RAID Reliability Calculations (Was Storage array advice anyone?)

  • From: chris@xxxxxxxxxxxxxxxxxxxxx
  • To: oracle-l@xxxxxxxxxxxxx
  • Date: Thu, 6 Jan 2005 10:38:14 +0000

Well it's the new year now and I've completed the data loss calculations to
"update" the calculations from section 3.4.5 of the paper "RAID: High
Performance, Reliable Secondary Storage" mentioned by Cary.

      http://www.eecs.umich.edu/CoVirt/papers/diskArraySurvey.pdf

First a quick summary of the original results:

Double disk failure
 - 285 years mean time to data loss (MTTDL)
 - 3.4% probability of data loss over 10 years (PDL10Y)

Disk failure + unrecoverable read error (bit error) during reconstruction of
failed disk
 - 36 years mean time to data loss (MTTDL)
 - 24.4% probability of data loss over 10 years (PDL10Y)

Based on:
 - 500 GB of data, 5 GB drives with 200,000 hrs MTTF, 16 disks per RAID set
 - 1 unrecoverable bit error per 10^14 bits read

Now with only 8 disks per RAID set:

Double disk failure                     - 571 years MTTDL,  1.7% PDL10Y
Disk failure + unrecoverable read error -  71 years MTTDL, 13.1% PDL10Y

Finally 2 disks per RAID set i.e. mirroring:

Double disk failure                     - 2,283 years MTTDL, 0.44% PDL10Y
Disk failure + unrecoverable read error -   285 years MTTDL, 3.44% PDL10Y


Now the figures using a modern Seagate Cheetah 15K.4 Ultra SCSI 320 drive
(http://www.seagate.com/docs/pdf/datasheet/disc/ds_cheetah15k.4.pdf)

Based on:
 - 10 TB of data, 72 GB drives with 1,400,000 hrs MTTF
 - 1 unrecoverable bit error per 10^15 sectors read

8 disks per RAID set:

Double disk failure                     -    20,173 years MTTDL, 0.050% PDL10Y
Disk failure + unrecoverable read error - 1,023,650 years MTTDL, 0.001% PDL10Y

2 disks per RAID set i.e. mirroring:

Double disk failure                     -    80,584 years MTTDL, 0.0124% PDL10Y
Disk failure + unrecoverable read error - 4,094,597 years MTTDL, 0.0002% PDL10Y


All the calculations assume a mean time to repair (MTTR) i.e. reconstruct failed
disk, of 1 hour and a correlated disk error factor of 10. These are the figures
used in the original paper so that we are always comparing "apples with apples"
as far as possible.

I've ignored the case of a system crash followed by a disk failure mentioned in
the original paper as that applies to software RAID and not hardware RAID with
non volatile cache storage as exists in all modern medium-highend RAID
solutions.

Also I've not used the Harmonic sum approach found in the original paper as I
was unable to work exactly what was being done.

I hope some people find this useful, it helps to provide some science towards
the question of RAID reliability and is certainly much better than my original
statement along the lines of:

   "you'd have to be very unlucky to suffer data loss with a modern RAID 5
solution"

I used a basic Excel spreadsheet to do the calculations which I've put up on my
company's website (to avoid clogging up the list server) so if anyone is
interested in looking further at the calculations or using different parameters
e.g. disk MTTF etc. then pls download the spreadsheet and use however you see
fit.

  http://www.christallize.com/download/diskfailurecalc.xls

Of course I provide no warranty, support or have any liability etc. on whatever
you may use the spreadsheet for.

Cheers,

Chris Dunscombe

Christallize Ltd
--
//www.freelists.org/webpage/oracle-l

Other related posts: