RE: Solid State Drives

  • From: "Mark W. Farnham" <mwf@xxxxxxxx>
  • To: <tanel@xxxxxxxxxx>, <jeremy.schneider@xxxxxxxxxxxxxx>, <mzito@xxxxxxxxxxx>
  • Date: Sat, 2 May 2009 08:46:29 -0400

We seem to have adopted an SSD==Flash assumption on this thread. Given the
faster cost drop in flash than in other types of solid state memory, that
may be appropriate. Still there are other choices and whether they are
economic or not going forward, it has long been the case that if you really
needed isolated throughput to persistent storage for modest sized particular
objects (like redo logs in a high update Oracle database or UNDO space
against a load of many transaction and long queries), then superior media
choices to spinning rust were available. I suggest reading Kevin Closson's
fine contributions to that topic to avoid being disappointed by the real
achievable throughput improvements as compared to the ratio of the service
time differential between traditional disk farms and SSD. Kevin's analysis
of where you hit other bottlenecks in the total throughput picture is spot
on. His mission of debunking hyperbole in this area is to my observation
scientifically complete.

I have long held that the biggest throughput per dollar spent improvement
due to selective use of SSD in an economic deployment is not the straight
out acceleration of i/o to the objects on the SSD (yes, Virginia, there is
still acceleration, it is just within the limits of other bottlenecks, not
the magical sounding ratio of the device speeds and math calculation address
speed seek time in place of mechanical seeks), but rather in the "deheating"
of the rest of the disk farm. As spinning rust sizes have grown and the drop
in cost per unit storage, ie. dollars per terabyte, has been truly
impressive, the cost per spindle drop has been much less impressive. So
isolating a few spindles to segregate the really hot i/o from the rest of
the farm is often more expensive now than segregating the really hot i/o to
some flavor of SSD that meets or exceeds the mean time between failure of
traditional disk.

Especially when paired with stripe and mirror everything on the rest of the
disk farm, this removal of hot interrupting i/o from the disk farm reduces
service times for everything else. And it reduces wear and tear on the
mechanical traditional disk farm components.

I think Tanel's analysis of the duration of wear out is on target, as to the
shape of wearout patterns for "flash" SSD. Knowing how much reserve is built
into a given manufacturer's "flash" SSD offering and whether it provides a
routine utility to tell you how much reserve remains free is also useful. If
you track your peak i/o requirement and verify your free margin, you will
have plenty of insurance in scheduling migration to new plexes. Unlike the
crash of spinning rust, degradation of flash is incremental and can be
watched.

I confess I didn't read this entire thread, but I've consistently found
Poder, Zito, Morle, and Closson to be speakers of the truth whose detailed
experiments accurately predict the results when you apply the resulting
logical suggestions to the construction of an Oracle Server complex.
Apologies in advance to those I've left out.

I wonder how long it will be before the best economic solution for storage
is completely non-mechanical? I just hope we don't "Rollerball" the 14th
century. Probably not a concern, there will probably be 2.6 billion copies
of everything, including all classified material in cloud, and on the moon
and Mars, too.

Regards,

mwf

-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx]
On Behalf Of Tanel Poder
Sent: Friday, May 01, 2009 2:34 PM
To: jeremy.schneider@xxxxxxxxxxxxxx; mzito@xxxxxxxxxxx
Cc: andrew.kerber@xxxxxxxxx; dofreeman@xxxxxxxxxxx; oracle-l@xxxxxxxxxxxxx
Subject: RE: Solid State Drives

Well even without wear levelling and copy-on-writes, assuming you have
loaded the SSD 100% full with redo logs only then you could write into this
redolog space many times. 

So if we do an exercise assuming that:

1) you have 8 x 1 GB redologs for a database on a SSD

2) it takes 1 hour to fill these 8 GB of logs ( 8 GB per minute = 192 GB per
24 hours ) - you will be writing to the same block once per an hour (two
times to redo header though as its updated when it gets full to mark the end
SCN in the file)

3) its possible to write to the SSD disk block for "only" 100,000 times

So, if you write to a block max 2 times per hour it would still mean 50,000
hours it would still be over 5 years.

The controlfiles and temp tablespace files (and sometimes undo) experience
much more writes to "same" blocks compared to redologs, they would be the
first ones hitting problems :)

But there IS the write levelling which avoids writing to the same blocks too
much by physically writing somewhere else and updating the virtual/physical
location translation table. Much depends on the algorithm used...

Regarding whether the mirrored SSDs would wear out at the same time -
probably not as the number of writes before wearing out is not some fixed
discrete number it will probably vary quite a lot. And you would not want to
wait until these disks fail anyway, but rather replace them before known
"expire time". And this expire time would be measured in number of write
operations rather than wall clock time.

--
Regards,
Tanel Poder
http://blog.tanelpoder.com



--
//www.freelists.org/webpage/oracle-l


Other related posts: