RE: Solid State Drives

  • From: "Matthew Zito" <mzito@xxxxxxxxxxx>
  • To: "Jeremy Schneider" <jeremy.schneider@xxxxxxxxxxxxxx>
  • Date: Fri, 1 May 2009 13:58:55 -0400

Well, so, unfortunately today is a little too busy for me to go back
through and track down some of the really good nuts-and-bolts level
discussions that have been going on in the storage community concerning
the lifespan of SSDs vs. traditional hard drives.  I'll see if I have
some time on the plane this weekend to put together an aggregate
discussion and put it on the list.

However, as I recall, there are a couple things that people are doing to
minimize the issue, above and beyond the vanilla wear leveling:
- Reserved blocks - all of the high-end SSD devices have a percentage of
reserved blocks that do not appear visible to the OS as usable blocks.
This way, as blocks begin to fail, they can be seamlessly paged out to
reserve blocks.  This will mitigate the edge case where some blocks
start to fail earlier than others, due to the magic of manufacturing
defects or other fun 
- Write cache - most of the folks implementing SSD support are tweaking
their algorithms to attempt to manage the write process.  For example, a
filesystem with a 4kb blocksize and a DB with an 8kb blocksize may be on
an SSD that uses 128KB sector sizes internally.  A naive implementation
would allow a user to sequentially write 16 8KB blocks, not realizing
that would generate 16 write cycles on the same block on the SSD.
Arrays are batching writes, even in high I/O environments where
typically cache pressure would cause immediate destage, to match to the
blocksize of the SSD.
- Drive-level write cache - the better SSDs have a much larger write
cache than a traditional disk, and they use battery backup to protect
that in case of power failure.  This way, high-volume or
write-read-overwrite model writes don't necessarily need to immediately
generate a write cycle on a given block.

The other thing to consider is that even if you have 10GB of active redo
logs, the odds are high that the array you're using will be striping
that write workload across at least two much larger (several hundred GB)
devices.  In addition, since there's no rotational latency penalty, it
is very feasible to mix workloads, such that there's 20GB allocated for
the redo logs, then 150GB for an archive log dest, and the writes will
get leveled across the whole set of flash chips inside the drive.

Finally, the reality is that two traditional disks dedicated to redo
logs with high I/O volumes will fail faster than drives that are a
traditional mixed-workload regardless.  And while your failure rates may
be higher in those scenarios, the same truths exist when it comes to hot
spares, etc. as in "vanilla" environments.  Your vendor will eat some of
the costs in higher failure rates in certain environments, same as they
do today with regular disks.

Thanks,
Matt

-----Original Message-----
From: Jeremy Schneider [mailto:jeremy.schneider@xxxxxxxxxxxxxx] 
Sent: Friday, May 01, 2009 1:24 PM
To: Matthew Zito
Cc: andrew.kerber@xxxxxxxxx; dofreeman@xxxxxxxxxxx;
oracle-l@xxxxxxxxxxxxx
Subject: Re: Solid State Drives

Matthew Zito wrote:
>
> As far as the upgrade path, the lifespan is comparable for a "spinning
> rust" hard drive.
>
I'm curious if this is actually true? (What is it based on?) I would
think that lifespan would be dependent on I/O patterns (because of
hardware I/O leveling) -- and filesystem vs redo logs could be very
different access patterns.  In particular, redo could easily pound every
single block on a smaller SSD (hardware leveling becomes fairly
meaningless), which is rather different from a filesystem where some
blocks may not get accessed that heavily. I'm not sure one way or the
other, just something I've been wondering about.

Similarly, if you mirrored two of them for redo then isn't there a high
likelyhood that they would wear out around the same time?

-Jeremy

-- 
Jeremy Schneider
Chicago, IL
http://www.ardentperf.com

--
//www.freelists.org/webpage/oracle-l


Other related posts: