RE: To use SAME or NOT for High End Storage Setup ? .... StripeUnit Size 32 MB Vs. 64 KB ?

  • From: "Mark W. Farnham" <mwf@xxxxxxxx>
  • To: <kevinc@xxxxxxxxxxxxx>, <oracle-l@xxxxxxxxxxxxx>
  • Date: Mon, 15 May 2006 12:35:51 -0400

At this point I think reviewing the input data to the question is in order:

Quoting VIVEK_SHARMA <VIVEK_SHARMA@xxxxxxxxxxx>:

> Folks
> 1) IBM is recommending SAME (Stripe across all the 46 LUNs) + 2 separate
> LUNs for online redo logfiles
> 2) IBM is recommending 32 MB Stripe Unit size across the 46 LUNs using
> Volume Manager.
> NOTE - Each underlying LUN has 8 Disks (Hardware Raid 1+0 with Stripe
> Unit Size 64 KB - This is NOT changeable)
> Qs. Any feedback on impact of 32 MB Stripe Unit Size(across LUNs) on
> Performance of OLTP / Batch Transactions?

I *think* this means that a LUN has 4 pairwise mirrored disks and that the
stripe width is 256KB (4*64K)
at the hardware level. At the individual disk level the 64K is made up in
the viewpoint of the OS as 128 512byte sectors,
so unless you're consistently perfectly aligned with the disk blocks, you've
got a small chance that any individual multiblock
read request that actually reaches the spinning platters will stay on a
single head. If you're reading from the array cache it
probably doesn't matter much, and the penalty you'll pay to physical disk is
gated by how well (and if) the hardware overlaps
seeks within a hardware managed LUN.

Since you wrote 32 MB stripe unit size across the 46 LUNs using the Volume
Manager, I *think* this means a stripe WIDTH
of 32MB*46, which is huge.(If you meant a stripe WIDTH of 32 MB, then your
stripe unit size is a little less than 3 quarters of
a MB.) I also *think* this means that if an object is 16MB in size, then it
has a 50-50 chance of living in one chunk on
a single LUN, and likewise a 50-50 chance of being in 2 pieces on two LUNs.
Small objects compared to 32 MB will reside in
1 or 2 LUNs in this configuration, while big objects compared to 32 MB will
be spread across more LUNs.

So if you have hot objects that are relatively small, you've got a good
chance they'll be on 1 or 2 LUNs, and it won't take too much
bad luck to get several small hot objects on the same LUN.

There is a good chance that any such potential hot spots are handled in the
cache, because they only pertain to relatively small objects.
Those objects will be in cache if they are hot (unless of course they are
hot with regard to writing).

But you're still only going to see the write degradation if the hot objects
on a single LUN overrun cache and the ability of four drives to keep up

Then again, I'm not sure what overhead you incur if a single read references
multiple LUNs, anyway. At that point it is software volume manager, and it
not clear to me whether there is any different penalty from referencing two
physical platters from a single LUN versus the last platter of one LUN and
first platter of the next LUN. That would depend on whether the hardware is
capable of overlapping seeks and the chain of software reaching the platter
triggers that capability.

If I'm wrong, we need to clarify your meaning of the terminology as regards
"stripe unit size" with regard to both the hardware LUN creation and the
manager creation of volumes as seen at the file system level  or raw
partition level.

I'm also curious whether you're re-mirroring on the 46 LUNs. I'm guessing
NOT, but I could take your meaning that way from the indicating that you're
using SAME across the 46 LUNs. I *think* you've mirrored pairwise at the
underlying hardware level and you're simply creating volumes striped across
46 of the resulting LUNs.

Finally, if the stripe WIDTH across the 46 LUNs is 32 MB, that could work
out very well with the first LUN in each logical stripe rotating as the
manager creates the storage. That is *if* there is any overhead to cross LUN
reads within a logical volume. The other way to do this is to make each LUN
be presented as 4 LUNs, rotating the starting drive on each quarter. Again,
with large cache you probably will never see the difference, but in the old
of actually depending on reading and writing from the physical platters, my
observation was that the first drive of a raid set tended to get beat on by
so spliting up a raid set into what I defined as "stripe sets" with using a
round robin allocation of the "first" platter in each stripe set made quite
a difference.
When Oracle handled much smaller volumes, you had to split up the raid sets
into multiple volumes to use them raw anyway, and the overhead to rotating
the starting point was just thinking of it (if the volume manager supported
telling it where to start), so it was worth it if there was any chance of a

I apologize for writing too much. I didn't have time to make it shorter.

Kevin's comment about 32KB being a trainwreck is probably an UNDERSTATEMENT.
Kevin's comment about the "the planets do not align that way" is exactly
correct, and I think Cary and others have written whole papers on the math
what lines up usefully and when, which I vastly oversimplified above, just
giving you the 50-50 point.



-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx]On
Behalf Of Kevin Closson
Sent: Monday, May 15, 2006 10:49 AM
To: oracle-l@xxxxxxxxxxxxx
Subject: RE: To use SAME or NOT for High End Storage Setup ? .... StripeUnit
Size 32 MB Vs. 64 KB ?

 >>>array. In both cases performance was fine and there were no "hot"
>>>disks. The logic for choosing 4 MB was to ensure that any
>>>full table scans (in our case 128 KB) would avoid having the
>>>multi-block reads split into two reads due to the required
>>>blocks existing in more than 1 "stripe".

a 4MB stripe width will reduce the odds there will be cross-stripe
reads, but in no way eliminates it. The planets do not align that

>>>IMHO I'm not sure why IBM are recommending why you should go
>>>as large as 32 MB.

I'm still surprised to hear there is such a thing as a 32MB stripe
width on a DSXXXX array... maybe they meant 32KB (which would be a
trainwreck) ?



Other related posts: