Re: I/O performance

From: Karl Arao <karlarao@xxxxxxxxx>
To: Brandon.Allen@xxxxxxxxxxx
Date: Thu, 21 Jun 2012 14:36:03 -0500
I've got a couple of points here..
Calibrate IO -
--------------------
Yes I agree with calibrate io doing 8k first then large reads at the end of
it.. and if you increase the "num_physical_disks" to 128 or some value way
larger than your number of disks it will do a longer sustained IO
workload.. which you can see here
https://lh6.googleusercontent.com/-sOsWu7Pic6Y/T-Np0FfGMTI/AAAAAAAABp0/LgvdL6-kF8A/s2048/20120621_calibrateio.PNG
that's
a run of 8,16, and 128 num_physical_disks. The 128 reached the max
bandwidth of the storage.

Short Stroking -
--------------------
Also on your earlier reply, short stroking the disk is really helpful in IO
performance.. and that's what is Exadata is actually doing when it is
allocating cell disks out of the outer layer. I've got an R&D server with 8
x 1TBdisks and short stroked it to the 320GB outer layer but it took me a
while to get the short stroke sweet spot
http://www.facebook.com/photo.php?fbidH9469633028&lfe0cb72e I believe
I used HD Tach in here but did a bunch of test case in Linux as I grow the
area size of the disk and it also depends on the size of the data area that
you need.. so I laid out mine as 320GB outer x 8 for the DATA ASM, 320GB x8
for the 2nd outer for the LVM that I striped for my VirtualBox guests, and
the rest for my RECO area which I used for backups.

Stripe size -
--------------------
Aside from short stroking the disks, the larger stripe size I used for my
LVM the greater the performance I get from the sequential reads and writes
# VBOX
pvcreate /dev/sda6 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 /dev/sdf3
/dev/sdg3 /dev/sdh3
vgcreate vgvbox /dev/sda6 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 /dev/sdf3
/dev/sdg3 /dev/sdh3
lvcreate -n lvvbox -i 8 -I 4096 vgvbox -l 625008         <-- so this is
striped at 8 disks which will behave like ASM where it writes 4M chunks at
each Physical Volume so when you do IO operations all of your spindles are
working
mkfs.ext3 /dev/vgvbox/lvvbox

and check out the 1MB and 4MB stripe size comparison here
#### 320GB LVM STRIPE 1MB VS 4MB ON UEK

$ less 16_ss_320GB_LVM_lvvm-1MBstripe/orion.log | grep +++ | grep -v RUN
+++ localhost.localdomain params_dss_randomwrites Maximum Large MBPS(6.61
@ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqwrites Maximum Large MBPS(5.16 @
Small=0 and Large%6          <---- at 1MB stripe
+++ localhost.localdomain params_dss_randomreads Maximum Large MBPSF0.69
@ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqreads Maximum Large MBPSE9.27 @
Small=0 and Large%6           <---- at 1MB stripe
+++ localhost.localdomain params_oltp_randomwrites Maximum Small IOPSu4 @
Small%6 and Large=0 Minimum Small Latency38.54 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqwrites Maximum Small IOPSs5 @
Small%6 and Large=0 Minimum Small Latency47.24 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_randomreads Maximum Small IOPS01 @
Small%6 and Large=0 Minimum Small Latency2.61 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqreads Maximum Small IOPS96 @
Small%6 and Large=0 Minimum Small Latency3.24 @ Small%6 and Large=0
+++ localhost.localdomain params_dss Maximum Large MBPS14.26 @ Small=0
and Large
+++ localhost.localdomain params_oltp Maximum Small IOPSy1 @ Small  and
Large=0 Minimum Small Latency.51 @ Small=1 and Large=0
oracle@xxxxxxxxxxxxxxxxxxx:/reco/orion:dw


$ less 15_ss_320GB_LVM_lvvm-4MBstripe/orion.log  | grep +++ | grep -v RUN
+++ localhost.localdomain params_dss_randomwrites Maximum Large MBPS(3.53
@ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqwrites Maximum Large MBPSB6.95 @
Small=0 and Large%6          <---- at 4MB stripe
+++ localhost.localdomain params_dss_randomreads Maximum Large MBPSF2.84
@ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqreads Maximum Large MBPSa4.10 @
Small=0 and Large%6           <---- at 4MB stripe
+++ localhost.localdomain params_oltp_randomwrites Maximum Small IOPSu3 @
Small%6 and Large=0 Minimum Small Latency38.63 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqwrites Maximum Small IOPSs1 @
Small%6 and Large=0 Minimum Small Latency49.22 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_randomreads Maximum Small IOPS98 @
Small%6 and Large=0 Minimum Small Latency3.03 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqreads Maximum Small IOPS06 @
Small%6 and Large=0 Minimum Small Latency1.97 @ Small%6 and Large=0
+++ localhost.localdomain params_dss Maximum Large MBPS15.68 @ Small=0
and Large
+++ localhost.localdomain params_oltp Maximum Small IOPSy2 @ Small  and
Large=0 Minimum Small Latency.42 @ Small=1 and Large=0
oracle@xxxxxxxxxxxxxxxxxxx:/reco/orion:dw


You see the difference of 614.10 MB/s to 459.27 MB/s on sequential reads..
that's a lot!
You'll see more details in here --> LVMstripesize,AUsize,UEKkernel test
case comparison -
http://www.evernote.com/shard/s48/sh/36636b46-995a-4812-bd07-e88fa0dfd191/d36f37565243025e7b5792f496dc5a37


UEK vs Regular Kernel -
----------------------------------------
And not only that.. I noticed that when I used a UEK kernel it gives me
more MB/s on sequential reads and writes which possible because of kernel
optimizations
http://www.oracle.com/us/technologies/linux/uek-for-linux-177034.pdf

#### NON-UEK VS UEK ON LVM - *the regular kernel gives lower MB/s on
sequential reads/writes*

$ cat 23_ss_320GB_LVM_lvvbox-4MBstripe-regularkernel/orion.log | grep +++ |
grep -v RUN
+++ localhost.localdomain params_dss_randomwrites Maximum Large MBPS%8.40
@ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqwrites Maximum Large MBPS43.02 @
Small=0 and Large%6              <---- regular kernel
+++ localhost.localdomain params_dss_randomreads Maximum Large MBPSA3.60
@ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqreads Maximum Large MBPSU0.17 @
Small=0 and Large%6              <---- regular kernel
+++ localhost.localdomain params_oltp_randomwrites Maximum Small IOPSs4 @
Small%6 and Large=0 Minimum Small Latency47.84 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqwrites Maximum Small IOPSq6 @
Small%6 and Large=0 Minimum Small Latency56.26 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_randomreads Maximum Small IOPS45 @
Small%6 and Large=0 Minimum Small Latency0.21 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqreads Maximum Small IOPS40 @
Small%6 and Large=0 Minimum Small Latency0.91 @ Small%6 and Large=0
+++ localhost.localdomain params_dss Maximum Large MBPS10.54 @ Small=0
and Large
+++ localhost.localdomain params_oltp Maximum Small IOPSx0 @ Small  and
Large=0 Minimum Small Latency.47 @ Small=1 and Large=0
oracle@xxxxxxxxxxxxxxxxxxx:/reco/orion:dw

$ less 15_ss_320GB_LVM_lvvm-4MBstripe/orion.log  | grep +++ | grep -v RUN
+++ localhost.localdomain params_dss_randomwrites Maximum Large MBPS(3.53
@ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqwrites Maximum Large MBPSB6.95 @
Small=0 and Large%6             <---- UEK kernel
+++ localhost.localdomain params_dss_randomreads Maximum Large MBPSF2.84
@ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqreads Maximum Large MBPSa4.10 @
Small=0 and Large%6              <---- UEK kernel
+++ localhost.localdomain params_oltp_randomwrites Maximum Small IOPSu3 @
Small%6 and Large=0 Minimum Small Latency38.63 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqwrites Maximum Small IOPSs1 @
Small%6 and Large=0 Minimum Small Latency49.22 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_randomreads Maximum Small IOPS98 @
Small%6 and Large=0 Minimum Small Latency3.03 @ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqreads Maximum Small IOPS06 @
Small%6 and Large=0 Minimum Small Latency1.97 @ Small%6 and Large=0
+++ localhost.localdomain params_dss Maximum Large MBPS15.68 @ Small=0
and Large
+++ localhost.localdomain params_oltp Maximum Small IOPSy2 @ Small  and
Large=0 Minimum Small Latency.42 @ Small=1 and Large=0
oracle@xxxxxxxxxxxxxxxxxxx:/reco/orion:dw



ASM redundancy / SAN redundancy -
-----------------------------------------------------------

I'll pull a conversation I had with a really good friend of mine.. he's
question was "Quick question on your AWR mining script AWR-gen-wl..., is
IOPs calculated before or after ASM mirroring? For example on Exadata, if I
see 10,000 write IOPs, did the cells do 10k or did they do 20k (normal
redundancy)?"

and here's my response..

The script awr_iowl.sql and awr_iowlexa.sql has that columns that accounts
for RAID1.. that is read penalty of 1 and write penalty of 2. ****

read on the section "the iops raid penalty" on this link
http://www.zdnetasia.com/calculate-iops-in-a-storage-array-62061792.htm and
the "real life examples" on this link
http://www.yellow-bricks.com/2009/12/23/iops/****

** **

so those computations should also apply for Exadata since normal redundancy
is essentially RAID1 that's write penalty of 2, and the high redundancy is
penalty of 3. ****

** **

Now I remember this sizing exercise I had with an EMC engineer on a project
bid before
https://www.evernote.com/shard/s48/sh/03602b99-3274-4c64-b5d1-bbe7bd961f8d/95be02ccf9aa75cf863bb19115353eb0
****

and that's why I created those columns to get the data directly from AWR..
so for every snapshot you've got the "hardware iops needed" and "number of
disks needed", what's good about that is as your workload vary those two
numbers if representative of that workload. So since you have a lot of data
samples, I usually make a histogram on those two columns and get the top
percentile numbers because most likely those are my peak pereiods and I can
investigate it by drilling down on the snap_ids and looking into the SQLs
and validating it to the app owners as to what's the application is running
at that time. ****

** **

I've attached an excel sheet which you can just plug the total workload
iops on the yellow box. So in your case, let's say you have 10K workload
IOPS... that's equivalent to 15K hardware IOPS for normal redundancy and
20K hardware IOPS for high redundancy.

the excel screenshot is actually here ---->
https://lh6.googleusercontent.com/-00PkzwfwnOE/T-N0oo2Q-FI/AAAAAAAABqE/EbTOnHBlpmQ/s2048/20120621_IOPS.png

Note that I'm particular with the words "workload IOPS" and "hardware
IOPS"  ****

so on this statement ****

*if I see 10,000 write IOPs, did the cells do 10k or did they do 20k
(normal redundancy)?* <-- if this 10,000 is what you pulled from the AWR
then it's the database that did the 10K IOPS so that's the "workload
IOPS".. and that's essentially your "IO workload requirements". ****

** **

Then let's say you haven't migrated to Exadata.. you have to take into
account the penalty computation shown above.. so you'll arrive with 15000
"hardware IOPS" needed (normal redundancy).. and say each disk has IOPS of
180 then you need at least 83 disks so that's 83disks / 12 disks each cell
= 6.9 storage cells ... and that's Half Rack Exadata.  But looking at the
data sheet
https://www.dropbox.com/s/ltvr7caysvfmvkr/dbmachine-x2-2-datasheet-175280.pdf
it
seems like you can fit the 15000 on a quarter rack (because of the flash)..
mmm.. well I'm not pretty confident with that because if let's say 50% of
15000 IOPS are writes (7.5K IOPS) then I would investigate on the IOPS
write mix if most of it are DBWR related (v$sysstat.physical write IO
requests) or LGWR related (v$sysstat.redo writes) and if most of it are
DBWR related then I don't think you'll ever benefit from the smart flash
log. So I would still go with the Half (12500 disk IOPS) or Full Rack
(25000 disk IOPS) for my "hardware IOPS" capacity. And I'll also take into
consideration the MB/s needed for that database but that should be
augmented by the flash cache. ****

**



The effect of ASM redundancy on read/write IOPS ? SLOB test case!
-------------------------------------------------------------------------------------------------------
I'm currently writing a blog post about this, but I'll give you a bits of
stuff right now.. So my statement above is true. The ASM redundancy or
parity affects the workload write IOPS number but it will not affect the
workload read IOPS.

as you can see here as I change redundancy the read IOPS stayed at the
range of 2400+ IOPS
128R
https://lh4.googleusercontent.com/-QEFEQkc3iy4/T-Npy9FRfpI/AAAAAAAABpk/VfvcxgN9D0k/s2048/20120621_128R.png

while on the writes, as I move to "normal" redundancy it went down to half
and "high" redundancy it went down to 1/3
128W
https://lh4.googleusercontent.com/-H7q6OpJnhRA/T-Npy8t5DDI/AAAAAAAABpo/jd2_Cp4exAc/s2048/20120621_128W.png

This behavior is the same even on a regular SAN environment... which you
have to be careful/aware when sizing storage.





-- 
Karl Arao
karlarao.wordpress.com
karlarao.tiddlyspot.com

--
//www.freelists.org/webpage/oracle-l
References:
- I/O performance
  - From: Niall Litchfield
- RE: I/O performance
  - From: Taylor, Chris David
- RE: I/O performance
  - From: Allen, Brandon
- RE: I/O performance
  - From: Allen, Brandon
- Re: I/O performance
  - From: GG
- RE: I/O performance
  - From: Allen, Brandon
Re: I/O performance

Other related posts: