Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 In order to test networking between two sides of a network, use 'iperf'.=20= The direct path read code does a lot of things different than the = traditional multiblock-read code path. Sadly, this behavior is not or as good as not documented by Oracle. Takeaways: >> please mind these are the results from testing on Linux = x64 with Oracle 11.2.0.1 & 11.2.0.3 with local attached disks and ASM=20 - direct path read starts off with 2 IOs in parallel - heuristics determine if the number IOs in parallel scale up (and = supposedly down, but never saw that) - if IOs are fast enough, no wait is shown in the tracefile (fast enough = means all IOs are ready) Frits Hoogland http://fritshoogland.wordpress.com frits.hoogland@xxxxxxxxx +31 6 53569942 On Aug 19, 2013, at 6:45 PM, "Hameed, Amir" <Amir.Hameed@xxxxxxxxx> = wrote: > Hi Folks, > We migrated our single instance ERP database to a four-node RAC system = about a week ago. As part of this migration, we also moved from SAN to = NAS. Here is a high-level architecture of the environment: > =D8 RDBMS and Grid versions=11.2.0.3 with April PSU >=20 > =D8 Each node has two 10Gbe paths to the NAS. >=20 > =D8 Database is configured with DNFS. >=20 > =D8 Storage vendor is EMC. NAS Data Movers are connected to the EMC = VMAX storage. >=20 > =D8 Database files are sitting on SSD drives with RAID-5 = configuration. >=20 > =D8 Redo log files are on RAID-10 >=20 > Since migrating to RAC, we are seeing high latencies on some batch = jobs. Jobs that used to run in 45 minutes are now taking a few hours to = run. We have a copy of production that is setup the same way as = production (same disk layout, same hardware, etc.) with the only = difference that non-production is a three-node RAC environment. If I = take a simple statement to force a full scan in parallel on a large = table and run it against production and the production-sized test = environment, the average wait of "direct path reads" from 10046 trace is = almost three times higher in production than in non-production. Here is = the statement that I am running: >=20 > select /*+ full(MMT) parallel(MMT,8) */ count(*) from = apps.mtl_material_transactions MMT ; >=20 > Production > call count cpu elapsed disk query current = rows > ------- ------ -------- ---------- ---------- ---------- ---------- = ---------- > Parse 1 0.00 0.00 0 0 0 = 0 > Execute 1 0.02 0.02 0 31 0 = 0 > Fetch 2 185.89 888.70 7706356 7706545 0 = 1 > ------- ------ -------- ---------- ---------- ---------- ---------- = ---------- > total 4 185.91 888.73 7706356 7706576 0 = 1 >=20 > Elapsed times include waiting on following events: > Event waited on Times Max. Wait Total = Waited > ---------------------------------------- Waited ---------- = ------------ > reliable message 2 0.00 = 0.00 > enq: KO - fast object checkpoint 4 0.00 = 0.00 > SQL*Net message to client 2 0.00 = 0.00 > Disk file operations I/O 117 0.00 = 0.00 > direct path read 83856 3.03 = 465.14 > db file sequential read 14 0.00 = 0.02 > SQL*Net message from client 2 0.00 = 0.00 > = **************************************************************************= ****** >=20 > Non-production > call count cpu elapsed disk query current = rows > ------- ------ -------- ---------- ---------- ---------- ---------- = ---------- > Parse 1 0.00 0.00 0 2 0 = 0 > Execute 1 0.03 0.03 0 31 0 = 0 > Fetch 2 148.36 224.13 7103122 7103268 0 = 1 > ------- ------ -------- ---------- ---------- ---------- ---------- = ---------- > total 4 148.39 224.17 7103122 7103301 0 = 1 >=20 > Elapsed times include waiting on following events: > Event waited on Times Max. Wait Total = Waited > ---------------------------------------- Waited ---------- = ------------ > enq: KO - fast object checkpoint 6 0.00 = 0.00 > reliable message 2 0.00 = 0.00 > enq: PS - contention 52 0.00 = 0.01 > SQL*Net message to client 2 0.00 = 0.00 > Disk file operations I/O 106 0.00 = 0.00 > direct path read 3773 1.49 = 67.53 > SQL*Net message from client 2 845.16 = 845.16 > = **************************************************************************= ****** >=20 > The raw trace file shows that some IOs in production are taking over a = second to complete as shown below: >=20 > WAIT #18446744071452723008: nam=3D'direct path read' ela=3D 1109814 = file number=1125 first dba"2080 block cnt=128 obj#r04 tim=950081082913 >=20 > We have EMC and Oracle engaged and are trying to diagnose the issue. = Since, this is DNFS and all IOs are going over the network layer, we had = our network folks look at the network layer and according to them all is = clean there, however I am not convinced. EMC is reporting that from the = NAS side, they are seeing an average service response of 2-4ms. >=20 > My question is, is there a way to send a large packet from the RAC = host side and gauge its latency? Ping can work for up to 32k size packet = but because Oracle tries to do a 1M IO for direct path reads, I am = trying to see if there is a tool available that can allow me to send = larger packets and gauge its latency. >=20 > Any help will be appreciated. >=20 > Thanks, > Amir >=20 >=20 >=20 > -- > //www.freelists.org/webpage/oracle-l >=20 >=20 -- //www.freelists.org/webpage/oracle-l