What does the 10046 trace for the MRP process show up? You may well also endup doing os level tracing with strace or similar. On Thu, Jan 28, 2010 at 9:15 AM, Martin Bach <development@xxxxxxxxxxxxxxxxx>wrote: > Hi Niall! > > couldn't agree more. I seem to have the dilemma where my storage array > seems to be in good working order and my OS doesn't report any unusually > high IO service times. I'm wondering where else I could look for clues as to > why, trying to follow a methodical approach to troubleshooting. > > The load is clearly caused by Oracle processes, shutting the standby down > reduces the iowaits to 0 (but that doesn't really solve my problem :) > > Regards, > > Martin > > > On 01/27/2010 07:06 PM, Niall Litchfield wrote: > >> I'd have thought this problem was eminently suited to method-r. >> >> On 1/27/10, Johnson, William L (TEIS)<WLJohnson@xxxxxxxxxxxxxxxxxxx> >> wrote: >> >>> What do you have your parallel_execution_message_size set to on the >>> standby >>> database? >>> >>> -----Original Message----- >>> From: oracle-l-bounce@xxxxxxxxxxxxx [mailto: >>> oracle-l-bounce@xxxxxxxxxxxxx] >>> On Behalf Of Martin Bach >>> Sent: Wednesday, January 27, 2010 12:13 PM >>> To: ORACLE-L >>> Subject: Application of archived logs on physical standby very slow >>> >>> Dear listers, >>> >>> I'd like to add another "interesting" problem to today's interesting >>> RMAN problem. I have quite a bit of output from diagnostic tools, >>> hopefully this doesn't got too badly mangled... >>> >>> Our alerting system fired when my physical standby database went out of >>> sync with production. DR and production share the same storage (an entry >>> level MSA) and are multipathed. I could observe this on the hosts (both >>> RHEL 5.2 32bit, don't ask): >>> >>> Wed Jan 27 10:42:42 2010 >>> RFS[2]: Archived Log: '/u03/oradata/dr/arch/1_49264_596779339.dbf' >>> Primary database is in MAXIMUM PERFORMANCE mode >>> Wed Jan 27 10:47:55 2010 >>> Media Recovery Log /u03/oradata/dr/arch/1_49252_596779339.dbf >>> Wed Jan 27 10:55:27 2010 >>> Media Recovery Log /u03/oradata/dr/arch/1_49253_596779339.dbf >>> Wed Jan 27 11:01:46 2010 >>> >>> Online redo logs are 500M in size and so are most archived logs, and as >>> you can see it takes 5-10 minutes to apply just one, which is consistent. >>> >>> top - 11:25:37 up 57 days, 17:34, 13 users, load average: 0.55, 1.05, >>> 1.19 >>> Tasks: 209 total, 1 running, 208 sleeping, 0 stopped, 0 zombie >>> Cpu0 : 2.8%us, 1.9%sy, 0.0%ni, 51.9%id, 43.5%wa, 0.0%hi, 0.0%si, >>> 0.0%st >>> Cpu1 : 1.9%us, 0.9%sy, 0.0%ni, 42.6%id, 54.6%wa, 0.0%hi, 0.0%si, >>> 0.0%st >>> Cpu2 : 0.9%us, 0.0%sy, 0.0%ni, 98.1%id, 0.9%wa, 0.0%hi, 0.0%si, >>> 0.0%st >>> Cpu3 : 0.0%us, 0.9%sy, 0.0%ni, 99.1%id, 0.0%wa, 0.0%hi, 0.0%si, >>> 0.0%st >>> Mem: 8310064k total, 8019968k used, 290096k free, 36456k buffers >>> Swap: 8152948k total, 70124k used, 8082824k free, 7568912k cached >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> >>> 21647 oracle 15 0 2178m 48m 31m S 3 0.6 0:00.41 ora_mrp0_dr >>> >>> 7496 oracle 15 0 2145m 1.5g 1.5g S 2 18.4 190:47.10 >>> ora_p000_dr >>> 7500 oracle 15 0 2145m 1.5g 1.5g S 2 18.4 199:38.12 >>> ora_p002_dr >>> 2580 root 10 -5 0 0 0 S 1 0.0 17:42.07 >>> [kjournald] >>> >>> 7498 oracle 16 0 2145m 1.5g 1.5g D 1 18.4 189:40.17 >>> ora_p001_dr >>> >>> The only processes on cpu are MRP0 and its parallel slaves. Funny >>> enough, iostat doesn't seem to see a problem: >>> >>> avg-cpu: %user %nice %system %iowait %steal %idle >>> 0.00 0.00 0.25 24.75 0.00 75.00 >>> >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >>> avgqu-sz await svctm %util >>> cciss/c0d0 0.00 4.00 0.00 8.00 0.00 48.00 12.00 >>> 0.01 0.88 0.50 0.40 >>> cciss/c0d1 0.00 5.00 0.00 18.00 0.00 92.00 10.22 >>> 0.02 0.89 0.06 0.10 >>> sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> sdc 0.00 211.00 0.00 189.00 0.00 1600.00 16.93 >>> 0.49 2.57 2.57 48.60 >>> sdd 0.00 203.00 0.00 183.00 0.00 1544.00 16.87 >>> 0.50 2.71 2.74 50.10 >>> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> dm-2 0.00 0.00 0.00 786.00 0.00 3144.00 8.00 >>> 2.09 2.64 1.26 98.70 >>> dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> dm-5 0.00 0.00 0.00 786.00 0.00 3144.00 8.00 >>> 2.09 2.65 1.26 98.80 >>> dm-6 0.00 0.00 0.00 23.00 0.00 92.00 8.00 >>> 0.02 0.83 0.04 0.10 >>> dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> dm-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 >>> >>> Service time is pretty decent I'd say and the SAN administrators tell me >>> that everything on the array is fine. I checked for AIO and that's also >>> used: >>> >>> [oracle]$ cat /proc/slabinfo | grep ^k >>> kioctx 25 60 192 20 1 : tunables 120 60 >>> 8 : slabdata 3 3 0 >>> kiocb 64 90 128 30 1 : tunables 120 60 >>> 8 : slabdata 3 3 0 >>> >>> Data files are scattered over mount points /u02-/u06, and there are no >>> multipathing (or other I/O related errors) reported in dmesg and >>> /var/log/messages. Archive logs are received on the DR box in time, and >>> there are no gaps. Our sys admin tells me that device mapper multipath >>> is happily distributing IO round robin. Oh yes, and it's Oracle 10.2.0.4 >>> 32bit. >>> >>> Now does anyone have an idea on how to proceed? >>> >>> Regards, >>> >>> Martin >>> -- >>> Martin Bach >>> OCM 10g >>> http://martincarstenbach.wordpress.com >>> http://www.linkedin.com/in/martincarstenbach >>> -- >>> //www.freelists.org/webpage/oracle-l >>> >>> >>> -- >>> //www.freelists.org/webpage/oracle-l >>> >>> >>> >>> >> > -- > Martin Bach > OCM 10g > http://martincarstenbach.wordpress.com > http://www.linkedin.com/in/martincarstenbach > -- Niall Litchfield Oracle DBA http://www.orawin.info