Application of archived logs on physical standby very slow

  • From: Martin Bach <development@xxxxxxxxxxxxxxxxx>
  • To: ORACLE-L <oracle-l@xxxxxxxxxxxxx>
  • Date: Wed, 27 Jan 2010 17:13:28 +0000

Dear listers,

I'd like to add another "interesting" problem to today's interesting RMAN problem. I have quite a bit of output from diagnostic tools, hopefully this doesn't got too badly mangled...

Our alerting system fired when my physical standby database went out of sync with production. DR and production share the same storage (an entry level MSA) and are multipathed. I could observe this on the hosts (both RHEL 5.2 32bit, don't ask):

Wed Jan 27 10:42:42 2010
RFS[2]: Archived Log: '/u03/oradata/dr/arch/1_49264_596779339.dbf'
Primary database is in MAXIMUM PERFORMANCE mode
Wed Jan 27 10:47:55 2010
Media Recovery Log /u03/oradata/dr/arch/1_49252_596779339.dbf
Wed Jan 27 10:55:27 2010
Media Recovery Log /u03/oradata/dr/arch/1_49253_596779339.dbf
Wed Jan 27 11:01:46 2010

Online redo logs are 500M in size and so are most archived logs, and as you can see it takes 5-10 minutes to apply just one, which is consistent.

top - 11:25:37 up 57 days, 17:34, 13 users,  load average: 0.55, 1.05, 1.19
Tasks: 209 total,   1 running, 208 sleeping,   0 stopped,   0 zombie
Cpu0 : 2.8%us, 1.9%sy, 0.0%ni, 51.9%id, 43.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 1.9%us, 0.9%sy, 0.0%ni, 42.6%id, 54.6%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.9%us, 0.0%sy, 0.0%ni, 98.1%id, 0.9%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.9%sy, 0.0%ni, 99.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:   8310064k total,  8019968k used,   290096k free,    36456k buffers
Swap:  8152948k total,    70124k used,  8082824k free,  7568912k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21647 oracle 15 0 2178m 48m 31m S 3 0.6 0:00.41 ora_mrp0_dr 7496 oracle 15 0 2145m 1.5g 1.5g S 2 18.4 190:47.10 ora_p000_dr 7500 oracle 15 0 2145m 1.5g 1.5g S 2 18.4 199:38.12 ora_p002_dr 2580 root 10 -5 0 0 0 S 1 0.0 17:42.07 [kjournald] 7498 oracle 16 0 2145m 1.5g 1.5g D 1 18.4 189:40.17 ora_p001_dr

The only processes on cpu are MRP0 and its parallel slaves. Funny enough, iostat doesn't seem to see a problem:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.25   24.75    0.00   75.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util cciss/c0d0 0.00 4.00 0.00 8.00 0.00 48.00 12.00 0.01 0.88 0.50 0.40 cciss/c0d1 0.00 5.00 0.00 18.00 0.00 92.00 10.22 0.02 0.89 0.06 0.10 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 211.00 0.00 189.00 0.00 1600.00 16.93 0.49 2.57 2.57 48.60 sdd 0.00 203.00 0.00 183.00 0.00 1544.00 16.87 0.50 2.71 2.74 50.10 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 786.00 0.00 3144.00 8.00 2.09 2.64 1.26 98.70 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 786.00 0.00 3144.00 8.00 2.09 2.65 1.26 98.80 dm-6 0.00 0.00 0.00 23.00 0.00 92.00 8.00 0.02 0.83 0.04 0.10 dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Service time is pretty decent I'd say and the SAN administrators tell me that everything on the array is fine. I checked for AIO and that's also used:

[oracle]$ cat /proc/slabinfo | grep ^k
kioctx 25 60 192 20 1 : tunables 120 60 8 : slabdata 3 3 0 kiocb 64 90 128 30 1 : tunables 120 60 8 : slabdata 3 3 0

Data files are scattered over mount points /u02-/u06, and there are no multipathing (or other I/O related errors) reported in dmesg and /var/log/messages. Archive logs are received on the DR box in time, and there are no gaps. Our sys admin tells me that device mapper multipath is happily distributing IO round robin. Oh yes, and it's Oracle 10.2.0.4 32bit.

Now does anyone have an idea on how to proceed?

Regards,

Martin
--
Martin Bach
OCM 10g
http://martincarstenbach.wordpress.com
http://www.linkedin.com/in/martincarstenbach
--
//www.freelists.org/webpage/oracle-l


Other related posts: