RE: Oracle 10g hangs intermittently waiting for I/O

  • From: "Matthew Zito" <mzito@xxxxxxxxxxx>
  • To: <pkotla@xxxxxx>, <oracle-l@xxxxxxxxxxxxx>
  • Date: Fri, 15 May 2009 09:57:19 -0400

If you run a "dmesg" - do you see any errors in the kernel logs?  If the 
devices stop responding to I/O for periods of time there should be SCSI 
timeouts in the logs, or at least some warnings from the multipathing driver.

Thanks,
Matt

--
Matthew Zito
Chief Scientist
GridApp Systems
P: 646-452-4090
mzito@xxxxxxxxxxx
http://www.gridapp.com



-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx on behalf of Pawel Kotlarz
Sent: Fri 5/15/2009 9:48 AM
To: oracle-l@xxxxxxxxxxxxx
Subject: Oracle 10g hangs intermittently waiting for I/O
 
Hello all.

I have oracle 10.2.0.3 data warehouse database on 11.1.0.7 ASM with
asmlib. RHEL 4.7. Proliant DL585 G2 with MSA70 storage.

The problem I face is an 'I/O hiccup'. The database can work properly
for a week or two and then suddenly keep stalling for no apparent
reason. Users complain that their selects take 2x or 3x more time.
vmstat shows I/O activity (bi, bo colums) for half a minute and for
another half a minute shows no activity (bi and bo columns equal to 0)
and a number of processes waiting for I/O (procs/b column). strace on an
oracle process waiting for I/O shows it is waiting for a completion of 
'read' call. The only thing that helps is rebooting the box.

I can isolate the problem to specific disks using iostat. These disks
are the same on a day the problem occurs but they are different on
another occurrance of the problem. Storage / Linux admins do not see any
problem on their side.

I have several one-off patches recommended by Oracle support:

Bug 5452672: Hung database instance if linux kernel miss aio request
Bug 6656824: LNX-10204-TC6  SIGSEGV AT SKGFR_REAP64()+281, IN DBW0
Bug 6087207: WARNING:ORACLE PROCESS RUNNING OUT OF OS KERNEL I/O RESOURCES
Bug 6882513 - MERGE LABEL REQUEST ON TOP OF 10.2.0.3 FOR BUGS 6801535 
5576584
Bug 5576584 (4880399): ASM PARALLEL READS PERFORMANCE NOT ACCEPTABLE

I plan to upgrade to 10.2.0.4 but need first to sort out some hash join
bugs (yet unknown to Oracle) that break our large queries with ora-600
errors.

What would you recommend to do to narrow down the problem to Oracle /
ASM / asmlib / Linux / storage fault?

Do you know of any other bugs that can show such a behaviour?

Thanks.


Pawel Kotlarz
--
//www.freelists.org/webpage/oracle-l



Other related posts: