RE: Oracle 10g hangs intermittently waiting for I/O

  • From: "Tanel Poder" <tanel@xxxxxxxxxx>
  • To: <pkotla@xxxxxx>, "'Rajeev Prabhakar'" <rprabha01@xxxxxxxxx>
  • Date: Sat, 16 May 2009 14:33:44 +0300

It looks like a classic case of extremely slow IO at some lower (hardware or
hardware driver) level. Once the IOs complete only then we know the IO times
and that's why the average IO times jump up in iostat only after the "hang"
is over. 

In my experience sometimes (or often) the SA's and storage admins just
perform some kind of healthcheck - they look into their equivalent of
alter.log's and if don't find anything from there, they come back with "we
did a healthcheck and everything looks fine from our side". What this
statement really means is "we have no idea where to look and frankly we
don't care as its easier to think that it must be a database problem
anyway".

Another thing what I've unfortunately found out too often that the storage
team (running high end storage arrays) sometimes doesn't even have proper
LUN/port level performance instrumentation enabled. 
They say it's gonna affect IO performance a lot (Even though I'm not a
storage guy I find it a little hard to believe that todays most expensive
DMX etc arrays haven't gotten this right). And that's why their
"healthcheck" doesn't show anything.

During most of my sudden IO problem troubleshooting cases we have eventually
found out that there has been some change or misconfiguration (like putting
database on the slow storage meant for backups or forgetting to enable some
HBAs for multipathing). DBAs can't look into storage level, but it helps if
you can point out (and show hard evidence) that there is definitely a
difference in IO performance. That's when the SA's and storage admins go and
do yet another "healthcheck", this time taking it seriously thanks to
evidence displayed and oops they find out that someone had forgotten to do
their work properly.

When there's a lot of fingerpointing going on, then visualizing IO stats
(before and after) can be a good asset at meetings with different
infrastructure teams as I've written here:
http://blog.tanelpoder.com/2008/12/28/performance-visualization-made-easy-pe
rfsheet-20-beta/

So I would first go and ask from SAs and storage admins, *what exactly* did
they check and see during their "healthchecks".

If you want to get systematic about this troubleshooting then there are
tools for monitoring IO requests to lower kernel levels. Linux has blktrace
and systemtap for that. However neither of these are 100% production-ready.
Blktrace requires mounting debugfs and requires a recent kernel, 2.6.18 I
think (which is standard in redhat 5.2 equivalent) and systemtap requires
installing systemtap & kernel debuginfo RPMs.

You probably don't want to start hacking your production environment like
this so I would suggest to ask what exactly did the SAs and storage admins
check when they said that everything is fine...

--
Regards,
Tanel Poder
http://blog.tanelpoder.com

> -----Original Message-----
> From: oracle-l-bounce@xxxxxxxxxxxxx 
> [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On Behalf Of Pawel Kotlarz
> Sent: 16 May 2009 00:35
> To: Rajeev Prabhakar
> Cc: oracle-l@xxxxxxxxxxxxx
> Subject: Re: Oracle 10g hangs intermittently waiting for I/O
> 
> Rajeev,
> 
> Oracle shows many sessions waiting for direct path read 
> (temp). Tanel's waitprof reports single events taking many 
> seconds though most of them are below 15ms.
> 
> On the OS level vmstat shows normal reading for some time and 
> then sessions in an uninterruptible sleep with no I/O taking 
> place. iostat -x and asmiostat (ML 437996.1) show specific 
> volumes. Just after the performance returns to normal these 
> volumes show much greater queue length (iostat) or much 
> greater average read time (asmiostat).
> 
> I ran strace on a process servicing the session on which I 
> used waitprof earlier. It stops on a read call.
> 
> Currently I only know that the sysadmins found nothing in 
> Linux logs and on a 'system management page'. Unfortunately 
> it is difficult to obtain more information from them unless I 
> tell what exactly to check...
> 
> Thanks,
> 

--
//www.freelists.org/webpage/oracle-l


Other related posts: