Spotting the real cause of a Production slowdown (10g)

From: "Schultz, Charles" <sac@xxxxxxxxxxxxx>
To: <oracle-l@xxxxxxxxxxxxx>
Date: Thu, 20 Apr 2006 08:14:36 -0500

> We have observed several critical slowdowns of our Production system
> since upgrading to 10g. The symptoms include a rapid rise in library
> cache latch waits (with associated, but less notable, rises in shared
> pool and library cache pin waits), CPU pegged at 100% and CPU load in
> triple-digits, and massive shared pool resizing evidenced by
> v$sga_resize_ops (1.7gb added in 15 seconds). Obviously, trying to run
> queries against the normal v$ views is rather slow (only adds to the
> latching problem). We sent a hanganalyze and systemstate dumps to
> Oracle Support, but after a couple months of trying to figure things
> out, we have word that the initial Support Engineer may have
> misdiagnosed our problem.
> 
> Based on what little I know and researched (reading a number of notes
> here, Metalinks, google, oraFAQ, etc), a large number of literal sql
> statements might be able to cause such symptoms. It is just hard for
> me to imagine the number of literal statements that would consume
> 1.7gb of shared pool in 15 seconds. Perhaps it is not quantity, but
> quality as well, each statement consuming a large chunk of real
> estate. The only problem with this conjecture is that I do not find
> evidence of this in v$sql (looking at number of executions, number of
> literals, sharable_mem).
> 
> But the puzzle is even more elusive. EM reports extremely (abnormally)
> large number of executions for seemingly random sql statements. In one
> case, we had 3 statements that were each executed 35 million times in
> one hour (2 of those statements had bind variables, the other was a
> "simple" select). I can never catch these large numbers in v$sql, so I
> have to assume they age out by the time we start to look. I have asked
> the application folks to see if these queries are related, and why the
> execution count is a couple orders of magnitude greater than normal.
> Still waiting for resolution at that end. It seems the general
> consensus that these larger numbers are symptomatic as well, that they
> only start to escalate when the system is severely slow and the
> middle-tier starts to timeout. NOTE: that is the working hypothesis,
> we are trying to prove it.
> 
> So, my question is two-fold. Do you experts out there have advice as
> to how we can reactively figure out what already happened? And how to
> proactively avoid this from happening again? Because of the automatic
> AWR snapshots only happening each hour, it is hard for us to get more
> granular when looking at a historical view. Perhaps we should
> temporarily increase that?
> 
> Any and all advice is gladly welcome. I am hoping to learn some of the
> tricks and glean from your experience.
> 
> Thanks in advance,
> 
> PS - after reading about Kyle Hailey and Roger Sanders work, anyone
> have access to a DMA for dummies? =)
> 
> charles schultz
> oracle dba
> aits - adsd
> university of illinois
> 
>

Follow-Ups:
- Re: Spotting the real cause of a Production slowdown (10g)
  - From: Dennis Williams

Spotting the real cause of a Production slowdown (10g)

Other related posts: