RE: Spotting the real cause of a Production slowdown (10g)

From: "Schultz, Charles" <sac@xxxxxxxxxxxxx>
To: "Alex Gorbachev" <gorbyx@xxxxxxxxx>
Date: Thu, 20 Apr 2006 16:59:17 -0500
That is insightful, thank you.

In light of trying to prove it (so that we can officially point our
finger and resolve this issue so it does not happen again), it sounds
like we are going to have to focus on the AWR to see if the workload is
actually higher or not. Again, hypothetically, if the latch waits are
merely a nuisance, they make the analysis that much harder because they
in turn cause some of the cpu load. Is it possible to use AWR and
attempt to exclude latch waits and their side effects? Perhaps we need
to concentrate on periods of time before latch waits become an issue?

Again, thanks for this thought. Another direction for us to pursue. 

-----Original Message-----
From: Alex Gorbachev [mailto:gorbyx@xxxxxxxxx] 
Sent: Thursday, April 20, 2006 4:50 PM
To: Schultz, Charles
Cc: oracle-l@xxxxxxxxxxxxx
Subject: Re: Spotting the real cause of a Production slowdown (10g)

Imagine that you have a bit more workload at some point and it leads to
higher CPU consumption. It also makes latching issue (since concurrency
increases) more proversive. Processes spinning more on latches consume
more CPU and keep other latches longer... and since CPU is becoming a
"rare" resource (well, bottleneck), taking a latch becoming more and
more difficult and longer and longer...
catch-twenty-two.

2006/4/20, Schultz, Charles <sac@xxxxxxxxxxxxx>:
>
> Thanks to all for the many replies on this - it gave us some really 
> good ideas. Unfortunately, we are in the middle of such a problem now,

> and I highly doubt that the literal sql statements are causing our 
> latch waits and high cpu usage right now. Additionally, the 
> shared_pool is not changing that much (relative to past events, 
> anywhere from 2.5gb to 3gb, sga_target = 10g). CPU is being consumed
by many active processes, all doing "normal"
> things that usually do not consume this much CPU. At least, that is 
> what we believe at this point in time.
>
> Paul, thanks for the hash_plan_value query - it definitely helps 
> disprove that we have a truckload of literal statements (in other 
> words, I do not think literal statements is our problem, at least not
in large numbers).
> Niall, I am still seeing a number of statements that are not sampled 
> into v$active_session_history, but those that are are showing a lot of

> library cache waits.
>
>  ________________________________
>
> From: oracle-l-bounce@xxxxxxxxxxxxx 
> [mailto:oracle-l-bounce@xxxxxxxxxxxxx]
> On Behalf Of Schultz, Charles
> Sent: Thursday, April 20, 2006 8:15 AM
>
> To: oracle-l@xxxxxxxxxxxxx
> Subject: Spotting the real cause of a Production slowdown (10g)
>
>
>
>
> We have observed several critical slowdowns of our Production system 
> since upgrading to 10g. The symptoms include a rapid rise in library 
> cache latch waits (with associated, but less notable, rises in shared 
> pool and library cache pin waits), CPU pegged at 100% and CPU load in 
> triple-digits, and massive shared pool resizing evidenced by 
> v$sga_resize_ops (1.7gb added in
> 15 seconds). Obviously, trying to run queries against the normal v$ 
> views is rather slow (only adds to the latching problem). We sent a 
> hanganalyze and systemstate dumps to Oracle Support, but after a 
> couple months of trying to figure things out, we have word that the 
> initial Support Engineer may have misdiagnosed our problem.
>
> Based on what little I know and researched (reading a number of notes 
> here, Metalinks, google, oraFAQ, etc), a large number of literal sql 
> statements might be able to cause such symptoms. It is just hard for 
> me to imagine the number of literal statements that would consume 
> 1.7gb of shared pool in 15 seconds. Perhaps it is not quantity, but 
> quality as well, each statement consuming a large chunk of real 
> estate. The only problem with this conjecture is that I do not find 
> evidence of this in v$sql (looking at number of executions, number of
literals, sharable_mem).
>
> But the puzzle is even more elusive. EM reports extremely (abnormally)

> large number of executions for seemingly random sql statements. In one

> case, we had 3 statements that were each executed 35 million times in 
> one hour (2 of those statements had bind variables, the other was a 
> "simple" select). I can never catch these large numbers in v$sql, so I

> have to assume they age out by the time we start to look. I have asked

> the application folks to see if these queries are related, and why the

> execution count is a couple orders of magnitude greater than normal. 
> Still waiting for resolution at that end. It seems the general 
> consensus that these larger numbers are symptomatic as well, that they

> only start to escalate when the system is severely slow and the 
> middle-tier starts to timeout. NOTE: that is the working hypothesis,
we are trying to prove it.
>
> So, my question is two-fold. Do you experts out there have advice as 
> to how we can reactively figure out what already happened? And how to 
> proactively avoid this from happening again? Because of the automatic 
> AWR snapshots only happening each hour, it is hard for us to get more 
> granular when looking at a historical view. Perhaps we should
temporarily increase that?
>
> Any and all advice is gladly welcome. I am hoping to learn some of the

> tricks and glean from your experience.
>
> Thanks in advance,
>
> PS - after reading about Kyle Hailey and Roger Sanders work, anyone 
> have access to a DMA for dummies? =)
>
>
> charles schultz
> oracle dba
> aits - adsd
> university of illinois
>


--
Best regards,
Alex Gorbachev
--
//www.freelists.org/webpage/oracle-l
Follow-Ups:
- RE: Spotting the real cause of a Production slowdown (10g)
  - From: Ray Feighery
- Re: Spotting the real cause of a Production slowdown (10g)
  - From: Alex Gorbachev
RE: Spotting the real cause of a Production slowdown (10g)

Other related posts: