Re: How to isolate root cause of increasing libcache latch waits? (Subtitle: Hoping to upstage Oracle Support)

  • From: Tanel Poder <tanel@xxxxxxxxxx>
  • To: martin.a.berger@xxxxxxxxx
  • Date: Thu, 15 Oct 2009 17:13:11 +0800

Short spikes in latch contention?

Do you use SGA_TARGET / ASMM? Shrinking shared pool can flush out some of
your library cache contents and cause this trouble. I understand you have
AWR licenses so you can query DBA_HIST_SGASTAT to see whether you had SGA
resize ops happening at the same time with the latch contention spikes.

You can easily visualize memory usage from DBA_HIST_SGASTAT with PerfSheet
if you like, like I've done here:

http://blog.tanelpoder.com/2009/06/04/ora-04031-errors-and-monitoring-shared-pool-subpool-memory-utilization-with-sgastatxsql/

Another reason for spikes in library cache latch contention are crappy
monitoring tools, which poll V$SQL or V$SQLAREA too often. Queries against
these views walk through entire library cache (unless you look up a specific
hash value). And in order to walk through entire library cache you need to
get and hold library cache latches a lot. Depending which version of
enterprise manager version you use, the OEM itself may be the problem. Since
10.2 EM should be able to use V$SQLSTATS for real time top SQL activity
monitoring, instead of V$SQL. V$SQLSTATS is a separate array in shared pool
and reading it does not require library cache latches (each bucket in the
stats array is protected by a separate mutex instead).

You can use LatchProfx for systematic troubleshooting of latching issues...
or just write a PL/SQL loop which dumps v$latchholder contents to a table
every couple of seconds and you will see the library cache latch holding
session IDs from there once the problem happens again.

Tanel.

On Thu, Oct 15, 2009 at 2:26 AM, Martin Berger <martin.a.berger@xxxxxxxxx>wrote:

> Charles,
> have you read Tanel Poders
> http://blog.tanelpoder.com/2009/09/13/kgl-simulator-shared-pool-simulator-and-buffer-cache-simulator-what-are-these/
> especially these quotes might match in your env:
> > As these allocations are dynamic, then when your application suddenly
> starts making lots of hard parses due newly released code, causing memory
> shortage, then the simulator can make things even worse as it can make
> relatively big allocations for itself too!
> and
> > Note that I’m talking about big systems here, ...
>
> It's just out of a kind of reflex, every time I hear library latch I check
> Tanels blog, latchprofX and others are just too helpful. :)
>
> might it help,
>   Martin
>
>
> Am 14.10.2009 um 15:42 schrieb Charles Schultz:
>
> Good day, list,
> *The Symptoms*
> We have observed a gradual rise in library cache latch waits in our
> Production database. Oracle Enterprise Manager shows these as small,
> short-lived "spikes" in the Top Activity Monitor. Our experience is that
> these spikes generally grow in intensity and duration (impacting more users
> for a longer period of time). We have observed that turning off OEM and the
> agent helps to alleviate these latch waits.
>
> *The System*
> Oracle EE 10.2.0.2 on Solaris 10. Upgrading to 10.2.0.4 this weekend.
> Hardware is Sun F15k with 44 CPUs and 96GB of RAM.
>
> *The Support*
> In our first SR with Oracle, the "resolution" was to stop using OEM. This
> is a very poor response in my opinion, for it does not address the root
> problem at all; it offers no insight as to what is causing the latch waits
> in the first place, nor is it a very effective way to treat a problem as it
> merely assuages the situation, not eliminating it.
>
> In our second SR, I have an analyst who wants me to turn on event 10046
> level 12. Good and dandy, I would have done that myself if not for the fact
> that we often get hit with mutex hangs triggered by rowsource execution
> statistics. Besides, the analyst is simply fishing for something, he has
> given no indication that he really understands the situation. He has
> identified several large, cpu-intensive queries; how does cpu-intensive
> relate to latch waits, especially considering the fact that our box is only
> about 12% cpu utilized. I admit, perhaps I am especially dense today and he
> is spot on. Perhaps.
>
> *The Synopsis*
> Since it is library cache latches that we are talking about, I looked at
> the shared pool fragmentation. Since fragmentation generally gets worse as
> time progresses, it is my theory that perhaps the libcache latch requests
> are taking longer and longer to fulfill as the there are more (and smaller)
> chunks of memory to search. I am currently running a couple diagnostic
> scripts (forgot which Metalink note I got them from, but it is a memory
> diagnostic suite). We know from past experience that flushing the shared
> pool has a marginal affect; which makes sense, as x$kssmp shows us that a
> small fraction of the shared pool is flushable. We also know from past
> experience that if we let the latch waits continue to get worse, the system
> approaches a critical mass and we have to bounce it. And nobody likes to
> bounce the Production system. =)
>
> So, one big (obvious) caveat. We are upgrading to 10.2.0.4 this weekend
> which means we will inherit a ton of bug fixes, as well as those wonderful
> new bugs that come along with it. Unfortunately, the PSU (10.2.0.4.1 bundle)
> came out too late for our regression tests, but I have scheduled that for
> future regression tests. If memory fragmentation actually is the root
> problem, I have my doubts as to whether 10.2.0.4 is actually going to help
> us or not. If anyone has any insights as to whether or not this might be
> true, I am all ears.
>
>
>


-- 
Tanel Poder
http://blog.tanelpoder.com

Other related posts: