How to isolate root cause of increasing libcache latch waits? (Subtitle: Hoping to upstage Oracle Support)

  • From: Charles Schultz <sacrophyte@xxxxxxxxx>
  • To: ORACLE-L <oracle-l@xxxxxxxxxxxxx>
  • Date: Wed, 14 Oct 2009 08:42:58 -0500

Good day, list,
*The Symptoms*
We have observed a gradual rise in library cache latch waits in our
Production database. Oracle Enterprise Manager shows these as small,
short-lived "spikes" in the Top Activity Monitor. Our experience is that
these spikes generally grow in intensity and duration (impacting more users
for a longer period of time). We have observed that turning off OEM and the
agent helps to alleviate these latch waits.

*The System*
Oracle EE 10.2.0.2 on Solaris 10. Upgrading to 10.2.0.4 this weekend.
Hardware is Sun F15k with 44 CPUs and 96GB of RAM.

*The Support*
In our first SR with Oracle, the "resolution" was to stop using OEM. This is
a very poor response in my opinion, for it does not address the root problem
at all; it offers no insight as to what is causing the latch waits in the
first place, nor is it a very effective way to treat a problem as it merely
assuages the situation, not eliminating it.

In our second SR, I have an analyst who wants me to turn on event 10046
level 12. Good and dandy, I would have done that myself if not for the fact
that we often get hit with mutex hangs triggered by rowsource execution
statistics. Besides, the analyst is simply fishing for something, he has
given no indication that he really understands the situation. He has
identified several large, cpu-intensive queries; how does cpu-intensive
relate to latch waits, especially considering the fact that our box is only
about 12% cpu utilized. I admit, perhaps I am especially dense today and he
is spot on. Perhaps.

*The Synopsis*
Since it is library cache latches that we are talking about, I looked at the
shared pool fragmentation. Since fragmentation generally gets worse as time
progresses, it is my theory that perhaps the libcache latch requests are
taking longer and longer to fulfill as the there are more (and smaller)
chunks of memory to search. I am currently running a couple diagnostic
scripts (forgot which Metalink note I got them from, but it is a memory
diagnostic suite). We know from past experience that flushing the shared
pool has a marginal affect; which makes sense, as x$kssmp shows us that a
small fraction of the shared pool is flushable. We also know from past
experience that if we let the latch waits continue to get worse, the system
approaches a critical mass and we have to bounce it. And nobody likes to
bounce the Production system. =)

So, one big (obvious) caveat. We are upgrading to 10.2.0.4 this weekend
which means we will inherit a ton of bug fixes, as well as those wonderful
new bugs that come along with it. Unfortunately, the PSU (10.2.0.4.1 bundle)
came out too late for our regression tests, but I have scheduled that for
future regression tests. If memory fragmentation actually is the root
problem, I have my doubts as to whether 10.2.0.4 is actually going to help
us or not. If anyone has any insights as to whether or not this might be
true, I am all ears.

-- 
Charles Schultz

Other related posts: