Re: How to isolate root cause of increasing libcache latch waits? (Subtitle: Hoping to upstage Oracle Support)

  • From: Tim Gorman <tim@xxxxxxxxx>
  • To: sacrophyte@xxxxxxxxx
  • Date: Wed, 14 Oct 2009 10:44:34 -0600

Charles,

Consider looking at what the sessions are doing?

Display the SQL_ID and CHILD_NUMBER values from V$SESSION while event = <whatever> and STATE = 'WAITING'.  Perhaps query recent history in V$ACTIVE_SESSION_HISTORY this way as well?

If the SQL_ID is constantly different and changing rapidly, consider perhaps that you might have a SQL sharing problem, where someone isn't using bind-variables?  Correct the application code or consider setting CURSOR_SHARING?  If you find that all of these sessions have similar (or the same) SQL_IDs while waiting, please consider looking at the CHILD_NUMBER, to see if that is changing often?  That might indicate that the client-side application code is "soft-parsing", or calling OPEN-EXECUTE-FETCH-CLOSE at each execution where it should be calling OPEN of the cursor once at the beginning of session, then calling EXECUTE-FETCH whenever needed, then calling CLOSE at the end of the session.

Lots of applications fail to use bind-variables, causing excessive hard-parsing, evidenced by changing SQL_ID values for essentially the same SQL text.  Lots of applications (generally "object-oriented" in nature) create-use-destroy their cursors for each execution, the methods mirroring the use of the "object", causing excessive soft-parsing, evidenced by same/similar SQL_ID values but changing CHILD_NUMBER.

Look to what the application is doing, understand what the application is firing at the database.  Library cache pin/lock contention, library cache latch contention, whatever -- most often it is a symptom, not a cause.  Don't look for an Oracle bug, don't look to tune the database instance, instead look for cause within the application code.

Hope this helps....
Tim Gorman
consultant - Evergreen Database Technologies, Inc.
P.O. Box 630791, Highlands Ranch CO  80163-0791
website   = http://www.EvDBT.com/
email     = Tim@xxxxxxxxx
mobile    = +1-303-885-4526
fax       = +1-303-484-3608
Yahoo IM  = tim_evdbt


Charles Schultz wrote:
Good day, list,

The Symptoms
We have observed a gradual rise in library cache latch waits in our Production database. Oracle Enterprise Manager shows these as small, short-lived "spikes" in the Top Activity Monitor. Our experience is that these spikes generally grow in intensity and duration (impacting more users for a longer period of time). We have observed that turning off OEM and the agent helps to alleviate these latch waits.

The System
Oracle EE 10.2.0.2 on Solaris 10. Upgrading to 10.2.0.4 this weekend. Hardware is Sun F15k with 44 CPUs and 96GB of RAM.

The Support
In our first SR with Oracle, the "resolution" was to stop using OEM. This is a very poor response in my opinion, for it does not address the root problem at all; it offers no insight as to what is causing the latch waits in the first place, nor is it a very effective way to treat a problem as it merely assuages the situation, not eliminating it.

In our second SR, I have an analyst who wants me to turn on event 10046 level 12. Good and dandy, I would have done that myself if not for the fact that we often get hit with mutex hangs triggered by rowsource execution statistics. Besides, the analyst is simply fishing for something, he has given no indication that he really understands the situation. He has identified several large, cpu-intensive queries; how does cpu-intensive relate to latch waits, especially considering the fact that our box is only about 12% cpu utilized. I admit, perhaps I am especially dense today and he is spot on. Perhaps.

The Synopsis
Since it is library cache latches that we are talking about, I looked at the shared pool fragmentation. Since fragmentation generally gets worse as time progresses, it is my theory that perhaps the libcache latch requests are taking longer and longer to fulfill as the there are more (and smaller) chunks of memory to search. I am currently running a couple diagnostic scripts (forgot which Metalink note I got them from, but it is a memory diagnostic suite). We know from past experience that flushing the shared pool has a marginal affect; which makes sense, as x$kssmp shows us that a small fraction of the shared pool is flushable. We also know from past experience that if we let the latch waits continue to get worse, the system approaches a critical mass and we have to bounce it. And nobody likes to bounce the Production system. =)

So, one big (obvious) caveat. We are upgrading to 10.2.0.4 this weekend which means we will inherit a ton of bug fixes, as well as those wonderful new bugs that come along with it. Unfortunately, the PSU (10.2.0.4.1 bundle) came out too late for our regression tests, but I have scheduled that for future regression tests. If memory fragmentation actually is the root problem, I have my doubts as to whether 10.2.0.4 is actually going to help us or not. If anyone has any insights as to whether or not this might be true, I am all ears.

--
Charles Schultz
-- //www.freelists.org/webpage/oracle-l

Other related posts: