Fwd: Issue Report - partial db hang (bug/library cache: mutex X)

From: Jeremy Schneider <jeremy.schneider@xxxxxxxxxxxxxx>
To: Oracle-L <oracle-l@xxxxxxxxxxxxx>
Date: Wed, 13 Oct 2010 11:16:18 -0500
Hello all -

Had an interesting problem yesterday, seemingly related to a bug.  Thought
I'd forward to the list in case anyone else finds it useful.  FYI, we're
running on AIX 6.1.0.0 with 11.1.0.6 (no one-offs).

-Jeremy


---------- Forwarded message ----------
From: Jeremy Schneider
Date: Wed, Oct 13, 2010 at 10:14 AM
Subject: Fwd: Issue Report - partial db hang
To: Jerry Carlisle, Kerry Bush

I generated system state dumps during an outage windows last night, then
started reading them.  I then realized that we were following the wrong
rabbit-trail with the MMON slave process.  The real culprit had grabbed the
mutexes and was sitting quietly, blending in with the other 200 DB
sessions.  The MMON slave process was just the first unlucky process to
start spinning on the mutex (which it would never get).

Key takeaway: the VALUE of a mutex is very useful.  For an exclusive mode
mutex, the value tells the SID of the process holding it.  For shared
mutexes, I believe that the value is the number of sessions who are holding
the mutex.

Also, it's glaringly obvious now, but the v$ (or x$) view
mutex_sleep_history has column called "blocking_sessions" - not sure how in
the world I missed this yesterday.  This is just giving the same info as the
value of the mutex.  The 11gR1 reference manual for this table explains the
bit about mutex values which I summarized above.  Blocking session.  Duuuh.


 ---------- Forwarded message ----------
From: Jeremy Schneider
Date: Wed, Oct 13, 2010 at 9:10 AM
Subject: Issue Report - partial db hang

  *Summary:*

There was a partial db hang on Tuesday, which is now resolved.  It did not
have any direct impact on the application users, but it did impact
administrators and developers.  We were unable to connect to the database
with SQL Developer or to run certain queries.



The root cause was a bug in Oracle’s code – sometime between 2pm and 3pm on
Monday (Oct 11) – causing one specific JDBC thin-client user session (SID
371) to indefinitely hold 2 mutexes (kgllkdl1 and kglic1).  While this user
session mistakenly held the mutexes, other sessions which needed these
mutexes began to stack up – spinning (running null operations) on the CPU
while waiting.



Note: mutexes are an Oracle internal data structure (inaccessible to us) and
are not tracked quite as visibly as locks, which made the analysis slightly
more challenging.



*Symptoms:*

1.       The first and most obvious symptom was that certain queries would
hang indefinitely from SQLPlus – such as any query from v$sqlarea or
x$kglob.  On a related note, SQLDeveloper was unable to connect to the
database for most developers (it would hang during connection).

2.       The AWR report shows an extraordinary amount of contention for
“library cache: mutex X”.  Also, at any given point in time, there are a
number of records in V$SESSION where the EVENT field is equal to this value.

3.       The CPU usage on the server began to rise dramatically, seemingly
inexplicably.

4.       The V$MUTEX_SLEEP_HISTORY view shows an extraordinarily high number
of SLEEPs on a few mutexes.

5.       At the OS level…  the stack trace of high-CPU processes
consistently includes kglGetMutex and truss/strace shows very little
activity [only calls to yield() on AIX].



*Solution:*

The already-scheduled system restart completely removed the problem.



In hindsight (after some analysis), I realized that the problem could most
likely have been solved without restarting the system – by examining the
BLOCKING_SESSION field of the V$MUTEX_SLEEP_HISTORY system view.  That would
probably give the SID of the session holding the mutexes.  Killing that
session may solve this problem in the future.  I should have realized this
much sooner, it was my mistake that that I didn’t catch it until after the
fact.  (I was mistakenly looking at the wrong “root cause” for a long time…)




*Prevention:*

We will open an SR with Oracle.  Before restarting the database, I captured
system state dumps.  This does not seem to be a reproducible bug and it’s
unlikely that Oracle will be able to determine what caused the mutexes to be
held, but we ought to open the ticket anyway.  If it does happen again, then
we should capture some diagnostics information about the “root cause”
session before killing it.



*Notes:*

This was the most recent SQL executed in the session which held the mutexes:

*SELECT sql_id,sql_text from v$sql  WHERE sql_id in
(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10)*



I found permutations of this query to be very helpful in troubleshooting:

*select SID, SERIAL#, USERNAME, STATUS, OSUSER, MACHINE, STATE,
substr(PROGRAM,1,18) program, SECONDS_IN_WAIT, P1RAW*

*from v$session where event='library cache: mutex X' order by p1raw,
seconds_in_wait, sid*




-- 
http://www.ardentperf.com
+1 312-725-9249

Jeremy Schneider
Chicago
Fwd: Issue Report - partial db hang (bug/library cache: mutex X)

Other related posts: