Hello all - Had an interesting problem yesterday, seemingly related to a bug. Thought I'd forward to the list in case anyone else finds it useful. FYI, we're running on AIX 6.1.0.0 with 11.1.0.6 (no one-offs). -Jeremy ---------- Forwarded message ---------- From: Jeremy Schneider Date: Wed, Oct 13, 2010 at 10:14 AM Subject: Fwd: Issue Report - partial db hang To: Jerry Carlisle, Kerry Bush I generated system state dumps during an outage windows last night, then started reading them. I then realized that we were following the wrong rabbit-trail with the MMON slave process. The real culprit had grabbed the mutexes and was sitting quietly, blending in with the other 200 DB sessions. The MMON slave process was just the first unlucky process to start spinning on the mutex (which it would never get). Key takeaway: the VALUE of a mutex is very useful. For an exclusive mode mutex, the value tells the SID of the process holding it. For shared mutexes, I believe that the value is the number of sessions who are holding the mutex. Also, it's glaringly obvious now, but the v$ (or x$) view mutex_sleep_history has column called "blocking_sessions" - not sure how in the world I missed this yesterday. This is just giving the same info as the value of the mutex. The 11gR1 reference manual for this table explains the bit about mutex values which I summarized above. Blocking session. Duuuh. ---------- Forwarded message ---------- From: Jeremy Schneider Date: Wed, Oct 13, 2010 at 9:10 AM Subject: Issue Report - partial db hang *Summary:* There was a partial db hang on Tuesday, which is now resolved. It did not have any direct impact on the application users, but it did impact administrators and developers. We were unable to connect to the database with SQL Developer or to run certain queries. The root cause was a bug in Oracle’s code – sometime between 2pm and 3pm on Monday (Oct 11) – causing one specific JDBC thin-client user session (SID 371) to indefinitely hold 2 mutexes (kgllkdl1 and kglic1). While this user session mistakenly held the mutexes, other sessions which needed these mutexes began to stack up – spinning (running null operations) on the CPU while waiting. Note: mutexes are an Oracle internal data structure (inaccessible to us) and are not tracked quite as visibly as locks, which made the analysis slightly more challenging. *Symptoms:* 1. The first and most obvious symptom was that certain queries would hang indefinitely from SQLPlus – such as any query from v$sqlarea or x$kglob. On a related note, SQLDeveloper was unable to connect to the database for most developers (it would hang during connection). 2. The AWR report shows an extraordinary amount of contention for “library cache: mutex X”. Also, at any given point in time, there are a number of records in V$SESSION where the EVENT field is equal to this value. 3. The CPU usage on the server began to rise dramatically, seemingly inexplicably. 4. The V$MUTEX_SLEEP_HISTORY view shows an extraordinarily high number of SLEEPs on a few mutexes. 5. At the OS level… the stack trace of high-CPU processes consistently includes kglGetMutex and truss/strace shows very little activity [only calls to yield() on AIX]. *Solution:* The already-scheduled system restart completely removed the problem. In hindsight (after some analysis), I realized that the problem could most likely have been solved without restarting the system – by examining the BLOCKING_SESSION field of the V$MUTEX_SLEEP_HISTORY system view. That would probably give the SID of the session holding the mutexes. Killing that session may solve this problem in the future. I should have realized this much sooner, it was my mistake that that I didn’t catch it until after the fact. (I was mistakenly looking at the wrong “root cause” for a long time…) *Prevention:* We will open an SR with Oracle. Before restarting the database, I captured system state dumps. This does not seem to be a reproducible bug and it’s unlikely that Oracle will be able to determine what caused the mutexes to be held, but we ought to open the ticket anyway. If it does happen again, then we should capture some diagnostics information about the “root cause” session before killing it. *Notes:* This was the most recent SQL executed in the session which held the mutexes: *SELECT sql_id,sql_text from v$sql WHERE sql_id in (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10)* I found permutations of this query to be very helpful in troubleshooting: *select SID, SERIAL#, USERNAME, STATUS, OSUSER, MACHINE, STATE, substr(PROGRAM,1,18) program, SECONDS_IN_WAIT, P1RAW* *from v$session where event='library cache: mutex X' order by p1raw, seconds_in_wait, sid* -- http://www.ardentperf.com +1 312-725-9249 Jeremy Schneider Chicago