RE: systemstate dump

  • From: "Poras, Henry R." <Henry_Poras@xxxxxxxxxxxxxxxx>
  • To: "'oracle-l@xxxxxxxxxxxxx'" <oracle-l@xxxxxxxxxxxxx>
  • Date: Mon, 3 May 2004 13:25:41 -0400

Mladen (et al),
Thanks for the reply. In the last week I've done a bit more digging and found
another Bug which I believe is the cause of these problems. I checked for the
wait events mentioned in Mladen's suggestion, but they weren't anywhere near the
top of the list. Also, the systemstate dump didn't have any negative 'timeout'
values or a On_timer_q of Y. 

On the other hand, Bug 1484634 is identified by:
-OPS
-Excessive use of rowcache memory in shared pool
-Periodic excessive contention on shared pool and row cache objects latch
-Following the contention, the rowcache memory shrinks.
-One instance of a 2 node OPS starts to hang. Only connect internal is possible.
-In a very short timeframe (1..3 minutes or less) suddenly there are lots of
latch waits (row cache, library cache, modify parameter, buffer chain, ...) and
the system hangs.

From the Statspack information below, I found the major jumps in timed events
were in latch free (library cache, shared pool, row cache objects), enqueue,
library cache pin, row cache lock, and global cache lock busy. 

I went to the systemstate dump to examine these in more detail. I started off by
looking for resources that were being held and resources being waited for.
Ignoring idle waits, the processes in the state dump are waiting for either
latch free, enqueue, row cache lock, or library cache pin. The state dump showed
one process holding both the (single) shared pool latch and the (single) row
cache objects latch, with other processes holding some of the library cache
child latches (the library cache latch has 5 children). 

Looking at our waits in more detail we found the following sources for
contention:
-Contention on the library cache latch: all sessions holding a library cache
child latch were waiting for the shared pool latch. 
-Contention for enqueues: all enqueue waits in the systemstate dump occurred
from the attempt to obtain a space management (ST) enqueue in exclusive (X)
mode. This enqueue was currently held by a process waiting for either the row
cache lock or the row cache object latch (depends which dump we look at)
-Contention for library cache pin: all processes are requesting a Share on the
same handle. The process which holds that pin in X mode is waiting for the row
cache object latch.
-Contention for row cache lock: we have waits for this, but nothing in the
systemstate dump is shown holding it. (Can you wait on the row cache lock if you
are not holding the row cache object latch?)
-Contention for global cache lock busy: still researching. (well, not really,
but I sort of ignored this one)

So we see that ultimately the bulk of the contention points to a single process
which is holding both the shared pool latch and the row cache object latch. What
is this process doing? Why the long hold of these latches?

The process (which is not specific to a particular username or piece of SQL. I
know this since I have multiple state dumps) displays:

      holding     8000ebec shared pool level=7 
        Location from where latch is held: kghfrunp: alloc: clatch nowait: 
        Context saved from call: 0
        state=busy
      holding     8000c114 row cache objects level=4 
        Location from where latch is held: kghfrunp: clatch: nowait: 
        Context saved from call: 0
        state=busy

Since the final statspack report completed after the slowdown cleared, I can
look for changes in certain pools rather than look at the end points. The
largest changes occurred in the following areas of the SGA.

POOL        NAME              BEGIN VALUE (early slice)   END (early)=Beg (slow)
END VALUE (slow)

Shared pool dictionary cache  44,700,000                  46,000,000
27,900,000
Shared pool sql area          34,500,000                  35,300,000
53,600,000
Shared pool free memory       24,500,000                  27,900,000
16,700,000
Shared pool KQLS heap         34,000,000                  28,700,000
40,700,000

My thought is that the system needs more space for the dictionary cache, so the
process holding the row cache object latch also takes the shared pool latch.
This latch then tries to free unpinned (frunp) chunks. The scanning of the
shared pool and freeing additional space is taking a long time. (similar to
Riyaj's response to this thread)

All this matches fairly well with the description of the Bug at the beginning of
this thread. In the meantime, another DBA here increased the shared_pool size
(which delays the onset of this problem, but it should hit harder when it hits),
and upgraded from 8.1.7.2 to 8.1.7.4 (Oracle thought it was a different bug
[1819214, 1647363] and that this upgrade would fix them. I am tracking down
whether our bug is also covered by this upgrade). It seems reducing shared pool
contention (reduce shared_pool size, use session_cached_cursors, ...) will also
help.

Thanks to Ross, Mladen, and Riyaj for the help.

Henry


-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx
[mailto:oracle-l-bounce@xxxxxxxxxxxxx]On Behalf Of Mladen Gogala
Sent: Wednesday, April 28, 2004 12:52 AM
To: oracle-l@xxxxxxxxxxxxx
Subject: Re: systemstate dump


I only now saw this post. Sorry for being late. There is an oracle
bug in 8i with DLM. To make the long story short, a global enqueue
"loses" owner, and, consequently, cannot be relased. That causes all
kinds of beautiful events. Here is what I'm talking about and what Mogens
would call CTNOM:

Article-ID:         <Note:157766.1> 
Circulation:        PUBLISHED (EXTERNAL) 
Folder:             server.HA.RAC 
Topic:              Performance, Tuning and Hanging Issues 
Title:              Sessions Wait Forever for 'global cache cr request' Wait

                    Event in OPS or RAC 
Document-Type:      PROBLEM 
Impact:             MEDIUM 
Skill-Level:        NOVICE 
Server-Version:     08.01 to 09.00 
Updated-Date:       18-FEB-2003 11:35:45 
References:         <BUG:1916409> 
Shared-Refs: 
Authors:            MPOLASKI.US 
Attachments:        NONE 
Content-Type:       TEXT/PLAIN 
Products:           5/RDBMS; 
Platforms:          GENERIC; 


Problem Description 
------------------- 


Sessions hang indefinately in an OPS environment.  Session(s) wait forever
for 
the 'global cache cr request' wait event.  Other sessions could end up
waiting 
for the 'buffer busy due to global cache' or 'global cache lock busy' wait
events. 
  
...


Explanation 
----------- 


Due to Bug 1916409 a PCM lock with the expired timeout might not be ever 
cancelled. 


Possible Symptoms: 


- The 'WAITING SESSIONS' section of an opsdiag.sql output (Note 135714.1)
shows 
  a session waiting for 'global cache cr request' with a high value for
'seconds' 


- The 'PCM LOCK WAITERS' section of an opsdiag.sql output (Note 135714.1)
shows 
  a lock upgrading (for example: granted in 'Null' mode and requesting
'Share' 
  mode with a high value for 'sec' (seconds)). 


- 'global cache cr request' wait event never finishes.  Seq# and 
seconds_in_wait continue to increase. 


        select sid, event, seq#, seconds_in_wait 
        from gv$session_wait 
        where event like 'global cache%'; 


- The lock dump (from an lkdebug dump or systemstate dump) for the offending

  PCM lock has On_timer_q? : Y and a negative timeout value.  Example: 


      ----------lock-------------------------- 
lock version     : 10165 
Owner node       : 0 
grant_level      : KJUSERNL            <-- Granted in Null Mode 
req_level        : KJUSERPR            <-- Requesting Protected Read 
bast_level       : KJUSERNL 
notify_func      : 0 
resp             : 0x99dd5174 
procp            : 0x9866e644 
remote_lockp     : 0x9916d114 
pid              : 5012 
proc version     : 418 
oprocp           : 0x9866e644 
opid             : 5012 
gid              : 1 
xid              : 0 0 
crstate          : 5 
dd_time          : 0.0 secs 
dd_count         : 0 
timeout          : -26786.0 secs       <-- Negative timeout value 
On_timer_q?      : Y                   <-- On Timer Queue 
On_dd_q?         : N 
lock_state       : KJUSEROPENING 
Open Options     : KJUSERPERSISTENT 
Convert options  : KJUSERGETVALUE 
History          : 0x231726cd 
Msg_Seq          : 0x0 


- 'buffer busy due to global cache' or 'global cache lock busy' wait events 
appearing on the system (although this is more of a side effect). 
  

> -----Original Message----- 
> From: Henry Poras [mailto:hporas@xxxxxxxxxxx] 
> Sent: Wed 4/28/2004 2:08 AM 
> To: oracle-l@xxxxxxxxxxxxx 
> Cc: 
> Subject: systemstate dump
> 
> 
> 
> 
> >From statspack, I compared the 15 (closer to 20) minute snapshot where the
> issue arose to the 15 minute snapshot just prior to that. What I found was:
> 
> -CPU time almost unchanged 
> -a huge change in the Total Wait Time (cs) for the following events (the
> number of waits were all within an order of 2 of each other except for latch
> free which jumped from 200 to 150,000).
> 
> NAME                              TIME-initial (cs)          TIME-final (cs)
>  
> latch free                        200                        2,800,000 
> enqueue                           900                        280,000
> global cache lock busy            8,200                      30,000 
> library cache pin                 500                        18,000 
> row cache lock                    600                        15,000 
> library cache lock                600                        8,000 
> DFS lock handle                   30                         5,000 
> PX Deq: Execution Msg             --                         5,000 
> 
----------------------------------------------------------------
Please see the official ORACLE-L FAQ: http://www.orafaq.com
----------------------------------------------------------------
To unsubscribe send email to:  oracle-l-request@xxxxxxxxxxxxx
put 'unsubscribe' in the subject line.
--
Archives are at //www.freelists.org/archives/oracle-l/
FAQ is at //www.freelists.org/help/fom-serve/cache/1.html
-----------------------------------------------------------------

Other related posts: