Re: Understanding "cursor: pin S wait on X"

  • From: Thomas Kellerer <thomas.kellerer@xxxxxxxxxx>
  • To: "oracle-l@xxxxxxxxxxxxx" <oracle-l@xxxxxxxxxxxxx>
  • Date: Tue, 09 Apr 2013 13:14:17 +0200

Tanel,

thanks for the quick reply.

> 1) The cursor got flushed out for some reason (or it's just the first
> time anyone is running it) and the other session is currently hard
> parsing/compiling/loading the contents of this cursor back into
> library cache. The longer the parsing/loading takes, the longer you
> wait. If the parsing session hits some (performance) issues, this pin
> may be held for a long time.

To make sure I understood this correctly:
the pin is only held during parsing/loading, *not* while the cursor is 
executing (and fetching rows), correct?

But if this is really caused by parsing, shouldn't I also see a high number of 
hard parses in my AWR report?
There where only 865 hard parses and 155,291 total parses during that time.

> 2) The child cursor is already loaded into cache, but someone else is
> currently flushing it out (shared pool shortage, fragmentation &
> loading of a big object, SGA target / ASMM reducing the SGA size).

There are 3 grow operations and one shrink operation recorded in the AWR for 
the node in question.
In overall it grew from 3.2G to 3.4G during that hour (SGA Target is 30G, with 
AMM enabled, if I'm not mistaken)

> The thing with "auto-extending" connection pools is that there's a
> chicken-and-egg situation with troubleshooting performance &
> utilization spikes - what caused what? Sometimes there's just a mild
> performance hiccup for whatever reason, which nobody would even
> noticed, but a trigger-happy connection pool will greatly amplify it
> by starting up hundreds of new connections during the problem time.

One problem was, that the pools on the application servers were configured to 
allow way
too many connections (if all of them had extended to the max size, this would 
have been more than
twice the number of connections allowed on the RAC system).

The first thing we did was to cap the number to reasonable limit on each app 
server.

(Additionally: the application has a "reputation" of doing this sort of things.
The root cause is almost certainly inside the application or the underlying 
E-Commerce platform.

> But if you are talking about 0.6 logons per second *on average in a 1-hour 
> AWR report*

Yes, that's what I'm talking about.
  
>then this could mean also (let's say) 40 new connections per second during 
>your spike time (of 60 seconds for example)
> and then no more logins throughout the rest of the hour. 60 x 50 / 3600  = 
> 0.66...

> So if your spike lasted for X seconds, then your performance data
> should also be from X seconds (or even more granular)

The spike lasted over 3 hours (if I'm not mistaken). The AWR was taken from one 
hour during that time.

> If you want to reproduce "cursor: pin S for X" waits, you can use one of my 
> seminar demo scripts
>  (http://blog.tanelpoder.com/files/scripts/aot/demo4.sql ). Just run the 
> entire script in one session
>(it will pin a child and end up hard parsing for a very long time) and then 
>run the select part of that script
> only in a different session. But my demo very narrowly focuses on causing the 
> wait event problem, I don't think it will
> reproduce anything like you saw.

I'm trying to understand the mechanics/reasons that lead to this situation. If 
I understand it correctly,
the demo is running a statement that takes really long to parse and therefor 
creates the wait event, right?

> Another option would be to get 10s or hundreds of sessions running the same 
> SQL statement (and overload the CPUs)
>and then purging the cursor with DBMS_SHARED_POOL.PURGE (not just flush shared 
>pool as it will skip pinned chunks,
>but purge will wait until it gets the pin).

I'll try that as well. Could the "logon storm" could cause a flushing of the 
cursors from the shared pool, requiring each
new session to re-parse the statements? (But then again: hardly any hard parses 
during that time).

Regards
Thomas


> On Tue, Apr 9, 2013 at 11:59 AM, Thomas Kellerer <thomas.kellerer@xxxxxxxxxx 
> <mailto:thomas.kellerer@xxxxxxxxxx>> wrote:
>
>     Hello all,
>
>     we had a situation on a 2 node RAC system where for some reason the 
> system more or less stopped working due to excessive "cursor: pin S wait on 
> X" waits. (89% of the DB time on one node, 36% on the other node).
>
>     This seems to be the result of the application servers opening an 
> excessive high number of connections to the RAC (the reason for that is yet 
> unknown).
>
>     We are not done invstigating everything that happened on the application 
> server yet, but it seems to be that all those sessions were running the same 
> statements and for some reason they were all routed to the same RAC node.
>
>     The top 5 Wait events from the overloaded node show up like this:
>
>     Event                      Waits        Time(s)     Avg wait (ms)   % DB 
> time   Wait Class
>     cursor: pin S wait on X    4,411        429,949     97472           88.68 
>       Concurrency
>     library cache lock         246          30,608      124424          6.31  
>       Concurrency
>     db file sequential read    1,237,878    8,275       7               1.71  
>       User I/O
>     DB CPU                                  4,728                       0.98
>     db file scattered read     1,014,046    2,738       3               0.56  
>       User I/O
>
>     (1 hour AWR report)
>
>     In order to understand better the backround on when a "cursor: pin S wait 
> on X" is requested, I tried to create a test environment to reproduce this. 
> But up to now without luck. The "best" I can do simulating the high 
> concurrency (on a test environment) will result in a high number of "latch: 
> shared pool" and "latch: row cache objects".
>
>     So my question is:
>
>     Could somebody explain in more detail at which step(s) in the query 
> processing a "cursor: pin S wait on X" is requested?
>     By understanding the sequence that leads to such a wait event I hope to 
> understand better what was going on on the Oracle side of the problem.
>
>     Btw: is 0.6 logons per second considered a "logon storm"?
>
>     Thanks in advance
>
>     Kind regards
>     Thomas

--
//www.freelists.org/webpage/oracle-l


Other related posts: