John, You're describing the too-many-LIOs problem in more and more detail. = Don't be confused; just because the Oracle kernel's not emitting data every = few microseconds doesn't mean it's not doing tons of work. Two factors about this problem often mess people up. When your system is killing itself on LIO calls (that is, on accesses to the database buffer cache), there are two things you might notice: (1)=A0The Oracle kernel = emits lines to the trace file only once per few minutes, and (2)=A0A single = "wait" row lingers in V$SESSION_WAIT for several minutes. The combination of = these two observations confuses a lot of people, because it looks like a = single timed event is to blame for the problem (that event that is the last on = in the trace file before the huge gap in time, or the one that's apparently "stuck" in V$SESSION_WAIT for the session). But it's not. There are two things you should realize. First, when the Oracle kernel = emits the record of a timed event to the trace file, that event is FINISHED. = Its elapsed duration is the value of the 'ela' field (not the duration of = the apparent "gap" in time that occurred after the event completed). If its = ela number is small, then that event's execution is not to blame for the response time problem. Second, with event 10046, the Oracle kernel = doesn't emit anything for individual LIO processing. You'll see a record of how = many LIOs were executed by a dbcall in a PARSE, EXEC, or FETCH line (in the = cr+cu total), but never in a WAIT line.=20 However, it's easy to create a record of LIO work while the slow program = is running. Just do this: select block_gets, consistent_gets from v$sess_io where sid=3D:sid Execute this once in SQL*Plus, and then just type "/<enter>" every few seconds, you'll be able to see the LIO count accumulate by the = thousands. For a little more detail, see "Optimizing Oracle Performance" on pages = 189 and 192-194. (There's also event 10200, which shows a 4-line x 80-byte record of each LIO if you're interested. Unlike 10046, 10200 is far too expensive to use in a production environment.) Again, John, the bottom line here is that you need to fix some SQL. By = the time you're finished doing that, you should have SQL that consumes no = more than about 10 LIO calls per row returned per table in your FROM clause. = For example, a query that's supposed to return 7 rows from an 8-way join = should consume no more than about 560 LIOs. (For those of you who have read the book, you know the caveat about aggregation functions, but the rule of = thumb is still accurate for the top-level row source operation that's beneath = the aggregation.) Instead of the number "10", right now, John, you're probably enduring billions. If you have no idea how to fix your SQL (whether it's your own custom = code, or part of an application you've purchased), a thread on this list that occurred earlier in the week revealed a couple of good options for = learning how to do the job. Good luck to you. Cary Millsap Hotsos Enterprises, Ltd. http://www.hotsos.com * Nullius in verba * Upcoming events: - Performance Diagnosis 101: 1/4 Calgary - SQL Optimization 101: 2/7 Dallas - Hotsos Symposium 2005: March 6-10 Dallas - Visit www.hotsos.com for schedule details... -----Original Message----- From: oracle-l-bounce@xxxxxxxxxxxxx = [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On Behalf Of John Clarke Sent: Friday, December 10, 2004 6:38 AM To: cary.millsap@xxxxxxxxxx; Oracle-L@xxxxxxxxxxxxx Subject: RE: mostly latch-free SCN? Thanks for the description of the *** lines. I should have done more research before asking this. Here's why I originally posted the question, just so you know my frame = of mind. I run a problematic session that had been hurting this client's nightly batch load for months. I knew we had some SQL optimization = issues and was going to go through the exercise of training their DBAs how to trace, analyze, etc. While running this thing I was showing them what's = in V$SESSION_WAIT and noticed the session waiting on latch free waits, = p2=3D88 (which on 9.2.0.4 is the mostly latch-free SCN latch, as I learned). I = then started examining the trace file and while tailing it, saw it "stuck" on this latch free wait. I expected the next line in the trace file to = have either a large ela value on a named wait event or a high CPU/elapsed = value for an LIO. After waiting about 10 minutes, the trace file started dumping more data = and the next 100 or so lines were mbr waits, all with relatively small = elapsed times. I guess I expected a "big value" following this latch 88 wait. = I suppose the sum of the consecutive db file scattered read waits that followed this latch wait will have a long elapsed time, but typically = don't you see these waits "stream" to the trace file while the session is = waiting on them? And if they're not streaming w/ explainable wait times, what = then? I admit I didn't check the next fetch/exec/parse line for this - I just expected the next posted line to the trace file would have high values. = If this next "cpu" line does have high elapsed/cpu times, as we would = expect, what does that mean about all the db file scattered read waits between = the time the trace file was not being written to (10 minutes) before the = next db call? I think you've got something about this in your book but don't = recall (and don't have it with me). When I get back online at the office I'll double-check and hopefully = things will become more clear. Thanks -----Original Message----- From: oracle-l-bounce@xxxxxxxxxxxxx = [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On Behalf Of Cary Millsap Sent: Thursday, December 09, 2004 10:26 PM To: Oracle-L@xxxxxxxxxxxxx Subject: RE: mostly latch-free SCN? John, Your 'latch free' excerpt is microscopic and definitely not a concern in = =3D the excerpt. That particular type of 'latch free' event has contributed only 0.000049 seconds to the approximately 6.5 MINUTES represented by your = =3D trace data. Pursuing a fix to your latch "problem" will improve your response = =3D time by--at most!--about 0.000013%. So, you're asking the wrong question. :) The *** lines you're seeing are emitted by Oracle kernel calls to dbms_system.ksdddt(). The Oracle kernel behaves as if there's a function called "write_to_trace()" in the kernel code that handles Oracle =3D extended SQL trace file writing in the following manner: { static int time_of_prior_write =3D3D 0; subroutine write_to_trace(TRACE, text) { if now() - time_of_prior_write > 10 seconds { ksdddt; } time_of_prior_write =3D3D now(); write(TRACE, text); } } The *** line doesn't mean that your 'latch free' consumed a bunch of =3D time. The ela value for each timed event shows you how much time that event = =3D has consumed. The *** line is telling you that it's a long time between =3D calls to Oracle timed events. This almost always indicates that the process being traced is accessing the database buffer cache way too many times. This = =3D is common because so many books, courses, etc. teach you that making sure everything is in the buffer cache is golden goodness. The dbcall (PARSE, EXEC, FETCH, ...) that immediately /follows/ this =3D excerpt will have a huge c value (lots of CPU service consumption for the call), = =3D and a huge e value (lots of elapsed time for the call). It will almost undoubtedly also have a huge cr+cu value as well, which will confirm =3D that you're suffering from, I would estimate, on the order of tens of =3D millions of unnecessary "consistent gets" and "db block gets." Bottom line: You have some very inefficient SQL here that you need to = =3D fix. The book "Optimizing Oracle Performance" explains further. Cary Millsap Hotsos Enterprises, Ltd. http://www.hotsos.com * Nullius in verba * Upcoming events: - Performance Diagnosis 101: 1/4 Calgary - SQL Optimization 101: 12/13 Atlanta - Hotsos Symposium 2005: March 6-10 Dallas - Visit www.hotsos.com for schedule details... -----Original Message----- From: oracle-l-bounce@xxxxxxxxxxxxx =3D [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On Behalf Of John Clarke Sent: Thursday, December 09, 2004 4:51 PM To: oracle-l@xxxxxxxxxxxxx Subject: mostly latch-free SCN? Anyone have any idea what this is=3D3D3F I was 10046 tracing a process and found it post a wait on mostly =3D latch-f=3D3D ree SCN latch. Excerpt from trace file below. WAIT #2: nam=3D3D3D'db file scattered read' ela=3D3D3D 1054 p1=3D3D3D17 = =3D p2=3D3D3D30500 p=3D3D 3=3D3D3D3 WAIT #2: nam=3D3D3D'db file sequential read' ela=3D3D3D 1377 p1=3D3D3D17 = =3D p2=3D3D3D30504 =3D3D p3=3D3D3D1 *** 2004-12-09 17:27:55.695 WAIT #2: nam=3D3D3D'latch free' ela=3D3D3D 49 = p1=3D3D3D-4611686009467115424 =3D p2=3D3D3D88 =3D3D p3=3D3D3D0 *** 2004-12-09 17:34:20.573 WAIT #2: nam=3D3D3D'db file sequential read' ela=3D3D3D 4850 p1=3D3D3D50 = =3D p2=3D3D3D119591=3D3D p3=3D3D3D1 WAIT #2: nam=3D3D3D'db file sequential read' ela=3D3D3D 6175 p1=3D3D3D50 = =3D p2=3D3D3D119590=3D3D p3=3D3D3D1 I'm not positive what the "^***" lines mean, but what I can tell is =3D that=3D3D when hit the latch wait and displayed a wall time of 17:27, it didn't = =3D d=3D3D o anything until the next "^***" line. Shouldn't the ela in either the = =3D =3D3D current or next wait match the elasped time=3D3D3F Thanks in advance. -- //www.freelists.org/webpage/oracle-l -- //www.freelists.org/webpage/oracle-l -- //www.freelists.org/webpage/oracle-l -- //www.freelists.org/webpage/oracle-l