Is it really necessary to reduce ITL Waits to near zero?

From: "Taft, David" <TaftD@xxxxxxxxxxx>
To: "'oracle-l@xxxxxxxxxxxxx'" <oracle-l@xxxxxxxxxxxxx>
Date: Tue, 8 May 2007 14:53:04 -0400
All,

Recently I had to investigate if the frequent deadlocks in one database
could be due to insufficient ITL slots. It was determined that none of the
deadlocks were due to ITL Waits.  However, since there are ITL Waits in the
database, this raised the question if this a problem worth fixing?  I tried
to figure out what is an acceptable number of ITL Waits before the table or
index should be reorged/rebuilt with a larger INITRANS value.  I couldn't
find any hard  answers on the net, so I decided to do some analysis.  I
chose to use the YAPP method.  The final analysis was that eliminating all
ITL Waits would only have increased response time by 0.03%, which doesn't
justify reorging tables or rebuilding indexes. 

The system is 9.2.0.7 w/RAC on AIX5L & HACMP 5.2.  What follows is the
breakdown of that analysis. My main concern is if this analysis is sound.  I
think it is, but would appreciate a second opinion before submitting my
report. Thanks.

David Taft

Begin analysis:

Checked the lmd trace file.  No deadlocks were due to insufficient ITL
slots, i.e. no TX mode-4 locks.  All were due to TX mode-5 locks. Example:

*** 2007-04-30 09:02:09.340
Global Wait-For-Graph(WFG) at ddTS[0.7a6] :
BLOCKED 7000001dad4c140 5 [0x150004][0x5e710a],[TX] [65556,576166] 0
BLOCKER 7000001dad4cf68 5 [0x150004][0x5e710a],[TX] [65620,7413] 0
BLOCKED 7000001dad4d3a0 5 [0x70008][0x5dabb1],[TX] [65620,7413] 0
BLOCKER 7000001dad4aa90 5 [0x70008][0x5dabb1],[TX] [65556,576166] 0

Just as an FYI, I have seen deadlocks on this system where both TX mode 3
and TX mode 5 were involved. Just mentioning this because I have seen
threads where others were wondering if the LMD trace ever lists anything
other than mode 5.

Since no deadlocks were due to insufficient ITL slots, on to the questions
if the ITL Waits that do exists are enough to justify reorging tables and/or
rebuilding indexes.

First I searched the alert log to find a time period with multiple
deadlocks:

  alert_log=$1
  lines=`grep -n "Deadlock detected" $alert_log|awk -F: '{print $1}'`
  for line in $lines
  do
    beforeline=`expr $line - 1`
    datestamp=`sed -n "${beforeline}p" $alert_log|awk '{print $3 $2
$5"."$4}'`
    echo $datestamp
  done

  30Apr2007.09:02:09
  30Apr2007.09:02:21
  30Apr2007.09:18:02
  30Apr2007.09:18:14
  30Apr2007.09:19:16

There were five between the hours of 9am and 10am. This system is set up to
take a level 7 statspack snapshot at the top of every hour, so I pulled the
report covering that time period and broke down the response time as
follows:

  CPU used by this session:  240937

Convert from centiseconds to seconds to get Service Time.

  Service Time:  240937/100 = 2409

Note, the top five timed events are reported in seconds, while "CPU used by
this session" appears to be in centiseconds.

Estimate the total wait time, divide the "Time (s)" by the "% Total Ela
Time" for the top event in the "Top 5 Wait Events".

  Event                                               Waits    Time (s) Ela
Time
  -------------------------------------------- ------------ -----------
--------
  db file sequential read                         2,841,497      30,837
68.55

  Wait Time:  30837 /.6855 = 44985

Add "Wait Time" to "Service Time" to estimate "Response Time".

  Response Time:  2409 + 44985 = 47394

Subtract "Wait Time" from "Response Time", to estimate "CPU Time".

  CPU Time:   47394 - 44985 = 2409

Not sure why the above was necessary since it matches Service Time, but I do
seem to recall that was not always the case.  At least that is what I seem
to remember when using the YAPP method with 8i databases. I think 8i
reported the top 5 wait events in centiseconds and not seconds, so that may
account for the difference. Some sort of rounding issue?

Calculate the percentage of "CPU Time" and "Wait Time" by dividing the value
of each by "Response Time".

  "%CPU Time":    2409 / 47394 = .0508 ( 5.08%)
  "%Wait Time":  44985 / 47394 = .9491 (94.91%)

Note, 95% of total response time was due to waits. This system is heavy on
IO waits and the application needs to reduce its' overall IO.

Divide the enqueue time by "Response Time" to get estimated percent of
enqueue time.

                                                     Total Wait   wait
Waits
  Event                             Waits   Timeouts   Time (s)   (ms)
/txn
  -------------------------- ------------ ---------- ---------- ------
--------
  enqueue                          13,371      1,437        721     54
0.2

  %enqueue Time:  721 / 47394 = .0152 (1.52%)

Enqueues account for only 1.52% of the total response time.  Some tiny
percentage of those enqueue waits were due to ITL Waits in TX mode-4.  OK,
so this is where I begin to reduce the analysis to the ridiculous.

Estimate the total ITL Waits by dividing "ITL Waits" by %Total for the top
event in the "Top 5 ITL Waits per Segment".

                                           Subobject  Obj.           ITL
  Owner      Tablespace Object Name        Name       Type         Waits
%Total
  ---------- ---------- ------------------ ---------- ----- ------------
-------
  ADADCC     ADA_INDEX1 DCC_OBSV_FK_I                 INDEX           26
9.03

  ITL Waits:  26 / .0903 = 288

Average the time that each enqueue wait represents.

  Average Enqueue Time: 721 / 13371 = .054 centiseconds (cs)

Multiply the "ITL Waits" by the "Average Enqueue Time" to get the ITL wait
time.

  ITL Wait Time:  288 * .054 = 15.552 (s)

Divide "ITL Wait Time" by "Response Time" to get percent of ITL waits as
part of the overall response time.

  %ITL Wait Time: 15.552 / 47394 = .0003 (0.03%)

So if I had eliminated all ITL Waits, I would have reduced the average
response time by 0.03% for no perceptible change.  In the absence of any
extended SQL trace showing that ITL Waits was the largest contributor to any
particular job or report's response time, reorging or rebuilding is not
justified.

End Analysis.
Is it really necessary to reduce ITL Waits to near zero?

Other related posts: