Re: Deadlock ITL Waits

  • From: kyle Hailey <kylelf@xxxxxxxxx>
  • To: Stalin <stalinsk@xxxxxxxxx>, ORACLE-L <Oracle-L@xxxxxxxxxxxxx>
  • Date: Thu, 21 Jul 2011 08:43:56 -0700

Yes, bad log write times:

Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s
>1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- -----
-----
log file parallel write      38K  72.5  15.4   5.4   2.0    .8    .4   1.3
2.2


You've got over 2% over a second! That's pretty bad.
There is something wrong with your I/O subsystem.

I did some lgwr I/O write time distribution graphs on a couple of systems
here

http://dboptimizer.com/2011/07/14/lgwr-redo-write-times-log-file-parallel-write/

Even at 6MB log writes , I was still getting under 300ms over NFS mounted
file systems .
Typically log writes are much smaller and on the order of 3ms.

- Kyle Hailey
http://dboptimizer.com


On Thu, Jul 21, 2011 at 12:16 AM, Stalin <stalinsk@xxxxxxxxx> wrote:

> Hah, totally forgot about event_histograms :)
>
> i think either the storage is pushed to its limits or something terribly
> wrong on the storage side.
>
> Wait Event Histogram  DB/Inst: NBBSDB/nbbsdb  Snaps: 11203-11204
> -> Total Waits - units: K is 1000, M is 1000000, G is 1000000000
> -> % of Waits - column heading: <=1s is truly <1024ms, >1s is truly
> >=1024ms
> -> % of Waits - value: .0 indicates value was <.05%, null is truly 0
> -> Ordered by Event (idle events last)
>
>                            Total ----------------- % of Waits
> ------------------
> Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s
>   >1s
> -------------------------- ----- ----- ----- ----- ----- ----- ----- -----
> -----
> ADR block file read          13   69.2        15.4   7.7   7.7
> ADR block file write          5   80.0  20.0
> ADR file lock                 6  100.0
> ARCH wait for archivelog l    8  100.0
> Data Pump slave startup       1  100.0
> Datapump dump file I/O       18   72.2  27.8
> Disk file operations I/O   1567   92.5    .1    .3    .1          .1   2.1
>   4.9
> LGWR wait for redo copy     562   98.9          .9                      .2
> Log archive I/O               2                                100.0
> PL/SQL lock timer            28    3.6               3.6              92.9
> Parameter File I/O          104   76.0   9.6   1.9   1.9   1.9   3.8   4.8
> SGA: allocation forcing co   19                      5.3              94.7
> SQL*Net break/reset to cli   56   94.6   3.6               1.8
> SQL*Net more data to clien 3818   91.9                            .0   8.1
> Streams AQ: enqueue blocke    1
>  100.0
> asynch descriptor resize    269  100.0
> buffer busy waits          7556   87.7    .8    .3    .2    .2    .2   2.5
>   8.2
> buffer deadlock             212  100.0
> buffer exterminate           92K    .0          .0    .0  99.1    .5    .4
> change tracking file synch   58   17.2              13.8  10.3   1.7  43.1
>  13.8
> change tracking file synch   58   39.7               3.4         5.2  43.1
>   8.6
> control file parallel writ 1474   32.2   6.8   2.0   1.6   2.1   2.8  44.1
>   8.3
> control file sequential re 8078   55.9    .8   1.1   3.6   2.0   1.6  30.0
>   5.0
> cursor: mutex S               7  100.0
> cursor: pin S              1363   29.2   7.7   7.9   2.4   2.5   6.4  43.9
> cursor: pin S wait on X     378                            9.8   6.6  82.3
>   1.3
> db file parallel read         7                           14.3  14.3  28.6
>  42.9
> db file parallel write      276K  13.5   7.9  11.8  13.2  12.4  10.8  30.0
>    .3
> db file scattered read       32K  11.3   2.7  10.2  20.2  44.3   8.9   2.3
>    .0
> db file sequential read     308K  30.2    .9   2.3   4.1   1.6    .6  38.6
>  21.7
> db file single write          5   60.0                                40.0
> direct path read            143   31.5   5.6  17.5  30.1   6.3    .7   7.0
>   1.4
> direct path read temp      9032    7.8   5.9  14.2  28.2  10.1   2.0  25.5
>   6.2
> direct path sync              2        100.0
> direct path write             4   25.0  25.0                          50.0
> direct path write temp     3441    2.2   2.6   6.4  22.2  58.8   6.2   1.6
>    .1
> enq: CF - contention         22                                  4.5  22.7
>  72.7
> enq: CR - block range reus    6               83.3  16.7
> enq: FB - contention          2  100.0
> enq: HW - contention         17   35.3         5.9  11.8         5.9   5.9
>  35.3
> enq: KO - fast object chec    1                                100.0
> enq: RO - fast object reus    7   42.9  28.6        14.3  14.3
> enq: SQ - contention         11   27.3  72.7
> enq: TQ - DDL contention      1                                      100.0
> enq: TQ - TM contention       1                                      100.0
> enq: TX - allocate ITL ent 1716     .9    .1    .2    .6   1.9   2.2  30.7
>  63.4
> enq: TX - contention          1              100.0
> enq: TX - index contention  592   37.2   6.6    .2    .3   1.7   3.0  38.0
>  13.0
> enq: TX - row lock content    4                     25.0  25.0  25.0  25.0
> enq: UL - contention          1                          100.0
> kksfbc child completion       1                                      100.0
> kupp process wait             2                                      100.0
> latch free                   33K  26.8  10.7  12.3   8.2  17.0  19.5   5.5
> latch: In memory undo latc 1046   98.6    .7    .5    .3
> latch: cache buffers chain 4750   96.3   1.1   1.1    .9    .2    .4    .0
> latch: cache buffers lru c   51   37.3   7.8   5.9  21.6   5.9  19.6   2.0
> latch: enqueue hash chains 1927    6.8   3.3   3.2   2.5   3.0    .9  80.3
> latch: messages             333                                100.0
> latch: object queue header   39   76.9   5.1               5.1  12.8
> latch: redo allocation        5   20.0  20.0        20.0        20.0  20.0
> latch: redo writing          12   83.3         8.3               8.3
> latch: row cache objects    173   30.1   6.9    .6   4.0        36.4  22.0
> latch: session allocation   240   17.1  11.7  18.3  49.6   3.3
> latch: shared pool            2   50.0  50.0
> latch: undo global data       9   77.8  22.2
> library cache load lock       1                                100.0
> library cache lock           31   64.5                     9.7   6.5  19.4
> library cache: mutex X      649   98.9    .6          .2    .3
> log file parallel write      38K  72.5  15.4   5.4   2.0    .8    .4   1.3
>   2.2
> log file sequential read    470    2.8    .2          .9    .6    .2  72.6
>  22.8
> log file single write         8   62.5        12.5                    12.5
>  12.5
> log file switch (private s 2303                             .0    .2   6.2
>  93.6
> log file sync               838K   2.9   1.0    .5   1.7   1.7    .8   7.6
>  83.8
> os thread startup            21                                      100.0
> rdbms ipc reply              29   89.7                                 6.9
>   3.4
> read by other session        96K  73.6   2.0   3.7   3.5   1.4    .4  10.7
>   4.6
> reliable message             58   62.1                                37.9
> row cache lock                4                    100.0
> undo segment extension       20   15.0                    15.0  10.0  60.0
> utl_file I/O               4602K  99.7    .0    .0    .3    .0    .0    .0
> wait list latch free          1                          100.0
> DIAG idle wait             6675                                      100.0
>    .0
> KSV master wait               3   66.7                                33.3
> SQL*Net message from clien 2871K  33.7   2.8   4.4   3.8  10.9  18.3  22.2
>   3.8
> SQL*Net message to client  2811K 100.0    .0    .0    .0    .0    .0    .0
> SQL*Net more data from cli   26K  99.8    .0    .0          .0    .0    .1
> Space Manager: slave idle   690
>  100.0
> Streams AQ: qmn coordinato  269   51.3                                  .4
>  48.3
> Streams AQ: qmn slave idle  251    2.4
>  97.6
> Streams AQ: waiting for ti    2                                       50.0
>  50.0
> class slave wait             30  100.0
> jobq slave wait             120                                      100.0
> pmon timer                 1243    8.5
>  91.5
> rdbms ipc message            53K  13.0  30.5   8.7   1.8   2.1   2.5  14.6
>  26.8
> smon timer                   18    5.6
>  94.4
> wait for unread message on   81    1.2   1.2   2.5  18.5   8.6   6.2  61.7
>           -------------------------------------------------------------
>
>
> On Wed, Jul 20, 2011 at 11:50 PM, Stalin <stalinsk@xxxxxxxxx> wrote:
>
>> Thanks Kyle.
>>
>> Apparently we don't have diagnostic pack licenses to use DBA_HIST tables.
>> Is this something i can get from statspack tables ?
>>
>>
>> On Wed, Jul 20, 2011 at 10:20 PM, kyle Hailey <kylelf@xxxxxxxxx> wrote:
>>
>>>
>>> Yeah, the massive log waits could exacerbate the ITL waits as things will
>>> be slow commiting their changes.
>>> The db file sequential reads are huge as well. Looks like you have a
>>> problem on your storage.
>>>
>>> Check the log write times  ( I blogged some of this at
>>> http://dboptimizer.com/2011/07/20/wait-event-and-wait-class-metrics-vs-vsystem_event/)
>>>
>>>
>>> Latencies in the past minute
>>>
>>> col name for a25
>>> select m.intsize_csec,
>>>        n.name ,
>>>        m.time_waited,
>>>        m.wait_count,
>>>        10*m.time_waited/nullif(m.wait_count,0) avgms
>>> from v$eventmetric m,
>>>      v$event_name n
>>> where m.event_id=n.event_id
>>>   and n.name in (
>>>                   'log file sync',
>>>                   'log file parallel write'
>>> );
>>>
>>>
>>> Latencies averaged over each hour
>>>
>>> select
>>>        btime,
>>>        (time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0) avg_ms
>>> from (
>>> select
>>>        to_char(s.BEGIN_INTERVAL_TIME,'DD-MON-YY HH24:MI')  btime,
>>>        total_waits count_end,
>>>        time_waited_micro/1000 time_ms_end,
>>>        Lag (e.time_waited_micro/1000)
>>>               OVER( PARTITION BY e.event_name ORDER BY s.snap_id) 
>>> time_ms_beg,
>>>        Lag (e.total_waits)
>>>               OVER( PARTITION BY e.event_name ORDER BY s.snap_id) count_beg
>>> from
>>>        DBA_HIST_SYSTEM_EVENT e,
>>>        DBA_HIST_SNAPSHOT s
>>> where
>>>          s.snap_id=e.snap_id
>>>    and e.event_name in (
>>>                   'log file sync',
>>>                   'log file parallel write'
>>> )
>>> order by begin_interval_time
>>> )
>>> order by btime
>>> /
>>>
>>> If the log file parallel writes are bad then probably something wrong on
>>> the storage system.
>>>
>>> - Kyle Hailey
>>> http://dboptimizer.com
>>>
>>>
>>>
>>> On Wed, Jul 20, 2011 at 3:37 PM, Stalin <stalinsk@xxxxxxxxx> wrote:
>>>
>>>> We have been seeing lots of deadlock errors lately in load testing
>>>> environments and they all have been due to enq: TX - allocate ITL entry. In
>>>> reviewing the statspack report for the periods of deadlock, i see that, log
>>>> file sync wait being the top consumer with a terrible wait time. That makes
>>>> to me think the deadlock, is just a symptom of high log file sync wait
>>>> times.  Below is the snippet from statspack and looking at these numbers,
>>>> especially CPU not being heavily loaded, wondering if this could be a case
>>>> of storage issue. Sys Admins are checking the storage layer but thought
>>>> would check here get any opinions/feedback.
>>>>
>>>> Top 5 Timed Events
>>>>  Avg %Total
>>>> ~~~~~~~~~~~~~~~~~~
>>>> wait   Call
>>>> Event                                            Waits    Time (s)
>>>> (ms)   Time
>>>> ----------------------------------------- ------------ -----------
>>>> ------ ------
>>>> log file sync                                1,400,773   4,357,902
>>>> 3111   91.4
>>>> db file sequential read                        457,568     334,834
>>>>  732    7.0
>>>> db file parallel write                         565,843      27,573
>>>> 49     .6
>>>> read by other session                           16,168       7,395
>>>>  457     .2
>>>> enq: TX - allocate ITL entry                       575       6,854
>>>>  11919     .1
>>>>          -------------------------------------------------------------
>>>> Host CPU  (CPUs: 64  Cores: 8  Sockets: 1)
>>>> ~~~~~~~~              Load Average
>>>>                      Begin     End      User  System    Idle     WIO
>>>> WCPU
>>>>                    ------- -------   ------- ------- ------- -------
>>>> --------
>>>>                       3.13    7.04      2.26    3.30   94.44    0.00
>>>>  7.81
>>>>
>>>> Statistic                                      Total     per Second
>>>>  per Trans
>>>> --------------------------------- ------------------ --------------
>>>> ------------
>>>> redo synch time                          435,852,302      120,969.3
>>>>    309.7
>>>> redo synch writes                          1,400,807          388.8
>>>>      1.0
>>>> redo wastage                               5,128,804        1,423.5
>>>>      3.6
>>>> redo write time                              357,414           99.2
>>>>      0.3
>>>> redo writes                                    9,935            2.8
>>>>      0.0
>>>> user commits                               1,400,619          388.7
>>>>      1.0
>>>>
>>>>
>>>> Environment : 11gr2 EE (11.2.0.1), Sol 10 Sparc
>>>>
>>>> Thanks,
>>>> Stalin
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> Stalin
>>
>
>
>
> --
> Thanks,
>
> Stalin
>

Other related posts: