RE: Hot index block split on a very busy table--what is the impact

  • From: Tanel Poder <tanel.poder.003@xxxxxxx>
  • To: zhuchao@xxxxxxxxx, oracle-l@xxxxxxxxxxxxx
  • Date: Thu, 31 Aug 2006 00:16:08 +0800

You're on 9i or higher, right?
 
One possibility is that this is caused by shared CBC latching peculiarities:
 
1) during normal selects your index root block can be examined under a
shared cache buffers chains latch.
So as long as everybody is only reading the index root block, everybody can
do it concurrently (without pinning the block). The "current holder count"
in the CBC latch structure is just increased by one for every read only
latch get and decreased by one on every release. 0 value means that nobody
has this latch taken currently.
 
Nobody has to wait for others for reading index root block in all read only
case. That greatly helps to combat hot index root issues.
 
2) Now if a branch block split happens a level below the root block, the
root block has to be pinned in exclusive mode for reflecting this change in
it. In order to pin a block you need to get the corresponding CBC latch in
exclusive mode.
If there are already a bunch of readers on the latch, then the exclusive
latch getter will just flip a bit in the CBC latch structure - stating it's
interest for exclusive get.
 
Every read only latch get will check for this bit, if it's set, then the
getters will just spin instead, waiting this bit to be cleared (they may
yield or sleep immediately as well, I haven't checked). Now the exclusive
getter has to spin/wait until all the shared getters have released the latch
and the "current holder count" drops to zero. Once it's zero (and the getter
manager to get on to CPU) it can get the latch, do its work and release the
latch.
 
During all that time starting from when the "exclusive interest" bit was
set, nobody could access this indexes root block except the processes which
already had the latch in shared mode. Depending on latch spin/sleep strategy
for this particular case and OSD implementation, this could mean that all
those "4000 readers per second" start just spinning on that latch, causing
heavy spike in CPU usage and they all queue up.
 
 
How do diagnose that:
 
You could sample v$latch_misses to see whether the number of "kcbgtcr:
kslbegin shared" nowaitfails/sleeps counter takes an exceptional jump up
once you observe this hiccup. 
 
 
How to fix that once diagnosed:
 
The usual stuff, like partitioning if possible or creating a single table
hash cluster instead.
 
If you see that the problem comes from excessive spinning, think about
reducing the spinning overhead (by reducing _spin_count for example). This
could affect your other database functions though..
 
If you can't do the above - then if you have off-peak time, then analyse
indexes (using treedump for start) and if you see a block split coming in a
branch below root block, then force the branch block to split during
off-peak time by inserting carefully picked values into the index tree,
which go exactly in the range which cause the proper block to split. Then
you can just roll back your transaction - the block splits are not rolled
back nor coalesced somehow, as this is done in a separate recursive
transaction.
 
Tanel.
 



  _____  

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx]
On Behalf Of Zhu,Chao
Sent: Wednesday, August 30, 2006 21:53
To: oracle-l@xxxxxxxxxxxxx
Subject: Hot index block split on a very busy table--what is the impact


hi, all,
    We have a few database with very high concurrent access busy tables,
that some indexes of the busy tables could be accessed 1.5K-4K times per
second.  (PK lookup)
    I am wondering , when such index got block split (at the root, or branch
level), what will be the impact on the system. 
    Index block split is said to be an expensive operation, during the block
split on branch/root block, what the other sessions that is doing select
based on this index, be waiting on? There is a wait event named:  Index
Block Split with p1,p2,p3 pointing to block address, level. 
   Document "Description of oracle 7 Wait event and enqueue" says it will
yield CPU. 
   We have a few production incident when load suddenly jump from 10, to
several hundred, and during that time nothing can be done. From our
monitoring tools, it is mainly "latch free" contention inside oracle, and it
is cache buffer chains. The load spike matches the oracle description of
block split, but the wait event does not match.  And because the load spike
time is very short, we can't capture the actual wait event/p1,p2,p3 during
the exact time of the load spike time. 
  Anyone have similar problem/insight into this issue? 
  Is there some v$ view to track, (v$segstat does not) have log about which
index go through how many time block split?  Or any other effecitve way?
   

-- 
Regards
Zhu Chao
www.cnoug.org


Other related posts: