Thanks very much to Tanel, Jonathan, and others whoe reviewed the issue.
To give an update on this issue, it turned out to be an oracle bug in implementing the _db_block_hash_latches and spin_count is set to spin_count*cpu_count internally by oracle. And another problem on scan the index structure which I don't quite clear yet.
We applied a patch and the issue is gone now.
On 8/31/06, Tanel Poder <tanel.poder.003@xxxxxxx> wrote: > > You're on 9i or higher, right? > > One possibility is that this is caused by shared CBC latching > peculiarities: > > 1) during normal selects your index root block can be examined under a > shared cache buffers chains latch. > So as long as everybody is only reading the index root block, everybody > can do it concurrently (without pinning the block). The "current holder > count" in the CBC latch structure is just increased by one for every read > only latch get and decreased by one on every release. 0 value means that > nobody has this latch taken currently. > > Nobody has to wait for others for reading index root block in all read > only case. That greatly helps to combat hot index root issues. > > 2) Now if a branch block split happens a level below the root block, the > root block has to be pinned in exclusive mode for reflecting this change in > it. In order to pin a block you need to get the corresponding CBC latch in > exclusive mode. > If there are already a bunch of readers on the latch, then the exclusive > latch getter will just flip a bit in the CBC latch structure - stating it's > interest for exclusive get. > > Every read only latch get will check for this bit, if it's set, then the > getters will just spin instead, waiting this bit to be cleared (they may > yield or sleep immediately as well, I haven't checked). Now the exclusive > getter has to spin/wait until all the shared getters have released the latch > and the "current holder count" drops to zero. Once it's zero (and the getter > manager to get on to CPU) it can get the latch, do its work and release the > latch. > > During all that time starting from when the "exclusive interest" bit was > set, nobody could access this indexes root block except the processes which > already had the latch in shared mode. Depending on latch spin/sleep strategy > for this particular case and OSD implementation, this could mean that all > those "4000 readers per second" start just spinning on that latch, causing > heavy spike in CPU usage and they all queue up. > > > How do diagnose that: > > You could sample v$latch_misses to see whether the number of "kcbgtcr: > kslbegin shared" nowaitfails/sleeps counter takes an exceptional jump up > once you observe this hiccup. > > > How to fix that once diagnosed: > > The usual stuff, like partitioning if possible or creating a single > table hash cluster instead. > > If you see that the problem comes from excessive spinning, think about > reducing the spinning overhead (by reducing _spin_count for example). This > could affect your other database functions though.. > > If you can't do the above - then if you have off-peak time, then analyse > indexes (using treedump for start) and if you see a block split coming in a > branch below root block, then force the branch block to split during > off-peak time by inserting carefully picked values into the index tree, > which go exactly in the range which cause the proper block to split. Then > you can just roll back your transaction - the block splits are not rolled > back nor coalesced somehow, as this is done in a separate recursive > transaction. > > Tanel. > > > ------------------------------ > *From:* oracle-l-bounce@xxxxxxxxxxxxx [mailto: > oracle-l-bounce@xxxxxxxxxxxxx] *On Behalf Of *Zhu,Chao > *Sent:* Wednesday, August 30, 2006 21:53 > > *To:* oracle-l@xxxxxxxxxxxxx > *Subject:* Hot index block split on a very busy table--what is the > impact > > hi, all, > We have a few database with very high concurrent access busy tables, > that some indexes of the busy tables could be accessed 1.5K-4K times per > second. (PK lookup) > I am wondering , when such index got block split (at the root, or > branch level), what will be the impact on the system. > Index block split is said to be an expensive operation, during the > block split on branch/root block, what the other sessions that is doing > select based on this index, be waiting on? There is a wait event named: > Index Block Split with p1,p2,p3 pointing to block address, level. > Document "Description of oracle 7 Wait event and enqueue" says it > will yield CPU. > We have a few production incident when load suddenly jump from 10, to > several hundred, and during that time nothing can be done. From our > monitoring tools, it is mainly "latch free" contention inside oracle, and it > is cache buffer chains. The load spike matches the oracle description of > block split, but the wait event does not match. And because the load spike > time is very short, we can't capture the actual wait event/p1,p2,p3 during > the exact time of the load spike time. > Anyone have similar problem/insight into this issue? > Is there some v$ view to track, (v$segstat does not) have log about > which index go through how many time block split? Or any other effecitve > way? > > > -- > Regards > Zhu Chao > www.cnoug.org >
-- Regards Zhu Chao www.cnoug.org
-- Regards Zhu Chao www.cnoug.org