Re: KTSJ / Wnnn

From: Ls Cheng <exriscer@xxxxxxxxx>
To: Dominic Brooks <dombrooks@xxxxxxxxxxx>
Date: Thu, 15 Jul 2021 14:27:05 +0200

Hi

Is this RAC? If it is then most probably you need to apply this patch

*Bug 32245850 - txtsdan : dml operations hung on "gc current request" waits*

We had quite a few problems with KTSJ in a 19.10 database running in
Exadata Full Rack and after SR'ing was told to apply this patch which fixed
the problems.

BR

On Thu, Jul 15, 2021 at 1:02 PM Dominic Brooks <dombrooks@xxxxxxxxxxx>
wrote:

I was observing a foreground process yesterday which was running a series
of batched updates from Java in a single thread and was running very
slowly.

Each element in the batch was updating a single row via a unique scan.

The execution time of this feed was reported as having tripled since
moving to 19.6.

Performance was atrocious. For example, from AWR over a period of 15-16
hours, an average size batch of a couple of hundred elements was averaging
anywhere between 8 and over 100 seconds per execution per hour, the vast
majority of time in cluster related waits. Averages hide a whole bunch of
detail of course but a useful indicator.

I was observing from GV$SESSION and GV$ASH and the source of the cluster
waits seems to be related to KTSJ slave activity and there was strong
correlation of the “two” (java update plus multiple active KTSJ slaves)
working on the same datafile/blocks – series of the two doing gc buffer
busy release, gc buffer busy acquire, gc current block busy with the
occasional cell single block physical read. Blocking session information on
some of the gc waits occasionally pointing at the other (update blocked by
KTSJ or vice versa)

Reading all the responses oracle-l thread from May 2020 on KTSJ was the
best source of information I could find:

https://www.freelists.org/post/oracle-l/reads-by-KTSJ,17

And a couple of bug references leading from there, not all of which
relevant to my version (19.6) but giving indications of what might be going
on:

   - blocks are not marked as free in assm after delete - 12.2 and later
   (Doc ID 30265523.8)
   - performance degradation by w00 processes after applying july 2020
   dbru (Doc ID 32075777.8) superseded by
   - force full repair enabled by fix control and populate repair list
   even if _assm_segment_repair_bg=false (Doc ID 32234161.8)

With mention of the parameter _assm_segment_repair_bg.

Per the explanations in the oracle-l thread, seems to be foreground
session doing something which then prompts background session to check/fix
the ASSM information. But in my case, this fixing is causing significant
contention back to the foreground session.

I ran snapper on some of the KTSJ slaves and of the ASSM fix related
stats, ASSM bg: slave fix state was consistently around 5000 in a 5 second
period. That is not a statistic I have any context to judge the value of.

This is a monthly feed so doesn’t happen every day but when it does it
sits in a critical path. It’s finished now so there’s not a lot I can look
at now if it’s not in ASH. Obviously a next step is to try to reproduce
this in a test environment.

I just wondered whether anyone had done any further investigation into
this behaviour.

Cheers,

Dominic

Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
Windows 10

Follow-Ups:
- RE: KTSJ / Wnnn
  - From: Dominic Brooks

References:
- KTSJ / Wnnn
  - From: Dominic Brooks

Re: KTSJ / Wnnn

Other related posts: