Re: intermittent long "log file sync" waits

From: Cary Millsap <cary.millsap@xxxxxxxxxxxx>
To: Stefan Koehler <contact@xxxxxxxx>
Date: Tue, 28 Jan 2020 14:56:55 -0600

Chris, I recognize your Method R Workbench output there :-).

Recently, Jeff and I saw the same kind of behavior. Client had intermittent
problems with a little program that usually runs in a second or so, but
every once in a while—about six times per day—it would hang for 25+
seconds. Client traced instance-wide for about six hours. We ended up with
several thousand trace files, but with Method R Workbench 9 (beta, but
coming soon), we were able to find what we were looking for in just a few
minutes. We found that each of the 25+ second executions were spending 24+
seconds of "log file sync" that the short-duration programs didn't have.

Happily, we had the trace file for LGWR, and we could line up the long "log
file sync" calls in the user programs *precisely* on the timeline with "log
file parallel write" calls in the LGWR trace, which were taking—you guessed
it—25+ seconds apiece. Weird thing was, some of these super long-latency
writes were processing only 2KB of data apiece. This was on IBM AIX. The
AIX sysadmins couldn't see anything wrong in their aggregations of all
write call latencies (which you wouldn't—it's easy to hide a half a dozen
25.000's in a list of millions of 0.003's). I'm not sure they ever really
tackled the problem of why their hardware writes were taking so long.

One hypothesis that Jeff and I were left with at the end was that perhaps
the application we were trying to diagnose was having a noisy neighbor
problem with some other system on some other VM on the same box.

Cary Millsap
Method R Corporation
Author of *Optimizing Oracle Performance <http://amzn.to/OM0q75>*
and *The Method R Guide to Mastering Oracle Trace Data, 3rd edition
<https://amzn.to/2IhhCG6+-+Millsap+2019.+Mastering+Oracle+Trace+Data+3ed>*

On Tue, Jan 28, 2020 at 2:31 PM Stefan Koehler <contact@xxxxxxxx> wrote:

Hello Chris,
these are the most fun problems - I love to troubleshoot sporadic LGWR
issues :-)

I just read through the whole thread quickly and here are my two cents:

1) Can you profile the extended SQL trace file with an appropriate
profiler (e.g. Method-R Profiler or TVD$XTAT, the latter is free) to get an
exact wait event histogram of the log file sync waits for the slow run?

2) You don't see any ASH data for LGWR process for this particular time
frame, right? Might it be possible that the scalable log writer mode kicked
in and the redo work is done by the worker processes? Just check the LGWR
trace file and you should see which mode (single log writer mode/scalable
log writer mode) was used. If it is the scalable one you have to take a
look at the lg<nn> processes.

3) You see "log file parallel write" for your loop workload which might
not be a surprise as you need to write redo. However if you see such large
latency spikes (basically in the same ball park as the "log file sync" ones
- see point 1) it is very likely that you got an OS config problem (e.g.
Linux kernel AIO slots, scheduler, device queue, etc.) or a storage
problem. You can troubleshoot these "log file parallel write" syscalls
incl. wchan with Tanel Poder's psnapper (
https://github.com/tanelpoder/psnapper) and see where it is stuck/waiting.

Have fun troubleshooting :-)

Best Regards
Stefan Koehler

Independent Oracle performance consultant and researcher
Website: http://www.soocs.de
Twitter: @OracleSK

Chris Stephens <cstephens16@xxxxxxxxx> hat am 28. Januar 2020 um 17:09

geschrieben:

3-node Oracle 19.3 RAC
Centos 7

We have a SQLAlchemy/Python based application workload that is running

the exact same steps with widely varying response times which appear to be
related to varying "log file sync" wait times.

Here is a profile of a "fast" run:

CALL-NAME                        DURATION       %   CALLS      MEAN

  MIN        MAX

------------------------------  ---------  ------  ------  --------

--------  ---------

SQL*Net message from client     53.197782   91.8%  10,092  0.005271

0.000177  28.568493

EXEC                             3.759177    6.5%   9,816  0.000383

0.000000   0.239592

row cache lock                   0.233153    0.4%     541  0.000431

0.000113   0.000941

PARSE                            0.140399    0.2%   4,867  0.000029

0.000000   0.006620

DLM cross inst call completion   0.137330    0.2%     956  0.000144

0.000004   0.000505

library cache lock               0.100171    0.2%     215  0.000466

0.000151   0.002133

library cache pin                0.079729    0.1%     216  0.000369

0.000056   0.000710

FETCH                            0.058253    0.1%   1,062  0.000055

0.000000   0.004148

log file sync                    0.048217    0.1%     149  0.000324

0.000259   0.000505

CLOSE                            0.045416    0.1%   4,929  0.000009

0.000000   0.000073

20 others                        0.135624    0.2%  11,854  0.000011

0.000000   0.001700

------------------------------  ---------  ------  ------  --------

--------  ---------

TOTAL (30)                      57.935251  100.0%  44,697  0.001296

0.000000  28.568493

Here is a profile of a "slow" run:

CALL-NAME                         DURATION       %   CALLS      MEAN

  MIN         MAX

------------------------------  ----------  ------  ------  --------

--------  ----------

SQL*Net message from client     131.186118   61.0%  10,092  0.012999

0.000212  106.789360

log file sync                    79.291166   36.8%     150  0.528608

0.000264    2.986575

EXEC                              3.728402    1.7%   9,816  0.000380

0.000000    0.221403

row cache lock                    0.248868    0.1%     542  0.000459

0.000111    0.001036

PARSE                             0.164267    0.1%   4,867  0.000034

0.000000    0.004652

DLM cross inst call completion    0.146981    0.1%     957  0.000154

0.000005    0.001188

library cache lock                0.104354    0.0%     218  0.000479

0.000160    0.000728

library cache pin                 0.082504    0.0%     202  0.000408

0.000157    0.000672

FETCH                             0.056687    0.0%   1,062  0.000053

0.000000    0.003969

CLOSE                             0.043590    0.0%   4,929  0.000009

0.000000    0.000180

20 others                         0.142044    0.1%  11,866  0.000012

0.000000    0.001792

------------------------------  ----------  ------  ------  --------

--------  ----------

TOTAL (30)                      215.194981  100.0%  44,701  0.004814

0.000000  106.789360

The weird thing is that I don't see corresponding log I/O waits (awaits)

in iostat output.

I have a ticket open w/ oracle but does anyone have any suggestions to

discover root cause and/or solution?
--
//www.freelists.org/webpage/oracle-l

Follow-Ups:
- Re: intermittent long "log file sync" waits
  - From: Chris Stephens
- RE: intermittent long "log file sync" waits
  - From: Noveljic Nenad

References:
- intermittent long "log file sync" waits
  - From: Chris Stephens
- Re: intermittent long "log file sync" waits
  - From: Stefan Koehler

Re: intermittent long "log file sync" waits

Other related posts: