Re: intermittent long "log file sync" waits

From: Frits Hoogland <frits.hoogland@xxxxxxxxx>
To: Tanel Poder <tanel@xxxxxxxxxxxxxx>, Chris Stephens <cstephens16@xxxxxxxxx>
Date: Sun, 2 Feb 2020 13:51:14 +0100

I didn’t look that deep, I tried to get an idea of the situation and the
current state.

This, as others have concluded already, then looks like the bug that seems to
surface every once in a while where sessions that use post-wait for waiting for
the logwriter get “forgotten” when the instance switches to polling.
The switching is visible, or at least was visible when I looked at it in
versions 11.2 and 12.1.0.2, by looking at the logwriter trace file, it mentions
when the logwriter switches between post-wait and polling and back.
However, in these cases, the process would way longer, until the logwriter
decided to switch back to post-wait and found the waiting sessions again.

When this is a bug with the above described issue, of course it needs to be
fixed. However, there is a way around it:
- In post-wait mode, a process waiting in ‘log file sync’ can only end the wait
if two conditions are met: it needs to be posted and the on_disk_scn must be
beyond the process’ commit SCN. Based on what is shown, I assume the logwriter
happily gone on and had the on disk scn updated (it does that always every 3
seconds) probably beyond the process’ commit scn. So the most logical thing the
process is waiting for, is to be posted, which did not happen.
- In polling mode, a process will wait for a pre-computed time, during which it
can not be notified, and then check for the on disk scn. The pre-computed time
in normal circumstances is really low, probably around a millisecond, but
computed based on the latency of your system. The important bit is there is no
need for the logwriter to do anything: the waiting is based on the progress
that the logwriter makes by means of the on disk scn, but it doesn’t require
additional work, like being posted.
- Therefore, if you fix your instances to use polling only, it can not come in
the situation of a post-wait process being forgotten. This can be achieved by
setting “_use_adaptive_log_file_sync” to polling_only. This is an underscore
parameter, and therefore you have to ask blessing from oracle support to use
it, but since you already got a SR open, this should be simple to do.

I can’t think of any downsides of using polling_only. Maybe someone can?
Even the use of it, since it’s switched to dynamically by all supported
versions, is high.

Frits Hoogland

http://fritshoogland.wordpress.com ;<http://fritshoogland.wordpress.com/>
frits.hoogland@xxxxxxxxx <mailto:frits.hoogland@xxxxxxxxx>
Mobile: +31 6 14180860

On 2 Feb 2020, at 04:50, Tanel Poder <tanel@xxxxxxxxxxxxxx> wrote:

But your trace file here already indicates that there's some glitch regarding
LGWR wakeups/posting:

Foreground:

*** 2020-01-29T13:04:20.724666-06:00
WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync scn=976089279
p3=0 obj#=-1 tim=1025228531958

LGWR at the same time:

*** 2020-01-29T13:04:20.725994-06:00
WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0 obj#=-1
tim=1025228533313

The WAIT lines are printed out when the wait ends (and the *** timestamp line
just before the WAIT line is printed out, if enough time has passed since the
last *** line). So this shows that LGWR had happily been idle, waiting for
its idle 3-second timeout to expire, or someone to wake it up (that's what
"rdbms ipc message" means) - while someone had issued a commit and was
waiting for LGWR acknowledgment. So this shows that LGWR didn't even get the
post from your session, nor any other sessions (if you had concurrent
transactional activity). Back in 11g, typically these problems happened when
adaptive log file syncing switch between poll and post/wait wakeup method
happened and you hit a bug. But in 12c there are more moving parts.

--
Tanel Poder
https://blog.tanelpoder.com/seminar ;<https://blog.tanelpoder.com/seminar>
On Sat, Feb 1, 2020 at 10:34 PM Tanel Poder <tanel@xxxxxxxxxxxxxx
<mailto:tanel@xxxxxxxxxxxxxx>> wrote:
Hi Chris,

As 12c+ has log writer slaves and Oracle can adaptavely switch from single
LGWR to LGWR + LGxx slave (_use_single_log_writer=adaptive), can you check
ASH data for the LGxx slaves at this time too. Not all process-to-process
interactions/waits are instrumented with a blocking_session in Oracle...

Should be this command:

@ashtop username,event2,program2,blocking_session,blocking_inst_id "program
like '%(LG%)'" "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29
13:05:00'"

--
Tanel Poder
https://blog.tanelpoder.com/seminar ;<https://blog.tanelpoder.com/seminar>

On Thu, Jan 30, 2020 at 3:44 PM Chris Stephens <cstephens16@xxxxxxxxx
<mailto:cstephens16@xxxxxxxxx>> wrote:
Here's some additional ash output from the trace data posted in last message
including event_hist.sql:

SQL> @ashtop username,event2,program2,blocking_session,blocking_inst_id 1=1
"TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29 13:04:27'"
   TOTALSECONDS    AAS      %This         USERNAME           EVENT2
PROGRAM2    BLOCKING_SESSION    BLOCKING_INST_ID             FIRST_SEEN
       LAST_SEEN    DIST_SQLEXEC_SEEN
_______________ ______ __________ ________________ ________________
_____________ ___________________ ___________________ ______________________
______________________ ____________________
              2    0.3   50% |    SYS              ON CPU           (sqlplus)
                                            2020-01-29 13:04:24    2020-01-29
13:04:25                       1
              1    0.1   25% |    GEN3_MGOWER_3    ON CPU           (python)
                                            2020-01-29 13:04:22    2020-01-29
13:04:22                       1
              1    0.1   25% |    GEN3_MGOWER_3    log file sync    (python)
                   1710                   2 2020-01-29 13:04:20    2020-01-29
13:04:20                       1

SQL> @bg lgwr
   NAME    DESCRIPTION     SID    OPID     SPID               PADDR
    SADDR
_______ ______________ _______ _______ ________ ___________________
___________________
LGWR    Redo etc.         1710      34 26552    00000001E8718860
00000001D8BFF4A0

SQL> @ashtop username,event2,program2,blocking_session,blocking_inst_id
session_id=1710 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29
13:04:27'"

no rows selected

SQL> @ashtop username,event2,program2,blocking_session,blocking_inst_id
session_id=1710 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29
13:05:00'"

no rows selected

SQL> @ash/event_hist "log file" 1=1  "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"
       EVH_EVENT       EVH_MILLISEC    EVH_SAMPLE_COUNT
EVH_EST_EVENT_COUNT    EVH_EST_TOTAL_SEC    PCT_EVT_TIME       EVH_GRAPH
       FIRST_SEEN              LAST_SEEN
________________ __________________ ___________________
______________________ ____________________ _______________ _______________
______________________ ______________________
log file sync             < 1024                      1
1.8                  1.8            12.8 |#         |    2020-01-29 13:04:13
  2020-01-29 13:04:13
log file sync             < 4096                      1
1                  4.1            29.1 |###       |    2020-01-29 13:04:12
2020-01-29 13:04:12
log file sync             < 8192                      1
1                  8.2            58.2 |######    |    2020-01-29 13:04:20
2020-01-29 13:04:20

SQL> @ash/event_hist "write" 1=1  "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"

no rows selected

SQL> @ash/event_hist "*" 1=1  "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"
        EVH_EVENT       EVH_MILLISEC    EVH_SAMPLE_COUNT
EVH_EST_EVENT_COUNT    EVH_EST_TOTAL_SEC    PCT_EVT_TIME       EVH_GRAPH
       FIRST_SEEN              LAST_SEEN
_________________ __________________ ___________________
______________________ ____________________ _______________ _______________
______________________ ______________________
log file sync              < 1024                      1
1.8                  1.8            12.8 |#         |    2020-01-29 13:04:13
  2020-01-29 13:04:13
log file sync              < 4096                      1
1                  4.1            29.1 |###       |    2020-01-29 13:04:12
2020-01-29 13:04:12
log file sync              < 8192                      1
1                  8.2            58.2 |######    |    2020-01-29 13:04:20
2020-01-29 13:04:20
row cache lock                < 1                      1
2469.1                  2.5             100 |##########|    2020-01-29
13:04:01    2020-01-29 13:04:01

SQL> @ashtop username,event2,program2,blocking_session,blocking_inst_id
"lower(program) like '%lg%'" "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"

no rows selected

i'm going to copy over gv$active_session_history to a single instance
database and take a look at ash_wait_chains but i don't expect that to reveal
anything either.

i need to confirm whether adding redo resolved issue (will have developer
fire up workload) but I think next step is to investigate x$trace and
possible process communication issues?

On Thu, Jan 30, 2020 at 10:27 AM Chris Stephens <cstephens16@xxxxxxxxx
<mailto:cstephens16@xxxxxxxxx>> wrote:
Hey Tanel,

Thanks for taking time to write this up! Will definitely be attending that
webinar.

comments inline

On Wed, Jan 29, 2020 at 10:32 PM Tanel Poder <tanel@xxxxxxxxxxxxxx
<mailto:tanel@xxxxxxxxxxxxxx>> wrote:
Comments:

1) First, looks like you're using all the right tools for systematic
troubleshooting ;-)

***i try!

2) As "log file sync" is not directly an I/O wait event (IO happens
indirectly under the hood), but "wait for LGWR to ACK/increment redo
persisting position" then there may be multiple reasons for delays. I'm
repeating some of what others have said here too:
  - LGWR could be waiting for remote ACK (synchronous data guard) - that
should show up as a LNS* wait or something like that though
  - Process slowness due to CPU overload/scheduling latency or swapping (in
both cases, LGWR may do its work fast, but it takes a while for the
foreground process to wake up and end the wait event - you could check if
there are any reports regarding switching LGWR mode in alert.log around that
time)
  - Bug/missed post due to switching between LGWR progress polling vs.
waiting for LGWR post
  - ASM instance communication issues (ASM communication can be on the
critical path when creating/changing files on ASM and even for reading, when
the ASM metadata cache in your local instance doesn't have what it needs)
  - So, the physical redo log file disk write is not necessarily the biggest
contributor to the log file sync wait

*** no data guard in the env.
*** we have excess CPU. (20 cores / instance and usually only 1-5 active
foreground processes but i will confirm from TFA collections)
*** yesterday i was ready to blame bug. today, not so sure. i checked log
file switches with a script google found me a few days ago. now thinking that
was inaccurate as a script from metastink shows tones of switching so that
could very well be related.
***asm communication is something i need to look into. TFA has reported some
ASM issues but there's no details and i never saw anything in asm alert logs
indicating an issue.

3) You'll need to measure what LGWR itself was doing during the problem time
  - Since you're on 19c (12c or later) the LGWR has slaves too and they may
be doing most of the work
  - You'll need to look into both the LGWR process and any other LG* ones too
  - Since you have (Oracle) Snapper running already and writing to trace you
should have the data already
  - The lgwr in the following @snapper ash,stats,trace 10 999999 lgwr syntax
includes LGWR slaves too
  - As LGWR slaves have their own sessions (and SIDs), you should run ashtop
etc and report if there's any BLOCKING_SESSION value listed for LGWR and
slaves
  - If LGWR and the slaves were listed idle during the problem time (while
others waited for log file sync event at the same time) and if there was no
huge OS level CPU/memory shortage at the time, this looks like a background
<-> foreground communication problem.
     - This (and any potential ASM issues) could probably be troubleshooted
via KST tracing, KSR channels and X$TRACE that could help here (some of it is
enabled by default ( Like I showed in a previous hacking session:
https://www.youtube.com/watch?v=mkmvZv58W6w ;
<https://www.youtube.com/watch?v=mkmvZv58W6w> )

here is all the output i have corresponding to a 6 sec "log file sync" wait:

Experience duration:  R = t1 − t0 = 41.336137 seconds
<x-msg://3/#m_-8103031405908768729_m_7542945560652427325_m_-5509691095923674063_profile-by-subroutine>
Experience begin time:        t0 = 2020-01-29T13:04:07.560478
Experience end time:  t1 = 2020-01-29T13:04:48.896615
CALL-NAME                        DURATION       %   CALLS      MEAN       MIN
      MAX
------------------------------  ---------  ------  ------  --------  --------
--------
SQL*Net message from client     26.017105   66.5%   9,886  0.002632  0.000209
7.858051
log file sync                   10.582409   27.1%     150  0.070549  0.000295
6.909711
EXEC                             2.088757    5.3%   9,702  0.000215  0.000000
0.209199
DLM cross inst call completion   0.109617    0.3%     212  0.000517  0.000005
0.080575
row cache lock                   0.070118    0.2%     125  0.000561  0.000208
0.000972
FETCH                            0.047586    0.1%   1,029  0.000046  0.000000
0.003056
PARSE                            0.045293    0.1%   4,778  0.000009  0.000000
0.002692
CLOSE                            0.033453    0.1%   4,784  0.000007  0.000000
0.000237
library cache lock               0.023614    0.1%      55  0.000429  0.000152
0.000901
library cache pin                0.022311    0.1%      55  0.000406  0.000153
0.000689
17 others                        0.071430    0.2%  10,772  0.000007  0.000000
0.001487
------------------------------  ---------  ------  ------  --------  --------
--------
TOTAL (27)                      39.111693  100.0%  41,548  0.000941  0.000000
7.858051

6 sec wait from raw data:

*** 2020-01-29T13:04:20.724666-06:00
WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync scn=976089279
p3=0 obj#=-1 tim=1025228531958
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1
p3=0 obj#=-1 tim=1025228532118
WAIT #0: nam='SQL*Net message from client' ela= 1711 driver id=1413697536
#bytes=1 p3=0 obj#=-1 tim=1025228533852
BINDS #140204587734864:

i enabled 10046 on lgwr:

*** 2020-01-29T13:04:20.725994-06:00
WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0 obj#=-1
tim=1025228533313
WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=17 requests=2
obj#=-1 tim=1025228533777
WAIT #0: nam='log file parallel write' ela= 171 files=1 blocks=3 requests=2
obj#=-1 tim=1025228534104
WAIT #0: nam='LGWR wait for redo copy' ela= 22 copy latch #=28 p2=0 p3=0
obj#=-1 tim=1025228534182
WAIT #0: nam='log file parallel write' ela= 183 files=1 blocks=2 requests=2
obj#=-1 tim=1025228534437
WAIT #0: nam='log file parallel write' ela= 174 files=1 blocks=2 requests=2
obj#=-1 tim=1025228534721
WAIT #0: nam='rdbms ipc message' ela= 31 timeout=8 p2=0 p3=0 obj#=-1
tim=1025228534786
WAIT #0: nam='rdbms ipc message' ela= 3233 timeout=8 p2=0 p3=0 obj#=-1
tim=1025228538038
WAIT #0: nam='log file parallel write' ela= 170 files=1 blocks=2 requests=1
obj#=-1 tim=1025228538297
WAIT #0: nam='rdbms ipc message' ela= 25 timeout=8 p2=0 p3=0 obj#=-1
tim=1025228538358
WAIT #0: nam='rdbms ipc message' ela= 11829 timeout=8 p2=0 p3=0 obj#=-1
tim=1025228550204
WAIT #0: nam='LGWR wait for redo copy' ela= 52 copy latch #=8 p2=0 p3=0
obj#=-1 tim=1025228550294
WAIT #0: nam='log file parallel write' ela= 239 files=1 blocks=7 requests=1
obj#=-1 tim=1025228550624
WAIT #0: nam='rdbms ipc message' ela= 23 timeout=6 p2=0 p3=0 obj#=-1
tim=1025228550699
WAIT #0: nam='rdbms ipc message' ela= 16750 timeout=6 p2=0 p3=0 obj#=-1
tim=1025228567467
WAIT #0: nam='log file parallel write' ela= 207 files=1 blocks=9 requests=1
obj#=-1 tim=1025228567762
WAIT #0: nam='rdbms ipc message' ela= 26 timeout=5 p2=0 p3=0 obj#=-1
tim=1025228567825
WAIT #0: nam='rdbms ipc message' ela= 16249 timeout=5 p2=0 p3=0 obj#=-1
tim=1025228584091
WAIT #0: nam='log file parallel write' ela= 213 files=1 blocks=9 requests=1
obj#=-1 tim=1025228584400
WAIT #0: nam='rdbms ipc message' ela= 48 timeout=3 p2=0 p3=0 obj#=-1
tim=1025228584502
WAIT #0: nam='rdbms ipc message' ela= 16466 timeout=3 p2=0 p3=0 obj#=-1
tim=1025228600993
WAIT #0: nam='log file parallel write' ela= 228 files=1 blocks=9 requests=1
obj#=-1 tim=1025228601393
WAIT #0: nam='rdbms ipc message' ela= 29 timeout=1 p2=0 p3=0 obj#=-1
tim=1025228601482
WAIT #0: nam='rdbms ipc message' ela= 9548 timeout=1 p2=0 p3=0 obj#=-1
tim=1025228611053
WAIT #0: nam='rdbms ipc message' ela= 7467 timeout=300 p2=0 p3=0 obj#=-1
tim=1025228618574
WAIT #0: nam='LGWR wait for redo copy' ela= 13 copy latch #=8 p2=0 p3=0
obj#=-1 tim=1025228618636
WAIT #0: nam='log file parallel write' ela= 222 files=1 blocks=9 requests=1
obj#=-1 tim=1025228618960
WAIT #0: nam='rdbms ipc message' ela= 31 timeout=300 p2=0 p3=0 obj#=-1
tim=1025228619066
WAIT #0: nam='rdbms ipc message' ela= 15946 timeout=300 p2=0 p3=0 obj#=-1
tim=1025228635045
WAIT #0: nam='log file parallel write' ela= 240 files=1 blocks=9 requests=1
obj#=-1 tim=1025228635427
WAIT #0: nam='rdbms ipc message' ela= 33 timeout=298 p2=0 p3=0 obj#=-1
tim=1025228635525
WAIT #0: nam='rdbms ipc message' ela= 15987 timeout=298 p2=0 p3=0 obj#=-1
tim=1025228651537
WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=9 requests=1
obj#=-1 tim=1025228651912
WAIT #0: nam='rdbms ipc message' ela= 51 timeout=296 p2=0 p3=0 obj#=-1
tim=1025228652034
WAIT #0: nam='rdbms ipc message' ela= 16332 timeout=296 p2=0 p3=0 obj#=-1
tim=1025228668405
WAIT #0: nam='log file parallel write' ela= 256 files=1 blocks=9 requests=1
obj#=-1 tim=1025228668804
WAIT #0: nam='rdbms ipc message' ela= 37 timeout=295 p2=0 p3=0 obj#=-1
tim=1025228668911
WAIT #0: nam='rdbms ipc message' ela= 16969 timeout=295 p2=0 p3=0 obj#=-1
tim=1025228685907
WAIT #0: nam='log file parallel write' ela= 273 files=1 blocks=9 requests=1
obj#=-1 tim=1025228686327
WAIT #0: nam='rdbms ipc message' ela= 38 timeout=293 p2=0 p3=0 obj#=-1
tim=1025228686453
WAIT #0: nam='rdbms ipc message' ela= 6908 timeout=293 p2=0 p3=0 obj#=-1
tim=1025228693390
WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=2 requests=1
obj#=-1 tim=1025228693768
WAIT #0: nam='rdbms ipc message' ela= 38 timeout=292 p2=0 p3=0 obj#=-1
tim=1025228693862j
...more of the same
264396 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=52 p2=0 p3=0 obj#=-1
tim=1025231090742
264397 WAIT #0: nam='rdbms ipc message' ela= 19460 timeout=52 p2=0 p3=0
obj#=-1 tim=1025231110227
264398 WAIT #0: nam='log file parallel write' ela= 272 files=1 blocks=9
requests=1 obj#=-1 tim=1025231110633
264399 WAIT #0: nam='rdbms ipc message' ela= 37 timeout=50 p2=0 p3=0 obj#=-1
tim=1025231110748
264400 WAIT #0: nam='rdbms ipc message' ela= 15484 timeout=50 p2=0 p3=0
obj#=-1 tim=1025231126261
264401 WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=4
requests=1 obj#=-1 tim=1025231126641
264402 WAIT #0: nam='rdbms ipc message' ela= 36 timeout=49 p2=0 p3=0 obj#=-1
tim=1025231126753
264403 WAIT #0: nam='rdbms ipc message' ela= 16732 timeout=49 p2=0 p3=0
obj#=-1 tim=1025231143514
264404 WAIT #0: nam='LGWR wait for redo copy' ela= 31 copy latch #=8 p2=0
p3=0 obj#=-1 tim=1025231143603
264405 WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=9
requests=1 obj#=-1 tim=1025231143976
264406 WAIT #0: nam='rdbms ipc message' ela= 35 timeout=47 p2=0 p3=0 obj#=-1
tim=1025231144093
264407 WAIT #0: nam='rdbms ipc message' ela= 5384 timeout=47 p2=0 p3=0
obj#=-1 tim=1025231149515
264408 WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=4
requests=1 obj#=-1 tim=1025231149889
264409 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=47 p2=0 p3=0 obj#=-1
tim=1025231149989
264410
264411 *** 2020-01-29T13:04:23.802774-06:00
264412 WAIT #0: nam='rdbms ipc message' ela= 460056 timeout=46 p2=0 p3=0
obj#=-1 tim=1025231610086
264413
264414 *** 2020-01-29T13:04:26.797925-06:00
264415 WAIT #0: nam='rdbms ipc message' ela= 2995027 timeout=300 p2=0 p3=0
obj#=-1 tim=1025234605238
264416 WAIT #0: nam='rdbms ipc message' ela= 10649 timeout=1 p2=0 p3=0
obj#=-1 tim=1025234616054
264417

i will get you snapper lgwr output associated w/ this interval as soon as i
can find it (can't remember where i stashed it and lost trace file name ..
fml)

4) Paste the full Snapper output (of one time snapshot from the problem time)
here if you can - we'll be able to see what the slaves were doing
  - You'd need to include not only LGWR SID but also any LG* slave SIDs too
  - Also we'll see better LGWR delay metrics from V$SESSTAT that go beyond
just wait events
  - Thinking of metrics like these - snapper already handles them (if you
look into the rightmost column)
     - redo synch time overhead (usec) - "FG wakeup overhead per log file
sync"
     - redo write time - accounting end to end redo write time (not just the
log file parallel write syscall) "per redo write"

5) Regarding the ASH wait chains error on 19c - I'm aware of this but forgot
to look into it
   - It has (probably) something to do with GV$ views (or PX slaves used for
PDB V$ queries) and possibly the connect by I'm using
   - Does it work in the CDB (if multitenant)?
   - Does it work with lower optimizer features enabled?
   - As we are already manually looking into what LGWR and its slaves are
doing, there's no need for the wait chains script here

***oracle has filed a bug. altering optimizer_features_enabled doesn't help.

6) Linux Process Snapper - pSnapper
   - If you want to see the full "wait" states of Linux processes/threads,
use the -a option.
   - It's somewhat like always showing "idle" threads too
   - But for linux apps it's harder to determine what's idle and what's some
network RPC wait, so I currently just do what Linux "load" metric does
   - I just shows threads in Runnable and Disk/Uninterruptible sleep states
by default (-a shows all states, including Sleep and kernel-Idle threads etc)
   - So the Linux pSnapper "idle" means a somewhat different thing than
Oracle "idle" - say LGWR was waiting for a Data Guard ACK, that should not be
shown as Idle in Oracle, but at the Linux process level it's "just some
network socket wait" and you'd need to use -a option to show all of them

*** will add that option to future collections (and watch the video/read
help)

7) Were LGWR/LG* processes mostly idle during the problem time?
   - As I kept reading the thread, it looks like ASH reported LGWR being
mostly idle during the problem time
   - This is not exactly the same pattern, but I've seen cases in the past
where everyone's waiting for LGWR and LGWR doesn't seem to be busy and
doesn't show any significant waits
   - It turned out that LGWR was actually waiting for CKPT to make progress,
but it wasn't synchronously waiting, but just going to sleep and "checking
again later" in a loop
   - I wrote about this here:
https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/

<https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/>
   - So it may still make sense to check what LGWR/LG* processes are doing
even if they don't seem to be active much (like do you see an occasional
controlfile read wait by LGWR etc)
   - SQL*Trace would capture all activity of LGWR, but may slow things down
further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that keeps
a log of last 10 waits of every session in memory)

*** i think i've addressed most of this.

Incidentally I'm doing another hacking session tomorrow (Thursday 30 Jan @
12pm ET) about complex Oracle performance problems that span multiple layers
and have multiple moving parts. I can add some LGWR troubleshooting magic in
the end of that (here goes my plan to fit the session in 1.5h again!). I'll
post the details in my blog in a few minutes.

*** can't wait!!
--
Tanel Poder
https://blog.tanelpoder.com/seminar ;<https://blog.tanelpoder.com/seminar>

On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16@xxxxxxxxx
<mailto:cstephens16@xxxxxxxxx>> wrote:
doesn't look like any ash data for lgwr:

SQL> @ashtop session_id,blocking_session,event2 1=1 "TIMESTAMP'2020-01-27
15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
   TOTALSECONDS    AAS      %This    SESSION_ID    BLOCKING_SESSION
     EVENT2             FIRST_SEEN              LAST_SEEN    DIST_SQLEXEC_SEEN
_______________ ______ __________ _____________ ___________________
_____________________ ______________________ ______________________
____________________
             81    0.7   82% |             1713                1710 log file
sync         2020-01-27 15:50:54    2020-01-27 15:52:14
1
              5      0    5% |             1221                     ON CPU
            2020-01-27 15:50:59    2020-01-27 15:52:14                       1
              3      0    3% |              246                     ON CPU
            2020-01-27 15:51:00    2020-01-27 15:52:00                       1
              3      0    3% |             1713                     ON CPU
            2020-01-27 15:50:51    2020-01-27 15:52:27                       2
              1      0    1% |              123                     ON CPU
            2020-01-27 15:51:35    2020-01-27 15:51:35                       1
              1      0    1% |              127                     ON CPU
            2020-01-27 15:51:50    2020-01-27 15:51:50                       1
              1      0    1% |              252                2321 latch
free            2020-01-27 15:52:28    2020-01-27 15:52:28
   1
              1      0    1% |              978                     ges
remote message    2020-01-27 15:50:48    2020-01-27 15:50:48
     1
              1      0    1% |              983                     latch
free            2020-01-27 15:52:28    2020-01-27 15:52:28
   1
              1      0    1% |             1713                     library
cache lock    2020-01-27 15:50:48    2020-01-27 15:50:48
1
              1      0    1% |             1831                     ON CPU
            2020-01-27 15:50:49    2020-01-27 15:50:49                       1

11 rows selected.

SQL> @ashtop session_id,blocking_session,event2 session_id=1710
"TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"

no rows selected

On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <nenad.noveljic@xxxxxxxxxxxx
<mailto:nenad.noveljic@xxxxxxxxxxxx>> wrote:
You’re using v0.2, but  v0.6 should be the latest version. 0.2 wouldn’t show
idle blockers.

I looked at the ashtop output once again. Strange that lgwr processes nor any
other bg processes show up significantly there. The question is, what was
lgwr doing while the fg processes were waiting on him.

Follow-Ups:
- Re: intermittent long "log file sync" waits
  - From: Chris Stephens

References:
- Re: intermittent long "log file sync" waits
  - From: Tanel Poder
- Re: intermittent long "log file sync" waits
  - From: Tanel Poder

Re: intermittent long "log file sync" waits

Other related posts: