Here's some additional ash output from the trace data posted in last
message including event_hist.sql:
SQL> @ashtop username,event2,program2,blocking_session,blocking_inst_id 1=1
"TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29 13:04:27'"
TOTALSECONDS AAS %This USERNAME EVENT2
PROGRAM2 BLOCKING_SESSION BLOCKING_INST_ID FIRST_SEEN
LAST_SEEN DIST_SQLEXEC_SEEN
_______________ ______ __________ ________________ ________________
_____________ ___________________ ___________________
______________________ ______________________ ____________________
2 0.3 50% | SYS ON CPU
(sqlplus) 2020-01-29 13:04:24
2020-01-29 13:04:25 1
1 0.1 25% | GEN3_MGOWER_3 ON CPU
(python) 2020-01-29 13:04:22
2020-01-29 13:04:22 1
1 0.1 25% | GEN3_MGOWER_3 log file sync
(python) 1710 2 2020-01-29 13:04:20
2020-01-29 13:04:20 1
SQL> @bg lgwr
NAME DESCRIPTION SID OPID SPID PADDR
SADDR
_______ ______________ _______ _______ ________ ___________________
___________________
LGWR Redo etc. 1710 34 26552 00000001E8718860
00000001D8BFF4A0
SQL> @ashtop username,event2,program2,blocking_session,blocking_inst_id
session_id=1710 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29
13:04:27'"
no rows selected
SQL> @ashtop username,event2,program2,blocking_session,blocking_inst_id
session_id=1710 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29
13:05:00'"
no rows selected
SQL> @ash/event_hist "log file" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"
EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT
EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH
FIRST_SEEN LAST_SEEN
________________ __________________ ___________________
______________________ ____________________ _______________ _______________
______________________ ______________________
log file sync < 1024 1
1.8 1.8 12.8 |# | 2020-01-29
13:04:13 2020-01-29 13:04:13
log file sync < 4096 1
1 4.1 29.1 |### | 2020-01-29 13:04:12
2020-01-29 13:04:12
log file sync < 8192 1
1 8.2 58.2 |###### | 2020-01-29 13:04:20
2020-01-29 13:04:20
SQL> @ash/event_hist "write" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"
no rows selected
SQL> @ash/event_hist "*" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"
EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT
EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH
FIRST_SEEN LAST_SEEN
_________________ __________________ ___________________
______________________ ____________________ _______________ _______________
______________________ ______________________
log file sync < 1024 1
1.8 1.8 12.8 |# | 2020-01-29
13:04:13 2020-01-29 13:04:13
log file sync < 4096 1
1 4.1 29.1 |### | 2020-01-29
13:04:12 2020-01-29 13:04:12
log file sync < 8192 1
1 8.2 58.2 |###### | 2020-01-29
13:04:20 2020-01-29 13:04:20
row cache lock < 1 1
2469.1 2.5 100 |##########| 2020-01-29
13:04:01 2020-01-29 13:04:01
SQL> @ashtop username,event2,program2,blocking_session,blocking_inst_id
"lower(program) like '%lg%'" "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"
no rows selected
i'm going to copy over gv$active_session_history to a single instance
database and take a look at ash_wait_chains but i don't expect that to
reveal anything either.
i need to confirm whether adding redo resolved issue (will have developer
fire up workload) but I think next step is to investigate x$trace and
possible process communication issues?
On Thu, Jan 30, 2020 at 10:27 AM Chris Stephens <cstephens16@xxxxxxxxx>
wrote:
Hey Tanel,
Thanks for taking time to write this up! Will definitely be attending that
webinar.
comments inline
On Wed, Jan 29, 2020 at 10:32 PM Tanel Poder <tanel@xxxxxxxxxxxxxx> wrote:
Comments:
1) First, looks like you're using all the right tools for systematic
troubleshooting ;-)
***i try!
*** we have excess CPU. (20 cores / instance and usually only 1-5 active
2) As "log file sync" is not directly an I/O wait event (IO happens
indirectly under the hood), but "wait for LGWR to ACK/increment redo
persisting position" then there may be multiple reasons for delays. I'm
repeating some of what others have said here too:
- LGWR could be waiting for remote ACK (synchronous data guard) - that
should show up as a LNS* wait or something like that though
- Process slowness due to CPU overload/scheduling latency or swapping
(in both cases, LGWR may do its work fast, but it takes a while for the
foreground process to wake up and end the wait event - you could check if
there are any reports regarding switching LGWR mode in alert.log around
that time)
- Bug/missed post due to switching between LGWR progress polling vs.
waiting for LGWR post
- ASM instance communication issues (ASM communication can be on the
critical path when creating/changing files on ASM and even for reading,
when the ASM metadata cache in your local instance doesn't have what it
needs)
- So, the physical redo log file disk write is not necessarily the
biggest contributor to the log file sync wait
*** no data guard in the env.
foreground processes but i will confirm from TFA collections)
*** yesterday i was ready to blame bug. today, not so sure. i checked log
file switches with a script google found me a few days ago. now thinking
that was inaccurate as a script from metastink shows tones of switching so
that could very well be related.
***asm communication is something i need to look into. TFA has reported
some ASM issues but there's no details and i never saw anything in asm
alert logs indicating an issue.
3) You'll need to measure what LGWR itself was doing during the problemhere is all the output i have corresponding to a 6 sec "log file sync"
time
- Since you're on 19c (12c or later) the LGWR has slaves too and they
may be doing most of the work
- You'll need to look into both the LGWR process and any other LG* ones
too
- Since you have (Oracle) Snapper running already and writing to trace
you should have the data already
- The *lgwr* in the following @snapper ash,stats,trace 10 999999 lgwr
syntax includes LGWR slaves too
- As LGWR slaves have their own sessions (and SIDs), you should run
ashtop etc and report if there's any BLOCKING_SESSION value listed for LGWR
and slaves
- If LGWR and the slaves were listed *idle* during the problem time
(while others waited for log file sync event at the same time) *and if *there
was no huge OS level CPU/memory shortage at the time, this looks like a
background <-> foreground communication problem.
- This (and any potential ASM issues) could probably be
troubleshooted via KST tracing, KSR channels and X$TRACE that could help
here (some of it is enabled by default ( Like I showed in a previous
hacking session: https://www.youtube.com/watch?v=mkmvZv58W6w ;)
wait:
Experience duration: *R* = *t*1 − *t*0 = 41.336137 seconds
<#m_-5509691095923674063_profile-by-subroutine>
Experience begin time: *t*0 = 2020-01-29T13:04:07.560478
Experience end time: *t*1 = 2020-01-29T13:04:48.896615
CALL-NAME DURATION % CALLS MEAN
MIN MAX
------------------------------ --------- ------ ------ --------
-------- --------
SQL*Net message from client 26.017105 66.5% 9,886 0.002632
0.000209 7.858051
log file sync 10.582409 27.1% 150 0.070549
0.000295 6.909711
EXEC 2.088757 5.3% 9,702 0.000215
0.000000 0.209199
DLM cross inst call completion 0.109617 0.3% 212 0.000517
0.000005 0.080575
row cache lock 0.070118 0.2% 125 0.000561
0.000208 0.000972
FETCH 0.047586 0.1% 1,029 0.000046
0.000000 0.003056
PARSE 0.045293 0.1% 4,778 0.000009
0.000000 0.002692
CLOSE 0.033453 0.1% 4,784 0.000007
0.000000 0.000237
library cache lock 0.023614 0.1% 55 0.000429
0.000152 0.000901
library cache pin 0.022311 0.1% 55 0.000406
0.000153 0.000689
17 others 0.071430 0.2% 10,772 0.000007
0.000000 0.001487
------------------------------ --------- ------ ------ --------
-------- --------
TOTAL (27) 39.111693 100.0% 41,548 0.000941
0.000000 7.858051
6 sec wait from raw data:
*** 2020-01-29T13:04:20.724666-06:00
WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync scn=976089279
p3=0 obj#=-1 tim=1025228531958
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536
#bytes=1 p3=0 obj#=-1 tim=1025228532118
WAIT #0: nam='SQL*Net message from client' ela= 1711 driver id=1413697536
#bytes=1 p3=0 obj#=-1 tim=1025228533852
BINDS #140204587734864:
i enabled 10046 on lgwr:
*** 2020-01-29T13:04:20.725994-06:00
WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0
obj#=-1 tim=1025228533313
WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=17
requests=2 obj#=-1 tim=1025228533777
WAIT #0: nam='log file parallel write' ela= 171 files=1 blocks=3
requests=2 obj#=-1 tim=1025228534104
WAIT #0: nam='LGWR wait for redo copy' ela= 22 copy latch #=28 p2=0 p3=0
obj#=-1 tim=1025228534182
WAIT #0: nam='log file parallel write' ela= 183 files=1 blocks=2
requests=2 obj#=-1 tim=1025228534437
WAIT #0: nam='log file parallel write' ela= 174 files=1 blocks=2
requests=2 obj#=-1 tim=1025228534721
WAIT #0: nam='rdbms ipc message' ela= 31 timeout=8 p2=0 p3=0 obj#=-1
tim=1025228534786
WAIT #0: nam='rdbms ipc message' ela= 3233 timeout=8 p2=0 p3=0 obj#=-1
tim=1025228538038
WAIT #0: nam='log file parallel write' ela= 170 files=1 blocks=2
requests=1 obj#=-1 tim=1025228538297
WAIT #0: nam='rdbms ipc message' ela= 25 timeout=8 p2=0 p3=0 obj#=-1
tim=1025228538358
WAIT #0: nam='rdbms ipc message' ela= 11829 timeout=8 p2=0 p3=0 obj#=-1
tim=1025228550204
WAIT #0: nam='LGWR wait for redo copy' ela= 52 copy latch #=8 p2=0 p3=0
obj#=-1 tim=1025228550294
WAIT #0: nam='log file parallel write' ela= 239 files=1 blocks=7
requests=1 obj#=-1 tim=1025228550624
WAIT #0: nam='rdbms ipc message' ela= 23 timeout=6 p2=0 p3=0 obj#=-1
tim=1025228550699
WAIT #0: nam='rdbms ipc message' ela= 16750 timeout=6 p2=0 p3=0 obj#=-1
tim=1025228567467
WAIT #0: nam='log file parallel write' ela= 207 files=1 blocks=9
requests=1 obj#=-1 tim=1025228567762
WAIT #0: nam='rdbms ipc message' ela= 26 timeout=5 p2=0 p3=0 obj#=-1
tim=1025228567825
WAIT #0: nam='rdbms ipc message' ela= 16249 timeout=5 p2=0 p3=0 obj#=-1
tim=1025228584091
WAIT #0: nam='log file parallel write' ela= 213 files=1 blocks=9
requests=1 obj#=-1 tim=1025228584400
WAIT #0: nam='rdbms ipc message' ela= 48 timeout=3 p2=0 p3=0 obj#=-1
tim=1025228584502
WAIT #0: nam='rdbms ipc message' ela= 16466 timeout=3 p2=0 p3=0 obj#=-1
tim=1025228600993
WAIT #0: nam='log file parallel write' ela= 228 files=1 blocks=9
requests=1 obj#=-1 tim=1025228601393
WAIT #0: nam='rdbms ipc message' ela= 29 timeout=1 p2=0 p3=0 obj#=-1
tim=1025228601482
WAIT #0: nam='rdbms ipc message' ela= 9548 timeout=1 p2=0 p3=0 obj#=-1
tim=1025228611053
WAIT #0: nam='rdbms ipc message' ela= 7467 timeout=300 p2=0 p3=0 obj#=-1
tim=1025228618574
WAIT #0: nam='LGWR wait for redo copy' ela= 13 copy latch #=8 p2=0 p3=0
obj#=-1 tim=1025228618636
WAIT #0: nam='log file parallel write' ela= 222 files=1 blocks=9
requests=1 obj#=-1 tim=1025228618960
WAIT #0: nam='rdbms ipc message' ela= 31 timeout=300 p2=0 p3=0 obj#=-1
tim=1025228619066
WAIT #0: nam='rdbms ipc message' ela= 15946 timeout=300 p2=0 p3=0 obj#=-1
tim=1025228635045
WAIT #0: nam='log file parallel write' ela= 240 files=1 blocks=9
requests=1 obj#=-1 tim=1025228635427
WAIT #0: nam='rdbms ipc message' ela= 33 timeout=298 p2=0 p3=0 obj#=-1
tim=1025228635525
WAIT #0: nam='rdbms ipc message' ela= 15987 timeout=298 p2=0 p3=0 obj#=-1
tim=1025228651537
WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=9
requests=1 obj#=-1 tim=1025228651912
WAIT #0: nam='rdbms ipc message' ela= 51 timeout=296 p2=0 p3=0 obj#=-1
tim=1025228652034
WAIT #0: nam='rdbms ipc message' ela= 16332 timeout=296 p2=0 p3=0 obj#=-1
tim=1025228668405
WAIT #0: nam='log file parallel write' ela= 256 files=1 blocks=9
requests=1 obj#=-1 tim=1025228668804
WAIT #0: nam='rdbms ipc message' ela= 37 timeout=295 p2=0 p3=0 obj#=-1
tim=1025228668911
WAIT #0: nam='rdbms ipc message' ela= 16969 timeout=295 p2=0 p3=0 obj#=-1
tim=1025228685907
WAIT #0: nam='log file parallel write' ela= 273 files=1 blocks=9
requests=1 obj#=-1 tim=1025228686327
WAIT #0: nam='rdbms ipc message' ela= 38 timeout=293 p2=0 p3=0 obj#=-1
tim=1025228686453
WAIT #0: nam='rdbms ipc message' ela= 6908 timeout=293 p2=0 p3=0 obj#=-1
tim=1025228693390
WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=2
requests=1 obj#=-1 tim=1025228693768
WAIT #0: nam='rdbms ipc message' ela= 38 timeout=292 p2=0 p3=0 obj#=-1
tim=1025228693862j
...more of the same
264396 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=52 p2=0 p3=0
obj#=-1 tim=1025231090742
264397 WAIT #0: nam='rdbms ipc message' ela= 19460 timeout=52 p2=0 p3=0
obj#=-1 tim=1025231110227
264398 WAIT #0: nam='log file parallel write' ela= 272 files=1 blocks=9
requests=1 obj#=-1 tim=1025231110633
264399 WAIT #0: nam='rdbms ipc message' ela= 37 timeout=50 p2=0 p3=0
obj#=-1 tim=1025231110748
264400 WAIT #0: nam='rdbms ipc message' ela= 15484 timeout=50 p2=0 p3=0
obj#=-1 tim=1025231126261
264401 WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=4
requests=1 obj#=-1 tim=1025231126641
264402 WAIT #0: nam='rdbms ipc message' ela= 36 timeout=49 p2=0 p3=0
obj#=-1 tim=1025231126753
264403 WAIT #0: nam='rdbms ipc message' ela= 16732 timeout=49 p2=0 p3=0
obj#=-1 tim=1025231143514
264404 WAIT #0: nam='LGWR wait for redo copy' ela= 31 copy latch #=8 p2=0
p3=0 obj#=-1 tim=1025231143603
264405 WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=9
requests=1 obj#=-1 tim=1025231143976
264406 WAIT #0: nam='rdbms ipc message' ela= 35 timeout=47 p2=0 p3=0
obj#=-1 tim=1025231144093
264407 WAIT #0: nam='rdbms ipc message' ela= 5384 timeout=47 p2=0 p3=0
obj#=-1 tim=1025231149515
264408 WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=4
requests=1 obj#=-1 tim=1025231149889
264409 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=47 p2=0 p3=0
obj#=-1 tim=1025231149989
264410
264411 *** 2020-01-29T13:04:23.802774-06:00
264412 WAIT #0: nam='rdbms ipc message' ela= 460056 timeout=46 p2=0 p3=0
obj#=-1 tim=1025231610086
264413
264414 *** 2020-01-29T13:04:26.797925-06:00
264415 WAIT #0: nam='rdbms ipc message' ela= 2995027 timeout=300 p2=0
p3=0 obj#=-1 tim=1025234605238
264416 WAIT #0: nam='rdbms ipc message' ela= 10649 timeout=1 p2=0 p3=0
obj#=-1 tim=1025234616054
264417
i will get you snapper lgwr output associated w/ this interval as soon as
i can find it (can't remember where i stashed it and lost trace file name
.. fml)
4) Paste the full *Snapper* output (of one time snapshot from the problem
time) here if you can - we'll be able to see what the slaves were doinghelp.
- You'd need to include not only LGWR SID but also any LG* slave SIDs
too
- Also we'll see better LGWR delay metrics from V$SESSTAT that go
beyond just wait events
- Thinking of metrics like these - snapper already handles them (if you
look into the rightmost column)
- *redo synch time overhead (usec)* - "FG wakeup overhead per log
file sync"
- *redo write time* - accounting end to end redo write time (not
just the log file parallel write syscall) "per redo write"
5) Regarding the *ASH wait chains* error on 19c - I'm aware of this but
forgot to look into it
- It has (probably) something to do with GV$ views (or PX slaves used
for PDB V$ queries) and possibly the connect by I'm using
- Does it work in the CDB (if multitenant)?
- Does it work with lower optimizer features enabled?
- As we are already manually looking into what LGWR and its slaves are
doing, there's no need for the wait chains script here
***oracle has filed a bug. altering optimizer_features_enabled doesn't
6) Linux Process Snapper - pSnapper
- If you want to see the full "wait" states of Linux
processes/threads, use the* -a *option.
- It's somewhat like always showing "idle" threads too
- But for linux apps it's harder to determine what's idle and what's
some network RPC wait, so I currently just do what Linux "load" metric does
- I just shows threads in Runnable and Disk/Uninterruptible sleep
states by default (-a shows all states, including Sleep and kernel-Idle
threads etc)
- So the Linux pSnapper "idle" means a somewhat different thing than
Oracle "idle" - say LGWR was waiting for a Data Guard ACK, that should not
be shown as Idle in Oracle, but at the Linux process level it's "just some
network socket wait" and you'd need to use *-a* option to show all of
them
*** will add that option to future collections (and watch the video/read
help)
7) Were LGWR/LG* processes mostly idle during the problem time?
- As I kept reading the thread, it looks like ASH reported LGWR being
mostly idle during the problem time
- This is not exactly the same pattern, but I've seen cases in the
past where everyone's waiting for LGWR and LGWR doesn't seem to be busy and
doesn't show any significant waits
- It turned out that LGWR was actually waiting for CKPT to make
progress, but it wasn't synchronously waiting, but just going to sleep and
"checking again later" in a loop
- I wrote about this here:
https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/
- So it may still make sense to check what LGWR/LG* processes are
doing even if they don't seem to be active much (like do you see an
occasional controlfile read wait by LGWR etc)
- SQL*Trace would capture all activity of LGWR, but may slow things
down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that
keeps a log of last 10 waits of every session in memory)
*** i think i've addressed most of this.
Incidentally I'm doing another hacking session tomorrow (Thursday 30 Jan
@ 12pm ET) about complex Oracle performance problems that span multiple
layers and have multiple moving parts. I can add some LGWR troubleshooting
magic in the end of that (here goes my plan to fit the session in 1.5h
again!). I'll post the details in my blog in a few minutes.
*** can't wait!!
--
Tanel Poder
https://blog.tanelpoder.com/seminar
On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16@xxxxxxxxx>
wrote:
doesn't look like any ash data for lgwr:
SQL> @ashtop session_id,blocking_session,event2 1=1
"TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
TOTALSECONDS AAS %This SESSION_ID BLOCKING_SESSION
EVENT2 FIRST_SEEN LAST_SEEN
DIST_SQLEXEC_SEEN
_______________ ______ __________ _____________ ___________________
_____________________ ______________________ ______________________
____________________
81 0.7 82% | 1713 1710 log
file sync 2020-01-27 15:50:54 2020-01-27 15:52:14
1
5 0 5% | 1221 ON
CPU 2020-01-27 15:50:59 2020-01-27 15:52:14
1
3 0 3% | 246 ON
CPU 2020-01-27 15:51:00 2020-01-27 15:52:00
1
3 0 3% | 1713 ON
CPU 2020-01-27 15:50:51 2020-01-27 15:52:27
2
1 0 1% | 123 ON
CPU 2020-01-27 15:51:35 2020-01-27 15:51:35
1
1 0 1% | 127 ON
CPU 2020-01-27 15:51:50 2020-01-27 15:51:50
1
1 0 1% | 252 2321
latch free 2020-01-27 15:52:28 2020-01-27 15:52:28
1
1 0 1% | 978 ges
remote message 2020-01-27 15:50:48 2020-01-27 15:50:48
1
1 0 1% | 983
latch free 2020-01-27 15:52:28 2020-01-27 15:52:28
1
1 0 1% | 1713
library cache lock 2020-01-27 15:50:48 2020-01-27 15:50:48
1
1 0 1% | 1831 ON
CPU 2020-01-27 15:50:49 2020-01-27 15:50:49
1
11 rows selected.
SQL> @ashtop session_id,blocking_session,event2 session_id=1710
"TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
no rows selected
On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <
nenad.noveljic@xxxxxxxxxxxx> wrote:
You’re using v0.2, but v0.6 should be the latest version. 0.2 wouldn’t
show idle blockers.
I looked at the ashtop output once again. Strange that lgwr processes
nor any other bg processes show up significantly there. The question is,
what was lgwr doing while the fg processes were waiting on him.