Here is some correlated ASH output showing log file sync w/ psnapper on
lgwr at same time:
SQL> @ashtop event2,program 1=1 sysdate-1/60/24 sysdate
TOTALSECONDS AAS %This EVENT2
PROGRAM FIRST_SEEN LAST_SEEN
DIST_SQLEXEC_SEEN
_______________ ______ __________ ___________________________
___________________________________________ ______________________
______________________ ____________________
10 0.2 26% | ON CPU
sqlplus@lsst-oradb05 (TNS V1-V3) 2020-01-29 10:42:19
2020-01-29 10:43:09 5
4 0.1 11% | ON CPU
sqlplus@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 10:43:04
2020-01-29 10:43:07 3
4 0.1 11% | log file sync
python@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 10:42:10
2020-01-29 10:42:13 1
3 0.1 8% | ON CPU
sqlplus@lsst-verify-worker45 (TNS V1-V3) 2020-01-29 10:42:15
2020-01-29 10:42:20 2
2 0 5% | ON CPU
oracle@lsst-oradb05 (PSP0) 2020-01-29 10:42:33
2020-01-29 10:42:34 1
2 0 5% | ON CPU
python@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 10:42:32
2020-01-29 10:43:09 2
2 0 5% | ON CPU
python@lsst-verify-worker45 (TNS V1-V3) 2020-01-29 10:42:22
2020-01-29 10:42:23 2
2 0 5% | row cache lock
sqlplus@lsst-verify-worker45 (TNS V1-V3) 2020-01-29 10:42:16
2020-01-29 10:42:18 2
1 0 3% | IPC send completion sync
oracle@lsst-oradb04 (PPA7) 2020-01-29 10:42:20
2020-01-29 10:42:20 1
1 0 3% | ON CPU
oracle@lsst-oradb05 (LMD1) 2020-01-29 10:43:07
2020-01-29 10:43:07 1
1 0 3% | ON CPU
oracle@lsst-oradb05 (PPA7) 2020-01-29 10:42:46
2020-01-29 10:42:46 1
1 0 3% | ON CPU
oracle@lsst-oradb06 (LCK1) 2020-01-29 10:42:16
2020-01-29 10:42:16 1
1 0 3% | ON CPU
oracle@lsst-oradb06 (LMD0) 2020-01-29 10:42:16
2020-01-29 10:42:16 1
1 0 3% | ON CPU
oracle@lsst-oradb06 (LMON) 2020-01-29 10:42:31
2020-01-29 10:42:31 1
1 0 3% | ON CPU
oracle@lsst-oradb06 (M001) 2020-01-29 10:43:06
2020-01-29 10:43:06 1
15 rows selected.
Wed Jan 29 10:42:12 CST 2020
Linux Process Snapper v0.14 by Tanel Poder [https://tp.dev/psnapper]
Sampling /proc/syscall, wchan, cmdline, stat for 5 seconds... finished.
=== Active Threads
==========================================================================
samples | avg_threads | cmdline | state | syscall
| wchan
---------------------------------------------------------------------------------------------
4 | 0.04 | ora_lgwr_lsst2db2 | Running (ON CPU) | semtimedop
| SYSC_semtimedop
samples: 100 (expected: 100)
total processes: 1, threads: 1
runtime: 5.00, measure time: 0.50
Wed Jan 29 10:42:17 CST 2020
Linux Process Snapper v0.14 by Tanel Poder [https://tp.dev/psnapper]
Sampling /proc/syscall, wchan, cmdline, stat for 5 seconds... finished.
=== Active Threads
==========================================================================
samples | avg_threads | cmdline | state | syscall
| wchan
---------------------------------------------------------------------------------------------
1 | 0.01 | ora_lgwr_lsst2db2 | Running (ON CPU) | semtimedop
| SYSC_semtimedop
1 | 0.01 | ora_lgwr_lsst2db2 | Running (ON CPU) | semtimedop
| read_events
neither of the Log Writer Slave processes showed any activity via psnapper.
On Wed, Jan 29, 2020 at 10:05 AM Chris Stephens <cstephens16@xxxxxxxxx>
wrote:
ugh. its still occurring. i don't think it is related to log file switches:
SQL> @log_switches
Date INST_ID Day Total h0 h1 h2 h3
h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14
h15 h16 h17 h18 h19 h20 h21 h22 h23 Avg
______________________ __________ ______ ________ _____ _____ _____ _____
_____ _____ _____ _____ _____ _____ ______ ______ ______ ______ ______
______ ______ ______ ______ ______ ______ ______ ______ ______ _______
2020-01-29 00:00:00 1 Wed 10 1 1 3 1
1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0.42
2020-01-29 00:00:00 2 Wed 13 2 1 3 1
2 2 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0.54
2020-01-29 00:00:00 3 Wed 14 2 1 4 1
2 1 1 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0.58
SQL> @ashtop event2 1=1 sysdate-5/60/24 sysdate
TOTALSECONDS AAS %This EVENT2
FIRST_SEEN LAST_SEEN DIST_SQLEXEC_SEEN
_______________ ______ __________ _________________________________
______________________ ______________________ ____________________
123 0.4 52% | ON CPU
2020-01-29 09:04:01 2020-01-29 09:08:51 48
96 0.3 41% | log file sync
2020-01-29 09:06:04 2020-01-29 09:08:59 1
3 0 1% | DLM cross inst call completion
2020-01-29 09:04:59 2020-01-29 09:07:23 3
3 0 1% | PX Deq: Slave Session Stats
2020-01-29 09:05:30 2020-01-29 09:06:58 1
2 0 1% | ASM file metadata operation
2020-01-29 09:05:18 2020-01-29 09:06:03 2
2 0 1% | library cache lock
2020-01-29 09:04:54 2020-01-29 09:07:24 2
1 0 0% | Disk file operations I/O
2020-01-29 09:05:09 2020-01-29 09:05:09 1
1 0 0% | KSV master wait
2020-01-29 09:05:18 2020-01-29 09:05:18 1
1 0 0% | PX Deq: Join ACK
2020-01-29 09:08:01 2020-01-29 09:08:01 1
1 0 0% | PX Deq: reap credit
2020-01-29 09:06:58 2020-01-29 09:06:58 1
1 0 0% | control file parallel write
2020-01-29 09:05:06 2020-01-29 09:05:06 1
1 0 0% | db file async I/O submit
2020-01-29 09:05:37 2020-01-29 09:05:37 1
1 0 0% | enq: PS - contention [mode=6]
2020-01-29 09:05:30 2020-01-29 09:05:30 1
1 0 0% | library cache pin
2020-01-29 09:04:56 2020-01-29 09:04:56 1
14 rows selected.
SQL> @ashtop event2,program 1=1 sysdate-5/60/24 sysdate
TOTALSECONDS AAS %This EVENT2
PROGRAM FIRST_SEEN
LAST_SEEN DIST_SQLEXEC_SEEN
_______________ ______ __________ _________________________________
___________________________________________ ______________________
______________________ ____________________
85 0.3 34% | log file sync
python@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:07:36
2020-01-29 09:09:24 1
56 0.2 22% | ON CPU
sqlplus@lsst-oradb05 (TNS V1-V3) 2020-01-29 09:05:15
2020-01-29 09:09:54 24
20 0.1 8% | log file sync
sqlplus@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:06:04
2020-01-29 09:10:02 1
15 0.1 6% | ON CPU
python@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:06:22
2020-01-29 09:09:21 13
11 0 4% | ON CPU
sqlplus@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:06:01
2020-01-29 09:10:04 8
5 0 2% | ON CPU
oracle@lsst-oradb05 (DIA0) 2020-01-29 09:07:57
2020-01-29 09:10:02 1
5 0 2% | ON CPU
oracle@lsst-oradb06 (CJQ0) 2020-01-29 09:09:01
2020-01-29 09:09:05 1
5 0 2% | ON CPU
oracle@lsst-oradb06 (PPA7) 2020-01-29 09:06:46
2020-01-29 09:09:54 1
4 0 2% | ON CPU
oracle@lsst-oradb04 (CJQ0) 2020-01-29 09:05:05
2020-01-29 09:05:08 1
4 0 2% | ON CPU
oracle@lsst-oradb05 (LMON) 2020-01-29 09:06:37
2020-01-29 09:09:38 1
4 0 2% | ON CPU
oracle@lsst-oradb06 (LMON) 2020-01-29 09:05:06
2020-01-29 09:09:06 1
3 0 1% | PX Deq: Slave Session Stats
oracle@lsst-oradb05 (PPA7) 2020-01-29 09:05:30
2020-01-29 09:06:58 1
2 0 1% | DLM cross inst call completion
sqlplus@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:06:02
2020-01-29 09:07:23 2
2 0 1% | ON CPU
oracle@lsst-oradb04 (PSP0) 2020-01-29 09:06:17
2020-01-29 09:06:18 1
2 0 1% | ON CPU
oracle@lsst-oradb05 (MMON) 2020-01-29 09:05:17
2020-01-29 09:05:47 2
15 rows selected.
SQL> @ashtop session_id,session_serial#,event2,program 1=1 sysdate-5/60/24
sysdate
TOTALSECONDS AAS %This SESSION_ID SESSION_SERIAL#
EVENT2 PROGRAM FIRST_SEEN
LAST_SEEN DIST_SQLEXEC_SEEN
_______________ ______ __________ _____________ __________________
________________ ___________________________________________
______________________ ______________________ ____________________
85 0.3 36% | 2078 23219 log
file sync python@lsst-verify-worker28 (TNS V1-V3) 2020-01-29
09:07:36 2020-01-29 09:09:24 1
56 0.2 24% | 2085 56305 ON CPU
sqlplus@lsst-oradb05 (TNS V1-V3) 2020-01-29 09:06:44
2020-01-29 09:11:34 25
6 0 3% | 2078 5162 log
file sync sqlplus@lsst-verify-worker28 (TNS V1-V3) 2020-01-29
09:09:57 2020-01-29 09:10:02 1
6 0 3% | 2078 24706 ON CPU
python@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:10:12
2020-01-29 09:10:39 2
5 0 2% | 4 53890 ON CPU
oracle@lsst-oradb06 (CJQ0) 2020-01-29 09:09:01
2020-01-29 09:09:05 1
5 0 2% | 123 45272 ON CPU
oracle@lsst-oradb05 (DIA0) 2020-01-29 09:07:57
2020-01-29 09:10:02 1
5 0 2% | 2078 59858 ON CPU
python@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:06:43
2020-01-29 09:06:52 5
4 0 2% | 246 4525 ON CPU
oracle@lsst-oradb05 (LMON) 2020-01-29 09:07:37
2020-01-29 09:11:38 1
4 0 2% | 2078 5162 ON CPU
sqlplus@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:09:56
2020-01-29 09:10:05 3
4 0 2% | 2078 23219 ON CPU
python@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:09:10
2020-01-29 09:09:21 4
3 0 1% | 246 5232 ON CPU
oracle@lsst-oradb06 (LMON) 2020-01-29 09:09:06
2020-01-29 09:11:06 1
3 0 1% | 2078 20119 ON CPU
sqlplus@lsst-verify-worker28 (TNS V1-V3) 2020-01-29 09:07:25
2020-01-29 09:07:27 2
2 0 1% | 123 25437 ON CPU
oracle@lsst-oradb06 (DIA0) 2020-01-29 09:08:20
2020-01-29 09:11:22 1
2 0 1% | 368 63897 ON CPU
oracle@lsst-oradb06 (LMD0) 2020-01-29 09:07:29
2020-01-29 09:09:10 1
2 0 1% | 489 23602 ON CPU
oracle@lsst-oradb06 (PSP0) 2020-01-29 09:07:03
2020-01-29 09:09:08 1
15 rows selected.
"@snapper ash,stats,trace 10 999999 lgwr" shows no active session
(v$session sampling) output during FIRST_SEEN - LAST_SEEN interval above.
i'm going to start looping psnapper on all LG* processes and hope that
turns something up.
still *very* open to any other suggestions. this is driving me crazy.
On Wed, Jan 29, 2020 at 9:00 AM Noveljic Nenad <
nenad.noveljic@xxxxxxxxxxxx> wrote:
Yes, it was strange to see rman backups in all your reports. We should
have paid more attention to it. But there were also periods when the backup
was running without impacting log file sync.
possibly locks on CF?
Locks on CF are typically held during log switches. You can correlate log
switch times with the log file sync problem to answer this. I think
Jonathan has already suggested that.
not sure why that wouldn't show up on ASH though.
That’s a big question – I’ve never seen that before. pSnapper would
provide the information what the lg* were doing.
*From:* oracle-l-bounce@xxxxxxxxxxxxx <oracle-l-bounce@xxxxxxxxxxxxx> *On
Behalf Of *Chris Stephens
*Sent:* Mittwoch, 29. Januar 2020 15:49
*To:* Stefan Koehler <contact@xxxxxxxx>
*Cc:* oracle-l <Oracle-L@xxxxxxxxxxxxx>
*Subject:* Re: intermittent long "log file sync" waits
holy shnickees psnapper is awesome. i discovered quite a few backup
related processes that were stuck w/ pstack prior to pulling down and using
psnapper. after killing those sessions, the remaining rman processes showed
progress w/ psnapper output like:
psnapper]# ./psn -p 874 -g comm,state,syscall
Linux Process Snapper v0.14 by Tanel Poder [https://tp.dev/psnapper]
Sampling /proc/syscall, stat for 5 seconds... finished.
=== Active Threads
===========================================================
samples | avg_threads | comm | state |
syscall
------------------------------------------------------------------------------
43 | 0.43 | (oracle_*_lsst) | Disk (Uninterruptible) |
pread64
31 | 0.31 | (oracle_*_lsst) | Disk (Uninterruptible) | lstat
7 | 0.07 | (oracle_*_lsst) | Running (ON CPU) |
pread64
6 | 0.06 | (oracle_*_lsst) | Running (ON CPU) |
fstatfs
3 | 0.03 | (oracle_*_lsst) | Disk (Uninterruptible) |
[running]
3 | 0.03 | (oracle_*_lsst) | Disk (Uninterruptible) | open
3 | 0.03 | (oracle_*_lsst) | Running (ON CPU) |
[running]
1 | 0.01 | (oracle_*_lsst) | Disk (Uninterruptible) |
fstatfs
1 | 0.01 | (oracle_*_lsst) | Running (ON CPU) | open
it turns out we (i) copied over backup scripts and cron entries to
another cluster i built up a few weeks ago. i thought i successfully
replaced all references to database experiencing issues but that was not
the case. I killed all processes related to backups scripts kicked off from
this new RAC connecting to problem RAC and now ASH output is looking much
better.
i will continue to monitor for unreasonable "log file sync" waits but i
suspect this was the cause of the issue. possibly locks on CF? not sure why
that wouldn't show up on ASH though.
On Wed, Jan 29, 2020 at 2:13 AM Stefan Koehler <contact@xxxxxxxx> wrote:
Hello Chris,
as I already mentioned - please use pSnapper (
https://github.com/tanelpoder/psnapper ;) by Tanel Poder.
pSnapper samples /proc file system and does not suspend the process like
pstack (as it is just a wrapper around gdb which uses ptrace() under the
hood). Suspending a process can screw it up :)
In addition pSnapper provides more useful information (e.g. state, C
function that runs in kernel space for a specific syscall, etc.).
Best Regards
Stefan Koehler
Independent Oracle performance consultant and researcher
Website: http://www.soocs.de
Twitter: @OracleSK
Chris Stephens <cstephens16@xxxxxxxxx> hat am 28. Januar 2020 um 23:49geschrieben:
scheduled to run every hour seems to be stuck and has been for at least 10
was just getting ready to sign off and noticed the archivelog backup
mins:
oracle_backup.sh
[oracle@lsst-oradb04 ~]$ ps -ef | grep oracle_backup.sh
oracle 13369 31167 0 16:43 pts/0 00:00:00 grep --color=auto
oracle 14649 14645 0 16:00 ? 00:00:00 /bin/bash/home/oracle/scripts/rman/oracle_backup.sh -d lsst2db -s archivelog.rman -c
lsst2db_rman -e cs2018@xxxxxxxxxxxxxxxxx -r iddsdba_rman
[oracle@lsst-oradb04 ~]$ pstack 15649wrote:
Process 15649 not found.
[oracle@lsst-oradb04 ~]$ pstack 14649
#0 0x00007f9a6537e41c in waitpid () from /lib64/libc.so.6
#1 0x0000000000440b84 in waitchld.isra.10 ()
#2 0x0000000000441e3c in wait_for ()
#3 0x0000000000433b0e in execute_command_internal ()
#4 0x0000000000433d2e in execute_command ()
#5 0x000000000041e365 in reader_loop ()
#6 0x000000000041c9ce in main ()
[oracle@lsst-oradb04 ~]$ pstack 14649
#0 0x00007f9a6537e41c in waitpid () from /lib64/libc.so.6
#1 0x0000000000440b84 in waitchld.isra.10 ()
#2 0x0000000000441e3c in wait_for ()
#3 0x0000000000433b0e in execute_command_internal ()
#4 0x0000000000433d2e in execute_command ()
#5 0x000000000041e365 in reader_loop ()
#6 0x000000000041c9ce in main ()
[oracle@lsst-oradb04 ~]$ pstack 14649
#0 0x00007f9a6537e41c in waitpid () from /lib64/libc.so.6
#1 0x0000000000440b84 in waitchld.isra.10 ()
#2 0x0000000000441e3c in wait_for ()
#3 0x0000000000433b0e in execute_command_internal ()
#4 0x0000000000433d2e in execute_command ()
#5 0x000000000041e365 in reader_loop ()
#6 0x000000000041c9ce in main ()
[oracle@lsst-oradb04 ~]$ pstack 14649
#0 0x00007f9a6537e41c in waitpid () from /lib64/libc.so.6
#1 0x0000000000440b84 in waitchld.isra.10 ()
#2 0x0000000000441e3c in wait_for ()
#3 0x0000000000433b0e in execute_command_internal ()
#4 0x0000000000433d2e in execute_command ()
#5 0x000000000041e365 in reader_loop ()
#6 0x000000000041c9ce in main ()
[oracle@lsst-oradb04 ~]$
On Tue, Jan 28, 2020 at 4:30 PM Chris Stephens < cstephens16@xxxxxxxxx>
run through the night in hopes that issue occurs again:ok. i've had the following running since lunchtime and will let it
enabled so we'll have trace data as well.
@snapper ash,stats,trace 10 999999 lgwr
I will run the following script as well:
#! /bin/bash
while [ 1 ]
do
echo `date` >> /tmp/pstack.26552
pstack 26552 >> /tmp/pstack.26552
echo `date` >> /tmp/pstack.26556
pstack 26556 >> /tmp/pstack.26556
echo `date` >> /tmp/pstack.26560
pstack 26560 >> /tmp/pstack.26560
sleep 5
done
based off:
[oracle@lsst-oradb05 bin]$ ps -ef | egrep "lg.*lsst2db2"
oracle 26552 1 0 Jan17 ? 00:06:14 ora_lgwr_lsst2db2
oracle 26556 1 0 Jan17 ? 00:00:13 ora_lg00_lsst2db2
oracle 26560 1 0 Jan17 ? 00:00:12 ora_lg01_lsst2db2
All sessions for this workload are connecting to service w/ tracing
I will (hopefully) have updates in the morning.
____________________________________________________
Please consider the environment before printing this e-mail.
Bitte denken Sie an die Umwelt, bevor Sie dieses E-Mail drucken.
Important Notice
This message is intended only for the individual named. It may contain
confidential or privileged information. If you are not the named addressee
you should in particular not disseminate, distribute, modify or copy this
e-mail. Please notify the sender immediately by e-mail, if you have
received this message by mistake and delete it from your system.
Without prejudice to any contractual agreements between you and us which
shall prevail in any case, we take it as your authorization to correspond
with you by e-mail if you send us messages by e-mail. However, we reserve
the right not to execute orders and instructions transmitted by e-mail at
any time and without further explanation.
E-mail transmission may not be secure or error-free as information could
be intercepted, corrupted, lost, destroyed, arrive late or incomplete. Also
processing of incoming e-mails cannot be guaranteed. All liability of
Vontobel Holding Ltd. and any of its affiliates (hereinafter collectively
referred to as "Vontobel Group") for any damages resulting from e-mail use
is excluded. You are advised that urgent and time sensitive messages should
not be sent by e-mail and if verification is required please request a
printed version. Please note that all e-mail communications to and from the
Vontobel Group are subject to electronic storage and review by Vontobel
Group. Unless stated to the contrary and without prejudice to any
contractual agreements between you and Vontobel Group which shall prevail
in any case, e-mail-communication is for informational purposes only and is
not intended as an offer or solicitation for the purchase or sale of any
financial instrument or as an official confirmation of any transaction.
The legal basis for the processing of your personal data is the
legitimate interest to develop a commercial relationship with you, as well
as your consent to forward you commercial communications. You can exercise,
at any time and under the terms established under current regulation, your
rights. If you prefer not to receive any further communications, please
contact your client relationship manager if you are a client of Vontobel
Group or notify the sender. Please note for an exact reference to the
affected group entity the corporate e-mail signature. For further
information about data privacy at Vontobel Group please consult
www.vontobel.com.