RE: LGWR, EMC or app cursors?

From: "Herring, David" <dmarc-noreply@xxxxxxxxxxxxx> (Redacted sender "HerringD" for DMARC)
To: "oracle-l@xxxxxxxxxxxxx" <oracle-l@xxxxxxxxxxxxx>
Date: Fri, 11 Oct 2019 14:19:19 +0000

(ongoing saga)

Yesterday I threw together a simple script that mimics what OSW does for "ps"
but instead only does so for LGWR, loops 5 seconds at a time and includes
"wchan=WIDE-WCHAN-COLUMN" so that I can get a bit more detail on what LGWR is
potentially waiting on.

During the duration of the most recent issue output shows that LGWR is waiting
on "asm_maybe_wait_io".  This shows up periodically for LGWR wchan values but
never consecutively, until we hit the issue and then it shows up until the
problem goes away.  We also had enabled tracing on LGWR and the file shows a
pause in entries for 30 seconds, to me meaning it was completely stuck for the
duration.

Anyone know any details about routine "asm_maybe_wait_io"?  I only got a couple
hits on it and it's possibly part of the ASM driver.  I've validated we have
the correct rpm for RHEL7.5.  At this point I'm whole-heartedly trying to force
the storage team to perform a detailed investigation but am open to any
insights others have.  Thx.

Regards,

Dave

From: Herring, David
Sent: Thursday, October 10, 2019 12:09 PM
To: dmarc-noreply@xxxxxxxxxxxxx; oracle-l@xxxxxxxxxxxxx
Cc: Tiwari, Yogesh <Yogesh.Tiwari@xxxxxxxxxxxxxx>
Subject: RE: LGWR, EMC or app cursors?

The storage folks are assuring me that there are no issues at that level and
that any shared resources are NOT seeing the same thing.  I have to take their
word for it.

As for oswatcher details, yes, it's running and I've reviewed "ps" data looking
for the LGWR process.  For WCHAN around one of the times we had a problem, LGWR
lists "SYSC_s" and "asm_ma".  I'll see if I can track down what that means.
OSW runs every 30 sec so obviously there's a lot of detail missing when the
problem happens.

Regards,

Dave
From: oracle-l-bounce@xxxxxxxxxxxxx <oracle-l-bounce@xxxxxxxxxxxxx> On Behalf
Of Tiwari, Yogesh
Sent: Wednesday, October 9, 2019 11:13 PM
To: oracle-l@xxxxxxxxxxxxx
Cc: Tiwari, Yogesh <Yogesh.Tiwari@xxxxxxxxxxxxxx>
Subject: RE: LGWR, EMC or app cursors?

CAUTION: This email originated from outside of D&B. Please do not click links
or open attachments unless you recognize the sender and know the content is
safe.

Dave,

If you are sharing SAN with other databases hosts, probably they should also
complain at the same time. Is this true for you?

Further, you might want to capture WCHAN for LGWR(s) processes, when it goes
into this hang. This can give clues, if it is really SAN, or OS issue.
Tools like oswatcher(TFA) do capture complete “ps” detail, which can help in
this situation. Do see that as an option?

Thanks,
Yogi
Disclaimer: The information transmitted is intended for the person or entity to
which it is addressed and may contain confidential, privileged or copyrighted
material or attorney work product. If you receive this in error, please contact
the sender and delete the material from any system. Any unauthorised copying,
disclosure or distribution of the material in this e-mail is strictly
forbidden. Any comments or statements made are not necessarily those of
Fidelity. All e-mails may be monitored or recorded.

From: oracle-l-bounce@xxxxxxxxxxxxx <oracle-l-bounce@xxxxxxxxxxxxx> On Behalf
Of Herring, David
Sent: 10 October 2019 08:11
To: martin.a.berger@xxxxxxxxx; dmarc-noreply@xxxxxxxxxxxxx
Cc: oracle-l@xxxxxxxxxxxxx
Subject: RE: LGWR, EMC or app cursors?

First, thanks to a number of you who replied.  The problem still exists but at
least I know a bit more about it.  I read through Frits' post and it's quite
interesting and informative, but didn't help.  One potentially serious problem
in debugging this is "strace" or any type of system trace is not available on
these servers, my guess is that the security team felt having access to that is
a "no-no".  Of course I seriously doubt it'd be realistic to strace LGWR and
let it run for hrs., waiting for the problem to occur (potentially large
performance impact, let alone a giant tracefile).

Unfortunately I'm back to trying to figure out details on exactly what LGWR is
doing during it's "log file parallel write".  Per Andy's suggestion I validated
that the column SEQ# in ASH doesn't change during the duration of the problem
for LGWR, so it's one huge wait.  In fact seconds before the one example that
I’m trying to tear apart I see LGWR waiting on the same event but it's a
different SEQ# so it got some work done, then just spun for nearly 30 seconds
while all other DML sat and waited on "log file sync".  LGWR finally gets it's
work done, everything back to normal.

I'm going to go back to the full issue bridge list (we have calls on this daily
with SMEs covering all areas) and see if I can get a 100% confirmation that
they've validated all components inbetween LGWR and the physical disk.

Regards,

Dave

From: oracle-l-bounce@xxxxxxxxxxxxx<mailto:oracle-l-bounce@xxxxxxxxxxxxx>
<oracle-l-bounce@xxxxxxxxxxxxx<mailto:oracle-l-bounce@xxxxxxxxxxxxx>> On Behalf
Of Martin Berger
Sent: Tuesday, October 8, 2019 2:34 AM
To: dmarc-noreply@xxxxxxxxxxxxx<mailto:dmarc-noreply@xxxxxxxxxxxxx>
Cc: oracle-l@xxxxxxxxxxxxx<mailto:oracle-l@xxxxxxxxxxxxx>
Subject: Re: LGWR, EMC or app cursors?

CAUTION: This email originated from outside of D&B. Please do not click links
or open attachments unless you recognize the sender and know the content is
safe.

Hi Dave,

as you asked for tracing, a "normal" 10046 trace can be enabled for
logwriter<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam03.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Ffritshoogland.files.wordpress.com-252F2014-252F04-252Fprofiling-2Dthe-2Dlogwriter-2Dand-2Ddatabase-2Dwriter.pdf-26data-3D02-257C01-257Cherringd-2540dnb.com-257Cf5bf752ea83c4c4c36fc08d74bc21067-257C19e2b708bf12437597198dec42771b3e-257C0-257C1-257C637061169179075851-26sdata-3DaTuydwpa7-252FrZwDsMemlNoGO-252BVCBOdYqys5T6rU6AG6o-253D-26reserved-3D0%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3Dvwjtbe9hdSGr0oJwFK3ZT-2E2ohv6TVwmu-Wvs_Hbjw%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467906384&sdata=1L9oicNn%2BP229Ps3V%2FDZT%2F1zOK2m783pIdfhRM0Vf8Q%3D&reserved=0>.
You will not get SQL statements, but normal trace information regarding WAITs.

The event log file parallel write is somehow tricky. Frits wrote a nice blog
post<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam03.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Ffritshoogland.wordpress.com-252F2013-252F08-252F30-252Foracle-2Dio-2Don-2Dlinux-2Dlog-2Dwriter-2Dio-2Dand-2Dwait-2Devents-252F-26data-3D02-257C01-257Cherringd-2540dnb.com-257Cf5bf752ea83c4c4c36fc08d74bc21067-257C19e2b708bf12437597198dec42771b3e-257C0-257C1-257C637061169179075851-26sdata-3D9yHLvzg3kvHMZdsz0FhRs4-252B3sRVKyFAtlF64P3n-252F8-252Bo-253D-26reserved-3D0%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3DhjOIzHWSJ3UVrSt-2pULB3JLSMAdSr3BgQXE2HQYmDA%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467916380&sdata=cGdFGtNYL%2F0jz7BuPIyFG%2Br9TFgngrUhVYcJws3KypI%3D&reserved=0>
about it.
It's important to understand that it represents multiple IOs (that's the
parallel).

"EMC and sysadmins have confirmed there are no disk errors and from their
standpoint the disks are waiting on Oracle."

I assume you have a (or two) FiberChannel SAN which connects EMS and your
DB-host.  Please ask them for measurements on those switches also.
The argument is simple: If the host claims it waits on the disks (according to
iostat) and EMC claims it's waiting on Oracle, have a closer look at the
components in between.

hth,
Martin

Am Mo., 7. Okt. 2019 um 17:20 Uhr schrieb Herring, David
<dmarc-noreply@xxxxxxxxxxxxx<mailto:dmarc-noreply@xxxxxxxxxxxxx>>:
Folks, I've got a bit of a mystery with a particular db where we're getting a
periodic 25-30 pause between user sessions and LGWR processes and can't clearly
identify what's the cause.

  *   The database is 11.2.0.4, RHEL 7.5, running ASM on EMC.
  *   Sometimes once a day, sometimes more (never more than 5) times a day we
see user processes start waiting on "log file sync".  LGWR is waiting on "log
file parallel write".
  *   At the same time one of the emcpower* devices shows 100% busy and service
time 200+ (from iostat via osw).  mpstat shows 1 CPU at 100% on iowait.  It's
not always the same disk (emcpowere1, a1, h1, …), not always the same CPU.  EMC
and sysadmins have confirmed there are no disk errors and from their standpoint
the disks are waiting on Oracle.
  *   During this time LGWR stats in ASH are all 0 - TIME_WAITED, DELTA*
columns.  Only after the problem goes away (about 25 secs) these columns are
populated again, obviously the DELTA* columns 1 row later.  LGWR's session
state is WAITING so I assume the column value observations are due to LGWR
waiting, as it won't write stats until it can do something.

I am stuck trying to find out, really prove who is the culprit or what exactly
the wait is on.  Is LGWR waiting on user sessions and user sessions are waiting
on LGWR and all that causes the disk to be 100%?  Can I enable some sort of
tracing on LGWR and would that point to exactly what he's waiting on to prove
where the problem is?

Regards,

Dave

References:
- LGWR, EMC or app cursors?
  - From: Herring, David
- Re: LGWR, EMC or app cursors?
  - From: Martin Berger
- RE: LGWR, EMC or app cursors?
  - From: Herring, David
- RE: LGWR, EMC or app cursors?
  - From: Tiwari, Yogesh
- RE: LGWR, EMC or app cursors?
  - From: Herring, David

RE: LGWR, EMC or app cursors?

Other related posts: