First, thanks to a number of you who replied. The problem still exists but at
least I know a bit more about it. I read through Frits' post and it's quite
interesting and informative, but didn't help. One potentially serious problem
in debugging this is "strace" or any type of system trace is not available on
these servers, my guess is that the security team felt having access to that is
a "no-no". Of course I seriously doubt it'd be realistic to strace LGWR and
let it run for hrs., waiting for the problem to occur (potentially large
performance impact, let alone a giant tracefile).
Unfortunately I'm back to trying to figure out details on exactly what LGWR is
doing during it's "log file parallel write". Per Andy's suggestion I validated
that the column SEQ# in ASH doesn't change during the duration of the problem
for LGWR, so it's one huge wait. In fact seconds before the one example that
I’m trying to tear apart I see LGWR waiting on the same event but it's a
different SEQ# so it got some work done, then just spun for nearly 30 seconds
while all other DML sat and waited on "log file sync". LGWR finally gets it's
work done, everything back to normal.
I'm going to go back to the full issue bridge list (we have calls on this daily
with SMEs covering all areas) and see if I can get a 100% confirmation that
they've validated all components inbetween LGWR and the physical disk.
Regards,
Dave
[cid:image001.png@01D05044.5C2AEE60]
Dave Herring
DBA
103 JFK Parkway
Short Hills, New Jersey 07078
Mobile 630.441.4404
dnb.com<http://www.dnb.com/>
[cid:image002.png@01D05044.5C2AEE60]<http://www.facebook.com/DunBradstreet>[cid:image003.png@01D05044.5C2AEE60]<http://twitter.com/dnbus>[cid:image004.png@01D05044.5C2AEE60]<http://www.linkedin.com/company/dun-&-bradstreet>[cid:image005.png@01D05044.5C2AEE60]<http://www.youtube.com/user/DunandBrad>
From: oracle-l-bounce@xxxxxxxxxxxxx <oracle-l-bounce@xxxxxxxxxxxxx> On Behalf
Of Martin Berger
Sent: Tuesday, October 8, 2019 2:34 AM
To: dmarc-noreply@xxxxxxxxxxxxx
Cc: oracle-l@xxxxxxxxxxxxx
Subject: Re: LGWR, EMC or app cursors?
CAUTION: This email originated from outside of D&B. Please do not click links
or open attachments unless you recognize the sender and know the content is
safe.
Hi Dave,
as you asked for tracing, a "normal" 10046 trace can be enabled for
logwriter<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffritshoogland.files.wordpress.com%2F2014%2F04%2Fprofiling-the-logwriter-and-database-writer.pdf&data=02%7C01%7Cherringd%40dnb.com%7Cf5bf752ea83c4c4c36fc08d74bc21067%7C19e2b708bf12437597198dec42771b3e%7C0%7C1%7C637061169179075851&sdata=aTuydwpa7%2FrZwDsMemlNoGO%2BVCBOdYqys5T6rU6AG6o%3D&reserved=0>.
You will not get SQL statements, but normal trace information regarding WAITs.
The event log file parallel write is somehow tricky. Frits wrote a nice blog
post<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffritshoogland.wordpress.com%2F2013%2F08%2F30%2Foracle-io-on-linux-log-writer-io-and-wait-events%2F&data=02%7C01%7Cherringd%40dnb.com%7Cf5bf752ea83c4c4c36fc08d74bc21067%7C19e2b708bf12437597198dec42771b3e%7C0%7C1%7C637061169179075851&sdata=9yHLvzg3kvHMZdsz0FhRs4%2B3sRVKyFAtlF64P3n%2F8%2Bo%3D&reserved=0>
about it.
It's important to understand that it represents multiple IOs (that's the
parallel).
"EMC and sysadmins have confirmed there are no disk errors and from theirI assume you have a (or two) FiberChannel SAN which connects EMS and your
standpoint the disks are waiting on Oracle."