Chris, yeah, seems that way. I had a problem a while back where operations
would timeout at seemingly random times and found out $LD_LIBRARY_PATH included
a NAS that no longer existed. "strace" showed that valid paths were picked for
a while, then the invalid NAS would get hit and there'd be a pause.
Anyway, the EMC config is all multipath. I'll check with storage/sysadmin
teams (pretty much everything is outsourced) to review it all again.
103 JFK Parkway
Short Hills, New Jersey 07078
From: oracle-l-bounce@xxxxxxxxxxxxx <oracle-l-bounce@xxxxxxxxxxxxx> On Behalf
Of Chris Taylor
Sent: Monday, October 7, 2019 10:27 AM
Subject: Re: LGWR, EMC or app cursors?
CAUTION: This email originated from outside of D&B. Please do not click links
or open attachments unless you recognize the sender and know the content is
Always the same database/machine?
Almost sounds like a path is down/unavailable from the machine to the storage
but the OS doesn't know it isn't responding. I'm not as familiar with EMC
Power and I think EMC Power uses something besides mutlipath drivers (but I
might be mistaken).
If you're not getting any path errors, it might be worthwhile to have someone
go into the cage and replace all the Fiber cables connecting from this server
to the storage (they can test the cables I believe and see if one is
On Mon, Oct 7, 2019 at 10:20 AM Herring, David
Folks, I've got a bit of a mystery with a particular db where we're getting a
periodic 25-30 pause between user sessions and LGWR processes and can't clearly
identify what's the cause.
* The database is 220.127.116.11, RHEL 7.5, running ASM on EMC.
* Sometimes once a day, sometimes more (never more than 5) times a day we
see user processes start waiting on "log file sync". LGWR is waiting on "log
file parallel write".
* At the same time one of the emcpower* devices shows 100% busy and service
time 200+ (from iostat via osw). mpstat shows 1 CPU at 100% on iowait. It's
not always the same disk (emcpowere1, a1, h1, …), not always the same CPU. EMC
and sysadmins have confirmed there are no disk errors and from their standpoint
the disks are waiting on Oracle.
* During this time LGWR stats in ASH are all 0 - TIME_WAITED, DELTA*
columns. Only after the problem goes away (about 25 secs) these columns are
populated again, obviously the DELTA* columns 1 row later. LGWR's session
state is WAITING so I assume the column value observations are due to LGWR
waiting, as it won't write stats until it can do something.
I am stuck trying to find out, really prove who is the culprit or what exactly
the wait is on. Is LGWR waiting on user sessions and user sessions are waiting
on LGWR and all that causes the disk to be 100%? Can I enable some sort of
tracing on LGWR and would that point to exactly what he's waiting on to prove
where the problem is?