Michael,
Regardless of what anyone else may have done, increasing the priority of
LGWR indicates that the problem has been definitively diagnosed as lack
of CPU.
Is that true? Has that diagnosis been made?
Or is it just a guess? Because
The LGWR process posts the "log file parallel write" event when it is
trying to flush redo from the log buffer to the online redo log files.
Is that event showing significant timings?
Be aware that at any point in time, there is only one process posting
"log file parallel write" while many processes may be posting "log file
sync". So, the cumulative timings for the latter may look higher on an
AWR report, but if you correct for concurrency, you might see a strong
correlation between "log file sync" and "log file parallel write", which
indicates that waits for I/O are the real cause of "log file sync", not
waits for CPU.
Please consider trying the following query in ASH...
select timestamp, event, count(*) cnt
from v$active_session_history
where event in ('log file sync','log file parallel write')
group by timestamp, event
order by timestamp, event;
From this query, I would be looking for times when "log file sync"
occurs *WITHOUT* an accompanying "log file parallel write".
Think of a "log file parallel write" wait by the LGWR as a "stimulus",
and a "log file sync" response by one or many background processes as a
"response". This is another way of saying that if the LGWR is slow in
writing redo down from the Log Buffer to the online redo log files, then
it is going to keep the Log Buffer locked for a longer period of time,
and foreground server processes are going wait, and while they wait they
are going to post "log file sync". So "log file parallel write" and
"oog file sync" aren't supposed to occur at the same instant, but maybe
we can see them in the same "neighborhood" of time often enough to show
correlation?
Now remember that ASH is only sampling every 1000 milli-seconds (i.e. 1
second), and that it is possible for many waits on either of these
events to have occurred during the course of a full second without being
sampled on the 1000th milli-second by ASH. But this level of
granularity is the information that we have, so we have to work with it.
So using this query, try to see if "log file sync" is occurring most
often when "log file parallel write" is occurring, or not.
If so, then the problem may not be CPU starvation for LGWR, but slow I/O
writing to the online redo log files. Granting higher CPU priority to
LGWR is not going to help. In fact, you may hurt things by starving
other processes of CPU priority.
If not, then it won't be definitive, but it will seem unlikely that slow
I/O to the online redo log files is an issue, and perhaps granting
higher CPU priority may help.
Anyway, please let us know what you find?
Happy hunting!
Thanks!
-Tim
On 9/30/15 13:26, Michael Calisi wrote:
I been dealing with log file sync showing up in my top 10 of my awr report.
I made changes to eliminate the log file sync, however there was one additional recommendation that I don't feel comfortable about and can't seem to get clarification from support.
They are recommending to promote the LGWR background process to become a critical thread by placing the thread in the FX scheduling class at priority sixty will cause the thread to be treated as critical.
when i check our current setting. i was already at 60.
QL> host ps -ecf | grep lgwr
oracle 21338 21337 FX 0 15:22:17 pts/85 0:00 grep lgwr
oracle 21337 18771 FX 0 15:22:17 pts/85 0:00 /bin/sh -c ps -ecf | grep lgwr
oracle 25393 25382 FX 60 Aug 29 ? 545:15 ora_lgwr_cg
What I don't understand is how this value is determined can it go higher and what are the consequence of changing this value higher? Not sure how increasing the value help the io issue..
Anyone every make changes to this process.