Re: log file sync

  • From: Tim Gorman <tim@xxxxxxxxx>
  • To: oracle-l@xxxxxxxxxxxxx
  • Date: Wed, 30 Sep 2015 15:22:56 -0600

Michael,

Regardless of what anyone else may have done, increasing the priority of LGWR indicates that the problem has been definitively diagnosed as lack of CPU.

Is that true? Has that diagnosis been made?

Or is it just a guess? Because

The LGWR process posts the "log file parallel write" event when it is trying to flush redo from the log buffer to the online redo log files. Is that event showing significant timings?

Be aware that at any point in time, there is only one process posting "log file parallel write" while many processes may be posting "log file sync". So, the cumulative timings for the latter may look higher on an AWR report, but if you correct for concurrency, you might see a strong correlation between "log file sync" and "log file parallel write", which indicates that waits for I/O are the real cause of "log file sync", not waits for CPU.

Please consider trying the following query in ASH...

select timestamp, event, count(*) cnt
from v$active_session_history
where event in ('log file sync','log file parallel write')
group by timestamp, event
order by timestamp, event;


From this query, I would be looking for times when "log file sync" occurs *WITHOUT* an accompanying "log file parallel write".

Think of a "log file parallel write" wait by the LGWR as a "stimulus", and a "log file sync" response by one or many background processes as a "response". This is another way of saying that if the LGWR is slow in writing redo down from the Log Buffer to the online redo log files, then it is going to keep the Log Buffer locked for a longer period of time, and foreground server processes are going wait, and while they wait they are going to post "log file sync". So "log file parallel write" and "oog file sync" aren't supposed to occur at the same instant, but maybe we can see them in the same "neighborhood" of time often enough to show correlation?

Now remember that ASH is only sampling every 1000 milli-seconds (i.e. 1 second), and that it is possible for many waits on either of these events to have occurred during the course of a full second without being sampled on the 1000th milli-second by ASH. But this level of granularity is the information that we have, so we have to work with it.

So using this query, try to see if "log file sync" is occurring most often when "log file parallel write" is occurring, or not.

If so, then the problem may not be CPU starvation for LGWR, but slow I/O writing to the online redo log files. Granting higher CPU priority to LGWR is not going to help. In fact, you may hurt things by starving other processes of CPU priority.

If not, then it won't be definitive, but it will seem unlikely that slow I/O to the online redo log files is an issue, and perhaps granting higher CPU priority may help.

Anyway, please let us know what you find?

Happy hunting!

Thanks!

-Tim




On 9/30/15 13:26, Michael Calisi wrote:

I been dealing with log file sync showing up in my top 10 of my awr report.

I made changes to eliminate the log file sync, however there was one additional recommendation that I don't feel comfortable about and can't seem to get clarification from support.

They are recommending to promote the LGWR background process to become a critical thread by placing the thread in the FX scheduling class at priority sixty will cause the thread to be treated as critical.


when i check our current setting. i was already at 60.

QL> host ps -ecf | grep lgwr
oracle 21338 21337 FX 0 15:22:17 pts/85 0:00 grep lgwr
oracle 21337 18771 FX 0 15:22:17 pts/85 0:00 /bin/sh -c ps -ecf | grep lgwr
oracle 25393 25382 FX 60 Aug 29 ? 545:15 ora_lgwr_cg

What I don't understand is how this value is determined can it go higher and what are the consequence of changing this value higher? Not sure how increasing the value help the io issue..

Anyone every make changes to this process.

Other related posts: