Re: ASM of any significant value when switching to Direct NFS / NetApp / non-RAC?

  • From: Austin Hackett <hacketta_57@xxxxxx>
  • To: "CRISLER, JON A" <JC1706@xxxxxxx>
  • Date: Mon, 20 Aug 2012 19:28:28 +0100

Hi Jon

I wasn't aware of that parameter, so it's good to hear about it

In our case, the log file sync waits are caused by slow disk I/O; they  
always correspond to massive log write elapsed warnings in the LGWR  
trace file - I've seen 90+ seconds to write a just a  couple of KB of  
redo. I've got a couple of leads, like the misconfigured jumbo frames,  
and also some nfsd.tcp.close.idle.notify:warning messages on the filer  
that correlate to when the slow write happens and relate to the IP of  
the NIC on the db host that saw the spike.

We're working on getting some tcpdumps the next time the issues  
occurs, so those should allow me to validate what the LGWR trace file  
is telling me.

Thanks

Austin


On 20 Aug 2012, at 17:04, CRISLER, JON A wrote:

> What is your setting for this parameter ?
>
> SQL> alter system set "_high_priority_processes"='LMS*|VKTM|LGWR'  
> scope=spfile sid='*';
>
> System altered.
>
> If LGWR is not set to RT priority it might be the reason behind  
> higher log file sync times.
>
> -----Original Message-----
> From: Austin Hackett [mailto:hacketta_57@xxxxxx]
> Sent: Sunday, August 19, 2012 5:59 AM
> To: CRISLER, JON A
> Cc: Oracle-L@xxxxxxxxxxxxx
> Subject: Re: ASM of any significant value when switching to Direct  
> NFS / NetApp / non-RAC?
>
> Hi Jon
>
> Interesting - thanks for the info.
>
> Yes, we also see the those symptoms - a big spike in log file sync,  
> accompanied by some GCS waits. When the spikes occur, we did check CPU
> utilization on the storage controller, and it was less than 50%.
> Write latencies, IOPS, and throughput were all within acceptable  
> limits, and actually much lower than other periods when performance  
> had been fine.
>
> We're using dNFS, so aren't using  DM-mutlipath. Indeed, there is  
> only a single storage NIC; a decision that precedes me and we're  
> working to address. We are on OEL 5.4 which is interesting.
>
> One idea is this could be caused by an incorrect MTU on the storage  
> NIC. It's currently set to 8000 (a setting I'm told was inherited  
> when they switched from Solaris to Linux a while back), whereas it's  
> 9000 on the filer and switch.
>
> Out of curiosity, what has your biggest log write elapsed warning?  
> We see 1 or 2 spikes a week and the biggest has been 92 seconds -  
> yes, 92 seconds!
>
> On 17 Aug 2012, at 04:35, CRISLER, JON A wrote:
>
>> Austin- we have observed the exact same behavior, and it appears to  
>> be
>> periodic spikes on the NetApp controller / cpu utilization in a RAC
>> environment.  The info is fuzzy right now but if you have a LGWR
>> delay, it also causes a GCS delay in passing the dirty block to
>> another node that needs it.  In our case it's a SAN-ASM-RAC
>> environment, and the NetApp cpu is always churning above 80%.  In our
>> case we found that RH tuning, multipath issues contributed to the
>> cause and seems to have been mostly addressed with RH 5.8 (was 5.4).
>> In a FC SAN environment something like Sanscreen that can measure end
>> to end FC response time helped to narrow down some of the  
>> contributing
>> factors.  You can set a undocumented parameter to allow the gcs dirty
>> block to be passed over to the other nodes while a lgwr wait occurs,
>> but you risk data corruption in the event of a node crash (hence we
>> passed on that tip).
>

--
//www.freelists.org/webpage/oracle-l


Other related posts: