RE: ASM of any significant value when switching to Direct NFS / NetApp / non-RAC?

  • From: "CRISLER, JON A" <JC1706@xxxxxxx>
  • To: "hacketta_57@xxxxxx" <hacketta_57@xxxxxx>, "Oracle-L@xxxxxxxxxxxxx" <Oracle-L@xxxxxxxxxxxxx>
  • Date: Fri, 17 Aug 2012 03:35:13 +0000

Austin- we have observed the exact same behavior, and it appears to be periodic 
spikes on the NetApp controller / cpu utilization in a RAC environment.  The 
info is fuzzy right now but if you have a LGWR delay, it also causes a GCS 
delay in passing the dirty block to another node that needs it.  In our case 
it's a SAN-ASM-RAC environment, and the NetApp cpu is always churning above 
80%.  In our case we found that RH tuning, multipath issues contributed to the 
cause and seems to have been mostly addressed with RH 5.8 (was 5.4).  In a FC 
SAN environment something like Sanscreen that can measure end to end FC 
response time helped to narrow down some of the contributing factors.  You can 
set a undocumented parameter to allow the gcs dirty block to be passed over to 
the other nodes while a lgwr wait occurs, but you risk data corruption in the 
event of a node crash (hence we passed on that tip).

-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On 
Behalf Of Austin Hackett
Sent: Wednesday, August 15, 2012 4:52 PM
To: Oracle-L@xxxxxxxxxxxxx
Subject: Re: ASM of any significant value when switching to Direct NFS / NetApp 
/ non-RAC?

Hi Dana
This info doesn't exactly relate to ASM, but I hopefully it'll be of use to you 
in the future...

I've recently started a new role at shop that uses Linux, Direct NFS and NetApp 
(no ASM) and as others have suggested, the solution does have a number of nice 
management features.

However, I am finding the apparent lack of read and write latency stats 
frustrating.

Something I'm currently looking into are occasional spikes in redo log writes. 
I know these are happening because there are log write elapsed warnings in the 
LGWR trace file. When these spikes occur, NetApp Ops Manager reports 2 - 3 
millisecond write latencies for the volume in question.

What I'd like to be able to do is cross-check these warnings against host-level 
io stats, but there seems to be no way of achieving this.

Using the standard Linux NFS client, iostat can show you the number of reads, 
writes etc. but not latencies. With the 2.6.17 kernel it seems that counters 
are available to report latency information, and there are scripts like 
nfs-iostat.py  out there which will display this
info: 
http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=blob;f=tools/nfs-iostat/nfs-iostat.py;h=9626d42609b9485c7fda0c9ef69d698f9fa929fd;hb=HEAD)
.

However, because Direct NFS bypasses the hosts NFS mount points (oracle db 
processes mount the files directly), it's my understanding the above tools 
won't include any operations performed by the Direct NFS in their output. There 
is a post about this here: 
http://glennfawcett.wordpress.com/2009/11/25/monitoring-direct-nfs-with-oracle-11g-and-solaris-pealing-back-the-layers-of-the-onion/

Now, whilst the v$dnfs_stats view does record the number of reads, writes etc, 
it doesn't have any latency data. Which just leaves you with usual v$ and AWR 
views like v$eventmetric, v$system_event etc.  
And if you're  trying to confirm the issue is at the host level and not Oracle, 
this doesn't help you much. So, at the moment I'm missing being able to run 
iostat and see svc_t and the like...

Incidentally, Glen Fawcett has nice script here for capturing v $dnfs_stats 
output: 
http://glennfawcett.wordpress.com/2010/02/18/simple-script-to-monitor-dnfs-activity

It's also worth being aware of bugs 13043012 and 13647945.

Hope that helps

Austin




--
//www.freelists.org/webpage/oracle-l


--
//www.freelists.org/webpage/oracle-l


Other related posts: