You need to drill down into OCFS2 issues. There are some Metalink notes about gcc versions, and an issue where high CPU can starve the OCFS2 processes. Changing the grub.conf file to use the deadline I/O scheduler may help (CFQ is the default). The messages about O2NET correspond to what we have found. Also, the ocfs2.conf file can be tweaked- increase the I/O timeouts if you are using multipathed SAN storage, and increase network timeouts if you are using bonded NICs. Find out the failover time set for both SAN multipathing and Network bonding, and make sure that OCFS2 is configured with higher timeout values that SAN and NIC timeouts. What can happen is that if you have a multipath or nic bond timeout or failover, OCFS2 can trigger an outage before the OS resource failover is completed. Again from the o2net messages this seems likely in your case. We have otherwise identical RAC clusters running ASM vs. OCFS2 on 10.2 and the ASM clusters rarely have problems compared to OCFS2 in this regard. Once you have a stable gcc / glibc and stable OCFS2 time parameters, then OCFS2 should be a lot more reliable. -----Original Message----- From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On Behalf Of Karl Arao Sent: Wednesday, February 10, 2010 4:38 PM To: K Gopalakrishnan Cc: oracle-l@xxxxxxxxxxxxx Subject: Re: Network interconnect traffic on RAC Thanks for the replies Andrew, Krishna, Aaron, Gopal... I had this client running on three node RAC. Just recently the two nodes got evicted... trying to diagnose if it was a CPU capacity, disk latency, interconnect issue... and been reading - Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC - RAC Performance Tuning best practices BTW they are running on 2 x 3.00GHz Xeon CPU on each node with 4GB memory connected on EMC CX300. From the time of the eviction, the two nodes that got evicted were 60-65% (run queue was 5 & 2.5 respectively) CPU utilization and the surviving node was only 30% utilized (got the data from SAR) then the cluster evicted the two nodes, BTW the ocfs2 (where the OCR & voting disk resides) was also on the interconnect IPs so it was also affected by the latency problem (shown on the OS logs)... Unfortunately since the servers restarted the data from the current SNAP_ID at the time of its busy load were all lost.. So I just have the SAR data and priod & after SNAP_IDs for diagnosis: - OS: 2 nodes at 60-65% (run queue was 5 & 2.5 respectively) CPU utilization, the other was only 30% - Disk: I don't have latency numbers, but from the SAR Disk data, the 2 evicted nodes had Block Transfer Read/Write/s of 450-500 and TPS/s 60-65... the surviving node had Block Transfer Read/Write/s of 60 and TPS/s 10 - Network: On the interconnect interface, from the SAR Network data, the 2 evicted nodes had similar utilization to the surviving node... txbytes/rxbytes/s of 3,000,000-4,000,000 - Database: prior & after SNAP_IDs all nodes have an AAS of < CPU, and the 2 evicted nodes just have 7 MB/s Read/Write activity... Looking at the ASH data I can see "CPU" and "gc cr multi block" as top two events. Below are some of the output on one of the failing nodes: -- OS log Jan 25 14:47:57 rac1-3 kernel: o2net: connection to node rac1-2 (num 1) at 192.168.0.2:7777 has been idle for 30.0 seconds, shutting it down. -- Clusterware Alert log [cssd(13414)]CRS-1610:node rac1-1 (3) at 90% heartbeat fatal, eviction in 0.130 seconds 2010-01-25 14:47:55.880 -- CSS log [ CSSD]2010-01-25 14:47:26.691 [1199618400] >WARNING: clssnmPollingThread: node rac1-1 (3) at 90 3.123428e-317artbeat fatal, eviction in 0.130 seconds [ CSSD]2010-01-25 14:47:26.823 [1199618400] >TRACE: clssnmPollingThread: Eviction started for node rac1-1 (3), flags 0x040d, state 3, wt4c 0 [ CSSD]2010-01-25 14:47:26.823 [1199618400] >TRACE: clssnmDiscHelper: rac1-1, node(3) connection failed, con (0x785550), probe((nil)) [ CSSD]2010-01-25 14:47:27.328 [1115699552] >TRACE: clssnmReadDskHeartbeat: node(3) is down. rcfg(30) wrtcnt(519555) LATS(534471324) Disk lastSeqNo(519555) So from the data above.. I could have an initial finding that the latency issue could be caused by high sustained CPU utilization on the OS side which affected the scheduling of critical RAC processes or could be caused by the congested interconnect switch... I'd like to drill down which of the two is the culprit.. which is the reason behind my asking.. - Karl Arao karlarao.wordpress.com -- //www.freelists.org/webpage/oracle-l -- //www.freelists.org/webpage/oracle-l