Re: Network interconnect traffic on RAC

  • From: Karl Arao <karlarao@xxxxxxxxx>
  • To: K Gopalakrishnan <kaygopal@xxxxxxxxx>
  • Date: Thu, 11 Feb 2010 05:38:28 +0800

Thanks for the replies Andrew, Krishna, Aaron, Gopal...

I had this client running on three node RAC. Just recently the two
nodes got evicted... trying to diagnose if it was a CPU capacity, disk
latency, interconnect issue...
and been reading
- Oracle Clusterware and Private Network Considerations
- Practical Performance Management for Oracle RAC
- RAC Performance Tuning best practices

BTW they are running on 2 x 3.00GHz Xeon CPU on each node with 4GB
memory connected on EMC CX300.

From the time of the eviction, the two nodes that got evicted were
60-65% (run queue was 5 & 2.5 respectively) CPU utilization and the
surviving node was only 30% utilized (got the data from SAR)
then the cluster evicted the two nodes, BTW the ocfs2 (where the OCR &
voting disk resides) was also on the interconnect IPs so it was also
affected by the latency problem (shown on the OS logs)...

Unfortunately since the servers restarted the data from the current
SNAP_ID at the time of its busy load were all lost.. So I just have
the SAR data and priod & after SNAP_IDs for diagnosis:
- OS: 2 nodes at 60-65% (run queue was 5 & 2.5 respectively) CPU
utilization, the other was only 30%
- Disk: I don't have latency numbers, but from the SAR Disk data, the
2 evicted nodes had Block Transfer Read/Write/s of 450-500 and TPS/s
60-65... the surviving node had Block Transfer Read/Write/s of 60 and
TPS/s 10
- Network: On the interconnect interface, from the SAR Network data,
the 2 evicted nodes had similar utilization to the surviving node...
txbytes/rxbytes/s of 3,000,000-4,000,000
- Database: prior & after SNAP_IDs all nodes have an AAS of < CPU, and
the 2 evicted nodes just have 7 MB/s Read/Write activity... Looking at
the ASH data I can see "CPU" and "gc cr multi block" as top two
events.


Below are some of the output on one of the failing nodes:

-- OS log
Jan 25 14:47:57 rac1-3 kernel: o2net: connection to node rac1-2 (num
1) at 192.168.0.2:7777 has been idle for 30.0 seconds, shutting it
down.

-- Clusterware Alert log
[cssd(13414)]CRS-1610:node rac1-1 (3) at 90% heartbeat fatal, eviction
in 0.130 seconds 2010-01-25 14:47:55.880

-- CSS log
[    CSSD]2010-01-25 14:47:26.691 [1199618400] >WARNING:
clssnmPollingThread: node rac1-1 (3) at 90 3.123428e-317artbeat fatal,
eviction in 0.130 seconds
[    CSSD]2010-01-25 14:47:26.823 [1199618400] >TRACE:
clssnmPollingThread: Eviction started for node rac1-1 (3), flags
0x040d, state 3, wt4c 0
[    CSSD]2010-01-25 14:47:26.823 [1199618400] >TRACE:
clssnmDiscHelper: rac1-1, node(3) connection failed, con (0x785550),
probe((nil))
[    CSSD]2010-01-25 14:47:27.328 [1115699552] >TRACE:
clssnmReadDskHeartbeat: node(3) is down. rcfg(30) wrtcnt(519555)
LATS(534471324) Disk lastSeqNo(519555)


So from the data above.. I could have an initial finding that the
latency issue could be caused by high sustained CPU utilization on the
OS side which affected the scheduling of critical RAC processes or
could be caused by the congested interconnect switch...
I'd like to drill down which of the two is the culprit.. which is the
reason behind my asking..




- Karl Arao
karlarao.wordpress.com
--
//www.freelists.org/webpage/oracle-l


Other related posts: