Re: Network interconnect traffic on RAC

  • From: Riyaj Shamsudeen <riyaj.shamsudeen@xxxxxxxxx>
  • To: karlarao@xxxxxxxxx
  • Date: Thu, 11 Feb 2010 16:35:51 -0600

Karl
  Do you have OSWatcher installed? It is a great tool to identify evictions
due to performance issues.
  CSS daemons and LMS background processes must be running RT priority. So,
even if you have higher CPU usage, heartbeat shouldn't be missed due to CPU
based latencies.
  I would suspect disk or network issues. Without detailed performance data
it will be almost impossible to debug this further.

Cheers

Riyaj Shamsudeen
Principal DBA,
Ora!nternals -  http://www.orainternals.com - Specialists in Performance,
Recovery and EBS11i
Blog: http://orainternals.wordpress.com
OakTable member http://www.oaktable.com
Co-author: "Expert Oracle practices: Oracle Database Administration from the
Oak Table" http://www.apress.com/book/view/9781430226680



On Wed, Feb 10, 2010 at 3:38 PM, Karl Arao <karlarao@xxxxxxxxx> wrote:

> Thanks for the replies Andrew, Krishna, Aaron, Gopal...
>
> I had this client running on three node RAC. Just recently the two
> nodes got evicted... trying to diagnose if it was a CPU capacity, disk
> latency, interconnect issue...
> and been reading
> - Oracle Clusterware and Private Network Considerations
> - Practical Performance Management for Oracle RAC
> - RAC Performance Tuning best practices
>
> BTW they are running on 2 x 3.00GHz Xeon CPU on each node with 4GB
> memory connected on EMC CX300.
>
> From the time of the eviction, the two nodes that got evicted were
> 60-65% (run queue was 5 & 2.5 respectively) CPU utilization and the
> surviving node was only 30% utilized (got the data from SAR)
> then the cluster evicted the two nodes, BTW the ocfs2 (where the OCR &
> voting disk resides) was also on the interconnect IPs so it was also
> affected by the latency problem (shown on the OS logs)...
>
> Unfortunately since the servers restarted the data from the current
> SNAP_ID at the time of its busy load were all lost.. So I just have
> the SAR data and priod & after SNAP_IDs for diagnosis:
> - OS: 2 nodes at 60-65% (run queue was 5 & 2.5 respectively) CPU
> utilization, the other was only 30%
> - Disk: I don't have latency numbers, but from the SAR Disk data, the
> 2 evicted nodes had Block Transfer Read/Write/s of 450-500 and TPS/s
> 60-65... the surviving node had Block Transfer Read/Write/s of 60 and
> TPS/s 10
> - Network: On the interconnect interface, from the SAR Network data,
> the 2 evicted nodes had similar utilization to the surviving node...
> txbytes/rxbytes/s of 3,000,000-4,000,000
> - Database: prior & after SNAP_IDs all nodes have an AAS of < CPU, and
> the 2 evicted nodes just have 7 MB/s Read/Write activity... Looking at
> the ASH data I can see "CPU" and "gc cr multi block" as top two
> events.
>
>
> Below are some of the output on one of the failing nodes:
>
> -- OS log
> Jan 25 14:47:57 rac1-3 kernel: o2net: connection to node rac1-2 (num
> 1) at 192.168.0.2:7777 has been idle for 30.0 seconds, shutting it
> down.
>
> -- Clusterware Alert log
> [cssd(13414)]CRS-1610:node rac1-1 (3) at 90% heartbeat fatal, eviction
> in 0.130 seconds 2010-01-25 14:47:55.880
>
> -- CSS log
> [    CSSD]2010-01-25 14:47:26.691 [1199618400] >WARNING:
> clssnmPollingThread: node rac1-1 (3) at 90 3.123428e-317artbeat fatal,
> eviction in 0.130 seconds
> [    CSSD]2010-01-25 14:47:26.823 [1199618400] >TRACE:
> clssnmPollingThread: Eviction started for node rac1-1 (3), flags
> 0x040d, state 3, wt4c 0
> [    CSSD]2010-01-25 14:47:26.823 [1199618400] >TRACE:
> clssnmDiscHelper: rac1-1, node(3) connection failed, con (0x785550),
> probe((nil))
> [    CSSD]2010-01-25 14:47:27.328 [1115699552] >TRACE:
> clssnmReadDskHeartbeat: node(3) is down. rcfg(30) wrtcnt(519555)
> LATS(534471324) Disk lastSeqNo(519555)
>
>
> So from the data above.. I could have an initial finding that the
> latency issue could be caused by high sustained CPU utilization on the
> OS side which affected the scheduling of critical RAC processes or
> could be caused by the congested interconnect switch...
> I'd like to drill down which of the two is the culprit.. which is the
> reason behind my asking..
>
>
>
>
> - Karl Arao
> karlarao.wordpress.com
> --
> //www.freelists.org/webpage/oracle-l
>
>
>

Other related posts: