CSSD panics all nodes when single node loses cluster interconnect

From: Maureen English <sxmte@xxxxxxxxxxxxxxxx>
To: oracle-l@xxxxxxxxxxxxx
Date: Thu, 05 Feb 2009 16:43:20 -0900

A coworker of mine asked if I could see if anyone on this list has seen
anything like this problem we are having, and if there is a solution.
We've opened a Service Request with Oracle, so if they have a solution,
I'll post it to the list, too.

Three node RHEL5.2.x86_64 cluster running Oracle Clusterware 10.2.0.4
Kernel 2.6.18-92.1.18.el5
OCFS2 2.6.18-92.1.18.el5-1.4.1-1.el5.x86_64

Each node has two 2-port gigabit nics, using bonding module and two switches
to provide redundancy. Bond0 is the public interface, Bond1 is the cluster
interconnect. Testing private interconnect failure using 'ifconfig bond1 down'
on any single node would cause the entire cluster to panic approximately 90% of
the time.

Looking at log files (/var/log/messages, $CRSHOME/log/$NODE/cssd/ocssd.log)
showed that the two 'live' nodes are losing the voting disks before OCFS2 can
finish evicting the 'dead' node from the cluster, causing cssd to reboot them.
Lowering the timing on 'OCFS_HEARTBEAT _THRESHOLD' and 'Network Idle Timeout' in
OCFS2 configuration reduced the likelihood of the entire cluster panicking to
approximately 20% of the time.

The chances of losing both nics/switches simultaneously is small, however
management wants it looked at to determine if it's a known issue with no fix,
misconfiguration, etc. before the cluster is put into production. Searching
Metalink hasn't turned up anything very useful.

Is this an issue anyone has run into before?  If so, how did you end up dealing
with it?

Thanks (on behalf of my coworker, too),

- Maureen

--
//www.freelists.org/webpage/oracle-l

Follow-Ups:
- RE: CSSD panics all nodes when single node loses cluster interconnect
  - From: Crisler, Jon

CSSD panics all nodes when single node loses cluster interconnect

Other related posts: