RAC node "has a disk HB, but no network HB" but traceroute resports no problem

From: "Yong Huang" <dmarc-noreply@xxxxxxxxxxxxx> (Redacted sender "yong321" for DMARC)
To: <oracle-l@xxxxxxxxxxxxx>
Date: Tue, 3 Jan 2017 22:20:43 +0000 (UTC)

Oracle and GI (grid infrastructure) 11.2.0.3 on 64-bit Red Hat Linux 6.6. Cisco
UCS.

Node 2 of a 2-node RAC crashed. Log ocssd.log shows:

2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: node
d1prpcrndb1a (1) at 50% heartbeat fatal, removal in 14.760 seconds
2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: node
d1prpcrndb1a (1) is impending reconfig, flag 2493454, misstime 15240
2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: local
diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending
reconfig status(1)
2016-12-18 02:03:06.307: [ CSSD][510686976]clssnmvDHBValidateNcopy: node 1,
d1prpcrndb1a, has a disk HB, but no network HB, DHB has rcfg 306434975, wrtcnt,
197140394, LATS 4040636964, lastSeqNo 185041690, uniqueness 1468029747,
timestamp 1482048185/1112586906
...[some lines snipped here]...
2016-12-18 02:03:28.094: [ CSSD][510686976]clssnmvDHBValidateNcopy: node 1,
d1prpcrndb1a, has a disk HB, but no network HB, DHB has rcfg 306434975, wrtcnt,
197140475, LATS 4040658754, lastSeqNo 197140472, uniqueness 1468029747,
timestamp 1482048207/1112608986

We installed Oracle's OSWatcher and enabled traceroute for the private network,
which shows no error during the time:

zzz ***Sun Dec 18 02:03:28 CST 2016
traceroute to dcprpcrndb1bic1 (10.114.21.3), 30 hops max, 60 byte packets
1 dcprpcrndb1bic1 (10.114.21.3) 0.020 ms 0.008 ms 0.004 ms
traceroute to dcprpcrndb1bic2 (10.114.21.67), 30 hops max, 60 byte packets
1 dcprpcrndb1bic2 (10.114.21.67) 0.020 ms 0.006 ms 0.004 ms
traceroute to d1prpcrndb1aic1 (10.114.21.2), 30 hops max, 60 byte packets
1 d1prpcrndb1aic1 (10.114.21.2) 0.262 ms 0.259 ms 0.255 ms
traceroute to d1prpcrndb1aic2 (10.114.21.66), 30 hops max, 60 byte packets
1 d1prpcrndb1aic2 (10.114.21.66) 0.135 ms 0.123 ms 0.110 ms

If traceroute never reports a problem, what does "no network HB" in occsd.log
mean? At 02:03:28, we see both "no network HB" and successful traceroute pings.
This is not the first time we have this problem. The network team never finds
any issue, consist with the traceroute report.

OSWatcher traceroute has only basic options:
traceroute -r -F <private network IP>
where -r means "Bypass the normal routing tables and send directly to a host on
an attached network". -F means "Do not fragment probe packets".

/var/log/messages reports no problem at the time. It only starts to show
problems after the cluster already decides on eviction.

Yong Huang
--
//www.freelists.org/webpage/oracle-l

Follow-Ups:
- Re: RAC node "has a disk HB, but no network HB" but traceroute resports no problem
  - From: Justin Mungal

RAC node "has a disk HB, but no network HB" but traceroute resports no problem

Other related posts: