I have a couple of follow up questions:|
1. When was the last time you executed a successful failover test of this environment?
2. What has changed since that last successful test? (assuming nothing)
3. What are the public, private, and VIP IPs for these nodes? It seems at least possible that somehow there's a network misconfiguration (however unlikely that may be).
It seems unusual for a VIP resource to be in UNKNOWN state since VIPs are generally lightweight and there's little effort associated with failover. When resources are in UNKNOWN, I generally try "crs_stop -f <resource_name>" to clear the current state. Then I'd try "crs_start -c <resource_name> <node-where-you-want-it-to-start>" to see if you can start it manually. Hopefully, that (possibly in combination with answers to the above questions) will yield something worth investigating.
Alessandro Vercelli wrote:
-- http://www.freelists.org/webpage/oracle-lThe crash exact time is not clearly defined, in the morning of May 9th, it was a database crash, not system; crsd.log reported many messages like: 2008-05-09 12:32:33.833: [ CRSEVT]0CAAMonitorHandler :: 0:Action Script /u01/app/oracle/product/crs/bin/racgwrap(check) timed out for ora.<failednode>.ons! (timeout=600) each message referred to a different resource. Last week, I tried to restart the failed node (in the meantime, other people made other attempts) and crsd.log reported, among other messages, the following: 2008-07-07 16:10:18.743: [ CRSRES]0CRS-1028: Dependency analysis failed because of: 'Resource in UNKNOWN state: ora.<failednode>.vip' Using crs_stat -t the ora.<failednode>.vip resource allocation was on the partner node - not the failed one - and its state was UNKNOWN (as expected). My opinion is that, at the crash time, the partner node performed an automatic failover but it failed; crsd.log of partner node: 2008-05-09 11:55:55.278: [ CRSRES]0Attempting to start `ora.<failednode>.vip` on member `<partnernode>` 2008-05-09 11:56:58.305: [ CRSAPP]0StartResource error for ora.<failednode>.vip error code = -2 2008-05-09 11:57:05.429: [ CRSEVT]0CAAMonitorHandler :: 0:Action Script /u01/app/oracle/product/crs/bin/racgwrap(check) timed out for ora.<failednode>.vip! (timeout=60) and, finally: 2008-05-09 11:58:01.422: [ CRSRES]0X_OP_StopResourceFailed : Stop Resource failed (File: rti.cpp, line: 1698 2008-05-09 11:58:01.422: [ CRSRES][ALERT]0`ora.<failednode>.vip` on member `<partnernode>` has experienced an unrecoverable failure. 2008-05-09 11:58:01.422: [ CRSRES]0Human intervention required to resume its availability. 2008-05-09 11:58:01.444: [ CRSRES]0CRS-1028: Dependency analysis failed because of: 'Resource in UNKNOWN state: ora.<failednode>.vip' Sorry for the *mess* of messages..... Thanks, AlessandroIf you think it's related to the resource not starting because of some dependency, then I'd suggest looking at $CRS_HOME/log/<nodename>/crsd/crsd.log on each node (especially the crashed node) and see what's there around the time of startup. If the node won't boot, try booting it into single user mode and disabling clusterware from starting if you think clusterware is what's not allowing it to boot completely. Dan Alessandro Vercelli wrote:O.S.: RHEL AS4 Hardware is HP BL45P, 4 x AMD Dual core, 8 Gb RAM. Oracle 10.2.0.1, RAC and Clusterware Anyway, the issue became "crabbed", since the last attempt to start the failing node succeeded, so I've one more task now...:)). The failed attempts reported on the console that the listener nodeapp could not start; looking into network configuration, I noticed vip IP address for the failing listener was not allocated on that node but on its partner; please, what log files do you suggest for errors? Thanks, Alessandro