Fwd:Re: Oracle RAC and VIPs

From: "Alessandro Vercelli" <alever@xxxxxxxxx>
To: "Oracle Freelists\.org" <Oracle-L@xxxxxxxxxxxxx>
Date: Wed, 16 Jul 2008 12:25:17 +0200
e-sending only to Oracle-L (overquoting....)

>   I have a couple of follow up questions:
>   1. When was the last time you executed a successful failover test of this 
> environment?

This time has been the first one I worked on that platform, which is under 
development by another team, but looking into log files I saw some failed and 
few successful attempts for allocating resources from the node which failed 
(after) to the partner and vice versa.

>   2. What has changed since that last successful test? (assuming nothing)

I assume nothing, too :)), but I cannot be sure... 

>   3. What are the public, private, and VIP IPs for these nodes? 

Public and virtual are in the same network xxx.xxx.xxx.xxx/24, private IPs are 
completely different 10.10.10.xxx/24

> It seems
>   at least possible that somehow there's a network misconfiguration
>   (however unlikely that may be).
>   It seems unusual for a VIP resource to be in UNKNOWN state since VIPs
>   are generally lightweight and there's little effort associated with
>   failover. When resources are in UNKNOWN, I generally try "crs_stop -f
>   <resource_name>" to clear the current state. Then I'd try "crs_start
>   -c <resource_name> <node-where-you-want-it-to-start>" to see if you
>   can start it manually. Hopefully, that (possibly in combination with
>   answers to the above questions) will yield something worth
>   investigating.
>   Dan

Nodes are remote, so its difficult to check the whole network physical 
configuration for problems/conflicts; I didn't try the crs_stop -f command, but 
I will if this issue raises again.

Many thanks for your help,

Alessandro

>   Alessandro Vercelli wrote:
>
>The crash exact time is not clearly defined, in the morning of May 9th, it was 
>a database crash, not system; crsd.log reported many messages like:
>
>2008-05-09 12:32:33.833: [  CRSEVT][3695033264]0CAAMonitorHandler :: 0:Action S
>cript /u01/app/oracle/product/crs/bin/racgwrap(check) timed out for ora.<failed
>node>.ons! (timeout=600)
>
>each message referred to a different resource.
>
>Last week, I tried to restart the failed node (in the meantime, other people ma
>de other attempts) and crsd.log reported, among other messages, the following:
>
>2008-07-07 16:10:18.743: [  CRSRES][3781585840]0CRS-1028: Dependency analysis f
>ailed because of:
>'Resource in UNKNOWN state: ora.<failednode>.vip'
>
>Using crs_stat -t the ora.<failednode>.vip resource allocation was on the partn
>er node - not the failed one - and its state was UNKNOWN (as expected).
>
>My opinion is that, at the crash time, the partner node performed an automatic 
>failover but it failed; crsd.log of partner node:
>
>2008-05-09 11:55:55.278: [  CRSRES][3686595504]0Attempting to start `ora.<faile
>dnode>.vip` on member `<partnernode>`
>2008-05-09 11:56:58.305: [  CRSAPP][3686595504]0StartResource error for ora.<fa
>ilednode>.vip error code = -2
>2008-05-09 11:57:05.429: [  CRSEVT][3697085360]0CAAMonitorHandler :: 0:Action S
>cript /u01/app/oracle/product/crs/bin/racgwrap(check) timed out for ora.<failed
>node>.vip! (timeout=60)
>
>and, finally:
>
>2008-05-09 11:58:01.422: [  CRSRES][3686595504]0X_OP_StopResourceFailed : Stop 
>Resource failed
>(File: rti.cpp, line: 1698
>
>2008-05-09 11:58:01.422: [  CRSRES][3686595504][ALERT]0`ora.<failednode>.vip` o
>n member `<partnernode>` has experienced an unrecoverable failure.
>2008-05-09 11:58:01.422: [  CRSRES][3686595504]0Human intervention required to 
>resume its availability.
>2008-05-09 11:58:01.444: [  CRSRES][3686595504]0CRS-1028: Dependency analysis f
>ailed because of:
>'Resource in UNKNOWN state: ora.<failednode>.vip'
>
>Sorry for the *mess* of messages.....
>Thanks,
>Alessandro
>
>
>If you think it's related to the resource not starting because of some 
>dependency, then I'd suggest looking at 
>$CRS_HOME/log/<nodename>/crsd/crsd.log on each node (especially the 
>crashed node) and see what's there around the time of startup.
>
>If the node won't boot, try booting it into single user mode and 
>disabling clusterware from starting if you think clusterware is what's 
>not allowing it to boot completely.
>
>Dan
>
>Alessandro Vercelli wrote:
>    
>
>O.S.: RHEL AS4
>Hardware is HP BL45P, 4 x AMD Dual core, 8 Gb RAM.
>Oracle 10.2.0.1,  RAC and Clusterware

<cut>

>The failed attempts reported on the console that the listener nodeapp could not
> start; looking into network configuration, I noticed vip IP address for the fa
>iling listener was not allocated on that node but on its partner; please, what 
>log files do you suggest for errors?
>Thanks,
>Alessandro
>

--
//www.freelists.org/webpage/oracle-l
Fwd:Re: Oracle RAC and VIPs

Other related posts: