RE: Oracle clusterware related question

From: "Hameed, Amir" <Amir.Hameed@xxxxxxxxx>
To: Mathias Zarick <Mathias.Zarick@xxxxxxxxxxxx>, Martin Berger <martin.a.berger@xxxxxxxxx>
Date: Thu, 10 May 2012 09:23:59 -0400

I looked at document "generic RAC System Test Plan Outline for 11gR2" listed in 
note "RAC and Oracle Clusterware Best Practices and Starter Kit (Platform 
Independent) [ID 810394.1]". Test #18 Test #19 state:

Test 18: Node Loses Access to Disks with CSS Voting Device
Expected Results: For 11.2.0.2 and above:
        CSS will detect this and evict the node as follows:
                o All I/O capable client processes will be terminated and all 
resources will be cleaned up. If process termination and/or resource cleanup 
does not complete successfully the node will be rebooted.
                o Assuming that the above has completed successfully, OHASD 
will attempt to restart the stack. In this case the stack will be restarted 
once the network connectivity of the private interconnect network has been 
restored.
                o Review the following logs:
                        o $GI_HOME/log/<nodename>/alert<nodename>.log
                        o $GI_HOME/log/<nodename>/cssd/ocssd.log

Test 19: Node Loses Access to Disks with OCR Device(s)
        CRSD will detect the failure of the OCR device and abort. OHASD will 
attempt to restart CRSD 10 times after which manual intervention will be 
required.
                o The database instance, ASM instance and listeners will not be 
impacted.
                o Review the following logs:
                        o $GI_HOME/log/<nodename>/cssd/crsd.log
                        o $GI_HOME/log/<nodename>/alert<nodename>.log
                        o $GI_HOME/log/<nodename>/ohasd/ohasd.log

It is not listed what will happen if CRS is not able to write to the log files. 
But it seems that when CRS loses connection to voting disks, it will reboot the 
node if resource clean-up is not succeeded.

Amir
-----Original Message-----
From: Mathias Zarick [mailto:Mathias.Zarick@xxxxxxxxxxxx] 
Sent: Thursday, May 10, 2012 3:08 AM
To: Martin Berger
Cc: tim@xxxxxxxxx; oracle-l@xxxxxxxxxxxxx; Hameed, Amir
Subject: RE: Oracle clusterware related question

Hi Martin,

the tests were like this.
- CRS, CRS logs on SAN disks -> unplug cables from HBA -> node does not reset
- CRS, CRS logs on local disks -> unplug cables from HBA -> node does reset

In Amir's Setup the CRS and CRS Logs are on nfs but problem is the same here.
And you probably will not find anything in the logfiles, if the cluster 
processes
cannot write to them. :-)

as you stated, node could not commit suicide if processes are hanging in an IO 
path
this is of course different in configurations with ipmi.

Cheers Mathias

-----Original Message-----
From: Martin Berger [mailto:martin.a.berger@xxxxxxxxx] 
Sent: Tuesday, May 08, 2012 8:03 PM
To: Amir.Hameed@xxxxxxxxx
Cc: tim@xxxxxxxxx; Mathias Zarick; oracle-l@xxxxxxxxxxxxx
Subject: Re: Oracle clusterware related question

Amir,

in Oracle Clusterware no node can be evicted by the remote nodes.
The 'others' can only exclude any node and hope this one commits suicide.

The problem here, on your hanging node the clusterware processes are hanging in 
IO to logfiles. As your NFS does not disappear, the filehandles are still open. 
It seems writing to logfiles is a synchronous task - so when these hang in 
file-IO, they can not do higher priority tasks as killing the node.

You can try to mount your log-directories 'soft' - maybe this solves the 
hanging issue. But I don't know which side-effects this might cause!

I am not sure if crs shows the same behavior in case logfile write hangs (as on 
NFS) or log file write fails (as on "mountpoints disappears as SAN-nwtwork is 
removed") - Mathias, do you remember the details? But as they where back in 
11.2.0.1, I probably should do the testcase again.

I second Mathias, grid-logs (and also grid-binaries) should be local!
All others, like rdbms binaries and logs can be on any remote system.

hth
 Martin

On Tue, May 8, 2012 at 6:11 PM, Hameed, Amir <Amir.Hameed@xxxxxxxxx> wrote:
> So, if voting disks are not updated by a certain node for any reason 
> for an extended period of time, that node would not be evicted by the 
> remote nodes from the cluster?
>
>
> From: Tim Gorman [mailto:tim@xxxxxxxxx]
> Sent: Tuesday, May 08, 2012 12:05 PM
> To: Mathias.Zarick@xxxxxxxxxxxx; Hameed, Amir
> Cc: oracle-l@xxxxxxxxxxxxx
> Subject: Re: Oracle clusterware related question
>
>
>
> Mathias hit the nail on the head.  Think about it this way:  NFS 
> errors and disconnects typically do not kill running programs, but 
> cause them to hang.  If the binaries for the clusterware are 
> themselves on NFS, then clearly they are going to hang also.
--
//www.freelists.org/webpage/oracle-l

References:
- Re: Oracle clusterware related question
  - From: Tim Gorman
- RE: Oracle clusterware related question
  - From: Hameed, Amir
- Re: Oracle clusterware related question
  - From: Martin Berger
- RE: Oracle clusterware related question
  - From: Mathias Zarick

RE: Oracle clusterware related question

Other related posts: