Re: Newbie Oracle RAC issue

From: Mark Bobak <Mark.Bobak@xxxxxxxxxxxx>
To: Chris King <ckaj111@xxxxxxxx>, "oracle-l@xxxxxxxxxxxxx" <oracle-l@xxxxxxxxxxxxx>
Date: Wed, 30 Apr 2014 18:52:14 +0000
No, you misunderstand.

What they did, mounting new storage and copying to new mountpoint is definitely 
*not* ok!

What I was saying, is that it should be easy enough, if you're running LVM or 
some other volume management solution, to grow a LUN, and then grow the 
filesystem live.  This is what your sysadmins need to do, if you're in this 
situation in the future.

What they did pretty much hosed you.

If I was onsite, maybe I could try untangling it, but it may be easier for you 
to wipe and re-install.  And tell the sysadmins I said "What the &*!?&* were 
you thinking??" :-)

One thing you might try, before wiping and re-installing, would be to shut 
everything down (Assuming anything is up now, which I'm guessing it's not), 
remove the old filesystem ( the one that had filled up), and remount the new 
filesystem with the same name as the old one had.  I'm not making any promises, 
but, it's worth a quick try.  If it's still dead, total wipe and reload may be 
easiest.

-Mark

From: Chris King <ckaj111@xxxxxxxx<mailto:ckaj111@xxxxxxxx>>
Reply-To: Chris King <ckaj111@xxxxxxxx<mailto:ckaj111@xxxxxxxx>>
Date: Wednesday, April 30, 2014 at 2:45 PM
To: Mark Bobak <Mark.Bobak@xxxxxxxxxxxx<mailto:Mark.Bobak@xxxxxxxxxxxx>>, 
"oracle-l@xxxxxxxxxxxxx<mailto:oracle-l@xxxxxxxxxxxxx>" 
<oracle-l@xxxxxxxxxxxxx<mailto:oracle-l@xxxxxxxxxxxxx>>
Subject: Re: Newbie Oracle RAC issue

the alert log for the remaining node in grid home says:

Oracle Database 11g Clusterware Release 11.2.0.3.0 - Production Copyright 1996, 
2011 Oracle. All rights reserved.
2014-04-30 13:25:39.673
[client(3130)]CRS-2317:Fatal error: cannot get local GPnP security keys 
(wallet).
2014-04-30 13:25:39.674
[client(3130)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR 
(Generic GPnP error).
2014-04-30 13:25:39.684
[client(3130)]CRS-1013:The OCR location in an ASM disk group is inaccessible. 
Details in /u01/app/11.2.0/grid/log/rac1/client/ocrconfig_3130.log.


I execute the command:
$  ./crsctl start crs
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.

There were no further lines written to the alert log after this command was 
issued.

The ocrconfig_3130.log file contains the following:
Oracle Database 11g Clusterware Release 11.2.0.3.0 - Production Copyright 1996, 
2011 Oracle. All rights reserved.
2014-04-30 13:25:35.969: [ OCRCONF][2705233664]ocrconfig starts...
2014-04-30 13:25:36.064: [  OCRMSG][2705233664]prom_waitconnect: CONN NOT 
ESTABLISHED (0,29,1,2)
2014-04-30 13:25:36.064: [  OCRMSG][2705233664]GIPC error [29] msg 
[gipcretConnectionRefused]
2014-04-30 13:25:36.064: [  OCRMSG][2705233664]prom_connect: error while 
waiting for connection complete [24]
2014-04-30 13:25:36.064: [ OCRCONF][2705233664]Failure initializing OCR in 
DEFAULT. Trying REBOOT. err :[PROC-32: Cluster Ready Services on the local node 
is not running Messaging error [gipcretConnectionRefused] [29]]


Yes.. a disk was added and mounted, and then all the oracle software was copied 
to the new mount point. So, okay, I'm glad to know this is okay to do, even 
with the cluster/database running.


On Wednesday, April 30, 2014 2:30:58 PM, Mark Bobak 
<Mark.Bobak@xxxxxxxxxxxx<mailto:Mark.Bobak@xxxxxxxxxxxx>> wrote:
What do you see in $GRID_HOME/log/`hostname -s`/alert`hostname -s`.log ?

What happens if you do 'crsctl start crs'?  What other info do you see in that 
log file after attempting that command?

When you say "a disk was added and files copied", are you saying they added a 
disk, mounted a new f/s, and copied stuff over to new mount point?  It should 
be relatively straightforward to grow a filesystem live.  I know our admins do 
it all the time.

-Mark

From: Chris King <ckaj111@xxxxxxxx<mailto:ckaj111@xxxxxxxx>>
Reply-To: Chris King <ckaj111@xxxxxxxx<mailto:ckaj111@xxxxxxxx>>
Date: Wednesday, April 30, 2014 at 2:21 PM
To: "oracle-l@xxxxxxxxxxxxx<mailto:oracle-l@xxxxxxxxxxxxx>" 
<oracle-l@xxxxxxxxxxxxx<mailto:oracle-l@xxxxxxxxxxxxx>>
Subject: Newbie Oracle RAC issue

Had a successful first install of Oracle RAC 11gR2 on RHEL6 in the lab... but 
we were running out of disk on the root drive, where Oracle software is 
installed. In my absence, disk was added, and files copied while the 
cluster/database was running. Subsequently one node crashed and is not 
recoverable. The remaining node keeps throwing this error when I attempt to 
start the clusterware:

$  crsctl start cluster
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Start failed, or completed with errors.

I'm unable to start the clusterware. I looked at the log file, and saw 
references to failures reaching the crashed node, so I thought maybe I have to 
tell the clusterware that we're missing a node, but all the commands I've found 
to do so require cluster services to be running.

What else should I be looking at to diagnose this? I'm trying to evaluate if I 
have to reiinstall everything from scratch or if this lab setup can be 
salvaged. Thanks!

also, please note the following is the only cluster-related process I find  
running on the remaining node:
root      1557     1  0 11:50 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd 
run
Follow-Ups:
- Re: Newbie Oracle RAC issue
  - From: Chris King
References:
- Newbie Oracle RAC issue
  - From: Chris King
- Re: Newbie Oracle RAC issue
  - From: Mark Bobak
- Re: Newbie Oracle RAC issue
  - From: Chris King
Re: Newbie Oracle RAC issue

Other related posts: