ASM Disks Dropping on RHEL 6.3 - Practical Limit to Disks?

From: David Barbour <david.barbour1@xxxxxxxxx>
To: oracle-l mailing list <oracle-l@xxxxxxxxxxxxx>
Date: Wed, 16 Apr 2014 16:34:08 -0500
I was warned about a certain SAN.  Regardless, what's happening now
probably is not directly related, but it could be.

3-Node RAC
Dell R720
Dual Quad-Core Intel Xeon E5-2643 0 @ 3.30GHz
384GB RAM
GI 11.2.0.3
Database 11.2.0.3
488 ASM Disks

Yesterday the bottom fell out of our test RAC.  Node2 just lost drives.
While I was trying to diagnose the problem, OEM alerted that it has lost
contact with Node1.  When I tried to log into Node1, there was 'no route to
host.'  So I engaged the sys admins on that one and went back to looking at
Node2.  I couldn't start the failed instances, nor could I stop them.  Nor
could I stop crs on the Node.  I should have saved the output, but the
bottom line is when I ran crsctl stop crs it failed.  Running srvctl stop
database -d failed.  So I logged into the instance and shut it down the
old-fashioned way.  When I fianlly got most everything stopped, I rebooted
the box.  Nothing came up.  Here's an abbreviated output from crsctl:

/rchr1t02/

/oracle/D00 #  crsctl stat res -t -init

--------------------------------------------------------------------------------

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.asm

      1        ONLINE  INTERMEDIATE rchr1t02                 OCR not started

ora.cluster_interconnect.haip

      1        ONLINE  ONLINE       rchr1t02

ora.crf

      1        ONLINE  ONLINE       rchr1t02

ora.crsd

      1        ONLINE  OFFLINE


Makes sense because the +GRID diskgroup that has the OCR didn't mount. I've
been through a lot of Oracle Docs on this.  ocssd.bin and evmd.bin and haip
were all running.  Just couldn't bring up the diskgroup (s).  Oh, and while
I'm struggling with this, the Sys Admin rebooted Node1 because he couldn't
log in through the console either.  That Node ended up in the same state as
Node2.  Node3 meanwhile is chugging along.  At least until it was
rebooted.  Now I've got 3 servers up, but no ASM disk.

Messages shows stuff like:  Apr 15 19:36:27 rchr1t02 udevd[13929]: worker
[29550] unexpectedly returned with status 0x0100
                                          Apr 15 19:37:44 rchr1t02
udevd[13929]: worker [52527] failed while handling
'/devices/virtual/block/asm!.asm_ctl_vbg2'

Red Hat suggested a workaround by upping the udev timeout and limiting the
number of udev worker processes (which appears to be a function of total
memory on this release) on boot.  This didn't help.  Eventually I was able
to stop and start multipathd, reload udev rules, log in to the ASM instance
(which was stuck at 'ONLINE' 'INTERMEDIATE') and mount the disk groups
manually.  I say eventually, because some mounted right away, others gave
me permission errors but then mounted many minutes later.

Totally freakin' weird.

Has anyone experienced anything like this?  I've opened an SR, but folks
around here want to do a root cause analysis right now and I don't have
anything to say except it appears the disks no longer mount on boot, I may
or may not be able to bring them up manually, and it could happen again.

We've rebooted nodes on this RAC numerous times without incident.  Why
now?  The Storage, Systems and Network folks swear nothing has changed.
Except there was a firmware update to the DRAC.  Oh, and they put a new
route on the boxes to accommodate a new set of IPs we're introducing.  But
other than that...........

cluvfy comes back clean.

Is there a practical limit to the number of disks?  I know ASM is limited
to 63 diskgroups.
ASM Disks Dropping on RHEL 6.3 - Practical Limit to Disks?

Other related posts: