Re: RAC Full cluster outage (almos)

  • From: LS Cheng <exriscer@xxxxxxxxx>
  • To: Christo Kutrovsky <kutrovsky.oracle@xxxxxxxxx>
  • Date: Thu, 12 Mar 2009 00:55:44 +0100

Hi

Normally in two nodes cluster when there are network problems in the private
interconnect lower nodes always survives and this is due to the rule that it
is impossible to know who have lost network in a two machine configuration.

It is the first time I have seen in two nodes RAC node 2 stays and node 1 is
evicted when private network went down.

My favourite is 4 nodes but not many customers is doing (yet).


Cheers

--
LSC


On Wed, Mar 11, 2009 at 7:09 PM, Christo Kutrovsky <
kutrovsky.oracle@xxxxxxxxx> wrote:

> Hi,
>
> We had similar problem, except node 2 evicted node 1 via the voting
> disk, which rebooted itself.
>
> In reality, a 2 node cluster is not reliable enought in network
> issues, as it is unknown which server should remain up. It's a 50/50
> chance.
>
> One approach is to have a 3 node cluster, with only 2 nodes running
> instances. The clusterware does not require any licenses, it is free.
>
> The 3th node only serves as an arbiter who should remain up.
>
> --
> Christo Kutrovsky
> Senior DBA
> The Pythian Group - www.pythian.com
> I blog at http://www.pythian.com/blogs/
>
>
> On Wed, Mar 11, 2009 at 11:35 AM, LS Cheng <exriscer@xxxxxxxxx> wrote:
> > Hi
> >
> > A couple of days one of my customers faced a almost full cluster outage
> in a
> > 2 node 10.2.0.4 RAC on Sun Solaris 10 Sparc (full oracle stack).
> >
> > The sequence was as follows
> >
> > 1. node 2 lost private network, interface went down
> > 2. node 1 evicts noe 2 (as expected)
> > 3. node 1 then evicts himself
> > 4. after nodes 1 returned to the cluster and cluster reformed from 1 node
> to
> > two nodes, node 2 lost private network again and this time eviction
> occurs
> > in node 2
> >
> > So it was not really a full cluster outage but the eviction occured one
> > after another so it looked full outage to the users.
> >
> > My doubt is, in a nodes cluster node 1 always survives which is not in
> this
> > case. My only theory is node 2 was so ill that it could not reboot the
> > server, node 1 then evicts himself to avoid corruptions.
> >
> > Any more ideas?
> >
> > Cheers
> >
> > --
> > LSC
> >
> >
>
>
>
> --
> Christo Kutrovsky
> Senior DBA
> The Pythian Group - www.pythian.com
> I blog at http://www.pythian.com/blogs/
>

Other related posts: