Re: RAC Full cluster outage (almos)

From: LS Cheng <exriscer@xxxxxxxxx>
To: Jason Heinrich <jheinrichdba@xxxxxxxxx>
Date: Thu, 12 Mar 2009 00:52:33 +0100

Hi

There are three voting disks

--
LSC


On Wed, Mar 11, 2009 at 7:48 PM, Jason Heinrich <jheinrichdba@xxxxxxxxx>wrote:

> The OP didn't mention how many voting disks were in the cluster.  In order
> to successfully survive N node failures, it is recommended that there be
> 2N+1 voting disks.  So a 2-node cluster should have 3 voting disks.
>
> //www.freelists.org/post/oracle-l/Voting-disk-TIE,5
>
> --
> Jason Heinrich
>
>
>
> On Wed, Mar 11, 2009 at 1:09 PM, Christo Kutrovsky <
> kutrovsky.oracle@xxxxxxxxx> wrote:
>
>> Hi,
>>
>> We had similar problem, except node 2 evicted node 1 via the voting
>> disk, which rebooted itself.
>>
>> In reality, a 2 node cluster is not reliable enought in network
>> issues, as it is unknown which server should remain up. It's a 50/50
>> chance.
>>
>> One approach is to have a 3 node cluster, with only 2 nodes running
>> instances. The clusterware does not require any licenses, it is free.
>>
>> The 3th node only serves as an arbiter who should remain up.
>>
>> --
>> Christo Kutrovsky
>> Senior DBA
>> The Pythian Group - www.pythian.com
>> I blog at http://www.pythian.com/blogs/
>>
>>
>> On Wed, Mar 11, 2009 at 11:35 AM, LS Cheng <exriscer@xxxxxxxxx> wrote:
>> > Hi
>> >
>> > A couple of days one of my customers faced a almost full cluster outage
>> in a
>> > 2 node 10.2.0.4 RAC on Sun Solaris 10 Sparc (full oracle stack).
>> >
>> > The sequence was as follows
>> >
>> > 1. node 2 lost private network, interface went down
>> > 2. node 1 evicts noe 2 (as expected)
>> > 3. node 1 then evicts himself
>> > 4. after nodes 1 returned to the cluster and cluster reformed from 1
>> node to
>> > two nodes, node 2 lost private network again and this time eviction
>> occurs
>> > in node 2
>> >
>> > So it was not really a full cluster outage but the eviction occured one
>> > after another so it looked full outage to the users.
>> >
>> > My doubt is, in a nodes cluster node 1 always survives which is not in
>> this
>> > case. My only theory is node 2 was so ill that it could not reboot the
>> > server, node 1 then evicts himself to avoid corruptions.
>> >
>> > Any more ideas?
>> >
>> > Cheers
>> >
>> > --
>> > LSC
>>
>>
>>
>>
>

References:
- RAC Full cluster outage (almos)
  - From: LS Cheng
- Re: RAC Full cluster outage (almos)
  - From: Christo Kutrovsky
- Re: RAC Full cluster outage (almos)
  - From: Jason Heinrich

Re: RAC Full cluster outage (almos)

Other related posts: