Re: Oracle RAC nodes eviction question

  • From: Moovarkku Mudhalvan <oraclebeanz@xxxxxxxxx>
  • To: Riyaj Shamsudeen <riyaj.shamsudeen@xxxxxxxxx>
  • Date: Thu, 14 Aug 2014 08:35:44 +0900

Hi Shamsudeen

     In reality yes for your test 2 node should be evicted.
 Ignoring the (1) in your email as that is a failover, test (2) should have
rebooted the clusterware, as the GI did not have access to voting disk or
binary, from the description of the test. Is it possible that voting
disks/binary were available through failed-over paths? Can you tell us
about the state of the node after 15 minutes? Was the GI still running or
not?

Cheers

Riyaj Shamsudeen
Principal DBA,
Ora!nternals -  http://www.orainternals.com - Specialists in Performance,
RAC and EBS
Blog: http://orainternals.wordpress.com/
Oracle ACE Director and OakTable member <http://www.oaktable.com/>

Co-author of the books: Expert Oracle Practices
<http://tinyurl.com/book-expert-oracle-practices/>, Pro Oracle SQL,
<http://tinyurl.com/ahpvms8> <http://tinyurl.com/ahpvms8>Expert RAC
Practices 12c. <http://tinyurl.com/expert-rac-12c> Expert PL/SQL practices
<http://tinyurl.com/book-expert-plsql-practices>

<http://tinyurl.com/book-expert-plsql-practices>



On Wed, Aug 13, 2014 at 3:09 PM, Moovarkku Mudhalvan <oraclebeanz@xxxxxxxxx>
wrote:

> Hi Amir
>
>      Looks to me you did tested on NAS failover. It means there is still
> any one of the NAS header is available for your nodes.
>
>      Actual problem you faced is NAS header failed for few mins which
> means total connection lost to nodes.
> On Aug 14, 2014 6:38 AM, "Hameed, Amir" <Amir.Hameed@xxxxxxxxx> wrote:
>
>>  Thanks Riyaj and Martin.
>>
>> So, based on your responses, it seems that if either the Grid binaries or
>> the Grid log files become inaccessible, that node will be evicted. This
>> does not coincide with the testing that I have done in the past but it does
>> coincide with the recent event where after the NAS head failed, all RAC
>> nodes were rebooted. This is how we had tested it in the past and saw no
>> impact:
>>
>>
>>
>> Prior to going live last year, we conducted destructive tests on the same
>> hardware on which production was going to go live. We used the same storage
>> NAS head that production was going to use. We then took a copy of
>> production, which was a single-instance at that time but was going to be
>> RAC'd on the new hardware and RAC'd it across four nodes. At that point we
>> had a copy of production, RAC'd across four nodes and looking exactly like
>> what production was going to look like. We then conducted a lot of
>> destructive tests including the following:
>>
>> 1.       To test the resilience of Oracle RAC in the event of NAS head
>> failure, we ran the following tests while the Grid and the database were up
>> and running: (a) We failed over the NAS head to its standby counterpart via
>> controlled failover and RAC stayed up (b) We induced panic to force the NAS
>> head failover and the environment stayed up.
>>
>> 2.       On one of the RAC nodes, we pulled both cables of the
>> LACP/bonded NIC which was used to mount storage for binaries, voting disks,
>> etc., and left it like that for over 15 minutes. I was expecting an
>> ejection primarily because the voting disks were not available on this node
>> but nothing happened.
>>
>>
>>
>> This is why I am a bit confused and trying to figure out why I am seeing
>> different results.
>>
>>
>>
>> Thanks
>>
>> *From:* Riyaj Shamsudeen [mailto:riyaj.shamsudeen@xxxxxxxxx]
>> *Sent:* Wednesday, August 13, 2014 4:32 PM
>> *To:* Hameed, Amir
>> *Cc:* oracle-l@xxxxxxxxxxxxx
>> *Subject:* Re: Oracle RAC nodes eviction question
>>
>>
>>
>> Hello Amir
>>
>>    Losing binaries can, and most probably will, lead to node eviction.
>> When there is a fault for an executable page in the page cache, that page
>> need to be paged-in from the binary. If the binary is not available, then
>> the GI processes will be killed. Death of GI processes will lead to events
>> such as missing heartbeats etc and finally to node eviction. From 11gR2
>> onwards, GI is restart is tried before restarting the node. Possibly that
>> file system may not have been available during GI restart try, so, it would
>> have lead to eventual node restart.
>>
>>
>>
>>    This is analogous to the scenario of removing oracle binary while the
>> database is up (in that case also, database will crash eventually).
>>
>>
>>
>>    I guess, an option to avoid node eviction due to loss of binaries
>> mounted through NFS, is to keep the GI and RDBMS homes local, still, it has
>> its own risk. Of course, in a big cluster environments, it is easier said
>> than done.
>>
>>
>>   Cheers
>>
>> Riyaj Shamsudeen
>> Principal DBA,
>> Ora!nternals -  http://www.orainternals.com - Specialists in
>> Performance, RAC and EBS
>> Blog: http://orainternals.wordpress.com/
>> Oracle ACE Director and OakTable member <http://www.oaktable.com/>
>>
>> Co-author of the books: Expert Oracle Practices
>> <http://tinyurl.com/book-expert-oracle-practices/>, Pro Oracle SQL,
>> <http://tinyurl.com/ahpvms8>Expert RAC Practices 12c.
>> <http://tinyurl.com/expert-rac-12c> Expert PL/SQL practices
>> <http://tinyurl.com/book-expert-plsql-practices>
>>
>>
>>
>>
>>
>> On Wed, Aug 13, 2014 at 12:57 PM, Hameed, Amir <Amir.Hameed@xxxxxxxxx>
>> wrote:
>>
>>  Folks,
>>
>> I am trying to understand the behavior of an Oracle RAC Cluster if the
>> Grid and RAC binaries homes become unavailable while the Cluster and Oracle
>> RAC are running. The Grid version is 11.2.0.3 and the platform is Solaris
>> 10. The Oracle Grid and the Oracle RAC environments are on NAS with the
>> database configured with dNFS. The storage for Grid and RAC binaries are
>> coming from one NAS head whereas the OCR and Voting Disks (three of each)
>> are spread over three NAS heads so that in the event that one NAS head
>> becomes unavailable, the cluster can still access two voting disks. The
>> recommendation for this configuration came from the storage vendor and
>> Oracle. What we observed was that last weekend when the NAS head where the
>> Grid and RAC binaries were mounted from went down for a few minutes, all
>> RAC nodes were rebooted even though two voting disks were still accessible.
>> In my destructive testing about a year ago, one of the tests run was to
>> pull all cables of NICs that were used for kernel NFS on one of the RAC
>> nodes but the cluster did not evict that node. Any feedback will be
>> appreciated.
>>
>>
>>
>> Thanks,
>>
>> Amir
>>
>>
>>
>

Other related posts: