Re: Doubt about timeout between nodes of cluster

From: "Waldirio Manhães Pinheiro" <waldirio@xxxxxxxxx>
To: "Riyaj Shamsudeen" <riyaj.shamsudeen@xxxxxxxxx>
Date: Thu, 12 Jun 2008 21:22:19 -0300
   Hello Riyaj

  I'm re-installing the Operating System of machines and tomorrow I'll
re-install the Oracle RAC (with default settings I'll check the crsd logs)
and try tuning this time.

Thanks again.

PS: In generally, what the time between stop the first node and the second
node up the first VIP interface ?!

Good Night All.
Waldirio

2008/6/12 Riyaj Shamsudeen <riyaj.shamsudeen@xxxxxxxxx>:

> Hello Waldirio
>  Breaking up crsd.log, Approximately 30 seconds spent on CLSC recv/send
> failure etc. Parameter css misscount is set to 30 in unix platforms. I would
> say, misscount is controlling this duration, but that need to be validated
> enabling further trace and looking at cssd.log etc.., if you want.
>
> 2008-06-12 14:19:15.781: [  OCRMSG][1484962144]prom_rpc: CLSC recv
> failure..ret code 7
> 2008-06-12 14:19:42.464: [  OCRMSG][1484962144]prom_rpc: CLSC send
> failure..ret code 6
>
>  Another 26 seconds spent in Cluster reconfiguration below..
>
> 2008-06-12 14:19:46.036: [  OCRSRV][2541411904]proath_init: Failed to
> retrieve pubdata. Expect a rcfg
> 2008-06-12 14:20:12.283: [  OCRMAS][1210108256]th_master:12: I AM THE NEW
> OCR MASTER at incar 1. Node Number 1
>
>  Changing these parameters have profound effect on availability especially
> if the network architecture is not good enough.
>
>  Cheers
> Riyaj Shamsudeen
> The Pythian Group www.pythian.com <http://www.pythian.com/>
> Personal blog: orainternals.wordpress.com <
> http://orainternals.wordpress.com/>
>
> Waldirio Manhães Pinheiro wrote:
>
>>    Hello Friend
>>    Thank you for answer .., let's check.
>>  2008/6/12, Riyaj Shamsudeen <riyaj.shamsudeen@xxxxxxxxx <mailto:
>> riyaj.shamsudeen@xxxxxxxxx>>:
>>
>>    Hello Waldirio
>>      >> the time to the first machine detect the second machine
>>    powered off is very big (between 1 and 2 min),
>>     How are you measuring this time? Are you checking alert log or
>>    are you using DB connections to check it?
>>
>>     I was check this time starting when I have been send the shutdown to
>> server until the second VIP interface up on second node (backup node).
>>
>>     Can you also send crsd.log?
>>
>>  Ok, following the address because the size ...
>> http://rafb.net/p/hqE13995.html
>>  When I send the power off on first node, on second node (crsd log on link
>> above), on line 1 log the message "[ COMMCRS][1147169120]clsc_receive:
>> (0xc6d180) Error receiving, ns (12535, 12560), transport (505, 110, 0)" and
>> still "Connection not active" until  line 2045.
>>  PS: Now, my VIP address of first node don't migrated to second node later
>> power off ... (maybe will be necessary re-install the OS and Oracle
>> ClusterWare, because I've changed the system a lot of to test)
>>
>>     Further, refer $CRS_HOME/bin/racgvip and there are few parameters
>>    such as check interval, restart attempts etc controlling behavior
>>    of VIP failover too. Not sure, they are applicable when machine is
>>    rebooted since heartbeat will fail before vip check..
>>
>>  Yes, I checked this file too, but don't changed.
>>  Now, looking the crsd log file, I believe the Oracle know when another
>> node is out, but who is responsible to make a failover (mount the aliases of
>> VIP on another machine) !? (Script, Daemon, Angel :P )
>>  Thank you friends for help.
>> Waldirio
>>
>>    Cheers
>>    Riyaj Shamsudeen
>>    The Pythian Group www.pythian.com <http://www.pythian.com/>
>>    Personal blog: orainternals.wordpress.com
>>    <http://orainternals.wordpress.com/>
>>
>>    Waldirio Manhães Pinheiro wrote:
>>
>>          Hello Friends
>>           I'd like to ask about Oracle RAC in Linux environment. I
>>        installed two machine with RedHat AS 4Up5 and Oracle 10.2.0.3
>>        <http://10.2.0.3/> <http://10.2.0.3/> with ClusterWare. The
>>
>>        installation finish with successful and the data base work fine.
>>           I checked my environment of availability with the test below:
>>         Station cambeba UP
>>        Station cangua UP
>>         # crs_stat -t
>>         Name           Type           Target    State     Host
>>        ------------------------------------------------------------
>>        ora....BA.lsnr application    ONLINE    ONLINE    cambeba
>>        ora....eba.gsd application    ONLINE    ONLINE    cambeba
>>        ora....eba.ons application    ONLINE    ONLINE    cambeba
>>        ora....eba.vip application    ONLINE    ONLINE    cambeba
>>        ora....UA.lsnr application    ONLINE    ONLINE    cangua
>>        ora.cangua.gsd application    ONLINE    ONLINE    cangua
>>        ora.cangua.ons application    ONLINE    ONLINE    cangua
>>        ora.cangua.vip application    ONLINE    ONLINE    cangua
>>        ora.ora10gq.db application    ONLINE    ONLINE    cangua
>>        ora....q1.inst application    ONLINE    ONLINE    cangua
>>        ora....q2.inst application    ONLINE    ONLINE    cambeba
>>         At this point, that's ok, but when I force a power off in
>>        cangua or cambeba (the name of my machines), the time to the
>>        firt machine detect the second machine powered off is very big
>>        (between 1 and 2 min), so, if my client was working, will lost
>>        the query for time out.
>>         I changed the configurations in objects ora.cambeba.vip and
>>        ora.cangua.vip, but without successful.
>>         Any Ideia to fix this problem (decrease the time of check
>>        between nodes on cluster) ?!?!
>>         PS: I checked in list database, but without successful about
>>        this problem
>>
>>         Thanks in advanced.
>>        --        ______________
>>        Atenciosamente
>>        Waldirio
>>        msn: wmp@xxxxxxxxxxxxx <mailto:wmp@xxxxxxxxxxxxx>
>>        <mailto:wmp@xxxxxxxxxxxxx <mailto:wmp@xxxxxxxxxxxxx>>
>>        Site: www.waldirio.com.br <http://www.waldirio.com.br/>
>>        <http://www.waldirio.com.br/>
>>        Blog: blog.waldirio.com.br <http://blog.waldirio.com.br/>
>>        <http://blog.waldirio.com.br/>
>>        PGP: www.waldirio.com.br/public.html
>>        <http://www.waldirio.com.br/public.html>
>>        <http://www.waldirio.com.br/public.html>
>>
>>
>>
>>
>>
>> --
>> ______________
>> Atenciosamente
>> Waldirio
>> msn: wmp@xxxxxxxxxxxxx <mailto:wmp@xxxxxxxxxxxxx>
>> Site: www.waldirio.com.br <http://www.waldirio.com.br>
>> Blog: blog.waldirio.com.br <http://blog.waldirio.com.br>
>> PGP: www.waldirio.com.br/public.html <
>> http://www.waldirio.com.br/public.html>
>>
>
>


-- 
______________
Atenciosamente
Waldirio
msn: wmp@xxxxxxxxxxxxx
Site: www.waldirio.com.br
Blog: blog.waldirio.com.br
PGP: www.waldirio.com.br/public.html
References:
- Doubt about timeout between nodes of cluster
  - From: Waldirio Manhães Pinheiro
- Re: Doubt about timeout between nodes of cluster
  - From: Riyaj Shamsudeen
- Re: Doubt about timeout between nodes of cluster
  - From: Waldirio Manhães Pinheiro
- Re: Doubt about timeout between nodes of cluster
  - From: Riyaj Shamsudeen
Re: Doubt about timeout between nodes of cluster

Other related posts: