RE: Really Strange Problem

  • From: "Amaral, Rui" <Rui.Amaral@xxxxxxxxxxxxxxxx>
  • To: "'ora_kclosson@xxxxxxxxx'" <ora_kclosson@xxxxxxxxx>, John Smith <john40855@xxxxxxxxx>
  • Date: Fri, 12 Nov 2010 12:38:26 -0500

both the servers going simultaneously indicates OS. Even with the san going 
away or all connectivity being lost something would still get written 
indicating the problem and the fact that it's clean then all of a sudden you 
see start up messages in the oracle logs indicates to me that an immediate 
reboot (OS crash if you will) happened with Oracle having no chance to write. 
Like Kevin indicated in scenarios like that there would be messages captured by 
syslogd but typically would be lost in those types of cases. However, there are 
ways to try and capture them going forward:

1) enable netdump on the servers. Netdump runs in it's own protected memory and 
would be able to dump those messages prior to the machine rebooting. I have had 
SA's do this with some success

2) disable the reboot so that SA's can eith iLo into the box, or manually 
connecting a terminal to the box, to do screen capture of the messages then 
manually restarting the box which we have also used with some success 
(especially in the very early days of ocfs)

3) or enable remote syslog capture (though I am not too convinced of this one) :

http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch05_:_Troubleshooting_Linux_with_syslog

Like Niall I suggest you looking at the OS cron - the timing is just too 
conspicuous. Make sure updatedb is not scheduled to run.

Rui Amaral
Database Administrator
ITS - SSG
TD Bank Financial Group
220 Bay St., 11th Floor
Toronto, ON, CA, M5K1A2
(bb) (647) 204-9106



________________________________
From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On 
Behalf Of Kevin Closson
Sent: Friday, November 12, 2010 12:26 PM
To: John Smith
Cc: andrew.kerber@xxxxxxxxx; harish.kumar.kalra@xxxxxxxxx; 
oracle-l@xxxxxxxxxxxxx
Subject: Re: Really Strange Problem

Wow..re-reading my email...massive "typos." Actually, the contacts on this old 
keyboard are nearly gone and I'm finding myself mashing keys...time to stop 
procrastinating and get another one.

Anyway, I don't think suicide is your problem. I was just addressing the bit 
about evidence. I'd check the common components (switches, storage) to see if 
there is anything there.

________________________________
From: John Smith <john40855@xxxxxxxxx>
To: Kevin Closson <ora_kclosson@xxxxxxxxx>
Cc: andrew.kerber@xxxxxxxxx; harish.kumar.kalra@xxxxxxxxx; 
oracle-l@xxxxxxxxxxxxx
Sent: Fri, November 12, 2010 8:36:53 AM
Subject: Re: Really Strange Problem

If it was a node eviction, wouldn't one server go before the other?  In this 
case, they appear to be going simultaneuosly.  If it is, is there anyplace 
besides the clusterware logs that would show evidence?

On Fri, Nov 12, 2010 at 10:25 AM, Kevin Closson 
<ora_kclosson@xxxxxxxxx<mailto:ora_kclosson@xxxxxxxxx>> wrote:
>Absolutely no indication of a node eviction.

I'm not sying this is your problme, bu... the messages you are looking for are 
sent via syslogd and are buffered writes. Don't expect a catatonic server to be 
able to flush buffered writes to a log. There is a reason Oracle implemented 
IPMI fencing in 11.2...I guess I wasn't such a renegade for blogging about 
fencing approaches all those years...



________________________________
From: Andrew Kerber <andrew.kerber@xxxxxxxxx<mailto:andrew.kerber@xxxxxxxxx>>
To: harish.kumar.kalra@xxxxxxxxx<mailto:harish.kumar.kalra@xxxxxxxxx>
Cc: john40855@xxxxxxxxx<mailto:john40855@xxxxxxxxx>; 
oracle-l@xxxxxxxxxxxxx<mailto:oracle-l@xxxxxxxxxxxxx>
Sent: Thu, November 11, 2010 8:50:30 PM

Subject: Re: Really Strange Problem

Absolutely no indication of a node eviction.  Nothing in any of the clusterware 
logs indicates a node eviction on either node. (crsd.log, ocssd.log, etc)  They 
are all normal until they suddenly start back up after an unexpected shutdown.

On Thu, Nov 11, 2010 at 9:36 PM, Harish Kumar 
<harish.kumar.kalra@xxxxxxxxx<mailto:harish.kumar.kalra@xxxxxxxxx>> wrote:
John,

Have you checked ocssd.log and system logfiles. Download and installe CHM also 
know as Cluster Health Monitor and let it running until node evicts again.

Once nodes are evicted check and analyze logfiles collected by CHM. Oracle may 
evict node for different reasons such as CPU saturation, longer IO latencies, 
missconfigured network etc.

I think once you have logfiles in place then it will be more clearer what the 
actual problem is.

Reagrds
Harish Kumar
Independant Database Consultant

www.oraxperts.com<http://www.oraxperts.com/>



On Fri, Nov 12, 2010 at 1:20 PM, John Smith 
<john40855@xxxxxxxxx<mailto:john40855@xxxxxxxxx>> wrote:
Oh yes, if I didnt make it clear:

OS - OEL 5.5 x86_64
Clusterware:  11.1.0.7 x86_64
ASM - 11.1.0.7 x86_64 (running over RAW)
Database: 10.1.0.5 x86_64 (running)
Database: 10.2.0.4 x86_64 (installed, but not running at this point)


---------- Forwarded message ----------
From: John Smith <john40855@xxxxxxxxx<mailto:john40855@xxxxxxxxx>>
Date: Thu, Nov 11, 2010 at 8:14 PM
Subject: Really Strange Problem
To: oracle-l@xxxxxxxxxxxxx<mailto:oracle-l@xxxxxxxxxxxxx>


OK, I don't know if this one is related to oracle database, OEL, or something 
else entirely.  But here it is:

We have oracle clusterware 11.1 installed and running with asm 11.1.  We also 
have oracle 10.2 installed, as well as 10.1.  I have created a 10.1 database.  
ASM is on RAW against EMC storage.  This has to be on raw because the intent is 
to take 10.1, 32 bit database to 10.2 64 bit.  This requires a stop at 10.1 64 
bit along the way, and 10.1 reqires ASM on raw.

Anyway, the problem is that the servers are rebooting every 2-3 days at 2:15 
am, and we have not been able to figure out why.  There is nothing in the ASM 
or clusterware or database logs, they show everything running fine then a 
restart.  Nothing in /var/log/messages.  Just shows a restart.  Any ideas?




--
Andrew W. Kerber

'If at first you dont succeed, dont take up skydiving.'




NOTICE: Confidential message which may be privileged. Unauthorized 
use/disclosure prohibited. If received in error, please go to www.td.com/legal 
for instructions.
AVIS : Message confidentiel dont le contenu peut être privilégié. 
Utilisation/divulgation interdites sans permission. Si reçu par erreur, prière 
d'aller au www.td.com/francais/avis_juridique pour des instructions.

Other related posts: