Although a lot has been said on this subject, I was wondering if there are any stats from the system available like sar. This could possible put a light on the fact if the issue has anything to do with IO or CPU resources. I could imagine when something like cpu or io is shortly very occupied that certain timeout limits are reached. Also for example a mirror sync on the storage could shortly freeze things which might end in a time out and strange side effects in your cluster environment. Regards, Gerwin Hendriksen 2010/11/12 Niall Litchfield <niall.litchfield@xxxxxxxxx> > Also look for other servers running the same os that you might have missed > (because say Apache is configured to autostart). Or, flavour, of my week > this week blame unspecified server configuration issue - though to reproduce > the insanity properly you'll need a clear error in the logs and confirmation > from development that its a known bug before blaming the nebulous. :) > > On 12 Nov 2010 17:41, "Amaral, Rui" <Rui.Amaral@xxxxxxxxxxxxxxxx> wrote: > > both the servers going simultaneously indicates OS. Even with the san > going away or all connectivity being lost something would still get written > indicating the problem and the fact that it's clean then all of a sudden you > see start up messages in the oracle logs indicates to me that an immediate > reboot (OS crash if you will) happened with Oracle having no chance to > write. Like Kevin indicated in scenarios like that there would be messages > captured by syslogd but typically would be lost in those types of cases. > However, there are ways to try and capture them going forward: > > 1) enable netdump on the servers. Netdump runs in it's own protected memory > and would be able to dump those messages prior to the machine rebooting. I > have had SA's do this with some success > > 2) disable the reboot so that SA's can eith iLo into the box, or manually > connecting a terminal to the box, to do screen capture of the messages then > manually restarting the box which we have also used with some success > (especially in the very early days of ocfs) > > 3) or enable remote syslog capture (though I am not too convinced of this > one) : > > > http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch05_:_Troubleshooting_Linux_with_syslog > > Like Niall I suggest you looking at the OS cron - the timing is just too > conspicuous. Make sure updatedb is not scheduled to run. > > > > Rui Amaral > Database Administrator > ITS - SSG > TD Bank Financial Group > 220 Bay St., 11th Floor > Toron... > ------------------------------ > *From:* oracle-l-bounce@xxxxxxxxxxxxx [mailto: > oracle-l-bounce@xxxxxxxxxxxxx] *On Behalf Of *Kevin Closson > *Sent:* Friday, November 12, 2010 12:26 PM > *To:* John Smith > > > Cc: andrew.kerber@xxxxxxxxx; harish.kumar.kalra@xxxxxxxxx; > oracle-l@xxxxxxxxxxxxx > > Subject: Re: Really Strange Problem > > Wow..re-reading my email...massive "typos." Actually, the contacts on this > old keyboard are nearly g... > > > NOTICE: Confidential message which may be privileged. Unauthorized > use/disclosure prohibited. If re... > >