Re: SunFire Server Hangs

From: "MacGregor, Ian A." <ian@xxxxxxxxxxxxxxxxx>
To: Andrew Kerber <andrew.kerber@xxxxxxxxx>
Date: Sun, 8 Mar 2015 04:49:46 +0000

It’s all internal storage not using ASM.   The oracle version is 11.2.0.3.

Ian
On Mar 6, 2015, at 4:50 PM, Andrew Kerber 
<andrew.kerber@xxxxxxxxx<mailto:andrew.kerber@xxxxxxxxx>> wrote:

Sounds like problem with storage to me.  Is it in local storage or SAN?  ASM or 
file system?  Also, what oracle version.

Sent from my iPad

On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. 
<ian@xxxxxxxxxxxxxxxxx<mailto:ian@xxxxxxxxxxxxxxxxx>> wrote:

Over the past  26 months or so we have three SunFire x86 servers hang, two, 
quite recently,  within a few weeks of each other. The servers  show no signs 
of high activity before the freeze  None of the  monitoring scripts we run  
indicate any problem at all before the freeze.  When it happens the machine  is 
 hangs,  it does ping, and can be reset  through the sp.
Looking at the boot events.  There is a system downtime which matches when the 
freeze occurs,

It has a very bad impact on Oracle which suffers from lost writes

SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments: 
[kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], [], [], []

This error is mostly associated with power outages.  In this case there was no 
loss of power.

I reported the first machine a while back  explaining automatic fail over 
failed  to occur when a machine is not quite dead.

The more recent failures have happened  on machines which didn’t  have a 
physical standby.

I could not find the article about fixing this problem through applying  the 
current redo log .  All I could find were article saying the database was 
unrecoverable.  Ensuring the backups were  valid.  I proceeded.

The database reported a corrupt rollback  segment.  I switched to manual undo, 
made sure there were no partially availble
segments.  Created a new undo tablespace and successfully opened the database.

We had one other problem an index disagreed with its table.  The index was 
missing a row.   We  were able to ascertain the program which was using the 
index.  It was a CDC job which was no longer needed.  Oracle was perfectly 
happy with the situation.  There was no  reported corruption unless the 
affected index block was read by a query.   Another problem was the index  and  
table are bootstrap objects .   We eventually used impdp/expdp to move to 
another machined.  But I am getting ahead of myself.


When I brought  the  database up with th new undo tablespace, all Oracle 
scheduler jobs  reported they could not open  a wallet.  I’m not sure which 
wallet is referenced here.  Also  clients which called PL/SQL programs to open 
a wallet could not open a different  wallet involved in authenticating to AD.  
However if the code was executed on the database server itself, the wallet 
opened without a problem.  Restarting both the database and the listener fixed 
the problem.  I have been able to find any information on this.

On another database the  corrupt rollback segments included some partially 
available ones.  This was a QA database and was refreshed from backup.  Again 
the problem with the wallet occurred.

On the third database both active redo logs  were corrupt.  It too was 
recovered from backup.  It also had the wallet problem.

So even with no loss of power,   yes the RAID cache batteries were goof. and 
having multiplexed redo logs.  We needed to recover from backup on two 
databases, and on the other  had a bootstrap table and index in disagreement

Ian MacGregor
SLAC National Accelerator Center

Follow-Ups:
- Re: SunFire Server Hangs
  - From: Mladen Gogala

References:
- SunFire Server Hangs
  - From: MacGregor, Ian A.
- Re: SunFire Server Hangs
  - From: Andrew Kerber

Re: SunFire Server Hangs

Other related posts: