Re: SunFire Server Hangs

  • From: Andrew Kerber <andrew.kerber@xxxxxxxxx>
  • To: "ian@xxxxxxxxxxxxxxxxx" <ian@xxxxxxxxxxxxxxxxx>
  • Date: Fri, 6 Mar 2015 18:50:34 -0600

Sounds like problem with storage to me.  Is it in local storage or SAN?  ASM or 
file system?  Also, what oracle version. 

Sent from my iPad

> On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. <ian@xxxxxxxxxxxxxxxxx> wrote:
> 
> Over the past  26 months or so we have three SunFire x86 servers hang, two, 
> quite recently,  within a few weeks of each other. The servers  show no signs 
> of high activity before the freeze  None of the  monitoring scripts we run  
> indicate any problem at all before the freeze.  When it happens the machine  
> is  hangs,  it does ping, and can be reset  through the sp.
> Looking at the boot events.  There is a system downtime which matches when 
> the freeze occurs,
> 
> It has a very bad impact on Oracle which suffers from lost writes
> 
> SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments: 
> [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], [], [], []
> 
> This error is mostly associated with power outages.  In this case there was 
> no loss of power.
> 
> I reported the first machine a while back  explaining automatic fail over 
> failed  to occur when a machine is not quite dead.
> 
> The more recent failures have happened  on machines which didn’t  have a 
> physical standby.
> 
> I could not find the article about fixing this problem through applying  the 
> current redo log .  All I could find were article saying the database was 
> unrecoverable.  Ensuring the backups were  valid.  I proceeded.
> 
> The database reported a corrupt rollback  segment.  I switched to manual 
> undo, made sure there were no partially availble 
> segments.  Created a new undo tablespace and successfully opened the database.
> 
> We had one other problem an index disagreed with its table.  The index was 
> missing a row.   We  were able to ascertain the program which was using the 
> index.  It was a CDC job which was no longer needed.  Oracle was perfectly 
> happy with the situation.  There was no  reported corruption unless the 
> affected index block was read by a query.   Another problem was the index  
> and  table are bootstrap objects .   We eventually used impdp/expdp to move 
> to another machined.  But I am getting ahead of myself.
> 
> 
> When I brought  the  database up with th new undo tablespace, all Oracle 
> scheduler jobs  reported they could not open  a wallet.  I’m not sure which 
> wallet is referenced here.  Also  clients which called PL/SQL programs to 
> open a wallet could not open a different  wallet involved in authenticating 
> to AD.  However if the code was executed on the database server itself, the 
> wallet opened without a problem.  Restarting both the database and the 
> listener fixed the problem.  I have been able to find any information on this.
> 
> On another database the  corrupt rollback segments included some partially 
> available ones.  This was a QA database and was refreshed from backup.  Again 
> the problem with the wallet occurred.
> 
> On the third database both active redo logs  were corrupt.  It too was 
> recovered from backup.  It also had the wallet problem.
> 
> So even with no loss of power,   yes the RAID cache batteries were goof. and 
> having multiplexed redo logs.  We needed to recover from backup on two 
> databases, and on the other  had a bootstrap table and index in disagreement
> 
> Ian MacGregor
> SLAC National Accelerator Center
> 
>   

Other related posts: