Re: SunFire Server Hangs

  • From: Andrew Kerber <andrew.kerber@xxxxxxxxx>
  • To: "ian@xxxxxxxxxxxxxxxxx" <ian@xxxxxxxxxxxxxxxxx>
  • Date: Mon, 9 Mar 2015 11:13:33 -0500

I can't figure out a mechanism that may cause this, but it sounds like the 
operating system is losing its path to the storage.

Sent from my iPad

> On Mar 9, 2015, at 10:21 AM, MacGregor, Ian A. <ian@xxxxxxxxxxxxxxxxx> wrote:
> 
> It's the entire server which hangs.   Once we can  access the machine, after  
> the reset though the system process, a check of the system down time shows 
> the time the machine  froze.    It's like the server  panicked, but did not 
> make  it all the way down  as it remains pingable.   When it happens all  
> programs which might provide some information as to the cause stop.     
> However, up to that point things look very normal indeed.
> 
> All the storage is  onboard.     These machines accommodate 16 drives 
> internally.
> 
> None of the machines which has had this problem is clustered.    They are 
> dedicated database  machines.    The OS is Solaris 10.
> 
> Ian MacGregor
> SLAC National Accelerator Center
> 
> -----Original Message-----
> From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On 
> Behalf Of Mladen Gogala
> Sent: Saturday, March 07, 2015 9:51 PM
> To: oracle-l@xxxxxxxxxxxxx
> Subject: Re: SunFire Server Hangs
> 
> Is the whole server hanging or just Oracle? Can you ssh into the server?  
> Unfortunately Solaris is not unbreakable, like Linux.
> 
> On 3/7/2015 11:49 PM, MacGregor, Ian A. wrote:
> 
> 
>    It’s all internal storage not using ASM.   The oracle version is 
> 11.2.0.3. 
> 
>    Ian
>    
> 
>        On Mar 6, 2015, at 4:50 PM, Andrew Kerber <andrew.kerber@xxxxxxxxx> 
> wrote:
> 
>        Sounds like problem with storage to me.  Is it in local storage or 
> SAN?  ASM or file system?  Also, what oracle version. 
>        
>        Sent from my iPad
> 
>        On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. <ian@xxxxxxxxxxxxxxxxx> 
> wrote:
>        
>        
> 
>            Over the past  26 months or so we have three SunFire x86 servers 
> hang, two, quite recently,  within a few weeks of each other. The servers  
> show no signs of high activity before the freeze  None of the  monitoring 
> scripts we run  indicate any problem at all before the freeze.  When it 
> happens the machine  is  hangs,  it does ping, and can be reset  through the 
> sp. 
>            Looking at the boot events.  There is a system downtime which 
> matches when the freeze occurs,
>            
> 
>            It has a very bad impact on Oracle which suffers from lost writes
> 
>            SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments: 
> [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], [], [], []
> 
>            This error is mostly associated with power outages.  In this case 
> there was no loss of power.
> 
>            I reported the first machine a while back  explaining automatic 
> fail over failed  to occur when a machine is not quite dead.
> 
>            The more recent failures have happened  on machines which didn’t 
>  have a physical standby.
> 
>            I could not find the article about fixing this problem through 
> applying  the current redo log .  All I could find were article saying the 
> database was unrecoverable.  Ensuring the backups were  valid.  I proceeded.
> 
>            The database reported a corrupt rollback  segment.  I switched to 
> manual undo, made sure there were no partially availble 
>            segments.  Created a new undo tablespace and successfully opened 
> the database.
> 
>            We had one other problem an index disagreed with its table.  The 
> index was missing a row.   We  were able to ascertain the program which was 
> using the index.  It was a CDC job which was no longer needed.  Oracle was 
> perfectly happy with the situation.  There was no  reported corruption unless 
> the affected index block was read by a query.   Another problem was the index 
>  and  table are bootstrap objects .   We eventually used impdp/expdp to move 
> to another machined.  But I am getting ahead of myself.
> 
> 
>            When I brought  the  database up with th new undo tablespace, all 
> Oracle scheduler jobs  reported they could not open  a wallet.  I’m not 
> sure which wallet is referenced here.  Also  clients which called PL/SQL 
> programs to open a wallet could not open a different  wallet involved in 
> authenticating to AD.  However if the code was executed on the database 
> server itself, the wallet opened without a problem.  Restarting both the 
> database and the listener fixed the problem.  I have been able to find any 
> information on this.
> 
>            On another database the  corrupt rollback segments included some 
> partially available ones.  This was a QA database and was refreshed from 
> backup.  Again the problem with the wallet occurred.
> 
>            On the third database both active redo logs  were corrupt.  It too 
> was recovered from backup.  It also had the wallet problem.
> 
>            So even with no loss of power,   yes the RAID cache batteries were 
> goof. and having multiplexed redo logs.  We needed to recover from backup on 
> two databases, and on the other  had a bootstrap table and index in 
> disagreement
> 
>            Ian MacGregor
>            SLAC National Accelerator Center
> 
>              
> 
> 
> 
> 
> -- 
> Mladen Gogala
> Oracle DBA
> http://mgogala.freehostia.com
> †Ûiÿü0ÁúÞzX¬¶Ê+ƒün– {ú+iÉ^
--
//www.freelists.org/webpage/oracle-l


Other related posts: