RE: SunFire Server Hangs

  • From: "MacGregor, Ian A." <ian@xxxxxxxxxxxxxxxxx>
  • To: "oracle-l@xxxxxxxxxxxxx" <oracle-l@xxxxxxxxxxxxx>
  • Date: Mon, 9 Mar 2015 15:21:28 +0000

It's the entire server which hangs.   Once we can  access the machine, after  
the reset though the system process, a check of the system down time shows the 
time the machine  froze.    It's like the server  panicked, but did not make  
it all the way down  as it remains pingable.   When it happens all  programs 
which might provide some information as to the cause stop.     However, up to 
that point things look very normal indeed.

All the storage is  onboard.     These machines accommodate 16 drives 
internally.

None of the machines which has had this problem is clustered.    They are 
dedicated database  machines.    The OS is Solaris 10.

Ian MacGregor
SLAC National Accelerator Center

-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On 
Behalf Of Mladen Gogala
Sent: Saturday, March 07, 2015 9:51 PM
To: oracle-l@xxxxxxxxxxxxx
Subject: Re: SunFire Server Hangs

Is the whole server hanging or just Oracle? Can you ssh into the server?  
Unfortunately Solaris is not unbreakable, like Linux.

On 3/7/2015 11:49 PM, MacGregor, Ian A. wrote:


        It’s all internal storage not using ASM.   The oracle version is 
11.2.0.3. 

        Ian
        

                On Mar 6, 2015, at 4:50 PM, Andrew Kerber 
<andrew.kerber@xxxxxxxxx> wrote:

                Sounds like problem with storage to me.  Is it in local storage 
or SAN?  ASM or file system?  Also, what oracle version. 
                
                Sent from my iPad

                On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. 
<ian@xxxxxxxxxxxxxxxxx> wrote:
                
                

                        Over the past  26 months or so we have three SunFire 
x86 servers hang, two, quite recently,  within a few weeks of each other. The 
servers  show no signs of high activity before the freeze  None of the  
monitoring scripts we run  indicate any problem at all before the freeze.  When 
it happens the machine  is  hangs,  it does ping, and can be reset  through the 
sp. 
                        Looking at the boot events.  There is a system downtime 
which matches when the freeze occurs,
                        

                        It has a very bad impact on Oracle which suffers from 
lost writes

                        SLACQA_ora_13043.trc:ORA-00600: internal error code, 
arguments: [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], 
[], [], []

                        This error is mostly associated with power outages.  In 
this case there was no loss of power.

                        I reported the first machine a while back  explaining 
automatic fail over failed  to occur when a machine is not quite dead.

                        The more recent failures have happened  on machines 
which didn’t  have a physical standby.

                        I could not find the article about fixing this problem 
through applying  the current redo log .  All I could find were article saying 
the database was unrecoverable.  Ensuring the backups were  valid.  I proceeded.

                        The database reported a corrupt rollback  segment.  I 
switched to manual undo, made sure there were no partially availble 
                        segments.  Created a new undo tablespace and 
successfully opened the database.

                        We had one other problem an index disagreed with its 
table.  The index was missing a row.   We  were able to ascertain the program 
which was using the index.  It was a CDC job which was no longer needed.  
Oracle was perfectly happy with the situation.  There was no  reported 
corruption unless the affected index block was read by a query.   Another 
problem was the index  and  table are bootstrap objects .   We eventually used 
impdp/expdp to move to another machined.  But I am getting ahead of myself.


                        When I brought  the  database up with th new undo 
tablespace, all Oracle scheduler jobs  reported they could not open  a wallet.  
I’m not sure which wallet is referenced here.  Also  clients which called 
PL/SQL programs to open a wallet could not open a different  wallet involved in 
authenticating to AD.  However if the code was executed on the database server 
itself, the wallet opened without a problem.  Restarting both the database and 
the listener fixed the problem.  I have been able to find any information on 
this.

                        On another database the  corrupt rollback segments 
included some partially available ones.  This was a QA database and was 
refreshed from backup.  Again the problem with the wallet occurred.

                        On the third database both active redo logs  were 
corrupt.  It too was recovered from backup.  It also had the wallet problem.

                        So even with no loss of power,   yes the RAID cache 
batteries were goof. and having multiplexed redo logs.  We needed to recover 
from backup on two databases, and on the other  had a bootstrap table and index 
in disagreement

                        Ian MacGregor
                        SLAC National Accelerator Center

                          




-- 
Mladen Gogala
Oracle DBA
http://mgogala.freehostia.com
��i��0���zX���+��n��{�+i�^

Other related posts: