Sounds like problem with storage to me. Is it in local storage or SAN? ASM or file system? Also, what oracle version. Sent from my iPad > On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. <ian@xxxxxxxxxxxxxxxxx> wrote: > > Over the past 26 months or so we have three SunFire x86 servers hang, two, > quite recently, within a few weeks of each other. The servers show no signs > of high activity before the freeze None of the monitoring scripts we run > indicate any problem at all before the freeze. When it happens the machine > is hangs, it does ping, and can be reset through the sp. > Looking at the boot events. There is a system downtime which matches when > the freeze occurs, > > It has a very bad impact on Oracle which suffers from lost writes > > SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments: > [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], [], [], [] > > This error is mostly associated with power outages. In this case there was > no loss of power. > > I reported the first machine a while back explaining automatic fail over > failed to occur when a machine is not quite dead. > > The more recent failures have happened on machines which didn’t have a > physical standby. > > I could not find the article about fixing this problem through applying the > current redo log . All I could find were article saying the database was > unrecoverable. Ensuring the backups were valid. I proceeded. > > The database reported a corrupt rollback segment. I switched to manual > undo, made sure there were no partially availble > segments. Created a new undo tablespace and successfully opened the database. > > We had one other problem an index disagreed with its table. The index was > missing a row. We were able to ascertain the program which was using the > index. It was a CDC job which was no longer needed. Oracle was perfectly > happy with the situation. There was no reported corruption unless the > affected index block was read by a query. Another problem was the index > and table are bootstrap objects . We eventually used impdp/expdp to move > to another machined. But I am getting ahead of myself. > > > When I brought the database up with th new undo tablespace, all Oracle > scheduler jobs reported they could not open a wallet. I’m not sure which > wallet is referenced here. Also clients which called PL/SQL programs to > open a wallet could not open a different wallet involved in authenticating > to AD. However if the code was executed on the database server itself, the > wallet opened without a problem. Restarting both the database and the > listener fixed the problem. I have been able to find any information on this. > > On another database the corrupt rollback segments included some partially > available ones. This was a QA database and was refreshed from backup. Again > the problem with the wallet occurred. > > On the third database both active redo logs were corrupt. It too was > recovered from backup. It also had the wallet problem. > > So even with no loss of power, yes the RAID cache batteries were goof. and > having multiplexed redo logs. We needed to recover from backup on two > databases, and on the other had a bootstrap table and index in disagreement > > Ian MacGregor > SLAC National Accelerator Center > >