I can't figure out a mechanism that may cause this, but it sounds like the operating system is losing its path to the storage. Sent from my iPad > On Mar 9, 2015, at 10:21 AM, MacGregor, Ian A. <ian@xxxxxxxxxxxxxxxxx> wrote: > > It's the entire server which hangs. Once we can access the machine, after > the reset though the system process, a check of the system down time shows > the time the machine froze. It's like the server panicked, but did not > make it all the way down as it remains pingable. When it happens all > programs which might provide some information as to the cause stop. > However, up to that point things look very normal indeed. > > All the storage is onboard. These machines accommodate 16 drives > internally. > > None of the machines which has had this problem is clustered. They are > dedicated database machines. The OS is Solaris 10. > > Ian MacGregor > SLAC National Accelerator Center > > -----Original Message----- > From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On > Behalf Of Mladen Gogala > Sent: Saturday, March 07, 2015 9:51 PM > To: oracle-l@xxxxxxxxxxxxx > Subject: Re: SunFire Server Hangs > > Is the whole server hanging or just Oracle? Can you ssh into the server? > Unfortunately Solaris is not unbreakable, like Linux. > > On 3/7/2015 11:49 PM, MacGregor, Ian A. wrote: > > > It’s all internal storage not using ASM. The oracle version is > 11.2.0.3. > > Ian > > > On Mar 6, 2015, at 4:50 PM, Andrew Kerber <andrew.kerber@xxxxxxxxx> > wrote: > > Sounds like problem with storage to me. Is it in local storage or > SAN? ASM or file system? Also, what oracle version. > > Sent from my iPad > > On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. <ian@xxxxxxxxxxxxxxxxx> > wrote: > > > > Over the past 26 months or so we have three SunFire x86 servers > hang, two, quite recently, within a few weeks of each other. The servers > show no signs of high activity before the freeze None of the monitoring > scripts we run indicate any problem at all before the freeze. When it > happens the machine is hangs, it does ping, and can be reset through the > sp. > Looking at the boot events. There is a system downtime which > matches when the freeze occurs, > > > It has a very bad impact on Oracle which suffers from lost writes > > SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments: > [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], [], [], [] > > This error is mostly associated with power outages. In this case > there was no loss of power. > > I reported the first machine a while back explaining automatic > fail over failed to occur when a machine is not quite dead. > > The more recent failures have happened on machines which didn’t > have a physical standby. > > I could not find the article about fixing this problem through > applying the current redo log . All I could find were article saying the > database was unrecoverable. Ensuring the backups were valid. I proceeded. > > The database reported a corrupt rollback segment. I switched to > manual undo, made sure there were no partially availble > segments. Created a new undo tablespace and successfully opened > the database. > > We had one other problem an index disagreed with its table. The > index was missing a row. We were able to ascertain the program which was > using the index. It was a CDC job which was no longer needed. Oracle was > perfectly happy with the situation. There was no reported corruption unless > the affected index block was read by a query. Another problem was the index > and table are bootstrap objects . We eventually used impdp/expdp to move > to another machined. But I am getting ahead of myself. > > > When I brought the database up with th new undo tablespace, all > Oracle scheduler jobs reported they could not open a wallet. I’m not > sure which wallet is referenced here. Also clients which called PL/SQL > programs to open a wallet could not open a different wallet involved in > authenticating to AD. However if the code was executed on the database > server itself, the wallet opened without a problem. Restarting both the > database and the listener fixed the problem. I have been able to find any > information on this. > > On another database the corrupt rollback segments included some > partially available ones. This was a QA database and was refreshed from > backup. Again the problem with the wallet occurred. > > On the third database both active redo logs were corrupt. It too > was recovered from backup. It also had the wallet problem. > > So even with no loss of power, yes the RAID cache batteries were > goof. and having multiplexed redo logs. We needed to recover from backup on > two databases, and on the other had a bootstrap table and index in > disagreement > > Ian MacGregor > SLAC National Accelerator Center > > > > > > > -- > Mladen Gogala > Oracle DBA > http://mgogala.freehostia.com > †Ûiÿü0ÁúÞzX¬¶Ê+ƒün– {ú+iÉ^ -- //www.freelists.org/webpage/oracle-l