When control files go bad

Hey all,

Our 10.1.0.5.0 DBs on AIX had some "issues" this weekend after the A/C
suffered multiple failures in the server room.  The DB server itself was OK,
but the SAN did an emergency shutdown from a temperature alarm.

Our SAN houses all datafiles, redo logs, archived logs, FRA, and 2/3 of the
control files (remember that last part!).

The alert.log shows something very close to this:

Sat May 30 18:10:57 2009
Errors in file /oracle/admin/db/bdump/oprd_ckpt_324056.trc:
ORA-00221: error on write to controlfile
ORA-00206: error in writing (block 3, # blocks 1) of controlfile
ORA-00202: controlfile: '/oracle/data/db/control02.ctl'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 9
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of controlfile
ORA-00202: controlfile: '/oracle/data/db/control01.ctl'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 9
Additional information: 3
Sat May 30 18:10:57 2009
CKPT: terminating instance due to error 221

After the A/C was back online and the ambient temp in operating range again,
the SAN was restarted and had it's cache flushed to disk.  The DB server was
halted (not shutdown) and restarted.  I started the DB manually with
nomount, mount, and finally open, all successfully.

My question -- why???  I fully expected to have to rebuild the controlfile
or at least copy controlfile 3 back to 1 and 2, but all were apparently
consistent prior to startup (in hindsight, I should have copied them to
another place before attempting a restart!).  And this same scenario was for
three DBs across two physical servers.

The current working theory is that Oracle had nothing to do with the
controlfiles being up-to-date, but that it was the SAN flush to disk.  Or is
it possible that Oracle determined that controlfile 3 was the up-to-date one
and did the copy back to 1 and 2 for me?  I didn't think that functionality
existed since there's nothing in the alert.log about that and scanning the
docs didn't turn up anything either.

The last time I had this happen to me, there was no local controlfile and
the SAN got disconnected.  I ended up rebuilding the controlfile from the
daily trace.

Thoughts?

Rich


--
http://www.freelists.org/webpage/oracle-l


Other related posts: