Re: Nasty RAC Bug in 10g. If you are running multi-nodes and one instance or more is not normally running - Read this...

From: Ravi Gaur <ravigaur1@xxxxxxxxx>
To: robertgfreeman@xxxxxxxxx
Date: Tue, 7 Apr 2009 10:00:45 -0500

Thanks Robert for bringing this up!!
I inquired about this bug from Oracle Support (cause we plan to add/drop
redo log groups in our RAC env). I'm sharing the response I got --
~~~~~~~~~~~~~~~~~~~~~~
Look for all the following:
- RAC configuration (ie multiple redo threads)
- looking at the alert logs of all the instances, they show
the following sequence of events in order:
1) the DBA dropped a specific log group# <x> belonging
to a redo thread whose instanceA was already in a
shutdown state (ie their redo thread was closed)
2) then the DBA added log group# <x> to the redo thread
belonging to a currently running instanceB
3) then the currently running instanceB switched logs
into the new log group# <x> and then began writing to
the old member logfiles which were formerly members
of the log group# <x> before it was dropped
4) there are LGWR errors, instance failure, and redo log
corruption
for example, here is one possible set of errors seen in
the case where the old logfiles were smaller than the
new logfiles (the errors are reported when LGWR tries
to write redo beyond the end of the old logfiles)
LGWR reported:
ORA-00340: IO error processing online log of thread
ORA-00345: redo log write error block <blk#> count <cnt>
ORA-00312: online log <log#> thread <blk#>: '<old_logfile>'
ORA-17510: Attempt to do i/o beyond file size
other processes (eg pmon,lmd,lms,lmon) reported:
ORA-00340: IO error processing online log of thread
and LGWR terminated the instance
on instance restart, crash recovery failed with:
ORA-00314: log <log#> of thread <thr#>,
expected sequence# <seq#> doesn't match 0
ORA-00312: online log <log#> thread <thr#>: 'new_logfile'
LGWR stack:
-> ksbrdp() -> ksbabs() -> kcrfw_redo_write() ->
-> kcrfw_post() -> -> kcrfwcint() -> ORA-00345
- there is another (less severe) variation of this bug,
the sequence of events is almost the same as descibed
above except in step (2), the DBA added a different log
group#, and the error is different - LGWR reports an
ORA-00600:[kcrf_cached_open_log_1] when it tries to
switch into the wrong logfile, and in this case, no redo
is actually written to the wrong logfile, and after the
instance has terminated, it can be restarted without any
problems.

-- And the WORKAROUND, mentioned as follows:

To avoid the possiblity of encountering this problem in the
first place, set the following event in the init.ora's of
all the instances:
event="10468 trace name context forever, level 2"
The other side effect of doing this, is that now instance
recovery may be slower in cases where the logfiles are
located on ASM disk groups (see bug 4967266).
If the problem has already happened and the instance
terminated and the database can nolonger be opened, then
In order to recover the database to a consistent (earlier)
state, do media recovery and apply redo up until just
before the wrong online redo log file was switched into.

-- Currently, there is no backport patch available for bug 6786022 for your
platform (Sun Solaris Sparc) .. but we can
request one, if needed, on top of
10.2.0.4

eos
~~~~~~~~~~~~~~~~~~~~~~

I'm not sure if I follow the sequence so I'm going to question them again.

- Ravi Gaur



On Sun, Apr 5, 2009 at 2:12 AM, Robert Freeman <robertgfreeman@xxxxxxxxx>wrote:

>
> So, we ran into a nasty bug last night. We are running 10g (various
> releases) RAC on 3 or 4 node clusters. In this particular configuration we
> had a 4 node cluster, with an instance for this database on each node. 2
> instances were active, two were configured but not running.
>
> DBA went to make redo log adjustments (adding a new group) and database
> crashed. There is a bug in 10g (and apparently 11g) with respect to this
> kind of configuration. If you are running an active/passive kind of RAC
> configuration, you will want to read up on the bug. Be very careful making
> any online redo log changes if you are running in such an environment.
>
> Metalink bug number is 6786022 and it's public. We understand patch is in
> QA to correct. There is also an event you can set to avoid the problem. See
> the bug on Metalink for more information.
>
> I'll also be posting a copy of this on my Blog...
>
> Cheers to all!
>
> RF
>
>
>  Robert G. Freeman
> Author:
> Blog: http://robertgfreeman.blogspot.com
> OCP: Oracle Database 11g Administrator Certified Professional Study Guide
> (Sybex)
> Oracle Database 11g New Features (Oracle Press)
> Portable DBA: Oracle  (Oracle Press)
> Oracle Database 10g New Features (Oracle Press)
> Oracle9i RMAN Backup and Recovery (Oracle Press)
> Oracle9i New Features (Oracle Press)
> Other various titles out of print now...
> The LDS Church is looking for DBA's. You do have to be a Church member in
> good standing. A lot of kind people write me, concerned I may be breaking
> the law by saying you have to be a Church member. It's legal I promise! :-)
> --
> //www.freelists.org/webpage/oracle-l
>
>
>

Follow-Ups:
- Mirroring redo log groups or not ?
  - From: Crisler, Jon

References:
- Re: database monitoring tools - what is your short list of
  - From: Yong Huang
- Nasty RAC Bug in 10g. If you are running multi-nodes and one instance or more is not normally running - Read this...
  - From: Robert Freeman

Re: Nasty RAC Bug in 10g. If you are running multi-nodes and one instance or more is not normally running - Read this...

Other related posts: