Re: FW: My worst nightmare - ORA-8103

  • From: Gerwin Hendriksen <gerwin.hendriksen@xxxxxxxxx>
  • To: Maureen English <maureen.english@xxxxxxxxxx>
  • Date: Sat, 14 Jun 2014 08:48:02 +0200

Hi Maureen,

I think the first ORA-600 errors have been due to trouble with the refresh
of materialized views, you mentioned something about, and corruptions in
the temporary tablespace. See note: "Errors For [rworupo.2] Or
[kcblsltio_1] While Refresh A Materialized View (Doc ID 1250564.1)".

The errors you describe after this first ORA-600 also have to do with
corruptions, so the first ORA-600 pointing to corruptions in the temporary
tablespace might be the first signs of hardware IO issues. So not only
temporary tablespace was affected but also the undo tablespace, data
tablespaces, etc. That is probably the whole list of issues which you have
been encountering. So the first ORA-600 is not a cause, the cause is
hardware failure.

From my opinion I would advise the following to make sure your database
will work reliable and Oracle supported again:
- First of all ensure the root cause of the hardware problems has found and
this is fixed.
- Create from the database an offline backup of the entire database. Might
something for some reasons not present in the export, you can go back to
the original database.
- Create a new database and use export / import (or datapump of course).
  --> The new database will make sure no hidden problems like transactions
in tables still pointing to non existing rollback segments will not be
present.
- Verify the new database and fix missing data, by the use of the old
database.
  --> As Mark also suggested with the help of indexes.
- Make a new offline backup of the new database.
- Use the new database for production.

I wish you good luck and would like to hear your continuation on this
entire issue.

With kind regards,

Gerwin









2014-06-13 23:44 GMT+02:00 Maureen English <maureen.english@xxxxxxxxxx>:

> Gerwin, Mark, (rest of list);
>
> I think we had a couple of different problems.
>
> Something happened on 5/28...still unknown as to what this was, but the
> alert log shows these errors.
>
>  Wed May 28 10:28:57 2014
>> Errors in file 
>> /ORACLE/home/admin/diag/rdbms/prod/PROD/trace/PROD_ora_13212.trc
>>  (incident=15673791):
>> ORA-00600: internal error code, arguments: [kcblsltio_1], [4582055],
>> [86302], [], [], [], [], [], [], [], [], []
>> Incident details in: /ORACLE/home/admin/diag/rdbms/
>> prod/PROD/incident/incdir_15673791/PROD_ora_13212_i15673791.trc
>>
>> Wed May 28 14:03:52 2014
>> Errors in file 
>> /ORACLE/home/admin/diag/rdbms/prod/PROD/trace/PROD_ora_11869.trc
>>  (incident=15670567):
>> ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [],
>> [], [], [], [], [], []
>> Incident details in: /ORACLE/home/admin/diag/rdbms/
>> prod/PROD/incident/incdir_15670567/PROD_ora_11869_i15670567.trc
>>
>
> Oracle Support says that if we can't reproduce the first one, they can't
> help us.  It's not a documented error
> and we're on an unsupported version of Oracle (11.2.0.2) and they won't
> open a new case for it.
>
> The second one has
>
>> 0x30c76bb8.2 file# 195 block# 486328 slot 2 not found
>>
>
> which is the file for the tablespace with the table having the ORA-8103
> problem.
>
> A few days later, we had another hardware issue that caused some problems
> with our scheduling software and that
> system got rebooted a couple of times.  I think that there were some jobs
> running against the database at that
> time, because that's when the issues with the undo segment started.
>  Recreating the UNDO tablespace resolved that
> issue and we're now left with getting the good data out of the huge table,
> truncating the table and getting the
> good data back in.  The plan is to get the good data out into a different
> table, expdp the data from the good
> table, drop the indexes on the table with the bad data and rename it to
> ..._corrupt, then rename the table with
> the good data to the original table name.  Then rebuild the indexes.
>
> If anyone has any idea how we can figure out what may have happened prior
> to that first ORA-00600 error, please let
> me know.
>
>
> Regarding recreating the UNDO tablespace, we were online with Oracle
> Support the entire time and did use _ parameters.
> Here's what we did:
> - created a pfile with
> undo management set to manual, and
> event = '10513 trace name context forever, level 2
> - startup restrict with pfile and verify all undo segments were
> offline...but one segment was PARTLY AVAILABLE, so
> we had to shutdown and add
> _corrupted_rollback_segments = ('_SYSSMU1_2143500570$') to the pfile
> then the segment was showing NEEDS RECOVERY and we were able to continue
> - create new undo tablespace
> - drop old undo tablespace
> - shutdown immediate, startup mount
> - alter system set undo_tablespace to new undo tablespace in spfile
> - shutdown immediate and then startup normal
>
>
> I'm ready to move to a place where they have a tiny little database and
> hardly any users...or, better yet, maybe I
> can retire some day soon...1478 days per my countdown clock :-)
>
> - Maureen
>
>
>
> On 6/11/2014 9:10 PM, Gerwin Hendriksen wrote:
>
>> Hello Maureen,
>>
>> I think Marc might be right with index suggestion. What worries me is the
>> fact you had a corrupted undo tablespace. I
>> guess you used parameters like '_corrupted_rollback_segments' and/or
>> '_offline_rollback_segments' to startup your
>> database before able to create a new undo tablespace. Can you explain how
>> this was done?
>>
>> Can you also state what kind of errors you encountered in your database
>> at the time you recognized a corrupted undo
>> tablespace? The problem is that most of this problems are going with
>> ORA-600 errors in the 4000 range (transaction layer
>> in the RDBMS code). So the problem you see with the table involved giving
>> ORA-8103 might be just one of the problems in
>> your database. Maybe some others have simply not been touched yet (don't
>> you see any errors in your alert file?).
>>
>> If the database was openened with one of the underscore parameters I
>> mentioned the only valid thing to do is recreating
>> the database by export / import. This is probably a very big job, but in
>> my opinion the only official Oracle supported
>> way to run the database.
>>
>> With kind regards,
>>
>> Gerwin Hendriksen
>>
>>
>> 2014-06-12 6:09 GMT+02:00 Maureen English <maureen.english@xxxxxxxxxx
>> <mailto:maureen.english@xxxxxxxxxx>>:
>>
>>
>>     Thanks!  I updated our ticket with Oracle asking for info on this
>> bug.  I can't find anything about it on Metalink.
>>
>>     Regarding Mark's post, does this mean that the reason users aren't
>> seeing any problems is because
>>     they are able to access the data because the application uses the
>> indexes when querying?
>>
>>     We did have to create a new undo tablespace due to corruption in the
>> undo tablespace.
>>
>>     At least one of the indexes got an ORA-8103 when running analyze on
>> it.  I didn't check the others.
>>     Now I'm starting to panic again.  Oracle says that the data is lost,
>> but now it might not be???
>>
>>     Interestingly enough, we were able to fix the other issues that we've
>> found by recreating indexes, or stepping through
>>     a table using a value from a unique index and then find the corrupt
>> or duplicate rows and delete them.  This table
>>     doesn't have a unique index...and it has 553M+ rows.  It takes 20
>> minutes just to get a rowcount back :-(
>>
>>     I ended up following Oracle's suggestion of copying the rows that
>> didn't return the ORA-8102 error to another table,
>>     exported that table, truncated my application table, dropped the
>> indexes on the application table and imported the
>>     data back in.  I'm currently rebuilding the indexes.  Tomorrow will
>> be when the 2 main Finance users will check things
>>     out and let me know the status.
>>
>>     I'm so dreading the possibility of someone telling me that all of a
>> sudden nothing balances...especially when it all
>>     seems to be in balance now....
>>
>>     - Maureen
>>
>>
>>
>>
>>     On 6/11/2014 5:59 AM, Gerwin Hendriksen wrote:
>>
>>         Hello Maureen,
>>
>>         The ORA-8103 can occur when the block contains a vector to an
>> undo segment and for some reason the change can't
>>         be found
>>         in the undo segment. Years ago I logged a bug at Oracle (at that
>> time working for Oracle Support) describing a
>>         situation
>>         when recovering a database ending in an ORA-8103 or ORA-600
>> [4146], when selecting a table. Because the issue
>>         described
>>         here is in a reporting database, this bug although a very old bug
>> could be underneath. The bug is caused by the
>>         way redo
>>         is applied and a small issue there might give the same situation
>> as described in the bug. The bug was filed 31 march
>>         2001 and has number 1714688, the bug is probably never really
>> fixed because fixing it would mean a major design
>>         change
>>         in the redo apply. Further more the bug was encountered due to
>> human error in the way the recovery was done.
>>         Just ask
>>         support that this bug might give an idea why you encounter the
>> ORA-8103.
>>
>>         Good luck and kind regards,
>>
>>         Gerwin Hendriksen
>>
>>
>>         2014-06-11 14:01 GMT+02:00 Mark W. Farnham <mwf@xxxxxxxx <mailto:
>> mwf@xxxxxxxx> <mailto:mwf@xxxxxxxx
>>
>>         <mailto:mwf@xxxxxxxx>>>:
>>
>>
>>              Sorry, missed the list.
>>
>>              -----Original Message-----
>>              From: Mark W. Farnham [mailto:mwf@xxxxxxxx <mailto:
>> mwf@xxxxxxxx> <mailto:mwf@xxxxxxxx <mailto:mwf@xxxxxxxx>>]
>>              Sent: Tuesday, June 10, 2014 2:43 PM
>>              To: 'maureen.english@xxxxxxxxxx <mailto:maureen.english@
>> alaska.edu> <mailto:maureen.english@__alaska.edu
>>         <mailto:maureen.english@xxxxxxxxxx>>'
>>              Subject: RE: My worst nightmare - ORA-8103
>>
>>              If a unique index survives, you should be able to retrieve
>> the rowids for
>>              the rows not in your recovery target table with a minus.
>>              Then, using the rowids and column sets for indexes, you can
>> retrieve at
>>              least those column values from the indexes.
>>              This at least should provide you with a complete set of keys
>> for the missing
>>              rows.
>>
>>              I presume you lack the ability to do individual block
>> patching via RMAN,
>>              etc.
>>
>>              good luck,
>>
>>              mwf
>>
>>              -----Original Message-----
>>              From: oracle-l-bounce@xxxxxxxxxxxxx <mailto:oracle-l-bounce@
>> freelists.org>
>>         <mailto:oracle-l-bounce@__freelists.org <mailto:oracle-l-bounce@
>> freelists.org>>
>>         [mailto:oracle-l-bounce@__freelists.org <mailto:oracle-l-bounce@
>> freelists.org>
>>              <mailto:oracle-l-bounce@__freelists.org <mailto:
>> oracle-l-bounce@xxxxxxxxxxxxx>>]
>>              On Behalf Of Maureen English
>>              Sent: Tuesday, June 10, 2014 2:29 PM
>>              To: oracle-l@xxxxxxxxxxxxx <mailto:oracle-l@xxxxxxxxxxxxx>
>> <mailto:oracle-l@xxxxxxxxxxxxx
>>         <mailto:oracle-l@xxxxxxxxxxxxx>__>
>>              Subject: My worst nightmare - ORA-8103
>>
>>              Hi,
>>
>>              Original post to BOracle list...apologies to those who are
>> seeing this
>>              again.
>>
>>              On 5/31, the refresh of a materialized view in our reporting
>> instance failed
>>              with an ORA-8103 error, as did a gather stats job for the
>> same table in the
>>              production database.  We had an application upgrade done on
>> 6/1 and copied
>>              our production database to a preprod version on 6/2, so the
>> errors weren't
>>              caught immediately.
>>
>>              Users complained that the current data wasn't available in
>> the reporting
>>              instance but my attempts to refresh and recreate all failed
>> with the same
>>              ORA-8103 error.
>>
>>              I've been working with Oracle Support since early last week
>> and keep hitting
>>              brick walls.  At the moment, we've managed to copy
>> 'uncorrupted' rows out of
>>              the table in our preproduction database and are 480 rows
>> short in a table
>>              with 550M rows in it.  Oracle is currently working to
>> identify the rowids of
>>              the corrupt blocks based on the output in a trace file
>> generated by a
>>              failing query.
>>
>>              I'm looking at Document 336133.1 while I wait for more info
>> from Oracle.
>>
>>              Anyone have any comments/suggestions/other info that might
>> help identify and
>>              fix the problem?  We really can't afford to lose that much
>> data.  We're
>>              working on recovering our database to a different location
>> to try to get
>>              back any data that we lose, but since I don't have any idea
>> what caused the
>>              corruption, I'm lost.
>>
>>              - Maureen
>>              --
>>         //www.freelists.org/__webpage/oracle-l <
>> //www.freelists.org/webpage/oracle-l>
>>
>>
>>              --
>>         //www.freelists.org/__webpage/oracle-l <
>> //www.freelists.org/webpage/oracle-l>
>>
>>
>>
>>
>>
>>     --
>>     Maureen English
>>     Lead Database Administrator
>>     University of Alaska
>>     Fairbanks, AK
>>     (907) 450-8329 <tel:%28907%29%20450-8329>
>>
>>
>>
> --
> Maureen English
> Lead Database Administrator
> University of Alaska
> Fairbanks, AK
> (907) 450-8329
>

Other related posts: