RE: Lost Writes

From: "Mark W. Farnham" <mwf@xxxxxxxx>
To: <charlottejanehammond@xxxxxxxxx>, <oracle-l@xxxxxxxxxxxxx>
Date: Thu, 6 Aug 2009 09:32:19 -0400
DB_LOST_WRITE_PROTECT, see manual B28270-03. There is some overhead.

You mentioned you're using RMAN, so it is possible you could have used block
recovery from a backup before the lost write(s) if you have the thread of
redo logs from that point to the present. If you have the thread of redo for
the file from its creation, then you could have also recovered the whole
file from thin air (and the redo logs.)

Now, as you mentioned a required component of generating the problem is that
the i/o subsystem loses the write but tells Oracle it made the write. But
you also note that there was a disk fault. So the disk fault should trigger
verification of the contents of that disk. If it throws faults too often,
then the disk should be retired. What would be worse is if the disk
subsystem lost a write and did not throw any fault. Then you really couldn't
use that i/o subsystem for anything important without implementing a
paranoid scheme for error detection. (Paranoid means "a good dba" in this
context.) If you must use a disk i/o subsystem that is not or never should
have been certified for use with Oracle databases, then the basic paranoid
scheme is to mine your redo log and force a read of each modified block in
Oracle and compare it to a block read in your program from the actual disk
block address. You've got to that to verify that the block (that could still
be in the Oracle cache) from Oracle matches the block on disk. Of course if
there is a delayed block cleanout you might get a false positive that you'd
have to figure out how to resolve.

Better to use disk subsystems that at least throw a fault when they lose a
write, and then to not ignore disk faults.



-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx]
On Behalf Of Charlotte Hammond
Sent: Wednesday, August 05, 2009 5:29 PM
To: oracle-l@xxxxxxxxxxxxx
Subject: Lost Writes

Hi All,

We recently had a disk fault on one of our test servers.  After replacing
the drive I recovered the database using RMAN only to encounter an ORA-600
[3020] part way relating to blocks in the system tablespace.  I had to do a
point-in-time restore to get the database back.  We lost transactions (but
since it was test not a big deal).

As I understand it this problem is possibly due to lost writes (i/o not
going to disk but saying it has) due to the disk drive going bad (it may be
due to other things, but lets assume that's the problem for the sake of
discussion).  What I'm trying to understand is:

1. Will this only manifest itself during a recovery or will I get a similar
error if I try to update a stale block using SQL?
2. Given I used the same backup but terminated the recovery earlier,
presumably the stale block is still there, I just avoiding applying recovery
to it.   Do I now have a "bad" database? - should I exp/imp?  DBV shows
nothing amiss but presumably these will not be seen as logical corruptions
as the blocks are simply old, not badly formed.         
3. I see 11g can detect lost writes in conjunction with a standby.  Is there
anything I can do on stand-alone 10g?  I'm now very paranoid I'll get the
same thing in production and really loose data.


Thanks!
Charlotte


      
--
//www.freelists.org/webpage/oracle-l




--
//www.freelists.org/webpage/oracle-l
Follow-Ups:
- RE: Lost Writes
  - From: Mark W. Farnham
- RE: Lost Writes
  - From: Crisler, Jon
References:
- Lost Writes
  - From: Charlotte Hammond
RE: Lost Writes

Other related posts: