Re: Harry Houdini Corruptions?

  • From: Stefan Knecht <knecht.stefan@xxxxxxxxx>
  • To: Spare EmailAcct <emailacctspare@xxxxxxxxx>, Jeremy Schneider <jeremy.schneider@xxxxxxxxxxxxxx>, nmjamaleddin@xxxxxxxxxxxxxxxx, willyk@xxxxxxxxxxx
  • Date: Tue, 15 Sep 2015 15:00:31 +0700

Thanks everyone for the ideas so far.

To answer some of the questions I've received:

- RMAN is backing up the blocks just fine, as the backup is scheduled at
the point in time when the corruption has "vanished". We're using RMAN
VALIDATE to detect them (running that in 3h intervals)

- We have block dumps during the time the blocks are corrupted and after.
But we don't know exactly when the corruption occurs, and what's writing to
them. Validating a 500GB database with RMAN takes time and resources so we
can't run that constantly.

It seems that the general consensus is that it's somehow related to the
sync mechanism between the controller and its disks. What I just dont get
is the seeming pattern in when they occur and when they "vanish". Also,
we've never had it occur on weekends. So perhaps it's simply related to a
combination of high load, or a certain type of load + an issue with the
controller and/or the disks.

They're planning on switching out the disks next weekend. If they cease
after this, we'll know it was the disks.

Stefan




On Fri, Sep 11, 2015 at 9:36 PM, Spare EmailAcct <emailacctspare@xxxxxxxxx>
wrote:

Stefan,

That is a good one.

I don't understand if they are not Oracle formatted blocks, then how can
RMAN back them up?

If RMAN can see them and provide the details you should be able to dump
the blocks and see the contents,that might offer an insight into the root
cause.

Example:

SQL> alter system dump datafile X BLOCK YYYYY;

then convert and review the data.

Can the customer open an SR and file a bug with Oracle? That would be a
good option too.

Thanks,
Frank

------------------------------
*From:* Stefan Knecht <knecht.stefan@xxxxxxxxx>
*To:* oracle-l-freelists <oracle-l@xxxxxxxxxxxxx>
*Sent:* Thursday, September 10, 2015 3:40 AM
*Subject:* Harry Houdini Corruptions?

Hi all

Got a scenario on a client system that has me puzzled.

RHEL box, running on top of a local disk hardware RAID 1-0, and linux
kernel (luks) encryption on top of that.

Database files are on an ext3 filesystem created on top of that encrypted
raid device.

Daily, but only from Monday to Friday, in the wee hours of the morning, a
handful (between 3 to 7) consecutive blocks get corrupted. They're garbage
data, not Oracle formatted blocks. They're in different files and in
different places.The pattern is totally random, sometimes it's table
blocks, sometimes indexes or LOBs. But every day, it's a bunch of them in
consecutive order.

We detect those by running RMAN validate on all the files every 3 hours.

Then, around noon the same day, a re-validation runs again, and the
corrupt blocks are now valid.

So basically it follows this pattern:

Monday, 3:30AM - file 7 blocks 2200555 to 2200559 corrupt
Monday, 6:30AM - same blocks reported as corrupt
Monday, 9:30AM - same blocks reported as corrupt
Monday, just after noon - no corrupt blocks found.
Nothing until the next day.
Tuesday, 3:30AM - file 4 blocks 3101220 to 3101224 corrupt
Tuesday, 6:30AM - same blocks reported as corrupt
Tuesday, 9:30AM - same blocks reported as corrupt
Tuesday, just after noon - no corrupt blocks found.

And on and on it goes.

What on earth could be doing this?

There's no anti-virus or something crazy like that running. No OS jobs
found that would touch anything like this.

The only non-standard thing about that setup is the encryption, that I
have not encountered on a database server before. But I have a hard time
understanding (and particularly proving) that the encryption could be doing
that.

Does any one have any wild ideas ? :)

Cheers

Stefan








Other related posts: