RE: Harry Houdini Corruptions?

  • From: "Chitale, Hemant K" <Hemant-K.Chitale@xxxxxx>
  • To: "knecht.stefan@xxxxxxxxx" <knecht.stefan@xxxxxxxxx>
  • Date: Tue, 15 Sep 2015 08:19:30 +0000

A question or two :
Are you running RMAN Backups to Disk ? Is the target disk location for the
RMAN Backups also configured for Encryption ? Have you run VALIDATE BACKUPSET
to see that the BackupPieces are not also being corrupted ?


Hemant K Chitale


From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On
Behalf Of Stefan Knecht
Sent: Tuesday, September 15, 2015 4:01 PM
To: Spare EmailAcct; Jeremy Schneider; nmjamaleddin@xxxxxxxxxxxxxxxx;
willyk@xxxxxxxxxxx
Cc: oracle-l-freelists
Subject: Re: Harry Houdini Corruptions?

Thanks everyone for the ideas so far.

To answer some of the questions I've received:

- RMAN is backing up the blocks just fine, as the backup is scheduled at the
point in time when the corruption has "vanished". We're using RMAN VALIDATE to
detect them (running that in 3h intervals)

- We have block dumps during the time the blocks are corrupted and after. But
we don't know exactly when the corruption occurs, and what's writing to them.
Validating a 500GB database with RMAN takes time and resources so we can't run
that constantly.

It seems that the general consensus is that it's somehow related to the sync
mechanism between the controller and its disks. What I just dont get is the
seeming pattern in when they occur and when they "vanish". Also, we've never
had it occur on weekends. So perhaps it's simply related to a combination of
high load, or a certain type of load + an issue with the controller and/or the
disks.

They're planning on switching out the disks next weekend. If they cease after
this, we'll know it was the disks.

Stefan




On Fri, Sep 11, 2015 at 9:36 PM, Spare EmailAcct
<emailacctspare@xxxxxxxxx<mailto:emailacctspare@xxxxxxxxx>> wrote:
Stefan,

That is a good one.

I don't understand if they are not Oracle formatted blocks, then how can RMAN
back them up?

If RMAN can see them and provide the details you should be able to dump the
blocks and see the contents,that might offer an insight into the root cause.

Example:

SQL> alter system dump datafile X BLOCK YYYYY;

then convert and review the data.

Can the customer open an SR and file a bug with Oracle? That would be a good
option too.

Thanks,
Frank

________________________________
From: Stefan Knecht <knecht.stefan@xxxxxxxxx<mailto:knecht.stefan@xxxxxxxxx>>
To: oracle-l-freelists <oracle-l@xxxxxxxxxxxxx<mailto:oracle-l@xxxxxxxxxxxxx>>
Sent: Thursday, September 10, 2015 3:40 AM
Subject: Harry Houdini Corruptions?

Hi all

Got a scenario on a client system that has me puzzled.

RHEL box, running on top of a local disk hardware RAID 1-0, and linux kernel
(luks) encryption on top of that.

Database files are on an ext3 filesystem created on top of that encrypted raid
device.

Daily, but only from Monday to Friday, in the wee hours of the morning, a
handful (between 3 to 7) consecutive blocks get corrupted. They're garbage
data, not Oracle formatted blocks. They're in different files and in different
places.The pattern is totally random, sometimes it's table blocks, sometimes
indexes or LOBs. But every day, it's a bunch of them in consecutive order.

We detect those by running RMAN validate on all the files every 3 hours.

Then, around noon the same day, a re-validation runs again, and the corrupt
blocks are now valid.

So basically it follows this pattern:

Monday, 3:30AM - file 7 blocks 2200555 to 2200559 corrupt
Monday, 6:30AM - same blocks reported as corrupt
Monday, 9:30AM - same blocks reported as corrupt
Monday, just after noon - no corrupt blocks found.
Nothing until the next day.
Tuesday, 3:30AM - file 4 blocks 3101220 to 3101224 corrupt
Tuesday, 6:30AM - same blocks reported as corrupt
Tuesday, 9:30AM - same blocks reported as corrupt
Tuesday, just after noon - no corrupt blocks found.

And on and on it goes.

What on earth could be doing this?

There's no anti-virus or something crazy like that running. No OS jobs found
that would touch anything like this.

The only non-standard thing about that setup is the encryption, that I have not
encountered on a database server before. But I have a hard time understanding
(and particularly proving) that the encryption could be doing that.

Does any one have any wild ideas ? :)

Cheers

Stefan








This email and any attachments are confidential and may also be privileged. If
you are not the intended recipient, please delete all copies and notify the
sender immediately. You may wish to refer to the incorporation details of
Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at
https://www.sc.com/en/incorporation-details.html

Other related posts: