Re: Help with database corruption issue

From: William Muriithi <william.muriithi@xxxxxxxxx>
To: frits.hoogland@xxxxxxxxx
Date: Sun, 5 Aug 2012 14:37:13 -0400

On 5 August 2012 12:52, Frits Hoogland <frits.hoogland@xxxxxxxxx> wrote:
> Please mind checking most journalled file systems will ONLY check
> metadata, not the data itself. Ext3 can be configured to have a
> journal for data writing (journal mode = ordered).
> Most filesystem related Linux messages/errors I've seen where due to
> memory corruptions, not disk corruptions. But that could have been
> coincidence or bad luck.
>
This one likely did not have to do with memory. I am relatively
certain his hardware is fine, its just the nature of the way ext4 was
designed.  The delayed allocation was a feature introduced on ext4
which improves I/O speed.  However, that also increase the chance of
corruption in case of power outage or the system not having had enough
time to sync the file system.

To avoid it happening again in future, check how you are mounting your
file system, specifically the journalling flags.  Make sure the file
system is up to date. The ext4 authors implemented a work around that
force ext4 to sync more often at the expense of speed.  The workaround
should be on by default but you can disable it though noauto_da_alloc
just for your information.

http://en.wikipedia.org/wiki/Ext4#Delayed_allocation_and_potential_data_loss

This may help to check if you have the above patch

rpm -q --changelog  kernel-2.6.18-308-xxx.el6.x86_64  | grep ext4

Regards,

William

> Frits Hoogland
>
> http://fritshoogland.wordpress.com
> mailto:frits.hoogland@xxxxxxxxx
> cell: +31 6 53569942
>
> Op 5 aug. 2012 om 05:25 heeft Steve Montgomerie <stmontgo@xxxxxxxxx>
> het volgende geschreven:
>
>> Thanks List!
>>
>> Dennis and Peter,
>>
>> We could start 19 of 20 databases. When we tried to start database X,
>> it would lock up the mount point,
>> would not open, and would hang all of the other 19 databases.
>>
>> The actual error points to software corruption. Something like running
>> fsck against a mounted file system.
>> SA swears he did not do that we believe him.
>>
>> In regards to the error it points to s system utility that detects a
>> bad block and then tries to fix it which ends
>> up with the header information being zeroed out of some blocks.
>>
>> The only thing that makes sense to me, is that the CP command somehow
>> rebuilt the header information
>> of the bad blocks. Is that possible?
>>
>> On Fri, Aug 3, 2012 at 6:49 AM, Peter Hitchman <pjhoraclel@xxxxxxxxx> wrote:
>>> Hi,
>>> Well for some reason the ext4 file system had errors, leading to lost
>>> data. That impacts the undo tablespace data file and Oracle could not
>>> recover. All I can think is that at some point in time the ext4 file
>>> system was not 100% OK and then when you made the data file copy is
>>> had been fixed. What sort of disk layout do you have, maybe the error
>>> was corrected by way of a disk mirror or some other RAID set-up
>>> protection?
>>>
>>> Regards
>>> Pete
>>> --
>>> //www.freelists.org/webpage/oracle-l
>>>
>>>
>> --
>> //www.freelists.org/webpage/oracle-l
>>
>>
>
>
> --
> //www.freelists.org/webpage/oracle-l
>
>
--
//www.freelists.org/webpage/oracle-l

References:
- Help with database corruption issue
  - From: Steve Montgomerie
- Re: Help with database corruption issue
  - From: Peter Hitchman
- Re: Help with database corruption issue
  - From: Steve Montgomerie
- Re: Help with database corruption issue
  - From: Frits Hoogland

Re: Help with database corruption issue

Other related posts: