linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Something corrupts raid5 disks slightly during reboot
@ 2003-10-31 19:08 Ville Herva
  2003-11-01  1:41 ` Jeffrey E. Hundstad
  0 siblings, 1 reply; 24+ messages in thread
From: Ville Herva @ 2003-10-31 19:08 UTC (permalink / raw)
  To: linux-kernel

I've been experiencing strange corruption on a raid5 volume for some time.
Basically, after unmounting the filesystem, I can mount it again without
problems. I can also raidstop the raid device in between and all is still
fine:

> umount /dev/md4; mount /dev/md4
    - no corruption
> umount /dev/md4; raidstop /dev/md4; raidstart /dev/md4; mount /dev/md4
    - no corruption

But after a reboot, the filesystem is corrupted:

> mount /dev/md4
 EXT2-fs error (device md(9,4)): ext2_check_descriptors: Block bitmap for
 group 17 not in group (block 0)!
 EXT2-fs: group descriptors corrupted !

(This is recoverable with e2fsck.)

The array consists of three 80GB Samsung disks in raid5 mode, but I
experienced this problem with two of the disks in raid0 mode, too. The raid
consists of raw disks hdb,hdc,hdg (rather than partitions hdb1,hdc1,hdg1).

On the same box I have three other raid arrays on different disks, all of
which consist of partitions. These do not show corruption on boot.

I made a little experiment and saved first megabyte of hd[bcg] between
umount,mount and umount,raidstop,raidstart,mount operations. They did not
change.

The I did umount,raidstop and rebooted. After boot, the beginning hdb was
intact, but hdc and hdg had been tampered. (Unfortunately, raidstart was
automatically run on boot, but I did raidstop as the first thing.)

I narrowed the difference down to bytes between 1060-1080 on hdc:

root@linux:/scratch>od -x hdc_bytes-1060-1080_before_boot
0000000 1e1e 00d0 000d 00d0 752e 4264 7714 3fa2
0000020 0002 0014
root@linux:/scratch>od -x hdc_bytes-1060-1080_after_boot
0000000 1e1e 00d0 000d 00d0 75ff 4264 7427 3fa2
0000020 0003 0014

On hdg, this range differed too:

root@linux:/scratch>od -x hdg_bytes-1060-1080_after_boot
0000000 8000 0000 8000 0000 7526 3fa2 7539 3fa2
0000020 0002 0014
root@linux:/scratch>od -x hdg_bytes-1060-1080_after_boot
0000000 8000 0000 8000 0000 75f7 3fa2 760a 3fa2
0000020 0003 0014

But there was additional difference somewhere between 1kB and 5kB that
wasn't there on hdc.

When I copied the saved 1MB blocks back in place, the fs mounted without
problems.

AFAIK, the first 512b on each disk should be the raid superblock and the
next 512 may be ext2 superblock. I assume 1060-1080 falls into group
descriptor table that gets corrupted.

It may be something in userspace that corrupts the disks, but I cannot think
what it could be.

Right now, the kernel is 2.2.25-secure + patches, but earlier 2.2.x kernels
exhibited this as well. These include the newest raid 0.90 patches for 2.2.

Any ideas what might cause this or how to debug this further?


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2003-10-31 19:08 Something corrupts raid5 disks slightly during reboot Ville Herva
@ 2003-11-01  1:41 ` Jeffrey E. Hundstad
  2003-11-01  1:57   ` Mike Fedyk
  2003-11-01  8:27   ` ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot] Ville Herva
  0 siblings, 2 replies; 24+ messages in thread
From: Jeffrey E. Hundstad @ 2003-11-01  1:41 UTC (permalink / raw)
  To: Ville Herva; +Cc: linux-kernel

Try:

hdparm -W0 /dev/hdX

for each of your ide drives.  This turns off write-caching which is 
usually a bad thing with ide drives anyway.

Ville Herva wrote:

>I've been experiencing strange corruption on a raid5 volume for some time.
>Basically, after unmounting the filesystem, I can mount it again without
>problems. I can also raidstop the raid device in between and all is still
>fine:
>
>  
>
>>umount /dev/md4; mount /dev/md4
>>    
>>
>    - no corruption
>  
>
>>umount /dev/md4; raidstop /dev/md4; raidstart /dev/md4; mount /dev/md4
>>    
>>
>    - no corruption
>
>But after a reboot, the filesystem is corrupted:
>
>  
>
>>mount /dev/md4
>>    
>>
> EXT2-fs error (device md(9,4)): ext2_check_descriptors: Block bitmap for
> group 17 not in group (block 0)!
> EXT2-fs: group descriptors corrupted !
>
>(This is recoverable with e2fsck.)
>
>The array consists of three 80GB Samsung disks in raid5 mode, but I
>experienced this problem with two of the disks in raid0 mode, too. The raid
>consists of raw disks hdb,hdc,hdg (rather than partitions hdb1,hdc1,hdg1).
>
>On the same box I have three other raid arrays on different disks, all of
>which consist of partitions. These do not show corruption on boot.
>
>I made a little experiment and saved first megabyte of hd[bcg] between
>umount,mount and umount,raidstop,raidstart,mount operations. They did not
>change.
>
>The I did umount,raidstop and rebooted. After boot, the beginning hdb was
>intact, but hdc and hdg had been tampered. (Unfortunately, raidstart was
>automatically run on boot, but I did raidstop as the first thing.)
>
>I narrowed the difference down to bytes between 1060-1080 on hdc:
>
>root@linux:/scratch>od -x hdc_bytes-1060-1080_before_boot
>0000000 1e1e 00d0 000d 00d0 752e 4264 7714 3fa2
>0000020 0002 0014
>root@linux:/scratch>od -x hdc_bytes-1060-1080_after_boot
>0000000 1e1e 00d0 000d 00d0 75ff 4264 7427 3fa2
>0000020 0003 0014
>
>On hdg, this range differed too:
>
>root@linux:/scratch>od -x hdg_bytes-1060-1080_after_boot
>0000000 8000 0000 8000 0000 7526 3fa2 7539 3fa2
>0000020 0002 0014
>root@linux:/scratch>od -x hdg_bytes-1060-1080_after_boot
>0000000 8000 0000 8000 0000 75f7 3fa2 760a 3fa2
>0000020 0003 0014
>
>But there was additional difference somewhere between 1kB and 5kB that
>wasn't there on hdc.
>
>When I copied the saved 1MB blocks back in place, the fs mounted without
>problems.
>
>AFAIK, the first 512b on each disk should be the raid superblock and the
>next 512 may be ext2 superblock. I assume 1060-1080 falls into group
>descriptor table that gets corrupted.
>
>It may be something in userspace that corrupts the disks, but I cannot think
>what it could be.
>
>Right now, the kernel is 2.2.25-secure + patches, but earlier 2.2.x kernels
>exhibited this as well. These include the newest raid 0.90 patches for 2.2.
>
>Any ideas what might cause this or how to debug this further?
>
>
>-- v --
>
>v@iki.fi
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2003-11-01  1:41 ` Jeffrey E. Hundstad
@ 2003-11-01  1:57   ` Mike Fedyk
  2003-11-01  8:33     ` Ville Herva
  2003-11-01  8:27   ` ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot] Ville Herva
  1 sibling, 1 reply; 24+ messages in thread
From: Mike Fedyk @ 2003-11-01  1:57 UTC (permalink / raw)
  To: Jeffrey E. Hundstad; +Cc: Ville Herva, linux-kernel

On Fri, Oct 31, 2003 at 07:41:30PM -0600, Jeffrey E. Hundstad wrote:
> Try:
> 
> hdparm -W0 /dev/hdX
> 
> for each of your ide drives.  This turns off write-caching which is 
> usually a bad thing with ide drives anyway.
> 

Also try installing smartmontools, and run smartmon -a on each of the
drives.  It might tell you one of the drives is going bad...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-01  1:41 ` Jeffrey E. Hundstad
  2003-11-01  1:57   ` Mike Fedyk
@ 2003-11-01  8:27   ` Ville Herva
  2003-11-01 15:56     ` Willy Tarreau
  1 sibling, 1 reply; 24+ messages in thread
From: Ville Herva @ 2003-11-01  8:27 UTC (permalink / raw)
  To: Jeffrey E. Hundstad; +Cc: linux-kernel

On Fri, Oct 31, 2003 at 07:41:30PM -0600, you [Jeffrey E. Hundstad] wrote:
> Try:
> 
> hdparm -W0 /dev/hdX
> 
> for each of your ide drives.  This turns off write-caching which is 
> usually a bad thing with ide drives anyway.

According to hdparm, write caching is indeed enabled for all the drives.
I find it somewhat odd if this was the cause, though. Before reboot, the
drives were not being written to for quite a while (the fs had been
unmounted and the raid array had been stopped.) 

I suppose it _is_ possible that the drives were updating the ext2 superblock
from their write cache when power went off. The md5sum of first 1MB of the
drives was probably in sync before reboot because I got it from kernel's
cache (or drive's cache), although the up-to-date data had not been written
onto the platter yet. Also, as this is a raid5 array, one of the drives
could have been clean because the ext2 superblock (that I assume was being
updated) is physically located on only two of the drives.

I can try to turn of write caching well before next reboot. I don't
suppose there is a way to boot so that the write caching would be off all
the time - the best I can do is turn it off early in boot scripts, no?

Does anyone know if there is a crucial write caching / flushing fix in
2.4/2.6 that hasn't been merged into 2.2 (I am using the newest 2.4 ide
backport from Krzysztof Olêdzk (ide-2.2.21-06162002)). 

I don't suppose there is a away to explicitly flush the IDE drive write
cache from user space?

Or is this likely to be a drive firmware problem (kernel tries to flush the
drives, but they don't do it early enough?) How long do ide drives normally
hold data in write cache if they are idle?

The drives are SAMSUNG SV8004H, FwRev=QR100-07, fwiw.

Turning off write caching permanently doesn't sound inviting though, as
it'll probably ruin the raid performance completely...


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2003-11-01  1:57   ` Mike Fedyk
@ 2003-11-01  8:33     ` Ville Herva
  0 siblings, 0 replies; 24+ messages in thread
From: Ville Herva @ 2003-11-01  8:33 UTC (permalink / raw)
  To: linux-kernel

On Fri, Oct 31, 2003 at 05:57:33PM -0800, you [Mike Fedyk] wrote:
> On Fri, Oct 31, 2003 at 07:41:30PM -0600, Jeffrey E. Hundstad wrote:
> > Try:
> > 
> > hdparm -W0 /dev/hdX
> > 
> > for each of your ide drives.  This turns off write-caching which is 
> > usually a bad thing with ide drives anyway.
> > 
> 
> Also try installing smartmontools, and run smartmon -a on each of the
> drives.  It might tell you one of the drives is going bad...

I am monitoring all my drives with smart constantly, and they haven't shown
any symptoms. The corruption only happens upon reboot, which is a quite
rare event for a server.

Also, I find that smart rarely gives much useful warnings beforehand when a
drive is about to fail. And when the drive fails I usually get a good doze
of UncorrectableErrors into the log, not silent corruption (and I've seen a
lot drives to fail ;( ). Silent corruption is usually caused by the chipset
or driver (seen that, too ;( ), but it has usually happened under stress,
not when nothing much is being written to the drive.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-01  8:27   ` ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot] Ville Herva
@ 2003-11-01 15:56     ` Willy Tarreau
  2003-11-01 18:25       ` Ville Herva
  0 siblings, 1 reply; 24+ messages in thread
From: Willy Tarreau @ 2003-11-01 15:56 UTC (permalink / raw)
  To: Ville Herva, linux-kernel

Hi Ville,

do you have the ability to reboot this beast on a DOS floppy equiped with a
disk editor or even debug ? It would tell you wether it's the IDE
initialization or shutdown which harms the disks. BTW, it may even be your
bios which believes for an unknown reason that it has to write to the
partition table which is not one.

just my 2 cents,
Willy


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-01 15:56     ` Willy Tarreau
@ 2003-11-01 18:25       ` Ville Herva
  2003-11-01 19:01         ` Willy Tarreau
  0 siblings, 1 reply; 24+ messages in thread
From: Ville Herva @ 2003-11-01 18:25 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel

On Sat, Nov 01, 2003 at 04:56:04PM +0100, you [Willy Tarreau] wrote:
> Hi Ville,
> 
> do you have the ability to reboot this beast on a DOS floppy equiped with a
> disk editor or even debug ? 

I have been planning (as someone else suggested) to boot to a different
kernel, but unfortunately I think my off-the-shelf solution, knoppix, won't
do as it probably includes raid autodetection in its kernel, and I'd rather
rule raidstart out as well.

Is there anything special in booting to DOS instead of different linux
kernel, other than that it would rule out some strange kernel bug that is
present in 2.2 and 2.4?

> BTW, it may even be your bios which believes for an unknown reason that it
> has to write to the partition table which is not one.

Yes, but I find it unlikely. The partition table in within the first 512
bytes and the corruption was in bytes 1060-1080. Also, one of the corrupted
disks is on i815 and another in on HPT370.

BTW: the corruption happens on warm reboots (running reboot command), not
just on power off / on.
 

-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-01 18:25       ` Ville Herva
@ 2003-11-01 19:01         ` Willy Tarreau
  2003-11-01 21:02           ` Ville Herva
  2004-01-02 19:42           ` Something corrupts raid5 disks slightly during reboot Ville Herva
  0 siblings, 2 replies; 24+ messages in thread
From: Willy Tarreau @ 2003-11-01 19:01 UTC (permalink / raw)
  To: Ville Herva, linux-kernel

On Sat, Nov 01, 2003 at 08:25:18PM +0200, Ville Herva wrote:
 
> Is there anything special in booting to DOS instead of different linux
> kernel, other than that it would rule out some strange kernel bug that is
> present in 2.2 and 2.4?

No, it was just to quicky confirm or deny the fact that it's the kernel
which causes the problem. It could have been a long standing bug in the IDE
or partition code, and which is present in several kernels. But as you say
that it affects two different controllers, there's little chance that it's
caused by anything except linux itself. Then, the reboot on DOS will only
tell you if the drives were corrupted at startup or at shutdown.

> Yes, but I find it unlikely. The partition table in within the first 512
> bytes and the corruption was in bytes 1060-1080. Also, one of the corrupted
> disks is on i815 and another in on HPT370.

I agree, but I proposed it just because it was simple to test.

> BTW: the corruption happens on warm reboots (running reboot command), not
> just on power off / on.

OK, but the BIOS scans your disks even during warm reboots. Though I don't
think it comes from there because of your two different controllers.

Willy


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-01 19:01         ` Willy Tarreau
@ 2003-11-01 21:02           ` Ville Herva
  2003-11-02  6:05             ` Andre Hedrick
  2004-01-02 19:42           ` Something corrupts raid5 disks slightly during reboot Ville Herva
  1 sibling, 1 reply; 24+ messages in thread
From: Ville Herva @ 2003-11-01 21:02 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel

On Sat, Nov 01, 2003 at 08:01:14PM +0100, you [Willy Tarreau] wrote:
> On Sat, Nov 01, 2003 at 08:25:18PM +0200, Ville Herva wrote:
>  
> > Is there anything special in booting to DOS instead of different linux
> > kernel, other than that it would rule out some strange kernel bug that is
> > present in 2.2 and 2.4?
> 
> No, it was just to quicky confirm or deny the fact that it's the kernel
> which causes the problem. It could have been a long standing bug in the IDE
> or partition code, and which is present in several kernels. 

I vaguely recall some ide write cache flushing code was fixed some time ago,
but I can't find it in the archives. Maybe I dreamed that up. But I still
wonder why an otherwise idle drive would hold the data in write cache for so
long (several minutes.)

> But as you say that it affects two different controllers, there's little
> chance that it's caused by anything except linux itself. 

Unless the drive is buggy wrt. flushing its write cache. But I think it's
a quite distant possibility.

> Then, the reboot on DOS will only tell you if the drives were corrupted at
> startup or at shutdown.

Yep. I'll try to find the moment to boot the beast into something else than
the current kernel / distro (it could in theory be something in userspace,
though I cannot think what). 

> > BTW: the corruption happens on warm reboots (running reboot command), not
> > just on power off / on.
> 
> OK, but the BIOS scans your disks even during warm reboots. 

True, I mainly made this note because I hadn't mentioned it before in the
thread, and I thought it might have some relevance wrt. possible ide write
caching problems. I didn't mean it as a response to the BIOS theory.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-01 21:02           ` Ville Herva
@ 2003-11-02  6:05             ` Andre Hedrick
  2003-11-02  8:28               ` Ville Herva
  0 siblings, 1 reply; 24+ messages in thread
From: Andre Hedrick @ 2003-11-02  6:05 UTC (permalink / raw)
  To: Ville Herva; +Cc: Willy Tarreau, linux-kernel


I added the flush code to flush a drive in several places but it got
pulled and munged.

The original model was to flush each time a device was closed, when any
partition mount point was released, and called by notifier.

In a minimal partition count of 1, you had at least two flush before
shutdown or reboot.

So it was not the code because I fixed it, but then again I am retiring
from formal maintainership.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Sat, 1 Nov 2003, Ville Herva wrote:

> On Sat, Nov 01, 2003 at 08:01:14PM +0100, you [Willy Tarreau] wrote:
> > On Sat, Nov 01, 2003 at 08:25:18PM +0200, Ville Herva wrote:
> >  
> > > Is there anything special in booting to DOS instead of different linux
> > > kernel, other than that it would rule out some strange kernel bug that is
> > > present in 2.2 and 2.4?
> > 
> > No, it was just to quicky confirm or deny the fact that it's the kernel
> > which causes the problem. It could have been a long standing bug in the IDE
> > or partition code, and which is present in several kernels. 
> 
> I vaguely recall some ide write cache flushing code was fixed some time ago,
> but I can't find it in the archives. Maybe I dreamed that up. But I still
> wonder why an otherwise idle drive would hold the data in write cache for so
> long (several minutes.)
> 
> > But as you say that it affects two different controllers, there's little
> > chance that it's caused by anything except linux itself. 
> 
> Unless the drive is buggy wrt. flushing its write cache. But I think it's
> a quite distant possibility.
> 
> > Then, the reboot on DOS will only tell you if the drives were corrupted at
> > startup or at shutdown.
> 
> Yep. I'll try to find the moment to boot the beast into something else than
> the current kernel / distro (it could in theory be something in userspace,
> though I cannot think what). 
> 
> > > BTW: the corruption happens on warm reboots (running reboot command), not
> > > just on power off / on.
> > 
> > OK, but the BIOS scans your disks even during warm reboots. 
> 
> True, I mainly made this note because I hadn't mentioned it before in the
> thread, and I thought it might have some relevance wrt. possible ide write
> caching problems. I didn't mean it as a response to the BIOS theory.
> 
> 
> -- v --
> 
> v@iki.fi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-02  6:05             ` Andre Hedrick
@ 2003-11-02  8:28               ` Ville Herva
  2003-11-02 20:57                 ` Matthias Andree
  2003-11-03  5:34                 ` Andre Hedrick
  0 siblings, 2 replies; 24+ messages in thread
From: Ville Herva @ 2003-11-02  8:28 UTC (permalink / raw)
  To: Andre Hedrick; +Cc: linux-kernel

On Sat, Nov 01, 2003 at 10:05:31PM -0800, you [Andre Hedrick] wrote:
> 
> I added the flush code to flush a drive in several places but it got
> pulled and munged.
> 
> The original model was to flush each time a device was closed, when any
> partition mount point was released, and called by notifier.
> 
> In a minimal partition count of 1, you had at least two flush before
> shutdown or reboot.
> 
> So it was not the code because I fixed it, but then again I am retiring
> from formal maintainership.

Thanks, Andre :(.

As an^Wthe IDE expert, can you clarify a few points:

  - How long can the unwritten data linger in the drive cache if the drive
    is otherwise idle? (Without an explicit flush and with write caching
    enabled.)

    I had unmounted the fs an raidstopped the md minutes before the boot.

  - Can this corruption happen on warmboot or only on poweroff?

  - What kind of corruption can one see the if boot takes place "too fast"
    and drive hasn't got enough time to flush its cache?



-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-02  8:28               ` Ville Herva
@ 2003-11-02 20:57                 ` Matthias Andree
  2003-11-03  5:34                 ` Andre Hedrick
  1 sibling, 0 replies; 24+ messages in thread
From: Matthias Andree @ 2003-11-02 20:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ville Herva, Andre Hedrick

On Sun, 02 Nov 2003, Ville Herva wrote:

> As an^Wthe IDE expert, can you clarify a few points:
> 
>   - How long can the unwritten data linger in the drive cache if the drive
>     is otherwise idle? (Without an explicit flush and with write caching
>     enabled.)

Several seconds.  This is usually detailed in the OEM integrator manual,
at least it used to be for several IBM and Fujitsu drives when I looked
two years ago.  Drives usually start flushing cached data before they go
idle, and some drives guarantee maximum times before data hits the disk.
IIRC, Fujitsu MAH drives (SCSI though, not ATA) for instance guarantee
not to cache data for longer than 3 s, even if that means interrupting
reordering writes and hits write performance adversely (because it might
involve seeks). I seem to recall some IBM ATA drive claimed 15 s, but
don't quote me on that, I don't even recall if that was 2.5" or 3.5".

I don't recall the exact wording, so it may mean that the drive will not
VOLUNTARILY DELAY the write for more than 3 s. It's quite hard to write
4,096 scattered blocks on individual cylinders in 3 s even on 10,025/min
drives and requires knowing the block offset from the current rotational
angle of the platter... I wonder if drive firmware makes such scheduling
efforts.

>     I had unmounted the fs an raidstopped the md minutes before the boot.

Ugly if it still corrupts. :-(

>   - Can this corruption happen on warmboot or only on poweroff?

On ATA drives, the cache contents must persist across soft or hard reset
(warmboot).

>   - What kind of corruption can one see the if boot takes place "too fast"
>     and drive hasn't got enough time to flush its cache?

None with intact drives and bug-free firmware (I doubt such a thing
exists). Anyways, on powering down or with firmware bugs, anything is
possible.

-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-02  8:28               ` Ville Herva
  2003-11-02 20:57                 ` Matthias Andree
@ 2003-11-03  5:34                 ` Andre Hedrick
  2003-11-03  6:38                   ` Ville Herva
  1 sibling, 1 reply; 24+ messages in thread
From: Andre Hedrick @ 2003-11-03  5:34 UTC (permalink / raw)
  To: Ville Herva; +Cc: linux-kernel

On Sun, 2 Nov 2003, Ville Herva wrote:

> On Sat, Nov 01, 2003 at 10:05:31PM -0800, you [Andre Hedrick] wrote:
> > 
> > I added the flush code to flush a drive in several places but it got
> > pulled and munged.
> > 
> > The original model was to flush each time a device was closed, when any
> > partition mount point was released, and called by notifier.
> > 
> > In a minimal partition count of 1, you had at least two flush before
> > shutdown or reboot.
> > 
> > So it was not the code because I fixed it, but then again I am retiring
> > from formal maintainership.
> 
> Thanks, Andre :(.
> 
> As an^Wthe IDE expert, can you clarify a few points:
> 
>   - How long can the unwritten data linger in the drive cache if the drive
>     is otherwise idle? (Without an explicit flush and with write caching
>     enabled.)

Basically forever, until a read is issued to a range of lba's which starts
smaller than the uncommitted contents's lba, and includes the content in
question.  Or if a flush cache or disable write-back cache is issued.

>     I had unmounted the fs an raidstopped the md minutes before the boot.

The problem imho, is a break down of fundamental cascading callers.

Unmount MD -> flush MD

	MD is a fakie device :-/

MD fakie calls for flush of R_DEV's

Likewise unloading or stopping MD operations should repeat regardless of
mount or not.

>   - Can this corruption happen on warmboot or only on poweroff?

Given POST (assume x86 for only a brief moment) will issue execute
diagnositics to hunt for signatures on the ribbon, that basically wacks
the content.  Cool cycle obviously wacks the buffer.

>   - What kind of corruption can one see the if boot takes place "too fast"
>     and drive hasn't got enough time to flush its cache?

erm, I am lost with the above.
Flush Cache is a hold and wait on completion, period.
However, a cache error at this point is a wasted effort to attempt
recovery.

Not sure I helped or not ...

Cheers,

Andre


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot]
  2003-11-03  5:34                 ` Andre Hedrick
@ 2003-11-03  6:38                   ` Ville Herva
  0 siblings, 0 replies; 24+ messages in thread
From: Ville Herva @ 2003-11-03  6:38 UTC (permalink / raw)
  To: Andre Hedrick; +Cc: linux-kernel

On Sun, Nov 02, 2003 at 09:34:30PM -0800, you [Andre Hedrick] wrote:
>
> >   - How long can the unwritten data linger in the drive cache if the drive
> >     is otherwise idle? (Without an explicit flush and with write caching
> >     enabled.)
> 
> Basically forever, until a read is issued to a range of lba's which starts
> smaller than the uncommitted contents's lba, and includes the content in
> question.  Or if a flush cache or disable write-back cache is issued.

Huh. Sounds stunning.

I mean if the drive is otherwise idle, why would it hold the data in cache
without trying to write it onto platter? But I'll take your word for it.
 
> >     I had unmounted the fs an raidstopped the md minutes before the boot.
> 
> The problem imho, is a break down of fundamental cascading callers.
> 
> Unmount MD -> flush MD
> 
> 	MD is a fakie device :-/
> 
> MD fakie calls for flush of R_DEV's
> 
> Likewise unloading or stopping MD operations should repeat regardless of
> mount or not.

Yep. You wouldn't happen to know if it could make difference if the md
consists of raw devices (hdb,hdc,hdg) instead of partitions (hdc1,hb1,hdg1)
wrt. how and when the IDE flushes get triggered? Is there code that does it
for partitions but is lacking for whole devices? 

(The other MDs on the same box that consist of partitions do not get
corrupted, but they are on Maxtors, not Samsungs.)
 
> >   - Can this corruption happen on warmboot or only on poweroff?
> 
> Given POST (assume x86 for only a brief moment) will issue execute

x86 in this case, yes.

> diagnositics to hunt for signatures on the ribbon, that basically wacks
> the content.  Cool cycle obviously wacks the buffer.

Ack.
 
> >   - What kind of corruption can one see the if boot takes place "too fast"
> >     and drive hasn't got enough time to flush its cache?
> 
> erm, I am lost with the above.
> Flush Cache is a hold and wait on completion, period.
> However, a cache error at this point is a wasted effort to attempt
> recovery.

I meant: if the drive does not flush it cache before reboot, is it likely to
see the sectors either up-to-date or having the old data? Or can one see
half-written or otherwise corrupted sectors?

The corruption I saw didn't look like the sector just had the old data, but
I'm not sure.

Then again, this may very well be something completely unrelated to ide
write caching.
 
> Not sure I helped or not ...

Yes you did, thanks!


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2003-11-01 19:01         ` Willy Tarreau
  2003-11-01 21:02           ` Ville Herva
@ 2004-01-02 19:42           ` Ville Herva
  2004-01-02 20:02             ` Ville Herva
  2004-01-14 14:46             ` Ville Herva
  1 sibling, 2 replies; 24+ messages in thread
From: Ville Herva @ 2004-01-02 19:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Willy Tarreau

Summary:                                                                   

I've been experiencing strange corruption on a raid5 volume for some time. 
The kernel is 2.2.x + RAID-0.90 patch. Fs is ext2 (+e2compr). After        
unmounting the filesystem, I can mount it again without problems. I can also
raidstop the raid device in between and all is still fine:

> umount /dev/md4; mount /dev/md4
    - no corruption              
> umount /dev/md4; raidstop /dev/md4; raidstart /dev/md4; mount /dev/md4
    - no corruption                                                     

But after a reboot, the filesystem is corrupted - few bytes differ in the
beginning of /dev/md4 between 1k and and 5k.

See the threads
  http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&threadm=MMYt.4B2.1%40gated-at.bofh.it&rnum=1&prev=/groups%3Fnum%3D50%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3Dutf-8%26q%3DSomething%2Bcorrupts%2Braid5%2Bdisks%2Bslightly%2Bduring%2Breboot%26sa%3DN%26tab%3Dwg
  http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&threadm=MZsH.72R.5%40gated-at.bofh.it&rnum=4&prev=/groups%3Fnum%3D50%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3Dutf-8%26q%3DSomething%2Bcorrupts%2Braid5%2Bdisks%2Bslightly%2Bduring%2Breboot%26sa%3DN%26tab%3Dwg
for details.

I did some futher research.

First I thought this was an artifact of using "non-normal" blocksize on the
fs, 4096 bytes. The other raid partitions I have on the system are 1024 and
do not get corrupted.). Also the corrupting fs is on raid5 on bare disks
(hdb+hdc+hdg), while the others are on partitions (hda1+hdd1+hdf1 and so
on.)

I tried to reproduce this under vmware with 3-disk raid5 (hda+hdb+hdd) using
4096-byte ext2 and the exact same kernel. Initially, I thought I was able to
trigger it by mounting the fs while raid rebuild was on progress. The kernel
spitted this:

  set_blocksize: b_count 1, dev md(9,4), block 15642112, from c014c3fb
  set_blocksize: b_count 1, dev md(9,4), block 15642113, from c014c3fb
  set_blocksize: b_count 1, dev md(9,4), block 15642114, from c014c3fb
  ...
  set_blocksize: b_count 2, dev md(9,4), block 15642367, from c014c3fb
  md4: blocksize changed during read
  nr_blocks changed to 64 (blocksize 4096, j 3910528, max_blocks 39091968)

and fsck reported problems, but only once (the set_blocksize stuff appeared
each time). It seems the "set_blocksize" outpouring is a known issue, and
not severe:

  http://www.ussg.iu.edu/hypermail/linux/kernel/0110.1/0493.html

The fsck errors were probably just a side-effect of unclean shutdown I used
to force raid rebuild.


After the failed vmware experiment, I tried to isolate when exactly the
corruption happens, shutdown or boot. Also, in the mentioned threads, people
had suggested turning off the write cache of the IDE disk.

I found out that the difference (corruption) is usually on three bytes on
/dev/hdg, but sometimes on /dev/hdc, too. (/dev/md4 = hdb+hdc+hdg; hdb&hdc
are on i810, hdg is on hpt370).

First, I did
   umount /dev/md4
   raidstop /dev/md4
   head -c 50k /dev/hdg > /save/hdg
   reboot

To rule out kernel raid autodetect and raid code in general, I
booted 2.2.25-1-secure with "single init=/bin/bash raid=noautodetect".
 Did
   head -c50k /dev/hdg | cmp -l /save/hdg
 Three bytes differed:
   4641   0      35
   4642   0      205
   4643   0      10
   bytepos after before
           boot  boot  

 wrote the original stuff back:
   dd if=/save/hdg /dev/hdg
   sync
   hdparm -W0 /dev/hdg
   sync
   reboot

Booted 2.2.25-1-secure with "single init=/bin/bash raid=noautodetect"
again.
 Did
   head -c50k /dev/hdg | cmp -l /save/hdg
 Three same three bytes differed again.
 Wrote the stuff back, sync'ed, did hdparm, and powered off. Still, the the
bytes differed on next boot.

Then I booted 2.4.21-jam1 with "single init=/bin/bash raid=noautodetect" (I
happened to have 2.4.21-jam1 compiled with suitable drivers at hand).
 Wrote the same stuff back with dd, synced, turned ide cache off.
 Booted 2.4.21-jam1 with "single init=/bin/bash raid=noautodetect" again.
 Did the diff; the three bytes differed again.

Note that sometimes few bytes on hdc differed, too. Usually it was just the
three hdg bytes.

So this is not a 2.2 kernel issue. I very much doubt it's a kernel issue at
all. Unless it is a bug in kernel partition detection that is still present
in 2.4.x.
         
I tried to turn off the ide write cache with hdparm -W0, so it shouldn't  
be a write caching issue.

If it's a bios issue, it's really a strange one, since it affects both disks
on i810 ide and on hpt370. The disks have no partition table, though, which
_could_ confuse the bios.

Any ideas? Who the heck could write to those three bytes, and why?


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2004-01-02 19:42           ` Something corrupts raid5 disks slightly during reboot Ville Herva
@ 2004-01-02 20:02             ` Ville Herva
  2004-01-14 14:46             ` Ville Herva
  1 sibling, 0 replies; 24+ messages in thread
From: Ville Herva @ 2004-01-02 20:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Willy Tarreau

> So this is not a 2.2 kernel issue. I very much doubt it's a kernel issue at
> all. Unless it is a bug in kernel partition detection that is still present
> in 2.4.x.

Short addition: in the earlier thread, it was suggested to inspect the disk
with another OS (DOS, Windows, something else) to rule out Linux kernel
completely. I couldn't easily find anything that boots from cd or preferably
from floppy (since I don't have cdrom attached due to ide cable shortage)
*and* supports the HPT370 ide controller /dev/hdg is connected to.

If I find something that fits the bill, I'll give it a shot.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2004-01-02 19:42           ` Something corrupts raid5 disks slightly during reboot Ville Herva
  2004-01-02 20:02             ` Ville Herva
@ 2004-01-14 14:46             ` Ville Herva
  2004-01-14 22:22               ` Willy Tarreau
  1 sibling, 1 reply; 24+ messages in thread
From: Ville Herva @ 2004-01-14 14:46 UTC (permalink / raw)
  To: linux-kernel, Willy Tarreau

On Fri, Jan 02, 2004 at 09:42:00PM +0200, you [Ville Herva] wrote:
> Summary:                                                                   
> 
> I've been experiencing strange corruption on a raid5 volume for some time. 
> The kernel is 2.2.x + RAID-0.90 patch. Fs is ext2 (+e2compr). After        
> unmounting the filesystem, I can mount it again without problems. I can also
> raidstop the raid device in between and all is still fine:
> 
> > umount /dev/md4; mount /dev/md4
>     - no corruption              
> > umount /dev/md4; raidstop /dev/md4; raidstart /dev/md4; mount /dev/md4
>     - no corruption                                                     
> 
> But after a reboot, the filesystem is corrupted - few bytes differ in the
> beginning of /dev/md4 between 1k and and 5k.
> 
> See the threads
>   http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&threadm=MMYt.4B2.1%40gated-at.bofh.it&rnum=1&prev=/groups%3Fnum%3D50%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3Dutf-8%26q%3DSomething%2Bcorrupts%2Braid5%2Bdisks%2Bslightly%2Bduring%2Breboot%26sa%3DN%26tab%3Dwg
>   http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&threadm=MZsH.72R.5%40gated-at.bofh.it&rnum=4&prev=/groups%3Fnum%3D50%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3Dutf-8%26q%3DSomething%2Bcorrupts%2Braid5%2Bdisks%2Bslightly%2Bduring%2Breboot%26sa%3DN%26tab%3Dwg
> for details.
(...) 
> I found out that the difference (corruption) is usually on three bytes on
> /dev/hdg, but sometimes on /dev/hdc, too. (/dev/md4 = hdb+hdc+hdg; hdb&hdc
> are on i810, hdg is on hpt370).
> 
> First, I did
>    umount /dev/md4
>    raidstop /dev/md4
>    head -c 50k /dev/hdg > /save/hdg
>    reboot
> 
> To rule out kernel raid autodetect and raid code in general, I
> booted 2.2.25-1-secure with "single init=/bin/bash raid=noautodetect".
>  Did
>    head -c50k /dev/hdg | cmp -l /save/hdg
>  Three bytes differed:
>    4641   0      35
>    4642   0      205
>    4643   0      10
>    bytepos after before
>            boot  boot  
> 
>  wrote the original stuff back:
>    dd if=/save/hdg /dev/hdg
>    sync
>    hdparm -W0 /dev/hdg
>    sync
>    reboot
> 
> Booted 2.2.25-1-secure with "single init=/bin/bash raid=noautodetect"
> again.
>  Did
>    head -c50k /dev/hdg | cmp -l /save/hdg
>  Three same three bytes differed again.
>  Wrote the stuff back, sync'ed, did hdparm, and powered off. Still, the the
> bytes differed on next boot.
> 
> Then I booted 2.4.21-jam1 with "single init=/bin/bash raid=noautodetect" (I
> happened to have 2.4.21-jam1 compiled with suitable drivers at hand).
>  Wrote the same stuff back with dd, synced, turned ide cache off.
>  Booted 2.4.21-jam1 with "single init=/bin/bash raid=noautodetect" again.
>  Did the diff; the three bytes differed again.
> 
> Note that sometimes few bytes on hdc differed, too. Usually it was just the
> three hdg bytes.
> 
> So this is not a 2.2 kernel issue. I very much doubt it's a kernel issue at
> all. Unless it is a bug in kernel partition detection that is still present
> in 2.4.x.
>          
> I tried to turn off the ide write cache with hdparm -W0, so it shouldn't  
> be a write caching issue.
> 
> If it's a bios issue, it's really a strange one, since it affects both disks
> on i810 ide and on hpt370. The disks have no partition table, though, which
> _could_ confuse the bios.

Addition: 

  - I tried booting from 2.6.1 single user mode to 2.6.1 single user
    mode (booting with sysrq-b to avoid shutdown process):
       ->  The corruption on /dev/hdg happens like with 2.2 and 2.4

  - I booted from 2.6.1 single user mode to 2.6.1 single user
    mode with kexec patch to avoid entering BIOS in between
       ->  The corruption DOES NOT happen

I'm pretty much out of ideas.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2004-01-14 14:46             ` Ville Herva
@ 2004-01-14 22:22               ` Willy Tarreau
  2004-01-14 22:46                 ` Ville Herva
  0 siblings, 1 reply; 24+ messages in thread
From: Willy Tarreau @ 2004-01-14 22:22 UTC (permalink / raw)
  To: Ville Herva, linux-kernel

Hi Ville,

On Wed, Jan 14, 2004 at 04:46:46PM +0200, Ville Herva wrote:
 
>   - I tried booting from 2.6.1 single user mode to 2.6.1 single user
>     mode (booting with sysrq-b to avoid shutdown process):
>        ->  The corruption on /dev/hdg happens like with 2.2 and 2.4
> 
>   - I booted from 2.6.1 single user mode to 2.6.1 single user
>     mode with kexec patch to avoid entering BIOS in between
>        ->  The corruption DOES NOT happen
> 
> I'm pretty much out of ideas.

To me, it proves that the bios triggers the problem. It could also be in
the device enumeration functions or device initialization that it does
this thing. Perhaps even a more nasty thing such as a pending DMA write
which completes during a device reset. That's very odd anyway. I don't
quite remember well all your setup. Have you tried enabling/disabling
shadow ram/caching on bios regions to check if a faster/slower code execution
in the bios changes something ? Also do it on additionnal ROMs if you have
an onboard bios on your secondary controller.

I'm also getting stuck without any other idea :-/

Regards,
Willy


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2004-01-14 22:22               ` Willy Tarreau
@ 2004-01-14 22:46                 ` Ville Herva
  0 siblings, 0 replies; 24+ messages in thread
From: Ville Herva @ 2004-01-14 22:46 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel

On Wed, Jan 14, 2004 at 11:22:14PM +0100, you [Willy Tarreau] wrote:
> Hi Ville,
> 
> On Wed, Jan 14, 2004 at 04:46:46PM +0200, Ville Herva wrote:
>  
> >   - I tried booting from 2.6.1 single user mode to 2.6.1 single user
> >     mode (booting with sysrq-b to avoid shutdown process):
> >        ->  The corruption on /dev/hdg happens like with 2.2 and 2.4
> > 
> >   - I booted from 2.6.1 single user mode to 2.6.1 single user
> >     mode with kexec patch to avoid entering BIOS in between
> >        ->  The corruption DOES NOT happen
> > 
> > I'm pretty much out of ideas.
> 
> To me, it proves that the bios triggers the problem. 

Or lilo. Abit BIOS, Adaptec SCSI BIOS, Highpoint HPT370 BIOS and lilo are
the only pieces of code that get executed between power on and the kernel.

Unfortunately, I was unable to rule that (unlikely) alternative out just
yet, because I found out that the box doesn't have a working floppy either
(cdrom is not plugged because of lack of cables - I guess I miswired the
floppy drive too when I last messed with the power cables.) This is also why
I didn't try your DOS disk on the box. It seems its diskedit can recognize
at least scsi disks, so it could well handle the disk on Highpoint
controller, too. Anyway, thanks for that (and reminding me how rusty my
French is - and has always been :). I plan to try booting from floppy
without lilo and the dos editor, when I next open the box and can fix the
floppy wiring. It's a server so I don't take it down all the time...

> It could also be in the device enumeration functions or device
> initialization that it does this thing. Perhaps even a more nasty thing
> such as a pending DMA write which completes during a device reset. 

Something like that crossed my mind initially, but waiting >10min between
the write and boot didn't help, nor did "hdparm -W 0"...

> That's very odd anyway. I don't quite remember well all your setup. 

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&threadm=MMYt.4B2.1%40gated-at.bofh.it&rnum=1&prev=/groups%3Fnum%3D50%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3Dutf-8%26q%3DSomething%2Bcorrupts%2Braid5%2Bdisks%2Bslightly%2Bduring%2Breboot%26sa%3DN%26tab%3Dwg

gives some details. Basicly it's a Abit ST6R mobo (i815 and HPT370 IDEs),
and three Maxtor 250GB disk (root and first data fs), 3 Samsung 80GB's
(second data fs). One of the Samsungs on the HPT370 is the one that exhibits
the corruption.

> Have you tried enabling/disabling shadow ram/caching on bios regions to
> check if a faster/slower code execution in the bios changes something ?

No. I could try that.

> Also do it on additionnal ROMs if you have an onboard bios on your
> secondary controller.

Ok, if only I can manage to find such options from the BIOS.
 
> I'm also getting stuck without any other idea :-/

No wonder. So far you have been most helpful - bug thanks for that.

PS: Again, the next round of results will only be in after some time - as I
said, I'll need to wait for a suitable reboot time for the box... Sorry for
the trickle.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2004-01-15 19:57     ` Ville Herva
@ 2004-01-16 10:24       ` Samium Gromoff
  0 siblings, 0 replies; 24+ messages in thread
From: Samium Gromoff @ 2004-01-16 10:24 UTC (permalink / raw)
  To: Ville Herva, Samium Gromoff, linux-kernel

At Thu, 15 Jan 2004 21:57:58 +0200,
Ville Herva wrote:
> 
> On Thu, Jan 15, 2004 at 03:42:41PM +0300, you [Samium Gromoff] wrote:
> > At Thu, 15 Jan 2004 00:30:40 +0200,
> > Ville Herva wrote:
> > > 
> > > On Wed, Jan 14, 2004 at 07:39:37PM +0300, you [Samium Gromoff] wrote:
> > > > 
> > > > I know this sounds stupid, but anyway:
> > > > 
> > > > I have seen the very same symptome caused by RAM faults (too slow ram
> > > > for given clocks, to be exact).
> > > 
> > > The very same? You mean if booted, wrote few kB's of data to disk, synced,
> > > then pressed reset, the same three bytes were corrupted (set to zero) each
> > > time after reboot? 
> > 
> > No, corruption after reboot and perfect work inbetween.
> 
> Very strange. And you got rid of it by replacing the memory? 

Yeah.

> Any theories on how faulty memory could actually cause something like this?
> A bad spot in memory on an area where the bios code is cached, and hence is
> never used apart from running the bios startup (not even by memtest86)?

No idea, really :-)

> -- v --
> 
> v@iki.fi

regards, Samium Gromoff



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2004-01-15 12:42   ` Samium Gromoff
@ 2004-01-15 19:57     ` Ville Herva
  2004-01-16 10:24       ` Samium Gromoff
  0 siblings, 1 reply; 24+ messages in thread
From: Ville Herva @ 2004-01-15 19:57 UTC (permalink / raw)
  To: Samium Gromoff; +Cc: linux-kernel

On Thu, Jan 15, 2004 at 03:42:41PM +0300, you [Samium Gromoff] wrote:
> At Thu, 15 Jan 2004 00:30:40 +0200,
> Ville Herva wrote:
> > 
> > On Wed, Jan 14, 2004 at 07:39:37PM +0300, you [Samium Gromoff] wrote:
> > > 
> > > I know this sounds stupid, but anyway:
> > > 
> > > I have seen the very same symptome caused by RAM faults (too slow ram
> > > for given clocks, to be exact).
> > 
> > The very same? You mean if booted, wrote few kB's of data to disk, synced,
> > then pressed reset, the same three bytes were corrupted (set to zero) each
> > time after reboot? 
> 
> No, corruption after reboot and perfect work inbetween.

Very strange. And you got rid of it by replacing the memory? 

Any theories on how faulty memory could actually cause something like this?
A bad spot in memory on an area where the bios code is cached, and hence is
never used apart from running the bios startup (not even by memtest86)?


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2004-01-14 22:30 ` Ville Herva
@ 2004-01-15 12:42   ` Samium Gromoff
  2004-01-15 19:57     ` Ville Herva
  0 siblings, 1 reply; 24+ messages in thread
From: Samium Gromoff @ 2004-01-15 12:42 UTC (permalink / raw)
  To: Ville Herva, Samium Gromoff, linux-kernel

At Thu, 15 Jan 2004 00:30:40 +0200,
Ville Herva wrote:
> 
> On Wed, Jan 14, 2004 at 07:39:37PM +0300, you [Samium Gromoff] wrote:
> > 
> > I know this sounds stupid, but anyway:
> > 
> > I have seen the very same symptome caused by RAM faults (too slow ram
> > for given clocks, to be exact).
> 
> The very same? You mean if booted, wrote few kB's of data to disk, synced,
> then pressed reset, the same three bytes were corrupted (set to zero) each
> time after reboot? 

No, corruption after reboot and perfect work inbetween.

> v@iki.fi

regards, Samium Gromoff



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
  2004-01-14 16:39 Samium Gromoff
@ 2004-01-14 22:30 ` Ville Herva
  2004-01-15 12:42   ` Samium Gromoff
  0 siblings, 1 reply; 24+ messages in thread
From: Ville Herva @ 2004-01-14 22:30 UTC (permalink / raw)
  To: Samium Gromoff; +Cc: linux-kernel

On Wed, Jan 14, 2004 at 07:39:37PM +0300, you [Samium Gromoff] wrote:
> 
> I know this sounds stupid, but anyway:
> 
> I have seen the very same symptome caused by RAM faults (too slow ram
> for given clocks, to be exact).

The very same? You mean if booted, wrote few kB's of data to disk, synced,
then pressed reset, the same three bytes were corrupted (set to zero) each
time after reboot? 

I can buy the faulty ram explanation for many symptoms, but it somehow in
this case it seems very unlikely. The box can be doing its thing (backing up
>20 workstations onto 6 ide disks) for weeks without ever corrupting
anything, and the when I power it down and up (after manually raidstopping
and umounting), three bytes get corrupted. (Well, sometimes few bytes in
addition to the three, but usually just three.)

> Yes gzipping/gunzipping a gigabyte of /dev/random data didn`t show up a
> crc error.

The box does survive memtest, but you're right that doesn't prove anything.
 
> That was an i845 chipset, by the way...

This is i815.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Something corrupts raid5 disks slightly during reboot
@ 2004-01-14 16:39 Samium Gromoff
  2004-01-14 22:30 ` Ville Herva
  0 siblings, 1 reply; 24+ messages in thread
From: Samium Gromoff @ 2004-01-14 16:39 UTC (permalink / raw)
  To: vherva; +Cc: linux-kernel


I know this sounds stupid, but anyway:

I have seen the very same symptome caused by RAM faults (too slow ram
for given clocks, to be exact).

Yes gzipping/gunzipping a gigabyte of /dev/random data didn`t show up a
crc error.

That was an i845 chipset, by the way...

regards, Samium Gromoff



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2004-01-16 10:25 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-31 19:08 Something corrupts raid5 disks slightly during reboot Ville Herva
2003-11-01  1:41 ` Jeffrey E. Hundstad
2003-11-01  1:57   ` Mike Fedyk
2003-11-01  8:33     ` Ville Herva
2003-11-01  8:27   ` ide write cache issue? [Re: Something corrupts raid5 disks slightly during reboot] Ville Herva
2003-11-01 15:56     ` Willy Tarreau
2003-11-01 18:25       ` Ville Herva
2003-11-01 19:01         ` Willy Tarreau
2003-11-01 21:02           ` Ville Herva
2003-11-02  6:05             ` Andre Hedrick
2003-11-02  8:28               ` Ville Herva
2003-11-02 20:57                 ` Matthias Andree
2003-11-03  5:34                 ` Andre Hedrick
2003-11-03  6:38                   ` Ville Herva
2004-01-02 19:42           ` Something corrupts raid5 disks slightly during reboot Ville Herva
2004-01-02 20:02             ` Ville Herva
2004-01-14 14:46             ` Ville Herva
2004-01-14 22:22               ` Willy Tarreau
2004-01-14 22:46                 ` Ville Herva
2004-01-14 16:39 Samium Gromoff
2004-01-14 22:30 ` Ville Herva
2004-01-15 12:42   ` Samium Gromoff
2004-01-15 19:57     ` Ville Herva
2004-01-16 10:24       ` Samium Gromoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).