All of lore.kernel.org
 help / color / mirror / Atom feed
* Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts]
@ 2009-09-19 16:10 Greg Freemyer
  2009-09-20  9:36 ` John Robinson
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Greg Freemyer @ 2009-09-19 16:10 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

On Sat, Sep 19, 2009 at 11:10 AM, Mario 'BitKoenig' Holbe
<Mario.Holbe@tu-ilmenau.de> wrote:
> Bryan Mesich <bryan.mesich@ndsu.edu> wrote:
>> On Wed, Sep 16, 2009 at 09:20:35PM +0200, Mario 'BitKoenig' Holbe wrote:
>>> They should not appear on RAID5.
>> I would agree.  The only reason I mentioned RAID5 was to remove the
>> possibility that the HD were spontaneously flipping bits.  Our SAN
>
> This should not happen on recent disks (and even not that recent ones)
> either. Disks do error correction and don't deliver faulty data, they
> deliver read errors instead. Maybe there are some disks where you can
> disable ECC (I'm not aware of such), but I doubt you would get even one
> reasonable bit out of them then :)
>
> And *if* spontaneously flipping bits *would* happen on single disks,
> they would also happen on your RAID5. No RAID level except RAID2 (which
> does ECC on its own) tolerates this kind of error, they all rely on
> disks delivering either correct data or error messages.
>

There is a whole series of places a bit flip can occur after the data
is read from the platter and the ECC verified that don't generate
error messages.

Could be in the IDE electronics themselves on the drive.  In the IDE
or SATA cable.  (I think sata has a checksum on the transmission.  IDE
cables don't.)  In the controller.  In ram. In the cache.  In the CPU,
etc., etc. etc.

If you want reliable data you have to build in end-to-end
verification.  As long as you attack the issues piece by little piece,
you are going to have weaknesses where a bit flip can sneak in.  That
is one reason we see  MD5s distributed with lots of downloadable ISOs,
etc.  In theory the whole distribution process is reliable, but by
verifying it at the very end you gain a significant amount of
confidence.

With regards to data storage, one major step in this direction is the
"integrity" patch that went into the kernel last winter (2.6.28?).
There is apparently now a scsi standard that allows a checksum / crc
to be passed along with the data.  The protocol for calculating the
value is published, so at the top of the linux block stack, with this
feature enabled a chucksum / crc is calculated as soon as a filesystem
puts a block of data into the block queues.  The checksum / crc
travels with the data all the way to the scsi subsystem.  The
subsystem in turn verifies the value and errors out on a data write if
the data and checksum / crc are in disagreement.  On read, the
subsystem also provides the checksum / crc in addition to the data.
This data traverses the linux block stack all the way to the
filesystem and is verified immediately prior to being handed off to
the filesystem.

This is all pretty new obviously.  To the best of my knowledge
filesystems have not yet been enhanced to track this value, thus
covering even more of the end-to-end transaction.

I don't know how specifically, but it also seems to me the mdraid
stack could add to currently poor data integrity process even in the
absence of a supporting scsi subsystem.  Maybe by pulling out the
integrity checksum / crc info and putting it on yet another disk, or
mixing it in with the parity calculation.

Specifically you could steal the second parity stripe from a raid 6
setup and replace it with this end-to-end data integrity checksum /
crc.  The checksum / crc is much smaller than the original data so the
one integrity disk should support a reasonable number of data disks.
Obviously this would not be one of the formal raid levels, but that
doesn't mean its not useful.

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Random bit flips - better data integrity needed [Was: Re:  mismatch_count != 0 on multiple hosts]
  2009-09-19 16:10 Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Greg Freemyer
@ 2009-09-20  9:36 ` John Robinson
  2009-09-20  9:43   ` Majed B.
  2009-09-20 15:30 ` Mario 'BitKoenig' Holbe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: John Robinson @ 2009-09-20  9:36 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Linux RAID

On 19/09/2009 17:10, Greg Freemyer wrote:
[...]
> I don't know how specifically, but it also seems to me the mdraid
> stack could add to currently poor data integrity process even in the
> absence of a supporting scsi subsystem.  Maybe by pulling out the
> integrity checksum / crc info and putting it on yet another disk, or
> mixing it in with the parity calculation.
> 
> Specifically you could steal the second parity stripe from a raid 6
> setup and replace it with this end-to-end data integrity checksum /
> crc.  The checksum / crc is much smaller than the original data so the
> one integrity disk should support a reasonable number of data disks.
> Obviously this would not be one of the formal raid levels, but that
> doesn't mean its not useful.

I vaguely remember someone here was prototyping/developing a device 
mapper thingy which added checksumming/integrity to simulate high-end 
RAID cards adding a checksum to each 512-byte sector by using 520- or 
528-byte sectors on their component discs. I don't remember the details, 
but what I have in mind was something along the lines of using an extra 
sector on the underlying device per 64 sectors or so. There wouldn't be 
too heavy an overhead on small reads - we do readahead anyway - and it 
would make small writes even more painful than they are already, but 
shouldn't significantly reduce throughput on large (chunk size) reads 
and writes. I'd use it.

Cheers,

John.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts]
  2009-09-20  9:36 ` John Robinson
@ 2009-09-20  9:43   ` Majed B.
  0 siblings, 0 replies; 8+ messages in thread
From: Majed B. @ 2009-09-20  9:43 UTC (permalink / raw)
  To: John Robinson; +Cc: Greg Freemyer, Linux RAID

Check the thread started by Matthias within last week.

On Sun, Sep 20, 2009 at 12:36 PM, John Robinson
<john.robinson@anonymous.org.uk> wrote:
> On 19/09/2009 17:10, Greg Freemyer wrote:
> [...]
>>
>> I don't know how specifically, but it also seems to me the mdraid
>> stack could add to currently poor data integrity process even in the
>> absence of a supporting scsi subsystem.  Maybe by pulling out the
>> integrity checksum / crc info and putting it on yet another disk, or
>> mixing it in with the parity calculation.
>>
>> Specifically you could steal the second parity stripe from a raid 6
>> setup and replace it with this end-to-end data integrity checksum /
>> crc.  The checksum / crc is much smaller than the original data so the
>> one integrity disk should support a reasonable number of data disks.
>> Obviously this would not be one of the formal raid levels, but that
>> doesn't mean its not useful.
>
> I vaguely remember someone here was prototyping/developing a device mapper
> thingy which added checksumming/integrity to simulate high-end RAID cards
> adding a checksum to each 512-byte sector by using 520- or 528-byte sectors
> on their component discs. I don't remember the details, but what I have in
> mind was something along the lines of using an extra sector on the
> underlying device per 64 sectors or so. There wouldn't be too heavy an
> overhead on small reads - we do readahead anyway - and it would make small
> writes even more painful than they are already, but shouldn't significantly
> reduce throughput on large (chunk size) reads and writes. I'd use it.
>
> Cheers,
>
> John.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
       Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts]
  2009-09-19 16:10 Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Greg Freemyer
  2009-09-20  9:36 ` John Robinson
@ 2009-09-20 15:30 ` Mario 'BitKoenig' Holbe
  2009-09-20 21:55   ` Random bit flips - better data integrity needed Martin K. Petersen
  2009-09-20 21:48 ` Martin K. Petersen
  2009-09-22 11:18 ` Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Matthias Urlichs
  3 siblings, 1 reply; 8+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-09-20 15:30 UTC (permalink / raw)
  To: linux-raid

Greg Freemyer <greg.freemyer@gmail.com> wrote:
> On Sat, Sep 19, 2009 at 11:10 AM, Mario 'BitKoenig' Holbe
>> And *if* spontaneously flipping bits *would* happen on single disks,
>> they would also happen on your RAID5. No RAID level except RAID2 (which
> If you want reliable data you have to build in end-to-end
> verification.

Indeed.

> etc.  In theory the whole distribution process is reliable, but by
> verifying it at the very end you gain a significant amount of
> confidence.

And cheaper, usually.

> With regards to data storage, one major step in this direction is the
> "integrity" patch that went into the kernel last winter (2.6.28?).

That's far from being end-to-end. Actually, it is even against the
end-to-end argument. End-to-end means from application to application.

> This is all pretty new obviously.  To the best of my knowledge

The End-to-End Arguments in System Design (Saltzer, Reed, and Clark,
1981) is far from being new.


regards
   Mario
-- 
reich sein heisst nicht, einen Ferrari zu kaufen, sondern einen zu
verbrennen
                                               Dietmar Wischmeier


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Random bit flips - better data integrity needed
  2009-09-19 16:10 Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Greg Freemyer
  2009-09-20  9:36 ` John Robinson
  2009-09-20 15:30 ` Mario 'BitKoenig' Holbe
@ 2009-09-20 21:48 ` Martin K. Petersen
  2009-09-22 11:18 ` Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Matthias Urlichs
  3 siblings, 0 replies; 8+ messages in thread
From: Martin K. Petersen @ 2009-09-20 21:48 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Mario 'BitKoenig' Holbe, linux-raid

>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:

Greg> This is all pretty new obviously.  To the best of my knowledge
Greg> filesystems have not yet been enhanced to track this value, thus
Greg> covering even more of the end-to-end transaction.

The filesystem part is pretty easy.  So far there hasn't been much point
because the window between filesystem submitting the bio and the block
layer generating the checksum is fairly small.

What's important wrt. to filesystems is to allow the checksums to be
passed through the page cache so we can get it to/from userland.  That's
on my list but it's stalled a bit waiting for aio to suck less.


Greg> I don't know how specifically, but it also seems to me the mdraid
Greg> stack could add to currently poor data integrity process even in
Greg> the absence of a supporting scsi subsystem.  Maybe by pulling out
Greg> the integrity checksum / crc info and putting it on yet another
Greg> disk, or mixing it in with the parity calculation.

Check dm-devel.  Alberto Bertogli has posted a DM target that does this.
I've only had time to do a cursory review.

However, I think the important thing here is to realize that the
strength of the data integrity infrastructure is in catching corruption
at WRITE time.  And that requires hardware participation because you
want it all the way.

If you want corruption detection and recovery at READ time the answer is
btrfs.  Really.  It was explicitly designed to do this.

I keep hearing talk about retrofitting checksums into existing
filesystems and software RAID.  Would the people that want to work on
this please stop partying like it's 1999 and go help out on btrfs
instead.  The world will be a much better place...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Random bit flips - better data integrity needed
  2009-09-20 15:30 ` Mario 'BitKoenig' Holbe
@ 2009-09-20 21:55   ` Martin K. Petersen
  0 siblings, 0 replies; 8+ messages in thread
From: Martin K. Petersen @ 2009-09-20 21:55 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

>>>>> "Mario" == Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE> writes:

>> With regards to data storage, one major step in this direction is the
>> "integrity" patch that went into the kernel last winter (2.6.28?).

Mario> That's far from being end-to-end. Actually, it is even against
Mario> the end-to-end argument. End-to-end means from application to
Mario> application.

Oracle has a custom (as in non-POSIX) async I/O submission interface
called oracleasm.  Hooking into my Linux kernel block integrity
infrastructure I can protect the I/O all the way from within the Oracle
DB context in userland to the drive firmware and back.

We're working on a generic (as in POSIX-like) interface that allows data
integrity passthrough for normal applications.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts]
  2009-09-19 16:10 Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Greg Freemyer
                   ` (2 preceding siblings ...)
  2009-09-20 21:48 ` Martin K. Petersen
@ 2009-09-22 11:18 ` Matthias Urlichs
  2009-10-13 21:45   ` Bill Davidsen
  3 siblings, 1 reply; 8+ messages in thread
From: Matthias Urlichs @ 2009-09-22 11:18 UTC (permalink / raw)
  To: linux-raid

On Sat, 19 Sep 2009 12:10:34 -0400, Greg Freemyer wrote:

> Specifically you could steal the second parity stripe from a raid 6
> setup and replace it with this end-to-end data integrity checksum / crc.

If you're willing to add that kind of overhead, simply read all of the 
RAID6 stripes into memory and check whether they're consistent.

If not, it's easy to decide (for RAID6) whether the data or the parity is 
wrong: simply check both P and Q. If only one is broken, fix it. If both 
are, correct the data according to P and check if Q is now correct. If 
so, fix it. Otherwise the only thing you can do is to fail the whole 
array, and to alert the operator that they have major hardware issues. :-/

For RAID45, you can do the same, except that there's no way to fix any 
problems since you don't know whether data or parity is right. As the 
error may have crept in upon writing, rereading is of limited use.

For RAID1 (and maybe even multipath), the same idea applies; add majority 
rule when you have more than two disks.

Adding this kind of checking to the RAID456 driver should be rather easy 
for somebody who knows its internals. Its effect on read throughput is 
anyone's guess, of course.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Random bit flips - better data integrity needed [Was: Re:  mismatch_count != 0 on multiple hosts]
  2009-09-22 11:18 ` Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Matthias Urlichs
@ 2009-10-13 21:45   ` Bill Davidsen
  0 siblings, 0 replies; 8+ messages in thread
From: Bill Davidsen @ 2009-10-13 21:45 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: linux-raid

Matthias Urlichs wrote:
> On Sat, 19 Sep 2009 12:10:34 -0400, Greg Freemyer wrote:
>
>   
>> Specifically you could steal the second parity stripe from a raid 6
>> setup and replace it with this end-to-end data integrity checksum / crc.
>>     
>
> If you're willing to add that kind of overhead, simply read all of the 
> RAID6 stripes into memory and check whether they're consistent.
>
> If not, it's easy to decide (for RAID6) whether the data or the parity is 
> wrong: simply check both P and Q. If only one is broken, fix it. If both 
> are, correct the data according to P and check if Q is now correct. If 
> so, fix it. Otherwise the only thing you can do is to fail the whole 
> array, and to alert the operator that they have major hardware issues. :-/
>
> For RAID45, you can do the same, except that there's no way to fix any 
> problems since you don't know whether data or parity is right. As the 
> error may have crept in upon writing, rereading is of limited use.
>
> For RAID1 (and maybe even multipath), the same idea applies; add majority 
> rule when you have more than two disks.
>
> Adding this kind of checking to the RAID456 driver should be rather easy 
> for somebody who knows its internals. Its effect on read throughput is 
> anyone's guess, of course.
>   

To do this right requires forcing the data to the platter, then reading 
it back (from the platter, not cache) and checking it. Preferably 
reading with ECC off to catch marginal data. In the 60's there were 
drives with read-after-write heads, but the data density was so low you 
could sprinkle oxide on the platter and see data patterns. I can't see 
doing it that way with "heads" any more, but when solid state becomes 
more mainstream it becomes possible with useful transfer rates.

I have the feeling that someone had a patch to do that with a loopback 
mount, but I can't find a pointer.

-- 
Bill Davidsen <davidsen@tmr.com>
  Unintended results are the well-earned reward for incompetence.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-10-13 21:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-19 16:10 Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Greg Freemyer
2009-09-20  9:36 ` John Robinson
2009-09-20  9:43   ` Majed B.
2009-09-20 15:30 ` Mario 'BitKoenig' Holbe
2009-09-20 21:55   ` Random bit flips - better data integrity needed Martin K. Petersen
2009-09-20 21:48 ` Martin K. Petersen
2009-09-22 11:18 ` Random bit flips - better data integrity needed [Was: Re: mismatch_count != 0 on multiple hosts] Matthias Urlichs
2009-10-13 21:45   ` Bill Davidsen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.