All of lore.kernel.org
 help / color / mirror / Atom feed
* Chances of silent errors?
@ 2013-01-21 19:58 Roy Sigurd Karlsbakk
  2013-01-21 22:16 ` Roger Heflin
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-21 19:58 UTC (permalink / raw)
  To: linux-raid Raid

Hi all

Coming from the zfs world, I've heard a few talk about the chances of "silent errors", meaning the checksum on the drives match, but the data being bad because of matching checksum (aka collisions). Does anyone in here know the relative chance of something like that happening with the checksums of current harddisks? Is the 1:10^14 or 1:10^15 chances for a URE in regard to this, or is that when the drive reports an error, or those two combined?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chances of silent errors?
  2013-01-21 19:58 Chances of silent errors? Roy Sigurd Karlsbakk
@ 2013-01-21 22:16 ` Roger Heflin
  2013-01-21 22:39 ` Peter Grandi
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Roger Heflin @ 2013-01-21 22:16 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-raid Raid

On Mon, Jan 21, 2013 at 1:58 PM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:
> Hi all
>
> Coming from the zfs world, I've heard a few talk about the chances of "silent errors", meaning the checksum on the drives match, but the data being bad because of matching checksum (aka collisions). Does anyone in here know the relative chance of something like that happening with the checksums of current harddisks? Is the 1:10^14 or 1:10^15 chances for a URE in regard to this, or is that when the drive reports an error, or those two combined?
>
> Vennlige hilsener / Best regards

I don't know about that result and am interested in the results.

On the silent corruption issues I have dealt with before both of them
came back to controller issues, in one case the raid/disk controller
had gone bad and would silently corrupt a block of data with a rate of
about 1 in 3-10GB of reads (never saw/confirmed corruption on writes),
in the other case because of the PCI-X bus speed being set too high (2
cards in use should have been at 100mhz was at 133mhz) it would
produce similar results on reads with blocks being corrupted about 1
block in 2-3 gb of reads.     Based on past data I am guessing the
silent corruption coming out of ram (non-ecc ram) and/or the
controller is likely at least an order of magnitude more likely an
issue on the disk.      The above referenced controllers were 2
completely different controllers by completely different companies.
I have never debugged corruption issues that was identified to be from
the disk itself, so I would suspect that the disks checksums are alot
more reliable that a lot of people believe and that the risk points
are at the controller -> pci -> cpu level were things are not so
carefully designed and checksummed.    Note that both of the
controllers were from (at the time) widely respected companies and
both were enterprise level cards (not home use cards--so cards that
cost > $US1k ).     And we added checksums at the software level to
give us end to end confirmation...as we had identified previous
corruption do to software/library issues...so in application is much
superior as it also catches software issues.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chances of silent errors?
  2013-01-21 19:58 Chances of silent errors? Roy Sigurd Karlsbakk
  2013-01-21 22:16 ` Roger Heflin
@ 2013-01-21 22:39 ` Peter Grandi
  2013-01-21 23:00 ` Chris Murphy
  2013-01-22 10:55 ` Roy Sigurd Karlsbakk
  3 siblings, 0 replies; 7+ messages in thread
From: Peter Grandi @ 2013-01-21 22:39 UTC (permalink / raw)
  To: Linux RAID

> Coming from the zfs world, I've heard a few talk about the
> chances of "silent errors", meaning the checksum on the drives
> match, but the data being bad because of matching checksum
> (aka collisions). [ ... ]

That's a very narrow definition of "silent errors", they happen
in any case where incorrect data has been written to persistent
storage from memory, and yet no error has been signaled.

A common cause of those is software or firmware (HBA, disk, ...)
bugs, that either read or write the wrong blocks or modify them
in transit.

The classic report on this is from CERN's extensive testing:

  http://w3.hepix.org/storage/hep_pdf/2007/Spring/kelemen-2007-HEPiX-Silent_Corruptions.pdf

As to checksum collisions, that depends a bit on sector size and
the type/length of checksum and "enterprise" drives can usually
be formatted with different size sectors to accomodate different
size checksums. I would also suspect that it is far more likely
that very different blocks on the same disk have legitimately
the same checksum than a slightly corrupted block gets the same
checksum as the uncorrupted one...

For some context the details of the very informative SAVVIO
product manual here, page 15, the "Miscorrected Data" line:

  http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/savvio-15k/

or also, page 43, the section "Protection Information".

But note that the URE is the *Unrecovered* Error Rate, that is
for errors that have been detected but not corrected, not the
*Undetected* Error Rate.

As someone famously said, as far as he knew his datacenter never
had an undetected error.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chances of silent errors?
  2013-01-21 19:58 Chances of silent errors? Roy Sigurd Karlsbakk
  2013-01-21 22:16 ` Roger Heflin
  2013-01-21 22:39 ` Peter Grandi
@ 2013-01-21 23:00 ` Chris Murphy
  2013-01-22 10:55 ` Roy Sigurd Karlsbakk
  3 siblings, 0 replies; 7+ messages in thread
From: Chris Murphy @ 2013-01-21 23:00 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-raid Raid


On Jan 21, 2013, at 12:58 PM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:

> Hi all
> 
> Coming from the zfs world, I've heard a few talk about the chances of "silent errors", meaning the checksum on the drives match, but the data being bad because of matching checksum (aka collisions). Does anyone in here know the relative chance of something like that happening with the checksums of current harddisks? Is the 1:10^14 or 1:10^15 chances for a URE in regard to this, or is that when the drive reports an error, or those two combined?

It's fun trying to locate what is a URE, a UER, and BER. I don't see even SNIA consistently using one term. WDC uses "Non-recoverable" rather than "unrecoverable" and while linguistically these are the same, if they're effectively using a different term than SNIA it might be a different thing, what is defined as "error". All of these though are disk errors.

SDC is not necessarily a disk only error. It can occur in the disk, in the cable between disk and controller, in the controller, or between controller and memory. So there are actually many more areas where SDC can occur.

To be SDC, it's either undetected error, or it's detected and improperly corrected error. In either case the result is error propagation without the OS being notified by constituent components in the storage stack.

Anyway, I think what you're after, a probability, for ZFS being spoofed as a result of SDC resulting in a checksum collision, is really remote. You're talking about a very rare case of SDC to start out with, second SDC is not well understood to have probabilities, and then you have a remarkably small surface area. To get a collision like you're suggesting, the SDC would have to exactly have affected the data and the metadata in a way that the corrupt data's checksum indicates it's valid data. Both would have to be so affected. That sort of collision is really next to impossible, even for MD5. Collisions have been demonstrated to be possible, but aren't expected in the wild. What checksum method does ZFS default to?

Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chances of silent errors?
  2013-01-21 19:58 Chances of silent errors? Roy Sigurd Karlsbakk
                   ` (2 preceding siblings ...)
  2013-01-21 23:00 ` Chris Murphy
@ 2013-01-22 10:55 ` Roy Sigurd Karlsbakk
  2013-01-22 16:28   ` Chris Murphy
  3 siblings, 1 reply; 7+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-22 10:55 UTC (permalink / raw)
  To: linux-raid Raid

> Coming from the zfs world, I've heard a few talk about the chances of
> "silent errors", meaning the checksum on the drives match, but the
> data being bad because of matching checksum (aka collisions). Does
> anyone in here know the relative chance of something like that
> happening with the checksums of current harddisks? Is the 1:10^14 or
> 1:10^15 chances for a URE in regard to this, or is that when the drive
> reports an error, or those two combined?

A follow-up here. I see drive manufacturers report the chance of an URE is 1:10^14 for desktop drives, 1:10^15 for 7k2RPM enterprise drives and 1:10^16 for 10k and 15k enterprise drives (or most do). I was under the impression that "nearline"  / 7k2 enterprise drives were the same thing as desktop drives, only with a slightly different firmware (TLER and friends => don't do anything as stupid as going into "deep recovery mode" like some desktop drives do).

Any idea if there's a real difference between the hardware on enterprise and desktop 7k2 drives? Is this 10-fold difference between error rates real, or is it just marketing?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chances of silent errors?
  2013-01-22 10:55 ` Roy Sigurd Karlsbakk
@ 2013-01-22 16:28   ` Chris Murphy
  2013-01-22 18:33     ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Murphy @ 2013-01-22 16:28 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-raid Raid


On Jan 22, 2013, at 3:55 AM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:

>> Coming from the zfs world, I've heard a few talk about the chances of
>> "silent errors", meaning the checksum on the drives match, but the
>> data being bad because of matching checksum (aka collisions). Does
>> anyone in here know the relative chance of something like that
>> happening with the checksums of current harddisks? Is the 1:10^14 or
>> 1:10^15 chances for a URE in regard to this, or is that when the drive
>> reports an error, or those two combined?
> 
> A follow-up here. I see drive manufacturers report the chance of an URE is 1:10^14 for desktop drives, 1:10^15 for 7k2RPM enterprise drives and 1:10^16 for 10k and 15k enterprise drives (or most do). I was under the impression that "nearline"  / 7k2 enterprise drives were the same thing as desktop drives, only with a slightly different firmware (TLER and friends => don't do anything as stupid as going into "deep recovery mode" like some desktop drives do).
> 
> Any idea if there's a real difference between the hardware on enterprise and desktop 7k2 drives? Is this 10-fold difference between error rates real, or is it just marketing?

SNIA has said there are differences in ECC between consumer SATA, nearline SATA, and enterprise SAS for some time. They also distinguish quite a few other differences that aren't ECC related, but they don't discuss error rates directly. And this is old information…

http://www.snia.org/sites/default/education/tutorials/2007/spring/storage/Desktop_Nearline_Deltas_by_Design.pdf

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chances of silent errors?
  2013-01-22 16:28   ` Chris Murphy
@ 2013-01-22 18:33     ` Roy Sigurd Karlsbakk
  0 siblings, 0 replies; 7+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-22 18:33 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

> SNIA has said there are differences in ECC between consumer SATA,
> nearline SATA, and enterprise SAS for some time. They also distinguish
> quite a few other differences that aren't ECC related, but they don't
> discuss error rates directly. And this is old information…
> 
> http://www.snia.org/sites/default/education/tutorials/2007/spring/storage/Desktop_Nearline_Deltas_by_Design.pdf

Thanks! Dunno how well this matches with today's drives, but I guess most of it is still valid.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-01-22 18:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-21 19:58 Chances of silent errors? Roy Sigurd Karlsbakk
2013-01-21 22:16 ` Roger Heflin
2013-01-21 22:39 ` Peter Grandi
2013-01-21 23:00 ` Chris Murphy
2013-01-22 10:55 ` Roy Sigurd Karlsbakk
2013-01-22 16:28   ` Chris Murphy
2013-01-22 18:33     ` Roy Sigurd Karlsbakk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.