All of lore.kernel.org
 help / color / mirror / Atom feed
* md-raid paranoia mode?
@ 2014-06-11  6:48 Bart Kus
       [not found] ` <CAH3kUhH06kpJNqb-zdcv5nu2e1FeZuotcW0SjBbWDOCcasm9OA@mail.gmail.com>
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Bart Kus @ 2014-06-11  6:48 UTC (permalink / raw)
  To: linux-raid

Hello,

As far as I understand, md-raid relies on the underlying devices to 
inform it of IO errors before it'll seek redundant/parity data to 
fulfill the read request.  I have, however, seen certain hard drives 
report successful reads while returning garbage data.

Is it possible to set md-raid into a paranoid mode, in which it reads 
all available data and confirms integrity?  Here's how it would work:

RAID6: read data + parity 1 + parity 2.  If 1 of the 3 mismatches, 
correct it, and write corrected data to the corrupt source.  Log the 
event.  If all 3 disagree, alert user somehow.
RAID5: read data + parity.  If they mismatch, alert user somehow.
RAID1: read data 1 + data 2.  If they mismatch, alert user somehow.

You can see this is mostly useful for RAID6 mode, where there is a 
chance at automated recovery.  However, it can also be used to prevent 
silent data corruption in the other modes, by making it not silent.

--Bart


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
       [not found] ` <CAH3kUhH06kpJNqb-zdcv5nu2e1FeZuotcW0SjBbWDOCcasm9OA@mail.gmail.com>
@ 2014-06-11 10:34   ` Bart Kus
  2014-06-12  7:26     ` Mattias Wadenstein
  0 siblings, 1 reply; 12+ messages in thread
From: Bart Kus @ 2014-06-11 10:34 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: linux-raid

Doing the periodic check does not prevent corruption of read() data 
though (RAID6 case).  Copied files may be corrupted, even though the 
RAID would eventually fix itself after a repair is done.

Yes, there is a performance penalty, but data integrity is also 
improved.  Paranoid mode should probably not be the default, but I would 
like the choice to improve data integrity at the expense of some small 
speed penalty.  ZFS implements this anti-corruption checking by using 
checksums on their data.  We don't have a simple checksumming mechanism 
in md-raid, but we do have the full stripe data available ready for 
verification.

BTW, the idea of a daily repair operation doesn't work when it takes 14 
hours to repair a large RAID.  That would only leave 10 hours of each 
day for normal speed access.  I schedule repairs weekly, though.

--Bart


On 6/11/2014 2:53 AM, Roberto Spadim wrote:
> Hi
> IMHO
>
> For silent corrupt i think it's better a periodic raid check 
> instead of a paranoid mode
>
> Normally a silent corrupt occurs with an 'old disk' or with old data, 
> but it don't occurs at every disk read (must check disk studies)
>
> I think a 'paranoid' mode is nice, but i think it will reduce all 
> system performace, maybe an crond daily check is better than a 'all 
> read, check' (paranoid)
>
> Em quarta-feira, 11 de junho de 2014, Bart Kus <me@bartk.us 
> <mailto:me@bartk.us>> escreveu:
>
>     Hello,
>
>     As far as I understand, md-raid relies on the underlying devices
>     to inform it of IO errors before it'll seek redundant/parity data
>     to fulfill the read request.  I have, however, seen certain hard
>     drives report successful reads while returning garbage data.
>
>     Is it possible to set md-raid into a paranoid mode, in which it
>     reads all available data and confirms integrity?  Here's how it
>     would work:
>
>     RAID6: read data + parity 1 + parity 2.  If 1 of the 3 mismatches,
>     correct it, and write corrected data to the corrupt source.  Log
>     the event.  If all 3 disagree, alert user somehow.
>     RAID5: read data + parity.  If they mismatch, alert user somehow.
>     RAID1: read data 1 + data 2.  If they mismatch, alert user somehow.
>
>     You can see this is mostly useful for RAID6 mode, where there is a
>     chance at automated recovery.  However, it can also be used to
>     prevent silent data corruption in the other modes, by making it
>     not silent.
>
>     --Bart
>
>     --
>     To unsubscribe from this list: send the line "unsubscribe
>     linux-raid" in
>     the body of a message to majordomo@vger.kernel.org
>     More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> -- 
> Roberto Spadim
> SPAEmpresarial
> Eng. Automação e Controle
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-11  6:48 md-raid paranoia mode? Bart Kus
       [not found] ` <CAH3kUhH06kpJNqb-zdcv5nu2e1FeZuotcW0SjBbWDOCcasm9OA@mail.gmail.com>
@ 2014-06-11 17:31 ` Piergiorgio Sartor
  2014-06-12  2:15 ` Brad Campbell
  2 siblings, 0 replies; 12+ messages in thread
From: Piergiorgio Sartor @ 2014-06-11 17:31 UTC (permalink / raw)
  To: Bart Kus; +Cc: linux-raid

On Tue, Jun 10, 2014 at 11:48:46PM -0700, Bart Kus wrote:
> Hello,
> 
> As far as I understand, md-raid relies on the underlying devices to inform
> it of IO errors before it'll seek redundant/parity data to fulfill the read
> request.  I have, however, seen certain hard drives report successful reads
> while returning garbage data.
> 
> Is it possible to set md-raid into a paranoid mode, in which it reads all
> available data and confirms integrity?  Here's how it would work:
> 
> RAID6: read data + parity 1 + parity 2.  If 1 of the 3 mismatches, correct
> it, and write corrected data to the corrupt source.  Log the event.  If all
> 3 disagree, alert user somehow.
> RAID5: read data + parity.  If they mismatch, alert user somehow.
> RAID1: read data 1 + data 2.  If they mismatch, alert user somehow.
> 
> You can see this is mostly useful for RAID6 mode, where there is a chance at
> automated recovery.  However, it can also be used to prevent silent data
> corruption in the other modes, by making it not silent.

Hi Bart,

this was discussed some times ago, mainly for RAID6.
One compromise was "raid6check", which run in user space.

The main objection is the performance drop, that such a
reading method will have, which, of course, will require
an enable/disable switch.

I do not want to write for Neil, but I guess reasonable
patches, doing what you propose, will be accepted.

bye,

pg

> 
> --Bart
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-11  6:48 md-raid paranoia mode? Bart Kus
       [not found] ` <CAH3kUhH06kpJNqb-zdcv5nu2e1FeZuotcW0SjBbWDOCcasm9OA@mail.gmail.com>
  2014-06-11 17:31 ` Piergiorgio Sartor
@ 2014-06-12  2:15 ` Brad Campbell
  2014-06-12  6:28   ` Roman Mamedov
  2 siblings, 1 reply; 12+ messages in thread
From: Brad Campbell @ 2014-06-12  2:15 UTC (permalink / raw)
  To: Bart Kus, linux-raid

On 11/06/14 14:48, Bart Kus wrote:
> Hello,
>
> As far as I understand, md-raid relies on the underlying devices to
> inform it of IO errors before it'll seek redundant/parity data to
> fulfill the read request.  I have, however, seen certain hard drives
> report successful reads while returning garbage data.

If you have drives that return garbage as valid data then you have far 
greater problems than what you are suggesting will fix. So much so I 
suggest you document these instances and start banging a drum announcing 
them in a name and shame campaign. That sort of behavior from storage 
devices is never ok, and the manufacturer needs to know that.

This comes up on the list at least once a year, and the upshot is that 
your storage platform needs to be reliable. Storage is *supposed* to be 
reliable. Even the cheapest solution is *supposed* to say "I'm sorry but 
that bit of data you asked for is toast". Even my 35c USB drives do that.

Whether you have a single drive or 10 mirrors, if you have a drive 
returning garbage you need to solve that problem first. Patching 
software that is based on the fundamental assumption that the storage 
stack knows when something is bad, to no longer trust that assumption 
makes all sorts of guarantees go out the window.

 From personal experience, I lost a 12TB RAID-6 and all the data on it 
due to a bad SATA controller. The controller would return corrupt reads 
under heavy load, and months of read/modify/write cycles combined with 
corrupt data spread the corruption all over the array. My immediate 
reaction was the same as yours. "RAID6 should be able to protect against 
this stuff", but after education from people that are more knowledgeable 
than I, it became apparent that bad hardware is JUST insidious and 
papering over one part of the stack would just lead to it biting me 
elsewhere anyway.

I learned 2 very valuable lessons.
- Don't deploy hardware unless you trust it. This may mean a month of 
burn-in testing in a spare machine, or delaying trusting it with 
valuable data. In my case it was a cheap 2 port PCIe SATA card procured 
to get me out of a tight spot, so I plugged it in and strapped drives to 
it blindly believing it would be ok.
- RAID is no substitute for backups.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-12  2:15 ` Brad Campbell
@ 2014-06-12  6:28   ` Roman Mamedov
  2014-06-12  6:45     ` NeilBrown
  2014-06-12  7:26     ` David Brown
  0 siblings, 2 replies; 12+ messages in thread
From: Roman Mamedov @ 2014-06-12  6:28 UTC (permalink / raw)
  To: Brad Campbell; +Cc: Bart Kus, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1548 bytes --]

On Thu, 12 Jun 2014 10:15:32 +0800
Brad Campbell <lists2009@fnarfbargle.com> wrote:

> On 11/06/14 14:48, Bart Kus wrote:
> > Hello,
> >
> > As far as I understand, md-raid relies on the underlying devices to
> > inform it of IO errors before it'll seek redundant/parity data to
> > fulfill the read request.  I have, however, seen certain hard drives
> > report successful reads while returning garbage data.
> 
> If you have drives that return garbage as valid data then you have far 
> greater problems than what you are suggesting will fix. So much so I 
> suggest you document these instances and start banging a drum announcing 
> them in a name and shame campaign. That sort of behavior from storage 
> devices is never ok, and the manufacturer needs to know that.

If your RAM can return garbage, that's not a justification for having ECC RAM.
ECC RAM is a gimmick invented by weak conformist people. Instead, you should go
and loudly scream at the manufacturer who sold you that RAM! Errors from RAM
are never OK! RAM should always work perfectly! And if it doesn't, you have
greater problems. We shall not tolerate this behavior! So go get a drum and
start banging it as loudly as you can! Name and shame the manufacturer who
sold you that RAM. Fight the power, brother!!!

You can probably tell just how sick I am of reasoning like yours. That's why
we can't have nice things (md-side resiliency for the cases when you need/want
it), and sadly Neil is of the same opinion as you.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-12  6:28   ` Roman Mamedov
@ 2014-06-12  6:45     ` NeilBrown
  2014-06-12  7:26     ` David Brown
  1 sibling, 0 replies; 12+ messages in thread
From: NeilBrown @ 2014-06-12  6:45 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Brad Campbell, Bart Kus, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2708 bytes --]

On Thu, 12 Jun 2014 12:28:14 +0600 Roman Mamedov <rm@romanrm.net> wrote:

> On Thu, 12 Jun 2014 10:15:32 +0800
> Brad Campbell <lists2009@fnarfbargle.com> wrote:
> 
> > On 11/06/14 14:48, Bart Kus wrote:
> > > Hello,
> > >
> > > As far as I understand, md-raid relies on the underlying devices to
> > > inform it of IO errors before it'll seek redundant/parity data to
> > > fulfill the read request.  I have, however, seen certain hard drives
> > > report successful reads while returning garbage data.
> > 
> > If you have drives that return garbage as valid data then you have far 
> > greater problems than what you are suggesting will fix. So much so I 
> > suggest you document these instances and start banging a drum announcing 
> > them in a name and shame campaign. That sort of behavior from storage 
> > devices is never ok, and the manufacturer needs to know that.
> 
> If your RAM can return garbage, that's not a justification for having ECC RAM.
> ECC RAM is a gimmick invented by weak conformist people. Instead, you should go
> and loudly scream at the manufacturer who sold you that RAM! Errors from RAM
> are never OK! RAM should always work perfectly! And if it doesn't, you have
> greater problems. We shall not tolerate this behavior! So go get a drum and
> start banging it as loudly as you can! Name and shame the manufacturer who
> sold you that RAM. Fight the power, brother!!!

Your screwdriver is leaking  (*).

Hard drives contain ECC.  It should ensure undetected errors are an
*extremely* rare event (more rare than bugs in the md code).

If your ECC RAM started returning bad data without telling you, would you
build a complex virtual memory system to load every byte from two different
DIMMs into CPU registers and compare them before trusting them?

I know that hard drives can return bad data.  I've seen it happen.  I don't
think that trying to "fix" it in the md/raid layer is appropriate.

File-systems and higher level data management systems (e.g. git) are much
better placed to detect such errors than md/raid is.  Supposedly btrfs will
DTRT with your drives (though TRT is to RMA them, and I don't think btrfs
has an RMA plugin yet).

> 
> You can probably tell just how sick I am of reasoning like yours. That's why
> we can't have nice things (md-side resiliency for the cases when you need/want
> it), and sadly Neil is of the same opinion as you.
> 

In general, if you want nice things you need to pay for them.  If you are
willing to pay I suspect you can find someone who is willing to provide.

NeilBrown

(*)http://www.zazzle.com/a_bad_analogy_is_like_a_leaky_screwdriver_tshirts-235102919981826183

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-12  6:28   ` Roman Mamedov
  2014-06-12  6:45     ` NeilBrown
@ 2014-06-12  7:26     ` David Brown
  2014-06-12  8:06       ` Roman Mamedov
  1 sibling, 1 reply; 12+ messages in thread
From: David Brown @ 2014-06-12  7:26 UTC (permalink / raw)
  To: Roman Mamedov, Brad Campbell; +Cc: Bart Kus, linux-raid

On 12/06/14 08:28, Roman Mamedov wrote:
> On Thu, 12 Jun 2014 10:15:32 +0800
> Brad Campbell <lists2009@fnarfbargle.com> wrote:
> 
>> On 11/06/14 14:48, Bart Kus wrote:
>>> Hello,
>>>
>>> As far as I understand, md-raid relies on the underlying devices to
>>> inform it of IO errors before it'll seek redundant/parity data to
>>> fulfill the read request.  I have, however, seen certain hard drives
>>> report successful reads while returning garbage data.
>>
>> If you have drives that return garbage as valid data then you have far 
>> greater problems than what you are suggesting will fix. So much so I 
>> suggest you document these instances and start banging a drum announcing 
>> them in a name and shame campaign. That sort of behavior from storage 
>> devices is never ok, and the manufacturer needs to know that.
> 
> If your RAM can return garbage, that's not a justification for having ECC RAM.
> ECC RAM is a gimmick invented by weak conformist people. Instead, you should go
> and loudly scream at the manufacturer who sold you that RAM! Errors from RAM
> are never OK! RAM should always work perfectly! And if it doesn't, you have
> greater problems. We shall not tolerate this behavior! So go get a drum and
> start banging it as loudly as you can! Name and shame the manufacturer who
> sold you that RAM. Fight the power, brother!!!

There are several points here.

First, RAM is susceptible to single event upsets - typically a cosmic
ray that hits the RAM array and knocks a bit out.  As geometries get
smaller and ram gets denser, this gets more likely.  So ECC on ram makes
sense as an economically practical way to reduce the impact of
real-world errors that are unavoidable (i.e., it's not just bad design
or production of the chips).  What would make more sense, however, is to
avoid the extra ECC lines from the chips - the ECC mechanism should be
entirely within the RAM chips.  The extra parity lines between the
memory and the controller are a left-over from the old days in which
there was no logic on the memory modules.

Secondly, hard disks already have ECC, in several layers.  There is
/far/ more error detection and correction on the data read from the
platters than you could hope to do in software at the md layer.  There
is nothing that you can do on the md layer to detect bad reads that
could not be better handled on the controller on the disk itself.  So if
you are getting /undetected/ read errors from a disk (as distinct from
/unrecoverable/ read errors), then something has gone very bad.  It is
at least as likely to be a write error as a read error, and you will
have no idea how long it has been going on and how much of your data is
corrupt.  It is probably a systematic error (such as firmware bug) in
either the disk controller or the interface card.  Such faults are
fortunately very rare - and thus very rarely worth the cost of checking
for online.

And since an undetected read error is not just an odd occasional event,
but a catastrophic system failure, the correct response is not
"re-create the data from parities" - it is "full scale panic - assume
/all/ your data is bad, check from backups, call the hardware service
people, replace the entire disk system".


If you really are paranoid about the integrity of data in the face of
undetected read errors, then there are three ways to handle it.  One is
by doing a raid scrub (a good idea anyway, to maintain redundancy
despite occasional detected read errors) - this will detect such
problems without the online costs.  Another is to maintain and check
lists of checksums (md5, sha256, etc.) of files - this is often done as
a security measure to detect alteration of files during break-ins.
Finally, you can use a filesystem that does checksumming (it is vastly
easier and more efficient to do the checksumming at the filesystem level
than at the md raid level) - btrfs is the obvious choice.

> 
> You can probably tell just how sick I am of reasoning like yours. That's why
> we can't have nice things (md-side resiliency for the cases when you need/want
> it), and sadly Neil is of the same opinion as you.
> 

If you disagree so strongly, you are free to do something about it.  The
people (Neil and others) who do the work in creating and maintaining md
raid know a great deal about the realistic problems in storage systems,
and realistic solutions.  They understand when people want magic, and
they understand the costs (in development time and run time) of
implementing something that is at best a very partial fix to an almost
non-existent problem (since the most likely cause of undetected read
errors is things like controller failure, which have no possible
software fix).  Given their limited time and development resources, they
therefore concentrate on features of md raid that make a real difference
to many users.

However, this is all open source development.  If you can write code to
support new md modes that do on-line scrubbing and smart recovery, then
I'm sure many people would be interested.  If you can't write the code
yourself, but can raise the money to hire a qualified developer, then
I'm sure that would also be of interest.

The point is not that such on-line checking is not a "nice thing" to
have - /I/ don't think it would be worth the on-line cost, but some
people might and choice is always a good thing.  The point is that it is
very rarely a useful feature - and there are many other "nice things"
that have higher priority amongst the developers.

<http://neil.brown.name/blog/20100211050355>
<http://neil.brown.name/blog/20110227114201>





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-11 10:34   ` Bart Kus
@ 2014-06-12  7:26     ` Mattias Wadenstein
  0 siblings, 0 replies; 12+ messages in thread
From: Mattias Wadenstein @ 2014-06-12  7:26 UTC (permalink / raw)
  To: Bart Kus; +Cc: Roberto Spadim, linux-raid

On Wed, 11 Jun 2014, Bart Kus wrote:

> Doing the periodic check does not prevent corruption of read() data though 
> (RAID6 case).  Copied files may be corrupted, even though the RAID would 
> eventually fix itself after a repair is done.
>
> Yes, there is a performance penalty, but data integrity is also improved. 
> Paranoid mode should probably not be the default, but I would like the choice 
> to improve data integrity at the expense of some small speed penalty.  ZFS 
> implements this anti-corruption checking by using checksums on their data. 
> We don't have a simple checksumming mechanism in md-raid, but we do have the 
> full stripe data available ready for verification.
>
> BTW, the idea of a daily repair operation doesn't work when it takes 14 hours 
> to repair a large RAID.  That would only leave 10 hours of each day for 
> normal speed access.  I schedule repairs weekly, though.

Since bytes read and written is the dominating factor behind disk failures 
these days, I certainly wouldn't want to do repairs daily. Weekly might 
even be pushing it.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-12  7:26     ` David Brown
@ 2014-06-12  8:06       ` Roman Mamedov
  2014-06-12  8:30         ` Brad Campbell
                           ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Roman Mamedov @ 2014-06-12  8:06 UTC (permalink / raw)
  To: David Brown; +Cc: Brad Campbell, Bart Kus, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4914 bytes --]

On Thu, 12 Jun 2014 09:26:18 +0200
David Brown <david.brown@hesbynett.no> wrote:

> Secondly, hard disks already have ECC, in several layers.  There is
> /far/ more error detection and correction on the data read from the
> platters than you could hope to do in software at the md layer.  There
> is nothing that you can do on the md layer to detect bad reads that
> could not be better handled on the controller on the disk itself.  So if
> you are getting /undetected/ read errors from a disk (as distinct from
> /unrecoverable/ read errors), then something has gone very bad.  It is
> at least as likely to be a write error as a read error, and you will
> have no idea how long it has been going on and how much of your data is
> corrupt.  It is probably a systematic error (such as firmware bug) in
> either the disk controller or the interface card.  Such faults are
> fortunately very rare - and thus very rarely worth the cost of checking
> for online.

In one case which Brad was describing, it was a hardware design fault in his
RAID controller, resulting in it returning bad data only when all ports are
utilized at high speeds. If MD had online checksum mismatch detection, it
would alert him immediately that something's going wrong, rather than have
this bug happily chew through all his data, with "months of read/modify/write
cycles combined with corrupt data spread the corruption all over the array".

> And since an undetected read error is not just an odd occasional event,
> but a catastrophic system failure, the correct response is not
> "re-create the data from parities" - it is "full scale panic - assume
> /all/ your data is bad, check from backups, call the hardware service
> people, replace the entire disk system".

Sure, it could and should loudly complain with "zomg, we just had a data
corruption and had to correct it from parity" messages to dmesg.

> Another is to maintain and check lists of checksums (md5, sha256, etc.)
> of files - this is often done as a security measure to detect alteration
> of files during break-ins.

Not always feasible at all, in case of e.g. VM images, including those of
"other" operating systems, also in case of e.g. actively modified databases.

> Finally, you can use a filesystem that does checksumming (it is vastly
> easier and more efficient to do the checksumming at the filesystem level
> than at the md raid level) - btrfs is the obvious choice.

Btrfs could not be further from the obvious choice at the moment, as Btrfs
RAID5/6 support is still in its infancy.

Sure you could use Btrfs in a single-device mode over MD; then it would detect
any checksum errors as they happen. But of course it will not be able to
correct them.

Which is sad, since MD (on RAID6) *has* all the parity information needed to
recover a read error, and there isn't even any need for a special filesystem
on top of it, but it's like it just won't help you, almost out of principle.

> If you disagree so strongly, you are free to do something about it.  The
> people (Neil and others) who do the work in creating and maintaining md
> raid know a great deal about the realistic problems in storage systems,
> and realistic solutions.  They understand when people want magic, and
> they understand the costs (in development time and run time) of
> implementing something that is at best a very partial fix to an almost
> non-existent problem (since the most likely cause of undetected read
> errors is things like controller failure, which have no possible
> software fix).  Given their limited time and development resources, they
> therefore concentrate on features of md raid that make a real difference
> to many users.

Absolutely, however the thing is, having a mode to always full-check RAID1/5/6
reads does not even seem like an extremely complicated feature to implement;
it's just the collective echo chamber of "this is useless; we don't need this;
md is the wrong place to do this; etc" that discourages any work in this area.
And those who think that on the contrary this is a good idea (as Brad said,
"this comes up at least once a year") typically lack the necessary experience
with the MD or kernel programming to implement it themselves.

> However, this is all open source development.  If you can write code to
> support new md modes that do on-line scrubbing and smart recovery, then
> I'm sure many people would be interested.  If you can't write the code
> yourself, but can raise the money to hire a qualified developer, then
> I'm sure that would also be of interest.

Sure, but that also does not stop me from doing my part by whining^W providing
valuable input on mailing lists, to signal to any interested developers that
yes, that's indeed one feature which is very much in demand by some users in
the real world :)

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-12  8:06       ` Roman Mamedov
@ 2014-06-12  8:30         ` Brad Campbell
  2014-06-12  8:53         ` Roman Mamedov
  2014-06-12 11:27         ` David Brown
  2 siblings, 0 replies; 12+ messages in thread
From: Brad Campbell @ 2014-06-12  8:30 UTC (permalink / raw)
  To: Roman Mamedov, David Brown; +Cc: Bart Kus, linux-raid


On 12/06/14 16:06, Roman Mamedov wrote:
> In one case which Brad was describing, it was a hardware design fault 
> in his RAID controller, resulting in it returning bad data only when 
> all ports are utilized at high speeds. If MD had online checksum 
> mismatch detection, it would alert him immediately that something's 
> going wrong, rather than have this bug happily chew through all his 
> data, with "months of read/modify/write cycles combined with corrupt 
> data spread the corruption all over the array".


Yeah, you are right it would have possibly spared some of my data. 
Having said if I'd been paying attention to the mismatch counts at the 
end of my monthly scrubs I'd have noticed it a _lot_ sooner also. I had 
the tools, I was just not using them right. My fault, not md's.

Having said that, if I'd not gone through that I'd probably still not 
have comprehensive and complete backups, and I'd not have 
developed/found tools to allow me to better monitor my systems. So while 
it was a painful experience, it was not catastrophic and (as Calvin's 
dad would say) it built some more character.

I'm a lot older, and hopefully wiser from the experience. I also know my 
time is better spent with monitoring and backups than developing code to 
build that feature into md. While that would paper over one part of the 
storage chain, backups and monitoring covers me end to end.

-- 
Dolphins are so intelligent that within a few weeks they can train 
Americans to stand at the edge of the pool and throw them fish.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-12  8:06       ` Roman Mamedov
  2014-06-12  8:30         ` Brad Campbell
@ 2014-06-12  8:53         ` Roman Mamedov
  2014-06-12 11:27         ` David Brown
  2 siblings, 0 replies; 12+ messages in thread
From: Roman Mamedov @ 2014-06-12  8:53 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: David Brown, Brad Campbell, Bart Kus, linux-raid

[-- Attachment #1: Type: text/plain, Size: 369 bytes --]

On Thu, 12 Jun 2014 14:06:44 +0600
Roman Mamedov <rm@romanrm.net> wrote:

> Which is sad, since MD (on RAID6) *has* all the parity information needed to
> recover a read error, and there isn't even any need for a special filesystem
> on top of it, but it's like it just won't help you, almost out of principle.

s/error/corruption/

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md-raid paranoia mode?
  2014-06-12  8:06       ` Roman Mamedov
  2014-06-12  8:30         ` Brad Campbell
  2014-06-12  8:53         ` Roman Mamedov
@ 2014-06-12 11:27         ` David Brown
  2 siblings, 0 replies; 12+ messages in thread
From: David Brown @ 2014-06-12 11:27 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Brad Campbell, Bart Kus, linux-raid

On 12/06/14 10:06, Roman Mamedov wrote:
> On Thu, 12 Jun 2014 09:26:18 +0200
> David Brown <david.brown@hesbynett.no> wrote:
> 
>> Secondly, hard disks already have ECC, in several layers.  There is
>> /far/ more error detection and correction on the data read from the
>> platters than you could hope to do in software at the md layer.  There
>> is nothing that you can do on the md layer to detect bad reads that
>> could not be better handled on the controller on the disk itself.  So if
>> you are getting /undetected/ read errors from a disk (as distinct from
>> /unrecoverable/ read errors), then something has gone very bad.  It is
>> at least as likely to be a write error as a read error, and you will
>> have no idea how long it has been going on and how much of your data is
>> corrupt.  It is probably a systematic error (such as firmware bug) in
>> either the disk controller or the interface card.  Such faults are
>> fortunately very rare - and thus very rarely worth the cost of checking
>> for online.
> 
> In one case which Brad was describing, it was a hardware design fault in his
> RAID controller, resulting in it returning bad data only when all ports are
> utilized at high speeds. If MD had online checksum mismatch detection, it
> would alert him immediately that something's going wrong, rather than have
> this bug happily chew through all his data, with "months of read/modify/write
> cycles combined with corrupt data spread the corruption all over the array".

More regular scrubs would have spotted the issue sooner, though not as
soon as online checks.  Fortunately, cases like this are rare.

> 
>> And since an undetected read error is not just an odd occasional event,
>> but a catastrophic system failure, the correct response is not
>> "re-create the data from parities" - it is "full scale panic - assume
>> /all/ your data is bad, check from backups, call the hardware service
>> people, replace the entire disk system".
> 
> Sure, it could and should loudly complain with "zomg, we just had a data
> corruption and had to correct it from parity" messages to dmesg.
> 

I would be tempted to consider a "kernel panic" rather than a log
message.  If such a serious problem is found, you don't want to write
anything more to the disks in case you make things worse - the user may
be better off disconnecting the disks and re-connecting them on another
system to get the data off them.

Of course, it would be nicer to make the level of reaction configurable.

>> Another is to maintain and check lists of checksums (md5, sha256, etc.)
>> of files - this is often done as a security measure to detect alteration
>> of files during break-ins.
> 
> Not always feasible at all, in case of e.g. VM images, including those of
> "other" operating systems, also in case of e.g. actively modified databases.
> 

Yes - it works for some usage patterns, but not others.

>> Finally, you can use a filesystem that does checksumming (it is vastly
>> easier and more efficient to do the checksumming at the filesystem level
>> than at the md raid level) - btrfs is the obvious choice.
> 
> Btrfs could not be further from the obvious choice at the moment, as Btrfs
> RAID5/6 support is still in its infancy.
> 
> Sure you could use Btrfs in a single-device mode over MD; then it would detect
> any checksum errors as they happen. But of course it will not be able to
> correct them.

That's correct.  But since the chances of you having an undetectable
read error are tiny, and there is /no/ good answer for how to "correct"
it, then a simple detection is absolutely fine.

> 
> Which is sad, since MD (on RAID6) *has* all the parity information needed to
> recover a read error, and there isn't even any need for a special filesystem
> on top of it, but it's like it just won't help you, almost out of principle.

Before going any further, you must understand that there is /no/ way to
recover from such read errors.  There are ways that /might/ help,
depending on the underlying cause.  Detection is important here, not
recovery, so a filesystem checksum that turns an undetected read error
into a detected one is all that's needed.

Another thing to note here is that there are a few circumstances in
which a parity mismatch is actually normal behaviour - and any automatic
online system would have to be able to distinguish those.  If parts of
the array are out of sync, such as when first building the array, while
writing a stripe, or after an unclean shutdown, then you will get
mismatches.  Swap areas can also be out of sync for short times if
memory changes while the pages are being written.  Such issues make it
harder than it might first seem to implement online checking.

> 
>> If you disagree so strongly, you are free to do something about it.  The
>> people (Neil and others) who do the work in creating and maintaining md
>> raid know a great deal about the realistic problems in storage systems,
>> and realistic solutions.  They understand when people want magic, and
>> they understand the costs (in development time and run time) of
>> implementing something that is at best a very partial fix to an almost
>> non-existent problem (since the most likely cause of undetected read
>> errors is things like controller failure, which have no possible
>> software fix).  Given their limited time and development resources, they
>> therefore concentrate on features of md raid that make a real difference
>> to many users.
> 
> Absolutely, however the thing is, having a mode to always full-check RAID1/5/6
> reads does not even seem like an extremely complicated feature to implement;
> it's just the collective echo chamber of "this is useless; we don't need this;
> md is the wrong place to do this; etc" that discourages any work in this area.
> And those who think that on the contrary this is a good idea (as Brad said,
> "this comes up at least once a year") typically lack the necessary experience
> with the MD or kernel programming to implement it themselves.
> 
>> However, this is all open source development.  If you can write code to
>> support new md modes that do on-line scrubbing and smart recovery, then
>> I'm sure many people would be interested.  If you can't write the code
>> yourself, but can raise the money to hire a qualified developer, then
>> I'm sure that would also be of interest.
> 
> Sure, but that also does not stop me from doing my part by whining^W providing
> valuable input on mailing lists, to signal to any interested developers that
> yes, that's indeed one feature which is very much in demand by some users in
> the real world :)
> 

That's certainly true - user pressure and demands is always an influence
when prioritising development!


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-06-12 11:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-11  6:48 md-raid paranoia mode? Bart Kus
     [not found] ` <CAH3kUhH06kpJNqb-zdcv5nu2e1FeZuotcW0SjBbWDOCcasm9OA@mail.gmail.com>
2014-06-11 10:34   ` Bart Kus
2014-06-12  7:26     ` Mattias Wadenstein
2014-06-11 17:31 ` Piergiorgio Sartor
2014-06-12  2:15 ` Brad Campbell
2014-06-12  6:28   ` Roman Mamedov
2014-06-12  6:45     ` NeilBrown
2014-06-12  7:26     ` David Brown
2014-06-12  8:06       ` Roman Mamedov
2014-06-12  8:30         ` Brad Campbell
2014-06-12  8:53         ` Roman Mamedov
2014-06-12 11:27         ` David Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.