All of lore.kernel.org
 help / color / mirror / Atom feed
* Fault tolerance in RAID0 with badblocks
@ 2017-05-04 10:04 Ravi (Tom) Hale
  2017-05-04 13:44 ` Wols Lists
  0 siblings, 1 reply; 69+ messages in thread
From: Ravi (Tom) Hale @ 2017-05-04 10:04 UTC (permalink / raw)
  To: linux-raid

Since btrfs doesn't support badblocks, this btrfs mailing list post[1]
suggested to use mdadm RAID0 3.1+.

Is there a way of having blocks from a spare device automatically
replacing bad blocks when they are next written to (like SMART does for
HDDs)?

Or would mdadm be able to add a "badblocks layer" to btrfs in some other
way?

My use case is mining storj - I don't mind some data loss.

[1] https://www.spinics.net/lists/linux-btrfs/msg40909.html

-- 
Cheers,

Tom Hale

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance in RAID0 with badblocks
  2017-05-04 10:04 Fault tolerance in RAID0 with badblocks Ravi (Tom) Hale
@ 2017-05-04 13:44 ` Wols Lists
  2017-05-05  4:03   ` Fault tolerance " Ravi (Tom) Hale
  0 siblings, 1 reply; 69+ messages in thread
From: Wols Lists @ 2017-05-04 13:44 UTC (permalink / raw)
  To: Ravi (Tom) Hale, linux-raid

On 04/05/17 11:04, Ravi (Tom) Hale wrote:
> Since btrfs doesn't support badblocks, this btrfs mailing list post[1]
> suggested to use mdadm RAID0 3.1+.

Having read the email you linked to, I don't think mdadm will be any
help at all ...
> 
> Is there a way of having blocks from a spare device automatically
> replacing bad blocks when they are next written to (like SMART does for
> HDDs)?

What quite do you mean?
> 
> Or would mdadm be able to add a "badblocks layer" to btrfs in some other
> way?

No. With modern hard drives, no filesystem should pay any attention to
badblocks - it's all handled in the drive firmware. Badblocks is an
unfortunate legacy from the past when drives really were CHS, and the
layer above needed some way of knowing which blocks were bad and should
be avoided. mdadm has had a lot of grief with its handling of badblocks,
and getting drives confused, and it's all totally unnecessary anyway.

Let the drive worry about what blocks are bad. One major point behind
LBA is it hides the actual disk layout from the computer, and allows the
drive to relocate blocks that aren't working properly. Let it do its job.

If you want to use raid, don't bother with 0. Use mdadm and raid 5 or 6
to combine your drives, and create a btrfs filesystem on top. (Don't
bother with raid1 - that part of btrfs apparently works well, so use the
filesystem variant, not an external one.)
> 
> My use case is mining storj - I don't mind some data loss.

Using a badblock list will have no impact on this whatsoever.
> 
> [1] https://www.spinics.net/lists/linux-btrfs/msg40909.html
> 
Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-04 13:44 ` Wols Lists
@ 2017-05-05  4:03   ` Ravi (Tom) Hale
  2017-05-05 19:20     ` Anthony Youngman
  2017-05-05 20:23     ` Peter Grandi
  0 siblings, 2 replies; 69+ messages in thread
From: Ravi (Tom) Hale @ 2017-05-05  4:03 UTC (permalink / raw)
  To: Wols Lists, linux-raid

On 04/05/17 20:44, Wols Lists wrote:
> On 04/05/17 11:04, Ravi (Tom) Hale wrote:
>> Is there a way of having blocks from a spare device automatically
>> replacing bad blocks when they are next written to (like SMART does for
>> HDDs)?
> 
> What quite do you mean?

I mean: should a bad block be identified, any writes to that virtual
block are redirected to another good LBA block held in a spare pool
which would need to be inaccessible for other purposes (so that they are
indeed spare).

>> Or would mdadm be able to add a "badblocks layer" to btrfs in some other
>> way?
> 
> No. With modern hard drives, no filesystem should pay any attention to
> badblocks - it's all handled in the drive firmware.

ext4 supports this, and is a relatively modern filesystem released in
December 2008. While it could be argued that this is for legacy support,
This feature still adds value (see below).

> mdadm has had a lot of grief with its handling of badblocks,
> and getting drives confused, and it's all totally unnecessary anyway.

The use case is simple: What if I want to have more goodblocks to
correct for badblocks than Seagate thinks I should have?

Eg, a charity or poor student wanting to get the most out of their old
hardware.

In my case, I don't care about actual data loss (RAID0).

However, in the usual case, running RAID 1, 5 or 6 with a pool of spare
goodblocks would allow extending the life of hardware considerably while
still providing a poor-man's margin of redundancy.

> Let the drive worry about what blocks are bad. One major point behind
> LBA is it hides the actual disk layout from the computer, and allows the
> drive to relocate blocks that aren't working properly. Let it do its job.

Until it can't do its job any more because it runs out of its
manufacturer determined fixed-size spare pool. Yes there are things to
consider for performance like having the physical good sector being
close to the physical bad sector, so a spare data area could be
allocated every N usable data areas.

And perhaps I could write that one day. :)

>> My use case is mining storj - I don't mind some data loss.
> 
> Using a badblock list will have no impact on this whatsoever.

A corrupted file is a corrupted file, and can be deleted at minimal
loss. I just don't want the next file being corrupted by the same badblock.

-- 
Tom

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-05  4:03   ` Fault tolerance " Ravi (Tom) Hale
@ 2017-05-05 19:20     ` Anthony Youngman
  2017-05-06 11:21       ` Ravi (Tom) Hale
  2017-05-05 20:23     ` Peter Grandi
  1 sibling, 1 reply; 69+ messages in thread
From: Anthony Youngman @ 2017-05-05 19:20 UTC (permalink / raw)
  To: Ravi (Tom) Hale, linux-raid

On 05/05/17 05:03, Ravi (Tom) Hale wrote:
> On 04/05/17 20:44, Wols Lists wrote:
>> On 04/05/17 11:04, Ravi (Tom) Hale wrote:
>>> Is there a way of having blocks from a spare device automatically
>>> replacing bad blocks when they are next written to (like SMART does for
>>> HDDs)?
>>
>> What quite do you mean?
>
> I mean: should a bad block be identified, any writes to that virtual
> block are redirected to another good LBA block held in a spare pool
> which would need to be inaccessible for other purposes (so that they are
> indeed spare).
>
>>> Or would mdadm be able to add a "badblocks layer" to btrfs in some other
>>> way?
>>
>> No. With modern hard drives, no filesystem should pay any attention to
>> badblocks - it's all handled in the drive firmware.
>
> ext4 supports this, and is a relatively modern filesystem released in
> December 2008. While it could be argued that this is for legacy support,
> This feature still adds value (see below).
>
>> mdadm has had a lot of grief with its handling of badblocks,
>> and getting drives confused, and it's all totally unnecessary anyway.
>
> The use case is simple: What if I want to have more goodblocks to
> correct for badblocks than Seagate thinks I should have?

Understood. Except that when you get to that state, your drive is 
probably dying anyway. Or tiny by modern standards.
>
> Eg, a charity or poor student wanting to get the most out of their old
> hardware.
>
> In my case, I don't care about actual data loss (RAID0).
>
> However, in the usual case, running RAID 1, 5 or 6 with a pool of spare
> goodblocks would allow extending the life of hardware considerably while
> still providing a poor-man's margin of redundancy.
>
>> Let the drive worry about what blocks are bad. One major point behind
>> LBA is it hides the actual disk layout from the computer, and allows the
>> drive to relocate blocks that aren't working properly. Let it do its job.
>
> Until it can't do its job any more because it runs out of its
> manufacturer determined fixed-size spare pool.

Bear in mind I'm speculating slightly here ... but how are you going to 
know when the drive has run out of its spare-pool? Bear in mind that 
most SSDs, it seems, will commit suicide at this point ...

Bear in mind also, that any *within* *spec* drive can have an "accident" 
every 10TB and still be considered perfectly okay. Which means that if 
you do what you are supposed to do (rewrite the block) you're risking 
the drive remapping the block - and getting closer to the drive bricking 
itself. But if you trap the error yourself and add it to the badblocks 
list, you are risking throwing away perfectly decent blocks that just 
hiccuped.

Bear in mind also, that with raid we recommend "scrubbing". That's 
basically reading the entire disk looking for errors, because data does 
fade. So if you "look after" a 3TB drive, you could be losing a block a 
month to your badblock list. Not good.

  Yes there are things to
> consider for performance like having the physical good sector being
> close to the physical bad sector, so a spare data area could be
> allocated every N usable data areas.
>
> And perhaps I could write that one day. :)
>
>>> My use case is mining storj - I don't mind some data loss.
>>
>> Using a badblock list will have no impact on this whatsoever.
>
> A corrupted file is a corrupted file, and can be deleted at minimal
> loss. I just don't want the next file being corrupted by the same badblock.
>
As we say, YMMV. If that's what you want to do, fine. Which is going to 
happen first - the drive bricks itself because it runs out of 
manufacturer-supplied spare blocks, or you bin the drive because your 
bad-blocks-list has got too big to handle? I suspect your bad block list 
will fill up long before the drive runs out of manufacturer-supplied blocks.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-05  4:03   ` Fault tolerance " Ravi (Tom) Hale
  2017-05-05 19:20     ` Anthony Youngman
@ 2017-05-05 20:23     ` Peter Grandi
  2017-05-05 22:14       ` Nix
  1 sibling, 1 reply; 69+ messages in thread
From: Peter Grandi @ 2017-05-05 20:23 UTC (permalink / raw)
  To: Linux RAID

>> No. With modern hard drives, no filesystem should pay any
>> attention to badblocks - it's all handled in the drive firmware.

> ext4 supports this,

Also JFS also supports bad-block avoidance, but only at 'mkfs' time
and JFS does this for legacy reason: Linux JFS supports this because
it is a port of JFS/2 from OS/2 which was a port of JFS version 1
from AIX in 1990.

> and is a relatively modern filesystem released in December
> 2008.

It is just a retread of 'ext3' which itself was a recycling of
'ext2' which was in turn a clone of the 4BSD FFS, and we are talking
of design decisions taken in 1982-3, not 2008.

> While it could be argued that this is for legacy support,

It is for legacy support. Once upon a time a drive's controller was
the main CPU itself, and the kernel had to manage bad block sparing
(as well as rotational layout and track buffering). That was up to
around 20-30 years ago :-).

> This feature still adds value (see below).

It adds value if one underestimates typical disk drive failure
modes.  It is quite irritating even for me that a drive with way
less than 1% bad blocks becomes effectively unusable, but long
experience tells me that once a drive starts to grow defects to the
point that manufacturer spare sectors run out there is usually a
reason for it and sooner than later it will be almost completely
unusable.

[ ... ]

> The use case is simple: What if I want to have more goodblocks to
> correct for badblocks than Seagate thinks I should have?

The answer is also simple: if you think you know better than
Seagate, or if you think that Seagate deliberately allocates too few
spare sector, you ask Seagate for custom firmware that allocates
more of a disk capacity for spares. I suspect that with an order of
at least 100,000 drives they will be happy to help. :-)

> Eg, a charity or poor student wanting to get the most out of their
> old hardware.

If it is your itch, and you think you know better than the rest
of the industry, scratch your itch: send patches :-).

Other people know that usually keeping decaying drives in use is
fairly pointless. Legend is that USSR computer engineers perfected
that art though, but they worked in special circumstances.  For a
similar example look at the BadRAM and similar modules:

  https://help.ubuntu.com/community/BadRAM
  http://rick.vanrein.org/linux/badram/

They haven't become that popular... :-)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-05 20:23     ` Peter Grandi
@ 2017-05-05 22:14       ` Nix
  0 siblings, 0 replies; 69+ messages in thread
From: Nix @ 2017-05-05 22:14 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On 5 May 2017, Peter Grandi stated:

>> This feature still adds value (see below).
>
> It adds value if one underestimates typical disk drive failure
> modes.  It is quite irritating even for me that a drive with way
> less than 1% bad blocks becomes effectively unusable, but long
> experience tells me that once a drive starts to grow defects to the
> point that manufacturer spare sectors run out there is usually a
> reason for it and sooner than later it will be almost completely
> unusable.

Quite. In my experience, if there are that many bad blocks on rotational
storage, it generally means either that a head has died or that the disk
surface is damaged. If the disk surface is damaged to that degree, there
will be crap flying around inside the drive at very high speed, abrading
the drive surface further with every passing minute. Such a drive is
walking dead. Get any surviving data off now and throw it away with
extreme prejudice, possibly pulling it apart first to gawp at the
horribleness that is all that remains of your disk surfaces. As for the
dead-head case, the question is whether whatever killed the head
produced debris. If it did, you're back at the previous problem, and if
it's electronic failure, frankly the whole drive is untrustworthy IMHO.
(There *are* other possibilities: catastrophically buggy drive firmware,
for instance -- but in such cases the drive is *also* walking dead.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-05 19:20     ` Anthony Youngman
@ 2017-05-06 11:21       ` Ravi (Tom) Hale
  2017-05-06 13:00         ` Wols Lists
  0 siblings, 1 reply; 69+ messages in thread
From: Ravi (Tom) Hale @ 2017-05-06 11:21 UTC (permalink / raw)
  To: Anthony Youngman, linux-raid

On 06/05/17 02:20, Anthony Youngman wrote:
> Bear in mind I'm speculating slightly here ... but how are you going to
> know when the drive has run out of its spare-pool? Bear in mind that
> most SSDs, it seems, will commit suicide at this point ...

Intel and Samsung SSDs support S.M.A.R.T. (but not my personal laptop).

> Bear in mind also, that any *within* *spec* drive can have an "accident"
> every 10TB and still be considered perfectly okay. Which means that if
> you do what you are supposed to do (rewrite the block) you're risking
> the drive remapping the block - and getting closer to the drive bricking
> itself. But if you trap the error yourself and add it to the badblocks
> list, you are risking throwing away perfectly decent blocks that just
> hiccuped.

For hiccups, having a bad-read-count for each suspected-bad block could
be sensible. If that number goes above <small-threshold> it's very
likely that the block is indeed bad and should be avoided in future.

-- 
Tom Hale

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-06 11:21       ` Ravi (Tom) Hale
@ 2017-05-06 13:00         ` Wols Lists
  2017-05-08 14:50           ` Nix
  0 siblings, 1 reply; 69+ messages in thread
From: Wols Lists @ 2017-05-06 13:00 UTC (permalink / raw)
  To: Ravi (Tom) Hale, linux-raid

On 06/05/17 12:21, Ravi (Tom) Hale wrote:
>> Bear in mind also, that any *within* *spec* drive can have an "accident"
>> > every 10TB and still be considered perfectly okay. Which means that if
>> > you do what you are supposed to do (rewrite the block) you're risking
>> > the drive remapping the block - and getting closer to the drive bricking
>> > itself. But if you trap the error yourself and add it to the badblocks
>> > list, you are risking throwing away perfectly decent blocks that just
>> > hiccuped.

> For hiccups, having a bad-read-count for each suspected-bad block could
> be sensible. If that number goes above <small-threshold> it's very
> likely that the block is indeed bad and should be avoided in future.

Except you have the second law of thermodynamics in play - "what man
proposes, nature opposes". This could well screw up big time.

DRAM memory needs to be refreshed by a read-write cycle every few
nanoseconds. Hard drives are the same, actually, except that the
interval is measured in years, not nanoseconds. Fill your brand new hard
drive with data, then hammer it gently over a few years. Especially if a
block's neighbours are repeatedly rewritten but this particular block is
never touched, it is likely to become unreadable.

So it will fail your test - reads will repeatedly fail - but if the
firmware was given a look-in (by rewriting it) it wouldn't be remapped.

And as Nix said, once a drive starts getting a load of errors, chances
are something is catastrophically wrong and things are going to get
exponentially worse.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-06 13:00         ` Wols Lists
@ 2017-05-08 14:50           ` Nix
  2017-05-08 18:00             ` Anthony Youngman
                               ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Nix @ 2017-05-08 14:50 UTC (permalink / raw)
  To: Wols Lists; +Cc: Ravi (Tom) Hale, linux-raid

On 6 May 2017, Wols Lists outgrape:

> On 06/05/17 12:21, Ravi (Tom) Hale wrote:
>>> Bear in mind also, that any *within* *spec* drive can have an "accident"
>>> > every 10TB and still be considered perfectly okay. Which means that if
>>> > you do what you are supposed to do (rewrite the block) you're risking
>>> > the drive remapping the block - and getting closer to the drive bricking
>>> > itself. But if you trap the error yourself and add it to the badblocks
>>> > list, you are risking throwing away perfectly decent blocks that just
>>> > hiccuped.
>
>> For hiccups, having a bad-read-count for each suspected-bad block could
>> be sensible. If that number goes above <small-threshold> it's very
>> likely that the block is indeed bad and should be avoided in future.
>
> Except you have the second law of thermodynamics in play - "what man
> proposes, nature opposes". This could well screw up big time.
>
> DRAM memory needs to be refreshed by a read-write cycle every few
> nanoseconds. Hard drives are the same, actually, except that the
> interval is measured in years, not nanoseconds. Fill your brand new hard
> drive with data, then hammer it gently over a few years. Especially if a
> block's neighbours are repeatedly rewritten but this particular block is
> never touched, it is likely to become unreadable.
>
> So it will fail your test - reads will repeatedly fail - but if the
> firmware was given a look-in (by rewriting it) it wouldn't be remapped.

You mean it *would* be remapped (and all would be well).

I wonder... scrubbing is not very useful with md, particularly with RAID
6, because it does no writes unless something mismatches, and on failure
there is no attempt to determine which of the N disks is bad and rewrite
its contents from the other devices (nor, as I understand it, does it
clearly say which drive gave the error, so even failing it out and
resyncing it is hard).

If there was a way to get md to *rewrite* everything during scrub,
rather than just checking, this might help (in addition to letting the
drive refresh the magnetization of absolutely everything). "repair" mode
appears to do no writes until an error is found, whereupon (on RAID 6)
it proceeds to make a "repair" that is more likely than not to overwrite
good data with bad. Optionally writing what's already there on non-error
seems like it might be a worthwhile (and fairly simple) change.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 14:50           ` Nix
@ 2017-05-08 18:00             ` Anthony Youngman
  2017-05-09 10:11               ` David Brown
  2017-05-09 10:18               ` Nix
  2017-05-08 19:02             ` Phil Turmel
  2017-05-09  7:37             ` David Brown
  2 siblings, 2 replies; 69+ messages in thread
From: Anthony Youngman @ 2017-05-08 18:00 UTC (permalink / raw)
  To: Nix; +Cc: Ravi (Tom) Hale, linux-raid

On 08/05/17 15:50, Nix wrote:
> On 6 May 2017, Wols Lists outgrape:
> 
>> On 06/05/17 12:21, Ravi (Tom) Hale wrote:
>>>> Bear in mind also, that any *within* *spec* drive can have an "accident"
>>>>> every 10TB and still be considered perfectly okay. Which means that if
>>>>> you do what you are supposed to do (rewrite the block) you're risking
>>>>> the drive remapping the block - and getting closer to the drive bricking
>>>>> itself. But if you trap the error yourself and add it to the badblocks
>>>>> list, you are risking throwing away perfectly decent blocks that just
>>>>> hiccuped.
>>
>>> For hiccups, having a bad-read-count for each suspected-bad block could
>>> be sensible. If that number goes above <small-threshold> it's very
>>> likely that the block is indeed bad and should be avoided in future.
>>
>> Except you have the second law of thermodynamics in play - "what man
>> proposes, nature opposes". This could well screw up big time.
>>
>> DRAM memory needs to be refreshed by a read-write cycle every few
>> nanoseconds. Hard drives are the same, actually, except that the
>> interval is measured in years, not nanoseconds. Fill your brand new hard
>> drive with data, then hammer it gently over a few years. Especially if a
>> block's neighbours are repeatedly rewritten but this particular block is
>> never touched, it is likely to become unreadable.
>>
>> So it will fail your test - reads will repeatedly fail - but if the
>> firmware was given a look-in (by rewriting it) it wouldn't be remapped.
> 
> You mean it *would* be remapped (and all would be well).
> 
No. The data would be lost, the block would be overwritten successfully 
and there would be no need to remap. Basically, the magnetism has 
decayed (so it can't be reconstructed from the extra error recovery bits 
on disk) and rewriting it fixes the problem. But the data's been lost ...

> I wonder... scrubbing is not very useful with md, particularly with RAID
> 6, because it does no writes unless something mismatches, and on failure
> there is no attempt to determine which of the N disks is bad and rewrite
> its contents from the other devices (nor, as I understand it, does it
> clearly say which drive gave the error, so even failing it out and
> resyncing it is hard).

With redundant raid (and that doesn't include a two-disk, or even 
three-disk mirror), it SHOULD recalculate the failed block. If it 
doesn't bother even though it can, I'd call that a bug in scrub. What I 
thought happened was that it reads a stripe direct from disk, and if 
that failed it read the same stripe via the raid code, to get the raid 
error correction to fire, and then it rewrote the stripe.

What would be a nice touch, is that if we have a massive timeout for 
non-SCT drives, if the scrub has to wait more than, say, 10 seconds for 
a read to succeed it then assumes the block is failing and rewrites it. 
Actually, scrub that (groan... :-) - if the drive takes longer than 1/3 
of the timeout to respond, then the scrub assumes it's dodgy and 
rewrites it.
> 
> If there was a way to get md to *rewrite* everything during scrub,
> rather than just checking, this might help (in addition to letting the
> drive refresh the magnetization of absolutely everything). "repair" mode
> appears to do no writes until an error is found, whereupon (on RAID 6)
> it proceeds to make a "repair" that is more likely than not to overwrite
> good data with bad. Optionally writing what's already there on non-error
> seems like it might be a worthwhile (and fairly simple) change.
> 
Agreed. But without some heuristic, it's actually going to make a scrub 
much slower, and achieve very little apart from adding unnecessary wear 
to the drive.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 14:50           ` Nix
  2017-05-08 18:00             ` Anthony Youngman
@ 2017-05-08 19:02             ` Phil Turmel
  2017-05-08 19:52               ` Nix
  2017-05-09  7:37             ` David Brown
  2 siblings, 1 reply; 69+ messages in thread
From: Phil Turmel @ 2017-05-08 19:02 UTC (permalink / raw)
  To: Nix, Wols Lists; +Cc: Ravi (Tom) Hale, linux-raid

On 05/08/2017 10:50 AM, Nix wrote:

> I wonder... scrubbing is not very useful with md, particularly with RAID
> 6, because it does no writes unless something mismatches,

This is wrong.  The purpose of scrubbing is to expose any sectors that
have degraded (as Wol describes) to the point of generating a read
error.  A "check" scrub only writes back to the sectors that report a
URE, giving the drive firmware a chance to fix or relocate the sector.

A check scrub will NOT write on mismatch, just increment the mismatch
counter.  This is the recommended regular scrubbing operation.  You want
to know when mismatches occur.

> If there was a way to get md to *rewrite* everything during scrub,
> rather than just checking, this might help (in addition to letting the
> drive refresh the magnetization of absolutely everything).

This is actually counterproductive.  Rewriting everything may refresh
the magnetism on weakening sectors, but will also prevent the drive from
*finding* weakening sectors that really do need relocation.

Phil

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 19:02             ` Phil Turmel
@ 2017-05-08 19:52               ` Nix
  2017-05-08 20:27                 ` Anthony Youngman
  2017-05-08 20:56                 ` Phil Turmel
  0 siblings, 2 replies; 69+ messages in thread
From: Nix @ 2017-05-08 19:52 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid

On 8 May 2017, Phil Turmel verbalised:

> On 05/08/2017 10:50 AM, Nix wrote:
>
>> I wonder... scrubbing is not very useful with md, particularly with RAID
>> 6, because it does no writes unless something mismatches,
>
> This is wrong.  The purpose of scrubbing is to expose any sectors that
> have degraded (as Wol describes) to the point of generating a read
> error.  A "check" scrub only writes back to the sectors that report a
> URE, giving the drive firmware a chance to fix or relocate the sector.
>
> A check scrub will NOT write on mismatch, just increment the mismatch
> counter.  This is the recommended regular scrubbing operation.  You want
> to know when mismatches occur.

And... then what do you do? On RAID-6, it appears the answer is "live
with a high probability of inevitable corruption". That's not very good.
(AIUI, if a check scrub finds a URE, it'll rewrite it, and when in the
common case the drive spares it out and the write succeeds, this will
not be reported as a mismatch: is this right?)

>> If there was a way to get md to *rewrite* everything during scrub,
>> rather than just checking, this might help (in addition to letting the
>> drive refresh the magnetization of absolutely everything).
>
> This is actually counterproductive.  Rewriting everything may refresh
> the magnetism on weakening sectors, but will also prevent the drive from
> *finding* weakening sectors that really do need relocation.

If a sector weakens purely because of neighbouring writes or temperature
or a vibrating housing or something (i.e. not because of actual damage),
so that a rewrite will strengthen it and relocation was never necessary,
surely you've just saved a pointless bit of sector sparing? (I don't
know: I'm not sure what the relative frequency of these things is. Read
and write errors in general are so rare that it's quite possible I'm
worrying about nothing at all. I do know I forgot to scrub my old
hardware RAID array for about three years and nothing bad happened...)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 19:52               ` Nix
@ 2017-05-08 20:27                 ` Anthony Youngman
  2017-05-09  9:53                   ` Nix
  2017-05-09 16:05                   ` Chris Murphy
  2017-05-08 20:56                 ` Phil Turmel
  1 sibling, 2 replies; 69+ messages in thread
From: Anthony Youngman @ 2017-05-08 20:27 UTC (permalink / raw)
  To: Nix, Phil Turmel; +Cc: Ravi (Tom) Hale, linux-raid



On 08/05/17 20:52, Nix wrote:
> On 8 May 2017, Phil Turmel verbalised:
> 
>> On 05/08/2017 10:50 AM, Nix wrote:
>>
>>> I wonder... scrubbing is not very useful with md, particularly with RAID
>>> 6, because it does no writes unless something mismatches,
>>
>> This is wrong.  The purpose of scrubbing is to expose any sectors that
>> have degraded (as Wol describes) to the point of generating a read
>> error.  A "check" scrub only writes back to the sectors that report a
>> URE, giving the drive firmware a chance to fix or relocate the sector.
>>
>> A check scrub will NOT write on mismatch, just increment the mismatch
>> counter.  This is the recommended regular scrubbing operation.  You want
>> to know when mismatches occur.
> 
> And... then what do you do? On RAID-6, it appears the answer is "live
> with a high probability of inevitable corruption". That's not very good.
> (AIUI, if a check scrub finds a URE, it'll rewrite it, and when in the
> common case the drive spares it out and the write succeeds, this will
> not be reported as a mismatch: is this right?)

I think you're misunderstanding RAID here. IF the drive says "I can't 
read this block", the RAID reconstructs the block, and rewrites it. No 
corruption.

If the scrub finds a mismatch, then the drives are reporting 
"everything's fine here". Something's gone wrong, but the question is 
what? If you've got a four-drive raid that reports a mismatch, how do 
you know which of the four drives is corrupt? Doing an auto-correct here 
risks doing even more damage. (I think a raid-6 could recover, but 
raid-5 is toast ...)

And seeing as drives are pretty much guaranteed (unless something's gone 
BADLY wrong) to either (a) accurately return the data written, or (b) 
return a read error, that means a data mismatch indicates something is 
seriously wrong that is NOTHING to do with the drives.

<snip>
> 
> If a sector weakens purely because of neighbouring writes or temperature
> or a vibrating housing or something (i.e. not because of actual damage),
> so that a rewrite will strengthen it and relocation was never necessary,
> surely you've just saved a pointless bit of sector sparing? (I don't
> know: I'm not sure what the relative frequency of these things is. Read
> and write errors in general are so rare that it's quite possible I'm
> worrying about nothing at all. I do know I forgot to scrub my old
> hardware RAID array for about three years and nothing bad happened...)
> 
Yes you have saved a sector sparing. Note that a consumer 3TB drive can 
return, on average, one error every time it's read from end to end 3 
times, and still be considered "within spec" ie "not faulty" by the 
manufacturer. And that's a *brand* *new* drive. That's why building a 
large array using consumer drives is a stupid idea - 4 x 3TB drives and 
a *within* *spec* array must expect to handle at least one error every 
scrub.

Okay - most drives are actually way over spec, and could probably be 
read end-to-end many times without a single error, but you'd be a fool 
to gamble on it.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 19:52               ` Nix
  2017-05-08 20:27                 ` Anthony Youngman
@ 2017-05-08 20:56                 ` Phil Turmel
  2017-05-09 10:28                   ` Nix
  1 sibling, 1 reply; 69+ messages in thread
From: Phil Turmel @ 2017-05-08 20:56 UTC (permalink / raw)
  To: Nix; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid

On 05/08/2017 03:52 PM, Nix wrote:
> On 8 May 2017, Phil Turmel verbalised:
> 
>> On 05/08/2017 10:50 AM, Nix wrote:

> And... then what do you do? On RAID-6, it appears the answer is "live
> with a high probability of inevitable corruption".

No, you investigate the quality of your data and the integrity of the
rest of the system, as something *other* than a drive problem caused the
mismatch.  (Swap is a known exception, though.)

> That's not very good.
> (AIUI, if a check scrub finds a URE, it'll rewrite it, and when in the
> common case the drive spares it out and the write succeeds, this will
> not be reported as a mismatch: is this right?)

This is also wrong, because you are assuming sparing-out is the common
case.  A read error does not automatically trigger relocation.  It
triggers *verification* of the next *write*.  In young drives,
successful rewrite in place is the common case.  As the drive ages,
rewrites will begin relocating because there really is a new problem at
that spot, not simple thermal/magnetic decay.

But keep in mind that the firmware of the drive will start verification
of a sector only if it gets a *read* error.  Such sectors get marked as
"pending" relocations until they are written again.  If that write
verifies correct, the "pending" status simply goes away.  Ordinary
writes to presumed-ok sectors are *not* verified.  (There'd be a huge
difference between read and write speeds on rotating media if they were.)

{ Drive self tests might do some pre-emptive rewriting of marginal
sectors -- it's not something drive manufacturers are documenting.  But
a drive self-test cannot fix an unreadable sector -- it doesn't know
what to write there. }

>> This is actually counterproductive.  Rewriting everything may refresh
>> the magnetism on weakening sectors, but will also prevent the drive from
>> *finding* weakening sectors that really do need relocation.
> 
> If a sector weakens purely because of neighbouring writes or temperature
> or a vibrating housing or something (i.e. not because of actual damage),
> so that a rewrite will strengthen it and relocation was never necessary,
> surely you've just saved a pointless bit of sector sparing? (I don't
> know: I'm not sure what the relative frequency of these things is. Read
> and write errors in general are so rare that it's quite possible I'm
> worrying about nothing at all. I do know I forgot to scrub my old
> hardware RAID array for about three years and nothing bad happened...)

Drives that are in applications that get *read* pretty often don't need
much if any scrubbing -- the application itself will expose problem
sectors.  Hobbyists and home media servers can go months with specific
files unread, so developing problems can hit in clusters.  Regular
scrubbing will catch these problems before they take your array down.

And you can't compare hardware array behavior to MD -- they have their
own algorithms to take care of attached disks without OS intervention.

Phil

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 14:50           ` Nix
  2017-05-08 18:00             ` Anthony Youngman
  2017-05-08 19:02             ` Phil Turmel
@ 2017-05-09  7:37             ` David Brown
  2017-05-09  9:58               ` Nix
  2 siblings, 1 reply; 69+ messages in thread
From: David Brown @ 2017-05-09  7:37 UTC (permalink / raw)
  To: Nix, Wols Lists; +Cc: Ravi (Tom) Hale, linux-raid

On 08/05/17 16:50, Nix wrote:

> 
> I wonder... scrubbing is not very useful with md, particularly with RAID
> 6, because it does no writes unless something mismatches, and on failure
> there is no attempt to determine which of the N disks is bad and rewrite
> its contents from the other devices (nor, as I understand it, does it
> clearly say which drive gave the error, so even failing it out and
> resyncing it is hard).
> 

Please read Neil Brown's article on this: "Smart or simple RAID
recovery?" <http://neil.brown.name/blog/20100211050355>

> If there was a way to get md to *rewrite* everything during scrub,
> rather than just checking, this might help (in addition to letting the
> drive refresh the magnetization of absolutely everything). "repair" mode
> appears to do no writes until an error is found, whereupon (on RAID 6)
> it proceeds to make a "repair" that is more likely than not to overwrite
> good data with bad. Optionally writing what's already there on non-error
> seems like it might be a worthwhile (and fairly simple) change.
> 

Scrubbing /does/ rewrite disk blocks - when necessary.  It does not do
it explicitly, but the disks handle this themselves.

To the processor, a disk block is 4K of data.  But to the disk and its
controllers, it is 4K plus a sizeable amount of error checking and
correcting bits.  Some are spread out within the block, some are
collected together at the end of the block.  The ECC system can handle a
large number of failed bits, either in lumps caused by a physical defect
on the disk surface, or spread out due to the slow decay of the magnetic
orientation, or hits by cosmic rays.

When the disk is asked to read a block, it pulls up the data and the ECC
bits, and uses this to check and re-construct the 4K of data, and a
measure of how many errors were corrected.  On modern high-capacity
drives, it is normal that some errors are corrected on a read.  But if
more than a certain level occur, then the firmware will trigger a
re-write automatically to the same sector.  This will then be re-read.
If the error rate is low, fine.  If it is high, then the sector will be
remapped by the disk.

So simply /reading/ the data, as far as the processor is concerned, will
cause re-writes as and when needed.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 20:27                 ` Anthony Youngman
@ 2017-05-09  9:53                   ` Nix
  2017-05-09 11:09                     ` David Brown
  2017-05-09 21:32                     ` NeilBrown
  2017-05-09 16:05                   ` Chris Murphy
  1 sibling, 2 replies; 69+ messages in thread
From: Nix @ 2017-05-09  9:53 UTC (permalink / raw)
  To: Anthony Youngman; +Cc: Phil Turmel, Ravi (Tom) Hale, linux-raid

On 8 May 2017, Anthony Youngman told this:

> If the scrub finds a mismatch, then the drives are reporting
> "everything's fine here". Something's gone wrong, but the question is
> what? If you've got a four-drive raid that reports a mismatch, how do
> you know which of the four drives is corrupt? Doing an auto-correct
> here risks doing even more damage. (I think a raid-6 could recover,
> but raid-5 is toast ...)

With a RAID-5 you are screwed: you can reconstruct the parity but cannot
tell if it was actually right. You can make things consistent, but not
correct.

But with a RAID-6 you *do* have enough data to make things correct, with
precisely the same probability as recovery of a RAID-5 "drive" of length
a single sector. It seems wrong that not only does md not do this but
doesn't even tell you which drive made the mistake so you could do the
millions-of-times-slower process of a manual fail and readdition of the
drive (or, if you suspect it of being wholly buggered, a manual fail and
replacement).

> And seeing as drives are pretty much guaranteed (unless something's
> gone BADLY wrong) to either (a) accurately return the data written, or
> (b) return a read error, that means a data mismatch indicates
> something is seriously wrong that is NOTHING to do with the drives.

This turns out not to be the case. See this ten-year-old paper:
<https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>.
Five weeks of doing 2GiB writes on 3000 nodes once every two hours
found, they estimated, 50 errors possibly attributable to disk problems
(sector- or page-size regions of corrupted data) on 1/30th of their
nodes. This is *not* rare and it is hard to imagine that 1/30th of disks
used by CERN deserve discarding. It is better to assume that drives
misdirect writes now and then, and to provide a means of recovering from
them that does not take days of panic. RAID-6 gives you that means: md
should use it.

The page-sized regions of corrupted data were probably software -- but
the sector-sized regions were just as likely the drives, possibly
misdirected writes or misdirected reads.

Neil decided not to do any repair work in this case on the grounds that
if the drive is misdirecting one write it might misdirect the repair as
well -- but if the repair is *consistently* misdirected, that seems
relatively harmless (you had corruption before, you have it now, it just
moved), and if it was a sporadic error, the repair is worthwhile. The
only case in which a repair should not be attempted is if the drive is
misdirecting all or most writes -- but in that case, by the time you do
a scrub, on all but the quietest arrays you'll see millions of
mismatches and it'll be obvious that it's time to throw the drive out.
(Assuming md told you which drive it was.)

>> If a sector weakens purely because of neighbouring writes or temperature
>> or a vibrating housing or something (i.e. not because of actual damage),
>> so that a rewrite will strengthen it and relocation was never necessary,
>> surely you've just saved a pointless bit of sector sparing? (I don't
>> know: I'm not sure what the relative frequency of these things is. Read
>> and write errors in general are so rare that it's quite possible I'm
>> worrying about nothing at all. I do know I forgot to scrub my old
>> hardware RAID array for about three years and nothing bad happened...)
>>
> Yes you have saved a sector sparing. Note that a consumer 3TB drive
> can return, on average, one error every time it's read from end to end
> 3 times, and still be considered "within spec" ie "not faulty" by the

Yeah, that's why RAID-6 is a good idea. :)

> manufacturer. And that's a *brand* *new* drive. That's why building a
> large array using consumer drives is a stupid idea - 4 x 3TB drives
> and a *within* *spec* array must expect to handle at least one error
> every scrub.

That's just one reason why. The lack of control over URE timeouts is
just as bad.

> Okay - most drives are actually way over spec, and could probably be
> read end-to-end many times without a single error, but you'd be a fool
> to gamble on it.

I'm trying *not* to gamble on it -- but I don't want to end up in the
current situation we seem to have with md6, which is "oh, you have a
mismatch, it's not going away, but we're neither going to tell you where
it is nor what disk it's on nor repair it ourselves, even though we
could, just to make it as hard as possible for you to repair the problem
or even tell if it's a consistent one" (is the single mismatch an
expected, spurious read error because of the volume of data you're
reading, or one that's consistent and needs repair? All mismatch_cnt
tells you is that there's a mismatch).

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09  7:37             ` David Brown
@ 2017-05-09  9:58               ` Nix
  2017-05-09 10:28                 ` Brad Campbell
  0 siblings, 1 reply; 69+ messages in thread
From: Nix @ 2017-05-09  9:58 UTC (permalink / raw)
  To: David Brown; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid

On 9 May 2017, David Brown spake thusly:

> On 08/05/17 16:50, Nix wrote:
>
>> I wonder... scrubbing is not very useful with md, particularly with RAID
>> 6, because it does no writes unless something mismatches, and on failure
>> there is no attempt to determine which of the N disks is bad and rewrite
>> its contents from the other devices (nor, as I understand it, does it
>> clearly say which drive gave the error, so even failing it out and
>> resyncing it is hard).
>
> Please read Neil Brown's article on this: "Smart or simple RAID
> recovery?" <http://neil.brown.name/blog/20100211050355>

I have. THe simple recovery is too simple. So you have a 40TiB RAID-6
array, say, and mismatch_cnt is consistently >0, but a low value, on
scrub. What can you do? The drive is probably not faulty or you'd have
many more mismatches from persistent misdirected reads or writes. md
doesn't repair the corruption, even though on RAID-6 it could. It
doesn't tell you which disk disagreed so you can fail it out. It doesn't
even tell you where the disagreement was so you can try to rebuild it by
hand. What on earth are you supposed to do in this case? Wipe the entire
array and restore from backup? For a *single* sector?

Right now I'm doing scrubs and ignoring the mismatch_cnt, because all it
can do is increase my worry level to no gain at all. I could just as
well do a dd over /dev/md*. It would have the same effect (only without
md's progress feedback and bandwidth throttling. You get progress
feedback, but you don't get told where errors are found?!)

> When the disk is asked to read a block, it pulls up the data and the ECC
> bits, and uses this to check and re-construct the 4K of data, and a
> measure of how many errors were corrected.  On modern high-capacity
> drives, it is normal that some errors are corrected on a read.  But if
> more than a certain level occur, then the firmware will trigger a
> re-write automatically to the same sector.  This will then be re-read.
> If the error rate is low, fine.  If it is high, then the sector will be
> remapped by the disk.
>
> So simply /reading/ the data, as far as the processor is concerned, will
> cause re-writes as and when needed.

Last time I asked a disk manufacturer about this, they said oh no we
never correct on read, we can't: if we needed to correct on read, the
data would already be unreadable: you have to trigger a write to get
sparing. Nice to see the drive firmware has improved in the last few
years... but one wonders how many disks actually *do* this. It's hard to
tell because sector sparing is so quiet: it's not always even reflected
in the SMART data, AIUI.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 18:00             ` Anthony Youngman
@ 2017-05-09 10:11               ` David Brown
  2017-05-09 10:18               ` Nix
  1 sibling, 0 replies; 69+ messages in thread
From: David Brown @ 2017-05-09 10:11 UTC (permalink / raw)
  To: Anthony Youngman, Nix; +Cc: Ravi (Tom) Hale, linux-raid

On 08/05/17 20:00, Anthony Youngman wrote:

> With redundant raid (and that doesn't include a two-disk, or even
> three-disk mirror), it SHOULD recalculate the failed block. If it
> doesn't bother even though it can, I'd call that a bug in scrub. 

Please read:
<http://neil.brown.name/blog/20100211050355>

> What I
> thought happened was that it reads a stripe direct from disk, and if
> that failed it read the same stripe via the raid code, to get the raid
> error correction to fire, and then it rewrote the stripe.

That /is/ what happens.

As I mentioned in another reply, /reading/ is enough to trigger a
re-write on the disk if significant /correctable/ errors are discovered
by the disk's firmware.  It is extremely rare that the raid level will
see an error (see the linked article by Neil Brown) - usually, the raid
level sees a missing block because the disk firmware could not read the
block correctly.  In such cases, the raid software will write the
correct data back to the disk at the same logical block, and the disk
firmware will re-map it to a different block.

> 
> What would be a nice touch, is that if we have a massive timeout for
> non-SCT drives, if the scrub has to wait more than, say, 10 seconds for
> a read to succeed it then assumes the block is failing and rewrites it.

I don't think the raid level can do that - it must wait for the drive to
finish handling the read request, or drop the drive entirely.

If the disk takes a long time to read a block, then it will either fail
and mark the block bad, or it will get the data off the disk and then
automatically re-write the data to a re-mapped block.  The scrub can
therefore handle it like any other read.

> Actually, scrub that (groan... :-) - if the drive takes longer than 1/3
> of the timeout to respond, then the scrub assumes it's dodgy and
> rewrites it.
>>
>> If there was a way to get md to *rewrite* everything during scrub,
>> rather than just checking, this might help (in addition to letting the
>> drive refresh the magnetization of absolutely everything). "repair" mode
>> appears to do no writes until an error is found, whereupon (on RAID 6)
>> it proceeds to make a "repair" that is more likely than not to overwrite
>> good data with bad. Optionally writing what's already there on non-error
>> seems like it might be a worthwhile (and fairly simple) change.
>>
> Agreed. But without some heuristic, it's actually going to make a scrub
> much slower, and achieve very little apart from adding unnecessary wear
> to the drive.
> 
> Cheers,
> Wol
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 18:00             ` Anthony Youngman
  2017-05-09 10:11               ` David Brown
@ 2017-05-09 10:18               ` Nix
  1 sibling, 0 replies; 69+ messages in thread
From: Nix @ 2017-05-09 10:18 UTC (permalink / raw)
  To: Anthony Youngman; +Cc: Ravi (Tom) Hale, linux-raid

On 8 May 2017, Anthony Youngman verbalised:

> On 08/05/17 15:50, Nix wrote:
>> I wonder... scrubbing is not very useful with md, particularly with RAID
>> 6, because it does no writes unless something mismatches, and on failure
>> there is no attempt to determine which of the N disks is bad and rewrite
>> its contents from the other devices (nor, as I understand it, does it
>> clearly say which drive gave the error, so even failing it out and
>> resyncing it is hard).
>
> With redundant raid (and that doesn't include a two-disk, or even
> three-disk mirror), it SHOULD recalculate the failed block. If it
> doesn't bother even though it can, I'd call that a bug in scrub. What

It didn't, once upon a time (in 2010), and as far as I can tell from the
code it still doesn't.

> I thought happened was that it reads a stripe direct from disk, and if
> that failed it read the same stripe via the raid code, to get the raid
> error correction to fire, and then it rewrote the stripe.

There's *failed*, which does trigger a rewrite, and there's 'we got a
mismatch', which on RAID-6 arguably should trigger a rewrite but instead
just tells you there was a mismatch, but not where, nor even on what
disk.

> What would be a nice touch, is that if we have a massive timeout for
> non-SCT drives, if the scrub has to wait more than, say, 10 seconds
> for a read to succeed it then assumes the block is failing and
> rewrites it.

What tends to happen is that the drive gets reset, which from md's
perspective is the drive vanishing and reappearing again. I don't see
any sane way for md to interpret *that* as anything but a possibly
rather major failure that should be reacted to by failing the drive out.
I mean, all it knows is there was a timeout: for all it knows there are
electrical problems there or something. The drive doesn't say (and
doesn't get a chance to say, because we reset it rather than wait five
minutes for it to tell us what's up).

>              Actually, scrub that (groan... :-) - if the drive takes
> longer than 1/3 of the timeout to respond, then the scrub assumes it's
> dodgy and rewrites it.

It's hard to rewrite anything on a drive that's too busy failing a read
to do anything else.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09  9:58               ` Nix
@ 2017-05-09 10:28                 ` Brad Campbell
  2017-05-09 10:40                   ` Nix
  0 siblings, 1 reply; 69+ messages in thread
From: Brad Campbell @ 2017-05-09 10:28 UTC (permalink / raw)
  To: Nix, David Brown; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid

On 09/05/17 17:58, Nix wrote:
>  md doesn't repair the corruption, even though on RAID-6 it could.

Patches are *always* welcome.

> but one wonders how many disks actually *do* this. It's hard to
> tell because sector sparing is so quiet: it's not always even reflected
> in the SMART data, AIUI.

Decent SAS drvies do it routinely *and* they tell you in the SMART data 
how long it has been since the last scrub, how long it is until the next 
scrub and how many errors it has silently corrected over the drive life. 
You get what you pay for.

Brad


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 20:56                 ` Phil Turmel
@ 2017-05-09 10:28                   ` Nix
  2017-05-09 10:50                     ` Reindl Harald
  0 siblings, 1 reply; 69+ messages in thread
From: Nix @ 2017-05-09 10:28 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid

On 8 May 2017, Phil Turmel said:

> On 05/08/2017 03:52 PM, Nix wrote:
>> And... then what do you do? On RAID-6, it appears the answer is "live
>> with a high probability of inevitable corruption".
>
> No, you investigate the quality of your data and the integrity of the
> rest of the system, as something *other* than a drive problem caused the
> mismatch.  (Swap is a known exception, though.)

Yeah, I'm going to "rely" on the fact that this machine has heaps of
memory and won't be swapping much when it does a RAID scrub. :)

But "you investigate the quality of your data"... so now, on a single
mismatch that won't go away, I have to compare all my data with backups,
taking countless hours and emitting heaps of spurious errors because no
backup is ever quite up to date? Those backups *live* on hard drives, so
it has exactly the same chance of spurious disk-layer errors as the
thing that preceded it (quite possibly higher).

Honestly, scrubs are looking less and less desirable the more I talk
about them. Massive worry inducers that don't actually spot problems in
any meaningful sense (not even at the level of "there is a problem on
this disk", just "there is a problem on this array").

>> That's not very good.
>> (AIUI, if a check scrub finds a URE, it'll rewrite it, and when in the
>> common case the drive spares it out and the write succeeds, this will
>> not be reported as a mismatch: is this right?)
>
> This is also wrong, because you are assuming sparing-out is the common
> case.  A read error does not automatically trigger relocation.  It
> triggers *verification* of the next *write*.  In young drives,

So I guess we only need to worry about mismatches if they don't go away
and are persistently in the same place on the same drive. (Only you
can't tell what place that is, or what drive that is, because md doesn't
tell you. I'm really tempted to fix *that* at least, a printk() or
something.)

> { Drive self tests might do some pre-emptive rewriting of marginal
> sectors -- it's not something drive manufacturers are documenting.  But
> a drive self-test cannot fix an unreadable sector -- it doesn't know
> what to write there. }

Agreed.

>>> This is actually counterproductive.  Rewriting everything may refresh
>>> the magnetism on weakening sectors, but will also prevent the drive from
>>> *finding* weakening sectors that really do need relocation.
>> 
>> If a sector weakens purely because of neighbouring writes or temperature
>> or a vibrating housing or something (i.e. not because of actual damage),
>> so that a rewrite will strengthen it and relocation was never necessary,
>> surely you've just saved a pointless bit of sector sparing? (I don't
>> know: I'm not sure what the relative frequency of these things is. Read
>> and write errors in general are so rare that it's quite possible I'm
>> worrying about nothing at all. I do know I forgot to scrub my old
>> hardware RAID array for about three years and nothing bad happened...)
>
> Drives that are in applications that get *read* pretty often don't need
> much if any scrubbing -- the application itself will expose problem
> sectors.  Hobbyists and home media servers can go months with specific
> files unread, so developing problems can hit in clusters.  Regular
> scrubbing will catch these problems before they take your array down.

Yeah, and I have plenty of archival data on this array -- it's the first
one I've ever had that's big enough to consider using for that as well
as for frequently-used stuff whose integrity I care about. (But even the
frequently-read stuff is bcached, so even that is in effect archival
much of the time, from the perspective of its read.)

> And you can't compare hardware array behavior to MD -- they have their
> own algorithms to take care of attached disks without OS intervention.

I don't see what the difference is between a hardware array controller
with its own noddy OS, barely-maintained software, creaking processor,
and not very big battery-backed RAM and md with a decent OS, much faster
processor, decent software, and often masses of RAM and a journal on
SSD, except that the md array will be far faster and if anything goes
wrong you have much higher chance of actually getting your data back
with md. :)

The days of saying "hardware arrays are just different/better, md cannot
compete with them" are many years in the past. People are *replacing*
hardware arrays with md these days because the hardware arrays are
*worse* on almost every metric. If hardware arrays have magic recovery
algorithms that md and/or the Linux block layer don't, the question now
is why not? not "oh we cannot compare"

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 10:28                 ` Brad Campbell
@ 2017-05-09 10:40                   ` Nix
  2017-05-09 12:15                     ` Tim Small
  0 siblings, 1 reply; 69+ messages in thread
From: Nix @ 2017-05-09 10:40 UTC (permalink / raw)
  To: Brad Campbell; +Cc: David Brown, Wols Lists, Ravi (Tom) Hale, linux-raid

On 9 May 2017, Brad Campbell stated:

> On 09/05/17 17:58, Nix wrote:
>>  md doesn't repair the corruption, even though on RAID-6 it could.
>
> Patches are *always* welcome.

Oh good. I might well look at that.

>> but one wonders how many disks actually *do* this. It's hard to
>> tell because sector sparing is so quiet: it's not always even reflected
>> in the SMART data, AIUI.
>
> Decent SAS drvies do it routinely *and* they tell you in the SMART
> data how long it has been since the last scrub, how long it is until
> the next scrub and how many errors it has silently corrected over the
> drive life. You get what you pay for.

Enterprise SATA drives appear similar except that they don't do the
scrubbing automatically: you have to trigger a SMART self-test. (I'm
wondering if that's enough, and perhaps I can ignore RAID scrubbing
entirely, except that if something *does* go wrong I won't know.)

Of course I haven't yet owned a drive that has ever deigned to give a
nonzero sector-sparing value in any of its SMART info, and I've been
using allegedly-enterprise drives (first SCSI, then SATA) for about
fifteen years now. I've had disk failures without warning, and
non-failed disks with both read and write errors that would not go away,
but that SMART reallocation value just stayed stuck at zero through all
of it. I'm wondering if smartctl is even reading the right field, but
it's hard to imagine how it couldn't be...

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 10:28                   ` Nix
@ 2017-05-09 10:50                     ` Reindl Harald
  2017-05-09 11:15                       ` Nix
  0 siblings, 1 reply; 69+ messages in thread
From: Reindl Harald @ 2017-05-09 10:50 UTC (permalink / raw)
  To: Nix, Phil Turmel; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid



Am 09.05.2017 um 12:28 schrieb Nix:
> Honestly, scrubs are looking less and less desirable the more I talk
> about them. Massive worry inducers that don't actually spot problems in
> any meaningful sense (not even at the level of "there is a problem on
> this disk", just "there is a problem on this array")

that is your opinion

my expierience over years using md-arrays is that *everytime* smartd 
triggered a alert mail that a drive will fail soon it happened while the 
scrub was running and so you can replace drives as soon as possible

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09  9:53                   ` Nix
@ 2017-05-09 11:09                     ` David Brown
  2017-05-09 11:27                       ` Nix
  2017-05-09 21:32                     ` NeilBrown
  1 sibling, 1 reply; 69+ messages in thread
From: David Brown @ 2017-05-09 11:09 UTC (permalink / raw)
  To: Nix, Anthony Youngman; +Cc: Phil Turmel, Ravi (Tom) Hale, linux-raid

On 09/05/17 11:53, Nix wrote:
> On 8 May 2017, Anthony Youngman told this:
> 
>> If the scrub finds a mismatch, then the drives are reporting
>> "everything's fine here". Something's gone wrong, but the question is
>> what? If you've got a four-drive raid that reports a mismatch, how do
>> you know which of the four drives is corrupt? Doing an auto-correct
>> here risks doing even more damage. (I think a raid-6 could recover,
>> but raid-5 is toast ...)
> 
> With a RAID-5 you are screwed: you can reconstruct the parity but cannot
> tell if it was actually right. You can make things consistent, but not
> correct.
> 
> But with a RAID-6 you *do* have enough data to make things correct, with
> precisely the same probability as recovery of a RAID-5 "drive" of length
> a single sector. 

No, you don't have enough data to make things correct.  You /might/ have
enough data to make a guess what /might/ be right to make things wrong,
but might also be wrong.  And you don't have enough data to have the
slightest idea about the probabilities.  And you don't have enough data
to know if "fixing" it will help overall, or make things worse if you
accidentally "fix" the wrong block.  (See the link I gave on other posts
for details.)

> It seems wrong that not only does md not do this but
> doesn't even tell you which drive made the mistake so you could do the
> millions-of-times-slower process of a manual fail and readdition of the
> drive (or, if you suspect it of being wholly buggered, a manual fail and
> replacement).
> 
>> And seeing as drives are pretty much guaranteed (unless something's
>> gone BADLY wrong) to either (a) accurately return the data written, or
>> (b) return a read error, that means a data mismatch indicates
>> something is seriously wrong that is NOTHING to do with the drives.
> 
> This turns out not to be the case. See this ten-year-old paper:
> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>.
> Five weeks of doing 2GiB writes on 3000 nodes once every two hours
> found, they estimated, 50 errors possibly attributable to disk problems
> (sector- or page-size regions of corrupted data) on 1/30th of their
> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks
> used by CERN deserve discarding. It is better to assume that drives
> misdirect writes now and then, and to provide a means of recovering from
> them that does not take days of panic. RAID-6 gives you that means: md
> should use it.

RAID-6 does not help here.  You have to understand the types of errors
that can occur, the reasons for them, the possibilities for detection,
the possibilities for recovery, and what the different layers in the
system can do about them.

RAID (1/5/6) will let you recover from one or more known failed reads,
on the assumption that the driver firmware is correct, memories have no
errors, buses have no errors, block writes are atomic, write ordering
matches the flush commands, block reads are either correct or marked as
failed, etc.

RAID will /not/ let you reliably detect or correct other sorts of
errors.  It is designed to cheaply and simply reduce the risk of a
certain class of possible errors - it is not a magic method of stopping
all errors.  Similarly, the drive firmware works under certain
assumptions to greatly reduce other sorts of errors (those local to the
block), but not everything.  And ECC memory, PCI bus CRCs, and other
such things reduce the risk of other kinds of error.

If you need more error checking or correction, you need different
mechanisms.  For example, BTRFS and ZFS will do checksumming on the
filesystem level.  They can be combined with raid/duplication to allow
correction on checksum error.  And they can usefully build on top of a
normal md raid layer, or use their own raid (with its pros and cons).
Or you can have multiple servers and also track md5 sums of files, with
cross-server scrubbing of the data.  There are lots of possibilities,
depending on what you want to get.

What does /not/ work, however, is trying to squeeze magic capabilities
out of existing layers in the system, or expecting more out of them that
they can give.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 10:50                     ` Reindl Harald
@ 2017-05-09 11:15                       ` Nix
  2017-05-09 11:48                         ` Reindl Harald
  0 siblings, 1 reply; 69+ messages in thread
From: Nix @ 2017-05-09 11:15 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Phil Turmel, Wols Lists, Ravi (Tom) Hale, linux-raid

On 9 May 2017, Reindl Harald said:

> Am 09.05.2017 um 12:28 schrieb Nix:
>> Honestly, scrubs are looking less and less desirable the more I talk
>> about them. Massive worry inducers that don't actually spot problems in
>> any meaningful sense (not even at the level of "there is a problem on
>> this disk", just "there is a problem on this array")
>
> that is your opinion
>
> my expierience over years using md-arrays is that *everytime* smartd triggered a alert mail that a drive will fail soon it happened
> while the scrub was running and so you can replace drives as soon as possible

What, it triggered a SMART warning while a scrub was running which SMART
long self-tests didn't? That's depressing. You'd think SMART would be
watching for errors while it's own tests were running!

(Or were you not running any long self-tests? That's at least as risky
as not scrubbing, IMNSHO.)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 11:09                     ` David Brown
@ 2017-05-09 11:27                       ` Nix
  2017-05-09 11:58                         ` David Brown
  2017-05-09 19:16                         ` Fault tolerance with badblocks Phil Turmel
  0 siblings, 2 replies; 69+ messages in thread
From: Nix @ 2017-05-09 11:27 UTC (permalink / raw)
  To: David Brown; +Cc: Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, linux-raid

On 9 May 2017, David Brown uttered the following:

> On 09/05/17 11:53, Nix wrote:
>> This turns out not to be the case. See this ten-year-old paper:
>> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>.
>> Five weeks of doing 2GiB writes on 3000 nodes once every two hours
>> found, they estimated, 50 errors possibly attributable to disk problems
>> (sector- or page-size regions of corrupted data) on 1/30th of their
>> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks
>> used by CERN deserve discarding. It is better to assume that drives
>> misdirect writes now and then, and to provide a means of recovering from
>> them that does not take days of panic. RAID-6 gives you that means: md
>> should use it.
>
> RAID-6 does not help here.  You have to understand the types of errors
> that can occur, the reasons for them, the possibilities for detection,
> the possibilities for recovery, and what the different layers in the
> system can do about them.
>
> RAID (1/5/6) will let you recover from one or more known failed reads,
> on the assumption that the driver firmware is correct, memories have no
> errors, buses have no errors, block writes are atomic, write ordering
> matches the flush commands, block reads are either correct or marked as
> failed, etc.

I think you're being too pedantic. Many of these things are known not to
be true on real hardware, and at least one of them cannot possibly be
true without a journal (atomic block writes). Nonetheless, the md layer
is quite happy to rebuild after a failed disk even though the write hole
might have torn garbage into your data, on the grounds that it
*probably* did not. If your argument was used everywhere, md would never
have been started because 100% reliability was not guaranteed.

The same, it seems to me, is true of cases in which one drive in a
RAID-6 reports a few mismatched blocks. It is true that you don't know
the cause of the mismatches, but you *do* know which bit of the mismatch
is wrong and what data should be there, subject only to the assumption
that sufficiently few drives have made simultaneous mistakes that
redundancy is preserved. And that's the same assumption RAID >0 makes
all the time anyway!

The only difference in the disk-failure case is that you know that one
drive has failed without needing to ask other drives to be sure. I mean,
yeah, *possibly* in the RAID-6 mismatch case *five* drives have gone
simultaneously wrong in such a way that their syndromes all match and
the one surviving drive is mistakenly misrepaired, but frankly you'd
need to wait for black holes to evaporate of old age before this became
an issue.

(I'm not suggesting repairing RAID-5 mismatches. That's clearly
impossible. You can't even tell what disk is affected. But in the RAID-6
case none of this is impossible, or so it seems to me. You have at least
three and probably four or more drives with consistent syndromes, and
one that is out of whack. You know which one must be wrong -- the
"minority vote" -- and you know what has to be done to make it
consistent with the others again. Why not do it? It's no more risky than
that aspect of a RAID rebuild from a failed disk would be.)

> RAID will /not/ let you reliably detect or correct other sorts of
> errors.

... only it clearly can. What stops it from handling the RAID-6-and-
one-disk-is-wrong case where it cannot handle the RAID-6-and-one-disk-
has-failed case, given that you can unambiguously determine which disk
is wrong using the data on the surviving drives, with an undetected-
failure probability of something way below 2^128? (I could work out the
actual value but I haven't had any coffee yet and it seems pointless
when it's that low.)

> What does /not/ work, however, is trying to squeeze magic capabilities
> out of existing layers in the system, or expecting more out of them that
> they can give.

I don't see that these capabilities are any more magic than what RAID-6
does already. It can recover from two failed drives: why can't it
recover from one wrong one? (Or, rather, from one drive with very
occasionally wrong sectors on it. Obviously if it was always getting
things wrong its presence is not a benefit and you have essentially
fallen back to nothing better than RAID-5, only with worse performance.
But that's what error thresholds are for, which md already employs in
similar situations.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 11:15                       ` Nix
@ 2017-05-09 11:48                         ` Reindl Harald
  2017-05-09 16:11                           ` Nix
  0 siblings, 1 reply; 69+ messages in thread
From: Reindl Harald @ 2017-05-09 11:48 UTC (permalink / raw)
  To: Nix; +Cc: Phil Turmel, Wols Lists, Ravi (Tom) Hale, linux-raid



Am 09.05.2017 um 13:15 schrieb Nix:
> On 9 May 2017, Reindl Harald said:
> 
>> Am 09.05.2017 um 12:28 schrieb Nix:
>>> Honestly, scrubs are looking less and less desirable the more I talk
>>> about them. Massive worry inducers that don't actually spot problems in
>>> any meaningful sense (not even at the level of "there is a problem on
>>> this disk", just "there is a problem on this array")
>>
>> that is your opinion
>>
>> my expierience over years using md-arrays is that *everytime* smartd triggered a alert mail that a drive will fail soon it happened
>> while the scrub was running and so you can replace drives as soon as possible
> 
> What, it triggered a SMART warning while a scrub was running which SMART
> long self-tests didn't? That's depressing. You'd think SMART would be
> watching for errors while it's own tests were running!

different time of tests, different access metrics

i guess smarter people like both of us had a reason to develop scrub 
instead say "just let the drive do it at it's own

> (Or were you not running any long self-tests? That's at least as risky
> as not scrubbing, IMNSHO.)

no i do both regulary

* smart short self-test daily
* smart long self-test weekly
* raid scrub weekly

and no - doing a long-smart-test daily is not a good solution, the 
RAID10 array in my office makes *terrible noises* when the SMART test is 
running and after doing this every week the last 6 years (Power_On_Hours 
14786, Start_Stop_Count 1597) i would say they are normal but probably 
it's not good doing that operations all the time

well, that machine has not lost a single drive, a clone of it acting as 
homeserver 365/24/7 has lost a dozen in the same time....

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 11:27                       ` Nix
@ 2017-05-09 11:58                         ` David Brown
  2017-05-09 17:25                           ` Chris Murphy
  2017-05-09 19:16                         ` Fault tolerance with badblocks Phil Turmel
  1 sibling, 1 reply; 69+ messages in thread
From: David Brown @ 2017-05-09 11:58 UTC (permalink / raw)
  To: Nix; +Cc: Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, linux-raid

On 09/05/17 13:27, Nix wrote:
> On 9 May 2017, David Brown uttered the following:

> (I'm not suggesting repairing RAID-5 mismatches. That's clearly
> impossible. You can't even tell what disk is affected. But in the RAID-6
> case none of this is impossible, or so it seems to me. You have at least
> three and probably four or more drives with consistent syndromes, and
> one that is out of whack. You know which one must be wrong -- the
> "minority vote" -- and you know what has to be done to make it
> consistent with the others again. Why not do it? It's no more risky than
> that aspect of a RAID rebuild from a failed disk would be.)
> 
>> RAID will /not/ let you reliably detect or correct other sorts of
>> errors.
> 
> ... only it clearly can. What stops it from handling the RAID-6-and-
> one-disk-is-wrong case where it cannot handle the RAID-6-and-one-disk-
> has-failed case, given that you can unambiguously determine which disk
> is wrong using the data on the surviving drives, with an undetected-
> failure probability of something way below 2^128? (I could work out the
> actual value but I haven't had any coffee yet and it seems pointless
> when it's that low.)
> 
>> What does /not/ work, however, is trying to squeeze magic capabilities
>> out of existing layers in the system, or expecting more out of them that
>> they can give.
> 
> I don't see that these capabilities are any more magic than what RAID-6
> does already. It can recover from two failed drives: why can't it
> recover from one wrong one? (Or, rather, from one drive with very
> occasionally wrong sectors on it. Obviously if it was always getting
> things wrong its presence is not a benefit and you have essentially
> fallen back to nothing better than RAID-5, only with worse performance.
> But that's what error thresholds are for, which md already employs in
> similar situations.)
> 

I thought you said that you had read Neil's article.  Please go back and
read it again.  If you don't agree with what is written there, then
there is little more I can say to convince you.

One thing I can try, is to note that you are /not/ the first person to
think "Surely with RAID-6 we can correct mismatches - it should be
easy?".  You are /not/ the first person to think "Correcting RAID-6
mismatches would be a marvellous feature that would make it /far/
better".  Linux md raid does not correct RAID-6 mismatches found on a
scrub.  To my (admittedly limited) knowledge, hardware RAID-6 systems do
not correct mismatches found on a scrub.  If correcting RAID-6
mismatches were as simple, reliably, and useful as you seem to believe,
than I think Linux md raid would already do it - either as part of the
scrub, or as an extra utility to run on mismatched stripes.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 10:40                   ` Nix
@ 2017-05-09 12:15                     ` Tim Small
  2017-05-09 15:30                       ` Nix
  0 siblings, 1 reply; 69+ messages in thread
From: Tim Small @ 2017-05-09 12:15 UTC (permalink / raw)
  To: Nix; +Cc: linux-raid


On 09/05/17 11:40, Nix wrote:
> I've had disk failures without warning, and
> non-failed disks with both read and write errors that would not go away,
> but that SMART reallocation value just stayed stuck at zero through all
> of it.

Really?  I see them pretty frequently...  Let's see

server1, RAID6 (4 disks), reallocated_sector_ct: 0 9 1 0
server2, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0
server3, RAID6 (5 disks), reallocated_sector_ct: 34 754 15 115 1
server4, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0
server5, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0

Disk 2 in server3 (which has drives which are a bit long in the tooth)
is scheduled to be replaced next time I visit that site.

Are you looking at the 'raw' column in the smartctl output?


Tim

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 12:15                     ` Tim Small
@ 2017-05-09 15:30                       ` Nix
  0 siblings, 0 replies; 69+ messages in thread
From: Nix @ 2017-05-09 15:30 UTC (permalink / raw)
  To: Tim Small; +Cc: linux-raid

On 9 May 2017, Tim Small spake thusly:

> On 09/05/17 11:40, Nix wrote:
>> I've had disk failures without warning, and
>> non-failed disks with both read and write errors that would not go away,
>> but that SMART reallocation value just stayed stuck at zero through all
>> of it.
>
> Really?  I see them pretty frequently...  Let's see
>
> server1, RAID6 (4 disks), reallocated_sector_ct: 0 9 1 0
> server2, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0
> server3, RAID6 (5 disks), reallocated_sector_ct: 34 754 15 115 1
> server4, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0
> server5, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0
>
> Disk 2 in server3 (which has drives which are a bit long in the tooth)
> is scheduled to be replaced next time I visit that site.
>
> Are you looking at the 'raw' column in the smartctl output?

No, but since they all read all zero:

  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0

this is pretty redundant.

I do see, on all my disks (regardless of hardware versus software RAID
or indeed age, and some of these disks are seven years old):

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

One figure is much higher:

195 Hardware_ECC_Recovered  -O-RC-   100   064   000    -    2067212
195 Hardware_ECC_Recovered  -O-RC-   100   064   000    -    2088928
195 Hardware_ECC_Recovered  -O-RC-   082   064   000    -    156528817
195 Hardware_ECC_Recovered  -O-RC-   082   065   000    -    156513792

but this is on a bunch of three-month-old Seagate enterprise disks, and
as with the seek error rate Seagate use a deeply bizarre encoding for
this value, and none of the SeaChest programs seem to be able to decode
it.

It appears that the lower the decoded value, the worse things are -- I
have no idea why two of my drives are doing so much worse than two
others on this score. I guess I should keep an eye on them. In any case,
it's going up fast on those two even when the drives are totally idle
and even when I forcibly spin them down... I don't trust this figure to
tell me anything useful at all. SMART, borderline useless as ever.

Aside: in hex these are

001f8b0c
001fdfe0
095470b1
09543600

which rather suggests that the drives have two distinct encodings to me,
with two drives using one encoding and the other two another one,
probably split at the four-hex-digit mark -- but the drives have
identical firmware and the same model number...

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-08 20:27                 ` Anthony Youngman
  2017-05-09  9:53                   ` Nix
@ 2017-05-09 16:05                   ` Chris Murphy
  2017-05-09 17:49                     ` Wols Lists
  1 sibling, 1 reply; 69+ messages in thread
From: Chris Murphy @ 2017-05-09 16:05 UTC (permalink / raw)
  To: Anthony Youngman; +Cc: Nix, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On Mon, May 8, 2017 at 2:27 PM, Anthony Youngman
<antlists@youngman.org.uk> wrote:


> Yes you have saved a sector sparing. Note that a consumer 3TB drive can
> return, on average, one error every time it's read from end to end 3 times,
> and still be considered "within spec" ie "not faulty" by the manufacturer.

All specs say "less than" which means it's a maximum permissible rate,
not an average. We have no idea what the minimum error rate is - we
being consumers. It's possible high volume users (e.g. Backblaze) have
data on this by now.



> And that's a *brand* *new* drive. That's why building a large array using
> consumer drives is a stupid idea - 4 x 3TB drives and a *within* *spec*
> array must expect to handle at least one error every scrub.

The requirement for any large array is quickly abandoning reattempted
reads in favor of reporting a read error. That's the main reason why
consumer drives are a bad idea, is that it can hang user space waiting
on the long recovery of a drive.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 11:48                         ` Reindl Harald
@ 2017-05-09 16:11                           ` Nix
  2017-05-09 16:46                             ` Reindl Harald
  0 siblings, 1 reply; 69+ messages in thread
From: Nix @ 2017-05-09 16:11 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Phil Turmel, Wols Lists, Ravi (Tom) Hale, linux-raid

On 9 May 2017, Reindl Harald verbalised:

> Am 09.05.2017 um 13:15 schrieb Nix:
>> (Or were you not running any long self-tests? That's at least as risky
>> as not scrubbing, IMNSHO.)
>
> no i do both regulary
>
> * smart short self-test daily
> * smart long self-test weekly
> * raid scrub weekly
>
> and no - doing a long-smart-test daily is not a good solution, the
> RAID10 array in my office makes *terrible noises* when the SMART

Agreed, though in my case not because of noise, but just because the
test takes fourteen hours and noticeably degrades disk performance while
it runs. I'm doing a long self-test monthly and frankly I'm wondering if
every three months is sufficient.

> well, that machine has not lost a single drive, a clone of it acting
> as homeserver 365/24/7 has lost a dozen in the same time....

A *dozen*?! In six years? Even with a big array you've been incredibly
unlucky, or you have young children and a corresponding disaster rate.
(Meanwhile, my last machine, with much-maligned WD GreenPower
variable-spin-rate disks, was completely happy for eight years, zero
failures, zero reallocations that I can see. I can only hope my new lot
are that good.)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 16:11                           ` Nix
@ 2017-05-09 16:46                             ` Reindl Harald
  0 siblings, 0 replies; 69+ messages in thread
From: Reindl Harald @ 2017-05-09 16:46 UTC (permalink / raw)
  To: Nix; +Cc: Phil Turmel, Wols Lists, Ravi (Tom) Hale, linux-raid



Am 09.05.2017 um 18:11 schrieb Nix:
> On 9 May 2017, Reindl Harald verbalised:
>> and no - doing a long-smart-test daily is not a good solution, the
>> RAID10 array in my office makes *terrible noises* when the SMART
> 
> Agreed, though in my case not because of noise, but just because the
> test takes fourteen hours and noticeably degrades disk performance while
> it runs. I'm doing a long self-test monthly and frankly I'm wondering if
> every three months is sufficient.
> 
>> well, that machine has not lost a single drive, a clone of it acting
>> as homeserver 365/24/7 has lost a dozen in the same time....
> 
> A *dozen*?! In six years? Even with a big array you've been incredibly
> unlucky, or you have young children and a corresponding disaster rate.
> (Meanwhile, my last machine, with much-maligned WD GreenPower
> variable-spin-rate disks, was completely happy for eight years, zero
> failures, zero reallocations that I can see. I can only hope my new lot
> are that good.)

RAID10, 4x2 TB, a rrom temperature of 28 degreee 12 months a year and 
214 TB only written - i doubt to be unlucky with that workloads


Filesystem created:       Wed Jun  8 13:10:56 2011
Lifetime writes:          214 TB

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 11:58                         ` David Brown
@ 2017-05-09 17:25                           ` Chris Murphy
  2017-05-09 19:44                             ` Wols Lists
                                               ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Chris Murphy @ 2017-05-09 17:25 UTC (permalink / raw)
  To: David Brown
  Cc: Nix, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@hesbynett.no> wrote:

> I thought you said that you had read Neil's article.  Please go back and
> read it again.  If you don't agree with what is written there, then
> there is little more I can say to convince you.
>
> One thing I can try, is to note that you are /not/ the first person to
> think "Surely with RAID-6 we can correct mismatches - it should be
> easy?".

H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion
http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf

This is totally non-trivial, especially because it says raid6 cannot
detect or correct more than one corruption, and ensuring that
additional corruption isn't introduced in the rare case is even more
non-trivial.

I do think it's sane for raid6 repair to avoid the current assumption
that data strip is correct, by doing the evaluation in equation 27. If
there's no corruption do nothing, if there's corruption of P or Q then
replace, if there's corruption of data, then report but do not repair
as follows:

1. md reports all data drives and the LBAs for the affected stripe
(otherwise this is not simple if it has to figure out which drive is
actually affected but that's not required, just a matter of better
efficiency in finding out what's really affected.)

2. the file system needs to be able to accept the error from md

3. the file system reports what it negatively impacted: file system
metadata or data and if data, the full filename path.

And now suddenly this work is likewise non-trivial.

And there is already something that will do exactly this: ZFS and
Btrfs. Both can unambiguously, efficiently determine whether data is
corrupt even if a drive doesn't report a read error.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 16:05                   ` Chris Murphy
@ 2017-05-09 17:49                     ` Wols Lists
  2017-05-10  3:06                       ` Chris Murphy
  0 siblings, 1 reply; 69+ messages in thread
From: Wols Lists @ 2017-05-09 17:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nix, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On 09/05/17 17:05, Chris Murphy wrote:
>> Yes you have saved a sector sparing. Note that a consumer 3TB drive can
>> > return, on average, one error every time it's read from end to end 3 times,
>> > and still be considered "within spec" ie "not faulty" by the manufacturer.

> All specs say "less than" which means it's a maximum permissible rate,
> not an average. We have no idea what the minimum error rate is - we
> being consumers. It's possible high volume users (e.g. Backblaze) have
> data on this by now.
> 
In other words, an error rate that high is "acceptable".

And to design software that quite explicitly expects greater perfection
than the hardware itself is guaranteed to provide is, in my humble
opinion, downright negligent!!!

I'm sorry, but like Linus, I take an *engineering* approach to this
stuff, not a mathematical approach. In a mathematical world everything
works perfectly. In an engineering world, things go wrong. You should
always plan for the worst case. But to fail to plan for "the worst
*acceptable* case" is just plain IDIOTIC.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 11:27                       ` Nix
  2017-05-09 11:58                         ` David Brown
@ 2017-05-09 19:16                         ` Phil Turmel
  2017-05-09 20:01                           ` Nix
  1 sibling, 1 reply; 69+ messages in thread
From: Phil Turmel @ 2017-05-09 19:16 UTC (permalink / raw)
  To: Nix, David Brown; +Cc: Anthony Youngman, Ravi (Tom) Hale, linux-raid

On 05/09/2017 07:27 AM, Nix wrote:
> On 9 May 2017, David Brown uttered the following:
> 
>> On 09/05/17 11:53, Nix wrote:
>>> This turns out not to be the case. See this ten-year-old paper:
>>> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>.
>>> Five weeks of doing 2GiB writes on 3000 nodes once every two hours
>>> found, they estimated, 50 errors possibly attributable to disk problems
>>> (sector- or page-size regions of corrupted data) on 1/30th of their
>>> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks
>>> used by CERN deserve discarding. It is better to assume that drives
>>> misdirect writes now and then, and to provide a means of recovering from
>>> them that does not take days of panic. RAID-6 gives you that means: md
>>> should use it.
>>
>> RAID-6 does not help here.  You have to understand the types of errors
>> that can occur, the reasons for them, the possibilities for detection,
>> the possibilities for recovery, and what the different layers in the
>> system can do about them.
>>
>> RAID (1/5/6) will let you recover from one or more known failed reads,
>> on the assumption that the driver firmware is correct, memories have no
>> errors, buses have no errors, block writes are atomic, write ordering
>> matches the flush commands, block reads are either correct or marked as
>> failed, etc.
> 
> I think you're being too pedantic. Many of these things are known not to
> be true on real hardware, and at least one of them cannot possibly be
> true without a journal (atomic block writes). Nonetheless, the md layer
> is quite happy to rebuild after a failed disk even though the write hole
> might have torn garbage into your data, on the grounds that it
> *probably* did not. If your argument was used everywhere, md would never
> have been started because 100% reliability was not guaranteed.
> 
> The same, it seems to me, is true of cases in which one drive in a
> RAID-6 reports a few mismatched blocks. It is true that you don't know
> the cause of the mismatches, but you *do* know which bit of the mismatch
> is wrong and what data should be there, subject only to the assumption
> that sufficiently few drives have made simultaneous mistakes that
> redundancy is preserved. And that's the same assumption RAID >0 makes
> all the time anyway!

You are completely ignoring the fact that reconstruction from P,Q is
mathematically correct only if the entire stripe is written together.
Any software or hardware problem that interrupts a complete stripe write
or a short-circuited P,Q update can and therefore often will deliver a
*wrong* assessment of what device is corrupted.  In particular, you
can't even tell which devices got new data and which got old data.  Even
worse, cable and controller problems have been known to create patterns
of corruption to the way to one or more drives.  You desperately need to
know if this happens to your array.  It is not only possible, but
*likely* in systems without ECC ram.

The bottom line is that any kernel that implements the auto-correct you
seem to think is a slam dunk will be shunned by any system administrator
who actually cares about their data.  Your obtuseness notwithstanding.

All:  Please drop me from future CCs on this thread.

Phil


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 17:25                           ` Chris Murphy
@ 2017-05-09 19:44                             ` Wols Lists
  2017-05-10  3:53                               ` Chris Murphy
  2017-05-09 20:18                             ` Nix
  2017-05-09 21:06                             ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix
  2 siblings, 1 reply; 69+ messages in thread
From: Wols Lists @ 2017-05-09 19:44 UTC (permalink / raw)
  To: Chris Murphy, David Brown; +Cc: Nix, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On 09/05/17 18:25, Chris Murphy wrote:
> On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@hesbynett.no> wrote:
> 
>> I thought you said that you had read Neil's article.  Please go back and
>> read it again.  If you don't agree with what is written there, then
>> there is little more I can say to convince you.
>>
>> One thing I can try, is to note that you are /not/ the first person to
>> think "Surely with RAID-6 we can correct mismatches - it should be
>> easy?".
> 
> H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion
> http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf
> 
> This is totally non-trivial, especially because it says raid6 cannot
> detect or correct more than one corruption, and ensuring that
> additional corruption isn't introduced in the rare case is even more
> non-trivial.

And can I point out that that is just one person's opinion? A
well-informed, respected person true, but it's still just opinion. And
imho the argument that says raid should not repair the data applies
equally against fsck - that shouldn't do any repair either! :-)
> 
> I do think it's sane for raid6 repair to avoid the current assumption
> that data strip is correct, by doing the evaluation in equation 27. If
> there's no corruption do nothing, if there's corruption of P or Q then
> replace, if there's corruption of data, then report but do not repair
> as follows:

From an ENGINEERING viewpoint, what is the probability that we get a
two-drive error? And if we do, then there's probably something rather
more serious gone wrong?
> 
> 1. md reports all data drives and the LBAs for the affected stripe
> (otherwise this is not simple if it has to figure out which drive is
> actually affected but that's not required, just a matter of better
> efficiency in finding out what's really affected.)

md should report the error AND THE DRIVE THAT APPEARS TO BE FAULTY. (Or
maybe we leave that to the below-mentioned mdfsck.)

That way, if it's a bunch of errors on the same drive we know we've got
a problem with the drive. If we've got a bunch of errors on random
drives, we know the problem is probably elsewhere.
> 
> 2. the file system needs to be able to accept the error from md
> 
> 3. the file system reports what it negatively impacted: file system
> metadata or data and if data, the full filename path.
> 
> And now suddenly this work is likewise non-trivial.

Which is why we keep the filesystem out of this. By all means make md
return a list of dud strips, which a filesystem-level utility can then
interpret, but that isn't md's problem.
> 
> And there is already something that will do exactly this: ZFS and
> Btrfs. Both can unambiguously, efficiently determine whether data is
> corrupt even if a drive doesn't report a read error.
> 
Or we write an mdfsck program. Just like you shouldn't run fsck with
write privileges on a mounted filesystem, you wouldn't run mdfsck with
filesystems in the array mounted.

At the end of the day, md should never corrupt data by default. Which is
what it sounds like is happening at the moment, if it's assuming the
data sectors are correct and the parity is wrong. If one parity appears
correct then by all means rewrite the second ...

But the current setup, where it's currently quite happy to assume a
single-drive error and rewrite it if it's a parity drive, but it won't
assume a single-drive error and and rewrite it if it's a data drive,
just seems totally wrong. Worse, in the latter case, it seems it
actively prevents fixing the problem by updating the parity and
(probably) corrupting the data.

Report the error, give the user the tools to fix it, and LET THEM sort
it out. Just like we do when we run fsck on a filesystem.

(I know I know, patches welcome :-)

Cheers,
Wol


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 19:16                         ` Fault tolerance with badblocks Phil Turmel
@ 2017-05-09 20:01                           ` Nix
  2017-05-09 20:57                             ` Wols Lists
  2017-05-09 21:23                             ` Phil Turmel
  0 siblings, 2 replies; 69+ messages in thread
From: Nix @ 2017-05-09 20:01 UTC (permalink / raw)
  To: Phil Turmel; +Cc: David Brown, Anthony Youngman, Ravi (Tom) Hale, linux-raid

On 9 May 2017, Phil Turmel told this:

> On 05/09/2017 07:27 AM, Nix wrote:
>> The same, it seems to me, is true of cases in which one drive in a
>> RAID-6 reports a few mismatched blocks. It is true that you don't know
>> the cause of the mismatches, but you *do* know which bit of the mismatch
>> is wrong and what data should be there, subject only to the assumption
>> that sufficiently few drives have made simultaneous mistakes that
>> redundancy is preserved. And that's the same assumption RAID >0 makes
>> all the time anyway!
>
> You are completely ignoring the fact that reconstruction from P,Q is
> mathematically correct only if the entire stripe is written together.

Ooh, true.

> Any software or hardware problem that interrupts a complete stripe write
> or a short-circuited P,Q update can and therefore often will deliver a
> *wrong* assessment of what device is corrupted.  In particular, you
> can't even tell which devices got new data and which got old data.  Even
> worse, cable and controller problems have been known to create patterns
> of corruption to the way to one or more drives.  You desperately need to
> know if this happens to your array.  It is not only possible, but
> *likely* in systems without ECC ram.

Is this still true if the md cache or PPL is in use? The whole point of
these, after all, is to ensure that stripe writes either happen
completely or not at all. (But, again, that'll only guard against things
like power failure interruptions, not bad cabling. However, again, if
you have bad cabling or a bad controller you can expect to have *lots
and lots* of errors -- a small number of errors are much less likely to
be something of this nature. So, again, a threshold like md already
applies elsewhere might seem to be worthwhile. If you are seeing *lots*
of mismatches, clearly correction is unwise -- heck, writing to the
array at all is unwise, and the whole thing might profitably be
remounted ro. I suspect the filesystems will have been remounted ro by
the kernel by this point in any case.)

The point made elsewhere that all your arguments also apply against fsck
still stands. (Why bother with it? If it gave an error, you have a
kernel bug or a bad disk controller, RAM, or cabling, and nothing on
your filesystem can be trusted! just restore from backup!)

Your arguments are absolutely classic "the perfect is the enemy of the
good" arguments, in my view. I can understand falling into that trap on
a RAID list, it's all about paranoia :) but that doesn't mean I agree
with them. I *have* excellent backups, but that doesn't mean I want to
waste hours to days restoring and/or revalidating everything just
because of a persistent mismatch_cnt > 0 which md won't localize for me
or even try to fix because it *might*, uh... no, as far as I can tell
you're worrying that it might in some cases cause corruption of data
that is *already known to be corrupt*. You'll pardon me if this
possibility does not fill me with fear.

> The bottom line is that any kernel that implements the auto-correct you
> seem to think is a slam dunk will be shunned by any system administrator
> who actually cares about their data.  Your obtuseness notwithstanding.

Gee, thanks heaps. Next time I want randomly insulting by someone who
doesn't bother to tell me his actual *arguments* in any message before
the one that starts on the insults, I'll come straight to you.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 17:25                           ` Chris Murphy
  2017-05-09 19:44                             ` Wols Lists
@ 2017-05-09 20:18                             ` Nix
  2017-05-09 20:52                               ` Wols Lists
  2017-05-10  8:41                               ` David Brown
  2017-05-09 21:06                             ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix
  2 siblings, 2 replies; 69+ messages in thread
From: Nix @ 2017-05-09 20:18 UTC (permalink / raw)
  To: Chris Murphy
  Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On 9 May 2017, Chris Murphy verbalised:

> On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@hesbynett.no> wrote:
>
>> I thought you said that you had read Neil's article.  Please go back and
>> read it again.  If you don't agree with what is written there, then
>> there is little more I can say to convince you.

The entire article is predicated on the assumption that when an
inconsistent stripe is found, fixing it is simple because you can just
fail whichever device is inconsistent... but given that the whole
premise of the article is that *you cannot tell which that is*, I don't
see the point in failing anything.

The first comment in the article is someone noting that md doesn't say
which device is failing, what the location of the error is or anything
else a sysadmin might actually find useful for fixing it. "Hey, you have
an error somewhere on some disk on this multi-terabyte array which might
be data corruption and if a disk fails will be data corruption!" is not
too useful :( The fourth comment notes that the "smart" approach, given
RAID-6, has a significantly higher chance of actually fixing the problem
than the simple approach. I'd call that a fairly important comment...

(Neil said: "Similarly a RAID6 with inconsistent P and Q could well not
be able to identify a single block which is "wrong" and even if it could
there is a small possibility that the identified block isn't wrong, but
the other blocks are all inconsistent in such a way as to accidentally
point to it. The probability of this is rather small, but it is
non-zero". As far as I can tell the probability of this is exactly the
same as that of multiple read errors in a single stripe -- possibly far
lower, if you need not only multiple wrong P and Q values but *precisely
mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using
RAID-6 to begin with.

I've been talking all the time about a stripe which is singly
inconsistent: either all the data blocks are fine and one of P or Q is
fine, or both P and Q and all but one data block is fine, and the
remaining block is inconsistent with all the rest. Obviously if more
blocks are corrupt, you can do nothing but report it. The redundancy
simply isn't there to attempt repair.)

> H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion
> http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf
>
> This is totally non-trivial, especially because it says raid6 cannot
> detect or correct more than one corruption, and ensuring that
> additional corruption isn't introduced in the rare case is even more
> non-trivial.

Yeah. Testing this is the bastard problem, really. Fault injection via
dm is the only approach that seems remotely practical to me.

> I do think it's sane for raid6 repair to avoid the current assumption
> that data strip is correct, by doing the evaluation in equation 27. If
> there's no corruption do nothing, if there's corruption of P or Q then
> replace, if there's corruption of data, then report but do not repair

At least indicate *where* the corruption is in the report. (I'd say
"repair, as a non-default option" for people with a different
availability/P(corruption) tradeoff -- since, after all, if you're using
RAID In the first place you value high availability across disk problems
more than most people do, and there is a difference between one bit of
unreported damage that causes a near-certain restore from backup and
either zero or two of them plus a report with an LBA attached so you
know you need to do something...)

> as follows:
>
> 1. md reports all data drives and the LBAs for the affected stripe
> (otherwise this is not simple if it has to figure out which drive is
> actually affected but that's not required, just a matter of better
> efficiency in finding out what's really affected.)

Yep.

> 2. the file system needs to be able to accept the error from md

It would probably need to report this as an -EIO, but I don't know of
any filesystems that can accept asynchronous reports of errors like
this. You'd need reverse mapping to even stand a chance (a non-default
option on xfs, and of course available on btrfs and zfs too). You'd
need self-healing metadata to stand a chance of doing anything about it.
And god knows what a filesystem is meant to do if part of the file data
vanishes. Replace it with \0? ugh. I'd almost rather have the error
go back out to a monitoring daemon and have it send you an email...

> 3. the file system reports what it negatively impacted: file system
> metadata or data and if data, the full filename path.
> 
> And now suddenly this work is likewise non-trivial.

Yeah, it's all the layers stacked up to the filesystem that are buggers
to deal with... and now the optional 'just repair it dammit' approach
seems useful again, if just because it doesn't have to deal with all
these extra layers.

> And there is already something that will do exactly this: ZFS and
> Btrfs. Both can unambiguously, efficiently determine whether data is
> corrupt even if a drive doesn't report a read error.

Yeah. Unfortunately both have their own problems: ZFS reimplements the
page cache and adds massive amounts of ineffiicency in the process, and
btrfs is... well... not really baked enough for the sort of high-
availability system that's going to be running RAID, yet. (Alas!)

(Recent xfs can do the same with metadata, but not data.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 20:18                             ` Nix
@ 2017-05-09 20:52                               ` Wols Lists
  2017-05-10  8:41                               ` David Brown
  1 sibling, 0 replies; 69+ messages in thread
From: Wols Lists @ 2017-05-09 20:52 UTC (permalink / raw)
  To: Nix, Chris Murphy; +Cc: David Brown, Ravi (Tom) Hale, Linux-RAID

On 09/05/17 21:18, Nix wrote:
> (Neil said: "Similarly a RAID6 with inconsistent P and Q could well not
> be able to identify a single block which is "wrong" and even if it could
> there is a small possibility that the identified block isn't wrong, but
> the other blocks are all inconsistent in such a way as to accidentally
> point to it. The probability of this is rather small, but it is
> non-zero". As far as I can tell the probability of this is exactly the
> same as that of multiple read errors in a single stripe -- possibly far
> lower, if you need not only multiple wrong P and Q values but *precisely
> mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using
> RAID-6 to begin with.

This to me is the crux of the argument.

What is the probability of CORRECTLY identifying a single-disk error?

What is the probability of WRONGLY mistaking a multi-disk error for a
single-disk error?

My gut instinct is that the second scenario is much less likely. So, in
that case, the current setup is that we DELIBERATELY CORRUPT a
recoverable error because of the TINY risk that we might have got it
wrong. Picking probabilities at random, let's say the first probability
is 99 in a hundred, the second is one in a thousand.

On a four-disk raid-6, that means we're throwing away about 500 chances
of recovering the correct data, so that on one occasion we can avoid
corruption. To me that's an insane trade-off.

Neil goes on about "what if a write fails? What if the power goes down?
What if what if?" Those are the wrong questions!!! The correct question
is "can we identify the difference between a single-disk failure and a
multi-disk failure". We don't care what *caused* that failure.

If the power goes down and only the first disk in a stripe is written,
we can correct it back to what it was. If only the last disk failed to
be written, we can correct it back to what it should have been. If at
least two disks are written and at least two disks are not, CAN WE
DETECT THAT? Surely we can - we don't care how many disks are or aren't
written - in that scenario surely all the parities mess up. In which
case we give up and say "corrupt data". Which is no different from at
present other than at present we fix the parity and pretend nothing is
wrong :-(

The problem is that at present we fix the parity and pretend nothing is
wrong when the reality is we *could* have corrected the data, if we
could have been bothered.

So we have to write an mdfsck. Okay. So we have to make sure that no
filesystems on the array are mounted. Okay, that's a bit harder. So we
have to assume that sysadmins are sensible beings who don't screw things
up - okay that's a lot harder :-) But we shouldn't be throwing away LOTS
of data that's easy to recover, because we MIGHT "recover" data that's
wrong.

Yes, yes, I know - code welcome ... :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 20:01                           ` Nix
@ 2017-05-09 20:57                             ` Wols Lists
  2017-05-09 21:22                               ` Nix
  2017-05-09 21:23                             ` Phil Turmel
  1 sibling, 1 reply; 69+ messages in thread
From: Wols Lists @ 2017-05-09 20:57 UTC (permalink / raw)
  To: Nix, Phil Turmel; +Cc: linux-raid

On 09/05/17 21:01, Nix wrote:
> Gee, thanks heaps. Next time I want randomly insulting by someone who
> doesn't bother to tell me his actual *arguments* in any message before
> the one that starts on the insults, I'll come straight to you.

Nix, much as I don't think people are thinking this through rationally
(they live in the perfect world of maths, not the imperfect world of
engineering), I do NOT think insulting Phil on this list is a good idea.

We all say things we shouldn't - I'm a master at it too :-) but sniping
at a well-respected regular isn't wise ...

Can we all tone it down, please ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-09 17:25                           ` Chris Murphy
  2017-05-09 19:44                             ` Wols Lists
  2017-05-09 20:18                             ` Nix
@ 2017-05-09 21:06                             ` Nix
  2017-05-12 11:14                               ` Nix
  2017-05-16  3:27                               ` NeilBrown
  2 siblings, 2 replies; 69+ messages in thread
From: Nix @ 2017-05-09 21:06 UTC (permalink / raw)
  To: Chris Murphy
  Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On 9 May 2017, Chris Murphy verbalised:

> 1. md reports all data drives and the LBAs for the affected stripe

Enough rambling from me. Here's a hilariously untested patch against
4.11 (as in I haven't even booted with it: my systems are kind of in
flux right now as I migrate to the md-based server that got me all
concerned about this). It compiles! And it's definitely safer than
trying a repair, and makes it possible to recover from a real mismatch
without losing all your hair in the process, or determine that a
mismatch is spurious or irrelevant. And that's enough for me, frankly.
This is a very rare problem, one hopes.

(It's probably not ideal, because the error is just known to be
somewhere in that stripe, not on that sector, which makes determining
the affected data somewhat harder. But at least you can figure out what
filesystem it's on. :) )

8<------------------------------------------------------------->8
From: Nick Alcock <nick.alcock@oracle.com>
Subject: [PATCH] md: report sector of stripes with check mismatches

This makes it possible, with appropriate filesystem support, for a
sysadmin to tell what is affected by the mismatch, and whether
it should be ignored (if it's inside a swap partition, for
instance).

We ratelimit to prevent log flooding: if there are so many
mismatches that ratelimiting is necessary, the individual messages
are relatively unlikely to be important (either the machine is
swapping like crazy or something is very wrong with the disk).

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
---
 drivers/md/raid5.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed5cd705b985..bcd2e5150e29 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3959,10 +3959,14 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
 			set_bit(STRIPE_INSYNC, &sh->state);
 		else {
 			atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
-			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
+			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
 				/* don't try to repair!! */
 				set_bit(STRIPE_INSYNC, &sh->state);
-			else {
+				pr_warn_ratelimited("%s: mismatch around sector "
+						    "%llu\n", __func__,
+						    (unsigned long long)
+						    sh->sector);
+			} else {
 				sh->check_state = check_state_compute_run;
 				set_bit(STRIPE_COMPUTE_RUN, &sh->state);
 				set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
@@ -4111,10 +4115,14 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
 			}
 		} else {
 			atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
-			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
+			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
 				/* don't try to repair!! */
 				set_bit(STRIPE_INSYNC, &sh->state);
-			else {
+				pr_warn_ratelimited("%s: mismatch around sector "
+						    "%llu\n", __func__,
+						    (unsigned long long)
+						    sh->sector);
+			} else {
 				int *target = &sh->ops.target;
 
 				sh->ops.target = -1;
-- 
2.12.2.212.gea238cf35.dirty


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 20:57                             ` Wols Lists
@ 2017-05-09 21:22                               ` Nix
  0 siblings, 0 replies; 69+ messages in thread
From: Nix @ 2017-05-09 21:22 UTC (permalink / raw)
  To: Wols Lists; +Cc: Phil Turmel, linux-raid

On 9 May 2017, Wols Lists told this:

> On 09/05/17 21:01, Nix wrote:
>> Gee, thanks heaps. Next time I want randomly insulting by someone who
>> doesn't bother to tell me his actual *arguments* in any message before
>> the one that starts on the insults, I'll come straight to you.
>
> Nix, much as I don't think people are thinking this through rationally
> (they live in the perfect world of maths, not the imperfect world of
> engineering), I do NOT think insulting Phil on this list is a good idea.

Errr... sure, but I may be ignorant, but I'm not obtuse. Not as far as I
know, anyway. What I am is sleep-deprived. (It takes a special kind of
nervous wreck to be kept awake by a problem like this, that has never
happened in many years of my using md/raid. I think I'll be kept awake
by the possibility of an asteroid strike or a second Carrington Event
tonight.)

> Can we all tone it down, please ...

Sure! I'm generating untested patches now, is that better? (Probably
not. But they do solve this problem enough to reduce the worry quotient
without actually doing the much-more-complex repair side of things.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 20:01                           ` Nix
  2017-05-09 20:57                             ` Wols Lists
@ 2017-05-09 21:23                             ` Phil Turmel
  1 sibling, 0 replies; 69+ messages in thread
From: Phil Turmel @ 2017-05-09 21:23 UTC (permalink / raw)
  To: Nix; +Cc: David Brown, Anthony Youngman, Ravi (Tom) Hale, linux-raid

On 05/09/2017 04:01 PM, Nix wrote:
> On 9 May 2017, Phil Turmel told this:

>> The bottom line is that any kernel that implements the auto-correct you
>> seem to think is a slam dunk will be shunned by any system administrator
>> who actually cares about their data.  Your obtuseness notwithstanding.
> 
> Gee, thanks heaps. Next time I want randomly insulting by someone who
> doesn't bother to tell me his actual *arguments* in any message before
> the one that starts on the insults, I'll come straight to you.

Ok, yeah, I was a bit harsh.  Ad hominem is not appropriate.  Not that
the shunning wouldn't happen.

As for the arguments, well, *everyone* on this list is providing
arguments and you are ignoring them.  Whether you are filtering facts on
pre-conceived ideas about raid6 or simply can't understand the points,
the result *appears* obtuse.

And now, please all drop me from the CC.

Phil

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09  9:53                   ` Nix
  2017-05-09 11:09                     ` David Brown
@ 2017-05-09 21:32                     ` NeilBrown
  2017-05-10 19:03                       ` Nix
  1 sibling, 1 reply; 69+ messages in thread
From: NeilBrown @ 2017-05-09 21:32 UTC (permalink / raw)
  To: Nix, Anthony Youngman; +Cc: Phil Turmel, Ravi (Tom) Hale, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5958 bytes --]

On Tue, May 09 2017, Nix wrote:

> On 8 May 2017, Anthony Youngman told this:
>
>> If the scrub finds a mismatch, then the drives are reporting
>> "everything's fine here". Something's gone wrong, but the question is
>> what? If you've got a four-drive raid that reports a mismatch, how do
>> you know which of the four drives is corrupt? Doing an auto-correct
>> here risks doing even more damage. (I think a raid-6 could recover,
>> but raid-5 is toast ...)
>
> With a RAID-5 you are screwed: you can reconstruct the parity but cannot
> tell if it was actually right. You can make things consistent, but not
> correct.
>
> But with a RAID-6 you *do* have enough data to make things correct, with
> precisely the same probability as recovery of a RAID-5 "drive" of length
> a single sector. It seems wrong that not only does md not do this but
> doesn't even tell you which drive made the mistake so you could do the
> millions-of-times-slower process of a manual fail and readdition of the
> drive (or, if you suspect it of being wholly buggered, a manual fail and
> replacement).
>
>> And seeing as drives are pretty much guaranteed (unless something's
>> gone BADLY wrong) to either (a) accurately return the data written, or
>> (b) return a read error, that means a data mismatch indicates
>> something is seriously wrong that is NOTHING to do with the drives.
>
> This turns out not to be the case. See this ten-year-old paper:
> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>.
> Five weeks of doing 2GiB writes on 3000 nodes once every two hours
> found, they estimated, 50 errors possibly attributable to disk problems
> (sector- or page-size regions of corrupted data) on 1/30th of their
> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks
> used by CERN deserve discarding. It is better to assume that drives
> misdirect writes now and then, and to provide a means of recovering from
> them that does not take days of panic. RAID-6 gives you that means: md
> should use it.
>
> The page-sized regions of corrupted data were probably software -- but
> the sector-sized regions were just as likely the drives, possibly
> misdirected writes or misdirected reads.
>
> Neil decided not to do any repair work in this case on the grounds that
> if the drive is misdirecting one write it might misdirect the repair as
> well

My justification was a bit broader than that.
 If you get a consistency error on RAID6, there is not one model to
 explain it which is significantly more likely than any other model.
 So it is not possible to predict the results of any particular remedial
 action.  It might help, it might hurt, it might have no effect.
 Better to do nothing and appear incompetent, than to do the wrong thing
 and remove all doubt.
 (there could be problems with media, buffering in the drive, addressing
 in the drive, buffer/addressing in the controller, errors in main
 memory, CPU problems comparing bytes, corruption on a bus, either
 reading or writing - of either data or addresses)

NeilBrown


>    -- but if the repair is *consistently* misdirected, that seems
> relatively harmless (you had corruption before, you have it now, it just
> moved), and if it was a sporadic error, the repair is worthwhile. The
> only case in which a repair should not be attempted is if the drive is
> misdirecting all or most writes -- but in that case, by the time you do
> a scrub, on all but the quietest arrays you'll see millions of
> mismatches and it'll be obvious that it's time to throw the drive out.
> (Assuming md told you which drive it was.)
>
>>> If a sector weakens purely because of neighbouring writes or temperature
>>> or a vibrating housing or something (i.e. not because of actual damage),
>>> so that a rewrite will strengthen it and relocation was never necessary,
>>> surely you've just saved a pointless bit of sector sparing? (I don't
>>> know: I'm not sure what the relative frequency of these things is. Read
>>> and write errors in general are so rare that it's quite possible I'm
>>> worrying about nothing at all. I do know I forgot to scrub my old
>>> hardware RAID array for about three years and nothing bad happened...)
>>>
>> Yes you have saved a sector sparing. Note that a consumer 3TB drive
>> can return, on average, one error every time it's read from end to end
>> 3 times, and still be considered "within spec" ie "not faulty" by the
>
> Yeah, that's why RAID-6 is a good idea. :)
>
>> manufacturer. And that's a *brand* *new* drive. That's why building a
>> large array using consumer drives is a stupid idea - 4 x 3TB drives
>> and a *within* *spec* array must expect to handle at least one error
>> every scrub.
>
> That's just one reason why. The lack of control over URE timeouts is
> just as bad.
>
>> Okay - most drives are actually way over spec, and could probably be
>> read end-to-end many times without a single error, but you'd be a fool
>> to gamble on it.
>
> I'm trying *not* to gamble on it -- but I don't want to end up in the
> current situation we seem to have with md6, which is "oh, you have a
> mismatch, it's not going away, but we're neither going to tell you where
> it is nor what disk it's on nor repair it ourselves, even though we
> could, just to make it as hard as possible for you to repair the problem
> or even tell if it's a consistent one" (is the single mismatch an
> expected, spurious read error because of the volume of data you're
> reading, or one that's consistent and needs repair? All mismatch_cnt
> tells you is that there's a mismatch).
>
> -- 
> NULL && (void)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 17:49                     ` Wols Lists
@ 2017-05-10  3:06                       ` Chris Murphy
  0 siblings, 0 replies; 69+ messages in thread
From: Chris Murphy @ 2017-05-10  3:06 UTC (permalink / raw)
  To: Wols Lists; +Cc: Chris Murphy, Nix, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On Tue, May 9, 2017 at 11:49 AM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 09/05/17 17:05, Chris Murphy wrote:
>>> Yes you have saved a sector sparing. Note that a consumer 3TB drive can
>>> > return, on average, one error every time it's read from end to end 3 times,
>>> > and still be considered "within spec" ie "not faulty" by the manufacturer.
>
>> All specs say "less than" which means it's a maximum permissible rate,
>> not an average. We have no idea what the minimum error rate is - we
>> being consumers. It's possible high volume users (e.g. Backblaze) have
>> data on this by now.
>>
> In other words, an error rate that high is "acceptable".

It's acceptable in that the manufacturer sells products with such
specification and consumers buy them. It's totally voluntary. There
are drives with one and two orders of magnitude lower unrecoverable
error rates and some people buy them and pay extra to get that spec as
a feature among other features.


> And to design software that quite explicitly expects greater perfection
> than the hardware itself is guaranteed to provide is, in my humble
> opinion, downright negligent!!!

How does the software expect a lower error rate than the drive specification?




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 19:44                             ` Wols Lists
@ 2017-05-10  3:53                               ` Chris Murphy
  2017-05-10  4:49                                 ` Wols Lists
                                                   ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Chris Murphy @ 2017-05-10  3:53 UTC (permalink / raw)
  To: Wols Lists; +Cc: Linux-RAID

On Tue, May 9, 2017 at 1:44 PM, Wols Lists <antlists@youngman.org.uk> wrote:

>> This is totally non-trivial, especially because it says raid6 cannot
>> detect or correct more than one corruption, and ensuring that
>> additional corruption isn't introduced in the rare case is even more
>> non-trivial.
>
> And can I point out that that is just one person's opinion?

Right off the bat you ask a stupid question that contains the answer
to your own stupid question. This is condescending and annoying, and
it invites treating you with suspicious as a troll. But then you make
it worse by saying it again:

> A
> well-informed, respected person true, but it's still just opinion.

Except it is not just an opinion, it's a fact by any objective reader
who isn't even a programmer, let alone if you know something about
math and/or programming. Let's break down how totally stupid your
position is.

1. Opinions don't count for much.
2. You have presented no code that contradicts the opinion that this
is hard. You've opined that an opinion is to be discarded at face
value. Therefore your own opinion is just an opinion and likewise
discardable.
3. How do do the thing you think is trivial has been well documented
for some time and yet there are essentially no implementations. That
it's simple to do (your idea) and yet does not exist (fact) means this
is a big fat conspiracy to fuck you over, on purpose.

It's so asinine I feel trolled right now.

>And
> imho the argument that says raid should not repair the data applies
> equally against fsck - that shouldn't do any repair either! :-)

And now the dog shit cake has cat shit icing on it. Great.


>> And there is already something that will do exactly this: ZFS and
>> Btrfs. Both can unambiguously, efficiently determine whether data is
>> corrupt even if a drive doesn't report a read error.
>>
> Or we write an mdfsck program. Just like you shouldn't run fsck with
> write privileges on a mounted filesystem, you wouldn't run mdfsck with
> filesystems in the array mounted.

Who is we? Are you volunteering other people build you a feature?


> At the end of the day, md should never corrupt data by default. Which is
> what it sounds like is happening at the moment, if it's assuming the
> data sectors are correct and the parity is wrong. If one parity appears
> correct then by all means rewrite the second ...

This is an obtuse and frankly malicious characterization. Scrubs don't
happen by default. And scrub repair's assuming data strips are correct
is well documented. If you don't like this assumption, don't use scrub
repair. You can't say corruption happens by default unless you admit
that there's URE's on a drive by default - of course that's absurd and
makes no sense.

>
> But the current setup, where it's currently quite happy to assume a
> single-drive error and rewrite it if it's a parity drive, but it won't
> assume a single-drive error and and rewrite it if it's a data drive,
> just seems totally wrong. Worse, in the latter case, it seems it
> actively prevents fixing the problem by updating the parity and
> (probably) corrupting the data.

The data is already corrupted by definition. No additional damage to
data is done. What does happen is good P and Q are replaced by bad P
and Q which matches the already bad data.

And nevertheless you have the very real problem that drives lie about
having committed data to stable media. And they reorder writes,
breaking the write order assumptions of things. And we have RMW
happening on live arrays. And that means you have a real likelihood
that you cannot absolutely determine with the available information
why P and Q don't agree with the data, you're still making probability
assumptions and if that assumption is wrong any correction will
introduce more corruption.

The only unambiguous way to do this has already been done and it's ZFS
and Btrfs. And a big part of why they can do what they do is because
they are copy on write. IIf you need to solve the problem of ambiguous
data strip integrity in relation to P and Q, then use ZFS. It's
production ready. If you are prepared to help test and improve things,
then you can look into the Btrfs implementation.

Otherwise I'm sure md and LVM folks have a feature list that
represents a few years of work as it is without yet another pile on.

>
> Report the error, give the user the tools to fix it, and LET THEM sort
> it out. Just like we do when we run fsck on a filesystem.

They're not at all comparable. One is a file system, the other a raid
implementation, they have nothing in common.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-10  3:53                               ` Chris Murphy
@ 2017-05-10  4:49                                 ` Wols Lists
  2017-05-10 17:18                                   ` Chris Murphy
  2017-05-16  3:20                                   ` NeilBrown
  2017-05-10  5:00                                 ` Dave Stevens
  2017-05-10 16:44                                 ` Edward Kuns
  2 siblings, 2 replies; 69+ messages in thread
From: Wols Lists @ 2017-05-10  4:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux-RAID

On 10/05/17 04:53, Chris Murphy wrote:
> On Tue, May 9, 2017 at 1:44 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> 
>>> This is totally non-trivial, especially because it says raid6 cannot
>>> detect or correct more than one corruption, and ensuring that
>>> additional corruption isn't introduced in the rare case is even more
>>> non-trivial.
>>
>> And can I point out that that is just one person's opinion?
> 
> Right off the bat you ask a stupid question that contains the answer
> to your own stupid question. This is condescending and annoying, and
> it invites treating you with suspicious as a troll. But then you make
> it worse by saying it again:
> 
Sorry. But I thought we were talking about *Neil's* paper. My bad for
missing it.

>> A
>> well-informed, respected person true, but it's still just opinion.
> 
> Except it is not just an opinion, it's a fact by any objective reader
> who isn't even a programmer, let alone if you know something about
> math and/or programming. Let's break down how totally stupid your
> position is.
> 

<snip ad hominems :-) >
> 
>> At the end of the day, md should never corrupt data by default. Which is
>> what it sounds like is happening at the moment, if it's assuming the
>> data sectors are correct and the parity is wrong. If one parity appears
>> correct then by all means rewrite the second ...
> 
> This is an obtuse and frankly malicious characterization. Scrubs don't
> happen by default. And scrub repair's assuming data strips are correct
> is well documented. If you don't like this assumption, don't use scrub
> repair. You can't say corruption happens by default unless you admit
> that there's URE's on a drive by default - of course that's absurd and
> makes no sense.
> 
Documenting bad behaviour doesn't turn it into good behaviour, though ...
>>
>> But the current setup, where it's currently quite happy to assume a
>> single-drive error and rewrite it if it's a parity drive, but it won't
>> assume a single-drive error and and rewrite it if it's a data drive,
>> just seems totally wrong. Worse, in the latter case, it seems it
>> actively prevents fixing the problem by updating the parity and
>> (probably) corrupting the data.
> 
> The data is already corrupted by definition. No additional damage to
> data is done. What does happen is good P and Q are replaced by bad P
> and Q which matches the already bad data.

Except, in my world, replacing good P & Q by bad P & Q *IS* doing
additional damage! We can identify and fix the bad data. So why don't
we? Throwing away good P & Q prevents us from doing that, and means we
can no longer recover the good data!
> 
> And nevertheless you have the very real problem that drives lie about
> having committed data to stable media. And they reorder writes,
> breaking the write order assumptions of things. And we have RMW
> happening on live arrays. And that means you have a real likelihood
> that you cannot absolutely determine with the available information
> why P and Q don't agree with the data, you're still making probability
> assumptions and if that assumption is wrong any correction will
> introduce more corruption.
> 
> The only unambiguous way to do this has already been done and it's ZFS
> and Btrfs. And a big part of why they can do what they do is because
> they are copy on write. IIf you need to solve the problem of ambiguous
> data strip integrity in relation to P and Q, then use ZFS. It's
> production ready. If you are prepared to help test and improve things,
> then you can look into the Btrfs implementation.

So how come btrfs and ZFS can handle this, and md can't? Can't md use
the same techniques. (Seriously, I don't know the answer. But, like Nix,
when I feel I'm being fed the answer "we're not going to give you the
choice because we know better than you", I get cheesed off. If I get the
answer "we're snowed under, do it yourself" then that is normal and
acceptable.)
> 
> Otherwise I'm sure md and LVM folks have a feature list that
> represents a few years of work as it is without yet another pile on.
> 
>>
>> Report the error, give the user the tools to fix it, and LET THEM sort
>> it out. Just like we do when we run fsck on a filesystem.
> 
> They're not at all comparable. One is a file system, the other a raid
> implementation, they have nothing in common.
> 
> 
And what are file systems and raid implementations? They are both data
store abstractions. They have everything in common.

Oh and by the way, now I've realised my mistake, I've taken a look at
the paper you mention. In particular, section 4. Yes it does say you
can't detect and correct multi-disk errors - but that's not what we're
asking for!

By implication, it seems to be saying LOUD AND CLEAR that you CAN detect
and correct a single-disk error. So why the blankety-blank won't md let
you do that!

Neil's point seems to be that it's a bad idea to do it automatically. I
get his logic. But to then actively prevent you doing it manually - this
is the paternalistic attitude that gets my goat.

Anyways, I've been thinking about this, and I've got a proposal (RFC?).
I haven't got time right now - I'm supposed to be at work - but I'll
write it up this evening. If the response is "we're snowed under - it
sounds a good idea but do it yourself", then so be it. But if the
response is "we don't want the sysadmin to have the choice", then expect
more flak from people like Nix and me.

(And the proposal involves giving sysadmins CHOICE. If they want to take
the hit, it's *their* decision, not a paternalistic choice forced on them.)

(Sorry to keep on about paternalism, but there is a sense that decisions
have been made, and they're not going to be reversed "because I say so".
I'm NOT getting a "you want it, you write it" vibe, and that's what gets
to me.)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-10  3:53                               ` Chris Murphy
  2017-05-10  4:49                                 ` Wols Lists
@ 2017-05-10  5:00                                 ` Dave Stevens
  2017-05-10 16:44                                 ` Edward Kuns
  2 siblings, 0 replies; 69+ messages in thread
From: Dave Stevens @ 2017-05-10  5:00 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Wols Lists, Linux-RAID

Quoting Chris Murphy <lists@colorremedies.com>:

> On Tue, May 9, 2017 at 1:44 PM, Wols Lists <antlists@youngman.org.uk> wrote:
>
>>> This is totally non-trivial, especially because it says raid6 cannot
>>> detect or correct more than one corruption, and ensuring that
>>> additional corruption isn't introduced in the rare case is even more
>>> non-trivial.
>>
>> And can I point out that that is just one person's opinion?
>
> Right off the bat you ask a stupid question that contains the answer

snip!

you know Chris, I've read this twice and think it's abusive. You
shouldn't do this.

Dave

-- 
"As long as politics is the shadow cast on society by big business,
the attenuation of the shadow will not change the substance."

-- John Dewey






^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 20:18                             ` Nix
  2017-05-09 20:52                               ` Wols Lists
@ 2017-05-10  8:41                               ` David Brown
  1 sibling, 0 replies; 69+ messages in thread
From: David Brown @ 2017-05-10  8:41 UTC (permalink / raw)
  To: Nix, Chris Murphy
  Cc: Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On 09/05/17 22:18, Nix wrote:
> On 9 May 2017, Chris Murphy verbalised:
> 
>> On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@hesbynett.no> wrote:
>>
>>> I thought you said that you had read Neil's article.  Please go back and
>>> read it again.  If you don't agree with what is written there, then
>>> there is little more I can say to convince you.
> 
> The entire article is predicated on the assumption that when an
> inconsistent stripe is found, fixing it is simple because you can just
> fail whichever device is inconsistent... but given that the whole
> premise of the article is that *you cannot tell which that is*, I don't
> see the point in failing anything.

The point is that if an inconsistent stripe is found, then there is no
way to be sure how to fix it correctly.  So scrub certainly will not
touch it.  And what should "repair" do?  I see several choices:

1. It could assume the data is correct, and re-create the parities.
This is simple, and it avoids changing anything on the array from the
viewpoint of higher levels (i.e., the filesystem).

2. It could do a "smart" repair of the stripe, if it sees that there is
only one inconsistent block in the stripe.

3. It could pass the problem on to higher level tools (possibly
correcting a single inconsistency in the P or Q parities first).

At the moment, raid6 repair follows the first choice here.  Many people
seem to think the second choice is a good idea.  Personally, I would say
choice 3 is right - but unless and until higher level tools are
available, I think 1 is no worse than 2 - and it is simpler, clearer,
and works today.

Key to why I don't like choice 2 is a question of why you have a
mismatch in the first place.  Undetected read errors - the drive
returning wrong data as though it were correct data - are astoundingly
rare.  Even on huge disks, they do not occur often.  (Unrecoverable read
errors - when the drive reports a sector as unreadable - are not
uncommon.  That is what raid is for.)  If you get a mismatch, a likely
cause is a crash or power fault during a stripe write.  Another main
cause is hardware errors such as memory faults.  "Smart" repair can make
the situation worse.

Secondly, "smart" repair means changing the data on the disk.  You can't
do that while a file system is mounted (unless you want to risk chaos).
 One major reason for using raid is to minimise downtime of a system in
the event of problems - offline repair goes against that philosophy.


What do I mean about passing the problem on to higher levels?  One
example would be if there is an other raid level sitting above, such as
a raid1 pair of raid6 arrays (it would make more sense the other way
round - the same principle applies there).  The raid6 level could ask
the block layer above if that layer can re-create the correct data.  In
the case of a raid1 pair at a higher level, then it could - that way the
stripe would be written with the full known correct data, rather than
just a guess.  Perhaps the layer above is a filesystem - this could say
if that stripe is actually in use (no need to worry if it is in deleted
space), or if it can re-create the data from a BTRFS duplicate.

Failing that, a tool could interact with the filesystem to determine
what sort of data was on that stripe, and perhaps check it in some way.
 At least a tool could run a consistency check - would the filesystem be
consistent if the stripe was "smart repaired", or would it be consistent
if the stripe data was left untouched (and the P & Q parities recreated)?

A simple method here could be to mark the whole stripe as unreadable,
then run a filesystem check.  If there are higher level raids that can
re-create the lost stripe, that will happen automatically.  If not, then
the filesystem repair will ensure that the filesystem is consistent even
though data may be lost.

And of course, a higher level repair tool could be one that simply runs
a "smart repair" on the stripe.



All in all, when there is /no/ correct answer, I think we have to be
very careful about picking methods here.  Before switching to a "smart"
repair, rather than the simple method, we have to be /very/ sure that it
gives noticeably "better" results in real-world cases.  We can't just
say it sounds good - we need to know.


> 
> The first comment in the article is someone noting that md doesn't say
> which device is failing, what the location of the error is or anything
> else a sysadmin might actually find useful for fixing it. "Hey, you have
> an error somewhere on some disk on this multi-terabyte array which might
> be data corruption and if a disk fails will be data corruption!" is not
> too useful :( 

I haven't looked at the information you get out of the scrub, but of
course more information is better than less information.

> The fourth comment notes that the "smart" approach, given
> RAID-6, has a significantly higher chance of actually fixing the problem
> than the simple approach. I'd call that a fairly important comment...
> 
> (Neil said: "Similarly a RAID6 with inconsistent P and Q could well not
> be able to identify a single block which is "wrong" and even if it could
> there is a small possibility that the identified block isn't wrong, but
> the other blocks are all inconsistent in such a way as to accidentally
> point to it. The probability of this is rather small, but it is
> non-zero".

It is true that for some causes of mismatches, the "smart" repair has a
high chance of being correct.

> As far as I can tell the probability of this is exactly the
> same as that of multiple read errors in a single stripe -- possibly far
> lower, if you need not only multiple wrong P and Q values but *precisely
> mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using
> RAID-6 to begin with.
> 
> I've been talking all the time about a stripe which is singly
> inconsistent: either all the data blocks are fine and one of P or Q is
> fine, or both P and Q and all but one data block is fine, and the
> remaining block is inconsistent with all the rest. Obviously if more
> blocks are corrupt, you can do nothing but report it. The redundancy
> simply isn't there to attempt repair.)

Or possible mark the whole stripe as "unreadable", and punt the problem
to the higher levels.

> 
>> H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion
>> http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf
>>
>> This is totally non-trivial, especially because it says raid6 cannot
>> detect or correct more than one corruption, and ensuring that
>> additional corruption isn't introduced in the rare case is even more
>> non-trivial.
> 
> Yeah. Testing this is the bastard problem, really. Fault injection via
> dm is the only approach that seems remotely practical to me.

That's what the "FAULTY" raid level in md is for :-)

But what are the /realistic/ fault situations?

> 
>> I do think it's sane for raid6 repair to avoid the current assumption
>> that data strip is correct, by doing the evaluation in equation 27. If
>> there's no corruption do nothing, if there's corruption of P or Q then
>> replace, if there's corruption of data, then report but do not repair
> 
> At least indicate *where* the corruption is in the report. (I'd say
> "repair, as a non-default option" for people with a different
> availability/P(corruption) tradeoff -- since, after all, if you're using
> RAID In the first place you value high availability across disk problems
> more than most people do, and there is a difference between one bit of
> unreported damage that causes a near-certain restore from backup and
> either zero or two of them plus a report with an LBA attached so you
> know you need to do something...)

One thing to consider here is the sort of person using the raid array.
When Neil wrote his article, raid6 would only be used by an expert.  He
did not want to change existing data and make life harder for the
systems administrator doing more serious repair.

However, these days the raid6 "administrator" may be someone who owns a
NAS box and has no idea what raid, or even Linux, actually is.  In such
cases, "smart" repair is probably the best idea if the filesystem on top
is not BTRFS.

> 
>> as follows:
>>
>> 1. md reports all data drives and the LBAs for the affected stripe
>> (otherwise this is not simple if it has to figure out which drive is
>> actually affected but that's not required, just a matter of better
>> efficiency in finding out what's really affected.)
> 
> Yep.
> 
>> 2. the file system needs to be able to accept the error from md
> 
> It would probably need to report this as an -EIO, but I don't know of
> any filesystems that can accept asynchronous reports of errors like
> this. You'd need reverse mapping to even stand a chance (a non-default
> option on xfs, and of course available on btrfs and zfs too). You'd
> need self-healing metadata to stand a chance of doing anything about it.
> And god knows what a filesystem is meant to do if part of the file data
> vanishes. Replace it with \0? ugh. I'd almost rather have the error
> go back out to a monitoring daemon and have it send you an email...
> 
>> 3. the file system reports what it negatively impacted: file system
>> metadata or data and if data, the full filename path.
>>
>> And now suddenly this work is likewise non-trivial.
> 
> Yeah, it's all the layers stacked up to the filesystem that are buggers
> to deal with... and now the optional 'just repair it dammit' approach
> seems useful again, if just because it doesn't have to deal with all
> these extra layers.
> 
>> And there is already something that will do exactly this: ZFS and
>> Btrfs. Both can unambiguously, efficiently determine whether data is
>> corrupt even if a drive doesn't report a read error.
> 
> Yeah. Unfortunately both have their own problems: ZFS reimplements the
> page cache and adds massive amounts of ineffiicency in the process, and
> btrfs is... well... not really baked enough for the sort of high-
> availability system that's going to be running RAID, yet. (Alas!)

I disagree about BTRFS here.  First, raid is a good idea no matter how
"experimental" you consider your filesystem.  Second, BTRFS is solid
enough for a great many uses - I use it on laptops, desktops and servers.

/No/ storage system should be viewed as infallible - backups are
important.  So if BTRFS were to eat my data, then I'd get it back from
backups - just as I would if the server died, both disks failed, it got
stolen, or whatever.

But BTRFS on our servers means very cheap regular snapshots.  That
protects us from the biggest cause of data loss - user error.

> 
> (Recent xfs can do the same with metadata, but not data.)
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-10  3:53                               ` Chris Murphy
  2017-05-10  4:49                                 ` Wols Lists
  2017-05-10  5:00                                 ` Dave Stevens
@ 2017-05-10 16:44                                 ` Edward Kuns
  2017-05-10 18:09                                   ` Chris Murphy
  2 siblings, 1 reply; 69+ messages in thread
From: Edward Kuns @ 2017-05-10 16:44 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Wols Lists, Linux-RAID

On Tue, May 9, 2017 at 10:53 PM, Chris Murphy <lists@colorremedies.com> wrote:
> Scrubs don't happen by default.

From the perspective of Linux Raid authors, that is true.  However,
the version of Fedora I have installed on my server at home does
weekly scrubs by default.  This is arguably a good thing, considering
that many people installing this OS will not proactively research the
technologies in use holding their server together and won't know that
there are certain maintenance activities that are essential if you
care about your data.

I'm not getting involved in the bigger discussion.  My opinion is too
uninformed to say anything there.  I just wanted to point out that
*from the viewpoint of some users*, scrubs *will* happen by default.
That is all.

            Eddie

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-10  4:49                                 ` Wols Lists
@ 2017-05-10 17:18                                   ` Chris Murphy
  2017-05-16  3:20                                   ` NeilBrown
  1 sibling, 0 replies; 69+ messages in thread
From: Chris Murphy @ 2017-05-10 17:18 UTC (permalink / raw)
  To: Wols Lists; +Cc: Chris Murphy, Linux-RAID

On Tue, May 9, 2017 at 10:49 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 10/05/17 04:53, Chris Murphy wrote:
>> On Tue, May 9, 2017 at 1:44 PM, Wols Lists <antlists@youngman.org.uk> wrote:
>>
>>>> This is totally non-trivial, especially because it says raid6 cannot
>>>> detect or correct more than one corruption, and ensuring that
>>>> additional corruption isn't introduced in the rare case is even more
>>>> non-trivial.
>>>
>>> And can I point out that that is just one person's opinion?
>>
>> Right off the bat you ask a stupid question that contains the answer
>> to your own stupid question. This is condescending and annoying, and
>> it invites treating you with suspicious as a troll. But then you make
>> it worse by saying it again:
>>
> Sorry. But I thought we were talking about *Neil's* paper. My bad for
> missing it.

Doesn't matter. Your standard is mere opinions are ignorable, and
therefore by your own standard you can be ignored for posing mere
opinions yourself. You set your own trap but you clearly want to hold
a double standard: your opinions are valid and should be listened to,
and others' opinions are merely opinion and can be easily discarded.


>>> A
>>> well-informed, respected person true, but it's still just opinion.
>>
>> Except it is not just an opinion, it's a fact by any objective reader
>> who isn't even a programmer, let alone if you know something about
>> math and/or programming. Let's break down how totally stupid your
>> position is.
>>
>
> <snip ad hominems :-) >

It is not an ad hominem attack to evaluate your lack of logic. An ad
hominem attack is one on the person rather than their arguments. I
haven't attacked you, I've attacked your arguing style and the deep
ignorance that style conveys. And you shouldn't like it, but you have
only yourself to blame, you didn't exactly bother to do any list
archive research before demanding everyone's foolish for having
withheld this feature from you personally. It almost immediately
became noise.


>>> At the end of the day, md should never corrupt data by default. Which is
>>> what it sounds like is happening at the moment, if it's assuming the
>>> data sectors are correct and the parity is wrong. If one parity appears
>>> correct then by all means rewrite the second ...
>>
>> This is an obtuse and frankly malicious characterization. Scrubs don't
>> happen by default. And scrub repair's assuming data strips are correct
>> is well documented. If you don't like this assumption, don't use scrub
>> repair. You can't say corruption happens by default unless you admit
>> that there's URE's on a drive by default - of course that's absurd and
>> makes no sense.
>>
> Documenting bad behaviour doesn't turn it into good behaviour, though ...

It is a common loophole to describe the chosen behavior when good
behavior is difficult or infeasible. It happens all the time.
Complaining here isn't going to change this.


>>>
>>> But the current setup, where it's currently quite happy to assume a
>>> single-drive error and rewrite it if it's a parity drive, but it won't
>>> assume a single-drive error and and rewrite it if it's a data drive,
>>> just seems totally wrong. Worse, in the latter case, it seems it
>>> actively prevents fixing the problem by updating the parity and
>>> (probably) corrupting the data.
>>
>> The data is already corrupted by definition. No additional damage to
>> data is done. What does happen is good P and Q are replaced by bad P
>> and Q which matches the already bad data.
>
> Except, in my world, replacing good P & Q by bad P & Q *IS* doing
> additional damage!

Arguing about it doesn't make it true. The primary data is corrupt and
in normal operation P & Q are not checked, so it will always silently
return corrupt data in normal operation, and if there is a failure
that does not exactly coincide with the corruption, the corruption
that is read in the ensuing reconstruction will corrupt the
reconstruction even though P & Q are good. So what you want to fix is
a lot of buck for almost no gain.

>We can identify and fix the bad data. So why don't
> we? Throwing away good P & Q prevents us from doing that, and means we
> can no longer recover the good data!

There is no possible way to know that P & Q are both good. That
requires assumption. So you've arbitrarily traded an assumption you
don't like for one that you do like, but have no evidence for in
either case.

There are better ways to solve this problem. md and LVM raid are
really about solving one, or two particular problemswhich is not data
integrity, it is data availability and recovery via reconstruction
rather than from backups being restored.

Better is defined by the use case at hand. Some use cases will want
this solved at the file system level, which points to ZFS or Btrfs -
the very problem you're talking about is one of those problems that
led to the design of both of those file systems. Other use cases can
have it solved at an application level. And still others will solve it
with a cluster file system, like glusterfs does with per file
checksums and replication.


>> And nevertheless you have the very real problem that drives lie about
>> having committed data to stable media. And they reorder writes,
>> breaking the write order assumptions of things. And we have RMW
>> happening on live arrays. And that means you have a real likelihood
>> that you cannot absolutely determine with the available information
>> why P and Q don't agree with the data, you're still making probability
>> assumptions and if that assumption is wrong any correction will
>> introduce more corruption.
>>
>> The only unambiguous way to do this has already been done and it's ZFS
>> and Btrfs. And a big part of why they can do what they do is because
>> they are copy on write. IIf you need to solve the problem of ambiguous
>> data strip integrity in relation to P and Q, then use ZFS. It's
>> production ready. If you are prepared to help test and improve things,
>> then you can look into the Btrfs implementation.
>
> So how come btrfs and ZFS can handle this, and md can't?

All data and metadata blocks are checksummed, and they're always
verified during normal operation for every read. The data checksums
are themselves checksummed. Even if a drive does not report an error,
error can be detected, and trigger reconstruction if redundant
metadata or data is available.

md does not checksum anything but its own metadata which is just the
superblock, there isn't much of anything else to it. There's no
checksums for data strips, parity strips, there's no timestamp for any
of the writes, there's a distinct lack of information to be able to do
an autopsy after the fact without any assumptions.

> Can't md use
> the same techniques. (Seriously, I don't know the answer. But, like Nix,
> when I feel I'm being fed the answer "we're not going to give you the
> choice because we know better than you", I get cheesed off. If I get the
> answer "we're snowed under, do it yourself" then that is normal and
> acceptable.)

No they operate on completely different architecture and assumptions.
You really should search the archives, all of these things you're
wanting to discuss now have already been discussed and argued and
nothing has changed.



>>
>> Otherwise I'm sure md and LVM folks have a feature list that
>> represents a few years of work as it is without yet another pile on.
>>
>>>
>>> Report the error, give the user the tools to fix it, and LET THEM sort
>>> it out. Just like we do when we run fsck on a filesystem.
>>
>> They're not at all comparable. One is a file system, the other a raid
>> implementation, they have nothing in common.
>>
>>
> And what are file systems and raid implementations? They are both data
> store abstractions. They have everything in common.

They have almost nothing in common. File systems store files. RAIDs do
not know anything at all about files. RAID has a superblock, and a
couple of optional logs for very specific purposes, there are no
trees. RAID works by logical assumptions where things are located, it
doesn't do lookups using metadata to find your data, it's all
determined by geometry, totally unlike a file system.


>
> Oh and by the way, now I've realised my mistake, I've taken a look at
> the paper you mention. In particular, section 4. Yes it does say you
> can't detect and correct multi-disk errors - but that's not what we're
> asking for!
>
> By implication, it seems to be saying LOUD AND CLEAR that you CAN detect
> and correct a single-disk error. So why the blankety-blank won't md let
> you do that!

It's one particular kind of error and there isn't enough on disk
metadata to differentiate this particular kind of error after the
fact. You're looking at this problem in total isolation to all other
problems. And you're not familiar with the lack of information
available in the corpse.

Neil's version of this explanation:

"Similarly a RAID6 with inconsistent P and Q could well not be able to
identify a single block which is "wrong" and even if it could there is
a small possibility that the identified block isn't wrong, but the
other blocks are all inconsistent in such a way as to accidentally
point to it. The probability of this is rather small, but it is
non-zero."

The autofix in such a case could cause more damage.


>
> Neil's point seems to be that it's a bad idea to do it automatically. I
> get his logic. But to then actively prevent you doing it manually - this
> is the paternalistic attitude that gets my goat.

You have no example code. You've basically come on the list, without
any prior research, and said "GIMME!"

*shrug*



>
> Anyways, I've been thinking about this, and I've got a proposal (RFC?).
> I haven't got time right now - I'm supposed to be at work - but I'll
> write it up this evening. If the response is "we're snowed under - it
> sounds a good idea but do it yourself", then so be it. But if the
> response is "we don't want the sysadmin to have the choice", then expect
> more flak from people like Nix and me.

1. The default response without having to say it is "we're snowed
under, show us a proof of concept first".
2. You have no imagination by having assumed this has never come up
before, instead thinking you're the first to have this feature in mind
3. You took the ensuing resistance personally.

You have an idea, the burden is on you to demonstrate a need, provide
code examples, and ask the right questions like "would the maintainers
accept some changes for error reporting for scrub checks?" At the very
least what you suggest indicates error reporting enhancements so why
not ask about those parameters?

Instead, from the outset you treated this resistance as if other
people are your grumpy daddy and they're just being mean to you.
That's why you got the reception you did. Mischaracterizing other
people as being paternalistic isn't going to help get a different
perception. (I was thinking of Commander Sela, referring to Toral,
when she said "Silence the child or send him away!")

My proposal for your proposal is a patch that implements equation 27
from HPA's paper, and enhances error reporting per its descriptive
outcomes.

md: error: mismatch, P corruption, array logical <LBA>
md: error: mismatch, Q corruption, array logical <LBA>
md: error: mismatch, data corruption suspected, array logical <LBA>

That's subject to wording and formatting discussion, I have not looked
at existing formatting, but you need to ask if approximately this
would be accepted.

However, the main point is that you need to find out what the
computational cost is for this scrub enhancement is. If it takes 5
times longer, even you will laugh and say it's not worth it. Stop
asking "why isn't this already implemented! do it now! now! now! now!"
 Instead ask "what is the ballpark maximum performance impact to scrub
that would be accepted? And if that maximum is busted would
maintainers consider a new value "check2" to write to echo check >
/sys/block/mdX/md/sync_action?"

Once you have better error reporting, a user space tool could use the
array metadata, and the error reporting LBA to lookup that stripe and
reconstruct just that stripe with the assumption that P & Q are
correct and hopefully fix your data. Or whatever other assumptions you
want to try and make to attempt different recoveries. That user space
tool could also backup the existing stripe so the fixes are all
reversible.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-10 16:44                                 ` Edward Kuns
@ 2017-05-10 18:09                                   ` Chris Murphy
  0 siblings, 0 replies; 69+ messages in thread
From: Chris Murphy @ 2017-05-10 18:09 UTC (permalink / raw)
  To: Edward Kuns; +Cc: Chris Murphy, Wols Lists, Linux-RAID

On Wed, May 10, 2017 at 10:44 AM, Edward Kuns <eddie.kuns@gmail.com> wrote:
> On Tue, May 9, 2017 at 10:53 PM, Chris Murphy <lists@colorremedies.com> wrote:
>> Scrubs don't happen by default.
>
> From the perspective of Linux Raid authors, that is true.  However,
> the version of Fedora I have installed on my server at home does
> weekly scrubs by default.

That is a check scrub, not a repair scrub, so it still wouldn't
obliterate "good" P & Q by default.

> This is arguably a good thing, considering
> that many people installing this OS will not proactively research the
> technologies in use holding their server together and won't know that
> there are certain maintenance activities that are essential if you
> care about your data.
>
> I'm not getting involved in the bigger discussion.  My opinion is too
> uninformed to say anything there.  I just wanted to point out that
> *from the viewpoint of some users*, scrubs *will* happen by default.
> That is all.


Absolutely, just not the kind of scrub that's being accused of
damaging assumed to be good P & Q parity.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-09 21:32                     ` NeilBrown
@ 2017-05-10 19:03                       ` Nix
  0 siblings, 0 replies; 69+ messages in thread
From: Nix @ 2017-05-10 19:03 UTC (permalink / raw)
  To: NeilBrown; +Cc: Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, linux-raid

On 9 May 2017, NeilBrown outgrape:

> On Tue, May 09 2017, Nix wrote:
>> Neil decided not to do any repair work in this case on the grounds that
>> if the drive is misdirecting one write it might misdirect the repair as
>> well
>
> My justification was a bit broader than that.

I noticed your trailing comment on the blog post only after sending all
these emails out :( bah!

>  If you get a consistency error on RAID6, there is not one model to
>  explain it which is significantly more likely than any other model.

Yeah, I'm quite satisfied with "we don't have enough data to know if
repairing is safe" as reasoning: among other things it suggests that
mismatches are really rare, which is reassuring! This certainly suggests
that repairing should be, at the very least, off by default, and I'm not
terribly unhappy for it to not exist.

... but I do want to at least report the location of stripes that fail
checks, as in my earlier ugly patch. That's useful for any array with >1
partition or LVM LV on it. ("Oh, that mismatch is harmless, it's in
swap. That one is in small_but_crucial_lv, I'll restore it from backup,
without affecting the massive_messy_lv which had no mismatches and would
take weeks to restore.")

(As far as I'm concerned, if you don't *have* a backup of some fs, you
deserve what's coming to you! Good backups are easy and with md you can
even make them as resilient as the main RAID arrays. I'm interested in
maximizing availability here: having to take a big array with many LVs
down for ages for a restore because you don't know which bit is
corrupted just seems *wrong*.)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-09 21:06                             ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix
@ 2017-05-12 11:14                               ` Nix
  2017-05-16  3:27                               ` NeilBrown
  1 sibling, 0 replies; 69+ messages in thread
From: Nix @ 2017-05-12 11:14 UTC (permalink / raw)
  To: Chris Murphy
  Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On 9 May 2017, nix@esperi.org.uk outgrape:

> On 9 May 2017, Chris Murphy verbalised:
>
>> 1. md reports all data drives and the LBAs for the affected stripe
>
> Enough rambling from me. Here's a hilariously untested patch against
> 4.11 (as in I haven't even booted with it: my systems are kind of in
> flux right now as I migrate to the md-based server that got me all
> concerned about this). It compiles! And it's definitely safer than
> trying a repair, and makes it possible to recover from a real mismatch
> without losing all your hair in the process, or determine that a
> mismatch is spurious or irrelevant. And that's enough for me, frankly.
> This is a very rare problem, one hopes.
>
> (It's probably not ideal, because the error is just known to be
> somewhere in that stripe, not on that sector, which makes determining
> the affected data somewhat harder. But at least you can figure out what
> filesystem it's on. :) )

Aside: this foolish optimist hopes that it might be fairly easy to tie
the new GETFSMAP ioctl() into mismatch reports if the filesystem(s)
overlying a mismatched stripe support it: it looks like we could get the
necessary info for a whole stripe in a single call. Being automatically
told "these files may be corrupted, restore them" or "oops you lost some
metadata on fses A and B, run fsck" would be wonderful. (Though the
actual corruption would be less wonderful.)

This feels like something mdadm's monitor mode should be able to do, to
me. I'll have a look in a bit, but I know nothing about the
implementation of monitor mode at all so I have some learning to do
first...

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Fault tolerance with badblocks
  2017-05-10  4:49                                 ` Wols Lists
  2017-05-10 17:18                                   ` Chris Murphy
@ 2017-05-16  3:20                                   ` NeilBrown
  1 sibling, 0 replies; 69+ messages in thread
From: NeilBrown @ 2017-05-16  3:20 UTC (permalink / raw)
  To: Wols Lists, Chris Murphy; +Cc: Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 5288 bytes --]

On Wed, May 10 2017, Wols Lists wrote:

> On 10/05/17 04:53, Chris Murphy wrote:
>> 
>> The data is already corrupted by definition. No additional damage to
>> data is done. What does happen is good P and Q are replaced by bad P
>> and Q which matches the already bad data.
>
> Except, in my world, replacing good P & Q by bad P & Q *IS* doing
> additional damage! We can identify and fix the bad data. So why don't
> we? Throwing away good P & Q prevents us from doing that, and means we
> can no longer recover the good data!
>> 
>> And nevertheless you have the very real problem that drives lie about
>> having committed data to stable media. And they reorder writes,
>> breaking the write order assumptions of things. And we have RMW
>> happening on live arrays. And that means you have a real likelihood
>> that you cannot absolutely determine with the available information
>> why P and Q don't agree with the data, you're still making probability
>> assumptions and if that assumption is wrong any correction will
>> introduce more corruption.
>> 
>> The only unambiguous way to do this has already been done and it's ZFS
>> and Btrfs. And a big part of why they can do what they do is because
>> they are copy on write. IIf you need to solve the problem of ambiguous
>> data strip integrity in relation to P and Q, then use ZFS. It's
>> production ready. If you are prepared to help test and improve things,
>> then you can look into the Btrfs implementation.
>
> So how come btrfs and ZFS can handle this, and md can't? Can't md use
> the same techniques. (Seriously, I don't know the answer.

Security theater?
I don't actually know what, specifically, btrfs and ZFS do, so I cannot
say for certain.  But I am far from convinced by what I know.

I come back to the same question I always come back to.  Is there a
likely cause for a particular anomaly, and does a particular action
properly respond to that cause.  I don't like addressing symptoms, I
like addressing causes.

In the case of a resync after an unclean shutdown, if I find a stripe in
which P and Q are not consistent with the data, then a likely cause is
that some, but not all, blocks in a new stripe were written just before
the crash.  If the array is not degraded, it is likely that the data is
all valid and P and Q are not needed.  So it makes sense to regenerate P
and Q.  Other responses might also make sense, but they don't make
*more* sense.  And regenerating P and Q is obvious and easy.  If the
array is degraded and a Data block is lost, there is no reliable way to
recover that block.  So md refuses the start the array by default.

If you find an inconsistent data block during a scrub, then I have no
idea what could have caused that, so I cannot suggest anything
(actually I have lots of ideas, but most of them suggest you should
replace your hardware and test your backups). Maybe there is a way to
recover data, maybe there is no need.  I cannot tell.  raid6recover is a
tool that can be used by a sysadmin to explore options.  Maybe not a
perfect tool, but it has some uses.

>                                                           But, like Nix,
> when I feel I'm being fed the answer "we're not going to give you the
> choice because we know better than you", I get cheesed off. If I get the
> answer "we're snowed under, do it yourself" then that is normal and
> acceptable.)

The main reason I have never implemented your idea of "validate every
block before reporting a successful read" is that I genuinely don't
think many people would use it.  Writing code that won't be used is not
very rewarding.
The simple way to provide evidence to the contrary is to turn the
interest into cash.  If 1000 people all give $10 to get it done, I
suspect we could make it happen.

>> 
>> Otherwise I'm sure md and LVM folks have a feature list that
>> represents a few years of work as it is without yet another pile on.
>> 
>>>
>>> Report the error, give the user the tools to fix it, and LET THEM sort
>>> it out. Just like we do when we run fsck on a filesystem.
>> 
>> They're not at all comparable. One is a file system, the other a raid
>> implementation, they have nothing in common.
>> 
>> 
> And what are file systems and raid implementations? They are both data
> store abstractions. They have everything in common.
>
> Oh and by the way, now I've realised my mistake, I've taken a look at
> the paper you mention. In particular, section 4. Yes it does say you
> can't detect and correct multi-disk errors - but that's not what we're
> asking for!
>
> By implication, it seems to be saying LOUD AND CLEAR that you CAN detect
> and correct a single-disk error. So why the blankety-blank won't md let
> you do that!
>
> Neil's point seems to be that it's a bad idea to do it automatically. I
> get his logic. But to then actively prevent you doing it manually - this
> is the paternalistic attitude that gets my goat.

I'm certainly not actively preventing you.  I certainly wouldn't object
to a patch which reports the details of mismatches.  I myself was never
motivated enough to write one. That might be inactively preventing you,
but not actively preventing you.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-09 21:06                             ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix
  2017-05-12 11:14                               ` Nix
@ 2017-05-16  3:27                               ` NeilBrown
  2017-05-16  9:13                                 ` Nix
  2017-05-16 21:11                                 ` NeilBrown
  1 sibling, 2 replies; 69+ messages in thread
From: NeilBrown @ 2017-05-16  3:27 UTC (permalink / raw)
  To: Nix, Chris Murphy
  Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 4235 bytes --]

On Tue, May 09 2017, Nix wrote:

> On 9 May 2017, Chris Murphy verbalised:
>
>> 1. md reports all data drives and the LBAs for the affected stripe
>
> Enough rambling from me. Here's a hilariously untested patch against
> 4.11 (as in I haven't even booted with it: my systems are kind of in
> flux right now as I migrate to the md-based server that got me all
> concerned about this). It compiles! And it's definitely safer than
> trying a repair, and makes it possible to recover from a real mismatch
> without losing all your hair in the process, or determine that a
> mismatch is spurious or irrelevant. And that's enough for me, frankly.
> This is a very rare problem, one hopes.
>
> (It's probably not ideal, because the error is just known to be
> somewhere in that stripe, not on that sector, which makes determining
> the affected data somewhat harder. But at least you can figure out what
> filesystem it's on. :) )
>
> 8<------------------------------------------------------------->8
> From: Nick Alcock <nick.alcock@oracle.com>
> Subject: [PATCH] md: report sector of stripes with check mismatches
>
> This makes it possible, with appropriate filesystem support, for a
> sysadmin to tell what is affected by the mismatch, and whether
> it should be ignored (if it's inside a swap partition, for
> instance).
>
> We ratelimit to prevent log flooding: if there are so many
> mismatches that ratelimiting is necessary, the individual messages
> are relatively unlikely to be important (either the machine is
> swapping like crazy or something is very wrong with the disk).
>
> Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
> ---
>  drivers/md/raid5.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ed5cd705b985..bcd2e5150e29 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3959,10 +3959,14 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
>  			set_bit(STRIPE_INSYNC, &sh->state);
>  		else {
>  			atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
> -			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
> +			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>  				/* don't try to repair!! */
>  				set_bit(STRIPE_INSYNC, &sh->state);
> -			else {
> +				pr_warn_ratelimited("%s: mismatch around sector "
> +						    "%llu\n", __func__,
> +						    (unsigned long long)
> +						    sh->sector);
> +			} else {

I think there is no point giving the function name,
but that you should give the name of the array.
Also "around" is a little vague.
Maybe something like:

> +				pr_warn_ratelimited("%s: mismatch sector in range "
> +						    "%llu-%llu\n", mdname(conf->mddev),
> +						    (unsigned long long) sh->sector,
> +						    (unsigned long long) sh->sector + STRIPE_SECTORS);

As an optional enhancement, you could add "will recalculate P/Q" or
"left unchanged" as appropriate.

Providing at least that the array name is included in the message, I
support this patch.

NeilBrown



>  				sh->check_state = check_state_compute_run;
>  				set_bit(STRIPE_COMPUTE_RUN, &sh->state);
>  				set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
> @@ -4111,10 +4115,14 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
>  			}
>  		} else {
>  			atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
> -			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
> +			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>  				/* don't try to repair!! */
>  				set_bit(STRIPE_INSYNC, &sh->state);
> -			else {
> +				pr_warn_ratelimited("%s: mismatch around sector "
> +						    "%llu\n", __func__,
> +						    (unsigned long long)
> +						    sh->sector);
> +			} else {
>  				int *target = &sh->ops.target;
>  
>  				sh->ops.target = -1;
> -- 
> 2.12.2.212.gea238cf35.dirty
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-16  3:27                               ` NeilBrown
@ 2017-05-16  9:13                                 ` Nix
  2017-05-16 21:11                                 ` NeilBrown
  1 sibling, 0 replies; 69+ messages in thread
From: Nix @ 2017-05-16  9:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel,
	Ravi (Tom) Hale, Linux-RAID

On 16 May 2017, NeilBrown said:

>> -			else {
>> +				pr_warn_ratelimited("%s: mismatch around sector "
>> +						    "%llu\n", __func__,
>> +						    (unsigned long long)
>> +						    sh->sector);
>> +			} else {
>
> I think there is no point giving the function name,
> but that you should give the name of the array.

*ouch* I can't believe I forgot that. I have more than one array
myself... "we have a fault but we don't know what array it's on" is not
much of an improvement over the status quo, really! (though you could
make a good guess by looking for preceding sync-start messages, you can
of course sync two arrays at the same time...)

> Also "around" is a little vague.

Intentionally: I couldn't think of the right terminology. Yours is
better.

> Maybe something like:
>
>> +				pr_warn_ratelimited("%s: mismatch sector in range "
>> +						    "%llu-%llu\n", mdname(conf->mddev),
>> +						    (unsigned long long) sh->sector,
>> +						    (unsigned long long) sh->sector + STRIPE_SECTORS);

Nice! Here's a rerolled patch. (We exceed the 80-char limit but that's
pr_warn_ratelimited()'s fault for having such a long name!)

Tested by making a raid array on a bunch of sparse files then dding a
byte of garbage into one of them and checking it. I got a nice error
message, name and all, and the sector count looked good.


From f05a451d46900849c7965a0e7dde085f1fb50dfc Mon Sep 17 00:00:00 2001
From: Nick Alcock <nick.alcock@oracle.com>
Date: Tue, 9 May 2017 21:55:17 +0100
Subject: [PATCH] md: report sector of stripes with check mismatches

This makes it possible, with appropriate filesystem support, for a
sysadmin to tell what is affected by the mismatch, and whether
it should be ignored (if it's inside a swap partition, for
instance).

We ratelimit to prevent log flooding: if there are so many
mismatches that ratelimiting is necessary, the individual messages
are relatively unlikely to be important (either the machine is
swapping like crazy or something is very wrong with the disk).

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
---
 drivers/md/raid5.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed5cd705b985..937314051be5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3959,10 +3959,15 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
 			set_bit(STRIPE_INSYNC, &sh->state);
 		else {
 			atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
-			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
+			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
 				/* don't try to repair!! */
 				set_bit(STRIPE_INSYNC, &sh->state);
-			else {
+				pr_warn_ratelimited("%s: mismatch sector in range "
+						    "%llu-%llu\n", mdname(conf->mddev),
+						    (unsigned long long) sh->sector,
+						    (unsigned long long) sh->sector +
+						    STRIPE_SECTORS);
+			} else {
 				sh->check_state = check_state_compute_run;
 				set_bit(STRIPE_COMPUTE_RUN, &sh->state);
 				set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
@@ -4111,10 +4116,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
 			}
 		} else {
 			atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
-			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
+			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
 				/* don't try to repair!! */
 				set_bit(STRIPE_INSYNC, &sh->state);
-			else {
+				pr_warn_ratelimited("%s: mismatch sector in range "
+						    "%llu-%llu\n", mdname(conf->mddev),
+						    (unsigned long long) sh->sector,
+						    (unsigned long long) sh->sector +
+						    STRIPE_SECTORS);
+			} else {
 				int *target = &sh->ops.target;
 
 				sh->ops.target = -1;
-- 
2.12.2.212.gea238cf35.dirty

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-16  3:27                               ` NeilBrown
  2017-05-16  9:13                                 ` Nix
@ 2017-05-16 21:11                                 ` NeilBrown
  2017-05-16 21:46                                   ` Nix
  1 sibling, 1 reply; 69+ messages in thread
From: NeilBrown @ 2017-05-16 21:11 UTC (permalink / raw)
  To: Nix, Chris Murphy
  Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 4656 bytes --]

On Tue, May 16 2017, NeilBrown wrote:

> On Tue, May 09 2017, Nix wrote:
>
>> On 9 May 2017, Chris Murphy verbalised:
>>
>>> 1. md reports all data drives and the LBAs for the affected stripe
>>
>> Enough rambling from me. Here's a hilariously untested patch against
>> 4.11 (as in I haven't even booted with it: my systems are kind of in
>> flux right now as I migrate to the md-based server that got me all
>> concerned about this). It compiles! And it's definitely safer than
>> trying a repair, and makes it possible to recover from a real mismatch
>> without losing all your hair in the process, or determine that a
>> mismatch is spurious or irrelevant. And that's enough for me, frankly.
>> This is a very rare problem, one hopes.
>>
>> (It's probably not ideal, because the error is just known to be
>> somewhere in that stripe, not on that sector, which makes determining
>> the affected data somewhat harder. But at least you can figure out what
>> filesystem it's on. :) )
>>
>> 8<------------------------------------------------------------->8
>> From: Nick Alcock <nick.alcock@oracle.com>
>> Subject: [PATCH] md: report sector of stripes with check mismatches
>>
>> This makes it possible, with appropriate filesystem support, for a
>> sysadmin to tell what is affected by the mismatch, and whether
>> it should be ignored (if it's inside a swap partition, for
>> instance).
>>
>> We ratelimit to prevent log flooding: if there are so many
>> mismatches that ratelimiting is necessary, the individual messages
>> are relatively unlikely to be important (either the machine is
>> swapping like crazy or something is very wrong with the disk).
>>
>> Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
>> ---
>>  drivers/md/raid5.c | 16 ++++++++++++----
>>  1 file changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index ed5cd705b985..bcd2e5150e29 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -3959,10 +3959,14 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
>>  			set_bit(STRIPE_INSYNC, &sh->state);
>>  		else {
>>  			atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
>> -			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
>> +			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>>  				/* don't try to repair!! */
>>  				set_bit(STRIPE_INSYNC, &sh->state);
>> -			else {
>> +				pr_warn_ratelimited("%s: mismatch around sector "
>> +						    "%llu\n", __func__,
>> +						    (unsigned long long)
>> +						    sh->sector);
>> +			} else {
>
> I think there is no point giving the function name,
> but that you should give the name of the array.
> Also "around" is a little vague.
> Maybe something like:
>
>> +				pr_warn_ratelimited("%s: mismatch sector in range "
>> +						    "%llu-%llu\n", mdname(conf->mddev),
>> +						    (unsigned long long) sh->sector,
>> +						    (unsigned long long) sh->sector + STRIPE_SECTORS);
>
> As an optional enhancement, you could add "will recalculate P/Q" or
> "left unchanged" as appropriate.
>
> Providing at least that the array name is included in the message, I
> support this patch.

Actually, I have another caveat.  I don't think we want these messages
during initial resync, or any resync.  Only during a 'check' or
'repair'.
So add a check for MD_RECOVERY_REQUESTED or maybe for
  sh->sectors >= conf->mddev->recovery_cp

NeilBrown


>
> NeilBrown
>
>
>
>>  				sh->check_state = check_state_compute_run;
>>  				set_bit(STRIPE_COMPUTE_RUN, &sh->state);
>>  				set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
>> @@ -4111,10 +4115,14 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
>>  			}
>>  		} else {
>>  			atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
>> -			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
>> +			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>>  				/* don't try to repair!! */
>>  				set_bit(STRIPE_INSYNC, &sh->state);
>> -			else {
>> +				pr_warn_ratelimited("%s: mismatch around sector "
>> +						    "%llu\n", __func__,
>> +						    (unsigned long long)
>> +						    sh->sector);
>> +			} else {
>>  				int *target = &sh->ops.target;
>>  
>>  				sh->ops.target = -1;
>> -- 
>> 2.12.2.212.gea238cf35.dirty
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-16 21:11                                 ` NeilBrown
@ 2017-05-16 21:46                                   ` Nix
  2017-05-18  0:07                                     ` Shaohua Li
  2017-05-19  4:49                                     ` NeilBrown
  0 siblings, 2 replies; 69+ messages in thread
From: Nix @ 2017-05-16 21:46 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel,
	Ravi (Tom) Hale, Linux-RAID

On 16 May 2017, NeilBrown spake thusly:

> Actually, I have another caveat.  I don't think we want these messages
> during initial resync, or any resync.  Only during a 'check' or
> 'repair'.
> So add a check for MD_RECOVERY_REQUESTED or maybe for
>   sh->sectors >= conf->mddev->recovery_cp

I completely agree, but it's already inside MD_RECOVERY_CHECK:

if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
        /* don't try to repair!! */
        set_bit(STRIPE_INSYNC, &sh->state);
        pr_warn_ratelimited("%s: mismatch sector in range "
                            "%llu-%llu\n", mdname(conf->mddev),
                            (unsigned long long) sh->sector,
                            (unsigned long long) sh->sector +
                            STRIPE_SECTORS);
} else {

Doesn't that already mean that someone has explicitly triggered a check
action?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-16 21:46                                   ` Nix
@ 2017-05-18  0:07                                     ` Shaohua Li
  2017-05-19  4:53                                       ` NeilBrown
  2017-05-19  4:49                                     ` NeilBrown
  1 sibling, 1 reply; 69+ messages in thread
From: Shaohua Li @ 2017-05-18  0:07 UTC (permalink / raw)
  To: Nix
  Cc: NeilBrown, Chris Murphy, David Brown, Anthony Youngman,
	Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On Tue, May 16, 2017 at 10:46:13PM +0100, Nix wrote:
> On 16 May 2017, NeilBrown spake thusly:
> 
> > Actually, I have another caveat.  I don't think we want these messages
> > during initial resync, or any resync.  Only during a 'check' or
> > 'repair'.
> > So add a check for MD_RECOVERY_REQUESTED or maybe for
> >   sh->sectors >= conf->mddev->recovery_cp
> 
> I completely agree, but it's already inside MD_RECOVERY_CHECK:
> 
> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>         /* don't try to repair!! */
>         set_bit(STRIPE_INSYNC, &sh->state);
>         pr_warn_ratelimited("%s: mismatch sector in range "
>                             "%llu-%llu\n", mdname(conf->mddev),
>                             (unsigned long long) sh->sector,
>                             (unsigned long long) sh->sector +
>                             STRIPE_SECTORS);
> } else {
> 
> Doesn't that already mean that someone has explicitly triggered a check
> action?


Hi,
So the idea is: run 'check' and report mismatch, userspace (raid6check for
example) uses the reported info to fix the mismatch. The pr_warn_ratelimited
isn't a good way to communicate the info to userspace. I'm wondering why we
don't just run raid6check solely, it can do the job like what kernel does and
we avoid the crappy pr_warn_ratelimited.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-16 21:46                                   ` Nix
  2017-05-18  0:07                                     ` Shaohua Li
@ 2017-05-19  4:49                                     ` NeilBrown
  2017-05-19 10:32                                       ` Nix
  1 sibling, 1 reply; 69+ messages in thread
From: NeilBrown @ 2017-05-19  4:49 UTC (permalink / raw)
  To: Nix
  Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel,
	Ravi (Tom) Hale, Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]

On Tue, May 16 2017, Nix wrote:

> On 16 May 2017, NeilBrown spake thusly:
>
>> Actually, I have another caveat.  I don't think we want these messages
>> during initial resync, or any resync.  Only during a 'check' or
>> 'repair'.
>> So add a check for MD_RECOVERY_REQUESTED or maybe for
>>   sh->sectors >= conf->mddev->recovery_cp
>
> I completely agree, but it's already inside MD_RECOVERY_CHECK:
>
> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>         /* don't try to repair!! */
>         set_bit(STRIPE_INSYNC, &sh->state);
>         pr_warn_ratelimited("%s: mismatch sector in range "
>                             "%llu-%llu\n", mdname(conf->mddev),
>                             (unsigned long long) sh->sector,
>                             (unsigned long long) sh->sector +
>                             STRIPE_SECTORS);
> } else {
>
> Doesn't that already mean that someone has explicitly triggered a check
> action?

Uhmm... yeah.  I lose track of which flags me what exactly.
You log messages aren't generated when 'repair' is used, only when
'check' is.
I can see why you might have chosen that, but I wonder if it is best.

But I'm OK with this patch as it stands.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-18  0:07                                     ` Shaohua Li
@ 2017-05-19  4:53                                       ` NeilBrown
  2017-05-19 10:31                                         ` Nix
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2017-05-19  4:53 UTC (permalink / raw)
  To: Shaohua Li, Nix
  Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel,
	Ravi (Tom) Hale, Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 1989 bytes --]

On Wed, May 17 2017, Shaohua Li wrote:

> On Tue, May 16, 2017 at 10:46:13PM +0100, Nix wrote:
>> On 16 May 2017, NeilBrown spake thusly:
>> 
>> > Actually, I have another caveat.  I don't think we want these messages
>> > during initial resync, or any resync.  Only during a 'check' or
>> > 'repair'.
>> > So add a check for MD_RECOVERY_REQUESTED or maybe for
>> >   sh->sectors >= conf->mddev->recovery_cp
>> 
>> I completely agree, but it's already inside MD_RECOVERY_CHECK:
>> 
>> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>>         /* don't try to repair!! */
>>         set_bit(STRIPE_INSYNC, &sh->state);
>>         pr_warn_ratelimited("%s: mismatch sector in range "
>>                             "%llu-%llu\n", mdname(conf->mddev),
>>                             (unsigned long long) sh->sector,
>>                             (unsigned long long) sh->sector +
>>                             STRIPE_SECTORS);
>> } else {
>> 
>> Doesn't that already mean that someone has explicitly triggered a check
>> action?
>
>
> Hi,
> So the idea is: run 'check' and report mismatch, userspace (raid6check for
> example) uses the reported info to fix the mismatch. The pr_warn_ratelimited
> isn't a good way to communicate the info to userspace. I'm wondering why we
> don't just run raid6check solely, it can do the job like what kernel does and
> we avoid the crappy pr_warn_ratelimited.
>

raid6check is *much* slower than doing it in the kernel, as the
interlocking to avoid checking a stripe that is being written are
clumsy.... and async IO is harder in user space.

I think the warnings are useful as warnings quite apart from the
possibility of raid6check using them.
If we really wanted a seamless "fix the raid6 thing" (which I don't
think we do), we'd probably make the list of inconsistencies appear in a
sysfs file.  That would be less 'crappy'.  But as I say, I don't think
we really want to do that.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-19  4:53                                       ` NeilBrown
@ 2017-05-19 10:31                                         ` Nix
  2017-05-19 16:48                                           ` Shaohua Li
  0 siblings, 1 reply; 69+ messages in thread
From: Nix @ 2017-05-19 10:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, Chris Murphy, David Brown, Anthony Youngman,
	Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On 19 May 2017, NeilBrown verbalised:

> On Wed, May 17 2017, Shaohua Li wrote:
>
>> On Tue, May 16, 2017 at 10:46:13PM +0100, Nix wrote:
>>> Doesn't that already mean that someone has explicitly triggered a check
>>> action?
>>
>> So the idea is: run 'check' and report mismatch, userspace (raid6check for
>> example) uses the reported info to fix the mismatch. The pr_warn_ratelimited
>> isn't a good way to communicate the info to userspace. I'm wondering why we
>> don't just run raid6check solely, it can do the job like what kernel does and
>> we avoid the crappy pr_warn_ratelimited.

It'll do when there are a few inconsistencies but you don't want to
spend days recovering a huge array to fix a small but nonzero
mismatch_cnt, or to reassure you that yes, these mismatch_cnts are in
swap, ignore them. When there are a lot, enough that a ratelimited
warning hits its rate limit, Neil's right: the array is probably toast.
The limit is then important to stop log flooding.

> If we really wanted a seamless "fix the raid6 thing" (which I don't
> think we do),

Oh, I want seamless everything -- the seamlessness and flexibility of md
are its killer features over hardware RAID in my eyes -- but I'm
convinced that this is probably too hard to test and simply too
disruptive to bother with for a likely vanishingly rare failure mode all
entangled with fairly hot paths.

>               we'd probably make the list of inconsistencies appear in a
> sysfs file.  That would be less 'crappy'.  But as I say, I don't think
> we really want to do that.

Aren't sysfs files in effect length-limited to one page (or at least
length-limited by virtue of being stored in memory?) It seems to me this
would just bring the same problem ratelimit is solving right back again,
except a sysfs file doesn't have a logging daemon sucking the contents
out constantly so you can overwrite your old output without worrying.
(And there is no other daemon running to do that, except mdadm in
monitor mode, which might not be running and really this job feels out
of scope for it anyway.)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-19  4:49                                     ` NeilBrown
@ 2017-05-19 10:32                                       ` Nix
  2017-05-19 16:55                                         ` Shaohua Li
  0 siblings, 1 reply; 69+ messages in thread
From: Nix @ 2017-05-19 10:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel,
	Ravi (Tom) Hale, Linux-RAID

On 19 May 2017, NeilBrown said:

> On Tue, May 16 2017, Nix wrote:
>
>> On 16 May 2017, NeilBrown spake thusly:
>>
>>> Actually, I have another caveat.  I don't think we want these messages
>>> during initial resync, or any resync.  Only during a 'check' or
>>> 'repair'.
>>> So add a check for MD_RECOVERY_REQUESTED or maybe for
>>>   sh->sectors >= conf->mddev->recovery_cp
>>
>> I completely agree, but it's already inside MD_RECOVERY_CHECK:
>>
>> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>>         /* don't try to repair!! */
>>         set_bit(STRIPE_INSYNC, &sh->state);
>>         pr_warn_ratelimited("%s: mismatch sector in range "
>>                             "%llu-%llu\n", mdname(conf->mddev),
>>                             (unsigned long long) sh->sector,
>>                             (unsigned long long) sh->sector +
>>                             STRIPE_SECTORS);
>> } else {
>>
>> Doesn't that already mean that someone has explicitly triggered a check
>> action?
>
> Uhmm... yeah.  I lose track of which flags me what exactly.
> You log messages aren't generated when 'repair' is used, only when
> 'check' is.
> I can see why you might have chosen that, but I wonder if it is best.

I'm not sure what the point is of being told when repair is used: hey,
there was an inconsistency here but there isn't any more! I suppose you
could still use it to see if the repair did the right thing. My problem
on that front was that I'm not sure what flag should be used to catch
repair but not resync etc: everywhere else in the code, repair is in an
unadorned else branch... is it the *lack* of MD_RECOVERY_CHECK and the
presence of, uh, something else?

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-19 10:31                                         ` Nix
@ 2017-05-19 16:48                                           ` Shaohua Li
  2017-06-02 12:28                                             ` Nix
  0 siblings, 1 reply; 69+ messages in thread
From: Shaohua Li @ 2017-05-19 16:48 UTC (permalink / raw)
  To: Nix
  Cc: NeilBrown, Chris Murphy, David Brown, Anthony Youngman,
	Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On Fri, May 19, 2017 at 11:31:23AM +0100, Nix wrote:
> On 19 May 2017, NeilBrown verbalised:
> 
> > On Wed, May 17 2017, Shaohua Li wrote:
> >
> >> On Tue, May 16, 2017 at 10:46:13PM +0100, Nix wrote:
> >>> Doesn't that already mean that someone has explicitly triggered a check
> >>> action?
> >>
> >> So the idea is: run 'check' and report mismatch, userspace (raid6check for
> >> example) uses the reported info to fix the mismatch. The pr_warn_ratelimited
> >> isn't a good way to communicate the info to userspace. I'm wondering why we
> >> don't just run raid6check solely, it can do the job like what kernel does and
> >> we avoid the crappy pr_warn_ratelimited.
> 
> It'll do when there are a few inconsistencies but you don't want to
> spend days recovering a huge array to fix a small but nonzero
> mismatch_cnt, or to reassure you that yes, these mismatch_cnts are in
> swap, ignore them. When there are a lot, enough that a ratelimited
> warning hits its rate limit, Neil's right: the array is probably toast.
> The limit is then important to stop log flooding.
> 
> > If we really wanted a seamless "fix the raid6 thing" (which I don't
> > think we do),
> 
> Oh, I want seamless everything -- the seamlessness and flexibility of md
> are its killer features over hardware RAID in my eyes -- but I'm
> convinced that this is probably too hard to test and simply too
> disruptive to bother with for a likely vanishingly rare failure mode all
> entangled with fairly hot paths.
> 
> >               we'd probably make the list of inconsistencies appear in a
> > sysfs file.  That would be less 'crappy'.  But as I say, I don't think
> > we really want to do that.
> 
> Aren't sysfs files in effect length-limited to one page (or at least
> length-limited by virtue of being stored in memory?) It seems to me this
> would just bring the same problem ratelimit is solving right back again,
> except a sysfs file doesn't have a logging daemon sucking the contents
> out constantly so you can overwrite your old output without worrying.
> (And there is no other daemon running to do that, except mdadm in
> monitor mode, which might not be running and really this job feels out
> of scope for it anyway.)

No, my question is not the print is ratelimited. The problem is dmesg isn't a
good way to communicate info to userspace. You can easily lose all dmesg info
with a simple 'dmesg -c'. sysfs file is more reliable. Length-limited isn't a
problem, as you said, if there are a lot of mismatch, the array is toast.

Alright, I'll accept Neil's suggestion. Unless your guys really need a seamless
fix (which I'm still thinking about doing it in usespace by optimizing
raid6check) and we'd take this simple warning patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-19 10:32                                       ` Nix
@ 2017-05-19 16:55                                         ` Shaohua Li
  2017-05-21 22:00                                           ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: Shaohua Li @ 2017-05-19 16:55 UTC (permalink / raw)
  To: Nix
  Cc: NeilBrown, Chris Murphy, David Brown, Anthony Youngman,
	Phil Turmel, Ravi (Tom) Hale, Linux-RAID

On Fri, May 19, 2017 at 11:32:43AM +0100, Nix wrote:
> On 19 May 2017, NeilBrown said:
> 
> > On Tue, May 16 2017, Nix wrote:
> >
> >> On 16 May 2017, NeilBrown spake thusly:
> >>
> >>> Actually, I have another caveat.  I don't think we want these messages
> >>> during initial resync, or any resync.  Only during a 'check' or
> >>> 'repair'.
> >>> So add a check for MD_RECOVERY_REQUESTED or maybe for
> >>>   sh->sectors >= conf->mddev->recovery_cp
> >>
> >> I completely agree, but it's already inside MD_RECOVERY_CHECK:
> >>
> >> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
> >>         /* don't try to repair!! */
> >>         set_bit(STRIPE_INSYNC, &sh->state);
> >>         pr_warn_ratelimited("%s: mismatch sector in range "
> >>                             "%llu-%llu\n", mdname(conf->mddev),
> >>                             (unsigned long long) sh->sector,
> >>                             (unsigned long long) sh->sector +
> >>                             STRIPE_SECTORS);
> >> } else {
> >>
> >> Doesn't that already mean that someone has explicitly triggered a check
> >> action?
> >
> > Uhmm... yeah.  I lose track of which flags me what exactly.
> > You log messages aren't generated when 'repair' is used, only when
> > 'check' is.
> > I can see why you might have chosen that, but I wonder if it is best.
> 
> I'm not sure what the point is of being told when repair is used: hey,
> there was an inconsistency here but there isn't any more! I suppose you
> could still use it to see if the repair did the right thing. My problem
> on that front was that I'm not sure what flag should be used to catch
> repair but not resync etc: everywhere else in the code, repair is in an
> unadorned else branch... is it the *lack* of MD_RECOVERY_CHECK and the
> presence of, uh, something else?
MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED && MD_RECOVERY_CHECK == check
MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED == repair
MD_RECOVERY_SYNC && !MD_RECOVERY_REQUESTED == resync

Don't see the poin to print the info for 'repair'. 'repair' already changes the
data, how could we use the info?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-19 16:55                                         ` Shaohua Li
@ 2017-05-21 22:00                                           ` NeilBrown
  0 siblings, 0 replies; 69+ messages in thread
From: NeilBrown @ 2017-05-21 22:00 UTC (permalink / raw)
  To: Shaohua Li, Nix
  Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel,
	Ravi (Tom) Hale, Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 3081 bytes --]

On Fri, May 19 2017, Shaohua Li wrote:

> On Fri, May 19, 2017 at 11:32:43AM +0100, Nix wrote:
>> On 19 May 2017, NeilBrown said:
>> 
>> > On Tue, May 16 2017, Nix wrote:
>> >
>> >> On 16 May 2017, NeilBrown spake thusly:
>> >>
>> >>> Actually, I have another caveat.  I don't think we want these messages
>> >>> during initial resync, or any resync.  Only during a 'check' or
>> >>> 'repair'.
>> >>> So add a check for MD_RECOVERY_REQUESTED or maybe for
>> >>>   sh->sectors >= conf->mddev->recovery_cp
>> >>
>> >> I completely agree, but it's already inside MD_RECOVERY_CHECK:
>> >>
>> >> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>> >>         /* don't try to repair!! */
>> >>         set_bit(STRIPE_INSYNC, &sh->state);
>> >>         pr_warn_ratelimited("%s: mismatch sector in range "
>> >>                             "%llu-%llu\n", mdname(conf->mddev),
>> >>                             (unsigned long long) sh->sector,
>> >>                             (unsigned long long) sh->sector +
>> >>                             STRIPE_SECTORS);
>> >> } else {
>> >>
>> >> Doesn't that already mean that someone has explicitly triggered a check
>> >> action?
>> >
>> > Uhmm... yeah.  I lose track of which flags me what exactly.
>> > You log messages aren't generated when 'repair' is used, only when
>> > 'check' is.
>> > I can see why you might have chosen that, but I wonder if it is best.
>> 
>> I'm not sure what the point is of being told when repair is used: hey,
>> there was an inconsistency here but there isn't any more! I suppose you
>> could still use it to see if the repair did the right thing. My problem
>> on that front was that I'm not sure what flag should be used to catch
>> repair but not resync etc: everywhere else in the code, repair is in an
>> unadorned else branch... is it the *lack* of MD_RECOVERY_CHECK and the
>> presence of, uh, something else?
> MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED && MD_RECOVERY_CHECK == check
> MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED == repair
> MD_RECOVERY_SYNC && !MD_RECOVERY_REQUESTED == resync
>
> Don't see the poin to print the info for 'repair'. 'repair' already changes the
> data, how could we use the info?

Surprising data is can be potentially valuable.
I don't think you should *ever* get an inconsistency in a RAID6 unless
you have faulty hardware.
If you do, then any information about the nature of the inconsistency
might be valuable in understanding the hardware fault.
I don't know in advance how I would interpret the data, but I do
know that if I didn't have the data, then I wouldn't be able to
interpret it.

However .... running "repair" when you don't know exactly what has
happened and why, is probably a bad idea.  So logging probably won't
provide value.
I wouldn't go out of my way to add extra logging for the 'repair' case,
but I certainly wouldn't go out of my way to avoid logging in that case.

It seems inconsistent to log for 'check' but not 'repair', but it isn't
a big deal for me.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)
  2017-05-19 16:48                                           ` Shaohua Li
@ 2017-06-02 12:28                                             ` Nix
  0 siblings, 0 replies; 69+ messages in thread
From: Nix @ 2017-06-02 12:28 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, Chris Murphy, David Brown, Anthony Youngman,
	Phil Turmel, Ravi (Tom) Hale, Linux-RAID

[getting back to this...]

On 19 May 2017, Shaohua Li told this:
> On Fri, May 19, 2017 at 11:31:23AM +0100, Nix wrote:
>> On 19 May 2017, NeilBrown verbalised:
>> >               we'd probably make the list of inconsistencies appear in a
>> > sysfs file.  That would be less 'crappy'.  But as I say, I don't think
>> > we really want to do that.
>> 
>> Aren't sysfs files in effect length-limited to one page (or at least
>> length-limited by virtue of being stored in memory?) It seems to me this
>> would just bring the same problem ratelimit is solving right back again,
>> except a sysfs file doesn't have a logging daemon sucking the contents
>> out constantly so you can overwrite your old output without worrying.
>> (And there is no other daemon running to do that, except mdadm in
>> monitor mode, which might not be running and really this job feels out
>> of scope for it anyway.)
>
> No, my question is not the print is ratelimited. The problem is dmesg isn't a
> good way to communicate info to userspace. You can easily lose all dmesg info
> with a simple 'dmesg -c'. sysfs file is more reliable. Length-limited isn't a
> problem, as you said, if there are a lot of mismatch, the array is toast.

I agree that in future having a mechanism for reporting this more easily
usable by programs would be good, and sysfs does seem like just such a
mechanism.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2017-06-02 12:28 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-04 10:04 Fault tolerance in RAID0 with badblocks Ravi (Tom) Hale
2017-05-04 13:44 ` Wols Lists
2017-05-05  4:03   ` Fault tolerance " Ravi (Tom) Hale
2017-05-05 19:20     ` Anthony Youngman
2017-05-06 11:21       ` Ravi (Tom) Hale
2017-05-06 13:00         ` Wols Lists
2017-05-08 14:50           ` Nix
2017-05-08 18:00             ` Anthony Youngman
2017-05-09 10:11               ` David Brown
2017-05-09 10:18               ` Nix
2017-05-08 19:02             ` Phil Turmel
2017-05-08 19:52               ` Nix
2017-05-08 20:27                 ` Anthony Youngman
2017-05-09  9:53                   ` Nix
2017-05-09 11:09                     ` David Brown
2017-05-09 11:27                       ` Nix
2017-05-09 11:58                         ` David Brown
2017-05-09 17:25                           ` Chris Murphy
2017-05-09 19:44                             ` Wols Lists
2017-05-10  3:53                               ` Chris Murphy
2017-05-10  4:49                                 ` Wols Lists
2017-05-10 17:18                                   ` Chris Murphy
2017-05-16  3:20                                   ` NeilBrown
2017-05-10  5:00                                 ` Dave Stevens
2017-05-10 16:44                                 ` Edward Kuns
2017-05-10 18:09                                   ` Chris Murphy
2017-05-09 20:18                             ` Nix
2017-05-09 20:52                               ` Wols Lists
2017-05-10  8:41                               ` David Brown
2017-05-09 21:06                             ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix
2017-05-12 11:14                               ` Nix
2017-05-16  3:27                               ` NeilBrown
2017-05-16  9:13                                 ` Nix
2017-05-16 21:11                                 ` NeilBrown
2017-05-16 21:46                                   ` Nix
2017-05-18  0:07                                     ` Shaohua Li
2017-05-19  4:53                                       ` NeilBrown
2017-05-19 10:31                                         ` Nix
2017-05-19 16:48                                           ` Shaohua Li
2017-06-02 12:28                                             ` Nix
2017-05-19  4:49                                     ` NeilBrown
2017-05-19 10:32                                       ` Nix
2017-05-19 16:55                                         ` Shaohua Li
2017-05-21 22:00                                           ` NeilBrown
2017-05-09 19:16                         ` Fault tolerance with badblocks Phil Turmel
2017-05-09 20:01                           ` Nix
2017-05-09 20:57                             ` Wols Lists
2017-05-09 21:22                               ` Nix
2017-05-09 21:23                             ` Phil Turmel
2017-05-09 21:32                     ` NeilBrown
2017-05-10 19:03                       ` Nix
2017-05-09 16:05                   ` Chris Murphy
2017-05-09 17:49                     ` Wols Lists
2017-05-10  3:06                       ` Chris Murphy
2017-05-08 20:56                 ` Phil Turmel
2017-05-09 10:28                   ` Nix
2017-05-09 10:50                     ` Reindl Harald
2017-05-09 11:15                       ` Nix
2017-05-09 11:48                         ` Reindl Harald
2017-05-09 16:11                           ` Nix
2017-05-09 16:46                             ` Reindl Harald
2017-05-09  7:37             ` David Brown
2017-05-09  9:58               ` Nix
2017-05-09 10:28                 ` Brad Campbell
2017-05-09 10:40                   ` Nix
2017-05-09 12:15                     ` Tim Small
2017-05-09 15:30                       ` Nix
2017-05-05 20:23     ` Peter Grandi
2017-05-05 22:14       ` Nix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.