* Fault tolerance in RAID0 with badblocks @ 2017-05-04 10:04 Ravi (Tom) Hale 2017-05-04 13:44 ` Wols Lists 0 siblings, 1 reply; 69+ messages in thread From: Ravi (Tom) Hale @ 2017-05-04 10:04 UTC (permalink / raw) To: linux-raid Since btrfs doesn't support badblocks, this btrfs mailing list post[1] suggested to use mdadm RAID0 3.1+. Is there a way of having blocks from a spare device automatically replacing bad blocks when they are next written to (like SMART does for HDDs)? Or would mdadm be able to add a "badblocks layer" to btrfs in some other way? My use case is mining storj - I don't mind some data loss. [1] https://www.spinics.net/lists/linux-btrfs/msg40909.html -- Cheers, Tom Hale ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance in RAID0 with badblocks 2017-05-04 10:04 Fault tolerance in RAID0 with badblocks Ravi (Tom) Hale @ 2017-05-04 13:44 ` Wols Lists 2017-05-05 4:03 ` Fault tolerance " Ravi (Tom) Hale 0 siblings, 1 reply; 69+ messages in thread From: Wols Lists @ 2017-05-04 13:44 UTC (permalink / raw) To: Ravi (Tom) Hale, linux-raid On 04/05/17 11:04, Ravi (Tom) Hale wrote: > Since btrfs doesn't support badblocks, this btrfs mailing list post[1] > suggested to use mdadm RAID0 3.1+. Having read the email you linked to, I don't think mdadm will be any help at all ... > > Is there a way of having blocks from a spare device automatically > replacing bad blocks when they are next written to (like SMART does for > HDDs)? What quite do you mean? > > Or would mdadm be able to add a "badblocks layer" to btrfs in some other > way? No. With modern hard drives, no filesystem should pay any attention to badblocks - it's all handled in the drive firmware. Badblocks is an unfortunate legacy from the past when drives really were CHS, and the layer above needed some way of knowing which blocks were bad and should be avoided. mdadm has had a lot of grief with its handling of badblocks, and getting drives confused, and it's all totally unnecessary anyway. Let the drive worry about what blocks are bad. One major point behind LBA is it hides the actual disk layout from the computer, and allows the drive to relocate blocks that aren't working properly. Let it do its job. If you want to use raid, don't bother with 0. Use mdadm and raid 5 or 6 to combine your drives, and create a btrfs filesystem on top. (Don't bother with raid1 - that part of btrfs apparently works well, so use the filesystem variant, not an external one.) > > My use case is mining storj - I don't mind some data loss. Using a badblock list will have no impact on this whatsoever. > > [1] https://www.spinics.net/lists/linux-btrfs/msg40909.html > Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-04 13:44 ` Wols Lists @ 2017-05-05 4:03 ` Ravi (Tom) Hale 2017-05-05 19:20 ` Anthony Youngman 2017-05-05 20:23 ` Peter Grandi 0 siblings, 2 replies; 69+ messages in thread From: Ravi (Tom) Hale @ 2017-05-05 4:03 UTC (permalink / raw) To: Wols Lists, linux-raid On 04/05/17 20:44, Wols Lists wrote: > On 04/05/17 11:04, Ravi (Tom) Hale wrote: >> Is there a way of having blocks from a spare device automatically >> replacing bad blocks when they are next written to (like SMART does for >> HDDs)? > > What quite do you mean? I mean: should a bad block be identified, any writes to that virtual block are redirected to another good LBA block held in a spare pool which would need to be inaccessible for other purposes (so that they are indeed spare). >> Or would mdadm be able to add a "badblocks layer" to btrfs in some other >> way? > > No. With modern hard drives, no filesystem should pay any attention to > badblocks - it's all handled in the drive firmware. ext4 supports this, and is a relatively modern filesystem released in December 2008. While it could be argued that this is for legacy support, This feature still adds value (see below). > mdadm has had a lot of grief with its handling of badblocks, > and getting drives confused, and it's all totally unnecessary anyway. The use case is simple: What if I want to have more goodblocks to correct for badblocks than Seagate thinks I should have? Eg, a charity or poor student wanting to get the most out of their old hardware. In my case, I don't care about actual data loss (RAID0). However, in the usual case, running RAID 1, 5 or 6 with a pool of spare goodblocks would allow extending the life of hardware considerably while still providing a poor-man's margin of redundancy. > Let the drive worry about what blocks are bad. One major point behind > LBA is it hides the actual disk layout from the computer, and allows the > drive to relocate blocks that aren't working properly. Let it do its job. Until it can't do its job any more because it runs out of its manufacturer determined fixed-size spare pool. Yes there are things to consider for performance like having the physical good sector being close to the physical bad sector, so a spare data area could be allocated every N usable data areas. And perhaps I could write that one day. :) >> My use case is mining storj - I don't mind some data loss. > > Using a badblock list will have no impact on this whatsoever. A corrupted file is a corrupted file, and can be deleted at minimal loss. I just don't want the next file being corrupted by the same badblock. -- Tom ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-05 4:03 ` Fault tolerance " Ravi (Tom) Hale @ 2017-05-05 19:20 ` Anthony Youngman 2017-05-06 11:21 ` Ravi (Tom) Hale 2017-05-05 20:23 ` Peter Grandi 1 sibling, 1 reply; 69+ messages in thread From: Anthony Youngman @ 2017-05-05 19:20 UTC (permalink / raw) To: Ravi (Tom) Hale, linux-raid On 05/05/17 05:03, Ravi (Tom) Hale wrote: > On 04/05/17 20:44, Wols Lists wrote: >> On 04/05/17 11:04, Ravi (Tom) Hale wrote: >>> Is there a way of having blocks from a spare device automatically >>> replacing bad blocks when they are next written to (like SMART does for >>> HDDs)? >> >> What quite do you mean? > > I mean: should a bad block be identified, any writes to that virtual > block are redirected to another good LBA block held in a spare pool > which would need to be inaccessible for other purposes (so that they are > indeed spare). > >>> Or would mdadm be able to add a "badblocks layer" to btrfs in some other >>> way? >> >> No. With modern hard drives, no filesystem should pay any attention to >> badblocks - it's all handled in the drive firmware. > > ext4 supports this, and is a relatively modern filesystem released in > December 2008. While it could be argued that this is for legacy support, > This feature still adds value (see below). > >> mdadm has had a lot of grief with its handling of badblocks, >> and getting drives confused, and it's all totally unnecessary anyway. > > The use case is simple: What if I want to have more goodblocks to > correct for badblocks than Seagate thinks I should have? Understood. Except that when you get to that state, your drive is probably dying anyway. Or tiny by modern standards. > > Eg, a charity or poor student wanting to get the most out of their old > hardware. > > In my case, I don't care about actual data loss (RAID0). > > However, in the usual case, running RAID 1, 5 or 6 with a pool of spare > goodblocks would allow extending the life of hardware considerably while > still providing a poor-man's margin of redundancy. > >> Let the drive worry about what blocks are bad. One major point behind >> LBA is it hides the actual disk layout from the computer, and allows the >> drive to relocate blocks that aren't working properly. Let it do its job. > > Until it can't do its job any more because it runs out of its > manufacturer determined fixed-size spare pool. Bear in mind I'm speculating slightly here ... but how are you going to know when the drive has run out of its spare-pool? Bear in mind that most SSDs, it seems, will commit suicide at this point ... Bear in mind also, that any *within* *spec* drive can have an "accident" every 10TB and still be considered perfectly okay. Which means that if you do what you are supposed to do (rewrite the block) you're risking the drive remapping the block - and getting closer to the drive bricking itself. But if you trap the error yourself and add it to the badblocks list, you are risking throwing away perfectly decent blocks that just hiccuped. Bear in mind also, that with raid we recommend "scrubbing". That's basically reading the entire disk looking for errors, because data does fade. So if you "look after" a 3TB drive, you could be losing a block a month to your badblock list. Not good. Yes there are things to > consider for performance like having the physical good sector being > close to the physical bad sector, so a spare data area could be > allocated every N usable data areas. > > And perhaps I could write that one day. :) > >>> My use case is mining storj - I don't mind some data loss. >> >> Using a badblock list will have no impact on this whatsoever. > > A corrupted file is a corrupted file, and can be deleted at minimal > loss. I just don't want the next file being corrupted by the same badblock. > As we say, YMMV. If that's what you want to do, fine. Which is going to happen first - the drive bricks itself because it runs out of manufacturer-supplied spare blocks, or you bin the drive because your bad-blocks-list has got too big to handle? I suspect your bad block list will fill up long before the drive runs out of manufacturer-supplied blocks. Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-05 19:20 ` Anthony Youngman @ 2017-05-06 11:21 ` Ravi (Tom) Hale 2017-05-06 13:00 ` Wols Lists 0 siblings, 1 reply; 69+ messages in thread From: Ravi (Tom) Hale @ 2017-05-06 11:21 UTC (permalink / raw) To: Anthony Youngman, linux-raid On 06/05/17 02:20, Anthony Youngman wrote: > Bear in mind I'm speculating slightly here ... but how are you going to > know when the drive has run out of its spare-pool? Bear in mind that > most SSDs, it seems, will commit suicide at this point ... Intel and Samsung SSDs support S.M.A.R.T. (but not my personal laptop). > Bear in mind also, that any *within* *spec* drive can have an "accident" > every 10TB and still be considered perfectly okay. Which means that if > you do what you are supposed to do (rewrite the block) you're risking > the drive remapping the block - and getting closer to the drive bricking > itself. But if you trap the error yourself and add it to the badblocks > list, you are risking throwing away perfectly decent blocks that just > hiccuped. For hiccups, having a bad-read-count for each suspected-bad block could be sensible. If that number goes above <small-threshold> it's very likely that the block is indeed bad and should be avoided in future. -- Tom Hale ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-06 11:21 ` Ravi (Tom) Hale @ 2017-05-06 13:00 ` Wols Lists 2017-05-08 14:50 ` Nix 0 siblings, 1 reply; 69+ messages in thread From: Wols Lists @ 2017-05-06 13:00 UTC (permalink / raw) To: Ravi (Tom) Hale, linux-raid On 06/05/17 12:21, Ravi (Tom) Hale wrote: >> Bear in mind also, that any *within* *spec* drive can have an "accident" >> > every 10TB and still be considered perfectly okay. Which means that if >> > you do what you are supposed to do (rewrite the block) you're risking >> > the drive remapping the block - and getting closer to the drive bricking >> > itself. But if you trap the error yourself and add it to the badblocks >> > list, you are risking throwing away perfectly decent blocks that just >> > hiccuped. > For hiccups, having a bad-read-count for each suspected-bad block could > be sensible. If that number goes above <small-threshold> it's very > likely that the block is indeed bad and should be avoided in future. Except you have the second law of thermodynamics in play - "what man proposes, nature opposes". This could well screw up big time. DRAM memory needs to be refreshed by a read-write cycle every few nanoseconds. Hard drives are the same, actually, except that the interval is measured in years, not nanoseconds. Fill your brand new hard drive with data, then hammer it gently over a few years. Especially if a block's neighbours are repeatedly rewritten but this particular block is never touched, it is likely to become unreadable. So it will fail your test - reads will repeatedly fail - but if the firmware was given a look-in (by rewriting it) it wouldn't be remapped. And as Nix said, once a drive starts getting a load of errors, chances are something is catastrophically wrong and things are going to get exponentially worse. Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-06 13:00 ` Wols Lists @ 2017-05-08 14:50 ` Nix 2017-05-08 18:00 ` Anthony Youngman ` (2 more replies) 0 siblings, 3 replies; 69+ messages in thread From: Nix @ 2017-05-08 14:50 UTC (permalink / raw) To: Wols Lists; +Cc: Ravi (Tom) Hale, linux-raid On 6 May 2017, Wols Lists outgrape: > On 06/05/17 12:21, Ravi (Tom) Hale wrote: >>> Bear in mind also, that any *within* *spec* drive can have an "accident" >>> > every 10TB and still be considered perfectly okay. Which means that if >>> > you do what you are supposed to do (rewrite the block) you're risking >>> > the drive remapping the block - and getting closer to the drive bricking >>> > itself. But if you trap the error yourself and add it to the badblocks >>> > list, you are risking throwing away perfectly decent blocks that just >>> > hiccuped. > >> For hiccups, having a bad-read-count for each suspected-bad block could >> be sensible. If that number goes above <small-threshold> it's very >> likely that the block is indeed bad and should be avoided in future. > > Except you have the second law of thermodynamics in play - "what man > proposes, nature opposes". This could well screw up big time. > > DRAM memory needs to be refreshed by a read-write cycle every few > nanoseconds. Hard drives are the same, actually, except that the > interval is measured in years, not nanoseconds. Fill your brand new hard > drive with data, then hammer it gently over a few years. Especially if a > block's neighbours are repeatedly rewritten but this particular block is > never touched, it is likely to become unreadable. > > So it will fail your test - reads will repeatedly fail - but if the > firmware was given a look-in (by rewriting it) it wouldn't be remapped. You mean it *would* be remapped (and all would be well). I wonder... scrubbing is not very useful with md, particularly with RAID 6, because it does no writes unless something mismatches, and on failure there is no attempt to determine which of the N disks is bad and rewrite its contents from the other devices (nor, as I understand it, does it clearly say which drive gave the error, so even failing it out and resyncing it is hard). If there was a way to get md to *rewrite* everything during scrub, rather than just checking, this might help (in addition to letting the drive refresh the magnetization of absolutely everything). "repair" mode appears to do no writes until an error is found, whereupon (on RAID 6) it proceeds to make a "repair" that is more likely than not to overwrite good data with bad. Optionally writing what's already there on non-error seems like it might be a worthwhile (and fairly simple) change. -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 14:50 ` Nix @ 2017-05-08 18:00 ` Anthony Youngman 2017-05-09 10:11 ` David Brown 2017-05-09 10:18 ` Nix 2017-05-08 19:02 ` Phil Turmel 2017-05-09 7:37 ` David Brown 2 siblings, 2 replies; 69+ messages in thread From: Anthony Youngman @ 2017-05-08 18:00 UTC (permalink / raw) To: Nix; +Cc: Ravi (Tom) Hale, linux-raid On 08/05/17 15:50, Nix wrote: > On 6 May 2017, Wols Lists outgrape: > >> On 06/05/17 12:21, Ravi (Tom) Hale wrote: >>>> Bear in mind also, that any *within* *spec* drive can have an "accident" >>>>> every 10TB and still be considered perfectly okay. Which means that if >>>>> you do what you are supposed to do (rewrite the block) you're risking >>>>> the drive remapping the block - and getting closer to the drive bricking >>>>> itself. But if you trap the error yourself and add it to the badblocks >>>>> list, you are risking throwing away perfectly decent blocks that just >>>>> hiccuped. >> >>> For hiccups, having a bad-read-count for each suspected-bad block could >>> be sensible. If that number goes above <small-threshold> it's very >>> likely that the block is indeed bad and should be avoided in future. >> >> Except you have the second law of thermodynamics in play - "what man >> proposes, nature opposes". This could well screw up big time. >> >> DRAM memory needs to be refreshed by a read-write cycle every few >> nanoseconds. Hard drives are the same, actually, except that the >> interval is measured in years, not nanoseconds. Fill your brand new hard >> drive with data, then hammer it gently over a few years. Especially if a >> block's neighbours are repeatedly rewritten but this particular block is >> never touched, it is likely to become unreadable. >> >> So it will fail your test - reads will repeatedly fail - but if the >> firmware was given a look-in (by rewriting it) it wouldn't be remapped. > > You mean it *would* be remapped (and all would be well). > No. The data would be lost, the block would be overwritten successfully and there would be no need to remap. Basically, the magnetism has decayed (so it can't be reconstructed from the extra error recovery bits on disk) and rewriting it fixes the problem. But the data's been lost ... > I wonder... scrubbing is not very useful with md, particularly with RAID > 6, because it does no writes unless something mismatches, and on failure > there is no attempt to determine which of the N disks is bad and rewrite > its contents from the other devices (nor, as I understand it, does it > clearly say which drive gave the error, so even failing it out and > resyncing it is hard). With redundant raid (and that doesn't include a two-disk, or even three-disk mirror), it SHOULD recalculate the failed block. If it doesn't bother even though it can, I'd call that a bug in scrub. What I thought happened was that it reads a stripe direct from disk, and if that failed it read the same stripe via the raid code, to get the raid error correction to fire, and then it rewrote the stripe. What would be a nice touch, is that if we have a massive timeout for non-SCT drives, if the scrub has to wait more than, say, 10 seconds for a read to succeed it then assumes the block is failing and rewrites it. Actually, scrub that (groan... :-) - if the drive takes longer than 1/3 of the timeout to respond, then the scrub assumes it's dodgy and rewrites it. > > If there was a way to get md to *rewrite* everything during scrub, > rather than just checking, this might help (in addition to letting the > drive refresh the magnetization of absolutely everything). "repair" mode > appears to do no writes until an error is found, whereupon (on RAID 6) > it proceeds to make a "repair" that is more likely than not to overwrite > good data with bad. Optionally writing what's already there on non-error > seems like it might be a worthwhile (and fairly simple) change. > Agreed. But without some heuristic, it's actually going to make a scrub much slower, and achieve very little apart from adding unnecessary wear to the drive. Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 18:00 ` Anthony Youngman @ 2017-05-09 10:11 ` David Brown 2017-05-09 10:18 ` Nix 1 sibling, 0 replies; 69+ messages in thread From: David Brown @ 2017-05-09 10:11 UTC (permalink / raw) To: Anthony Youngman, Nix; +Cc: Ravi (Tom) Hale, linux-raid On 08/05/17 20:00, Anthony Youngman wrote: > With redundant raid (and that doesn't include a two-disk, or even > three-disk mirror), it SHOULD recalculate the failed block. If it > doesn't bother even though it can, I'd call that a bug in scrub. Please read: <http://neil.brown.name/blog/20100211050355> > What I > thought happened was that it reads a stripe direct from disk, and if > that failed it read the same stripe via the raid code, to get the raid > error correction to fire, and then it rewrote the stripe. That /is/ what happens. As I mentioned in another reply, /reading/ is enough to trigger a re-write on the disk if significant /correctable/ errors are discovered by the disk's firmware. It is extremely rare that the raid level will see an error (see the linked article by Neil Brown) - usually, the raid level sees a missing block because the disk firmware could not read the block correctly. In such cases, the raid software will write the correct data back to the disk at the same logical block, and the disk firmware will re-map it to a different block. > > What would be a nice touch, is that if we have a massive timeout for > non-SCT drives, if the scrub has to wait more than, say, 10 seconds for > a read to succeed it then assumes the block is failing and rewrites it. I don't think the raid level can do that - it must wait for the drive to finish handling the read request, or drop the drive entirely. If the disk takes a long time to read a block, then it will either fail and mark the block bad, or it will get the data off the disk and then automatically re-write the data to a re-mapped block. The scrub can therefore handle it like any other read. > Actually, scrub that (groan... :-) - if the drive takes longer than 1/3 > of the timeout to respond, then the scrub assumes it's dodgy and > rewrites it. >> >> If there was a way to get md to *rewrite* everything during scrub, >> rather than just checking, this might help (in addition to letting the >> drive refresh the magnetization of absolutely everything). "repair" mode >> appears to do no writes until an error is found, whereupon (on RAID 6) >> it proceeds to make a "repair" that is more likely than not to overwrite >> good data with bad. Optionally writing what's already there on non-error >> seems like it might be a worthwhile (and fairly simple) change. >> > Agreed. But without some heuristic, it's actually going to make a scrub > much slower, and achieve very little apart from adding unnecessary wear > to the drive. > > Cheers, > Wol > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 18:00 ` Anthony Youngman 2017-05-09 10:11 ` David Brown @ 2017-05-09 10:18 ` Nix 1 sibling, 0 replies; 69+ messages in thread From: Nix @ 2017-05-09 10:18 UTC (permalink / raw) To: Anthony Youngman; +Cc: Ravi (Tom) Hale, linux-raid On 8 May 2017, Anthony Youngman verbalised: > On 08/05/17 15:50, Nix wrote: >> I wonder... scrubbing is not very useful with md, particularly with RAID >> 6, because it does no writes unless something mismatches, and on failure >> there is no attempt to determine which of the N disks is bad and rewrite >> its contents from the other devices (nor, as I understand it, does it >> clearly say which drive gave the error, so even failing it out and >> resyncing it is hard). > > With redundant raid (and that doesn't include a two-disk, or even > three-disk mirror), it SHOULD recalculate the failed block. If it > doesn't bother even though it can, I'd call that a bug in scrub. What It didn't, once upon a time (in 2010), and as far as I can tell from the code it still doesn't. > I thought happened was that it reads a stripe direct from disk, and if > that failed it read the same stripe via the raid code, to get the raid > error correction to fire, and then it rewrote the stripe. There's *failed*, which does trigger a rewrite, and there's 'we got a mismatch', which on RAID-6 arguably should trigger a rewrite but instead just tells you there was a mismatch, but not where, nor even on what disk. > What would be a nice touch, is that if we have a massive timeout for > non-SCT drives, if the scrub has to wait more than, say, 10 seconds > for a read to succeed it then assumes the block is failing and > rewrites it. What tends to happen is that the drive gets reset, which from md's perspective is the drive vanishing and reappearing again. I don't see any sane way for md to interpret *that* as anything but a possibly rather major failure that should be reacted to by failing the drive out. I mean, all it knows is there was a timeout: for all it knows there are electrical problems there or something. The drive doesn't say (and doesn't get a chance to say, because we reset it rather than wait five minutes for it to tell us what's up). > Actually, scrub that (groan... :-) - if the drive takes > longer than 1/3 of the timeout to respond, then the scrub assumes it's > dodgy and rewrites it. It's hard to rewrite anything on a drive that's too busy failing a read to do anything else. -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 14:50 ` Nix 2017-05-08 18:00 ` Anthony Youngman @ 2017-05-08 19:02 ` Phil Turmel 2017-05-08 19:52 ` Nix 2017-05-09 7:37 ` David Brown 2 siblings, 1 reply; 69+ messages in thread From: Phil Turmel @ 2017-05-08 19:02 UTC (permalink / raw) To: Nix, Wols Lists; +Cc: Ravi (Tom) Hale, linux-raid On 05/08/2017 10:50 AM, Nix wrote: > I wonder... scrubbing is not very useful with md, particularly with RAID > 6, because it does no writes unless something mismatches, This is wrong. The purpose of scrubbing is to expose any sectors that have degraded (as Wol describes) to the point of generating a read error. A "check" scrub only writes back to the sectors that report a URE, giving the drive firmware a chance to fix or relocate the sector. A check scrub will NOT write on mismatch, just increment the mismatch counter. This is the recommended regular scrubbing operation. You want to know when mismatches occur. > If there was a way to get md to *rewrite* everything during scrub, > rather than just checking, this might help (in addition to letting the > drive refresh the magnetization of absolutely everything). This is actually counterproductive. Rewriting everything may refresh the magnetism on weakening sectors, but will also prevent the drive from *finding* weakening sectors that really do need relocation. Phil ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 19:02 ` Phil Turmel @ 2017-05-08 19:52 ` Nix 2017-05-08 20:27 ` Anthony Youngman 2017-05-08 20:56 ` Phil Turmel 0 siblings, 2 replies; 69+ messages in thread From: Nix @ 2017-05-08 19:52 UTC (permalink / raw) To: Phil Turmel; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid On 8 May 2017, Phil Turmel verbalised: > On 05/08/2017 10:50 AM, Nix wrote: > >> I wonder... scrubbing is not very useful with md, particularly with RAID >> 6, because it does no writes unless something mismatches, > > This is wrong. The purpose of scrubbing is to expose any sectors that > have degraded (as Wol describes) to the point of generating a read > error. A "check" scrub only writes back to the sectors that report a > URE, giving the drive firmware a chance to fix or relocate the sector. > > A check scrub will NOT write on mismatch, just increment the mismatch > counter. This is the recommended regular scrubbing operation. You want > to know when mismatches occur. And... then what do you do? On RAID-6, it appears the answer is "live with a high probability of inevitable corruption". That's not very good. (AIUI, if a check scrub finds a URE, it'll rewrite it, and when in the common case the drive spares it out and the write succeeds, this will not be reported as a mismatch: is this right?) >> If there was a way to get md to *rewrite* everything during scrub, >> rather than just checking, this might help (in addition to letting the >> drive refresh the magnetization of absolutely everything). > > This is actually counterproductive. Rewriting everything may refresh > the magnetism on weakening sectors, but will also prevent the drive from > *finding* weakening sectors that really do need relocation. If a sector weakens purely because of neighbouring writes or temperature or a vibrating housing or something (i.e. not because of actual damage), so that a rewrite will strengthen it and relocation was never necessary, surely you've just saved a pointless bit of sector sparing? (I don't know: I'm not sure what the relative frequency of these things is. Read and write errors in general are so rare that it's quite possible I'm worrying about nothing at all. I do know I forgot to scrub my old hardware RAID array for about three years and nothing bad happened...) -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 19:52 ` Nix @ 2017-05-08 20:27 ` Anthony Youngman 2017-05-09 9:53 ` Nix 2017-05-09 16:05 ` Chris Murphy 2017-05-08 20:56 ` Phil Turmel 1 sibling, 2 replies; 69+ messages in thread From: Anthony Youngman @ 2017-05-08 20:27 UTC (permalink / raw) To: Nix, Phil Turmel; +Cc: Ravi (Tom) Hale, linux-raid On 08/05/17 20:52, Nix wrote: > On 8 May 2017, Phil Turmel verbalised: > >> On 05/08/2017 10:50 AM, Nix wrote: >> >>> I wonder... scrubbing is not very useful with md, particularly with RAID >>> 6, because it does no writes unless something mismatches, >> >> This is wrong. The purpose of scrubbing is to expose any sectors that >> have degraded (as Wol describes) to the point of generating a read >> error. A "check" scrub only writes back to the sectors that report a >> URE, giving the drive firmware a chance to fix or relocate the sector. >> >> A check scrub will NOT write on mismatch, just increment the mismatch >> counter. This is the recommended regular scrubbing operation. You want >> to know when mismatches occur. > > And... then what do you do? On RAID-6, it appears the answer is "live > with a high probability of inevitable corruption". That's not very good. > (AIUI, if a check scrub finds a URE, it'll rewrite it, and when in the > common case the drive spares it out and the write succeeds, this will > not be reported as a mismatch: is this right?) I think you're misunderstanding RAID here. IF the drive says "I can't read this block", the RAID reconstructs the block, and rewrites it. No corruption. If the scrub finds a mismatch, then the drives are reporting "everything's fine here". Something's gone wrong, but the question is what? If you've got a four-drive raid that reports a mismatch, how do you know which of the four drives is corrupt? Doing an auto-correct here risks doing even more damage. (I think a raid-6 could recover, but raid-5 is toast ...) And seeing as drives are pretty much guaranteed (unless something's gone BADLY wrong) to either (a) accurately return the data written, or (b) return a read error, that means a data mismatch indicates something is seriously wrong that is NOTHING to do with the drives. <snip> > > If a sector weakens purely because of neighbouring writes or temperature > or a vibrating housing or something (i.e. not because of actual damage), > so that a rewrite will strengthen it and relocation was never necessary, > surely you've just saved a pointless bit of sector sparing? (I don't > know: I'm not sure what the relative frequency of these things is. Read > and write errors in general are so rare that it's quite possible I'm > worrying about nothing at all. I do know I forgot to scrub my old > hardware RAID array for about three years and nothing bad happened...) > Yes you have saved a sector sparing. Note that a consumer 3TB drive can return, on average, one error every time it's read from end to end 3 times, and still be considered "within spec" ie "not faulty" by the manufacturer. And that's a *brand* *new* drive. That's why building a large array using consumer drives is a stupid idea - 4 x 3TB drives and a *within* *spec* array must expect to handle at least one error every scrub. Okay - most drives are actually way over spec, and could probably be read end-to-end many times without a single error, but you'd be a fool to gamble on it. Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 20:27 ` Anthony Youngman @ 2017-05-09 9:53 ` Nix 2017-05-09 11:09 ` David Brown 2017-05-09 21:32 ` NeilBrown 2017-05-09 16:05 ` Chris Murphy 1 sibling, 2 replies; 69+ messages in thread From: Nix @ 2017-05-09 9:53 UTC (permalink / raw) To: Anthony Youngman; +Cc: Phil Turmel, Ravi (Tom) Hale, linux-raid On 8 May 2017, Anthony Youngman told this: > If the scrub finds a mismatch, then the drives are reporting > "everything's fine here". Something's gone wrong, but the question is > what? If you've got a four-drive raid that reports a mismatch, how do > you know which of the four drives is corrupt? Doing an auto-correct > here risks doing even more damage. (I think a raid-6 could recover, > but raid-5 is toast ...) With a RAID-5 you are screwed: you can reconstruct the parity but cannot tell if it was actually right. You can make things consistent, but not correct. But with a RAID-6 you *do* have enough data to make things correct, with precisely the same probability as recovery of a RAID-5 "drive" of length a single sector. It seems wrong that not only does md not do this but doesn't even tell you which drive made the mistake so you could do the millions-of-times-slower process of a manual fail and readdition of the drive (or, if you suspect it of being wholly buggered, a manual fail and replacement). > And seeing as drives are pretty much guaranteed (unless something's > gone BADLY wrong) to either (a) accurately return the data written, or > (b) return a read error, that means a data mismatch indicates > something is seriously wrong that is NOTHING to do with the drives. This turns out not to be the case. See this ten-year-old paper: <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>. Five weeks of doing 2GiB writes on 3000 nodes once every two hours found, they estimated, 50 errors possibly attributable to disk problems (sector- or page-size regions of corrupted data) on 1/30th of their nodes. This is *not* rare and it is hard to imagine that 1/30th of disks used by CERN deserve discarding. It is better to assume that drives misdirect writes now and then, and to provide a means of recovering from them that does not take days of panic. RAID-6 gives you that means: md should use it. The page-sized regions of corrupted data were probably software -- but the sector-sized regions were just as likely the drives, possibly misdirected writes or misdirected reads. Neil decided not to do any repair work in this case on the grounds that if the drive is misdirecting one write it might misdirect the repair as well -- but if the repair is *consistently* misdirected, that seems relatively harmless (you had corruption before, you have it now, it just moved), and if it was a sporadic error, the repair is worthwhile. The only case in which a repair should not be attempted is if the drive is misdirecting all or most writes -- but in that case, by the time you do a scrub, on all but the quietest arrays you'll see millions of mismatches and it'll be obvious that it's time to throw the drive out. (Assuming md told you which drive it was.) >> If a sector weakens purely because of neighbouring writes or temperature >> or a vibrating housing or something (i.e. not because of actual damage), >> so that a rewrite will strengthen it and relocation was never necessary, >> surely you've just saved a pointless bit of sector sparing? (I don't >> know: I'm not sure what the relative frequency of these things is. Read >> and write errors in general are so rare that it's quite possible I'm >> worrying about nothing at all. I do know I forgot to scrub my old >> hardware RAID array for about three years and nothing bad happened...) >> > Yes you have saved a sector sparing. Note that a consumer 3TB drive > can return, on average, one error every time it's read from end to end > 3 times, and still be considered "within spec" ie "not faulty" by the Yeah, that's why RAID-6 is a good idea. :) > manufacturer. And that's a *brand* *new* drive. That's why building a > large array using consumer drives is a stupid idea - 4 x 3TB drives > and a *within* *spec* array must expect to handle at least one error > every scrub. That's just one reason why. The lack of control over URE timeouts is just as bad. > Okay - most drives are actually way over spec, and could probably be > read end-to-end many times without a single error, but you'd be a fool > to gamble on it. I'm trying *not* to gamble on it -- but I don't want to end up in the current situation we seem to have with md6, which is "oh, you have a mismatch, it's not going away, but we're neither going to tell you where it is nor what disk it's on nor repair it ourselves, even though we could, just to make it as hard as possible for you to repair the problem or even tell if it's a consistent one" (is the single mismatch an expected, spurious read error because of the volume of data you're reading, or one that's consistent and needs repair? All mismatch_cnt tells you is that there's a mismatch). -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 9:53 ` Nix @ 2017-05-09 11:09 ` David Brown 2017-05-09 11:27 ` Nix 2017-05-09 21:32 ` NeilBrown 1 sibling, 1 reply; 69+ messages in thread From: David Brown @ 2017-05-09 11:09 UTC (permalink / raw) To: Nix, Anthony Youngman; +Cc: Phil Turmel, Ravi (Tom) Hale, linux-raid On 09/05/17 11:53, Nix wrote: > On 8 May 2017, Anthony Youngman told this: > >> If the scrub finds a mismatch, then the drives are reporting >> "everything's fine here". Something's gone wrong, but the question is >> what? If you've got a four-drive raid that reports a mismatch, how do >> you know which of the four drives is corrupt? Doing an auto-correct >> here risks doing even more damage. (I think a raid-6 could recover, >> but raid-5 is toast ...) > > With a RAID-5 you are screwed: you can reconstruct the parity but cannot > tell if it was actually right. You can make things consistent, but not > correct. > > But with a RAID-6 you *do* have enough data to make things correct, with > precisely the same probability as recovery of a RAID-5 "drive" of length > a single sector. No, you don't have enough data to make things correct. You /might/ have enough data to make a guess what /might/ be right to make things wrong, but might also be wrong. And you don't have enough data to have the slightest idea about the probabilities. And you don't have enough data to know if "fixing" it will help overall, or make things worse if you accidentally "fix" the wrong block. (See the link I gave on other posts for details.) > It seems wrong that not only does md not do this but > doesn't even tell you which drive made the mistake so you could do the > millions-of-times-slower process of a manual fail and readdition of the > drive (or, if you suspect it of being wholly buggered, a manual fail and > replacement). > >> And seeing as drives are pretty much guaranteed (unless something's >> gone BADLY wrong) to either (a) accurately return the data written, or >> (b) return a read error, that means a data mismatch indicates >> something is seriously wrong that is NOTHING to do with the drives. > > This turns out not to be the case. See this ten-year-old paper: > <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>. > Five weeks of doing 2GiB writes on 3000 nodes once every two hours > found, they estimated, 50 errors possibly attributable to disk problems > (sector- or page-size regions of corrupted data) on 1/30th of their > nodes. This is *not* rare and it is hard to imagine that 1/30th of disks > used by CERN deserve discarding. It is better to assume that drives > misdirect writes now and then, and to provide a means of recovering from > them that does not take days of panic. RAID-6 gives you that means: md > should use it. RAID-6 does not help here. You have to understand the types of errors that can occur, the reasons for them, the possibilities for detection, the possibilities for recovery, and what the different layers in the system can do about them. RAID (1/5/6) will let you recover from one or more known failed reads, on the assumption that the driver firmware is correct, memories have no errors, buses have no errors, block writes are atomic, write ordering matches the flush commands, block reads are either correct or marked as failed, etc. RAID will /not/ let you reliably detect or correct other sorts of errors. It is designed to cheaply and simply reduce the risk of a certain class of possible errors - it is not a magic method of stopping all errors. Similarly, the drive firmware works under certain assumptions to greatly reduce other sorts of errors (those local to the block), but not everything. And ECC memory, PCI bus CRCs, and other such things reduce the risk of other kinds of error. If you need more error checking or correction, you need different mechanisms. For example, BTRFS and ZFS will do checksumming on the filesystem level. They can be combined with raid/duplication to allow correction on checksum error. And they can usefully build on top of a normal md raid layer, or use their own raid (with its pros and cons). Or you can have multiple servers and also track md5 sums of files, with cross-server scrubbing of the data. There are lots of possibilities, depending on what you want to get. What does /not/ work, however, is trying to squeeze magic capabilities out of existing layers in the system, or expecting more out of them that they can give. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 11:09 ` David Brown @ 2017-05-09 11:27 ` Nix 2017-05-09 11:58 ` David Brown 2017-05-09 19:16 ` Fault tolerance with badblocks Phil Turmel 0 siblings, 2 replies; 69+ messages in thread From: Nix @ 2017-05-09 11:27 UTC (permalink / raw) To: David Brown; +Cc: Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, linux-raid On 9 May 2017, David Brown uttered the following: > On 09/05/17 11:53, Nix wrote: >> This turns out not to be the case. See this ten-year-old paper: >> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>. >> Five weeks of doing 2GiB writes on 3000 nodes once every two hours >> found, they estimated, 50 errors possibly attributable to disk problems >> (sector- or page-size regions of corrupted data) on 1/30th of their >> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks >> used by CERN deserve discarding. It is better to assume that drives >> misdirect writes now and then, and to provide a means of recovering from >> them that does not take days of panic. RAID-6 gives you that means: md >> should use it. > > RAID-6 does not help here. You have to understand the types of errors > that can occur, the reasons for them, the possibilities for detection, > the possibilities for recovery, and what the different layers in the > system can do about them. > > RAID (1/5/6) will let you recover from one or more known failed reads, > on the assumption that the driver firmware is correct, memories have no > errors, buses have no errors, block writes are atomic, write ordering > matches the flush commands, block reads are either correct or marked as > failed, etc. I think you're being too pedantic. Many of these things are known not to be true on real hardware, and at least one of them cannot possibly be true without a journal (atomic block writes). Nonetheless, the md layer is quite happy to rebuild after a failed disk even though the write hole might have torn garbage into your data, on the grounds that it *probably* did not. If your argument was used everywhere, md would never have been started because 100% reliability was not guaranteed. The same, it seems to me, is true of cases in which one drive in a RAID-6 reports a few mismatched blocks. It is true that you don't know the cause of the mismatches, but you *do* know which bit of the mismatch is wrong and what data should be there, subject only to the assumption that sufficiently few drives have made simultaneous mistakes that redundancy is preserved. And that's the same assumption RAID >0 makes all the time anyway! The only difference in the disk-failure case is that you know that one drive has failed without needing to ask other drives to be sure. I mean, yeah, *possibly* in the RAID-6 mismatch case *five* drives have gone simultaneously wrong in such a way that their syndromes all match and the one surviving drive is mistakenly misrepaired, but frankly you'd need to wait for black holes to evaporate of old age before this became an issue. (I'm not suggesting repairing RAID-5 mismatches. That's clearly impossible. You can't even tell what disk is affected. But in the RAID-6 case none of this is impossible, or so it seems to me. You have at least three and probably four or more drives with consistent syndromes, and one that is out of whack. You know which one must be wrong -- the "minority vote" -- and you know what has to be done to make it consistent with the others again. Why not do it? It's no more risky than that aspect of a RAID rebuild from a failed disk would be.) > RAID will /not/ let you reliably detect or correct other sorts of > errors. ... only it clearly can. What stops it from handling the RAID-6-and- one-disk-is-wrong case where it cannot handle the RAID-6-and-one-disk- has-failed case, given that you can unambiguously determine which disk is wrong using the data on the surviving drives, with an undetected- failure probability of something way below 2^128? (I could work out the actual value but I haven't had any coffee yet and it seems pointless when it's that low.) > What does /not/ work, however, is trying to squeeze magic capabilities > out of existing layers in the system, or expecting more out of them that > they can give. I don't see that these capabilities are any more magic than what RAID-6 does already. It can recover from two failed drives: why can't it recover from one wrong one? (Or, rather, from one drive with very occasionally wrong sectors on it. Obviously if it was always getting things wrong its presence is not a benefit and you have essentially fallen back to nothing better than RAID-5, only with worse performance. But that's what error thresholds are for, which md already employs in similar situations.) -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 11:27 ` Nix @ 2017-05-09 11:58 ` David Brown 2017-05-09 17:25 ` Chris Murphy 2017-05-09 19:16 ` Fault tolerance with badblocks Phil Turmel 1 sibling, 1 reply; 69+ messages in thread From: David Brown @ 2017-05-09 11:58 UTC (permalink / raw) To: Nix; +Cc: Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, linux-raid On 09/05/17 13:27, Nix wrote: > On 9 May 2017, David Brown uttered the following: > (I'm not suggesting repairing RAID-5 mismatches. That's clearly > impossible. You can't even tell what disk is affected. But in the RAID-6 > case none of this is impossible, or so it seems to me. You have at least > three and probably four or more drives with consistent syndromes, and > one that is out of whack. You know which one must be wrong -- the > "minority vote" -- and you know what has to be done to make it > consistent with the others again. Why not do it? It's no more risky than > that aspect of a RAID rebuild from a failed disk would be.) > >> RAID will /not/ let you reliably detect or correct other sorts of >> errors. > > ... only it clearly can. What stops it from handling the RAID-6-and- > one-disk-is-wrong case where it cannot handle the RAID-6-and-one-disk- > has-failed case, given that you can unambiguously determine which disk > is wrong using the data on the surviving drives, with an undetected- > failure probability of something way below 2^128? (I could work out the > actual value but I haven't had any coffee yet and it seems pointless > when it's that low.) > >> What does /not/ work, however, is trying to squeeze magic capabilities >> out of existing layers in the system, or expecting more out of them that >> they can give. > > I don't see that these capabilities are any more magic than what RAID-6 > does already. It can recover from two failed drives: why can't it > recover from one wrong one? (Or, rather, from one drive with very > occasionally wrong sectors on it. Obviously if it was always getting > things wrong its presence is not a benefit and you have essentially > fallen back to nothing better than RAID-5, only with worse performance. > But that's what error thresholds are for, which md already employs in > similar situations.) > I thought you said that you had read Neil's article. Please go back and read it again. If you don't agree with what is written there, then there is little more I can say to convince you. One thing I can try, is to note that you are /not/ the first person to think "Surely with RAID-6 we can correct mismatches - it should be easy?". You are /not/ the first person to think "Correcting RAID-6 mismatches would be a marvellous feature that would make it /far/ better". Linux md raid does not correct RAID-6 mismatches found on a scrub. To my (admittedly limited) knowledge, hardware RAID-6 systems do not correct mismatches found on a scrub. If correcting RAID-6 mismatches were as simple, reliably, and useful as you seem to believe, than I think Linux md raid would already do it - either as part of the scrub, or as an extra utility to run on mismatched stripes. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 11:58 ` David Brown @ 2017-05-09 17:25 ` Chris Murphy 2017-05-09 19:44 ` Wols Lists ` (2 more replies) 0 siblings, 3 replies; 69+ messages in thread From: Chris Murphy @ 2017-05-09 17:25 UTC (permalink / raw) To: David Brown Cc: Nix, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@hesbynett.no> wrote: > I thought you said that you had read Neil's article. Please go back and > read it again. If you don't agree with what is written there, then > there is little more I can say to convince you. > > One thing I can try, is to note that you are /not/ the first person to > think "Surely with RAID-6 we can correct mismatches - it should be > easy?". H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf This is totally non-trivial, especially because it says raid6 cannot detect or correct more than one corruption, and ensuring that additional corruption isn't introduced in the rare case is even more non-trivial. I do think it's sane for raid6 repair to avoid the current assumption that data strip is correct, by doing the evaluation in equation 27. If there's no corruption do nothing, if there's corruption of P or Q then replace, if there's corruption of data, then report but do not repair as follows: 1. md reports all data drives and the LBAs for the affected stripe (otherwise this is not simple if it has to figure out which drive is actually affected but that's not required, just a matter of better efficiency in finding out what's really affected.) 2. the file system needs to be able to accept the error from md 3. the file system reports what it negatively impacted: file system metadata or data and if data, the full filename path. And now suddenly this work is likewise non-trivial. And there is already something that will do exactly this: ZFS and Btrfs. Both can unambiguously, efficiently determine whether data is corrupt even if a drive doesn't report a read error. -- Chris Murphy ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 17:25 ` Chris Murphy @ 2017-05-09 19:44 ` Wols Lists 2017-05-10 3:53 ` Chris Murphy 2017-05-09 20:18 ` Nix 2017-05-09 21:06 ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix 2 siblings, 1 reply; 69+ messages in thread From: Wols Lists @ 2017-05-09 19:44 UTC (permalink / raw) To: Chris Murphy, David Brown; +Cc: Nix, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 09/05/17 18:25, Chris Murphy wrote: > On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@hesbynett.no> wrote: > >> I thought you said that you had read Neil's article. Please go back and >> read it again. If you don't agree with what is written there, then >> there is little more I can say to convince you. >> >> One thing I can try, is to note that you are /not/ the first person to >> think "Surely with RAID-6 we can correct mismatches - it should be >> easy?". > > H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion > http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf > > This is totally non-trivial, especially because it says raid6 cannot > detect or correct more than one corruption, and ensuring that > additional corruption isn't introduced in the rare case is even more > non-trivial. And can I point out that that is just one person's opinion? A well-informed, respected person true, but it's still just opinion. And imho the argument that says raid should not repair the data applies equally against fsck - that shouldn't do any repair either! :-) > > I do think it's sane for raid6 repair to avoid the current assumption > that data strip is correct, by doing the evaluation in equation 27. If > there's no corruption do nothing, if there's corruption of P or Q then > replace, if there's corruption of data, then report but do not repair > as follows: From an ENGINEERING viewpoint, what is the probability that we get a two-drive error? And if we do, then there's probably something rather more serious gone wrong? > > 1. md reports all data drives and the LBAs for the affected stripe > (otherwise this is not simple if it has to figure out which drive is > actually affected but that's not required, just a matter of better > efficiency in finding out what's really affected.) md should report the error AND THE DRIVE THAT APPEARS TO BE FAULTY. (Or maybe we leave that to the below-mentioned mdfsck.) That way, if it's a bunch of errors on the same drive we know we've got a problem with the drive. If we've got a bunch of errors on random drives, we know the problem is probably elsewhere. > > 2. the file system needs to be able to accept the error from md > > 3. the file system reports what it negatively impacted: file system > metadata or data and if data, the full filename path. > > And now suddenly this work is likewise non-trivial. Which is why we keep the filesystem out of this. By all means make md return a list of dud strips, which a filesystem-level utility can then interpret, but that isn't md's problem. > > And there is already something that will do exactly this: ZFS and > Btrfs. Both can unambiguously, efficiently determine whether data is > corrupt even if a drive doesn't report a read error. > Or we write an mdfsck program. Just like you shouldn't run fsck with write privileges on a mounted filesystem, you wouldn't run mdfsck with filesystems in the array mounted. At the end of the day, md should never corrupt data by default. Which is what it sounds like is happening at the moment, if it's assuming the data sectors are correct and the parity is wrong. If one parity appears correct then by all means rewrite the second ... But the current setup, where it's currently quite happy to assume a single-drive error and rewrite it if it's a parity drive, but it won't assume a single-drive error and and rewrite it if it's a data drive, just seems totally wrong. Worse, in the latter case, it seems it actively prevents fixing the problem by updating the parity and (probably) corrupting the data. Report the error, give the user the tools to fix it, and LET THEM sort it out. Just like we do when we run fsck on a filesystem. (I know I know, patches welcome :-) Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 19:44 ` Wols Lists @ 2017-05-10 3:53 ` Chris Murphy 2017-05-10 4:49 ` Wols Lists ` (2 more replies) 0 siblings, 3 replies; 69+ messages in thread From: Chris Murphy @ 2017-05-10 3:53 UTC (permalink / raw) To: Wols Lists; +Cc: Linux-RAID On Tue, May 9, 2017 at 1:44 PM, Wols Lists <antlists@youngman.org.uk> wrote: >> This is totally non-trivial, especially because it says raid6 cannot >> detect or correct more than one corruption, and ensuring that >> additional corruption isn't introduced in the rare case is even more >> non-trivial. > > And can I point out that that is just one person's opinion? Right off the bat you ask a stupid question that contains the answer to your own stupid question. This is condescending and annoying, and it invites treating you with suspicious as a troll. But then you make it worse by saying it again: > A > well-informed, respected person true, but it's still just opinion. Except it is not just an opinion, it's a fact by any objective reader who isn't even a programmer, let alone if you know something about math and/or programming. Let's break down how totally stupid your position is. 1. Opinions don't count for much. 2. You have presented no code that contradicts the opinion that this is hard. You've opined that an opinion is to be discarded at face value. Therefore your own opinion is just an opinion and likewise discardable. 3. How do do the thing you think is trivial has been well documented for some time and yet there are essentially no implementations. That it's simple to do (your idea) and yet does not exist (fact) means this is a big fat conspiracy to fuck you over, on purpose. It's so asinine I feel trolled right now. >And > imho the argument that says raid should not repair the data applies > equally against fsck - that shouldn't do any repair either! :-) And now the dog shit cake has cat shit icing on it. Great. >> And there is already something that will do exactly this: ZFS and >> Btrfs. Both can unambiguously, efficiently determine whether data is >> corrupt even if a drive doesn't report a read error. >> > Or we write an mdfsck program. Just like you shouldn't run fsck with > write privileges on a mounted filesystem, you wouldn't run mdfsck with > filesystems in the array mounted. Who is we? Are you volunteering other people build you a feature? > At the end of the day, md should never corrupt data by default. Which is > what it sounds like is happening at the moment, if it's assuming the > data sectors are correct and the parity is wrong. If one parity appears > correct then by all means rewrite the second ... This is an obtuse and frankly malicious characterization. Scrubs don't happen by default. And scrub repair's assuming data strips are correct is well documented. If you don't like this assumption, don't use scrub repair. You can't say corruption happens by default unless you admit that there's URE's on a drive by default - of course that's absurd and makes no sense. > > But the current setup, where it's currently quite happy to assume a > single-drive error and rewrite it if it's a parity drive, but it won't > assume a single-drive error and and rewrite it if it's a data drive, > just seems totally wrong. Worse, in the latter case, it seems it > actively prevents fixing the problem by updating the parity and > (probably) corrupting the data. The data is already corrupted by definition. No additional damage to data is done. What does happen is good P and Q are replaced by bad P and Q which matches the already bad data. And nevertheless you have the very real problem that drives lie about having committed data to stable media. And they reorder writes, breaking the write order assumptions of things. And we have RMW happening on live arrays. And that means you have a real likelihood that you cannot absolutely determine with the available information why P and Q don't agree with the data, you're still making probability assumptions and if that assumption is wrong any correction will introduce more corruption. The only unambiguous way to do this has already been done and it's ZFS and Btrfs. And a big part of why they can do what they do is because they are copy on write. IIf you need to solve the problem of ambiguous data strip integrity in relation to P and Q, then use ZFS. It's production ready. If you are prepared to help test and improve things, then you can look into the Btrfs implementation. Otherwise I'm sure md and LVM folks have a feature list that represents a few years of work as it is without yet another pile on. > > Report the error, give the user the tools to fix it, and LET THEM sort > it out. Just like we do when we run fsck on a filesystem. They're not at all comparable. One is a file system, the other a raid implementation, they have nothing in common. -- Chris Murphy ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-10 3:53 ` Chris Murphy @ 2017-05-10 4:49 ` Wols Lists 2017-05-10 17:18 ` Chris Murphy 2017-05-16 3:20 ` NeilBrown 2017-05-10 5:00 ` Dave Stevens 2017-05-10 16:44 ` Edward Kuns 2 siblings, 2 replies; 69+ messages in thread From: Wols Lists @ 2017-05-10 4:49 UTC (permalink / raw) To: Chris Murphy; +Cc: Linux-RAID On 10/05/17 04:53, Chris Murphy wrote: > On Tue, May 9, 2017 at 1:44 PM, Wols Lists <antlists@youngman.org.uk> wrote: > >>> This is totally non-trivial, especially because it says raid6 cannot >>> detect or correct more than one corruption, and ensuring that >>> additional corruption isn't introduced in the rare case is even more >>> non-trivial. >> >> And can I point out that that is just one person's opinion? > > Right off the bat you ask a stupid question that contains the answer > to your own stupid question. This is condescending and annoying, and > it invites treating you with suspicious as a troll. But then you make > it worse by saying it again: > Sorry. But I thought we were talking about *Neil's* paper. My bad for missing it. >> A >> well-informed, respected person true, but it's still just opinion. > > Except it is not just an opinion, it's a fact by any objective reader > who isn't even a programmer, let alone if you know something about > math and/or programming. Let's break down how totally stupid your > position is. > <snip ad hominems :-) > > >> At the end of the day, md should never corrupt data by default. Which is >> what it sounds like is happening at the moment, if it's assuming the >> data sectors are correct and the parity is wrong. If one parity appears >> correct then by all means rewrite the second ... > > This is an obtuse and frankly malicious characterization. Scrubs don't > happen by default. And scrub repair's assuming data strips are correct > is well documented. If you don't like this assumption, don't use scrub > repair. You can't say corruption happens by default unless you admit > that there's URE's on a drive by default - of course that's absurd and > makes no sense. > Documenting bad behaviour doesn't turn it into good behaviour, though ... >> >> But the current setup, where it's currently quite happy to assume a >> single-drive error and rewrite it if it's a parity drive, but it won't >> assume a single-drive error and and rewrite it if it's a data drive, >> just seems totally wrong. Worse, in the latter case, it seems it >> actively prevents fixing the problem by updating the parity and >> (probably) corrupting the data. > > The data is already corrupted by definition. No additional damage to > data is done. What does happen is good P and Q are replaced by bad P > and Q which matches the already bad data. Except, in my world, replacing good P & Q by bad P & Q *IS* doing additional damage! We can identify and fix the bad data. So why don't we? Throwing away good P & Q prevents us from doing that, and means we can no longer recover the good data! > > And nevertheless you have the very real problem that drives lie about > having committed data to stable media. And they reorder writes, > breaking the write order assumptions of things. And we have RMW > happening on live arrays. And that means you have a real likelihood > that you cannot absolutely determine with the available information > why P and Q don't agree with the data, you're still making probability > assumptions and if that assumption is wrong any correction will > introduce more corruption. > > The only unambiguous way to do this has already been done and it's ZFS > and Btrfs. And a big part of why they can do what they do is because > they are copy on write. IIf you need to solve the problem of ambiguous > data strip integrity in relation to P and Q, then use ZFS. It's > production ready. If you are prepared to help test and improve things, > then you can look into the Btrfs implementation. So how come btrfs and ZFS can handle this, and md can't? Can't md use the same techniques. (Seriously, I don't know the answer. But, like Nix, when I feel I'm being fed the answer "we're not going to give you the choice because we know better than you", I get cheesed off. If I get the answer "we're snowed under, do it yourself" then that is normal and acceptable.) > > Otherwise I'm sure md and LVM folks have a feature list that > represents a few years of work as it is without yet another pile on. > >> >> Report the error, give the user the tools to fix it, and LET THEM sort >> it out. Just like we do when we run fsck on a filesystem. > > They're not at all comparable. One is a file system, the other a raid > implementation, they have nothing in common. > > And what are file systems and raid implementations? They are both data store abstractions. They have everything in common. Oh and by the way, now I've realised my mistake, I've taken a look at the paper you mention. In particular, section 4. Yes it does say you can't detect and correct multi-disk errors - but that's not what we're asking for! By implication, it seems to be saying LOUD AND CLEAR that you CAN detect and correct a single-disk error. So why the blankety-blank won't md let you do that! Neil's point seems to be that it's a bad idea to do it automatically. I get his logic. But to then actively prevent you doing it manually - this is the paternalistic attitude that gets my goat. Anyways, I've been thinking about this, and I've got a proposal (RFC?). I haven't got time right now - I'm supposed to be at work - but I'll write it up this evening. If the response is "we're snowed under - it sounds a good idea but do it yourself", then so be it. But if the response is "we don't want the sysadmin to have the choice", then expect more flak from people like Nix and me. (And the proposal involves giving sysadmins CHOICE. If they want to take the hit, it's *their* decision, not a paternalistic choice forced on them.) (Sorry to keep on about paternalism, but there is a sense that decisions have been made, and they're not going to be reversed "because I say so". I'm NOT getting a "you want it, you write it" vibe, and that's what gets to me.) Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-10 4:49 ` Wols Lists @ 2017-05-10 17:18 ` Chris Murphy 2017-05-16 3:20 ` NeilBrown 1 sibling, 0 replies; 69+ messages in thread From: Chris Murphy @ 2017-05-10 17:18 UTC (permalink / raw) To: Wols Lists; +Cc: Chris Murphy, Linux-RAID On Tue, May 9, 2017 at 10:49 PM, Wols Lists <antlists@youngman.org.uk> wrote: > On 10/05/17 04:53, Chris Murphy wrote: >> On Tue, May 9, 2017 at 1:44 PM, Wols Lists <antlists@youngman.org.uk> wrote: >> >>>> This is totally non-trivial, especially because it says raid6 cannot >>>> detect or correct more than one corruption, and ensuring that >>>> additional corruption isn't introduced in the rare case is even more >>>> non-trivial. >>> >>> And can I point out that that is just one person's opinion? >> >> Right off the bat you ask a stupid question that contains the answer >> to your own stupid question. This is condescending and annoying, and >> it invites treating you with suspicious as a troll. But then you make >> it worse by saying it again: >> > Sorry. But I thought we were talking about *Neil's* paper. My bad for > missing it. Doesn't matter. Your standard is mere opinions are ignorable, and therefore by your own standard you can be ignored for posing mere opinions yourself. You set your own trap but you clearly want to hold a double standard: your opinions are valid and should be listened to, and others' opinions are merely opinion and can be easily discarded. >>> A >>> well-informed, respected person true, but it's still just opinion. >> >> Except it is not just an opinion, it's a fact by any objective reader >> who isn't even a programmer, let alone if you know something about >> math and/or programming. Let's break down how totally stupid your >> position is. >> > > <snip ad hominems :-) > It is not an ad hominem attack to evaluate your lack of logic. An ad hominem attack is one on the person rather than their arguments. I haven't attacked you, I've attacked your arguing style and the deep ignorance that style conveys. And you shouldn't like it, but you have only yourself to blame, you didn't exactly bother to do any list archive research before demanding everyone's foolish for having withheld this feature from you personally. It almost immediately became noise. >>> At the end of the day, md should never corrupt data by default. Which is >>> what it sounds like is happening at the moment, if it's assuming the >>> data sectors are correct and the parity is wrong. If one parity appears >>> correct then by all means rewrite the second ... >> >> This is an obtuse and frankly malicious characterization. Scrubs don't >> happen by default. And scrub repair's assuming data strips are correct >> is well documented. If you don't like this assumption, don't use scrub >> repair. You can't say corruption happens by default unless you admit >> that there's URE's on a drive by default - of course that's absurd and >> makes no sense. >> > Documenting bad behaviour doesn't turn it into good behaviour, though ... It is a common loophole to describe the chosen behavior when good behavior is difficult or infeasible. It happens all the time. Complaining here isn't going to change this. >>> >>> But the current setup, where it's currently quite happy to assume a >>> single-drive error and rewrite it if it's a parity drive, but it won't >>> assume a single-drive error and and rewrite it if it's a data drive, >>> just seems totally wrong. Worse, in the latter case, it seems it >>> actively prevents fixing the problem by updating the parity and >>> (probably) corrupting the data. >> >> The data is already corrupted by definition. No additional damage to >> data is done. What does happen is good P and Q are replaced by bad P >> and Q which matches the already bad data. > > Except, in my world, replacing good P & Q by bad P & Q *IS* doing > additional damage! Arguing about it doesn't make it true. The primary data is corrupt and in normal operation P & Q are not checked, so it will always silently return corrupt data in normal operation, and if there is a failure that does not exactly coincide with the corruption, the corruption that is read in the ensuing reconstruction will corrupt the reconstruction even though P & Q are good. So what you want to fix is a lot of buck for almost no gain. >We can identify and fix the bad data. So why don't > we? Throwing away good P & Q prevents us from doing that, and means we > can no longer recover the good data! There is no possible way to know that P & Q are both good. That requires assumption. So you've arbitrarily traded an assumption you don't like for one that you do like, but have no evidence for in either case. There are better ways to solve this problem. md and LVM raid are really about solving one, or two particular problemswhich is not data integrity, it is data availability and recovery via reconstruction rather than from backups being restored. Better is defined by the use case at hand. Some use cases will want this solved at the file system level, which points to ZFS or Btrfs - the very problem you're talking about is one of those problems that led to the design of both of those file systems. Other use cases can have it solved at an application level. And still others will solve it with a cluster file system, like glusterfs does with per file checksums and replication. >> And nevertheless you have the very real problem that drives lie about >> having committed data to stable media. And they reorder writes, >> breaking the write order assumptions of things. And we have RMW >> happening on live arrays. And that means you have a real likelihood >> that you cannot absolutely determine with the available information >> why P and Q don't agree with the data, you're still making probability >> assumptions and if that assumption is wrong any correction will >> introduce more corruption. >> >> The only unambiguous way to do this has already been done and it's ZFS >> and Btrfs. And a big part of why they can do what they do is because >> they are copy on write. IIf you need to solve the problem of ambiguous >> data strip integrity in relation to P and Q, then use ZFS. It's >> production ready. If you are prepared to help test and improve things, >> then you can look into the Btrfs implementation. > > So how come btrfs and ZFS can handle this, and md can't? All data and metadata blocks are checksummed, and they're always verified during normal operation for every read. The data checksums are themselves checksummed. Even if a drive does not report an error, error can be detected, and trigger reconstruction if redundant metadata or data is available. md does not checksum anything but its own metadata which is just the superblock, there isn't much of anything else to it. There's no checksums for data strips, parity strips, there's no timestamp for any of the writes, there's a distinct lack of information to be able to do an autopsy after the fact without any assumptions. > Can't md use > the same techniques. (Seriously, I don't know the answer. But, like Nix, > when I feel I'm being fed the answer "we're not going to give you the > choice because we know better than you", I get cheesed off. If I get the > answer "we're snowed under, do it yourself" then that is normal and > acceptable.) No they operate on completely different architecture and assumptions. You really should search the archives, all of these things you're wanting to discuss now have already been discussed and argued and nothing has changed. >> >> Otherwise I'm sure md and LVM folks have a feature list that >> represents a few years of work as it is without yet another pile on. >> >>> >>> Report the error, give the user the tools to fix it, and LET THEM sort >>> it out. Just like we do when we run fsck on a filesystem. >> >> They're not at all comparable. One is a file system, the other a raid >> implementation, they have nothing in common. >> >> > And what are file systems and raid implementations? They are both data > store abstractions. They have everything in common. They have almost nothing in common. File systems store files. RAIDs do not know anything at all about files. RAID has a superblock, and a couple of optional logs for very specific purposes, there are no trees. RAID works by logical assumptions where things are located, it doesn't do lookups using metadata to find your data, it's all determined by geometry, totally unlike a file system. > > Oh and by the way, now I've realised my mistake, I've taken a look at > the paper you mention. In particular, section 4. Yes it does say you > can't detect and correct multi-disk errors - but that's not what we're > asking for! > > By implication, it seems to be saying LOUD AND CLEAR that you CAN detect > and correct a single-disk error. So why the blankety-blank won't md let > you do that! It's one particular kind of error and there isn't enough on disk metadata to differentiate this particular kind of error after the fact. You're looking at this problem in total isolation to all other problems. And you're not familiar with the lack of information available in the corpse. Neil's version of this explanation: "Similarly a RAID6 with inconsistent P and Q could well not be able to identify a single block which is "wrong" and even if it could there is a small possibility that the identified block isn't wrong, but the other blocks are all inconsistent in such a way as to accidentally point to it. The probability of this is rather small, but it is non-zero." The autofix in such a case could cause more damage. > > Neil's point seems to be that it's a bad idea to do it automatically. I > get his logic. But to then actively prevent you doing it manually - this > is the paternalistic attitude that gets my goat. You have no example code. You've basically come on the list, without any prior research, and said "GIMME!" *shrug* > > Anyways, I've been thinking about this, and I've got a proposal (RFC?). > I haven't got time right now - I'm supposed to be at work - but I'll > write it up this evening. If the response is "we're snowed under - it > sounds a good idea but do it yourself", then so be it. But if the > response is "we don't want the sysadmin to have the choice", then expect > more flak from people like Nix and me. 1. The default response without having to say it is "we're snowed under, show us a proof of concept first". 2. You have no imagination by having assumed this has never come up before, instead thinking you're the first to have this feature in mind 3. You took the ensuing resistance personally. You have an idea, the burden is on you to demonstrate a need, provide code examples, and ask the right questions like "would the maintainers accept some changes for error reporting for scrub checks?" At the very least what you suggest indicates error reporting enhancements so why not ask about those parameters? Instead, from the outset you treated this resistance as if other people are your grumpy daddy and they're just being mean to you. That's why you got the reception you did. Mischaracterizing other people as being paternalistic isn't going to help get a different perception. (I was thinking of Commander Sela, referring to Toral, when she said "Silence the child or send him away!") My proposal for your proposal is a patch that implements equation 27 from HPA's paper, and enhances error reporting per its descriptive outcomes. md: error: mismatch, P corruption, array logical <LBA> md: error: mismatch, Q corruption, array logical <LBA> md: error: mismatch, data corruption suspected, array logical <LBA> That's subject to wording and formatting discussion, I have not looked at existing formatting, but you need to ask if approximately this would be accepted. However, the main point is that you need to find out what the computational cost is for this scrub enhancement is. If it takes 5 times longer, even you will laugh and say it's not worth it. Stop asking "why isn't this already implemented! do it now! now! now! now!" Instead ask "what is the ballpark maximum performance impact to scrub that would be accepted? And if that maximum is busted would maintainers consider a new value "check2" to write to echo check > /sys/block/mdX/md/sync_action?" Once you have better error reporting, a user space tool could use the array metadata, and the error reporting LBA to lookup that stripe and reconstruct just that stripe with the assumption that P & Q are correct and hopefully fix your data. Or whatever other assumptions you want to try and make to attempt different recoveries. That user space tool could also backup the existing stripe so the fixes are all reversible. -- Chris Murphy ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-10 4:49 ` Wols Lists 2017-05-10 17:18 ` Chris Murphy @ 2017-05-16 3:20 ` NeilBrown 1 sibling, 0 replies; 69+ messages in thread From: NeilBrown @ 2017-05-16 3:20 UTC (permalink / raw) To: Wols Lists, Chris Murphy; +Cc: Linux-RAID [-- Attachment #1: Type: text/plain, Size: 5288 bytes --] On Wed, May 10 2017, Wols Lists wrote: > On 10/05/17 04:53, Chris Murphy wrote: >> >> The data is already corrupted by definition. No additional damage to >> data is done. What does happen is good P and Q are replaced by bad P >> and Q which matches the already bad data. > > Except, in my world, replacing good P & Q by bad P & Q *IS* doing > additional damage! We can identify and fix the bad data. So why don't > we? Throwing away good P & Q prevents us from doing that, and means we > can no longer recover the good data! >> >> And nevertheless you have the very real problem that drives lie about >> having committed data to stable media. And they reorder writes, >> breaking the write order assumptions of things. And we have RMW >> happening on live arrays. And that means you have a real likelihood >> that you cannot absolutely determine with the available information >> why P and Q don't agree with the data, you're still making probability >> assumptions and if that assumption is wrong any correction will >> introduce more corruption. >> >> The only unambiguous way to do this has already been done and it's ZFS >> and Btrfs. And a big part of why they can do what they do is because >> they are copy on write. IIf you need to solve the problem of ambiguous >> data strip integrity in relation to P and Q, then use ZFS. It's >> production ready. If you are prepared to help test and improve things, >> then you can look into the Btrfs implementation. > > So how come btrfs and ZFS can handle this, and md can't? Can't md use > the same techniques. (Seriously, I don't know the answer. Security theater? I don't actually know what, specifically, btrfs and ZFS do, so I cannot say for certain. But I am far from convinced by what I know. I come back to the same question I always come back to. Is there a likely cause for a particular anomaly, and does a particular action properly respond to that cause. I don't like addressing symptoms, I like addressing causes. In the case of a resync after an unclean shutdown, if I find a stripe in which P and Q are not consistent with the data, then a likely cause is that some, but not all, blocks in a new stripe were written just before the crash. If the array is not degraded, it is likely that the data is all valid and P and Q are not needed. So it makes sense to regenerate P and Q. Other responses might also make sense, but they don't make *more* sense. And regenerating P and Q is obvious and easy. If the array is degraded and a Data block is lost, there is no reliable way to recover that block. So md refuses the start the array by default. If you find an inconsistent data block during a scrub, then I have no idea what could have caused that, so I cannot suggest anything (actually I have lots of ideas, but most of them suggest you should replace your hardware and test your backups). Maybe there is a way to recover data, maybe there is no need. I cannot tell. raid6recover is a tool that can be used by a sysadmin to explore options. Maybe not a perfect tool, but it has some uses. > But, like Nix, > when I feel I'm being fed the answer "we're not going to give you the > choice because we know better than you", I get cheesed off. If I get the > answer "we're snowed under, do it yourself" then that is normal and > acceptable.) The main reason I have never implemented your idea of "validate every block before reporting a successful read" is that I genuinely don't think many people would use it. Writing code that won't be used is not very rewarding. The simple way to provide evidence to the contrary is to turn the interest into cash. If 1000 people all give $10 to get it done, I suspect we could make it happen. >> >> Otherwise I'm sure md and LVM folks have a feature list that >> represents a few years of work as it is without yet another pile on. >> >>> >>> Report the error, give the user the tools to fix it, and LET THEM sort >>> it out. Just like we do when we run fsck on a filesystem. >> >> They're not at all comparable. One is a file system, the other a raid >> implementation, they have nothing in common. >> >> > And what are file systems and raid implementations? They are both data > store abstractions. They have everything in common. > > Oh and by the way, now I've realised my mistake, I've taken a look at > the paper you mention. In particular, section 4. Yes it does say you > can't detect and correct multi-disk errors - but that's not what we're > asking for! > > By implication, it seems to be saying LOUD AND CLEAR that you CAN detect > and correct a single-disk error. So why the blankety-blank won't md let > you do that! > > Neil's point seems to be that it's a bad idea to do it automatically. I > get his logic. But to then actively prevent you doing it manually - this > is the paternalistic attitude that gets my goat. I'm certainly not actively preventing you. I certainly wouldn't object to a patch which reports the details of mismatches. I myself was never motivated enough to write one. That might be inactively preventing you, but not actively preventing you. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-10 3:53 ` Chris Murphy 2017-05-10 4:49 ` Wols Lists @ 2017-05-10 5:00 ` Dave Stevens 2017-05-10 16:44 ` Edward Kuns 2 siblings, 0 replies; 69+ messages in thread From: Dave Stevens @ 2017-05-10 5:00 UTC (permalink / raw) To: Chris Murphy; +Cc: Wols Lists, Linux-RAID Quoting Chris Murphy <lists@colorremedies.com>: > On Tue, May 9, 2017 at 1:44 PM, Wols Lists <antlists@youngman.org.uk> wrote: > >>> This is totally non-trivial, especially because it says raid6 cannot >>> detect or correct more than one corruption, and ensuring that >>> additional corruption isn't introduced in the rare case is even more >>> non-trivial. >> >> And can I point out that that is just one person's opinion? > > Right off the bat you ask a stupid question that contains the answer snip! you know Chris, I've read this twice and think it's abusive. You shouldn't do this. Dave -- "As long as politics is the shadow cast on society by big business, the attenuation of the shadow will not change the substance." -- John Dewey ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-10 3:53 ` Chris Murphy 2017-05-10 4:49 ` Wols Lists 2017-05-10 5:00 ` Dave Stevens @ 2017-05-10 16:44 ` Edward Kuns 2017-05-10 18:09 ` Chris Murphy 2 siblings, 1 reply; 69+ messages in thread From: Edward Kuns @ 2017-05-10 16:44 UTC (permalink / raw) To: Chris Murphy; +Cc: Wols Lists, Linux-RAID On Tue, May 9, 2017 at 10:53 PM, Chris Murphy <lists@colorremedies.com> wrote: > Scrubs don't happen by default. From the perspective of Linux Raid authors, that is true. However, the version of Fedora I have installed on my server at home does weekly scrubs by default. This is arguably a good thing, considering that many people installing this OS will not proactively research the technologies in use holding their server together and won't know that there are certain maintenance activities that are essential if you care about your data. I'm not getting involved in the bigger discussion. My opinion is too uninformed to say anything there. I just wanted to point out that *from the viewpoint of some users*, scrubs *will* happen by default. That is all. Eddie ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-10 16:44 ` Edward Kuns @ 2017-05-10 18:09 ` Chris Murphy 0 siblings, 0 replies; 69+ messages in thread From: Chris Murphy @ 2017-05-10 18:09 UTC (permalink / raw) To: Edward Kuns; +Cc: Chris Murphy, Wols Lists, Linux-RAID On Wed, May 10, 2017 at 10:44 AM, Edward Kuns <eddie.kuns@gmail.com> wrote: > On Tue, May 9, 2017 at 10:53 PM, Chris Murphy <lists@colorremedies.com> wrote: >> Scrubs don't happen by default. > > From the perspective of Linux Raid authors, that is true. However, > the version of Fedora I have installed on my server at home does > weekly scrubs by default. That is a check scrub, not a repair scrub, so it still wouldn't obliterate "good" P & Q by default. > This is arguably a good thing, considering > that many people installing this OS will not proactively research the > technologies in use holding their server together and won't know that > there are certain maintenance activities that are essential if you > care about your data. > > I'm not getting involved in the bigger discussion. My opinion is too > uninformed to say anything there. I just wanted to point out that > *from the viewpoint of some users*, scrubs *will* happen by default. > That is all. Absolutely, just not the kind of scrub that's being accused of damaging assumed to be good P & Q parity. -- Chris Murphy ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 17:25 ` Chris Murphy 2017-05-09 19:44 ` Wols Lists @ 2017-05-09 20:18 ` Nix 2017-05-09 20:52 ` Wols Lists 2017-05-10 8:41 ` David Brown 2017-05-09 21:06 ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix 2 siblings, 2 replies; 69+ messages in thread From: Nix @ 2017-05-09 20:18 UTC (permalink / raw) To: Chris Murphy Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 9 May 2017, Chris Murphy verbalised: > On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@hesbynett.no> wrote: > >> I thought you said that you had read Neil's article. Please go back and >> read it again. If you don't agree with what is written there, then >> there is little more I can say to convince you. The entire article is predicated on the assumption that when an inconsistent stripe is found, fixing it is simple because you can just fail whichever device is inconsistent... but given that the whole premise of the article is that *you cannot tell which that is*, I don't see the point in failing anything. The first comment in the article is someone noting that md doesn't say which device is failing, what the location of the error is or anything else a sysadmin might actually find useful for fixing it. "Hey, you have an error somewhere on some disk on this multi-terabyte array which might be data corruption and if a disk fails will be data corruption!" is not too useful :( The fourth comment notes that the "smart" approach, given RAID-6, has a significantly higher chance of actually fixing the problem than the simple approach. I'd call that a fairly important comment... (Neil said: "Similarly a RAID6 with inconsistent P and Q could well not be able to identify a single block which is "wrong" and even if it could there is a small possibility that the identified block isn't wrong, but the other blocks are all inconsistent in such a way as to accidentally point to it. The probability of this is rather small, but it is non-zero". As far as I can tell the probability of this is exactly the same as that of multiple read errors in a single stripe -- possibly far lower, if you need not only multiple wrong P and Q values but *precisely mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using RAID-6 to begin with. I've been talking all the time about a stripe which is singly inconsistent: either all the data blocks are fine and one of P or Q is fine, or both P and Q and all but one data block is fine, and the remaining block is inconsistent with all the rest. Obviously if more blocks are corrupt, you can do nothing but report it. The redundancy simply isn't there to attempt repair.) > H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion > http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf > > This is totally non-trivial, especially because it says raid6 cannot > detect or correct more than one corruption, and ensuring that > additional corruption isn't introduced in the rare case is even more > non-trivial. Yeah. Testing this is the bastard problem, really. Fault injection via dm is the only approach that seems remotely practical to me. > I do think it's sane for raid6 repair to avoid the current assumption > that data strip is correct, by doing the evaluation in equation 27. If > there's no corruption do nothing, if there's corruption of P or Q then > replace, if there's corruption of data, then report but do not repair At least indicate *where* the corruption is in the report. (I'd say "repair, as a non-default option" for people with a different availability/P(corruption) tradeoff -- since, after all, if you're using RAID In the first place you value high availability across disk problems more than most people do, and there is a difference between one bit of unreported damage that causes a near-certain restore from backup and either zero or two of them plus a report with an LBA attached so you know you need to do something...) > as follows: > > 1. md reports all data drives and the LBAs for the affected stripe > (otherwise this is not simple if it has to figure out which drive is > actually affected but that's not required, just a matter of better > efficiency in finding out what's really affected.) Yep. > 2. the file system needs to be able to accept the error from md It would probably need to report this as an -EIO, but I don't know of any filesystems that can accept asynchronous reports of errors like this. You'd need reverse mapping to even stand a chance (a non-default option on xfs, and of course available on btrfs and zfs too). You'd need self-healing metadata to stand a chance of doing anything about it. And god knows what a filesystem is meant to do if part of the file data vanishes. Replace it with \0? ugh. I'd almost rather have the error go back out to a monitoring daemon and have it send you an email... > 3. the file system reports what it negatively impacted: file system > metadata or data and if data, the full filename path. > > And now suddenly this work is likewise non-trivial. Yeah, it's all the layers stacked up to the filesystem that are buggers to deal with... and now the optional 'just repair it dammit' approach seems useful again, if just because it doesn't have to deal with all these extra layers. > And there is already something that will do exactly this: ZFS and > Btrfs. Both can unambiguously, efficiently determine whether data is > corrupt even if a drive doesn't report a read error. Yeah. Unfortunately both have their own problems: ZFS reimplements the page cache and adds massive amounts of ineffiicency in the process, and btrfs is... well... not really baked enough for the sort of high- availability system that's going to be running RAID, yet. (Alas!) (Recent xfs can do the same with metadata, but not data.) -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 20:18 ` Nix @ 2017-05-09 20:52 ` Wols Lists 2017-05-10 8:41 ` David Brown 1 sibling, 0 replies; 69+ messages in thread From: Wols Lists @ 2017-05-09 20:52 UTC (permalink / raw) To: Nix, Chris Murphy; +Cc: David Brown, Ravi (Tom) Hale, Linux-RAID On 09/05/17 21:18, Nix wrote: > (Neil said: "Similarly a RAID6 with inconsistent P and Q could well not > be able to identify a single block which is "wrong" and even if it could > there is a small possibility that the identified block isn't wrong, but > the other blocks are all inconsistent in such a way as to accidentally > point to it. The probability of this is rather small, but it is > non-zero". As far as I can tell the probability of this is exactly the > same as that of multiple read errors in a single stripe -- possibly far > lower, if you need not only multiple wrong P and Q values but *precisely > mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using > RAID-6 to begin with. This to me is the crux of the argument. What is the probability of CORRECTLY identifying a single-disk error? What is the probability of WRONGLY mistaking a multi-disk error for a single-disk error? My gut instinct is that the second scenario is much less likely. So, in that case, the current setup is that we DELIBERATELY CORRUPT a recoverable error because of the TINY risk that we might have got it wrong. Picking probabilities at random, let's say the first probability is 99 in a hundred, the second is one in a thousand. On a four-disk raid-6, that means we're throwing away about 500 chances of recovering the correct data, so that on one occasion we can avoid corruption. To me that's an insane trade-off. Neil goes on about "what if a write fails? What if the power goes down? What if what if?" Those are the wrong questions!!! The correct question is "can we identify the difference between a single-disk failure and a multi-disk failure". We don't care what *caused* that failure. If the power goes down and only the first disk in a stripe is written, we can correct it back to what it was. If only the last disk failed to be written, we can correct it back to what it should have been. If at least two disks are written and at least two disks are not, CAN WE DETECT THAT? Surely we can - we don't care how many disks are or aren't written - in that scenario surely all the parities mess up. In which case we give up and say "corrupt data". Which is no different from at present other than at present we fix the parity and pretend nothing is wrong :-( The problem is that at present we fix the parity and pretend nothing is wrong when the reality is we *could* have corrected the data, if we could have been bothered. So we have to write an mdfsck. Okay. So we have to make sure that no filesystems on the array are mounted. Okay, that's a bit harder. So we have to assume that sysadmins are sensible beings who don't screw things up - okay that's a lot harder :-) But we shouldn't be throwing away LOTS of data that's easy to recover, because we MIGHT "recover" data that's wrong. Yes, yes, I know - code welcome ... :-) Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 20:18 ` Nix 2017-05-09 20:52 ` Wols Lists @ 2017-05-10 8:41 ` David Brown 1 sibling, 0 replies; 69+ messages in thread From: David Brown @ 2017-05-10 8:41 UTC (permalink / raw) To: Nix, Chris Murphy Cc: Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 09/05/17 22:18, Nix wrote: > On 9 May 2017, Chris Murphy verbalised: > >> On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@hesbynett.no> wrote: >> >>> I thought you said that you had read Neil's article. Please go back and >>> read it again. If you don't agree with what is written there, then >>> there is little more I can say to convince you. > > The entire article is predicated on the assumption that when an > inconsistent stripe is found, fixing it is simple because you can just > fail whichever device is inconsistent... but given that the whole > premise of the article is that *you cannot tell which that is*, I don't > see the point in failing anything. The point is that if an inconsistent stripe is found, then there is no way to be sure how to fix it correctly. So scrub certainly will not touch it. And what should "repair" do? I see several choices: 1. It could assume the data is correct, and re-create the parities. This is simple, and it avoids changing anything on the array from the viewpoint of higher levels (i.e., the filesystem). 2. It could do a "smart" repair of the stripe, if it sees that there is only one inconsistent block in the stripe. 3. It could pass the problem on to higher level tools (possibly correcting a single inconsistency in the P or Q parities first). At the moment, raid6 repair follows the first choice here. Many people seem to think the second choice is a good idea. Personally, I would say choice 3 is right - but unless and until higher level tools are available, I think 1 is no worse than 2 - and it is simpler, clearer, and works today. Key to why I don't like choice 2 is a question of why you have a mismatch in the first place. Undetected read errors - the drive returning wrong data as though it were correct data - are astoundingly rare. Even on huge disks, they do not occur often. (Unrecoverable read errors - when the drive reports a sector as unreadable - are not uncommon. That is what raid is for.) If you get a mismatch, a likely cause is a crash or power fault during a stripe write. Another main cause is hardware errors such as memory faults. "Smart" repair can make the situation worse. Secondly, "smart" repair means changing the data on the disk. You can't do that while a file system is mounted (unless you want to risk chaos). One major reason for using raid is to minimise downtime of a system in the event of problems - offline repair goes against that philosophy. What do I mean about passing the problem on to higher levels? One example would be if there is an other raid level sitting above, such as a raid1 pair of raid6 arrays (it would make more sense the other way round - the same principle applies there). The raid6 level could ask the block layer above if that layer can re-create the correct data. In the case of a raid1 pair at a higher level, then it could - that way the stripe would be written with the full known correct data, rather than just a guess. Perhaps the layer above is a filesystem - this could say if that stripe is actually in use (no need to worry if it is in deleted space), or if it can re-create the data from a BTRFS duplicate. Failing that, a tool could interact with the filesystem to determine what sort of data was on that stripe, and perhaps check it in some way. At least a tool could run a consistency check - would the filesystem be consistent if the stripe was "smart repaired", or would it be consistent if the stripe data was left untouched (and the P & Q parities recreated)? A simple method here could be to mark the whole stripe as unreadable, then run a filesystem check. If there are higher level raids that can re-create the lost stripe, that will happen automatically. If not, then the filesystem repair will ensure that the filesystem is consistent even though data may be lost. And of course, a higher level repair tool could be one that simply runs a "smart repair" on the stripe. All in all, when there is /no/ correct answer, I think we have to be very careful about picking methods here. Before switching to a "smart" repair, rather than the simple method, we have to be /very/ sure that it gives noticeably "better" results in real-world cases. We can't just say it sounds good - we need to know. > > The first comment in the article is someone noting that md doesn't say > which device is failing, what the location of the error is or anything > else a sysadmin might actually find useful for fixing it. "Hey, you have > an error somewhere on some disk on this multi-terabyte array which might > be data corruption and if a disk fails will be data corruption!" is not > too useful :( I haven't looked at the information you get out of the scrub, but of course more information is better than less information. > The fourth comment notes that the "smart" approach, given > RAID-6, has a significantly higher chance of actually fixing the problem > than the simple approach. I'd call that a fairly important comment... > > (Neil said: "Similarly a RAID6 with inconsistent P and Q could well not > be able to identify a single block which is "wrong" and even if it could > there is a small possibility that the identified block isn't wrong, but > the other blocks are all inconsistent in such a way as to accidentally > point to it. The probability of this is rather small, but it is > non-zero". It is true that for some causes of mismatches, the "smart" repair has a high chance of being correct. > As far as I can tell the probability of this is exactly the > same as that of multiple read errors in a single stripe -- possibly far > lower, if you need not only multiple wrong P and Q values but *precisely > mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using > RAID-6 to begin with. > > I've been talking all the time about a stripe which is singly > inconsistent: either all the data blocks are fine and one of P or Q is > fine, or both P and Q and all but one data block is fine, and the > remaining block is inconsistent with all the rest. Obviously if more > blocks are corrupt, you can do nothing but report it. The redundancy > simply isn't there to attempt repair.) Or possible mark the whole stripe as "unreadable", and punt the problem to the higher levels. > >> H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion >> http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf >> >> This is totally non-trivial, especially because it says raid6 cannot >> detect or correct more than one corruption, and ensuring that >> additional corruption isn't introduced in the rare case is even more >> non-trivial. > > Yeah. Testing this is the bastard problem, really. Fault injection via > dm is the only approach that seems remotely practical to me. That's what the "FAULTY" raid level in md is for :-) But what are the /realistic/ fault situations? > >> I do think it's sane for raid6 repair to avoid the current assumption >> that data strip is correct, by doing the evaluation in equation 27. If >> there's no corruption do nothing, if there's corruption of P or Q then >> replace, if there's corruption of data, then report but do not repair > > At least indicate *where* the corruption is in the report. (I'd say > "repair, as a non-default option" for people with a different > availability/P(corruption) tradeoff -- since, after all, if you're using > RAID In the first place you value high availability across disk problems > more than most people do, and there is a difference between one bit of > unreported damage that causes a near-certain restore from backup and > either zero or two of them plus a report with an LBA attached so you > know you need to do something...) One thing to consider here is the sort of person using the raid array. When Neil wrote his article, raid6 would only be used by an expert. He did not want to change existing data and make life harder for the systems administrator doing more serious repair. However, these days the raid6 "administrator" may be someone who owns a NAS box and has no idea what raid, or even Linux, actually is. In such cases, "smart" repair is probably the best idea if the filesystem on top is not BTRFS. > >> as follows: >> >> 1. md reports all data drives and the LBAs for the affected stripe >> (otherwise this is not simple if it has to figure out which drive is >> actually affected but that's not required, just a matter of better >> efficiency in finding out what's really affected.) > > Yep. > >> 2. the file system needs to be able to accept the error from md > > It would probably need to report this as an -EIO, but I don't know of > any filesystems that can accept asynchronous reports of errors like > this. You'd need reverse mapping to even stand a chance (a non-default > option on xfs, and of course available on btrfs and zfs too). You'd > need self-healing metadata to stand a chance of doing anything about it. > And god knows what a filesystem is meant to do if part of the file data > vanishes. Replace it with \0? ugh. I'd almost rather have the error > go back out to a monitoring daemon and have it send you an email... > >> 3. the file system reports what it negatively impacted: file system >> metadata or data and if data, the full filename path. >> >> And now suddenly this work is likewise non-trivial. > > Yeah, it's all the layers stacked up to the filesystem that are buggers > to deal with... and now the optional 'just repair it dammit' approach > seems useful again, if just because it doesn't have to deal with all > these extra layers. > >> And there is already something that will do exactly this: ZFS and >> Btrfs. Both can unambiguously, efficiently determine whether data is >> corrupt even if a drive doesn't report a read error. > > Yeah. Unfortunately both have their own problems: ZFS reimplements the > page cache and adds massive amounts of ineffiicency in the process, and > btrfs is... well... not really baked enough for the sort of high- > availability system that's going to be running RAID, yet. (Alas!) I disagree about BTRFS here. First, raid is a good idea no matter how "experimental" you consider your filesystem. Second, BTRFS is solid enough for a great many uses - I use it on laptops, desktops and servers. /No/ storage system should be viewed as infallible - backups are important. So if BTRFS were to eat my data, then I'd get it back from backups - just as I would if the server died, both disks failed, it got stolen, or whatever. But BTRFS on our servers means very cheap regular snapshots. That protects us from the biggest cause of data loss - user error. > > (Recent xfs can do the same with metadata, but not data.) > ^ permalink raw reply [flat|nested] 69+ messages in thread
* A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-09 17:25 ` Chris Murphy 2017-05-09 19:44 ` Wols Lists 2017-05-09 20:18 ` Nix @ 2017-05-09 21:06 ` Nix 2017-05-12 11:14 ` Nix 2017-05-16 3:27 ` NeilBrown 2 siblings, 2 replies; 69+ messages in thread From: Nix @ 2017-05-09 21:06 UTC (permalink / raw) To: Chris Murphy Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 9 May 2017, Chris Murphy verbalised: > 1. md reports all data drives and the LBAs for the affected stripe Enough rambling from me. Here's a hilariously untested patch against 4.11 (as in I haven't even booted with it: my systems are kind of in flux right now as I migrate to the md-based server that got me all concerned about this). It compiles! And it's definitely safer than trying a repair, and makes it possible to recover from a real mismatch without losing all your hair in the process, or determine that a mismatch is spurious or irrelevant. And that's enough for me, frankly. This is a very rare problem, one hopes. (It's probably not ideal, because the error is just known to be somewhere in that stripe, not on that sector, which makes determining the affected data somewhat harder. But at least you can figure out what filesystem it's on. :) ) 8<------------------------------------------------------------->8 From: Nick Alcock <nick.alcock@oracle.com> Subject: [PATCH] md: report sector of stripes with check mismatches This makes it possible, with appropriate filesystem support, for a sysadmin to tell what is affected by the mismatch, and whether it should be ignored (if it's inside a swap partition, for instance). We ratelimit to prevent log flooding: if there are so many mismatches that ratelimiting is necessary, the individual messages are relatively unlikely to be important (either the machine is swapping like crazy or something is very wrong with the disk). Signed-off-by: Nick Alcock <nick.alcock@oracle.com> --- drivers/md/raid5.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index ed5cd705b985..bcd2e5150e29 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3959,10 +3959,14 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh, set_bit(STRIPE_INSYNC, &sh->state); else { atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); - if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) + if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { /* don't try to repair!! */ set_bit(STRIPE_INSYNC, &sh->state); - else { + pr_warn_ratelimited("%s: mismatch around sector " + "%llu\n", __func__, + (unsigned long long) + sh->sector); + } else { sh->check_state = check_state_compute_run; set_bit(STRIPE_COMPUTE_RUN, &sh->state); set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request); @@ -4111,10 +4115,14 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, } } else { atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); - if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) + if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { /* don't try to repair!! */ set_bit(STRIPE_INSYNC, &sh->state); - else { + pr_warn_ratelimited("%s: mismatch around sector " + "%llu\n", __func__, + (unsigned long long) + sh->sector); + } else { int *target = &sh->ops.target; sh->ops.target = -1; -- 2.12.2.212.gea238cf35.dirty ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-09 21:06 ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix @ 2017-05-12 11:14 ` Nix 2017-05-16 3:27 ` NeilBrown 1 sibling, 0 replies; 69+ messages in thread From: Nix @ 2017-05-12 11:14 UTC (permalink / raw) To: Chris Murphy Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 9 May 2017, nix@esperi.org.uk outgrape: > On 9 May 2017, Chris Murphy verbalised: > >> 1. md reports all data drives and the LBAs for the affected stripe > > Enough rambling from me. Here's a hilariously untested patch against > 4.11 (as in I haven't even booted with it: my systems are kind of in > flux right now as I migrate to the md-based server that got me all > concerned about this). It compiles! And it's definitely safer than > trying a repair, and makes it possible to recover from a real mismatch > without losing all your hair in the process, or determine that a > mismatch is spurious or irrelevant. And that's enough for me, frankly. > This is a very rare problem, one hopes. > > (It's probably not ideal, because the error is just known to be > somewhere in that stripe, not on that sector, which makes determining > the affected data somewhat harder. But at least you can figure out what > filesystem it's on. :) ) Aside: this foolish optimist hopes that it might be fairly easy to tie the new GETFSMAP ioctl() into mismatch reports if the filesystem(s) overlying a mismatched stripe support it: it looks like we could get the necessary info for a whole stripe in a single call. Being automatically told "these files may be corrupted, restore them" or "oops you lost some metadata on fses A and B, run fsck" would be wonderful. (Though the actual corruption would be less wonderful.) This feels like something mdadm's monitor mode should be able to do, to me. I'll have a look in a bit, but I know nothing about the implementation of monitor mode at all so I have some learning to do first... -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-09 21:06 ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix 2017-05-12 11:14 ` Nix @ 2017-05-16 3:27 ` NeilBrown 2017-05-16 9:13 ` Nix 2017-05-16 21:11 ` NeilBrown 1 sibling, 2 replies; 69+ messages in thread From: NeilBrown @ 2017-05-16 3:27 UTC (permalink / raw) To: Nix, Chris Murphy Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID [-- Attachment #1: Type: text/plain, Size: 4235 bytes --] On Tue, May 09 2017, Nix wrote: > On 9 May 2017, Chris Murphy verbalised: > >> 1. md reports all data drives and the LBAs for the affected stripe > > Enough rambling from me. Here's a hilariously untested patch against > 4.11 (as in I haven't even booted with it: my systems are kind of in > flux right now as I migrate to the md-based server that got me all > concerned about this). It compiles! And it's definitely safer than > trying a repair, and makes it possible to recover from a real mismatch > without losing all your hair in the process, or determine that a > mismatch is spurious or irrelevant. And that's enough for me, frankly. > This is a very rare problem, one hopes. > > (It's probably not ideal, because the error is just known to be > somewhere in that stripe, not on that sector, which makes determining > the affected data somewhat harder. But at least you can figure out what > filesystem it's on. :) ) > > 8<------------------------------------------------------------->8 > From: Nick Alcock <nick.alcock@oracle.com> > Subject: [PATCH] md: report sector of stripes with check mismatches > > This makes it possible, with appropriate filesystem support, for a > sysadmin to tell what is affected by the mismatch, and whether > it should be ignored (if it's inside a swap partition, for > instance). > > We ratelimit to prevent log flooding: if there are so many > mismatches that ratelimiting is necessary, the individual messages > are relatively unlikely to be important (either the machine is > swapping like crazy or something is very wrong with the disk). > > Signed-off-by: Nick Alcock <nick.alcock@oracle.com> > --- > drivers/md/raid5.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index ed5cd705b985..bcd2e5150e29 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -3959,10 +3959,14 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh, > set_bit(STRIPE_INSYNC, &sh->state); > else { > atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); > - if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) > + if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { > /* don't try to repair!! */ > set_bit(STRIPE_INSYNC, &sh->state); > - else { > + pr_warn_ratelimited("%s: mismatch around sector " > + "%llu\n", __func__, > + (unsigned long long) > + sh->sector); > + } else { I think there is no point giving the function name, but that you should give the name of the array. Also "around" is a little vague. Maybe something like: > + pr_warn_ratelimited("%s: mismatch sector in range " > + "%llu-%llu\n", mdname(conf->mddev), > + (unsigned long long) sh->sector, > + (unsigned long long) sh->sector + STRIPE_SECTORS); As an optional enhancement, you could add "will recalculate P/Q" or "left unchanged" as appropriate. Providing at least that the array name is included in the message, I support this patch. NeilBrown > sh->check_state = check_state_compute_run; > set_bit(STRIPE_COMPUTE_RUN, &sh->state); > set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request); > @@ -4111,10 +4115,14 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, > } > } else { > atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); > - if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) > + if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { > /* don't try to repair!! */ > set_bit(STRIPE_INSYNC, &sh->state); > - else { > + pr_warn_ratelimited("%s: mismatch around sector " > + "%llu\n", __func__, > + (unsigned long long) > + sh->sector); > + } else { > int *target = &sh->ops.target; > > sh->ops.target = -1; > -- > 2.12.2.212.gea238cf35.dirty > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-16 3:27 ` NeilBrown @ 2017-05-16 9:13 ` Nix 2017-05-16 21:11 ` NeilBrown 1 sibling, 0 replies; 69+ messages in thread From: Nix @ 2017-05-16 9:13 UTC (permalink / raw) To: NeilBrown Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 16 May 2017, NeilBrown said: >> - else { >> + pr_warn_ratelimited("%s: mismatch around sector " >> + "%llu\n", __func__, >> + (unsigned long long) >> + sh->sector); >> + } else { > > I think there is no point giving the function name, > but that you should give the name of the array. *ouch* I can't believe I forgot that. I have more than one array myself... "we have a fault but we don't know what array it's on" is not much of an improvement over the status quo, really! (though you could make a good guess by looking for preceding sync-start messages, you can of course sync two arrays at the same time...) > Also "around" is a little vague. Intentionally: I couldn't think of the right terminology. Yours is better. > Maybe something like: > >> + pr_warn_ratelimited("%s: mismatch sector in range " >> + "%llu-%llu\n", mdname(conf->mddev), >> + (unsigned long long) sh->sector, >> + (unsigned long long) sh->sector + STRIPE_SECTORS); Nice! Here's a rerolled patch. (We exceed the 80-char limit but that's pr_warn_ratelimited()'s fault for having such a long name!) Tested by making a raid array on a bunch of sparse files then dding a byte of garbage into one of them and checking it. I got a nice error message, name and all, and the sector count looked good. From f05a451d46900849c7965a0e7dde085f1fb50dfc Mon Sep 17 00:00:00 2001 From: Nick Alcock <nick.alcock@oracle.com> Date: Tue, 9 May 2017 21:55:17 +0100 Subject: [PATCH] md: report sector of stripes with check mismatches This makes it possible, with appropriate filesystem support, for a sysadmin to tell what is affected by the mismatch, and whether it should be ignored (if it's inside a swap partition, for instance). We ratelimit to prevent log flooding: if there are so many mismatches that ratelimiting is necessary, the individual messages are relatively unlikely to be important (either the machine is swapping like crazy or something is very wrong with the disk). Signed-off-by: Nick Alcock <nick.alcock@oracle.com> --- drivers/md/raid5.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index ed5cd705b985..937314051be5 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3959,10 +3959,15 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh, set_bit(STRIPE_INSYNC, &sh->state); else { atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); - if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) + if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { /* don't try to repair!! */ set_bit(STRIPE_INSYNC, &sh->state); - else { + pr_warn_ratelimited("%s: mismatch sector in range " + "%llu-%llu\n", mdname(conf->mddev), + (unsigned long long) sh->sector, + (unsigned long long) sh->sector + + STRIPE_SECTORS); + } else { sh->check_state = check_state_compute_run; set_bit(STRIPE_COMPUTE_RUN, &sh->state); set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request); @@ -4111,10 +4116,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, } } else { atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); - if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) + if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { /* don't try to repair!! */ set_bit(STRIPE_INSYNC, &sh->state); - else { + pr_warn_ratelimited("%s: mismatch sector in range " + "%llu-%llu\n", mdname(conf->mddev), + (unsigned long long) sh->sector, + (unsigned long long) sh->sector + + STRIPE_SECTORS); + } else { int *target = &sh->ops.target; sh->ops.target = -1; -- 2.12.2.212.gea238cf35.dirty ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-16 3:27 ` NeilBrown 2017-05-16 9:13 ` Nix @ 2017-05-16 21:11 ` NeilBrown 2017-05-16 21:46 ` Nix 1 sibling, 1 reply; 69+ messages in thread From: NeilBrown @ 2017-05-16 21:11 UTC (permalink / raw) To: Nix, Chris Murphy Cc: David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID [-- Attachment #1: Type: text/plain, Size: 4656 bytes --] On Tue, May 16 2017, NeilBrown wrote: > On Tue, May 09 2017, Nix wrote: > >> On 9 May 2017, Chris Murphy verbalised: >> >>> 1. md reports all data drives and the LBAs for the affected stripe >> >> Enough rambling from me. Here's a hilariously untested patch against >> 4.11 (as in I haven't even booted with it: my systems are kind of in >> flux right now as I migrate to the md-based server that got me all >> concerned about this). It compiles! And it's definitely safer than >> trying a repair, and makes it possible to recover from a real mismatch >> without losing all your hair in the process, or determine that a >> mismatch is spurious or irrelevant. And that's enough for me, frankly. >> This is a very rare problem, one hopes. >> >> (It's probably not ideal, because the error is just known to be >> somewhere in that stripe, not on that sector, which makes determining >> the affected data somewhat harder. But at least you can figure out what >> filesystem it's on. :) ) >> >> 8<------------------------------------------------------------->8 >> From: Nick Alcock <nick.alcock@oracle.com> >> Subject: [PATCH] md: report sector of stripes with check mismatches >> >> This makes it possible, with appropriate filesystem support, for a >> sysadmin to tell what is affected by the mismatch, and whether >> it should be ignored (if it's inside a swap partition, for >> instance). >> >> We ratelimit to prevent log flooding: if there are so many >> mismatches that ratelimiting is necessary, the individual messages >> are relatively unlikely to be important (either the machine is >> swapping like crazy or something is very wrong with the disk). >> >> Signed-off-by: Nick Alcock <nick.alcock@oracle.com> >> --- >> drivers/md/raid5.c | 16 ++++++++++++---- >> 1 file changed, 12 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >> index ed5cd705b985..bcd2e5150e29 100644 >> --- a/drivers/md/raid5.c >> +++ b/drivers/md/raid5.c >> @@ -3959,10 +3959,14 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh, >> set_bit(STRIPE_INSYNC, &sh->state); >> else { >> atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); >> - if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) >> + if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { >> /* don't try to repair!! */ >> set_bit(STRIPE_INSYNC, &sh->state); >> - else { >> + pr_warn_ratelimited("%s: mismatch around sector " >> + "%llu\n", __func__, >> + (unsigned long long) >> + sh->sector); >> + } else { > > I think there is no point giving the function name, > but that you should give the name of the array. > Also "around" is a little vague. > Maybe something like: > >> + pr_warn_ratelimited("%s: mismatch sector in range " >> + "%llu-%llu\n", mdname(conf->mddev), >> + (unsigned long long) sh->sector, >> + (unsigned long long) sh->sector + STRIPE_SECTORS); > > As an optional enhancement, you could add "will recalculate P/Q" or > "left unchanged" as appropriate. > > Providing at least that the array name is included in the message, I > support this patch. Actually, I have another caveat. I don't think we want these messages during initial resync, or any resync. Only during a 'check' or 'repair'. So add a check for MD_RECOVERY_REQUESTED or maybe for sh->sectors >= conf->mddev->recovery_cp NeilBrown > > NeilBrown > > > >> sh->check_state = check_state_compute_run; >> set_bit(STRIPE_COMPUTE_RUN, &sh->state); >> set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request); >> @@ -4111,10 +4115,14 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, >> } >> } else { >> atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); >> - if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) >> + if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { >> /* don't try to repair!! */ >> set_bit(STRIPE_INSYNC, &sh->state); >> - else { >> + pr_warn_ratelimited("%s: mismatch around sector " >> + "%llu\n", __func__, >> + (unsigned long long) >> + sh->sector); >> + } else { >> int *target = &sh->ops.target; >> >> sh->ops.target = -1; >> -- >> 2.12.2.212.gea238cf35.dirty >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-16 21:11 ` NeilBrown @ 2017-05-16 21:46 ` Nix 2017-05-18 0:07 ` Shaohua Li 2017-05-19 4:49 ` NeilBrown 0 siblings, 2 replies; 69+ messages in thread From: Nix @ 2017-05-16 21:46 UTC (permalink / raw) To: NeilBrown Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 16 May 2017, NeilBrown spake thusly: > Actually, I have another caveat. I don't think we want these messages > during initial resync, or any resync. Only during a 'check' or > 'repair'. > So add a check for MD_RECOVERY_REQUESTED or maybe for > sh->sectors >= conf->mddev->recovery_cp I completely agree, but it's already inside MD_RECOVERY_CHECK: if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { /* don't try to repair!! */ set_bit(STRIPE_INSYNC, &sh->state); pr_warn_ratelimited("%s: mismatch sector in range " "%llu-%llu\n", mdname(conf->mddev), (unsigned long long) sh->sector, (unsigned long long) sh->sector + STRIPE_SECTORS); } else { Doesn't that already mean that someone has explicitly triggered a check action? ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-16 21:46 ` Nix @ 2017-05-18 0:07 ` Shaohua Li 2017-05-19 4:53 ` NeilBrown 2017-05-19 4:49 ` NeilBrown 1 sibling, 1 reply; 69+ messages in thread From: Shaohua Li @ 2017-05-18 0:07 UTC (permalink / raw) To: Nix Cc: NeilBrown, Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On Tue, May 16, 2017 at 10:46:13PM +0100, Nix wrote: > On 16 May 2017, NeilBrown spake thusly: > > > Actually, I have another caveat. I don't think we want these messages > > during initial resync, or any resync. Only during a 'check' or > > 'repair'. > > So add a check for MD_RECOVERY_REQUESTED or maybe for > > sh->sectors >= conf->mddev->recovery_cp > > I completely agree, but it's already inside MD_RECOVERY_CHECK: > > if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { > /* don't try to repair!! */ > set_bit(STRIPE_INSYNC, &sh->state); > pr_warn_ratelimited("%s: mismatch sector in range " > "%llu-%llu\n", mdname(conf->mddev), > (unsigned long long) sh->sector, > (unsigned long long) sh->sector + > STRIPE_SECTORS); > } else { > > Doesn't that already mean that someone has explicitly triggered a check > action? Hi, So the idea is: run 'check' and report mismatch, userspace (raid6check for example) uses the reported info to fix the mismatch. The pr_warn_ratelimited isn't a good way to communicate the info to userspace. I'm wondering why we don't just run raid6check solely, it can do the job like what kernel does and we avoid the crappy pr_warn_ratelimited. Thanks, Shaohua ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-18 0:07 ` Shaohua Li @ 2017-05-19 4:53 ` NeilBrown 2017-05-19 10:31 ` Nix 0 siblings, 1 reply; 69+ messages in thread From: NeilBrown @ 2017-05-19 4:53 UTC (permalink / raw) To: Shaohua Li, Nix Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID [-- Attachment #1: Type: text/plain, Size: 1989 bytes --] On Wed, May 17 2017, Shaohua Li wrote: > On Tue, May 16, 2017 at 10:46:13PM +0100, Nix wrote: >> On 16 May 2017, NeilBrown spake thusly: >> >> > Actually, I have another caveat. I don't think we want these messages >> > during initial resync, or any resync. Only during a 'check' or >> > 'repair'. >> > So add a check for MD_RECOVERY_REQUESTED or maybe for >> > sh->sectors >= conf->mddev->recovery_cp >> >> I completely agree, but it's already inside MD_RECOVERY_CHECK: >> >> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { >> /* don't try to repair!! */ >> set_bit(STRIPE_INSYNC, &sh->state); >> pr_warn_ratelimited("%s: mismatch sector in range " >> "%llu-%llu\n", mdname(conf->mddev), >> (unsigned long long) sh->sector, >> (unsigned long long) sh->sector + >> STRIPE_SECTORS); >> } else { >> >> Doesn't that already mean that someone has explicitly triggered a check >> action? > > > Hi, > So the idea is: run 'check' and report mismatch, userspace (raid6check for > example) uses the reported info to fix the mismatch. The pr_warn_ratelimited > isn't a good way to communicate the info to userspace. I'm wondering why we > don't just run raid6check solely, it can do the job like what kernel does and > we avoid the crappy pr_warn_ratelimited. > raid6check is *much* slower than doing it in the kernel, as the interlocking to avoid checking a stripe that is being written are clumsy.... and async IO is harder in user space. I think the warnings are useful as warnings quite apart from the possibility of raid6check using them. If we really wanted a seamless "fix the raid6 thing" (which I don't think we do), we'd probably make the list of inconsistencies appear in a sysfs file. That would be less 'crappy'. But as I say, I don't think we really want to do that. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-19 4:53 ` NeilBrown @ 2017-05-19 10:31 ` Nix 2017-05-19 16:48 ` Shaohua Li 0 siblings, 1 reply; 69+ messages in thread From: Nix @ 2017-05-19 10:31 UTC (permalink / raw) To: NeilBrown Cc: Shaohua Li, Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 19 May 2017, NeilBrown verbalised: > On Wed, May 17 2017, Shaohua Li wrote: > >> On Tue, May 16, 2017 at 10:46:13PM +0100, Nix wrote: >>> Doesn't that already mean that someone has explicitly triggered a check >>> action? >> >> So the idea is: run 'check' and report mismatch, userspace (raid6check for >> example) uses the reported info to fix the mismatch. The pr_warn_ratelimited >> isn't a good way to communicate the info to userspace. I'm wondering why we >> don't just run raid6check solely, it can do the job like what kernel does and >> we avoid the crappy pr_warn_ratelimited. It'll do when there are a few inconsistencies but you don't want to spend days recovering a huge array to fix a small but nonzero mismatch_cnt, or to reassure you that yes, these mismatch_cnts are in swap, ignore them. When there are a lot, enough that a ratelimited warning hits its rate limit, Neil's right: the array is probably toast. The limit is then important to stop log flooding. > If we really wanted a seamless "fix the raid6 thing" (which I don't > think we do), Oh, I want seamless everything -- the seamlessness and flexibility of md are its killer features over hardware RAID in my eyes -- but I'm convinced that this is probably too hard to test and simply too disruptive to bother with for a likely vanishingly rare failure mode all entangled with fairly hot paths. > we'd probably make the list of inconsistencies appear in a > sysfs file. That would be less 'crappy'. But as I say, I don't think > we really want to do that. Aren't sysfs files in effect length-limited to one page (or at least length-limited by virtue of being stored in memory?) It seems to me this would just bring the same problem ratelimit is solving right back again, except a sysfs file doesn't have a logging daemon sucking the contents out constantly so you can overwrite your old output without worrying. (And there is no other daemon running to do that, except mdadm in monitor mode, which might not be running and really this job feels out of scope for it anyway.) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-19 10:31 ` Nix @ 2017-05-19 16:48 ` Shaohua Li 2017-06-02 12:28 ` Nix 0 siblings, 1 reply; 69+ messages in thread From: Shaohua Li @ 2017-05-19 16:48 UTC (permalink / raw) To: Nix Cc: NeilBrown, Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On Fri, May 19, 2017 at 11:31:23AM +0100, Nix wrote: > On 19 May 2017, NeilBrown verbalised: > > > On Wed, May 17 2017, Shaohua Li wrote: > > > >> On Tue, May 16, 2017 at 10:46:13PM +0100, Nix wrote: > >>> Doesn't that already mean that someone has explicitly triggered a check > >>> action? > >> > >> So the idea is: run 'check' and report mismatch, userspace (raid6check for > >> example) uses the reported info to fix the mismatch. The pr_warn_ratelimited > >> isn't a good way to communicate the info to userspace. I'm wondering why we > >> don't just run raid6check solely, it can do the job like what kernel does and > >> we avoid the crappy pr_warn_ratelimited. > > It'll do when there are a few inconsistencies but you don't want to > spend days recovering a huge array to fix a small but nonzero > mismatch_cnt, or to reassure you that yes, these mismatch_cnts are in > swap, ignore them. When there are a lot, enough that a ratelimited > warning hits its rate limit, Neil's right: the array is probably toast. > The limit is then important to stop log flooding. > > > If we really wanted a seamless "fix the raid6 thing" (which I don't > > think we do), > > Oh, I want seamless everything -- the seamlessness and flexibility of md > are its killer features over hardware RAID in my eyes -- but I'm > convinced that this is probably too hard to test and simply too > disruptive to bother with for a likely vanishingly rare failure mode all > entangled with fairly hot paths. > > > we'd probably make the list of inconsistencies appear in a > > sysfs file. That would be less 'crappy'. But as I say, I don't think > > we really want to do that. > > Aren't sysfs files in effect length-limited to one page (or at least > length-limited by virtue of being stored in memory?) It seems to me this > would just bring the same problem ratelimit is solving right back again, > except a sysfs file doesn't have a logging daemon sucking the contents > out constantly so you can overwrite your old output without worrying. > (And there is no other daemon running to do that, except mdadm in > monitor mode, which might not be running and really this job feels out > of scope for it anyway.) No, my question is not the print is ratelimited. The problem is dmesg isn't a good way to communicate info to userspace. You can easily lose all dmesg info with a simple 'dmesg -c'. sysfs file is more reliable. Length-limited isn't a problem, as you said, if there are a lot of mismatch, the array is toast. Alright, I'll accept Neil's suggestion. Unless your guys really need a seamless fix (which I'm still thinking about doing it in usespace by optimizing raid6check) and we'd take this simple warning patch. Thanks, Shaohua ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-19 16:48 ` Shaohua Li @ 2017-06-02 12:28 ` Nix 0 siblings, 0 replies; 69+ messages in thread From: Nix @ 2017-06-02 12:28 UTC (permalink / raw) To: Shaohua Li Cc: NeilBrown, Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID [getting back to this...] On 19 May 2017, Shaohua Li told this: > On Fri, May 19, 2017 at 11:31:23AM +0100, Nix wrote: >> On 19 May 2017, NeilBrown verbalised: >> > we'd probably make the list of inconsistencies appear in a >> > sysfs file. That would be less 'crappy'. But as I say, I don't think >> > we really want to do that. >> >> Aren't sysfs files in effect length-limited to one page (or at least >> length-limited by virtue of being stored in memory?) It seems to me this >> would just bring the same problem ratelimit is solving right back again, >> except a sysfs file doesn't have a logging daemon sucking the contents >> out constantly so you can overwrite your old output without worrying. >> (And there is no other daemon running to do that, except mdadm in >> monitor mode, which might not be running and really this job feels out >> of scope for it anyway.) > > No, my question is not the print is ratelimited. The problem is dmesg isn't a > good way to communicate info to userspace. You can easily lose all dmesg info > with a simple 'dmesg -c'. sysfs file is more reliable. Length-limited isn't a > problem, as you said, if there are a lot of mismatch, the array is toast. I agree that in future having a mechanism for reporting this more easily usable by programs would be good, and sysfs does seem like just such a mechanism. -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-16 21:46 ` Nix 2017-05-18 0:07 ` Shaohua Li @ 2017-05-19 4:49 ` NeilBrown 2017-05-19 10:32 ` Nix 1 sibling, 1 reply; 69+ messages in thread From: NeilBrown @ 2017-05-19 4:49 UTC (permalink / raw) To: Nix Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID [-- Attachment #1: Type: text/plain, Size: 1209 bytes --] On Tue, May 16 2017, Nix wrote: > On 16 May 2017, NeilBrown spake thusly: > >> Actually, I have another caveat. I don't think we want these messages >> during initial resync, or any resync. Only during a 'check' or >> 'repair'. >> So add a check for MD_RECOVERY_REQUESTED or maybe for >> sh->sectors >= conf->mddev->recovery_cp > > I completely agree, but it's already inside MD_RECOVERY_CHECK: > > if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { > /* don't try to repair!! */ > set_bit(STRIPE_INSYNC, &sh->state); > pr_warn_ratelimited("%s: mismatch sector in range " > "%llu-%llu\n", mdname(conf->mddev), > (unsigned long long) sh->sector, > (unsigned long long) sh->sector + > STRIPE_SECTORS); > } else { > > Doesn't that already mean that someone has explicitly triggered a check > action? Uhmm... yeah. I lose track of which flags me what exactly. You log messages aren't generated when 'repair' is used, only when 'check' is. I can see why you might have chosen that, but I wonder if it is best. But I'm OK with this patch as it stands. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-19 4:49 ` NeilBrown @ 2017-05-19 10:32 ` Nix 2017-05-19 16:55 ` Shaohua Li 0 siblings, 1 reply; 69+ messages in thread From: Nix @ 2017-05-19 10:32 UTC (permalink / raw) To: NeilBrown Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 19 May 2017, NeilBrown said: > On Tue, May 16 2017, Nix wrote: > >> On 16 May 2017, NeilBrown spake thusly: >> >>> Actually, I have another caveat. I don't think we want these messages >>> during initial resync, or any resync. Only during a 'check' or >>> 'repair'. >>> So add a check for MD_RECOVERY_REQUESTED or maybe for >>> sh->sectors >= conf->mddev->recovery_cp >> >> I completely agree, but it's already inside MD_RECOVERY_CHECK: >> >> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { >> /* don't try to repair!! */ >> set_bit(STRIPE_INSYNC, &sh->state); >> pr_warn_ratelimited("%s: mismatch sector in range " >> "%llu-%llu\n", mdname(conf->mddev), >> (unsigned long long) sh->sector, >> (unsigned long long) sh->sector + >> STRIPE_SECTORS); >> } else { >> >> Doesn't that already mean that someone has explicitly triggered a check >> action? > > Uhmm... yeah. I lose track of which flags me what exactly. > You log messages aren't generated when 'repair' is used, only when > 'check' is. > I can see why you might have chosen that, but I wonder if it is best. I'm not sure what the point is of being told when repair is used: hey, there was an inconsistency here but there isn't any more! I suppose you could still use it to see if the repair did the right thing. My problem on that front was that I'm not sure what flag should be used to catch repair but not resync etc: everywhere else in the code, repair is in an unadorned else branch... is it the *lack* of MD_RECOVERY_CHECK and the presence of, uh, something else? -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-19 10:32 ` Nix @ 2017-05-19 16:55 ` Shaohua Li 2017-05-21 22:00 ` NeilBrown 0 siblings, 1 reply; 69+ messages in thread From: Shaohua Li @ 2017-05-19 16:55 UTC (permalink / raw) To: Nix Cc: NeilBrown, Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On Fri, May 19, 2017 at 11:32:43AM +0100, Nix wrote: > On 19 May 2017, NeilBrown said: > > > On Tue, May 16 2017, Nix wrote: > > > >> On 16 May 2017, NeilBrown spake thusly: > >> > >>> Actually, I have another caveat. I don't think we want these messages > >>> during initial resync, or any resync. Only during a 'check' or > >>> 'repair'. > >>> So add a check for MD_RECOVERY_REQUESTED or maybe for > >>> sh->sectors >= conf->mddev->recovery_cp > >> > >> I completely agree, but it's already inside MD_RECOVERY_CHECK: > >> > >> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { > >> /* don't try to repair!! */ > >> set_bit(STRIPE_INSYNC, &sh->state); > >> pr_warn_ratelimited("%s: mismatch sector in range " > >> "%llu-%llu\n", mdname(conf->mddev), > >> (unsigned long long) sh->sector, > >> (unsigned long long) sh->sector + > >> STRIPE_SECTORS); > >> } else { > >> > >> Doesn't that already mean that someone has explicitly triggered a check > >> action? > > > > Uhmm... yeah. I lose track of which flags me what exactly. > > You log messages aren't generated when 'repair' is used, only when > > 'check' is. > > I can see why you might have chosen that, but I wonder if it is best. > > I'm not sure what the point is of being told when repair is used: hey, > there was an inconsistency here but there isn't any more! I suppose you > could still use it to see if the repair did the right thing. My problem > on that front was that I'm not sure what flag should be used to catch > repair but not resync etc: everywhere else in the code, repair is in an > unadorned else branch... is it the *lack* of MD_RECOVERY_CHECK and the > presence of, uh, something else? MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED && MD_RECOVERY_CHECK == check MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED == repair MD_RECOVERY_SYNC && !MD_RECOVERY_REQUESTED == resync Don't see the poin to print the info for 'repair'. 'repair' already changes the data, how could we use the info? Thanks, Shaohua ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) 2017-05-19 16:55 ` Shaohua Li @ 2017-05-21 22:00 ` NeilBrown 0 siblings, 0 replies; 69+ messages in thread From: NeilBrown @ 2017-05-21 22:00 UTC (permalink / raw) To: Shaohua Li, Nix Cc: Chris Murphy, David Brown, Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, Linux-RAID [-- Attachment #1: Type: text/plain, Size: 3081 bytes --] On Fri, May 19 2017, Shaohua Li wrote: > On Fri, May 19, 2017 at 11:32:43AM +0100, Nix wrote: >> On 19 May 2017, NeilBrown said: >> >> > On Tue, May 16 2017, Nix wrote: >> > >> >> On 16 May 2017, NeilBrown spake thusly: >> >> >> >>> Actually, I have another caveat. I don't think we want these messages >> >>> during initial resync, or any resync. Only during a 'check' or >> >>> 'repair'. >> >>> So add a check for MD_RECOVERY_REQUESTED or maybe for >> >>> sh->sectors >= conf->mddev->recovery_cp >> >> >> >> I completely agree, but it's already inside MD_RECOVERY_CHECK: >> >> >> >> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { >> >> /* don't try to repair!! */ >> >> set_bit(STRIPE_INSYNC, &sh->state); >> >> pr_warn_ratelimited("%s: mismatch sector in range " >> >> "%llu-%llu\n", mdname(conf->mddev), >> >> (unsigned long long) sh->sector, >> >> (unsigned long long) sh->sector + >> >> STRIPE_SECTORS); >> >> } else { >> >> >> >> Doesn't that already mean that someone has explicitly triggered a check >> >> action? >> > >> > Uhmm... yeah. I lose track of which flags me what exactly. >> > You log messages aren't generated when 'repair' is used, only when >> > 'check' is. >> > I can see why you might have chosen that, but I wonder if it is best. >> >> I'm not sure what the point is of being told when repair is used: hey, >> there was an inconsistency here but there isn't any more! I suppose you >> could still use it to see if the repair did the right thing. My problem >> on that front was that I'm not sure what flag should be used to catch >> repair but not resync etc: everywhere else in the code, repair is in an >> unadorned else branch... is it the *lack* of MD_RECOVERY_CHECK and the >> presence of, uh, something else? > MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED && MD_RECOVERY_CHECK == check > MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED == repair > MD_RECOVERY_SYNC && !MD_RECOVERY_REQUESTED == resync > > Don't see the poin to print the info for 'repair'. 'repair' already changes the > data, how could we use the info? Surprising data is can be potentially valuable. I don't think you should *ever* get an inconsistency in a RAID6 unless you have faulty hardware. If you do, then any information about the nature of the inconsistency might be valuable in understanding the hardware fault. I don't know in advance how I would interpret the data, but I do know that if I didn't have the data, then I wouldn't be able to interpret it. However .... running "repair" when you don't know exactly what has happened and why, is probably a bad idea. So logging probably won't provide value. I wouldn't go out of my way to add extra logging for the 'repair' case, but I certainly wouldn't go out of my way to avoid logging in that case. It seems inconsistent to log for 'check' but not 'repair', but it isn't a big deal for me. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 11:27 ` Nix 2017-05-09 11:58 ` David Brown @ 2017-05-09 19:16 ` Phil Turmel 2017-05-09 20:01 ` Nix 1 sibling, 1 reply; 69+ messages in thread From: Phil Turmel @ 2017-05-09 19:16 UTC (permalink / raw) To: Nix, David Brown; +Cc: Anthony Youngman, Ravi (Tom) Hale, linux-raid On 05/09/2017 07:27 AM, Nix wrote: > On 9 May 2017, David Brown uttered the following: > >> On 09/05/17 11:53, Nix wrote: >>> This turns out not to be the case. See this ten-year-old paper: >>> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>. >>> Five weeks of doing 2GiB writes on 3000 nodes once every two hours >>> found, they estimated, 50 errors possibly attributable to disk problems >>> (sector- or page-size regions of corrupted data) on 1/30th of their >>> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks >>> used by CERN deserve discarding. It is better to assume that drives >>> misdirect writes now and then, and to provide a means of recovering from >>> them that does not take days of panic. RAID-6 gives you that means: md >>> should use it. >> >> RAID-6 does not help here. You have to understand the types of errors >> that can occur, the reasons for them, the possibilities for detection, >> the possibilities for recovery, and what the different layers in the >> system can do about them. >> >> RAID (1/5/6) will let you recover from one or more known failed reads, >> on the assumption that the driver firmware is correct, memories have no >> errors, buses have no errors, block writes are atomic, write ordering >> matches the flush commands, block reads are either correct or marked as >> failed, etc. > > I think you're being too pedantic. Many of these things are known not to > be true on real hardware, and at least one of them cannot possibly be > true without a journal (atomic block writes). Nonetheless, the md layer > is quite happy to rebuild after a failed disk even though the write hole > might have torn garbage into your data, on the grounds that it > *probably* did not. If your argument was used everywhere, md would never > have been started because 100% reliability was not guaranteed. > > The same, it seems to me, is true of cases in which one drive in a > RAID-6 reports a few mismatched blocks. It is true that you don't know > the cause of the mismatches, but you *do* know which bit of the mismatch > is wrong and what data should be there, subject only to the assumption > that sufficiently few drives have made simultaneous mistakes that > redundancy is preserved. And that's the same assumption RAID >0 makes > all the time anyway! You are completely ignoring the fact that reconstruction from P,Q is mathematically correct only if the entire stripe is written together. Any software or hardware problem that interrupts a complete stripe write or a short-circuited P,Q update can and therefore often will deliver a *wrong* assessment of what device is corrupted. In particular, you can't even tell which devices got new data and which got old data. Even worse, cable and controller problems have been known to create patterns of corruption to the way to one or more drives. You desperately need to know if this happens to your array. It is not only possible, but *likely* in systems without ECC ram. The bottom line is that any kernel that implements the auto-correct you seem to think is a slam dunk will be shunned by any system administrator who actually cares about their data. Your obtuseness notwithstanding. All: Please drop me from future CCs on this thread. Phil ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 19:16 ` Fault tolerance with badblocks Phil Turmel @ 2017-05-09 20:01 ` Nix 2017-05-09 20:57 ` Wols Lists 2017-05-09 21:23 ` Phil Turmel 0 siblings, 2 replies; 69+ messages in thread From: Nix @ 2017-05-09 20:01 UTC (permalink / raw) To: Phil Turmel; +Cc: David Brown, Anthony Youngman, Ravi (Tom) Hale, linux-raid On 9 May 2017, Phil Turmel told this: > On 05/09/2017 07:27 AM, Nix wrote: >> The same, it seems to me, is true of cases in which one drive in a >> RAID-6 reports a few mismatched blocks. It is true that you don't know >> the cause of the mismatches, but you *do* know which bit of the mismatch >> is wrong and what data should be there, subject only to the assumption >> that sufficiently few drives have made simultaneous mistakes that >> redundancy is preserved. And that's the same assumption RAID >0 makes >> all the time anyway! > > You are completely ignoring the fact that reconstruction from P,Q is > mathematically correct only if the entire stripe is written together. Ooh, true. > Any software or hardware problem that interrupts a complete stripe write > or a short-circuited P,Q update can and therefore often will deliver a > *wrong* assessment of what device is corrupted. In particular, you > can't even tell which devices got new data and which got old data. Even > worse, cable and controller problems have been known to create patterns > of corruption to the way to one or more drives. You desperately need to > know if this happens to your array. It is not only possible, but > *likely* in systems without ECC ram. Is this still true if the md cache or PPL is in use? The whole point of these, after all, is to ensure that stripe writes either happen completely or not at all. (But, again, that'll only guard against things like power failure interruptions, not bad cabling. However, again, if you have bad cabling or a bad controller you can expect to have *lots and lots* of errors -- a small number of errors are much less likely to be something of this nature. So, again, a threshold like md already applies elsewhere might seem to be worthwhile. If you are seeing *lots* of mismatches, clearly correction is unwise -- heck, writing to the array at all is unwise, and the whole thing might profitably be remounted ro. I suspect the filesystems will have been remounted ro by the kernel by this point in any case.) The point made elsewhere that all your arguments also apply against fsck still stands. (Why bother with it? If it gave an error, you have a kernel bug or a bad disk controller, RAM, or cabling, and nothing on your filesystem can be trusted! just restore from backup!) Your arguments are absolutely classic "the perfect is the enemy of the good" arguments, in my view. I can understand falling into that trap on a RAID list, it's all about paranoia :) but that doesn't mean I agree with them. I *have* excellent backups, but that doesn't mean I want to waste hours to days restoring and/or revalidating everything just because of a persistent mismatch_cnt > 0 which md won't localize for me or even try to fix because it *might*, uh... no, as far as I can tell you're worrying that it might in some cases cause corruption of data that is *already known to be corrupt*. You'll pardon me if this possibility does not fill me with fear. > The bottom line is that any kernel that implements the auto-correct you > seem to think is a slam dunk will be shunned by any system administrator > who actually cares about their data. Your obtuseness notwithstanding. Gee, thanks heaps. Next time I want randomly insulting by someone who doesn't bother to tell me his actual *arguments* in any message before the one that starts on the insults, I'll come straight to you. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 20:01 ` Nix @ 2017-05-09 20:57 ` Wols Lists 2017-05-09 21:22 ` Nix 2017-05-09 21:23 ` Phil Turmel 1 sibling, 1 reply; 69+ messages in thread From: Wols Lists @ 2017-05-09 20:57 UTC (permalink / raw) To: Nix, Phil Turmel; +Cc: linux-raid On 09/05/17 21:01, Nix wrote: > Gee, thanks heaps. Next time I want randomly insulting by someone who > doesn't bother to tell me his actual *arguments* in any message before > the one that starts on the insults, I'll come straight to you. Nix, much as I don't think people are thinking this through rationally (they live in the perfect world of maths, not the imperfect world of engineering), I do NOT think insulting Phil on this list is a good idea. We all say things we shouldn't - I'm a master at it too :-) but sniping at a well-respected regular isn't wise ... Can we all tone it down, please ... Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 20:57 ` Wols Lists @ 2017-05-09 21:22 ` Nix 0 siblings, 0 replies; 69+ messages in thread From: Nix @ 2017-05-09 21:22 UTC (permalink / raw) To: Wols Lists; +Cc: Phil Turmel, linux-raid On 9 May 2017, Wols Lists told this: > On 09/05/17 21:01, Nix wrote: >> Gee, thanks heaps. Next time I want randomly insulting by someone who >> doesn't bother to tell me his actual *arguments* in any message before >> the one that starts on the insults, I'll come straight to you. > > Nix, much as I don't think people are thinking this through rationally > (they live in the perfect world of maths, not the imperfect world of > engineering), I do NOT think insulting Phil on this list is a good idea. Errr... sure, but I may be ignorant, but I'm not obtuse. Not as far as I know, anyway. What I am is sleep-deprived. (It takes a special kind of nervous wreck to be kept awake by a problem like this, that has never happened in many years of my using md/raid. I think I'll be kept awake by the possibility of an asteroid strike or a second Carrington Event tonight.) > Can we all tone it down, please ... Sure! I'm generating untested patches now, is that better? (Probably not. But they do solve this problem enough to reduce the worry quotient without actually doing the much-more-complex repair side of things.) -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 20:01 ` Nix 2017-05-09 20:57 ` Wols Lists @ 2017-05-09 21:23 ` Phil Turmel 1 sibling, 0 replies; 69+ messages in thread From: Phil Turmel @ 2017-05-09 21:23 UTC (permalink / raw) To: Nix; +Cc: David Brown, Anthony Youngman, Ravi (Tom) Hale, linux-raid On 05/09/2017 04:01 PM, Nix wrote: > On 9 May 2017, Phil Turmel told this: >> The bottom line is that any kernel that implements the auto-correct you >> seem to think is a slam dunk will be shunned by any system administrator >> who actually cares about their data. Your obtuseness notwithstanding. > > Gee, thanks heaps. Next time I want randomly insulting by someone who > doesn't bother to tell me his actual *arguments* in any message before > the one that starts on the insults, I'll come straight to you. Ok, yeah, I was a bit harsh. Ad hominem is not appropriate. Not that the shunning wouldn't happen. As for the arguments, well, *everyone* on this list is providing arguments and you are ignoring them. Whether you are filtering facts on pre-conceived ideas about raid6 or simply can't understand the points, the result *appears* obtuse. And now, please all drop me from the CC. Phil ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 9:53 ` Nix 2017-05-09 11:09 ` David Brown @ 2017-05-09 21:32 ` NeilBrown 2017-05-10 19:03 ` Nix 1 sibling, 1 reply; 69+ messages in thread From: NeilBrown @ 2017-05-09 21:32 UTC (permalink / raw) To: Nix, Anthony Youngman; +Cc: Phil Turmel, Ravi (Tom) Hale, linux-raid [-- Attachment #1: Type: text/plain, Size: 5958 bytes --] On Tue, May 09 2017, Nix wrote: > On 8 May 2017, Anthony Youngman told this: > >> If the scrub finds a mismatch, then the drives are reporting >> "everything's fine here". Something's gone wrong, but the question is >> what? If you've got a four-drive raid that reports a mismatch, how do >> you know which of the four drives is corrupt? Doing an auto-correct >> here risks doing even more damage. (I think a raid-6 could recover, >> but raid-5 is toast ...) > > With a RAID-5 you are screwed: you can reconstruct the parity but cannot > tell if it was actually right. You can make things consistent, but not > correct. > > But with a RAID-6 you *do* have enough data to make things correct, with > precisely the same probability as recovery of a RAID-5 "drive" of length > a single sector. It seems wrong that not only does md not do this but > doesn't even tell you which drive made the mistake so you could do the > millions-of-times-slower process of a manual fail and readdition of the > drive (or, if you suspect it of being wholly buggered, a manual fail and > replacement). > >> And seeing as drives are pretty much guaranteed (unless something's >> gone BADLY wrong) to either (a) accurately return the data written, or >> (b) return a read error, that means a data mismatch indicates >> something is seriously wrong that is NOTHING to do with the drives. > > This turns out not to be the case. See this ten-year-old paper: > <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>. > Five weeks of doing 2GiB writes on 3000 nodes once every two hours > found, they estimated, 50 errors possibly attributable to disk problems > (sector- or page-size regions of corrupted data) on 1/30th of their > nodes. This is *not* rare and it is hard to imagine that 1/30th of disks > used by CERN deserve discarding. It is better to assume that drives > misdirect writes now and then, and to provide a means of recovering from > them that does not take days of panic. RAID-6 gives you that means: md > should use it. > > The page-sized regions of corrupted data were probably software -- but > the sector-sized regions were just as likely the drives, possibly > misdirected writes or misdirected reads. > > Neil decided not to do any repair work in this case on the grounds that > if the drive is misdirecting one write it might misdirect the repair as > well My justification was a bit broader than that. If you get a consistency error on RAID6, there is not one model to explain it which is significantly more likely than any other model. So it is not possible to predict the results of any particular remedial action. It might help, it might hurt, it might have no effect. Better to do nothing and appear incompetent, than to do the wrong thing and remove all doubt. (there could be problems with media, buffering in the drive, addressing in the drive, buffer/addressing in the controller, errors in main memory, CPU problems comparing bytes, corruption on a bus, either reading or writing - of either data or addresses) NeilBrown > -- but if the repair is *consistently* misdirected, that seems > relatively harmless (you had corruption before, you have it now, it just > moved), and if it was a sporadic error, the repair is worthwhile. The > only case in which a repair should not be attempted is if the drive is > misdirecting all or most writes -- but in that case, by the time you do > a scrub, on all but the quietest arrays you'll see millions of > mismatches and it'll be obvious that it's time to throw the drive out. > (Assuming md told you which drive it was.) > >>> If a sector weakens purely because of neighbouring writes or temperature >>> or a vibrating housing or something (i.e. not because of actual damage), >>> so that a rewrite will strengthen it and relocation was never necessary, >>> surely you've just saved a pointless bit of sector sparing? (I don't >>> know: I'm not sure what the relative frequency of these things is. Read >>> and write errors in general are so rare that it's quite possible I'm >>> worrying about nothing at all. I do know I forgot to scrub my old >>> hardware RAID array for about three years and nothing bad happened...) >>> >> Yes you have saved a sector sparing. Note that a consumer 3TB drive >> can return, on average, one error every time it's read from end to end >> 3 times, and still be considered "within spec" ie "not faulty" by the > > Yeah, that's why RAID-6 is a good idea. :) > >> manufacturer. And that's a *brand* *new* drive. That's why building a >> large array using consumer drives is a stupid idea - 4 x 3TB drives >> and a *within* *spec* array must expect to handle at least one error >> every scrub. > > That's just one reason why. The lack of control over URE timeouts is > just as bad. > >> Okay - most drives are actually way over spec, and could probably be >> read end-to-end many times without a single error, but you'd be a fool >> to gamble on it. > > I'm trying *not* to gamble on it -- but I don't want to end up in the > current situation we seem to have with md6, which is "oh, you have a > mismatch, it's not going away, but we're neither going to tell you where > it is nor what disk it's on nor repair it ourselves, even though we > could, just to make it as hard as possible for you to repair the problem > or even tell if it's a consistent one" (is the single mismatch an > expected, spurious read error because of the volume of data you're > reading, or one that's consistent and needs repair? All mismatch_cnt > tells you is that there's a mismatch). > > -- > NULL && (void) > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 21:32 ` NeilBrown @ 2017-05-10 19:03 ` Nix 0 siblings, 0 replies; 69+ messages in thread From: Nix @ 2017-05-10 19:03 UTC (permalink / raw) To: NeilBrown; +Cc: Anthony Youngman, Phil Turmel, Ravi (Tom) Hale, linux-raid On 9 May 2017, NeilBrown outgrape: > On Tue, May 09 2017, Nix wrote: >> Neil decided not to do any repair work in this case on the grounds that >> if the drive is misdirecting one write it might misdirect the repair as >> well > > My justification was a bit broader than that. I noticed your trailing comment on the blog post only after sending all these emails out :( bah! > If you get a consistency error on RAID6, there is not one model to > explain it which is significantly more likely than any other model. Yeah, I'm quite satisfied with "we don't have enough data to know if repairing is safe" as reasoning: among other things it suggests that mismatches are really rare, which is reassuring! This certainly suggests that repairing should be, at the very least, off by default, and I'm not terribly unhappy for it to not exist. ... but I do want to at least report the location of stripes that fail checks, as in my earlier ugly patch. That's useful for any array with >1 partition or LVM LV on it. ("Oh, that mismatch is harmless, it's in swap. That one is in small_but_crucial_lv, I'll restore it from backup, without affecting the massive_messy_lv which had no mismatches and would take weeks to restore.") (As far as I'm concerned, if you don't *have* a backup of some fs, you deserve what's coming to you! Good backups are easy and with md you can even make them as resilient as the main RAID arrays. I'm interested in maximizing availability here: having to take a big array with many LVs down for ages for a restore because you don't know which bit is corrupted just seems *wrong*.) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 20:27 ` Anthony Youngman 2017-05-09 9:53 ` Nix @ 2017-05-09 16:05 ` Chris Murphy 2017-05-09 17:49 ` Wols Lists 1 sibling, 1 reply; 69+ messages in thread From: Chris Murphy @ 2017-05-09 16:05 UTC (permalink / raw) To: Anthony Youngman; +Cc: Nix, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On Mon, May 8, 2017 at 2:27 PM, Anthony Youngman <antlists@youngman.org.uk> wrote: > Yes you have saved a sector sparing. Note that a consumer 3TB drive can > return, on average, one error every time it's read from end to end 3 times, > and still be considered "within spec" ie "not faulty" by the manufacturer. All specs say "less than" which means it's a maximum permissible rate, not an average. We have no idea what the minimum error rate is - we being consumers. It's possible high volume users (e.g. Backblaze) have data on this by now. > And that's a *brand* *new* drive. That's why building a large array using > consumer drives is a stupid idea - 4 x 3TB drives and a *within* *spec* > array must expect to handle at least one error every scrub. The requirement for any large array is quickly abandoning reattempted reads in favor of reporting a read error. That's the main reason why consumer drives are a bad idea, is that it can hang user space waiting on the long recovery of a drive. -- Chris Murphy ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 16:05 ` Chris Murphy @ 2017-05-09 17:49 ` Wols Lists 2017-05-10 3:06 ` Chris Murphy 0 siblings, 1 reply; 69+ messages in thread From: Wols Lists @ 2017-05-09 17:49 UTC (permalink / raw) To: Chris Murphy; +Cc: Nix, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On 09/05/17 17:05, Chris Murphy wrote: >> Yes you have saved a sector sparing. Note that a consumer 3TB drive can >> > return, on average, one error every time it's read from end to end 3 times, >> > and still be considered "within spec" ie "not faulty" by the manufacturer. > All specs say "less than" which means it's a maximum permissible rate, > not an average. We have no idea what the minimum error rate is - we > being consumers. It's possible high volume users (e.g. Backblaze) have > data on this by now. > In other words, an error rate that high is "acceptable". And to design software that quite explicitly expects greater perfection than the hardware itself is guaranteed to provide is, in my humble opinion, downright negligent!!! I'm sorry, but like Linus, I take an *engineering* approach to this stuff, not a mathematical approach. In a mathematical world everything works perfectly. In an engineering world, things go wrong. You should always plan for the worst case. But to fail to plan for "the worst *acceptable* case" is just plain IDIOTIC. Cheers, Wol ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 17:49 ` Wols Lists @ 2017-05-10 3:06 ` Chris Murphy 0 siblings, 0 replies; 69+ messages in thread From: Chris Murphy @ 2017-05-10 3:06 UTC (permalink / raw) To: Wols Lists; +Cc: Chris Murphy, Nix, Phil Turmel, Ravi (Tom) Hale, Linux-RAID On Tue, May 9, 2017 at 11:49 AM, Wols Lists <antlists@youngman.org.uk> wrote: > On 09/05/17 17:05, Chris Murphy wrote: >>> Yes you have saved a sector sparing. Note that a consumer 3TB drive can >>> > return, on average, one error every time it's read from end to end 3 times, >>> > and still be considered "within spec" ie "not faulty" by the manufacturer. > >> All specs say "less than" which means it's a maximum permissible rate, >> not an average. We have no idea what the minimum error rate is - we >> being consumers. It's possible high volume users (e.g. Backblaze) have >> data on this by now. >> > In other words, an error rate that high is "acceptable". It's acceptable in that the manufacturer sells products with such specification and consumers buy them. It's totally voluntary. There are drives with one and two orders of magnitude lower unrecoverable error rates and some people buy them and pay extra to get that spec as a feature among other features. > And to design software that quite explicitly expects greater perfection > than the hardware itself is guaranteed to provide is, in my humble > opinion, downright negligent!!! How does the software expect a lower error rate than the drive specification? -- Chris Murphy ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 19:52 ` Nix 2017-05-08 20:27 ` Anthony Youngman @ 2017-05-08 20:56 ` Phil Turmel 2017-05-09 10:28 ` Nix 1 sibling, 1 reply; 69+ messages in thread From: Phil Turmel @ 2017-05-08 20:56 UTC (permalink / raw) To: Nix; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid On 05/08/2017 03:52 PM, Nix wrote: > On 8 May 2017, Phil Turmel verbalised: > >> On 05/08/2017 10:50 AM, Nix wrote: > And... then what do you do? On RAID-6, it appears the answer is "live > with a high probability of inevitable corruption". No, you investigate the quality of your data and the integrity of the rest of the system, as something *other* than a drive problem caused the mismatch. (Swap is a known exception, though.) > That's not very good. > (AIUI, if a check scrub finds a URE, it'll rewrite it, and when in the > common case the drive spares it out and the write succeeds, this will > not be reported as a mismatch: is this right?) This is also wrong, because you are assuming sparing-out is the common case. A read error does not automatically trigger relocation. It triggers *verification* of the next *write*. In young drives, successful rewrite in place is the common case. As the drive ages, rewrites will begin relocating because there really is a new problem at that spot, not simple thermal/magnetic decay. But keep in mind that the firmware of the drive will start verification of a sector only if it gets a *read* error. Such sectors get marked as "pending" relocations until they are written again. If that write verifies correct, the "pending" status simply goes away. Ordinary writes to presumed-ok sectors are *not* verified. (There'd be a huge difference between read and write speeds on rotating media if they were.) { Drive self tests might do some pre-emptive rewriting of marginal sectors -- it's not something drive manufacturers are documenting. But a drive self-test cannot fix an unreadable sector -- it doesn't know what to write there. } >> This is actually counterproductive. Rewriting everything may refresh >> the magnetism on weakening sectors, but will also prevent the drive from >> *finding* weakening sectors that really do need relocation. > > If a sector weakens purely because of neighbouring writes or temperature > or a vibrating housing or something (i.e. not because of actual damage), > so that a rewrite will strengthen it and relocation was never necessary, > surely you've just saved a pointless bit of sector sparing? (I don't > know: I'm not sure what the relative frequency of these things is. Read > and write errors in general are so rare that it's quite possible I'm > worrying about nothing at all. I do know I forgot to scrub my old > hardware RAID array for about three years and nothing bad happened...) Drives that are in applications that get *read* pretty often don't need much if any scrubbing -- the application itself will expose problem sectors. Hobbyists and home media servers can go months with specific files unread, so developing problems can hit in clusters. Regular scrubbing will catch these problems before they take your array down. And you can't compare hardware array behavior to MD -- they have their own algorithms to take care of attached disks without OS intervention. Phil ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 20:56 ` Phil Turmel @ 2017-05-09 10:28 ` Nix 2017-05-09 10:50 ` Reindl Harald 0 siblings, 1 reply; 69+ messages in thread From: Nix @ 2017-05-09 10:28 UTC (permalink / raw) To: Phil Turmel; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid On 8 May 2017, Phil Turmel said: > On 05/08/2017 03:52 PM, Nix wrote: >> And... then what do you do? On RAID-6, it appears the answer is "live >> with a high probability of inevitable corruption". > > No, you investigate the quality of your data and the integrity of the > rest of the system, as something *other* than a drive problem caused the > mismatch. (Swap is a known exception, though.) Yeah, I'm going to "rely" on the fact that this machine has heaps of memory and won't be swapping much when it does a RAID scrub. :) But "you investigate the quality of your data"... so now, on a single mismatch that won't go away, I have to compare all my data with backups, taking countless hours and emitting heaps of spurious errors because no backup is ever quite up to date? Those backups *live* on hard drives, so it has exactly the same chance of spurious disk-layer errors as the thing that preceded it (quite possibly higher). Honestly, scrubs are looking less and less desirable the more I talk about them. Massive worry inducers that don't actually spot problems in any meaningful sense (not even at the level of "there is a problem on this disk", just "there is a problem on this array"). >> That's not very good. >> (AIUI, if a check scrub finds a URE, it'll rewrite it, and when in the >> common case the drive spares it out and the write succeeds, this will >> not be reported as a mismatch: is this right?) > > This is also wrong, because you are assuming sparing-out is the common > case. A read error does not automatically trigger relocation. It > triggers *verification* of the next *write*. In young drives, So I guess we only need to worry about mismatches if they don't go away and are persistently in the same place on the same drive. (Only you can't tell what place that is, or what drive that is, because md doesn't tell you. I'm really tempted to fix *that* at least, a printk() or something.) > { Drive self tests might do some pre-emptive rewriting of marginal > sectors -- it's not something drive manufacturers are documenting. But > a drive self-test cannot fix an unreadable sector -- it doesn't know > what to write there. } Agreed. >>> This is actually counterproductive. Rewriting everything may refresh >>> the magnetism on weakening sectors, but will also prevent the drive from >>> *finding* weakening sectors that really do need relocation. >> >> If a sector weakens purely because of neighbouring writes or temperature >> or a vibrating housing or something (i.e. not because of actual damage), >> so that a rewrite will strengthen it and relocation was never necessary, >> surely you've just saved a pointless bit of sector sparing? (I don't >> know: I'm not sure what the relative frequency of these things is. Read >> and write errors in general are so rare that it's quite possible I'm >> worrying about nothing at all. I do know I forgot to scrub my old >> hardware RAID array for about three years and nothing bad happened...) > > Drives that are in applications that get *read* pretty often don't need > much if any scrubbing -- the application itself will expose problem > sectors. Hobbyists and home media servers can go months with specific > files unread, so developing problems can hit in clusters. Regular > scrubbing will catch these problems before they take your array down. Yeah, and I have plenty of archival data on this array -- it's the first one I've ever had that's big enough to consider using for that as well as for frequently-used stuff whose integrity I care about. (But even the frequently-read stuff is bcached, so even that is in effect archival much of the time, from the perspective of its read.) > And you can't compare hardware array behavior to MD -- they have their > own algorithms to take care of attached disks without OS intervention. I don't see what the difference is between a hardware array controller with its own noddy OS, barely-maintained software, creaking processor, and not very big battery-backed RAM and md with a decent OS, much faster processor, decent software, and often masses of RAM and a journal on SSD, except that the md array will be far faster and if anything goes wrong you have much higher chance of actually getting your data back with md. :) The days of saying "hardware arrays are just different/better, md cannot compete with them" are many years in the past. People are *replacing* hardware arrays with md these days because the hardware arrays are *worse* on almost every metric. If hardware arrays have magic recovery algorithms that md and/or the Linux block layer don't, the question now is why not? not "oh we cannot compare" ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 10:28 ` Nix @ 2017-05-09 10:50 ` Reindl Harald 2017-05-09 11:15 ` Nix 0 siblings, 1 reply; 69+ messages in thread From: Reindl Harald @ 2017-05-09 10:50 UTC (permalink / raw) To: Nix, Phil Turmel; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid Am 09.05.2017 um 12:28 schrieb Nix: > Honestly, scrubs are looking less and less desirable the more I talk > about them. Massive worry inducers that don't actually spot problems in > any meaningful sense (not even at the level of "there is a problem on > this disk", just "there is a problem on this array") that is your opinion my expierience over years using md-arrays is that *everytime* smartd triggered a alert mail that a drive will fail soon it happened while the scrub was running and so you can replace drives as soon as possible ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 10:50 ` Reindl Harald @ 2017-05-09 11:15 ` Nix 2017-05-09 11:48 ` Reindl Harald 0 siblings, 1 reply; 69+ messages in thread From: Nix @ 2017-05-09 11:15 UTC (permalink / raw) To: Reindl Harald; +Cc: Phil Turmel, Wols Lists, Ravi (Tom) Hale, linux-raid On 9 May 2017, Reindl Harald said: > Am 09.05.2017 um 12:28 schrieb Nix: >> Honestly, scrubs are looking less and less desirable the more I talk >> about them. Massive worry inducers that don't actually spot problems in >> any meaningful sense (not even at the level of "there is a problem on >> this disk", just "there is a problem on this array") > > that is your opinion > > my expierience over years using md-arrays is that *everytime* smartd triggered a alert mail that a drive will fail soon it happened > while the scrub was running and so you can replace drives as soon as possible What, it triggered a SMART warning while a scrub was running which SMART long self-tests didn't? That's depressing. You'd think SMART would be watching for errors while it's own tests were running! (Or were you not running any long self-tests? That's at least as risky as not scrubbing, IMNSHO.) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 11:15 ` Nix @ 2017-05-09 11:48 ` Reindl Harald 2017-05-09 16:11 ` Nix 0 siblings, 1 reply; 69+ messages in thread From: Reindl Harald @ 2017-05-09 11:48 UTC (permalink / raw) To: Nix; +Cc: Phil Turmel, Wols Lists, Ravi (Tom) Hale, linux-raid Am 09.05.2017 um 13:15 schrieb Nix: > On 9 May 2017, Reindl Harald said: > >> Am 09.05.2017 um 12:28 schrieb Nix: >>> Honestly, scrubs are looking less and less desirable the more I talk >>> about them. Massive worry inducers that don't actually spot problems in >>> any meaningful sense (not even at the level of "there is a problem on >>> this disk", just "there is a problem on this array") >> >> that is your opinion >> >> my expierience over years using md-arrays is that *everytime* smartd triggered a alert mail that a drive will fail soon it happened >> while the scrub was running and so you can replace drives as soon as possible > > What, it triggered a SMART warning while a scrub was running which SMART > long self-tests didn't? That's depressing. You'd think SMART would be > watching for errors while it's own tests were running! different time of tests, different access metrics i guess smarter people like both of us had a reason to develop scrub instead say "just let the drive do it at it's own > (Or were you not running any long self-tests? That's at least as risky > as not scrubbing, IMNSHO.) no i do both regulary * smart short self-test daily * smart long self-test weekly * raid scrub weekly and no - doing a long-smart-test daily is not a good solution, the RAID10 array in my office makes *terrible noises* when the SMART test is running and after doing this every week the last 6 years (Power_On_Hours 14786, Start_Stop_Count 1597) i would say they are normal but probably it's not good doing that operations all the time well, that machine has not lost a single drive, a clone of it acting as homeserver 365/24/7 has lost a dozen in the same time.... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 11:48 ` Reindl Harald @ 2017-05-09 16:11 ` Nix 2017-05-09 16:46 ` Reindl Harald 0 siblings, 1 reply; 69+ messages in thread From: Nix @ 2017-05-09 16:11 UTC (permalink / raw) To: Reindl Harald; +Cc: Phil Turmel, Wols Lists, Ravi (Tom) Hale, linux-raid On 9 May 2017, Reindl Harald verbalised: > Am 09.05.2017 um 13:15 schrieb Nix: >> (Or were you not running any long self-tests? That's at least as risky >> as not scrubbing, IMNSHO.) > > no i do both regulary > > * smart short self-test daily > * smart long self-test weekly > * raid scrub weekly > > and no - doing a long-smart-test daily is not a good solution, the > RAID10 array in my office makes *terrible noises* when the SMART Agreed, though in my case not because of noise, but just because the test takes fourteen hours and noticeably degrades disk performance while it runs. I'm doing a long self-test monthly and frankly I'm wondering if every three months is sufficient. > well, that machine has not lost a single drive, a clone of it acting > as homeserver 365/24/7 has lost a dozen in the same time.... A *dozen*?! In six years? Even with a big array you've been incredibly unlucky, or you have young children and a corresponding disaster rate. (Meanwhile, my last machine, with much-maligned WD GreenPower variable-spin-rate disks, was completely happy for eight years, zero failures, zero reallocations that I can see. I can only hope my new lot are that good.) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 16:11 ` Nix @ 2017-05-09 16:46 ` Reindl Harald 0 siblings, 0 replies; 69+ messages in thread From: Reindl Harald @ 2017-05-09 16:46 UTC (permalink / raw) To: Nix; +Cc: Phil Turmel, Wols Lists, Ravi (Tom) Hale, linux-raid Am 09.05.2017 um 18:11 schrieb Nix: > On 9 May 2017, Reindl Harald verbalised: >> and no - doing a long-smart-test daily is not a good solution, the >> RAID10 array in my office makes *terrible noises* when the SMART > > Agreed, though in my case not because of noise, but just because the > test takes fourteen hours and noticeably degrades disk performance while > it runs. I'm doing a long self-test monthly and frankly I'm wondering if > every three months is sufficient. > >> well, that machine has not lost a single drive, a clone of it acting >> as homeserver 365/24/7 has lost a dozen in the same time.... > > A *dozen*?! In six years? Even with a big array you've been incredibly > unlucky, or you have young children and a corresponding disaster rate. > (Meanwhile, my last machine, with much-maligned WD GreenPower > variable-spin-rate disks, was completely happy for eight years, zero > failures, zero reallocations that I can see. I can only hope my new lot > are that good.) RAID10, 4x2 TB, a rrom temperature of 28 degreee 12 months a year and 214 TB only written - i doubt to be unlucky with that workloads Filesystem created: Wed Jun 8 13:10:56 2011 Lifetime writes: 214 TB ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-08 14:50 ` Nix 2017-05-08 18:00 ` Anthony Youngman 2017-05-08 19:02 ` Phil Turmel @ 2017-05-09 7:37 ` David Brown 2017-05-09 9:58 ` Nix 2 siblings, 1 reply; 69+ messages in thread From: David Brown @ 2017-05-09 7:37 UTC (permalink / raw) To: Nix, Wols Lists; +Cc: Ravi (Tom) Hale, linux-raid On 08/05/17 16:50, Nix wrote: > > I wonder... scrubbing is not very useful with md, particularly with RAID > 6, because it does no writes unless something mismatches, and on failure > there is no attempt to determine which of the N disks is bad and rewrite > its contents from the other devices (nor, as I understand it, does it > clearly say which drive gave the error, so even failing it out and > resyncing it is hard). > Please read Neil Brown's article on this: "Smart or simple RAID recovery?" <http://neil.brown.name/blog/20100211050355> > If there was a way to get md to *rewrite* everything during scrub, > rather than just checking, this might help (in addition to letting the > drive refresh the magnetization of absolutely everything). "repair" mode > appears to do no writes until an error is found, whereupon (on RAID 6) > it proceeds to make a "repair" that is more likely than not to overwrite > good data with bad. Optionally writing what's already there on non-error > seems like it might be a worthwhile (and fairly simple) change. > Scrubbing /does/ rewrite disk blocks - when necessary. It does not do it explicitly, but the disks handle this themselves. To the processor, a disk block is 4K of data. But to the disk and its controllers, it is 4K plus a sizeable amount of error checking and correcting bits. Some are spread out within the block, some are collected together at the end of the block. The ECC system can handle a large number of failed bits, either in lumps caused by a physical defect on the disk surface, or spread out due to the slow decay of the magnetic orientation, or hits by cosmic rays. When the disk is asked to read a block, it pulls up the data and the ECC bits, and uses this to check and re-construct the 4K of data, and a measure of how many errors were corrected. On modern high-capacity drives, it is normal that some errors are corrected on a read. But if more than a certain level occur, then the firmware will trigger a re-write automatically to the same sector. This will then be re-read. If the error rate is low, fine. If it is high, then the sector will be remapped by the disk. So simply /reading/ the data, as far as the processor is concerned, will cause re-writes as and when needed. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 7:37 ` David Brown @ 2017-05-09 9:58 ` Nix 2017-05-09 10:28 ` Brad Campbell 0 siblings, 1 reply; 69+ messages in thread From: Nix @ 2017-05-09 9:58 UTC (permalink / raw) To: David Brown; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid On 9 May 2017, David Brown spake thusly: > On 08/05/17 16:50, Nix wrote: > >> I wonder... scrubbing is not very useful with md, particularly with RAID >> 6, because it does no writes unless something mismatches, and on failure >> there is no attempt to determine which of the N disks is bad and rewrite >> its contents from the other devices (nor, as I understand it, does it >> clearly say which drive gave the error, so even failing it out and >> resyncing it is hard). > > Please read Neil Brown's article on this: "Smart or simple RAID > recovery?" <http://neil.brown.name/blog/20100211050355> I have. THe simple recovery is too simple. So you have a 40TiB RAID-6 array, say, and mismatch_cnt is consistently >0, but a low value, on scrub. What can you do? The drive is probably not faulty or you'd have many more mismatches from persistent misdirected reads or writes. md doesn't repair the corruption, even though on RAID-6 it could. It doesn't tell you which disk disagreed so you can fail it out. It doesn't even tell you where the disagreement was so you can try to rebuild it by hand. What on earth are you supposed to do in this case? Wipe the entire array and restore from backup? For a *single* sector? Right now I'm doing scrubs and ignoring the mismatch_cnt, because all it can do is increase my worry level to no gain at all. I could just as well do a dd over /dev/md*. It would have the same effect (only without md's progress feedback and bandwidth throttling. You get progress feedback, but you don't get told where errors are found?!) > When the disk is asked to read a block, it pulls up the data and the ECC > bits, and uses this to check and re-construct the 4K of data, and a > measure of how many errors were corrected. On modern high-capacity > drives, it is normal that some errors are corrected on a read. But if > more than a certain level occur, then the firmware will trigger a > re-write automatically to the same sector. This will then be re-read. > If the error rate is low, fine. If it is high, then the sector will be > remapped by the disk. > > So simply /reading/ the data, as far as the processor is concerned, will > cause re-writes as and when needed. Last time I asked a disk manufacturer about this, they said oh no we never correct on read, we can't: if we needed to correct on read, the data would already be unreadable: you have to trigger a write to get sparing. Nice to see the drive firmware has improved in the last few years... but one wonders how many disks actually *do* this. It's hard to tell because sector sparing is so quiet: it's not always even reflected in the SMART data, AIUI. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 9:58 ` Nix @ 2017-05-09 10:28 ` Brad Campbell 2017-05-09 10:40 ` Nix 0 siblings, 1 reply; 69+ messages in thread From: Brad Campbell @ 2017-05-09 10:28 UTC (permalink / raw) To: Nix, David Brown; +Cc: Wols Lists, Ravi (Tom) Hale, linux-raid On 09/05/17 17:58, Nix wrote: > md doesn't repair the corruption, even though on RAID-6 it could. Patches are *always* welcome. > but one wonders how many disks actually *do* this. It's hard to > tell because sector sparing is so quiet: it's not always even reflected > in the SMART data, AIUI. Decent SAS drvies do it routinely *and* they tell you in the SMART data how long it has been since the last scrub, how long it is until the next scrub and how many errors it has silently corrected over the drive life. You get what you pay for. Brad ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 10:28 ` Brad Campbell @ 2017-05-09 10:40 ` Nix 2017-05-09 12:15 ` Tim Small 0 siblings, 1 reply; 69+ messages in thread From: Nix @ 2017-05-09 10:40 UTC (permalink / raw) To: Brad Campbell; +Cc: David Brown, Wols Lists, Ravi (Tom) Hale, linux-raid On 9 May 2017, Brad Campbell stated: > On 09/05/17 17:58, Nix wrote: >> md doesn't repair the corruption, even though on RAID-6 it could. > > Patches are *always* welcome. Oh good. I might well look at that. >> but one wonders how many disks actually *do* this. It's hard to >> tell because sector sparing is so quiet: it's not always even reflected >> in the SMART data, AIUI. > > Decent SAS drvies do it routinely *and* they tell you in the SMART > data how long it has been since the last scrub, how long it is until > the next scrub and how many errors it has silently corrected over the > drive life. You get what you pay for. Enterprise SATA drives appear similar except that they don't do the scrubbing automatically: you have to trigger a SMART self-test. (I'm wondering if that's enough, and perhaps I can ignore RAID scrubbing entirely, except that if something *does* go wrong I won't know.) Of course I haven't yet owned a drive that has ever deigned to give a nonzero sector-sparing value in any of its SMART info, and I've been using allegedly-enterprise drives (first SCSI, then SATA) for about fifteen years now. I've had disk failures without warning, and non-failed disks with both read and write errors that would not go away, but that SMART reallocation value just stayed stuck at zero through all of it. I'm wondering if smartctl is even reading the right field, but it's hard to imagine how it couldn't be... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 10:40 ` Nix @ 2017-05-09 12:15 ` Tim Small 2017-05-09 15:30 ` Nix 0 siblings, 1 reply; 69+ messages in thread From: Tim Small @ 2017-05-09 12:15 UTC (permalink / raw) To: Nix; +Cc: linux-raid On 09/05/17 11:40, Nix wrote: > I've had disk failures without warning, and > non-failed disks with both read and write errors that would not go away, > but that SMART reallocation value just stayed stuck at zero through all > of it. Really? I see them pretty frequently... Let's see server1, RAID6 (4 disks), reallocated_sector_ct: 0 9 1 0 server2, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0 server3, RAID6 (5 disks), reallocated_sector_ct: 34 754 15 115 1 server4, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0 server5, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0 Disk 2 in server3 (which has drives which are a bit long in the tooth) is scheduled to be replaced next time I visit that site. Are you looking at the 'raw' column in the smartctl output? Tim ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-09 12:15 ` Tim Small @ 2017-05-09 15:30 ` Nix 0 siblings, 0 replies; 69+ messages in thread From: Nix @ 2017-05-09 15:30 UTC (permalink / raw) To: Tim Small; +Cc: linux-raid On 9 May 2017, Tim Small spake thusly: > On 09/05/17 11:40, Nix wrote: >> I've had disk failures without warning, and >> non-failed disks with both read and write errors that would not go away, >> but that SMART reallocation value just stayed stuck at zero through all >> of it. > > Really? I see them pretty frequently... Let's see > > server1, RAID6 (4 disks), reallocated_sector_ct: 0 9 1 0 > server2, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0 > server3, RAID6 (5 disks), reallocated_sector_ct: 34 754 15 115 1 > server4, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0 > server5, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0 > > Disk 2 in server3 (which has drives which are a bit long in the tooth) > is scheduled to be replaced next time I visit that site. > > Are you looking at the 'raw' column in the smartctl output? No, but since they all read all zero: 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 this is pretty redundant. I do see, on all my disks (regardless of hardware versus software RAID or indeed age, and some of these disks are seven years old): 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0 One figure is much higher: 195 Hardware_ECC_Recovered -O-RC- 100 064 000 - 2067212 195 Hardware_ECC_Recovered -O-RC- 100 064 000 - 2088928 195 Hardware_ECC_Recovered -O-RC- 082 064 000 - 156528817 195 Hardware_ECC_Recovered -O-RC- 082 065 000 - 156513792 but this is on a bunch of three-month-old Seagate enterprise disks, and as with the seek error rate Seagate use a deeply bizarre encoding for this value, and none of the SeaChest programs seem to be able to decode it. It appears that the lower the decoded value, the worse things are -- I have no idea why two of my drives are doing so much worse than two others on this score. I guess I should keep an eye on them. In any case, it's going up fast on those two even when the drives are totally idle and even when I forcibly spin them down... I don't trust this figure to tell me anything useful at all. SMART, borderline useless as ever. Aside: in hex these are 001f8b0c 001fdfe0 095470b1 09543600 which rather suggests that the drives have two distinct encodings to me, with two drives using one encoding and the other two another one, probably split at the four-hex-digit mark -- but the drives have identical firmware and the same model number... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-05 4:03 ` Fault tolerance " Ravi (Tom) Hale 2017-05-05 19:20 ` Anthony Youngman @ 2017-05-05 20:23 ` Peter Grandi 2017-05-05 22:14 ` Nix 1 sibling, 1 reply; 69+ messages in thread From: Peter Grandi @ 2017-05-05 20:23 UTC (permalink / raw) To: Linux RAID >> No. With modern hard drives, no filesystem should pay any >> attention to badblocks - it's all handled in the drive firmware. > ext4 supports this, Also JFS also supports bad-block avoidance, but only at 'mkfs' time and JFS does this for legacy reason: Linux JFS supports this because it is a port of JFS/2 from OS/2 which was a port of JFS version 1 from AIX in 1990. > and is a relatively modern filesystem released in December > 2008. It is just a retread of 'ext3' which itself was a recycling of 'ext2' which was in turn a clone of the 4BSD FFS, and we are talking of design decisions taken in 1982-3, not 2008. > While it could be argued that this is for legacy support, It is for legacy support. Once upon a time a drive's controller was the main CPU itself, and the kernel had to manage bad block sparing (as well as rotational layout and track buffering). That was up to around 20-30 years ago :-). > This feature still adds value (see below). It adds value if one underestimates typical disk drive failure modes. It is quite irritating even for me that a drive with way less than 1% bad blocks becomes effectively unusable, but long experience tells me that once a drive starts to grow defects to the point that manufacturer spare sectors run out there is usually a reason for it and sooner than later it will be almost completely unusable. [ ... ] > The use case is simple: What if I want to have more goodblocks to > correct for badblocks than Seagate thinks I should have? The answer is also simple: if you think you know better than Seagate, or if you think that Seagate deliberately allocates too few spare sector, you ask Seagate for custom firmware that allocates more of a disk capacity for spares. I suspect that with an order of at least 100,000 drives they will be happy to help. :-) > Eg, a charity or poor student wanting to get the most out of their > old hardware. If it is your itch, and you think you know better than the rest of the industry, scratch your itch: send patches :-). Other people know that usually keeping decaying drives in use is fairly pointless. Legend is that USSR computer engineers perfected that art though, but they worked in special circumstances. For a similar example look at the BadRAM and similar modules: https://help.ubuntu.com/community/BadRAM http://rick.vanrein.org/linux/badram/ They haven't become that popular... :-) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: Fault tolerance with badblocks 2017-05-05 20:23 ` Peter Grandi @ 2017-05-05 22:14 ` Nix 0 siblings, 0 replies; 69+ messages in thread From: Nix @ 2017-05-05 22:14 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID On 5 May 2017, Peter Grandi stated: >> This feature still adds value (see below). > > It adds value if one underestimates typical disk drive failure > modes. It is quite irritating even for me that a drive with way > less than 1% bad blocks becomes effectively unusable, but long > experience tells me that once a drive starts to grow defects to the > point that manufacturer spare sectors run out there is usually a > reason for it and sooner than later it will be almost completely > unusable. Quite. In my experience, if there are that many bad blocks on rotational storage, it generally means either that a head has died or that the disk surface is damaged. If the disk surface is damaged to that degree, there will be crap flying around inside the drive at very high speed, abrading the drive surface further with every passing minute. Such a drive is walking dead. Get any surviving data off now and throw it away with extreme prejudice, possibly pulling it apart first to gawp at the horribleness that is all that remains of your disk surfaces. As for the dead-head case, the question is whether whatever killed the head produced debris. If it did, you're back at the previous problem, and if it's electronic failure, frankly the whole drive is untrustworthy IMHO. (There *are* other possibilities: catastrophically buggy drive firmware, for instance -- but in such cases the drive is *also* walking dead.) -- NULL && (void) ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2017-06-02 12:28 UTC | newest] Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-05-04 10:04 Fault tolerance in RAID0 with badblocks Ravi (Tom) Hale 2017-05-04 13:44 ` Wols Lists 2017-05-05 4:03 ` Fault tolerance " Ravi (Tom) Hale 2017-05-05 19:20 ` Anthony Youngman 2017-05-06 11:21 ` Ravi (Tom) Hale 2017-05-06 13:00 ` Wols Lists 2017-05-08 14:50 ` Nix 2017-05-08 18:00 ` Anthony Youngman 2017-05-09 10:11 ` David Brown 2017-05-09 10:18 ` Nix 2017-05-08 19:02 ` Phil Turmel 2017-05-08 19:52 ` Nix 2017-05-08 20:27 ` Anthony Youngman 2017-05-09 9:53 ` Nix 2017-05-09 11:09 ` David Brown 2017-05-09 11:27 ` Nix 2017-05-09 11:58 ` David Brown 2017-05-09 17:25 ` Chris Murphy 2017-05-09 19:44 ` Wols Lists 2017-05-10 3:53 ` Chris Murphy 2017-05-10 4:49 ` Wols Lists 2017-05-10 17:18 ` Chris Murphy 2017-05-16 3:20 ` NeilBrown 2017-05-10 5:00 ` Dave Stevens 2017-05-10 16:44 ` Edward Kuns 2017-05-10 18:09 ` Chris Murphy 2017-05-09 20:18 ` Nix 2017-05-09 20:52 ` Wols Lists 2017-05-10 8:41 ` David Brown 2017-05-09 21:06 ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix 2017-05-12 11:14 ` Nix 2017-05-16 3:27 ` NeilBrown 2017-05-16 9:13 ` Nix 2017-05-16 21:11 ` NeilBrown 2017-05-16 21:46 ` Nix 2017-05-18 0:07 ` Shaohua Li 2017-05-19 4:53 ` NeilBrown 2017-05-19 10:31 ` Nix 2017-05-19 16:48 ` Shaohua Li 2017-06-02 12:28 ` Nix 2017-05-19 4:49 ` NeilBrown 2017-05-19 10:32 ` Nix 2017-05-19 16:55 ` Shaohua Li 2017-05-21 22:00 ` NeilBrown 2017-05-09 19:16 ` Fault tolerance with badblocks Phil Turmel 2017-05-09 20:01 ` Nix 2017-05-09 20:57 ` Wols Lists 2017-05-09 21:22 ` Nix 2017-05-09 21:23 ` Phil Turmel 2017-05-09 21:32 ` NeilBrown 2017-05-10 19:03 ` Nix 2017-05-09 16:05 ` Chris Murphy 2017-05-09 17:49 ` Wols Lists 2017-05-10 3:06 ` Chris Murphy 2017-05-08 20:56 ` Phil Turmel 2017-05-09 10:28 ` Nix 2017-05-09 10:50 ` Reindl Harald 2017-05-09 11:15 ` Nix 2017-05-09 11:48 ` Reindl Harald 2017-05-09 16:11 ` Nix 2017-05-09 16:46 ` Reindl Harald 2017-05-09 7:37 ` David Brown 2017-05-09 9:58 ` Nix 2017-05-09 10:28 ` Brad Campbell 2017-05-09 10:40 ` Nix 2017-05-09 12:15 ` Tim Small 2017-05-09 15:30 ` Nix 2017-05-05 20:23 ` Peter Grandi 2017-05-05 22:14 ` Nix
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.