* Feature request: Remove the badblocks list
@ 2020-08-18 18:00 Roy Sigurd Karlsbakk
2020-08-18 19:26 ` Wols Lists
2020-08-18 21:03 ` Håkon Struijk Holmen
0 siblings, 2 replies; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-08-18 18:00 UTC (permalink / raw)
To: Linux RAID Mailing List; +Cc: Håkon
Hi all
It seems the badblocks list was added around 10 years ago[1]. The reason was to keep a track on sectors not readable, which may have made sensee 20 years earlier, but not even in 2010. The first IDE drives came out in the end of the 1980s and were named thus of their 'Integrated Drive Electronics' which was a new thing at the time. Opposed to earlier MFM drives and such, these were "smart" and could handle errors a bit better, even reallocate bad sectors to somewhere else when needed. A lot happened beween 1987 and 2010, but for some reason, this feature slipped through anyway, perhaps becauase Linus was drunk, I don't know. As far as I can understand, this feature works a bit like this
- If a bad (that is, unreadable) block is found, it is flagged as bad, not to be used ever again, in the md member's superblock
- If a new disk is added to the array, the block number of the initial bad block is flagged on the new drive, since the whole stripe is rendered useless (erm, didn't we have redundency here?)
- If replacing the original drive with a new drive, md happily replaces all the data to the new drive and updates the superblock with the same badblock list.
So no attempt is ever done to check or repair that sector. Disks reallocate sectors if they are bad and it's not necessarily a big issue unless there's a lot of such errors. We just say 'this sector or block said *ouch* and is thus dead, and so will his siblings be for ever and ever'. There's a nice article about it here[2].
In practice, if you have data in stripes with badblocks, they may be lost forever and for no reason at all, since drives tend to fix their problems if you issue a write to that sector. ZFS does this nicely - when it finds a bad read, it reconstructs from parity or mirror and writes it again. If it encounters a write error, it tries over. Eventually the drive may fail, but hell, that's why we have redundancy.
As far as I can see, the only solution to remove the badblocks list, is "mdadm ... --assemble --update=no-bbl", from [2], and have md return garbage for those lost sectors, which is fine, since fsck/xfs_repair should fix what's fixable (and still won't be readable anyway). An alternate version written by a friend of mine (Håkon on cc) is present on [3] to remove the list from an offlined array.
As far as I can understand, this list doesn't have any reason to exist, except to annoy sysadmins.
So please remove this useless thing or at least don't enable it by default
[1] https://linux-raid.vger.kernel.narkive.com/R1rvkUiQ/using-the-new-bad-block-log-in-md-for-linux-3-1
[2] https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
[3] https://git.thehawken.org/hawken/md-badblocktool.git
Vennlig hilsen
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-08-18 18:00 Feature request: Remove the badblocks list Roy Sigurd Karlsbakk
@ 2020-08-18 19:26 ` Wols Lists
2020-08-18 19:34 ` Piergiorgio Sartor
2020-08-18 19:43 ` Phil Turmel
2020-08-18 21:03 ` Håkon Struijk Holmen
1 sibling, 2 replies; 14+ messages in thread
From: Wols Lists @ 2020-08-18 19:26 UTC (permalink / raw)
To: Roy Sigurd Karlsbakk, Linux RAID Mailing List; +Cc: Håkon
On 18/08/20 19:00, Roy Sigurd Karlsbakk wrote:
> As far as I can understand, this list doesn't have any reason to exist, except to annoy sysadmins.
Actually, there's at least one good reason for it to exist that I can
think of - it *could* make recovering a broken array much easier. Think
about it, I think it's documented in the wiki.
That said, I'm hoping to do some work soon that will make it redundant.
One little tip though - you've done a load of research to tell us what
we already know - as documented on the wiki - and now you're asking us
to do a load of work. If you want it done, well nobody else has bothered
so far so what makes you think they'll bother now?
Cheers,
Wol
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-08-18 19:26 ` Wols Lists
@ 2020-08-18 19:34 ` Piergiorgio Sartor
2020-08-18 19:43 ` Phil Turmel
1 sibling, 0 replies; 14+ messages in thread
From: Piergiorgio Sartor @ 2020-08-18 19:34 UTC (permalink / raw)
To: Wols Lists; +Cc: Roy Sigurd Karlsbakk, Linux RAID Mailing List, Håkon
On Tue, Aug 18, 2020 at 08:26:07PM +0100, Wols Lists wrote:
> On 18/08/20 19:00, Roy Sigurd Karlsbakk wrote:
> > As far as I can understand, this list doesn't have any reason to exist, except to annoy sysadmins.
>
> Actually, there's at least one good reason for it to exist that I can
> think of - it *could* make recovering a broken array much easier. Think
> about it, I think it's documented in the wiki.
>
> That said, I'm hoping to do some work soon that will make it redundant.
>
> One little tip though - you've done a load of research to tell us what
> we already know - as documented on the wiki - and now you're asking us
> to do a load of work. If you want it done, well nobody else has bothered
> so far so what makes you think they'll bother now?
Is it really "a load of work" to switch for
default "on" to default "off"?
Because that's what he is asking.
If this is the case, there is something more
broken in the code...
BTW, I find it quite problematic too to have
a feature, activated by default, which is
_officially_ declared as *buggy*/
bye,
--
piergiorgio
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-08-18 19:26 ` Wols Lists
2020-08-18 19:34 ` Piergiorgio Sartor
@ 2020-08-18 19:43 ` Phil Turmel
1 sibling, 0 replies; 14+ messages in thread
From: Phil Turmel @ 2020-08-18 19:43 UTC (permalink / raw)
To: Wols Lists, Roy Sigurd Karlsbakk, Linux RAID Mailing List; +Cc: Håkon
On 8/18/20 3:26 PM, Wols Lists wrote:
> On 18/08/20 19:00, Roy Sigurd Karlsbakk wrote:
>> As far as I can understand, this list doesn't have any reason to exist, except to annoy sysadmins.
>
> Actually, there's at least one good reason for it to exist that I can
> think of - it *could* make recovering a broken array much easier. Think
> about it, I think it's documented in the wiki.
Link please.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-08-18 18:00 Feature request: Remove the badblocks list Roy Sigurd Karlsbakk
2020-08-18 19:26 ` Wols Lists
@ 2020-08-18 21:03 ` Håkon Struijk Holmen
2020-08-22 1:42 ` David C. Rankin
1 sibling, 1 reply; 14+ messages in thread
From: Håkon Struijk Holmen @ 2020-08-18 21:03 UTC (permalink / raw)
To: Linux RAID Mailing List
On 8/18/20 8:00 PM, Roy Sigurd Karlsbakk wrote:
> Hi all
Hi,
Thanks for the CC, I just managed to get myself subscribed to the list :)
I have gathered some thoughts on the subject as well after reading up on it,
figuring out the actual header format is, and writing a tool [3] to fix my array...
About the tool, it will try to read all the supposedly bad blocks in a drive,
and erase the whole list if no blocks fail to read. As long as it's not run against
a drive that got marked because data was unavailable during rebuild, this
should make it possible to read the data again. A possible improvement here, would be to reduce
the size of the list down to the actually bad blocks if some still fail, but right now the tool
will refuse to do anything to the drive if md was correct. You also need to flip a variable that
I hid awkwardly between two functions before it will write to the drive at all.
But I have some complaints about the thing..
Good data marked as bad:
My viewpoint is from what happened to my raid array, a 5 drive raid6.
3 of the drives had identical lists of bad blocks while 2 had empty lists.
Therefore, the marked sectors corresponded to lost data. This was solved by
iterating the bad block list, verifying that all the sectors were in fact readable,
and then removing the bad block list. Since I did not have any drive replacements,
I was certain enough that I would not run into uninitialized space. This gave me back
the data that md had decided was gone.
I do not really think one can say that the md badblock list corresponds
to bad blocks on the device. The lists consists of sectors where
md thinks the data is permanently unavailable. It happens in two ways:
- A read error occurs for any reason
- A new drive is rebuilt, but the array doesn't have the parity to find out what
data was supposed to go there, because badblock entries for other devices prevents
it from finding a source for the data that it's supposed to write there. It's assumed
that such reads would fail.
Since these are added to the same list of bad blocks, it follows that
even if you were to have a successful read from a bad block, it can also be uninitialized
space.
Once enough drives have bad blocks for the same stripe, that data is now gone. md will not read it.
Even if it's there on the drives. I can only speculate on what happened in my case,
so far I think that some intermittent controller failure caused any reads to give errors, and
somehow md was still able to write to the badblock list.
I think it's not just me, and it seems like it's a common phenomenon that arrays end up with
identical lists across drives. Be this controller failures, or just a bug, it's not good
and undermines the assumption that the underlying blocks will actually be bad.
9 years ago, Lutz Vieweg asked "I've experienced drives
with intermittent read / write failures (due to controller or power stability
problems), and I wonder whether such a situation could quickly fill up the
"bad block list", doing more harm than good in the "intermittent error"-
szenario." [1]. I have my doubts that this was resolved.
I also don't know if this is the cause of the issue with many drives sharing the exact same
list, or if some other logic error type bug is causing it.
md indicating all is good:
In the same URL, Neil Brown said "(...) You shouldn't aim to run an array
with bad blocks any more than you should run an array degraded. The purpose
of bad block management is to provide a more graceful failure path, not to
encourage you to run an array with bad drives". However, an array with
bad blocks does not report this as "degraded", and you have to run
--examine to even see it. The result is that the array is not being treated
as bad, having md communicate that the array is still good. The end result
being, the software encouraging running an array with bad drives.
If the assumption was that one would treat this as a degraded array. But
you have to --examine and specifically look for it to see that there are bad blocks.
Lack of documentation:
I have added some links in addition to the ones found by Roy. This was the extent
of the documentation that I was able to find. I'll be interested if this is documented
better elsewhere. The kernel documentation also briefly mentioned the existence of bad
block lists in mdraid. The wiki article on the superblock format [5] hasn't been updated
with the badblock fields. The 2010 blog post [4] was the closest thing to documentation,
even if it was written before the thing was finalized.
Overall:
I don't think this uncertainty is good at all. I feel like it would be easier to deal with
a controller failure throwing the whole raid apart. You'd assemble it back together and
check the filesystem, and with fingers crossed, everything will be fine. I think one
finds that it makes sense how md acts without this algorithm enabled. Drives thrown out
of arrays still have data on them. This means that if unrecoverable errors occur, one can
still run ddrescue and try to copy the array to new drives, one by one and get as much data
back as possible.
Once you hit a bad block during read, adopting the zfs model of calculating parity and
overwriting seems better because it tries to just solve the problem so it doesn't happen
the next time around. I think md will throw it in the list and expect it to be fixed during
the next check operation. Unless that doesn't happen and more bad blocks accumulate until
data loss happens..
I would also like to see the functionality changed to opt-in or just removed. If it's kept as
opt-in, it still hope that some of this feedback is taken. For example, reporting the array
as degraded if the lists get populated. Automatically fixing bad blocks as soon as possible,
before the situation develops any further. Making the uninitialized data and the bad blocks
two separate things, so that one can still try reading those blocks and keep track of
where the data is supposed to be, and where it's definitely not.
Maybe dropping badblocks and taking inspiration from ZFS instead.
> [1] https://linux-raid.vger.kernel.narkive.com/R1rvkUiQ/using-the-new-bad-block-log-in-md-for-linux-3-1
> [2] https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
> [3] https://git.thehawken.org/hawken/md-badblocktool.git
[4]https://neil.brown.name/blog/20100519043730
[5]https://raid.wiki.kernel.org/index.php/RAID_superblock_formats
Regards,
Håkon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-08-18 21:03 ` Håkon Struijk Holmen
@ 2020-08-22 1:42 ` David C. Rankin
2020-09-02 13:36 ` Roy Sigurd Karlsbakk
0 siblings, 1 reply; 14+ messages in thread
From: David C. Rankin @ 2020-08-22 1:42 UTC (permalink / raw)
To: mdraid
On 8/18/20 4:03 PM, Håkon Struijk Holmen wrote:
> Hi,
>
> Thanks for the CC, I just managed to get myself subscribed to the list :)
>
> I have gathered some thoughts on the subject as well after reading up on it,
> figuring out the actual header format is, and writing a tool [3] to fix my
> array...
>
<snip>
> But I have some complaints about the thing..
Well,
There is code in all things that can be fixed, but I for one will chime in
and say I don't care if a lose a strip or two so long as on a failed disk I
pop the new one in and it rebuilds without issue (which it does, even when the
disk was replaced due to bad blocks)
So whatever is done, don't fix what isn't broken and introduce more bugs
along the way. If this is such an immediate problem, then why are patches
being attached to the complaints?
--
David C. Rankin, J.D.,P.E.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-08-22 1:42 ` David C. Rankin
@ 2020-09-02 13:36 ` Roy Sigurd Karlsbakk
2020-09-02 14:34 ` Adam Goryachev
0 siblings, 1 reply; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-09-02 13:36 UTC (permalink / raw)
To: David C. Rankin; +Cc: Linux Raid
----- Original Message -----
> From: "David C. Rankin" <drankinatty@suddenlinkmail.com>
> To: "Linux Raid" <linux-raid@vger.kernel.org>
> Sent: Saturday, 22 August, 2020 03:42:40
> Subject: Re: Feature request: Remove the badblocks list
> On 8/18/20 4:03 PM, Håkon Struijk Holmen wrote:
>> Hi,
>>
>> Thanks for the CC, I just managed to get myself subscribed to the list :)
>>
>> I have gathered some thoughts on the subject as well after reading up on it,
>> figuring out the actual header format is, and writing a tool [3] to fix my
>> array...
>>
> <snip>
>> But I have some complaints about the thing..
>
> Well,
>
> There is code in all things that can be fixed, but I for one will chime in
> and say I don't care if a lose a strip or two so long as on a failed disk I
> pop the new one in and it rebuilds without issue (which it does, even when the
> disk was replaced due to bad blocks)
>
> So whatever is done, don't fix what isn't broken and introduce more bugs
> along the way. If this is such an immediate problem, then why are patches
> being attached to the complaints?
The problem is that it's already broken. Take a single mirror. One drive experiences a bad sector, fine, you have redundancy, so you read the data from the other drive and md flags the sector as bad. The drive two is replaced, you lose the data. The new drive will get flagged with the same sector number as faulty, since the first drive has it flagged. So you replace the first drive and during resync, it also gets flagged as having a bad sector. And so on.
Modern (that is, disks since 20 years ago or so) reallocate sectors as they wear out. We have redundancy to handle errors, not to pinpoint them on disks and fill up not-so-smart lists with broken sectors that work. If md sees a drive with excessive errors, that drive should be kicked out, marked as dead, but not interfere with the rest of the raid.
Vennlig hilsen
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-09-02 13:36 ` Roy Sigurd Karlsbakk
@ 2020-09-02 14:34 ` Adam Goryachev
2020-09-02 14:50 ` Roy Sigurd Karlsbakk
0 siblings, 1 reply; 14+ messages in thread
From: Adam Goryachev @ 2020-09-02 14:34 UTC (permalink / raw)
To: Roy Sigurd Karlsbakk, David C. Rankin; +Cc: Linux Raid
On 2/9/20 23:36, Roy Sigurd Karlsbakk wrote:
> ----- Original Message -----
>> From: "David C. Rankin" <drankinatty@suddenlinkmail.com>
>> To: "Linux Raid" <linux-raid@vger.kernel.org>
>> Sent: Saturday, 22 August, 2020 03:42:40
>> Subject: Re: Feature request: Remove the badblocks list
>> On 8/18/20 4:03 PM, Håkon Struijk Holmen wrote:
>>> Hi,
>>>
>>> Thanks for the CC, I just managed to get myself subscribed to the list :)
>>>
>>> I have gathered some thoughts on the subject as well after reading up on it,
>>> figuring out the actual header format is, and writing a tool [3] to fix my
>>> array...
>>>
>> <snip>
>>> But I have some complaints about the thing..
>> Well,
>>
>> There is code in all things that can be fixed, but I for one will chime in
>> and say I don't care if a lose a strip or two so long as on a failed disk I
>> pop the new one in and it rebuilds without issue (which it does, even when the
>> disk was replaced due to bad blocks)
>>
>> So whatever is done, don't fix what isn't broken and introduce more bugs
>> along the way. If this is such an immediate problem, then why are patches
>> being attached to the complaints?
> The problem is that it's already broken. Take a single mirror. One drive experiences a bad sector, fine, you have redundancy, so you read the data from the other drive and md flags the sector as bad. The drive two is replaced, you lose the data. The new drive will get flagged with the same sector number as faulty, since the first drive has it flagged. So you replace the first drive and during resync, it also gets flagged as having a bad sector. And so on.
>
> Modern (that is, disks since 20 years ago or so) reallocate sectors as they wear out. We have redundancy to handle errors, not to pinpoint them on disks and fill up not-so-smart lists with broken sectors that work. If md sees a drive with excessive errors, that drive should be kicked out, marked as dead, but not interfere with the rest of the raid.
>
> Vennlig hilsen
>
> roy
I'm no MD expert, but I there are a couple of things to consider...
1) MD doesn't mark the sector as bad unless we try to write to it, AND
the drive replies to say it could not be written. So, in your case, the
drive is saying that it doesn't have any "spare" sectors left to
re-allocate, we are already passed that point.
2) When MD tries to read, it gets an error, so read from the other
mirror, or re-construct from parity/etc, and automatically attempt to
write to the sector, see point 1 above for the failure case.
So by the time MD gets a write error for a sector, the drive really is
bad, and MD can no longer ensure that *this* sector will be able to
properly store data again (whatever level of RAID we asked for, that
level can't be achieved with one drive faulty). So MD marks it bad, and
won't store any user data in that sector in future. As other drives are
replaced, we mark the corresponding sector on those drives as also bad,
so they also know that no user data should be stored there.
Eventually, we replace the faulty disk, and it would probably be safe to
store user data in the marked sector (assuming the new drive is not
faulty on the same sector, and all other member drives are not faulty on
the same sector).
So, to "fix" this, we just need a way to tell MD to try and write to all
member drives, on all faulty sectors, and if any drive returns fails to
write, then keep the sector as marked bad, if *ALL* drives succeed, then
remove from the bad blocks list on all members.
So why not add this feature to fix the problem, instead of throwing away
something that is potentially useful? Perhaps this could be done as part
of the "repair" mode, or done during a replace/add (when we reach the
"bad" sector, test the new drive, test all existing drives, and then
continue with the repair/add.
Would that solve the "bug"?
PS, As you noted, if MD gets repeated write errors for one drive, then
it will be kicked out. That value is configurable.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-09-02 14:34 ` Adam Goryachev
@ 2020-09-02 14:50 ` Roy Sigurd Karlsbakk
2020-09-02 15:09 ` Adam Goryachev
0 siblings, 1 reply; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-09-02 14:50 UTC (permalink / raw)
To: Adam Goryachev; +Cc: David C. Rankin, Linux Raid
> I'm no MD expert, but I there are a couple of things to consider...
>
> 1) MD doesn't mark the sector as bad unless we try to write to it, AND
> the drive replies to say it could not be written. So, in your case, the
> drive is saying that it doesn't have any "spare" sectors left to
> re-allocate, we are already passed that point.
>
> 2) When MD tries to read, it gets an error, so read from the other
> mirror, or re-construct from parity/etc, and automatically attempt to
> write to the sector, see point 1 above for the failure case.
>
> So by the time MD gets a write error for a sector, the drive really is
> bad, and MD can no longer ensure that *this* sector will be able to
> properly store data again (whatever level of RAID we asked for, that
> level can't be achieved with one drive faulty). So MD marks it bad, and
> won't store any user data in that sector in future. As other drives are
> replaced, we mark the corresponding sector on those drives as also bad,
> so they also know that no user data should be stored there.
>
> Eventually, we replace the faulty disk, and it would probably be safe to
> store user data in the marked sector (assuming the new drive is not
> faulty on the same sector, and all other member drives are not faulty on
> the same sector).
>
> So, to "fix" this, we just need a way to tell MD to try and write to all
> member drives, on all faulty sectors, and if any drive returns fails to
> write, then keep the sector as marked bad, if *ALL* drives succeed, then
> remove from the bad blocks list on all members.
>
> So why not add this feature to fix the problem, instead of throwing away
> something that is potentially useful? Perhaps this could be done as part
> of the "repair" mode, or done during a replace/add (when we reach the
> "bad" sector, test the new drive, test all existing drives, and then
> continue with the repair/add.
>
> Would that solve the "bug"?
I'd better want md to stop fixing "somebody else's problem", that is, the disk, and rather just do its job. As for the case, I have tried to manually read those sectors named in the badblocks list and they all work. All of them. But then, there's no fixing, since they are proclaimed dead. So are their siblings' sectors with the same number, regardless of status.
If a drive has multiple issues with bad sector, kick it out. It doesn't have anything to do in the RAID anymore
Vennlig hilsen
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-09-02 14:50 ` Roy Sigurd Karlsbakk
@ 2020-09-02 15:09 ` Adam Goryachev
2020-09-02 15:25 ` Roy Sigurd Karlsbakk
0 siblings, 1 reply; 14+ messages in thread
From: Adam Goryachev @ 2020-09-02 15:09 UTC (permalink / raw)
To: Roy Sigurd Karlsbakk; +Cc: David C. Rankin, Linux Raid
On 3/9/20 00:50, Roy Sigurd Karlsbakk wrote:
>> I'm no MD expert, but I there are a couple of things to consider...
>>
>> 1) MD doesn't mark the sector as bad unless we try to write to it, AND
>> the drive replies to say it could not be written. So, in your case, the
>> drive is saying that it doesn't have any "spare" sectors left to
>> re-allocate, we are already passed that point.
>>
>> 2) When MD tries to read, it gets an error, so read from the other
>> mirror, or re-construct from parity/etc, and automatically attempt to
>> write to the sector, see point 1 above for the failure case.
>>
>> So by the time MD gets a write error for a sector, the drive really is
>> bad, and MD can no longer ensure that *this* sector will be able to
>> properly store data again (whatever level of RAID we asked for, that
>> level can't be achieved with one drive faulty). So MD marks it bad, and
>> won't store any user data in that sector in future. As other drives are
>> replaced, we mark the corresponding sector on those drives as also bad,
>> so they also know that no user data should be stored there.
>>
>> Eventually, we replace the faulty disk, and it would probably be safe to
>> store user data in the marked sector (assuming the new drive is not
>> faulty on the same sector, and all other member drives are not faulty on
>> the same sector).
>>
>> So, to "fix" this, we just need a way to tell MD to try and write to all
>> member drives, on all faulty sectors, and if any drive returns fails to
>> write, then keep the sector as marked bad, if *ALL* drives succeed, then
>> remove from the bad blocks list on all members.
>>
>> So why not add this feature to fix the problem, instead of throwing away
>> something that is potentially useful? Perhaps this could be done as part
>> of the "repair" mode, or done during a replace/add (when we reach the
>> "bad" sector, test the new drive, test all existing drives, and then
>> continue with the repair/add.
>>
>> Would that solve the "bug"?
> I'd better want md to stop fixing "somebody else's problem", that is, the disk, and rather just do its job. As for the case, I have tried to manually read those sectors named in the badblocks list and they all work. All of them. But then, there's no fixing, since they are proclaimed dead. So are their siblings' sectors with the same number, regardless of status.
Just because you can read them, doesn't mean you can write them.
Clearly, at some point in time, one of your drives failed. You now need
to recover from that failed drive in the most sensible way.
> If a drive has multiple issues with bad sector, kick it out. It doesn't have anything to do in the RAID anymore
And if a group of 100 sectors are bad on drive 1, and 100 different
sectors on drive 2, you want to kick both drives out, and destroy all
your data until you can create a new array and restore from backup?
OR, just mark those parts of all disks faulty, and at some point in the
future, you replace the disks, and then find a way to tell MD that the
sectors are working now (and preferably, re-test them before marking
them as OK)?
BTW, I just found this:
https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
Which suggests that there is indeed a bug which should be hunted and
fixed, and that actually the BBL isn't populated via failed writes, it
is populated by failed reads while doing a replace/add, AND the failed
read is from the source drive AND the parity/mirror drives.
Either way, perhaps what is needed (if you are interested) is a
repeatable test scenario causing the problem, which could then be used
to identify and fix the bug.
Regards,
Adam
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-09-02 15:09 ` Adam Goryachev
@ 2020-09-02 15:25 ` Roy Sigurd Karlsbakk
2020-09-02 16:32 ` Adam Goryachev
0 siblings, 1 reply; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-09-02 15:25 UTC (permalink / raw)
To: Adam Goryachev; +Cc: David C. Rankin, Linux Raid
>> I'd better want md to stop fixing "somebody else's problem", that is, the disk,
>> and rather just do its job. As for the case, I have tried to manually read
>> those sectors named in the badblocks list and they all work. All of them. But
>> then, there's no fixing, since they are proclaimed dead. So are their siblings'
>> sectors with the same number, regardless of status.
> Just because you can read them, doesn't mean you can write them.
> Clearly, at some point in time, one of your drives failed. You now need
> to recover from that failed drive in the most sensible way.
>> If a drive has multiple issues with bad sector, kick it out. It doesn't have
>> anything to do in the RAID anymore
>
> And if a group of 100 sectors are bad on drive 1, and 100 different
> sectors on drive 2, you want to kick both drives out, and destroy all
> your data until you can create a new array and restore from backup?
>
> OR, just mark those parts of all disks faulty, and at some point in the
> future, you replace the disks, and then find a way to tell MD that the
> sectors are working now (and preferably, re-test them before marking
> them as OK)?
>
> BTW, I just found this:
>
> https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
I linked to that earlier in the thread
> Which suggests that there is indeed a bug which should be hunted and
> fixed, and that actually the BBL isn't populated via failed writes, it
> is populated by failed reads while doing a replace/add, AND the failed
> read is from the source drive AND the parity/mirror drives.
It is neither hunted down nor fixed. It's the same thing and it has stayed the same for these years.
> Either way, perhaps what is needed (if you are interested) is a
> repeatable test scenario causing the problem, which could then be used
> to identify and fix the bug.
I have tried several things and all show the same. I just don't know how to tell md "this drive's sector X is bad, so flag it so".
Again, this is not the way to walk around a problem. What this does is just hiding real problems and let them grow in generations instead of just flagging a bad drive as bad, since that's the originating problem here.
Vennlig hilsen
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-09-02 15:25 ` Roy Sigurd Karlsbakk
@ 2020-09-02 16:32 ` Adam Goryachev
2020-09-02 16:50 ` Roy Sigurd Karlsbakk
2020-09-02 19:45 ` Håkon Struijk Holmen
0 siblings, 2 replies; 14+ messages in thread
From: Adam Goryachev @ 2020-09-02 16:32 UTC (permalink / raw)
To: Roy Sigurd Karlsbakk; +Cc: David C. Rankin, Linux Raid
On 3/9/20 01:25, Roy Sigurd Karlsbakk wrote:
>>> I'd better want md to stop fixing "somebody else's problem", that is, the disk,
>>> and rather just do its job. As for the case, I have tried to manually read
>>> those sectors named in the badblocks list and they all work. All of them. But
>>> then, there's no fixing, since they are proclaimed dead. So are their siblings'
>>> sectors with the same number, regardless of status.
>> Just because you can read them, doesn't mean you can write them.
>> Clearly, at some point in time, one of your drives failed. You now need
>> to recover from that failed drive in the most sensible way.
>>> If a drive has multiple issues with bad sector, kick it out. It doesn't have
>>> anything to do in the RAID anymore
>> And if a group of 100 sectors are bad on drive 1, and 100 different
>> sectors on drive 2, you want to kick both drives out, and destroy all
>> your data until you can create a new array and restore from backup?
>>
>> OR, just mark those parts of all disks faulty, and at some point in the
>> future, you replace the disks, and then find a way to tell MD that the
>> sectors are working now (and preferably, re-test them before marking
>> them as OK)?
>>
>> BTW, I just found this:
>>
>> https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
> I linked to that earlier in the thread
>
>> Which suggests that there is indeed a bug which should be hunted and
>> fixed, and that actually the BBL isn't populated via failed writes, it
>> is populated by failed reads while doing a replace/add, AND the failed
>> read is from the source drive AND the parity/mirror drives.
> It is neither hunted down nor fixed. It's the same thing and it has stayed the same for these years.
So what will you do now to change that? Obviously nobody else has had
enough of a problem with it to be bothered to "hunt it down and fix it".
Can you help hunt it down at least?
>> Either way, perhaps what is needed (if you are interested) is a
>> repeatable test scenario causing the problem, which could then be used
>> to identify and fix the bug.
> I have tried several things and all show the same. I just don't know how to tell md "this drive's sector X is bad, so flag it so".
>
> Again, this is not the way to walk around a problem. What this does is just hiding real problems and let them grow in generations instead of just flagging a bad drive as bad, since that's the originating problem here.
>
> Vennlig hilsen
>
> roy
Based in the linked page, you would need to do something like this:
1) Create a clean array with correctly working disks
2) Tell the underlying block device to pretend there is a read error on
a specific sector of one disk
3) Ask MD to replace the "bad" block device with a "good" one
4) See what happens with the BBL
5) Various steps of reading/writing to that specific stripe, and
document the outcome/behavior
6) Replace another drive, and document the results
Hint: there is a block device that could sit between your actual block
device and MD, and it can "pretend" there are certain errors. The
answers here seem to contain relevant information:
https://stackoverflow.com/questions/1870696/simulate-a-faulty-block-device-with-read-errors
As I said, I suspect that if a reproducible error is found, then it
should be easier to fix the bug.
OTOH, you could just remove the BBL from your arrays, and ensure you
create new arrays without the BBL.
Regards,
Adam
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-09-02 16:32 ` Adam Goryachev
@ 2020-09-02 16:50 ` Roy Sigurd Karlsbakk
2020-09-02 19:45 ` Håkon Struijk Holmen
1 sibling, 0 replies; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-09-02 16:50 UTC (permalink / raw)
To: Adam Goryachev; +Cc: David C. Rankin, Linux Raid
> Based in the linked page, you would need to do something like this:
>
> 1) Create a clean array with correctly working disks
>
> 2) Tell the underlying block device to pretend there is a read error on
> a specific sector of one disk
>
> 3) Ask MD to replace the "bad" block device with a "good" one
Do you have a howto on 2,3?
> 4) See what happens with the BBL
>
> 5) Various steps of reading/writing to that specific stripe, and
> document the outcome/behavior
or this - how?
> 6) Replace another drive, and document the results
>
> Hint: there is a block device that could sit between your actual block
> device and MD, and it can "pretend" there are certain errors. The
> answers here seem to contain relevant information:
> https://stackoverflow.com/questions/1870696/simulate-a-faulty-block-device-with-read-errors
>
> As I said, I suspect that if a reproducible error is found, then it
> should be easier to fix the bug.
>
> OTOH, you could just remove the BBL from your arrays, and ensure you
> create new arrays without the BBL.
Anything better than just "mdadm ... --assemble --update=force-no-bbl"?
Vennlig hilsen
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Remove the badblocks list
2020-09-02 16:32 ` Adam Goryachev
2020-09-02 16:50 ` Roy Sigurd Karlsbakk
@ 2020-09-02 19:45 ` Håkon Struijk Holmen
1 sibling, 0 replies; 14+ messages in thread
From: Håkon Struijk Holmen @ 2020-09-02 19:45 UTC (permalink / raw)
To: Adam Goryachev, Roy Sigurd Karlsbakk; +Cc: David C. Rankin, Linux Raid
On 9/2/20 6:32 PM, Adam Goryachev wrote:
>
> On 3/9/20 01:25, Roy Sigurd Karlsbakk wrote:
>>>> I'd better want md to stop fixing "somebody else's problem", that
>>>> is, the disk,
>>>> and rather just do its job. As for the case, I have tried to
>>>> manually read
>>>> those sectors named in the badblocks list and they all work. All of
>>>> them. But
>>>> then, there's no fixing, since they are proclaimed dead. So are
>>>> their siblings'
>>>> sectors with the same number, regardless of status.
>>> Just because you can read them, doesn't mean you can write them.
>>> Clearly, at some point in time, one of your drives failed. You now need
>>> to recover from that failed drive in the most sensible way.
>>>> If a drive has multiple issues with bad sector, kick it out. It
>>>> doesn't have
>>>> anything to do in the RAID anymore
>>> And if a group of 100 sectors are bad on drive 1, and 100 different
>>> sectors on drive 2, you want to kick both drives out, and destroy all
>>> your data until you can create a new array and restore from backup?
>>>
>>> OR, just mark those parts of all disks faulty, and at some point in the
>>> future, you replace the disks, and then find a way to tell MD that the
>>> sectors are working now (and preferably, re-test them before marking
>>> them as OK)?
>>>
>>> BTW, I just found this:
>>>
>>> https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
>> I linked to that earlier in the thread
>>
>>> Which suggests that there is indeed a bug which should be hunted and
>>> fixed, and that actually the BBL isn't populated via failed writes, it
>>> is populated by failed reads while doing a replace/add, AND the failed
>>> read is from the source drive AND the parity/mirror drives.
>> It is neither hunted down nor fixed. It's the same thing and it has
>> stayed the same for these years.
> So what will you do now to change that? Obviously nobody else has had
> enough of a problem with it to be bothered to "hunt it down and fix
> it". Can you help hunt it down at least?
>>> Either way, perhaps what is needed (if you are interested) is a
>>> repeatable test scenario causing the problem, which could then be used
>>> to identify and fix the bug.
>> I have tried several things and all show the same. I just don't know
>> how to tell md "this drive's sector X is bad, so flag it so".
>>
>> Again, this is not the way to walk around a problem. What this does
>> is just hiding real problems and let them grow in generations instead
>> of just flagging a bad drive as bad, since that's the originating
>> problem here.
>>
>> Vennlig hilsen
>>
>> roy
>
> Based in the linked page, you would need to do something like this:
>
> 1) Create a clean array with correctly working disks
>
> 2) Tell the underlying block device to pretend there is a read error
> on a specific sector of one disk
>
> 3) Ask MD to replace the "bad" block device with a "good" one
>
> 4) See what happens with the BBL
>
> 5) Various steps of reading/writing to that specific stripe, and
> document the outcome/behavior
>
> 6) Replace another drive, and document the results
>
> Hint: there is a block device that could sit between your actual block
> device and MD, and it can "pretend" there are certain errors. The
> answers here seem to contain relevant information:
> https://stackoverflow.com/questions/1870696/simulate-a-faulty-block-device-with-read-errors
>
> As I said, I suspect that if a reproducible error is found, then it
> should be easier to fix the bug.
>
> OTOH, you could just remove the BBL from your arrays, and ensure you
> create new arrays without the BBL.
>
> Regards,
> Adam
>
Hi,
I think you may have misunderstood slightly. Bad blocks can get written
based on failed read requests, which is the case that Roy and I are
complaining about. Such a read error may just be temporary, and affect
multiple drives if there is some sort of a controller problem.
I have actually done an experiment, and I would like to explain it in
terms of your numbered points.
1) A NFS server was set up with a share and some block files were set
up, approx 100MB in size for each. The NFS server was given a secondary
IP address for the client, that could be added or removed to simulate a
passing controller failure. The NFS client mapped up this with a soft
mount, allowing it to give IO errors after a timeout. The files were
mapped to loopback blocks and a raid array was created, I think it was
raid 5. The array was formatted to xfs and filled with data. Caches were
wiped.
2) The IP was removed to simulate the controller temporarily failing.
Then I tried reading from the raid array, producing io errors on all the
drives. The IP was added back in to restore communication, and md took
the opportunity to write one of the drives full of bad blocks. The rest
of the block devices were thrown out, maybe for failing to write to the
bad block list.
3) My attempt wasn't entirely successful, since only one drive got bad
blocks. I think this was out of luck. In this case md will have enough
data to repair the error during a drive replacement. Maybe if one of the
"healthy" ones were removed, then we would see md failing to reconstruct
data and writing bad blocks to the new device. I didn't carry this out,
but I understand the algorithm to work like that.
The issue I have is that a temporary read failure can cause blocks to be
marked with a flag that means "the data here is not the correct data".
It would be necessary to handle read failures differently to have a
distinction and be able to retry reading from these types of bad blocks.
There's just one flag, and it's used if reading fails, if writing fails,
if the correct data was not found for a new drive and thus the data was
not initialized...
I've talked to Roy and we will probably try removing the lists, and I
think it will work. At least partially. For his array, he has been
replacing some drives from time to time without knowing about the bad
block lists, and this means that his bad blocks are a combination of
drives where the data actually is present, and drives where the data was
never written in the first place. If we remove the lists, then we will
probably get a mix of uninitialized data and correct data back. I did
the same to my array, but I did not replace any drives so I was certain
that I had all the data. My drives actually don't have any bad blocks at
all, I iterated the lists and read all of the sectors.
I would expect md to state that the array is degraded, send angry emails
and such, but it seems like you will only know the state of your BBLs if
you go and check them.
Regards and thanks for understanding,
Håkon
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2020-09-02 19:45 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-18 18:00 Feature request: Remove the badblocks list Roy Sigurd Karlsbakk
2020-08-18 19:26 ` Wols Lists
2020-08-18 19:34 ` Piergiorgio Sartor
2020-08-18 19:43 ` Phil Turmel
2020-08-18 21:03 ` Håkon Struijk Holmen
2020-08-22 1:42 ` David C. Rankin
2020-09-02 13:36 ` Roy Sigurd Karlsbakk
2020-09-02 14:34 ` Adam Goryachev
2020-09-02 14:50 ` Roy Sigurd Karlsbakk
2020-09-02 15:09 ` Adam Goryachev
2020-09-02 15:25 ` Roy Sigurd Karlsbakk
2020-09-02 16:32 ` Adam Goryachev
2020-09-02 16:50 ` Roy Sigurd Karlsbakk
2020-09-02 19:45 ` Håkon Struijk Holmen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.