All of lore.kernel.org
 help / color / mirror / Atom feed
* Feature request: Remove the badblocks list
@ 2020-08-18 18:00 Roy Sigurd Karlsbakk
  2020-08-18 19:26 ` Wols Lists
  2020-08-18 21:03 ` Håkon Struijk Holmen
  0 siblings, 2 replies; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-08-18 18:00 UTC (permalink / raw)
  To: Linux RAID Mailing List; +Cc: Håkon

Hi all

It seems the badblocks list was added around 10 years ago[1]. The reason was to keep a track on sectors not readable, which may have made sensee 20 years earlier, but not even in 2010. The first IDE drives came out in the end of the 1980s and were named thus of their 'Integrated Drive Electronics' which was a new thing at the time. Opposed to earlier MFM drives and such, these were "smart" and could handle errors a bit better, even reallocate bad sectors to somewhere else when needed. A lot happened beween 1987 and 2010, but for some reason, this feature slipped through anyway, perhaps becauase Linus was drunk, I don't know. As far as I can understand, this feature works a bit like this

 - If a bad (that is, unreadable) block is found, it is flagged as bad, not to be used ever again, in the md member's superblock
 - If a new disk is added to the array, the block number of the initial bad block is flagged on the new drive, since the whole stripe is rendered useless (erm, didn't we have redundency here?)
 - If replacing the original drive with a new drive, md happily replaces all the data to the new drive and updates the superblock with the same badblock list.

So no attempt is ever done to check or repair that sector. Disks reallocate sectors if they are bad and it's not necessarily a big issue unless there's a lot of such errors. We just say 'this sector or block said *ouch* and is thus dead, and so will his siblings be for ever and ever'. There's a nice article about it here[2].

In practice, if you have data in stripes with badblocks, they may be lost forever and for no reason at all, since drives tend to fix their problems if you issue a write to that sector. ZFS does this nicely - when it finds a bad read, it reconstructs from parity or mirror and writes it again. If it encounters a write error, it tries over. Eventually the drive may fail, but hell, that's why we have redundancy.

As far as I can see, the only solution to remove the badblocks list, is "mdadm ... --assemble --update=no-bbl", from [2], and have md return garbage for those lost sectors, which is fine, since fsck/xfs_repair should fix what's fixable (and still won't be readable anyway). An alternate version written by a friend of mine (Håkon on cc) is present on [3] to remove the list from an offlined array.

As far as I can understand, this list doesn't have any reason to exist, except to annoy sysadmins.

So please remove this useless thing or at least don't enable it by default

[1] https://linux-raid.vger.kernel.narkive.com/R1rvkUiQ/using-the-new-bad-block-log-in-md-for-linux-3-1
[2] https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
[3] https://git.thehawken.org/hawken/md-badblocktool.git

Vennlig hilsen

roy
-- 
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-08-18 18:00 Feature request: Remove the badblocks list Roy Sigurd Karlsbakk
@ 2020-08-18 19:26 ` Wols Lists
  2020-08-18 19:34   ` Piergiorgio Sartor
  2020-08-18 19:43   ` Phil Turmel
  2020-08-18 21:03 ` Håkon Struijk Holmen
  1 sibling, 2 replies; 14+ messages in thread
From: Wols Lists @ 2020-08-18 19:26 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk, Linux RAID Mailing List; +Cc: Håkon

On 18/08/20 19:00, Roy Sigurd Karlsbakk wrote:
> As far as I can understand, this list doesn't have any reason to exist, except to annoy sysadmins.

Actually, there's at least one good reason for it to exist that I can
think of - it *could* make recovering a broken array much easier. Think
about it, I think it's documented in the wiki.

That said, I'm hoping to do some work soon that will make it redundant.

One little tip though - you've done a load of research to tell us what
we already know - as documented on the wiki - and now you're asking us
to do a load of work. If you want it done, well nobody else has bothered
so far so what makes you think they'll bother now?

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-08-18 19:26 ` Wols Lists
@ 2020-08-18 19:34   ` Piergiorgio Sartor
  2020-08-18 19:43   ` Phil Turmel
  1 sibling, 0 replies; 14+ messages in thread
From: Piergiorgio Sartor @ 2020-08-18 19:34 UTC (permalink / raw)
  To: Wols Lists; +Cc: Roy Sigurd Karlsbakk, Linux RAID Mailing List, Håkon

On Tue, Aug 18, 2020 at 08:26:07PM +0100, Wols Lists wrote:
> On 18/08/20 19:00, Roy Sigurd Karlsbakk wrote:
> > As far as I can understand, this list doesn't have any reason to exist, except to annoy sysadmins.
> 
> Actually, there's at least one good reason for it to exist that I can
> think of - it *could* make recovering a broken array much easier. Think
> about it, I think it's documented in the wiki.
> 
> That said, I'm hoping to do some work soon that will make it redundant.
> 
> One little tip though - you've done a load of research to tell us what
> we already know - as documented on the wiki - and now you're asking us
> to do a load of work. If you want it done, well nobody else has bothered
> so far so what makes you think they'll bother now?

Is it really "a load of work" to switch for
default "on" to default "off"?
Because that's what he is asking.

If this is the case, there is something more
broken in the code...

BTW, I find it quite problematic too to have
a feature, activated by default, which is
_officially_ declared as *buggy*/

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-08-18 19:26 ` Wols Lists
  2020-08-18 19:34   ` Piergiorgio Sartor
@ 2020-08-18 19:43   ` Phil Turmel
  1 sibling, 0 replies; 14+ messages in thread
From: Phil Turmel @ 2020-08-18 19:43 UTC (permalink / raw)
  To: Wols Lists, Roy Sigurd Karlsbakk, Linux RAID Mailing List; +Cc: Håkon

On 8/18/20 3:26 PM, Wols Lists wrote:
> On 18/08/20 19:00, Roy Sigurd Karlsbakk wrote:
>> As far as I can understand, this list doesn't have any reason to exist, except to annoy sysadmins.
> 
> Actually, there's at least one good reason for it to exist that I can
> think of - it *could* make recovering a broken array much easier. Think
> about it, I think it's documented in the wiki.

Link please.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-08-18 18:00 Feature request: Remove the badblocks list Roy Sigurd Karlsbakk
  2020-08-18 19:26 ` Wols Lists
@ 2020-08-18 21:03 ` Håkon Struijk Holmen
  2020-08-22  1:42   ` David C. Rankin
  1 sibling, 1 reply; 14+ messages in thread
From: Håkon Struijk Holmen @ 2020-08-18 21:03 UTC (permalink / raw)
  To: Linux RAID Mailing List

On 8/18/20 8:00 PM, Roy Sigurd Karlsbakk wrote:

> Hi all

Hi,

Thanks for the CC, I just managed to get myself subscribed to the list :)

I have gathered some thoughts on the subject as well after reading up on it,
figuring out the actual header format is, and writing a tool [3] to fix my array...

About the tool, it will try to read all the supposedly bad blocks in a drive,
and erase the whole list if no blocks fail to read. As long as it's not run against
a drive that got marked because data was unavailable during rebuild, this
should make it possible to read the data again. A possible improvement here, would be to reduce
the size of the list down to the actually bad blocks if some still fail, but right now the tool
will refuse to do anything to the drive if md was correct. You also need to flip a variable that
I hid awkwardly between two functions before it will write to the drive at all.

But I have some complaints about the thing..


Good data marked as bad:

My viewpoint is from what happened to my raid array, a 5 drive raid6.
3 of the drives had identical lists of bad blocks while 2 had empty lists.
Therefore, the marked sectors corresponded to lost data. This was solved by
iterating the bad block list, verifying that all the sectors were in fact readable,
and then removing the bad block list. Since I did not have any drive replacements,
I was certain enough that I would not run into uninitialized space. This gave me back
the data that md had decided was gone.

I do not really think one can say that the md badblock list corresponds
to bad blocks on the device. The lists consists of sectors where
md thinks the data is permanently unavailable. It happens in two ways:
- A read error occurs for any reason
- A new drive is rebuilt, but the array doesn't have the parity to find out what
   data was supposed to go there, because badblock entries for other devices prevents
   it from finding a source for the data that it's supposed to write there. It's assumed
   that such reads would fail.

Since these are added to the same list of bad blocks, it follows that
even if you were to have a successful read from a bad block, it can also be uninitialized
space.

Once enough drives have bad blocks for the same stripe, that data is now gone. md will not read it.
Even if it's there on the drives. I can only speculate on what happened in my case,
so far I think that some intermittent controller failure caused any reads to give errors, and
somehow md was still able to write to the badblock list.

I think it's not just me, and it seems like it's a common phenomenon that arrays end up with
identical lists across drives. Be this controller failures, or just a bug, it's not good
and undermines the assumption that the underlying blocks will actually be bad.

9 years ago, Lutz Vieweg asked "I've experienced drives
with intermittent read / write failures (due to controller or power stability
problems), and I wonder whether such a situation could quickly fill up the
"bad block list", doing more harm than good in the "intermittent error"-
szenario." [1]. I have my doubts that this was resolved.

I also don't know if this is the cause of the issue with many drives sharing the exact same
list, or if some other logic error type bug is causing it.


md indicating all is good:

In the same URL, Neil Brown said "(...) You shouldn't aim to run an array
with bad blocks any more than you should run an array degraded. The purpose
of bad block management is to provide a more graceful failure path, not to
encourage you to run an array with bad drives". However, an array with
bad blocks does not report this as "degraded", and you have to run
--examine to even see it. The result is that the array is not being treated
as bad, having md communicate that the array is still good. The end result
being, the software encouraging running an array with bad drives.

If the assumption was that one would treat this as a degraded array. But
you have to --examine and specifically look for it to see that there are bad blocks.


Lack of documentation:

I have added some links in addition to the ones found by Roy. This was the extent
of the documentation that I was able to find. I'll be interested if this is documented
better elsewhere. The kernel documentation also briefly mentioned the existence of bad
block lists in mdraid. The wiki article on the superblock format [5] hasn't been updated
with the badblock fields. The 2010 blog post [4] was the closest thing to documentation,
even if it was written before the thing was finalized.


Overall:

I don't think this uncertainty is good at all. I feel like it would be easier to deal with
a controller failure throwing the whole raid apart. You'd assemble it back together and
check the filesystem, and with fingers crossed, everything will be fine. I think one
finds that it makes sense how md acts without this algorithm enabled. Drives thrown out
of arrays still have data on them. This means that if unrecoverable errors occur, one can
still run ddrescue and try to copy the array to new drives, one by one and get as much data
back as possible.

Once you hit a bad block during read, adopting the zfs model of calculating parity and
overwriting seems better because it tries to just solve the problem so it doesn't happen
the next time around. I think md will throw it in the list and expect it to be fixed during
the next check operation. Unless that doesn't happen and more bad blocks accumulate until
data loss happens..

I would also like to see the functionality changed to opt-in or just removed. If it's kept as
opt-in, it still hope that some of this feedback is taken. For example, reporting the array
as degraded if the lists get populated. Automatically fixing bad blocks as soon as possible,
before the situation develops any further. Making the uninitialized data and the bad blocks
two separate things, so that one can still try reading those blocks and keep track of
where the data is supposed to be, and where it's definitely not.

Maybe dropping badblocks and taking inspiration from ZFS instead.

> [1] https://linux-raid.vger.kernel.narkive.com/R1rvkUiQ/using-the-new-bad-block-log-in-md-for-linux-3-1
> [2] https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
> [3] https://git.thehawken.org/hawken/md-badblocktool.git

[4]https://neil.brown.name/blog/20100519043730     

[5]https://raid.wiki.kernel.org/index.php/RAID_superblock_formats



Regards,
Håkon
     


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-08-18 21:03 ` Håkon Struijk Holmen
@ 2020-08-22  1:42   ` David C. Rankin
  2020-09-02 13:36     ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 14+ messages in thread
From: David C. Rankin @ 2020-08-22  1:42 UTC (permalink / raw)
  To: mdraid

On 8/18/20 4:03 PM, Håkon Struijk Holmen wrote:
> Hi,
> 
> Thanks for the CC, I just managed to get myself subscribed to the list :)
> 
> I have gathered some thoughts on the subject as well after reading up on it,
> figuring out the actual header format is, and writing a tool [3] to fix my
> array...
> 
<snip>
> But I have some complaints about the thing..

Well,

  There is code in all things that can be fixed, but I for one will chime in
and say I don't care if a lose a strip or two so long as on a failed disk I
pop the new one in and it rebuilds without issue (which it does, even when the
disk was replaced due to bad blocks)

  So whatever is done, don't fix what isn't broken and introduce more bugs
along the way. If this is such an immediate problem, then why are patches
being attached to the complaints?

-- 
David C. Rankin, J.D.,P.E.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-08-22  1:42   ` David C. Rankin
@ 2020-09-02 13:36     ` Roy Sigurd Karlsbakk
  2020-09-02 14:34       ` Adam Goryachev
  0 siblings, 1 reply; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-09-02 13:36 UTC (permalink / raw)
  To: David C. Rankin; +Cc: Linux Raid

----- Original Message -----
> From: "David C. Rankin" <drankinatty@suddenlinkmail.com>
> To: "Linux Raid" <linux-raid@vger.kernel.org>
> Sent: Saturday, 22 August, 2020 03:42:40
> Subject: Re: Feature request: Remove the badblocks list

> On 8/18/20 4:03 PM, Håkon Struijk Holmen wrote:
>> Hi,
>> 
>> Thanks for the CC, I just managed to get myself subscribed to the list :)
>> 
>> I have gathered some thoughts on the subject as well after reading up on it,
>> figuring out the actual header format is, and writing a tool [3] to fix my
>> array...
>> 
> <snip>
>> But I have some complaints about the thing..
> 
> Well,
> 
>  There is code in all things that can be fixed, but I for one will chime in
> and say I don't care if a lose a strip or two so long as on a failed disk I
> pop the new one in and it rebuilds without issue (which it does, even when the
> disk was replaced due to bad blocks)
> 
>  So whatever is done, don't fix what isn't broken and introduce more bugs
> along the way. If this is such an immediate problem, then why are patches
> being attached to the complaints?

The problem is that it's already broken. Take a single mirror. One drive experiences a bad sector, fine, you have redundancy, so you read the data from the other drive and md flags the sector as bad. The drive two is replaced, you lose the data. The new drive will get flagged with the same sector number as faulty, since the first drive has it flagged. So you replace the first drive and during resync, it also gets flagged as having a bad sector. And so on.

Modern (that is, disks since 20 years ago or so) reallocate sectors as they wear out. We have redundancy to handle errors, not to pinpoint them on disks and fill up not-so-smart lists with broken sectors that work. If md sees a drive with excessive errors, that drive should be kicked out, marked as dead, but not interfere with the rest of the raid.

Vennlig hilsen

roy
-- 
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-09-02 13:36     ` Roy Sigurd Karlsbakk
@ 2020-09-02 14:34       ` Adam Goryachev
  2020-09-02 14:50         ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 14+ messages in thread
From: Adam Goryachev @ 2020-09-02 14:34 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk, David C. Rankin; +Cc: Linux Raid


On 2/9/20 23:36, Roy Sigurd Karlsbakk wrote:
> ----- Original Message -----
>> From: "David C. Rankin" <drankinatty@suddenlinkmail.com>
>> To: "Linux Raid" <linux-raid@vger.kernel.org>
>> Sent: Saturday, 22 August, 2020 03:42:40
>> Subject: Re: Feature request: Remove the badblocks list
>> On 8/18/20 4:03 PM, Håkon Struijk Holmen wrote:
>>> Hi,
>>>
>>> Thanks for the CC, I just managed to get myself subscribed to the list :)
>>>
>>> I have gathered some thoughts on the subject as well after reading up on it,
>>> figuring out the actual header format is, and writing a tool [3] to fix my
>>> array...
>>>
>> <snip>
>>> But I have some complaints about the thing..
>> Well,
>>
>>   There is code in all things that can be fixed, but I for one will chime in
>> and say I don't care if a lose a strip or two so long as on a failed disk I
>> pop the new one in and it rebuilds without issue (which it does, even when the
>> disk was replaced due to bad blocks)
>>
>>   So whatever is done, don't fix what isn't broken and introduce more bugs
>> along the way. If this is such an immediate problem, then why are patches
>> being attached to the complaints?
> The problem is that it's already broken. Take a single mirror. One drive experiences a bad sector, fine, you have redundancy, so you read the data from the other drive and md flags the sector as bad. The drive two is replaced, you lose the data. The new drive will get flagged with the same sector number as faulty, since the first drive has it flagged. So you replace the first drive and during resync, it also gets flagged as having a bad sector. And so on.
>
> Modern (that is, disks since 20 years ago or so) reallocate sectors as they wear out. We have redundancy to handle errors, not to pinpoint them on disks and fill up not-so-smart lists with broken sectors that work. If md sees a drive with excessive errors, that drive should be kicked out, marked as dead, but not interfere with the rest of the raid.
>
> Vennlig hilsen
>
> roy

I'm no MD expert, but I there are a couple of things to consider...

1) MD doesn't mark the sector as bad unless we try to write to it, AND 
the drive replies to say it could not be written. So, in your case, the 
drive is saying that it doesn't have any "spare" sectors left to 
re-allocate, we are already passed that point.

2) When MD tries to read, it gets an error, so read from the other 
mirror, or re-construct from parity/etc, and automatically attempt to 
write to the sector, see point 1 above for the failure case.

So by the time MD gets a write error for a sector, the drive really is 
bad, and MD can no longer ensure that *this* sector will be able to 
properly store data again (whatever level of RAID we asked for, that 
level can't be achieved with one drive faulty). So MD marks it bad, and 
won't store any user data in that sector in future. As other drives are 
replaced, we mark the corresponding sector on those drives as also bad, 
so they also know that no user data should be stored there.

Eventually, we replace the faulty disk, and it would probably be safe to 
store user data in the marked sector (assuming the new drive is not 
faulty on the same sector, and all other member drives are not faulty on 
the same sector).

So, to "fix" this, we just need a way to tell MD to try and write to all 
member drives, on all faulty sectors, and if any drive returns fails to 
write, then keep the sector as marked bad, if *ALL* drives succeed, then 
remove from the bad blocks list on all members.

So why not add this feature to fix the problem, instead of throwing away 
something that is potentially useful? Perhaps this could be done as part 
of the "repair" mode, or done during a replace/add (when we reach the 
"bad" sector, test the new drive, test all existing drives, and then 
continue with the repair/add.

Would that solve the "bug"?

PS, As you noted, if MD gets repeated write errors for one drive, then 
it will be kicked out. That value is configurable.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-09-02 14:34       ` Adam Goryachev
@ 2020-09-02 14:50         ` Roy Sigurd Karlsbakk
  2020-09-02 15:09           ` Adam Goryachev
  0 siblings, 1 reply; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-09-02 14:50 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: David C. Rankin, Linux Raid

> I'm no MD expert, but I there are a couple of things to consider...
> 
> 1) MD doesn't mark the sector as bad unless we try to write to it, AND
> the drive replies to say it could not be written. So, in your case, the
> drive is saying that it doesn't have any "spare" sectors left to
> re-allocate, we are already passed that point.
> 
> 2) When MD tries to read, it gets an error, so read from the other
> mirror, or re-construct from parity/etc, and automatically attempt to
> write to the sector, see point 1 above for the failure case.
> 
> So by the time MD gets a write error for a sector, the drive really is
> bad, and MD can no longer ensure that *this* sector will be able to
> properly store data again (whatever level of RAID we asked for, that
> level can't be achieved with one drive faulty). So MD marks it bad, and
> won't store any user data in that sector in future. As other drives are
> replaced, we mark the corresponding sector on those drives as also bad,
> so they also know that no user data should be stored there.
> 
> Eventually, we replace the faulty disk, and it would probably be safe to
> store user data in the marked sector (assuming the new drive is not
> faulty on the same sector, and all other member drives are not faulty on
> the same sector).
> 
> So, to "fix" this, we just need a way to tell MD to try and write to all
> member drives, on all faulty sectors, and if any drive returns fails to
> write, then keep the sector as marked bad, if *ALL* drives succeed, then
> remove from the bad blocks list on all members.
> 
> So why not add this feature to fix the problem, instead of throwing away
> something that is potentially useful? Perhaps this could be done as part
> of the "repair" mode, or done during a replace/add (when we reach the
> "bad" sector, test the new drive, test all existing drives, and then
> continue with the repair/add.
> 
> Would that solve the "bug"?

I'd better want md to stop fixing "somebody else's problem", that is, the disk, and rather just do its job. As for the case, I have tried to manually read those sectors named in the badblocks list and they all work. All of them. But then, there's no fixing, since they are proclaimed dead. So are their siblings' sectors with the same number, regardless of status.

If a drive has multiple issues with bad sector, kick it out. It doesn't have anything to do in the RAID anymore

Vennlig hilsen

roy
-- 
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-09-02 14:50         ` Roy Sigurd Karlsbakk
@ 2020-09-02 15:09           ` Adam Goryachev
  2020-09-02 15:25             ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 14+ messages in thread
From: Adam Goryachev @ 2020-09-02 15:09 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: David C. Rankin, Linux Raid


On 3/9/20 00:50, Roy Sigurd Karlsbakk wrote:
>> I'm no MD expert, but I there are a couple of things to consider...
>>
>> 1) MD doesn't mark the sector as bad unless we try to write to it, AND
>> the drive replies to say it could not be written. So, in your case, the
>> drive is saying that it doesn't have any "spare" sectors left to
>> re-allocate, we are already passed that point.
>>
>> 2) When MD tries to read, it gets an error, so read from the other
>> mirror, or re-construct from parity/etc, and automatically attempt to
>> write to the sector, see point 1 above for the failure case.
>>
>> So by the time MD gets a write error for a sector, the drive really is
>> bad, and MD can no longer ensure that *this* sector will be able to
>> properly store data again (whatever level of RAID we asked for, that
>> level can't be achieved with one drive faulty). So MD marks it bad, and
>> won't store any user data in that sector in future. As other drives are
>> replaced, we mark the corresponding sector on those drives as also bad,
>> so they also know that no user data should be stored there.
>>
>> Eventually, we replace the faulty disk, and it would probably be safe to
>> store user data in the marked sector (assuming the new drive is not
>> faulty on the same sector, and all other member drives are not faulty on
>> the same sector).
>>
>> So, to "fix" this, we just need a way to tell MD to try and write to all
>> member drives, on all faulty sectors, and if any drive returns fails to
>> write, then keep the sector as marked bad, if *ALL* drives succeed, then
>> remove from the bad blocks list on all members.
>>
>> So why not add this feature to fix the problem, instead of throwing away
>> something that is potentially useful? Perhaps this could be done as part
>> of the "repair" mode, or done during a replace/add (when we reach the
>> "bad" sector, test the new drive, test all existing drives, and then
>> continue with the repair/add.
>>
>> Would that solve the "bug"?
> I'd better want md to stop fixing "somebody else's problem", that is, the disk, and rather just do its job. As for the case, I have tried to manually read those sectors named in the badblocks list and they all work. All of them. But then, there's no fixing, since they are proclaimed dead. So are their siblings' sectors with the same number, regardless of status.
Just because you can read them, doesn't mean you can write them. 
Clearly, at some point in time, one of your drives failed. You now need 
to recover from that failed drive in the most sensible way.
> If a drive has multiple issues with bad sector, kick it out. It doesn't have anything to do in the RAID anymore

And if a group of 100 sectors are bad on drive 1, and 100 different 
sectors on drive 2, you want to kick both drives out, and destroy all 
your data until you can create a new array and restore from backup?

OR, just mark those parts of all disks faulty, and at some point in the 
future, you replace the disks, and then find a way to tell MD that the 
sectors are working now (and preferably, re-test them before marking 
them as OK)?

BTW, I just found this:

https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy

Which suggests that there is indeed a bug which should be hunted and 
fixed, and that actually the BBL isn't populated via failed writes, it 
is populated by failed reads while doing a replace/add, AND the failed 
read is from the source drive AND the parity/mirror drives.

Either way, perhaps what is needed (if you are interested) is a 
repeatable test scenario causing the problem, which could then be used 
to identify and fix the bug.

Regards,
Adam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-09-02 15:09           ` Adam Goryachev
@ 2020-09-02 15:25             ` Roy Sigurd Karlsbakk
  2020-09-02 16:32               ` Adam Goryachev
  0 siblings, 1 reply; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-09-02 15:25 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: David C. Rankin, Linux Raid

>> I'd better want md to stop fixing "somebody else's problem", that is, the disk,
>> and rather just do its job. As for the case, I have tried to manually read
>> those sectors named in the badblocks list and they all work. All of them. But
>> then, there's no fixing, since they are proclaimed dead. So are their siblings'
>> sectors with the same number, regardless of status.
> Just because you can read them, doesn't mean you can write them.
> Clearly, at some point in time, one of your drives failed. You now need
> to recover from that failed drive in the most sensible way.
>> If a drive has multiple issues with bad sector, kick it out. It doesn't have
>> anything to do in the RAID anymore
> 
> And if a group of 100 sectors are bad on drive 1, and 100 different
> sectors on drive 2, you want to kick both drives out, and destroy all
> your data until you can create a new array and restore from backup?
> 
> OR, just mark those parts of all disks faulty, and at some point in the
> future, you replace the disks, and then find a way to tell MD that the
> sectors are working now (and preferably, re-test them before marking
> them as OK)?
> 
> BTW, I just found this:
> 
> https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy

I linked to that earlier in the thread

> Which suggests that there is indeed a bug which should be hunted and
> fixed, and that actually the BBL isn't populated via failed writes, it
> is populated by failed reads while doing a replace/add, AND the failed
> read is from the source drive AND the parity/mirror drives.

It is neither hunted down nor fixed. It's the same thing and it has stayed the same for these years.

> Either way, perhaps what is needed (if you are interested) is a
> repeatable test scenario causing the problem, which could then be used
> to identify and fix the bug.

I have tried several things and all show the same. I just don't know how to tell md "this drive's sector X is bad, so flag it so".

Again, this is not the way to walk around a problem. What this does is just hiding real problems and let them grow in generations instead of just flagging a bad drive as bad, since that's the originating problem here.

Vennlig hilsen

roy
-- 
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-09-02 15:25             ` Roy Sigurd Karlsbakk
@ 2020-09-02 16:32               ` Adam Goryachev
  2020-09-02 16:50                 ` Roy Sigurd Karlsbakk
  2020-09-02 19:45                 ` Håkon Struijk Holmen
  0 siblings, 2 replies; 14+ messages in thread
From: Adam Goryachev @ 2020-09-02 16:32 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: David C. Rankin, Linux Raid


On 3/9/20 01:25, Roy Sigurd Karlsbakk wrote:
>>> I'd better want md to stop fixing "somebody else's problem", that is, the disk,
>>> and rather just do its job. As for the case, I have tried to manually read
>>> those sectors named in the badblocks list and they all work. All of them. But
>>> then, there's no fixing, since they are proclaimed dead. So are their siblings'
>>> sectors with the same number, regardless of status.
>> Just because you can read them, doesn't mean you can write them.
>> Clearly, at some point in time, one of your drives failed. You now need
>> to recover from that failed drive in the most sensible way.
>>> If a drive has multiple issues with bad sector, kick it out. It doesn't have
>>> anything to do in the RAID anymore
>> And if a group of 100 sectors are bad on drive 1, and 100 different
>> sectors on drive 2, you want to kick both drives out, and destroy all
>> your data until you can create a new array and restore from backup?
>>
>> OR, just mark those parts of all disks faulty, and at some point in the
>> future, you replace the disks, and then find a way to tell MD that the
>> sectors are working now (and preferably, re-test them before marking
>> them as OK)?
>>
>> BTW, I just found this:
>>
>> https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
> I linked to that earlier in the thread
>
>> Which suggests that there is indeed a bug which should be hunted and
>> fixed, and that actually the BBL isn't populated via failed writes, it
>> is populated by failed reads while doing a replace/add, AND the failed
>> read is from the source drive AND the parity/mirror drives.
> It is neither hunted down nor fixed. It's the same thing and it has stayed the same for these years.
So what will you do now to change that? Obviously nobody else has had 
enough of a problem with it to be bothered to "hunt it down and fix it". 
Can you help hunt it down at least?
>> Either way, perhaps what is needed (if you are interested) is a
>> repeatable test scenario causing the problem, which could then be used
>> to identify and fix the bug.
> I have tried several things and all show the same. I just don't know how to tell md "this drive's sector X is bad, so flag it so".
>
> Again, this is not the way to walk around a problem. What this does is just hiding real problems and let them grow in generations instead of just flagging a bad drive as bad, since that's the originating problem here.
>
> Vennlig hilsen
>
> roy

Based in the linked page, you would need to do something like this:

1) Create a clean array with correctly working disks

2) Tell the underlying block device to pretend there is a read error on 
a specific sector of one disk

3) Ask MD to replace the "bad" block device with a "good" one

4) See what happens with the BBL

5) Various steps of reading/writing to that specific stripe, and 
document the outcome/behavior

6) Replace another drive, and document the results

Hint: there is a block device that could sit between your actual block 
device and MD, and it can "pretend" there are certain errors. The 
answers here seem to contain relevant information: 
https://stackoverflow.com/questions/1870696/simulate-a-faulty-block-device-with-read-errors

As I said, I suspect that if a reproducible error is found, then it 
should be easier to fix the bug.

OTOH, you could just remove the BBL from your arrays, and ensure you 
create new arrays without the BBL.

Regards,
Adam



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-09-02 16:32               ` Adam Goryachev
@ 2020-09-02 16:50                 ` Roy Sigurd Karlsbakk
  2020-09-02 19:45                 ` Håkon Struijk Holmen
  1 sibling, 0 replies; 14+ messages in thread
From: Roy Sigurd Karlsbakk @ 2020-09-02 16:50 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: David C. Rankin, Linux Raid

> Based in the linked page, you would need to do something like this:
> 
> 1) Create a clean array with correctly working disks
> 
> 2) Tell the underlying block device to pretend there is a read error on
> a specific sector of one disk
> 
> 3) Ask MD to replace the "bad" block device with a "good" one

Do you have a howto on 2,3?

> 4) See what happens with the BBL
> 
> 5) Various steps of reading/writing to that specific stripe, and
> document the outcome/behavior

or this - how?

> 6) Replace another drive, and document the results
> 
> Hint: there is a block device that could sit between your actual block
> device and MD, and it can "pretend" there are certain errors. The
> answers here seem to contain relevant information:
> https://stackoverflow.com/questions/1870696/simulate-a-faulty-block-device-with-read-errors
> 
> As I said, I suspect that if a reproducible error is found, then it
> should be easier to fix the bug.
> 
> OTOH, you could just remove the BBL from your arrays, and ensure you
> create new arrays without the BBL.

Anything better than just "mdadm ... --assemble --update=force-no-bbl"?

Vennlig hilsen

roy
-- 
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Remove the badblocks list
  2020-09-02 16:32               ` Adam Goryachev
  2020-09-02 16:50                 ` Roy Sigurd Karlsbakk
@ 2020-09-02 19:45                 ` Håkon Struijk Holmen
  1 sibling, 0 replies; 14+ messages in thread
From: Håkon Struijk Holmen @ 2020-09-02 19:45 UTC (permalink / raw)
  To: Adam Goryachev, Roy Sigurd Karlsbakk; +Cc: David C. Rankin, Linux Raid


On 9/2/20 6:32 PM, Adam Goryachev wrote:
>
> On 3/9/20 01:25, Roy Sigurd Karlsbakk wrote:
>>>> I'd better want md to stop fixing "somebody else's problem", that 
>>>> is, the disk,
>>>> and rather just do its job. As for the case, I have tried to 
>>>> manually read
>>>> those sectors named in the badblocks list and they all work. All of 
>>>> them. But
>>>> then, there's no fixing, since they are proclaimed dead. So are 
>>>> their siblings'
>>>> sectors with the same number, regardless of status.
>>> Just because you can read them, doesn't mean you can write them.
>>> Clearly, at some point in time, one of your drives failed. You now need
>>> to recover from that failed drive in the most sensible way.
>>>> If a drive has multiple issues with bad sector, kick it out. It 
>>>> doesn't have
>>>> anything to do in the RAID anymore
>>> And if a group of 100 sectors are bad on drive 1, and 100 different
>>> sectors on drive 2, you want to kick both drives out, and destroy all
>>> your data until you can create a new array and restore from backup?
>>>
>>> OR, just mark those parts of all disks faulty, and at some point in the
>>> future, you replace the disks, and then find a way to tell MD that the
>>> sectors are working now (and preferably, re-test them before marking
>>> them as OK)?
>>>
>>> BTW, I just found this:
>>>
>>> https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
>> I linked to that earlier in the thread
>>
>>> Which suggests that there is indeed a bug which should be hunted and
>>> fixed, and that actually the BBL isn't populated via failed writes, it
>>> is populated by failed reads while doing a replace/add, AND the failed
>>> read is from the source drive AND the parity/mirror drives.
>> It is neither hunted down nor fixed. It's the same thing and it has 
>> stayed the same for these years.
> So what will you do now to change that? Obviously nobody else has had 
> enough of a problem with it to be bothered to "hunt it down and fix 
> it". Can you help hunt it down at least?
>>> Either way, perhaps what is needed (if you are interested) is a
>>> repeatable test scenario causing the problem, which could then be used
>>> to identify and fix the bug.
>> I have tried several things and all show the same. I just don't know 
>> how to tell md "this drive's sector X is bad, so flag it so".
>>
>> Again, this is not the way to walk around a problem. What this does 
>> is just hiding real problems and let them grow in generations instead 
>> of just flagging a bad drive as bad, since that's the originating 
>> problem here.
>>
>> Vennlig hilsen
>>
>> roy
>
> Based in the linked page, you would need to do something like this:
>
> 1) Create a clean array with correctly working disks
>
> 2) Tell the underlying block device to pretend there is a read error 
> on a specific sector of one disk
>
> 3) Ask MD to replace the "bad" block device with a "good" one
>
> 4) See what happens with the BBL
>
> 5) Various steps of reading/writing to that specific stripe, and 
> document the outcome/behavior
>
> 6) Replace another drive, and document the results
>
> Hint: there is a block device that could sit between your actual block 
> device and MD, and it can "pretend" there are certain errors. The 
> answers here seem to contain relevant information: 
> https://stackoverflow.com/questions/1870696/simulate-a-faulty-block-device-with-read-errors
>
> As I said, I suspect that if a reproducible error is found, then it 
> should be easier to fix the bug.
>
> OTOH, you could just remove the BBL from your arrays, and ensure you 
> create new arrays without the BBL.
>
> Regards,
> Adam
>
Hi,

I think you may have misunderstood slightly. Bad blocks can get written 
based on failed read requests, which is the case that Roy and I are 
complaining about. Such a read error may just be temporary, and affect 
multiple drives if there is some sort of a controller problem.

I have actually done an experiment, and I would like to explain it in 
terms of your numbered points.


1) A NFS server was set up with a share and some block files were set 
up, approx 100MB in size for each. The NFS server was given a secondary 
IP address for the client, that could be added or removed to simulate a 
passing controller failure. The NFS client mapped up this with a soft 
mount, allowing it to give IO errors after a timeout. The files were 
mapped to loopback blocks and a raid array was created, I think it was 
raid 5. The array was formatted to xfs and filled with data. Caches were 
wiped.

2) The IP was removed to simulate the controller temporarily failing. 
Then I tried reading from the raid array, producing io errors on all the 
drives. The IP was added back in to restore communication, and md took 
the opportunity to write one of the drives full of bad blocks. The rest 
of the block devices were thrown out, maybe for failing to write to the 
bad block list.

3) My attempt wasn't entirely successful, since only one drive got bad 
blocks. I think this was out of luck. In this case md will have enough 
data to repair the error during a drive replacement. Maybe if one of the 
"healthy" ones were removed, then we would see md failing to reconstruct 
data and writing bad blocks to the new device. I didn't carry this out, 
but I understand the algorithm to work like that.


The issue I have is that a temporary read failure can cause blocks to be 
marked with a flag that means "the data here is not the correct data". 
It would be necessary to handle read failures differently to have a 
distinction and be able to retry reading from these types of bad blocks. 
There's just one flag, and it's used if reading fails, if writing fails, 
if the correct data was not found for a new drive and thus the data was 
not initialized...

I've talked to Roy and we will probably try removing the lists, and I 
think it will work. At least partially. For his array, he has been 
replacing some drives from time to time without knowing about the bad 
block lists, and this means that his bad blocks are a combination of 
drives where the data actually is present, and drives where the data was 
never written in the first place. If we remove the lists, then we will 
probably get a mix of uninitialized data and correct data back. I did 
the same to my array, but I did not replace any drives so I was certain 
that I had all the data. My drives actually don't have any bad blocks at 
all, I iterated the lists and read all of the sectors.

I would expect md to state that the array is degraded, send angry emails 
and such, but it seems like you will only know the state of your BBLs if 
you go and check them.


Regards and thanks for understanding,
Håkon

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-09-02 19:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-18 18:00 Feature request: Remove the badblocks list Roy Sigurd Karlsbakk
2020-08-18 19:26 ` Wols Lists
2020-08-18 19:34   ` Piergiorgio Sartor
2020-08-18 19:43   ` Phil Turmel
2020-08-18 21:03 ` Håkon Struijk Holmen
2020-08-22  1:42   ` David C. Rankin
2020-09-02 13:36     ` Roy Sigurd Karlsbakk
2020-09-02 14:34       ` Adam Goryachev
2020-09-02 14:50         ` Roy Sigurd Karlsbakk
2020-09-02 15:09           ` Adam Goryachev
2020-09-02 15:25             ` Roy Sigurd Karlsbakk
2020-09-02 16:32               ` Adam Goryachev
2020-09-02 16:50                 ` Roy Sigurd Karlsbakk
2020-09-02 19:45                 ` Håkon Struijk Holmen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.