* BTRFS bad block management. Does it exist?
@ 2018-10-14 11:08 waxhead
2018-10-14 11:31 ` Qu Wenruo
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: waxhead @ 2018-10-14 11:08 UTC (permalink / raw)
To: Btrfs BTRFS
In case BTRFS fails to WRITE to a disk. What happens?
Does the bad area get mapped out somehow? Does it try again until it
succeed or until it "times out" or reach a threshold counter?
Does it eventually try to write to a different disk (in case of using
the raid1/10 profile?)
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: BTRFS bad block management. Does it exist?
2018-10-14 11:08 BTRFS bad block management. Does it exist? waxhead
@ 2018-10-14 11:31 ` Qu Wenruo
2018-10-15 12:09 ` Austin S. Hemmelgarn
2018-10-16 9:57 ` Anand Jain
2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2018-10-14 11:31 UTC (permalink / raw)
To: waxhead, Btrfs BTRFS
[-- Attachment #1.1: Type: text/plain, Size: 755 bytes --]
On 2018/10/14 下午7:08, waxhead wrote:
> In case BTRFS fails to WRITE to a disk. What happens?
Normally it should return error when we flush disk.
And in that case, error will leads to transaction abort and the fs goes
RO to prevent further corruption.
> Does the bad area get mapped out somehow?
No.
> Does it try again until it
> succeed or until it "times out" or reach a threshold counter?
Unless it's done by block layer, btrfs doesn't try that.
> Does it eventually try to write to a different disk (in case of using
> the raid1/10 profile?)
No. That's not what RAID is designed to do.
It's only allowed to have any flush error if using "degraded" mount
option and the error is under tolerance.
Thanks,
Qu
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: BTRFS bad block management. Does it exist?
2018-10-14 11:08 BTRFS bad block management. Does it exist? waxhead
2018-10-14 11:31 ` Qu Wenruo
@ 2018-10-15 12:09 ` Austin S. Hemmelgarn
2018-10-16 9:57 ` Anand Jain
2 siblings, 0 replies; 4+ messages in thread
From: Austin S. Hemmelgarn @ 2018-10-15 12:09 UTC (permalink / raw)
To: waxhead, Btrfs BTRFS
On 2018-10-14 07:08, waxhead wrote:
> In case BTRFS fails to WRITE to a disk. What happens?
> Does the bad area get mapped out somehow? Does it try again until it
> succeed or until it "times out" or reach a threshold counter?
> Does it eventually try to write to a different disk (in case of using
> the raid1/10 profile?)
Building on Qu's answer (which is absolutely correct), BTRFS makes the
perfectly reasonable assumption that you're not trying to use known bad
hardware. It's not alone in this respect either, pretty much every
Linux filesystem makes the exact same assumption (and almost all
non-Linux ones too), because it really is a perfectly reasonable
assumption. The only exception is ext[234], but they only support it
statically (you can set the bad block list at mkfs time, but not
afterwards, and they don't update it at runtime), and it's a holdover
from earlier filesystems which originated at a time when storage was
sufficiently expensive _and_ unreliable that you kept using disks until
they were essentially completely dead.
The reality is that with modern storage hardware, if you have
persistently bad sectors the device is either defective (and should be
returned under warranty), or it's beyond expected EOL (and should just
be replaced). Most people know about SSD's doing block remapping to
avoid bad blocks, but hard drives do it to, and they're actually rather
good at it. In both cases, enough spare blocks are provided that the
device can handle average rates of media errors through the entirety of
it's average life expectancy without running out of spare blocks.
On top of all of that though, it's fully possible to work around bad
blocks in the block layer if you take the time to actually do it. With
a bit of reasonably simple math, you can easily set up an LVM volume
that actively avoids all the bad blocks on a disk while still fully
utilizing the rest of the volume. Similarly, with a bit of work (and a
partition table that supports _lots_ of partitions) you can work around
bad blocks with an MD concatenated device.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: BTRFS bad block management. Does it exist?
2018-10-14 11:08 BTRFS bad block management. Does it exist? waxhead
2018-10-14 11:31 ` Qu Wenruo
2018-10-15 12:09 ` Austin S. Hemmelgarn
@ 2018-10-16 9:57 ` Anand Jain
2 siblings, 0 replies; 4+ messages in thread
From: Anand Jain @ 2018-10-16 9:57 UTC (permalink / raw)
To: waxhead, Btrfs BTRFS
On 10/14/2018 07:08 PM, waxhead wrote:
> In case BTRFS fails to WRITE to a disk. What happens?
> Does the bad area get mapped out somehow?
There was a proposed patch, its not convincing because the disks does
the bad block relocation part transparently to the host and if disk runs
out of reserved list then probably its time to replace the disk as in my
experience the disk would have failed for other non-media error before
it runs out of the reserved list and where in this case the host
performed relocation won't help. Further more being at the file-system
level you won't be able to accurately determine whether the block write
has failed for the bad media error and not because of the reason of
target circuitry fault.
> Does it try again until it
> succeed or
> until it "times out" or reach a threshold counter?
Block IO timeout and retry are the properties of the block layer
depending on the type of error it should.
SD module already does retry of 5 counts (when failfast is not set), it
should be tune-able. And I think there was a patch for that in the ML.
We had few discussion on the retry part in the past. [1]
[1]
https://www.spinics.net/lists/linux-btrfs/msg70240.html
https://www.spinics.net/lists/linux-btrfs/msg71779.html
> Does it eventually try to write to a different disk (in case of using
> the raid1/10 profile?)
When there is mirror copy it does not go into the RO mode, and it leaves
write hole(s) patchy across any transaction as we don't fail the disk at
the first failed transaction. That means if a disk is at nth transaction
per the super-block, its not guaranteed that all previous transactions
have made it to the disk successfully in case of mirror-ed configs. I
consider this as a bug. And there is a danger that it may read the junk
data, which is hard but not impossible to hit due to our un-reasonable
(there is a patch in the ML to address that as well) hard-coded
pid-based read-mirror policy.
I sent a patch to fail the disk when first write fails so that we know
the last good integrity of the FS based on the transaction id. That was
a long time back I still believe its important patch. There wasn't
enough comments I guess for it go into the next step.
The current solution is to replace the offending disk _without_ reading
from it, to have a good recovery from the failed disk. As data centers
can't relay on admin initiated manual recovery, there is also a patch to
do this stuff automatically using the auto-replace feature, patches are
in the ML. Again there wasn't enough comments I guess for it go into the
next step.
Thanks, Anand
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-10-16 9:57 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-14 11:08 BTRFS bad block management. Does it exist? waxhead
2018-10-14 11:31 ` Qu Wenruo
2018-10-15 12:09 ` Austin S. Hemmelgarn
2018-10-16 9:57 ` Anand Jain
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).