All of lore.kernel.org
 help / color / mirror / Atom feed
* experiment: suboptimal behaviour with write errors and multi-device filesystems
@ 2020-04-26 12:46 Marc Lehmann
  2020-04-28  6:19 ` Zygo Blaxell
  0 siblings, 1 reply; 10+ messages in thread
From: Marc Lehmann @ 2020-04-26 12:46 UTC (permalink / raw)
  To: linux-btrfs

Hi!

I made an experiment whose results I would like to share with you, in the
hope of possible behaviour improvement in the future.

Summary: A disk was physically removed for a multi-device filesystem while
copying large amounts of data to the fs. btrfs continued writing half a
TB of data without signalling an error - it probably would have continued
like that forever, which I think is suboptimal behaviour.

And here the longer version:

I created a multi-device fs with data=single and meta=raid1 and copied
about 8TB of data to it. After copying roughly 7.5TB of data I powercycled
the disk, which caused the raid controller to remove the device
semi-permanently.

Since the partitions were on LVM, this didn't cause btrfs to see a rmeoved
device (if btrfs can even do that) - it did get EIO on every write, but
btrfs f u for example did display the device even though it was physically
missing, liekly as the device-mapper device was still there.

While the write errors kept increasing (altogether over 300000) in the
kernel log, no other indications hsowed anything out of the ordinary -
mkdir/file writes still worked.

After restoring the missing disk rebooting, I was able to mount the the
filesystem without any special options. Accessing the data got  a lot of:

Apr 24 21:01:53 doom kernel: [   83.515375] BTRFS error (device dm-32): bad tree block start, want 35423883739136 have 15380345110528
Apr 24 21:01:53 doom kernel: [   83.534174] BTRFS info (device dm-32): read error corrected: ino 0 off 35423883743232 (dev /dev/mapper/xmnt-faulty sector 14241833192)
Apr 24 21:01:53 doom kernel: [   83.849524] BTRFS error (device dm-32): parent transid verify failed on 34293446770688 wanted 2575 found 2539

While btrfs seemed to be able to repair most, amybe all, of the metadata
errors, I did get lots of inaccessible files and directories, which is of
course expected.

I tried to balance the metadata to simulate a metadata-only btrfs scrub
(which I wish would exist :), but the balance kept erroring out with
repeated ENOSPC errors and switched the device to read-only, which was
unexpected due to using raid1.

I finally rebalanced the metadata to dup profile and back to raid1, which
seemed to have the expected effect of reparing the metadata errors.

At remounting and unmounting the device, I got a number of these messages
as well:

Apr 24 21:30:48 doom kernel: [ 1818.523929] BTRFS warning (device dm-32): page private not zero on page 32786264244224
Apr 24 21:30:48 doom kernel: [ 1818.523931] BTRFS warning (device dm-32): page private not zero on page 32786264248320
Apr 24 21:30:48 doom kernel: [ 1818.523932] BTRFS warning (device dm-32): page private not zero on page 32786264252416

I then deleted all directories written while the disk was gone, did a
btrfs scrub (no errors) and some other tests (all remaining files were
readable and had correct contents) and it seems btrfs completely recovered
from this accident, which is a very positive change compared to older
kernel versions (I did this with 4.9 and the fs was effectively lost).

Discussion:

The reason I think the write-error behaviour is suboptimal is because
btrfs seems to not be bothered by a disk that loudly throws away all data
- it keeps writing to it and it never signals userspace about it. In my
case, 500GB were written "successfully" before I stopped it.

While signalling userspace for writes is hard (as the EIO comes too
late to signal userspace directly), I nevertheless am suprised by btrfs
not only effectively ignoring all write errors, but also not signaling
errors where it could - for example, a number of subdirectories were
gone or unreadable after the reboot (as they at least partially were on
the missing disk) which were written without error even though they were
multiple times larger than the memory size, i.e. it was almost certainly
writing to directories long _after_ btrfs got an EIO for the respective
directory blocks. This is substantiated by the fact that I was able to
list the directories before rebooting, but not afterwards, so some info
lived in blocks which were not writtem but were still cached.

I can't say with confidence how to improve this behaviour - I could
understand writing some gigabytes of data that are still in the cache,
or writing new files, but I think btrfs should not simply pretend an I/O
error means "successfully written" to the extent it does now.

On the other hand, kicking out a disk because it had a single write error
might not be the best behaviour either, but at least with normal disks,
an EIO on write rarely means that the block has a problem (as disk cache
usually enusres that write errors are silent), but usually indicates
a much worse condition, so cosnidering a diskunusable after EIO (or a
certain number of EIO errors) might be better, especially if there is a
way to get the disk back into the filesystem.

I hope this mail comes in useful.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-04-26 12:46 experiment: suboptimal behaviour with write errors and multi-device filesystems Marc Lehmann
@ 2020-04-28  6:19 ` Zygo Blaxell
  2020-04-28 18:14   ` Marc Lehmann
  0 siblings, 1 reply; 10+ messages in thread
From: Zygo Blaxell @ 2020-04-28  6:19 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6432 bytes --]

On Sun, Apr 26, 2020 at 02:46:13PM +0200, Marc Lehmann wrote:
> Hi!
> 
> I made an experiment whose results I would like to share with you, in the
> hope of possible behaviour improvement in the future.
> 
> Summary: A disk was physically removed for a multi-device filesystem while
> copying large amounts of data to the fs. btrfs continued writing half a
> TB of data without signalling an error - it probably would have continued
> like that forever, which I think is suboptimal behaviour.
> 
> And here the longer version:
> 
> I created a multi-device fs with data=single and meta=raid1 and copied
> about 8TB of data to it. After copying roughly 7.5TB of data I powercycled
> the disk, which caused the raid controller to remove the device
> semi-permanently.
> 
> Since the partitions were on LVM, this didn't cause btrfs to see a rmeoved
> device (if btrfs can even do that) - it did get EIO on every write, but
> btrfs f u for example did display the device even though it was physically
> missing, liekly as the device-mapper device was still there.
> 
> While the write errors kept increasing (altogether over 300000) in the
> kernel log, no other indications hsowed anything out of the ordinary -
> mkdir/file writes still worked.
> 
> After restoring the missing disk rebooting, I was able to mount the the
> filesystem without any special options. Accessing the data got  a lot of:
> 
> Apr 24 21:01:53 doom kernel: [   83.515375] BTRFS error (device dm-32): bad tree block start, want 35423883739136 have 15380345110528
> Apr 24 21:01:53 doom kernel: [   83.534174] BTRFS info (device dm-32): read error corrected: ino 0 off 35423883743232 (dev /dev/mapper/xmnt-faulty sector 14241833192)
> Apr 24 21:01:53 doom kernel: [   83.849524] BTRFS error (device dm-32): parent transid verify failed on 34293446770688 wanted 2575 found 2539
> 
> While btrfs seemed to be able to repair most, amybe all, of the metadata
> errors, I did get lots of inaccessible files and directories, which is of
> course expected.

That is _not_ expected.  Directories in btrfs are stored entirely in
metadata as btrfs items.  They do not have data blocks in data block
groups.

With metadata=raid1 and a surviving uncorrupted disk, you should have
been able to enumerate and stat() every file on the filesystem, even
the ones where the data inside the files was lost.

> I tried to balance the metadata to simulate a metadata-only btrfs scrub
> (which I wish would exist :), but the balance kept erroring out with
> repeated ENOSPC errors and switched the device to read-only, which was
> unexpected due to using raid1.
> 
> I finally rebalanced the metadata to dup profile and back to raid1, which
> seemed to have the expected effect of reparing the metadata errors.
> 
> At remounting and unmounting the device, I got a number of these messages
> as well:
> 
> Apr 24 21:30:48 doom kernel: [ 1818.523929] BTRFS warning (device dm-32): page private not zero on page 32786264244224
> Apr 24 21:30:48 doom kernel: [ 1818.523931] BTRFS warning (device dm-32): page private not zero on page 32786264248320
> Apr 24 21:30:48 doom kernel: [ 1818.523932] BTRFS warning (device dm-32): page private not zero on page 32786264252416
> 
> I then deleted all directories written while the disk was gone, did a
> btrfs scrub (no errors) and some other tests (all remaining files were
> readable and had correct contents) and it seems btrfs completely recovered
> from this accident, which is a very positive change compared to older
> kernel versions (I did this with 4.9 and the fs was effectively lost).
> 
> Discussion:
> 
> The reason I think the write-error behaviour is suboptimal is because
> btrfs seems to not be bothered by a disk that loudly throws away all data
> - it keeps writing to it and it never signals userspace about it. In my
> case, 500GB were written "successfully" before I stopped it.
> 
> While signalling userspace for writes is hard (as the EIO comes too
> late to signal userspace directly), I nevertheless am suprised by btrfs
> not only effectively ignoring all write errors, but also not signaling
> errors where it could - for example, a number of subdirectories were
> gone or unreadable after the reboot (as they at least partially were on
> the missing disk) which were written without error even though they were
> multiple times larger than the memory size, i.e. it was almost certainly
> writing to directories long _after_ btrfs got an EIO for the respective
> directory blocks. 

There would be a surviving mirror copy of the directory, because it's in
raid1 metadata, so that should be a successful write in degraded mode.

Uncorrectable EIO on metadata triggers a hard shutdown of all writes to
the filesystem.  Userspace will definitely be informed when that happens.
It's something we'd want to avoid with raid1.

> This is substantiated by the fact that I was able to
> list the directories before rebooting, but not afterwards, so some info
> lived in blocks which were not writtem but were still cached.

It sounds like you hit some other kind of failure there (this and the
"page private not zero" messages.  What kernel was this?

> I can't say with confidence how to improve this behaviour - I could
> understand writing some gigabytes of data that are still in the cache,
> or writing new files, but I think btrfs should not simply pretend an I/O
> error means "successfully written" to the extent it does now.
> 
> On the other hand, kicking out a disk because it had a single write error
> might not be the best behaviour either, but at least with normal disks,
> an EIO on write rarely means that the block has a problem (as disk cache
> usually enusres that write errors are silent), but usually indicates
> a much worse condition, so cosnidering a diskunusable after EIO (or a
> certain number of EIO errors) might be better, especially if there is a
> way to get the disk back into the filesystem.
> 
> I hope this mail comes in useful.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-04-28  6:19 ` Zygo Blaxell
@ 2020-04-28 18:14   ` Marc Lehmann
  2020-04-28 21:35     ` Zygo Blaxell
  0 siblings, 1 reply; 10+ messages in thread
From: Marc Lehmann @ 2020-04-28 18:14 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Hi, thanks for your reply!

On Tue, Apr 28, 2020 at 02:19:59AM -0400, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> That is _not_ expected.  Directories in btrfs are stored entirely in
> metadata as btrfs items.  They do not have data blocks in data block
> groups.

Ah, ok, yes, I agree then. I wrongly assumed directory data would be
stored as file data. I am actually very happy to be wrong about this, as
it makes me even more confident when facing a missing disk in production,
which is bound to happen.

That is strange then - I was able to delete the directories (and obviously
the files inside) though, but I did that _after_ "regenerating" the
metadata by balancing.

The only other inconsistency is that

   btrfs ba start -musage=100 -mdevid=2

kept failing with ENOSPC after doing some work, and

   btrfa ba start -mconvert=dup

worked flawlessly and apparently fixed all errors (other than missing file
data). Maybe the difference is the -mdevid=2 - although the disk had more
than 100G of unallocated space, so that alone wouldn't epxlain the enospc.

Just FYI, here are example kernel messages for such a failed balance with
only -musage:

Apr 24 22:08:01 doom kernel: [ 4051.894190] BTRFS info (device dm-32): balance: start -musage=100,devid=2
Apr 24 22:08:02 doom kernel: [ 4052.194964] BTRFS info (device dm-32): relocating block group 35508773191680 flags metadata|raid1
Apr 24 22:08:02 doom kernel: [ 4052.296436] BTRFS info (device dm-32): relocating block group 35507699449856 flags metadata|raid1
Apr 24 22:08:02 doom kernel: [ 4052.410760] BTRFS info (device dm-32): relocating block group 35506625708032 flags metadata|raid1
Apr 24 22:08:02 doom kernel: [ 4052.552481] BTRFS info (device dm-32): relocating block group 35505551966208 flags metadata|raid1
Apr 24 22:08:03 doom kernel: [ 4052.940950] BTRFS info (device dm-32): relocating block group 35504478224384 flags metadata|raid1
Apr 24 22:08:03 doom kernel: [ 4053.047505] BTRFS info (device dm-32): relocating block group 35503404482560 flags metadata|raid1
Apr 24 22:08:03 doom kernel: [ 4053.128938] BTRFS info (device dm-32): relocating block group 35502330740736 flags metadata|raid1
Apr 24 22:08:03 doom kernel: [ 4053.218385] BTRFS info (device dm-32): relocating block group 35501256998912 flags metadata|raid1
Apr 24 22:08:03 doom kernel: [ 4053.326941] BTRFS info (device dm-32): relocating block group 35500183257088 flags metadata|raid1
Apr 24 22:08:03 doom kernel: [ 4053.432318] BTRFS info (device dm-32): relocating block group 35499109515264 flags metadata|raid1
Apr 24 22:08:22 doom kernel: [ 4072.112133] BTRFS info (device dm-32): found 50845 extents
Apr 24 22:08:27 doom kernel: [ 4077.002724] BTRFS info (device dm-32): 3 enospc errors during balance
Apr 24 22:08:27 doom kernel: [ 4077.002727] BTRFS info (device dm-32): balance: ended with status: -28

> > multiple times larger than the memory size, i.e. it was almost certainly
> > writing to directories long _after_ btrfs got an EIO for the respective
> > directory blocks. 
> 
> There would be a surviving mirror copy of the directory, because it's in
> raid1 metadata, so that should be a successful write in degraded mode.
> 
> Uncorrectable EIO on metadata triggers a hard shutdown of all writes to
> the filesystem.  Userspace will definitely be informed when that happens.
> It's something we'd want to avoid with raid1.

Does "Uncorrectable EIO" also mean writes, though? I know from experience
that I get EIO when btrfs hits a metadata error, and that nowadays it is
very successfull in correcting metadata errors (which is a relatively new
thing).

My main takeaway from this experiment was that a) I did get my filesystem
back without having to reformat, which is admirable, and b) I can write a
surprising amount of data to a missing disk without seeing anything more
than kernel messages. In my stupidity I can well imagine having a disk
falling out of the "array" and me not noticing it for days.

Arguably, that is how it is though - a write error does not cause btrfs to
dismiss the whole disk, and most write errors cannot be reported back to
userspace, so btrfs would somehow have to correlate write errors and decide
when enough is enough.

OTOH, write errors are very rare on normal disks, and raid controllers
usually immediately kick out a disk on write errors so maybe marking
the disk bad (until a remount or so) might be a good idea - with modenr
drives, write errors are almost alwyays a symptom of something very bad
happening that is usually not directly associated with a specific block -
for example, an SSD disk mightr be end-of-life and switch to read-onyl, or
a conventional disk might have run out of spare blocks.

i.e. maybe btrfs shouldn't treat write errors as less bad than read
errors - not sure.

> > This is substantiated by the fact that I was able to
> > list the directories before rebooting, but not afterwards, so some info
> > lived in blocks which were not writtem but were still cached.
> 
> It sounds like you hit some other kind of failure there (this and the
> "page private not zero" messages.  What kernel was this?

5.4.28 from mainline-ppa (https://kernel.ubuntu.com/~kernel-ppa/mainline/).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-04-28 18:14   ` Marc Lehmann
@ 2020-04-28 21:35     ` Zygo Blaxell
  2020-05-01  1:55       ` Marc Lehmann
  0 siblings, 1 reply; 10+ messages in thread
From: Zygo Blaxell @ 2020-04-28 21:35 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs

On Tue, Apr 28, 2020 at 08:14:36PM +0200, Marc Lehmann wrote:
> Hi, thanks for your reply!
> 
> On Tue, Apr 28, 2020 at 02:19:59AM -0400, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> > That is _not_ expected.  Directories in btrfs are stored entirely in
> > metadata as btrfs items.  They do not have data blocks in data block
> > groups.
> 
> Ah, ok, yes, I agree then. I wrongly assumed directory data would be
> stored as file data. I am actually very happy to be wrong about this, as
> it makes me even more confident when facing a missing disk in production,
> which is bound to happen.
> 
> That is strange then - I was able to delete the directories (and obviously
> the files inside) though, but I did that _after_ "regenerating" the
> metadata by balancing.
> 
> The only other inconsistency is that
> 
>    btrfs ba start -musage=100 -mdevid=2
> 
> kept failing with ENOSPC after doing some work, and
> 
>    btrfa ba start -mconvert=dup
> 
> worked flawlessly and apparently fixed all errors (other than missing file
> data). Maybe the difference is the -mdevid=2 - although the disk had more
> than 100G of unallocated space, so that alone wouldn't epxlain the enospc.

I'm not sure, but my guess is the allocator may have noticed you have
only one disk in degraded mode, and will not be able to allocate more
raid1 block groups (which require 2 disks).  A similar thing happens
when raid5 arrays degrade--allocation continues on remaining disks,
in new block groups.

> Just FYI, here are example kernel messages for such a failed balance with
> only -musage:
> 
> Apr 24 22:08:01 doom kernel: [ 4051.894190] BTRFS info (device dm-32): balance: start -musage=100,devid=2
> Apr 24 22:08:02 doom kernel: [ 4052.194964] BTRFS info (device dm-32): relocating block group 35508773191680 flags metadata|raid1
> Apr 24 22:08:02 doom kernel: [ 4052.296436] BTRFS info (device dm-32): relocating block group 35507699449856 flags metadata|raid1
> Apr 24 22:08:02 doom kernel: [ 4052.410760] BTRFS info (device dm-32): relocating block group 35506625708032 flags metadata|raid1
> Apr 24 22:08:02 doom kernel: [ 4052.552481] BTRFS info (device dm-32): relocating block group 35505551966208 flags metadata|raid1
> Apr 24 22:08:03 doom kernel: [ 4052.940950] BTRFS info (device dm-32): relocating block group 35504478224384 flags metadata|raid1
> Apr 24 22:08:03 doom kernel: [ 4053.047505] BTRFS info (device dm-32): relocating block group 35503404482560 flags metadata|raid1
> Apr 24 22:08:03 doom kernel: [ 4053.128938] BTRFS info (device dm-32): relocating block group 35502330740736 flags metadata|raid1
> Apr 24 22:08:03 doom kernel: [ 4053.218385] BTRFS info (device dm-32): relocating block group 35501256998912 flags metadata|raid1
> Apr 24 22:08:03 doom kernel: [ 4053.326941] BTRFS info (device dm-32): relocating block group 35500183257088 flags metadata|raid1
> Apr 24 22:08:03 doom kernel: [ 4053.432318] BTRFS info (device dm-32): relocating block group 35499109515264 flags metadata|raid1
> Apr 24 22:08:22 doom kernel: [ 4072.112133] BTRFS info (device dm-32): found 50845 extents
> Apr 24 22:08:27 doom kernel: [ 4077.002724] BTRFS info (device dm-32): 3 enospc errors during balance
> Apr 24 22:08:27 doom kernel: [ 4077.002727] BTRFS info (device dm-32): balance: ended with status: -28

> > > multiple times larger than the memory size, i.e. it was almost certainly
> > > writing to directories long _after_ btrfs got an EIO for the respective
> > > directory blocks. 
> > 
> > There would be a surviving mirror copy of the directory, because it's in
> > raid1 metadata, so that should be a successful write in degraded mode.
> > 
> > Uncorrectable EIO on metadata triggers a hard shutdown of all writes to
> > the filesystem.  Userspace will definitely be informed when that happens.
> > It's something we'd want to avoid with raid1.
> 
> Does "Uncorrectable EIO" also mean writes, though? I know from experience
> that I get EIO when btrfs hits a metadata error, and that nowadays it is
> very successfull in correcting metadata errors (which is a relatively new
> thing).

Either.  EIO is the result of _two_ read or write failures (for raid1).

> My main takeaway from this experiment was that a) I did get my filesystem
> back without having to reformat, which is admirable, and b) I can write a
> surprising amount of data to a missing disk without seeing anything more
> than kernel messages. In my stupidity I can well imagine having a disk
> falling out of the "array" and me not noticing it for days.

It's critical to continuously monitor btrfs raids by polling 'btrfs
dev stats'.  Ideally there would be an ioctl or something that would
block until they change, so an alert can be generated without polling.

This is true of most raid implementations.  The whole point of a RAID1
is to _not_ report correctable errors on individual drives to userspace
applications.  There is usually(*) a side-channel for monitoring error
rates, and producing alert notifications when those are not zero.

(*) There are a few RAID implementations out there that don't implement
a monitoring channel, or it's unreliable or hard to use.  When you find
such a RAID implementation, it should be placed directly in an appropriate
e-waste recycling bin.

> Arguably, that is how it is though - a write error does not cause btrfs to
> dismiss the whole disk, and most write errors cannot be reported back to
> userspace, so btrfs would somehow have to correlate write errors and decide
> when enough is enough.

That's a black art at the best of times.  Monitoring software can implement
corrective action, and by not being part of the kernel it can be easily
customized for various levels of error tolerance and response.

That said, some of the response tools are lacking, e.g. 'btrfs replace'
doesn't allow you to rewrite an existing drive with its own contents,
you have to take it offline first (bad: putting the array in degraded
mode) or use a separate disk as the replace target (bad: needs one more
disk than you have).  If you have a disk with known corruption, only
'scrub' will repair it in-place (bad: doesn't work on nodatacow files,
and only works 99.999999925% of the time with crc32c csums).

> OTOH, write errors are very rare on normal disks, and raid controllers
> usually immediately kick out a disk on write errors so maybe marking
> the disk bad (until a remount or so) might be a good idea - with modenr
> drives, write errors are almost alwyays a symptom of something very bad
> happening that is usually not directly associated with a specific block -
> for example, an SSD disk mightr be end-of-life and switch to read-onyl, or
> a conventional disk might have run out of spare blocks.
> 
> i.e. maybe btrfs shouldn't treat write errors as less bad than read
> errors - not sure.

Cheap SSDs (and some NAS HDDs) corrupt data randomly without any
indication of failure at all--you have to find the damage with a scrub
later on, and the drive _never_ reports an error even when it is clearly
and obviously and repeatedly failing.  It's hard to imagine how a drive
could behave worse--even locking up the bus is better, at least it's
something different from the "successfully completed" response to a
write or flush command.  No filesystem can detect these errors at
write time.  I'm pretty sure even the disk firmware doesn't know.

Given disks like those, applications cannot rely on write time IO
failure detection--some data is gonna get lost, probably while you're not
watching.  As a service operator I have to ensure that applications are
adequately protected against data loss from write time to the next read
time and later--a scope which includes much more than merely reporting
an IO error during the write.  That means designing applications to not
rely on writing to local disks at all (e.g. multi-host replication), so
what the local filesystem says about the success or failure of a single
individual write operation is not usually interesting.  (*)

All that said, from what you've described, it sounds like there are still
failures even on the stuff btrfs does well?  e.g. there should not have
been a directory search problem at _any_ time with that setup.

(*) Read errors are super important though--even the ones not reported
to userspace--as they are direct evidence of drive failure.  Since 5.0,
btrfs silently ignores some of those, and this even got backported to
4.19 and earlier LTS kernels.  >:-(

> > > This is substantiated by the fact that I was able to
> > > list the directories before rebooting, but not afterwards, so some info
> > > lived in blocks which were not writtem but were still cached.
> > 
> > It sounds like you hit some other kind of failure there (this and the
> > "page private not zero" messages.  What kernel was this?
> 
> 5.4.28 from mainline-ppa (https://kernel.ubuntu.com/~kernel-ppa/mainline/).
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-04-28 21:35     ` Zygo Blaxell
@ 2020-05-01  1:55       ` Marc Lehmann
  2020-05-01  3:37         ` Zygo Blaxell
  0 siblings, 1 reply; 10+ messages in thread
From: Marc Lehmann @ 2020-05-01  1:55 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Tue, Apr 28, 2020 at 05:35:51PM -0400, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> > worked flawlessly and apparently fixed all errors (other than missing file
> > data). Maybe the difference is the -mdevid=2 - although the disk had more
> > than 100G of unallocated space, so that alone wouldn't epxlain the enospc.
> 
> I'm not sure, but my guess is the allocator may have noticed you have
> only one disk in degraded mode, and will not be able to allocate more
> raid1 block groups (which require 2 disks).  A similar thing happens
> when raid5 arrays degrade--allocation continues on remaining disks,
> in new block groups.

There were at least 2 other disks with some unallocated data available. Could
the -mdevid=2 have limited the allocation or reading somehow?

> > > the filesystem.  Userspace will definitely be informed when that happens.
> > > It's something we'd want to avoid with raid1.
> > 
> > Does "Uncorrectable EIO" also mean writes, though? I know from experience
> > that I get EIO when btrfs hits a metadata error, and that nowadays it is
> > very successfull in correcting metadata errors (which is a relatively new
> > thing).
> 
> Either.  EIO is the result of _two_ read or write failures (for raid1).

But then btrfs doesn't correct underlying EIO errors on write in raid1,
i.e. it gets EIO from the block write, and doesn't fix it.

> > My main takeaway from this experiment was that a) I did get my filesystem
> > back without having to reformat, which is admirable, and b) I can write a
> > surprising amount of data to a missing disk without seeing anything more
> > than kernel messages. In my stupidity I can well imagine having a disk
> > falling out of the "array" and me not noticing it for days.
> 
> It's critical to continuously monitor btrfs raids by polling 'btrfs
> dev stats'.  Ideally there would be an ioctl or something that would
> block until they change, so an alert can be generated without polling.

Right, that might be helpful.

> This is true of most raid implementations.  The whole point of a RAID1
> is to _not_ report correctable errors on individual drives to userspace
> applications.  There is usually(*) a side-channel for monitoring error
> rates, and producing alert notifications when those are not zero.

Well, the data wasn't raid1, but single, and no error was ever reported.

My concern is that btrfs will happily, continously and mostly silently loose
data practically forever by assuming a disk that gives an error on every
access is still there and able to hold data.

That behaviour is very different to other "raid" implementations.

> 'scrub' will repair it in-place (bad: doesn't work on nodatacow files,
> and only works 99.999999925% of the time with crc32c csums).

I assume that calculations assumes random bit errors - but that is rarely
the case. In this case, for example, there were no crc32 errors, all
detection came from other layers ("parent transid failed" etc.).

> Cheap SSDs (and some NAS HDDs) corrupt data randomly without any
[...]
> All that said, from what you've described, it sounds like there are still

I'm not sure I can follow you here completely - form what you write, it
sounds like "some disks fail silently, so btrfs doesn't care when disks fail
loudly".

I mean, in the case described, there were no silent failures except maybe in
the split second before the disk disconnected (and not even then when the
raid controller keeps the cache and writes it later).

All failures were properly reported (by device-mapper ion this case), i.e.
every read and write caused an EIO to be reported to btrfs from the block
layer.

Just because some disks behave bad doesn't seem like sufficient reason to
me to completely ignore cases whwre errors _are_ being reported.

I don't think my case is very unlikely - it's basically how linux behaves
when lvm is used and, say, one of your disks has a temporary outage - the
device node might go away and all accesses will rersult in EIO.

Other filesystems can get around this by not supporting multiple devices and
relying on underlying systems (e.g. software or hardware raid) to make the
disks appear a single device.

I do think btrfs would need more robust error handling for such cases -
I don't know *any* raid implementation that ignores write errors, for
example, and I don't think there is any raid implementation that ignores
missing disks.

> failures even on the stuff btrfs does well?  e.g. there should not have
> been a directory search problem at _any_ time with that setup.

That was my expectation, although I am well aware that this is still under
development. I am already positively surprised that I was able to get an
(apparently) fully functional filesystem back after something so drastic,
with relatively little effort (metadata profile conversion).

Tooling is not so much of an issue for me, the biggest issue would be
detecting which files are on the missing disk, and if I can't come up with
something better (e.g. ioctls to query the data block location), I can
always read all files and see which of them error out and restore them.

> (*) Read errors are super important though--even the ones not reported
> to userspace--as they are direct evidence of drive failure.

Under what conditions are write errors not even more direct evidence
of drive failure (usually, but not exclusively, indicating far bigger
problems than single block errors)?

> Since 5.0, btrfs silently ignores some of those, and this even got
> backported to 4.19 and earlier LTS kernels. >:-(

eh, cool, uh :)

Well, I assume that my case of concern - single disk failure - is not
something that will escape my attention forever, so doing a manual
metadata balance is fully viable for me.

My concern is merely that btrfs stubbornly insists a completely missing
disk is totally fine to write to, essentially forever :)

I'm not saying alerting userspace is required in some way. But maybe
btrfs should not consider an obviously badly broken disk as healthy. I
would have expected btrfs to stop writing to a disk when it is told in no
unclear terms that the write failed, at leats at some point.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-05-01  1:55       ` Marc Lehmann
@ 2020-05-01  3:37         ` Zygo Blaxell
  2020-05-02 18:23           ` Marc Lehmann
  2020-05-02 18:27           ` Marc Lehmann
  0 siblings, 2 replies; 10+ messages in thread
From: Zygo Blaxell @ 2020-05-01  3:37 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs

On Fri, May 01, 2020 at 03:55:20AM +0200, Marc Lehmann wrote:
> On Tue, Apr 28, 2020 at 05:35:51PM -0400, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> > > worked flawlessly and apparently fixed all errors (other than missing file
> > > data). Maybe the difference is the -mdevid=2 - although the disk had more
> > > than 100G of unallocated space, so that alone wouldn't epxlain the enospc.
> > 
> > I'm not sure, but my guess is the allocator may have noticed you have
> > only one disk in degraded mode, and will not be able to allocate more
> > raid1 block groups (which require 2 disks).  A similar thing happens
> > when raid5 arrays degrade--allocation continues on remaining disks,
> > in new block groups.
> 
> There were at least 2 other disks with some unallocated data available. Could
> the -mdevid=2 have limited the allocation or reading somehow?

It shouldn't.  Balance will happily copy data to a different location on the
same device, and if you have at least 2 disks with unallocated then
raid1 allocation should always succeed.  Doesn't mean there isn't a
bug there though.  The btrfs allocator is full of 5-year-old bugs,
a few get fixed every month.

> > > > the filesystem.  Userspace will definitely be informed when that happens.
> > > > It's something we'd want to avoid with raid1.
> > > 
> > > Does "Uncorrectable EIO" also mean writes, though? I know from experience
> > > that I get EIO when btrfs hits a metadata error, and that nowadays it is
> > > very successfull in correcting metadata errors (which is a relatively new
> > > thing).
> > 
> > Either.  EIO is the result of _two_ read or write failures (for raid1).
> 
> But then btrfs doesn't correct underlying EIO errors on write in raid1,
> i.e. it gets EIO from the block write, and doesn't fix it.

Fixing it would mean repeating the same write--btrfs doesn't feed back
into the allocator to try to reallocate the data on a block group with
at least one chunk disk online.  Nothing to do there.

> > > My main takeaway from this experiment was that a) I did get my filesystem
> > > back without having to reformat, which is admirable, and b) I can write a
> > > surprising amount of data to a missing disk without seeing anything more
> > > than kernel messages. In my stupidity I can well imagine having a disk
> > > falling out of the "array" and me not noticing it for days.
> > 
> > It's critical to continuously monitor btrfs raids by polling 'btrfs
> > dev stats'.  Ideally there would be an ioctl or something that would
> > block until they change, so an alert can be generated without polling.
> 
> Right, that might be helpful.
> 
> > This is true of most raid implementations.  The whole point of a RAID1
> > is to _not_ report correctable errors on individual drives to userspace
> > applications.  There is usually(*) a side-channel for monitoring error
> > rates, and producing alert notifications when those are not zero.
> 
> Well, the data wasn't raid1, but single, and no error was ever reported.

The metadata was raid1, that's all that should matter.

> My concern is that btrfs will happily, continously and mostly silently loose
> data practically forever by assuming a disk that gives an error on every
> access is still there and able to hold data.
> 
> That behaviour is very different to other "raid" implementations.

Be careful what you ask for:  you have configured a filesystem that can
tolerate metadata write failures (raid1 insists on continuous operation
in degraded mode).  Fine, btrfs checks for metadata write failures, and
handles them by forcing itself read-only.  Data block async write failures
are not normally reported to userspace unless reports are explicitly
requested, so it doesn't matter that the single profile data block writes
are failing, there's nowhere to report those errors.  So it's going to
tolerate all the failures in metadata because that's what you asked for,
and ignore all the failures in data because that's what you asked for.

If your applications didn't call fsync() or use O_SYNC or O_DIRECT or
any of the other weird and wonderful ways to ask the filesystem to tell
you about data block write errors, then userspace won't learn anything
about data block write errors on most Linux filesystems.  The system
calls will exit before the disk is even touched, so there's no way to
tell the application that a delalloc extent write, which maybe happens
30 seconds after the application closed the file, didn't work.  Nor is it
reasonable to shut down the entire filesystem because of a deferred
non-metadata write failure (well, maybe it's reasonable if btrfs had a
'errors=remount-ro' option like ext4's).

If you contrive a test case where other filesystems are able to write
their metadata without error, but all data block writes (excluding
directory block writes) fail with EIO, they will do the same thing as btrfs.
It's a simple experiment with a USB stick (make sure to defeat your distro
which might be mounting with 'sync'):  write a multi-megabyte file to a
filesystem on the USB stick with 'cat', and let cat finish the write and
close the file without error.  Then immediately pull the USB stick out
before writeback starts.  Result:  the application sees write success,
the file is gone (or, depending on the filesystem, contains garbage),
and the only indication of failure is a handful of kernel messages about
the lost USB device.

If you write a file and you do call fsync() on btrfs, and fsync() doesn't
report an IO error, that's...possibly?...a bug.  If fsync writes the data
to metadata block groups then there will be no error because a failure
did not happen, but if fsync writes the data to data block groups then
a failure does happen and fsync should report it.  So e.g. a small file
that becomes an inline extent won't trigger an error (and won't be lost)
but a larger file will.

Other "raid" implementations for the most part don't support "care about
some data but not others" operating modes like the ones btrfs has.
There's nothing like data-single/metadata-raid1 in any standard RAID
configuration.  Each profile is operating correctly according to its
rules, even if the combination of those rules doesn't make much sense
in terms of real-world use cases.

> > 'scrub' will repair it in-place (bad: doesn't work on nodatacow files,
> > and only works 99.999999925% of the time with crc32c csums).
> 
> I assume that calculations assumes random bit errors - but that is rarely
> the case. In this case, for example, there were no crc32 errors, all
> detection came from other layers ("parent transid failed" etc.).

Parent transid verification happens after csum checks.  For metadata, the
csum is stored inline in the block, so any older version of the metadata
(e.g. a page that was previously present but failed to be overwritten)
will pass the csum check but fail the later checks on level and transid.

> > Cheap SSDs (and some NAS HDDs) corrupt data randomly without any
> [...]
> > All that said, from what you've described, it sounds like there are still
> 
> I'm not sure I can follow you here completely - form what you write, it
> sounds like "some disks fail silently, so btrfs doesn't care when disks fail
> loudly".
> 
> I mean, in the case described, there were no silent failures except maybe in
> the split second before the disk disconnected (and not even then when the
> raid controller keeps the cache and writes it later).
> 
> All failures were properly reported (by device-mapper ion this case), i.e.
> every read and write caused an EIO to be reported to btrfs from the block
> layer.
> 
> Just because some disks behave bad doesn't seem like sufficient reason to
> me to completely ignore cases whwre errors _are_ being reported.
> 
> I don't think my case is very unlikely - it's basically how linux behaves
> when lvm is used and, say, one of your disks has a temporary outage - the
> device node might go away and all accesses will rersult in EIO.

Were the read accesses not returning EIO?  That would be a bug.

Asynchronous writes have terrible reporting facilities on all Linux
filesystems.  btrfs is not inconsistent here.

> Other filesystems can get around this by not supporting multiple devices and
> relying on underlying systems (e.g. software or hardware raid) to make the
> disks appear a single device.

Indeed, this is contributing to the difference between your expectations
and reality.

> I do think btrfs would need more robust error handling for such cases -
> I don't know *any* raid implementation that ignores write errors, for
> example, and I don't think there is any raid implementation that ignores
> missing disks.

Most RAID implementations have a mode that allows data recovery when the
array has exceeded maximum tolerated failures (e.g. lvm "partial" mode,
or mdadm --force --run).  This is currently the only mode btrfs has.
Until we get some better management tools (kick a device out of the
array, reintegrate a previously disconnected device into an aray, all
while online) we're stuck permanently in partial mode.

> > failures even on the stuff btrfs does well?  e.g. there should not have
> > been a directory search problem at _any_ time with that setup.
> 
> That was my expectation, although I am well aware that this is still under
> development. I am already positively surprised that I was able to get an
> (apparently) fully functional filesystem back after something so drastic,
> with relatively little effort (metadata profile conversion).

RAID1 passed my test cases for the first time in 2016 (after some NULL
deref bugs were fixed).  If it's not working today, there has been
a _regression_.

> Tooling is not so much of an issue for me, the biggest issue would be
> detecting which files are on the missing disk, and if I can't come up with
> something better (e.g. ioctls to query the data block location), I can
> always read all files and see which of them error out and restore them.
> 
> > (*) Read errors are super important though--even the ones not reported
> > to userspace--as they are direct evidence of drive failure.
> 
> Under what conditions are write errors not even more direct evidence
> of drive failure (usually, but not exclusively, indicating far bigger
> problems than single block errors)?

Well, this is why you monitor dev stats for write errors (or, for
that matter, the raw disk devices).  The monitor can remount the
filesystem read-only or kill all the applications with open files if
that's what you'd prefer.

> > Since 5.0, btrfs silently ignores some of those, and this even got
> > backported to 4.19 and earlier LTS kernels. >:-(
> 
> eh, cool, uh :)
> 
> Well, I assume that my case of concern - single disk failure - is not
> something that will escape my attention forever, so doing a manual
> metadata balance is fully viable for me.

Remember that data-single metadata-raid1 is a weird case--you're
essentially saying that if a disk fails, you want the filesystem to be
read-only forever (btrfs won't mount it read-write without the missing
disk), and you don't care about which data disappears (since there's
no facility to control which files go on which disks).  I'd say being
confused about when Linux decides to return to EIO to userspace--already
well understood on other filesystems--is the least of your problems.  ;)

> My concern is merely that btrfs stubbornly insists a completely missing
> disk is totally fine to write to, essentially forever :)

That's an administrator decision, but btrfs does currently lack the tool
to implement the "remove the failing device" decision.  A workaround is
'echo 1 > /sys/block/sdX/dev/delete'.

> I'm not saying alerting userspace is required in some way. But maybe
> btrfs should not consider an obviously badly broken disk as healthy. I
> would have expected btrfs to stop writing to a disk when it is told in no
> unclear terms that the write failed, at leats at some point.

The trouble is that it's a continuum--disks aren't "good" at 0 errors
and "bad" at 1 or write errors.  Even mdadm isn't that strict--they have
maximum error counts, retries, mechanisms to do partial resyncs if disks
come back.  In btrfs this is all block-level stuff, every individual
block has its own sync state and data integrity.

lvm completely ignores _read_ errors during pvmove, a feature I use to
expedite the recovery of broken filesystems (and btrfs ends up not even
being broken).

> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-05-01  3:37         ` Zygo Blaxell
@ 2020-05-02 18:23           ` Marc Lehmann
  2020-05-02 18:49             ` Remi Gauvin
  2020-05-03  4:16             ` Zygo Blaxell
  2020-05-02 18:27           ` Marc Lehmann
  1 sibling, 2 replies; 10+ messages in thread
From: Marc Lehmann @ 2020-05-02 18:23 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Thu, Apr 30, 2020 at 11:37:20PM -0400, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> > Well, the data wasn't raid1, but single, and no error was ever reported.
> 
> The metadata was raid1, that's all that should matter.

So, in other words, btrfs is a filesystem for metadata only and doesn't
care about file data.

> > My concern is that btrfs will happily, continously and mostly silently loose
> > data practically forever by assuming a disk that gives an error on every
> > access is still there and able to hold data.
> > 
> > That behaviour is very different to other "raid" implementations.
> 
> Be careful what you ask for:  you have configured a filesystem that can
> tolerate metadata write failures (raid1 insists on continuous operation
> in degraded mode).  Fine, btrfs checks for metadata write failures, and
> handles them by forcing itself read-only.

I'm quite sure I didn't ask for that. And I'm quite ok with how the
metadata was handled.

> those errors.  So it's going to tolerate all the failures in metadata
> because that's what you asked for, and ignore all the failures in data
> because that's what you asked for.

Well, that's clearly unlike any other filesystem/raid/etc. system out
there.  Maybe this should then be pointed out more clearly, as people
probablöy go towards btrfs with a similar attitude as existing systems.

> If your applications didn't call fsync() or use O_SYNC or O_DIRECT or
> any of the other weird and wonderful ways to ask the filesystem to tell
> you about data block write errors, then userspace won't learn anything
> about data block write errors on most Linux filesystems.  The system
> calls will exit before the disk is even touched, so there's no way to
> tell the application that a delalloc extent write, which maybe happens
> 30 seconds after the application closed the file, didn't work.  Nor is it
> reasonable to shut down the entire filesystem because of a deferred
> non-metadata write failure (well, maybe it's reasonable if btrfs had a
> 'errors=remount-ro' option like ext4's).

I fully agree, that's why I suggested a policy more like other existign
systems: declare the faulty disk as faulty.

> If you contrive a test case where other filesystems are able to write
> their metadata without error, but all data block writes (excluding
> directory block writes) fail with EIO, they will do the same thing as btrfs.

That's not a useful comparison, as the failure cased being discussed
(device failure) is an all or nothing with those, as they don't manage
multiple devices and rely on unerlying mechanisms (such as md) to do tha
faulty device management.

> It's a simple experiment with a USB stick (make sure to defeat your distro
> which might be mounting with 'sync'):  write a multi-megabyte file to a
> filesystem on the USB stick with 'cat', and let cat finish the write and
> close the file without error.  Then immediately pull the USB stick out
> before writeback starts.  Result:  the application sees write success,
> the file is gone (or, depending on the filesystem, contains garbage),
> and the only indication of failure is a handful of kernel messages about
> the lost USB device.

Yes, I understand that perfectly, as I have initially explained...

> Other "raid" implementations for the most part don't support "care about
> some data but not others" operating modes like the ones btrfs has.

Sure - since those other raids do not support distinguishing data they
care for all data, not just metadata as btrfs does at the moment. I.e. it
doesn't matter whether the data written is metadata or file data, a faulty
device will be detected at some point.

> configuration.  Each profile is operating correctly according to its
> rules, even if the combination of those rules doesn't make much sense
> in terms of real-world use cases.

As I said, no hardware raid controller I know acts like btrfs "in the real
world", and neither do most software raids. Neither did the linux block
layer in the case at hand, by disallowing writes to the device forever,
raid or not - just unplug the USB stick from your example and plug it in
again - linux will not continue to write to it.

The issue here is that btrfs behaves very differenmtly than other, similar,
real-world systems, and the behaviour clearly makes no sense.

So, sure, btrfs can create rules that make no sense and apply them, but my
whole point is that improving the rules so that they actually make sense is a
worthwile goal.

> > I don't think my case is very unlikely - it's basically how linux behaves
> > when lvm is used and, say, one of your disks has a temporary outage - the
> > device node might go away and all accesses will rersult in EIO.
> 
> Were the read accesses not returning EIO?  That would be a bug.

Every access returned EIO (from the blocklayer), and I assume, but haven't
investigated, that btrfs passed these through.

> Asynchronous writes have terrible reporting facilities on all Linux
> filesystems.  btrfs is not inconsistent here.

Ah, but is - other filesystems will effectively stop writing to the disk
in these cases (arguably not because they contain code to do so, the block
layer forces this on them). The difference is that other filesystems do not
contain a device manager for multiple devices, so they can let the kernel or
other underlying syytems do the disk management.

btrfs can do the disk management itself, but fails to do so in a
reasonable way for missing disks.

> > Other filesystems can get around this by not supporting multiple devices and
> > relying on underlying systems (e.g. software or hardware raid) to make the
> > disks appear a single device.
> 
> Indeed, this is contributing to the difference between your expectations
> and reality.

Yes, that's why I tghinbk btrfs would be improved by kicking faulty disks
out of it's filesystem, versus continuing to use it as it if still was
there.

> > I do think btrfs would need more robust error handling for such cases -
> > I don't know *any* raid implementation that ignores write errors, for
> > example, and I don't think there is any raid implementation that ignores
> > missing disks.
> 
> Most RAID implementations have a mode that allows data recovery when the
> array has exceeded maximum tolerated failures (e.g. lvm "partial" mode,
> or mdadm --force --run).  This is currently the only mode btrfs has.

I wish - btrfs simply ignored the missing disk and continued on.

> Until we get some better management tools (kick a device out of the
> array, reintegrate a previously disconnected device into an aray, all
> while online) we're stuck permanently in partial mode.

I'm not sure how exactly you define partial mode - lvm and mdraid both
define it as missing "disks".

That's clearly not what btrfs does - btrfs didn't go into any partial
mode, it simply continued on pretending the disk is fine.

> > That was my expectation, although I am well aware that this is still under
> > development. I am already positively surprised that I was able to get an
> > (apparently) fully functional filesystem back after something so drastic,
> > with relatively little effort (metadata profile conversion).
> 
> RAID1 passed my test cases for the first time in 2016 (after some NULL
> deref bugs were fixed).  If it's not working today, there has been
> a _regression_.

That's good to hear - however, the really useful improvements (for this
case) were not in btrfs raid1, but in the fact that btrfs recently got
a lot more picky about treating errors as raid-errors (e.g. parwent
transid msimatches), and thus using the mirrored information a lot more
aggressive.

That's what allowed it to recover from having being presented an old
evrsion of a member disk.

> > Under what conditions are write errors not even more direct evidence
> > of drive failure (usually, but not exclusively, indicating far bigger
> > problems than single block errors)?
> 
> Well, this is why you monitor dev stats for write errors (or, for
> that matter, the raw disk devices).  The monitor can remount the
> filesystem read-only or kill all the applications with open files if
> that's what you'd prefer.

Well, if that is allk that btrfs can do, that's how it has to be done, of
course.

It would clearly be better (and probably trivial) in my eyes if btrfs
would act more like all the other systems out there and limit the amount
of data loss, but I'm not a btrfs dveeloper, so I take what I get :)

> > Well, I assume that my case of concern - single disk failure - is not
> > something that will escape my attention forever, so doing a manual
> > metadata balance is fully viable for me.
>
> Remember that data-single metadata-raid1 is a weird case--you're
> essentially saying that if a disk fails, you want the filesystem to be
> read-only forever (btrfs won't mount it read-write without the missing
> disk)

It doesn't feel weird to me - it seems the only way of limiting data loss.
With metadata=raid1, the filesystem will survive a single device loss, with
some work. With metadata=single, it almost certainly won't.

I currently have a single multi-device filesystem for archival, and it
keeps acquiring disks (it's currently at 7 devices). Sinmgle device
failures are a almost guaramteed during the lfietime of the fs, and being
able to recover from that without losing all data and with being able to
restore only the missing data seems not so weird to me.

Also, I didn't want btrfs to be read-only, but that would probably be
preferable over the current behaviour, as it would limit data loss.

> and you don't care about which data disappears (since there's
> no facility to control which files go on which disks).  I'd say being
> confused about when Linux decides to return to EIO to userspace--already
> well understood on other filesystems--is the least of your problems. ;)

I'm not sure how to hanmdle this case better, other than using actual raid.

I'm still not sure why you think this is weird, though - btrfs itself has
a "dup" mode which also suffers from the same problems (no facility to
control where the copy of the blocks go), and with single profile on a
single disk (the most common case), device lsos means total data loss.

Why is it so weird to try to limit data loss and restore costs? What's
the point of the dup profile if not the same (limit data loss on partial
failure)?

If dup is so weird, why is it beign used by default in some cases in
btfrs?

> > My concern is merely that btrfs stubbornly insists a completely
> > missing disk is totally fine to write to, essentially forever :)
>
> That's an administrator decision, but btrfs does currently lack the tool
> to implement the "remove the failing device" decision.  A workaround is
> 'echo 1 > /sys/block/sdX/dev/delete'.

Well, the tool (that does the deciwsion) should obviously be inside the
kernel, not userspace, because currently, I, as an administrator, cannot
make that decision and it would be necessarily delayed.

Intersstignly enough, it's not an administrative decision with other,
similar, systems - the linux kernel doesn't allow me to configure the
current btrfs failure, and neither do software and hardware raids - they
all kick out faulty devices automatically at some point.

> > I'm not saying alerting userspace is required in some way. But maybe
> > btrfs should not consider an obviously badly broken disk as healthy. I
> > would have expected btrfs to stop writing to a disk when it is told in
> > no unclear terms that the write failed, at leats at some point.
>
> The trouble is that it's a continuum--disks aren't "good" at 0 errors
> and "bad" at 1 or write errors.  Even mdadm isn't that strict--they have
> maximum error counts, retries, mechanisms to do partial resyncs if disks
> come back.  In btrfs this is all block-level stuff, every individual
> block has its own sync state and data integrity.

The problem is thta btrfs has a maximum, unconfgiurable, error count of
infinity.

I'd be totally happy with an unconfigurable error count of "0", "small",
"bigger", or it being configurable :)

I tzhink you are misundestanding me - it seems we actually fully
agree that the btrfs behaviour is bad as it is and would gain from
improvement. I'm not proposing any fixed solution, other than having
*some* rsasonable kind of data loss limiting inside btrfs, at elast in
obvious cases.

> lvm completely ignores _read_ errors during pvmove, a feature I use to
> expedite the recovery of broken filesystems (and btrfs ends up not even
> being broken).

That's interesting - last time I used pvmove on a source with read errors,
it didn't move that (that was a hwile ago, most of my volumes nowadays are
raid5'ed and don't suffer from read errors).

More importantly, however, if your source drive fails, pvmove will *not*
end up with skipping all the rest of the transfer and finish successfully
(as btrfs did in the case we discuss), resulting in very massive data
loss, simply because it cannot commit the new state.

No matter what other tool you look at, none behave as btrfs does
currently. Actual behaviour difers widely in detail, of course, but I
can't come up with a situation where a removed disk will result in upper
layers continuing to use it as if it were there.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-05-01  3:37         ` Zygo Blaxell
  2020-05-02 18:23           ` Marc Lehmann
@ 2020-05-02 18:27           ` Marc Lehmann
  1 sibling, 0 replies; 10+ messages in thread
From: Marc Lehmann @ 2020-05-02 18:27 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Thu, Apr 30, 2020 at 11:37:20PM -0400, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> > My concern is merely that btrfs stubbornly insists a completely missing
> > disk is totally fine to write to, essentially forever :)
> 
> That's an administrator decision, but btrfs does currently lack the tool
> to implement the "remove the failing device" decision.  A workaround is
> 'echo 1 > /sys/block/sdX/dev/delete'.

Ah, I forgot to mention, the kernel did this automatically in my
experiment (as I described - the device node was gone).

The problem is the upper layers (lvm/dm and btrfs) didn't react to this
- dm becaus eit leaves error handling to the fs, and btrfs, because it
didn't have error handling other than "ignore an continue".

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-05-02 18:23           ` Marc Lehmann
@ 2020-05-02 18:49             ` Remi Gauvin
  2020-05-03  4:16             ` Zygo Blaxell
  1 sibling, 0 replies; 10+ messages in thread
From: Remi Gauvin @ 2020-05-02 18:49 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1347 bytes --]

On 2020-05-02 2:23 p.m., Marc Lehmann wrote:

> 
> That's interesting - last time I used pvmove on a source with read errors,
> it didn't move that (that was a hwile ago, most of my volumes nowadays are
> raid5'ed and don't suffer from read errors).
> 
> More importantly, however, if your source drive fails, pvmove will *not*
> end up with skipping all the rest of the transfer and finish successfully
> (as btrfs did in the case we discuss), resulting in very massive data
> loss, simply because it cannot commit the new state.
> 
> No matter what other tool you look at, none behave as btrfs does
> currently. Actual behaviour difers widely in detail, of course, but I
> can't come up with a situation where a removed disk will result in upper
> layers continuing to use it as if it were there.
> 

I agree with the core of what you said, but I also think you're
overcomplicating it a bit.  If BTRFS is unable to write a single copy of
data, it should go R/O. (god knows, it has enough triggers to go R/O on
it's own already,, it seems odd that being unable to write data is not
included.)

A more strict mode of raid error could be used to go R/O if any of the
writes fail, (rather than btrfs continuing in a degraded mode
indefinately until reboot,), but that would be something that could be a
mount option.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: experiment: suboptimal behaviour with write errors and multi-device filesystems
  2020-05-02 18:23           ` Marc Lehmann
  2020-05-02 18:49             ` Remi Gauvin
@ 2020-05-03  4:16             ` Zygo Blaxell
  1 sibling, 0 replies; 10+ messages in thread
From: Zygo Blaxell @ 2020-05-03  4:16 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs

On Sat, May 02, 2020 at 08:23:16PM +0200, Marc Lehmann wrote:
> On Thu, Apr 30, 2020 at 11:37:20PM -0400, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> > > Well, the data wasn't raid1, but single, and no error was ever reported.
> > 
> > The metadata was raid1, that's all that should matter.
> 
> So, in other words, btrfs is a filesystem for metadata only and doesn't
> care about file data.

delalloc means applications don't get to learn about write errors
unless they jump through hoops.  If your application doesn't request
this information then it's correct for the application not to have it.
If your application _does_ request this information (via fsync, or mount
-o sync), and btrfs doesn't provide it, it's a bug.  btrfs doesn't do
anything about data block write errors beyond reporting them to userspace
when asked.  This is the same as any other filesystem on Linux.  You
just haven't been paying attention.

So that leaves only metadata that btrfs would or should do anything
about by its own initiative.  Other filesystems on Linux that support the
'errors=' mount option behave the same way--they won't stop for a data
block IO error, but they'll optionally panic the entire system for a
metadata block IO error.

> > > My concern is that btrfs will happily, continously and mostly silently loose
> > > data practically forever by assuming a disk that gives an error on every
> > > access is still there and able to hold data.
> > > 
> > > That behaviour is very different to other "raid" implementations.
> > 
> > Be careful what you ask for:  you have configured a filesystem that can
> > tolerate metadata write failures (raid1 insists on continuous operation
> > in degraded mode).  Fine, btrfs checks for metadata write failures, and
> > handles them by forcing itself read-only.
> 
> I'm quite sure I didn't ask for that. And I'm quite ok with how the
> metadata was handled.
>
> > those errors.  So it's going to tolerate all the failures in metadata
> > because that's what you asked for, and ignore all the failures in data
> > because that's what you asked for.
> 
> Well, that's clearly unlike any other filesystem/raid/etc. system out
> there.  Maybe this should then be pointed out more clearly, as people
> probablöy go towards btrfs with a similar attitude as existing systems.
> 
> > If your applications didn't call fsync() or use O_SYNC or O_DIRECT or
> > any of the other weird and wonderful ways to ask the filesystem to tell
> > you about data block write errors, then userspace won't learn anything
> > about data block write errors on most Linux filesystems.  The system
> > calls will exit before the disk is even touched, so there's no way to
> > tell the application that a delalloc extent write, which maybe happens
> > 30 seconds after the application closed the file, didn't work.  Nor is it
> > reasonable to shut down the entire filesystem because of a deferred
> > non-metadata write failure (well, maybe it's reasonable if btrfs had a
> > 'errors=remount-ro' option like ext4's).
> 
> I fully agree, that's why I suggested a policy more like other existign
> systems: declare the faulty disk as faulty.

This would not change anything in your case.

> > If you contrive a test case where other filesystems are able to write
> > their metadata without error, but all data block writes (excluding
> > directory block writes) fail with EIO, they will do the same thing as btrfs.
> 
> That's not a useful comparison, as the failure cased being discussed
> (device failure) is an all or nothing with those, as they don't manage
> multiple devices and rely on unerlying mechanisms (such as md) to do tha
> faulty device management.

OK, suppose we import what mdadm does in this case.  Result: no change.

> > It's a simple experiment with a USB stick (make sure to defeat your distro
> > which might be mounting with 'sync'):  write a multi-megabyte file to a
> > filesystem on the USB stick with 'cat', and let cat finish the write and
> > close the file without error.  Then immediately pull the USB stick out
> > before writeback starts.  Result:  the application sees write success,
> > the file is gone (or, depending on the filesystem, contains garbage),
> > and the only indication of failure is a handful of kernel messages about
> > the lost USB device.
> 
> Yes, I understand that perfectly, as I have initially explained...

I kind of think you don't, since you keep demonstrating that you don't
understand how every Linux filesystem behaves when it sees data block
write errors on delalloc block writes, nor do you understand what
mdadm and other raid systems do with JBOD and other non-redundant
configurations.

You keep expecting raid1 behaviors from single data.  That doesn't
make sense.  The whole point of single data is that it behaves like
data on other single-device filesystems and filesystems on top of JBOD
devices, not like data on a RAID1 device.  If you expect RAID1 behavior,
use raid1 data on btrfs.  It's your choice, you formatted your filesystem
with single data profile.

> > Other "raid" implementations for the most part don't support "care about
> > some data but not others" operating modes like the ones btrfs has.
> 
> Sure - since those other raids do not support distinguishing data they
> care for all data, not just metadata as btrfs does at the moment. I.e. it
> doesn't matter whether the data written is metadata or file data, a faulty
> device will be detected at some point.

Faulty devices in btrfs can be detected from dev stats.  It's what
dev stats are for.  A daemon can easily be set up to fire off a btrfs
replace on a spare drive if one of your data disks has write errors.

In your case you used single data profile, so all you can get from a
monitoring daemon is an email telling you that some of your data is
definitely gone now.

> > configuration.  Each profile is operating correctly according to its
> > rules, even if the combination of those rules doesn't make much sense
> > in terms of real-world use cases.
> 
> As I said, no hardware raid controller I know acts like btrfs "in the real
> world", and neither do most software raids. Neither did the linux block
> layer in the case at hand, by disallowing writes to the device forever,
> raid or not - just unplug the USB stick from your example and plug it in
> again - linux will not continue to write to it.

Every single-disk interface with persistent bus enumeration keeps the
device attached to its dev node.  mdadm does not kick a device out of a
JBOD array because of a write error--this would make the entire filesystem
instantly inaccessible.  Neither do most (all?) SATA interfaces, even if
the device drops off the bus and returns later.  USB can be configured to
do it as well, though USB is unreliable enough that it's dangerous to
enable this by default.

For RAID arrays with redundancy, some controllers and mdadm will kick
individual disks out of arrays _if they can continue to run in degraded
mode without the disk_.  Since you are using 'single' data, the number
of disks you can kick out of the array is zero.  Even if btrfs had a
full copy of the mdadm feature, it would behave _exactly_ the
same way it does now.

> The issue here is that btrfs behaves very differenmtly than other, similar,
> real-world systems, and the behaviour clearly makes no sense.
> 
> So, sure, btrfs can create rules that make no sense and apply them, but my
> whole point is that improving the rules so that they actually make sense is a
> worthwile goal.
> 
> > > I don't think my case is very unlikely - it's basically how linux behaves
> > > when lvm is used and, say, one of your disks has a temporary outage - the
> > > device node might go away and all accesses will rersult in EIO.
> > 
> > Were the read accesses not returning EIO?  That would be a bug.
> 
> Every access returned EIO (from the blocklayer), and I assume, but haven't
> investigated, that btrfs passed these through.
> 
> > Asynchronous writes have terrible reporting facilities on all Linux
> > filesystems.  btrfs is not inconsistent here.
> 
> Ah, but is - other filesystems will effectively stop writing to the disk
> in these cases (arguably not because they contain code to do so, the block
> layer forces this on them). 

This is not correct.  Other filesystems strictly stop writing to the
disk if there are _critical_ (meaning metadata) IO failures.  Data errors
(read or write) are only reported to userspace, and if userspace doesn't
stick around to get the write error status, nobody gets notified about
the lost data.

The block layer only forces a disconnect if the device disconnects.  If
the device stays online but rejects every write request with an error,
then filesystems will keep submitting write requests until they are forced
readonly because of a metadata update failure.

> The difference is that other filesystems do not
> contain a device manager for multiple devices, so they can let the kernel or
> other underlying syytems do the disk management.
> 
> btrfs can do the disk management itself, but fails to do so in a
> reasonable way for missing disks.
> 
> > > Other filesystems can get around this by not supporting multiple devices and
> > > relying on underlying systems (e.g. software or hardware raid) to make the
> > > disks appear a single device.
> > 
> > Indeed, this is contributing to the difference between your expectations
> > and reality.
> 
> Yes, that's why I tghinbk btrfs would be improved by kicking faulty disks
> out of it's filesystem, versus continuing to use it as it if still was
> there.

Still different expectations and reality.

> > > I do think btrfs would need more robust error handling for such cases -
> > > I don't know *any* raid implementation that ignores write errors, for
> > > example, and I don't think there is any raid implementation that ignores
> > > missing disks.
> > 
> > Most RAID implementations have a mode that allows data recovery when the
> > array has exceeded maximum tolerated failures (e.g. lvm "partial" mode,
> > or mdadm --force --run).  This is currently the only mode btrfs has.
> 
> I wish - btrfs simply ignored the missing disk and continued on.
> 
> > Until we get some better management tools (kick a device out of the
> > array, reintegrate a previously disconnected device into an aray, all
> > while online) we're stuck permanently in partial mode.
> 
> I'm not sure how exactly you define partial mode - lvm and mdraid both
> define it as missing "disks".

Yes.  In btrfs, chunks from different disks are gathered into block
groups.  The RAID profiles work on the block group level, as if you
had made thousands of partitions and then used mdadm to assemble each
pair of partitions into arrays with different profiles.

If there are block groups anywhere in your btrfs where too many disks are
missing (i.e. for single data, if one disk is missing), your filesystem
will be read-only, and you'll be able to read any data that is still
available on remaining disks.  This doesn't happen immediately--disks
fall of the bus and reconnect sometimes--but if you umount the filesystem
and mount it again, you will be in degraded mode because of the missing
disk, and strictly read-only because single profile tolerates zero
disk failures.  If you are not able to recover the missing disk then
you will have to mkfs and copy the surviving data to a freshly formatted
filesystem.

This is equivalent to the lvm partial and mdadm recovery modes, though
I believe they still allow write access.

It would be nice if btrfs could allow read-write mode here, so you can
delete all your damaged files and then remove the failed disk; however,
currently it doesn't do that, and simply removing the check for missing
disks isn't enough to fix it (I've already tried that ;).

> That's clearly not what btrfs does - btrfs didn't go into any partial
> mode, it simply continued on pretending the disk is fine.
> 
> > > That was my expectation, although I am well aware that this is still under
> > > development. I am already positively surprised that I was able to get an
> > > (apparently) fully functional filesystem back after something so drastic,
> > > with relatively little effort (metadata profile conversion).
> > 
> > RAID1 passed my test cases for the first time in 2016 (after some NULL
> > deref bugs were fixed).  If it's not working today, there has been
> > a _regression_.
> 
> That's good to hear - however, the really useful improvements (for this
> case) were not in btrfs raid1, but in the fact that btrfs recently got
> a lot more picky about treating errors as raid-errors (e.g. parwent
> transid msimatches), and thus using the mirrored information a lot more
> aggressive.
> 
> That's what allowed it to recover from having being presented an old
> evrsion of a member disk.
> 
> > > Under what conditions are write errors not even more direct evidence
> > > of drive failure (usually, but not exclusively, indicating far bigger
> > > problems than single block errors)?
> > 
> > Well, this is why you monitor dev stats for write errors (or, for
> > that matter, the raw disk devices).  The monitor can remount the
> > filesystem read-only or kill all the applications with open files if
> > that's what you'd prefer.
> 
> Well, if that is allk that btrfs can do, that's how it has to be done, of
> course.
> 
> It would clearly be better (and probably trivial) in my eyes if btrfs
> would act more like all the other systems out there and limit the amount
> of data loss, but I'm not a btrfs dveeloper, so I take what I get :)
> 
> > > Well, I assume that my case of concern - single disk failure - is not
> > > something that will escape my attention forever, so doing a manual
> > > metadata balance is fully viable for me.
> >
> > Remember that data-single metadata-raid1 is a weird case--you're
> > essentially saying that if a disk fails, you want the filesystem to be
> > read-only forever (btrfs won't mount it read-write without the missing
> > disk)
> 
> It doesn't feel weird to me - it seems the only way of limiting data loss.
> With metadata=raid1, the filesystem will survive a single device loss, with
> some work. With metadata=single, it almost certainly won't.
> 
> I currently have a single multi-device filesystem for archival, and it
> keeps acquiring disks (it's currently at 7 devices). Sinmgle device
> failures are a almost guaramteed during the lfietime of the fs, and being
> able to recover from that without losing all data and with being able to
> restore only the missing data seems not so weird to me.
> 
> Also, I didn't want btrfs to be read-only, but that would probably be
> preferable over the current behaviour, as it would limit data loss.

OK, so you've gotten stuck on a trivial side issue, and missed the fact
that your entire plan does not work.

Single data means if you lose any disk (with data on it) then the
filesystem will be forced read-only and you have to start over with a
new filesystem.  You'll need raid1, raid5, or raid6 (or raid10, why not)
to survive disk failures; otherwise, the entire filesystem stops when
any disk is not available at mount time.  Here, all raid1 metadata does
for you is give you the same protection against bitrot and bad sectors
that dup metadata does.  The advantage of raid1 metadata is that raid1 is
faster than dup.  It's still fundamentally a filesystem on top of JBOD,
and it behaves exactly as such.

The partial recovery you speak of only works with sector-level errors
(UNC sectors and csum failures).  I suppose if you restore the superblocks
of missing disks onto a new disk, you could fool btrfs into thinking the
entire disk is still online, then use scrub to recover the metadata, then
delete all the broken files, but you may find the results disappointing.

Even when mounted read-only, 1/7 of the data extents are gone, so files
will be damaged according to their size:  14% of files 4K and smaller will
be 100% destroyed, 100% of files over 7GB will be missing at least 14% of
their contents, files between those sizes will have various probabilities
and quantities of loss in between.

Depending on how your file sizes are
distributed, you might keep 10-20% of the files intact, which is better
than 0%, but it's very unlikely you'll get anything close to 86%.

raid5 data, despite the bugs, costs only one drive's worth of capacity
(the largest one) and can recover from losing a disk.  Be sure to use
space_cache=v2 and raid1 metadata with raid5 data (space_cache=v1 is
stored in data blocks and can be corrupted when used with raid5, and
raid5 metadata has two other filesystem-killing problems).

With _major_ surgery in the delalloc write handling, btrfs could detect
that one disk is producing write errors, and relocate the data to a
remaining disk with free space in single profile block groups, and try
the write again; however, the cost of that work would be paid in years
of regression fixes, and the gains achieved would be very small, like
having only 75% of a filesystem damaged by a single-disk failure instead
of 80%, or having all but the most critical 30 seconds of a log file
as a disk fails.

> > and you don't care about which data disappears (since there's
> > no facility to control which files go on which disks).  I'd say being
> > confused about when Linux decides to return to EIO to userspace--already
> > well understood on other filesystems--is the least of your problems. ;)
> 
> I'm not sure how to hanmdle this case better, other than using actual raid.
> 
> I'm still not sure why you think this is weird, though - btrfs itself has
> a "dup" mode which also suffers from the same problems (no facility to
> control where the copy of the blocks go), and with single profile on a
> single disk (the most common case), device lsos means total data loss.
> 
> Why is it so weird to try to limit data loss and restore costs? What's
> the point of the dup profile if not the same (limit data loss on partial
> failure)?

It's weird because single data does not limit data loss or restore costs.
At most, single data can reliably report data loss after the fact.
After a failure, you'll need 6 new drives to copy the shattered remains
of your filesystem to, and it might be cheaper to buy an 8th disk today
for RAID5.

The "raid1" in the raid1 metadata doesn't help you survive disk failures.
It's just a faster version of dup metadata for 2 or more drives, with
all the same failure modes.

> If dup is so weird, why is it beign used by default in some cases in
> btfrs?

dup isn't weird on single-disk filesystems.  There are many more
individual sector errors and silent corruption on single disks than
there are total disk failures, and dup profile can fix most of them.
dup metadata is essential on cheap SSDs since silent data corruption is
their most common failure mode.

It's a bit unfortunate that btrfs's default is still to use single
metadata on SSD--it's going to eat a lot of filesystems on low-end
machines.

> > > My concern is merely that btrfs stubbornly insists a completely
> > > missing disk is totally fine to write to, essentially forever :)
> >
> > That's an administrator decision, but btrfs does currently lack the tool
> > to implement the "remove the failing device" decision.  A workaround is
> > 'echo 1 > /sys/block/sdX/dev/delete'.
> 
> Well, the tool (that does the deciwsion) should obviously be inside the
> kernel, not userspace, because currently, I, as an administrator, cannot
> make that decision and it would be necessarily delayed.
> 
> Intersstignly enough, it's not an administrative decision with other,
> similar, systems - the linux kernel doesn't allow me to configure the
> current btrfs failure, and neither do software and hardware raids - they
> all kick out faulty devices automatically at some point.
> 
> > > I'm not saying alerting userspace is required in some way. But maybe
> > > btrfs should not consider an obviously badly broken disk as healthy. I
> > > would have expected btrfs to stop writing to a disk when it is told in
> > > no unclear terms that the write failed, at leats at some point.
> >
> > The trouble is that it's a continuum--disks aren't "good" at 0 errors
> > and "bad" at 1 or write errors.  Even mdadm isn't that strict--they have
> > maximum error counts, retries, mechanisms to do partial resyncs if disks
> > come back.  In btrfs this is all block-level stuff, every individual
> > block has its own sync state and data integrity.
> 
> The problem is thta btrfs has a maximum, unconfgiurable, error count of
> infinity.
> 
> I'd be totally happy with an unconfigurable error count of "0", "small",
> "bigger", or it being configurable :)
> 
> I tzhink you are misundestanding me - it seems we actually fully
> agree that the btrfs behaviour is bad as it is and would gain from
> improvement. 

The current behavior isn't wrong--none of the data integrity requirements
are violated.  Potential improvements in this area are mostly related
to not spamming the kernel log with errors from bad drives, and doing
something about disks that don't report errors and don't corrupt data,
but have suddenly become orders of magnitude slower.  It would be nice to
say "look I know you think sda is healthy, but get rid of it immediately"
to btrfs directly, as opposed to talking to the block layer underneath
and hoping it has a device delete function of some kind.

This is necessarily not automated kernel-side, unless we get some more
sophisticated primitives underneath (e.g. "migrate metadata to new disk"
or "replace with chunk tree relocation").  You definitely do not want to
throw btrfs metadata into degraded raid1 mode without soberly considering
the risk tradeoffs involved--that implies single metadata, and when that
happens, you are literally 1 wrong bit away from not having a filesystem
any more.  Proactive monitoring (or an automated script with a pool of
available spare drives to deploy for replacement) is essential.

> I'm not proposing any fixed solution, other than having
> *some* rsasonable kind of data loss limiting inside btrfs, at elast in
> obvious cases.
> 
> > lvm completely ignores _read_ errors during pvmove, a feature I use to
> > expedite the recovery of broken filesystems (and btrfs ends up not even
> > being broken).
> 
> That's interesting - last time I used pvmove on a source with read errors,
> it didn't move that (that was a hwile ago, most of my volumes nowadays are
> raid5'ed and don't suffer from read errors).

Interesting, I've never seen it stop.  I've pvmoved plenty of LVs from
half-broken drives (UNC remapping table full).  If it didn't work,
I'd have to create a new LV on a new PV and then use dd_rescue to copy
the data--but I've never had to do that, pvmove just plows through as
megabytes of IO errors scroll by on dmesg.  There are plenty of error
counters in sysfs that increment while this happens, but nothing ever
seems to check them.

> More importantly, however, if your source drive fails, pvmove will *not*
> end up with skipping all the rest of the transfer and finish successfully
> (as btrfs did in the case we discuss), resulting in very massive data
> loss, simply because it cannot commit the new state.

lvm copies the remaining data, and then reports success (well, it doesn't
report anything per se, it just finishes the operation and updates the VG
config when it's done).  Obviously the data is garbage in the unreadable
blocks, but a scrub fixes that.

> No matter what other tool you look at, none behave as btrfs does
> currently. Actual behaviour difers widely in detail, of course, but I
> can't come up with a situation where a removed disk will result in upper
> layers continuing to use it as if it were there.

See lvm.conf, activation_mode "partial":

        #   partial
        #     Allows the activation of any LV even if a missing or failed PV
        #     could cause data loss with a portion of the LV inaccessible.

You can run an ext4 FS on top of a LV with missing PVs.  It behaves mostly
the same way that btrfs does--ignores write errors on data blocks other
than reporting them to the application, and keeps going until it hits
an IO error on a metadata update.  This part of what btrfs does isn't
the incorrect part--or if it is, then all the other filesystems on Linux
are wrong too.

> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-05-03  4:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-26 12:46 experiment: suboptimal behaviour with write errors and multi-device filesystems Marc Lehmann
2020-04-28  6:19 ` Zygo Blaxell
2020-04-28 18:14   ` Marc Lehmann
2020-04-28 21:35     ` Zygo Blaxell
2020-05-01  1:55       ` Marc Lehmann
2020-05-01  3:37         ` Zygo Blaxell
2020-05-02 18:23           ` Marc Lehmann
2020-05-02 18:49             ` Remi Gauvin
2020-05-03  4:16             ` Zygo Blaxell
2020-05-02 18:27           ` Marc Lehmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.