linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* btrfs dev del not transaction protected?
@ 2019-12-20  4:05 Marc Lehmann
  2019-12-20  5:24 ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20  4:05 UTC (permalink / raw)
  To: linux-btrfs

Hi!

I used btrfs del /somedevice /mountpoint to remove a device, and then typed
sync. A short time later the system had a hard reset.

Now the file system doesn't mount read-write anymore because it complains
about a missing device (linux 5.4.5):

[  247.385346] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
[  247.386942] BTRFS error (device dm-32): failed to read chunk tree: -2
[  247.462693] BTRFS error (device dm-32): open_ctree failed

The thing is, the device is still there and accessible, but btrfs no longer
recognises it, as it already deleted it before the crash.

I can mount the filesystem in degraded mode, and I have a backup in case
somehting isn't readable, so this is merely a costly inconvinience for me
(it's a 40TB volume), but this seems very unexpected, both that device
dels apparently have a race condition and that sync doesn't actually
synchronise the filesystem - I naively expected that btrfs dev del doesn't
cause the loss of the filesystem due to a system crash.

Probably nbot related, but maybe worth mentioning: I found that system
crashes (resets, not power failures) cause btrfs to not mount the first
time a mount is attempted, but it always succeeds the second time, e.g.:

   # mount /device /mnt
   ... no errors or warnings in kernel log, except:
   BTRFS error (device dm-34): open_ctree failed
   # mount /device /mnt
   magically succeeds

The typical symptom here is that systemd goes into emergency mode on mount
failure, but simpyl rebooting, or executing the mount manually then succeeds.

Greetings,
Marc

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20  4:05 btrfs dev del not transaction protected? Marc Lehmann
@ 2019-12-20  5:24 ` Qu Wenruo
  2019-12-20  6:37   ` Marc Lehmann
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-12-20  5:24 UTC (permalink / raw)
  To: Marc Lehmann, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2334 bytes --]



On 2019/12/20 下午12:05, Marc Lehmann wrote:
> Hi!
> 
> I used btrfs del /somedevice /mountpoint to remove a device, and then typed
> sync. A short time later the system had a hard reset.

Then it doesn't look like the title.

Normally for sync, btrfs will commit transaction, thus even something
like the title happened, you shouldn't be affected at all.

> 
> Now the file system doesn't mount read-write anymore because it complains
> about a missing device (linux 5.4.5):
> 
> [  247.385346] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
> [  247.386942] BTRFS error (device dm-32): failed to read chunk tree: -2
> [  247.462693] BTRFS error (device dm-32): open_ctree failed

Is that devid 1 the device you tried to deleted?
Or some unrelated device?

> 
> The thing is, the device is still there and accessible, but btrfs no longer
> recognises it, as it already deleted it before the crash.

I think it's not what you thought, but btrfs device scan is not properly
triggered.

Would you please give some more dmesg? As each scanned btrfs device will
show up in dmesg.
That would help us to pin down the real cause.

> 
> I can mount the filesystem in degraded mode, and I have a backup in case
> somehting isn't readable, so this is merely a costly inconvinience for me
> (it's a 40TB volume), but this seems very unexpected, both that device
> dels apparently have a race condition and that sync doesn't actually
> synchronise the filesystem - I naively expected that btrfs dev del doesn't
> cause the loss of the filesystem due to a system crash.
> 
> Probably nbot related, but maybe worth mentioning: I found that system
> crashes (resets, not power failures) cause btrfs to not mount the first
> time a mount is attempted, but it always succeeds the second time, e.g.:
> 
>    # mount /device /mnt
>    ... no errors or warnings in kernel log, except:
>    BTRFS error (device dm-34): open_ctree failed
>    # mount /device /mnt
>    magically succeeds

Yep, this makes it sound more like a scan related bug.

Thanks,
Qu

> 
> The typical symptom here is that systemd goes into emergency mode on mount
> failure, but simpyl rebooting, or executing the mount manually then succeeds.
> 
> Greetings,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20  5:24 ` Qu Wenruo
@ 2019-12-20  6:37   ` Marc Lehmann
  2019-12-20  7:10     ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20  6:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Dec 20, 2019 at 01:24:20PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > I used btrfs del /somedevice /mountpoint to remove a device, and then typed
> > sync. A short time later the system had a hard reset.
> 
> Then it doesn't look like the title.

Hmm, I am not sure I understand: do you mean the subject? The command here
is obviously not copied and pasted, and when typing it into my mail client,
I forgot the "dev" part. The exact command, I think, was this:

   btrfs dev del /dev/mapper/xmnt-cold13 /oldcold

> Normally for sync, btrfs will commit transaction, thus even something
> like the title happened, you shouldn't be affected at all.

Exactly, that is my expectation.

> > [  247.385346] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
> > [  247.386942] BTRFS error (device dm-32): failed to read chunk tree: -2
> > [  247.462693] BTRFS error (device dm-32): open_ctree failed
> 
> Is that devid 1 the device you tried to deleted?
> Or some unrelated device?

I think the device I removed had devid 1. I am not 100% sure, but I am
reasonably sure because I had "watch -n10 btrfs dev us" running while
waiting for the removal to finish and not being able to control the device
ids triggers my ocd reflexes (mostly because btrfs fi res needs the device
id even for some single-device filesystems :), so I kind of memorised
them.

> > The thing is, the device is still there and accessible, but btrfs no longer
> > recognises it, as it already deleted it before the crash.
> 
> I think it's not what you thought, but btrfs device scan is not properly
> triggered.

Quite possible - I based my statement that it is no longer recognized
based on the fact that a) blkid also didn't recognize a filesystem on
the removed device anymore and b) btrfs found the other two remaining
devices, so if btrfs scan is not properly triggered, then this is a
serious issue in current GNU/Linux distributions (I use debian buster on
that server).

I assume that the device is not recognised as btrfs by blkid anymore
because the signature had been wiped by btrfs dev del, based on previous
experience, but I of course can't exactly know it's not, say, a hardware
error that wiped that disk, although I would find that hard to believe :)

> Would you please give some more dmesg? As each scanned btrfs device will
> show up in dmesg.

Here should be all btrfs-related messages for this (from grep -i btrfs):

 [   10.288533] BTRFS: device label ROOT devid 1 transid 2106939 /dev/mapper/vg_doom-root
 [   10.314498] BTRFS info (device dm-0): disk space caching is enabled
 [   10.316488] BTRFS info (device dm-0): has skinny extents
 [   10.900930] BTRFS info (device dm-0): enabling ssd optimizations
 [   10.902741] BTRFS info (device dm-0): disk space caching is enabled
 [   11.524129] BTRFS info (device dm-0): device fsid bb3185c8-19f0-4018-b06f-38678c06c7c2 devid 1 moved old:/dev/mapper/vg_doom-root new:/dev/dm-0
 [   11.528554] BTRFS info (device dm-0): device fsid bb3185c8-19f0-4018-b06f-38678c06c7c2 devid 1 moved old:/dev/dm-0 new:/dev/mapper/vg_doom-root
 [   42.273530] BTRFS: device label LOCALVOL3 devid 1 transid 1240483 /dev/dm-28
 [   42.312354] BTRFS info (device dm-28): enabling auto defrag
 [   42.314152] BTRFS info (device dm-28): force zstd compression, level 12
 [   42.315938] BTRFS info (device dm-28): using free space tree
 [   42.317696] BTRFS info (device dm-28): has skinny extents
 [   49.115007] BTRFS: device label LOCALVOL5 devid 1 transid 146201 /dev/dm-29
 [   49.138816] BTRFS info (device dm-29): using free space tree
 [   49.140590] BTRFS info (device dm-29): has skinny extents
 [  102.348872] BTRFS info (device dm-29): checking UUID tree
 [  102.393185] BTRFS: device label COLD1 devid 5 transid 1876906 /dev/dm-30
 [  109.626550] BTRFS: device label COLD1 devid 4 transid 1876907 /dev/dm-32
 [  109.654401] BTRFS: device label COLD1 devid 3 transid 1876907 /dev/dm-31
 [  109.656171] BTRFS info (device dm-32): use zstd compression, level 12
 [  109.657924] BTRFS info (device dm-32): using free space tree
 [  109.660917] BTRFS info (device dm-32): has skinny extents
 [  109.662687] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
 [  109.664832] BTRFS error (device dm-32): failed to read chunk tree: -2
 [  109.742501] BTRFS error (device dm-32): open_ctree failed

At this point, /dev/mapper/xmnt-cold11 (dm-32),
/dev/mapper/xmnt-oldcold12 (dm-31) and /dev/mapper/xmnt-cold14 (dm-30)
were the remaining disks in the filesystem, while xmnt-cold13 was the
device I had formerly removed (which doesn't show up).

(There are two btrfs filesystems with the COLD1 label in this machine at
the moment, as I was migrating the fs, but the above COLD1 messages should
all relate to the same fs).

"blkid -o value -s TYPE /dev/mapper/xmnt-cold13" didn't give any output
(the mounting script checks for that and pauses to make provisioning
of new disks easier), while normally it would give "btrfs" on volume
members. This, I think, would be normal behaviour for devices that have
been removed from a btrfs.

BTW, the four devices in question are all dmcrypt-on-lvm and are single
devices in a hardware raid controller (a perc h740).

> > Probably nbot related, but maybe worth mentioning: I found that system
> > crashes (resets, not power failures) cause btrfs to not mount the first
> > time a mount is attempted, but it always succeeds the second time, e.g.:
> > 
> >    # mount /device /mnt
> >    ... no errors or warnings in kernel log, except:
> >    BTRFS error (device dm-34): open_ctree failed
> >    # mount /device /mnt
> >    magically succeeds
> 
> Yep, this makes it sound more like a scan related bug.

BTW, this (second issue) also happens with filesystems that are not
multi-device. Not sure if that menas that btrfs scan would be involved, as
I would assume the only device btrfs would need in such cases is the one
given to mount, but maybe that also needs a working btrfs scan?

Thanks for your working on btrfs btw. :)

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20  6:37   ` Marc Lehmann
@ 2019-12-20  7:10     ` Qu Wenruo
  2019-12-20 13:27       ` Marc Lehmann
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-12-20  7:10 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7856 bytes --]



On 2019/12/20 下午2:37, Marc Lehmann wrote:
> On Fri, Dec 20, 2019 at 01:24:20PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> I used btrfs del /somedevice /mountpoint to remove a device, and then typed
>>> sync. A short time later the system had a hard reset.
>>
>> Then it doesn't look like the title.
> 
> Hmm, I am not sure I understand: do you mean the subject?

Oh, sorry, I mean subject line "btrfs dev del not transaction protected".

> The command here
> is obviously not copied and pasted, and when typing it into my mail client,
> I forgot the "dev" part. The exact command, I think, was this:

No big deal, as we all get the point.

> 
>    btrfs dev del /dev/mapper/xmnt-cold13 /oldcold>
>> Normally for sync, btrfs will commit transaction, thus even something
>> like the title happened, you shouldn't be affected at all.
> 
> Exactly, that is my expectation.
> 
>>> [  247.385346] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
>>> [  247.386942] BTRFS error (device dm-32): failed to read chunk tree: -2
>>> [  247.462693] BTRFS error (device dm-32): open_ctree failed
>>
>> Is that devid 1 the device you tried to deleted?
>> Or some unrelated device?
> 
> I think the device I removed had devid 1. I am not 100% sure, but I am
> reasonably sure because I had "watch -n10 btrfs dev us" running while
> waiting for the removal to finish and not being able to control the device
> ids triggers my ocd reflexes (mostly because btrfs fi res needs the device
> id even for some single-device filesystems :), so I kind of memorised
> them.

Then it looks like a big deal.

After looking into the code (at least v5.5-rc kernel), btrfs will commit
transaction after deleting the device item in btrfs_rm_dev_item().

So even no manual sync is called, as long as there is no error report
from "btrfs dev del", such case shouldn't happen.

> 
>>> The thing is, the device is still there and accessible, but btrfs no longer
>>> recognises it, as it already deleted it before the crash.
>>
>> I think it's not what you thought, but btrfs device scan is not properly
>> triggered.
> 
> Quite possible - I based my statement that it is no longer recognized
> based on the fact that a) blkid also didn't recognize a filesystem on
> the removed device anymore and b) btrfs found the other two remaining
> devices, so if btrfs scan is not properly triggered, then this is a
> serious issue in current GNU/Linux distributions (I use debian buster on
> that server).

a) means btrfs has wiped the superblock, which happens after
btrfs_rm_dev_item().
Something is not sane now.

> 
> I assume that the device is not recognised as btrfs by blkid anymore
> because the signature had been wiped by btrfs dev del, based on previous
> experience, but I of course can't exactly know it's not, say, a hardware
> error that wiped that disk, although I would find that hard to believe :)
> 
>> Would you please give some more dmesg? As each scanned btrfs device will
>> show up in dmesg.
> 
> Here should be all btrfs-related messages for this (from grep -i btrfs):
> 
>  [   10.288533] BTRFS: device label ROOT devid 1 transid 2106939 /dev/mapper/vg_doom-root
>  [   10.314498] BTRFS info (device dm-0): disk space caching is enabled
>  [   10.316488] BTRFS info (device dm-0): has skinny extents
>  [   10.900930] BTRFS info (device dm-0): enabling ssd optimizations
>  [   10.902741] BTRFS info (device dm-0): disk space caching is enabled
>  [   11.524129] BTRFS info (device dm-0): device fsid bb3185c8-19f0-4018-b06f-38678c06c7c2 devid 1 moved old:/dev/mapper/vg_doom-root new:/dev/dm-0
>  [   11.528554] BTRFS info (device dm-0): device fsid bb3185c8-19f0-4018-b06f-38678c06c7c2 devid 1 moved old:/dev/dm-0 new:/dev/mapper/vg_doom-root
>  [   42.273530] BTRFS: device label LOCALVOL3 devid 1 transid 1240483 /dev/dm-28
>  [   42.312354] BTRFS info (device dm-28): enabling auto defrag
>  [   42.314152] BTRFS info (device dm-28): force zstd compression, level 12
>  [   42.315938] BTRFS info (device dm-28): using free space tree
>  [   42.317696] BTRFS info (device dm-28): has skinny extents
>  [   49.115007] BTRFS: device label LOCALVOL5 devid 1 transid 146201 /dev/dm-29
>  [   49.138816] BTRFS info (device dm-29): using free space tree
>  [   49.140590] BTRFS info (device dm-29): has skinny extents
>  [  102.348872] BTRFS info (device dm-29): checking UUID tree
>  [  102.393185] BTRFS: device label COLD1 devid 5 transid 1876906 /dev/dm-30

dm-30 is one transaction older than other devices.

Is that expected? If not, it may explain why we got the dead device. As
we're using older superblock, which may points to older chunk tree which
has the device item.

>  [  109.626550] BTRFS: device label COLD1 devid 4 transid 1876907 /dev/dm-32
>  [  109.654401] BTRFS: device label COLD1 devid 3 transid 1876907 /dev/dm-31

And I'm also curious about the 7s delay between devid5 and devid 3/4
detection.

Can you find a way to make devid 3/4 show up before devid 5 and try again?

And if you find a way to mount the volume RW, please write a single
empty file, and sync the fs, then umount the fs, ensure "btrfs ins
dump-super" gives the same transid of all 3 related disks.

Then the problem *may* be gone if it matches my assumption.
(After all these assumed success, please to do an unmounted btrfs check
just to make sure nothing is wrong)

>  [  109.656171] BTRFS info (device dm-32): use zstd compression, level 12
>  [  109.657924] BTRFS info (device dm-32): using free space tree
>  [  109.660917] BTRFS info (device dm-32): has skinny extents
>  [  109.662687] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
>  [  109.664832] BTRFS error (device dm-32): failed to read chunk tree: -2
>  [  109.742501] BTRFS error (device dm-32): open_ctree failed
> 
> At this point, /dev/mapper/xmnt-cold11 (dm-32),
> /dev/mapper/xmnt-oldcold12 (dm-31) and /dev/mapper/xmnt-cold14 (dm-30)
> were the remaining disks in the filesystem, while xmnt-cold13 was the
> device I had formerly removed (which doesn't show up).
> 
> (There are two btrfs filesystems with the COLD1 label in this machine at
> the moment, as I was migrating the fs, but the above COLD1 messages should
> all relate to the same fs).
> 
> "blkid -o value -s TYPE /dev/mapper/xmnt-cold13" didn't give any output
> (the mounting script checks for that and pauses to make provisioning
> of new disks easier), while normally it would give "btrfs" on volume
> members. This, I think, would be normal behaviour for devices that have
> been removed from a btrfs.
> 
> BTW, the four devices in question are all dmcrypt-on-lvm and are single
> devices in a hardware raid controller (a perc h740).
> 
>>> Probably nbot related, but maybe worth mentioning: I found that system
>>> crashes (resets, not power failures) cause btrfs to not mount the first
>>> time a mount is attempted, but it always succeeds the second time, e.g.:
>>>
>>>    # mount /device /mnt
>>>    ... no errors or warnings in kernel log, except:
>>>    BTRFS error (device dm-34): open_ctree failed
>>>    # mount /device /mnt
>>>    magically succeeds
>>
>> Yep, this makes it sound more like a scan related bug.
> 
> BTW, this (second issue) also happens with filesystems that are not
> multi-device.

Single device btrfs doesn't need device scan.
If that happened, something insane happened again...
Thanks,
Qu

> Not sure if that menas that btrfs scan would be involved, as
> I would assume the only device btrfs would need in such cases is the one
> given to mount, but maybe that also needs a working btrfs scan?
> 
> Thanks for your working on btrfs btw. :)
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20  7:10     ` Qu Wenruo
@ 2019-12-20 13:27       ` Marc Lehmann
  2019-12-20 13:41         ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20 13:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> >  [  102.393185] BTRFS: device label COLD1 devid 5 transid 1876906 /dev/dm-30
> 
> dm-30 is one transaction older than other devices.
> 
> Is that expected? If not, it may explain why we got the dead device. As
> we're using older superblock, which may points to older chunk tree which
> has the device item.

Well, not that my expectation here would mean anything, but no, from
experience I have never seen the transids to disagree, or bad thingsa will
happen...

> >  [  109.626550] BTRFS: device label COLD1 devid 4 transid 1876907 /dev/dm-32
> >  [  109.654401] BTRFS: device label COLD1 devid 3 transid 1876907 /dev/dm-31
> 
> And I'm also curious about the 7s delay between devid5 and devid 3/4
> detection.

That is about the time it takes the disk to wake up when its spinned down,
so maybe that was the case - the disks are used for archiving ("cold"
storage), have a short spin-down and btrfs filesystems can takes ages to
mount. The real question is why the fortuh disk was already spun up then,
but the disks do not apply time outs very exactly.

> Can you find a way to make devid 3/4 show up before devid 5 and try again?

Unfortunately, I had to start restoring from backup a while ago, as I need
the machine up and restoring takes days.

How would I go about making it show up in different orders though? If
these messages come up independently, I could have spun down some of the
disks, right?

> And if you find a way to mount the volume RW, please write a single
> empty file, and sync the fs, then umount the fs, ensure "btrfs ins
> dump-super" gives the same transid of all 3 related disks.

I tried -o degraded followed by remounting rw, but couldn't get it to
mount rw. I tried to mount/remount, though:

   04:48:45 doom kernel: BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
   04:48:45 doom kernel: BTRFS error (device dm-32): failed to read chunk tree: -2
   04:48:45 doom kernel: BTRFS error (device dm-32): open_ctree failed
   04:49:37 doom kernel: BTRFS warning (device dm-31): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
   04:52:30 doom kernel: BTRFS warning (device dm-31): chunk 12582912 missing 1 devices, max tolerance is 0 for writable mount
   04:52:30 doom kernel: BTRFS warning (device dm-31): writable mount is not allowed due to too many missing devices
   04:52:30 doom kernel: BTRFS error (device dm-31): open_ctree failed
   04:54:01 doom kernel: BTRFS warning (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
   04:54:45 doom kernel: BTRFS warning (device dm-32): chunk 12582912 missing 1 devices, max tolerance is 0 for writable mount
   04:54:45 doom kernel: BTRFS warning (device dm-32): too many missing devices, writable remount is not allowed

Since (in theory :) the filesystemw a completely backed up, I didn't
bother with further recovery after I made sure the physical disk is
actually there and was unlocked (cryptsetup), so it wasn't a case of an
actual missing disk.

> > BTW, this (second issue) also happens with filesystems that are not
> > multi-device.
> 
> Single device btrfs doesn't need device scan.
> If that happened, something insane happened again...
> Thanks,

It happens since at least 4.14 on at least four machines, but I haven't
seen it recently, after I switched to 5.2.21 one some machines (post-4.4
kernels have this habit of freezing under memory pressure, and 5.2.21 has
greatly improved in this regard). That also means I had far fewer hard
resets with 5.2.21, but the problem did not happen on the last resets in
5.2.21 and 5.4.5.

I originally reported it below, with some evidence that it isn't a
hardware issue (no reset needed, just wipe the dm table while the device
is mounted which should cleanly "cut off" the write stream):

https://bugzilla.kernel.org/show_bug.cgi?id=204083

Since multiple scrubs and full reads of the volumes didn't show up any
issues, I didn't think much of it.

And if you want to hear more "insane" things, after I hard-reset
my desktop machine (5.2.21) two days ago I had to "btrfs rescue
fix-device-size" to be able to mount (can't find the kernel error atm.).

Greetings,

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 13:27       ` Marc Lehmann
@ 2019-12-20 13:41         ` Qu Wenruo
  2019-12-20 16:53           ` Marc Lehmann
                             ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Qu Wenruo @ 2019-12-20 13:41 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5591 bytes --]



On 2019/12/20 下午9:27, Marc Lehmann wrote:
>>>  [  102.393185] BTRFS: device label COLD1 devid 5 transid 1876906 /dev/dm-30
>>
>> dm-30 is one transaction older than other devices.
>>
>> Is that expected? If not, it may explain why we got the dead device. As
>> we're using older superblock, which may points to older chunk tree which
>> has the device item.
> 
> Well, not that my expectation here would mean anything, but no, from
> experience I have never seen the transids to disagree, or bad thingsa will
> happen...
> 
>>>  [  109.626550] BTRFS: device label COLD1 devid 4 transid 1876907 /dev/dm-32
>>>  [  109.654401] BTRFS: device label COLD1 devid 3 transid 1876907 /dev/dm-31
>>
>> And I'm also curious about the 7s delay between devid5 and devid 3/4
>> detection.
> 
> That is about the time it takes the disk to wake up when its spinned down,
> so maybe that was the case - the disks are used for archiving ("cold"
> storage), have a short spin-down and btrfs filesystems can takes ages to
> mount. The real question is why the fortuh disk was already spun up then,
> but the disks do not apply time outs very exactly.
> 
>> Can you find a way to make devid 3/4 show up before devid 5 and try again?
> 
> Unfortunately, I had to start restoring from backup a while ago, as I need
> the machine up and restoring takes days.
> 
> How would I go about making it show up in different orders though? If
> these messages come up independently, I could have spun down some of the
> disks, right?

You could utilize the latest "forget" feature, to make btrfs kernel
module forget that device, provided by "btrfs device scan -u".

So the plan would be something like:
- Forget all devices of that volume
- Scan the two disks with higher transid
- Scan the disk with mismatched transid

Then try to mount the volume.

> 
>> And if you find a way to mount the volume RW, please write a single
>> empty file, and sync the fs, then umount the fs, ensure "btrfs ins
>> dump-super" gives the same transid of all 3 related disks.
> 
> I tried -o degraded followed by remounting rw, but couldn't get it to
> mount rw. I tried to mount/remount, though:
> 
>    04:48:45 doom kernel: BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
>    04:48:45 doom kernel: BTRFS error (device dm-32): failed to read chunk tree: -2
>    04:48:45 doom kernel: BTRFS error (device dm-32): open_ctree failed
>    04:49:37 doom kernel: BTRFS warning (device dm-31): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
>    04:52:30 doom kernel: BTRFS warning (device dm-31): chunk 12582912 missing 1 devices, max tolerance is 0 for writable mount

BTW, that chunk number is very small, and since it has 0 tolerance, it
looks like to be SINGLE chunk.

In that case, it looks like a temporary chunk from older mkfs, and it
should contain no data/metadata at all, thus brings no data loss.

>    04:52:30 doom kernel: BTRFS warning (device dm-31): writable mount is not allowed due to too many missing devices
>    04:52:30 doom kernel: BTRFS error (device dm-31): open_ctree failed
>    04:54:01 doom kernel: BTRFS warning (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing
>    04:54:45 doom kernel: BTRFS warning (device dm-32): chunk 12582912 missing 1 devices, max tolerance is 0 for writable mount
>    04:54:45 doom kernel: BTRFS warning (device dm-32): too many missing devices, writable remount is not allowed
> 
> Since (in theory :) the filesystemw a completely backed up, I didn't
> bother with further recovery after I made sure the physical disk is
> actually there and was unlocked (cryptsetup), so it wasn't a case of an
> actual missing disk.

BTW, "btrfs ins dump-tree -t chunk <dev>" would help a lot.
That would directly tell us if the devid 1 device is in chunk tree.

If passing different <dev> would cause different output, please also
provide all different versions.

> 
>>> BTW, this (second issue) also happens with filesystems that are not
>>> multi-device.
>>
>> Single device btrfs doesn't need device scan.
>> If that happened, something insane happened again...
>> Thanks,
> 
> It happens since at least 4.14 on at least four machines, but I haven't
> seen it recently, after I switched to 5.2.21 one some machines (post-4.4
> kernels have this habit of freezing under memory pressure, and 5.2.21 has
> greatly improved in this regard). That also means I had far fewer hard
> resets with 5.2.21, but the problem did not happen on the last resets in
> 5.2.21 and 5.4.5.
> 
> I originally reported it below, with some evidence that it isn't a
> hardware issue (no reset needed, just wipe the dm table while the device
> is mounted which should cleanly "cut off" the write stream):
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=204083
> 
> Since multiple scrubs and full reads of the volumes didn't show up any
> issues, I didn't think much of it.
> 
> And if you want to hear more "insane" things, after I hard-reset
> my desktop machine (5.2.21) two days ago I had to "btrfs rescue
> fix-device-size" to be able to mount (can't find the kernel error atm.).

Consider all these insane things, I tend to believe there is some
FUA/FLUSH related hardware problem.
E.g. the HDD/SSD controller reports FUA/FLUSH finished way before it
really write data into the disk or non-volatile cache, or the
non-volatile cache recovery is not implemented properly...

Thanks,
Qu
> 
> Greetings,
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 13:41         ` Qu Wenruo
@ 2019-12-20 16:53           ` Marc Lehmann
  2019-12-20 17:24             ` Remi Gauvin
                               ` (2 more replies)
  2019-12-20 17:07           ` Marc Lehmann
  2019-12-20 17:20           ` Marc Lehmann
  2 siblings, 3 replies; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20 16:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> BTW, that chunk number is very small, and since it has 0 tolerance, it
> looks like to be SINGLE chunk.
> 
> In that case, it looks like a temporary chunk from older mkfs, and it
> should contain no data/metadata at all, thus brings no data loss.

Well, there indeed should not have been any data or metadata left as the
btrfs dev del succeeded after lengthy copying.

> BTW, "btrfs ins dump-tree -t chunk <dev>" would help a lot.
> That would directly tell us if the devid 1 device is in chunk tree.

Apologies if I wasn't too clear about it - I already had to mkfs and
redo the filesystem. I understand that makes tracking this down hard or
impossible, but I did need that machine and filesystem.

> > And if you want to hear more "insane" things, after I hard-reset
> > my desktop machine (5.2.21) two days ago I had to "btrfs rescue
> > fix-device-size" to be able to mount (can't find the kernel error atm.).
> 
> Consider all these insane things, I tend to believe there is some
> FUA/FLUSH related hardware problem.

Please don't - I honestly think btrfs developers are way to fast to blame
hardware for problems. I currently lose btrfs filesystems about once every
6 months, and other than the occasional user error, it's always the kernel
(e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things,
low-memory situations etc. - none of these seem to be centric to btrfs,
but none of those are hardware errors either). I know its the kernel in
most cases because in those cases, I can identify the fix in a later
kernel, or the mitigating circumstances don't appear (e.g. freezes).

In any case if it is a hardware problem, then linux and/or btrfs has
to work around them, because it affects many different controllers on
different boards:

- dell perc h740 on "doom" and "cerebro"
- intel series 9 controller on "doom'" and "cerebro".
- samsung nvme controller on "yoyo" and "yuna".
- marvell sata controller on "doom".

Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs
filesystem I restored to went into readonly mode with ENOSPC. Another
hardware problem?

[41801.618772] ------------[ cut here ]------------
[41801.618776] BTRFS: Transaction aborted (error -28)
[41801.618843] WARNING: CPU: 2 PID: 5713 at fs/btrfs/inode.c:3159 btrfs_finish_ordered_io+0x730/0x820 [btrfs]
[41801.618844] Modules linked in: nfsv3 nfs fscache nvidia_modeset(POE) nvidia(POE) btusb algif_skcipher af_alg dm_crypt nfsd auth_rpcgss nfs_acl lockd grace cls_fw sch_htb sit tunnel4 ip_tunnel hidp act_police cls_u32 sch_ingress sch_tbf 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6 xt_CT xt_MASQUERADE xt_nat xt_REDIRECT nft_chain_nat nf_nat xt_owner xt_TCPMSS xt_DSCP xt_mark nf_log_ipv4 nf_log_common xt_LOG xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_length xt_mac xt_tcpudp nft_compat nft_counter nf_tables xfrm_user xfrm_algo nfnetlink cmac uhid bnep tda10021 snd_hda_codec_hdmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass tda827x tda10023 crct10dif_pclmul mei_hdcp crc32_pclmul btrtl btbcm rc_tt_1500 ghash_clmulni_intel snd_emu10k1 btintel snd_util_mem snd_ac97_codec aesni_intel bluetooth snd_hda_intel budget_av snd_rawmidi snd_intel_nhlt crypto_simd saa7
 146_vv
[41801.618864]  snd_hda_codec videobuf_dma_sg budget_ci videobuf_core snd_seq_device budget_core cryptd ttpci_eeprom glue_helper snd_hda_core saa7146 dvb_core intel_cstate ac97_bus snd_hwdep rc_core snd_pcm intel_rapl_perf mxm_wmi cdc_acm pcspkr videodev snd_timer ecdh_generic snd emu10k1_gp ecc mc gameport soundcore mei_me mei mac_hid acpi_pad tcp_bbr drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler hid_generic usbhid hid usbkbd coretemp nct6775 hwmon_vid sunrpc parport_pc ppdev lp parport msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c ahci megaraid_sas i2c_i801 libahci lpc_ich r8169 realtek wmi video [last unloaded: nvidia]
[41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P           OE     5.4.5-050405-generic #201912181630
[41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014
[41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs]
[41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01
[41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282
[41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left
[41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006
[41801.618922] BTRFS info (device dm-35): forced readonly
[41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440
[41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90
[41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60
[41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000
[41801.618927] FS:  0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000
[41801.618928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0
[41801.618930] Call Trace:
[41801.618943]  finish_ordered_fn+0x15/0x20 [btrfs]
[41801.618957]  normal_work_helper+0xbd/0x2f0 [btrfs]
[41801.618959]  ? __schedule+0x2eb/0x740
[41801.618973]  btrfs_endio_write_helper+0x12/0x20 [btrfs]
[41801.618975]  process_one_work+0x1ec/0x3a0
[41801.618977]  worker_thread+0x4d/0x400
[41801.618979]  kthread+0x104/0x140
[41801.618980]  ? process_one_work+0x3a0/0x3a0
[41801.618982]  ? kthread_park+0x90/0x90
[41801.618984]  ret_from_fork+0x1f/0x40
[41801.618985] ---[ end trace 35086266bf39c897 ]---
[41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left

unmount/remount seems to make it work again, and it is full (df) yet has
3TB of unallocated space left. No clue what to do now, do I have to start
over restoring again?

   Filesystem               Size  Used Avail Use% Mounted on
   /dev/mapper/xmnt-cold15   27T   23T     0 100% /cold1

   Overall:
       Device size:                       24216.49GiB
       Device allocated:                  20894.89GiB
       Device unallocated:                 3321.60GiB
       Device missing:                        0.00GiB
       Used:                              20893.68GiB
       Free (estimated):                   3322.73GiB      (min: 1661.93GiB)
       Data ratio:                               1.00
       Metadata ratio:                           2.00
       Global reserve:                        0.50GiB      (used: 0.00GiB)

   Data,single: Size:20839.01GiB, Used:20837.88GiB (99.99%)
      /dev/mapper/xmnt-cold15      9288.01GiB
      /dev/mapper/xmnt-cold12      7427.00GiB
      /dev/mapper/xmnt-cold13      4124.00GiB

   Metadata,RAID1: Size:27.91GiB, Used:27.90GiB (99.97%)
      /dev/mapper/xmnt-cold15        25.44GiB
      /dev/mapper/xmnt-cold12        24.46GiB
      /dev/mapper/xmnt-cold13         5.91GiB

   System,RAID1: Size:0.03GiB, Used:0.00GiB (6.69%)
      /dev/mapper/xmnt-cold15         0.03GiB
      /dev/mapper/xmnt-cold12         0.03GiB

   Unallocated:
      /dev/mapper/xmnt-cold15         0.01GiB
      /dev/mapper/xmnt-cold12         0.00GiB
      /dev/mapper/xmnt-cold13      3321.59GiB

Please, don't always chalk it up to hardware problems - btrfs is a
wonderful filesystem for many reasons, one reason I like is that it can
detect corruption much earlier than other filesystems. This featire alone
makes it impossible for me to go back to xfs. However, I had corruption
on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier
still than those - before btrfs (and even now) I kept md5sums of all
archived files (~200TB), and xfs and ext4 _do_ a much better job at not
corrupting data than btrfs on the same hardware - while I get filesystem
problems about every half a year with btrfs, I had (silent) corruption
problems maybe once every three to four years with xfs or ext4 (and not
yet on the bxoes I use currently).

Please take these issues seriously - the trend of "it's a hardware
problem" will not remove the "unstable" stigma from btrfs as long as btrfs
is clearly more buggy then other filesystems.

Sorry to be so blunt, but I am a bit sensitive with always being told
"it's probably a hardware problem" when it clearly affects practically any
server and any laptop I administrate. I believe in btrfs, and detecting
corruption early is a feature to me.

I understand it can be frustrating to be confronted with hard to explain
accidents, and I understand if you can't find the bug with the sparse info
I gave, especially as the bug might not even be in btrfs. But keep in mind
that the people who boldly/dumbly use btrfs in production and restore
dozens of terabytes from backup every so and so many months are also being
frustrated if they present evidence from multiple machines and get told
"its probably a hardware problem".

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 13:41         ` Qu Wenruo
  2019-12-20 16:53           ` Marc Lehmann
@ 2019-12-20 17:07           ` Marc Lehmann
  2019-12-21  1:23             ` Qu Wenruo
  2019-12-20 17:20           ` Marc Lehmann
  2 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20 17:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs
> filesystem I restored to went into readonly mode with ENOSPC. Another
> hardware problem?

btrfs check gave me a possible hint:

   Checking filesystem on /dev/mapper/xmnt-cold15
   UUID: 6e035cfe-5b47-406a-998f-b8ee6567abbc
   [1/7] checking root items
   [2/7] checking extents
   [3/7] checking free space tree
   cache and super generation don't match, space cache will be invalidated
   [4/7] checking fs roots
   [no other errors]

But mounting with clear_cache,space_cache=v2 didn't help, df still shows 0
bytes free, "btrfs f us" still shows 3tb unallocated. I'll play around with
it more...

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 13:41         ` Qu Wenruo
  2019-12-20 16:53           ` Marc Lehmann
  2019-12-20 17:07           ` Marc Lehmann
@ 2019-12-20 17:20           ` Marc Lehmann
  2 siblings, 0 replies; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20 17:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs
> > filesystem I restored to went into readonly mode with ENOSPC. Another
> > hardware problem?
>
> But mounting with clear_cache,space_cache=v2 didn't help, df still shows 0
> bytes free, "btrfs f us" still shows 3tb unallocated. I'll play around with
> it more...

clear_cache didn't work, but btrfsck --clear-space-cache v1 and .. v2 did
work:

   Filesystem               Size  Used Avail Use% Mounted on
   /dev/mapper/xmnt-cold15   27T   23T  3.6T  87% /cold1

Which is rather insane, as I can't see how this filesystem was ever not
mounted without -o space_cache=v2.

Looking at btrfs f u again...

   Metadata,single: Size:1.22GiB, Used:0.00B (0.00%)
      /dev/mapper/xmnt-cold13         1.22GiB

   Metadata,RAID1: Size:27.92GiB, Used:27.90GiB (99.91%)
      /dev/mapper/xmnt-cold15        25.46GiB
      /dev/mapper/xmnt-cold12        24.46GiB
      /dev/mapper/xmnt-cold13         5.92GiB

   System,RAID1: Size:32.00MiB, Used:2.16MiB (6.74%)
      /dev/mapper/xmnt-cold15        32.00MiB
      /dev/mapper/xmnt-cold12        32.00MiB

   Unallocated:
      /dev/mapper/xmnt-cold15         1.00MiB
      /dev/mapper/xmnt-cold12         1.00MiB
      /dev/mapper/xmnt-cold13         3.24TiB

Did this happen because metadata is raid1 and two of the disks were full,
and for some reason, btrfsck freed up a tiny bit of space somewhere?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 16:53           ` Marc Lehmann
@ 2019-12-20 17:24             ` Remi Gauvin
  2019-12-20 17:50               ` Marc Lehmann
  2019-12-20 18:00               ` Marc Lehmann
  2019-12-20 20:24             ` Chris Murphy
  2019-12-21  1:32             ` Qu Wenruo
  2 siblings, 2 replies; 18+ messages in thread
From: Remi Gauvin @ 2019-12-20 17:24 UTC (permalink / raw)
  To: Marc Lehmann, Qu Wenruo; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1806 bytes --]

On 2019-12-20 11:53 a.m., Marc Lehmann wrote:

> 
>    Filesystem               Size  Used Avail Use% Mounted on
>    /dev/mapper/xmnt-cold15   27T   23T     0 100% /cold1
> 
>    Overall:
>        Device size:                       24216.49GiB
>        Device allocated:                  20894.89GiB
>        Device unallocated:                 3321.60GiB
>        Device missing:                        0.00GiB
>        Used:                              20893.68GiB
>        Free (estimated):                   3322.73GiB      (min: 1661.93GiB)
>        Data ratio:                               1.00
>        Metadata ratio:                           2.00
>        Global reserve:                        0.50GiB      (used: 0.00GiB)
> 
>    Data,single: Size:20839.01GiB, Used:20837.88GiB (99.99%)
>       /dev/mapper/xmnt-cold15      9288.01GiB
>       /dev/mapper/xmnt-cold12      7427.00GiB
>       /dev/mapper/xmnt-cold13      4124.00GiB
> 
>    Metadata,RAID1: Size:27.91GiB, Used:27.90GiB (99.97%)
>       /dev/mapper/xmnt-cold15        25.44GiB
>       /dev/mapper/xmnt-cold12        24.46GiB
>       /dev/mapper/xmnt-cold13         5.91GiB
> 
>    System,RAID1: Size:0.03GiB, Used:0.00GiB (6.69%)
>       /dev/mapper/xmnt-cold15         0.03GiB
>       /dev/mapper/xmnt-cold12         0.03GiB
> 
>    Unallocated:
>       /dev/mapper/xmnt-cold15         0.01GiB
>       /dev/mapper/xmnt-cold12         0.00GiB
>       /dev/mapper/xmnt-cold13      3321.59GiB
> 

You don't need hints, the problem is right here.

Your Metadata is Raid 1, (which requires minimum of 2 devices,) Your
allocated metadata is full (27.90GB / 27.91 GB) and you only have 1
device left with unallocated space, so no new metadata space can be
allocated until you fix that.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 17:24             ` Remi Gauvin
@ 2019-12-20 17:50               ` Marc Lehmann
  2019-12-20 18:00               ` Marc Lehmann
  1 sibling, 0 replies; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20 17:50 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: Qu Wenruo, linux-btrfs

On Fri, Dec 20, 2019 at 12:24:05PM -0500, Remi Gauvin <remi@georgianit.com> wrote:
> You don't need hints, the problem is right here.

Yes, I already guessed that (see my other mail). I fortunately can add two
more devices. However:

> device left with unallocated space, so no new metadata space can be
> allocated until you fix that.

I think it really shouldn't be up to me to second guess btrfs's not very
helpful error messages "and fix things". And if I couldn't add another
device, I would be pretty much fucked - btrfs balance does not allow me
to move any chunks to the other device, I tried balancing 10 data chunks
and 10 metadata chunks - the data chunks balanced successfully but nothing
changed, and the metadata chunks instantly hit the ENOSPC problem.

Pushing "fix things" at users without giving them the ability to do so is
rather poor.

So is there a legit fix for this? The tools don't allow me to rebalance
the filesystem so there is more space on the drives and deleting data and
writing it again doesn't seem to help - btrfs still wants to write to the
nearly full disks. I could probably convert the metadata to single and
back, but as long as btrfs has no way orf moving data form one disk to
another, that's going to be tough. Maybe converting to single and resizing
would do the trick - seriously, though, btrfs shouldn't force users to
jump through such hoops.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 17:24             ` Remi Gauvin
  2019-12-20 17:50               ` Marc Lehmann
@ 2019-12-20 18:00               ` Marc Lehmann
  2019-12-20 18:28                 ` Eli V
  1 sibling, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20 18:00 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: Qu Wenruo, linux-btrfs

On Fri, Dec 20, 2019 at 12:24:05PM -0500, Remi Gauvin <remi@georgianit.com> wrote:
> You don't need hints, the problem is right here.
> Your Metadata is Raid 1, (which requires minimum of 2 devices,) Your

Guess I found another bug - three disks with >>3tb free space, but df
still shows 0 available bytes. Sure I can probably work around it somehow,
but no, I refuse to accept that this is supposedly a user problem - surely
btrfs could create more raid1 metadata with _three disks with lots of free
space_.

doom ~# df /cold1
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/xmnt-cold15   43T   23T     0 100% /cold1
doom ~# btrfs dev us /cold1
/dev/mapper/xmnt-cold15, ID: 1
   Device size:             9.09TiB
   Device slack:              0.00B
   Data,single:             9.07TiB
   Metadata,RAID1:         25.46GiB
   System,RAID1:           32.00MiB
   Unallocated:             1.00MiB

/dev/mapper/xmnt-cold12, ID: 2
   Device size:             7.28TiB
   Device slack:              0.00B
   Data,single:             7.25TiB
   Metadata,RAID1:         24.46GiB
   System,RAID1:           32.00MiB
   Unallocated:             1.00MiB

/dev/mapper/xmnt-cold13, ID: 3
   Device size:             7.28TiB
   Device slack:              0.00B
   Data,single:             4.03TiB
   Metadata,RAID1:          5.92GiB
   Unallocated:             3.24TiB

/dev/mapper/xmnt-cold14, ID: 4
   Device size:             7.28TiB
   Device slack:              0.00B
   Unallocated:             7.28TiB

/dev/mapper/xmnt-cold11, ID: 5
   Device size:             7.28TiB
   Device slack:              0.00B
   Unallocated:             7.28TiB

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 18:00               ` Marc Lehmann
@ 2019-12-20 18:28                 ` Eli V
  0 siblings, 0 replies; 18+ messages in thread
From: Eli V @ 2019-12-20 18:28 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Remi Gauvin, Qu Wenruo, linux-btrfs

In general df will only ever be an approximation on btrfs filesystems
since the different profiles use different amounts of space, and it
does have bugs from time to time. If you untar a mail spool on the
filesystem the metadata usage may shoot way up when only a small
amount of additional data is needed. So on btrfs filesystems I really
just ignore df, and use btrfs filesystem usage -T almost exclusively.
The table format of -T does make it much more readable for an admin.

On Fri, Dec 20, 2019 at 1:02 PM Marc Lehmann <schmorp@schmorp.de> wrote:
>
> On Fri, Dec 20, 2019 at 12:24:05PM -0500, Remi Gauvin <remi@georgianit.com> wrote:
> > You don't need hints, the problem is right here.
> > Your Metadata is Raid 1, (which requires minimum of 2 devices,) Your
>
> Guess I found another bug - three disks with >>3tb free space, but df
> still shows 0 available bytes. Sure I can probably work around it somehow,
> but no, I refuse to accept that this is supposedly a user problem - surely
> btrfs could create more raid1 metadata with _three disks with lots of free
> space_.
>
> doom ~# df /cold1
> Filesystem               Size  Used Avail Use% Mounted on
> /dev/mapper/xmnt-cold15   43T   23T     0 100% /cold1
> doom ~# btrfs dev us /cold1
> /dev/mapper/xmnt-cold15, ID: 1
>    Device size:             9.09TiB
>    Device slack:              0.00B
>    Data,single:             9.07TiB
>    Metadata,RAID1:         25.46GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             1.00MiB
>
> /dev/mapper/xmnt-cold12, ID: 2
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Data,single:             7.25TiB
>    Metadata,RAID1:         24.46GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             1.00MiB
>
> /dev/mapper/xmnt-cold13, ID: 3
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Data,single:             4.03TiB
>    Metadata,RAID1:          5.92GiB
>    Unallocated:             3.24TiB
>
> /dev/mapper/xmnt-cold14, ID: 4
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Unallocated:             7.28TiB
>
> /dev/mapper/xmnt-cold11, ID: 5
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Unallocated:             7.28TiB
>
> --
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 16:53           ` Marc Lehmann
  2019-12-20 17:24             ` Remi Gauvin
@ 2019-12-20 20:24             ` Chris Murphy
  2019-12-20 23:30               ` Marc Lehmann
  2019-12-21 20:06               ` Zygo Blaxell
  2019-12-21  1:32             ` Qu Wenruo
  2 siblings, 2 replies; 18+ messages in thread
From: Chris Murphy @ 2019-12-20 20:24 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Qu Wenruo, Marc Lehmann, Zygo Blaxell

On Fri, Dec 20, 2019 at 9:53 AM Marc Lehmann <schmorp@schmorp.de> wrote:
>
> On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:

> > Consider all these insane things, I tend to believe there is some
> > FUA/FLUSH related hardware problem.
>
> Please don't - I honestly think btrfs developers are way to fast to blame
> hardware for problems.

That's because they have a lot of evidence of this, in a way that's
only inferable with other file systems. This has long been suspected
by, and demonstrated, well before Btrfs with ZFS development.

A reasonable criticism of Btrfs development is the state of the file
system check repair, which still has danger warnings. But it's also a
case of damned if they do, and damned if they don't provide it. It
might be the best chance of recovery, so why not provide it?
Conversely, the reality is that the file system is complicated enough,
and the file system checker too slow, that the effort needs to be on
(what I call) file system autopsy tools, to figure out why the
corruption happened, and prevent that from happening. The repair is
often too difficult.

Take, for example, the recent 5.2.0-5.2.14 corruption bug. That was
self-reported once it was discovered and fixed, which took longer than
usual, and developers apologized. What else can they do? It's not like
the developers are blaming hardware for their own bugs. They have
consistently taken responsibility for Btrfs bugs.


> I currently lose btrfs filesystems about once every
> 6 months, and other than the occasional user error, it's always the kernel
> (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things,
> low-memory situations etc. - none of these seem to be centric to btrfs,
> but none of those are hardware errors either). I know its the kernel in
> most cases because in those cases, I can identify the fix in a later
> kernel, or the mitigating circumstances don't appear (e.g. freezes).

Usually Btrfs developers do mention the possibility of other software
layers contributing to the problem, it's a valid observation that this
possibility be stated.

However, if it's exclusively a software problem, then it should be
reproducible on other systems.


> In any case if it is a hardware problem, then linux and/or btrfs has
> to work around them, because it affects many different controllers on
> different boards:

How do you propose Btrfs work around it? In particular when there are
additional software layers over which it has no control?

Have you tried disabling the (drives') write cache?


> Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs
> filesystem I restored to went into readonly mode with ENOSPC. Another
> hardware problem?

> [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P           OE     5.4.5-050405-generic #201912181630

Why is this kernel tainted? The point of pointing this out isn't to
blame whatever it tainting the kernel, but to point out that
identifying the cause of your problems is made a lot more difficult. I
think you need to simplify the setup, a lot, in order to reduce the
surface area of possible problems. Any bug hunt is made way harder
when there's complication.



> [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014
> [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs]
> [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01
> [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282
> [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left
> [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006
> [41801.618922] BTRFS info (device dm-35): forced readonly
> [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440
> [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90
> [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60
> [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000
> [41801.618927] FS:  0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000
> [41801.618928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0
> [41801.618930] Call Trace:
> [41801.618943]  finish_ordered_fn+0x15/0x20 [btrfs]
> [41801.618957]  normal_work_helper+0xbd/0x2f0 [btrfs]
> [41801.618959]  ? __schedule+0x2eb/0x740
> [41801.618973]  btrfs_endio_write_helper+0x12/0x20 [btrfs]
> [41801.618975]  process_one_work+0x1ec/0x3a0
> [41801.618977]  worker_thread+0x4d/0x400
> [41801.618979]  kthread+0x104/0x140
> [41801.618980]  ? process_one_work+0x3a0/0x3a0
> [41801.618982]  ? kthread_park+0x90/0x90
> [41801.618984]  ret_from_fork+0x1f/0x40
> [41801.618985] ---[ end trace 35086266bf39c897 ]---
> [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left
>
> unmount/remount seems to make it work again, and it is full (df) yet has
> 3TB of unallocated space left. No clue what to do now, do I have to start
> over restoring again?
>
>    Filesystem               Size  Used Avail Use% Mounted on
>    /dev/mapper/xmnt-cold15   27T   23T     0 100% /cold1

Clearly a bug, possibly more than one. This problem is being discussed
in other threads on df misreporting with recent kernels, and a fix is
pending.

As for the ENOSPC, also clearly a bug. But not clear why or where.


> Please, don't always chalk it up to hardware problems - btrfs is a
> wonderful filesystem for many reasons, one reason I like is that it can
> detect corruption much earlier than other filesystems. This featire alone
> makes it impossible for me to go back to xfs. However, I had corruption
> on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier
> still than those - before btrfs (and even now) I kept md5sums of all
> archived files (~200TB), and xfs and ext4 _do_ a much better job at not
> corrupting data than btrfs on the same hardware - while I get filesystem
> problems about every half a year with btrfs, I had (silent) corruption
> problems maybe once every three to four years with xfs or ext4 (and not
> yet on the bxoes I use currently).

I can't really parse the suggestion that you are seeing md5 mismatches
(indicating data changes) on Btrfs, where Btrfs doesn't produce a csum
warning along with EIO on those files? Are these files nodatacow,
either by mount option nodatasum or nodatacow, or using chattr +C on
these files?

A mechanism explaining this anecdote isn't clear. Not even crc32c
checksum collision would explain more than maybe one instance of it.

I'm curious what Zygo thinks about this.







>
> Please take these issues seriously - the trend of "it's a hardware
> problem" will not remove the "unstable" stigma from btrfs as long as btrfs
> is clearly more buggy then other filesystems.
>
> Sorry to be so blunt, but I am a bit sensitive with always being told
> "it's probably a hardware problem" when it clearly affects practically any
> server and any laptop I administrate. I believe in btrfs, and detecting
> corruption early is a feature to me.

The problem with the anecdotal method of arguing in favor of software
bugs as the explanation? It directly goes against my own experience,
also anecdote. I've had no problems that I can attribute to Btrfs. All
were hardware or user sabotage. And I've had zero data loss, outside
of user sabotage.

I have seen device UNC read errors, corrected automatically by Btrfs.
And I have seen devices return bad data that Btrfs caught, that would
otherwise have been silent corruption of either metadata or data, and
this was corrected in the raid1 cases, and merely reported in the
non-raid cases. And I've also seen considerable corruption reported
upon SD Cards in the midst of implosion and becoming read only. But
even read only, I was able to get all the data out.

But in your case, practically ever server and laptop? That's weird and
unexpected. And it makes me wonder what's in common. Btrfs is much
fussier than other file systems because the by far largest target for
corruption, isn't file system metadata, but data. The actual payload
of a file system isn't the file system. And Btrfs is the only Linux
native file system that checksums data. The other file systems check
only metadata, and only somewhat recently, depending on the
distribution you're using.


> I understand it can be frustrating to be confronted with hard to explain
> accidents, and I understand if you can't find the bug with the sparse info
> I gave, especially as the bug might not even be in btrfs. But keep in mind
> that the people who boldly/dumbly use btrfs in production and restore
> dozens of terabytes from backup every so and so many months are also being
> frustrated if they present evidence from multiple machines and get told
> "its probably a hardware problem".

For sure. But take the contrary case that other file systems have
depended on for more than a decade: assuming the hardware is returning
valid data. This is intrinsic to their design. And go back before they
had metadata checksumming, and you'd see it stated on their lists that
they do assume this, and if your devices return any bad data, it's not
the file system's fault at all. Not even the lack of reporting any
kind of problem whatsoever. How is that better?

Well indeed, not long after Btrfs was demonstrating these are actually
more common problems that suspected, metadata checksumming started
creeping into other file systems, finally becoming the default (a
while ago on XFS, and very recently on ext4). And they are catching a
lot of these same kinds of layer and hardware bugs. Hardware does not
just mean the drive, it can be power supply, logic board, controller,
cables, drive write caches, drive firmware, and other drive internals.

And the only way any problem can be fixed, is to understand how, when
and where it happened.

--
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 20:24             ` Chris Murphy
@ 2019-12-20 23:30               ` Marc Lehmann
  2019-12-21 20:06               ` Zygo Blaxell
  1 sibling, 0 replies; 18+ messages in thread
From: Marc Lehmann @ 2019-12-20 23:30 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Fri, Dec 20, 2019 at 01:24:02PM -0700, Chris Murphy <lists@colorremedies.com> wrote:
> > Please don't - I honestly think btrfs developers are way to fast to blame
> > hardware for problems.
> 
> That's because they have a lot of evidence of this, in a way that's
> only inferable with other file systems. This has long been suspected
> by, and demonstrated, well before Btrfs with ZFS development.

But they don't - when I report that on different machines I see this
reproducible behaviour, what is that lot of evidence? At the least they
could inquire

> A reasonable criticism of Btrfs development is the state of the file
> system check repair, which still has danger warnings. But it's also a
> case of damned if they do, and damned if they don't provide it. It
> might be the best chance of recovery, so why not provide it?

Note that I have not asked for a better fsck or anything of the sort.

> usual, and developers apologized. What else can they do? It's not like
> the developers are blaming hardware for their own bugs. They have
> consistently taken responsibility for Btrfs bugs.

That's not the reality I live in, though. Most of my bug reports on btrfs
have either been completely ignored or "oh, I can't reproduce it today
anymore, maybe its fixed itself".

Sure, some of my bug reports have been taken seriously as well, and btrfs
has advanced considerably over the years.

I am a software developer myself, and I understand that not every bug
report can be acted upon, and sometimes you need to be sceptic for other
reasons then the assumed reported ones.

> Usually Btrfs developers do mention the possibility of other software
> layers contributing to the problem, it's a valid observation that this
> possibility be stated.

That's probably why I stated it, yes.

Your mail doesn't really apply to much of what I wrote - have you really
read my bug report, or is this the pre-canned response you send out for
criticism? Sorry to be so blunt, but thats pretty much how your mail feels
to me, as it doesn't seem to take into account what I reported.

> However, if it's exclusively a software problem, then it should be
> reproducible on other systems.

Which in this case, it is.

Even hardware problems can be reproduced on other systems, when it's say,
a controller problem, so reproducability of a problem does not mean it's a
software bug. But likewise jumping to conclusions because it is convenient
is also a non-sequitur.

> > In any case if it is a hardware problem, then linux and/or btrfs has
> > to work around them, because it affects many different controllers on
> > different boards:
> 
> How do you propose Btrfs work around it? In particular when there are
> additional software layers over which it has no control?

Why do I suddenly have to propose btrfs to work around it? I said if its
a hardware problem with practically every current controller, then linux
and/or btrfs have to work around them, otherwise it becomes useless. And
if other filesystems can keep data safe when btrfs can't, then clearly
btrfs can be improved to do likewise.

> Have you tried disabling the (drives') write cache?

The write caches of all drives have been off during the lifetime of the
filesystem.

Nevertheless, whats your basis for asking for write caches to be turned
off? Is there any evidence that drives lose their caches without being
turned off? Maybe such drives exist, but I have not heard of them - have you?

The reason why the write cache is off is because the stupid lsi raid
controllers I use _do_ lose data on power outages when the drive cache is
on, something that 3ware controllers didn't do. Not that power outages
(actual brownouts or manually induced via the power switch) are something
that really happens here.

I do, however, expect that current filesystems properly flush data to
disk, and outside the raid controllers, I do not disable the write cache.

> > [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P           OE     5.4.5-050405-generic #201912181630
> 
> Why is this kernel tainted?

non-free nvidia driver

> The point of pointing this out isn't to blame whatever it tainting the
> kernel, but to point out that identifying the cause of your problems is
> made a lot more difficult.

I think you are wildly exaggerating. Are there any reports of nvidia
drivers actually corrupting filesystems? I am genuinely curious.

Other than that, your claim that a tainted kernel some makes identifying
the problem a lot more difficult is just taken out of the blue, isn't it?

> I think you need to simplify the setup, a lot, in order to reduce the
> surface area of possible problems. Any bug hunt is made way harder when
> there's complication.

Well, sure, you can hide behind the kernel taint. I am sorry, in that case, I
just cannot provide you with any reports anymore, if that is what you really
want.

Note, however, that this is simply an excuse "oh, a raid controller, you
need to simplify your setup first". "oh, a tainted driver, you need to
simplify your setup first". What's next, "oh, you used a sata disk, you
need to simplify the setup first to see if filesystem corruption happens
without the disk frst".

Debugging real world problems is hard, and just ignoring the real world
because it doesn't fit into a lab doesn't work.

Besides, I already simplified the setup for you - my dell laptop only uses
certiefied genuine non-tainted in-kernel drivers and an nvme controller,
and still szuffered form the open_ctree on first reboot problem once.

> >    Filesystem               Size  Used Avail Use% Mounted on
> >    /dev/mapper/xmnt-cold15   27T   23T     0 100% /cold1
> 
> Clearly a bug, possibly more than one. This problem is being discussed
> in other threads on df misreporting with recent kernels, and a fix is
> pending.

As it turns out, it's not df misreporting, it's simply very bad error
reporting on the side of btrfs, requiring a lot of guesswork on the side of
the user, followed by "you need to fix that problem first" from the btrfs
mailing list.

In any case, I am sorry I was triggered and brought this up - this last
oops report wqas not meant as a request to help me solve my problem, but
to show how bad the user experience really is, both with btrfs and with
this list.

Seriously, when I mention I have a reproducible problem on multiple kernel
versions on multiple very different computers (that I reported in may)
then it is simply not appreciated to tell me its probably a hardware
problem, even if, however inconceivable it might be, it possibly could be
a hardware problem.

> As for the ENOSPC, also clearly a bug. But not clear why or where.

So at least it wasn't immediately obvious to you either - it took me a
while to figure out the "obvious", namely that one disk with free space
is not enough for raid1 metadata. The issue here is not df potentially
misreporting, but the fatc that btrfs simply has no tool to do much about
it in obvious ways, yet the btrfs lsit tells me I need to fix thigs
first. Great advice.

> > Please, don't always chalk it up to hardware problems - btrfs is a
> > wonderful filesystem for many reasons, one reason I like is that it can
> > detect corruption much earlier than other filesystems. This featire alone
> > makes it impossible for me to go back to xfs. However, I had corruption
> > on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier
> > still than those - before btrfs (and even now) I kept md5sums of all
> > archived files (~200TB), and xfs and ext4 _do_ a much better job at not
> > corrupting data than btrfs on the same hardware - while I get filesystem
> > problems about every half a year with btrfs, I had (silent) corruption
> > problems maybe once every three to four years with xfs or ext4 (and not
> > yet on the bxoes I use currently).
> 
> I can't really parse the suggestion that you are seeing md5 mismatches
> (indicating data changes) on Btrfs,

Me neither, where do you think I suggested that?

> where Btrfs doesn't produce a csum warning along with EIO on those
> files? Are these files nodatacow,

Well, first of all, the default is a relatively weak crc checksum
(fortunately, current kernels already offer more, another good ppint in
favour of btrfs). second, nocow data isn't checksummed. Third, I didn't
say btrfs claims good checksums on md5 mismatches, I claimed it corrupts
data.

For me, btrfs telling me instantly that a file is unreadable due to a
checksum error is much preferable to me having to manually checksum it to
find it, something I do maybe once a year.

What I am saying is that I lose way more files and data on btrfs than
on any other filesystem I used or use. And that's a fact - we cna now
speculate on why that is.

I have a few data points - most of the problems I ran into have been
other kernel bugs (such as the 4.11 corruption bug, various dmcache
corruption bugs and so on). Some of these have been btrfs bugs that have
been fixed. Some of these have been operator errors. And some are unknown,
but it's easy to scratch it up to bugs in, say, dmcache, which is another
big unknown. I don't report any of these issues because I have no useful
data to report. (And before you ask, the write cache of the dmcache
backing ssd is switche doff, although I believe the crucial ssds that I
use are some of the rare devices which actually honor FUA, and have data
at rest protection).

The big advantage of btrfs is that I can often mount it and make a full
backup (which only stats files to see if they changed) of the disk before
recreating the filesystem, and even if it fails, it usually can and does
loudly tell me so. I had way worse experience with, say, reiserfs long
ago, for example :)

> A mechanism explaining this anecdote isn't clear. Not even crc32c
> checksum collision would explain more than maybe one instance of it.
> 
> I'm curious what Zygo thinks about this.

I think you are reading way too much extra stuff into what I wrote - I
really feel you are replying to a superficial versiob fo what I actually
reported, because you are used to other such reports maybe?

> The problem with the anecdotal method of arguing in favor of software
> bugs as the explanation?

Is this your method? Because it is certainly not mine, or where do you see
that?

I have carefully presented facts together with supporting evidence that
it is unlikely to be a hardware problem. Nowhere did I exclude hardware
problems, nowhere did I exclude operator errors, nowhere did I exclude
errors in other parts of the kernel (the most common source for corruption
problems I have encountered).

Nowhere have I claimed it must be a software problem.

> also anecdote. I've had no problems that I can attribute to Btrfs. All
> were hardware or user sabotage. And I've had zero data loss, outside
> of user sabotage.

That's great for you. I have ~100 active disks here with >>200TB of data,
the vast majority nowadays runs with btrfs. The filesystems tend to be
very busy, with very different use cases, and I think as far as anecdotes
go, I have far better chances of runn inginto btrfs (or other!) bugs than
most other people, possibly (don't know) even you.

> I have seen device UNC read errors, corrected automatically by Btrfs.

Yeah, me too, but only on a USB stick that I knew was badly borked.

I currently see about one unreadable sector every 1-1.5 years, and in most
cases, raid5 scrubbing takes care of that, or I identify the file, delete
it, and move on. I had a single case of a sudden drive death in my whole
3x years career (maybe I am lucky).

Of course these are just anecdotes. But hundreds of disks give a much
better statistical base then, say, a single drive in somebodies home
computer.

But of course I only have as single multi-device btrfs filesystem (as an
experiment), so my statistical basis here is very thin...

> And I have seen devices return bad data that Btrfs caught, that would
> otherwise have been silent corruption of either metadata or data, and

Me too, me too. Wonderful feature, that.

I have also seen btrfs fail to use the mirror copy because of a broken
block, even though the mirror would have been fine - quite the number of
these have been fixed in recent years, did you know that? Until quite
recently, btrfs only believed in the checksum, and if that was good,
dup/raid1 was of no use...

Until recently, btrfs did practically nothing about corruption introduced
by itself or the kernel (or bad ram for example). It's great that this
changed, even though I had a few filesystems that the new stricter 5.4.x
checker refused to mount.

It's painful, but clearly progress.

I really think you confuse me with some other person that mindlessly
complains. I think my complaints do have substance, though, and you chide
unfairly :)

> But in your case, practically ever server and laptop? That's weird and
> unexpected.

Not if you have some google fu. btrfs corruption problems are extremely
common, but it's often very hard to guess what caused it.

I also think you are super badly informed about btrfs (or pretend to to
defend btrfs against all reason) - recent kernels report a lot of scary
corruption messages and refuse mounts, with no hint as to what could be
the problem (a stricter checker - your fs might work fine under previous
kernels).

If btrfs declares my fs as corrupt, I consider that filesystem
corruption. It took me quite a while to realise its stricter checking and
not neccessarily an indication of a real problem. I quietly reformatted those
affected partitions.

Seriosly, if claims that btrfs corruption is unexpected is so far fetched and
unexpected you live in a parallel universe.

Just go through kernel changelogs and count the btrfs bugfixes that cold
somehow lead to corruption on somebidies system - btrfs received a lotm opf
bugfixes (which is good!) but it also means there certaionly were a lot of
bugs in it.

And must I remind you of the raid5 issues - I never ran into these,
because careful reading of the documentation clearly told me to not use
it, but it cetrainly caused a lot of btrfs corruption - let's chalk that
up to user errors, though.

> And it makes me wonder what's in common. Btrfs is much fussier than
> other file systems because the by far largest target for corruption,
> isn't file system metadata, but data. The actual payload

I don'Ät think this is true. File data might offer more surface, but there
are many workloads (wsome of mine included) where metadata is shuffled around
a lot more than data, and there is a lot less that can go wrong with actual
data - btrfs just has to copy and checksum it - while for netadata, evry
complicated algorithms are in use.

Maybe actual data is the largest target, but I don't think you can
substantiate that claim in this generality.

> of a file system isn't the file system. And Btrfs is the only Linux
> native file system that checksums data. The other file systems check

Which is exactly why I am using it. I had a single case of a file that was
silently corrupted on xfs on teh last decade, and I only had a backup of
it because it was silently corrupted and the backup had a good copy, also
practically proving that it was silent data corruption.

> only metadata, and only somewhat recently, depending on the
> distribution you're using.

I think that is a clear case of fanboyism - ext4 has had metadata
checksums for almost 8 years in the standard kernel now, and is probably
the most commonly used fs. XFS has had metadata checksums in the standard
kernel for more than 6 ysears. I am not sure how stable for production
btrfs was when ext4 introduced these.

Sure, metadata checksums are for noobs, but let's not make other
filesystems look worse than they really are.

> For sure. But take the contrary case that other file systems have
> depended on for more than a decade: assuming the hardware is returning
> valid data. This is intrinsic to their design. And go back before they
> had metadata checksumming, and you'd see it stated on their lists that
> they do assume this, and if your devices return any bad data, it's not
> the file system's fault at all. Not even the lack of reporting any
> kind of problem whatsoever. How is that better?

Sure, but I have considerable performance data about devices returning
bad data over decades, as I, remember, keep md5 sums of practically all
my files and more or less regulalry compare them, and in many cases, have
backups so I can even investigate what exactly was corrupted how.

I am sorry to bring you the bad news, but outside of known broken hardware
(e.g. the cmd640 corruption, which I suffered from if somebody is old
enough to remember those), devices returning bad data happens, but is
_exceedingly_ rare. Unreadable sectors are by far more common on spinning
rust, and in my experience, quite rare unless there as "an incident" (such
as a headcrash).

The most common sources of data corruption is not bad hardware, especially
on hardware that otherwise works fine (e.g. survives a few hours of
memtest etc. and keeps file data in general), but software bugs. The, by
far, most common source of data loss for me is kernel bugs, especially
in recent years. The second most common source of data loss is operator
error, at least for me. Having backups certainly made me careless.

> Well indeed, not long after Btrfs was demonstrating these are actually
> more common problems that suspected, metadata checksumming started

Sure they are more common than suspected, but thats trivial since
practically nobody expected them.

And I know these problems exist, having suffered from them. But they are
still an insignificant issue.

Maybe we come from different backgrounds - practically all my data is
on hardware raid5 or better, and an unreadable sector is not something
my filesystem usually has to take care of. They happen. Also, hardware
has become both better (e.g. checksumming on transfer) and worse (asi n,
cheper and much closer to physical limits).

Yet still, disks silently returning other data is exceedingly rare (even
if you include controller firmware problems - I have no doubt that lsi
controller firmwares are a complete bugfest).

However, what you are seeing is "btrfs is reporting bad checksums", and
you wrongly seem to ascribe all these cases to hardware, while probably
many of these cases are driver bugs, mm bugs or even filesystem bugs (a
checksum will also fail if a metadata block is outdated or points to the
wrong data block for example, which can easiily happen if something goes
wrong during tree management).

I think that is not warranted without further evidence, which you don't
seem to have.

> while ago on XFS, and very recently on ext4). And they are catching a
> lot of these same kinds of layer and hardware bugs. Hardware does not
> just mean the drive, it can be power supply, logic board, controller,
> cables, drive write caches, drive firmware, and other drive internals.

And it can also be software. And tghe filesystem. On what grounds do you
exclude btrfs from this list, for example? It clearly had a lot of bugs,
and like every complex piece of softwrae, it surely has a lot of bugs
left.

> And the only way any problem can be fixed, is to understand how, when
> and where it happened.

Yes, and you can't understand how if xyou simply exclude the
filesystem because it prpobably was the hardware anyway and ignore the
problem. "Could not reproduce this in the current kernel anymore, maye its
fixed, closing bug".

a disclaimer: this mail (and your mail!) was way too long. When I can't
fully participate in any (potential) further disucssion, it is because I
lack the time, not for other reasons.

Greetings,

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 17:07           ` Marc Lehmann
@ 2019-12-21  1:23             ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2019-12-21  1:23 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1232 bytes --]



On 2019/12/21 上午1:07, Marc Lehmann wrote:
>> Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs
>> filesystem I restored to went into readonly mode with ENOSPC. Another
>> hardware problem?
> 
> btrfs check gave me a possible hint:
> 
>    Checking filesystem on /dev/mapper/xmnt-cold15
>    UUID: 6e035cfe-5b47-406a-998f-b8ee6567abbc
>    [1/7] checking root items
>    [2/7] checking extents
>    [3/7] checking free space tree
>    cache and super generation don't match, space cache will be invalidated

That's common, and not a problem at all.
Btrfs will rebuild the free space tree.

>    [4/7] checking fs roots
>    [no other errors]
> 
> But mounting with clear_cache,space_cache=v2 didn't help, df still shows 0
> bytes free, "btrfs f us" still shows 3tb unallocated. I'll play around with
> it more...

Df reports 0 available is a bug and caused pinned down.
It's btrfs_statfs() can't co-operate with latest over-commit behavior.

This happens when there are some metadata operation queued.
It's completely a runtime false alert, had nothing incorrect on-disk.

I had a fix submitted for it.
https://patchwork.kernel.org/patch/11293419/

Thanks,
Qu

> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 16:53           ` Marc Lehmann
  2019-12-20 17:24             ` Remi Gauvin
  2019-12-20 20:24             ` Chris Murphy
@ 2019-12-21  1:32             ` Qu Wenruo
  2 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2019-12-21  1:32 UTC (permalink / raw)
  To: Marc Lehmann, Josef Bacik; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 10518 bytes --]



On 2019/12/21 上午12:53, Marc Lehmann wrote:
> On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> BTW, that chunk number is very small, and since it has 0 tolerance, it
>> looks like to be SINGLE chunk.
>>
>> In that case, it looks like a temporary chunk from older mkfs, and it
>> should contain no data/metadata at all, thus brings no data loss.
> 
> Well, there indeed should not have been any data or metadata left as the
> btrfs dev del succeeded after lengthy copying.
> 
>> BTW, "btrfs ins dump-tree -t chunk <dev>" would help a lot.
>> That would directly tell us if the devid 1 device is in chunk tree.
> 
> Apologies if I wasn't too clear about it - I already had to mkfs and
> redo the filesystem. I understand that makes tracking this down hard or
> impossible, but I did need that machine and filesystem.
> 
>>> And if you want to hear more "insane" things, after I hard-reset
>>> my desktop machine (5.2.21) two days ago I had to "btrfs rescue
>>> fix-device-size" to be able to mount (can't find the kernel error atm.).
>>
>> Consider all these insane things, I tend to believe there is some
>> FUA/FLUSH related hardware problem.
> 
> Please don't - I honestly think btrfs developers are way to fast to blame
> hardware for problems. I currently lose btrfs filesystems about once every
> 6 months, and other than the occasional user error, it's always the kernel
> (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things,
> low-memory situations etc. - none of these seem to be centric to btrfs,
> but none of those are hardware errors either). I know its the kernel in
> most cases because in those cases, I can identify the fix in a later
> kernel, or the mitigating circumstances don't appear (e.g. freezes).
> 
> In any case if it is a hardware problem, then linux and/or btrfs has
> to work around them, because it affects many different controllers on
> different boards:
> 
> - dell perc h740 on "doom" and "cerebro"
> - intel series 9 controller on "doom'" and "cerebro".
> - samsung nvme controller on "yoyo" and "yuna".
> - marvell sata controller on "doom".
> 
> Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs
> filesystem I restored to went into readonly mode with ENOSPC. Another
> hardware problem?
> 
> [41801.618772] ------------[ cut here ]------------
> [41801.618776] BTRFS: Transaction aborted (error -28)

According to your later replies, this bug turns out to be a problem in
over-commit calculation.

It doesn't really take disk requirement into consideration, thus can't
handle cases like 3 disks RAID1 with 2 full disks.
Now it acts just like we're using DUP profiles, thus causing the problem.

To Josef, any idea to fix it?
I guess we could go the complex statfs() way to do a calculation on how
many bytes can really be allocated.

Or hugely reduce the over-commit threshold?

Thanks,
Qu

> [41801.618843] WARNING: CPU: 2 PID: 5713 at fs/btrfs/inode.c:3159 btrfs_finish_ordered_io+0x730/0x820 [btrfs]
> [41801.618844] Modules linked in: nfsv3 nfs fscache nvidia_modeset(POE) nvidia(POE) btusb algif_skcipher af_alg dm_crypt nfsd auth_rpcgss nfs_acl lockd grace cls_fw sch_htb sit tunnel4 ip_tunnel hidp act_police cls_u32 sch_ingress sch_tbf 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6 xt_CT xt_MASQUERADE xt_nat xt_REDIRECT nft_chain_nat nf_nat xt_owner xt_TCPMSS xt_DSCP xt_mark nf_log_ipv4 nf_log_common xt_LOG xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_length xt_mac xt_tcpudp nft_compat nft_counter nf_tables xfrm_user xfrm_algo nfnetlink cmac uhid bnep tda10021 snd_hda_codec_hdmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass tda827x tda10023 crct10dif_pclmul mei_hdcp crc32_pclmul btrtl btbcm rc_tt_1500 ghash_clmulni_intel snd_emu10k1 btintel snd_util_mem snd_ac97_codec aesni_intel bluetooth snd_hda_intel budget_av snd_rawmidi snd_intel_nhlt crypto_simd saa7146_vv
> [41801.618864]  snd_hda_codec videobuf_dma_sg budget_ci videobuf_core snd_seq_device budget_core cryptd ttpci_eeprom glue_helper snd_hda_core saa7146 dvb_core intel_cstate ac97_bus snd_hwdep rc_core snd_pcm intel_rapl_perf mxm_wmi cdc_acm pcspkr videodev snd_timer ecdh_generic snd emu10k1_gp ecc mc gameport soundcore mei_me mei mac_hid acpi_pad tcp_bbr drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler hid_generic usbhid hid usbkbd coretemp nct6775 hwmon_vid sunrpc parport_pc ppdev lp parport msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c ahci megaraid_sas i2c_i801 libahci lpc_ich r8169 realtek wmi video [last unloaded: nvidia]
> [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P           OE     5.4.5-050405-generic #201912181630
> [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014
> [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs]
> [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01
> [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282
> [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left
> [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006
> [41801.618922] BTRFS info (device dm-35): forced readonly
> [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440
> [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90
> [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60
> [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000
> [41801.618927] FS:  0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000
> [41801.618928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0
> [41801.618930] Call Trace:
> [41801.618943]  finish_ordered_fn+0x15/0x20 [btrfs]
> [41801.618957]  normal_work_helper+0xbd/0x2f0 [btrfs]
> [41801.618959]  ? __schedule+0x2eb/0x740
> [41801.618973]  btrfs_endio_write_helper+0x12/0x20 [btrfs]
> [41801.618975]  process_one_work+0x1ec/0x3a0
> [41801.618977]  worker_thread+0x4d/0x400
> [41801.618979]  kthread+0x104/0x140
> [41801.618980]  ? process_one_work+0x3a0/0x3a0
> [41801.618982]  ? kthread_park+0x90/0x90
> [41801.618984]  ret_from_fork+0x1f/0x40
> [41801.618985] ---[ end trace 35086266bf39c897 ]---
> [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left
> 
> unmount/remount seems to make it work again, and it is full (df) yet has
> 3TB of unallocated space left. No clue what to do now, do I have to start
> over restoring again?
> 
>    Filesystem               Size  Used Avail Use% Mounted on
>    /dev/mapper/xmnt-cold15   27T   23T     0 100% /cold1
> 
>    Overall:
>        Device size:                       24216.49GiB
>        Device allocated:                  20894.89GiB
>        Device unallocated:                 3321.60GiB
>        Device missing:                        0.00GiB
>        Used:                              20893.68GiB
>        Free (estimated):                   3322.73GiB      (min: 1661.93GiB)
>        Data ratio:                               1.00
>        Metadata ratio:                           2.00
>        Global reserve:                        0.50GiB      (used: 0.00GiB)
> 
>    Data,single: Size:20839.01GiB, Used:20837.88GiB (99.99%)
>       /dev/mapper/xmnt-cold15      9288.01GiB
>       /dev/mapper/xmnt-cold12      7427.00GiB
>       /dev/mapper/xmnt-cold13      4124.00GiB
> 
>    Metadata,RAID1: Size:27.91GiB, Used:27.90GiB (99.97%)
>       /dev/mapper/xmnt-cold15        25.44GiB
>       /dev/mapper/xmnt-cold12        24.46GiB
>       /dev/mapper/xmnt-cold13         5.91GiB
> 
>    System,RAID1: Size:0.03GiB, Used:0.00GiB (6.69%)
>       /dev/mapper/xmnt-cold15         0.03GiB
>       /dev/mapper/xmnt-cold12         0.03GiB
> 
>    Unallocated:
>       /dev/mapper/xmnt-cold15         0.01GiB
>       /dev/mapper/xmnt-cold12         0.00GiB
>       /dev/mapper/xmnt-cold13      3321.59GiB
> 
> Please, don't always chalk it up to hardware problems - btrfs is a
> wonderful filesystem for many reasons, one reason I like is that it can
> detect corruption much earlier than other filesystems. This featire alone
> makes it impossible for me to go back to xfs. However, I had corruption
> on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier
> still than those - before btrfs (and even now) I kept md5sums of all
> archived files (~200TB), and xfs and ext4 _do_ a much better job at not
> corrupting data than btrfs on the same hardware - while I get filesystem
> problems about every half a year with btrfs, I had (silent) corruption
> problems maybe once every three to four years with xfs or ext4 (and not
> yet on the bxoes I use currently).
> 
> Please take these issues seriously - the trend of "it's a hardware
> problem" will not remove the "unstable" stigma from btrfs as long as btrfs
> is clearly more buggy then other filesystems.
> 
> Sorry to be so blunt, but I am a bit sensitive with always being told
> "it's probably a hardware problem" when it clearly affects practically any
> server and any laptop I administrate. I believe in btrfs, and detecting
> corruption early is a feature to me.
> 
> I understand it can be frustrating to be confronted with hard to explain
> accidents, and I understand if you can't find the bug with the sparse info
> I gave, especially as the bug might not even be in btrfs. But keep in mind
> that the people who boldly/dumbly use btrfs in production and restore
> dozens of terabytes from backup every so and so many months are also being
> frustrated if they present evidence from multiple machines and get told
> "its probably a hardware problem".
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs dev del not transaction protected?
  2019-12-20 20:24             ` Chris Murphy
  2019-12-20 23:30               ` Marc Lehmann
@ 2019-12-21 20:06               ` Zygo Blaxell
  1 sibling, 0 replies; 18+ messages in thread
From: Zygo Blaxell @ 2019-12-21 20:06 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Qu Wenruo, Marc Lehmann

[-- Attachment #1: Type: text/plain, Size: 17482 bytes --]

On Fri, Dec 20, 2019 at 01:24:02PM -0700, Chris Murphy wrote:
> On Fri, Dec 20, 2019 at 9:53 AM Marc Lehmann <schmorp@schmorp.de> wrote:
> >
> > On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
> > > Consider all these insane things, I tend to believe there is some
> > > FUA/FLUSH related hardware problem.
> >
> > Please don't - I honestly think btrfs developers are way to fast to blame
> > hardware for problems.
> 
> That's because they have a lot of evidence of this, in a way that's
> only inferable with other file systems. This has long been suspected
> by, and demonstrated, well before Btrfs with ZFS development.
> 
> A reasonable criticism of Btrfs development is the state of the file
> system check repair, which still has danger warnings. But it's also a
> case of damned if they do, and damned if they don't provide it. It
> might be the best chance of recovery, so why not provide it?
> Conversely, the reality is that the file system is complicated enough,
> and the file system checker too slow, that the effort needs to be on
> (what I call) file system autopsy tools, to figure out why the
> corruption happened, and prevent that from happening. The repair is
> often too difficult.
> 
> Take, for example, the recent 5.2.0-5.2.14 corruption bug. That was
> self-reported once it was discovered and fixed, which took longer than
> usual, and developers apologized. What else can they do? It's not like
> the developers are blaming hardware for their own bugs. They have
> consistently taken responsibility for Btrfs bugs.
> 
> 
> > I currently lose btrfs filesystems about once every
> > 6 months, and other than the occasional user error, it's always the kernel
> > (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things,
> > low-memory situations etc. - none of these seem to be centric to btrfs,
> > but none of those are hardware errors either). I know its the kernel in
> > most cases because in those cases, I can identify the fix in a later
> > kernel, or the mitigating circumstances don't appear (e.g. freezes).
> 
> Usually Btrfs developers do mention the possibility of other software
> layers contributing to the problem, it's a valid observation that this
> possibility be stated.

Also note that not all btrfs developers will agree on a failure analysis.
Some patience is required.  Be prepared to support your bug report with
working reproducers and relevant evidence, possibly many times, with fresh
backtraces on each new kernel release in which the bug still appears.

> However, if it's exclusively a software problem, then it should be
> reproducible on other systems.
> 
> 
> > In any case if it is a hardware problem, then linux and/or btrfs has
> > to work around them, because it affects many different controllers on
> > different boards:
> 
> How do you propose Btrfs work around it? In particular when there are
> additional software layers over which it has no control?
> 
> Have you tried disabling the (drives') write cache?

Apparently many sysadmins disable write cache proactively on all drives,
instead of waiting until the drive drops some data to learn that there's
a problem with the firmware.  That's a reasonable tradeoff for btrfs,
which already has a heavily optimized write path (most of the IO time
in btrfs commit is spent _reading_ metadata).

> > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs
> > filesystem I restored to went into readonly mode with ENOSPC. Another
> > hardware problem?
> 
> > [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P           OE     5.4.5-050405-generic #201912181630
> 
> Why is this kernel tainted? The point of pointing this out isn't to
> blame whatever it tainting the kernel, but to point out that
> identifying the cause of your problems is made a lot more difficult. I
> think you need to simplify the setup, a lot, in order to reduce the
> surface area of possible problems. Any bug hunt is made way harder
> when there's complication.
> 
> 
> 
> > [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014
> > [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> > [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs]
> > [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01
> > [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282
> > [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left
> > [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006
> > [41801.618922] BTRFS info (device dm-35): forced readonly
> > [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440
> > [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90
> > [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60
> > [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000
> > [41801.618927] FS:  0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000
> > [41801.618928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0
> > [41801.618930] Call Trace:
> > [41801.618943]  finish_ordered_fn+0x15/0x20 [btrfs]
> > [41801.618957]  normal_work_helper+0xbd/0x2f0 [btrfs]
> > [41801.618959]  ? __schedule+0x2eb/0x740
> > [41801.618973]  btrfs_endio_write_helper+0x12/0x20 [btrfs]
> > [41801.618975]  process_one_work+0x1ec/0x3a0
> > [41801.618977]  worker_thread+0x4d/0x400
> > [41801.618979]  kthread+0x104/0x140
> > [41801.618980]  ? process_one_work+0x3a0/0x3a0
> > [41801.618982]  ? kthread_park+0x90/0x90
> > [41801.618984]  ret_from_fork+0x1f/0x40
> > [41801.618985] ---[ end trace 35086266bf39c897 ]---
> > [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left
> >
> > unmount/remount seems to make it work again, and it is full (df) yet has
> > 3TB of unallocated space left. No clue what to do now, do I have to start
> > over restoring again?
> >
> >    Filesystem               Size  Used Avail Use% Mounted on
> >    /dev/mapper/xmnt-cold15   27T   23T     0 100% /cold1
> 
> Clearly a bug, possibly more than one. This problem is being discussed
> in other threads on df misreporting with recent kernels, and a fix is
> pending.
> 
> As for the ENOSPC, also clearly a bug. But not clear why or where.
> 
> 
> > Please, don't always chalk it up to hardware problems - btrfs is a
> > wonderful filesystem for many reasons, one reason I like is that it can
> > detect corruption much earlier than other filesystems. This featire alone
> > makes it impossible for me to go back to xfs. However, I had corruption
> > on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier
> > still than those - before btrfs (and even now) I kept md5sums of all
> > archived files (~200TB), and xfs and ext4 _do_ a much better job at not
> > corrupting data than btrfs on the same hardware - while I get filesystem
> > problems about every half a year with btrfs, I had (silent) corruption
> > problems maybe once every three to four years with xfs or ext4 (and not
> > yet on the bxoes I use currently).
> 
> I can't really parse the suggestion that you are seeing md5 mismatches
> (indicating data changes) on Btrfs, where Btrfs doesn't produce a csum
> warning along with EIO on those files? Are these files nodatacow,
> either by mount option nodatasum or nodatacow, or using chattr +C on
> these files?
> 
> A mechanism explaining this anecdote isn't clear. Not even crc32c
> checksum collision would explain more than maybe one instance of it.
> 
> I'm curious what Zygo thinks about this.

Hardware bugs and failures are certainly common, and fleetwide hardware
failures do happen.  They're also recognizable as hardware bugs--some
specific failure modes (e.g. single-bit data value errors, parent transid
verify failure after crashes) are definitely hardware and can be easily
spotted with only a few lines of kernel logs.  Some components of btrfs
(e.g.  scrubs, csum verification, raid1 corruption recovery) are very
reliable detectors of hardware or firmware misbehavior (although sometimes
it is not trivial to identify _which_ hardware is at fault).  Some parts
of btrfs (like free space management) are completely btrfs, and cannot
be affected by hardware failures without destroying the entire filesystem.

On the other hand, it's not like btrfs or the Linux kernel has been
bug free either, and a lot of serious but hard to detect bugs are 5-10
years old when they get fixed.  All kernels before 5.1 had silent data
corruption bugs for compressed data at hole boundaries.  Kernels 5.1 to
5.4 have use-after-free bugs in btrfs that lead to metadata corruption
(5.1), transaction aborts due to self-detected metadata corruption
(5.2), and crashes (5.3 and 5.4).  5.2 also had a second metadata
corruption with deadlock bug.  Other parts of the kernel are hard on
data as well: somewhere around 4.7 a year-old kernel memory corruption
bug was found in the r8169 network driver, and 4.0, 4.19, and 5.1 all
had famous block-layer bugs that would destroy any filesystem under
certain conditions.

I test every upstream kernel release thoroughly before deploying to
production, because every upstream Linux kernel release has thousands
of bugs (btrfs is usually about 1-2% of those).  I am still waiting
for the very first upstream kernel release for btrfs that can run our
full production stress test workload without any backported fixes and
without crashing or corrupting data or metadata for 30 days.  So far
that goal has never been met.  We upgrade kernels when a new release
gets better than an old one, but the median uptime under stress is
still an order of magnitude short of the 30 day mark, and our testing
on 5.4.5+fixes isn't done yet.

Unfortunately, due to the nature of crashing bugs, we can only work on
the most frequently occurring bug at any time, and each one has to be
fixed before the next most frequently occurring bug can be discovered,
making these fixes a very sequential process.  Then there's the two-month
lag to get patches from the mailing list into stable kernels, which is
plenty of time for new regressions to appear, and we start over again
with a fresh set of bugs to fix.

btrfs dev del bugs are not crashing bugs, so they are so far down my
priority list that I haven't bothered to test for them, or even to report
them when I find one accidentally.  There are a few bugs there though,
especially if you are low on metadata space (which is a likely event if
you just removed an entire disk's worth of storage) or btrfs has a bug in
that kernel version that just makes btrfs _think_ it is low on metadata
space, and the transaction aborts during the delete.  Occasionally I hit
one of these in an array and work around it with a patch like this one:

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 56e35d2e957c..b16539fd2c23 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7350,6 +7350,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 #if 0
                ret = -EINVAL;
                goto error;
+#else
+               btrfs_set_super_num_devices(fs_info->super_copy, total_dev);
 #endif
        }
        if (btrfs_super_total_bytes(fs_info->super_copy) <

Probably not a good idea for general use, but it may solve an immediate
problem if the problem is simply that the wrong number of devices is
stored in the superblock.

> 
> 
> 
> 
> 
> 
> >
> > Please take these issues seriously - the trend of "it's a hardware
> > problem" will not remove the "unstable" stigma from btrfs as long as btrfs
> > is clearly more buggy then other filesystems.
> 
> > Sorry to be so blunt, but I am a bit sensitive with always being told
> > "it's probably a hardware problem" when it clearly affects practically any
> > server and any laptop I administrate. I believe in btrfs, and detecting
> > corruption early is a feature to me.
> 
> The problem with the anecdotal method of arguing in favor of software
> bugs as the explanation? It directly goes against my own experience,
> also anecdote. I've had no problems that I can attribute to Btrfs. All
> were hardware or user sabotage. And I've had zero data loss, outside
> of user sabotage.

You are definitely not testing hard enough.  ;)

At one point in 2016 there were 145 active bugs known today.  About 10
of those 145 were discovered in the last few months alone (i.e. it was
broken in 2016, and we only know now how broken it was then after 3
years of hindsight).  https://imgur.com/a/A2sXcQB

Thankfully, many of those bugs were mostly harmless, but some were not:
I've found at least 5 distinct data or metadata corrupting bugs since
2014, and confirmed the existence of several more in regression testing.

> I have seen device UNC read errors, corrected automatically by Btrfs.
> And I have seen devices return bad data that Btrfs caught, that would
> otherwise have been silent corruption of either metadata or data, and
> this was corrected in the raid1 cases, and merely reported in the
> non-raid cases. And I've also seen considerable corruption reported
> upon SD Cards in the midst of implosion and becoming read only. But
> even read only, I was able to get all the data out.

btrfs data recovery on raid1 from csum and UNC sector failures is
excellent.  I've seen no issues there since 3.18ish.  I do test that
from time to time with VMs and fault injection and also with real
disk failures.

btrfs on raid5 (internal or external raid5 implementation), device delete,
and some unfortunate degraded mode behaviors still need some work.

> But in your case, practically ever server and laptop? That's weird and
> unexpected. And it makes me wonder what's in common. Btrfs is much
> fussier than other file systems because the by far largest target for
> corruption, isn't file system metadata, but data. The actual payload
> of a file system isn't the file system. And Btrfs is the only Linux
> native file system that checksums data. The other file systems check
> only metadata, and only somewhat recently, depending on the
> distribution you're using.

If the "corruption" consists of large quantities of zeros, the problem
might be using the (default) noflushoncommit mount option, or using
applications that don't fsync() religiously.  This is correct filesystem
behavior, though maybe not behavior any application developer wants.

If the corruption affects compressed data adjacent to holes, then it's
a known problem fixed in 5.1 and later.

If the corruption is specifically and only parent transid verify failures
after a crash, UNC sector read, or power failure, then we'd be looking for
drive firmware issues or non-default kernel settings to get a fleetwide
effect.

If the corruption is general metadata corruption without metadata page
csum failures, then it could be host RAM failure, general kernel memory
corruption (i.e. you have to look at all the other device drivers in the
system), or known bugs in btrfs kernel 5.1 and later.

If the corruption is all csum failures, then there's a long list of
drive issues that could cause it, or the partition could be trampled by
other software (BIOSes are sometimes surprisingly bad at this).

> > I understand it can be frustrating to be confronted with hard to explain
> > accidents, and I understand if you can't find the bug with the sparse info
> > I gave, especially as the bug might not even be in btrfs. But keep in mind
> > that the people who boldly/dumbly use btrfs in production and restore
> > dozens of terabytes from backup every so and so many months are also being
> > frustrated if they present evidence from multiple machines and get told
> > "its probably a hardware problem".
> 
> For sure. But take the contrary case that other file systems have
> depended on for more than a decade: assuming the hardware is returning
> valid data. This is intrinsic to their design. And go back before they
> had metadata checksumming, and you'd see it stated on their lists that
> they do assume this, and if your devices return any bad data, it's not
> the file system's fault at all. Not even the lack of reporting any
> kind of problem whatsoever. How is that better?
> 
> Well indeed, not long after Btrfs was demonstrating these are actually
> more common problems that suspected, metadata checksumming started
> creeping into other file systems, finally becoming the default (a
> while ago on XFS, and very recently on ext4). And they are catching a
> lot of these same kinds of layer and hardware bugs. Hardware does not
> just mean the drive, it can be power supply, logic board, controller,
> cables, drive write caches, drive firmware, and other drive internals.
> 
> And the only way any problem can be fixed, is to understand how, when
> and where it happened.
> 
> --
> Chris Murphy
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-12-21 20:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-20  4:05 btrfs dev del not transaction protected? Marc Lehmann
2019-12-20  5:24 ` Qu Wenruo
2019-12-20  6:37   ` Marc Lehmann
2019-12-20  7:10     ` Qu Wenruo
2019-12-20 13:27       ` Marc Lehmann
2019-12-20 13:41         ` Qu Wenruo
2019-12-20 16:53           ` Marc Lehmann
2019-12-20 17:24             ` Remi Gauvin
2019-12-20 17:50               ` Marc Lehmann
2019-12-20 18:00               ` Marc Lehmann
2019-12-20 18:28                 ` Eli V
2019-12-20 20:24             ` Chris Murphy
2019-12-20 23:30               ` Marc Lehmann
2019-12-21 20:06               ` Zygo Blaxell
2019-12-21  1:32             ` Qu Wenruo
2019-12-20 17:07           ` Marc Lehmann
2019-12-21  1:23             ` Qu Wenruo
2019-12-20 17:20           ` Marc Lehmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).