linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* I think he's dead, Jim
@ 2020-05-18 20:51 Justin Engwer
  2020-05-18 23:23 ` Chris Murphy
  2020-05-20  1:32 ` Zygo Blaxell
  0 siblings, 2 replies; 10+ messages in thread
From: Justin Engwer @ 2020-05-18 20:51 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I'm hoping to get some (or all) data back from what I can only assume
is the dreaded write hole. I did a fairly lengthy post on reddit that
you can find here:
https://old.reddit.com/r/btrfs/comments/glbde0/btrfs_died_last_night_pulling_out_hair_all_day/

TLDR; BTRFS keeps dropping to RO. System sometimes completely locks up
and needs to be hard powered off because of read activity on BTRFS.
See reddit link for actual errors.

I'm really not super familiar, or at all familiar, with BTRFS or the
recovery of it.
-- 

Justin Engwer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
  2020-05-18 20:51 I think he's dead, Jim Justin Engwer
@ 2020-05-18 23:23 ` Chris Murphy
       [not found]   ` <CAGAeKuv3y=rHvRsq6SVSQ+NadyUaFES94PpFu1zD74cO3B_eLA@mail.gmail.com>
  2020-05-20  1:32 ` Zygo Blaxell
  1 sibling, 1 reply; 10+ messages in thread
From: Chris Murphy @ 2020-05-18 23:23 UTC (permalink / raw)
  To: Justin Engwer; +Cc: Btrfs BTRFS

On Mon, May 18, 2020 at 2:51 PM Justin Engwer <justin@mautobu.com> wrote:
>
> Hi,
>
> I'm hoping to get some (or all) data back from what I can only assume
> is the dreaded write hole. I did a fairly lengthy post on reddit that
> you can find here:
> https://old.reddit.com/r/btrfs/comments/glbde0/btrfs_died_last_night_pulling_out_hair_all_day/
>
> TLDR; BTRFS keeps dropping to RO. System sometimes completely locks up
> and needs to be hard powered off because of read activity on BTRFS.
> See reddit link for actual errors.

Almost no one will follow the links. You've got a problem, which is
unfortunate, but you're also asking for help so you kinda need to make
it easy for readers to understand the setup instead of having to go
digging for it elsewhere. And also it's needed for archive
searchability, which an external reference doesn't provide.

a. kernel and btrfs-progs version; ideally also include some kernel
history for this file system
b. basics of the storage stack: what are the physical drives, how are
they connected,
c. if VM, what's the hypervisor, are the drives being passed through,
what caching mode
d. mkfs command used to create; or just state the metadata and data
profiles; or paste 'btrfs fi us /mnt'
e. ideally a complete dmesg (start to finish, not snipped) at the time
of the original problem, this might be the prior boot; it's probably
too big to attach to the list so in that case nextcloud, dropbox,
pastebin, etc.
f. a current dmesg for the mount failure
g. btrfs check --readonly /dev/


I thought we had a FAQ item with what info we wanted reported to the
list, but I can't find it.


Thanks,

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
       [not found]     ` <CAJCQCtQXR+x4mG+jT34nhkE69sP94yio-97MLmd_ugKS+m96DQ@mail.gmail.com>
@ 2020-05-19 18:45       ` Justin Engwer
  2020-05-19 20:44         ` Chris Murphy
  0 siblings, 1 reply; 10+ messages in thread
From: Justin Engwer @ 2020-05-19 18:45 UTC (permalink / raw)
  To: linux-btrfs

On Mon, May 18, 2020 at 7:03 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Mon, May 18, 2020 at 6:47 PM Justin Engwer <justin@mautobu.com> wrote:
> >
> > Thanks for getting back to me Chris. Here's the info requested:
> >
> > a. Kernels are:
> > CentOS Linux (5.5.2-1.el7.elrepo.x86_64) 7 (Core)
> > CentOS Linux (4.16.7-1.el7.elrepo.x86_64) 7 (Core)
> > CentOS Linux (4.4.213-1.el7.elrepo.x86_64) 7 (Core)
> >
> > I was originally on 4.4, then updated to 4.16. After updating to 5.5 I
> > must have screwed up the grub boot default as it started booting to
> > 4.4.
>
> The problem happened while using kernel 5.5.2?
>

Likely 4.4

> These:
> parent transid verify failed on 2788917248 wanted 173258 found 173174
>
> suggest that the problem didn't happen too long ago. But the
> difficulty I see is that the "found" ranges from 172716 to 173167.
>
> A further difficulty is the wanted ranges from 173237 to 173258. That
> is really significant.
>
> Have there been crashes/power failures while the file system was being written?
>

Given the system is hard locking up when btrfs is accessing some data,
yes most likely.

>
> > btrfs-progs v4.9.1
>
> This is too old to attempt a repair. The errors reported seem
> reliable, but there might be other problems going on that it's not
> catching, so I suggest updating it in any case.
>
> Try this:
> https://copr.fedorainfracloud.org/coprs/ngompa/btrfs-progs-el8/
>
> But I can't recommend a repair except as a last resort. It seems like
> things can't get worse, but it's better to be prepared. It also
> includes more capable offline scrape tool 'btrfs restore'.
>

Working on restoring. Will start with the 4 "good" drives. I recall
years ago working in a computer repair shop if a drive was bad and we
left it in the freezer overnight we could get data from it for a few
hours, then it would be completely dead afterward. Might be worth a
shot if nothing else works.

Does BTRFS store whole files on single drives then use a parity across
all of them or does it break single large files up, store them across
different drives, then parity?

>
> > b. Physical drives are identical seagate SATA 3tb drives. Ancient
> > bastards. Connected through a combination of LSI HBA and motherboard.
>
> Does the LSI HBA have a cache enabled? If its battery backed it's
> probably OK but otherwise it should be disabled. And the write caches
> on the drives should be disabled. That's the conservative
> configuration. If the controller and drives really honor FUA/fsync
> then it's OK to leave the write caches enabled. But the problem is if
> they honor different flushes in different order you end up with an
> inconsistent file system. And that's bad for Btrfs because repairing
> inconsistency is difficult. It really just needs to be avoided in the
> first place.
>

All cards are LSI 9211 or 9200 in the system. None of them have onboard caching.

> >
> > c. Not a vm. They host(ed) vms though.
> >
> > d. [root@kvm2 ~]# btrfs fi us recovery/mount/
> > WARNING: RAID56 detected, not implemented
> > WARNING: RAID56 detected, not implemented
> > WARNING: RAID56 detected, not implemented
> > Overall:
> >     Device size:                  13.64TiB
> >     Device allocated:                0.00B
> >     Device unallocated:           13.64TiB
> >     Device missing:                  0.00B
> >     Used:                            0.00B
> >     Free (estimated):                0.00B      (min: 8.00EiB)
> >     Data ratio:                       0.00
> >     Metadata ratio:                   0.00
> >     Global reserve:              512.00MiB      (used: 0.00B)
> >
> > Data,RAID6: Size:4.39TiB, Used:0.00B
> >    /dev/sdh        1.46TiB
> >    /dev/sdi        1.46TiB
> >    /dev/sdl        1.46TiB
> >    /dev/sdo        1.46TiB
> >    /dev/sdp        1.46TiB
> >
> > Metadata,RAID6: Size:7.12GiB, Used:176.00KiB
>
> This is more difficult to recover from since it can spread a single
> transaction across multiple disks, and it's harder (sometimes
> impossible) to guarantee atomic updates. It's recommended to use
> raid1c3 or raid1c4 in this configuration. I understand that's not
> supported by the older kernels you were using, hopefully this was an
> experimental setup.
>

Noted. It's a homelab, so it's not ideal but not a huge issue. Just
time consuming to rebuild.

> >
> > f. Mounting without ro,norecovery,degraded results in immediate system
> > lockup and nothing in dmesg.
> >
> >  mount -o ro,norecovery,degraded /dev/sdi recovery/mount/
>
> I think that's a bug on the face of it. It shouldn't indefinitely hang.
>

Now mounts on Fedora. Drops to RO quickly though. See
https://pastebin.com/94BbRamb

> >
> > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi):
> > disabling log replay at mount time
> > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi):
> > allowing degraded mounts
> > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi):
> > disk space caching is enabled
> > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi): has
> > skinny extents
> > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi):
> > bdev /dev/sdl errs: wr 12, rd 264, flush 4, corrupt 0, gen 0
>
> From the reddit threat, all of these errors are confined to a single drive.
>
> Unfortunately there's more than one thing going on. If it were just
> one or two problems, then in theory Btrfs can deal with it. But looks
> like there's one device problem, corrupt extent tree, and checksum
> failure preventing further recovery.
>
> Another thing to check that frequently causes problems with raid on
> Linux, whether Btrfs, LVM or mdadm, are timeout mismatches between
> drive firmware and the kernel command timer.
>
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>
> If it were my problem, in order:
>
> - Update btrfs progs and the kernel. Recent versions should at least
> fail with some sane error reporting, or it's a bug. Not every problem
> can be fixed, but there shouldn't be crashes.
>
> - 'btrfs rescue super -v /anydev/'  - this should check all supers on
> all devices and see if they're the same or not. I don't recommend
> repairing yet if there are differences. It's vaguely possible that
> there is a really old one that might point to a tree that isn't
> busted. And then point btrfs check at that old super.
>
> - 'btrfs check --readonly' to update the report; also, for multiple
> devices you only need to run this command on any one of them.
>
> - try to 'mount -o ro,nologreplay' first and if that doesn't work try
> 'mount -o ro,nologreplay,degraded'
>
> I suggest ssh into this system and use 'journalctl -fk' to follow the
> journal while you do these things in case there are kernel message; in
> particular if it leads to a hang or crash, hopefully this will still
> catch it. And in a second shell, I suggest having sysrq enabled and
> ready to issue sysrq+t.
>
> It's a lot to collect and tediou. But the better the information the
> more likely it'll attract developer attention to see if there's a bug
> that needs to be fixed. Also, that might not happen for a while. So
> it's best to collect as much info as possible now in case you have to
> give up and move on.
>
> --
> Chris Murphy


I put the drives in a box with Fedora Rawhide connected directly to
the motherboard. It looks like all of the supers are the same.

[root@localhost ~]# btrfs rescue super -v /dev/sdb
All Devices:
        Device: id = 4, name = /dev/sdh
        Device: id = 2, name = /dev/sdf
        Device: id = 5, name = /dev/sde
        Device: id = 3, name = /dev/sdd
        Device: id = 1, name = /dev/sdb

Before Recovering:
        [All good supers]:
                device name = /dev/sdh
                superblock bytenr = 65536

                device name = /dev/sdh
                superblock bytenr = 67108864

                device name = /dev/sdh
                superblock bytenr = 274877906944

                device name = /dev/sdf
                superblock bytenr = 65536

                device name = /dev/sdf
                superblock bytenr = 67108864

                device name = /dev/sdf
                superblock bytenr = 274877906944

                device name = /dev/sde
                superblock bytenr = 65536

                device name = /dev/sde
                superblock bytenr = 67108864

                device name = /dev/sde
                superblock bytenr = 274877906944

                device name = /dev/sdd
                superblock bytenr = 65536

                device name = /dev/sdd
                superblock bytenr = 67108864

                device name = /dev/sdd
                superblock bytenr = 274877906944

                device name = /dev/sdb
                superblock bytenr = 65536

                device name = /dev/sdb
                superblock bytenr = 67108864

                device name = /dev/sdb
                superblock bytenr = 274877906944

        [All bad supers]:

All supers are valid, no need to recover



I tried all mounting all three supers on sdb using "btrfs-select-super
-s 0 /dev/sdb" and mounting with "mount /dev/sdb btrfs/ -t btrfs -o
ro,nologreplay,degraded".  Still unable to get data. syslog here:
https://pastebin.com/94BbRamb

Results of btrfs check:

[root@localhost ~]# btrfs check --readonly /dev/sde
Opening filesystem to check...
Checking filesystem on /dev/sde
UUID: 64961501-b1d1-4470-8461-5c47aa5e72c6
[1/7] checking root items
parent transid verify failed on 2788917248 wanted 173258 found 173174
checksum verify failed on 2788917248 found 000000E4 wanted 00000029
checksum verify failed on 2788917248 found 000000E4 wanted 00000029
bad tree block 2788917248, bytenr mismatch, want=2788917248, have=1438426880
ERROR: failed to repair root items: Input/output error

[root@localhost ~]# btrfs check -s 2 --readonly /dev/sde
using SB copy 2, bytenr 274877906944
Opening filesystem to check...
Checking filesystem on /dev/sde
UUID: 64961501-b1d1-4470-8461-5c47aa5e72c6
[1/7] checking root items
parent transid verify failed on 2788917248 wanted 173258 found 173174
checksum verify failed on 2788917248 found 000000E4 wanted 00000029
checksum verify failed on 2788917248 found 000000E4 wanted 00000029
bad tree block 2788917248, bytenr mismatch, want=2788917248, have=1438426880
ERROR: failed to repair root items: Input/output error


I highly doubt any of this is a bug. This pretty much sums up my
feelings right now: https://imgflip.com/i/422w78


-- 

Justin Engwer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
  2020-05-19 18:45       ` Justin Engwer
@ 2020-05-19 20:44         ` Chris Murphy
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Murphy @ 2020-05-19 20:44 UTC (permalink / raw)
  To: Justin Engwer; +Cc: Btrfs BTRFS

On Tue, May 19, 2020 at 12:45 PM Justin Engwer <justin@mautobu.com> wrote:
>
> On Mon, May 18, 2020 at 7:03 PM Chris Murphy <lists@colorremedies.com> wrote:
> >
> > On Mon, May 18, 2020 at 6:47 PM Justin Engwer <justin@mautobu.com> wrote:
> > >
> > > Thanks for getting back to me Chris. Here's the info requested:
> > >
> > > a. Kernels are:
> > > CentOS Linux (5.5.2-1.el7.elrepo.x86_64) 7 (Core)
> > > CentOS Linux (4.16.7-1.el7.elrepo.x86_64) 7 (Core)
> > > CentOS Linux (4.4.213-1.el7.elrepo.x86_64) 7 (Core)
> > >
> > > I was originally on 4.4, then updated to 4.16. After updating to 5.5 I
> > > must have screwed up the grub boot default as it started booting to
> > > 4.4.
> >
> > The problem happened while using kernel 5.5.2?
> >
>
> Likely 4.4

While it's a long term supported kernel, this is really difficult for
Btrfs because some fixes and features just don't ever get backported.
File systems are increasingly non-deterministic the older they get,
even with a static never changing kernel version. The LTS kernels are
perhaps best suited for distributions (public or internal) with
dedicated dev teams.

For example:
$ git diff --shortstat v4.4..v5.6 -- fs/btrfs
 109 files changed, 52729 insertions(+), 36600 deletions(-)
$ git diff --shortstat v4.4..v5.6 -- fs/btrfs/raid56.c
 1 file changed, 396 insertions(+), 373 deletions(-)
$ wc -l fs/btrfs/raid56.c
2749 fs/btrfs/raid56.c

Has this bug you're running into been fixed? *shrug*

I think if you're using raid1 or raid10 you could use 4.19 series but
you're probably still better off using something more recent, in
particular for raid56 so that you can use metadata raid1c3 instead of
raid6.





>
> > These:
> > parent transid verify failed on 2788917248 wanted 173258 found 173174
> >
> > suggest that the problem didn't happen too long ago. But the
> > difficulty I see is that the "found" ranges from 172716 to 173167.
> >
> > A further difficulty is the wanted ranges from 173237 to 173258. That
> > is really significant.
> >
> > Have there been crashes/power failures while the file system was being written?
> >
>
> Given the system is hard locking up when btrfs is accessing some data,
> yes most likely.

I don't expect hard lockups just because a power fail or crash has
confused the file system state on disk. If the proper ordering has
been honored, none of the written garbage is pointedd to by any
superblock. So the difficult but important question is, why might the
proper ordering not have been honored?

At least the Btrfs developers have said Btrfs theoretically does the
correct thing order wise; and dm-log-writes is one of the
contributions they've made so all file systems can test and do better
with respect to power failures.

Anyway, the question is more looking for a possibly prior event(s)
that might explain the transid discrepancies. And yeah crashes and
powerfails can do that, but it takes other things too like writes
being committed out of order - which is difficult to ensure with
multiple device file systems. Especially if they aren't telling the
whole truth when data is actually committed to stable media, but claim
so even when the data is merely in the write cache.



> Working on restoring. Will start with the 4 "good" drives. I recall
> years ago working in a computer repair shop if a drive was bad and we
> left it in the freezer overnight we could get data from it for a few
> hours, then it would be completely dead afterward. Might be worth a
> shot if nothing else works.
>
> Does BTRFS store whole files on single drives then use a parity across
> all of them or does it break single large files up, store them across
> different drives, then parity?

The latter.

The stripe element size is 64KiB (a.k.a. strip size, a.k.a. chunk in
mdadm terminology; btrfs chunks are the same as block groups). The
striping is per block group. And the order isn't always consistent.

So if metadata and data are raid6, it means everything is in 64KiB
"strips". Including the file system itself.


>
> >
> > > b. Physical drives are identical seagate SATA 3tb drives. Ancient
> > > bastards. Connected through a combination of LSI HBA and motherboard.
> >
> > Does the LSI HBA have a cache enabled? If its battery backed it's
> > probably OK but otherwise it should be disabled. And the write caches
> > on the drives should be disabled. That's the conservative
> > configuration. If the controller and drives really honor FUA/fsync
> > then it's OK to leave the write caches enabled. But the problem is if
> > they honor different flushes in different order you end up with an
> > inconsistent file system. And that's bad for Btrfs because repairing
> > inconsistency is difficult. It really just needs to be avoided in the
> > first place.
> >
>
> All cards are LSI 9211 or 9200 in the system. None of them have onboard caching.

Use hdparm -W to check the write cache on the drives and disable it on
all drives. Make sure not to use small w option -w, see man page.


> > I think that's a bug on the face of it. It shouldn't indefinitely hang.
> >
>
> Now mounts on Fedora. Drops to RO quickly though. See
> https://pastebin.com/94BbRamb

May 19 10:46:59 localhost.localdomain kernel: BTRFS: error (device
sdb) in btrfs_remove_chunk:2959: errno=-117 unknown
May 19 10:46:59 localhost.localdomain kernel: BTRFS info (device sdb):
forced readonly

It consistently gets tripped up removing block groups.


> I put the drives in a box with Fedora Rawhide connected directly to
> the motherboard. It looks like all of the supers are the same.
>
> [root@localhost ~]# btrfs rescue super -v /dev/sdb
> All Devices:
>         Device: id = 4, name = /dev/sdh
>         Device: id = 2, name = /dev/sdf
>         Device: id = 5, name = /dev/sde
>         Device: id = 3, name = /dev/sdd
>         Device: id = 1, name = /dev/sdb
>
> Before Recovering:
>         [All good supers]:
>                 device name = /dev/sdh
>                 superblock bytenr = 65536
>
>                 device name = /dev/sdh
>                 superblock bytenr = 67108864
>
>                 device name = /dev/sdh
>                 superblock bytenr = 274877906944
>
>                 device name = /dev/sdf
>                 superblock bytenr = 65536
>
>                 device name = /dev/sdf
>                 superblock bytenr = 67108864
>
>                 device name = /dev/sdf
>                 superblock bytenr = 274877906944
>
>                 device name = /dev/sde
>                 superblock bytenr = 65536
>
>                 device name = /dev/sde
>                 superblock bytenr = 67108864
>
>                 device name = /dev/sde
>                 superblock bytenr = 274877906944
>
>                 device name = /dev/sdd
>                 superblock bytenr = 65536
>
>                 device name = /dev/sdd
>                 superblock bytenr = 67108864
>
>                 device name = /dev/sdd
>                 superblock bytenr = 274877906944
>
>                 device name = /dev/sdb
>                 superblock bytenr = 65536
>
>                 device name = /dev/sdb
>                 superblock bytenr = 67108864
>
>                 device name = /dev/sdb
>                 superblock bytenr = 274877906944
>
>         [All bad supers]:
>
> All supers are valid, no need to recover

Interesting.


>
>
>
> I tried all mounting all three supers on sdb using "btrfs-select-super
> -s 0 /dev/sdb" and mounting with "mount /dev/sdb btrfs/ -t btrfs -o
> ro,nologreplay,degraded".  Still unable to get data. syslog here:
> https://pastebin.com/94BbRamb
>
> Results of btrfs check:
>
> [root@localhost ~]# btrfs check --readonly /dev/sde
> Opening filesystem to check...
> Checking filesystem on /dev/sde
> UUID: 64961501-b1d1-4470-8461-5c47aa5e72c6
> [1/7] checking root items
> parent transid verify failed on 2788917248 wanted 173258 found 173174
> checksum verify failed on 2788917248 found 000000E4 wanted 00000029
> checksum verify failed on 2788917248 found 000000E4 wanted 00000029
> bad tree block 2788917248, bytenr mismatch, want=2788917248, have=1438426880
> ERROR: failed to repair root items: Input/output error


btrfs inspect dump-t --follow -b 2788917248 /anydev/




>
> [root@localhost ~]# btrfs check -s 2 --readonly /dev/sde
> using SB copy 2, bytenr 274877906944
> Opening filesystem to check...
> Checking filesystem on /dev/sde
> UUID: 64961501-b1d1-4470-8461-5c47aa5e72c6
> [1/7] checking root items
> parent transid verify failed on 2788917248 wanted 173258 found 173174
> checksum verify failed on 2788917248 found 000000E4 wanted 00000029
> checksum verify failed on 2788917248 found 000000E4 wanted 00000029
> bad tree block 2788917248, bytenr mismatch, want=2788917248, have=1438426880
> ERROR: failed to repair root items: Input/output error
>
>
> I highly doubt any of this is a bug. This pretty much sums up my
> feelings right now: https://imgflip.com/i/422w78

It's probably not any one thing. That's the difficulty. There are
certainly a lot of bug fixes between 4.4 and 5.6. But also write
caches enabled, who knows if one or more drives fib about commits
actually being on disk, loss of writes in write cache during power
fail, etc. And these things can accumulate, not just happen all at the
same time.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
  2020-05-18 20:51 I think he's dead, Jim Justin Engwer
  2020-05-18 23:23 ` Chris Murphy
@ 2020-05-20  1:32 ` Zygo Blaxell
  2020-05-20 20:53   ` Johannes Hirte
  1 sibling, 1 reply; 10+ messages in thread
From: Zygo Blaxell @ 2020-05-20  1:32 UTC (permalink / raw)
  To: Justin Engwer; +Cc: linux-btrfs

On Mon, May 18, 2020 at 01:51:03PM -0700, Justin Engwer wrote:
> Hi,
> 
> I'm hoping to get some (or all) data back from what I can only assume
> is the dreaded write hole. I did a fairly lengthy post on reddit that

Write hole is a popular scapegoat; however, write hole is far down the
list of the most common ways that a btrfs dies.  The top 6 are:

1.  Firmware bugs (specifically, write ordering failure in lower storage
layers).  If you have a drive with bad firmware, turn off write caching
(or, if you don't have a test rig to verify firmware behavior, just turn
off write caching for all drives).  Also please post your drive models and
firmware revisions so we can correlate them with other failure reports.

2.  btrfs kernel bugs.  See list below.

3.  Other (non-btrfs) kernel bugs.  In theory any UAF bug can kill
a btrfs.  In 5.2 btrfs added run-time checks for this, and will force
the filesystem read-only instead of writing obviously broken metadata
to disk.

4.  Non-disk hardware failure (bad RAM, power supply, cables, SATA
bridge, etc).  These can be hard to diagnose.  Sometimes the only way to
know for sure is to swap the hardware one piece at a time to a different
machine and test to see if the failure happens again.

5.  Isolation failure, e.g. one of your drives shorts out its motor as
it fails, and causes other drives sharing the same power supply rail to
fail at the same time.  Or two drives share a SATA bridge chip and the
bridge chip fails, causing an unrecoverable multi-device failure in btrfs.

6.  raid5/6 write hole, if somehow your filesystem survives the above.

A quick map of btrfs raid5/6 kernel bugs:

	2.6 to 3.4:  don't use btrfs on these kernels

	3.5 to 3.8:  don't use raid5 or raid6 because it doesn't exist

	3.9 to 3.18:  don't use raid5 or raid6 because parity repair
	code not present

	3.19 to 4.4:  don't use raid5 or raid6 because space_cache=v2
	does not exist yet and parity repair code badly broken

	4.5 to 4.15:  don't use raid5 or raid6 because parity repair
	code badly broken

	4.16 to 5.0:  use raid5 data + raid1 metadata.  Use only
	with space_cache=v2.  Don't use raid6 because raid1c3 does not
	exist yet.

	5.1:  don't use btrfs on this kernel because of metadata
	corruption bugs

	5.2 to 5.3:  don't use btrfs on these kernels because of metadata
	corruption bugs partially contained by runtime corrupt metadata
	checking

	5.4:  use raid5 data + raid1 metadata.	Use only with
	space_cache=v2.  Don't use raid6 because raid1c3 does not
	exist yet.  Don't use kernels 5.4.0 to 5.4.13 with btrfs
	because they still have the metadata corruption bug.

	5.5 to 5.7:  use raid5 data + raid1 metadata, or raid6 data
	+ raid1c3 metadata.  Use only with space_cache=v2.

On current kernels there are still some leftover issues:

	- btrfs sometimes corrupts parity if there is corrupted data
	already present on one of the disks while a write is performed
	to other data blocks in the same raid stripe.  Note that if a
	disk goes offline temporarily for any reason, any writes that
	it missed will appear to be corrupted data on the disk when it
	returns to the array, so the impact of this bug can be surprising.

	- there is some risk of data loss due to write hole, which has an
	effect very similar to the above btrfs bug; however, the btrfs
	bug can only occur when all disks are online, and the write hole
	bug can only occur when some disks are offline.

	- scrub can detect parity corruption but cannot map the corrupted
	block to the correct drive in some cases, so the error statistics
	can be wildly inaccurate when there is data corruption on the
	disks (i.e. error counts will be distributed randomly across
	all disks).  This cannot be fixed with the current on-disk format.

Never use raid5 or raid6 for metadata because the write hole and parity
corruption bugs still present in current kernels will race to see which
gets to destroy the filesystem first.

Corollary:  Never use space_cache=v1 with raid5 or raid6 data.
space_cache=v1 puts some metadata (free space cache) in data block
groups, so it violates the "never use raid5 or raid6 for metadata" rule.
space_cache=v2 eliminates this problem by storing the free space tree
in metadata block groups.

> you can find here:
> https://old.reddit.com/r/btrfs/comments/glbde0/btrfs_died_last_night_pulling_out_hair_all_day/
> 
> TLDR; BTRFS keeps dropping to RO. System sometimes completely locks up
> and needs to be hard powered off because of read activity on BTRFS.
> See reddit link for actual errors.

You were lucky to have a filesystem with raid6 metadata and presumably
space_cache=v1 survive this long.

It looks like you were in the middle of trying to delete something, i.e.
a snapshot or file was deleted before the last crash.  The metadata
is corrupted, so the next time you mount, it detects the corruption
and aborts.  This repeats on the next mount because btrfs can't modify
anything.

My guess is you hit a firmware bug first, and then the other errors
followed, but at this point it's hard to tell which came first.  It looks
like this wasn't detected until much later, and recovery gets harder
the longer the initial error is uncorrected.

> I'm really not super familiar, or at all familiar, with BTRFS or the
> recovery of it.
> -- 
> 
> Justin Engwer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
  2020-05-20  1:32 ` Zygo Blaxell
@ 2020-05-20 20:53   ` Johannes Hirte
  2020-05-20 21:35     ` Chris Murphy
  2020-05-21  6:20     ` Zygo Blaxell
  0 siblings, 2 replies; 10+ messages in thread
From: Johannes Hirte @ 2020-05-20 20:53 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Justin Engwer, linux-btrfs

On 2020 Mai 19, Zygo Blaxell wrote:
> 
> Corollary:  Never use space_cache=v1 with raid5 or raid6 data.
> space_cache=v1 puts some metadata (free space cache) in data block
> groups, so it violates the "never use raid5 or raid6 for metadata" rule.
> space_cache=v2 eliminates this problem by storing the free space tree
> in metadata block groups.
> 

This should not be a real problem, as the space-cache can be discarded
and rebuild anytime. Or do I miss something?

-- 
Regards,
  Johannes Hirte


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
  2020-05-20 20:53   ` Johannes Hirte
@ 2020-05-20 21:35     ` Chris Murphy
  2020-05-20 22:15       ` Johannes Hirte
  2020-05-21  6:20     ` Zygo Blaxell
  1 sibling, 1 reply; 10+ messages in thread
From: Chris Murphy @ 2020-05-20 21:35 UTC (permalink / raw)
  To: Johannes Hirte; +Cc: Zygo Blaxell, Justin Engwer, Btrfs BTRFS

On Wed, May 20, 2020 at 3:02 PM Johannes Hirte
<johannes.hirte@datenkhaos.de> wrote:
>
> On 2020 Mai 19, Zygo Blaxell wrote:
> >
> > Corollary:  Never use space_cache=v1 with raid5 or raid6 data.
> > space_cache=v1 puts some metadata (free space cache) in data block
> > groups, so it violates the "never use raid5 or raid6 for metadata" rule.
> > space_cache=v2 eliminates this problem by storing the free space tree
> > in metadata block groups.
> >
>
> This should not be a real problem, as the space-cache can be discarded
> and rebuild anytime. Or do I miss something?

The bitmap locations for the free space cache are referred to in the
extent tree. It's not as trivial update or drop the v1 space cache as
it is the v2 which is in its own btree.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
  2020-05-20 21:35     ` Chris Murphy
@ 2020-05-20 22:15       ` Johannes Hirte
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Hirte @ 2020-05-20 22:15 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Zygo Blaxell, Justin Engwer, Btrfs BTRFS

On 2020 Mai 20, Chris Murphy wrote:
> On Wed, May 20, 2020 at 3:02 PM Johannes Hirte
> <johannes.hirte@datenkhaos.de> wrote:
> >
> > On 2020 Mai 19, Zygo Blaxell wrote:
> > >
> > > Corollary:  Never use space_cache=v1 with raid5 or raid6 data.
> > > space_cache=v1 puts some metadata (free space cache) in data block
> > > groups, so it violates the "never use raid5 or raid6 for metadata" rule.
> > > space_cache=v2 eliminates this problem by storing the free space tree
> > > in metadata block groups.
> > >
> >
> > This should not be a real problem, as the space-cache can be discarded
> > and rebuild anytime. Or do I miss something?
> 
> The bitmap locations for the free space cache are referred to in the
> extent tree. It's not as trivial update or drop the v1 space cache as
> it is the v2 which is in its own btree.

I still don't see the problem. Free space cache is needed for
performance, not function. If it's not available, this can be
ignored. 

-- 
Regards,
  Johannes Hirte


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
  2020-05-20 20:53   ` Johannes Hirte
  2020-05-20 21:35     ` Chris Murphy
@ 2020-05-21  6:20     ` Zygo Blaxell
  2020-05-21 17:24       ` Justin Engwer
  1 sibling, 1 reply; 10+ messages in thread
From: Zygo Blaxell @ 2020-05-21  6:20 UTC (permalink / raw)
  To: Johannes Hirte; +Cc: Justin Engwer, linux-btrfs

On Wed, May 20, 2020 at 10:53:19PM +0200, Johannes Hirte wrote:
> On 2020 Mai 19, Zygo Blaxell wrote:
> > 
> > Corollary:  Never use space_cache=v1 with raid5 or raid6 data.
> > space_cache=v1 puts some metadata (free space cache) in data block
> > groups, so it violates the "never use raid5 or raid6 for metadata" rule.
> > space_cache=v2 eliminates this problem by storing the free space tree
> > in metadata block groups.
> > 
> 
> This should not be a real problem, as the space-cache can be discarded
> and rebuild anytime. Or do I miss something?

Keep in mind that there are multiple reasons to not use space_cache=v1;
space_cache=v1 is quite slow, especially on filesystems big enough that
raid5 is in play, even when it's not recovering from integrity failures.

The free space cache (v1) is stored in nodatacow inodes, so it has all
the btrfs RAID data integrity problems of nodatasum, plus the parity
corruption and write hole issues of raid5.  Free space tree (v2) is
stored in metadata, so it has csums to detect data corruption and transid
checks for dropped writes, and if you are using raid1 metadata you also
avoid the parity corruption bug in btrfs's raid5/6 implementation and
the write hole.  v2 is faster too, especially at commit time.

The probability of undetected space_cache=v1 failure is low, but not zero.
In the event of failure, the filesystem should detect the error when it
tries to create new entries in the extent tree--they'll overlap existing
allocated blocks, and the filesystem will force itself read-only, so
there should be no permanent damage other than killing any application
that was writing to the disk at the time.

Come to think of it, though, the space_cache=v1 problems are not specific
to raid5.  You shouldn't use space_cache=v1 with raid1 or raid10 data
either, for the same reasons.

In the raid5/6 case it's a bit simpler:   kernels that can't do
space_cache=v2 (4.4 and earlier) don't have working raid5 recovery either.

> -- 
> Regards,
>   Johannes Hirte
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: I think he's dead, Jim
  2020-05-21  6:20     ` Zygo Blaxell
@ 2020-05-21 17:24       ` Justin Engwer
  0 siblings, 0 replies; 10+ messages in thread
From: Justin Engwer @ 2020-05-21 17:24 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Johannes Hirte, linux-btrfs

So, in my case at least, I'd guess that dropping the kernel from 4.16
to 4.4 combined with a failed disk is what the root cause was.

I've done what little recovery I can of the current state of files
using btrfs restore. Is there a means of rebuilding the metadata using
the existing data on the drives? Can I put that metadata elsewhere in
a different location so not to overwrite anything? I'm thinking of
moving onto destructive recovery at this point anyway.

Cheers,
Justin

On Wed, May 20, 2020 at 11:20 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Wed, May 20, 2020 at 10:53:19PM +0200, Johannes Hirte wrote:
> > On 2020 Mai 19, Zygo Blaxell wrote:
> > >
> > > Corollary:  Never use space_cache=v1 with raid5 or raid6 data.
> > > space_cache=v1 puts some metadata (free space cache) in data block
> > > groups, so it violates the "never use raid5 or raid6 for metadata" rule.
> > > space_cache=v2 eliminates this problem by storing the free space tree
> > > in metadata block groups.
> > >
> >
> > This should not be a real problem, as the space-cache can be discarded
> > and rebuild anytime. Or do I miss something?
>
> Keep in mind that there are multiple reasons to not use space_cache=v1;
> space_cache=v1 is quite slow, especially on filesystems big enough that
> raid5 is in play, even when it's not recovering from integrity failures.
>
> The free space cache (v1) is stored in nodatacow inodes, so it has all
> the btrfs RAID data integrity problems of nodatasum, plus the parity
> corruption and write hole issues of raid5.  Free space tree (v2) is
> stored in metadata, so it has csums to detect data corruption and transid
> checks for dropped writes, and if you are using raid1 metadata you also
> avoid the parity corruption bug in btrfs's raid5/6 implementation and
> the write hole.  v2 is faster too, especially at commit time.
>
> The probability of undetected space_cache=v1 failure is low, but not zero.
> In the event of failure, the filesystem should detect the error when it
> tries to create new entries in the extent tree--they'll overlap existing
> allocated blocks, and the filesystem will force itself read-only, so
> there should be no permanent damage other than killing any application
> that was writing to the disk at the time.
>
> Come to think of it, though, the space_cache=v1 problems are not specific
> to raid5.  You shouldn't use space_cache=v1 with raid1 or raid10 data
> either, for the same reasons.
>
> In the raid5/6 case it's a bit simpler:   kernels that can't do
> space_cache=v2 (4.4 and earlier) don't have working raid5 recovery either.
>
> > --
> > Regards,
> >   Johannes Hirte
> >



-- 

Justin Engwer
Mautobu Business Services
250-415-3709

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-05-21 17:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-18 20:51 I think he's dead, Jim Justin Engwer
2020-05-18 23:23 ` Chris Murphy
     [not found]   ` <CAGAeKuv3y=rHvRsq6SVSQ+NadyUaFES94PpFu1zD74cO3B_eLA@mail.gmail.com>
     [not found]     ` <CAJCQCtQXR+x4mG+jT34nhkE69sP94yio-97MLmd_ugKS+m96DQ@mail.gmail.com>
2020-05-19 18:45       ` Justin Engwer
2020-05-19 20:44         ` Chris Murphy
2020-05-20  1:32 ` Zygo Blaxell
2020-05-20 20:53   ` Johannes Hirte
2020-05-20 21:35     ` Chris Murphy
2020-05-20 22:15       ` Johannes Hirte
2020-05-21  6:20     ` Zygo Blaxell
2020-05-21 17:24       ` Justin Engwer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).