* I think he's dead, Jim @ 2020-05-18 20:51 Justin Engwer 2020-05-18 23:23 ` Chris Murphy 2020-05-20 1:32 ` Zygo Blaxell 0 siblings, 2 replies; 10+ messages in thread From: Justin Engwer @ 2020-05-18 20:51 UTC (permalink / raw) To: linux-btrfs Hi, I'm hoping to get some (or all) data back from what I can only assume is the dreaded write hole. I did a fairly lengthy post on reddit that you can find here: https://old.reddit.com/r/btrfs/comments/glbde0/btrfs_died_last_night_pulling_out_hair_all_day/ TLDR; BTRFS keeps dropping to RO. System sometimes completely locks up and needs to be hard powered off because of read activity on BTRFS. See reddit link for actual errors. I'm really not super familiar, or at all familiar, with BTRFS or the recovery of it. -- Justin Engwer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: I think he's dead, Jim 2020-05-18 20:51 I think he's dead, Jim Justin Engwer @ 2020-05-18 23:23 ` Chris Murphy [not found] ` <CAGAeKuv3y=rHvRsq6SVSQ+NadyUaFES94PpFu1zD74cO3B_eLA@mail.gmail.com> 2020-05-20 1:32 ` Zygo Blaxell 1 sibling, 1 reply; 10+ messages in thread From: Chris Murphy @ 2020-05-18 23:23 UTC (permalink / raw) To: Justin Engwer; +Cc: Btrfs BTRFS On Mon, May 18, 2020 at 2:51 PM Justin Engwer <justin@mautobu.com> wrote: > > Hi, > > I'm hoping to get some (or all) data back from what I can only assume > is the dreaded write hole. I did a fairly lengthy post on reddit that > you can find here: > https://old.reddit.com/r/btrfs/comments/glbde0/btrfs_died_last_night_pulling_out_hair_all_day/ > > TLDR; BTRFS keeps dropping to RO. System sometimes completely locks up > and needs to be hard powered off because of read activity on BTRFS. > See reddit link for actual errors. Almost no one will follow the links. You've got a problem, which is unfortunate, but you're also asking for help so you kinda need to make it easy for readers to understand the setup instead of having to go digging for it elsewhere. And also it's needed for archive searchability, which an external reference doesn't provide. a. kernel and btrfs-progs version; ideally also include some kernel history for this file system b. basics of the storage stack: what are the physical drives, how are they connected, c. if VM, what's the hypervisor, are the drives being passed through, what caching mode d. mkfs command used to create; or just state the metadata and data profiles; or paste 'btrfs fi us /mnt' e. ideally a complete dmesg (start to finish, not snipped) at the time of the original problem, this might be the prior boot; it's probably too big to attach to the list so in that case nextcloud, dropbox, pastebin, etc. f. a current dmesg for the mount failure g. btrfs check --readonly /dev/ I thought we had a FAQ item with what info we wanted reported to the list, but I can't find it. Thanks, -- Chris Murphy ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <CAGAeKuv3y=rHvRsq6SVSQ+NadyUaFES94PpFu1zD74cO3B_eLA@mail.gmail.com>]
[parent not found: <CAJCQCtQXR+x4mG+jT34nhkE69sP94yio-97MLmd_ugKS+m96DQ@mail.gmail.com>]
* Re: I think he's dead, Jim [not found] ` <CAJCQCtQXR+x4mG+jT34nhkE69sP94yio-97MLmd_ugKS+m96DQ@mail.gmail.com> @ 2020-05-19 18:45 ` Justin Engwer 2020-05-19 20:44 ` Chris Murphy 0 siblings, 1 reply; 10+ messages in thread From: Justin Engwer @ 2020-05-19 18:45 UTC (permalink / raw) To: linux-btrfs On Mon, May 18, 2020 at 7:03 PM Chris Murphy <lists@colorremedies.com> wrote: > > On Mon, May 18, 2020 at 6:47 PM Justin Engwer <justin@mautobu.com> wrote: > > > > Thanks for getting back to me Chris. Here's the info requested: > > > > a. Kernels are: > > CentOS Linux (5.5.2-1.el7.elrepo.x86_64) 7 (Core) > > CentOS Linux (4.16.7-1.el7.elrepo.x86_64) 7 (Core) > > CentOS Linux (4.4.213-1.el7.elrepo.x86_64) 7 (Core) > > > > I was originally on 4.4, then updated to 4.16. After updating to 5.5 I > > must have screwed up the grub boot default as it started booting to > > 4.4. > > The problem happened while using kernel 5.5.2? > Likely 4.4 > These: > parent transid verify failed on 2788917248 wanted 173258 found 173174 > > suggest that the problem didn't happen too long ago. But the > difficulty I see is that the "found" ranges from 172716 to 173167. > > A further difficulty is the wanted ranges from 173237 to 173258. That > is really significant. > > Have there been crashes/power failures while the file system was being written? > Given the system is hard locking up when btrfs is accessing some data, yes most likely. > > > btrfs-progs v4.9.1 > > This is too old to attempt a repair. The errors reported seem > reliable, but there might be other problems going on that it's not > catching, so I suggest updating it in any case. > > Try this: > https://copr.fedorainfracloud.org/coprs/ngompa/btrfs-progs-el8/ > > But I can't recommend a repair except as a last resort. It seems like > things can't get worse, but it's better to be prepared. It also > includes more capable offline scrape tool 'btrfs restore'. > Working on restoring. Will start with the 4 "good" drives. I recall years ago working in a computer repair shop if a drive was bad and we left it in the freezer overnight we could get data from it for a few hours, then it would be completely dead afterward. Might be worth a shot if nothing else works. Does BTRFS store whole files on single drives then use a parity across all of them or does it break single large files up, store them across different drives, then parity? > > > b. Physical drives are identical seagate SATA 3tb drives. Ancient > > bastards. Connected through a combination of LSI HBA and motherboard. > > Does the LSI HBA have a cache enabled? If its battery backed it's > probably OK but otherwise it should be disabled. And the write caches > on the drives should be disabled. That's the conservative > configuration. If the controller and drives really honor FUA/fsync > then it's OK to leave the write caches enabled. But the problem is if > they honor different flushes in different order you end up with an > inconsistent file system. And that's bad for Btrfs because repairing > inconsistency is difficult. It really just needs to be avoided in the > first place. > All cards are LSI 9211 or 9200 in the system. None of them have onboard caching. > > > > c. Not a vm. They host(ed) vms though. > > > > d. [root@kvm2 ~]# btrfs fi us recovery/mount/ > > WARNING: RAID56 detected, not implemented > > WARNING: RAID56 detected, not implemented > > WARNING: RAID56 detected, not implemented > > Overall: > > Device size: 13.64TiB > > Device allocated: 0.00B > > Device unallocated: 13.64TiB > > Device missing: 0.00B > > Used: 0.00B > > Free (estimated): 0.00B (min: 8.00EiB) > > Data ratio: 0.00 > > Metadata ratio: 0.00 > > Global reserve: 512.00MiB (used: 0.00B) > > > > Data,RAID6: Size:4.39TiB, Used:0.00B > > /dev/sdh 1.46TiB > > /dev/sdi 1.46TiB > > /dev/sdl 1.46TiB > > /dev/sdo 1.46TiB > > /dev/sdp 1.46TiB > > > > Metadata,RAID6: Size:7.12GiB, Used:176.00KiB > > This is more difficult to recover from since it can spread a single > transaction across multiple disks, and it's harder (sometimes > impossible) to guarantee atomic updates. It's recommended to use > raid1c3 or raid1c4 in this configuration. I understand that's not > supported by the older kernels you were using, hopefully this was an > experimental setup. > Noted. It's a homelab, so it's not ideal but not a huge issue. Just time consuming to rebuild. > > > > f. Mounting without ro,norecovery,degraded results in immediate system > > lockup and nothing in dmesg. > > > > mount -o ro,norecovery,degraded /dev/sdi recovery/mount/ > > I think that's a bug on the face of it. It shouldn't indefinitely hang. > Now mounts on Fedora. Drops to RO quickly though. See https://pastebin.com/94BbRamb > > > > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi): > > disabling log replay at mount time > > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi): > > allowing degraded mounts > > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi): > > disk space caching is enabled > > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi): has > > skinny extents > > May 18 17:38:46 kvm2.mordor.local kernel: BTRFS info (device sdi): > > bdev /dev/sdl errs: wr 12, rd 264, flush 4, corrupt 0, gen 0 > > From the reddit threat, all of these errors are confined to a single drive. > > Unfortunately there's more than one thing going on. If it were just > one or two problems, then in theory Btrfs can deal with it. But looks > like there's one device problem, corrupt extent tree, and checksum > failure preventing further recovery. > > Another thing to check that frequently causes problems with raid on > Linux, whether Btrfs, LVM or mdadm, are timeout mismatches between > drive firmware and the kernel command timer. > > https://raid.wiki.kernel.org/index.php/Timeout_Mismatch > > If it were my problem, in order: > > - Update btrfs progs and the kernel. Recent versions should at least > fail with some sane error reporting, or it's a bug. Not every problem > can be fixed, but there shouldn't be crashes. > > - 'btrfs rescue super -v /anydev/' - this should check all supers on > all devices and see if they're the same or not. I don't recommend > repairing yet if there are differences. It's vaguely possible that > there is a really old one that might point to a tree that isn't > busted. And then point btrfs check at that old super. > > - 'btrfs check --readonly' to update the report; also, for multiple > devices you only need to run this command on any one of them. > > - try to 'mount -o ro,nologreplay' first and if that doesn't work try > 'mount -o ro,nologreplay,degraded' > > I suggest ssh into this system and use 'journalctl -fk' to follow the > journal while you do these things in case there are kernel message; in > particular if it leads to a hang or crash, hopefully this will still > catch it. And in a second shell, I suggest having sysrq enabled and > ready to issue sysrq+t. > > It's a lot to collect and tediou. But the better the information the > more likely it'll attract developer attention to see if there's a bug > that needs to be fixed. Also, that might not happen for a while. So > it's best to collect as much info as possible now in case you have to > give up and move on. > > -- > Chris Murphy I put the drives in a box with Fedora Rawhide connected directly to the motherboard. It looks like all of the supers are the same. [root@localhost ~]# btrfs rescue super -v /dev/sdb All Devices: Device: id = 4, name = /dev/sdh Device: id = 2, name = /dev/sdf Device: id = 5, name = /dev/sde Device: id = 3, name = /dev/sdd Device: id = 1, name = /dev/sdb Before Recovering: [All good supers]: device name = /dev/sdh superblock bytenr = 65536 device name = /dev/sdh superblock bytenr = 67108864 device name = /dev/sdh superblock bytenr = 274877906944 device name = /dev/sdf superblock bytenr = 65536 device name = /dev/sdf superblock bytenr = 67108864 device name = /dev/sdf superblock bytenr = 274877906944 device name = /dev/sde superblock bytenr = 65536 device name = /dev/sde superblock bytenr = 67108864 device name = /dev/sde superblock bytenr = 274877906944 device name = /dev/sdd superblock bytenr = 65536 device name = /dev/sdd superblock bytenr = 67108864 device name = /dev/sdd superblock bytenr = 274877906944 device name = /dev/sdb superblock bytenr = 65536 device name = /dev/sdb superblock bytenr = 67108864 device name = /dev/sdb superblock bytenr = 274877906944 [All bad supers]: All supers are valid, no need to recover I tried all mounting all three supers on sdb using "btrfs-select-super -s 0 /dev/sdb" and mounting with "mount /dev/sdb btrfs/ -t btrfs -o ro,nologreplay,degraded". Still unable to get data. syslog here: https://pastebin.com/94BbRamb Results of btrfs check: [root@localhost ~]# btrfs check --readonly /dev/sde Opening filesystem to check... Checking filesystem on /dev/sde UUID: 64961501-b1d1-4470-8461-5c47aa5e72c6 [1/7] checking root items parent transid verify failed on 2788917248 wanted 173258 found 173174 checksum verify failed on 2788917248 found 000000E4 wanted 00000029 checksum verify failed on 2788917248 found 000000E4 wanted 00000029 bad tree block 2788917248, bytenr mismatch, want=2788917248, have=1438426880 ERROR: failed to repair root items: Input/output error [root@localhost ~]# btrfs check -s 2 --readonly /dev/sde using SB copy 2, bytenr 274877906944 Opening filesystem to check... Checking filesystem on /dev/sde UUID: 64961501-b1d1-4470-8461-5c47aa5e72c6 [1/7] checking root items parent transid verify failed on 2788917248 wanted 173258 found 173174 checksum verify failed on 2788917248 found 000000E4 wanted 00000029 checksum verify failed on 2788917248 found 000000E4 wanted 00000029 bad tree block 2788917248, bytenr mismatch, want=2788917248, have=1438426880 ERROR: failed to repair root items: Input/output error I highly doubt any of this is a bug. This pretty much sums up my feelings right now: https://imgflip.com/i/422w78 -- Justin Engwer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: I think he's dead, Jim 2020-05-19 18:45 ` Justin Engwer @ 2020-05-19 20:44 ` Chris Murphy 0 siblings, 0 replies; 10+ messages in thread From: Chris Murphy @ 2020-05-19 20:44 UTC (permalink / raw) To: Justin Engwer; +Cc: Btrfs BTRFS On Tue, May 19, 2020 at 12:45 PM Justin Engwer <justin@mautobu.com> wrote: > > On Mon, May 18, 2020 at 7:03 PM Chris Murphy <lists@colorremedies.com> wrote: > > > > On Mon, May 18, 2020 at 6:47 PM Justin Engwer <justin@mautobu.com> wrote: > > > > > > Thanks for getting back to me Chris. Here's the info requested: > > > > > > a. Kernels are: > > > CentOS Linux (5.5.2-1.el7.elrepo.x86_64) 7 (Core) > > > CentOS Linux (4.16.7-1.el7.elrepo.x86_64) 7 (Core) > > > CentOS Linux (4.4.213-1.el7.elrepo.x86_64) 7 (Core) > > > > > > I was originally on 4.4, then updated to 4.16. After updating to 5.5 I > > > must have screwed up the grub boot default as it started booting to > > > 4.4. > > > > The problem happened while using kernel 5.5.2? > > > > Likely 4.4 While it's a long term supported kernel, this is really difficult for Btrfs because some fixes and features just don't ever get backported. File systems are increasingly non-deterministic the older they get, even with a static never changing kernel version. The LTS kernels are perhaps best suited for distributions (public or internal) with dedicated dev teams. For example: $ git diff --shortstat v4.4..v5.6 -- fs/btrfs 109 files changed, 52729 insertions(+), 36600 deletions(-) $ git diff --shortstat v4.4..v5.6 -- fs/btrfs/raid56.c 1 file changed, 396 insertions(+), 373 deletions(-) $ wc -l fs/btrfs/raid56.c 2749 fs/btrfs/raid56.c Has this bug you're running into been fixed? *shrug* I think if you're using raid1 or raid10 you could use 4.19 series but you're probably still better off using something more recent, in particular for raid56 so that you can use metadata raid1c3 instead of raid6. > > > These: > > parent transid verify failed on 2788917248 wanted 173258 found 173174 > > > > suggest that the problem didn't happen too long ago. But the > > difficulty I see is that the "found" ranges from 172716 to 173167. > > > > A further difficulty is the wanted ranges from 173237 to 173258. That > > is really significant. > > > > Have there been crashes/power failures while the file system was being written? > > > > Given the system is hard locking up when btrfs is accessing some data, > yes most likely. I don't expect hard lockups just because a power fail or crash has confused the file system state on disk. If the proper ordering has been honored, none of the written garbage is pointedd to by any superblock. So the difficult but important question is, why might the proper ordering not have been honored? At least the Btrfs developers have said Btrfs theoretically does the correct thing order wise; and dm-log-writes is one of the contributions they've made so all file systems can test and do better with respect to power failures. Anyway, the question is more looking for a possibly prior event(s) that might explain the transid discrepancies. And yeah crashes and powerfails can do that, but it takes other things too like writes being committed out of order - which is difficult to ensure with multiple device file systems. Especially if they aren't telling the whole truth when data is actually committed to stable media, but claim so even when the data is merely in the write cache. > Working on restoring. Will start with the 4 "good" drives. I recall > years ago working in a computer repair shop if a drive was bad and we > left it in the freezer overnight we could get data from it for a few > hours, then it would be completely dead afterward. Might be worth a > shot if nothing else works. > > Does BTRFS store whole files on single drives then use a parity across > all of them or does it break single large files up, store them across > different drives, then parity? The latter. The stripe element size is 64KiB (a.k.a. strip size, a.k.a. chunk in mdadm terminology; btrfs chunks are the same as block groups). The striping is per block group. And the order isn't always consistent. So if metadata and data are raid6, it means everything is in 64KiB "strips". Including the file system itself. > > > > > > b. Physical drives are identical seagate SATA 3tb drives. Ancient > > > bastards. Connected through a combination of LSI HBA and motherboard. > > > > Does the LSI HBA have a cache enabled? If its battery backed it's > > probably OK but otherwise it should be disabled. And the write caches > > on the drives should be disabled. That's the conservative > > configuration. If the controller and drives really honor FUA/fsync > > then it's OK to leave the write caches enabled. But the problem is if > > they honor different flushes in different order you end up with an > > inconsistent file system. And that's bad for Btrfs because repairing > > inconsistency is difficult. It really just needs to be avoided in the > > first place. > > > > All cards are LSI 9211 or 9200 in the system. None of them have onboard caching. Use hdparm -W to check the write cache on the drives and disable it on all drives. Make sure not to use small w option -w, see man page. > > I think that's a bug on the face of it. It shouldn't indefinitely hang. > > > > Now mounts on Fedora. Drops to RO quickly though. See > https://pastebin.com/94BbRamb May 19 10:46:59 localhost.localdomain kernel: BTRFS: error (device sdb) in btrfs_remove_chunk:2959: errno=-117 unknown May 19 10:46:59 localhost.localdomain kernel: BTRFS info (device sdb): forced readonly It consistently gets tripped up removing block groups. > I put the drives in a box with Fedora Rawhide connected directly to > the motherboard. It looks like all of the supers are the same. > > [root@localhost ~]# btrfs rescue super -v /dev/sdb > All Devices: > Device: id = 4, name = /dev/sdh > Device: id = 2, name = /dev/sdf > Device: id = 5, name = /dev/sde > Device: id = 3, name = /dev/sdd > Device: id = 1, name = /dev/sdb > > Before Recovering: > [All good supers]: > device name = /dev/sdh > superblock bytenr = 65536 > > device name = /dev/sdh > superblock bytenr = 67108864 > > device name = /dev/sdh > superblock bytenr = 274877906944 > > device name = /dev/sdf > superblock bytenr = 65536 > > device name = /dev/sdf > superblock bytenr = 67108864 > > device name = /dev/sdf > superblock bytenr = 274877906944 > > device name = /dev/sde > superblock bytenr = 65536 > > device name = /dev/sde > superblock bytenr = 67108864 > > device name = /dev/sde > superblock bytenr = 274877906944 > > device name = /dev/sdd > superblock bytenr = 65536 > > device name = /dev/sdd > superblock bytenr = 67108864 > > device name = /dev/sdd > superblock bytenr = 274877906944 > > device name = /dev/sdb > superblock bytenr = 65536 > > device name = /dev/sdb > superblock bytenr = 67108864 > > device name = /dev/sdb > superblock bytenr = 274877906944 > > [All bad supers]: > > All supers are valid, no need to recover Interesting. > > > > I tried all mounting all three supers on sdb using "btrfs-select-super > -s 0 /dev/sdb" and mounting with "mount /dev/sdb btrfs/ -t btrfs -o > ro,nologreplay,degraded". Still unable to get data. syslog here: > https://pastebin.com/94BbRamb > > Results of btrfs check: > > [root@localhost ~]# btrfs check --readonly /dev/sde > Opening filesystem to check... > Checking filesystem on /dev/sde > UUID: 64961501-b1d1-4470-8461-5c47aa5e72c6 > [1/7] checking root items > parent transid verify failed on 2788917248 wanted 173258 found 173174 > checksum verify failed on 2788917248 found 000000E4 wanted 00000029 > checksum verify failed on 2788917248 found 000000E4 wanted 00000029 > bad tree block 2788917248, bytenr mismatch, want=2788917248, have=1438426880 > ERROR: failed to repair root items: Input/output error btrfs inspect dump-t --follow -b 2788917248 /anydev/ > > [root@localhost ~]# btrfs check -s 2 --readonly /dev/sde > using SB copy 2, bytenr 274877906944 > Opening filesystem to check... > Checking filesystem on /dev/sde > UUID: 64961501-b1d1-4470-8461-5c47aa5e72c6 > [1/7] checking root items > parent transid verify failed on 2788917248 wanted 173258 found 173174 > checksum verify failed on 2788917248 found 000000E4 wanted 00000029 > checksum verify failed on 2788917248 found 000000E4 wanted 00000029 > bad tree block 2788917248, bytenr mismatch, want=2788917248, have=1438426880 > ERROR: failed to repair root items: Input/output error > > > I highly doubt any of this is a bug. This pretty much sums up my > feelings right now: https://imgflip.com/i/422w78 It's probably not any one thing. That's the difficulty. There are certainly a lot of bug fixes between 4.4 and 5.6. But also write caches enabled, who knows if one or more drives fib about commits actually being on disk, loss of writes in write cache during power fail, etc. And these things can accumulate, not just happen all at the same time. -- Chris Murphy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: I think he's dead, Jim 2020-05-18 20:51 I think he's dead, Jim Justin Engwer 2020-05-18 23:23 ` Chris Murphy @ 2020-05-20 1:32 ` Zygo Blaxell 2020-05-20 20:53 ` Johannes Hirte 1 sibling, 1 reply; 10+ messages in thread From: Zygo Blaxell @ 2020-05-20 1:32 UTC (permalink / raw) To: Justin Engwer; +Cc: linux-btrfs On Mon, May 18, 2020 at 01:51:03PM -0700, Justin Engwer wrote: > Hi, > > I'm hoping to get some (or all) data back from what I can only assume > is the dreaded write hole. I did a fairly lengthy post on reddit that Write hole is a popular scapegoat; however, write hole is far down the list of the most common ways that a btrfs dies. The top 6 are: 1. Firmware bugs (specifically, write ordering failure in lower storage layers). If you have a drive with bad firmware, turn off write caching (or, if you don't have a test rig to verify firmware behavior, just turn off write caching for all drives). Also please post your drive models and firmware revisions so we can correlate them with other failure reports. 2. btrfs kernel bugs. See list below. 3. Other (non-btrfs) kernel bugs. In theory any UAF bug can kill a btrfs. In 5.2 btrfs added run-time checks for this, and will force the filesystem read-only instead of writing obviously broken metadata to disk. 4. Non-disk hardware failure (bad RAM, power supply, cables, SATA bridge, etc). These can be hard to diagnose. Sometimes the only way to know for sure is to swap the hardware one piece at a time to a different machine and test to see if the failure happens again. 5. Isolation failure, e.g. one of your drives shorts out its motor as it fails, and causes other drives sharing the same power supply rail to fail at the same time. Or two drives share a SATA bridge chip and the bridge chip fails, causing an unrecoverable multi-device failure in btrfs. 6. raid5/6 write hole, if somehow your filesystem survives the above. A quick map of btrfs raid5/6 kernel bugs: 2.6 to 3.4: don't use btrfs on these kernels 3.5 to 3.8: don't use raid5 or raid6 because it doesn't exist 3.9 to 3.18: don't use raid5 or raid6 because parity repair code not present 3.19 to 4.4: don't use raid5 or raid6 because space_cache=v2 does not exist yet and parity repair code badly broken 4.5 to 4.15: don't use raid5 or raid6 because parity repair code badly broken 4.16 to 5.0: use raid5 data + raid1 metadata. Use only with space_cache=v2. Don't use raid6 because raid1c3 does not exist yet. 5.1: don't use btrfs on this kernel because of metadata corruption bugs 5.2 to 5.3: don't use btrfs on these kernels because of metadata corruption bugs partially contained by runtime corrupt metadata checking 5.4: use raid5 data + raid1 metadata. Use only with space_cache=v2. Don't use raid6 because raid1c3 does not exist yet. Don't use kernels 5.4.0 to 5.4.13 with btrfs because they still have the metadata corruption bug. 5.5 to 5.7: use raid5 data + raid1 metadata, or raid6 data + raid1c3 metadata. Use only with space_cache=v2. On current kernels there are still some leftover issues: - btrfs sometimes corrupts parity if there is corrupted data already present on one of the disks while a write is performed to other data blocks in the same raid stripe. Note that if a disk goes offline temporarily for any reason, any writes that it missed will appear to be corrupted data on the disk when it returns to the array, so the impact of this bug can be surprising. - there is some risk of data loss due to write hole, which has an effect very similar to the above btrfs bug; however, the btrfs bug can only occur when all disks are online, and the write hole bug can only occur when some disks are offline. - scrub can detect parity corruption but cannot map the corrupted block to the correct drive in some cases, so the error statistics can be wildly inaccurate when there is data corruption on the disks (i.e. error counts will be distributed randomly across all disks). This cannot be fixed with the current on-disk format. Never use raid5 or raid6 for metadata because the write hole and parity corruption bugs still present in current kernels will race to see which gets to destroy the filesystem first. Corollary: Never use space_cache=v1 with raid5 or raid6 data. space_cache=v1 puts some metadata (free space cache) in data block groups, so it violates the "never use raid5 or raid6 for metadata" rule. space_cache=v2 eliminates this problem by storing the free space tree in metadata block groups. > you can find here: > https://old.reddit.com/r/btrfs/comments/glbde0/btrfs_died_last_night_pulling_out_hair_all_day/ > > TLDR; BTRFS keeps dropping to RO. System sometimes completely locks up > and needs to be hard powered off because of read activity on BTRFS. > See reddit link for actual errors. You were lucky to have a filesystem with raid6 metadata and presumably space_cache=v1 survive this long. It looks like you were in the middle of trying to delete something, i.e. a snapshot or file was deleted before the last crash. The metadata is corrupted, so the next time you mount, it detects the corruption and aborts. This repeats on the next mount because btrfs can't modify anything. My guess is you hit a firmware bug first, and then the other errors followed, but at this point it's hard to tell which came first. It looks like this wasn't detected until much later, and recovery gets harder the longer the initial error is uncorrected. > I'm really not super familiar, or at all familiar, with BTRFS or the > recovery of it. > -- > > Justin Engwer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: I think he's dead, Jim 2020-05-20 1:32 ` Zygo Blaxell @ 2020-05-20 20:53 ` Johannes Hirte 2020-05-20 21:35 ` Chris Murphy 2020-05-21 6:20 ` Zygo Blaxell 0 siblings, 2 replies; 10+ messages in thread From: Johannes Hirte @ 2020-05-20 20:53 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Justin Engwer, linux-btrfs On 2020 Mai 19, Zygo Blaxell wrote: > > Corollary: Never use space_cache=v1 with raid5 or raid6 data. > space_cache=v1 puts some metadata (free space cache) in data block > groups, so it violates the "never use raid5 or raid6 for metadata" rule. > space_cache=v2 eliminates this problem by storing the free space tree > in metadata block groups. > This should not be a real problem, as the space-cache can be discarded and rebuild anytime. Or do I miss something? -- Regards, Johannes Hirte ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: I think he's dead, Jim 2020-05-20 20:53 ` Johannes Hirte @ 2020-05-20 21:35 ` Chris Murphy 2020-05-20 22:15 ` Johannes Hirte 2020-05-21 6:20 ` Zygo Blaxell 1 sibling, 1 reply; 10+ messages in thread From: Chris Murphy @ 2020-05-20 21:35 UTC (permalink / raw) To: Johannes Hirte; +Cc: Zygo Blaxell, Justin Engwer, Btrfs BTRFS On Wed, May 20, 2020 at 3:02 PM Johannes Hirte <johannes.hirte@datenkhaos.de> wrote: > > On 2020 Mai 19, Zygo Blaxell wrote: > > > > Corollary: Never use space_cache=v1 with raid5 or raid6 data. > > space_cache=v1 puts some metadata (free space cache) in data block > > groups, so it violates the "never use raid5 or raid6 for metadata" rule. > > space_cache=v2 eliminates this problem by storing the free space tree > > in metadata block groups. > > > > This should not be a real problem, as the space-cache can be discarded > and rebuild anytime. Or do I miss something? The bitmap locations for the free space cache are referred to in the extent tree. It's not as trivial update or drop the v1 space cache as it is the v2 which is in its own btree. -- Chris Murphy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: I think he's dead, Jim 2020-05-20 21:35 ` Chris Murphy @ 2020-05-20 22:15 ` Johannes Hirte 0 siblings, 0 replies; 10+ messages in thread From: Johannes Hirte @ 2020-05-20 22:15 UTC (permalink / raw) To: Chris Murphy; +Cc: Zygo Blaxell, Justin Engwer, Btrfs BTRFS On 2020 Mai 20, Chris Murphy wrote: > On Wed, May 20, 2020 at 3:02 PM Johannes Hirte > <johannes.hirte@datenkhaos.de> wrote: > > > > On 2020 Mai 19, Zygo Blaxell wrote: > > > > > > Corollary: Never use space_cache=v1 with raid5 or raid6 data. > > > space_cache=v1 puts some metadata (free space cache) in data block > > > groups, so it violates the "never use raid5 or raid6 for metadata" rule. > > > space_cache=v2 eliminates this problem by storing the free space tree > > > in metadata block groups. > > > > > > > This should not be a real problem, as the space-cache can be discarded > > and rebuild anytime. Or do I miss something? > > The bitmap locations for the free space cache are referred to in the > extent tree. It's not as trivial update or drop the v1 space cache as > it is the v2 which is in its own btree. I still don't see the problem. Free space cache is needed for performance, not function. If it's not available, this can be ignored. -- Regards, Johannes Hirte ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: I think he's dead, Jim 2020-05-20 20:53 ` Johannes Hirte 2020-05-20 21:35 ` Chris Murphy @ 2020-05-21 6:20 ` Zygo Blaxell 2020-05-21 17:24 ` Justin Engwer 1 sibling, 1 reply; 10+ messages in thread From: Zygo Blaxell @ 2020-05-21 6:20 UTC (permalink / raw) To: Johannes Hirte; +Cc: Justin Engwer, linux-btrfs On Wed, May 20, 2020 at 10:53:19PM +0200, Johannes Hirte wrote: > On 2020 Mai 19, Zygo Blaxell wrote: > > > > Corollary: Never use space_cache=v1 with raid5 or raid6 data. > > space_cache=v1 puts some metadata (free space cache) in data block > > groups, so it violates the "never use raid5 or raid6 for metadata" rule. > > space_cache=v2 eliminates this problem by storing the free space tree > > in metadata block groups. > > > > This should not be a real problem, as the space-cache can be discarded > and rebuild anytime. Or do I miss something? Keep in mind that there are multiple reasons to not use space_cache=v1; space_cache=v1 is quite slow, especially on filesystems big enough that raid5 is in play, even when it's not recovering from integrity failures. The free space cache (v1) is stored in nodatacow inodes, so it has all the btrfs RAID data integrity problems of nodatasum, plus the parity corruption and write hole issues of raid5. Free space tree (v2) is stored in metadata, so it has csums to detect data corruption and transid checks for dropped writes, and if you are using raid1 metadata you also avoid the parity corruption bug in btrfs's raid5/6 implementation and the write hole. v2 is faster too, especially at commit time. The probability of undetected space_cache=v1 failure is low, but not zero. In the event of failure, the filesystem should detect the error when it tries to create new entries in the extent tree--they'll overlap existing allocated blocks, and the filesystem will force itself read-only, so there should be no permanent damage other than killing any application that was writing to the disk at the time. Come to think of it, though, the space_cache=v1 problems are not specific to raid5. You shouldn't use space_cache=v1 with raid1 or raid10 data either, for the same reasons. In the raid5/6 case it's a bit simpler: kernels that can't do space_cache=v2 (4.4 and earlier) don't have working raid5 recovery either. > -- > Regards, > Johannes Hirte > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: I think he's dead, Jim 2020-05-21 6:20 ` Zygo Blaxell @ 2020-05-21 17:24 ` Justin Engwer 0 siblings, 0 replies; 10+ messages in thread From: Justin Engwer @ 2020-05-21 17:24 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Johannes Hirte, linux-btrfs So, in my case at least, I'd guess that dropping the kernel from 4.16 to 4.4 combined with a failed disk is what the root cause was. I've done what little recovery I can of the current state of files using btrfs restore. Is there a means of rebuilding the metadata using the existing data on the drives? Can I put that metadata elsewhere in a different location so not to overwrite anything? I'm thinking of moving onto destructive recovery at this point anyway. Cheers, Justin On Wed, May 20, 2020 at 11:20 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On Wed, May 20, 2020 at 10:53:19PM +0200, Johannes Hirte wrote: > > On 2020 Mai 19, Zygo Blaxell wrote: > > > > > > Corollary: Never use space_cache=v1 with raid5 or raid6 data. > > > space_cache=v1 puts some metadata (free space cache) in data block > > > groups, so it violates the "never use raid5 or raid6 for metadata" rule. > > > space_cache=v2 eliminates this problem by storing the free space tree > > > in metadata block groups. > > > > > > > This should not be a real problem, as the space-cache can be discarded > > and rebuild anytime. Or do I miss something? > > Keep in mind that there are multiple reasons to not use space_cache=v1; > space_cache=v1 is quite slow, especially on filesystems big enough that > raid5 is in play, even when it's not recovering from integrity failures. > > The free space cache (v1) is stored in nodatacow inodes, so it has all > the btrfs RAID data integrity problems of nodatasum, plus the parity > corruption and write hole issues of raid5. Free space tree (v2) is > stored in metadata, so it has csums to detect data corruption and transid > checks for dropped writes, and if you are using raid1 metadata you also > avoid the parity corruption bug in btrfs's raid5/6 implementation and > the write hole. v2 is faster too, especially at commit time. > > The probability of undetected space_cache=v1 failure is low, but not zero. > In the event of failure, the filesystem should detect the error when it > tries to create new entries in the extent tree--they'll overlap existing > allocated blocks, and the filesystem will force itself read-only, so > there should be no permanent damage other than killing any application > that was writing to the disk at the time. > > Come to think of it, though, the space_cache=v1 problems are not specific > to raid5. You shouldn't use space_cache=v1 with raid1 or raid10 data > either, for the same reasons. > > In the raid5/6 case it's a bit simpler: kernels that can't do > space_cache=v2 (4.4 and earlier) don't have working raid5 recovery either. > > > -- > > Regards, > > Johannes Hirte > > -- Justin Engwer Mautobu Business Services 250-415-3709 ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-05-21 17:24 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-18 20:51 I think he's dead, Jim Justin Engwer 2020-05-18 23:23 ` Chris Murphy [not found] ` <CAGAeKuv3y=rHvRsq6SVSQ+NadyUaFES94PpFu1zD74cO3B_eLA@mail.gmail.com> [not found] ` <CAJCQCtQXR+x4mG+jT34nhkE69sP94yio-97MLmd_ugKS+m96DQ@mail.gmail.com> 2020-05-19 18:45 ` Justin Engwer 2020-05-19 20:44 ` Chris Murphy 2020-05-20 1:32 ` Zygo Blaxell 2020-05-20 20:53 ` Johannes Hirte 2020-05-20 21:35 ` Chris Murphy 2020-05-20 22:15 ` Johannes Hirte 2020-05-21 6:20 ` Zygo Blaxell 2020-05-21 17:24 ` Justin Engwer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).