* btrfs dev del not transaction protected? @ 2019-12-20 4:05 Marc Lehmann 2019-12-20 5:24 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 4:05 UTC (permalink / raw) To: linux-btrfs Hi! I used btrfs del /somedevice /mountpoint to remove a device, and then typed sync. A short time later the system had a hard reset. Now the file system doesn't mount read-write anymore because it complains about a missing device (linux 5.4.5): [ 247.385346] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing [ 247.386942] BTRFS error (device dm-32): failed to read chunk tree: -2 [ 247.462693] BTRFS error (device dm-32): open_ctree failed The thing is, the device is still there and accessible, but btrfs no longer recognises it, as it already deleted it before the crash. I can mount the filesystem in degraded mode, and I have a backup in case somehting isn't readable, so this is merely a costly inconvinience for me (it's a 40TB volume), but this seems very unexpected, both that device dels apparently have a race condition and that sync doesn't actually synchronise the filesystem - I naively expected that btrfs dev del doesn't cause the loss of the filesystem due to a system crash. Probably nbot related, but maybe worth mentioning: I found that system crashes (resets, not power failures) cause btrfs to not mount the first time a mount is attempted, but it always succeeds the second time, e.g.: # mount /device /mnt ... no errors or warnings in kernel log, except: BTRFS error (device dm-34): open_ctree failed # mount /device /mnt magically succeeds The typical symptom here is that systemd goes into emergency mode on mount failure, but simpyl rebooting, or executing the mount manually then succeeds. Greetings, Marc -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 4:05 btrfs dev del not transaction protected? Marc Lehmann @ 2019-12-20 5:24 ` Qu Wenruo 2019-12-20 6:37 ` Marc Lehmann 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-12-20 5:24 UTC (permalink / raw) To: Marc Lehmann, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2334 bytes --] On 2019/12/20 下午12:05, Marc Lehmann wrote: > Hi! > > I used btrfs del /somedevice /mountpoint to remove a device, and then typed > sync. A short time later the system had a hard reset. Then it doesn't look like the title. Normally for sync, btrfs will commit transaction, thus even something like the title happened, you shouldn't be affected at all. > > Now the file system doesn't mount read-write anymore because it complains > about a missing device (linux 5.4.5): > > [ 247.385346] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing > [ 247.386942] BTRFS error (device dm-32): failed to read chunk tree: -2 > [ 247.462693] BTRFS error (device dm-32): open_ctree failed Is that devid 1 the device you tried to deleted? Or some unrelated device? > > The thing is, the device is still there and accessible, but btrfs no longer > recognises it, as it already deleted it before the crash. I think it's not what you thought, but btrfs device scan is not properly triggered. Would you please give some more dmesg? As each scanned btrfs device will show up in dmesg. That would help us to pin down the real cause. > > I can mount the filesystem in degraded mode, and I have a backup in case > somehting isn't readable, so this is merely a costly inconvinience for me > (it's a 40TB volume), but this seems very unexpected, both that device > dels apparently have a race condition and that sync doesn't actually > synchronise the filesystem - I naively expected that btrfs dev del doesn't > cause the loss of the filesystem due to a system crash. > > Probably nbot related, but maybe worth mentioning: I found that system > crashes (resets, not power failures) cause btrfs to not mount the first > time a mount is attempted, but it always succeeds the second time, e.g.: > > # mount /device /mnt > ... no errors or warnings in kernel log, except: > BTRFS error (device dm-34): open_ctree failed > # mount /device /mnt > magically succeeds Yep, this makes it sound more like a scan related bug. Thanks, Qu > > The typical symptom here is that systemd goes into emergency mode on mount > failure, but simpyl rebooting, or executing the mount manually then succeeds. > > Greetings, > Marc > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 5:24 ` Qu Wenruo @ 2019-12-20 6:37 ` Marc Lehmann 2019-12-20 7:10 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 6:37 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Fri, Dec 20, 2019 at 01:24:20PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > I used btrfs del /somedevice /mountpoint to remove a device, and then typed > > sync. A short time later the system had a hard reset. > > Then it doesn't look like the title. Hmm, I am not sure I understand: do you mean the subject? The command here is obviously not copied and pasted, and when typing it into my mail client, I forgot the "dev" part. The exact command, I think, was this: btrfs dev del /dev/mapper/xmnt-cold13 /oldcold > Normally for sync, btrfs will commit transaction, thus even something > like the title happened, you shouldn't be affected at all. Exactly, that is my expectation. > > [ 247.385346] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing > > [ 247.386942] BTRFS error (device dm-32): failed to read chunk tree: -2 > > [ 247.462693] BTRFS error (device dm-32): open_ctree failed > > Is that devid 1 the device you tried to deleted? > Or some unrelated device? I think the device I removed had devid 1. I am not 100% sure, but I am reasonably sure because I had "watch -n10 btrfs dev us" running while waiting for the removal to finish and not being able to control the device ids triggers my ocd reflexes (mostly because btrfs fi res needs the device id even for some single-device filesystems :), so I kind of memorised them. > > The thing is, the device is still there and accessible, but btrfs no longer > > recognises it, as it already deleted it before the crash. > > I think it's not what you thought, but btrfs device scan is not properly > triggered. Quite possible - I based my statement that it is no longer recognized based on the fact that a) blkid also didn't recognize a filesystem on the removed device anymore and b) btrfs found the other two remaining devices, so if btrfs scan is not properly triggered, then this is a serious issue in current GNU/Linux distributions (I use debian buster on that server). I assume that the device is not recognised as btrfs by blkid anymore because the signature had been wiped by btrfs dev del, based on previous experience, but I of course can't exactly know it's not, say, a hardware error that wiped that disk, although I would find that hard to believe :) > Would you please give some more dmesg? As each scanned btrfs device will > show up in dmesg. Here should be all btrfs-related messages for this (from grep -i btrfs): [ 10.288533] BTRFS: device label ROOT devid 1 transid 2106939 /dev/mapper/vg_doom-root [ 10.314498] BTRFS info (device dm-0): disk space caching is enabled [ 10.316488] BTRFS info (device dm-0): has skinny extents [ 10.900930] BTRFS info (device dm-0): enabling ssd optimizations [ 10.902741] BTRFS info (device dm-0): disk space caching is enabled [ 11.524129] BTRFS info (device dm-0): device fsid bb3185c8-19f0-4018-b06f-38678c06c7c2 devid 1 moved old:/dev/mapper/vg_doom-root new:/dev/dm-0 [ 11.528554] BTRFS info (device dm-0): device fsid bb3185c8-19f0-4018-b06f-38678c06c7c2 devid 1 moved old:/dev/dm-0 new:/dev/mapper/vg_doom-root [ 42.273530] BTRFS: device label LOCALVOL3 devid 1 transid 1240483 /dev/dm-28 [ 42.312354] BTRFS info (device dm-28): enabling auto defrag [ 42.314152] BTRFS info (device dm-28): force zstd compression, level 12 [ 42.315938] BTRFS info (device dm-28): using free space tree [ 42.317696] BTRFS info (device dm-28): has skinny extents [ 49.115007] BTRFS: device label LOCALVOL5 devid 1 transid 146201 /dev/dm-29 [ 49.138816] BTRFS info (device dm-29): using free space tree [ 49.140590] BTRFS info (device dm-29): has skinny extents [ 102.348872] BTRFS info (device dm-29): checking UUID tree [ 102.393185] BTRFS: device label COLD1 devid 5 transid 1876906 /dev/dm-30 [ 109.626550] BTRFS: device label COLD1 devid 4 transid 1876907 /dev/dm-32 [ 109.654401] BTRFS: device label COLD1 devid 3 transid 1876907 /dev/dm-31 [ 109.656171] BTRFS info (device dm-32): use zstd compression, level 12 [ 109.657924] BTRFS info (device dm-32): using free space tree [ 109.660917] BTRFS info (device dm-32): has skinny extents [ 109.662687] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing [ 109.664832] BTRFS error (device dm-32): failed to read chunk tree: -2 [ 109.742501] BTRFS error (device dm-32): open_ctree failed At this point, /dev/mapper/xmnt-cold11 (dm-32), /dev/mapper/xmnt-oldcold12 (dm-31) and /dev/mapper/xmnt-cold14 (dm-30) were the remaining disks in the filesystem, while xmnt-cold13 was the device I had formerly removed (which doesn't show up). (There are two btrfs filesystems with the COLD1 label in this machine at the moment, as I was migrating the fs, but the above COLD1 messages should all relate to the same fs). "blkid -o value -s TYPE /dev/mapper/xmnt-cold13" didn't give any output (the mounting script checks for that and pauses to make provisioning of new disks easier), while normally it would give "btrfs" on volume members. This, I think, would be normal behaviour for devices that have been removed from a btrfs. BTW, the four devices in question are all dmcrypt-on-lvm and are single devices in a hardware raid controller (a perc h740). > > Probably nbot related, but maybe worth mentioning: I found that system > > crashes (resets, not power failures) cause btrfs to not mount the first > > time a mount is attempted, but it always succeeds the second time, e.g.: > > > > # mount /device /mnt > > ... no errors or warnings in kernel log, except: > > BTRFS error (device dm-34): open_ctree failed > > # mount /device /mnt > > magically succeeds > > Yep, this makes it sound more like a scan related bug. BTW, this (second issue) also happens with filesystems that are not multi-device. Not sure if that menas that btrfs scan would be involved, as I would assume the only device btrfs would need in such cases is the one given to mount, but maybe that also needs a working btrfs scan? Thanks for your working on btrfs btw. :) -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 6:37 ` Marc Lehmann @ 2019-12-20 7:10 ` Qu Wenruo 2019-12-20 13:27 ` Marc Lehmann 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-12-20 7:10 UTC (permalink / raw) To: Marc Lehmann; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 7856 bytes --] On 2019/12/20 下午2:37, Marc Lehmann wrote: > On Fri, Dec 20, 2019 at 01:24:20PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>> I used btrfs del /somedevice /mountpoint to remove a device, and then typed >>> sync. A short time later the system had a hard reset. >> >> Then it doesn't look like the title. > > Hmm, I am not sure I understand: do you mean the subject? Oh, sorry, I mean subject line "btrfs dev del not transaction protected". > The command here > is obviously not copied and pasted, and when typing it into my mail client, > I forgot the "dev" part. The exact command, I think, was this: No big deal, as we all get the point. > > btrfs dev del /dev/mapper/xmnt-cold13 /oldcold> >> Normally for sync, btrfs will commit transaction, thus even something >> like the title happened, you shouldn't be affected at all. > > Exactly, that is my expectation. > >>> [ 247.385346] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing >>> [ 247.386942] BTRFS error (device dm-32): failed to read chunk tree: -2 >>> [ 247.462693] BTRFS error (device dm-32): open_ctree failed >> >> Is that devid 1 the device you tried to deleted? >> Or some unrelated device? > > I think the device I removed had devid 1. I am not 100% sure, but I am > reasonably sure because I had "watch -n10 btrfs dev us" running while > waiting for the removal to finish and not being able to control the device > ids triggers my ocd reflexes (mostly because btrfs fi res needs the device > id even for some single-device filesystems :), so I kind of memorised > them. Then it looks like a big deal. After looking into the code (at least v5.5-rc kernel), btrfs will commit transaction after deleting the device item in btrfs_rm_dev_item(). So even no manual sync is called, as long as there is no error report from "btrfs dev del", such case shouldn't happen. > >>> The thing is, the device is still there and accessible, but btrfs no longer >>> recognises it, as it already deleted it before the crash. >> >> I think it's not what you thought, but btrfs device scan is not properly >> triggered. > > Quite possible - I based my statement that it is no longer recognized > based on the fact that a) blkid also didn't recognize a filesystem on > the removed device anymore and b) btrfs found the other two remaining > devices, so if btrfs scan is not properly triggered, then this is a > serious issue in current GNU/Linux distributions (I use debian buster on > that server). a) means btrfs has wiped the superblock, which happens after btrfs_rm_dev_item(). Something is not sane now. > > I assume that the device is not recognised as btrfs by blkid anymore > because the signature had been wiped by btrfs dev del, based on previous > experience, but I of course can't exactly know it's not, say, a hardware > error that wiped that disk, although I would find that hard to believe :) > >> Would you please give some more dmesg? As each scanned btrfs device will >> show up in dmesg. > > Here should be all btrfs-related messages for this (from grep -i btrfs): > > [ 10.288533] BTRFS: device label ROOT devid 1 transid 2106939 /dev/mapper/vg_doom-root > [ 10.314498] BTRFS info (device dm-0): disk space caching is enabled > [ 10.316488] BTRFS info (device dm-0): has skinny extents > [ 10.900930] BTRFS info (device dm-0): enabling ssd optimizations > [ 10.902741] BTRFS info (device dm-0): disk space caching is enabled > [ 11.524129] BTRFS info (device dm-0): device fsid bb3185c8-19f0-4018-b06f-38678c06c7c2 devid 1 moved old:/dev/mapper/vg_doom-root new:/dev/dm-0 > [ 11.528554] BTRFS info (device dm-0): device fsid bb3185c8-19f0-4018-b06f-38678c06c7c2 devid 1 moved old:/dev/dm-0 new:/dev/mapper/vg_doom-root > [ 42.273530] BTRFS: device label LOCALVOL3 devid 1 transid 1240483 /dev/dm-28 > [ 42.312354] BTRFS info (device dm-28): enabling auto defrag > [ 42.314152] BTRFS info (device dm-28): force zstd compression, level 12 > [ 42.315938] BTRFS info (device dm-28): using free space tree > [ 42.317696] BTRFS info (device dm-28): has skinny extents > [ 49.115007] BTRFS: device label LOCALVOL5 devid 1 transid 146201 /dev/dm-29 > [ 49.138816] BTRFS info (device dm-29): using free space tree > [ 49.140590] BTRFS info (device dm-29): has skinny extents > [ 102.348872] BTRFS info (device dm-29): checking UUID tree > [ 102.393185] BTRFS: device label COLD1 devid 5 transid 1876906 /dev/dm-30 dm-30 is one transaction older than other devices. Is that expected? If not, it may explain why we got the dead device. As we're using older superblock, which may points to older chunk tree which has the device item. > [ 109.626550] BTRFS: device label COLD1 devid 4 transid 1876907 /dev/dm-32 > [ 109.654401] BTRFS: device label COLD1 devid 3 transid 1876907 /dev/dm-31 And I'm also curious about the 7s delay between devid5 and devid 3/4 detection. Can you find a way to make devid 3/4 show up before devid 5 and try again? And if you find a way to mount the volume RW, please write a single empty file, and sync the fs, then umount the fs, ensure "btrfs ins dump-super" gives the same transid of all 3 related disks. Then the problem *may* be gone if it matches my assumption. (After all these assumed success, please to do an unmounted btrfs check just to make sure nothing is wrong) > [ 109.656171] BTRFS info (device dm-32): use zstd compression, level 12 > [ 109.657924] BTRFS info (device dm-32): using free space tree > [ 109.660917] BTRFS info (device dm-32): has skinny extents > [ 109.662687] BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing > [ 109.664832] BTRFS error (device dm-32): failed to read chunk tree: -2 > [ 109.742501] BTRFS error (device dm-32): open_ctree failed > > At this point, /dev/mapper/xmnt-cold11 (dm-32), > /dev/mapper/xmnt-oldcold12 (dm-31) and /dev/mapper/xmnt-cold14 (dm-30) > were the remaining disks in the filesystem, while xmnt-cold13 was the > device I had formerly removed (which doesn't show up). > > (There are two btrfs filesystems with the COLD1 label in this machine at > the moment, as I was migrating the fs, but the above COLD1 messages should > all relate to the same fs). > > "blkid -o value -s TYPE /dev/mapper/xmnt-cold13" didn't give any output > (the mounting script checks for that and pauses to make provisioning > of new disks easier), while normally it would give "btrfs" on volume > members. This, I think, would be normal behaviour for devices that have > been removed from a btrfs. > > BTW, the four devices in question are all dmcrypt-on-lvm and are single > devices in a hardware raid controller (a perc h740). > >>> Probably nbot related, but maybe worth mentioning: I found that system >>> crashes (resets, not power failures) cause btrfs to not mount the first >>> time a mount is attempted, but it always succeeds the second time, e.g.: >>> >>> # mount /device /mnt >>> ... no errors or warnings in kernel log, except: >>> BTRFS error (device dm-34): open_ctree failed >>> # mount /device /mnt >>> magically succeeds >> >> Yep, this makes it sound more like a scan related bug. > > BTW, this (second issue) also happens with filesystems that are not > multi-device. Single device btrfs doesn't need device scan. If that happened, something insane happened again... Thanks, Qu > Not sure if that menas that btrfs scan would be involved, as > I would assume the only device btrfs would need in such cases is the one > given to mount, but maybe that also needs a working btrfs scan? > > Thanks for your working on btrfs btw. :) > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 7:10 ` Qu Wenruo @ 2019-12-20 13:27 ` Marc Lehmann 2019-12-20 13:41 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 13:27 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > > [ 102.393185] BTRFS: device label COLD1 devid 5 transid 1876906 /dev/dm-30 > > dm-30 is one transaction older than other devices. > > Is that expected? If not, it may explain why we got the dead device. As > we're using older superblock, which may points to older chunk tree which > has the device item. Well, not that my expectation here would mean anything, but no, from experience I have never seen the transids to disagree, or bad thingsa will happen... > > [ 109.626550] BTRFS: device label COLD1 devid 4 transid 1876907 /dev/dm-32 > > [ 109.654401] BTRFS: device label COLD1 devid 3 transid 1876907 /dev/dm-31 > > And I'm also curious about the 7s delay between devid5 and devid 3/4 > detection. That is about the time it takes the disk to wake up when its spinned down, so maybe that was the case - the disks are used for archiving ("cold" storage), have a short spin-down and btrfs filesystems can takes ages to mount. The real question is why the fortuh disk was already spun up then, but the disks do not apply time outs very exactly. > Can you find a way to make devid 3/4 show up before devid 5 and try again? Unfortunately, I had to start restoring from backup a while ago, as I need the machine up and restoring takes days. How would I go about making it show up in different orders though? If these messages come up independently, I could have spun down some of the disks, right? > And if you find a way to mount the volume RW, please write a single > empty file, and sync the fs, then umount the fs, ensure "btrfs ins > dump-super" gives the same transid of all 3 related disks. I tried -o degraded followed by remounting rw, but couldn't get it to mount rw. I tried to mount/remount, though: 04:48:45 doom kernel: BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing 04:48:45 doom kernel: BTRFS error (device dm-32): failed to read chunk tree: -2 04:48:45 doom kernel: BTRFS error (device dm-32): open_ctree failed 04:49:37 doom kernel: BTRFS warning (device dm-31): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing 04:52:30 doom kernel: BTRFS warning (device dm-31): chunk 12582912 missing 1 devices, max tolerance is 0 for writable mount 04:52:30 doom kernel: BTRFS warning (device dm-31): writable mount is not allowed due to too many missing devices 04:52:30 doom kernel: BTRFS error (device dm-31): open_ctree failed 04:54:01 doom kernel: BTRFS warning (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing 04:54:45 doom kernel: BTRFS warning (device dm-32): chunk 12582912 missing 1 devices, max tolerance is 0 for writable mount 04:54:45 doom kernel: BTRFS warning (device dm-32): too many missing devices, writable remount is not allowed Since (in theory :) the filesystemw a completely backed up, I didn't bother with further recovery after I made sure the physical disk is actually there and was unlocked (cryptsetup), so it wasn't a case of an actual missing disk. > > BTW, this (second issue) also happens with filesystems that are not > > multi-device. > > Single device btrfs doesn't need device scan. > If that happened, something insane happened again... > Thanks, It happens since at least 4.14 on at least four machines, but I haven't seen it recently, after I switched to 5.2.21 one some machines (post-4.4 kernels have this habit of freezing under memory pressure, and 5.2.21 has greatly improved in this regard). That also means I had far fewer hard resets with 5.2.21, but the problem did not happen on the last resets in 5.2.21 and 5.4.5. I originally reported it below, with some evidence that it isn't a hardware issue (no reset needed, just wipe the dm table while the device is mounted which should cleanly "cut off" the write stream): https://bugzilla.kernel.org/show_bug.cgi?id=204083 Since multiple scrubs and full reads of the volumes didn't show up any issues, I didn't think much of it. And if you want to hear more "insane" things, after I hard-reset my desktop machine (5.2.21) two days ago I had to "btrfs rescue fix-device-size" to be able to mount (can't find the kernel error atm.). Greetings, -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 13:27 ` Marc Lehmann @ 2019-12-20 13:41 ` Qu Wenruo 2019-12-20 16:53 ` Marc Lehmann ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Qu Wenruo @ 2019-12-20 13:41 UTC (permalink / raw) To: Marc Lehmann; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 5591 bytes --] On 2019/12/20 下午9:27, Marc Lehmann wrote: >>> [ 102.393185] BTRFS: device label COLD1 devid 5 transid 1876906 /dev/dm-30 >> >> dm-30 is one transaction older than other devices. >> >> Is that expected? If not, it may explain why we got the dead device. As >> we're using older superblock, which may points to older chunk tree which >> has the device item. > > Well, not that my expectation here would mean anything, but no, from > experience I have never seen the transids to disagree, or bad thingsa will > happen... > >>> [ 109.626550] BTRFS: device label COLD1 devid 4 transid 1876907 /dev/dm-32 >>> [ 109.654401] BTRFS: device label COLD1 devid 3 transid 1876907 /dev/dm-31 >> >> And I'm also curious about the 7s delay between devid5 and devid 3/4 >> detection. > > That is about the time it takes the disk to wake up when its spinned down, > so maybe that was the case - the disks are used for archiving ("cold" > storage), have a short spin-down and btrfs filesystems can takes ages to > mount. The real question is why the fortuh disk was already spun up then, > but the disks do not apply time outs very exactly. > >> Can you find a way to make devid 3/4 show up before devid 5 and try again? > > Unfortunately, I had to start restoring from backup a while ago, as I need > the machine up and restoring takes days. > > How would I go about making it show up in different orders though? If > these messages come up independently, I could have spun down some of the > disks, right? You could utilize the latest "forget" feature, to make btrfs kernel module forget that device, provided by "btrfs device scan -u". So the plan would be something like: - Forget all devices of that volume - Scan the two disks with higher transid - Scan the disk with mismatched transid Then try to mount the volume. > >> And if you find a way to mount the volume RW, please write a single >> empty file, and sync the fs, then umount the fs, ensure "btrfs ins >> dump-super" gives the same transid of all 3 related disks. > > I tried -o degraded followed by remounting rw, but couldn't get it to > mount rw. I tried to mount/remount, though: > > 04:48:45 doom kernel: BTRFS error (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing > 04:48:45 doom kernel: BTRFS error (device dm-32): failed to read chunk tree: -2 > 04:48:45 doom kernel: BTRFS error (device dm-32): open_ctree failed > 04:49:37 doom kernel: BTRFS warning (device dm-31): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing > 04:52:30 doom kernel: BTRFS warning (device dm-31): chunk 12582912 missing 1 devices, max tolerance is 0 for writable mount BTW, that chunk number is very small, and since it has 0 tolerance, it looks like to be SINGLE chunk. In that case, it looks like a temporary chunk from older mkfs, and it should contain no data/metadata at all, thus brings no data loss. > 04:52:30 doom kernel: BTRFS warning (device dm-31): writable mount is not allowed due to too many missing devices > 04:52:30 doom kernel: BTRFS error (device dm-31): open_ctree failed > 04:54:01 doom kernel: BTRFS warning (device dm-32): devid 1 uuid f5c3dc63-1fac-45b3-b9ba-ed1ec5f92403 is missing > 04:54:45 doom kernel: BTRFS warning (device dm-32): chunk 12582912 missing 1 devices, max tolerance is 0 for writable mount > 04:54:45 doom kernel: BTRFS warning (device dm-32): too many missing devices, writable remount is not allowed > > Since (in theory :) the filesystemw a completely backed up, I didn't > bother with further recovery after I made sure the physical disk is > actually there and was unlocked (cryptsetup), so it wasn't a case of an > actual missing disk. BTW, "btrfs ins dump-tree -t chunk <dev>" would help a lot. That would directly tell us if the devid 1 device is in chunk tree. If passing different <dev> would cause different output, please also provide all different versions. > >>> BTW, this (second issue) also happens with filesystems that are not >>> multi-device. >> >> Single device btrfs doesn't need device scan. >> If that happened, something insane happened again... >> Thanks, > > It happens since at least 4.14 on at least four machines, but I haven't > seen it recently, after I switched to 5.2.21 one some machines (post-4.4 > kernels have this habit of freezing under memory pressure, and 5.2.21 has > greatly improved in this regard). That also means I had far fewer hard > resets with 5.2.21, but the problem did not happen on the last resets in > 5.2.21 and 5.4.5. > > I originally reported it below, with some evidence that it isn't a > hardware issue (no reset needed, just wipe the dm table while the device > is mounted which should cleanly "cut off" the write stream): > > https://bugzilla.kernel.org/show_bug.cgi?id=204083 > > Since multiple scrubs and full reads of the volumes didn't show up any > issues, I didn't think much of it. > > And if you want to hear more "insane" things, after I hard-reset > my desktop machine (5.2.21) two days ago I had to "btrfs rescue > fix-device-size" to be able to mount (can't find the kernel error atm.). Consider all these insane things, I tend to believe there is some FUA/FLUSH related hardware problem. E.g. the HDD/SSD controller reports FUA/FLUSH finished way before it really write data into the disk or non-volatile cache, or the non-volatile cache recovery is not implemented properly... Thanks, Qu > > Greetings, > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 13:41 ` Qu Wenruo @ 2019-12-20 16:53 ` Marc Lehmann 2019-12-20 17:24 ` Remi Gauvin ` (2 more replies) 2019-12-20 17:07 ` Marc Lehmann 2019-12-20 17:20 ` Marc Lehmann 2 siblings, 3 replies; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 16:53 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > BTW, that chunk number is very small, and since it has 0 tolerance, it > looks like to be SINGLE chunk. > > In that case, it looks like a temporary chunk from older mkfs, and it > should contain no data/metadata at all, thus brings no data loss. Well, there indeed should not have been any data or metadata left as the btrfs dev del succeeded after lengthy copying. > BTW, "btrfs ins dump-tree -t chunk <dev>" would help a lot. > That would directly tell us if the devid 1 device is in chunk tree. Apologies if I wasn't too clear about it - I already had to mkfs and redo the filesystem. I understand that makes tracking this down hard or impossible, but I did need that machine and filesystem. > > And if you want to hear more "insane" things, after I hard-reset > > my desktop machine (5.2.21) two days ago I had to "btrfs rescue > > fix-device-size" to be able to mount (can't find the kernel error atm.). > > Consider all these insane things, I tend to believe there is some > FUA/FLUSH related hardware problem. Please don't - I honestly think btrfs developers are way to fast to blame hardware for problems. I currently lose btrfs filesystems about once every 6 months, and other than the occasional user error, it's always the kernel (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things, low-memory situations etc. - none of these seem to be centric to btrfs, but none of those are hardware errors either). I know its the kernel in most cases because in those cases, I can identify the fix in a later kernel, or the mitigating circumstances don't appear (e.g. freezes). In any case if it is a hardware problem, then linux and/or btrfs has to work around them, because it affects many different controllers on different boards: - dell perc h740 on "doom" and "cerebro" - intel series 9 controller on "doom'" and "cerebro". - samsung nvme controller on "yoyo" and "yuna". - marvell sata controller on "doom". Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs filesystem I restored to went into readonly mode with ENOSPC. Another hardware problem? [41801.618772] ------------[ cut here ]------------ [41801.618776] BTRFS: Transaction aborted (error -28) [41801.618843] WARNING: CPU: 2 PID: 5713 at fs/btrfs/inode.c:3159 btrfs_finish_ordered_io+0x730/0x820 [btrfs] [41801.618844] Modules linked in: nfsv3 nfs fscache nvidia_modeset(POE) nvidia(POE) btusb algif_skcipher af_alg dm_crypt nfsd auth_rpcgss nfs_acl lockd grace cls_fw sch_htb sit tunnel4 ip_tunnel hidp act_police cls_u32 sch_ingress sch_tbf 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6 xt_CT xt_MASQUERADE xt_nat xt_REDIRECT nft_chain_nat nf_nat xt_owner xt_TCPMSS xt_DSCP xt_mark nf_log_ipv4 nf_log_common xt_LOG xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_length xt_mac xt_tcpudp nft_compat nft_counter nf_tables xfrm_user xfrm_algo nfnetlink cmac uhid bnep tda10021 snd_hda_codec_hdmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass tda827x tda10023 crct10dif_pclmul mei_hdcp crc32_pclmul btrtl btbcm rc_tt_1500 ghash_clmulni_intel snd_emu10k1 btintel snd_util_mem snd_ac97_codec aesni_intel bluetooth snd_hda_intel budget_av snd_rawmidi snd_intel_nhlt crypto_simd saa7 146_vv [41801.618864] snd_hda_codec videobuf_dma_sg budget_ci videobuf_core snd_seq_device budget_core cryptd ttpci_eeprom glue_helper snd_hda_core saa7146 dvb_core intel_cstate ac97_bus snd_hwdep rc_core snd_pcm intel_rapl_perf mxm_wmi cdc_acm pcspkr videodev snd_timer ecdh_generic snd emu10k1_gp ecc mc gameport soundcore mei_me mei mac_hid acpi_pad tcp_bbr drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler hid_generic usbhid hid usbkbd coretemp nct6775 hwmon_vid sunrpc parport_pc ppdev lp parport msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c ahci megaraid_sas i2c_i801 libahci lpc_ich r8169 realtek wmi video [last unloaded: nvidia] [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P OE 5.4.5-050405-generic #201912181630 [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014 [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs] [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01 [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282 [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006 [41801.618922] BTRFS info (device dm-35): forced readonly [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440 [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90 [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60 [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000 [41801.618927] FS: 0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000 [41801.618928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0 [41801.618930] Call Trace: [41801.618943] finish_ordered_fn+0x15/0x20 [btrfs] [41801.618957] normal_work_helper+0xbd/0x2f0 [btrfs] [41801.618959] ? __schedule+0x2eb/0x740 [41801.618973] btrfs_endio_write_helper+0x12/0x20 [btrfs] [41801.618975] process_one_work+0x1ec/0x3a0 [41801.618977] worker_thread+0x4d/0x400 [41801.618979] kthread+0x104/0x140 [41801.618980] ? process_one_work+0x3a0/0x3a0 [41801.618982] ? kthread_park+0x90/0x90 [41801.618984] ret_from_fork+0x1f/0x40 [41801.618985] ---[ end trace 35086266bf39c897 ]--- [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left unmount/remount seems to make it work again, and it is full (df) yet has 3TB of unallocated space left. No clue what to do now, do I have to start over restoring again? Filesystem Size Used Avail Use% Mounted on /dev/mapper/xmnt-cold15 27T 23T 0 100% /cold1 Overall: Device size: 24216.49GiB Device allocated: 20894.89GiB Device unallocated: 3321.60GiB Device missing: 0.00GiB Used: 20893.68GiB Free (estimated): 3322.73GiB (min: 1661.93GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 0.50GiB (used: 0.00GiB) Data,single: Size:20839.01GiB, Used:20837.88GiB (99.99%) /dev/mapper/xmnt-cold15 9288.01GiB /dev/mapper/xmnt-cold12 7427.00GiB /dev/mapper/xmnt-cold13 4124.00GiB Metadata,RAID1: Size:27.91GiB, Used:27.90GiB (99.97%) /dev/mapper/xmnt-cold15 25.44GiB /dev/mapper/xmnt-cold12 24.46GiB /dev/mapper/xmnt-cold13 5.91GiB System,RAID1: Size:0.03GiB, Used:0.00GiB (6.69%) /dev/mapper/xmnt-cold15 0.03GiB /dev/mapper/xmnt-cold12 0.03GiB Unallocated: /dev/mapper/xmnt-cold15 0.01GiB /dev/mapper/xmnt-cold12 0.00GiB /dev/mapper/xmnt-cold13 3321.59GiB Please, don't always chalk it up to hardware problems - btrfs is a wonderful filesystem for many reasons, one reason I like is that it can detect corruption much earlier than other filesystems. This featire alone makes it impossible for me to go back to xfs. However, I had corruption on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier still than those - before btrfs (and even now) I kept md5sums of all archived files (~200TB), and xfs and ext4 _do_ a much better job at not corrupting data than btrfs on the same hardware - while I get filesystem problems about every half a year with btrfs, I had (silent) corruption problems maybe once every three to four years with xfs or ext4 (and not yet on the bxoes I use currently). Please take these issues seriously - the trend of "it's a hardware problem" will not remove the "unstable" stigma from btrfs as long as btrfs is clearly more buggy then other filesystems. Sorry to be so blunt, but I am a bit sensitive with always being told "it's probably a hardware problem" when it clearly affects practically any server and any laptop I administrate. I believe in btrfs, and detecting corruption early is a feature to me. I understand it can be frustrating to be confronted with hard to explain accidents, and I understand if you can't find the bug with the sparse info I gave, especially as the bug might not even be in btrfs. But keep in mind that the people who boldly/dumbly use btrfs in production and restore dozens of terabytes from backup every so and so many months are also being frustrated if they present evidence from multiple machines and get told "its probably a hardware problem". -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 16:53 ` Marc Lehmann @ 2019-12-20 17:24 ` Remi Gauvin 2019-12-20 17:50 ` Marc Lehmann 2019-12-20 18:00 ` Marc Lehmann 2019-12-20 20:24 ` Chris Murphy 2019-12-21 1:32 ` Qu Wenruo 2 siblings, 2 replies; 18+ messages in thread From: Remi Gauvin @ 2019-12-20 17:24 UTC (permalink / raw) To: Marc Lehmann, Qu Wenruo; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 1806 bytes --] On 2019-12-20 11:53 a.m., Marc Lehmann wrote: > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/xmnt-cold15 27T 23T 0 100% /cold1 > > Overall: > Device size: 24216.49GiB > Device allocated: 20894.89GiB > Device unallocated: 3321.60GiB > Device missing: 0.00GiB > Used: 20893.68GiB > Free (estimated): 3322.73GiB (min: 1661.93GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 0.50GiB (used: 0.00GiB) > > Data,single: Size:20839.01GiB, Used:20837.88GiB (99.99%) > /dev/mapper/xmnt-cold15 9288.01GiB > /dev/mapper/xmnt-cold12 7427.00GiB > /dev/mapper/xmnt-cold13 4124.00GiB > > Metadata,RAID1: Size:27.91GiB, Used:27.90GiB (99.97%) > /dev/mapper/xmnt-cold15 25.44GiB > /dev/mapper/xmnt-cold12 24.46GiB > /dev/mapper/xmnt-cold13 5.91GiB > > System,RAID1: Size:0.03GiB, Used:0.00GiB (6.69%) > /dev/mapper/xmnt-cold15 0.03GiB > /dev/mapper/xmnt-cold12 0.03GiB > > Unallocated: > /dev/mapper/xmnt-cold15 0.01GiB > /dev/mapper/xmnt-cold12 0.00GiB > /dev/mapper/xmnt-cold13 3321.59GiB > You don't need hints, the problem is right here. Your Metadata is Raid 1, (which requires minimum of 2 devices,) Your allocated metadata is full (27.90GB / 27.91 GB) and you only have 1 device left with unallocated space, so no new metadata space can be allocated until you fix that. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 17:24 ` Remi Gauvin @ 2019-12-20 17:50 ` Marc Lehmann 2019-12-20 18:00 ` Marc Lehmann 1 sibling, 0 replies; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 17:50 UTC (permalink / raw) To: Remi Gauvin; +Cc: Qu Wenruo, linux-btrfs On Fri, Dec 20, 2019 at 12:24:05PM -0500, Remi Gauvin <remi@georgianit.com> wrote: > You don't need hints, the problem is right here. Yes, I already guessed that (see my other mail). I fortunately can add two more devices. However: > device left with unallocated space, so no new metadata space can be > allocated until you fix that. I think it really shouldn't be up to me to second guess btrfs's not very helpful error messages "and fix things". And if I couldn't add another device, I would be pretty much fucked - btrfs balance does not allow me to move any chunks to the other device, I tried balancing 10 data chunks and 10 metadata chunks - the data chunks balanced successfully but nothing changed, and the metadata chunks instantly hit the ENOSPC problem. Pushing "fix things" at users without giving them the ability to do so is rather poor. So is there a legit fix for this? The tools don't allow me to rebalance the filesystem so there is more space on the drives and deleting data and writing it again doesn't seem to help - btrfs still wants to write to the nearly full disks. I could probably convert the metadata to single and back, but as long as btrfs has no way orf moving data form one disk to another, that's going to be tough. Maybe converting to single and resizing would do the trick - seriously, though, btrfs shouldn't force users to jump through such hoops. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 17:24 ` Remi Gauvin 2019-12-20 17:50 ` Marc Lehmann @ 2019-12-20 18:00 ` Marc Lehmann 2019-12-20 18:28 ` Eli V 1 sibling, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 18:00 UTC (permalink / raw) To: Remi Gauvin; +Cc: Qu Wenruo, linux-btrfs On Fri, Dec 20, 2019 at 12:24:05PM -0500, Remi Gauvin <remi@georgianit.com> wrote: > You don't need hints, the problem is right here. > Your Metadata is Raid 1, (which requires minimum of 2 devices,) Your Guess I found another bug - three disks with >>3tb free space, but df still shows 0 available bytes. Sure I can probably work around it somehow, but no, I refuse to accept that this is supposedly a user problem - surely btrfs could create more raid1 metadata with _three disks with lots of free space_. doom ~# df /cold1 Filesystem Size Used Avail Use% Mounted on /dev/mapper/xmnt-cold15 43T 23T 0 100% /cold1 doom ~# btrfs dev us /cold1 /dev/mapper/xmnt-cold15, ID: 1 Device size: 9.09TiB Device slack: 0.00B Data,single: 9.07TiB Metadata,RAID1: 25.46GiB System,RAID1: 32.00MiB Unallocated: 1.00MiB /dev/mapper/xmnt-cold12, ID: 2 Device size: 7.28TiB Device slack: 0.00B Data,single: 7.25TiB Metadata,RAID1: 24.46GiB System,RAID1: 32.00MiB Unallocated: 1.00MiB /dev/mapper/xmnt-cold13, ID: 3 Device size: 7.28TiB Device slack: 0.00B Data,single: 4.03TiB Metadata,RAID1: 5.92GiB Unallocated: 3.24TiB /dev/mapper/xmnt-cold14, ID: 4 Device size: 7.28TiB Device slack: 0.00B Unallocated: 7.28TiB /dev/mapper/xmnt-cold11, ID: 5 Device size: 7.28TiB Device slack: 0.00B Unallocated: 7.28TiB -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 18:00 ` Marc Lehmann @ 2019-12-20 18:28 ` Eli V 0 siblings, 0 replies; 18+ messages in thread From: Eli V @ 2019-12-20 18:28 UTC (permalink / raw) To: Marc Lehmann; +Cc: Remi Gauvin, Qu Wenruo, linux-btrfs In general df will only ever be an approximation on btrfs filesystems since the different profiles use different amounts of space, and it does have bugs from time to time. If you untar a mail spool on the filesystem the metadata usage may shoot way up when only a small amount of additional data is needed. So on btrfs filesystems I really just ignore df, and use btrfs filesystem usage -T almost exclusively. The table format of -T does make it much more readable for an admin. On Fri, Dec 20, 2019 at 1:02 PM Marc Lehmann <schmorp@schmorp.de> wrote: > > On Fri, Dec 20, 2019 at 12:24:05PM -0500, Remi Gauvin <remi@georgianit.com> wrote: > > You don't need hints, the problem is right here. > > Your Metadata is Raid 1, (which requires minimum of 2 devices,) Your > > Guess I found another bug - three disks with >>3tb free space, but df > still shows 0 available bytes. Sure I can probably work around it somehow, > but no, I refuse to accept that this is supposedly a user problem - surely > btrfs could create more raid1 metadata with _three disks with lots of free > space_. > > doom ~# df /cold1 > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/xmnt-cold15 43T 23T 0 100% /cold1 > doom ~# btrfs dev us /cold1 > /dev/mapper/xmnt-cold15, ID: 1 > Device size: 9.09TiB > Device slack: 0.00B > Data,single: 9.07TiB > Metadata,RAID1: 25.46GiB > System,RAID1: 32.00MiB > Unallocated: 1.00MiB > > /dev/mapper/xmnt-cold12, ID: 2 > Device size: 7.28TiB > Device slack: 0.00B > Data,single: 7.25TiB > Metadata,RAID1: 24.46GiB > System,RAID1: 32.00MiB > Unallocated: 1.00MiB > > /dev/mapper/xmnt-cold13, ID: 3 > Device size: 7.28TiB > Device slack: 0.00B > Data,single: 4.03TiB > Metadata,RAID1: 5.92GiB > Unallocated: 3.24TiB > > /dev/mapper/xmnt-cold14, ID: 4 > Device size: 7.28TiB > Device slack: 0.00B > Unallocated: 7.28TiB > > /dev/mapper/xmnt-cold11, ID: 5 > Device size: 7.28TiB > Device slack: 0.00B > Unallocated: 7.28TiB > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de > -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 16:53 ` Marc Lehmann 2019-12-20 17:24 ` Remi Gauvin @ 2019-12-20 20:24 ` Chris Murphy 2019-12-20 23:30 ` Marc Lehmann 2019-12-21 20:06 ` Zygo Blaxell 2019-12-21 1:32 ` Qu Wenruo 2 siblings, 2 replies; 18+ messages in thread From: Chris Murphy @ 2019-12-20 20:24 UTC (permalink / raw) To: Btrfs BTRFS; +Cc: Qu Wenruo, Marc Lehmann, Zygo Blaxell On Fri, Dec 20, 2019 at 9:53 AM Marc Lehmann <schmorp@schmorp.de> wrote: > > On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > Consider all these insane things, I tend to believe there is some > > FUA/FLUSH related hardware problem. > > Please don't - I honestly think btrfs developers are way to fast to blame > hardware for problems. That's because they have a lot of evidence of this, in a way that's only inferable with other file systems. This has long been suspected by, and demonstrated, well before Btrfs with ZFS development. A reasonable criticism of Btrfs development is the state of the file system check repair, which still has danger warnings. But it's also a case of damned if they do, and damned if they don't provide it. It might be the best chance of recovery, so why not provide it? Conversely, the reality is that the file system is complicated enough, and the file system checker too slow, that the effort needs to be on (what I call) file system autopsy tools, to figure out why the corruption happened, and prevent that from happening. The repair is often too difficult. Take, for example, the recent 5.2.0-5.2.14 corruption bug. That was self-reported once it was discovered and fixed, which took longer than usual, and developers apologized. What else can they do? It's not like the developers are blaming hardware for their own bugs. They have consistently taken responsibility for Btrfs bugs. > I currently lose btrfs filesystems about once every > 6 months, and other than the occasional user error, it's always the kernel > (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things, > low-memory situations etc. - none of these seem to be centric to btrfs, > but none of those are hardware errors either). I know its the kernel in > most cases because in those cases, I can identify the fix in a later > kernel, or the mitigating circumstances don't appear (e.g. freezes). Usually Btrfs developers do mention the possibility of other software layers contributing to the problem, it's a valid observation that this possibility be stated. However, if it's exclusively a software problem, then it should be reproducible on other systems. > In any case if it is a hardware problem, then linux and/or btrfs has > to work around them, because it affects many different controllers on > different boards: How do you propose Btrfs work around it? In particular when there are additional software layers over which it has no control? Have you tried disabling the (drives') write cache? > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs > filesystem I restored to went into readonly mode with ENOSPC. Another > hardware problem? > [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P OE 5.4.5-050405-generic #201912181630 Why is this kernel tainted? The point of pointing this out isn't to blame whatever it tainting the kernel, but to point out that identifying the cause of your problems is made a lot more difficult. I think you need to simplify the setup, a lot, in order to reduce the surface area of possible problems. Any bug hunt is made way harder when there's complication. > [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014 > [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] > [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs] > [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01 > [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282 > [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left > [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006 > [41801.618922] BTRFS info (device dm-35): forced readonly > [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440 > [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90 > [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60 > [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000 > [41801.618927] FS: 0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000 > [41801.618928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0 > [41801.618930] Call Trace: > [41801.618943] finish_ordered_fn+0x15/0x20 [btrfs] > [41801.618957] normal_work_helper+0xbd/0x2f0 [btrfs] > [41801.618959] ? __schedule+0x2eb/0x740 > [41801.618973] btrfs_endio_write_helper+0x12/0x20 [btrfs] > [41801.618975] process_one_work+0x1ec/0x3a0 > [41801.618977] worker_thread+0x4d/0x400 > [41801.618979] kthread+0x104/0x140 > [41801.618980] ? process_one_work+0x3a0/0x3a0 > [41801.618982] ? kthread_park+0x90/0x90 > [41801.618984] ret_from_fork+0x1f/0x40 > [41801.618985] ---[ end trace 35086266bf39c897 ]--- > [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left > > unmount/remount seems to make it work again, and it is full (df) yet has > 3TB of unallocated space left. No clue what to do now, do I have to start > over restoring again? > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/xmnt-cold15 27T 23T 0 100% /cold1 Clearly a bug, possibly more than one. This problem is being discussed in other threads on df misreporting with recent kernels, and a fix is pending. As for the ENOSPC, also clearly a bug. But not clear why or where. > Please, don't always chalk it up to hardware problems - btrfs is a > wonderful filesystem for many reasons, one reason I like is that it can > detect corruption much earlier than other filesystems. This featire alone > makes it impossible for me to go back to xfs. However, I had corruption > on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier > still than those - before btrfs (and even now) I kept md5sums of all > archived files (~200TB), and xfs and ext4 _do_ a much better job at not > corrupting data than btrfs on the same hardware - while I get filesystem > problems about every half a year with btrfs, I had (silent) corruption > problems maybe once every three to four years with xfs or ext4 (and not > yet on the bxoes I use currently). I can't really parse the suggestion that you are seeing md5 mismatches (indicating data changes) on Btrfs, where Btrfs doesn't produce a csum warning along with EIO on those files? Are these files nodatacow, either by mount option nodatasum or nodatacow, or using chattr +C on these files? A mechanism explaining this anecdote isn't clear. Not even crc32c checksum collision would explain more than maybe one instance of it. I'm curious what Zygo thinks about this. > > Please take these issues seriously - the trend of "it's a hardware > problem" will not remove the "unstable" stigma from btrfs as long as btrfs > is clearly more buggy then other filesystems. > > Sorry to be so blunt, but I am a bit sensitive with always being told > "it's probably a hardware problem" when it clearly affects practically any > server and any laptop I administrate. I believe in btrfs, and detecting > corruption early is a feature to me. The problem with the anecdotal method of arguing in favor of software bugs as the explanation? It directly goes against my own experience, also anecdote. I've had no problems that I can attribute to Btrfs. All were hardware or user sabotage. And I've had zero data loss, outside of user sabotage. I have seen device UNC read errors, corrected automatically by Btrfs. And I have seen devices return bad data that Btrfs caught, that would otherwise have been silent corruption of either metadata or data, and this was corrected in the raid1 cases, and merely reported in the non-raid cases. And I've also seen considerable corruption reported upon SD Cards in the midst of implosion and becoming read only. But even read only, I was able to get all the data out. But in your case, practically ever server and laptop? That's weird and unexpected. And it makes me wonder what's in common. Btrfs is much fussier than other file systems because the by far largest target for corruption, isn't file system metadata, but data. The actual payload of a file system isn't the file system. And Btrfs is the only Linux native file system that checksums data. The other file systems check only metadata, and only somewhat recently, depending on the distribution you're using. > I understand it can be frustrating to be confronted with hard to explain > accidents, and I understand if you can't find the bug with the sparse info > I gave, especially as the bug might not even be in btrfs. But keep in mind > that the people who boldly/dumbly use btrfs in production and restore > dozens of terabytes from backup every so and so many months are also being > frustrated if they present evidence from multiple machines and get told > "its probably a hardware problem". For sure. But take the contrary case that other file systems have depended on for more than a decade: assuming the hardware is returning valid data. This is intrinsic to their design. And go back before they had metadata checksumming, and you'd see it stated on their lists that they do assume this, and if your devices return any bad data, it's not the file system's fault at all. Not even the lack of reporting any kind of problem whatsoever. How is that better? Well indeed, not long after Btrfs was demonstrating these are actually more common problems that suspected, metadata checksumming started creeping into other file systems, finally becoming the default (a while ago on XFS, and very recently on ext4). And they are catching a lot of these same kinds of layer and hardware bugs. Hardware does not just mean the drive, it can be power supply, logic board, controller, cables, drive write caches, drive firmware, and other drive internals. And the only way any problem can be fixed, is to understand how, when and where it happened. -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 20:24 ` Chris Murphy @ 2019-12-20 23:30 ` Marc Lehmann 2019-12-21 20:06 ` Zygo Blaxell 1 sibling, 0 replies; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 23:30 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On Fri, Dec 20, 2019 at 01:24:02PM -0700, Chris Murphy <lists@colorremedies.com> wrote: > > Please don't - I honestly think btrfs developers are way to fast to blame > > hardware for problems. > > That's because they have a lot of evidence of this, in a way that's > only inferable with other file systems. This has long been suspected > by, and demonstrated, well before Btrfs with ZFS development. But they don't - when I report that on different machines I see this reproducible behaviour, what is that lot of evidence? At the least they could inquire > A reasonable criticism of Btrfs development is the state of the file > system check repair, which still has danger warnings. But it's also a > case of damned if they do, and damned if they don't provide it. It > might be the best chance of recovery, so why not provide it? Note that I have not asked for a better fsck or anything of the sort. > usual, and developers apologized. What else can they do? It's not like > the developers are blaming hardware for their own bugs. They have > consistently taken responsibility for Btrfs bugs. That's not the reality I live in, though. Most of my bug reports on btrfs have either been completely ignored or "oh, I can't reproduce it today anymore, maybe its fixed itself". Sure, some of my bug reports have been taken seriously as well, and btrfs has advanced considerably over the years. I am a software developer myself, and I understand that not every bug report can be acted upon, and sometimes you need to be sceptic for other reasons then the assumed reported ones. > Usually Btrfs developers do mention the possibility of other software > layers contributing to the problem, it's a valid observation that this > possibility be stated. That's probably why I stated it, yes. Your mail doesn't really apply to much of what I wrote - have you really read my bug report, or is this the pre-canned response you send out for criticism? Sorry to be so blunt, but thats pretty much how your mail feels to me, as it doesn't seem to take into account what I reported. > However, if it's exclusively a software problem, then it should be > reproducible on other systems. Which in this case, it is. Even hardware problems can be reproduced on other systems, when it's say, a controller problem, so reproducability of a problem does not mean it's a software bug. But likewise jumping to conclusions because it is convenient is also a non-sequitur. > > In any case if it is a hardware problem, then linux and/or btrfs has > > to work around them, because it affects many different controllers on > > different boards: > > How do you propose Btrfs work around it? In particular when there are > additional software layers over which it has no control? Why do I suddenly have to propose btrfs to work around it? I said if its a hardware problem with practically every current controller, then linux and/or btrfs have to work around them, otherwise it becomes useless. And if other filesystems can keep data safe when btrfs can't, then clearly btrfs can be improved to do likewise. > Have you tried disabling the (drives') write cache? The write caches of all drives have been off during the lifetime of the filesystem. Nevertheless, whats your basis for asking for write caches to be turned off? Is there any evidence that drives lose their caches without being turned off? Maybe such drives exist, but I have not heard of them - have you? The reason why the write cache is off is because the stupid lsi raid controllers I use _do_ lose data on power outages when the drive cache is on, something that 3ware controllers didn't do. Not that power outages (actual brownouts or manually induced via the power switch) are something that really happens here. I do, however, expect that current filesystems properly flush data to disk, and outside the raid controllers, I do not disable the write cache. > > [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P OE 5.4.5-050405-generic #201912181630 > > Why is this kernel tainted? non-free nvidia driver > The point of pointing this out isn't to blame whatever it tainting the > kernel, but to point out that identifying the cause of your problems is > made a lot more difficult. I think you are wildly exaggerating. Are there any reports of nvidia drivers actually corrupting filesystems? I am genuinely curious. Other than that, your claim that a tainted kernel some makes identifying the problem a lot more difficult is just taken out of the blue, isn't it? > I think you need to simplify the setup, a lot, in order to reduce the > surface area of possible problems. Any bug hunt is made way harder when > there's complication. Well, sure, you can hide behind the kernel taint. I am sorry, in that case, I just cannot provide you with any reports anymore, if that is what you really want. Note, however, that this is simply an excuse "oh, a raid controller, you need to simplify your setup first". "oh, a tainted driver, you need to simplify your setup first". What's next, "oh, you used a sata disk, you need to simplify the setup first to see if filesystem corruption happens without the disk frst". Debugging real world problems is hard, and just ignoring the real world because it doesn't fit into a lab doesn't work. Besides, I already simplified the setup for you - my dell laptop only uses certiefied genuine non-tainted in-kernel drivers and an nvme controller, and still szuffered form the open_ctree on first reboot problem once. > > Filesystem Size Used Avail Use% Mounted on > > /dev/mapper/xmnt-cold15 27T 23T 0 100% /cold1 > > Clearly a bug, possibly more than one. This problem is being discussed > in other threads on df misreporting with recent kernels, and a fix is > pending. As it turns out, it's not df misreporting, it's simply very bad error reporting on the side of btrfs, requiring a lot of guesswork on the side of the user, followed by "you need to fix that problem first" from the btrfs mailing list. In any case, I am sorry I was triggered and brought this up - this last oops report wqas not meant as a request to help me solve my problem, but to show how bad the user experience really is, both with btrfs and with this list. Seriously, when I mention I have a reproducible problem on multiple kernel versions on multiple very different computers (that I reported in may) then it is simply not appreciated to tell me its probably a hardware problem, even if, however inconceivable it might be, it possibly could be a hardware problem. > As for the ENOSPC, also clearly a bug. But not clear why or where. So at least it wasn't immediately obvious to you either - it took me a while to figure out the "obvious", namely that one disk with free space is not enough for raid1 metadata. The issue here is not df potentially misreporting, but the fatc that btrfs simply has no tool to do much about it in obvious ways, yet the btrfs lsit tells me I need to fix thigs first. Great advice. > > Please, don't always chalk it up to hardware problems - btrfs is a > > wonderful filesystem for many reasons, one reason I like is that it can > > detect corruption much earlier than other filesystems. This featire alone > > makes it impossible for me to go back to xfs. However, I had corruption > > on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier > > still than those - before btrfs (and even now) I kept md5sums of all > > archived files (~200TB), and xfs and ext4 _do_ a much better job at not > > corrupting data than btrfs on the same hardware - while I get filesystem > > problems about every half a year with btrfs, I had (silent) corruption > > problems maybe once every three to four years with xfs or ext4 (and not > > yet on the bxoes I use currently). > > I can't really parse the suggestion that you are seeing md5 mismatches > (indicating data changes) on Btrfs, Me neither, where do you think I suggested that? > where Btrfs doesn't produce a csum warning along with EIO on those > files? Are these files nodatacow, Well, first of all, the default is a relatively weak crc checksum (fortunately, current kernels already offer more, another good ppint in favour of btrfs). second, nocow data isn't checksummed. Third, I didn't say btrfs claims good checksums on md5 mismatches, I claimed it corrupts data. For me, btrfs telling me instantly that a file is unreadable due to a checksum error is much preferable to me having to manually checksum it to find it, something I do maybe once a year. What I am saying is that I lose way more files and data on btrfs than on any other filesystem I used or use. And that's a fact - we cna now speculate on why that is. I have a few data points - most of the problems I ran into have been other kernel bugs (such as the 4.11 corruption bug, various dmcache corruption bugs and so on). Some of these have been btrfs bugs that have been fixed. Some of these have been operator errors. And some are unknown, but it's easy to scratch it up to bugs in, say, dmcache, which is another big unknown. I don't report any of these issues because I have no useful data to report. (And before you ask, the write cache of the dmcache backing ssd is switche doff, although I believe the crucial ssds that I use are some of the rare devices which actually honor FUA, and have data at rest protection). The big advantage of btrfs is that I can often mount it and make a full backup (which only stats files to see if they changed) of the disk before recreating the filesystem, and even if it fails, it usually can and does loudly tell me so. I had way worse experience with, say, reiserfs long ago, for example :) > A mechanism explaining this anecdote isn't clear. Not even crc32c > checksum collision would explain more than maybe one instance of it. > > I'm curious what Zygo thinks about this. I think you are reading way too much extra stuff into what I wrote - I really feel you are replying to a superficial versiob fo what I actually reported, because you are used to other such reports maybe? > The problem with the anecdotal method of arguing in favor of software > bugs as the explanation? Is this your method? Because it is certainly not mine, or where do you see that? I have carefully presented facts together with supporting evidence that it is unlikely to be a hardware problem. Nowhere did I exclude hardware problems, nowhere did I exclude operator errors, nowhere did I exclude errors in other parts of the kernel (the most common source for corruption problems I have encountered). Nowhere have I claimed it must be a software problem. > also anecdote. I've had no problems that I can attribute to Btrfs. All > were hardware or user sabotage. And I've had zero data loss, outside > of user sabotage. That's great for you. I have ~100 active disks here with >>200TB of data, the vast majority nowadays runs with btrfs. The filesystems tend to be very busy, with very different use cases, and I think as far as anecdotes go, I have far better chances of runn inginto btrfs (or other!) bugs than most other people, possibly (don't know) even you. > I have seen device UNC read errors, corrected automatically by Btrfs. Yeah, me too, but only on a USB stick that I knew was badly borked. I currently see about one unreadable sector every 1-1.5 years, and in most cases, raid5 scrubbing takes care of that, or I identify the file, delete it, and move on. I had a single case of a sudden drive death in my whole 3x years career (maybe I am lucky). Of course these are just anecdotes. But hundreds of disks give a much better statistical base then, say, a single drive in somebodies home computer. But of course I only have as single multi-device btrfs filesystem (as an experiment), so my statistical basis here is very thin... > And I have seen devices return bad data that Btrfs caught, that would > otherwise have been silent corruption of either metadata or data, and Me too, me too. Wonderful feature, that. I have also seen btrfs fail to use the mirror copy because of a broken block, even though the mirror would have been fine - quite the number of these have been fixed in recent years, did you know that? Until quite recently, btrfs only believed in the checksum, and if that was good, dup/raid1 was of no use... Until recently, btrfs did practically nothing about corruption introduced by itself or the kernel (or bad ram for example). It's great that this changed, even though I had a few filesystems that the new stricter 5.4.x checker refused to mount. It's painful, but clearly progress. I really think you confuse me with some other person that mindlessly complains. I think my complaints do have substance, though, and you chide unfairly :) > But in your case, practically ever server and laptop? That's weird and > unexpected. Not if you have some google fu. btrfs corruption problems are extremely common, but it's often very hard to guess what caused it. I also think you are super badly informed about btrfs (or pretend to to defend btrfs against all reason) - recent kernels report a lot of scary corruption messages and refuse mounts, with no hint as to what could be the problem (a stricter checker - your fs might work fine under previous kernels). If btrfs declares my fs as corrupt, I consider that filesystem corruption. It took me quite a while to realise its stricter checking and not neccessarily an indication of a real problem. I quietly reformatted those affected partitions. Seriosly, if claims that btrfs corruption is unexpected is so far fetched and unexpected you live in a parallel universe. Just go through kernel changelogs and count the btrfs bugfixes that cold somehow lead to corruption on somebidies system - btrfs received a lotm opf bugfixes (which is good!) but it also means there certaionly were a lot of bugs in it. And must I remind you of the raid5 issues - I never ran into these, because careful reading of the documentation clearly told me to not use it, but it cetrainly caused a lot of btrfs corruption - let's chalk that up to user errors, though. > And it makes me wonder what's in common. Btrfs is much fussier than > other file systems because the by far largest target for corruption, > isn't file system metadata, but data. The actual payload I don'Ät think this is true. File data might offer more surface, but there are many workloads (wsome of mine included) where metadata is shuffled around a lot more than data, and there is a lot less that can go wrong with actual data - btrfs just has to copy and checksum it - while for netadata, evry complicated algorithms are in use. Maybe actual data is the largest target, but I don't think you can substantiate that claim in this generality. > of a file system isn't the file system. And Btrfs is the only Linux > native file system that checksums data. The other file systems check Which is exactly why I am using it. I had a single case of a file that was silently corrupted on xfs on teh last decade, and I only had a backup of it because it was silently corrupted and the backup had a good copy, also practically proving that it was silent data corruption. > only metadata, and only somewhat recently, depending on the > distribution you're using. I think that is a clear case of fanboyism - ext4 has had metadata checksums for almost 8 years in the standard kernel now, and is probably the most commonly used fs. XFS has had metadata checksums in the standard kernel for more than 6 ysears. I am not sure how stable for production btrfs was when ext4 introduced these. Sure, metadata checksums are for noobs, but let's not make other filesystems look worse than they really are. > For sure. But take the contrary case that other file systems have > depended on for more than a decade: assuming the hardware is returning > valid data. This is intrinsic to their design. And go back before they > had metadata checksumming, and you'd see it stated on their lists that > they do assume this, and if your devices return any bad data, it's not > the file system's fault at all. Not even the lack of reporting any > kind of problem whatsoever. How is that better? Sure, but I have considerable performance data about devices returning bad data over decades, as I, remember, keep md5 sums of practically all my files and more or less regulalry compare them, and in many cases, have backups so I can even investigate what exactly was corrupted how. I am sorry to bring you the bad news, but outside of known broken hardware (e.g. the cmd640 corruption, which I suffered from if somebody is old enough to remember those), devices returning bad data happens, but is _exceedingly_ rare. Unreadable sectors are by far more common on spinning rust, and in my experience, quite rare unless there as "an incident" (such as a headcrash). The most common sources of data corruption is not bad hardware, especially on hardware that otherwise works fine (e.g. survives a few hours of memtest etc. and keeps file data in general), but software bugs. The, by far, most common source of data loss for me is kernel bugs, especially in recent years. The second most common source of data loss is operator error, at least for me. Having backups certainly made me careless. > Well indeed, not long after Btrfs was demonstrating these are actually > more common problems that suspected, metadata checksumming started Sure they are more common than suspected, but thats trivial since practically nobody expected them. And I know these problems exist, having suffered from them. But they are still an insignificant issue. Maybe we come from different backgrounds - practically all my data is on hardware raid5 or better, and an unreadable sector is not something my filesystem usually has to take care of. They happen. Also, hardware has become both better (e.g. checksumming on transfer) and worse (asi n, cheper and much closer to physical limits). Yet still, disks silently returning other data is exceedingly rare (even if you include controller firmware problems - I have no doubt that lsi controller firmwares are a complete bugfest). However, what you are seeing is "btrfs is reporting bad checksums", and you wrongly seem to ascribe all these cases to hardware, while probably many of these cases are driver bugs, mm bugs or even filesystem bugs (a checksum will also fail if a metadata block is outdated or points to the wrong data block for example, which can easiily happen if something goes wrong during tree management). I think that is not warranted without further evidence, which you don't seem to have. > while ago on XFS, and very recently on ext4). And they are catching a > lot of these same kinds of layer and hardware bugs. Hardware does not > just mean the drive, it can be power supply, logic board, controller, > cables, drive write caches, drive firmware, and other drive internals. And it can also be software. And tghe filesystem. On what grounds do you exclude btrfs from this list, for example? It clearly had a lot of bugs, and like every complex piece of softwrae, it surely has a lot of bugs left. > And the only way any problem can be fixed, is to understand how, when > and where it happened. Yes, and you can't understand how if xyou simply exclude the filesystem because it prpobably was the hardware anyway and ignore the problem. "Could not reproduce this in the current kernel anymore, maye its fixed, closing bug". a disclaimer: this mail (and your mail!) was way too long. When I can't fully participate in any (potential) further disucssion, it is because I lack the time, not for other reasons. Greetings, -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 20:24 ` Chris Murphy 2019-12-20 23:30 ` Marc Lehmann @ 2019-12-21 20:06 ` Zygo Blaxell 1 sibling, 0 replies; 18+ messages in thread From: Zygo Blaxell @ 2019-12-21 20:06 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS, Qu Wenruo, Marc Lehmann [-- Attachment #1: Type: text/plain, Size: 17482 bytes --] On Fri, Dec 20, 2019 at 01:24:02PM -0700, Chris Murphy wrote: > On Fri, Dec 20, 2019 at 9:53 AM Marc Lehmann <schmorp@schmorp.de> wrote: > > > > On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > Consider all these insane things, I tend to believe there is some > > > FUA/FLUSH related hardware problem. > > > > Please don't - I honestly think btrfs developers are way to fast to blame > > hardware for problems. > > That's because they have a lot of evidence of this, in a way that's > only inferable with other file systems. This has long been suspected > by, and demonstrated, well before Btrfs with ZFS development. > > A reasonable criticism of Btrfs development is the state of the file > system check repair, which still has danger warnings. But it's also a > case of damned if they do, and damned if they don't provide it. It > might be the best chance of recovery, so why not provide it? > Conversely, the reality is that the file system is complicated enough, > and the file system checker too slow, that the effort needs to be on > (what I call) file system autopsy tools, to figure out why the > corruption happened, and prevent that from happening. The repair is > often too difficult. > > Take, for example, the recent 5.2.0-5.2.14 corruption bug. That was > self-reported once it was discovered and fixed, which took longer than > usual, and developers apologized. What else can they do? It's not like > the developers are blaming hardware for their own bugs. They have > consistently taken responsibility for Btrfs bugs. > > > > I currently lose btrfs filesystems about once every > > 6 months, and other than the occasional user error, it's always the kernel > > (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things, > > low-memory situations etc. - none of these seem to be centric to btrfs, > > but none of those are hardware errors either). I know its the kernel in > > most cases because in those cases, I can identify the fix in a later > > kernel, or the mitigating circumstances don't appear (e.g. freezes). > > Usually Btrfs developers do mention the possibility of other software > layers contributing to the problem, it's a valid observation that this > possibility be stated. Also note that not all btrfs developers will agree on a failure analysis. Some patience is required. Be prepared to support your bug report with working reproducers and relevant evidence, possibly many times, with fresh backtraces on each new kernel release in which the bug still appears. > However, if it's exclusively a software problem, then it should be > reproducible on other systems. > > > > In any case if it is a hardware problem, then linux and/or btrfs has > > to work around them, because it affects many different controllers on > > different boards: > > How do you propose Btrfs work around it? In particular when there are > additional software layers over which it has no control? > > Have you tried disabling the (drives') write cache? Apparently many sysadmins disable write cache proactively on all drives, instead of waiting until the drive drops some data to learn that there's a problem with the firmware. That's a reasonable tradeoff for btrfs, which already has a heavily optimized write path (most of the IO time in btrfs commit is spent _reading_ metadata). > > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs > > filesystem I restored to went into readonly mode with ENOSPC. Another > > hardware problem? > > > [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P OE 5.4.5-050405-generic #201912181630 > > Why is this kernel tainted? The point of pointing this out isn't to > blame whatever it tainting the kernel, but to point out that > identifying the cause of your problems is made a lot more difficult. I > think you need to simplify the setup, a lot, in order to reduce the > surface area of possible problems. Any bug hunt is made way harder > when there's complication. > > > > > [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014 > > [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] > > [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs] > > [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01 > > [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282 > > [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left > > [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006 > > [41801.618922] BTRFS info (device dm-35): forced readonly > > [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440 > > [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90 > > [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60 > > [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000 > > [41801.618927] FS: 0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000 > > [41801.618928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0 > > [41801.618930] Call Trace: > > [41801.618943] finish_ordered_fn+0x15/0x20 [btrfs] > > [41801.618957] normal_work_helper+0xbd/0x2f0 [btrfs] > > [41801.618959] ? __schedule+0x2eb/0x740 > > [41801.618973] btrfs_endio_write_helper+0x12/0x20 [btrfs] > > [41801.618975] process_one_work+0x1ec/0x3a0 > > [41801.618977] worker_thread+0x4d/0x400 > > [41801.618979] kthread+0x104/0x140 > > [41801.618980] ? process_one_work+0x3a0/0x3a0 > > [41801.618982] ? kthread_park+0x90/0x90 > > [41801.618984] ret_from_fork+0x1f/0x40 > > [41801.618985] ---[ end trace 35086266bf39c897 ]--- > > [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left > > > > unmount/remount seems to make it work again, and it is full (df) yet has > > 3TB of unallocated space left. No clue what to do now, do I have to start > > over restoring again? > > > > Filesystem Size Used Avail Use% Mounted on > > /dev/mapper/xmnt-cold15 27T 23T 0 100% /cold1 > > Clearly a bug, possibly more than one. This problem is being discussed > in other threads on df misreporting with recent kernels, and a fix is > pending. > > As for the ENOSPC, also clearly a bug. But not clear why or where. > > > > Please, don't always chalk it up to hardware problems - btrfs is a > > wonderful filesystem for many reasons, one reason I like is that it can > > detect corruption much earlier than other filesystems. This featire alone > > makes it impossible for me to go back to xfs. However, I had corruption > > on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier > > still than those - before btrfs (and even now) I kept md5sums of all > > archived files (~200TB), and xfs and ext4 _do_ a much better job at not > > corrupting data than btrfs on the same hardware - while I get filesystem > > problems about every half a year with btrfs, I had (silent) corruption > > problems maybe once every three to four years with xfs or ext4 (and not > > yet on the bxoes I use currently). > > I can't really parse the suggestion that you are seeing md5 mismatches > (indicating data changes) on Btrfs, where Btrfs doesn't produce a csum > warning along with EIO on those files? Are these files nodatacow, > either by mount option nodatasum or nodatacow, or using chattr +C on > these files? > > A mechanism explaining this anecdote isn't clear. Not even crc32c > checksum collision would explain more than maybe one instance of it. > > I'm curious what Zygo thinks about this. Hardware bugs and failures are certainly common, and fleetwide hardware failures do happen. They're also recognizable as hardware bugs--some specific failure modes (e.g. single-bit data value errors, parent transid verify failure after crashes) are definitely hardware and can be easily spotted with only a few lines of kernel logs. Some components of btrfs (e.g. scrubs, csum verification, raid1 corruption recovery) are very reliable detectors of hardware or firmware misbehavior (although sometimes it is not trivial to identify _which_ hardware is at fault). Some parts of btrfs (like free space management) are completely btrfs, and cannot be affected by hardware failures without destroying the entire filesystem. On the other hand, it's not like btrfs or the Linux kernel has been bug free either, and a lot of serious but hard to detect bugs are 5-10 years old when they get fixed. All kernels before 5.1 had silent data corruption bugs for compressed data at hole boundaries. Kernels 5.1 to 5.4 have use-after-free bugs in btrfs that lead to metadata corruption (5.1), transaction aborts due to self-detected metadata corruption (5.2), and crashes (5.3 and 5.4). 5.2 also had a second metadata corruption with deadlock bug. Other parts of the kernel are hard on data as well: somewhere around 4.7 a year-old kernel memory corruption bug was found in the r8169 network driver, and 4.0, 4.19, and 5.1 all had famous block-layer bugs that would destroy any filesystem under certain conditions. I test every upstream kernel release thoroughly before deploying to production, because every upstream Linux kernel release has thousands of bugs (btrfs is usually about 1-2% of those). I am still waiting for the very first upstream kernel release for btrfs that can run our full production stress test workload without any backported fixes and without crashing or corrupting data or metadata for 30 days. So far that goal has never been met. We upgrade kernels when a new release gets better than an old one, but the median uptime under stress is still an order of magnitude short of the 30 day mark, and our testing on 5.4.5+fixes isn't done yet. Unfortunately, due to the nature of crashing bugs, we can only work on the most frequently occurring bug at any time, and each one has to be fixed before the next most frequently occurring bug can be discovered, making these fixes a very sequential process. Then there's the two-month lag to get patches from the mailing list into stable kernels, which is plenty of time for new regressions to appear, and we start over again with a fresh set of bugs to fix. btrfs dev del bugs are not crashing bugs, so they are so far down my priority list that I haven't bothered to test for them, or even to report them when I find one accidentally. There are a few bugs there though, especially if you are low on metadata space (which is a likely event if you just removed an entire disk's worth of storage) or btrfs has a bug in that kernel version that just makes btrfs _think_ it is low on metadata space, and the transaction aborts during the delete. Occasionally I hit one of these in an array and work around it with a patch like this one: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 56e35d2e957c..b16539fd2c23 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7350,6 +7350,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) #if 0 ret = -EINVAL; goto error; +#else + btrfs_set_super_num_devices(fs_info->super_copy, total_dev); #endif } if (btrfs_super_total_bytes(fs_info->super_copy) < Probably not a good idea for general use, but it may solve an immediate problem if the problem is simply that the wrong number of devices is stored in the superblock. > > > > > > > > > > Please take these issues seriously - the trend of "it's a hardware > > problem" will not remove the "unstable" stigma from btrfs as long as btrfs > > is clearly more buggy then other filesystems. > > > Sorry to be so blunt, but I am a bit sensitive with always being told > > "it's probably a hardware problem" when it clearly affects practically any > > server and any laptop I administrate. I believe in btrfs, and detecting > > corruption early is a feature to me. > > The problem with the anecdotal method of arguing in favor of software > bugs as the explanation? It directly goes against my own experience, > also anecdote. I've had no problems that I can attribute to Btrfs. All > were hardware or user sabotage. And I've had zero data loss, outside > of user sabotage. You are definitely not testing hard enough. ;) At one point in 2016 there were 145 active bugs known today. About 10 of those 145 were discovered in the last few months alone (i.e. it was broken in 2016, and we only know now how broken it was then after 3 years of hindsight). https://imgur.com/a/A2sXcQB Thankfully, many of those bugs were mostly harmless, but some were not: I've found at least 5 distinct data or metadata corrupting bugs since 2014, and confirmed the existence of several more in regression testing. > I have seen device UNC read errors, corrected automatically by Btrfs. > And I have seen devices return bad data that Btrfs caught, that would > otherwise have been silent corruption of either metadata or data, and > this was corrected in the raid1 cases, and merely reported in the > non-raid cases. And I've also seen considerable corruption reported > upon SD Cards in the midst of implosion and becoming read only. But > even read only, I was able to get all the data out. btrfs data recovery on raid1 from csum and UNC sector failures is excellent. I've seen no issues there since 3.18ish. I do test that from time to time with VMs and fault injection and also with real disk failures. btrfs on raid5 (internal or external raid5 implementation), device delete, and some unfortunate degraded mode behaviors still need some work. > But in your case, practically ever server and laptop? That's weird and > unexpected. And it makes me wonder what's in common. Btrfs is much > fussier than other file systems because the by far largest target for > corruption, isn't file system metadata, but data. The actual payload > of a file system isn't the file system. And Btrfs is the only Linux > native file system that checksums data. The other file systems check > only metadata, and only somewhat recently, depending on the > distribution you're using. If the "corruption" consists of large quantities of zeros, the problem might be using the (default) noflushoncommit mount option, or using applications that don't fsync() religiously. This is correct filesystem behavior, though maybe not behavior any application developer wants. If the corruption affects compressed data adjacent to holes, then it's a known problem fixed in 5.1 and later. If the corruption is specifically and only parent transid verify failures after a crash, UNC sector read, or power failure, then we'd be looking for drive firmware issues or non-default kernel settings to get a fleetwide effect. If the corruption is general metadata corruption without metadata page csum failures, then it could be host RAM failure, general kernel memory corruption (i.e. you have to look at all the other device drivers in the system), or known bugs in btrfs kernel 5.1 and later. If the corruption is all csum failures, then there's a long list of drive issues that could cause it, or the partition could be trampled by other software (BIOSes are sometimes surprisingly bad at this). > > I understand it can be frustrating to be confronted with hard to explain > > accidents, and I understand if you can't find the bug with the sparse info > > I gave, especially as the bug might not even be in btrfs. But keep in mind > > that the people who boldly/dumbly use btrfs in production and restore > > dozens of terabytes from backup every so and so many months are also being > > frustrated if they present evidence from multiple machines and get told > > "its probably a hardware problem". > > For sure. But take the contrary case that other file systems have > depended on for more than a decade: assuming the hardware is returning > valid data. This is intrinsic to their design. And go back before they > had metadata checksumming, and you'd see it stated on their lists that > they do assume this, and if your devices return any bad data, it's not > the file system's fault at all. Not even the lack of reporting any > kind of problem whatsoever. How is that better? > > Well indeed, not long after Btrfs was demonstrating these are actually > more common problems that suspected, metadata checksumming started > creeping into other file systems, finally becoming the default (a > while ago on XFS, and very recently on ext4). And they are catching a > lot of these same kinds of layer and hardware bugs. Hardware does not > just mean the drive, it can be power supply, logic board, controller, > cables, drive write caches, drive firmware, and other drive internals. > > And the only way any problem can be fixed, is to understand how, when > and where it happened. > > -- > Chris Murphy > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 16:53 ` Marc Lehmann 2019-12-20 17:24 ` Remi Gauvin 2019-12-20 20:24 ` Chris Murphy @ 2019-12-21 1:32 ` Qu Wenruo 2 siblings, 0 replies; 18+ messages in thread From: Qu Wenruo @ 2019-12-21 1:32 UTC (permalink / raw) To: Marc Lehmann, Josef Bacik; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 10518 bytes --] On 2019/12/21 上午12:53, Marc Lehmann wrote: > On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> BTW, that chunk number is very small, and since it has 0 tolerance, it >> looks like to be SINGLE chunk. >> >> In that case, it looks like a temporary chunk from older mkfs, and it >> should contain no data/metadata at all, thus brings no data loss. > > Well, there indeed should not have been any data or metadata left as the > btrfs dev del succeeded after lengthy copying. > >> BTW, "btrfs ins dump-tree -t chunk <dev>" would help a lot. >> That would directly tell us if the devid 1 device is in chunk tree. > > Apologies if I wasn't too clear about it - I already had to mkfs and > redo the filesystem. I understand that makes tracking this down hard or > impossible, but I did need that machine and filesystem. > >>> And if you want to hear more "insane" things, after I hard-reset >>> my desktop machine (5.2.21) two days ago I had to "btrfs rescue >>> fix-device-size" to be able to mount (can't find the kernel error atm.). >> >> Consider all these insane things, I tend to believe there is some >> FUA/FLUSH related hardware problem. > > Please don't - I honestly think btrfs developers are way to fast to blame > hardware for problems. I currently lose btrfs filesystems about once every > 6 months, and other than the occasional user error, it's always the kernel > (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things, > low-memory situations etc. - none of these seem to be centric to btrfs, > but none of those are hardware errors either). I know its the kernel in > most cases because in those cases, I can identify the fix in a later > kernel, or the mitigating circumstances don't appear (e.g. freezes). > > In any case if it is a hardware problem, then linux and/or btrfs has > to work around them, because it affects many different controllers on > different boards: > > - dell perc h740 on "doom" and "cerebro" > - intel series 9 controller on "doom'" and "cerebro". > - samsung nvme controller on "yoyo" and "yuna". > - marvell sata controller on "doom". > > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs > filesystem I restored to went into readonly mode with ENOSPC. Another > hardware problem? > > [41801.618772] ------------[ cut here ]------------ > [41801.618776] BTRFS: Transaction aborted (error -28) According to your later replies, this bug turns out to be a problem in over-commit calculation. It doesn't really take disk requirement into consideration, thus can't handle cases like 3 disks RAID1 with 2 full disks. Now it acts just like we're using DUP profiles, thus causing the problem. To Josef, any idea to fix it? I guess we could go the complex statfs() way to do a calculation on how many bytes can really be allocated. Or hugely reduce the over-commit threshold? Thanks, Qu > [41801.618843] WARNING: CPU: 2 PID: 5713 at fs/btrfs/inode.c:3159 btrfs_finish_ordered_io+0x730/0x820 [btrfs] > [41801.618844] Modules linked in: nfsv3 nfs fscache nvidia_modeset(POE) nvidia(POE) btusb algif_skcipher af_alg dm_crypt nfsd auth_rpcgss nfs_acl lockd grace cls_fw sch_htb sit tunnel4 ip_tunnel hidp act_police cls_u32 sch_ingress sch_tbf 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6 xt_CT xt_MASQUERADE xt_nat xt_REDIRECT nft_chain_nat nf_nat xt_owner xt_TCPMSS xt_DSCP xt_mark nf_log_ipv4 nf_log_common xt_LOG xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_length xt_mac xt_tcpudp nft_compat nft_counter nf_tables xfrm_user xfrm_algo nfnetlink cmac uhid bnep tda10021 snd_hda_codec_hdmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass tda827x tda10023 crct10dif_pclmul mei_hdcp crc32_pclmul btrtl btbcm rc_tt_1500 ghash_clmulni_intel snd_emu10k1 btintel snd_util_mem snd_ac97_codec aesni_intel bluetooth snd_hda_intel budget_av snd_rawmidi snd_intel_nhlt crypto_simd saa7146_vv > [41801.618864] snd_hda_codec videobuf_dma_sg budget_ci videobuf_core snd_seq_device budget_core cryptd ttpci_eeprom glue_helper snd_hda_core saa7146 dvb_core intel_cstate ac97_bus snd_hwdep rc_core snd_pcm intel_rapl_perf mxm_wmi cdc_acm pcspkr videodev snd_timer ecdh_generic snd emu10k1_gp ecc mc gameport soundcore mei_me mei mac_hid acpi_pad tcp_bbr drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler hid_generic usbhid hid usbkbd coretemp nct6775 hwmon_vid sunrpc parport_pc ppdev lp parport msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c ahci megaraid_sas i2c_i801 libahci lpc_ich r8169 realtek wmi video [last unloaded: nvidia] > [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P OE 5.4.5-050405-generic #201912181630 > [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014 > [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] > [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs] > [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01 > [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282 > [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left > [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006 > [41801.618922] BTRFS info (device dm-35): forced readonly > [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440 > [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90 > [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60 > [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000 > [41801.618927] FS: 0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000 > [41801.618928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0 > [41801.618930] Call Trace: > [41801.618943] finish_ordered_fn+0x15/0x20 [btrfs] > [41801.618957] normal_work_helper+0xbd/0x2f0 [btrfs] > [41801.618959] ? __schedule+0x2eb/0x740 > [41801.618973] btrfs_endio_write_helper+0x12/0x20 [btrfs] > [41801.618975] process_one_work+0x1ec/0x3a0 > [41801.618977] worker_thread+0x4d/0x400 > [41801.618979] kthread+0x104/0x140 > [41801.618980] ? process_one_work+0x3a0/0x3a0 > [41801.618982] ? kthread_park+0x90/0x90 > [41801.618984] ret_from_fork+0x1f/0x40 > [41801.618985] ---[ end trace 35086266bf39c897 ]--- > [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left > > unmount/remount seems to make it work again, and it is full (df) yet has > 3TB of unallocated space left. No clue what to do now, do I have to start > over restoring again? > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/xmnt-cold15 27T 23T 0 100% /cold1 > > Overall: > Device size: 24216.49GiB > Device allocated: 20894.89GiB > Device unallocated: 3321.60GiB > Device missing: 0.00GiB > Used: 20893.68GiB > Free (estimated): 3322.73GiB (min: 1661.93GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 0.50GiB (used: 0.00GiB) > > Data,single: Size:20839.01GiB, Used:20837.88GiB (99.99%) > /dev/mapper/xmnt-cold15 9288.01GiB > /dev/mapper/xmnt-cold12 7427.00GiB > /dev/mapper/xmnt-cold13 4124.00GiB > > Metadata,RAID1: Size:27.91GiB, Used:27.90GiB (99.97%) > /dev/mapper/xmnt-cold15 25.44GiB > /dev/mapper/xmnt-cold12 24.46GiB > /dev/mapper/xmnt-cold13 5.91GiB > > System,RAID1: Size:0.03GiB, Used:0.00GiB (6.69%) > /dev/mapper/xmnt-cold15 0.03GiB > /dev/mapper/xmnt-cold12 0.03GiB > > Unallocated: > /dev/mapper/xmnt-cold15 0.01GiB > /dev/mapper/xmnt-cold12 0.00GiB > /dev/mapper/xmnt-cold13 3321.59GiB > > Please, don't always chalk it up to hardware problems - btrfs is a > wonderful filesystem for many reasons, one reason I like is that it can > detect corruption much earlier than other filesystems. This featire alone > makes it impossible for me to go back to xfs. However, I had corruption > on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier > still than those - before btrfs (and even now) I kept md5sums of all > archived files (~200TB), and xfs and ext4 _do_ a much better job at not > corrupting data than btrfs on the same hardware - while I get filesystem > problems about every half a year with btrfs, I had (silent) corruption > problems maybe once every three to four years with xfs or ext4 (and not > yet on the bxoes I use currently). > > Please take these issues seriously - the trend of "it's a hardware > problem" will not remove the "unstable" stigma from btrfs as long as btrfs > is clearly more buggy then other filesystems. > > Sorry to be so blunt, but I am a bit sensitive with always being told > "it's probably a hardware problem" when it clearly affects practically any > server and any laptop I administrate. I believe in btrfs, and detecting > corruption early is a feature to me. > > I understand it can be frustrating to be confronted with hard to explain > accidents, and I understand if you can't find the bug with the sparse info > I gave, especially as the bug might not even be in btrfs. But keep in mind > that the people who boldly/dumbly use btrfs in production and restore > dozens of terabytes from backup every so and so many months are also being > frustrated if they present evidence from multiple machines and get told > "its probably a hardware problem". > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 13:41 ` Qu Wenruo 2019-12-20 16:53 ` Marc Lehmann @ 2019-12-20 17:07 ` Marc Lehmann 2019-12-21 1:23 ` Qu Wenruo 2019-12-20 17:20 ` Marc Lehmann 2 siblings, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 17:07 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs > filesystem I restored to went into readonly mode with ENOSPC. Another > hardware problem? btrfs check gave me a possible hint: Checking filesystem on /dev/mapper/xmnt-cold15 UUID: 6e035cfe-5b47-406a-998f-b8ee6567abbc [1/7] checking root items [2/7] checking extents [3/7] checking free space tree cache and super generation don't match, space cache will be invalidated [4/7] checking fs roots [no other errors] But mounting with clear_cache,space_cache=v2 didn't help, df still shows 0 bytes free, "btrfs f us" still shows 3tb unallocated. I'll play around with it more... -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 17:07 ` Marc Lehmann @ 2019-12-21 1:23 ` Qu Wenruo 0 siblings, 0 replies; 18+ messages in thread From: Qu Wenruo @ 2019-12-21 1:23 UTC (permalink / raw) To: Marc Lehmann; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 1232 bytes --] On 2019/12/21 上午1:07, Marc Lehmann wrote: >> Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs >> filesystem I restored to went into readonly mode with ENOSPC. Another >> hardware problem? > > btrfs check gave me a possible hint: > > Checking filesystem on /dev/mapper/xmnt-cold15 > UUID: 6e035cfe-5b47-406a-998f-b8ee6567abbc > [1/7] checking root items > [2/7] checking extents > [3/7] checking free space tree > cache and super generation don't match, space cache will be invalidated That's common, and not a problem at all. Btrfs will rebuild the free space tree. > [4/7] checking fs roots > [no other errors] > > But mounting with clear_cache,space_cache=v2 didn't help, df still shows 0 > bytes free, "btrfs f us" still shows 3tb unallocated. I'll play around with > it more... Df reports 0 available is a bug and caused pinned down. It's btrfs_statfs() can't co-operate with latest over-commit behavior. This happens when there are some metadata operation queued. It's completely a runtime false alert, had nothing incorrect on-disk. I had a fix submitted for it. https://patchwork.kernel.org/patch/11293419/ Thanks, Qu > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: btrfs dev del not transaction protected? 2019-12-20 13:41 ` Qu Wenruo 2019-12-20 16:53 ` Marc Lehmann 2019-12-20 17:07 ` Marc Lehmann @ 2019-12-20 17:20 ` Marc Lehmann 2 siblings, 0 replies; 18+ messages in thread From: Marc Lehmann @ 2019-12-20 17:20 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs > > filesystem I restored to went into readonly mode with ENOSPC. Another > > hardware problem? > > But mounting with clear_cache,space_cache=v2 didn't help, df still shows 0 > bytes free, "btrfs f us" still shows 3tb unallocated. I'll play around with > it more... clear_cache didn't work, but btrfsck --clear-space-cache v1 and .. v2 did work: Filesystem Size Used Avail Use% Mounted on /dev/mapper/xmnt-cold15 27T 23T 3.6T 87% /cold1 Which is rather insane, as I can't see how this filesystem was ever not mounted without -o space_cache=v2. Looking at btrfs f u again... Metadata,single: Size:1.22GiB, Used:0.00B (0.00%) /dev/mapper/xmnt-cold13 1.22GiB Metadata,RAID1: Size:27.92GiB, Used:27.90GiB (99.91%) /dev/mapper/xmnt-cold15 25.46GiB /dev/mapper/xmnt-cold12 24.46GiB /dev/mapper/xmnt-cold13 5.92GiB System,RAID1: Size:32.00MiB, Used:2.16MiB (6.74%) /dev/mapper/xmnt-cold15 32.00MiB /dev/mapper/xmnt-cold12 32.00MiB Unallocated: /dev/mapper/xmnt-cold15 1.00MiB /dev/mapper/xmnt-cold12 1.00MiB /dev/mapper/xmnt-cold13 3.24TiB Did this happen because metadata is raid1 and two of the disks were full, and for some reason, btrfsck freed up a tiny bit of space somewhere? -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2019-12-21 20:06 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-20 4:05 btrfs dev del not transaction protected? Marc Lehmann 2019-12-20 5:24 ` Qu Wenruo 2019-12-20 6:37 ` Marc Lehmann 2019-12-20 7:10 ` Qu Wenruo 2019-12-20 13:27 ` Marc Lehmann 2019-12-20 13:41 ` Qu Wenruo 2019-12-20 16:53 ` Marc Lehmann 2019-12-20 17:24 ` Remi Gauvin 2019-12-20 17:50 ` Marc Lehmann 2019-12-20 18:00 ` Marc Lehmann 2019-12-20 18:28 ` Eli V 2019-12-20 20:24 ` Chris Murphy 2019-12-20 23:30 ` Marc Lehmann 2019-12-21 20:06 ` Zygo Blaxell 2019-12-21 1:32 ` Qu Wenruo 2019-12-20 17:07 ` Marc Lehmann 2019-12-21 1:23 ` Qu Wenruo 2019-12-20 17:20 ` Marc Lehmann
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.