* Problem with file system @ 2017-04-24 15:27 Fred Van Andel 2017-04-24 17:02 ` Chris Murphy 2017-04-25 0:26 ` Qu Wenruo 0 siblings, 2 replies; 33+ messages in thread From: Fred Van Andel @ 2017-04-24 15:27 UTC (permalink / raw) To: linux-btrfs I have a btrfs file system with a few thousand snapshots. When I attempted to delete 20 or so of them the problems started. The disks are being read but except for the first few minutes there are no writes. Memory usage keeps growing until all the memory (24 Gb) is used in a few hours. Eventually the system will crash with out of memory errors. The CPU load is low (<5%) but iowait is around 30 to 50% The drives are mounted but any process that attempts to access them will just hang so I cannot access any data on the drives. Smartctl does not show any issues with the drives. The problem restarts after a reboot once you mount the drives. I tried to zero the log hoping it wouldn't restart after a reboot but that didn't work I am assuming that the attempt to remove the snapshots caused this problem. How do I interrupt the process so I can access the filesystem again? # uname -a Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # btrfs --version btrfs-progs v4.9.1 # btrfs fi show Label: none uuid: 79ba7374-bf77-4868-bb64-656ff5736c44 Total devices 6 FS bytes used 5.65TiB devid 1 size 1.82TiB used 1.29TiB path /dev/sdb devid 2 size 1.82TiB used 1.29TiB path /dev/sdc devid 3 size 1.82TiB used 1.29TiB path /dev/sdd devid 4 size 1.82TiB used 1.29TiB path /dev/sde devid 5 size 3.64TiB used 3.11TiB path /dev/sdf devid 6 size 3.64TiB used 3.11TiB path /dev/sdg # btrfs fi df /pubroot Data, RAID1: total=5.58TiB, used=5.58TiB System, RAID1: total=32.00MiB, used=828.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=104.00GiB, used=70.64GiB GlobalReserve, single: total=512.00MiB, used=28.51MiB ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-04-24 15:27 Problem with file system Fred Van Andel @ 2017-04-24 17:02 ` Chris Murphy 2017-04-25 4:05 ` Duncan 2017-04-25 0:26 ` Qu Wenruo 1 sibling, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-04-24 17:02 UTC (permalink / raw) To: Fred Van Andel; +Cc: Btrfs BTRFS On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel <vanandel@gmail.com> wrote: > I have a btrfs file system with a few thousand snapshots. When I > attempted to delete 20 or so of them the problems started. > > The disks are being read but except for the first few minutes there > are no writes. > > Memory usage keeps growing until all the memory (24 Gb) is used in a > few hours. Eventually the system will crash with out of memory errors. Boot with these boot parameters log_buf_len=1M I find it easier to remotely login with another computer to capture problems in case of a crash and I can't save things locally. So on the remote computer use 'journalctl -kf -o short-monotonic' Either on the 1st computer, or from an additional ssh connection from the 2nd: echo 1 >/proc/sys/kernel/sysrq btrfs fi show #you need the UUID for the volume you're going to mount, best to have it in advance mount the file system normally, and once it's starting to have the problem (I guess it happens pretty quickly?) echo t > /proc/sysrq-trigger grep . -IR /sys/fs/btrfs/UUID/allocation/ Paste in the UUID from fi show. If the computer is hanging due to running out of memory, each of these commands can take a while to complete. So it's best to have them all ready to go before you mount the file system, and the problem starts happening. Best if you can issue the commands more than once as the problem gets worse, if you can keep them all organized and labeled. Then attach them (rather than pasting them into the message). > I tried to zero the log hoping it wouldn't restart after a reboot but > that didn't work Yeah don't just start randomly hitting the fs with a hammer like zeroing the log tree. That's for a specific problem and this isn't it. > I am assuming that the attempt to remove the snapshots caused this > problem. How do I interrupt the process so I can access the > filesystem again? Snapshot creation is essentially free. Snapshot removal is expensive. There's no way to answer your questions because your email doesn't even include a call trace. So a developer will need at least the call trace, but there might be some other useful information in a sysrq + t, as well as the allocation states. > # btrfs fi df /pubroot > Data, RAID1: total=5.58TiB, used=5.58TiB > System, RAID1: total=32.00MiB, used=828.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, RAID1: total=104.00GiB, used=70.64GiB > GlobalReserve, single: total=512.00MiB, used=28.51MiB Later, after this problem is solved, you'll want to get rid of that single system chunk that isn't being used, but might cause a problem in a device failure. sudo btrfs balance start -mconvert=raid1,soft <mp> -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-04-24 17:02 ` Chris Murphy @ 2017-04-25 4:05 ` Duncan 0 siblings, 0 replies; 33+ messages in thread From: Duncan @ 2017-04-25 4:05 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Mon, 24 Apr 2017 11:02:02 -0600 as excerpted: > On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel <vanandel@gmail.com> > wrote: >> I have a btrfs file system with a few thousand snapshots. When I >> attempted to delete 20 or so of them the problems started. >> >> The disks are being read but except for the first few minutes there are >> no writes. >> >> Memory usage keeps growing until all the memory (24 Gb) is used in a >> few hours. Eventually the system will crash with out of memory errors. In addition to what CMurphy and QW suggested (both valid), I have a couple other suggestions/pointers. They won't help you get out of the current situation, but they might help you stay out of it in the future. 1) A "few thousand snapshots", but no mention of how many subvolumes those snapshots are of, or how many per subvolume. As CMurphy says but I'll expand it here, taking a snapshot is nearly free, just a bit of metadata to write because btrfs is COW-based and all a snapshot does is lock down a copy of everything in the subvolume as it exists currently and the filesystem's already tracking that, but removal is expensive, because btrfs must go thru and check everything to see if it can actually be deleted (no other snapshots referencing the block) or not (something else referencing it). Obviously, then, this checking gets much more complicated the more snapshots of the same subvolume that exist. IOW, it's a scaling issue. The same scaling issue applies to various other btrfs maintenance tasks, including btrfs check (aka btrfsck), and btrfs balance (and thus btrfs device remove, which does an implicit balance). Both of these take *far* longer if the number of snapshots per subvolume is allowed to get out of hand. Due to this scaling issue, the recommendation is no more than 200-300 snapshots per subvolume, and keeping it down to 50-100 max is even better, if you can do it reasonably. That helps keep scaling issues and thus time for any necessary maintenance manageable. Otherwise... well, we've had reports of device removes (aka balances) that would take /months/ to finish at the rate they were going. Obviously, well before it gets to that point it's far faster to simply blow away the filesystem and restore from backups.[1] It follows that if you have an automated system doing the snapshots, it's equally important to have an automated system doing snapshot thinning as well, keeping the number of snapshots per subvolume within manageable scaling limits. So if that's "a few thousand snapshots", I hope you that's of (at least) a double-digit number of subvolumes, keeping the number of snapshots per subvolume under 300, and under 100 if your snapshot rotation schedule will allow it. 2) As Qu suggests, btrfs quotas increase the scaling issues significantly. Additionally, there have been and continue to be accuracy issues with certain quota corner-cases, so they can't be entirely relied upon anyway. Generally, people using btrfs quotas fall into three categories: a) Those who know the problems and are working with Qu and the other devs to report and trace issues so they will eventually work well, ideally with less of a scaling issue as well. Bless them! Keep it up! =:^) b) Those who have a use-case that really depends on quotas. Because btrfs quotas are buggy and not entirely reliable now, not to mention the scaling issues, these users are almost certainly better served using more mature filesystems with mature and dependable quotas. c) Those who don't really care about quotas specifically, and are just using them because it's a nice feature. This likely includes some who are simply running distros that enable quotas. My recommendation for these users is to simply turn btrfs quotas off for now, as they're presently in general more trouble than they're worth, due to both the accuracy and scaling issues. Hopefully quotas will be stable in a couple years, and with developer and tester hard work perhaps the scaling issues will have been reduced as well, and that recommendation can change. But for now, if you don't really need them, leaving quotas off will significantly reduce scaling issues. And if you do need them, they're not yet reliable on btrfs anyway, so better off using something more mature where they actually work. 3) Similarly (tho unlikely to apply in your case), beware of the scaling implications of the various reflink-based copying and dedup utilities, which work via the same copy-on-write and reflinking technology that's behind snapshotting. Tho snapshotting is effectively reflinking /everything/ in the subvolume, so the scaling issues compound much faster there than they will with a more trivial level of reflinking. Of course, when it comes to dedup, a more trivial level of reflinking means less benefit from doing the dedup in the first place, so there's a limit to the effectiveness of dedup before it starts having the same scaling issues that snapshots do. But if you have exactly two copies of /everything/ in a subvolume, and dedup it down to a single copy, that's the same effect as a single snapshot, so it does take a lot of reflink-based deduping to get to the same level as a couple hundred snapshots. But it's something to think about if you're planning to dedup say 1000 copies of a bunch of stuff by making them all reflinks to the same single copy. Bottom line, if those "few thousand snapshots" are all of the same subvol or two, /especially/ if you're running btrfs quotas on top of that... that's very likely your problem right there. Keep your number of snapshots per subvolume under 300, and turn off btrfs quotas, and you'll very likely find the problem disappears. --- [1] Backups: Sysadmin's first rule of backups, simple form: If you don't have a backup, you are by lack thereof defining the data at risk as worth less than the time/hassle/resources to do that backup. Because if it was worth more than the time/hassle/resources necessary for the backup, by definition, it would /be/ backed up. It's your choice to make, but no redefining after the fact. If you lost the primary copy due to whatever reason and didn't have that backup, you simply defined the data as not worth enough to have a backup, and get to be happy because you saved what your actions, or lack thereof, defined as of most value to you, the time/hassle/resources you would have otherwise spent doing that backup. Sysadmin's second rule of backups: A backup isn't complete until it has been tested restorable. Until then, it's simply a would-be backup, because you don't actually know if it worked or not. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-04-24 15:27 Problem with file system Fred Van Andel 2017-04-24 17:02 ` Chris Murphy @ 2017-04-25 0:26 ` Qu Wenruo 2017-04-25 5:33 ` Marat Khalili 1 sibling, 1 reply; 33+ messages in thread From: Qu Wenruo @ 2017-04-25 0:26 UTC (permalink / raw) To: Fred Van Andel, linux-btrfs At 04/24/2017 11:27 PM, Fred Van Andel wrote: > I have a btrfs file system with a few thousand snapshots. When I > attempted to delete 20 or so of them the problems started. > > The disks are being read but except for the first few minutes there > are no writes. > > Memory usage keeps growing until all the memory (24 Gb) is used in a > few hours. Eventually the system will crash with out of memory errors. Are you using qgroup/quota? IIRC qgroup for subvolume deletion will cause full subtree rescan which can cause tons of memory. Thanks, Qu > > The CPU load is low (<5%) but iowait is around 30 to 50% > > The drives are mounted but any process that attempts to access them > will just hang so I cannot access any data on the drives. > > Smartctl does not show any issues with the drives. > > The problem restarts after a reboot once you mount the drives. > > I tried to zero the log hoping it wouldn't restart after a reboot but > that didn't work > > I am assuming that the attempt to remove the snapshots caused this > problem. How do I interrupt the process so I can access the > filesystem again? > > # uname -a > Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC > 2017 x86_64 x86_64 x86_64 GNU/Linux > > # btrfs --version > btrfs-progs v4.9.1 > > # btrfs fi show > Label: none uuid: 79ba7374-bf77-4868-bb64-656ff5736c44 > Total devices 6 FS bytes used 5.65TiB > devid 1 size 1.82TiB used 1.29TiB path /dev/sdb > devid 2 size 1.82TiB used 1.29TiB path /dev/sdc > devid 3 size 1.82TiB used 1.29TiB path /dev/sdd > devid 4 size 1.82TiB used 1.29TiB path /dev/sde > devid 5 size 3.64TiB used 3.11TiB path /dev/sdf > devid 6 size 3.64TiB used 3.11TiB path /dev/sdg > > # btrfs fi df /pubroot > Data, RAID1: total=5.58TiB, used=5.58TiB > System, RAID1: total=32.00MiB, used=828.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, RAID1: total=104.00GiB, used=70.64GiB > GlobalReserve, single: total=512.00MiB, used=28.51MiB > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-04-25 0:26 ` Qu Wenruo @ 2017-04-25 5:33 ` Marat Khalili 2017-04-25 6:13 ` Qu Wenruo 0 siblings, 1 reply; 33+ messages in thread From: Marat Khalili @ 2017-04-25 5:33 UTC (permalink / raw) To: linux-btrfs On 25/04/17 03:26, Qu Wenruo wrote: > IIRC qgroup for subvolume deletion will cause full subtree rescan > which can cause tons of memory. Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even use this absurd amount of memory for? Is it swappable? Haven't read about RAM limitations for running qgroups before, only about CPU load (which importantly only requires patience, does not crash servers). -- With Best Regards, Marat Khalili ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-04-25 5:33 ` Marat Khalili @ 2017-04-25 6:13 ` Qu Wenruo 2017-04-26 16:43 ` Fred Van Andel 0 siblings, 1 reply; 33+ messages in thread From: Qu Wenruo @ 2017-04-25 6:13 UTC (permalink / raw) To: Marat Khalili, linux-btrfs At 04/25/2017 01:33 PM, Marat Khalili wrote: > On 25/04/17 03:26, Qu Wenruo wrote: >> IIRC qgroup for subvolume deletion will cause full subtree rescan >> which can cause tons of memory. > Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even > use this absurd amount of memory for? Is it swappable? The memory is used for 2 reasons. 1) Record which extents are needed to trace Freed at transaction commit. Need better idea to handle them. Maybe create a new tree so that we can write it to disk? Or another qgroup rework? 2) Record current roots referring to this extent Only after v4.10 IIRC. The memory allocated is not swappable. How many memory it uses depends on the number of extents of that subvolume. It's 56 bytes for one extent, both tree block and data extent. To use up 16G ram, it's about 300 million extents. For 5.6T volume, its average extent size is about 20K. It seems that your volume is highly fragmented though. If that's the problem, disabling qgroup may be the best workaround. Thanks, Qu > > Haven't read about RAM limitations for running qgroups before, only > about CPU load (which importantly only requires patience, does not crash > servers). > > -- > > With Best Regards, > Marat Khalili > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-04-25 6:13 ` Qu Wenruo @ 2017-04-26 16:43 ` Fred Van Andel 2017-10-30 3:31 ` Dave 0 siblings, 1 reply; 33+ messages in thread From: Fred Van Andel @ 2017-04-26 16:43 UTC (permalink / raw) To: linux-btrfs Yes I was running qgroups. Yes the filesystem is highly fragmented. Yes I have way too many snapshots. I think it's clear that the problem is on my end. I simply placed too many demands on the filesystem without fully understanding the implications. Now I have to deal with the consequences. It was decided today to replace this computer due to its age. I will use the recover command to pull the needed data off this system and onto the new one. Thank you everyone for your assistance and the education. Fred ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-04-26 16:43 ` Fred Van Andel @ 2017-10-30 3:31 ` Dave 2017-10-30 21:37 ` Chris Murphy 2017-10-31 1:58 ` Duncan 0 siblings, 2 replies; 33+ messages in thread From: Dave @ 2017-10-30 3:31 UTC (permalink / raw) To: Linux fs Btrfs; +Cc: Fred Van Andel This is a very helpful thread. I want to share an interesting related story. We have a machine with 4 btrfs volumes and 4 Snapper configs. I recently discovered that Snapper timeline cleanup been turned off for 3 of those volumes. In the Snapper configs I found this setting: TIMELINE_CLEANUP="no" Normally that would be set to "yes". So I corrected the issue and set it to "yes" for the 3 volumes where it had not been set correctly. I suppose it was turned off temporarily and then somebody forgot to turn it back on. What I did not know, and what I did not realize was a critical piece of information, was how long timeline cleanup had been turned off and how many snapshots had accumulated on each volume in that time. I naively re-enabled Snapper timeline cleanup. The instant I started the snapper-cleanup.service the system was hosed. The ssh session became unresponsive, no other ssh sessions could be established and it was impossible to log into the system at the console. My subsequent investigation showed that the root filesystem volume accumulated more than 3000 btrfs snapshots. The two other affected volumes also had very large numbers of snapshots. Deleting a single snapshot in that situation would likely require hours. (I set up a test, but I ran out of patience before I was able to delete even a single snapshot.) My guess is that if we had been patient enough to wait for all the snapshots to be deleted, the process would have finished in some number of months (or maybe a year). We did not know most of this at the time, so we did what we usually do when a system becomes totally unresponsive -- we did a hard reset. Of course, we could never get the system to boot up again. Since we had backups, the easiest option became to replace that system -- not unlike what the OP decided to do. In our case, the hardware was not old, so we simply reformatted the drives and reinstalled Linux. That's a drastic consequence of changing TIMELINE_CLEANUP="no" to TIMELINE_CLEANUP="yes" in the snapper config. It's all part of the process of gaining critical experience with BTRFS. Whether or not BTRFS is ready for production use is (it seems to me) mostly a question of how knowledgeable and experienced are the people administering it. In the various online discussions on this topic, all the focus is on whether or not BTRFS itself is production-ready. At the current maturity level of BTRFS, I think that's the wrong focus. The right focus is on how production-ready is the admin person or team (with respect to their BTRFS knowledge and experience). When a filesystem has been around for decades, most of the critical admin issues become fairly common knowledge, fairly widely known and easy to find. When a filesystem is newer, far fewer people understand the gotchas. Also, in older or widely used filesystems, when someone hits a gotcha, the response isn't "that filesystem is not ready for production". Instead the response is, "you should have known not to do that." On Wed, Apr 26, 2017 at 12:43 PM, Fred Van Andel <vanandel@gmail.com> wrote: > Yes I was running qgroups. > Yes the filesystem is highly fragmented. > Yes I have way too many snapshots. > > I think it's clear that the problem is on my end. I simply placed too > many demands on the filesystem without fully understanding the > implications. Now I have to deal with the consequences. > > It was decided today to replace this computer due to its age. I will > use the recover command to pull the needed data off this system and > onto the new one. > > > Thank you everyone for your assistance and the education. > > Fred > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-10-30 3:31 ` Dave @ 2017-10-30 21:37 ` Chris Murphy 2017-10-31 5:57 ` Marat Khalili 2017-11-04 7:26 ` Dave 2017-10-31 1:58 ` Duncan 1 sibling, 2 replies; 33+ messages in thread From: Chris Murphy @ 2017-10-30 21:37 UTC (permalink / raw) To: Dave; +Cc: Linux fs Btrfs, Fred Van Andel On Mon, Oct 30, 2017 at 4:31 AM, Dave <davestechshop@gmail.com> wrote: > This is a very helpful thread. I want to share an interesting related story. > > We have a machine with 4 btrfs volumes and 4 Snapper configs. I > recently discovered that Snapper timeline cleanup been turned off for > 3 of those volumes. In the Snapper configs I found this setting: > > TIMELINE_CLEANUP="no" > > Normally that would be set to "yes". So I corrected the issue and set > it to "yes" for the 3 volumes where it had not been set correctly. > > I suppose it was turned off temporarily and then somebody forgot to > turn it back on. > > What I did not know, and what I did not realize was a critical piece > of information, was how long timeline cleanup had been turned off and > how many snapshots had accumulated on each volume in that time. > > I naively re-enabled Snapper timeline cleanup. The instant I started > the snapper-cleanup.service the system was hosed. The ssh session > became unresponsive, no other ssh sessions could be established and it > was impossible to log into the system at the console. > > My subsequent investigation showed that the root filesystem volume > accumulated more than 3000 btrfs snapshots. The two other affected > volumes also had very large numbers of snapshots. > > Deleting a single snapshot in that situation would likely require > hours. (I set up a test, but I ran out of patience before I was able > to delete even a single snapshot.) My guess is that if we had been > patient enough to wait for all the snapshots to be deleted, the > process would have finished in some number of months (or maybe a > year). > > We did not know most of this at the time, so we did what we usually do > when a system becomes totally unresponsive -- we did a hard reset. Of > course, we could never get the system to boot up again. > > Since we had backups, the easiest option became to replace that system > -- not unlike what the OP decided to do. In our case, the hardware was > not old, so we simply reformatted the drives and reinstalled Linux. > > That's a drastic consequence of changing TIMELINE_CLEANUP="no" to > TIMELINE_CLEANUP="yes" in the snapper config. Without a complete autopsy on the file system, it's unclear whether it was fixable with available tools, and why it wouldn't mount normally, or if necessary do its own autorecovery with one of the available backup roots. But off hand it sounds like hardware was sabotaging the expected write ordering. How to test a given hardware setup for that, I think, is really overdue. It affects literally every file system, and Linux storage technology. It kinda sounds like to me something other than supers is being overwritten too soon, and that's why it's possible for none of the backup roots to find a valid root tree, because all four possible root trees either haven't actually been written yet (still) or they've been overwritten, even though the super is updated. But again, it's speculation, we don't actually know why your system was no longer mountable. > > It's all part of the process of gaining critical experience with > BTRFS. Whether or not BTRFS is ready for production use is (it seems > to me) mostly a question of how knowledgeable and experienced are the > people administering it. "Btrfs is a copy on write filesystem for Linux aimed at implementing advanced features while focusing on fault tolerance, repair and easy administration." That is the current descriptive text at Documentation/filesystems/btrfs.txt for some time now. > In the various online discussions on this topic, all the focus is on > whether or not BTRFS itself is production-ready. At the current > maturity level of BTRFS, I think that's the wrong focus. The right > focus is on how production-ready is the admin person or team (with > respect to their BTRFS knowledge and experience). When a filesystem > has been around for decades, most of the critical admin issues become > fairly common knowledge, fairly widely known and easy to find. When a > filesystem is newer, far fewer people understand the gotchas. Also, in > older or widely used filesystems, when someone hits a gotcha, the > response isn't "that filesystem is not ready for production". Instead > the response is, "you should have known not to do that." That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried. -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-10-30 21:37 ` Chris Murphy @ 2017-10-31 5:57 ` Marat Khalili 2017-10-31 11:28 ` Austin S. Hemmelgarn 2017-11-04 7:26 ` Dave 1 sibling, 1 reply; 33+ messages in thread From: Marat Khalili @ 2017-10-31 5:57 UTC (permalink / raw) To: Chris Murphy, Dave; +Cc: Linux fs Btrfs, Fred Van Andel On 31/10/17 00:37, Chris Murphy wrote: > But off hand it sounds like hardware was sabotaging the expected write > ordering. How to test a given hardware setup for that, I think, is > really overdue. It affects literally every file system, and Linux > storage technology. > > It kinda sounds like to me something other than supers is being > overwritten too soon, and that's why it's possible for none of the > backup roots to find a valid root tree, because all four possible root > trees either haven't actually been written yet (still) or they've been > overwritten, even though the super is updated. But again, it's > speculation, we don't actually know why your system was no longer > mountable. Just a detached view: I know hardware should respect ordering/barriers and such, but how hard is it really to avoid overwriting at least one complete metadata tree for half an hour (even better, yet another one for a day)? Just metadata, not data extents. -- With Best Regards, Marat Khalili ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-10-31 5:57 ` Marat Khalili @ 2017-10-31 11:28 ` Austin S. Hemmelgarn 2017-11-03 7:42 ` Kai Krakow 2017-11-03 22:03 ` Chris Murphy 0 siblings, 2 replies; 33+ messages in thread From: Austin S. Hemmelgarn @ 2017-10-31 11:28 UTC (permalink / raw) To: Marat Khalili, Chris Murphy, Dave; +Cc: Linux fs Btrfs, Fred Van Andel On 2017-10-31 01:57, Marat Khalili wrote: > On 31/10/17 00:37, Chris Murphy wrote: >> But off hand it sounds like hardware was sabotaging the expected write >> ordering. How to test a given hardware setup for that, I think, is >> really overdue. It affects literally every file system, and Linux >> storage technology. >> >> It kinda sounds like to me something other than supers is being >> overwritten too soon, and that's why it's possible for none of the >> backup roots to find a valid root tree, because all four possible root >> trees either haven't actually been written yet (still) or they've been >> overwritten, even though the super is updated. But again, it's >> speculation, we don't actually know why your system was no longer >> mountable. > Just a detached view: I know hardware should respect ordering/barriers > and such, but how hard is it really to avoid overwriting at least one > complete metadata tree for half an hour (even better, yet another one > for a day)? Just metadata, not data extents. If you're running on an SSD (or thinly provisioned storage, or something else which supports discards) and have the 'discard' mount option enabled, then there is no backup metadata tree (this issue was mentioned on the list a while ago, but nobody ever replied), because it's already been discarded. This is ideally something which should be addressed (we need some sort of discard queue for handling in-line discards), but it's not easy to address. Otherwise, it becomes a question of space usage on the filesystem, and this is just another reason to keep some extra slack space on the FS (though that doesn't help _much_, it does help). This, in theory, could be addressed, but it probably can't be applied across mounts of a filesystem without an on-disk format change. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-10-31 11:28 ` Austin S. Hemmelgarn @ 2017-11-03 7:42 ` Kai Krakow 2017-11-03 11:33 ` Austin S. Hemmelgarn 2017-11-03 22:03 ` Chris Murphy 1 sibling, 1 reply; 33+ messages in thread From: Kai Krakow @ 2017-11-03 7:42 UTC (permalink / raw) To: linux-btrfs Am Tue, 31 Oct 2017 07:28:58 -0400 schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > On 2017-10-31 01:57, Marat Khalili wrote: > > On 31/10/17 00:37, Chris Murphy wrote: > >> But off hand it sounds like hardware was sabotaging the expected > >> write ordering. How to test a given hardware setup for that, I > >> think, is really overdue. It affects literally every file system, > >> and Linux storage technology. > >> > >> It kinda sounds like to me something other than supers is being > >> overwritten too soon, and that's why it's possible for none of the > >> backup roots to find a valid root tree, because all four possible > >> root trees either haven't actually been written yet (still) or > >> they've been overwritten, even though the super is updated. But > >> again, it's speculation, we don't actually know why your system > >> was no longer mountable. > > Just a detached view: I know hardware should respect > > ordering/barriers and such, but how hard is it really to avoid > > overwriting at least one complete metadata tree for half an hour > > (even better, yet another one for a day)? Just metadata, not data > > extents. > If you're running on an SSD (or thinly provisioned storage, or > something else which supports discards) and have the 'discard' mount > option enabled, then there is no backup metadata tree (this issue was > mentioned on the list a while ago, but nobody ever replied), because > it's already been discarded. This is ideally something which should > be addressed (we need some sort of discard queue for handling in-line > discards), but it's not easy to address. > > Otherwise, it becomes a question of space usage on the filesystem, > and this is just another reason to keep some extra slack space on the > FS (though that doesn't help _much_, it does help). This, in theory, > could be addressed, but it probably can't be applied across mounts of > a filesystem without an on-disk format change. Well, maybe inline discard is working at the wrong level. It should kick in when the reference through any of the backup roots is dropped, not when the current instance is dropped. Without knowledge of the internals, I guess discards could be added to a queue within a new tree in btrfs, and only added to that queue when dropped from the last backup root referencing it. But this will probably add some bad performance spikes. I wonder how a regular fstrim run through cron applies to this problem? -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-03 7:42 ` Kai Krakow @ 2017-11-03 11:33 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 33+ messages in thread From: Austin S. Hemmelgarn @ 2017-11-03 11:33 UTC (permalink / raw) To: linux-btrfs On 2017-11-03 03:42, Kai Krakow wrote: > Am Tue, 31 Oct 2017 07:28:58 -0400 > schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > >> On 2017-10-31 01:57, Marat Khalili wrote: >>> On 31/10/17 00:37, Chris Murphy wrote: >>>> But off hand it sounds like hardware was sabotaging the expected >>>> write ordering. How to test a given hardware setup for that, I >>>> think, is really overdue. It affects literally every file system, >>>> and Linux storage technology. >>>> >>>> It kinda sounds like to me something other than supers is being >>>> overwritten too soon, and that's why it's possible for none of the >>>> backup roots to find a valid root tree, because all four possible >>>> root trees either haven't actually been written yet (still) or >>>> they've been overwritten, even though the super is updated. But >>>> again, it's speculation, we don't actually know why your system >>>> was no longer mountable. >>> Just a detached view: I know hardware should respect >>> ordering/barriers and such, but how hard is it really to avoid >>> overwriting at least one complete metadata tree for half an hour >>> (even better, yet another one for a day)? Just metadata, not data >>> extents. >> If you're running on an SSD (or thinly provisioned storage, or >> something else which supports discards) and have the 'discard' mount >> option enabled, then there is no backup metadata tree (this issue was >> mentioned on the list a while ago, but nobody ever replied), because >> it's already been discarded. This is ideally something which should >> be addressed (we need some sort of discard queue for handling in-line >> discards), but it's not easy to address. >> >> Otherwise, it becomes a question of space usage on the filesystem, >> and this is just another reason to keep some extra slack space on the >> FS (though that doesn't help _much_, it does help). This, in theory, >> could be addressed, but it probably can't be applied across mounts of >> a filesystem without an on-disk format change. > > Well, maybe inline discard is working at the wrong level. It should > kick in when the reference through any of the backup roots is dropped, > not when the current instance is dropped. Indeed. > > Without knowledge of the internals, I guess discards could be added to > a queue within a new tree in btrfs, and only added to that queue when > dropped from the last backup root referencing it. But this will > probably add some bad performance spikes. Inline discards can already cause bad performance spikes. > > I wonder how a regular fstrim run through cron applies to this problem? You functionally lose any old (freed) trees, they just get kept around until you call fstrim. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-10-31 11:28 ` Austin S. Hemmelgarn 2017-11-03 7:42 ` Kai Krakow @ 2017-11-03 22:03 ` Chris Murphy 2017-11-04 4:46 ` Adam Borowski 1 sibling, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-11-03 22:03 UTC (permalink / raw) To: Austin S. Hemmelgarn Cc: Marat Khalili, Chris Murphy, Dave, Linux fs Btrfs, Fred Van Andel On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > If you're running on an SSD (or thinly provisioned storage, or something > else which supports discards) and have the 'discard' mount option enabled, > then there is no backup metadata tree (this issue was mentioned on the list > a while ago, but nobody ever replied), This is a really good point. I've been running discard mount option for some time now without problems, in a laptop with Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951. However, just trying btrfs-debug-tree -b on a specific block address for any of the backup root trees listed in the super, only the current one returns a valid result. All others fail with checksum errors. And even the good one fails with checksum errors within seconds as a new tree is created, the super updated, and Btrfs considers the old root tree disposable and subject to discard. So absolutely if I were to have a problem, probably no rollback for me. This seems to totally obviate a fundamental part of Btrfs design. because it's already been discarded. > This is ideally something which should be addressed (we need some sort of > discard queue for handling in-line discards), but it's not easy to address. Discard data extents, don't discard metadata extents? Or put them on a substantial delay. -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-03 22:03 ` Chris Murphy @ 2017-11-04 4:46 ` Adam Borowski 2017-11-04 12:00 ` Marat Khalili 2017-11-04 17:14 ` Chris Murphy 0 siblings, 2 replies; 33+ messages in thread From: Adam Borowski @ 2017-11-04 4:46 UTC (permalink / raw) To: Chris Murphy Cc: Austin S. Hemmelgarn, Marat Khalili, Dave, Linux fs Btrfs, Fred Van Andel On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote: > On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > > > If you're running on an SSD (or thinly provisioned storage, or something > > else which supports discards) and have the 'discard' mount option enabled, > > then there is no backup metadata tree (this issue was mentioned on the list > > a while ago, but nobody ever replied), > > > This is a really good point. I've been running discard mount option > for some time now without problems, in a laptop with Samsung > Electronics Co Ltd NVMe SSD Controller SM951/PM951. > > However, just trying btrfs-debug-tree -b on a specific block address > for any of the backup root trees listed in the super, only the current > one returns a valid result. All others fail with checksum errors. And > even the good one fails with checksum errors within seconds as a new > tree is created, the super updated, and Btrfs considers the old root > tree disposable and subject to discard. > > So absolutely if I were to have a problem, probably no rollback for > me. This seems to totally obviate a fundamental part of Btrfs design. How is this an issue? Discard is issued only once we're positive there's no reference to the freed blocks anywhere. At that point, they're also open for reuse, thus they can be arbitrarily scribbled upon. Unless your hardware is seriously broken (such as lying about barriers, which is nearly-guaranteed data loss on btrfs anyway), there's no way the filesystem will ever reference such blocks. The corpses of old trees that are left lying around with no discard can at most be used for manual forensics, but whether a given block will have been overwritten or not is a matter of pure luck. For rollbacks, there are snapshots. Once a transaction has been fully committed, the old version is considered gone. > because it's already been discarded. > > This is ideally something which should be addressed (we need some sort of > > discard queue for handling in-line discards), but it's not easy to address. > > Discard data extents, don't discard metadata extents? Or put them on a > substantial delay. Why would you special-case metadata? Metadata that points to overwritten or discarded blocks is of no use either. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. ⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift ⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters ⠈⠳⣄⠀⠀⠀⠀ relevant to duties], shall be punished by death by shooting. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-04 4:46 ` Adam Borowski @ 2017-11-04 12:00 ` Marat Khalili 2017-11-04 17:14 ` Chris Murphy 1 sibling, 0 replies; 33+ messages in thread From: Marat Khalili @ 2017-11-04 12:00 UTC (permalink / raw) To: Adam Borowski, Chris Murphy Cc: Austin S. Hemmelgarn, Dave, Linux fs Btrfs, Fred Van Andel >How is this an issue? Discard is issued only once we're positive >there's no >reference to the freed blocks anywhere. At that point, they're also >open >for reuse, thus they can be arbitrarily scribbled upon. Point was, how about keeping this reference for some time period? >Unless your hardware is seriously broken (such as lying about barriers, >which is nearly-guaranteed data loss on btrfs anyway), there's no way >the >filesystem will ever reference such blocks. Buggy hardware happen. So do buggy filesystems ;) Besides, most filesystems let user recover most data after losing just one sector, would be pity if BTRFS with all its COW coolness didn't. >Why would you special-case metadata? Metadata that points to >overwritten or >discarded blocks is of no use either. It takes significant time to overwrite noticeable portion of data on disk, but loss of metadata makes it gone in a moment. Moreover, user is usually prepared to lose some recently changed data in crash, but not the one that it didn't even touch. -- With Best Regards, Marat Khalili ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-04 4:46 ` Adam Borowski 2017-11-04 12:00 ` Marat Khalili @ 2017-11-04 17:14 ` Chris Murphy 2017-11-06 13:29 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-11-04 17:14 UTC (permalink / raw) To: Adam Borowski Cc: Chris Murphy, Austin S. Hemmelgarn, Marat Khalili, Dave, Linux fs Btrfs, Fred Van Andel On Fri, Nov 3, 2017 at 10:46 PM, Adam Borowski <kilobyte@angband.pl> wrote: > On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote: >> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn >> <ahferroin7@gmail.com> wrote: >> >> > If you're running on an SSD (or thinly provisioned storage, or something >> > else which supports discards) and have the 'discard' mount option enabled, >> > then there is no backup metadata tree (this issue was mentioned on the list >> > a while ago, but nobody ever replied), >> >> >> This is a really good point. I've been running discard mount option >> for some time now without problems, in a laptop with Samsung >> Electronics Co Ltd NVMe SSD Controller SM951/PM951. >> >> However, just trying btrfs-debug-tree -b on a specific block address >> for any of the backup root trees listed in the super, only the current >> one returns a valid result. All others fail with checksum errors. And >> even the good one fails with checksum errors within seconds as a new >> tree is created, the super updated, and Btrfs considers the old root >> tree disposable and subject to discard. >> >> So absolutely if I were to have a problem, probably no rollback for >> me. This seems to totally obviate a fundamental part of Btrfs design. > > How is this an issue? Discard is issued only once we're positive there's no > reference to the freed blocks anywhere. At that point, they're also open > for reuse, thus they can be arbitrarily scribbled upon. If it's not an issue, then no one should ever need those backup slots in the super and we should just remove them. But in fact, we know people end up situations where they're needed for either automatic recovery at mount time or explicitly calling --usebackuproot. And in some cases we're seeing users using discard who have a borked root tree, and none of the backup roots are present so they're fucked. Their file system is fucked. Now again, maybe this means the hardware is misbehaving, and honored the discard out of order, and did that and wrote the new supers before it had completely committed all the metadata? I have no idea, but the evidence is present in the list that some people run into this and when they do the file system is beyond repair even t hough it can usually be scraped with btrfs restore. > Unless your hardware is seriously broken (such as lying about barriers, > which is nearly-guaranteed data loss on btrfs anyway), there's no way the > filesystem will ever reference such blocks. The corpses of old trees that > are left lying around with no discard can at most be used for manual > forensics, but whether a given block will have been overwritten or not is > a matter of pure luck. File systems that overwrite, are hinting the intent in the journal what's about to happen. So if there's a partial overwrite of metadata, it's fine. The journal can help recover. But Btrfs without a journal, has a major piece of information required to bootstrap the file system at mount time, that's damaged, and then every backup has been discarded. So it actually makes Btrfs more fragile than other file systems in the same situation. > > For rollbacks, there are snapshots. Once a transaction has been fully > committed, the old version is considered gone. Yeah well snapshots do not cause root trees to stick around. > >> because it's already been discarded. >> > This is ideally something which should be addressed (we need some sort of >> > discard queue for handling in-line discards), but it's not easy to address. >> >> Discard data extents, don't discard metadata extents? Or put them on a >> substantial delay. > > Why would you special-case metadata? Metadata that points to overwritten or > discarded blocks is of no use either. I would rather lose 30 seconds, 1 minute, or even 2 minutes of writes, than lose an entire file system. That's why. Anyway right now I consider discard mount option fundamentally broken on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's broken there too. Even fstrim leaves a tiny window open for a few minutes every time it gets called, where if the root tree is corrupted for any reason, you're fucked because all the backup roots are already gone. -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-04 17:14 ` Chris Murphy @ 2017-11-06 13:29 ` Austin S. Hemmelgarn 2017-11-06 18:45 ` Chris Murphy 0 siblings, 1 reply; 33+ messages in thread From: Austin S. Hemmelgarn @ 2017-11-06 13:29 UTC (permalink / raw) To: Chris Murphy, Adam Borowski Cc: Marat Khalili, Dave, Linux fs Btrfs, Fred Van Andel On 2017-11-04 13:14, Chris Murphy wrote: > On Fri, Nov 3, 2017 at 10:46 PM, Adam Borowski <kilobyte@angband.pl> wrote: >> On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote: >>> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn >>> <ahferroin7@gmail.com> wrote: >>> >>>> If you're running on an SSD (or thinly provisioned storage, or something >>>> else which supports discards) and have the 'discard' mount option enabled, >>>> then there is no backup metadata tree (this issue was mentioned on the list >>>> a while ago, but nobody ever replied), >>> >>> >>> This is a really good point. I've been running discard mount option >>> for some time now without problems, in a laptop with Samsung >>> Electronics Co Ltd NVMe SSD Controller SM951/PM951. >>> >>> However, just trying btrfs-debug-tree -b on a specific block address >>> for any of the backup root trees listed in the super, only the current >>> one returns a valid result. All others fail with checksum errors. And >>> even the good one fails with checksum errors within seconds as a new >>> tree is created, the super updated, and Btrfs considers the old root >>> tree disposable and subject to discard. >>> >>> So absolutely if I were to have a problem, probably no rollback for >>> me. This seems to totally obviate a fundamental part of Btrfs design. >> >> How is this an issue? Discard is issued only once we're positive there's no >> reference to the freed blocks anywhere. At that point, they're also open >> for reuse, thus they can be arbitrarily scribbled upon. > > If it's not an issue, then no one should ever need those backup slots > in the super and we should just remove them. > > But in fact, we know people end up situations where they're needed for > either automatic recovery at mount time or explicitly calling > --usebackuproot. And in some cases we're seeing users using discard > who have a borked root tree, and none of the backup roots are present > so they're fucked. Their file system is fucked. > > Now again, maybe this means the hardware is misbehaving, and honored > the discard out of order, and did that and wrote the new supers before > it had completely committed all the metadata? I have no idea, but the > evidence is present in the list that some people run into this and > when they do the file system is beyond repair even though it can > usually be scraped with btrfs restore. With ATA devices (including SATA), except on newer SSD's, TRIM commands can't be queued, so by definition they can't become unordered (the kernel ends up having to flush the device queue prior to the discard and then flush the write cache, so it's functionally equivalent to a write barrier, just more expensive, which is why inline discard performance sucks in most cases). I'm not sure about SCSI (I'm pretty sure UNMAP can be queued and is handled just like any other write in terms of ordering), MMC/SD (Though I'm also not sure if the block layer and the MMC driver properly handle discard BIO's on MMC devices), or NVMe (which I think handles things similarly to SCSI). > > >> Unless your hardware is seriously broken (such as lying about barriers, >> which is nearly-guaranteed data loss on btrfs anyway), there's no way the >> filesystem will ever reference such blocks. The corpses of old trees that >> are left lying around with no discard can at most be used for manual >> forensics, but whether a given block will have been overwritten or not is >> a matter of pure luck. > > File systems that overwrite, are hinting the intent in the journal > what's about to happen. So if there's a partial overwrite of metadata, > it's fine. The journal can help recover. But Btrfs without a journal, > has a major piece of information required to bootstrap the file system > at mount time, that's damaged, and then every backup has been > discarded. So it actually makes Btrfs more fragile than other file > systems in the same situation. Indeed. Unless I'm seriously misunderstanding the code, there's a pretty high chance that any given old metadata block will get overwritten reasonably soon on an active filesystem. I'm not 100% certain about this, but I'm pretty sure that BTRFS will avoid allocating new chunks to write into just to preserve old copies of metadata, which in turn means that it will overwrite things pretty fast if the metadata chunks are mostly full.> >> >> For rollbacks, there are snapshots. Once a transaction has been fully >> committed, the old version is considered gone. > > Yeah well snapshots do not cause root trees to stick around. > > >> >>> because it's already been discarded. >>>> This is ideally something which should be addressed (we need some sort of >>>> discard queue for handling in-line discards), but it's not easy to address. >>> >>> Discard data extents, don't discard metadata extents? Or put them on a >>> substantial delay. >> >> Why would you special-case metadata? Metadata that points to overwritten or >> discarded blocks is of no use either. > > I would rather lose 30 seconds, 1 minute, or even 2 minutes of writes, > than lose an entire file system. That's why. And outside of very specific use cases, this is something you'll hear from almost any sysadmin. > > Anyway right now I consider discard mount option fundamentally broken > on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's > broken there too. For LVM thinp, discard there deallocates the blocks, and unallocated regions read back as zeroes, just like in a sparse file (in fact, if you just think of LVM thinp as a sparse file with reflinking for snapshots, you get remarkably close to how it's actually implemented from a semantic perspective), so it is broken there. In fact, it's guaranteed broken on any block device that has the discard_zeroes_data flag set, and theoretically broken on many things that don't have that flag (although block devices that don't have that flag are inherently broken from a security perspective anyway, but that's orthogonal to this discussion). > > Even fstrim leaves a tiny window open for a few minutes every time it > gets called, where if the root tree is corrupted for any reason, > you're fucked because all the backup roots are already gone. For this particular case, I'm pretty sure you can minimize this window by calling `btrfs filesystem sync` on the filesystem after calling fstrim. It likely won't eliminate the window, but should significantly shorten it. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-06 13:29 ` Austin S. Hemmelgarn @ 2017-11-06 18:45 ` Chris Murphy 2017-11-06 19:12 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-11-06 18:45 UTC (permalink / raw) To: Austin S. Hemmelgarn Cc: Chris Murphy, Adam Borowski, Marat Khalili, Dave, Linux fs Btrfs, Fred Van Andel On Mon, Nov 6, 2017 at 6:29 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > > With ATA devices (including SATA), except on newer SSD's, TRIM commands > can't be queued, SATA spec 3.1 includes queued trim. There are SATA spec 3.1 products on the market claiming to do queued trim. Some of them fuck up, and have been black listed in the kernel for queued trim. >> >> >> Anyway right now I consider discard mount option fundamentally broken >> on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's >> broken there too. > > For LVM thinp, discard there deallocates the blocks, and unallocated regions > read back as zeroes, just like in a sparse file (in fact, if you just think > of LVM thinp as a sparse file with reflinking for snapshots, you get > remarkably close to how it's actually implemented from a semantic > perspective), so it is broken there. In fact, it's guaranteed broken on any > block device that has the discard_zeroes_data flag set, and theoretically > broken on many things that don't have that flag (although block devices that > don't have that flag are inherently broken from a security perspective > anyway, but that's orthogonal to this discussion). So this is really only solvable by having Btrfs delay, possibly substantially, the discarding of metadata blocks. Aside from physical device trim, there are benefits in thin provisioning for trim and some use cases will require file system discard, being unable to rely on periodic fstrim. -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-06 18:45 ` Chris Murphy @ 2017-11-06 19:12 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 33+ messages in thread From: Austin S. Hemmelgarn @ 2017-11-06 19:12 UTC (permalink / raw) To: Chris Murphy Cc: Adam Borowski, Marat Khalili, Dave, Linux fs Btrfs, Fred Van Andel On 2017-11-06 13:45, Chris Murphy wrote: > On Mon, Nov 6, 2017 at 6:29 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > >> >> With ATA devices (including SATA), except on newer SSD's, TRIM commands >> can't be queued, > > SATA spec 3.1 includes queued trim. There are SATA spec 3.1 products > on the market claiming to do queued trim. Some of them fuck up, and > have been black listed in the kernel for queued trim. > Yes, but some still work, and they are invariably very new devices by most people's definitions. >>> Anyway right now I consider discard mount option fundamentally broken >>> on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's >>> broken there too. >> >> For LVM thinp, discard there deallocates the blocks, and unallocated regions >> read back as zeroes, just like in a sparse file (in fact, if you just think >> of LVM thinp as a sparse file with reflinking for snapshots, you get >> remarkably close to how it's actually implemented from a semantic >> perspective), so it is broken there. In fact, it's guaranteed broken on any >> block device that has the discard_zeroes_data flag set, and theoretically >> broken on many things that don't have that flag (although block devices that >> don't have that flag are inherently broken from a security perspective >> anyway, but that's orthogonal to this discussion). > > So this is really only solvable by having Btrfs delay, possibly > substantially, the discarding of metadata blocks. Aside from physical > device trim, there are benefits in thin provisioning for trim and some > use cases will require file system discard, being unable to rely on > periodic fstrim. Yes. However, from a simplicity of implementation perspective, it makes more sense to keep some number of old trees instead of keeping old trees for some amount of time. That would remove the need to track timing info in the filesystem, provide sufficient protection, and probably be a bit easier to explain in the documentation. Such logic could also be applied to regular block devices that don't support discard to provide a better guarantee that you won't overwrite old trees that might be useful for recovery. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-10-30 21:37 ` Chris Murphy 2017-10-31 5:57 ` Marat Khalili @ 2017-11-04 7:26 ` Dave 2017-11-04 17:25 ` Chris Murphy 1 sibling, 1 reply; 33+ messages in thread From: Dave @ 2017-11-04 7:26 UTC (permalink / raw) To: Chris Murphy; +Cc: Linux fs Btrfs On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <lists@colorremedies.com> wrote: > > That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried. I'm not sure I understand your comment... Are you saying BTRFS is not a general purpose file system? If btrfs isn't able to serve as a general purpose file system for Linux going forward, which file system(s) would you suggest can fill that role? (I can't think of any that are clearly all-around better than btrfs now, or that will be in the next few years.) Or maybe you meant something else? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-04 7:26 ` Dave @ 2017-11-04 17:25 ` Chris Murphy 2017-11-07 7:01 ` Dave 0 siblings, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-11-04 17:25 UTC (permalink / raw) To: Dave; +Cc: Chris Murphy, Linux fs Btrfs On Sat, Nov 4, 2017 at 1:26 AM, Dave <davestechshop@gmail.com> wrote: > On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <lists@colorremedies.com> wrote: >> >> That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried. > > I'm not sure I understand your comment... > > Are you saying BTRFS is not a general purpose file system? I'm suggesting that any file system that burdens the user with more knowledge to stay out of trouble than the widely considered general purpose file systems of the day, is not a general purpose file system. And yes, I'm suggesting that Btrfs is at risk of being neither general purpose, and not meeting its design goals as stated in Btrfs documentation. It is not easy to admin *when things go wrong*. It's great before then. It's a butt ton easier to resize, replace devices, take snapshots, and so on. But when it comes to fixing it when it goes wrong? It is a goddamn Choose Your Own Adventure book. It's way, way more complicated than any other file system I'm aware of. > If btrfs isn't able to serve as a general purpose file system for > Linux going forward, which file system(s) would you suggest can fill > that role? (I can't think of any that are clearly all-around better > than btrfs now, or that will be in the next few years.) ext4 and XFS are clearly the file systems to beat. They almost always recover from crashes with just a normal journal replay at mount time, file system repair is not often needed. When it is needed, it usually works, and there is just the one option to repair and go with it. Btrfs has piles of repair options, mount time options, btrfs check has options, btrfs rescue has options, it's a bit nutty honestly. And there's zero guidance in the available docs what order to try things in, not least of which some of these repair tools are still considered dangerous at least in the man page text, and the order depends on the failure. The user is burdened with way too much. Even as much as I know about Btrfs having used it since 2008 and my list activity, I routinely have WTF moments when people post problems, what order to try to get things going again. Easy to admin? Yeah for the most part. But stability is still a problem, and it's coming up on a 10 year anniversary soon. If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd use ZoL hands down. But I'm not, I'm much more familiar with Btrfs and where the bodies are buried, so I continue to use Btrfs. -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-04 17:25 ` Chris Murphy @ 2017-11-07 7:01 ` Dave 2017-11-07 13:02 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 33+ messages in thread From: Dave @ 2017-11-07 7:01 UTC (permalink / raw) To: Linux fs Btrfs; +Cc: Chris Murphy On Sat, Nov 4, 2017 at 1:25 PM, Chris Murphy <lists@colorremedies.com> wrote: > > On Sat, Nov 4, 2017 at 1:26 AM, Dave <davestechshop@gmail.com> wrote: > > On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <lists@colorremedies.com> wrote: > >> > >> That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried. > > > > I'm not sure I understand your comment... > > > > Are you saying BTRFS is not a general purpose file system? > > I'm suggesting that any file system that burdens the user with more > knowledge to stay out of trouble than the widely considered general > purpose file systems of the day, is not a general purpose file system. > > And yes, I'm suggesting that Btrfs is at risk of being neither general > purpose, and not meeting its design goals as stated in Btrfs > documentation. It is not easy to admin *when things go wrong*. It's > great before then. It's a butt ton easier to resize, replace devices, > take snapshots, and so on. But when it comes to fixing it when it goes > wrong? It is a goddamn Choose Your Own Adventure book. It's way, way > more complicated than any other file system I'm aware of. It sounds like a large part of that could be addressed with better documentation. I know that documentation such as what you are suggesting would be really valuable to me! > > If btrfs isn't able to serve as a general purpose file system for > > Linux going forward, which file system(s) would you suggest can fill > > that role? (I can't think of any that are clearly all-around better > > than btrfs now, or that will be in the next few years.) > > ext4 and XFS are clearly the file systems to beat. They almost always > recover from crashes with just a normal journal replay at mount time, > file system repair is not often needed. When it is needed, it usually > works, and there is just the one option to repair and go with it. > Btrfs has piles of repair options, mount time options, btrfs check has > options, btrfs rescue has options, it's a bit nutty honestly. And > there's zero guidance in the available docs what order to try things > in, not least of which some of these repair tools are still considered > dangerous at least in the man page text, and the order depends on the > failure. The user is burdened with way too much. Neither one of those file systems offers snapshots. (And when I compared LVM snapshots vs BTRFS snapshots, I got the impression BTRFS is the clear winner.) Snapshots and volumes have a lot of value to me and I would not enjoy going back to a file system without those features. > Even as much as I know about Btrfs having used it since 2008 and my > list activity, I routinely have WTF moments when people post problems, > what order to try to get things going again. Easy to admin? Yeah for > the most part. But stability is still a problem, and it's coming up on > a 10 year anniversary soon. > > If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd > use ZoL hands down. Might it be the case that if you were equally familiar with ZFS, you would become aware of more of its pitfalls? And that greater knowledge could always lead to a different decision (such as favoring BTRFS).. In my experience the grass is always greener when I am less familiar with the field. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-07 7:01 ` Dave @ 2017-11-07 13:02 ` Austin S. Hemmelgarn 2017-11-08 4:50 ` Chris Murphy 0 siblings, 1 reply; 33+ messages in thread From: Austin S. Hemmelgarn @ 2017-11-07 13:02 UTC (permalink / raw) To: Dave, Linux fs Btrfs; +Cc: Chris Murphy On 2017-11-07 02:01, Dave wrote: > On Sat, Nov 4, 2017 at 1:25 PM, Chris Murphy <lists@colorremedies.com> wrote: >> >> On Sat, Nov 4, 2017 at 1:26 AM, Dave <davestechshop@gmail.com> wrote: >>> On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <lists@colorremedies.com> wrote: >>>> >>>> That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried. >>> >>> I'm not sure I understand your comment... >>> >>> Are you saying BTRFS is not a general purpose file system? >> >> I'm suggesting that any file system that burdens the user with more >> knowledge to stay out of trouble than the widely considered general >> purpose file systems of the day, is not a general purpose file system. >> >> And yes, I'm suggesting that Btrfs is at risk of being neither general >> purpose, and not meeting its design goals as stated in Btrfs >> documentation. It is not easy to admin *when things go wrong*. It's >> great before then. It's a butt ton easier to resize, replace devices, >> take snapshots, and so on. But when it comes to fixing it when it goes >> wrong? It is a goddamn Choose Your Own Adventure book. It's way, way >> more complicated than any other file system I'm aware of. > > It sounds like a large part of that could be addressed with better > documentation. I know that documentation such as what you are > suggesting would be really valuable to me! Documentation would help, but most of it is a lack of automation of things that could be automated (and are reasonably expected to be based on how LVM and ZFS work), including but not limited to: * Handling of device failures. In particular, BTRFS has absolutely zero hot-spare support currently (though there are patches to add this), which is considered a mandatory feature in almost all large scale data storage situations. * Handling of chunk-level allocation exhaustion. Ideally, when we can't allocate a chunk, we should try to free up space from the other chunk type through repacking of data. Handling this better would significantly improve things around one of the biggest pitfalls with BTRFS, namely filling up a filesystem completely (which many end users seem to think is perfectly fine, despite that not being the case for pretty much any filesystem). * Optional automatic correction of errors detected during normal usage. Right now, you have to run a scrub to correct errors. Such a design makes sense with MD and LVM, where you don't know which copy is correct, but BTRFS does know which copy is correct (or how to rebuild the correct data), and it therefore makes sense to have an option to automatically rebuild data that is detected to be incorrect. > >>> If btrfs isn't able to serve as a general purpose file system for >>> Linux going forward, which file system(s) would you suggest can fill >>> that role? (I can't think of any that are clearly all-around better >>> than btrfs now, or that will be in the next few years.) >> >> ext4 and XFS are clearly the file systems to beat. They almost always >> recover from crashes with just a normal journal replay at mount time, >> file system repair is not often needed. When it is needed, it usually >> works, and there is just the one option to repair and go with it. >> Btrfs has piles of repair options, mount time options, btrfs check has >> options, btrfs rescue has options, it's a bit nutty honestly. And >> there's zero guidance in the available docs what order to try things >> in, not least of which some of these repair tools are still considered >> dangerous at least in the man page text, and the order depends on the >> failure. The user is burdened with way too much. > > Neither one of those file systems offers snapshots. (And when I > compared LVM snapshots vs BTRFS snapshots, I got the impression BTRFS > is the clear winner.) > > Snapshots and volumes have a lot of value to me and I would not enjoy > going back to a file system without those features. While that is true, that's not exactly the point Chris was trying to make. The point is that if you install a system with XFS, you don't have to do pretty much anything to keep the filesystem running correctly, and ext4 is almost as good about not needing user intervention (repairs for ext4 are a bit more involved, and you have to watch inode usage because it uses static inode tables). In contrast, you have to essentially treat BTRFS like a small child and keep an eye on it almost constantly to make sure it works correctly. > >> Even as much as I know about Btrfs having used it since 2008 and my >> list activity, I routinely have WTF moments when people post problems, >> what order to try to get things going again. Easy to admin? Yeah for >> the most part. But stability is still a problem, and it's coming up on >> a 10 year anniversary soon. >> >> If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd >> use ZoL hands down. > > Might it be the case that if you were equally familiar with ZFS, you > would become aware of more of its pitfalls? And that greater knowledge > could always lead to a different decision (such as favoring BTRFS).. > In my experience the grass is always greener when I am less familiar > with the field. Quick summary of the big differences, with ZFS parts based on my experience using it with FreeNAS at work: BTRFS: * Natively supported by the mainline kernel, unlike ZFS which can't ever be included in the mainline kernel due to licensing issues. This is pretty much the only significant reason I stick with BTRFS over ZFS, as it greatly simplifies updates (and means I don't have to wait as long for kernel upgrades). * Subvolumes are implicitly rooted in the filesystem hierarchy, unlike ZFS datasets which always have to be explicitly mounted. This is largely cosmetic to be honest. * Able to group subvolumes for quotas without having to replicate the grouping with parent subvolumes, unlike ZFS which requires a common parent dataset if you want to group datasets for quotas. This is very useful as it reduces the complexity needed in the subvolume hierarchy. * Has native support for most forms of fallocate(), while ZFS doesn't. This isn't all that significant for most users, but it does provide some significant benefit if you use lots of large sparse files (you have to do batch deduplication on ZFS to make them 'sparse' again, whereas you just call fallocate to punch holes on BTRFS, which takes far less time). ZFS: * Provides native support for exposing virtual block devices (zvols), unlike BTRFS which just provides filesystem functionality. This is really big for NAS usage, as it's much more efficient to expose a zvol as an iSCSI, ATAoE, or NBD device than it is to expose a regular file as one. * Includes hot-spare and automatic rebuild support, unlike BTRFS which does not (but we are working on this). Really important for enterprise usage and high availability. * Provides the ability to control stripe width for parity RAID modes, unlike BTRFS. This is extremely important when dealing with large filesystems, by using reduced stripe width, you improve rebuild times for a given stripe, and in theory can sustain more lost disks before losing data. * Has a much friendlier scrub mechanism that doesn't have anywhere near as much impact on other things accessing the device as BTRFS does. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-07 13:02 ` Austin S. Hemmelgarn @ 2017-11-08 4:50 ` Chris Murphy 2017-11-08 12:13 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-11-08 4:50 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Dave, Linux fs Btrfs, Chris Murphy On Tue, Nov 7, 2017 at 6:02 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > * Optional automatic correction of errors detected during normal usage. > Right now, you have to run a scrub to correct errors. Such a design makes > sense with MD and LVM, where you don't know which copy is correct, but BTRFS > does know which copy is correct (or how to rebuild the correct data), and it > therefore makes sense to have an option to automatically rebuild data that > is detected to be incorrect. ? It definitely does fix ups during normal operations. During reads, if there's a UNC or there's corruption detected, Btrfs gets the good copy, and does a (I think it's an overwrite, not COW) fixup. Fixups don't just happen with scrubbing. Even raid56 supports these kinds of passive fixups back to disk. -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-08 4:50 ` Chris Murphy @ 2017-11-08 12:13 ` Austin S. Hemmelgarn 2017-11-08 17:17 ` Chris Murphy 0 siblings, 1 reply; 33+ messages in thread From: Austin S. Hemmelgarn @ 2017-11-08 12:13 UTC (permalink / raw) To: Chris Murphy; +Cc: Dave, Linux fs Btrfs On 2017-11-07 23:50, Chris Murphy wrote: > On Tue, Nov 7, 2017 at 6:02 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > >> * Optional automatic correction of errors detected during normal usage. >> Right now, you have to run a scrub to correct errors. Such a design makes >> sense with MD and LVM, where you don't know which copy is correct, but BTRFS >> does know which copy is correct (or how to rebuild the correct data), and it >> therefore makes sense to have an option to automatically rebuild data that >> is detected to be incorrect. > > ? > > It definitely does fix ups during normal operations. During reads, if > there's a UNC or there's corruption detected, Btrfs gets the good > copy, and does a (I think it's an overwrite, not COW) fixup. Fixups > don't just happen with scrubbing. Even raid56 supports these kinds of > passive fixups back to disk. I could have sworn it didn't rewrite the data on-disk during normal usage. I mean, I know for certain that it will return the correct data to userspace if at all possible, but I was under the impression it will just log the error during normal operation. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-08 12:13 ` Austin S. Hemmelgarn @ 2017-11-08 17:17 ` Chris Murphy 2017-11-08 17:22 ` Hugo Mills 0 siblings, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-11-08 17:17 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Dave, Linux fs Btrfs On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: >> It definitely does fix ups during normal operations. During reads, if >> there's a UNC or there's corruption detected, Btrfs gets the good >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups >> don't just happen with scrubbing. Even raid56 supports these kinds of >> passive fixups back to disk. > > I could have sworn it didn't rewrite the data on-disk during normal usage. > I mean, I know for certain that it will return the correct data to userspace > if at all possible, but I was under the impression it will just log the > error during normal operation. No, everything except raid56 has had it since a long time, I can't even think how far back, maybe even before 3.0. Whereas raid56 got it in 4.12. -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-08 17:17 ` Chris Murphy @ 2017-11-08 17:22 ` Hugo Mills 2017-11-08 17:54 ` Chris Murphy 0 siblings, 1 reply; 33+ messages in thread From: Hugo Mills @ 2017-11-08 17:22 UTC (permalink / raw) To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Dave, Linux fs Btrfs [-- Attachment #1: Type: text/plain, Size: 1321 bytes --] On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote: > On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > > >> It definitely does fix ups during normal operations. During reads, if > >> there's a UNC or there's corruption detected, Btrfs gets the good > >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups > >> don't just happen with scrubbing. Even raid56 supports these kinds of > >> passive fixups back to disk. > > > > I could have sworn it didn't rewrite the data on-disk during normal usage. > > I mean, I know for certain that it will return the correct data to userspace > > if at all possible, but I was under the impression it will just log the > > error during normal operation. > > No, everything except raid56 has had it since a long time, I can't > even think how far back, maybe even before 3.0. Whereas raid56 got it > in 4.12. Yes, I'm pretty sure it's been like that ever since I've been using btrfs (somewhere around the early neolithic). Hugo. -- Hugo Mills | Turning, pages turning in the widening bath, hugo@... carfax.org.uk | The spine cannot bear the humidity. http://carfax.org.uk/ | Books fall apart; the binding cannot hold. PGP: E2AB1DE4 | Page 129 is loosed upon the world. Zarf [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-08 17:22 ` Hugo Mills @ 2017-11-08 17:54 ` Chris Murphy 2017-11-08 18:10 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-11-08 17:54 UTC (permalink / raw) To: Hugo Mills, Chris Murphy, Austin S. Hemmelgarn, Dave, Linux fs Btrfs On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote: >> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn >> <ahferroin7@gmail.com> wrote: >> >> >> It definitely does fix ups during normal operations. During reads, if >> >> there's a UNC or there's corruption detected, Btrfs gets the good >> >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups >> >> don't just happen with scrubbing. Even raid56 supports these kinds of >> >> passive fixups back to disk. >> > >> > I could have sworn it didn't rewrite the data on-disk during normal usage. >> > I mean, I know for certain that it will return the correct data to userspace >> > if at all possible, but I was under the impression it will just log the >> > error during normal operation. >> >> No, everything except raid56 has had it since a long time, I can't >> even think how far back, maybe even before 3.0. Whereas raid56 got it >> in 4.12. > > Yes, I'm pretty sure it's been like that ever since I've been using > btrfs (somewhere around the early neolithic). > Yeah, around the original code for multiple devices I think. Anyway, this is what the fixups look like between scrub and normal read on raid1. Hilariously the error reporting is radically different. This is kernel messages of what a scrub finding data file corruption detection and repair looks like. This was 5120 bytes corrupted so all of one block and partial of anther. [244964.589522] BTRFS warning (device dm-6): checksum error at logical 1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257, offset 0, length 4096, links 1 (path: test.bin) [244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [244964.650239] BTRFS error (device dm-6): fixed up error at logical 1103626240 on dev /dev/mapper/vg-2 [244964.650612] BTRFS warning (device dm-6): checksum error at logical 1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257, offset 4096, length 4096, links 1 (path: test.bin) [244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [244964.683586] BTRFS error (device dm-6): fixed up error at logical 1103630336 on dev /dev/mapper/vg-2 [root@f26s test]# Exact same corruption (same device and offset), but normal read of the file. [245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1 [245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1 [245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1 [245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1 [245721.638901] BTRFS info (device dm-6): read error corrected: ino 257 off 0 (dev /dev/mapper/vg-2 sector 2116608) [245721.639608] BTRFS info (device dm-6): read error corrected: ino 257 off 4096 (dev /dev/mapper/vg-2 sector 2116616) [245747.280718] scrub considers the fixup an error, normal read considers it info; but there's more useful information in the scrub output I think. I'd really like to see the warning make it clear whether this is metadata or data corruption though. From the above you have to infer it, because of the inode reference. -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-08 17:54 ` Chris Murphy @ 2017-11-08 18:10 ` Austin S. Hemmelgarn 2017-11-08 18:31 ` Chris Murphy 0 siblings, 1 reply; 33+ messages in thread From: Austin S. Hemmelgarn @ 2017-11-08 18:10 UTC (permalink / raw) To: Chris Murphy, Hugo Mills, Dave, Linux fs Btrfs On 2017-11-08 12:54, Chris Murphy wrote: > On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote: >>> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn >>> <ahferroin7@gmail.com> wrote: >>> >>>>> It definitely does fix ups during normal operations. During reads, if >>>>> there's a UNC or there's corruption detected, Btrfs gets the good >>>>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups >>>>> don't just happen with scrubbing. Even raid56 supports these kinds of >>>>> passive fixups back to disk. >>>> >>>> I could have sworn it didn't rewrite the data on-disk during normal usage. >>>> I mean, I know for certain that it will return the correct data to userspace >>>> if at all possible, but I was under the impression it will just log the >>>> error during normal operation. >>> >>> No, everything except raid56 has had it since a long time, I can't >>> even think how far back, maybe even before 3.0. Whereas raid56 got it >>> in 4.12. >> >> Yes, I'm pretty sure it's been like that ever since I've been using >> btrfs (somewhere around the early neolithic). >> > > Yeah, around the original code for multiple devices I think. Anyway, > this is what the fixups look like between scrub and normal read on > raid1. Hilariously the error reporting is radically different. > > This is kernel messages of what a scrub finding data file corruption > detection and repair looks like. This was 5120 bytes corrupted so all > of one block and partial of anther. > > > [244964.589522] BTRFS warning (device dm-6): checksum error at logical > 1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257, > offset 0, length 4096, links 1 (path: test.bin) > [244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: > wr 0, rd 0, flush 0, corrupt 1, gen 0 > [244964.650239] BTRFS error (device dm-6): fixed up error at logical > 1103626240 on dev /dev/mapper/vg-2 > [244964.650612] BTRFS warning (device dm-6): checksum error at logical > 1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257, > offset 4096, length 4096, links 1 (path: test.bin) > [244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: > wr 0, rd 0, flush 0, corrupt 2, gen 0 > [244964.683586] BTRFS error (device dm-6): fixed up error at logical > 1103630336 on dev /dev/mapper/vg-2 > [root@f26s test]# > > > Exact same corruption (same device and offset), but normal read of the file. > > [245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino > 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1 > [245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino > 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1 > [245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino > 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1 > [245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino > 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1 > [245721.638901] BTRFS info (device dm-6): read error corrected: ino > 257 off 0 (dev /dev/mapper/vg-2 sector 2116608) > [245721.639608] BTRFS info (device dm-6): read error corrected: ino > 257 off 4096 (dev /dev/mapper/vg-2 sector 2116616) > [245747.280718] > > > scrub considers the fixup an error, normal read considers it info; but > there's more useful information in the scrub output I think. I'd > really like to see the warning make it clear whether this is metadata > or data corruption though. From the above you have to infer it, > because of the inode reference. OK, that actually explains why I had this incorrect assumption. I've not delved all that deep into that code, so I have no reference there, but looking at the two messages, the scrub message makes it very clear that the error was fixed, whereas the phrasing in the case of a normal read is kind of ambiguous (as I see it, 'read error corrected' could mean that it was actually repaired (fixed as scrub says), or that the error was corrected in BTRFS by falling back to the old copy, and I assumed the second case given the context). As far as the whole warning versus info versus error thing, I actually think _that_ makes some sense. If things got fixed, it's not exactly an error, even though it would be nice to have some consistency there. For scrub however, it makes sense to have it all be labeled as an 'error' because otherwise the log entries will be incomplete if dmesg is not set to report anything less than an error (and the three lines are functionally _one_ entry). I can also kind of understand scrub reporting error counts, but regular reads not doing so (scrub is a diagnostic and repair tool, regular reads aren't). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-08 18:10 ` Austin S. Hemmelgarn @ 2017-11-08 18:31 ` Chris Murphy 2017-11-08 19:29 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 33+ messages in thread From: Chris Murphy @ 2017-11-08 18:31 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Hugo Mills, Dave, Linux fs Btrfs On Wed, Nov 8, 2017 at 11:10 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2017-11-08 12:54, Chris Murphy wrote: >> >> On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >>> >>> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote: >>>> >>>> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn >>>> <ahferroin7@gmail.com> wrote: >>>> >>>>>> It definitely does fix ups during normal operations. During reads, if >>>>>> there's a UNC or there's corruption detected, Btrfs gets the good >>>>>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups >>>>>> don't just happen with scrubbing. Even raid56 supports these kinds of >>>>>> passive fixups back to disk. >>>>> >>>>> >>>>> I could have sworn it didn't rewrite the data on-disk during normal >>>>> usage. >>>>> I mean, I know for certain that it will return the correct data to >>>>> userspace >>>>> if at all possible, but I was under the impression it will just log the >>>>> error during normal operation. >>>> >>>> >>>> No, everything except raid56 has had it since a long time, I can't >>>> even think how far back, maybe even before 3.0. Whereas raid56 got it >>>> in 4.12. >>> >>> >>> Yes, I'm pretty sure it's been like that ever since I've been using >>> btrfs (somewhere around the early neolithic). >>> >> >> Yeah, around the original code for multiple devices I think. Anyway, >> this is what the fixups look like between scrub and normal read on >> raid1. Hilariously the error reporting is radically different. >> >> This is kernel messages of what a scrub finding data file corruption >> detection and repair looks like. This was 5120 bytes corrupted so all >> of one block and partial of anther. >> >> >> [244964.589522] BTRFS warning (device dm-6): checksum error at logical >> 1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257, >> offset 0, length 4096, links 1 (path: test.bin) >> [244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: >> wr 0, rd 0, flush 0, corrupt 1, gen 0 >> [244964.650239] BTRFS error (device dm-6): fixed up error at logical >> 1103626240 on dev /dev/mapper/vg-2 >> [244964.650612] BTRFS warning (device dm-6): checksum error at logical >> 1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257, >> offset 4096, length 4096, links 1 (path: test.bin) >> [244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: >> wr 0, rd 0, flush 0, corrupt 2, gen 0 >> [244964.683586] BTRFS error (device dm-6): fixed up error at logical >> 1103630336 on dev /dev/mapper/vg-2 >> [root@f26s test]# >> >> >> Exact same corruption (same device and offset), but normal read of the >> file. >> >> [245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino >> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1 >> [245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino >> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1 >> [245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino >> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1 >> [245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino >> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1 >> [245721.638901] BTRFS info (device dm-6): read error corrected: ino >> 257 off 0 (dev /dev/mapper/vg-2 sector 2116608) >> [245721.639608] BTRFS info (device dm-6): read error corrected: ino >> 257 off 4096 (dev /dev/mapper/vg-2 sector 2116616) >> [245747.280718] >> >> >> scrub considers the fixup an error, normal read considers it info; but >> there's more useful information in the scrub output I think. I'd >> really like to see the warning make it clear whether this is metadata >> or data corruption though. From the above you have to infer it, >> because of the inode reference. > > OK, that actually explains why I had this incorrect assumption. I've not > delved all that deep into that code, so I have no reference there, but > looking at the two messages, the scrub message makes it very clear that the > error was fixed, whereas the phrasing in the case of a normal read is kind > of ambiguous (as I see it, 'read error corrected' could mean that it was > actually repaired (fixed as scrub says), or that the error was corrected in > BTRFS by falling back to the old copy, and I assumed the second case given > the context). > > As far as the whole warning versus info versus error thing, I actually think > _that_ makes some sense. If things got fixed, it's not exactly an error, > even though it would be nice to have some consistency there. For scrub > however, it makes sense to have it all be labeled as an 'error' because > otherwise the log entries will be incomplete if dmesg is not set to report > anything less than an error (and the three lines are functionally _one_ > entry). I can also kind of understand scrub reporting error counts, but > regular reads not doing so (scrub is a diagnostic and repair tool, regular > reads aren't). I just did those corruptions as a test, and following the normal read fixup, a subsequent scrub finds no problems. And in both cases debug-tree shows pretty much identical metadata, at least the same chunks are intact and the tree the file is located in has the same logical address for the file in question. So this is not a COW fix up, it's an overwrite. (Something tells me that raid56 fixes corruptions differently, they may be cow). -- Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-11-08 18:31 ` Chris Murphy @ 2017-11-08 19:29 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 33+ messages in thread From: Austin S. Hemmelgarn @ 2017-11-08 19:29 UTC (permalink / raw) To: Chris Murphy; +Cc: Hugo Mills, Dave, Linux fs Btrfs On 2017-11-08 13:31, Chris Murphy wrote: > On Wed, Nov 8, 2017 at 11:10 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2017-11-08 12:54, Chris Murphy wrote: >>> >>> On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >>>> >>>> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote: >>>>> >>>>> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn >>>>> <ahferroin7@gmail.com> wrote: >>>>> >>>>>>> It definitely does fix ups during normal operations. During reads, if >>>>>>> there's a UNC or there's corruption detected, Btrfs gets the good >>>>>>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups >>>>>>> don't just happen with scrubbing. Even raid56 supports these kinds of >>>>>>> passive fixups back to disk. >>>>>> >>>>>> >>>>>> I could have sworn it didn't rewrite the data on-disk during normal >>>>>> usage. >>>>>> I mean, I know for certain that it will return the correct data to >>>>>> userspace >>>>>> if at all possible, but I was under the impression it will just log the >>>>>> error during normal operation. >>>>> >>>>> >>>>> No, everything except raid56 has had it since a long time, I can't >>>>> even think how far back, maybe even before 3.0. Whereas raid56 got it >>>>> in 4.12. >>>> >>>> >>>> Yes, I'm pretty sure it's been like that ever since I've been using >>>> btrfs (somewhere around the early neolithic). >>>> >>> >>> Yeah, around the original code for multiple devices I think. Anyway, >>> this is what the fixups look like between scrub and normal read on >>> raid1. Hilariously the error reporting is radically different. >>> >>> This is kernel messages of what a scrub finding data file corruption >>> detection and repair looks like. This was 5120 bytes corrupted so all >>> of one block and partial of anther. >>> >>> >>> [244964.589522] BTRFS warning (device dm-6): checksum error at logical >>> 1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257, >>> offset 0, length 4096, links 1 (path: test.bin) >>> [244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: >>> wr 0, rd 0, flush 0, corrupt 1, gen 0 >>> [244964.650239] BTRFS error (device dm-6): fixed up error at logical >>> 1103626240 on dev /dev/mapper/vg-2 >>> [244964.650612] BTRFS warning (device dm-6): checksum error at logical >>> 1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257, >>> offset 4096, length 4096, links 1 (path: test.bin) >>> [244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: >>> wr 0, rd 0, flush 0, corrupt 2, gen 0 >>> [244964.683586] BTRFS error (device dm-6): fixed up error at logical >>> 1103630336 on dev /dev/mapper/vg-2 >>> [root@f26s test]# >>> >>> >>> Exact same corruption (same device and offset), but normal read of the >>> file. >>> >>> [245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino >>> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1 >>> [245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino >>> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1 >>> [245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino >>> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1 >>> [245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino >>> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1 >>> [245721.638901] BTRFS info (device dm-6): read error corrected: ino >>> 257 off 0 (dev /dev/mapper/vg-2 sector 2116608) >>> [245721.639608] BTRFS info (device dm-6): read error corrected: ino >>> 257 off 4096 (dev /dev/mapper/vg-2 sector 2116616) >>> [245747.280718] >>> >>> >>> scrub considers the fixup an error, normal read considers it info; but >>> there's more useful information in the scrub output I think. I'd >>> really like to see the warning make it clear whether this is metadata >>> or data corruption though. From the above you have to infer it, >>> because of the inode reference. >> >> OK, that actually explains why I had this incorrect assumption. I've not >> delved all that deep into that code, so I have no reference there, but >> looking at the two messages, the scrub message makes it very clear that the >> error was fixed, whereas the phrasing in the case of a normal read is kind >> of ambiguous (as I see it, 'read error corrected' could mean that it was >> actually repaired (fixed as scrub says), or that the error was corrected in >> BTRFS by falling back to the old copy, and I assumed the second case given >> the context). >> >> As far as the whole warning versus info versus error thing, I actually think >> _that_ makes some sense. If things got fixed, it's not exactly an error, >> even though it would be nice to have some consistency there. For scrub >> however, it makes sense to have it all be labeled as an 'error' because >> otherwise the log entries will be incomplete if dmesg is not set to report >> anything less than an error (and the three lines are functionally _one_ >> entry). I can also kind of understand scrub reporting error counts, but >> regular reads not doing so (scrub is a diagnostic and repair tool, regular >> reads aren't). > > > I just did those corruptions as a test, and following the normal read > fixup, a subsequent scrub finds no problems. And in both cases > debug-tree shows pretty much identical metadata, at least the same > chunks are intact and the tree the file is located in has the same > logical address for the file in question. So this is not a COW fix up, > it's an overwrite. (Something tells me that raid56 fixes corruptions > differently, they may be cow). > I would think that this is the only case it makes sense to unconditionally _not_ do a COW update. In the event that the write gets interrupted, we're no worse off than we already were (the checksum will still fail), so there's not much point in incurring the overhead of a COW operation, except possibly with parity involved (because you might run the risk of both bogus parity _and_ a bogus checksum). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Problem with file system 2017-10-30 3:31 ` Dave 2017-10-30 21:37 ` Chris Murphy @ 2017-10-31 1:58 ` Duncan 1 sibling, 0 replies; 33+ messages in thread From: Duncan @ 2017-10-31 1:58 UTC (permalink / raw) To: linux-btrfs Dave posted on Sun, 29 Oct 2017 23:31:57 -0400 as excerpted: > It's all part of the process of gaining critical experience with BTRFS. > Whether or not BTRFS is ready for production use is (it seems to me) > mostly a question of how knowledgeable and experienced are the people > administering it. > > In the various online discussions on this topic, all the focus is on > whether or not BTRFS itself is production-ready. At the current maturity > level of BTRFS, I think that's the wrong focus. The right focus is on > how production-ready is the admin person or team (with respect to their > BTRFS knowledge and experience). When a filesystem has been around for > decades, most of the critical admin issues become fairly common > knowledge, fairly widely known and easy to find. When a filesystem is > newer, far fewer people understand the gotchas. Also, in older or widely > used filesystems, when someone hits a gotcha, the response isn't "that > filesystem is not ready for production". Instead the response is, "you > should have known not to do that." That's a view I hadn't seen before, but it seems reasonable and I like it. Indeed, there were/are a few reasonably widely known caveats with both ext3 and reiserfs, for instance, and certainly some that apply to fat/ vfat/fat32, the three filesystems other than btrfs I know most about, and if anything they're past their prime, /not/ "still maturing", as btrfs is typically described. For example, setting either of the two to writeback journaling and then losing data results in something akin to "you should have known not to do that unless you were prepared for the risk, as it's definitely a well known one." Which of course was about my own reaction when Linus and the other powers that be decided to set ext3 to writeback journaling by default for a few kernel cycles. Having lived thru that on reiserfs, I /knew/ where /that/ was headed, and sure enough... Similarly, ext3's performance problems with fsync, because it effectively forces a full filesystem sync not just a file sync, are well known, as are the risks of storing a reiserfs in a loopback file on reiserfs and then trying to run a tree restore on the host, since it's known to mix up the two filesystems in that case. It's thus a reasonable viewpoint to consider some of the btrfs quirks to be in the same category. Of course btrfs being the first COW-based-fs most will have had experience with, and the first filesystem most will have experienced that handles raid, snapshotting, etc, it's definitely rather different and more complex than the filesystems most people are familiar with, and thus can only be expected to have rather different and more complex caveats than the filesystems most are familiar with, as well. OTOH, there's definitely some known low-hanging-fruit in terms of ease of use, remaining to be implemented, tho I'd argue that we've reached the point where general stability is such that it has allowed the focus to gradually tilt toward implementing some of this, over the last year or so, and we're beginning to see the loose ends tied up in the documentation, for instance. I'd say we are getting close, and your viewpoint is a definite argument in support of that. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2017-11-08 19:29 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-04-24 15:27 Problem with file system Fred Van Andel 2017-04-24 17:02 ` Chris Murphy 2017-04-25 4:05 ` Duncan 2017-04-25 0:26 ` Qu Wenruo 2017-04-25 5:33 ` Marat Khalili 2017-04-25 6:13 ` Qu Wenruo 2017-04-26 16:43 ` Fred Van Andel 2017-10-30 3:31 ` Dave 2017-10-30 21:37 ` Chris Murphy 2017-10-31 5:57 ` Marat Khalili 2017-10-31 11:28 ` Austin S. Hemmelgarn 2017-11-03 7:42 ` Kai Krakow 2017-11-03 11:33 ` Austin S. Hemmelgarn 2017-11-03 22:03 ` Chris Murphy 2017-11-04 4:46 ` Adam Borowski 2017-11-04 12:00 ` Marat Khalili 2017-11-04 17:14 ` Chris Murphy 2017-11-06 13:29 ` Austin S. Hemmelgarn 2017-11-06 18:45 ` Chris Murphy 2017-11-06 19:12 ` Austin S. Hemmelgarn 2017-11-04 7:26 ` Dave 2017-11-04 17:25 ` Chris Murphy 2017-11-07 7:01 ` Dave 2017-11-07 13:02 ` Austin S. Hemmelgarn 2017-11-08 4:50 ` Chris Murphy 2017-11-08 12:13 ` Austin S. Hemmelgarn 2017-11-08 17:17 ` Chris Murphy 2017-11-08 17:22 ` Hugo Mills 2017-11-08 17:54 ` Chris Murphy 2017-11-08 18:10 ` Austin S. Hemmelgarn 2017-11-08 18:31 ` Chris Murphy 2017-11-08 19:29 ` Austin S. Hemmelgarn 2017-10-31 1:58 ` Duncan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).