* Unexpected raid1 behaviour @ 2017-12-16 19:50 Dark Penguin 2017-12-17 11:58 ` Duncan ` (2 more replies) 0 siblings, 3 replies; 61+ messages in thread From: Dark Penguin @ 2017-12-16 19:50 UTC (permalink / raw) To: linux-btrfs Could someone please point me towards some read about how btrfs handles multiple devices? Namely, kicking faulty devices and re-adding them. I've been using btrfs on single devices for a while, but now I want to start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and tried to see how does it handle various situations. The experience left me very surprised; I've tried a number of things, all of which produced unexpected results. I create a btrfs raid1 filesystem on two hard drives and mount it. - When I pull one of the drives out (simulating a simple cable failure, which happens pretty often to me), the filesystem sometimes goes read-only. ??? - But only after a while, and not always. ??? - When I fix the cable problem (plug the device back), it's immediately "re-added" back. But I see no replication of the data I've written onto a degraded filesystem... Nothing shows any problems, so "my filesystem must be ok". ??? - If I unmount the filesystem and then mount it back, I see all my recent changes lost (everything I wrote during the "degraded" period). - If I continue working with a degraded raid1 filesystem (even without damaging it further by re-adding the faulty device), after a while it won't mount at all, even with "-o degraded". I can't wrap my head about all this. Either the kicked device should not be re-added, or it should be re-added "properly", or it should at least show some errors and not pretend nothing happened, right?.. I must be missing something. Is there an explanation somewhere about what's really going on during those situations? Also, do I understand correctly that upon detecting a faulty device (a write error), nothing is done about it except logging an error into the 'btrfs device stats' report? No device kicking, no notification?.. And what about degraded filesystems - is it absolutely forbidden to work with them without converting them to a "single" filesystem first?.. On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 . -- darkpenguin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin @ 2017-12-17 11:58 ` Duncan 2017-12-17 15:48 ` Peter Grandi 2017-12-18 5:11 ` Anand Jain 2017-12-18 1:20 ` Qu Wenruo 2017-12-18 13:31 ` Austin S. Hemmelgarn 2 siblings, 2 replies; 61+ messages in thread From: Duncan @ 2017-12-17 11:58 UTC (permalink / raw) To: linux-btrfs Dark Penguin posted on Sat, 16 Dec 2017 22:50:33 +0300 as excerpted: > Could someone please point me towards some read about how btrfs handles > multiple devices? Namely, kicking faulty devices and re-adding them. > > I've been using btrfs on single devices for a while, but now I want to > start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and > tried to see how does it handle various situations. The experience left > me very surprised; I've tried a number of things, all of which produced > unexpected results. > > I create a btrfs raid1 filesystem on two hard drives and mount it. > > - When I pull one of the drives out (simulating a simple cable failure, > which happens pretty often to me), the filesystem sometimes goes > read-only. ??? > - But only after a while, and not always. ??? > - When I fix the cable problem (plug the device back), it's immediately > "re-added" back. But I see no replication of the data I've written onto > a degraded filesystem... Nothing shows any problems, so "my filesystem > must be ok". ??? > - If I unmount the filesystem and then mount it back, I see all my > recent changes lost (everything I wrote during the "degraded" period). - > If I continue working with a degraded raid1 filesystem (even without > damaging it further by re-adding the faulty device), after a while it > won't mount at all, even with "-o degraded". > > I can't wrap my head about all this. Either the kicked device should not > be re-added, or it should be re-added "properly", or it should at least > show some errors and not pretend nothing happened, right?.. > > I must be missing something. Is there an explanation somewhere about > what's really going on during those situations? Also, do I understand > correctly that upon detecting a faulty device (a write error), nothing > is done about it except logging an error into the 'btrfs device stats' > report? No device kicking, no notification?.. And what about degraded > filesystems - is it absolutely forbidden to work with them without > converting them to a "single" filesystem first?.. > > On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 . Btrfs device handling at this point is still "development level" and very rough, but there's a patch set in active review ATM that should improve things dramatically, perhaps as soon as 4.16 (4.15 is already well on the way). Basically, at this point btrfs doesn't have "dynamic" device handling. That is, if a device disappears, it doesn't know it. So it continues attempting to write to (and read from, but the reads are redirected) the missing device until things go bad enough it kicks to read-only for safety. If a device is added back, the kernel normally shuffles device names and assigns a new one. Btrfs will see it and list the new device, but it's still trying to use the old one internally. =:^( Thus, if a device disappears, to get it back you really have to reboot, or at least unload/reload the btrfs kernel module, in ordered to clear the stale device state and have btrfs rescan and reassociate devices with the matching filesystems. Meanwhile, once a device goes stale -- other devices in the filesystem have data that should have been written to the stale one but it was gone so the data couldn't get to it -- once you do the module unload/reload or reboot cycle and btrfs picks up the device again, you should immediately do a btrfs scrub, which will detect and "catch up" the differences. Btrfs tracks atomic filesystem updates via a monotonically increasing generation number, aka transaction-id (transid). When a device goes offline, its generation number of course gets stuck at the point it went offline, while the other devices continue to update their generation numbers. When a stale device is readded, btrfs should automatically find and use the device with the latest generation, but the old one isn't automatically caught up -- a scrub is the mechanism by which you do this. One thing you do **NOT** want to do is degraded-writable mount one device, then the other device, of a raid1 pair, because that'll diverge the two with new data on each, and that's no longer simple to correct. If you /have/ to degraded-writable mount a raid1, always make sure it's the same one mounted writable if you want to combine them again. If you /do/ need to recombine two diverged raid1 devices, the only safe way to do so is to wipe the one so btrfs has only the one copy of the data to go on, and add the wiped device back as a new device. Meanwhile, until /very/ recently... 4.13 may not be current enough... if you mounted a two-device raid1 degraded-writable, btrfs would try to write and note that it couldn't do raid1 because there wasn't a second device, so it would create single chunks to write into. And the older filesystem safe-mount mechanism would see those single chunks on a raid1 and decide it wasn't safe to mount the filesystem writable at all after that, even if all the single chunks were actually present on the remaining device. The effect was that if a device died, you had exactly one degraded- writable mount to replace it successfully. If you didn't complete the replace in that single chance writable mount, the filesystem would refuse to mount writable again, and thus it was impossible to repair the filesystem since that required a writable mount and that was no longer possible! Fortunately the filesystem could still be mounted degraded- readonly (unless there was some other problem), allowing people to at least get at the read-only data to copy it elsewhere. With a new enough btrfs, while btrfs will still create those single chunks on a degraded-writable mount of a raid1, it's at least smart enough to do per-chunk checks to see if they're all available on existing devices (none only on the missing device), and will continue to allow degraded-writable mounting if so. But once the filesystem is back to multi-device (with writable space on at least two devices), a balance-convert of those single chunks to raid1 should be done, otherwise if the device with them on it goes... And there's work on allowing it to do only single-copy, thus incomplete- raid1, chunk writes as well. This should prevent the single mode chunks entirely, thus eliminating the need for the balance-convert, tho a scrub would still be needed to fully sync back up. But I'm not sure what the status is on that. Meanwhile, as mentioned above, there's active work on proper dynamic btrfs device tracking and management. It may or may not be ready for 4.16, but once it goes in, btrfs should properly detect a device going away and react accordingly, and it should detect a device coming back as a different device too. As I write this it occurs to me that I've not read close enough to know if it actually initiates scrub/resync on its own in the current patch set, but that's obviously an eventual goal if not. Longer term, there's further patches that will provide a hot-spare functionality, automatically bringing in a device pre-configured as a hot- spare if a device disappears, but that of course requires that btrfs properly recognize devices disappearing and coming back first, so one thing at a time. Tho as originally presented, that hot-spare functionality was a bit limited -- it was a global hot-spare list, and with multiple btrfs of different sizes and multiple hot-spare devices also of different sizes, it would always just pick the first spare on the list for the first btrfs needing one, regardless of whether the size was appropriate for that filesystem or not. By the time the feature actually gets merged it may have changed some, and regardless, it should eventually get less limited, but that's _eventually_, with a target time likely still in years, so don't hold your breath. I think that answers most of your questions. Basically, you have to be quite careful with btrfs raid1 today, as btrfs simply doesn't have the automated functionality to handle it yet. It's still possible to do two- device-only raid1 and replace a failed device when you're down to one, but it's not as easy or automated as more mature raid options such as mdraid, and you do have to keep on top of it as a result. But it can and does work reasonably well for those (like me) who use btrfs raid1 as their "daily driver", as long as you /do/ keep on top of it... and don't try to use raid1 as a replacement for real backups, because it's *not* a backup! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-17 11:58 ` Duncan @ 2017-12-17 15:48 ` Peter Grandi 2017-12-17 20:42 ` Chris Murphy ` (2 more replies) 2017-12-18 5:11 ` Anand Jain 1 sibling, 3 replies; 61+ messages in thread From: Peter Grandi @ 2017-12-17 15:48 UTC (permalink / raw) To: Linux fs Btrfs "Duncan"'s reply is slightly optimistic in parts, so some further information... [ ... ] > Basically, at this point btrfs doesn't have "dynamic" device > handling. That is, if a device disappears, it doesn't know > it. That's just the consequence of what is a completely broken conceptual model: the current way most multi-device profiles are designed is that block-devices and only be "added" or "removed", and cannot be "broken"/"missing". Therefore if IO fails, that is just one IO failing, not the entire block-device going away. The time when a block-device is noticed as sort-of missing is when it is not available for "add"-ing at start. Put another way, the multi-device design is/was based on the demented idea that block-devices that are missing are/should be "remove"d, so that a 2-device volume with a 'raid1' profile becomes a 1-device volume with a 'single'/'dup' profile, and not a 2-device volume with a missing block-device and an incomplete 'raid1' profile, even if things have been awkwardly moving in that direction in recent years. Note the above is not totally accurate today because various hacks have been introduced to work around the various issues. > Thus, if a device disappears, to get it back you really have > to reboot, or at least unload/reload the btrfs kernel module, > in ordered to clear the stale device state and have btrfs > rescan and reassociate devices with the matching filesystems. IIRC that is not quite accurate: a "missing" device can be nowadays "replace"d (by "devid") or "remove"d, the latter possibly implying profile changes: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete Terrible tricks like this also work: https://www.spinics.net/lists/linux-btrfs/msg48394.html > Meanwhile, as mentioned above, there's active work on proper > dynamic btrfs device tracking and management. It may or may > not be ready for 4.16, but once it goes in, btrfs should > properly detect a device going away and react accordingly, I haven't seen that, but I doubt that it is the radical redesign of the multi-device layer of Btrfs that is needed to give it operational semantics similar to those of MD RAID, and that I have vaguely described previously. > and it should detect a device coming back as a different > device too. That is disagreeable because of poor terminology: I guess that what was intended that it should be able to detect a previous member block-device becoming available again as a different device inode, which currently is very dangerous in some vital situations. > Longer term, there's further patches that will provide a > hot-spare functionality, automatically bringing in a device > pre-configured as a hot- spare if a device disappears, but > that of course requires that btrfs properly recognize devices > disappearing and coming back first, so one thing at a time. That would be trivial if the complete redesign of block-device states of the Btrfs multi-device layer happened, adding an "active" flag to an "accessible" flag to describe new member states, for example. My guess that while logically consistent, the current multi-device logic is fundamentally broken from an operational point of view, and needs a complete replacement instead of fixes. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-17 15:48 ` Peter Grandi @ 2017-12-17 20:42 ` Chris Murphy 2017-12-18 8:49 ` Anand Jain 2017-12-18 8:49 ` Anand Jain 2017-12-18 13:06 ` Austin S. Hemmelgarn 2 siblings, 1 reply; 61+ messages in thread From: Chris Murphy @ 2017-12-17 20:42 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs Btrfs On Sun, Dec 17, 2017 at 8:48 AM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote: > "Duncan"'s reply is slightly optimistic in parts, so some > further information... >> and it should detect a device coming back as a different >> device too. > > That is disagreeable because of poor terminology: I guess that > what was intended that it should be able to detect a previous > member block-device becoming available again as a different > device inode, which currently is very dangerous in some vital > situations. Duncan probably means if the device reappears with different enumeration (was /dev/sdb1 but comes back as /dev/sde1), that Btrfs can recover from this by using the Btrfs specific dev.uuid to recognize the device. And also by knowing generation it in effect has a virtual write intent bitmap to use to catch up that device for missing commits, which is something that doesn't currently happen automatically; it requires either a scrub or balance to catch up a formerly missing device - a very big penalty because the whole array has to be done to catch it up for what might be only a few minutes of missing time. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-17 20:42 ` Chris Murphy @ 2017-12-18 8:49 ` Anand Jain 0 siblings, 0 replies; 61+ messages in thread From: Anand Jain @ 2017-12-18 8:49 UTC (permalink / raw) To: Chris Murphy, Peter Grandi; +Cc: Linux fs Btrfs > formerly missing device - a very big penalty because the whole array > has to be done to catch it up for what might be only a few minutes of > missing time. For raid1 [1] cli will pick only new chunks. [1] btrfs bal start -dprofiles=single -mprofiles=single <mnt> Thanks, Anand ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-17 15:48 ` Peter Grandi 2017-12-17 20:42 ` Chris Murphy @ 2017-12-18 8:49 ` Anand Jain 2017-12-18 10:36 ` Peter Grandi ` (2 more replies) 2017-12-18 13:06 ` Austin S. Hemmelgarn 2 siblings, 3 replies; 61+ messages in thread From: Anand Jain @ 2017-12-18 8:49 UTC (permalink / raw) To: Peter Grandi, Linux fs Btrfs > Put another way, the multi-device design is/was based on the > demented idea that block-devices that are missing are/should be > "remove"d, so that a 2-device volume with a 'raid1' profile > becomes a 1-device volume with a 'single'/'dup' profile, and not > a 2-device volume with a missing block-device and an incomplete > 'raid1' profile, Agreed. IMO degraded-raid1-single-chunk is an accidental feature caused by [1], which we should revert back, since.. - balance (to raid1 chunk) may fail if FS is near full - recovery (to raid1 chunk) will take more writes as compared to recovery under degraded raid1 chunks [1] commit 95669976bd7d30ae265db938ecb46a6b7f8cb893 Btrfs: don't consider the missing device when allocating new chunks There is an attempt to fix it [2], but will certainly takes time as there are many things to fix around this. [2] [PATCH RFC] btrfs: create degraded-RAID1 chunks > even if things have been awkwardly moving in > that direction in recent years. > Note the above is not totally accurate today because various > hacks have been introduced to work around the various issues. May be you are talking about [3]. Pls note its a workaround patch (which I mentioned in its original patch). Its nice that we fixed the availability issue through this patch and the helper function it added also helps the other developments. But for long term we need to work on [2]. [3] btrfs: Introduce a function to check if all chunks a OK for degraded rw mount >> Thus, if a device disappears, to get it back you really have >> to reboot, or at least unload/reload the btrfs kernel module, >> in ordered to clear the stale device state and have btrfs >> rescan and reassociate devices with the matching filesystems. > > IIRC that is not quite accurate: a "missing" device can be > nowadays "replace"d (by "devid") or "remove"d, the latter > possibly implying profile changes: > > https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete > > Terrible tricks like this also work: > > https://www.spinics.net/lists/linux-btrfs/msg48394.html Its replace, which isn't about bringing back a missing disk. >> Meanwhile, as mentioned above, there's active work on proper >> dynamic btrfs device tracking and management. It may or may >> not be ready for 4.16, but once it goes in, btrfs should >> properly detect a device going away and react accordingly, > > I haven't seen that, but I doubt that it is the radical redesign > of the multi-device layer of Btrfs that is needed to give it > operational semantics similar to those of MD RAID, and that I > have vaguely described previously. I agree that btrfs volume manager is incomplete in view of data center RAS requisites, there are couple of critical bugs and inconsistent design between raid profiles, but I doubt if it needs a radical redesign. Pls take a look at [4], comments are appreciated as usual. I have experimented with two approaches and both are reasonable. - There isn't any harm to leave failed disk opened (but stop any new IO to it). And there will be udev 'btrfs dev forget --mounted <dev>' call when device disappears so that we can close the device. In the 2nd approach, close the failed device right away when disk write fails, so that we continue to have only two device states. I like the latter. >> and it should detect a device coming back as a different >> device too. > > That is disagreeable because of poor terminology: I guess that > what was intended that it should be able to detect a previous > member block-device becoming available again as a different > device inode, which currently is very dangerous in some vital > situations. If device disappears, the patch [4] will completely take out the device from btrfs, and continues to RW in degraded mode. When it reappears then [5] will bring it back to the RW list. [4] btrfs: introduce device dynamic state transition to failed [5] btrfs: handle dynamically reappearing missing device From the btrfs original design, it always depends on device SB fsid:uuid:devid so it does not matter about the device path or device inode or device transport layer. For eg. Dynamically you can bring a device under different transport and it will work without any down time. > That would be trivial if the complete redesign of block-device > states of the Btrfs multi-device layer happened, adding an > "active" flag to an "accessible" flag to describe new member > states, for example. I think you are talking about BTRFS_DEV_STATE.. But I think Duncan is talking about the patches which I included in my reply. Thanks, Anand ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 8:49 ` Anand Jain @ 2017-12-18 10:36 ` Peter Grandi 2017-12-18 12:10 ` Nikolay Borisov 2017-12-18 22:28 ` Chris Murphy 2 siblings, 0 replies; 61+ messages in thread From: Peter Grandi @ 2017-12-18 10:36 UTC (permalink / raw) To: Linux fs Btrfs >> I haven't seen that, but I doubt that it is the radical >> redesign of the multi-device layer of Btrfs that is needed to >> give it operational semantics similar to those of MD RAID, >> and that I have vaguely described previously. > I agree that btrfs volume manager is incomplete in view of > data center RAS requisites, there are couple of critical > bugs and inconsistent design between raid profiles, but I > doubt if it needs a radical redesign. Well it needs a radical redesign because the original design was based on an entirely consistent and logical concept that was quite different from that required for sensible operations, and then special-case case was added (and keeps being added) to fix the consequences. But I suspect that it does not need a radical *recoding*, because most if not all of the needed code is already there. All tha needs changing most likely is the member state-machine, that's the bit that need a radical redesign, and it is a relatively small part of the whole. The closer the member state-machine design is to the MD RAID one the better as it is a very workable, proven model. Sometimes I suspect that the design needs to be changed to also add a formal notion of "stripe" to the Btrfs internals, where a "stripe" is a collection of chunks that are "related" (and something like that is already part of the 'raid10' profile), but I think that needs not be user-visible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 8:49 ` Anand Jain 2017-12-18 10:36 ` Peter Grandi @ 2017-12-18 12:10 ` Nikolay Borisov 2017-12-18 13:43 ` Anand Jain 2017-12-18 22:28 ` Chris Murphy 2 siblings, 1 reply; 61+ messages in thread From: Nikolay Borisov @ 2017-12-18 12:10 UTC (permalink / raw) To: Anand Jain, Peter Grandi, Linux fs Btrfs On 18.12.2017 10:49, Anand Jain wrote: > > >> Put another way, the multi-device design is/was based on the >> demented idea that block-devices that are missing are/should be >> "remove"d, so that a 2-device volume with a 'raid1' profile >> becomes a 1-device volume with a 'single'/'dup' profile, and not >> a 2-device volume with a missing block-device and an incomplete >> 'raid1' profile, > > Agreed. IMO degraded-raid1-single-chunk is an accidental feature > caused by [1], which we should revert back, since.. > - balance (to raid1 chunk) may fail if FS is near full > - recovery (to raid1 chunk) will take more writes as compared > to recovery under degraded raid1 chunks > > [1] > commit 95669976bd7d30ae265db938ecb46a6b7f8cb893 > Btrfs: don't consider the missing device when allocating new chunks > > There is an attempt to fix it [2], but will certainly takes time as > there are many things to fix around this. > > [2] > [PATCH RFC] btrfs: create degraded-RAID1 chunks > >> even if things have been awkwardly moving in >> that direction in recent years. >> Note the above is not totally accurate today because various >> hacks have been introduced to work around the various issues. > May be you are talking about [3]. Pls note its a workaround > patch (which I mentioned in its original patch). Its nice that > we fixed the availability issue through this patch and the > helper function it added also helps the other developments. > But for long term we need to work on [2]. > > [3] > btrfs: Introduce a function to check if all chunks a OK for degraded rw > mount > >>> Thus, if a device disappears, to get it back you really have >>> to reboot, or at least unload/reload the btrfs kernel module, >>> in ordered to clear the stale device state and have btrfs >>> rescan and reassociate devices with the matching filesystems. >> >> IIRC that is not quite accurate: a "missing" device can be >> nowadays "replace"d (by "devid") or "remove"d, the latter >> possibly implying profile changes: >> >> >> https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete >> >> >> Terrible tricks like this also work: >> >> https://www.spinics.net/lists/linux-btrfs/msg48394.html > > Its replace, which isn't about bringing back a missing disk. > > >>> Meanwhile, as mentioned above, there's active work on proper >>> dynamic btrfs device tracking and management. It may or may >>> not be ready for 4.16, but once it goes in, btrfs should >>> properly detect a device going away and react accordingly, >> >> I haven't seen that, but I doubt that it is the radical redesign >> of the multi-device layer of Btrfs that is needed to give it >> operational semantics similar to those of MD RAID, and that I >> have vaguely described previously. > > I agree that btrfs volume manager is incomplete in view of > data center RAS requisites, there are couple of critical > bugs and inconsistent design between raid profiles, but I > doubt if it needs a radical redesign. > > Pls take a look at [4], comments are appreciated as usual. > I have experimented with two approaches and both are reasonable. - > There isn't any harm to leave failed disk opened (but stop any > new IO to it). And there will be udev > 'btrfs dev forget --mounted <dev>' call when device disappears > so that we can close the device. > In the 2nd approach, close the failed device right away when disk > write fails, so that we continue to have only two device states. > I like the latter. > >>> and it should detect a device coming back as a different >>> device too. >> >> That is disagreeable because of poor terminology: I guess that >> what was intended that it should be able to detect a previous >> member block-device becoming available again as a different >> device inode, which currently is very dangerous in some vital >> situations. > > If device disappears, the patch [4] will completely take out the > device from btrfs, and continues to RW in degraded mode. > When it reappears then [5] will bring it back to the RW list. but [5] relies on someone from userspace (presumably udev) actually invoking BTRFS_IOC_SCAN_DEV/IOSC_DEVICES_READY, no ? Because device_list_add is only ever called from btrfs_scan_one_device, which in turn is called by either of the aforementioned IOCTLS or during mount (which is not at play here). > > [4] > btrfs: introduce device dynamic state transition to failed > [5] > btrfs: handle dynamically reappearing missing device > > From the btrfs original design, it always depends on device SB > fsid:uuid:devid so it does not matter about the device > path or device inode or device transport layer. For eg. Dynamically > you can bring a device under different transport and it will work > without any down time. > > >> That would be trivial if the complete redesign of block-device >> states of the Btrfs multi-device layer happened, adding an >> "active" flag to an "accessible" flag to describe new member >> states, for example. > > I think you are talking about BTRFS_DEV_STATE.. But I think > Duncan is talking about the patches which I included in my > reply. > > Thanks, Anand > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 12:10 ` Nikolay Borisov @ 2017-12-18 13:43 ` Anand Jain 0 siblings, 0 replies; 61+ messages in thread From: Anand Jain @ 2017-12-18 13:43 UTC (permalink / raw) To: Nikolay Borisov, Peter Grandi, Linux fs Btrfs >>> what was intended that it should be able to detect a previous >>> member block-device becoming available again as a different >>> device inode, which currently is very dangerous in some vital >>> situations. Peter, What's the dangerous part here ? >> If device disappears, the patch [4] will completely take out the >> device from btrfs, and continues to RW in degraded mode. >> When it reappears then [5] will bring it back to the RW list. > > but [5] relies on someone from userspace (presumably udev) actually > invoking BTRFS_IOC_SCAN_DEV/IOSC_DEVICES_READY, no ? Nikoly, Yes. Most of the destro udev already does that. udev calls btrfs dev scan when SB is overwritten from userland or when a device with primary SB as btrfs (re)appears. > Because > device_list_add is only ever called from btrfs_scan_one_device, which in > turn is called by either of the aforementioned IOCTLS or during mount > (which is not at play here). Hm. as above. Thanks Anand ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 8:49 ` Anand Jain 2017-12-18 10:36 ` Peter Grandi 2017-12-18 12:10 ` Nikolay Borisov @ 2017-12-18 22:28 ` Chris Murphy 2017-12-18 22:29 ` Chris Murphy ` (3 more replies) 2 siblings, 4 replies; 61+ messages in thread From: Chris Murphy @ 2017-12-18 22:28 UTC (permalink / raw) To: Anand Jain; +Cc: Peter Grandi, Linux fs Btrfs On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain <anand.jain@oracle.com> wrote: > Agreed. IMO degraded-raid1-single-chunk is an accidental feature > caused by [1], which we should revert back, since.. > - balance (to raid1 chunk) may fail if FS is near full > - recovery (to raid1 chunk) will take more writes as compared > to recovery under degraded raid1 chunks The advantage of writing single chunks when degraded, is in the case where a missing device returns (is readded, intact). Catching up that device with the first drive, is a manual but simple invocation of 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft' The alternative is a full balance or full scrub. It's pretty tedious for big arrays. mdadm uses bitmap=internal for any array larger than 100GB for this reason, avoiding full resync. 'btrfs sub find' will list all *added* files since an arbitrarily specified generation; but not deletions. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 22:28 ` Chris Murphy @ 2017-12-18 22:29 ` Chris Murphy 2017-12-19 12:30 ` Adam Borowski ` (2 subsequent siblings) 3 siblings, 0 replies; 61+ messages in thread From: Chris Murphy @ 2017-12-18 22:29 UTC (permalink / raw) To: Chris Murphy; +Cc: Anand Jain, Peter Grandi, Linux fs Btrfs On Mon, Dec 18, 2017 at 3:28 PM, Chris Murphy <lists@colorremedies.com> wrote: > On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain <anand.jain@oracle.com> wrote: > >> Agreed. IMO degraded-raid1-single-chunk is an accidental feature >> caused by [1], which we should revert back, since.. >> - balance (to raid1 chunk) may fail if FS is near full >> - recovery (to raid1 chunk) will take more writes as compared >> to recovery under degraded raid1 chunks > > > The advantage of writing single chunks when degraded, is in the case > where a missing device returns (is readded, intact). Catching up that > device with the first drive, is a manual but simple invocation of > 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft' The > alternative is a full balance or full scrub. It's pretty tedious for > big arrays. > > mdadm uses bitmap=internal for any array larger than 100GB for this > reason, avoiding full resync. > > 'btrfs sub find' will list all *added* files since an arbitrarily > specified generation; but not deletions. Looks like LVM raid types (the non-legacy ones that use md driver) also use a bitmap by default. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 22:28 ` Chris Murphy 2017-12-18 22:29 ` Chris Murphy @ 2017-12-19 12:30 ` Adam Borowski 2017-12-19 12:54 ` Andrei Borzenkov 2017-12-19 12:59 ` Peter Grandi 3 siblings, 0 replies; 61+ messages in thread From: Adam Borowski @ 2017-12-19 12:30 UTC (permalink / raw) To: Chris Murphy; +Cc: Anand Jain, Peter Grandi, Linux fs Btrfs On Mon, Dec 18, 2017 at 03:28:14PM -0700, Chris Murphy wrote: > On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain <anand.jain@oracle.com> wrote: > > Agreed. IMO degraded-raid1-single-chunk is an accidental feature > > caused by [1], which we should revert back, since.. > > - balance (to raid1 chunk) may fail if FS is near full > > - recovery (to raid1 chunk) will take more writes as compared > > to recovery under degraded raid1 chunks > > The advantage of writing single chunks when degraded, is in the case > where a missing device returns (is readded, intact). Catching up that > device with the first drive, is a manual but simple invocation of > 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft' The > alternative is a full balance or full scrub. It's pretty tedious for > big arrays. > > mdadm uses bitmap=internal for any array larger than 100GB for this > reason, avoiding full resync. > > 'btrfs sub find' will list all *added* files since an arbitrarily > specified generation; but not deletions. This is fine as scrub cares about extents not files. The newer generation of metadata doesn't have a reference to the deleted extent anymore. Selective scrub hasn't been implemented, but it should be pretty straightforward -- unless nocow is involved. Correct me if I'm wrong, but I believe there's no way to tell which copy of a nocow extent is the good one. Meow! -- // If you believe in so-called "intellectual property", please immediately // cease using counterfeit alphabets. Instead, contact the nearest temple // of Amon, whose priests will provide you with scribal services for all // your writing needs, for Reasonable And Non-Discriminatory prices. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 22:28 ` Chris Murphy 2017-12-18 22:29 ` Chris Murphy 2017-12-19 12:30 ` Adam Borowski @ 2017-12-19 12:54 ` Andrei Borzenkov 2017-12-19 12:59 ` Peter Grandi 3 siblings, 0 replies; 61+ messages in thread From: Andrei Borzenkov @ 2017-12-19 12:54 UTC (permalink / raw) To: Chris Murphy; +Cc: Anand Jain, Peter Grandi, Linux fs Btrfs On Tue, Dec 19, 2017 at 1:28 AM, Chris Murphy <lists@colorremedies.com> wrote: > On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain <anand.jain@oracle.com> wrote: > >> Agreed. IMO degraded-raid1-single-chunk is an accidental feature >> caused by [1], which we should revert back, since.. >> - balance (to raid1 chunk) may fail if FS is near full >> - recovery (to raid1 chunk) will take more writes as compared >> to recovery under degraded raid1 chunks > > > The advantage of writing single chunks when degraded, is in the case > where a missing device returns (is readded, intact). Catching up that > device with the first drive, is a manual but simple invocation of > 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft' The > alternative is a full balance or full scrub. It's pretty tedious for > big arrays. > The alternative would be to introduce new "resilver" operation that would allocate second copy for every degraded chunk. And it could even be started automatically when enough redundacy is present again. > mdadm uses bitmap=internal for any array larger than 100GB for this > reason, avoiding full resync. > ZFS manages to avoid full sync in this case quite efficiently. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 22:28 ` Chris Murphy ` (2 preceding siblings ...) 2017-12-19 12:54 ` Andrei Borzenkov @ 2017-12-19 12:59 ` Peter Grandi 3 siblings, 0 replies; 61+ messages in thread From: Peter Grandi @ 2017-12-19 12:59 UTC (permalink / raw) To: Linux fs Btrfs [ ... ] > The advantage of writing single chunks when degraded, is in > the case where a missing device returns (is readded, > intact). Catching up that device with the first drive, is a > manual but simple invocation of 'btrfs balance start > -dconvert=raid1,soft -mconvert=raid1,soft' The alternative is > a full balance or full scrub. It's pretty tedious for big > arrays. That is merely an after-the-fact rationalization for a design that is at the same time entirely logical and quite broken: that the intended replication factor is the same as the current number of members of the volume, so if a volume has (currently) only one member, than only "single" chunks gets created. A design that would work better for operations would be to have "profiles" to be a concept entirely independent of number of members, or perhaps more precisely to have the "desired" profile of a chunk be distinct from the "actual" profile (dependent on the actual number of members of a volume) of that chunk, so that if a volume has only one member chunks could be created that have "desired" profile 'raid1' but "actual" profile 'single', or perhaps more sensibly 'raid1-with-missing-mirror', with checks that "actual" profile be usable else the volume is not mountable. Note: ideally every chunk would have both a static desired profile and a desired stripe width, and a computed actual profile and a actual stripe width. Or perhaps the desired profile and width would be properties of the volume (for each of the three types of data). For example in MD RAID it is perfectly legitimate to create a RAID6 set with "desired" width of 6 and "actual" width of 4 (in which case it can be activated as degraded) or a RAID5 set with "desired" width of 5 and actual width of 3 (in which case it cannot be activated at all until at least another member is added). The difference with MD RAID is that in MD RAID there is (except in one case , during conversion) an exact match between "desired" profile stripe width and number of members, while at least in principle a Btrfs volume can have any number of chunks of any profile of any desired stripe size (except that current implementation is not so flexible in most profiles). That would require scanning all chunks to determine whether a volume is mountable at all or mountable only as degraded, while MD RAID can just count the members. Apparently recent versions of the Btrfs 'raid1' profile do just that. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-17 15:48 ` Peter Grandi 2017-12-17 20:42 ` Chris Murphy 2017-12-18 8:49 ` Anand Jain @ 2017-12-18 13:06 ` Austin S. Hemmelgarn 2017-12-18 19:43 ` Tomasz Pala 2 siblings, 1 reply; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-18 13:06 UTC (permalink / raw) To: Peter Grandi, Linux fs Btrfs On 2017-12-17 10:48, Peter Grandi wrote: > "Duncan"'s reply is slightly optimistic in parts, so some > further information... > > [ ... ] > >> Basically, at this point btrfs doesn't have "dynamic" device >> handling. That is, if a device disappears, it doesn't know >> it. > > That's just the consequence of what is a completely broken > conceptual model: the current way most multi-device profiles are > designed is that block-devices and only be "added" or "removed", > and cannot be "broken"/"missing". Therefore if IO fails, that is > just one IO failing, not the entire block-device going away. > The time when a block-device is noticed as sort-of missing is > when it is not available for "add"-ing at start. > > Put another way, the multi-device design is/was based on the > demented idea that block-devices that are missing are/should be > "remove"d, so that a 2-device volume with a 'raid1' profile > becomes a 1-device volume with a 'single'/'dup' profile, and not > a 2-device volume with a missing block-device and an incomplete > 'raid1' profile, even if things have been awkwardly moving in > that direction in recent years. > > Note the above is not totally accurate today because various > hacks have been introduced to work around the various issues. You do realize you just restated exactly what Duncan said, just in a much more verbose (and aggressively negative) manner... > >> Thus, if a device disappears, to get it back you really have >> to reboot, or at least unload/reload the btrfs kernel module, >> in ordered to clear the stale device state and have btrfs >> rescan and reassociate devices with the matching filesystems. > > IIRC that is not quite accurate: a "missing" device can be > nowadays "replace"d (by "devid") or "remove"d, the latter > possibly implying profile changes: > > https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete > > Terrible tricks like this also work: > > https://www.spinics.net/lists/linux-btrfs/msg48394.html While that is all true, none of that _fixes_ the issue of a device disappearing and then being reconnected. In theory, you can use `btrfs device replace` to force BTRFS to acknowledge the new name (by 'replacing' the missing device with the now returned device), but doing so is horribly inefficient as to not be worth it unless you have no other choice. > >> Meanwhile, as mentioned above, there's active work on proper >> dynamic btrfs device tracking and management. It may or may >> not be ready for 4.16, but once it goes in, btrfs should >> properly detect a device going away and react accordingly, > > I haven't seen that, but I doubt that it is the radical redesign > of the multi-device layer of Btrfs that is needed to give it > operational semantics similar to those of MD RAID, and that I > have vaguely described previously. Anand has been working on hot spare support, and as part of that has done some work on handling of missing devices. > >> and it should detect a device coming back as a different >> device too. > > That is disagreeable because of poor terminology: I guess that > what was intended that it should be able to detect a previous > member block-device becoming available again as a different > device inode, which currently is very dangerous in some vital > situations. How exactly is this dangerous? The only situation I can think of is if a bogus device is hot-plugged and happens to perfectly match all the required identifiers, and at that point you've either got someone attacking your system who already has sufficient access to do whatever the hell they want with it, or you did something exceedingly stupid, and both cases are dangerous by themselves. > >> Longer term, there's further patches that will provide a >> hot-spare functionality, automatically bringing in a device >> pre-configured as a hot- spare if a device disappears, but >> that of course requires that btrfs properly recognize devices >> disappearing and coming back first, so one thing at a time. > > That would be trivial if the complete redesign of block-device > states of the Btrfs multi-device layer happened, adding an > "active" flag to an "accessible" flag to describe new member > states, for example. No, it wouldn't be trivial, because a complete redesign of part of the filesystem would be needed. > > My guess that while logically consistent, the current > multi-device logic is fundamentally broken from an operational > point of view, and needs a complete replacement instead of > fixes. Then why don't you go write up some patches yourself if you feel so strongly about it? The fact is, the only cases where this is really an issue is if you've either got intermittently bad hardware, or are dealing with external storage devices. For the majority of people who are using multi-device setups, the common case is internally connected fixed storage devices with properly working hardware, and for that use case, it works perfectly fine. In fact, the only people I've seen any reports of issues from are either: 1. Testing the behavior of device management (such as the OP), in which case, yes it doesn't work if you do things that aren't reasonably expected of working hardware. 2. Trying to do multi-device on USB, which is a bad idea regardless of what you're using to create a single volume, because USB has pretty serious reliability issues. Neither case is 'normal' usage of a multi-device volume though. Yes, the second case could be better supported, but that's likely going to require some help from the block layer, and verification of writes. As far as handling of other marginal hardware, I'm very inclined to say that BTRFS should not care. At the point at which a device is dropping off the bus and reappearing with enough regularity for this to be an issue, you have absolutely no idea how else it's corrupting your data, and support of such a situation is beyond any filesystem (including ZFS). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 13:06 ` Austin S. Hemmelgarn @ 2017-12-18 19:43 ` Tomasz Pala 2017-12-18 22:01 ` Peter Grandi 2017-12-19 12:25 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 61+ messages in thread From: Tomasz Pala @ 2017-12-18 19:43 UTC (permalink / raw) To: Linux fs Btrfs On Mon, Dec 18, 2017 at 08:06:57 -0500, Austin S. Hemmelgarn wrote: > The fact is, the only cases where this is really an issue is if you've > either got intermittently bad hardware, or are dealing with external Well, the RAID1+ is all about the failing hardware. > storage devices. For the majority of people who are using multi-device > setups, the common case is internally connected fixed storage devices > with properly working hardware, and for that use case, it works > perfectly fine. If you're talking about "RAID"-0 or storage pools (volume management) that is true. But if you imply, that RAID1+ "works perfectly fine as long as hardware works fine" this is fundamentally wrong. If the hardware needs to work properly for the RAID to work properly, noone would need this RAID in the first place. > that BTRFS should not care. At the point at which a device is dropping > off the bus and reappearing with enough regularity for this to be an > issue, you have absolutely no idea how else it's corrupting your data, > and support of such a situation is beyond any filesystem (including ZFS). Support for such situation is exactly what RAID performs. So don't blame people for expecting this to be handled as long as you call the filesystem feature a 'RAID'. If this feature is not going to mitigate hardware hiccups by design (as opposed to "not implemented yet, needs some time", which is perfectly understandable), just don't call it 'RAID'. All the features currently working, like bit-rot mitigation for duplicated data (dup/raid*) using checksums, are something different than RAID itself. RAID means "survive failure of N devices/controllers" - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not _expected_ to happen after single disk failure (without any reappearing). -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 19:43 ` Tomasz Pala @ 2017-12-18 22:01 ` Peter Grandi 2017-12-19 12:46 ` Austin S. Hemmelgarn 2017-12-19 12:25 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 61+ messages in thread From: Peter Grandi @ 2017-12-18 22:01 UTC (permalink / raw) To: Linux fs Btrfs >> The fact is, the only cases where this is really an issue is >> if you've either got intermittently bad hardware, or are >> dealing with external > Well, the RAID1+ is all about the failing hardware. >> storage devices. For the majority of people who are using >> multi-device setups, the common case is internally connected >> fixed storage devices with properly working hardware, and for >> that use case, it works perfectly fine. > If you're talking about "RAID"-0 or storage pools (volume > management) that is true. But if you imply, that RAID1+ "works > perfectly fine as long as hardware works fine" this is > fundamentally wrong. I really agree with this, the argument about "properly working hardware" is utterly ridiculous. I'll to this: apparently I am not the first one to discover the "anomalies" in the "RAID" profiles, but I may have been the first to document some of them, e.g. the famous issues with the 'raid1' profile. How did I discover them? Well, I had used Btrfs in single device mode for a bit, and wanted to try multi-device, and the docs seemed "strange", so I did tests before trying it out. The tests were simply on a spare PC with a bunch of old disks to create two block devices (partitions), put them in 'raid1' first natively, then by adding a new member to an existing partition, and then 'remove' one, or simply unplug it (actually 'echo 1 > /sys/block/.../device/delete') initially. I wanted to check exactly what happened, resync times, speed, behaviour and speed when degraded, just ordinary operational tasks. Well I found significant problems after less than one hour. I can't imagine anyone with some experience of hw or sw RAID (especially hw RAID, as hw RAID firmware is often fantastically buggy especially as to RAID operations) that wouldn't have done the same tests before operational use, and would not have found the same issues too straight away. The only guess I could draw is that whover designed the "RAID" profile had zero operational system administration experience. > If the hardware needs to work properly for the RAID to work > properly, noone would need this RAID in the first place. It is not just that, but some maintenance operations are needed even if the hardware works properly: for example preventive maintenance, replacing drives that are becoming too old, expanding capacity, testing periodically hardware bits. Systems engineers don't just say "it works, let's assume it continues to work properly, why worry". My impression is that multi-device and "chunks" were designed in one way by someone, and someone else did not understand the intent, and confused them with "RAID", and based the 'raid' profiles on that confusion. For example the 'raid10' profile seems the least confused to me, and that's I think because the "RAID" aspect is kept more distinct from the "multi-device" aspect. But perhaps I am an optimist... To simplify a longer discussion to have "RAID" one needs an explicit design concept of "stripe", which in Btrfs needs to be quite different from that of "set of member devices" and "chunks", so that for example adding/removing to a "stripe" is not quite the same thing as adding/removing members to a volume, plus to make a distinction between online and offline members, not just added and removed ones, and well-defined state machine transitions (e.g. in response to hardware problems) among all those, like in MD RAID. But the importance of such distinctions may not be apparent to everybody. But I may have read comments in which "block device" (a data container on some medium), "block device inode" (a descriptor for that) and "block device name" (a path to a "block device inode") were hopelessly confused, so I don't hold a lot of hope. :-( ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 22:01 ` Peter Grandi @ 2017-12-19 12:46 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-19 12:46 UTC (permalink / raw) To: Peter Grandi, Linux fs Btrfs On 2017-12-18 17:01, Peter Grandi wrote: >>> The fact is, the only cases where this is really an issue is >>> if you've either got intermittently bad hardware, or are >>> dealing with external > >> Well, the RAID1+ is all about the failing hardware. > >>> storage devices. For the majority of people who are using >>> multi-device setups, the common case is internally connected >>> fixed storage devices with properly working hardware, and for >>> that use case, it works perfectly fine. > >> If you're talking about "RAID"-0 or storage pools (volume >> management) that is true. But if you imply, that RAID1+ "works >> perfectly fine as long as hardware works fine" this is >> fundamentally wrong. > > I really agree with this, the argument about "properly working > hardware" is utterly ridiculous. I'll to this: apparently I am > not the first one to discover the "anomalies" in the "RAID" > profiles, but I may have been the first to document some of > them, e.g. the famous issues with the 'raid1' profile. How did I > discover them? Well, I had used Btrfs in single device mode for > a bit, and wanted to try multi-device, and the docs seemed > "strange", so I did tests before trying it out. > > The tests were simply on a spare PC with a bunch of old disks to > create two block devices (partitions), put them in 'raid1' first > natively, then by adding a new member to an existing partition, > and then 'remove' one, or simply unplug it (actually 'echo 1 > > /sys/block/.../device/delete') initially. I wanted to check > exactly what happened, resync times, speed, behaviour and speed > when degraded, just ordinary operational tasks. > > Well I found significant problems after less than one hour. I > can't imagine anyone with some experience of hw or sw RAID > (especially hw RAID, as hw RAID firmware is often fantastically > buggy especially as to RAID operations) that wouldn't have done > the same tests before operational use, and would not have found > the same issues too straight away. The only guess I could draw > is that whover designed the "RAID" profile had zero operational > system administration experience. Or possibly that you didn't read the documentation thoroughly at all, which any reasonable system administrator would do before even starting to test something. Unless you were doing stupid stuff like running for extended periods of time with half an array or not trying at all to repair things after the device reappeared, then none of what you described should have caused any issues. > >> If the hardware needs to work properly for the RAID to work >> properly, noone would need this RAID in the first place. > > It is not just that, but some maintenance operations are needed > even if the hardware works properly: for example preventive > maintenance, replacing drives that are becoming too old, > expanding capacity, testing periodically hardware bits. Systems > engineers don't just say "it works, let's assume it continues to > work properly, why worry". Really? So replacing hard drives just doesn't work on BTRFS? Hmm... Then that means that all the testing I do regularly of reshaping arrays and replacing devices that is consistently working (except for raid5 and raid6, but those have other issues too right now) must be a complete fluke. I guess I have to go check my hardware and the QEMU sources to figure out how those are broken such that all of this is working successfully... Seriously though, did you even _test_ replacing devices using the procedures described in the documentation, or did you just see that things didn't work in the couple of cases you thought were most important and assume nothing else worked? > > My impression is that multi-device and "chunks" were designed in > one way by someone, and someone else did not understand the > intent, and confused them with "RAID", and based the 'raid' > profiles on that confusion. For example the 'raid10' profile > seems the least confused to me, and that's I think because the > "RAID" aspect is kept more distinct from the "multi-device" > aspect. But perhaps I am an optimist... Then names were a stupid choice intended to convey the basic behavior in a way that idiots who have no business being sysadmins could understand (and yes, the raid1 profiles do behave as someone with a naive understanding of RAID1 as simple replication would expect). Unfortunately, we're stuck with them now, and there's no point in complaining beyond just acknowledging that the names were a poor choice. > > To simplify a longer discussion to have "RAID" one needs an > explicit design concept of "stripe", which in Btrfs needs to be > quite different from that of "set of member devices" and > "chunks", so that for example adding/removing to a "stripe" is > not quite the same thing as adding/removing members to a volume, > plus to make a distinction between online and offline members, > not just added and removed ones, and well-defined state machine > transitions (e.g. in response to hardware problems) among all > those, like in MD RAID. But the importance of such distinctions > may not be apparent to everybody. Or maybe people are sensible and don't care about such distinctions as long as things work within the defined parameters? It's only engineers and scientists that care about how and why (or stuffy bureaucrats who want control over things). Regular users, and even some developers don't care about the exact implementation provided it works how they need it to work. > [Obviously intentionally inflammatory comment removed] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 19:43 ` Tomasz Pala 2017-12-18 22:01 ` Peter Grandi @ 2017-12-19 12:25 ` Austin S. Hemmelgarn 2017-12-19 14:46 ` Tomasz Pala 1 sibling, 1 reply; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-19 12:25 UTC (permalink / raw) To: Tomasz Pala, Linux fs Btrfs On 2017-12-18 14:43, Tomasz Pala wrote: > On Mon, Dec 18, 2017 at 08:06:57 -0500, Austin S. Hemmelgarn wrote: > >> The fact is, the only cases where this is really an issue is if you've >> either got intermittently bad hardware, or are dealing with external > > Well, the RAID1+ is all about the failing hardware. About catastrophically failing hardware, not intermittent failure. > >> storage devices. For the majority of people who are using multi-device >> setups, the common case is internally connected fixed storage devices >> with properly working hardware, and for that use case, it works >> perfectly fine. > > If you're talking about "RAID"-0 or storage pools (volume management) > that is true. > But if you imply, that RAID1+ "works perfectly fine as long as hardware > works fine" this is fundamentally wrong. If the hardware needs to work > properly for the RAID to work properly, noone would need this RAID in > the first place. I never said the hardware needed to not fail, just that it needed to fail in a consistent manner. BTRFS handles catastrophic failures of storage devices just fine right now. It has issues with intermittent failures, but so does hardware RAID, and so do MD and LVM to a lesser degree. > >> that BTRFS should not care. At the point at which a device is dropping >> off the bus and reappearing with enough regularity for this to be an >> issue, you have absolutely no idea how else it's corrupting your data, >> and support of such a situation is beyond any filesystem (including ZFS). > > Support for such situation is exactly what RAID performs. So don't blame > people for expecting this to be handled as long as you call the > filesystem feature a 'RAID'. No, classical RAID (other than RAID0) is supposed to handle catastrophic failure of component devices. That is the entirety of the original design purpose, and that is the entirety of what you should be using it for in production. The point at which you are getting random corruption on a disk and you're using anything but BTRFS for replication, you _NEED_ to replace that disk, and if you don't you risk it causing corruption on the other disk. As of right now, BTRFS is no different in that respect, but I agree that it _should_ be able to handle such a situation eventually. > > If this feature is not going to mitigate hardware hiccups by design (as > opposed to "not implemented yet, needs some time", which is perfectly > understandable), just don't call it 'RAID'. It shouldn't have been called RAID in the first place, that we can agree on (even if for different reasons). > > All the features currently working, like bit-rot mitigation for > duplicated data (dup/raid*) using checksums, are something different > than RAID itself. RAID means "survive failure of N devices/controllers" > - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not > _expected_ to happen after single disk failure (without any reappearing). And that's a known bug on older kernels (not to mention that you should not be mounting writable and degraded for any purpose other than fixing the volume). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 12:25 ` Austin S. Hemmelgarn @ 2017-12-19 14:46 ` Tomasz Pala 2017-12-19 16:35 ` Austin S. Hemmelgarn ` (2 more replies) 0 siblings, 3 replies; 61+ messages in thread From: Tomasz Pala @ 2017-12-19 14:46 UTC (permalink / raw) To: Linux fs Btrfs On Tue, Dec 19, 2017 at 07:25:49 -0500, Austin S. Hemmelgarn wrote: >> Well, the RAID1+ is all about the failing hardware. > About catastrophically failing hardware, not intermittent failure. It shouldn't matter - as long as disk failing once is kicked out of the array *if possible*. Or reattached in write-only mode as a best effort, meaning "will try to keep your *redundancy* copy, but won't trust it to be read from". As you see, the "failure level handled" is not by definition, but by implementation. *if possible* == when there are other volume members having the same data /or/ there are spare members that could take over the failing ones. > I never said the hardware needed to not fail, just that it needed to > fail in a consistent manner. BTRFS handles catastrophic failures of > storage devices just fine right now. It has issues with intermittent > failures, but so does hardware RAID, and so do MD and LVM to a lesser > degree. When planning hardware failovers/backups I can't predict the failing pattern. So first of all - every *known* shortcoming should be documented somehow. Secondly - permanent failures are not handled "just fine", as there is (1) no automatic mount as degraded, so the machine won't reboot properly and (2) the r/w degraded mount is[*] one-timer. Again, this should be: 1. documented in manpage, as a comment to profiles, not wiki page or linux-btrfs archives, 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools), 3. blown into one's face when doing r/w degraded mount (by kernel). [*] yes, I know the recent kernels handle this, but the last LTS (4.14) is just too young. I'm now aware of issues with MD you're referring to - I got drives kicked off many times and they were *never* causing any problems despite being visible in the system. Moreover, since 4.10 there is FAILFAST which would do this even faster. There is also no problem with mounting degraded MD array automatically, so telling that btrfs is doing "just fine" is, well... not even theoretically close. And in my practice it never saved the day, but already ruined a few ones... It's not right for the protection to make more problems than it solves. > No, classical RAID (other than RAID0) is supposed to handle catastrophic > failure of component devices. That is the entirety of the original > design purpose, and that is the entirety of what you should be using it > for in production. 1. no, it's not: https://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf 2. even if there was, the single I/O failure (e.g. one bad block) might be interpreted as "catastrophic" and the entire drive should be kicked off then. 3. if sysadmin doesn't request any kind of device autobinding, the device that were already failed doesn't matter anymore - regardless of it's current state or reappearences. > The point at which you are getting random corruption > on a disk and you're using anything but BTRFS for replication, you > _NEED_ to replace that disk, and if you don't you risk it causing > corruption on the other disk. Not only BTRFS, there are hardware solutions like T10 PI/DIF. Guess what should RAID controller do in such situation? Fail drive immediately after the first CRC mismatch? BTW do you consider "random corruption" as a catastrophic failure? > As of right now, BTRFS is no different in > that respect, but I agree that it _should_ be able to handle such a > situation eventually. The first step should be to realize, that there are some tunables required if you want to handle many different situation. Having said that, let's back to reallity: The classical RAID is about keeping the system functional - trashing a single drive from RAID1 should be fully-ignorable by sysadmin. The system must reboot properly, work properly and there MUST NOT by ANY functional differences compared to non-degraded mode except for slower read rate (and having no more redundancy obviously). - not having this == not having RAID1. > It shouldn't have been called RAID in the first place, that we can agree > on (even if for different reasons). The misnaming would be much less of a problem if it were documented properly (man page, btrfs-progs and finally kernel screaming). >> - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not >> _expected_ to happen after single disk failure (without any reappearing). > And that's a known bug on older kernels (not to mention that you should > not be mounting writable and degraded for any purpose other than fixing > the volume). Yes, ...but: 1. "known" only to the people that already stepped into it, meaning too late - it should be "COMMONLY known", i.e. documented, 2. "older kernels" are not so old, the newest mature LTS (4.9) is still affected, 3. I was about to fix the volume, accidentally the machine has rebooted. Which should do no harm if I had a RAID1. 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE, as long as you accept "no more redundancy"... 4a. ...or had an N-way mirror and there is still some redundancy if N>2. Since we agree, that btrfs RAID != common RAID, as there are/were different design principles and some features are in WIP state at best, the current behaviour should be better documented. That's it. -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 14:46 ` Tomasz Pala @ 2017-12-19 16:35 ` Austin S. Hemmelgarn 2017-12-19 17:56 ` Tomasz Pala 2017-12-19 18:31 ` George Mitchell 2017-12-19 19:35 ` Chris Murphy 2 siblings, 1 reply; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-19 16:35 UTC (permalink / raw) To: Tomasz Pala, Linux fs Btrfs On 2017-12-19 09:46, Tomasz Pala wrote: > On Tue, Dec 19, 2017 at 07:25:49 -0500, Austin S. Hemmelgarn wrote: > >>> Well, the RAID1+ is all about the failing hardware. >> About catastrophically failing hardware, not intermittent failure. > > It shouldn't matter - as long as disk failing once is kicked out of the > array *if possible*. Or reattached in write-only mode as a best effort, > meaning "will try to keep your *redundancy* copy, but won't trust it to > be read from". > As you see, the "failure level handled" is not by definition, but by implementation. > > *if possible* == when there are other volume members having the same > data /or/ there are spare members that could take over the failing ones. Actually, it very much does matter, at least with hardware RAID. The exact failure mode that causes issues for BTRFS (intermittent disconnects at the bus level) causes just as many issues with most hardware RAID controllers (though the exact issues are not quite the same), and is in and of itself an indicator that something else is wrong. > >> I never said the hardware needed to not fail, just that it needed to >> fail in a consistent manner. BTRFS handles catastrophic failures of >> storage devices just fine right now. It has issues with intermittent >> failures, but so does hardware RAID, and so do MD and LVM to a lesser >> degree. > > When planning hardware failovers/backups I can't predict the failing > pattern. So first of all - every *known* shortcoming should be > documented somehow. Secondly - permanent failures are not handled "just > fine", as there is (1) no automatic mount as degraded, so the machine > won't reboot properly and (2) the r/w degraded mount is[*] one-timer. > Again, this should be: > 1. documented in manpage, as a comment to profiles, not wiki page or > linux-btrfs archives, Agreed, our documentation needs consolidated in general (I would absolutely love to see it just be the man pages, and have those up on the wiki like some other software does). > 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools), I don't agree on this one. It is in no way unreasonable to expect that someone has read the documentation _before_ trying to use something. > 3. blown into one's face when doing r/w degraded mount (by kernel). Agreed here though. > > [*] yes, I know the recent kernels handle this, but the last LTS (4.14) > is just too young. 4.14 should have gotten that patch last I checked. > > I'm now aware of issues with MD you're referring to - I got drives > kicked off many times and they were *never* causing any problems despite > being visible in the system. Moreover, since 4.10 there is FAILFAST > which would do this even faster. There is also no problem with mounting > degraded MD array automatically, so telling that btrfs is doing "just > fine" is, well... not even theoretically close. And in my practice it > never saved the day, but already ruined a few ones... It's not right for > the protection to make more problems than it solves. Regarding handling of degraded mounts, BTRFS _is_ working just fine, we just chose a different default behavior from MD and LVM (we make certain the user knows about the issue without having to look through syslog). > >> No, classical RAID (other than RAID0) is supposed to handle catastrophic >> failure of component devices. That is the entirety of the original >> design purpose, and that is the entirety of what you should be using it >> for in production. > > 1. no, it's not: https://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf OK, so I see here performance as a motivation, but listed secondarily to reliability, and all the discussion of reliability assumes that either: 1. Disks fail catastrophically. or: 2. Disks return read or write errors when there is a problem. Following just those constraints, RAID is not designed to handle devices that randomly drop off the bus and reappear or exhibit silent data corruption, so my original statement largely was accurate, the primary design intent was handling of catastrophic failures. > > 2. even if there was, the single I/O failure (e.g. one bad block) might > be interpreted as "catastrophic" and the entire drive should be kicked off then. This I will agree with, given that it's common behavior in many RAID implementations. As people are quick to point out BTRFS _IS NOT_ RAID, the devs just made a poor choice in the original naming of the 2-way replication implementation, and it stuck. > > 3. if sysadmin doesn't request any kind of device autobinding, the > device that were already failed doesn't matter anymore - regardless of > it's current state or reappearences. You have to explicitly disable automatic binding of drivers to hot-plugged devices though, so that's rather irrelevant. Yes, you can do so yourself if you want, and it will mitigate one of the issues with BTRFS to a limited degree (we still don't 'kick-out' old devices, even if we should). > >> The point at which you are getting random corruption >> on a disk and you're using anything but BTRFS for replication, you >> _NEED_ to replace that disk, and if you don't you risk it causing >> corruption on the other disk. > > Not only BTRFS, there are hardware solutions like T10 PI/DIF. > Guess what should RAID controller do in such situation? Fail > drive immediately after the first CRC mismatch? If it's more than single errors, yes, it should fail the drive. If you're getting any kind of recurring corruption, it's time to replace the drive, whether the error gets corrected or not. > > BTW do you consider "random corruption" as a catastrophic failure? No, catastrophic failure in reference to hard drives is (usually) mechanical failure rendering the drive unusable (such as a head crash for example), or a complete controller failure (for example, the drive won't enumerate at all). To use a (possibly strained) analogy: Catastrophic failure is like a handgun blowing up when you try to fire it, you won't be able to use it ever again. Random corruption is equivalent to not consistently feeding new rounds from the magazine properly, it still technically works, and can (theoretically) be fixed, but it's usually just simpler (and significantly safer) to replace the gun than it is to try and jury rig things so that it works reliably. > >> As of right now, BTRFS is no different in >> that respect, but I agree that it _should_ be able to handle such a >> situation eventually. > > The first step should be to realize, that there are some tunables > required if you want to handle many different situation. > > Having said that, let's back to reallity: > > > The classical RAID is about keeping the system functional - trashing a > single drive from RAID1 should be fully-ignorable by sysadmin. The > system must reboot properly, work properly and there MUST NOT by ANY > functional differences compared to non-degraded mode except for slower > read rate (and having no more redundancy obviously). 'No functional differences' isn't even a standard that MD or LVM achieve, and it's definitely not one that most hardware RAID controllers have. > > - not having this == not having RAID1. Again, BTRFS _IS NOT_ RAID. > >> It shouldn't have been called RAID in the first place, that we can agree >> on (even if for different reasons). > > The misnaming would be much less of a problem if it were documented > properly (man page, btrfs-progs and finally kernel screaming). Yes, our documentation could be significantly better. > >>> - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not >>> _expected_ to happen after single disk failure (without any reappearing). >> And that's a known bug on older kernels (not to mention that you should >> not be mounting writable and degraded for any purpose other than fixing >> the volume). > > Yes, ...but: > > 1. "known" only to the people that already stepped into it, meaning too > late - it should be "COMMONLY known", i.e. documented, And also known to people who have done proper research. > 2. "older kernels" are not so old, the newest mature LTS (4.9) is still > affected, I really don't see this as a valid excuse. It's pretty well documented that you absolutely should be running the most recent kernel if you're using BTRFS. > 3. I was about to fix the volume, accidentally the machine has rebooted. > Which should do no harm if I had a RAID1. Agreed. > 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE, > as long as you accept "no more redundancy"... This is a matter of opinion. I still contend that running half a two device array for an extended period of time without reshaping it to be a single device is a bad idea for cases other than BTRFS. The fewer layers of code you're going through, the safer you are. > 4a. ...or had an N-way mirror and there is still some redundancy if N>2. N-way mirroring is still on the list of things to implement, believe me, many people want it. > > > Since we agree, that btrfs RAID != common RAID, as there are/were > different design principles and some features are in WIP state at best, > the current behaviour should be better documented. That's it. Patches would be gratefully accepted. It's really not hard to update the documentation, it's just that nobody has had the time to do it. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 16:35 ` Austin S. Hemmelgarn @ 2017-12-19 17:56 ` Tomasz Pala 2017-12-19 19:47 ` Chris Murphy 2017-12-19 20:11 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 61+ messages in thread From: Tomasz Pala @ 2017-12-19 17:56 UTC (permalink / raw) To: Linux fs Btrfs On Tue, Dec 19, 2017 at 11:35:02 -0500, Austin S. Hemmelgarn wrote: >> 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools), > I don't agree on this one. It is in no way unreasonable to expect that > someone has read the documentation _before_ trying to use something. Provided there are: - a decent documentation AND - appropriate[*] level of "common knowledge" AND - stable behaviour and mature code (kernel, tools etc.) BTRFS lacks all of these - there are major functional changes in current kernels and it reaches far beyond LTS. All the knowledge YOU have here, on this maillist, should be 'engraved' into btrfs-progs, as there are people still using kernels with serious malfunctions. btrfs-progs could easily check kernel version and print appropriate warning - consider this a "software quirks". [*] by 'appropriate' I mean knowledge so common, as the real word usage itself. Moreover, the fact that I've read the documentation and did a comprehensive[**] reseach today, doesn't mean I should do this again after kernel change for example. [**] apparently what I thought was comprehensive, wasn't at all. Most of the btrfs quirks I've found HERE. As a regular user, not fs developer, I shouldn't be even looking at this list. BTW, doesn't SuSE use btrfs by default? Would you expect everyone using this distro to research every component used? >> [*] yes, I know the recent kernels handle this, but the last LTS (4.14) >> is just too young. > 4.14 should have gotten that patch last I checked. I meant too young to be widely adopted yet. This requires some countermeasures in the toolkit that is easier to upgrade, like userspace. > Regarding handling of degraded mounts, BTRFS _is_ working just fine, we > just chose a different default behavior from MD and LVM (we make certain > the user knows about the issue without having to look through syslog). I'm not arguing about the behaviour - apparently there were some technical reasons. But IF the reasons are not technical, but philosophical, I'd like to have either mount option (allow_degraded) or even kernel-level configuration knob for this to happen RAID-style. Now, if the current kernels won't toggle degraded RAID1 as ro, can I safely add "degraded" to the mount options? My primary concern is the machine UPTIME. I care less about the data, as they are backed up to some remote location and loosing day or week of changes is acceptable, brain-split as well, while every hour of downtime costs me a real money. Meanwhile I can't fix broken server using 'remote hands' - mounting degraded volume means using physical keyboard or KVM which might be not available at a site. Current btrfs behavious requires physical presence AND downtime (if a machine rebooted) for fixing things, that could be fixed remotely an on-line. Anyway, users shouldn't look through syslog, device status should be reported by some monitoring tool. Deviation so big (respectively to common RAID1 scenarios) deserves being documented. Or renamed... > reliability, and all the discussion of reliability assumes that either: > 1. Disks fail catastrophically. > or: > 2. Disks return read or write errors when there is a problem. > > Following just those constraints, RAID is not designed to handle devices > that randomly drop off the bus and reappear If it drops, there would be I/O errors eventually. Without the errors - agreed. > implementations. As people are quick to point out BTRFS _IS NOT_ RAID, > the devs just made a poor choice in the original naming of the 2-way > replication implementation, and it stuck. Well, the question is: either it is not raid YET, or maybe it's time to consider renaming? >> 3. if sysadmin doesn't request any kind of device autobinding, the >> device that were already failed doesn't matter anymore - regardless of >> it's current state or reappearences. > You have to explicitly disable automatic binding of drivers to > hot-plugged devices though, so that's rather irrelevant. Yes, you can Ha! I got this disabled on every bus (although for different reasons) after boot completes. Lucky me:) >> 1. "known" only to the people that already stepped into it, meaning too >> late - it should be "COMMONLY known", i.e. documented, > And also known to people who have done proper research. All the OpenSUSE userbase? ;) >> 2. "older kernels" are not so old, the newest mature LTS (4.9) is still >> affected, > I really don't see this as a valid excuse. It's pretty well documented > that you absolutely should be running the most recent kernel if you're > using BTRFS. Good point. >> 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE, >> as long as you accept "no more redundancy"... > This is a matter of opinion. Sure! And the particular opinion depends on system being affected. I'd rather not have any brain-split scenario under my database servers, but also won't mind data loss on BGP router as long as it keeps running and is fully operational. > I still contend that running half a two > device array for an extended period of time without reshaping it to be a > single device is a bad idea for cases other than BTRFS. The fewer > layers of code you're going through, the safer you are. I create single-device degraded MD RAID1 when I attach one disk for deployment (usually test machines), which are going to be converted into dual (production) in a future - attaching second disk to array is much easier and faster than messing with device nodes (or labels or anything). The same applies to LVM, it's better to have it even when not used at a moment. In case of btrfs there is no need for such preparations, as the devices are added without renaming. However, sometimes the systems end up without second disk attached. Either due to their low importance, sometimes power usage, others need to be quiet. One might ask, why don't I attach second disk before initial system creation - the answer is simple: I usually use the same drive models in RAID1, but it happens that drives bought from the same production lot fail simultaneously, so this approach mitigates the problem and gives more time to react. > Patches would be gratefully accepted. It's really not hard to update > the documentation, it's just that nobody has had the time to do it. Writing accurate documentation requires deep undestanding of internals. Me - for example, I know some of the results: "don't do this", "if X happens, Y should be done", "Z doesn't work yet, but there were some patches", "V was fixed in some recent kernel, but no idea which commit was it exactly", "W was severly broken in kernel I.J.K" etc. Not the hard data that could be posted without creating the impression, that it's all about creating complain-list. Not to mention I'm absolutely not familiar with current patches, WIP and many many other corner cases or usage scenarios. In a fact, not only the internals, but motivation and design principles must be well understood to write piece of documentation. Otherwise some "fake news" propaganda is being created, just like https://suckless.org/sucks/systemd or other systemd-haters that haven't spent a day in their life for writing SysV init scripts or managing a bunch of mission critical machines with handcrafted supervisors. -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 17:56 ` Tomasz Pala @ 2017-12-19 19:47 ` Chris Murphy 2017-12-19 21:17 ` Tomasz Pala 2017-12-20 16:53 ` Andrei Borzenkov 2017-12-19 20:11 ` Austin S. Hemmelgarn 1 sibling, 2 replies; 61+ messages in thread From: Chris Murphy @ 2017-12-19 19:47 UTC (permalink / raw) To: Tomasz Pala; +Cc: Linux fs Btrfs On Tue, Dec 19, 2017 at 10:56 AM, Tomasz Pala <gotar@polanet.pl> wrote: > On Tue, Dec 19, 2017 at 11:35:02 -0500, Austin S. Hemmelgarn wrote: > >>> 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools), >> I don't agree on this one. It is in no way unreasonable to expect that >> someone has read the documentation _before_ trying to use something. > > Provided there are: > - a decent documentation AND > - appropriate[*] level of "common knowledge" AND > - stable behaviour and mature code (kernel, tools etc.) > > BTRFS lacks all of these - there are major functional changes in current > kernels and it reaches far beyond LTS. All the knowledge YOU have here, > on this maillist, should be 'engraved' into btrfs-progs, as there are > people still using kernels with serious malfunctions. btrfs-progs could > easily check kernel version and print appropriate warning - consider > this a "software quirks". The more verbose man pages are, the more likely it is that information gets stale. We already see this with the Btrfs Wiki. So are you volunteering to do the btrfs-progs work to easily check kernel versions and print appropriate warnings? Or is this a case of complaining about what other people aren't doing with their time? > > BTW, doesn't SuSE use btrfs by default? Would you expect everyone using > this distro to research every component used? As far as I'm aware, only Btrfs single device stuff is "supported". The multiple device stuff is definitely not supported on openSUSE, but I have no idea to what degree they support it with enterprise license, no doubt that support must come with caveats. > >>> [*] yes, I know the recent kernels handle this, but the last LTS (4.14) >>> is just too young. >> 4.14 should have gotten that patch last I checked. > > I meant too young to be widely adopted yet. This requires some > countermeasures in the toolkit that is easier to upgrade, like userspace. > >> Regarding handling of degraded mounts, BTRFS _is_ working just fine, we >> just chose a different default behavior from MD and LVM (we make certain >> the user knows about the issue without having to look through syslog). > > I'm not arguing about the behaviour - apparently there were some > technical reasons. But IF the reasons are not technical, but > philosophical, I'd like to have either mount option (allow_degraded) or > even kernel-level configuration knob for this to happen RAID-style. They are technical, which then runs into the philosophical. Giving users a hurt me button is not ethical programming. > > Now, if the current kernels won't toggle degraded RAID1 as ro, can I > safely add "degraded" to the mount options? My primary concern is the > machine UPTIME. I care less about the data, as they are backed up to > some remote location and loosing day or week of changes is acceptable, > brain-split as well, while every hour of downtime costs me a real money. Btrfs simply is not ready for this use case. If you need to depend on degraded raid1 booting, you need to use mdadm or LVM or hardware raid. Complaining about the lack of maturity in this area? Get in line. Or propose a design and scope of work that needs to be completed to enable it. > Meanwhile I can't fix broken server using 'remote hands' - mounting degraded > volume means using physical keyboard or KVM which might be not available > at a site. Current btrfs behavious requires physical presence AND downtime > (if a machine rebooted) for fixing things, that could be fixed remotely > an on-line. Right. It's not ready for this use case. Complaining about this fact isn't going to make it ready for this use case. What will make it ready for the use case is a design, a lot of work, and testing. > Anyway, users shouldn't look through syslog, device status should be > reported by some monitoring tool. Yes. And it doesn't exist yet. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 19:47 ` Chris Murphy @ 2017-12-19 21:17 ` Tomasz Pala 2017-12-20 0:08 ` Chris Murphy 2017-12-20 16:53 ` Andrei Borzenkov 1 sibling, 1 reply; 61+ messages in thread From: Tomasz Pala @ 2017-12-19 21:17 UTC (permalink / raw) To: Linux fs Btrfs On Tue, Dec 19, 2017 at 12:47:33 -0700, Chris Murphy wrote: > The more verbose man pages are, the more likely it is that information > gets stale. We already see this with the Btrfs Wiki. So are you True. The same applies to git documentation (3rd paragraph): https://stevebennett.me/2012/02/24/10-things-i-hate-about-git/ Fortunately this CAN be done properly, one of the greatest documentations I've seen is systemd one. What I don't like about documentation is lack of objectivity: $ zgrep -i bugs /usr/share/man/man8/*btrfs*.8.gz | grep -v bugs.debian.org Nothing. The old-school manuals all had BUGS section even if it was empty. Seriously, nothing appropriate to be put in there? Documentation must be symmetric - if it mentions feature X, it must mention at least the most common caveats. > volunteering to do the btrfs-progs work to easily check kernel > versions and print appropriate warnings? Or is this a case of > complaining about what other people aren't doing with their time? This is definitely the second case. You see, I got my issues with btrfs, I already know where to use it and when not. I've learned HARD and still didn't fully recovered (some dangling r/o, some ENOSPACE due to fragmentation etc). What I /MIGHT/ help to the community is to share my opinions and suggestions. And it's all up to you, what would you do with this. Either you blame me for complaining or you ignore me - you should realize, that _I_do_not_care_, because I already know things that I write. At least some other guy, some other day would read this thread and my opinions might save HIS day. After all, using btrfs should be preceded with research. No offence, just trying to be honest with you. Because the other thing that I've learned hard in my life is to listen regular users of my products and appreciate any feedback, even if it doesn't suit me. >> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >> safely add "degraded" to the mount options? My primary concern is the [...] > Btrfs simply is not ready for this use case. If you need to depend on > degraded raid1 booting, you need to use mdadm or LVM or hardware raid. > Complaining about the lack of maturity in this area? Get in line. Or > propose a design and scope of work that needs to be completed to > enable it. I thought the work was already done if current kernel handles degraded RAID1 without switching to r/o, doesn't it? Or something else is missing? -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 21:17 ` Tomasz Pala @ 2017-12-20 0:08 ` Chris Murphy 2017-12-23 4:08 ` Tomasz Pala 0 siblings, 1 reply; 61+ messages in thread From: Chris Murphy @ 2017-12-20 0:08 UTC (permalink / raw) To: Tomasz Pala; +Cc: Linux fs Btrfs On Tue, Dec 19, 2017 at 2:17 PM, Tomasz Pala <gotar@polanet.pl> wrote: > On Tue, Dec 19, 2017 at 12:47:33 -0700, Chris Murphy wrote: > >> The more verbose man pages are, the more likely it is that information >> gets stale. We already see this with the Btrfs Wiki. So are you > > True. The same applies to git documentation (3rd paragraph): > > https://stevebennett.me/2012/02/24/10-things-i-hate-about-git/ > > Fortunately this CAN be done properly, one of the greatest > documentations I've seen is systemd one. > > What I don't like about documentation is lack of objectivity: > > $ zgrep -i bugs /usr/share/man/man8/*btrfs*.8.gz | grep -v bugs.debian.org > > Nothing. The old-school manuals all had BUGS section even if it was > empty. Seriously, nothing appropriate to be put in there? Documentation > must be symmetric - if it mentions feature X, it must mention at least the > most common caveats. It's reasonable to have a known bugs section in the man page, so long as people are willing to do the work adding to it and deleting it when bugs are fixed. > >> volunteering to do the btrfs-progs work to easily check kernel >> versions and print appropriate warnings? Or is this a case of >> complaining about what other people aren't doing with their time? > > This is definitely the second case. You see, I got my issues with btrfs, I > already know where to use it and when not. I've learned HARD and still > didn't fully recovered (some dangling r/o, some ENOSPACE due to > fragmentation etc). What I /MIGHT/ help to the community is to share my > opinions and suggestions. And it's all up to you, what would you do with > this. Either you blame me for complaining or you ignore me - you > should realize, that _I_do_not_care_, because I already know things that > I write. At least some other guy, some other day would read this thread and my > opinions might save HIS day. After all, using btrfs should be preceded > with research. > > No offence, just trying to be honest with you. Because the other thing > that I've learned hard in my life is to listen regular users of my > products and appreciate any feedback, even if it doesn't suit me. Btrfs development has definitely been a lot more fractured and wild west and people who do the work get to dictate the direction, than perhaps other Linux file systems, and certainly that's true compared to ZFS which has a small team with a very clear ideology and direction established from the outset. > >>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >>> safely add "degraded" to the mount options? My primary concern is the > [...] >> Btrfs simply is not ready for this use case. If you need to depend on >> degraded raid1 booting, you need to use mdadm or LVM or hardware raid. >> Complaining about the lack of maturity in this area? Get in line. Or >> propose a design and scope of work that needs to be completed to >> enable it. > > I thought the work was already done if current kernel handles degraded RAID1 > without switching to r/o, doesn't it? Or something else is missing? Well it only does rw once, then the next degraded is ro - there are patches dealing with this better but I don't know the state. And there's no resync code that I'm aware of, absolutely it's not good enough to just kick off a full scrub - that has huge performance implications and I'd consider it a regression compared to functionality in LVM and mdadm RAID by default with the write intent bitmap. Without some equivalent short cut, automatic degraded means a decently likely scenario where a slightly late assembly at boot time will end up requiring a full scrub. That's not an improvement over manual degraded so people aren't hit with even more silent consequences of their unfortunate situation. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 0:08 ` Chris Murphy @ 2017-12-23 4:08 ` Tomasz Pala 2017-12-23 5:23 ` Duncan 0 siblings, 1 reply; 61+ messages in thread From: Tomasz Pala @ 2017-12-23 4:08 UTC (permalink / raw) To: Linux fs Btrfs On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote: >>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >>>> safely add "degraded" to the mount options? My primary concern is the >> [...] > > Well it only does rw once, then the next degraded is ro - there are > patches dealing with this better but I don't know the state. And > there's no resync code that I'm aware of, absolutely it's not good > enough to just kick off a full scrub - that has huge performance > implications and I'd consider it a regression compared to > functionality in LVM and mdadm RAID by default with the write intent > bitmap. Without some equivalent short cut, automatic degraded means a I read about the 'scrub' all over the time here, so let me ask this directly, as this is also not documented clearly: 1. is the full scrub required after ANY desync? (like: degraded mount followed by readding old device)? 2. if the scrub is omitted - is it possible that btrfs return invalid data (from the desynced and readded drive)? 3. is the scrub required to be scheduled on regular basis? By 'required' I mean by design/implementation issues/quirks, _not_ related to possible hardware malfunctions. -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-23 4:08 ` Tomasz Pala @ 2017-12-23 5:23 ` Duncan 0 siblings, 0 replies; 61+ messages in thread From: Duncan @ 2017-12-23 5:23 UTC (permalink / raw) To: linux-btrfs Tomasz Pala posted on Sat, 23 Dec 2017 05:08:16 +0100 as excerpted: > On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote: > >>>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >>>>> safely add "degraded" to the mount options? My primary concern is >>>>> the >>> [...] >> >> Well it only does rw once, then the next degraded is ro - there are >> patches dealing with this better but I don't know the state. And >> there's no resync code that I'm aware of, absolutely it's not good >> enough to just kick off a full scrub - that has huge performance >> implications and I'd consider it a regression compared to functionality >> in LVM and mdadm RAID by default with the write intent bitmap. Without >> some equivalent short cut, automatic degraded means a > > I read about the 'scrub' all over the time here, so let me ask this > directly, as this is also not documented clearly: > > 1. is the full scrub required after ANY desync? (like: degraded mount > followed by readding old device)? It is very strongly recommended. > 2. if the scrub is omitted - is it possible that btrfs return invalid > data (from the desynced and readded drive)? Were invalid data returned it would be a bug. However, a reasonably common refrain here is that btrfs is "still stabilizing, not yet fully stable and mature", so occasional bugs can be expected, tho both the ideal and experience suggests that they're gradually reducing in frequency and severity as time goes on and we get closer to "fully stable and mature". Which of course is why both having usable and tested backups, and keeping current with the kernel, are strongly recommended as well, the first in case one of those bugs does hit and it's severe enough to take out your working btrfs, the second because later kernels have fewer known bugs in the first place. Functioning as designed as as intent-coded, in the case of a desync, btrfs will use the copy with the latest generation/transid serial, and thus should never return older data from the desynced device. Further, btrfs is designed to be self-healing and will transparently rewrite the out-of-sync copy, syncing it in the process, as it comes across each stale block. But the only way to be sure everything's consistent again is that scrub, and of course if something should happen to the only current copy while the desync still has the other copy stale, /then/ you lose data. And as I said, that's functioning as designed and intent-coded, assuming no bugs, an explicitly unsafe assumption given btrfs' "still stabilizing" state. So... "strongly recommended" indeed, tho in theory it shouldn't be absolutely required as long as unlucky fate doesn't strike before the data is transparently synced in normal usage. YMMV, but I definitely do those scrubs here. > 3. is the scrub required to be scheduled on regular basis? By 'required' > I mean by design/implementation issues/quirks, _not_ related to possible > hardware malfunctions. Perhaps I'm tempting fate, but I don't do scheduled/regular scrubs here. Only if I have an ungraceful shutdown or see complaints in the log (which I tail to a system status dashboard so I'd be likely to notice a problem one way or the other pretty quickly). But I do keep those backups, and while it has been quite some time (over a year, I'd say about 18 months to two years, and I was actually able to use btrfs restore and avoid having to use the backups themselves the last time it happened even 18 months or whatever ago) now since I had to use them, I /did/ actually spend some significant money upgrading my backups to all-SSD in ordered to make updating those backups easier and encourage me to keep them much more current than I had been (btrfs restore saved me more trouble than I'm comfortable admitting, given that I /did/ have backups, but they weren't the freshest at the time). If as some people I had my backups offsite and would have to download them if I actually needed them, I'd potentially be rather stricter and schedule regular scrubs. So by design and intention-coding, no, regularly scheduled scrubs aren't "required". But I'd treat them the same as I would on non-btrfs raid, or a bit stricter given the above discussed btrfs stability status. If you'd be uncomfortable not scheduling regular scrubs on your non-btrfs raid, you better be uncomfortable not scheduling them on btrfs as well! And as always, btrfs or no btrfs, scrub or no scrub, have your backups or you are literally defining your data as not worth the time/trouble/ resources necessary to do them, and some day, maybe 10 minutes from now, maybe 10 years from now, fate's going to call you on that definition! (Yes, I know /you/ know that or we'd not have this thread, which demonstrates that you /do/ care about your data. But it's as much about the lurkers and googlers coming across the thread later as it is the direct participants...) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 19:47 ` Chris Murphy 2017-12-19 21:17 ` Tomasz Pala @ 2017-12-20 16:53 ` Andrei Borzenkov 2017-12-20 16:57 ` Austin S. Hemmelgarn 2017-12-20 20:02 ` Chris Murphy 1 sibling, 2 replies; 61+ messages in thread From: Andrei Borzenkov @ 2017-12-20 16:53 UTC (permalink / raw) To: Chris Murphy, Tomasz Pala; +Cc: Linux fs Btrfs 19.12.2017 22:47, Chris Murphy пишет: > >> >> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using >> this distro to research every component used? > > As far as I'm aware, only Btrfs single device stuff is "supported". > The multiple device stuff is definitely not supported on openSUSE, but > I have no idea to what degree they support it with enterprise license, > no doubt that support must come with caveats. > I was rather surprised seeing RAID1 and RAID10 listed as supported in SLES 12.x release notes, especially as there is no support for multi-device btrfs in YaST and hence no way to even install on such filesystem. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 16:53 ` Andrei Borzenkov @ 2017-12-20 16:57 ` Austin S. Hemmelgarn 2017-12-20 20:02 ` Chris Murphy 1 sibling, 0 replies; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-20 16:57 UTC (permalink / raw) To: Andrei Borzenkov, Chris Murphy, Tomasz Pala; +Cc: Linux fs Btrfs On 2017-12-20 11:53, Andrei Borzenkov wrote: > 19.12.2017 22:47, Chris Murphy пишет: >> >>> >>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using >>> this distro to research every component used? >> >> As far as I'm aware, only Btrfs single device stuff is "supported". >> The multiple device stuff is definitely not supported on openSUSE, but >> I have no idea to what degree they support it with enterprise license, >> no doubt that support must come with caveats. >> > > I was rather surprised seeing RAID1 and RAID10 listed as supported in > SLES 12.x release notes, especially as there is no support for > multi-device btrfs in YaST and hence no way to even install on such > filesystem. That's the beauty of it all though, you don't need to install on such a setup directly like you would need to with hardware RAID, you can install in single-device mode and then convert the system on-line to use multiple devices, and that will (usually) be faster than a direct install if you're using replication (unless you're using RAID10 and have a _lot_ of disks). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 16:53 ` Andrei Borzenkov 2017-12-20 16:57 ` Austin S. Hemmelgarn @ 2017-12-20 20:02 ` Chris Murphy 2017-12-20 20:07 ` Chris Murphy 1 sibling, 1 reply; 61+ messages in thread From: Chris Murphy @ 2017-12-20 20:02 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Chris Murphy, Tomasz Pala, Linux fs Btrfs On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: > 19.12.2017 22:47, Chris Murphy пишет: >> >>> >>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using >>> this distro to research every component used? >> >> As far as I'm aware, only Btrfs single device stuff is "supported". >> The multiple device stuff is definitely not supported on openSUSE, but >> I have no idea to what degree they support it with enterprise license, >> no doubt that support must come with caveats. >> > > I was rather surprised seeing RAID1 and RAID10 listed as supported in > SLES 12.x release notes, especially as there is no support for > multi-device btrfs in YaST and hence no way to even install on such > filesystem. Haha. OK well I'm at a loss then. And they use systemd which is going to run into the udev rule that prevents systemd from even attempting to mount rootfs if one or more devices are missing. So I don't know how it really gets supported. At the dracut prompt, manually mount using -o degraded to /sysroot and then exit? I guess? -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 20:02 ` Chris Murphy @ 2017-12-20 20:07 ` Chris Murphy 2017-12-20 20:14 ` Austin S. Hemmelgarn 2017-12-21 11:49 ` Andrei Borzenkov 0 siblings, 2 replies; 61+ messages in thread From: Chris Murphy @ 2017-12-20 20:07 UTC (permalink / raw) To: Chris Murphy; +Cc: Andrei Borzenkov, Tomasz Pala, Linux fs Btrfs On Wed, Dec 20, 2017 at 1:02 PM, Chris Murphy <lists@colorremedies.com> wrote: > On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: >> 19.12.2017 22:47, Chris Murphy пишет: >>> >>>> >>>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using >>>> this distro to research every component used? >>> >>> As far as I'm aware, only Btrfs single device stuff is "supported". >>> The multiple device stuff is definitely not supported on openSUSE, but >>> I have no idea to what degree they support it with enterprise license, >>> no doubt that support must come with caveats. >>> >> >> I was rather surprised seeing RAID1 and RAID10 listed as supported in >> SLES 12.x release notes, especially as there is no support for >> multi-device btrfs in YaST and hence no way to even install on such >> filesystem. > > Haha. OK well I'm at a loss then. And they use systemd which is going > to run into the udev rule that prevents systemd from even attempting > to mount rootfs if one or more devices are missing. So I don't know > how it really gets supported. At the dracut prompt, manually mount > using -o degraded to /sysroot and then exit? I guess? There is an irony here: YaST doesn't have Btrfs raid1 or raid10 options; and also won't do encrypted root with Btrfs either because YaST enforces LVM to do LUKS encryption for some weird reason; and it also enforces NOT putting Btrfs on LVM. Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of these use cases for something like 5 years (does support Btrfs raid1 and raid10 layouts; and also supports Btrfs directly on dmcrypt without LVM) - with the caveat that it enforces /boot to be on ext4. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 20:07 ` Chris Murphy @ 2017-12-20 20:14 ` Austin S. Hemmelgarn 2017-12-21 1:34 ` Chris Murphy 2017-12-21 11:49 ` Andrei Borzenkov 1 sibling, 1 reply; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-20 20:14 UTC (permalink / raw) To: Chris Murphy; +Cc: Andrei Borzenkov, Tomasz Pala, Linux fs Btrfs On 2017-12-20 15:07, Chris Murphy wrote: > On Wed, Dec 20, 2017 at 1:02 PM, Chris Murphy <lists@colorremedies.com> wrote: >> On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: >>> 19.12.2017 22:47, Chris Murphy пишет: >>>> >>>>> >>>>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using >>>>> this distro to research every component used? >>>> >>>> As far as I'm aware, only Btrfs single device stuff is "supported". >>>> The multiple device stuff is definitely not supported on openSUSE, but >>>> I have no idea to what degree they support it with enterprise license, >>>> no doubt that support must come with caveats. >>>> >>> >>> I was rather surprised seeing RAID1 and RAID10 listed as supported in >>> SLES 12.x release notes, especially as there is no support for >>> multi-device btrfs in YaST and hence no way to even install on such >>> filesystem. >> >> Haha. OK well I'm at a loss then. And they use systemd which is going >> to run into the udev rule that prevents systemd from even attempting >> to mount rootfs if one or more devices are missing. So I don't know >> how it really gets supported. At the dracut prompt, manually mount >> using -o degraded to /sysroot and then exit? I guess? > > > There is an irony here: > > YaST doesn't have Btrfs raid1 or raid10 options; and also won't do > encrypted root with Btrfs either because YaST enforces LVM to do LUKS > encryption for some weird reason; and it also enforces NOT putting > Btrfs on LVM. The 'LUKS must use LVM' thing is likely historical. The BCP for using LUKS is that it's at the bottom level (so you leak absolutely nothing about how your storage stack is structured), and if that's the case you need something on top to support separate filesystems, which up until BTRFS came around has solely been LVM. The 'No BTRFS on LVM' thing is likely for sanity reasons. Using BTRFS on SuSE means allocating /boot and swap, and the entire rest of the disk is BTRFS. They only support a single PV or a single BTRFS volume at the bottom level per-disk for /. > > Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of > these use cases for something like 5 years (does support Btrfs raid1 > and raid10 layouts; and also supports Btrfs directly on dmcrypt > without LVM) - with the caveat that it enforces /boot to be on ext4. And this caveat is because for some reason Fedora has chosen not to integrate BTRFS support into their version of GRUB. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 20:14 ` Austin S. Hemmelgarn @ 2017-12-21 1:34 ` Chris Murphy 0 siblings, 0 replies; 61+ messages in thread From: Chris Murphy @ 2017-12-21 1:34 UTC (permalink / raw) To: Austin S. Hemmelgarn Cc: Chris Murphy, Andrei Borzenkov, Tomasz Pala, Linux fs Btrfs On Wed, Dec 20, 2017 at 1:14 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2017-12-20 15:07, Chris Murphy wrote: >> There is an irony here: >> >> YaST doesn't have Btrfs raid1 or raid10 options; and also won't do >> encrypted root with Btrfs either because YaST enforces LVM to do LUKS >> encryption for some weird reason; and it also enforces NOT putting >> Btrfs on LVM. > > The 'LUKS must use LVM' thing is likely historical. The BCP for using LUKS > is that it's at the bottom level (so you leak absolutely nothing about how > your storage stack is structured), and if that's the case you need something > on top to support separate filesystems, which up until BTRFS came around has > solely been LVM. *shrug* Anaconda has supported plain partition LUKS without device-mapper for ext3/4 and XFS since forever, even before the rewrite. >> Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of >> these use cases for something like 5 years (does support Btrfs raid1 >> and raid10 layouts; and also supports Btrfs directly on dmcrypt >> without LVM) - with the caveat that it enforces /boot to be on ext4. > > And this caveat is because for some reason Fedora has chosen not to > integrate BTRFS support into their version of GRUB. No. The Fedora patchset for upstream GRUB doesn't remove Btrfs support. However, they don't use grub-mkconfig to rewrite the grub.cfg when a new kernel is installed. Instead, they use an unrelated project called grubby, which modifies the existing grub.cfg (and also supports most all other configs like syslinux/extlinux, yaboot, uboot, lilo, and others). And grubby gets confused [1] if the grub.cfg is on a subvolume (other than ID 5). If the grub.cfg is in the ID 5 subvolume, in a normal directory structure, it works fine. Chris Murphy [1] Gory details The central part of the confusion appears to be this sequence of comments in this insanely long bug: https://bugzilla.redhat.com/show_bug.cgi?id=864198#c3 https://bugzilla.redhat.com/show_bug.cgi?id=864198#c5 https://bugzilla.redhat.com/show_bug.cgi?id=864198#c6 https://bugzilla.redhat.com/show_bug.cgi?id=864198#c7 The comments from Gene Czarcinski (now deceased, that's how old this bug is) try to negotiate understanding the problem and he had a fix but it didn't meet some upstream grubby requirement, and so the patch wasn't accepted. Grubby is sufficiently messy that near as I can tell no other distribution uses it, and no one really cares to maintain it until something in RHEL breaks and then *that* gets attention. Upstream bug https://github.com/rhboot/grubby/issues/22 ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 20:07 ` Chris Murphy 2017-12-20 20:14 ` Austin S. Hemmelgarn @ 2017-12-21 11:49 ` Andrei Borzenkov 1 sibling, 0 replies; 61+ messages in thread From: Andrei Borzenkov @ 2017-12-21 11:49 UTC (permalink / raw) To: Chris Murphy; +Cc: Tomasz Pala, Linux fs Btrfs On Wed, Dec 20, 2017 at 11:07 PM, Chris Murphy <lists@colorremedies.com> wrote: > > YaST doesn't have Btrfs raid1 or raid10 options; and also won't do > encrypted root with Btrfs either because YaST enforces LVM to do LUKS > encryption for some weird reason; and it also enforces NOT putting > Btrfs on LVM. > That's incorrect, btrfs on LVM is default on some SLES flavors and one of the three standard proposals (where you do not need to go in expert mode) - normal partitions, LVM, encrypted LVM - even on openSUSE. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 17:56 ` Tomasz Pala 2017-12-19 19:47 ` Chris Murphy @ 2017-12-19 20:11 ` Austin S. Hemmelgarn 2017-12-19 21:58 ` Tomasz Pala 2017-12-19 23:53 ` Chris Murphy 1 sibling, 2 replies; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-19 20:11 UTC (permalink / raw) To: Tomasz Pala, Linux fs Btrfs On 2017-12-19 12:56, Tomasz Pala wrote: > On Tue, Dec 19, 2017 at 11:35:02 -0500, Austin S. Hemmelgarn wrote: > >>> 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools), >> I don't agree on this one. It is in no way unreasonable to expect that >> someone has read the documentation _before_ trying to use something. > > Provided there are: > - a decent documentation AND > - appropriate[*] level of "common knowledge" AND > - stable behaviour and mature code (kernel, tools etc.) > > BTRFS lacks all of these - there are major functional changes in current > kernels and it reaches far beyond LTS. All the knowledge YOU have here, > on this maillist, should be 'engraved' into btrfs-progs, as there are > people still using kernels with serious malfunctions. btrfs-progs could > easily check kernel version and print appropriate warning - consider > this a "software quirks". Except the systems running on those ancient kernel versions are not necessarily using a recent version of btrfs-progs. It might be possible to write up a script to check the kernel version and report known issues with it, but I don't think having it tightly integrated will be much help, at least not for quite some time. > > [*] by 'appropriate' I mean knowledge so common, as the real word usage > itself. > > Moreover, the fact that I've read the documentation and did a > comprehensive[**] reseach today, doesn't mean I should do this again > after kernel change for example. > > [**] apparently what I thought was comprehensive, wasn't at all. Most of > the btrfs quirks I've found HERE. As a regular user, not fs developer, I > shouldn't be even looking at this list. That last bit is debatable. BTRFS doesn't have separate developer and user lists, so this list serves both purposes (though IRC also serves some of the function of a user list). I'll agree that searching the archives shouldn't be needed to get a baseline of knowledge. > > BTW, doesn't SuSE use btrfs by default? Would you expect everyone using > this distro to research every component used? SuSE also provides very good support by themselves. > >>> [*] yes, I know the recent kernels handle this, but the last LTS (4.14) >>> is just too young. >> 4.14 should have gotten that patch last I checked. > > I meant too young to be widely adopted yet. This requires some > countermeasures in the toolkit that is easier to upgrade, like userspace. So in other words, spend the time to write up code for btrfs-progs that will then be run by a significant minority of users because people using old kernels usually use old userspace, and people using new kernels won't have to care, instead of working on other bugs that are still affecting people? > >> Regarding handling of degraded mounts, BTRFS _is_ working just fine, we >> just chose a different default behavior from MD and LVM (we make certain >> the user knows about the issue without having to look through syslog). > > I'm not arguing about the behaviour - apparently there were some > technical reasons. But IF the reasons are not technical, but > philosophical, I'd like to have either mount option (allow_degraded) or > even kernel-level configuration knob for this to happen RAID-style. > > Now, if the current kernels won't toggle degraded RAID1 as ro, can I > safely add "degraded" to the mount options? My primary concern is the > machine UPTIME. I care less about the data, as they are backed up to > some remote location and loosing day or week of changes is acceptable, > brain-split as well, while every hour of downtime costs me a real money. In which case you shouldn't be relying on _ANY_ kind of RAID by itself, let alone BTRFS. If you care that much about uptime, you should be investing in a HA setup and going from there. If downtime costs you money, you need to be accounting for kernel updates and similar things, and therefore should have things set up such that you can reboot a system with no issues. > > > Meanwhile I can't fix broken server using 'remote hands' - mounting degraded > volume means using physical keyboard or KVM which might be not available > at a site. Current btrfs behavious requires physical presence AND downtime > (if a machine rebooted) for fixing things, that could be fixed remotely > an on-line. Assuming you have a sensibly designed system and are able to do remote management, physical presence should only be required for handling of an issue with the root filesystem, and downtime should only be needed long enough for other filesystem to get them into a sensible enough state that you can repair them the rest of the way online. There's not really anything you can do about the root filesystem, but sensible organization of application data can mitigate the issues for other filesystems. > > Anyway, users shouldn't look through syslog, device status should be > reported by some monitoring tool. This is a common complaint, and based on developer response, I think the consensus is that it's out of scope for the time being. There have been some people starting work on such things, but nobody really got anywhere because most of the users who care enough about monitoring to be interested are already using some external monitoring tool that it's easy to hook into. TBH, you essentially need external monitoring in most RAID situations anyway unless you've got some pre-built purpose specific system that already includes it (see FreeNAS for an example). > > Deviation so big (respectively to common RAID1 scenarios) deserves being documented. > Or renamed... Really? Some examples of where MD and LVM provide direct monitoring without needing third party software please. LVM technically has the ability to handle it though dmeventd, but it's decidedly non-trivial to monitor state with that directly, and as a result almost everyone uses third party software there.. MD I don't have as much background with (I prefer the flexibility LVM offers), but anything I've seen regarding that requires manual setup of some external software as well. > >> reliability, and all the discussion of reliability assumes that either: >> 1. Disks fail catastrophically. >> or: >> 2. Disks return read or write errors when there is a problem. >> >> Following just those constraints, RAID is not designed to handle devices >> that randomly drop off the bus and reappear > > If it drops, there would be I/O errors eventually. Without the errors - agreed. Classical hardware RAID will kick the device when it drops, and then never re-add it, just like BTRFS functionally does. The only difference is how they then treat the 'failed' disk. Hardware RAID will stop using it, BTRFS will keep trying to use it. > >> implementations. As people are quick to point out BTRFS _IS NOT_ RAID, >> the devs just made a poor choice in the original naming of the 2-way >> replication implementation, and it stuck. > > Well, the question is: either it is not raid YET, or maybe it's time to consider renaming? Again, the naming is too ingrained. At a minimum, you will have to keep the old naming, and at that point you're just wasting time and making things _more_ confusing because some documentation will use the old naming and some will use the new (keep in mind that third-party documentation rarely gets updated). > >>> 3. if sysadmin doesn't request any kind of device autobinding, the >>> device that were already failed doesn't matter anymore - regardless of >>> it's current state or reappearences. >> You have to explicitly disable automatic binding of drivers to >> hot-plugged devices though, so that's rather irrelevant. Yes, you can > > Ha! I got this disabled on every bus (although for different reasons) > after boot completes. Lucky me:) Security I'm guessing (my laptop behaves like that for USB devices for that exact reason)? It's a viable option on systems that are tightly controlled. Once you look at consumer devices though, it's just impractical. People expect hardware to just work when they plug it in these days. > >>> 1. "known" only to the people that already stepped into it, meaning too >>> late - it should be "COMMONLY known", i.e. documented, >> And also known to people who have done proper research. > > All the OpenSUSE userbase? ;) I don't think you quite understand what the SuSE business model is. SuSE does the research, and then provides support for customers so they don't have to. Red Hat has a similar model. Most normal distros however, do not, and those people using them need to be doing proper research. > >>> 2. "older kernels" are not so old, the newest mature LTS (4.9) is still >>> affected, >> I really don't see this as a valid excuse. It's pretty well documented >> that you absolutely should be running the most recent kernel if you're >> using BTRFS. > > Good point. > >>> 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE, >>> as long as you accept "no more redundancy"... >> This is a matter of opinion. > > Sure! And the particular opinion depends on system being affected. I'd > rather not have any brain-split scenario under my database servers, but > also won't mind data loss on BGP router as long as it keeps running and > is fully operational. > >> I still contend that running half a two >> device array for an extended period of time without reshaping it to be a >> single device is a bad idea for cases other than BTRFS. The fewer >> layers of code you're going through, the safer you are. > > I create single-device degraded MD RAID1 when I attach one disk for > deployment (usually test machines), which are going to be converted into > dual (production) in a future - attaching second disk to array is much > easier and faster than messing with device nodes (or labels or > anything). The same applies to LVM, it's better to have it even when not > used at a moment. In case of btrfs there is no need for such > preparations, as the devices are added without renaming. Unless you're pulling some complex black magic, you're not running degraded, you're running both in single device mode (which is not the same as a degraded two device RAID1 array) and converting to two device RAID1 later, which is a perfectly normal use case I have absolutely no issues with. > > However, sometimes the systems end up without second disk attached. > Either due to their low importance, sometimes power usage, others > need to be quiet. > > One might ask, why don't I attach second disk before initial system > creation - the answer is simple: I usually use the same drive models in > RAID1, but it happens that drives bought from the same production lot > fail simultaneously, so this approach mitigates the problem and gives > more time to react. You appear to be misunderstanding me here. I'm not saying I think running with a single disk is bad, I'm saying that I feel that running with a single disk and not telling the storage stack that the other one isn't coming back any time soon is bad. IOW, if I lose a disk in a two device BTRFS volume set up for replication, I'll mount it degraded, and convert it from the raid1 profile to the single profile and then remove the missing disk from the volume. Similarly, for a 2 device LVM RAID1 LV, I would use lvconvert to a regular linear LV. Going through the multi-device code in BTRFS or the DM-RAID code in LVM when you've only got one actual device is a waste of processing power, and ads another layer where things can go wrong. > >> Patches would be gratefully accepted. It's really not hard to update >> the documentation, it's just that nobody has had the time to do it. > > Writing accurate documentation requires deep undestanding of internals. > Me - for example, I know some of the results: "don't do this", "if X happens, Y > should be done", "Z doesn't work yet, but there were some patches", "V > was fixed in some recent kernel, but no idea which commit was it > exactly", "W was severly broken in kernel I.J.K" etc. Not the hard data > that could be posted without creating the impression, that it's all > about creating complain-list. Not to mention I'm absolutely not familiar > with current patches, WIP and many many other corner cases or usage > scenarios. In a fact, not only the internals, but motivation and design > principles must be well understood to write piece of documentation. Writing up something like that is near useless, it would only be valid for upstream kernels (And if you're using upstream kernels and following the advice of keeping up to date, what does it matter anyway? The moment a new btrfs-progs gets released, you're already going to be on a kernel that fixes the issues it reports.), because distros do whatever the hell they want with version numbers (RHEL for example is notorious for using _ancient_ version numbers bug having bunches of stuff back-ported, and most other big distros that aren't Arch, Gentoo, or Slackware derived do so too to a lesser degree), and it would require constant curation to keep up to date. Only for long-term known issues does it make sense, but those absolutely should be documented in the regular documentation, and doing that really isn't that hard if you just go for current issues. > > Otherwise some "fake news" propaganda is being created, just like > https://suckless.org/sucks/systemd or other systemd-haters that haven't > spent a day in their life for writing SysV init scripts or managing a > bunch of mission critical machines with handcrafted supervisors. I hate to tell you that: 1. This type of thing happens regardless. Systemd has just garnered a lot of hatred because it redesigned everything from the ground up and was then functionally forced on most of the Linux community. 2. There are quite a few of us who dislike systemd who have had to handle actual systems administration before (and quite a few such individuals are primarily complaining about other aspects of systemd, like the journal crap or how it handles manually mounted filesystems for which mount units exist (namely, if it thinks the underlying device isn't ready, it will unmount them immediately, even if the user just manually mounted them), not the service files replacing init scripts). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 20:11 ` Austin S. Hemmelgarn @ 2017-12-19 21:58 ` Tomasz Pala 2017-12-20 13:10 ` Austin S. Hemmelgarn 2017-12-19 23:53 ` Chris Murphy 1 sibling, 1 reply; 61+ messages in thread From: Tomasz Pala @ 2017-12-19 21:58 UTC (permalink / raw) To: Linux fs Btrfs On Tue, Dec 19, 2017 at 15:11:22 -0500, Austin S. Hemmelgarn wrote: > Except the systems running on those ancient kernel versions are not > necessarily using a recent version of btrfs-progs. Still much easier to update a userspace tools than kernel (consider binary drivers for various hardware). > So in other words, spend the time to write up code for btrfs-progs that > will then be run by a significant minority of users because people using > old kernels usually use old userspace, and people using new kernels > won't have to care, instead of working on other bugs that are still > affecting people? I am aware of the dillema and the answer is: that depends. Depends on expected usefulness of such infrastructure regarding _future_ changes and possible bugs. In case of stable/mature/frozen projects this doesn't make much sense, as the possible incompatibilities would be very rare. Wheter this makes sense for btrfs? I don't know - it's not mature, but if the quirk rate would be too high to track appropriate kernel versions it might be really better to officially state "DO USE 4.14+ kernel, REALLY". This might be accomplished very easy - when releasing new btrfs-progs check currently available LTS kernel and use it as a base reference for warning. After all, "giving users a hurt me button is not ethical programming." >> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >> safely add "degraded" to the mount options? My primary concern is the >> machine UPTIME. I care less about the data, as they are backed up to >> some remote location and loosing day or week of changes is acceptable, >> brain-split as well, while every hour of downtime costs me a real money. > In which case you shouldn't be relying on _ANY_ kind of RAID by itself, > let alone BTRFS. If you care that much about uptime, you should be > investing in a HA setup and going from there. If downtime costs you I got this handled and don't use btrfs there - the question remains: in a situation as described above, is it safe now to add "degraded"? To rephrase the question: can degraded RAID1 run permanently as rw without some *internal* damage? >> Anyway, users shouldn't look through syslog, device status should be >> reported by some monitoring tool. > This is a common complaint, and based on developer response, I think the > consensus is that it's out of scope for the time being. There have been > some people starting work on such things, but nobody really got anywhere > because most of the users who care enough about monitoring to be > interested are already using some external monitoring tool that it's > easy to hook into. I agree, the btrfs code should only emit events, so SomeUserspaceGUIWhatever could display blinking exclamation mark. >> Well, the question is: either it is not raid YET, or maybe it's time to consider renaming? > Again, the naming is too ingrained. At a minimum, you will have to keep > the old naming, and at that point you're just wasting time and making > things _more_ confusing because some documentation will use the old True, but realizing that documentation is already flawed it gets easier. But I still don't know if it is going to be RAID some day? Or won't be "by design"? >> Ha! I got this disabled on every bus (although for different reasons) >> after boot completes. Lucky me:) > Security I'm guessing (my laptop behaves like that for USB devices for > that exact reason)? It's a viable option on systems that are tightly Yes, machines are locked and only authorized devices are allowed during boot. > IOW, if I lose a disk in a two device BTRFS volume set up for > replication, I'll mount it degraded, and convert it from the raid1 > profile to the single profile and then remove the missing disk from the > volume. I was about to do the same with my r/o-stuck btrfs system, unfortunatelly unplugged the wrong cable... >> Writing accurate documentation requires deep undestanding of internals. [...] > Writing up something like that is near useless, it would only be valid > for upstream kernels (And if you're using upstream kernels and following > the advice of keeping up to date, what does it matter anyway? The [...] > kernel that fixes the issues it reports.), because distros do whatever > the hell they want with version numbers (RHEL for example is notorious > for using _ancient_ version numbers bug having bunches of stuff > back-ported, and most other big distros that aren't Arch, Gentoo, or > Slackware derived do so too to a lesser degree), and it would require > constant curation to keep up to date. Only for long-term known issues OK, you've convinced me that kernel-vs-feature list is overhead. So maybe other approach: just like systemd sets the system time (when no time source available) to it's own release date, maybe btrfs-progs should take the version of the kernel on which it was build? -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 21:58 ` Tomasz Pala @ 2017-12-20 13:10 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-20 13:10 UTC (permalink / raw) To: Tomasz Pala, Linux fs Btrfs On 2017-12-19 16:58, Tomasz Pala wrote: > On Tue, Dec 19, 2017 at 15:11:22 -0500, Austin S. Hemmelgarn wrote: > >> Except the systems running on those ancient kernel versions are not >> necessarily using a recent version of btrfs-progs. > > Still much easier to update a userspace tools than kernel (consider > binary drivers for various hardware). OK, let's look at this objectively: Current version of btrfs-progs is 4.14, released last month, and current kernel is 4.14.8 (or a 4.15 RC release). In various distributions: * Arch Linux: btrfs-progs version is 4.14-2 kernel version is 4.14.6-1 * Alpine Linux: btrfs-progs version is 4.10.2-r0 kernel version is 4.9.32-0 * Debian Sid: btrfs-progs version is 4.13.3-1 kernel version is 4.14.0-1 * Debian 9.3: btrfs-progs version is 4.7.3-1 kernel version is 4.9.0-4 * Fedora 27: btrfs-progs version is 4.11.3-3 kernel version is 4.14.6-300 * Gentoo ~amd64 (equivalent of Debian Sid or Fedora Rawhide): btrfs-progs version is 4.14 kernel version is 4.14.7 * Gentoo stable: btrfs-progs version is 4.10.2 kernel version is 4.14.7 * Manjaro (a somewhat popular Arch Linux derivative): btrfs-progs version is 4.14-1 kernel version is 4.11.12-1-rt16 * OpenSUSE Leap 42.3: btrfs-progs version is 4.5.3+20160729 kernel version is 4.4.103-36 * OpenSUSE Tumbleweed: btrfs-progs version is 4.13.3 kernel version is 4.14.6-1 * Ubuntu 17.10: btrfs-progs version is 4.12-1 kernel version is 4.13.0-19 * Ubuntu 16.04.3: btrfs-progs version is 4.4-1ubuntu1 kernel version is 4.4.0-104 Based on this, it looks like Alpine, Manjaro, and OpenSUSE Leap are the only distros for which it was easier to upgrade the userspace than the kernel, and Alpine and Manjaro are the only two that it even makes sense for that to be the case given that they use GRSecurity and RT patches respectively. The fact is that most people use whatever version their distro packages, and don't install software themselves through other means, so for most people, it is easier to upgrade the kernel. Even as a 'power user' using Gentoo (where it's really easy to install stuff from external sources because you have all the development tools pre-installed), I almost never pull anything that's beyond the main repositories or the small handful of user repositories that I've got enabled, and that's only for stuff I can't get in a repository. > >> So in other words, spend the time to write up code for btrfs-progs that >> will then be run by a significant minority of users because people using >> old kernels usually use old userspace, and people using new kernels >> won't have to care, instead of working on other bugs that are still >> affecting people? > > I am aware of the dillema and the answer is: that depends. > Depends on expected usefulness of such infrastructure regarding _future_ > changes and possible bugs. > In case of stable/mature/frozen projects this doesn't make much sense, > as the possible incompatibilities would be very rare. > Wheter this makes sense for btrfs? I don't know - it's not mature, but if the quirk rate > would be too high to track appropriate kernel versions it might be > really better to officially state "DO USE 4.14+ kernel, REALLY". > > This might be accomplished very easy - when releasing new btrfs-progs > check currently available LTS kernel and use it as a base reference for > warning. > > After all, "giving users a hurt me button is not ethical programming." Scaring users needlessly is also not ethical programming. As an example: 4.9 is the current LTS release (4.9.71 as of right now). Dozens of bugs have been fixed since then. If we were to start doing as you propose, then we'd be spitting out potentially bogus warnings for everything up through current kernels. > >>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >>> safely add "degraded" to the mount options? My primary concern is the >>> machine UPTIME. I care less about the data, as they are backed up to >>> some remote location and loosing day or week of changes is acceptable, >>> brain-split as well, while every hour of downtime costs me a real money. >> In which case you shouldn't be relying on _ANY_ kind of RAID by itself, >> let alone BTRFS. If you care that much about uptime, you should be >> investing in a HA setup and going from there. If downtime costs you > > I got this handled and don't use btrfs there - the question remains: > in a situation as described above, is it safe now to add "degraded"? > > To rephrase the question: can degraded RAID1 run permanently as rw > without some *internal* damage? Not on kernels that don't have the patch that's been mentioned a couple of times in this thread, with the caveat that 'internal damage' means that it won't mount on such kernels after the first time (but will mount on newer kernels that have been patched). > >>> Anyway, users shouldn't look through syslog, device status should be >>> reported by some monitoring tool. >> This is a common complaint, and based on developer response, I think the >> consensus is that it's out of scope for the time being. There have been >> some people starting work on such things, but nobody really got anywhere >> because most of the users who care enough about monitoring to be >> interested are already using some external monitoring tool that it's >> easy to hook into. > > I agree, the btrfs code should only emit events, so > SomeUserspaceGUIWhatever could display blinking exclamation mark. No, it shouldn't _only_ emit events. It damn well should be logging to the kernel log even if it's emitting events, LVM does so, MD does so, ZFS does so, why the hell should BTRFS _NOT_ do so? > >>> Well, the question is: either it is not raid YET, or maybe it's time to consider renaming? >> Again, the naming is too ingrained. At a minimum, you will have to keep >> the old naming, and at that point you're just wasting time and making >> things _more_ confusing because some documentation will use the old > > True, but realizing that documentation is already flawed it gets easier. > But I still don't know if it is going to be RAID some day? Or won't be > "by design"? > >>> Ha! I got this disabled on every bus (although for different reasons) >>> after boot completes. Lucky me:) >> Security I'm guessing (my laptop behaves like that for USB devices for >> that exact reason)? It's a viable option on systems that are tightly > > Yes, machines are locked and only authorized devices are allowed during > boot. > >> IOW, if I lose a disk in a two device BTRFS volume set up for >> replication, I'll mount it degraded, and convert it from the raid1 >> profile to the single profile and then remove the missing disk from the >> volume. > > I was about to do the same with my r/o-stuck btrfs system, unfortunatelly > unplugged the wrong cable... > >>> Writing accurate documentation requires deep undestanding of internals. > [...] >> Writing up something like that is near useless, it would only be valid >> for upstream kernels (And if you're using upstream kernels and following >> the advice of keeping up to date, what does it matter anyway? The > [...] >> kernel that fixes the issues it reports.), because distros do whatever >> the hell they want with version numbers (RHEL for example is notorious >> for using _ancient_ version numbers bug having bunches of stuff >> back-ported, and most other big distros that aren't Arch, Gentoo, or >> Slackware derived do so too to a lesser degree), and it would require >> constant curation to keep up to date. Only for long-term known issues > > OK, you've convinced me that kernel-vs-feature list is overhead. > > So maybe other approach: just like systemd sets the system time (when no > time source available) to it's own release date, maybe btrfs-progs > should take the version of the kernel on which it was build? The systemd thing works because it knows the current time can't be older than when it was built (short of time-travel, but that's probably irrelevant right now). Grabbing the kernel version of the build system and then using that as our own version absolutely does not because: 1. The kernel on the build system has (or should have) zero impact on how btrfs-progs work on the target system. The only bit that matters is the UAPI headers that are installed, and if those mismatch then btrfs-progs won't run at all. 2. While the code in btrfs-progs is developed in-concert with kernel code, it is not directly dependent on it for most of it's operation. As a couple of people are apt to point out, kernel version matters mostly for regular operation, btrfs-progs version matters mostly for recovery and repair. However, _both_ do matter, so just displaying one is a bad idea. As of right now, the versioning of btrfs-progs is largely linked to whatever the current stable kernel version is at the time of release. That provides a good enough indication of the vintage that most people have no issues just running: btrfs --version uname -r to figure out what they have, though of course `uname -r` is essentially useless for outsiders on RHEL or OEL systems (and there is nothing the btrfs community can do about that). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 20:11 ` Austin S. Hemmelgarn 2017-12-19 21:58 ` Tomasz Pala @ 2017-12-19 23:53 ` Chris Murphy 2017-12-20 13:12 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 61+ messages in thread From: Chris Murphy @ 2017-12-19 23:53 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Tomasz Pala, Linux fs Btrfs On Tue, Dec 19, 2017 at 1:11 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2017-12-19 12:56, Tomasz Pala wrote: >> BTRFS lacks all of these - there are major functional changes in current >> kernels and it reaches far beyond LTS. All the knowledge YOU have here, >> on this maillist, should be 'engraved' into btrfs-progs, as there are >> people still using kernels with serious malfunctions. btrfs-progs could >> easily check kernel version and print appropriate warning - consider >> this a "software quirks". > > Except the systems running on those ancient kernel versions are not > necessarily using a recent version of btrfs-progs. Indeed it is much more common to find old user space tools, for whatever reason, compared to the kernel version. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 23:53 ` Chris Murphy @ 2017-12-20 13:12 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-20 13:12 UTC (permalink / raw) To: Chris Murphy; +Cc: Tomasz Pala, Linux fs Btrfs On 2017-12-19 18:53, Chris Murphy wrote: > On Tue, Dec 19, 2017 at 1:11 PM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2017-12-19 12:56, Tomasz Pala wrote: > >>> BTRFS lacks all of these - there are major functional changes in current >>> kernels and it reaches far beyond LTS. All the knowledge YOU have here, >>> on this maillist, should be 'engraved' into btrfs-progs, as there are >>> people still using kernels with serious malfunctions. btrfs-progs could >>> easily check kernel version and print appropriate warning - consider >>> this a "software quirks". >> >> Except the systems running on those ancient kernel versions are not >> necessarily using a recent version of btrfs-progs. > > Indeed it is much more common to find old user space tools, for > whatever reason, compared to the kernel version. Most distros have infrastructure in place to handle quick updates to the kernel, and tend to keep the kernel up to date to fix hardware issues that affect people who may not be using BTRFS. In contrast, btrfs-progs updates generally aren't high priority, because they benefit a much smaller user base (unless you're SuSE). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 14:46 ` Tomasz Pala 2017-12-19 16:35 ` Austin S. Hemmelgarn @ 2017-12-19 18:31 ` George Mitchell 2017-12-19 20:28 ` Tomasz Pala 2017-12-19 19:35 ` Chris Murphy 2 siblings, 1 reply; 61+ messages in thread From: George Mitchell @ 2017-12-19 18:31 UTC (permalink / raw) To: linux-btrfs On 12/19/2017 06:46 AM, Tomasz Pala wrote: > On Tue, Dec 19, 2017 at 07:25:49 -0500, Austin S. Hemmelgarn wrote: > >>> Well, the RAID1+ is all about the failing hardware. >> About catastrophically failing hardware, not intermittent failure. > It shouldn't matter - as long as disk failing once is kicked out of the > array *if possible*. Or reattached in write-only mode as a best effort, > meaning "will try to keep your *redundancy* copy, but won't trust it to > be read from". > As you see, the "failure level handled" is not by definition, but by implementation. > > *if possible* == when there are other volume members having the same > data /or/ there are spare members that could take over the failing ones. > >> I never said the hardware needed to not fail, just that it needed to >> fail in a consistent manner. BTRFS handles catastrophic failures of >> storage devices just fine right now. It has issues with intermittent >> failures, but so does hardware RAID, and so do MD and LVM to a lesser >> degree. > When planning hardware failovers/backups I can't predict the failing > pattern. So first of all - every *known* shortcoming should be > documented somehow. Secondly - permanent failures are not handled "just > fine", as there is (1) no automatic mount as degraded, so the machine > won't reboot properly and (2) the r/w degraded mount is[*] one-timer. > Again, this should be: > 1. documented in manpage, as a comment to profiles, not wiki page or > linux-btrfs archives, > 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools), > 3. blown into one's face when doing r/w degraded mount (by kernel). > > [*] yes, I know the recent kernels handle this, but the last LTS (4.14) > is just too young. > > I'm now aware of issues with MD you're referring to - I got drives > kicked off many times and they were *never* causing any problems despite > being visible in the system. Moreover, since 4.10 there is FAILFAST > which would do this even faster. There is also no problem with mounting > degraded MD array automatically, so telling that btrfs is doing "just > fine" is, well... not even theoretically close. And in my practice it > never saved the day, but already ruined a few ones... It's not right for > the protection to make more problems than it solves. > >> No, classical RAID (other than RAID0) is supposed to handle catastrophic >> failure of component devices. That is the entirety of the original >> design purpose, and that is the entirety of what you should be using it >> for in production. > 1. no, it's not: https://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf > > 2. even if there was, the single I/O failure (e.g. one bad block) might > be interpreted as "catastrophic" and the entire drive should be kicked off then. > > 3. if sysadmin doesn't request any kind of device autobinding, the > device that were already failed doesn't matter anymore - regardless of > it's current state or reappearences. > >> The point at which you are getting random corruption >> on a disk and you're using anything but BTRFS for replication, you >> _NEED_ to replace that disk, and if you don't you risk it causing >> corruption on the other disk. > Not only BTRFS, there are hardware solutions like T10 PI/DIF. > Guess what should RAID controller do in such situation? Fail > drive immediately after the first CRC mismatch? > > BTW do you consider "random corruption" as a catastrophic failure? > >> As of right now, BTRFS is no different in >> that respect, but I agree that it _should_ be able to handle such a >> situation eventually. > The first step should be to realize, that there are some tunables > required if you want to handle many different situation. > > Having said that, let's back to reallity: > > > The classical RAID is about keeping the system functional - trashing a > single drive from RAID1 should be fully-ignorable by sysadmin. The > system must reboot properly, work properly and there MUST NOT by ANY > functional differences compared to non-degraded mode except for slower > read rate (and having no more redundancy obviously). > > > - not having this == not having RAID1. > >> It shouldn't have been called RAID in the first place, that we can agree >> on (even if for different reasons). > The misnaming would be much less of a problem if it were documented > properly (man page, btrfs-progs and finally kernel screaming). > >>> - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not >>> _expected_ to happen after single disk failure (without any reappearing). >> And that's a known bug on older kernels (not to mention that you should >> not be mounting writable and degraded for any purpose other than fixing >> the volume). > Yes, ...but: > > 1. "known" only to the people that already stepped into it, meaning too > late - it should be "COMMONLY known", i.e. documented, > 2. "older kernels" are not so old, the newest mature LTS (4.9) is still > affected, > 3. I was about to fix the volume, accidentally the machine has rebooted. > Which should do no harm if I had a RAID1. > 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE, > as long as you accept "no more redundancy"... > 4a. ...or had an N-way mirror and there is still some redundancy if N>2. > > > Since we agree, that btrfs RAID != common RAID, as there are/were > different design principles and some features are in WIP state at best, > the current behaviour should be better documented. That's it. > > I have significant experience as a user of raid1. I spent years using software raid1 and then more years using hardware (3ware) raid1 and now around 3 years using btrfs raid1. I have not found btrfs raid1 to be less reliable than any of the previous implementations of raid. I have found that any implementation of raid whether it be software, hardware, or filesystem, is not infallible. I have also found that when you have a failure, you don't just plug things back in and expect it to be fixed without seriously investigating what has gone wrong and potential unexpected consequences. I have found that even with hardware raid you can find ways to screw things up to the point that you lose your data. I have had situations where I reconnected a drive on hardware raid1 only to find that the array would not sync and from there on I ended up having to directly attach one of the drives and recover the partition table with test disk in order to regain access to my data. So NO FORM of raid is a replacement for backups and NO FORM of raid is a replacement for due diligence in recovery from failure mode. Raid gives you a second chance when things go wrong, it does not make failures transparent which is seemingly what we sometimes expect from raid. And I doubt that we will ever achieve that goal no matter how much effort we put into making it happen. Even with hardware raid things can happen that were not foreseen by the designers. So I think we have to be careful when we compare various raid (or "raid like") implementations. There is no such thing as "fool proof" raid and likely never will be. And with that I will end my rant. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 18:31 ` George Mitchell @ 2017-12-19 20:28 ` Tomasz Pala 0 siblings, 0 replies; 61+ messages in thread From: Tomasz Pala @ 2017-12-19 20:28 UTC (permalink / raw) To: George Mitchell; +Cc: linux-btrfs On Tue, Dec 19, 2017 at 10:31:40 -0800, George Mitchell wrote: > I have significant experience as a user of raid1. I spent years using > software raid1 and then more years using hardware (3ware) raid1 and now > around 3 years using btrfs raid1. I have not found btrfs raid1 to be > less reliable than any of the previous implementations of raid. I have You are aware that in order to proof something one needs only one example? Degraded r/o is such, QED. Doesn't matter how long did you ride on top of any RAID implementation, unless you got them in action, i.e. had actual drive malfunction. Did you have broken drive under btrfs raid? > a failure, you don't just plug things back in and expect it to be fixed > without seriously investigating what has gone wrong and potential > unexpected consequences. I have found that even with hardware raid you > can find ways to screw things up to the point that you lose your data. Everything could be screwed beyond comprehension, but we're talking about PRIMARY objectives. In case of RAID1+ it seems to be obvious: https://en.oxforddictionaries.com/definition/redundancy - unplugging ANY SINGLE drive MUST NOT render system unusable. This is really as simple as that. > I have had situations where I reconnected a drive on hardware raid1 only > to find that the array would not sync and from there on I ended up > having to directly attach one of the drives and recover the partition I had a situation when replugging a drive started a sync of older data over the newer. So what? This doesn't change a thing - the drive reappearance or resync is RECOVERY part. RECOVERY scenarios are entirely different thing than REDUNDANCY itself. RECOVERY phase in some implementation could be entirely off-line process and it still would be RAID. Remove REDUNDANCY part and it's not RAID anymore. If one is naming thing an apple, shouldn't be surprised if others compare it to apples, not oranges. > table with test disk in order to regain access to my data. So NO FORM > of raid is a replacement for backups and NO FORM of raid is a > replacement for due diligence in recovery from failure mode. Raid gives And who said it is? > you a second chance when things go wrong, it does not make failures > transparent which is seemingly what we sometimes expect from raid. And Wouldn't want to worry you, but properly managed RAIDs make I/J-of-K trivial-failures transparent. Just like ECC protects N/M bits transparently. Investigating the reasons is sysadmin's job, just like other maintenance, including restoring protection level. -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 14:46 ` Tomasz Pala 2017-12-19 16:35 ` Austin S. Hemmelgarn 2017-12-19 18:31 ` George Mitchell @ 2017-12-19 19:35 ` Chris Murphy 2017-12-19 20:41 ` Tomasz Pala 2 siblings, 1 reply; 61+ messages in thread From: Chris Murphy @ 2017-12-19 19:35 UTC (permalink / raw) To: Tomasz Pala; +Cc: Linux fs Btrfs On Tue, Dec 19, 2017 at 7:46 AM, Tomasz Pala <gotar@polanet.pl> wrote: >Secondly - permanent failures are not handled "just > fine", as there is (1) no automatic mount as degraded, so the machine > won't reboot properly and (2) the r/w degraded mount is[*] one-timer. > Again, this should be: One of the reasons for problem 1 is problem 2. If we had automatic degraded mount, people would run into problem 2 and now they're stuck with a read only file system. Another reason is the kernel code and udev rule for device "readiness" means the volume is not "ready" until all member devices are present. And while the volume is not "ready" systemd will not even attempt to mount. Solving this requires kernel and udev work, or possibly a helper, to wait an appropriate amount of time. I also think it's a bad idea to implement automatic degraded mounts unless there's an API for user space to receive either a push or request notification for degradedness state, so desktop environments can inform the user of degradedness. There is no amount of documentation that makes up for these deficiencies enough to enable automatic degraded mounts by default. I would consider it a high order betrayal of user trust to do it. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 19:35 ` Chris Murphy @ 2017-12-19 20:41 ` Tomasz Pala 2017-12-19 20:47 ` Austin S. Hemmelgarn 2017-12-19 23:59 ` Chris Murphy 0 siblings, 2 replies; 61+ messages in thread From: Tomasz Pala @ 2017-12-19 20:41 UTC (permalink / raw) To: Linux fs Btrfs On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote: > with a read only file system. Another reason is the kernel code and > udev rule for device "readiness" means the volume is not "ready" until > all member devices are present. And while the volume is not "ready" > systemd will not even attempt to mount. Solving this requires kernel > and udev work, or possibly a helper, to wait an appropriate amount of Sth like this? I got such problem a few months ago, my solution was accepted upstream: https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f Rationale is in referred ticket, udev would not support any more btrfs logic, so unless btrfs handles this itself on kernel level (daemon?), that is all that can be done. > time. I also think it's a bad idea to implement automatic degraded > mounts unless there's an API for user space to receive either a push [...] > There is no amount of documentation that makes up for these > deficiencies enough to enable automatic degraded mounts by default. I > would consider it a high order betrayal of user trust to do it. It doesn't have to be default, might be kernel compile-time knob, module parameter or anything else to make the *R*aid work. -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 20:41 ` Tomasz Pala @ 2017-12-19 20:47 ` Austin S. Hemmelgarn 2017-12-19 22:23 ` Tomasz Pala 2017-12-21 11:44 ` Andrei Borzenkov 2017-12-19 23:59 ` Chris Murphy 1 sibling, 2 replies; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-19 20:47 UTC (permalink / raw) To: Tomasz Pala, Linux fs Btrfs On 2017-12-19 15:41, Tomasz Pala wrote: > On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote: > >> with a read only file system. Another reason is the kernel code and >> udev rule for device "readiness" means the volume is not "ready" until >> all member devices are present. And while the volume is not "ready" >> systemd will not even attempt to mount. Solving this requires kernel >> and udev work, or possibly a helper, to wait an appropriate amount of > > Sth like this? I got such problem a few months ago, my solution was > accepted upstream: > https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f > > Rationale is in referred ticket, udev would not support any more btrfs > logic, so unless btrfs handles this itself on kernel level (daemon?), > that is all that can be done. Or maybe systemd can quit trying to treat BTRFS like a volume manager (which it isn't) and just try to mount the requested filesystem with the requested options? Then you would just be able to specify 'degraded' in your mount options, and you don't have to care that the kernel refuses to mount degraded filesystems without being explicitly asked to. > >> time. I also think it's a bad idea to implement automatic degraded >> mounts unless there's an API for user space to receive either a push > [...] >> There is no amount of documentation that makes up for these >> deficiencies enough to enable automatic degraded mounts by default. I >> would consider it a high order betrayal of user trust to do it. > > It doesn't have to be default, might be kernel compile-time knob, module > parameter or anything else to make the *R*aid work. There's a mount option for it per-filesystem. Just add that to all your mount calls, and you get exactly the same effect. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 20:47 ` Austin S. Hemmelgarn @ 2017-12-19 22:23 ` Tomasz Pala 2017-12-20 13:33 ` Austin S. Hemmelgarn 2017-12-21 11:44 ` Andrei Borzenkov 1 sibling, 1 reply; 61+ messages in thread From: Tomasz Pala @ 2017-12-19 22:23 UTC (permalink / raw) To: Linux fs Btrfs On Tue, Dec 19, 2017 at 15:47:03 -0500, Austin S. Hemmelgarn wrote: >> Sth like this? I got such problem a few months ago, my solution was >> accepted upstream: >> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f >> >> Rationale is in referred ticket, udev would not support any more btrfs >> logic, so unless btrfs handles this itself on kernel level (daemon?), >> that is all that can be done. > Or maybe systemd can quit trying to treat BTRFS like a volume manager > (which it isn't) and just try to mount the requested filesystem with the > requested options? Tried that before ("just mount my filesystem, stupid"), it is a no-go. The problem source is not within systemd treating BTRFS differently, but in btrfs kernel logic that it uses. Just to show it: 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb, 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool), 3. try mount /dev/sda /test - fails mount /dev/sdb /test - works 4. reboot again and try in reversed order mount /dev/sdb /test - fails mount /dev/sda /test - works THIS readiness is exposed via udev to systemd. And it must be used for multi-layer setups to work (consider stacked LUKS, LVM, MD, iSCSI, FC etc). In short: until *something* scans all the btrfs components, so the kernel makes it ready, systemd won't even try to mount it. > Then you would just be able to specify 'degraded' in > your mount options, and you don't have to care that the kernel refuses > to mount degraded filesystems without being explicitly asked to. Exactly. But since LP refused to try mounting despite kernel "not-ready" state - it is the kernel that must emit 'ready'. So the question is: how can I make kernel to mark degraded array as "ready"? The obvious answer is: do it via kernel command line, just like mdadm does: rootflags=device=/dev/sda,device=/dev/sdb rootflags=device=/dev/sda,device=missing rootflags=device=/dev/sda,device=/dev/sdb,degraded If only btrfs.ko recognized this, kernel would be able to assemble multivolume btrfs itself. Not only this would allow automated degraded mounts, it would also allow using initrd-less kernels on such volumes. >> It doesn't have to be default, might be kernel compile-time knob, module >> parameter or anything else to make the *R*aid work. > There's a mount option for it per-filesystem. Just add that to all your > mount calls, and you get exactly the same effect. If only they were passed... -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 22:23 ` Tomasz Pala @ 2017-12-20 13:33 ` Austin S. Hemmelgarn 2017-12-20 17:28 ` Duncan 0 siblings, 1 reply; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-20 13:33 UTC (permalink / raw) To: Tomasz Pala, Linux fs Btrfs On 2017-12-19 17:23, Tomasz Pala wrote: > On Tue, Dec 19, 2017 at 15:47:03 -0500, Austin S. Hemmelgarn wrote: > >>> Sth like this? I got such problem a few months ago, my solution was >>> accepted upstream: >>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f >>> >>> Rationale is in referred ticket, udev would not support any more btrfs >>> logic, so unless btrfs handles this itself on kernel level (daemon?), >>> that is all that can be done. >> Or maybe systemd can quit trying to treat BTRFS like a volume manager >> (which it isn't) and just try to mount the requested filesystem with the >> requested options? > > Tried that before ("just mount my filesystem, stupid"), it is a no-go. > The problem source is not within systemd treating BTRFS differently, but > in btrfs kernel logic that it uses. Just to show it: > > 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb, > 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool), > 3. try > mount /dev/sda /test - fails > mount /dev/sdb /test - works > 4. reboot again and try in reversed order > mount /dev/sdb /test - fails > mount /dev/sda /test - works > > THIS readiness is exposed via udev to systemd. And it must be used for > multi-layer setups to work (consider stacked LUKS, LVM, MD, iSCSI, FC etc). Except BTRFS _IS NOT MULTIPLE LAYERS_. It's one layer at the filesystem layer, and handles the other 'layers' internally. > > In short: until *something* scans all the btrfs components, so the > kernel makes it ready, systemd won't even try to mount it. Which is the problem here. Systemd needs to treat BTRFS differently, even if the ioctl it's using gets 'fixed', currently it's treating it like LVM or MD, when it needs to be treated as just a filesystem with an extra wait condition prior to mount (and needs to trust that the user knows what they are doing when they mount something by hand). The IOCTL systemd is using was poorly named, what it really does is say that the FS is ready to mount normally (that is, without needing 'device=' or 'degraded' mount options). Aside from this being problematic with degraded volumes, it's got an inherent TOCTOU race condition (so do the checks with all the other block layers you mentioned FWIW). If systemd would just treat BTRFS like a filesystem instead of a volume manager, and try to mount the volume with the specified options (after waiting for udev to report that it's done scanning everything) instead of asking the kernel if it's ready, none of this would be an issue. Put slightly differently: I use OpenRC and sysv init. I have a script that runs right after udev starts and directly scans all fixed disks for BTRFS signatures, and that's _all_ that I need to do to get multi-device BTRFS working properly with the standard local filesystem mount script in Gentoo. I don't have to deal with any of this crap that systemd users do because Gentoo's OpenRC script for mounting local filesystems treats BTRFS like any other filesystem, and (sensibly) assumes that if the call to mount succeeds, things are ready and working. > >> Then you would just be able to specify 'degraded' in >> your mount options, and you don't have to care that the kernel refuses >> to mount degraded filesystems without being explicitly asked to. > > Exactly. But since LP refused to try mounting despite kernel "not-ready" > state - it is the kernel that must emit 'ready'. So the > question is: how can I make kernel to mark degraded array as "ready"? You can't, because the DEVICE_READY IOCTL is coded to mark the volume ready when all component devices are ready. IOW, it's there to say 'this mount will work without needing -o degraded or specifying any devices in the mount options'. The issue is the interaction here, not the kernel behavior by itself, since the kernel behavior produces no issues whatsoever for other init systems (though I will acknowledge that the ioctl itself is really only used by systemd, but I contend that that's because everything else is sensible enough to understand that the ioctl is functionally useless and just avoid it). > > The obvious answer is: do it via kernel command line, just like mdadm > does: > rootflags=device=/dev/sda,device=/dev/sdb > rootflags=device=/dev/sda,device=missing > rootflags=device=/dev/sda,device=/dev/sdb,degraded > > If only btrfs.ko recognized this, kernel would be able to assemble > multivolume btrfs itself. Not only this would allow automated degraded > mounts, it would also allow using initrd-less kernels on such volumes. Last I checked, the 'device=' options work on upstream kernels just fine, though I've never tried the degraded option. Of course, I'm also not using systemd, so it may be some interaction with systemd that's causing them to not work (and yes, I understand that I'm inclined to blame systemd most of the time based on significant past experience with systemd creating issues that never existed before). > >>> It doesn't have to be default, might be kernel compile-time knob, module >>> parameter or anything else to make the *R*aid work. >> There's a mount option for it per-filesystem. Just add that to all your >> mount calls, and you get exactly the same effect. > > If only they were passed... > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 13:33 ` Austin S. Hemmelgarn @ 2017-12-20 17:28 ` Duncan 0 siblings, 0 replies; 61+ messages in thread From: Duncan @ 2017-12-20 17:28 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Wed, 20 Dec 2017 08:33:03 -0500 as excerpted: >> The obvious answer is: do it via kernel command line, just like mdadm >> does: >> rootflags=device=/dev/sda,device=/dev/sdb >> rootflags=device=/dev/sda,device=missing >> rootflags=device=/dev/sda,device=/dev/sdb,degraded >> >> If only btrfs.ko recognized this, kernel would be able to assemble >> multivolume btrfs itself. Not only this would allow automated degraded >> mounts, it would also allow using initrd-less kernels on such volumes. > Last I checked, the 'device=' options work on upstream kernels just > fine, though I've never tried the degraded option. Of course, I'm also > not using systemd, so it may be some interaction with systemd that's > causing them to not work (and yes, I understand that I'm inclined to > blame systemd most of the time based on significant past experience with > systemd creating issues that never existed before). Has the bug where rootflags=device=/dev/sda1,device=/dev/sdb1 failed, been fixed? Last I knew (which was ancient history in btrfs terms, but I've not seen mention of a patch for it in all that time either), device= on the userspace commandline worked, and device= on the kernel commandline worked if there was just one device, but it would fail for more than one device. Mounting degraded (on a pair-device raid1) would then of course work, since it would just use the one device=, but that's simply dangerous for routine use regardless of whether it actually assembled or not, thus effectively forcing an initr* for multi-device btrfs root in ordered to get it mounted properly. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 20:47 ` Austin S. Hemmelgarn 2017-12-19 22:23 ` Tomasz Pala @ 2017-12-21 11:44 ` Andrei Borzenkov 2017-12-21 12:27 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 61+ messages in thread From: Andrei Borzenkov @ 2017-12-21 11:44 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Tomasz Pala, Linux fs Btrfs On Tue, Dec 19, 2017 at 11:47 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2017-12-19 15:41, Tomasz Pala wrote: >> >> On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote: >> >>> with a read only file system. Another reason is the kernel code and >>> udev rule for device "readiness" means the volume is not "ready" until >>> all member devices are present. And while the volume is not "ready" >>> systemd will not even attempt to mount. Solving this requires kernel >>> and udev work, or possibly a helper, to wait an appropriate amount of >> >> >> Sth like this? I got such problem a few months ago, my solution was >> accepted upstream: >> >> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f >> >> Rationale is in referred ticket, udev would not support any more btrfs >> logic, so unless btrfs handles this itself on kernel level (daemon?), >> that is all that can be done. > > Or maybe systemd can quit trying to treat BTRFS like a volume manager (which > it isn't) and just try to mount the requested filesystem with the requested > options? You can't mount filesystem until sufficient number of devices are present and not waiting (at least attempting to wait) for them opens you to races on startup. So far systemd position was - it is up to filesystem to give it something to wait on. And while apparently everyone agrees that current "btrfs device ready" does not fit the bill, this is the only thing we have. This integration issue was so far silently ignored both by btrfs and systemd developers. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-21 11:44 ` Andrei Borzenkov @ 2017-12-21 12:27 ` Austin S. Hemmelgarn 2017-12-22 16:05 ` Tomasz Pala 0 siblings, 1 reply; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-21 12:27 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Tomasz Pala, Linux fs Btrfs On 2017-12-21 06:44, Andrei Borzenkov wrote: > On Tue, Dec 19, 2017 at 11:47 PM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2017-12-19 15:41, Tomasz Pala wrote: >>> >>> On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote: >>> >>>> with a read only file system. Another reason is the kernel code and >>>> udev rule for device "readiness" means the volume is not "ready" until >>>> all member devices are present. And while the volume is not "ready" >>>> systemd will not even attempt to mount. Solving this requires kernel >>>> and udev work, or possibly a helper, to wait an appropriate amount of >>> >>> >>> Sth like this? I got such problem a few months ago, my solution was >>> accepted upstream: >>> >>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f >>> >>> Rationale is in referred ticket, udev would not support any more btrfs >>> logic, so unless btrfs handles this itself on kernel level (daemon?), >>> that is all that can be done. >> >> Or maybe systemd can quit trying to treat BTRFS like a volume manager (which >> it isn't) and just try to mount the requested filesystem with the requested >> options? > > You can't mount filesystem until sufficient number of devices are > present and not waiting (at least attempting to wait) for them opens > you to races on startup. So far systemd position was - it is up to > filesystem to give it something to wait on. And while apparently > everyone agrees that current "btrfs device ready" does not fit the > bill, this is the only thing we have. No, it isn't. You can just make the damn mount call with the supplied options. If it succeeds, the volume was ready, if it fails, it wasn't, it's that simple, and there's absolutely no reason that systemd can't just do that in a loop until it succeeds or a timeout is reached. That isn't any more racy than waiting on them is (waiting on them to be ready and then mounting them is a TOCTOU race condition), and it doesn't have any of these issues with the volume being completely unusable in a degraded state. Also, it's not 'up to the filesystem', it's 'up to the underlying device'. LUKS, LVM, MD, and everything else that's an actual device layer is what systemd waits on. XFS, ext4, and any other filesystem except BTRFS (and possibly ZFS, but I'm not 100% sure about that) provides absolutely _NOTHING_ to wait on. Systemd just chose to handle BTRFS like a device layer, and not a filesystem, so we have this crap to deal with, as well as the fact that it makes it impossible to manually mount a BTRFS volume with missing or failed devices in degraded mode under systemd (because it unmounts it damn near instantly because it somehow thinks it knows better than the user what the user wants to do). > > This integration issue was so far silently ignored both by btrfs and > systemd developers. It's been ignored by BTRFS devs because there is _nothing_ wrong on this side other than the naming choice for the ioctl. Systemd is _THE ONLY_ init system which has this issue, every other one works just fine. As far as the systemd side, I have no idea why they are ignoring it, though I suspect it's the usual spoiled brat mentality that seems to be present about everything that people complain about regarding systemd. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-21 12:27 ` Austin S. Hemmelgarn @ 2017-12-22 16:05 ` Tomasz Pala 2017-12-22 21:04 ` Chris Murphy 0 siblings, 1 reply; 61+ messages in thread From: Tomasz Pala @ 2017-12-22 16:05 UTC (permalink / raw) To: Linux fs Btrfs On Thu, Dec 21, 2017 at 07:27:23 -0500, Austin S. Hemmelgarn wrote: > No, it isn't. You can just make the damn mount call with the supplied > options. If it succeeds, the volume was ready, if it fails, it wasn't, > it's that simple, and there's absolutely no reason that systemd can't > just do that in a loop until it succeeds or a timeout is reached. That There is no such loop, so if mount would happen before all the required devices show up, it would either definitely fail, or if there were 'degraded' in fstab, just start degraded. > any of these issues with the volume being completely unusable in a > degraded state. > > Also, it's not 'up to the filesystem', it's 'up to the underlying > device'. LUKS, LVM, MD, and everything else that's an actual device > layer is what systemd waits on. XFS, ext4, and any other filesystem > except BTRFS (and possibly ZFS, but I'm not 100% sure about that) > provides absolutely _NOTHING_ to wait on. Systemd just chose to handle You wait for all the devices to settle. One might have dozen of drives including some attached via network and it might take a time to become available. Since systemd knows nothing about underlying components, it simply waits for the btrfs itself to announce it's ready. > BTRFS like a device layer, and not a filesystem, so we have this crap to As btrfs handles many devices in "lower part", this effectively is device layer. Mounting /dev/sda happens to mount various other /dev/sd* that are _not_ explicitly exposed, so there is really not an alternative. Except for the 'mount loop' which is a no-go. > deal with, as well as the fact that it makes it impossible to manually > mount a BTRFS volume with missing or failed devices in degraded mode > under systemd (because it unmounts it damn near instantly because it > somehow thinks it knows better than the user what the user wants to do). This seems to be some distro-specific misconfiguration, didn't happen to me on plain systemd/udev. What is the reproducing scenario? >> This integration issue was so far silently ignored both by btrfs and >> systemd developers. > It's been ignored by BTRFS devs because there is _nothing_ wrong on this > side other than the naming choice for the ioctl. Systemd is _THE ONLY_ > init system which has this issue, every other one works just fine. Not true - mounting btrfs without "btrfs device scan" doesn't work at all without udev rules (that mimic behaviour of the command). Let me repeat example from Dec 19th: 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb, 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool), 3. try mount /dev/sda /test - fails mount /dev/sdb /test - works 4. reboot again and try in reversed order mount /dev/sdb /test - fails mount /dev/sda /test - works > As far as the systemd side, I have no idea why they are ignoring it, > though I suspect it's the usual spoiled brat mentality that seems to be > present about everything that people complain about regarding systemd. Explanation above. This is the point when _you_ need to stop ignoring the fact, that you simply cannot just try mounting devices in a loop as this would render any NAS/FC/iSCSI-backed or more complicated systems unusable or hide problems in case of temporary problems with connection. systemd waits for the _underlying_ device - unless btrfs exposes them as a list of _actual_ devices to wait for, there is nothing except for waiting for btrfs itself that systemd can do. -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-22 16:05 ` Tomasz Pala @ 2017-12-22 21:04 ` Chris Murphy 2017-12-23 2:52 ` Tomasz Pala 0 siblings, 1 reply; 61+ messages in thread From: Chris Murphy @ 2017-12-22 21:04 UTC (permalink / raw) To: Tomasz Pala; +Cc: Linux fs Btrfs On Fri, Dec 22, 2017 at 9:05 AM, Tomasz Pala <gotar@polanet.pl> wrote: > On Thu, Dec 21, 2017 at 07:27:23 -0500, Austin S. Hemmelgarn wrote: > >> Also, it's not 'up to the filesystem', it's 'up to the underlying >> device'. LUKS, LVM, MD, and everything else that's an actual device >> layer is what systemd waits on. XFS, ext4, and any other filesystem >> except BTRFS (and possibly ZFS, but I'm not 100% sure about that) >> provides absolutely _NOTHING_ to wait on. Systemd just chose to handle > > You wait for all the devices to settle. One might have dozen of drives > including some attached via network and it might take a time to become > available. Since systemd knows nothing about underlying components, it > simply waits for the btrfs itself to announce it's ready. I'm pretty sure degraded boot timeout policy is handled by dracut. The kernel doesn't just automatically assemble an md array as soon as it's possible (degraded) and then switch to normal operation as other devices appear. I have no idea how LVM manages the delay policy for multiple devices. I don't think the delay policy belongs in the kernel. It's pie in the sky, and unicorns, but it sure would be nice to have standardization rather than everyone rolling their own solution. The Red Hat Stratis folks will need something to do this for their solution so yet another one is about to be developed... -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-22 21:04 ` Chris Murphy @ 2017-12-23 2:52 ` Tomasz Pala 2017-12-23 5:40 ` Duncan 0 siblings, 1 reply; 61+ messages in thread From: Tomasz Pala @ 2017-12-23 2:52 UTC (permalink / raw) To: Linux fs Btrfs On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote: > I'm pretty sure degraded boot timeout policy is handled by dracut. The Well, last time I've checked dracut on systemd-system couldn't even generate systemd-less image. > kernel doesn't just automatically assemble an md array as soon as it's > possible (degraded) and then switch to normal operation as other MD devices are explicitly listed in mdadm.conf (for mdadm --assemble --scan) or kernel command line or metadata of autodetected partitions (fd). > devices appear. I have no idea how LVM manages the delay policy for > multiple devices. I *guess* it's not about waiting, but simply being executed after the devices are ready. And there is a VERY long history of various init systems having problems to boot systems using multi-layer setups (LVM/MD under or above LUKS, not to mention remote ones that need networking to be set up). All of this works reasonably well under systemd - except for the btrfs that uses single device node to match entire group of devices. Which is convenient for living person (no need to switch between /dev/mdX and /dev/sdX), but impossible to guess automatically by userspace tools. There is only probe IOCTL which doesn't handle degraded mode. > I don't think the delay policy belongs in the kernel. That is exactly why the systemd waits for appropriate udev state. > It's pie in the sky, and unicorns, but it sure would be nice to have > standardization rather than everyone rolling their own solution. The There was a de facto standard I think - expose component devices or require them to be specified. Apparently no such thing in btrfs, so it must be handled in btrfs-way. Also note that MD can be assembled by kernel itself, while btrfs cannot (so initrd is required for rootfs). -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-23 2:52 ` Tomasz Pala @ 2017-12-23 5:40 ` Duncan 0 siblings, 0 replies; 61+ messages in thread From: Duncan @ 2017-12-23 5:40 UTC (permalink / raw) To: linux-btrfs Tomasz Pala posted on Sat, 23 Dec 2017 03:52:47 +0100 as excerpted: > On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote: > >> I'm pretty sure degraded boot timeout policy is handled by dracut. The > > Well, last time I've checked dracut on systemd-system couldn't even > generate systemd-less image. ?? Unless it changed recently (I /chose/ a systemd-based dracut setup here, so I'd not be aware if it did), dracut can indeed do systemd-less initr* images. Dracut is modular, and systemd is one of the modules, enabled by default on a systemd system, but not required, as I know, because I had dracut setup without the systemd module for some time after I switched to systemd for my main sysinit, and I verified it didn't install systemd in the initr* until I activated the systemd module. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 20:41 ` Tomasz Pala 2017-12-19 20:47 ` Austin S. Hemmelgarn @ 2017-12-19 23:59 ` Chris Murphy 2017-12-20 8:34 ` Tomasz Pala 1 sibling, 1 reply; 61+ messages in thread From: Chris Murphy @ 2017-12-19 23:59 UTC (permalink / raw) To: Tomasz Pala; +Cc: Linux fs Btrfs On Tue, Dec 19, 2017 at 1:41 PM, Tomasz Pala <gotar@polanet.pl> wrote: > On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote: > >> with a read only file system. Another reason is the kernel code and >> udev rule for device "readiness" means the volume is not "ready" until >> all member devices are present. And while the volume is not "ready" >> systemd will not even attempt to mount. Solving this requires kernel >> and udev work, or possibly a helper, to wait an appropriate amount of > > Sth like this? I got such problem a few months ago, my solution was > accepted upstream: > https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f I can't parse this commit. In particular I can't tell how long it waits, or what triggers the end to waiting. > > Rationale is in referred ticket, udev would not support any more btrfs > logic, so unless btrfs handles this itself on kernel level (daemon?), > that is all that can be done. > >> time. I also think it's a bad idea to implement automatic degraded >> mounts unless there's an API for user space to receive either a push > [...] >> There is no amount of documentation that makes up for these >> deficiencies enough to enable automatic degraded mounts by default. I >> would consider it a high order betrayal of user trust to do it. > > It doesn't have to be default, might be kernel compile-time knob, module > parameter or anything else to make the *R*aid work. OK but that's cart before the horse. The horse is proper recovery behavior once a delayed/missing drive appears, i.e. resync. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-19 23:59 ` Chris Murphy @ 2017-12-20 8:34 ` Tomasz Pala 2017-12-20 8:51 ` Tomasz Pala 2017-12-20 19:49 ` Chris Murphy 0 siblings, 2 replies; 61+ messages in thread From: Tomasz Pala @ 2017-12-20 8:34 UTC (permalink / raw) To: Linux fs Btrfs On Tue, Dec 19, 2017 at 16:59:39 -0700, Chris Murphy wrote: >> Sth like this? I got such problem a few months ago, my solution was >> accepted upstream: >> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f > > I can't parse this commit. In particular I can't tell how long it > waits, or what triggers the end to waiting. The point is - it doesn't wait at all. Instead, every 'ready' btrfs device triggers event on all the pending devices. Consider 3-device filesystem consisting of /dev/sd[abd] with /dev/sdc being different, standalone btrfs: /dev/sda -> 'not ready' /dev/sdb -> 'not ready' /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not ready' /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 'ready' This way all the parts of a volume are marked as ready, so systemd won't refuse mounting using legacy device nodes like /dev/sda. This particular solution depends on kernel returning 'btrfs ready', which would obviously not work for degraded arrays unless the btrfs.ko handles some 'missing' or 'mount_degraded' kernel cmdline options _before_ actually _trying_ to mount it with -o degraded. And there is a logical problem with this - _which_ array components should be ignored? Consider: volume1: /dev/sda /dev/sdb volume2: /dev/sdc /dev/sdd-broken If /dev/sdd is missing from the system, it would never be scanned, so /dev/sdc would be pending. It cannot be assembled just in time of scanning alone, because the same would happen with /dev/sda and there would be desync with /dev/sdb, which IS available - a few moments later. This is the place for the timeout you've mentioned - there should be *some* decent timeout allowing all the devices to show up (udev waits for 90 seconds by default or x-systemd.device-timeout=N from fstab). After such timeout, I'd like to tell the kernel: "no more devices, give me all the remaining btrfs volumes in degraded mode if possible". By "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could fire it's rules. And if there would be anything for udev to distinguish 'ready' from 'ready-degraded' one could easily compose some notification scripting on top of it, including sending e-mail to sysadmin. Is there anything that would make the kernel do the above? -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 8:34 ` Tomasz Pala @ 2017-12-20 8:51 ` Tomasz Pala 2017-12-20 19:49 ` Chris Murphy 1 sibling, 0 replies; 61+ messages in thread From: Tomasz Pala @ 2017-12-20 8:51 UTC (permalink / raw) To: Linux fs Btrfs Errata: On Wed, Dec 20, 2017 at 09:34:48 +0100, Tomasz Pala wrote: > /dev/sda -> 'not ready' > /dev/sdb -> 'not ready' > /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not ready' > /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 'ready' The last line should start with /dev/sdd. > After such timeout, I'd like to tell the kernel: "no more devices, give > me all the remaining btrfs volumes in degraded mode if possible". By Actually "if possible" means both: - if technically possible (i.e. required data is available, like half of RAID1), - AND if allowed for specific volume as there might be different policies. For example - one might allow rootfs to be started in degraded-rw mode in order for the system to boot up, /home in degraded read-only for the users to have access to their files and do not mount /srv degraded at all. The failed mount can be non-critical with 'nofail' fstab flag. > "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could > fire it's rules. And if there would be anything for udev to distinguish > 'ready' from 'ready-degraded' one could easily compose some notification > scripting on top of it, including sending e-mail to sysadmin. -- Tomasz Pala <gotar@pld-linux.org> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-20 8:34 ` Tomasz Pala 2017-12-20 8:51 ` Tomasz Pala @ 2017-12-20 19:49 ` Chris Murphy 1 sibling, 0 replies; 61+ messages in thread From: Chris Murphy @ 2017-12-20 19:49 UTC (permalink / raw) To: Tomasz Pala; +Cc: Linux fs Btrfs On Wed, Dec 20, 2017 at 1:34 AM, Tomasz Pala <gotar@polanet.pl> wrote: > On Tue, Dec 19, 2017 at 16:59:39 -0700, Chris Murphy wrote: > >>> Sth like this? I got such problem a few months ago, my solution was >>> accepted upstream: >>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f >> >> I can't parse this commit. In particular I can't tell how long it >> waits, or what triggers the end to waiting. > > The point is - it doesn't wait at all. Instead, every 'ready' btrfs > device triggers event on all the pending devices. Consider 3-device > filesystem consisting of /dev/sd[abd] with /dev/sdc being different, > standalone btrfs: > > /dev/sda -> 'not ready' > /dev/sdb -> 'not ready' > /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not ready' > /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 'ready' > > This way all the parts of a volume are marked as ready, so systemd won't > refuse mounting using legacy device nodes like /dev/sda. > > > This particular solution depends on kernel returning 'btrfs ready', > which would obviously not work for degraded arrays unless the btrfs.ko > handles some 'missing' or 'mount_degraded' kernel cmdline options > _before_ actually _trying_ to mount it with -o degraded. The thing that is valuing a Btrfs's "readiness" is udev. The kernel doesn't care, it still instantiates a volume UUID. And if you pass -o degraded mount to a non-ready Btrfs volume, the kernel code will try to mount that volume in degraded mode (assuming it passes tests for the minimum number of devices, can find all the supers it needs, and bootstrap the chunk tree, etc) If the udev rule were smarter, it could infer "non-ready" Btrfs volume to mean it should wait (and complaining might be nice so we know why it's waiting) for some period of time, and then if it's still not ready to try to mount with -o degraded. I don't know where teaching system about degraded attempts belongs, whether the udev rule can tell systemd to add that mount option if a volume is still not ready, of if systemd needs hard coded understanding of this mount option for Btrfs. There is no risk of using -o degraded on a Btrfs volume if it's missing too many devices, such a degraded mount will simply fail. > After such timeout, I'd like to tell the kernel: "no more devices, give > me all the remaining btrfs volumes in degraded mode if possible". By > "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could > fire it's rules. And if there would be anything for udev to distinguish > 'ready' from 'ready-degraded' one could easily compose some notification > scripting on top of it, including sending e-mail to sysadmin. I think the linguistics of "btrfs devices ready" is confusing because what we really care about is whether the volume/array can be mounted normally (not degraded). The BTRFS_IOC_DEVICES_READY ioctl is pointed to any one of the volume's devices, and you get a pass/fail. If it passes (ready), all other devices are present. If it fails (not ready), one or more devices are missing. It's not necessary to hit every device with this ioctl to understand what's going on. If the question can be answered with: ready, ready-degraded - It's highly likely that you always get read-degraded as the answer for all btrfs multiple device volumes. So if udev were to get read-degraded will it still wait to see if the state goes to ready? How long does it wait? Seems like it still should wait 90 seconds. In which case it's going to try to mount with -o degraded. So I see zero advantage and multiple disadvantages to having the kernel do a degradedness test well before the mount will be attempted. I think this is asking for a race condition. -- Chris Murphy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-17 11:58 ` Duncan 2017-12-17 15:48 ` Peter Grandi @ 2017-12-18 5:11 ` Anand Jain 1 sibling, 0 replies; 61+ messages in thread From: Anand Jain @ 2017-12-18 5:11 UTC (permalink / raw) To: Duncan, linux-btrfs Nice status update about btrfs volume manager. Thanks. Below I have added the names of the patch in ML/wip addressing the current limitations. On 12/17/2017 07:58 PM, Duncan wrote: > Dark Penguin posted on Sat, 16 Dec 2017 22:50:33 +0300 as excerpted: > >> Could someone please point me towards some read about how btrfs handles >> multiple devices? Namely, kicking faulty devices and re-adding them. >> >> I've been using btrfs on single devices for a while, but now I want to >> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and >> tried to see how does it handle various situations. The experience left >> me very surprised; I've tried a number of things, all of which produced >> unexpected results. >> >> I create a btrfs raid1 filesystem on two hard drives and mount it. >> >> - When I pull one of the drives out (simulating a simple cable failure, >> which happens pretty often to me), the filesystem sometimes goes >> read-only. ??? >> - But only after a while, and not always. ??? >> - When I fix the cable problem (plug the device back), it's immediately >> "re-added" back. But I see no replication of the data I've written onto >> a degraded filesystem... Nothing shows any problems, so "my filesystem >> must be ok". ??? >> - If I unmount the filesystem and then mount it back, I see all my >> recent changes lost (everything I wrote during the "degraded" period). - >> If I continue working with a degraded raid1 filesystem (even without >> damaging it further by re-adding the faulty device), after a while it >> won't mount at all, even with "-o degraded". >> >> I can't wrap my head about all this. Either the kicked device should not >> be re-added, or it should be re-added "properly", or it should at least >> show some errors and not pretend nothing happened, right?.. >> >> I must be missing something. Is there an explanation somewhere about >> what's really going on during those situations? Also, do I understand >> correctly that upon detecting a faulty device (a write error), nothing >> is done about it except logging an error into the 'btrfs device stats' >> report? No device kicking, no notification?.. And what about degraded >> filesystems - is it absolutely forbidden to work with them without >> converting them to a "single" filesystem first?.. >> >> On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 . > > Btrfs device handling at this point is still "development level" and very > rough, but there's a patch set in active review ATM that should improve > things dramatically, perhaps as soon as 4.16 (4.15 is already well on the > way). > > Basically, at this point btrfs doesn't have "dynamic" device handling. > That is, if a device disappears, it doesn't know it. So it continues > attempting to write to (and read from, but the reads are redirected) the > missing device until things go bad enough it kicks to read-only for > safety. btrfs: introduce device dynamic state transition to failed > If a device is added back, the kernel normally shuffles device names and > assigns a new one. Btrfs will see it and list the new device, but it's > still trying to use the old one internally. =:^( btrfs: handle dynamically reappearing missing device > Thus, if a device disappears, to get it back you really have to reboot, > or at least unload/reload the btrfs kernel module, in ordered to clear > the stale device state and have btrfs rescan and reassociate devices with > the matching filesystems. > > Meanwhile, once a device goes stale -- other devices in the filesystem > have data that should have been written to the stale one but it was gone > so the data couldn't get to it -- once you do the module unload/reload or > reboot cycle and btrfs picks up the device again, you should immediately > do a btrfs scrub, which will detect and "catch up" the differences. > > Btrfs tracks atomic filesystem updates via a monotonically increasing > generation number, aka transaction-id (transid). When a device goes > offline, its generation number of course gets stuck at the point it went > offline, while the other devices continue to update their generation > numbers. > > When a stale device is readded, btrfs should automatically find and use > the device with the latest generation, but the old one isn't > automatically caught up -- a scrub is the mechanism by which you do this. > > One thing you do **NOT** want to do is degraded-writable mount one > device, then the other device, of a raid1 pair, because that'll diverge > the two with new data on each, and that's no longer simple to correct. > If you /have/ to degraded-writable mount a raid1, always make sure it's > the same one mounted writable if you want to combine them again. If you > /do/ need to recombine two diverged raid1 devices, the only safe way to > do so is to wipe the one so btrfs has only the one copy of the data to go > on, and add the wiped device back as a new device. btrfs: handle volume split brain scenario > Meanwhile, until /very/ recently... 4.13 may not be current enough... if > you mounted a two-device raid1 degraded-writable, btrfs would try to > write and note that it couldn't do raid1 because there wasn't a second > device, so it would create single chunks to write into. > > And the older filesystem safe-mount mechanism would see those single > chunks on a raid1 and decide it wasn't safe to mount the filesystem > writable at all after that, even if all the single chunks were actually > present on the remaining device. > > The effect was that if a device died, you had exactly one degraded- > writable mount to replace it successfully. If you didn't complete the > replace in that single chance writable mount, the filesystem would refuse > to mount writable again, and thus it was impossible to repair the > filesystem since that required a writable mount and that was no longer > possible! Fortunately the filesystem could still be mounted degraded- > readonly (unless there was some other problem), allowing people to at > least get at the read-only data to copy it elsewhere. > > With a new enough btrfs, while btrfs will still create those single > chunks on a degraded-writable mount of a raid1, it's at least smart > enough to do per-chunk checks to see if they're all available on existing > devices (none only on the missing device), and will continue to allow > degraded-writable mounting if so. (v4.14) btrfs: Introduce a function to check if all chunks a OK for degraded rw mount > But once the filesystem is back to multi-device (with writable space on > at least two devices), a balance-convert of those single chunks to raid1 > should be done, otherwise if the device with them on it goes... > > And there's work on allowing it to do only single-copy, thus incomplete- > raid1, chunk writes as well. This should prevent the single mode chunks > entirely, thus eliminating the need for the balance-convert, tho a scrub > would still be needed to fully sync back up. But I'm not sure what the > status is on that. btrfs: create degraded-RAID1 chunks (Patch is wip still. There is a good workaround). > Meanwhile, as mentioned above, there's active work on proper dynamic > btrfs device tracking and management. btrfs: Introduce device pool sysfs attributes (needs revival) > It may or may not be ready for > 4.16, but once it goes in, btrfs should properly detect a device going > away and react accordingly, and it should detect a device coming back as > a different device too. As I write this it occurs to me that I've not > read close enough to know if it actually initiates scrub/resync on its > own in the current patch set, but that's obviously an eventual goal if > not. Right. It doesn't as of now, its in my list of things to fix. > Longer term, there's further patches that will provide a hot-spare > functionality, automatically bringing in a device pre-configured as a hot- > spare if a device disappears, but that of course requires that btrfs > properly recognize devices disappearing and coming back first, so one > thing at a time. Tho as originally presented, that hot-spare > functionality was a bit limited -- it was a global hot-spare list, and > with multiple btrfs of different sizes and multiple hot-spare devices > also of different sizes, it would always just pick the first spare on the > list for the first btrfs needing one, regardless of whether the size was > appropriate for that filesystem or not. By the time the feature actually > gets merged it may have changed some, and regardless, it should > eventually get less limited, but that's _eventually_, with a target time > likely still in years, so don't hold your breath. hah. - Its not that difficult to pick up a suitable sized disk from the global hot spare list. - A CLI can show which fsid/volume a global hot spare can be the candidate for the potential replacement. - An auto replace priority can be at the fsid/volume end or we could still dedicate a global hot spare device to a fsid/volume. Related patches (needs revival): btrfs: block incompatible optional features at scan btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV btrfs: add check not to mount a spare device btrfs: support btrfs dev scan for spare device btrfs: provide framework to get and put a spare device btrfs: introduce helper functions to perform hot replace btrfs: check for failed device and hot replace > I think that answers most of your questions. Basically, you have to be > quite careful with btrfs raid1 today, as btrfs simply doesn't have the > automated functionality to handle it yet. It's still possible to do two- > device-only raid1 and replace a failed device when you're down to one, > but it's not as easy or automated as more mature raid options such as > mdraid, and you do have to keep on top of it as a result. But it can and > does work reasonably well for those (like me) who use btrfs raid1 as > their "daily driver", as long as you /do/ keep on top of it... and don't > try to use raid1 as a replacement for real backups, because it's *not* a > backup! =:^) > Thanks, Anand ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin 2017-12-17 11:58 ` Duncan @ 2017-12-18 1:20 ` Qu Wenruo 2017-12-18 13:31 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 61+ messages in thread From: Qu Wenruo @ 2017-12-18 1:20 UTC (permalink / raw) To: Dark Penguin, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2469 bytes --] On 2017年12月17日 03:50, Dark Penguin wrote: > Could someone please point me towards some read about how btrfs handles > multiple devices? Namely, kicking faulty devices and re-adding them. > > I've been using btrfs on single devices for a while, but now I want to > start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and > tried to see how does it handle various situations. > The experience left > me very surprised; I've tried a number of things, all of which produced > unexpected results. > > I create a btrfs raid1 filesystem on two hard drives and mount it. Initial info like "btrfs fi df" will help us to dig this further. > > - When I pull one of the drives out (simulating a simple cable failure, > which happens pretty often to me), the filesystem sometimes goes > read-only. ??? Please provide the kernel message. > - But only after a while, and not always. ??? > - When I fix the cable problem (plug the device back), it's immediately > "re-added" back. But I see no replication of the data I've written onto > a degraded filesystem... Nothing shows any problems, so "my filesystem > must be ok". ??? Needs extra info like "btrfs fi df" > - If I unmount the filesystem and then mount it back, I see all my > recent changes lost (everything I wrote during the "degraded" period). > - If I continue working with a degraded raid1 filesystem (even without > damaging it further by re-adding the faulty device), after a while it > won't mount at all, even with "-o degraded". Please provide kernel message too. Although I doubt about the usefulness, it's still better than none. Thanks, Qu > > I can't wrap my head about all this. Either the kicked device should not > be re-added, or it should be re-added "properly", or it should at least > show some errors and not pretend nothing happened, right?.. > > I must be missing something. Is there an explanation somewhere about > what's really going on during those situations? Also, do I understand > correctly that upon detecting a faulty device (a write error), nothing > is done about it except logging an error into the 'btrfs device stats' > report? No device kicking, no notification?.. And what about degraded > filesystems - is it absolutely forbidden to work with them without > converting them to a "single" filesystem first?.. > > On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 . > > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin 2017-12-17 11:58 ` Duncan 2017-12-18 1:20 ` Qu Wenruo @ 2017-12-18 13:31 ` Austin S. Hemmelgarn 2018-01-12 12:26 ` Dark Penguin 2 siblings, 1 reply; 61+ messages in thread From: Austin S. Hemmelgarn @ 2017-12-18 13:31 UTC (permalink / raw) To: Dark Penguin, linux-btrfs On 2017-12-16 14:50, Dark Penguin wrote: > Could someone please point me towards some read about how btrfs handles > multiple devices? Namely, kicking faulty devices and re-adding them. > > I've been using btrfs on single devices for a while, but now I want to > start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and > tried to see how does it handle various situations. The experience left > me very surprised; I've tried a number of things, all of which produced > unexpected results. Expounding a bit on Duncan's answer with some more specific info. > > I create a btrfs raid1 filesystem on two hard drives and mount it. > > - When I pull one of the drives out (simulating a simple cable failure, > which happens pretty often to me), the filesystem sometimes goes > read-only. ??? > - But only after a while, and not always. ??? The filesystem won't go read-only until it hits an I/O error, and it's non-deterministic how long it will be before that happens on an idle filesystem that only sees read access (because if all the files that are being read are in the page cache). > - When I fix the cable problem (plug the device back), it's immediately > "re-added" back. But I see no replication of the data I've written onto > a degraded filesystem... Nothing shows any problems, so "my filesystem > must be ok". ??? One of two things happens in this case, and why there is no re-sync is dependent on which happens, but both ultimately have to do with the fact that BTRFS assumes I/O errors are from device failures, and are at worst transient. Either: 1. The device reappears with the same name. This happens if the time it was disconnected is less than the kernel's command timeout (30 seconds by default). In this case, BTRFS may not even notice that the device was gone (and if it doesn't, then a re-sync isn't necessary, since it will retry all the writes it needs to). In this case, BTRFS assumes the I/O errors were temporary, and keeps using the device after logging the errors. If this happens, then you need to manually re-sync things by scrubbing the filesystem (or balancing, but scrubbing is preferred as it should run quicker and will only re-write what is actually needed). 2. The device reappears with a different name. In this case, the device was gone long enough that the block layer is certain it was disconnected, and thus when it reappears and BTRFS still holds open references to the old device node, it gets a new device node. In this case, if the 'new' device is scanned, BTRFS will recognize it as part of the FS, but will keep using the old device node. The correct fix here is to unmount the filesystem, re-scan all devices, and then remount the filesystem and manually re-sync with a scrub. > - If I unmount the filesystem and then mount it back, I see all my > recent changes lost (everything I wrote during the "degraded" period). I'm not quite sure about this, but I think BTRFS is rolling back to the last common generation number for some reason. > - If I continue working with a degraded raid1 filesystem (even without > damaging it further by re-adding the faulty device), after a while it > won't mount at all, even with "-o degraded". This is (probably) a known bug relating to chunk handling. In a two device volume using a raid1 profile with a missing device, older kernels (I don't remember when the fix went in, but I could have sworn it was in 4.13) will (erroneously) generate single-profile chunks when they need to allocate new chunks. When you then go to mount the filesystem, the check for the degraded mount-ability of the FS fails because there is a device missing and single profile chunks. Now, even without that bug, it's never a good idea t0o run a storage array degraded for any extended period of time, regardless of what type of array it is (BTRFS, ZFS, MD, LVM, or even hardware RAID). By keeping it in 'degraded' mode, you're essentially telling the system that the array will be fixed in a reasonably short time-frame, which impacts how it handles the array. If you're not going to fix it almost immediately, you should almost always reshape the array to account for the missing device if at all possible, as that will improve relative data safety and generally get you better performance than running degraded will. > > I can't wrap my head about all this. Either the kicked device should not > be re-added, or it should be re-added "properly", or it should at least > show some errors and not pretend nothing happened, right?.. BTRFS is not the best at error reporting at the moment. If you check the output of `btrfs device stats` for that filesystem though, it should show non-zero values in the error counters (note that these counters are cumulative, so they are counts since the last time they were reset (or when the FS was created if they have never been reset). Similarly, scrub should report errors, there should be error messages in the kernel log, and switching the FS to read-only mode _is_ technically reporting an error, as that's standard error behavior for most sensible filesystems (ext[234] being the notable exception, they just continue as if nothing happened). > > I must be missing something. Is there an explanation somewhere about > what's really going on during those situations? Also, do I understand > correctly that upon detecting a faulty device (a write error), nothing > is done about it except logging an error into the 'btrfs device stats' > report? No device kicking, no notification?.. And what about degraded > filesystems - is it absolutely forbidden to work with them without > converting them to a "single" filesystem first?.. As mentioned above, going read-only _is_ a notification that something is wrong. Translating that (and the error counter increase, and the kernel log messages) into a user visible notification is not really the job of BTRFS, especially considering that no other filesystem or device manager does so either (yes, you can get nice notifications from LVM, but they aren't _from_ LVM itself, they're from other software that watches for errors, and the same type of software works just fine for BTRFS too). If you're this worried about it and don't want to keep on top of it yourself by monitoring things manually, you really need to look into a tool like monit [1] that can handle this for you. [1] https://mmonit.com/monit/ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Unexpected raid1 behaviour 2017-12-18 13:31 ` Austin S. Hemmelgarn @ 2018-01-12 12:26 ` Dark Penguin 0 siblings, 0 replies; 61+ messages in thread From: Dark Penguin @ 2018-01-12 12:26 UTC (permalink / raw) To: Austin S. Hemmelgarn, linux-btrfs On 18/12/17 16:31, Austin S. Hemmelgarn wrote: > On 2017-12-16 14:50, Dark Penguin wrote: >> Could someone please point me towards some read about how btrfs handles >> multiple devices? Namely, kicking faulty devices and re-adding them. >> >> I've been using btrfs on single devices for a while, but now I want to >> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and >> tried to see how does it handle various situations. The experience left >> me very surprised; I've tried a number of things, all of which produced >> unexpected results. > Expounding a bit on Duncan's answer with some more specific info. >> >> I create a btrfs raid1 filesystem on two hard drives and mount it. >> >> - When I pull one of the drives out (simulating a simple cable failure, >> which happens pretty often to me), the filesystem sometimes goes >> read-only. ??? > - But only after a while, and not always. ??? > The filesystem won't go read-only until it hits an I/O error, and it's > non-deterministic how long it will be before that happens on an idle > filesystem that only sees read access (because if all the files that are > being read are in the page cache). >> - When I fix the cable problem (plug the device back), it's immediately >> "re-added" back. But I see no replication of the data I've written onto >> a degraded filesystem... Nothing shows any problems, so "my filesystem >> must be ok". ??? > One of two things happens in this case, and why there is no re-sync is > dependent on which happens, but both ultimately have to do with the fact > that BTRFS assumes I/O errors are from device failures, and are at worst > transient. Either: > > 1. The device reappears with the same name. This happens if the time it > was disconnected is less than the kernel's command timeout (30 seconds > by default). In this case, BTRFS may not even notice that the device > was gone (and if it doesn't, then a re-sync isn't necessary, since it > will retry all the writes it needs to). In this case, BTRFS assumes the > I/O errors were temporary, and keeps using the device after logging the > errors. If this happens, then you need to manually re-sync things by > scrubbing the filesystem (or balancing, but scrubbing is preferred as it > should run quicker and will only re-write what is actually needed). > 2. The device reappears with a different name. In this case, the device > was gone long enough that the block layer is certain it was > disconnected, and thus when it reappears and BTRFS still holds open > references to the old device node, it gets a new device node. In this > case, if the 'new' device is scanned, BTRFS will recognize it as part of > the FS, but will keep using the old device node. The correct fix here > is to unmount the filesystem, re-scan all devices, and then remount the > filesystem and manually re-sync with a scrub. > >> - If I unmount the filesystem and then mount it back, I see all my >> recent changes lost (everything I wrote during the "degraded" period). > I'm not quite sure about this, but I think BTRFS is rolling back to the > last common generation number for some reason. > >> - If I continue working with a degraded raid1 filesystem (even without >> damaging it further by re-adding the faulty device), after a while it >> won't mount at all, even with "-o degraded". > This is (probably) a known bug relating to chunk handling. In a two > device volume using a raid1 profile with a missing device, older kernels > (I don't remember when the fix went in, but I could have sworn it was in > 4.13) will (erroneously) generate single-profile chunks when they need > to allocate new chunks. When you then go to mount the filesystem, the > check for the degraded mount-ability of the FS fails because there is a > device missing and single profile chunks. > > Now, even without that bug, it's never a good idea t0o run a storage > array degraded for any extended period of time, regardless of what type > of array it is (BTRFS, ZFS, MD, LVM, or even hardware RAID). By keeping > it in 'degraded' mode, you're essentially telling the system that the > array will be fixed in a reasonably short time-frame, which impacts how > it handles the array. If you're not going to fix it almost immediately, > you should almost always reshape the array to account for the missing > device if at all possible, as that will improve relative data safety and > generally get you better performance than running degraded will. >> >> I can't wrap my head about all this. Either the kicked device should not >> be re-added, or it should be re-added "properly", or it should at least >> show some errors and not pretend nothing happened, right?.. > BTRFS is not the best at error reporting at the moment. If you check > the output of `btrfs device stats` for that filesystem though, it should > show non-zero values in the error counters (note that these counters are > cumulative, so they are counts since the last time they were reset (or > when the FS was created if they have never been reset). Similarly, > scrub should report errors, there should be error messages in the kernel > log, and switching the FS to read-only mode _is_ technically reporting > an error, as that's standard error behavior for most sensible > filesystems (ext[234] being the notable exception, they just continue as > if nothing happened). >> >> I must be missing something. Is there an explanation somewhere about >> what's really going on during those situations? Also, do I understand >> correctly that upon detecting a faulty device (a write error), nothing >> is done about it except logging an error into the 'btrfs device stats' >> report? No device kicking, no notification?.. And what about degraded >> filesystems - is it absolutely forbidden to work with them without >> converting them to a "single" filesystem first?.. > As mentioned above, going read-only _is_ a notification that something > is wrong. Translating that (and the error counter increase, and the > kernel log messages) into a user visible notification is not really the > job of BTRFS, especially considering that no other filesystem or device > manager does so either (yes, you can get nice notifications from LVM, > but they aren't _from_ LVM itself, they're from other software that > watches for errors, and the same type of software works just fine for > BTRFS too). If you're this worried about it and don't want to keep on > top of it yourself by monitoring things manually, you really need to > look into a tool like monit [1] that can handle this for you. > > > [1] https://mmonit.com/monit/ Thank you! That was a really detailed explanation! I was using MD for a long time, so I was expecting kind of the same behaviour - like refusing to add the failed device back without resyncing, kicking faulty devices from the array, sending email warnings, being able to use the array in degraded mode with no problems (in case of RAID1) and so on. But I guess a few things are different in the btrfs mindset. It behaves more like a filesystem, so it doesn't force you to ensure data integrity; noticing errors and fixing them is up to you, like with any normal filesystem. The test I did was a "try to break btrfs and see if it survives" test, which mdadm would have passed (probably), but now I understand that btrfs was not made for that. However, with some error-reporting tools, it's probably possible to make it reasonably reliable. -- darkpenguin -- darkpenguin ^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2018-01-12 12:34 UTC | newest] Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin 2017-12-17 11:58 ` Duncan 2017-12-17 15:48 ` Peter Grandi 2017-12-17 20:42 ` Chris Murphy 2017-12-18 8:49 ` Anand Jain 2017-12-18 8:49 ` Anand Jain 2017-12-18 10:36 ` Peter Grandi 2017-12-18 12:10 ` Nikolay Borisov 2017-12-18 13:43 ` Anand Jain 2017-12-18 22:28 ` Chris Murphy 2017-12-18 22:29 ` Chris Murphy 2017-12-19 12:30 ` Adam Borowski 2017-12-19 12:54 ` Andrei Borzenkov 2017-12-19 12:59 ` Peter Grandi 2017-12-18 13:06 ` Austin S. Hemmelgarn 2017-12-18 19:43 ` Tomasz Pala 2017-12-18 22:01 ` Peter Grandi 2017-12-19 12:46 ` Austin S. Hemmelgarn 2017-12-19 12:25 ` Austin S. Hemmelgarn 2017-12-19 14:46 ` Tomasz Pala 2017-12-19 16:35 ` Austin S. Hemmelgarn 2017-12-19 17:56 ` Tomasz Pala 2017-12-19 19:47 ` Chris Murphy 2017-12-19 21:17 ` Tomasz Pala 2017-12-20 0:08 ` Chris Murphy 2017-12-23 4:08 ` Tomasz Pala 2017-12-23 5:23 ` Duncan 2017-12-20 16:53 ` Andrei Borzenkov 2017-12-20 16:57 ` Austin S. Hemmelgarn 2017-12-20 20:02 ` Chris Murphy 2017-12-20 20:07 ` Chris Murphy 2017-12-20 20:14 ` Austin S. Hemmelgarn 2017-12-21 1:34 ` Chris Murphy 2017-12-21 11:49 ` Andrei Borzenkov 2017-12-19 20:11 ` Austin S. Hemmelgarn 2017-12-19 21:58 ` Tomasz Pala 2017-12-20 13:10 ` Austin S. Hemmelgarn 2017-12-19 23:53 ` Chris Murphy 2017-12-20 13:12 ` Austin S. Hemmelgarn 2017-12-19 18:31 ` George Mitchell 2017-12-19 20:28 ` Tomasz Pala 2017-12-19 19:35 ` Chris Murphy 2017-12-19 20:41 ` Tomasz Pala 2017-12-19 20:47 ` Austin S. Hemmelgarn 2017-12-19 22:23 ` Tomasz Pala 2017-12-20 13:33 ` Austin S. Hemmelgarn 2017-12-20 17:28 ` Duncan 2017-12-21 11:44 ` Andrei Borzenkov 2017-12-21 12:27 ` Austin S. Hemmelgarn 2017-12-22 16:05 ` Tomasz Pala 2017-12-22 21:04 ` Chris Murphy 2017-12-23 2:52 ` Tomasz Pala 2017-12-23 5:40 ` Duncan 2017-12-19 23:59 ` Chris Murphy 2017-12-20 8:34 ` Tomasz Pala 2017-12-20 8:51 ` Tomasz Pala 2017-12-20 19:49 ` Chris Murphy 2017-12-18 5:11 ` Anand Jain 2017-12-18 1:20 ` Qu Wenruo 2017-12-18 13:31 ` Austin S. Hemmelgarn 2018-01-12 12:26 ` Dark Penguin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.