* btrfs as / filesystem in RAID1 @ 2019-02-01 10:28 Stefan K 2019-02-01 19:13 ` Hans van Kranenburg 2019-02-02 23:35 ` Chris Murphy 0 siblings, 2 replies; 32+ messages in thread From: Stefan K @ 2019-02-01 10:28 UTC (permalink / raw) To: linux-btrfs Hello, I've installed my Debian Stretch to have / on btrfs with raid1 on 2 SSDs. Today I want test if it works, it works fine until the server is running and the SSD get broken and I can change this, but it looks like that it does not work if the SSD fails until restart. I got the error, that one of the Disks can't be read and I got a initramfs prompt, I expected that it still runs like mdraid and said something is missing. My question is, is it possible to configure btrfs/fstab/grub that it still boot? (that is what I expected from a RAID1) best regards Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-01 10:28 btrfs as / filesystem in RAID1 Stefan K @ 2019-02-01 19:13 ` Hans van Kranenburg 2019-02-07 11:04 ` Stefan K 2019-02-02 23:35 ` Chris Murphy 1 sibling, 1 reply; 32+ messages in thread From: Hans van Kranenburg @ 2019-02-01 19:13 UTC (permalink / raw) To: Stefan K, linux-btrfs Hi Stefan, On 2/1/19 11:28 AM, Stefan K wrote: > > I've installed my Debian Stretch to have / on btrfs with raid1 on 2 > SSDs. Today I want test if it works, it works fine until the server > is running and the SSD get broken and I can change this, but it looks > like that it does not work if the SSD fails until restart. I got the > error, that one of the Disks can't be read and I got a initramfs > prompt, I expected that it still runs like mdraid and said something > is missing. > > My question is, is it possible to configure btrfs/fstab/grub that it > still boot? (that is what I expected from a RAID1) Yes. I'm not the expert in this area, but I see you haven't got a reply today yet, so I'll try. What you see happening is correct. This is the default behavior. To be able to boot into your system with a missing disk, you can add... rootflags=degraded ...to the linux kernel command line by editing it on the fly when you are in the GRUB menu. This allows the filesystem to start in 'degraded' mode this one time. The only thing you should be doing when the system is booted is have a new disk present already in place and fix the btrfs situation. This means things like cloning the partition table of the disk that's still working, doing whatever else is needed in your situation and then running btrfs replace to replace the missing disk with the new one, and then making sure you don't have "single" block groups left (using btrfs balance), which might have been created for new writes when the filesystem was running in degraded mode. -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-01 19:13 ` Hans van Kranenburg @ 2019-02-07 11:04 ` Stefan K 2019-02-07 12:18 ` Austin S. Hemmelgarn ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Stefan K @ 2019-02-07 11:04 UTC (permalink / raw) To: linux-btrfs Thanks, with degraded as kernel parameter and also ind the fstab it works like expected That should be the normal behaviour, cause a server must be up and running, and I don't care about a device loss, thats why I use a RAID1. The device-loss problem can I fix later, but its important that a server is up and running, i got informed at boot time and also in the logs files that a device is missing, also I see that if you use a monitoring program. So please change the normal behavior On Friday, February 1, 2019 7:13:16 PM CET Hans van Kranenburg wrote: > Hi Stefan, > > On 2/1/19 11:28 AM, Stefan K wrote: > > > > I've installed my Debian Stretch to have / on btrfs with raid1 on 2 > > SSDs. Today I want test if it works, it works fine until the server > > is running and the SSD get broken and I can change this, but it looks > > like that it does not work if the SSD fails until restart. I got the > > error, that one of the Disks can't be read and I got a initramfs > > prompt, I expected that it still runs like mdraid and said something > > is missing. > > > > My question is, is it possible to configure btrfs/fstab/grub that it > > still boot? (that is what I expected from a RAID1) > > Yes. I'm not the expert in this area, but I see you haven't got a reply > today yet, so I'll try. > > What you see happening is correct. This is the default behavior. > > To be able to boot into your system with a missing disk, you can add... > rootflags=degraded > ...to the linux kernel command line by editing it on the fly when you > are in the GRUB menu. > > This allows the filesystem to start in 'degraded' mode this one time. > The only thing you should be doing when the system is booted is have a > new disk present already in place and fix the btrfs situation. This > means things like cloning the partition table of the disk that's still > working, doing whatever else is needed in your situation and then > running btrfs replace to replace the missing disk with the new one, and > then making sure you don't have "single" block groups left (using btrfs > balance), which might have been created for new writes when the > filesystem was running in degraded mode. > > -- > Hans van Kranenburg > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 11:04 ` Stefan K @ 2019-02-07 12:18 ` Austin S. Hemmelgarn 2019-02-07 18:53 ` waxhead 2019-02-07 17:15 ` Chris Murphy 2019-02-11 9:30 ` Anand Jain 2 siblings, 1 reply; 32+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-07 12:18 UTC (permalink / raw) To: Stefan K, linux-btrfs On 2019-02-07 06:04, Stefan K wrote: > Thanks, with degraded as kernel parameter and also ind the fstab it works like expected > > That should be the normal behaviour, cause a server must be up and running, and I don't care about a device loss, thats why I use a RAID1. The device-loss problem can I fix later, but its important that a server is up and running, i got informed at boot time and also in the logs files that a device is missing, also I see that if you use a monitoring program. No, it shouldn't be the default, because: * Normal desktop users _never_ look at the log files or boot info, and rarely run monitoring programs, so they as a general rule won't notice until it's already too late. BTRFS isn't just a server filesystem, so it needs to be safe for regular users too. * It's easily possible to end up mounting degraded by accident if one of the constituent devices is slow to enumerate, and this can easily result in a split-brain scenario where all devices have diverged and the volume can only be repaired by recreating it from scratch. * We have _ZERO_ automatic recovery from this situation. This makes both of the above mentioned issues far more dangerous. * It just plain does not work with most systemd setups, because systemd will hang waiting on all the devices to appear due to the fact that they refuse to acknowledge that the only way to correctly know if a BTRFS volume will mount is to just try and mount it. * Given that new kernels still don't properly generate half-raid1 chunks when a device is missing in a two-device raid1 setup, there's a very real possibility that users will have trouble recovering filesystems with old recovery media (IOW, any recovery environment running a kernel before 4.14 will not mount the volume correctly). * You shouldn't be mounting writable and degraded for any reason other than fixing the volume (or converting it to a single profile until you can fix it), even aside from the other issues. > > So please change the normal behavior > > On Friday, February 1, 2019 7:13:16 PM CET Hans van Kranenburg wrote: >> Hi Stefan, >> >> On 2/1/19 11:28 AM, Stefan K wrote: >>> >>> I've installed my Debian Stretch to have / on btrfs with raid1 on 2 >>> SSDs. Today I want test if it works, it works fine until the server >>> is running and the SSD get broken and I can change this, but it looks >>> like that it does not work if the SSD fails until restart. I got the >>> error, that one of the Disks can't be read and I got a initramfs >>> prompt, I expected that it still runs like mdraid and said something >>> is missing. >>> >>> My question is, is it possible to configure btrfs/fstab/grub that it >>> still boot? (that is what I expected from a RAID1) >> >> Yes. I'm not the expert in this area, but I see you haven't got a reply >> today yet, so I'll try. >> >> What you see happening is correct. This is the default behavior. >> >> To be able to boot into your system with a missing disk, you can add... >> rootflags=degraded >> ...to the linux kernel command line by editing it on the fly when you >> are in the GRUB menu. >> >> This allows the filesystem to start in 'degraded' mode this one time. >> The only thing you should be doing when the system is booted is have a >> new disk present already in place and fix the btrfs situation. This >> means things like cloning the partition table of the disk that's still >> working, doing whatever else is needed in your situation and then >> running btrfs replace to replace the missing disk with the new one, and >> then making sure you don't have "single" block groups left (using btrfs >> balance), which might have been created for new writes when the >> filesystem was running in degraded mode. >> >> -- >> Hans van Kranenburg >> > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 12:18 ` Austin S. Hemmelgarn @ 2019-02-07 18:53 ` waxhead 2019-02-07 19:39 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 32+ messages in thread From: waxhead @ 2019-02-07 18:53 UTC (permalink / raw) To: Austin S. Hemmelgarn, Stefan K, linux-btrfs Austin S. Hemmelgarn wrote: > On 2019-02-07 06:04, Stefan K wrote: >> Thanks, with degraded as kernel parameter and also ind the fstab it >> works like expected >> >> That should be the normal behaviour, cause a server must be up and >> running, and I don't care about a device loss, thats why I use a >> RAID1. The device-loss problem can I fix later, but its important that >> a server is up and running, i got informed at boot time and also in >> the logs files that a device is missing, also I see that if you use a >> monitoring program. > No, it shouldn't be the default, because: > > * Normal desktop users _never_ look at the log files or boot info, and > rarely run monitoring programs, so they as a general rule won't notice > until it's already too late. BTRFS isn't just a server filesystem, so > it needs to be safe for regular users too. I am willing to argue that whatever you refer to as normal users don't have a clue how to make a raid1 filesystem, nor do they care about what underlying filesystem their computer runs. I can't quite see how a limping system would be worse than a failing system in this case. Besides "normal" desktop users use Windows anyway, people that run on penguin powered stuff generally have at least some technical knowledge. > * It's easily possible to end up mounting degraded by accident if one of > the constituent devices is slow to enumerate, and this can easily result > in a split-brain scenario where all devices have diverged and the volume > can only be repaired by recreating it from scratch. Am I wrong or would not the remaining disk have the generation number bumped on every commit? would it not make sense to ignore (previously) stale disks and require a manual "re-add" of the failed disks. From a users perspective with some C coding knowledge this sounds to me (in principle) like something as quite simple. E.g. if the superblock UUID match for all devices and one (or more) devices has a lower generation number than the other(s) then the disk(s) with the newest generation number should be considered good and the other disks with a lower generation number should be marked as failed. > * We have _ZERO_ automatic recovery from this situation. This makes > both of the above mentioned issues far more dangerous. See above, would this not be as simple as auto-deleting disks from the pool that has a matching UUID and a mismatch for the superblock generation number? Not exactly a recovery, but the system should be able to limp along. > * It just plain does not work with most systemd setups, because systemd > will hang waiting on all the devices to appear due to the fact that they > refuse to acknowledge that the only way to correctly know if a BTRFS > volume will mount is to just try and mount it. As far as I have understood this BTRFS refuses to mount even in redundant setups without the degraded flag. Why?! This is just plain useless. If anything the degraded mount option should be replaced with something like failif=X where X would be anything from 'never' which should get a 2 disk system up with exclusively raid1 profiles even if only one device is working. 'always' in case any device is failed or even 'atrisk' when loss of one more device would keep any raid chunk profile guarantee. (this get admittedly complex in a multi disk raid1 setup or when subvolumes perhaps can be mounted with different "raid" profiles....) > * Given that new kernels still don't properly generate half-raid1 chunks > when a device is missing in a two-device raid1 setup, there's a very > real possibility that users will have trouble recovering filesystems > with old recovery media (IOW, any recovery environment running a kernel > before 4.14 will not mount the volume correctly). Sometimes you have to break a few eggs to make an omelette right? If people want to recover their data they should have backups, and if they are really interested in recovering their data (and don't have backups) then they will probably find this on the web by searching anyway... > * You shouldn't be mounting writable and degraded for any reason other > than fixing the volume (or converting it to a single profile until you > can fix it), even aside from the other issues. Well in my opinion the degraded mount option is counter intuitive. Unless otherwise asked for the system should mount and work as long as it can guarantee the data can be read and written somehow (regardless if any redundancy guarantee is not met). If the user is willing to accept more or less risk they should configure it! ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 18:53 ` waxhead @ 2019-02-07 19:39 ` Austin S. Hemmelgarn 2019-02-07 21:21 ` Remi Gauvin ` (3 more replies) 0 siblings, 4 replies; 32+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-07 19:39 UTC (permalink / raw) To: waxhead, Stefan K, linux-btrfs On 2019-02-07 13:53, waxhead wrote: > > > Austin S. Hemmelgarn wrote: >> On 2019-02-07 06:04, Stefan K wrote: >>> Thanks, with degraded as kernel parameter and also ind the fstab it >>> works like expected >>> >>> That should be the normal behaviour, cause a server must be up and >>> running, and I don't care about a device loss, thats why I use a >>> RAID1. The device-loss problem can I fix later, but its important >>> that a server is up and running, i got informed at boot time and also >>> in the logs files that a device is missing, also I see that if you >>> use a monitoring program. >> No, it shouldn't be the default, because: >> >> * Normal desktop users _never_ look at the log files or boot info, and >> rarely run monitoring programs, so they as a general rule won't notice >> until it's already too late. BTRFS isn't just a server filesystem, so >> it needs to be safe for regular users too. > > I am willing to argue that whatever you refer to as normal users don't > have a clue how to make a raid1 filesystem, nor do they care about what > underlying filesystem their computer runs. I can't quite see how a > limping system would be worse than a failing system in this case. > Besides "normal" desktop users use Windows anyway, people that run on > penguin powered stuff generally have at least some technical knowledge. Once you get into stuff like Arch or Gentoo, yeah, people tend to have enough technical knowledge to handle this type of thing, but if you're talking about the big distros like Ubuntu or Fedora, not so much. Yes, I might be a bit pessimistic here, but that pessimism is based on personal experience over many years of providing technical support for people. Put differently, human nature is to ignore things that aren't immediately relevant. Kernel logs don't matter until you see something wrong. Boot messages don't matter unless you happen to see them while the system is booting (and most people don't). Monitoring is the only way here, but most people won't invest the time in proper monitoring until they have problems. Even as a seasoned sysadmin, I never look at kernel logs until I see any problem, I rarely see boot messages on most of the systems I manage (because I'm rarely sitting at the console when they boot up, and when I am I'm usually handling startup of a dozen or so systems simultaneously after a network-wide outage), and I only monitor things that I know for certain need to be monitored. > >> * It's easily possible to end up mounting degraded by accident if one >> of the constituent devices is slow to enumerate, and this can easily >> result in a split-brain scenario where all devices have diverged and >> the volume can only be repaired by recreating it from scratch. > > Am I wrong or would not the remaining disk have the generation number > bumped on every commit? would it not make sense to ignore (previously) > stale disks and require a manual "re-add" of the failed disks. From a > users perspective with some C coding knowledge this sounds to me (in > principle) like something as quite simple. > E.g. if the superblock UUID match for all devices and one (or more) > devices has a lower generation number than the other(s) then the disk(s) > with the newest generation number should be considered good and the > other disks with a lower generation number should be marked as failed. The problem is that if you're defaulting to this behavior, you can have multiple disks diverge from the base. Imagine, for example, a system with two devices in a raid1 setup with degraded mounts enabled by default, and either device randomly taking longer than normal to enumerate. It's very possible for one boot to have one device delay during enumeration on one boot, then the other on the next boot, and if not handled _exactly_ right by the user, this will result in both devices having a higher generation number than they started with, but neither one being 'wrong'. It's like trying to merge branches in git that both have different changes to a binary file, there's no sane way to handle it without user input. Realistically, we can only safely recover from divergence correctly if we can prove that all devices are true prior states of the current highest generation, which is not currently possible to do reliably because of how BTRFS operates. Also, LVM and MD have the exact same issue, it's just not as significant because they re-add and re-sync missing devices automatically when they reappear, which makes such split-brain scenarios much less likely. > >> * We have _ZERO_ automatic recovery from this situation. This makes >> both of the above mentioned issues far more dangerous. > > See above, would this not be as simple as auto-deleting disks from the > pool that has a matching UUID and a mismatch for the superblock > generation number? Not exactly a recovery, but the system should be able > to limp along. > >> * It just plain does not work with most systemd setups, because >> systemd will hang waiting on all the devices to appear due to the fact >> that they refuse to acknowledge that the only way to correctly know if >> a BTRFS volume will mount is to just try and mount it. > > As far as I have understood this BTRFS refuses to mount even in > redundant setups without the degraded flag. Why?! This is just plain > useless. If anything the degraded mount option should be replaced with > something like failif=X where X would be anything from 'never' which > should get a 2 disk system up with exclusively raid1 profiles even if > only one device is working. 'always' in case any device is failed or > even 'atrisk' when loss of one more device would keep any raid chunk > profile guarantee. (this get admittedly complex in a multi disk raid1 > setup or when subvolumes perhaps can be mounted with different "raid" > profiles....) The issue with systemd is that if you pass 'degraded' on most systemd systems, and devices are missing when the system tries to mount the volume, systemd won't mount it because it doesn't see all the devices. It doesn't even _try_ to mount it because it doesn't see all the devices. Changing to degraded by default won't fix this, because it's a systemd problem. The same issue also makes it a serious pain in the arse to recover degraded BTRFS volumes on systemd systems, because if the volume is supposed to mount normally on that system, systemd will unmount it if it doesn't see all the devices, regardless of how it got mounted in the first place. IOW, there's a special case with systemd that makes even mounting BTRFS volumes that have missing devices degraded not work. > >> * Given that new kernels still don't properly generate half-raid1 >> chunks when a device is missing in a two-device raid1 setup, there's a >> very real possibility that users will have trouble recovering >> filesystems with old recovery media (IOW, any recovery environment >> running a kernel before 4.14 will not mount the volume correctly). > Sometimes you have to break a few eggs to make an omelette right? If > people want to recover their data they should have backups, and if they > are really interested in recovering their data (and don't have backups) > then they will probably find this on the web by searching anyway... Backups aren't the type of recovery I'm talking about. I'm talking about people booting to things like SystemRescueCD to fix system configuration or do offline maintenance without having to nuke the system and restore from backups. Such recovery environments often don't get updated for a _long_ time, and such usage is not atypical as a first step in trying to fix a broken system in situations where downtime really is a serious issue. > >> * You shouldn't be mounting writable and degraded for any reason other >> than fixing the volume (or converting it to a single profile until you >> can fix it), even aside from the other issues. > > Well in my opinion the degraded mount option is counter intuitive. > Unless otherwise asked for the system should mount and work as long as > it can guarantee the data can be read and written somehow (regardless if > any redundancy guarantee is not met). If the user is willing to accept > more or less risk they should configure it! Again, BTRFS mounting degraded is significantly riskier than LVM or MD doing the same thing. Most users don't properly research things (When's the last time you did a complete cost/benefit analysis before deciding to use a particular piece of software on a system?), and would not know they were taking on significantly higher risk by using BTRFS without configuring it to behave safely until it actually caused them problems, at which point most people would then complain about the resulting data loss instead of trying to figure out why it happened and prevent it in the first place. I don't know about you, but I for one would rather BTRFS have a reputation for being over-aggressively safe by default than risking users data by default. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 19:39 ` Austin S. Hemmelgarn @ 2019-02-07 21:21 ` Remi Gauvin 2019-02-08 4:51 ` Andrei Borzenkov ` (2 subsequent siblings) 3 siblings, 0 replies; 32+ messages in thread From: Remi Gauvin @ 2019-02-07 21:21 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1.1.1: Type: text/plain, Size: 1701 bytes --] On 2019-02-07 2:39 p.m., Austin S. Hemmelgarn wrote: > Again, BTRFS mounting degraded is significantly riskier than LVM or MD > doing the same thing. Most users don't properly research things (When's > the last time you did a complete cost/benefit analysis before deciding > to use a particular piece of software on a system?), and would not know > they were taking on significantly higher risk by using BTRFS without > configuring it to behave safely until it actually caused them problems, > at which point most people would then complain about the resulting data > loss instead of trying to figure out why it happened and prevent it in > the first place. I don't know about you, but I for one would rather > BTRFS have a reputation for being over-aggressively safe by default than > risking users data by default. Another important consideration is that BTRFS has practically zero tolerance for corruption in the metadata. Most other FS's can, at least on surface appearance, either continue working despite bits of scrambled data, or have repair utilities that are pretty good at figuring out what the scrambled data should be and make a best guess effort that more or less works (leaving aside for now statistics as to how often that might cause undetected data corruption, which possibly propagates to backups etc.) BTRFS is almost entirely reliant on the Duplicate copy of metadata, which is missing when running degraded. It makes it much more likely for simple error to break the FS entirely. BTRFS default configuration prioritizes data integrity over uptime, and I think a very good argument can be made for any FS that *should* be the default. [-- Attachment #1.1.2: remi.vcf --] [-- Type: text/x-vcard, Size: 203 bytes --] begin:vcard fn:Remi Gauvin n:Gauvin;Remi org:Georgian Infotech adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada email;internet:remi@georgianit.com tel;work:226-256-1545 version:2.1 end:vcard [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 19:39 ` Austin S. Hemmelgarn 2019-02-07 21:21 ` Remi Gauvin @ 2019-02-08 4:51 ` Andrei Borzenkov 2019-02-08 12:54 ` Austin S. Hemmelgarn 2019-02-08 7:15 ` Stefan K 2019-02-08 18:10 ` waxhead 3 siblings, 1 reply; 32+ messages in thread From: Andrei Borzenkov @ 2019-02-08 4:51 UTC (permalink / raw) To: Austin S. Hemmelgarn, waxhead, Stefan K, linux-btrfs 07.02.2019 22:39, Austin S. Hemmelgarn пишет: > The issue with systemd is that if you pass 'degraded' on most systemd > systems, and devices are missing when the system tries to mount the > volume, systemd won't mount it because it doesn't see all the devices. > It doesn't even _try_ to mount it because it doesn't see all the > devices. Changing to degraded by default won't fix this, because it's a > systemd problem. > Oh no, not again. It was discussed millions of times already - systemd is using information that btrfs provides. > The same issue also makes it a serious pain in the arse to recover > degraded BTRFS volumes on systemd systems, because if the volume is > supposed to mount normally on that system, systemd will unmount it if it > doesn't see all the devices, regardless of how it got mounted in the > first place. > *That* would be systemd issue indeed. If someone can reliably reproduce it, systemd bug report would certainly be in order. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-08 4:51 ` Andrei Borzenkov @ 2019-02-08 12:54 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 32+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-08 12:54 UTC (permalink / raw) To: Andrei Borzenkov, linux-btrfs; +Cc: waxhead, Stefan K On 2019-02-07 23:51, Andrei Borzenkov wrote: > 07.02.2019 22:39, Austin S. Hemmelgarn пишет: >> The issue with systemd is that if you pass 'degraded' on most systemd >> systems, and devices are missing when the system tries to mount the >> volume, systemd won't mount it because it doesn't see all the devices. >> It doesn't even _try_ to mount it because it doesn't see all the >> devices. Changing to degraded by default won't fix this, because it's a >> systemd problem. >> > > Oh no, not again. It was discussed millions of times already - systemd > is using information that btrfs provides. And we've already told the systemd developers to quit using the ioctl they're using because it causes this issue and also introduces a TOCTOU race condition that can be avoided by just trying to mount the volume with the provided options. > >> The same issue also makes it a serious pain in the arse to recover >> degraded BTRFS volumes on systemd systems, because if the volume is >> supposed to mount normally on that system, systemd will unmount it if it >> doesn't see all the devices, regardless of how it got mounted in the >> first place. >> > > *That* would be systemd issue indeed. If someone can reliably reproduce > it, systemd bug report would certainly be in order. > It's been a few months since I dealt with it last (I don't use systemd on my everyday systems, because of this and a bunch of other issues I have with it (mostly design complaints, not bugs FWIW)), but the general process is as follows: 1. Configure a multi-device BTRFS volume such that removal of one device will cause the DEVICE_READY ioctl to return false. 2. Set it up in `/etc/fstab` or as a mount unit such that it will normally get mounted at boot, but won't prevent the system from booting if it fails. 3. Reboot with one of the devices missing. 4. Attempt to manually mount the volume using the regular `mount` command with the `degraded` option. 5. Check the mount table, there should be no entry for the volume you just mounted in it. After dealing with this the first time (multiple years ago now), I took the time to trace system calls, and found that systemd was unmounting the volume immediately after I mounted it. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 19:39 ` Austin S. Hemmelgarn 2019-02-07 21:21 ` Remi Gauvin 2019-02-08 4:51 ` Andrei Borzenkov @ 2019-02-08 7:15 ` Stefan K 2019-02-08 12:58 ` Austin S. Hemmelgarn 2019-02-08 16:56 ` Chris Murphy 2019-02-08 18:10 ` waxhead 3 siblings, 2 replies; 32+ messages in thread From: Stefan K @ 2019-02-08 7:15 UTC (permalink / raw) To: linux-btrfs > * Normal desktop users _never_ look at the log files or boot info, and > rarely run monitoring programs, so they as a general rule won't notice > until it's already too late. BTRFS isn't just a server filesystem, so > it needs to be safe for regular users too. I guess a normal desktop user wouldn't create a RAID1 nor other RAID-things, right? So an admin take care of a RAID and monitor it (it doesn't matter if it a hardwareraid, mdraid, zfs raid or what ever) and degraded works only with RAID-things, its not relevant for single-disk usage, right? > Also, LVM and MD have the exact same issue, it's just not as significant > because they re-add and re-sync missing devices automatically when they > reappear, which makes such split-brain scenarios much less likely. why does btrfs don't do that? On Thursday, February 7, 2019 2:39:34 PM CET Austin S. Hemmelgarn wrote: > On 2019-02-07 13:53, waxhead wrote: > > > > > > Austin S. Hemmelgarn wrote: > >> On 2019-02-07 06:04, Stefan K wrote: > >>> Thanks, with degraded as kernel parameter and also ind the fstab it > >>> works like expected > >>> > >>> That should be the normal behaviour, cause a server must be up and > >>> running, and I don't care about a device loss, thats why I use a > >>> RAID1. The device-loss problem can I fix later, but its important > >>> that a server is up and running, i got informed at boot time and also > >>> in the logs files that a device is missing, also I see that if you > >>> use a monitoring program. > >> No, it shouldn't be the default, because: > >> > >> * Normal desktop users _never_ look at the log files or boot info, and > >> rarely run monitoring programs, so they as a general rule won't notice > >> until it's already too late. BTRFS isn't just a server filesystem, so > >> it needs to be safe for regular users too. > > > > I am willing to argue that whatever you refer to as normal users don't > > have a clue how to make a raid1 filesystem, nor do they care about what > > underlying filesystem their computer runs. I can't quite see how a > > limping system would be worse than a failing system in this case. > > Besides "normal" desktop users use Windows anyway, people that run on > > penguin powered stuff generally have at least some technical knowledge. > Once you get into stuff like Arch or Gentoo, yeah, people tend to have > enough technical knowledge to handle this type of thing, but if you're > talking about the big distros like Ubuntu or Fedora, not so much. Yes, > I might be a bit pessimistic here, but that pessimism is based on > personal experience over many years of providing technical support for > people. > > Put differently, human nature is to ignore things that aren't > immediately relevant. Kernel logs don't matter until you see something > wrong. Boot messages don't matter unless you happen to see them while > the system is booting (and most people don't). Monitoring is the only > way here, but most people won't invest the time in proper monitoring > until they have problems. Even as a seasoned sysadmin, I never look at > kernel logs until I see any problem, I rarely see boot messages on most > of the systems I manage (because I'm rarely sitting at the console when > they boot up, and when I am I'm usually handling startup of a dozen or > so systems simultaneously after a network-wide outage), and I only > monitor things that I know for certain need to be monitored. > > > >> * It's easily possible to end up mounting degraded by accident if one > >> of the constituent devices is slow to enumerate, and this can easily > >> result in a split-brain scenario where all devices have diverged and > >> the volume can only be repaired by recreating it from scratch. > > > > Am I wrong or would not the remaining disk have the generation number > > bumped on every commit? would it not make sense to ignore (previously) > > stale disks and require a manual "re-add" of the failed disks. From a > > users perspective with some C coding knowledge this sounds to me (in > > principle) like something as quite simple. > > E.g. if the superblock UUID match for all devices and one (or more) > > devices has a lower generation number than the other(s) then the disk(s) > > with the newest generation number should be considered good and the > > other disks with a lower generation number should be marked as failed. > The problem is that if you're defaulting to this behavior, you can have > multiple disks diverge from the base. Imagine, for example, a system > with two devices in a raid1 setup with degraded mounts enabled by > default, and either device randomly taking longer than normal to > enumerate. It's very possible for one boot to have one device delay > during enumeration on one boot, then the other on the next boot, and if > not handled _exactly_ right by the user, this will result in both > devices having a higher generation number than they started with, but > neither one being 'wrong'. It's like trying to merge branches in git > that both have different changes to a binary file, there's no sane way > to handle it without user input. > > Realistically, we can only safely recover from divergence correctly if > we can prove that all devices are true prior states of the current > highest generation, which is not currently possible to do reliably > because of how BTRFS operates. > > Also, LVM and MD have the exact same issue, it's just not as significant > because they re-add and re-sync missing devices automatically when they > reappear, which makes such split-brain scenarios much less likely. > > > >> * We have _ZERO_ automatic recovery from this situation. This makes > >> both of the above mentioned issues far more dangerous. > > > > See above, would this not be as simple as auto-deleting disks from the > > pool that has a matching UUID and a mismatch for the superblock > > generation number? Not exactly a recovery, but the system should be able > > to limp along. > > > >> * It just plain does not work with most systemd setups, because > >> systemd will hang waiting on all the devices to appear due to the fact > >> that they refuse to acknowledge that the only way to correctly know if > >> a BTRFS volume will mount is to just try and mount it. > > > > As far as I have understood this BTRFS refuses to mount even in > > redundant setups without the degraded flag. Why?! This is just plain > > useless. If anything the degraded mount option should be replaced with > > something like failif=X where X would be anything from 'never' which > > should get a 2 disk system up with exclusively raid1 profiles even if > > only one device is working. 'always' in case any device is failed or > > even 'atrisk' when loss of one more device would keep any raid chunk > > profile guarantee. (this get admittedly complex in a multi disk raid1 > > setup or when subvolumes perhaps can be mounted with different "raid" > > profiles....) > The issue with systemd is that if you pass 'degraded' on most systemd > systems, and devices are missing when the system tries to mount the > volume, systemd won't mount it because it doesn't see all the devices. > It doesn't even _try_ to mount it because it doesn't see all the > devices. Changing to degraded by default won't fix this, because it's a > systemd problem. > > The same issue also makes it a serious pain in the arse to recover > degraded BTRFS volumes on systemd systems, because if the volume is > supposed to mount normally on that system, systemd will unmount it if it > doesn't see all the devices, regardless of how it got mounted in the > first place. > > IOW, there's a special case with systemd that makes even mounting BTRFS > volumes that have missing devices degraded not work. > > > >> * Given that new kernels still don't properly generate half-raid1 > >> chunks when a device is missing in a two-device raid1 setup, there's a > >> very real possibility that users will have trouble recovering > >> filesystems with old recovery media (IOW, any recovery environment > >> running a kernel before 4.14 will not mount the volume correctly). > > Sometimes you have to break a few eggs to make an omelette right? If > > people want to recover their data they should have backups, and if they > > are really interested in recovering their data (and don't have backups) > > then they will probably find this on the web by searching anyway... > Backups aren't the type of recovery I'm talking about. I'm talking > about people booting to things like SystemRescueCD to fix system > configuration or do offline maintenance without having to nuke the > system and restore from backups. Such recovery environments often don't > get updated for a _long_ time, and such usage is not atypical as a first > step in trying to fix a broken system in situations where downtime > really is a serious issue. > > > >> * You shouldn't be mounting writable and degraded for any reason other > >> than fixing the volume (or converting it to a single profile until you > >> can fix it), even aside from the other issues. > > > > Well in my opinion the degraded mount option is counter intuitive. > > Unless otherwise asked for the system should mount and work as long as > > it can guarantee the data can be read and written somehow (regardless if > > any redundancy guarantee is not met). If the user is willing to accept > > more or less risk they should configure it! > Again, BTRFS mounting degraded is significantly riskier than LVM or MD > doing the same thing. Most users don't properly research things (When's > the last time you did a complete cost/benefit analysis before deciding > to use a particular piece of software on a system?), and would not know > they were taking on significantly higher risk by using BTRFS without > configuring it to behave safely until it actually caused them problems, > at which point most people would then complain about the resulting data > loss instead of trying to figure out why it happened and prevent it in > the first place. I don't know about you, but I for one would rather > BTRFS have a reputation for being over-aggressively safe by default than > risking users data by default. > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-08 7:15 ` Stefan K @ 2019-02-08 12:58 ` Austin S. Hemmelgarn 2019-02-08 16:56 ` Chris Murphy 1 sibling, 0 replies; 32+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-08 12:58 UTC (permalink / raw) To: linux-btrfs On 2019-02-08 02:15, Stefan K wrote: >> * Normal desktop users _never_ look at the log files or boot info, and >> rarely run monitoring programs, so they as a general rule won't notice >> until it's already too late. BTRFS isn't just a server filesystem, so >> it needs to be safe for regular users too. > I guess a normal desktop user wouldn't create a RAID1 nor other RAID-things, right? You would think that would be the case, but it generally isn't in my experience. Such desktop users also tend to be the worst offenders in the 'RAID is my backup' camp as well in my experience. > So an admin take care of a RAID and monitor it (it doesn't matter if it a hardwareraid, mdraid, zfs raid or what ever) > and degraded works only with RAID-things, its not relevant for single-disk usage, right? Correct, but because it's never relevant for single-disk usage, you don't have to worry about any of this. > >> Also, LVM and MD have the exact same issue, it's just not as significant >> because they re-add and re-sync missing devices automatically when they >> reappear, which makes such split-brain scenarios much less likely. > why does btrfs don't do that? Because we currently don't have any code that does it. Part of the problem is that we're a lot more tolerant of intermittent I/O errors than LVM and MD are, so we can't reliably tell if a device is truly gone or not. > > > On Thursday, February 7, 2019 2:39:34 PM CET Austin S. Hemmelgarn wrote: >> On 2019-02-07 13:53, waxhead wrote: >>> >>> >>> Austin S. Hemmelgarn wrote: >>>> On 2019-02-07 06:04, Stefan K wrote: >>>>> Thanks, with degraded as kernel parameter and also ind the fstab it >>>>> works like expected >>>>> >>>>> That should be the normal behaviour, cause a server must be up and >>>>> running, and I don't care about a device loss, thats why I use a >>>>> RAID1. The device-loss problem can I fix later, but its important >>>>> that a server is up and running, i got informed at boot time and also >>>>> in the logs files that a device is missing, also I see that if you >>>>> use a monitoring program. >>>> No, it shouldn't be the default, because: >>>> >>>> * Normal desktop users _never_ look at the log files or boot info, and >>>> rarely run monitoring programs, so they as a general rule won't notice >>>> until it's already too late. BTRFS isn't just a server filesystem, so >>>> it needs to be safe for regular users too. >>> >>> I am willing to argue that whatever you refer to as normal users don't >>> have a clue how to make a raid1 filesystem, nor do they care about what >>> underlying filesystem their computer runs. I can't quite see how a >>> limping system would be worse than a failing system in this case. >>> Besides "normal" desktop users use Windows anyway, people that run on >>> penguin powered stuff generally have at least some technical knowledge. >> Once you get into stuff like Arch or Gentoo, yeah, people tend to have >> enough technical knowledge to handle this type of thing, but if you're >> talking about the big distros like Ubuntu or Fedora, not so much. Yes, >> I might be a bit pessimistic here, but that pessimism is based on >> personal experience over many years of providing technical support for >> people. >> >> Put differently, human nature is to ignore things that aren't >> immediately relevant. Kernel logs don't matter until you see something >> wrong. Boot messages don't matter unless you happen to see them while >> the system is booting (and most people don't). Monitoring is the only >> way here, but most people won't invest the time in proper monitoring >> until they have problems. Even as a seasoned sysadmin, I never look at >> kernel logs until I see any problem, I rarely see boot messages on most >> of the systems I manage (because I'm rarely sitting at the console when >> they boot up, and when I am I'm usually handling startup of a dozen or >> so systems simultaneously after a network-wide outage), and I only >> monitor things that I know for certain need to be monitored. >>> >>>> * It's easily possible to end up mounting degraded by accident if one >>>> of the constituent devices is slow to enumerate, and this can easily >>>> result in a split-brain scenario where all devices have diverged and >>>> the volume can only be repaired by recreating it from scratch. >>> >>> Am I wrong or would not the remaining disk have the generation number >>> bumped on every commit? would it not make sense to ignore (previously) >>> stale disks and require a manual "re-add" of the failed disks. From a >>> users perspective with some C coding knowledge this sounds to me (in >>> principle) like something as quite simple. >>> E.g. if the superblock UUID match for all devices and one (or more) >>> devices has a lower generation number than the other(s) then the disk(s) >>> with the newest generation number should be considered good and the >>> other disks with a lower generation number should be marked as failed. >> The problem is that if you're defaulting to this behavior, you can have >> multiple disks diverge from the base. Imagine, for example, a system >> with two devices in a raid1 setup with degraded mounts enabled by >> default, and either device randomly taking longer than normal to >> enumerate. It's very possible for one boot to have one device delay >> during enumeration on one boot, then the other on the next boot, and if >> not handled _exactly_ right by the user, this will result in both >> devices having a higher generation number than they started with, but >> neither one being 'wrong'. It's like trying to merge branches in git >> that both have different changes to a binary file, there's no sane way >> to handle it without user input. >> >> Realistically, we can only safely recover from divergence correctly if >> we can prove that all devices are true prior states of the current >> highest generation, which is not currently possible to do reliably >> because of how BTRFS operates. >> >> Also, LVM and MD have the exact same issue, it's just not as significant >> because they re-add and re-sync missing devices automatically when they >> reappear, which makes such split-brain scenarios much less likely. >>> >>>> * We have _ZERO_ automatic recovery from this situation. This makes >>>> both of the above mentioned issues far more dangerous. >>> >>> See above, would this not be as simple as auto-deleting disks from the >>> pool that has a matching UUID and a mismatch for the superblock >>> generation number? Not exactly a recovery, but the system should be able >>> to limp along. >>> >>>> * It just plain does not work with most systemd setups, because >>>> systemd will hang waiting on all the devices to appear due to the fact >>>> that they refuse to acknowledge that the only way to correctly know if >>>> a BTRFS volume will mount is to just try and mount it. >>> >>> As far as I have understood this BTRFS refuses to mount even in >>> redundant setups without the degraded flag. Why?! This is just plain >>> useless. If anything the degraded mount option should be replaced with >>> something like failif=X where X would be anything from 'never' which >>> should get a 2 disk system up with exclusively raid1 profiles even if >>> only one device is working. 'always' in case any device is failed or >>> even 'atrisk' when loss of one more device would keep any raid chunk >>> profile guarantee. (this get admittedly complex in a multi disk raid1 >>> setup or when subvolumes perhaps can be mounted with different "raid" >>> profiles....) >> The issue with systemd is that if you pass 'degraded' on most systemd >> systems, and devices are missing when the system tries to mount the >> volume, systemd won't mount it because it doesn't see all the devices. >> It doesn't even _try_ to mount it because it doesn't see all the >> devices. Changing to degraded by default won't fix this, because it's a >> systemd problem. >> >> The same issue also makes it a serious pain in the arse to recover >> degraded BTRFS volumes on systemd systems, because if the volume is >> supposed to mount normally on that system, systemd will unmount it if it >> doesn't see all the devices, regardless of how it got mounted in the >> first place. >> >> IOW, there's a special case with systemd that makes even mounting BTRFS >> volumes that have missing devices degraded not work. >>> >>>> * Given that new kernels still don't properly generate half-raid1 >>>> chunks when a device is missing in a two-device raid1 setup, there's a >>>> very real possibility that users will have trouble recovering >>>> filesystems with old recovery media (IOW, any recovery environment >>>> running a kernel before 4.14 will not mount the volume correctly). >>> Sometimes you have to break a few eggs to make an omelette right? If >>> people want to recover their data they should have backups, and if they >>> are really interested in recovering their data (and don't have backups) >>> then they will probably find this on the web by searching anyway... >> Backups aren't the type of recovery I'm talking about. I'm talking >> about people booting to things like SystemRescueCD to fix system >> configuration or do offline maintenance without having to nuke the >> system and restore from backups. Such recovery environments often don't >> get updated for a _long_ time, and such usage is not atypical as a first >> step in trying to fix a broken system in situations where downtime >> really is a serious issue. >>> >>>> * You shouldn't be mounting writable and degraded for any reason other >>>> than fixing the volume (or converting it to a single profile until you >>>> can fix it), even aside from the other issues. >>> >>> Well in my opinion the degraded mount option is counter intuitive. >>> Unless otherwise asked for the system should mount and work as long as >>> it can guarantee the data can be read and written somehow (regardless if >>> any redundancy guarantee is not met). If the user is willing to accept >>> more or less risk they should configure it! >> Again, BTRFS mounting degraded is significantly riskier than LVM or MD >> doing the same thing. Most users don't properly research things (When's >> the last time you did a complete cost/benefit analysis before deciding >> to use a particular piece of software on a system?), and would not know >> they were taking on significantly higher risk by using BTRFS without >> configuring it to behave safely until it actually caused them problems, >> at which point most people would then complain about the resulting data >> loss instead of trying to figure out why it happened and prevent it in >> the first place. I don't know about you, but I for one would rather >> BTRFS have a reputation for being over-aggressively safe by default than >> risking users data by default. >> > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-08 7:15 ` Stefan K 2019-02-08 12:58 ` Austin S. Hemmelgarn @ 2019-02-08 16:56 ` Chris Murphy 1 sibling, 0 replies; 32+ messages in thread From: Chris Murphy @ 2019-02-08 16:56 UTC (permalink / raw) To: Btrfs BTRFS On Fri, Feb 8, 2019 at 12:15 AM Stefan K <shadow_7@gmx.net> wrote: > > > * Normal desktop users _never_ look at the log files or boot info, and > > rarely run monitoring programs, so they as a general rule won't notice > > until it's already too late. BTRFS isn't just a server filesystem, so > > it needs to be safe for regular users too. > I guess a normal desktop user wouldn't create a RAID1 nor other RAID-things, right? So an admin take care of a RAID and monitor it (it doesn't matter if it a hardwareraid, mdraid, zfs raid or what ever) > and degraded works only with RAID-things, its not relevant for single-disk usage, right? The point is that persistently setting the degraded mount option as a boot param has a chance of causing a degraded mount even if your array is not degraded. Also, there is no such thing as transitioning from normal mount to degraded mount. If a device fails, the array is not strictly degraded, it's a normal mount with a huge amount of kernel errors being generated by Btrfs due to the bad/missing device. I'm pretty sure there are unmerged patches add something like the concept of an md faulty device, but I'm not sure what the logic is, but my understanding is they're not well enough tested yet (?) for them to get merged. If your system log is directed to write to this same volume, that causes even more errors due to additional failing writes, which then have to be logged. So now you're depending on kernel printk rate limiting being set well below the water line to make sure Btrfs errors don't cause so much disk contention that the system gets stuck (not difficult if sysroot is a hard drive). > > > Also, LVM and MD have the exact same issue, it's just not as significant > > because they re-add and re-sync missing devices automatically when they > > reappear, which makes such split-brain scenarios much less likely. > why does btrfs don't do that? It's a fair question but the simplest answer is, features don't grow on trees, they're written by developers and no one has yet done that work. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 19:39 ` Austin S. Hemmelgarn ` (2 preceding siblings ...) 2019-02-08 7:15 ` Stefan K @ 2019-02-08 18:10 ` waxhead 2019-02-08 19:17 ` Austin S. Hemmelgarn 2019-02-08 20:17 ` Chris Murphy 3 siblings, 2 replies; 32+ messages in thread From: waxhead @ 2019-02-08 18:10 UTC (permalink / raw) To: Austin S. Hemmelgarn, Stefan K, linux-btrfs Austin S. Hemmelgarn wrote: > On 2019-02-07 13:53, waxhead wrote: >> >> >> Austin S. Hemmelgarn wrote: >>> On 2019-02-07 06:04, Stefan K wrote: >>>> Thanks, with degraded as kernel parameter and also ind the fstab it >>>> works like expected >>>> >>>> That should be the normal behaviour, cause a server must be up and >>>> running, and I don't care about a device loss, thats why I use a >>>> RAID1. The device-loss problem can I fix later, but its important >>>> that a server is up and running, i got informed at boot time and >>>> also in the logs files that a device is missing, also I see that if >>>> you use a monitoring program. >>> No, it shouldn't be the default, because: >>> >>> * Normal desktop users _never_ look at the log files or boot info, >>> and rarely run monitoring programs, so they as a general rule won't >>> notice until it's already too late. BTRFS isn't just a server >>> filesystem, so it needs to be safe for regular users too. >> >> I am willing to argue that whatever you refer to as normal users don't >> have a clue how to make a raid1 filesystem, nor do they care about >> what underlying filesystem their computer runs. I can't quite see how >> a limping system would be worse than a failing system in this case. >> Besides "normal" desktop users use Windows anyway, people that run on >> penguin powered stuff generally have at least some technical knowledge. > Once you get into stuff like Arch or Gentoo, yeah, people tend to have > enough technical knowledge to handle this type of thing, but if you're > talking about the big distros like Ubuntu or Fedora, not so much. Yes, > I might be a bit pessimistic here, but that pessimism is based on > personal experience over many years of providing technical support for > people. > > Put differently, human nature is to ignore things that aren't > immediately relevant. Kernel logs don't matter until you see something > wrong. Boot messages don't matter unless you happen to see them while > the system is booting (and most people don't). Monitoring is the only > way here, but most people won't invest the time in proper monitoring > until they have problems. Even as a seasoned sysadmin, I never look at > kernel logs until I see any problem, I rarely see boot messages on most > of the systems I manage (because I'm rarely sitting at the console when > they boot up, and when I am I'm usually handling startup of a dozen or > so systems simultaneously after a network-wide outage), and I only > monitor things that I know for certain need to be monitored. So what you are saying here is that distro's that use btrfs by default should be responsible enough to make some monitoring solution if they allow non-technical users to create a "raid"1 like btrfs filesystem in the first place. I don't think that many distros install some S.M.A.R.T. monitoring solution either... in which case you are worse off with a non-checksumming filesystem. Since the users you refer to basically ignores the filesystem anyway I can't see why this would be an argument at all... >> >>> * It's easily possible to end up mounting degraded by accident if one >>> of the constituent devices is slow to enumerate, and this can easily >>> result in a split-brain scenario where all devices have diverged and >>> the volume can only be repaired by recreating it from scratch. >> >> Am I wrong or would not the remaining disk have the generation number >> bumped on every commit? would it not make sense to ignore (previously) >> stale disks and require a manual "re-add" of the failed disks. From a >> users perspective with some C coding knowledge this sounds to me (in >> principle) like something as quite simple. >> E.g. if the superblock UUID match for all devices and one (or more) >> devices has a lower generation number than the other(s) then the >> disk(s) with the newest generation number should be considered good >> and the other disks with a lower generation number should be marked as >> failed. > The problem is that if you're defaulting to this behavior, you can have > multiple disks diverge from the base. Imagine, for example, a system > with two devices in a raid1 setup with degraded mounts enabled by > default, and either device randomly taking longer than normal to > enumerate. It's very possible for one boot to have one device delay > during enumeration on one boot, then the other on the next boot, and if > not handled _exactly_ right by the user, this will result in both > devices having a higher generation number than they started with, but > neither one being 'wrong'. It's like trying to merge branches in git > that both have different changes to a binary file, there's no sane way > to handle it without user input. > So why do BTRFS hurry to mount itself even if devices are missing? and if BTRFS still can mount , why whould it blindly accept a non-existing disk to take part of the pool?! > Realistically, we can only safely recover from divergence correctly if > we can prove that all devices are true prior states of the current > highest generation, which is not currently possible to do reliably > because of how BTRFS operates. > So what you are saying is that the generation number does not represent a true frozen state of the filesystem at that point? > Also, LVM and MD have the exact same issue, it's just not as significant > because they re-add and re-sync missing devices automatically when they > reappear, which makes such split-brain scenarios much less likely. Which means marking the entire device as invalid, then re-adding it from scratch more or less... >> >>> * We have _ZERO_ automatic recovery from this situation. This makes >>> both of the above mentioned issues far more dangerous. >> >> See above, would this not be as simple as auto-deleting disks from the >> pool that has a matching UUID and a mismatch for the superblock >> generation number? Not exactly a recovery, but the system should be >> able to limp along. >> >>> * It just plain does not work with most systemd setups, because >>> systemd will hang waiting on all the devices to appear due to the >>> fact that they refuse to acknowledge that the only way to correctly >>> know if a BTRFS volume will mount is to just try and mount it. >> >> As far as I have understood this BTRFS refuses to mount even in >> redundant setups without the degraded flag. Why?! This is just plain >> useless. If anything the degraded mount option should be replaced with >> something like failif=X where X would be anything from 'never' which >> should get a 2 disk system up with exclusively raid1 profiles even if >> only one device is working. 'always' in case any device is failed or >> even 'atrisk' when loss of one more device would keep any raid chunk >> profile guarantee. (this get admittedly complex in a multi disk raid1 >> setup or when subvolumes perhaps can be mounted with different "raid" >> profiles....) > The issue with systemd is that if you pass 'degraded' on most systemd > systems, and devices are missing when the system tries to mount the > volume, systemd won't mount it because it doesn't see all the devices. > It doesn't even _try_ to mount it because it doesn't see all the > devices. Changing to degraded by default won't fix this, because it's a > systemd problem. > > The same issue also makes it a serious pain in the arse to recover > degraded BTRFS volumes on systemd systems, because if the volume is > supposed to mount normally on that system, systemd will unmount it if it > doesn't see all the devices, regardless of how it got mounted in the > first place. > Why does systemd concern itself about what devices btrfs consist of. Please educate me, I am curious. > IOW, there's a special case with systemd that makes even mounting BTRFS > volumes that have missing devices degraded not work. Well I use systemd on Debian and have not had that issue. In what situation does this fail? >> >>> * Given that new kernels still don't properly generate half-raid1 >>> chunks when a device is missing in a two-device raid1 setup, there's >>> a very real possibility that users will have trouble recovering >>> filesystems with old recovery media (IOW, any recovery environment >>> running a kernel before 4.14 will not mount the volume correctly). >> Sometimes you have to break a few eggs to make an omelette right? If >> people want to recover their data they should have backups, and if >> they are really interested in recovering their data (and don't have >> backups) then they will probably find this on the web by searching >> anyway... > Backups aren't the type of recovery I'm talking about. I'm talking > about people booting to things like SystemRescueCD to fix system > configuration or do offline maintenance without having to nuke the > system and restore from backups. Such recovery environments often don't > get updated for a _long_ time, and such usage is not atypical as a first > step in trying to fix a broken system in situations where downtime > really is a serious issue. I would say that if downtime is such a serious issue you have a failover and a working tested backup. >> >>> * You shouldn't be mounting writable and degraded for any reason >>> other than fixing the volume (or converting it to a single profile >>> until you can fix it), even aside from the other issues. >> >> Well in my opinion the degraded mount option is counter intuitive. >> Unless otherwise asked for the system should mount and work as long as >> it can guarantee the data can be read and written somehow (regardless >> if any redundancy guarantee is not met). If the user is willing to >> accept more or less risk they should configure it! > Again, BTRFS mounting degraded is significantly riskier than LVM or MD > doing the same thing. Most users don't properly research things (When's > the last time you did a complete cost/benefit analysis before deciding > to use a particular piece of software on a system?), and would not know > they were taking on significantly higher risk by using BTRFS without > configuring it to behave safely until it actually caused them problems, > at which point most people would then complain about the resulting data > loss instead of trying to figure out why it happened and prevent it in > the first place. I don't know about you, but I for one would rather > BTRFS have a reputation for being over-aggressively safe by default than > risking users data by default. Well I don't do cost/benefit analysis since I run free software. I do however try my best to ensure that whatever software I install don't cause more drawbacks than benefits. I would also like for BTRFS to be over-aggressively safe, but I also want it to be over-aggressively always running or even limping if that is what it needs to do. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-08 18:10 ` waxhead @ 2019-02-08 19:17 ` Austin S. Hemmelgarn 2019-02-09 12:13 ` waxhead 2019-02-08 20:17 ` Chris Murphy 1 sibling, 1 reply; 32+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-08 19:17 UTC (permalink / raw) To: waxhead, Stefan K, linux-btrfs On 2019-02-08 13:10, waxhead wrote: > Austin S. Hemmelgarn wrote: >> On 2019-02-07 13:53, waxhead wrote: >>> >>> >>> Austin S. Hemmelgarn wrote: >>>> On 2019-02-07 06:04, Stefan K wrote: >>>>> Thanks, with degraded as kernel parameter and also ind the fstab >>>>> it works like expected >>>>> >>>>> That should be the normal behaviour, cause a server must be up and >>>>> running, and I don't care about a device loss, thats why I use a >>>>> RAID1. The device-loss problem can I fix later, but its important >>>>> that a server is up and running, i got informed at boot time and >>>>> also in the logs files that a device is missing, also I see that if >>>>> you use a monitoring program. >>>> No, it shouldn't be the default, because: >>>> >>>> * Normal desktop users _never_ look at the log files or boot info, >>>> and rarely run monitoring programs, so they as a general rule won't >>>> notice until it's already too late. BTRFS isn't just a server >>>> filesystem, so it needs to be safe for regular users too. >>> >>> I am willing to argue that whatever you refer to as normal users >>> don't have a clue how to make a raid1 filesystem, nor do they care >>> about what underlying filesystem their computer runs. I can't quite >>> see how a limping system would be worse than a failing system in this >>> case. Besides "normal" desktop users use Windows anyway, people that >>> run on penguin powered stuff generally have at least some technical >>> knowledge. >> Once you get into stuff like Arch or Gentoo, yeah, people tend to have >> enough technical knowledge to handle this type of thing, but if you're >> talking about the big distros like Ubuntu or Fedora, not so much. >> Yes, I might be a bit pessimistic here, but that pessimism is based on >> personal experience over many years of providing technical support for >> people. >> >> Put differently, human nature is to ignore things that aren't >> immediately relevant. Kernel logs don't matter until you see >> something wrong. Boot messages don't matter unless you happen to see >> them while the system is booting (and most people don't). Monitoring >> is the only way here, but most people won't invest the time in proper >> monitoring until they have problems. Even as a seasoned sysadmin, I >> never look at kernel logs until I see any problem, I rarely see boot >> messages on most of the systems I manage (because I'm rarely sitting >> at the console when they boot up, and when I am I'm usually handling >> startup of a dozen or so systems simultaneously after a network-wide >> outage), and I only monitor things that I know for certain need to be >> monitored. > > So what you are saying here is that distro's that use btrfs by default > should be responsible enough to make some monitoring solution if they > allow non-technical users to create a "raid"1 like btrfs filesystem in > the first place. I don't think that many distros install some S.M.A.R.T. > monitoring solution either... in which case you are worse off with a > non-checksumming filesystem. Actually, more than you probably realize do (Windows does by default these days, so the big distros that want to compete for desktop users need to as well), and many have trivial to set up monitoring for MD and LVM arrays as well. > Since the users you refer to basically ignores the filesystem anyway I > can't see why this would be an argument at all... My argument here is that we shouldn't assume users will know what they're doing. It's the same logic behind the saner distros not defaulting to using BTRFS for installation, if they do and BTRFS causes the user to lose data, the distro will usually get blamed, even if it was not at all their fault. Similarly, if a user chooses to use BTRFS without doing their research, it's very likely that any data loss, even if it's caused by the user themself not doing things sensibly, will be blamed on BTRFS. > >>> >>>> * It's easily possible to end up mounting degraded by accident if >>>> one of the constituent devices is slow to enumerate, and this can >>>> easily result in a split-brain scenario where all devices have >>>> diverged and the volume can only be repaired by recreating it from >>>> scratch. >>> >>> Am I wrong or would not the remaining disk have the generation number >>> bumped on every commit? would it not make sense to ignore >>> (previously) stale disks and require a manual "re-add" of the failed >>> disks. From a users perspective with some C coding knowledge this >>> sounds to me (in principle) like something as quite simple. >>> E.g. if the superblock UUID match for all devices and one (or more) >>> devices has a lower generation number than the other(s) then the >>> disk(s) with the newest generation number should be considered good >>> and the other disks with a lower generation number should be marked >>> as failed. >> The problem is that if you're defaulting to this behavior, you can >> have multiple disks diverge from the base. Imagine, for example, a >> system with two devices in a raid1 setup with degraded mounts enabled >> by default, and either device randomly taking longer than normal to >> enumerate. It's very possible for one boot to have one device delay >> during enumeration on one boot, then the other on the next boot, and >> if not handled _exactly_ right by the user, this will result in both >> devices having a higher generation number than they started with, but >> neither one being 'wrong'. It's like trying to merge branches in git >> that both have different changes to a binary file, there's no sane way >> to handle it without user input. >> > So why do BTRFS hurry to mount itself even if devices are missing? and > if BTRFS still can mount , why whould it blindly accept a non-existing > disk to take part of the pool?! It doesn't unless you tell it to., and that behavior is exactly what I'm arguing against making the default here. > >> Realistically, we can only safely recover from divergence correctly if >> we can prove that all devices are true prior states of the current >> highest generation, which is not currently possible to do reliably >> because of how BTRFS operates. >> > So what you are saying is that the generation number does not represent > a true frozen state of the filesystem at that point? It does _only_ for those devices which were present at the time of the commit that incremented it. As an example (don't do this with any BTRFS volume you care about, it will break it), take a BTRFS volume with two devices configured for raid1. Mount the volume with only one of the devices present, issue a single write to it, then unmounted it. Now do the same with only the other device. Both devices should show the same generation number right now (but it should be one higher than when you started), but the generation number on each device refers to a different volume state. > >> Also, LVM and MD have the exact same issue, it's just not as >> significant because they re-add and re-sync missing devices >> automatically when they reappear, which makes such split-brain >> scenarios much less likely. > Which means marking the entire device as invalid, then re-adding it from > scratch more or less... Actually, it doesn't. For LVM and MD, they track what regions of the remaining device have changed, and sync only those regions when the missing device comes back. For BTRFS, the same thing happens implicitly because of the COW structure, and you can manually reproduce similar behavior to LVM or MD by scrubbing the volume and then using balance with the 'soft' filter to ensure all the chunks are the correct type. In both cases though, you still get into trouble if each of the devices gets used separately from each other before being re-synced (though BTRFS at least has the decency in that situation to not lose any data, LVM or MD will just blindly sync whichever mirror they happen to pick over the others). > >>> >>>> * We have _ZERO_ automatic recovery from this situation. This makes >>>> both of the above mentioned issues far more dangerous. >>> >>> See above, would this not be as simple as auto-deleting disks from >>> the pool that has a matching UUID and a mismatch for the superblock >>> generation number? Not exactly a recovery, but the system should be >>> able to limp along. >>> >>>> * It just plain does not work with most systemd setups, because >>>> systemd will hang waiting on all the devices to appear due to the >>>> fact that they refuse to acknowledge that the only way to correctly >>>> know if a BTRFS volume will mount is to just try and mount it. >>> >>> As far as I have understood this BTRFS refuses to mount even in >>> redundant setups without the degraded flag. Why?! This is just plain >>> useless. If anything the degraded mount option should be replaced >>> with something like failif=X where X would be anything from 'never' >>> which should get a 2 disk system up with exclusively raid1 profiles >>> even if only one device is working. 'always' in case any device is >>> failed or even 'atrisk' when loss of one more device would keep any >>> raid chunk profile guarantee. (this get admittedly complex in a multi >>> disk raid1 setup or when subvolumes perhaps can be mounted with >>> different "raid" profiles....) >> The issue with systemd is that if you pass 'degraded' on most systemd >> systems, and devices are missing when the system tries to mount the >> volume, systemd won't mount it because it doesn't see all the devices. >> It doesn't even _try_ to mount it because it doesn't see all the >> devices. Changing to degraded by default won't fix this, because it's >> a systemd problem. >> >> The same issue also makes it a serious pain in the arse to recover >> degraded BTRFS volumes on systemd systems, because if the volume is >> supposed to mount normally on that system, systemd will unmount it if >> it doesn't see all the devices, regardless of how it got mounted in >> the first place. >> > Why does systemd concern itself about what devices btrfs consist of. > Please educate me, I am curious. For the same reason that it concerns itself with what devices make up a LVM volume or an MD array. In essence, it comes down to a couple of specific things: * It is almost always preferable to delay boot-up while waiting for a missing device to reappear than it is to start using a volume that depends on it while it's missing. The overall impact on the system from taking a few seconds longer to boot is generally less than the impact of having to resync the device when it reappears while the system is still booting up. * Systemd allows mounts to not block the system booting while still allowing certain services to depend on those mounts being active. This is extremely useful for remote management reasons, and is actually supported by most service managers these days. Systemd extends this all the way down the storage stack though, which is even more useful, because it lets disk failures properly cascade up the storage stack and translate into the volumes they were part of showing up as degraded (or getting unmounted if you choose to configure it that way). > >> IOW, there's a special case with systemd that makes even mounting >> BTRFS volumes that have missing devices degraded not work. > Well I use systemd on Debian and have not had that issue. In what > situation does this fail? At one point, if you tried to manually mount a volume that systemd did not see all the constituent devices present for, it would get unmounted almost instantly by systemd itself. This may not be the case anymore, or it may have been how the distros I've used with systemd on them happened to behave, but either way it's a pain in the arse when you want to fix a BTRFS volume. > >>> >>>> * Given that new kernels still don't properly generate half-raid1 >>>> chunks when a device is missing in a two-device raid1 setup, there's >>>> a very real possibility that users will have trouble recovering >>>> filesystems with old recovery media (IOW, any recovery environment >>>> running a kernel before 4.14 will not mount the volume correctly). >>> Sometimes you have to break a few eggs to make an omelette right? If >>> people want to recover their data they should have backups, and if >>> they are really interested in recovering their data (and don't have >>> backups) then they will probably find this on the web by searching >>> anyway... >> Backups aren't the type of recovery I'm talking about. I'm talking >> about people booting to things like SystemRescueCD to fix system >> configuration or do offline maintenance without having to nuke the >> system and restore from backups. Such recovery environments often >> don't get updated for a _long_ time, and such usage is not atypical as >> a first step in trying to fix a broken system in situations where >> downtime really is a serious issue. > I would say that if downtime is such a serious issue you have a failover > and a working tested backup. Generally yes, but restoring a volume completely from scratch is almost always going to take longer than just fixing what's broken unless it's _really_ broken. Would you really want to nuke a system and rebuild it from scratch just because you accidentally pulled out the wrong disk when hot-swapping drives to rebuild an array? > >>> >>>> * You shouldn't be mounting writable and degraded for any reason >>>> other than fixing the volume (or converting it to a single profile >>>> until you can fix it), even aside from the other issues. >>> >>> Well in my opinion the degraded mount option is counter intuitive. >>> Unless otherwise asked for the system should mount and work as long >>> as it can guarantee the data can be read and written somehow >>> (regardless if any redundancy guarantee is not met). If the user is >>> willing to accept more or less risk they should configure it! >> Again, BTRFS mounting degraded is significantly riskier than LVM or MD >> doing the same thing. Most users don't properly research things >> (When's the last time you did a complete cost/benefit analysis before >> deciding to use a particular piece of software on a system?), and >> would not know they were taking on significantly higher risk by using >> BTRFS without configuring it to behave safely until it actually caused >> them problems, at which point most people would then complain about >> the resulting data loss instead of trying to figure out why it >> happened and prevent it in the first place. I don't know about you, >> but I for one would rather BTRFS have a reputation for being >> over-aggressively safe by default than risking users data by default. > Well I don't do cost/benefit analysis since I run free software. I do > however try my best to ensure that whatever software I install don't > cause more drawbacks than benefits. Which is essentially a CBA. The cost doesn't have to equate to money, it could be time, or even limitations in what you can do with the system. > I would also like for BTRFS to be over-aggressively safe, but I also > want it to be over-aggressively always running or even limping if that > is what it needs to do. And you can have it do that, we just prefer not to by default. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-08 19:17 ` Austin S. Hemmelgarn @ 2019-02-09 12:13 ` waxhead 2019-02-10 18:34 ` Chris Murphy 0 siblings, 1 reply; 32+ messages in thread From: waxhead @ 2019-02-09 12:13 UTC (permalink / raw) To: Austin S. Hemmelgarn, Stefan K, linux-btrfs Austin S. Hemmelgarn wrote: > On 2019-02-08 13:10, waxhead wrote: >> Austin S. Hemmelgarn wrote: >>> On 2019-02-07 13:53, waxhead wrote: >>>> >>>> >>>> Austin S. Hemmelgarn wrote: >>> >> So why do BTRFS hurry to mount itself even if devices are missing? and >> if BTRFS still can mount , why whould it blindly accept a non-existing >> disk to take part of the pool?! > It doesn't unless you tell it to., and that behavior is exactly what I'm > arguing against making the default here. Understood, but that is not quite what I meant - let me rephrase... If BTRFS still can't mount, why would it blindly accept a previously non-existing disk to take part of the pool?! E.g. if you have "disk" A+B and suddenly at one boot B is not there. Now you have only A and one would think that A should register that B has been missing. Now on the next boot you have AB , in which case B is likely to have diverged from A since A has been mounted without B present - so even if both devices are present why would btrfs blindly accept that both A+B are good to go even if it should be perfectly possible to register in A that B was gone. And if you have B without A it should be the same story right? >> >>> Realistically, we can only safely recover from divergence correctly >>> if we can prove that all devices are true prior states of the current >>> highest generation, which is not currently possible to do reliably >>> because of how BTRFS operates. >>> >> So what you are saying is that the generation number does not >> represent a true frozen state of the filesystem at that point? > It does _only_ for those devices which were present at the time of the > commit that incremented it. > So in other words devices that are not present can easily be marked / defined as such at a later time? > As an example (don't do this with any BTRFS volume you care about, it > will break it), take a BTRFS volume with two devices configured for > raid1. Mount the volume with only one of the devices present, issue a > single write to it, then unmounted it. Now do the same with only the > other device. Both devices should show the same generation number right > now (but it should be one higher than when you started), but the > generation number on each device refers to a different volume state. >> >>> Also, LVM and MD have the exact same issue, it's just not as >>> significant because they re-add and re-sync missing devices >>> automatically when they reappear, which makes such split-brain >>> scenarios much less likely. >> Which means marking the entire device as invalid, then re-adding it >> from scratch more or less... > Actually, it doesn't. > > For LVM and MD, they track what regions of the remaining device have > changed, and sync only those regions when the missing device comes back. > For MD , if you have the bitmap enabled yes... > For BTRFS, the same thing happens implicitly because of the COW > structure, and you can manually reproduce similar behavior to LVM or MD > by scrubbing the volume and then using balance with the 'soft' filter to > ensure all the chunks are the correct type. > Understood. >> Why does systemd concern itself about what devices btrfs consist of. >> Please educate me, I am curious. > For the same reason that it concerns itself with what devices make up a > LVM volume or an MD array. In essence, it comes down to a couple of > specific things: > > * It is almost always preferable to delay boot-up while waiting for a > missing device to reappear than it is to start using a volume that > depends on it while it's missing. The overall impact on the system from > taking a few seconds longer to boot is generally less than the impact of > having to resync the device when it reappears while the system is still > booting up. > > * Systemd allows mounts to not block the system booting while still > allowing certain services to depend on those mounts being active. This > is extremely useful for remote management reasons, and is actually > supported by most service managers these days. Systemd extends this all > the way down the storage stack though, which is even more useful, > because it lets disk failures properly cascade up the storage stack and > translate into the volumes they were part of showing up as degraded (or > getting unmounted if you choose to configure it that way). Ok, not sure I still understand how/why systemd knows what devices are part of btrfs (or md or lvm for that matter). I'll try to research this a bit - thanks for the info! >> >>> IOW, there's a special case with systemd that makes even mounting >>> BTRFS volumes that have missing devices degraded not work. >> Well I use systemd on Debian and have not had that issue. In what >> situation does this fail? > At one point, if you tried to manually mount a volume that systemd did > not see all the constituent devices present for, it would get unmounted > almost instantly by systemd itself. This may not be the case anymore, > or it may have been how the distros I've used with systemd on them > happened to behave, but either way it's a pain in the arse when you want > to fix a BTRFS volume. I can see that, but from my "toying around" with btrfs I have not run into any issues while mounting degraded. >> >>>> >>>>> * Given that new kernels still don't properly generate half-raid1 >>>>> chunks when a device is missing in a two-device raid1 setup, >>>>> there's a very real possibility that users will have trouble >>>>> recovering filesystems with old recovery media (IOW, any recovery >>>>> environment running a kernel before 4.14 will not mount the volume >>>>> correctly). >>>> Sometimes you have to break a few eggs to make an omelette right? If >>>> people want to recover their data they should have backups, and if >>>> they are really interested in recovering their data (and don't have >>>> backups) then they will probably find this on the web by searching >>>> anyway... >>> Backups aren't the type of recovery I'm talking about. I'm talking >>> about people booting to things like SystemRescueCD to fix system >>> configuration or do offline maintenance without having to nuke the >>> system and restore from backups. Such recovery environments often >>> don't get updated for a _long_ time, and such usage is not atypical >>> as a first step in trying to fix a broken system in situations where >>> downtime really is a serious issue. >> I would say that if downtime is such a serious issue you have a >> failover and a working tested backup. > Generally yes, but restoring a volume completely from scratch is almost > always going to take longer than just fixing what's broken unless it's > _really_ broken. Would you really want to nuke a system and rebuild it > from scratch just because you accidentally pulled out the wrong disk > when hot-swapping drives to rebuild an array? Absolutely not , but in this case I would not even want to use a rescue disk in the first place. >>>> >>>>> * You shouldn't be mounting writable and degraded for any reason >>>>> other than fixing the volume (or converting it to a single profile >>>>> until you can fix it), even aside from the other issues. >>>> >>>> Well in my opinion the degraded mount option is counter intuitive. >>>> Unless otherwise asked for the system should mount and work as long >>>> as it can guarantee the data can be read and written somehow >>>> (regardless if any redundancy guarantee is not met). If the user is >>>> willing to accept more or less risk they should configure it! >>> Again, BTRFS mounting degraded is significantly riskier than LVM or >>> MD doing the same thing. Most users don't properly research things >>> (When's the last time you did a complete cost/benefit analysis before >>> deciding to use a particular piece of software on a system?), and >>> would not know they were taking on significantly higher risk by using >>> BTRFS without configuring it to behave safely until it actually >>> caused them problems, at which point most people would then complain >>> about the resulting data loss instead of trying to figure out why it >>> happened and prevent it in the first place. I don't know about you, >>> but I for one would rather BTRFS have a reputation for being >>> over-aggressively safe by default than risking users data by default. >> Well I don't do cost/benefit analysis since I run free software. I do >> however try my best to ensure that whatever software I install don't >> cause more drawbacks than benefits. > Which is essentially a CBA. The cost doesn't have to equate to money, > it could be time, or even limitations in what you can do with the system. > >> I would also like for BTRFS to be over-aggressively safe, but I also >> want it to be over-aggressively always running or even limping if that >> is what it needs to do. > And you can have it do that, we just prefer not to by default. Got it! ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-09 12:13 ` waxhead @ 2019-02-10 18:34 ` Chris Murphy 2019-02-11 12:17 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 32+ messages in thread From: Chris Murphy @ 2019-02-10 18:34 UTC (permalink / raw) To: waxhead; +Cc: Austin S. Hemmelgarn, Stefan K, Btrfs BTRFS On Sat, Feb 9, 2019 at 5:13 AM waxhead <waxhead@dirtcellar.net> wrote: > Understood, but that is not quite what I meant - let me rephrase... > If BTRFS still can't mount, why would it blindly accept a previously > non-existing disk to take part of the pool?! It doesn't do it blindly. It only ever mounts when the user specifies the degraded mount option, which is not a default mount option. >E.g. if you have "disk" A+B > and suddenly at one boot B is not there. Now you have only A and one > would think that A should register that B has been missing. Now on the > next boot you have AB , in which case B is likely to have diverged from > A since A has been mounted without B present - so even if both devices > are present why would btrfs blindly accept that both A+B are good to go > even if it should be perfectly possible to register in A that B was > gone. And if you have B without A it should be the same story right? OK no, you haven't gone far enough to setup the split brain scenario where there is a partially legitimate complaint. Prior to split brain, it's entirely reasonable for Btrfs to mount *when you use the degraded mount option* - it does not blindly mount. And if you've ever done exactly what you wrote in the above paragraph, you'd see Btrfs *complains vociferously* about all the errors it's passively finding and fixing. If you want a more active method of getting device B caught up with A automatically - that's completely reasonable, and something people have been saying for some time, but it takes a design proposal, and code. As for split brain scenario, it is only the user's manual intervention with multiple 'degraded' mount options (which again, is not the default) that caused the volume to arrive in such a state. Would it be wise to have some additional error checking? Sure. Someone would need to step up with a design and to do code work, same as any other feature. Maybe a rudimentary check would be comparing the timestamps for leaves or nodes ostensibly with the same transid, but in any case that doesn't just happen for free. > >> So what you are saying is that the generation number does not > >> represent a true frozen state of the filesystem at that point? > > It does _only_ for those devices which were present at the time of the > > commit that incremented it. > > > So in other words devices that are not present can easily be marked / > defined as such at a later time? That isn't how it currently works. When stale device B is subsequently mounted (normally) along with device A, it's only passively fixed up. Part of the point of non-automatic degraded mounts that require user intervention is the lack of anything beyond simple error handling and fixups. > Ok, not sure I still understand how/why systemd knows what devices are > part of btrfs (or md or lvm for that matter). I'll try to research this > a bit - thanks for the info! It doesn't, not directly. It's from the previously mentioned udev rule. For md, the assembly, delays, and fall back to running degraded, are handled in dracut. But the reason why this is in udev is to prevent a mount failure just because one or more devices are delayed; basically it inserts a pause until the devices appear, and then systemd issues the mount command. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-10 18:34 ` Chris Murphy @ 2019-02-11 12:17 ` Austin S. Hemmelgarn 2019-02-11 21:15 ` Chris Murphy 0 siblings, 1 reply; 32+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-11 12:17 UTC (permalink / raw) To: Chris Murphy, waxhead; +Cc: Stefan K, Btrfs BTRFS On 2019-02-10 13:34, Chris Murphy wrote: > On Sat, Feb 9, 2019 at 5:13 AM waxhead <waxhead@dirtcellar.net> wrote: > >> Understood, but that is not quite what I meant - let me rephrase... >> If BTRFS still can't mount, why would it blindly accept a previously >> non-existing disk to take part of the pool?! > > It doesn't do it blindly. It only ever mounts when the user specifies > the degraded mount option, which is not a default mount option. > >> E.g. if you have "disk" A+B >> and suddenly at one boot B is not there. Now you have only A and one >> would think that A should register that B has been missing. Now on the >> next boot you have AB , in which case B is likely to have diverged from >> A since A has been mounted without B present - so even if both devices >> are present why would btrfs blindly accept that both A+B are good to go >> even if it should be perfectly possible to register in A that B was >> gone. And if you have B without A it should be the same story right? > > OK no, you haven't gone far enough to setup the split brain scenario > where there is a partially legitimate complaint. Prior to split brain, > it's entirely reasonable for Btrfs to mount *when you use the degraded > mount option* - it does not blindly mount. And if you've ever done > exactly what you wrote in the above paragraph, you'd see Btrfs > *complains vociferously* about all the errors it's passively finding > and fixing. If you want a more active method of getting device B > caught up with A automatically - that's completely reasonable, and > something people have been saying for some time, but it takes a design > proposal, and code. > > As for split brain scenario, it is only the user's manual intervention > with multiple 'degraded' mount options (which again, is not the > default) that caused the volume to arrive in such a state. Would it be > wise to have some additional error checking? Sure. Someone would need > to step up with a design and to do code work, same as any other > feature. Maybe a rudimentary check would be comparing the timestamps > for leaves or nodes ostensibly with the same transid, but in any case > that doesn't just happen for free. And even then it couldn't be made truly reliable, because data from old transactions may be arbitrarily overwritten at any point after the next transaction (and is just plain gone if you're using the `discard` mount option). > > >>>> So what you are saying is that the generation number does not >>>> represent a true frozen state of the filesystem at that point? >>> It does _only_ for those devices which were present at the time of the >>> commit that incremented it. >>> >> So in other words devices that are not present can easily be marked / >> defined as such at a later time? > > That isn't how it currently works. When stale device B is subsequently > mounted (normally) along with device A, it's only passively fixed up. > Part of the point of non-automatic degraded mounts that require user > intervention is the lack of anything beyond simple error handling and > fixups. > >> Ok, not sure I still understand how/why systemd knows what devices are >> part of btrfs (or md or lvm for that matter). I'll try to research this >> a bit - thanks for the info! > > It doesn't, not directly. It's from the previously mentioned udev > rule. For md, the assembly, delays, and fall back to running degraded, > are handled in dracut. But the reason why this is in udev is to > prevent a mount failure just because one or more devices are delayed; > basically it inserts a pause until the devices appear, and then > systemd issues the mount command. Last I knew, it was systemd itself doing the pause, because we provide no real device for udev to wait on appearing. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-11 12:17 ` Austin S. Hemmelgarn @ 2019-02-11 21:15 ` Chris Murphy 0 siblings, 0 replies; 32+ messages in thread From: Chris Murphy @ 2019-02-11 21:15 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Chris Murphy, waxhead, Stefan K, Btrfs BTRFS On Mon, Feb 11, 2019 at 5:17 AM Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > > Last I knew, it was systemd itself doing the pause, because we provide > no real device for udev to wait on appearing. Well there's more than one thing responsible for the net behavior. The most central thing waiting is the kernel. And that's because 'btrfs device ready' simply waits until all devices are found (by kernel code). That's the command that /usr/lib/udev/rules.d/64-btrfs.rules calls. So it is also udev that doesn't return from that, indefinitely as far as I know. And therefore systemd won't issue a mount command for sysroot. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-08 18:10 ` waxhead 2019-02-08 19:17 ` Austin S. Hemmelgarn @ 2019-02-08 20:17 ` Chris Murphy 1 sibling, 0 replies; 32+ messages in thread From: Chris Murphy @ 2019-02-08 20:17 UTC (permalink / raw) To: waxhead; +Cc: Austin S. Hemmelgarn, Stefan K, Btrfs BTRFS On Fri, Feb 8, 2019 at 11:10 AM waxhead <waxhead@dirtcellar.net> wrote: > So what you are saying here is that distro's that use btrfs by default > should be responsible enough to make some monitoring solution if they > allow non-technical users to create a "raid"1 like btrfs filesystem in > the first place. None do this by default. I'm only aware of one that makes it possible in custom partitioning which is widely regarded as "you're on your own" land. I am of the opinion that GUI installers have a high burden to protect users from themselves but it's just an opinion; I see plenty of fail danger GUI software. > So why do BTRFS hurry to mount itself even if devices are missing? It isn't and it doesn't. You have to specify 'degraded' mount option, which is not the default, which right now with the present design means you intend for an immediate successful mount if there's a missing device and it's still possible to mount anyway. >and > if BTRFS still can mount , why whould it blindly accept a non-existing > disk to take part of the pool?! I can't parse this question. I think the answer is, it doesn't do that. > > Realistically, we can only safely recover from divergence correctly if > > we can prove that all devices are true prior states of the current > > highest generation, which is not currently possible to do reliably > > because of how BTRFS operates. > > > So what you are saying is that the generation number does not represent > a true frozen state of the filesystem at that point? You have a two device raid1, and their generation is 100. You mount one device by itself with degraded mount option. And you start adding and deleting files, no snapshots, and those changes are all under generation 101. You now unmount it, and you degraded mount the other device, and you add and delete some different files, and those changes are all under generation 101 too. How do you merge them? I personally think that scarnio is user sabotage and they're just screwed. Start over. They had to intentionally, manually, mount those two drives *separately* with a non-default 'degraded' flag. It's crazy to expect Btrfs to sort this out - but it's entirely reasonable for it to faceplant read only the instant it becomes confused; and reasonable to expect and design it to quickly become confused in such a case, to keep damage from making both separated mirrors so corrupted they can't be mounted even read only. > Why does systemd concern itself about what devices btrfs consist of. > Please educate me, I am curious. I'm not sure of the history of: /usr/lib/udev/rules.d/64-btrfs-dm.rules /usr/lib/udev/rules.d/64-btrfs.rules But I think they were submitted to udev by Btrfs developers long ago, which was then later subsumed into systemd. It would be ideal if this rule had time sort of timeout, I think instead it will indefinitely wait for all devices to appear. Anyway, without that rule, if a device is merely delayed, and systemd tries to mount, mount immediately fails and thus boot fails. There is no such thing in systemd as reattempting to mount after a mount failure, and if sysroot fails to mount, it's a fatal startup error. > I would also like for BTRFS to be over-aggressively safe, but I also > want it to be over-aggressively always running or even limping if that > is what it needs to do. While I understand that's a metaphor, someone limping along is not a stable situation. They are more likely to trip and fall. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 11:04 ` Stefan K 2019-02-07 12:18 ` Austin S. Hemmelgarn @ 2019-02-07 17:15 ` Chris Murphy 2019-02-07 17:37 ` Martin Steigerwald 2019-02-11 9:30 ` Anand Jain 2 siblings, 1 reply; 32+ messages in thread From: Chris Murphy @ 2019-02-07 17:15 UTC (permalink / raw) To: Stefan K; +Cc: Btrfs BTRFS On Thu, Feb 7, 2019 at 4:04 AM Stefan K <shadow_7@gmx.net> wrote: > > Thanks, with degraded as kernel parameter and also ind the fstab it works like expected > That should be the normal behaviour, cause a server must be up and running, and I don't care about a device loss, thats why I use a RAID1. You managed to completely ignore all the warnings associated with doing this, and then conclude that it's a good idea to subject normal users to possible data loss or corruption... > So please change the normal behavior In the case of no device loss, but device delay, with 'degraded' set in fstab you risk a non-deterministic degraded mount. And there is no automatic balance (sync) after recovering from a degraded mount. And as far as I know there's no automatic transition from degraded to normal operation upon later discovery of a previously missing device. It's just begging for data loss. That's why it's not the default. That's why it's not recommended. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 17:15 ` Chris Murphy @ 2019-02-07 17:37 ` Martin Steigerwald 2019-02-07 22:19 ` Chris Murphy 0 siblings, 1 reply; 32+ messages in thread From: Martin Steigerwald @ 2019-02-07 17:37 UTC (permalink / raw) To: Chris Murphy; +Cc: Stefan K, Btrfs BTRFS Chris Murphy - 07.02.19, 18:15: > > So please change the normal behavior > > In the case of no device loss, but device delay, with 'degraded' set > in fstab you risk a non-deterministic degraded mount. And there is no > automatic balance (sync) after recovering from a degraded mount. And > as far as I know there's no automatic transition from degraded to > normal operation upon later discovery of a previously missing device. > It's just begging for data loss. That's why it's not the default. > That's why it's not recommended. Still the current behavior is not really user-friendly. And does not meet expectations that users usually have about how RAID 1 works. I know BTRFS RAID 1 is no RAID 1, although it is called like this. I also somewhat get that with the current state of BTRFS the current behavior of not allowing a degraded mount may be better… however… I see clearly room for improvement here. And there very likely will be discussions like this on this list… until BTRFS acts in a more user friendly way here. I faced this myself during recovery from a failure of one SSD of a dual SSD BTRFS RAID 1 and it caused me having to spend *hours* instead of what in my eyes could be minutes to recover the machine to a working state again. Luckily the SSDs I use do not tend to fail all that often. And the Intel SSD 320 that has this "Look, I am 8 MiB big and all your data is gone" firmware bug – even with the firmware version that was supposed to fix this issue – is out of service now. Although I was able to bring it back to a working (but blank) state with a secure erase, I am just not going to use such a SSD for anything serious. Thanks, -- Martin ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 17:37 ` Martin Steigerwald @ 2019-02-07 22:19 ` Chris Murphy 2019-02-07 23:02 ` Remi Gauvin 2019-02-08 7:33 ` Stefan K 0 siblings, 2 replies; 32+ messages in thread From: Chris Murphy @ 2019-02-07 22:19 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Chris Murphy, Stefan K, Btrfs BTRFS On Thu, Feb 7, 2019 at 10:37 AM Martin Steigerwald <martin@lichtvoll.de> wrote: > > Chris Murphy - 07.02.19, 18:15: > > > So please change the normal behavior > > > > In the case of no device loss, but device delay, with 'degraded' set > > in fstab you risk a non-deterministic degraded mount. And there is no > > automatic balance (sync) after recovering from a degraded mount. And > > as far as I know there's no automatic transition from degraded to > > normal operation upon later discovery of a previously missing device. > > It's just begging for data loss. That's why it's not the default. > > That's why it's not recommended. > > Still the current behavior is not really user-friendly. And does not > meet expectations that users usually have about how RAID 1 works. I know > BTRFS RAID 1 is no RAID 1, although it is called like this. I mentioned the user experience is not good, in both my Feb 2 and Feb 5 responses, compared to mdadm and lvm raid1 in the same situation. However the raid1 term only describes replication. It doesn't describe any policy. And whether to fail to mount or mount degraded by default, is a policy. Whether and how to transition from degraded to normal operation when a formerly missing device reappears, is a policy. And whether, and how, and when to rebuild data after resuming normal operation is a policy. A big part of why these policies are MIA is because they require features that just don't exist yet. And perhaps don't even belong in btrfs kernel code or user space tools; but rather a system service or daemon that manages such policies. However, none of that means Btrfs raid1 is not raid1. There's a wrong assumption being made about policies and features in mdadm and LVM, that they are somehow attached to the definition of raid1, but they aren't. > I also somewhat get that with the current state of BTRFS the current > behavior of not allowing a degraded mount may be better… however… I see > clearly room for improvement here. And there very likely will be > discussions like this on this list… until BTRFS acts in a more user > friendly way here. And it's completely appropriate if someone wants to update the Btrfs status page to make more clear what features/behaviors/policies apply to Btrfs raid of all types, or to have a page that summarizes their differences among mdadm and/or LVM raid levels, so users can better assess their risk taking, and choose the best Linux storage technology for their use case. But at least developers know this is the case. And actually, you could mitigate some decent amount of Btrfs missing features with server monitoring tools; including parsing kernel messages. Because right now you aren't even informed of read or write errors, device or csums mismatches or fixups, unless you're checking kernel messages. Where mdadm has the option for emailing notifications to an admin for such things, and lvm has a monitor that I guess does something I haven't used it. Literally Btrfs will only complain about failed writes that would cause immediate ejection of the device by md. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 22:19 ` Chris Murphy @ 2019-02-07 23:02 ` Remi Gauvin 2019-02-08 7:33 ` Stefan K 1 sibling, 0 replies; 32+ messages in thread From: Remi Gauvin @ 2019-02-07 23:02 UTC (permalink / raw) To: Btrfs BTRFS [-- Attachment #1.1.1: Type: text/plain, Size: 986 bytes --] On 2019-02-07 5:19 p.m., Chris Murphy wrote: > And actually, you could mitigate some decent amount of Btrfs missing > features with server monitoring tools; including parsing kernel > messages. Because right now you aren't even informed of read or write > errors, device or csums mismatches or fixups, unless you're checking > kernel messages. Where mdadm has the option for emailing notifications > to an admin for such things, and lvm has a monitor that I guess does > something I haven't used it. Literally Btrfs will only complain about > failed writes that would cause immediate ejection of the device by md. You can, and probably should, have an hourly cron job that does something like btrfs dev stats -c / || Command to sound sysadmin alarm the only difference here is that this is not, at this time, already baked into distros by default. I think I saw mention of a project recently to to build a package that automates common btrfs maintenance tasks? [-- Attachment #1.1.2: remi.vcf --] [-- Type: text/x-vcard, Size: 203 bytes --] begin:vcard fn:Remi Gauvin n:Gauvin;Remi org:Georgian Infotech adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada email;internet:remi@georgianit.com tel;work:226-256-1545 version:2.1 end:vcard [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 22:19 ` Chris Murphy 2019-02-07 23:02 ` Remi Gauvin @ 2019-02-08 7:33 ` Stefan K 2019-02-08 17:26 ` Chris Murphy 1 sibling, 1 reply; 32+ messages in thread From: Stefan K @ 2019-02-08 7:33 UTC (permalink / raw) To: linux-btrfs > However the raid1 term only describes replication. It doesn't describe > any policy. yep you're right, but the most sysadmin expect some 'policies'. If I use RAID1 I expect that if one drive failed, I can still boot _without_ boot issues, just some warnings etc, because I use raid1 to have simple 1device tolerance if one fails (which can happen). I can check/monitor the BTRFS RAID status by 'btrfs fi sh' or '(or by 'btrfs dev stat'). I also expect that if a device came back it will sync automatically and if I replace a device it will automatically rebalance the raid1 (which btrfs does, so far). I think a lot of sysadmins feel the same way. On Thursday, February 7, 2019 3:19:01 PM CET Chris Murphy wrote: > On Thu, Feb 7, 2019 at 10:37 AM Martin Steigerwald <martin@lichtvoll.de> wrote: > > > > Chris Murphy - 07.02.19, 18:15: > > > > So please change the normal behavior > > > > > > In the case of no device loss, but device delay, with 'degraded' set > > > in fstab you risk a non-deterministic degraded mount. And there is no > > > automatic balance (sync) after recovering from a degraded mount. And > > > as far as I know there's no automatic transition from degraded to > > > normal operation upon later discovery of a previously missing device. > > > It's just begging for data loss. That's why it's not the default. > > > That's why it's not recommended. > > > > Still the current behavior is not really user-friendly. And does not > > meet expectations that users usually have about how RAID 1 works. I know > > BTRFS RAID 1 is no RAID 1, although it is called like this. > > I mentioned the user experience is not good, in both my Feb 2 and Feb > 5 responses, compared to mdadm and lvm raid1 in the same situation. > > However the raid1 term only describes replication. It doesn't describe > any policy. And whether to fail to mount or mount degraded by default, > is a policy. Whether and how to transition from degraded to normal > operation when a formerly missing device reappears, is a policy. And > whether, and how, and when to rebuild data after resuming normal > operation is a policy. A big part of why these policies are MIA is > because they require features that just don't exist yet. And perhaps > don't even belong in btrfs kernel code or user space tools; but rather > a system service or daemon that manages such policies. However, none > of that means Btrfs raid1 is not raid1. There's a wrong assumption > being made about policies and features in mdadm and LVM, that they are > somehow attached to the definition of raid1, but they aren't. > > > > I also somewhat get that with the current state of BTRFS the current > > behavior of not allowing a degraded mount may be better… however… I see > > clearly room for improvement here. And there very likely will be > > discussions like this on this list… until BTRFS acts in a more user > > friendly way here. > > And it's completely appropriate if someone wants to update the Btrfs > status page to make more clear what features/behaviors/policies apply > to Btrfs raid of all types, or to have a page that summarizes their > differences among mdadm and/or LVM raid levels, so users can better > assess their risk taking, and choose the best Linux storage technology > for their use case. > > But at least developers know this is the case. > > And actually, you could mitigate some decent amount of Btrfs missing > features with server monitoring tools; including parsing kernel > messages. Because right now you aren't even informed of read or write > errors, device or csums mismatches or fixups, unless you're checking > kernel messages. Where mdadm has the option for emailing notifications > to an admin for such things, and lvm has a monitor that I guess does > something I haven't used it. Literally Btrfs will only complain about > failed writes that would cause immediate ejection of the device by md. > > > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-08 7:33 ` Stefan K @ 2019-02-08 17:26 ` Chris Murphy 0 siblings, 0 replies; 32+ messages in thread From: Chris Murphy @ 2019-02-08 17:26 UTC (permalink / raw) To: Btrfs BTRFS On Fri, Feb 8, 2019 at 12:33 AM Stefan K <shadow_7@gmx.net> wrote: > > > However the raid1 term only describes replication. It doesn't describe > > any policy. > yep you're right, but the most sysadmin expect some 'policies'. A sysadmin expecting policies is fine, but assuming they exist makes them a questionable sysadmin. >> If I use RAID1 I expect that if one drive failed, I can still boot _without_ boot issues, just some warnings etc, because I use raid1 to have simple 1device tolerance if one fails (which can happen). OK and we've already explained that btrfs doesn't work that way yet, which is why it has the defaults it has, but then you go on to assert that Btrfs should have the defaults YOU want based on YOUR assumptions. It's absurd. >I can check/monitor the BTRFS RAID status by 'btrfs fi sh' or '(or by 'btrfs dev stat'). I also expect that if a device came back it will sync automatically and if I replace a device it will automatically rebalance the raid1 (which btrfs does, so far). I think a lot of sysadmins feel the same way. OK what you just wrote there is sufficiently incomplete that it's wrong. I and others have already described part of this behavior so if you were really comprehending what people are saying, you wouldn't have just written the above paragraph. If a missing device reappears, it is not synced automatically. If you have a two device raid1 with a missing device, and mounted degraded, data is highly likely to get written to the single remaining drive as single profile chunks; which means when you do either 'btrfs replace' or 'btrfs device add' followed by 'btrfs device remove' the data in those single chunks will *not* be replicated automatically to the replacement drive. You will have to do a manual balance and explicitly convert single chunks to raid1. If it's 3+ drives, a device replacement (of either method) should cause data to be replicated. I see a lot of sysadmins make the wrong assumptions on the linux-raid list and on LVM list, and I often read about data loss when they do that. What matters is how things actually work. When you make assumptions about how they work, you're unwittingly begging for user induced data loss, and all the complaining about missing features won't help get the data back. Over and over again telling people, you didn't understand how it worked, you didn't understand what you were doing, and yeah sorry the data is just gone. It's your responsibility to understand how things really work and fail. It isn't possible for the code to understand your expectations and act accordingly. At least you're discovering the limitations before you end up in trouble. The job of a sysadmin is to find out the difference between expectations and actual feature set, because maybe the technology being evaluated isn't a good match for the use case. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-07 11:04 ` Stefan K 2019-02-07 12:18 ` Austin S. Hemmelgarn 2019-02-07 17:15 ` Chris Murphy @ 2019-02-11 9:30 ` Anand Jain 2 siblings, 0 replies; 32+ messages in thread From: Anand Jain @ 2019-02-11 9:30 UTC (permalink / raw) To: Stefan K, linux-btrfs On 2/7/19 7:04 PM, Stefan K wrote: > Thanks, with degraded as kernel parameter and also ind the fstab it works like expected > > That should be the normal behaviour, IMO in the long term it will be. But before that we have few items to fix around this, such as the serviceability part. -Anand > cause a server must be up and running, and I don't care about a device loss, thats why I use a RAID1. The device-loss problem can I fix later, but its important that a server is up and running, i got informed at boot time and also in the logs files that a device is missing, also I see that if you use a monitoring program. > > So please change the normal behavior > > On Friday, February 1, 2019 7:13:16 PM CET Hans van Kranenburg wrote: >> Hi Stefan, >> >> On 2/1/19 11:28 AM, Stefan K wrote: >>> >>> I've installed my Debian Stretch to have / on btrfs with raid1 on 2 >>> SSDs. Today I want test if it works, it works fine until the server >>> is running and the SSD get broken and I can change this, but it looks >>> like that it does not work if the SSD fails until restart. I got the >>> error, that one of the Disks can't be read and I got a initramfs >>> prompt, I expected that it still runs like mdraid and said something >>> is missing. >>> >>> My question is, is it possible to configure btrfs/fstab/grub that it >>> still boot? (that is what I expected from a RAID1) >> >> Yes. I'm not the expert in this area, but I see you haven't got a reply >> today yet, so I'll try. >> >> What you see happening is correct. This is the default behavior. >> >> To be able to boot into your system with a missing disk, you can add... >> rootflags=degraded >> ...to the linux kernel command line by editing it on the fly when you >> are in the GRUB menu. >> >> This allows the filesystem to start in 'degraded' mode this one time. >> The only thing you should be doing when the system is booted is have a >> new disk present already in place and fix the btrfs situation. This >> means things like cloning the partition table of the disk that's still >> working, doing whatever else is needed in your situation and then >> running btrfs replace to replace the missing disk with the new one, and >> then making sure you don't have "single" block groups left (using btrfs >> balance), which might have been created for new writes when the >> filesystem was running in degraded mode. >> >> -- >> Hans van Kranenburg >> > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-01 10:28 btrfs as / filesystem in RAID1 Stefan K 2019-02-01 19:13 ` Hans van Kranenburg @ 2019-02-02 23:35 ` Chris Murphy 2019-02-04 17:47 ` Patrik Lundquist 1 sibling, 1 reply; 32+ messages in thread From: Chris Murphy @ 2019-02-02 23:35 UTC (permalink / raw) To: Stefan K; +Cc: Btrfs BTRFS On Fri, Feb 1, 2019 at 3:28 AM Stefan K <shadow_7@gmx.net> wrote: > > Hello, > > I've installed my Debian Stretch to have / on btrfs with raid1 on 2 SSDs. Today I want test if it works, it works fine until the server is running and the SSD get broken and I can change this, but it looks like that it does not work if the SSD fails until restart. I got the error, that one of the Disks can't be read and I got a initramfs prompt, I expected that it still runs like mdraid and said something is missing. > > My question is, is it possible to configure btrfs/fstab/grub that it still boot? (that is what I expected from a RAID1) It's not reliable for unattended use. There are two issues: 1. /usr/lib/udev/rules.d/64-btrfs.rules means mount won't even be attempted if all Btrfs devices are not found. 2. Degraded mounts don't happen automatically or by default; instead mount fails. It might seem like you can have a grub boot param 'rootflags=degraded' set all the time. While it's ignored if all devices are found at mount time, the problem is if one device is just delayed, you get an undesirable degraded mount. Three additional problems come from degraded mounts: 1. At least with raid1/10, a particular device can only be mounted rw,degraded one time and from then on it fails, and can only be ro mounted. There are patches for this but I don't think they've been merged still. 2. There is no automatic "catch up" repair once the old device returns. md and lvm raid will do a partial sync based on the write-intent bitmap, so it doesn't have to do a full sync. Btrfs should have all available information to see how far behind a mirror device (more correctly it's a stripe of a mirror chunk) and to do a catch up so the mirrors are all the same again; however there's no mechanism do do a partial scrub, nor to do a scrub of any kind automatically. It takes manual intervention to make them the same again. This affects raid 1/10/5/6. 3. At least raid1/10, if more than one device of a mirrored volume is mounted rw degraded - it's hosed. If you have a two device raid1, with device A and B; if A is mounted rw degraded and then later B is (separately) mounted rw degraded, they each have different states than the other, and those states are equally valid, and there's no way to merge them. Further, I'm pretty sure Btrfs still has no check for this, and will corrupt itself if you mount the volume rw (with all devices present, i.e. not degraded). I think there are patches for this (?) but in any case I don't think they've been merged either. So the bottom line is that the sysadmin has to handhold a Btrfs raid1. It really can't be used for unattended access. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-02 23:35 ` Chris Murphy @ 2019-02-04 17:47 ` Patrik Lundquist 2019-02-04 17:55 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 32+ messages in thread From: Patrik Lundquist @ 2019-02-04 17:47 UTC (permalink / raw) To: Chris Murphy; +Cc: Stefan K, Btrfs BTRFS On Sun, 3 Feb 2019 at 01:24, Chris Murphy <lists@colorremedies.com> wrote: > > 1. At least with raid1/10, a particular device can only be mounted > rw,degraded one time and from then on it fails, and can only be ro > mounted. There are patches for this but I don't think they've been > merged still. That should be fixed since Linux 4.14. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-04 17:47 ` Patrik Lundquist @ 2019-02-04 17:55 ` Austin S. Hemmelgarn 2019-02-04 22:19 ` Patrik Lundquist 0 siblings, 1 reply; 32+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-04 17:55 UTC (permalink / raw) To: Patrik Lundquist, Chris Murphy; +Cc: Stefan K, Btrfs BTRFS On 2019-02-04 12:47, Patrik Lundquist wrote: > On Sun, 3 Feb 2019 at 01:24, Chris Murphy <lists@colorremedies.com> wrote: >> >> 1. At least with raid1/10, a particular device can only be mounted >> rw,degraded one time and from then on it fails, and can only be ro >> mounted. There are patches for this but I don't think they've been >> merged still. > > That should be fixed since Linux 4.14. > Did the patches that fixed chunk generation land too? Last I knew, 4.14 had the patch that fixed mounting volumes that had this particular issue, but not the patches that prevented a writable degraded mount from producing the issue on-disk in the first place. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-04 17:55 ` Austin S. Hemmelgarn @ 2019-02-04 22:19 ` Patrik Lundquist 2019-02-05 6:46 ` Chris Murphy 0 siblings, 1 reply; 32+ messages in thread From: Patrik Lundquist @ 2019-02-04 22:19 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Stefan K, Btrfs BTRFS On Mon, 4 Feb 2019 at 18:55, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > > On 2019-02-04 12:47, Patrik Lundquist wrote: > > On Sun, 3 Feb 2019 at 01:24, Chris Murphy <lists@colorremedies.com> wrote: > >> > >> 1. At least with raid1/10, a particular device can only be mounted > >> rw,degraded one time and from then on it fails, and can only be ro > >> mounted. There are patches for this but I don't think they've been > >> merged still. > > > > That should be fixed since Linux 4.14. > > > > Did the patches that fixed chunk generation land too? Last I knew, 4.14 > had the patch that fixed mounting volumes that had this particular > issue, but not the patches that prevented a writable degraded mount from > producing the issue on-disk in the first place. A very good question and at least 4.19.12 creates single chunks instead of raid1 chunks if I rip out one disk of two in a raid1 setup and mount it degraded. So a balance from single chunks to raid1 chunks is still needed after the failed device has been replaced. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-04 22:19 ` Patrik Lundquist @ 2019-02-05 6:46 ` Chris Murphy 2019-02-05 7:37 ` Chris Murphy 0 siblings, 1 reply; 32+ messages in thread From: Chris Murphy @ 2019-02-05 6:46 UTC (permalink / raw) To: Patrik Lundquist Cc: Austin S. Hemmelgarn, Chris Murphy, Stefan K, Btrfs BTRFS On Mon, Feb 4, 2019 at 3:19 PM Patrik Lundquist <patrik.lundquist@gmail.com> wrote: > > On Mon, 4 Feb 2019 at 18:55, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > > > > On 2019-02-04 12:47, Patrik Lundquist wrote: > > > On Sun, 3 Feb 2019 at 01:24, Chris Murphy <lists@colorremedies.com> wrote: > > >> > > >> 1. At least with raid1/10, a particular device can only be mounted > > >> rw,degraded one time and from then on it fails, and can only be ro > > >> mounted. There are patches for this but I don't think they've been > > >> merged still. > > > > > > That should be fixed since Linux 4.14. > > > > > > > Did the patches that fixed chunk generation land too? Last I knew, 4.14 > > had the patch that fixed mounting volumes that had this particular > > issue, but not the patches that prevented a writable degraded mount from > > producing the issue on-disk in the first place. > > A very good question and at least 4.19.12 creates single chunks > instead of raid1 chunks if I rip out one disk of two in a raid1 setup > and mount it degraded. So a balance from single chunks to raid1 chunks > is still needed after the failed device has been replaced. Kernel 4.20.3 I can confirm that I can do at least three rw,degraded mounts, adding data each mount, on a two device raid1 with a missing device. When rw,degraded, it's writing data to single profile chunks, and to raid1 metadata chunks. There's no warning about this. After remounting both devices and scrubbing, it's dog slow. 14 minutes to scrub a 4GiB file system, complaining the whole time about checksums on the files not replicated. All it appears to be doing is replicating metadata at a snails pace, less than 2MB/s. That's unexpected. But while it's expected single data is not magically converted to raid1; the fact that it's single profile just because it's a degraded raid1 is not expected, and not warned about. I don't like this behavior - so now the user has to do a balance convert to get back to the replicated state they thought they had when formatting? -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: btrfs as / filesystem in RAID1 2019-02-05 6:46 ` Chris Murphy @ 2019-02-05 7:37 ` Chris Murphy 0 siblings, 0 replies; 32+ messages in thread From: Chris Murphy @ 2019-02-05 7:37 UTC (permalink / raw) To: Btrfs BTRFS On Mon, Feb 4, 2019 at 11:46 PM Chris Murphy <lists@colorremedies.com> wrote: > > After remounting both devices and scrubbing, it's dog slow. 14 minutes > to scrub a 4GiB file system, complaining the whole time about > checksums on the files not replicated. All it appears to be doing is > replicating metadata at a snails pace, less than 2MB/s. OK I see what's going on. The raid1 data chunk was not full, so initially rw,degraded writes went there. New writes went to a single chunk. Upon unmounting, restoring the missing device, and mounting normally: Data,single: Size:13.00GiB, Used:12.91GiB /dev/mapper/vg-test2 13.00GiB Data,RAID1: Size:2.00GiB, Used:1.98GiB /dev/mapper/vg-test1 2.00GiB /dev/mapper/vg-test2 2.00GiB Metadata,single: Size:1.00GiB, Used:0.00B /dev/mapper/vg-test2 1.00GiB Metadata,RAID1: Size:1.00GiB, Used:15.91MiB /dev/mapper/vg-test1 1.00GiB /dev/mapper/vg-test2 1.00GiB System,single: Size:32.00MiB, Used:16.00KiB /dev/mapper/vg-test2 32.00MiB System,RAID1: Size:8.00MiB, Used:0.00B /dev/mapper/vg-test1 8.00MiB /dev/mapper/vg-test2 8.00MiB So it's demoted system chunk to single profile, new data chunk is also single profile. And even though it created a single profile metadata chunk it's not using it, instead it continues to use the not full raid1 profile metadata chunks, presumably until they're all full and then only once new data chunks need to be allocated are they single chunk. mdadm and LVM upon assembly once all devices are present again, detects the stale device from its lower count, and knows what blocks to replicate from the write intent bitmap, and starts this sync/replication right away - before even mounting the file system. So Btrfs is neither automatic, nor obvious that you have to do a *balance* rather than a scrub in this case, which looks like it only happens in the single device degraded case (I assume if it were a 3 device array with a missing device, raid1 chunks can still be created and thus this situation doesn't happen). With a very new file system, perhaps most of the data written while rw,degraded mounted goes to single profile chunks. That permits use of the soft filter when converting to avoid full balance (a full sync). However, that's not certain. So the safest single option is unfortunately a full balance with convert filter only. The most efficient is to use both convert and soft filter (for data only; metadata must be hard converted); followed by a scrub. *sigh* it's non obvious that the user must intervene, and then also what they need to do is non-obvious. For sure mdadm and LVM are better in this case, simply because it does the right thing to re-establish the expected replication automatically. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2019-02-11 21:15 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-01 10:28 btrfs as / filesystem in RAID1 Stefan K 2019-02-01 19:13 ` Hans van Kranenburg 2019-02-07 11:04 ` Stefan K 2019-02-07 12:18 ` Austin S. Hemmelgarn 2019-02-07 18:53 ` waxhead 2019-02-07 19:39 ` Austin S. Hemmelgarn 2019-02-07 21:21 ` Remi Gauvin 2019-02-08 4:51 ` Andrei Borzenkov 2019-02-08 12:54 ` Austin S. Hemmelgarn 2019-02-08 7:15 ` Stefan K 2019-02-08 12:58 ` Austin S. Hemmelgarn 2019-02-08 16:56 ` Chris Murphy 2019-02-08 18:10 ` waxhead 2019-02-08 19:17 ` Austin S. Hemmelgarn 2019-02-09 12:13 ` waxhead 2019-02-10 18:34 ` Chris Murphy 2019-02-11 12:17 ` Austin S. Hemmelgarn 2019-02-11 21:15 ` Chris Murphy 2019-02-08 20:17 ` Chris Murphy 2019-02-07 17:15 ` Chris Murphy 2019-02-07 17:37 ` Martin Steigerwald 2019-02-07 22:19 ` Chris Murphy 2019-02-07 23:02 ` Remi Gauvin 2019-02-08 7:33 ` Stefan K 2019-02-08 17:26 ` Chris Murphy 2019-02-11 9:30 ` Anand Jain 2019-02-02 23:35 ` Chris Murphy 2019-02-04 17:47 ` Patrik Lundquist 2019-02-04 17:55 ` Austin S. Hemmelgarn 2019-02-04 22:19 ` Patrik Lundquist 2019-02-05 6:46 ` Chris Murphy 2019-02-05 7:37 ` Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).