linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* btrfs as / filesystem in RAID1
@ 2019-02-01 10:28 Stefan K
  2019-02-01 19:13 ` Hans van Kranenburg
  2019-02-02 23:35 ` Chris Murphy
  0 siblings, 2 replies; 32+ messages in thread
From: Stefan K @ 2019-02-01 10:28 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I've installed my Debian Stretch to have / on btrfs with raid1 on 2 SSDs. Today I want test if it works, it works fine until the server is running and the SSD get broken and I can change this, but it looks like that it does not work if the SSD fails until restart. I got the error, that one of the Disks can't be read and I got a initramfs prompt, I expected that it still runs like mdraid and said something is missing.

My question is, is it possible to configure btrfs/fstab/grub that it still boot? (that is what I expected from a RAID1)

best regards
Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-01 10:28 btrfs as / filesystem in RAID1 Stefan K
@ 2019-02-01 19:13 ` Hans van Kranenburg
  2019-02-07 11:04   ` Stefan K
  2019-02-02 23:35 ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Hans van Kranenburg @ 2019-02-01 19:13 UTC (permalink / raw)
  To: Stefan K, linux-btrfs

Hi Stefan,

On 2/1/19 11:28 AM, Stefan K wrote:
> 
> I've installed my Debian Stretch to have / on btrfs with raid1 on 2
> SSDs. Today I want test if it works, it works fine until the server
> is running and the SSD get broken and I can change this, but it looks
> like that it does not work if the SSD fails until restart. I got the
> error, that one of the Disks can't be read and I got a initramfs
> prompt, I expected that it still runs like mdraid and said something
> is missing.
> 
> My question is, is it possible to configure btrfs/fstab/grub that it
> still boot? (that is what I expected from a RAID1)

Yes. I'm not the expert in this area, but I see you haven't got a reply
today yet, so I'll try.

What you see happening is correct. This is the default behavior.

To be able to boot into your system with a missing disk, you can add...
    rootflags=degraded
...to the linux kernel command line by editing it on the fly when you
are in the GRUB menu.

This allows the filesystem to start in 'degraded' mode this one time.
The only thing you should be doing when the system is booted is have a
new disk present already in place and fix the btrfs situation. This
means things like cloning the partition table of the disk that's still
working, doing whatever else is needed in your situation and then
running btrfs replace to replace the missing disk with the new one, and
then making sure you don't have "single" block groups left (using btrfs
balance), which might have been created for new writes when the
filesystem was running in degraded mode.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-01 10:28 btrfs as / filesystem in RAID1 Stefan K
  2019-02-01 19:13 ` Hans van Kranenburg
@ 2019-02-02 23:35 ` Chris Murphy
  2019-02-04 17:47   ` Patrik Lundquist
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2019-02-02 23:35 UTC (permalink / raw)
  To: Stefan K; +Cc: Btrfs BTRFS

On Fri, Feb 1, 2019 at 3:28 AM Stefan K <shadow_7@gmx.net> wrote:
>
> Hello,
>
> I've installed my Debian Stretch to have / on btrfs with raid1 on 2 SSDs. Today I want test if it works, it works fine until the server is running and the SSD get broken and I can change this, but it looks like that it does not work if the SSD fails until restart. I got the error, that one of the Disks can't be read and I got a initramfs prompt, I expected that it still runs like mdraid and said something is missing.
>
> My question is, is it possible to configure btrfs/fstab/grub that it still boot? (that is what I expected from a RAID1)

It's not reliable for unattended use. There are two issues:
1. /usr/lib/udev/rules.d/64-btrfs.rules means mount won't even be
attempted if all Btrfs devices are not found.
2. Degraded mounts don't happen automatically or by default; instead
mount fails.

It might seem like you can have a grub boot param 'rootflags=degraded'
set all the time. While it's ignored if all devices are found at mount
time, the problem is if one device is just delayed, you get an
undesirable degraded mount. Three additional problems come from
degraded mounts:

1. At least with raid1/10, a particular device can only be mounted
rw,degraded one time and from then on it fails, and can only be ro
mounted. There are patches for this but I don't think they've been
merged still.
2. There is no automatic "catch up" repair once the old device
returns. md and lvm raid will do a partial sync based on the
write-intent bitmap, so it doesn't have to do a full sync. Btrfs
should have all available information to see how far behind a mirror
device (more correctly it's a stripe of a mirror chunk) and to do a
catch up so the mirrors are all the same again; however there's no
mechanism do do a partial scrub, nor to do a scrub of any kind
automatically. It takes manual intervention to make them the same
again. This affects raid 1/10/5/6.
3. At least raid1/10, if more than one device of a mirrored volume is
mounted rw degraded - it's hosed. If you have a two device raid1, with
device A and B; if A is mounted rw degraded and then later B is
(separately) mounted rw degraded, they each have different states than
the other, and those states are equally valid, and there's no way to
merge them. Further, I'm pretty sure Btrfs still has no check for
this, and will corrupt itself if you mount the volume rw (with all
devices present, i.e. not degraded). I think there are patches for
this (?) but in any case I don't think they've been merged either.

So the bottom line is that the sysadmin has to handhold a Btrfs raid1.
It really can't be used for unattended access.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-02 23:35 ` Chris Murphy
@ 2019-02-04 17:47   ` Patrik Lundquist
  2019-02-04 17:55     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Patrik Lundquist @ 2019-02-04 17:47 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Stefan K, Btrfs BTRFS

On Sun, 3 Feb 2019 at 01:24, Chris Murphy <lists@colorremedies.com> wrote:
>
> 1. At least with raid1/10, a particular device can only be mounted
> rw,degraded one time and from then on it fails, and can only be ro
> mounted. There are patches for this but I don't think they've been
> merged still.

That should be fixed since Linux 4.14.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-04 17:47   ` Patrik Lundquist
@ 2019-02-04 17:55     ` Austin S. Hemmelgarn
  2019-02-04 22:19       ` Patrik Lundquist
  0 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-04 17:55 UTC (permalink / raw)
  To: Patrik Lundquist, Chris Murphy; +Cc: Stefan K, Btrfs BTRFS

On 2019-02-04 12:47, Patrik Lundquist wrote:
> On Sun, 3 Feb 2019 at 01:24, Chris Murphy <lists@colorremedies.com> wrote:
>>
>> 1. At least with raid1/10, a particular device can only be mounted
>> rw,degraded one time and from then on it fails, and can only be ro
>> mounted. There are patches for this but I don't think they've been
>> merged still.
> 
> That should be fixed since Linux 4.14.
> 

Did the patches that fixed chunk generation land too?  Last I knew, 4.14 
had the patch that fixed mounting volumes that had this particular 
issue, but not the patches that prevented a writable degraded mount from 
producing the issue on-disk in the first place.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-04 17:55     ` Austin S. Hemmelgarn
@ 2019-02-04 22:19       ` Patrik Lundquist
  2019-02-05  6:46         ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Patrik Lundquist @ 2019-02-04 22:19 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Stefan K, Btrfs BTRFS

On Mon, 4 Feb 2019 at 18:55, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>
> On 2019-02-04 12:47, Patrik Lundquist wrote:
> > On Sun, 3 Feb 2019 at 01:24, Chris Murphy <lists@colorremedies.com> wrote:
> >>
> >> 1. At least with raid1/10, a particular device can only be mounted
> >> rw,degraded one time and from then on it fails, and can only be ro
> >> mounted. There are patches for this but I don't think they've been
> >> merged still.
> >
> > That should be fixed since Linux 4.14.
> >
>
> Did the patches that fixed chunk generation land too?  Last I knew, 4.14
> had the patch that fixed mounting volumes that had this particular
> issue, but not the patches that prevented a writable degraded mount from
> producing the issue on-disk in the first place.

A very good question and at least 4.19.12 creates single chunks
instead of raid1 chunks if I rip out one disk of two in a raid1 setup
and mount it degraded. So a balance from single chunks to raid1 chunks
is still needed after the failed device has been replaced.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-04 22:19       ` Patrik Lundquist
@ 2019-02-05  6:46         ` Chris Murphy
  2019-02-05  7:37           ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2019-02-05  6:46 UTC (permalink / raw)
  To: Patrik Lundquist
  Cc: Austin S. Hemmelgarn, Chris Murphy, Stefan K, Btrfs BTRFS

On Mon, Feb 4, 2019 at 3:19 PM Patrik Lundquist
<patrik.lundquist@gmail.com> wrote:
>
> On Mon, 4 Feb 2019 at 18:55, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
> >
> > On 2019-02-04 12:47, Patrik Lundquist wrote:
> > > On Sun, 3 Feb 2019 at 01:24, Chris Murphy <lists@colorremedies.com> wrote:
> > >>
> > >> 1. At least with raid1/10, a particular device can only be mounted
> > >> rw,degraded one time and from then on it fails, and can only be ro
> > >> mounted. There are patches for this but I don't think they've been
> > >> merged still.
> > >
> > > That should be fixed since Linux 4.14.
> > >
> >
> > Did the patches that fixed chunk generation land too?  Last I knew, 4.14
> > had the patch that fixed mounting volumes that had this particular
> > issue, but not the patches that prevented a writable degraded mount from
> > producing the issue on-disk in the first place.
>
> A very good question and at least 4.19.12 creates single chunks
> instead of raid1 chunks if I rip out one disk of two in a raid1 setup
> and mount it degraded. So a balance from single chunks to raid1 chunks
> is still needed after the failed device has been replaced.

Kernel 4.20.3 I can confirm that I can do at least three rw,degraded
mounts, adding data each mount, on a two device raid1 with a missing
device. When rw,degraded, it's writing data to single profile chunks,
and to raid1 metadata chunks. There's no warning about this.

After remounting both devices and scrubbing, it's dog slow. 14 minutes
to scrub a 4GiB file system, complaining the whole time about
checksums on the files not replicated. All it appears to be doing is
replicating metadata at a snails pace, less than 2MB/s. That's
unexpected. But while it's expected single data is not magically
converted to raid1; the fact that it's single profile just because
it's a degraded raid1 is not expected, and not warned about. I don't
like this behavior - so now the user has to do a balance convert to
get back to the replicated state they thought they had when
formatting?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-05  6:46         ` Chris Murphy
@ 2019-02-05  7:37           ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2019-02-05  7:37 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon, Feb 4, 2019 at 11:46 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> After remounting both devices and scrubbing, it's dog slow. 14 minutes
> to scrub a 4GiB file system, complaining the whole time about
> checksums on the files not replicated. All it appears to be doing is
> replicating metadata at a snails pace, less than 2MB/s.


OK I see what's going on. The raid1 data chunk was not full, so
initially rw,degraded writes went there. New writes went to a single
chunk. Upon unmounting, restoring the missing device, and mounting
normally:

Data,single: Size:13.00GiB, Used:12.91GiB
   /dev/mapper/vg-test2      13.00GiB

Data,RAID1: Size:2.00GiB, Used:1.98GiB
   /dev/mapper/vg-test1       2.00GiB
   /dev/mapper/vg-test2       2.00GiB

Metadata,single: Size:1.00GiB, Used:0.00B
   /dev/mapper/vg-test2       1.00GiB

Metadata,RAID1: Size:1.00GiB, Used:15.91MiB
   /dev/mapper/vg-test1       1.00GiB
   /dev/mapper/vg-test2       1.00GiB

System,single: Size:32.00MiB, Used:16.00KiB
   /dev/mapper/vg-test2      32.00MiB

System,RAID1: Size:8.00MiB, Used:0.00B
   /dev/mapper/vg-test1       8.00MiB
   /dev/mapper/vg-test2       8.00MiB


So it's demoted system chunk to single profile, new data chunk is also
single profile. And even though it created a single profile metadata
chunk it's not using it, instead it continues to use the not full
raid1 profile metadata chunks, presumably until they're all full and
then only once new data chunks need to be allocated are they single
chunk.

mdadm and LVM upon assembly once all devices are present again,
detects the stale device from its lower count, and knows what blocks
to replicate from the write intent bitmap, and starts this
sync/replication right away - before even mounting the file system. So
Btrfs is neither automatic, nor obvious that you have to do a
*balance* rather than a scrub in this case, which looks like it only
happens in the single device degraded case (I assume if it were a 3
device array with a missing device, raid1 chunks can still be created
and thus this situation doesn't happen).

With a very new file system, perhaps most of the data written while
rw,degraded mounted goes to single profile chunks. That permits use of
the soft filter when converting to avoid full balance (a full sync).
However, that's not certain. So the safest single option is
unfortunately a full balance with convert filter only. The most
efficient is to use both convert and soft filter (for data only;
metadata must be hard converted); followed by a scrub.

*sigh* it's non obvious that the user must intervene, and then also
what they need to do is non-obvious. For sure mdadm and LVM are better
in this case, simply because it does the right thing to re-establish
the expected replication automatically.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-01 19:13 ` Hans van Kranenburg
@ 2019-02-07 11:04   ` Stefan K
  2019-02-07 12:18     ` Austin S. Hemmelgarn
                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Stefan K @ 2019-02-07 11:04 UTC (permalink / raw)
  To: linux-btrfs

Thanks, with degraded  as kernel parameter and also ind the fstab it works like expected

That should be the normal behaviour, cause a server must be up and running, and I don't care about a device loss, thats why I use a RAID1. The device-loss problem can I fix later, but its important that a server is up and running, i got informed at boot time and also in the logs files that a device is missing, also I see that if you use a monitoring program.

So please change the normal behavior

On Friday, February 1, 2019 7:13:16 PM CET Hans van Kranenburg wrote:
> Hi Stefan,
> 
> On 2/1/19 11:28 AM, Stefan K wrote:
> > 
> > I've installed my Debian Stretch to have / on btrfs with raid1 on 2
> > SSDs. Today I want test if it works, it works fine until the server
> > is running and the SSD get broken and I can change this, but it looks
> > like that it does not work if the SSD fails until restart. I got the
> > error, that one of the Disks can't be read and I got a initramfs
> > prompt, I expected that it still runs like mdraid and said something
> > is missing.
> > 
> > My question is, is it possible to configure btrfs/fstab/grub that it
> > still boot? (that is what I expected from a RAID1)
> 
> Yes. I'm not the expert in this area, but I see you haven't got a reply
> today yet, so I'll try.
> 
> What you see happening is correct. This is the default behavior.
> 
> To be able to boot into your system with a missing disk, you can add...
>     rootflags=degraded
> ...to the linux kernel command line by editing it on the fly when you
> are in the GRUB menu.
> 
> This allows the filesystem to start in 'degraded' mode this one time.
> The only thing you should be doing when the system is booted is have a
> new disk present already in place and fix the btrfs situation. This
> means things like cloning the partition table of the disk that's still
> working, doing whatever else is needed in your situation and then
> running btrfs replace to replace the missing disk with the new one, and
> then making sure you don't have "single" block groups left (using btrfs
> balance), which might have been created for new writes when the
> filesystem was running in degraded mode.
> 
> -- 
> Hans van Kranenburg
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 11:04   ` Stefan K
@ 2019-02-07 12:18     ` Austin S. Hemmelgarn
  2019-02-07 18:53       ` waxhead
  2019-02-07 17:15     ` Chris Murphy
  2019-02-11  9:30     ` Anand Jain
  2 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-07 12:18 UTC (permalink / raw)
  To: Stefan K, linux-btrfs

On 2019-02-07 06:04, Stefan K wrote:
> Thanks, with degraded  as kernel parameter and also ind the fstab it works like expected
> 
> That should be the normal behaviour, cause a server must be up and running, and I don't care about a device loss, thats why I use a RAID1. The device-loss problem can I fix later, but its important that a server is up and running, i got informed at boot time and also in the logs files that a device is missing, also I see that if you use a monitoring program.
No, it shouldn't be the default, because:

* Normal desktop users _never_ look at the log files or boot info, and 
rarely run monitoring programs, so they as a general rule won't notice 
until it's already too late.  BTRFS isn't just a server filesystem, so 
it needs to be safe for regular users too.
* It's easily possible to end up mounting degraded by accident if one of 
the constituent devices is slow to enumerate, and this can easily result 
in a split-brain scenario where all devices have diverged and the volume 
can only be repaired by recreating it from scratch.
* We have _ZERO_ automatic recovery from this situation.  This makes 
both of the above mentioned issues far more dangerous.
* It just plain does not work with most systemd setups, because systemd 
will hang waiting on all the devices to appear due to the fact that they 
refuse to acknowledge that the only way to correctly know if a BTRFS 
volume will mount is to just try and mount it.
* Given that new kernels still don't properly generate half-raid1 chunks 
when a device is missing in a two-device raid1 setup, there's a very 
real possibility that users will have trouble recovering filesystems 
with old recovery media (IOW, any recovery environment running a kernel 
before 4.14 will not mount the volume correctly).
* You shouldn't be mounting writable and degraded for any reason other 
than fixing the volume (or converting it to a single profile until you 
can fix it), even aside from the other issues.
> 
> So please change the normal behavior
> 
> On Friday, February 1, 2019 7:13:16 PM CET Hans van Kranenburg wrote:
>> Hi Stefan,
>>
>> On 2/1/19 11:28 AM, Stefan K wrote:
>>>
>>> I've installed my Debian Stretch to have / on btrfs with raid1 on 2
>>> SSDs. Today I want test if it works, it works fine until the server
>>> is running and the SSD get broken and I can change this, but it looks
>>> like that it does not work if the SSD fails until restart. I got the
>>> error, that one of the Disks can't be read and I got a initramfs
>>> prompt, I expected that it still runs like mdraid and said something
>>> is missing.
>>>
>>> My question is, is it possible to configure btrfs/fstab/grub that it
>>> still boot? (that is what I expected from a RAID1)
>>
>> Yes. I'm not the expert in this area, but I see you haven't got a reply
>> today yet, so I'll try.
>>
>> What you see happening is correct. This is the default behavior.
>>
>> To be able to boot into your system with a missing disk, you can add...
>>      rootflags=degraded
>> ...to the linux kernel command line by editing it on the fly when you
>> are in the GRUB menu.
>>
>> This allows the filesystem to start in 'degraded' mode this one time.
>> The only thing you should be doing when the system is booted is have a
>> new disk present already in place and fix the btrfs situation. This
>> means things like cloning the partition table of the disk that's still
>> working, doing whatever else is needed in your situation and then
>> running btrfs replace to replace the missing disk with the new one, and
>> then making sure you don't have "single" block groups left (using btrfs
>> balance), which might have been created for new writes when the
>> filesystem was running in degraded mode.
>>
>> -- 
>> Hans van Kranenburg
>>
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 11:04   ` Stefan K
  2019-02-07 12:18     ` Austin S. Hemmelgarn
@ 2019-02-07 17:15     ` Chris Murphy
  2019-02-07 17:37       ` Martin Steigerwald
  2019-02-11  9:30     ` Anand Jain
  2 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2019-02-07 17:15 UTC (permalink / raw)
  To: Stefan K; +Cc: Btrfs BTRFS

On Thu, Feb 7, 2019 at 4:04 AM Stefan K <shadow_7@gmx.net> wrote:
>
> Thanks, with degraded  as kernel parameter and also ind the fstab it works like expected
> That should be the normal behaviour, cause a server must be up and running, and I don't care about a device loss, thats why I use a RAID1.

You managed to completely ignore all the warnings associated with
doing this, and then conclude that it's a good idea to subject normal
users to possible data loss or corruption...

> So please change the normal behavior

In the case of no device loss, but device delay, with 'degraded' set
in fstab you risk a non-deterministic degraded mount. And there is no
automatic balance (sync) after recovering from a degraded mount. And
as far as I know there's no automatic transition from degraded to
normal operation upon later discovery of a previously missing device.
It's just begging for data loss. That's why it's not the default.
That's why it's not recommended.



--
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 17:15     ` Chris Murphy
@ 2019-02-07 17:37       ` Martin Steigerwald
  2019-02-07 22:19         ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Martin Steigerwald @ 2019-02-07 17:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Stefan K, Btrfs BTRFS

Chris Murphy - 07.02.19, 18:15:
> > So please change the normal behavior
> 
> In the case of no device loss, but device delay, with 'degraded' set
> in fstab you risk a non-deterministic degraded mount. And there is no
> automatic balance (sync) after recovering from a degraded mount. And
> as far as I know there's no automatic transition from degraded to
> normal operation upon later discovery of a previously missing device.
> It's just begging for data loss. That's why it's not the default.
> That's why it's not recommended.

Still the current behavior is not really user-friendly. And does not 
meet expectations that users usually have about how RAID 1 works. I know 
BTRFS RAID 1 is no RAID 1, although it is called like this.

I also somewhat get that with the current state of BTRFS the current 
behavior of not allowing a degraded mount may be better… however… I see 
clearly room for improvement here. And there very likely will be 
discussions like this on this list… until BTRFS acts in a more user 
friendly way here.

I faced this myself during recovery from a failure of one SSD of a dual 
SSD BTRFS RAID 1 and it caused me having to spend *hours* instead of 
what in my eyes could be minutes to recover the machine to a working 
state again. Luckily the SSDs I use do not tend to fail all that often. 
And the Intel SSD 320 that has this "Look, I am 8 MiB big and all your 
data is gone" firmware bug – even with the firmware version that was 
supposed to fix this issue – is out of service now. Although I was able 
to bring it back to a working (but blank) state with a secure erase, I 
am just not going to use such a SSD for anything serious.

Thanks,
-- 
Martin



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 12:18     ` Austin S. Hemmelgarn
@ 2019-02-07 18:53       ` waxhead
  2019-02-07 19:39         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: waxhead @ 2019-02-07 18:53 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Stefan K, linux-btrfs



Austin S. Hemmelgarn wrote:
> On 2019-02-07 06:04, Stefan K wrote:
>> Thanks, with degraded  as kernel parameter and also ind the fstab it 
>> works like expected
>>
>> That should be the normal behaviour, cause a server must be up and 
>> running, and I don't care about a device loss, thats why I use a 
>> RAID1. The device-loss problem can I fix later, but its important that 
>> a server is up and running, i got informed at boot time and also in 
>> the logs files that a device is missing, also I see that if you use a 
>> monitoring program.
> No, it shouldn't be the default, because:
> 
> * Normal desktop users _never_ look at the log files or boot info, and 
> rarely run monitoring programs, so they as a general rule won't notice 
> until it's already too late.  BTRFS isn't just a server filesystem, so 
> it needs to be safe for regular users too.

I am willing to argue that whatever you refer to as normal users don't 
have a clue how to make a raid1 filesystem, nor do they care about what 
underlying filesystem their computer runs. I can't quite see how a 
limping system would be worse than a failing system in this case. 
Besides "normal" desktop users use Windows anyway, people that run on 
penguin powered stuff generally have at least some technical knowledge.

> * It's easily possible to end up mounting degraded by accident if one of 
> the constituent devices is slow to enumerate, and this can easily result 
> in a split-brain scenario where all devices have diverged and the volume 
> can only be repaired by recreating it from scratch.

Am I wrong or would not the remaining disk have the generation number 
bumped on every commit? would it not make sense to ignore (previously) 
stale disks and require a manual "re-add" of the failed disks. From a 
users perspective with some C coding knowledge this sounds to me (in 
principle) like something as quite simple.
E.g. if the superblock UUID match for all devices and one (or more) 
devices has a lower generation number than the other(s) then the disk(s) 
with the newest generation number should be considered good and the 
other disks with a lower generation number should be marked as failed.

> * We have _ZERO_ automatic recovery from this situation.  This makes 
> both of the above mentioned issues far more dangerous.

See above, would this not be as simple as auto-deleting disks from the 
pool that has a matching UUID and a mismatch for the superblock 
generation number? Not exactly a recovery, but the system should be able 
to limp along.

> * It just plain does not work with most systemd setups, because systemd 
> will hang waiting on all the devices to appear due to the fact that they 
> refuse to acknowledge that the only way to correctly know if a BTRFS 
> volume will mount is to just try and mount it.

As far as I have understood this BTRFS refuses to mount even in 
redundant setups without the degraded flag. Why?! This is just plain 
useless. If anything the degraded mount option should be replaced with 
something like failif=X where X would be anything from 'never' which 
should get a 2 disk system up with exclusively raid1 profiles even if 
only one device is working. 'always' in case any device is failed or 
even 'atrisk' when loss of one more device would keep any raid chunk 
profile guarantee. (this get admittedly complex in a multi disk raid1 
setup or when subvolumes perhaps can be mounted with different "raid" 
profiles....)

> * Given that new kernels still don't properly generate half-raid1 chunks 
> when a device is missing in a two-device raid1 setup, there's a very 
> real possibility that users will have trouble recovering filesystems 
> with old recovery media (IOW, any recovery environment running a kernel 
> before 4.14 will not mount the volume correctly).
Sometimes you have to break a few eggs to make an omelette right? If 
people want to recover their data they should have backups, and if they 
are really interested in recovering their data (and don't have backups) 
then they will probably find this on the web by searching anyway...

> * You shouldn't be mounting writable and degraded for any reason other 
> than fixing the volume (or converting it to a single profile until you 
> can fix it), even aside from the other issues.

Well in my opinion the degraded mount option is counter intuitive. 
Unless otherwise asked for the system should mount and work as long as 
it can guarantee the data can be read and written somehow (regardless if 
any redundancy guarantee is not met). If the user is willing to accept 
more or less risk they should configure it!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 18:53       ` waxhead
@ 2019-02-07 19:39         ` Austin S. Hemmelgarn
  2019-02-07 21:21           ` Remi Gauvin
                             ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-07 19:39 UTC (permalink / raw)
  To: waxhead, Stefan K, linux-btrfs

On 2019-02-07 13:53, waxhead wrote:
> 
> 
> Austin S. Hemmelgarn wrote:
>> On 2019-02-07 06:04, Stefan K wrote:
>>> Thanks, with degraded  as kernel parameter and also ind the fstab it 
>>> works like expected
>>>
>>> That should be the normal behaviour, cause a server must be up and 
>>> running, and I don't care about a device loss, thats why I use a 
>>> RAID1. The device-loss problem can I fix later, but its important 
>>> that a server is up and running, i got informed at boot time and also 
>>> in the logs files that a device is missing, also I see that if you 
>>> use a monitoring program.
>> No, it shouldn't be the default, because:
>>
>> * Normal desktop users _never_ look at the log files or boot info, and 
>> rarely run monitoring programs, so they as a general rule won't notice 
>> until it's already too late.  BTRFS isn't just a server filesystem, so 
>> it needs to be safe for regular users too.
> 
> I am willing to argue that whatever you refer to as normal users don't 
> have a clue how to make a raid1 filesystem, nor do they care about what 
> underlying filesystem their computer runs. I can't quite see how a 
> limping system would be worse than a failing system in this case. 
> Besides "normal" desktop users use Windows anyway, people that run on 
> penguin powered stuff generally have at least some technical knowledge.
Once you get into stuff like Arch or Gentoo, yeah, people tend to have 
enough technical knowledge to handle this type of thing, but if you're 
talking about the big distros like Ubuntu or Fedora, not so much.  Yes, 
I might be a bit pessimistic here, but that pessimism is based on 
personal experience over many years of providing technical support for 
people.

Put differently, human nature is to ignore things that aren't 
immediately relevant.  Kernel logs don't matter until you see something 
wrong.  Boot messages don't matter unless you happen to see them while 
the system is booting (and most people don't).  Monitoring is the only 
way here, but most people won't invest the time in proper monitoring 
until they have problems.  Even as a seasoned sysadmin, I never look at 
kernel logs until I see any problem, I rarely see boot messages on most 
of the systems I manage (because I'm rarely sitting at the console when 
they boot up, and when I am I'm usually handling startup of a dozen or 
so systems simultaneously after a network-wide outage), and I only 
monitor things that I know for certain need to be monitored.
> 
>> * It's easily possible to end up mounting degraded by accident if one 
>> of the constituent devices is slow to enumerate, and this can easily 
>> result in a split-brain scenario where all devices have diverged and 
>> the volume can only be repaired by recreating it from scratch.
> 
> Am I wrong or would not the remaining disk have the generation number 
> bumped on every commit? would it not make sense to ignore (previously) 
> stale disks and require a manual "re-add" of the failed disks. From a 
> users perspective with some C coding knowledge this sounds to me (in 
> principle) like something as quite simple.
> E.g. if the superblock UUID match for all devices and one (or more) 
> devices has a lower generation number than the other(s) then the disk(s) 
> with the newest generation number should be considered good and the 
> other disks with a lower generation number should be marked as failed.
The problem is that if you're defaulting to this behavior, you can have 
multiple disks diverge from the base.  Imagine, for example, a system 
with two devices in a raid1 setup with degraded mounts enabled by 
default, and either device randomly taking longer than normal to 
enumerate.  It's very possible for one boot to have one device delay 
during enumeration on one boot, then the other on the next boot, and if 
not handled _exactly_ right by the user, this will result in both 
devices having a higher generation number than they started with, but 
neither one being 'wrong'.  It's like trying to merge branches in git 
that both have different changes to a binary file, there's no sane way 
to handle it without user input.

Realistically, we can only safely recover from divergence correctly if 
we can prove that all devices are true prior states of the current 
highest generation, which is not currently possible to do reliably 
because of how BTRFS operates.

Also, LVM and MD have the exact same issue, it's just not as significant 
because they re-add and re-sync missing devices automatically when they 
reappear, which makes such split-brain scenarios much less likely.
> 
>> * We have _ZERO_ automatic recovery from this situation.  This makes 
>> both of the above mentioned issues far more dangerous.
> 
> See above, would this not be as simple as auto-deleting disks from the 
> pool that has a matching UUID and a mismatch for the superblock 
> generation number? Not exactly a recovery, but the system should be able 
> to limp along.
> 
>> * It just plain does not work with most systemd setups, because 
>> systemd will hang waiting on all the devices to appear due to the fact 
>> that they refuse to acknowledge that the only way to correctly know if 
>> a BTRFS volume will mount is to just try and mount it.
> 
> As far as I have understood this BTRFS refuses to mount even in 
> redundant setups without the degraded flag. Why?! This is just plain 
> useless. If anything the degraded mount option should be replaced with 
> something like failif=X where X would be anything from 'never' which 
> should get a 2 disk system up with exclusively raid1 profiles even if 
> only one device is working. 'always' in case any device is failed or 
> even 'atrisk' when loss of one more device would keep any raid chunk 
> profile guarantee. (this get admittedly complex in a multi disk raid1 
> setup or when subvolumes perhaps can be mounted with different "raid" 
> profiles....)
The issue with systemd is that if you pass 'degraded' on most systemd 
systems,  and devices are missing when the system tries to mount the 
volume, systemd won't mount it because it doesn't see all the devices. 
It doesn't even _try_ to mount it because it doesn't see all the 
devices.  Changing to degraded by default won't fix this, because it's a 
systemd problem.

The same issue also makes it a serious pain in the arse to recover 
degraded BTRFS volumes on systemd systems, because if the volume is 
supposed to mount normally on that system, systemd will unmount it if it 
doesn't see all the devices, regardless of how it got mounted in the 
first place.

IOW, there's a special case with systemd that makes even mounting BTRFS 
volumes that have missing devices degraded not work.
> 
>> * Given that new kernels still don't properly generate half-raid1 
>> chunks when a device is missing in a two-device raid1 setup, there's a 
>> very real possibility that users will have trouble recovering 
>> filesystems with old recovery media (IOW, any recovery environment 
>> running a kernel before 4.14 will not mount the volume correctly).
> Sometimes you have to break a few eggs to make an omelette right? If 
> people want to recover their data they should have backups, and if they 
> are really interested in recovering their data (and don't have backups) 
> then they will probably find this on the web by searching anyway...
Backups aren't the type of recovery I'm talking about.  I'm talking 
about people booting to things like SystemRescueCD to fix system 
configuration or do offline maintenance without having to nuke the 
system and restore from backups.  Such recovery environments often don't 
get updated for a _long_ time, and such usage is not atypical as a first 
step in trying to fix a broken system in situations where downtime 
really is a serious issue.
> 
>> * You shouldn't be mounting writable and degraded for any reason other 
>> than fixing the volume (or converting it to a single profile until you 
>> can fix it), even aside from the other issues.
> 
> Well in my opinion the degraded mount option is counter intuitive. 
> Unless otherwise asked for the system should mount and work as long as 
> it can guarantee the data can be read and written somehow (regardless if 
> any redundancy guarantee is not met). If the user is willing to accept 
> more or less risk they should configure it!
Again, BTRFS mounting degraded is significantly riskier than LVM or MD 
doing the same thing.  Most users don't properly research things (When's 
the last time you did a complete cost/benefit analysis before deciding 
to use a particular piece of software on a system?), and would not know 
they were taking on significantly higher risk by using BTRFS without 
configuring it to behave safely until it actually caused them problems, 
at which point most people would then complain about the resulting data 
loss instead of trying to figure out why it happened and prevent it in 
the first place.  I don't know about you, but I for one would rather 
BTRFS have a reputation for being over-aggressively safe by default than 
risking users data by default.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 19:39         ` Austin S. Hemmelgarn
@ 2019-02-07 21:21           ` Remi Gauvin
  2019-02-08  4:51           ` Andrei Borzenkov
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Remi Gauvin @ 2019-02-07 21:21 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1.1: Type: text/plain, Size: 1701 bytes --]

On 2019-02-07 2:39 p.m., Austin S. Hemmelgarn wrote:


> Again, BTRFS mounting degraded is significantly riskier than LVM or MD
> doing the same thing.  Most users don't properly research things (When's
> the last time you did a complete cost/benefit analysis before deciding
> to use a particular piece of software on a system?), and would not know
> they were taking on significantly higher risk by using BTRFS without
> configuring it to behave safely until it actually caused them problems,
> at which point most people would then complain about the resulting data
> loss instead of trying to figure out why it happened and prevent it in
> the first place.  I don't know about you, but I for one would rather
> BTRFS have a reputation for being over-aggressively safe by default than
> risking users data by default.


Another important consideration is that BTRFS has practically zero
tolerance for corruption in the metadata.  Most other FS's can, at least
on surface appearance, either continue working despite bits of scrambled
data, or have repair utilities that are pretty good at figuring out what
the scrambled data should be and make a best guess effort that more or
less works (leaving aside for now statistics as to how often that might
cause undetected data corruption, which possibly propagates to backups etc.)

BTRFS is almost entirely reliant on the Duplicate copy of metadata,
which is missing when running degraded.  It makes it much more likely
for simple error to break the FS entirely.

BTRFS default configuration prioritizes data integrity over uptime, and
I think a very good argument can be made for any FS that *should* be the
default.


[-- Attachment #1.1.2: remi.vcf --]
[-- Type: text/x-vcard, Size: 203 bytes --]

begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 17:37       ` Martin Steigerwald
@ 2019-02-07 22:19         ` Chris Murphy
  2019-02-07 23:02           ` Remi Gauvin
  2019-02-08  7:33           ` Stefan K
  0 siblings, 2 replies; 32+ messages in thread
From: Chris Murphy @ 2019-02-07 22:19 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Chris Murphy, Stefan K, Btrfs BTRFS

On Thu, Feb 7, 2019 at 10:37 AM Martin Steigerwald <martin@lichtvoll.de> wrote:
>
> Chris Murphy - 07.02.19, 18:15:
> > > So please change the normal behavior
> >
> > In the case of no device loss, but device delay, with 'degraded' set
> > in fstab you risk a non-deterministic degraded mount. And there is no
> > automatic balance (sync) after recovering from a degraded mount. And
> > as far as I know there's no automatic transition from degraded to
> > normal operation upon later discovery of a previously missing device.
> > It's just begging for data loss. That's why it's not the default.
> > That's why it's not recommended.
>
> Still the current behavior is not really user-friendly. And does not
> meet expectations that users usually have about how RAID 1 works. I know
> BTRFS RAID 1 is no RAID 1, although it is called like this.

I mentioned the user experience is not good, in both my Feb 2 and Feb
5 responses, compared to mdadm and lvm raid1 in the same situation.

However the raid1 term only describes replication. It doesn't describe
any policy. And whether to fail to mount or mount degraded by default,
is a policy. Whether and how to transition from degraded to normal
operation when a formerly missing device reappears, is a policy. And
whether, and how, and when to rebuild data after resuming normal
operation is a policy. A big part of why these policies are MIA is
because they require features that just don't exist yet. And perhaps
don't even belong in btrfs kernel code or user space tools; but rather
a system service or daemon that manages such policies. However, none
of that means Btrfs raid1 is not raid1. There's a wrong assumption
being made about policies and features in mdadm and LVM, that they are
somehow attached to the definition of raid1, but they aren't.


> I also somewhat get that with the current state of BTRFS the current
> behavior of not allowing a degraded mount may be better… however… I see
> clearly room for improvement here. And there very likely will be
> discussions like this on this list… until BTRFS acts in a more user
> friendly way here.

And it's completely appropriate if someone wants to  update the Btrfs
status page to make more clear what features/behaviors/policies apply
to Btrfs raid of all types, or to have a page that summarizes their
differences among mdadm and/or LVM raid levels, so users can better
assess their risk taking, and choose the best Linux storage technology
for their use case.

But at least developers know this is the case.

And actually, you could mitigate some decent amount of Btrfs missing
features with server monitoring tools; including parsing kernel
messages. Because right now you aren't even informed of read or write
errors, device or csums mismatches or fixups, unless you're checking
kernel messages. Where mdadm has the option for emailing notifications
to an admin for such things, and lvm has a monitor that I guess does
something I haven't used it. Literally Btrfs will only complain about
failed writes that would cause immediate ejection of the device by md.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 22:19         ` Chris Murphy
@ 2019-02-07 23:02           ` Remi Gauvin
  2019-02-08  7:33           ` Stefan K
  1 sibling, 0 replies; 32+ messages in thread
From: Remi Gauvin @ 2019-02-07 23:02 UTC (permalink / raw)
  To: Btrfs BTRFS


[-- Attachment #1.1.1: Type: text/plain, Size: 986 bytes --]

On 2019-02-07 5:19 p.m., Chris Murphy wrote:

> And actually, you could mitigate some decent amount of Btrfs missing
> features with server monitoring tools; including parsing kernel
> messages. Because right now you aren't even informed of read or write
> errors, device or csums mismatches or fixups, unless you're checking
> kernel messages. Where mdadm has the option for emailing notifications
> to an admin for such things, and lvm has a monitor that I guess does
> something I haven't used it. Literally Btrfs will only complain about
> failed writes that would cause immediate ejection of the device by md.


You can, and probably should, have an hourly cron job that does
something like
btrfs dev stats -c / || Command to sound sysadmin alarm

the only difference here is that this is not, at this time, already
baked into distros by default.  I think I saw mention of a project
recently to to build a package that automates common btrfs maintenance
tasks?

[-- Attachment #1.1.2: remi.vcf --]
[-- Type: text/x-vcard, Size: 203 bytes --]

begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 19:39         ` Austin S. Hemmelgarn
  2019-02-07 21:21           ` Remi Gauvin
@ 2019-02-08  4:51           ` Andrei Borzenkov
  2019-02-08 12:54             ` Austin S. Hemmelgarn
  2019-02-08  7:15           ` Stefan K
  2019-02-08 18:10           ` waxhead
  3 siblings, 1 reply; 32+ messages in thread
From: Andrei Borzenkov @ 2019-02-08  4:51 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, waxhead, Stefan K, linux-btrfs

07.02.2019 22:39, Austin S. Hemmelgarn пишет:
> The issue with systemd is that if you pass 'degraded' on most systemd
> systems,  and devices are missing when the system tries to mount the
> volume, systemd won't mount it because it doesn't see all the devices.
> It doesn't even _try_ to mount it because it doesn't see all the
> devices.  Changing to degraded by default won't fix this, because it's a
> systemd problem.
> 

Oh no, not again. It was discussed millions of times already - systemd
is using information that btrfs provides.

> The same issue also makes it a serious pain in the arse to recover
> degraded BTRFS volumes on systemd systems, because if the volume is
> supposed to mount normally on that system, systemd will unmount it if it
> doesn't see all the devices, regardless of how it got mounted in the
> first place.
> 

*That* would be systemd issue indeed. If someone can reliably reproduce
it, systemd bug report would certainly be in order.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 19:39         ` Austin S. Hemmelgarn
  2019-02-07 21:21           ` Remi Gauvin
  2019-02-08  4:51           ` Andrei Borzenkov
@ 2019-02-08  7:15           ` Stefan K
  2019-02-08 12:58             ` Austin S. Hemmelgarn
  2019-02-08 16:56             ` Chris Murphy
  2019-02-08 18:10           ` waxhead
  3 siblings, 2 replies; 32+ messages in thread
From: Stefan K @ 2019-02-08  7:15 UTC (permalink / raw)
  To: linux-btrfs

> * Normal desktop users _never_ look at the log files or boot info, and 
> rarely run monitoring programs, so they as a general rule won't notice 
> until it's already too late.  BTRFS isn't just a server filesystem, so 
> it needs to be safe for regular users too.
I guess a normal desktop user wouldn't create a RAID1 nor other RAID-things, right? So an admin take care of a RAID and monitor it (it doesn't matter if it a hardwareraid, mdraid, zfs raid or what ever)
and degraded works only with RAID-things, its not relevant for single-disk usage, right?

> Also, LVM and MD have the exact same issue, it's just not as significant 
> because they re-add and re-sync missing devices automatically when they 
> reappear, which makes such split-brain scenarios much less likely.
why does btrfs don't do that?


On Thursday, February 7, 2019 2:39:34 PM CET Austin S. Hemmelgarn wrote:
> On 2019-02-07 13:53, waxhead wrote:
> > 
> > 
> > Austin S. Hemmelgarn wrote:
> >> On 2019-02-07 06:04, Stefan K wrote:
> >>> Thanks, with degraded  as kernel parameter and also ind the fstab it 
> >>> works like expected
> >>>
> >>> That should be the normal behaviour, cause a server must be up and 
> >>> running, and I don't care about a device loss, thats why I use a 
> >>> RAID1. The device-loss problem can I fix later, but its important 
> >>> that a server is up and running, i got informed at boot time and also 
> >>> in the logs files that a device is missing, also I see that if you 
> >>> use a monitoring program.
> >> No, it shouldn't be the default, because:
> >>
> >> * Normal desktop users _never_ look at the log files or boot info, and 
> >> rarely run monitoring programs, so they as a general rule won't notice 
> >> until it's already too late.  BTRFS isn't just a server filesystem, so 
> >> it needs to be safe for regular users too.
> > 
> > I am willing to argue that whatever you refer to as normal users don't 
> > have a clue how to make a raid1 filesystem, nor do they care about what 
> > underlying filesystem their computer runs. I can't quite see how a 
> > limping system would be worse than a failing system in this case. 
> > Besides "normal" desktop users use Windows anyway, people that run on 
> > penguin powered stuff generally have at least some technical knowledge.
> Once you get into stuff like Arch or Gentoo, yeah, people tend to have 
> enough technical knowledge to handle this type of thing, but if you're 
> talking about the big distros like Ubuntu or Fedora, not so much.  Yes, 
> I might be a bit pessimistic here, but that pessimism is based on 
> personal experience over many years of providing technical support for 
> people.
> 
> Put differently, human nature is to ignore things that aren't 
> immediately relevant.  Kernel logs don't matter until you see something 
> wrong.  Boot messages don't matter unless you happen to see them while 
> the system is booting (and most people don't).  Monitoring is the only 
> way here, but most people won't invest the time in proper monitoring 
> until they have problems.  Even as a seasoned sysadmin, I never look at 
> kernel logs until I see any problem, I rarely see boot messages on most 
> of the systems I manage (because I'm rarely sitting at the console when 
> they boot up, and when I am I'm usually handling startup of a dozen or 
> so systems simultaneously after a network-wide outage), and I only 
> monitor things that I know for certain need to be monitored.
> > 
> >> * It's easily possible to end up mounting degraded by accident if one 
> >> of the constituent devices is slow to enumerate, and this can easily 
> >> result in a split-brain scenario where all devices have diverged and 
> >> the volume can only be repaired by recreating it from scratch.
> > 
> > Am I wrong or would not the remaining disk have the generation number 
> > bumped on every commit? would it not make sense to ignore (previously) 
> > stale disks and require a manual "re-add" of the failed disks. From a 
> > users perspective with some C coding knowledge this sounds to me (in 
> > principle) like something as quite simple.
> > E.g. if the superblock UUID match for all devices and one (or more) 
> > devices has a lower generation number than the other(s) then the disk(s) 
> > with the newest generation number should be considered good and the 
> > other disks with a lower generation number should be marked as failed.
> The problem is that if you're defaulting to this behavior, you can have 
> multiple disks diverge from the base.  Imagine, for example, a system 
> with two devices in a raid1 setup with degraded mounts enabled by 
> default, and either device randomly taking longer than normal to 
> enumerate.  It's very possible for one boot to have one device delay 
> during enumeration on one boot, then the other on the next boot, and if 
> not handled _exactly_ right by the user, this will result in both 
> devices having a higher generation number than they started with, but 
> neither one being 'wrong'.  It's like trying to merge branches in git 
> that both have different changes to a binary file, there's no sane way 
> to handle it without user input.
> 
> Realistically, we can only safely recover from divergence correctly if 
> we can prove that all devices are true prior states of the current 
> highest generation, which is not currently possible to do reliably 
> because of how BTRFS operates.
> 
> Also, LVM and MD have the exact same issue, it's just not as significant 
> because they re-add and re-sync missing devices automatically when they 
> reappear, which makes such split-brain scenarios much less likely.
> > 
> >> * We have _ZERO_ automatic recovery from this situation.  This makes 
> >> both of the above mentioned issues far more dangerous.
> > 
> > See above, would this not be as simple as auto-deleting disks from the 
> > pool that has a matching UUID and a mismatch for the superblock 
> > generation number? Not exactly a recovery, but the system should be able 
> > to limp along.
> > 
> >> * It just plain does not work with most systemd setups, because 
> >> systemd will hang waiting on all the devices to appear due to the fact 
> >> that they refuse to acknowledge that the only way to correctly know if 
> >> a BTRFS volume will mount is to just try and mount it.
> > 
> > As far as I have understood this BTRFS refuses to mount even in 
> > redundant setups without the degraded flag. Why?! This is just plain 
> > useless. If anything the degraded mount option should be replaced with 
> > something like failif=X where X would be anything from 'never' which 
> > should get a 2 disk system up with exclusively raid1 profiles even if 
> > only one device is working. 'always' in case any device is failed or 
> > even 'atrisk' when loss of one more device would keep any raid chunk 
> > profile guarantee. (this get admittedly complex in a multi disk raid1 
> > setup or when subvolumes perhaps can be mounted with different "raid" 
> > profiles....)
> The issue with systemd is that if you pass 'degraded' on most systemd 
> systems,  and devices are missing when the system tries to mount the 
> volume, systemd won't mount it because it doesn't see all the devices. 
> It doesn't even _try_ to mount it because it doesn't see all the 
> devices.  Changing to degraded by default won't fix this, because it's a 
> systemd problem.
> 
> The same issue also makes it a serious pain in the arse to recover 
> degraded BTRFS volumes on systemd systems, because if the volume is 
> supposed to mount normally on that system, systemd will unmount it if it 
> doesn't see all the devices, regardless of how it got mounted in the 
> first place.
> 
> IOW, there's a special case with systemd that makes even mounting BTRFS 
> volumes that have missing devices degraded not work.
> > 
> >> * Given that new kernels still don't properly generate half-raid1 
> >> chunks when a device is missing in a two-device raid1 setup, there's a 
> >> very real possibility that users will have trouble recovering 
> >> filesystems with old recovery media (IOW, any recovery environment 
> >> running a kernel before 4.14 will not mount the volume correctly).
> > Sometimes you have to break a few eggs to make an omelette right? If 
> > people want to recover their data they should have backups, and if they 
> > are really interested in recovering their data (and don't have backups) 
> > then they will probably find this on the web by searching anyway...
> Backups aren't the type of recovery I'm talking about.  I'm talking 
> about people booting to things like SystemRescueCD to fix system 
> configuration or do offline maintenance without having to nuke the 
> system and restore from backups.  Such recovery environments often don't 
> get updated for a _long_ time, and such usage is not atypical as a first 
> step in trying to fix a broken system in situations where downtime 
> really is a serious issue.
> > 
> >> * You shouldn't be mounting writable and degraded for any reason other 
> >> than fixing the volume (or converting it to a single profile until you 
> >> can fix it), even aside from the other issues.
> > 
> > Well in my opinion the degraded mount option is counter intuitive. 
> > Unless otherwise asked for the system should mount and work as long as 
> > it can guarantee the data can be read and written somehow (regardless if 
> > any redundancy guarantee is not met). If the user is willing to accept 
> > more or less risk they should configure it!
> Again, BTRFS mounting degraded is significantly riskier than LVM or MD 
> doing the same thing.  Most users don't properly research things (When's 
> the last time you did a complete cost/benefit analysis before deciding 
> to use a particular piece of software on a system?), and would not know 
> they were taking on significantly higher risk by using BTRFS without 
> configuring it to behave safely until it actually caused them problems, 
> at which point most people would then complain about the resulting data 
> loss instead of trying to figure out why it happened and prevent it in 
> the first place.  I don't know about you, but I for one would rather 
> BTRFS have a reputation for being over-aggressively safe by default than 
> risking users data by default.
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 22:19         ` Chris Murphy
  2019-02-07 23:02           ` Remi Gauvin
@ 2019-02-08  7:33           ` Stefan K
  2019-02-08 17:26             ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Stefan K @ 2019-02-08  7:33 UTC (permalink / raw)
  To: linux-btrfs

> However the raid1 term only describes replication. It doesn't describe
> any policy.
yep you're right, but the most sysadmin expect some 'policies'. 

If I use RAID1 I expect that if one drive failed, I can still  boot _without_ boot issues, just some warnings etc, because I use raid1 to have simple 1device tolerance if one fails (which can happen). I can check/monitor the BTRFS RAID status by 'btrfs fi sh' or '(or by 'btrfs dev stat'). I also expect that if a device came back it will sync automatically and if I replace a device it will automatically rebalance the raid1 (which btrfs does, so far). I think a lot of sysadmins feel the same way.


On Thursday, February 7, 2019 3:19:01 PM CET Chris Murphy wrote:
> On Thu, Feb 7, 2019 at 10:37 AM Martin Steigerwald <martin@lichtvoll.de> wrote:
> >
> > Chris Murphy - 07.02.19, 18:15:
> > > > So please change the normal behavior
> > >
> > > In the case of no device loss, but device delay, with 'degraded' set
> > > in fstab you risk a non-deterministic degraded mount. And there is no
> > > automatic balance (sync) after recovering from a degraded mount. And
> > > as far as I know there's no automatic transition from degraded to
> > > normal operation upon later discovery of a previously missing device.
> > > It's just begging for data loss. That's why it's not the default.
> > > That's why it's not recommended.
> >
> > Still the current behavior is not really user-friendly. And does not
> > meet expectations that users usually have about how RAID 1 works. I know
> > BTRFS RAID 1 is no RAID 1, although it is called like this.
> 
> I mentioned the user experience is not good, in both my Feb 2 and Feb
> 5 responses, compared to mdadm and lvm raid1 in the same situation.
> 
> However the raid1 term only describes replication. It doesn't describe
> any policy. And whether to fail to mount or mount degraded by default,
> is a policy. Whether and how to transition from degraded to normal
> operation when a formerly missing device reappears, is a policy. And
> whether, and how, and when to rebuild data after resuming normal
> operation is a policy. A big part of why these policies are MIA is
> because they require features that just don't exist yet. And perhaps
> don't even belong in btrfs kernel code or user space tools; but rather
> a system service or daemon that manages such policies. However, none
> of that means Btrfs raid1 is not raid1. There's a wrong assumption
> being made about policies and features in mdadm and LVM, that they are
> somehow attached to the definition of raid1, but they aren't.
> 
> 
> > I also somewhat get that with the current state of BTRFS the current
> > behavior of not allowing a degraded mount may be better… however… I see
> > clearly room for improvement here. And there very likely will be
> > discussions like this on this list… until BTRFS acts in a more user
> > friendly way here.
> 
> And it's completely appropriate if someone wants to  update the Btrfs
> status page to make more clear what features/behaviors/policies apply
> to Btrfs raid of all types, or to have a page that summarizes their
> differences among mdadm and/or LVM raid levels, so users can better
> assess their risk taking, and choose the best Linux storage technology
> for their use case.
> 
> But at least developers know this is the case.
> 
> And actually, you could mitigate some decent amount of Btrfs missing
> features with server monitoring tools; including parsing kernel
> messages. Because right now you aren't even informed of read or write
> errors, device or csums mismatches or fixups, unless you're checking
> kernel messages. Where mdadm has the option for emailing notifications
> to an admin for such things, and lvm has a monitor that I guess does
> something I haven't used it. Literally Btrfs will only complain about
> failed writes that would cause immediate ejection of the device by md.
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-08  4:51           ` Andrei Borzenkov
@ 2019-02-08 12:54             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-08 12:54 UTC (permalink / raw)
  To: Andrei Borzenkov, linux-btrfs; +Cc: waxhead, Stefan K

On 2019-02-07 23:51, Andrei Borzenkov wrote:
> 07.02.2019 22:39, Austin S. Hemmelgarn пишет:
>> The issue with systemd is that if you pass 'degraded' on most systemd
>> systems,  and devices are missing when the system tries to mount the
>> volume, systemd won't mount it because it doesn't see all the devices.
>> It doesn't even _try_ to mount it because it doesn't see all the
>> devices.  Changing to degraded by default won't fix this, because it's a
>> systemd problem.
>>
> 
> Oh no, not again. It was discussed millions of times already - systemd
> is using information that btrfs provides.
And we've already told the systemd developers to quit using the ioctl 
they're using because it causes this issue and also introduces a TOCTOU 
race condition that can be avoided by just trying to mount the volume 
with the provided options.
> 
>> The same issue also makes it a serious pain in the arse to recover
>> degraded BTRFS volumes on systemd systems, because if the volume is
>> supposed to mount normally on that system, systemd will unmount it if it
>> doesn't see all the devices, regardless of how it got mounted in the
>> first place.
>>
> 
> *That* would be systemd issue indeed. If someone can reliably reproduce
> it, systemd bug report would certainly be in order.
> 
It's been a few months since I dealt with it last (I don't use systemd 
on my everyday systems, because of this and a bunch of other issues I 
have with it (mostly design complaints, not bugs FWIW)), but the general 
process is as follows:

1. Configure a multi-device BTRFS volume such that removal of one device 
will cause the DEVICE_READY ioctl to return false.
2. Set it up in `/etc/fstab` or as a mount unit such that it will 
normally get mounted at boot, but won't prevent the system from booting 
if it fails.
3. Reboot with one of the devices missing.
4. Attempt to manually mount the volume using the regular `mount` 
command with the `degraded` option.
5. Check the mount table, there should be no entry for the volume you 
just mounted in it.

After dealing with this the first time (multiple years ago now), I took 
the time to trace system calls, and found that systemd was unmounting 
the volume immediately after I mounted it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-08  7:15           ` Stefan K
@ 2019-02-08 12:58             ` Austin S. Hemmelgarn
  2019-02-08 16:56             ` Chris Murphy
  1 sibling, 0 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-08 12:58 UTC (permalink / raw)
  To: linux-btrfs

On 2019-02-08 02:15, Stefan K wrote:
>> * Normal desktop users _never_ look at the log files or boot info, and
>> rarely run monitoring programs, so they as a general rule won't notice
>> until it's already too late.  BTRFS isn't just a server filesystem, so
>> it needs to be safe for regular users too.
> I guess a normal desktop user wouldn't create a RAID1 nor other RAID-things, right?
You would think that would be the case, but it generally isn't in my 
experience.  Such desktop users also tend to be the worst offenders in 
the 'RAID is my backup' camp as well in my experience.

> So an admin take care of a RAID and monitor it (it doesn't matter if it a hardwareraid, mdraid, zfs raid or what ever)
> and degraded works only with RAID-things, its not relevant for single-disk usage, right?
Correct, but because it's never relevant for single-disk usage, you 
don't have to worry about any of this.
> 
>> Also, LVM and MD have the exact same issue, it's just not as significant
>> because they re-add and re-sync missing devices automatically when they
>> reappear, which makes such split-brain scenarios much less likely.
> why does btrfs don't do that?
Because we currently don't have any code that does it.  Part of the 
problem is that we're a lot more tolerant of intermittent I/O errors 
than LVM and MD are, so we can't reliably tell if a device is truly gone 
or not.
> 
> 
> On Thursday, February 7, 2019 2:39:34 PM CET Austin S. Hemmelgarn wrote:
>> On 2019-02-07 13:53, waxhead wrote:
>>>
>>>
>>> Austin S. Hemmelgarn wrote:
>>>> On 2019-02-07 06:04, Stefan K wrote:
>>>>> Thanks, with degraded  as kernel parameter and also ind the fstab it
>>>>> works like expected
>>>>>
>>>>> That should be the normal behaviour, cause a server must be up and
>>>>> running, and I don't care about a device loss, thats why I use a
>>>>> RAID1. The device-loss problem can I fix later, but its important
>>>>> that a server is up and running, i got informed at boot time and also
>>>>> in the logs files that a device is missing, also I see that if you
>>>>> use a monitoring program.
>>>> No, it shouldn't be the default, because:
>>>>
>>>> * Normal desktop users _never_ look at the log files or boot info, and
>>>> rarely run monitoring programs, so they as a general rule won't notice
>>>> until it's already too late.  BTRFS isn't just a server filesystem, so
>>>> it needs to be safe for regular users too.
>>>
>>> I am willing to argue that whatever you refer to as normal users don't
>>> have a clue how to make a raid1 filesystem, nor do they care about what
>>> underlying filesystem their computer runs. I can't quite see how a
>>> limping system would be worse than a failing system in this case.
>>> Besides "normal" desktop users use Windows anyway, people that run on
>>> penguin powered stuff generally have at least some technical knowledge.
>> Once you get into stuff like Arch or Gentoo, yeah, people tend to have
>> enough technical knowledge to handle this type of thing, but if you're
>> talking about the big distros like Ubuntu or Fedora, not so much.  Yes,
>> I might be a bit pessimistic here, but that pessimism is based on
>> personal experience over many years of providing technical support for
>> people.
>>
>> Put differently, human nature is to ignore things that aren't
>> immediately relevant.  Kernel logs don't matter until you see something
>> wrong.  Boot messages don't matter unless you happen to see them while
>> the system is booting (and most people don't).  Monitoring is the only
>> way here, but most people won't invest the time in proper monitoring
>> until they have problems.  Even as a seasoned sysadmin, I never look at
>> kernel logs until I see any problem, I rarely see boot messages on most
>> of the systems I manage (because I'm rarely sitting at the console when
>> they boot up, and when I am I'm usually handling startup of a dozen or
>> so systems simultaneously after a network-wide outage), and I only
>> monitor things that I know for certain need to be monitored.
>>>
>>>> * It's easily possible to end up mounting degraded by accident if one
>>>> of the constituent devices is slow to enumerate, and this can easily
>>>> result in a split-brain scenario where all devices have diverged and
>>>> the volume can only be repaired by recreating it from scratch.
>>>
>>> Am I wrong or would not the remaining disk have the generation number
>>> bumped on every commit? would it not make sense to ignore (previously)
>>> stale disks and require a manual "re-add" of the failed disks. From a
>>> users perspective with some C coding knowledge this sounds to me (in
>>> principle) like something as quite simple.
>>> E.g. if the superblock UUID match for all devices and one (or more)
>>> devices has a lower generation number than the other(s) then the disk(s)
>>> with the newest generation number should be considered good and the
>>> other disks with a lower generation number should be marked as failed.
>> The problem is that if you're defaulting to this behavior, you can have
>> multiple disks diverge from the base.  Imagine, for example, a system
>> with two devices in a raid1 setup with degraded mounts enabled by
>> default, and either device randomly taking longer than normal to
>> enumerate.  It's very possible for one boot to have one device delay
>> during enumeration on one boot, then the other on the next boot, and if
>> not handled _exactly_ right by the user, this will result in both
>> devices having a higher generation number than they started with, but
>> neither one being 'wrong'.  It's like trying to merge branches in git
>> that both have different changes to a binary file, there's no sane way
>> to handle it without user input.
>>
>> Realistically, we can only safely recover from divergence correctly if
>> we can prove that all devices are true prior states of the current
>> highest generation, which is not currently possible to do reliably
>> because of how BTRFS operates.
>>
>> Also, LVM and MD have the exact same issue, it's just not as significant
>> because they re-add and re-sync missing devices automatically when they
>> reappear, which makes such split-brain scenarios much less likely.
>>>
>>>> * We have _ZERO_ automatic recovery from this situation.  This makes
>>>> both of the above mentioned issues far more dangerous.
>>>
>>> See above, would this not be as simple as auto-deleting disks from the
>>> pool that has a matching UUID and a mismatch for the superblock
>>> generation number? Not exactly a recovery, but the system should be able
>>> to limp along.
>>>
>>>> * It just plain does not work with most systemd setups, because
>>>> systemd will hang waiting on all the devices to appear due to the fact
>>>> that they refuse to acknowledge that the only way to correctly know if
>>>> a BTRFS volume will mount is to just try and mount it.
>>>
>>> As far as I have understood this BTRFS refuses to mount even in
>>> redundant setups without the degraded flag. Why?! This is just plain
>>> useless. If anything the degraded mount option should be replaced with
>>> something like failif=X where X would be anything from 'never' which
>>> should get a 2 disk system up with exclusively raid1 profiles even if
>>> only one device is working. 'always' in case any device is failed or
>>> even 'atrisk' when loss of one more device would keep any raid chunk
>>> profile guarantee. (this get admittedly complex in a multi disk raid1
>>> setup or when subvolumes perhaps can be mounted with different "raid"
>>> profiles....)
>> The issue with systemd is that if you pass 'degraded' on most systemd
>> systems,  and devices are missing when the system tries to mount the
>> volume, systemd won't mount it because it doesn't see all the devices.
>> It doesn't even _try_ to mount it because it doesn't see all the
>> devices.  Changing to degraded by default won't fix this, because it's a
>> systemd problem.
>>
>> The same issue also makes it a serious pain in the arse to recover
>> degraded BTRFS volumes on systemd systems, because if the volume is
>> supposed to mount normally on that system, systemd will unmount it if it
>> doesn't see all the devices, regardless of how it got mounted in the
>> first place.
>>
>> IOW, there's a special case with systemd that makes even mounting BTRFS
>> volumes that have missing devices degraded not work.
>>>
>>>> * Given that new kernels still don't properly generate half-raid1
>>>> chunks when a device is missing in a two-device raid1 setup, there's a
>>>> very real possibility that users will have trouble recovering
>>>> filesystems with old recovery media (IOW, any recovery environment
>>>> running a kernel before 4.14 will not mount the volume correctly).
>>> Sometimes you have to break a few eggs to make an omelette right? If
>>> people want to recover their data they should have backups, and if they
>>> are really interested in recovering their data (and don't have backups)
>>> then they will probably find this on the web by searching anyway...
>> Backups aren't the type of recovery I'm talking about.  I'm talking
>> about people booting to things like SystemRescueCD to fix system
>> configuration or do offline maintenance without having to nuke the
>> system and restore from backups.  Such recovery environments often don't
>> get updated for a _long_ time, and such usage is not atypical as a first
>> step in trying to fix a broken system in situations where downtime
>> really is a serious issue.
>>>
>>>> * You shouldn't be mounting writable and degraded for any reason other
>>>> than fixing the volume (or converting it to a single profile until you
>>>> can fix it), even aside from the other issues.
>>>
>>> Well in my opinion the degraded mount option is counter intuitive.
>>> Unless otherwise asked for the system should mount and work as long as
>>> it can guarantee the data can be read and written somehow (regardless if
>>> any redundancy guarantee is not met). If the user is willing to accept
>>> more or less risk they should configure it!
>> Again, BTRFS mounting degraded is significantly riskier than LVM or MD
>> doing the same thing.  Most users don't properly research things (When's
>> the last time you did a complete cost/benefit analysis before deciding
>> to use a particular piece of software on a system?), and would not know
>> they were taking on significantly higher risk by using BTRFS without
>> configuring it to behave safely until it actually caused them problems,
>> at which point most people would then complain about the resulting data
>> loss instead of trying to figure out why it happened and prevent it in
>> the first place.  I don't know about you, but I for one would rather
>> BTRFS have a reputation for being over-aggressively safe by default than
>> risking users data by default.
>>
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-08  7:15           ` Stefan K
  2019-02-08 12:58             ` Austin S. Hemmelgarn
@ 2019-02-08 16:56             ` Chris Murphy
  1 sibling, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2019-02-08 16:56 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Feb 8, 2019 at 12:15 AM Stefan K <shadow_7@gmx.net> wrote:
>
> > * Normal desktop users _never_ look at the log files or boot info, and
> > rarely run monitoring programs, so they as a general rule won't notice
> > until it's already too late.  BTRFS isn't just a server filesystem, so
> > it needs to be safe for regular users too.
> I guess a normal desktop user wouldn't create a RAID1 nor other RAID-things, right? So an admin take care of a RAID and monitor it (it doesn't matter if it a hardwareraid, mdraid, zfs raid or what ever)
> and degraded works only with RAID-things, its not relevant for single-disk usage, right?

The point is that persistently setting the degraded mount option as a
boot param has a chance of causing a degraded mount even if your array
is not degraded.

Also, there is no such thing as transitioning from normal mount to
degraded mount. If a device fails, the array is not strictly degraded,
it's a normal mount with a huge amount of kernel errors being
generated by Btrfs due to the bad/missing device. I'm pretty sure
there are unmerged patches add something like the concept of an md
faulty device, but I'm not sure what the logic is, but my
understanding is they're not well enough tested yet (?) for them to
get merged.

If your system log is directed to write to this same volume, that
causes even more errors due to additional failing writes, which then
have to be logged. So now you're depending on kernel printk rate
limiting being set well below the water line to make sure Btrfs errors
don't cause so much disk contention that the system gets stuck (not
difficult if sysroot is a hard drive).

>
> > Also, LVM and MD have the exact same issue, it's just not as significant
> > because they re-add and re-sync missing devices automatically when they
> > reappear, which makes such split-brain scenarios much less likely.
> why does btrfs don't do that?

It's a fair question but the simplest answer is, features don't grow
on trees, they're written by developers and no one has yet done that
work.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-08  7:33           ` Stefan K
@ 2019-02-08 17:26             ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2019-02-08 17:26 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Feb 8, 2019 at 12:33 AM Stefan K <shadow_7@gmx.net> wrote:
>
> > However the raid1 term only describes replication. It doesn't describe
> > any policy.
> yep you're right, but the most sysadmin expect some 'policies'.

A sysadmin expecting policies is fine, but assuming they exist makes
them a questionable sysadmin.

>> If I use RAID1 I expect that if one drive failed, I can still  boot _without_ boot issues, just some warnings etc, because I use raid1 to have simple 1device tolerance if one fails (which can happen).

OK and we've already explained that btrfs doesn't work that way yet,
which is why it has the defaults it has, but then you go on to assert
that Btrfs should have the defaults YOU want based on YOUR
assumptions. It's absurd.


>I can check/monitor the BTRFS RAID status by 'btrfs fi sh' or '(or by 'btrfs dev stat'). I also expect that if a device came back it will sync automatically and if I replace a device it will automatically rebalance the raid1 (which btrfs does, so far). I think a lot of sysadmins feel the same way.

OK what you just wrote there is sufficiently incomplete that it's
wrong. I and others have already described part of this behavior so if
you were really comprehending what people are saying, you wouldn't
have just written the above paragraph.

If a missing device reappears, it is not synced automatically.

If you have a two device raid1 with a missing device, and mounted
degraded, data is highly likely to get written to the single remaining
drive as single profile chunks; which means when you do either 'btrfs
replace' or 'btrfs device add' followed by 'btrfs device remove' the
data in those single chunks will *not* be replicated automatically to
the replacement drive. You will have to do a manual balance and
explicitly convert single chunks to raid1. If it's 3+ drives, a device
replacement (of either method) should cause data to be replicated.

I see a lot of sysadmins make the wrong assumptions on the linux-raid
list and on LVM list, and I often read about data loss when they do
that. What matters is how things actually work. When you make
assumptions about how they work, you're unwittingly begging for user
induced data loss, and all the complaining about missing features
won't help get the data back. Over and over again telling people, you
didn't understand how it worked, you didn't understand what you were
doing, and yeah sorry the data is just gone. It's your responsibility
to understand how things really work and fail. It isn't possible for
the code to understand your expectations and act accordingly.

At least you're discovering the limitations before you end up in
trouble. The job of a sysadmin is to find out the difference between
expectations and actual feature set, because maybe the technology
being evaluated isn't a good match for the use case.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 19:39         ` Austin S. Hemmelgarn
                             ` (2 preceding siblings ...)
  2019-02-08  7:15           ` Stefan K
@ 2019-02-08 18:10           ` waxhead
  2019-02-08 19:17             ` Austin S. Hemmelgarn
  2019-02-08 20:17             ` Chris Murphy
  3 siblings, 2 replies; 32+ messages in thread
From: waxhead @ 2019-02-08 18:10 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Stefan K, linux-btrfs

Austin S. Hemmelgarn wrote:
> On 2019-02-07 13:53, waxhead wrote:
>>
>>
>> Austin S. Hemmelgarn wrote:
>>> On 2019-02-07 06:04, Stefan K wrote:
>>>> Thanks, with degraded  as kernel parameter and also ind the fstab it 
>>>> works like expected
>>>>
>>>> That should be the normal behaviour, cause a server must be up and 
>>>> running, and I don't care about a device loss, thats why I use a 
>>>> RAID1. The device-loss problem can I fix later, but its important 
>>>> that a server is up and running, i got informed at boot time and 
>>>> also in the logs files that a device is missing, also I see that if 
>>>> you use a monitoring program.
>>> No, it shouldn't be the default, because:
>>>
>>> * Normal desktop users _never_ look at the log files or boot info, 
>>> and rarely run monitoring programs, so they as a general rule won't 
>>> notice until it's already too late.  BTRFS isn't just a server 
>>> filesystem, so it needs to be safe for regular users too.
>>
>> I am willing to argue that whatever you refer to as normal users don't 
>> have a clue how to make a raid1 filesystem, nor do they care about 
>> what underlying filesystem their computer runs. I can't quite see how 
>> a limping system would be worse than a failing system in this case. 
>> Besides "normal" desktop users use Windows anyway, people that run on 
>> penguin powered stuff generally have at least some technical knowledge.
> Once you get into stuff like Arch or Gentoo, yeah, people tend to have 
> enough technical knowledge to handle this type of thing, but if you're 
> talking about the big distros like Ubuntu or Fedora, not so much.  Yes, 
> I might be a bit pessimistic here, but that pessimism is based on 
> personal experience over many years of providing technical support for 
> people.
> 
> Put differently, human nature is to ignore things that aren't 
> immediately relevant.  Kernel logs don't matter until you see something 
> wrong.  Boot messages don't matter unless you happen to see them while 
> the system is booting (and most people don't).  Monitoring is the only 
> way here, but most people won't invest the time in proper monitoring 
> until they have problems.  Even as a seasoned sysadmin, I never look at 
> kernel logs until I see any problem, I rarely see boot messages on most 
> of the systems I manage (because I'm rarely sitting at the console when 
> they boot up, and when I am I'm usually handling startup of a dozen or 
> so systems simultaneously after a network-wide outage), and I only 
> monitor things that I know for certain need to be monitored.

So what you are saying here is that distro's that use btrfs by default 
should be responsible enough to make some monitoring solution if they 
allow non-technical users to create a "raid"1 like btrfs filesystem in 
the first place. I don't think that many distros install some S.M.A.R.T. 
monitoring solution either... in which case you are worse off with a 
non-checksumming filesystem.
Since the users you refer to basically ignores the filesystem anyway I 
can't see why this would be an argument at all...

>>
>>> * It's easily possible to end up mounting degraded by accident if one 
>>> of the constituent devices is slow to enumerate, and this can easily 
>>> result in a split-brain scenario where all devices have diverged and 
>>> the volume can only be repaired by recreating it from scratch.
>>
>> Am I wrong or would not the remaining disk have the generation number 
>> bumped on every commit? would it not make sense to ignore (previously) 
>> stale disks and require a manual "re-add" of the failed disks. From a 
>> users perspective with some C coding knowledge this sounds to me (in 
>> principle) like something as quite simple.
>> E.g. if the superblock UUID match for all devices and one (or more) 
>> devices has a lower generation number than the other(s) then the 
>> disk(s) with the newest generation number should be considered good 
>> and the other disks with a lower generation number should be marked as 
>> failed.
> The problem is that if you're defaulting to this behavior, you can have 
> multiple disks diverge from the base.  Imagine, for example, a system 
> with two devices in a raid1 setup with degraded mounts enabled by 
> default, and either device randomly taking longer than normal to 
> enumerate.  It's very possible for one boot to have one device delay 
> during enumeration on one boot, then the other on the next boot, and if 
> not handled _exactly_ right by the user, this will result in both 
> devices having a higher generation number than they started with, but 
> neither one being 'wrong'.  It's like trying to merge branches in git 
> that both have different changes to a binary file, there's no sane way 
> to handle it without user input.
> 
So why do BTRFS hurry to mount itself even if devices are missing? and 
if BTRFS still can mount , why whould it blindly accept a non-existing 
disk to take part of the pool?!

> Realistically, we can only safely recover from divergence correctly if 
> we can prove that all devices are true prior states of the current 
> highest generation, which is not currently possible to do reliably 
> because of how BTRFS operates.
> 
So what you are saying is that the generation number does not represent 
a true frozen state of the filesystem at that point?

> Also, LVM and MD have the exact same issue, it's just not as significant 
> because they re-add and re-sync missing devices automatically when they 
> reappear, which makes such split-brain scenarios much less likely.
Which means marking the entire device as invalid, then re-adding it from 
scratch more or less...

>>
>>> * We have _ZERO_ automatic recovery from this situation.  This makes 
>>> both of the above mentioned issues far more dangerous.
>>
>> See above, would this not be as simple as auto-deleting disks from the 
>> pool that has a matching UUID and a mismatch for the superblock 
>> generation number? Not exactly a recovery, but the system should be 
>> able to limp along.
>>
>>> * It just plain does not work with most systemd setups, because 
>>> systemd will hang waiting on all the devices to appear due to the 
>>> fact that they refuse to acknowledge that the only way to correctly 
>>> know if a BTRFS volume will mount is to just try and mount it.
>>
>> As far as I have understood this BTRFS refuses to mount even in 
>> redundant setups without the degraded flag. Why?! This is just plain 
>> useless. If anything the degraded mount option should be replaced with 
>> something like failif=X where X would be anything from 'never' which 
>> should get a 2 disk system up with exclusively raid1 profiles even if 
>> only one device is working. 'always' in case any device is failed or 
>> even 'atrisk' when loss of one more device would keep any raid chunk 
>> profile guarantee. (this get admittedly complex in a multi disk raid1 
>> setup or when subvolumes perhaps can be mounted with different "raid" 
>> profiles....)
> The issue with systemd is that if you pass 'degraded' on most systemd 
> systems,  and devices are missing when the system tries to mount the 
> volume, systemd won't mount it because it doesn't see all the devices. 
> It doesn't even _try_ to mount it because it doesn't see all the 
> devices.  Changing to degraded by default won't fix this, because it's a 
> systemd problem.
> 
> The same issue also makes it a serious pain in the arse to recover 
> degraded BTRFS volumes on systemd systems, because if the volume is 
> supposed to mount normally on that system, systemd will unmount it if it 
> doesn't see all the devices, regardless of how it got mounted in the 
> first place.
> 
Why does systemd concern itself about what devices btrfs consist of. 
Please educate me, I am curious.

> IOW, there's a special case with systemd that makes even mounting BTRFS 
> volumes that have missing devices degraded not work.
Well I use systemd on Debian and have not had that issue. In what 
situation does this fail?

>>
>>> * Given that new kernels still don't properly generate half-raid1 
>>> chunks when a device is missing in a two-device raid1 setup, there's 
>>> a very real possibility that users will have trouble recovering 
>>> filesystems with old recovery media (IOW, any recovery environment 
>>> running a kernel before 4.14 will not mount the volume correctly).
>> Sometimes you have to break a few eggs to make an omelette right? If 
>> people want to recover their data they should have backups, and if 
>> they are really interested in recovering their data (and don't have 
>> backups) then they will probably find this on the web by searching 
>> anyway...
> Backups aren't the type of recovery I'm talking about.  I'm talking 
> about people booting to things like SystemRescueCD to fix system 
> configuration or do offline maintenance without having to nuke the 
> system and restore from backups.  Such recovery environments often don't 
> get updated for a _long_ time, and such usage is not atypical as a first 
> step in trying to fix a broken system in situations where downtime 
> really is a serious issue.
I would say that if downtime is such a serious issue you have a failover 
and a working tested backup.

>>
>>> * You shouldn't be mounting writable and degraded for any reason 
>>> other than fixing the volume (or converting it to a single profile 
>>> until you can fix it), even aside from the other issues.
>>
>> Well in my opinion the degraded mount option is counter intuitive. 
>> Unless otherwise asked for the system should mount and work as long as 
>> it can guarantee the data can be read and written somehow (regardless 
>> if any redundancy guarantee is not met). If the user is willing to 
>> accept more or less risk they should configure it!
> Again, BTRFS mounting degraded is significantly riskier than LVM or MD 
> doing the same thing.  Most users don't properly research things (When's 
> the last time you did a complete cost/benefit analysis before deciding 
> to use a particular piece of software on a system?), and would not know 
> they were taking on significantly higher risk by using BTRFS without 
> configuring it to behave safely until it actually caused them problems, 
> at which point most people would then complain about the resulting data 
> loss instead of trying to figure out why it happened and prevent it in 
> the first place.  I don't know about you, but I for one would rather 
> BTRFS have a reputation for being over-aggressively safe by default than 
> risking users data by default.
Well I don't do cost/benefit analysis since I run free software. I do 
however try my best to ensure that whatever software I install don't 
cause more drawbacks than benefits.
I would also like for BTRFS to be over-aggressively safe, but I also 
want it to be over-aggressively always running or even limping if that 
is what it needs to do.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-08 18:10           ` waxhead
@ 2019-02-08 19:17             ` Austin S. Hemmelgarn
  2019-02-09 12:13               ` waxhead
  2019-02-08 20:17             ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-08 19:17 UTC (permalink / raw)
  To: waxhead, Stefan K, linux-btrfs

On 2019-02-08 13:10, waxhead wrote:
> Austin S. Hemmelgarn wrote:
>> On 2019-02-07 13:53, waxhead wrote:
>>>
>>>
>>> Austin S. Hemmelgarn wrote:
>>>> On 2019-02-07 06:04, Stefan K wrote:
>>>>> Thanks, with degraded  as kernel parameter and also ind the fstab 
>>>>> it works like expected
>>>>>
>>>>> That should be the normal behaviour, cause a server must be up and 
>>>>> running, and I don't care about a device loss, thats why I use a 
>>>>> RAID1. The device-loss problem can I fix later, but its important 
>>>>> that a server is up and running, i got informed at boot time and 
>>>>> also in the logs files that a device is missing, also I see that if 
>>>>> you use a monitoring program.
>>>> No, it shouldn't be the default, because:
>>>>
>>>> * Normal desktop users _never_ look at the log files or boot info, 
>>>> and rarely run monitoring programs, so they as a general rule won't 
>>>> notice until it's already too late.  BTRFS isn't just a server 
>>>> filesystem, so it needs to be safe for regular users too.
>>>
>>> I am willing to argue that whatever you refer to as normal users 
>>> don't have a clue how to make a raid1 filesystem, nor do they care 
>>> about what underlying filesystem their computer runs. I can't quite 
>>> see how a limping system would be worse than a failing system in this 
>>> case. Besides "normal" desktop users use Windows anyway, people that 
>>> run on penguin powered stuff generally have at least some technical 
>>> knowledge.
>> Once you get into stuff like Arch or Gentoo, yeah, people tend to have 
>> enough technical knowledge to handle this type of thing, but if you're 
>> talking about the big distros like Ubuntu or Fedora, not so much.  
>> Yes, I might be a bit pessimistic here, but that pessimism is based on 
>> personal experience over many years of providing technical support for 
>> people.
>>
>> Put differently, human nature is to ignore things that aren't 
>> immediately relevant.  Kernel logs don't matter until you see 
>> something wrong.  Boot messages don't matter unless you happen to see 
>> them while the system is booting (and most people don't).  Monitoring 
>> is the only way here, but most people won't invest the time in proper 
>> monitoring until they have problems.  Even as a seasoned sysadmin, I 
>> never look at kernel logs until I see any problem, I rarely see boot 
>> messages on most of the systems I manage (because I'm rarely sitting 
>> at the console when they boot up, and when I am I'm usually handling 
>> startup of a dozen or so systems simultaneously after a network-wide 
>> outage), and I only monitor things that I know for certain need to be 
>> monitored.
> 
> So what you are saying here is that distro's that use btrfs by default 
> should be responsible enough to make some monitoring solution if they 
> allow non-technical users to create a "raid"1 like btrfs filesystem in 
> the first place. I don't think that many distros install some S.M.A.R.T. 
> monitoring solution either... in which case you are worse off with a 
> non-checksumming filesystem.
Actually, more than you probably realize do (Windows does by default 
these days, so the big distros that want to compete for desktop users 
need to as well), and many have trivial to set up monitoring for MD and 
LVM arrays as well.
> Since the users you refer to basically ignores the filesystem anyway I 
> can't see why this would be an argument at all...
My argument here is that we shouldn't assume users will know what 
they're doing.  It's the same logic behind the saner distros not 
defaulting to using BTRFS for installation, if they do and BTRFS causes 
the user to lose data, the distro will usually get blamed, even if it 
was not at all their fault.  Similarly, if a user chooses to use BTRFS 
without doing their research, it's very likely that any data loss, even 
if it's caused by the user themself not doing things sensibly, will be 
blamed on BTRFS.
> 
>>>
>>>> * It's easily possible to end up mounting degraded by accident if 
>>>> one of the constituent devices is slow to enumerate, and this can 
>>>> easily result in a split-brain scenario where all devices have 
>>>> diverged and the volume can only be repaired by recreating it from 
>>>> scratch.
>>>
>>> Am I wrong or would not the remaining disk have the generation number 
>>> bumped on every commit? would it not make sense to ignore 
>>> (previously) stale disks and require a manual "re-add" of the failed 
>>> disks. From a users perspective with some C coding knowledge this 
>>> sounds to me (in principle) like something as quite simple.
>>> E.g. if the superblock UUID match for all devices and one (or more) 
>>> devices has a lower generation number than the other(s) then the 
>>> disk(s) with the newest generation number should be considered good 
>>> and the other disks with a lower generation number should be marked 
>>> as failed.
>> The problem is that if you're defaulting to this behavior, you can 
>> have multiple disks diverge from the base.  Imagine, for example, a 
>> system with two devices in a raid1 setup with degraded mounts enabled 
>> by default, and either device randomly taking longer than normal to 
>> enumerate.  It's very possible for one boot to have one device delay 
>> during enumeration on one boot, then the other on the next boot, and 
>> if not handled _exactly_ right by the user, this will result in both 
>> devices having a higher generation number than they started with, but 
>> neither one being 'wrong'.  It's like trying to merge branches in git 
>> that both have different changes to a binary file, there's no sane way 
>> to handle it without user input.
>>
> So why do BTRFS hurry to mount itself even if devices are missing? and 
> if BTRFS still can mount , why whould it blindly accept a non-existing 
> disk to take part of the pool?!
It doesn't unless you tell it to., and that behavior is exactly what I'm 
arguing against making the default here.
> 
>> Realistically, we can only safely recover from divergence correctly if 
>> we can prove that all devices are true prior states of the current 
>> highest generation, which is not currently possible to do reliably 
>> because of how BTRFS operates.
>>
> So what you are saying is that the generation number does not represent 
> a true frozen state of the filesystem at that point?
It does _only_ for those devices which were present at the time of the 
commit that incremented it.

As an example (don't do this with any BTRFS volume you care about, it 
will break it), take a BTRFS volume with two devices configured for 
raid1.  Mount the volume with only one of the devices present, issue a 
single write to it, then unmounted it.  Now do the same with only the 
other device.  Both devices should show the same generation number right 
now (but it should be one higher than when you started), but the 
generation number on each device refers to a different volume state.
> 
>> Also, LVM and MD have the exact same issue, it's just not as 
>> significant because they re-add and re-sync missing devices 
>> automatically when they reappear, which makes such split-brain 
>> scenarios much less likely.
> Which means marking the entire device as invalid, then re-adding it from 
> scratch more or less...
Actually, it doesn't.

For LVM and MD, they track what regions of the remaining device have 
changed, and sync only those regions when the missing device comes back.

For BTRFS, the same thing happens implicitly because of the COW 
structure, and you can manually reproduce similar behavior to LVM or MD 
by scrubbing the volume and then using balance with the 'soft' filter to 
ensure all the chunks are the correct type.

In both cases though, you still get into trouble if each of the devices 
gets used separately from each other before being re-synced (though 
BTRFS at least has the decency in that situation to not lose any data, 
LVM or MD will just blindly sync whichever mirror they happen to pick 
over the others).
> 
>>>
>>>> * We have _ZERO_ automatic recovery from this situation.  This makes 
>>>> both of the above mentioned issues far more dangerous.
>>>
>>> See above, would this not be as simple as auto-deleting disks from 
>>> the pool that has a matching UUID and a mismatch for the superblock 
>>> generation number? Not exactly a recovery, but the system should be 
>>> able to limp along.
>>>
>>>> * It just plain does not work with most systemd setups, because 
>>>> systemd will hang waiting on all the devices to appear due to the 
>>>> fact that they refuse to acknowledge that the only way to correctly 
>>>> know if a BTRFS volume will mount is to just try and mount it.
>>>
>>> As far as I have understood this BTRFS refuses to mount even in 
>>> redundant setups without the degraded flag. Why?! This is just plain 
>>> useless. If anything the degraded mount option should be replaced 
>>> with something like failif=X where X would be anything from 'never' 
>>> which should get a 2 disk system up with exclusively raid1 profiles 
>>> even if only one device is working. 'always' in case any device is 
>>> failed or even 'atrisk' when loss of one more device would keep any 
>>> raid chunk profile guarantee. (this get admittedly complex in a multi 
>>> disk raid1 setup or when subvolumes perhaps can be mounted with 
>>> different "raid" profiles....)
>> The issue with systemd is that if you pass 'degraded' on most systemd 
>> systems,  and devices are missing when the system tries to mount the 
>> volume, systemd won't mount it because it doesn't see all the devices. 
>> It doesn't even _try_ to mount it because it doesn't see all the 
>> devices.  Changing to degraded by default won't fix this, because it's 
>> a systemd problem.
>>
>> The same issue also makes it a serious pain in the arse to recover 
>> degraded BTRFS volumes on systemd systems, because if the volume is 
>> supposed to mount normally on that system, systemd will unmount it if 
>> it doesn't see all the devices, regardless of how it got mounted in 
>> the first place.
>>
> Why does systemd concern itself about what devices btrfs consist of. 
> Please educate me, I am curious.
For the same reason that it concerns itself with what devices make up a 
LVM volume or an MD array.  In essence, it comes down to a couple of 
specific things:

* It is almost always preferable to delay boot-up while waiting for a 
missing device to reappear than it is to start using a volume that 
depends on it while it's missing.  The overall impact on the system from 
taking a few seconds longer to boot is generally less than the impact of 
having to resync the device when it reappears while the system is still 
booting up.

* Systemd allows mounts to not block the system booting while still 
allowing certain services to depend on those mounts being active.  This 
is extremely useful for remote management reasons, and is actually 
supported by most service managers these days.  Systemd extends this all 
the way down the storage stack though, which is even more useful, 
because it lets disk failures properly cascade up the storage stack and 
translate into the volumes they were part of showing up as degraded (or 
getting unmounted if you choose to configure it that way).
> 
>> IOW, there's a special case with systemd that makes even mounting 
>> BTRFS volumes that have missing devices degraded not work.
> Well I use systemd on Debian and have not had that issue. In what 
> situation does this fail?
At one point, if you tried to manually mount a volume that systemd did 
not see all the constituent devices present for, it would get unmounted 
almost instantly by systemd itself.  This may not be the case anymore, 
or it may have been how the distros I've used with systemd on them 
happened to behave, but either way it's a pain in the arse when you want 
to fix a BTRFS volume.
> 
>>>
>>>> * Given that new kernels still don't properly generate half-raid1 
>>>> chunks when a device is missing in a two-device raid1 setup, there's 
>>>> a very real possibility that users will have trouble recovering 
>>>> filesystems with old recovery media (IOW, any recovery environment 
>>>> running a kernel before 4.14 will not mount the volume correctly).
>>> Sometimes you have to break a few eggs to make an omelette right? If 
>>> people want to recover their data they should have backups, and if 
>>> they are really interested in recovering their data (and don't have 
>>> backups) then they will probably find this on the web by searching 
>>> anyway...
>> Backups aren't the type of recovery I'm talking about.  I'm talking 
>> about people booting to things like SystemRescueCD to fix system 
>> configuration or do offline maintenance without having to nuke the 
>> system and restore from backups.  Such recovery environments often 
>> don't get updated for a _long_ time, and such usage is not atypical as 
>> a first step in trying to fix a broken system in situations where 
>> downtime really is a serious issue.
> I would say that if downtime is such a serious issue you have a failover 
> and a working tested backup.
Generally yes, but restoring a volume completely from scratch is almost 
always going to take longer than just fixing what's broken unless it's 
_really_ broken.  Would you really want to nuke a system and rebuild it 
from scratch just because you accidentally pulled out the wrong disk 
when hot-swapping drives to rebuild an array?
> 
>>>
>>>> * You shouldn't be mounting writable and degraded for any reason 
>>>> other than fixing the volume (or converting it to a single profile 
>>>> until you can fix it), even aside from the other issues.
>>>
>>> Well in my opinion the degraded mount option is counter intuitive. 
>>> Unless otherwise asked for the system should mount and work as long 
>>> as it can guarantee the data can be read and written somehow 
>>> (regardless if any redundancy guarantee is not met). If the user is 
>>> willing to accept more or less risk they should configure it!
>> Again, BTRFS mounting degraded is significantly riskier than LVM or MD 
>> doing the same thing.  Most users don't properly research things 
>> (When's the last time you did a complete cost/benefit analysis before 
>> deciding to use a particular piece of software on a system?), and 
>> would not know they were taking on significantly higher risk by using 
>> BTRFS without configuring it to behave safely until it actually caused 
>> them problems, at which point most people would then complain about 
>> the resulting data loss instead of trying to figure out why it 
>> happened and prevent it in the first place.  I don't know about you, 
>> but I for one would rather BTRFS have a reputation for being 
>> over-aggressively safe by default than risking users data by default.
> Well I don't do cost/benefit analysis since I run free software. I do 
> however try my best to ensure that whatever software I install don't 
> cause more drawbacks than benefits.
Which is essentially a CBA.  The cost doesn't have to equate to money, 
it could be time, or even limitations in what you can do with the system.

> I would also like for BTRFS to be over-aggressively safe, but I also 
> want it to be over-aggressively always running or even limping if that 
> is what it needs to do.
And you can have it do that, we just prefer not to by default.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-08 18:10           ` waxhead
  2019-02-08 19:17             ` Austin S. Hemmelgarn
@ 2019-02-08 20:17             ` Chris Murphy
  1 sibling, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2019-02-08 20:17 UTC (permalink / raw)
  To: waxhead; +Cc: Austin S. Hemmelgarn, Stefan K, Btrfs BTRFS

On Fri, Feb 8, 2019 at 11:10 AM waxhead <waxhead@dirtcellar.net> wrote:

> So what you are saying here is that distro's that use btrfs by default
> should be responsible enough to make some monitoring solution if they
> allow non-technical users to create a "raid"1 like btrfs filesystem in
> the first place.

None do this by default. I'm only aware of one that makes it possible
in custom partitioning which is widely regarded as "you're on your
own" land.

I am of the opinion that GUI installers have a high burden to protect
users from themselves but it's just an opinion; I see plenty of fail
danger GUI software.


> So why do BTRFS hurry to mount itself even if devices are missing?

It isn't and it doesn't. You have to specify 'degraded' mount option,
which is not the default, which right now with the present design
means you intend for an immediate successful mount if there's a
missing device and it's still possible to mount anyway.

>and
> if BTRFS still can mount , why whould it blindly accept a non-existing
> disk to take part of the pool?!

I can't parse this question. I think the answer is, it doesn't do that.


> > Realistically, we can only safely recover from divergence correctly if
> > we can prove that all devices are true prior states of the current
> > highest generation, which is not currently possible to do reliably
> > because of how BTRFS operates.
> >
> So what you are saying is that the generation number does not represent
> a true frozen state of the filesystem at that point?

You have a two device raid1, and their generation is 100. You mount
one device by itself with degraded mount option. And you start adding
and deleting files, no snapshots, and those changes are all under
generation 101. You now unmount it, and you degraded mount the other
device, and you add and delete some different files, and those changes
are all under generation 101 too.

How do you merge them? I personally think that scarnio is user
sabotage and they're just screwed. Start over. They had to
intentionally, manually, mount those two drives *separately* with a
non-default 'degraded' flag. It's crazy to expect Btrfs to sort this
out - but it's entirely reasonable for it to faceplant read only the
instant it becomes confused; and reasonable to expect and design it to
quickly become confused in such a case, to keep damage from making
both separated mirrors so corrupted they can't be mounted even read
only.


> Why does systemd concern itself about what devices btrfs consist of.
> Please educate me, I am curious.

I'm not sure of the history of:
/usr/lib/udev/rules.d/64-btrfs-dm.rules
/usr/lib/udev/rules.d/64-btrfs.rules

But I think they were submitted to udev by Btrfs developers long ago,
which was then later subsumed into systemd. It would be ideal if this
rule had time sort of timeout, I think instead it will indefinitely
wait for all devices to appear. Anyway, without that rule, if a device
is merely delayed, and systemd tries to mount, mount immediately fails
and thus boot fails. There is no such thing in systemd as reattempting
to mount after a mount failure, and if sysroot fails to mount, it's a
fatal startup error.

> I would also like for BTRFS to be over-aggressively safe, but I also
> want it to be over-aggressively always running or even limping if that
> is what it needs to do.

While I understand that's a metaphor, someone limping along is not a
stable situation. They are more likely to trip and fall.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-08 19:17             ` Austin S. Hemmelgarn
@ 2019-02-09 12:13               ` waxhead
  2019-02-10 18:34                 ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: waxhead @ 2019-02-09 12:13 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Stefan K, linux-btrfs



Austin S. Hemmelgarn wrote:
> On 2019-02-08 13:10, waxhead wrote:
>> Austin S. Hemmelgarn wrote:
>>> On 2019-02-07 13:53, waxhead wrote:
>>>>
>>>>
>>>> Austin S. Hemmelgarn wrote:
>>>
>> So why do BTRFS hurry to mount itself even if devices are missing? and 
>> if BTRFS still can mount , why whould it blindly accept a non-existing 
>> disk to take part of the pool?!
> It doesn't unless you tell it to., and that behavior is exactly what I'm 
> arguing against making the default here.
Understood, but that is not quite what I meant - let me rephrase...
If BTRFS still can't mount, why would it blindly accept a previously 
non-existing disk to take part of the pool?! E.g. if you have "disk" A+B 
and suddenly at one boot B is not there. Now you have only A and one 
would think that A should register that B has been missing. Now on the 
next boot you have AB , in which case B is likely to have diverged from 
A since A has been mounted without B present - so even if both devices 
are present why would btrfs blindly accept that both A+B are good to go 
even if it should be perfectly possible to register in A that B was 
gone. And if you have B without A it should be the same story right?

>>
>>> Realistically, we can only safely recover from divergence correctly 
>>> if we can prove that all devices are true prior states of the current 
>>> highest generation, which is not currently possible to do reliably 
>>> because of how BTRFS operates.
>>>
>> So what you are saying is that the generation number does not 
>> represent a true frozen state of the filesystem at that point?
> It does _only_ for those devices which were present at the time of the 
> commit that incremented it.
> 
So in other words devices that are not present can easily be marked / 
defined as such at a later time?

> As an example (don't do this with any BTRFS volume you care about, it 
> will break it), take a BTRFS volume with two devices configured for 
> raid1.  Mount the volume with only one of the devices present, issue a 
> single write to it, then unmounted it.  Now do the same with only the 
> other device.  Both devices should show the same generation number right 
> now (but it should be one higher than when you started), but the 
> generation number on each device refers to a different volume state.
>>
>>> Also, LVM and MD have the exact same issue, it's just not as 
>>> significant because they re-add and re-sync missing devices 
>>> automatically when they reappear, which makes such split-brain 
>>> scenarios much less likely.
>> Which means marking the entire device as invalid, then re-adding it 
>> from scratch more or less...
> Actually, it doesn't.
> 
> For LVM and MD, they track what regions of the remaining device have 
> changed, and sync only those regions when the missing device comes back.
> 
For MD , if you have the bitmap enabled yes...

> For BTRFS, the same thing happens implicitly because of the COW 
> structure, and you can manually reproduce similar behavior to LVM or MD 
> by scrubbing the volume and then using balance with the 'soft' filter to 
> ensure all the chunks are the correct type.
> 
Understood.

>> Why does systemd concern itself about what devices btrfs consist of. 
>> Please educate me, I am curious.
> For the same reason that it concerns itself with what devices make up a 
> LVM volume or an MD array.  In essence, it comes down to a couple of 
> specific things:
> 
> * It is almost always preferable to delay boot-up while waiting for a 
> missing device to reappear than it is to start using a volume that 
> depends on it while it's missing.  The overall impact on the system from 
> taking a few seconds longer to boot is generally less than the impact of 
> having to resync the device when it reappears while the system is still 
> booting up.
> 
> * Systemd allows mounts to not block the system booting while still 
> allowing certain services to depend on those mounts being active.  This 
> is extremely useful for remote management reasons, and is actually 
> supported by most service managers these days.  Systemd extends this all 
> the way down the storage stack though, which is even more useful, 
> because it lets disk failures properly cascade up the storage stack and 
> translate into the volumes they were part of showing up as degraded (or 
> getting unmounted if you choose to configure it that way).
Ok, not sure I still understand how/why systemd knows what devices are 
part of btrfs (or md or lvm for that matter). I'll try to research this 
a bit - thanks for the info!

>>
>>> IOW, there's a special case with systemd that makes even mounting 
>>> BTRFS volumes that have missing devices degraded not work.
>> Well I use systemd on Debian and have not had that issue. In what 
>> situation does this fail?
> At one point, if you tried to manually mount a volume that systemd did 
> not see all the constituent devices present for, it would get unmounted 
> almost instantly by systemd itself.  This may not be the case anymore, 
> or it may have been how the distros I've used with systemd on them 
> happened to behave, but either way it's a pain in the arse when you want 
> to fix a BTRFS volume.
I can see that, but from my "toying around" with btrfs I have not run 
into any issues while mounting degraded.

>>
>>>>
>>>>> * Given that new kernels still don't properly generate half-raid1 
>>>>> chunks when a device is missing in a two-device raid1 setup, 
>>>>> there's a very real possibility that users will have trouble 
>>>>> recovering filesystems with old recovery media (IOW, any recovery 
>>>>> environment running a kernel before 4.14 will not mount the volume 
>>>>> correctly).
>>>> Sometimes you have to break a few eggs to make an omelette right? If 
>>>> people want to recover their data they should have backups, and if 
>>>> they are really interested in recovering their data (and don't have 
>>>> backups) then they will probably find this on the web by searching 
>>>> anyway...
>>> Backups aren't the type of recovery I'm talking about.  I'm talking 
>>> about people booting to things like SystemRescueCD to fix system 
>>> configuration or do offline maintenance without having to nuke the 
>>> system and restore from backups.  Such recovery environments often 
>>> don't get updated for a _long_ time, and such usage is not atypical 
>>> as a first step in trying to fix a broken system in situations where 
>>> downtime really is a serious issue.
>> I would say that if downtime is such a serious issue you have a 
>> failover and a working tested backup.
> Generally yes, but restoring a volume completely from scratch is almost 
> always going to take longer than just fixing what's broken unless it's 
> _really_ broken.  Would you really want to nuke a system and rebuild it 
> from scratch just because you accidentally pulled out the wrong disk 
> when hot-swapping drives to rebuild an array?
Absolutely not , but in this case I would not even want to use a rescue 
disk in the first place.

>>>>
>>>>> * You shouldn't be mounting writable and degraded for any reason 
>>>>> other than fixing the volume (or converting it to a single profile 
>>>>> until you can fix it), even aside from the other issues.
>>>>
>>>> Well in my opinion the degraded mount option is counter intuitive. 
>>>> Unless otherwise asked for the system should mount and work as long 
>>>> as it can guarantee the data can be read and written somehow 
>>>> (regardless if any redundancy guarantee is not met). If the user is 
>>>> willing to accept more or less risk they should configure it!
>>> Again, BTRFS mounting degraded is significantly riskier than LVM or 
>>> MD doing the same thing.  Most users don't properly research things 
>>> (When's the last time you did a complete cost/benefit analysis before 
>>> deciding to use a particular piece of software on a system?), and 
>>> would not know they were taking on significantly higher risk by using 
>>> BTRFS without configuring it to behave safely until it actually 
>>> caused them problems, at which point most people would then complain 
>>> about the resulting data loss instead of trying to figure out why it 
>>> happened and prevent it in the first place.  I don't know about you, 
>>> but I for one would rather BTRFS have a reputation for being 
>>> over-aggressively safe by default than risking users data by default.
>> Well I don't do cost/benefit analysis since I run free software. I do 
>> however try my best to ensure that whatever software I install don't 
>> cause more drawbacks than benefits.
> Which is essentially a CBA.  The cost doesn't have to equate to money, 
> it could be time, or even limitations in what you can do with the system.
> 
>> I would also like for BTRFS to be over-aggressively safe, but I also 
>> want it to be over-aggressively always running or even limping if that 
>> is what it needs to do.
> And you can have it do that, we just prefer not to by default.
Got it!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-09 12:13               ` waxhead
@ 2019-02-10 18:34                 ` Chris Murphy
  2019-02-11 12:17                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2019-02-10 18:34 UTC (permalink / raw)
  To: waxhead; +Cc: Austin S. Hemmelgarn, Stefan K, Btrfs BTRFS

On Sat, Feb 9, 2019 at 5:13 AM waxhead <waxhead@dirtcellar.net> wrote:

> Understood, but that is not quite what I meant - let me rephrase...
> If BTRFS still can't mount, why would it blindly accept a previously
> non-existing disk to take part of the pool?!

It doesn't do it blindly. It only ever mounts when the user specifies
the degraded mount option, which is not a default mount option.

>E.g. if you have "disk" A+B
> and suddenly at one boot B is not there. Now you have only A and one
> would think that A should register that B has been missing. Now on the
> next boot you have AB , in which case B is likely to have diverged from
> A since A has been mounted without B present - so even if both devices
> are present why would btrfs blindly accept that both A+B are good to go
> even if it should be perfectly possible to register in A that B was
> gone. And if you have B without A it should be the same story right?

OK no, you haven't gone far enough to setup the split brain scenario
where there is a partially legitimate complaint. Prior to split brain,
it's entirely reasonable for Btrfs to mount *when you use the degraded
mount option* - it does not blindly mount. And if you've ever done
exactly what you wrote in the above paragraph, you'd see Btrfs
*complains vociferously* about all the errors it's passively finding
and fixing. If you want a more active method of getting device B
caught up with A automatically - that's completely reasonable, and
something people have been saying for some time, but it takes a design
proposal, and code.

As for split brain scenario, it is only the user's manual intervention
with multiple 'degraded' mount options (which again, is not the
default) that caused the volume to arrive in such a state. Would it be
wise to have some additional error checking? Sure. Someone would need
to step up with a design and to do code work, same as any other
feature. Maybe a rudimentary check would be comparing the timestamps
for leaves or nodes ostensibly with the same transid, but in any case
that doesn't just happen for free.


> >> So what you are saying is that the generation number does not
> >> represent a true frozen state of the filesystem at that point?
> > It does _only_ for those devices which were present at the time of the
> > commit that incremented it.
> >
> So in other words devices that are not present can easily be marked /
> defined as such at a later time?

That isn't how it currently works. When stale device B is subsequently
mounted (normally) along with device A, it's only passively fixed up.
Part of the point of non-automatic degraded mounts that require user
intervention is the lack of anything beyond simple error handling and
fixups.

> Ok, not sure I still understand how/why systemd knows what devices are
> part of btrfs (or md or lvm for that matter). I'll try to research this
> a bit - thanks for the info!

It doesn't, not directly. It's from the previously mentioned udev
rule. For md, the assembly, delays, and fall back to running degraded,
are handled in dracut. But the reason why this is in udev is to
prevent a mount failure just because one or more devices are delayed;
basically it inserts a pause until the devices appear, and then
systemd issues the mount command.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-07 11:04   ` Stefan K
  2019-02-07 12:18     ` Austin S. Hemmelgarn
  2019-02-07 17:15     ` Chris Murphy
@ 2019-02-11  9:30     ` Anand Jain
  2 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2019-02-11  9:30 UTC (permalink / raw)
  To: Stefan K, linux-btrfs



On 2/7/19 7:04 PM, Stefan K wrote:
> Thanks, with degraded  as kernel parameter and also ind the fstab it works like expected
> 
> That should be the normal behaviour,

  IMO in the long term it will be. But before that we have few items to 
fix around this, such as the serviceability part.

-Anand


> cause a server must be up and running, and I don't care about a device loss, thats why I use a RAID1. The device-loss problem can I fix later, but its important that a server is up and running, i got informed at boot time and also in the logs files that a device is missing, also I see that if you use a monitoring program.
> 
> So please change the normal behavior
> 
> On Friday, February 1, 2019 7:13:16 PM CET Hans van Kranenburg wrote:
>> Hi Stefan,
>>
>> On 2/1/19 11:28 AM, Stefan K wrote:
>>>
>>> I've installed my Debian Stretch to have / on btrfs with raid1 on 2
>>> SSDs. Today I want test if it works, it works fine until the server
>>> is running and the SSD get broken and I can change this, but it looks
>>> like that it does not work if the SSD fails until restart. I got the
>>> error, that one of the Disks can't be read and I got a initramfs
>>> prompt, I expected that it still runs like mdraid and said something
>>> is missing.
>>>
>>> My question is, is it possible to configure btrfs/fstab/grub that it
>>> still boot? (that is what I expected from a RAID1)
>>
>> Yes. I'm not the expert in this area, but I see you haven't got a reply
>> today yet, so I'll try.
>>
>> What you see happening is correct. This is the default behavior.
>>
>> To be able to boot into your system with a missing disk, you can add...
>>      rootflags=degraded
>> ...to the linux kernel command line by editing it on the fly when you
>> are in the GRUB menu.
>>
>> This allows the filesystem to start in 'degraded' mode this one time.
>> The only thing you should be doing when the system is booted is have a
>> new disk present already in place and fix the btrfs situation. This
>> means things like cloning the partition table of the disk that's still
>> working, doing whatever else is needed in your situation and then
>> running btrfs replace to replace the missing disk with the new one, and
>> then making sure you don't have "single" block groups left (using btrfs
>> balance), which might have been created for new writes when the
>> filesystem was running in degraded mode.
>>
>> -- 
>> Hans van Kranenburg
>>
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-10 18:34                 ` Chris Murphy
@ 2019-02-11 12:17                   ` Austin S. Hemmelgarn
  2019-02-11 21:15                     ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-11 12:17 UTC (permalink / raw)
  To: Chris Murphy, waxhead; +Cc: Stefan K, Btrfs BTRFS

On 2019-02-10 13:34, Chris Murphy wrote:
> On Sat, Feb 9, 2019 at 5:13 AM waxhead <waxhead@dirtcellar.net> wrote:
> 
>> Understood, but that is not quite what I meant - let me rephrase...
>> If BTRFS still can't mount, why would it blindly accept a previously
>> non-existing disk to take part of the pool?!
> 
> It doesn't do it blindly. It only ever mounts when the user specifies
> the degraded mount option, which is not a default mount option.
> 
>> E.g. if you have "disk" A+B
>> and suddenly at one boot B is not there. Now you have only A and one
>> would think that A should register that B has been missing. Now on the
>> next boot you have AB , in which case B is likely to have diverged from
>> A since A has been mounted without B present - so even if both devices
>> are present why would btrfs blindly accept that both A+B are good to go
>> even if it should be perfectly possible to register in A that B was
>> gone. And if you have B without A it should be the same story right?
> 
> OK no, you haven't gone far enough to setup the split brain scenario
> where there is a partially legitimate complaint. Prior to split brain,
> it's entirely reasonable for Btrfs to mount *when you use the degraded
> mount option* - it does not blindly mount. And if you've ever done
> exactly what you wrote in the above paragraph, you'd see Btrfs
> *complains vociferously* about all the errors it's passively finding
> and fixing. If you want a more active method of getting device B
> caught up with A automatically - that's completely reasonable, and
> something people have been saying for some time, but it takes a design
> proposal, and code.
> 
> As for split brain scenario, it is only the user's manual intervention
> with multiple 'degraded' mount options (which again, is not the
> default) that caused the volume to arrive in such a state. Would it be
> wise to have some additional error checking? Sure. Someone would need
> to step up with a design and to do code work, same as any other
> feature. Maybe a rudimentary check would be comparing the timestamps
> for leaves or nodes ostensibly with the same transid, but in any case
> that doesn't just happen for free.
And even then it couldn't be made truly reliable, because data from old 
transactions may be arbitrarily overwritten at any point after the next 
transaction (and is just plain gone if you're using the `discard` mount 
option).
> 
> 
>>>> So what you are saying is that the generation number does not
>>>> represent a true frozen state of the filesystem at that point?
>>> It does _only_ for those devices which were present at the time of the
>>> commit that incremented it.
>>>
>> So in other words devices that are not present can easily be marked /
>> defined as such at a later time?
> 
> That isn't how it currently works. When stale device B is subsequently
> mounted (normally) along with device A, it's only passively fixed up.
> Part of the point of non-automatic degraded mounts that require user
> intervention is the lack of anything beyond simple error handling and
> fixups.
> 
>> Ok, not sure I still understand how/why systemd knows what devices are
>> part of btrfs (or md or lvm for that matter). I'll try to research this
>> a bit - thanks for the info!
> 
> It doesn't, not directly. It's from the previously mentioned udev
> rule. For md, the assembly, delays, and fall back to running degraded,
> are handled in dracut. But the reason why this is in udev is to
> prevent a mount failure just because one or more devices are delayed;
> basically it inserts a pause until the devices appear, and then
> systemd issues the mount command.
Last I knew, it was systemd itself doing the pause, because we provide 
no real device for udev to wait on appearing.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs as / filesystem in RAID1
  2019-02-11 12:17                   ` Austin S. Hemmelgarn
@ 2019-02-11 21:15                     ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2019-02-11 21:15 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, waxhead, Stefan K, Btrfs BTRFS

On Mon, Feb 11, 2019 at 5:17 AM Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
>
> Last I knew, it was systemd itself doing the pause, because we provide
> no real device for udev to wait on appearing.

Well there's more than one thing responsible for the net behavior. The
most central thing waiting is the kernel. And that's because 'btrfs
device ready' simply waits until all devices are found (by kernel
code). That's the command that /usr/lib/udev/rules.d/64-btrfs.rules
calls. So it is also udev that doesn't return from that, indefinitely
as far as I know. And therefore systemd won't issue a mount command
for sysroot.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-02-11 21:15 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-01 10:28 btrfs as / filesystem in RAID1 Stefan K
2019-02-01 19:13 ` Hans van Kranenburg
2019-02-07 11:04   ` Stefan K
2019-02-07 12:18     ` Austin S. Hemmelgarn
2019-02-07 18:53       ` waxhead
2019-02-07 19:39         ` Austin S. Hemmelgarn
2019-02-07 21:21           ` Remi Gauvin
2019-02-08  4:51           ` Andrei Borzenkov
2019-02-08 12:54             ` Austin S. Hemmelgarn
2019-02-08  7:15           ` Stefan K
2019-02-08 12:58             ` Austin S. Hemmelgarn
2019-02-08 16:56             ` Chris Murphy
2019-02-08 18:10           ` waxhead
2019-02-08 19:17             ` Austin S. Hemmelgarn
2019-02-09 12:13               ` waxhead
2019-02-10 18:34                 ` Chris Murphy
2019-02-11 12:17                   ` Austin S. Hemmelgarn
2019-02-11 21:15                     ` Chris Murphy
2019-02-08 20:17             ` Chris Murphy
2019-02-07 17:15     ` Chris Murphy
2019-02-07 17:37       ` Martin Steigerwald
2019-02-07 22:19         ` Chris Murphy
2019-02-07 23:02           ` Remi Gauvin
2019-02-08  7:33           ` Stefan K
2019-02-08 17:26             ` Chris Murphy
2019-02-11  9:30     ` Anand Jain
2019-02-02 23:35 ` Chris Murphy
2019-02-04 17:47   ` Patrik Lundquist
2019-02-04 17:55     ` Austin S. Hemmelgarn
2019-02-04 22:19       ` Patrik Lundquist
2019-02-05  6:46         ` Chris Murphy
2019-02-05  7:37           ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).