Re: btrfs as / filesystem in RAID1

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Chris Murphy <lists@colorremedies.com>, waxhead <waxhead@dirtcellar.net>
Cc: Stefan K <shadow_7@gmx.net>, Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: btrfs as / filesystem in RAID1
Date: Mon, 11 Feb 2019 07:17:42 -0500	[thread overview]
Message-ID: <a8e00ae7-9e18-ba74-5521-a2db7b525e51@gmail.com> (raw)
In-Reply-To: <CAJCQCtQ-nLkOYE5ARk+rjT4JBxR6Atn1gU-+U8gAT0sb7Mduow@mail.gmail.com>

On 2019-02-10 13:34, Chris Murphy wrote:
> On Sat, Feb 9, 2019 at 5:13 AM waxhead <waxhead@dirtcellar.net> wrote:
> 
>> Understood, but that is not quite what I meant - let me rephrase...
>> If BTRFS still can't mount, why would it blindly accept a previously
>> non-existing disk to take part of the pool?!
> 
> It doesn't do it blindly. It only ever mounts when the user specifies
> the degraded mount option, which is not a default mount option.
> 
>> E.g. if you have "disk" A+B
>> and suddenly at one boot B is not there. Now you have only A and one
>> would think that A should register that B has been missing. Now on the
>> next boot you have AB , in which case B is likely to have diverged from
>> A since A has been mounted without B present - so even if both devices
>> are present why would btrfs blindly accept that both A+B are good to go
>> even if it should be perfectly possible to register in A that B was
>> gone. And if you have B without A it should be the same story right?
> 
> OK no, you haven't gone far enough to setup the split brain scenario
> where there is a partially legitimate complaint. Prior to split brain,
> it's entirely reasonable for Btrfs to mount *when you use the degraded
> mount option* - it does not blindly mount. And if you've ever done
> exactly what you wrote in the above paragraph, you'd see Btrfs
> *complains vociferously* about all the errors it's passively finding
> and fixing. If you want a more active method of getting device B
> caught up with A automatically - that's completely reasonable, and
> something people have been saying for some time, but it takes a design
> proposal, and code.
> 
> As for split brain scenario, it is only the user's manual intervention
> with multiple 'degraded' mount options (which again, is not the
> default) that caused the volume to arrive in such a state. Would it be
> wise to have some additional error checking? Sure. Someone would need
> to step up with a design and to do code work, same as any other
> feature. Maybe a rudimentary check would be comparing the timestamps
> for leaves or nodes ostensibly with the same transid, but in any case
> that doesn't just happen for free.
And even then it couldn't be made truly reliable, because data from old 
transactions may be arbitrarily overwritten at any point after the next 
transaction (and is just plain gone if you're using the `discard` mount 
option).
> 
> 
>>>> So what you are saying is that the generation number does not
>>>> represent a true frozen state of the filesystem at that point?
>>> It does _only_ for those devices which were present at the time of the
>>> commit that incremented it.
>>>
>> So in other words devices that are not present can easily be marked /
>> defined as such at a later time?
> 
> That isn't how it currently works. When stale device B is subsequently
> mounted (normally) along with device A, it's only passively fixed up.
> Part of the point of non-automatic degraded mounts that require user
> intervention is the lack of anything beyond simple error handling and
> fixups.
> 
>> Ok, not sure I still understand how/why systemd knows what devices are
>> part of btrfs (or md or lvm for that matter). I'll try to research this
>> a bit - thanks for the info!
> 
> It doesn't, not directly. It's from the previously mentioned udev
> rule. For md, the assembly, delays, and fall back to running degraded,
> are handled in dracut. But the reason why this is in udev is to
> prevent a mount failure just because one or more devices are delayed;
> basically it inserts a pause until the devices appear, and then
> systemd issues the mount command.
Last I knew, it was systemd itself doing the pause, because we provide 
no real device for udev to wait on appearing.