From: waxhead <waxhead@dirtcellar.net>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
Stefan K <shadow_7@gmx.net>,
linux-btrfs@vger.kernel.org
Subject: Re: btrfs as / filesystem in RAID1
Date: Sat, 9 Feb 2019 13:13:44 +0100 [thread overview]
Message-ID: <f67c6a69-4fc8-e33a-543e-9b97adf54438@dirtcellar.net> (raw)
In-Reply-To: <f41063e8-b9b1-f929-7954-8a96e673bd2e@gmail.com>
Austin S. Hemmelgarn wrote:
> On 2019-02-08 13:10, waxhead wrote:
>> Austin S. Hemmelgarn wrote:
>>> On 2019-02-07 13:53, waxhead wrote:
>>>>
>>>>
>>>> Austin S. Hemmelgarn wrote:
>>>
>> So why do BTRFS hurry to mount itself even if devices are missing? and
>> if BTRFS still can mount , why whould it blindly accept a non-existing
>> disk to take part of the pool?!
> It doesn't unless you tell it to., and that behavior is exactly what I'm
> arguing against making the default here.
Understood, but that is not quite what I meant - let me rephrase...
If BTRFS still can't mount, why would it blindly accept a previously
non-existing disk to take part of the pool?! E.g. if you have "disk" A+B
and suddenly at one boot B is not there. Now you have only A and one
would think that A should register that B has been missing. Now on the
next boot you have AB , in which case B is likely to have diverged from
A since A has been mounted without B present - so even if both devices
are present why would btrfs blindly accept that both A+B are good to go
even if it should be perfectly possible to register in A that B was
gone. And if you have B without A it should be the same story right?
>>
>>> Realistically, we can only safely recover from divergence correctly
>>> if we can prove that all devices are true prior states of the current
>>> highest generation, which is not currently possible to do reliably
>>> because of how BTRFS operates.
>>>
>> So what you are saying is that the generation number does not
>> represent a true frozen state of the filesystem at that point?
> It does _only_ for those devices which were present at the time of the
> commit that incremented it.
>
So in other words devices that are not present can easily be marked /
defined as such at a later time?
> As an example (don't do this with any BTRFS volume you care about, it
> will break it), take a BTRFS volume with two devices configured for
> raid1. Mount the volume with only one of the devices present, issue a
> single write to it, then unmounted it. Now do the same with only the
> other device. Both devices should show the same generation number right
> now (but it should be one higher than when you started), but the
> generation number on each device refers to a different volume state.
>>
>>> Also, LVM and MD have the exact same issue, it's just not as
>>> significant because they re-add and re-sync missing devices
>>> automatically when they reappear, which makes such split-brain
>>> scenarios much less likely.
>> Which means marking the entire device as invalid, then re-adding it
>> from scratch more or less...
> Actually, it doesn't.
>
> For LVM and MD, they track what regions of the remaining device have
> changed, and sync only those regions when the missing device comes back.
>
For MD , if you have the bitmap enabled yes...
> For BTRFS, the same thing happens implicitly because of the COW
> structure, and you can manually reproduce similar behavior to LVM or MD
> by scrubbing the volume and then using balance with the 'soft' filter to
> ensure all the chunks are the correct type.
>
Understood.
>> Why does systemd concern itself about what devices btrfs consist of.
>> Please educate me, I am curious.
> For the same reason that it concerns itself with what devices make up a
> LVM volume or an MD array. In essence, it comes down to a couple of
> specific things:
>
> * It is almost always preferable to delay boot-up while waiting for a
> missing device to reappear than it is to start using a volume that
> depends on it while it's missing. The overall impact on the system from
> taking a few seconds longer to boot is generally less than the impact of
> having to resync the device when it reappears while the system is still
> booting up.
>
> * Systemd allows mounts to not block the system booting while still
> allowing certain services to depend on those mounts being active. This
> is extremely useful for remote management reasons, and is actually
> supported by most service managers these days. Systemd extends this all
> the way down the storage stack though, which is even more useful,
> because it lets disk failures properly cascade up the storage stack and
> translate into the volumes they were part of showing up as degraded (or
> getting unmounted if you choose to configure it that way).
Ok, not sure I still understand how/why systemd knows what devices are
part of btrfs (or md or lvm for that matter). I'll try to research this
a bit - thanks for the info!
>>
>>> IOW, there's a special case with systemd that makes even mounting
>>> BTRFS volumes that have missing devices degraded not work.
>> Well I use systemd on Debian and have not had that issue. In what
>> situation does this fail?
> At one point, if you tried to manually mount a volume that systemd did
> not see all the constituent devices present for, it would get unmounted
> almost instantly by systemd itself. This may not be the case anymore,
> or it may have been how the distros I've used with systemd on them
> happened to behave, but either way it's a pain in the arse when you want
> to fix a BTRFS volume.
I can see that, but from my "toying around" with btrfs I have not run
into any issues while mounting degraded.
>>
>>>>
>>>>> * Given that new kernels still don't properly generate half-raid1
>>>>> chunks when a device is missing in a two-device raid1 setup,
>>>>> there's a very real possibility that users will have trouble
>>>>> recovering filesystems with old recovery media (IOW, any recovery
>>>>> environment running a kernel before 4.14 will not mount the volume
>>>>> correctly).
>>>> Sometimes you have to break a few eggs to make an omelette right? If
>>>> people want to recover their data they should have backups, and if
>>>> they are really interested in recovering their data (and don't have
>>>> backups) then they will probably find this on the web by searching
>>>> anyway...
>>> Backups aren't the type of recovery I'm talking about. I'm talking
>>> about people booting to things like SystemRescueCD to fix system
>>> configuration or do offline maintenance without having to nuke the
>>> system and restore from backups. Such recovery environments often
>>> don't get updated for a _long_ time, and such usage is not atypical
>>> as a first step in trying to fix a broken system in situations where
>>> downtime really is a serious issue.
>> I would say that if downtime is such a serious issue you have a
>> failover and a working tested backup.
> Generally yes, but restoring a volume completely from scratch is almost
> always going to take longer than just fixing what's broken unless it's
> _really_ broken. Would you really want to nuke a system and rebuild it
> from scratch just because you accidentally pulled out the wrong disk
> when hot-swapping drives to rebuild an array?
Absolutely not , but in this case I would not even want to use a rescue
disk in the first place.
>>>>
>>>>> * You shouldn't be mounting writable and degraded for any reason
>>>>> other than fixing the volume (or converting it to a single profile
>>>>> until you can fix it), even aside from the other issues.
>>>>
>>>> Well in my opinion the degraded mount option is counter intuitive.
>>>> Unless otherwise asked for the system should mount and work as long
>>>> as it can guarantee the data can be read and written somehow
>>>> (regardless if any redundancy guarantee is not met). If the user is
>>>> willing to accept more or less risk they should configure it!
>>> Again, BTRFS mounting degraded is significantly riskier than LVM or
>>> MD doing the same thing. Most users don't properly research things
>>> (When's the last time you did a complete cost/benefit analysis before
>>> deciding to use a particular piece of software on a system?), and
>>> would not know they were taking on significantly higher risk by using
>>> BTRFS without configuring it to behave safely until it actually
>>> caused them problems, at which point most people would then complain
>>> about the resulting data loss instead of trying to figure out why it
>>> happened and prevent it in the first place. I don't know about you,
>>> but I for one would rather BTRFS have a reputation for being
>>> over-aggressively safe by default than risking users data by default.
>> Well I don't do cost/benefit analysis since I run free software. I do
>> however try my best to ensure that whatever software I install don't
>> cause more drawbacks than benefits.
> Which is essentially a CBA. The cost doesn't have to equate to money,
> it could be time, or even limitations in what you can do with the system.
>
>> I would also like for BTRFS to be over-aggressively safe, but I also
>> want it to be over-aggressively always running or even limping if that
>> is what it needs to do.
> And you can have it do that, we just prefer not to by default.
Got it!
next prev parent reply other threads:[~2019-02-09 12:13 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-01 10:28 btrfs as / filesystem in RAID1 Stefan K
2019-02-01 19:13 ` Hans van Kranenburg
2019-02-07 11:04 ` Stefan K
2019-02-07 12:18 ` Austin S. Hemmelgarn
2019-02-07 18:53 ` waxhead
2019-02-07 19:39 ` Austin S. Hemmelgarn
2019-02-07 21:21 ` Remi Gauvin
2019-02-08 4:51 ` Andrei Borzenkov
2019-02-08 12:54 ` Austin S. Hemmelgarn
2019-02-08 7:15 ` Stefan K
2019-02-08 12:58 ` Austin S. Hemmelgarn
2019-02-08 16:56 ` Chris Murphy
2019-02-08 18:10 ` waxhead
2019-02-08 19:17 ` Austin S. Hemmelgarn
2019-02-09 12:13 ` waxhead [this message]
2019-02-10 18:34 ` Chris Murphy
2019-02-11 12:17 ` Austin S. Hemmelgarn
2019-02-11 21:15 ` Chris Murphy
2019-02-08 20:17 ` Chris Murphy
2019-02-07 17:15 ` Chris Murphy
2019-02-07 17:37 ` Martin Steigerwald
2019-02-07 22:19 ` Chris Murphy
2019-02-07 23:02 ` Remi Gauvin
2019-02-08 7:33 ` Stefan K
2019-02-08 17:26 ` Chris Murphy
2019-02-11 9:30 ` Anand Jain
2019-02-02 23:35 ` Chris Murphy
2019-02-04 17:47 ` Patrik Lundquist
2019-02-04 17:55 ` Austin S. Hemmelgarn
2019-02-04 22:19 ` Patrik Lundquist
2019-02-05 6:46 ` Chris Murphy
2019-02-05 7:37 ` Chris Murphy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f67c6a69-4fc8-e33a-543e-9b97adf54438@dirtcellar.net \
--to=waxhead@dirtcellar.net \
--cc=ahferroin7@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=shadow_7@gmx.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).