linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: waxhead <waxhead@dirtcellar.net>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Stefan K <shadow_7@gmx.net>,
	linux-btrfs@vger.kernel.org
Subject: Re: btrfs as / filesystem in RAID1
Date: Sat, 9 Feb 2019 13:13:44 +0100	[thread overview]
Message-ID: <f67c6a69-4fc8-e33a-543e-9b97adf54438@dirtcellar.net> (raw)
In-Reply-To: <f41063e8-b9b1-f929-7954-8a96e673bd2e@gmail.com>



Austin S. Hemmelgarn wrote:
> On 2019-02-08 13:10, waxhead wrote:
>> Austin S. Hemmelgarn wrote:
>>> On 2019-02-07 13:53, waxhead wrote:
>>>>
>>>>
>>>> Austin S. Hemmelgarn wrote:
>>>
>> So why do BTRFS hurry to mount itself even if devices are missing? and 
>> if BTRFS still can mount , why whould it blindly accept a non-existing 
>> disk to take part of the pool?!
> It doesn't unless you tell it to., and that behavior is exactly what I'm 
> arguing against making the default here.
Understood, but that is not quite what I meant - let me rephrase...
If BTRFS still can't mount, why would it blindly accept a previously 
non-existing disk to take part of the pool?! E.g. if you have "disk" A+B 
and suddenly at one boot B is not there. Now you have only A and one 
would think that A should register that B has been missing. Now on the 
next boot you have AB , in which case B is likely to have diverged from 
A since A has been mounted without B present - so even if both devices 
are present why would btrfs blindly accept that both A+B are good to go 
even if it should be perfectly possible to register in A that B was 
gone. And if you have B without A it should be the same story right?

>>
>>> Realistically, we can only safely recover from divergence correctly 
>>> if we can prove that all devices are true prior states of the current 
>>> highest generation, which is not currently possible to do reliably 
>>> because of how BTRFS operates.
>>>
>> So what you are saying is that the generation number does not 
>> represent a true frozen state of the filesystem at that point?
> It does _only_ for those devices which were present at the time of the 
> commit that incremented it.
> 
So in other words devices that are not present can easily be marked / 
defined as such at a later time?

> As an example (don't do this with any BTRFS volume you care about, it 
> will break it), take a BTRFS volume with two devices configured for 
> raid1.  Mount the volume with only one of the devices present, issue a 
> single write to it, then unmounted it.  Now do the same with only the 
> other device.  Both devices should show the same generation number right 
> now (but it should be one higher than when you started), but the 
> generation number on each device refers to a different volume state.
>>
>>> Also, LVM and MD have the exact same issue, it's just not as 
>>> significant because they re-add and re-sync missing devices 
>>> automatically when they reappear, which makes such split-brain 
>>> scenarios much less likely.
>> Which means marking the entire device as invalid, then re-adding it 
>> from scratch more or less...
> Actually, it doesn't.
> 
> For LVM and MD, they track what regions of the remaining device have 
> changed, and sync only those regions when the missing device comes back.
> 
For MD , if you have the bitmap enabled yes...

> For BTRFS, the same thing happens implicitly because of the COW 
> structure, and you can manually reproduce similar behavior to LVM or MD 
> by scrubbing the volume and then using balance with the 'soft' filter to 
> ensure all the chunks are the correct type.
> 
Understood.

>> Why does systemd concern itself about what devices btrfs consist of. 
>> Please educate me, I am curious.
> For the same reason that it concerns itself with what devices make up a 
> LVM volume or an MD array.  In essence, it comes down to a couple of 
> specific things:
> 
> * It is almost always preferable to delay boot-up while waiting for a 
> missing device to reappear than it is to start using a volume that 
> depends on it while it's missing.  The overall impact on the system from 
> taking a few seconds longer to boot is generally less than the impact of 
> having to resync the device when it reappears while the system is still 
> booting up.
> 
> * Systemd allows mounts to not block the system booting while still 
> allowing certain services to depend on those mounts being active.  This 
> is extremely useful for remote management reasons, and is actually 
> supported by most service managers these days.  Systemd extends this all 
> the way down the storage stack though, which is even more useful, 
> because it lets disk failures properly cascade up the storage stack and 
> translate into the volumes they were part of showing up as degraded (or 
> getting unmounted if you choose to configure it that way).
Ok, not sure I still understand how/why systemd knows what devices are 
part of btrfs (or md or lvm for that matter). I'll try to research this 
a bit - thanks for the info!

>>
>>> IOW, there's a special case with systemd that makes even mounting 
>>> BTRFS volumes that have missing devices degraded not work.
>> Well I use systemd on Debian and have not had that issue. In what 
>> situation does this fail?
> At one point, if you tried to manually mount a volume that systemd did 
> not see all the constituent devices present for, it would get unmounted 
> almost instantly by systemd itself.  This may not be the case anymore, 
> or it may have been how the distros I've used with systemd on them 
> happened to behave, but either way it's a pain in the arse when you want 
> to fix a BTRFS volume.
I can see that, but from my "toying around" with btrfs I have not run 
into any issues while mounting degraded.

>>
>>>>
>>>>> * Given that new kernels still don't properly generate half-raid1 
>>>>> chunks when a device is missing in a two-device raid1 setup, 
>>>>> there's a very real possibility that users will have trouble 
>>>>> recovering filesystems with old recovery media (IOW, any recovery 
>>>>> environment running a kernel before 4.14 will not mount the volume 
>>>>> correctly).
>>>> Sometimes you have to break a few eggs to make an omelette right? If 
>>>> people want to recover their data they should have backups, and if 
>>>> they are really interested in recovering their data (and don't have 
>>>> backups) then they will probably find this on the web by searching 
>>>> anyway...
>>> Backups aren't the type of recovery I'm talking about.  I'm talking 
>>> about people booting to things like SystemRescueCD to fix system 
>>> configuration or do offline maintenance without having to nuke the 
>>> system and restore from backups.  Such recovery environments often 
>>> don't get updated for a _long_ time, and such usage is not atypical 
>>> as a first step in trying to fix a broken system in situations where 
>>> downtime really is a serious issue.
>> I would say that if downtime is such a serious issue you have a 
>> failover and a working tested backup.
> Generally yes, but restoring a volume completely from scratch is almost 
> always going to take longer than just fixing what's broken unless it's 
> _really_ broken.  Would you really want to nuke a system and rebuild it 
> from scratch just because you accidentally pulled out the wrong disk 
> when hot-swapping drives to rebuild an array?
Absolutely not , but in this case I would not even want to use a rescue 
disk in the first place.

>>>>
>>>>> * You shouldn't be mounting writable and degraded for any reason 
>>>>> other than fixing the volume (or converting it to a single profile 
>>>>> until you can fix it), even aside from the other issues.
>>>>
>>>> Well in my opinion the degraded mount option is counter intuitive. 
>>>> Unless otherwise asked for the system should mount and work as long 
>>>> as it can guarantee the data can be read and written somehow 
>>>> (regardless if any redundancy guarantee is not met). If the user is 
>>>> willing to accept more or less risk they should configure it!
>>> Again, BTRFS mounting degraded is significantly riskier than LVM or 
>>> MD doing the same thing.  Most users don't properly research things 
>>> (When's the last time you did a complete cost/benefit analysis before 
>>> deciding to use a particular piece of software on a system?), and 
>>> would not know they were taking on significantly higher risk by using 
>>> BTRFS without configuring it to behave safely until it actually 
>>> caused them problems, at which point most people would then complain 
>>> about the resulting data loss instead of trying to figure out why it 
>>> happened and prevent it in the first place.  I don't know about you, 
>>> but I for one would rather BTRFS have a reputation for being 
>>> over-aggressively safe by default than risking users data by default.
>> Well I don't do cost/benefit analysis since I run free software. I do 
>> however try my best to ensure that whatever software I install don't 
>> cause more drawbacks than benefits.
> Which is essentially a CBA.  The cost doesn't have to equate to money, 
> it could be time, or even limitations in what you can do with the system.
> 
>> I would also like for BTRFS to be over-aggressively safe, but I also 
>> want it to be over-aggressively always running or even limping if that 
>> is what it needs to do.
> And you can have it do that, we just prefer not to by default.
Got it!

  reply	other threads:[~2019-02-09 12:13 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-01 10:28 btrfs as / filesystem in RAID1 Stefan K
2019-02-01 19:13 ` Hans van Kranenburg
2019-02-07 11:04   ` Stefan K
2019-02-07 12:18     ` Austin S. Hemmelgarn
2019-02-07 18:53       ` waxhead
2019-02-07 19:39         ` Austin S. Hemmelgarn
2019-02-07 21:21           ` Remi Gauvin
2019-02-08  4:51           ` Andrei Borzenkov
2019-02-08 12:54             ` Austin S. Hemmelgarn
2019-02-08  7:15           ` Stefan K
2019-02-08 12:58             ` Austin S. Hemmelgarn
2019-02-08 16:56             ` Chris Murphy
2019-02-08 18:10           ` waxhead
2019-02-08 19:17             ` Austin S. Hemmelgarn
2019-02-09 12:13               ` waxhead [this message]
2019-02-10 18:34                 ` Chris Murphy
2019-02-11 12:17                   ` Austin S. Hemmelgarn
2019-02-11 21:15                     ` Chris Murphy
2019-02-08 20:17             ` Chris Murphy
2019-02-07 17:15     ` Chris Murphy
2019-02-07 17:37       ` Martin Steigerwald
2019-02-07 22:19         ` Chris Murphy
2019-02-07 23:02           ` Remi Gauvin
2019-02-08  7:33           ` Stefan K
2019-02-08 17:26             ` Chris Murphy
2019-02-11  9:30     ` Anand Jain
2019-02-02 23:35 ` Chris Murphy
2019-02-04 17:47   ` Patrik Lundquist
2019-02-04 17:55     ` Austin S. Hemmelgarn
2019-02-04 22:19       ` Patrik Lundquist
2019-02-05  6:46         ` Chris Murphy
2019-02-05  7:37           ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f67c6a69-4fc8-e33a-543e-9b97adf54438@dirtcellar.net \
    --to=waxhead@dirtcellar.net \
    --cc=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=shadow_7@gmx.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).