All of lore.kernel.org
 help / color / mirror / Atom feed
* degraded permanent mount option
@ 2018-01-26 14:02 Christophe Yayon
  2018-01-26 14:18 ` Austin S. Hemmelgarn
  2018-01-26 21:54 ` Chris Murphy
  0 siblings, 2 replies; 54+ messages in thread
From: Christophe Yayon @ 2018-01-26 14:02 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

Hi all,

I don't know if it the right place to ask. Sorry it's not...

Just a little question about "degraded" mount option. Is it a good idea to add this option (permanent) in fstab and grub rootflags for raid1/10 array ? Just to allow the system to boot again if a single hdd fail.

Of course, i have some cron jobs to check my array health.

Thanks.

-- 
  Christophe Yayon
  cyayon-list@nbux.org

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-26 14:02 degraded permanent mount option Christophe Yayon
@ 2018-01-26 14:18 ` Austin S. Hemmelgarn
  2018-01-26 14:47   ` Christophe Yayon
  2018-01-26 21:54 ` Chris Murphy
  1 sibling, 1 reply; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-26 14:18 UTC (permalink / raw)
  To: Christophe Yayon, Majordomo vger.kernel.org

On 2018-01-26 09:02, Christophe Yayon wrote:
> Hi all,
> 
> I don't know if it the right place to ask. Sorry it's not...
No, it's just fine to ask here.  Questions like this are part of why the 
mailing list exists.
> 
> Just a little question about "degraded" mount option. Is it a good idea to add this option (permanent) in fstab and grub rootflags for raid1/10 array ? Just to allow the system to boot again if a single hdd fail.
Some people will disagree with me on this, but I would personally 
suggest not doing this.  I'm of the opinion that running an array 
degraded for any period of time beyond the bare minimum required to fix 
it is a bad idea, given that:
* It's not a widely tested configuration, so you are statistically more 
likely to run into previously unknown bugs.  Even aside from that, there 
are probably some edge cases that people have not yet found.
* There are some issues with older kernel versions trying to access the 
array after it's been mounted writable and degraded when it's only two 
devices in raid1 mode.  This in turn is a good example of the above 
point about not being widely tested, as it took quite a while for this 
problem to come up on the mailing list.
* Running degraded is liable to be slower, because the filesystem has to 
account for the fact that the missing device might reappear at any 
moment.  This is actually true of any replication system, not just BTRFS.
* For a 2 device raid1 volume, there is no functional advantage to 
running degraded with one device compared to converting to just use a 
single device (this is only true of BTRFS because of the fact that it's 
trivial to convert things, while for MD and LVM it is extremely 
complicated to do so online).

Additionally, adding the `degraded` mount option won't actually let you 
mount the root filesystem if you're using systemd as an init system, 
because systemd refuses to mount BTRFS volumes which have devices missing.

Assuming that the systemd thing isn't an issue for you, I would suggest 
instead creating a separate GRUB entry with the option set in rootflags. 
  This will allow you to manually boot the system if the array is 
degraded, but will make sure you notice during boot (in my case, I don't 
even do that, but I'm also reasonably used to tweaking kernel parameters 
from GRUB prior to booting the system that it would end up just wasting 
space).
> 
> Of course, i have some cron jobs to check my array health.
It's good to hear that you're taking the initiative to monitor things, 
however this fact doesn't really change my assessment above.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-26 14:18 ` Austin S. Hemmelgarn
@ 2018-01-26 14:47   ` Christophe Yayon
  2018-01-26 14:55     ` Austin S. Hemmelgarn
  2018-01-27  5:50     ` Andrei Borzenkov
  0 siblings, 2 replies; 54+ messages in thread
From: Christophe Yayon @ 2018-01-26 14:47 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Majordomo vger.kernel.org

Hi Austin,

Thanks for your answer. It was my opinion too as the "degraded" seems to be flagged as "Mostly OK" on btrfs wiki status page. I am running Archlinux with recent kernel on all my servers (because of use of btrfs as my main filesystem, i need a recent kernel).

Your idea to add a separate entry in grub.cfg with rootflags=degraded is attractive, i will do this...

Just a last question, i thank that it was necessary to add "degraded" option in grub.cfg AND fstab to allow boot in degraded mode. I am not sure that only grub.cfg is sufficient... 
Yesterday, i have done some test and boot a a system with only 1 of 2 drive in my root raid1 array. No problem with systemd, but i added rootflags and fstab option. I didn't test with only rootflags.

Thanks. 


-- 
  Christophe Yayon
  cyayon-list@nbux.org

On Fri, Jan 26, 2018, at 15:18, Austin S. Hemmelgarn wrote:
> On 2018-01-26 09:02, Christophe Yayon wrote:
> > Hi all,
> > 
> > I don't know if it the right place to ask. Sorry it's not...
> No, it's just fine to ask here.  Questions like this are part of why the 
> mailing list exists.
> > 
> > Just a little question about "degraded" mount option. Is it a good idea to add this option (permanent) in fstab and grub rootflags for raid1/10 array ? Just to allow the system to boot again if a single hdd fail.
> Some people will disagree with me on this, but I would personally 
> suggest not doing this.  I'm of the opinion that running an array 
> degraded for any period of time beyond the bare minimum required to fix 
> it is a bad idea, given that:
> * It's not a widely tested configuration, so you are statistically more 
> likely to run into previously unknown bugs.  Even aside from that, there 
> are probably some edge cases that people have not yet found.
> * There are some issues with older kernel versions trying to access the 
> array after it's been mounted writable and degraded when it's only two 
> devices in raid1 mode.  This in turn is a good example of the above 
> point about not being widely tested, as it took quite a while for this 
> problem to come up on the mailing list.
> * Running degraded is liable to be slower, because the filesystem has to 
> account for the fact that the missing device might reappear at any 
> moment.  This is actually true of any replication system, not just BTRFS.
> * For a 2 device raid1 volume, there is no functional advantage to 
> running degraded with one device compared to converting to just use a 
> single device (this is only true of BTRFS because of the fact that it's 
> trivial to convert things, while for MD and LVM it is extremely 
> complicated to do so online).
> 
> Additionally, adding the `degraded` mount option won't actually let you 
> mount the root filesystem if you're using systemd as an init system, 
> because systemd refuses to mount BTRFS volumes which have devices missing.
> 
> Assuming that the systemd thing isn't an issue for you, I would suggest 
> instead creating a separate GRUB entry with the option set in rootflags. 
>   This will allow you to manually boot the system if the array is 
> degraded, but will make sure you notice during boot (in my case, I don't 
> even do that, but I'm also reasonably used to tweaking kernel parameters 
> from GRUB prior to booting the system that it would end up just wasting 
> space).
> > 
> > Of course, i have some cron jobs to check my array health.
> It's good to hear that you're taking the initiative to monitor things, 
> however this fact doesn't really change my assessment above.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-26 14:47   ` Christophe Yayon
@ 2018-01-26 14:55     ` Austin S. Hemmelgarn
  2018-01-27  5:50     ` Andrei Borzenkov
  1 sibling, 0 replies; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-26 14:55 UTC (permalink / raw)
  To: Christophe Yayon, Majordomo vger.kernel.org

On 2018-01-26 09:47, Christophe Yayon wrote:
> Hi Austin,
> 
> Thanks for your answer. It was my opinion too as the "degraded" seems to be flagged as "Mostly OK" on btrfs wiki status page. I am running Archlinux with recent kernel on all my servers (because of use of btrfs as my main filesystem, i need a recent kernel).
> 
> Your idea to add a separate entry in grub.cfg with rootflags=degraded is attractive, i will do this...
> 
> Just a last question, i thank that it was necessary to add "degraded" option in grub.cfg AND fstab to allow boot in degraded mode. I am not sure that only grub.cfg is sufficient...
> Yesterday, i have done some test and boot a a system with only 1 of 2 drive in my root raid1 array. No problem with systemd, but i added rootflags and fstab option. I didn't test with only rootflags.
Hmm...  I'm pretty sure that you only need degraded in rootflags for a 
degraded boot without systemd involved.  Not sure about with systemd 
involved, though the fact that it worked with systemd at all is 
interesting, as last I knew systemd doesn't do any special casing for 
BTRFS and just looks at whether all the devices are registered with the 
kernel or not.

Also, as far as I know, `degraded` in the mount options won't cause any 
change in behavior if there is no device missing, so you're not really 
going to be running 'degraded' if you've got all your devices (though 
depending on how long it takes to scan devices, you may end up with some 
issues during boot when they're technically all present and working).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-26 14:02 degraded permanent mount option Christophe Yayon
  2018-01-26 14:18 ` Austin S. Hemmelgarn
@ 2018-01-26 21:54 ` Chris Murphy
  2018-01-26 22:03   ` Christophe Yayon
  1 sibling, 1 reply; 54+ messages in thread
From: Chris Murphy @ 2018-01-26 21:54 UTC (permalink / raw)
  To: Christophe Yayon; +Cc: Majordomo vger.kernel.org

On Fri, Jan 26, 2018 at 7:02 AM, Christophe Yayon <cyayon-list@nbux.org> wrote:

> Just a little question about "degraded" mount option. Is it a good idea to add this option (permanent) in fstab and grub rootflags for raid1/10 array ? Just to allow the system to boot again if a single hdd fail.

No because it's going to open a window where a delayed member drive
will mean the volume is mounted degraded, which will happen silently.
And current behavior in such a case, any new writes go to single
chunks. Again it's silent. When the delayed drive appears, it's not
going to be added, the volume is still treated as degraded. And even
when you remount to bring them all together in a normal mount, Btrfs
will not automatically sync the drives, so you will still have some
single chunk writes on one drive not the other. So you have a window
of time where there can be data loss if a real failure occurs, and you
need degraded mounting. Further, right now Btrfs will only do one
degraded rw mount, and you *must* fix that degradedness before it is
umounted or else you will only ever be able to mount it again ro.
There are unmerged patches to work around this, so you'd need to
commit to building your own kernel. I can't see any way of reliably
using Btrfs in production for the described use case otherwise. You
can't depend on getting the delayed or replacement drive restored, and
the volume made healthy again, because ostensibly the whole point of
the setup is having good uptime and you won't have that assurance
unless you carry these patches.

Also note that there are two kinds of degraded writes. a.) drive was
missing at mount time, and volume is mounted degraded, for raid1
volumes you get single chunks written; to sync once the missing drive
appears you do a btrfs balance -dconvert=raid1,soft
-mconvert=raid1,soft which should be fairly fast; b.) if the drive
goes missing after a normal mount, Btrfs continues to write out raid1
chunks; to sync once the missing drive appears you have to do a full
scrub or balance of the entire volume there's no shortcut.

Anyway, for the described use case I think you're better off with
mdadm or LVM raid1 or raid10, and then format with Btrfs and DUP
metadata (default mkfs) in which case you get full error detection and
metadata error detection and correction, as well as the uptime you
want.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-26 21:54 ` Chris Murphy
@ 2018-01-26 22:03   ` Christophe Yayon
  0 siblings, 0 replies; 54+ messages in thread
From: Christophe Yayon @ 2018-01-26 22:03 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Christophe Yayon, Majordomo vger.kernel.org

Hi Chris,

Thanks for this complete answer.

I have to do some benchmark with mdadm raid and btrfs native raid...

Thanks

--
Christophe Yayon

> On 26 Jan 2018, at 22:54, Chris Murphy <lists@colorremedies.com> wrote:
> 
>> On Fri, Jan 26, 2018 at 7:02 AM, Christophe Yayon <cyayon-list@nbux.org> wrote:
>> 
>> Just a little question about "degraded" mount option. Is it a good idea to add this option (permanent) in fstab and grub rootflags for raid1/10 array ? Just to allow the system to boot again if a single hdd fail.
> 
> No because it's going to open a window where a delayed member drive
> will mean the volume is mounted degraded, which will happen silently.
> And current behavior in such a case, any new writes go to single
> chunks. Again it's silent. When the delayed drive appears, it's not
> going to be added, the volume is still treated as degraded. And even
> when you remount to bring them all together in a normal mount, Btrfs
> will not automatically sync the drives, so you will still have some
> single chunk writes on one drive not the other. So you have a window
> of time where there can be data loss if a real failure occurs, and you
> need degraded mounting. Further, right now Btrfs will only do one
> degraded rw mount, and you *must* fix that degradedness before it is
> umounted or else you will only ever be able to mount it again ro.
> There are unmerged patches to work around this, so you'd need to
> commit to building your own kernel. I can't see any way of reliably
> using Btrfs in production for the described use case otherwise. You
> can't depend on getting the delayed or replacement drive restored, and
> the volume made healthy again, because ostensibly the whole point of
> the setup is having good uptime and you won't have that assurance
> unless you carry these patches.
> 
> Also note that there are two kinds of degraded writes. a.) drive was
> missing at mount time, and volume is mounted degraded, for raid1
> volumes you get single chunks written; to sync once the missing drive
> appears you do a btrfs balance -dconvert=raid1,soft
> -mconvert=raid1,soft which should be fairly fast; b.) if the drive
> goes missing after a normal mount, Btrfs continues to write out raid1
> chunks; to sync once the missing drive appears you have to do a full
> scrub or balance of the entire volume there's no shortcut.
> 
> Anyway, for the described use case I think you're better off with
> mdadm or LVM raid1 or raid10, and then format with Btrfs and DUP
> metadata (default mkfs) in which case you get full error detection and
> metadata error detection and correction, as well as the uptime you
> want.
> 
> -- 
> Chris Murphy


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-26 14:47   ` Christophe Yayon
  2018-01-26 14:55     ` Austin S. Hemmelgarn
@ 2018-01-27  5:50     ` Andrei Borzenkov
       [not found]       ` <1517035210.1252874.1249880112.19FABD13@webmail.messagingengine.com>
  1 sibling, 1 reply; 54+ messages in thread
From: Andrei Borzenkov @ 2018-01-27  5:50 UTC (permalink / raw)
  To: Christophe Yayon, Austin S. Hemmelgarn, Majordomo vger.kernel.org

26.01.2018 17:47, Christophe Yayon пишет:
> Hi Austin,
> 
> Thanks for your answer. It was my opinion too as the "degraded" seems to be flagged as "Mostly OK" on btrfs wiki status page. I am running Archlinux with recent kernel on all my servers (because of use of btrfs as my main filesystem, i need a recent kernel).
> 
> Your idea to add a separate entry in grub.cfg with rootflags=degraded is attractive, i will do this...
> 
> Just a last question, i thank that it was necessary to add "degraded" option in grub.cfg AND fstab to allow boot in degraded mode. I am not sure that only grub.cfg is sufficient... 
> Yesterday, i have done some test and boot a a system with only 1 of 2 drive in my root raid1 array. No problem with systemd,

Are you using systemd in your initramfs (whatever implementation you are
using)? I just tested with dracut using systemd dracut module and it
does not work - it hangs forever waiting for device. Of course, there is
no way to abort it and go into command line ...

Oh, wait - what device names are you using? I'm using mount by UUID and
this is where the problem starts - /dev/disk/by-uuid/xxx will not appear
unless all devices have been seen once ...

... and it still does not work even if I change it to root=/dev/sda1
explicitly because sda1 will *not* be announced as "present" to systemd
until all devices have been seen once ...

So no, it does not work with systemd *in initramfs*. Absolutely.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
       [not found]       ` <1517035210.1252874.1249880112.19FABD13@webmail.messagingengine.com>
@ 2018-01-27  6:43         ` Andrei Borzenkov
  2018-01-27  6:48           ` Christophe Yayon
  0 siblings, 1 reply; 54+ messages in thread
From: Andrei Borzenkov @ 2018-01-27  6:43 UTC (permalink / raw)
  To: Christophe Yayon, Austin S. Hemmelgarn, Majordomo vger.kernel.org

27.01.2018 09:40, Christophe Yayon пишет:
> Hi, 
> 
> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
> In fstab root is mounted via UUID. As far as I know the UUID is the same
> for all devices in raid array.
> The system boot with no problem with degraded and only 1/2 root device.

Then your initramfs does not use systemd.

> --
>   Christophe Yayon
>   cyayon-list@nbux.org
> 
> 
> 
> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>> 26.01.2018 17:47, Christophe Yayon пишет:
>>> Hi Austin,
>>>
>>> Thanks for your answer. It was my opinion too as the "degraded"
>>> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
>>> running Archlinux with recent kernel on all my servers (because of
>>> use of btrfs as my main filesystem, i need a recent kernel).> >
>>> Your idea to add a separate entry in grub.cfg with
>>> rootflags=degraded is attractive, i will do this...> >
>>> Just a last question, i thank that it was necessary to add
>>> "degraded" option in grub.cfg AND fstab to allow boot in degraded
>>> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i have done some test and boot a a system with only 1 of
>>> 2 drive in my root raid1 array. No problem with systemd,>
>> Are you using systemd in your initramfs (whatever
>> implementation you are> using)? I just tested with dracut using systemd dracut module and it
>> does not work - it hangs forever waiting for device. Of course,
>> there is> no way to abort it and go into command line ...
>>
>> Oh, wait - what device names are you using? I'm using mount by
>> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
>> not appear> unless all devices have been seen once ...
>>
>> ... and it still does not work even if I change it to root=/dev/sda1
>> explicitly because sda1 will *not* be announced as "present" to
>> systemd> until all devices have been seen once ...
>>
>> So no, it does not work with systemd *in initramfs*. Absolutely.
> 
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27  6:43         ` Andrei Borzenkov
@ 2018-01-27  6:48           ` Christophe Yayon
  2018-01-27 10:08             ` Christophe Yayon
  0 siblings, 1 reply; 54+ messages in thread
From: Christophe Yayon @ 2018-01-27  6:48 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Majordomo vger.kernel.org

I think you are right, i do not see any systemd message when degraded option is missing and have to remount manually with degraded.

It seems it is better to use mdadm for raid and btrfs over it as i understand. Even in recent kernel ?
I hav me to do some bench and compare...

Thanks

--
Christophe Yayon

> On 27 Jan 2018, at 07:43, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> 
> 27.01.2018 09:40, Christophe Yayon пишет:
>> Hi, 
>> 
>> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
>> In fstab root is mounted via UUID. As far as I know the UUID is the same
>> for all devices in raid array.
>> The system boot with no problem with degraded and only 1/2 root device.
> 
> Then your initramfs does not use systemd.
> 
>> --
>>  Christophe Yayon
>>  cyayon-list@nbux.org
>> 
>> 
>> 
>>> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>>> 26.01.2018 17:47, Christophe Yayon пишет:
>>>> Hi Austin,
>>>> 
>>>> Thanks for your answer. It was my opinion too as the "degraded"
>>>> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
>>>> running Archlinux with recent kernel on all my servers (because of
>>>> use of btrfs as my main filesystem, i need a recent kernel).> >
>>>> Your idea to add a separate entry in grub.cfg with
>>>> rootflags=degraded is attractive, i will do this...> >
>>>> Just a last question, i thank that it was necessary to add
>>>> "degraded" option in grub.cfg AND fstab to allow boot in degraded
>>>> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i have done some test and boot a a system with only 1 of
>>>> 2 drive in my root raid1 array. No problem with systemd,>
>>> Are you using systemd in your initramfs (whatever
>>> implementation you are> using)? I just tested with dracut using systemd dracut module and it
>>> does not work - it hangs forever waiting for device. Of course,
>>> there is> no way to abort it and go into command line ...
>>> 
>>> Oh, wait - what device names are you using? I'm using mount by
>>> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
>>> not appear> unless all devices have been seen once ...
>>> 
>>> ... and it still does not work even if I change it to root=/dev/sda1
>>> explicitly because sda1 will *not* be announced as "present" to
>>> systemd> until all devices have been seen once ...
>>> 
>>> So no, it does not work with systemd *in initramfs*. Absolutely.
>> 
>> 
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27  6:48           ` Christophe Yayon
@ 2018-01-27 10:08             ` Christophe Yayon
  2018-01-27 10:26               ` Andrei Borzenkov
  0 siblings, 1 reply; 54+ messages in thread
From: Christophe Yayon @ 2018-01-27 10:08 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Majordomo vger.kernel.org

I just tested to boot with a single drive (raid1 degraded), even with degraded option in fstab and grub, unable to boot ! The boot process stop on initramfs.

Is there a solution to boot with systemd and degraded array ?

Thanks 

--
Christophe Yayon

> On 27 Jan 2018, at 07:48, Christophe Yayon <cyayon@nbux.org> wrote:
> 
> I think you are right, i do not see any systemd message when degraded option is missing and have to remount manually with degraded.
> 
> It seems it is better to use mdadm for raid and btrfs over it as i understand. Even in recent kernel ?
> I hav me to do some bench and compare...
> 
> Thanks
> 
> --
> Christophe Yayon
> 
>> On 27 Jan 2018, at 07:43, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>> 
>> 27.01.2018 09:40, Christophe Yayon пишет:
>>> Hi, 
>>> 
>>> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
>>> In fstab root is mounted via UUID. As far as I know the UUID is the same
>>> for all devices in raid array.
>>> The system boot with no problem with degraded and only 1/2 root device.
>> 
>> Then your initramfs does not use systemd.
>> 
>>> --
>>> Christophe Yayon
>>> cyayon-list@nbux.org
>>> 
>>> 
>>> 
>>>>> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>>>>> 26.01.2018 17:47, Christophe Yayon пишет:
>>>>> Hi Austin,
>>>>> 
>>>>> Thanks for your answer. It was my opinion too as the "degraded"
>>>>> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
>>>>> running Archlinux with recent kernel on all my servers (because of
>>>>> use of btrfs as my main filesystem, i need a recent kernel).> >
>>>>> Your idea to add a separate entry in grub.cfg with
>>>>> rootflags=degraded is attractive, i will do this...> >
>>>>> Just a last question, i thank that it was necessary to add
>>>>> "degraded" option in grub.cfg AND fstab to allow boot in degraded
>>>>> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i have done some test and boot a a system with only 1 of
>>>>> 2 drive in my root raid1 array. No problem with systemd,>
>>>> Are you using systemd in your initramfs (whatever
>>>> implementation you are> using)? I just tested with dracut using systemd dracut module and it
>>>> does not work - it hangs forever waiting for device. Of course,
>>>> there is> no way to abort it and go into command line ...
>>>> 
>>>> Oh, wait - what device names are you using? I'm using mount by
>>>> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
>>>> not appear> unless all devices have been seen once ...
>>>> 
>>>> ... and it still does not work even if I change it to root=/dev/sda1
>>>> explicitly because sda1 will *not* be announced as "present" to
>>>> systemd> until all devices have been seen once ...
>>>> 
>>>> So no, it does not work with systemd *in initramfs*. Absolutely.
>>> 
>>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 10:08             ` Christophe Yayon
@ 2018-01-27 10:26               ` Andrei Borzenkov
  2018-01-27 11:06                 ` Tomasz Pala
  0 siblings, 1 reply; 54+ messages in thread
From: Andrei Borzenkov @ 2018-01-27 10:26 UTC (permalink / raw)
  To: Christophe Yayon; +Cc: Majordomo vger.kernel.org

27.01.2018 13:08, Christophe Yayon пишет:
> I just tested to boot with a single drive (raid1 degraded), even with degraded option in fstab and grub, unable to boot ! The boot process stop on initramfs.
> 
> Is there a solution to boot with systemd and degraded array ?

No. It is finger pointing. Both btrfs and systemd developers say
everything is fine from their point of view.

> 
> Thanks 
> 
> --
> Christophe Yayon
> 
>> On 27 Jan 2018, at 07:48, Christophe Yayon <cyayon@nbux.org> wrote:
>>
>> I think you are right, i do not see any systemd message when degraded option is missing and have to remount manually with degraded.
>>
>> It seems it is better to use mdadm for raid and btrfs over it as i understand. Even in recent kernel ?
>> I hav me to do some bench and compare...
>>
>> Thanks
>>
>> --
>> Christophe Yayon
>>
>>> On 27 Jan 2018, at 07:43, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>>>
>>> 27.01.2018 09:40, Christophe Yayon пишет:
>>>> Hi, 
>>>>
>>>> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
>>>> In fstab root is mounted via UUID. As far as I know the UUID is the same
>>>> for all devices in raid array.
>>>> The system boot with no problem with degraded and only 1/2 root device.
>>>
>>> Then your initramfs does not use systemd.
>>>
>>>> --
>>>> Christophe Yayon
>>>> cyayon-list@nbux.org
>>>>
>>>>
>>>>
>>>>>> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>>>>>> 26.01.2018 17:47, Christophe Yayon пишет:
>>>>>> Hi Austin,
>>>>>>
>>>>>> Thanks for your answer. It was my opinion too as the "degraded"
>>>>>> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
>>>>>> running Archlinux with recent kernel on all my servers (because of
>>>>>> use of btrfs as my main filesystem, i need a recent kernel).> >
>>>>>> Your idea to add a separate entry in grub.cfg with
>>>>>> rootflags=degraded is attractive, i will do this...> >
>>>>>> Just a last question, i thank that it was necessary to add
>>>>>> "degraded" option in grub.cfg AND fstab to allow boot in degraded
>>>>>> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i have done some test and boot a a system with only 1 of
>>>>>> 2 drive in my root raid1 array. No problem with systemd,>
>>>>> Are you using systemd in your initramfs (whatever
>>>>> implementation you are> using)? I just tested with dracut using systemd dracut module and it
>>>>> does not work - it hangs forever waiting for device. Of course,
>>>>> there is> no way to abort it and go into command line ...
>>>>>
>>>>> Oh, wait - what device names are you using? I'm using mount by
>>>>> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
>>>>> not appear> unless all devices have been seen once ...
>>>>>
>>>>> ... and it still does not work even if I change it to root=/dev/sda1
>>>>> explicitly because sda1 will *not* be announced as "present" to
>>>>> systemd> until all devices have been seen once ...
>>>>>
>>>>> So no, it does not work with systemd *in initramfs*. Absolutely.
>>>>
>>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 10:26               ` Andrei Borzenkov
@ 2018-01-27 11:06                 ` Tomasz Pala
  2018-01-27 13:26                   ` Adam Borowski
  2018-01-27 20:57                   ` Chris Murphy
  0 siblings, 2 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-27 11:06 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:

>> I just tested to boot with a single drive (raid1 degraded), even with degraded option in fstab and grub, unable to boot ! The boot process stop on initramfs.
>> 
>> Is there a solution to boot with systemd and degraded array ?
> 
> No. It is finger pointing. Both btrfs and systemd developers say
> everything is fine from their point of view.

Treating btrfs volume as ready by systemd would open a window of
opportunity when volume would be mounted degraded _despite_ all the
components are (meaning: "would soon") be ready - just like Chris Murphy
wrote; provided there is -o degraded somewhere.

This is not a systemd issue, but apparently btrfs design choice to allow
using any single component device name also as volume name itself.

IF a volume has degraded flag, then it is btrfs job to mark is as ready:

>>>>>> ... and it still does not work even if I change it to root=/dev/sda1
>>>>>> explicitly because sda1 will *not* be announced as "present" to
>>>>>> systemd> until all devices have been seen once ...

...so this scenario would obviously and magically start working.

As for the regular by-UUID mounts: these links are created by udev WHEN
underlying devices appear. Does btrfs volume appear? No.

If btrfs pretends to be device manager it should expose more states,
especially "ready to be mounted, but not fully populated" (i.e.
"degraded mount possible"). Then systemd could _fallback_ after timing
out to degraded mount automatically according to some systemd-level
option.

Unless there is *some* signalling from btrfs, there is really not much
systemd can *safely* do.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 11:06                 ` Tomasz Pala
@ 2018-01-27 13:26                   ` Adam Borowski
  2018-01-27 14:36                     ` Goffredo Baroncelli
                                       ` (3 more replies)
  2018-01-27 20:57                   ` Chris Murphy
  1 sibling, 4 replies; 54+ messages in thread
From: Adam Borowski @ 2018-01-27 13:26 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Majordomo vger.kernel.org

On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
> 
> >> I just tested to boot with a single drive (raid1 degraded), even with
> >> degraded option in fstab and grub, unable to boot !  The boot process
> >> stop on initramfs.
> >> 
> >> Is there a solution to boot with systemd and degraded array ?
> > 
> > No. It is finger pointing. Both btrfs and systemd developers say
> > everything is fine from their point of view.

It's quite obvious who's the culprit: every single remaining rc system
manages to mount degraded btrfs without problems.  They just don't try to
outsmart the kernel.

> Treating btrfs volume as ready by systemd would open a window of
> opportunity when volume would be mounted degraded _despite_ all the
> components are (meaning: "would soon") be ready - just like Chris Murphy
> wrote; provided there is -o degraded somewhere.

For this reason, currently hardcoding -o degraded isn't a wise choice.  This
might chance once autoresync and devices coming back at runtime are
implemented.

> This is not a systemd issue, but apparently btrfs design choice to allow
> using any single component device name also as volume name itself.

And what other user interface would you propose?  The only alternative I see
is inventing a device manager (like you're implying below that btrfs does),
which would needlessly complicate the usual, single-device, case.
 
> If btrfs pretends to be device manager it should expose more states,

But it doesn't pretend to.

> especially "ready to be mounted, but not fully populated" (i.e.
> "degraded mount possible"). Then systemd could _fallback_ after timing
> out to degraded mount automatically according to some systemd-level
> option.

You're assuming that btrfs somehow knows this itself.  Unlike the bogus
assumption systemd does that by counting devices you can know whether a
degraded or non-degraded mount is possible, it is in general not possible to
know whether a mount attempt will succeed without actually trying.

Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
naive counting of this kind, it had to be replaced by actually checking
whether at least one copy of every block group is actually present.

An example scenario: you have a 3-device filesystem, sda sdb sdc.  Suddenly,
sda goes offline due to a loose cable, controller hiccup, evil fairies, or
something of this kind.  The sysadmin notices this, rushes in with an
USB-attached disk (sdd), rebalances.  After reboot, sda works well (or got
its cable reseated, etc), while sdd either got accidentally removed or is
just slow to initialize (USB...).  So, systemd asks sda how many devices
there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
even ask for UUIDs -- all devices are present.  So, mount will succeed,
right?
 
> Unless there is *some* signalling from btrfs, there is really not much
> systemd can *safely* do.

Btrfs already tells everything it knows.  To learn more, you need to do most
of the mount process (whether you continue or abort is another matter). 
This can't be done sanely from outside the kernel.  Adding finer control
would be reasonable ("wait and block" vs "try and return immediately") but
that's about all.  It's be also wrong to have a different interface for
daemon X than for humans.

Ie, the thing systemd can safely do, is to stop trying to rule everything,
and refrain from telling the user whether he can mount something or not.
And especially, unmounting after the user mounts manually...


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄⠀⠀⠀⠀ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 13:26                   ` Adam Borowski
@ 2018-01-27 14:36                     ` Goffredo Baroncelli
  2018-01-27 15:38                       ` Adam Borowski
  2018-01-27 15:22                     ` Duncan
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 54+ messages in thread
From: Goffredo Baroncelli @ 2018-01-27 14:36 UTC (permalink / raw)
  To: Adam Borowski, Tomasz Pala; +Cc: Majordomo vger.kernel.org

On 01/27/2018 02:26 PM, Adam Borowski wrote:
> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>
>>>> I just tested to boot with a single drive (raid1 degraded), even with
>>>> degraded option in fstab and grub, unable to boot !  The boot process
>>>> stop on initramfs.
>>>>
>>>> Is there a solution to boot with systemd and degraded array ?
>>>
>>> No. It is finger pointing. Both btrfs and systemd developers say
>>> everything is fine from their point of view.
> 
> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try to
> outsmart the kernel.

I think that the real problem relies that the mounting a btrfs filesystem cannot be a responsibility of systemd (or whichever rc-system). Unfortunately in the past it was thought that it would be sufficient to assemble a devices list in the kernel, then issue a simple mount...

I think that all the possible scenarios of a btrfs filesystem are a lot wider than a conventional one; and this approach is too much basic.

Systemd is another factor (which spread the responsibilities); but it is not the real problem.

In the past[*] I proposed a mount helper, which would perform all the device registering and mounting in degraded mode (depending by the option). My idea is that all the policies should be placed only in one place. Now some policies are in the kernel, some in udev, some in systemd... It is a mess. And if something goes wrong, you have to look to several logs to understand which/where is the problem..

I have to point out that there is not a sane default for mounting in degraded mode or not. May be that now RAID1/10 are "mount-degraded" friendly, so it would be a sane default; but for other (raid5/6) I think that this is not mature enough. And it is possible to exist hybrid filesystem (both RAID1/10 and RAID5/6)

Mounting in degraded mode would be better for a root filesystem, than a non-root one (think about remote machine)....

BR
G.Baroncelli

[*]
https://www.spinics.net/lists/linux-btrfs/msg39706.html




> 
>> Treating btrfs volume as ready by systemd would open a window of
>> opportunity when volume would be mounted degraded _despite_ all the
>> components are (meaning: "would soon") be ready - just like Chris Murphy
>> wrote; provided there is -o degraded somewhere.
> 
> For this reason, currently hardcoding -o degraded isn't a wise choice.  This
> might chance once autoresync and devices coming back at runtime are
> implemented.
> 
>> This is not a systemd issue, but apparently btrfs design choice to allow
>> using any single component device name also as volume name itself.
> 
> And what other user interface would you propose?  The only alternative I see
> is inventing a device manager (like you're implying below that btrfs does),
> which would needlessly complicate the usual, single-device, case.
>  
>> If btrfs pretends to be device manager it should expose more states,
> 
> But it doesn't pretend to.
> 
>> especially "ready to be mounted, but not fully populated" (i.e.
>> "degraded mount possible"). Then systemd could _fallback_ after timing
>> out to degraded mount automatically according to some systemd-level
>> option.
> 
> You're assuming that btrfs somehow knows this itself.  Unlike the bogus
> assumption systemd does that by counting devices you can know whether a
> degraded or non-degraded mount is possible, it is in general not possible to
> know whether a mount attempt will succeed without actually trying.
> 
> Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
> naive counting of this kind, it had to be replaced by actually checking
> whether at least one copy of every block group is actually present.
> 
> An example scenario: you have a 3-device filesystem, sda sdb sdc.  Suddenly,
> sda goes offline due to a loose cable, controller hiccup, evil fairies, or
> something of this kind.  The sysadmin notices this, rushes in with an
> USB-attached disk (sdd), rebalances.  After reboot, sda works well (or got
> its cable reseated, etc), while sdd either got accidentally removed or is
> just slow to initialize (USB...).  So, systemd asks sda how many devices
> there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
> even ask for UUIDs -- all devices are present.  So, mount will succeed,
> right?
>  
>> Unless there is *some* signalling from btrfs, there is really not much
>> systemd can *safely* do.
> 
> Btrfs already tells everything it knows.  To learn more, you need to do most
> of the mount process (whether you continue or abort is another matter). 
> This can't be done sanely from outside the kernel.  Adding finer control
> would be reasonable ("wait and block" vs "try and return immediately") but
> that's about all.  It's be also wrong to have a different interface for
> daemon X than for humans.
> 
> Ie, the thing systemd can safely do, is to stop trying to rule everything,
> and refrain from telling the user whether he can mount something or not.
> And especially, unmounting after the user mounts manually...
> 
> 
> Meow!
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 13:26                   ` Adam Borowski
  2018-01-27 14:36                     ` Goffredo Baroncelli
@ 2018-01-27 15:22                     ` Duncan
  2018-01-28  0:39                       ` Tomasz Pala
  2018-01-28  8:06                       ` Andrei Borzenkov
  2018-01-27 21:12                     ` Chris Murphy
  2018-01-27 22:42                     ` Tomasz Pala
  3 siblings, 2 replies; 54+ messages in thread
From: Duncan @ 2018-01-27 15:22 UTC (permalink / raw)
  To: linux-btrfs

Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:

> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>> 
>> >> I just tested to boot with a single drive (raid1 degraded), even
>> >> with degraded option in fstab and grub, unable to boot !  The boot
>> >> process stop on initramfs.
>> >> 
>> >> Is there a solution to boot with systemd and degraded array ?
>> > 
>> > No. It is finger pointing. Both btrfs and systemd developers say
>> > everything is fine from their point of view.
> 
> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try
> to outsmart the kernel.

No kidding.

All systemd has to do is leave the mount alone that the kernel has 
already done, instead of insisting it knows what's going on better than 
the kernel does, and immediately umounting it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 14:36                     ` Goffredo Baroncelli
@ 2018-01-27 15:38                       ` Adam Borowski
  0 siblings, 0 replies; 54+ messages in thread
From: Adam Borowski @ 2018-01-27 15:38 UTC (permalink / raw)
  To: kreijack; +Cc: Tomasz Pala, Majordomo vger.kernel.org

On Sat, Jan 27, 2018 at 03:36:48PM +0100, Goffredo Baroncelli wrote:
> I think that the real problem relies that the mounting a btrfs filesystem
> cannot be a responsibility of systemd (or whichever rc-system). 
> Unfortunately in the past it was thought that it would be sufficient to
> assemble a devices list in the kernel, then issue a simple mount...

Yeah... every device that comes online may have its own idea what devices
are part of the filesystem.  There's also a quite separate question whether
we have enough chunks for a degraded mount (implemented by Qu), which
requires reading the chunk tree.

> In the past[*] I proposed a mount helper, which would perform all the
> device registering and mounting in degraded mode (depending by the
> option).  My idea is that all the policies should be placed only in one
> place.  Now some policies are in the kernel, some in udev, some in
> systemd...  It is a mess.  And if something goes wrong, you have to look
> to several logs to understand which/where is the problem..

Since most of the logic needs to be in the kernel anyway, I believe it'd be
best to keep as much as possible in the kernel, and let the userspace
request at most "try regular/degraded mount, block/don't block".  Anything
else would be duplicating functionality.

> I have to point out that there is not a sane default for mounting in
> degraded mode or not.  May be that now RAID1/10 are "mount-degraded"
> friendly, so it would be a sane default; but for other (raid5/6) I think
> that this is not mature enough.  And it is possible to exist hybrid
> filesystem (both RAID1/10 and RAID5/6)

Not yet: if one of the devices comes a bit late, btrfs won't let it into the
filesystem yet (patches to do so have been proposed), and if you run
degraded for even a moment, a very lengthy action is required.  That lengthy
action could be improved -- we can note the last generation when the raid
was complete[1], and scrub/balance only extents newer than that[2] -- but
that's a SMOC then SMOR, and I don't see volunteers yet.

Thus, auto-degrading without a hearty timeout first is currently sitting
strongly in the "do not want" land.

> Mounting in degraded mode would be better for a root filesystem, than a
> non-root one (think about remote machine)....

I for one use ext4-on-md for root, and btrfs raid for the actual data.  It's
not like production servers see much / churn anyway.


Meow!

[1]. Extra fun for raid6 (or possible future raid1×N where N>2 modes):
there's "fully complete", "degraded missing A", "degraded missing B",
"degraded missing A and B".

[2]. NOCOW extents would require an artificial generation bump upon writing
to whenever the level of degradeness changes.
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄⠀⠀⠀⠀ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 11:06                 ` Tomasz Pala
  2018-01-27 13:26                   ` Adam Borowski
@ 2018-01-27 20:57                   ` Chris Murphy
  2018-01-28  0:00                     ` Tomasz Pala
  1 sibling, 1 reply; 54+ messages in thread
From: Chris Murphy @ 2018-01-27 20:57 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Majordomo vger.kernel.org

On Sat, Jan 27, 2018 at 4:06 AM, Tomasz Pala <gotar@polanet.pl> wrote:

> As for the regular by-UUID mounts: these links are created by udev WHEN
> underlying devices appear. Does btrfs volume appear? No.

If I boot with rd.break=pre-mount I can absolutely mount a Btrfs
multiple volume that has a missing device by UUID with --uuid flag, or
by /dev/sdXY, along with -o degraded. And I can then use the exit
command to continue the startup process. In fact I can try to mount
without -o degraded, and the mount command "works" in that it does not
complain about an invalid node or UUID.

The Btrfs systemd udev rule is a sledghammer because it has no
timeout. It neither times out and tries to mount anyway, nor does it
time out and just drop to a dracut prompt. There are a number of
things in systemd startups that have timeouts, I have no idea how they
get defined, but that single thing would make this a lot better. Right
now the Btrfs udev rule means if all devices aren't available, hang
indefinitely.

I don't know systemd or systemd-udev well enough at all to know if
this rule can have a timer. Service units absolutely can have timers,
so maybe there's a way to marry a udev rule with a service which has a
timer. The absolute dumbest thing that's better than now, is at the
timer just fail and drop to a dracut prompt. Better would be to try a
normal mount anyway, which also fails to a dracut prompt, but
additionally gives us a kernel error for Btrfs (the missing device
open ctree error you'd expect to get when mounting without -o degraded
when you're missing a device). And even better would be a way for the
user to edit the service unit to indicate "upon timeout being reached,
use mount -o degraded rather than just mount". This is the simplest of
Boolean logic, so I'd be surprised if systemd doesn't offer a way for
us to do exactly what I'm describing.

Again the central problem is the udev rule now means "wait for device
to appear" with no timed fallback.

The mdadm case has this, and it's done by dracut. At this same stage
of startup with a  missing device, there is in fact no fs colume UUID
yet because the array hasn't started. Dracut+mdadm knows there's a
missing device so it's just iterating: look, sleep 3, look, sleep 3,
look, sleep 3. It's on a loop. And after that loop hits something like
100, the script says f it, start array anyway, so now there is a
degraded array, and for the first time the fs volume UUID appears, and
systemd goes "ahaha! mount that!" and it does it normally.

So the timer and timeout and what happens at the timeout is defined by
dracut. That's probably why the systemd folks say "not our problem"
and why the kernel folks say "not our problem".


> If btrfs pretends to be device manager it should expose more states,
> especially "ready to be mounted, but not fully populated" (i.e.
> "degraded mount possible"). Then systemd could _fallback_ after timing
> out to degraded mount automatically according to some systemd-level
> option.

No, mdadm is a device manager and it has no such facility. Something
issues a command to start the array anyway, and only then do you find
out if there are enough devices to start it. I don't understand the
value of knowing whether it is possible. Just try to mount it degraded
and then if it fails we fail, nothing can be done automatically it's
up to an admin.

And even if you had this "degraded mount possible" state, you still
need a timer. So just build the timer.

If all devices ready ioctl is true, the timer doesn't start, it means
all devices are available, mount normally.
If all devices ready ioctl is false, the timer starts, if all devices
appear later the ioctl goes to true, the timer is belayed, mount
normally.
If all devices ready ioctl is false, the timer starts, when the timer
times out, mount normally which fails and gives us a shell to
troubleshoot at.
OR
If all devices ready ioctl is false, the timer starts, when the timer
times out, mount with -o degraded which either succeeds and we boot or
it fails and we have a troubleshooting shell.


The central problem is the lack of a timer and time out.


> Unless there is *some* signalling from btrfs, there is really not much
> systemd can *safely* do.

That is not true. It's not how mdadm works anyway.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 13:26                   ` Adam Borowski
  2018-01-27 14:36                     ` Goffredo Baroncelli
  2018-01-27 15:22                     ` Duncan
@ 2018-01-27 21:12                     ` Chris Murphy
  2018-01-28  0:16                       ` Tomasz Pala
  2018-01-27 22:42                     ` Tomasz Pala
  3 siblings, 1 reply; 54+ messages in thread
From: Chris Murphy @ 2018-01-27 21:12 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Tomasz Pala, Majordomo vger.kernel.org

On Sat, Jan 27, 2018 at 6:26 AM, Adam Borowski <kilobyte@angband.pl> wrote:

> You're assuming that btrfs somehow knows this itself.  Unlike the bogus
> assumption systemd does that by counting devices you can know whether a
> degraded or non-degraded mount is possible, it is in general not possible to
> know whether a mount attempt will succeed without actually trying.

That's right, although a small clarification is in order: systemd
doesn't count devices itself. The Btrfs systemd udev rule defers to
Btrfs kernel code by using BTRFS_IOC_DEVICES_READY. And it's totally
binary. Either they are all ready, in which case it exits 0, and if
they aren't all ready it exits 1.

But yes, mounting whether degraded or not is sufficiently complicated
that you just have to try it. I don't get the point of wanting to know
whether it's possible without trying. Why would this information be
useful if you were NOT going to mount it?


> Ie, the thing systemd can safely do, is to stop trying to rule everything,
> and refrain from telling the user whether he can mount something or not.

Right. Open question is whether the timer and timeout can be
implemented in the systemd world and I don't see why not, I certainly
see it put various services on timers, some of which are indefinite,
some are 1m30s and others are 3m. Pretty much every unit gets a
descrete boot line with a green dot or red cylon eye as it waits. I
don't see why at the very least we don't have that for Btrfs rootfs
mounts because *that* alone would at least clue in a user why their
startup is totally hung indefinitely.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 13:26                   ` Adam Borowski
                                       ` (2 preceding siblings ...)
  2018-01-27 21:12                     ` Chris Murphy
@ 2018-01-27 22:42                     ` Tomasz Pala
  2018-01-29 13:42                       ` Austin S. Hemmelgarn
  3 siblings, 1 reply; 54+ messages in thread
From: Tomasz Pala @ 2018-01-27 22:42 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

On Sat, Jan 27, 2018 at 14:26:41 +0100, Adam Borowski wrote:

> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try to
> outsmart the kernel.

Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.
Recently I've started mdadm on top of bunch of LVM volumes, with others
using btrfs and others prepared for crypto. And you know what? systemd
assembled everything just fine.

So with argument just like yours:

It's quite obvious who's the culprit: every single remaining filesystem
manages to mount under systemd without problems. They just expose
informations about their state.

>> This is not a systemd issue, but apparently btrfs design choice to allow
>> using any single component device name also as volume name itself.
> 
> And what other user interface would you propose? The only alternative I see
> is inventing a device manager (like you're implying below that btrfs does),
> which would needlessly complicate the usual, single-device, case.

The 'needless complication', as you named it, usually should be the default
to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
No easy way to RAID the drive (there are device-mapper tricks, they are
just way more complicated). Even attaching SSD cache is not trivial
without preparations (for bcache being the absolutely necessary, much
easier with LVM in place).

>> If btrfs pretends to be device manager it should expose more states,
> 
> But it doesn't pretend to.

Why mounting sda2 requires sdb2 in my setup then?

>> especially "ready to be mounted, but not fully populated" (i.e.
>> "degraded mount possible"). Then systemd could _fallback_ after timing
>> out to degraded mount automatically according to some systemd-level
>> option.
> 
> You're assuming that btrfs somehow knows this itself.

"It's quite obvious who's the culprit: every single volume manager keeps
track of it's component devices".

>  Unlike the bogus
> assumption systemd does that by counting devices you can know whether a
> degraded or non-degraded mount is possible, it is in general not possible to
> know whether a mount attempt will succeed without actually trying.

There is a term for such situation: broken by design.

> Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
> naive counting of this kind, it had to be replaced by actually checking
> whether at least one copy of every block group is actually present.

And you still blame systemd for using BTRFS_IOC_DEVICES_READY?

[...]
> just slow to initialize (USB...).  So, systemd asks sda how many devices
> there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
> even ask for UUIDs -- all devices are present.  So, mount will succeed,
> right?

Systemd doesn't count anything, it asks BTRFS_IOC_DEVICES_READY as
implemented in btrfs/super.c.

> Ie, the thing systemd can safely do, is to stop trying to rule everything,
> and refrain from telling the user whether he can mount something or not.

Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 20:57                   ` Chris Murphy
@ 2018-01-28  0:00                     ` Tomasz Pala
  2018-01-28 10:43                       ` Tomasz Pala
  0 siblings, 1 reply; 54+ messages in thread
From: Tomasz Pala @ 2018-01-28  0:00 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

On Sat, Jan 27, 2018 at 13:57:29 -0700, Chris Murphy wrote:

> The Btrfs systemd udev rule is a sledghammer because it has no
> timeout. It neither times out and tries to mount anyway, nor does it
> time out and just drop to a dracut prompt. There are a number of
> things in systemd startups that have timeouts, I have no idea how they
> get defined, but that single thing would make this a lot better. Right
> now the Btrfs udev rule means if all devices aren't available, hang
> indefinitely.

You mix udev with systemd:
- udev doesn't wait for anything - it REACTS to events. Blame the part
  that doesn't emit a one that you want.
- systemd WAITS for device to appear AND SETTLE. The timeout for devices
  is 90 seconds by default and can be changed in fstab with
  x-systemd.device-timeout.

It cannot "just try-mount this-or-that ON boot" as there is NO state of
"booting" or "shutting down" in systemd flow, there is only a chain of
events.

Mounting a device happens when it becomes available. If the device is
not crucial for booting up, just add the "nofail". If you want to lie
about the device being available, put the appropriate code into device
handler (btrfs could parse kernel cmdline and return READY for degraded).

> this rule can have a timer. Service units absolutely can have timers,
> so maybe there's a way to marry a udev rule with a service which has a
> timer. The absolute dumbest thing that's better than now, is at the
> timer just fail and drop to a dracut prompt. Better would be to try a
> normal mount anyway, which also fails to a dracut prompt, but
> additionally gives us a kernel error for Btrfs (the missing device
> open ctree error you'd expect to get when mounting without -o degraded
> when you're missing a device). And even better would be a way for the
> user to edit the service unit to indicate "upon timeout being reached,
> use mount -o degraded rather than just mount". This is the simplest of
> Boolean logic, so I'd be surprised if systemd doesn't offer a way for
> us to do exactly what I'm describing.

Any fallback to degraded mode requires the volume manager to handle this
gracefuly first. Until btrfs is degraded-safe, systemd cannot offer to
mount it degraded in _any_ way.

> Again the central problem is the udev rule now means "wait for device
> to appear" with no timed fallback.

If I got photos on external drive, is this also the udev rule that waits
for me to plug it in?

You should blame btrfs.ko for not "plugging in" (emiting event to udev).

> The mdadm case has this, and it's done by dracut. At this same stage
> of startup with a  missing device, there is in fact no fs colume UUID
> yet because the array hasn't started. Dracut+mdadm knows there's a
> missing device so it's just iterating: look, sleep 3, look, sleep 3,
> look, sleep 3. It's on a loop. And after that loop hits something like
> 100, the script says f it, start array anyway, so now there is a

So you need some btrfsd to iterate and eventually say "go degraded",
as systemd isn't the volume manager here.

> degraded array, and for the first time the fs volume UUID appears, and
> systemd goes "ahaha! mount that!" and it does it normally.

You see for yourself, systemd mounts ready md device. It doesn't handle
the 'ready-or-not-yet' timing out logic.

> So the timer and timeout and what happens at the timeout is defined by
> dracut. That's probably why the systemd folks say "not our problem"
> and why the kernel folks say "not our problem".

And they are both right - there should be btrfsd for handling this.
Well, except for the kernel cmdline that should be parsed by kernel
guys. But since it is impossible to assemble multidevice btrfs by kernel
itself anyway (i.e. without initrd) this could all go to the daemon.

>> If btrfs pretends to be device manager it should expose more states,
>> especially "ready to be mounted, but not fully populated" (i.e.
>> "degraded mount possible"). Then systemd could _fallback_ after timing
>> out to degraded mount automatically according to some systemd-level
>> option.
> 
> No, mdadm is a device manager and it has no such facility. Something

It has - the ultimate "signalling" in case of mdadm is apperance of
/dev/mdX device. Until the device won't came up, the systemd obviously
won't mount it.
In case of btrfs the situation is abnormal - there IS /dev/sda1 device
available, but in fact it might be not available. So there is the IOCTL
to check if the available device is really available. And guess what -
it returns NOT_READY... And you want systemd to mount this
ready-not_ready? The same way you could ask systemd to mount MD/LVM
device that is not existing.

This is btrfs fault to:
1. reuse device node for different purposes [*],
2. lack the timing-out/degraded logic implemented somewhere.

> issues a command to start the array anyway, and only then do you find
> out if there are enough devices to start it. I don't understand the
> value of knowing whether it is possible. Just try to mount it degraded
> and then if it fails we fail, nothing can be done automatically it's
> up to an admin.

It can't mount degraded, because the "missing" device might go online a
few seconds ago.

> And even if you had this "degraded mount possible" state, you still
> need a timer. So just build the timer.

Exactly! This "timer" is btrfs-specific daemon that should be shipped
with btrfs-tools. Well, maybe not the actual daemon, as btrfs handles
incremental assembly on it's own, just the appropriate units and
signalling.

For mdadm there is --incremental used for gradually assemble via udev rule:
https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/udev-md-raid-assembly.rules
(this also fires timer)

and the systemd part for timing-out and degraded fallback:
https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/systemd
mdadm-last-resort@.timer
mdadm-last-resort@.service

There is appropriate code in LVM as well, using lvmetad, but this one is easier.

So, let's step by step your proposal:

> If all devices ready ioctl is true, the timer doesn't start, it means
> all devices are available, mount normally.

sure

> If all devices ready ioctl is false, the timer starts, if all devices
> appear later the ioctl goes to true, the timer is belayed, mount
> normally.

sure

> If all devices ready ioctl is false, the timer starts, when the timer
> times out, mount normally which fails and gives us a shell to
> troubleshoot at.
> OR
> If all devices ready ioctl is false, the timer starts, when the timer
> times out, mount with -o degraded which either succeeds and we boot or
> it fails and we have a troubleshooting shell.

Don't mix layers - just image your /dev/sda1 is not there and you simply
cannot even try to mount it; this should be done like this:

If all devices ready ioctl is false, the timer starts, when the timer
times out, TELL THE KERNEL THAT WE WANT DEGRADED MOUNT. This in turn
should switch IOCTL response to "OK, go degraded" which in turn would
make udev rule to raise the flag[*] and then systemd could mount this.

This is important that the kernel would be instructed in a way, not the
last one, as this gives the chance to pass the degraded option using the
cmdline.

> The central problem is the lack of a timer and time out.

You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack
anything, as you all state here, this should be easy to make this work.
Go ahead please.

>> Unless there is *some* signalling from btrfs, there is really not much
>> systemd can *safely* do.
> 
> That is not true. It's not how mdadm works anyway.

Yes it does. You can't mount mdadm until /dev/mdX appears, which happens
when array get's fully assembled *OR* times out and kernel get's
instructed to run array as degraded, which effects in /dev/mdX appearing.
There is NO additional logic in systemd.

This is NOT systemd that assembles degraded mdadm, this is mdadm that
tells the kernel to assemble it and systemd mounts READY md.
Moreover, systemd gives you a set of tools that you can use for timers.

[*] the udev flag is required to distinguish /dev/sda1 block device from
/dev/sda1 btrfs-volume device being ready. If there were separate device
created, there would be no need for this entire IOCTL.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 21:12                     ` Chris Murphy
@ 2018-01-28  0:16                       ` Tomasz Pala
  0 siblings, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-28  0:16 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

On Sat, Jan 27, 2018 at 14:12:01 -0700, Chris Murphy wrote:

> doesn't count devices itself. The Btrfs systemd udev rule defers to
> Btrfs kernel code by using BTRFS_IOC_DEVICES_READY. And it's totally
> binary. Either they are all ready, in which case it exits 0, and if
> they aren't all ready it exits 1.
> 
> But yes, mounting whether degraded or not is sufficiently complicated
> that you just have to try it. I don't get the point of wanting to know
> whether it's possible without trying. Why would this information be

If you want to blind-try it, just tell the btrfs.ko to flip the IOCTL bit.

No shortcuts please, do it legit, where it belongs.

>> Ie, the thing systemd can safely do, is to stop trying to rule everything,
>> and refrain from telling the user whether he can mount something or not.
> 
> Right. Open question is whether the timer and timeout can be
> implemented in the systemd world and I don't see why not, I certainly

It can. The reasons why it's not already there follow:

1. noone created udev rules and systemd units for btrfs-progs yet (that
   is trivial),
2. btrfs is not degraded-safe yet (the rules would have to check if the
   filesystem won't stuck in read-only mode for example, this is NOT
   trivial),
3. there is not way to tell the kernel that we want degraded (probably
   some new IOCTL) - this is the path that timer would use to trigger udev
   event releasing systemd mount.

Let me repeat this, so this would be clear: this is NOT going to work
as some systemd-shortcut being "mount -o degraded", this must go through
the kernel IOCTL -> udev -> systemd path, i.e.:

timer expires -> executes IOCTL with "OK, give me degraded /dev/blah" ->
BTRFS_IOC_DEVICES_READY returns "READY" (or new value "DEGRADED") -> udev
catches event and changes SYSTEMD_READY -> systemd mounts the volume.


This is really simple. All you need to do is to pass "degraded" to the
btrfs.ko, so the BTRFS_IOC_DEVICES_READY would return "go ahead".

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 15:22                     ` Duncan
@ 2018-01-28  0:39                       ` Tomasz Pala
  2018-01-28 20:02                         ` Chris Murphy
  2018-01-28  8:06                       ` Andrei Borzenkov
  1 sibling, 1 reply; 54+ messages in thread
From: Tomasz Pala @ 2018-01-28  0:39 UTC (permalink / raw)
  To: linux-btrfs

On Sat, Jan 27, 2018 at 15:22:38 +0000, Duncan wrote:

>> manages to mount degraded btrfs without problems.  They just don't try
>> to outsmart the kernel.
> 
> No kidding.
> 
> All systemd has to do is leave the mount alone that the kernel has 
> already done, instead of insisting it knows what's going on better than 
> the kernel does, and immediately umounting it.

Tell me please, if you mount -o degraded btrfs - what would
BTRFS_IOC_DEVICES_READY return?

This is not "outsmarting" nor "knowing better", on the contrary, this is "FOLLOWING the
kernel-returned data". The umounting case is simply a bug in btrfs.ko
that should change to READY state *if* someone has tried and apparently
succeeded mounting the not-ready volume.

Otherwise - how should any system part behave when you detach some drive? Insist
that "the kernel has already mounted it" and ignore kernel screaming
"the device is (not yet there/gone)"?


Just update the internal state after successful mount and this
particular problem is gone. Unless there is some race condition and the
state should be changed before the mount is announced to the userspace.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 15:22                     ` Duncan
  2018-01-28  0:39                       ` Tomasz Pala
@ 2018-01-28  8:06                       ` Andrei Borzenkov
  2018-01-28 10:27                         ` Tomasz Pala
                                           ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Andrei Borzenkov @ 2018-01-28  8:06 UTC (permalink / raw)
  To: Duncan, linux-btrfs

27.01.2018 18:22, Duncan пишет:
> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
> 
>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>
>>>>> I just tested to boot with a single drive (raid1 degraded), even
>>>>> with degraded option in fstab and grub, unable to boot !  The boot
>>>>> process stop on initramfs.
>>>>>
>>>>> Is there a solution to boot with systemd and degraded array ?
>>>>
>>>> No. It is finger pointing. Both btrfs and systemd developers say
>>>> everything is fine from their point of view.
>>
>> It's quite obvious who's the culprit: every single remaining rc system
>> manages to mount degraded btrfs without problems.  They just don't try
>> to outsmart the kernel.
> 
> No kidding.
> 
> All systemd has to do is leave the mount alone that the kernel has 
> already done,

Are you sure you really understand the problem? No mount happens because
systemd waits for indication that it can mount and it never gets this
indication.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28  8:06                       ` Andrei Borzenkov
@ 2018-01-28 10:27                         ` Tomasz Pala
  2018-01-28 15:57                         ` Duncan
  2018-01-28 20:28                         ` Chris Murphy
  2 siblings, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-28 10:27 UTC (permalink / raw)
  To: linux-btrfs

On Sun, Jan 28, 2018 at 11:06:06 +0300, Andrei Borzenkov wrote:

>> All systemd has to do is leave the mount alone that the kernel has 
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

And even after successful manual mount (with -o degraded) btrfs.ko
insists that the device is not ready.

That schizophrenia makes systemd umount that immediately, because this
is the only proper way to handle missing devices (only the failed ones
should go r/o). And there is really nothing systemd can do about this,
until underlying code stops lying, unless we're going back to 1990s when
devices were never unplugged or detached during system uptime. But even
floppies could be ejected without system reboot.

BTRFS is no exception here - when marked as 'not available',
don't expect it to be kept used. Just fix the code to match reality.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28  0:00                     ` Tomasz Pala
@ 2018-01-28 10:43                       ` Tomasz Pala
  0 siblings, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-28 10:43 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

On Sun, Jan 28, 2018 at 01:00:16 +0100, Tomasz Pala wrote:

> It can't mount degraded, because the "missing" device might go online a
> few seconds ago.

s/ago/after/

>> The central problem is the lack of a timer and time out.
> 
> You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack
> anything, as you all state here, this should be easy to make this work.
> Go ahead please.

And just to make it even easier - this is how you can react to events
inside udev (this is to eliminane btrfs-scan tool being required as it sux):

https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

One could even try to trick systemd by SETTING (note the single '=')

ENV{ID_BTRFS_READY}="0"

- which would probably break as soon as btrfs.ko emits next 'changed' event.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28  8:06                       ` Andrei Borzenkov
  2018-01-28 10:27                         ` Tomasz Pala
@ 2018-01-28 15:57                         ` Duncan
  2018-01-28 16:51                           ` Andrei Borzenkov
  2018-01-28 20:28                         ` Chris Murphy
  2 siblings, 1 reply; 54+ messages in thread
From: Duncan @ 2018-01-28 15:57 UTC (permalink / raw)
  To: linux-btrfs

Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:

> 27.01.2018 18:22, Duncan пишет:
>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>> 
>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>>
>>>>>> I just tested to boot with a single drive (raid1 degraded), even
>>>>>> with degraded option in fstab and grub, unable to boot !  The boot
>>>>>> process stop on initramfs.
>>>>>>
>>>>>> Is there a solution to boot with systemd and degraded array ?
>>>>>
>>>>> No. It is finger pointing. Both btrfs and systemd developers say
>>>>> everything is fine from their point of view.
>>>
>>> It's quite obvious who's the culprit: every single remaining rc system
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>> 
>> No kidding.
>> 
>> All systemd has to do is leave the mount alone that the kernel has
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

As Tomaz indicates, I'm talking about manual mounting (after the initr* 
drops to a maintenance prompt if it's root being mounted, or on manual 
mount later if it's an optional mount) here.  The kernel accepts the 
degraded mount and it's mounted for a fraction of a second, but systemd 
actually undoes the successful work of the kernel to mount it, so by the 
time the prompt returns and a user can check, the filesystem is unmounted 
again, with the only indication that it was mounted at all being the log.

He says that's because the kernel still says it's not ready, but that's 
for /normal/ mounting.  The kernel accepted the degraded mount and 
actually mounted the filesystem, but systemd undoes that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28 15:57                         ` Duncan
@ 2018-01-28 16:51                           ` Andrei Borzenkov
  0 siblings, 0 replies; 54+ messages in thread
From: Andrei Borzenkov @ 2018-01-28 16:51 UTC (permalink / raw)
  To: Duncan, linux-btrfs

28.01.2018 18:57, Duncan пишет:
> Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:
> 
>> 27.01.2018 18:22, Duncan пишет:
>>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>>>
>>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>>>
>>>>>>> I just tested to boot with a single drive (raid1 degraded), even
>>>>>>> with degraded option in fstab and grub, unable to boot !  The boot
>>>>>>> process stop on initramfs.
>>>>>>>
>>>>>>> Is there a solution to boot with systemd and degraded array ?
>>>>>>
>>>>>> No. It is finger pointing. Both btrfs and systemd developers say
>>>>>> everything is fine from their point of view.
>>>>
>>>> It's quite obvious who's the culprit: every single remaining rc system
>>>> manages to mount degraded btrfs without problems.  They just don't try
>>>> to outsmart the kernel.
>>>
>>> No kidding.
>>>
>>> All systemd has to do is leave the mount alone that the kernel has
>>> already done,
>>
>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> As Tomaz indicates, I'm talking about manual mounting (after the initr* 
> drops to a maintenance prompt if it's root being mounted, or on manual 
> mount later if it's an optional mount) here.  The kernel accepts the 
> degraded mount and it's mounted for a fraction of a second, but systemd 
> actually undoes the successful work of the kernel to mount it, so by the 
> time the prompt returns and a user can check, the filesystem is unmounted 
> again, with the only indication that it was mounted at all being the log.
> 

This is fixed in current systemd (actually for quite some time). If you
still observe it with more or less recent systemd, report a bug.

> He says that's because the kernel still says it's not ready, but that's 
> for /normal/ mounting.  The kernel accepted the degraded mount and 
> actually mounted the filesystem, but systemd undoes that.
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28  0:39                       ` Tomasz Pala
@ 2018-01-28 20:02                         ` Chris Murphy
  2018-01-28 22:39                           ` Tomasz Pala
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Murphy @ 2018-01-28 20:02 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Btrfs BTRFS

On Sat, Jan 27, 2018 at 5:39 PM, Tomasz Pala <gotar@polanet.pl> wrote:
> On Sat, Jan 27, 2018 at 15:22:38 +0000, Duncan wrote:
>
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>>
>> No kidding.
>>
>> All systemd has to do is leave the mount alone that the kernel has
>> already done, instead of insisting it knows what's going on better than
>> the kernel does, and immediately umounting it.
>
> Tell me please, if you mount -o degraded btrfs - what would
> BTRFS_IOC_DEVICES_READY return?


case BTRFS_IOC_DEVICES_READY:
    ret = btrfs_scan_one_device(vol->name, FMODE_READ,
                    &btrfs_fs_type, &fs_devices);
    if (ret)
        break;
    ret = !(fs_devices->num_devices == fs_devices->total_devices);
    break;


All it cares about is whether the number of devices found is the same
as the number of devices any of that volume's supers claim make up
that volume. That's it.


>
> This is not "outsmarting" nor "knowing better", on the contrary, this is "FOLLOWING the
> kernel-returned data". The umounting case is simply a bug in btrfs.ko
> that should change to READY state *if* someone has tried and apparently
> succeeded mounting the not-ready volume.


Nope. That is not what the ioctl does.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28  8:06                       ` Andrei Borzenkov
  2018-01-28 10:27                         ` Tomasz Pala
  2018-01-28 15:57                         ` Duncan
@ 2018-01-28 20:28                         ` Chris Murphy
  2018-01-28 23:13                           ` Tomasz Pala
  2 siblings, 1 reply; 54+ messages in thread
From: Chris Murphy @ 2018-01-28 20:28 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Duncan, Btrfs BTRFS

On Sun, Jan 28, 2018 at 1:06 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> 27.01.2018 18:22, Duncan пишет:
>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>>
>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>>
>>>>>> I just tested to boot with a single drive (raid1 degraded), even
>>>>>> with degraded option in fstab and grub, unable to boot !  The boot
>>>>>> process stop on initramfs.
>>>>>>
>>>>>> Is there a solution to boot with systemd and degraded array ?
>>>>>
>>>>> No. It is finger pointing. Both btrfs and systemd developers say
>>>>> everything is fine from their point of view.
>>>
>>> It's quite obvious who's the culprit: every single remaining rc system
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>>
>> No kidding.
>>
>> All systemd has to do is leave the mount alone that the kernel has
>> already done,
>
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

"not ready" is rather vague terminology but yes that's how systemd
ends up using the ioctl this rule depends on, even though the rule has
nothing to do with readiness per se. If all devices for a volume
aren't found, we can correctly conclude a normal mount attempt *will*
fail. But that's all we can conclude. What I can't parse in all of
this is if the udev rule is a one shot, if the ioctl is a one shot, if
something is constantly waiting for "not all devices are found" to
transition to "all devices are found" or what. I can't actually parse
the two critical lines in this rule. I


$ cat /usr/lib/udev/rules.d/64-btrfs.rules
# do not edit this file, it will be overwritten on update

SUBSYSTEM!="block", GOTO="btrfs_end"
ACTION=="remove", GOTO="btrfs_end"
ENV{ID_FS_TYPE}!="btrfs", GOTO="btrfs_end"

# let the kernel know about this btrfs filesystem, and check if it is complete
IMPORT{builtin}="btrfs ready $devnode"

# mark the device as not ready to be used by the system
ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"

LABEL="btrfs_end"


----

And udev builtin btrfs, which I guess the above rule is referring to:

https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-btrfs.c

I think the Btrfs ioctl is a one shot. Either they are all present or
not. The waiting is a policy by systemd udev rule near as I can tell.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28 20:02                         ` Chris Murphy
@ 2018-01-28 22:39                           ` Tomasz Pala
  2018-01-29  0:00                             ` Chris Murphy
  0 siblings, 1 reply; 54+ messages in thread
From: Tomasz Pala @ 2018-01-28 22:39 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Jan 28, 2018 at 13:02:08 -0700, Chris Murphy wrote:

>> Tell me please, if you mount -o degraded btrfs - what would
>> BTRFS_IOC_DEVICES_READY return?
> 
> case BTRFS_IOC_DEVICES_READY:
>     ret = btrfs_scan_one_device(vol->name, FMODE_READ,
>                     &btrfs_fs_type, &fs_devices);
>     if (ret)
>         break;
>     ret = !(fs_devices->num_devices == fs_devices->total_devices);
>     break;
> 
> 
> All it cares about is whether the number of devices found is the same
> as the number of devices any of that volume's supers claim make up
> that volume. That's it.
>
>> This is not "outsmarting" nor "knowing better", on the contrary, this is "FOLLOWING the
>> kernel-returned data". The umounting case is simply a bug in btrfs.ko
>> that should change to READY state *if* someone has tried and apparently
>> succeeded mounting the not-ready volume.
> 
> Nope. That is not what the ioctl does.

So who is to blame for creating utterly useless code? Userspace
shouldn't depend on some stats (as number of devices is nothing more
than that), but overall _availability_.

I do not care if there are 2, 5 or 100 devices. I do care if there is
ENOUGH devices to run regular (including N-way mirroring and hot spares)
and if not - if there is ENOUGH devices to run degraded. Having ALL the
devices is just the edge case.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28 20:28                         ` Chris Murphy
@ 2018-01-28 23:13                           ` Tomasz Pala
  0 siblings, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-28 23:13 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Jan 28, 2018 at 13:28:55 -0700, Chris Murphy wrote:

>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> "not ready" is rather vague terminology but yes that's how systemd
> ends up using the ioctl this rule depends on, even though the rule has
> nothing to do with readiness per se. If all devices for a volume

If you avoid using THIS ioctl, then you'd have nothing to fire the rule
at all. One way or another, this is btrfs that must emit _some_ event or
be polled _somehow_.

> aren't found, we can correctly conclude a normal mount attempt *will*
> fail. But that's all we can conclude. What I can't parse in all of
> this is if the udev rule is a one shot, if the ioctl is a one shot, if
> something is constantly waiting for "not all devices are found" to
> transition to "all devices are found" or what. I can't actually parse

It's not one shot. This works like this:

sda1 appears -> udev catches event -> udev detects btrfs and IOCTLs => not ready
sdb1 appears -> udev catches event -> udev detects btrfs and IOCTLs => ready

The end.

If there were some other device appearing after assembly, like /dev/md1,
or if there were some event generated by btrfs code itself, udev could
catch this and follow. Now, if you unplug sdb1, there's no such event at
all.

Since this IOCTL is the *only* thing that udev can rely on, it cannot be
removed from the logic. So even if you create a timer to force assembly,
you must do it by influencing the IOCTL response.

Or creating some other IOCTL for this purpose, or creating some
userspace daemon or whatever.

> the two critical lines in this rule. I
> 
> # let the kernel know about this btrfs filesystem, and check if it is complete
> IMPORT{builtin}="btrfs ready $devnode"

This sends IOCTL.

> # mark the device as not ready to be used by the system
> ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"
      ^^^^^^^^^^^^^^this is IOCTL response being checked

and SYSTEMD_READY set to 0 prevents systemd from mounting.

> I think the Btrfs ioctl is a one shot. Either they are all present or not.

The rules are called once per (block) device.
So when btrfs scans all the devices to return READY, this would finally
be systemd-ready. This is trivial to re-trigger udev rule (udevadm trigger),
but there is no way to force btrfs to return READY after any timeout.

> The waiting is a policy by systemd udev rule near as I can tell.

There is no problem in waiting or re-triggering. This can be done in ~10
lines of rules. The problem is that the IOCTL won't EVER return READY until
there are ALL the components present.

It's simple as that: there MUST be some mechanism at device-manager
level that tells if a compound device is mountable, degraded or not;
upper layers (systemd-mount) do not care about degradation, handling
redundancy/mirrors/chunks/stripes/spares is not it's job.
It (systemd) can (easily!) handle expiration timer to push pending
compound to be force-assembled, but currently there is no way to push.


If the IOCTL would be extended to return TRYING_DEGRADED (when
instructed to do so after expired timeout), systemd could handle
additional per-filesystem fstab options, like x-systemd.allow-degraded.

Then in would be possible to have best-effort policy for rootfs (to make
machine boot), and more strict one for crucial data (do not mount it
when there is no redundancy, wait for operator intervention).

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-28 22:39                           ` Tomasz Pala
@ 2018-01-29  0:00                             ` Chris Murphy
  2018-01-29  8:54                               ` Tomasz Pala
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Murphy @ 2018-01-29  0:00 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Btrfs BTRFS

On Sun, Jan 28, 2018 at 3:39 PM, Tomasz Pala <gotar@polanet.pl> wrote:
> On Sun, Jan 28, 2018 at 13:02:08 -0700, Chris Murphy wrote:
>
>>> Tell me please, if you mount -o degraded btrfs - what would
>>> BTRFS_IOC_DEVICES_READY return?
>>
>> case BTRFS_IOC_DEVICES_READY:
>>     ret = btrfs_scan_one_device(vol->name, FMODE_READ,
>>                     &btrfs_fs_type, &fs_devices);
>>     if (ret)
>>         break;
>>     ret = !(fs_devices->num_devices == fs_devices->total_devices);
>>     break;
>>
>>
>> All it cares about is whether the number of devices found is the same
>> as the number of devices any of that volume's supers claim make up
>> that volume. That's it.
>>
>>> This is not "outsmarting" nor "knowing better", on the contrary, this is "FOLLOWING the
>>> kernel-returned data". The umounting case is simply a bug in btrfs.ko
>>> that should change to READY state *if* someone has tried and apparently
>>> succeeded mounting the not-ready volume.
>>
>> Nope. That is not what the ioctl does.
>
> So who is to blame for creating utterly useless code? Userspace
> shouldn't depend on some stats (as number of devices is nothing more
> than that), but overall _availability_.

There's quite a lot missing. Btrfs doesn't even really have a degraded
state concept. It has a degraded mount option, but this is not a
state. e.g. if you have a normally mounted volume, and a drive dies or
vanishes, there's no way for the user to know the array is degraded.
They can only infer that it's degraded by a.) metric f tons of
read/write errors to a bdev b.) the application layer isn't pissed off
about it; or in lieu of a. they see via 'btrfs fi show' that a device
is missing. Likewise, when a device is failing to read and write,
Btrfs doesn't consider it faulty and boot it out of the array, it just
keeps on trying, the spew of which can cause disk contention of those
errors are written to a log on spinning rust.

Anyway, the fact many state features are missing doesn't mean the
necessary information to do the right thing is missing.



> I do not care if there are 2, 5 or 100 devices. I do care if there is
> ENOUGH devices to run regular (including N-way mirroring and hot spares)
> and if not - if there is ENOUGH devices to run degraded. Having ALL the
> devices is just the edge case.

systemd can't possibly need to know more information than a person
does in the exact same situation in order to do the right thing. No
human would wait 10 minutes, let alone literally the heat death of the
planet for "all devices have appeared" but systemd will. And it does
that by its own choice, its own policy. That's the complaint. It's
choosing to do something a person wouldn't do, given identical
available information. There's nothing the kernel is doing that's
telling systemd to wait for goddamn ever.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29  0:00                             ` Chris Murphy
@ 2018-01-29  8:54                               ` Tomasz Pala
  2018-01-29 11:24                                 ` Adam Borowski
  2018-01-30  4:44                                 ` Chris Murphy
  0 siblings, 2 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-29  8:54 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:

> systemd can't possibly need to know more information than a person
> does in the exact same situation in order to do the right thing. No
> human would wait 10 minutes, let alone literally the heat death of the
> planet for "all devices have appeared" but systemd will. And it does

We're already repeating - systemd waits for THE btrfs-compound-device,
not ALL the block-devices. Just like it 'waits' for someone to plug USB
pendrive in.

It is a btrfs choice to not expose compound device as separate one (like
every other device manager does), it is a btrfs drawback that doesn't
provice anything else except for this IOCTL with it's logic, it is a
btrfs drawback that there is nothing to push assembling into "OK, going
degraded" state, it is btrfs drawback that there are no states...

I've told already - pretend the /dev/sda1 device doesn't
exist until assembled. If this overlapping usage was designed with
'easier mounting' on mind, this is simply bad design.

> that by its own choice, its own policy. That's the complaint. It's
> choosing to do something a person wouldn't do, given identical
> available information.

You are expecting systemd to mix in functions of kernel and udev.
There is NO concept of 'assembled stuff' in systemd AT ALL.
There is NO concept of 'waiting' in udev AT ALL.
If you want to do some crazy interlayer shortcuts just implement btrfsd.

> There's nothing the kernel is doing that's
> telling systemd to wait for goddamn ever.

There's nothing the kernel is doing that's
telling udev there IS a degraded device assembled to be used.

There's nothing a userspace-thing is doing that's
telling udev to mark degraded device as mountable.

There is NO DEVICE to be mounted, so systemd doesn't mount it.

The difference is:

YOU think that sda1 device is ephemeral, as it's covered by sda1 btrfs device that COULD BE mounted.

I think that there is real sda1 device, following Linux rules of system
registration, which CAN be overtaken by ephemeral btrfs-compound device.
Can I mount that thing above sda1 block device? ONLY when it's properly
registered in the system.

Does btrfs-compound-device register in the system? - Yes, but only fully populated.

Just don't expect people will break their code with broken designs just
to overcome your own limitations. If you want systemd to mount degraded
btrfs volume, just MAKE IT REGISTER in the system.

How can btrfs register in the system being degraded? Either by some
userspace daemon handling btrfs volumes states (which are missing from
the kernel), or by some IOCTLs altering in-kernel states.


So for the last time: nobody will break his own code to patch missing
code from other (actively maintained) subsystem.

If you expect degraded mounts, there are 2 choices:

1. implement degraded STATE _some_where_ - udev would handle falling
   back to degraded mount after specified timeout,

2. change this IOCTL to _always_ return 1 - udev would register any
   btrfs device, but you will get random behaviour of mounting
   degraded/populated. But you should expect that since there is no
   concept of any state below.


Actually, this is ridiculous - you expect the degradation to be handled
in some 3rd party software?! In init system? With the only thing you got
is 'degraded' mount option?!
What next - moving MD and LVM logic into systemd?

This is not systemd's job - there are
btrfs-specific kernel cmdline options to be parsed (allowing degraded
volumes), there is tracking of volume health required.
Yes, device-manager needs to track it's components, RAID controller
needs to track minimum required redundancy. It's not only about
mounting. But doing the degraded mounting is easy, only this one
particular ioctl needs to be fixed:

1. counted devices<all	=> not_ready

2. counted devices<all BUT
- 'go degraded' received from userspace or kernel cmdline OR
- volume IS mounted and doesn't report errors (i.e. mount -o degraded
  DID succeeded)	=> ok_degraded

3. counted devices==all => ok


If btrfs DISTINGUISHES these two states, systemd would be able to use them.


You might ask why this is important for the state to be kept inside some
btrfs-related stuff, like kernel or btrfsd, while the systemd timer
could do the same and 'just mount degraded'. The answear is simple:
systemd.timer is just a sane default CONFIGURATION, that can be EASILY
changed by system administrator. But somewhere, sometime, someone would
have a NEED for totally different set of rules for handling degraded
volumes, just like MD or LVM does. This would be totally irresponsible
to hardcode any mount-degraded rule inside systemd itself.

That is exactly why this must go through the udev - udev is responsible
for handling devices in Linux world. How can I register btrfs device
in udev, since it's overlapping the block device? I can't - the ioctl
is one-way, doesn't accept any userspace feedback.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29  8:54                               ` Tomasz Pala
@ 2018-01-29 11:24                                 ` Adam Borowski
  2018-01-29 13:05                                   ` Austin S. Hemmelgarn
                                                     ` (2 more replies)
  2018-01-30  4:44                                 ` Chris Murphy
  1 sibling, 3 replies; 54+ messages in thread
From: Adam Borowski @ 2018-01-29 11:24 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon, Jan 29, 2018 at 09:54:04AM +0100, Tomasz Pala wrote:
> On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:
> 
> > systemd can't possibly need to know more information than a person
> > does in the exact same situation in order to do the right thing. No
> > human would wait 10 minutes, let alone literally the heat death of the
> > planet for "all devices have appeared" but systemd will. And it does
> 
> We're already repeating - systemd waits for THE btrfs-compound-device,
> not ALL the block-devices.

Because there is NO compound device.  You can't wait for something that
doesn't exist.  The user wants a filesystem, not some mythical compound
device, and as knowing whether we have enough requires doing most of mount
work, we can as well complete the mount instead of backing off and
reporting, so you can then racily repeat the work.

> Just like it 'waits' for someone to plug USB pendrive in.

Plugging an USB pendrive is an event -- there's no user request.  On the
other hand, we already know we want to mount -- the user requested so either
by booting ("please mount everything in fstab") or by an explicit mount
command.

So any event (the user's request) has already happened.  A rc system, of
which systemd is one, knows whether we reached the "want root filesystem" or
"want secondary filesystems" stage.  Once you're there, you can issue the
mount() call and let the kernel do the work.

> It is a btrfs choice to not expose compound device as separate one (like
> every other device manager does)

Btrfs is not a device manager, it's a filesystem.

> it is a btrfs drawback that doesn't provice anything else except for this
> IOCTL with it's logic

How can it provide you with something it doesn't yet have?  If you want the
information, call mount().  And as others in this thread have mentioned,
what, pray tell, would you want to know "would a mount succeed?" for if you
don't want to mount?

> it is a btrfs drawback that there is nothing to push assembling into "OK,
> going degraded" state

The way to do so is to timeout, then retry with -o degraded.

> I've told already - pretend the /dev/sda1 device doesn't
> exist until assembled.

It does... you're confusing a block device (a _part_ of the filesystem) with
the filesystem itself.  MD takes a bunch of such block devices and provides
you with another block devices, btrfs takes a bunch of block devices and
provides you with a filesystem.

> If this overlapping usage was designed with 'easier mounting' on mind,
> this is simply bad design.

No other rc system but systemd has a problem.

> > that by its own choice, its own policy. That's the complaint. It's
> > choosing to do something a person wouldn't do, given identical
> > available information.
> 
> You are expecting systemd to mix in functions of kernel and udev.
> There is NO concept of 'assembled stuff' in systemd AT ALL.
> There is NO concept of 'waiting' in udev AT ALL.
> If you want to do some crazy interlayer shortcuts just implement btrfsd.

No, I don't want systemd, or any userspace daemon, to try knowing kernel
stuff better than the kernel.  Just call mount(), and that's it.

Let me explain via a car analogy.  There is a flood that covers many roads,
the phone network is unreliable, and you want to drive to help relatives at
place X.

You can ask someone who was there yesterday how to get there (ie, ask a
device; it can tell you "when I was a part of the filesystem last time, its
layout was such and such").  Usually, this is reliable (you don't reshape an
array every day), but if there's flooding (you're contemplating a degraded
mount), yesterday's data being stale shouldn't be a surprise.

So, you climb into the car and drive.  It's possible that the road you
wanted to take has changed, it's also possible some other roads you didn't
even know about are now driveable.  Once you have X in sight, do you retrace
all the way home, tell your mom (systemd) who's worrying but has no way to
help, that the road is clear, and only then get to X?  Or do you stop,
search for a spot with working phone coverage to phone mom asking for
advice, despite her having no informations you don't have?  The reasonable
thing to do (and what all other rc systems do) is to get to X, help the
relatives, and only then tell mom that all is ok.

But with mom wanting to control everything, things can go worse.  If you,
without mom's prior knowledge (the user typed "mount" by hand) manage to
find a side road to X, she shouldn't tell you "I hear you telling me you're
at X -- as the road is flooded, that's impossible, so get home this instant"
(ie, systemd thinking the filesystem not being complete, despite it being
already mounted).

> > There's nothing the kernel is doing that's
> > telling systemd to wait for goddamn ever.
> 
> There's nothing the kernel is doing that's
> telling udev there IS a degraded device assembled to be used.

Because there is no device.

> There's nothing a userspace-thing is doing that's
> telling udev to mark degraded device as mountable.
> 
> There is NO DEVICE to be mounted, so systemd doesn't mount it.
> 
> The difference is:
> 
> YOU think that sda1 device is ephemeral, as it's covered by sda1 btrfs
> device that COULD BE mounted.

sda1 is there, it's not ephemeral.  You also shouldn't label filesystems by
whatever device was used for the initial mount, as this can change at
runtime -- and, if it does change, it's likely the admin will reuse sda1
for something else -- perhaps another btrfs filesystem.

> Just don't expect people will break their code with broken designs just
> to overcome your own limitations. If you want systemd to mount degraded
> btrfs volume, just MAKE IT REGISTER in the system.

Sorry but my crystal ball is broken.  I don't know whether the mount will
succeed yet.  And per the car analogy above, it's pointless to go back and
report that the device is mountable, if all we care about is to mount it.

> So for the last time: nobody will break his own code to patch missing
> code from other (actively maintained) subsystem.

I expect that a rc system doesn't get nosy trying to know things it has no
reason to know about.  All other rc systems don't care, why should systemd
be different?

> If you expect degraded mounts, there are 2 choices:
> 
> 1. implement degraded STATE _some_where_ - udev would handle falling
>    back to degraded mount after specified timeout,

STATE of what?  The filesystem doesn't exist yet.

> 2. change this IOCTL to _always_ return 1 - udev would register any
>    btrfs device, but you will get random behaviour of mounting
>    degraded/populated. But you should expect that since there is no
>    concept of any state below.

If the ioctl, which has only a vague guess, doesn't do what you want, don't
call it.  As it's btrfs specific already, there's no special casing on your
part.

> Actually, this is ridiculous - you expect the degradation to be handled
> in some 3rd party software?! In init system? With the only thing you got
> is 'degraded' mount option?!
> What next - moving MD and LVM logic into systemd?

It's not init system's job.  So it shouldn't try to micromanage, but just
mount().

> This is not systemd's job - there are
> btrfs-specific kernel cmdline options to be parsed (allowing degraded
> volumes), there is tracking of volume health required.
> Yes, device-manager needs to track it's components, RAID controller
> needs to track minimum required redundancy. It's not only about
> mounting. But doing the degraded mounting is easy, only this one
> particular ioctl needs to be fixed:
> 
> 1. counted devices<all	=> not_ready

Count is unreliable.  It usually gives a good answer, but if you're
contemplating mounting degraded, this is precisely the case it might be
wrong.

> 2. counted devices<all BUT
> - 'go degraded' received from userspace or kernel cmdline OR
> - volume IS mounted and doesn't report errors (i.e. mount -o degraded
>   DID succeeded)	=> ok_degraded

Then you don't want that ioctl, but mount().  And what would you even want
to use that hypothetical "ok_degraded" state for?

> 3. counted devices==all => ok
> 
> 
> If btrfs DISTINGUISHES these two states, systemd would be able to use them.

As per the car analogy above, mom doesn't need to know whether all roads
were dry, merely whether you are at the relatives' house.  The filesystem
either is mounted or it isn't.

> You might ask why this is important for the state to be kept inside some
> btrfs-related stuff, like kernel or btrfsd, while the systemd timer
> could do the same and 'just mount degraded'. The answear is simple:
> systemd.timer is just a sane default CONFIGURATION, that can be EASILY
> changed by system administrator. But somewhere, sometime, someone would
> have a NEED for totally different set of rules for handling degraded
> volumes, just like MD or LVM does. This would be totally irresponsible
> to hardcode any mount-degraded rule inside systemd itself.

It's not rocket science to edit an init script if knobs it exposes are not
configurable enough for your needs.  If systemd decides to hide this
functionality, it needs to provide the admin with some way to override.

We're talking about issuing a mount call, it's not _that_ complicated.

> That is exactly why this must go through the udev - udev is responsible
> for handling devices in Linux world. How can I register btrfs device
> in udev, since it's overlapping the block device? I can't - the ioctl
> is one-way, doesn't accept any userspace feedback.

But there's no device to register.  There's a filesystem, and those do have
a well-defined interface: they appear in /proc/mounts and a bunch of other
places.  That cow is not a duck, so it shouldn't quack.  ext4 or xfs don't
quack either, and no one considers them buggy for not quacking.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄⠀⠀⠀⠀ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 11:24                                 ` Adam Borowski
@ 2018-01-29 13:05                                   ` Austin S. Hemmelgarn
  2018-01-30 13:46                                     ` Tomasz Pala
  2018-01-29 17:58                                   ` Andrei Borzenkov
  2018-01-30 13:36                                   ` Tomasz Pala
  2 siblings, 1 reply; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-29 13:05 UTC (permalink / raw)
  To: Btrfs BTRFS

On 2018-01-29 06:24, Adam Borowski wrote:
> On Mon, Jan 29, 2018 at 09:54:04AM +0100, Tomasz Pala wrote:
>> it is a btrfs drawback that doesn't provice anything else except for this
>> IOCTL with it's logic
> 
> How can it provide you with something it doesn't yet have?  If you want the
> information, call mount().  And as others in this thread have mentioned,
> what, pray tell, would you want to know "would a mount succeed?" for if you
> don't want to mount?
And more importantly, WHY THE HELL DO YOU _WANT_ A TOCTOU RACE CONDITION 
INVOLVED?

Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF 
THIS_.  It's functionally no different than prefacing an attempt to send 
a signal to a process by checking if the process exists, or trying to 
see if some other process is using a file that might be locked by 
scanning /proc instead of just trying to lock the file yourself, or 
scheduling something to check if a RAID array is out of sync before even 
trying to start a scrub.  No sane programmer would do any of that 
(although a lot of rather poorly educated sysadmins do the third), 
because _IT'S NOT RELIABLE_.  The process you're trying to send a signal 
to might disappear after checking for it (or worse, might be a different 
process), the file might get locked by something with a low PID while 
you're busy scanning /proc, or the array could completely die right 
after you check if it's OK.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-27 22:42                     ` Tomasz Pala
@ 2018-01-29 13:42                       ` Austin S. Hemmelgarn
  2018-01-30 15:09                         ` Tomasz Pala
  0 siblings, 1 reply; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-29 13:42 UTC (permalink / raw)
  To: Tomasz Pala, Majordomo vger.kernel.org

On 2018-01-27 17:42, Tomasz Pala wrote:
> On Sat, Jan 27, 2018 at 14:26:41 +0100, Adam Borowski wrote:
> 
>> It's quite obvious who's the culprit: every single remaining rc system
>> manages to mount degraded btrfs without problems.  They just don't try to
>> outsmart the kernel.
> 
> Yes. They are stupid enough to fail miserably with any more complicated
> setups, like stacking volume managers, crypto layer, network attached
> storage etc.
I think you mean any setup that isn't sensibly layered.  BCP for over a 
decade has been to put multipathing at the bottom, then crypto, then 
software RAID, than LVM, and then whatever filesystem you're using. 
Multipathing has to be the bottom layer for a given node because it 
interacts directly with hardware topology which gets obscured by the 
other layers.  Crypto essentially has to be next, otherwise you leak 
info about the storage stack.  Swapping LVM and software RAID ends up 
giving you a setup which is difficult for most people to understand and 
therefore is hard to reliably maintain.

Other init systems enforce things being this way because it maintains 
people's sanity, not because they have significant difficulty doing 
things differently (and in fact, it is _trivial_ to change the ordering 
in some of them, OpenRC on Gentoo for example quite literally requires 
exactly N-1 lines to change in each of N files when re-ordering N 
layers), provided each layer occurs exactly once for a given device and 
the relative ordering is the same on all devices.  And you know what? 
Given my own experience with systemd, it has exactly the same constraint 
on relative ordering.  I've tried to run split setups with LVM and 
dm-crypt where one device had dm-crypt as the bottom layer and the other 
had it as the top layer, and things locked up during boot on _every_ 
generalized init system I tried.

> Recently I've started mdadm on top of bunch of LVM volumes, with others
> using btrfs and others prepared for crypto. And you know what? systemd
> assembled everything just fine.
> 
> So with argument just like yours:
> 
> It's quite obvious who's the culprit: every single remaining filesystem
> manages to mount under systemd without problems. They just expose
> informations about their state.
No, they don't (except ZFS).  There is no 'state' to expose for anything 
but BTRFS (and ZFS) except possibly if the filesystem needs checked or 
not.  You're conflating filesystems and volume management.

The alternative way of putting what you just said is:
Every single remaining filesystem manages to mount under systemd without 
problems, because it doesn't try to treat them as a block layer.
> 
>>> This is not a systemd issue, but apparently btrfs design choice to allow
>>> using any single component device name also as volume name itself.
>>
>> And what other user interface would you propose? The only alternative I see
>> is inventing a device manager (like you're implying below that btrfs does),
>> which would needlessly complicate the usual, single-device, case.
> 
> The 'needless complication', as you named it, usually should be the default
> to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
> No easy way to RAID the drive (there are device-mapper tricks, they are
> just way more complicated). Even attaching SSD cache is not trivial
> without preparations (for bcache being the absolutely necessary, much
> easier with LVM in place).
For a bog-standard client system, all of those _ARE_ overkill (and 
actually, so is BTRFS in many cases too, it's just that we're the only 
option for main-line filesystem-level snapshots at the moment).
> 
>>> If btrfs pretends to be device manager it should expose more states,
>>
>> But it doesn't pretend to.
> 
> Why mounting sda2 requires sdb2 in my setup then?
First off, it shouldn't unless you're using a profile that doesn't 
tolerate any missing devices and have provided the `degraded` mount 
option.  It doesn't in your case because you are using systemd.

Second, BTRFS is not a volume manager, it's a filesystem with 
multi-device support.  The difference is that it's not a block layer, 
despite the fact that systemd is treating it as such.   Yes, BTRFS has 
failure modes that result in regular operations being refused based on 
what storage devices are present, but so does every single distributed 
filesystem in existence, and none of those are volume managers either.
> 
>>> especially "ready to be mounted, but not fully populated" (i.e.
>>> "degraded mount possible"). Then systemd could _fallback_ after timing
>>> out to degraded mount automatically according to some systemd-level
>>> option.
>>
>> You're assuming that btrfs somehow knows this itself.
> 
> "It's quite obvious who's the culprit: every single volume manager keeps
> track of it's component devices".
> 
>>   Unlike the bogus
>> assumption systemd does that by counting devices you can know whether a
>> degraded or non-degraded mount is possible, it is in general not possible to
>> know whether a mount attempt will succeed without actually trying.
> 
> There is a term for such situation: broken by design.
So in other words, it's broken by design to try to connect to a remote 
host without pinging it first to see if it's online?  Or to try to send 
a signal to a given process without first checking that it's still 
running, or to open a file without first checking if we have permission 
to read it, or to try to mount any other filesystem without first 
checking if the superblock is valid?

In all of those cases, there is no advantage to trying to figure out if 
what you're trying to do is going to work before doing it, because every 
one of those operations is functionally atomic (it either happens or it 
doesn't, period), and has a clear-cut return code that tells you 
directly if it succeeded or not.

There's a name for the type of design you're saying we should have here, 
it's called a time of check time of use (TOCTOU) race condition.  It's 
one of the easiest types of race conditions to find, and also one of the 
easiest to fix.  Ask any sane programmer, and he will say that _that_ is 
broken by design.
> 
>> Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
>> naive counting of this kind, it had to be replaced by actually checking
>> whether at least one copy of every block group is actually present.
> 
> And you still blame systemd for using BTRFS_IOC_DEVICES_READY?
Given that it's been proven that it doesn't work and the developers 
responsible for it's usage don't want to accept that it doesn't work?  Yes.
> 
> [...]
>> just slow to initialize (USB...).  So, systemd asks sda how many devices
>> there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
>> even ask for UUIDs -- all devices are present.  So, mount will succeed,
>> right?
> 
> Systemd doesn't count anything, it asks BTRFS_IOC_DEVICES_READY as
> implemented in btrfs/super.c.
> 
>> Ie, the thing systemd can safely do, is to stop trying to rule everything,
>> and refrain from telling the user whether he can mount something or not.
> 
> Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.
> 
Or maybe we should just remove it completely, because checking it _IS 
WRONG_, which is why no other init system does it, and in fact no 
_human_ who has any kind of basic knowledge of how BTRFS operates does 
it either.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 11:24                                 ` Adam Borowski
  2018-01-29 13:05                                   ` Austin S. Hemmelgarn
@ 2018-01-29 17:58                                   ` Andrei Borzenkov
  2018-01-29 19:00                                     ` Austin S. Hemmelgarn
  2018-01-30 13:36                                   ` Tomasz Pala
  2 siblings, 1 reply; 54+ messages in thread
From: Andrei Borzenkov @ 2018-01-29 17:58 UTC (permalink / raw)
  To: Adam Borowski, Btrfs BTRFS

29.01.2018 14:24, Adam Borowski пишет:
...
> 
> So any event (the user's request) has already happened.  A rc system, of
> which systemd is one, knows whether we reached the "want root filesystem" or
> "want secondary filesystems" stage.  Once you're there, you can issue the
> mount() call and let the kernel do the work.
> 
>> It is a btrfs choice to not expose compound device as separate one (like
>> every other device manager does)
> 
> Btrfs is not a device manager, it's a filesystem.
> 
>> it is a btrfs drawback that doesn't provice anything else except for this
>> IOCTL with it's logic
> 
> How can it provide you with something it doesn't yet have?  If you want the
> information, call mount().  And as others in this thread have mentioned,
> what, pray tell, would you want to know "would a mount succeed?" for if you
> don't want to mount?
> 
>> it is a btrfs drawback that there is nothing to push assembling into "OK,
>> going degraded" state
> 
> The way to do so is to timeout, then retry with -o degraded.
> 

That's possible way to solve it. This likely requires support from
mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
incomplete so caller can decide whether to retry or to try degraded mount.

Or may be mount.btrfs should implement this logic internally. This would
really be the most simple way to make it acceptable to the other side by
not needing to accept anything :)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 17:58                                   ` Andrei Borzenkov
@ 2018-01-29 19:00                                     ` Austin S. Hemmelgarn
  2018-01-29 21:54                                       ` waxhead
  2018-01-30 15:24                                       ` Tomasz Pala
  0 siblings, 2 replies; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-29 19:00 UTC (permalink / raw)
  To: Andrei Borzenkov, Adam Borowski, Btrfs BTRFS

On 2018-01-29 12:58, Andrei Borzenkov wrote:
> 29.01.2018 14:24, Adam Borowski пишет:
> ...
>>
>> So any event (the user's request) has already happened.  A rc system, of
>> which systemd is one, knows whether we reached the "want root filesystem" or
>> "want secondary filesystems" stage.  Once you're there, you can issue the
>> mount() call and let the kernel do the work.
>>
>>> It is a btrfs choice to not expose compound device as separate one (like
>>> every other device manager does)
>>
>> Btrfs is not a device manager, it's a filesystem.
>>
>>> it is a btrfs drawback that doesn't provice anything else except for this
>>> IOCTL with it's logic
>>
>> How can it provide you with something it doesn't yet have?  If you want the
>> information, call mount().  And as others in this thread have mentioned,
>> what, pray tell, would you want to know "would a mount succeed?" for if you
>> don't want to mount?
>>
>>> it is a btrfs drawback that there is nothing to push assembling into "OK,
>>> going degraded" state
>>
>> The way to do so is to timeout, then retry with -o degraded.
>>
> 
> That's possible way to solve it. This likely requires support from
> mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
> incomplete so caller can decide whether to retry or to try degraded mount.
We already do so in the accepted standard manner.  If the mount fails 
because of a missing device, you get a very specific message in the 
kernel log about it, as is the case for most other common errors (for 
uncommon ones you usually just get a generic open_ctree error).  This is 
really the only option too, as the mount() syscall (which the mount 
command calls) returns only 0 on success or -1 and an appropriate errno 
value on failure, and we can't exactly go about creating a half dozen 
new error numbers just for this (well, technically we could, but I very 
much doubt that they would be accepted upstream, which defeats the purpose).
> 
> Or may be mount.btrfs should implement this logic internally. This would
> really be the most simple way to make it acceptable to the other side by
> not needing to accept anything :)
And would also be another layering violation which would require a 
proliferation of extra mount options to control the mount command itself 
and adjust the timeout handling.

This has been done before with mount.nfs, but for slightly different 
reasons (primarily to allow nested NFS mounts, since the local directory 
that the filesystem is being mounted on not being present is treated 
like a mount timeout), and it had near zero control.  It works there 
because they push the complicated policy decisions to userspace (namely, 
there is no support for retrying with different options or trying a 
different server).

With what you're proposing for BTRFS however, _everything_ is a 
complicated decision, namely:
1. Do you retry at all?  During boot, the answer should usually be yes, 
but during normal system operation it should normally be no (because we 
should be letting the user handle issues at that point).
2. How long should you wait before you retry?  There is no right answer 
here that will work in all cases (I've seen systems which take multiple 
minutes for devices to become available on boot), especially considering 
those of us who would rather have things fail early.
3. If the retry fails, do you retry again?  How many times before it 
just outright fails?  This is going to be system specific policy.  On 
systems where devices may take a while to come online, the answer is 
probably yes and some reasonably large number, while on systems where 
devices are known to reliably be online immediately, it makes no sense 
to retry more than once or twice.
4. If you are going to retry, should you try a degraded mount?  Again, 
this is going to be system specific policy (regular users would probably 
want this to be a yes, while people who care about data integrity over 
availability would likely want it to be a no).
5. Assuming you do retry with the degraded mount, how many times should 
a normal mount fail before things go degraded?  This ties in with 3 and 
has the same arguments about variability I gave there.
6. How many times do you try a degraded mount before just giving up? 
Again, similar variability to 3.
7. Should each attempt try first a regular mount and then a degraded 
one, or do you try just normal a couple times and then switch to 
degraded, or even start out trying normal and then start alternating? 
Any of those patterns has valid arguments both for and against it, so 
this again needs to be user configurable policy.

Altogether, that's a total of 7 policy decisions that should be user 
configurable.  Having a config file other than /etc/fstab for the mount 
command should probably be avoided for sanity reasons (again, BTRFS is a 
filesystem, not a volume manager), so they would all have to be handled 
through mount options.  The kernel will additionally have to understand 
that those options need to be ignored (things do try to mount 
filesystems without calling a mount helper, most notably the kernel when 
it mounts the root filesystem on boot if you're not using an initramfs). 
  All in all, this type of thing gets out of hand _very_ fast.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 19:00                                     ` Austin S. Hemmelgarn
@ 2018-01-29 21:54                                       ` waxhead
  2018-01-30 13:46                                         ` Austin S. Hemmelgarn
  2018-01-30 15:24                                       ` Tomasz Pala
  1 sibling, 1 reply; 54+ messages in thread
From: waxhead @ 2018-01-29 21:54 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Andrei Borzenkov, Adam Borowski, Btrfs BTRFS



Austin S. Hemmelgarn wrote:
> On 2018-01-29 12:58, Andrei Borzenkov wrote:
>> 29.01.2018 14:24, Adam Borowski пишет:
>> ...
>>>
>>> So any event (the user's request) has already happened.  A rc system, of
>>> which systemd is one, knows whether we reached the "want root 
>>> filesystem" or
>>> "want secondary filesystems" stage.  Once you're there, you can issue 
>>> the
>>> mount() call and let the kernel do the work.
>>>
>>>> It is a btrfs choice to not expose compound device as separate one 
>>>> (like
>>>> every other device manager does)
>>>
>>> Btrfs is not a device manager, it's a filesystem.
>>>
>>>> it is a btrfs drawback that doesn't provice anything else except for 
>>>> this
>>>> IOCTL with it's logic
>>>
>>> How can it provide you with something it doesn't yet have?  If you 
>>> want the
>>> information, call mount().  And as others in this thread have mentioned,
>>> what, pray tell, would you want to know "would a mount succeed?" for 
>>> if you
>>> don't want to mount?
>>>
>>>> it is a btrfs drawback that there is nothing to push assembling into 
>>>> "OK,
>>>> going degraded" state
>>>
>>> The way to do so is to timeout, then retry with -o degraded.
>>>
>>
>> That's possible way to solve it. This likely requires support from
>> mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
>> incomplete so caller can decide whether to retry or to try degraded 
>> mount.
> We already do so in the accepted standard manner.  If the mount fails 
> because of a missing device, you get a very specific message in the 
> kernel log about it, as is the case for most other common errors (for 
> uncommon ones you usually just get a generic open_ctree error).  This is 
> really the only option too, as the mount() syscall (which the mount 
> command calls) returns only 0 on success or -1 and an appropriate errno 
> value on failure, and we can't exactly go about creating a half dozen 
> new error numbers just for this (well, technically we could, but I very 
> much doubt that they would be accepted upstream, which defeats the 
> purpose).
>>
>> Or may be mount.btrfs should implement this logic internally. This would
>> really be the most simple way to make it acceptable to the other side by
>> not needing to accept anything :)
> And would also be another layering violation which would require a 
> proliferation of extra mount options to control the mount command itself 
> and adjust the timeout handling.
> 
> This has been done before with mount.nfs, but for slightly different 
> reasons (primarily to allow nested NFS mounts, since the local directory 
> that the filesystem is being mounted on not being present is treated 
> like a mount timeout), and it had near zero control.  It works there 
> because they push the complicated policy decisions to userspace (namely, 
> there is no support for retrying with different options or trying a 
> different server).
> 
I just felt like commenting a bit on this from a regular users point of 
view.

Remember that at some point BTRFS will probably be the default 
filesystem for the average penguin.
BTRFS big selling point is redundance and a guarantee that whatever you 
write is the same that you will read sometime later.

Many users will probably build their BTRFS system on a redundant array 
of storage devices. As long as there are sufficient (not necessarily 
all) storage devices present they expect their system to come up and 
work. If the system is not able to come up in a fully operative state it 
must at least be able to limp until the issue is fixed.

Starting a argument about what init system is the most sane or most 
shiny is not helping. The truth is that systemd is not going away 
sometime soon and one might as well try to become friends if nothing 
else for the sake of having things working which should be a common goal 
regardless of the religion.

I personally think the degraded mount option is a mistake as this 
assumes that a lightly degraded system is not able to work which is false.
If the system can mount to some working state then it should mount 
regardless if it is fully operative or not. If the array is in a bad 
state you need to learn about it by issuing a command or something. The 
same goes for a MD array (and yes, I am aware of the block layer vs 
filesystem thing here).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29  8:54                               ` Tomasz Pala
  2018-01-29 11:24                                 ` Adam Borowski
@ 2018-01-30  4:44                                 ` Chris Murphy
  2018-01-30 15:40                                   ` Tomasz Pala
  1 sibling, 1 reply; 54+ messages in thread
From: Chris Murphy @ 2018-01-30  4:44 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Btrfs BTRFS

On Mon, Jan 29, 2018 at 1:54 AM, Tomasz Pala <gotar@polanet.pl> wrote:
> On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:
>
>> systemd can't possibly need to know more information than a person
>> does in the exact same situation in order to do the right thing. No
>> human would wait 10 minutes, let alone literally the heat death of the
>> planet for "all devices have appeared" but systemd will. And it does
>
> We're already repeating - systemd waits for THE btrfs-compound-device,
> not ALL the block-devices. Just like it 'waits' for someone to plug USB
> pendrive in.

Btrfs is orthogonal to systemd's willingness to wait forever while
making no progress. It doesn't matter what it is, it shouldn't wait
forever.

It occurs to me there are such systemd service units specifically for
waiting for example

systemd-networkd-wait-online.service, systemd-networkd-wait-online -
Wait for network to
       come online

 chrony-wait.service - Wait for chrony to synchronize system clock

NetworkManager has a version of this. I don't see why there can't be a
wait for Btrfs to normally mount, just simply try to mount, it fails,
wait 10, try again, wait 10 try again. And then fail the unit so we
end up at a prompt. Or some people can optionally ask for a mount -o
degraded instead of a fail, and then if that also doesn't work, the
unit fails. Of course service units can have such conditionals rather
than waiting forever.





-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 11:24                                 ` Adam Borowski
  2018-01-29 13:05                                   ` Austin S. Hemmelgarn
  2018-01-29 17:58                                   ` Andrei Borzenkov
@ 2018-01-30 13:36                                   ` Tomasz Pala
  2 siblings, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 13:36 UTC (permalink / raw)
  To: Btrfs BTRFS

As I won't repeat myself, I've cut all the stuff I've already described
in detail before. Just read the previous mails.


On Mon, Jan 29, 2018 at 12:24:56 +0100, Adam Borowski wrote:

> How can it provide you with something it doesn't yet have?  If you want the
> information, call mount().  And as others in this thread have mentioned,
> what, pray tell, would you want to know "would a mount succeed?" for if you
> don't want to mount?

First of all I need the instruction: "SHOULD I TRY TO FORCE-MOUNT".

Don't you see the obvious difference between "following _environmental_
rules" and "reinventing the logic for every possible mounting initiator"?

>> it is a btrfs drawback that there is nothing to push assembling into "OK,
>> going degraded" state
> 
> The way to do so is to timeout, then retry with -o degraded.

There is no such logic in systemd core.
Systemd won't implement device-management and fallbacks.

You might just send the appropriate event for such request. I've already said how
to write such timer and how to implement fallback logic in udev/systemd
rules. systemd won't write units for every piece of software in the
wild. mdadm provides it's own rules (I gave links), LVN provides his,
so JUST WRITE THE RULES.

Then you'll see what's missing inside btrfs for them to be effective.

> It does... you're confusing a block device (a _part_ of the filesystem) with
> the filesystem itself.  MD takes a bunch of such block devices and provides
> you with another block devices, btrfs takes a bunch of block devices and
> provides you with a filesystem.

It cames with consequences. Described before, so ENOREPEAT.

>> If this overlapping usage was designed with 'easier mounting' on mind,
>> this is simply bad design.
> 
> No other rc system but systemd has a problem.

This statement is ultimately FALSE.
1. My SysV init didn't handle this - until being patched with btrfs-specific code,
2. SysV init systems in other distros didn't handle this - until being reworked in btrfs in mind,
3. my geninitrd doesn't handle this - noone was willing to write the ADDITIONAL btrfs-specific code,
4. dracut deesn't didn't handle this - at least without having extra code for btrfs case.

Systemd won't reimplement this in 5th, 10th, 20th place - there is a
SINGLE place that should implement state machine, just like this IS
handled by MD or LVM.

> No, I don't want systemd, or any userspace daemon, to try knowing kernel
> stuff better than the kernel.  Just call mount(), and that's it.

mount fails. Now it's YOUR job to retrigger degraded.

> Let me explain via a car analogy.  There is a flood that covers many roads,

Yeah, the great car analogies.
Do you have separate engine for every wheel?
Do you have separate break pedal for every wheel?
Do you have separate door opening logic (and different key) for every doors?

>> There's nothing the kernel is doing that's
>> telling udev there IS a degraded device assembled to be used.
> 
> Because there is no device.

I call it ephemeral device, you call it assembled volume, naming
convention doesn't matter. But easies understanding what others are
trying to explain to you.

>> YOU think that sda1 device is ephemeral, as it's covered by sda1 btrfs
>> device that COULD BE mounted.
> 
> sda1 is there, it's not ephemeral.

And after appearing it DOES NOT MOUNT. Quest failed. That's what's
happening.

>> So for the last time: nobody will break his own code to patch missing
>> code from other (actively maintained) subsystem.
> 
> I expect that a rc system doesn't get nosy trying to know things it has no
> reason to know about.  All other rc systems don't care, why should systemd
> be different?

All other rc systems fail miserably or have a tons of code repeating
functionality. How many rc systems were you involved in?

>> 1. implement degraded STATE _some_where_ - udev would handle falling
>>    back to degraded mount after specified timeout,
> 
> STATE of what?  The filesystem doesn't exist yet.

Paraphrasing you: how can I mount something that doesn't exist?

>> 2. change this IOCTL to _always_ return 1 - udev would register any
>>    btrfs device, but you will get random behaviour of mounting
>>    degraded/populated. But you should expect that since there is no
>>    concept of any state below.
> 
> If the ioctl, which has only a vague guess, doesn't do what you want, don't
> call it.  As it's btrfs specific already, there's no special casing on your
> part.

Without the "waiting for IOCTL OK response" btrfs mount would fail after
first device appears in system and mount happens before the next
components are available.

This COULD be done in systemd - provided there is some btrfsd that
retriggers mounting later (with or without degraded).

>> Actually, this is ridiculous - you expect the degradation to be handled
>> in some 3rd party software?! In init system? With the only thing you got
>> is 'degraded' mount option?!
>> What next - moving MD and LVM logic into systemd?
> 
> It's not init system's job.  So it shouldn't try to micromanage, but just
> mount().

As described above this would randomly fail for every multidevice setup.

You might either PREVENT race conditions (waiting for IOCTL), or make
these races irrevelant (by keeping track of components and retriggering
mounts).

Btrfs makes both impossible and expects systemd to implement it's own
logic.

>> 1. counted devices<all	=> not_ready
> 
> Count is unreliable.  It usually gives a good answer, but if you're
> contemplating mounting degraded, this is precisely the case it might be
> wrong.

Could you please respond after reading what's written below?

>> 2. counted devices<all BUT
>> - 'go degraded' received from userspace or kernel cmdline OR
>> - volume IS mounted and doesn't report errors (i.e. mount -o degraded
>>   DID succeeded)	=> ok_degraded
> 
> Then you don't want that ioctl, but mount().

There is no mount LOOP. mount() is called ONCE per device, if it fails,
then it is considered FAILED until REtriggered.

Just stop this flame and write 10 lines of retrigger rules. This would
greatly improve your comprehension of the problem.

> And what would you even want to use that hypothetical "ok_degraded" state for?

Could you just look into mdadm rules? This is obvious: for the same
purpose as mdadm invokes last-resort.

> It's not rocket science to edit an init script if knobs it exposes are not
> configurable enough for your needs. 

How many init scripts were you involved in?

> If systemd decides to hide this
> functionality, it needs to provide the admin with some way to override.

There is - udev roules and systemd units I've mentioned. Just use them.

> We're talking about issuing a mount call, it's not _that_ complicated.

So just do it! https://github.com/systemd/systemd

Please, go ahead with some PoC implementation, as this is REALLY hard to 
discuss init systems/scripts corner cases with someone that has
apparently never written a single line of such code.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 21:54                                       ` waxhead
@ 2018-01-30 13:46                                         ` Austin S. Hemmelgarn
  2018-01-30 19:50                                           ` Tomasz Pala
  0 siblings, 1 reply; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-30 13:46 UTC (permalink / raw)
  To: waxhead, Andrei Borzenkov, Adam Borowski, Btrfs BTRFS

On 2018-01-29 16:54, waxhead wrote:
> 
> 
> Austin S. Hemmelgarn wrote:
>> On 2018-01-29 12:58, Andrei Borzenkov wrote:
>>> 29.01.2018 14:24, Adam Borowski пишет:
>>> ...
>>>>
>>>> So any event (the user's request) has already happened.  A rc 
>>>> system, of
>>>> which systemd is one, knows whether we reached the "want root 
>>>> filesystem" or
>>>> "want secondary filesystems" stage.  Once you're there, you can 
>>>> issue the
>>>> mount() call and let the kernel do the work.
>>>>
>>>>> It is a btrfs choice to not expose compound device as separate one 
>>>>> (like
>>>>> every other device manager does)
>>>>
>>>> Btrfs is not a device manager, it's a filesystem.
>>>>
>>>>> it is a btrfs drawback that doesn't provice anything else except 
>>>>> for this
>>>>> IOCTL with it's logic
>>>>
>>>> How can it provide you with something it doesn't yet have?  If you 
>>>> want the
>>>> information, call mount().  And as others in this thread have 
>>>> mentioned,
>>>> what, pray tell, would you want to know "would a mount succeed?" for 
>>>> if you
>>>> don't want to mount?
>>>>
>>>>> it is a btrfs drawback that there is nothing to push assembling 
>>>>> into "OK,
>>>>> going degraded" state
>>>>
>>>> The way to do so is to timeout, then retry with -o degraded.
>>>>
>>>
>>> That's possible way to solve it. This likely requires support from
>>> mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
>>> incomplete so caller can decide whether to retry or to try degraded 
>>> mount.
>> We already do so in the accepted standard manner.  If the mount fails 
>> because of a missing device, you get a very specific message in the 
>> kernel log about it, as is the case for most other common errors (for 
>> uncommon ones you usually just get a generic open_ctree error).  This 
>> is really the only option too, as the mount() syscall (which the mount 
>> command calls) returns only 0 on success or -1 and an appropriate 
>> errno value on failure, and we can't exactly go about creating a half 
>> dozen new error numbers just for this (well, technically we could, but 
>> I very much doubt that they would be accepted upstream, which defeats 
>> the purpose).
>>>
>>> Or may be mount.btrfs should implement this logic internally. This would
>>> really be the most simple way to make it acceptable to the other side by
>>> not needing to accept anything :)
>> And would also be another layering violation which would require a 
>> proliferation of extra mount options to control the mount command 
>> itself and adjust the timeout handling.
>>
>> This has been done before with mount.nfs, but for slightly different 
>> reasons (primarily to allow nested NFS mounts, since the local 
>> directory that the filesystem is being mounted on not being present is 
>> treated like a mount timeout), and it had near zero control.  It works 
>> there because they push the complicated policy decisions to userspace 
>> (namely, there is no support for retrying with different options or 
>> trying a different server).
>>
> I just felt like commenting a bit on this from a regular users point of 
> view.
> 
> Remember that at some point BTRFS will probably be the default 
> filesystem for the average penguin.
> BTRFS big selling point is redundance and a guarantee that whatever you 
> write is the same that you will read sometime later.
> 
> Many users will probably build their BTRFS system on a redundant array 
> of storage devices. As long as there are sufficient (not necessarily 
> all) storage devices present they expect their system to come up and 
> work. If the system is not able to come up in a fully operative state it 
> must at least be able to limp until the issue is fixed.
> 
> Starting a argument about what init system is the most sane or most 
> shiny is not helping. The truth is that systemd is not going away 
> sometime soon and one might as well try to become friends if nothing 
> else for the sake of having things working which should be a common goal 
> regardless of the religion.
FWIW, I don't care that it's systemd in this case, I care that people 
are arguing for the forced use of a coding anti-pattern that ends up 
being covered as bad practice in first year computer science courses 
(no, seriously, every professional programmer I've asked about this had 
time-of-check-time-of-use race conditions covered in one of their 
first-year CS classes) or the enforcement of an event-based model that 
really doesn't make any sense for this (OK, it makes a little sense for 
handling of devices reappearing, but systemd doesn't need to be involved 
in that beyond telling the kernel that the device reappeared, except 
that that's udev's job).
> 
> I personally think the degraded mount option is a mistake as this 
> assumes that a lightly degraded system is not able to work which is false.
> If the system can mount to some working state then it should mount 
> regardless if it is fully operative or not. If the array is in a bad 
> state you need to learn about it by issuing a command or something. The 
> same goes for a MD array (and yes, I am aware of the block layer vs 
> filesystem thing here).
The problem with this is that right now, it is not safe to run a BTRFS 
volume degraded and writable, but for an even remotely usable system 
with pretty much any modern distro, you need your root filesystem to be 
writable (or you need to have jumped through the hoops to make sure /var 
and /tmp are writable even if / isn't).

Long-term, yes, I do think that such behavior should be an option (yes, 
specifically optional, there are people out there who like me would 
rather the system just doesn't boot so we know immediately something is 
wrong and can fix it then).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 13:05                                   ` Austin S. Hemmelgarn
@ 2018-01-30 13:46                                     ` Tomasz Pala
  2018-01-30 15:05                                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 13:46 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote:

> Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF 
> THIS_.  It's functionally no different than prefacing an attempt to send 
> a signal to a process by checking if the process exists, or trying to 
> see if some other process is using a file that might be locked by 

Seriously, there is a race condition on train stations. People check if
the train has stopped and opened the door before they move their legs to
get in, but the train might be already gone - so this is pointless.

Instead, they should move their legs continuously and if the train is
not on the station yet, just climb back and retry.


See the difference? I hope now you know what is the race condition.
It is the condition, where CONSEQUENCES are fatal.


mounting BEFORE volume is complete is FATAL - since no userspace daemon
would ever retrigger the mount and the system won't came up. Provide one
btrfsd volume manager and systemd could probably switch to using it.

mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens
and volume disappears, then this was either some operator action, so the
umount SHOULD happen, or we are facing some MALFUNCION, which is fatal
itself, not by being a "race condition".

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30 13:46                                     ` Tomasz Pala
@ 2018-01-30 15:05                                       ` Austin S. Hemmelgarn
  2018-01-30 16:07                                         ` Tomasz Pala
  0 siblings, 1 reply; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-30 15:05 UTC (permalink / raw)
  To: Tomasz Pala, Btrfs BTRFS

On 2018-01-30 08:46, Tomasz Pala wrote:
> On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote:
> 
>> Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF
>> THIS_.  It's functionally no different than prefacing an attempt to send
>> a signal to a process by checking if the process exists, or trying to
>> see if some other process is using a file that might be locked by
> 
> Seriously, there is a race condition on train stations. People check if
> the train has stopped and opened the door before they move their legs to
> get in, but the train might be already gone - so this is pointless.
> 
> Instead, they should move their legs continuously and if the train is > not on the station yet, just climb back and retry.
No, that's really not a good analogy given the fact that that check for 
the presence of a train takes a normal person milliseconds while the 
event being raced against (the train departing) takes minutes.  In the 
case being discussed, the check takes milliseconds and the event being 
raced against also takes milliseconds.  The scale here is drastically 
different.>
> See the difference? I hope now you know what is the race condition.
> It is the condition, where CONSEQUENCES are fatal.
Yes, the consequences of the condition being discussed functionally are 
fatal (you completely fail to mount the volume), because systemd doesn't 
retry mounting the root filesystem, it just breaks, which is absolutely 
at odds with the whole 'just works' mentality I always hear from the 
systemd fanboys and developers.

You're already looping forever _waiting_ for the volume to appear.  How 
is that any different from lopping forever trying to _mount_ the volume 
instead given that failing to mount the volume is not going to damage 
things.  The issue here is that systemd refuses to implement any method 
of actually retrying things that fail during startup.>
> mounting BEFORE volume is complete is FATAL - since no userspace daemon
> would ever retrigger the mount and the system won't came up. Provide one
> btrfsd volume manager and systemd could probably switch to using it.
And here you've lost any respect I might have had for you.

**YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM**

Period, end of story.

<rant>
This is one of the two biggest things I hate about systemd (the journal 
is the other one for those who care).  You don't need some special 
daemon to set the time, or to set the hostname, or to fetch account 
data, or even to track who's logged in (though I understand that the 
last one is not systemd's fault originally).

As much as it may surprise the systemd developers, people got on just 
fine handling setting the system time, setting the hostname, fetching 
account info, tracking active users, and any number of myriad other 
tasks before systemd decided they needed to have their own special daemon.
</rant>

In this particular case, you don't need a daemon because the kernel does 
the state tracking.  It only checks that state completely though _when 
you ask it to mount the filesystem_ because it requires doing 99% of the 
work of mounting the filesystem (quite literally, you're doing pretty 
much everything short of actually hooking things up in the VFS layer). 
We are not a case like MD where there's just a tiny bit of metadata to 
parse to check what the state is supposed to be.  Imagine if LVM 
required you to unconditionally activate all the LV's in a VG when you 
activate the VG and what logic would be required to validate the VG 
then, and you're pretty close to what's needed to check state for a 
BTRFS volume (translating LV's to chunks and the VG to the filesystem as 
a whole).  There is no point in trying to parse that data every time a 
new device shows up, it's a waste of time (at a minimum, you're almost 
doubling the amount of time it takes to mount a volume if you are doing 
this each time a device shows up), energy, and resources in general.
> 
> mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens
> and volume disappears, then this was either some operator action, so the
> umount SHOULD happen, or we are facing some MALFUNCION, which is fatal
> itself, not by being a "race condition".
Short of catastrophic failure, the _volume_ doesn't disappear, a 
component device does, and that is where the problem lies, especially 
given that the ioctl only tracks that each component device has been 
seen, not that all are present at the moment the ioctl is invoked.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 13:42                       ` Austin S. Hemmelgarn
@ 2018-01-30 15:09                         ` Tomasz Pala
  2018-01-30 16:22                           ` Tomasz Pala
  2018-01-30 16:30                           ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 15:09 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote:

>> Yes. They are stupid enough to fail miserably with any more complicated
>> setups, like stacking volume managers, crypto layer, network attached
>> storage etc.
> I think you mean any setup that isn't sensibly layered.

No, I mean any setup that wasn't considered by init system authors.
Your 'sensibly' is not sensible for me.

> BCP for over a 
> decade has been to put multipathing at the bottom, then crypto, then 
> software RAID, than LVM, and then whatever filesystem you're using. 

Really? Let's enumerate some caveats of this:

- crypto below software RAID means double-encryption (wasted CPU),

- RAID below LVM means you're stuck with the same RAID-profile for all
  the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
  system and RAID0 for various system caches (like ccache on software
  builder machine) or transient LVM-level snapshots.

- RAID below filesystem means loosing btrfs-RAID extra functionality,
  like recovering data from different mirror when CRC mismatch happens,

- crypto below LVN means encrypting everything, including data that is
  not sensitive - more CPU wasted,

- RAID below LVM means no way to use SSD acceleration of part of the HDD
  space using MD write-mostly functionality.

What you present is only some sane default, which doesn't mean it covers
all the real-world cases.

My recent server is using:
- raw partitioning for base volumes,
- LVM,
- MD on top of some LVs (varying levels),
- paritioned SSD cache attached to specific VGs,
- crypto on top of selected LV/MD,
- btrfs RAID1 on top of non-MDed LVs.

> Multipathing has to be the bottom layer for a given node because it 
> interacts directly with hardware topology which gets obscured by the 
> other layers.

It is the bottom layer, but I might be attached into volumes at virtually
any place of the logical topology tree. E.g. bare network drive added as
device-mapper mirror target for on-line volume cloning.

> Crypto essentially has to be next, otherwise you leak
> info about the storage stack.

I'm encrypting only the containers that require block-level encryption.
Others might have more effective filesystem-level encryption or even be
some TrueCrypt/whatever images.

> Swapping LVM and software RAID ends up 
> giving you a setup which is difficult for most people to understand and 
> therefore is hard to reliably maintain.

It's more difficult, as you need to maintain manually two (or more) separate VGs with
matching LVs inside. Harder, but more flexible.

> Other init systems enforce things being this way because it maintains 
> people's sanity, not because they have significant difficulty doing 
> things differently (and in fact, it is _trivial_ to change the ordering 
> in some of them, OpenRC on Gentoo for example quite literally requires 
> exactly N-1 lines to change in each of N files when re-ordering N 
> layers), provided each layer occurs exactly once for a given device and 
> the relative ordering is the same on all devices.  And you know what? 

The point is: mainaining all of this logic is NOT the job for init system.
With systemd you need exactly N-N=0 lines of code to make this work.

The appropriate unit files are provided by MD and LVM upstream.
And they include fallback mechanism for degrading volumes.

> Given my own experience with systemd, it has exactly the same constraint 
> on relative ordering.  I've tried to run split setups with LVM and 
> dm-crypt where one device had dm-crypt as the bottom layer and the other 
> had it as the top layer, and things locked up during boot on _every_ 
> generalized init system I tried.

Hard to tell without access to the failing system, but this MIGHT have been:

- old/missing/broken-by-distro-maintainers-who-know-better LVM rules,
- old/bugged systemd, possibly with broken/old cryptsetup rules.

>> It's quite obvious who's the culprit: every single remaining filesystem
>> manages to mount under systemd without problems. They just expose
>> informations about their state.
> No, they don't (except ZFS).

They don't expose informations (as there are none), but they DO mount.

> There is no 'state' to expose for anything but BTRFS (and ZFS)

Does ZFS expose it's state or not?

> except possibly if the filesystem needs checked or 
> not.  You're conflating filesystems and volume management.

btrfs is a filesystem, device manager and volume manager.
I might add DEVICE to a btrfs-thingy.
I might mount the same btrfs-thingy selecting different VOLUME (subVOL=something_other)

> The alternative way of putting what you just said is:
> Every single remaining filesystem manages to mount under systemd without 
> problems, because it doesn't try to treat them as a block layer.

Or: every other volume manager exposes separate block devices.

Anyway - however we put this into words, it is btrfs that behaves differently.

>> The 'needless complication', as you named it, usually should be the default
>> to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
>> No easy way to RAID the drive (there are device-mapper tricks, they are
>> just way more complicated). Even attaching SSD cache is not trivial
>> without preparations (for bcache being the absolutely necessary, much
>> easier with LVM in place).
> For a bog-standard client system, all of those _ARE_ overkill (and 
> actually, so is BTRFS in many cases too, it's just that we're the only 
> option for main-line filesystem-level snapshots at the moment).

Such standard systems don't have multidevice btrfs volumes neither, so
they are beyond the problem discussed here.

>>>> If btrfs pretends to be device manager it should expose more states,
>>>
>>> But it doesn't pretend to.
>> 
>> Why mounting sda2 requires sdb2 in my setup then?
> First off, it shouldn't unless you're using a profile that doesn't 
> tolerate any missing devices and have provided the `degraded` mount 
> option.  It doesn't in your case because you are using systemd.

I have written this previously (19-22 Dec, "Unexpected raid1 behaviour"):

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

mounting btrfs without "btrfs device scan" doesn't work at
all without udev rules (that mimic behaviour of the command).

> Second, BTRFS is not a volume manager, it's a filesystem with 
> multi-device support.

What is the designatum difference between 'volume' and 'subvolume'?

> The difference is that it's not a block layer, 

As a de facto design choice only.

> despite the fact that systemd is treating it as such.   Yes, BTRFS has 
> failure modes that result in regular operations being refused based on 
> what storage devices are present, but so does every single distributed 
> filesystem in existence, and none of those are volume managers either.

Great example - how is systemd mounting distributed/network filesystems?
Does it mount them blindly, in a loop, or fires some checks against
_plausible_ availability?

In other words, is it:
- the systemd that threats btrfs WORSE than distributed filesystems, OR
- btrfs that requires from systemd to be threaded BETTER than other fss?

>> There is a term for such situation: broken by design.
> So in other words, it's broken by design to try to connect to a remote 
> host without pinging it first to see if it's online?

Trying to connect to remote host without checking if OUR network is
already up and if the remote target MIGHT be reachable using OUR routes.

systemd checks LOCAL conditions: being online in case of network, being
online in case of hardware, being online in case of virtual devices.

> In all of those cases, there is no advantage to trying to figure out if 
> what you're trying to do is going to work before doing it, because every 

...provided there are some measures taken for the premature operation to be
repeated. There is non in btrfs-ecosystem.

> There's a name for the type of design you're saying we should have here, 
> it's called a time of check time of use (TOCTOU) race condition.  It's 
> one of the easiest types of race conditions to find, and also one of the 
> easiest to fix.  Ask any sane programmer, and he will say that _that_ is 
> broken by design.

Explained before.

>> And you still blame systemd for using BTRFS_IOC_DEVICES_READY?
> Given that it's been proven that it doesn't work and the developers 
> responsible for it's usage don't want to accept that it doesn't work?  Yes.

Remove it then.

>> Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.
>> 
> Or maybe we should just remove it completely, because checking it _IS 
> WRONG_,

That's right. But before commiting upstream, check for consequences.
I've already described a few today, pointed the source and gave some
possible alternate solutions.

> which is why no other init system does it, and in fact no 

Other init systems either fail at mounting degraded btrfs just like
systemd does, or have buggy workarounds in their code reimplemented in
each other just to handle thing, that should be centrally organized.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-29 19:00                                     ` Austin S. Hemmelgarn
  2018-01-29 21:54                                       ` waxhead
@ 2018-01-30 15:24                                       ` Tomasz Pala
  1 sibling, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 15:24 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon, Jan 29, 2018 at 14:00:53 -0500, Austin S. Hemmelgarn wrote:

> We already do so in the accepted standard manner.  If the mount fails 
> because of a missing device, you get a very specific message in the 
> kernel log about it, as is the case for most other common errors (for 
> uncommon ones you usually just get a generic open_ctree error).  This is 
> really the only option too, as the mount() syscall (which the mount 
> command calls) returns only 0 on success or -1 and an appropriate errno 
> value on failure, and we can't exactly go about creating a half dozen 
> new error numbers just for this (well, technically we could, but I very 
> much doubt that they would be accepted upstream, which defeats the purpose).

This is exacly why the separate communication channel being the ioctl is
currently used. And I really don't understand why do you fight against
expanding this ioctl response.

> With what you're proposing for BTRFS however, _everything_ is a 
> complicated decision, namely:
> 1. Do you retry at all?  During boot, the answer should usually be yes, 
> but during normal system operation it should normally be no (because we 
> should be letting the user handle issues at that point).

This is exactly why I propose to introduce ioctl in btrfs.ko that
accepts userspace-configured (as per-volume policy) expectations.

> 2. How long should you wait before you retry?  There is no right answer 
> here that will work in all cases (I've seen systems which take multiple 
> minutes for devices to become available on boot), especially considering 
> those of us who would rather have things fail early.

btrfs-last-resort@.timer per analogy to mdadm-last-resort@.timer

> 3. If the retry fails, do you retry again?  How many times before it 
> just outright fails?  This is going to be system specific policy.  On 
> systems where devices may take a while to come online, the answer is 
> probably yes and some reasonably large number, while on systems where 
> devices are known to reliably be online immediately, it makes no sense 
> to retry more than once or twice.

All of this is systemd timer/service job.

> 4. If you are going to retry, should you try a degraded mount?  Again, 
> this is going to be system specific policy (regular users would probably 
> want this to be a yes, while people who care about data integrity over 
> availability would likely want it to be a no).

Just like above - user-configured in systemd timers/services easily.

> 5. Assuming you do retry with the degraded mount, how many times should 
> a normal mount fail before things go degraded?  This ties in with 3 and 
> has the same arguments about variability I gave there.

As above.

> 6. How many times do you try a degraded mount before just giving up? 
> Again, similar variability to 3.
> 7. Should each attempt try first a regular mount and then a degraded 
> one, or do you try just normal a couple times and then switch to 
> degraded, or even start out trying normal and then start alternating? 
> Any of those patterns has valid arguments both for and against it, so 
> this again needs to be user configurable policy.
> 
> Altogether, that's a total of 7 policy decisions that should be user 
> configurable. 

All of them easy to implement if the btrfs.ko could accept
'allow-degraded' per-volume instruction and return 'try-degraded' in the
ioctl.

> Having a config file other than /etc/fstab for the mount 
> command should probably be avoided for sanity reasons (again, BTRFS is a 
> filesystem, not a volume manager), so they would all have to be handled 
> through mount options.  The kernel will additionally have to understand 
> that those options need to be ignored (things do try to mount 
> filesystems without calling a mount helper, most notably the kernel when 
> it mounts the root filesystem on boot if you're not using an initramfs). 
>   All in all, this type of thing gets out of hand _very_ fast.

You need to think about the two separately:
1. tracking STATE - this is remembering 'allow-degraded' option for now,
2. configured POLICY - this is to be handled by init system.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30  4:44                                 ` Chris Murphy
@ 2018-01-30 15:40                                   ` Tomasz Pala
  0 siblings, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 15:40 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon, Jan 29, 2018 at 21:44:23 -0700, Chris Murphy wrote:

> Btrfs is orthogonal to systemd's willingness to wait forever while
> making no progress. It doesn't matter what it is, it shouldn't wait
> forever.

It times out after 90 seconds (by default) and then it fails the mount
entirely.

> It occurs to me there are such systemd service units specifically for
> waiting for example
> 
> systemd-networkd-wait-online.service, systemd-networkd-wait-online -
> Wait for network to
>        come online
> 
>  chrony-wait.service - Wait for chrony to synchronize system clock
> 
> NetworkManager has a version of this. I don't see why there can't be a
> wait for Btrfs to normally mount,

Because mounting degraded btrfs without -o degraded won't WAIT for
anything just immediatelly return failed.

> just simply try to mount, it fails, wait 10, try again, wait 10 try again.

For the last time:

No
	Such
		Logic
			In
				Systemd
						CORE

Every wait/repeat is done using UNITS - as you already noticed
itself. And these are plain, regular UNITS.

Is there anything that prevents YOU, Chris, from writing these UNITS for
btrfs?

I know what makes ME stop writing these units - it's lack of feedback
from btrfs.ko ioctl handler. Without this I am unable to write UNITS
handling fstab mount entries, because the logic would PROBABLY have to
be hardcoded inside systemd-fstab-generator.

And such logic MUST NOT be hardcoded - this MUST be user-configurable,
i.e. made on UNITS level.

You might argue that some-distros-SysV units or some Gentoo-OpenRC have
support for this and if you want to change anything this is only a few
lines of shell code to be altered. But systemd-fstab-generator is
compiled binary and so WON'T allow the behaviour to be user-configurable.

> And then fail the unit so we end up at a prompt.

This can also be easily done, just like emergency-shell spawns
when configured. If only btrfs could accept and keep information about
volume being allowed for degraded mount.


OK, to be honest I _can_ write such rules now, keeping the
'allow-degraded' state somewhere else (in a file for example).

But since this is some non-standarized side-channel, such code won't
be accepted in systemd upstream, especially because it requires the
current udev rule to be slightly changed.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30 15:05                                       ` Austin S. Hemmelgarn
@ 2018-01-30 16:07                                         ` Tomasz Pala
  0 siblings, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 16:07 UTC (permalink / raw)
  To: Btrfs BTRFS

On Tue, Jan 30, 2018 at 10:05:34 -0500, Austin S. Hemmelgarn wrote:

>> Instead, they should move their legs continuously and if the train is > not on the station yet, just climb back and retry.
> No, that's really not a good analogy given the fact that that check for 
> the presence of a train takes a normal person milliseconds while the 
> event being raced against (the train departing) takes minutes.  In the 

OMG... preventing races by "this would always take longer"? Seriously?

> You're already looping forever _waiting_ for the volume to appear.  How 

udev is waiting for events, not systemd. Nobody will do some crazy
cross-layered shortcuts to overcome other's lazyness.

> is that any different from lopping forever trying to _mount_ the volume 

Yes, because udev doesn't mount anything, ever. Not this binary dude!

> instead given that failing to mount the volume is not going to damage 
> things. 

Failed premature attempt to mount prevents the system from booting WHEN
the devices are ready - this is fatal. System boots randomly on racy
conditions.

But hey, "the devices will always appear faster, than the init attempt
to do the mount"!

Have you ever had some hardware RAID controller? Never heard about
devices appearing after 5 minutes of warming up?

> The issue here is that systemd refuses to implement any method 
> of actually retrying things that fail during startup.>

1. Such methods are trivial and I've already mentioned them a dozen of times.
2. They should be implemented in btrfs-upstream, not systemd-upstream,
   but I personally would happily help with writing them here.
3. They require full-circle path of 'allow-degraded' to be passed
   through btrfs code.

>> mounting BEFORE volume is complete is FATAL - since no userspace daemon
>> would ever retrigger the mount and the system won't came up. Provide one
>> btrfsd volume manager and systemd could probably switch to using it.
> And here you've lost any respect I might have had for you.

Going personal? So thank you for discussion and good bye.

Please refrain from answering me, I'm not going to discuss this any
further with you.

> **YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM**

Sorry dude, but I won't repeat for the 5th times all the alternatives.

You *all* refuse to step in ANY possible solution mentioned.
You *all* except the systemd to do ALL the job, just like other init
systems were forced to do, against the good design principles.

Good luck having btrfs degraded mount under systemd.

> <rant>
> This is one of the two biggest things I hate about systemd(the journal 
> is the other one for those who care).

The journal has currently *many* drawbacks, but this is not 'by design'
but 'by appropriate code missing for now'. The same applies to btrfs,
isn't it?

> You don't need some special daemon to set the time,

Ever heard about NTP?

> or to set the hostname,

FUD - no such daemon

> or to fetch account data,

FUD

> or even to track who's logged in

FUD

> As much as it may surprise the systemd developers, people got on just 
> fine handling setting the system time, setting the hostname, fetching 
> account info, tracking active users, and any number of myriad other 
> tasks before systemd decided they needed to have their own special daemon.
> </rant>

Sure, in myriad of different scattered distro-specific files. The only
reason systemd stepped in for some of there is that nobody else could
introduce and force Linux-wide consensus. And if anyone would succeed,
there would be some Austins blaming them for 'overtaking good old
trashyard into coherent de facto standard.'

> In this particular case, you don't need a daemon because the kernel does 
> the state tracking. 

Sure, MD doesn't require daemon and LVM doesn't require either. But they
do provide some - I know, they are all wrong.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30 15:09                         ` Tomasz Pala
@ 2018-01-30 16:22                           ` Tomasz Pala
  2018-01-30 16:30                           ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 16:22 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

On Tue, Jan 30, 2018 at 16:09:50 +0100, Tomasz Pala wrote:

>> BCP for over a 
>> decade has been to put multipathing at the bottom, then crypto, then 
>> software RAID, than LVM, and then whatever filesystem you're using. 
> 
> Really? Let's enumerate some caveats of this:
> 
> - crypto below software RAID means double-encryption (wasted CPU),
> 
> - RAID below LVM means you're stuck with the same RAID-profile for all
>   the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
>   system and RAID0 for various system caches (like ccache on software
>   builder machine) or transient LVM-level snapshots.
> 
> - RAID below filesystem means loosing btrfs-RAID extra functionality,
>   like recovering data from different mirror when CRC mismatch happens,
> 
> - crypto below LVN means encrypting everything, including data that is
>   not sensitive - more CPU wasted,

And, what is much worse - encrypting everything using the same secret.
BIG show-stopper.

I would shred such BCP as ineffective and insecure for both, data
integrity and confidentiality.

> - RAID below LVM means no way to use SSD acceleration of part of the HDD
>   space using MD write-mostly functionality.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30 15:09                         ` Tomasz Pala
  2018-01-30 16:22                           ` Tomasz Pala
@ 2018-01-30 16:30                           ` Austin S. Hemmelgarn
  2018-01-30 19:24                             ` Tomasz Pala
  2018-01-30 19:40                             ` Tomasz Pala
  1 sibling, 2 replies; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-30 16:30 UTC (permalink / raw)
  To: Tomasz Pala, Majordomo vger.kernel.org

On 2018-01-30 10:09, Tomasz Pala wrote:
> On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote:
> 
>>> Yes. They are stupid enough to fail miserably with any more complicated
>>> setups, like stacking volume managers, crypto layer, network attached
>>> storage etc.
>> I think you mean any setup that isn't sensibly layered.
> 
> No, I mean any setup that wasn't considered by init system authors.
> Your 'sensibly' is not sensible for me.
> 
>> BCP for over a
>> decade has been to put multipathing at the bottom, then crypto, then
>> software RAID, than LVM, and then whatever filesystem you're using.
> 
> Really? Let's enumerate some caveats of this:
> 
> - crypto below software RAID means double-encryption (wasted CPU),
It also means you leak no information about your storage stack.  If 
you're sufficiently worried about data protection that you're using 
block-level encryption, you should be thinking _very_ hard about whether 
or not that's an acceptable risk (and it usually isn't).
> 
> - RAID below LVM means you're stuck with the same RAID-profile for all
>    the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
>    system and RAID0 for various system caches (like ccache on software
>    builder machine) or transient LVM-level snapshots.
Then you skip MD and do the RAID work in LVM with DM-RAID (which 
technically _is_ MD, just with a different frontend).
> 
> - RAID below filesystem means loosing btrfs-RAID extra functionality,
>    like recovering data from different mirror when CRC mismatch happens,
That depends on your choice of RAID and the exact configuration of the 
storage stack.  As long as you expose two RAID devices, BTRFS 
replication works just fine on top of them.
> 
> - crypto below LVN means encrypting everything, including data that is
>    not sensitive - more CPU wasted,
Encrypting only sensitive data is never a good idea unless you can prove 
with certainty that you will keep it properly segregated, and even then 
it's still a pretty bad idea because it makes it obvious exactly where 
the information you consider sensitive is stored.
> 
> - RAID below LVM means no way to use SSD acceleration of part of the HDD
>    space using MD write-mostly functionality.
Again, just use LVM's DM-RAID and throw in DM-cache.  Also, there were 
some patches just posted for BTRFS that indirectly allow for this 
(specifically, they let you change the read-selection algorithm, with 
the option of specifying to preferentially read from a specific device).
> 
> What you present is only some sane default, which doesn't mean it covers
> all the real-world cases.
> 
> My recent server is using:
> - raw partitioning for base volumes,
> - LVM,
> - MD on top of some LVs (varying levels),
> - paritioned SSD cache attached to specific VGs,
> - crypto on top of selected LV/MD,
> - btrfs RAID1 on top of non-MDed LVs.
> 
>> Multipathing has to be the bottom layer for a given node because it
>> interacts directly with hardware topology which gets obscured by the
>> other layers.
> 
> It is the bottom layer, but I might be attached into volumes at virtually
> any place of the logical topology tree. E.g. bare network drive added as
> device-mapper mirror target for on-line volume cloning.
And you seriously think that that's going to be a persistent setup? 
One-shot stuff like that is almost never an issue unless your init 
system is absolutely brain-dead _and_ you need it working as it was 
immediately (and a live-clone of a device doesn't if you're doing it right).
> 
>> Crypto essentially has to be next, otherwise you leak
>> info about the storage stack.
> 
> I'm encrypting only the containers that require block-level encryption.
> Others might have more effective filesystem-level encryption or even be
> some TrueCrypt/whatever images.
Again, you're leaking information by doing so.  At a minimum, you're 
leaking info about where the data you consider sensitive is stored, and 
that's not counting volume names (exposed by LVM), container 
configuration (possibly exposed depending on how your container stack 
handles it), and other storage stack configuration info (exposed by the 
metadata of the various layers and possibly by files in /etc if you 
don't have your root filesystem encrypted).
> 
>> Swapping LVM and software RAID ends up
>> giving you a setup which is difficult for most people to understand and
>> therefore is hard to reliably maintain.
> 
> It's more difficult, as you need to maintain manually two (or more) separate VGs with
> matching LVs inside. Harder, but more flexible.
And could also be trivially simplified by eliminating MD and using LVM's 
native support for DM-RAID, which provides essentially the exact same 
functionality because DM-RAID is largely just a DM fronted for MD.
> 
>> Other init systems enforce things being this way because it maintains
>> people's sanity, not because they have significant difficulty doing
>> things differently (and in fact, it is _trivial_ to change the ordering
>> in some of them, OpenRC on Gentoo for example quite literally requires
>> exactly N-1 lines to change in each of N files when re-ordering N
>> layers), provided each layer occurs exactly once for a given device and
>> the relative ordering is the same on all devices.  And you know what?
> 
> The point is: mainaining all of this logic is NOT the job for init system.
> With systemd you need exactly N-N=0 lines of code to make this work.
So, I find it very hard to believe that systemd requires absolutely zero 
configuration of per-device dependencies.  If it really doesn't, then 
that's just more reason I will never use it, as auto-detection opens you 
up to some quite nasty physical attacks on the system.
> 
> The appropriate unit files are provided by MD and LVM upstream.
> And they include fallback mechanism for degrading volumes.
> 
>> Given my own experience with systemd, it has exactly the same constraint
>> on relative ordering.  I've tried to run split setups with LVM and
>> dm-crypt where one device had dm-crypt as the bottom layer and the other
>> had it as the top layer, and things locked up during boot on _every_
>> generalized init system I tried.
> 
> Hard to tell without access to the failing system, but this MIGHT have been:
> 
> - old/missing/broken-by-distro-maintainers-who-know-better LVM rules,
> - old/bugged systemd, possibly with broken/old cryptsetup rules.
> 
>>> It's quite obvious who's the culprit: every single remaining filesystem
>>> manages to mount under systemd without problems. They just expose
>>> informations about their state.
>> No, they don't (except ZFS).
> 
> They don't expose informations (as there are none), but they DO mount.
> 
>> There is no 'state' to expose for anything but BTRFS (and ZFS)
> 
> Does ZFS expose it's state or not?
Yes, but I'm not quite4 sure exactly how much.  I assume it exposes 
enough to check if datasets can be mounted, but it's also not quite the 
same situation as BTRFS, because you can start a ZFS volume with half a 
pool and selectively mount only those datasets that are completely 
provided by the set of devices you do have.
> 
>> except possibly if the filesystem needs checked or
>> not.  You're conflating filesystems and volume management.
> 
> btrfs is a filesystem, device manager and volume manager.
BTRFS is a filesystem, it does not manage volumes except in the very 
limited sense that MD or hardware RAID do, and it does not manage 
devices (the kernel and udev do so).

> I might add DEVICE to a btrfs-thingy.
> I might mount the same btrfs-thingy selecting different VOLUME (subVOL=something_other)
Except subvolumes aren't really applicable here because they're all or 
nothing.  If you don't have the base filesystem, you don't have any 
subvolumes (because what mounting a subvolume actually does is mount the 
root of the filesystem, and then bind-mount the subvolume onto the 
specified mount-point).
> 
>> The alternative way of putting what you just said is:
>> Every single remaining filesystem manages to mount under systemd without
>> problems, because it doesn't try to treat them as a block layer.
> 
> Or: every other volume manager exposes separate block devices.
> 
> Anyway - however we put this into words, it is btrfs that behaves differently.
> 
>>> The 'needless complication', as you named it, usually should be the default
>>> to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
>>> No easy way to RAID the drive (there are device-mapper tricks, they are
>>> just way more complicated). Even attaching SSD cache is not trivial
>>> without preparations (for bcache being the absolutely necessary, much
>>> easier with LVM in place).
>> For a bog-standard client system, all of those _ARE_ overkill (and
>> actually, so is BTRFS in many cases too, it's just that we're the only
>> option for main-line filesystem-level snapshots at the moment).
> 
> Such standard systems don't have multidevice btrfs volumes neither, so
> they are beyond the problem discussed here.
> 
>>>>> If btrfs pretends to be device manager it should expose more states,
>>>>
>>>> But it doesn't pretend to.
>>>
>>> Why mounting sda2 requires sdb2 in my setup then?
>> First off, it shouldn't unless you're using a profile that doesn't
>> tolerate any missing devices and have provided the `degraded` mount
>> option.  It doesn't in your case because you are using systemd.
> 
> I have written this previously (19-22 Dec, "Unexpected raid1 behaviour"):
> 
> 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
> 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool),
> 3. try
> mount /dev/sda /test - fails
> mount /dev/sdb /test - works
> 4. reboot again and try in reversed order
> mount /dev/sdb /test - fails
> mount /dev/sda /test - works
> 
> mounting btrfs without "btrfs device scan" doesn't work at
> all without udev rules (that mimic behaviour of the command).
Actually, try your first mount command above with `-o 
device=/dev/sda,device=/dev/sdb` and it will work.  You don't need 
global scanning or the udev rules unless you want auto-detection.  The 
things is, using this mount option (which effectively triggers the scan 
code directly on the specified devices as part of the mount call) makes 
it work in pretty much all init systems except systemd (which still 
tries to check with udev regardless).
> 
>> Second, BTRFS is not a volume manager, it's a filesystem with
>> multi-device support.
> 
> What is the designatum difference between 'volume' and 'subvolume'?This is largely orthogonal to my comment above, but:

A volume is an entirely independent data set.  So, the following are all 
volumes:
* A partition on a storage device containing a filesystem that needs no 
other devices.
* A device-mapper target exposed by LVM.
* A /dev/md* device exposed by MDADM.
* The internal device mapping used by BTRFS (which is not exposed 
_anywhere_ outside of the given filesystem).
* A ZFS storage pool.

A sub-volume is a BTRFS-specific concept referring to a mostly 
independent filesystem tree within a BTRFS volume that still depends on 
the super-blocks, chunk-tree, and a couple of other internal structures 
from the main filesystem.  It's directly equivalent to the ZFS concept 
of a dataset, with the caveat that subvolumes are implicitly rooted at 
paths within their hierarchy (that is, if you have a subvolume at 
/something and mount the root subvolume, you will be able to access the 
contents of /something as well from that mount), while ZFS datasets are 
not (they have to each be explicitly mounted, and the mount hierarchy 
doesn't have to match the actual dataset hierarchy (but almost always 
does for sanity reasons)).  Furthermore, subvolumes are all-or-nothing 
dependent on the state of the filesystem as a whole (in theory, this 
could be changed, but it would be so invasive to do so that it's likely 
to never happen).
> 
>> The difference is that it's not a block layer,
> 
> As a de facto design choice only.
Not really...

ZFS is really the only comparable design to BTRFS out there, and even 
looking at their code it was decidedly non-trivial to implement zvols 
and have them play nice with everything else.
> 
>> despite the fact that systemd is treating it as such.   Yes, BTRFS has
>> failure modes that result in regular operations being refused based on
>> what storage devices are present, but so does every single distributed
>> filesystem in existence, and none of those are volume managers either.
> 
> Great example - how is systemd mounting distributed/network filesystems?
> Does it mount them blindly, in a loop, or fires some checks against
> _plausible_ availability?
Yes, but availability there is a boolean value.  In BTRFS it's tri-state 
(as of right now, possibly four to six states in the future depending on 
what gets merged), and the intermediate (not true or false) state can't 
be checked in a trivial manner.
> 
> In other words, is it:
> - the systemd that threats btrfs WORSE than distributed filesystems, OR
> - btrfs that requires from systemd to be threaded BETTER than other fss?
Or maybe it's both?  I'm more than willing to admit that what BTRFS does 
expose currently is crap in terms of usability.  The reason it hasn't 
changed is that we (that is, the BTRFS people and the systemd people) 
can't agree on what it should look like.
> 
>>> There is a term for such situation: broken by design.
>> So in other words, it's broken by design to try to connect to a remote
>> host without pinging it first to see if it's online?
> 
> Trying to connect to remote host without checking if OUR network is
> already up and if the remote target MIGHT be reachable using OUR routes.
> 
> systemd checks LOCAL conditions: being online in case of network, being
> online in case of hardware, being online in case of virtual devices.
> 
>> In all of those cases, there is no advantage to trying to figure out if
>> what you're trying to do is going to work before doing it, because every
> 
> ...provided there are some measures taken for the premature operation to be
> repeated. There is non in btrfs-ecosystem.
Yes, because we expect the user to do so, just like LVM, and MD, and 
pretty much every other block layer you're claiming we should be 
behaving like.
> 
>> There's a name for the type of design you're saying we should have here,
>> it's called a time of check time of use (TOCTOU) race condition.  It's
>> one of the easiest types of race conditions to find, and also one of the
>> easiest to fix.  Ask any sane programmer, and he will say that _that_ is
>> broken by design.
> 
> Explained before.
> 
>>> And you still blame systemd for using BTRFS_IOC_DEVICES_READY?
>> Given that it's been proven that it doesn't work and the developers
>> responsible for it's usage don't want to accept that it doesn't work?  Yes.
> 
> Remove it then.
As much as I would love to, we can't because <insert usual stable 
userspace API rant from Linus and co. here>.
> 
>>> Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.
>>>
>> Or maybe we should just remove it completely, because checking it _IS
>> WRONG_,
> 
> That's right. But before commiting upstream, check for consequences.
> I've already described a few today, pointed the source and gave some
> possible alternate solutions.
> 
>> which is why no other init system does it, and in fact no
> 
> Other init systems either fail at mounting degraded btrfs just like
> systemd does, or have buggy workarounds in their code reimplemented in
> each other just to handle thing, that should be centrally organized.
> 
Really? So the fact that I can mount a 2-device volume with RAID1 
profiles degraded using OpenRC without needing anything more than adding 
rootflags=degraded to the kernel parameters must be a fluke then...

The thing is, it primarily breaks if there are hardware issues, 
regardless of the init system being used, but at least the other init 
systems _give you an error message_ (even if it's really the kernel 
spitting it out) instead of just hanging there forever with no 
indication of what's going on like systemd does.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30 16:30                           ` Austin S. Hemmelgarn
@ 2018-01-30 19:24                             ` Tomasz Pala
  2018-01-30 19:40                             ` Tomasz Pala
  1 sibling, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 19:24 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote:

>> - crypto below software RAID means double-encryption (wasted CPU),
> It also means you leak no information about your storage stack.  If 

JBOD

> you're sufficiently worried about data protection that you're using 
> block-level encryption, you should be thinking _very_ hard about whether 
> or not that's an acceptable risk (and it usually isn't).

Nonsense. Block-level encryption is the last resource protection, your
primary concern is to encrypt at the highest level possible. Anyway,
I don't need to care at all about encryption, one of my customer might.
Just stop extending justification of your tight usage pattern to the
rest of the world.

BTW if YOU are sufficiently worried about data protection you need to
use some hardware solution, like OPAL and completely avoid using
consumer-grade (especially SSD) drives. This also saves CPU cycles,
but let's not discuss here the gory details.

If you can't imagine people have different requirements than you, then
this is your mental problem, go solve it somewhere else.

>> - RAID below LVM means you're stuck with the same RAID-profile for all
>>    the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
>>    system and RAID0 for various system caches (like ccache on software
>>    builder machine) or transient LVM-level snapshots.
> Then you skip MD and do the RAID work in LVM with DM-RAID (which 
> technically _is_ MD, just with a different frontend).

1. how is write-mostly handled by LVM-initiated RAID1?
2. how can one split LVM RAID1 to separate volumes in case of bit-rot
situation that requires manual intervention to recover specific copy of
a data (just like btrfs checksumming does automatically in raid1 mode)?

>> - RAID below filesystem means loosing btrfs-RAID extra functionality,
>>    like recovering data from different mirror when CRC mismatch happens,
> That depends on your choice of RAID and the exact configuration of the 

There is no data checksumming in MD-RAID, there is no voting in MD-RAID.
There is FEC mode in dm-verity.

> storage stack.  As long as you expose two RAID devices, BTRFS 
> replication works just fine on top of them.

Taking up 4 times the space? Or going crazy with 2*MD-RAID0?

>> - crypto below LVN means encrypting everything, including data that is
>>    not sensitive - more CPU wasted,
> Encrypting only sensitive data is never a good idea unless you can prove 

Encrypting the sensitive data _AT_or_ABOVE_ the filesystem level is
crucial for any really sensitive data.

> with certainty that you will keep it properly segregated, and even then 
> it's still a pretty bad idea because it makes it obvious exactly where 
> the information you consider sensitive is stored.

ROTFL

Do you really think this would make breaking the XTS easier, than in
would be if the _entire_ drive would be encrypted using THE SAME secret?
With attacker having access to the plain texts _AND_ ciphers?

Wow... - stop doing the crypto, seriously. You do this wrong.

Do you think that my customer or cooperative would happily share HIS
secret with mine, just because we're running on the same server?

Have you ever heard about zero-knowledge databases?
Can you imagine, that some might want to do the decryption remotely,
when he doesn't trust me as the owner of the machine?

How is that me KNOWING, that their data is encrypted, eases the attack?

>> - RAID below LVM means no way to use SSD acceleration of part of the HDD
>>    space using MD write-mostly functionality.
> Again, just use LVM's DM-RAID and throw in DM-cache.  Also, there were 

Obviously you've never used write-mostly, as you're apparently not aware
about the difference in maintenance burden.

> some patches just posted for BTRFS that indirectly allow for this 
> (specifically, they let you change the read-selection algorithm, with 
> the option of specifying to preferentially read from a specific device).

When they will be available in LTS kernel, some will definitely use it
and create even more complicated stacks.

>> It is the bottom layer, but I might be attached into volumes at virtually
>> any place of the logical topology tree. E.g. bare network drive added as
>> device-mapper mirror target for on-line volume cloning.
> And you seriously think that that's going to be a persistent setup? 

Persistent setups are archeology in IT.

> One-shot stuff like that is almost never an issue unless your init 
> system is absolutely brain-dead _and_ you need it working as it was 
> immediately (and a live-clone of a device doesn't if you're doing it right).

Brain-dead is a state of mind, when you reject usage scenarios that you
completely don't understand, hopefully due to the small experience only.

>> The point is: mainaining all of this logic is NOT the job for init system.
>> With systemd you need exactly N-N=0 lines of code to make this work.
> So, I find it very hard to believe that systemd requires absolutely zero 
> configuration of per-device dependencies.

You might resolve your religion doubts in church of your choice, but
technical issues are better verified by experiment.

> If it really doesn't, then 
> that's just more reason I will never use it, as auto-detection opens you 
> up to some quite nasty physical attacks on the system.

ROTFL
There is no auto-detection, but - read my mouth: METADATA. The same
metadata that allows btrfs to mount filesystem after scanning all the
components. The same metadata that incrementally assembles MD.
The same metadata that is called UUID or DEVPATH (of various type).

Ever peeked into /dev/disk, /dev/mapper or /dev/dm-*?

>>> There is no 'state' to expose for anything but BTRFS (and ZFS)
>> 
>> Does ZFS expose it's state or not?
> Yes,

Morons! They could have made dozen of init-scripts maintainers to handle
the logic inside bunch of shell scripts!

> but I'm not quite4 sure exactly how much.

Well, if more than btrfs (i.e. if ANY) - are they morons?

> I assume it exposes enough to check if datasets can be mounted,

Oh, they are sooo-morons!

> but it's also not quite the 
> same situation as BTRFS, because you can start a ZFS volume with half a 
> pool and selectively mount only those datasets that are completely 
> provided by the set of devices you do have.

Isn't it only temporary btrfs limitation? It was supposed to allow
per-subvolume mount options and per-object profile.

>> btrfs is a filesystem, device manager and volume manager.
> BTRFS is a filesystem, it does not manage volumes except in the very 
> limited sense that MD or hardware RAID do, and it does not manage 
> devices (the kernel and udev do so).
> 
>> I might add DEVICE to a btrfs-thingy.
>> I might mount the same btrfs-thingy selecting different VOLUME (subVOL=something_other)
> Except subvolumes aren't really applicable here because they're all or 
> nothing.  If you don't have the base filesystem, you don't have any 
> subvolumes (because what mounting a subvolume actually does is mount the 
> root of the filesystem, and then bind-mount the subvolume onto the 
> specified mount-point).

Technical detail - with not a single subvolume there is no btrfs
filesystem at all.

>> 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
>> 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool),
>> 3. try
>> mount /dev/sda /test - fails
>> mount /dev/sdb /test - works
>> 4. reboot again and try in reversed order
>> mount /dev/sdb /test - fails
>> mount /dev/sda /test - works
>> 
>> mounting btrfs without "btrfs device scan" doesn't work at
>> all without udev rules (that mimic behaviour of the command).
> Actually, try your first mount command above with `-o 
> device=/dev/sda,device=/dev/sdb` and it will work.

Dude... stop writing this bullshit. I got this in fstab _AND_ in
rootflags of kernel cmdline and this DIDN'T work. Haven't seen any
commits improving this behavious since, did I missed one?!

btrfs can NOT be assembled by rootflags=device=... cmdline in contrary
to MD RAID (only with 0.9 metadata BTW, not the 1.0+ ones).

> You don't need 
> global scanning or the udev rules unless you want auto-detection.  The 

You've probably forgot to disable udev or simply didn't check this at all.

> things is, using this mount option (which effectively triggers the scan 
> code directly on the specified devices as part of the mount call) makes 

Show me the code, in case it's really me writing the bullshit here.

> it work in pretty much all init systems except systemd (which still 
> tries to check with udev regardless).

Oh, this is definitely bullshit - btrfs-scanning devices changes state
of IOCTL, so this has no other option than work under systemd.

Have you actually ever used systemd? Or just read too much of
systemd-flame by various wannabies?

>>> Second, BTRFS is not a volume manager, it's a filesystem with
>>> multi-device support.
>> 
>> What is the designatum difference between 'volume' and 'subvolume'?This is largely orthogonal to my comment above, but:
> 
> A volume is an entirely independent data set.  So, the following are all 
> volumes:
> * A partition on a storage device containing a filesystem that needs no 
> other devices.
> * A device-mapper target exposed by LVM.
> * A /dev/md* device exposed by MDADM.
> * The internal device mapping used by BTRFS (which is not exposed 
> _anywhere_ outside of the given filesystem).
> * A ZFS storage pool.

That are technical differences - what is the designatum difference?

> A sub-volume is a BTRFS-specific concept referring to a mostly 
> independent filesystem tree within a BTRFS volume that still depends on 
> the super-blocks, chunk-tree, and a couple of other internal structures 
> from the main filesystem.

LVM volumes also depend on VG metadata. Main btrfs 'volume', that
handles other subvolumes, is only technical difference.

>> Great example - how is systemd mounting distributed/network filesystems?
>> Does it mount them blindly, in a loop, or fires some checks against
>> _plausible_ availability?
> Yes, but availability there is a boolean value.

No, systemd won't try to mount remote filesystems until network is up.

> In BTRFS it's tri-state 
> (as of right now, possibly four to six states in the future depending on 
> what gets merged), and the intermediate (not true or false) state can't 
> be checked in a trivial manner.

All the udev need is: "am I ALLOWED to force-mount this, even if degraded".

And this 'permission' must change after a user-supplied timeout.

>> In other words, is it:
>> - the systemd that threats btrfs WORSE than distributed filesystems, OR
>> - btrfs that requires from systemd to be threaded BETTER than other fss?
> Or maybe it's both?  I'm more than willing to admit that what BTRFS does 
> expose currently is crap in terms of usability.  The reason it hasn't 
> changed is that we (that is, the BTRFS people and the systemd people) 
> can't agree on what it should look like.

This might be ANY way, that allows udev to work just like it works with MD.

>> ...provided there are some measures taken for the premature operation to be
>> repeated. There is non in btrfs-ecosystem.
> Yes, because we expect the user to do so, just like LVM, and MD, and 
> pretty much every other block layer you're claiming we should be 
> behaving like.

MD and LVM export their state, so the userspace CAN react. btrfs doesn't.

>> Other init systems either fail at mounting degraded btrfs just like
>> systemd does, or have buggy workarounds in their code reimplemented in
>> each other just to handle thing, that should be centrally organized.
>> 
> Really? So the fact that I can mount a 2-device volume with RAID1 
> profiles degraded using OpenRC without needing anything more than adding 
> rootflags=degraded to the kernel parameters must be a fluke then...

We are talking about automatic fallback after timeout, not manually
casting any magic spells! Since OpenRC doesn't read rootflags at all:

grep -iE 'rootflags|degraded|btrfs' openrc/**/*

it won't support this without some extra code.

> The thing is, it primarily breaks if there are hardware issues, 
> regardless of the init system being used, but at least the other init 
> systems _give you an error message_ (even if it's really the kernel 
> spitting it out) instead of just hanging there forever with no 
> indication of what's going on like systemd does.

If your systemd waits forever and you have no error messages, report bug
to your distro maintainer, as he is probably the one to blame for fixing
what ain't broken.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30 16:30                           ` Austin S. Hemmelgarn
  2018-01-30 19:24                             ` Tomasz Pala
@ 2018-01-30 19:40                             ` Tomasz Pala
  1 sibling, 0 replies; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 19:40 UTC (permalink / raw)
  To: Majordomo vger.kernel.org

Just one final word, as all was already said:

On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote:

>> In other words, is it:
>> - the systemd that threats btrfs WORSE than distributed filesystems, OR
>> - btrfs that requires from systemd to be threaded BETTER than other fss?
> Or maybe it's both?  I'm more than willing to admit that what BTRFS does 
> expose currently is crap in terms of usability.  The reason it hasn't 
> changed is that we (that is, the BTRFS people and the systemd people) 
> can't agree on what it should look like.

Hard to agree with someone who refuses to do _anything_.

You can choose to follow whatever, MD, LVM, ZFS, invent something
totally different, write custom daemon or put timeout logic inside the
kernel itself. It doesn't matter. You know the ecosystem - it is the
udev that must be signalled somehow and systemd WILL follow.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30 13:46                                         ` Austin S. Hemmelgarn
@ 2018-01-30 19:50                                           ` Tomasz Pala
  2018-01-30 20:40                                             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 54+ messages in thread
From: Tomasz Pala @ 2018-01-30 19:50 UTC (permalink / raw)
  To: Btrfs BTRFS

On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote:

>> I personally think the degraded mount option is a mistake as this 
>> assumes that a lightly degraded system is not able to work which is false.
>> If the system can mount to some working state then it should mount 
>> regardless if it is fully operative or not. If the array is in a bad 
>> state you need to learn about it by issuing a command or something. The 
>> same goes for a MD array (and yes, I am aware of the block layer vs 
>> filesystem thing here).
> The problem with this is that right now, it is not safe to run a BTRFS 
> volume degraded and writable, but for an even remotely usable system 

Mounting read-only is still better than not mounting at all.

For example, my emergency.target has limited network access and starts
ssh server so I could recover from this situation remotely.

> with pretty much any modern distro, you need your root filesystem to be 
> writable (or you need to have jumped through the hoops to make sure /var 
> and /tmp are writable even if / isn't).

Easy to handle by systemd. Not only this, but much more is planned:

http://0pointer.net/blog/projects/stateless.html

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: degraded permanent mount option
  2018-01-30 19:50                                           ` Tomasz Pala
@ 2018-01-30 20:40                                             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 54+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-30 20:40 UTC (permalink / raw)
  To: Tomasz Pala, Btrfs BTRFS

On 2018-01-30 14:50, Tomasz Pala wrote:
> On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote:
> 
>>> I personally think the degraded mount option is a mistake as this
>>> assumes that a lightly degraded system is not able to work which is false.
>>> If the system can mount to some working state then it should mount
>>> regardless if it is fully operative or not. If the array is in a bad
>>> state you need to learn about it by issuing a command or something. The
>>> same goes for a MD array (and yes, I am aware of the block layer vs
>>> filesystem thing here).
>> The problem with this is that right now, it is not safe to run a BTRFS
>> volume degraded and writable, but for an even remotely usable system
> 
> Mounting read-only is still better than not mounting at all.
Agreed, but what most people who are asking about this are asking for is 
to have the system just run missing a drive.
> 
> For example, my emergency.target has limited network access and starts
> ssh server so I could recover from this situation remotely.
> 
>> with pretty much any modern distro, you need your root filesystem to be
>> writable (or you need to have jumped through the hoops to make sure /var
>> and /tmp are writable even if / isn't).
> 
> Easy to handle by systemd. Not only this, but much more is planned:
> 
> http://0pointer.net/blog/projects/stateless.html
> 
It's reasonably easy to handle even in a normal init system.  The issue 
is that most distros don't really support it well.  Arch and Gentoo make 
it trivial, but they let you configure storage however the hell you 
want.  Pretty much everybody else is mostly designed to assume that /var 
is a part of /, they mostly work if it's not, but certain odd things 
cause problems, and you have to go through somewhat unfriendly 
configuration work during install to get a system set up that way (well, 
unfriendly if you're a regular user, it's perfectly fine for a seasoned 
sysadmin).

Also, slightly OT, but has anyone involved in the development described 
in the article you linked every looked beyond the typical Fedora/Debian 
environment for any of the stuff the conclusions section says you're 
trying to achieve?  Just curious, since NixOS can do almost all of it 
with near zero effort except for the vendor data part (NixOS still 
stores it's config in /etc, but it can work with just one or two files), 
and a handful of the other specific items have reasonably easy ways to 
implement them that just aren't widely supported (for example, factory 
resets have at least three options already, OverlayFS (bottom layer is 
your base image, stored in a read-only verified manner, top layer is 
writable for user customization), BTRFS seed devices (similar to an 
overlay, just at the block level), and bootable, self-installing, 
compressed system images).

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2018-01-30 20:40 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-26 14:02 degraded permanent mount option Christophe Yayon
2018-01-26 14:18 ` Austin S. Hemmelgarn
2018-01-26 14:47   ` Christophe Yayon
2018-01-26 14:55     ` Austin S. Hemmelgarn
2018-01-27  5:50     ` Andrei Borzenkov
     [not found]       ` <1517035210.1252874.1249880112.19FABD13@webmail.messagingengine.com>
2018-01-27  6:43         ` Andrei Borzenkov
2018-01-27  6:48           ` Christophe Yayon
2018-01-27 10:08             ` Christophe Yayon
2018-01-27 10:26               ` Andrei Borzenkov
2018-01-27 11:06                 ` Tomasz Pala
2018-01-27 13:26                   ` Adam Borowski
2018-01-27 14:36                     ` Goffredo Baroncelli
2018-01-27 15:38                       ` Adam Borowski
2018-01-27 15:22                     ` Duncan
2018-01-28  0:39                       ` Tomasz Pala
2018-01-28 20:02                         ` Chris Murphy
2018-01-28 22:39                           ` Tomasz Pala
2018-01-29  0:00                             ` Chris Murphy
2018-01-29  8:54                               ` Tomasz Pala
2018-01-29 11:24                                 ` Adam Borowski
2018-01-29 13:05                                   ` Austin S. Hemmelgarn
2018-01-30 13:46                                     ` Tomasz Pala
2018-01-30 15:05                                       ` Austin S. Hemmelgarn
2018-01-30 16:07                                         ` Tomasz Pala
2018-01-29 17:58                                   ` Andrei Borzenkov
2018-01-29 19:00                                     ` Austin S. Hemmelgarn
2018-01-29 21:54                                       ` waxhead
2018-01-30 13:46                                         ` Austin S. Hemmelgarn
2018-01-30 19:50                                           ` Tomasz Pala
2018-01-30 20:40                                             ` Austin S. Hemmelgarn
2018-01-30 15:24                                       ` Tomasz Pala
2018-01-30 13:36                                   ` Tomasz Pala
2018-01-30  4:44                                 ` Chris Murphy
2018-01-30 15:40                                   ` Tomasz Pala
2018-01-28  8:06                       ` Andrei Borzenkov
2018-01-28 10:27                         ` Tomasz Pala
2018-01-28 15:57                         ` Duncan
2018-01-28 16:51                           ` Andrei Borzenkov
2018-01-28 20:28                         ` Chris Murphy
2018-01-28 23:13                           ` Tomasz Pala
2018-01-27 21:12                     ` Chris Murphy
2018-01-28  0:16                       ` Tomasz Pala
2018-01-27 22:42                     ` Tomasz Pala
2018-01-29 13:42                       ` Austin S. Hemmelgarn
2018-01-30 15:09                         ` Tomasz Pala
2018-01-30 16:22                           ` Tomasz Pala
2018-01-30 16:30                           ` Austin S. Hemmelgarn
2018-01-30 19:24                             ` Tomasz Pala
2018-01-30 19:40                             ` Tomasz Pala
2018-01-27 20:57                   ` Chris Murphy
2018-01-28  0:00                     ` Tomasz Pala
2018-01-28 10:43                       ` Tomasz Pala
2018-01-26 21:54 ` Chris Murphy
2018-01-26 22:03   ` Christophe Yayon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.