From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@vger.kernel.org>
Received: from plane.gmane.org ([80.91.229.3]:48001 "EHLO plane.gmane.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750737AbaEUAD0 (ORCPT <rfc822;linux-btrfs@vger.kernel.org>);
	Tue, 20 May 2014 20:03:26 -0400
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <gcfb-btrfs-devel-moved1@m.gmane.org>)
	id 1Wmu0A-0003JS-1M
	for linux-btrfs@vger.kernel.org; Wed, 21 May 2014 02:03:22 +0200
Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <linux-btrfs@vger.kernel.org>; Wed, 21 May 2014 02:03:22 +0200
Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <linux-btrfs@vger.kernel.org>; Wed, 21 May 2014 02:03:22 +0200
To: linux-btrfs@vger.kernel.org
From: Duncan <1i5t5.duncan@cox.net>
Subject: Re: problem with degraded boot and systemd
Date: Wed, 21 May 2014 00:03:08 +0000 (UTC)
Message-ID: <pan$6f6ec$5ba926ea$9860d25c$5d86cd20@cox.net>
References: <45D5C607-ED9D-49BB-BA60-CA2B0E94223D@colorremedies.com>
	<537BD078.7070504@libero.it> <20140520222609.GD1756@carfax.org.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Sender: linux-btrfs-owner@vger.kernel.org
List-ID: <linux-btrfs.vger.kernel.org>

Hugo Mills posted on Tue, 20 May 2014 23:26:09 +0100 as excerpted:

> On Wed, May 21, 2014 at 12:00:24AM +0200, Goffredo Baroncelli wrote:
>> On 05/19/2014 02:54 AM, Chris Murphy wrote:
>>> 
>>> It's insufficient to pass rootflags=degraded to get the system root
>>> to mount when a device is missing. It looks like when a device is
>>> missing, udev doesn't [...]
>>> 
>>> This is the current udev rule:
>>> 
>>> # cat /usr/lib/udev/rules.d/64-btrfs.rules 
>>> # do not edit this file, it will be overwritten on update
>>> 
>>> SUBSYSTEM!="block", GOTO="btrfs_end"
>>> ACTION=="remove", GOTO="btrfs_end"
>>> ENV{ID_FS_TYPE}!="btrfs", GOTO="btrfs_end"
>>> 
>>> # let the kernel know about this btrfs filesystem, and check if it is
>>> # complete
>>> IMPORT{builtin}="btrfs ready $devnode"
>>> 
>>> # mark the device as not ready to be used by the system
>>> ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"
>>> 
>>> LABEL="btrfs_end"
>> 
>> The key is the line
>> 
>> 	IMPORT{builtin}="btrfs ready $devnode"
>> 
>> This line sets ID_BTRFS_READY=0 if a filesystem is not ready; otherwise
>> set ID_BTRFS_READY=1 [1].
>> The next line
>> 
>> 	ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"
>> 
>> sets SYSTEMD_READY=0 if the filesystem is not ready so the "plug" event
>> is not raised to systemd.
>> 
>> This is my understanding.

Looks correct to me. =:^)

>>> How this works with raid:
>>> 
>>> RAID assembly is separate from filesystem mount. The volume UUID
>>> isn't available until the RAID is successfully assembled.
>>> 
>>> On at least Fedora (dracut) systems with the system root on an md
>>> device, the initramfs contains 30-parse-md.sh [with a sleep loop and 
>>> a timeout]
>> 
>>> The approximate Btrfs equivalent down the road would be a similar
>>> initrd script, or maybe a user space daemon, that causes btrfs device
>>> ready to confirm/deny all devices are present. And after x number of
>>> failures, then it's issue an equivalent to mdadm -R which right now
>>> we don't seem to have.
>> 
>> I suggest to implement a mount.btrfs command, which waits all the
>> needed disks until a timeout expires. After this timeout it could try a
>> "degraded" mount until a second timeout. Only then it fails.
>> 
>> Each time a device appear, the system may start mount.btrfs. Each
>> invocation has to test if there is another instance of mount.btrfs
>> related to the same filesystem; if so it ends, otherwise it follows the
>> above behavior.
> 
> Don't we already have something approaching this functionality with
> btrfs device ready? (i.e. this is exactly what it was designed for).

Well, sort of.

btrfs device ready is used directly in the udev rule quoted above.  And 
in the non-degraded case it works as intended, checking if the filesystem 
is complete and only letting the udev plug event complete when all 
devices are available.

But this thread is about a degraded state mount, with devices missing.  
In that case, the missing devices never appear so the plug event never 
happens, so systemd will never mount the device, despite the fact that 
degraded was specifically passed as an option, indicating that the admin 
wants the mount to happen anyway.

In dracut[1] (on gentoo), the result is an eventual timeout on rootfs 
appearing and a kick to the initr* rescue shell prompt.  Where an admin 
can manually mount using the degraded option, and continue from there.

I'd actually argue that's functioning as it should, since I see forced 
manual intervention in ordered to mount degraded as a FEATURE, NOT A BUG.

But never-the-less, being able to effectively pass degraded either as 
part of rootflags or in the fstab that dracut (and systemd in dracut) 
use, such that degraded-mount could still be automated, could I suppose 
be seen as a feature, to some.

To do that would require a script with a countdown and timeout, first for 
undegraded ready (and thus mount), then if all devices don't appear, 
bypassing the ready test and plugging it anyway, to let mount try it if 
the degraded option was passed, and only if THAT fails falling back to 
the emergency shell prompt.

Note that such a script wouldn't have to actually check for degraded in 
the mount options, only fall back to plugging without all devices if the 
complete timeout triggered, since mount would then take care of success/
failure on its own based on whether the degraded option was passed, just 
as it does if a mount is attempted on an incomplete btrfs at other times.

---
[1] dracut: I use it here on gentoo as well, because my rootfs is a multi-
device btrfs and a kernel rootflags=device= line won't parse correctly, 
apparently due to splitting at the wrong =, so I must use an initr* 
despite my preference for a direct initr*-less boot, and I use dracut to 
generate it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman