problem with degraded boot and systemd

* problem with degraded boot and systemd
@ 2014-05-19  0:54 Chris Murphy
  2014-05-20 22:00 ` Goffredo Baroncelli
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Murphy @ 2014-05-19  0:54 UTC (permalink / raw)
  To: Btrfs BTRFS

Summary:

It's insufficient to pass rootflags=degraded to get the system root to mount when a device is missing. It looks like when a device is missing, udev doesn't create the dev-disk-by-uuid linkage that then causes systemd to change the device state from dead to plugged. Only once plugged, will systemd attempt to mount the volume. This issue was brought up on systemd-devel under the subject "timed out waiting for device dev-disk-by\x2duuid" for those who want details.

Work around:

I tested systemd 208-16.fc20, and 212-4.fc21. Both will wait indefinitely for dev-disk-by-x2uuid, and fail to drop to a dracut shell for a manual recovery attempt. That seems like a bug to me so I filed that here:
https://bugzilla.redhat.com/show_bug.cgi?id=1096910

Therefore, first the system must be forced to shutdown, rebooted with boot param "rd.break=pre-mount" to get to a dracut shell before the wait for root by uuid begins. Then:

# mount -o subvol=root,ro,degraded <device> /sysroot
# exit
# exit

And then it boots normally. Fortunately btrfs fi show works so you can mount with -U or with a non-missing /dev device.

What's going on:

Example of a 2 device Btrfs raid1 volume, using sda3 and sdb3.

Since boot parameter root=UUID= is used, systemd is expecting to issue the mount command referencing that particular volume UUID. When all devices are available, systemd-udevd produces entries like this for each device:

[    2.168697] localhost.localdomain systemd-udevd[109]: creating link '/dev/disk/by-uuid/9ff63135-ce42-4447-a6de-d7c9b4fb6d66' to '/dev/sda3'
[    2.170232] localhost.localdomain systemd-udevd[135]: creating link '/dev/disk/by-uuid/9ff63135-ce42-4447-a6de-d7c9b4fb6d66' to '/dev/sdb3'

But when just one device is missing, neither link is created by udev, and that's the show stopper. 

When all devices are present, the links are created, and systemd changes the dev-disk-by-uuid device from dead to plugged like this:

[    2.176280] localhost.localdomain systemd[1]: dev-disk-by\x2duuid-9ff63135\x2dce42\x2d4447\x2da6de\x2dd7c9b4fb6d66.device changed dead -> plugged

And then systemd will initiate the command to mount it.

[    2.177501] localhost.localdomain systemd[1]: Job dev-disk-by\x2duuid-9ff63135\x2dce42\x2d4447\x2da6de\x2dd7c9b4fb6d66.device/start finished, result=done
[    2.586488] localhost.localdomain systemd[152]: Executing: /bin/mount /dev/disk/by-uuid/9ff63135-ce42-4447-a6de-d7c9b4fb6d66 /sysroot -t auto -o ro,ro,subvol=root

I think the key problem is either a limitation of udev, or a problem with the existing udev rule, that prevents the link creation for any remaining btrfs device. Or maybe it's intentional. But I'm not a udev expert. This is the current udev rule:

# cat /usr/lib/udev/rules.d/64-btrfs.rules
# do not edit this file, it will be overwritten on update

SUBSYSTEM!="block", GOTO="btrfs_end"
ACTION=="remove", GOTO="btrfs_end"
ENV{ID_FS_TYPE}!="btrfs", GOTO="btrfs_end"

# let the kernel know about this btrfs filesystem, and check if it is complete
IMPORT{builtin}="btrfs ready $devnode"

# mark the device as not ready to be used by the system
ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"

LABEL="btrfs_end"

How this works with raid:

RAID assembly is separate from filesystem mount. The volume UUID isn't available until the RAID is successfully assembled. 

On at least Fedora (dracut) systems with the system root on an md device, the initramfs contains 30-parse-md.sh which includes a loop to check for the volume UUID. If it's not found, the script sleeps for 0.5 seconds, and then looks for it again, up to 240 times. If it's still not found at attempt 240, then the script executes mdadm -R to forcibly run the array with fewer than all devices present (degraded assembly). Now the volume UUID exists, udevd creates the linkage, systemd picks this up and changes device state from dead to plugged, and then executes a normal mount command.

The approximate Btrfs equivalent down the road would be a similar initrd script, or maybe a user space daemon, that causes btrfs device ready to confirm/deny all devices are present. And after x number of failures, then it's issue an equivalent to mdadm -R which right now we don't seem to have. 

That equivalent might be a decoupling of degraded as a mount option, such that the user space tool deals with degradedness. And the mount command remains a normal mount command (without degraded option). For example something like "btrfs filesystem allowdegraded -U <uuid>" would cause some logic to confirm/deny that degraded mounting is even possible, such as having the minimum number of devices available. If it succeeds, then btrfs device ready will report all devices are in fact present, enable udevd to create the links by volume uuid, which then allows systemd to tigger a normal mount command. Further, the btrfs allowdegraded command would set appropriate metadata on the file system such that a normal mount command will succeed.

Or something like that.

Chris Murphy

^ permalink raw reply	[flat|nested] 6+ messages in thread