From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f46.google.com ([209.85.218.46]:41166 "EHLO mail-oi0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751467AbeA0U5a (ORCPT ); Sat, 27 Jan 2018 15:57:30 -0500 Received: by mail-oi0-f46.google.com with SMTP id m83so2522129oik.8 for ; Sat, 27 Jan 2018 12:57:30 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20180127110619.GA10472@polanet.pl> References: <1516975360.4083556.1249069832.1B287A04@webmail.messagingengine.com> <5d342036-0de0-9bf7-3e9e-4885b62d8100@gmail.com> <1516978054.4103196.1249114200.76EC1546@webmail.messagingengine.com> <84c23047-522d-2529-5b16-d07ed8c28fc6@gmail.com> <1517035210.1252874.1249880112.19FABD13@webmail.messagingengine.com> <8607255b-98e7-5623-6f62-75d6f7cf23db@gmail.com> <569AC15F-174E-4C78-8FE5-6CE9E0BED479@yayon.me> <111ca301-f631-694d-93eb-b73a790f57d4@gmail.com> <20180127110619.GA10472@polanet.pl> From: Chris Murphy Date: Sat, 27 Jan 2018 13:57:29 -0700 Message-ID: Subject: Re: degraded permanent mount option To: Tomasz Pala Cc: "Majordomo vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Sat, Jan 27, 2018 at 4:06 AM, Tomasz Pala wrote: > As for the regular by-UUID mounts: these links are created by udev WHEN > underlying devices appear. Does btrfs volume appear? No. If I boot with rd.break=pre-mount I can absolutely mount a Btrfs multiple volume that has a missing device by UUID with --uuid flag, or by /dev/sdXY, along with -o degraded. And I can then use the exit command to continue the startup process. In fact I can try to mount without -o degraded, and the mount command "works" in that it does not complain about an invalid node or UUID. The Btrfs systemd udev rule is a sledghammer because it has no timeout. It neither times out and tries to mount anyway, nor does it time out and just drop to a dracut prompt. There are a number of things in systemd startups that have timeouts, I have no idea how they get defined, but that single thing would make this a lot better. Right now the Btrfs udev rule means if all devices aren't available, hang indefinitely. I don't know systemd or systemd-udev well enough at all to know if this rule can have a timer. Service units absolutely can have timers, so maybe there's a way to marry a udev rule with a service which has a timer. The absolute dumbest thing that's better than now, is at the timer just fail and drop to a dracut prompt. Better would be to try a normal mount anyway, which also fails to a dracut prompt, but additionally gives us a kernel error for Btrfs (the missing device open ctree error you'd expect to get when mounting without -o degraded when you're missing a device). And even better would be a way for the user to edit the service unit to indicate "upon timeout being reached, use mount -o degraded rather than just mount". This is the simplest of Boolean logic, so I'd be surprised if systemd doesn't offer a way for us to do exactly what I'm describing. Again the central problem is the udev rule now means "wait for device to appear" with no timed fallback. The mdadm case has this, and it's done by dracut. At this same stage of startup with a missing device, there is in fact no fs colume UUID yet because the array hasn't started. Dracut+mdadm knows there's a missing device so it's just iterating: look, sleep 3, look, sleep 3, look, sleep 3. It's on a loop. And after that loop hits something like 100, the script says f it, start array anyway, so now there is a degraded array, and for the first time the fs volume UUID appears, and systemd goes "ahaha! mount that!" and it does it normally. So the timer and timeout and what happens at the timeout is defined by dracut. That's probably why the systemd folks say "not our problem" and why the kernel folks say "not our problem". > If btrfs pretends to be device manager it should expose more states, > especially "ready to be mounted, but not fully populated" (i.e. > "degraded mount possible"). Then systemd could _fallback_ after timing > out to degraded mount automatically according to some systemd-level > option. No, mdadm is a device manager and it has no such facility. Something issues a command to start the array anyway, and only then do you find out if there are enough devices to start it. I don't understand the value of knowing whether it is possible. Just try to mount it degraded and then if it fails we fail, nothing can be done automatically it's up to an admin. And even if you had this "degraded mount possible" state, you still need a timer. So just build the timer. If all devices ready ioctl is true, the timer doesn't start, it means all devices are available, mount normally. If all devices ready ioctl is false, the timer starts, if all devices appear later the ioctl goes to true, the timer is belayed, mount normally. If all devices ready ioctl is false, the timer starts, when the timer times out, mount normally which fails and gives us a shell to troubleshoot at. OR If all devices ready ioctl is false, the timer starts, when the timer times out, mount with -o degraded which either succeeds and we boot or it fails and we have a troubleshooting shell. The central problem is the lack of a timer and time out. > Unless there is *some* signalling from btrfs, there is really not much > systemd can *safely* do. That is not true. It's not how mdadm works anyway. -- Chris Murphy