From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.kernel.org ([198.145.29.99]:57932 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727399AbeHLCCT (ORCPT ); Sat, 11 Aug 2018 22:02:19 -0400 Received: from mail-wm0-f47.google.com (mail-wm0-f47.google.com [74.125.82.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id C4F2C21A67 for ; Sat, 11 Aug 2018 23:26:29 +0000 (UTC) Received: by mail-wm0-f47.google.com with SMTP id c14-v6so5279683wmb.4 for ; Sat, 11 Aug 2018 16:26:29 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <5152.1534018808@warthog.procyon.org.uk> References: <153313703562.13253.5766498657900728120.stgit@warthog.procyon.org.uk> <153313723557.13253.9055982745313603422.stgit@warthog.procyon.org.uk> <87in4n9zg0.fsf@xmission.com> <27374.1533824694@warthog.procyon.org.uk> <5152.1534018808@warthog.procyon.org.uk> From: Andy Lutomirski Date: Sat, 11 Aug 2018 16:26:07 -0700 Message-ID: Subject: Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11] To: David Howells Cc: Miklos Szeredi , "Eric W. Biederman" , Al Viro , Linux API , Linus Torvalds , Linux FS Devel , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Sat, Aug 11, 2018 at 1:20 PM, David Howells wrote: > Miklos Szeredi wrote: > >> You can determine at fsopen() time whether the filesystem is able to >> support the O_EXCL behavior? If so, then it's trivial to enable this >> conditionally. I think that's what Eric is asking for, it's obviously >> not fair to ask for a change in behavior of the legacy interface. > > It's not trivial, see btrfs and nfs :-/ > I'm not convinced that btrfs and nfs are the same situation. As far as I can tell, in NFS's case, NFS shares superblocks as an implementation detail. With Al's example, someone can do: mount -t nfs4 wank.example.org:/foo/bar /mnt/a mount -t nfs4 wank.example.org:/baz/barf /mnt/b mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c or equivalently create three fscontexts and FSCONFIG_CMD_CREATE all of them, and the kernel creates one superblock for /mnt/a and /mnt/b and a second one for /mnt/c. That seems like a good optimization, but I think it really is just an optimization. In any sane implementation, all three calls should succeed, and it should in general be possible to create as many totally fresh mounts of the same network file system as anyone wants. Given this example, I think that it may be important to give FSCONFIG_CMD_RECONFIGURE a very clear definition, and possibly a definition that doesn't use the word superblock. After all, if someone does FSCONFIG_CMD_RECONFIGURE on /mnt/a, if it really reconfigures a *superblock*, then it will change /mnt/b as a side effect but will not change /mnt/c. This seems like a mistake. But I think that btrfs is quite a bit different. With btrfs, I can do: mount -t btrfs /dev/sda1 -o subvol=a /mnt/a mount -t btrfs /dev/sda1 -o subvol=b /mnt/b and I get two mounts, each pointing at a different subvolume, that (I'm pretty sure) share a superblock mount -t btrfs /dev/sda1 -o subvol=c,foo=bar /mnt/c where foo is a per-superblock option, it probably gets ignored. If I set up /dev/mapper/foo as a linear alias for /dev/sda1 and I do: mount -t btrfs /dev/mapper/foo -o subvol=d /mnt/d then I get a fresh superblock. If /dev/sda1 is still mounted and the various O_EXCL-like checks don'e catch it, then I get massive corruption. The btrfs case seems quite fragile to me, and it seems like a bit of an abuse of mount(2). (Of course, basically everything anyone does with mount(2) is a bit of an abuse.) I would hope that the new fs mounting API would clean this up. The NFS case seems just fine, but for btrfs, it seems like maybe the whole CMD_CREATE operation should be more fine grained. There seem to be *two* actions going on in a btrfs mount. First there's the act of instantiating the filesystem driver backed by the device (I think this is open_ctree()), and *then* there's the act of instantiating a dentry tree pointing at some subvolume, etc. ZFS seems to handle this quite nicely. First you fire up a zpool, and then you start mounting its volumes.