Re: [PATCH] fs: open the block device after allocation the super_block

From: Christian Brauner <brauner@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: viro@zeniv.linux.org.uk, jack@suse.cz, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] fs: open the block device after allocation the super_block
Date: Tue, 25 Jul 2023 18:32:05 +0200	[thread overview]
Message-ID: <20230725-einnahmen-warnschilder-17779aec0a97@brauner> (raw)
In-Reply-To: <20230725-tagebuch-gerede-a28f8fd8084a@brauner>

On Tue, Jul 25, 2023 at 02:35:17PM +0200, Christian Brauner wrote:
> On Mon, Jul 24, 2023 at 10:51:45AM -0700, Christoph Hellwig wrote:
> > From: Jan Kara <jack@suse.cz>
> > 
> > Currently get_tree_bdev and mount_bdev open the block device before
> > commiting to allocating a super block.  This means the block device
> > is opened even for bind mounts and other reuses of the super_block.
> > 
> > That creates problems for restricting the number of writers to a device,
> > and also leads to a unusual and not very helpful holder (the fs_type).
> > 
> > Reorganize the mount code to first look whether the superblock for a
> > particular device is already mounted and open the block device only if
> > it is not.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > [hch: port to before the bdev_handle changes,
> >       duplicate the bdev read-only check from blkdev_get_by_path,
> >       extend the fsfree_mutex coverage to protect against freezes,
> >       fix an open bdev leak when the bdev is frozen,
> >       use the bdev local variable more,
> >       rename the s variable to sb to be more descriptive]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> > 
> > So I promised to get a series that builds on top of this ready, but
> > I'm way to busy and this will take a while.  Getting this reworked
> > version of Jan's patch out for everyone to use it as a based given
> > that Christian is back from vacation, and I think Jan should be about
> > back now as well.
> 
> I'm in the middle of reviewing this. You're probably aware, but both
> btrfs and nilfs at least still open the devices first since they
> open-code their bdev and sb handling.

I've removed the references to bind mounts from the commit message.
I mentioned in [1] and [2] that this problem is really related to
superblocks at it's core. It's just that technically a bind-mount would
be created in the following scenario where two processes race to create
a superblock:

P1								P2
fd_fs = fsopen("ext4");						fd_fs = fsopen("ext4");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");	fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");

// wins and creates superblock
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
								// finds compatible superblock of P1
								// spins until P1 sets SB_BORN and grabs a reference
								fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)

P1								P2
fd_mnt1 = fsmount(fd_fs);					fd_mnt2 = fsmount(fd_fs);
move_mount(fd_mnt1, "/A")                                       move_mount(fd_mnt2, "/B")

If we forbid writes to a mounted block device like Jan's other patch did
then before your's and Jan's patch P2 would fail at FSCONFIG_CMD_CREATE
iirc.

But bind-mounting itself isn't really broken. For example, even after P2
failed to create the superblock it could very well still do:

mount --bind /A /C
mount --bind /A /D

or whatever as that never goes into get_tree again. The problem really
is that stuff like:

mount -t ext4 /dev/sda /E

would be broken but the problem is that this request is arguably
ambiguous when seen from userspace. It either means:

(1) create a superblock and mount it at /E
(2) create a bind-mount for an existing superblock at /E

For P1 (1) is the case but for P2 (2) is the case.

Most of the time users really want (1). Right now, we have no way
for a user to be sure that (1) did take place aka that they did indeed
create the superblock. That can be a problem in environments where you
need to be sure that you did create the superblock with the correct
options. Because for P2 all mount options that they set may well be
completely ignored unless e.g., P1 did request rw and P2 did request ro.

This is why I'd like to add something like FSCONFIG_CMD_CREATE_EXCL
which would fail if the caller didn't actually create the superblock but
found an existing one instead.

[1]: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
[2]: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner