linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: viro@zeniv.linux.org.uk, jack@suse.cz, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] fs: open the block device after allocation the super_block
Date: Tue, 25 Jul 2023 18:32:05 +0200	[thread overview]
Message-ID: <20230725-einnahmen-warnschilder-17779aec0a97@brauner> (raw)
In-Reply-To: <20230725-tagebuch-gerede-a28f8fd8084a@brauner>

On Tue, Jul 25, 2023 at 02:35:17PM +0200, Christian Brauner wrote:
> On Mon, Jul 24, 2023 at 10:51:45AM -0700, Christoph Hellwig wrote:
> > From: Jan Kara <jack@suse.cz>
> > 
> > Currently get_tree_bdev and mount_bdev open the block device before
> > commiting to allocating a super block.  This means the block device
> > is opened even for bind mounts and other reuses of the super_block.
> > 
> > That creates problems for restricting the number of writers to a device,
> > and also leads to a unusual and not very helpful holder (the fs_type).
> > 
> > Reorganize the mount code to first look whether the superblock for a
> > particular device is already mounted and open the block device only if
> > it is not.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > [hch: port to before the bdev_handle changes,
> >       duplicate the bdev read-only check from blkdev_get_by_path,
> >       extend the fsfree_mutex coverage to protect against freezes,
> >       fix an open bdev leak when the bdev is frozen,
> >       use the bdev local variable more,
> >       rename the s variable to sb to be more descriptive]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> > 
> > So I promised to get a series that builds on top of this ready, but
> > I'm way to busy and this will take a while.  Getting this reworked
> > version of Jan's patch out for everyone to use it as a based given
> > that Christian is back from vacation, and I think Jan should be about
> > back now as well.
> 
> I'm in the middle of reviewing this. You're probably aware, but both
> btrfs and nilfs at least still open the devices first since they
> open-code their bdev and sb handling.

I've removed the references to bind mounts from the commit message.
I mentioned in [1] and [2] that this problem is really related to
superblocks at it's core. It's just that technically a bind-mount would
be created in the following scenario where two processes race to create
a superblock:

P1								P2
fd_fs = fsopen("ext4");						fd_fs = fsopen("ext4");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");	fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");

// wins and creates superblock
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
								// finds compatible superblock of P1
								// spins until P1 sets SB_BORN and grabs a reference
								fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)

P1								P2
fd_mnt1 = fsmount(fd_fs);					fd_mnt2 = fsmount(fd_fs);
move_mount(fd_mnt1, "/A")                                       move_mount(fd_mnt2, "/B")

If we forbid writes to a mounted block device like Jan's other patch did
then before your's and Jan's patch P2 would fail at FSCONFIG_CMD_CREATE
iirc.

But bind-mounting itself isn't really broken. For example, even after P2
failed to create the superblock it could very well still do:

mount --bind /A /C
mount --bind /A /D

or whatever as that never goes into get_tree again. The problem really
is that stuff like:

mount -t ext4 /dev/sda /E

would be broken but the problem is that this request is arguably
ambiguous when seen from userspace. It either means:

(1) create a superblock and mount it at /E
(2) create a bind-mount for an existing superblock at /E

For P1 (1) is the case but for P2 (2) is the case.

Most of the time users really want (1). Right now, we have no way
for a user to be sure that (1) did take place aka that they did indeed
create the superblock. That can be a problem in environments where you
need to be sure that you did create the superblock with the correct
options. Because for P2 all mount options that they set may well be
completely ignored unless e.g., P1 did request rw and P2 did request ro.

This is why I'd like to add something like FSCONFIG_CMD_CREATE_EXCL
which would fail if the caller didn't actually create the superblock but
found an existing one instead.

[1]: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
[2]: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner

  reply	other threads:[~2023-07-25 16:32 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-24 17:51 [PATCH] fs: open the block device after allocation the super_block Christoph Hellwig
2023-07-25 12:35 ` Christian Brauner
2023-07-25 16:32   ` Christian Brauner [this message]
2023-07-26 12:51     ` Christoph Hellwig
2023-07-26 12:57       ` Christian Brauner
2023-07-25 15:53 ` Christian Brauner
2023-08-15 14:43 ` Christian Brauner
2023-08-16  7:29   ` Christian Brauner
2023-08-16 21:39     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230725-einnahmen-warnschilder-17779aec0a97@brauner \
    --to=brauner@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).