All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] fs,block: yield devices
@ 2023-10-24 14:53 Christian Brauner
  2023-10-24 14:53 ` [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls Christian Brauner
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Christian Brauner @ 2023-10-24 14:53 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, Christian Brauner

Hey,

This is a mechanism that allows the holder of a block device to yield
device access before actually closing the block device.

If a someone yields a device then any concurrent opener claiming the
device exclusively with the same blk_holder_ops as the current owner can
wait for the device to be given up. Filesystems by default use
fs_holder_ps and so can wait on each other.

This mechanism allows us to simplify superblock handling quite a bit at
the expense of requiring filesystems to yield devices. A filesytems must
yield devices under s_umount. This allows costly work to be done outside
of s_umount.

There's nothing wrong with the way we currently do things but this does
allow us to simplify things and kills a whole class of theoretical UAF
when walking the superblock list.

I had originally considered doing it this way but wasn't able
(time-wise) to code that up but since we recently had that discussion
again here it is.

Survives both xfstests and blktests. Also tested this by introducing
custom delays into kill_block_super() to widen the race where a
superblock is removed from the instance list and the device is fully
closed and synced.
Based on on vfs.super and the freezer work sent out earlier.
Very barebones commit messages and less analyzed then usually for
possible side-effects.

Thanks!
Christian

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (6):
      fs: simplify setup_bdev_super() calls
      xfs: simplify device handling
      ext4: simplify device handling
      bdev: simplify waiting for concurrent claimers
      block: mark device as about to be released
      fs: add ->yield_devices()

 block/bdev.c              | 54 +++++++++++++++++++++++++++-----------
 fs/ext4/super.c           | 15 ++++++++---
 fs/super.c                | 67 ++++++++++++++---------------------------------
 fs/xfs/xfs_super.c        | 46 +++++++++++++++++++++-----------
 include/linux/blk_types.h |  8 +++++-
 include/linux/blkdev.h    |  1 +
 include/linux/fs.h        |  1 +
 7 files changed, 109 insertions(+), 83 deletions(-)
---
base-commit: c6cc4b13e95115c13433136a17150768d562a54c
change-id: 20231024-vfs-super-rework-ca447a3240c9


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls
  2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
@ 2023-10-24 14:53 ` Christian Brauner
  2023-10-25 15:29   ` Jan Kara
  2023-10-27  6:42   ` Christoph Hellwig
  2023-10-24 14:53 ` [PATCH RFC 2/6] xfs: simplify device handling Christian Brauner
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2023-10-24 14:53 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, Christian Brauner

There's no need to drop s_umount anymore now that we removed all sources
where s_umount is taken beneath open_mutex or bd_holder_lock.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index b26b302f870d..4edde92d5e8f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1613,15 +1613,7 @@ int get_tree_bdev(struct fs_context *fc,
 			return -EBUSY;
 		}
 	} else {
-		/*
-		 * We drop s_umount here because we need to open the bdev and
-		 * bdev->open_mutex ranks above s_umount (blkdev_put() ->
-		 * bdev_mark_dead()). It is safe because we have active sb
-		 * reference and SB_BORN is not set yet.
-		 */
-		super_unlock_excl(s);
 		error = setup_bdev_super(s, fc->sb_flags, fc);
-		__super_lock_excl(s);
 		if (!error)
 			error = fill_super(s, fc);
 		if (error) {
@@ -1665,15 +1657,7 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
 			return ERR_PTR(-EBUSY);
 		}
 	} else {
-		/*
-		 * We drop s_umount here because we need to open the bdev and
-		 * bdev->open_mutex ranks above s_umount (blkdev_put() ->
-		 * bdev_mark_dead()). It is safe because we have active sb
-		 * reference and SB_BORN is not set yet.
-		 */
-		super_unlock_excl(s);
 		error = setup_bdev_super(s, flags, NULL);
-		__super_lock_excl(s);
 		if (!error)
 			error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
 		if (error) {

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC 2/6] xfs: simplify device handling
  2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
  2023-10-24 14:53 ` [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls Christian Brauner
@ 2023-10-24 14:53 ` Christian Brauner
  2023-10-25 15:30   ` Jan Kara
  2023-10-27  6:42   ` Christoph Hellwig
  2023-10-24 14:53 ` [PATCH RFC 3/6] ext4: " Christian Brauner
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2023-10-24 14:53 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, Christian Brauner

We removed all codepaths where s_umount is taken beneath open_mutex and
bd_holder_lock so don't make things more complicated than they need to
be and hold s_umount over block device opening.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/xfs/xfs_super.c | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f0ae07828153..84107d162e41 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -437,19 +437,13 @@ xfs_open_devices(
 	struct bdev_handle	*logdev_handle = NULL, *rtdev_handle = NULL;
 	int			error;
 
-	/*
-	 * blkdev_put() can't be called under s_umount, see the comment
-	 * in get_tree_bdev() for more details
-	 */
-	up_write(&sb->s_umount);
-
 	/*
 	 * Open real time and log devices - order is important.
 	 */
 	if (mp->m_logname) {
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev_handle);
 		if (error)
-			goto out_relock;
+			return error;
 	}
 
 	if (mp->m_rtname) {
@@ -492,10 +486,7 @@ xfs_open_devices(
 			bdev_release(logdev_handle);
 	}
 
-	error = 0;
-out_relock:
-	down_write(&sb->s_umount);
-	return error;
+	return 0;
 
  out_free_rtdev_targ:
 	if (mp->m_rtdev_targp)
@@ -508,7 +499,7 @@ xfs_open_devices(
  out_close_logdev:
 	if (logdev_handle)
 		bdev_release(logdev_handle);
-	goto out_relock;
+	return error;
 }
 
 /*
@@ -758,10 +749,6 @@ static void
 xfs_mount_free(
 	struct xfs_mount	*mp)
 {
-	/*
-	 * Free the buftargs here because blkdev_put needs to be called outside
-	 * of sb->s_umount, which is held around the call to ->put_super.
-	 */
 	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
 		xfs_free_buftarg(mp->m_logdev_targp);
 	if (mp->m_rtdev_targp)

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC 3/6] ext4: simplify device handling
  2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
  2023-10-24 14:53 ` [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls Christian Brauner
  2023-10-24 14:53 ` [PATCH RFC 2/6] xfs: simplify device handling Christian Brauner
@ 2023-10-24 14:53 ` Christian Brauner
  2023-10-25 15:30   ` Jan Kara
  2023-10-27  6:42   ` Christoph Hellwig
  2023-10-24 14:53 ` [PATCH RFC 4/6] bdev: simplify waiting for concurrent claimers Christian Brauner
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2023-10-24 14:53 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, Christian Brauner

We removed all codepaths where s_umount is taken beneath open_mutex and
bd_holder_lock so don't make things more complicated than they need to
be and hold s_umount over block device opening.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/ext4/super.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d43f8324242a..e94df97ea440 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5855,11 +5855,8 @@ static struct bdev_handle *ext4_get_journal_blkdev(struct super_block *sb,
 	struct ext4_super_block *es;
 	int errno;
 
-	/* see get_tree_bdev why this is needed and safe */
-	up_write(&sb->s_umount);
 	bdev_handle = bdev_open_by_dev(j_dev, BLK_OPEN_READ | BLK_OPEN_WRITE,
 				       sb, &fs_holder_ops);
-	down_write(&sb->s_umount);
 	if (IS_ERR(bdev_handle)) {
 		ext4_msg(sb, KERN_ERR,
 			 "failed to open journal device unknown-block(%u,%u) %ld",

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC 4/6] bdev: simplify waiting for concurrent claimers
  2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
                   ` (2 preceding siblings ...)
  2023-10-24 14:53 ` [PATCH RFC 3/6] ext4: " Christian Brauner
@ 2023-10-24 14:53 ` Christian Brauner
  2023-10-25 15:54   ` Jan Kara
  2023-10-24 14:53 ` [PATCH RFC 5/6] block: mark device as about to be released Christian Brauner
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2023-10-24 14:53 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, Christian Brauner

Simplify the mechanism to wait for concurrent block devices claimers
and make it possible to introduce an additional state in the following
patches.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 block/bdev.c              | 34 ++++++++++++++++++----------------
 include/linux/blk_types.h |  7 ++++++-
 2 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 9deacd346192..7d19e04a8df8 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -482,6 +482,14 @@ static bool bd_may_claim(struct block_device *bdev, void *holder,
 	return true;
 }
 
+static bool wait_claimable(const struct block_device *bdev)
+{
+	enum bd_claim bd_claim;
+
+	bd_claim = smp_load_acquire(&bdev->bd_claim);
+	return bd_claim == BD_CLAIM_DEFAULT;
+}
+
 /**
  * bd_prepare_to_claim - claim a block device
  * @bdev: block device of interest
@@ -490,7 +498,7 @@ static bool bd_may_claim(struct block_device *bdev, void *holder,
  *
  * Claim @bdev.  This function fails if @bdev is already claimed by another
  * holder and waits if another claiming is in progress. return, the caller
- * has ownership of bd_claiming and bd_holder[s].
+ * has ownership of bd_claim and bd_holder[s].
  *
  * RETURNS:
  * 0 if @bdev can be claimed, -EBUSY otherwise.
@@ -511,31 +519,25 @@ int bd_prepare_to_claim(struct block_device *bdev, void *holder,
 	}
 
 	/* if claiming is already in progress, wait for it to finish */
-	if (whole->bd_claiming) {
-		wait_queue_head_t *wq = bit_waitqueue(&whole->bd_claiming, 0);
-		DEFINE_WAIT(wait);
-
-		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
+	if (whole->bd_claim) {
 		mutex_unlock(&bdev_lock);
-		schedule();
-		finish_wait(wq, &wait);
+		wait_var_event(&whole->bd_claim, wait_claimable(whole));
 		goto retry;
 	}
 
 	/* yay, all mine */
-	whole->bd_claiming = holder;
+	whole->bd_claim = BD_CLAIM_ACQUIRE;
 	mutex_unlock(&bdev_lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(bd_prepare_to_claim); /* only for the loop driver */
 
-static void bd_clear_claiming(struct block_device *whole, void *holder)
+static void bd_clear_claiming(struct block_device *whole)
 {
 	lockdep_assert_held(&bdev_lock);
-	/* tell others that we're done */
-	BUG_ON(whole->bd_claiming != holder);
-	whole->bd_claiming = NULL;
-	wake_up_bit(&whole->bd_claiming, 0);
+	smp_store_release(&whole->bd_claim, BD_CLAIM_DEFAULT);
+	smp_mb();
+	wake_up_var(&whole->bd_claim);
 }
 
 /**
@@ -565,7 +567,7 @@ static void bd_finish_claiming(struct block_device *bdev, void *holder,
 	bdev->bd_holder = holder;
 	bdev->bd_holder_ops = hops;
 	mutex_unlock(&bdev->bd_holder_lock);
-	bd_clear_claiming(whole, holder);
+	bd_clear_claiming(whole);
 	mutex_unlock(&bdev_lock);
 }
 
@@ -581,7 +583,7 @@ static void bd_finish_claiming(struct block_device *bdev, void *holder,
 void bd_abort_claiming(struct block_device *bdev, void *holder)
 {
 	mutex_lock(&bdev_lock);
-	bd_clear_claiming(bdev_whole(bdev), holder);
+	bd_clear_claiming(bdev_whole(bdev));
 	mutex_unlock(&bdev_lock);
 }
 EXPORT_SYMBOL(bd_abort_claiming);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 749203277fee..cbef041fd868 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -37,6 +37,11 @@ struct bio_crypt_ctx;
 #define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
 #define SECTOR_MASK		(PAGE_SECTORS - 1)
 
+enum bd_claim {
+	BD_CLAIM_DEFAULT = 0,
+	BD_CLAIM_ACQUIRE = 1,
+};
+
 struct block_device {
 	sector_t		bd_start_sect;
 	sector_t		bd_nr_sectors;
@@ -52,7 +57,7 @@ struct block_device {
 	atomic_t		bd_openers;
 	spinlock_t		bd_size_lock; /* for bd_inode->i_size updates */
 	struct inode *		bd_inode;	/* will die */
-	void *			bd_claiming;
+	enum bd_claim		bd_claim;
 	void *			bd_holder;
 	const struct blk_holder_ops *bd_holder_ops;
 	struct mutex		bd_holder_lock;

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC 5/6] block: mark device as about to be released
  2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
                   ` (3 preceding siblings ...)
  2023-10-24 14:53 ` [PATCH RFC 4/6] bdev: simplify waiting for concurrent claimers Christian Brauner
@ 2023-10-24 14:53 ` Christian Brauner
  2023-10-24 14:53 ` [PATCH RFC 6/6] fs: add ->yield_devices() Christian Brauner
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2023-10-24 14:53 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, Christian Brauner

Make it possible for the exclusive holder of a block device to mark it
as about to be closed and exclusive ownership given up. Any concurrent
opener trying to claim the device with the same holder ops can wait
until the device is free to be reclaimed. Requiring the same holder ops
makes it possible to easily define groups of openers that can wait for
each other.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 block/bdev.c              | 20 ++++++++++++++++++++
 include/linux/blk_types.h |  1 +
 include/linux/blkdev.h    |  1 +
 3 files changed, 22 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index 7d19e04a8df8..943c7a188bb3 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -469,6 +469,11 @@ static bool bd_may_claim(struct block_device *bdev, void *holder,
 				return false;
 			return true;
 		}
+
+		if ((whole->bd_claim == BD_CLAIM_YIELD) &&
+		    (bdev->bd_holder_ops == hops))
+			return true;
+
 		return false;
 	}
 
@@ -608,6 +613,7 @@ static void bd_end_claim(struct block_device *bdev, void *holder)
 		mutex_unlock(&bdev->bd_holder_lock);
 		if (bdev->bd_write_holder)
 			unblock = true;
+		bd_clear_claiming(whole);
 	}
 	if (!whole->bd_holders)
 		whole->bd_holder = NULL;
@@ -954,6 +960,20 @@ void bdev_release(struct bdev_handle *handle)
 }
 EXPORT_SYMBOL(bdev_release);
 
+void bdev_yield(struct bdev_handle *handle)
+{
+	struct block_device *bdev = handle->bdev;
+	struct block_device *whole = bdev_whole(bdev);
+
+	mutex_lock(&bdev_lock);
+	WARN_ON_ONCE(bdev->bd_holders == 0);
+	WARN_ON_ONCE(bdev->bd_holder != handle->holder);
+	WARN_ON_ONCE(whole->bd_claim);
+	whole->bd_claim = BD_CLAIM_YIELD;
+	mutex_unlock(&bdev_lock);
+}
+EXPORT_SYMBOL(bdev_yield);
+
 /**
  * lookup_bdev() - Look up a struct block_device by name.
  * @pathname: Name of the block device in the filesystem.
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cbef041fd868..54cf274a436c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -40,6 +40,7 @@ struct bio_crypt_ctx;
 enum bd_claim {
 	BD_CLAIM_DEFAULT = 0,
 	BD_CLAIM_ACQUIRE = 1,
+	BD_CLAIM_YIELD   = 2,
 };
 
 struct block_device {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index abf71cce785c..b15129afcdbe 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1513,6 +1513,7 @@ int bd_prepare_to_claim(struct block_device *bdev, void *holder,
 void bd_abort_claiming(struct block_device *bdev, void *holder);
 void blkdev_put(struct block_device *bdev, void *holder);
 void bdev_release(struct bdev_handle *handle);
+void bdev_yield(struct bdev_handle *handle);
 
 /* just for blk-cgroup, don't use elsewhere */
 struct block_device *blkdev_get_no_open(dev_t dev);

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC 6/6] fs: add ->yield_devices()
  2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
                   ` (4 preceding siblings ...)
  2023-10-24 14:53 ` [PATCH RFC 5/6] block: mark device as about to be released Christian Brauner
@ 2023-10-24 14:53 ` Christian Brauner
  2023-10-25 17:20 ` [PATCH RFC 0/6] fs,block: yield devices Jan Kara
  2023-10-26 11:50 ` (subset) " Christian Brauner
  7 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2023-10-24 14:53 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, Christian Brauner

and allow filesystems to mark devices as about to be released allowing
concurrent openers to wait until the device is reclaimable.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/ext4/super.c    | 12 ++++++++++++
 fs/super.c         | 51 ++++++++++++++++++++-------------------------------
 fs/xfs/xfs_super.c | 27 +++++++++++++++++++++++++++
 include/linux/fs.h |  1 +
 4 files changed, 60 insertions(+), 31 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e94df97ea440..45f550801329 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1477,6 +1477,17 @@ static void ext4_shutdown(struct super_block *sb)
        ext4_force_shutdown(sb, EXT4_GOING_FLAGS_NOLOGFLUSH);
 }
 
+static void ext4_yield_devices(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = sb->s_fs_info;
+	struct bdev_handle *journal_bdev_handle =
+		sbi ? sbi->s_journal_bdev_handle : NULL;
+
+	if (journal_bdev_handle)
+		bdev_yield(journal_bdev_handle);
+	bdev_yield(sb->s_bdev_handle);
+}
+
 static void init_once(void *foo)
 {
 	struct ext4_inode_info *ei = foo;
@@ -1638,6 +1649,7 @@ static const struct super_operations ext4_sops = {
 	.statfs		= ext4_statfs,
 	.show_options	= ext4_show_options,
 	.shutdown	= ext4_shutdown,
+	.yield_devices	= ext4_yield_devices,
 #ifdef CONFIG_QUOTA
 	.quota_read	= ext4_quota_read,
 	.quota_write	= ext4_quota_write,
diff --git a/fs/super.c b/fs/super.c
index 4edde92d5e8f..7e24bbd65be2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -87,10 +87,10 @@ static inline bool wait_born(struct super_block *sb)
 
 	/*
 	 * Pairs with smp_store_release() in super_wake() and ensures
-	 * that we see SB_BORN or SB_DYING after we're woken.
+	 * that we see SB_BORN or SB_DEAD after we're woken.
 	 */
 	flags = smp_load_acquire(&sb->s_flags);
-	return flags & (SB_BORN | SB_DYING);
+	return flags & (SB_BORN | SB_DEAD);
 }
 
 /**
@@ -101,12 +101,12 @@ static inline bool wait_born(struct super_block *sb)
  * If the superblock has neither passed through vfs_get_tree() or
  * generic_shutdown_super() yet wait for it to happen. Either superblock
  * creation will succeed and SB_BORN is set by vfs_get_tree() or we're
- * woken and we'll see SB_DYING.
+ * woken and we'll see SB_DEAD.
  *
  * The caller must have acquired a temporary reference on @sb->s_count.
  *
  * Return: The function returns true if SB_BORN was set and with
- *         s_umount held. The function returns false if SB_DYING was
+ *         s_umount held. The function returns false if SB_DEAD was
  *         set and without s_umount held.
  */
 static __must_check bool super_lock(struct super_block *sb, bool excl)
@@ -122,7 +122,7 @@ static __must_check bool super_lock(struct super_block *sb, bool excl)
 	 * @sb->s_root is NULL and @sb->s_active is 0. No one needs to
 	 * grab a reference to this. Tell them so.
 	 */
-	if (sb->s_flags & SB_DYING) {
+	if (sb->s_flags & SB_DEAD) {
 		super_unlock(sb, excl);
 		return false;
 	}
@@ -137,7 +137,7 @@ static __must_check bool super_lock(struct super_block *sb, bool excl)
 	wait_var_event(&sb->s_flags, wait_born(sb));
 
 	/*
-	 * Neither SB_BORN nor SB_DYING are ever unset so we never loop.
+	 * Neither SB_BORN nor SB_DEAD are ever unset so we never loop.
 	 * Just reacquire @sb->s_umount for the caller.
 	 */
 	goto relock;
@@ -439,18 +439,17 @@ void put_super(struct super_block *sb)
 
 static void kill_super_notify(struct super_block *sb)
 {
-	lockdep_assert_not_held(&sb->s_umount);
+	const struct super_operations *sop = sb->s_op;
 
-	/* already notified earlier */
-	if (sb->s_flags & SB_DEAD)
-		return;
+	lockdep_assert_held(&sb->s_umount);
+
+	/* Allow openers to wait for the devices to be cleaned up. */
+	if (sop->yield_devices)
+		sop->yield_devices(sb);
 
 	/*
 	 * Remove it from @fs_supers so it isn't found by new
-	 * sget{_fc}() walkers anymore. Any concurrent mounter still
-	 * managing to grab a temporary reference is guaranteed to
-	 * already see SB_DYING and will wait until we notify them about
-	 * SB_DEAD.
+	 * sget{_fc}() walkers anymore.
 	 */
 	spin_lock(&sb_lock);
 	hlist_del_init(&sb->s_instances);
@@ -459,7 +458,7 @@ static void kill_super_notify(struct super_block *sb)
 	/*
 	 * Let concurrent mounts know that this thing is really dead.
 	 * We don't need @sb->s_umount here as every concurrent caller
-	 * will see SB_DYING and either discard the superblock or wait
+	 * will see SB_DEAD and either discard the superblock or wait
 	 * for SB_DEAD.
 	 */
 	super_wake(sb, SB_DEAD);
@@ -483,8 +482,6 @@ void deactivate_locked_super(struct super_block *s)
 		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
 
-		kill_super_notify(s);
-
 		/*
 		 * Since list_lru_destroy() may sleep, we cannot call it from
 		 * put_super(), where we hold the sb_lock. Therefore we destroy
@@ -583,7 +580,7 @@ static bool grab_super(struct super_block *sb)
 bool super_trylock_shared(struct super_block *sb)
 {
 	if (down_read_trylock(&sb->s_umount)) {
-		if (!(sb->s_flags & SB_DYING) && sb->s_root &&
+		if (!(sb->s_flags & SB_DEAD) && sb->s_root &&
 		    (sb->s_flags & SB_BORN))
 			return true;
 		super_unlock_shared(sb);
@@ -689,16 +686,9 @@ void generic_shutdown_super(struct super_block *sb)
 			spin_unlock(&sb->s_inode_list_lock);
 		}
 	}
-	/*
-	 * Broadcast to everyone that grabbed a temporary reference to this
-	 * superblock before we removed it from @fs_supers that the superblock
-	 * is dying. Every walker of @fs_supers outside of sget{_fc}() will now
-	 * discard this superblock and treat it as dead.
-	 *
-	 * We leave the superblock on @fs_supers so it can be found by
-	 * sget{_fc}() until we passed sb->kill_sb().
-	 */
-	super_wake(sb, SB_DYING);
+
+	kill_super_notify(sb);
+
 	super_unlock_excl(sb);
 	if (sb->s_bdi != &noop_backing_dev_info) {
 		if (sb->s_iflags & SB_I_PERSB_BDI)
@@ -790,7 +780,7 @@ struct super_block *sget_fc(struct fs_context *fc,
 	/*
 	 * Make the superblock visible on @super_blocks and @fs_supers.
 	 * It's in a nascent state and users should wait on SB_BORN or
-	 * SB_DYING to be set.
+	 * SB_DEAD to be set.
 	 */
 	list_add_tail(&s->s_list, &super_blocks);
 	hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
@@ -906,7 +896,7 @@ static void __iterate_supers(void (*f)(struct super_block *))
 	spin_lock(&sb_lock);
 	list_for_each_entry(sb, &super_blocks, s_list) {
 		/* Pairs with memory marrier in super_wake(). */
-		if (smp_load_acquire(&sb->s_flags) & SB_DYING)
+		if (smp_load_acquire(&sb->s_flags) & SB_DEAD)
 			continue;
 		sb->s_count++;
 		spin_unlock(&sb_lock);
@@ -1248,7 +1238,6 @@ void kill_anon_super(struct super_block *sb)
 {
 	dev_t dev = sb->s_dev;
 	generic_shutdown_super(sb);
-	kill_super_notify(sb);
 	free_anon_bdev(dev);
 }
 EXPORT_SYMBOL(kill_anon_super);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 84107d162e41..f7a0cb92c7c0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1170,6 +1170,32 @@ xfs_fs_shutdown(
 	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REMOVED);
 }
 
+static void xfs_fs_bdev_yield(struct bdev_handle *handle,
+			      struct super_block *sb)
+{
+	if (handle != sb->s_bdev_handle)
+		bdev_yield(handle);
+}
+
+static void
+xfs_fs_yield_devices(
+	struct super_block *sb)
+{
+	struct xfs_mount	*mp = XFS_M(sb);
+
+	if (mp) {
+		if (mp->m_logdev_targp &&
+		    mp->m_logdev_targp != mp->m_ddev_targp)
+			xfs_fs_bdev_yield(mp->m_logdev_targp->bt_bdev_handle, sb);
+		if (mp->m_rtdev_targp)
+			xfs_fs_bdev_yield(mp->m_rtdev_targp->bt_bdev_handle, sb);
+		if (mp->m_ddev_targp)
+			xfs_fs_bdev_yield(mp->m_ddev_targp->bt_bdev_handle, sb);
+	}
+
+	bdev_yield(sb->s_bdev_handle);
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1184,6 +1210,7 @@ static const struct super_operations xfs_super_operations = {
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
 	.shutdown		= xfs_fs_shutdown,
+	.yield_devices		= xfs_fs_yield_devices,
 };
 
 static int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5174e821d451..f0278bf4ca03 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2026,6 +2026,7 @@ struct super_operations {
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
 	void (*shutdown)(struct super_block *sb);
+	void (*yield_devices)(struct super_block *sb);
 };
 
 /*

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls
  2023-10-24 14:53 ` [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls Christian Brauner
@ 2023-10-25 15:29   ` Jan Kara
  2023-10-27  6:42   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2023-10-25 15:29 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Tue 24-10-23 16:53:39, Christian Brauner wrote:
> There's no need to drop s_umount anymore now that we removed all sources
> where s_umount is taken beneath open_mutex or bd_holder_lock.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Yay. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c | 16 ----------------
>  1 file changed, 16 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b26b302f870d..4edde92d5e8f 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1613,15 +1613,7 @@ int get_tree_bdev(struct fs_context *fc,
>  			return -EBUSY;
>  		}
>  	} else {
> -		/*
> -		 * We drop s_umount here because we need to open the bdev and
> -		 * bdev->open_mutex ranks above s_umount (blkdev_put() ->
> -		 * bdev_mark_dead()). It is safe because we have active sb
> -		 * reference and SB_BORN is not set yet.
> -		 */
> -		super_unlock_excl(s);
>  		error = setup_bdev_super(s, fc->sb_flags, fc);
> -		__super_lock_excl(s);
>  		if (!error)
>  			error = fill_super(s, fc);
>  		if (error) {
> @@ -1665,15 +1657,7 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
>  			return ERR_PTR(-EBUSY);
>  		}
>  	} else {
> -		/*
> -		 * We drop s_umount here because we need to open the bdev and
> -		 * bdev->open_mutex ranks above s_umount (blkdev_put() ->
> -		 * bdev_mark_dead()). It is safe because we have active sb
> -		 * reference and SB_BORN is not set yet.
> -		 */
> -		super_unlock_excl(s);
>  		error = setup_bdev_super(s, flags, NULL);
> -		__super_lock_excl(s);
>  		if (!error)
>  			error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
>  		if (error) {
> 
> -- 
> 2.34.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 2/6] xfs: simplify device handling
  2023-10-24 14:53 ` [PATCH RFC 2/6] xfs: simplify device handling Christian Brauner
@ 2023-10-25 15:30   ` Jan Kara
  2023-10-27  6:42   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2023-10-25 15:30 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Tue 24-10-23 16:53:40, Christian Brauner wrote:
> We removed all codepaths where s_umount is taken beneath open_mutex and
> bd_holder_lock so don't make things more complicated than they need to
> be and hold s_umount over block device opening.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/xfs/xfs_super.c | 19 +++----------------
>  1 file changed, 3 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index f0ae07828153..84107d162e41 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -437,19 +437,13 @@ xfs_open_devices(
>  	struct bdev_handle	*logdev_handle = NULL, *rtdev_handle = NULL;
>  	int			error;
>  
> -	/*
> -	 * blkdev_put() can't be called under s_umount, see the comment
> -	 * in get_tree_bdev() for more details
> -	 */
> -	up_write(&sb->s_umount);
> -
>  	/*
>  	 * Open real time and log devices - order is important.
>  	 */
>  	if (mp->m_logname) {
>  		error = xfs_blkdev_get(mp, mp->m_logname, &logdev_handle);
>  		if (error)
> -			goto out_relock;
> +			return error;
>  	}
>  
>  	if (mp->m_rtname) {
> @@ -492,10 +486,7 @@ xfs_open_devices(
>  			bdev_release(logdev_handle);
>  	}
>  
> -	error = 0;
> -out_relock:
> -	down_write(&sb->s_umount);
> -	return error;
> +	return 0;
>  
>   out_free_rtdev_targ:
>  	if (mp->m_rtdev_targp)
> @@ -508,7 +499,7 @@ xfs_open_devices(
>   out_close_logdev:
>  	if (logdev_handle)
>  		bdev_release(logdev_handle);
> -	goto out_relock;
> +	return error;
>  }
>  
>  /*
> @@ -758,10 +749,6 @@ static void
>  xfs_mount_free(
>  	struct xfs_mount	*mp)
>  {
> -	/*
> -	 * Free the buftargs here because blkdev_put needs to be called outside
> -	 * of sb->s_umount, which is held around the call to ->put_super.
> -	 */
>  	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
>  		xfs_free_buftarg(mp->m_logdev_targp);
>  	if (mp->m_rtdev_targp)
> 
> -- 
> 2.34.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 3/6] ext4: simplify device handling
  2023-10-24 14:53 ` [PATCH RFC 3/6] ext4: " Christian Brauner
@ 2023-10-25 15:30   ` Jan Kara
  2023-10-27  6:42   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2023-10-25 15:30 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Tue 24-10-23 16:53:41, Christian Brauner wrote:
> We removed all codepaths where s_umount is taken beneath open_mutex and
> bd_holder_lock so don't make things more complicated than they need to
> be and hold s_umount over block device opening.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Nice. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index d43f8324242a..e94df97ea440 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5855,11 +5855,8 @@ static struct bdev_handle *ext4_get_journal_blkdev(struct super_block *sb,
>  	struct ext4_super_block *es;
>  	int errno;
>  
> -	/* see get_tree_bdev why this is needed and safe */
> -	up_write(&sb->s_umount);
>  	bdev_handle = bdev_open_by_dev(j_dev, BLK_OPEN_READ | BLK_OPEN_WRITE,
>  				       sb, &fs_holder_ops);
> -	down_write(&sb->s_umount);
>  	if (IS_ERR(bdev_handle)) {
>  		ext4_msg(sb, KERN_ERR,
>  			 "failed to open journal device unknown-block(%u,%u) %ld",
> 
> -- 
> 2.34.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 4/6] bdev: simplify waiting for concurrent claimers
  2023-10-24 14:53 ` [PATCH RFC 4/6] bdev: simplify waiting for concurrent claimers Christian Brauner
@ 2023-10-25 15:54   ` Jan Kara
  2023-10-27  7:21     ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2023-10-25 15:54 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Tue 24-10-23 16:53:42, Christian Brauner wrote:
> Simplify the mechanism to wait for concurrent block devices claimers
> and make it possible to introduce an additional state in the following
> patches.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

The simplification looks good but a few notes below:

> diff --git a/block/bdev.c b/block/bdev.c
> index 9deacd346192..7d19e04a8df8 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -482,6 +482,14 @@ static bool bd_may_claim(struct block_device *bdev, void *holder,
>  	return true;
>  }
>  
> +static bool wait_claimable(const struct block_device *bdev)
> +{
> +	enum bd_claim bd_claim;
> +
> +	bd_claim = smp_load_acquire(&bdev->bd_claim);
> +	return bd_claim == BD_CLAIM_DEFAULT;
> +}

Aren't you overdoing it here a bit? Given this is used only in a retry
loop and all the checks that need to be reliable are done under bdev_lock,
I'd say having:

	return READ_ONCE(bdev->bd_claim) == BD_CLAIM_DEFAULT;

shound be fine here? And probably just inline that into the
wait_var_event() call...

> @@ -511,31 +519,25 @@ int bd_prepare_to_claim(struct block_device *bdev, void *holder,
>  	}
>  
>  	/* if claiming is already in progress, wait for it to finish */
> -	if (whole->bd_claiming) {
> -		wait_queue_head_t *wq = bit_waitqueue(&whole->bd_claiming, 0);
> -		DEFINE_WAIT(wait);
> -
> -		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
> +	if (whole->bd_claim) {

This test implicitely assumes that 0 is BD_CLAIM_DEFAULT. I guess that's
fine although I somewhat prefer explicit value test like:

	if (whole->bd_claim != BD_CLAIM_DEFAULT)

>  		mutex_unlock(&bdev_lock);
> -		schedule();
> -		finish_wait(wq, &wait);
> +		wait_var_event(&whole->bd_claim, wait_claimable(whole));
>  		goto retry;
>  	}
>  
>  	/* yay, all mine */
> -	whole->bd_claiming = holder;
> +	whole->bd_claim = BD_CLAIM_ACQUIRE;

Here I'd use WRITE_ONCE() to avoid KCSAN warnings and having to think
whether this can race with wait_claimable() or not.

>  	mutex_unlock(&bdev_lock);
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(bd_prepare_to_claim); /* only for the loop driver */
>  
> -static void bd_clear_claiming(struct block_device *whole, void *holder)
> +static void bd_clear_claiming(struct block_device *whole)
>  {
>  	lockdep_assert_held(&bdev_lock);
> -	/* tell others that we're done */
> -	BUG_ON(whole->bd_claiming != holder);
> -	whole->bd_claiming = NULL;
> -	wake_up_bit(&whole->bd_claiming, 0);
> +	smp_store_release(&whole->bd_claim, BD_CLAIM_DEFAULT);
> +	smp_mb();
> +	wake_up_var(&whole->bd_claim);

And here since we are under bdev_lock and the waiter is going to check
under bdev_lock as well, we should be able to do:

	WRITE_ONCE(whole->bd_claim, BD_CLAIM_DEFAULT);
	/* Pairs with barrier in prepare_to_wait_event() -> set_current_state() */
	smp_mb();
	wake_up_var(&whole->bd_claim);

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 0/6] fs,block: yield devices
  2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
                   ` (5 preceding siblings ...)
  2023-10-24 14:53 ` [PATCH RFC 6/6] fs: add ->yield_devices() Christian Brauner
@ 2023-10-25 17:20 ` Jan Kara
  2023-10-25 20:46   ` Christian Brauner
  2023-10-26 11:50 ` (subset) " Christian Brauner
  7 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2023-10-25 17:20 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

Hello!

On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> This is a mechanism that allows the holder of a block device to yield
> device access before actually closing the block device.
> 
> If a someone yields a device then any concurrent opener claiming the
> device exclusively with the same blk_holder_ops as the current owner can
> wait for the device to be given up. Filesystems by default use
> fs_holder_ps and so can wait on each other.
> 
> This mechanism allows us to simplify superblock handling quite a bit at
> the expense of requiring filesystems to yield devices. A filesytems must
> yield devices under s_umount. This allows costly work to be done outside
> of s_umount.
> 
> There's nothing wrong with the way we currently do things but this does
> allow us to simplify things and kills a whole class of theoretical UAF
> when walking the superblock list.

I'm not sure why is it better to create new ->yield callback called under
sb->s_umount rather than just move blkdev_put() calls back into
->put_super? Or at least yielding could be done in ->put_super instead of
a new callback if for some reason that is better than releasing the devices
right away in ->put_super.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 0/6] fs,block: yield devices
  2023-10-25 17:20 ` [PATCH RFC 0/6] fs,block: yield devices Jan Kara
@ 2023-10-25 20:46   ` Christian Brauner
  2023-10-26 10:35     ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2023-10-25 20:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel

On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote:
> Hello!
> 
> On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> > This is a mechanism that allows the holder of a block device to yield
> > device access before actually closing the block device.
> > 
> > If a someone yields a device then any concurrent opener claiming the
> > device exclusively with the same blk_holder_ops as the current owner can
> > wait for the device to be given up. Filesystems by default use
> > fs_holder_ps and so can wait on each other.
> > 
> > This mechanism allows us to simplify superblock handling quite a bit at
> > the expense of requiring filesystems to yield devices. A filesytems must
> > yield devices under s_umount. This allows costly work to be done outside
> > of s_umount.
> > 
> > There's nothing wrong with the way we currently do things but this does
> > allow us to simplify things and kills a whole class of theoretical UAF
> > when walking the superblock list.
> 
> I'm not sure why is it better to create new ->yield callback called under
> sb->s_umount rather than just move blkdev_put() calls back into
> ->put_super? Or at least yielding could be done in ->put_super instead of

The main reason was to not call potentially expensive
blkdev_put()/bdev_release() under s_umount. If we don't care about this
though then this shouldn't be a problem. And yes, then we need to move
blkdev_put()/bdev_release() under s_umount including the main block
device. IOW, we need to ensure that all bdev calls are done under
s_umount before we remove the superblock from the instance list. I think
that should be fine but I wanted to propose an alternative to that as
well: cheap mark-for-release under s_umount and heavy-duty without
s_umount. But I guess it doesn't matter because most filesystems did use
to close devices under s_umount before anyway. Let me know what you
think makes the most sense.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 0/6] fs,block: yield devices
  2023-10-25 20:46   ` Christian Brauner
@ 2023-10-26 10:35     ` Jan Kara
  2023-10-26 12:07       ` Christian Brauner
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2023-10-26 10:35 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Wed 25-10-23 22:46:29, Christian Brauner wrote:
> On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote:
> > Hello!
> > 
> > On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> > > This is a mechanism that allows the holder of a block device to yield
> > > device access before actually closing the block device.
> > > 
> > > If a someone yields a device then any concurrent opener claiming the
> > > device exclusively with the same blk_holder_ops as the current owner can
> > > wait for the device to be given up. Filesystems by default use
> > > fs_holder_ps and so can wait on each other.
> > > 
> > > This mechanism allows us to simplify superblock handling quite a bit at
> > > the expense of requiring filesystems to yield devices. A filesytems must
> > > yield devices under s_umount. This allows costly work to be done outside
> > > of s_umount.
> > > 
> > > There's nothing wrong with the way we currently do things but this does
> > > allow us to simplify things and kills a whole class of theoretical UAF
> > > when walking the superblock list.
> > 
> > I'm not sure why is it better to create new ->yield callback called under
> > sb->s_umount rather than just move blkdev_put() calls back into
> > ->put_super? Or at least yielding could be done in ->put_super instead of
> 
> The main reason was to not call potentially expensive
> blkdev_put()/bdev_release() under s_umount. If we don't care about this
> though then this shouldn't be a problem.

So I would not be really concerned about performance here. The superblock
is dying, nobody can do anything about that until the superblock is fully
dead and cleaned up. Maybe some places could skip such superblocks more
quickly but so far I'm not convinced it matters in practice (e.g. writeback
holds s_umount over the whole sync(1) time and nobody complains). And as
you mention below, we have been doing this for a long time without anybody
really complaining.

> And yes, then we need to move
> blkdev_put()/bdev_release() under s_umount including the main block
> device. IOW, we need to ensure that all bdev calls are done under
> s_umount before we remove the superblock from the instance list.

This is about those seemingly spurious "device busy" errors when the
superblock hasn't closed its devices yet, isn't it?  But we now remove
superblock from s_instances list in kill_super_notify() and until that
moment SB_DYING is protecting us from racing mounts. So is there some other
problem?

> I think
> that should be fine but I wanted to propose an alternative to that as
> well: cheap mark-for-release under s_umount and heavy-duty without
> s_umount. But I guess it doesn't matter because most filesystems did use
> to close devices under s_umount before anyway. Let me know what you
> think makes the most sense.

I think we should make it as simple as possible for filesystems. As I said
above I don't think s_umount hold time really matters so the only thing
that limits us are locking constraints. During superblock shutdown the only
thing that currently cannot be done under s_umount (that I'm aware of) is the
teardown of the sb->s_bdi because that waits for writeback threads and
those can be blocked waiting for s_umount.

So if we wanted we could just move ext4 & xfs bdev closing back into
->put_super to avoid ext4_kill_sb() and somewhat slim down xfs_mount_free()
but otherwise I don't see much space for simplification or need for adding
more callbacks :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: (subset) [PATCH RFC 0/6] fs,block: yield devices
  2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
                   ` (6 preceding siblings ...)
  2023-10-25 17:20 ` [PATCH RFC 0/6] fs,block: yield devices Jan Kara
@ 2023-10-26 11:50 ` Christian Brauner
  7 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2023-10-26 11:50 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: Christian Brauner, linux-fsdevel

On Tue, 24 Oct 2023 16:53:38 +0200, Christian Brauner wrote:
> Hey,
> 
> This is a mechanism that allows the holder of a block device to yield
> device access before actually closing the block device.
> 
> If a someone yields a device then any concurrent opener claiming the
> device exclusively with the same blk_holder_ops as the current owner can
> wait for the device to be given up. Filesystems by default use
> fs_holder_ps and so can wait on each other.
> 
> [...]

for v6.8

---

Applied to the vfs.super branch of the vfs/vfs.git tree.
Patches in the vfs.super branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.super

[1/6] fs: simplify setup_bdev_super() calls
      https://git.kernel.org/vfs/vfs/c/f15112134111
[2/6] xfs: simplify device handling
      https://git.kernel.org/vfs/vfs/c/cbded4a095bd
[3/6] ext4: simplify device handling
      https://git.kernel.org/vfs/vfs/c/ac85cf49ed93

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 0/6] fs,block: yield devices
  2023-10-26 10:35     ` Jan Kara
@ 2023-10-26 12:07       ` Christian Brauner
  2023-10-26 13:04         ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2023-10-26 12:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel

On Thu, Oct 26, 2023 at 12:35:03PM +0200, Jan Kara wrote:
> On Wed 25-10-23 22:46:29, Christian Brauner wrote:
> > On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> > > > This is a mechanism that allows the holder of a block device to yield
> > > > device access before actually closing the block device.
> > > > 
> > > > If a someone yields a device then any concurrent opener claiming the
> > > > device exclusively with the same blk_holder_ops as the current owner can
> > > > wait for the device to be given up. Filesystems by default use
> > > > fs_holder_ps and so can wait on each other.
> > > > 
> > > > This mechanism allows us to simplify superblock handling quite a bit at
> > > > the expense of requiring filesystems to yield devices. A filesytems must
> > > > yield devices under s_umount. This allows costly work to be done outside
> > > > of s_umount.
> > > > 
> > > > There's nothing wrong with the way we currently do things but this does
> > > > allow us to simplify things and kills a whole class of theoretical UAF
> > > > when walking the superblock list.
> > > 
> > > I'm not sure why is it better to create new ->yield callback called under
> > > sb->s_umount rather than just move blkdev_put() calls back into
> > > ->put_super? Or at least yielding could be done in ->put_super instead of
> > 
> > The main reason was to not call potentially expensive
> > blkdev_put()/bdev_release() under s_umount. If we don't care about this
> > though then this shouldn't be a problem.
> 
> So I would not be really concerned about performance here. The superblock
> is dying, nobody can do anything about that until the superblock is fully
> dead and cleaned up. Maybe some places could skip such superblocks more
> quickly but so far I'm not convinced it matters in practice (e.g. writeback
> holds s_umount over the whole sync(1) time and nobody complains). And as
> you mention below, we have been doing this for a long time without anybody
> really complaining.

Ok, that's a good point.

> 
> > And yes, then we need to move
> > blkdev_put()/bdev_release() under s_umount including the main block
> > device. IOW, we need to ensure that all bdev calls are done under
> > s_umount before we remove the superblock from the instance list.
> 
> This is about those seemingly spurious "device busy" errors when the
> superblock hasn't closed its devices yet, isn't it?  But we now remove

Yes, because we tie superblock and block device neatly together.

> superblock from s_instances list in kill_super_notify() and until that
> moment SB_DYING is protecting us from racing mounts. So is there some other
> problem?

No, there isn't a problem at all. It's all working fine but it was
initially a little annoying as we had to update filesystems to ensure
that sb->s_fs_info is kept alive. But it's all fixed.

The possible advantage is that if we drop all block devices under
s_umount then we can remove the superblock from fs_type->s_instances in
the old location again. I'm not convinced it's worth it but it's a
possible simplification. I'm not even arguing it's objectively better I
think it's a matter of taste in the end.

> 
> > I think
> > that should be fine but I wanted to propose an alternative to that as
> > well: cheap mark-for-release under s_umount and heavy-duty without
> > s_umount. But I guess it doesn't matter because most filesystems did use
> > to close devices under s_umount before anyway. Let me know what you
> > think makes the most sense.
> 
> I think we should make it as simple as possible for filesystems. As I said

Yes, I fully agree.

The really good thing about the current mechanism is that it's
completely vfs-only. With s_umount/open_mutex ordering fixed filesystems
can now close block devices wherever and the vfs will ensure that you
don't get spurious ebusy. And the filesystem doesn't have to know a
thing about it or take any care.

> above I don't think s_umount hold time really matters so the only thing
> that limits us are locking constraints. During superblock shutdown the only
> thing that currently cannot be done under s_umount (that I'm aware of) is the
> teardown of the sb->s_bdi because that waits for writeback threads and
> those can be blocked waiting for s_umount.

Which we don't need to change if it's working fine.

> 
> So if we wanted we could just move ext4 & xfs bdev closing back into
> ->put_super to avoid ext4_kill_sb() and somewhat slim down xfs_mount_free()
> but otherwise I don't see much space for simplification or need for adding
> more callbacks :)

I mean, that I would only think is useful if we really wanted to close
all block devices under s_umount to possible be remove the waiting
mechanism we have right now. Otherwise I think it's churn for the sake
of churn and I quite like what we have right now.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 0/6] fs,block: yield devices
  2023-10-26 12:07       ` Christian Brauner
@ 2023-10-26 13:04         ` Jan Kara
  2023-10-26 15:08           ` Christian Brauner
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2023-10-26 13:04 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Thu 26-10-23 14:07:38, Christian Brauner wrote:
> On Thu, Oct 26, 2023 at 12:35:03PM +0200, Jan Kara wrote:
> > This is about those seemingly spurious "device busy" errors when the
> > superblock hasn't closed its devices yet, isn't it?  But we now remove
> 
> Yes, because we tie superblock and block device neatly together.
> 
> > superblock from s_instances list in kill_super_notify() and until that
> > moment SB_DYING is protecting us from racing mounts. So is there some other
> > problem?
> 
> No, there isn't a problem at all. It's all working fine but it was
> initially a little annoying as we had to update filesystems to ensure
> that sb->s_fs_info is kept alive. But it's all fixed.
> 
> The possible advantage is that if we drop all block devices under
> s_umount then we can remove the superblock from fs_type->s_instances in
> the old location again. I'm not convinced it's worth it but it's a
> possible simplification. I'm not even arguing it's objectively better I
> think it's a matter of taste in the end.

Yes. But dropping the main bdev under s_umount is not easy to do with the
current callback structure. We have kill_block_super() calling into
generic_shutdown_super(). Now logically generic_shutdown_super() wants to
teardown bdi (for which it needs to drop s_umount) but bdev is closed only
in kill_block_super() because that's the sb-on-bdev specific call.
Previously we got away with this without spurious EBUSY errors because the
bdev holder was the filesystem type and not the superblock. But after
changing the holder it isn't fine anymore and we have to play these games
with s_instances and SB_DYING.

What we could probably do is to have something like:

	if (sb->s_bdev) {
		blkdev_put(sb->s_bdev);
		sb->s_bdev = NULL;
	}
	if (sb->s_mtd) {
		put_mtd_device(sb->s_mtd);
		sb->s_mtd = NULL;
	}

in generic_shutdown_super() and then remove sb from s_instances, drop
s_umount and cleanup bdi.

Then we can get rid of SB_DYING, kill_super_notify() and stuff but based on
your taste it could be also viewed as kind of layering violation so I'm not
100% convinced this is definitely a way to go.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 0/6] fs,block: yield devices
  2023-10-26 13:04         ` Jan Kara
@ 2023-10-26 15:08           ` Christian Brauner
  2023-10-26 15:58             ` Jan Kara
  2023-10-27  7:24             ` Christoph Hellwig
  0 siblings, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2023-10-26 15:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel

> your taste it could be also viewed as kind of layering violation so I'm not
> 100% convinced this is definitely a way to go.

Yeah, I'm not convinced either. As I said, I really like that right now
ti's a vfs thing only and we don't have specific requirements about how
devices are closed which is really nice. So let's just leave it as is?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 0/6] fs,block: yield devices
  2023-10-26 15:08           ` Christian Brauner
@ 2023-10-26 15:58             ` Jan Kara
  2023-10-27  7:24             ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2023-10-26 15:58 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Thu 26-10-23 17:08:00, Christian Brauner wrote:
> > your taste it could be also viewed as kind of layering violation so I'm not
> > 100% convinced this is definitely a way to go.
> 
> Yeah, I'm not convinced either. As I said, I really like that right now
> ti's a vfs thing only and we don't have specific requirements about how
> devices are closed which is really nice. So let's just leave it as is?

Yes, fine by me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls
  2023-10-24 14:53 ` [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls Christian Brauner
  2023-10-25 15:29   ` Jan Kara
@ 2023-10-27  6:42   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2023-10-27  6:42 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

So while I still think we should not be holding s_umount when setting
up a new unborn sb to start with, there is no point in dropping it
now:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 2/6] xfs: simplify device handling
  2023-10-24 14:53 ` [PATCH RFC 2/6] xfs: simplify device handling Christian Brauner
  2023-10-25 15:30   ` Jan Kara
@ 2023-10-27  6:42   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2023-10-27  6:42 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 3/6] ext4: simplify device handling
  2023-10-24 14:53 ` [PATCH RFC 3/6] ext4: " Christian Brauner
  2023-10-25 15:30   ` Jan Kara
@ 2023-10-27  6:42   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2023-10-27  6:42 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 4/6] bdev: simplify waiting for concurrent claimers
  2023-10-25 15:54   ` Jan Kara
@ 2023-10-27  7:21     ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2023-10-27  7:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, Christoph Hellwig, linux-fsdevel

On Wed, Oct 25, 2023 at 05:54:39PM +0200, Jan Kara wrote:
> This test implicitely assumes that 0 is BD_CLAIM_DEFAULT. I guess that's
> fine although I somewhat prefer explicit value test like:
>
> 	if (whole->bd_claim != BD_CLAIM_DEFAULT)

I find the BD_CLAIM_DEFAULT confusing to be honest.  I'd expect null
to just be check as:

 	if (whole->bd_claim)

That being said, instead of doing all the manual atomic magic, why
not add an

	unsigned long		bd_state;

to struct block_device instead of bd_claim, then define a single
bit for a device being clamed and simply everything while also
giving us space for more bits if we ever need them?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC 0/6] fs,block: yield devices
  2023-10-26 15:08           ` Christian Brauner
  2023-10-26 15:58             ` Jan Kara
@ 2023-10-27  7:24             ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2023-10-27  7:24 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Thu, Oct 26, 2023 at 05:08:00PM +0200, Christian Brauner wrote:
> > your taste it could be also viewed as kind of layering violation so I'm not
> > 100% convinced this is definitely a way to go.
> 
> Yeah, I'm not convinced either. As I said, I really like that right now
> ti's a vfs thing only and we don't have specific requirements about how
> devices are closed which is really nice. So let's just leave it as is?

Yes, I can't really get excited about this change in any way.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-10-27  7:24 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
2023-10-24 14:53 ` [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls Christian Brauner
2023-10-25 15:29   ` Jan Kara
2023-10-27  6:42   ` Christoph Hellwig
2023-10-24 14:53 ` [PATCH RFC 2/6] xfs: simplify device handling Christian Brauner
2023-10-25 15:30   ` Jan Kara
2023-10-27  6:42   ` Christoph Hellwig
2023-10-24 14:53 ` [PATCH RFC 3/6] ext4: " Christian Brauner
2023-10-25 15:30   ` Jan Kara
2023-10-27  6:42   ` Christoph Hellwig
2023-10-24 14:53 ` [PATCH RFC 4/6] bdev: simplify waiting for concurrent claimers Christian Brauner
2023-10-25 15:54   ` Jan Kara
2023-10-27  7:21     ` Christoph Hellwig
2023-10-24 14:53 ` [PATCH RFC 5/6] block: mark device as about to be released Christian Brauner
2023-10-24 14:53 ` [PATCH RFC 6/6] fs: add ->yield_devices() Christian Brauner
2023-10-25 17:20 ` [PATCH RFC 0/6] fs,block: yield devices Jan Kara
2023-10-25 20:46   ` Christian Brauner
2023-10-26 10:35     ` Jan Kara
2023-10-26 12:07       ` Christian Brauner
2023-10-26 13:04         ` Jan Kara
2023-10-26 15:08           ` Christian Brauner
2023-10-26 15:58             ` Jan Kara
2023-10-27  7:24             ` Christoph Hellwig
2023-10-26 11:50 ` (subset) " Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.