yet another approach to fix the loop lock order inversions v5

All of lore.kernel.org
 help / color / mirror / Atom feed

* yet another approach to fix the loop lock order inversions v5
@ 2022-03-25  6:39 Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 01/14] nbd: use the correct block_device in nbd_bdev_reset Christoph Hellwig
                   ` (14 more replies)
  0 siblings, 15 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Hi all,

this series uses the approach from Tetsuo to delay the destroy_workueue
call, extended by a delayed teardown of the workers to fix a potential
race window then the workqueue can be still round after finishing the
commands. 

Changes since v4:
 - keep the (questionable) __invalidate_device call in nbd as-is for now
 - suppress uevents while reconfiguring

Changes since v3:
 - change bd_openers into a atomic_t, including a bunch of cleanups
   and fix found while adding those

Changes since v2:
 - rebased to the lastest block for-next tree, which has the async
   clear reverted and ->free_disk
 - impkement ->free_disk for loop to handle open vs delete races
   more gracefully
 - get rid of lo_refcnt entirely

Changes since v1:
 - add comments to document the lo_refcnt synchronization
 - fix comment typos

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/14] nbd: use the correct block_device in nbd_bdev_reset
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  9:48   ` Jan Kara
  2022-03-25  6:39 ` [PATCH 02/14] zram: cleanup reset_store Christoph Hellwig
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

The bdev parameter to ->ioctl contains the block device that the ioctl
is called on, which can be the partition.  But the openers check in
nbd_bdev_reset really needs to check use the whole device, so switch to
using that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/block/nbd.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 5a1f98494dddf..2f29da41fbc80 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1217,11 +1217,11 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg)
 	return -ENOSPC;
 }
 
-static void nbd_bdev_reset(struct block_device *bdev)
+static void nbd_bdev_reset(struct nbd_device *nbd)
 {
-	if (bdev->bd_openers > 1)
+	if (nbd->disk->part0->bd_openers > 1)
 		return;
-	set_capacity(bdev->bd_disk, 0);
+	set_capacity(nbd->disk, 0);
 }
 
 static void nbd_parse_flags(struct nbd_device *nbd)
@@ -1389,7 +1389,7 @@ static int nbd_start_device(struct nbd_device *nbd)
 	return nbd_set_size(nbd, config->bytesize, nbd_blksize(config));
 }
 
-static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *bdev)
+static int nbd_start_device_ioctl(struct nbd_device *nbd)
 {
 	struct nbd_config *config = nbd->config;
 	int ret;
@@ -1408,7 +1408,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
 	flush_workqueue(nbd->recv_workq);
 
 	mutex_lock(&nbd->config_lock);
-	nbd_bdev_reset(bdev);
+	nbd_bdev_reset(nbd);
 	/* user requested, ignore socket errors */
 	if (test_bit(NBD_RT_DISCONNECT_REQUESTED, &config->runtime_flags))
 		ret = 0;
@@ -1422,7 +1422,7 @@ static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
 {
 	sock_shutdown(nbd);
 	__invalidate_device(bdev, true);
-	nbd_bdev_reset(bdev);
+	nbd_bdev_reset(nbd);
 	if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF,
 			       &nbd->config->runtime_flags))
 		nbd_config_put(nbd);
@@ -1468,7 +1468,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		config->flags = arg;
 		return 0;
 	case NBD_DO_IT:
-		return nbd_start_device_ioctl(nbd, bdev);
+		return nbd_start_device_ioctl(nbd);
 	case NBD_CLEAR_QUE:
 		/*
 		 * This is for compatibility only.  The queue is always cleared
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/14] zram: cleanup reset_store
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 01/14] nbd: use the correct block_device in nbd_bdev_reset Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 03/14] zram: cleanup zram_remove Christoph Hellwig
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Use a local variable for the gendisk instead of the part0 block_device,
as the gendisk is what this function actually operates on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/block/zram/zram_drv.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e9474b02012de..fd83fad59beb1 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1786,7 +1786,7 @@ static ssize_t reset_store(struct device *dev,
 	int ret;
 	unsigned short do_reset;
 	struct zram *zram;
-	struct block_device *bdev;
+	struct gendisk *disk;
 
 	ret = kstrtou16(buf, 10, &do_reset);
 	if (ret)
@@ -1796,26 +1796,26 @@ static ssize_t reset_store(struct device *dev,
 		return -EINVAL;
 
 	zram = dev_to_zram(dev);
-	bdev = zram->disk->part0;
+	disk = zram->disk;
 
-	mutex_lock(&bdev->bd_disk->open_mutex);
+	mutex_lock(&disk->open_mutex);
 	/* Do not reset an active device or claimed device */
-	if (bdev->bd_openers || zram->claim) {
-		mutex_unlock(&bdev->bd_disk->open_mutex);
+	if (disk->part0->bd_openers || zram->claim) {
+		mutex_unlock(&disk->open_mutex);
 		return -EBUSY;
 	}
 
 	/* From now on, anyone can't open /dev/zram[0-9] */
 	zram->claim = true;
-	mutex_unlock(&bdev->bd_disk->open_mutex);
+	mutex_unlock(&disk->open_mutex);
 
 	/* Make sure all the pending I/O are finished */
-	sync_blockdev(bdev);
+	sync_blockdev(disk->part0);
 	zram_reset_device(zram);
 
-	mutex_lock(&bdev->bd_disk->open_mutex);
+	mutex_lock(&disk->open_mutex);
 	zram->claim = false;
-	mutex_unlock(&bdev->bd_disk->open_mutex);
+	mutex_unlock(&disk->open_mutex);
 
 	return len;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/14] zram: cleanup zram_remove
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 01/14] nbd: use the correct block_device in nbd_bdev_reset Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 02/14] zram: cleanup reset_store Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 04/14] block: add a disk_openers helper Christoph Hellwig
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Remove the bdev variable and just use the gendisk pointed to by the
zram_device directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/block/zram/zram_drv.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fd83fad59beb1..863606f1722b1 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1987,19 +1987,18 @@ static int zram_add(void)
 
 static int zram_remove(struct zram *zram)
 {
-	struct block_device *bdev = zram->disk->part0;
 	bool claimed;
 
-	mutex_lock(&bdev->bd_disk->open_mutex);
-	if (bdev->bd_openers) {
-		mutex_unlock(&bdev->bd_disk->open_mutex);
+	mutex_lock(&zram->disk->open_mutex);
+	if (zram->disk->part0->bd_openers) {
+		mutex_unlock(&zram->disk->open_mutex);
 		return -EBUSY;
 	}
 
 	claimed = zram->claim;
 	if (!claimed)
 		zram->claim = true;
-	mutex_unlock(&bdev->bd_disk->open_mutex);
+	mutex_unlock(&zram->disk->open_mutex);
 
 	zram_debugfs_unregister(zram);
 
@@ -2011,7 +2010,7 @@ static int zram_remove(struct zram *zram)
 		;
 	} else {
 		/* Make sure all the pending I/O are finished */
-		sync_blockdev(bdev);
+		sync_blockdev(zram->disk->part0);
 		zram_reset_device(zram);
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/14] block: add a disk_openers helper
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 03/14] zram: cleanup zram_remove Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 05/14] block: turn bdev->bd_openers into an atomic_t Christoph Hellwig
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Add a helper that returns the openers for a given gendisk to avoid having
drivers poke into disk->part0 to get at this information in a somewhat
cumbersome way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/block/nbd.c           |  4 ++--
 drivers/block/zram/zram_drv.c |  4 ++--
 include/linux/blkdev.h        | 15 +++++++++++++++
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 2f29da41fbc80..7473f3d4e1270 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1219,7 +1219,7 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg)
 
 static void nbd_bdev_reset(struct nbd_device *nbd)
 {
-	if (nbd->disk->part0->bd_openers > 1)
+	if (disk_openers(nbd->disk) > 1)
 		return;
 	set_capacity(nbd->disk, 0);
 }
@@ -1579,7 +1579,7 @@ static void nbd_release(struct gendisk *disk, fmode_t mode)
 	struct nbd_device *nbd = disk->private_data;
 
 	if (test_bit(NBD_RT_DISCONNECT_ON_CLOSE, &nbd->config->runtime_flags) &&
-			disk->part0->bd_openers == 0)
+			disk_openers(disk) == 0)
 		nbd_disconnect_and_put(nbd);
 
 	nbd_config_put(nbd);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 863606f1722b1..2362385f782a9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1800,7 +1800,7 @@ static ssize_t reset_store(struct device *dev,
 
 	mutex_lock(&disk->open_mutex);
 	/* Do not reset an active device or claimed device */
-	if (disk->part0->bd_openers || zram->claim) {
+	if (disk_openers(disk) || zram->claim) {
 		mutex_unlock(&disk->open_mutex);
 		return -EBUSY;
 	}
@@ -1990,7 +1990,7 @@ static int zram_remove(struct zram *zram)
 	bool claimed;
 
 	mutex_lock(&zram->disk->open_mutex);
-	if (zram->disk->part0->bd_openers) {
+	if (disk_openers(zram->disk)) {
 		mutex_unlock(&zram->disk->open_mutex);
 		return -EBUSY;
 	}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eb27312a1b8f3..9824ebc9b4d31 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -176,6 +176,21 @@ static inline bool disk_live(struct gendisk *disk)
 	return !inode_unhashed(disk->part0->bd_inode);
 }
 
+/**
+ * disk_openers - returns how many openers are there for a disk
+ * @disk: disk to check
+ *
+ * This returns the number of openers for a disk.  Note that this value is only
+ * stable if disk->open_mutex is held.
+ *
+ * Note: Due to a quirk in the block layer open code, each open partition is
+ * only counted once even if there are multiple openers.
+ */
+static inline unsigned int disk_openers(struct gendisk *disk)
+{
+	return disk->part0->bd_openers;
+}
+
 /*
  * The gendisk is refcounted by the part0 block_device, and the bd_device
  * therein is also used for device model presentation in sysfs.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/14] block: turn bdev->bd_openers into an atomic_t
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 04/14] block: add a disk_openers helper Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 06/14] loop: de-duplicate the idle worker freeing code Christoph Hellwig
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

All manipulation of bd_openers is under disk->open_mutex and will remain
so for the foreseeable future.  But at least one place reads it without
the lock (blkdev_get) and there are more to be added.  So make sure the
compiler does not do turn the increments and decrements into non-atomic
sequences by using an atomic_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 block/bdev.c              | 16 ++++++++--------
 block/partitions/core.c   |  2 +-
 include/linux/blk_types.h |  2 +-
 include/linux/blkdev.h    |  2 +-
 4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 13de871fa8169..7bf88e591aaf3 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -673,17 +673,17 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
 		}
 	}
 
-	if (!bdev->bd_openers)
+	if (!atomic_read(&bdev->bd_openers))
 		set_init_blocksize(bdev);
 	if (test_bit(GD_NEED_PART_SCAN, &disk->state))
 		bdev_disk_changed(disk, false);
-	bdev->bd_openers++;
+	atomic_inc(&bdev->bd_openers);
 	return 0;
 }
 
 static void blkdev_put_whole(struct block_device *bdev, fmode_t mode)
 {
-	if (!--bdev->bd_openers)
+	if (atomic_dec_and_test(&bdev->bd_openers))
 		blkdev_flush_mapping(bdev);
 	if (bdev->bd_disk->fops->release)
 		bdev->bd_disk->fops->release(bdev->bd_disk, mode);
@@ -694,7 +694,7 @@ static int blkdev_get_part(struct block_device *part, fmode_t mode)
 	struct gendisk *disk = part->bd_disk;
 	int ret;
 
-	if (part->bd_openers)
+	if (atomic_read(&part->bd_openers))
 		goto done;
 
 	ret = blkdev_get_whole(bdev_whole(part), mode);
@@ -708,7 +708,7 @@ static int blkdev_get_part(struct block_device *part, fmode_t mode)
 	disk->open_partitions++;
 	set_init_blocksize(part);
 done:
-	part->bd_openers++;
+	atomic_inc(&part->bd_openers);
 	return 0;
 
 out_blkdev_put:
@@ -720,7 +720,7 @@ static void blkdev_put_part(struct block_device *part, fmode_t mode)
 {
 	struct block_device *whole = bdev_whole(part);
 
-	if (--part->bd_openers)
+	if (!atomic_dec_and_test(&part->bd_openers))
 		return;
 	blkdev_flush_mapping(part);
 	whole->bd_disk->open_partitions--;
@@ -899,7 +899,7 @@ void blkdev_put(struct block_device *bdev, fmode_t mode)
 	 * of the world and we want to avoid long (could be several minute)
 	 * syncs while holding the mutex.
 	 */
-	if (bdev->bd_openers == 1)
+	if (atomic_read(&bdev->bd_openers) == 1)
 		sync_blockdev(bdev);
 
 	mutex_lock(&disk->open_mutex);
@@ -1044,7 +1044,7 @@ void sync_bdevs(bool wait)
 		bdev = I_BDEV(inode);
 
 		mutex_lock(&bdev->bd_disk->open_mutex);
-		if (!bdev->bd_openers) {
+		if (!atomic_read(&bdev->bd_openers)) {
 			; /* skip */
 		} else if (wait) {
 			/*
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 2ef8dfa1e5c85..373ed748dcf26 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -486,7 +486,7 @@ int bdev_del_partition(struct gendisk *disk, int partno)
 		goto out_unlock;
 
 	ret = -EBUSY;
-	if (part->bd_openers)
+	if (atomic_read(&part->bd_openers))
 		goto out_unlock;
 
 	delete_partition(part);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 0c3563b45fe90..b1ced43ed0d3f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -44,7 +44,7 @@ struct block_device {
 	unsigned long		bd_stamp;
 	bool			bd_read_only;	/* read-only policy */
 	dev_t			bd_dev;
-	int			bd_openers;
+	atomic_t		bd_openers;
 	struct inode *		bd_inode;	/* will die */
 	struct super_block *	bd_super;
 	void *			bd_claiming;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9824ebc9b4d31..6b7c5af1d01df 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -188,7 +188,7 @@ static inline bool disk_live(struct gendisk *disk)
  */
 static inline unsigned int disk_openers(struct gendisk *disk)
 {
-	return disk->part0->bd_openers;
+	return atomic_read(&disk->part0->bd_openers);
 }
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/14] loop: de-duplicate the idle worker freeing code
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 05/14] block: turn bdev->bd_openers into an atomic_t Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 07/14] loop: initialize the worker tracking fields once Christoph Hellwig
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Use a common helper for both timer based and uncoditional freeing of idle
workers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Darrick J. Wong <djwong@kernel.org>
---
 drivers/block/loop.c | 73 +++++++++++++++++++++-----------------------
 1 file changed, 35 insertions(+), 38 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3e636a75c83a8..762f0a18295d7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -809,7 +809,6 @@ struct loop_worker {
 
 static void loop_workfn(struct work_struct *work);
 static void loop_rootcg_workfn(struct work_struct *work);
-static void loop_free_idle_workers(struct timer_list *timer);
 
 #ifdef CONFIG_BLK_CGROUP
 static inline int queue_on_root_worker(struct cgroup_subsys_state *css)
@@ -893,6 +892,39 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 	spin_unlock_irq(&lo->lo_work_lock);
 }
 
+static void loop_set_timer(struct loop_device *lo)
+{
+	timer_reduce(&lo->timer, jiffies + LOOP_IDLE_WORKER_TIMEOUT);
+}
+
+static void loop_free_idle_workers(struct loop_device *lo, bool delete_all)
+{
+	struct loop_worker *pos, *worker;
+
+	spin_lock_irq(&lo->lo_work_lock);
+	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
+				idle_list) {
+		if (!delete_all &&
+		    time_is_after_jiffies(worker->last_ran_at +
+					  LOOP_IDLE_WORKER_TIMEOUT))
+			break;
+		list_del(&worker->idle_list);
+		rb_erase(&worker->rb_node, &lo->worker_tree);
+		css_put(worker->blkcg_css);
+		kfree(worker);
+	}
+	if (!list_empty(&lo->idle_worker_list))
+		loop_set_timer(lo);
+	spin_unlock_irq(&lo->lo_work_lock);
+}
+
+static void loop_free_idle_workers_timer(struct timer_list *timer)
+{
+	struct loop_device *lo = container_of(timer, struct loop_device, timer);
+
+	return loop_free_idle_workers(lo, false);
+}
+
 static void loop_update_rotational(struct loop_device *lo)
 {
 	struct file *file = lo->lo_backing_file;
@@ -1027,7 +1059,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	INIT_LIST_HEAD(&lo->rootcg_cmd_list);
 	INIT_LIST_HEAD(&lo->idle_worker_list);
 	lo->worker_tree = RB_ROOT;
-	timer_setup(&lo->timer, loop_free_idle_workers,
+	timer_setup(&lo->timer, loop_free_idle_workers_timer,
 		TIMER_DEFERRABLE);
 	lo->use_dio = lo->lo_flags & LO_FLAGS_DIRECT_IO;
 	lo->lo_device = bdev;
@@ -1091,7 +1123,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
-	struct loop_worker *pos, *worker;
 
 	/*
 	 * Flush loop_configure() and loop_change_fd(). It is acceptable for
@@ -1121,15 +1152,7 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	blk_mq_freeze_queue(lo->lo_queue);
 
 	destroy_workqueue(lo->workqueue);
-	spin_lock_irq(&lo->lo_work_lock);
-	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
-				idle_list) {
-		list_del(&worker->idle_list);
-		rb_erase(&worker->rb_node, &lo->worker_tree);
-		css_put(worker->blkcg_css);
-		kfree(worker);
-	}
-	spin_unlock_irq(&lo->lo_work_lock);
+	loop_free_idle_workers(lo, true);
 	del_timer_sync(&lo->timer);
 
 	spin_lock_irq(&lo->lo_lock);
@@ -1887,11 +1910,6 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
 	}
 }
 
-static void loop_set_timer(struct loop_device *lo)
-{
-	timer_reduce(&lo->timer, jiffies + LOOP_IDLE_WORKER_TIMEOUT);
-}
-
 static void loop_process_work(struct loop_worker *worker,
 			struct list_head *cmd_list, struct loop_device *lo)
 {
@@ -1940,27 +1958,6 @@ static void loop_rootcg_workfn(struct work_struct *work)
 	loop_process_work(NULL, &lo->rootcg_cmd_list, lo);
 }
 
-static void loop_free_idle_workers(struct timer_list *timer)
-{
-	struct loop_device *lo = container_of(timer, struct loop_device, timer);
-	struct loop_worker *pos, *worker;
-
-	spin_lock_irq(&lo->lo_work_lock);
-	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
-				idle_list) {
-		if (time_is_after_jiffies(worker->last_ran_at +
-						LOOP_IDLE_WORKER_TIMEOUT))
-			break;
-		list_del(&worker->idle_list);
-		rb_erase(&worker->rb_node, &lo->worker_tree);
-		css_put(worker->blkcg_css);
-		kfree(worker);
-	}
-	if (!list_empty(&lo->idle_worker_list))
-		loop_set_timer(lo);
-	spin_unlock_irq(&lo->lo_work_lock);
-}
-
 static const struct blk_mq_ops loop_mq_ops = {
 	.queue_rq       = loop_queue_rq,
 	.complete	= lo_complete_rq,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/14] loop: initialize the worker tracking fields once
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 06/14] loop: de-duplicate the idle worker freeing code Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 08/14] loop: remove the racy bd_inode->i_mapping->nrpages asserts Christoph Hellwig
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd, Chaitanya Kulkarni

There is no need to reinitialize idle_worker_list, worker_tree and timer
every time a loop device is configured.  Just initialize them once at
allocation time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
---
 drivers/block/loop.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 762f0a18295d7..d1c1086beedce 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1057,10 +1057,6 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 
 	INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
 	INIT_LIST_HEAD(&lo->rootcg_cmd_list);
-	INIT_LIST_HEAD(&lo->idle_worker_list);
-	lo->worker_tree = RB_ROOT;
-	timer_setup(&lo->timer, loop_free_idle_workers_timer,
-		TIMER_DEFERRABLE);
 	lo->use_dio = lo->lo_flags & LO_FLAGS_DIRECT_IO;
 	lo->lo_device = bdev;
 	lo->lo_backing_file = file;
@@ -1973,6 +1969,9 @@ static int loop_add(int i)
 	lo = kzalloc(sizeof(*lo), GFP_KERNEL);
 	if (!lo)
 		goto out;
+	lo->worker_tree = RB_ROOT;
+	INIT_LIST_HEAD(&lo->idle_worker_list);
+	timer_setup(&lo->timer, loop_free_idle_workers_timer, TIMER_DEFERRABLE);
 	lo->lo_state = Lo_unbound;
 
 	err = mutex_lock_killable(&loop_ctl_mutex);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/14] loop: remove the racy bd_inode->i_mapping->nrpages asserts
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 07/14] loop: initialize the worker tracking fields once Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 09/14] loop: don't freeze the queue in lo_release Christoph Hellwig
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Nothing prevents a file system or userspace opener of the block device
from redirtying the page right afte sync_blockdev returned.  Fortunately
data in the page cache during a block device change is mostly harmless
anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Darrick J. Wong <djwong@kernel.org>
---
 drivers/block/loop.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d1c1086beedce..25a71fd7b59da 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1276,15 +1276,6 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 	/* I/O need to be drained during transfer transition */
 	blk_mq_freeze_queue(lo->lo_queue);
 
-	if (size_changed && lo->lo_device->bd_inode->i_mapping->nrpages) {
-		/* If any pages were dirtied after invalidate_bdev(), try again */
-		err = -EAGAIN;
-		pr_warn("%s: loop%d (%s) still has dirty pages (nrpages=%lu)\n",
-			__func__, lo->lo_number, lo->lo_file_name,
-			lo->lo_device->bd_inode->i_mapping->nrpages);
-		goto out_unfreeze;
-	}
-
 	prev_lo_flags = lo->lo_flags;
 
 	err = loop_set_status_from_info(lo, info);
@@ -1495,21 +1486,10 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
 	invalidate_bdev(lo->lo_device);
 
 	blk_mq_freeze_queue(lo->lo_queue);
-
-	/* invalidate_bdev should have truncated all the pages */
-	if (lo->lo_device->bd_inode->i_mapping->nrpages) {
-		err = -EAGAIN;
-		pr_warn("%s: loop%d (%s) still has dirty pages (nrpages=%lu)\n",
-			__func__, lo->lo_number, lo->lo_file_name,
-			lo->lo_device->bd_inode->i_mapping->nrpages);
-		goto out_unfreeze;
-	}
-
 	blk_queue_logical_block_size(lo->lo_queue, arg);
 	blk_queue_physical_block_size(lo->lo_queue, arg);
 	blk_queue_io_min(lo->lo_queue, arg);
 	loop_update_dio(lo);
-out_unfreeze:
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
 	return err;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/14] loop: don't freeze the queue in lo_release
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 08/14] loop: remove the racy bd_inode->i_mapping->nrpages asserts Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 10/14] loop: only freeze the queue in __loop_clr_fd when needed Christoph Hellwig
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

By the time the final ->release is called there can't be outstanding I/O.
For non-final ->release there is no need for driver action at all.

Thus remove the useless queue freeze.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Darrick J. Wong <djwong@kernel.org>
---
 drivers/block/loop.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 25a71fd7b59da..c57acbcf9be6a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1753,13 +1753,6 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		 */
 		__loop_clr_fd(lo, true);
 		return;
-	} else if (lo->lo_state == Lo_bound) {
-		/*
-		 * Otherwise keep thread (if running) and config,
-		 * but flush possible ongoing bios in thread.
-		 */
-		blk_mq_freeze_queue(lo->lo_queue);
-		blk_mq_unfreeze_queue(lo->lo_queue);
 	}
 
 out_unlock:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/14] loop: only freeze the queue in __loop_clr_fd when needed
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 09/14] loop: don't freeze the queue in lo_release Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  6:39 ` [PATCH 11/14] loop: implement ->free_disk Christoph Hellwig
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

->release is only called after all outstanding I/O has completed, so only
freeze the queue when clearing the backing file of a live loop device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Darrick J. Wong <djwong@kernel.org>
---
 drivers/block/loop.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c57acbcf9be6a..a5dd259958ee2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1144,8 +1144,13 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);
 
-	/* freeze request queue during the transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	/*
+	 * Freeze the request queue when unbinding on a live file descriptor and
+	 * thus an open device.  When called from ->release we are guaranteed
+	 * that there is no I/O in progress already.
+	 */
+	if (!release)
+		blk_mq_freeze_queue(lo->lo_queue);
 
 	destroy_workqueue(lo->workqueue);
 	loop_free_idle_workers(lo, true);
@@ -1170,7 +1175,8 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	mapping_set_gfp_mask(filp->f_mapping, gfp);
 	/* This is safe: open() is still holding a reference. */
 	module_put(THIS_MODULE);
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	if (!release)
+		blk_mq_unfreeze_queue(lo->lo_queue);
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 11/14] loop: implement ->free_disk
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 10/14] loop: only freeze the queue in __loop_clr_fd when needed Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25 10:42   ` Tetsuo Handa
  2022-03-25  6:39 ` [PATCH 12/14] loop: suppress uevents while reconfiguring the device Christoph Hellwig
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Ensure that the lo_device which is stored in the gendisk private
data is valid until the gendisk is freed.  Currently the loop driver
uses a lot of effort to make sure a device is not freed when it is
still in use, but to to fix a potential deadlock this will be relaxed
a bit soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/block/loop.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a5dd259958ee2..b3170e8cdbe95 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1765,6 +1765,14 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 	mutex_unlock(&lo->lo_mutex);
 }
 
+static void lo_free_disk(struct gendisk *disk)
+{
+	struct loop_device *lo = disk->private_data;
+
+	mutex_destroy(&lo->lo_mutex);
+	kfree(lo);
+}
+
 static const struct block_device_operations lo_fops = {
 	.owner =	THIS_MODULE,
 	.open =		lo_open,
@@ -1773,6 +1781,7 @@ static const struct block_device_operations lo_fops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl =	lo_compat_ioctl,
 #endif
+	.free_disk =	lo_free_disk,
 };
 
 /*
@@ -2064,15 +2073,14 @@ static void loop_remove(struct loop_device *lo)
 {
 	/* Make this loop device unreachable from pathname. */
 	del_gendisk(lo->lo_disk);
-	blk_cleanup_disk(lo->lo_disk);
+	blk_cleanup_queue(lo->lo_disk->queue);
 	blk_mq_free_tag_set(&lo->tag_set);
 
 	mutex_lock(&loop_ctl_mutex);
 	idr_remove(&loop_index_idr, lo->lo_number);
 	mutex_unlock(&loop_ctl_mutex);
-	/* There is no route which can find this loop device. */
-	mutex_destroy(&lo->lo_mutex);
-	kfree(lo);
+
+	put_disk(lo->lo_disk);
 }
 
 static void loop_probe(dev_t dev)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 12/14] loop: suppress uevents while reconfiguring the device
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 11/14] loop: implement ->free_disk Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-25  9:49   ` Jan Kara
  2022-03-25  6:39 ` [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release Christoph Hellwig
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Currently, udev change event is generated for a loop device before the
device is ready for IO. Due to serialization on lo->lo_mutex in
lo_open() this does not matter because anybody is able to open the
device and do IO only after the configuration is finished. However this
synchronization in lo_open() is going away so make sure userspace
reacting to the change event will see the new device state by generating
the event only when the device is setup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/block/loop.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b3170e8cdbe95..bfd21af7aa38b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -572,6 +572,10 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
 
 	if (!file)
 		return -EBADF;
+
+	/* suppress uevents while reconfiguring the device */
+	dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
+
 	is_loop = is_loop_device(file);
 	error = loop_global_lock_killable(lo, is_loop);
 	if (error)
@@ -626,13 +630,18 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
 	fput(old_file);
 	if (partscan)
 		loop_reread_partitions(lo);
-	return 0;
+
+	error = 0;
+done:
+	/* enable and uncork uevent now that we are done */
+	dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
+	return error;
 
 out_err:
 	loop_global_unlock(lo, is_loop);
 out_putf:
 	fput(file);
-	return error;
+	goto done;
 }
 
 /* loop sysfs attributes */
@@ -999,6 +1008,9 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	/* This is safe, since we have a reference from open(). */
 	__module_get(THIS_MODULE);
 
+	/* suppress uevents while reconfiguring the device */
+	dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
+
 	/*
 	 * If we don't hold exclusive handle for the device, upgrade to it
 	 * here to avoid changing device under exclusive owner.
@@ -1101,7 +1113,12 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 		loop_reread_partitions(lo);
 	if (!(mode & FMODE_EXCL))
 		bd_abort_claiming(bdev, loop_configure);
-	return 0;
+
+	error = 0;
+done:
+	/* enable and uncork uevent now that we are done */
+	dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
+	return error;
 
 out_unlock:
 	loop_global_unlock(lo, is_loop);
@@ -1112,7 +1129,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	fput(file);
 	/* This is safe: open() is still holding a reference. */
 	module_put(THIS_MODULE);
-	return error;
+	goto done;
 }
 
 static void __loop_clr_fd(struct loop_device *lo, bool release)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 12/14] loop: suppress uevents while reconfiguring the device Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-26  2:52   ` Tetsuo Handa
  2022-03-25  6:39 ` [PATCH 14/14] loop: don't destroy lo->workqueue in __loop_clr_fd Christoph Hellwig
  2022-03-29 15:36 ` [PATCH 15/14] loop: avoid loop_validate_mutex/lo_mutex in ->release Tetsuo Handa
  14 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

lo_refcount counts how many openers a loop device has, but that count
is already provided by the block layer in the bd_openers field of the
whole-disk block_device.  Remove lo_refcount and allow opens to
succeed even on devices beeing deleted - now that ->free_disk is
implemented we can handle that race gracefull and all I/O on it will
just fail. Similarly there is a small race window now where
loop_control_remove does not synchronize the delete vs the remove
due do bd_openers not being under lo_mutex protection, but we can
handle that just as gracefully.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/block/loop.c | 37 +++++++------------------------------
 drivers/block/loop.h |  1 -
 2 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index bfd21af7aa38b..8ad8cfffdcbdc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1261,7 +1261,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	 * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d
 	 * command to fail with EBUSY.
 	 */
-	if (atomic_read(&lo->lo_refcnt) > 1) {
+	if (disk_openers(lo->lo_disk) > 1) {
 		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
 		mutex_unlock(&lo->lo_mutex);
 		return 0;
@@ -1741,33 +1741,15 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
 }
 #endif
 
-static int lo_open(struct block_device *bdev, fmode_t mode)
-{
-	struct loop_device *lo = bdev->bd_disk->private_data;
-	int err;
-
-	err = mutex_lock_killable(&lo->lo_mutex);
-	if (err)
-		return err;
-	if (lo->lo_state == Lo_deleting)
-		err = -ENXIO;
-	else
-		atomic_inc(&lo->lo_refcnt);
-	mutex_unlock(&lo->lo_mutex);
-	return err;
-}
-
 static void lo_release(struct gendisk *disk, fmode_t mode)
 {
 	struct loop_device *lo = disk->private_data;
 
-	mutex_lock(&lo->lo_mutex);
-	if (atomic_dec_return(&lo->lo_refcnt))
-		goto out_unlock;
+	if (disk_openers(disk) > 0)
+		return;
 
-	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
-		if (lo->lo_state != Lo_bound)
-			goto out_unlock;
+	mutex_lock(&lo->lo_mutex);
+	if (lo->lo_state == Lo_bound && (lo->lo_flags & LO_FLAGS_AUTOCLEAR)) {
 		lo->lo_state = Lo_rundown;
 		mutex_unlock(&lo->lo_mutex);
 		/*
@@ -1777,8 +1759,6 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		__loop_clr_fd(lo, true);
 		return;
 	}
-
-out_unlock:
 	mutex_unlock(&lo->lo_mutex);
 }
 
@@ -1792,7 +1772,6 @@ static void lo_free_disk(struct gendisk *disk)
 
 static const struct block_device_operations lo_fops = {
 	.owner =	THIS_MODULE,
-	.open =		lo_open,
 	.release =	lo_release,
 	.ioctl =	lo_ioctl,
 #ifdef CONFIG_COMPAT
@@ -2046,7 +2025,6 @@ static int loop_add(int i)
 	 */
 	if (!part_shift)
 		disk->flags |= GENHD_FL_NO_PART;
-	atomic_set(&lo->lo_refcnt, 0);
 	mutex_init(&lo->lo_mutex);
 	lo->lo_number		= i;
 	spin_lock_init(&lo->lo_lock);
@@ -2136,13 +2114,12 @@ static int loop_control_remove(int idx)
 	ret = mutex_lock_killable(&lo->lo_mutex);
 	if (ret)
 		goto mark_visible;
-	if (lo->lo_state != Lo_unbound ||
-	    atomic_read(&lo->lo_refcnt) > 0) {
+	if (lo->lo_state != Lo_unbound || disk_openers(lo->lo_disk) > 0) {
 		mutex_unlock(&lo->lo_mutex);
 		ret = -EBUSY;
 		goto mark_visible;
 	}
-	/* Mark this loop device no longer open()-able. */
+	/* Mark this loop device as no more bound, but not quite unbound yet */
 	lo->lo_state = Lo_deleting;
 	mutex_unlock(&lo->lo_mutex);
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 082d4b6bfc6a6..449d562738c52 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -28,7 +28,6 @@ struct loop_func_table;
 
 struct loop_device {
 	int		lo_number;
-	atomic_t	lo_refcnt;
 	loff_t		lo_offset;
 	loff_t		lo_sizelimit;
 	int		lo_flags;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 14/14] loop: don't destroy lo->workqueue in __loop_clr_fd
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release Christoph Hellwig
@ 2022-03-25  6:39 ` Christoph Hellwig
  2022-03-29 15:36 ` [PATCH 15/14] loop: avoid loop_validate_mutex/lo_mutex in ->release Tetsuo Handa
  14 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:39 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Tetsuo Handa, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd, syzbot+6479585dfd4dedd3f7e1

There is no need to destroy the workqueue when clearing unbinding
a loop device from a backing file.  Not doing so on the other hand
avoid creating a complex lock dependency chain involving the global
system_transition_mutex.

Based on a patch from Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>.

Reported-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com
---
 drivers/block/loop.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8ad8cfffdcbdc..2043d3efbc491 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -817,7 +817,6 @@ struct loop_worker {
 };
 
 static void loop_workfn(struct work_struct *work);
-static void loop_rootcg_workfn(struct work_struct *work);
 
 #ifdef CONFIG_BLK_CGROUP
 static inline int queue_on_root_worker(struct cgroup_subsys_state *css)
@@ -1055,20 +1054,19 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	    !file->f_op->write_iter)
 		lo->lo_flags |= LO_FLAGS_READ_ONLY;
 
-	lo->workqueue = alloc_workqueue("loop%d",
-					WQ_UNBOUND | WQ_FREEZABLE,
-					0,
-					lo->lo_number);
 	if (!lo->workqueue) {
-		error = -ENOMEM;
-		goto out_unlock;
+		lo->workqueue = alloc_workqueue("loop%d",
+						WQ_UNBOUND | WQ_FREEZABLE,
+						0, lo->lo_number);
+		if (!lo->workqueue) {
+			error = -ENOMEM;
+			goto out_unlock;
+		}
 	}
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
 	set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0);
 
-	INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
-	INIT_LIST_HEAD(&lo->rootcg_cmd_list);
 	lo->use_dio = lo->lo_flags & LO_FLAGS_DIRECT_IO;
 	lo->lo_device = bdev;
 	lo->lo_backing_file = file;
@@ -1169,10 +1167,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	if (!release)
 		blk_mq_freeze_queue(lo->lo_queue);
 
-	destroy_workqueue(lo->workqueue);
-	loop_free_idle_workers(lo, true);
-	del_timer_sync(&lo->timer);
-
 	spin_lock_irq(&lo->lo_lock);
 	filp = lo->lo_backing_file;
 	lo->lo_backing_file = NULL;
@@ -1766,6 +1760,10 @@ static void lo_free_disk(struct gendisk *disk)
 {
 	struct loop_device *lo = disk->private_data;
 
+	if (lo->workqueue)
+		destroy_workqueue(lo->workqueue);
+	loop_free_idle_workers(lo, true);
+	del_timer_sync(&lo->timer);
 	mutex_destroy(&lo->lo_mutex);
 	kfree(lo);
 }
@@ -2029,6 +2027,8 @@ static int loop_add(int i)
 	lo->lo_number		= i;
 	spin_lock_init(&lo->lo_lock);
 	spin_lock_init(&lo->lo_work_lock);
+	INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
+	INIT_LIST_HEAD(&lo->rootcg_cmd_list);
 	disk->major		= LOOP_MAJOR;
 	disk->first_minor	= i << part_shift;
 	disk->minors		= 1 << part_shift;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/14] nbd: use the correct block_device in nbd_bdev_reset
  2022-03-25  6:39 ` [PATCH 01/14] nbd: use the correct block_device in nbd_bdev_reset Christoph Hellwig
@ 2022-03-25  9:48   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2022-03-25  9:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta, Tetsuo Handa,
	Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce, linux-block,
	nbd

On Fri 25-03-22 07:39:16, Christoph Hellwig wrote:
> The bdev parameter to ->ioctl contains the block device that the ioctl
> is called on, which can be the partition.  But the openers check in
> nbd_bdev_reset really needs to check use the whole device, so switch to
> using that.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/block/nbd.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 5a1f98494dddf..2f29da41fbc80 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -1217,11 +1217,11 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg)
>  	return -ENOSPC;
>  }
>  
> -static void nbd_bdev_reset(struct block_device *bdev)
> +static void nbd_bdev_reset(struct nbd_device *nbd)
>  {
> -	if (bdev->bd_openers > 1)
> +	if (nbd->disk->part0->bd_openers > 1)
>  		return;
> -	set_capacity(bdev->bd_disk, 0);
> +	set_capacity(nbd->disk, 0);
>  }
>  
>  static void nbd_parse_flags(struct nbd_device *nbd)
> @@ -1389,7 +1389,7 @@ static int nbd_start_device(struct nbd_device *nbd)
>  	return nbd_set_size(nbd, config->bytesize, nbd_blksize(config));
>  }
>  
> -static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *bdev)
> +static int nbd_start_device_ioctl(struct nbd_device *nbd)
>  {
>  	struct nbd_config *config = nbd->config;
>  	int ret;
> @@ -1408,7 +1408,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
>  	flush_workqueue(nbd->recv_workq);
>  
>  	mutex_lock(&nbd->config_lock);
> -	nbd_bdev_reset(bdev);
> +	nbd_bdev_reset(nbd);
>  	/* user requested, ignore socket errors */
>  	if (test_bit(NBD_RT_DISCONNECT_REQUESTED, &config->runtime_flags))
>  		ret = 0;
> @@ -1422,7 +1422,7 @@ static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
>  {
>  	sock_shutdown(nbd);
>  	__invalidate_device(bdev, true);
> -	nbd_bdev_reset(bdev);
> +	nbd_bdev_reset(nbd);
>  	if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF,
>  			       &nbd->config->runtime_flags))
>  		nbd_config_put(nbd);
> @@ -1468,7 +1468,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
>  		config->flags = arg;
>  		return 0;
>  	case NBD_DO_IT:
> -		return nbd_start_device_ioctl(nbd, bdev);
> +		return nbd_start_device_ioctl(nbd);
>  	case NBD_CLEAR_QUE:
>  		/*
>  		 * This is for compatibility only.  The queue is always cleared
> -- 
> 2.30.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 12/14] loop: suppress uevents while reconfiguring the device
  2022-03-25  6:39 ` [PATCH 12/14] loop: suppress uevents while reconfiguring the device Christoph Hellwig
@ 2022-03-25  9:49   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2022-03-25  9:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta, Tetsuo Handa,
	Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce, linux-block,
	nbd

On Fri 25-03-22 07:39:27, Christoph Hellwig wrote:
> Currently, udev change event is generated for a loop device before the
> device is ready for IO. Due to serialization on lo->lo_mutex in
> lo_open() this does not matter because anybody is able to open the
> device and do IO only after the configuration is finished. However this
> synchronization in lo_open() is going away so make sure userspace
> reacting to the change event will see the new device state by generating
> the event only when the device is setup.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/block/loop.c | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index b3170e8cdbe95..bfd21af7aa38b 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -572,6 +572,10 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
>  
>  	if (!file)
>  		return -EBADF;
> +
> +	/* suppress uevents while reconfiguring the device */
> +	dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
> +
>  	is_loop = is_loop_device(file);
>  	error = loop_global_lock_killable(lo, is_loop);
>  	if (error)
> @@ -626,13 +630,18 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
>  	fput(old_file);
>  	if (partscan)
>  		loop_reread_partitions(lo);
> -	return 0;
> +
> +	error = 0;
> +done:
> +	/* enable and uncork uevent now that we are done */
> +	dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
> +	return error;
>  
>  out_err:
>  	loop_global_unlock(lo, is_loop);
>  out_putf:
>  	fput(file);
> -	return error;
> +	goto done;
>  }
>  
>  /* loop sysfs attributes */
> @@ -999,6 +1008,9 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
>  	/* This is safe, since we have a reference from open(). */
>  	__module_get(THIS_MODULE);
>  
> +	/* suppress uevents while reconfiguring the device */
> +	dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
> +
>  	/*
>  	 * If we don't hold exclusive handle for the device, upgrade to it
>  	 * here to avoid changing device under exclusive owner.
> @@ -1101,7 +1113,12 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
>  		loop_reread_partitions(lo);
>  	if (!(mode & FMODE_EXCL))
>  		bd_abort_claiming(bdev, loop_configure);
> -	return 0;
> +
> +	error = 0;
> +done:
> +	/* enable and uncork uevent now that we are done */
> +	dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
> +	return error;
>  
>  out_unlock:
>  	loop_global_unlock(lo, is_loop);
> @@ -1112,7 +1129,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
>  	fput(file);
>  	/* This is safe: open() is still holding a reference. */
>  	module_put(THIS_MODULE);
> -	return error;
> +	goto done;
>  }
>  
>  static void __loop_clr_fd(struct loop_device *lo, bool release)
> -- 
> 2.30.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 11/14] loop: implement ->free_disk
  2022-03-25  6:39 ` [PATCH 11/14] loop: implement ->free_disk Christoph Hellwig
@ 2022-03-25 10:42   ` Tetsuo Handa
  2022-03-25 15:10     ` Tetsuo Handa
  0 siblings, 1 reply; 26+ messages in thread
From: Tetsuo Handa @ 2022-03-25 10:42 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce, linux-block, nbd

On 2022/03/25 15:39, Christoph Hellwig wrote:
> Ensure that the lo_device which is stored in the gendisk private
> data is valid until the gendisk is freed.  Currently the loop driver
> uses a lot of effort to make sure a device is not freed when it is
> still in use, but to to fix a potential deadlock this will be relaxed
> a bit soon.

This patch breaks blk_cleanup_disk() into blk_cleanup_queue() and put_disk() on
loop_remove() side only. But there is blk_cleanup_disk() in the error path of
loop_add() side. Don't we need to rewrite the error path of loop_add() side, for
put_disk() from blk_cleanup_disk() from loop_add() calls kfree() via lo_free_disk()
but out_cleanup_disk: label falls through to blk_mq_free_tag_set() (which seems to
be UAF read) and kfree() (which seems to be double kfree()) ?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 11/14] loop: implement ->free_disk
  2022-03-25 10:42   ` Tetsuo Handa
@ 2022-03-25 15:10     ` Tetsuo Handa
  0 siblings, 0 replies; 26+ messages in thread
From: Tetsuo Handa @ 2022-03-25 15:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce, linux-block, nbd

On 2022/03/25 19:42, Tetsuo Handa wrote:
> On 2022/03/25 15:39, Christoph Hellwig wrote:
>> Ensure that the lo_device which is stored in the gendisk private
>> data is valid until the gendisk is freed.  Currently the loop driver
>> uses a lot of effort to make sure a device is not freed when it is
>> still in use, but to to fix a potential deadlock this will be relaxed
>> a bit soon.
> 
> This patch breaks blk_cleanup_disk() into blk_cleanup_queue() and put_disk() on
> loop_remove() side only. But there is blk_cleanup_disk() in the error path of
> loop_add() side. Don't we need to rewrite the error path of loop_add() side, for
> put_disk() from blk_cleanup_disk() from loop_add() calls kfree() via lo_free_disk()
> but out_cleanup_disk: label falls through to blk_mq_free_tag_set() (which seems to
> be UAF read) and kfree() (which seems to be double kfree()) ?
> 

Ah, since set_bit(GD_ADDED, &disk->state) is not called unless
device_add_disk() from add_disk() succeeds, disk->fops->free_disk
will not be called unless add_disk() succeeds. Thus, it is OK for
the error path of loop_add(). Tricky call...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release
  2022-03-25  6:39 ` [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release Christoph Hellwig
@ 2022-03-26  2:52   ` Tetsuo Handa
  2022-03-29  6:52     ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Tetsuo Handa @ 2022-03-26  2:52 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta
  Cc: Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce, linux-block, nbd

Since __loop_clr_fd() is currently holding loop_validate_mutex and lo->lo_mutex,
"avoid lo_mutex in ->release" part is incomplete.

The intent of holding loop_validate_mutex (which causes disk->open_mutex =>
loop_validate_mutex => lo->lo_mutex => blk_mq_freeze_queue()/blk_mq_unfreeze_queue()
chain) is to make loop_validate_file() traversal safe.

Since ->release is called with disk->open_mutex held, and __loop_clr_fd() from
lo_release() is called via ->release when disk_openers() == 0, we are guaranteed
that "struct file" which will be passed to loop_validate_file() via fget() cannot
be the loop device __loop_clr_fd(lo, true) will clear. Thus, there is no need to
hold loop_validate_mutex and lo->lo_mutex from __loop_clr_fd() if release == true.

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2506193a4fd1..d47f3d86dd55 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1136,25 +1136,26 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	gfp_t gfp = lo->old_gfp_mask;

 	/*
-	 * Flush loop_configure() and loop_change_fd(). It is acceptable for
-	 * loop_validate_file() to succeed, for actual clear operation has not
-	 * started yet.
-	 */
-	mutex_lock(&loop_validate_mutex);
-	mutex_unlock(&loop_validate_mutex);
-	/*
-	 * loop_validate_file() now fails because l->lo_state != Lo_bound
-	 * became visible.
+	 * Make sure that Lo_rundown state becomes visible to loop_configure()
+	 * and loop_change_fd(). When called from ->release, we are guaranteed
+	 * that the "struct file" which loop_configure()/loop_change_fd() found
+	 * via fget() is not this loop device.
 	 */
+	if (!release) {
+		mutex_lock(&loop_validate_mutex);
+		mutex_unlock(&loop_validate_mutex);
+	}

 	/*
-	 * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close()
-	 * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if
-	 * lo->lo_state has changed while waiting for lo->lo_mutex.
+	 * It is a sign of something going wrong if lo->lo_state has changed
+	 * while waiting for lo->lo_mutex. When called from ->release, we are
+	 * guaranteed that the nobody is using this loop device.
 	 */
-	mutex_lock(&lo->lo_mutex);
-	BUG_ON(lo->lo_state != Lo_rundown);
-	mutex_unlock(&lo->lo_mutex);
+	if (!release) {
+		mutex_lock(&lo->lo_mutex);
+		BUG_ON(lo->lo_state != Lo_rundown);
+		mutex_unlock(&lo->lo_mutex);
+	}

 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);

But thinking again about release == false case, which I wrote "It is acceptable
for loop_validate_file() to succeed, for actual clear operation has not started
yet.", I came to feel why it is acceptable to succeed.

Even if loop_validate_file() was safely traversed due to serialization via
loop_validate_mutex, I/O requests after loop_configure()/loop_change_fd() completed
will fail. Is this behavior what we want?

If we don't want I/O requests after loop_configure()/loop_change_fd() completed
fail due to __loop_clr_fd(), it is not acceptable for loop_validate_file() to
succeed. We should hold loop_validate_mutex before setting Lo_rundown in order to
make sure that loop_validate_file() will see Lo_rundown state. That is, something
like below will be expected?

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2506193a4fd1..a4ff94ca654f 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1135,27 +1135,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;

-	/*
-	 * Flush loop_configure() and loop_change_fd(). It is acceptable for
-	 * loop_validate_file() to succeed, for actual clear operation has not
-	 * started yet.
-	 */
-	mutex_lock(&loop_validate_mutex);
-	mutex_unlock(&loop_validate_mutex);
-	/*
-	 * loop_validate_file() now fails because l->lo_state != Lo_bound
-	 * became visible.
-	 */
-
-	/*
-	 * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close()
-	 * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if
-	 * lo->lo_state has changed while waiting for lo->lo_mutex.
-	 */
-	mutex_lock(&lo->lo_mutex);
-	BUG_ON(lo->lo_state != Lo_rundown);
-	mutex_unlock(&lo->lo_mutex);
-
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);

@@ -1238,11 +1217,18 @@ static int loop_clr_fd(struct loop_device *lo)
 {
 	int err;

-	err = mutex_lock_killable(&lo->lo_mutex);
+	/*
+	 * Use global lock when setting Lo_rundown state in order to make sure
+	 * that loop_validate_file() will fail if the "struct file" which
+	 * loop_configure()/loop_change_fd() found via fget() was this loop
+	 * device. The disk_openers(lo->lo_disk) > 1 test below guarantees that
+	 * fget() did not return this loop device.
+	 */
+	err = loop_global_lock_killable(lo, true);
 	if (err)
 		return err;
 	if (lo->lo_state != Lo_bound) {
-		mutex_unlock(&lo->lo_mutex);
+		loop_global_unlock(lo, true);
 		return -ENXIO;
 	}
 	/*
@@ -1257,11 +1243,11 @@ static int loop_clr_fd(struct loop_device *lo)
 	 */
 	if (disk_openers(lo->lo_disk) > 1) {
 		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-		mutex_unlock(&lo->lo_mutex);
+		loop_global_unlock(lo, true);
 		return 0;
 	}
 	lo->lo_state = Lo_rundown;
-	mutex_unlock(&lo->lo_mutex);
+	loop_global_unlock(lo, true);

 	__loop_clr_fd(lo, false);
 	return 0;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release
  2022-03-26  2:52   ` Tetsuo Handa
@ 2022-03-29  6:52     ` Christoph Hellwig
  2022-03-29 13:25       ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-29  6:52 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jens Axboe, Josef Bacik, Minchan Kim,
	Nitin Gupta, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Hi Tetsuo,

I'm a bit confused. Partially due to the two patches in one mail, but
also because I can't actually find a try that they cleanly apply to.

But let me add my thoughts here:

 - I think the change 

On Sat, Mar 26, 2022 at 11:52:36AM +0900, Tetsuo Handa wrote:
> Since __loop_clr_fd() is currently holding loop_validate_mutex and lo->lo_mutex,
> "avoid lo_mutex in ->release" part is incomplete.
> 
> The intent of holding loop_validate_mutex (which causes disk->open_mutex =>
> loop_validate_mutex => lo->lo_mutex => blk_mq_freeze_queue()/blk_mq_unfreeze_queue()
> chain) is to make loop_validate_file() traversal safe.
> 
> Since ->release is called with disk->open_mutex held, and __loop_clr_fd() from
> lo_release() is called via ->release when disk_openers() == 0, we are guaranteed
> that "struct file" which will be passed to loop_validate_file() via fget() cannot
> be the loop device __loop_clr_fd(lo, true) will clear. Thus, there is no need to
> hold loop_validate_mutex and lo->lo_mutex from __loop_clr_fd() if release == true.

This part looks reasonable.  Can you give a signoff and proper commit
log and I'll slot it in before the lo_refcount removal.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release
  2022-03-29  6:52     ` Christoph Hellwig
@ 2022-03-29 13:25       ` Christoph Hellwig
  2022-03-29 14:02         ` Tetsuo Handa
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-29 13:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jens Axboe, Josef Bacik, Minchan Kim,
	Nitin Gupta, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

Thinking a bit more, I really don't think the existing any refcount
check protects us against a different tread modififying the backing
file.  When a process has a file descriptor to a loop device open
and is multithreaded (or forks) we can still have multiple threads
manipulating the loop state.

That being said I do not think we really need that refcount check
at all - once loop_clr_fd set lo->lo_state to Lo_rundown under the
global lock we know that loop_validate_file will error out on it
due to the lo_state != Lo_bound check.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release
  2022-03-29 13:25       ` Christoph Hellwig
@ 2022-03-29 14:02         ` Tetsuo Handa
  2022-03-29 14:49           ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Tetsuo Handa @ 2022-03-29 14:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta, Jan Kara,
	Darrick J . Wong, Ming Lei, Matteo Croce, linux-block, nbd

On 2022/03/29 22:25, Christoph Hellwig wrote:
> Thinking a bit more, I really don't think the existing any refcount
> check protects us against a different tread modifying the backing
> file.  When a process has a file descriptor to a loop device open
> and is multithreaded (or forks) we can still have multiple threads
> manipulating the loop state.

Yes, I came to that answer. But

> 
> That being said I do not think we really need that refcount check
> at all - once loop_clr_fd set lo->lo_state to Lo_rundown under the
> global lock we know that loop_validate_file will error out on it
> due to the lo_state != Lo_bound check.

if I think about non "a file descriptor to a loop device" case, I
came to the opposite answer. Rather, shouldn't we check the refcount
strictly like below?

[PATCH] loop: avoid loop_validate_mutex/lo_mutex in ->release

Since ->release is called with disk->open_mutex held, and __loop_clr_fd()
 from lo_release() is called via ->release when disk_openers() == 0, we are
guaranteed that "struct file" which will be passed to loop_validate_file()
via fget() cannot be the loop device __loop_clr_fd(lo, true) will clear.
Thus, there is no need to hold loop_validate_mutex from __loop_clr_fd()
if release == true.

When I made commit 3ce6e1f662a91097 ("loop: reintroduce global lock for
safe loop_validate_file() traversal"), I wrote "It is acceptable for
loop_validate_file() to succeed, for actual clear operation has not started
yet.". But now I came to feel why it is acceptable to succeed.

It seems that the loop driver was added in Linux 1.3.68, and

  if (lo->lo_refcnt > 1)
    return -EBUSY;

check in loop_clr_fd() was there from the beginning. The intent of this
check was unclear. But now I think that current

  disk_openers(lo->lo_disk) > 1

form is there for three reasons.

(1) Avoid I/O errors when some process which opens and reads from this
    loop device in response to uevent notification (e.g. systemd-udevd),
    as described in commit a1ecac3b0656a682 ("loop: Make explicit loop
    device destruction lazy"). This opener is short-lived because it is
    likely that the file descriptor used by that process is closed soon.

(2) Avoid I/O errors caused by underlying layer of stacked loop devices
    (i.e. ioctl(some_loop_fd, LOOP_SET_FD, other_loop_fd)) being suddenly
    disappeared. This opener is long-lived because this reference is
    associated with not a file descriptor but lo->lo_backing_file.

(3) Avoid I/O errors caused by underlying layer of mounted loop device
    (i.e. mount(some_loop_device, some_mount_point)) being suddenly
    disappeared. This opener is long-lived because this reference is
    associated with not a file descriptor but mount.

While race in (1) might be acceptable, (2) and (3) should be checked
racelessly. That is, make sure that __loop_clr_fd() will not run if
loop_validate_file() succeeds, by doing refcount check with global lock
held when explicit loop device destruction is requested.

As a result of no longer waiting for lo->lo_mutex after setting Lo_rundown,
we can remove pointless BUG_ON(lo->lo_state != Lo_rundown) check.

Not-yet-signed-off. ;-)
---
 drivers/block/loop.c | 38 +++++++++++++-------------------------
 1 file changed, 13 insertions(+), 25 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2506193a4fd1..6b813c592159 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1135,27 +1135,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
 
-	/*
-	 * Flush loop_configure() and loop_change_fd(). It is acceptable for
-	 * loop_validate_file() to succeed, for actual clear operation has not
-	 * started yet.
-	 */
-	mutex_lock(&loop_validate_mutex);
-	mutex_unlock(&loop_validate_mutex);
-	/*
-	 * loop_validate_file() now fails because l->lo_state != Lo_bound
-	 * became visible.
-	 */
-
-	/*
-	 * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close()
-	 * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if
-	 * lo->lo_state has changed while waiting for lo->lo_mutex.
-	 */
-	mutex_lock(&lo->lo_mutex);
-	BUG_ON(lo->lo_state != Lo_rundown);
-	mutex_unlock(&lo->lo_mutex);
-
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);
 
@@ -1238,11 +1217,20 @@ static int loop_clr_fd(struct loop_device *lo)
 {
 	int err;
 
-	err = mutex_lock_killable(&lo->lo_mutex);
+	/*
+	 * Since lo_ioctl() is called without locks held, it is possible that
+	 * loop_configure()/loop_change_fd() and loop_clr_fd() run in parallel.
+	 *
+	 * Therefore, use global lock when setting Lo_rundown state in order to
+	 * make sure that loop_validate_file() will fail if the "struct file"
+	 * which loop_configure()/loop_change_fd() found via fget() was this
+	 * loop device.
+	 */
+	err = loop_global_lock_killable(lo, true);
 	if (err)
 		return err;
 	if (lo->lo_state != Lo_bound) {
-		mutex_unlock(&lo->lo_mutex);
+		loop_global_unlock(lo, true);
 		return -ENXIO;
 	}
 	/*
@@ -1257,11 +1245,11 @@ static int loop_clr_fd(struct loop_device *lo)
 	 */
 	if (disk_openers(lo->lo_disk) > 1) {
 		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-		mutex_unlock(&lo->lo_mutex);
+		loop_global_unlock(lo, true);
 		return 0;
 	}
 	lo->lo_state = Lo_rundown;
-	mutex_unlock(&lo->lo_mutex);
+	loop_global_unlock(lo, true);
 
 	__loop_clr_fd(lo, false);
 	return 0;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release
  2022-03-29 14:02         ` Tetsuo Handa
@ 2022-03-29 14:49           ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2022-03-29 14:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jens Axboe, Josef Bacik, Minchan Kim,
	Nitin Gupta, Jan Kara, Darrick J . Wong, Ming Lei, Matteo Croce,
	linux-block, nbd

On Tue, Mar 29, 2022 at 11:02:15PM +0900, Tetsuo Handa wrote:
> It seems that the loop driver was added in Linux 1.3.68, and
> 
>   if (lo->lo_refcnt > 1)
>     return -EBUSY;
> 
> check in loop_clr_fd() was there from the beginning. The intent of this
> check was unclear.

Yes.

> But now I think that current
> 
>   disk_openers(lo->lo_disk) > 1
> 
> form is there for three reasons.
> 
> (1) Avoid I/O errors when some process which opens and reads from this
>     loop device in response to uevent notification (e.g. systemd-udevd),
>     as described in commit a1ecac3b0656a682 ("loop: Make explicit loop
>     device destruction lazy"). This opener is short-lived because it is
>     likely that the file descriptor used by that process is closed soon.

Well.  With the the uevent supression in the current series there won't
be uevents until the capacity has been set to 0.  More importantly
anything that listens to theses kinds of uevents needs to be able to
deal with I/O errors like this.

> (2) Avoid I/O errors caused by underlying layer of stacked loop devices
>     (i.e. ioctl(some_loop_fd, LOOP_SET_FD, other_loop_fd)) being suddenly
>     disappeared. This opener is long-lived because this reference is
>     associated with not a file descriptor but lo->lo_backing_file.

Again, if you clear the FD expecting I/O errors is the logical consequence.
This is like saying we should work around seeing I/O errors when hot
removing a physical device.

> (3) Avoid I/O errors caused by underlying layer of mounted loop device
>     (i.e. mount(some_loop_device, some_mount_point)) being suddenly
>     disappeared. This opener is long-lived because this reference is
>     associated with not a file descriptor but mount.

Same I/O error story.  If you hot remove a nvme SSD you do expect
error in the file system.  This is a pretty clear action -> consequence
relation.

> While race in (1) might be acceptable, (2) and (3) should be checked
> racelessly. That is, make sure that __loop_clr_fd() will not run if
> loop_validate_file() succeeds, by doing refcount check with global lock
> held when explicit loop device destruction is requested.
> 
> As a result of no longer waiting for lo->lo_mutex after setting Lo_rundown,
> we can remove pointless BUG_ON(lo->lo_state != Lo_rundown) check.

I really do like this patch.  And I think based on your description that
we both agree that the disk_openers check is not needed for functional
correctness as a malicious userspace can do concurrent operations even
without the openers check.  You want a protection against "I/O errors"
when the FD is cleared on a live device, and with your patch we get
that with the disk_openers check.  I'm perfectly fine with that state
for this series as it keeps the status quo.  I just think this check
that goes all the way back is actually a really bad idea that just
provides some false security.  But that isn't something we need to
discuss here and now.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 15/14] loop: avoid loop_validate_mutex/lo_mutex in ->release
  2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2022-03-25  6:39 ` [PATCH 14/14] loop: don't destroy lo->workqueue in __loop_clr_fd Christoph Hellwig
@ 2022-03-29 15:36 ` Tetsuo Handa
  2022-03-30  8:14   ` Jan Kara
  14 siblings, 1 reply; 26+ messages in thread
From: Tetsuo Handa @ 2022-03-29 15:36 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Darrick J . Wong, Ming Lei, Matteo Croce, linux-block, nbd,
	Jens Axboe, Josef Bacik, Minchan Kim, Nitin Gupta

Since ->release is called with disk->open_mutex held, and __loop_clr_fd()
 from lo_release() is called via ->release when disk_openers() == 0, we are
guaranteed that "struct file" which will be passed to loop_validate_file()
via fget() cannot be the loop device __loop_clr_fd(lo, true) will clear.
Thus, there is no need to hold loop_validate_mutex from __loop_clr_fd()
if release == true.

When I made commit 3ce6e1f662a91097 ("loop: reintroduce global lock for
safe loop_validate_file() traversal"), I wrote "It is acceptable for
loop_validate_file() to succeed, for actual clear operation has not started
yet.". But now I came to feel why it is acceptable to succeed.

It seems that the loop driver was added in Linux 1.3.68, and

  if (lo->lo_refcnt > 1)
    return -EBUSY;

check in loop_clr_fd() was there from the beginning. The intent of this
check was unclear. But now I think that current

  disk_openers(lo->lo_disk) > 1

form is there for three reasons.

(1) Avoid I/O errors when some process which opens and reads from this
    loop device in response to uevent notification (e.g. systemd-udevd),
    as described in commit a1ecac3b0656a682 ("loop: Make explicit loop
    device destruction lazy"). This opener is short-lived because it is
    likely that the file descriptor used by that process is closed soon.

(2) Avoid I/O errors caused by underlying layer of stacked loop devices
    (i.e. ioctl(some_loop_fd, LOOP_SET_FD, other_loop_fd)) being suddenly
    disappeared. This opener is long-lived because this reference is
    associated with not a file descriptor but lo->lo_backing_file.

(3) Avoid I/O errors caused by underlying layer of mounted loop device
    (i.e. mount(some_loop_device, some_mount_point)) being suddenly
    disappeared. This opener is long-lived because this reference is
    associated with not a file descriptor but mount.

While race in (1) might be acceptable, (2) and (3) should be checked
racelessly. That is, make sure that __loop_clr_fd() will not run if
loop_validate_file() succeeds, by doing refcount check with global lock
held when explicit loop device destruction is requested.

As a result of no longer waiting for lo->lo_mutex after setting Lo_rundown,
we can remove pointless BUG_ON(lo->lo_state != Lo_rundown) check.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 38 +++++++++++++-------------------------
 1 file changed, 13 insertions(+), 25 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2506193a4fd1..6b813c592159 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1135,27 +1135,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
 
-	/*
-	 * Flush loop_configure() and loop_change_fd(). It is acceptable for
-	 * loop_validate_file() to succeed, for actual clear operation has not
-	 * started yet.
-	 */
-	mutex_lock(&loop_validate_mutex);
-	mutex_unlock(&loop_validate_mutex);
-	/*
-	 * loop_validate_file() now fails because l->lo_state != Lo_bound
-	 * became visible.
-	 */
-
-	/*
-	 * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close()
-	 * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if
-	 * lo->lo_state has changed while waiting for lo->lo_mutex.
-	 */
-	mutex_lock(&lo->lo_mutex);
-	BUG_ON(lo->lo_state != Lo_rundown);
-	mutex_unlock(&lo->lo_mutex);
-
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);
 
@@ -1238,11 +1217,20 @@ static int loop_clr_fd(struct loop_device *lo)
 {
 	int err;
 
-	err = mutex_lock_killable(&lo->lo_mutex);
+	/*
+	 * Since lo_ioctl() is called without locks held, it is possible that
+	 * loop_configure()/loop_change_fd() and loop_clr_fd() run in parallel.
+	 *
+	 * Therefore, use global lock when setting Lo_rundown state in order to
+	 * make sure that loop_validate_file() will fail if the "struct file"
+	 * which loop_configure()/loop_change_fd() found via fget() was this
+	 * loop device.
+	 */
+	err = loop_global_lock_killable(lo, true);
 	if (err)
 		return err;
 	if (lo->lo_state != Lo_bound) {
-		mutex_unlock(&lo->lo_mutex);
+		loop_global_unlock(lo, true);
 		return -ENXIO;
 	}
 	/*
@@ -1257,11 +1245,11 @@ static int loop_clr_fd(struct loop_device *lo)
 	 */
 	if (disk_openers(lo->lo_disk) > 1) {
 		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-		mutex_unlock(&lo->lo_mutex);
+		loop_global_unlock(lo, true);
 		return 0;
 	}
 	lo->lo_state = Lo_rundown;
-	mutex_unlock(&lo->lo_mutex);
+	loop_global_unlock(lo, true);
 
 	__loop_clr_fd(lo, false);
 	return 0;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 15/14] loop: avoid loop_validate_mutex/lo_mutex in ->release
  2022-03-29 15:36 ` [PATCH 15/14] loop: avoid loop_validate_mutex/lo_mutex in ->release Tetsuo Handa
@ 2022-03-30  8:14   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2022-03-30  8:14 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jan Kara, Darrick J . Wong, Ming Lei,
	Matteo Croce, linux-block, nbd, Jens Axboe, Josef Bacik,
	Minchan Kim, Nitin Gupta

On Wed 30-03-22 00:36:49, Tetsuo Handa wrote:
> Since ->release is called with disk->open_mutex held, and __loop_clr_fd()
>  from lo_release() is called via ->release when disk_openers() == 0, we are
> guaranteed that "struct file" which will be passed to loop_validate_file()
> via fget() cannot be the loop device __loop_clr_fd(lo, true) will clear.
> Thus, there is no need to hold loop_validate_mutex from __loop_clr_fd()
> if release == true.
> 
> When I made commit 3ce6e1f662a91097 ("loop: reintroduce global lock for
> safe loop_validate_file() traversal"), I wrote "It is acceptable for
> loop_validate_file() to succeed, for actual clear operation has not started
> yet.". But now I came to feel why it is acceptable to succeed.
> 
> It seems that the loop driver was added in Linux 1.3.68, and
> 
>   if (lo->lo_refcnt > 1)
>     return -EBUSY;
> 
> check in loop_clr_fd() was there from the beginning. The intent of this
> check was unclear. But now I think that current
> 
>   disk_openers(lo->lo_disk) > 1
> 
> form is there for three reasons.
> 
> (1) Avoid I/O errors when some process which opens and reads from this
>     loop device in response to uevent notification (e.g. systemd-udevd),
>     as described in commit a1ecac3b0656a682 ("loop: Make explicit loop
>     device destruction lazy"). This opener is short-lived because it is
>     likely that the file descriptor used by that process is closed soon.
> 
> (2) Avoid I/O errors caused by underlying layer of stacked loop devices
>     (i.e. ioctl(some_loop_fd, LOOP_SET_FD, other_loop_fd)) being suddenly
>     disappeared. This opener is long-lived because this reference is
>     associated with not a file descriptor but lo->lo_backing_file.
> 
> (3) Avoid I/O errors caused by underlying layer of mounted loop device
>     (i.e. mount(some_loop_device, some_mount_point)) being suddenly
>     disappeared. This opener is long-lived because this reference is
>     associated with not a file descriptor but mount.
> 
> While race in (1) might be acceptable, (2) and (3) should be checked
> racelessly. That is, make sure that __loop_clr_fd() will not run if
> loop_validate_file() succeeds, by doing refcount check with global lock
> held when explicit loop device destruction is requested.
> 
> As a result of no longer waiting for lo->lo_mutex after setting Lo_rundown,
> we can remove pointless BUG_ON(lo->lo_state != Lo_rundown) check.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Yeah, the patch makes sense to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/block/loop.c | 38 +++++++++++++-------------------------
>  1 file changed, 13 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 2506193a4fd1..6b813c592159 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1135,27 +1135,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
>  	struct file *filp;
>  	gfp_t gfp = lo->old_gfp_mask;
>  
> -	/*
> -	 * Flush loop_configure() and loop_change_fd(). It is acceptable for
> -	 * loop_validate_file() to succeed, for actual clear operation has not
> -	 * started yet.
> -	 */
> -	mutex_lock(&loop_validate_mutex);
> -	mutex_unlock(&loop_validate_mutex);
> -	/*
> -	 * loop_validate_file() now fails because l->lo_state != Lo_bound
> -	 * became visible.
> -	 */
> -
> -	/*
> -	 * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close()
> -	 * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if
> -	 * lo->lo_state has changed while waiting for lo->lo_mutex.
> -	 */
> -	mutex_lock(&lo->lo_mutex);
> -	BUG_ON(lo->lo_state != Lo_rundown);
> -	mutex_unlock(&lo->lo_mutex);
> -
>  	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
>  		blk_queue_write_cache(lo->lo_queue, false, false);
>  
> @@ -1238,11 +1217,20 @@ static int loop_clr_fd(struct loop_device *lo)
>  {
>  	int err;
>  
> -	err = mutex_lock_killable(&lo->lo_mutex);
> +	/*
> +	 * Since lo_ioctl() is called without locks held, it is possible that
> +	 * loop_configure()/loop_change_fd() and loop_clr_fd() run in parallel.
> +	 *
> +	 * Therefore, use global lock when setting Lo_rundown state in order to
> +	 * make sure that loop_validate_file() will fail if the "struct file"
> +	 * which loop_configure()/loop_change_fd() found via fget() was this
> +	 * loop device.
> +	 */
> +	err = loop_global_lock_killable(lo, true);
>  	if (err)
>  		return err;
>  	if (lo->lo_state != Lo_bound) {
> -		mutex_unlock(&lo->lo_mutex);
> +		loop_global_unlock(lo, true);
>  		return -ENXIO;
>  	}
>  	/*
> @@ -1257,11 +1245,11 @@ static int loop_clr_fd(struct loop_device *lo)
>  	 */
>  	if (disk_openers(lo->lo_disk) > 1) {
>  		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
> -		mutex_unlock(&lo->lo_mutex);
> +		loop_global_unlock(lo, true);
>  		return 0;
>  	}
>  	lo->lo_state = Lo_rundown;
> -	mutex_unlock(&lo->lo_mutex);
> +	loop_global_unlock(lo, true);
>  
>  	__loop_clr_fd(lo, false);
>  	return 0;
> -- 
> 2.32.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-03-30  8:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-25  6:39 yet another approach to fix the loop lock order inversions v5 Christoph Hellwig
2022-03-25  6:39 ` [PATCH 01/14] nbd: use the correct block_device in nbd_bdev_reset Christoph Hellwig
2022-03-25  9:48   ` Jan Kara
2022-03-25  6:39 ` [PATCH 02/14] zram: cleanup reset_store Christoph Hellwig
2022-03-25  6:39 ` [PATCH 03/14] zram: cleanup zram_remove Christoph Hellwig
2022-03-25  6:39 ` [PATCH 04/14] block: add a disk_openers helper Christoph Hellwig
2022-03-25  6:39 ` [PATCH 05/14] block: turn bdev->bd_openers into an atomic_t Christoph Hellwig
2022-03-25  6:39 ` [PATCH 06/14] loop: de-duplicate the idle worker freeing code Christoph Hellwig
2022-03-25  6:39 ` [PATCH 07/14] loop: initialize the worker tracking fields once Christoph Hellwig
2022-03-25  6:39 ` [PATCH 08/14] loop: remove the racy bd_inode->i_mapping->nrpages asserts Christoph Hellwig
2022-03-25  6:39 ` [PATCH 09/14] loop: don't freeze the queue in lo_release Christoph Hellwig
2022-03-25  6:39 ` [PATCH 10/14] loop: only freeze the queue in __loop_clr_fd when needed Christoph Hellwig
2022-03-25  6:39 ` [PATCH 11/14] loop: implement ->free_disk Christoph Hellwig
2022-03-25 10:42   ` Tetsuo Handa
2022-03-25 15:10     ` Tetsuo Handa
2022-03-25  6:39 ` [PATCH 12/14] loop: suppress uevents while reconfiguring the device Christoph Hellwig
2022-03-25  9:49   ` Jan Kara
2022-03-25  6:39 ` [PATCH 13/14] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release Christoph Hellwig
2022-03-26  2:52   ` Tetsuo Handa
2022-03-29  6:52     ` Christoph Hellwig
2022-03-29 13:25       ` Christoph Hellwig
2022-03-29 14:02         ` Tetsuo Handa
2022-03-29 14:49           ` Christoph Hellwig
2022-03-25  6:39 ` [PATCH 14/14] loop: don't destroy lo->workqueue in __loop_clr_fd Christoph Hellwig
2022-03-29 15:36 ` [PATCH 15/14] loop: avoid loop_validate_mutex/lo_mutex in ->release Tetsuo Handa
2022-03-30  8:14   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.