All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 1/5] task_work: export task_work_add()
@ 2022-01-21 11:40 Tetsuo Handa
  2022-01-21 11:40 ` [PATCH v3 2/5] loop: revert "make autoclear operation asynchronous" Tetsuo Handa
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Tetsuo Handa @ 2022-01-21 11:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Jan Kara; +Cc: linux-block, Tetsuo Handa

Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
silenced a circular locking dependency warning by moving autoclear
operation to WQ context.

Then, it was reported that WQ context is too late to run autoclear
operation; some userspace programs (e.g. xfstest) assume that the autoclear
operation already completed by the moment close() returns to user mode
so that they can immediately call umount() of a partition containing a
backing file which the autoclear operation should have closed.

Then, Jan Kara found that fundamental problem is that waiting for I/O
completion (from blk_mq_freeze_queue() or flush_workqueue()) with
disk->open_mutex held has possibility of deadlock.

Then, I found that since disk->open_mutex => lo->lo_mutex dependency is
recorded by lo_open() and lo_release(), and blk_mq_freeze_queue() by e.g.
loop_set_status() waits for I/O completion with lo->lo_mutex held, from
locking dependency chain perspective we need to avoid holding lo->lo_mutex
 from lo_open() and lo_release(). And we can avoid holding lo->lo_mutex
 from lo_open(), for we can instead use a spinlock dedicated for
Lo_deleting check.

But we cannot avoid holding lo->lo_mutex from lo_release(), for WQ context
was too late to run autoclear operation. We need to make whole lo_release()
operation start without disk->open_mutex and complete before returning to
user mode. One of approaches that can meet such requirement is to use the
task_work context. Thus, export task_work_add() for the loop driver.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 kernel/task_work.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 1698fbe6f0e1..2a1644189182 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -60,6 +60,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(task_work_add);
 
 /**
  * task_work_cancel_match - cancel a pending work added by task_work_add()
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/5] loop: revert "make autoclear operation asynchronous"
  2022-01-21 11:40 [PATCH v3 1/5] task_work: export task_work_add() Tetsuo Handa
@ 2022-01-21 11:40 ` Tetsuo Handa
  2022-01-21 11:40 ` [PATCH v3 3/5] loop: don't hold lo->lo_mutex from lo_open() Tetsuo Handa
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Tetsuo Handa @ 2022-01-21 11:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Jan Kara
  Cc: linux-block, Tetsuo Handa, kernel test robot

The kernel test robot is reporting that xfstest which does

  umount ext2 on xfs
  umount xfs

sequence started failing, for commit 322c4293ecc58110 ("loop: make
autoclear operation asynchronous") removed a guarantee that fput() of
backing file is processed before lo_release() from close() returns to
user mode.

As a preparation for retrying with task_work_add() approach, firstly
make a clean revert.

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 65 ++++++++++++++++++++------------------------
 drivers/block/loop.h |  1 -
 2 files changed, 29 insertions(+), 37 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b1b05c45c07c..e52a8a5e8cbc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo)
+static void __loop_clr_fd(struct loop_device *lo, bool release)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
@@ -1144,6 +1144,8 @@ static void __loop_clr_fd(struct loop_device *lo)
 	/* let user-space know about this change */
 	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
 	mapping_set_gfp_mask(filp->f_mapping, gfp);
+	/* This is safe: open() is still holding a reference. */
+	module_put(THIS_MODULE);
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
@@ -1151,52 +1153,44 @@ static void __loop_clr_fd(struct loop_device *lo)
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
-		mutex_lock(&lo->lo_disk->open_mutex);
+		/*
+		 * open_mutex has been held already in release path, so don't
+		 * acquire it if this function is called in such case.
+		 *
+		 * If the reread partition isn't from release path, lo_refcnt
+		 * must be at least one and it can only become zero when the
+		 * current holder is released.
+		 */
+		if (!release)
+			mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		mutex_unlock(&lo->lo_disk->open_mutex);
+		if (!release)
+			mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
 		/* Device is gone, no point in returning error */
 	}
 
+	/*
+	 * lo->lo_state is set to Lo_unbound here after above partscan has
+	 * finished. There cannot be anybody else entering __loop_clr_fd() as
+	 * Lo_rundown state protects us from all the other places trying to
+	 * change the 'lo' device.
+	 */
 	lo->lo_flags = 0;
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART;
-
-	fput(filp);
-}
-
-static void loop_rundown_completed(struct loop_device *lo)
-{
 	mutex_lock(&lo->lo_mutex);
 	lo->lo_state = Lo_unbound;
 	mutex_unlock(&lo->lo_mutex);
-	module_put(THIS_MODULE);
-}
-
-static void loop_rundown_workfn(struct work_struct *work)
-{
-	struct loop_device *lo = container_of(work, struct loop_device,
-					      rundown_work);
-	struct block_device *bdev = lo->lo_device;
-	struct gendisk *disk = lo->lo_disk;
-
-	__loop_clr_fd(lo);
-	kobject_put(&bdev->bd_device.kobj);
-	module_put(disk->fops->owner);
-	loop_rundown_completed(lo);
-}
 
-static void loop_schedule_rundown(struct loop_device *lo)
-{
-	struct block_device *bdev = lo->lo_device;
-	struct gendisk *disk = lo->lo_disk;
-
-	__module_get(disk->fops->owner);
-	kobject_get(&bdev->bd_device.kobj);
-	INIT_WORK(&lo->rundown_work, loop_rundown_workfn);
-	queue_work(system_long_wq, &lo->rundown_work);
+	/*
+	 * Need not hold lo_mutex to fput backing file. Calling fput holding
+	 * lo_mutex triggers a circular lock dependency possibility warning as
+	 * fput can take open_mutex which is usually taken before lo_mutex.
+	 */
+	fput(filp);
 }
 
 static int loop_clr_fd(struct loop_device *lo)
@@ -1228,8 +1222,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_rundown;
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo);
-	loop_rundown_completed(lo);
+	__loop_clr_fd(lo, false);
 	return 0;
 }
 
@@ -1754,7 +1747,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		 * In autoclear mode, stop the loop thread
 		 * and remove configuration after last close.
 		 */
-		loop_schedule_rundown(lo);
+		__loop_clr_fd(lo, true);
 		return;
 	} else if (lo->lo_state == Lo_bound) {
 		/*
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 918a7a2dc025..082d4b6bfc6a 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -56,7 +56,6 @@ struct loop_device {
 	struct gendisk		*lo_disk;
 	struct mutex		lo_mutex;
 	bool			idr_visible;
-	struct work_struct      rundown_work;
 };
 
 struct loop_cmd {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/5] loop: don't hold lo->lo_mutex from lo_open()
  2022-01-21 11:40 [PATCH v3 1/5] task_work: export task_work_add() Tetsuo Handa
  2022-01-21 11:40 ` [PATCH v3 2/5] loop: revert "make autoclear operation asynchronous" Tetsuo Handa
@ 2022-01-21 11:40 ` Tetsuo Handa
  2022-01-21 11:40 ` [PATCH v3 4/5] loop: don't hold lo->lo_mutex from lo_release() Tetsuo Handa
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Tetsuo Handa @ 2022-01-21 11:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Jan Kara; +Cc: linux-block, Tetsuo Handa

Waiting for I/O completion with disk->open_mutex held has possibility of
deadlock. Since disk->open_mutex => lo->lo_mutex dependency is recorded by
lo_open(), and blk_mq_freeze_queue() by e.g. loop_set_status() waits for
I/O completion with lo->lo_mutex held, from locking dependency chain
perspective waiting for I/O completion with disk->open_mutex held still
remains.

Introduce loop_delete_spinlock dedicated for protecting lo->lo_state
versus lo->lo_refcnt race in lo_open() and loop_remove_control().

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index e52a8a5e8cbc..5ce8ac2dfa4c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -89,6 +89,7 @@
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 static DEFINE_MUTEX(loop_validate_mutex);
+static DEFINE_SPINLOCK(loop_delete_spinlock);
 
 /**
  * loop_global_lock_killable() - take locks for safe loop_validate_file() test
@@ -1717,16 +1718,15 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
 static int lo_open(struct block_device *bdev, fmode_t mode)
 {
 	struct loop_device *lo = bdev->bd_disk->private_data;
-	int err;
+	int err = 0;
 
-	err = mutex_lock_killable(&lo->lo_mutex);
-	if (err)
-		return err;
-	if (lo->lo_state == Lo_deleting)
+	spin_lock(&loop_delete_spinlock);
+	/* lo->lo_state may be changed to any Lo_* but Lo_deleting. */
+	if (data_race(lo->lo_state) == Lo_deleting)
 		err = -ENXIO;
 	else
 		atomic_inc(&lo->lo_refcnt);
-	mutex_unlock(&lo->lo_mutex);
+	spin_unlock(&loop_delete_spinlock);
 	return err;
 }
 
@@ -2112,19 +2112,18 @@ static int loop_control_remove(int idx)
 	ret = mutex_lock_killable(&lo->lo_mutex);
 	if (ret)
 		goto mark_visible;
-	if (lo->lo_state != Lo_unbound ||
-	    atomic_read(&lo->lo_refcnt) > 0) {
-		mutex_unlock(&lo->lo_mutex);
+	spin_lock(&loop_delete_spinlock);
+	/* Mark this loop device no longer open()-able if nobody is using. */
+	if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0)
 		ret = -EBUSY;
-		goto mark_visible;
-	}
-	/* Mark this loop device no longer open()-able. */
-	lo->lo_state = Lo_deleting;
+	else
+		lo->lo_state = Lo_deleting;
+	spin_unlock(&loop_delete_spinlock);
 	mutex_unlock(&lo->lo_mutex);
-
-	loop_remove(lo);
-	return 0;
-
+	if (!ret) {
+		loop_remove(lo);
+		return 0;
+	}
 mark_visible:
 	/* Show this loop device again. */
 	mutex_lock(&loop_ctl_mutex);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 4/5] loop: don't hold lo->lo_mutex from lo_release()
  2022-01-21 11:40 [PATCH v3 1/5] task_work: export task_work_add() Tetsuo Handa
  2022-01-21 11:40 ` [PATCH v3 2/5] loop: revert "make autoclear operation asynchronous" Tetsuo Handa
  2022-01-21 11:40 ` [PATCH v3 3/5] loop: don't hold lo->lo_mutex from lo_open() Tetsuo Handa
@ 2022-01-21 11:40 ` Tetsuo Handa
  2022-01-21 11:40 ` [PATCH v3 5/5] loop: add workaround for racy loop device reuse logic in /bin/mount Tetsuo Handa
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Tetsuo Handa @ 2022-01-21 11:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Jan Kara; +Cc: linux-block, Tetsuo Handa

This is a retry of commit 322c4293ecc58110 ("loop: make autoclear operation
asynchronous").

Since it turned out that we need to avoid waiting for I/O completion with
disk->open_mutex held, move whole lo_release() operation to task work
context (when possible) or WQ context (otherwise).

Refcount management in lo_release() and loop_release_workfn() needs to be
updated in sync with blkdev_put(), for blkdev_put() already dropped
references by the moment loop_release_callbackfn() is invoked.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 151 +++++++++++++++++++++++++++++++------------
 drivers/block/loop.h |   1 +
 2 files changed, 111 insertions(+), 41 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5ce8ac2dfa4c..74d919e98a6b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1083,7 +1083,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo, bool release)
+static void __loop_clr_fd(struct loop_device *lo)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
@@ -1145,8 +1145,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	/* let user-space know about this change */
 	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
 	mapping_set_gfp_mask(filp->f_mapping, gfp);
-	/* This is safe: open() is still holding a reference. */
-	module_put(THIS_MODULE);
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
@@ -1154,37 +1152,18 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
-		/*
-		 * open_mutex has been held already in release path, so don't
-		 * acquire it if this function is called in such case.
-		 *
-		 * If the reread partition isn't from release path, lo_refcnt
-		 * must be at least one and it can only become zero when the
-		 * current holder is released.
-		 */
-		if (!release)
-			mutex_lock(&lo->lo_disk->open_mutex);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		if (!release)
-			mutex_unlock(&lo->lo_disk->open_mutex);
+		mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
 		/* Device is gone, no point in returning error */
 	}
 
-	/*
-	 * lo->lo_state is set to Lo_unbound here after above partscan has
-	 * finished. There cannot be anybody else entering __loop_clr_fd() as
-	 * Lo_rundown state protects us from all the other places trying to
-	 * change the 'lo' device.
-	 */
 	lo->lo_flags = 0;
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART;
-	mutex_lock(&lo->lo_mutex);
-	lo->lo_state = Lo_unbound;
-	mutex_unlock(&lo->lo_mutex);
 
 	/*
 	 * Need not hold lo_mutex to fput backing file. Calling fput holding
@@ -1192,6 +1171,10 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	 * fput can take open_mutex which is usually taken before lo_mutex.
 	 */
 	fput(filp);
+	mutex_lock(&lo->lo_mutex);
+	lo->lo_state = Lo_unbound;
+	mutex_unlock(&lo->lo_mutex);
+	module_put(THIS_MODULE);
 }
 
 static int loop_clr_fd(struct loop_device *lo)
@@ -1223,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_rundown;
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo, false);
+	__loop_clr_fd(lo);
 	return 0;
 }
 
@@ -1715,10 +1698,31 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
 }
 #endif
 
+struct loop_release_task {
+	union {
+		struct list_head head;
+		struct callback_head cb;
+		struct work_struct ws;
+	};
+	struct loop_device *lo;
+};
+
+static LIST_HEAD(release_task_spool);
+static DEFINE_SPINLOCK(release_task_spool_spinlock);
+
 static int lo_open(struct block_device *bdev, fmode_t mode)
 {
 	struct loop_device *lo = bdev->bd_disk->private_data;
 	int err = 0;
+	/*
+	 * In order to avoid doing __GFP_NOFAIL allocaion from lo_release(),
+	 * reserve memory for calling lo_post_release() from lo_open().
+	 */
+	struct loop_release_task *lrt =
+		kmalloc(sizeof(*lrt), GFP_KERNEL | __GFP_NOWARN);
+
+	if (!lrt)
+		return -ENOMEM;
 
 	spin_lock(&loop_delete_spinlock);
 	/* lo->lo_state may be changed to any Lo_* but Lo_deleting. */
@@ -1727,33 +1731,40 @@ static int lo_open(struct block_device *bdev, fmode_t mode)
 	else
 		atomic_inc(&lo->lo_refcnt);
 	spin_unlock(&loop_delete_spinlock);
-	return err;
+	if (err) {
+		kfree(lrt);
+		return err;
+	}
+	spin_lock(&release_task_spool_spinlock);
+	list_add(&lrt->head, &release_task_spool);
+	spin_unlock(&release_task_spool_spinlock);
+	return 0;
 }
 
-static void lo_release(struct gendisk *disk, fmode_t mode)
+static void lo_post_release(struct gendisk *disk)
 {
 	struct loop_device *lo = disk->private_data;
 
 	mutex_lock(&lo->lo_mutex);
-	if (atomic_dec_return(&lo->lo_refcnt))
-		goto out_unlock;
 
+	/* Check whether this loop device can be cleared. */
+	if (atomic_dec_return(&lo->lo_refcnt) || lo->lo_state != Lo_bound)
+		goto out_unlock;
+	/*
+	 * Clear this loop device since nobody is using. Note that since
+	 * lo_open() increments lo->lo_refcnt without holding lo->lo_mutex,
+	 * I might become no longer the last user, but there is a fact that
+	 * there was no user.
+	 *
+	 * In autoclear mode, destroy WQ and remove configuration.
+	 * Otherwise flush possible ongoing bios in WQ and keep configuration.
+	 */
 	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
-		if (lo->lo_state != Lo_bound)
-			goto out_unlock;
 		lo->lo_state = Lo_rundown;
 		mutex_unlock(&lo->lo_mutex);
-		/*
-		 * In autoclear mode, stop the loop thread
-		 * and remove configuration after last close.
-		 */
-		__loop_clr_fd(lo, true);
+		__loop_clr_fd(lo);
 		return;
-	} else if (lo->lo_state == Lo_bound) {
-		/*
-		 * Otherwise keep thread (if running) and config,
-		 * but flush possible ongoing bios in thread.
-		 */
+	} else {
 		blk_mq_freeze_queue(lo->lo_queue);
 		blk_mq_unfreeze_queue(lo->lo_queue);
 	}
@@ -1762,6 +1773,60 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 	mutex_unlock(&lo->lo_mutex);
 }
 
+static void loop_release_workfn(struct work_struct *work)
+{
+	struct loop_release_task *lrt =
+		container_of(work, struct loop_release_task, ws);
+	struct loop_device *lo = lrt->lo;
+	struct gendisk *disk = lo->lo_disk;
+
+	lo_post_release(disk);
+	/* Drop references which will be dropped after lo_release(). */
+	kobject_put(&disk_to_dev(disk)->kobj);
+	module_put(disk->fops->owner);
+	kfree(lrt);
+	atomic_dec(&lo->async_pending);
+}
+
+static void loop_release_callbackfn(struct callback_head *callback)
+{
+	struct loop_release_task *lrt =
+		container_of(callback, struct loop_release_task, cb);
+
+	loop_release_workfn(&lrt->ws);
+}
+
+static void lo_release(struct gendisk *disk, fmode_t mode)
+{
+	struct loop_device *lo = disk->private_data;
+	struct loop_release_task *lrt;
+
+	atomic_inc(&lo->async_pending);
+	/*
+	 * Fetch from spool. Since a successful lo_open() call is coupled with
+	 * a lo_release() call, we are guaranteed that spool is not empty.
+	 */
+	spin_lock(&release_task_spool_spinlock);
+	lrt = list_first_entry(&release_task_spool, typeof(*lrt), head);
+	list_del(&lrt->head);
+	spin_unlock(&release_task_spool_spinlock);
+	/* Hold references which will be dropped after lo_release(). */
+	__module_get(disk->fops->owner);
+	kobject_get(&disk_to_dev(disk)->kobj);
+	/*
+	 * Prefer task work so that clear operation completes
+	 * before close() returns to user mode.
+	 */
+	lrt->lo = lo;
+	if (!(current->flags & PF_KTHREAD)) {
+		init_task_work(&lrt->cb, loop_release_callbackfn);
+		if (!task_work_add(current, &lrt->cb, TWA_RESUME))
+			return;
+	}
+	INIT_WORK(&lrt->ws, loop_release_workfn);
+	queue_work(system_long_wq, &lrt->ws);
+}
+
 static const struct block_device_operations lo_fops = {
 	.owner =	THIS_MODULE,
 	.open =		lo_open,
@@ -2023,6 +2088,7 @@ static int loop_add(int i)
 	if (!part_shift)
 		disk->flags |= GENHD_FL_NO_PART;
 	atomic_set(&lo->lo_refcnt, 0);
+	atomic_set(&lo->async_pending, 0);
 	mutex_init(&lo->lo_mutex);
 	lo->lo_number		= i;
 	spin_lock_init(&lo->lo_lock);
@@ -2064,6 +2130,9 @@ static int loop_add(int i)
 
 static void loop_remove(struct loop_device *lo)
 {
+	/* Wait for task work and/or WQ to complete. */
+	while (atomic_read(&lo->async_pending))
+		schedule_timeout_uninterruptible(1);
 	/* Make this loop device unreachable from pathname. */
 	del_gendisk(lo->lo_disk);
 	blk_cleanup_disk(lo->lo_disk);
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 082d4b6bfc6a..20fc5eebe455 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -56,6 +56,7 @@ struct loop_device {
 	struct gendisk		*lo_disk;
 	struct mutex		lo_mutex;
 	bool			idr_visible;
+	atomic_t		async_pending;
 };
 
 struct loop_cmd {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 5/5] loop: add workaround for racy loop device reuse logic in /bin/mount
  2022-01-21 11:40 [PATCH v3 1/5] task_work: export task_work_add() Tetsuo Handa
                   ` (2 preceding siblings ...)
  2022-01-21 11:40 ` [PATCH v3 4/5] loop: don't hold lo->lo_mutex from lo_release() Tetsuo Handa
@ 2022-01-21 11:40 ` Tetsuo Handa
  2022-01-25 15:47 ` [PATCH v3 1/5] task_work: export task_work_add() Christoph Hellwig
  2022-01-25 21:37 ` Darrick J. Wong
  5 siblings, 0 replies; 13+ messages in thread
From: Tetsuo Handa @ 2022-01-21 11:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Jan Kara
  Cc: linux-block, Tetsuo Handa, Jan Stancek, Mike Galbraith

Since lo_open() and lo_release() were previously serialized via
disk->open_mutex, new file descriptors returned by open() never reached
a loop device in Lo_rundown state unless ioctl(LOOP_CLR_FD) was inside
__loop_clr_fd(). But now that since lo_open() and lo_release() no longer
hold lo->lo_mutex in order to kill disk->open_mutex => lo->lo_mutex
dependency, new file descriptors returned by open() can easily reach a
loop device in Lo_rundown state.

So far, Jan Stancek and Mike Galbraith found that LTP's isofs testcase
which do mount/umount in close succession started failing. The root cause
is that loop device reuse logic in /bin/mount is racy, and Jan Kara posted
a patch for fixing one of two bugs [1].

But we need some migration period for allowing users to update their
util-linux package. Not everybody can use latest userspace programs.
Therefore, add a switch for allow emulating serialization between lo_open()
and lo_release() without using disk->open_mutex. This emulation is disabled
by default, and will be removed eventually. Since this emulation runs from
task work context, we don't need to worry about locking dependency problem.

Link: https://lkml.kernel.org/r/20220120114705.25342-1-jack@suse.cz [1]
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Analyzed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
This hack is not popular, but without enabling this workaround, about 20% of
mount requests fails. If this workaround is enabled, no mount request fails.
I think we need this hack for a while.

root@fuzz:/mnt# time for i in $(seq 1 100); do mount -o loop,ro isofs.iso isofs/ && umount isofs/; done
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: isofs/: operation permitted for root only.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: isofs/: operation permitted for root only.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
mount: /mnt/isofs: can't read superblock on /dev/loop0.

real    0m9.896s
user    0m0.161s
sys     0m8.523s

drivers/block/loop.c | 58 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 74d919e98a6b..844471213494 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -90,6 +90,7 @@ static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 static DEFINE_MUTEX(loop_validate_mutex);
 static DEFINE_SPINLOCK(loop_delete_spinlock);
+static DECLARE_WAIT_QUEUE_HEAD(loop_rundown_wait);
 
 /**
  * loop_global_lock_killable() - take locks for safe loop_validate_file() test
@@ -1174,6 +1175,7 @@ static void __loop_clr_fd(struct loop_device *lo)
 	mutex_lock(&lo->lo_mutex);
 	lo->lo_state = Lo_unbound;
 	mutex_unlock(&lo->lo_mutex);
+	wake_up_all(&loop_rundown_wait);
 	module_put(THIS_MODULE);
 }
 
@@ -1710,6 +1712,38 @@ struct loop_release_task {
 static LIST_HEAD(release_task_spool);
 static DEFINE_SPINLOCK(release_task_spool_spinlock);
 
+/* Workaround code for racy loop device reuse logic in /bin/mount. */
+static bool open_waits_rundown_device;
+module_param(open_waits_rundown_device, bool, 0644);
+MODULE_PARM_DESC(open_waits_rundown_device, "Please report if you need to enable this option.");
+
+struct loop_open_task {
+	struct callback_head cb;
+	struct loop_device *lo;
+};
+
+static void lo_post_open(struct gendisk *disk)
+{
+	struct loop_device *lo = disk->private_data;
+
+	/* Wait for lo_post_release() to leave lo->lo_mutex section. */
+	if (mutex_lock_killable(&lo->lo_mutex) == 0)
+		mutex_unlock(&lo->lo_mutex);
+	/* Also wait for __loop_clr_fd() to complete if Lo_rundown was set. */
+	wait_event_killable(loop_rundown_wait, data_race(lo->lo_state) != Lo_rundown);
+	atomic_dec(&lo->async_pending);
+}
+
+static void loop_open_callbackfn(struct callback_head *callback)
+{
+	struct loop_open_task *lot =
+		container_of(callback, struct loop_open_task, cb);
+	struct gendisk *disk = lot->lo->lo_disk;
+
+	lo_post_open(disk);
+	kfree(lot);
+}
+
 static int lo_open(struct block_device *bdev, fmode_t mode)
 {
 	struct loop_device *lo = bdev->bd_disk->private_data;
@@ -1738,6 +1772,30 @@ static int lo_open(struct block_device *bdev, fmode_t mode)
 	spin_lock(&release_task_spool_spinlock);
 	list_add(&lrt->head, &release_task_spool);
 	spin_unlock(&release_task_spool_spinlock);
+
+	/*
+	 * Try to avoid accessing Lo_rundown loop device.
+	 *
+	 * Since the task_work list is LIFO, lo_post_release() scheduled by
+	 * lo_release() can run before lo_post_open() scheduled by lo_open()
+	 * runs when an error occurred and fput() scheduled lo_release() before
+	 * returning to user mode. This means that lo->refcnt may be already 0
+	 * when lo_post_open() runs. Therefore, use lo->async_pending in order
+	 * to prevent loop_remove() from releasing this loop device.
+	 */
+	if (open_waits_rundown_device && !(current->flags & PF_KTHREAD)) {
+		struct loop_open_task *lot =
+			kmalloc(sizeof(*lrt), GFP_KERNEL | __GFP_NOWARN);
+
+		if (!lot)
+			return 0;
+		lot->lo = lo;
+		init_task_work(&lot->cb, loop_open_callbackfn);
+		if (task_work_add(current, &lot->cb, TWA_RESUME))
+			kfree(lot);
+		else
+			atomic_inc(&lo->async_pending);
+	}
 	return 0;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] task_work: export task_work_add()
  2022-01-21 11:40 [PATCH v3 1/5] task_work: export task_work_add() Tetsuo Handa
                   ` (3 preceding siblings ...)
  2022-01-21 11:40 ` [PATCH v3 5/5] loop: add workaround for racy loop device reuse logic in /bin/mount Tetsuo Handa
@ 2022-01-25 15:47 ` Christoph Hellwig
  2022-01-25 23:47   ` Tetsuo Handa
  2022-01-25 21:37 ` Darrick J. Wong
  5 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2022-01-25 15:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Jens Axboe, Christoph Hellwig, Jan Kara, linux-block

I'm sometimes a little slow, but I still fail to understand why we need
all this.  Your cut down patch that moves the destroy_workqueue call
and the work_struct fixed all the know lockdep issues, right?

And the only other problem we're thinking off is that blk_mq_freeze_queue
could have the same effect, except that lockdep doesn't track it and
we've not seen it in the wild.

As far as I can tell we do not need the freeze at all for given that
by the time release is called I/O is quiesced.  So with a little prep
work we should not need any of that.  See below (this is actually
three patches, with the last one being a very slight rebase of your
patch):

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 01cbbfc4e9e24..91b56761392e2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1006,10 +1006,10 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	    !file->f_op->write_iter)
 		lo->lo_flags |= LO_FLAGS_READ_ONLY;
 
-	lo->workqueue = alloc_workqueue("loop%d",
-					WQ_UNBOUND | WQ_FREEZABLE,
-					0,
-					lo->lo_number);
+	if (!lo->workqueue)
+		lo->workqueue = alloc_workqueue("loop%d",
+						WQ_UNBOUND | WQ_FREEZABLE,
+						0, lo->lo_number);
 	if (!lo->workqueue) {
 		error = -ENOMEM;
 		goto out_unlock;
@@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo)
+static void __loop_clr_fd(struct loop_device *lo, bool release)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
@@ -1112,10 +1112,14 @@ static void __loop_clr_fd(struct loop_device *lo)
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);
 
-	/* freeze request queue during the transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	/*
+	 * Freeze the request queue when unbinding on a live file descriptor and
+	 * thus an open device.  When called from ->release we are guaranteed
+	 * that there is no I/O in progress already.
+	 */
+	if (!release)
+		blk_mq_freeze_queue(lo->lo_queue);
 
-	destroy_workqueue(lo->workqueue);
 	spin_lock_irq(&lo->lo_work_lock);
 	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
 				idle_list) {
@@ -1144,16 +1148,19 @@ static void __loop_clr_fd(struct loop_device *lo)
 	/* let user-space know about this change */
 	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
 	mapping_set_gfp_mask(filp->f_mapping, gfp);
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	if (!release)
+		blk_mq_unfreeze_queue(lo->lo_queue);
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
 
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
-		mutex_lock(&lo->lo_disk->open_mutex);
+		if (!release)
+			mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		mutex_unlock(&lo->lo_disk->open_mutex);
+		if (!release)
+			mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
@@ -1164,39 +1171,17 @@ static void __loop_clr_fd(struct loop_device *lo)
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART;
 
-	fput(filp);
-}
-
-static void loop_rundown_completed(struct loop_device *lo)
-{
 	mutex_lock(&lo->lo_mutex);
 	lo->lo_state = Lo_unbound;
 	mutex_unlock(&lo->lo_mutex);
 	module_put(THIS_MODULE);
-}
-
-static void loop_rundown_workfn(struct work_struct *work)
-{
-	struct loop_device *lo = container_of(work, struct loop_device,
-					      rundown_work);
-	struct block_device *bdev = lo->lo_device;
-	struct gendisk *disk = lo->lo_disk;
-
-	__loop_clr_fd(lo);
-	kobject_put(&bdev->bd_device.kobj);
-	module_put(disk->fops->owner);
-	loop_rundown_completed(lo);
-}
 
-static void loop_schedule_rundown(struct loop_device *lo)
-{
-	struct block_device *bdev = lo->lo_device;
-	struct gendisk *disk = lo->lo_disk;
-
-	__module_get(disk->fops->owner);
-	kobject_get(&bdev->bd_device.kobj);
-	INIT_WORK(&lo->rundown_work, loop_rundown_workfn);
-	queue_work(system_long_wq, &lo->rundown_work);
+	/*
+	 * Need not hold lo_mutex to fput backing file. Calling fput holding
+	 * lo_mutex triggers a circular lock dependency possibility warning as
+	 * fput can take open_mutex which is usually taken before lo_mutex.
+	 */
+	fput(filp);
 }
 
 static int loop_clr_fd(struct loop_device *lo)
@@ -1228,8 +1213,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_rundown;
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo);
-	loop_rundown_completed(lo);
+	__loop_clr_fd(lo, false);
 	return 0;
 }
 
@@ -1754,15 +1738,8 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		 * In autoclear mode, stop the loop thread
 		 * and remove configuration after last close.
 		 */
-		loop_schedule_rundown(lo);
+		__loop_clr_fd(lo, true);
 		return;
-	} else if (lo->lo_state == Lo_bound) {
-		/*
-		 * Otherwise keep thread (if running) and config,
-		 * but flush possible ongoing bios in thread.
-		 */
-		blk_mq_freeze_queue(lo->lo_queue);
-		blk_mq_unfreeze_queue(lo->lo_queue);
 	}
 
 out_unlock:
@@ -2080,6 +2057,8 @@ static void loop_remove(struct loop_device *lo)
 	mutex_unlock(&loop_ctl_mutex);
 	/* There is no route which can find this loop device. */
 	mutex_destroy(&lo->lo_mutex);
+	if (lo->workqueue)
+		destroy_workqueue(lo->workqueue);
 	kfree(lo);
 }
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 918a7a2dc0259..082d4b6bfc6a6 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -56,7 +56,6 @@ struct loop_device {
 	struct gendisk		*lo_disk;
 	struct mutex		lo_mutex;
 	bool			idr_visible;
-	struct work_struct      rundown_work;
 };
 
 struct loop_cmd {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] task_work: export task_work_add()
  2022-01-21 11:40 [PATCH v3 1/5] task_work: export task_work_add() Tetsuo Handa
                   ` (4 preceding siblings ...)
  2022-01-25 15:47 ` [PATCH v3 1/5] task_work: export task_work_add() Christoph Hellwig
@ 2022-01-25 21:37 ` Darrick J. Wong
  5 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-01-25 21:37 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Jens Axboe, Christoph Hellwig, Jan Kara, linux-block, xfs

On Fri, Jan 21, 2022 at 08:40:02PM +0900, Tetsuo Handa wrote:
> Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
> silenced a circular locking dependency warning by moving autoclear
> operation to WQ context.
> 
> Then, it was reported that WQ context is too late to run autoclear
> operation; some userspace programs (e.g. xfstest) assume that the autoclear
> operation already completed by the moment close() returns to user mode
> so that they can immediately call umount() of a partition containing a
> backing file which the autoclear operation should have closed.
> 
> Then, Jan Kara found that fundamental problem is that waiting for I/O
> completion (from blk_mq_freeze_queue() or flush_workqueue()) with
> disk->open_mutex held has possibility of deadlock.
> 
> Then, I found that since disk->open_mutex => lo->lo_mutex dependency is
> recorded by lo_open() and lo_release(), and blk_mq_freeze_queue() by e.g.
> loop_set_status() waits for I/O completion with lo->lo_mutex held, from
> locking dependency chain perspective we need to avoid holding lo->lo_mutex
>  from lo_open() and lo_release(). And we can avoid holding lo->lo_mutex
>  from lo_open(), for we can instead use a spinlock dedicated for
> Lo_deleting check.
> 
> But we cannot avoid holding lo->lo_mutex from lo_release(), for WQ context
> was too late to run autoclear operation. We need to make whole lo_release()
> operation start without disk->open_mutex and complete before returning to
> user mode. One of approaches that can meet such requirement is to use the
> task_work context. Thus, export task_work_add() for the loop driver.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

5.17-rc1 came out and I saw regressions across the board when xfs/049
and xfs/073 ran.

xfs/049 formats XFS on a block device, mounts it, creates a sparse file
inside the XFS fs, formats the sparse file, mounts that (via automatic
loop device), does some work, and then unmounts both the sparse file
filesystem and the outer XFS filesystem.

The first unmount no longer waited for the loop device to release
asynchronously so the unmount of the outer fs fails because it's still
in use.

So this series fixes xfs/049, but xfs/073 is still broken.  xfs/073
creates a sparse file containing an XFS filesystem and then does this in
rapid succession:

mount -o loop <mount options that guarantee mount failure>
mount -o loop <mount options that should work>

Whereas with 5.16 this worked fine,

 [U] try mount 1
 loop0: detected capacity change from 0 to 83968
 XFS (loop0): Filesystem has duplicate UUID 924e8033-a130-4f9c-a11f-52f892c268e9 - can't mount
 [U] try mount 2
 loop0: detected capacity change from 0 to 83968
 XFS (loop0): Mounting V5 Filesystem
 XFS (loop0): resetting quota flags
 XFS (loop0): Ending clean mount

in 5.17-rc1 it fails like this:

 [U] try mount 1
 loop0: detected capacity change from 0 to 83968
 XFS (loop0): Filesystem has duplicate UUID 0b0afdac-5c9c-4d94-9b8d-fe85a2eb1143 - can't mount
 [U] try mount 2
 I/O error, dev loop0, sector 0 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
 XFS (loop0): SB validate failed with error -5.
 [U] fail mount 2

I guess this means that mount can grab the loop device after the backing
file has been released but before a new one has been plumbed in?  Or,
seeing the lack of "detected capacity change" for mount 2, maybe the
backing file never gets added?

--D

> ---
>  kernel/task_work.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 1698fbe6f0e1..2a1644189182 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -60,6 +60,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(task_work_add);
>  
>  /**
>   * task_work_cancel_match - cancel a pending work added by task_work_add()
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] task_work: export task_work_add()
  2022-01-25 15:47 ` [PATCH v3 1/5] task_work: export task_work_add() Christoph Hellwig
@ 2022-01-25 23:47   ` Tetsuo Handa
  2022-01-26  5:21     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2022-01-25 23:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Jan Kara, linux-block

On 2022/01/26 0:47, Christoph Hellwig wrote:
> I'm sometimes a little slow, but I still fail to understand why we need
> all this.  Your cut down patch that moves the destroy_workqueue call
> and the work_struct fixed all the know lockdep issues, right?

Right. Moving destroy_workqueue() (and blk_mq_freeze_queue() together)
in __loop_clr_fd() to WQ context fixed a known lockdep issue.

> 
> And the only other problem we're thinking off is that blk_mq_freeze_queue
> could have the same effect,

Right. blk_mq_freeze_queue() is still called with disk->open_mutex held, for
there is

	} else if (lo->lo_state == Lo_bound) {
		/*
		 * Otherwise keep thread (if running) and config,
		 * but flush possible ongoing bios in thread.
		 */
		blk_mq_freeze_queue(lo->lo_queue);
		blk_mq_unfreeze_queue(lo->lo_queue);
	}

path in lo_release().

>                             except that lockdep doesn't track it and
> we've not seen it in the wild.

It is difficult to test. Fuzzers cannot test fsfreeze paths, for failing to
issue an unfreeze request leads to unresponding virtual machines.

> 
> As far as I can tell we do not need the freeze at all for given that
> by the time release is called I/O is quiesced.

Why? lo_release() is called when close() is called. But (periodically-scheduled
or triggered-on-demand) writeback of previously executed buffered write() calls
can start while lo_release() or __loop_clr_fd() is running. Then, why not to
wait for I/O requests to complete? Isn't that the reason of

	} else if (lo->lo_state == Lo_bound) {
		/*
		 * Otherwise keep thread (if running) and config,
		 * but flush possible ongoing bios in thread.
		 */
		blk_mq_freeze_queue(lo->lo_queue);
		blk_mq_unfreeze_queue(lo->lo_queue);
	}

path in lo_release() being there?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] task_work: export task_work_add()
  2022-01-25 23:47   ` Tetsuo Handa
@ 2022-01-26  5:21     ` Christoph Hellwig
  2022-01-26  7:18       ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2022-01-26  5:21 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, linux-block, Ming Lei

On Wed, Jan 26, 2022 at 08:47:17AM +0900, Tetsuo Handa wrote:
> > As far as I can tell we do not need the freeze at all for given that
> > by the time release is called I/O is quiesced.
> 
> Why? lo_release() is called when close() is called. But (periodically-scheduled
> or triggered-on-demand) writeback of previously executed buffered write() calls
> can start while lo_release() or __loop_clr_fd() is running. Then, why not to
> wait for I/O requests to complete?

Let's refine my wording, the above should be "... by the time the final
lo_release is called".  blkdev_put_whole ensures all writeback has finished
and all buffers are gone by writing all data back and waiting for it and then
truncating the pages from blkdev_flush_mapping.

> Isn't that the reason of
> 
> 	} else if (lo->lo_state == Lo_bound) {
> 		/*
> 		 * Otherwise keep thread (if running) and config,
> 		 * but flush possible ongoing bios in thread.
> 		 */
> 		blk_mq_freeze_queue(lo->lo_queue);
> 		blk_mq_unfreeze_queue(lo->lo_queue);
> 	}
> 
> path in lo_release() being there?

This looks completely spurious to me.  Adding Ming who added it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] task_work: export task_work_add()
  2022-01-26  5:21     ` Christoph Hellwig
@ 2022-01-26  7:18       ` Ming Lei
  2022-01-26 10:27         ` Tetsuo Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-01-26  7:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Tetsuo Handa, Jens Axboe, Jan Kara, linux-block

On Wed, Jan 26, 2022 at 06:21:59AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 26, 2022 at 08:47:17AM +0900, Tetsuo Handa wrote:
> > > As far as I can tell we do not need the freeze at all for given that
> > > by the time release is called I/O is quiesced.
> > 
> > Why? lo_release() is called when close() is called. But (periodically-scheduled
> > or triggered-on-demand) writeback of previously executed buffered write() calls
> > can start while lo_release() or __loop_clr_fd() is running. Then, why not to
> > wait for I/O requests to complete?
> 
> Let's refine my wording, the above should be "... by the time the final
> lo_release is called".  blkdev_put_whole ensures all writeback has finished
> and all buffers are gone by writing all data back and waiting for it and then
> truncating the pages from blkdev_flush_mapping.
> 
> > Isn't that the reason of
> > 
> > 	} else if (lo->lo_state == Lo_bound) {
> > 		/*
> > 		 * Otherwise keep thread (if running) and config,
> > 		 * but flush possible ongoing bios in thread.
> > 		 */
> > 		blk_mq_freeze_queue(lo->lo_queue);
> > 		blk_mq_unfreeze_queue(lo->lo_queue);
> > 	}
> > 
> > path in lo_release() being there?
> 
> This looks completely spurious to me.  Adding Ming who added it.

It was added when converting to blk-mq.

I remember it was to replace original loop_flush() which uses
wait_for_completion() for draining all inflight bios, but seems
the flush isn't needed in lo_release().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] task_work: export task_work_add()
  2022-01-26  7:18       ` Ming Lei
@ 2022-01-26 10:27         ` Tetsuo Handa
  2022-01-26 13:11           ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2022-01-26 10:27 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig; +Cc: Jens Axboe, Jan Kara, linux-block

On 2022/01/26 16:18, Ming Lei wrote:
>>> Isn't that the reason of
>>>
>>> 	} else if (lo->lo_state == Lo_bound) {
>>> 		/*
>>> 		 * Otherwise keep thread (if running) and config,
>>> 		 * but flush possible ongoing bios in thread.
>>> 		 */
>>> 		blk_mq_freeze_queue(lo->lo_queue);
>>> 		blk_mq_unfreeze_queue(lo->lo_queue);
>>> 	}
>>>
>>> path in lo_release() being there?
>>
>> This looks completely spurious to me.  Adding Ming who added it.
> 
> It was added when converting to blk-mq.

Well, commit b5dd2f6047ca1080 ("block: loop: improve performance via blk-mq") in v4.0-rc1.
That's where "thread" in the block comment above became outdated, and a global WQ_MEM_RECLAIM
workqueue was used. We need to update both.

> 
> I remember it was to replace original loop_flush() which uses
> wait_for_completion() for draining all inflight bios, but seems
> the flush isn't needed in lo_release().

Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from lo_release(), we
cannot remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g. loop_set_status(), right?

Then, lo_release() which is called with disk->open_mutex held can be still blocked at
mutex_lock(&lo->lo_mutex) waiting for e.g. loop_set_status() to call mutex_unlock(&lo->lo_mutex).
That is, lo_open() from e.g. /sys/power/resume can still wait for I/O completion with disk->open_mutex held.

How to kill this dependency? Don't we need to make sure that lo_open()/lo_release() are not
blocked on lo->lo_mutex (like [PATCH v3 3/5] and [PATCH v3 4/5] does) ?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] task_work: export task_work_add()
  2022-01-26 10:27         ` Tetsuo Handa
@ 2022-01-26 13:11           ` Jan Kara
  2022-01-26 13:35             ` Tetsuo Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2022-01-26 13:11 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe, Jan Kara, linux-block

On Wed 26-01-22 19:27:17, Tetsuo Handa wrote:
> On 2022/01/26 16:18, Ming Lei wrote:
> > I remember it was to replace original loop_flush() which uses
> > wait_for_completion() for draining all inflight bios, but seems
> > the flush isn't needed in lo_release().
> 
> Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from
> lo_release(), we cannot remove
> blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g.
> loop_set_status(), right?

Correct AFAICT.

> Then, lo_release() which is called with disk->open_mutex held can be
> still blocked at mutex_lock(&lo->lo_mutex) waiting for e.g.
> loop_set_status() to call mutex_unlock(&lo->lo_mutex).  That is,
> lo_open() from e.g. /sys/power/resume can still wait for I/O completion
> with disk->open_mutex held.

I don't think this is a real problem. If someone is calling
loop_set_status() it means the loop device is open and thus lo_release()
cannot be running in parallel, can it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] task_work: export task_work_add()
  2022-01-26 13:11           ` Jan Kara
@ 2022-01-26 13:35             ` Tetsuo Handa
  0 siblings, 0 replies; 13+ messages in thread
From: Tetsuo Handa @ 2022-01-26 13:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ming Lei, Christoph Hellwig, Jens Axboe, linux-block

On 2022/01/26 22:11, Jan Kara wrote:
> On Wed 26-01-22 19:27:17, Tetsuo Handa wrote:
>> Even if we can remove blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from
>> lo_release(), we cannot remove
>> blk_mq_freeze_queue()/blk_mq_unfreeze_queue() from e.g.
>> loop_set_status(), right?
> 
> Correct AFAICT.

OK.

> 
>> Then, lo_release() which is called with disk->open_mutex held can be
>> still blocked at mutex_lock(&lo->lo_mutex) waiting for e.g.
>> loop_set_status() to call mutex_unlock(&lo->lo_mutex).  That is,
>> lo_open() from e.g. /sys/power/resume can still wait for I/O completion
>> with disk->open_mutex held.
> 
> I don't think this is a real problem. If someone is calling
> loop_set_status() it means the loop device is open and thus lo_release()
> cannot be running in parallel, can it?

lo_release() is called when a file descriptor is close()d.
That is, loop_set_status() and lo_release() can run in parallel, can't it?

  Process-A                               Process-B

  int fd1 = open("/dev/loop0", O_RDWR);   int fd2 = open("/dev/loop0", O_RDWR);
  ioctl(fd1, LOOP_SET_STATUS64, ...);     close(fd2);

If lo_release() (which is called with disk->open_mutex held) waits for ioctl()
(which waits for I/O completion with lo->lo_mutex held), there is
disk->open_mutex => lo->lo_mutex => I/O completion dependency.

And the possibility of deadlock problem (when mixed with sysfs and fsfreeze)
happens at lo_open(). If lo_open() (which is called with disk->open_mutex held)
waits for ioctl() (which waits for I/O completion with lo->lo_mutex held), there
as well is disk->open_mutex => lo->lo_mutex => I/O completion dependency.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-01-26 13:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-21 11:40 [PATCH v3 1/5] task_work: export task_work_add() Tetsuo Handa
2022-01-21 11:40 ` [PATCH v3 2/5] loop: revert "make autoclear operation asynchronous" Tetsuo Handa
2022-01-21 11:40 ` [PATCH v3 3/5] loop: don't hold lo->lo_mutex from lo_open() Tetsuo Handa
2022-01-21 11:40 ` [PATCH v3 4/5] loop: don't hold lo->lo_mutex from lo_release() Tetsuo Handa
2022-01-21 11:40 ` [PATCH v3 5/5] loop: add workaround for racy loop device reuse logic in /bin/mount Tetsuo Handa
2022-01-25 15:47 ` [PATCH v3 1/5] task_work: export task_work_add() Christoph Hellwig
2022-01-25 23:47   ` Tetsuo Handa
2022-01-26  5:21     ` Christoph Hellwig
2022-01-26  7:18       ` Ming Lei
2022-01-26 10:27         ` Tetsuo Handa
2022-01-26 13:11           ` Jan Kara
2022-01-26 13:35             ` Tetsuo Handa
2022-01-25 21:37 ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.