All of lore.kernel.org
 help / color / mirror / Atom feed
* [syzbot] possible deadlock in blkdev_put (3)
@ 2022-03-12 16:25 syzbot
  2022-03-12 17:08 ` Tetsuo Handa
  2022-03-14  4:45 ` [syzbot] possible deadlock in blkdev_put (3) syzbot
  0 siblings, 2 replies; 10+ messages in thread
From: syzbot @ 2022-03-12 16:25 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    68453767131a ARM: Spectre-BHB: provide empty stub for non-..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17bd4709700000
kernel config:  https://syzkaller.appspot.com/x/.config?x=16438642a37fea1
dashboard link: https://syzkaller.appspot.com/bug?extid=6479585dfd4dedd3f7e1
compiler:       Debian clang version 11.0.1-2, GNU ld (GNU Binutils for Debian) 2.35.2

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com

======================================================
WARNING: possible circular locking dependency detected
5.17.0-rc7-syzkaller-00227-g68453767131a #0 Not tainted
------------------------------------------------------
udevd/6428 is trying to acquire lock:
ffff88801d66a938 ((wq_completion)loop3){+.+.}-{0:0}, at: flush_workqueue+0x172/0x16b0 kernel/workqueue.c:2827

but task is already holding lock:
ffff88801ab90918 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0xf8/0x7a0 block/bdev.c:902

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #6 (&disk->open_mutex){+.+.}-{3:3}:
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5639
       __mutex_lock_common+0x1d5/0x2590 kernel/locking/mutex.c:600
       __mutex_lock kernel/locking/mutex.c:733 [inline]
       mutex_lock_nested+0x1a/0x20 kernel/locking/mutex.c:785
       blkdev_get_by_dev+0x169/0xd70 block/bdev.c:804
       swsusp_check+0xb0/0x3f0 kernel/power/swap.c:1526
       software_resume+0xc8/0x3d0 kernel/power/hibernate.c:979
       resume_store+0xdc/0x120 kernel/power/hibernate.c:1181
       kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2074 [inline]
       new_sync_write fs/read_write.c:503 [inline]
       vfs_write+0xb11/0xe90 fs/read_write.c:590
       ksys_write+0x18f/0x2c0 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #5 (system_transition_mutex/1){+.+.}-{3:3}:
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5639
       __mutex_lock_common+0x1d5/0x2590 kernel/locking/mutex.c:600
       __mutex_lock kernel/locking/mutex.c:733 [inline]
       mutex_lock_nested+0x1a/0x20 kernel/locking/mutex.c:785
       software_resume+0x7a/0x3d0 kernel/power/hibernate.c:934
       resume_store+0xdc/0x120 kernel/power/hibernate.c:1181
       kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2074 [inline]
       new_sync_write fs/read_write.c:503 [inline]
       vfs_write+0xb11/0xe90 fs/read_write.c:590
       ksys_write+0x18f/0x2c0 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #4 (&of->mutex){+.+.}-{3:3}:
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5639
       __mutex_lock_common+0x1d5/0x2590 kernel/locking/mutex.c:600
       __mutex_lock kernel/locking/mutex.c:733 [inline]
       mutex_lock_nested+0x1a/0x20 kernel/locking/mutex.c:785
       kernfs_seq_start+0x50/0x3b0 fs/kernfs/file.c:112
       seq_read_iter+0x3cd/0xd30 fs/seq_file.c:225
       call_read_iter include/linux/fs.h:2068 [inline]
       new_sync_read fs/read_write.c:400 [inline]
       vfs_read+0xaf9/0xe60 fs/read_write.c:481
       ksys_read+0x18f/0x2c0 fs/read_write.c:619
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&p->lock){+.+.}-{3:3}:
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5639
       __mutex_lock_common+0x1d5/0x2590 kernel/locking/mutex.c:600
       __mutex_lock kernel/locking/mutex.c:733 [inline]
       mutex_lock_nested+0x1a/0x20 kernel/locking/mutex.c:785
       seq_read_iter+0xad/0xd30 fs/seq_file.c:182
       call_read_iter include/linux/fs.h:2068 [inline]
       generic_file_splice_read+0x482/0x760 fs/splice.c:311
       do_splice_to fs/splice.c:796 [inline]
       splice_direct_to_actor+0x45f/0xd00 fs/splice.c:870
       do_splice_direct+0x291/0x3e0 fs/splice.c:979
       do_sendfile+0x6fe/0x1040 fs/read_write.c:1245
       __do_sys_sendfile64 fs/read_write.c:1310 [inline]
       __se_sys_sendfile64+0x171/0x1d0 fs/read_write.c:1296
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#3){.+.+}-{0:0}:
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5639
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write include/linux/fs.h:1722 [inline]
       sb_start_write include/linux/fs.h:1792 [inline]
       file_start_write include/linux/fs.h:2937 [inline]
       lo_write_bvec drivers/block/loop.c:243 [inline]
       lo_write_simple drivers/block/loop.c:266 [inline]
       do_req_filebacked drivers/block/loop.c:495 [inline]
       loop_handle_cmd drivers/block/loop.c:1852 [inline]
       loop_process_work+0x167f/0x22b0 drivers/block/loop.c:1892
       process_one_work+0x86c/0x1190 kernel/workqueue.c:2307
       worker_thread+0xab1/0x1300 kernel/workqueue.c:2454
       kthread+0x2a3/0x2d0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5639
       process_one_work+0x83c/0x1190 kernel/workqueue.c:2283
       worker_thread+0xab1/0x1300 kernel/workqueue.c:2454
       kthread+0x2a3/0x2d0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop3){+.+.}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3063 [inline]
       check_prevs_add kernel/locking/lockdep.c:3186 [inline]
       validate_chain+0x1dfb/0x8250 kernel/locking/lockdep.c:3801
       __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5027
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5639
       flush_workqueue+0x18e/0x16b0 kernel/workqueue.c:2827
       drain_workqueue+0xc3/0x3a0 kernel/workqueue.c:2992
       destroy_workqueue+0x7d/0xed0 kernel/workqueue.c:4429
       __loop_clr_fd+0x1bd/0x980 drivers/block/loop.c:1124
       blkdev_put+0x5a7/0x7a0
       blkdev_close+0x58/0x80 block/fops.c:517
       __fput+0x3fc/0x870 fs/file_table.c:317
       task_work_run+0x146/0x1c0 kernel/task_work.c:164
       tracehook_notify_resume include/linux/tracehook.h:188 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
       exit_to_user_mode_prepare+0x209/0x220 kernel/entry/common.c:207
       __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
       syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
       do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  (wq_completion)loop3 --> system_transition_mutex/1 --> &disk->open_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&disk->open_mutex);
                               lock(system_transition_mutex/1);
                               lock(&disk->open_mutex);
  lock((wq_completion)loop3);

 *** DEADLOCK ***

1 lock held by udevd/6428:
 #0: ffff88801ab90918 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0xf8/0x7a0 block/bdev.c:902

stack backtrace:
CPU: 0 PID: 6428 Comm: udevd Not tainted 5.17.0-rc7-syzkaller-00227-g68453767131a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1dc/0x2d8 lib/dump_stack.c:106
 check_noncircular+0x2f9/0x3b0 kernel/locking/lockdep.c:2143
 check_prev_add kernel/locking/lockdep.c:3063 [inline]
 check_prevs_add kernel/locking/lockdep.c:3186 [inline]
 validate_chain+0x1dfb/0x8250 kernel/locking/lockdep.c:3801
 __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5027
 lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5639
 flush_workqueue+0x18e/0x16b0 kernel/workqueue.c:2827
 drain_workqueue+0xc3/0x3a0 kernel/workqueue.c:2992
 destroy_workqueue+0x7d/0xed0 kernel/workqueue.c:4429
 __loop_clr_fd+0x1bd/0x980 drivers/block/loop.c:1124
 blkdev_put+0x5a7/0x7a0
 blkdev_close+0x58/0x80 block/fops.c:517
 __fput+0x3fc/0x870 fs/file_table.c:317
 task_work_run+0x146/0x1c0 kernel/task_work.c:164
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
 exit_to_user_mode_prepare+0x209/0x220 kernel/entry/common.c:207
 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
 syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
 do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f3819885fc3
Code: 48 ff ff ff b8 ff ff ff ff e9 3e ff ff ff 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8
RSP: 002b:00007ffc29e6f138 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f381972e6a8 RCX: 00007f3819885fc3
RDX: 000000000000001c RSI: 00007ffc29e6e938 RDI: 0000000000000008
RBP: 000055f83c052190 R08: 0000000000000007 R09: 000055f83c057f00
R10: 00007f3819914fc0 R11: 0000000000000246 R12: 0000000000000002
R13: 000055f83c04c1b0 R14: 0000000000000008 R15: 000055f83c028910
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (3)
  2022-03-12 16:25 [syzbot] possible deadlock in blkdev_put (3) syzbot
@ 2022-03-12 17:08 ` Tetsuo Handa
  2022-03-13 15:15   ` [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release() Tetsuo Handa
  2022-03-14  4:45 ` [syzbot] possible deadlock in blkdev_put (3) syzbot
  1 sibling, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2022-03-12 17:08 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: syzbot, axboe, linux-block

Hello.

On 2022/03/13 1:25, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    68453767131a ARM: Spectre-BHB: provide empty stub for non-..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=17bd4709700000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=16438642a37fea1
> dashboard link: https://syzkaller.appspot.com/bug?extid=6479585dfd4dedd3f7e1
> compiler:       Debian clang version 11.0.1-2, GNU ld (GNU Binutils for Debian) 2.35.2
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com

Now that it is time to retry, I'm waiting for your response on
https://lkml.kernel.org/r/cf61ca83-d9b0-2e5d-eb3b-018e16753749@I-love.SAKURA.ne.jp .

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()
  2022-03-12 17:08 ` Tetsuo Handa
@ 2022-03-13 15:15   ` Tetsuo Handa
  2022-03-14 15:23     ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2022-03-13 15:15 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: syzbot, axboe, linux-block

syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
is calling destroy_workqueue() with disk->open_mutex held.

Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
tried to fix this problem by deferring __loop_clr_fd() from lo_release()
to a WQ context, but that commit was reverted because there are userspace
programs which depend on that __loop_clr_fd() completes by the moment
close() returns to user mode.

This time, we try to fix this problem by deferring __loop_clr_fd() from
lo_release() to a task work context. Since task_work_add() is not exported
but Christoph does not want to export it, this patch uses anonymous inode
interface as a wrapper for calling task_work_add() via fput().

Although Jan has found two bugs in /bin/mount where one of these was fixed
in util-linux package [2], I found that not holding lo->lo_mutex from
lo_open() causes occasional I/O error [3] (due to

	if (lo->lo_state != Lo_bound)
		return BLK_STS_IOERR;

check in loop_queue_rq()) when systemd-udevd opens a loop device and
reads from it before loop_configure() updates lo->lo_state to Lo_bound.

That is, not only we need to wait for __loop_clr_fd() upon close() event
(in order to avoid unexpected umount() failure) but also we need to wait
for lo->lo_mutex upon open() event (in order to avoid unexpected I/O
errors). Therefore, schedule a task work from both lo_open() and
lo_release().

Link: https://syzkaller.appspot.com/bug?extid=6479585dfd4dedd3f7e1 [1]
Link: https://github.com/util-linux/util-linux/commit/3e1fc3bbee99384c [2]
Link: https://lkml.kernel.org/r/a72c59c6-298b-e4ba-b1f5-2275afab49a1@I-love.SAKURA.ne.jp [3]
Reported-by: syzbot <syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 244 +++++++++++++++++++++++++++++++------------
 drivers/block/loop.h |   1 +
 2 files changed, 180 insertions(+), 65 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 19fe19eaa50e..2f60356ad428 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -80,6 +80,7 @@
 #include <linux/blk-cgroup.h>
 #include <linux/sched/mm.h>
 #include <linux/statfs.h>
+#include <linux/anon_inodes.h>
 
 #include "loop.h"
 
@@ -90,6 +91,8 @@
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 static DEFINE_MUTEX(loop_validate_mutex);
+static DEFINE_SPINLOCK(loop_delete_spinlock);
+static DECLARE_WAIT_QUEUE_HEAD(loop_rundown_wait);
 
 /**
  * loop_global_lock_killable() - take locks for safe loop_validate_file() test
@@ -1088,7 +1091,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo, bool release)
+static void __loop_clr_fd(struct loop_device *lo)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
@@ -1150,8 +1153,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	/* let user-space know about this change */
 	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
 	mapping_set_gfp_mask(filp->f_mapping, gfp);
-	/* This is safe: open() is still holding a reference. */
-	module_put(THIS_MODULE);
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
@@ -1159,37 +1160,18 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
-		/*
-		 * open_mutex has been held already in release path, so don't
-		 * acquire it if this function is called in such case.
-		 *
-		 * If the reread partition isn't from release path, lo_refcnt
-		 * must be at least one and it can only become zero when the
-		 * current holder is released.
-		 */
-		if (!release)
-			mutex_lock(&lo->lo_disk->open_mutex);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		if (!release)
-			mutex_unlock(&lo->lo_disk->open_mutex);
+		mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
 		/* Device is gone, no point in returning error */
 	}
 
-	/*
-	 * lo->lo_state is set to Lo_unbound here after above partscan has
-	 * finished. There cannot be anybody else entering __loop_clr_fd() as
-	 * Lo_rundown state protects us from all the other places trying to
-	 * change the 'lo' device.
-	 */
 	lo->lo_flags = 0;
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART;
-	mutex_lock(&lo->lo_mutex);
-	lo->lo_state = Lo_unbound;
-	mutex_unlock(&lo->lo_mutex);
 
 	/*
 	 * Need not hold lo_mutex to fput backing file. Calling fput holding
@@ -1197,6 +1179,11 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	 * fput can take open_mutex which is usually taken before lo_mutex.
 	 */
 	fput(filp);
+	mutex_lock(&lo->lo_mutex);
+	lo->lo_state = Lo_unbound;
+	mutex_unlock(&lo->lo_mutex);
+	wake_up_all(&loop_rundown_wait);
+	module_put(THIS_MODULE);
 }
 
 static int loop_clr_fd(struct loop_device *lo)
@@ -1228,7 +1215,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_rundown;
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo, false);
+	__loop_clr_fd(lo);
 	return 0;
 }
 
@@ -1720,52 +1707,176 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
 }
 #endif
 
-static int lo_open(struct block_device *bdev, fmode_t mode)
+static int lo_no_open(struct inode *inode, struct file *file)
 {
-	struct loop_device *lo = bdev->bd_disk->private_data;
-	int err;
+	return -ENXIO;
+}
 
-	err = mutex_lock_killable(&lo->lo_mutex);
-	if (err)
-		return err;
-	if (lo->lo_state == Lo_deleting)
-		err = -ENXIO;
-	else
-		atomic_inc(&lo->lo_refcnt);
-	mutex_unlock(&lo->lo_mutex);
-	return err;
+static int lo_post_open(struct inode *inode, struct file *file)
+{
+	struct loop_device *lo = file->private_data;
+
+	/* Nothing to do if lo_open() failed. */
+	if (!lo)
+		return 0;
+	/* Wait for loop_configure()/loop_post_release() to complete. */
+	if (mutex_lock_killable(&lo->lo_mutex) == 0)
+		mutex_unlock(&lo->lo_mutex);
+	/*
+	 * Also wait for __loop_clr_fd() to complete because
+	 * loop_post_release() releases lo->lo_mutex before __loop_clr_fd()
+	 * is called.
+	 */
+	wait_event_killable(loop_rundown_wait, data_race(lo->lo_state) != Lo_rundown);
+	atomic_dec(&lo->async_pending);
+	return 0;
 }
 
-static void lo_release(struct gendisk *disk, fmode_t mode)
+static const struct file_operations loop_open_fops = {
+	.open = lo_no_open,
+	.release = lo_post_open,
+};
+
+static int lo_post_release(struct inode *inode, struct file *file)
 {
-	struct loop_device *lo = disk->private_data;
+	struct gendisk *disk = file->private_data;
+	struct loop_device *lo;
 
+	/* Nothing to do if lo_open() failed. */
+	if (!disk)
+		return 0;
+	lo = disk->private_data;
 	mutex_lock(&lo->lo_mutex);
-	if (atomic_dec_return(&lo->lo_refcnt))
+	/* Check whether this loop device can be cleared. */
+	if (atomic_dec_return(&lo->lo_refcnt) || lo->lo_state != Lo_bound)
 		goto out_unlock;
-
+	/*
+	 * Clear this loop device since nobody is using. Note that since
+	 * lo_open() increments lo->lo_refcnt without holding lo->lo_mutex,
+	 * I might become no longer the last user, but there is a fact that
+	 * there was no user.
+	 *
+	 * In autoclear mode, destroy WQ and remove configuration.
+	 * Otherwise flush possible ongoing bios in WQ and keep configuration.
+	 */
 	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
-		if (lo->lo_state != Lo_bound)
-			goto out_unlock;
 		lo->lo_state = Lo_rundown;
 		mutex_unlock(&lo->lo_mutex);
-		/*
-		 * In autoclear mode, stop the loop thread
-		 * and remove configuration after last close.
-		 */
-		__loop_clr_fd(lo, true);
-		return;
-	} else if (lo->lo_state == Lo_bound) {
-		/*
-		 * Otherwise keep thread (if running) and config,
-		 * but flush possible ongoing bios in thread.
-		 */
+		__loop_clr_fd(lo);
+		goto done;
+	} else {
 		blk_mq_freeze_queue(lo->lo_queue);
 		blk_mq_unfreeze_queue(lo->lo_queue);
 	}
 
 out_unlock:
 	mutex_unlock(&lo->lo_mutex);
+done:
+	/* Drop references which would have been dropped after lo_release(). */
+	kobject_put(&disk_to_dev(disk)->kobj);
+	module_put(disk->fops->owner);
+	atomic_dec(&lo->async_pending);
+	return 0;
+}
+
+static const struct file_operations loop_close_fops = {
+	.open = lo_no_open,
+	.release = lo_post_release,
+};
+
+struct release_event {
+	struct list_head head;
+	struct file *file;
+};
+
+static LIST_HEAD(release_event_spool);
+static DEFINE_SPINLOCK(release_event_spinlock);
+
+static int lo_open(struct block_device *bdev, fmode_t mode)
+{
+	struct loop_device *lo = bdev->bd_disk->private_data;
+	int err = 0;
+	/*
+	 * Reserve an anon_inode for calling lo_post_open() before returning to
+	 * user mode. Since fput() by kernel thread defers actual cleanup to WQ
+	 * context, there is no point with calling lo_post_open() from kernel
+	 * threads.
+	 */
+	struct file *open = (current->flags & PF_KTHREAD) ? NULL :
+		anon_inode_getfile("loop.open", &loop_open_fops, NULL, O_CLOEXEC);
+	/*
+	 * Reserve an anon_inode for calling lo_post_release() from
+	 * lo_release() before returning to user mode.
+	 */
+	struct file *close = anon_inode_getfile("loop.close", &loop_close_fops, NULL, O_CLOEXEC);
+	struct release_event *ev = kmalloc(sizeof(*ev), GFP_KERNEL | __GFP_NOWARN);
+
+	if (!ev || IS_ERR(open) || IS_ERR(close)) {
+		err = -ENOMEM;
+		goto cleanup;
+	}
+
+	spin_lock(&loop_delete_spinlock);
+	/* lo->lo_state may be changed to any Lo_* but Lo_deleting. */
+	if (data_race(lo->lo_state) == Lo_deleting)
+		err = -ENXIO;
+	else
+		atomic_inc(&lo->lo_refcnt);
+	spin_unlock(&loop_delete_spinlock);
+	if (err)
+		goto cleanup;
+
+	/* close->private_data is set when lo_release() is called. */
+	ev->file = close;
+	spin_lock(&release_event_spinlock);
+	list_add(&ev->head, &release_event_spool);
+	spin_unlock(&release_event_spinlock);
+
+	/*
+	 * Try to avoid accessing Lo_rundown loop device.
+	 *
+	 * Since the task_work list is LIFO, lo_post_release() will run before
+	 * lo_post_open() runs if an error occur between returned from
+	 * lo_open() and returning to user mode. This means that lo->refcnt may
+	 * be already 0 when lo_post_open() runs. Therefore, use
+	 * lo->async_pending in order to prevent loop_remove() from releasing
+	 * this loop device.
+	 */
+	if (open) {
+		open->private_data = lo;
+		atomic_inc(&lo->async_pending);
+		fput(open);
+	}
+	return 0;
+ cleanup:
+	if (!IS_ERR_OR_NULL(open))
+		fput(open);
+	if (!IS_ERR(close))
+		fput(close);
+	kfree(ev);
+	return err;
+}
+
+static void lo_release(struct gendisk *disk, fmode_t mode)
+{
+	struct loop_device *lo = disk->private_data;
+	struct release_event *ev;
+
+	atomic_inc(&lo->async_pending);
+	/*
+	 * Fetch from spool. Since a successful lo_open() call is coupled with
+	 * a lo_release() call, we are guaranteed that spool is not empty.
+	 */
+	spin_lock(&release_event_spinlock);
+	ev = list_first_entry(&release_event_spool, typeof(*ev), head);
+	list_del(&ev->head);
+	spin_unlock(&release_event_spinlock);
+	/* Hold references which will be dropped after lo_release(). */
+	__module_get(disk->fops->owner);
+	kobject_get(&disk_to_dev(disk)->kobj);
+	ev->file->private_data = disk;
+	fput(ev->file);
+	kfree(ev);
 }
 
 static const struct block_device_operations lo_fops = {
@@ -2029,6 +2140,7 @@ static int loop_add(int i)
 	if (!part_shift)
 		disk->flags |= GENHD_FL_NO_PART;
 	atomic_set(&lo->lo_refcnt, 0);
+	atomic_set(&lo->async_pending, 0);
 	mutex_init(&lo->lo_mutex);
 	lo->lo_number		= i;
 	spin_lock_init(&lo->lo_lock);
@@ -2070,6 +2182,9 @@ static int loop_add(int i)
 
 static void loop_remove(struct loop_device *lo)
 {
+	/* Wait for task work and/or WQ to complete. */
+	while (atomic_read(&lo->async_pending))
+		schedule_timeout_uninterruptible(1);
 	/* Make this loop device unreachable from pathname. */
 	del_gendisk(lo->lo_disk);
 	blk_cleanup_disk(lo->lo_disk);
@@ -2118,19 +2233,18 @@ static int loop_control_remove(int idx)
 	ret = mutex_lock_killable(&lo->lo_mutex);
 	if (ret)
 		goto mark_visible;
-	if (lo->lo_state != Lo_unbound ||
-	    atomic_read(&lo->lo_refcnt) > 0) {
-		mutex_unlock(&lo->lo_mutex);
+	spin_lock(&loop_delete_spinlock);
+	/* Mark this loop device no longer open()-able if nobody is using. */
+	if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0)
 		ret = -EBUSY;
-		goto mark_visible;
-	}
-	/* Mark this loop device no longer open()-able. */
-	lo->lo_state = Lo_deleting;
+	else
+		lo->lo_state = Lo_deleting;
+	spin_unlock(&loop_delete_spinlock);
 	mutex_unlock(&lo->lo_mutex);
-
-	loop_remove(lo);
-	return 0;
-
+	if (!ret) {
+		loop_remove(lo);
+		return 0;
+	}
 mark_visible:
 	/* Show this loop device again. */
 	mutex_lock(&loop_ctl_mutex);
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 082d4b6bfc6a..20fc5eebe455 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -56,6 +56,7 @@ struct loop_device {
 	struct gendisk		*lo_disk;
 	struct mutex		lo_mutex;
 	bool			idr_visible;
+	atomic_t		async_pending;
 };
 
 struct loop_cmd {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (3)
  2022-03-12 16:25 [syzbot] possible deadlock in blkdev_put (3) syzbot
  2022-03-12 17:08 ` Tetsuo Handa
@ 2022-03-14  4:45 ` syzbot
  1 sibling, 0 replies; 10+ messages in thread
From: syzbot @ 2022-03-14  4:45 UTC (permalink / raw)
  To: axboe, hch, jack, linux-block, linux-kernel, penguin-kernel,
	syzkaller-bugs

syzbot has found a reproducer for the following issue on:

HEAD commit:    f0e18b03fcaf Merge tag 'x86_urgent_for_v5.17_rc8' of git:/..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1547fb03700000
kernel config:  https://syzkaller.appspot.com/x/.config?x=aba0ab2928a512c2
dashboard link: https://syzkaller.appspot.com/bug?extid=6479585dfd4dedd3f7e1
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1704bd29700000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14f78d41700000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com

======================================================
WARNING: possible circular locking dependency detected
5.17.0-rc7-syzkaller-00241-gf0e18b03fcaf #0 Not tainted
------------------------------------------------------
udevd/3652 is trying to acquire lock:
ffff888018c7a938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0xe1/0x13a0 kernel/workqueue.c:2824

but task is already holding lock:
ffff88801a0fa918 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x99/0x950 block/bdev.c:902

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #6 (&disk->open_mutex){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:600 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:733
       blkdev_get_by_dev.part.0+0x40e/0xc70 block/bdev.c:804
       blkdev_get_by_dev+0x6b/0x80 block/bdev.c:847
       swsusp_check+0x97/0x420 kernel/power/swap.c:1526
       software_resume.part.0+0x102/0x1f0 kernel/power/hibernate.c:979
       software_resume kernel/power/hibernate.c:86 [inline]
       resume_store+0x161/0x190 kernel/power/hibernate.c:1181
       kobj_attr_store+0x50/0x80 lib/kobject.c:856
       sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:136
       kernfs_fop_write_iter+0x3f8/0x610 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2074 [inline]
       new_sync_write+0x431/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x12d/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #5 (system_transition_mutex/1){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:600 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:733
       software_resume.part.0+0x19/0x1f0 kernel/power/hibernate.c:934
       software_resume kernel/power/hibernate.c:86 [inline]
       resume_store+0x161/0x190 kernel/power/hibernate.c:1181
       kobj_attr_store+0x50/0x80 lib/kobject.c:856
       sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:136
       kernfs_fop_write_iter+0x3f8/0x610 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2074 [inline]
       new_sync_write+0x431/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x12d/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #4 (&of->mutex){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:600 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:733
       kernfs_seq_start+0x47/0x470 fs/kernfs/file.c:112
       seq_read_iter+0x2c6/0x1280 fs/seq_file.c:225
       kernfs_fop_read_iter+0x514/0x6f0 fs/kernfs/file.c:241
       call_read_iter include/linux/fs.h:2068 [inline]
       new_sync_read+0x429/0x6e0 fs/read_write.c:400
       vfs_read+0x35c/0x600 fs/read_write.c:481
       ksys_read+0x12d/0x250 fs/read_write.c:619
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&p->lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:600 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:733
       seq_read_iter+0xdf/0x1280 fs/seq_file.c:182
       call_read_iter include/linux/fs.h:2068 [inline]
       generic_file_splice_read+0x45b/0x6d0 fs/splice.c:311
       do_splice_to+0x1bf/0x250 fs/splice.c:796
       splice_direct_to_actor+0x2c2/0x8c0 fs/splice.c:870
       do_splice_direct+0x1b3/0x280 fs/splice.c:979
       do_sendfile+0xaf2/0x1250 fs/read_write.c:1245
       __do_sys_sendfile64 fs/read_write.c:1310 [inline]
       __se_sys_sendfile64 fs/read_write.c:1296 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1296
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#3){.+.+}-{0:0}:
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write include/linux/fs.h:1722 [inline]
       sb_start_write include/linux/fs.h:1792 [inline]
       file_start_write include/linux/fs.h:2937 [inline]
       lo_write_bvec drivers/block/loop.c:243 [inline]
       lo_write_simple drivers/block/loop.c:266 [inline]
       do_req_filebacked drivers/block/loop.c:495 [inline]
       loop_handle_cmd drivers/block/loop.c:1852 [inline]
       loop_process_work+0x1499/0x1db0 drivers/block/loop.c:1892
       process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
       worker_thread+0x657/0x1110 kernel/workqueue.c:2454
       kthread+0x2e9/0x3a0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
       process_one_work+0x91b/0x1650 kernel/workqueue.c:2283
       worker_thread+0x657/0x1110 kernel/workqueue.c:2454
       kthread+0x2e9/0x3a0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3063 [inline]
       check_prevs_add kernel/locking/lockdep.c:3186 [inline]
       validate_chain kernel/locking/lockdep.c:3801 [inline]
       __lock_acquire+0x2ad4/0x56c0 kernel/locking/lockdep.c:5027
       lock_acquire kernel/locking/lockdep.c:5639 [inline]
       lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
       flush_workqueue+0x110/0x13a0 kernel/workqueue.c:2827
       drain_workqueue+0x1a5/0x3c0 kernel/workqueue.c:2992
       destroy_workqueue+0x71/0x800 kernel/workqueue.c:4429
       __loop_clr_fd+0x1ab/0xe20 drivers/block/loop.c:1124
       lo_release+0x1ac/0x1f0 drivers/block/loop.c:1756
       blkdev_put_whole block/bdev.c:689 [inline]
       blkdev_put+0x2de/0x950 block/bdev.c:944
       blkdev_close+0x6a/0x80 block/fops.c:517
       __fput+0x286/0x9f0 fs/file_table.c:317
       task_work_run+0xdd/0x1a0 kernel/task_work.c:164
       tracehook_notify_resume include/linux/tracehook.h:188 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
       exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
       __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
       syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
       do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  (wq_completion)loop0 --> system_transition_mutex/1 --> &disk->open_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&disk->open_mutex);
                               lock(system_transition_mutex/1);
                               lock(&disk->open_mutex);
  lock((wq_completion)loop0);

 *** DEADLOCK ***

1 lock held by udevd/3652:
 #0: ffff88801a0fa918 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x99/0x950 block/bdev.c:902

stack backtrace:
CPU: 0 PID: 3652 Comm: udevd Not tainted 5.17.0-rc7-syzkaller-00241-gf0e18b03fcaf #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 check_noncircular+0x25f/0x2e0 kernel/locking/lockdep.c:2143
 check_prev_add kernel/locking/lockdep.c:3063 [inline]
 check_prevs_add kernel/locking/lockdep.c:3186 [inline]
 validate_chain kernel/locking/lockdep.c:3801 [inline]
 __lock_acquire+0x2ad4/0x56c0 kernel/locking/lockdep.c:5027
 lock_acquire kernel/locking/lockdep.c:5639 [inline]
 lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
 flush_workqueue+0x110/0x13a0 kernel/workqueue.c:2827
 drain_workqueue+0x1a5/0x3c0 kernel/workqueue.c:2992
 destroy_workqueue+0x71/0x800 kernel/workqueue.c:4429
 __loop_clr_fd+0x1ab/0xe20 drivers/block/loop.c:1124
 lo_release+0x1ac/0x1f0 drivers/block/loop.c:1756
 blkdev_put_whole block/bdev.c:689 [inline]
 blkdev_put+0x2de/0x950 block/bdev.c:944
 blkdev_close+0x6a/0x80 block/fops.c:517
 __fput+0x286/0x9f0 fs/file_table.c:317
 task_work_run+0xdd/0x1a0 kernel/task_work.c:164
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
 exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
 syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0ca7b90fc3
Code: 48 ff ff ff b8 ff ff ff ff e9 3e ff ff ff 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8
RSP: 002b:00007ffcf6cd76d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0ca7a396a8 RCX: 00007f0ca7b90fc3
RDX: 000000000000001c RSI: 00007ffcf6cd6ed8 RDI: 0000000000000008
RBP: 0000564a5512feb0 R08: 0000000000000007 R09: 0000564a55126a00
R10: 0000000002423870 R11: 0000000000000246 R12: 0000000000000002
R13: 0000564a55116f80 R14: 0000000000000008 R15: 0000564a550e82c0
 </TASK>
I/O error, dev loop0, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
I/O error, dev loop0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev loop0, logical block 0, async page read


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()
  2022-03-13 15:15   ` [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release() Tetsuo Handa
@ 2022-03-14 15:23     ` Jan Kara
  2022-03-15  8:44       ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2022-03-14 15:23 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Jan Kara, Christoph Hellwig, syzbot, axboe, linux-block

On Mon 14-03-22 00:15:22, Tetsuo Handa wrote:
> syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
> commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
> is calling destroy_workqueue() with disk->open_mutex held.
> 
> Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
> tried to fix this problem by deferring __loop_clr_fd() from lo_release()
> to a WQ context, but that commit was reverted because there are userspace
> programs which depend on that __loop_clr_fd() completes by the moment
> close() returns to user mode.
> 
> This time, we try to fix this problem by deferring __loop_clr_fd() from
> lo_release() to a task work context. Since task_work_add() is not exported
> but Christoph does not want to export it, this patch uses anonymous inode
> interface as a wrapper for calling task_work_add() via fput().
> 
> Although Jan has found two bugs in /bin/mount where one of these was fixed
> in util-linux package [2], I found that not holding lo->lo_mutex from
> lo_open() causes occasional I/O error [3] (due to
> 
> 	if (lo->lo_state != Lo_bound)
> 		return BLK_STS_IOERR;
> 
> check in loop_queue_rq()) when systemd-udevd opens a loop device and
> reads from it before loop_configure() updates lo->lo_state to Lo_bound.

Thanks for detailed explanation here. Finally, I can see what is the
problem with the open path. A few questions here:

1) Can you please remind me why do we need to also remove the lo_mutex from
lo_open()? The original lockdep problem was only for __loop_clr_fd(),
wasn't it?

2) Cannot we just move the generation of DISK_EVENT_MEDIA_CHANGE event
after the moment the loop device is configured? That way systemd won't
start probing the new loop device until it is fully configured.

Because the less hacks with task_work we have the better.

Also I don't think the deadlock is really fixed when you wait for
__loop_clr_fd() during open. It is maybe just hidden from lockdep. But the
fundamental problem of the cycle syzbot constructed is that flushing of IO
to a loop device may end up depending on open of that loop device to
finish. So if pending __loop_clr_fd() is a problem for somebody (is it?),
we still need to let open finish and deal with waiting for __loop_clr_fd()
in other places where that actually causes problem (and make sure
Lo_rundown is either treated as Lo_unbound or wait for Lo_rundown to
transition to Lo_unbound).

> That is, not only we need to wait for __loop_clr_fd() upon close() event
> (in order to avoid unexpected umount() failure) but also we need to wait
> for lo->lo_mutex upon open() event (in order to avoid unexpected I/O
> errors). Therefore, schedule a task work from both lo_open() and
> lo_release().

Honestly, the anon inode trick makes the code pretty much unreadable. As
far as I remember Christoph was not pricipially against using task_work. He
just wanted the task_work API to be improved so that it is easier to use.

								Honza

 
> Link: https://syzkaller.appspot.com/bug?extid=6479585dfd4dedd3f7e1 [1]
> Link: https://github.com/util-linux/util-linux/commit/3e1fc3bbee99384c [2]
> Link: https://lkml.kernel.org/r/a72c59c6-298b-e4ba-b1f5-2275afab49a1@I-love.SAKURA.ne.jp [3]
> Reported-by: syzbot <syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  drivers/block/loop.c | 244 +++++++++++++++++++++++++++++++------------
>  drivers/block/loop.h |   1 +
>  2 files changed, 180 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 19fe19eaa50e..2f60356ad428 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -80,6 +80,7 @@
>  #include <linux/blk-cgroup.h>
>  #include <linux/sched/mm.h>
>  #include <linux/statfs.h>
> +#include <linux/anon_inodes.h>
>  
>  #include "loop.h"
>  
> @@ -90,6 +91,8 @@
>  static DEFINE_IDR(loop_index_idr);
>  static DEFINE_MUTEX(loop_ctl_mutex);
>  static DEFINE_MUTEX(loop_validate_mutex);
> +static DEFINE_SPINLOCK(loop_delete_spinlock);
> +static DECLARE_WAIT_QUEUE_HEAD(loop_rundown_wait);
>  
>  /**
>   * loop_global_lock_killable() - take locks for safe loop_validate_file() test
> @@ -1088,7 +1091,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
>  	return error;
>  }
>  
> -static void __loop_clr_fd(struct loop_device *lo, bool release)
> +static void __loop_clr_fd(struct loop_device *lo)
>  {
>  	struct file *filp;
>  	gfp_t gfp = lo->old_gfp_mask;
> @@ -1150,8 +1153,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
>  	/* let user-space know about this change */
>  	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
>  	mapping_set_gfp_mask(filp->f_mapping, gfp);
> -	/* This is safe: open() is still holding a reference. */
> -	module_put(THIS_MODULE);
>  	blk_mq_unfreeze_queue(lo->lo_queue);
>  
>  	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
> @@ -1159,37 +1160,18 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
>  	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
>  		int err;
>  
> -		/*
> -		 * open_mutex has been held already in release path, so don't
> -		 * acquire it if this function is called in such case.
> -		 *
> -		 * If the reread partition isn't from release path, lo_refcnt
> -		 * must be at least one and it can only become zero when the
> -		 * current holder is released.
> -		 */
> -		if (!release)
> -			mutex_lock(&lo->lo_disk->open_mutex);
> +		mutex_lock(&lo->lo_disk->open_mutex);
>  		err = bdev_disk_changed(lo->lo_disk, false);
> -		if (!release)
> -			mutex_unlock(&lo->lo_disk->open_mutex);
> +		mutex_unlock(&lo->lo_disk->open_mutex);
>  		if (err)
>  			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
>  				__func__, lo->lo_number, err);
>  		/* Device is gone, no point in returning error */
>  	}
>  
> -	/*
> -	 * lo->lo_state is set to Lo_unbound here after above partscan has
> -	 * finished. There cannot be anybody else entering __loop_clr_fd() as
> -	 * Lo_rundown state protects us from all the other places trying to
> -	 * change the 'lo' device.
> -	 */
>  	lo->lo_flags = 0;
>  	if (!part_shift)
>  		lo->lo_disk->flags |= GENHD_FL_NO_PART;
> -	mutex_lock(&lo->lo_mutex);
> -	lo->lo_state = Lo_unbound;
> -	mutex_unlock(&lo->lo_mutex);
>  
>  	/*
>  	 * Need not hold lo_mutex to fput backing file. Calling fput holding
> @@ -1197,6 +1179,11 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
>  	 * fput can take open_mutex which is usually taken before lo_mutex.
>  	 */
>  	fput(filp);
> +	mutex_lock(&lo->lo_mutex);
> +	lo->lo_state = Lo_unbound;
> +	mutex_unlock(&lo->lo_mutex);
> +	wake_up_all(&loop_rundown_wait);
> +	module_put(THIS_MODULE);
>  }
>  
>  static int loop_clr_fd(struct loop_device *lo)
> @@ -1228,7 +1215,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  	lo->lo_state = Lo_rundown;
>  	mutex_unlock(&lo->lo_mutex);
>  
> -	__loop_clr_fd(lo, false);
> +	__loop_clr_fd(lo);
>  	return 0;
>  }
>  
> @@ -1720,52 +1707,176 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
>  }
>  #endif
>  
> -static int lo_open(struct block_device *bdev, fmode_t mode)
> +static int lo_no_open(struct inode *inode, struct file *file)
>  {
> -	struct loop_device *lo = bdev->bd_disk->private_data;
> -	int err;
> +	return -ENXIO;
> +}
>  
> -	err = mutex_lock_killable(&lo->lo_mutex);
> -	if (err)
> -		return err;
> -	if (lo->lo_state == Lo_deleting)
> -		err = -ENXIO;
> -	else
> -		atomic_inc(&lo->lo_refcnt);
> -	mutex_unlock(&lo->lo_mutex);
> -	return err;
> +static int lo_post_open(struct inode *inode, struct file *file)
> +{
> +	struct loop_device *lo = file->private_data;
> +
> +	/* Nothing to do if lo_open() failed. */
> +	if (!lo)
> +		return 0;
> +	/* Wait for loop_configure()/loop_post_release() to complete. */
> +	if (mutex_lock_killable(&lo->lo_mutex) == 0)
> +		mutex_unlock(&lo->lo_mutex);
> +	/*
> +	 * Also wait for __loop_clr_fd() to complete because
> +	 * loop_post_release() releases lo->lo_mutex before __loop_clr_fd()
> +	 * is called.
> +	 */
> +	wait_event_killable(loop_rundown_wait, data_race(lo->lo_state) != Lo_rundown);
> +	atomic_dec(&lo->async_pending);
> +	return 0;
>  }
>  
> -static void lo_release(struct gendisk *disk, fmode_t mode)
> +static const struct file_operations loop_open_fops = {
> +	.open = lo_no_open,
> +	.release = lo_post_open,
> +};
> +
> +static int lo_post_release(struct inode *inode, struct file *file)
>  {
> -	struct loop_device *lo = disk->private_data;
> +	struct gendisk *disk = file->private_data;
> +	struct loop_device *lo;
>  
> +	/* Nothing to do if lo_open() failed. */
> +	if (!disk)
> +		return 0;
> +	lo = disk->private_data;
>  	mutex_lock(&lo->lo_mutex);
> -	if (atomic_dec_return(&lo->lo_refcnt))
> +	/* Check whether this loop device can be cleared. */
> +	if (atomic_dec_return(&lo->lo_refcnt) || lo->lo_state != Lo_bound)
>  		goto out_unlock;
> -
> +	/*
> +	 * Clear this loop device since nobody is using. Note that since
> +	 * lo_open() increments lo->lo_refcnt without holding lo->lo_mutex,
> +	 * I might become no longer the last user, but there is a fact that
> +	 * there was no user.
> +	 *
> +	 * In autoclear mode, destroy WQ and remove configuration.
> +	 * Otherwise flush possible ongoing bios in WQ and keep configuration.
> +	 */
>  	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
> -		if (lo->lo_state != Lo_bound)
> -			goto out_unlock;
>  		lo->lo_state = Lo_rundown;
>  		mutex_unlock(&lo->lo_mutex);
> -		/*
> -		 * In autoclear mode, stop the loop thread
> -		 * and remove configuration after last close.
> -		 */
> -		__loop_clr_fd(lo, true);
> -		return;
> -	} else if (lo->lo_state == Lo_bound) {
> -		/*
> -		 * Otherwise keep thread (if running) and config,
> -		 * but flush possible ongoing bios in thread.
> -		 */
> +		__loop_clr_fd(lo);
> +		goto done;
> +	} else {
>  		blk_mq_freeze_queue(lo->lo_queue);
>  		blk_mq_unfreeze_queue(lo->lo_queue);
>  	}
>  
>  out_unlock:
>  	mutex_unlock(&lo->lo_mutex);
> +done:
> +	/* Drop references which would have been dropped after lo_release(). */
> +	kobject_put(&disk_to_dev(disk)->kobj);
> +	module_put(disk->fops->owner);
> +	atomic_dec(&lo->async_pending);
> +	return 0;
> +}
> +
> +static const struct file_operations loop_close_fops = {
> +	.open = lo_no_open,
> +	.release = lo_post_release,
> +};
> +
> +struct release_event {
> +	struct list_head head;
> +	struct file *file;
> +};
> +
> +static LIST_HEAD(release_event_spool);
> +static DEFINE_SPINLOCK(release_event_spinlock);
> +
> +static int lo_open(struct block_device *bdev, fmode_t mode)
> +{
> +	struct loop_device *lo = bdev->bd_disk->private_data;
> +	int err = 0;
> +	/*
> +	 * Reserve an anon_inode for calling lo_post_open() before returning to
> +	 * user mode. Since fput() by kernel thread defers actual cleanup to WQ
> +	 * context, there is no point with calling lo_post_open() from kernel
> +	 * threads.
> +	 */
> +	struct file *open = (current->flags & PF_KTHREAD) ? NULL :
> +		anon_inode_getfile("loop.open", &loop_open_fops, NULL, O_CLOEXEC);
> +	/*
> +	 * Reserve an anon_inode for calling lo_post_release() from
> +	 * lo_release() before returning to user mode.
> +	 */
> +	struct file *close = anon_inode_getfile("loop.close", &loop_close_fops, NULL, O_CLOEXEC);
> +	struct release_event *ev = kmalloc(sizeof(*ev), GFP_KERNEL | __GFP_NOWARN);
> +
> +	if (!ev || IS_ERR(open) || IS_ERR(close)) {
> +		err = -ENOMEM;
> +		goto cleanup;
> +	}
> +
> +	spin_lock(&loop_delete_spinlock);
> +	/* lo->lo_state may be changed to any Lo_* but Lo_deleting. */
> +	if (data_race(lo->lo_state) == Lo_deleting)
> +		err = -ENXIO;
> +	else
> +		atomic_inc(&lo->lo_refcnt);
> +	spin_unlock(&loop_delete_spinlock);
> +	if (err)
> +		goto cleanup;
> +
> +	/* close->private_data is set when lo_release() is called. */
> +	ev->file = close;
> +	spin_lock(&release_event_spinlock);
> +	list_add(&ev->head, &release_event_spool);
> +	spin_unlock(&release_event_spinlock);
> +
> +	/*
> +	 * Try to avoid accessing Lo_rundown loop device.
> +	 *
> +	 * Since the task_work list is LIFO, lo_post_release() will run before
> +	 * lo_post_open() runs if an error occur between returned from
> +	 * lo_open() and returning to user mode. This means that lo->refcnt may
> +	 * be already 0 when lo_post_open() runs. Therefore, use
> +	 * lo->async_pending in order to prevent loop_remove() from releasing
> +	 * this loop device.
> +	 */
> +	if (open) {
> +		open->private_data = lo;
> +		atomic_inc(&lo->async_pending);
> +		fput(open);
> +	}
> +	return 0;
> + cleanup:
> +	if (!IS_ERR_OR_NULL(open))
> +		fput(open);
> +	if (!IS_ERR(close))
> +		fput(close);
> +	kfree(ev);
> +	return err;
> +}
> +
> +static void lo_release(struct gendisk *disk, fmode_t mode)
> +{
> +	struct loop_device *lo = disk->private_data;
> +	struct release_event *ev;
> +
> +	atomic_inc(&lo->async_pending);
> +	/*
> +	 * Fetch from spool. Since a successful lo_open() call is coupled with
> +	 * a lo_release() call, we are guaranteed that spool is not empty.
> +	 */
> +	spin_lock(&release_event_spinlock);
> +	ev = list_first_entry(&release_event_spool, typeof(*ev), head);
> +	list_del(&ev->head);
> +	spin_unlock(&release_event_spinlock);
> +	/* Hold references which will be dropped after lo_release(). */
> +	__module_get(disk->fops->owner);
> +	kobject_get(&disk_to_dev(disk)->kobj);
> +	ev->file->private_data = disk;
> +	fput(ev->file);
> +	kfree(ev);
>  }
>  
>  static const struct block_device_operations lo_fops = {
> @@ -2029,6 +2140,7 @@ static int loop_add(int i)
>  	if (!part_shift)
>  		disk->flags |= GENHD_FL_NO_PART;
>  	atomic_set(&lo->lo_refcnt, 0);
> +	atomic_set(&lo->async_pending, 0);
>  	mutex_init(&lo->lo_mutex);
>  	lo->lo_number		= i;
>  	spin_lock_init(&lo->lo_lock);
> @@ -2070,6 +2182,9 @@ static int loop_add(int i)
>  
>  static void loop_remove(struct loop_device *lo)
>  {
> +	/* Wait for task work and/or WQ to complete. */
> +	while (atomic_read(&lo->async_pending))
> +		schedule_timeout_uninterruptible(1);
>  	/* Make this loop device unreachable from pathname. */
>  	del_gendisk(lo->lo_disk);
>  	blk_cleanup_disk(lo->lo_disk);
> @@ -2118,19 +2233,18 @@ static int loop_control_remove(int idx)
>  	ret = mutex_lock_killable(&lo->lo_mutex);
>  	if (ret)
>  		goto mark_visible;
> -	if (lo->lo_state != Lo_unbound ||
> -	    atomic_read(&lo->lo_refcnt) > 0) {
> -		mutex_unlock(&lo->lo_mutex);
> +	spin_lock(&loop_delete_spinlock);
> +	/* Mark this loop device no longer open()-able if nobody is using. */
> +	if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0)
>  		ret = -EBUSY;
> -		goto mark_visible;
> -	}
> -	/* Mark this loop device no longer open()-able. */
> -	lo->lo_state = Lo_deleting;
> +	else
> +		lo->lo_state = Lo_deleting;
> +	spin_unlock(&loop_delete_spinlock);
>  	mutex_unlock(&lo->lo_mutex);
> -
> -	loop_remove(lo);
> -	return 0;
> -
> +	if (!ret) {
> +		loop_remove(lo);
> +		return 0;
> +	}
>  mark_visible:
>  	/* Show this loop device again. */
>  	mutex_lock(&loop_ctl_mutex);
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index 082d4b6bfc6a..20fc5eebe455 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -56,6 +56,7 @@ struct loop_device {
>  	struct gendisk		*lo_disk;
>  	struct mutex		lo_mutex;
>  	bool			idr_visible;
> +	atomic_t		async_pending;
>  };
>  
>  struct loop_cmd {
> -- 
> 2.32.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()
  2022-03-14 15:23     ` Jan Kara
@ 2022-03-15  8:44       ` Christoph Hellwig
  2022-03-15 10:57         ` Tetsuo Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2022-03-15  8:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Tetsuo Handa, Christoph Hellwig, syzbot, axboe, linux-block

On Mon, Mar 14, 2022 at 04:23:18PM +0100, Jan Kara wrote:
> Honestly, the anon inode trick makes the code pretty much unreadable. As
> far as I remember Christoph was not pricipially against using task_work. He
> just wanted the task_work API to be improved so that it is easier to use.

This whole patch is awful.  And no, I don't think task_work_add really has
any business here.  We need to properly unwind the mess instead of piling
things higher and higher.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()
  2022-03-15  8:44       ` Christoph Hellwig
@ 2022-03-15 10:57         ` Tetsuo Handa
  2022-03-16  8:44           ` Christoph Hellwig
  2022-03-16 12:26           ` Jan Kara
  0 siblings, 2 replies; 10+ messages in thread
From: Tetsuo Handa @ 2022-03-15 10:57 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara; +Cc: syzbot, axboe, linux-block

On 2022/03/15 17:44, Christoph Hellwig wrote:
> On Mon, Mar 14, 2022 at 04:23:18PM +0100, Jan Kara wrote:
>> Honestly, the anon inode trick makes the code pretty much unreadable. As
>> far as I remember Christoph was not pricipially against using task_work. He
>> just wanted the task_work API to be improved so that it is easier to use.
> 
> This whole patch is awful.  And no, I don't think task_work_add really has
> any business here.  We need to properly unwind the mess instead of piling
> things higher and higher.

How do you plan to unwind the mess?



On 2022/03/15 0:23, Jan Kara wrote:
> On Mon 14-03-22 00:15:22, Tetsuo Handa wrote:
>> syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
>> commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
>> is calling destroy_workqueue() with disk->open_mutex held.
>>
>> Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
>> tried to fix this problem by deferring __loop_clr_fd() from lo_release()
>> to a WQ context, but that commit was reverted because there are userspace
>> programs which depend on that __loop_clr_fd() completes by the moment
>> close() returns to user mode.
>>
>> This time, we try to fix this problem by deferring __loop_clr_fd() from
>> lo_release() to a task work context. Since task_work_add() is not exported
>> but Christoph does not want to export it, this patch uses anonymous inode
>> interface as a wrapper for calling task_work_add() via fput().
>>
>> Although Jan has found two bugs in /bin/mount where one of these was fixed
>> in util-linux package [2], I found that not holding lo->lo_mutex from
>> lo_open() causes occasional I/O error [3] (due to
>>
>> 	if (lo->lo_state != Lo_bound)
>> 		return BLK_STS_IOERR;
>>
>> check in loop_queue_rq()) when systemd-udevd opens a loop device and
>> reads from it before loop_configure() updates lo->lo_state to Lo_bound.
> 
> Thanks for detailed explanation here. Finally, I can see what is the
> problem with the open path. A few questions here:
> 
> 1) Can you please remind me why do we need to also remove the lo_mutex from
> lo_open()? The original lockdep problem was only for __loop_clr_fd(),
> wasn't it?

Right, but from locking dependency chain perspective, holding lo->lo_mutex from
lo_open()/lo_release() is treated as if lo_open()/lo_release() waits for flush
of I/O with lo->lo_mutex held, even if it is not lo_open()/lo_release() which
actually flushes I/O with lo->lo_mutex held.

We need to avoid holding lo->lo_mutex from lo_open()/lo_release() while
avoiding problems caused by not waiting for loop_configure()/__loop_clr_fd().

> 
> 2) Cannot we just move the generation of DISK_EVENT_MEDIA_CHANGE event
> after the moment the loop device is configured? That way systemd won't
> start probing the new loop device until it is fully configured.

Do you mean

--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1024,7 +1024,6 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
                goto out_unlock;
        }

-       disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
        set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0);

        INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
@@ -1066,6 +1065,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
        wmb();

        lo->lo_state = Lo_bound;
+       disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
        if (part_shift)
                lo->lo_flags |= LO_FLAGS_PARTSCAN;
        partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;

? Maybe possible, but looks strange. Such change will defer only
open()->read() sequence triggered via kobject_uevent(KOBJ_CHANGE).
My patch defers all open() from userspace, like we did until 5.11.

> 
> Because the less hacks with task_work we have the better.
> 
> Also I don't think the deadlock is really fixed when you wait for
> __loop_clr_fd() during open. It is maybe just hidden from lockdep.

Since lo_post_open() is called with no locks held, how can waiting
for __loop_clr_fd() after lo_open() can cause deadlock? The reason to
choose task_work context is to synchronize without any other locks held.

Until 5.11, lo_open() and loop_configure() were serialized using
loop_ctl_mutex. In 5.12, lo_open() and loop_configure() started using
lo->lo_mutex, and serialization among multiple loop devices was fixed
in 5.13 using loop_validate_mutex. Therefore, basically we had been
waiting for loop_configure() and __loop_clr_fd() during open. My patch
just moves waiting for loop_configure() and __loop_clr_fd() in open
to outside of system_transition_mutex/1 and disk->open_mutex.

>                                                                    But the
> fundamental problem of the cycle syzbot constructed is that flushing of IO
> to a loop device may end up depending on open of that loop device to
> finish.

How can my patch depend on open of that loop device? It is task_work context.

>         So if pending __loop_clr_fd() is a problem for somebody (is it?),
> we still need to let open finish and deal with waiting for __loop_clr_fd()
> in other places where that actually causes problem (and make sure
> Lo_rundown is either treated as Lo_unbound or wait for Lo_rundown to
> transition to Lo_unbound).

lo_post_open() lets open finish and deal with waiting for __loop_clr_fd().


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()
  2022-03-15 10:57         ` Tetsuo Handa
@ 2022-03-16  8:44           ` Christoph Hellwig
  2022-03-16 12:26           ` Jan Kara
  1 sibling, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2022-03-16  8:44 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Christoph Hellwig, Jan Kara, syzbot, axboe, linux-block

On Tue, Mar 15, 2022 at 07:57:25PM +0900, Tetsuo Handa wrote:
> On 2022/03/15 17:44, Christoph Hellwig wrote:
> > On Mon, Mar 14, 2022 at 04:23:18PM +0100, Jan Kara wrote:
> >> Honestly, the anon inode trick makes the code pretty much unreadable. As
> >> far as I remember Christoph was not pricipially against using task_work. He
> >> just wanted the task_work API to be improved so that it is easier to use.
> > 
> > This whole patch is awful.  And no, I don't think task_work_add really has
> > any business here.  We need to properly unwind the mess instead of piling
> > things higher and higher.
> 
> How do you plan to unwind the mess?

Yes.  I'll send another resping of the previous approach we had last
time..

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()
  2022-03-15 10:57         ` Tetsuo Handa
  2022-03-16  8:44           ` Christoph Hellwig
@ 2022-03-16 12:26           ` Jan Kara
  2022-03-16 14:29             ` Tetsuo Handa
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kara @ 2022-03-16 12:26 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Christoph Hellwig, Jan Kara, syzbot, axboe, linux-block

On Tue 15-03-22 19:57:25, Tetsuo Handa wrote:
> On 2022/03/15 17:44, Christoph Hellwig wrote:
> On 2022/03/15 0:23, Jan Kara wrote:
> > On Mon 14-03-22 00:15:22, Tetsuo Handa wrote:
> >> syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
> >> commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
> >> is calling destroy_workqueue() with disk->open_mutex held.
> >>
> >> Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
> >> tried to fix this problem by deferring __loop_clr_fd() from lo_release()
> >> to a WQ context, but that commit was reverted because there are userspace
> >> programs which depend on that __loop_clr_fd() completes by the moment
> >> close() returns to user mode.
> >>
> >> This time, we try to fix this problem by deferring __loop_clr_fd() from
> >> lo_release() to a task work context. Since task_work_add() is not exported
> >> but Christoph does not want to export it, this patch uses anonymous inode
> >> interface as a wrapper for calling task_work_add() via fput().
> >>
> >> Although Jan has found two bugs in /bin/mount where one of these was fixed
> >> in util-linux package [2], I found that not holding lo->lo_mutex from
> >> lo_open() causes occasional I/O error [3] (due to
> >>
> >> 	if (lo->lo_state != Lo_bound)
> >> 		return BLK_STS_IOERR;
> >>
> >> check in loop_queue_rq()) when systemd-udevd opens a loop device and
> >> reads from it before loop_configure() updates lo->lo_state to Lo_bound.
> > 
> > Thanks for detailed explanation here. Finally, I can see what is the
> > problem with the open path. A few questions here:
> > 
> > 1) Can you please remind me why do we need to also remove the lo_mutex from
> > lo_open()? The original lockdep problem was only for __loop_clr_fd(),
> > wasn't it?
> 
> Right, but from locking dependency chain perspective, holding
> lo->lo_mutex from lo_open()/lo_release() is treated as if
> lo_open()/lo_release() waits for flush of I/O with lo->lo_mutex held,
> even if it is not lo_open()/lo_release() which actually flushes I/O with
> lo->lo_mutex held.
> 
> We need to avoid holding lo->lo_mutex from lo_open()/lo_release() while
> avoiding problems caused by not waiting for loop_configure()/__loop_clr_fd().

I thought the whole point of these patches is to move waiting for IO from
under lo->lo_mutex and disk->open_mutex? And if we do that, there's no
problem with open needing either of these mutexes? What am I missing?

Anyway, I have no problem with removing lo->lo_mutex from lo_open() as e.g.
Christoph done it. I just disliked the task work magic you did there...

> > 2) Cannot we just move the generation of DISK_EVENT_MEDIA_CHANGE event
> > after the moment the loop device is configured? That way systemd won't
> > start probing the new loop device until it is fully configured.
> 
> Do you mean
> 
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1024,7 +1024,6 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
>                 goto out_unlock;
>         }
> 
> -       disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
>         set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0);
> 
>         INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
> @@ -1066,6 +1065,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
>         wmb();
> 
>         lo->lo_state = Lo_bound;
> +       disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
>         if (part_shift)
>                 lo->lo_flags |= LO_FLAGS_PARTSCAN;
>         partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
> 
> ? Maybe possible, but looks strange.

Yes, something like that. But disk_force_media_change() has some
invalidation in it as well and that may need to stay in the original place
- needs a bit more thought & validation.

> Such change will defer only open()->read() sequence triggered via
> kobject_uevent(KOBJ_CHANGE). My patch defers all open() from userspace,
> like we did until 5.11.

Correct. But EIO error is the correct return if you access unbound loop
device. It is only after we tell userspace the device exists (i.e., after
we send the event), when we should better make sure the IO succeeds.

> > Because the less hacks with task_work we have the better.
> > 
> > Also I don't think the deadlock is really fixed when you wait for
> > __loop_clr_fd() during open. It is maybe just hidden from lockdep.
> 
> Since lo_post_open() is called with no locks held, how can waiting
> for __loop_clr_fd() after lo_open() can cause deadlock? The reason to
> choose task_work context is to synchronize without any other locks held.

Sorry, I was wrong here. It will not cause deadlock. But in-kernel openers
of the block device like swsusp_check() in kernel/power/swap.c (using
blkdev_get_by_dev()) will be now operating on a loop device without waiting
for __loop_clr_fd(). Which is I guess fine but the inconsistency between
in-kernel and userspace bdev openers would be a bit weird. Anyway,
Christophs patches look like a cleaner way forward to me...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()
  2022-03-16 12:26           ` Jan Kara
@ 2022-03-16 14:29             ` Tetsuo Handa
  0 siblings, 0 replies; 10+ messages in thread
From: Tetsuo Handa @ 2022-03-16 14:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, syzbot, axboe, linux-block

On 2022/03/16 21:26, Jan Kara wrote:
>>> 1) Can you please remind me why do we need to also remove the lo_mutex from
>>> lo_open()? The original lockdep problem was only for __loop_clr_fd(),
>>> wasn't it?
>>
>> Right, but from locking dependency chain perspective, holding
>> lo->lo_mutex from lo_open()/lo_release() is treated as if
>> lo_open()/lo_release() waits for flush of I/O with lo->lo_mutex held,
>> even if it is not lo_open()/lo_release() which actually flushes I/O with
>> lo->lo_mutex held.
>>
>> We need to avoid holding lo->lo_mutex from lo_open()/lo_release() while
>> avoiding problems caused by not waiting for loop_configure()/__loop_clr_fd().
> 
> I thought the whole point of these patches is to move waiting for IO from
> under lo->lo_mutex and disk->open_mutex?

Excuse me? I couldn't tell you say "remove ... from ..." or "move ... from ... to ...".

The point of my patch is to avoid

  disk->open_mutex => lo->lo_mutex

  disk->open_mutex => loop_validate_mutex => lo->lo_mutex

dependency chain, for there is

  system_transition_mutex/1 => disk->open_mutex

dependency chain and

  blk_mq_freeze_queue()/blk_mq_unfreeze_queue()

are called with lo->lo_mutex held.

Christoph's "[PATCH 4/8] loop: don't freeze the queue in lo_release" and
"[PATCH 5/8] loop: only freeze the queue in __loop_clr_fd when needed" stops
calling blk_mq_freeze_queue()/blk_mq_unfreeze_queue() with lo->lo_mutex held
only from lo_release(). There are locations (e.g. __loop_update_dio()) which
calls blk_mq_freeze_queue()/blk_mq_unfreeze_queue() with lo->lo_mutex held.

>                                          And if we do that, there's no
> problem with open needing either of these mutexes? What am I missing?
> 
> Anyway, I have no problem with removing lo->lo_mutex from lo_open() as e.g.
> Christoph done it. I just disliked the task work magic you did there...
> 

Hmm, "these patches" === "yet another approach to fix the loop lock order inversions v3"
than my "[PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()" ?



>>> 2) Cannot we just move the generation of DISK_EVENT_MEDIA_CHANGE event
>>> after the moment the loop device is configured? That way systemd won't
>>> start probing the new loop device until it is fully configured.
>>
> Yes, something like that. But disk_force_media_change() has some
> invalidation in it as well and that may need to stay in the original place
> - needs a bit more thought & validation.

Yes, please. I don't know about invalidation.
But can't we instead remove

	if (lo->lo_state != Lo_bound)
		return BLK_STS_IOERR;

 from loop_queue_rq() ? There is some mechanism which can guarantee that crash
does not happen even if lo_queue_work() is called (blk_mq_*freeze_queue() ?),
isn't there?

> 
>> Such change will defer only open()->read() sequence triggered via
>> kobject_uevent(KOBJ_CHANGE). My patch defers all open() from userspace,
>> like we did until 5.11.
> 
> Correct. But EIO error is the correct return if you access unbound loop
> device. It is only after we tell userspace the device exists (i.e., after
> we send the event), when we should better make sure the IO succeeds.

Userspace may open this loop device with arbitrary delay; this is asyncronous
notification. By the moment some userspace process opens this loop device and
issues a read request as a response to "this device is ready", loop_clr_fd()
can be called (because lo_open() no longer holds lo->lo_mutex) by some other
process and this loop device can become unbound. There must be some mechanism
which can guarantee that crash does not happen.



> Sorry, I was wrong here. It will not cause deadlock. But in-kernel openers
> of the block device like swsusp_check() in kernel/power/swap.c (using
> blkdev_get_by_dev()) will be now operating on a loop device without waiting
> for __loop_clr_fd(). Which is I guess fine but the inconsistency between
> in-kernel and userspace bdev openers would be a bit weird.

It looks weired, but I think it is unavoidable in order to keep kernel
locking dependency chain simple, for in-kernel openers might be holding
e.g. system_transition_mutex/1.

>                                                            Anyway,
> Christophs patches look like a cleaner way forward to me...

My patch behaves as if

  int fd = open("/dev/loopX", 3);
  ioctl(fd, LOOP_CLR_FD);
  close(fd);

is automatically called when close() by the last user of /dev/loopX returns,
in order to keep kernel locking dependency chain short and simple.
Nothing difficult.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-03-16 14:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-12 16:25 [syzbot] possible deadlock in blkdev_put (3) syzbot
2022-03-12 17:08 ` Tetsuo Handa
2022-03-13 15:15   ` [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release() Tetsuo Handa
2022-03-14 15:23     ` Jan Kara
2022-03-15  8:44       ` Christoph Hellwig
2022-03-15 10:57         ` Tetsuo Handa
2022-03-16  8:44           ` Christoph Hellwig
2022-03-16 12:26           ` Jan Kara
2022-03-16 14:29             ` Tetsuo Handa
2022-03-14  4:45 ` [syzbot] possible deadlock in blkdev_put (3) syzbot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.