All of lore.kernel.org
 help / color / mirror / Atom feed
* [syzbot] possible deadlock in blkdev_put (2)
@ 2021-11-27  7:10 syzbot
  2021-11-27 11:27 ` Tetsuo Handa
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: syzbot @ 2021-11-27  7:10 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    f81e94e91878 Add linux-next specific files for 20211125
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16366216b00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=be9183de0824e4d7
dashboard link: https://syzkaller.appspot.com/bug?extid=643e4ce4b6ad1347d372
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+643e4ce4b6ad1347d372@syzkaller.appspotmail.com

======================================================
WARNING: possible circular locking dependency detected
5.16.0-rc2-next-20211125-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.3/8204 is trying to acquire lock:
ffff88807df21138 ((wq_completion)loop3){+.+.}-{0:0}, at: flush_workqueue+0xe1/0x15b0 kernel/workqueue.c:2816

but task is already holding lock:
ffff88801a822918 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x97/0x9e0 block/bdev.c:913

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #6 (&disk->open_mutex){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:607 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:740
       blkdev_get_by_dev.part.0+0x24e/0xb50 block/bdev.c:818
       blkdev_get_by_dev+0x6b/0x80 block/bdev.c:858
       swsusp_check+0x97/0x2f0 kernel/power/swap.c:1520
       software_resume.part.0+0x102/0x1f0 kernel/power/hibernate.c:979
       software_resume kernel/power/hibernate.c:86 [inline]
       resume_store+0x161/0x190 kernel/power/hibernate.c:1181
       kobj_attr_store+0x50/0x80 lib/kobject.c:856
       sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:136
       kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2162 [inline]
       new_sync_write+0x429/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x12d/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #5 (system_transition_mutex/1){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:607 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:740
       software_resume.part.0+0x19/0x1f0 kernel/power/hibernate.c:934
       software_resume kernel/power/hibernate.c:86 [inline]
       resume_store+0x161/0x190 kernel/power/hibernate.c:1181
       kobj_attr_store+0x50/0x80 lib/kobject.c:856
       sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:136
       kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2162 [inline]
       new_sync_write+0x429/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x12d/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #4 (&of->mutex){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:607 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:740
       kernfs_seq_start+0x4b/0x260 fs/kernfs/file.c:112
       seq_read_iter+0x2c7/0x1240 fs/seq_file.c:225
       kernfs_fop_read_iter+0x44f/0x5f0 fs/kernfs/file.c:241
       call_read_iter include/linux/fs.h:2156 [inline]
       new_sync_read+0x421/0x6e0 fs/read_write.c:400
       vfs_read+0x35c/0x600 fs/read_write.c:481
       ksys_read+0x12d/0x250 fs/read_write.c:619
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&p->lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:607 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:740
       seq_read_iter+0xdf/0x1240 fs/seq_file.c:182
       kernfs_fop_read_iter+0x44f/0x5f0 fs/kernfs/file.c:241
       call_read_iter include/linux/fs.h:2156 [inline]
       generic_file_splice_read+0x453/0x6d0 fs/splice.c:311
       do_splice_to+0x1bf/0x250 fs/splice.c:796
       splice_direct_to_actor+0x2c2/0x8c0 fs/splice.c:870
       do_splice_direct+0x1b3/0x280 fs/splice.c:979
       do_sendfile+0xaf2/0x1250 fs/read_write.c:1245
       __do_sys_sendfile64 fs/read_write.c:1310 [inline]
       __se_sys_sendfile64 fs/read_write.c:1296 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1296
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#3){.+.+}-{0:0}:
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write include/linux/fs.h:1810 [inline]
       sb_start_write include/linux/fs.h:1880 [inline]
       file_start_write include/linux/fs.h:3008 [inline]
       lo_write_bvec drivers/block/loop.c:242 [inline]
       lo_write_simple drivers/block/loop.c:265 [inline]
       do_req_filebacked drivers/block/loop.c:494 [inline]
       loop_handle_cmd drivers/block/loop.c:1846 [inline]
       loop_process_work+0x1499/0x1db0 drivers/block/loop.c:1886
       process_one_work+0x9b2/0x1690 kernel/workqueue.c:2299
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2446
       kthread+0x405/0x4f0 kernel/kthread.c:345
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
       process_one_work+0x921/0x1690 kernel/workqueue.c:2275
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2446
       kthread+0x405/0x4f0 kernel/kthread.c:345
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #0 ((wq_completion)loop3){+.+.}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3063 [inline]
       check_prevs_add kernel/locking/lockdep.c:3186 [inline]
       validate_chain kernel/locking/lockdep.c:3801 [inline]
       __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5027
       lock_acquire kernel/locking/lockdep.c:5637 [inline]
       lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
       flush_workqueue+0x110/0x15b0 kernel/workqueue.c:2819
       drain_workqueue+0x1a5/0x3c0 kernel/workqueue.c:2984
       destroy_workqueue+0x71/0x800 kernel/workqueue.c:4421
       __loop_clr_fd+0x1ab/0xe20 drivers/block/loop.c:1118
       lo_release+0x1ac/0x1f0 drivers/block/loop.c:1750
       blkdev_put_whole block/bdev.c:694 [inline]
       blkdev_put+0x2fb/0x9e0 block/bdev.c:955
       deactivate_locked_super+0x94/0x160 fs/super.c:335
       deactivate_super+0xad/0xd0 fs/super.c:366
       cleanup_mnt+0x3a2/0x540 fs/namespace.c:1137
       task_work_run+0xdd/0x1a0 kernel/task_work.c:164
       tracehook_notify_resume include/linux/tracehook.h:189 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
       exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
       __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
       syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
       do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  (wq_completion)loop3 --> system_transition_mutex/1 --> &disk->open_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&disk->open_mutex);
                               lock(system_transition_mutex/1);
                               lock(&disk->open_mutex);
  lock((wq_completion)loop3);

 *** DEADLOCK ***

1 lock held by syz-executor.3/8204:
 #0: ffff88801a822918 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x97/0x9e0 block/bdev.c:913

stack backtrace:
CPU: 0 PID: 8204 Comm: syz-executor.3 Not tainted 5.16.0-rc2-next-20211125-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 check_noncircular+0x25f/0x2e0 kernel/locking/lockdep.c:2143
 check_prev_add kernel/locking/lockdep.c:3063 [inline]
 check_prevs_add kernel/locking/lockdep.c:3186 [inline]
 validate_chain kernel/locking/lockdep.c:3801 [inline]
 __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5027
 lock_acquire kernel/locking/lockdep.c:5637 [inline]
 lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
 flush_workqueue+0x110/0x15b0 kernel/workqueue.c:2819
 drain_workqueue+0x1a5/0x3c0 kernel/workqueue.c:2984
 destroy_workqueue+0x71/0x800 kernel/workqueue.c:4421
 __loop_clr_fd+0x1ab/0xe20 drivers/block/loop.c:1118
 lo_release+0x1ac/0x1f0 drivers/block/loop.c:1750
 blkdev_put_whole block/bdev.c:694 [inline]
 blkdev_put+0x2fb/0x9e0 block/bdev.c:955
 deactivate_locked_super+0x94/0x160 fs/super.c:335
 deactivate_super+0xad/0xd0 fs/super.c:366
 cleanup_mnt+0x3a2/0x540 fs/namespace.c:1137
 task_work_run+0xdd/0x1a0 kernel/task_work.c:164
 tracehook_notify_resume include/linux/tracehook.h:189 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
 exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
 syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fcdf7077f57
Code: ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc24b336f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fcdf7077f57
RDX: 00007ffc24b337ca RSI: 000000000000000a RDI: 00007ffc24b337c0
RBP: 00007ffc24b337c0 R08: 00000000ffffffff R09: 00007ffc24b33590
R10: 0000555555c00903 R11: 0000000000000246 R12: 00007fcdf70d0105
R13: 00007ffc24b34880 R14: 0000555555c00810 R15: 00007ffc24b348c0
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-27  7:10 [syzbot] possible deadlock in blkdev_put (2) syzbot
@ 2021-11-27 11:27 ` Tetsuo Handa
  2021-11-28  5:32   ` Tetsuo Handa
  2021-12-08  5:05 ` [syzbot] possible deadlock in blkdev_put (2) syzbot
  2021-12-08 11:42 ` syzbot
  2 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2021-11-27 11:27 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara; +Cc: linux-block, Jens Axboe

Hello.

> HEAD commit:    f81e94e91878 Add linux-next specific files for 20211125
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16366216b00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=be9183de0824e4d7
> dashboard link: https://syzkaller.appspot.com/bug?extid=643e4ce4b6ad1347d372
> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

This looks unsolvable as long as flush_workqueue() is called with disk->open_mutex
held. I think we need to call flush_workqueue() without holding disk->open_mutex.

Since disk->fops->open == lo_open and bdev->bd_disk->fops->release == lo_release
and blkdev_put() calls bdev->bd_disk->fops->release() right before dropping
disk->open_mutex, is dropping disk->open_mutex inside lo_release (something like
below) possible?

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3dfb39d38235..0fde1842686a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1057,17 +1057,17 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 		bd_abort_claiming(bdev, loop_configure);
 out_putf:
 	fput(file);
 	/* This is safe: open() is still holding a reference. */
 	module_put(THIS_MODULE);
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo, bool release)
+static void __loop_clr_fd(struct loop_device *lo)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
 	struct loop_worker *pos, *worker;
 
 	/*
 	 * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close()
 	 * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if
@@ -1117,28 +1117,26 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
 
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
 		/*
-		 * open_mutex has been held already in release path, so don't
-		 * acquire it if this function is called in such case.
+		 * If the reread partition is from release path, lo_refcnt is
+		 * already zero.
 		 *
 		 * If the reread partition isn't from release path, lo_refcnt
 		 * must be at least one and it can only become zero when the
 		 * current holder is released.
 		 */
-		if (!release)
-			mutex_lock(&lo->lo_disk->open_mutex);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		if (!release)
-			mutex_unlock(&lo->lo_disk->open_mutex);
+		mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
 		/* Device is gone, no point in returning error */
 	}
 
 	/*
 	 * lo->lo_state is set to Lo_unbound here after above partscan has
@@ -1183,17 +1181,17 @@ static int loop_clr_fd(struct loop_device *lo)
 	if (atomic_read(&lo->lo_refcnt) > 1) {
 		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
 		mutex_unlock(&lo->lo_mutex);
 		return 0;
 	}
 	loop_update_state_locked(lo, Lo_rundown);
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo, false);
+	__loop_clr_fd(lo);
 	return 0;
 }
 
 static int
 loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 {
 	int err;
 	int prev_lo_flags;
@@ -1683,17 +1681,17 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
 static int lo_open(struct block_device *bdev, fmode_t mode)
 {
 	struct loop_device *lo = bdev->bd_disk->private_data;
 	int err;
 
 	err = mutex_lock_killable(&lo->lo_mutex);
 	if (err)
 		return err;
-	if (lo->lo_state == Lo_deleting)
+	if (lo->lo_state == Lo_deleting || lo->lo_state == Lo_rundown)
 		err = -ENXIO;
 	else
 		atomic_inc(&lo->lo_refcnt);
 	mutex_unlock(&lo->lo_mutex);
 	return err;
 }
 
 static void lo_release(struct gendisk *disk, fmode_t mode)
@@ -1706,18 +1704,27 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 
 	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
 		if (!loop_try_update_state_locked(lo, Lo_bound, Lo_rundown))
 			goto out_unlock;
 		mutex_unlock(&lo->lo_mutex);
 		/*
 		 * In autoclear mode, stop the loop thread
 		 * and remove configuration after last close.
+		 *
+		 * Since calling flush_workqueue() with open_mutex held causes
+		 * circular locking dependency problem, call __loop_clr_fd()
+		 * without open_mutex. Use lo->lo_state == Lo_rundown condition
+		 * for preventing lo_open() from incrementing lo_refcnt, for
+		 * calling __loop_clr_fd() without open_mutex might result in
+		 * entering lo_open() before lo_release() completes.
 		 */
-		__loop_clr_fd(lo, true);
+		mutex_unlock(&lo->lo_disk->open_mutex);
+		__loop_clr_fd(lo);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		return;
 	} else if (lo->lo_state == Lo_bound) {
 		/*
 		 * Otherwise keep thread (if running) and config,
 		 * but flush possible ongoing bios in thread.
 		 */
 		blk_mq_freeze_queue(lo->lo_queue);
 		blk_mq_unfreeze_queue(lo->lo_queue);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-27 11:27 ` Tetsuo Handa
@ 2021-11-28  5:32   ` Tetsuo Handa
  2021-11-28  7:42     ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2021-11-28  5:32 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara, Dave Chinner; +Cc: linux-block, Jens Axboe

Hello.

On 2021/11/27 20:27, Tetsuo Handa wrote:
> Hello.
> 
>> HEAD commit:    f81e94e91878 Add linux-next specific files for 20211125
>> git tree:       linux-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=16366216b00000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=be9183de0824e4d7
>> dashboard link: https://syzkaller.appspot.com/bug?extid=643e4ce4b6ad1347d372
>> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> 
> This looks unsolvable as long as flush_workqueue() is called with disk->open_mutex
> held. I think we need to call flush_workqueue() without holding disk->open_mutex.

Commit a1ecac3b0656a682 ("loop: Make explicit loop device destruction lazy") changed

	if (lo->lo_refcnt > 1)  /* we needed one fd for the ioctl */
		return -EBUSY;

in loop_clr_fd() to

	if (lo->lo_refcnt > 1) {
		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
		mutex_unlock(&lo->lo_ctl_mutex);
		return 0;
	}

but how does lo_refcnt matter here?
What happens if we ignore lo->lo_refcnt > 1 check (i.e. allow this loop device
to tear down even if somebody else (likely blkid via udev) is still accessing
this loop device) ?

lo_open() increments lo->lo_refcnt with lo->lo_mutex held. However, since
loop_clr_fd() releases lo->lo_mutex before calling __loop_clr_fd(), setting
LO_FLAGS_AUTOCLEAR flag based on whether there is somebody else accessing
this loop device is racy. That is, __loop_clr_fd() might be already started
by the moment blkid via udev calls lo_open(). lo_open() by blkid via udev will
succeed and blkid would access loop device in Lo_rundown state.

That is, userspace programs can tolerate accessing loop devide in Lo_rundown
state. Then, why not unconditionally start __loop_clr_fd() instead of unreliably
setting LO_FLAGS_AUTOCLEAR flag (i.e. force loop device destruction instead of
making loop device destruction lazy) ?

If we can unconditionally start __loop_clr_fd() upon ioctl(LOOP_CLR_FD), I think
we can avoid circular locking between disk->open_mutex and flush_workqueue().

---
 drivers/block/loop.c | 86 +++++++-------------------------------------
 1 file changed, 12 insertions(+), 74 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3dfb39d38235..cd82c8a5c8d5 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -680,9 +680,7 @@ static ssize_t loop_attr_sizelimit_show(struct loop_device *lo, char *buf)
 
 static ssize_t loop_attr_autoclear_show(struct loop_device *lo, char *buf)
 {
-	int autoclear = (lo->lo_flags & LO_FLAGS_AUTOCLEAR);
-
-	return sprintf(buf, "%s\n", autoclear ? "1" : "0");
+	return sprintf(buf, "%s\n", "0");
 }
 
 static ssize_t loop_attr_partscan_show(struct loop_device *lo, char *buf)
@@ -1062,20 +1060,13 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo, bool release)
+static int loop_clr_fd(struct loop_device *lo)
 {
 	struct file *filp;
-	gfp_t gfp = lo->old_gfp_mask;
 	struct loop_worker *pos, *worker;
 
-	/*
-	 * Since this function is called upon "ioctl(LOOP_CLR_FD)" xor "close()
-	 * after ioctl(LOOP_CLR_FD)", it is a sign of something going wrong if
-	 * lo->lo_state has changed while waiting for lo->lo_mutex.
-	 */
-	mutex_lock(&lo->lo_mutex);
-	BUG_ON(lo->lo_state != Lo_rundown);
-	mutex_unlock(&lo->lo_mutex);
+	if (!loop_try_update_state(lo, Lo_bound, Lo_rundown))
+		return -ENXIO;
 
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);
@@ -1111,7 +1102,7 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	loop_sysfs_exit(lo);
 	/* let user-space know about this change */
 	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
-	mapping_set_gfp_mask(filp->f_mapping, gfp);
+	mapping_set_gfp_mask(filp->f_mapping, lo->old_gfp_mask);
 	/* This is safe: open() is still holding a reference. */
 	module_put(THIS_MODULE);
 	blk_mq_unfreeze_queue(lo->lo_queue);
@@ -1122,18 +1113,12 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 		int err;
 
 		/*
-		 * open_mutex has been held already in release path, so don't
-		 * acquire it if this function is called in such case.
-		 *
-		 * If the reread partition isn't from release path, lo_refcnt
-		 * must be at least one and it can only become zero when the
-		 * current holder is released.
+		 * lo_refcnt must be at least one and it can only become zero
+		 * when the current holder is released.
 		 */
-		if (!release)
-			mutex_lock(&lo->lo_disk->open_mutex);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		if (!release)
-			mutex_unlock(&lo->lo_disk->open_mutex);
+		mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
@@ -1142,7 +1127,7 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 
 	/*
 	 * lo->lo_state is set to Lo_unbound here after above partscan has
-	 * finished. There cannot be anybody else entering __loop_clr_fd() as
+	 * finished. There cannot be anybody else entering loop_clr_fd() as
 	 * Lo_rundown state protects us from all the other places trying to
 	 * change the 'lo' device.
 	 */
@@ -1157,38 +1142,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	 * fput can take open_mutex which is usually taken before lo_mutex.
 	 */
 	fput(filp);
-}
-
-static int loop_clr_fd(struct loop_device *lo)
-{
-	int err;
-
-	err = mutex_lock_killable(&lo->lo_mutex);
-	if (err)
-		return err;
-	if (lo->lo_state != Lo_bound) {
-		mutex_unlock(&lo->lo_mutex);
-		return -ENXIO;
-	}
-	/*
-	 * If we've explicitly asked to tear down the loop device,
-	 * and it has an elevated reference count, set it for auto-teardown when
-	 * the last reference goes away. This stops $!~#$@ udev from
-	 * preventing teardown because it decided that it needs to run blkid on
-	 * the loopback device whenever they appear. xfstests is notorious for
-	 * failing tests because blkid via udev races with a losetup
-	 * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d
-	 * command to fail with EBUSY.
-	 */
-	if (atomic_read(&lo->lo_refcnt) > 1) {
-		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-		mutex_unlock(&lo->lo_mutex);
-		return 0;
-	}
-	loop_update_state_locked(lo, Lo_rundown);
-	mutex_unlock(&lo->lo_mutex);
-
-	__loop_clr_fd(lo, false);
 	return 0;
 }
 
@@ -1703,26 +1656,11 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 	mutex_lock(&lo->lo_mutex);
 	if (atomic_dec_return(&lo->lo_refcnt))
 		goto out_unlock;
-
-	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
-		if (!loop_try_update_state_locked(lo, Lo_bound, Lo_rundown))
-			goto out_unlock;
-		mutex_unlock(&lo->lo_mutex);
-		/*
-		 * In autoclear mode, stop the loop thread
-		 * and remove configuration after last close.
-		 */
-		__loop_clr_fd(lo, true);
-		return;
-	} else if (lo->lo_state == Lo_bound) {
-		/*
-		 * Otherwise keep thread (if running) and config,
-		 * but flush possible ongoing bios in thread.
-		 */
+	/* If still bound, flush request queue. */
+	if (lo->lo_state == Lo_bound) {
 		blk_mq_freeze_queue(lo->lo_queue);
 		blk_mq_unfreeze_queue(lo->lo_queue);
 	}
-
 out_unlock:
 	mutex_unlock(&lo->lo_mutex);
 }
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-28  5:32   ` Tetsuo Handa
@ 2021-11-28  7:42     ` Tetsuo Handa
  2021-11-29 10:21       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2021-11-28  7:42 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara, Dave Chinner; +Cc: linux-block, Jens Axboe

On 2021/11/28 14:32, Tetsuo Handa wrote:
> If we can unconditionally start __loop_clr_fd() upon ioctl(LOOP_CLR_FD), I think
> we can avoid circular locking between disk->open_mutex and flush_workqueue().

Too bad. There is ioctl(LOOP_SET_STATUS) which allows forcing __loop_clr_fd() to be
called without ioctl(LOOP_CLR_FD). We have to support __loop_clr_fd() upon lo_release().

Is dropping disk->open_mutex inside lo_release()
( https://lkml.kernel.org/r/e4bdc6b1-701d-6cc1-5d42-65564d2aa089@I-love.SAKURA.ne.jp ) possible?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-28  7:42     ` Tetsuo Handa
@ 2021-11-29 10:21       ` Christoph Hellwig
  2021-11-29 10:36         ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2021-11-29 10:21 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-block, Jens Axboe

On Sun, Nov 28, 2021 at 04:42:35PM +0900, Tetsuo Handa wrote:
> On 2021/11/28 14:32, Tetsuo Handa wrote:
> > If we can unconditionally start __loop_clr_fd() upon ioctl(LOOP_CLR_FD), I think
> > we can avoid circular locking between disk->open_mutex and flush_workqueue().
> 
> Too bad. There is ioctl(LOOP_SET_STATUS) which allows forcing __loop_clr_fd() to be
> called without ioctl(LOOP_CLR_FD). We have to support __loop_clr_fd() upon lo_release().
> 
> Is dropping disk->open_mutex inside lo_release()
> ( https://lkml.kernel.org/r/e4bdc6b1-701d-6cc1-5d42-65564d2aa089@I-love.SAKURA.ne.jp ) possible?

I don't think we can drop open_mutex inside ->release. What is the
problem with offloading the clearing to a different context than the
one that calls ->release?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-29 10:21       ` Christoph Hellwig
@ 2021-11-29 10:36         ` Tetsuo Handa
  2021-11-29 14:13           ` Tetsuo Handa
  2021-11-30 12:57           ` Christoph Hellwig
  0 siblings, 2 replies; 19+ messages in thread
From: Tetsuo Handa @ 2021-11-29 10:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Dave Chinner, linux-block, Jens Axboe

On 2021/11/29 19:21, Christoph Hellwig wrote:
> On Sun, Nov 28, 2021 at 04:42:35PM +0900, Tetsuo Handa wrote:
>> Is dropping disk->open_mutex inside lo_release()
>> ( https://lkml.kernel.org/r/e4bdc6b1-701d-6cc1-5d42-65564d2aa089@I-love.SAKURA.ne.jp ) possible?
> 
> I don't think we can drop open_mutex inside ->release. What is the
> problem with offloading the clearing to a different context than the
> one that calls ->release?
> 

Offloading to a WQ context?

If the caller just want to call ioctl(LOOP_CTL_GET_FREE) followed by
ioctl(LOOP_CONFIGURE), deferring __loop_clr_fd() would be fine.

But the caller might want to unmount as soon as fput(filp) from __loop_clr_fd() completes.
I think we need to wait for __loop_clr_fd() from lo_release() to complete.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-29 10:36         ` Tetsuo Handa
@ 2021-11-29 14:13           ` Tetsuo Handa
  2021-11-30 12:57           ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Tetsuo Handa @ 2021-11-29 14:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Dave Chinner, linux-block, Jens Axboe

On 2021/11/29 19:36, Tetsuo Handa wrote:
> On 2021/11/29 19:21, Christoph Hellwig wrote:
>> On Sun, Nov 28, 2021 at 04:42:35PM +0900, Tetsuo Handa wrote:
>>> Is dropping disk->open_mutex inside lo_release()
>>> ( https://lkml.kernel.org/r/e4bdc6b1-701d-6cc1-5d42-65564d2aa089@I-love.SAKURA.ne.jp ) possible?
>>
>> I don't think we can drop open_mutex inside ->release. What is the
>> problem with offloading the clearing to a different context than the
>> one that calls ->release?
>>
> 
> Offloading to a WQ context?
> 
> If the caller just want to call ioctl(LOOP_CTL_GET_FREE) followed by
> ioctl(LOOP_CONFIGURE), deferring __loop_clr_fd() would be fine.
> 
> But the caller might want to unmount as soon as fput(filp) from __loop_clr_fd() completes.
> I think we need to wait for __loop_clr_fd() from lo_release() to complete.
> 

Something like this?

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0a4416ef4fbf..2d54416abe24 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1210,10 +1210,16 @@ struct block_device_operations {
 	int (*get_unique_id)(struct gendisk *disk, u8 id[16],
 			enum blk_unique_id id_type);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 
+	/*
+	 * Special callback for waiting for completion of release callback
+	 * without disk->open_mutex held. Used by loop driver.
+	 */
+	void (*wait_release)(struct gendisk *disk);
+
 	/*
 	 * Special callback for probing GPT entry at a given sector.
 	 * Needed by Android devices, used by GPT scanner and MMC blk
 	 * driver.
 	 */
diff --git a/block/bdev.c b/block/bdev.c
index ae063041f291..edadc3fc1df3 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -952,10 +952,12 @@ void blkdev_put(struct block_device *bdev, fmode_t mode)
 	if (bdev_is_partition(bdev))
 		blkdev_put_part(bdev, mode);
 	else
 		blkdev_put_whole(bdev, mode);
 	mutex_unlock(&disk->open_mutex);
+	if (bdev->bd_disk->fops->wait_release)
+		bdev->bd_disk->fops->wait_release(bdev->bd_disk);
 
 	blkdev_put_no_open(bdev);
 }
 EXPORT_SYMBOL(blkdev_put);
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 56b9392737b2..858bb6b4ceea 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -55,10 +55,11 @@ struct loop_device {
 	struct request_queue	*lo_queue;
 	struct blk_mq_tag_set	tag_set;
 	struct gendisk		*lo_disk;
 	struct mutex		lo_mutex;
 	bool			idr_visible;
+	struct work_struct      rundown_work;
 };
 
 struct loop_cmd {
 	struct list_head list_entry;
 	bool use_aio; /* use AIO interface to handle I/O */
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3dfb39d38235..8f1db8ca0eb8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1060,11 +1060,11 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	/* This is safe: open() is still holding a reference. */
 	module_put(THIS_MODULE);
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo, bool release)
+static void __loop_clr_fd(struct loop_device *lo)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
 	struct loop_worker *pos, *worker;
 
@@ -1119,23 +1119,13 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
 
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
-		/*
-		 * open_mutex has been held already in release path, so don't
-		 * acquire it if this function is called in such case.
-		 *
-		 * If the reread partition isn't from release path, lo_refcnt
-		 * must be at least one and it can only become zero when the
-		 * current holder is released.
-		 */
-		if (!release)
-			mutex_lock(&lo->lo_disk->open_mutex);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		if (!release)
-			mutex_unlock(&lo->lo_disk->open_mutex);
+		mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
 		/* Device is gone, no point in returning error */
 	}
@@ -1186,11 +1176,11 @@ static int loop_clr_fd(struct loop_device *lo)
 		return 0;
 	}
 	loop_update_state_locked(lo, Lo_rundown);
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo, false);
+	__loop_clr_fd(lo);
 	return 0;
 }
 
 static int
 loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
@@ -1694,10 +1684,17 @@ static int lo_open(struct block_device *bdev, fmode_t mode)
 		atomic_inc(&lo->lo_refcnt);
 	mutex_unlock(&lo->lo_mutex);
 	return err;
 }
 
+static void loop_rundown_workfn(struct work_struct *work)
+{
+	struct loop_device *lo = container_of(work, struct loop_device, rundown_work);
+
+	__loop_clr_fd(lo);
+}
+
 static void lo_release(struct gendisk *disk, fmode_t mode)
 {
 	struct loop_device *lo = disk->private_data;
 
 	mutex_lock(&lo->lo_mutex);
@@ -1710,11 +1707,12 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		mutex_unlock(&lo->lo_mutex);
 		/*
 		 * In autoclear mode, stop the loop thread
 		 * and remove configuration after last close.
 		 */
-		__loop_clr_fd(lo, true);
+		INIT_WORK(&lo->rundown_work, loop_rundown_workfn);
+		queue_work(system_long_wq, &lo->rundown_work);
 		return;
 	} else if (lo->lo_state == Lo_bound) {
 		/*
 		 * Otherwise keep thread (if running) and config,
 		 * but flush possible ongoing bios in thread.
@@ -1725,14 +1723,22 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 
 out_unlock:
 	mutex_unlock(&lo->lo_mutex);
 }
 
+static void lo_wait_release(struct gendisk *disk)
+{
+	struct loop_device *lo = disk->private_data;
+
+	flush_workqueue(system_long_wq);
+}
+
 static const struct block_device_operations lo_fops = {
 	.owner =	THIS_MODULE,
 	.open =		lo_open,
 	.release =	lo_release,
+	.wait_release = lo_wait_release,
 	.ioctl =	lo_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl =	lo_compat_ioctl,
 #endif
 };

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-29 10:36         ` Tetsuo Handa
  2021-11-29 14:13           ` Tetsuo Handa
@ 2021-11-30 12:57           ` Christoph Hellwig
  2021-12-01 14:41             ` [PATCH] loop: make autoclear operation asynchronous Tetsuo Handa
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2021-11-30 12:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-block, Jens Axboe

On Mon, Nov 29, 2021 at 07:36:27PM +0900, Tetsuo Handa wrote:
> If the caller just want to call ioctl(LOOP_CTL_GET_FREE) followed by
> ioctl(LOOP_CONFIGURE), deferring __loop_clr_fd() would be fine.
> 
> But the caller might want to unmount as soon as fput(filp) from __loop_clr_fd() completes.
> I think we need to wait for __loop_clr_fd() from lo_release() to complete.

Anything else could have a reference to this or other files as well.
So I can't see how deferring the clear to a different context can be
any kind of problem in practice.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] loop: make autoclear operation asynchronous
  2021-11-30 12:57           ` Christoph Hellwig
@ 2021-12-01 14:41             ` Tetsuo Handa
  2021-12-02  7:22               ` Christoph Hellwig
  2021-12-02 12:16               ` Jan Kara
  0 siblings, 2 replies; 19+ messages in thread
From: Tetsuo Handa @ 2021-12-01 14:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Dave Chinner, linux-block, Jens Axboe

On 2021/11/30 21:57, Christoph Hellwig wrote:
> On Mon, Nov 29, 2021 at 07:36:27PM +0900, Tetsuo Handa wrote:
>> If the caller just want to call ioctl(LOOP_CTL_GET_FREE) followed by
>> ioctl(LOOP_CONFIGURE), deferring __loop_clr_fd() would be fine.
>>
>> But the caller might want to unmount as soon as fput(filp) from __loop_clr_fd() completes.
>> I think we need to wait for __loop_clr_fd() from lo_release() to complete.
> 
> Anything else could have a reference to this or other files as well.
> So I can't see how deferring the clear to a different context can be
> any kind of problem in practice.
> 

OK. Here is a patch.
Is this better than temporarily dropping disk->open_mutex ?

From 1405d604f1a0aa153de595f607726f0dcbe5c784 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 1 Dec 2021 23:31:20 +0900
Subject: [PATCH] loop: make autoclear operation asynchronous

syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
is calling destroy_workqueue() with disk->open_mutex held.

This circular dependency cannot be broken unless we call __loop_clr_fd()
without holding disk->open_mutex. There are two approaches.

One is to temporarily drop disk->open_mutex when calling __loop_clr_fd().

  -  __loop_clr_fd(lo, true);
  +  mutex_unlock(&lo->lo_disk->open_mutex);
  +  __loop_clr_fd(lo, false);
  +  mutex_lock(&lo->lo_disk->open_mutex);

This should work because

  (a) __loop_clr_fd() can be called without disk->open_mutex held, and
      takes disk->open_mutex if needed when called by ioctl(LOOP_CLR_FD)

  (b) lo_release() is called by blkdev_put_whole() via
      bdev->bd_disk->fops->release from blkdev_put() (maybe via
      blkdev_put_part()) immediately before dropping disk->open_mutex

  (c) there is no resource to protect after dropping disk->open_mutex
      till blkdev_put() completes

are true.

The other is to defer __loop_clr_fd() to a WQ context. This should work
given that

  (d) refcount on resources accessed by __loop_clr_fd() are taken before
      blkdev_put() drops refcount

  (e) refcount on resources accessed by __loop_clr_fd() are dropped after
      __loop_clr_fd() completes

  (f) the caller is not trying to e.g. unmount as soon as returning from
      loop_release()

  (g) the WQ context does not introduce new locking problems

are true. This patch implements (d) and (e).

Link: https://syzkaller.appspot.com/bug?extid=643e4ce4b6ad1347d372 [1]
Reported-by: syzbot <syzbot+643e4ce4b6ad1347d372@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 65 ++++++++++++++++++++++++--------------------
 drivers/block/loop.h |  1 +
 2 files changed, 37 insertions(+), 29 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ba76319b5544..7f4ea06534c2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo, bool release)
+static void __loop_clr_fd(struct loop_device *lo)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
@@ -1144,8 +1144,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	/* let user-space know about this change */
 	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
 	mapping_set_gfp_mask(filp->f_mapping, gfp);
-	/* This is safe: open() is still holding a reference. */
-	module_put(THIS_MODULE);
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
@@ -1153,44 +1151,52 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
-		/*
-		 * open_mutex has been held already in release path, so don't
-		 * acquire it if this function is called in such case.
-		 *
-		 * If the reread partition isn't from release path, lo_refcnt
-		 * must be at least one and it can only become zero when the
-		 * current holder is released.
-		 */
-		if (!release)
-			mutex_lock(&lo->lo_disk->open_mutex);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		if (!release)
-			mutex_unlock(&lo->lo_disk->open_mutex);
+		mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
 		/* Device is gone, no point in returning error */
 	}
 
-	/*
-	 * lo->lo_state is set to Lo_unbound here after above partscan has
-	 * finished. There cannot be anybody else entering __loop_clr_fd() as
-	 * Lo_rundown state protects us from all the other places trying to
-	 * change the 'lo' device.
-	 */
 	lo->lo_flags = 0;
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART;
+
+	fput(filp);
+}
+
+static void loop_rundown_completed(struct loop_device *lo)
+{
 	mutex_lock(&lo->lo_mutex);
 	lo->lo_state = Lo_unbound;
 	mutex_unlock(&lo->lo_mutex);
+	module_put(THIS_MODULE);
+}
 
-	/*
-	 * Need not hold lo_mutex to fput backing file. Calling fput holding
-	 * lo_mutex triggers a circular lock dependency possibility warning as
-	 * fput can take open_mutex which is usually taken before lo_mutex.
-	 */
-	fput(filp);
+static void loop_rundown_workfn(struct work_struct *work)
+{
+	struct loop_device *lo = container_of(work, struct loop_device,
+					      rundown_work);
+	struct block_device *bdev = lo->lo_device;
+	struct gendisk *disk = lo->lo_disk;
+
+	__loop_clr_fd(lo);
+	kobject_put(&bdev->bd_device.kobj);
+	module_put(disk->fops->owner);
+	loop_rundown_completed(lo);
+}
+
+static void loop_schedule_rundown(struct loop_device *lo)
+{
+	struct block_device *bdev = lo->lo_device;
+	struct gendisk *disk = lo->lo_disk;
+
+	__module_get(disk->fops->owner);
+	kobject_get(&bdev->bd_device.kobj);
+	INIT_WORK(&lo->rundown_work, loop_rundown_workfn);
+	queue_work(system_long_wq, &lo->rundown_work);
 }
 
 static int loop_clr_fd(struct loop_device *lo)
@@ -1222,7 +1228,8 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_rundown;
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo, false);
+	__loop_clr_fd(lo);
+	loop_rundown_completed(lo);
 	return 0;
 }
 
@@ -1747,7 +1754,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		 * In autoclear mode, stop the loop thread
 		 * and remove configuration after last close.
 		 */
-		__loop_clr_fd(lo, true);
+		loop_schedule_rundown(lo);
 		return;
 	} else if (lo->lo_state == Lo_bound) {
 		/*
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 082d4b6bfc6a..918a7a2dc025 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -56,6 +56,7 @@ struct loop_device {
 	struct gendisk		*lo_disk;
 	struct mutex		lo_mutex;
 	bool			idr_visible;
+	struct work_struct      rundown_work;
 };
 
 struct loop_cmd {
-- 
2.18.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] loop: make autoclear operation asynchronous
  2021-12-01 14:41             ` [PATCH] loop: make autoclear operation asynchronous Tetsuo Handa
@ 2021-12-02  7:22               ` Christoph Hellwig
  2021-12-02 11:03                 ` Tetsuo Handa
  2021-12-02 12:16               ` Jan Kara
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2021-12-02  7:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-block, Jens Axboe

On Wed, Dec 01, 2021 at 11:41:23PM +0900, Tetsuo Handa wrote:
> OK. Here is a patch.
> Is this better than temporarily dropping disk->open_mutex ?

This looks much better, and also cleans up the horrible locking warts
in __loop_clr_fd.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] loop: make autoclear operation asynchronous
  2021-12-02  7:22               ` Christoph Hellwig
@ 2021-12-02 11:03                 ` Tetsuo Handa
  0 siblings, 0 replies; 19+ messages in thread
From: Tetsuo Handa @ 2021-12-02 11:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Dave Chinner, linux-block, Jens Axboe

On 2021/12/02 16:22, Christoph Hellwig wrote:
> On Wed, Dec 01, 2021 at 11:41:23PM +0900, Tetsuo Handa wrote:
>> OK. Here is a patch.
>> Is this better than temporarily dropping disk->open_mutex ?
> 
> This looks much better, and also cleans up the horrible locking warts
> in __loop_clr_fd.
> 

What "the horrible locking warts" refer to? Below approach temporarily
drops disk->open_mutex. I think there is no locking difference between
synchronous and asynchronous...

Anyway, I resent
https://lkml.kernel.org/r/d1f760f9-cdb2-f40d-33d8-bfa517c731be@i-love.sakura.ne.jp
in order to apply before "loop: replace loop_validate_mutex with loop_validate_spinlock".

---
 drivers/block/loop.c | 33 +++++++--------------------------
 1 file changed, 7 insertions(+), 26 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ba76319b5544..31d3fbe67fea 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo, bool release)
+static void __loop_clr_fd(struct loop_device *lo)
 {
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
@@ -1153,31 +1153,15 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN) {
 		int err;
 
-		/*
-		 * open_mutex has been held already in release path, so don't
-		 * acquire it if this function is called in such case.
-		 *
-		 * If the reread partition isn't from release path, lo_refcnt
-		 * must be at least one and it can only become zero when the
-		 * current holder is released.
-		 */
-		if (!release)
-			mutex_lock(&lo->lo_disk->open_mutex);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		err = bdev_disk_changed(lo->lo_disk, false);
-		if (!release)
-			mutex_unlock(&lo->lo_disk->open_mutex);
+		mutex_unlock(&lo->lo_disk->open_mutex);
 		if (err)
 			pr_warn("%s: partition scan of loop%d failed (rc=%d)\n",
 				__func__, lo->lo_number, err);
 		/* Device is gone, no point in returning error */
 	}
 
-	/*
-	 * lo->lo_state is set to Lo_unbound here after above partscan has
-	 * finished. There cannot be anybody else entering __loop_clr_fd() as
-	 * Lo_rundown state protects us from all the other places trying to
-	 * change the 'lo' device.
-	 */
 	lo->lo_flags = 0;
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART;
@@ -1185,11 +1169,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	lo->lo_state = Lo_unbound;
 	mutex_unlock(&lo->lo_mutex);
 
-	/*
-	 * Need not hold lo_mutex to fput backing file. Calling fput holding
-	 * lo_mutex triggers a circular lock dependency possibility warning as
-	 * fput can take open_mutex which is usually taken before lo_mutex.
-	 */
 	fput(filp);
 }
 
@@ -1222,7 +1201,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_rundown;
 	mutex_unlock(&lo->lo_mutex);
 
-	__loop_clr_fd(lo, false);
+	__loop_clr_fd(lo);
 	return 0;
 }
 
@@ -1747,7 +1726,9 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		 * In autoclear mode, stop the loop thread
 		 * and remove configuration after last close.
 		 */
-		__loop_clr_fd(lo, true);
+		mutex_unlock(&lo->lo_disk->open_mutex);
+		__loop_clr_fd(lo);
+		mutex_lock(&lo->lo_disk->open_mutex);
 		return;
 	} else if (lo->lo_state == Lo_bound) {
 		/*
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] loop: make autoclear operation asynchronous
  2021-12-01 14:41             ` [PATCH] loop: make autoclear operation asynchronous Tetsuo Handa
  2021-12-02  7:22               ` Christoph Hellwig
@ 2021-12-02 12:16               ` Jan Kara
  2021-12-02 14:39                 ` Tetsuo Handa
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Kara @ 2021-12-02 12:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-block, Jens Axboe

On Wed 01-12-21 23:41:23, Tetsuo Handa wrote:
> On 2021/11/30 21:57, Christoph Hellwig wrote:
> > On Mon, Nov 29, 2021 at 07:36:27PM +0900, Tetsuo Handa wrote:
> >> If the caller just want to call ioctl(LOOP_CTL_GET_FREE) followed by
> >> ioctl(LOOP_CONFIGURE), deferring __loop_clr_fd() would be fine.
> >>
> >> But the caller might want to unmount as soon as fput(filp) from __loop_clr_fd() completes.
> >> I think we need to wait for __loop_clr_fd() from lo_release() to complete.
> > 
> > Anything else could have a reference to this or other files as well.
> > So I can't see how deferring the clear to a different context can be
> > any kind of problem in practice.
> > 
> 
> OK. Here is a patch.
> Is this better than temporarily dropping disk->open_mutex ?

The patch looks good to me. Just one suggestion for improvement:

> +static void loop_schedule_rundown(struct loop_device *lo)
> +{
> +	struct block_device *bdev = lo->lo_device;
> +	struct gendisk *disk = lo->lo_disk;
> +
> +	__module_get(disk->fops->owner);
> +	kobject_get(&bdev->bd_device.kobj);
> +	INIT_WORK(&lo->rundown_work, loop_rundown_workfn);
> +	queue_work(system_long_wq, &lo->rundown_work);
>  }

Why not scheduling this using task_work_add()? It solves the locking
context problems, has generally lower overhead than normal work (no need to
schedule), and avoids possible unexpected side-effects of releasing
loopback device later. Also task work is specifically designed so that one
task work can queue another task work so we should be fine using it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] loop: make autoclear operation asynchronous
  2021-12-02 12:16               ` Jan Kara
@ 2021-12-02 14:39                 ` Tetsuo Handa
  2021-12-02 18:05                   ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2021-12-02 14:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, Dave Chinner, linux-block, Jens Axboe

On 2021/12/02 21:16, Jan Kara wrote:
> Why not scheduling this using task_work_add()? It solves the locking
> context problems, has generally lower overhead than normal work (no need to
> schedule), and avoids possible unexpected side-effects of releasing
> loopback device later. Also task work is specifically designed so that one
> task work can queue another task work so we should be fine using it.

Indeed. But that will make really no difference between synchronous approach
( https://lkml.kernel.org/r/fb6adcdc-fb56-3b90-355b-3f5a81220f2b@i-love.sakura.ne.jp )
and asynchronous approach
( https://lkml.kernel.org/r/d1f760f9-cdb2-f40d-33d8-bfa517c731be@i-love.sakura.ne.jp ), for
disk->open_mutex is the only lock held when lo_release() is called.

Both approaches allow __loop_clr_fd() to run with no lock held, and both approaches
need to be aware of what actions are taken by blkdev_put() before and after dropping
disk->open_mutex. And bdev->bd_disk->fops->release() is the last action taken before
dropping disk->open_mutex.

What is so happier with preventing what will be done after disk->open_mutex is dropped
by blkdev_put() (i.e. __module_get() + kobject_get() before blkdev_put() calls
kobject_put() + module_put(), and kobject_put() + module_put() upon task_work_run()),
compared to doing things that can be done without disk->open_mutex (i.e. calling
__loop_clr_fd() without disk->open_mutex) ?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] loop: make autoclear operation asynchronous
  2021-12-02 14:39                 ` Tetsuo Handa
@ 2021-12-02 18:05                   ` Jan Kara
  2021-12-03  6:50                     ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2021-12-02 18:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, linux-block, Jens Axboe

On Thu 02-12-21 23:39:42, Tetsuo Handa wrote:
> On 2021/12/02 21:16, Jan Kara wrote:
> > Why not scheduling this using task_work_add()? It solves the locking
> > context problems, has generally lower overhead than normal work (no need to
> > schedule), and avoids possible unexpected side-effects of releasing
> > loopback device later. Also task work is specifically designed so that one
> > task work can queue another task work so we should be fine using it.
> 
> Indeed. But that will make really no difference between synchronous approach
> ( https://lkml.kernel.org/r/fb6adcdc-fb56-3b90-355b-3f5a81220f2b@i-love.sakura.ne.jp )
> and asynchronous approach
> ( https://lkml.kernel.org/r/d1f760f9-cdb2-f40d-33d8-bfa517c731be@i-love.sakura.ne.jp ), for
> disk->open_mutex is the only lock held when lo_release() is called.
> 
> Both approaches allow __loop_clr_fd() to run with no lock held, and both
> approaches need to be aware of what actions are taken by blkdev_put()
> before and after dropping disk->open_mutex. And
> bdev->bd_disk->fops->release() is the last action taken before dropping
> disk->open_mutex.
> 
> What is so happier with preventing what will be done after
> disk->open_mutex is dropped by blkdev_put() (i.e. __module_get() +
> kobject_get() before blkdev_put() calls kobject_put() + module_put(), and
> kobject_put() + module_put() upon task_work_run()), compared to doing
> things that can be done without disk->open_mutex (i.e. calling
> __loop_clr_fd() without disk->open_mutex) ?

So the advantage of using task work instead of just dropping open_mutex
before calling __loop_clr_fd() is that if something in block/bdev.c ever
changes and starts relying on open_mutex being held throughout blkdev_put()
then loop device handling will not suddently become broken. Generally it is
a bad practice to drop locks (even temporarily) upper layers have acquired.
Sometimes it is inevitable in in this case we can avoid that... So I'd
prefer if we used task work instead of dropping open_mutex inside loop
driver. Not sure what's Christoph's opinion though, I don't feel *that*
strongly about it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] loop: make autoclear operation asynchronous
  2021-12-02 18:05                   ` Jan Kara
@ 2021-12-03  6:50                     ` Christoph Hellwig
  2021-12-03 11:01                       ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2021-12-03  6:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tetsuo Handa, Christoph Hellwig, Dave Chinner, linux-block, Jens Axboe

On Thu, Dec 02, 2021 at 07:05:00PM +0100, Jan Kara wrote:
> So the advantage of using task work instead of just dropping open_mutex
> before calling __loop_clr_fd() is that if something in block/bdev.c ever
> changes and starts relying on open_mutex being held throughout blkdev_put()
> then loop device handling will not suddently become broken. Generally it is
> a bad practice to drop locks (even temporarily) upper layers have acquired.
> Sometimes it is inevitable in in this case we can avoid that... So I'd
> prefer if we used task work instead of dropping open_mutex inside loop
> driver. Not sure what's Christoph's opinion though, I don't feel *that*
> strongly about it.

Dropping the lock is a complete no go a it doesn't allow proper
reasoning about the locking scheme in the block layer.

task_work_add sounds nice, but it is currently not exported which might
be for a reason (I don't really have any experience with it).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] loop: make autoclear operation asynchronous
  2021-12-03  6:50                     ` Christoph Hellwig
@ 2021-12-03 11:01                       ` Tetsuo Handa
  2021-12-08  9:56                         ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2021-12-03 11:01 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara; +Cc: Dave Chinner, linux-block, Jens Axboe

On 2021/12/03 15:50, Christoph Hellwig wrote:
> task_work_add sounds nice, but it is currently not exported which might
> be for a reason (I don't really have any experience with it).

I didn't find a reason not to export. But generally task_work_add() users
seem to implement a fallback which uses a WQ in case task_work_add() failed
(i.e. exit_task_work() was already called from do_exit()) or task_work_add()
cannot be used (e.g. the caller is a kernel thread).

I don't know if there is possibility that a kernel thread calls blkdev_put(),
but implementing the fallback path after all requires WQ. Thus, I think that
starting from WQ only and see if something breaks is fine.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-27  7:10 [syzbot] possible deadlock in blkdev_put (2) syzbot
  2021-11-27 11:27 ` Tetsuo Handa
@ 2021-12-08  5:05 ` syzbot
  2021-12-08 11:42 ` syzbot
  2 siblings, 0 replies; 19+ messages in thread
From: syzbot @ 2021-12-08  5:05 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel, syzkaller-bugs

syzbot has found a reproducer for the following issue on:

HEAD commit:    04fe99a8d936 Add linux-next specific files for 20211207
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1315e2b9b00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=4589399873466942
dashboard link: https://syzkaller.appspot.com/bug?extid=643e4ce4b6ad1347d372
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11bb96ceb00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15ad1badb00000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+643e4ce4b6ad1347d372@syzkaller.appspotmail.com

======================================================
WARNING: possible circular locking dependency detected
5.16.0-rc4-next-20211207-syzkaller #0 Not tainted
------------------------------------------------------
systemd-udevd/7081 is trying to acquire lock:
ffff88802358c938 ((wq_completion)loop3){+.+.}-{0:0}, at: flush_workqueue+0xe1/0x15b0 kernel/workqueue.c:2835

but task is already holding lock:
ffff88801a57a118 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x99/0x980 block/bdev.c:907

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #6 (&disk->open_mutex){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:607 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:740
       blkdev_get_by_dev.part.0+0x40e/0xc70 block/bdev.c:809
       blkdev_get_by_dev+0x6b/0x80 block/bdev.c:852
       swsusp_check+0x97/0x2f0 kernel/power/swap.c:1520
       software_resume.part.0+0x102/0x1f0 kernel/power/hibernate.c:979
       software_resume kernel/power/hibernate.c:86 [inline]
       resume_store+0x161/0x190 kernel/power/hibernate.c:1181
       kobj_attr_store+0x50/0x80 lib/kobject.c:856
       sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:136
       kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2079 [inline]
       new_sync_write+0x429/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x12d/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #5 (system_transition_mutex/1){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:607 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:740
       software_resume.part.0+0x19/0x1f0 kernel/power/hibernate.c:934
       software_resume kernel/power/hibernate.c:86 [inline]
       resume_store+0x161/0x190 kernel/power/hibernate.c:1181
       kobj_attr_store+0x50/0x80 lib/kobject.c:856
       sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:136
       kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2079 [inline]
       new_sync_write+0x429/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x12d/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #4 (&of->mutex){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:607 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:740
       kernfs_seq_start+0x4b/0x260 fs/kernfs/file.c:112
       seq_read_iter+0x2c7/0x1240 fs/seq_file.c:225
       kernfs_fop_read_iter+0x44f/0x5f0 fs/kernfs/file.c:241
       call_read_iter include/linux/fs.h:2073 [inline]
       new_sync_read+0x421/0x6e0 fs/read_write.c:400
       vfs_read+0x35c/0x600 fs/read_write.c:481
       ksys_read+0x12d/0x250 fs/read_write.c:619
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&p->lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:607 [inline]
       __mutex_lock+0x12f/0x12f0 kernel/locking/mutex.c:740
       seq_read_iter+0xdf/0x1240 fs/seq_file.c:182
       call_read_iter include/linux/fs.h:2073 [inline]
       generic_file_splice_read+0x453/0x6d0 fs/splice.c:311
       do_splice_to+0x1bf/0x250 fs/splice.c:796
       splice_direct_to_actor+0x2c2/0x8c0 fs/splice.c:870
       do_splice_direct+0x1b3/0x280 fs/splice.c:979
       do_sendfile+0xaf2/0x1250 fs/read_write.c:1245
       __do_sys_sendfile64 fs/read_write.c:1310 [inline]
       __se_sys_sendfile64 fs/read_write.c:1296 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1296
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#3){.+.+}-{0:0}:
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write include/linux/fs.h:1727 [inline]
       sb_start_write include/linux/fs.h:1797 [inline]
       file_start_write include/linux/fs.h:2942 [inline]
       lo_write_bvec drivers/block/loop.c:242 [inline]
       lo_write_simple drivers/block/loop.c:265 [inline]
       do_req_filebacked drivers/block/loop.c:494 [inline]
       loop_handle_cmd drivers/block/loop.c:1846 [inline]
       loop_process_work+0x1499/0x1db0 drivers/block/loop.c:1886
       process_one_work+0x9b2/0x1690 kernel/workqueue.c:2318
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2465
       kthread+0x405/0x4f0 kernel/kthread.c:345
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
       process_one_work+0x921/0x1690 kernel/workqueue.c:2294
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2465
       kthread+0x405/0x4f0 kernel/kthread.c:345
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #0 ((wq_completion)loop3){+.+.}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3063 [inline]
       check_prevs_add kernel/locking/lockdep.c:3186 [inline]
       validate_chain kernel/locking/lockdep.c:3801 [inline]
       __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5027
       lock_acquire kernel/locking/lockdep.c:5637 [inline]
       lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
       flush_workqueue+0x110/0x15b0 kernel/workqueue.c:2838
       drain_workqueue+0x1a5/0x3c0 kernel/workqueue.c:3003
       destroy_workqueue+0x71/0x800 kernel/workqueue.c:4440
       __loop_clr_fd+0x1ab/0xe20 drivers/block/loop.c:1118
       lo_release+0x1ac/0x1f0 drivers/block/loop.c:1750
       blkdev_put_whole block/bdev.c:694 [inline]
       blkdev_put+0x2de/0x980 block/bdev.c:949
       blkdev_close+0x6a/0x80 block/fops.c:516
       __fput+0x286/0x9f0 fs/file_table.c:311
       task_work_run+0xdd/0x1a0 kernel/task_work.c:164
       tracehook_notify_resume include/linux/tracehook.h:189 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
       exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
       __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
       syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
       do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  (wq_completion)loop3 --> system_transition_mutex/1 --> &disk->open_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&disk->open_mutex);
                               lock(system_transition_mutex/1);
                               lock(&disk->open_mutex);
  lock((wq_completion)loop3);

 *** DEADLOCK ***

1 lock held by systemd-udevd/7081:
 #0: ffff88801a57a118 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x99/0x980 block/bdev.c:907

stack backtrace:
CPU: 1 PID: 7081 Comm: systemd-udevd Not tainted 5.16.0-rc4-next-20211207-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 check_noncircular+0x25f/0x2e0 kernel/locking/lockdep.c:2143
 check_prev_add kernel/locking/lockdep.c:3063 [inline]
 check_prevs_add kernel/locking/lockdep.c:3186 [inline]
 validate_chain kernel/locking/lockdep.c:3801 [inline]
 __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5027
 lock_acquire kernel/locking/lockdep.c:5637 [inline]
 lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
 flush_workqueue+0x110/0x15b0 kernel/workqueue.c:2838
 drain_workqueue+0x1a5/0x3c0 kernel/workqueue.c:3003
 destroy_workqueue+0x71/0x800 kernel/workqueue.c:4440
 __loop_clr_fd+0x1ab/0xe20 drivers/block/loop.c:1118
 lo_release+0x1ac/0x1f0 drivers/block/loop.c:1750
 blkdev_put_whole block/bdev.c:694 [inline]
 blkdev_put+0x2de/0x980 block/bdev.c:949
 blkdev_close+0x6a/0x80 block/fops.c:516
 __fput+0x286/0x9f0 fs/file_table.c:311
 task_work_run+0xdd/0x1a0 kernel/task_work.c:164
 tracehook_notify_resume include/linux/tracehook.h:189 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
 exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
 syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f5387c2b270
Code: 73 01 c3 48 8b 0d 38 7d 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c1 20 00 00 75 10 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ee fb ff ff 48 89 04 24
RSP: 002b:00007ffce06abcb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00007f5387c2b270
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000007
RBP: 00007f5388ae5710 R08: 00005621f86a7af0 R09: 00005621f86b5700
R10: 00007f5388ae58c0 R11: 0000000000000246 R12: 0000000000000000
R13: 00005621f86a3e40 R14: 0000000000000003 R15: 000000000000000e
 </TASK>
I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev loop2, logical block 0, async page read
I/O error, dev loop3, sector 1 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev loop3, logical block 1, async page read
I/O error, dev loop3, sector 1 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev loop3, logical block 1, async page read


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] loop: make autoclear operation asynchronous
  2021-12-03 11:01                       ` Tetsuo Handa
@ 2021-12-08  9:56                         ` Tetsuo Handa
  0 siblings, 0 replies; 19+ messages in thread
From: Tetsuo Handa @ 2021-12-08  9:56 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara; +Cc: Dave Chinner, linux-block, Jens Axboe

Can we apply https://lkml.kernel.org/r/d1f760f9-cdb2-f40d-33d8-bfa517c731be@i-love.sakura.ne.jp ?

On 2021/12/03 20:01, Tetsuo Handa wrote:
> On 2021/12/03 15:50, Christoph Hellwig wrote:
>> task_work_add sounds nice, but it is currently not exported which might
>> be for a reason (I don't really have any experience with it).
> 
> I didn't find a reason not to export. But generally task_work_add() users
> seem to implement a fallback which uses a WQ in case task_work_add() failed
> (i.e. exit_task_work() was already called from do_exit()) or task_work_add()
> cannot be used (e.g. the caller is a kernel thread).
> 
> I don't know if there is possibility that a kernel thread calls blkdev_put(),
> but implementing the fallback path after all requires WQ. Thus, I think that
> starting from WQ only and see if something breaks is fine.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] possible deadlock in blkdev_put (2)
  2021-11-27  7:10 [syzbot] possible deadlock in blkdev_put (2) syzbot
  2021-11-27 11:27 ` Tetsuo Handa
  2021-12-08  5:05 ` [syzbot] possible deadlock in blkdev_put (2) syzbot
@ 2021-12-08 11:42 ` syzbot
  2 siblings, 0 replies; 19+ messages in thread
From: syzbot @ 2021-12-08 11:42 UTC (permalink / raw)
  To: axboe, coreteam, davem, dsahern, fw, kadlec, kuba, linux-block,
	linux-kernel, netdev, netfilter-devel, pablo, penguin-kernel,
	syzkaller-bugs, yoshfuji

syzbot has bisected this issue to:

commit f9006acc8dfe59e25aa75729728ac57a8d84fc32
Author: Florian Westphal <fw@strlen.de>
Date:   Wed Apr 21 07:51:08 2021 +0000

    netfilter: arp_tables: pass table pointer via nf_hook_ops

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12500ae5b00000
start commit:   04fe99a8d936 Add linux-next specific files for 20211207
git tree:       linux-next
final oops:     https://syzkaller.appspot.com/x/report.txt?x=11500ae5b00000
console output: https://syzkaller.appspot.com/x/log.txt?x=16500ae5b00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=4589399873466942
dashboard link: https://syzkaller.appspot.com/bug?extid=643e4ce4b6ad1347d372
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11bb96ceb00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15ad1badb00000

Reported-by: syzbot+643e4ce4b6ad1347d372@syzkaller.appspotmail.com
Fixes: f9006acc8dfe ("netfilter: arp_tables: pass table pointer via nf_hook_ops")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-12-08 11:42 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-27  7:10 [syzbot] possible deadlock in blkdev_put (2) syzbot
2021-11-27 11:27 ` Tetsuo Handa
2021-11-28  5:32   ` Tetsuo Handa
2021-11-28  7:42     ` Tetsuo Handa
2021-11-29 10:21       ` Christoph Hellwig
2021-11-29 10:36         ` Tetsuo Handa
2021-11-29 14:13           ` Tetsuo Handa
2021-11-30 12:57           ` Christoph Hellwig
2021-12-01 14:41             ` [PATCH] loop: make autoclear operation asynchronous Tetsuo Handa
2021-12-02  7:22               ` Christoph Hellwig
2021-12-02 11:03                 ` Tetsuo Handa
2021-12-02 12:16               ` Jan Kara
2021-12-02 14:39                 ` Tetsuo Handa
2021-12-02 18:05                   ` Jan Kara
2021-12-03  6:50                     ` Christoph Hellwig
2021-12-03 11:01                       ` Tetsuo Handa
2021-12-08  9:56                         ` Tetsuo Handa
2021-12-08  5:05 ` [syzbot] possible deadlock in blkdev_put (2) syzbot
2021-12-08 11:42 ` syzbot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.