Re: yet another approach to fix loop autoclear for xfstets xfs/049

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>,
	linux-block@vger.kernel.org, Ming Lei <ming.lei@redhat.com>
Subject: Re: yet another approach to fix loop autoclear for xfstets xfs/049
Date: Thu, 27 Jan 2022 10:05:45 +0900	[thread overview]
Message-ID: <7bebf860-2415-7eb6-55a1-47dc4439d9e9@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20220126155040.1190842-1-hch@lst.de>

On 2022/01/27 0:50, Christoph Hellwig wrote:
> Hi Jens, hi Tetsuo,
> 
> this series uses the approach from Tetsuo to delay the destroy_workueue
> call, extended by a delayed teardown of the workers to fix a potential
> racewindow then the workqueue can be still round after finishing the
> commands.  It then also removed the queue freeing in release that is
> not needed to fix the dependency chain for that (which can't be
> reported by lockdep) as well.

I want to remove disk->open_mutex => lo->lo_mutex => I/O completion chain itself from
/proc/lockdep . Even if real deadlock does not happen due to lo->lo_refcnt exclusion,
I consider that disk->open_mutex => lo->lo_mutex dependency being recorded is a bad sign.
It makes difficult to find real possibility of deadlock when things change in future.
I consider that lo_release() is doing too much things under disk->open_mutex.

I tried to defer lo->lo_mutex in lo_release() as much as possible. But this chain
still remains.

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cf7830a68113..a9abd213b38d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1706,12 +1706,19 @@ static int lo_open(struct block_device *bdev, fmode_t mode)
 	struct loop_device *lo = bdev->bd_disk->private_data;
 	int err;
 
+	/*
+	 * Since lo_open() and lo_release() are serialized by disk->open_mutex,
+	 * we can racelessly increment/decrement lo->lo_refcnt without holding
+	 * lo->lo_mutex.
+	 */
 	if (atomic_inc_return(&lo->lo_refcnt) > 1)
 		return 0;
 
 	err = mutex_lock_killable(&lo->lo_mutex);
-	if (err)
+	if (err) {
+		atomic_dec(&lo->lo_refcnt);
 		return err;
+	}
 	if (lo->lo_state == Lo_deleting) {
 		atomic_dec(&lo->lo_refcnt);
 		err = -ENXIO;
@@ -1727,16 +1734,18 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 	if (!atomic_dec_and_test(&lo->lo_refcnt))
 		return;
 
-	mutex_lock(&lo->lo_mutex);
-	if (lo->lo_state == Lo_bound &&
-	    (lo->lo_flags & LO_FLAGS_AUTOCLEAR)) {
-		lo->lo_state = Lo_rundown;
-		mutex_unlock(&lo->lo_mutex);
-		__loop_clr_fd(lo, true);
+	/*
+	 * Since lo_open() and lo_release() are serialized by disk->open_mutex,
+	 * and lo->refcnt == 0 means nobody is using this device, we can read
+	 * lo->lo_state and lo->lo_flags without holding lo->lo_mutex.
+	 */
+	if (lo->lo_state != Lo_bound || !(lo->lo_flags & LO_FLAGS_AUTOCLEAR))
 		return;
-	}
-
+	mutex_lock(&lo->lo_mutex);
+	lo->lo_state = Lo_rundown;
 	mutex_unlock(&lo->lo_mutex);
+	__loop_clr_fd(lo, true);
+	return;
 }
 
 static const struct block_device_operations lo_fops = {

In __loop_clr_fd(), it waits for loop_validate_mutex. loop_validate_mutex can be
held when loop_change_fd() calls blk_mq_freeze_queue(). Loop recursion interacts
with other loop devices.

While each loop device takes care of only single backing file, we can use multiple
loop devices for multiple backing files within the same mount point (e.g. /dev/loop0
is attached on /mnt/file0 and /dev/loop1 is attached on /mnt/file1), can't we?
But fsfreeze is per a mount point (e.g. /mnt/). That is, fsfreeze also interacts
with other loop devices, doesn't it?

disk->open_mutex is a per a loop device serialization, but loop_validate_mutex and
fsfreeze are global serialization. It is anxious to involve global serialization under
individual serialization, when we already know that involvement of sysfs + fsfreeze
causes possibility of deadlock. I expect that lo_open()/lo_release() are done without
holding disk->open_mutex.