All of lore.kernel.org
 help / color / mirror / Atom feed
* fix tag freeing use after free and debugfs name reuse
@ 2022-06-14  7:48 Christoph Hellwig
  2022-06-14  7:48 ` [PATCH 1/4] block: disable the elevator int del_gendisk Christoph Hellwig
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Christoph Hellwig @ 2022-06-14  7:48 UTC (permalink / raw)
  To: axboe; +Cc: shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei, linux-block

Hi all,

the first patch fixes a use after free, and the others deal with debugfs
name reuse that spews warnings and makes debugfs use impossible for
quickly reused gendisk instances.  Both of those are rooted in sloppy
life time rules for block device tear down.

Compared to the previous separate postings this adds a missing queue
quiesce and documents debugfs_mutex better.

Diffstat:
 block/blk-core.c        |   13 -------------
 block/blk-mq-debugfs.c  |   29 ++++++++++++++++++-----------
 block/blk-mq-debugfs.h  |   10 ----------
 block/blk-mq-sched.c    |   11 +++++++++++
 block/blk-rq-qos.c      |    2 --
 block/blk-rq-qos.h      |    7 ++++++-
 block/blk-sysfs.c       |   30 ++++++++++++++----------------
 block/genhd.c           |   42 ++++++++++++------------------------------
 include/linux/blkdev.h  |    8 ++++----
 kernel/trace/blktrace.c |    3 ---
 10 files changed, 65 insertions(+), 90 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] block: disable the elevator int del_gendisk
  2022-06-14  7:48 fix tag freeing use after free and debugfs name reuse Christoph Hellwig
@ 2022-06-14  7:48 ` Christoph Hellwig
  2022-06-14  8:23   ` Ming Lei
  2022-06-14  7:48 ` [PATCH 2/4] block: serialize all debugfs operations using q->debugfs_mutex Christoph Hellwig
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2022-06-14  7:48 UTC (permalink / raw)
  To: axboe
  Cc: shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei,
	linux-block, syzbot+3e3f419f4a7816471838

The elevator is only used for file system requests, which are stopped in
del_gendisk.  Move disabling the elevator and freeing the scheduler tags
to the end of del_gendisk instead of doing that work in disk_release and
blk_cleanup_queue to avoid a use after free on q->tag_set from
disk_release as the tag_set might not be alive at that point.

Move the blk_qos_exit call as well, as it just depends on the elevator
exit and would be the only reason to keep the not exactly cheap queue
freeze in disk_release.

Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
---
 block/blk-core.c | 13 -------------
 block/genhd.c    | 39 +++++++++++----------------------------
 2 files changed, 11 insertions(+), 41 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 06ff5bbfe8f66..27fb1357ad4b8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -322,19 +322,6 @@ void blk_cleanup_queue(struct request_queue *q)
 		blk_mq_exit_queue(q);
 	}
 
-	/*
-	 * In theory, request pool of sched_tags belongs to request queue.
-	 * However, the current implementation requires tag_set for freeing
-	 * requests, so free the pool now.
-	 *
-	 * Queue has become frozen, there can't be any in-queue requests, so
-	 * it is safe to free requests now.
-	 */
-	mutex_lock(&q->sysfs_lock);
-	if (q->elevator)
-		blk_mq_sched_free_rqs(q);
-	mutex_unlock(&q->sysfs_lock);
-
 	/* @q is and will stay empty, shutdown and put */
 	blk_put_queue(q);
 }
diff --git a/block/genhd.c b/block/genhd.c
index 27205ae47d593..e0675772178b0 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -652,6 +652,17 @@ void del_gendisk(struct gendisk *disk)
 
 	blk_sync_queue(q);
 	blk_flush_integrity();
+	blk_mq_cancel_work_sync(q);
+
+	blk_mq_quiesce_queue(q);
+	if (q->elevator) {
+		mutex_lock(&q->sysfs_lock);
+		elevator_exit(q);
+		mutex_unlock(&q->sysfs_lock);
+	}
+	rq_qos_exit(q);
+	blk_mq_unquiesce_queue(q);
+
 	/*
 	 * Allow using passthrough request again after the queue is torn down.
 	 */
@@ -1120,31 +1131,6 @@ static const struct attribute_group *disk_attr_groups[] = {
 	NULL
 };
 
-static void disk_release_mq(struct request_queue *q)
-{
-	blk_mq_cancel_work_sync(q);
-
-	/*
-	 * There can't be any non non-passthrough bios in flight here, but
-	 * requests stay around longer, including passthrough ones so we
-	 * still need to freeze the queue here.
-	 */
-	blk_mq_freeze_queue(q);
-
-	/*
-	 * Since the I/O scheduler exit code may access cgroup information,
-	 * perform I/O scheduler exit before disassociating from the block
-	 * cgroup controller.
-	 */
-	if (q->elevator) {
-		mutex_lock(&q->sysfs_lock);
-		elevator_exit(q);
-		mutex_unlock(&q->sysfs_lock);
-	}
-	rq_qos_exit(q);
-	__blk_mq_unfreeze_queue(q, true);
-}
-
 /**
  * disk_release - releases all allocated resources of the gendisk
  * @dev: the device representing this disk
@@ -1166,9 +1152,6 @@ static void disk_release(struct device *dev)
 	might_sleep();
 	WARN_ON_ONCE(disk_live(disk));
 
-	if (queue_is_mq(disk->queue))
-		disk_release_mq(disk->queue);
-
 	blkcg_exit_queue(disk->queue);
 
 	disk_release_events(disk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] block: serialize all debugfs operations using q->debugfs_mutex
  2022-06-14  7:48 fix tag freeing use after free and debugfs name reuse Christoph Hellwig
  2022-06-14  7:48 ` [PATCH 1/4] block: disable the elevator int del_gendisk Christoph Hellwig
@ 2022-06-14  7:48 ` Christoph Hellwig
  2022-06-14  7:48 ` [PATCH 3/4] block: remove per-disk debugfs files in blk_unregister_queue Christoph Hellwig
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2022-06-14  7:48 UTC (permalink / raw)
  To: axboe; +Cc: shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei, linux-block

Various places like I/O schedulers or the QOS infrastructure try to
register debugfs files on demans, which can race with creating and
removing the main queue debugfs directory.  Use the existing
debugfs_mutex to serialize all debugfs operations that rely on
q->debugfs_dir or the directories hanging off it.

To make the teardown code a little simpler declare all debugfs dentry
pointers and not just the main one uncoditionally in blkdev.h.

Move debugfs_mutex next to the dentries that it protects and document
what it is used for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-debugfs.c  | 25 ++++++++++++++++++++-----
 block/blk-mq-debugfs.h  |  5 -----
 block/blk-mq-sched.c    | 11 +++++++++++
 block/blk-rq-qos.c      |  2 ++
 block/blk-rq-qos.h      |  7 ++++++-
 block/blk-sysfs.c       | 20 +++++++++-----------
 include/linux/blkdev.h  |  8 ++++----
 kernel/trace/blktrace.c |  3 ---
 8 files changed, 52 insertions(+), 29 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7e4136a60e1cc..f0fcfe1387cbc 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -711,11 +711,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
 	}
 }
 
-void blk_mq_debugfs_unregister(struct request_queue *q)
-{
-	q->sched_debugfs_dir = NULL;
-}
-
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_ctx *ctx)
 {
@@ -746,6 +741,8 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
 
 void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx)
 {
+	if (!hctx->queue->debugfs_dir)
+		return;
 	debugfs_remove_recursive(hctx->debugfs_dir);
 	hctx->sched_debugfs_dir = NULL;
 	hctx->debugfs_dir = NULL;
@@ -773,6 +770,8 @@ void blk_mq_debugfs_register_sched(struct request_queue *q)
 {
 	struct elevator_type *e = q->elevator->type;
 
+	lockdep_assert_held(&q->debugfs_mutex);
+
 	/*
 	 * If the parent directory has not been created yet, return, we will be
 	 * called again later on and the directory/files will be created then.
@@ -790,6 +789,8 @@ void blk_mq_debugfs_register_sched(struct request_queue *q)
 
 void blk_mq_debugfs_unregister_sched(struct request_queue *q)
 {
+	lockdep_assert_held(&q->debugfs_mutex);
+
 	debugfs_remove_recursive(q->sched_debugfs_dir);
 	q->sched_debugfs_dir = NULL;
 }
@@ -811,6 +812,10 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id)
 
 void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 {
+	lockdep_assert_held(&rqos->q->debugfs_mutex);
+
+	if (!rqos->q->debugfs_dir)
+		return;
 	debugfs_remove_recursive(rqos->debugfs_dir);
 	rqos->debugfs_dir = NULL;
 }
@@ -820,6 +825,8 @@ void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
 	struct request_queue *q = rqos->q;
 	const char *dir_name = rq_qos_id_to_name(rqos->id);
 
+	lockdep_assert_held(&q->debugfs_mutex);
+
 	if (rqos->debugfs_dir || !rqos->ops->debugfs_attrs)
 		return;
 
@@ -835,6 +842,8 @@ void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
 
 void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q)
 {
+	lockdep_assert_held(&q->debugfs_mutex);
+
 	debugfs_remove_recursive(q->rqos_debugfs_dir);
 	q->rqos_debugfs_dir = NULL;
 }
@@ -844,6 +853,8 @@ void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
 {
 	struct elevator_type *e = q->elevator->type;
 
+	lockdep_assert_held(&q->debugfs_mutex);
+
 	/*
 	 * If the parent debugfs directory has not been created yet, return;
 	 * We will be called again later on with appropriate parent debugfs
@@ -863,6 +874,10 @@ void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
 
 void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
 {
+	lockdep_assert_held(&hctx->queue->debugfs_mutex);
+
+	if (!hctx->queue->debugfs_dir)
+		return;
 	debugfs_remove_recursive(hctx->sched_debugfs_dir);
 	hctx->sched_debugfs_dir = NULL;
 }
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 69918f4170d69..771d458328788 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -21,7 +21,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
 int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
 
 void blk_mq_debugfs_register(struct request_queue *q);
-void blk_mq_debugfs_unregister(struct request_queue *q);
 void blk_mq_debugfs_register_hctx(struct request_queue *q,
 				  struct blk_mq_hw_ctx *hctx);
 void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx);
@@ -42,10 +41,6 @@ static inline void blk_mq_debugfs_register(struct request_queue *q)
 {
 }
 
-static inline void blk_mq_debugfs_unregister(struct request_queue *q)
-{
-}
-
 static inline void blk_mq_debugfs_register_hctx(struct request_queue *q,
 						struct blk_mq_hw_ctx *hctx)
 {
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9e56a69422b65..e84bec39fd3ad 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -593,7 +593,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (ret)
 		goto err_free_map_and_rqs;
 
+	mutex_lock(&q->debugfs_mutex);
 	blk_mq_debugfs_register_sched(q);
+	mutex_unlock(&q->debugfs_mutex);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->ops.init_hctx) {
@@ -606,7 +608,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 				return ret;
 			}
 		}
+		mutex_lock(&q->debugfs_mutex);
 		blk_mq_debugfs_register_sched_hctx(q, hctx);
+		mutex_unlock(&q->debugfs_mutex);
 	}
 
 	return 0;
@@ -647,14 +651,21 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 	unsigned int flags = 0;
 
 	queue_for_each_hw_ctx(q, hctx, i) {
+		mutex_lock(&q->debugfs_mutex);
 		blk_mq_debugfs_unregister_sched_hctx(hctx);
+		mutex_unlock(&q->debugfs_mutex);
+
 		if (e->type->ops.exit_hctx && hctx->sched_data) {
 			e->type->ops.exit_hctx(hctx, i);
 			hctx->sched_data = NULL;
 		}
 		flags = hctx->flags;
 	}
+
+	mutex_lock(&q->debugfs_mutex);
 	blk_mq_debugfs_unregister_sched(q);
+	mutex_unlock(&q->debugfs_mutex);
+
 	if (e->type->ops.exit_sched)
 		e->type->ops.exit_sched(e);
 	blk_mq_sched_tags_teardown(q, flags);
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index e83af7bc75919..249a6f05dd3bd 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -294,7 +294,9 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,
 
 void rq_qos_exit(struct request_queue *q)
 {
+	mutex_lock(&q->debugfs_mutex);
 	blk_mq_debugfs_unregister_queue_rqos(q);
+	mutex_unlock(&q->debugfs_mutex);
 
 	while (q->rq_qos) {
 		struct rq_qos *rqos = q->rq_qos;
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 68267007da1c6..0e46052b018a4 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -104,8 +104,11 @@ static inline void rq_qos_add(struct request_queue *q, struct rq_qos *rqos)
 
 	blk_mq_unfreeze_queue(q);
 
-	if (rqos->ops->debugfs_attrs)
+	if (rqos->ops->debugfs_attrs) {
+		mutex_lock(&q->debugfs_mutex);
 		blk_mq_debugfs_register_rqos(rqos);
+		mutex_unlock(&q->debugfs_mutex);
+	}
 }
 
 static inline void rq_qos_del(struct request_queue *q, struct rq_qos *rqos)
@@ -129,7 +132,9 @@ static inline void rq_qos_del(struct request_queue *q, struct rq_qos *rqos)
 
 	blk_mq_unfreeze_queue(q);
 
+	mutex_lock(&q->debugfs_mutex);
 	blk_mq_debugfs_unregister_rqos(rqos);
+	mutex_unlock(&q->debugfs_mutex);
 }
 
 typedef bool (acquire_inflight_cb_t)(struct rq_wait *rqw, void *private_data);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 88bd41d4cb593..6e4801b217a79 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -779,14 +779,13 @@ static void blk_release_queue(struct kobject *kobj)
 	if (queue_is_mq(q))
 		blk_mq_release(q);
 
-	blk_trace_shutdown(q);
 	mutex_lock(&q->debugfs_mutex);
+	blk_trace_shutdown(q);
 	debugfs_remove_recursive(q->debugfs_dir);
+	q->debugfs_dir = NULL;
+	q->sched_debugfs_dir = NULL;
 	mutex_unlock(&q->debugfs_mutex);
 
-	if (queue_is_mq(q))
-		blk_mq_debugfs_unregister(q);
-
 	bioset_exit(&q->bio_split);
 
 	if (blk_queue_has_srcu(q))
@@ -836,17 +835,16 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	if (queue_is_mq(q))
+		__blk_mq_register_dev(dev, q);
+	mutex_lock(&q->sysfs_lock);
+
 	mutex_lock(&q->debugfs_mutex);
 	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
 					    blk_debugfs_root);
-	mutex_unlock(&q->debugfs_mutex);
-
-	if (queue_is_mq(q)) {
-		__blk_mq_register_dev(dev, q);
+	if (queue_is_mq(q))
 		blk_mq_debugfs_register(q);
-	}
-
-	mutex_lock(&q->sysfs_lock);
+	mutex_unlock(&q->debugfs_mutex);
 
 	ret = disk_register_independent_access_ranges(disk, NULL);
 	if (ret)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 608d577734c29..0fc912d756f94 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -482,7 +482,6 @@ struct request_queue {
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 	int			node;
-	struct mutex		debugfs_mutex;
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	struct blk_trace __rcu	*blk_trace;
 #endif
@@ -526,11 +525,12 @@ struct request_queue {
 	struct bio_set		bio_split;
 
 	struct dentry		*debugfs_dir;
-
-#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*sched_debugfs_dir;
 	struct dentry		*rqos_debugfs_dir;
-#endif
+	/*
+	 * Serializes all debugfs metadata operations using the above dentries.
+	 */
+	struct mutex		debugfs_mutex;
 
 	bool			mq_sysfs_init_done;
 
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 10a32b0f2deb6..fe04c6f96ca5d 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -770,14 +770,11 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
  **/
 void blk_trace_shutdown(struct request_queue *q)
 {
-	mutex_lock(&q->debugfs_mutex);
 	if (rcu_dereference_protected(q->blk_trace,
 				      lockdep_is_held(&q->debugfs_mutex))) {
 		__blk_trace_startstop(q, 0);
 		__blk_trace_remove(q);
 	}
-
-	mutex_unlock(&q->debugfs_mutex);
 }
 
 #ifdef CONFIG_BLK_CGROUP
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] block: remove per-disk debugfs files in blk_unregister_queue
  2022-06-14  7:48 fix tag freeing use after free and debugfs name reuse Christoph Hellwig
  2022-06-14  7:48 ` [PATCH 1/4] block: disable the elevator int del_gendisk Christoph Hellwig
  2022-06-14  7:48 ` [PATCH 2/4] block: serialize all debugfs operations using q->debugfs_mutex Christoph Hellwig
@ 2022-06-14  7:48 ` Christoph Hellwig
  2022-06-14  7:48 ` [PATCH 4/4] block: freeze the queue earlier in del_gendisk Christoph Hellwig
  2022-06-17 13:31 ` fix tag freeing use after free and debugfs name reuse Jens Axboe
  4 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2022-06-14  7:48 UTC (permalink / raw)
  To: axboe; +Cc: shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei, linux-block

The block debugfs files are created in blk_register_queue, which is
called by add_disk and use a naming scheme based on the disk_name.
After del_gendisk returns that name can be reused and thus we must not
leave these debugfs files around, otherwise the kernel is unhappy
and spews messages like:

	Directory XXXXX with parent 'block' already present!

and the newly created devices will not have working debugfs files.

Move the unregistration to blk_unregister_queue instead (which matches
the sysfs unregistration) to make sure the debugfs life time rules match
those of the disk name.

As part of the move also make sure the whole debugfs unregistration is
inside a single debugfs_mutex critical section.

Note that this breaks blktests block/002, which checks that the debugfs
directory has not been removed while blktests is running, but that
particular check should simply be removed from the test case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-debugfs.c |  8 --------
 block/blk-mq-debugfs.h |  5 -----
 block/blk-rq-qos.c     |  4 ----
 block/blk-sysfs.c      | 16 ++++++++--------
 4 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f0fcfe1387cbc..4d1ce9ef43187 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -840,14 +840,6 @@ void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
 	debugfs_create_files(rqos->debugfs_dir, rqos, rqos->ops->debugfs_attrs);
 }
 
-void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q)
-{
-	lockdep_assert_held(&q->debugfs_mutex);
-
-	debugfs_remove_recursive(q->rqos_debugfs_dir);
-	q->rqos_debugfs_dir = NULL;
-}
-
 void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
 					struct blk_mq_hw_ctx *hctx)
 {
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 771d458328788..9c7d4b6117d41 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -35,7 +35,6 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_rqos(struct rq_qos *rqos);
 void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos);
-void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q);
 #else
 static inline void blk_mq_debugfs_register(struct request_queue *q)
 {
@@ -82,10 +81,6 @@ static inline void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
 static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 {
 }
-
-static inline void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q)
-{
-}
 #endif
 
 #ifdef CONFIG_BLK_DEBUG_FS_ZONED
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 249a6f05dd3bd..d3a75693adbf4 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -294,10 +294,6 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,
 
 void rq_qos_exit(struct request_queue *q)
 {
-	mutex_lock(&q->debugfs_mutex);
-	blk_mq_debugfs_unregister_queue_rqos(q);
-	mutex_unlock(&q->debugfs_mutex);
-
 	while (q->rq_qos) {
 		struct rq_qos *rqos = q->rq_qos;
 		q->rq_qos = rqos->next;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6e4801b217a79..9b905e9443e49 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -779,13 +779,6 @@ static void blk_release_queue(struct kobject *kobj)
 	if (queue_is_mq(q))
 		blk_mq_release(q);
 
-	mutex_lock(&q->debugfs_mutex);
-	blk_trace_shutdown(q);
-	debugfs_remove_recursive(q->debugfs_dir);
-	q->debugfs_dir = NULL;
-	q->sched_debugfs_dir = NULL;
-	mutex_unlock(&q->debugfs_mutex);
-
 	bioset_exit(&q->bio_split);
 
 	if (blk_queue_has_srcu(q))
@@ -946,8 +939,15 @@ void blk_unregister_queue(struct gendisk *disk)
 	/* Now that we've deleted all child objects, we can delete the queue. */
 	kobject_uevent(&q->kobj, KOBJ_REMOVE);
 	kobject_del(&q->kobj);
-
 	mutex_unlock(&q->sysfs_dir_lock);
 
+	mutex_lock(&q->debugfs_mutex);
+	blk_trace_shutdown(q);
+	debugfs_remove_recursive(q->debugfs_dir);
+	q->debugfs_dir = NULL;
+	q->sched_debugfs_dir = NULL;
+	q->rqos_debugfs_dir = NULL;
+	mutex_unlock(&q->debugfs_mutex);
+
 	kobject_put(&disk_to_dev(disk)->kobj);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-06-14  7:48 fix tag freeing use after free and debugfs name reuse Christoph Hellwig
                   ` (2 preceding siblings ...)
  2022-06-14  7:48 ` [PATCH 3/4] block: remove per-disk debugfs files in blk_unregister_queue Christoph Hellwig
@ 2022-06-14  7:48 ` Christoph Hellwig
  2022-07-08  5:41   ` REGRESSION: " Logan Gunthorpe
  2022-06-17 13:31 ` fix tag freeing use after free and debugfs name reuse Jens Axboe
  4 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2022-06-14  7:48 UTC (permalink / raw)
  To: axboe; +Cc: shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei, linux-block

Freeze the queue earlier in del_gendisk so that the state does not
change while we remove debugfs and sysfs files.

Ming mentioned that being able to observer request in debugfs might
be useful while the queue is being frozen in del_gendisk, which is
made possible by this change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index e0675772178b0..278227ba1d531 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -623,6 +623,7 @@ void del_gendisk(struct gendisk *disk)
 	 * Prevent new I/O from crossing bio_queue_enter().
 	 */
 	blk_queue_start_drain(q);
+	blk_mq_freeze_queue_wait(q);
 
 	if (!(disk->flags & GENHD_FL_HIDDEN)) {
 		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
@@ -646,8 +647,6 @@ void del_gendisk(struct gendisk *disk)
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
 
-	blk_mq_freeze_queue_wait(q);
-
 	blk_throtl_cancel_bios(disk->queue);
 
 	blk_sync_queue(q);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] block: disable the elevator int del_gendisk
  2022-06-14  7:48 ` [PATCH 1/4] block: disable the elevator int del_gendisk Christoph Hellwig
@ 2022-06-14  8:23   ` Ming Lei
  2022-06-14  8:34     ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2022-06-14  8:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, shinichiro.kawasaki, dan.j.williams, yukuai3, linux-block,
	syzbot+3e3f419f4a7816471838, ming.lei

On Tue, Jun 14, 2022 at 09:48:24AM +0200, Christoph Hellwig wrote:
> The elevator is only used for file system requests, which are stopped in
> del_gendisk.  Move disabling the elevator and freeing the scheduler tags
> to the end of del_gendisk instead of doing that work in disk_release and
> blk_cleanup_queue to avoid a use after free on q->tag_set from
> disk_release as the tag_set might not be alive at that point.
> 
> Move the blk_qos_exit call as well, as it just depends on the elevator
> exit and would be the only reason to keep the not exactly cheap queue
> freeze in disk_release.
> 
> Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support")
> Reported-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
> ---
>  block/blk-core.c | 13 -------------
>  block/genhd.c    | 39 +++++++++++----------------------------
>  2 files changed, 11 insertions(+), 41 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 06ff5bbfe8f66..27fb1357ad4b8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -322,19 +322,6 @@ void blk_cleanup_queue(struct request_queue *q)
>  		blk_mq_exit_queue(q);
>  	}
>  
> -	/*
> -	 * In theory, request pool of sched_tags belongs to request queue.
> -	 * However, the current implementation requires tag_set for freeing
> -	 * requests, so free the pool now.
> -	 *
> -	 * Queue has become frozen, there can't be any in-queue requests, so
> -	 * it is safe to free requests now.
> -	 */
> -	mutex_lock(&q->sysfs_lock);
> -	if (q->elevator)
> -		blk_mq_sched_free_rqs(q);
> -	mutex_unlock(&q->sysfs_lock);
> -
>  	/* @q is and will stay empty, shutdown and put */
>  	blk_put_queue(q);
>  }
> diff --git a/block/genhd.c b/block/genhd.c
> index 27205ae47d593..e0675772178b0 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -652,6 +652,17 @@ void del_gendisk(struct gendisk *disk)
>  
>  	blk_sync_queue(q);
>  	blk_flush_integrity();
> +	blk_mq_cancel_work_sync(q);
> +
> +	blk_mq_quiesce_queue(q);

quiesce queue adds a bit long delay in del_gendisk, not sure if this way may
cause regression in big machines with lots of disks.

> +	if (q->elevator) {
> +		mutex_lock(&q->sysfs_lock);
> +		elevator_exit(q);
> +		mutex_unlock(&q->sysfs_lock);
> +	}
> +	rq_qos_exit(q);
> +	blk_mq_unquiesce_queue(q);

Also tearing down elevator here has to be carefully, that means any
elevator reference has to hold rcu read lock or .q_usage_counter,
meantime it has to be checked, otherwise use-after-free may be caused.

Unfortunately, there are some cases which looks not safe, such as,
__blk_mq_update_nr_hw_queues() and blk_mq_has_sqsched().

Another example is bfq_insert_request()<-bfq_insert_requests():

static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
                               bool at_head)
{
		...
        spin_unlock_irq(&bfqd->lock);

        bfq_update_insert_stats(q, bfqq, idle_timer_disabled,
                                cmd_flags);
}

If last 'rq' is done between unlocking bfqd->lock and calling bfq_update_insert_stats,
del_gendisk() may tear down elevator, and UAF is caused.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] block: disable the elevator int del_gendisk
  2022-06-14  8:23   ` Ming Lei
@ 2022-06-14  8:34     ` Christoph Hellwig
  2022-06-14 11:27       ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2022-06-14  8:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, axboe, shinichiro.kawasaki, dan.j.williams,
	yukuai3, linux-block, syzbot+3e3f419f4a7816471838

On Tue, Jun 14, 2022 at 04:23:36PM +0800, Ming Lei wrote:
> >  	blk_sync_queue(q);
> >  	blk_flush_integrity();
> > +	blk_mq_cancel_work_sync(q);
> > +
> > +	blk_mq_quiesce_queue(q);
> 
> quiesce queue adds a bit long delay in del_gendisk, not sure if this way may
> cause regression in big machines with lots of disks.

It does.  But at least we remove a freeze in the queue teardown path.
But either way I'd really like to get things correct first before
looking into optimizations.

> 
> > +	if (q->elevator) {
> > +		mutex_lock(&q->sysfs_lock);
> > +		elevator_exit(q);
> > +		mutex_unlock(&q->sysfs_lock);
> > +	}
> > +	rq_qos_exit(q);
> > +	blk_mq_unquiesce_queue(q);
> 
> Also tearing down elevator here has to be carefully, that means any
> elevator reference has to hold rcu read lock or .q_usage_counter,
> meantime it has to be checked, otherwise use-after-free may be caused.

This is not a new pattern.  We have the same locking here as a
sysfs-induced change of the elevator to none which also clears
q->elevator under a queue that is frozen and quiesced.

But unlike that path we do fail all requests that could have been
queued in the schedule before unfreezing here at least.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] block: disable the elevator int del_gendisk
  2022-06-14  8:34     ` Christoph Hellwig
@ 2022-06-14 11:27       ` Ming Lei
  2022-06-17 12:50         ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2022-06-14 11:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, shinichiro.kawasaki, dan.j.williams, yukuai3, linux-block,
	syzbot+3e3f419f4a7816471838

On Tue, Jun 14, 2022 at 10:34:53AM +0200, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 04:23:36PM +0800, Ming Lei wrote:
> > >  	blk_sync_queue(q);
> > >  	blk_flush_integrity();
> > > +	blk_mq_cancel_work_sync(q);
> > > +
> > > +	blk_mq_quiesce_queue(q);
> > 
> > quiesce queue adds a bit long delay in del_gendisk, not sure if this way may
> > cause regression in big machines with lots of disks.
> 
> It does.  But at least we remove a freeze in the queue teardown path.
> But either way I'd really like to get things correct first before
> looking into optimizations.

The removed one works at atomic mode and it is super fast.

> 
> > 
> > > +	if (q->elevator) {
> > > +		mutex_lock(&q->sysfs_lock);
> > > +		elevator_exit(q);
> > > +		mutex_unlock(&q->sysfs_lock);
> > > +	}
> > > +	rq_qos_exit(q);
> > > +	blk_mq_unquiesce_queue(q);
> > 
> > Also tearing down elevator here has to be carefully, that means any
> > elevator reference has to hold rcu read lock or .q_usage_counter,
> > meantime it has to be checked, otherwise use-after-free may be caused.
> 
> This is not a new pattern.  We have the same locking here as a
> sysfs-induced change of the elevator to none which also clears
> q->elevator under a queue that is frozen and quiesced.

Then looks this pattern has problem in dealing with the examples I
mentioned.

And the elevator usage in __blk_mq_update_nr_hw_queues() looks one
old problem, but easy to fix by protecting it via sysfs_lock.

And fixing blk_mq_has_sqsched() should be easy too.

I will send patches later.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] block: disable the elevator int del_gendisk
  2022-06-14 11:27       ` Ming Lei
@ 2022-06-17 12:50         ` Jens Axboe
  2022-06-17 13:26           ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2022-06-17 12:50 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: shinichiro.kawasaki, dan.j.williams, yukuai3, linux-block,
	syzbot+3e3f419f4a7816471838

On 6/14/22 5:27 AM, Ming Lei wrote:
> On Tue, Jun 14, 2022 at 10:34:53AM +0200, Christoph Hellwig wrote:
>> On Tue, Jun 14, 2022 at 04:23:36PM +0800, Ming Lei wrote:
>>>>  	blk_sync_queue(q);
>>>>  	blk_flush_integrity();
>>>> +	blk_mq_cancel_work_sync(q);
>>>> +
>>>> +	blk_mq_quiesce_queue(q);
>>>
>>> quiesce queue adds a bit long delay in del_gendisk, not sure if this way may
>>> cause regression in big machines with lots of disks.
>>
>> It does.  But at least we remove a freeze in the queue teardown path.
>> But either way I'd really like to get things correct first before
>> looking into optimizations.
> 
> The removed one works at atomic mode and it is super fast.
> 
>>
>>>
>>>> +	if (q->elevator) {
>>>> +		mutex_lock(&q->sysfs_lock);
>>>> +		elevator_exit(q);
>>>> +		mutex_unlock(&q->sysfs_lock);
>>>> +	}
>>>> +	rq_qos_exit(q);
>>>> +	blk_mq_unquiesce_queue(q);
>>>
>>> Also tearing down elevator here has to be carefully, that means any
>>> elevator reference has to hold rcu read lock or .q_usage_counter,
>>> meantime it has to be checked, otherwise use-after-free may be caused.
>>
>> This is not a new pattern.  We have the same locking here as a
>> sysfs-induced change of the elevator to none which also clears
>> q->elevator under a queue that is frozen and quiesced.
> 
> Then looks this pattern has problem in dealing with the examples I
> mentioned.
> 
> And the elevator usage in __blk_mq_update_nr_hw_queues() looks one
> old problem, but easy to fix by protecting it via sysfs_lock.
> 
> And fixing blk_mq_has_sqsched() should be easy too.
> 
> I will send patches later.

Just checking in on this series, Ming did you make any progress?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] block: disable the elevator int del_gendisk
  2022-06-17 12:50         ` Jens Axboe
@ 2022-06-17 13:26           ` Christoph Hellwig
  2022-06-17 13:27             ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2022-06-17 13:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Christoph Hellwig, shinichiro.kawasaki, dan.j.williams,
	yukuai3, linux-block, syzbot+3e3f419f4a7816471838

On Fri, Jun 17, 2022 at 06:50:50AM -0600, Jens Axboe wrote:
> > Then looks this pattern has problem in dealing with the examples I
> > mentioned.
> > 
> > And the elevator usage in __blk_mq_update_nr_hw_queues() looks one
> > old problem, but easy to fix by protecting it via sysfs_lock.
> > 
> > And fixing blk_mq_has_sqsched() should be easy too.
> > 
> > I will send patches later.
> 
> Just checking in on this series, Ming did you make any progress?

He send the patches in the "blk-mq: three misc patches" series and you
already applied them.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] block: disable the elevator int del_gendisk
  2022-06-17 13:26           ` Christoph Hellwig
@ 2022-06-17 13:27             ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2022-06-17 13:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, shinichiro.kawasaki, dan.j.williams, yukuai3,
	linux-block, syzbot+3e3f419f4a7816471838

On 6/17/22 7:26 AM, Christoph Hellwig wrote:
> On Fri, Jun 17, 2022 at 06:50:50AM -0600, Jens Axboe wrote:
>>> Then looks this pattern has problem in dealing with the examples I
>>> mentioned.
>>>
>>> And the elevator usage in __blk_mq_update_nr_hw_queues() looks one
>>> old problem, but easy to fix by protecting it via sysfs_lock.
>>>
>>> And fixing blk_mq_has_sqsched() should be easy too.
>>>
>>> I will send patches later.
>>
>> Just checking in on this series, Ming did you make any progress?
> 
> He send the patches in the "blk-mq: three misc patches" series and you
> already applied them.

Ah ok, guess they are already queued up for 5.19. Hard to keep track
with the backlog.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: fix tag freeing use after free and debugfs name reuse
  2022-06-14  7:48 fix tag freeing use after free and debugfs name reuse Christoph Hellwig
                   ` (3 preceding siblings ...)
  2022-06-14  7:48 ` [PATCH 4/4] block: freeze the queue earlier in del_gendisk Christoph Hellwig
@ 2022-06-17 13:31 ` Jens Axboe
  4 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2022-06-17 13:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: yukuai3, linux-block, dan.j.williams, ming.lei, shinichiro.kawasaki

On Tue, 14 Jun 2022 09:48:23 +0200, Christoph Hellwig wrote:
> the first patch fixes a use after free, and the others deal with debugfs
> name reuse that spews warnings and makes debugfs use impossible for
> quickly reused gendisk instances.  Both of those are rooted in sloppy
> life time rules for block device tear down.
> 
> Compared to the previous separate postings this adds a missing queue
> quiesce and documents debugfs_mutex better.
> 
> [...]

Applied, thanks!

[1/4] block: disable the elevator int del_gendisk
      commit: 50e34d78815e474d410f342fbe783b18192ca518
[2/4] block: serialize all debugfs operations using q->debugfs_mutex
      commit: 5cf9c91ba927119fc6606b938b1895bb2459d3bc
[3/4] block: remove per-disk debugfs files in blk_unregister_queue
      commit: 99d055b4fd4bbb309c6cdb51a0d420669f777944
[4/4] block: freeze the queue earlier in del_gendisk
      commit: a09b314005f3a0956ebf56e01b3b80339df577cc

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 18+ messages in thread

* REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-06-14  7:48 ` [PATCH 4/4] block: freeze the queue earlier in del_gendisk Christoph Hellwig
@ 2022-07-08  5:41   ` Logan Gunthorpe
  2022-07-08  6:01     ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Logan Gunthorpe @ 2022-07-08  5:41 UTC (permalink / raw)
  To: Christoph Hellwig, axboe
  Cc: shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei,
	linux-block, linux-raid, Song Liu

Hi,

On 2022-06-14 01:48, Christoph Hellwig wrote:
> Freeze the queue earlier in del_gendisk so that the state does not
> change while we remove debugfs and sysfs files.
> 
> Ming mentioned that being able to observer request in debugfs might
> be useful while the queue is being frozen in del_gendisk, which is
> made possible by this change.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

I'm not really sure why this is yet, but this patch in rc4 causes some
random failures with mdadm tests.

It seems the 11spare-migration tests starts failing roughly every other
run because the block device is not quite cleaned up after mdadm --stop
by the time the next mdadm --create commands starts, or rather there
appears to be a race now between the newly created device and the one
being cleaned up. This results in an infrequent sysfs panic with a
duplicate filename error (see the end of this email).

I managed to bisect this and found a09b314005f3a09 to be the problematic
commit.

Reverting seems to fix it.

Thanks,

Logan

> ---
>  block/genhd.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index e0675772178b0..278227ba1d531 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -623,6 +623,7 @@ void del_gendisk(struct gendisk *disk)
>  	 * Prevent new I/O from crossing bio_queue_enter().
>  	 */
>  	blk_queue_start_drain(q);
> +	blk_mq_freeze_queue_wait(q);
>  
>  	if (!(disk->flags & GENHD_FL_HIDDEN)) {
>  		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
> @@ -646,8 +647,6 @@ void del_gendisk(struct gendisk *disk)
>  	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
>  	device_del(disk_to_dev(disk));
>  
> -	blk_mq_freeze_queue_wait(q);
> -
>  	blk_throtl_cancel_bios(disk->queue);
>  
>  	blk_sync_queue(q);


[ 1026.373014] sysfs: cannot create duplicate filename
'/devices/virtual/block/md124'
[ 1026.374616] CPU: 1 PID: 11046 Comm: mdadm Not tainted
5.19.0-rc4-eid-vmlocalyes-dbg-00065-gff4ec5f79108 #2430
[ 1026.376546] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.14.0-2 04/01/2014
[ 1026.378000] Call Trace:
[ 1026.378461]  <TASK>
[ 1026.378867]  dump_stack_lvl+0x5a/0x74
[ 1026.379558]  dump_stack+0x10/0x12
[ 1026.380178]  sysfs_warn_dup.cold+0x17/0x27
[ 1026.380944]  sysfs_create_dir_ns+0x17d/0x190
[ 1026.381765]  ? sysfs_create_mount_point+0x80/0x80
[ 1026.382638]  ? __kasan_check_read+0x11/0x20
[ 1026.383401]  ? class_dir_child_ns_type+0x23/0x30
[ 1026.384267]  kobject_add_internal+0x145/0x460
[ 1026.385084]  kobject_add+0xf3/0x150
[ 1026.385733]  ? kset_create_and_add+0xe0/0xe0
[ 1026.386529]  ? __kasan_check_read+0x11/0x20
[ 1026.387306]  ? mutex_unlock+0x12/0x20
[ 1026.387986]  ? device_add+0x1da/0xf20
[ 1026.388676]  device_add+0x224/0xf20
[ 1026.389316]  ? kobject_set_name_vargs+0x95/0xb0
[ 1026.390163]  ? __fw_devlink_link_to_suppliers+0x180/0x180
[ 1026.391157]  ? sprintf+0xae/0xe0
[ 1026.391784]  device_add_disk+0x1b8/0x5f0
[ 1026.392520]  md_alloc+0x4c9/0x800
[ 1026.393131]  ? __kasan_check_read+0x11/0x20
[ 1026.393921]  md_probe+0x24/0x30
[ 1026.394506]  blk_request_module+0x9a/0x100
[ 1026.395268]  blkdev_get_no_open+0x66/0xa0
[ 1026.395993]  blkdev_get_by_dev.part.0+0x24/0x570
[ 1026.396854]  ? devcgroup_check_permission+0xed/0x240
[ 1026.397770]  blkdev_get_by_dev+0x51/0x60
[ 1026.398497]  blkdev_open+0xa4/0x140
[ 1026.399146]  do_dentry_open+0x2a7/0x6e0
[ 1026.399854]  ? blkdev_close+0x50/0x50
[ 1026.400546]  vfs_open+0x58/0x60
[ 1026.401125]  path_openat+0x77e/0x13f0
[ 1026.401830]  ? lookup_open.isra.0+0xaf0/0xaf0
[ 1026.402615]  ? kvm_sched_clock_read+0x18/0x40
[ 1026.403441]  ? sched_autogroup_detach+0x20/0x20
[ 1026.404267]  ? __this_cpu_preempt_check+0x13/0x20
[ 1026.405141]  do_filp_open+0x154/0x280
[ 1026.405833]  ? may_open_dev+0x60/0x60
[ 1026.406558]  ? __kasan_check_read+0x11/0x20
[ 1026.407302]  ? do_raw_spin_unlock+0x98/0x100
[ 1026.408067]  ? alloc_fd+0x183/0x340
[ 1026.408718]  do_sys_openat2+0x119/0x2c0
[ 1026.409437]  ? kmem_cache_free+0x156/0x690
[ 1026.410167]  ? dput+0x29/0x750
[ 1026.410730]  ? build_open_flags+0x280/0x280
[ 1026.411607]  ? putname+0x7c/0x90
[ 1026.412164]  __x64_sys_openat+0xe7/0x160
[ 1026.412919]  ? __ia32_compat_sys_open+0x130/0x130
[ 1026.413630]  ? syscall_enter_from_user_mode+0x21/0x60
[ 1026.414271]  ? lockdep_hardirqs_on+0x82/0x110
[ 1026.414828]  ? trace_hardirqs_on+0x3d/0x100
[ 1026.415376]  do_syscall_64+0x3b/0x90
[ 1026.415837]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 1026.416473] RIP: 0033:0x7fab048f3be7
[ 1026.416938] Code: 25 00 00 41 00 3d 00 00 41 00 74 47 64 8b 04 25 18
00 00 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
05 <48> 3d 00 f0 ff ff 0f 87 95 00 00 00 48 8b 4c 24 28 64 48 2b 0c 25
[ 1026.419219] RSP: 002b:00007ffc3c3d7ae0 EFLAGS: 00000246 ORIG_RAX:
0000000000000101
[ 1026.420162] RAX: ffffffffffffffda RBX: 00000000000003e8 RCX:
00007fab048f3be7
[ 1026.421045] RDX: 0000000000004082 RSI: 00007ffc3c3d7b70 RDI:
00000000ffffff9c
[ 1026.421928] RBP: 00007ffc3c3d7b70 R08: 0000000000000000 R09:
00007ffc3c3d79f0
[ 1026.422809] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000004082
[ 1026.423696] R13: 0000000000000009 R14: 00007ffc3c3d7b68 R15:
0000556054ddd970
[ 1026.424608]  </TASK>
[ 1026.424982] kobject_add_internal failed for md124 with -EEXIST, don't
try to register things with the same name in the same directory.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-08  5:41   ` REGRESSION: " Logan Gunthorpe
@ 2022-07-08  6:01     ` Christoph Hellwig
  2022-07-08 15:55       ` Logan Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2022-07-08  6:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, axboe, shinichiro.kawasaki, dan.j.williams,
	yukuai3, ming.lei, linux-block, linux-raid, Song Liu

On Thu, Jul 07, 2022 at 11:41:40PM -0600, Logan Gunthorpe wrote:
> I'm not really sure why this is yet, but this patch in rc4 causes some
> random failures with mdadm tests.
> 
> It seems the 11spare-migration tests starts failing roughly every other
> run because the block device is not quite cleaned up after mdadm --stop
> by the time the next mdadm --create commands starts, or rather there
> appears to be a race now between the newly created device and the one
> being cleaned up. This results in an infrequent sysfs panic with a
> duplicate filename error (see the end of this email).
> 
> I managed to bisect this and found a09b314005f3a09 to be the problematic
> commit.

Taking a look at the mddev code this commit just seems to increase the
race window of hitting horrible life time problems in md, but I'll also
try to reproduce and verify it myself.

Take a look at how md searches for a duplicate name in md_alloc,
mddev_alloc_unit and mddev_find_locked based on the all_mddevs list,
and how the mddev gets dropped from all_mddevs very early and long
before the gendisk is gone in mddev_put.  I think what needs to be
done is to implement a free_disk method and drop the mddev (and free it)
from that.  But given how much intricate mess is based on all_mddevs
we'll have to be very careful about that.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-08  6:01     ` Christoph Hellwig
@ 2022-07-08 15:55       ` Logan Gunthorpe
  2022-07-09  8:17         ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Logan Gunthorpe @ 2022-07-08 15:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei,
	linux-block, linux-raid, Song Liu



On 2022-07-08 00:01, Christoph Hellwig wrote:
> On Thu, Jul 07, 2022 at 11:41:40PM -0600, Logan Gunthorpe wrote:
>> I'm not really sure why this is yet, but this patch in rc4 causes some
>> random failures with mdadm tests.
>>
>> It seems the 11spare-migration tests starts failing roughly every other
>> run because the block device is not quite cleaned up after mdadm --stop
>> by the time the next mdadm --create commands starts, or rather there
>> appears to be a race now between the newly created device and the one
>> being cleaned up. This results in an infrequent sysfs panic with a
>> duplicate filename error (see the end of this email).
>>
>> I managed to bisect this and found a09b314005f3a09 to be the problematic
>> commit.
> 
> Taking a look at the mddev code this commit just seems to increase the
> race window of hitting horrible life time problems in md, but I'll also
> try to reproduce and verify it myself.
> 
> Take a look at how md searches for a duplicate name in md_alloc,
> mddev_alloc_unit and mddev_find_locked based on the all_mddevs list,
> and how the mddev gets dropped from all_mddevs very early and long
> before the gendisk is gone in mddev_put.  I think what needs to be
> done is to implement a free_disk method and drop the mddev (and free it)
> from that.  But given how much intricate mess is based on all_mddevs
> we'll have to be very careful about that.

I agree it's a mess, probably buggy and could use a cleanup with a
free_disk method. But I'm not sure the all_mdevs lifetime issues are the
problem here. If the entry in all_mdevs outlasts the disk, then
md_alloc() will just fail earlier. Many test scripts rely on the fact
that you can stop an mddev and recreate it immediately after. We need
some way of ensuring any deleted disks are fully deleted before trying
to make a new mddev, in case the new one has the same name as one being
deleted.

The md code deletes the disk in md_delayed_delete(), a work queue item
on md_misc_wq. That queue is flushed first in md_misc_wq, but somehow,
some of the disk is still not fully deleted by the time
flush_workqueue() returns. I'm not sure why that would be.

Logan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-08 15:55       ` Logan Gunthorpe
@ 2022-07-09  8:17         ` Christoph Hellwig
  2022-07-11  3:33           ` Logan Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2022-07-09  8:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, axboe, shinichiro.kawasaki, dan.j.williams,
	yukuai3, ming.lei, linux-block, linux-raid, Song Liu

On Fri, Jul 08, 2022 at 09:55:55AM -0600, Logan Gunthorpe wrote:
> I agree it's a mess, probably buggy and could use a cleanup with a
> free_disk method. But I'm not sure the all_mdevs lifetime issues are the
> problem here. If the entry in all_mdevs outlasts the disk, then
> md_alloc() will just fail earlier. Many test scripts rely on the fact
> that you can stop an mddev and recreate it immediately after. We need
> some way of ensuring any deleted disks are fully deleted before trying
> to make a new mddev, in case the new one has the same name as one being
> deleted.

I think those tests are broken.  But fortunately that is just an
assumption in the tests, while device name reuse is a real problem.

I could not reproduce your problem, but on the for-5.20/block
tree I see a hang in 10ddf-geometry when running the tests.

The branch here:

    git://git.infradead.org/users/hch/block.git md-lifetime-fixes
    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/md-lifetime-fixes

fixes that for me and does not introduce new regressions.  Can you
check if that helps your problem?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-09  8:17         ` Christoph Hellwig
@ 2022-07-11  3:33           ` Logan Gunthorpe
  2022-07-11  4:33             ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Logan Gunthorpe @ 2022-07-11  3:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei,
	linux-block, linux-raid, Song Liu




On 2022-07-09 02:17, Christoph Hellwig wrote:
> On Fri, Jul 08, 2022 at 09:55:55AM -0600, Logan Gunthorpe wrote:
>> I agree it's a mess, probably buggy and could use a cleanup with a
>> free_disk method. But I'm not sure the all_mdevs lifetime issues are the
>> problem here. If the entry in all_mdevs outlasts the disk, then
>> md_alloc() will just fail earlier. Many test scripts rely on the fact
>> that you can stop an mddev and recreate it immediately after. We need
>> some way of ensuring any deleted disks are fully deleted before trying
>> to make a new mddev, in case the new one has the same name as one being
>> deleted.
> 
> I think those tests are broken.  But fortunately that is just an
> assumption in the tests, while device name reuse is a real problem.
> 
> I could not reproduce your problem, but on the for-5.20/block
> tree I see a hang in 10ddf-geometry when running the tests.

Ah, strange. I never saw an issue with that test, though I didn't run
that one repeatedly with the latest branch. So maybe it was an
intermittent like I saw with 11spare-migration and I just missed it.

> The branch here:
> 
>     git://git.infradead.org/users/hch/block.git md-lifetime-fixes
>     http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/md-lifetime-fixes
> 
> fixes that for me and does not introduce new regressions.  Can you
> check if that helps your problem?

Yes, I can confirm those patches fix the bug I was seeing too.


I did a fairly quick review of the patches:

 - In the patch that introduces md_free_disk() it looks like md_free()
can still be called from the error path of md_alloc() before
mddev->gendisk is set... which seems to make things rather complicated
seeing we then can't use free_disk to finish the cleanup if the disk
hasn't been created yet. I probably need to take closer look at this
but, it might make more sense for the cleanup to remain in md_free() but
call kobject_put() in md_free_disk() and del_gendisk() in
mdev_delayed_delete(). Then md_alloc() can still use kobject_put() in
the error path and it makes a little more sense seeing we'd still be
freeing the kobject stuff in it's own release method instead of the
disks free method.

 - In the patch with md_rdevs_overlap, it looks like we remove the
rcu_read_lock(), which definitely seems out of place and probably isn't
correct. But the comment that was recreated still references the rcu so
probably should be changed.

 - The last patch has a typo in the title (disk is *a* freed).

Thanks!

Logan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-11  3:33           ` Logan Gunthorpe
@ 2022-07-11  4:33             ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2022-07-11  4:33 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, axboe, shinichiro.kawasaki, dan.j.williams,
	yukuai3, ming.lei, linux-block, linux-raid, Song Liu

On Sun, Jul 10, 2022 at 09:33:51PM -0600, Logan Gunthorpe wrote:
> I did a fairly quick review of the patches:
> 
>  - In the patch that introduces md_free_disk() it looks like md_free()
> can still be called from the error path of md_alloc() before
> mddev->gendisk is set... which seems to make things rather complicated
> seeing we then can't use free_disk to finish the cleanup if the disk
> hasn't been created yet. I probably need to take closer look at this
> but, it might make more sense for the cleanup to remain in md_free() but
> call kobject_put() in md_free_disk() and del_gendisk() in
> mdev_delayed_delete(). Then md_alloc() can still use kobject_put() in
> the error path and it makes a little more sense seeing we'd still be
> freeing the kobject stuff in it's own release method instead of the
> disks free method.

Uww, yes.  I suspect the best fix is to actually stop the kobject
from taking part in the md life time directly.  Because the kobject
contributes a reference to the disk until it is deleted, we might
as well stop messing with the refcounts entirely and just call
kobject_del on it just before del_gendisk.  Let me see what I can do
there.

> 
>  - In the patch with md_rdevs_overlap, it looks like we remove the
> rcu_read_lock(), which definitely seems out of place and probably isn't
> correct. But the comment that was recreated still references the rcu so
> probably should be changed.

Fixed.

>  - The last patch has a typo in the title (disk is *a* freed).

Fixed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-07-11  4:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14  7:48 fix tag freeing use after free and debugfs name reuse Christoph Hellwig
2022-06-14  7:48 ` [PATCH 1/4] block: disable the elevator int del_gendisk Christoph Hellwig
2022-06-14  8:23   ` Ming Lei
2022-06-14  8:34     ` Christoph Hellwig
2022-06-14 11:27       ` Ming Lei
2022-06-17 12:50         ` Jens Axboe
2022-06-17 13:26           ` Christoph Hellwig
2022-06-17 13:27             ` Jens Axboe
2022-06-14  7:48 ` [PATCH 2/4] block: serialize all debugfs operations using q->debugfs_mutex Christoph Hellwig
2022-06-14  7:48 ` [PATCH 3/4] block: remove per-disk debugfs files in blk_unregister_queue Christoph Hellwig
2022-06-14  7:48 ` [PATCH 4/4] block: freeze the queue earlier in del_gendisk Christoph Hellwig
2022-07-08  5:41   ` REGRESSION: " Logan Gunthorpe
2022-07-08  6:01     ` Christoph Hellwig
2022-07-08 15:55       ` Logan Gunthorpe
2022-07-09  8:17         ` Christoph Hellwig
2022-07-11  3:33           ` Logan Gunthorpe
2022-07-11  4:33             ` Christoph Hellwig
2022-06-17 13:31 ` fix tag freeing use after free and debugfs name reuse Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.