linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/8] blktrace: fix debugfs use after free
@ 2020-06-19 20:47 Luis Chamberlain
  2020-06-19 20:47 ` [PATCH v7 1/8] block: add docs for gendisk / request_queue refcount helpers Luis Chamberlain
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain

Its been a fun ride, but all patch series come to an end. My hope is
that this is it. The simplification of the fix is considerable now,
with only a few lines of code and with no data structure changes.

We were only creating the debugfs_dir upon initialization only if
you had CONFIG_BLK_DEBUG_FS for for make_request block drivers
(multiqueue). That's where the UAF bug could happen. Folks liked
the idea of open coding the debugfs initialization even if
CONFIG_BLK_DEBUG_FS was disabled, given that debugfs code will
simply ignore that code if debugfs is disabled, but to make
the fix easier to backport, that shift is done now in another
patch. Likewise, although we were only creating the debugfs_dir
only for make_request block drivers (multiqueue), the same new
additional patch also creates the debugfs_dir for request-based
block drivers. That *begged* us to just rename the mutex to
clarify its for the debugfs_dir, blktrace then just becomes
its biggest user.

The only patches changed here is the last one from the last series,
which actually fixed the UAF oops, and that one is now split in 3
patches, which makes a secondary fix much clearer.

I've waited a while to post these, to let 0-day give me its blessings,
both for Linus' tree and linux-next. No issues have been found. I've
also taken time to run blktests prior and after this series and I have
found no regressions. In fact, I think I should just extend blktests
with the break-blktrace tests, I'll do that later.

Luis Chamberlain (8):
  block: add docs for gendisk / request_queue refcount helpers
  block: clarify context for refcount increment helpers
  block: revert back to synchronous request_queue removal
  blktrace: annotate required lock on do_blk_trace_setup()
  loop: be paranoid on exit and prevent new additions / removals
  blktrace: fix debugfs use after free
  blktrace: ensure our debugfs dir exists
  block: create the request_queue debugfs_dir on registration

 block/blk-core.c        | 31 +++++++++++++----
 block/blk-mq-debugfs.c  |  5 ---
 block/blk-sysfs.c       | 52 ++++++++++++++++------------
 block/blk.h             |  2 --
 block/genhd.c           | 73 ++++++++++++++++++++++++++++++++++++++-
 drivers/block/loop.c    |  4 +++
 include/linux/blkdev.h  |  7 ++--
 kernel/trace/blktrace.c | 76 ++++++++++++++++++++++++-----------------
 8 files changed, 179 insertions(+), 71 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 1/8] block: add docs for gendisk / request_queue refcount helpers
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-06-19 20:47 ` Luis Chamberlain
  2020-06-19 20:47 ` [PATCH v7 2/8] block: clarify context for refcount increment helpers Luis Chamberlain
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain,
	Christoph Hellwig

This adds documentation for the gendisk / request_queue refcount
helpers.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c | 13 +++++++++++++
 block/genhd.c    | 50 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 62a4904db921..a0760aac110a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -321,6 +321,13 @@ void blk_clear_pm_only(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
+/**
+ * blk_put_queue - decrement the request_queue refcount
+ * @q: the request_queue structure to decrement the refcount for
+ *
+ * Decrements the refcount of the request_queue kobject. When this reaches 0
+ * we'll have blk_release_queue() called.
+ */
 void blk_put_queue(struct request_queue *q)
 {
 	kobject_put(&q->kobj);
@@ -598,6 +605,12 @@ struct request_queue *blk_alloc_queue(make_request_fn make_request, int node_id)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+/**
+ * blk_get_queue - increment the request_queue refcount
+ * @q: the request_queue structure to increment the refcount for
+ *
+ * Increment the refcount of the request_queue kobject.
+ */
 bool blk_get_queue(struct request_queue *q)
 {
 	if (likely(!blk_queue_dying(q))) {
diff --git a/block/genhd.c b/block/genhd.c
index 1a7659327664..f741613d731f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -876,6 +876,20 @@ static void invalidate_partition(struct gendisk *disk, int partno)
 	bdput(bdev);
 }
 
+/**
+ * del_gendisk - remove the gendisk
+ * @disk: the struct gendisk to remove
+ *
+ * Removes the gendisk and all its associated resources. This deletes the
+ * partitions associated with the gendisk, and unregisters the associated
+ * request_queue.
+ *
+ * This is the counter to the respective __device_add_disk() call.
+ *
+ * The final removal of the struct gendisk happens when its refcount reaches 0
+ * with put_disk(), which should be called after del_gendisk(), if
+ * __device_add_disk() was used.
+ */
 void del_gendisk(struct gendisk *disk)
 {
 	struct disk_part_iter piter;
@@ -1514,6 +1528,23 @@ int disk_expand_part_tbl(struct gendisk *disk, int partno)
 	return 0;
 }
 
+/**
+ * disk_release - releases all allocated resources of the gendisk
+ * @dev: the device representing this disk
+ *
+ * This function releases all allocated resources of the gendisk.
+ *
+ * The struct gendisk refcount is incremented with get_gendisk() or
+ * get_disk_and_module(), and its refcount is decremented with
+ * put_disk_and_module() or put_disk(). Once the refcount reaches 0 this
+ * function is called.
+ *
+ * Drivers which used __device_add_disk() have a gendisk with a request_queue
+ * assigned. Since the request_queue sits on top of the gendisk for these
+ * drivers we also call blk_put_queue() for them, and we expect the
+ * request_queue refcount to reach 0 at this point, and so the request_queue
+ * will also be freed prior to the disk.
+ */
 static void disk_release(struct device *dev)
 {
 	struct gendisk *disk = dev_to_disk(dev);
@@ -1727,6 +1758,13 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 }
 EXPORT_SYMBOL(__alloc_disk_node);
 
+/**
+ * get_disk_and_module - increments the gendisk and gendisk fops module refcount
+ * @disk: the struct gendisk to to increment the refcount for
+ *
+ * This increments the refcount for the struct gendisk, and the gendisk's
+ * fops module owner.
+ */
 struct kobject *get_disk_and_module(struct gendisk *disk)
 {
 	struct module *owner;
@@ -1747,6 +1785,13 @@ struct kobject *get_disk_and_module(struct gendisk *disk)
 }
 EXPORT_SYMBOL(get_disk_and_module);
 
+/**
+ * put_disk - decrements the gendisk refcount
+ * @disk: the struct gendisk to to decrement the refcount for
+ *
+ * This decrements the refcount for the struct gendisk. When this reaches 0
+ * we'll have disk_release() called.
+ */
 void put_disk(struct gendisk *disk)
 {
 	if (disk)
@@ -1754,7 +1799,10 @@ void put_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(put_disk);
 
-/*
+/**
+ * put_disk_and_module - decrements the module and gendisk refcount
+ * @disk: the struct gendisk to to decrement the refcount for
+ *
  * This is a counterpart of get_disk_and_module() and thus also of
  * get_gendisk().
  */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 2/8] block: clarify context for refcount increment helpers
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
  2020-06-19 20:47 ` [PATCH v7 1/8] block: add docs for gendisk / request_queue refcount helpers Luis Chamberlain
@ 2020-06-19 20:47 ` Luis Chamberlain
  2020-06-19 20:47 ` [PATCH v7 3/8] block: revert back to synchronous request_queue removal Luis Chamberlain
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain,
	Christoph Hellwig

Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().

We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c | 2 ++
 block/genhd.c    | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index a0760aac110a..14c09daf55f3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,8 @@ EXPORT_SYMBOL(blk_alloc_queue);
  * @q: the request_queue structure to increment the refcount for
  *
  * Increment the refcount of the request_queue kobject.
+ *
+ * Context: Any context.
  */
 bool blk_get_queue(struct request_queue *q)
 {
diff --git a/block/genhd.c b/block/genhd.c
index f741613d731f..1be86b1f43ec 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -985,11 +985,15 @@ static ssize_t disk_badblocks_store(struct device *dev,
  *
  * This function gets the structure containing partitioning
  * information for the given device @devt.
+ *
+ * Context: can sleep
  */
 struct gendisk *get_gendisk(dev_t devt, int *partno)
 {
 	struct gendisk *disk = NULL;
 
+	might_sleep();
+
 	if (MAJOR(devt) != BLOCK_EXT_MAJOR) {
 		struct kobject *kobj;
 
@@ -1764,6 +1768,8 @@ EXPORT_SYMBOL(__alloc_disk_node);
  *
  * This increments the refcount for the struct gendisk, and the gendisk's
  * fops module owner.
+ *
+ * Context: Any context.
  */
 struct kobject *get_disk_and_module(struct gendisk *disk)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 3/8] block: revert back to synchronous request_queue removal
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
  2020-06-19 20:47 ` [PATCH v7 1/8] block: add docs for gendisk / request_queue refcount helpers Luis Chamberlain
  2020-06-19 20:47 ` [PATCH v7 2/8] block: clarify context for refcount increment helpers Luis Chamberlain
@ 2020-06-19 20:47 ` Luis Chamberlain
  2020-06-19 20:47 ` [PATCH v7 4/8] blktrace: annotate required lock on do_blk_trace_setup() Luis Chamberlain
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain,
	Omar Sandoval, Hannes Reinecke, Michal Hocko, Christoph Hellwig

Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue kobject, and
upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
now removed through commit db6d99523560 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within
blk_release_queue() context.

The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as
possible so we can avoid future issues, and with the hopes that the
synchronous request_queue removal sticks.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.

Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Suggested-by: Nicolai Stange <nstange@suse.de>
Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c       |  8 ++++++++
 block/blk-sysfs.c      | 43 +++++++++++++++++++++---------------------
 block/genhd.c          | 17 +++++++++++++++++
 include/linux/blkdev.h |  2 --
 4 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14c09daf55f3..a5126c0be777 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -327,6 +327,9 @@ EXPORT_SYMBOL_GPL(blk_clear_pm_only);
  *
  * Decrements the refcount of the request_queue kobject. When this reaches 0
  * we'll have blk_release_queue() called.
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ *          atomic context.
  */
 void blk_put_queue(struct request_queue *q)
 {
@@ -359,9 +362,14 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
  *
  * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
  * put it.  All future requests will be failed immediately with -ENODEV.
+ *
+ * Context: can sleep
  */
 void blk_cleanup_queue(struct request_queue *q)
 {
+	/* cannot be called from atomic context */
+	might_sleep();
+
 	WARN_ON_ONCE(blk_queue_registered(q));
 
 	/* mark @q DYING, no new request or merges will be allowed afterwards */
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 02643e149d5e..561624d4cc4e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -873,22 +873,32 @@ static void blk_exit_queue(struct request_queue *q)
 	bdi_put(q->backing_dev_info);
 }
 
-
 /**
- * __blk_release_queue - release a request queue
- * @work: pointer to the release_work member of the request queue to be released
+ * blk_release_queue - releases all allocated resources of the request_queue
+ * @kobj: pointer to a kobject, whose container is a request_queue
+ *
+ * This function releases all allocated resources of the request queue.
+ *
+ * The struct request_queue refcount is incremented with blk_get_queue() and
+ * decremented with blk_put_queue(). Once the refcount reaches 0 this function
+ * is called.
+ *
+ * For drivers that have a request_queue on a gendisk and added with
+ * __device_add_disk() the refcount to request_queue will reach 0 with
+ * the last put_disk() called by the driver. For drivers which don't use
+ * __device_add_disk() this happens with blk_cleanup_queue().
  *
- * Description:
- *     This function is called when a block device is being unregistered. The
- *     process of releasing a request queue starts with blk_cleanup_queue, which
- *     set the appropriate flags and then calls blk_put_queue, that decrements
- *     the reference counter of the request queue. Once the reference counter
- *     of the request queue reaches zero, blk_release_queue is called to release
- *     all allocated resources of the request queue.
+ * Drivers exist which depend on the release of the request_queue to be
+ * synchronous, it should not be deferred.
+ *
+ * Context: can sleep
  */
-static void __blk_release_queue(struct work_struct *work)
+static void blk_release_queue(struct kobject *kobj)
 {
-	struct request_queue *q = container_of(work, typeof(*q), release_work);
+	struct request_queue *q =
+		container_of(kobj, struct request_queue, kobj);
+
+	might_sleep();
 
 	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
 		blk_stat_remove_callback(q, q->poll_cb);
@@ -917,15 +927,6 @@ static void __blk_release_queue(struct work_struct *work)
 	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
 
-static void blk_release_queue(struct kobject *kobj)
-{
-	struct request_queue *q =
-		container_of(kobj, struct request_queue, kobj);
-
-	INIT_WORK(&q->release_work, __blk_release_queue);
-	schedule_work(&q->release_work);
-}
-
 static const struct sysfs_ops queue_sysfs_ops = {
 	.show	= queue_attr_show,
 	.store	= queue_attr_store,
diff --git a/block/genhd.c b/block/genhd.c
index 1be86b1f43ec..60ae4e1b4d38 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -889,12 +889,19 @@ static void invalidate_partition(struct gendisk *disk, int partno)
  * The final removal of the struct gendisk happens when its refcount reaches 0
  * with put_disk(), which should be called after del_gendisk(), if
  * __device_add_disk() was used.
+ *
+ * Drivers exist which depend on the release of the gendisk to be synchronous,
+ * it should not be deferred.
+ *
+ * Context: can sleep
  */
 void del_gendisk(struct gendisk *disk)
 {
 	struct disk_part_iter piter;
 	struct hd_struct *part;
 
+	might_sleep();
+
 	blk_integrity_del(disk);
 	disk_del_events(disk);
 
@@ -1548,11 +1555,15 @@ int disk_expand_part_tbl(struct gendisk *disk, int partno)
  * drivers we also call blk_put_queue() for them, and we expect the
  * request_queue refcount to reach 0 at this point, and so the request_queue
  * will also be freed prior to the disk.
+ *
+ * Context: can sleep
  */
 static void disk_release(struct device *dev)
 {
 	struct gendisk *disk = dev_to_disk(dev);
 
+	might_sleep();
+
 	blk_free_devt(dev->devt);
 	disk_release_events(disk);
 	kfree(disk->random);
@@ -1797,6 +1808,9 @@ EXPORT_SYMBOL(get_disk_and_module);
  *
  * This decrements the refcount for the struct gendisk. When this reaches 0
  * we'll have disk_release() called.
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ *          atomic context.
  */
 void put_disk(struct gendisk *disk)
 {
@@ -1811,6 +1825,9 @@ EXPORT_SYMBOL(put_disk);
  *
  * This is a counterpart of get_disk_and_module() and thus also of
  * get_gendisk().
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ *          atomic context.
  */
 void put_disk_and_module(struct gendisk *disk)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6e067dca94cf..780d6fa94746 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,8 +584,6 @@ struct request_queue {
 
 	size_t			cmd_size;
 
-	struct work_struct	release_work;
-
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
 };
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 4/8] blktrace: annotate required lock on do_blk_trace_setup()
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (2 preceding siblings ...)
  2020-06-19 20:47 ` [PATCH v7 3/8] block: revert back to synchronous request_queue removal Luis Chamberlain
@ 2020-06-19 20:47 ` Luis Chamberlain
  2020-06-20 17:02   ` Bart Van Assche
  2020-06-19 20:47 ` [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals Luis Chamberlain
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain

Ensure it is clear which lock is required on do_blk_trace_setup().

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 7f60029bdaff..7ff2ea5cd05e 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -483,6 +483,8 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	struct dentry *dir = NULL;
 	int ret;
 
+	lockdep_assert_held(&q->blk_trace_mutex);
+
 	if (!buts->buf_size || !buts->buf_nr)
 		return -EINVAL;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (3 preceding siblings ...)
  2020-06-19 20:47 ` [PATCH v7 4/8] blktrace: annotate required lock on do_blk_trace_setup() Luis Chamberlain
@ 2020-06-19 20:47 ` Luis Chamberlain
  2020-06-20 17:11   ` Bart Van Assche
  2020-06-19 20:47 ` [PATCH v7 6/8] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain,
	Christoph Hellwig

Be pedantic on removal as well and hold the mutex.
This should prevent uses of addition while we exit.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/loop.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c33bbbfd1bd9..d55e1b52f076 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2402,6 +2402,8 @@ static void __exit loop_exit(void)
 
 	range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
 
+	mutex_lock(&loop_ctl_mutex);
+
 	idr_for_each(&loop_index_idr, &loop_exit_cb, NULL);
 	idr_destroy(&loop_index_idr);
 
@@ -2409,6 +2411,8 @@ static void __exit loop_exit(void)
 	unregister_blkdev(LOOP_MAJOR, "loop");
 
 	misc_deregister(&loop_misc);
+
+	mutex_unlock(&loop_ctl_mutex);
 }
 
 module_init(loop_init);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 6/8] blktrace: fix debugfs use after free
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (4 preceding siblings ...)
  2020-06-19 20:47 ` [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals Luis Chamberlain
@ 2020-06-19 20:47 ` Luis Chamberlain
  2020-06-20  7:29   ` Christoph Hellwig
  2020-06-20 17:31   ` Bart Van Assche
  2020-06-19 20:47 ` [PATCH v7 7/8] blktrace: ensure our debugfs dir exists Luis Chamberlain
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain,
	Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace while racing to remove / add a device.

We used to use asynchronous removal of the request_queue, and with that
the issue was easier to reproduce. Now that we have reverted to
synchronous removal of the request_queue, the issue is still possible to
reproduce, its however just a bit more difficult.

We essentially run two instances of break-blktrace which add/remove
a loop device, and setup a blktrace and just never tear the blktrace
down. We do this twice in parallel. This is easily reproduced with the
script run_0004.sh from break-blktrace [0].

We can end up with two types of panics each reflecting where we
race, one a failed blktrace setup:

[  252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
[  252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  252.436592] #PF: supervisor write access in kernel mode
[  252.439822] #PF: error_code(0x0002) - not-present page
[  252.442967] PGD 0 P4D 0
[  252.444656] Oops: 0002 [#1] SMP NOPTI
[  252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
[  252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  252.456343] RIP: 0010:down_write+0x15/0x40
[  252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
               cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
               00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
               45 08 5d
[  252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246
[  252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000
[  252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001
[  252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000
[  252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0
[  252.473346] FS:  00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000
[  252.475225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0
[  252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  252.479866] Call Trace:
[  252.480322]  simple_recursive_removal+0x4e/0x2e0
[  252.481078]  ? debugfs_remove+0x60/0x60
[  252.481725]  ? relay_destroy_buf+0x77/0xb0
[  252.482662]  debugfs_remove+0x40/0x60
[  252.483518]  blk_remove_buf_file_callback+0x5/0x10
[  252.484328]  relay_close_buf+0x2e/0x60
[  252.484930]  relay_open+0x1ce/0x2c0
[  252.485520]  do_blk_trace_setup+0x14f/0x2b0
[  252.486187]  __blk_trace_setup+0x54/0xb0
[  252.486803]  blk_trace_ioctl+0x90/0x140
[  252.487423]  ? do_sys_openat2+0x1ab/0x2d0
[  252.488053]  blkdev_ioctl+0x4d/0x260
[  252.488636]  block_ioctl+0x39/0x40
[  252.489139]  ksys_ioctl+0x87/0xc0
[  252.489675]  __x64_sys_ioctl+0x16/0x20
[  252.490380]  do_syscall_64+0x52/0x180
[  252.491032]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And the other on the device removal:

[  128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
[  128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  128.619537] #PF: supervisor write access in kernel mode
[  128.622700] #PF: error_code(0x0002) - not-present page
[  128.625842] PGD 0 P4D 0
[  128.627585] Oops: 0002 [#1] SMP NOPTI
[  128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
[  128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  128.640471] RIP: 0010:down_write+0x15/0x40
[  128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
               cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
               00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
               45 08 5d
[  128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246
[  128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000
[  128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0
[  128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000
[  128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0
[  128.660500] FS:  00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000
[  128.662204] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0
[  128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  128.667282] Call Trace:
[  128.667801]  simple_recursive_removal+0x4e/0x2e0
[  128.668663]  ? debugfs_remove+0x60/0x60
[  128.669368]  debugfs_remove+0x40/0x60
[  128.669985]  blk_trace_free+0xd/0x50
[  128.670593]  __blk_trace_remove+0x27/0x40
[  128.671274]  blk_trace_shutdown+0x30/0x40
[  128.671935]  blk_release_queue+0x95/0xf0
[  128.672589]  kobject_put+0xa5/0x1b0
[  128.673188]  disk_release+0xa2/0xc0
[  128.673786]  device_release+0x28/0x80
[  128.674376]  kobject_put+0xa5/0x1b0
[  128.674915]  loop_remove+0x39/0x50 [loop]
[  128.675511]  loop_control_ioctl+0x113/0x130 [loop]
[  128.676199]  ksys_ioctl+0x87/0xc0
[  128.676708]  __x64_sys_ioctl+0x16/0x20
[  128.677274]  do_syscall_64+0x52/0x180
[  128.677823]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The common theme here is:

debugfs: Directory 'loop0' with parent 'block' already present

This crash happens because of how blktrace uses the debugfs directory
where it places its files. Upon init we always create the same directory
which would be needed by blktrace but we only do this for make_request
drivers (multiqueue) block drivers. When you race a removal of these
devices with a blktrace setup you end up in a situation where the
make_request recursive debugfs removal will sweep away the blktrace
files and then later blktrace will also try to remove individual
dentries which are already NULL. The inverse is also possible and hence
the two types of use after frees.

We don't create the block debugfs directory on init for these types of
block devices:

  * request-based block driver block devices
  * every possible partition
  * scsi-generic

And so, this race should in theory only be possible with make_request
drivers.

We can fix the UAF by simply re-using the debugfs directory for
make_request drivers (multiqueue) and only creating the ephemeral
directory for the other type of block devices. The new clarifications
on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace
*prior* to processing a blktrace ensures the debugfs directories are
only created if no possible directory name clashes are possible.

This goes tested with:

  o nvme partitions
  o ISCSI with tgt, and blktracing against scsi-generic with:
    o block
    o tape
    o cdrom
    o media changer
  o blktests

This patch is part of the work which disputes the severity of
CVE-2019-19770 which shows this issue is not a core debugfs issue, but
a misuse of debugfs within blktace.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: yu kuai <yukuai3@huawei.com>
Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 7ff2ea5cd05e..e6e2d25fdbd6 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -524,10 +524,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (!bt->msg_data)
 		goto err;
 
-	ret = -ENOENT;
-
-	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
+#ifdef CONFIG_BLK_DEBUG_FS
+	/*
+	 * When tracing whole make_request drivers (multiqueue) block devices,
+	 * reuse the existing debugfs directory created by the block layer on
+	 * init. For request-based block devices, all partitions block devices,
+	 * and scsi-generic block devices we create a temporary new debugfs
+	 * directory that will be removed once the trace ends.
+	 */
+	if (queue_is_mq(q) && bdev && bdev == bdev->bd_contains)
+		dir = q->debugfs_dir;
+	else
+#endif
 		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
 
 	bt->dev = dev;
@@ -565,8 +573,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = 0;
 err:
-	if (dir && !bt->dir)
-		dput(dir);
 	if (ret)
 		blk_trace_free(bt);
 	return ret;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 7/8] blktrace: ensure our debugfs dir exists
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (5 preceding siblings ...)
  2020-06-19 20:47 ` [PATCH v7 6/8] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-06-19 20:47 ` Luis Chamberlain
  2020-06-20  7:29   ` Christoph Hellwig
  2020-06-20 17:33   ` Bart Van Assche
  2020-06-19 20:47 ` [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration Luis Chamberlain
  2020-06-20 21:18 ` [PATCH v7 0/8] blktrace: fix debugfs use after free Jens Axboe
  8 siblings, 2 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain

We make an assumption that a debugfs directory exists, but since
this can fail ensure it exists before allowing blktrace setup to
complete. Otherwise we end up stuffing blktrace files on the debugfs
root directory. In the worst case scenario this *in theory* can create
an eventual panic *iff* in the future a similarly named file is created
prior on the debugfs root directory. This theoretical crash can happen
due to a recursive removal followed by a specific dentry removal.

This doesn't fix any known crash, however I have seen the files
go into the main debugfs root directory in cases where the debugfs
directory was not created due to other internal bugs with blktrace
now fixed.

blktrace is also completely useless without this directory, so
this ensures to userspace we only setup blktrace if the kernel
can stuff files where they are supposed to go into.

debugfs directory creations typically aren't checked for, and we have
maintainers doing sweep removals of these checks, but since we need this
check to ensure proper userspace blktrace functionality we make sure
to annotate the justification for the check.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index e6e2d25fdbd6..098780ec018f 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -538,6 +538,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 #endif
 		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
 
+	/*
+	 * As blktrace relies on debugfs for its interface the debugfs directory
+	 * is required, contrary to the usual mantra of not checking for debugfs
+	 * files or directories.
+	 */
+	if (IS_ERR_OR_NULL(dir)) {
+		pr_warn("debugfs_dir not present for %s so skipping\n",
+			buts->name);
+		ret = -ENOENT;
+		goto err;
+	}
+
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
 	INIT_LIST_HEAD(&bt->running_list);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (6 preceding siblings ...)
  2020-06-19 20:47 ` [PATCH v7 7/8] blktrace: ensure our debugfs dir exists Luis Chamberlain
@ 2020-06-19 20:47 ` Luis Chamberlain
  2020-06-20  7:30   ` Christoph Hellwig
  2020-06-20 18:07   ` Bart Van Assche
  2020-06-20 21:18 ` [PATCH v7 0/8] blktrace: fix debugfs use after free Jens Axboe
  8 siblings, 2 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-19 20:47 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Luis Chamberlain

We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.

Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c        |  8 +-----
 block/blk-mq-debugfs.c  |  5 ----
 block/blk-sysfs.c       |  9 +++++++
 block/blk.h             |  2 --
 include/linux/blkdev.h  |  5 ++--
 kernel/trace/blktrace.c | 58 ++++++++++++++++++-----------------------
 6 files changed, 39 insertions(+), 48 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a5126c0be777..72b102a389a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -51,9 +51,7 @@
 #include "blk-pm.h"
 #include "blk-rq-qos.h"
 
-#ifdef CONFIG_DEBUG_FS
 struct dentry *blk_debugfs_root;
-#endif
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -555,9 +553,7 @@ struct request_queue *__blk_alloc_queue(int node_id)
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
 
-#ifdef CONFIG_BLK_DEV_IO_TRACE
-	mutex_init(&q->blk_trace_mutex);
-#endif
+	mutex_init(&q->debugfs_mutex);
 	mutex_init(&q->sysfs_lock);
 	mutex_init(&q->sysfs_dir_lock);
 	spin_lock_init(&q->queue_lock);
@@ -1937,9 +1933,7 @@ int __init blk_dev_init(void)
 	blk_requestq_cachep = kmem_cache_create("request_queue",
 			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
 
-#ifdef CONFIG_DEBUG_FS
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
-#endif
 
 	return 0;
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 15df3a36e9fa..a2800bc56fb4 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -824,9 +824,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
-					    blk_debugfs_root);
-
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
 	/*
@@ -857,9 +854,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
 
 void blk_mq_debugfs_unregister(struct request_queue *q)
 {
-	debugfs_remove_recursive(q->debugfs_dir);
 	q->sched_debugfs_dir = NULL;
-	q->debugfs_dir = NULL;
 }
 
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 561624d4cc4e..be67952e7be2 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/blk-mq.h>
 #include <linux/blk-cgroup.h>
+#include <linux/debugfs.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -917,6 +918,9 @@ static void blk_release_queue(struct kobject *kobj)
 		blk_mq_release(q);
 
 	blk_trace_shutdown(q);
+	mutex_lock(&q->debugfs_mutex);
+	debugfs_remove_recursive(q->debugfs_dir);
+	mutex_unlock(&q->debugfs_mutex);
 
 	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
@@ -989,6 +993,11 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	mutex_lock(&q->debugfs_mutex);
+	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
+					    blk_debugfs_root);
+	mutex_unlock(&q->debugfs_mutex);
+
 	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
diff --git a/block/blk.h b/block/blk.h
index b5d1f0fc6547..499308c6ab3b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -14,9 +14,7 @@
 /* Max future timer expiry for timeouts */
 #define BLK_MAX_TIMEOUT		(5 * HZ)
 
-#ifdef CONFIG_DEBUG_FS
 extern struct dentry *blk_debugfs_root;
-#endif
 
 struct blk_flush_queue {
 	unsigned int		flush_pending_idx:1;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 780d6fa94746..e70e03d2cd8c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -528,9 +528,9 @@ struct request_queue {
 	unsigned int		sg_timeout;
 	unsigned int		sg_reserved_size;
 	int			node;
+	struct mutex		debugfs_mutex;
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	struct blk_trace __rcu	*blk_trace;
-	struct mutex		blk_trace_mutex;
 #endif
 	/*
 	 * for flush operations
@@ -574,8 +574,9 @@ struct request_queue {
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
 
-#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*debugfs_dir;
+
+#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*sched_debugfs_dir;
 	struct dentry		*rqos_debugfs_dir;
 #endif
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 098780ec018f..12e1d52f1d17 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -348,7 +348,7 @@ static int __blk_trace_remove(struct request_queue *q)
 	struct blk_trace *bt;
 
 	bt = rcu_replace_pointer(q->blk_trace, NULL,
-				 lockdep_is_held(&q->blk_trace_mutex));
+				 lockdep_is_held(&q->debugfs_mutex));
 	if (!bt)
 		return -EINVAL;
 
@@ -362,9 +362,9 @@ int blk_trace_remove(struct request_queue *q)
 {
 	int ret;
 
-	mutex_lock(&q->blk_trace_mutex);
+	mutex_lock(&q->debugfs_mutex);
 	ret = __blk_trace_remove(q);
-	mutex_unlock(&q->blk_trace_mutex);
+	mutex_unlock(&q->debugfs_mutex);
 
 	return ret;
 }
@@ -483,14 +483,11 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	struct dentry *dir = NULL;
 	int ret;
 
-	lockdep_assert_held(&q->blk_trace_mutex);
+	lockdep_assert_held(&q->debugfs_mutex);
 
 	if (!buts->buf_size || !buts->buf_nr)
 		return -EINVAL;
 
-	if (!blk_debugfs_root)
-		return -ENOENT;
-
 	strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
 	buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0';
 
@@ -505,7 +502,7 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	 * we can be.
 	 */
 	if (rcu_dereference_protected(q->blk_trace,
-				      lockdep_is_held(&q->blk_trace_mutex))) {
+				      lockdep_is_held(&q->debugfs_mutex))) {
 		pr_warn("Concurrent blktraces are not allowed on %s\n",
 			buts->name);
 		return -EBUSY;
@@ -524,18 +521,15 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (!bt->msg_data)
 		goto err;
 
-#ifdef CONFIG_BLK_DEBUG_FS
 	/*
-	 * When tracing whole make_request drivers (multiqueue) block devices,
-	 * reuse the existing debugfs directory created by the block layer on
-	 * init. For request-based block devices, all partitions block devices,
+	 * When tracing the whole disk reuse the existing debugfs directory
+	 * created by the block layer on init. For partitions block devices,
 	 * and scsi-generic block devices we create a temporary new debugfs
 	 * directory that will be removed once the trace ends.
 	 */
-	if (queue_is_mq(q) && bdev && bdev == bdev->bd_contains)
+	if (bdev && bdev == bdev->bd_contains)
 		dir = q->debugfs_dir;
 	else
-#endif
 		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
 
 	/*
@@ -617,9 +611,9 @@ int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 {
 	int ret;
 
-	mutex_lock(&q->blk_trace_mutex);
+	mutex_lock(&q->debugfs_mutex);
 	ret = __blk_trace_setup(q, name, dev, bdev, arg);
-	mutex_unlock(&q->blk_trace_mutex);
+	mutex_unlock(&q->debugfs_mutex);
 
 	return ret;
 }
@@ -665,7 +659,7 @@ static int __blk_trace_startstop(struct request_queue *q, int start)
 	struct blk_trace *bt;
 
 	bt = rcu_dereference_protected(q->blk_trace,
-				       lockdep_is_held(&q->blk_trace_mutex));
+				       lockdep_is_held(&q->debugfs_mutex));
 	if (bt == NULL)
 		return -EINVAL;
 
@@ -705,9 +699,9 @@ int blk_trace_startstop(struct request_queue *q, int start)
 {
 	int ret;
 
-	mutex_lock(&q->blk_trace_mutex);
+	mutex_lock(&q->debugfs_mutex);
 	ret = __blk_trace_startstop(q, start);
-	mutex_unlock(&q->blk_trace_mutex);
+	mutex_unlock(&q->debugfs_mutex);
 
 	return ret;
 }
@@ -736,7 +730,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
 	if (!q)
 		return -ENXIO;
 
-	mutex_lock(&q->blk_trace_mutex);
+	mutex_lock(&q->debugfs_mutex);
 
 	switch (cmd) {
 	case BLKTRACESETUP:
@@ -763,7 +757,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
 		break;
 	}
 
-	mutex_unlock(&q->blk_trace_mutex);
+	mutex_unlock(&q->debugfs_mutex);
 	return ret;
 }
 
@@ -774,14 +768,14 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
  **/
 void blk_trace_shutdown(struct request_queue *q)
 {
-	mutex_lock(&q->blk_trace_mutex);
+	mutex_lock(&q->debugfs_mutex);
 	if (rcu_dereference_protected(q->blk_trace,
-				      lockdep_is_held(&q->blk_trace_mutex))) {
+				      lockdep_is_held(&q->debugfs_mutex))) {
 		__blk_trace_startstop(q, 0);
 		__blk_trace_remove(q);
 	}
 
-	mutex_unlock(&q->blk_trace_mutex);
+	mutex_unlock(&q->debugfs_mutex);
 }
 
 #ifdef CONFIG_BLK_CGROUP
@@ -1662,7 +1656,7 @@ static int blk_trace_remove_queue(struct request_queue *q)
 	struct blk_trace *bt;
 
 	bt = rcu_replace_pointer(q->blk_trace, NULL,
-				 lockdep_is_held(&q->blk_trace_mutex));
+				 lockdep_is_held(&q->debugfs_mutex));
 	if (bt == NULL)
 		return -EINVAL;
 
@@ -1837,10 +1831,10 @@ static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
 	if (q == NULL)
 		goto out_bdput;
 
-	mutex_lock(&q->blk_trace_mutex);
+	mutex_lock(&q->debugfs_mutex);
 
 	bt = rcu_dereference_protected(q->blk_trace,
-				       lockdep_is_held(&q->blk_trace_mutex));
+				       lockdep_is_held(&q->debugfs_mutex));
 	if (attr == &dev_attr_enable) {
 		ret = sprintf(buf, "%u\n", !!bt);
 		goto out_unlock_bdev;
@@ -1858,7 +1852,7 @@ static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
 		ret = sprintf(buf, "%llu\n", bt->end_lba);
 
 out_unlock_bdev:
-	mutex_unlock(&q->blk_trace_mutex);
+	mutex_unlock(&q->debugfs_mutex);
 out_bdput:
 	bdput(bdev);
 out:
@@ -1901,10 +1895,10 @@ static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
 	if (q == NULL)
 		goto out_bdput;
 
-	mutex_lock(&q->blk_trace_mutex);
+	mutex_lock(&q->debugfs_mutex);
 
 	bt = rcu_dereference_protected(q->blk_trace,
-				       lockdep_is_held(&q->blk_trace_mutex));
+				       lockdep_is_held(&q->debugfs_mutex));
 	if (attr == &dev_attr_enable) {
 		if (!!value == !!bt) {
 			ret = 0;
@@ -1921,7 +1915,7 @@ static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
 	if (bt == NULL) {
 		ret = blk_trace_setup_queue(q, bdev);
 		bt = rcu_dereference_protected(q->blk_trace,
-				lockdep_is_held(&q->blk_trace_mutex));
+				lockdep_is_held(&q->debugfs_mutex));
 	}
 
 	if (ret == 0) {
@@ -1936,7 +1930,7 @@ static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
 	}
 
 out_unlock_bdev:
-	mutex_unlock(&q->blk_trace_mutex);
+	mutex_unlock(&q->debugfs_mutex);
 out_bdput:
 	bdput(bdev);
 out:
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 6/8] blktrace: fix debugfs use after free
  2020-06-19 20:47 ` [PATCH v7 6/8] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-06-20  7:29   ` Christoph Hellwig
  2020-06-20 17:31   ` Bart Van Assche
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2020-06-20  7:29 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, martin.petersen, jejb,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

Still not a fan of the wall of text in the commit log, but the changes
look good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 7/8] blktrace: ensure our debugfs dir exists
  2020-06-19 20:47 ` [PATCH v7 7/8] blktrace: ensure our debugfs dir exists Luis Chamberlain
@ 2020-06-20  7:29   ` Christoph Hellwig
  2020-06-20 17:33   ` Bart Van Assche
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2020-06-20  7:29 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, martin.petersen, jejb,
	linux-block, linux-fsdevel, linux-mm, linux-kernel

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration
  2020-06-19 20:47 ` [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration Luis Chamberlain
@ 2020-06-20  7:30   ` Christoph Hellwig
  2020-06-20 18:07   ` Bart Van Assche
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2020-06-20  7:30 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, martin.petersen, jejb,
	linux-block, linux-fsdevel, linux-mm, linux-kernel

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] blktrace: annotate required lock on do_blk_trace_setup()
  2020-06-19 20:47 ` [PATCH v7 4/8] blktrace: annotate required lock on do_blk_trace_setup() Luis Chamberlain
@ 2020-06-20 17:02   ` Bart Van Assche
  0 siblings, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2020-06-20 17:02 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

On 2020-06-19 13:47, Luis Chamberlain wrote:
> Ensure it is clear which lock is required on do_blk_trace_setup().

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals
  2020-06-19 20:47 ` [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals Luis Chamberlain
@ 2020-06-20 17:11   ` Bart Van Assche
  2020-06-22 12:27     ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2020-06-20 17:11 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Christoph Hellwig

On 2020-06-19 13:47, Luis Chamberlain wrote:
> Be pedantic on removal as well and hold the mutex.
> This should prevent uses of addition while we exit.
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/block/loop.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c33bbbfd1bd9..d55e1b52f076 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -2402,6 +2402,8 @@ static void __exit loop_exit(void)
>  
>  	range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
>  
> +	mutex_lock(&loop_ctl_mutex);
> +
>  	idr_for_each(&loop_index_idr, &loop_exit_cb, NULL);
>  	idr_destroy(&loop_index_idr);
>  
> @@ -2409,6 +2411,8 @@ static void __exit loop_exit(void)
>  	unregister_blkdev(LOOP_MAJOR, "loop");
>  
>  	misc_deregister(&loop_misc);
> +
> +	mutex_unlock(&loop_ctl_mutex);
>  }
>  
>  module_init(loop_init);

Is try_module_get(fops->owner) called before a loop device is opened and
is module_put(fops->owner) called after a loop device is closed? Does
that mean that it is impossible to unload the loop driver while a loop
device is open? Does that mean that the above patch is not necessary or
did I perhaps miss something?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 6/8] blktrace: fix debugfs use after free
  2020-06-19 20:47 ` [PATCH v7 6/8] blktrace: fix debugfs use after free Luis Chamberlain
  2020-06-20  7:29   ` Christoph Hellwig
@ 2020-06-20 17:31   ` Bart Van Assche
  2020-06-22 12:36     ` Luis Chamberlain
  1 sibling, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2020-06-20 17:31 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko, syzbot+603294af2d01acfdd6da

On 2020-06-19 13:47, Luis Chamberlain wrote:
> This goes tested with:
       ^^^^
       got?

> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 7ff2ea5cd05e..e6e2d25fdbd6 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -524,10 +524,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  	if (!bt->msg_data)
>  		goto err;
>  
> -	ret = -ENOENT;
> -
> -	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> -	if (!dir)
> +#ifdef CONFIG_BLK_DEBUG_FS
> +	/*
> +	 * When tracing whole make_request drivers (multiqueue) block devices,
> +	 * reuse the existing debugfs directory created by the block layer on
> +	 * init. For request-based block devices, all partitions block devices,
                                                  ^^^^^^^^^^^^^^^^^^^^^
It seems like a word is missing from the comment? Or did you perhaps
want to refer to "all partition block devices"?

> +	 * and scsi-generic block devices we create a temporary new debugfs
> +	 * directory that will be removed once the trace ends.
> +	 */
> +	if (queue_is_mq(q) && bdev && bdev == bdev->bd_contains)
> +		dir = q->debugfs_dir;
> +	else
> +#endif
>  		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);

Can it happen that two different threads each try to set up block
tracing and hence that debugfs_create_dir() fails because a directory
with name buts->name already exists?

>  	bt->dev = dev;
> @@ -565,8 +573,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  
>  	ret = 0;
>  err:
> -	if (dir && !bt->dir)
> -		dput(dir);
>  	if (ret)
>  		blk_trace_free(bt);
>  	return ret;

Shouldn't bt->dir be removed in this error path for make_request drivers?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 7/8] blktrace: ensure our debugfs dir exists
  2020-06-19 20:47 ` [PATCH v7 7/8] blktrace: ensure our debugfs dir exists Luis Chamberlain
  2020-06-20  7:29   ` Christoph Hellwig
@ 2020-06-20 17:33   ` Bart Van Assche
  1 sibling, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2020-06-20 17:33 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

On 2020-06-19 13:47, Luis Chamberlain wrote:
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index e6e2d25fdbd6..098780ec018f 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -538,6 +538,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  #endif
>  		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
>  
> +	/*
> +	 * As blktrace relies on debugfs for its interface the debugfs directory
> +	 * is required, contrary to the usual mantra of not checking for debugfs
> +	 * files or directories.
> +	 */
> +	if (IS_ERR_OR_NULL(dir)) {
> +		pr_warn("debugfs_dir not present for %s so skipping\n",
> +			buts->name);

Maybe mention in the message what is being skipped (block tracing)?

> +		ret = -ENOENT;
> +		goto err;
> +	}
> +
>  	bt->dev = dev;
>  	atomic_set(&bt->dropped, 0);
>  	INIT_LIST_HEAD(&bt->running_list);

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration
  2020-06-19 20:47 ` [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration Luis Chamberlain
  2020-06-20  7:30   ` Christoph Hellwig
@ 2020-06-20 18:07   ` Bart Van Assche
  2020-06-22 12:42     ` Luis Chamberlain
  1 sibling, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2020-06-20 18:07 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

On 2020-06-19 13:47, Luis Chamberlain wrote:
> We were only creating the request_queue debugfs_dir only
> for make_request block drivers (multiqueue), but never for
> request-based block drivers. We did this as we were only
> creating non-blktrace additional debugfs files on that directory
> for make_request drivers. However, since blktrace *always* creates
> that directory anyway, we special-case the use of that directory
> on blktrace. Other than this being an eye-sore, this exposes
> request-based block drivers to the same debugfs fragile
> race that used to exist with make_request block drivers
> where if we start adding files onto that directory we can later
> run a race with a double removal of dentries on the directory
> if we don't deal with this carefully on blktrace.
> 
> Instead, just simplify things by always creating the request_queue
> debugfs_dir on request_queue registration. Rename the mutex also to
> reflect the fact that this is used outside of the blktrace context.

There are two changes in this patch: a bug fix and a rename of a mutex.
I don't like it to see two changes in a single patch.

Additionally, is the new mutex name really better than the old name? The
proper way to use mutexes is to use mutexes to protect data instead of
code. Where is the documentation that mentions which member variable(s)
of which data structures are protected by the mutex formerly called
blk_trace_mutex? Since the new name makes it even less clear which data
is protected by this mutex, is the new name really better than the old name?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 0/8] blktrace: fix debugfs use after free
  2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (7 preceding siblings ...)
  2020-06-19 20:47 ` [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration Luis Chamberlain
@ 2020-06-20 21:18 ` Jens Axboe
  8 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2020-06-20 21:18 UTC (permalink / raw)
  To: Luis Chamberlain, viro, bvanassche, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

On 6/19/20 2:47 PM, Luis Chamberlain wrote:
> Its been a fun ride, but all patch series come to an end. My hope is
> that this is it. The simplification of the fix is considerable now,
> with only a few lines of code and with no data structure changes.
> 
> We were only creating the debugfs_dir upon initialization only if
> you had CONFIG_BLK_DEBUG_FS for for make_request block drivers
> (multiqueue). That's where the UAF bug could happen. Folks liked
> the idea of open coding the debugfs initialization even if
> CONFIG_BLK_DEBUG_FS was disabled, given that debugfs code will
> simply ignore that code if debugfs is disabled, but to make
> the fix easier to backport, that shift is done now in another
> patch. Likewise, although we were only creating the debugfs_dir
> only for make_request block drivers (multiqueue), the same new
> additional patch also creates the debugfs_dir for request-based
> block drivers. That *begged* us to just rename the mutex to
> clarify its for the debugfs_dir, blktrace then just becomes
> its biggest user.
> 
> The only patches changed here is the last one from the last series,
> which actually fixed the UAF oops, and that one is now split in 3
> patches, which makes a secondary fix much clearer.
> 
> I've waited a while to post these, to let 0-day give me its blessings,
> both for Linus' tree and linux-next. No issues have been found. I've
> also taken time to run blktests prior and after this series and I have
> found no regressions. In fact, I think I should just extend blktests
> with the break-blktrace tests, I'll do that later.
> 
> Luis Chamberlain (8):
>   block: add docs for gendisk / request_queue refcount helpers
>   block: clarify context for refcount increment helpers
>   block: revert back to synchronous request_queue removal
>   blktrace: annotate required lock on do_blk_trace_setup()
>   loop: be paranoid on exit and prevent new additions / removals
>   blktrace: fix debugfs use after free
>   blktrace: ensure our debugfs dir exists
>   block: create the request_queue debugfs_dir on registration

Applied for 5.9, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals
  2020-06-20 17:11   ` Bart Van Assche
@ 2020-06-22 12:27     ` Luis Chamberlain
  2020-06-23  2:16       ` Bart Van Assche
  2020-06-23 17:05       ` Johannes Thumshirn
  0 siblings, 2 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-22 12:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Christoph Hellwig

On Sat, Jun 20, 2020 at 10:11:46AM -0700, Bart Van Assche wrote:
> On 2020-06-19 13:47, Luis Chamberlain wrote:
> > Be pedantic on removal as well and hold the mutex.
> > This should prevent uses of addition while we exit.
> > 
> > Reviewed-by: Ming Lei <ming.lei@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  drivers/block/loop.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index c33bbbfd1bd9..d55e1b52f076 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -2402,6 +2402,8 @@ static void __exit loop_exit(void)
> >  
> >  	range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
> >  
> > +	mutex_lock(&loop_ctl_mutex);
> > +
> >  	idr_for_each(&loop_index_idr, &loop_exit_cb, NULL);
> >  	idr_destroy(&loop_index_idr);
> >  
> > @@ -2409,6 +2411,8 @@ static void __exit loop_exit(void)
> >  	unregister_blkdev(LOOP_MAJOR, "loop");
> >  
> >  	misc_deregister(&loop_misc);
> > +
> > +	mutex_unlock(&loop_ctl_mutex);
> >  }
> >  
> >  module_init(loop_init);
> 
> Is try_module_get(fops->owner) called before a loop device is opened and
> is module_put(fops->owner) called after a loop device is closed? Does
> that mean that it is impossible to unload the loop driver while a loop
> device is open? Does that mean that the above patch is not necessary or
> did I perhaps miss something?

That's not the only way to add or remove the loop module though.

You may add/remove it manually. And again, as mentioned in the commit log,
I couldn't trigger a race myself, however this seemed the more pedantic
and careful strategy we can take.

Note: this will bring you sanity if you try to figure out *why* we still
get:

[235530.144343] debugfs: Directory 'loop0' with parent 'block' already present!
[235530.149477] blktrace: debugfs_dir not present for loop0 so skipping
[235530.232328] debugfs: Directory 'loop0' with parent 'block' already present!
[235530.238962] blktrace: debugfs_dir not present for loop0 so skipping

If you run run_0004.sh from break-blktrace [0]. Even with all my patches
merged we still run into this. And so the bug lies within the block
layer or on the driver. I haven't been able to find the issue yet.

[0] https://github.com/mcgrof/break-blktrace

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 6/8] blktrace: fix debugfs use after free
  2020-06-20 17:31   ` Bart Van Assche
@ 2020-06-22 12:36     ` Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-22 12:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko, syzbot+603294af2d01acfdd6da

On Sat, Jun 20, 2020 at 10:31:51AM -0700, Bart Van Assche wrote:
> On 2020-06-19 13:47, Luis Chamberlain wrote:
> > This goes tested with:
>        ^^^^
>        got?
> 
> > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > index 7ff2ea5cd05e..e6e2d25fdbd6 100644
> > --- a/kernel/trace/blktrace.c
> > +++ b/kernel/trace/blktrace.c
> > @@ -524,10 +524,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> >  	if (!bt->msg_data)
> >  		goto err;
> >  
> > -	ret = -ENOENT;
> > -
> > -	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > -	if (!dir)
> > +#ifdef CONFIG_BLK_DEBUG_FS
> > +	/*
> > +	 * When tracing whole make_request drivers (multiqueue) block devices,
> > +	 * reuse the existing debugfs directory created by the block layer on
> > +	 * init. For request-based block devices, all partitions block devices,
>                                                   ^^^^^^^^^^^^^^^^^^^^^
> It seems like a word is missing from the comment? Or did you perhaps
> want to refer to "all partition block devices"?

Yes, the later.

> > +	 * and scsi-generic block devices we create a temporary new debugfs
> > +	 * directory that will be removed once the trace ends.
> > +	 */
> > +	if (queue_is_mq(q) && bdev && bdev == bdev->bd_contains)
> > +		dir = q->debugfs_dir;
> > +	else
> > +#endif
> >  		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> 
> Can it happen that two different threads each try to set up block
> tracing and hence that debugfs_create_dir() fails because a directory
> with name buts->name already exists?

Great question, the answer is no. The reason is that we first use the
mutex and then we check for q->blk_trace. If you hold the lock *and*
you have checked for q->blk_trace and its NULL, you are sure you should
not have a duplicate.

Its why the commit log mentioned:

  The new clarifications on relying on the q->blk_trace_mutex *and* also
  checking for q->blk_trace *prior* to processing a blktrace ensures the
  debugfs directories are only created if no possible directory name
  clashes are possible.

These clarifications were prompted through discussions with Jan Kara
on the patches he posted which you CC'd me on. I agreed with his
patch *but* I suggested it would hold true only if check for the
q->blk_trace first, and this is why my patch titled "blktrace: break
out of blktrace setup on concurrent calls" got merged prior to Jan
Kara's "blktrace: Avoid sparse warnings when assigning q->blk_trace".

> >  	bt->dev = dev;
> > @@ -565,8 +573,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> >  
> >  	ret = 0;
> >  err:
> > -	if (dir && !bt->dir)
> > -		dput(dir);
> >  	if (ret)
> >  		blk_trace_free(bt);
> >  	return ret;
> 
> Shouldn't bt->dir be removed in this error path for make_request drivers?

If there is an error, bt->dir will be removed still, as I never modified
the removal of bt->dir in this patch.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration
  2020-06-20 18:07   ` Bart Van Assche
@ 2020-06-22 12:42     ` Luis Chamberlain
  2020-06-23  3:18       ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-22 12:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

On Sat, Jun 20, 2020 at 11:07:43AM -0700, Bart Van Assche wrote:
> On 2020-06-19 13:47, Luis Chamberlain wrote:
> > We were only creating the request_queue debugfs_dir only
> > for make_request block drivers (multiqueue), but never for
> > request-based block drivers. We did this as we were only
> > creating non-blktrace additional debugfs files on that directory
> > for make_request drivers. However, since blktrace *always* creates
> > that directory anyway, we special-case the use of that directory
> > on blktrace. Other than this being an eye-sore, this exposes
> > request-based block drivers to the same debugfs fragile
> > race that used to exist with make_request block drivers
> > where if we start adding files onto that directory we can later
> > run a race with a double removal of dentries on the directory
> > if we don't deal with this carefully on blktrace.
> > 
> > Instead, just simplify things by always creating the request_queue
> > debugfs_dir on request_queue registration. Rename the mutex also to
> > reflect the fact that this is used outside of the blktrace context.
> 
> There are two changes in this patch: a bug fix and a rename of a mutex.
> I don't like it to see two changes in a single patch.

I thought about doing the split first, and I did it at first, but
then I could hear Christoph yelling at me for it. So I merged the
two together. Although it makes it more difficult for review,
the changes do go together.

Kind of late to split this as its already merged now.

> Additionally, is the new mutex name really better than the old name? The
> proper way to use mutexes is to use mutexes to protect data instead of
> code. Where is the documentation that mentions which member variable(s)
> of which data structures are protected by the mutex formerly called
> blk_trace_mutex?

It does not exist, and that is the point. The debugfs_dir use after
free showed us *when* that UAF can happen, and so care must be taken
if we are to use the mutex to protect the debugfs_dir but also re-use
the same directory for other block core shenanigans.

> Since the new name makes it even less clear which data
> is protected by this mutex, is the new name really better than the old name?

I thought the new name makes it crystal clear what is being protected. I
can however add a comment to explain that the q->debugfs_mutex protects
the q->debugfs_dir if it is created, otherwise it protects the ephemeral
debugfs_dir directory which would otherwise be created in lieue of
q->debugfs_dir, however the patch still lies under <debugfs_root>/block/.

Let me know if you think that will help.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals
  2020-06-22 12:27     ` Luis Chamberlain
@ 2020-06-23  2:16       ` Bart Van Assche
  2020-06-23 16:56         ` Luis Chamberlain
  2020-06-23 17:05       ` Johannes Thumshirn
  1 sibling, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2020-06-23  2:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Christoph Hellwig

On 2020-06-22 05:27, Luis Chamberlain wrote:
> Note: this will bring you sanity if you try to figure out *why* we still
> get:
> 
> [235530.144343] debugfs: Directory 'loop0' with parent 'block' already present!
> [235530.149477] blktrace: debugfs_dir not present for loop0 so skipping
> [235530.232328] debugfs: Directory 'loop0' with parent 'block' already present!
> [235530.238962] blktrace: debugfs_dir not present for loop0 so skipping
> 
> If you run run_0004.sh from break-blktrace [0]. Even with all my patches
> merged we still run into this. And so the bug lies within the block
> layer or on the driver. I haven't been able to find the issue yet.
> 
> [0] https://github.com/mcgrof/break-blktrace

Thanks Luis for having shared this information. If I can find the time I
will have a look into this myself.

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration
  2020-06-22 12:42     ` Luis Chamberlain
@ 2020-06-23  3:18       ` Bart Van Assche
  0 siblings, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2020-06-23  3:18 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

On 2020-06-22 05:42, Luis Chamberlain wrote:
> On Sat, Jun 20, 2020 at 11:07:43AM -0700, Bart Van Assche wrote:
>> On 2020-06-19 13:47, Luis Chamberlain wrote:
>>> We were only creating the request_queue debugfs_dir only
>>> for make_request block drivers (multiqueue), but never for
>>> request-based block drivers. We did this as we were only
>>> creating non-blktrace additional debugfs files on that directory
>>> for make_request drivers. However, since blktrace *always* creates
>>> that directory anyway, we special-case the use of that directory
>>> on blktrace. Other than this being an eye-sore, this exposes
>>> request-based block drivers to the same debugfs fragile
>>> race that used to exist with make_request block drivers
>>> where if we start adding files onto that directory we can later
>>> run a race with a double removal of dentries on the directory
>>> if we don't deal with this carefully on blktrace.
>>>
>>> Instead, just simplify things by always creating the request_queue
>>> debugfs_dir on request_queue registration. Rename the mutex also to
>>> reflect the fact that this is used outside of the blktrace context.
>>
>> There are two changes in this patch: a bug fix and a rename of a mutex.
>> I don't like it to see two changes in a single patch.
> 
> I thought about doing the split first, and I did it at first, but
> then I could hear Christoph yelling at me for it. So I merged the
> two together. Although it makes it more difficult for review,
> the changes do go together.

During the past weeks I have been more busy than usual. I will try to
make sure that in the future I have the time to read all comments on the
previous versions of a patch series before replying to the latest
version of a patch series.

>> Additionally, is the new mutex name really better than the old name? The
>> proper way to use mutexes is to use mutexes to protect data instead of
>> code. Where is the documentation that mentions which member variable(s)
>> of which data structures are protected by the mutex formerly called
>> blk_trace_mutex?
> 
> It does not exist, and that is the point. The debugfs_dir use after
> free showed us *when* that UAF can happen, and so care must be taken
> if we are to use the mutex to protect the debugfs_dir but also re-use
> the same directory for other block core shenanigans.
> 
>> Since the new name makes it even less clear which data
>> is protected by this mutex, is the new name really better than the old name?
> 
> I thought the new name makes it crystal clear what is being protected. I
> can however add a comment to explain that the q->debugfs_mutex protects
> the q->debugfs_dir if it is created, otherwise it protects the ephemeral
> debugfs_dir directory which would otherwise be created in lieue of
> q->debugfs_dir, however the patch still lies under <debugfs_root>/block/.
> 
> Let me know if you think that will help.

My concern is that q->debugfs_mutex would evolve the same way as
q->sysfs_lock: at the time of introduction the role of a mutex is very
clear but over time the number of use cases grows to a point where it is
no longer possible to recognize the original purpose. I think there are
two possible approaches: either a comment is added now that explains the
role of q->debugfs_mutex or someone who has followed this conversation
yells when someone tries to use q->debugfs_mutex for another purpose
than what it was intended for.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals
  2020-06-23  2:16       ` Bart Van Assche
@ 2020-06-23 16:56         ` Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-23 16:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Christoph Hellwig

On Mon, Jun 22, 2020 at 07:16:15PM -0700, Bart Van Assche wrote:
> On 2020-06-22 05:27, Luis Chamberlain wrote:
> > Note: this will bring you sanity if you try to figure out *why* we still
> > get:
> > 
> > [235530.144343] debugfs: Directory 'loop0' with parent 'block' already present!
> > [235530.149477] blktrace: debugfs_dir not present for loop0 so skipping
> > [235530.232328] debugfs: Directory 'loop0' with parent 'block' already present!
> > [235530.238962] blktrace: debugfs_dir not present for loop0 so skipping
> > 
> > If you run run_0004.sh from break-blktrace [0]. Even with all my patches
> > merged we still run into this. And so the bug lies within the block
> > layer or on the driver. I haven't been able to find the issue yet.
> > 
> > [0] https://github.com/mcgrof/break-blktrace
> 
> Thanks Luis for having shared this information. If I can find the time I
> will have a look into this myself.

Let's keep track of it:

https://bugzilla.kernel.org/show_bug.cgi?id=208301

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals
  2020-06-22 12:27     ` Luis Chamberlain
  2020-06-23  2:16       ` Bart Van Assche
@ 2020-06-23 17:05       ` Johannes Thumshirn
  2020-06-24 12:08         ` Luis Chamberlain
  1 sibling, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2020-06-23 17:05 UTC (permalink / raw)
  To: Luis Chamberlain, Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, martin.petersen, jejb, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Christoph Hellwig

On 22/06/2020 14:27, Luis Chamberlain wrote:
[...]> If you run run_0004.sh from break-blktrace [0]. Even with all my patches
> merged we still run into this. And so the bug lies within the block
> layer or on the driver. I haven't been able to find the issue yet.
> 
> [0] https://github.com/mcgrof/break-blktrace

Would it be a good idea to merge this into blktests? Maybe start a blktrace 
section for it, which could host other regression test for blktrace.

Thoughts?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals
  2020-06-23 17:05       ` Johannes Thumshirn
@ 2020-06-24 12:08         ` Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2020-06-24 12:08 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Bart Van Assche, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, martin.petersen, jejb,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Christoph Hellwig

On Tue, Jun 23, 2020 at 05:05:01PM +0000, Johannes Thumshirn wrote:
> On 22/06/2020 14:27, Luis Chamberlain wrote:
> [...]> If you run run_0004.sh from break-blktrace [0]. Even with all my patches
> > merged we still run into this. And so the bug lies within the block
> > layer or on the driver. I haven't been able to find the issue yet.
> > 
> > [0] https://github.com/mcgrof/break-blktrace
> 
> Would it be a good idea to merge this into blktests? Maybe start a blktrace 
> section for it, which could host other regression test for blktrace.
> 
> Thoughts?

Absolutely! That is the goal as well.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-06-24 12:08 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19 20:47 [PATCH v7 0/8] blktrace: fix debugfs use after free Luis Chamberlain
2020-06-19 20:47 ` [PATCH v7 1/8] block: add docs for gendisk / request_queue refcount helpers Luis Chamberlain
2020-06-19 20:47 ` [PATCH v7 2/8] block: clarify context for refcount increment helpers Luis Chamberlain
2020-06-19 20:47 ` [PATCH v7 3/8] block: revert back to synchronous request_queue removal Luis Chamberlain
2020-06-19 20:47 ` [PATCH v7 4/8] blktrace: annotate required lock on do_blk_trace_setup() Luis Chamberlain
2020-06-20 17:02   ` Bart Van Assche
2020-06-19 20:47 ` [PATCH v7 5/8] loop: be paranoid on exit and prevent new additions / removals Luis Chamberlain
2020-06-20 17:11   ` Bart Van Assche
2020-06-22 12:27     ` Luis Chamberlain
2020-06-23  2:16       ` Bart Van Assche
2020-06-23 16:56         ` Luis Chamberlain
2020-06-23 17:05       ` Johannes Thumshirn
2020-06-24 12:08         ` Luis Chamberlain
2020-06-19 20:47 ` [PATCH v7 6/8] blktrace: fix debugfs use after free Luis Chamberlain
2020-06-20  7:29   ` Christoph Hellwig
2020-06-20 17:31   ` Bart Van Assche
2020-06-22 12:36     ` Luis Chamberlain
2020-06-19 20:47 ` [PATCH v7 7/8] blktrace: ensure our debugfs dir exists Luis Chamberlain
2020-06-20  7:29   ` Christoph Hellwig
2020-06-20 17:33   ` Bart Van Assche
2020-06-19 20:47 ` [PATCH v7 8/8] block: create the request_queue debugfs_dir on registration Luis Chamberlain
2020-06-20  7:30   ` Christoph Hellwig
2020-06-20 18:07   ` Bart Van Assche
2020-06-22 12:42     ` Luis Chamberlain
2020-06-23  3:18       ` Bart Van Assche
2020-06-20 21:18 ` [PATCH v7 0/8] blktrace: fix debugfs use after free Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).