Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v5 0/7] block: fix blktrace debugfs use after free
@ 2020-05-16  3:19 Luis Chamberlain
  2020-05-16  3:19 ` [PATCH v5 1/7] block: add docs for gendisk / request_queue refcount helpers Luis Chamberlain
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-16  3:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

On this v5 I've split up the first patch into 3, one for comments,
another for context / might_sleep() updates, and the last the big
revert back to synchronous request_queue removal. I didn't update
the context for the put / decrements for gendisk & request_queue
as they would be updated in the next patch.

Since the first 3 patches are a reflection of the original one, I've
left the Reviewed-by's collected in place.

I've changed the kzalloc() / snprintf() to just kasprintf() as requested
by Bart. Since it was not clear that we don't have the bdev on
do_blk_trace_setup() for the patch titled "blktrace: break out of
blktrace setup on concurrent calls", I've added a comment so that
someone doesn't later try to add a dev_printk() or the like.

I've also addressed a compilation issue with debugfs disabled reported
by 0-day on the patch titled "blktrace: fix debugfs use after free". It
was missing a "static inline" on a function. I've also moved the new
declarations underneath the "#ifdef CONFIG_BLOCK" on include/linux/genhd.h,
I previously had them outside of this block.

I've left in place the scsi-generic blktrace suppport given I didn't receive any
feedback to kill it. This ensures this works as it used to.

Since these are minor changes I've given this a spin with break-blktrace
tests I have written and also ran blktrace with a scsi-generic media
changer. Both sg0 (the controller) and sg1 worked as expected.

These changes are based on linux-next tag next-20200515, and can also be
found on my git tree:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200515-blktrace-fixes

Luis Chamberlain (7):
  block: add docs for gendisk / request_queue refcount helpers
  block: clarify context for gendisk / request_queue refcount increment
    helpers
  block: revert back to synchronous request_queue removal
  block: move main block debugfs initialization to its own file
  blktrace: fix debugfs use after free
  blktrace: break out of blktrace setup on concurrent calls
  loop: be paranoid on exit and prevent new additions / removals

 block/Makefile               |  10 +-
 block/blk-core.c             |  32 ++++--
 block/blk-debugfs.c          | 197 +++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.c       |   5 -
 block/blk-sysfs.c            |  46 ++++----
 block/blk.h                  |  24 +++++
 block/bsg.c                  |   2 +
 block/genhd.c                |  73 ++++++++++++-
 block/partitions/core.c      |   9 ++
 drivers/block/loop.c         |   4 +
 drivers/scsi/ch.c            |   1 +
 drivers/scsi/sg.c            |  75 +++++++++++++
 drivers/scsi/st.c            |   2 +
 include/linux/blkdev.h       |   6 +-
 include/linux/blktrace_api.h |   1 -
 include/linux/genhd.h        |  69 ++++++++++++
 kernel/trace/blktrace.c      |  37 +++++--
 17 files changed, 545 insertions(+), 48 deletions(-)
 create mode 100644 block/blk-debugfs.c

-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 1/7] block: add docs for gendisk / request_queue refcount helpers
  2020-05-16  3:19 [PATCH v5 0/7] block: fix blktrace debugfs use after free Luis Chamberlain
@ 2020-05-16  3:19 ` Luis Chamberlain
  2020-05-16  3:19 ` [PATCH v5 2/7] block: clarify context for gendisk / request_queue refcount increment helpers Luis Chamberlain
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-16  3:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Christoph Hellwig

This adds documentation for the gendisk / request_queue refcount
helpers.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c | 13 +++++++++++++
 block/genhd.c    | 50 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5847993738f1..e438c3b0815b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -321,6 +321,13 @@ void blk_clear_pm_only(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
+/**
+ * blk_put_queue - decrement the request_queue refcount
+ * @q: the request_queue structure to decrement the refcount for
+ *
+ * Decrements the refcount of the request_queue kobject. When this reaches 0
+ * we'll have blk_release_queue() called.
+ */
 void blk_put_queue(struct request_queue *q)
 {
 	kobject_put(&q->kobj);
@@ -598,6 +605,12 @@ struct request_queue *blk_alloc_queue(make_request_fn make_request, int node_id)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+/**
+ * blk_get_queue - increment the request_queue refcount
+ * @q: the request_queue structure to increment the refcount for
+ *
+ * Increment the refcount of the request_queue kobject.
+ */
 bool blk_get_queue(struct request_queue *q)
 {
 	if (likely(!blk_queue_dying(q))) {
diff --git a/block/genhd.c b/block/genhd.c
index afdb2c3e5b22..af910e6a0233 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -908,6 +908,20 @@ static void invalidate_partition(struct gendisk *disk, int partno)
 	bdput(bdev);
 }
 
+/**
+ * del_gendisk - remove the gendisk
+ * @disk: the struct gendisk to remove
+ *
+ * Removes the gendisk and all its associated resources. This deletes the
+ * partitions associated with the gendisk, and unregisters the associated
+ * request_queue.
+ *
+ * This is the counter to the respective __device_add_disk() call.
+ *
+ * The final removal of the struct gendisk happens when its refcount reaches 0
+ * with put_disk(), which should be called after del_gendisk(), if
+ * __device_add_disk() was used.
+ */
 void del_gendisk(struct gendisk *disk)
 {
 	struct disk_part_iter piter;
@@ -1539,6 +1553,23 @@ int disk_expand_part_tbl(struct gendisk *disk, int partno)
 	return 0;
 }
 
+/**
+ * disk_release - releases all allocated resources of the gendisk
+ * @dev: the device representing this disk
+ *
+ * This function releases all allocated resources of the gendisk.
+ *
+ * The struct gendisk refcount is incremented with get_gendisk() or
+ * get_disk_and_module(), and its refcount is decremented with
+ * put_disk_and_module() or put_disk(). Once the refcount reaches 0 this
+ * function is called.
+ *
+ * Drivers which used __device_add_disk() have a gendisk with a request_queue
+ * assigned. Since the request_queue sits on top of the gendisk for these
+ * drivers we also call blk_put_queue() for them, and we expect the
+ * request_queue refcount to reach 0 at this point, and so the request_queue
+ * will also be freed prior to the disk.
+ */
 static void disk_release(struct device *dev)
 {
 	struct gendisk *disk = dev_to_disk(dev);
@@ -1748,6 +1779,13 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 }
 EXPORT_SYMBOL(__alloc_disk_node);
 
+/**
+ * get_disk_and_module - increments the gendisk and gendisk fops module refcount
+ * @disk: the struct gendisk to to increment the refcount for
+ *
+ * This increments the refcount for the struct gendisk, and the gendisk's
+ * fops module owner.
+ */
 struct kobject *get_disk_and_module(struct gendisk *disk)
 {
 	struct module *owner;
@@ -1768,6 +1806,13 @@ struct kobject *get_disk_and_module(struct gendisk *disk)
 }
 EXPORT_SYMBOL(get_disk_and_module);
 
+/**
+ * put_disk - decrements the gendisk refcount
+ * @disk: the struct gendisk to to decrement the refcount for
+ *
+ * This decrements the refcount for the struct gendisk. When this reaches 0
+ * we'll have disk_release() called.
+ */
 void put_disk(struct gendisk *disk)
 {
 	if (disk)
@@ -1775,7 +1820,10 @@ void put_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(put_disk);
 
-/*
+/**
+ * put_disk_and_module - decrements the module and gendisk refcount
+ * @disk: the struct gendisk to to decrement the refcount for
+ *
  * This is a counterpart of get_disk_and_module() and thus also of
  * get_gendisk().
  */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 2/7] block: clarify context for gendisk / request_queue refcount increment helpers
  2020-05-16  3:19 [PATCH v5 0/7] block: fix blktrace debugfs use after free Luis Chamberlain
  2020-05-16  3:19 ` [PATCH v5 1/7] block: add docs for gendisk / request_queue refcount helpers Luis Chamberlain
@ 2020-05-16  3:19 ` Luis Chamberlain
  2020-05-16  3:19 ` [PATCH v5 3/7] block: revert back to synchronous request_queue removal Luis Chamberlain
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-16  3:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Christoph Hellwig

Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().

We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c | 2 ++
 block/genhd.c    | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index e438c3b0815b..94216fa16a05 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,8 @@ EXPORT_SYMBOL(blk_alloc_queue);
  * @q: the request_queue structure to increment the refcount for
  *
  * Increment the refcount of the request_queue kobject.
+ *
+ * Context: Any context.
  */
 bool blk_get_queue(struct request_queue *q)
 {
diff --git a/block/genhd.c b/block/genhd.c
index af910e6a0233..598bd32ad28c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1017,11 +1017,15 @@ static ssize_t disk_badblocks_store(struct device *dev,
  *
  * This function gets the structure containing partitioning
  * information for the given device @devt.
+ *
+ * Context: can sleep
  */
 struct gendisk *get_gendisk(dev_t devt, int *partno)
 {
 	struct gendisk *disk = NULL;
 
+	might_sleep();
+
 	if (MAJOR(devt) != BLOCK_EXT_MAJOR) {
 		struct kobject *kobj;
 
@@ -1785,6 +1789,8 @@ EXPORT_SYMBOL(__alloc_disk_node);
  *
  * This increments the refcount for the struct gendisk, and the gendisk's
  * fops module owner.
+ *
+ * Context: Any context.
  */
 struct kobject *get_disk_and_module(struct gendisk *disk)
 {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 3/7] block: revert back to synchronous request_queue removal
  2020-05-16  3:19 [PATCH v5 0/7] block: fix blktrace debugfs use after free Luis Chamberlain
  2020-05-16  3:19 ` [PATCH v5 1/7] block: add docs for gendisk / request_queue refcount helpers Luis Chamberlain
  2020-05-16  3:19 ` [PATCH v5 2/7] block: clarify context for gendisk / request_queue refcount increment helpers Luis Chamberlain
@ 2020-05-16  3:19 ` Luis Chamberlain
  2020-05-16  3:19 ` [PATCH v5 4/7] block: move main block debugfs initialization to its own file Luis Chamberlain
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-16  3:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, Christoph Hellwig

Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue kobject,
and upon reaching 0 blk_release_queue() is called. Although blk_exit_rl()
is now removed through commit db6d9952356 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within blk_release_queue()
context.

The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as possible
so we can avoid future issues, and with the hopes that the synchronous
request_queue removal sticks.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.

Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Suggested-by: Nicolai Stange <nstange@suse.de>
Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c       |  8 ++++++++
 block/blk-sysfs.c      | 43 +++++++++++++++++++++---------------------
 block/genhd.c          | 17 +++++++++++++++++
 include/linux/blkdev.h |  2 --
 4 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 94216fa16a05..8a785d16033b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -327,6 +327,9 @@ EXPORT_SYMBOL_GPL(blk_clear_pm_only);
  *
  * Decrements the refcount of the request_queue kobject. When this reaches 0
  * we'll have blk_release_queue() called.
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ *          atomic context.
  */
 void blk_put_queue(struct request_queue *q)
 {
@@ -359,9 +362,14 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
  *
  * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
  * put it.  All future requests will be failed immediately with -ENODEV.
+ *
+ * Context: can sleep
  */
 void blk_cleanup_queue(struct request_queue *q)
 {
+	/* cannot be called from atomic context */
+	might_sleep();
+
 	WARN_ON_ONCE(blk_queue_registered(q));
 
 	/* mark @q DYING, no new request or merges will be allowed afterwards */
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 02643e149d5e..561624d4cc4e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -873,22 +873,32 @@ static void blk_exit_queue(struct request_queue *q)
 	bdi_put(q->backing_dev_info);
 }
 
-
 /**
- * __blk_release_queue - release a request queue
- * @work: pointer to the release_work member of the request queue to be released
+ * blk_release_queue - releases all allocated resources of the request_queue
+ * @kobj: pointer to a kobject, whose container is a request_queue
+ *
+ * This function releases all allocated resources of the request queue.
+ *
+ * The struct request_queue refcount is incremented with blk_get_queue() and
+ * decremented with blk_put_queue(). Once the refcount reaches 0 this function
+ * is called.
+ *
+ * For drivers that have a request_queue on a gendisk and added with
+ * __device_add_disk() the refcount to request_queue will reach 0 with
+ * the last put_disk() called by the driver. For drivers which don't use
+ * __device_add_disk() this happens with blk_cleanup_queue().
  *
- * Description:
- *     This function is called when a block device is being unregistered. The
- *     process of releasing a request queue starts with blk_cleanup_queue, which
- *     set the appropriate flags and then calls blk_put_queue, that decrements
- *     the reference counter of the request queue. Once the reference counter
- *     of the request queue reaches zero, blk_release_queue is called to release
- *     all allocated resources of the request queue.
+ * Drivers exist which depend on the release of the request_queue to be
+ * synchronous, it should not be deferred.
+ *
+ * Context: can sleep
  */
-static void __blk_release_queue(struct work_struct *work)
+static void blk_release_queue(struct kobject *kobj)
 {
-	struct request_queue *q = container_of(work, typeof(*q), release_work);
+	struct request_queue *q =
+		container_of(kobj, struct request_queue, kobj);
+
+	might_sleep();
 
 	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
 		blk_stat_remove_callback(q, q->poll_cb);
@@ -917,15 +927,6 @@ static void __blk_release_queue(struct work_struct *work)
 	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
 
-static void blk_release_queue(struct kobject *kobj)
-{
-	struct request_queue *q =
-		container_of(kobj, struct request_queue, kobj);
-
-	INIT_WORK(&q->release_work, __blk_release_queue);
-	schedule_work(&q->release_work);
-}
-
 static const struct sysfs_ops queue_sysfs_ops = {
 	.show	= queue_attr_show,
 	.store	= queue_attr_store,
diff --git a/block/genhd.c b/block/genhd.c
index 598bd32ad28c..ea6abfadb7f5 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -921,12 +921,19 @@ static void invalidate_partition(struct gendisk *disk, int partno)
  * The final removal of the struct gendisk happens when its refcount reaches 0
  * with put_disk(), which should be called after del_gendisk(), if
  * __device_add_disk() was used.
+ *
+ * Drivers exist which depend on the release of the gendisk to be synchronous,
+ * it should not be deferred.
+ *
+ * Context: can sleep
  */
 void del_gendisk(struct gendisk *disk)
 {
 	struct disk_part_iter piter;
 	struct hd_struct *part;
 
+	might_sleep();
+
 	blk_integrity_del(disk);
 	disk_del_events(disk);
 
@@ -1573,11 +1580,15 @@ int disk_expand_part_tbl(struct gendisk *disk, int partno)
  * drivers we also call blk_put_queue() for them, and we expect the
  * request_queue refcount to reach 0 at this point, and so the request_queue
  * will also be freed prior to the disk.
+ *
+ * Context: can sleep
  */
 static void disk_release(struct device *dev)
 {
 	struct gendisk *disk = dev_to_disk(dev);
 
+	might_sleep();
+
 	blk_free_devt(dev->devt);
 	disk_release_events(disk);
 	kfree(disk->random);
@@ -1818,6 +1829,9 @@ EXPORT_SYMBOL(get_disk_and_module);
  *
  * This decrements the refcount for the struct gendisk. When this reaches 0
  * we'll have disk_release() called.
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ *          atomic context.
  */
 void put_disk(struct gendisk *disk)
 {
@@ -1832,6 +1846,9 @@ EXPORT_SYMBOL(put_disk);
  *
  * This is a counterpart of get_disk_and_module() and thus also of
  * get_gendisk().
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ *          atomic context.
  */
 void put_disk_and_module(struct gendisk *disk)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2b33166b9daf..8801f3d7cf4a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,8 +584,6 @@ struct request_queue {
 
 	size_t			cmd_size;
 
-	struct work_struct	release_work;
-
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
 };
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 4/7] block: move main block debugfs initialization to its own file
  2020-05-16  3:19 [PATCH v5 0/7] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (2 preceding siblings ...)
  2020-05-16  3:19 ` [PATCH v5 3/7] block: revert back to synchronous request_queue removal Luis Chamberlain
@ 2020-05-16  3:19 ` Luis Chamberlain
  2020-05-19 15:33   ` Christoph Hellwig
  2020-05-16  3:19 ` [PATCH v5 5/7] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-16  3:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

make_request-based drivers and and request-based drivers share some
debugfs code. By moving this into its own file it makes it easier
to expand and audit this shared code.

This patch contains no functional changes.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/Makefile      | 10 +++++++---
 block/blk-core.c    |  9 +--------
 block/blk-debugfs.c | 15 +++++++++++++++
 block/blk.h         |  8 ++++++++
 4 files changed, 31 insertions(+), 11 deletions(-)
 create mode 100644 block/blk-debugfs.c

diff --git a/block/Makefile b/block/Makefile
index 78719169fb2a..ec4b17f9dd93 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,7 +8,8 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
 			blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
-			genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o
+			genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
+			debugfs.o
 
 obj-$(CONFIG_BOUNCE)		+= bounce.o
 obj-$(CONFIG_BLK_SCSI_REQUEST)	+= scsi_ioctl.o
@@ -32,8 +33,11 @@ obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
 obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
-obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
-obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
+
+debugfs-$(CONFIG_DEBUG_FS)		+= blk-debugfs.o
+debugfs-$(CONFIG_BLK_DEBUG_FS)		+= blk-mq-debugfs.o
+debugfs-$(CONFIG_BLK_DEBUG_FS_ZONED)	+= blk-mq-debugfs-zoned.o
+
 obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
 obj-$(CONFIG_BLK_PM)		+= blk-pm.o
 obj-$(CONFIG_BLK_INLINE_ENCRYPTION)	+= keyslot-manager.o blk-crypto.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 8a785d16033b..d40648958767 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -51,10 +51,6 @@
 #include "blk-pm.h"
 #include "blk-rq-qos.h"
 
-#ifdef CONFIG_DEBUG_FS
-struct dentry *blk_debugfs_root;
-#endif
-
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
@@ -1880,10 +1876,7 @@ int __init blk_dev_init(void)
 
 	blk_requestq_cachep = kmem_cache_create("request_queue",
 			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
-
-#ifdef CONFIG_DEBUG_FS
-	blk_debugfs_root = debugfs_create_dir("block", NULL);
-#endif
+	blk_debugfs_register();
 
 	return 0;
 }
diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
new file mode 100644
index 000000000000..19091e1effc0
--- /dev/null
+++ b/block/blk-debugfs.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Shared request-based / make_request-based functionality
+ */
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/debugfs.h>
+
+struct dentry *blk_debugfs_root;
+
+void blk_debugfs_register(void)
+{
+	blk_debugfs_root = debugfs_create_dir("block", NULL);
+}
diff --git a/block/blk.h b/block/blk.h
index fc00537026a0..ee309233f95e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -458,4 +458,12 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+#ifdef CONFIG_DEBUG_FS
+void blk_debugfs_register(void);
+#else
+static inline void blk_debugfs_register(void)
+{
+}
+#endif /* CONFIG_DEBUG_FS */
+
 #endif /* BLK_INTERNAL_H */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-16  3:19 [PATCH v5 0/7] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (3 preceding siblings ...)
  2020-05-16  3:19 ` [PATCH v5 4/7] block: move main block debugfs initialization to its own file Luis Chamberlain
@ 2020-05-16  3:19 ` Luis Chamberlain
  2020-05-19 14:44   ` Greg KH
  2020-05-19 16:37   ` Christoph Hellwig
  2020-05-16  3:19 ` [PATCH v5 6/7] blktrace: break out of blktrace setup on concurrent calls Luis Chamberlain
  2020-05-16  3:19 ` [PATCH v5 7/7] loop: be paranoid on exit and prevent new additions / removals Luis Chamberlain
  6 siblings, 2 replies; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-16  3:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace while racing to remove / add a device.

We used to use asynchronous removal of the request_queue, and with that
the issue was easier to reproduce. Now that we have reverted to
synchronous removal of the request_queue, the issue is still possible to
reproduce, its however just a bit more difficult.

We essentially run two instances of break-blktrace which add/remove
a loop device, and setup a blktrace and just never tear the blktrace
down. We do this twice in parallel. This is easily reproduced with the
break-blktrace run_0004.sh script.

We can end up with two types of panics each reflecting where we
race, one a failed blktrace setup:

[  252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
[  252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  252.436592] #PF: supervisor write access in kernel mode
[  252.439822] #PF: error_code(0x0002) - not-present page
[  252.442967] PGD 0 P4D 0
[  252.444656] Oops: 0002 [#1] SMP NOPTI
[  252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
[  252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  252.456343] RIP: 0010:down_write+0x15/0x40
[  252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
               cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
               00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
               45 08 5d
[  252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246
[  252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000
[  252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001
[  252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000
[  252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0
[  252.473346] FS:  00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000
[  252.475225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0
[  252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  252.479866] Call Trace:
[  252.480322]  simple_recursive_removal+0x4e/0x2e0
[  252.481078]  ? debugfs_remove+0x60/0x60
[  252.481725]  ? relay_destroy_buf+0x77/0xb0
[  252.482662]  debugfs_remove+0x40/0x60
[  252.483518]  blk_remove_buf_file_callback+0x5/0x10
[  252.484328]  relay_close_buf+0x2e/0x60
[  252.484930]  relay_open+0x1ce/0x2c0
[  252.485520]  do_blk_trace_setup+0x14f/0x2b0
[  252.486187]  __blk_trace_setup+0x54/0xb0
[  252.486803]  blk_trace_ioctl+0x90/0x140
[  252.487423]  ? do_sys_openat2+0x1ab/0x2d0
[  252.488053]  blkdev_ioctl+0x4d/0x260
[  252.488636]  block_ioctl+0x39/0x40
[  252.489139]  ksys_ioctl+0x87/0xc0
[  252.489675]  __x64_sys_ioctl+0x16/0x20
[  252.490380]  do_syscall_64+0x52/0x180
[  252.491032]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And the other on the device removal:

[  128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
[  128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  128.619537] #PF: supervisor write access in kernel mode
[  128.622700] #PF: error_code(0x0002) - not-present page
[  128.625842] PGD 0 P4D 0
[  128.627585] Oops: 0002 [#1] SMP NOPTI
[  128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
[  128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  128.640471] RIP: 0010:down_write+0x15/0x40
[  128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
               cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
               00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
               45 08 5d
[  128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246
[  128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000
[  128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0
[  128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000
[  128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0
[  128.660500] FS:  00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000
[  128.662204] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0
[  128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  128.667282] Call Trace:
[  128.667801]  simple_recursive_removal+0x4e/0x2e0
[  128.668663]  ? debugfs_remove+0x60/0x60
[  128.669368]  debugfs_remove+0x40/0x60
[  128.669985]  blk_trace_free+0xd/0x50
[  128.670593]  __blk_trace_remove+0x27/0x40
[  128.671274]  blk_trace_shutdown+0x30/0x40
[  128.671935]  blk_release_queue+0x95/0xf0
[  128.672589]  kobject_put+0xa5/0x1b0
[  128.673188]  disk_release+0xa2/0xc0
[  128.673786]  device_release+0x28/0x80
[  128.674376]  kobject_put+0xa5/0x1b0
[  128.674915]  loop_remove+0x39/0x50 [loop]
[  128.675511]  loop_control_ioctl+0x113/0x130 [loop]
[  128.676199]  ksys_ioctl+0x87/0xc0
[  128.676708]  __x64_sys_ioctl+0x16/0x20
[  128.677274]  do_syscall_64+0x52/0x180
[  128.677823]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The common theme here is:

debugfs: Directory 'loop0' with parent 'block' already present

This crash happens because of how blktrace uses the debugfs directory
where it places its files. Upon init we always create the same directory
which would be needed by blktrace but we only do this for make_request
drivers (multiqueue) block drivers, but never for request-based block
drivers. Furthermore, that directory is only created on init for the
entire disk. This means that if you use blktrace on a partition, we'll
always be creating a new directory regardless of whether or not you
are doing blktrace on a make_request driver (multiqueue) or a
request-based block drivers.

These directory creations are only associated with a path, and so
when a debugfs_remove() is called it removes everything in its way.
A device removal will remove all blktrace files, and so if a blktrace
is still present a cleanup of blktrace files later will end up trying
to remove dentries pointing to NULL.

We can fix the UAF by using a debugfs directory which moving forward
will always be accessible if debugfs is enabled for both make_request
drivers (multiqueue) and request-based block drivers, *and* for all
partitions upon creation. This ensures that removal of the directories
only happens on device removal and removes the race of the files
underneath an active blktrace.

For partitions we simply symlink to the whole disk's debugfs_dir, as the
debugfs_dir is shared anyway and this limits us to only run one blktrace
for the entire disk.

We special-case a solution for scsi-generic which got blktrace support
added by Christof via commit 6da127ad0918 ("blktrace: Add blktrace
ioctls to SCSI generic devices") so upstream since v2.6.25. scsi-generic
drives use a character device, however behind the scenes we have a scsi
device with a request_queue. How this is used varies by class of driver
(TYPE_DISK, TYPE_TAPE, etc). Care has to be taken into consideration of
the fact that scsi drivers will probe asynchronously but the scsi-generic
class_interface sg_add_device() will complete before. This means
sd_probe() will use device_add_disk() for TYPE_DISK and have its
debugfs_dir created *after* the scsi-generic device is created.

For scsi-generic then we symlink to the real debugfs_dir only during a
blktrace ioctl, but we do this only once. We also have to special-case
yet another solution for drivers which use the bsg queue.

This goes tested with:

  o nvme partitions
  o ISCSI with tgt, and blktracing against scsi-generic with:
    o block
    o tape
    o cdrom
    o media changer

Screenshots of what the debugfs for block looks like after running
blktrace on a system with sg0  which has a raid controllerand then sg1
as the media changer:

 # ls -l /sys/kernel/debug/block
total 0
drwxr-xr-x  3 root root 0 May  9 02:31 bsg
drwxr-xr-x 19 root root 0 May  9 02:31 nvme0n1
drwxr-xr-x 19 root root 0 May  9 02:31 nvme1n1
lrwxrwxrwx  1 root root 0 May  9 02:31 nvme1n1p1 -> nvme1n1
lrwxrwxrwx  1 root root 0 May  9 02:31 nvme1n1p2 -> nvme1n1
lrwxrwxrwx  1 root root 0 May  9 02:31 nvme1n1p3 -> nvme1n1
lrwxrwxrwx  1 root root 0 May  9 02:31 nvme1n1p5 -> nvme1n1
lrwxrwxrwx  1 root root 0 May  9 02:31 nvme1n1p6 -> nvme1n1
drwxr-xr-x  2 root root 0 May  9 02:33 sch0
lrwxrwxrwx  1 root root 0 May  9 02:33 sg0 -> bsg/2:0:0:0
lrwxrwxrwx  1 root root 0 May  9 02:33 sg1 -> sch0
drwxr-xr-x  5 root root 0 May  9 02:31 vda
lrwxrwxrwx  1 root root 0 May  9 02:31 vda1 -> vda

Code for handling the  ebugfs_dir did get more complicatd for
scsi-generic but this is technical debt. For the other types of devices,
this simplifies the code considerably, with the only penalty now being
that we're always creating the request queue debugfs directory for the
request-based block device drivers.

The symlink use also makes it clearer when the request_queue is shared.

This patch is part of the work which disputes the severity of
CVE-2019-19770 which shows this issue is not a core debugfs issue, but
a misuse of debugfs within blktace.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-debugfs.c          | 182 +++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.c       |   5 -
 block/blk-sysfs.c            |   3 +
 block/blk.h                  |  16 +++
 block/bsg.c                  |   2 +
 block/partitions/core.c      |   9 ++
 drivers/scsi/ch.c            |   1 +
 drivers/scsi/sg.c            |  75 +++++++++++++++
 drivers/scsi/st.c            |   2 +
 include/linux/blkdev.h       |   4 +-
 include/linux/blktrace_api.h |   1 -
 include/linux/genhd.h        |  69 +++++++++++++
 kernel/trace/blktrace.c      |  24 +++--
 13 files changed, 380 insertions(+), 13 deletions(-)

diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
index 19091e1effc0..8121f297eaba 100644
--- a/block/blk-debugfs.c
+++ b/block/blk-debugfs.c
@@ -8,8 +8,190 @@
 #include <linux/debugfs.h>
 
 struct dentry *blk_debugfs_root;
+struct dentry *blk_debugfs_bsg = NULL;
+
+/**
+ * enum blk_debugfs_dir_type - block device debugfs directory type
+ * @BLK_DBG_DIR_BASE: the block device debugfs_dir exists on the base
+ * 	system <system-debugfs-dir>/block/ debugfs directory.
+ * @BLK_DBG_DIR_BSG: the block device debugfs_dir is under the directory
+ * 	<system-debugfs-dir>/block/bsg/
+ */
+enum blk_debugfs_dir_type {
+	BLK_DBG_DIR_BASE = 1,
+	BLK_DBG_DIR_BSG,
+};
 
 void blk_debugfs_register(void)
 {
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
 }
+
+static struct dentry *queue_get_base_dir(enum blk_debugfs_dir_type type)
+{
+	switch (type) {
+	case BLK_DBG_DIR_BASE:
+		return blk_debugfs_root;
+	case BLK_DBG_DIR_BSG:
+		return blk_debugfs_bsg;
+	}
+	return NULL;
+}
+
+static void queue_debugfs_register_type(struct request_queue *q,
+					const char *name,
+					enum blk_debugfs_dir_type type)
+{
+	struct dentry *base_dir = queue_get_base_dir(type);
+
+	q->debugfs_dir = debugfs_create_dir(name, base_dir);
+}
+
+/**
+ * blk_queue_debugfs_register - register the debugfs_dir for the block device
+ * @q: the associated request_queue of the block device
+ * @name: the name of the block device exposed
+ *
+ * This is used to create the debugfs_dir used by the block layer and blktrace.
+ * Drivers which use any of the *add_disk*() calls or variants have this called
+ * automatically for them. This directory is removed automatically on
+ * blk_release_queue() once the request_queue reference count reaches 0.
+ */
+void blk_queue_debugfs_register(struct request_queue *q, const char *name)
+{
+	queue_debugfs_register_type(q, name, BLK_DBG_DIR_BASE);
+}
+EXPORT_SYMBOL_GPL(blk_queue_debugfs_register);
+
+/**
+ * blk_queue_debugfs_unregister - remove the debugfs_dir for the block device
+ * @q: the associated request_queue of the block device
+ *
+ * Removes the debugfs_dir for the request_queue on the associated block device.
+ * This is handled for you on blk_release_queue(), and that should only be
+ * called once.
+ *
+ * Since we don't care where the debugfs_dir was created this is used for all
+ * types of of enum blk_debugfs_dir_type.
+ */
+void blk_queue_debugfs_unregister(struct request_queue *q)
+{
+	debugfs_remove_recursive(q->debugfs_dir);
+}
+
+static struct dentry *queue_debugfs_symlink_type(struct request_queue *q,
+						 const char *src,
+						 const char *dst,
+						 enum blk_debugfs_dir_type type)
+{
+	struct dentry *dentry = ERR_PTR(-EINVAL);
+	char *dir_dst = NULL;
+
+	switch (type) {
+	case BLK_DBG_DIR_BASE:
+		if (dst)
+			dir_dst = kasprintf(GFP_KERNEL, "%s", dst);
+		else if (!IS_ERR_OR_NULL(q->debugfs_dir))
+			dir_dst = kasprintf(GFP_KERNEL, "%s",
+					    q->debugfs_dir->d_name.name);
+		else
+			goto out;
+		break;
+	case BLK_DBG_DIR_BSG:
+		if (dst)
+			dir_dst = kasprintf(GFP_KERNEL, "bsg/%s", dst);
+		else
+			goto out;
+		break;
+	}
+
+	/*
+	 * The base block debugfs directory is always used for the symlinks,
+	 * their target is what changes.
+	 */
+	dentry = debugfs_create_symlink(src, blk_debugfs_root, dir_dst);
+	kfree(dir_dst);
+out:
+	return dentry;
+}
+
+/**
+ * blk_queue_debugfs_symlink - symlink to the real block device debugfs_dir
+ * @q: the request queue where we know the debugfs_dir exists or will exist
+ *     eventually. Cannot be NULL.
+ * @src: name of the exposed device we wish to associate to the block device
+ * @dst: the name of the directory to which we want to symlink to, may be NULL
+ *	 if you do not know what this may be, but only if your base block device
+ *	 is not bsg. If you set this to NULL, we will have no other option but
+ *	 to look at the request_queue to infer the name, but you must ensure
+ *	 it is already be set, be mindful of asynchronous probes.
+ *
+ * Some devices don't have a request_queue of their own, however, they have an
+ * association to one and have historically supported using the same
+ * debugfs_dir which has been used to represent the whole disk for blktrace
+ * functionality. Such is the case for partitions and for scsi-generic devices.
+ * They share the same request_queue and debugfs_dir as with the whole disk for
+ * blktrace purposes.  This helper allows such association to be made explicit
+ * and enable blktrace functionality for them. scsi-generic devices representing
+ * scsi device such as block, cdrom, tape, media changer register their own
+ * debug_dir already and share the same request_queue as with scsi-generic, as
+ * such the respective scsi-generic debugfs_dir is just a symlink to these
+ * driver's debugfs_dir.
+ *
+ * To remove use debugfs_remove() on the symlink dentry returned by this
+ * function. The block layer will not clean this up for you, you must remove
+ * it yourself in case of device removal.
+ */
+struct dentry *blk_queue_debugfs_symlink(struct request_queue *q,
+					 const char *src,
+					 const char *dst)
+{
+	return queue_debugfs_symlink_type(q, src, dst, BLK_DBG_DIR_BASE);
+}
+EXPORT_SYMBOL_GPL(blk_queue_debugfs_symlink);
+
+#ifdef CONFIG_BLK_DEV_BSG
+
+void blk_debugfs_register_bsg(void)
+{
+	blk_debugfs_bsg = debugfs_create_dir("bsg", blk_debugfs_root);
+}
+
+/**
+ * blk_queue_debugfs_register_bsg - create the debugfs_dir for bsg block devices
+ * @q: the associated request_queue of the block device
+ * @name: the name of the block device exposed
+ *
+ * This is used to create the debugfs_dir used by the Block layer SCSI generic
+ * (bsg) driver. This is to be used only by the scsi-generic driver on behalf
+ * of scsi devices which work as scsi controllers or transports.
+ *
+ * This directory is cleaned up for all drivers automatically on
+ * blk_release_queue() once the request_queue reference count reaches 0.
+ */
+void blk_queue_debugfs_register_bsg(struct request_queue *q, const char *name)
+{
+	queue_debugfs_register_type(q, name, BLK_DBG_DIR_BSG);
+}
+EXPORT_SYMBOL_GPL(blk_queue_debugfs_register_bsg);
+
+/**
+ * blk_queue_debugfs_symlink_bsg - symlink to the bsg debugfs_dir
+ * @q: the request queue where we know the debugfs_dir exists or will exist
+ *     eventually. Cannot be NULL.
+ * @src: name of the scsi-generic device we wish to associate to the bsg
+ * 	request_queue.
+ * @dst: the name of the bsg request_queue debugfs_dir to which we want to
+ *	 symlink to. This cannot be NULL.
+ *
+ * This is used by scsi-generic devices representing raid controllers /
+ * transport drivers.
+ */
+struct dentry *blk_queue_debugfs_bsg_symlink(struct request_queue *q,
+					     const char *src,
+					     const char *dst)
+{
+	return queue_debugfs_symlink_type(q, src, dst, BLK_DBG_DIR_BSG);
+}
+EXPORT_SYMBOL_GPL(blk_queue_debugfs_bsg_symlink);
+#endif /* CONFIG_BLK_DEV_BSG */
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 96b7a35c898a..08edc3a54114 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -822,9 +822,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
-					    blk_debugfs_root);
-
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
 	/*
@@ -855,9 +852,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
 
 void blk_mq_debugfs_unregister(struct request_queue *q)
 {
-	debugfs_remove_recursive(q->debugfs_dir);
 	q->sched_debugfs_dir = NULL;
-	q->debugfs_dir = NULL;
 }
 
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 561624d4cc4e..4e0c00a88c99 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -918,6 +918,7 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+	blk_queue_debugfs_unregister(q);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
 
@@ -989,6 +990,8 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	blk_queue_debugfs_register(q, kobject_name(q->kobj.parent));
+
 	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
diff --git a/block/blk.h b/block/blk.h
index ee309233f95e..300b8526066b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -460,10 +460,26 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 
 #ifdef CONFIG_DEBUG_FS
 void blk_debugfs_register(void);
+void blk_queue_debugfs_unregister(struct request_queue *q);
+void blk_part_debugfs_register(struct hd_struct *p, const char *name);
+void blk_part_debugfs_unregister(struct hd_struct *p);
 #else
 static inline void blk_debugfs_register(void)
 {
 }
+
+static inline void blk_queue_debugfs_unregister(struct request_queue *q)
+{
+}
+
+static inline void blk_part_debugfs_register(struct hd_struct *p,
+					     const char *name)
+{
+}
+
+static inline void blk_part_debugfs_unregister(struct hd_struct *p)
+{
+}
 #endif /* CONFIG_DEBUG_FS */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/block/bsg.c b/block/bsg.c
index d7bae94b64d9..bfb1036858c4 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -503,6 +503,8 @@ static int __init bsg_init(void)
 	if (ret)
 		goto unregister_chrdev;
 
+	blk_debugfs_register_bsg();
+
 	printk(KERN_INFO BSG_DESCRIPTION " version " BSG_VERSION
 	       " loaded (major %d)\n", bsg_major);
 	return 0;
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 297004fd2264..4d2a130e6055 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -10,6 +10,7 @@
 #include <linux/vmalloc.h>
 #include <linux/blktrace_api.h>
 #include <linux/raid/detect.h>
+#include <linux/debugfs.h>
 #include "check.h"
 
 static int (*check_part[])(struct parsed_partitions *) = {
@@ -320,6 +321,9 @@ void delete_partition(struct gendisk *disk, struct hd_struct *part)
 	 *  we have to hold the disk device
 	 */
 	get_device(disk_to_dev(part_to_disk(part)));
+#ifdef CONFIG_DEBUG_FS
+	debugfs_remove(part->debugfs_sym);
+#endif
 	rcu_assign_pointer(ptbl->part[part->partno], NULL);
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));
@@ -460,6 +464,11 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	/* everything is up and running, commence */
 	rcu_assign_pointer(ptbl->part[partno], p);
 
+#ifdef CONFIG_DEBUG_FS
+	p->debugfs_sym = blk_queue_debugfs_symlink(disk->queue, dev_name(pdev),
+						   disk->disk_name);
+#endif
+
 	/* suppress uevent if the disk suppresses it */
 	if (!dev_get_uevent_suppress(ddev))
 		kobject_uevent(&pdev->kobj, KOBJ_ADD);
diff --git a/drivers/scsi/ch.c b/drivers/scsi/ch.c
index cb74ab1ae5a4..5dfabc04bfef 100644
--- a/drivers/scsi/ch.c
+++ b/drivers/scsi/ch.c
@@ -971,6 +971,7 @@ static int ch_probe(struct device *dev)
 
 	mutex_unlock(&ch->lock);
 	dev_set_drvdata(dev, ch);
+	blk_queue_debugfs_register(sd->request_queue, dev_name(class_dev));
 	sdev_printk(KERN_INFO, sd, "Attached scsi changer %s\n", ch->name);
 
 	return 0;
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 20472aaaf630..6fa201086e59 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -47,6 +47,7 @@ static int sg_version_num = 30536;	/* 2 digits for each component */
 #include <linux/ratelimit.h>
 #include <linux/uio.h>
 #include <linux/cred.h> /* for sg_check_file_access() */
+#include <linux/debugfs.h>
 
 #include "scsi.h"
 #include <scsi/scsi_dbg.h>
@@ -169,6 +170,10 @@ typedef struct sg_device { /* holds the state of each scsi generic device */
 	struct gendisk *disk;
 	struct cdev * cdev;	/* char_dev [sysfs: /sys/cdev/major/sg<n>] */
 	struct kref d_ref;
+#ifdef CONFIG_DEBUG_FS
+	bool debugfs_set;
+	struct dentry *debugfs_sym;
+#endif
 } Sg_device;
 
 /* tasklet or soft irq callback */
@@ -914,6 +919,72 @@ static int put_compat_request_table(struct compat_sg_req_info __user *o,
 }
 #endif
 
+#ifdef CONFIG_DEBUG_FS
+/*
+ * For scsi-generic devices like TYPE_DISK will re-use the scsi_device
+ * request_queue on their driver for their disk and later device_add_disk() it,
+ * we want its respective scsi-generic debugfs_dir to just be a symlink to the
+ * one created on the real scsi device probe.
+ *
+ * We use this on the ioctl path instead of sg_add_device() since some driver
+ * probes can run asynchronously. Such is the case for scsi devices of
+ * TYPE_DISK, and the class interface currently has no callbacks once a device
+ * driver probe has completed its probe. We don't use wait_for_device_probe()
+ * on sg_add_device() as that would defeat the purpose of using asynchronous
+ * probe.
+ */
+static void sg_init_blktrace_setup(Sg_device *sdp)
+{
+	struct scsi_device *scsidp = sdp->device;
+	struct device *scsi_dev = &scsidp->sdev_gendev;
+	struct gendisk *sg_disk = sdp->disk;
+	struct request_queue *q = scsidp->request_queue;
+
+	/*
+	 * Although debugfs is used for debugging purposes and we
+	 * typically don't care about the return value, we do here
+	 * because we use it for userspace to ensure blktrace works.
+	 *
+	 * Instead of always just checking for the return value though,
+	 * just try setting this once, if the first time failed we don't
+	 * try again.
+	 */
+	if (sdp->debugfs_set)
+		return;
+
+	switch (sdp->device->type) {
+	case TYPE_RAID:
+		/*
+		 * We do the registration for bsg here to keep bsg scsi_device
+		 * opaque. If bsg is disabled we just create the debugfs_dir on
+		 * the base block debugfs_dir and scsi-generic symlinks to it.
+		 */
+		blk_queue_debugfs_register_bsg(q, dev_name(scsi_dev));
+		sdp->debugfs_sym =
+			blk_queue_debugfs_bsg_symlink(q,
+						      sg_disk->disk_name,
+						      dev_name(scsi_dev));
+		break;
+	default:
+		/*
+		 * We don't know scsi_device probed device name (this is
+		 * different from the scsi_device name). This is opaque to
+		 * scsi-generic, so we use the request_queue to infer the name
+		 * based on the set debugfs_dir.
+		 */
+		sdp->debugfs_sym = blk_queue_debugfs_symlink(q,
+							     sg_disk->disk_name,
+							     NULL);
+		break;
+	}
+	sdp->debugfs_set = true;
+}
+#else
+static void sg_init_blktrace_setup(Sg_device *sdp)
+{
+}
+#endif
+
 static long
 sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp,
 		unsigned int cmd_in, void __user *p)
@@ -1117,6 +1188,7 @@ sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp,
 		return put_user(max_sectors_bytes(sdp->device->request_queue),
 				ip);
 	case BLKTRACESETUP:
+		sg_init_blktrace_setup(sdp);
 		return blk_trace_setup(sdp->device->request_queue,
 				       sdp->disk->disk_name,
 				       MKDEV(SCSI_GENERIC_MAJOR, sdp->index),
@@ -1644,6 +1716,9 @@ sg_remove_device(struct device *cl_dev, struct class_interface *cl_intf)
 
 	sysfs_remove_link(&scsidp->sdev_gendev.kobj, "generic");
 	device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index));
+#ifdef CONFIG_DEBUG_FS
+	debugfs_remove(sdp->debugfs_sym);
+#endif
 	cdev_del(sdp->cdev);
 	sdp->cdev = NULL;
 
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index 4bf4ab3b70f4..fb3c0546803a 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4417,6 +4417,8 @@ static int st_probe(struct device *dev)
 	if (error)
 		goto out_remove_devs;
 	scsi_autopm_put_device(SDp);
+	blk_queue_debugfs_register(tpnt->device->request_queue,
+				   tape_name(tpnt));
 
 	sdev_printk(KERN_NOTICE, SDp,
 		    "Attached scsi tape %s\n", tape_name(tpnt));
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8801f3d7cf4a..0e6dff9c4233 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -574,8 +574,10 @@ struct request_queue {
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
 
-#ifdef CONFIG_BLK_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
 	struct dentry		*debugfs_dir;
+#endif
+#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*sched_debugfs_dir;
 	struct dentry		*rqos_debugfs_dir;
 #endif
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 3b6ff5902edc..eb6db276e293 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -22,7 +22,6 @@ struct blk_trace {
 	u64 end_lba;
 	u32 pid;
 	u32 dev;
-	struct dentry *dir;
 	struct dentry *dropped_file;
 	struct dentry *msg_file;
 	struct list_head running_list;
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index a9384449465a..60ce3d8e4acd 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -89,6 +89,9 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	struct rcu_work rcu_work;
+#ifdef CONFIG_DEBUG_FS
+	struct dentry *debugfs_sym;
+#endif
 };
 
 /**
@@ -390,6 +393,72 @@ extern void blk_unregister_region(dev_t devt, unsigned long range);
 
 #define alloc_disk(minors) alloc_disk_node(minors, NUMA_NO_NODE)
 
+#ifdef CONFIG_DEBUG_FS
+void blk_queue_debugfs_register(struct request_queue *q, const char *name);
+struct dentry *blk_queue_debugfs_symlink(struct request_queue *q,
+					 const char *src,
+					 const char *dst);
+#ifdef CONFIG_BLK_DEV_BSG
+void blk_debugfs_register_bsg(void);
+void blk_queue_debugfs_register_bsg(struct request_queue *q, const char *name);
+struct dentry *blk_queue_debugfs_bsg_symlink(struct request_queue *q,
+					     const char *src,
+					     const char *dst);
+#else
+
+static inline void blk_debugfs_register_bsg(void)
+{
+}
+
+/* If bsg is not enabled we use the base directory */
+static inline void blk_queue_debugfs_register_bsg(struct request_queue *q,
+						  const char *name)
+{
+	blk_queue_debugfs_register(q, name);
+}
+
+static inline
+struct dentry *blk_queue_debugfs_bsg_symlink(struct request_queue *q,
+					     const char *src,
+					     const char *dst)
+{
+	return blk_queue_debugfs_symlink(q, src, dst);
+}
+
+#endif /* CONFIG_BLK_DEV_BSG */
+#else  /* ! CONFIG_DEBUG_FS */
+static inline void blk_queue_debugfs_register(struct request_queue *q,
+					      const char *name)
+{
+}
+
+static inline struct dentry *blk_queue_debugfs_symlink(struct request_queue *q,
+						       const char *src,
+						       const char *dst)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+#ifdef CONFIG_BLK_DEV_BSG
+static inline void blk_debugfs_register_bsg(void)
+{
+}
+#endif /* CONFIG_BLK_DEV_BSG */
+
+static inline void blk_queue_debugfs_register_bsg(struct request_queue *q,
+						  const char *name)
+{
+}
+
+static inline
+struct dentry *blk_queue_debugfs_bsg_symlink(struct request_queue *q,
+					     const char *src,
+					     const char *dst)
+{
+	return ERR_PTR(-ENODEV);
+}
+#endif /* CONFIG_DEBUG_FS */
+
 #else /* CONFIG_BLOCK */
 
 static inline void printk_all_partitions(void) { }
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index ca39dc3230cb..6c10a1427de2 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
 	debugfs_remove(bt->msg_file);
 	debugfs_remove(bt->dropped_file);
 	relay_close(bt->rchan);
-	debugfs_remove(bt->dir);
 	free_percpu(bt->sequence);
 	free_percpu(bt->msg_data);
 	kfree(bt);
@@ -509,9 +508,24 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = -ENOENT;
 
-	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
-		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
+	dir = q->debugfs_dir;
+
+	/*
+	 * Although the directory here is from debugfs, and we typically do not
+	 * care about NULL dirs as debugfs is typically only used for debugging,
+	 * we rely on the directory to exist to place files which we then use
+	 * for blktrace userspace functionality. Without this directory
+	 * blktrace would not work. Enabling blktrace functionality enables
+	 * debugfs too, as such, we *really* do want to check for this and must
+	 * ensure it was set before chugging on. If NULL were used below, we'd
+	 * also end up creating the debugfs files under the block root
+	 * directory, which we definitely do not want.
+	 */
+	if (IS_ERR_OR_NULL(dir)) {
+		pr_warn("debugfs_dir not present for %s so skipping\n",
+			buts->name);
+		goto err;
+	}
 
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
@@ -551,8 +565,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = 0;
 err:
-	if (dir && !bt->dir)
-		dput(dir);
 	if (ret)
 		blk_trace_free(bt);
 	return ret;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 6/7] blktrace: break out of blktrace setup on concurrent calls
  2020-05-16  3:19 [PATCH v5 0/7] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (4 preceding siblings ...)
  2020-05-16  3:19 ` [PATCH v5 5/7] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-05-16  3:19 ` Luis Chamberlain
  2020-05-19 15:37   ` Christoph Hellwig
  2020-05-19 16:10   ` Bart Van Assche
  2020-05-16  3:19 ` [PATCH v5 7/7] loop: be paranoid on exit and prevent new additions / removals Luis Chamberlain
  6 siblings, 2 replies; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-16  3:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

We use one blktrace per request_queue, that means one per the entire
disk.  So we cannot run one blktrace on say /dev/vda and then /dev/vda1,
or just two calls on /dev/vda.

We check for concurrent setup only at the very end of the blktrace setup though.

If we try to run two concurrent blktraces on the same block device the
second one will fail, and the first one seems to go on. However when
one tries to kill the first one one will see things like this:

The kernel will show these:

```
debugfs: File 'dropped' in directory 'nvme1n1' already present!
debugfs: File 'msg' in directory 'nvme1n1' already present!
debugfs: File 'trace0' in directory 'nvme1n1' already present!
``

And userspace just sees this error message for the second call:

```
blktrace /dev/nvme1n1
BLKTRACESETUP(2) /dev/nvme1n1 failed: 5/Input/output error
```

The first userspace process #1 will also claim that the files
were taken underneath their nose as well. The files are taken
away form the first process given that when the second blktrace
fails, it will follow up with a BLKTRACESTOP and BLKTRACETEARDOWN.
This means that even if go-happy process #1 is waiting for blktrace
data, we *have* been asked to take teardown the blktrace.

This can easily be reproduced with break-blktrace [0] run_0005.sh test.

Just break out early if we know we're already going to fail, this will
prevent trying to create the files all over again, which we know still
exist.

[0] https://github.com/mcgrof/break-blktrace
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 6c10a1427de2..ac6650828d49 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -3,6 +3,9 @@
  * Copyright (C) 2006 Jens Axboe <axboe@kernel.dk>
  *
  */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/kernel.h>
 #include <linux/blkdev.h>
 #include <linux/blktrace_api.h>
@@ -493,6 +496,16 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	 */
 	strreplace(buts->name, '/', '_');
 
+	/*
+	 * bdev can be NULL, as with scsi-generic, this is a helpful as
+	 * we can be.
+	 */
+	if (q->blk_trace) {
+		pr_warn("Concurrent blktraces are not allowed on %s\n",
+			buts->name);
+		return -EBUSY;
+	}
+
 	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
 	if (!bt)
 		return -ENOMEM;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 7/7] loop: be paranoid on exit and prevent new additions / removals
  2020-05-16  3:19 [PATCH v5 0/7] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (5 preceding siblings ...)
  2020-05-16  3:19 ` [PATCH v5 6/7] blktrace: break out of blktrace setup on concurrent calls Luis Chamberlain
@ 2020-05-16  3:19 ` Luis Chamberlain
  2020-05-19 15:36   ` Christoph Hellwig
  6 siblings, 1 reply; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-16  3:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

Be pedantic on removal as well and hold the mutex.
This should prevent uses of addition while we exit.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/loop.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 14372df0f354..54fbcbd930de 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2333,6 +2333,8 @@ static void __exit loop_exit(void)
 
 	range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
 
+	mutex_lock(&loop_ctl_mutex);
+
 	idr_for_each(&loop_index_idr, &loop_exit_cb, NULL);
 	idr_destroy(&loop_index_idr);
 
@@ -2340,6 +2342,8 @@ static void __exit loop_exit(void)
 	unregister_blkdev(LOOP_MAJOR, "loop");
 
 	misc_deregister(&loop_misc);
+
+	mutex_unlock(&loop_ctl_mutex);
 }
 
 module_init(loop_init);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-16  3:19 ` [PATCH v5 5/7] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-05-19 14:44   ` Greg KH
  2020-05-19 15:52     ` Luis Chamberlain
  2020-05-19 16:37   ` Christoph Hellwig
  1 sibling, 1 reply; 24+ messages in thread
From: Greg KH @ 2020-05-19 14:44 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Sat, May 16, 2020 at 03:19:54AM +0000, Luis Chamberlain wrote:
>  struct dentry *blk_debugfs_root;
> +struct dentry *blk_debugfs_bsg = NULL;

checkpatch didn't complain about "= NULL;"?

> +
> +/**
> + * enum blk_debugfs_dir_type - block device debugfs directory type
> + * @BLK_DBG_DIR_BASE: the block device debugfs_dir exists on the base
> + * 	system <system-debugfs-dir>/block/ debugfs directory.
> + * @BLK_DBG_DIR_BSG: the block device debugfs_dir is under the directory
> + * 	<system-debugfs-dir>/block/bsg/
> + */
> +enum blk_debugfs_dir_type {
> +	BLK_DBG_DIR_BASE = 1,
> +	BLK_DBG_DIR_BSG,
> +};
>  
>  void blk_debugfs_register(void)
>  {
>  	blk_debugfs_root = debugfs_create_dir("block", NULL);
>  }
> +
> +static struct dentry *queue_get_base_dir(enum blk_debugfs_dir_type type)
> +{
> +	switch (type) {
> +	case BLK_DBG_DIR_BASE:
> +		return blk_debugfs_root;
> +	case BLK_DBG_DIR_BSG:
> +		return blk_debugfs_bsg;
> +	}
> +	return NULL;
> +}

This "function" is used once, here:

> +static void queue_debugfs_register_type(struct request_queue *q,
> +					const char *name,
> +					enum blk_debugfs_dir_type type)
> +{
> +	struct dentry *base_dir = queue_get_base_dir(type);

And it could be a simple if statement instead.

Oh well, I don't have to maintain this :)

> +
> +	q->debugfs_dir = debugfs_create_dir(name, base_dir);
> +}
> +
> +/**
> + * blk_queue_debugfs_register - register the debugfs_dir for the block device
> + * @q: the associated request_queue of the block device
> + * @name: the name of the block device exposed
> + *
> + * This is used to create the debugfs_dir used by the block layer and blktrace.
> + * Drivers which use any of the *add_disk*() calls or variants have this called
> + * automatically for them. This directory is removed automatically on
> + * blk_release_queue() once the request_queue reference count reaches 0.
> + */
> +void blk_queue_debugfs_register(struct request_queue *q, const char *name)
> +{
> +	queue_debugfs_register_type(q, name, BLK_DBG_DIR_BASE);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_debugfs_register);
> +
> +/**
> + * blk_queue_debugfs_unregister - remove the debugfs_dir for the block device
> + * @q: the associated request_queue of the block device
> + *
> + * Removes the debugfs_dir for the request_queue on the associated block device.
> + * This is handled for you on blk_release_queue(), and that should only be
> + * called once.
> + *
> + * Since we don't care where the debugfs_dir was created this is used for all
> + * types of of enum blk_debugfs_dir_type.
> + */
> +void blk_queue_debugfs_unregister(struct request_queue *q)
> +{
> +	debugfs_remove_recursive(q->debugfs_dir);
> +}

Why is register needed to be exported, but unregister does not?  Does
some driver not properly clean things up?

> +
> +static struct dentry *queue_debugfs_symlink_type(struct request_queue *q,
> +						 const char *src,
> +						 const char *dst,
> +						 enum blk_debugfs_dir_type type)
> +{
> +	struct dentry *dentry = ERR_PTR(-EINVAL);
> +	char *dir_dst = NULL;
> +
> +	switch (type) {
> +	case BLK_DBG_DIR_BASE:
> +		if (dst)
> +			dir_dst = kasprintf(GFP_KERNEL, "%s", dst);
> +		else if (!IS_ERR_OR_NULL(q->debugfs_dir))
> +			dir_dst = kasprintf(GFP_KERNEL, "%s",
> +					    q->debugfs_dir->d_name.name);

There really is no other way to get the name of the directory other than
from the dentry?  It's not in the queue itself somewhere?

Anyway, not a big deal, just trying to not expose debugfs internals
here.

> +		else
> +			goto out;
> +		break;
> +	case BLK_DBG_DIR_BSG:
> +		if (dst)
> +			dir_dst = kasprintf(GFP_KERNEL, "bsg/%s", dst);
> +		else
> +			goto out;
> +		break;
> +	}
> +
> +	/*
> +	 * The base block debugfs directory is always used for the symlinks,
> +	 * their target is what changes.
> +	 */
> +	dentry = debugfs_create_symlink(src, blk_debugfs_root, dir_dst);
> +	kfree(dir_dst);
> +out:
> +	return dentry;
> +}
> +
> +/**
> + * blk_queue_debugfs_symlink - symlink to the real block device debugfs_dir
> + * @q: the request queue where we know the debugfs_dir exists or will exist
> + *     eventually. Cannot be NULL.
> + * @src: name of the exposed device we wish to associate to the block device
> + * @dst: the name of the directory to which we want to symlink to, may be NULL
> + *	 if you do not know what this may be, but only if your base block device
> + *	 is not bsg. If you set this to NULL, we will have no other option but
> + *	 to look at the request_queue to infer the name, but you must ensure
> + *	 it is already be set, be mindful of asynchronous probes.
> + *
> + * Some devices don't have a request_queue of their own, however, they have an
> + * association to one and have historically supported using the same
> + * debugfs_dir which has been used to represent the whole disk for blktrace
> + * functionality. Such is the case for partitions and for scsi-generic devices.
> + * They share the same request_queue and debugfs_dir as with the whole disk for
> + * blktrace purposes.  This helper allows such association to be made explicit
> + * and enable blktrace functionality for them. scsi-generic devices representing
> + * scsi device such as block, cdrom, tape, media changer register their own
> + * debug_dir already and share the same request_queue as with scsi-generic, as
> + * such the respective scsi-generic debugfs_dir is just a symlink to these
> + * driver's debugfs_dir.
> + *
> + * To remove use debugfs_remove() on the symlink dentry returned by this
> + * function. The block layer will not clean this up for you, you must remove
> + * it yourself in case of device removal.
> + */
> +struct dentry *blk_queue_debugfs_symlink(struct request_queue *q,
> +					 const char *src,
> +					 const char *dst)
> +{
> +	return queue_debugfs_symlink_type(q, src, dst, BLK_DBG_DIR_BASE);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_debugfs_symlink);
> +
> +#ifdef CONFIG_BLK_DEV_BSG
> +
> +void blk_debugfs_register_bsg(void)
> +{
> +	blk_debugfs_bsg = debugfs_create_dir("bsg", blk_debugfs_root);
> +}
> +
> +/**
> + * blk_queue_debugfs_register_bsg - create the debugfs_dir for bsg block devices
> + * @q: the associated request_queue of the block device
> + * @name: the name of the block device exposed
> + *
> + * This is used to create the debugfs_dir used by the Block layer SCSI generic
> + * (bsg) driver. This is to be used only by the scsi-generic driver on behalf
> + * of scsi devices which work as scsi controllers or transports.
> + *
> + * This directory is cleaned up for all drivers automatically on
> + * blk_release_queue() once the request_queue reference count reaches 0.
> + */
> +void blk_queue_debugfs_register_bsg(struct request_queue *q, const char *name)
> +{
> +	queue_debugfs_register_type(q, name, BLK_DBG_DIR_BSG);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_debugfs_register_bsg);
> +
> +/**
> + * blk_queue_debugfs_symlink_bsg - symlink to the bsg debugfs_dir
> + * @q: the request queue where we know the debugfs_dir exists or will exist
> + *     eventually. Cannot be NULL.
> + * @src: name of the scsi-generic device we wish to associate to the bsg
> + * 	request_queue.
> + * @dst: the name of the bsg request_queue debugfs_dir to which we want to
> + *	 symlink to. This cannot be NULL.
> + *
> + * This is used by scsi-generic devices representing raid controllers /
> + * transport drivers.
> + */
> +struct dentry *blk_queue_debugfs_bsg_symlink(struct request_queue *q,
> +					     const char *src,
> +					     const char *dst)
> +{
> +	return queue_debugfs_symlink_type(q, src, dst, BLK_DBG_DIR_BSG);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_debugfs_bsg_symlink);
> +#endif /* CONFIG_BLK_DEV_BSG */
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 96b7a35c898a..08edc3a54114 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -822,9 +822,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
>  	struct blk_mq_hw_ctx *hctx;
>  	int i;
>  
> -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> -					    blk_debugfs_root);
> -
>  	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
>  
>  	/*
> @@ -855,9 +852,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
>  
>  void blk_mq_debugfs_unregister(struct request_queue *q)
>  {
> -	debugfs_remove_recursive(q->debugfs_dir);
>  	q->sched_debugfs_dir = NULL;
> -	q->debugfs_dir = NULL;
>  }
>  
>  static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 561624d4cc4e..4e0c00a88c99 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -918,6 +918,7 @@ static void blk_release_queue(struct kobject *kobj)
>  
>  	blk_trace_shutdown(q);
>  
> +	blk_queue_debugfs_unregister(q);
>  	if (queue_is_mq(q))
>  		blk_mq_debugfs_unregister(q);
>  
> @@ -989,6 +990,8 @@ int blk_register_queue(struct gendisk *disk)
>  		goto unlock;
>  	}
>  
> +	blk_queue_debugfs_register(q, kobject_name(q->kobj.parent));
> +
>  	if (queue_is_mq(q)) {
>  		__blk_mq_register_dev(dev, q);
>  		blk_mq_debugfs_register(q);
> diff --git a/block/blk.h b/block/blk.h
> index ee309233f95e..300b8526066b 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -460,10 +460,26 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>  
>  #ifdef CONFIG_DEBUG_FS
>  void blk_debugfs_register(void);
> +void blk_queue_debugfs_unregister(struct request_queue *q);
> +void blk_part_debugfs_register(struct hd_struct *p, const char *name);
> +void blk_part_debugfs_unregister(struct hd_struct *p);
>  #else
>  static inline void blk_debugfs_register(void)
>  {
>  }
> +
> +static inline void blk_queue_debugfs_unregister(struct request_queue *q)
> +{
> +}
> +
> +static inline void blk_part_debugfs_register(struct hd_struct *p,
> +					     const char *name)
> +{
> +}
> +
> +static inline void blk_part_debugfs_unregister(struct hd_struct *p)
> +{
> +}
>  #endif /* CONFIG_DEBUG_FS */
>  
>  #endif /* BLK_INTERNAL_H */
> diff --git a/block/bsg.c b/block/bsg.c
> index d7bae94b64d9..bfb1036858c4 100644
> --- a/block/bsg.c
> +++ b/block/bsg.c
> @@ -503,6 +503,8 @@ static int __init bsg_init(void)
>  	if (ret)
>  		goto unregister_chrdev;
>  
> +	blk_debugfs_register_bsg();
> +
>  	printk(KERN_INFO BSG_DESCRIPTION " version " BSG_VERSION
>  	       " loaded (major %d)\n", bsg_major);
>  	return 0;
> diff --git a/block/partitions/core.c b/block/partitions/core.c
> index 297004fd2264..4d2a130e6055 100644
> --- a/block/partitions/core.c
> +++ b/block/partitions/core.c
> @@ -10,6 +10,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/blktrace_api.h>
>  #include <linux/raid/detect.h>
> +#include <linux/debugfs.h>
>  #include "check.h"
>  
>  static int (*check_part[])(struct parsed_partitions *) = {
> @@ -320,6 +321,9 @@ void delete_partition(struct gendisk *disk, struct hd_struct *part)
>  	 *  we have to hold the disk device
>  	 */
>  	get_device(disk_to_dev(part_to_disk(part)));
> +#ifdef CONFIG_DEBUG_FS
> +	debugfs_remove(part->debugfs_sym);
> +#endif

Why is the #ifdef needed?  It shouldn't be.

And why not recursive?

>  	rcu_assign_pointer(ptbl->part[part->partno], NULL);
>  	kobject_put(part->holder_dir);
>  	device_del(part_to_dev(part));
> @@ -460,6 +464,11 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
>  	/* everything is up and running, commence */
>  	rcu_assign_pointer(ptbl->part[partno], p);
>  
> +#ifdef CONFIG_DEBUG_FS
> +	p->debugfs_sym = blk_queue_debugfs_symlink(disk->queue, dev_name(pdev),
> +						   disk->disk_name);
> +#endif

Again, no #ifdef should be needed here, just provide the "empty"
function in the .h file.

You know this stuff :)

> +
>  	/* suppress uevent if the disk suppresses it */
>  	if (!dev_get_uevent_suppress(ddev))
>  		kobject_uevent(&pdev->kobj, KOBJ_ADD);
> diff --git a/drivers/scsi/ch.c b/drivers/scsi/ch.c
> index cb74ab1ae5a4..5dfabc04bfef 100644
> --- a/drivers/scsi/ch.c
> +++ b/drivers/scsi/ch.c
> @@ -971,6 +971,7 @@ static int ch_probe(struct device *dev)
>  
>  	mutex_unlock(&ch->lock);
>  	dev_set_drvdata(dev, ch);
> +	blk_queue_debugfs_register(sd->request_queue, dev_name(class_dev));
>  	sdev_printk(KERN_INFO, sd, "Attached scsi changer %s\n", ch->name);
>  
>  	return 0;
> diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
> index 20472aaaf630..6fa201086e59 100644
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -47,6 +47,7 @@ static int sg_version_num = 30536;	/* 2 digits for each component */
>  #include <linux/ratelimit.h>
>  #include <linux/uio.h>
>  #include <linux/cred.h> /* for sg_check_file_access() */
> +#include <linux/debugfs.h>
>  
>  #include "scsi.h"
>  #include <scsi/scsi_dbg.h>
> @@ -169,6 +170,10 @@ typedef struct sg_device { /* holds the state of each scsi generic device */
>  	struct gendisk *disk;
>  	struct cdev * cdev;	/* char_dev [sysfs: /sys/cdev/major/sg<n>] */
>  	struct kref d_ref;
> +#ifdef CONFIG_DEBUG_FS
> +	bool debugfs_set;
> +	struct dentry *debugfs_sym;
> +#endif
>  } Sg_device;
>  
>  /* tasklet or soft irq callback */
> @@ -914,6 +919,72 @@ static int put_compat_request_table(struct compat_sg_req_info __user *o,
>  }
>  #endif
>  
> +#ifdef CONFIG_DEBUG_FS
> +/*
> + * For scsi-generic devices like TYPE_DISK will re-use the scsi_device
> + * request_queue on their driver for their disk and later device_add_disk() it,
> + * we want its respective scsi-generic debugfs_dir to just be a symlink to the
> + * one created on the real scsi device probe.
> + *
> + * We use this on the ioctl path instead of sg_add_device() since some driver
> + * probes can run asynchronously. Such is the case for scsi devices of
> + * TYPE_DISK, and the class interface currently has no callbacks once a device
> + * driver probe has completed its probe. We don't use wait_for_device_probe()
> + * on sg_add_device() as that would defeat the purpose of using asynchronous
> + * probe.
> + */
> +static void sg_init_blktrace_setup(Sg_device *sdp)
> +{
> +	struct scsi_device *scsidp = sdp->device;
> +	struct device *scsi_dev = &scsidp->sdev_gendev;
> +	struct gendisk *sg_disk = sdp->disk;
> +	struct request_queue *q = scsidp->request_queue;
> +
> +	/*
> +	 * Although debugfs is used for debugging purposes and we
> +	 * typically don't care about the return value, we do here
> +	 * because we use it for userspace to ensure blktrace works.
> +	 *
> +	 * Instead of always just checking for the return value though,
> +	 * just try setting this once, if the first time failed we don't
> +	 * try again.
> +	 */
> +	if (sdp->debugfs_set)
> +		return;
> +
> +	switch (sdp->device->type) {
> +	case TYPE_RAID:
> +		/*
> +		 * We do the registration for bsg here to keep bsg scsi_device
> +		 * opaque. If bsg is disabled we just create the debugfs_dir on
> +		 * the base block debugfs_dir and scsi-generic symlinks to it.
> +		 */
> +		blk_queue_debugfs_register_bsg(q, dev_name(scsi_dev));
> +		sdp->debugfs_sym =
> +			blk_queue_debugfs_bsg_symlink(q,
> +						      sg_disk->disk_name,
> +						      dev_name(scsi_dev));
> +		break;
> +	default:
> +		/*
> +		 * We don't know scsi_device probed device name (this is
> +		 * different from the scsi_device name). This is opaque to
> +		 * scsi-generic, so we use the request_queue to infer the name
> +		 * based on the set debugfs_dir.
> +		 */
> +		sdp->debugfs_sym = blk_queue_debugfs_symlink(q,
> +							     sg_disk->disk_name,
> +							     NULL);
> +		break;
> +	}
> +	sdp->debugfs_set = true;
> +}
> +#else
> +static void sg_init_blktrace_setup(Sg_device *sdp)
> +{
> +}
> +#endif
> +
>  static long
>  sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp,
>  		unsigned int cmd_in, void __user *p)
> @@ -1117,6 +1188,7 @@ sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp,
>  		return put_user(max_sectors_bytes(sdp->device->request_queue),
>  				ip);
>  	case BLKTRACESETUP:
> +		sg_init_blktrace_setup(sdp);
>  		return blk_trace_setup(sdp->device->request_queue,
>  				       sdp->disk->disk_name,
>  				       MKDEV(SCSI_GENERIC_MAJOR, sdp->index),
> @@ -1644,6 +1716,9 @@ sg_remove_device(struct device *cl_dev, struct class_interface *cl_intf)
>  
>  	sysfs_remove_link(&scsidp->sdev_gendev.kobj, "generic");
>  	device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index));
> +#ifdef CONFIG_DEBUG_FS
> +	debugfs_remove(sdp->debugfs_sym);
> +#endif

Again, no need for the #ifdef.

If you are worried about the variable not being there, just always put
it in the structure, it's only a pointer, for something that there are
not a lot of, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 4/7] block: move main block debugfs initialization to its own file
  2020-05-16  3:19 ` [PATCH v5 4/7] block: move main block debugfs initialization to its own file Luis Chamberlain
@ 2020-05-19 15:33   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-05-19 15:33 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 7/7] loop: be paranoid on exit and prevent new additions / removals
  2020-05-16  3:19 ` [PATCH v5 7/7] loop: be paranoid on exit and prevent new additions / removals Luis Chamberlain
@ 2020-05-19 15:36   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-05-19 15:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Sat, May 16, 2020 at 03:19:56AM +0000, Luis Chamberlain wrote:
> Be pedantic on removal as well and hold the mutex.
> This should prevent uses of addition while we exit.
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 6/7] blktrace: break out of blktrace setup on concurrent calls
  2020-05-16  3:19 ` [PATCH v5 6/7] blktrace: break out of blktrace setup on concurrent calls Luis Chamberlain
@ 2020-05-19 15:37   ` Christoph Hellwig
  2020-05-19 16:10   ` Bart Van Assche
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-05-19 15:37 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Sat, May 16, 2020 at 03:19:55AM +0000, Luis Chamberlain wrote:
> We use one blktrace per request_queue, that means one per the entire
> disk.  So we cannot run one blktrace on say /dev/vda and then /dev/vda1,
> or just two calls on /dev/vda.
> 
> We check for concurrent setup only at the very end of the blktrace setup though.

Too long line in the changelog.

Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-19 14:44   ` Greg KH
@ 2020-05-19 15:52     ` Luis Chamberlain
  2020-05-19 17:03       ` Greg KH
  0 siblings, 1 reply; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-19 15:52 UTC (permalink / raw)
  To: Greg KH
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Tue, May 19, 2020 at 04:44:08PM +0200, Greg KH wrote:
> On Sat, May 16, 2020 at 03:19:54AM +0000, Luis Chamberlain wrote:
> >  struct dentry *blk_debugfs_root;
> > +struct dentry *blk_debugfs_bsg = NULL;
> 
> checkpatch didn't complain about "= NULL;"?

Will remove.

> > +static void queue_debugfs_register_type(struct request_queue *q,
> > +					const char *name,
> > +					enum blk_debugfs_dir_type type)
> > +{
> > +	struct dentry *base_dir = queue_get_base_dir(type);
> 
> And it could be a simple if statement instead.
> 
> Oh well, I don't have to maintain this :)

I'll just use that, but yeah I think its a matter of preference.

> > +/**
> > + * blk_queue_debugfs_register - register the debugfs_dir for the block device
> > + * @q: the associated request_queue of the block device
> > + * @name: the name of the block device exposed
> > + *
> > + * This is used to create the debugfs_dir used by the block layer and blktrace.
> > + * Drivers which use any of the *add_disk*() calls or variants have this called
> > + * automatically for them. This directory is removed automatically on
> > + * blk_release_queue() once the request_queue reference count reaches 0.
> > + */
> > +void blk_queue_debugfs_register(struct request_queue *q, const char *name)
> > +{
> > +	queue_debugfs_register_type(q, name, BLK_DBG_DIR_BASE);
> > +}
> > +EXPORT_SYMBOL_GPL(blk_queue_debugfs_register);
> > +
> > +/**
> > + * blk_queue_debugfs_unregister - remove the debugfs_dir for the block device
> > + * @q: the associated request_queue of the block device
> > + *
> > + * Removes the debugfs_dir for the request_queue on the associated block device.
> > + * This is handled for you on blk_release_queue(), and that should only be
> > + * called once.
> > + *
> > + * Since we don't care where the debugfs_dir was created this is used for all
> > + * types of of enum blk_debugfs_dir_type.
> > + */
> > +void blk_queue_debugfs_unregister(struct request_queue *q)
> > +{
> > +	debugfs_remove_recursive(q->debugfs_dir);
> > +}
> 
> Why is register needed to be exported, but unregister does not?  Does
> some driver not properly clean things up?

Is the comment on blk_queue_debugfs_register() not sufficient?
I thought I was going overboard with how clear this was.  Should I also
add a note here on unregister?

> > +
> > +static struct dentry *queue_debugfs_symlink_type(struct request_queue *q,
> > +						 const char *src,
> > +						 const char *dst,
> > +						 enum blk_debugfs_dir_type type)
> > +{
> > +	struct dentry *dentry = ERR_PTR(-EINVAL);
> > +	char *dir_dst = NULL;
> > +
> > +	switch (type) {
> > +	case BLK_DBG_DIR_BASE:
> > +		if (dst)
> > +			dir_dst = kasprintf(GFP_KERNEL, "%s", dst);
> > +		else if (!IS_ERR_OR_NULL(q->debugfs_dir))
> > +			dir_dst = kasprintf(GFP_KERNEL, "%s",
> > +					    q->debugfs_dir->d_name.name);
> 
> There really is no other way to get the name of the directory other than
> from the dentry?  It's not in the queue itself somewhere?

Nope, beyond that, the problem is that the caller can be scsi-generic
and the queue name instantiation is opaque to what happens below, and
the name of that target directory is only set when the async probe
completes, much after the class_interface sg_add_device(). That is, the
request_queue is shared between scsi-generic device and another driver
which depends on the scsi driver type: TYPE_DISK, TYPE_TAPE, etc. The
sg_add_device() gets called before the debugfs_dir name is even determined
and set. This is why I punted setting the symlink to the ioctl on
scsi-generic.

If we add a post-probe class_interface callback, and make scsi-generic
use it, it would only allow us to set the symlink at a better time
during initialization after the async probe instead of the ioctl, then
if we give the class_interface the now probed device we *could* instead
use device_name().

I thought this would be a welcomed change, but I see this as an
evolution.  In particular older kernels will have to use this format,
unless they want to carry extensions to the class_interface as well.

> Anyway, not a big deal, just trying to not expose debugfs internals
> here.

I'm with you on this, I'd personally prefer to see an extension to the
class_interface as an evolution, that way these fixes can be backported
without much hassle, and the *need* for the new class_interface call is
clearer.

But I'll yield to what folks prefer here.

> > diff --git a/block/partitions/core.c b/block/partitions/core.c
> > index 297004fd2264..4d2a130e6055 100644
> > --- a/block/partitions/core.c
> > +++ b/block/partitions/core.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/vmalloc.h>
> >  #include <linux/blktrace_api.h>
> >  #include <linux/raid/detect.h>
> > +#include <linux/debugfs.h>
> >  #include "check.h"
> >  
> >  static int (*check_part[])(struct parsed_partitions *) = {
> > @@ -320,6 +321,9 @@ void delete_partition(struct gendisk *disk, struct hd_struct *part)
> >  	 *  we have to hold the disk device
> >  	 */
> >  	get_device(disk_to_dev(part_to_disk(part)));
> > +#ifdef CONFIG_DEBUG_FS
> > +	debugfs_remove(part->debugfs_sym);
> > +#endif
> 
> Why is the #ifdef needed?  It shouldn't be.

Because debugfs_sym is a member which is only extended if
CONFIG_DEBUG_FS is defined.

> And why not recursive?

recursive seems odd for a symlink.

> >  	rcu_assign_pointer(ptbl->part[part->partno], NULL);
> >  	kobject_put(part->holder_dir);
> >  	device_del(part_to_dev(part));
> > @@ -460,6 +464,11 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
> >  	/* everything is up and running, commence */
> >  	rcu_assign_pointer(ptbl->part[partno], p);
> >  
> > +#ifdef CONFIG_DEBUG_FS
> > +	p->debugfs_sym = blk_queue_debugfs_symlink(disk->queue, dev_name(pdev),
> > +						   disk->disk_name);
> > +#endif
> 
> Again, no #ifdef should be needed here, just provide the "empty"
> function in the .h file.
> 
> You know this stuff :)

Well it was only *one* function, if we want the boiler plate stuff to
not deal with it, fine, I'll wrap it around and provide a helper for
these. It just seemed overkill.

> > @@ -1644,6 +1716,9 @@ sg_remove_device(struct device *cl_dev, struct class_interface *cl_intf)
> >  
> >  	sysfs_remove_link(&scsidp->sdev_gendev.kobj, "generic");
> >  	device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index));
> > +#ifdef CONFIG_DEBUG_FS
> > +	debugfs_remove(sdp->debugfs_sym);
> > +#endif
> 
> Again, no need for the #ifdef.
> 
> If you are worried about the variable not being there, just always put
> it in the structure, it's only a pointer, for something that there are
> not a lot of, right?

Alright, will use wrappers.

  Luis

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 6/7] blktrace: break out of blktrace setup on concurrent calls
  2020-05-16  3:19 ` [PATCH v5 6/7] blktrace: break out of blktrace setup on concurrent calls Luis Chamberlain
  2020-05-19 15:37   ` Christoph Hellwig
@ 2020-05-19 16:10   ` Bart Van Assche
  1 sibling, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2020-05-19 16:10 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm, linux-kernel

On 2020-05-15 20:19, Luis Chamberlain wrote:
> [ ... ]

Once Christoph's comments are addressed, feel free to add:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-16  3:19 ` [PATCH v5 5/7] blktrace: fix debugfs use after free Luis Chamberlain
  2020-05-19 14:44   ` Greg KH
@ 2020-05-19 16:37   ` Christoph Hellwig
  2020-05-19 16:54     ` Greg KH
  2020-05-27  3:12     ` Luis Chamberlain
  1 sibling, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-05-19 16:37 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

I don't think we need any of that symlink stuff.  Even if we want it
(which I don't), it should not be in a bug fix patch.

In fact to fix the blktrace race I think we only need something like
this fairly trivial patch (completely untested so far) below.

(and with that we can also drop the previous patch, as blk-debugfs.c
becomes rather pointless)


diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 15df3a36e9fa4..a2800bc56fb4d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -824,9 +824,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
-					    blk_debugfs_root);
-
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
 	/*
@@ -857,9 +854,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
 
 void blk_mq_debugfs_unregister(struct request_queue *q)
 {
-	debugfs_remove_recursive(q->debugfs_dir);
 	q->sched_debugfs_dir = NULL;
-	q->debugfs_dir = NULL;
 }
 
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 561624d4cc4e7..8e6ea4a13f550 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/blk-mq.h>
 #include <linux/blk-cgroup.h>
+#include <linux/debugfs.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -918,6 +919,7 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+	debugfs_remove_recursive(q->debugfs_dir);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
 
@@ -989,6 +991,27 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	/*
+	 * Blktrace needs a debugsfs name even for queues that don't register
+	 * a gendisk, so it lazily registers the debugfs directory.  But that
+	 * can get us into a situation where a SCSI device is found, with no
+	 * driver for it (yet).  Then blktrace is used on the device, creating
+	 * the debugfs directory, and only after that a drivers is loaded. In
+	 * that case we might already have a debugfs directory registered here.
+	 * Even worse we could be racing with blktrace to register it.
+	 */
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+	mutex_lock(&q->blk_trace_mutex);
+	if (!q->debugfs_dir) {
+		q->debugfs_dir =
+			debugfs_create_dir(kobject_name(q->kobj.parent),
+				blk_debugfs_root);
+	}
+	mutex_unlock(&q->blk_trace_mutex);
+#else
+	blk_queue_debugfs_register(q);
+#endif
+
 	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8801f3d7cf4a3..7a4de524f408f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -574,8 +574,8 @@ struct request_queue {
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
 
-#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*debugfs_dir;
+#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*sched_debugfs_dir;
 	struct dentry		*rqos_debugfs_dir;
 #endif
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 3b6ff5902edce..eb6db276e2931 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -22,7 +22,6 @@ struct blk_trace {
 	u64 end_lba;
 	u32 pid;
 	u32 dev;
-	struct dentry *dir;
 	struct dentry *dropped_file;
 	struct dentry *msg_file;
 	struct list_head running_list;
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index ca39dc3230cb3..1b622e970cede 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
 	debugfs_remove(bt->msg_file);
 	debugfs_remove(bt->dropped_file);
 	relay_close(bt->rchan);
-	debugfs_remove(bt->dir);
 	free_percpu(bt->sequence);
 	free_percpu(bt->msg_data);
 	kfree(bt);
@@ -476,15 +475,11 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 			      struct blk_user_trace_setup *buts)
 {
 	struct blk_trace *bt = NULL;
-	struct dentry *dir = NULL;
 	int ret;
 
 	if (!buts->buf_size || !buts->buf_nr)
 		return -EINVAL;
 
-	if (!blk_debugfs_root)
-		return -ENOENT;
-
 	strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
 	buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0';
 
@@ -494,6 +489,25 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	 */
 	strreplace(buts->name, '/', '_');
 
+	/*
+	 * For queues that do not have a gendisk attached to them, the debugfs
+	 * directory will not have been created at setup time.  Create it here
+	 * lazily, it will only be removed when the queue is torn down.
+	 *
+	 * As blktrace relies on debugfs for its interface the debugfs directory
+	 * is required, contrary to the usual mantra of not checking for debugfs
+	 * files or directories.
+	 */
+	if (!q->debugfs_dir) {
+		q->debugfs_dir =
+			debugfs_create_dir(buts->name, blk_debugfs_root);
+	}
+	if (IS_ERR_OR_NULL(q->debugfs_dir)) {
+		pr_warn("debugfs_dir not present for %s so skipping\n",
+			buts->name);
+		return -ENOENT;
+	}
+
 	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
 	if (!bt)
 		return -ENOMEM;
@@ -507,23 +521,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (!bt->msg_data)
 		goto err;
 
-	ret = -ENOENT;
-
-	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
-		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
-
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
 	INIT_LIST_HEAD(&bt->running_list);
 
 	ret = -EIO;
-	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
-					       &blk_dropped_fops);
+	bt->dropped_file = debugfs_create_file("dropped", 0444, q->debugfs_dir,
+					       bt, &blk_dropped_fops);
 
-	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
+	bt->msg_file = debugfs_create_file("msg", 0222, q->debugfs_dir, bt,
+					   &blk_msg_fops);
 
-	bt->rchan = relay_open("trace", dir, buts->buf_size,
+	bt->rchan = relay_open("trace", q->debugfs_dir, buts->buf_size,
 				buts->buf_nr, &blk_relay_callbacks, bt);
 	if (!bt->rchan)
 		goto err;
@@ -551,8 +560,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = 0;
 err:
-	if (dir && !bt->dir)
-		dput(dir);
 	if (ret)
 		blk_trace_free(bt);
 	return ret;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-19 16:37   ` Christoph Hellwig
@ 2020-05-19 16:54     ` Greg KH
  2020-05-27  3:12     ` Luis Chamberlain
  1 sibling, 0 replies; 24+ messages in thread
From: Greg KH @ 2020-05-19 16:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Luis Chamberlain, axboe, viro, bvanassche, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko, syzbot+603294af2d01acfdd6da

On Tue, May 19, 2020 at 09:37:13AM -0700, Christoph Hellwig wrote:
> I don't think we need any of that symlink stuff.  Even if we want it
> (which I don't), it should not be in a bug fix patch.

I agree, why are the symlinks even needed?  This is debugfs, the
files/contents there can change whenever they want to, no userspace code
should depend on this stuff...

> In fact to fix the blktrace race I think we only need something like
> this fairly trivial patch (completely untested so far) below.
> 
> (and with that we can also drop the previous patch, as blk-debugfs.c
> becomes rather pointless)

Patch looks much more sane than Luis's one.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-19 15:52     ` Luis Chamberlain
@ 2020-05-19 17:03       ` Greg KH
  0 siblings, 0 replies; 24+ messages in thread
From: Greg KH @ 2020-05-19 17:03 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Tue, May 19, 2020 at 03:52:10PM +0000, Luis Chamberlain wrote:
> On Tue, May 19, 2020 at 04:44:08PM +0200, Greg KH wrote:
> > On Sat, May 16, 2020 at 03:19:54AM +0000, Luis Chamberlain wrote:
> > >  struct dentry *blk_debugfs_root;
> > > +struct dentry *blk_debugfs_bsg = NULL;
> > 
> > checkpatch didn't complain about "= NULL;"?
> 
> Will remove.
> 
> > > +static void queue_debugfs_register_type(struct request_queue *q,
> > > +					const char *name,
> > > +					enum blk_debugfs_dir_type type)
> > > +{
> > > +	struct dentry *base_dir = queue_get_base_dir(type);
> > 
> > And it could be a simple if statement instead.
> > 
> > Oh well, I don't have to maintain this :)
> 
> I'll just use that, but yeah I think its a matter of preference.
> 
> > > +/**
> > > + * blk_queue_debugfs_register - register the debugfs_dir for the block device
> > > + * @q: the associated request_queue of the block device
> > > + * @name: the name of the block device exposed
> > > + *
> > > + * This is used to create the debugfs_dir used by the block layer and blktrace.
> > > + * Drivers which use any of the *add_disk*() calls or variants have this called
> > > + * automatically for them. This directory is removed automatically on
> > > + * blk_release_queue() once the request_queue reference count reaches 0.
> > > + */
> > > +void blk_queue_debugfs_register(struct request_queue *q, const char *name)
> > > +{
> > > +	queue_debugfs_register_type(q, name, BLK_DBG_DIR_BASE);
> > > +}
> > > +EXPORT_SYMBOL_GPL(blk_queue_debugfs_register);
> > > +
> > > +/**
> > > + * blk_queue_debugfs_unregister - remove the debugfs_dir for the block device
> > > + * @q: the associated request_queue of the block device
> > > + *
> > > + * Removes the debugfs_dir for the request_queue on the associated block device.
> > > + * This is handled for you on blk_release_queue(), and that should only be
> > > + * called once.
> > > + *
> > > + * Since we don't care where the debugfs_dir was created this is used for all
> > > + * types of of enum blk_debugfs_dir_type.
> > > + */
> > > +void blk_queue_debugfs_unregister(struct request_queue *q)
> > > +{
> > > +	debugfs_remove_recursive(q->debugfs_dir);
> > > +}
> > 
> > Why is register needed to be exported, but unregister does not?  Does
> > some driver not properly clean things up?
> 
> Is the comment on blk_queue_debugfs_register() not sufficient?

Ah, hm, ok, I guess so.

> I thought I was going overboard with how clear this was.  Should I also
> add a note here on unregister?

Not really, it's fine, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-19 16:37   ` Christoph Hellwig
  2020-05-19 16:54     ` Greg KH
@ 2020-05-27  3:12     ` Luis Chamberlain
  2020-05-28  1:15       ` Bart Van Assche
  2020-06-01 17:05       ` Luis Chamberlain
  1 sibling, 2 replies; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-27  3:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Tue, May 19, 2020 at 09:37:13AM -0700, Christoph Hellwig wrote:
> I don't think we need any of that symlink stuff.  Even if we want it
> (which I don't), it should not be in a bug fix patch.
> 
> In fact to fix the blktrace race I think we only need something like
> this fairly trivial patch (completely untested so far) below.
> 
> (and with that we can also drop the previous patch, as blk-debugfs.c
> becomes rather pointless)

You forgot to deal with partitions. Putting similar lipstick on the pig,
this is what I end up with, let me know if this seems agreeable:

diff --git a/block/blk-core.c b/block/blk-core.c
index 2096373bd16d..67cd9ddac822 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -51,9 +51,7 @@
 #include "blk-pm.h"
 #include "blk-rq-qos.h"
 
-#ifdef CONFIG_DEBUG_FS
 struct dentry *blk_debugfs_root;
-#endif
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -1877,9 +1875,7 @@ int __init blk_dev_init(void)
 	blk_requestq_cachep = kmem_cache_create("request_queue",
 			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
 
-#ifdef CONFIG_DEBUG_FS
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
-#endif
 
 	return 0;
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 96b7a35c898a..08edc3a54114 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -822,9 +822,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
-					    blk_debugfs_root);
-
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
 	/*
@@ -855,9 +852,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
 
 void blk_mq_debugfs_unregister(struct request_queue *q)
 {
-	debugfs_remove_recursive(q->debugfs_dir);
 	q->sched_debugfs_dir = NULL;
-	q->debugfs_dir = NULL;
 }
 
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 561624d4cc4e..5babb6547f48 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/blk-mq.h>
 #include <linux/blk-cgroup.h>
+#include <linux/debugfs.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -918,6 +919,7 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+	debugfs_remove_recursive(q->debugfs_dir);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
 
@@ -989,6 +991,28 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	/*
+	 * Blktrace needs a debugsfs name even for queues that don't register
+	 * a gendisk, so it lazily registers the debugfs directory.  But that
+	 * can get us into a situation where a SCSI device is found, with no
+	 * driver for it (yet).  Then blktrace is used on the device, creating
+	 * the debugfs directory, and only after that a drivers is loaded. In
+	 * that case we might already have a debugfs directory registered here.
+	 * Even worse we could be racing with blktrace to register it.
+	 */
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+	mutex_lock(&q->blk_trace_mutex);
+	if (!q->debugfs_dir) {
+		q->debugfs_dir =
+			debugfs_create_dir(kobject_name(q->kobj.parent),
+				blk_debugfs_root);
+	}
+	mutex_unlock(&q->blk_trace_mutex);
+#else
+	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
+					    blk_debugfs_root);
+#endif
+
 	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
diff --git a/block/blk.h b/block/blk.h
index 5db4ec1e85f7..f11f79295419 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -14,9 +14,7 @@
 /* Max future timer expiry for timeouts */
 #define BLK_MAX_TIMEOUT		(5 * HZ)
 
-#ifdef CONFIG_DEBUG_FS
 extern struct dentry *blk_debugfs_root;
-#endif
 
 struct blk_flush_queue {
 	unsigned int		flush_pending_idx:1;
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 297004fd2264..95f9019aac83 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -10,6 +10,7 @@
 #include <linux/vmalloc.h>
 #include <linux/blktrace_api.h>
 #include <linux/raid/detect.h>
+#include <linux/debugfs.h>
 #include "check.h"
 
 static int (*check_part[])(struct parsed_partitions *) = {
@@ -322,6 +323,7 @@ void delete_partition(struct gendisk *disk, struct hd_struct *part)
 	get_device(disk_to_dev(part_to_disk(part)));
 	rcu_assign_pointer(ptbl->part[part->partno], NULL);
 	kobject_put(part->holder_dir);
+	debugfs_remove_recursive(part->debugfs_dir);
 	device_del(part_to_dev(part));
 
 	/*
@@ -443,6 +445,7 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	if (!p->holder_dir)
 		goto out_del;
 
+	p->debugfs_dir = debugfs_create_dir(dev_name(pdev), blk_debugfs_root);
 	dev_set_uevent_suppress(pdev, 0);
 	if (flags & ADDPART_FLAG_WHOLEDISK) {
 		err = device_create_file(pdev, &dev_attr_whole_disk);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 20e378b428b8..737467c29a31 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -574,8 +574,8 @@ struct request_queue {
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
 
-#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*debugfs_dir;
+#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*sched_debugfs_dir;
 	struct dentry		*rqos_debugfs_dir;
 #endif
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 3b6ff5902edc..eb6db276e293 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -22,7 +22,6 @@ struct blk_trace {
 	u64 end_lba;
 	u32 pid;
 	u32 dev;
-	struct dentry *dir;
 	struct dentry *dropped_file;
 	struct dentry *msg_file;
 	struct list_head running_list;
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index a9384449465a..7ff4c4c06140 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -89,6 +89,7 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	struct rcu_work rcu_work;
+	struct dentry *debugfs_dir;
 };
 
 /**
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index ea47f2084087..8209d41dec18 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
 	debugfs_remove(bt->msg_file);
 	debugfs_remove(bt->dropped_file);
 	relay_close(bt->rchan);
-	debugfs_remove(bt->dir);
 	free_percpu(bt->sequence);
 	free_percpu(bt->msg_data);
 	kfree(bt);
@@ -482,9 +481,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (!buts->buf_size || !buts->buf_nr)
 		return -EINVAL;
 
-	if (!blk_debugfs_root)
-		return -ENOENT;
-
 	strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
 	buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0';
 
@@ -494,6 +490,38 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	 */
 	strreplace(buts->name, '/', '_');
 
+	/*
+	 * We also have to use a partition directory if a partition is
+	 * being worked on, even though the same request_queue is shared.
+	 */
+	if (bdev && bdev != bdev->bd_contains)
+		dir = bdev->bd_part->debugfs_dir;
+	else {
+		/*
+		 * For queues that do not have a gendisk attached to them, the
+		 * debugfs directory will not have been created at setup time.
+		 * Create it here lazily, it will only be removed when the
+		 * queue is torn down.
+		 */
+		if (!q->debugfs_dir) {
+			q->debugfs_dir =
+				debugfs_create_dir(buts->name,
+						   blk_debugfs_root);
+		}
+		dir = q->debugfs_dir;
+	}
+
+	/*
+	 * As blktrace relies on debugfs for its interface the debugfs directory
+	 * is required, contrary to the usual mantra of not checking for debugfs
+	 * files or directories.
+	 */
+	if (IS_ERR_OR_NULL(q->debugfs_dir)) {
+		pr_warn("debugfs_dir not present for %s so skipping\n",
+			buts->name);
+		return -ENOENT;
+	}
+
 	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
 	if (!bt)
 		return -ENOMEM;
@@ -507,12 +535,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (!bt->msg_data)
 		goto err;
 
-	ret = -ENOENT;
-
-	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
-		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
-
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
 	INIT_LIST_HEAD(&bt->running_list);
@@ -551,8 +573,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = 0;
 err:
-	if (dir && !bt->dir)
-		dput(dir);
 	if (ret)
 		blk_trace_free(bt);
 	return ret;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-27  3:12     ` Luis Chamberlain
@ 2020-05-28  1:15       ` Bart Van Assche
  2020-05-29  7:56         ` Luis Chamberlain
  2020-06-01 17:05       ` Luis Chamberlain
  1 sibling, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2020-05-28  1:15 UTC (permalink / raw)
  To: Luis Chamberlain, Christoph Hellwig
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On 2020-05-26 20:12, Luis Chamberlain wrote:
> +	/*
> +	 * Blktrace needs a debugsfs name even for queues that don't register
> +	 * a gendisk, so it lazily registers the debugfs directory.  But that
> +	 * can get us into a situation where a SCSI device is found, with no
> +	 * driver for it (yet).  Then blktrace is used on the device, creating
> +	 * the debugfs directory, and only after that a drivers is loaded. In
                                                        ^^^^^^^
                                                        driver?

> @@ -494,6 +490,38 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  	 */
>  	strreplace(buts->name, '/', '_');
>  
> +	/*
> +	 * We also have to use a partition directory if a partition is
> +	 * being worked on, even though the same request_queue is shared.
> +	 */
> +	if (bdev && bdev != bdev->bd_contains)
> +		dir = bdev->bd_part->debugfs_dir;

Please balance braces in if-statements as required by the kernel coding style.

> +	else {
> +		/*
> +		 * For queues that do not have a gendisk attached to them, the
> +		 * debugfs directory will not have been created at setup time.
> +		 * Create it here lazily, it will only be removed when the
> +		 * queue is torn down.
> +		 */

Is the above comment perhaps a reference to blk_register_queue()? If so, please
mention the name of that function explicitly.

> +		if (!q->debugfs_dir) {
> +			q->debugfs_dir =
> +				debugfs_create_dir(buts->name,
> +						   blk_debugfs_root);
> +		}
> +		dir = q->debugfs_dir;
> +	}
> +
> +	/*
> +	 * As blktrace relies on debugfs for its interface the debugfs directory
> +	 * is required, contrary to the usual mantra of not checking for debugfs
> +	 * files or directories.
> +	 */
> +	if (IS_ERR_OR_NULL(q->debugfs_dir)) {
> +		pr_warn("debugfs_dir not present for %s so skipping\n",
> +			buts->name);
> +		return -ENOENT;
> +	}

How are do_blk_trace_setup() calls serialized against the debugfs directory
creation code in blk_register_queue()? Perhaps via q->blk_trace_mutex? Are
mutex lock and unlock calls for that mutex perhaps missing from
compat_blk_trace_setup()?

How about adding a lockdep_assert_held(&q->blk_trace_mutex) statement in
do_blk_trace_setup()?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-28  1:15       ` Bart Van Assche
@ 2020-05-29  7:56         ` Luis Chamberlain
  2020-05-29 14:09           ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Luis Chamberlain @ 2020-05-29  7:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko, syzbot+603294af2d01acfdd6da

On Wed, May 27, 2020 at 06:15:10PM -0700, Bart Van Assche wrote:
> On 2020-05-26 20:12, Luis Chamberlain wrote:
> > +	/*
> > +	 * Blktrace needs a debugsfs name even for queues that don't register
> > +	 * a gendisk, so it lazily registers the debugfs directory.  But that
> > +	 * can get us into a situation where a SCSI device is found, with no
> > +	 * driver for it (yet).  Then blktrace is used on the device, creating
> > +	 * the debugfs directory, and only after that a drivers is loaded. In
>                                                         ^^^^^^^
>                                                         driver?

Fixed.

> > @@ -494,6 +490,38 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> >  	 */
> >  	strreplace(buts->name, '/', '_');
> >  
> > +	/*
> > +	 * We also have to use a partition directory if a partition is
> > +	 * being worked on, even though the same request_queue is shared.
> > +	 */
> > +	if (bdev && bdev != bdev->bd_contains)
> > +		dir = bdev->bd_part->debugfs_dir;
> 
> Please balance braces in if-statements as required by the kernel coding style.

Sure thing.

> > +	else {
> > +		/*
> > +		 * For queues that do not have a gendisk attached to them, the
> > +		 * debugfs directory will not have been created at setup time.
> > +		 * Create it here lazily, it will only be removed when the
> > +		 * queue is torn down.
> > +		 */
> 
> Is the above comment perhaps a reference to blk_register_queue()? If so, please
> mention the name of that function explicitly.

No, it actually is in reference to *add_disk()* helpers, so I'll add
that there. scsi-generic is the ugly child we have which we don't talk
too much about, not sure if we have a proper name for *non* add_disk()
related use of the request_queue... oh and mmc I think?

I've changed this to (ignore spaces, I'll adjust):

* For queues that do not have a gendisk attached to them, that is those
* which do not use *add_disk*() or similar, the debugfs directory will
* not have been created at setup time.  This is the case for
* scsi-generic drivers.  Create it here lazily, it will only be removed
* when the queue is torn down.

> > +		if (!q->debugfs_dir) {
> > +			q->debugfs_dir =
> > +				debugfs_create_dir(buts->name,
> > +						   blk_debugfs_root);
> > +		}
> > +		dir = q->debugfs_dir;
> > +	}
> > +
> > +	/*
> > +	 * As blktrace relies on debugfs for its interface the debugfs directory
> > +	 * is required, contrary to the usual mantra of not checking for debugfs
> > +	 * files or directories.
> > +	 */
> > +	if (IS_ERR_OR_NULL(q->debugfs_dir)) {
> > +		pr_warn("debugfs_dir not present for %s so skipping\n",
> > +			buts->name);
> > +		return -ENOENT;
> > +	}
> 
> How are do_blk_trace_setup() calls serialized against the debugfs directory
> creation code in blk_register_queue()? Perhaps via q->blk_trace_mutex?

Yes, hence the mutex lock that Christoph added as an alternative to
the whole symlink stuff for scsi-generic and addressing this on the
class interface driver.

> Are
> mutex lock and unlock calls for that mutex perhaps missing from
> compat_blk_trace_setup()?

No, because that is called from blk_trace_ioctl(), and that holds the
mutex.

> How about adding a lockdep_assert_held(&q->blk_trace_mutex) statement in
> do_blk_trace_setup()?

Sure, however that doesn't seem part of the fix. How about adding that
as a separat patch?

  Luis

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-29  7:56         ` Luis Chamberlain
@ 2020-05-29 14:09           ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2020-05-29 14:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko, syzbot+603294af2d01acfdd6da

On 2020-05-29 00:56, Luis Chamberlain wrote:
> On Wed, May 27, 2020 at 06:15:10PM -0700, Bart Van Assche wrote:
>> How about adding a lockdep_assert_held(&q->blk_trace_mutex) statement in
>> do_blk_trace_setup()?
> 
> Sure, however that doesn't seem part of the fix. How about adding that
> as a separat patch?

That sounds good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-05-27  3:12     ` Luis Chamberlain
  2020-05-28  1:15       ` Bart Van Assche
@ 2020-06-01 17:05       ` Luis Chamberlain
  2020-06-05  4:48         ` Bart Van Assche
  1 sibling, 1 reply; 24+ messages in thread
From: Luis Chamberlain @ 2020-06-01 17:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Wed, May 27, 2020 at 03:12:02AM +0000, Luis Chamberlain wrote:
> You forgot to deal with partitions. Putting similar lipstick on the pig,
> this is what I end up with, let me know if this seems agreeable:

So even with the partition stuff in place, this approach still don't
allow multiple uses of blktrace against a scsi-generic device and its
backend real block device, say TYPE_DISK. A simple example is a scsi
drive hooked up used to allow users to do blktrace /dev/sda *and*
blktrace /dev/sg0, but with the proposed change /dev/sg0 no longer
works beacuse the dentry pertains to the '/dev/sda' name, not
'/dev/sg0'.

We can shoehorn in a solution following the style proposed as follows.
We can keep this only slightly cleaner if we don't care about the
extra dentry even if a user disables CONFIG_CHR_DEV_SG. The cost
would just be an extra dentry on the request_queue.

I'll run this through 0-day and then post a new hopefully final series,
but if you don't think this or would prefer something lease please let
me know.

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 86c107de2836..f46bdc7f6509 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -920,6 +920,9 @@ static void blk_release_queue(struct kobject *kobj)
 	blk_trace_shutdown(q);
 
 	debugfs_remove_recursive(q->debugfs_dir);
+#if defined(CONFIG_CHR_DEV_SG) || defined(CONFIG_CHR_DEV_SG_MODULE)
+	debugfs_remove_recursive(q->sg_debugfs_dir);
+#endif
 	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
 
@@ -939,6 +942,21 @@ struct kobj_type blk_queue_ktype = {
 	.release	= blk_release_queue,
 };
 
+#if defined(CONFIG_CHR_DEV_SG) || defined(CONFIG_CHR_DEV_SG_MODULE)
+/**
+ * blk_sg_debugfs_init - initialize debugs for scsi-generic
+ * @q: the associated queue
+ * @name: name of the scsi-generic device
+ *
+ * To be used by scsi-generic for allowing it to use blktrace.
+ */
+void blk_sg_debugfs_init(struct request_queue *q, const char *name)
+{
+	q->sg_debugfs_dir = debugfs_create_dir(name, blk_debugfs_root);
+}
+EXPORT_SYMBOL_GPL(blk_sg_debugfs_init);
+#endif
+
 /**
  * blk_register_queue - register a block layer queue with sysfs
  * @disk: Disk of which the request queue should be registered with sysfs.
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 20472aaaf630..c87fe1923f3d 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -1519,6 +1519,7 @@ static int
 sg_add_device(struct device *cl_dev, struct class_interface *cl_intf)
 {
 	struct scsi_device *scsidp = to_scsi_device(cl_dev->parent);
+	struct request_queue *q = scsidp->request_queue;
 	struct gendisk *disk;
 	Sg_device *sdp = NULL;
 	struct cdev * cdev = NULL;
@@ -1573,6 +1574,7 @@ sg_add_device(struct device *cl_dev, struct class_interface *cl_intf)
 	} else
 		pr_warn("%s: sg_sys Invalid\n", __func__);
 
+	blk_sg_debugfs_init(q, disk->disk_name);
 	sdev_printk(KERN_NOTICE, scsidp, "Attached scsi generic sg%d "
 		    "type %d\n", sdp->index, scsidp->type);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5877b03b8117..be5a40d59f60 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -575,6 +575,9 @@ struct request_queue {
 	struct bio_set		bio_split;
 
 	struct dentry		*debugfs_dir;
+#if defined(CONFIG_CHR_DEV_SG) || defined(CONFIG_CHR_DEV_SG_MODULE)
+	struct dentry		*sg_debugfs_dir;
+#endif
 #ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*sched_debugfs_dir;
 	struct dentry		*rqos_debugfs_dir;
@@ -858,6 +861,14 @@ static inline void rq_flush_dcache_pages(struct request *rq)
 
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
+#if defined(CONFIG_CHR_DEV_SG) || defined(CONFIG_CHR_DEV_SG_MODULE)
+extern void blk_sg_debugfs_init(struct request_queue *q, const char *name);
+#else
+static inline void blk_sg_debugfs_init(struct request_queue *q,
+				       const char *name)
+{
+}
+#endif
 extern blk_qc_t generic_make_request(struct bio *bio);
 extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index a55cbfd060f5..5b0310f38e11 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -511,6 +511,11 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	 */
 	if (bdev && bdev != bdev->bd_contains) {
 		dir = bdev->bd_part->debugfs_dir;
+	} else if (q->sg_debugfs_dir &&
+		   strlen(buts->name) == strlen(q->sg_debugfs_dir->d_name.name)
+		   && strcmp(buts->name, q->sg_debugfs_dir->d_name.name) == 0) {
+		/* scsi-generic requires use of its own directory */
+		dir = q->sg_debugfs_dir;
 	} else {
 		/*
 		 * For queues that do not have a gendisk attached to them, that

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-06-01 17:05       ` Luis Chamberlain
@ 2020-06-05  4:48         ` Bart Van Assche
  2020-06-05 22:33           ` Luis Chamberlain
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2020-06-05  4:48 UTC (permalink / raw)
  To: Luis Chamberlain, Christoph Hellwig
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On 2020-06-01 10:05, Luis Chamberlain wrote:
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index a55cbfd060f5..5b0310f38e11 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -511,6 +511,11 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  	 */
>  	if (bdev && bdev != bdev->bd_contains) {
>  		dir = bdev->bd_part->debugfs_dir;
> +	} else if (q->sg_debugfs_dir &&
> +		   strlen(buts->name) == strlen(q->sg_debugfs_dir->d_name.name)
> +		   && strcmp(buts->name, q->sg_debugfs_dir->d_name.name) == 0) {
> +		/* scsi-generic requires use of its own directory */
> +		dir = q->sg_debugfs_dir;
>  	} else {
>  		/*
>  		 * For queues that do not have a gendisk attached to them, that
> 

Please Cc Martin Petersen for patches that modify SCSI code.

The string comparison check looks fragile to me. Is the purpose of that
check perhaps to verify whether tracing is being activated through the
SCSI generic interface? If so, how about changing that test into
something like the following?

	MAJOR(dev) == SCSI_GENERIC_MAJOR

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] blktrace: fix debugfs use after free
  2020-06-05  4:48         ` Bart Van Assche
@ 2020-06-05 22:33           ` Luis Chamberlain
  0 siblings, 0 replies; 24+ messages in thread
From: Luis Chamberlain @ 2020-06-05 22:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko, syzbot+603294af2d01acfdd6da

On Thu, Jun 04, 2020 at 09:48:43PM -0700, Bart Van Assche wrote:
> On 2020-06-01 10:05, Luis Chamberlain wrote:
> > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > index a55cbfd060f5..5b0310f38e11 100644
> > --- a/kernel/trace/blktrace.c
> > +++ b/kernel/trace/blktrace.c
> > @@ -511,6 +511,11 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> >  	 */
> >  	if (bdev && bdev != bdev->bd_contains) {
> >  		dir = bdev->bd_part->debugfs_dir;
> > +	} else if (q->sg_debugfs_dir &&
> > +		   strlen(buts->name) == strlen(q->sg_debugfs_dir->d_name.name)
> > +		   && strcmp(buts->name, q->sg_debugfs_dir->d_name.name) == 0) {
> > +		/* scsi-generic requires use of its own directory */
> > +		dir = q->sg_debugfs_dir;
> >  	} else {
> >  		/*
> >  		 * For queues that do not have a gendisk attached to them, that
> > 
> 
> Please Cc Martin Petersen for patches that modify SCSI code.

Sure thing.
> The string comparison check looks fragile to me. Is the purpose of that

> check perhaps to verify whether tracing is being activated through the
> SCSI generic interface?

Yes.

> If so, how about changing that test into
> something like the following?
> 
> 	MAJOR(dev) == SCSI_GENERIC_MAJOR

Sure.

  Luis

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-16  3:19 [PATCH v5 0/7] block: fix blktrace debugfs use after free Luis Chamberlain
2020-05-16  3:19 ` [PATCH v5 1/7] block: add docs for gendisk / request_queue refcount helpers Luis Chamberlain
2020-05-16  3:19 ` [PATCH v5 2/7] block: clarify context for gendisk / request_queue refcount increment helpers Luis Chamberlain
2020-05-16  3:19 ` [PATCH v5 3/7] block: revert back to synchronous request_queue removal Luis Chamberlain
2020-05-16  3:19 ` [PATCH v5 4/7] block: move main block debugfs initialization to its own file Luis Chamberlain
2020-05-19 15:33   ` Christoph Hellwig
2020-05-16  3:19 ` [PATCH v5 5/7] blktrace: fix debugfs use after free Luis Chamberlain
2020-05-19 14:44   ` Greg KH
2020-05-19 15:52     ` Luis Chamberlain
2020-05-19 17:03       ` Greg KH
2020-05-19 16:37   ` Christoph Hellwig
2020-05-19 16:54     ` Greg KH
2020-05-27  3:12     ` Luis Chamberlain
2020-05-28  1:15       ` Bart Van Assche
2020-05-29  7:56         ` Luis Chamberlain
2020-05-29 14:09           ` Bart Van Assche
2020-06-01 17:05       ` Luis Chamberlain
2020-06-05  4:48         ` Bart Van Assche
2020-06-05 22:33           ` Luis Chamberlain
2020-05-16  3:19 ` [PATCH v5 6/7] blktrace: break out of blktrace setup on concurrent calls Luis Chamberlain
2020-05-19 15:37   ` Christoph Hellwig
2020-05-19 16:10   ` Bart Van Assche
2020-05-16  3:19 ` [PATCH v5 7/7] loop: be paranoid on exit and prevent new additions / removals Luis Chamberlain
2020-05-19 15:36   ` Christoph Hellwig

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git