linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/10] block: fix blktrace debugfs use after free
@ 2020-04-19 19:45 Luis Chamberlain
  2020-04-19 19:45 ` [PATCH v2 01/10] block: move main block debugfs initialization to its own file Luis Chamberlain
                   ` (10 more replies)
  0 siblings, 11 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

Upstream kernel.org korg#205713 [0] states that there is a UAF in the
core debugfs debugfs_remove() function, and has gone through pushing for
a CVE for this, CVE-2019-19770 [1].

This patch series disputes that CVE, and shows how the issue was just
a complex misuse of debugfs within blktrace and fixes it.

On this v2 I've dropped two patches from my last series which are not
needed to ensure we can move back to a synchronous request_queue
removal. I've also addressed Ming's feedback on ensuring we keep
functionality working for when paritions are used for a blktrace.  That
effort lead me to ensuring we don't try to overwrite the request_queue
debugfs_dir, and add sanity checks in place so that what we give back
only what is expected.

Although my v1 patches also had fixed the kernel splat we get when we
try to reproduce the issue:

debugfs: Directory 'loop0' with parent 'block' already present!

This v2 series now provides a clear explanation for *why* this was
ultimately one of the reasons why we ended up with a crash.

The commit log for the actual fix, patch 3/10, "blktrace: fix debugfs
use after free" has also been extended to provide a better explanation
as to *how* overwriting the debugfs_dir leads to an eventual panic with
blktrace. I hope that helps, as it seems the root cause was still not
well explained in the commit log.

To make review easier, I've also added some helper functions with no
functional changes at first, and only extended them later.

Also changed is blk_queue_debugfs_register() to return an int, we do
this to not make the fact that we don't check for errors on
register_disk() or add_disk() any worse.

After the patch 3/10 "blktrace: fix debugfs use after free", is applied,
with the pr_warns(), and prior to reverting back to synchronous request_queue
removal, if we try to reproduce the issue with break-blktrace [2]'s
./run_0001.sh script, we'd see::

blk_debugfs: loop0 : registering request_queue debugfs directory twice is not allowed
blktrace: loop0: request_queue parent is gone

And sometimes only:

blk_debugfs: loop0 : registering request_queue debugfs directory twice is not allowed

After we revert back to synchronous request_queue removal this should no
longer be possible, and if it is, we want to hear about it. To help with
this two patches are added which change pr_warn() to BUG_ON()s after we flip
back to synchronous request_queue removal.

Note that on patch 6/10 "blk-debugfs: upgrade warns to BUG_ON() if
directory" I explain the syfs layout between a gendisk and the
request_queue. By reverting back to synchronous request_queue removal,
if someone manages to figure out a way to create a clash with
registering block devices, we expect to see a sysfs clash now instead of
a clash with debugfs, as the debugfs directory is removed now always
first, prior to clearing out the sysfs dir. *If* there are races
possible in these areas, we want to hear about them, and the BUG_ON()s
should make it clearer *where* the real issue is coming from.

Having an asynchronous request_queue removal has exposed other bugs
lingering around, however most importantly I think its revealing more
the value of adding error handling for __device_add_disk() and friends.
If its encouraged I could take a stab at finally addressing that for
good.

You can find this code on my git tree, on the 20200417-blktrace-fixes
branch, which is based on linux-next tag next-20200417 [3].

[0] https://bugzilla.kernel.org/show_bug.cgi?id=205713                          
[1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770 
[2] https://github.com/mcgrof/break-blktrace
[3] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200417-blktrace-fixes

Luis Chamberlain (10):
  block: move main block debugfs initialization to its own file
  blktrace: move blktrace debugfs creation to helper function
  blktrace: fix debugfs use after free
  block: revert back to synchronous request_queue removal
  blktrace: upgrade warns to BUG_ON() on unexpected circmunstances
  blk-debugfs: upgrade warns to BUG_ON() if directory is already found
  blktrace: move debugfs file creation to its own function
  blktrace: add checks for created debugfs files on setup
  block: panic if block debugfs dir is not created
  block: put_device() if device_add() fails

 block/Makefile               |   1 +
 block/blk-core.c             |  28 +++++---
 block/blk-debugfs.c          |  39 ++++++++++++
 block/blk-mq-debugfs.c       |   5 --
 block/blk-sysfs.c            |  47 ++++++++------
 block/blk.h                  |  18 ++++++
 block/genhd.c                |   4 +-
 include/linux/blkdev.h       |   7 +-
 include/linux/blktrace_api.h |   1 +
 kernel/trace/blktrace.c      | 120 +++++++++++++++++++++++++++++++----
 10 files changed, 218 insertions(+), 52 deletions(-)
 create mode 100644 block/blk-debugfs.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 01/10] block: move main block debugfs initialization to its own file
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 21:06   ` Bart Van Assche
  2020-04-19 19:45 ` [PATCH v2 02/10] blktrace: move blktrace debugfs creation to helper function Luis Chamberlain
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

make_request-based drivers and and request-based drivers share some
some debugfs code. By moving this into its own file it makes it easier
to expand and audit this shared code.

This patch contains no functional changes.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/Makefile      |  1 +
 block/blk-core.c    |  9 +--------
 block/blk-debugfs.c | 15 +++++++++++++++
 block/blk.h         |  7 +++++++
 4 files changed, 24 insertions(+), 8 deletions(-)
 create mode 100644 block/blk-debugfs.c

diff --git a/block/Makefile b/block/Makefile
index 206b96e9387f..1d3ab20505d8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \
 			blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
 			genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o
 
+obj-$(CONFIG_DEBUG_FS)		+= blk-debugfs.o
 obj-$(CONFIG_BOUNCE)		+= bounce.o
 obj-$(CONFIG_BLK_SCSI_REQUEST)	+= scsi_ioctl.o
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 7e4a1da0715e..5aaae7a1b338 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -48,10 +48,6 @@
 #include "blk-pm.h"
 #include "blk-rq-qos.h"
 
-#ifdef CONFIG_DEBUG_FS
-struct dentry *blk_debugfs_root;
-#endif
-
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
@@ -1796,10 +1792,7 @@ int __init blk_dev_init(void)
 
 	blk_requestq_cachep = kmem_cache_create("request_queue",
 			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
-
-#ifdef CONFIG_DEBUG_FS
-	blk_debugfs_root = debugfs_create_dir("block", NULL);
-#endif
+	blk_debugfs_register();
 
 	return 0;
 }
diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
new file mode 100644
index 000000000000..19091e1effc0
--- /dev/null
+++ b/block/blk-debugfs.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Shared request-based / make_request-based functionality
+ */
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/debugfs.h>
+
+struct dentry *blk_debugfs_root;
+
+void blk_debugfs_register(void)
+{
+	blk_debugfs_root = debugfs_create_dir("block", NULL);
+}
diff --git a/block/blk.h b/block/blk.h
index 0a94ec68af32..86a66b614f08 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -487,5 +487,12 @@ struct request_queue *__blk_alloc_queue(int node_id);
 int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		bool *same_page);
+#ifdef CONFIG_DEBUG_FS
+void blk_debugfs_register(void);
+#else
+static inline void blk_debugfs_register(void)
+{
+}
+#endif /* CONFIG_DEBUG_FS */
 
 #endif /* BLK_INTERNAL_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 02/10] blktrace: move blktrace debugfs creation to helper function
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
  2020-04-19 19:45 ` [PATCH v2 01/10] block: move main block debugfs initialization to its own file Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 21:11   ` Bart Van Assche
  2020-04-22  7:12   ` Christoph Hellwig
  2020-04-19 19:45 ` [PATCH v2 03/10] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

Move the work to create the debugfs directory used into a helper.
It will make further checks easier to read. This commit introduces
no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index ca39dc3230cb..2c6e6c386ace 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -468,6 +468,18 @@ static void blk_trace_setup_lba(struct blk_trace *bt,
 	}
 }
 
+static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
+					    struct blk_trace *bt)
+{
+	struct dentry *dir = NULL;
+
+	dir = debugfs_lookup(buts->name, blk_debugfs_root);
+	if (!dir)
+		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
+
+	return dir;
+}
+
 /*
  * Setup everything required to start tracing
  */
@@ -509,9 +521,7 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = -ENOENT;
 
-	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
-		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
+	dir = blk_trace_debugfs_dir(buts, bt);
 
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
  2020-04-19 19:45 ` [PATCH v2 01/10] block: move main block debugfs initialization to its own file Luis Chamberlain
  2020-04-19 19:45 ` [PATCH v2 02/10] blktrace: move blktrace debugfs creation to helper function Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 21:55   ` Bart Van Assche
                     ` (2 more replies)
  2020-04-19 19:45 ` [PATCH v2 04/10] block: revert back to synchronous request_queue removal Luis Chamberlain
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace in a way it was not intended, and even more so
with our current asynchronous request_queue removal.

Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
forget to BLKTRACETEARDOWN, and then just remove the device you end up
with a panic:

[  107.193134] debugfs: Directory 'loop0' with parent 'block' already present!
[  107.254615] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  107.258785] #PF: supervisor write access in kernel mode
[  107.262035] #PF: error_code(0x0002) - not-present page
[  107.264106] PGD 0 P4D 0
[  107.264404] Oops: 0002 [#1] SMP NOPTI
[  107.264803] CPU: 8 PID: 674 Comm: kworker/8:2 Tainted: G            E 5.6.0-rc7-next-20200327 #1
[  107.265712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  107.266553] Workqueue: events __blk_release_queue
[  107.267051] RIP: 0010:down_write+0x15/0x40
[  107.267488] Code: eb ca e8 ee a5 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f  65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
[  107.269300] RSP: 0018:ffff9927c06efda8 EFLAGS: 00010246
[  107.269841] RAX: 0000000000000000 RBX: ffff8be7e73b0600 RCX: ffffff8100000000
[  107.270559] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  107.271281] RBP: 00000000000000a0 R08: ffff8be7ebc80fa8 R09: ffff8be7ebc80fa8
[  107.272001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  107.272722] R13: ffff8be7efc30400 R14: ffff8be7e0571200 R15: 00000000000000a0
[  107.273475] FS:  0000000000000000(0000) GS:ffff8be7efc00000(0000) knlGS:0000000000000000
[  107.274346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.274968] CR2: 00000000000000a0 CR3: 000000042abee003 CR4: 0000000000360ee0
[  107.275710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  107.276465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  107.277214] Call Trace:
[  107.277532]  simple_recursive_removal+0x4e/0x2e0
[  107.278049]  ? debugfs_remove+0x60/0x60
[  107.278493]  debugfs_remove+0x40/0x60
[  107.278922]  blk_trace_free+0xd/0x50
[  107.279339]  __blk_trace_remove+0x27/0x40
[  107.279797]  blk_trace_shutdown+0x30/0x40
[  107.280256]  __blk_release_queue+0xab/0x110
[  107.280734]  process_one_work+0x1b4/0x380
[  107.281194]  worker_thread+0x50/0x3c0
[  107.281622]  kthread+0xf9/0x130
[  107.281994]  ? process_one_work+0x380/0x380
[  107.282467]  ? kthread_park+0x90/0x90
[  107.282895]  ret_from_fork+0x1f/0x40
[  107.283316] Modules linked in: loop(E) <etc>
[  107.288562] CR2: 00000000000000a0
[  107.288957] ---[ end trace b885d243d441bbce ]---

This splat happens to be very similar to the one reported via
kernel.org korg#205713, only that korg#205713 was for v4.19.83
and the above now includes the simple_recursive_removal() introduced
via commit a3d1e7eb5abe ("simple_recursive_removal(): kernel-side rm
-rf for ramfs-style filesystems") merged on v5.6.

korg#205713 then was used to create CVE-2019-19770 and claims that
the bug is in a use-after-free in the debugfs core code. The
implications of this being a generic UAF on debugfs would be
much more severe, as it would imply parent dentries can sometimes
not be positive, which we hold by design is just not possible.

Below is the splat explained with a bit more details, explaining
what is happening in userspace, kernel, and a print of the CPU on,
which the code runs on:

load loopback module
[   13.603371] == blk_mq_debugfs_register(12) start
[   13.604040] == blk_mq_debugfs_register(12) q->debugfs_dir created
[   13.604934] == blk_mq_debugfs_register(12) end
[   13.627382] == blk_mq_debugfs_register(12) start
[   13.628041] == blk_mq_debugfs_register(12) q->debugfs_dir created
[   13.629240] == blk_mq_debugfs_register(12) end
[   13.651667] == blk_mq_debugfs_register(12) start
[   13.652836] == blk_mq_debugfs_register(12) q->debugfs_dir created
[   13.655107] == blk_mq_debugfs_register(12) end
[   13.684917] == blk_mq_debugfs_register(12) start
[   13.687876] == blk_mq_debugfs_register(12) q->debugfs_dir created
[   13.691588] == blk_mq_debugfs_register(13) end
[   13.707320] == blk_mq_debugfs_register(13) start
[   13.707863] == blk_mq_debugfs_register(13) q->debugfs_dir created
[   13.708856] == blk_mq_debugfs_register(13) end
[   13.735623] == blk_mq_debugfs_register(13) start
[   13.736656] == blk_mq_debugfs_register(13) q->debugfs_dir created
[   13.738411] == blk_mq_debugfs_register(13) end
[   13.763326] == blk_mq_debugfs_register(13) start
[   13.763972] == blk_mq_debugfs_register(13) q->debugfs_dir created
[   13.765167] == blk_mq_debugfs_register(13) end
[   13.779510] == blk_mq_debugfs_register(13) start
[   13.780522] == blk_mq_debugfs_register(13) q->debugfs_dir created
[   13.782338] == blk_mq_debugfs_register(13) end
[   13.783521] loop: module loaded

LOOP_CTL_DEL(loop0) #1
[   13.803550] = __blk_release_queue(4) start
[   13.807772] == blk_trace_shutdown(4) start
[   13.810749] == blk_trace_shutdown(4) end
[   13.813437] = __blk_release_queue(4) calling blk_mq_debugfs_unregister()
[   13.817593] ==== blk_mq_debugfs_unregister(4) begin
[   13.817621] ==== blk_mq_debugfs_unregister(4) debugfs_remove_recursive(q->debugfs_dir)
[   13.821203] ==== blk_mq_debugfs_unregister(4) end q->debugfs_dir is NULL
[   13.826166] = __blk_release_queue(4) blk_mq_debugfs_unregister() end
[   13.832992] = __blk_release_queue(4) end

LOOP_CTL_ADD(loop0) #1
[   13.843742] == blk_mq_debugfs_register(7) start
[   13.845569] == blk_mq_debugfs_register(7) q->debugfs_dir created
[   13.848628] == blk_mq_debugfs_register(7) end

BLKTRACE_SETUP(loop0) #1
[   13.850924] == blk_trace_ioctl(7, BLKTRACESETUP) start
[   13.852852] === do_blk_trace_setup(7) start
[   13.854580] === do_blk_trace_setup(7) creating directory
[   13.856620] === do_blk_trace_setup(7) using what debugfs_lookup() gave
[   13.860635] === do_blk_trace_setup(7) end with ret: 0
[   13.862615] == blk_trace_ioctl(7, BLKTRACESETUP) end

LOOP_CTL_DEL(loop0) #2
[   13.883304] = __blk_release_queue(7) start
[   13.885324] == blk_trace_shutdown(7) start
[   13.887197] == blk_trace_shutdown(7) calling __blk_trace_remove()
[   13.889807] == __blk_trace_remove(7) start
[   13.891669] === blk_trace_cleanup(7) start
[   13.911656] ====== blk_trace_free(7) start

This is a scheduled __blk_release_queue(). Note that at this point
we know that the old device is gone as the gendisk removal is
synchronous.

LOOP_CTL_ADD(loop0) #2
[   13.912709] == blk_mq_debugfs_register(2) start

This is a *new* device, as we know the last one was removed synchronously.

---> From LOOP_CTL_DEL(loop0) #2
[   13.915887] ====== blk_trace_free(7) end

---> From LOOP_CTL_ADD(loop0) #2
[   13.918359] debugfs: Directory 'loop0' with parent 'block' already present!

Above we have a case where the upon LOOP_CTL_ADD(loop) #2 the
debugfs_create_dir() start_creating() call will have found an existing dentry,
and return ERR_PTR(-EEXIST), and so the q->debugfs_dir will be initialized to
NULL. This clash happens because an asynchronous __blk_release_queue() has
been scheduled from the above LOOP_CTL_DEL(loop0) #2 call, and it has not yet
removed the old loop0 debugfs directory.

[   13.926433] == blk_mq_debugfs_register(2) q->debugfs_dir created

This is a huge assumption, since we never check for errors on
blk_mq_debugfs_register(), and so the child files we need to
create *can* simply be created using NULL as as the parent, in
which case debugfs would use debugfs_mount->mnt_root as the parent.

[   13.930373] == blk_mq_debugfs_register(2) end

BLKTRACE_SETUP(loop0) #2
[   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start
[   13.936758] === do_blk_trace_setup(2) start
[   13.938944] === do_blk_trace_setup(2) creating directory
[   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave

Here debugfs_lookup() will have found the old debugfs directory, even
though q->debugfs_dir was empty when initializing this device. So the
files created can actually be created where they should have.

---> From LOOP_CTL_DEL(loop0) #2
[   13.971046] === blk_trace_cleanup(7) end
[   13.973175] == __blk_trace_remove(7) end
[   13.975352] == blk_trace_shutdown(7) end
[   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()
[   13.980645] ==== blk_mq_debugfs_unregister(7) begin
[   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)

This is a continuation from the old scheduled  __blk_release_queue(). It
is using another request_queue structure than what the currently activated
blktrace setup is using.

Here now the dentry where the blktrace setup files were created on top
of is gone.  In fact, since debugfs_remove_recursive() is used, *all* of
those blktrace debugfs files should have been removed as well, including
the relay. This is despite the fact that the removal and blktrace setup
are working on two different request_queue structures.

[   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL
[   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end
[   13.993155] = __blk_release_queue(7) end

The scheduled __blk_release_queue completes finally.

---> From BLKTRACE_SETUP(loop0) #2
[   13.995928] === do_blk_trace_setup(2) end with ret: 0
[   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end

At this point the old blktrace *setup* is complete though, and let us
recall that the old dentry is still used. But recall that we just had
our old device removed, and since debugfs_remove_recursive() was used,
so are all of its files. This request_queue however is still active.

LOOP_CTL_DEL(loop0) #3
[   14.035119] = __blk_release_queue(2) start
[   14.036925] == blk_trace_shutdown(2) start
[   14.038518] == blk_trace_shutdown(2) calling __blk_trace_remove()
[   14.040829] == __blk_trace_remove(2) start
[   14.042413] === blk_trace_cleanup(2) start

This removal will cleanup the blktrace setup. Note the trap though,
the blktrace setup will already have been cleaned up since the
scheduled removal from instance #2 removed the files on the same
debugfs path.

The __blk_release_queue() is scheduled and is asynchronous so can
continue later, and that is where the crash happens, on CPU 2.

LOOP_CTL_ADD(loop0) #3
[   14.072522] == blk_mq_debugfs_register(6) start

---> From LOOP_CTL_DEL(loop0) #3
[   14.075151] ====== blk_trace_free(2) start

---> From LOOP_CTL_ADD(loop0) #3
[   14.075882] == blk_mq_debugfs_register(6) q->debugfs_dir created

---> From LOOP_CTL_DEL(loop0) #3
The panic happens on CPU 2 when we continue with the old scheduled
removal of the request_queue, which will run into a blktrace setup
with NULL pointers.

[   14.078624] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[   14.084332] == blk_mq_debugfs_register(6) end
[   14.086971] #PF: supervisor write access in kernel mode
[   14.086974] #PF: error_code(0x0002) - not-present page
[   14.086977] PGD 0 P4D 0
[   14.086984] Oops: 0002 [#1] SMP NOPTI
[   14.086990] CPU: 2 PID: 287 Comm: kworker/2:2 Tainted: G            E 5.6.0-next-20200403+ #54
[   14.086991] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[   14.087002] Workqueue: events __blk_release_queue
[   14.087011] RIP: 0010:down_write+0x15/0x40
[   14.090300] == blk_trace_ioctl(6, BLKTRACESETUP) start
[   14.093277] Code: eb ca e8 3e 34 8d ff cc cc cc cc cc cc cc cc cc cc
cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00
00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
[   14.093280] RSP: 0018:ffffc28a00533da8 EFLAGS: 00010246
[   14.093284] RAX: 0000000000000000 RBX: ffff9f7a24d07980 RCX: ffffff8100000000
[   14.093286] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[   14.093287] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000019
[   14.093289] R10: 0000000000000774 R11: 0000000000000000 R12: 0000000000000000
[   14.093291] R13: ffff9f7a2fab0400 R14: ffff9f7a21dd1140 R15: 00000000000000a0
[   14.093294] FS:  0000000000000000(0000) GS:ffff9f7a2fa80000(0000) knlGS:0000000000000000
[   14.093296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.093298] CR2: 00000000000000a0 CR3: 00000004293d2003 CR4: 0000000000360ee0
[   14.093307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   14.093308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   14.093310] Call Trace:
[   14.093324]  simple_recursive_removal+0x4e/0x2e0
[   14.093330]  ? debugfs_remove+0x60/0x60
[   14.093334]  debugfs_remove+0x40/0x60
[   14.093339]  blk_trace_free+0x20/0x70
[   14.093346]  __blk_trace_remove+0x54/0x90
[   14.096704] === do_blk_trace_setup(6) start
[   14.098534]  blk_trace_shutdown+0x74/0x80
[   14.100958] === do_blk_trace_setup(6) creating directory
[   14.104575]  __blk_release_queue+0xbe/0x160
[   14.104580]  process_one_work+0x1b4/0x380
[   14.104585]  worker_thread+0x50/0x3c0
[   14.104589]  kthread+0xf9/0x130
[   14.104593]  ? process_one_work+0x380/0x380
[   14.104596]  ? kthread_park+0x90/0x90
[   14.104599]  ret_from_fork+0x1f/0x40
[   14.104603] Modules linked in: loop(E) xfs(E) libcrc32c(E)
crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E)
serio_raw(E) aesni_intel(E) glue_helper(E) virtio_balloon(E) evdev(E)
crypto_simd(E) pcspkr(E) cryptd(E) i6300esb(E) button(E) ip_tables(E)
x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E)
jbd2(E) virtio_net(E) net_failover(E) failover(E) virtio_blk(E)
ata_generic(E) uhci_hcd(E) ata_piix(E) ehci_hcd(E) nvme(E) libata(E)
crc32c_intel(E) usbcore(E) psmouse(E) nvme_core(E) virtio_pci(E)
scsi_mod(E) virtio_ring(E) t10_pi(E) virtio(E) i2c_piix4(E) floppy(E)
[   14.107400] === do_blk_trace_setup(6) using what debugfs_lookup() gave
[   14.108939] CR2: 00000000000000a0
[   14.110589] === do_blk_trace_setup(6) end with ret: 0
[   14.111592] ---[ end trace 7a783b33b9614db9 ]---

The root cause to this issue is that debugfs_lookup() can find a
previous incarnation's dir of the same name which is about to get
removed from a not yet schedule work. When that happens, the the files
are taken underneath the nose of the blktrace, and when it comes time to
cleanup, these dentries are long gone because of a scheduled removal.

This issue is happening because of two reasons:

  1) The request_queue is currently removed asynchronously as of commit
     dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
     v4.12, this allows races with userspace which were not possible
     before unless as removal of a block device used to happen
     synchronously with its request_queue. One could however still
     parallelize blksetup calls while one loops on device addition and
     removal.

  2) There are no errors checks when we create the debugfs directory,
     be it on init or for blktrace. The concept of when the directory
     *should* really exist is further complicated by the fact that
     we have asynchronous request_queue removal. And, we have no
     real sanity checks to ensure we don't re-create the queue debugfs
     directory.

We can fix the UAF by using a debugfs directory which moving forward
will always be accessible if debugfs is enabled, this way, its allocated
and avaialble always for both request-based block drivers or
make_request drivers (multiqueue) block drivers.

We also put sanity checks in place to ensure that if the directory is
found with debugfs_lookup() it is the dentry we expect. When doing a
blktrace against a parition, we will always be creating a temporary
debugfs directory, so ensure that only exists once as well to avoid
issues against concurrent blktrace runs.

Lastly, since we are now always creating the needed request_queue
debugfs directory upon init, we can also take the initiative to
proactively check against errors. We currently do not check for
errors on add_disk() and friends, but we shouldn't make the issue
any worse.

This also simplifies the code considerably, with the only penalty now
being that we're always creating the request queue debugfs directory for
the request-based block device drivers.

The UAF then is not a core debugfs issue, but instead a complex misuse
of debugfs, and this issue can only be triggered if you are root.

This issue can be reproduced with break-blktrace [2] using:

  break-blktrace -c 10 -d -s

This patch fixes this issue. Note that there is also another
respective UAF but from the ioctl path [3], this should also fix
that issue.

This patch then also disputes the severity of CVE-2019-19770 as
this issue is only possible by being root and using blktrace.

It is not a core debugfs issue.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
[1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
[2] https://github.com/mcgrof/break-blktrace
[3] https://lore.kernel.org/lkml/000000000000ec635b059f752700@google.com/

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-debugfs.c          | 30 +++++++++++
 block/blk-mq-debugfs.c       |  5 --
 block/blk-sysfs.c            |  9 ++++
 block/blk.h                  | 11 ++++
 include/linux/blkdev.h       |  5 +-
 include/linux/blktrace_api.h |  1 +
 kernel/trace/blktrace.c      | 98 +++++++++++++++++++++++++++++++++---
 7 files changed, 146 insertions(+), 13 deletions(-)

diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
index 19091e1effc0..d84038bce0a5 100644
--- a/block/blk-debugfs.c
+++ b/block/blk-debugfs.c
@@ -3,6 +3,9 @@
 /*
  * Shared request-based / make_request-based functionality
  */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/kernel.h>
 #include <linux/blkdev.h>
 #include <linux/debugfs.h>
@@ -13,3 +16,30 @@ void blk_debugfs_register(void)
 {
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
 }
+
+int __must_check blk_queue_debugfs_register(struct request_queue *q)
+{
+	struct dentry *dir = NULL;
+
+	/* This can happen if we have a bug in the lower layers */
+	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
+	if (dir) {
+		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
+			kobject_name(q->kobj.parent));
+		dput(dir);
+		return -EALREADY;
+	}
+
+	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
+					    blk_debugfs_root);
+	if (!q->debugfs_dir)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void blk_queue_debugfs_unregister(struct request_queue *q)
+{
+	debugfs_remove_recursive(q->debugfs_dir);
+	q->debugfs_dir = NULL;
+}
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..bda9378eab90 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -823,9 +823,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
-					    blk_debugfs_root);
-
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
 	/*
@@ -856,9 +853,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
 
 void blk_mq_debugfs_unregister(struct request_queue *q)
 {
-	debugfs_remove_recursive(q->debugfs_dir);
 	q->sched_debugfs_dir = NULL;
-	q->debugfs_dir = NULL;
 }
 
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..7072f408e69a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -895,6 +895,7 @@ static void __blk_release_queue(struct work_struct *work)
 
 	blk_trace_shutdown(q);
 
+	blk_queue_debugfs_unregister(q);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
 
@@ -975,6 +976,14 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	ret = blk_queue_debugfs_register(q);
+	if (ret) {
+		blk_trace_remove_sysfs(dev);
+		kobject_del(&q->kobj);
+		kobject_put(&dev->kobj);
+		goto unlock;
+	}
+
 	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
diff --git a/block/blk.h b/block/blk.h
index 86a66b614f08..0cae5a92fe0b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -489,10 +489,21 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		bool *same_page);
 #ifdef CONFIG_DEBUG_FS
 void blk_debugfs_register(void);
+int __must_check blk_queue_debugfs_register(struct request_queue *q);
+void blk_queue_debugfs_unregister(struct request_queue *q);
 #else
 static inline void blk_debugfs_register(void)
 {
 }
+
+static inline int __must_check blk_queue_debugfs_register(struct request_queue *q)
+{
+	return 0;
+}
+
+static inline void blk_queue_debugfs_unregister(struct request_queue *q)
+{
+}
 #endif /* CONFIG_DEBUG_FS */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32868fbedc9e..cc43c8e6516c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -569,8 +569,11 @@ struct request_queue {
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
 
-#ifdef CONFIG_BLK_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
+	/* Used by block/blk-*debugfs.c and kernel/trace/blktrace.c */
 	struct dentry		*debugfs_dir;
+#endif
+#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*sched_debugfs_dir;
 	struct dentry		*rqos_debugfs_dir;
 #endif
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 3b6ff5902edc..43a6f98840dc 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -23,6 +23,7 @@ struct blk_trace {
 	u32 pid;
 	u32 dev;
 	struct dentry *dir;
+	struct dentry *backing_dir;
 	struct dentry *dropped_file;
 	struct dentry *msg_file;
 	struct list_head running_list;
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 2c6e6c386ace..8f87979d0971 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -3,6 +3,9 @@
  * Copyright (C) 2006 Jens Axboe <axboe@kernel.dk>
  *
  */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/kernel.h>
 #include <linux/blkdev.h>
 #include <linux/blktrace_api.h>
@@ -311,7 +314,15 @@ static void blk_trace_free(struct blk_trace *bt)
 	debugfs_remove(bt->msg_file);
 	debugfs_remove(bt->dropped_file);
 	relay_close(bt->rchan);
-	debugfs_remove(bt->dir);
+	/*
+	 * backing_dir is set when we use the request_queue debugfs directory.
+	 * Otherwise we are using a temporary directory created only for the
+	 * purpose of blktrace.
+	 */
+	if (bt->backing_dir)
+		dput(bt->backing_dir);
+	else
+		debugfs_remove(bt->dir);
 	free_percpu(bt->sequence);
 	free_percpu(bt->msg_data);
 	kfree(bt);
@@ -468,16 +479,89 @@ static void blk_trace_setup_lba(struct blk_trace *bt,
 	}
 }
 
+static bool blk_trace_target_disk(const char *target, const char *diskname)
+{
+	if (strlen(target) != strlen(diskname))
+		return false;
+
+	if (!strncmp(target, diskname,
+		     min_t(size_t, strlen(target), strlen(diskname))))
+		return true;
+
+	return false;
+}
+
 static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
+					    struct request_queue *q,
 					    struct blk_trace *bt)
 {
 	struct dentry *dir = NULL;
 
+	/* This can only happen if we have a bug on our lower layers */
+	if (!q->kobj.parent) {
+		pr_warn("%s: request_queue parent is gone\n", buts->name);
+		return NULL;
+	}
+
+	/*
+	 * From a sysfs kobject perspective, the request_queue sits on top of
+	 * the gendisk, which has the name of the disk. We always create a
+	 * debugfs directory upon init for this gendisk kobject, so we re-use
+	 * that if blktrace is going to be done for it.
+	 */
+	if (blk_trace_target_disk(buts->name, kobject_name(q->kobj.parent))) {
+		if (!q->debugfs_dir) {
+			pr_warn("%s: expected request_queue debugfs_dir is not set\n",
+				buts->name);
+			return NULL;
+		}
+		/*
+		 * debugfs_lookup() is used to ensure the directory is not
+		 * taken from underneath us. We must dput() it later once
+		 * done with it within blktrace.
+		 */
+		dir = debugfs_lookup(buts->name, blk_debugfs_root);
+		if (!dir) {
+			pr_warn("%s: expected request_queue debugfs_dir dentry is gone\n",
+				buts->name);
+			return NULL;
+		}
+		 /*
+		 * This is a reaffirmation that debugfs_lookup() shall always
+		 * return the same dentry if it was already set.
+		 */
+		if (dir != q->debugfs_dir) {
+			dput(dir);
+			pr_warn("%s: expected dentry dir != q->debugfs_dir\n",
+				buts->name);
+			return NULL;
+		}
+		bt->backing_dir = q->debugfs_dir;
+		return bt->backing_dir;
+	}
+
+	/*
+	 * If not using blktrace on the gendisk, we are going to create a
+	 * temporary debugfs directory for it, however this cannot be shared
+	 * between two concurrent blktraces since the path is not unique, so
+	 * ensure this is only done once.
+	 */
 	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
-		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
+	if (dir) {
+		pr_warn("%s: temporary blktrace debugfs directory already present\n",
+			buts->name);
+		dput(dir);
+		return NULL;
+	}
+
+	bt->dir = debugfs_create_dir(buts->name, blk_debugfs_root);
+	if (!bt->dir) {
+		pr_warn("%s: temporary blktrace debugfs directory could not be created\n",
+			buts->name);
+		return NULL;
+	}
 
-	return dir;
+	return bt->dir;
 }
 
 /*
@@ -521,7 +605,9 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = -ENOENT;
 
-	dir = blk_trace_debugfs_dir(buts, bt);
+	dir = blk_trace_debugfs_dir(buts, q, bt);
+	if (!dir)
+		goto err;
 
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
@@ -561,8 +647,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = 0;
 err:
-	if (dir && !bt->dir)
-		dput(dir);
 	if (ret)
 		blk_trace_free(bt);
 	return ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 04/10] block: revert back to synchronous request_queue removal
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (2 preceding siblings ...)
  2020-04-19 19:45 ` [PATCH v2 03/10] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 22:23   ` Bart Van Assche
  2020-04-19 19:45 ` [PATCH v2 05/10] blktrace: upgrade warns to BUG_ON() on unexpected circmunstances Luis Chamberlain
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue
kobject, and upon reaching 0 blk_release_queue() is called. Although
blk_exit_rl() is now removed through commit db6d9952356 ("block: remove
request_list code"), we reserve the right to be able to sleep within
blk_release_queue() context. If you see no other way and *have* be
in atomic context when you driver calls the last blk_put_queue()
you can always just increase your block device's reference count with
bdgrab() as this can be done in atomic context and the request_queue
removal would be left to upper layers later. We document this bit of
tribal knowledge as well now, and adjust kdoc format a bit.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver and
issues ioctl from userspace to do so, upon return and if successful one
expects the device to be removed. Moving to asynchronous request_queue
removal could have broken many scripts which relied on the removal to
have been completed if there was no error.

Using asynchronous request_queue removal however has helped us find
other bugs, in the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Suggested-by: Nicolai Stange <nstange@suse.de>
Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c       | 19 +++++++++++++++++++
 block/blk-sysfs.c      | 38 +++++++++++++++++---------------------
 include/linux/blkdev.h |  2 --
 3 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5aaae7a1b338..19e24d7f40ef 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -301,6 +301,17 @@ void blk_clear_pm_only(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
+/**
+ * blk_put_queue - decrement the request_queue refcount
+ *
+ * @q: the request_queue structure to decrement the refcount for
+ *
+ * Decrements the refcount to the request_queue kobject, when this reaches
+ * 0 we'll have blk_release_queue() called. You should avoid calling
+ * this function in atomic context but if you really have to ensure you
+ * first refcount the block device with bdgrab() / bdput() so that the
+ * last decrement happens in blk_cleanup_queue().
+ */
 void blk_put_queue(struct request_queue *q)
 {
 	kobject_put(&q->kobj);
@@ -328,13 +339,21 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
 /**
  * blk_cleanup_queue - shutdown a request queue
+ *
  * @q: request queue to shutdown
  *
  * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
  * put it.  All future requests will be failed immediately with -ENODEV.
+ *
+ * You should not call this function in atomic context. If you need to
+ * refcount a request_queue in atomic context, instead refcount the
+ * block device with bdgrab() / bdput().
  */
 void blk_cleanup_queue(struct request_queue *q)
 {
+	/* cannot be called from atomic context */
+	might_sleep();
+
 	WARN_ON_ONCE(blk_queue_registered(q));
 
 	/* mark @q DYING, no new request or merges will be allowed afterwards */
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7072f408e69a..dc7985b7e4c5 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -860,22 +860,27 @@ static void blk_exit_queue(struct request_queue *q)
 	bdi_put(q->backing_dev_info);
 }
 
-
 /**
- * __blk_release_queue - release a request queue
- * @work: pointer to the release_work member of the request queue to be released
+ * blk_release_queue - release a request queue
+ *
+ * This function is called as part of the process when a block device is being
+ * unregistered. Releasing a request queue starts with blk_cleanup_queue(),
+ * which set the appropriate flags and then calls blk_put_queue() as the last
+ * step. blk_put_queue() decrements the reference counter of the request queue
+ * and once the reference counter reaches zero, this function is called to
+ * release all allocated resources of the request queue.
  *
- * Description:
- *     This function is called when a block device is being unregistered. The
- *     process of releasing a request queue starts with blk_cleanup_queue, which
- *     set the appropriate flags and then calls blk_put_queue, that decrements
- *     the reference counter of the request queue. Once the reference counter
- *     of the request queue reaches zero, blk_release_queue is called to release
- *     all allocated resources of the request queue.
+ * This function can sleep, and so we must ensure that the very last
+ * blk_put_queue() is never called from atomic context.
+ *
+ * @kobj: pointer to a kobject, who's container is a request_queue
  */
-static void __blk_release_queue(struct work_struct *work)
+static void blk_release_queue(struct kobject *kobj)
 {
-	struct request_queue *q = container_of(work, typeof(*q), release_work);
+	struct request_queue *q =
+		container_of(kobj, struct request_queue, kobj);
+
+	might_sleep();
 
 	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
 		blk_stat_remove_callback(q, q->poll_cb);
@@ -905,15 +910,6 @@ static void __blk_release_queue(struct work_struct *work)
 	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
 
-static void blk_release_queue(struct kobject *kobj)
-{
-	struct request_queue *q =
-		container_of(kobj, struct request_queue, kobj);
-
-	INIT_WORK(&q->release_work, __blk_release_queue);
-	schedule_work(&q->release_work);
-}
-
 static const struct sysfs_ops queue_sysfs_ops = {
 	.show	= queue_attr_show,
 	.store	= queue_attr_store,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cc43c8e6516c..81f7ddb1587e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -582,8 +582,6 @@ struct request_queue {
 
 	size_t			cmd_size;
 
-	struct work_struct	release_work;
-
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 05/10] blktrace: upgrade warns to BUG_ON() on unexpected circmunstances
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (3 preceding siblings ...)
  2020-04-19 19:45 ` [PATCH v2 04/10] block: revert back to synchronous request_queue removal Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 22:50   ` Bart Van Assche
  2020-04-19 19:45 ` [PATCH v2 06/10] blk-debugfs: upgrade warns to BUG_ON() if directory is already found Luis Chamberlain
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

Now that the request_queue removal is scheduled synchronously again,
we have certain expectations on when debugfs directories used for
blktrace are used. Any violation of these expecations should reflect
core bugs we want to hear about.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 32 ++++++++------------------------
 1 file changed, 8 insertions(+), 24 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 8f87979d0971..909db597b551 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -498,10 +498,7 @@ static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
 	struct dentry *dir = NULL;
 
 	/* This can only happen if we have a bug on our lower layers */
-	if (!q->kobj.parent) {
-		pr_warn("%s: request_queue parent is gone\n", buts->name);
-		return NULL;
-	}
+	BUG_ON(!q->kobj.parent);
 
 	/*
 	 * From a sysfs kobject perspective, the request_queue sits on top of
@@ -510,32 +507,19 @@ static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
 	 * that if blktrace is going to be done for it.
 	 */
 	if (blk_trace_target_disk(buts->name, kobject_name(q->kobj.parent))) {
-		if (!q->debugfs_dir) {
-			pr_warn("%s: expected request_queue debugfs_dir is not set\n",
-				buts->name);
-			return NULL;
-		}
+		BUG_ON(!q->debugfs_dir);
+
 		/*
 		 * debugfs_lookup() is used to ensure the directory is not
 		 * taken from underneath us. We must dput() it later once
 		 * done with it within blktrace.
+		 *
+		 * This is also a reaffirmation that debugfs_lookup() shall
+		 * always return the same dentry if it was already set.
 		 */
 		dir = debugfs_lookup(buts->name, blk_debugfs_root);
-		if (!dir) {
-			pr_warn("%s: expected request_queue debugfs_dir dentry is gone\n",
-				buts->name);
-			return NULL;
-		}
-		 /*
-		 * This is a reaffirmation that debugfs_lookup() shall always
-		 * return the same dentry if it was already set.
-		 */
-		if (dir != q->debugfs_dir) {
-			dput(dir);
-			pr_warn("%s: expected dentry dir != q->debugfs_dir\n",
-				buts->name);
-			return NULL;
-		}
+		BUG_ON(!dir || dir != q->debugfs_dir);
+
 		bt->backing_dir = q->debugfs_dir;
 		return bt->backing_dir;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 06/10] blk-debugfs: upgrade warns to BUG_ON() if directory is already found
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (4 preceding siblings ...)
  2020-04-19 19:45 ` [PATCH v2 05/10] blktrace: upgrade warns to BUG_ON() on unexpected circmunstances Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-20 11:36   ` Greg KH
  2020-04-19 19:45 ` [PATCH v2 07/10] blktrace: move debugfs file creation to its own function Luis Chamberlain
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

Now that we have moved release_queue from being asynchronous to
synchronous, and fixed how we use the debugfs directory with blktrace
we should no longer have expected races with device removal/addition
and other operations with the debugfs directory.

If races do happen however, we want to be informed of *how* this races
happens rather than dealing with a debugfs splat, so upgrading this to a
BUG_ON() should capture better information about how this can happen
in the future.

This is specially true these days with funky reproducers in userspace
for which we have no access to, but only a bug splat.

Note that on addition the gendisk kobject is used as the parent for the
request_queue kobject, and upon removal, now that request_queue removal
is synchronous, blk_unregister_queue() is called prior to the gendisk
device_del(). This means we expect to see a sysfs clash first now prior
to running into a race with the debugfs dentry; so this bug would be
considered highly unlikely.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-debugfs.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
index d84038bce0a5..761318dcbf40 100644
--- a/block/blk-debugfs.c
+++ b/block/blk-debugfs.c
@@ -19,16 +19,8 @@ void blk_debugfs_register(void)
 
 int __must_check blk_queue_debugfs_register(struct request_queue *q)
 {
-	struct dentry *dir = NULL;
-
 	/* This can happen if we have a bug in the lower layers */
-	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
-	if (dir) {
-		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
-			kobject_name(q->kobj.parent));
-		dput(dir);
-		return -EALREADY;
-	}
+	BUG_ON(debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root));
 
 	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
 					    blk_debugfs_root);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 07/10] blktrace: move debugfs file creation to its own function
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (5 preceding siblings ...)
  2020-04-19 19:45 ` [PATCH v2 06/10] blk-debugfs: upgrade warns to BUG_ON() if directory is already found Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 22:55   ` Bart Van Assche
  2020-04-19 19:45 ` [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup Luis Chamberlain
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

Move creating the blktrace debugfs files into its own function.
We'll expand on this later.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 909db597b551..9cc0153849c3 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -548,6 +548,25 @@ static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
 	return bt->dir;
 }
 
+static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
+					  struct dentry *dir,
+					  struct blk_trace *bt)
+{
+	int ret = -EIO;
+
+	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
+					       &blk_dropped_fops);
+
+	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
+
+	bt->rchan = relay_open("trace", dir, buts->buf_size,
+				buts->buf_nr, &blk_relay_callbacks, bt);
+	if (!bt->rchan)
+		return ret;
+
+	return 0;
+}
+
 /*
  * Setup everything required to start tracing
  */
@@ -597,15 +616,8 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	atomic_set(&bt->dropped, 0);
 	INIT_LIST_HEAD(&bt->running_list);
 
-	ret = -EIO;
-	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
-					       &blk_dropped_fops);
-
-	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
-
-	bt->rchan = relay_open("trace", dir, buts->buf_size,
-				buts->buf_nr, &blk_relay_callbacks, bt);
-	if (!bt->rchan)
+	ret = blk_trace_create_debugfs_files(buts, dir, bt);
+	if (ret)
 		goto err;
 
 	bt->act_mask = buts->act_mask;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (6 preceding siblings ...)
  2020-04-19 19:45 ` [PATCH v2 07/10] blktrace: move debugfs file creation to its own function Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 22:57   ` Bart Van Assche
  2020-04-20 11:39   ` Greg KH
  2020-04-19 19:45 ` [PATCH v2 09/10] block: panic if block debugfs dir is not created Luis Chamberlain
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
select DEBUG_FS, and blktrace exposes an API which userspace uses
relying on certain files created in debugfs. If files are not created
blktrace will not work correctly, so we do want to ensure that a
blktrace setup creates these files properly, and otherwise inform
userspace.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 9cc0153849c3..fc32a8665ce8 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
 					  struct dentry *dir,
 					  struct blk_trace *bt)
 {
-	int ret = -EIO;
-
 	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
 					       &blk_dropped_fops);
+	if (!bt->dropped_file)
+		return -ENOMEM;
 
 	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
+	if (!bt->msg_file)
+		return -ENOMEM;
 
 	bt->rchan = relay_open("trace", dir, buts->buf_size,
 				buts->buf_nr, &blk_relay_callbacks, bt);
 	if (!bt->rchan)
-		return ret;
+		return -EIO;
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 09/10] block: panic if block debugfs dir is not created
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (7 preceding siblings ...)
  2020-04-19 19:45 ` [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 23:08   ` Bart Van Assche
  2020-04-20 11:38   ` Greg KH
  2020-04-19 19:45 ` [PATCH v2 10/10] block: put_device() if device_add() fails Luis Chamberlain
  2020-04-19 19:48 ` [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
  10 siblings, 2 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

If DEBUG_FS is disabled we have another inline
blk_debugfs_register() which just returns 0.

If BLK_DEV_IO_TRACE is enabled we rely on the block debugfs
directory to have been created. If BLK_DEV_IO_TRACE is not enabled
though, but if debugfs is still enabled we will always create a
debugfs directory for a request_queue. Instead of special-casing
this just for BLK_DEV_IO_TRACE, ensure this block debugfs dir is
always created at boot if we have enabled debugfs.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-debugfs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
index 761318dcbf40..d6ec980e7531 100644
--- a/block/blk-debugfs.c
+++ b/block/blk-debugfs.c
@@ -15,6 +15,8 @@ struct dentry *blk_debugfs_root;
 void blk_debugfs_register(void)
 {
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
+	if (!blk_debugfs_root)
+		panic("Failed to create block debugfs directory\n");
 }
 
 int __must_check blk_queue_debugfs_register(struct request_queue *q)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 10/10] block: put_device() if device_add() fails
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (8 preceding siblings ...)
  2020-04-19 19:45 ` [PATCH v2 09/10] block: panic if block debugfs dir is not created Luis Chamberlain
@ 2020-04-19 19:45 ` Luis Chamberlain
  2020-04-19 23:40   ` Bart Van Assche
  2020-04-19 19:48 ` [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
  10 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:45 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

Through code inspection I've found that we don't put_device() if
device_add() fails, and this must be done to decrement its refcount.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/genhd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 06b642b23a07..c52095a74792 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -721,8 +721,10 @@ static void register_disk(struct device *parent, struct gendisk *disk,
 		WARN_ON(ddev->groups);
 		ddev->groups = groups;
 	}
-	if (device_add(ddev))
+	if (device_add(ddev)) {
+		put_device(ddev);
 		return;
+	}
 	if (!sysfs_deprecated) {
 		err = sysfs_create_link(block_depr, &ddev->kobj,
 					kobject_name(&ddev->kobj));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 00/10] block: fix blktrace debugfs use after free
  2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
                   ` (9 preceding siblings ...)
  2020-04-19 19:45 ` [PATCH v2 10/10] block: put_device() if device_add() fails Luis Chamberlain
@ 2020-04-19 19:48 ` Luis Chamberlain
  10 siblings, 0 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 19:48 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm, linux-kernel

I forgot to mention that I've dropped Reviewed-by tags to patches
I have changed considerably like patch 3/10 which actually fixes
the race. So if you had provided a review before, a new review
would be appreciated.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 01/10] block: move main block debugfs initialization to its own file
  2020-04-19 19:45 ` [PATCH v2 01/10] block: move main block debugfs initialization to its own file Luis Chamberlain
@ 2020-04-19 21:06   ` Bart Van Assche
  0 siblings, 0 replies; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 21:06 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> make_request-based drivers and and request-based drivers share some
> some debugfs code. By moving this into its own file it makes it easier
> to expand and audit this shared code.

If this patch is reposted, please change "some some" into "some".

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 02/10] blktrace: move blktrace debugfs creation to helper function
  2020-04-19 19:45 ` [PATCH v2 02/10] blktrace: move blktrace debugfs creation to helper function Luis Chamberlain
@ 2020-04-19 21:11   ` Bart Van Assche
  2020-04-22  7:12   ` Christoph Hellwig
  1 sibling, 0 replies; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 21:11 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm, linux-kernel

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> Move the work to create the debugfs directory used into a helper.
> It will make further checks easier to read. This commit introduces
> no functional changes.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-19 19:45 ` [PATCH v2 03/10] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-04-19 21:55   ` Bart Van Assche
  2020-04-20  0:04     ` Luis Chamberlain
  2020-04-20 20:16   ` Greg KH
  2020-04-22  7:27   ` Christoph Hellwig
  2 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 21:55 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> +int __must_check blk_queue_debugfs_register(struct request_queue *q)
> +{
> +	struct dentry *dir = NULL;
> +
> +	/* This can happen if we have a bug in the lower layers */

What does "this" refer to? Which layers does "lower layers" refer to? 
Most software developers consider a module that calls directly into 
another module as a higher layer (callbacks through function pointers do 
not count; see also https://en.wikipedia.org/wiki/Modular_programming). 
According to that definition block drivers are a software layer 
immediately above the block layer core.

How about changing that comment into the following to make it 
unambiguous (if this is what you meant)?

	/*
	 * Check whether the debugfs directory already exists. This can
	 * only happen as the result of a bug in a block driver.
	 */

> +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> +	if (dir) {
> +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> +			kobject_name(q->kobj.parent));
> +		dput(dir);
> +		return -EALREADY;
> +	}
> +
> +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> +					    blk_debugfs_root);
> +	if (!q->debugfs_dir)
> +		return -ENOMEM;
> +
> +	return 0;
> +}

kobject_name(q->kobj.parent) is used three times in the above function. 
How about introducing a local variable that holds the result of that 
expression?

> +static bool blk_trace_target_disk(const char *target, const char *diskname)
> +{
> +	if (strlen(target) != strlen(diskname))
> +		return false;
> +
> +	if (!strncmp(target, diskname,
> +		     min_t(size_t, strlen(target), strlen(diskname))))
> +		return true;
> +
> +	return false;
> +}

The above code looks weird to me. When the second if-statement is 
reached, it is guaranteed that 'target' and 'diskname' have the same 
length. So why to calculate the minimum length in the second 
if-statement of two strings that have the same length?

Independent of what the purpose of the above code is, can that code be 
rewritten such that it does not depend on the details of how names are 
assigned to disks and partitions? Would disk_get_part() be useful here?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 04/10] block: revert back to synchronous request_queue removal
  2020-04-19 19:45 ` [PATCH v2 04/10] block: revert back to synchronous request_queue removal Luis Chamberlain
@ 2020-04-19 22:23   ` Bart Van Assche
  2020-04-20 18:59     ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 22:23 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> +/**
> + * blk_put_queue - decrement the request_queue refcount
> + *
> + * @q: the request_queue structure to decrement the refcount for
> + *

How about following the example from 
Documentation/doc-guide/kernel-doc.rst and not leaving a blank line 
above the function argument documentation?

> + * Decrements the refcount to the request_queue kobject, when this reaches
                               ^^
                               of?
> + * 0 we'll have blk_release_queue() called. You should avoid calling
> + * this function in atomic context but if you really have to ensure you
> + * first refcount the block device with bdgrab() / bdput() so that the
> + * last decrement happens in blk_cleanup_queue().
> + */

Is calling bdgrab() and bdput() an option from a context in which it is 
not guaranteed that the block device is open?

Does every context that calls blk_put_queue() also call blk_cleanup_queue()?

How about avoiding confusion by changing the last sentence of that 
comment into something like the following: "The last reference must not 
be dropped from atomic context. If it is necessary to call 
blk_put_queue() from atomic context, make sure that that call does not 
decrease the request queue refcount to zero."

>   /**
>    * blk_cleanup_queue - shutdown a request queue
> + *
>    * @q: request queue to shutdown
>    *

How about following the example from 
Documentation/doc-guide/kernel-doc.rst and not leaving a blank line 
above the function argument documentation?

>    * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
>    * put it.  All future requests will be failed immediately with -ENODEV.
> + *
> + * You should not call this function in atomic context. If you need to
> + * refcount a request_queue in atomic context, instead refcount the
> + * block device with bdgrab() / bdput().

Surrounding blk_cleanup_queue() with bdgrab() / bdput() does not help. 
This blk_cleanup_queue() must not be called from atomic context.

>   /**
> - * __blk_release_queue - release a request queue
> - * @work: pointer to the release_work member of the request queue to be released
> + * blk_release_queue - release a request queue
> + *
> + * This function is called as part of the process when a block device is being
> + * unregistered. Releasing a request queue starts with blk_cleanup_queue(),
> + * which set the appropriate flags and then calls blk_put_queue() as the last
> + * step. blk_put_queue() decrements the reference counter of the request queue
> + * and once the reference counter reaches zero, this function is called to
> + * release all allocated resources of the request queue.
>    *
> - * Description:
> - *     This function is called when a block device is being unregistered. The
> - *     process of releasing a request queue starts with blk_cleanup_queue, which
> - *     set the appropriate flags and then calls blk_put_queue, that decrements
> - *     the reference counter of the request queue. Once the reference counter
> - *     of the request queue reaches zero, blk_release_queue is called to release
> - *     all allocated resources of the request queue.
> + * This function can sleep, and so we must ensure that the very last
> + * blk_put_queue() is never called from atomic context.
> + *
> + * @kobj: pointer to a kobject, who's container is a request_queue
>    */

Please follow the style used elsewhere in the kernel and move function 
argument documentation just below the line with the function name.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/10] blktrace: upgrade warns to BUG_ON() on unexpected circmunstances
  2020-04-19 19:45 ` [PATCH v2 05/10] blktrace: upgrade warns to BUG_ON() on unexpected circmunstances Luis Chamberlain
@ 2020-04-19 22:50   ` Bart Van Assche
  2020-04-19 23:07     ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 22:50 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm, linux-kernel

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> @@ -498,10 +498,7 @@ static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
>   	struct dentry *dir = NULL;
>   
>   	/* This can only happen if we have a bug on our lower layers */
> -	if (!q->kobj.parent) {
> -		pr_warn("%s: request_queue parent is gone\n", buts->name);
> -		return NULL;
> -	}
> +	BUG_ON(!q->kobj.parent);

Does the following quote from Linus also apply to this patch: "there is 
NO F*CKING EXCUSE to knowingly kill the kernel." See also 
https://lkml.org/lkml/2016/10/4/1.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 07/10] blktrace: move debugfs file creation to its own function
  2020-04-19 19:45 ` [PATCH v2 07/10] blktrace: move debugfs file creation to its own function Luis Chamberlain
@ 2020-04-19 22:55   ` Bart Van Assche
  2020-04-20 11:37     ` Greg KH
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 22:55 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm, linux-kernel

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> +static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
> +					  struct dentry *dir,
> +					  struct blk_trace *bt)
> +{
> +	int ret = -EIO;
> +
> +	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> +					       &blk_dropped_fops);
> +
> +	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> +
> +	bt->rchan = relay_open("trace", dir, buts->buf_size,
> +				buts->buf_nr, &blk_relay_callbacks, bt);
> +	if (!bt->rchan)
> +		return ret;
> +
> +	return 0;
> +}

How about adding IS_ERR() checks for the debugfs_create_file() return 
values?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-19 19:45 ` [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup Luis Chamberlain
@ 2020-04-19 22:57   ` Bart Van Assche
  2020-04-19 23:05     ` Luis Chamberlain
  2020-04-20 11:39   ` Greg KH
  1 sibling, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 22:57 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm, linux-kernel

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
> select DEBUG_FS, and blktrace exposes an API which userspace uses
> relying on certain files created in debugfs. If files are not created
> blktrace will not work correctly, so we do want to ensure that a
> blktrace setup creates these files properly, and otherwise inform
> userspace.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   kernel/trace/blktrace.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 9cc0153849c3..fc32a8665ce8 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
>   					  struct dentry *dir,
>   					  struct blk_trace *bt)
>   {
> -	int ret = -EIO;
> -
>   	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
>   					       &blk_dropped_fops);
> +	if (!bt->dropped_file)
> +		return -ENOMEM;
>   
>   	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> +	if (!bt->msg_file)
> +		return -ENOMEM;
>   
>   	bt->rchan = relay_open("trace", dir, buts->buf_size,
>   				buts->buf_nr, &blk_relay_callbacks, bt);
>   	if (!bt->rchan)
> -		return ret;
> +		return -EIO;
>   
>   	return 0;
>   }

I should have had a look at this patch before I replied to the previous 
patch.

Do you agree that the following code can be triggered by 
debugfs_create_file() and also that debugfs_create_file() never returns 
NULL?

static struct dentry *failed_creating(struct dentry *dentry)
{
	inode_unlock(d_inode(dentry->d_parent));
	dput(dentry);
	simple_release_fs(&debugfs_mount, &debugfs_mount_count);
	return ERR_PTR(-ENOMEM);
}

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-19 22:57   ` Bart Van Assche
@ 2020-04-19 23:05     ` Luis Chamberlain
  2020-04-19 23:17       ` Bart Van Assche
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 23:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Sun, Apr 19, 2020 at 03:57:58PM -0700, Bart Van Assche wrote:
> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
> > select DEBUG_FS, and blktrace exposes an API which userspace uses
> > relying on certain files created in debugfs. If files are not created
> > blktrace will not work correctly, so we do want to ensure that a
> > blktrace setup creates these files properly, and otherwise inform
> > userspace.
> > 
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >   kernel/trace/blktrace.c | 8 +++++---
> >   1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > index 9cc0153849c3..fc32a8665ce8 100644
> > --- a/kernel/trace/blktrace.c
> > +++ b/kernel/trace/blktrace.c
> > @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
> >   					  struct dentry *dir,
> >   					  struct blk_trace *bt)
> >   {
> > -	int ret = -EIO;
> > -
> >   	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> >   					       &blk_dropped_fops);
> > +	if (!bt->dropped_file)
> > +		return -ENOMEM;
> >   	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> > +	if (!bt->msg_file)
> > +		return -ENOMEM;
> >   	bt->rchan = relay_open("trace", dir, buts->buf_size,
> >   				buts->buf_nr, &blk_relay_callbacks, bt);
> >   	if (!bt->rchan)
> > -		return ret;
> > +		return -EIO;
> >   	return 0;
> >   }
> 
> I should have had a look at this patch before I replied to the previous
> patch.
> 
> Do you agree that the following code can be triggered by
> debugfs_create_file() and also that debugfs_create_file() never returns
> NULL?

If debugfs is enabled, and not that we know it is in this blktrace code,
as we select it, it can return ERR_PTR(-ERROR) if an error occurs.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/10] blktrace: upgrade warns to BUG_ON() on unexpected circmunstances
  2020-04-19 22:50   ` Bart Van Assche
@ 2020-04-19 23:07     ` Luis Chamberlain
  2020-04-20 23:20       ` Steven Rostedt
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-19 23:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Sun, Apr 19, 2020 at 03:50:13PM -0700, Bart Van Assche wrote:
> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > @@ -498,10 +498,7 @@ static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> >   	struct dentry *dir = NULL;
> >   	/* This can only happen if we have a bug on our lower layers */
> > -	if (!q->kobj.parent) {
> > -		pr_warn("%s: request_queue parent is gone\n", buts->name);
> > -		return NULL;
> > -	}
> > +	BUG_ON(!q->kobj.parent);
> 
> Does the following quote from Linus also apply to this patch: "there is NO
> F*CKING EXCUSE to knowingly kill the kernel." See also
> https://lkml.org/lkml/2016/10/4/1.

We can use WARN_ON() and keep the return NULL, sure.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 09/10] block: panic if block debugfs dir is not created
  2020-04-19 19:45 ` [PATCH v2 09/10] block: panic if block debugfs dir is not created Luis Chamberlain
@ 2020-04-19 23:08   ` Bart Van Assche
  2020-04-20 11:38   ` Greg KH
  1 sibling, 0 replies; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 23:08 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm, linux-kernel

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> --- a/block/blk-debugfs.c
> +++ b/block/blk-debugfs.c
> @@ -15,6 +15,8 @@ struct dentry *blk_debugfs_root;
>   void blk_debugfs_register(void)
>   {
>   	blk_debugfs_root = debugfs_create_dir("block", NULL);
> +	if (!blk_debugfs_root)
> +		panic("Failed to create block debugfs directory\n");
>   }

Does Linus' "don't kill the kernel" rant apply to this patch?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-19 23:05     ` Luis Chamberlain
@ 2020-04-19 23:17       ` Bart Van Assche
  2020-04-20 11:40         ` Greg KH
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 23:17 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On 4/19/20 4:05 PM, Luis Chamberlain wrote:
> On Sun, Apr 19, 2020 at 03:57:58PM -0700, Bart Van Assche wrote:
>> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
>>> Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
>>> select DEBUG_FS, and blktrace exposes an API which userspace uses
>>> relying on certain files created in debugfs. If files are not created
>>> blktrace will not work correctly, so we do want to ensure that a
>>> blktrace setup creates these files properly, and otherwise inform
>>> userspace.
>>>
>>> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
>>> ---
>>>    kernel/trace/blktrace.c | 8 +++++---
>>>    1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
>>> index 9cc0153849c3..fc32a8665ce8 100644
>>> --- a/kernel/trace/blktrace.c
>>> +++ b/kernel/trace/blktrace.c
>>> @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
>>>    					  struct dentry *dir,
>>>    					  struct blk_trace *bt)
>>>    {
>>> -	int ret = -EIO;
>>> -
>>>    	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
>>>    					       &blk_dropped_fops);
>>> +	if (!bt->dropped_file)
>>> +		return -ENOMEM;
>>>    	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
>>> +	if (!bt->msg_file)
>>> +		return -ENOMEM;
>>>    	bt->rchan = relay_open("trace", dir, buts->buf_size,
>>>    				buts->buf_nr, &blk_relay_callbacks, bt);
>>>    	if (!bt->rchan)
>>> -		return ret;
>>> +		return -EIO;
>>>    	return 0;
>>>    }
>>
>> I should have had a look at this patch before I replied to the previous
>> patch.
>>
>> Do you agree that the following code can be triggered by
>> debugfs_create_file() and also that debugfs_create_file() never returns
>> NULL?
> 
> If debugfs is enabled, and not that we know it is in this blktrace code,
> as we select it, it can return ERR_PTR(-ERROR) if an error occurs.

This is what I found in include/linux/debugfs.h in case debugfs is disabled:

static inline struct dentry *debugfs_create_file(const char *name,
	umode_t mode, struct dentry *parent, void *data,
	const struct file_operations *fops)
{
	return ERR_PTR(-ENODEV);
}

I have not found any code path that can cause debugfs_create_file() to 
return NULL. Did I perhaps overlook something? If not, it's not clear to 
me why the above patch adds checks that check whether 
debugfs_create_file() returns NULL?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 10/10] block: put_device() if device_add() fails
  2020-04-19 19:45 ` [PATCH v2 10/10] block: put_device() if device_add() fails Luis Chamberlain
@ 2020-04-19 23:40   ` Bart Van Assche
  2020-04-24 22:32     ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-19 23:40 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm, linux-kernel

On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> Through code inspection I've found that we don't put_device() if
> device_add() fails, and this must be done to decrement its refcount.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-19 21:55   ` Bart Van Assche
@ 2020-04-20  0:04     ` Luis Chamberlain
  2020-04-20  0:38       ` Bart Van Assche
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-20  0:04 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Sun, Apr 19, 2020 at 02:55:42PM -0700, Bart Van Assche wrote:
> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > +int __must_check blk_queue_debugfs_register(struct request_queue *q)
> > +{
> > +	struct dentry *dir = NULL;
> > +
> > +	/* This can happen if we have a bug in the lower layers */
> 
> What does "this" refer to? Which layers does "lower layers" refer to? Most
> software developers consider a module that calls directly into another
> module as a higher layer (callbacks through function pointers do not count;
> see also https://en.wikipedia.org/wiki/Modular_programming). According to
> that definition block drivers are a software layer immediately above the
> block layer core.
> 
> How about changing that comment into the following to make it unambiguous
> (if this is what you meant)?
> 
> 	/*
> 	 * Check whether the debugfs directory already exists. This can
> 	 * only happen as the result of a bug in a block driver.
> 	 */

But I didn't mean on a block driver. I meant the block core. In our
case, the async request_queue removal is an example. There is nothing
that block drivers could have done to help much with that.

I could just change "lower layers" to "block layer" ?

> > +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> > +	if (dir) {
> > +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> > +			kobject_name(q->kobj.parent));
> > +		dput(dir);
> > +		return -EALREADY;
> > +	}
> > +
> > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > +					    blk_debugfs_root);
> > +	if (!q->debugfs_dir)
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> 
> kobject_name(q->kobj.parent) is used three times in the above function. How
> about introducing a local variable that holds the result of that expression?

Sure.

> > +static bool blk_trace_target_disk(const char *target, const char *diskname)
> > +{
> > +	if (strlen(target) != strlen(diskname))
> > +		return false;
> > +
> > +	if (!strncmp(target, diskname,
> > +		     min_t(size_t, strlen(target), strlen(diskname))))
> > +		return true;
> > +
> > +	return false;
> > +}
> 
> The above code looks weird to me. When the second if-statement is reached,
> it is guaranteed that 'target' and 'diskname' have the same length. So why
> to calculate the minimum length in the second if-statement of two strings
> that have the same length?

True, no need that that point. Will fix.

> Independent of what the purpose of the above code is, can that code be
> rewritten such that it does not depend on the details of how names are
> assigned to disks and partitions? Would disk_get_part() be useful here?

I did try, but couldn't figure out a way. I'll keep looking but likewise
let me know if you find a way.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-20  0:04     ` Luis Chamberlain
@ 2020-04-20  0:38       ` Bart Van Assche
  2020-04-20 18:46         ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-20  0:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On 4/19/20 5:04 PM, Luis Chamberlain wrote:
> On Sun, Apr 19, 2020 at 02:55:42PM -0700, Bart Van Assche wrote:
>> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
>>> +int __must_check blk_queue_debugfs_register(struct request_queue *q)
>>> +{
>>> +	struct dentry *dir = NULL;
>>> +
>>> +	/* This can happen if we have a bug in the lower layers */
>>
>> What does "this" refer to? Which layers does "lower layers" refer to? Most
>> software developers consider a module that calls directly into another
>> module as a higher layer (callbacks through function pointers do not count;
>> see also https://en.wikipedia.org/wiki/Modular_programming). According to
>> that definition block drivers are a software layer immediately above the
>> block layer core.
>>
>> How about changing that comment into the following to make it unambiguous
>> (if this is what you meant)?
>>
>> 	/*
>> 	 * Check whether the debugfs directory already exists. This can
>> 	 * only happen as the result of a bug in a block driver.
>> 	 */
> 
> But I didn't mean on a block driver. I meant the block core. In our
> case, the async request_queue removal is an example. There is nothing
> that block drivers could have done to help much with that.
> 
> I could just change "lower layers" to "block layer" ?

That sounds good to me.

>> Independent of what the purpose of the above code is, can that code be
>> rewritten such that it does not depend on the details of how names are
>> assigned to disks and partitions? Would disk_get_part() be useful here?
> 
> I did try, but couldn't figure out a way. I'll keep looking but likewise
> let me know if you find a way.

How about making blk_trace_setup() pass the result of the following 
expression as an additional argument to blk_trace_setup():

	bdev != bdev->bd_contains

I think that is a widely used approach to verify after a block device 
has been opened whether or not 'bdev' refers to a partition (bdev != 
bdev->bd_contains means that 'bdev' represents a partition).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 06/10] blk-debugfs: upgrade warns to BUG_ON() if directory is already found
  2020-04-19 19:45 ` [PATCH v2 06/10] blk-debugfs: upgrade warns to BUG_ON() if directory is already found Luis Chamberlain
@ 2020-04-20 11:36   ` Greg KH
  0 siblings, 0 replies; 57+ messages in thread
From: Greg KH @ 2020-04-20 11:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Sun, Apr 19, 2020 at 07:45:25PM +0000, Luis Chamberlain wrote:
> Now that we have moved release_queue from being asynchronous to
> synchronous, and fixed how we use the debugfs directory with blktrace
> we should no longer have expected races with device removal/addition
> and other operations with the debugfs directory.
> 
> If races do happen however, we want to be informed of *how* this races
> happens rather than dealing with a debugfs splat, so upgrading this to a
> BUG_ON() should capture better information about how this can happen
> in the future.
> 
> This is specially true these days with funky reproducers in userspace
> for which we have no access to, but only a bug splat.
> 
> Note that on addition the gendisk kobject is used as the parent for the
> request_queue kobject, and upon removal, now that request_queue removal
> is synchronous, blk_unregister_queue() is called prior to the gendisk
> device_del(). This means we expect to see a sysfs clash first now prior
> to running into a race with the debugfs dentry; so this bug would be
> considered highly unlikely.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  block/blk-debugfs.c | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> index d84038bce0a5..761318dcbf40 100644
> --- a/block/blk-debugfs.c
> +++ b/block/blk-debugfs.c
> @@ -19,16 +19,8 @@ void blk_debugfs_register(void)
>  
>  int __must_check blk_queue_debugfs_register(struct request_queue *q)
>  {
> -	struct dentry *dir = NULL;
> -
>  	/* This can happen if we have a bug in the lower layers */
> -	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> -	if (dir) {
> -		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> -			kobject_name(q->kobj.parent));
> -		dput(dir);
> -		return -EALREADY;
> -	}
> +	BUG_ON(debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root));

So you are willing to crash the whole kernel and throw all of
userspace's data away if this happens?

Ick, no, don't do that, handle the issue correctly and move on.

As proof you shouldn't be doing this, that BUG_ON will trigger if
debugfs is not enabled, which might be a bit mean for all users of those
kernels :(

Hard NAK from me, sorry.

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 07/10] blktrace: move debugfs file creation to its own function
  2020-04-19 22:55   ` Bart Van Assche
@ 2020-04-20 11:37     ` Greg KH
  0 siblings, 0 replies; 57+ messages in thread
From: Greg KH @ 2020-04-20 11:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Luis Chamberlain, axboe, viro, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Sun, Apr 19, 2020 at 03:55:15PM -0700, Bart Van Assche wrote:
> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > +static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
> > +					  struct dentry *dir,
> > +					  struct blk_trace *bt)
> > +{
> > +	int ret = -EIO;
> > +
> > +	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> > +					       &blk_dropped_fops);
> > +
> > +	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> > +
> > +	bt->rchan = relay_open("trace", dir, buts->buf_size,
> > +				buts->buf_nr, &blk_relay_callbacks, bt);
> > +	if (!bt->rchan)
> > +		return ret;
> > +
> > +	return 0;
> > +}
> 
> How about adding IS_ERR() checks for the debugfs_create_file() return
> values?

No, please no, I have been removing all of that nonsense from the kernel
for the past 3-4 releases.  You should never care about the return value
from that call, just save it off and move on.

as it is, this is correct.

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 09/10] block: panic if block debugfs dir is not created
  2020-04-19 19:45 ` [PATCH v2 09/10] block: panic if block debugfs dir is not created Luis Chamberlain
  2020-04-19 23:08   ` Bart Van Assche
@ 2020-04-20 11:38   ` Greg KH
  1 sibling, 0 replies; 57+ messages in thread
From: Greg KH @ 2020-04-20 11:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Sun, Apr 19, 2020 at 07:45:28PM +0000, Luis Chamberlain wrote:
> If DEBUG_FS is disabled we have another inline
> blk_debugfs_register() which just returns 0.
> 
> If BLK_DEV_IO_TRACE is enabled we rely on the block debugfs
> directory to have been created. If BLK_DEV_IO_TRACE is not enabled
> though, but if debugfs is still enabled we will always create a
> debugfs directory for a request_queue. Instead of special-casing
> this just for BLK_DEV_IO_TRACE, ensure this block debugfs dir is
> always created at boot if we have enabled debugfs.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  block/blk-debugfs.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> index 761318dcbf40..d6ec980e7531 100644
> --- a/block/blk-debugfs.c
> +++ b/block/blk-debugfs.c
> @@ -15,6 +15,8 @@ struct dentry *blk_debugfs_root;
>  void blk_debugfs_register(void)
>  {
>  	blk_debugfs_root = debugfs_create_dir("block", NULL);
> +	if (!blk_debugfs_root)
> +		panic("Failed to create block debugfs directory\n");

How rude, never crash the kernel for something so trivial as that.

Heck, never do ANYTHING different in the kernel if debugfs fails to do
something you think it should do.  This is debugging code, nothing
should ever depend on it, so just save the value (if you need it) and
move on.  Never check the value, as it means nothing to you.

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-19 19:45 ` [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup Luis Chamberlain
  2020-04-19 22:57   ` Bart Van Assche
@ 2020-04-20 11:39   ` Greg KH
  1 sibling, 0 replies; 57+ messages in thread
From: Greg KH @ 2020-04-20 11:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Sun, Apr 19, 2020 at 07:45:27PM +0000, Luis Chamberlain wrote:
> Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
> select DEBUG_FS, and blktrace exposes an API which userspace uses
> relying on certain files created in debugfs. If files are not created
> blktrace will not work correctly, so we do want to ensure that a
> blktrace setup creates these files properly, and otherwise inform
> userspace.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  kernel/trace/blktrace.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 9cc0153849c3..fc32a8665ce8 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
>  					  struct dentry *dir,
>  					  struct blk_trace *bt)
>  {
> -	int ret = -EIO;
> -
>  	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
>  					       &blk_dropped_fops);
> +	if (!bt->dropped_file)
> +		return -ENOMEM;

No, this is wrong, please do not ever check the return value of a
debugfs call.  See the zillions of patches I've been doing to the kernel
for this type of thing over the past year for examples of why.

the code is fine as-is.

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-19 23:17       ` Bart Van Assche
@ 2020-04-20 11:40         ` Greg KH
  2020-04-20 18:44           ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Greg KH @ 2020-04-20 11:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Luis Chamberlain, axboe, viro, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Sun, Apr 19, 2020 at 04:17:46PM -0700, Bart Van Assche wrote:
> On 4/19/20 4:05 PM, Luis Chamberlain wrote:
> > On Sun, Apr 19, 2020 at 03:57:58PM -0700, Bart Van Assche wrote:
> > > On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > > > Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
> > > > select DEBUG_FS, and blktrace exposes an API which userspace uses
> > > > relying on certain files created in debugfs. If files are not created
> > > > blktrace will not work correctly, so we do want to ensure that a
> > > > blktrace setup creates these files properly, and otherwise inform
> > > > userspace.
> > > > 
> > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > > ---
> > > >    kernel/trace/blktrace.c | 8 +++++---
> > > >    1 file changed, 5 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > > > index 9cc0153849c3..fc32a8665ce8 100644
> > > > --- a/kernel/trace/blktrace.c
> > > > +++ b/kernel/trace/blktrace.c
> > > > @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
> > > >    					  struct dentry *dir,
> > > >    					  struct blk_trace *bt)
> > > >    {
> > > > -	int ret = -EIO;
> > > > -
> > > >    	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> > > >    					       &blk_dropped_fops);
> > > > +	if (!bt->dropped_file)
> > > > +		return -ENOMEM;
> > > >    	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> > > > +	if (!bt->msg_file)
> > > > +		return -ENOMEM;
> > > >    	bt->rchan = relay_open("trace", dir, buts->buf_size,
> > > >    				buts->buf_nr, &blk_relay_callbacks, bt);
> > > >    	if (!bt->rchan)
> > > > -		return ret;
> > > > +		return -EIO;
> > > >    	return 0;
> > > >    }
> > > 
> > > I should have had a look at this patch before I replied to the previous
> > > patch.
> > > 
> > > Do you agree that the following code can be triggered by
> > > debugfs_create_file() and also that debugfs_create_file() never returns
> > > NULL?
> > 
> > If debugfs is enabled, and not that we know it is in this blktrace code,
> > as we select it, it can return ERR_PTR(-ERROR) if an error occurs.
> 
> This is what I found in include/linux/debugfs.h in case debugfs is disabled:
> 
> static inline struct dentry *debugfs_create_file(const char *name,
> 	umode_t mode, struct dentry *parent, void *data,
> 	const struct file_operations *fops)
> {
> 	return ERR_PTR(-ENODEV);
> }
> 
> I have not found any code path that can cause debugfs_create_file() to
> return NULL. Did I perhaps overlook something? If not, it's not clear to me
> why the above patch adds checks that check whether debugfs_create_file()
> returns NULL?

Short answer, yes, it can return NULL.  Correct answer is, you don't
care, don't check the value and don't do anything about it.  It's
debugging code, userspace doesn't care, so just keep moving on.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-20 11:40         ` Greg KH
@ 2020-04-20 18:44           ` Luis Chamberlain
  2020-04-20 20:11             ` Greg KH
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-20 18:44 UTC (permalink / raw)
  To: Greg KH
  Cc: Bart Van Assche, axboe, viro, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Mon, Apr 20, 2020 at 01:40:38PM +0200, Greg KH wrote:
> On Sun, Apr 19, 2020 at 04:17:46PM -0700, Bart Van Assche wrote:
> > On 4/19/20 4:05 PM, Luis Chamberlain wrote:
> > > On Sun, Apr 19, 2020 at 03:57:58PM -0700, Bart Van Assche wrote:
> > > > On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > > > > Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
> > > > > select DEBUG_FS, and blktrace exposes an API which userspace uses
> > > > > relying on certain files created in debugfs. If files are not created
> > > > > blktrace will not work correctly, so we do want to ensure that a
> > > > > blktrace setup creates these files properly, and otherwise inform
> > > > > userspace.
> > > > > 
> > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > > > ---
> > > > >    kernel/trace/blktrace.c | 8 +++++---
> > > > >    1 file changed, 5 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > > > > index 9cc0153849c3..fc32a8665ce8 100644
> > > > > --- a/kernel/trace/blktrace.c
> > > > > +++ b/kernel/trace/blktrace.c
> > > > > @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
> > > > >    					  struct dentry *dir,
> > > > >    					  struct blk_trace *bt)
> > > > >    {
> > > > > -	int ret = -EIO;
> > > > > -
> > > > >    	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> > > > >    					       &blk_dropped_fops);
> > > > > +	if (!bt->dropped_file)
> > > > > +		return -ENOMEM;
> > > > >    	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> > > > > +	if (!bt->msg_file)
> > > > > +		return -ENOMEM;
> > > > >    	bt->rchan = relay_open("trace", dir, buts->buf_size,
> > > > >    				buts->buf_nr, &blk_relay_callbacks, bt);
> > > > >    	if (!bt->rchan)
> > > > > -		return ret;
> > > > > +		return -EIO;
> > > > >    	return 0;
> > > > >    }
> > > > 
> > > > I should have had a look at this patch before I replied to the previous
> > > > patch.
> > > > 
> > > > Do you agree that the following code can be triggered by
> > > > debugfs_create_file() and also that debugfs_create_file() never returns
> > > > NULL?
> > > 
> > > If debugfs is enabled, and not that we know it is in this blktrace code,
> > > as we select it, it can return ERR_PTR(-ERROR) if an error occurs.
> > 
> > This is what I found in include/linux/debugfs.h in case debugfs is disabled:
> > 
> > static inline struct dentry *debugfs_create_file(const char *name,
> > 	umode_t mode, struct dentry *parent, void *data,
> > 	const struct file_operations *fops)
> > {
> > 	return ERR_PTR(-ENODEV);
> > }
> > 
> > I have not found any code path that can cause debugfs_create_file() to
> > return NULL. Did I perhaps overlook something? If not, it's not clear to me
> > why the above patch adds checks that check whether debugfs_create_file()
> > returns NULL?
> 
> Short answer, yes, it can return NULL.  Correct answer is, you don't
> care, don't check the value and don't do anything about it.  It's
> debugging code, userspace doesn't care, so just keep moving on.

Thing is this code *exposes* knobs to userspace for an API that *does*
exepect those files to exist. That is, blktrace *relies* on these
debugfs files to exist. So the kconfig which enables blktrace
CONFIG_BLK_DEV_IO_TRACE selects DEBUG_FS.

So typically we don't care if these files were created or not on regular
drivers, but in this case this code is only compiled when debugfs is
enabled and CONFIG_BLK_DEV_IO_TRACE, and the userspace interaction with
debugfs *expects* these files.

So what do you recommend?

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-20  0:38       ` Bart Van Assche
@ 2020-04-20 18:46         ` Luis Chamberlain
  0 siblings, 0 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-20 18:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Sun, Apr 19, 2020 at 05:38:59PM -0700, Bart Van Assche wrote:
> On 4/19/20 5:04 PM, Luis Chamberlain wrote:
> > > Independent of what the purpose of the above code is, can that code be
> > > rewritten such that it does not depend on the details of how names are
> > > assigned to disks and partitions? Would disk_get_part() be useful here?
> > 
> > I did try, but couldn't figure out a way. I'll keep looking but likewise
> > let me know if you find a way.
> 
> How about making blk_trace_setup() pass the result of the following
> expression as an additional argument to blk_trace_setup():
> 
> 	bdev != bdev->bd_contains
> 
> I think that is a widely used approach to verify after a block device has
> been opened whether or not 'bdev' refers to a partition (bdev !=
> bdev->bd_contains means that 'bdev' represents a partition).

Sweet, that should simplify this considerbly, thanks!

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 04/10] block: revert back to synchronous request_queue removal
  2020-04-19 22:23   ` Bart Van Assche
@ 2020-04-20 18:59     ` Luis Chamberlain
  2020-04-20 21:11       ` Bart Van Assche
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-20 18:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Sun, Apr 19, 2020 at 03:23:31PM -0700, Bart Van Assche wrote:
> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > +/**
> > + * blk_put_queue - decrement the request_queue refcount
> > + *
> > + * @q: the request_queue structure to decrement the refcount for
> > + *
> 
> How about following the example from Documentation/doc-guide/kernel-doc.rst
> and not leaving a blank line above the function argument documentation?

Sure.

> > + * Decrements the refcount to the request_queue kobject, when this reaches
>                               ^^
>                               of?
> > + * 0 we'll have blk_release_queue() called. You should avoid calling
> > + * this function in atomic context but if you really have to ensure you
> > + * first refcount the block device with bdgrab() / bdput() so that the
> > + * last decrement happens in blk_cleanup_queue().
> > + */
> 
> Is calling bdgrab() and bdput() an option from a context in which it is not
> guaranteed that the block device is open?

If the block device is not open, nope. For that blk_get_queue() can
be used, and is used by the block layer. This begs the question:

Do we have *drivers* which requires access to the request_queue from
atomic context when the block device is not open?

> Does every context that calls blk_put_queue() also call blk_cleanup_queue()?

Nope.

> How about avoiding confusion by changing the last sentence of that comment
> into something like the following: "The last reference must not be dropped
> from atomic context. If it is necessary to call blk_put_queue() from atomic
> context, make sure that that call does not decrease the request queue
> refcount to zero."

This would be fine, if not for the fact that it seems worthy to also ask
ourselves if we even need blk_get_queue() / blk_put_queue() exported for
drivers.

I haven't yet finalized my review of this, but planting the above
comment cements the idea further that it is possible. Granted, I think
its fine as -- that is our current use case and best practice. Removing
the export for blk_get_queue() / blk_put_queue() should entail reviewing
each driver caller and ensuring that it is not needed. And that is not
done yet, and should be considered a separate effort.

> >   /**
> >    * blk_cleanup_queue - shutdown a request queue
> > + *
> >    * @q: request queue to shutdown
> >    *
> 
> How about following the example from Documentation/doc-guide/kernel-doc.rst
> and not leaving a blank line above the function argument documentation?

Will do.

> >    * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
> >    * put it.  All future requests will be failed immediately with -ENODEV.
> > + *
> > + * You should not call this function in atomic context. If you need to
> > + * refcount a request_queue in atomic context, instead refcount the
> > + * block device with bdgrab() / bdput().
> 
> Surrounding blk_cleanup_queue() with bdgrab() / bdput() does not help. This
> blk_cleanup_queue() must not be called from atomic context.

I'll just remove that.

> 
> >   /**
> > - * __blk_release_queue - release a request queue
> > - * @work: pointer to the release_work member of the request queue to be released
> > + * blk_release_queue - release a request queue
> > + *
> > + * This function is called as part of the process when a block device is being
> > + * unregistered. Releasing a request queue starts with blk_cleanup_queue(),
> > + * which set the appropriate flags and then calls blk_put_queue() as the last
> > + * step. blk_put_queue() decrements the reference counter of the request queue
> > + * and once the reference counter reaches zero, this function is called to
> > + * release all allocated resources of the request queue.
> >    *
> > - * Description:
> > - *     This function is called when a block device is being unregistered. The
> > - *     process of releasing a request queue starts with blk_cleanup_queue, which
> > - *     set the appropriate flags and then calls blk_put_queue, that decrements
> > - *     the reference counter of the request queue. Once the reference counter
> > - *     of the request queue reaches zero, blk_release_queue is called to release
> > - *     all allocated resources of the request queue.
> > + * This function can sleep, and so we must ensure that the very last
> > + * blk_put_queue() is never called from atomic context.
> > + *
> > + * @kobj: pointer to a kobject, who's container is a request_queue
> >    */
> 
> Please follow the style used elsewhere in the kernel and move function
> argument documentation just below the line with the function name.

Sure, thanks for the review.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-20 18:44           ` Luis Chamberlain
@ 2020-04-20 20:11             ` Greg KH
  2020-04-20 20:20               ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Greg KH @ 2020-04-20 20:11 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Bart Van Assche, axboe, viro, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Mon, Apr 20, 2020 at 06:44:45PM +0000, Luis Chamberlain wrote:
> On Mon, Apr 20, 2020 at 01:40:38PM +0200, Greg KH wrote:
> > On Sun, Apr 19, 2020 at 04:17:46PM -0700, Bart Van Assche wrote:
> > > On 4/19/20 4:05 PM, Luis Chamberlain wrote:
> > > > On Sun, Apr 19, 2020 at 03:57:58PM -0700, Bart Van Assche wrote:
> > > > > On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > > > > > Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
> > > > > > select DEBUG_FS, and blktrace exposes an API which userspace uses
> > > > > > relying on certain files created in debugfs. If files are not created
> > > > > > blktrace will not work correctly, so we do want to ensure that a
> > > > > > blktrace setup creates these files properly, and otherwise inform
> > > > > > userspace.
> > > > > > 
> > > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > > > > ---
> > > > > >    kernel/trace/blktrace.c | 8 +++++---
> > > > > >    1 file changed, 5 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > > > > > index 9cc0153849c3..fc32a8665ce8 100644
> > > > > > --- a/kernel/trace/blktrace.c
> > > > > > +++ b/kernel/trace/blktrace.c
> > > > > > @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
> > > > > >    					  struct dentry *dir,
> > > > > >    					  struct blk_trace *bt)
> > > > > >    {
> > > > > > -	int ret = -EIO;
> > > > > > -
> > > > > >    	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> > > > > >    					       &blk_dropped_fops);
> > > > > > +	if (!bt->dropped_file)
> > > > > > +		return -ENOMEM;
> > > > > >    	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> > > > > > +	if (!bt->msg_file)
> > > > > > +		return -ENOMEM;
> > > > > >    	bt->rchan = relay_open("trace", dir, buts->buf_size,
> > > > > >    				buts->buf_nr, &blk_relay_callbacks, bt);
> > > > > >    	if (!bt->rchan)
> > > > > > -		return ret;
> > > > > > +		return -EIO;
> > > > > >    	return 0;
> > > > > >    }
> > > > > 
> > > > > I should have had a look at this patch before I replied to the previous
> > > > > patch.
> > > > > 
> > > > > Do you agree that the following code can be triggered by
> > > > > debugfs_create_file() and also that debugfs_create_file() never returns
> > > > > NULL?
> > > > 
> > > > If debugfs is enabled, and not that we know it is in this blktrace code,
> > > > as we select it, it can return ERR_PTR(-ERROR) if an error occurs.
> > > 
> > > This is what I found in include/linux/debugfs.h in case debugfs is disabled:
> > > 
> > > static inline struct dentry *debugfs_create_file(const char *name,
> > > 	umode_t mode, struct dentry *parent, void *data,
> > > 	const struct file_operations *fops)
> > > {
> > > 	return ERR_PTR(-ENODEV);
> > > }
> > > 
> > > I have not found any code path that can cause debugfs_create_file() to
> > > return NULL. Did I perhaps overlook something? If not, it's not clear to me
> > > why the above patch adds checks that check whether debugfs_create_file()
> > > returns NULL?
> > 
> > Short answer, yes, it can return NULL.  Correct answer is, you don't
> > care, don't check the value and don't do anything about it.  It's
> > debugging code, userspace doesn't care, so just keep moving on.
> 
> Thing is this code *exposes* knobs to userspace for an API that *does*
> exepect those files to exist. That is, blktrace *relies* on these
> debugfs files to exist. So the kconfig which enables blktrace
> CONFIG_BLK_DEV_IO_TRACE selects DEBUG_FS.

That's nice, but again, no kernel code should do anything different
depending on what debugfs happens to be doing at that point in time.

> So typically we don't care if these files were created or not on regular
> drivers, but in this case this code is only compiled when debugfs is
> enabled and CONFIG_BLK_DEV_IO_TRACE, and the userspace interaction with
> debugfs *expects* these files.
> 
> So what do you recommend?

Make sure that userspace can handle the files not being there and keep
on working properly if they aren't.

As you can't "recover" from debugfs failing, there's no need to check
anything with it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-19 19:45 ` [PATCH v2 03/10] blktrace: fix debugfs use after free Luis Chamberlain
  2020-04-19 21:55   ` Bart Van Assche
@ 2020-04-20 20:16   ` Greg KH
  2020-04-20 20:41     ` Luis Chamberlain
  2020-04-22  7:27   ` Christoph Hellwig
  2 siblings, 1 reply; 57+ messages in thread
From: Greg KH @ 2020-04-20 20:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Sun, Apr 19, 2020 at 07:45:22PM +0000, Luis Chamberlain wrote:
> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> merged on v4.12 Omar fixed the original blktrace code for request-based
> drivers (multiqueue). This however left in place a possible crash, if you
> happen to abuse blktrace in a way it was not intended, and even more so
> with our current asynchronous request_queue removal.
> 
> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> forget to BLKTRACETEARDOWN, and then just remove the device you end up
> with a panic:
> 
> [  107.193134] debugfs: Directory 'loop0' with parent 'block' already present!
> [  107.254615] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [  107.258785] #PF: supervisor write access in kernel mode
> [  107.262035] #PF: error_code(0x0002) - not-present page
> [  107.264106] PGD 0 P4D 0
> [  107.264404] Oops: 0002 [#1] SMP NOPTI
> [  107.264803] CPU: 8 PID: 674 Comm: kworker/8:2 Tainted: G            E 5.6.0-rc7-next-20200327 #1
> [  107.265712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [  107.266553] Workqueue: events __blk_release_queue
> [  107.267051] RIP: 0010:down_write+0x15/0x40
> [  107.267488] Code: eb ca e8 ee a5 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f  65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
> [  107.269300] RSP: 0018:ffff9927c06efda8 EFLAGS: 00010246
> [  107.269841] RAX: 0000000000000000 RBX: ffff8be7e73b0600 RCX: ffffff8100000000
> [  107.270559] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [  107.271281] RBP: 00000000000000a0 R08: ffff8be7ebc80fa8 R09: ffff8be7ebc80fa8
> [  107.272001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  107.272722] R13: ffff8be7efc30400 R14: ffff8be7e0571200 R15: 00000000000000a0
> [  107.273475] FS:  0000000000000000(0000) GS:ffff8be7efc00000(0000) knlGS:0000000000000000
> [  107.274346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  107.274968] CR2: 00000000000000a0 CR3: 000000042abee003 CR4: 0000000000360ee0
> [  107.275710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  107.276465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  107.277214] Call Trace:
> [  107.277532]  simple_recursive_removal+0x4e/0x2e0
> [  107.278049]  ? debugfs_remove+0x60/0x60
> [  107.278493]  debugfs_remove+0x40/0x60
> [  107.278922]  blk_trace_free+0xd/0x50
> [  107.279339]  __blk_trace_remove+0x27/0x40
> [  107.279797]  blk_trace_shutdown+0x30/0x40
> [  107.280256]  __blk_release_queue+0xab/0x110
> [  107.280734]  process_one_work+0x1b4/0x380
> [  107.281194]  worker_thread+0x50/0x3c0
> [  107.281622]  kthread+0xf9/0x130
> [  107.281994]  ? process_one_work+0x380/0x380
> [  107.282467]  ? kthread_park+0x90/0x90
> [  107.282895]  ret_from_fork+0x1f/0x40
> [  107.283316] Modules linked in: loop(E) <etc>
> [  107.288562] CR2: 00000000000000a0
> [  107.288957] ---[ end trace b885d243d441bbce ]---
> 
> This splat happens to be very similar to the one reported via
> kernel.org korg#205713, only that korg#205713 was for v4.19.83
> and the above now includes the simple_recursive_removal() introduced
> via commit a3d1e7eb5abe ("simple_recursive_removal(): kernel-side rm
> -rf for ramfs-style filesystems") merged on v5.6.
> 
> korg#205713 then was used to create CVE-2019-19770 and claims that
> the bug is in a use-after-free in the debugfs core code. The
> implications of this being a generic UAF on debugfs would be
> much more severe, as it would imply parent dentries can sometimes
> not be positive, which we hold by design is just not possible.
> 
> Below is the splat explained with a bit more details, explaining
> what is happening in userspace, kernel, and a print of the CPU on,
> which the code runs on:
> 
> load loopback module
> [   13.603371] == blk_mq_debugfs_register(12) start
> [   13.604040] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.604934] == blk_mq_debugfs_register(12) end
> [   13.627382] == blk_mq_debugfs_register(12) start
> [   13.628041] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.629240] == blk_mq_debugfs_register(12) end
> [   13.651667] == blk_mq_debugfs_register(12) start
> [   13.652836] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.655107] == blk_mq_debugfs_register(12) end
> [   13.684917] == blk_mq_debugfs_register(12) start
> [   13.687876] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.691588] == blk_mq_debugfs_register(13) end
> [   13.707320] == blk_mq_debugfs_register(13) start
> [   13.707863] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.708856] == blk_mq_debugfs_register(13) end
> [   13.735623] == blk_mq_debugfs_register(13) start
> [   13.736656] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.738411] == blk_mq_debugfs_register(13) end
> [   13.763326] == blk_mq_debugfs_register(13) start
> [   13.763972] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.765167] == blk_mq_debugfs_register(13) end
> [   13.779510] == blk_mq_debugfs_register(13) start
> [   13.780522] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.782338] == blk_mq_debugfs_register(13) end
> [   13.783521] loop: module loaded
> 
> LOOP_CTL_DEL(loop0) #1
> [   13.803550] = __blk_release_queue(4) start
> [   13.807772] == blk_trace_shutdown(4) start
> [   13.810749] == blk_trace_shutdown(4) end
> [   13.813437] = __blk_release_queue(4) calling blk_mq_debugfs_unregister()
> [   13.817593] ==== blk_mq_debugfs_unregister(4) begin
> [   13.817621] ==== blk_mq_debugfs_unregister(4) debugfs_remove_recursive(q->debugfs_dir)
> [   13.821203] ==== blk_mq_debugfs_unregister(4) end q->debugfs_dir is NULL
> [   13.826166] = __blk_release_queue(4) blk_mq_debugfs_unregister() end
> [   13.832992] = __blk_release_queue(4) end
> 
> LOOP_CTL_ADD(loop0) #1
> [   13.843742] == blk_mq_debugfs_register(7) start
> [   13.845569] == blk_mq_debugfs_register(7) q->debugfs_dir created
> [   13.848628] == blk_mq_debugfs_register(7) end
> 
> BLKTRACE_SETUP(loop0) #1
> [   13.850924] == blk_trace_ioctl(7, BLKTRACESETUP) start
> [   13.852852] === do_blk_trace_setup(7) start
> [   13.854580] === do_blk_trace_setup(7) creating directory
> [   13.856620] === do_blk_trace_setup(7) using what debugfs_lookup() gave
> [   13.860635] === do_blk_trace_setup(7) end with ret: 0
> [   13.862615] == blk_trace_ioctl(7, BLKTRACESETUP) end
> 
> LOOP_CTL_DEL(loop0) #2
> [   13.883304] = __blk_release_queue(7) start
> [   13.885324] == blk_trace_shutdown(7) start
> [   13.887197] == blk_trace_shutdown(7) calling __blk_trace_remove()
> [   13.889807] == __blk_trace_remove(7) start
> [   13.891669] === blk_trace_cleanup(7) start
> [   13.911656] ====== blk_trace_free(7) start
> 
> This is a scheduled __blk_release_queue(). Note that at this point
> we know that the old device is gone as the gendisk removal is
> synchronous.
> 
> LOOP_CTL_ADD(loop0) #2
> [   13.912709] == blk_mq_debugfs_register(2) start
> 
> This is a *new* device, as we know the last one was removed synchronously.
> 
> ---> From LOOP_CTL_DEL(loop0) #2
> [   13.915887] ====== blk_trace_free(7) end
> 
> ---> From LOOP_CTL_ADD(loop0) #2
> [   13.918359] debugfs: Directory 'loop0' with parent 'block' already present!
> 
> Above we have a case where the upon LOOP_CTL_ADD(loop) #2 the
> debugfs_create_dir() start_creating() call will have found an existing dentry,
> and return ERR_PTR(-EEXIST), and so the q->debugfs_dir will be initialized to
> NULL. This clash happens because an asynchronous __blk_release_queue() has
> been scheduled from the above LOOP_CTL_DEL(loop0) #2 call, and it has not yet
> removed the old loop0 debugfs directory.
> 
> [   13.926433] == blk_mq_debugfs_register(2) q->debugfs_dir created
> 
> This is a huge assumption, since we never check for errors on
> blk_mq_debugfs_register(), and so the child files we need to
> create *can* simply be created using NULL as as the parent, in
> which case debugfs would use debugfs_mount->mnt_root as the parent.
> 
> [   13.930373] == blk_mq_debugfs_register(2) end
> 
> BLKTRACE_SETUP(loop0) #2
> [   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start
> [   13.936758] === do_blk_trace_setup(2) start
> [   13.938944] === do_blk_trace_setup(2) creating directory
> [   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave
> 
> Here debugfs_lookup() will have found the old debugfs directory, even
> though q->debugfs_dir was empty when initializing this device. So the
> files created can actually be created where they should have.
> 
> ---> From LOOP_CTL_DEL(loop0) #2
> [   13.971046] === blk_trace_cleanup(7) end
> [   13.973175] == __blk_trace_remove(7) end
> [   13.975352] == blk_trace_shutdown(7) end
> [   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()
> [   13.980645] ==== blk_mq_debugfs_unregister(7) begin
> [   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)
> 
> This is a continuation from the old scheduled  __blk_release_queue(). It
> is using another request_queue structure than what the currently activated
> blktrace setup is using.
> 
> Here now the dentry where the blktrace setup files were created on top
> of is gone.  In fact, since debugfs_remove_recursive() is used, *all* of
> those blktrace debugfs files should have been removed as well, including
> the relay. This is despite the fact that the removal and blktrace setup
> are working on two different request_queue structures.
> 
> [   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL
> [   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end
> [   13.993155] = __blk_release_queue(7) end
> 
> The scheduled __blk_release_queue completes finally.
> 
> ---> From BLKTRACE_SETUP(loop0) #2
> [   13.995928] === do_blk_trace_setup(2) end with ret: 0
> [   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end
> 
> At this point the old blktrace *setup* is complete though, and let us
> recall that the old dentry is still used. But recall that we just had
> our old device removed, and since debugfs_remove_recursive() was used,
> so are all of its files. This request_queue however is still active.
> 
> LOOP_CTL_DEL(loop0) #3
> [   14.035119] = __blk_release_queue(2) start
> [   14.036925] == blk_trace_shutdown(2) start
> [   14.038518] == blk_trace_shutdown(2) calling __blk_trace_remove()
> [   14.040829] == __blk_trace_remove(2) start
> [   14.042413] === blk_trace_cleanup(2) start
> 
> This removal will cleanup the blktrace setup. Note the trap though,
> the blktrace setup will already have been cleaned up since the
> scheduled removal from instance #2 removed the files on the same
> debugfs path.
> 
> The __blk_release_queue() is scheduled and is asynchronous so can
> continue later, and that is where the crash happens, on CPU 2.
> 
> LOOP_CTL_ADD(loop0) #3
> [   14.072522] == blk_mq_debugfs_register(6) start
> 
> ---> From LOOP_CTL_DEL(loop0) #3
> [   14.075151] ====== blk_trace_free(2) start
> 
> ---> From LOOP_CTL_ADD(loop0) #3
> [   14.075882] == blk_mq_debugfs_register(6) q->debugfs_dir created
> 
> ---> From LOOP_CTL_DEL(loop0) #3
> The panic happens on CPU 2 when we continue with the old scheduled
> removal of the request_queue, which will run into a blktrace setup
> with NULL pointers.
> 
> [   14.078624] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [   14.084332] == blk_mq_debugfs_register(6) end
> [   14.086971] #PF: supervisor write access in kernel mode
> [   14.086974] #PF: error_code(0x0002) - not-present page
> [   14.086977] PGD 0 P4D 0
> [   14.086984] Oops: 0002 [#1] SMP NOPTI
> [   14.086990] CPU: 2 PID: 287 Comm: kworker/2:2 Tainted: G            E 5.6.0-next-20200403+ #54
> [   14.086991] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [   14.087002] Workqueue: events __blk_release_queue
> [   14.087011] RIP: 0010:down_write+0x15/0x40
> [   14.090300] == blk_trace_ioctl(6, BLKTRACESETUP) start
> [   14.093277] Code: eb ca e8 3e 34 8d ff cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00
> 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
> [   14.093280] RSP: 0018:ffffc28a00533da8 EFLAGS: 00010246
> [   14.093284] RAX: 0000000000000000 RBX: ffff9f7a24d07980 RCX: ffffff8100000000
> [   14.093286] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [   14.093287] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000019
> [   14.093289] R10: 0000000000000774 R11: 0000000000000000 R12: 0000000000000000
> [   14.093291] R13: ffff9f7a2fab0400 R14: ffff9f7a21dd1140 R15: 00000000000000a0
> [   14.093294] FS:  0000000000000000(0000) GS:ffff9f7a2fa80000(0000) knlGS:0000000000000000
> [   14.093296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   14.093298] CR2: 00000000000000a0 CR3: 00000004293d2003 CR4: 0000000000360ee0
> [   14.093307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   14.093308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   14.093310] Call Trace:
> [   14.093324]  simple_recursive_removal+0x4e/0x2e0
> [   14.093330]  ? debugfs_remove+0x60/0x60
> [   14.093334]  debugfs_remove+0x40/0x60
> [   14.093339]  blk_trace_free+0x20/0x70
> [   14.093346]  __blk_trace_remove+0x54/0x90
> [   14.096704] === do_blk_trace_setup(6) start
> [   14.098534]  blk_trace_shutdown+0x74/0x80
> [   14.100958] === do_blk_trace_setup(6) creating directory
> [   14.104575]  __blk_release_queue+0xbe/0x160
> [   14.104580]  process_one_work+0x1b4/0x380
> [   14.104585]  worker_thread+0x50/0x3c0
> [   14.104589]  kthread+0xf9/0x130
> [   14.104593]  ? process_one_work+0x380/0x380
> [   14.104596]  ? kthread_park+0x90/0x90
> [   14.104599]  ret_from_fork+0x1f/0x40
> [   14.104603] Modules linked in: loop(E) xfs(E) libcrc32c(E)
> crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E)
> serio_raw(E) aesni_intel(E) glue_helper(E) virtio_balloon(E) evdev(E)
> crypto_simd(E) pcspkr(E) cryptd(E) i6300esb(E) button(E) ip_tables(E)
> x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E)
> jbd2(E) virtio_net(E) net_failover(E) failover(E) virtio_blk(E)
> ata_generic(E) uhci_hcd(E) ata_piix(E) ehci_hcd(E) nvme(E) libata(E)
> crc32c_intel(E) usbcore(E) psmouse(E) nvme_core(E) virtio_pci(E)
> scsi_mod(E) virtio_ring(E) t10_pi(E) virtio(E) i2c_piix4(E) floppy(E)
> [   14.107400] === do_blk_trace_setup(6) using what debugfs_lookup() gave
> [   14.108939] CR2: 00000000000000a0
> [   14.110589] === do_blk_trace_setup(6) end with ret: 0
> [   14.111592] ---[ end trace 7a783b33b9614db9 ]---
> 
> The root cause to this issue is that debugfs_lookup() can find a
> previous incarnation's dir of the same name which is about to get
> removed from a not yet schedule work. When that happens, the the files
> are taken underneath the nose of the blktrace, and when it comes time to
> cleanup, these dentries are long gone because of a scheduled removal.
> 
> This issue is happening because of two reasons:
> 
>   1) The request_queue is currently removed asynchronously as of commit
>      dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
>      v4.12, this allows races with userspace which were not possible
>      before unless as removal of a block device used to happen
>      synchronously with its request_queue. One could however still
>      parallelize blksetup calls while one loops on device addition and
>      removal.
> 
>   2) There are no errors checks when we create the debugfs directory,
>      be it on init or for blktrace. The concept of when the directory
>      *should* really exist is further complicated by the fact that
>      we have asynchronous request_queue removal. And, we have no
>      real sanity checks to ensure we don't re-create the queue debugfs
>      directory.
> 
> We can fix the UAF by using a debugfs directory which moving forward
> will always be accessible if debugfs is enabled, this way, its allocated
> and avaialble always for both request-based block drivers or
> make_request drivers (multiqueue) block drivers.
> 
> We also put sanity checks in place to ensure that if the directory is
> found with debugfs_lookup() it is the dentry we expect. When doing a
> blktrace against a parition, we will always be creating a temporary
> debugfs directory, so ensure that only exists once as well to avoid
> issues against concurrent blktrace runs.
> 
> Lastly, since we are now always creating the needed request_queue
> debugfs directory upon init, we can also take the initiative to
> proactively check against errors. We currently do not check for
> errors on add_disk() and friends, but we shouldn't make the issue
> any worse.
> 
> This also simplifies the code considerably, with the only penalty now
> being that we're always creating the request queue debugfs directory for
> the request-based block device drivers.
> 
> The UAF then is not a core debugfs issue, but instead a complex misuse
> of debugfs, and this issue can only be triggered if you are root.
> 
> This issue can be reproduced with break-blktrace [2] using:
> 
>   break-blktrace -c 10 -d -s
> 
> This patch fixes this issue. Note that there is also another
> respective UAF but from the ioctl path [3], this should also fix
> that issue.
> 
> This patch then also disputes the severity of CVE-2019-19770 as
> this issue is only possible by being root and using blktrace.
> 
> It is not a core debugfs issue.
> 
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
> [1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
> [2] https://github.com/mcgrof/break-blktrace
> [3] https://lore.kernel.org/lkml/000000000000ec635b059f752700@google.com/
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
> Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  block/blk-debugfs.c          | 30 +++++++++++
>  block/blk-mq-debugfs.c       |  5 --
>  block/blk-sysfs.c            |  9 ++++
>  block/blk.h                  | 11 ++++
>  include/linux/blkdev.h       |  5 +-
>  include/linux/blktrace_api.h |  1 +
>  kernel/trace/blktrace.c      | 98 +++++++++++++++++++++++++++++++++---
>  7 files changed, 146 insertions(+), 13 deletions(-)

This patch triggered gmail's spam detection, your changelog text is
whack...

> diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> index 19091e1effc0..d84038bce0a5 100644
> --- a/block/blk-debugfs.c
> +++ b/block/blk-debugfs.c
> @@ -3,6 +3,9 @@
>  /*
>   * Shared request-based / make_request-based functionality
>   */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
>  #include <linux/kernel.h>
>  #include <linux/blkdev.h>
>  #include <linux/debugfs.h>
> @@ -13,3 +16,30 @@ void blk_debugfs_register(void)
>  {
>  	blk_debugfs_root = debugfs_create_dir("block", NULL);
>  }
> +
> +int __must_check blk_queue_debugfs_register(struct request_queue *q)
> +{
> +	struct dentry *dir = NULL;
> +
> +	/* This can happen if we have a bug in the lower layers */
> +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> +	if (dir) {
> +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> +			kobject_name(q->kobj.parent));
> +		dput(dir);
> +		return -EALREADY;
> +	}
> +
> +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> +					    blk_debugfs_root);
> +	if (!q->debugfs_dir)
> +		return -ENOMEM;

Why doesn't the directory just live in the request queue, or somewhere
else, so that you save it when it is created and then that's it.  No
need to "look it up" anywhere else.

Or do you do that in later patches?  I only see this one at the moment,
sorry...

>  static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> +					    struct request_queue *q,
>  					    struct blk_trace *bt)
>  {
>  	struct dentry *dir = NULL;
>  
> +	/* This can only happen if we have a bug on our lower layers */
> +	if (!q->kobj.parent) {
> +		pr_warn("%s: request_queue parent is gone\n", buts->name);

A kobject always has a parent, unless it has not been registered yet, so
I don't know what you are testing could ever happen.



> +		return NULL;
> +	}
> +
> +	/*
> +	 * From a sysfs kobject perspective, the request_queue sits on top of
> +	 * the gendisk, which has the name of the disk. We always create a
> +	 * debugfs directory upon init for this gendisk kobject, so we re-use
> +	 * that if blktrace is going to be done for it.
> +	 */
> +	if (blk_trace_target_disk(buts->name, kobject_name(q->kobj.parent))) {
> +		if (!q->debugfs_dir) {
> +			pr_warn("%s: expected request_queue debugfs_dir is not set\n",
> +				buts->name);

What is userspace supposed to be able to do if they see this warning?

> +			return NULL;
> +		}
> +		/*
> +		 * debugfs_lookup() is used to ensure the directory is not
> +		 * taken from underneath us. We must dput() it later once
> +		 * done with it within blktrace.
> +		 */
> +		dir = debugfs_lookup(buts->name, blk_debugfs_root);
> +		if (!dir) {
> +			pr_warn("%s: expected request_queue debugfs_dir dentry is gone\n",
> +				buts->name);

Again, can't we just save the pointer when we create it and not have to
look it up again?

> +			return NULL;
> +		}
> +		 /*
> +		 * This is a reaffirmation that debugfs_lookup() shall always
> +		 * return the same dentry if it was already set.
> +		 */

I'm all for reaffirmation and the like, but really, is this needed???

> +		if (dir != q->debugfs_dir) {
> +			dput(dir);
> +			pr_warn("%s: expected dentry dir != q->debugfs_dir\n",
> +				buts->name);
> +			return NULL;

Why are you testing to see if debugfs really is working properly?
SHould all users do crazy things like this (hint, rhetorical
question...)

> +		}
> +		bt->backing_dir = q->debugfs_dir;
> +		return bt->backing_dir;
> +	}
> +
> +	/*
> +	 * If not using blktrace on the gendisk, we are going to create a
> +	 * temporary debugfs directory for it, however this cannot be shared
> +	 * between two concurrent blktraces since the path is not unique, so
> +	 * ensure this is only done once.
> +	 */
>  	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> -	if (!dir)
> -		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> +	if (dir) {
> +		pr_warn("%s: temporary blktrace debugfs directory already present\n",
> +			buts->name);
> +		dput(dir);
> +		return NULL;
> +	}
> +
> +	bt->dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> +	if (!bt->dir) {
> +		pr_warn("%s: temporary blktrace debugfs directory could not be created\n",
> +			buts->name);

Again, do not test the return value, you do not care.  I've been
removing these checks from everywhere.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-20 20:11             ` Greg KH
@ 2020-04-20 20:20               ` Luis Chamberlain
  2020-04-21  6:55                 ` Greg KH
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-20 20:20 UTC (permalink / raw)
  To: Greg KH
  Cc: Bart Van Assche, axboe, viro, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Mon, Apr 20, 2020 at 10:11:01PM +0200, Greg KH wrote:
> On Mon, Apr 20, 2020 at 06:44:45PM +0000, Luis Chamberlain wrote:
> > On Mon, Apr 20, 2020 at 01:40:38PM +0200, Greg KH wrote:
> > > On Sun, Apr 19, 2020 at 04:17:46PM -0700, Bart Van Assche wrote:
> > > > On 4/19/20 4:05 PM, Luis Chamberlain wrote:
> > > > > On Sun, Apr 19, 2020 at 03:57:58PM -0700, Bart Van Assche wrote:
> > > > > > On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > > > > > > Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
> > > > > > > select DEBUG_FS, and blktrace exposes an API which userspace uses
> > > > > > > relying on certain files created in debugfs. If files are not created
> > > > > > > blktrace will not work correctly, so we do want to ensure that a
> > > > > > > blktrace setup creates these files properly, and otherwise inform
> > > > > > > userspace.
> > > > > > > 
> > > > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > > > > > ---
> > > > > > >    kernel/trace/blktrace.c | 8 +++++---
> > > > > > >    1 file changed, 5 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > > > > > > index 9cc0153849c3..fc32a8665ce8 100644
> > > > > > > --- a/kernel/trace/blktrace.c
> > > > > > > +++ b/kernel/trace/blktrace.c
> > > > > > > @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
> > > > > > >    					  struct dentry *dir,
> > > > > > >    					  struct blk_trace *bt)
> > > > > > >    {
> > > > > > > -	int ret = -EIO;
> > > > > > > -
> > > > > > >    	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> > > > > > >    					       &blk_dropped_fops);
> > > > > > > +	if (!bt->dropped_file)
> > > > > > > +		return -ENOMEM;
> > > > > > >    	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> > > > > > > +	if (!bt->msg_file)
> > > > > > > +		return -ENOMEM;
> > > > > > >    	bt->rchan = relay_open("trace", dir, buts->buf_size,
> > > > > > >    				buts->buf_nr, &blk_relay_callbacks, bt);
> > > > > > >    	if (!bt->rchan)
> > > > > > > -		return ret;
> > > > > > > +		return -EIO;
> > > > > > >    	return 0;
> > > > > > >    }
> > > > > > 
> > > > > > I should have had a look at this patch before I replied to the previous
> > > > > > patch.
> > > > > > 
> > > > > > Do you agree that the following code can be triggered by
> > > > > > debugfs_create_file() and also that debugfs_create_file() never returns
> > > > > > NULL?
> > > > > 
> > > > > If debugfs is enabled, and not that we know it is in this blktrace code,
> > > > > as we select it, it can return ERR_PTR(-ERROR) if an error occurs.
> > > > 
> > > > This is what I found in include/linux/debugfs.h in case debugfs is disabled:
> > > > 
> > > > static inline struct dentry *debugfs_create_file(const char *name,
> > > > 	umode_t mode, struct dentry *parent, void *data,
> > > > 	const struct file_operations *fops)
> > > > {
> > > > 	return ERR_PTR(-ENODEV);
> > > > }
> > > > 
> > > > I have not found any code path that can cause debugfs_create_file() to
> > > > return NULL. Did I perhaps overlook something? If not, it's not clear to me
> > > > why the above patch adds checks that check whether debugfs_create_file()
> > > > returns NULL?
> > > 
> > > Short answer, yes, it can return NULL.  Correct answer is, you don't
> > > care, don't check the value and don't do anything about it.  It's
> > > debugging code, userspace doesn't care, so just keep moving on.
> > 
> > Thing is this code *exposes* knobs to userspace for an API that *does*
> > exepect those files to exist. That is, blktrace *relies* on these
> > debugfs files to exist. So the kconfig which enables blktrace
> > CONFIG_BLK_DEV_IO_TRACE selects DEBUG_FS.
> 
> That's nice, but again, no kernel code should do anything different
> depending on what debugfs happens to be doing at that point in time.

So even if the debugfs files were *not* created, and this code executes only
if DEBUG_FS, you don't think we should inform userspace if the blktrace
setup ioctl, which sets up these debugfs, didn't happen?

The "recovery" here would just be to destroy the blktrace setup, and
inform userspace that the blktrace setup ioctl failed.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-20 20:16   ` Greg KH
@ 2020-04-20 20:41     ` Luis Chamberlain
  2020-04-21  7:00       ` Greg KH
  2020-04-22  7:29       ` Christoph Hellwig
  0 siblings, 2 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-20 20:41 UTC (permalink / raw)
  To: Greg KH
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Mon, Apr 20, 2020 at 10:16:15PM +0200, Greg KH wrote:
> On Sun, Apr 19, 2020 at 07:45:22PM +0000, Luis Chamberlain wrote:
> 
> This patch triggered gmail's spam detection, your changelog text is
> whack...

Oh? What do you think triggered it?

> > diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> > index 19091e1effc0..d84038bce0a5 100644
> > --- a/block/blk-debugfs.c
> > +++ b/block/blk-debugfs.c
> > @@ -3,6 +3,9 @@
> >  /*
> >   * Shared request-based / make_request-based functionality
> >   */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> >  #include <linux/kernel.h>
> >  #include <linux/blkdev.h>
> >  #include <linux/debugfs.h>
> > @@ -13,3 +16,30 @@ void blk_debugfs_register(void)
> >  {
> >  	blk_debugfs_root = debugfs_create_dir("block", NULL);
> >  }
> > +
> > +int __must_check blk_queue_debugfs_register(struct request_queue *q)
> > +{
> > +	struct dentry *dir = NULL;
> > +
> > +	/* This can happen if we have a bug in the lower layers */
> > +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> > +	if (dir) {
> > +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> > +			kobject_name(q->kobj.parent));
> > +		dput(dir);
> > +		return -EALREADY;
> > +	}
> > +
> > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > +					    blk_debugfs_root);
> > +	if (!q->debugfs_dir)
> > +		return -ENOMEM;
> 
> Why doesn't the directory just live in the request queue, or somewhere
> else, so that you save it when it is created and then that's it.  No
> need to "look it up" anywhere else.

Its already there. And yes, after my changes it is technically possible
to just re-use it directly. But this is complicated by a few things. One
is that at this point in time, asynchronous request_queue removal is
still possible, and so a race was exposed where a requeust_queue may be
lingering but its old device is gone. That race is fixed by reverting us
back to synchronous request_queue removal, therefore ensuring that the
debugfs dir exists so long as the device does.

I can remove the debugfs_lookup() *after* we revert to synchronous
request_queue removal, or we just re-order the patches so that the
revert happens first. That should simplify this patch.

The code in this patch was designed to help dispute the logic behind
the CVE, in particular it shows exactly where debugfs_dir *is* the
one found by debugfs_lookup(), and shows the real issue behind the
removal.

But yeah, now that that is done, I hope its clear to all, and I think
this patch can be simplified if we just revert the async requeust_queue
removal first.

> Or do you do that in later patches?  I only see this one at the moment,
> sorry...
> 
> >  static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> > +					    struct request_queue *q,
> >  					    struct blk_trace *bt)
> >  {
> >  	struct dentry *dir = NULL;
> >  
> > +	/* This can only happen if we have a bug on our lower layers */
> > +	if (!q->kobj.parent) {
> > +		pr_warn("%s: request_queue parent is gone\n", buts->name);
> 
> A kobject always has a parent, unless it has not been registered yet, so
> I don't know what you are testing could ever happen.

Or it has been kobject_del()'d?

A deferred requeust_queue removal shows this is possible, the parent is
taken underneath from us because the refcounting of this kobject is
already kobject_del()'d, and its actual removal scheduled for later.
The parent is taken underneath from us prior to the scheduled removal
completing.

> 
> > +		return NULL;
> > +	}
> > +
> > +	/*
> > +	 * From a sysfs kobject perspective, the request_queue sits on top of
> > +	 * the gendisk, which has the name of the disk. We always create a
> > +	 * debugfs directory upon init for this gendisk kobject, so we re-use
> > +	 * that if blktrace is going to be done for it.
> > +	 */
> > +	if (blk_trace_target_disk(buts->name, kobject_name(q->kobj.parent))) {
> > +		if (!q->debugfs_dir) {
> > +			pr_warn("%s: expected request_queue debugfs_dir is not set\n",
> > +				buts->name);
> 
> What is userspace supposed to be able to do if they see this warning?

Userspace doesn't parse warnings, but the NULL ensures it won't crash
the kernel. The warn informs the kernel of a possible block layer bug.

> > +			return NULL;
> > +		}
> > +		/*
> > +		 * debugfs_lookup() is used to ensure the directory is not
> > +		 * taken from underneath us. We must dput() it later once
> > +		 * done with it within blktrace.
> > +		 */
> > +		dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > +		if (!dir) {
> > +			pr_warn("%s: expected request_queue debugfs_dir dentry is gone\n",
> > +				buts->name);
> 
> Again, can't we just save the pointer when we create it and not have to
> look it up again?

Only if we do the revert of the requeust_queue removal first.

> > +			return NULL;
> > +		}
> > +		 /*
> > +		 * This is a reaffirmation that debugfs_lookup() shall always
> > +		 * return the same dentry if it was already set.
> > +		 */
> 
> I'm all for reaffirmation and the like, but really, is this needed???

To those who were still not sure that the issue was not a debugfs issue
I hoped this to make it clear. But indeed, if we revert back to
synchronous request_queue removal, that should simplify this.

> > +		if (dir != q->debugfs_dir) {
> > +			dput(dir);
> > +			pr_warn("%s: expected dentry dir != q->debugfs_dir\n",
> > +				buts->name);
> > +			return NULL;
> 
> Why are you testing to see if debugfs really is working properly?
> SHould all users do crazy things like this (hint, rhetorical
> question...)

No, this can happen with the race I mentioned above.

> > +		}
> > +		bt->backing_dir = q->debugfs_dir;
> > +		return bt->backing_dir;
> > +	}
> > +
> > +	/*
> > +	 * If not using blktrace on the gendisk, we are going to create a
> > +	 * temporary debugfs directory for it, however this cannot be shared
> > +	 * between two concurrent blktraces since the path is not unique, so
> > +	 * ensure this is only done once.
> > +	 */
> >  	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > -	if (!dir)
> > -		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> > +	if (dir) {
> > +		pr_warn("%s: temporary blktrace debugfs directory already present\n",
> > +			buts->name);
> > +		dput(dir);
> > +		return NULL;
> > +	}
> > +
> > +	bt->dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> > +	if (!bt->dir) {
> > +		pr_warn("%s: temporary blktrace debugfs directory could not be created\n",
> > +			buts->name);
> 
> Again, do not test the return value, you do not care.  I've been
> removing these checks from everywhere.

Sure, the question still stands on what *should* the kernel do if the
blktrace setup failed to create the debugfs files.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 04/10] block: revert back to synchronous request_queue removal
  2020-04-20 18:59     ` Luis Chamberlain
@ 2020-04-20 21:11       ` Bart Van Assche
  2020-04-20 21:51         ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-20 21:11 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On 4/20/20 11:59 AM, Luis Chamberlain wrote:
> On Sun, Apr 19, 2020 at 03:23:31PM -0700, Bart Van Assche wrote:
>> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
>>> + * Decrements the refcount to the request_queue kobject, when this reaches
>>> + * 0 we'll have blk_release_queue() called. You should avoid calling
>>> + * this function in atomic context but if you really have to ensure you
>>> + * first refcount the block device with bdgrab() / bdput() so that the
>>> + * last decrement happens in blk_cleanup_queue().
>>> + */
>>
>> Is calling bdgrab() and bdput() an option from a context in which it is not
>> guaranteed that the block device is open?
> 
> If the block device is not open, nope. For that blk_get_queue() can
> be used, and is used by the block layer. This begs the question:
> 
> Do we have *drivers* which requires access to the request_queue from
> atomic context when the block device is not open?

Instead of trying to answer that question, how about changing the 
references to bdgrab() and bdput() into references to blk_get_queue() 
and blk_put_queue()? I think if that change is made that we won't have 
to research what the answer to the bdgrab()/bdput() question is.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 04/10] block: revert back to synchronous request_queue removal
  2020-04-20 21:11       ` Bart Van Assche
@ 2020-04-20 21:51         ` Luis Chamberlain
  0 siblings, 0 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-20 21:51 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Mon, Apr 20, 2020 at 02:11:13PM -0700, Bart Van Assche wrote:
> On 4/20/20 11:59 AM, Luis Chamberlain wrote:
> > On Sun, Apr 19, 2020 at 03:23:31PM -0700, Bart Van Assche wrote:
> > > On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > > > + * Decrements the refcount to the request_queue kobject, when this reaches
> > > > + * 0 we'll have blk_release_queue() called. You should avoid calling
> > > > + * this function in atomic context but if you really have to ensure you
> > > > + * first refcount the block device with bdgrab() / bdput() so that the
> > > > + * last decrement happens in blk_cleanup_queue().
> > > > + */
> > > 
> > > Is calling bdgrab() and bdput() an option from a context in which it is not
> > > guaranteed that the block device is open?
> > 
> > If the block device is not open, nope. For that blk_get_queue() can
> > be used, and is used by the block layer. This begs the question:
> > 
> > Do we have *drivers* which requires access to the request_queue from
> > atomic context when the block device is not open?
> 
> Instead of trying to answer that question, how about changing the references
> to bdgrab() and bdput() into references to blk_get_queue() and
> blk_put_queue()? I think if that change is made that we won't have to
> research what the answer to the bdgrab()/bdput() question is.

Yeah that's fine, now at least we'd have documented what should be avoided.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/10] blktrace: upgrade warns to BUG_ON() on unexpected circmunstances
  2020-04-19 23:07     ` Luis Chamberlain
@ 2020-04-20 23:20       ` Steven Rostedt
  0 siblings, 0 replies; 57+ messages in thread
From: Steven Rostedt @ 2020-04-20 23:20 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Bart Van Assche, axboe, viro, gregkh, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Sun, 19 Apr 2020 23:07:30 +0000
Luis Chamberlain <mcgrof@kernel.org> wrote:

> On Sun, Apr 19, 2020 at 03:50:13PM -0700, Bart Van Assche wrote:
> > On 4/19/20 12:45 PM, Luis Chamberlain wrote:  
> > > @@ -498,10 +498,7 @@ static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> > >   	struct dentry *dir = NULL;
> > >   	/* This can only happen if we have a bug on our lower layers */
> > > -	if (!q->kobj.parent) {
> > > -		pr_warn("%s: request_queue parent is gone\n", buts->name);
> > > -		return NULL;
> > > -	}
> > > +	BUG_ON(!q->kobj.parent);  
> > 
> > Does the following quote from Linus also apply to this patch: "there is NO
> > F*CKING EXCUSE to knowingly kill the kernel." See also
> > https://lkml.org/lkml/2016/10/4/1.  
> 
> We can use WARN_ON() and keep the return NULL, sure.
> 

Yes please. This is definitely not something that should kill the system.

-- Steve

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup
  2020-04-20 20:20               ` Luis Chamberlain
@ 2020-04-21  6:55                 ` Greg KH
  0 siblings, 0 replies; 57+ messages in thread
From: Greg KH @ 2020-04-21  6:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Bart Van Assche, axboe, viro, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Mon, Apr 20, 2020 at 08:20:46PM +0000, Luis Chamberlain wrote:
> On Mon, Apr 20, 2020 at 10:11:01PM +0200, Greg KH wrote:
> > On Mon, Apr 20, 2020 at 06:44:45PM +0000, Luis Chamberlain wrote:
> > > On Mon, Apr 20, 2020 at 01:40:38PM +0200, Greg KH wrote:
> > > > On Sun, Apr 19, 2020 at 04:17:46PM -0700, Bart Van Assche wrote:
> > > > > On 4/19/20 4:05 PM, Luis Chamberlain wrote:
> > > > > > On Sun, Apr 19, 2020 at 03:57:58PM -0700, Bart Van Assche wrote:
> > > > > > > On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > > > > > > > Even though debugfs can be disabled, enabling BLK_DEV_IO_TRACE will
> > > > > > > > select DEBUG_FS, and blktrace exposes an API which userspace uses
> > > > > > > > relying on certain files created in debugfs. If files are not created
> > > > > > > > blktrace will not work correctly, so we do want to ensure that a
> > > > > > > > blktrace setup creates these files properly, and otherwise inform
> > > > > > > > userspace.
> > > > > > > > 
> > > > > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > > > > > > ---
> > > > > > > >    kernel/trace/blktrace.c | 8 +++++---
> > > > > > > >    1 file changed, 5 insertions(+), 3 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > > > > > > > index 9cc0153849c3..fc32a8665ce8 100644
> > > > > > > > --- a/kernel/trace/blktrace.c
> > > > > > > > +++ b/kernel/trace/blktrace.c
> > > > > > > > @@ -552,17 +552,19 @@ static int blk_trace_create_debugfs_files(struct blk_user_trace_setup *buts,
> > > > > > > >    					  struct dentry *dir,
> > > > > > > >    					  struct blk_trace *bt)
> > > > > > > >    {
> > > > > > > > -	int ret = -EIO;
> > > > > > > > -
> > > > > > > >    	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> > > > > > > >    					       &blk_dropped_fops);
> > > > > > > > +	if (!bt->dropped_file)
> > > > > > > > +		return -ENOMEM;
> > > > > > > >    	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> > > > > > > > +	if (!bt->msg_file)
> > > > > > > > +		return -ENOMEM;
> > > > > > > >    	bt->rchan = relay_open("trace", dir, buts->buf_size,
> > > > > > > >    				buts->buf_nr, &blk_relay_callbacks, bt);
> > > > > > > >    	if (!bt->rchan)
> > > > > > > > -		return ret;
> > > > > > > > +		return -EIO;
> > > > > > > >    	return 0;
> > > > > > > >    }
> > > > > > > 
> > > > > > > I should have had a look at this patch before I replied to the previous
> > > > > > > patch.
> > > > > > > 
> > > > > > > Do you agree that the following code can be triggered by
> > > > > > > debugfs_create_file() and also that debugfs_create_file() never returns
> > > > > > > NULL?
> > > > > > 
> > > > > > If debugfs is enabled, and not that we know it is in this blktrace code,
> > > > > > as we select it, it can return ERR_PTR(-ERROR) if an error occurs.
> > > > > 
> > > > > This is what I found in include/linux/debugfs.h in case debugfs is disabled:
> > > > > 
> > > > > static inline struct dentry *debugfs_create_file(const char *name,
> > > > > 	umode_t mode, struct dentry *parent, void *data,
> > > > > 	const struct file_operations *fops)
> > > > > {
> > > > > 	return ERR_PTR(-ENODEV);
> > > > > }
> > > > > 
> > > > > I have not found any code path that can cause debugfs_create_file() to
> > > > > return NULL. Did I perhaps overlook something? If not, it's not clear to me
> > > > > why the above patch adds checks that check whether debugfs_create_file()
> > > > > returns NULL?
> > > > 
> > > > Short answer, yes, it can return NULL.  Correct answer is, you don't
> > > > care, don't check the value and don't do anything about it.  It's
> > > > debugging code, userspace doesn't care, so just keep moving on.
> > > 
> > > Thing is this code *exposes* knobs to userspace for an API that *does*
> > > exepect those files to exist. That is, blktrace *relies* on these
> > > debugfs files to exist. So the kconfig which enables blktrace
> > > CONFIG_BLK_DEV_IO_TRACE selects DEBUG_FS.
> > 
> > That's nice, but again, no kernel code should do anything different
> > depending on what debugfs happens to be doing at that point in time.
> 
> So even if the debugfs files were *not* created, and this code executes only
> if DEBUG_FS, you don't think we should inform userspace if the blktrace
> setup ioctl, which sets up these debugfs, didn't happen?
> 
> The "recovery" here would just be to destroy the blktrace setup, and
> inform userspace that the blktrace setup ioctl failed.

Hm, ok, but comment the heck out of this saying _why_ you are testing
the return value, and how that differs from 99% of the other users of
this function in the kernel tree please.

Otherwise I will end up removing the checks again with my semi-regular
sweep of the tree...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-20 20:41     ` Luis Chamberlain
@ 2020-04-21  7:00       ` Greg KH
  2020-04-22  7:28         ` Luis Chamberlain
  2020-04-22  7:29       ` Christoph Hellwig
  1 sibling, 1 reply; 57+ messages in thread
From: Greg KH @ 2020-04-21  7:00 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Mon, Apr 20, 2020 at 08:41:56PM +0000, Luis Chamberlain wrote:
> On Mon, Apr 20, 2020 at 10:16:15PM +0200, Greg KH wrote:
> > On Sun, Apr 19, 2020 at 07:45:22PM +0000, Luis Chamberlain wrote:
> > 
> > This patch triggered gmail's spam detection, your changelog text is
> > whack...
> 
> Oh? What do you think triggered it?

No idea.

> 
> > > diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> > > index 19091e1effc0..d84038bce0a5 100644
> > > --- a/block/blk-debugfs.c
> > > +++ b/block/blk-debugfs.c
> > > @@ -3,6 +3,9 @@
> > >  /*
> > >   * Shared request-based / make_request-based functionality
> > >   */
> > > +
> > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > +
> > >  #include <linux/kernel.h>
> > >  #include <linux/blkdev.h>
> > >  #include <linux/debugfs.h>
> > > @@ -13,3 +16,30 @@ void blk_debugfs_register(void)
> > >  {
> > >  	blk_debugfs_root = debugfs_create_dir("block", NULL);
> > >  }
> > > +
> > > +int __must_check blk_queue_debugfs_register(struct request_queue *q)
> > > +{
> > > +	struct dentry *dir = NULL;
> > > +
> > > +	/* This can happen if we have a bug in the lower layers */
> > > +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> > > +	if (dir) {
> > > +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> > > +			kobject_name(q->kobj.parent));
> > > +		dput(dir);
> > > +		return -EALREADY;
> > > +	}
> > > +
> > > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > +					    blk_debugfs_root);
> > > +	if (!q->debugfs_dir)
> > > +		return -ENOMEM;
> > 
> > Why doesn't the directory just live in the request queue, or somewhere
> > else, so that you save it when it is created and then that's it.  No
> > need to "look it up" anywhere else.
> 
> Its already there. And yes, after my changes it is technically possible
> to just re-use it directly. But this is complicated by a few things. One
> is that at this point in time, asynchronous request_queue removal is
> still possible, and so a race was exposed where a requeust_queue may be
> lingering but its old device is gone. That race is fixed by reverting us
> back to synchronous request_queue removal, therefore ensuring that the
> debugfs dir exists so long as the device does.
> 
> I can remove the debugfs_lookup() *after* we revert to synchronous
> request_queue removal, or we just re-order the patches so that the
> revert happens first. That should simplify this patch.
> 
> The code in this patch was designed to help dispute the logic behind
> the CVE, in particular it shows exactly where debugfs_dir *is* the
> one found by debugfs_lookup(), and shows the real issue behind the
> removal.
> 
> But yeah, now that that is done, I hope its clear to all, and I think
> this patch can be simplified if we just revert the async requeust_queue
> removal first.

Don't try to "dispute" crazyness, that's not what kernel code is for.
Just do the right thing, and simply saving off the pointer to the
debugfs file when created is the "correct" thing to do, no matter what.
No race conditions or anything else can happen when you do that.

> > Or do you do that in later patches?  I only see this one at the moment,
> > sorry...
> > 
> > >  static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> > > +					    struct request_queue *q,
> > >  					    struct blk_trace *bt)
> > >  {
> > >  	struct dentry *dir = NULL;
> > >  
> > > +	/* This can only happen if we have a bug on our lower layers */
> > > +	if (!q->kobj.parent) {
> > > +		pr_warn("%s: request_queue parent is gone\n", buts->name);
> > 
> > A kobject always has a parent, unless it has not been registered yet, so
> > I don't know what you are testing could ever happen.
> 
> Or it has been kobject_del()'d?

If that happened, how in the world are you in this function anyway, as
the request_queue is an invalid pointer at that point in time???

> A deferred requeust_queue removal shows this is possible, the parent is
> taken underneath from us because the refcounting of this kobject is
> already kobject_del()'d, and its actual removal scheduled for later.
> The parent is taken underneath from us prior to the scheduled removal
> completing.

No, a parent's reference is always valid while the child pointer is
alive.

There is an odd race condition right now that we are working on fixing
if you notice another lkml thread, but that race will soon be fixed.  So
no need for every single user in the kernel to try to test for something
like this (hint, this check is still wrong as with this logic, what
could prevent parent from going away right _after_ you check it???)

Just remove this invalid check please.

> > > +		return NULL;
> > > +	}
> > > +
> > > +	/*
> > > +	 * From a sysfs kobject perspective, the request_queue sits on top of
> > > +	 * the gendisk, which has the name of the disk. We always create a
> > > +	 * debugfs directory upon init for this gendisk kobject, so we re-use
> > > +	 * that if blktrace is going to be done for it.
> > > +	 */
> > > +	if (blk_trace_target_disk(buts->name, kobject_name(q->kobj.parent))) {
> > > +		if (!q->debugfs_dir) {
> > > +			pr_warn("%s: expected request_queue debugfs_dir is not set\n",
> > > +				buts->name);
> > 
> > What is userspace supposed to be able to do if they see this warning?
> 
> Userspace doesn't parse warnings, but the NULL ensures it won't crash
> the kernel. The warn informs the kernel of a possible block layer bug.

Again, the code should not care if the pointer is NULL, as only debugfs
deals with those pointers (and it can handle a NULL pointer just fine.)

> > > +			return NULL;
> > > +		}
> > > +		/*
> > > +		 * debugfs_lookup() is used to ensure the directory is not
> > > +		 * taken from underneath us. We must dput() it later once
> > > +		 * done with it within blktrace.
> > > +		 */
> > > +		dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > > +		if (!dir) {
> > > +			pr_warn("%s: expected request_queue debugfs_dir dentry is gone\n",
> > > +				buts->name);
> > 
> > Again, can't we just save the pointer when we create it and not have to
> > look it up again?
> 
> Only if we do the revert of the requeust_queue removal first.

Your end goal should be no more debugfs_lookup() calls.  Hopefully by
the end of this patch series, that is the result.

> > > +			return NULL;
> > > +		}
> > > +		 /*
> > > +		 * This is a reaffirmation that debugfs_lookup() shall always
> > > +		 * return the same dentry if it was already set.
> > > +		 */
> > 
> > I'm all for reaffirmation and the like, but really, is this needed???
> 
> To those who were still not sure that the issue was not a debugfs issue
> I hoped this to make it clear. But indeed, if we revert back to
> synchronous request_queue removal, that should simplify this.

Again, don't pander to crazies :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 02/10] blktrace: move blktrace debugfs creation to helper function
  2020-04-19 19:45 ` [PATCH v2 02/10] blktrace: move blktrace debugfs creation to helper function Luis Chamberlain
  2020-04-19 21:11   ` Bart Van Assche
@ 2020-04-22  7:12   ` Christoph Hellwig
  1 sibling, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2020-04-22  7:12 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Sun, Apr 19, 2020 at 07:45:21PM +0000, Luis Chamberlain wrote:
> Move the work to create the debugfs directory used into a helper.
> It will make further checks easier to read. This commit introduces
> no functional changes.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  kernel/trace/blktrace.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index ca39dc3230cb..2c6e6c386ace 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -468,6 +468,18 @@ static void blk_trace_setup_lba(struct blk_trace *bt,
>  	}
>  }
>  
> +static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> +					    struct blk_trace *bt)
> +{
> +	struct dentry *dir = NULL;
> +
> +	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> +	if (!dir)
> +		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> +

This creates an > 80 char line.  But I also think it is rather confusing
anyway, why not:

	dir = debugfs_lookup(buts->name, blk_debugfs_root);
	if (dir)
		return dir;
	bt->dir = debugfs_create_dir(buts->name, blk_debugfs_root);
	return bt->dir;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-19 19:45 ` [PATCH v2 03/10] blktrace: fix debugfs use after free Luis Chamberlain
  2020-04-19 21:55   ` Bart Van Assche
  2020-04-20 20:16   ` Greg KH
@ 2020-04-22  7:27   ` Christoph Hellwig
  2020-04-22  7:48     ` Luis Chamberlain
  2 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2020-04-22  7:27 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Sun, Apr 19, 2020 at 07:45:22PM +0000, Luis Chamberlain wrote:
> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> merged on v4.12 Omar fixed the original blktrace code for request-based
> drivers (multiqueue). This however left in place a possible crash, if you
> happen to abuse blktrace in a way it was not intended, and even more so
> with our current asynchronous request_queue removal.
> 
> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> forget to BLKTRACETEARDOWN, and then just remove the device you end up
> with a panic:

FYI, I find all this backtrace garbage not hepful at all.  It requires
me to scroll for so long that I've forgot what was written above by
the time I'm past it.

> This splat happens to be very similar to the one reported via
> kernel.org korg#205713, only that korg#205713 was for v4.19.83
> and the above now includes the simple_recursive_removal() introduced
> via commit a3d1e7eb5abe ("simple_recursive_removal(): kernel-side rm
> -rf for ramfs-style filesystems") merged on v5.6.
> 
> korg#205713 then was used to create CVE-2019-19770 and claims that
> the bug is in a use-after-free in the debugfs core code. The
> implications of this being a generic UAF on debugfs would be
> much more severe, as it would imply parent dentries can sometimes
> not be positive, which we hold by design is just not possible.
> 
> Below is the splat explained with a bit more details, explaining
> what is happening in userspace, kernel, and a print of the CPU on,
> which the code runs on:
> 
> load loopback module
> [   13.603371] == blk_mq_debugfs_register(12) start
> [   13.604040] == blk_mq_debugfs_register(12) q->debugfs_dir created

Same for this..  I think the real valuable changelog only stars below
this 'trace'.

> The root cause to this issue is that debugfs_lookup() can find a
> previous incarnation's dir of the same name which is about to get
> removed from a not yet schedule work. When that happens, the the files
> are taken underneath the nose of the blktrace, and when it comes time to
> cleanup, these dentries are long gone because of a scheduled removal.
> 
> This issue is happening because of two reasons:
> 
>   1) The request_queue is currently removed asynchronously as of commit
>      dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
>      v4.12, this allows races with userspace which were not possible
>      before unless as removal of a block device used to happen
>      synchronously with its request_queue. One could however still
>      parallelize blksetup calls while one loops on device addition and
>      removal.
> 
>   2) There are no errors checks when we create the debugfs directory,
>      be it on init or for blktrace. The concept of when the directory
>      *should* really exist is further complicated by the fact that
>      we have asynchronous request_queue removal. And, we have no
>      real sanity checks to ensure we don't re-create the queue debugfs
>      directory.
> 
> We can fix the UAF by using a debugfs directory which moving forward
> will always be accessible if debugfs is enabled, this way, its allocated
> and avaialble always for both request-based block drivers or
> make_request drivers (multiqueue) block drivers.
> 
> We also put sanity checks in place to ensure that if the directory is
> found with debugfs_lookup() it is the dentry we expect. When doing a
> blktrace against a parition, we will always be creating a temporary
> debugfs directory, so ensure that only exists once as well to avoid
> issues against concurrent blktrace runs.
> 
> Lastly, since we are now always creating the needed request_queue
> debugfs directory upon init, we can also take the initiative to
> proactively check against errors. We currently do not check for
> errors on add_disk() and friends, but we shouldn't make the issue
> any worse.
> 
> This also simplifies the code considerably, with the only penalty now
> being that we're always creating the request queue debugfs directory for
> the request-based block device drivers.
> 
> The UAF then is not a core debugfs issue, but instead a complex misuse
> of debugfs, and this issue can only be triggered if you are root.
> 
> This issue can be reproduced with break-blktrace [2] using:
> 
>   break-blktrace -c 10 -d -s
> 
> This patch fixes this issue. Note that there is also another
> respective UAF but from the ioctl path [3], this should also fix
> that issue.
> 
> This patch then also disputes the severity of CVE-2019-19770 as
> this issue is only possible by being root and using blktrace.
> 
> It is not a core debugfs issue.
> 
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
> [1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
> [2] https://github.com/mcgrof/break-blktrace
> [3] https://lore.kernel.org/lkml/000000000000ec635b059f752700@google.com/

> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

This looks like an unrelated change.

> +
>  #include <linux/kernel.h>
>  #include <linux/blkdev.h>
>  #include <linux/debugfs.h>
> @@ -13,3 +16,30 @@ void blk_debugfs_register(void)
>  {
>  	blk_debugfs_root = debugfs_create_dir("block", NULL);
>  }
> +
> +int __must_check blk_queue_debugfs_register(struct request_queue *q)

__must_check for a function with a single caller looks silly.

> +{
> +	struct dentry *dir = NULL;
> +
> +	/* This can happen if we have a bug in the lower layers */
> +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> +	if (dir) {
> +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> +			kobject_name(q->kobj.parent));
> +		dput(dir);
> +		return -EALREADY;
> +	}

I don't see why we need this check.  If it is valueable enough we
should have a debugfs_create_dir_exclusive or so that retunrns an error
for an exsting directory, instead of reimplementing it in the caller in
a racy way.  But I'm not really sure we need it to start with.

> +
> +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> +					    blk_debugfs_root);
> +	if (!q->debugfs_dir)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +void blk_queue_debugfs_unregister(struct request_queue *q)
> +{
> +	debugfs_remove_recursive(q->debugfs_dir);
> +	q->debugfs_dir = NULL;
> +}

Which to me suggests we can just fold these two into the callers,
with an IS_ENABLED for the creation case given that we check for errors
and the stub will always return an error.

>  	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
>  
>  	/*
> @@ -856,9 +853,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
>  
>  void blk_mq_debugfs_unregister(struct request_queue *q)
>  {
> -	debugfs_remove_recursive(q->debugfs_dir);
>  	q->sched_debugfs_dir = NULL;
> -	q->debugfs_dir = NULL;
>  }

This function is weird - the sched dir gets removed by the
debugfs_remove_recursive, so just leaving a function that clears
a pointer is rather odd.  In fact I don't think we need to clear
either sched_debugfs_dir or debugfs_dir anywhere.

>  
> @@ -975,6 +976,14 @@ int blk_register_queue(struct gendisk *disk)
>  		goto unlock;
>  	}
>  
> +	ret = blk_queue_debugfs_register(q);
> +	if (ret) {
> +		blk_trace_remove_sysfs(dev);
> +		kobject_del(&q->kobj);
> +		kobject_put(&dev->kobj);
> +		goto unlock;
> +	}
> +

Please use a goto label to consolidate the common cleanup code.

Also I think these generic debugfs changes probably should be separate
to the blktrace changes.

> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
>  #include <linux/kernel.h>
>  #include <linux/blkdev.h>
>  #include <linux/blktrace_api.h>
> @@ -311,7 +314,15 @@ static void blk_trace_free(struct blk_trace *bt)
>  	debugfs_remove(bt->msg_file);
>  	debugfs_remove(bt->dropped_file);
>  	relay_close(bt->rchan);
> -	debugfs_remove(bt->dir);
> +	/*
> +	 * backing_dir is set when we use the request_queue debugfs directory.
> +	 * Otherwise we are using a temporary directory created only for the
> +	 * purpose of blktrace.
> +	 */
> +	if (bt->backing_dir)
> +		dput(bt->backing_dir);
> +	else
> +		debugfs_remove(bt->dir);
>  	free_percpu(bt->sequence);
>  	free_percpu(bt->msg_data);
>  	kfree(bt);
> @@ -468,16 +479,89 @@ static void blk_trace_setup_lba(struct blk_trace *bt,
>  	}
>  }
>  
> +static bool blk_trace_target_disk(const char *target, const char *diskname)
> +{
> +	if (strlen(target) != strlen(diskname))
> +		return false;
> +
> +	if (!strncmp(target, diskname,
> +		     min_t(size_t, strlen(target), strlen(diskname))))
> +		return true;
> +
> +	return false;
> +}
> +
>  static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> +					    struct request_queue *q,
>  					    struct blk_trace *bt)
>  {
>  	struct dentry *dir = NULL;
>  
> +	/* This can only happen if we have a bug on our lower layers */
> +	if (!q->kobj.parent) {
> +		pr_warn("%s: request_queue parent is gone\n", buts->name);
> +		return NULL;
> +	}

Why is this not simply a WARN_ON_ONCE()?

> +	/*
> +	 * From a sysfs kobject perspective, the request_queue sits on top of
> +	 * the gendisk, which has the name of the disk. We always create a
> +	 * debugfs directory upon init for this gendisk kobject, so we re-use
> +	 * that if blktrace is going to be done for it.
> +	 */

-EPARSE.

> +	if (blk_trace_target_disk(buts->name, kobject_name(q->kobj.parent))) {
> +		if (!q->debugfs_dir) {
> +			pr_warn("%s: expected request_queue debugfs_dir is not set\n",
> +				buts->name);
> +			return NULL;
> +		}
> +		/*
> +		 * debugfs_lookup() is used to ensure the directory is not
> +		 * taken from underneath us. We must dput() it later once
> +		 * done with it within blktrace.
> +		 */
> +		dir = debugfs_lookup(buts->name, blk_debugfs_root);
> +		if (!dir) {
> +			pr_warn("%s: expected request_queue debugfs_dir dentry is gone\n",
> +				buts->name);
> +			return NULL;
> +		}
> +		 /*
> +		 * This is a reaffirmation that debugfs_lookup() shall always
> +		 * return the same dentry if it was already set.
> +		 */
> +		if (dir != q->debugfs_dir) {
> +			dput(dir);
> +			pr_warn("%s: expected dentry dir != q->debugfs_dir\n",
> +				buts->name);
> +			return NULL;
> +		}
> +		bt->backing_dir = q->debugfs_dir;
> +		return bt->backing_dir;
> +	}

Even with the gigantic commit log I don't get the point of this
code.  It looks rather sketchy and I can't find a rationale for it.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-21  7:00       ` Greg KH
@ 2020-04-22  7:28         ` Luis Chamberlain
  2020-04-22  9:43           ` Ming Lei
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-22  7:28 UTC (permalink / raw)
  To: Greg KH
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Tue, Apr 21, 2020 at 09:00:48AM +0200, Greg KH wrote:
> On Mon, Apr 20, 2020 at 08:41:56PM +0000, Luis Chamberlain wrote:
> > On Mon, Apr 20, 2020 at 10:16:15PM +0200, Greg KH wrote:
> > > On Sun, Apr 19, 2020 at 07:45:22PM +0000, Luis Chamberlain wrote:
> > > 
> > > This patch triggered gmail's spam detection, your changelog text is
> > > whack...
> > 
> > Oh? What do you think triggered it?
> 
> No idea.

Alright, well I'm going to move most of the analysis to the bug report
and be as concise as possible on the commit log.

> > > > diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> > > > index 19091e1effc0..d84038bce0a5 100644
> > > > --- a/block/blk-debugfs.c
> > > > +++ b/block/blk-debugfs.c
> > > > @@ -3,6 +3,9 @@
> > > >  /*
> > > >   * Shared request-based / make_request-based functionality
> > > >   */
> > > > +
> > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > +
> > > >  #include <linux/kernel.h>
> > > >  #include <linux/blkdev.h>
> > > >  #include <linux/debugfs.h>
> > > > @@ -13,3 +16,30 @@ void blk_debugfs_register(void)
> > > >  {
> > > >  	blk_debugfs_root = debugfs_create_dir("block", NULL);
> > > >  }
> > > > +
> > > > +int __must_check blk_queue_debugfs_register(struct request_queue *q)
> > > > +{
> > > > +	struct dentry *dir = NULL;
> > > > +
> > > > +	/* This can happen if we have a bug in the lower layers */
> > > > +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> > > > +	if (dir) {
> > > > +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> > > > +			kobject_name(q->kobj.parent));
> > > > +		dput(dir);
> > > > +		return -EALREADY;
> > > > +	}
> > > > +
> > > > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > > +					    blk_debugfs_root);
> > > > +	if (!q->debugfs_dir)
> > > > +		return -ENOMEM;
> > > 
> > > Why doesn't the directory just live in the request queue, or somewhere
> > > else, so that you save it when it is created and then that's it.  No
> > > need to "look it up" anywhere else.
> > 
> > Its already there. And yes, after my changes it is technically possible
> > to just re-use it directly. But this is complicated by a few things. One
> > is that at this point in time, asynchronous request_queue removal is
> > still possible, and so a race was exposed where a requeust_queue may be
> > lingering but its old device is gone. That race is fixed by reverting us
> > back to synchronous request_queue removal, therefore ensuring that the
> > debugfs dir exists so long as the device does.
> > 
> > I can remove the debugfs_lookup() *after* we revert to synchronous
> > request_queue removal, or we just re-order the patches so that the
> > revert happens first. That should simplify this patch.
> > 
> > The code in this patch was designed to help dispute the logic behind
> > the CVE, in particular it shows exactly where debugfs_dir *is* the
> > one found by debugfs_lookup(), and shows the real issue behind the
> > removal.
> > 
> > But yeah, now that that is done, I hope its clear to all, and I think
> > this patch can be simplified if we just revert the async requeust_queue
> > removal first.
> 
> Don't try to "dispute" crazyness, that's not what kernel code is for.
> Just do the right thing, and simply saving off the pointer to the
> debugfs file when created is the "correct" thing to do, no matter what.
> No race conditions or anything else can happen when you do that.

Nope, races are still possible even if we move revert back to sync
request_queue removal, but I do believe that just reveals a bug
elsewhere, which I'll just fix as I think I know where this is.

> > > Or do you do that in later patches?  I only see this one at the moment,
> > > sorry...
> > > 
> > > >  static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> > > > +					    struct request_queue *q,
> > > >  					    struct blk_trace *bt)
> > > >  {
> > > >  	struct dentry *dir = NULL;
> > > >  
> > > > +	/* This can only happen if we have a bug on our lower layers */
> > > > +	if (!q->kobj.parent) {
> > > > +		pr_warn("%s: request_queue parent is gone\n", buts->name);
> > > 
> > > A kobject always has a parent, unless it has not been registered yet, so
> > > I don't know what you are testing could ever happen.
> > 
> > Or it has been kobject_del()'d?
> 
> If that happened, how in the world are you in this function anyway, as
> the request_queue is an invalid pointer at that point in time???

Nope, the block layer still finishes some work on it.

Drivers are allowed to cleanup a block device in this order, this
example,  is from the loop block driver:

static void loop_remove(struct loop_device *lo)                                 
{
	del_gendisk(lo->lo_disk);
	blk_cleanup_queue(lo->lo_queue);
	blk_mq_free_tag_set(&lo->tag_set);
	put_disk(lo->lo_disk);
	kfree(lo);
}   

At this point in time patch-wise we still haven't reverted back to
synchronous request_queue removal. Considering this, a race with the
parent disappearing can happen because the request_queue removal is
deferred, that is, the request_queue's kobject's release() call used
schedule_work() to finish off its removal. We expect the last
blk_put_queue() to be called at the end of blk_cleanup_queue(). Since
that is deferred and device_del() is called also at the end of
del_gendisk(), it means the release of the queue can happen in a
context where the disk is gone.

Although this issue is triggerable easily with the current async
request_queue removal, I can think of other ways to trigger an issue
here and one of them was suggested as possible by Christoph on the last
v1 patch series.

blk_queue_get() is not atomic and so what it returns can be incorrect:

bool blk_get_queue(struct request_queue *q)
{
	if (likely(!blk_queue_dying(q))) {
		__blk_get_queue(q);
		return true;
	}
	----> we can schedule here easily and move the queue to dying
	----> and race with blk_cleanup_queue() which will then allow
	----> code paths to incorrectly trigger the release of the queue
	----> in places we don't want
	return false;
}
EXPORT_SYMBOL(blk_get_queue);

Another area of concern I am seeing through code inspection is that
since the request_queue life depends on the disk, it seemse odd we call
device_del() before we remove the request_queue. If this is indeed
wrong, we could move the device_del() from del_gendisk() to
disk_release() triggered by the last put_disk().

I have a test now which shows that even if we revert back to synchronous
request_queue removal I can trigger a panic on the same use-after-free
case on debugfs on blktrace, and this is because we are overwriting the
same debugfs directory, and I think the above are its root causes.

I can fix this bug, but that just moves the bug to conflicts within
two sysfs objects already existing, and this is because add_disk()
(which calls __device_add_disk() doesn't return errors). This is both
a blk layer bug in the sense we never check for error and a driver bug
for allowing conflicts. All this just needs to be fixed, and although I
thought this could be done later, I think I'm just going to fix all this
now.

I have reproducer now which enables the same race with synchronous
request_queue removal, I'll work my way towards fixing all this...

> > A deferred requeust_queue removal shows this is possible, the parent is
> > taken underneath from us because the refcounting of this kobject is
> > already kobject_del()'d, and its actual removal scheduled for later.
> > The parent is taken underneath from us prior to the scheduled removal
> > completing.
> 
> No, a parent's reference is always valid while the child pointer is
> alive.

OK I'll keep thsi in mind then.

> There is an odd race condition right now that we are working on fixing
> if you notice another lkml thread, but that race will soon be fixed.

Got a pointer?

> So
> no need for every single user in the kernel to try to test for something
> like this (hint, this check is still wrong as with this logic, what
> could prevent parent from going away right _after_ you check it???)
> 
> Just remove this invalid check please.

Yeah sure, make sense.

> > > > +		return NULL;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * From a sysfs kobject perspective, the request_queue sits on top of
> > > > +	 * the gendisk, which has the name of the disk. We always create a
> > > > +	 * debugfs directory upon init for this gendisk kobject, so we re-use
> > > > +	 * that if blktrace is going to be done for it.
> > > > +	 */
> > > > +	if (blk_trace_target_disk(buts->name, kobject_name(q->kobj.parent))) {
> > > > +		if (!q->debugfs_dir) {
> > > > +			pr_warn("%s: expected request_queue debugfs_dir is not set\n",
> > > > +				buts->name);
> > > 
> > > What is userspace supposed to be able to do if they see this warning?
> > 
> > Userspace doesn't parse warnings, but the NULL ensures it won't crash
> > the kernel. The warn informs the kernel of a possible block layer bug.
> 
> Again, the code should not care if the pointer is NULL, as only debugfs
> deals with those pointers (and it can handle a NULL pointer just fine.)

Alright.

> > > > +			return NULL;
> > > > +		}
> > > > +		/*
> > > > +		 * debugfs_lookup() is used to ensure the directory is not
> > > > +		 * taken from underneath us. We must dput() it later once
> > > > +		 * done with it within blktrace.
> > > > +		 */
> > > > +		dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > > > +		if (!dir) {
> > > > +			pr_warn("%s: expected request_queue debugfs_dir dentry is gone\n",
> > > > +				buts->name);
> > > 
> > > Again, can't we just save the pointer when we create it and not have to
> > > look it up again?
> > 
> > Only if we do the revert of the requeust_queue removal first.
> 
> Your end goal should be no more debugfs_lookup() calls.  Hopefully by
> the end of this patch series, that is the result.

Alrighty.

> > > > +			return NULL;
> > > > +		}
> > > > +		 /*
> > > > +		 * This is a reaffirmation that debugfs_lookup() shall always
> > > > +		 * return the same dentry if it was already set.
> > > > +		 */
> > > 
> > > I'm all for reaffirmation and the like, but really, is this needed???
> > 
> > To those who were still not sure that the issue was not a debugfs issue
> > I hoped this to make it clear. But indeed, if we revert back to
> > synchronous request_queue removal, that should simplify this.
> 
> Again, don't pander to crazies :)

At least I didn't say it :)

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-20 20:41     ` Luis Chamberlain
  2020-04-21  7:00       ` Greg KH
@ 2020-04-22  7:29       ` Christoph Hellwig
  2020-04-22  7:34         ` Luis Chamberlain
  1 sibling, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2020-04-22  7:29 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Greg KH, axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Mon, Apr 20, 2020 at 08:41:56PM +0000, Luis Chamberlain wrote:
> Its already there. And yes, after my changes it is technically possible
> to just re-use it directly. But this is complicated by a few things. One
> is that at this point in time, asynchronous request_queue removal is
> still possible, and so a race was exposed where a requeust_queue may be
> lingering but its old device is gone. That race is fixed by reverting us
> back to synchronous request_queue removal, therefore ensuring that the
> debugfs dir exists so long as the device does.
> 
> I can remove the debugfs_lookup() *after* we revert to synchronous
> request_queue removal, or we just re-order the patches so that the
> revert happens first. That should simplify this patch.

Yes, please move the synchronous removal first instead of working around
the current problems.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-22  7:29       ` Christoph Hellwig
@ 2020-04-22  7:34         ` Luis Chamberlain
  0 siblings, 0 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-22  7:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Greg KH, axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Wed, Apr 22, 2020 at 12:29:42AM -0700, Christoph Hellwig wrote:
> On Mon, Apr 20, 2020 at 08:41:56PM +0000, Luis Chamberlain wrote:
> > Its already there. And yes, after my changes it is technically possible
> > to just re-use it directly. But this is complicated by a few things. One
> > is that at this point in time, asynchronous request_queue removal is
> > still possible, and so a race was exposed where a requeust_queue may be
> > lingering but its old device is gone. That race is fixed by reverting us
> > back to synchronous request_queue removal, therefore ensuring that the
> > debugfs dir exists so long as the device does.
> > 
> > I can remove the debugfs_lookup() *after* we revert to synchronous
> > request_queue removal, or we just re-order the patches so that the
> > revert happens first. That should simplify this patch.
> 
> Yes, please move the synchronous removal first instead of working around
> the current problems.

Sounds good. At first it was questionable, now its understood we need it.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-22  7:27   ` Christoph Hellwig
@ 2020-04-22  7:48     ` Luis Chamberlain
  2020-04-22  8:10       ` Christoph Hellwig
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-22  7:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Wed, Apr 22, 2020 at 12:27:15AM -0700, Christoph Hellwig wrote:
> On Sun, Apr 19, 2020 at 07:45:22PM +0000, Luis Chamberlain wrote:
> > +{
> > +	struct dentry *dir = NULL;
> > +
> > +	/* This can happen if we have a bug in the lower layers */
> > +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> > +	if (dir) {
> > +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> > +			kobject_name(q->kobj.parent));
> > +		dput(dir);
> > +		return -EALREADY;
> > +	}
> 
> I don't see why we need this check.  If it is valueable enough we
> should have a debugfs_create_dir_exclusive or so that retunrns an error
> for an exsting directory, instead of reimplementing it in the caller in
> a racy way.  But I'm not really sure we need it to start with.

In short races, and even with synchronous request_queue removal I'm
seeing the race is still possible, but that's due to some other races
I'm going to chase down now.

The easier solution really is to just have a debugfs dir created for
each partition if debugfs is enabled, this way the directory will
always be there, and the lookups are gone.

> > +
> > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > +					    blk_debugfs_root);
> > +	if (!q->debugfs_dir)
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> > +void blk_queue_debugfs_unregister(struct request_queue *q)
> > +{
> > +	debugfs_remove_recursive(q->debugfs_dir);
> > +	q->debugfs_dir = NULL;
> > +}
> 
> Which to me suggests we can just fold these two into the callers,
> with an IS_ENABLED for the creation case given that we check for errors
> and the stub will always return an error.

Sorry not sure I follow this.

> >  	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
> >  
> >  	/*
> > @@ -856,9 +853,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
> >  
> >  void blk_mq_debugfs_unregister(struct request_queue *q)
> >  {
> > -	debugfs_remove_recursive(q->debugfs_dir);
> >  	q->sched_debugfs_dir = NULL;
> > -	q->debugfs_dir = NULL;
> >  }
> 
> This function is weird - the sched dir gets removed by the
> debugfs_remove_recursive, so just leaving a function that clears
> a pointer is rather odd.  In fact I don't think we need to clear
> either sched_debugfs_dir or debugfs_dir anywhere.

Indeed. Will clean it up.

> > @@ -975,6 +976,14 @@ int blk_register_queue(struct gendisk *disk)
> >  		goto unlock;
> >  	}
> >  
> > +	ret = blk_queue_debugfs_register(q);
> > +	if (ret) {
> > +		blk_trace_remove_sysfs(dev);
> > +		kobject_del(&q->kobj);
> > +		kobject_put(&dev->kobj);
> > +		goto unlock;
> > +	}
> > +
> 
> Please use a goto label to consolidate the common cleanup code.

Sure.

> Also I think these generic debugfs changes probably should be separate
> to the blktrace changes.

I'll try to do that.

> >  static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> > +					    struct request_queue *q,
> >  					    struct blk_trace *bt)
> >  {
> >  	struct dentry *dir = NULL;
> >  
> > +	/* This can only happen if we have a bug on our lower layers */
> > +	if (!q->kobj.parent) {
> > +		pr_warn("%s: request_queue parent is gone\n", buts->name);
> > +		return NULL;
> > +	}
> 
> Why is this not simply a WARN_ON_ONCE()?

I'll actually remove it and instead fix the race where it happens.

> > +	if (blk_trace_target_disk(buts->name, kobject_name(q->kobj.parent))) {
> > +		if (!q->debugfs_dir) {
> > +			pr_warn("%s: expected request_queue debugfs_dir is not set\n",
> > +				buts->name);
> > +			return NULL;
> > +		}
> > +		/*
> > +		 * debugfs_lookup() is used to ensure the directory is not
> > +		 * taken from underneath us. We must dput() it later once
> > +		 * done with it within blktrace.
> > +		 */
> > +		dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > +		if (!dir) {
> > +			pr_warn("%s: expected request_queue debugfs_dir dentry is gone\n",
> > +				buts->name);
> > +			return NULL;
> > +		}
> > +		 /*
> > +		 * This is a reaffirmation that debugfs_lookup() shall always
> > +		 * return the same dentry if it was already set.
> > +		 */
> > +		if (dir != q->debugfs_dir) {
> > +			dput(dir);
> > +			pr_warn("%s: expected dentry dir != q->debugfs_dir\n",
> > +				buts->name);
> > +			return NULL;
> > +		}
> > +		bt->backing_dir = q->debugfs_dir;
> > +		return bt->backing_dir;
> > +	}
> 
> Even with the gigantic commit log I don't get the point of this
> code.  It looks rather sketchy and I can't find a rationale for it.

Yeah I think this is going to be much easier on the eyes with the
revert to synchronous request_queue removal first.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-22  7:48     ` Luis Chamberlain
@ 2020-04-22  8:10       ` Christoph Hellwig
  2020-04-22  8:26         ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2020-04-22  8:10 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, axboe, viro, bvanassche, gregkh, rostedt,
	mingo, jack, ming.lei, nstange, akpm, mhocko, yukuai3,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Wed, Apr 22, 2020 at 07:48:02AM +0000, Luis Chamberlain wrote:
> > I don't see why we need this check.  If it is valueable enough we
> > should have a debugfs_create_dir_exclusive or so that retunrns an error
> > for an exsting directory, instead of reimplementing it in the caller in
> > a racy way.  But I'm not really sure we need it to start with.
> 
> In short races, and even with synchronous request_queue removal I'm
> seeing the race is still possible, but that's due to some other races
> I'm going to chase down now.
> 
> The easier solution really is to just have a debugfs dir created for
> each partition if debugfs is enabled, this way the directory will
> always be there, and the lookups are gone.

That sounds like the best plan to me.

> 
> > > +
> > > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > +					    blk_debugfs_root);
> > > +	if (!q->debugfs_dir)
> > > +		return -ENOMEM;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +void blk_queue_debugfs_unregister(struct request_queue *q)
> > > +{
> > > +	debugfs_remove_recursive(q->debugfs_dir);
> > > +	q->debugfs_dir = NULL;
> > > +}
> > 
> > Which to me suggests we can just fold these two into the callers,
> > with an IS_ENABLED for the creation case given that we check for errors
> > and the stub will always return an error.
> 
> Sorry not sure I follow this.

Don't both with the two above functions and just open code them in
the callers.  IFF you still want to check for errors after the
discussion with Greg, wrap the call in a

	if (IS_ENABLED(CONFIG_DEBUG_FS))

to ensure that you don't fail queue creation in the !DEBUG_FS
case.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-22  8:10       ` Christoph Hellwig
@ 2020-04-22  8:26         ` Luis Chamberlain
  0 siblings, 0 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-22  8:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Wed, Apr 22, 2020 at 01:10:11AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 22, 2020 at 07:48:02AM +0000, Luis Chamberlain wrote:
> > > I don't see why we need this check.  If it is valueable enough we
> > > should have a debugfs_create_dir_exclusive or so that retunrns an error
> > > for an exsting directory, instead of reimplementing it in the caller in
> > > a racy way.  But I'm not really sure we need it to start with.
> > 
> > In short races, and even with synchronous request_queue removal I'm
> > seeing the race is still possible, but that's due to some other races
> > I'm going to chase down now.
> > 
> > The easier solution really is to just have a debugfs dir created for
> > each partition if debugfs is enabled, this way the directory will
> > always be there, and the lookups are gone.
> 
> That sounds like the best plan to me.

Groovy.

> > > > +
> > > > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > > +					    blk_debugfs_root);
> > > > +	if (!q->debugfs_dir)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +void blk_queue_debugfs_unregister(struct request_queue *q)
> > > > +{
> > > > +	debugfs_remove_recursive(q->debugfs_dir);
> > > > +	q->debugfs_dir = NULL;
> > > > +}
> > > 
> > > Which to me suggests we can just fold these two into the callers,
> > > with an IS_ENABLED for the creation case given that we check for errors
> > > and the stub will always return an error.
> > 
> > Sorry not sure I follow this.
> 
> Don't both with the two above functions and just open code them in
> the callers.  IFF you still want to check for errors after the
> discussion with Greg, wrap the call in a
> 
> 	if (IS_ENABLED(CONFIG_DEBUG_FS))
> 
> to ensure that you don't fail queue creation in the !DEBUG_FS
> case.

Got it, thanks.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-22  7:28         ` Luis Chamberlain
@ 2020-04-22  9:43           ` Ming Lei
  2020-04-22 10:31             ` Luis Chamberlain
  2020-04-24 23:47             ` Luis Chamberlain
  0 siblings, 2 replies; 57+ messages in thread
From: Ming Lei @ 2020-04-22  9:43 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Greg KH, axboe, viro, bvanassche, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Wed, Apr 22, 2020 at 07:28:59AM +0000, Luis Chamberlain wrote:
> On Tue, Apr 21, 2020 at 09:00:48AM +0200, Greg KH wrote:
> > On Mon, Apr 20, 2020 at 08:41:56PM +0000, Luis Chamberlain wrote:
> > > On Mon, Apr 20, 2020 at 10:16:15PM +0200, Greg KH wrote:
> > > > On Sun, Apr 19, 2020 at 07:45:22PM +0000, Luis Chamberlain wrote:
> > > > 
> > > > This patch triggered gmail's spam detection, your changelog text is
> > > > whack...
> > > 
> > > Oh? What do you think triggered it?
> > 
> > No idea.
> 
> Alright, well I'm going to move most of the analysis to the bug report
> and be as concise as possible on the commit log.
> 
> > > > > diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> > > > > index 19091e1effc0..d84038bce0a5 100644
> > > > > --- a/block/blk-debugfs.c
> > > > > +++ b/block/blk-debugfs.c
> > > > > @@ -3,6 +3,9 @@
> > > > >  /*
> > > > >   * Shared request-based / make_request-based functionality
> > > > >   */
> > > > > +
> > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > +
> > > > >  #include <linux/kernel.h>
> > > > >  #include <linux/blkdev.h>
> > > > >  #include <linux/debugfs.h>
> > > > > @@ -13,3 +16,30 @@ void blk_debugfs_register(void)
> > > > >  {
> > > > >  	blk_debugfs_root = debugfs_create_dir("block", NULL);
> > > > >  }
> > > > > +
> > > > > +int __must_check blk_queue_debugfs_register(struct request_queue *q)
> > > > > +{
> > > > > +	struct dentry *dir = NULL;
> > > > > +
> > > > > +	/* This can happen if we have a bug in the lower layers */
> > > > > +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> > > > > +	if (dir) {
> > > > > +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> > > > > +			kobject_name(q->kobj.parent));
> > > > > +		dput(dir);
> > > > > +		return -EALREADY;
> > > > > +	}
> > > > > +
> > > > > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > > > +					    blk_debugfs_root);
> > > > > +	if (!q->debugfs_dir)
> > > > > +		return -ENOMEM;
> > > > 
> > > > Why doesn't the directory just live in the request queue, or somewhere
> > > > else, so that you save it when it is created and then that's it.  No
> > > > need to "look it up" anywhere else.
> > > 
> > > Its already there. And yes, after my changes it is technically possible
> > > to just re-use it directly. But this is complicated by a few things. One
> > > is that at this point in time, asynchronous request_queue removal is
> > > still possible, and so a race was exposed where a requeust_queue may be
> > > lingering but its old device is gone. That race is fixed by reverting us
> > > back to synchronous request_queue removal, therefore ensuring that the
> > > debugfs dir exists so long as the device does.
> > > 
> > > I can remove the debugfs_lookup() *after* we revert to synchronous
> > > request_queue removal, or we just re-order the patches so that the
> > > revert happens first. That should simplify this patch.
> > > 
> > > The code in this patch was designed to help dispute the logic behind
> > > the CVE, in particular it shows exactly where debugfs_dir *is* the
> > > one found by debugfs_lookup(), and shows the real issue behind the
> > > removal.
> > > 
> > > But yeah, now that that is done, I hope its clear to all, and I think
> > > this patch can be simplified if we just revert the async requeust_queue
> > > removal first.
> > 
> > Don't try to "dispute" crazyness, that's not what kernel code is for.
> > Just do the right thing, and simply saving off the pointer to the
> > debugfs file when created is the "correct" thing to do, no matter what.
> > No race conditions or anything else can happen when you do that.
> 
> Nope, races are still possible even if we move revert back to sync
> request_queue removal, but I do believe that just reveals a bug
> elsewhere, which I'll just fix as I think I know where this is.
> 
> > > > Or do you do that in later patches?  I only see this one at the moment,
> > > > sorry...
> > > > 
> > > > >  static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> > > > > +					    struct request_queue *q,
> > > > >  					    struct blk_trace *bt)
> > > > >  {
> > > > >  	struct dentry *dir = NULL;
> > > > >  
> > > > > +	/* This can only happen if we have a bug on our lower layers */
> > > > > +	if (!q->kobj.parent) {
> > > > > +		pr_warn("%s: request_queue parent is gone\n", buts->name);
> > > > 
> > > > A kobject always has a parent, unless it has not been registered yet, so
> > > > I don't know what you are testing could ever happen.
> > > 
> > > Or it has been kobject_del()'d?
> > 
> > If that happened, how in the world are you in this function anyway, as
> > the request_queue is an invalid pointer at that point in time???
> 
> Nope, the block layer still finishes some work on it.
> 
> Drivers are allowed to cleanup a block device in this order, this
> example,  is from the loop block driver:
> 
> static void loop_remove(struct loop_device *lo)                                 
> {
> 	del_gendisk(lo->lo_disk);
> 	blk_cleanup_queue(lo->lo_queue);
> 	blk_mq_free_tag_set(&lo->tag_set);
> 	put_disk(lo->lo_disk);
> 	kfree(lo);
> }   
> 
> At this point in time patch-wise we still haven't reverted back to
> synchronous request_queue removal. Considering this, a race with the
> parent disappearing can happen because the request_queue removal is
> deferred, that is, the request_queue's kobject's release() call used
> schedule_work() to finish off its removal. We expect the last
> blk_put_queue() to be called at the end of blk_cleanup_queue(). Since

Actually no, we expect that request queue is released after disk is
released. Don't forget that gendisk does hold one extra refcount of
request queue.

> that is deferred and device_del() is called also at the end of
> del_gendisk(), it means the release of the queue can happen in a
> context where the disk is gone.
> 
> Although this issue is triggerable easily with the current async
> request_queue removal, I can think of other ways to trigger an issue
> here and one of them was suggested as possible by Christoph on the last
> v1 patch series.
> 
> blk_queue_get() is not atomic and so what it returns can be incorrect:
> 
> bool blk_get_queue(struct request_queue *q)
> {
> 	if (likely(!blk_queue_dying(q))) {
> 		__blk_get_queue(q);
> 		return true;
> 	}
> 	----> we can schedule here easily and move the queue to dying
> 	----> and race with blk_cleanup_queue() which will then allow
> 	----> code paths to incorrectly trigger the release of the queue
> 	----> in places we don't want

Right, actually caller of blk_get_queue() has to guarantee that
the request queue is alive.

Some users of blk_get_queue() aren't necessary, such as rbd and mmc.

> 	return false;
> }
> EXPORT_SYMBOL(blk_get_queue);
> 
> Another area of concern I am seeing through code inspection is that
> since the request_queue life depends on the disk, it seemse odd we call
> device_del() before we remove the request_queue. If this is indeed
> wrong, we could move the device_del() from del_gendisk() to
> disk_release() triggered by the last put_disk().

Why do you think it is odd and want to change this way? Disk has to be
removed first for draining dirty pages to queue, then we can cleanup
queue. Once we start to clean up request queue, all data may not land
disk any more.

> 
> I have a test now which shows that even if we revert back to synchronous
> request_queue removal I can trigger a panic on the same use-after-free
> case on debugfs on blktrace, and this is because we are overwriting the
> same debugfs directory, and I think the above are its root causes.

The reason should be shared debugfs dir between blktrace and blk-mq
debugfs.

> 
> I can fix this bug, but that just moves the bug to conflicts within
> two sysfs objects already existing, and this is because add_disk()
> (which calls __device_add_disk() doesn't return errors). This is both
> a blk layer bug in the sense we never check for error and a driver bug
> for allowing conflicts. All this just needs to be fixed, and although I
> thought this could be done later, I think I'm just going to fix all this
> now.

Yeah, we talked that several times, looks no one tries to post patch to
fix that.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-22  9:43           ` Ming Lei
@ 2020-04-22 10:31             ` Luis Chamberlain
  2020-04-24 23:47             ` Luis Chamberlain
  1 sibling, 0 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-22 10:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Greg KH, axboe, viro, bvanassche, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Wed, Apr 22, 2020 at 05:43:20PM +0800, Ming Lei wrote:
> On Wed, Apr 22, 2020 at 07:28:59AM +0000, Luis Chamberlain wrote:
> > On Tue, Apr 21, 2020 at 09:00:48AM +0200, Greg KH wrote:
> > > On Mon, Apr 20, 2020 at 08:41:56PM +0000, Luis Chamberlain wrote:
> > > > On Mon, Apr 20, 2020 at 10:16:15PM +0200, Greg KH wrote:
> > > > > On Sun, Apr 19, 2020 at 07:45:22PM +0000, Luis Chamberlain wrote:
> > > > > 
> > > > > This patch triggered gmail's spam detection, your changelog text is
> > > > > whack...
> > > > 
> > > > Oh? What do you think triggered it?
> > > 
> > > No idea.
> > 
> > Alright, well I'm going to move most of the analysis to the bug report
> > and be as concise as possible on the commit log.
> > 
> > > > > > diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> > > > > > index 19091e1effc0..d84038bce0a5 100644
> > > > > > --- a/block/blk-debugfs.c
> > > > > > +++ b/block/blk-debugfs.c
> > > > > > @@ -3,6 +3,9 @@
> > > > > >  /*
> > > > > >   * Shared request-based / make_request-based functionality
> > > > > >   */
> > > > > > +
> > > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > > +
> > > > > >  #include <linux/kernel.h>
> > > > > >  #include <linux/blkdev.h>
> > > > > >  #include <linux/debugfs.h>
> > > > > > @@ -13,3 +16,30 @@ void blk_debugfs_register(void)
> > > > > >  {
> > > > > >  	blk_debugfs_root = debugfs_create_dir("block", NULL);
> > > > > >  }
> > > > > > +
> > > > > > +int __must_check blk_queue_debugfs_register(struct request_queue *q)
> > > > > > +{
> > > > > > +	struct dentry *dir = NULL;
> > > > > > +
> > > > > > +	/* This can happen if we have a bug in the lower layers */
> > > > > > +	dir = debugfs_lookup(kobject_name(q->kobj.parent), blk_debugfs_root);
> > > > > > +	if (dir) {
> > > > > > +		pr_warn("%s: registering request_queue debugfs directory twice is not allowed\n",
> > > > > > +			kobject_name(q->kobj.parent));
> > > > > > +		dput(dir);
> > > > > > +		return -EALREADY;
> > > > > > +	}
> > > > > > +
> > > > > > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > > > > +					    blk_debugfs_root);
> > > > > > +	if (!q->debugfs_dir)
> > > > > > +		return -ENOMEM;
> > > > > 
> > > > > Why doesn't the directory just live in the request queue, or somewhere
> > > > > else, so that you save it when it is created and then that's it.  No
> > > > > need to "look it up" anywhere else.
> > > > 
> > > > Its already there. And yes, after my changes it is technically possible
> > > > to just re-use it directly. But this is complicated by a few things. One
> > > > is that at this point in time, asynchronous request_queue removal is
> > > > still possible, and so a race was exposed where a requeust_queue may be
> > > > lingering but its old device is gone. That race is fixed by reverting us
> > > > back to synchronous request_queue removal, therefore ensuring that the
> > > > debugfs dir exists so long as the device does.
> > > > 
> > > > I can remove the debugfs_lookup() *after* we revert to synchronous
> > > > request_queue removal, or we just re-order the patches so that the
> > > > revert happens first. That should simplify this patch.
> > > > 
> > > > The code in this patch was designed to help dispute the logic behind
> > > > the CVE, in particular it shows exactly where debugfs_dir *is* the
> > > > one found by debugfs_lookup(), and shows the real issue behind the
> > > > removal.
> > > > 
> > > > But yeah, now that that is done, I hope its clear to all, and I think
> > > > this patch can be simplified if we just revert the async requeust_queue
> > > > removal first.
> > > 
> > > Don't try to "dispute" crazyness, that's not what kernel code is for.
> > > Just do the right thing, and simply saving off the pointer to the
> > > debugfs file when created is the "correct" thing to do, no matter what.
> > > No race conditions or anything else can happen when you do that.
> > 
> > Nope, races are still possible even if we move revert back to sync
> > request_queue removal, but I do believe that just reveals a bug
> > elsewhere, which I'll just fix as I think I know where this is.
> > 
> > > > > Or do you do that in later patches?  I only see this one at the moment,
> > > > > sorry...
> > > > > 
> > > > > >  static struct dentry *blk_trace_debugfs_dir(struct blk_user_trace_setup *buts,
> > > > > > +					    struct request_queue *q,
> > > > > >  					    struct blk_trace *bt)
> > > > > >  {
> > > > > >  	struct dentry *dir = NULL;
> > > > > >  
> > > > > > +	/* This can only happen if we have a bug on our lower layers */
> > > > > > +	if (!q->kobj.parent) {
> > > > > > +		pr_warn("%s: request_queue parent is gone\n", buts->name);
> > > > > 
> > > > > A kobject always has a parent, unless it has not been registered yet, so
> > > > > I don't know what you are testing could ever happen.
> > > > 
> > > > Or it has been kobject_del()'d?
> > > 
> > > If that happened, how in the world are you in this function anyway, as
> > > the request_queue is an invalid pointer at that point in time???
> > 
> > Nope, the block layer still finishes some work on it.
> > 
> > Drivers are allowed to cleanup a block device in this order, this
> > example,  is from the loop block driver:
> > 
> > static void loop_remove(struct loop_device *lo)                                 
> > {
> > 	del_gendisk(lo->lo_disk);
> > 	blk_cleanup_queue(lo->lo_queue);
> > 	blk_mq_free_tag_set(&lo->tag_set);
> > 	put_disk(lo->lo_disk);
> > 	kfree(lo);
> > }   
> > 
> > At this point in time patch-wise we still haven't reverted back to
> > synchronous request_queue removal. Considering this, a race with the
> > parent disappearing can happen because the request_queue removal is
> > deferred, that is, the request_queue's kobject's release() call used
> > schedule_work() to finish off its removal. We expect the last
> > blk_put_queue() to be called at the end of blk_cleanup_queue(). Since
> 
> Actually no, we expect that request queue is released after disk is
> released. Don't forget that gendisk does hold one extra refcount of
> request queue.

Ah yes.

Still the device_del() occurs before, this means the sysfs path
is cleared for a new device to come in as well, and this can happen
even with synchronous request_queue removal.

I have some changes to try to help address this now.

> > that is deferred and device_del() is called also at the end of
> > del_gendisk(), it means the release of the queue can happen in a
> > context where the disk is gone.
> > 
> > Although this issue is triggerable easily with the current async
> > request_queue removal, I can think of other ways to trigger an issue
> > here and one of them was suggested as possible by Christoph on the last
> > v1 patch series.
> > 
> > blk_queue_get() is not atomic and so what it returns can be incorrect:
> > 
> > bool blk_get_queue(struct request_queue *q)
> > {
> > 	if (likely(!blk_queue_dying(q))) {
> > 		__blk_get_queue(q);
> > 		return true;
> > 	}
> > 	----> we can schedule here easily and move the queue to dying
> > 	----> and race with blk_cleanup_queue() which will then allow
> > 	----> code paths to incorrectly trigger the release of the queue
> > 	----> in places we don't want
> 
> Right, actually caller of blk_get_queue() has to guarantee that
> the request queue is alive.

Sure.

> Some users of blk_get_queue() aren't necessary, such as rbd and mmc.

Are you saying we can remove those calls on rbd / mmc or something else?

> 
> > 	return false;
> > }
> > EXPORT_SYMBOL(blk_get_queue);
> > 
> > Another area of concern I am seeing through code inspection is that
> > since the request_queue life depends on the disk, it seemse odd we call
> > device_del() before we remove the request_queue. If this is indeed
> > wrong, we could move the device_del() from del_gendisk() to
> > disk_release() triggered by the last put_disk().
> 
> Why do you think it is odd and want to change this way? Disk has to be
> removed first for draining dirty pages to queue, then we can cleanup
> queue. Once we start to clean up request queue, all data may not land
> disk any more.

I think that all that can be done as-is today, an issue I suspect exists
is that we remove the disk from syfs hierarchy prior to the request
queue, whereas we expect this to be in the inverse order.

> > I have a test now which shows that even if we revert back to synchronous
> > request_queue removal I can trigger a panic on the same use-after-free
> > case on debugfs on blktrace, and this is because we are overwriting the
> > same debugfs directory, and I think the above are its root causes.
> 
> The reason should be shared debugfs dir between blktrace and blk-mq
> debugfs.

I think its something else.

> > I can fix this bug, but that just moves the bug to conflicts within
> > two sysfs objects already existing, and this is because add_disk()
> > (which calls __device_add_disk() doesn't return errors). This is both
> > a blk layer bug in the sense we never check for error and a driver bug
> > for allowing conflicts. All this just needs to be fixed, and although I
> > thought this could be done later, I think I'm just going to fix all this
> > now.
> 
> Yeah, we talked that several times, looks no one tries to post patch to
> fix that.

Consider this done, just need to brush it up now and send for review.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 10/10] block: put_device() if device_add() fails
  2020-04-19 23:40   ` Bart Van Assche
@ 2020-04-24 22:32     ` Luis Chamberlain
  2020-04-25  1:58       ` Bart Van Assche
  0 siblings, 1 reply; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-24 22:32 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Sun, Apr 19, 2020 at 04:40:45PM -0700, Bart Van Assche wrote:
> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> > Through code inspection I've found that we don't put_device() if
> > device_add() fails, and this must be done to decrement its refcount.
> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Turns out this is wrong, as bdi needs it still, we have can only remove
it once all users are done, which should be at the disk_release() path.

I've found this while adding the errors paths missing.

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/10] blktrace: fix debugfs use after free
  2020-04-22  9:43           ` Ming Lei
  2020-04-22 10:31             ` Luis Chamberlain
@ 2020-04-24 23:47             ` Luis Chamberlain
  1 sibling, 0 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-24 23:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Greg KH, axboe, viro, bvanassche, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Wed, Apr 22, 2020 at 05:43:20PM +0800, Ming Lei wrote:
> On Wed, Apr 22, 2020 at 07:28:59AM +0000, Luis Chamberlain wrote:
> > At this point in time patch-wise we still haven't reverted back to
> > synchronous request_queue removal. Considering this, a race with the
> > parent disappearing can happen because the request_queue removal is
> > deferred, that is, the request_queue's kobject's release() call used
> > schedule_work() to finish off its removal. We expect the last
> > blk_put_queue() to be called at the end of blk_cleanup_queue(). Since
> 
> Actually no, we expect that request queue is released after disk is
> released. Don't forget that gendisk does hold one extra refcount of
> request queue.

Then by all means using blk_put_queue() from everywhere should be safe
in atomic context, as we have control over that blk_put_queue() on the
block layer.

(Modulo, we accept the races possible today on blk_get_queue(), which
I'll try to address).

  Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 10/10] block: put_device() if device_add() fails
  2020-04-24 22:32     ` Luis Chamberlain
@ 2020-04-25  1:58       ` Bart Van Assche
  2020-04-25  2:12         ` Luis Chamberlain
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2020-04-25  1:58 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On 2020-04-24 15:32, Luis Chamberlain wrote:
> On Sun, Apr 19, 2020 at 04:40:45PM -0700, Bart Van Assche wrote:
>> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
>>> Through code inspection I've found that we don't put_device() if
>>> device_add() fails, and this must be done to decrement its refcount.
>>
>> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> 
> Turns out this is wrong, as bdi needs it still, we have can only remove
> it once all users are done, which should be at the disk_release() path.
> 
> I've found this while adding the errors paths missing.

Hi Luis,

I had a look at the comments above device_add() before I added my
Reviewed-by. Now that I've had another look at these comments and also
at the device_add() implementation I agree that we don't need this patch.

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 10/10] block: put_device() if device_add() fails
  2020-04-25  1:58       ` Bart Van Assche
@ 2020-04-25  2:12         ` Luis Chamberlain
  0 siblings, 0 replies; 57+ messages in thread
From: Luis Chamberlain @ 2020-04-25  2:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Fri, Apr 24, 2020 at 06:58:23PM -0700, Bart Van Assche wrote:
> On 2020-04-24 15:32, Luis Chamberlain wrote:
> > On Sun, Apr 19, 2020 at 04:40:45PM -0700, Bart Van Assche wrote:
> >> On 4/19/20 12:45 PM, Luis Chamberlain wrote:
> >>> Through code inspection I've found that we don't put_device() if
> >>> device_add() fails, and this must be done to decrement its refcount.
> >>
> >> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> > 
> > Turns out this is wrong, as bdi needs it still, we have can only remove
> > it once all users are done, which should be at the disk_release() path.
> > 
> > I've found this while adding the errors paths missing.
> 
> Hi Luis,
> 
> I had a look at the comments above device_add() before I added my
> Reviewed-by. Now that I've had another look at these comments and also
> at the device_add() implementation I agree that we don't need this patch.

Thanks for the confirmation. And just to note, we don't do then
put_device() because we don't handle error paths properly. Once we do,
we'll need to ensure we put_disk() just at the right place. I'm working
on putting some final brush strokes on that now.

   Luis

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2020-04-25  2:12 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-19 19:45 [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain
2020-04-19 19:45 ` [PATCH v2 01/10] block: move main block debugfs initialization to its own file Luis Chamberlain
2020-04-19 21:06   ` Bart Van Assche
2020-04-19 19:45 ` [PATCH v2 02/10] blktrace: move blktrace debugfs creation to helper function Luis Chamberlain
2020-04-19 21:11   ` Bart Van Assche
2020-04-22  7:12   ` Christoph Hellwig
2020-04-19 19:45 ` [PATCH v2 03/10] blktrace: fix debugfs use after free Luis Chamberlain
2020-04-19 21:55   ` Bart Van Assche
2020-04-20  0:04     ` Luis Chamberlain
2020-04-20  0:38       ` Bart Van Assche
2020-04-20 18:46         ` Luis Chamberlain
2020-04-20 20:16   ` Greg KH
2020-04-20 20:41     ` Luis Chamberlain
2020-04-21  7:00       ` Greg KH
2020-04-22  7:28         ` Luis Chamberlain
2020-04-22  9:43           ` Ming Lei
2020-04-22 10:31             ` Luis Chamberlain
2020-04-24 23:47             ` Luis Chamberlain
2020-04-22  7:29       ` Christoph Hellwig
2020-04-22  7:34         ` Luis Chamberlain
2020-04-22  7:27   ` Christoph Hellwig
2020-04-22  7:48     ` Luis Chamberlain
2020-04-22  8:10       ` Christoph Hellwig
2020-04-22  8:26         ` Luis Chamberlain
2020-04-19 19:45 ` [PATCH v2 04/10] block: revert back to synchronous request_queue removal Luis Chamberlain
2020-04-19 22:23   ` Bart Van Assche
2020-04-20 18:59     ` Luis Chamberlain
2020-04-20 21:11       ` Bart Van Assche
2020-04-20 21:51         ` Luis Chamberlain
2020-04-19 19:45 ` [PATCH v2 05/10] blktrace: upgrade warns to BUG_ON() on unexpected circmunstances Luis Chamberlain
2020-04-19 22:50   ` Bart Van Assche
2020-04-19 23:07     ` Luis Chamberlain
2020-04-20 23:20       ` Steven Rostedt
2020-04-19 19:45 ` [PATCH v2 06/10] blk-debugfs: upgrade warns to BUG_ON() if directory is already found Luis Chamberlain
2020-04-20 11:36   ` Greg KH
2020-04-19 19:45 ` [PATCH v2 07/10] blktrace: move debugfs file creation to its own function Luis Chamberlain
2020-04-19 22:55   ` Bart Van Assche
2020-04-20 11:37     ` Greg KH
2020-04-19 19:45 ` [PATCH v2 08/10] blktrace: add checks for created debugfs files on setup Luis Chamberlain
2020-04-19 22:57   ` Bart Van Assche
2020-04-19 23:05     ` Luis Chamberlain
2020-04-19 23:17       ` Bart Van Assche
2020-04-20 11:40         ` Greg KH
2020-04-20 18:44           ` Luis Chamberlain
2020-04-20 20:11             ` Greg KH
2020-04-20 20:20               ` Luis Chamberlain
2020-04-21  6:55                 ` Greg KH
2020-04-20 11:39   ` Greg KH
2020-04-19 19:45 ` [PATCH v2 09/10] block: panic if block debugfs dir is not created Luis Chamberlain
2020-04-19 23:08   ` Bart Van Assche
2020-04-20 11:38   ` Greg KH
2020-04-19 19:45 ` [PATCH v2 10/10] block: put_device() if device_add() fails Luis Chamberlain
2020-04-19 23:40   ` Bart Van Assche
2020-04-24 22:32     ` Luis Chamberlain
2020-04-25  1:58       ` Bart Van Assche
2020-04-25  2:12         ` Luis Chamberlain
2020-04-19 19:48 ` [PATCH v2 00/10] block: fix blktrace debugfs use after free Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).