linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] blktrace: fix use after free
@ 2020-04-14  4:18 Luis Chamberlain
  2020-04-14  4:18 ` [PATCH 1/5] block: move main block debugfs initialization to its own file Luis Chamberlain
                   ` (5 more replies)
  0 siblings, 6 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-14  4:18 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain

After two iterations of RFCs I think this is ready now. I've taken
the feedback from the last series, both on code and commit log.
I've also extended the commit log on the last patch to also explain
how the original shift to async request_queue removal turned out
to actually be a userspace regression and added a respective fixes
tag for it.

You can find these patches on my 20200414-dev-blkqueue-defer-removal-patch-v1
branch based on linux-next tag next-20200414 on kernel.org [0].

Further review and rants are appreciated.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200414-dev-blkqueue-defer-removal

Luis Chamberlain (5):
  block: move main block debugfs initialization to its own file
  blktrace: fix debugfs use after free
  blktrace: refcount the request_queue during ioctl
  mm/swapfile: refcount block and queue before using
    blkcg_schedule_throttle()
  block: revert back to synchronous request_queue removal

 block/Makefile               |  1 +
 block/blk-core.c             | 28 ++++++++++++++++--------
 block/blk-debugfs.c          | 27 ++++++++++++++++++++++++
 block/blk-mq-debugfs.c       |  5 -----
 block/blk-sysfs.c            | 41 ++++++++++++++++++------------------
 block/blk.h                  | 17 +++++++++++++++
 include/linux/blkdev.h       |  7 +++---
 include/linux/blktrace_api.h |  1 -
 kernel/trace/blktrace.c      | 25 ++++++++++++----------
 mm/swapfile.c                | 14 ++++++++++--
 10 files changed, 114 insertions(+), 52 deletions(-)
 create mode 100644 block/blk-debugfs.c

-- 
2.25.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/5] block: move main block debugfs initialization to its own file
  2020-04-14  4:18 [PATCH 0/5] blktrace: fix use after free Luis Chamberlain
@ 2020-04-14  4:18 ` Luis Chamberlain
  2020-04-14  7:35   ` Greg KH
  2020-04-15  2:44   ` Bart Van Assche
  2020-04-14  4:18 ` [PATCH 2/5] blktrace: fix debugfs use after free Luis Chamberlain
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-14  4:18 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

make_request-based drivers and and request-based drivers share some
some debugfs code. By moving this into its own file it makes it easier
to expand and audit this shared code.

This patch contains no functional changes.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/Makefile      |  1 +
 block/blk-core.c    |  9 +--------
 block/blk-debugfs.c | 15 +++++++++++++++
 block/blk.h         |  7 +++++++
 4 files changed, 24 insertions(+), 8 deletions(-)
 create mode 100644 block/blk-debugfs.c

diff --git a/block/Makefile b/block/Makefile
index 206b96e9387f..1d3ab20505d8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \
 			blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
 			genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o
 
+obj-$(CONFIG_DEBUG_FS)		+= blk-debugfs.o
 obj-$(CONFIG_BOUNCE)		+= bounce.o
 obj-$(CONFIG_BLK_SCSI_REQUEST)	+= scsi_ioctl.o
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 7e4a1da0715e..5aaae7a1b338 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -48,10 +48,6 @@
 #include "blk-pm.h"
 #include "blk-rq-qos.h"
 
-#ifdef CONFIG_DEBUG_FS
-struct dentry *blk_debugfs_root;
-#endif
-
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
@@ -1796,10 +1792,7 @@ int __init blk_dev_init(void)
 
 	blk_requestq_cachep = kmem_cache_create("request_queue",
 			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
-
-#ifdef CONFIG_DEBUG_FS
-	blk_debugfs_root = debugfs_create_dir("block", NULL);
-#endif
+	blk_debugfs_register();
 
 	return 0;
 }
diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
new file mode 100644
index 000000000000..19091e1effc0
--- /dev/null
+++ b/block/blk-debugfs.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Shared request-based / make_request-based functionality
+ */
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/debugfs.h>
+
+struct dentry *blk_debugfs_root;
+
+void blk_debugfs_register(void)
+{
+	blk_debugfs_root = debugfs_create_dir("block", NULL);
+}
diff --git a/block/blk.h b/block/blk.h
index 0a94ec68af32..86a66b614f08 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -487,5 +487,12 @@ struct request_queue *__blk_alloc_queue(int node_id);
 int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		bool *same_page);
+#ifdef CONFIG_DEBUG_FS
+void blk_debugfs_register(void);
+#else
+static inline void blk_debugfs_register(void)
+{
+}
+#endif /* CONFIG_DEBUG_FS */
 
 #endif /* BLK_INTERNAL_H */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-14  4:18 [PATCH 0/5] blktrace: fix use after free Luis Chamberlain
  2020-04-14  4:18 ` [PATCH 1/5] block: move main block debugfs initialization to its own file Luis Chamberlain
@ 2020-04-14  4:18 ` Luis Chamberlain
  2020-04-14  7:37   ` Greg KH
                     ` (4 more replies)
  2020-04-14  4:19 ` [PATCH 3/5] blktrace: refcount the request_queue during ioctl Luis Chamberlain
                   ` (3 subsequent siblings)
  5 siblings, 5 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-14  4:18 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace in a way it was not intended.

Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
forget to BLKTRACETEARDOWN, and then just remove the device you end up
with a panic:

[  107.193134] debugfs: Directory 'loop0' with parent 'block' already present!
[  107.254615] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  107.258785] #PF: supervisor write access in kernel mode
[  107.262035] #PF: error_code(0x0002) - not-present page
[  107.264106] PGD 0 P4D 0
[  107.264404] Oops: 0002 [#1] SMP NOPTI
[  107.264803] CPU: 8 PID: 674 Comm: kworker/8:2 Tainted: G            E 5.6.0-rc7-next-20200327 #1
[  107.265712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  107.266553] Workqueue: events __blk_release_queue
[  107.267051] RIP: 0010:down_write+0x15/0x40
[  107.267488] Code: eb ca e8 ee a5 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f  65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
[  107.269300] RSP: 0018:ffff9927c06efda8 EFLAGS: 00010246
[  107.269841] RAX: 0000000000000000 RBX: ffff8be7e73b0600 RCX: ffffff8100000000
[  107.270559] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  107.271281] RBP: 00000000000000a0 R08: ffff8be7ebc80fa8 R09: ffff8be7ebc80fa8
[  107.272001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  107.272722] R13: ffff8be7efc30400 R14: ffff8be7e0571200 R15: 00000000000000a0
[  107.273475] FS:  0000000000000000(0000) GS:ffff8be7efc00000(0000) knlGS:0000000000000000
[  107.274346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.274968] CR2: 00000000000000a0 CR3: 000000042abee003 CR4: 0000000000360ee0
[  107.275710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  107.276465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  107.277214] Call Trace:
[  107.277532]  simple_recursive_removal+0x4e/0x2e0
[  107.278049]  ? debugfs_remove+0x60/0x60
[  107.278493]  debugfs_remove+0x40/0x60
[  107.278922]  blk_trace_free+0xd/0x50
[  107.279339]  __blk_trace_remove+0x27/0x40
[  107.279797]  blk_trace_shutdown+0x30/0x40
[  107.280256]  __blk_release_queue+0xab/0x110
[  107.280734]  process_one_work+0x1b4/0x380
[  107.281194]  worker_thread+0x50/0x3c0
[  107.281622]  kthread+0xf9/0x130
[  107.281994]  ? process_one_work+0x380/0x380
[  107.282467]  ? kthread_park+0x90/0x90
[  107.282895]  ret_from_fork+0x1f/0x40
[  107.283316] Modules linked in: loop(E) <etc>
[  107.288562] CR2: 00000000000000a0
[  107.288957] ---[ end trace b885d243d441bbce ]---

This splat happens to be very similar to the one reported via
kernel.org korg#205713, only that korg#205713 was for v4.19.83
and the above now includes the simple_recursive_removal() introduced
via commit a3d1e7eb5abe ("simple_recursive_removal(): kernel-side rm
-rf for ramfs-style filesystems") merged on v5.6.

korg#205713 then was used to create CVE-2019-19770 and claims that
the bug is in a use-after-free in the debugfs core code. The
implications of this being a generic UAF on debugfs would be
much more severe, as it would imply parent dentries can sometimes
not be positive, which we hold by design is just not possible.

Below is the splat explained with a bit more details, explaining
what is happening in userspace, kernel, and a print of the CPU on,
which the code runs on:

load loopback module
[   13.603371] == blk_mq_debugfs_register(12) start
[   13.604040] == blk_mq_debugfs_register(12) q->debugfs_dir created
[   13.604934] == blk_mq_debugfs_register(12) end
[   13.627382] == blk_mq_debugfs_register(12) start
[   13.628041] == blk_mq_debugfs_register(12) q->debugfs_dir created
[   13.629240] == blk_mq_debugfs_register(12) end
[   13.651667] == blk_mq_debugfs_register(12) start
[   13.652836] == blk_mq_debugfs_register(12) q->debugfs_dir created
[   13.655107] == blk_mq_debugfs_register(12) end
[   13.684917] == blk_mq_debugfs_register(12) start
[   13.687876] == blk_mq_debugfs_register(12) q->debugfs_dir created
[   13.691588] == blk_mq_debugfs_register(13) end
[   13.707320] == blk_mq_debugfs_register(13) start
[   13.707863] == blk_mq_debugfs_register(13) q->debugfs_dir created
[   13.708856] == blk_mq_debugfs_register(13) end
[   13.735623] == blk_mq_debugfs_register(13) start
[   13.736656] == blk_mq_debugfs_register(13) q->debugfs_dir created
[   13.738411] == blk_mq_debugfs_register(13) end
[   13.763326] == blk_mq_debugfs_register(13) start
[   13.763972] == blk_mq_debugfs_register(13) q->debugfs_dir created
[   13.765167] == blk_mq_debugfs_register(13) end
[   13.779510] == blk_mq_debugfs_register(13) start
[   13.780522] == blk_mq_debugfs_register(13) q->debugfs_dir created
[   13.782338] == blk_mq_debugfs_register(13) end
[   13.783521] loop: module loaded

LOOP_CTL_DEL(loop0) #1
[   13.803550] = __blk_release_queue(4) start
[   13.807772] == blk_trace_shutdown(4) start
[   13.810749] == blk_trace_shutdown(4) end
[   13.813437] = __blk_release_queue(4) calling blk_mq_debugfs_unregister()
[   13.817593] ==== blk_mq_debugfs_unregister(4) begin
[   13.817621] ==== blk_mq_debugfs_unregister(4) debugfs_remove_recursive(q->debugfs_dir)
[   13.821203] ==== blk_mq_debugfs_unregister(4) end q->debugfs_dir is NULL
[   13.826166] = __blk_release_queue(4) blk_mq_debugfs_unregister() end
[   13.832992] = __blk_release_queue(4) end

LOOP_CTL_ADD(loop0) #1
[   13.843742] == blk_mq_debugfs_register(7) start
[   13.845569] == blk_mq_debugfs_register(7) q->debugfs_dir created
[   13.848628] == blk_mq_debugfs_register(7) end

BLKTRACE_SETUP(loop0) #1
[   13.850924] == blk_trace_ioctl(7, BLKTRACESETUP) start
[   13.852852] === do_blk_trace_setup(7) start
[   13.854580] === do_blk_trace_setup(7) creating directory
[   13.856620] === do_blk_trace_setup(7) using what debugfs_lookup() gave
[   13.860635] === do_blk_trace_setup(7) end with ret: 0
[   13.862615] == blk_trace_ioctl(7, BLKTRACESETUP) end

LOOP_CTL_DEL(loop0) #2
[   13.883304] = __blk_release_queue(7) start
[   13.885324] == blk_trace_shutdown(7) start
[   13.887197] == blk_trace_shutdown(7) calling __blk_trace_remove()
[   13.889807] == __blk_trace_remove(7) start
[   13.891669] === blk_trace_cleanup(7) start
[   13.911656] ====== blk_trace_free(7) start

LOOP_CTL_ADD(loop0) #2
[   13.912709] == blk_mq_debugfs_register(2) start

---> From LOOP_CTL_DEL(loop0) #2
[   13.915887] ====== blk_trace_free(7) end

---> From LOOP_CTL_ADD(loop0) #2
[   13.918359] debugfs: Directory 'loop0' with parent 'block' already present!
[   13.926433] == blk_mq_debugfs_register(2) q->debugfs_dir created
[   13.930373] == blk_mq_debugfs_register(2) end

BLKTRACE_SETUP(loop0) #2
[   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start
[   13.936758] === do_blk_trace_setup(2) start
[   13.938944] === do_blk_trace_setup(2) creating directory
[   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave

---> From LOOP_CTL_DEL(loop0) #2
[   13.971046] === blk_trace_cleanup(7) end
[   13.973175] == __blk_trace_remove(7) end
[   13.975352] == blk_trace_shutdown(7) end
[   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()
[   13.980645] ==== blk_mq_debugfs_unregister(7) begin
[   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)
[   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL
[   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end
[   13.993155] = __blk_release_queue(7) end

---> From BLKTRACE_SETUP(loop0) #2
[   13.995928] === do_blk_trace_setup(2) end with ret: 0
[   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end

LOOP_CTL_DEL(loop0) #3
[   14.035119] = __blk_release_queue(2) start
[   14.036925] == blk_trace_shutdown(2) start
[   14.038518] == blk_trace_shutdown(2) calling __blk_trace_remove()
[   14.040829] == __blk_trace_remove(2) start
[   14.042413] === blk_trace_cleanup(2) start

LOOP_CTL_ADD(loop0) #3
[   14.072522] == blk_mq_debugfs_register(6) start

---> From LOOP_CTL_DEL(loop0) #3
[   14.075151] ====== blk_trace_free(2) start

---> From LOOP_CTL_ADD(loop0) #3
[   14.075882] == blk_mq_debugfs_register(6) q->debugfs_dir created

---> From LOOP_CTL_DEL(loop0) #3
[   14.078624] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[   14.084332] == blk_mq_debugfs_register(6) end
[   14.086971] #PF: supervisor write access in kernel mode
[   14.086974] #PF: error_code(0x0002) - not-present page
[   14.086977] PGD 0 P4D 0
[   14.086984] Oops: 0002 [#1] SMP NOPTI
[   14.086990] CPU: 2 PID: 287 Comm: kworker/2:2 Tainted: G            E 5.6.0-next-20200403+ #54
[   14.086991] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[   14.087002] Workqueue: events __blk_release_queue
[   14.087011] RIP: 0010:down_write+0x15/0x40
[   14.090300] == blk_trace_ioctl(6, BLKTRACESETUP) start
[   14.093277] Code: eb ca e8 3e 34 8d ff cc cc cc cc cc cc cc cc cc cc
cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00
00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
[   14.093280] RSP: 0018:ffffc28a00533da8 EFLAGS: 00010246
[   14.093284] RAX: 0000000000000000 RBX: ffff9f7a24d07980 RCX: ffffff8100000000
[   14.093286] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[   14.093287] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000019
[   14.093289] R10: 0000000000000774 R11: 0000000000000000 R12: 0000000000000000
[   14.093291] R13: ffff9f7a2fab0400 R14: ffff9f7a21dd1140 R15: 00000000000000a0
[   14.093294] FS:  0000000000000000(0000) GS:ffff9f7a2fa80000(0000) knlGS:0000000000000000
[   14.093296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.093298] CR2: 00000000000000a0 CR3: 00000004293d2003 CR4: 0000000000360ee0
[   14.093307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   14.093308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   14.093310] Call Trace:
[   14.093324]  simple_recursive_removal+0x4e/0x2e0
[   14.093330]  ? debugfs_remove+0x60/0x60
[   14.093334]  debugfs_remove+0x40/0x60
[   14.093339]  blk_trace_free+0x20/0x70
[   14.093346]  __blk_trace_remove+0x54/0x90
[   14.096704] === do_blk_trace_setup(6) start
[   14.098534]  blk_trace_shutdown+0x74/0x80
[   14.100958] === do_blk_trace_setup(6) creating directory
[   14.104575]  __blk_release_queue+0xbe/0x160
[   14.104580]  process_one_work+0x1b4/0x380
[   14.104585]  worker_thread+0x50/0x3c0
[   14.104589]  kthread+0xf9/0x130
[   14.104593]  ? process_one_work+0x380/0x380
[   14.104596]  ? kthread_park+0x90/0x90
[   14.104599]  ret_from_fork+0x1f/0x40
[   14.104603] Modules linked in: loop(E) xfs(E) libcrc32c(E)
crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E)
serio_raw(E) aesni_intel(E) glue_helper(E) virtio_balloon(E) evdev(E)
crypto_simd(E) pcspkr(E) cryptd(E) i6300esb(E) button(E) ip_tables(E)
x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E)
jbd2(E) virtio_net(E) net_failover(E) failover(E) virtio_blk(E)
ata_generic(E) uhci_hcd(E) ata_piix(E) ehci_hcd(E) nvme(E) libata(E)
crc32c_intel(E) usbcore(E) psmouse(E) nvme_core(E) virtio_pci(E)
scsi_mod(E) virtio_ring(E) t10_pi(E) virtio(E) i2c_piix4(E) floppy(E)
[   14.107400] === do_blk_trace_setup(6) using what debugfs_lookup() gave
[   14.108939] CR2: 00000000000000a0
[   14.110589] === do_blk_trace_setup(6) end with ret: 0
[   14.111592] ---[ end trace 7a783b33b9614db9 ]---

The root cause to this issue is that debugfs_lookup() can find a
previous incarnation's dir of the same name which is about to get
removed from a not yet schedule work.

We can fix the UAF by simply using a debugfs directory which moving
forward will always be accessible if debugfs is enabled, this way,
its allocated and avaialble always for both request-based block
drivers or make_request drivers (multiqueue) block drivers.

This simplifies the code considerably, with the only penalty now being
that we're always creating the request queue debugfs directory for the
request-based block device drivers.

The UAF then is not a core debugfs issue, but instead a misuse of
debugfs, and this issue can only be triggered if you are root, and
misuse blktrace.

This issue can be reproduced with break-blktrace [2] using:

  break-blktrace -c 10 -d -s

This patch fixes this issue. Note that there is also another
respective UAF but from the ioctl path [3], this should also fix
that issue.

This patch then also disputes the severity of CVE-2019-19770 as
this issue is only possible by being root and using blktrace.

It is not a core debugfs issue.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
[1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
[2] https://github.com/mcgrof/break-blktrace
[3] https://lore.kernel.org/lkml/000000000000ec635b059f752700@google.com/

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-debugfs.c          | 12 ++++++++++++
 block/blk-mq-debugfs.c       |  5 -----
 block/blk-sysfs.c            |  3 +++
 block/blk.h                  | 10 ++++++++++
 include/linux/blkdev.h       |  5 ++++-
 include/linux/blktrace_api.h |  1 -
 kernel/trace/blktrace.c      | 19 ++++++++-----------
 7 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
index 19091e1effc0..db982688cf46 100644
--- a/block/blk-debugfs.c
+++ b/block/blk-debugfs.c
@@ -13,3 +13,15 @@ void blk_debugfs_register(void)
 {
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
 }
+
+void blk_queue_debugfs_register(struct request_queue *q)
+{
+	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
+					    blk_debugfs_root);
+}
+
+void blk_queue_debugfs_unregister(struct request_queue *q)
+{
+	debugfs_remove_recursive(q->debugfs_dir);
+	q->debugfs_dir = NULL;
+}
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..bda9378eab90 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -823,9 +823,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
-					    blk_debugfs_root);
-
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
 	/*
@@ -856,9 +853,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
 
 void blk_mq_debugfs_unregister(struct request_queue *q)
 {
-	debugfs_remove_recursive(q->debugfs_dir);
 	q->sched_debugfs_dir = NULL;
-	q->debugfs_dir = NULL;
 }
 
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..0285d67e1e4c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -895,6 +895,7 @@ static void __blk_release_queue(struct work_struct *work)
 
 	blk_trace_shutdown(q);
 
+	blk_queue_debugfs_unregister(q);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
 
@@ -975,6 +976,8 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	blk_queue_debugfs_register(q);
+
 	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
diff --git a/block/blk.h b/block/blk.h
index 86a66b614f08..813b8513fc1a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -489,10 +489,20 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		bool *same_page);
 #ifdef CONFIG_DEBUG_FS
 void blk_debugfs_register(void);
+void blk_queue_debugfs_register(struct request_queue *q);
+void blk_queue_debugfs_unregister(struct request_queue *q);
 #else
 static inline void blk_debugfs_register(void)
 {
 }
+
+static inline void blk_queue_debugfs_register(struct request_queue *q)
+{
+}
+
+static inline void blk_queue_debugfs_unregister(struct request_queue *q)
+{
+}
 #endif /* CONFIG_DEBUG_FS */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32868fbedc9e..cc43c8e6516c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -569,8 +569,11 @@ struct request_queue {
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
 
-#ifdef CONFIG_BLK_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
+	/* Used by block/blk-*debugfs.c and kernel/trace/blktrace.c */
 	struct dentry		*debugfs_dir;
+#endif
+#ifdef CONFIG_BLK_DEBUG_FS
 	struct dentry		*sched_debugfs_dir;
 	struct dentry		*rqos_debugfs_dir;
 #endif
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 3b6ff5902edc..eb6db276e293 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -22,7 +22,6 @@ struct blk_trace {
 	u64 end_lba;
 	u32 pid;
 	u32 dev;
-	struct dentry *dir;
 	struct dentry *dropped_file;
 	struct dentry *msg_file;
 	struct list_head running_list;
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index ca39dc3230cb..15086227592f 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
 	debugfs_remove(bt->msg_file);
 	debugfs_remove(bt->dropped_file);
 	relay_close(bt->rchan);
-	debugfs_remove(bt->dir);
 	free_percpu(bt->sequence);
 	free_percpu(bt->msg_data);
 	kfree(bt);
@@ -476,7 +475,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 			      struct blk_user_trace_setup *buts)
 {
 	struct blk_trace *bt = NULL;
-	struct dentry *dir = NULL;
 	int ret;
 
 	if (!buts->buf_size || !buts->buf_nr)
@@ -485,6 +483,9 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (!blk_debugfs_root)
 		return -ENOENT;
 
+	if (!q->debugfs_dir)
+		return -ENOENT;
+
 	strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
 	buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0';
 
@@ -509,21 +510,19 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = -ENOENT;
 
-	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
-		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
-
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
 	INIT_LIST_HEAD(&bt->running_list);
 
 	ret = -EIO;
-	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
+	bt->dropped_file = debugfs_create_file("dropped", 0444,
+					       q->debugfs_dir, bt,
 					       &blk_dropped_fops);
 
-	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
+	bt->msg_file = debugfs_create_file("msg", 0222, q->debugfs_dir,
+					   bt, &blk_msg_fops);
 
-	bt->rchan = relay_open("trace", dir, buts->buf_size,
+	bt->rchan = relay_open("trace", q->debugfs_dir, buts->buf_size,
 				buts->buf_nr, &blk_relay_callbacks, bt);
 	if (!bt->rchan)
 		goto err;
@@ -551,8 +550,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = 0;
 err:
-	if (dir && !bt->dir)
-		dput(dir);
 	if (ret)
 		blk_trace_free(bt);
 	return ret;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-14  4:18 [PATCH 0/5] blktrace: fix use after free Luis Chamberlain
  2020-04-14  4:18 ` [PATCH 1/5] block: move main block debugfs initialization to its own file Luis Chamberlain
  2020-04-14  4:18 ` [PATCH 2/5] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-04-14  4:19 ` Luis Chamberlain
  2020-04-14 15:40   ` Christoph Hellwig
  2020-04-16  2:31   ` Ming Lei
  2020-04-14  4:19 ` [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle() Luis Chamberlain
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-14  4:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

Ensure that the request_queue is refcounted during its full
ioctl cycle. This avoids possible races against removal, given
blk_get_queue() also checks to ensure the queue is not dying.

This small race is possible if you defer removal of the request_queue
and userspace fires off an ioctl for the device in the meantime.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/trace/blktrace.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 15086227592f..17e144d15779 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -701,6 +701,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
 	if (!q)
 		return -ENXIO;
 
+	if (!blk_get_queue(q))
+		return -ENXIO;
+
 	mutex_lock(&q->blk_trace_mutex);
 
 	switch (cmd) {
@@ -729,6 +732,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
 	}
 
 	mutex_unlock(&q->blk_trace_mutex);
+
+	blk_put_queue(q);
+
 	return ret;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-14  4:18 [PATCH 0/5] blktrace: fix use after free Luis Chamberlain
                   ` (2 preceding siblings ...)
  2020-04-14  4:19 ` [PATCH 3/5] blktrace: refcount the request_queue during ioctl Luis Chamberlain
@ 2020-04-14  4:19 ` Luis Chamberlain
  2020-04-14 15:44   ` Christoph Hellwig
  2020-04-16  6:22   ` Ming Lei
  2020-04-14  4:19 ` [PATCH 5/5] block: revert back to synchronous request_queue removal Luis Chamberlain
  2020-04-14  7:38 ` [PATCH 0/5] blktrace: fix use after free Greg KH
  5 siblings, 2 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-14  4:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

block devices are refcounted so to ensure once its final user goes away it
can be cleaned up by the lower layers properly. The block device's
request_queue structure is also refcounted, however, if the last
blk_put_queue() is called under atomic context the block layer has
to defer removal.

By refcounting the block device during the use of blkcg_schedule_throttle(),
we ensure ensure two things:

1) the block device remains available during the call
2) we ensure avoid having to deal with the fact we're using the
   request_queue structure in atomic context, since the last
   blk_put_queue() will be called upon disk_release(), *after*
   our own bdput().

This means this code path is *not* going to remove the request_queue
structure, as we are ensuring some later upper layer disk_release()
will be the one to release the request_queue structure for us.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/swapfile.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6659ab563448..9285ff6030ca 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
 void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 				  gfp_t gfp_mask)
 {
+	struct block_device *bdev;
 	struct swap_info_struct *si, *next;
 	if (!(gfp_mask & __GFP_IO) || !memcg)
 		return;
@@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
 				  avail_lists[node]) {
 		if (si->bdev) {
-			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
-						true);
+			bdev = bdgrab(si->bdev);
+			if (!bdev)
+				continue;
+			/*
+			 * By adding our own bdgrab() we ensure the queue
+			 * sticks around until disk_release(), and so we ensure
+			 * our release of the request_queue does not happen in
+			 * atomic context.
+			 */
+			blkcg_schedule_throttle(bdev_get_queue(bdev), true);
+			bdput(bdev);
 			break;
 		}
 	}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 5/5] block: revert back to synchronous request_queue removal
  2020-04-14  4:18 [PATCH 0/5] blktrace: fix use after free Luis Chamberlain
                   ` (3 preceding siblings ...)
  2020-04-14  4:19 ` [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle() Luis Chamberlain
@ 2020-04-14  4:19 ` Luis Chamberlain
  2020-04-14 15:47   ` Christoph Hellwig
  2020-04-16  2:36   ` Ming Lei
  2020-04-14  7:38 ` [PATCH 0/5] blktrace: fix use after free Greg KH
  5 siblings, 2 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-14  4:19 UTC (permalink / raw)
  To: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Luis Chamberlain, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue
kobject, and upon reaching 0 blk_release_queue() is called. Although
blk_exit_rl() is now removed through commit db6d9952356 ("block: remove
request_list code"), we reserve the right to be able to sleep within
blk_release_queue() context. If you see no other way and *have* be
in atomic context when you driver calls the last blk_put_queue()
you can always just increase your block device's reference count with
bdgrab() as this can be done in atomic context and the request_queue
removal would be left to upper layers later. We document this bit of
tribal knowledge as well now, and adjust kdoc format a bit.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver and
issues ioctl from userspace to do so, upon return and if successful one
expects the device to be removed. Moving to asynchronous request_queue
removal could have broken many scripts which relied on the removal to
have been completed if there was no error.

Using asynchronous request_queue removal however has helped us find
other bugs, in the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Suggested-by: Nicolai Stange <nstange@suse.de>
Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-core.c       | 19 ++++++++++++++++++-
 block/blk-sysfs.c      | 38 +++++++++++++++++---------------------
 include/linux/blkdev.h |  2 --
 3 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5aaae7a1b338..8346c7c59ee6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -301,6 +301,17 @@ void blk_clear_pm_only(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
+/**
+ * blk_put_queue - decrement the request_queue refcount
+ *
+ * Decrements the refcount to the request_queue kobject, when this reaches
+ * 0 we'll have blk_release_queue() called. You should avoid calling
+ * this function in atomic context but if you really have to ensure you
+ * first refcount the block device with bdgrab() / bdput() so that the
+ * last decrement happens in blk_cleanup_queue().
+ *
+ * @q: the request_queue structure to decrement the refcount for
+ */
 void blk_put_queue(struct request_queue *q)
 {
 	kobject_put(&q->kobj);
@@ -328,10 +339,16 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
 /**
  * blk_cleanup_queue - shutdown a request queue
- * @q: request queue to shutdown
  *
  * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
  * put it.  All future requests will be failed immediately with -ENODEV.
+ *
+ * You should not call this function in atomic context. If you need to
+ * refcount a request_queue in atomic context, instead refcount the
+ * block device with bdgrab() / bdput().
+ *
+ * @q: request queue to shutdown
+ *
  */
 void blk_cleanup_queue(struct request_queue *q)
 {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0285d67e1e4c..859911191ebc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -860,22 +860,27 @@ static void blk_exit_queue(struct request_queue *q)
 	bdi_put(q->backing_dev_info);
 }
 
-
 /**
- * __blk_release_queue - release a request queue
- * @work: pointer to the release_work member of the request queue to be released
+ * blk_release_queue - release a request queue
+ *
+ * This function is called as part of the process when a block device is being
+ * unregistered. Releasing a request queue starts with blk_cleanup_queue(),
+ * which set the appropriate flags and then calls blk_put_queue() as the last
+ * step. blk_put_queue() decrements the reference counter of the request queue
+ * and once the reference counter reaches zero, this function is called to
+ * release all allocated resources of the request queue.
  *
- * Description:
- *     This function is called when a block device is being unregistered. The
- *     process of releasing a request queue starts with blk_cleanup_queue, which
- *     set the appropriate flags and then calls blk_put_queue, that decrements
- *     the reference counter of the request queue. Once the reference counter
- *     of the request queue reaches zero, blk_release_queue is called to release
- *     all allocated resources of the request queue.
+ * This function can sleep, and so we must ensure that the very last
+ * blk_put_queue() is never called from atomic context.
+ *
+ * @kobj: pointer to a kobject, who's container is a request_queue
  */
-static void __blk_release_queue(struct work_struct *work)
+static void blk_release_queue(struct kobject *kobj)
 {
-	struct request_queue *q = container_of(work, typeof(*q), release_work);
+	struct request_queue *q =
+		container_of(kobj, struct request_queue, kobj);
+
+	might_sleep();
 
 	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
 		blk_stat_remove_callback(q, q->poll_cb);
@@ -905,15 +910,6 @@ static void __blk_release_queue(struct work_struct *work)
 	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
 
-static void blk_release_queue(struct kobject *kobj)
-{
-	struct request_queue *q =
-		container_of(kobj, struct request_queue, kobj);
-
-	INIT_WORK(&q->release_work, __blk_release_queue);
-	schedule_work(&q->release_work);
-}
-
 static const struct sysfs_ops queue_sysfs_ops = {
 	.show	= queue_attr_show,
 	.store	= queue_attr_store,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cc43c8e6516c..81f7ddb1587e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -582,8 +582,6 @@ struct request_queue {
 
 	size_t			cmd_size;
 
-	struct work_struct	release_work;
-
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
 };
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/5] block: move main block debugfs initialization to its own file
  2020-04-14  4:18 ` [PATCH 1/5] block: move main block debugfs initialization to its own file Luis Chamberlain
@ 2020-04-14  7:35   ` Greg KH
  2020-04-15  2:44   ` Bart Van Assche
  1 sibling, 0 replies; 53+ messages in thread
From: Greg KH @ 2020-04-14  7:35 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Tue, Apr 14, 2020 at 04:18:58AM +0000, Luis Chamberlain wrote:
> make_request-based drivers and and request-based drivers share some
> some debugfs code. By moving this into its own file it makes it easier
> to expand and audit this shared code.
> 
> This patch contains no functional changes.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-14  4:18 ` [PATCH 2/5] blktrace: fix debugfs use after free Luis Chamberlain
@ 2020-04-14  7:37   ` Greg KH
  2020-04-14 15:38   ` Christoph Hellwig
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 53+ messages in thread
From: Greg KH @ 2020-04-14  7:37 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Tue, Apr 14, 2020 at 04:18:59AM +0000, Luis Chamberlain wrote:
> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> merged on v4.12 Omar fixed the original blktrace code for request-based
> drivers (multiqueue). This however left in place a possible crash, if you
> happen to abuse blktrace in a way it was not intended.
> 
> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> forget to BLKTRACETEARDOWN, and then just remove the device you end up
> with a panic:
> 
> [  107.193134] debugfs: Directory 'loop0' with parent 'block' already present!
> [  107.254615] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [  107.258785] #PF: supervisor write access in kernel mode
> [  107.262035] #PF: error_code(0x0002) - not-present page
> [  107.264106] PGD 0 P4D 0
> [  107.264404] Oops: 0002 [#1] SMP NOPTI
> [  107.264803] CPU: 8 PID: 674 Comm: kworker/8:2 Tainted: G            E 5.6.0-rc7-next-20200327 #1
> [  107.265712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [  107.266553] Workqueue: events __blk_release_queue
> [  107.267051] RIP: 0010:down_write+0x15/0x40
> [  107.267488] Code: eb ca e8 ee a5 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f  65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
> [  107.269300] RSP: 0018:ffff9927c06efda8 EFLAGS: 00010246
> [  107.269841] RAX: 0000000000000000 RBX: ffff8be7e73b0600 RCX: ffffff8100000000
> [  107.270559] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [  107.271281] RBP: 00000000000000a0 R08: ffff8be7ebc80fa8 R09: ffff8be7ebc80fa8
> [  107.272001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  107.272722] R13: ffff8be7efc30400 R14: ffff8be7e0571200 R15: 00000000000000a0
> [  107.273475] FS:  0000000000000000(0000) GS:ffff8be7efc00000(0000) knlGS:0000000000000000
> [  107.274346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  107.274968] CR2: 00000000000000a0 CR3: 000000042abee003 CR4: 0000000000360ee0
> [  107.275710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  107.276465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  107.277214] Call Trace:
> [  107.277532]  simple_recursive_removal+0x4e/0x2e0
> [  107.278049]  ? debugfs_remove+0x60/0x60
> [  107.278493]  debugfs_remove+0x40/0x60
> [  107.278922]  blk_trace_free+0xd/0x50
> [  107.279339]  __blk_trace_remove+0x27/0x40
> [  107.279797]  blk_trace_shutdown+0x30/0x40
> [  107.280256]  __blk_release_queue+0xab/0x110
> [  107.280734]  process_one_work+0x1b4/0x380
> [  107.281194]  worker_thread+0x50/0x3c0
> [  107.281622]  kthread+0xf9/0x130
> [  107.281994]  ? process_one_work+0x380/0x380
> [  107.282467]  ? kthread_park+0x90/0x90
> [  107.282895]  ret_from_fork+0x1f/0x40
> [  107.283316] Modules linked in: loop(E) <etc>
> [  107.288562] CR2: 00000000000000a0
> [  107.288957] ---[ end trace b885d243d441bbce ]---
> 
> This splat happens to be very similar to the one reported via
> kernel.org korg#205713, only that korg#205713 was for v4.19.83
> and the above now includes the simple_recursive_removal() introduced
> via commit a3d1e7eb5abe ("simple_recursive_removal(): kernel-side rm
> -rf for ramfs-style filesystems") merged on v5.6.
> 
> korg#205713 then was used to create CVE-2019-19770 and claims that
> the bug is in a use-after-free in the debugfs core code. The
> implications of this being a generic UAF on debugfs would be
> much more severe, as it would imply parent dentries can sometimes
> not be positive, which we hold by design is just not possible.
> 
> Below is the splat explained with a bit more details, explaining
> what is happening in userspace, kernel, and a print of the CPU on,
> which the code runs on:
> 
> load loopback module
> [   13.603371] == blk_mq_debugfs_register(12) start
> [   13.604040] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.604934] == blk_mq_debugfs_register(12) end
> [   13.627382] == blk_mq_debugfs_register(12) start
> [   13.628041] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.629240] == blk_mq_debugfs_register(12) end
> [   13.651667] == blk_mq_debugfs_register(12) start
> [   13.652836] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.655107] == blk_mq_debugfs_register(12) end
> [   13.684917] == blk_mq_debugfs_register(12) start
> [   13.687876] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.691588] == blk_mq_debugfs_register(13) end
> [   13.707320] == blk_mq_debugfs_register(13) start
> [   13.707863] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.708856] == blk_mq_debugfs_register(13) end
> [   13.735623] == blk_mq_debugfs_register(13) start
> [   13.736656] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.738411] == blk_mq_debugfs_register(13) end
> [   13.763326] == blk_mq_debugfs_register(13) start
> [   13.763972] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.765167] == blk_mq_debugfs_register(13) end
> [   13.779510] == blk_mq_debugfs_register(13) start
> [   13.780522] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.782338] == blk_mq_debugfs_register(13) end
> [   13.783521] loop: module loaded
> 
> LOOP_CTL_DEL(loop0) #1
> [   13.803550] = __blk_release_queue(4) start
> [   13.807772] == blk_trace_shutdown(4) start
> [   13.810749] == blk_trace_shutdown(4) end
> [   13.813437] = __blk_release_queue(4) calling blk_mq_debugfs_unregister()
> [   13.817593] ==== blk_mq_debugfs_unregister(4) begin
> [   13.817621] ==== blk_mq_debugfs_unregister(4) debugfs_remove_recursive(q->debugfs_dir)
> [   13.821203] ==== blk_mq_debugfs_unregister(4) end q->debugfs_dir is NULL
> [   13.826166] = __blk_release_queue(4) blk_mq_debugfs_unregister() end
> [   13.832992] = __blk_release_queue(4) end
> 
> LOOP_CTL_ADD(loop0) #1
> [   13.843742] == blk_mq_debugfs_register(7) start
> [   13.845569] == blk_mq_debugfs_register(7) q->debugfs_dir created
> [   13.848628] == blk_mq_debugfs_register(7) end
> 
> BLKTRACE_SETUP(loop0) #1
> [   13.850924] == blk_trace_ioctl(7, BLKTRACESETUP) start
> [   13.852852] === do_blk_trace_setup(7) start
> [   13.854580] === do_blk_trace_setup(7) creating directory
> [   13.856620] === do_blk_trace_setup(7) using what debugfs_lookup() gave
> [   13.860635] === do_blk_trace_setup(7) end with ret: 0
> [   13.862615] == blk_trace_ioctl(7, BLKTRACESETUP) end
> 
> LOOP_CTL_DEL(loop0) #2
> [   13.883304] = __blk_release_queue(7) start
> [   13.885324] == blk_trace_shutdown(7) start
> [   13.887197] == blk_trace_shutdown(7) calling __blk_trace_remove()
> [   13.889807] == __blk_trace_remove(7) start
> [   13.891669] === blk_trace_cleanup(7) start
> [   13.911656] ====== blk_trace_free(7) start
> 
> LOOP_CTL_ADD(loop0) #2
> [   13.912709] == blk_mq_debugfs_register(2) start
> 
> ---> From LOOP_CTL_DEL(loop0) #2
> [   13.915887] ====== blk_trace_free(7) end
> 
> ---> From LOOP_CTL_ADD(loop0) #2
> [   13.918359] debugfs: Directory 'loop0' with parent 'block' already present!
> [   13.926433] == blk_mq_debugfs_register(2) q->debugfs_dir created
> [   13.930373] == blk_mq_debugfs_register(2) end
> 
> BLKTRACE_SETUP(loop0) #2
> [   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start
> [   13.936758] === do_blk_trace_setup(2) start
> [   13.938944] === do_blk_trace_setup(2) creating directory
> [   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave
> 
> ---> From LOOP_CTL_DEL(loop0) #2
> [   13.971046] === blk_trace_cleanup(7) end
> [   13.973175] == __blk_trace_remove(7) end
> [   13.975352] == blk_trace_shutdown(7) end
> [   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()
> [   13.980645] ==== blk_mq_debugfs_unregister(7) begin
> [   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)
> [   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL
> [   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end
> [   13.993155] = __blk_release_queue(7) end
> 
> ---> From BLKTRACE_SETUP(loop0) #2
> [   13.995928] === do_blk_trace_setup(2) end with ret: 0
> [   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end
> 
> LOOP_CTL_DEL(loop0) #3
> [   14.035119] = __blk_release_queue(2) start
> [   14.036925] == blk_trace_shutdown(2) start
> [   14.038518] == blk_trace_shutdown(2) calling __blk_trace_remove()
> [   14.040829] == __blk_trace_remove(2) start
> [   14.042413] === blk_trace_cleanup(2) start
> 
> LOOP_CTL_ADD(loop0) #3
> [   14.072522] == blk_mq_debugfs_register(6) start
> 
> ---> From LOOP_CTL_DEL(loop0) #3
> [   14.075151] ====== blk_trace_free(2) start
> 
> ---> From LOOP_CTL_ADD(loop0) #3
> [   14.075882] == blk_mq_debugfs_register(6) q->debugfs_dir created
> 
> ---> From LOOP_CTL_DEL(loop0) #3
> [   14.078624] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [   14.084332] == blk_mq_debugfs_register(6) end
> [   14.086971] #PF: supervisor write access in kernel mode
> [   14.086974] #PF: error_code(0x0002) - not-present page
> [   14.086977] PGD 0 P4D 0
> [   14.086984] Oops: 0002 [#1] SMP NOPTI
> [   14.086990] CPU: 2 PID: 287 Comm: kworker/2:2 Tainted: G            E 5.6.0-next-20200403+ #54
> [   14.086991] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [   14.087002] Workqueue: events __blk_release_queue
> [   14.087011] RIP: 0010:down_write+0x15/0x40
> [   14.090300] == blk_trace_ioctl(6, BLKTRACESETUP) start
> [   14.093277] Code: eb ca e8 3e 34 8d ff cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00
> 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
> [   14.093280] RSP: 0018:ffffc28a00533da8 EFLAGS: 00010246
> [   14.093284] RAX: 0000000000000000 RBX: ffff9f7a24d07980 RCX: ffffff8100000000
> [   14.093286] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [   14.093287] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000019
> [   14.093289] R10: 0000000000000774 R11: 0000000000000000 R12: 0000000000000000
> [   14.093291] R13: ffff9f7a2fab0400 R14: ffff9f7a21dd1140 R15: 00000000000000a0
> [   14.093294] FS:  0000000000000000(0000) GS:ffff9f7a2fa80000(0000) knlGS:0000000000000000
> [   14.093296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   14.093298] CR2: 00000000000000a0 CR3: 00000004293d2003 CR4: 0000000000360ee0
> [   14.093307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   14.093308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   14.093310] Call Trace:
> [   14.093324]  simple_recursive_removal+0x4e/0x2e0
> [   14.093330]  ? debugfs_remove+0x60/0x60
> [   14.093334]  debugfs_remove+0x40/0x60
> [   14.093339]  blk_trace_free+0x20/0x70
> [   14.093346]  __blk_trace_remove+0x54/0x90
> [   14.096704] === do_blk_trace_setup(6) start
> [   14.098534]  blk_trace_shutdown+0x74/0x80
> [   14.100958] === do_blk_trace_setup(6) creating directory
> [   14.104575]  __blk_release_queue+0xbe/0x160
> [   14.104580]  process_one_work+0x1b4/0x380
> [   14.104585]  worker_thread+0x50/0x3c0
> [   14.104589]  kthread+0xf9/0x130
> [   14.104593]  ? process_one_work+0x380/0x380
> [   14.104596]  ? kthread_park+0x90/0x90
> [   14.104599]  ret_from_fork+0x1f/0x40
> [   14.104603] Modules linked in: loop(E) xfs(E) libcrc32c(E)
> crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E)
> serio_raw(E) aesni_intel(E) glue_helper(E) virtio_balloon(E) evdev(E)
> crypto_simd(E) pcspkr(E) cryptd(E) i6300esb(E) button(E) ip_tables(E)
> x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E)
> jbd2(E) virtio_net(E) net_failover(E) failover(E) virtio_blk(E)
> ata_generic(E) uhci_hcd(E) ata_piix(E) ehci_hcd(E) nvme(E) libata(E)
> crc32c_intel(E) usbcore(E) psmouse(E) nvme_core(E) virtio_pci(E)
> scsi_mod(E) virtio_ring(E) t10_pi(E) virtio(E) i2c_piix4(E) floppy(E)
> [   14.107400] === do_blk_trace_setup(6) using what debugfs_lookup() gave
> [   14.108939] CR2: 00000000000000a0
> [   14.110589] === do_blk_trace_setup(6) end with ret: 0
> [   14.111592] ---[ end trace 7a783b33b9614db9 ]---
> 
> The root cause to this issue is that debugfs_lookup() can find a
> previous incarnation's dir of the same name which is about to get
> removed from a not yet schedule work.
> 
> We can fix the UAF by simply using a debugfs directory which moving
> forward will always be accessible if debugfs is enabled, this way,
> its allocated and avaialble always for both request-based block
> drivers or make_request drivers (multiqueue) block drivers.
> 
> This simplifies the code considerably, with the only penalty now being
> that we're always creating the request queue debugfs directory for the
> request-based block device drivers.
> 
> The UAF then is not a core debugfs issue, but instead a misuse of
> debugfs, and this issue can only be triggered if you are root, and
> misuse blktrace.
> 
> This issue can be reproduced with break-blktrace [2] using:
> 
>   break-blktrace -c 10 -d -s
> 
> This patch fixes this issue. Note that there is also another
> respective UAF but from the ioctl path [3], this should also fix
> that issue.
> 
> This patch then also disputes the severity of CVE-2019-19770 as
> this issue is only possible by being root and using blktrace.
> 
> It is not a core debugfs issue.
> 
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
> [1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
> [2] https://github.com/mcgrof/break-blktrace
> [3] https://lore.kernel.org/lkml/000000000000ec635b059f752700@google.com/
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
> Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 0/5] blktrace: fix use after free
  2020-04-14  4:18 [PATCH 0/5] blktrace: fix use after free Luis Chamberlain
                   ` (4 preceding siblings ...)
  2020-04-14  4:19 ` [PATCH 5/5] block: revert back to synchronous request_queue removal Luis Chamberlain
@ 2020-04-14  7:38 ` Greg KH
  5 siblings, 0 replies; 53+ messages in thread
From: Greg KH @ 2020-04-14  7:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Tue, Apr 14, 2020 at 04:18:57AM +0000, Luis Chamberlain wrote:
> After two iterations of RFCs I think this is ready now. I've taken
> the feedback from the last series, both on code and commit log.
> I've also extended the commit log on the last patch to also explain
> how the original shift to async request_queue removal turned out
> to actually be a userspace regression and added a respective fixes
> tag for it.
> 
> You can find these patches on my 20200414-dev-blkqueue-defer-removal-patch-v1
> branch based on linux-next tag next-20200414 on kernel.org [0].
> 
> Further review and rants are appreciated.
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200414-dev-blkqueue-defer-removal
> 
> Luis Chamberlain (5):
>   block: move main block debugfs initialization to its own file
>   blktrace: fix debugfs use after free
>   blktrace: refcount the request_queue during ioctl
>   mm/swapfile: refcount block and queue before using
>     blkcg_schedule_throttle()
>   block: revert back to synchronous request_queue removal

Looks good from a debugfs point of view, thanks for doing this cleanup.

greg k-h


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-14  4:18 ` [PATCH 2/5] blktrace: fix debugfs use after free Luis Chamberlain
  2020-04-14  7:37   ` Greg KH
@ 2020-04-14 15:38   ` Christoph Hellwig
  2020-04-15  2:46   ` Bart Van Assche
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-14 15:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-14  4:19 ` [PATCH 3/5] blktrace: refcount the request_queue during ioctl Luis Chamberlain
@ 2020-04-14 15:40   ` Christoph Hellwig
  2020-04-15  6:16     ` Luis Chamberlain
  2020-04-16  2:31   ` Ming Lei
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-14 15:40 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Tue, Apr 14, 2020 at 04:19:00AM +0000, Luis Chamberlain wrote:
> Ensure that the request_queue is refcounted during its full
> ioctl cycle. This avoids possible races against removal, given
> blk_get_queue() also checks to ensure the queue is not dying.
> 
> This small race is possible if you defer removal of the request_queue
> and userspace fires off an ioctl for the device in the meantime.

Hmm, where exactly does the race come in so that it can only happen
after where you take the reference, but not before it?  I'm probably
missing something, but that just means it needs to be explained a little
better :)


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-14  4:19 ` [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle() Luis Chamberlain
@ 2020-04-14 15:44   ` Christoph Hellwig
  2020-04-15  5:42     ` Luis Chamberlain
  2020-04-16  6:22   ` Ming Lei
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-14 15:44 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> block devices are refcounted so to ensure once its final user goes away it
> can be cleaned up by the lower layers properly. The block device's
> request_queue structure is also refcounted, however, if the last
> blk_put_queue() is called under atomic context the block layer has
> to defer removal.
> 
> By refcounting the block device during the use of blkcg_schedule_throttle(),
> we ensure ensure two things:
> 
> 1) the block device remains available during the call
> 2) we ensure avoid having to deal with the fact we're using the
>    request_queue structure in atomic context, since the last
>    blk_put_queue() will be called upon disk_release(), *after*
>    our own bdput().
> 
> This means this code path is *not* going to remove the request_queue
> structure, as we are ensuring some later upper layer disk_release()
> will be the one to release the request_queue structure for us.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/swapfile.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6659ab563448..9285ff6030ca 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
>  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  				  gfp_t gfp_mask)
>  {
> +	struct block_device *bdev;
>  	struct swap_info_struct *si, *next;
>  	if (!(gfp_mask & __GFP_IO) || !memcg)
>  		return;
> @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
>  				  avail_lists[node]) {
>  		if (si->bdev) {
> -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> -						true);
> +			bdev = bdgrab(si->bdev);
> +			if (!bdev)
> +				continue;
> +			/*
> +			 * By adding our own bdgrab() we ensure the queue
> +			 * sticks around until disk_release(), and so we ensure
> +			 * our release of the request_queue does not happen in
> +			 * atomic context.
> +			 */
> +			blkcg_schedule_throttle(bdev_get_queue(bdev), true);
> +			bdput(bdev);

I don't understand the atomic part of the comment.  How does
bdgrab/bdput help us there?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5] block: revert back to synchronous request_queue removal
  2020-04-14  4:19 ` [PATCH 5/5] block: revert back to synchronous request_queue removal Luis Chamberlain
@ 2020-04-14 15:47   ` Christoph Hellwig
  2020-04-14 20:58     ` Luis Chamberlain
  2020-04-16  2:36   ` Ming Lei
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-14 15:47 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Tue, Apr 14, 2020 at 04:19:02AM +0000, Luis Chamberlain wrote:
> Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
> v4.12 moved the work behind blk_release_queue() into a workqueue after a
> splat floated around which indicated some work on blk_release_queue()
> could sleep in blk_exit_rl(). This splat would be possible when a driver
> called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
> as its final call) from an atomic context.
> 
> blk_put_queue() decrements the refcount for the request_queue
> kobject, and upon reaching 0 blk_release_queue() is called. Although
> blk_exit_rl() is now removed through commit db6d9952356 ("block: remove
> request_list code"), we reserve the right to be able to sleep within
> blk_release_queue() context. If you see no other way and *have* be
> in atomic context when you driver calls the last blk_put_queue()
> you can always just increase your block device's reference count with
> bdgrab() as this can be done in atomic context and the request_queue
> removal would be left to upper layers later. We document this bit of
> tribal knowledge as well now, and adjust kdoc format a bit.
> 
> We revert back to synchronous request_queue removal because asynchronous
> removal creates a regression with expected userspace interaction with
> several drivers. An example is when removing the loopback driver and
> issues ioctl from userspace to do so, upon return and if successful one
> expects the device to be removed. Moving to asynchronous request_queue
> removal could have broken many scripts which relied on the removal to
> have been completed if there was no error.
> 
> Using asynchronous request_queue removal however has helped us find
> other bugs, in the future we can test what could break with this
> arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Suggested-by: Nicolai Stange <nstange@suse.de>
> Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  block/blk-core.c       | 19 ++++++++++++++++++-
>  block/blk-sysfs.c      | 38 +++++++++++++++++---------------------
>  include/linux/blkdev.h |  2 --
>  3 files changed, 35 insertions(+), 24 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5aaae7a1b338..8346c7c59ee6 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -301,6 +301,17 @@ void blk_clear_pm_only(struct request_queue *q)
>  }
>  EXPORT_SYMBOL_GPL(blk_clear_pm_only);
>  
> +/**
> + * blk_put_queue - decrement the request_queue refcount
> + *
> + * Decrements the refcount to the request_queue kobject, when this reaches
> + * 0 we'll have blk_release_queue() called. You should avoid calling
> + * this function in atomic context but if you really have to ensure you
> + * first refcount the block device with bdgrab() / bdput() so that the
> + * last decrement happens in blk_cleanup_queue().
> + *
> + * @q: the request_queue structure to decrement the refcount for
> + */
>  void blk_put_queue(struct request_queue *q)
>  {
>  	kobject_put(&q->kobj);
> @@ -328,10 +339,16 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
>  
>  /**
>   * blk_cleanup_queue - shutdown a request queue
> - * @q: request queue to shutdown
>   *
>   * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
>   * put it.  All future requests will be failed immediately with -ENODEV.
> + *
> + * You should not call this function in atomic context. If you need to
> + * refcount a request_queue in atomic context, instead refcount the
> + * block device with bdgrab() / bdput().

I think this needs a WARN_ON thrown in to enforece the calling context.

> + *
> + * @q: request queue to shutdown

Moving the argument documentation seems against the usual kerneldoc
style.

Otherwise this look good, I hope it sticks :)


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5] block: revert back to synchronous request_queue removal
  2020-04-14 15:47   ` Christoph Hellwig
@ 2020-04-14 20:58     ` Luis Chamberlain
  2020-04-15  6:46       ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-14 20:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Tue, Apr 14, 2020 at 08:47:25AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 14, 2020 at 04:19:02AM +0000, Luis Chamberlain wrote:
> > Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
> > v4.12 moved the work behind blk_release_queue() into a workqueue after a
> > splat floated around which indicated some work on blk_release_queue()
> > could sleep in blk_exit_rl(). This splat would be possible when a driver
> > called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
> > as its final call) from an atomic context.
> > 
> > blk_put_queue() decrements the refcount for the request_queue
> > kobject, and upon reaching 0 blk_release_queue() is called. Although
> > blk_exit_rl() is now removed through commit db6d9952356 ("block: remove
> > request_list code"), we reserve the right to be able to sleep within
> > blk_release_queue() context. If you see no other way and *have* be
> > in atomic context when you driver calls the last blk_put_queue()
> > you can always just increase your block device's reference count with
> > bdgrab() as this can be done in atomic context and the request_queue
> > removal would be left to upper layers later. We document this bit of
> > tribal knowledge as well now, and adjust kdoc format a bit.
> > 
> > We revert back to synchronous request_queue removal because asynchronous
> > removal creates a regression with expected userspace interaction with
> > several drivers. An example is when removing the loopback driver and
> > issues ioctl from userspace to do so, upon return and if successful one
> > expects the device to be removed. Moving to asynchronous request_queue
> > removal could have broken many scripts which relied on the removal to
> > have been completed if there was no error.
> > 
> > Using asynchronous request_queue removal however has helped us find
> > other bugs, in the future we can test what could break with this
> > arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Omar Sandoval <osandov@fb.com>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Nicolai Stange <nstange@suse.de>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: yu kuai <yukuai3@huawei.com>
> > Suggested-by: Nicolai Stange <nstange@suse.de>
> > Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  block/blk-core.c       | 19 ++++++++++++++++++-
> >  block/blk-sysfs.c      | 38 +++++++++++++++++---------------------
> >  include/linux/blkdev.h |  2 --
> >  3 files changed, 35 insertions(+), 24 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 5aaae7a1b338..8346c7c59ee6 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -301,6 +301,17 @@ void blk_clear_pm_only(struct request_queue *q)
> >  }
> >  EXPORT_SYMBOL_GPL(blk_clear_pm_only);
> >  
> > +/**
> > + * blk_put_queue - decrement the request_queue refcount
> > + *
> > + * Decrements the refcount to the request_queue kobject, when this reaches
> > + * 0 we'll have blk_release_queue() called. You should avoid calling
> > + * this function in atomic context but if you really have to ensure you
> > + * first refcount the block device with bdgrab() / bdput() so that the
> > + * last decrement happens in blk_cleanup_queue().
> > + *
> > + * @q: the request_queue structure to decrement the refcount for
> > + */
> >  void blk_put_queue(struct request_queue *q)
> >  {
> >  	kobject_put(&q->kobj);
> > @@ -328,10 +339,16 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
> >  
> >  /**
> >   * blk_cleanup_queue - shutdown a request queue
> > - * @q: request queue to shutdown
> >   *
> >   * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
> >   * put it.  All future requests will be failed immediately with -ENODEV.
> > + *
> > + * You should not call this function in atomic context. If you need to
> > + * refcount a request_queue in atomic context, instead refcount the
> > + * block device with bdgrab() / bdput().
> 
> I think this needs a WARN_ON thrown in to enforece the calling context.

I considered adding a might_sleep() but upon review with Bart, he noted
that this function already has a mutex_lock(), and if you look under the
hood of mutex_lock(), it has a might_sleep() at the very top. The
warning then is implicit.

> > + *
> > + * @q: request queue to shutdown
> 
> Moving the argument documentation seems against the usual kerneldoc
> style.

Would you look at that, Documentation/doc-guide/kernel-doc.rst does
say to keep the argument at the top as it was in place before, OK will
revert that. Sorry, I used include/net/mac80211.h as my base for style.

> Otherwise this look good, I hope it sticks :)

I hope that the kdocs / might_sleep() sprinkled should make it stick now.
But hey, this uncovered wonderful obscure bugs, it was fun. I'll add a
selftest also later to ensure we don't regress on some of this later
once again.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/5] block: move main block debugfs initialization to its own file
  2020-04-14  4:18 ` [PATCH 1/5] block: move main block debugfs initialization to its own file Luis Chamberlain
  2020-04-14  7:35   ` Greg KH
@ 2020-04-15  2:44   ` Bart Van Assche
  1 sibling, 0 replies; 53+ messages in thread
From: Bart Van Assche @ 2020-04-15  2:44 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On 2020-04-13 21:18, Luis Chamberlain wrote:
> make_request-based drivers and and request-based drivers share some
> some debugfs code. By moving this into its own file it makes it easier
> to expand and audit this shared code.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-14  4:18 ` [PATCH 2/5] blktrace: fix debugfs use after free Luis Chamberlain
  2020-04-14  7:37   ` Greg KH
  2020-04-14 15:38   ` Christoph Hellwig
@ 2020-04-15  2:46   ` Bart Van Assche
  2020-04-15 17:38   ` Eric Sandeen
  2020-04-16  2:10   ` Ming Lei
  4 siblings, 0 replies; 53+ messages in thread
From: Bart Van Assche @ 2020-04-15  2:46 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On 2020-04-13 21:18, Luis Chamberlain wrote:
> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> merged on v4.12 Omar fixed the original blktrace code for request-based
> drivers (multiqueue). This however left in place a possible crash, if you
> happen to abuse blktrace in a way it was not intended.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-14 15:44   ` Christoph Hellwig
@ 2020-04-15  5:42     ` Luis Chamberlain
  2020-04-15  7:27       ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-15  5:42 UTC (permalink / raw)
  To: Christoph Hellwig, Alan Jenkins
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Tue, Apr 14, 2020 at 08:44:47AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> > block devices are refcounted so to ensure once its final user goes away it
> > can be cleaned up by the lower layers properly. The block device's
> > request_queue structure is also refcounted, however, if the last
> > blk_put_queue() is called under atomic context the block layer has
> > to defer removal.
> > 
> > By refcounting the block device during the use of blkcg_schedule_throttle(),
> > we ensure ensure two things:
> > 
> > 1) the block device remains available during the call
> > 2) we ensure avoid having to deal with the fact we're using the
> >    request_queue structure in atomic context, since the last
> >    blk_put_queue() will be called upon disk_release(), *after*
> >    our own bdput().
> > 
> > This means this code path is *not* going to remove the request_queue
> > structure, as we are ensuring some later upper layer disk_release()
> > will be the one to release the request_queue structure for us.
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Omar Sandoval <osandov@fb.com>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Nicolai Stange <nstange@suse.de>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: yu kuai <yukuai3@huawei.com>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  mm/swapfile.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 6659ab563448..9285ff6030ca 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
> >  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> >  				  gfp_t gfp_mask)
> >  {
> > +	struct block_device *bdev;
> >  	struct swap_info_struct *si, *next;
> >  	if (!(gfp_mask & __GFP_IO) || !memcg)
> >  		return;
> > @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> >  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
> >  				  avail_lists[node]) {
> >  		if (si->bdev) {
> > -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> > -						true);
> > +			bdev = bdgrab(si->bdev);
> > +			if (!bdev)
> > +				continue;
> > +			/*
> > +			 * By adding our own bdgrab() we ensure the queue
> > +			 * sticks around until disk_release(), and so we ensure
> > +			 * our release of the request_queue does not happen in
> > +			 * atomic context.
> > +			 */
> > +			blkcg_schedule_throttle(bdev_get_queue(bdev), true);
> > +			bdput(bdev);
> 
> I don't understand the atomic part of the comment.  How does
> bdgrab/bdput help us there?

The commit log above did a better job at explaining this in terms of our
goal to use the request_queue and how this use would prevent the risk of
releasing the request_queue, which could sleep.

If its not clear still, at least why we'd escape the sleep potential
of the request_queue, we can just see its up to the disk_release()
to call the last blk_put_queue():

static void __device_add_disk(struct device *parent, struct gendisk disk,      
			      const struct attribute_group **groups,            
			      bool register_queue)                              
{   
	...
        /*                                                                      
	 * Take an extra ref on queue which will be put on disk_release()
	 * so that it sticks around as long as @disk is there.                  
	 */                                                                     
	WARN_ON_ONCE(!blk_get_queue(disk->queue));

	disk_add_events(disk);
	blk_integrity_add(disk);
}

static void disk_release(struct device *dev)                                    
{                                                                               
	struct gendisk *disk = dev_to_disk(dev);

	blk_free_devt(dev->devt);
	disk_release_events(disk);
	kfree(disk->random);
	disk_replace_part_tbl(disk, NULL);
	hd_free_part(&disk->part0);
	if (disk->queue)
		blk_put_queue(disk->queue);
	kfree(disk);
}     

I admit that all this however it did a poor job at explaining why
bdgrab()/bdput() was safe in atomic context other than the implicit
reasoning that we already do that elsewhere in atomic context.

bdgrab() specifically was added to be able to refcount a block device in
atomic context via commit dddac6a7b445 ("("PM / Hibernate: Replace bdget
call with simple atomic_inc of i_count"). In its latest incarnation we
have:

/**
 * bdgrab -- Grab a reference to an already referenced block device
 * @bdev:       Block device to grab a reference to.                            
 */
struct block_device *bdgrab(struct block_device *bdev)
{
	ihold(bdev->bd_inode);                                                  
	return bdev;                                                            
}                                                                               
EXPORT_SYMBOL(bdgrab); 

And this in turn:

/*                                                                              
 * get additional reference to inode; caller must already hold one.
 */
void ihold(struct inode *inode)                                                 
{                                                                               
	WARN_ON(atomic_inc_return(&inode->i_count) < 2);                        
}                                                                               
EXPORT_SYMBOL(ihold);

However... I'd eventure to say we don't have tribal knowledge documented
about why bdput() is safe in atomic context when used with bdgrab(),
including the commit log which added it. So the only thing backing its
safety is that we already use this combo in atomic context, and if its
incorrect, other areas would be incorrect as well.

But looking underneath the hood, I see at the end of __blkdev_get():

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)  
{
	...
	disk = bdev_get_gendisk(bdev, &partno);
	...
        /* only one opener holds refs to the module and disk */                 
	if (!first_open)                                                        
		put_disk_and_module(disk); 
	...
}

So ihold() seems like a way to ensure the caller of bdgrab() won't be
this first opener. If only one opener holds the ref to the disk and it
was not us, we musn't be the one to decrease it, and if the disk is
held, it should mean the block device should be refcounted by it as
well. More review on this later part is appreciated though.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-14 15:40   ` Christoph Hellwig
@ 2020-04-15  6:16     ` Luis Chamberlain
  2020-04-15  7:14       ` Christoph Hellwig
  2020-04-15 14:45       ` Bart Van Assche
  0 siblings, 2 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-15  6:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Tue, Apr 14, 2020 at 08:40:44AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 14, 2020 at 04:19:00AM +0000, Luis Chamberlain wrote:
> > Ensure that the request_queue is refcounted during its full
> > ioctl cycle. This avoids possible races against removal, given
> > blk_get_queue() also checks to ensure the queue is not dying.
> > 
> > This small race is possible if you defer removal of the request_queue
> > and userspace fires off an ioctl for the device in the meantime.
> 
> Hmm, where exactly does the race come in so that it can only happen
> after where you take the reference, but not before it?  I'm probably
> missing something, but that just means it needs to be explained a little
> better :)

From the trace on patch 2/5:

    BLKTRACE_SETUP(loop0) #2
    [   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start
    [   13.936758] === do_blk_trace_setup(2) start
    [   13.938944] === do_blk_trace_setup(2) creating directory
    [   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave
    
    ---> From LOOP_CTL_DEL(loop0) #2
    [   13.971046] === blk_trace_cleanup(7) end
    [   13.973175] == __blk_trace_remove(7) end
    [   13.975352] == blk_trace_shutdown(7) end
    [   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()
    [   13.980645] ==== blk_mq_debugfs_unregister(7) begin
    [   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)
    [   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL
    [   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end
    [   13.993155] = __blk_release_queue(7) end
    
    ---> From BLKTRACE_SETUP(loop0) #2
    [   13.995928] === do_blk_trace_setup(2) end with ret: 0
    [   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end

The BLKTRACESETUP above works on request_queue which later
LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us.
If you use this commit alone though, this doesn't fix the race issue
however, and that's because of both still the debugfs_lookup() use
and that we're still using asynchronous removal at this point.

refcounting will just ensure we don't take the request_queue underneath
our noses.

Should I just add this to the commit log?

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5] block: revert back to synchronous request_queue removal
  2020-04-14 20:58     ` Luis Chamberlain
@ 2020-04-15  6:46       ` Christoph Hellwig
  2020-04-15 13:20         ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-15  6:46 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, axboe, viro, bvanassche, gregkh, rostedt,
	mingo, jack, ming.lei, nstange, akpm, mhocko, yukuai3,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Omar Sandoval, Hannes Reinecke, Michal Hocko

On Tue, Apr 14, 2020 at 08:58:52PM +0000, Luis Chamberlain wrote:
> > I think this needs a WARN_ON thrown in to enforece the calling context.
> 
> I considered adding a might_sleep() but upon review with Bart, he noted
> that this function already has a mutex_lock(), and if you look under the
> hood of mutex_lock(), it has a might_sleep() at the very top. The
> warning then is implicit.

It might just be a personal preference, but I think the documentation
value of a WARN_ON_ONCE or might_sleep with a comment at the top of
the function is much higher than a blurb in a long kerneldoc text and
a later mutex_lock.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-15  6:16     ` Luis Chamberlain
@ 2020-04-15  7:14       ` Christoph Hellwig
  2020-04-15 12:34         ` Luis Chamberlain
  2020-04-15 14:45       ` Bart Van Assche
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-15  7:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, axboe, viro, bvanassche, gregkh, rostedt,
	mingo, jack, ming.lei, nstange, akpm, mhocko, yukuai3,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Omar Sandoval, Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 06:16:49AM +0000, Luis Chamberlain wrote:
> The BLKTRACESETUP above works on request_queue which later
> LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us.
> If you use this commit alone though, this doesn't fix the race issue
> however, and that's because of both still the debugfs_lookup() use
> and that we're still using asynchronous removal at this point.
> 
> refcounting will just ensure we don't take the request_queue underneath
> our noses.
> 
> Should I just add this to the commit log?

That sounds much more useful than the trace.

Btw, Isn't blk_get_queue racy as well?  Shouldn't we check
blk_queue_dying after getting the reference and undo it if the queue is
indeeed dying?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-15  5:42     ` Luis Chamberlain
@ 2020-04-15  7:27       ` Christoph Hellwig
  2020-04-15  7:34         ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-15  7:27 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Alan Jenkins, axboe, viro, bvanassche, gregkh,
	rostedt, mingo, jack, ming.lei, nstange, akpm, mhocko, yukuai3,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Omar Sandoval, Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 05:42:34AM +0000, Luis Chamberlain wrote:
> > I don't understand the atomic part of the comment.  How does
> > bdgrab/bdput help us there?
> 
> The commit log above did a better job at explaining this in terms of our
> goal to use the request_queue and how this use would prevent the risk of
> releasing the request_queue, which could sleep.

So bdput eventually does and iput, but what leads to an out of context
offload?

But anyway, isn't the original problem better solved by simply not
releasing the queue from atomic context to start with?  There isn't
really any good reason we keep holding the spinlock once we have a
reference on the queue, so something like this (not even compile tested)
should do the work:

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c5dc833212e1..45faa851f789 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1673,6 +1673,17 @@ void blkcg_maybe_throttle_current(void)
 	blk_put_queue(q);
 }
 
+/* consumes a reference on q */
+void __blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay)
+{
+	if (current->throttle_queue)
+		blk_put_queue(current->throttle_queue);
+	current->throttle_queue = q;
+	if (use_memdelay)
+		current->use_memdelay = use_memdelay;
+	set_notify_resume(current);
+}
+
 /**
  * blkcg_schedule_throttle - this task needs to check for throttling
  * @q: the request queue IO was submitted on
@@ -1694,16 +1705,8 @@ void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay)
 {
 	if (unlikely(current->flags & PF_KTHREAD))
 		return;
-
-	if (!blk_get_queue(q))
-		return;
-
-	if (current->throttle_queue)
-		blk_put_queue(current->throttle_queue);
-	current->throttle_queue = q;
-	if (use_memdelay)
-		current->use_memdelay = use_memdelay;
-	set_notify_resume(current);
+	if (blk_get_queue(q))
+		__blkcg_schedule_throttle(q, use_memdelay);
 }
 
 /**
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 35f8ffe92b70..68440cb3ea9e 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -679,6 +679,7 @@ static inline void blkcg_clear_delay(struct blkcg_gq *blkg)
 
 void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta);
 void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay);
+void __blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay);
 void blkcg_maybe_throttle_current(void);
 #else	/* CONFIG_BLK_CGROUP */
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5871a2aa86a5..4c6aa59ee593 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3749,9 +3749,10 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 				  gfp_t gfp_mask)
 {
 	struct swap_info_struct *si, *next;
+	struct request_queue *q = NULL;
+
 	if (!(gfp_mask & __GFP_IO) || !memcg)
 		return;
-
 	if (!blk_cgroup_congested())
 		return;
 
@@ -3761,17 +3762,21 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 	 */
 	if (current->throttle_queue)
 		return;
+	if (unlikely(current->flags & PF_KTHREAD))
+		return;
 
 	spin_lock(&swap_avail_lock);
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
 				  avail_lists[node]) {
 		if (si->bdev) {
-			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
-						true);
+			if (blk_get_queue(dev_get_queue(si->bdev)))
+				q = dev_get_queue(si->bdev);
 			break;
 		}
 	}
 	spin_unlock(&swap_avail_lock);
+	if (q)
+		__blkcg_schedule_throttle(q, true);
 }
 #endif
 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-15  7:27       ` Christoph Hellwig
@ 2020-04-15  7:34         ` Christoph Hellwig
  2020-04-15 13:19           ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-15  7:34 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Alan Jenkins, axboe, viro, bvanassche, gregkh,
	rostedt, mingo, jack, ming.lei, nstange, akpm, mhocko, yukuai3,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Omar Sandoval, Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 12:27:12AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 15, 2020 at 05:42:34AM +0000, Luis Chamberlain wrote:
> > > I don't understand the atomic part of the comment.  How does
> > > bdgrab/bdput help us there?
> > 
> > The commit log above did a better job at explaining this in terms of our
> > goal to use the request_queue and how this use would prevent the risk of
> > releasing the request_queue, which could sleep.
> 
> So bdput eventually does and iput, but what leads to an out of context
> offload?
> 
> But anyway, isn't the original problem better solved by simply not
> releasing the queue from atomic context to start with?  There isn't
> really any good reason we keep holding the spinlock once we have a
> reference on the queue, so something like this (not even compile tested)
> should do the work:

Actually - mem_cgroup_throttle_swaprate already checks for a non-NULL
current->throttle_queue above, so we should never even call
blk_put_queue here.  Was this found by code inspection, or was there
a real report?

In the latter case we need to figure out what protects >throttle_queue,
as the way blkcg_schedule_throttle first put the reference and only then
assign a value to it already looks like it introduces a tiny race
window.

Otherwise just open coding the applicable part ofblkcg_schedule_throttle
in mem_cgroup_throttle_swaprate seems easiest:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5871a2aa86a5..e16051ef074c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3761,15 +3761,20 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 	 */
 	if (current->throttle_queue)
 		return;
+	if (unlikely(current->flags & PF_KTHREAD))
+		return;
 
 	spin_lock(&swap_avail_lock);
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
 				  avail_lists[node]) {
-		if (si->bdev) {
-			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
-						true);
-			break;
+		if (!si->bdev)
+			continue;
+		if (blk_get_queue(dev_get_queue(si->bdev))) {
+			current->throttle_queue = dev_get_queue(si->bdev);
+			current->use_memdelay = true;
+			set_notify_resume(current);
 		}
+		break;
 	}
 	spin_unlock(&swap_avail_lock);
 }


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-15  7:14       ` Christoph Hellwig
@ 2020-04-15 12:34         ` Luis Chamberlain
  2020-04-15 12:39           ` Christoph Hellwig
  2020-04-15 14:18           ` Bart Van Assche
  0 siblings, 2 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-15 12:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Wed, Apr 15, 2020 at 12:14:25AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 15, 2020 at 06:16:49AM +0000, Luis Chamberlain wrote:
> > The BLKTRACESETUP above works on request_queue which later
> > LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us.
> > If you use this commit alone though, this doesn't fix the race issue
> > however, and that's because of both still the debugfs_lookup() use
> > and that we're still using asynchronous removal at this point.
> > 
> > refcounting will just ensure we don't take the request_queue underneath
> > our noses.
> > 
> > Should I just add this to the commit log?
> 
> That sounds much more useful than the trace.
> 
> Btw, Isn't blk_get_queue racy as well?  Shouldn't we check
> blk_queue_dying after getting the reference and undo it if the queue is
> indeeed dying?

Yes that race should be possible:

bool blk_get_queue(struct request_queue *q)                                     
{                                                                               
	if (likely(!blk_queue_dying(q))) {
       ----------> we can get the queue to go dying here <---------
		__blk_get_queue(q);
		return true;
	}                                                                       

	return false;
}                                                                               
EXPORT_SYMBOL(blk_get_queue);

I'll pile up a fix. I've also considered doing a full review of callers
outside of the core block layer using it, and maybe just unexporting
this. It was originally exported due to commit d86e0e83b ("block: export
blk_{get,put}_queue()") to fix a scsi bug, but I can't find such
respective fix. I suspec that using bdgrab()/bdput() seems more likely
what drivers should be using. That would allow us to keep this
functionality internal.

Think that's worthy review?

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-15 12:34         ` Luis Chamberlain
@ 2020-04-15 12:39           ` Christoph Hellwig
  2020-04-15 13:25             ` Luis Chamberlain
  2020-04-15 14:18           ` Bart Van Assche
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-15 12:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, axboe, viro, bvanassche, gregkh, rostedt,
	mingo, jack, ming.lei, nstange, akpm, mhocko, yukuai3,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Omar Sandoval, Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 12:34:34PM +0000, Luis Chamberlain wrote:
> I'll pile up a fix. I've also considered doing a full review of callers
> outside of the core block layer using it, and maybe just unexporting
> this. It was originally exported due to commit d86e0e83b ("block: export
> blk_{get,put}_queue()") to fix a scsi bug, but I can't find such
> respective fix. I suspec that using bdgrab()/bdput() seems more likely
> what drivers should be using. That would allow us to keep this
> functionality internal.
> 
> Think that's worthy review?

Probably.  I did in fact very quickly look into that but then gave
up due to the fair amount of modular users.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-15  7:34         ` Christoph Hellwig
@ 2020-04-15 13:19           ` Luis Chamberlain
  2020-04-16  6:10             ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-15 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Jenkins, axboe, viro, bvanassche, gregkh, rostedt, mingo,
	jack, ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 12:34:43AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 15, 2020 at 12:27:12AM -0700, Christoph Hellwig wrote:
> > On Wed, Apr 15, 2020 at 05:42:34AM +0000, Luis Chamberlain wrote:
> > > > I don't understand the atomic part of the comment.  How does
> > > > bdgrab/bdput help us there?
> > > 
> > > The commit log above did a better job at explaining this in terms of our
> > > goal to use the request_queue and how this use would prevent the risk of
> > > releasing the request_queue, which could sleep.
> > 
> > So bdput eventually does and iput, but what leads to an out of context
> > offload?
> > 
> > But anyway, isn't the original problem better solved by simply not
> > releasing the queue from atomic context to start with?  There isn't
> > really any good reason we keep holding the spinlock once we have a
> > reference on the queue, so something like this (not even compile tested)
> > should do the work:
> 
> Actually - mem_cgroup_throttle_swaprate already checks for a non-NULL
> current->throttle_queue above, so we should never even call
> blk_put_queue here.  Was this found by code inspection, or was there
> a real report?

No but report, this code path came up as a possible use of
blk_put_queue() which already exists in atomic context. So yes, it
already uses blk_put_queue(), but how are we *sure* its not going to be
the last one? Because if it is that mean the release will happen in
atomic context, defeating the goal of the last patch in this series.

Using bdgrab() however ensures that during the lifecycle of this path,
blk_put_queue() won't be the last if used after, but instead we are
sure it will be upon disk release.

In fact, with bdgrab() isn't the blk_get_queue() on
blkcg_schedule_throttle() no longer needed?

> In the latter case we need to figure out what protects >throttle_queue,
> as the way blkcg_schedule_throttle first put the reference and only then
> assign a value to it already looks like it introduces a tiny race
> window.
> 
> Otherwise just open coding the applicable part of blkcg_schedule_throttle
> in mem_cgroup_throttle_swaprate seems easiest:
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5871a2aa86a5..e16051ef074c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3761,15 +3761,20 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  	 */
>  	if (current->throttle_queue)
>  		return;
> +	if (unlikely(current->flags & PF_KTHREAD))
> +		return;
>  
>  	spin_lock(&swap_avail_lock);
>  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
>  				  avail_lists[node]) {
> -		if (si->bdev) {
> -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> -						true);
> -			break;
> +		if (!si->bdev)
> +			continue;
> +		if (blk_get_queue(dev_get_queue(si->bdev))) {
> +			current->throttle_queue = dev_get_queue(si->bdev);
> +			current->use_memdelay = true;
> +			set_notify_resume(current);
>  		}
> +		break;
>  	}
>  	spin_unlock(&swap_avail_lock);
>  }

Sorry, its not clear to me  who calls the respective blk_put_queue()
here?

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5] block: revert back to synchronous request_queue removal
  2020-04-15  6:46       ` Christoph Hellwig
@ 2020-04-15 13:20         ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-15 13:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Tue, Apr 14, 2020 at 11:46:44PM -0700, Christoph Hellwig wrote:
> On Tue, Apr 14, 2020 at 08:58:52PM +0000, Luis Chamberlain wrote:
> > > I think this needs a WARN_ON thrown in to enforece the calling context.
> > 
> > I considered adding a might_sleep() but upon review with Bart, he noted
> > that this function already has a mutex_lock(), and if you look under the
> > hood of mutex_lock(), it has a might_sleep() at the very top. The
> > warning then is implicit.
> 
> It might just be a personal preference, but I think the documentation
> value of a WARN_ON_ONCE or might_sleep with a comment at the top of
> the function is much higher than a blurb in a long kerneldoc text and
> a later mutex_lock.

Well I'm a fan of making this explicit, so sure will just sprinkle a
might_sleep(), even though we have a mutex_lock().

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-15 12:39           ` Christoph Hellwig
@ 2020-04-15 13:25             ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-15 13:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko

On Wed, Apr 15, 2020 at 05:39:25AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 15, 2020 at 12:34:34PM +0000, Luis Chamberlain wrote:
> > I'll pile up a fix. I've also considered doing a full review of callers
> > outside of the core block layer using it, and maybe just unexporting
> > this. It was originally exported due to commit d86e0e83b ("block: export
> > blk_{get,put}_queue()") to fix a scsi bug, but I can't find such
> > respective fix. I suspec that using bdgrab()/bdput() seems more likely
> > what drivers should be using. That would allow us to keep this
> > functionality internal.
> > 
> > Think that's worthy review?
> 
> Probably.  I did in fact very quickly look into that but then gave
> up due to the fair amount of modular users.

Alright, then might as well then verify if the existing practice of
bdgrab()/bdput() is indeed valid logic, as otherwise we'd be puting
the atomic context / sleep concern to bdput(). As noted earlier I
am able to confirm easily that bdgrab() can be called in atomic contex,
however I cannot easily yet vet for *why* this was a safe assumption for
bdput().

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-15 12:34         ` Luis Chamberlain
  2020-04-15 12:39           ` Christoph Hellwig
@ 2020-04-15 14:18           ` Bart Van Assche
  2020-04-16  1:12             ` Luis Chamberlain
  1 sibling, 1 reply; 53+ messages in thread
From: Bart Van Assche @ 2020-04-15 14:18 UTC (permalink / raw)
  To: Luis Chamberlain, Christoph Hellwig
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On 2020-04-15 05:34, Luis Chamberlain wrote:
> On Wed, Apr 15, 2020 at 12:14:25AM -0700, Christoph Hellwig wrote:
>> Btw, Isn't blk_get_queue racy as well?  Shouldn't we check
>> blk_queue_dying after getting the reference and undo it if the queue is
>> indeeed dying?
> 
> Yes that race should be possible:
> 
> bool blk_get_queue(struct request_queue *q)                                     
> {                                                                               
> 	if (likely(!blk_queue_dying(q))) {
>        ----------> we can get the queue to go dying here <---------
> 		__blk_get_queue(q);
> 		return true;
> 	}                                                                       
> 
> 	return false;
> }                                                                               
> EXPORT_SYMBOL(blk_get_queue);
> 
> I'll pile up a fix. I've also considered doing a full review of callers
> outside of the core block layer using it, and maybe just unexporting
> this. It was originally exported due to commit d86e0e83b ("block: export
> blk_{get,put}_queue()") to fix a scsi bug, but I can't find such
> respective fix. I suspec that using bdgrab()/bdput() seems more likely
> what drivers should be using. That would allow us to keep this
> functionality internal.

blk_get_queue() prevents concurrent freeing of struct request_queue but
does not prevent concurrent blk_cleanup_queue() calls. Callers of
blk_get_queue() may encounter a change of the queue state from normal
into dying any time during the blk_get_queue() call or after
blk_get_queue() has finished. Maybe I'm overlooking something but I
doubt that modifying blk_get_queue() will help.

Bart.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-15  6:16     ` Luis Chamberlain
  2020-04-15  7:14       ` Christoph Hellwig
@ 2020-04-15 14:45       ` Bart Van Assche
  2020-04-16  1:17         ` Luis Chamberlain
  1 sibling, 1 reply; 53+ messages in thread
From: Bart Van Assche @ 2020-04-15 14:45 UTC (permalink / raw)
  To: Luis Chamberlain, Christoph Hellwig
  Cc: axboe, viro, gregkh, rostedt, mingo, jack, ming.lei, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On 2020-04-14 23:16, Luis Chamberlain wrote:
> On Tue, Apr 14, 2020 at 08:40:44AM -0700, Christoph Hellwig wrote:
>> Hmm, where exactly does the race come in so that it can only happen
>> after where you take the reference, but not before it?  I'm probably
>> missing something, but that just means it needs to be explained a little
>> better :)
> 
>>From the trace on patch 2/5:
> 
>     BLKTRACE_SETUP(loop0) #2
>     [   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start
>     [   13.936758] === do_blk_trace_setup(2) start
>     [   13.938944] === do_blk_trace_setup(2) creating directory
>     [   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave
>     
>     ---> From LOOP_CTL_DEL(loop0) #2
>     [   13.971046] === blk_trace_cleanup(7) end
>     [   13.973175] == __blk_trace_remove(7) end
>     [   13.975352] == blk_trace_shutdown(7) end
>     [   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()
>     [   13.980645] ==== blk_mq_debugfs_unregister(7) begin
>     [   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)
>     [   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL
>     [   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end
>     [   13.993155] = __blk_release_queue(7) end
>     
>     ---> From BLKTRACE_SETUP(loop0) #2
>     [   13.995928] === do_blk_trace_setup(2) end with ret: 0
>     [   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end
> 
> The BLKTRACESETUP above works on request_queue which later
> LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us.
> If you use this commit alone though, this doesn't fix the race issue
> however, and that's because of both still the debugfs_lookup() use
> and that we're still using asynchronous removal at this point.
> 
> refcounting will just ensure we don't take the request_queue underneath
> our noses.

I think the above trace reveals a bug in the loop driver. The loop
driver shouldn't allow the associated request queue to disappear while
the loop device is open. One may want to have a look at sd_open() in the
sd driver. The scsi_disk_get() call in that function not only increases
the reference count of the SCSI disk but also of the underlying SCSI device.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-14  4:18 ` [PATCH 2/5] blktrace: fix debugfs use after free Luis Chamberlain
                     ` (2 preceding siblings ...)
  2020-04-15  2:46   ` Bart Van Assche
@ 2020-04-15 17:38   ` Eric Sandeen
  2020-04-15 21:48     ` Bart Van Assche
  2020-04-16  0:56     ` Luis Chamberlain
  2020-04-16  2:10   ` Ming Lei
  4 siblings, 2 replies; 53+ messages in thread
From: Eric Sandeen @ 2020-04-15 17:38 UTC (permalink / raw)
  To: Luis Chamberlain, axboe, viro, bvanassche, gregkh, rostedt,
	mingo, jack, ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On 4/13/20 11:18 PM, Luis Chamberlain wrote:
> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> merged on v4.12 Omar fixed the original blktrace code for request-based
> drivers (multiqueue). This however left in place a possible crash, if you
> happen to abuse blktrace in a way it was not intended.
> 
> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> forget to BLKTRACETEARDOWN, and then just remove the device you end up
> with a panic:

I think this patch makes this all cleaner anyway, but - without the apparent
loop bug mentioned by Bart which allows removal of the loop device while blktrace
is active (if I read that right), can this still happen?

-Eric

> [  107.193134] debugfs: Directory 'loop0' with parent 'block' already present!
> [  107.254615] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [  107.258785] #PF: supervisor write access in kernel mode
> [  107.262035] #PF: error_code(0x0002) - not-present page
> [  107.264106] PGD 0 P4D 0
> [  107.264404] Oops: 0002 [#1] SMP NOPTI
> [  107.264803] CPU: 8 PID: 674 Comm: kworker/8:2 Tainted: G            E 5.6.0-rc7-next-20200327 #1
> [  107.265712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [  107.266553] Workqueue: events __blk_release_queue
> [  107.267051] RIP: 0010:down_write+0x15/0x40
> [  107.267488] Code: eb ca e8 ee a5 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f  65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
> [  107.269300] RSP: 0018:ffff9927c06efda8 EFLAGS: 00010246
> [  107.269841] RAX: 0000000000000000 RBX: ffff8be7e73b0600 RCX: ffffff8100000000
> [  107.270559] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [  107.271281] RBP: 00000000000000a0 R08: ffff8be7ebc80fa8 R09: ffff8be7ebc80fa8
> [  107.272001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  107.272722] R13: ffff8be7efc30400 R14: ffff8be7e0571200 R15: 00000000000000a0
> [  107.273475] FS:  0000000000000000(0000) GS:ffff8be7efc00000(0000) knlGS:0000000000000000
> [  107.274346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  107.274968] CR2: 00000000000000a0 CR3: 000000042abee003 CR4: 0000000000360ee0
> [  107.275710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  107.276465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  107.277214] Call Trace:
> [  107.277532]  simple_recursive_removal+0x4e/0x2e0
> [  107.278049]  ? debugfs_remove+0x60/0x60
> [  107.278493]  debugfs_remove+0x40/0x60
> [  107.278922]  blk_trace_free+0xd/0x50
> [  107.279339]  __blk_trace_remove+0x27/0x40
> [  107.279797]  blk_trace_shutdown+0x30/0x40
> [  107.280256]  __blk_release_queue+0xab/0x110
> [  107.280734]  process_one_work+0x1b4/0x380
> [  107.281194]  worker_thread+0x50/0x3c0
> [  107.281622]  kthread+0xf9/0x130
> [  107.281994]  ? process_one_work+0x380/0x380
> [  107.282467]  ? kthread_park+0x90/0x90
> [  107.282895]  ret_from_fork+0x1f/0x40
> [  107.283316] Modules linked in: loop(E) <etc>
> [  107.288562] CR2: 00000000000000a0
> [  107.288957] ---[ end trace b885d243d441bbce ]---



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-15 17:38   ` Eric Sandeen
@ 2020-04-15 21:48     ` Bart Van Assche
  2020-04-16  0:56     ` Luis Chamberlain
  1 sibling, 0 replies; 53+ messages in thread
From: Bart Van Assche @ 2020-04-15 21:48 UTC (permalink / raw)
  To: Eric Sandeen, Luis Chamberlain, axboe, viro, gregkh, rostedt,
	mingo, jack, ming.lei, nstange, akpm
  Cc: mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On 2020-04-15 10:38, Eric Sandeen wrote:
> On 4/13/20 11:18 PM, Luis Chamberlain wrote:
>> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
>> merged on v4.12 Omar fixed the original blktrace code for request-based
>> drivers (multiqueue). This however left in place a possible crash, if you
>> happen to abuse blktrace in a way it was not intended.
>>
>> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
>> forget to BLKTRACETEARDOWN, and then just remove the device you end up
>> with a panic:
> 
> I think this patch makes this all cleaner anyway, but - without the apparent
> loop bug mentioned by Bart which allows removal of the loop device while blktrace
> is active (if I read that right), can this still happen?

That's a great question. Even if the loop driver fix would be sufficient
to fix the blktrace debugfs use-after free I think the block layer
patches from this series are still very valuable. As explained in the
cover letter this patch series fixes more than only the blktrace debugfs
use-after-free.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-15 17:38   ` Eric Sandeen
  2020-04-15 21:48     ` Bart Van Assche
@ 2020-04-16  0:56     ` Luis Chamberlain
  2020-04-16  1:02       ` Eric Sandeen
  1 sibling, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  0:56 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Wed, Apr 15, 2020 at 12:38:26PM -0500, Eric Sandeen wrote:
> On 4/13/20 11:18 PM, Luis Chamberlain wrote:
> > On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> > merged on v4.12 Omar fixed the original blktrace code for request-based
> > drivers (multiqueue). This however left in place a possible crash, if you
> > happen to abuse blktrace in a way it was not intended.
> > 
> > Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> > forget to BLKTRACETEARDOWN, and then just remove the device you end up
> > with a panic:
> 
> I think this patch makes this all cleaner anyway, but - without the apparent
> loop bug mentioned by Bart which allows removal of the loop device while blktrace
> is active (if I read that right), can this still happen?

I have not tested that, but some modifications of the break-blktrace
program could enable us to test that, however I don't think the race
would be possible after patch 3/5 "blktrace: refcount the request_queue
during ioctl" is merged, as removal then a pending blktrace would
refcount the request_queue and the removal would have to wait until
the refcount is decremeneted, until after the blktrace ioctl.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  0:56     ` Luis Chamberlain
@ 2020-04-16  1:02       ` Eric Sandeen
  2020-04-16  1:20         ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Eric Sandeen @ 2020-04-16  1:02 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da



On 4/15/20 7:56 PM, Luis Chamberlain wrote:
> On Wed, Apr 15, 2020 at 12:38:26PM -0500, Eric Sandeen wrote:
>> On 4/13/20 11:18 PM, Luis Chamberlain wrote:
>>> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
>>> merged on v4.12 Omar fixed the original blktrace code for request-based
>>> drivers (multiqueue). This however left in place a possible crash, if you
>>> happen to abuse blktrace in a way it was not intended.
>>>
>>> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
>>> forget to BLKTRACETEARDOWN, and then just remove the device you end up
>>> with a panic:
>>
>> I think this patch makes this all cleaner anyway, but - without the apparent
>> loop bug mentioned by Bart which allows removal of the loop device while blktrace
>> is active (if I read that right), can this still happen?
> 
> I have not tested that, but some modifications of the break-blktrace
> program could enable us to test that

FWIW, I modified it to modprobe & rmmod scsi_debug instead of the loop ioctls,
and the module can't be unloaded after the blktrace is started since it's busy.

Not sure that's equivalent tho.

> however I don't think the race
> would be possible after patch 3/5 "blktrace: refcount the request_queue
> during ioctl" is merged, as removal then a pending blktrace would
> refcount the request_queue and the removal would have to wait until
> the refcount is decremeneted, until after the blktrace ioctl.

I'm out of my depth in the block layer, not sure who's supposed to take
refs on what. ;)

-Eric


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-15 14:18           ` Bart Van Assche
@ 2020-04-16  1:12             ` Luis Chamberlain
  2020-04-16  3:43               ` Bart Van Assche
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  1:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 07:18:22AM -0700, Bart Van Assche wrote:
> On 2020-04-15 05:34, Luis Chamberlain wrote:
> > On Wed, Apr 15, 2020 at 12:14:25AM -0700, Christoph Hellwig wrote:
> >> Btw, Isn't blk_get_queue racy as well?  Shouldn't we check
> >> blk_queue_dying after getting the reference and undo it if the queue is
> >> indeeed dying?
> > 
> > Yes that race should be possible:
> > 
> > bool blk_get_queue(struct request_queue *q)                                     
> > {                                                                               
> > 	if (likely(!blk_queue_dying(q))) {
> >        ----------> we can get the queue to go dying here <---------
> > 		__blk_get_queue(q);
> > 		return true;
> > 	}                                                                       
> > 
> > 	return false;
> > }                                                                               
> > EXPORT_SYMBOL(blk_get_queue);
> > 
> > I'll pile up a fix. I've also considered doing a full review of callers
> > outside of the core block layer using it, and maybe just unexporting
> > this. It was originally exported due to commit d86e0e83b ("block: export
> > blk_{get,put}_queue()") to fix a scsi bug, but I can't find such
> > respective fix. I suspec that using bdgrab()/bdput() seems more likely
> > what drivers should be using. That would allow us to keep this
> > functionality internal.
> 
> blk_get_queue() prevents concurrent freeing of struct request_queue but
> does not prevent concurrent blk_cleanup_queue() calls.

Wouldn't concurrent blk_cleanup_queue() calls be a bug? If so should
I make it clear that it would be or simply prevent it?

> Callers of
> blk_get_queue() may encounter a change of the queue state from normal
> into dying any time during the blk_get_queue() call or after
> blk_get_queue() has finished. Maybe I'm overlooking something but I
> doubt that modifying blk_get_queue() will help.

Good point, to fix that race described by Christoph we'd have to take
into consideration refcounts of the request_queue to prevent queues from
changing state to dying if the refcount is > 1, however that'd also
would  mean not allowing the request_queue from ever dying.

One way we could resolve this could be to to keep track of a
quiesce/dying request, then at that point prevent blk_get_queue() from
allowing increments, and once the refcount is down to 1, flip the switch
to dying.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-15 14:45       ` Bart Van Assche
@ 2020-04-16  1:17         ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  1:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 07:45:18AM -0700, Bart Van Assche wrote:
> On 2020-04-14 23:16, Luis Chamberlain wrote:
> > On Tue, Apr 14, 2020 at 08:40:44AM -0700, Christoph Hellwig wrote:
> >> Hmm, where exactly does the race come in so that it can only happen
> >> after where you take the reference, but not before it?  I'm probably
> >> missing something, but that just means it needs to be explained a little
> >> better :)
> > 
> >>From the trace on patch 2/5:
> > 
> >     BLKTRACE_SETUP(loop0) #2
> >     [   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start
> >     [   13.936758] === do_blk_trace_setup(2) start
> >     [   13.938944] === do_blk_trace_setup(2) creating directory
> >     [   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave
> >     
> >     ---> From LOOP_CTL_DEL(loop0) #2
> >     [   13.971046] === blk_trace_cleanup(7) end
> >     [   13.973175] == __blk_trace_remove(7) end
> >     [   13.975352] == blk_trace_shutdown(7) end
> >     [   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()
> >     [   13.980645] ==== blk_mq_debugfs_unregister(7) begin
> >     [   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)
> >     [   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL
> >     [   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end
> >     [   13.993155] = __blk_release_queue(7) end
> >     
> >     ---> From BLKTRACE_SETUP(loop0) #2
> >     [   13.995928] === do_blk_trace_setup(2) end with ret: 0
> >     [   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end
> > 
> > The BLKTRACESETUP above works on request_queue which later
> > LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us.
> > If you use this commit alone though, this doesn't fix the race issue
> > however, and that's because of both still the debugfs_lookup() use
> > and that we're still using asynchronous removal at this point.
> > 
> > refcounting will just ensure we don't take the request_queue underneath
> > our noses.
> 
> I think the above trace reveals a bug in the loop driver. The loop
> driver shouldn't allow the associated request queue to disappear while
> the loop device is open.

The bug was *not* in the driver, the bug was in that deferal of removal
was allowed to be asynchronous, therefore the removal from a userspace
perspective *finishes*, but its not actually really done. Back when
the removal was synchronous, the loop driver waited on cleanup, and
didn't return to userspace until it was really removed.

This is why I annotated that the move to asynch removal turns out to
actually be a userspace API regression.

> One may want to have a look at sd_open() in the
> sd driver. The scsi_disk_get() call in that function not only increases
> the reference count of the SCSI disk but also of the underlying SCSI device.

Are you saying to use this as a template for what a driver should do or
do you suspect there is a bug there? Not sure what you mean here.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  1:02       ` Eric Sandeen
@ 2020-04-16  1:20         ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  1:20 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, ming.lei,
	nstange, akpm, mhocko, yukuai3, linux-block, linux-fsdevel,
	linux-mm, linux-kernel, Omar Sandoval, Hannes Reinecke,
	Michal Hocko, syzbot+603294af2d01acfdd6da

On Wed, Apr 15, 2020 at 08:02:04PM -0500, Eric Sandeen wrote:
> 
> 
> On 4/15/20 7:56 PM, Luis Chamberlain wrote:
> > On Wed, Apr 15, 2020 at 12:38:26PM -0500, Eric Sandeen wrote:
> >> On 4/13/20 11:18 PM, Luis Chamberlain wrote:
> >>> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> >>> merged on v4.12 Omar fixed the original blktrace code for request-based
> >>> drivers (multiqueue). This however left in place a possible crash, if you
> >>> happen to abuse blktrace in a way it was not intended.
> >>>
> >>> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> >>> forget to BLKTRACETEARDOWN, and then just remove the device you end up
> >>> with a panic:
> >>
> >> I think this patch makes this all cleaner anyway, but - without the apparent
> >> loop bug mentioned by Bart which allows removal of the loop device while blktrace
> >> is active (if I read that right), can this still happen?
> > 
> > I have not tested that, but some modifications of the break-blktrace
> > program could enable us to test that
> 
> FWIW, I modified it to modprobe & rmmod scsi_debug instead of the loop ioctls,
> and the module can't be unloaded after the blktrace is started since it's busy.
> 
> Not sure that's equivalent tho.

Given what Bart mentioned about sd_open, that might be the saving grace
for blocking async behaviour on scsi. If it only refcounted the
request_queue as the loop driver it would have been exposed to the
same bug.

> > however I don't think the race
> > would be possible after patch 3/5 "blktrace: refcount the request_queue
> > during ioctl" is merged, as removal then a pending blktrace would
> > refcount the request_queue and the removal would have to wait until
> > the refcount is decremeneted, until after the blktrace ioctl.
> 
> I'm out of my depth in the block layer, not sure who's supposed to take
> refs on what. ;)

I'm new to all this code, only been a few weeks looking into it, but am
trying to do my best ot make sense of it. So the above is what I can
tell would be needed.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-14  4:18 ` [PATCH 2/5] blktrace: fix debugfs use after free Luis Chamberlain
                     ` (3 preceding siblings ...)
  2020-04-15 17:38   ` Eric Sandeen
@ 2020-04-16  2:10   ` Ming Lei
  2020-04-16  5:25     ` Luis Chamberlain
  4 siblings, 1 reply; 53+ messages in thread
From: Ming Lei @ 2020-04-16  2:10 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Tue, Apr 14, 2020 at 04:18:59AM +0000, Luis Chamberlain wrote:
> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> merged on v4.12 Omar fixed the original blktrace code for request-based
> drivers (multiqueue). This however left in place a possible crash, if you
> happen to abuse blktrace in a way it was not intended.
> 
> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> forget to BLKTRACETEARDOWN, and then just remove the device you end up
> with a panic:
> 
> [  107.193134] debugfs: Directory 'loop0' with parent 'block' already present!
> [  107.254615] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [  107.258785] #PF: supervisor write access in kernel mode
> [  107.262035] #PF: error_code(0x0002) - not-present page
> [  107.264106] PGD 0 P4D 0
> [  107.264404] Oops: 0002 [#1] SMP NOPTI
> [  107.264803] CPU: 8 PID: 674 Comm: kworker/8:2 Tainted: G            E 5.6.0-rc7-next-20200327 #1
> [  107.265712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [  107.266553] Workqueue: events __blk_release_queue
> [  107.267051] RIP: 0010:down_write+0x15/0x40
> [  107.267488] Code: eb ca e8 ee a5 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f  65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
> [  107.269300] RSP: 0018:ffff9927c06efda8 EFLAGS: 00010246
> [  107.269841] RAX: 0000000000000000 RBX: ffff8be7e73b0600 RCX: ffffff8100000000
> [  107.270559] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [  107.271281] RBP: 00000000000000a0 R08: ffff8be7ebc80fa8 R09: ffff8be7ebc80fa8
> [  107.272001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  107.272722] R13: ffff8be7efc30400 R14: ffff8be7e0571200 R15: 00000000000000a0
> [  107.273475] FS:  0000000000000000(0000) GS:ffff8be7efc00000(0000) knlGS:0000000000000000
> [  107.274346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  107.274968] CR2: 00000000000000a0 CR3: 000000042abee003 CR4: 0000000000360ee0
> [  107.275710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  107.276465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  107.277214] Call Trace:
> [  107.277532]  simple_recursive_removal+0x4e/0x2e0
> [  107.278049]  ? debugfs_remove+0x60/0x60
> [  107.278493]  debugfs_remove+0x40/0x60
> [  107.278922]  blk_trace_free+0xd/0x50
> [  107.279339]  __blk_trace_remove+0x27/0x40
> [  107.279797]  blk_trace_shutdown+0x30/0x40
> [  107.280256]  __blk_release_queue+0xab/0x110
> [  107.280734]  process_one_work+0x1b4/0x380
> [  107.281194]  worker_thread+0x50/0x3c0
> [  107.281622]  kthread+0xf9/0x130
> [  107.281994]  ? process_one_work+0x380/0x380
> [  107.282467]  ? kthread_park+0x90/0x90
> [  107.282895]  ret_from_fork+0x1f/0x40
> [  107.283316] Modules linked in: loop(E) <etc>
> [  107.288562] CR2: 00000000000000a0
> [  107.288957] ---[ end trace b885d243d441bbce ]---
> 
> This splat happens to be very similar to the one reported via
> kernel.org korg#205713, only that korg#205713 was for v4.19.83
> and the above now includes the simple_recursive_removal() introduced
> via commit a3d1e7eb5abe ("simple_recursive_removal(): kernel-side rm
> -rf for ramfs-style filesystems") merged on v5.6.
> 
> korg#205713 then was used to create CVE-2019-19770 and claims that
> the bug is in a use-after-free in the debugfs core code. The
> implications of this being a generic UAF on debugfs would be
> much more severe, as it would imply parent dentries can sometimes
> not be positive, which we hold by design is just not possible.
> 
> Below is the splat explained with a bit more details, explaining
> what is happening in userspace, kernel, and a print of the CPU on,
> which the code runs on:
> 
> load loopback module
> [   13.603371] == blk_mq_debugfs_register(12) start
> [   13.604040] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.604934] == blk_mq_debugfs_register(12) end
> [   13.627382] == blk_mq_debugfs_register(12) start
> [   13.628041] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.629240] == blk_mq_debugfs_register(12) end
> [   13.651667] == blk_mq_debugfs_register(12) start
> [   13.652836] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.655107] == blk_mq_debugfs_register(12) end
> [   13.684917] == blk_mq_debugfs_register(12) start
> [   13.687876] == blk_mq_debugfs_register(12) q->debugfs_dir created
> [   13.691588] == blk_mq_debugfs_register(13) end
> [   13.707320] == blk_mq_debugfs_register(13) start
> [   13.707863] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.708856] == blk_mq_debugfs_register(13) end
> [   13.735623] == blk_mq_debugfs_register(13) start
> [   13.736656] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.738411] == blk_mq_debugfs_register(13) end
> [   13.763326] == blk_mq_debugfs_register(13) start
> [   13.763972] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.765167] == blk_mq_debugfs_register(13) end
> [   13.779510] == blk_mq_debugfs_register(13) start
> [   13.780522] == blk_mq_debugfs_register(13) q->debugfs_dir created
> [   13.782338] == blk_mq_debugfs_register(13) end
> [   13.783521] loop: module loaded
> 
> LOOP_CTL_DEL(loop0) #1
> [   13.803550] = __blk_release_queue(4) start
> [   13.807772] == blk_trace_shutdown(4) start
> [   13.810749] == blk_trace_shutdown(4) end
> [   13.813437] = __blk_release_queue(4) calling blk_mq_debugfs_unregister()
> [   13.817593] ==== blk_mq_debugfs_unregister(4) begin
> [   13.817621] ==== blk_mq_debugfs_unregister(4) debugfs_remove_recursive(q->debugfs_dir)
> [   13.821203] ==== blk_mq_debugfs_unregister(4) end q->debugfs_dir is NULL
> [   13.826166] = __blk_release_queue(4) blk_mq_debugfs_unregister() end
> [   13.832992] = __blk_release_queue(4) end
> 
> LOOP_CTL_ADD(loop0) #1
> [   13.843742] == blk_mq_debugfs_register(7) start
> [   13.845569] == blk_mq_debugfs_register(7) q->debugfs_dir created
> [   13.848628] == blk_mq_debugfs_register(7) end
> 
> BLKTRACE_SETUP(loop0) #1
> [   13.850924] == blk_trace_ioctl(7, BLKTRACESETUP) start
> [   13.852852] === do_blk_trace_setup(7) start
> [   13.854580] === do_blk_trace_setup(7) creating directory
> [   13.856620] === do_blk_trace_setup(7) using what debugfs_lookup() gave
> [   13.860635] === do_blk_trace_setup(7) end with ret: 0
> [   13.862615] == blk_trace_ioctl(7, BLKTRACESETUP) end
> 
> LOOP_CTL_DEL(loop0) #2
> [   13.883304] = __blk_release_queue(7) start
> [   13.885324] == blk_trace_shutdown(7) start
> [   13.887197] == blk_trace_shutdown(7) calling __blk_trace_remove()
> [   13.889807] == __blk_trace_remove(7) start
> [   13.891669] === blk_trace_cleanup(7) start
> [   13.911656] ====== blk_trace_free(7) start
> 
> LOOP_CTL_ADD(loop0) #2
> [   13.912709] == blk_mq_debugfs_register(2) start
> 
> ---> From LOOP_CTL_DEL(loop0) #2
> [   13.915887] ====== blk_trace_free(7) end
> 
> ---> From LOOP_CTL_ADD(loop0) #2
> [   13.918359] debugfs: Directory 'loop0' with parent 'block' already present!
> [   13.926433] == blk_mq_debugfs_register(2) q->debugfs_dir created
> [   13.930373] == blk_mq_debugfs_register(2) end
> 
> BLKTRACE_SETUP(loop0) #2
> [   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start
> [   13.936758] === do_blk_trace_setup(2) start
> [   13.938944] === do_blk_trace_setup(2) creating directory
> [   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave
> 
> ---> From LOOP_CTL_DEL(loop0) #2
> [   13.971046] === blk_trace_cleanup(7) end
> [   13.973175] == __blk_trace_remove(7) end
> [   13.975352] == blk_trace_shutdown(7) end
> [   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()
> [   13.980645] ==== blk_mq_debugfs_unregister(7) begin
> [   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)
> [   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL
> [   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end
> [   13.993155] = __blk_release_queue(7) end
> 
> ---> From BLKTRACE_SETUP(loop0) #2
> [   13.995928] === do_blk_trace_setup(2) end with ret: 0
> [   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end
> 
> LOOP_CTL_DEL(loop0) #3
> [   14.035119] = __blk_release_queue(2) start
> [   14.036925] == blk_trace_shutdown(2) start
> [   14.038518] == blk_trace_shutdown(2) calling __blk_trace_remove()
> [   14.040829] == __blk_trace_remove(2) start
> [   14.042413] === blk_trace_cleanup(2) start
> 
> LOOP_CTL_ADD(loop0) #3
> [   14.072522] == blk_mq_debugfs_register(6) start
> 
> ---> From LOOP_CTL_DEL(loop0) #3
> [   14.075151] ====== blk_trace_free(2) start
> 
> ---> From LOOP_CTL_ADD(loop0) #3
> [   14.075882] == blk_mq_debugfs_register(6) q->debugfs_dir created
> 
> ---> From LOOP_CTL_DEL(loop0) #3
> [   14.078624] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [   14.084332] == blk_mq_debugfs_register(6) end
> [   14.086971] #PF: supervisor write access in kernel mode
> [   14.086974] #PF: error_code(0x0002) - not-present page
> [   14.086977] PGD 0 P4D 0
> [   14.086984] Oops: 0002 [#1] SMP NOPTI
> [   14.086990] CPU: 2 PID: 287 Comm: kworker/2:2 Tainted: G            E 5.6.0-next-20200403+ #54
> [   14.086991] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [   14.087002] Workqueue: events __blk_release_queue
> [   14.087011] RIP: 0010:down_write+0x15/0x40
> [   14.090300] == blk_trace_ioctl(6, BLKTRACESETUP) start
> [   14.093277] Code: eb ca e8 3e 34 8d ff cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00
> 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
> [   14.093280] RSP: 0018:ffffc28a00533da8 EFLAGS: 00010246
> [   14.093284] RAX: 0000000000000000 RBX: ffff9f7a24d07980 RCX: ffffff8100000000
> [   14.093286] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [   14.093287] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000019
> [   14.093289] R10: 0000000000000774 R11: 0000000000000000 R12: 0000000000000000
> [   14.093291] R13: ffff9f7a2fab0400 R14: ffff9f7a21dd1140 R15: 00000000000000a0
> [   14.093294] FS:  0000000000000000(0000) GS:ffff9f7a2fa80000(0000) knlGS:0000000000000000
> [   14.093296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   14.093298] CR2: 00000000000000a0 CR3: 00000004293d2003 CR4: 0000000000360ee0
> [   14.093307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   14.093308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   14.093310] Call Trace:
> [   14.093324]  simple_recursive_removal+0x4e/0x2e0
> [   14.093330]  ? debugfs_remove+0x60/0x60
> [   14.093334]  debugfs_remove+0x40/0x60
> [   14.093339]  blk_trace_free+0x20/0x70
> [   14.093346]  __blk_trace_remove+0x54/0x90
> [   14.096704] === do_blk_trace_setup(6) start
> [   14.098534]  blk_trace_shutdown+0x74/0x80
> [   14.100958] === do_blk_trace_setup(6) creating directory
> [   14.104575]  __blk_release_queue+0xbe/0x160
> [   14.104580]  process_one_work+0x1b4/0x380
> [   14.104585]  worker_thread+0x50/0x3c0
> [   14.104589]  kthread+0xf9/0x130
> [   14.104593]  ? process_one_work+0x380/0x380
> [   14.104596]  ? kthread_park+0x90/0x90
> [   14.104599]  ret_from_fork+0x1f/0x40
> [   14.104603] Modules linked in: loop(E) xfs(E) libcrc32c(E)
> crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E)
> serio_raw(E) aesni_intel(E) glue_helper(E) virtio_balloon(E) evdev(E)
> crypto_simd(E) pcspkr(E) cryptd(E) i6300esb(E) button(E) ip_tables(E)
> x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E)
> jbd2(E) virtio_net(E) net_failover(E) failover(E) virtio_blk(E)
> ata_generic(E) uhci_hcd(E) ata_piix(E) ehci_hcd(E) nvme(E) libata(E)
> crc32c_intel(E) usbcore(E) psmouse(E) nvme_core(E) virtio_pci(E)
> scsi_mod(E) virtio_ring(E) t10_pi(E) virtio(E) i2c_piix4(E) floppy(E)
> [   14.107400] === do_blk_trace_setup(6) using what debugfs_lookup() gave
> [   14.108939] CR2: 00000000000000a0
> [   14.110589] === do_blk_trace_setup(6) end with ret: 0
> [   14.111592] ---[ end trace 7a783b33b9614db9 ]---
> 
> The root cause to this issue is that debugfs_lookup() can find a
> previous incarnation's dir of the same name which is about to get
> removed from a not yet schedule work.
> 
> We can fix the UAF by simply using a debugfs directory which moving
> forward will always be accessible if debugfs is enabled, this way,
> its allocated and avaialble always for both request-based block
> drivers or make_request drivers (multiqueue) block drivers.
> 
> This simplifies the code considerably, with the only penalty now being
> that we're always creating the request queue debugfs directory for the
> request-based block device drivers.

This patch is indeed a big simplification, just a minor comment.

> 
> The UAF then is not a core debugfs issue, but instead a misuse of
> debugfs, and this issue can only be triggered if you are root, and
> misuse blktrace.
> 
> This issue can be reproduced with break-blktrace [2] using:
> 
>   break-blktrace -c 10 -d -s
> 
> This patch fixes this issue. Note that there is also another
> respective UAF but from the ioctl path [3], this should also fix
> that issue.
> 
> This patch then also disputes the severity of CVE-2019-19770 as
> this issue is only possible by being root and using blktrace.
> 
> It is not a core debugfs issue.
> 
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
> [1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
> [2] https://github.com/mcgrof/break-blktrace
> [3] https://lore.kernel.org/lkml/000000000000ec635b059f752700@google.com/
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
> Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  block/blk-debugfs.c          | 12 ++++++++++++
>  block/blk-mq-debugfs.c       |  5 -----
>  block/blk-sysfs.c            |  3 +++
>  block/blk.h                  | 10 ++++++++++
>  include/linux/blkdev.h       |  5 ++++-
>  include/linux/blktrace_api.h |  1 -
>  kernel/trace/blktrace.c      | 19 ++++++++-----------
>  7 files changed, 37 insertions(+), 18 deletions(-)
> 
> diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> index 19091e1effc0..db982688cf46 100644
> --- a/block/blk-debugfs.c
> +++ b/block/blk-debugfs.c
> @@ -13,3 +13,15 @@ void blk_debugfs_register(void)
>  {
>  	blk_debugfs_root = debugfs_create_dir("block", NULL);
>  }
> +
> +void blk_queue_debugfs_register(struct request_queue *q)
> +{
> +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> +					    blk_debugfs_root);
> +}
> +
> +void blk_queue_debugfs_unregister(struct request_queue *q)
> +{
> +	debugfs_remove_recursive(q->debugfs_dir);
> +	q->debugfs_dir = NULL;
> +}
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index b3f2ba483992..bda9378eab90 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -823,9 +823,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
>  	struct blk_mq_hw_ctx *hctx;
>  	int i;
>  
> -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> -					    blk_debugfs_root);
> -
>  	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
>  
>  	/*
> @@ -856,9 +853,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
>  
>  void blk_mq_debugfs_unregister(struct request_queue *q)
>  {
> -	debugfs_remove_recursive(q->debugfs_dir);
>  	q->sched_debugfs_dir = NULL;
> -	q->debugfs_dir = NULL;
>  }
>  
>  static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index fca9b158f4a0..0285d67e1e4c 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -895,6 +895,7 @@ static void __blk_release_queue(struct work_struct *work)
>  
>  	blk_trace_shutdown(q);
>  
> +	blk_queue_debugfs_unregister(q);
>  	if (queue_is_mq(q))
>  		blk_mq_debugfs_unregister(q);
>  
> @@ -975,6 +976,8 @@ int blk_register_queue(struct gendisk *disk)
>  		goto unlock;
>  	}
>  
> +	blk_queue_debugfs_register(q);
> +
>  	if (queue_is_mq(q)) {
>  		__blk_mq_register_dev(dev, q);
>  		blk_mq_debugfs_register(q);
> diff --git a/block/blk.h b/block/blk.h
> index 86a66b614f08..813b8513fc1a 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -489,10 +489,20 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
>  		bool *same_page);
>  #ifdef CONFIG_DEBUG_FS
>  void blk_debugfs_register(void);
> +void blk_queue_debugfs_register(struct request_queue *q);
> +void blk_queue_debugfs_unregister(struct request_queue *q);
>  #else
>  static inline void blk_debugfs_register(void)
>  {
>  }
> +
> +static inline void blk_queue_debugfs_register(struct request_queue *q)
> +{
> +}
> +
> +static inline void blk_queue_debugfs_unregister(struct request_queue *q)
> +{
> +}
>  #endif /* CONFIG_DEBUG_FS */
>  
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 32868fbedc9e..cc43c8e6516c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -569,8 +569,11 @@ struct request_queue {
>  	struct list_head	tag_set_list;
>  	struct bio_set		bio_split;
>  
> -#ifdef CONFIG_BLK_DEBUG_FS
> +#ifdef CONFIG_DEBUG_FS
> +	/* Used by block/blk-*debugfs.c and kernel/trace/blktrace.c */
>  	struct dentry		*debugfs_dir;
> +#endif
> +#ifdef CONFIG_BLK_DEBUG_FS
>  	struct dentry		*sched_debugfs_dir;
>  	struct dentry		*rqos_debugfs_dir;
>  #endif
> diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
> index 3b6ff5902edc..eb6db276e293 100644
> --- a/include/linux/blktrace_api.h
> +++ b/include/linux/blktrace_api.h
> @@ -22,7 +22,6 @@ struct blk_trace {
>  	u64 end_lba;
>  	u32 pid;
>  	u32 dev;
> -	struct dentry *dir;
>  	struct dentry *dropped_file;
>  	struct dentry *msg_file;
>  	struct list_head running_list;
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index ca39dc3230cb..15086227592f 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
>  	debugfs_remove(bt->msg_file);
>  	debugfs_remove(bt->dropped_file);
>  	relay_close(bt->rchan);
> -	debugfs_remove(bt->dir);
>  	free_percpu(bt->sequence);
>  	free_percpu(bt->msg_data);
>  	kfree(bt);
> @@ -476,7 +475,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  			      struct blk_user_trace_setup *buts)
>  {
>  	struct blk_trace *bt = NULL;
> -	struct dentry *dir = NULL;
>  	int ret;
>  
>  	if (!buts->buf_size || !buts->buf_nr)
> @@ -485,6 +483,9 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  	if (!blk_debugfs_root)
>  		return -ENOENT;
>  
> +	if (!q->debugfs_dir)
> +		return -ENOENT;
> +
>  	strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
>  	buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0';
>  
> @@ -509,21 +510,19 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  
>  	ret = -ENOENT;
>  
> -	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> -	if (!dir)
> -		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> -
>  	bt->dev = dev;
>  	atomic_set(&bt->dropped, 0);
>  	INIT_LIST_HEAD(&bt->running_list);
>  
>  	ret = -EIO;
> -	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> +	bt->dropped_file = debugfs_create_file("dropped", 0444,
> +					       q->debugfs_dir, bt,
>  					       &blk_dropped_fops);
>  
> -	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
> +	bt->msg_file = debugfs_create_file("msg", 0222, q->debugfs_dir,
> +					   bt, &blk_msg_fops);
>  
> -	bt->rchan = relay_open("trace", dir, buts->buf_size,
> +	bt->rchan = relay_open("trace", q->debugfs_dir, buts->buf_size,
>  				buts->buf_nr, &blk_relay_callbacks, bt);

In theory, multiple partitions can be traced concurrently, but looks
it never works, so it won't cause trouble for multiple partition trace.

One userspace visible change is that blktrace debugfs dir name is switched 
to disk name from partition name in case of partition trace, will it
break some utilities?

Thanks,
Ming



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-14  4:19 ` [PATCH 3/5] blktrace: refcount the request_queue during ioctl Luis Chamberlain
  2020-04-14 15:40   ` Christoph Hellwig
@ 2020-04-16  2:31   ` Ming Lei
  2020-04-16  5:36     ` Luis Chamberlain
  1 sibling, 1 reply; 53+ messages in thread
From: Ming Lei @ 2020-04-16  2:31 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Tue, Apr 14, 2020 at 04:19:00AM +0000, Luis Chamberlain wrote:
> Ensure that the request_queue is refcounted during its full
> ioctl cycle. This avoids possible races against removal, given
> blk_get_queue() also checks to ensure the queue is not dying.
> 
> This small race is possible if you defer removal of the request_queue
> and userspace fires off an ioctl for the device in the meantime.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  kernel/trace/blktrace.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 15086227592f..17e144d15779 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -701,6 +701,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
>  	if (!q)
>  		return -ENXIO;
>  
> +	if (!blk_get_queue(q))
> +		return -ENXIO;
> +
>  	mutex_lock(&q->blk_trace_mutex);
>  
>  	switch (cmd) {
> @@ -729,6 +732,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
>  	}
>  
>  	mutex_unlock(&q->blk_trace_mutex);
> +
> +	blk_put_queue(q);
> +
>  	return ret;
>  }

Actually when bdev is opened, one extra refcount is held on gendisk, so
gendisk won't go away. And __device_add_disk() does grab one extra
refcount on request queue, so request queue shouldn't go away when ioctl
is running.

Can you describe a bit more what the issue is to be addressed by this
patch?

Thanks,
Ming



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5] block: revert back to synchronous request_queue removal
  2020-04-14  4:19 ` [PATCH 5/5] block: revert back to synchronous request_queue removal Luis Chamberlain
  2020-04-14 15:47   ` Christoph Hellwig
@ 2020-04-16  2:36   ` Ming Lei
  1 sibling, 0 replies; 53+ messages in thread
From: Ming Lei @ 2020-04-16  2:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Tue, Apr 14, 2020 at 04:19:02AM +0000, Luis Chamberlain wrote:
> Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
> v4.12 moved the work behind blk_release_queue() into a workqueue after a
> splat floated around which indicated some work on blk_release_queue()
> could sleep in blk_exit_rl(). This splat would be possible when a driver
> called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
> as its final call) from an atomic context.
> 
> blk_put_queue() decrements the refcount for the request_queue
> kobject, and upon reaching 0 blk_release_queue() is called. Although
> blk_exit_rl() is now removed through commit db6d9952356 ("block: remove
> request_list code"), we reserve the right to be able to sleep within
> blk_release_queue() context. If you see no other way and *have* be
> in atomic context when you driver calls the last blk_put_queue()
> you can always just increase your block device's reference count with
> bdgrab() as this can be done in atomic context and the request_queue
> removal would be left to upper layers later. We document this bit of
> tribal knowledge as well now, and adjust kdoc format a bit.
> 
> We revert back to synchronous request_queue removal because asynchronous
> removal creates a regression with expected userspace interaction with
> several drivers. An example is when removing the loopback driver and
> issues ioctl from userspace to do so, upon return and if successful one
> expects the device to be removed. Moving to asynchronous request_queue
> removal could have broken many scripts which relied on the removal to
> have been completed if there was no error.
> 
> Using asynchronous request_queue removal however has helped us find
> other bugs, in the future we can test what could break with this
> arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Suggested-by: Nicolai Stange <nstange@suse.de>
> Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  block/blk-core.c       | 19 ++++++++++++++++++-
>  block/blk-sysfs.c      | 38 +++++++++++++++++---------------------
>  include/linux/blkdev.h |  2 --
>  3 files changed, 35 insertions(+), 24 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5aaae7a1b338..8346c7c59ee6 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -301,6 +301,17 @@ void blk_clear_pm_only(struct request_queue *q)
>  }
>  EXPORT_SYMBOL_GPL(blk_clear_pm_only);
>  
> +/**
> + * blk_put_queue - decrement the request_queue refcount
> + *
> + * Decrements the refcount to the request_queue kobject, when this reaches
> + * 0 we'll have blk_release_queue() called. You should avoid calling
> + * this function in atomic context but if you really have to ensure you
> + * first refcount the block device with bdgrab() / bdput() so that the
> + * last decrement happens in blk_cleanup_queue().
> + *
> + * @q: the request_queue structure to decrement the refcount for
> + */
>  void blk_put_queue(struct request_queue *q)
>  {
>  	kobject_put(&q->kobj);
> @@ -328,10 +339,16 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
>  
>  /**
>   * blk_cleanup_queue - shutdown a request queue
> - * @q: request queue to shutdown
>   *
>   * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
>   * put it.  All future requests will be failed immediately with -ENODEV.
> + *
> + * You should not call this function in atomic context. If you need to
> + * refcount a request_queue in atomic context, instead refcount the
> + * block device with bdgrab() / bdput().
> + *
> + * @q: request queue to shutdown
> + *
>   */
>  void blk_cleanup_queue(struct request_queue *q)
>  {
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0285d67e1e4c..859911191ebc 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -860,22 +860,27 @@ static void blk_exit_queue(struct request_queue *q)
>  	bdi_put(q->backing_dev_info);
>  }
>  
> -
>  /**
> - * __blk_release_queue - release a request queue
> - * @work: pointer to the release_work member of the request queue to be released
> + * blk_release_queue - release a request queue
> + *
> + * This function is called as part of the process when a block device is being
> + * unregistered. Releasing a request queue starts with blk_cleanup_queue(),
> + * which set the appropriate flags and then calls blk_put_queue() as the last
> + * step. blk_put_queue() decrements the reference counter of the request queue
> + * and once the reference counter reaches zero, this function is called to
> + * release all allocated resources of the request queue.
>   *
> - * Description:
> - *     This function is called when a block device is being unregistered. The
> - *     process of releasing a request queue starts with blk_cleanup_queue, which
> - *     set the appropriate flags and then calls blk_put_queue, that decrements
> - *     the reference counter of the request queue. Once the reference counter
> - *     of the request queue reaches zero, blk_release_queue is called to release
> - *     all allocated resources of the request queue.
> + * This function can sleep, and so we must ensure that the very last
> + * blk_put_queue() is never called from atomic context.
> + *
> + * @kobj: pointer to a kobject, who's container is a request_queue
>   */
> -static void __blk_release_queue(struct work_struct *work)
> +static void blk_release_queue(struct kobject *kobj)
>  {
> -	struct request_queue *q = container_of(work, typeof(*q), release_work);
> +	struct request_queue *q =
> +		container_of(kobj, struct request_queue, kobj);
> +
> +	might_sleep();
>  
>  	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
>  		blk_stat_remove_callback(q, q->poll_cb);
> @@ -905,15 +910,6 @@ static void __blk_release_queue(struct work_struct *work)
>  	call_rcu(&q->rcu_head, blk_free_queue_rcu);
>  }
>  
> -static void blk_release_queue(struct kobject *kobj)
> -{
> -	struct request_queue *q =
> -		container_of(kobj, struct request_queue, kobj);
> -
> -	INIT_WORK(&q->release_work, __blk_release_queue);
> -	schedule_work(&q->release_work);
> -}
> -
>  static const struct sysfs_ops queue_sysfs_ops = {
>  	.show	= queue_attr_show,
>  	.store	= queue_attr_store,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index cc43c8e6516c..81f7ddb1587e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -582,8 +582,6 @@ struct request_queue {
>  
>  	size_t			cmd_size;
>  
> -	struct work_struct	release_work;
> -
>  #define BLK_MAX_WRITE_HINTS	5
>  	u64			write_hints[BLK_MAX_WRITE_HINTS];
>  };
> -- 
> 2.25.1
> 

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-16  1:12             ` Luis Chamberlain
@ 2020-04-16  3:43               ` Bart Van Assche
  2020-04-16  5:29                 ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Bart Van Assche @ 2020-04-16  3:43 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko

On 2020-04-15 18:12, Luis Chamberlain wrote:
> On Wed, Apr 15, 2020 at 07:18:22AM -0700, Bart Van Assche wrote:
>> blk_get_queue() prevents concurrent freeing of struct request_queue but
>> does not prevent concurrent blk_cleanup_queue() calls.
> 
> Wouldn't concurrent blk_cleanup_queue() calls be a bug? If so should
> I make it clear that it would be or simply prevent it?

I think calling blk_cleanup_queue() while the queue refcount > 0 is well
established behavior. At least the SCSI core triggers that behavior
since a very long time. I prefer not to change that behavior.

Regarding patch 3/5: how about dropping that patch? If the queue
refcount can drop to zero while blk_trace_ioctl() is in progress I think
that should be fixed in the block_device_operations.open callback
instead of in blk_trace_ioctl().

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  2:10   ` Ming Lei
@ 2020-04-16  5:25     ` Luis Chamberlain
  2020-04-16  5:47       ` Ming Lei
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  5:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Thu, Apr 16, 2020 at 10:10:36AM +0800, Ming Lei wrote:
> In theory, multiple partitions can be traced concurrently, but looks
> it never works, so it won't cause trouble for multiple partition trace.
> 
> One userspace visible change is that blktrace debugfs dir name is switched 
> to disk name from partition name in case of partition trace, will it
> break some utilities?

How is this possible, its not clear to me, we go from:

-	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
-					    blk_debugfs_root);

To this:

+	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
+					    blk_debugfs_root);


Maybe I am overlooking something.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-16  3:43               ` Bart Van Assche
@ 2020-04-16  5:29                 ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  5:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, axboe, viro, gregkh, rostedt, mingo, jack,
	ming.lei, nstange, akpm, mhocko, yukuai3, linux-block,
	linux-fsdevel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 08:43:32PM -0700, Bart Van Assche wrote:
> On 2020-04-15 18:12, Luis Chamberlain wrote:
> > On Wed, Apr 15, 2020 at 07:18:22AM -0700, Bart Van Assche wrote:
> >> blk_get_queue() prevents concurrent freeing of struct request_queue but
> >> does not prevent concurrent blk_cleanup_queue() calls.
> > 
> > Wouldn't concurrent blk_cleanup_queue() calls be a bug? If so should
> > I make it clear that it would be or simply prevent it?
> 
> I think calling blk_cleanup_queue() while the queue refcount > 0 is well
> established behavior. At least the SCSI core triggers that behavior
> since a very long time. I prefer not to change that behavior.

I see. An alternative is to simply check if we already are cleaning up
and if so abort early on the blk_cleanup_queue(). That would allow
re-entrant calls, and just be a no-op to the additional calls. Or is
the re-entrant, two attemps to really do all the work
blk_cleanup_queue() expected functionality already?

> Regarding patch 3/5: how about dropping that patch? If the queue
> refcount can drop to zero while blk_trace_ioctl() is in progress I think
> that should be fixed in the block_device_operations.open callback
> instead of in blk_trace_ioctl().

I'll take a look, thanks!

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5] blktrace: refcount the request_queue during ioctl
  2020-04-16  2:31   ` Ming Lei
@ 2020-04-16  5:36     ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  5:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Thu, Apr 16, 2020 at 10:31:22AM +0800, Ming Lei wrote:
> On Tue, Apr 14, 2020 at 04:19:00AM +0000, Luis Chamberlain wrote:
> > Ensure that the request_queue is refcounted during its full
> > ioctl cycle. This avoids possible races against removal, given
> > blk_get_queue() also checks to ensure the queue is not dying.
> > 
> > This small race is possible if you defer removal of the request_queue
> > and userspace fires off an ioctl for the device in the meantime.
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Omar Sandoval <osandov@fb.com>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Nicolai Stange <nstange@suse.de>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: yu kuai <yukuai3@huawei.com>
> > Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  kernel/trace/blktrace.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > index 15086227592f..17e144d15779 100644
> > --- a/kernel/trace/blktrace.c
> > +++ b/kernel/trace/blktrace.c
> > @@ -701,6 +701,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
> >  	if (!q)
> >  		return -ENXIO;
> >  
> > +	if (!blk_get_queue(q))
> > +		return -ENXIO;
> > +
> >  	mutex_lock(&q->blk_trace_mutex);
> >  
> >  	switch (cmd) {
> > @@ -729,6 +732,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
> >  	}
> >  
> >  	mutex_unlock(&q->blk_trace_mutex);
> > +
> > +	blk_put_queue(q);
> > +
> >  	return ret;
> >  }
> 
> Actually when bdev is opened, one extra refcount is held on gendisk, so
> gendisk won't go away. And __device_add_disk() does grab one extra
> refcount on request queue, so request queue shouldn't go away when ioctl
> is running.

Alright, then yes, this should not be needed.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  5:25     ` Luis Chamberlain
@ 2020-04-16  5:47       ` Ming Lei
  2020-04-16  6:09         ` Ming Lei
  2020-04-16  6:20         ` Luis Chamberlain
  0 siblings, 2 replies; 53+ messages in thread
From: Ming Lei @ 2020-04-16  5:47 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Thu, Apr 16, 2020 at 05:25:24AM +0000, Luis Chamberlain wrote:
> On Thu, Apr 16, 2020 at 10:10:36AM +0800, Ming Lei wrote:
> > In theory, multiple partitions can be traced concurrently, but looks
> > it never works, so it won't cause trouble for multiple partition trace.
> > 
> > One userspace visible change is that blktrace debugfs dir name is switched 
> > to disk name from partition name in case of partition trace, will it
> > break some utilities?
> 
> How is this possible, its not clear to me, we go from:
> 
> -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> -					    blk_debugfs_root);
> 
> To this:
> 
> +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> +					    blk_debugfs_root);
> 
> 
> Maybe I am overlooking something.

Your patch removes the blktrace debugfs dir:

do_blk_trace_setup()

-       dir = debugfs_lookup(buts->name, blk_debugfs_root);
-       if (!dir)
-               bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
-

Then create blktrace attributes under the dir of q->debugfs_dir.

However, buts->name could be one partition device name, but
q->debugfs_dir has to be disk name.

This change is visible to blktrace utilities.

Thanks,
Ming



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  5:47       ` Ming Lei
@ 2020-04-16  6:09         ` Ming Lei
  2020-04-16  6:22           ` Luis Chamberlain
  2020-04-16  6:20         ` Luis Chamberlain
  1 sibling, 1 reply; 53+ messages in thread
From: Ming Lei @ 2020-04-16  6:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Thu, Apr 16, 2020 at 01:47:50PM +0800, Ming Lei wrote:
> On Thu, Apr 16, 2020 at 05:25:24AM +0000, Luis Chamberlain wrote:
> > On Thu, Apr 16, 2020 at 10:10:36AM +0800, Ming Lei wrote:
> > > In theory, multiple partitions can be traced concurrently, but looks
> > > it never works, so it won't cause trouble for multiple partition trace.
> > > 
> > > One userspace visible change is that blktrace debugfs dir name is switched 
> > > to disk name from partition name in case of partition trace, will it
> > > break some utilities?
> > 
> > How is this possible, its not clear to me, we go from:
> > 
> > -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > -					    blk_debugfs_root);
> > 
> > To this:
> > 
> > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > +					    blk_debugfs_root);
> > 
> > 
> > Maybe I am overlooking something.
> 
> Your patch removes the blktrace debugfs dir:
> 
> do_blk_trace_setup()
> 
> -       dir = debugfs_lookup(buts->name, blk_debugfs_root);
> -       if (!dir)
> -               bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> -
> 
> Then create blktrace attributes under the dir of q->debugfs_dir.
> 
> However, buts->name could be one partition device name, but
> q->debugfs_dir has to be disk name.
> 
> This change is visible to blktrace utilities.

Just test the 1st two patches via "blktrace /dev/sda2", follows the
result, so this way can't be accepted.

[root@ktest-01 ~]# blktrace /dev/sda2
Thread 0 failed open /sys/kernel/debug/block/sda2/trace0: 2/No such file or directory
Thread 4 failed open /sys/kernel/debug/block/sda2/trace4: 2/No such file or directory
Thread 1 failed open /sys/kernel/debug/block/sda2/trace1: 2/No such file or directory
Thread 2 failed open /sys/kernel/debug/block/sda2/trace2: 2/No such file or directory
Thread 5 failed open /sys/kernel/debug/block/sda2/trace5: 2/No such file or directory
Thread 3 failed open /sys/kernel/debug/block/sda2/trace3: 2/No such file or directory
Thread 6 failed open /sys/kernel/debug/block/sda2/trace6: 2/No such file or directory
Thread 7 failed open /sys/kernel/debug/block/sda2/trace7: 2/No such file or directory
FAILED to start thread on CPU 0: 1/Operation not permitted
FAILED to start thread on CPU 1: 1/Operation not permitted
FAILED to start thread on CPU 2: 1/Operation not permitted
FAILED to start thread on CPU 3: 1/Operation not permitted
FAILED to start thread on CPU 4: 1/Operation not permitted
FAILED to start thread on CPU 5: 1/Operation not permitted
FAILED to start thread on CPU 6: 1/Operation not permitted
FAILED to start thread on CPU 7: 1/Operation not permitted



Thanks, 
Ming



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-15 13:19           ` Luis Chamberlain
@ 2020-04-16  6:10             ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2020-04-16  6:10 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Alan Jenkins, axboe, viro, bvanassche, gregkh,
	rostedt, mingo, jack, ming.lei, nstange, akpm, mhocko, yukuai3,
	linux-block, linux-fsdevel, linux-mm, linux-kernel,
	Omar Sandoval, Hannes Reinecke, Michal Hocko

On Wed, Apr 15, 2020 at 01:19:15PM +0000, Luis Chamberlain wrote:
> >  	if (current->throttle_queue)
> >  		return;
> > +	if (unlikely(current->flags & PF_KTHREAD))
> > +		return;
> >  
> >  	spin_lock(&swap_avail_lock);
> >  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
> >  				  avail_lists[node]) {
> > -		if (si->bdev) {
> > -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> > -						true);
> > -			break;
> > +		if (!si->bdev)
> > +			continue;
> > +		if (blk_get_queue(dev_get_queue(si->bdev))) {
> > +			current->throttle_queue = dev_get_queue(si->bdev);
> > +			current->use_memdelay = true;
> > +			set_notify_resume(current);
> >  		}
> > +		break;
> >  	}
> >  	spin_unlock(&swap_avail_lock);
> >  }
> 
> Sorry, its not clear to me  who calls the respective blk_put_queue()
> here?

If you look at blkcg_schedule_throttle, it only puts the queue that
was in current->throttle_queue.  But mem_cgroup_throttle_swaprate
exits early when current->throttle_queue is non-zero (first two lines
quote above).  So when called from mem_cgroup_throttle_swaprate,
blkcg_schedule_throttle should never actually put a queue.  Open
coding the few relevant lines from blkcg_schedule_throttle in
mem_cgroup_throttle_swaprate makes that obvious.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  5:47       ` Ming Lei
  2020-04-16  6:09         ` Ming Lei
@ 2020-04-16  6:20         ` Luis Chamberlain
  2020-04-16  6:28           ` Ming Lei
  1 sibling, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  6:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Thu, Apr 16, 2020 at 01:47:50PM +0800, Ming Lei wrote:
> On Thu, Apr 16, 2020 at 05:25:24AM +0000, Luis Chamberlain wrote:
> > On Thu, Apr 16, 2020 at 10:10:36AM +0800, Ming Lei wrote:
> > > In theory, multiple partitions can be traced concurrently, but looks
> > > it never works, so it won't cause trouble for multiple partition trace.
> > > 
> > > One userspace visible change is that blktrace debugfs dir name is switched 
> > > to disk name from partition name in case of partition trace, will it
> > > break some utilities?
> > 
> > How is this possible, its not clear to me, we go from:
> > 
> > -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > -					    blk_debugfs_root);
> > 
> > To this:
> > 
> > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > +					    blk_debugfs_root);
> > 
> > 
> > Maybe I am overlooking something.
> 
> Your patch removes the blktrace debugfs dir:
> 
> do_blk_trace_setup()
> 
> -       dir = debugfs_lookup(buts->name, blk_debugfs_root);
> -       if (!dir)
> -               bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> -
> 
> Then create blktrace attributes under the dir of q->debugfs_dir.
> 
> However, buts->name could be one partition device name, but

I can see how buts->name is set to bdevname() which expands to
disk_name(bdev->bd_disk, bdev->bd_part->partno, buf).

> q->debugfs_dir has to be disk name.

I can't see this, can you point me to where it is clear the
request_queue kobject's parent is sure to be the disk name?

If it is different, the issue I don't think should be debugfs, but
the bigger issue would be that blktrace on two different partitions
would clash.

Also, the *old* lookup intent on partitions always would fail on mq
and we'd end up creating a directory.

I think we'd need to create a directory per partition, even when we
don't use blktrace. That makes this more complex than I'd hope for.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-14  4:19 ` [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle() Luis Chamberlain
  2020-04-14 15:44   ` Christoph Hellwig
@ 2020-04-16  6:22   ` Ming Lei
  2020-04-16  6:25     ` Luis Chamberlain
  1 sibling, 1 reply; 53+ messages in thread
From: Ming Lei @ 2020-04-16  6:22 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> block devices are refcounted so to ensure once its final user goes away it
> can be cleaned up by the lower layers properly. The block device's
> request_queue structure is also refcounted, however, if the last
> blk_put_queue() is called under atomic context the block layer has
> to defer removal.
> 
> By refcounting the block device during the use of blkcg_schedule_throttle(),
> we ensure ensure two things:
> 
> 1) the block device remains available during the call
> 2) we ensure avoid having to deal with the fact we're using the
>    request_queue structure in atomic context, since the last
>    blk_put_queue() will be called upon disk_release(), *after*
>    our own bdput().
> 
> This means this code path is *not* going to remove the request_queue
> structure, as we are ensuring some later upper layer disk_release()
> will be the one to release the request_queue structure for us.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/swapfile.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6659ab563448..9285ff6030ca 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
>  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  				  gfp_t gfp_mask)
>  {
> +	struct block_device *bdev;
>  	struct swap_info_struct *si, *next;
>  	if (!(gfp_mask & __GFP_IO) || !memcg)
>  		return;
> @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
>  				  avail_lists[node]) {
>  		if (si->bdev) {
> -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> -						true);
> +			bdev = bdgrab(si->bdev);

When swapon, the block_device has been opened in claim_swapfile(),
so no need to worry about the queue being gone here.


Thanks,
Ming



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  6:09         ` Ming Lei
@ 2020-04-16  6:22           ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  6:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Thu, Apr 16, 2020 at 02:09:21PM +0800, Ming Lei wrote:
> On Thu, Apr 16, 2020 at 01:47:50PM +0800, Ming Lei wrote:
> > On Thu, Apr 16, 2020 at 05:25:24AM +0000, Luis Chamberlain wrote:
> > > On Thu, Apr 16, 2020 at 10:10:36AM +0800, Ming Lei wrote:
> > > > In theory, multiple partitions can be traced concurrently, but looks
> > > > it never works, so it won't cause trouble for multiple partition trace.
> > > > 
> > > > One userspace visible change is that blktrace debugfs dir name is switched 
> > > > to disk name from partition name in case of partition trace, will it
> > > > break some utilities?
> > > 
> > > How is this possible, its not clear to me, we go from:
> > > 
> > > -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > -					    blk_debugfs_root);
> > > 
> > > To this:
> > > 
> > > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > +					    blk_debugfs_root);
> > > 
> > > 
> > > Maybe I am overlooking something.
> > 
> > Your patch removes the blktrace debugfs dir:
> > 
> > do_blk_trace_setup()
> > 
> > -       dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > -       if (!dir)
> > -               bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> > -
> > 
> > Then create blktrace attributes under the dir of q->debugfs_dir.
> > 
> > However, buts->name could be one partition device name, but
> > q->debugfs_dir has to be disk name.
> > 
> > This change is visible to blktrace utilities.
> 
> Just test the 1st two patches via "blktrace /dev/sda2", follows the
> result, so this way can't be accepted.
> 
> [root@ktest-01 ~]# blktrace /dev/sda2
> Thread 0 failed open /sys/kernel/debug/block/sda2/trace0: 2/No such file or directory
> Thread 4 failed open /sys/kernel/debug/block/sda2/trace4: 2/No such file or directory
> Thread 1 failed open /sys/kernel/debug/block/sda2/trace1: 2/No such file or directory
> Thread 2 failed open /sys/kernel/debug/block/sda2/trace2: 2/No such file or directory
> Thread 5 failed open /sys/kernel/debug/block/sda2/trace5: 2/No such file or directory
> Thread 3 failed open /sys/kernel/debug/block/sda2/trace3: 2/No such file or directory
> Thread 6 failed open /sys/kernel/debug/block/sda2/trace6: 2/No such file or directory
> Thread 7 failed open /sys/kernel/debug/block/sda2/trace7: 2/No such file or directory
> FAILED to start thread on CPU 0: 1/Operation not permitted
> FAILED to start thread on CPU 1: 1/Operation not permitted
> FAILED to start thread on CPU 2: 1/Operation not permitted
> FAILED to start thread on CPU 3: 1/Operation not permitted
> FAILED to start thread on CPU 4: 1/Operation not permitted
> FAILED to start thread on CPU 5: 1/Operation not permitted
> FAILED to start thread on CPU 6: 1/Operation not permitted
> FAILED to start thread on CPU 7: 1/Operation not permitted

Thanks, as I noted, I think we'd need to pre-create the directories per
parition. Let me know if you think of a better alternative.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-16  6:22   ` Ming Lei
@ 2020-04-16  6:25     ` Luis Chamberlain
  2020-04-16  6:34       ` Ming Lei
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-16  6:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Thu, Apr 16, 2020 at 02:22:22PM +0800, Ming Lei wrote:
> On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> > block devices are refcounted so to ensure once its final user goes away it
> > can be cleaned up by the lower layers properly. The block device's
> > request_queue structure is also refcounted, however, if the last
> > blk_put_queue() is called under atomic context the block layer has
> > to defer removal.
> > 
> > By refcounting the block device during the use of blkcg_schedule_throttle(),
> > we ensure ensure two things:
> > 
> > 1) the block device remains available during the call
> > 2) we ensure avoid having to deal with the fact we're using the
> >    request_queue structure in atomic context, since the last
> >    blk_put_queue() will be called upon disk_release(), *after*
> >    our own bdput().
> > 
> > This means this code path is *not* going to remove the request_queue
> > structure, as we are ensuring some later upper layer disk_release()
> > will be the one to release the request_queue structure for us.
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Omar Sandoval <osandov@fb.com>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Nicolai Stange <nstange@suse.de>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: yu kuai <yukuai3@huawei.com>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  mm/swapfile.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 6659ab563448..9285ff6030ca 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
> >  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> >  				  gfp_t gfp_mask)
> >  {
> > +	struct block_device *bdev;
> >  	struct swap_info_struct *si, *next;
> >  	if (!(gfp_mask & __GFP_IO) || !memcg)
> >  		return;
> > @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> >  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
> >  				  avail_lists[node]) {
> >  		if (si->bdev) {
> > -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> > -						true);
> > +			bdev = bdgrab(si->bdev);
> 
> When swapon, the block_device has been opened in claim_swapfile(),
> so no need to worry about the queue being gone here.

Thanks, so why bdev_get_queue() before?

   Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  6:20         ` Luis Chamberlain
@ 2020-04-16  6:28           ` Ming Lei
  2020-04-17  4:09             ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Ming Lei @ 2020-04-16  6:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko,
	syzbot+603294af2d01acfdd6da

On Thu, Apr 16, 2020 at 06:20:54AM +0000, Luis Chamberlain wrote:
> On Thu, Apr 16, 2020 at 01:47:50PM +0800, Ming Lei wrote:
> > On Thu, Apr 16, 2020 at 05:25:24AM +0000, Luis Chamberlain wrote:
> > > On Thu, Apr 16, 2020 at 10:10:36AM +0800, Ming Lei wrote:
> > > > In theory, multiple partitions can be traced concurrently, but looks
> > > > it never works, so it won't cause trouble for multiple partition trace.
> > > > 
> > > > One userspace visible change is that blktrace debugfs dir name is switched 
> > > > to disk name from partition name in case of partition trace, will it
> > > > break some utilities?
> > > 
> > > How is this possible, its not clear to me, we go from:
> > > 
> > > -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > -					    blk_debugfs_root);
> > > 
> > > To this:
> > > 
> > > +	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > +					    blk_debugfs_root);
> > > 
> > > 
> > > Maybe I am overlooking something.
> > 
> > Your patch removes the blktrace debugfs dir:
> > 
> > do_blk_trace_setup()
> > 
> > -       dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > -       if (!dir)
> > -               bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> > -
> > 
> > Then create blktrace attributes under the dir of q->debugfs_dir.
> > 
> > However, buts->name could be one partition device name, but
> 
> I can see how buts->name is set to bdevname() which expands to
> disk_name(bdev->bd_disk, bdev->bd_part->partno, buf).
> 
> > q->debugfs_dir has to be disk name.
> 
> I can't see this, can you point me to where it is clear the
> request_queue kobject's parent is sure to be the disk name?

blk_register_queue():
	...
	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
	...
> 
> If it is different, the issue I don't think should be debugfs, but
> the bigger issue would be that blktrace on two different partitions
> would clash.
> 
> Also, the *old* lookup intent on partitions always would fail on mq
> and we'd end up creating a directory.
> 
> I think we'd need to create a directory per partition, even when we
> don't use blktrace. That makes this more complex than I'd hope for.

Anyway, the current ABI can't be broken, also I'd suggest to understand
how the userpace utility uses blktrace syscall interfaces first before
figuring any new approach.

Thanks, 
Ming



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()
  2020-04-16  6:25     ` Luis Chamberlain
@ 2020-04-16  6:34       ` Ming Lei
  0 siblings, 0 replies; 53+ messages in thread
From: Ming Lei @ 2020-04-16  6:34 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: axboe, viro, bvanassche, gregkh, rostedt, mingo, jack, nstange,
	akpm, mhocko, yukuai3, linux-block, linux-fsdevel, linux-mm,
	linux-kernel, Omar Sandoval, Hannes Reinecke, Michal Hocko

On Thu, Apr 16, 2020 at 06:25:32AM +0000, Luis Chamberlain wrote:
> On Thu, Apr 16, 2020 at 02:22:22PM +0800, Ming Lei wrote:
> > On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> > > block devices are refcounted so to ensure once its final user goes away it
> > > can be cleaned up by the lower layers properly. The block device's
> > > request_queue structure is also refcounted, however, if the last
> > > blk_put_queue() is called under atomic context the block layer has
> > > to defer removal.
> > > 
> > > By refcounting the block device during the use of blkcg_schedule_throttle(),
> > > we ensure ensure two things:
> > > 
> > > 1) the block device remains available during the call
> > > 2) we ensure avoid having to deal with the fact we're using the
> > >    request_queue structure in atomic context, since the last
> > >    blk_put_queue() will be called upon disk_release(), *after*
> > >    our own bdput().
> > > 
> > > This means this code path is *not* going to remove the request_queue
> > > structure, as we are ensuring some later upper layer disk_release()
> > > will be the one to release the request_queue structure for us.
> > > 
> > > Cc: Bart Van Assche <bvanassche@acm.org>
> > > Cc: Omar Sandoval <osandov@fb.com>
> > > Cc: Hannes Reinecke <hare@suse.com>
> > > Cc: Nicolai Stange <nstange@suse.de>
> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: yu kuai <yukuai3@huawei.com>
> > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > ---
> > >  mm/swapfile.c | 14 ++++++++++++--
> > >  1 file changed, 12 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 6659ab563448..9285ff6030ca 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
> > >  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> > >  				  gfp_t gfp_mask)
> > >  {
> > > +	struct block_device *bdev;
> > >  	struct swap_info_struct *si, *next;
> > >  	if (!(gfp_mask & __GFP_IO) || !memcg)
> > >  		return;
> > > @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> > >  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
> > >  				  avail_lists[node]) {
> > >  		if (si->bdev) {
> > > -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> > > -						true);
> > > +			bdev = bdgrab(si->bdev);
> > 
> > When swapon, the block_device has been opened in claim_swapfile(),
> > so no need to worry about the queue being gone here.
> 
> Thanks, so why bdev_get_queue() before?

bdev_get_queue() returns the request queue associated with the
the block device, and it is just that blkcg_schedule_throttle() needs
it.

Maybe I misunderstood your question, if yes, please explain it in
a bit detail.

Thanks,
Ming



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5] blktrace: fix debugfs use after free
  2020-04-16  6:28           ` Ming Lei
@ 2020-04-17  4:09             ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2020-04-17  4:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Al Viro, Bart Van Assche, Greg Kroah-Hartman,
	Steven Rostedt, Ingo Molnar, Jan Kara, Nicolai Stange,
	Andrew Morton, Michal Hocko, yu kuai, linux-block,
	Linux FS Devel, linux-mm, linux-kernel, Omar Sandoval,
	Hannes Reinecke, Michal Hocko, syzbot+603294af2d01acfdd6da

On Thu, Apr 16, 2020 at 12:29 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Thu, Apr 16, 2020 at 06:20:54AM +0000, Luis Chamberlain wrote:
> > On Thu, Apr 16, 2020 at 01:47:50PM +0800, Ming Lei wrote:
> > > On Thu, Apr 16, 2020 at 05:25:24AM +0000, Luis Chamberlain wrote:
> > > > On Thu, Apr 16, 2020 at 10:10:36AM +0800, Ming Lei wrote:
> > > > > In theory, multiple partitions can be traced concurrently, but looks
> > > > > it never works, so it won't cause trouble for multiple partition trace.
> > > > >
> > > > > One userspace visible change is that blktrace debugfs dir name is switched
> > > > > to disk name from partition name in case of partition trace, will it
> > > > > break some utilities?
> > > >
> > > > How is this possible, its not clear to me, we go from:
> > > >
> > > > - q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > > -                                     blk_debugfs_root);
> > > >
> > > > To this:
> > > >
> > > > + q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > > > +                                     blk_debugfs_root);
> > > >
> > > >
> > > > Maybe I am overlooking something.
> > >
> > > Your patch removes the blktrace debugfs dir:
> > >
> > > do_blk_trace_setup()
> > >
> > > -       dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > > -       if (!dir)
> > > -               bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> > > -
> > >
> > > Then create blktrace attributes under the dir of q->debugfs_dir.
> > >
> > > However, buts->name could be one partition device name, but
> >
> > I can see how buts->name is set to bdevname() which expands to
> > disk_name(bdev->bd_disk, bdev->bd_part->partno, buf).
> >
> > > q->debugfs_dir has to be disk name.
> >
> > I can't see this, can you point me to where it is clear the
> > request_queue kobject's parent is sure to be the disk name?
>
> blk_register_queue():
>         ...
>         ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
>         ...

Alright, I have a fix for this now, and I do have also a further
explanation as to *why* the debugfs_lookup() doesn't help us here.
I'll follow up with more patches.

  Luis


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2020-04-17  4:09 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-14  4:18 [PATCH 0/5] blktrace: fix use after free Luis Chamberlain
2020-04-14  4:18 ` [PATCH 1/5] block: move main block debugfs initialization to its own file Luis Chamberlain
2020-04-14  7:35   ` Greg KH
2020-04-15  2:44   ` Bart Van Assche
2020-04-14  4:18 ` [PATCH 2/5] blktrace: fix debugfs use after free Luis Chamberlain
2020-04-14  7:37   ` Greg KH
2020-04-14 15:38   ` Christoph Hellwig
2020-04-15  2:46   ` Bart Van Assche
2020-04-15 17:38   ` Eric Sandeen
2020-04-15 21:48     ` Bart Van Assche
2020-04-16  0:56     ` Luis Chamberlain
2020-04-16  1:02       ` Eric Sandeen
2020-04-16  1:20         ` Luis Chamberlain
2020-04-16  2:10   ` Ming Lei
2020-04-16  5:25     ` Luis Chamberlain
2020-04-16  5:47       ` Ming Lei
2020-04-16  6:09         ` Ming Lei
2020-04-16  6:22           ` Luis Chamberlain
2020-04-16  6:20         ` Luis Chamberlain
2020-04-16  6:28           ` Ming Lei
2020-04-17  4:09             ` Luis Chamberlain
2020-04-14  4:19 ` [PATCH 3/5] blktrace: refcount the request_queue during ioctl Luis Chamberlain
2020-04-14 15:40   ` Christoph Hellwig
2020-04-15  6:16     ` Luis Chamberlain
2020-04-15  7:14       ` Christoph Hellwig
2020-04-15 12:34         ` Luis Chamberlain
2020-04-15 12:39           ` Christoph Hellwig
2020-04-15 13:25             ` Luis Chamberlain
2020-04-15 14:18           ` Bart Van Assche
2020-04-16  1:12             ` Luis Chamberlain
2020-04-16  3:43               ` Bart Van Assche
2020-04-16  5:29                 ` Luis Chamberlain
2020-04-15 14:45       ` Bart Van Assche
2020-04-16  1:17         ` Luis Chamberlain
2020-04-16  2:31   ` Ming Lei
2020-04-16  5:36     ` Luis Chamberlain
2020-04-14  4:19 ` [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle() Luis Chamberlain
2020-04-14 15:44   ` Christoph Hellwig
2020-04-15  5:42     ` Luis Chamberlain
2020-04-15  7:27       ` Christoph Hellwig
2020-04-15  7:34         ` Christoph Hellwig
2020-04-15 13:19           ` Luis Chamberlain
2020-04-16  6:10             ` Christoph Hellwig
2020-04-16  6:22   ` Ming Lei
2020-04-16  6:25     ` Luis Chamberlain
2020-04-16  6:34       ` Ming Lei
2020-04-14  4:19 ` [PATCH 5/5] block: revert back to synchronous request_queue removal Luis Chamberlain
2020-04-14 15:47   ` Christoph Hellwig
2020-04-14 20:58     ` Luis Chamberlain
2020-04-15  6:46       ` Christoph Hellwig
2020-04-15 13:20         ` Luis Chamberlain
2020-04-16  2:36   ` Ming Lei
2020-04-14  7:38 ` [PATCH 0/5] blktrace: fix use after free Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).