* v4.20-rc6: Sporadic use-after-free in bt_iter()
@ 2018-12-19 23:24 Bart Van Assche
2018-12-19 23:27 ` Jens Axboe
2018-12-20 4:06 ` Ming Lei
0 siblings, 2 replies; 34+ messages in thread
From: Bart Van Assche @ 2018-12-19 23:24 UTC (permalink / raw)
To: linux-block
Hello,
If I run the srp blktests in a loop then I see the below call stack appearing
sporadically. I have not yet had the time to analyze this but I'm reporting
this here in case someone else would already have had a look at this.
Bart.
==================================================================
BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
Read of size 8 at addr ffff88803b335240 by task fio/21412
CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x86/0xca
print_address_description+0x71/0x239
kasan_report.cold.5+0x242/0x301
__asan_load8+0x54/0x90
bt_iter+0x86/0xf0
blk_mq_queue_tag_busy_iter+0x373/0x5e0
blk_mq_in_flight+0x96/0xb0
part_in_flight+0x40/0x140
part_round_stats+0x18e/0x370
blk_account_io_start+0x3d7/0x670
blk_mq_bio_to_request+0x19c/0x3a0
blk_mq_make_request+0x7a9/0xcb0
generic_make_request+0x41d/0x960
submit_bio+0x9b/0x250
do_blockdev_direct_IO+0x435c/0x4c70
__blockdev_direct_IO+0x79/0x88
ext4_direct_IO+0x46c/0xc00
generic_file_direct_write+0x119/0x210
__generic_file_write_iter+0x11c/0x280
ext4_file_write_iter+0x1b8/0x6f0
aio_write+0x204/0x310
io_submit_one+0x9d3/0xe80
__x64_sys_io_submit+0x115/0x340
do_syscall_64+0x71/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f02cf043219
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 47 fc 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007f02a1df78b8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
RAX: ffffffffffffffda RBX: 00007f02a1df8ef8 RCX: 00007f02cf043219
RDX: 00007f029804a7c0 RSI: 0000000000000001 RDI: 00007f02c4f67000
RBP: 00007f02c4f67000 R08: 00007f0298007af0 R09: 00007f02a362f0f0
R10: 00007f029804a9c0 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00007f029804a7c0 R15: 00007f0298049f60
The buggy address belongs to the page:
page:ffffea0000eccd40 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x1fff000000000000()
raw: 1fff000000000000 0000000000000000 ffffffff00ec0201 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88803b335100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88803b335180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88803b335200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88803b335280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88803b335300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-19 23:24 v4.20-rc6: Sporadic use-after-free in bt_iter() Bart Van Assche @ 2018-12-19 23:27 ` Jens Axboe 2018-12-20 0:16 ` Bart Van Assche 2018-12-20 4:06 ` Ming Lei 1 sibling, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-19 23:27 UTC (permalink / raw) To: Bart Van Assche, linux-block On 12/19/18 4:24 PM, Bart Van Assche wrote: > Hello, > > If I run the srp blktests in a loop then I see the below call stack appearing > sporadically. I have not yet had the time to analyze this but I'm reporting > this here in case someone else would already have had a look at this. > > Bart. > > ================================================================== > BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 > Read of size 8 at addr ffff88803b335240 by task fio/21412 > > CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > Call Trace: > dump_stack+0x86/0xca > print_address_description+0x71/0x239 > kasan_report.cold.5+0x242/0x301 > __asan_load8+0x54/0x90 > bt_iter+0x86/0xf0 > blk_mq_queue_tag_busy_iter+0x373/0x5e0 > blk_mq_in_flight+0x96/0xb0 > part_in_flight+0x40/0x140 > part_round_stats+0x18e/0x370 > blk_account_io_start+0x3d7/0x670 > blk_mq_bio_to_request+0x19c/0x3a0 > blk_mq_make_request+0x7a9/0xcb0 > generic_make_request+0x41d/0x960 > submit_bio+0x9b/0x250 > do_blockdev_direct_IO+0x435c/0x4c70 > __blockdev_direct_IO+0x79/0x88 > ext4_direct_IO+0x46c/0xc00 > generic_file_direct_write+0x119/0x210 > __generic_file_write_iter+0x11c/0x280 > ext4_file_write_iter+0x1b8/0x6f0 > aio_write+0x204/0x310 > io_submit_one+0x9d3/0xe80 > __x64_sys_io_submit+0x115/0x340 > do_syscall_64+0x71/0x210 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x7f02cf043219 I've seen this one before as well, it's not a new thing. As far as I can tell, it's a false positive. There should be no possibility for a use-after-free iterating the static tags/requests. -- Jens Axboe ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-19 23:27 ` Jens Axboe @ 2018-12-20 0:16 ` Bart Van Assche 2018-12-20 3:17 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2018-12-20 0:16 UTC (permalink / raw) To: Jens Axboe, linux-block On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: > On 12/19/18 4:24 PM, Bart Van Assche wrote: > > Hello, > > > > If I run the srp blktests in a loop then I see the below call stack appearing > > sporadically. I have not yet had the time to analyze this but I'm reporting > > this here in case someone else would already have had a look at this. > > > > Bart. > > > > ================================================================== > > BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 > > Read of size 8 at addr ffff88803b335240 by task fio/21412 > > > > CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > > Call Trace: > > dump_stack+0x86/0xca > > print_address_description+0x71/0x239 > > kasan_report.cold.5+0x242/0x301 > > __asan_load8+0x54/0x90 > > bt_iter+0x86/0xf0 > > blk_mq_queue_tag_busy_iter+0x373/0x5e0 > > blk_mq_in_flight+0x96/0xb0 > > part_in_flight+0x40/0x140 > > part_round_stats+0x18e/0x370 > > blk_account_io_start+0x3d7/0x670 > > blk_mq_bio_to_request+0x19c/0x3a0 > > blk_mq_make_request+0x7a9/0xcb0 > > generic_make_request+0x41d/0x960 > > submit_bio+0x9b/0x250 > > do_blockdev_direct_IO+0x435c/0x4c70 > > __blockdev_direct_IO+0x79/0x88 > > ext4_direct_IO+0x46c/0xc00 > > generic_file_direct_write+0x119/0x210 > > __generic_file_write_iter+0x11c/0x280 > > ext4_file_write_iter+0x1b8/0x6f0 > > aio_write+0x204/0x310 > > io_submit_one+0x9d3/0xe80 > > __x64_sys_io_submit+0x115/0x340 > > do_syscall_64+0x71/0x210 > > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > RIP: 0033:0x7f02cf043219 > > I've seen this one before as well, it's not a new thing. As far as I can > tell, it's a false positive. There should be no possibility for a > use-after-free iterating the static tags/requests. Are you sure this is a false positive? I have not yet encountered any false positive KASAN complaints. According to the following gdb output this complaint refers to reading rq->q: (gdb) list *(bt_iter+0x86) 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). 232 233 /* 234 * We can hit rq == NULL here, because the tagging functions 235 * test and set the bit before assigning ->rqs[]. 236 */ 237 if (rq && rq->q == hctx->queue) 238 iter_data->fn(hctx, rq, iter_data->data, reserved); 239 return true; 240 } 241 From the disassembly output: 232 233 /* 234 * We can hit rq == NULL here, because the tagging functions 235 * test and set the bit before assigning ->rqs[]. 236 */ 237 if (rq && rq->q == hctx->queue) 0xffffffff816b9339 <+121>: test %r12,%r12 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> 0xffffffff816b933e <+126>: mov %r12,%rdi 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi 0xffffffff816b934d <+141>: mov (%r12),%r14 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also refer to hctx->fq.flush_rq. Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 0:16 ` Bart Van Assche @ 2018-12-20 3:17 ` Jens Axboe 2018-12-20 3:24 ` jianchao.wang 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 3:17 UTC (permalink / raw) To: Bart Van Assche, linux-block On 12/19/18 5:16 PM, Bart Van Assche wrote: > On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: >> On 12/19/18 4:24 PM, Bart Van Assche wrote: >>> Hello, >>> >>> If I run the srp blktests in a loop then I see the below call stack appearing >>> sporadically. I have not yet had the time to analyze this but I'm reporting >>> this here in case someone else would already have had a look at this. >>> >>> Bart. >>> >>> ================================================================== >>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >>> Read of size 8 at addr ffff88803b335240 by task fio/21412 >>> >>> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >>> Call Trace: >>> dump_stack+0x86/0xca >>> print_address_description+0x71/0x239 >>> kasan_report.cold.5+0x242/0x301 >>> __asan_load8+0x54/0x90 >>> bt_iter+0x86/0xf0 >>> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >>> blk_mq_in_flight+0x96/0xb0 >>> part_in_flight+0x40/0x140 >>> part_round_stats+0x18e/0x370 >>> blk_account_io_start+0x3d7/0x670 >>> blk_mq_bio_to_request+0x19c/0x3a0 >>> blk_mq_make_request+0x7a9/0xcb0 >>> generic_make_request+0x41d/0x960 >>> submit_bio+0x9b/0x250 >>> do_blockdev_direct_IO+0x435c/0x4c70 >>> __blockdev_direct_IO+0x79/0x88 >>> ext4_direct_IO+0x46c/0xc00 >>> generic_file_direct_write+0x119/0x210 >>> __generic_file_write_iter+0x11c/0x280 >>> ext4_file_write_iter+0x1b8/0x6f0 >>> aio_write+0x204/0x310 >>> io_submit_one+0x9d3/0xe80 >>> __x64_sys_io_submit+0x115/0x340 >>> do_syscall_64+0x71/0x210 >>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> RIP: 0033:0x7f02cf043219 >> >> I've seen this one before as well, it's not a new thing. As far as I can >> tell, it's a false positive. There should be no possibility for a >> use-after-free iterating the static tags/requests. > > Are you sure this is a false positive? No I'm not, but the few times I have seen it, I haven't been able to make much sense of it. It goes back quite a bit. I have not yet encountered any false > positive KASAN complaints. According to the following gdb output this complaint > refers to reading rq->q: > > (gdb) list *(bt_iter+0x86) > 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). > 232 > 233 /* > 234 * We can hit rq == NULL here, because the tagging functions > 235 * test and set the bit before assigning ->rqs[]. > 236 */ > 237 if (rq && rq->q == hctx->queue) > 238 iter_data->fn(hctx, rq, iter_data->data, reserved); > 239 return true; > 240 } > 241 > > From the disassembly output: > > 232 > 233 /* > 234 * We can hit rq == NULL here, because the tagging functions > 235 * test and set the bit before assigning ->rqs[]. > 236 */ > 237 if (rq && rq->q == hctx->queue) > 0xffffffff816b9339 <+121>: test %r12,%r12 > 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> > 0xffffffff816b933e <+126>: mov %r12,%rdi > 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> > 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi > 0xffffffff816b934d <+141>: mov (%r12),%r14 > 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> > 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 > 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> > > BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also > refer to hctx->fq.flush_rq. But even those are persistent for the lifetime of the queue... But since kasan complains it belongs to a specific page, I'm guessing it's one of the regular requests since those are out of a chopped up page. Which means it makes even less sense. Is this happening while devices are being actively torn down? And are you using shared tags? That's the only way I could see this triggering. -- Jens Axboe ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 3:17 ` Jens Axboe @ 2018-12-20 3:24 ` jianchao.wang 2018-12-20 4:19 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: jianchao.wang @ 2018-12-20 3:24 UTC (permalink / raw) To: Jens Axboe, Bart Van Assche, linux-block On 12/20/18 11:17 AM, Jens Axboe wrote: > On 12/19/18 5:16 PM, Bart Van Assche wrote: >> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: >>> On 12/19/18 4:24 PM, Bart Van Assche wrote: >>>> Hello, >>>> >>>> If I run the srp blktests in a loop then I see the below call stack appearing >>>> sporadically. I have not yet had the time to analyze this but I'm reporting >>>> this here in case someone else would already have had a look at this. >>>> >>>> Bart. >>>> >>>> ================================================================== >>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >>>> Read of size 8 at addr ffff88803b335240 by task fio/21412 >>>> >>>> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >>>> Call Trace: >>>> dump_stack+0x86/0xca >>>> print_address_description+0x71/0x239 >>>> kasan_report.cold.5+0x242/0x301 >>>> __asan_load8+0x54/0x90 >>>> bt_iter+0x86/0xf0 >>>> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >>>> blk_mq_in_flight+0x96/0xb0 >>>> part_in_flight+0x40/0x140 >>>> part_round_stats+0x18e/0x370 >>>> blk_account_io_start+0x3d7/0x670 >>>> blk_mq_bio_to_request+0x19c/0x3a0 >>>> blk_mq_make_request+0x7a9/0xcb0 >>>> generic_make_request+0x41d/0x960 >>>> submit_bio+0x9b/0x250 >>>> do_blockdev_direct_IO+0x435c/0x4c70 >>>> __blockdev_direct_IO+0x79/0x88 >>>> ext4_direct_IO+0x46c/0xc00 >>>> generic_file_direct_write+0x119/0x210 >>>> __generic_file_write_iter+0x11c/0x280 >>>> ext4_file_write_iter+0x1b8/0x6f0 >>>> aio_write+0x204/0x310 >>>> io_submit_one+0x9d3/0xe80 >>>> __x64_sys_io_submit+0x115/0x340 >>>> do_syscall_64+0x71/0x210 >>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>> RIP: 0033:0x7f02cf043219 >>> >>> I've seen this one before as well, it's not a new thing. As far as I can >>> tell, it's a false positive. There should be no possibility for a >>> use-after-free iterating the static tags/requests. >> >> Are you sure this is a false positive? > > No I'm not, but the few times I have seen it, I haven't been able to > make much sense of it. It goes back quite a bit. > > I have not yet encountered any false >> positive KASAN complaints. According to the following gdb output this complaint >> refers to reading rq->q: >> >> (gdb) list *(bt_iter+0x86) >> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). >> 232 >> 233 /* >> 234 * We can hit rq == NULL here, because the tagging functions >> 235 * test and set the bit before assigning ->rqs[]. >> 236 */ >> 237 if (rq && rq->q == hctx->queue) >> 238 iter_data->fn(hctx, rq, iter_data->data, reserved); >> 239 return true; >> 240 } >> 241 >> >> From the disassembly output: >> >> 232 >> 233 /* >> 234 * We can hit rq == NULL here, because the tagging functions >> 235 * test and set the bit before assigning ->rqs[]. >> 236 */ >> 237 if (rq && rq->q == hctx->queue) >> 0xffffffff816b9339 <+121>: test %r12,%r12 >> 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> >> 0xffffffff816b933e <+126>: mov %r12,%rdi >> 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> >> 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi >> 0xffffffff816b934d <+141>: mov (%r12),%r14 >> 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> >> 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 >> 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> >> >> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also >> refer to hctx->fq.flush_rq. > > But even those are persistent for the lifetime of the queue... But since > kasan complains it belongs to a specific page, I'm guessing it's one > of the regular requests since those are out of a chopped up page. Which > means it makes even less sense. > > Is this happening while devices are being actively torn down? And > are you using shared tags? That's the only way I could see this > triggering. > Or could it be caused by the stale request in hctx->tags->rqs[] slot ? We don't clear it after free the requests. And there could be a scenario like, There used to be a io scheduler attached. After some workload, the io scheduler is detached. So there could be rqs allocated by the io scheduler left in hctx->tags->rqs. blk_mq_get_request blk_mq_queue_tag_busy_iter -> blk_mq_get_tag -> bt_for_each -> bt_iter -> rq = taags->rqs[] -> rq->q -> blk_mq_rq_ctx_init -> data->hctx->tags->rqs[rq->tag] = rq; If the scenario is possible, maybe we could fix it as following. --- block/blk-mq.c | 1 + block/blk-mq.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6a75662..ad55226 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -477,6 +477,7 @@ static void __blk_mq_free_request(struct request *rq) const int sched_tag = rq->internal_tag; blk_pm_mark_last_busy(rq); + hctx->tags->rqs[rq->tag] = NULL; if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) diff --git a/block/blk-mq.h b/block/blk-mq.h index 9497b47..675b681 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -190,6 +190,7 @@ static inline void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx, if (rq->tag == -1 || rq->internal_tag == -1) return; + hctx->tags->rqs[rq->tag] = NULL; __blk_mq_put_driver_tag(hctx, rq); } @@ -201,6 +202,7 @@ static inline void blk_mq_put_driver_tag(struct request *rq) return; hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu); + hctx->tags->rqs[rq->tag] = NULL; __blk_mq_put_driver_tag(hctx, rq); } -- 2.7.4 Thanks Jianchao ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 3:24 ` jianchao.wang @ 2018-12-20 4:19 ` Jens Axboe 2018-12-20 4:32 ` jianchao.wang 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 4:19 UTC (permalink / raw) To: jianchao.wang, Bart Van Assche, linux-block On 12/19/18 8:24 PM, jianchao.wang wrote: > > > On 12/20/18 11:17 AM, Jens Axboe wrote: >> On 12/19/18 5:16 PM, Bart Van Assche wrote: >>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: >>>> On 12/19/18 4:24 PM, Bart Van Assche wrote: >>>>> Hello, >>>>> >>>>> If I run the srp blktests in a loop then I see the below call stack appearing >>>>> sporadically. I have not yet had the time to analyze this but I'm reporting >>>>> this here in case someone else would already have had a look at this. >>>>> >>>>> Bart. >>>>> >>>>> ================================================================== >>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412 >>>>> >>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >>>>> Call Trace: >>>>> dump_stack+0x86/0xca >>>>> print_address_description+0x71/0x239 >>>>> kasan_report.cold.5+0x242/0x301 >>>>> __asan_load8+0x54/0x90 >>>>> bt_iter+0x86/0xf0 >>>>> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >>>>> blk_mq_in_flight+0x96/0xb0 >>>>> part_in_flight+0x40/0x140 >>>>> part_round_stats+0x18e/0x370 >>>>> blk_account_io_start+0x3d7/0x670 >>>>> blk_mq_bio_to_request+0x19c/0x3a0 >>>>> blk_mq_make_request+0x7a9/0xcb0 >>>>> generic_make_request+0x41d/0x960 >>>>> submit_bio+0x9b/0x250 >>>>> do_blockdev_direct_IO+0x435c/0x4c70 >>>>> __blockdev_direct_IO+0x79/0x88 >>>>> ext4_direct_IO+0x46c/0xc00 >>>>> generic_file_direct_write+0x119/0x210 >>>>> __generic_file_write_iter+0x11c/0x280 >>>>> ext4_file_write_iter+0x1b8/0x6f0 >>>>> aio_write+0x204/0x310 >>>>> io_submit_one+0x9d3/0xe80 >>>>> __x64_sys_io_submit+0x115/0x340 >>>>> do_syscall_64+0x71/0x210 >>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>> RIP: 0033:0x7f02cf043219 >>>> >>>> I've seen this one before as well, it's not a new thing. As far as I can >>>> tell, it's a false positive. There should be no possibility for a >>>> use-after-free iterating the static tags/requests. >>> >>> Are you sure this is a false positive? >> >> No I'm not, but the few times I have seen it, I haven't been able to >> make much sense of it. It goes back quite a bit. >> >> I have not yet encountered any false >>> positive KASAN complaints. According to the following gdb output this complaint >>> refers to reading rq->q: >>> >>> (gdb) list *(bt_iter+0x86) >>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). >>> 232 >>> 233 /* >>> 234 * We can hit rq == NULL here, because the tagging functions >>> 235 * test and set the bit before assigning ->rqs[]. >>> 236 */ >>> 237 if (rq && rq->q == hctx->queue) >>> 238 iter_data->fn(hctx, rq, iter_data->data, reserved); >>> 239 return true; >>> 240 } >>> 241 >>> >>> From the disassembly output: >>> >>> 232 >>> 233 /* >>> 234 * We can hit rq == NULL here, because the tagging functions >>> 235 * test and set the bit before assigning ->rqs[]. >>> 236 */ >>> 237 if (rq && rq->q == hctx->queue) >>> 0xffffffff816b9339 <+121>: test %r12,%r12 >>> 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> >>> 0xffffffff816b933e <+126>: mov %r12,%rdi >>> 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> >>> 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi >>> 0xffffffff816b934d <+141>: mov (%r12),%r14 >>> 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> >>> 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 >>> 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> >>> >>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also >>> refer to hctx->fq.flush_rq. >> >> But even those are persistent for the lifetime of the queue... But since >> kasan complains it belongs to a specific page, I'm guessing it's one >> of the regular requests since those are out of a chopped up page. Which >> means it makes even less sense. >> >> Is this happening while devices are being actively torn down? And >> are you using shared tags? That's the only way I could see this >> triggering. >> > > Or could it be caused by the stale request in hctx->tags->rqs[] slot ? > We don't clear it after free the requests. > > And there could be a scenario like, > There used to be a io scheduler attached. > After some workload, the io scheduler is detached. > So there could be rqs allocated by the io scheduler left in hctx->tags->rqs. > > blk_mq_get_request blk_mq_queue_tag_busy_iter > -> blk_mq_get_tag > -> bt_for_each > -> bt_iter > -> rq = taags->rqs[] > -> rq->q > -> blk_mq_rq_ctx_init > -> data->hctx->tags->rqs[rq->tag] = rq; > > If the scenario is possible, maybe we could fix it as following. Ah yes, good point, I bet that's what it is. But we just had this exact discussion in another thread, and my point there was that we should clear these when they go away, not inline. So how about clearing entries when the sched tags go away? -- Jens Axboe ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 4:19 ` Jens Axboe @ 2018-12-20 4:32 ` jianchao.wang 2018-12-20 4:48 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: jianchao.wang @ 2018-12-20 4:32 UTC (permalink / raw) To: Jens Axboe, Bart Van Assche, linux-block On 12/20/18 12:19 PM, Jens Axboe wrote: > On 12/19/18 8:24 PM, jianchao.wang wrote: >> >> >> On 12/20/18 11:17 AM, Jens Axboe wrote: >>> On 12/19/18 5:16 PM, Bart Van Assche wrote: >>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: >>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote: >>>>>> Hello, >>>>>> >>>>>> If I run the srp blktests in a loop then I see the below call stack appearing >>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting >>>>>> this here in case someone else would already have had a look at this. >>>>>> >>>>>> Bart. >>>>>> >>>>>> ================================================================== >>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412 >>>>>> >>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >>>>>> Call Trace: >>>>>> dump_stack+0x86/0xca >>>>>> print_address_description+0x71/0x239 >>>>>> kasan_report.cold.5+0x242/0x301 >>>>>> __asan_load8+0x54/0x90 >>>>>> bt_iter+0x86/0xf0 >>>>>> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >>>>>> blk_mq_in_flight+0x96/0xb0 >>>>>> part_in_flight+0x40/0x140 >>>>>> part_round_stats+0x18e/0x370 >>>>>> blk_account_io_start+0x3d7/0x670 >>>>>> blk_mq_bio_to_request+0x19c/0x3a0 >>>>>> blk_mq_make_request+0x7a9/0xcb0 >>>>>> generic_make_request+0x41d/0x960 >>>>>> submit_bio+0x9b/0x250 >>>>>> do_blockdev_direct_IO+0x435c/0x4c70 >>>>>> __blockdev_direct_IO+0x79/0x88 >>>>>> ext4_direct_IO+0x46c/0xc00 >>>>>> generic_file_direct_write+0x119/0x210 >>>>>> __generic_file_write_iter+0x11c/0x280 >>>>>> ext4_file_write_iter+0x1b8/0x6f0 >>>>>> aio_write+0x204/0x310 >>>>>> io_submit_one+0x9d3/0xe80 >>>>>> __x64_sys_io_submit+0x115/0x340 >>>>>> do_syscall_64+0x71/0x210 >>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>>> RIP: 0033:0x7f02cf043219 >>>>> >>>>> I've seen this one before as well, it's not a new thing. As far as I can >>>>> tell, it's a false positive. There should be no possibility for a >>>>> use-after-free iterating the static tags/requests. >>>> >>>> Are you sure this is a false positive? >>> >>> No I'm not, but the few times I have seen it, I haven't been able to >>> make much sense of it. It goes back quite a bit. >>> >>> I have not yet encountered any false >>>> positive KASAN complaints. According to the following gdb output this complaint >>>> refers to reading rq->q: >>>> >>>> (gdb) list *(bt_iter+0x86) >>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). >>>> 232 >>>> 233 /* >>>> 234 * We can hit rq == NULL here, because the tagging functions >>>> 235 * test and set the bit before assigning ->rqs[]. >>>> 236 */ >>>> 237 if (rq && rq->q == hctx->queue) >>>> 238 iter_data->fn(hctx, rq, iter_data->data, reserved); >>>> 239 return true; >>>> 240 } >>>> 241 >>>> >>>> From the disassembly output: >>>> >>>> 232 >>>> 233 /* >>>> 234 * We can hit rq == NULL here, because the tagging functions >>>> 235 * test and set the bit before assigning ->rqs[]. >>>> 236 */ >>>> 237 if (rq && rq->q == hctx->queue) >>>> 0xffffffff816b9339 <+121>: test %r12,%r12 >>>> 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> >>>> 0xffffffff816b933e <+126>: mov %r12,%rdi >>>> 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> >>>> 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi >>>> 0xffffffff816b934d <+141>: mov (%r12),%r14 >>>> 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> >>>> 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 >>>> 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> >>>> >>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also >>>> refer to hctx->fq.flush_rq. >>> >>> But even those are persistent for the lifetime of the queue... But since >>> kasan complains it belongs to a specific page, I'm guessing it's one >>> of the regular requests since those are out of a chopped up page. Which >>> means it makes even less sense. >>> >>> Is this happening while devices are being actively torn down? And >>> are you using shared tags? That's the only way I could see this >>> triggering. >>> >> >> Or could it be caused by the stale request in hctx->tags->rqs[] slot ? >> We don't clear it after free the requests. >> >> And there could be a scenario like, >> There used to be a io scheduler attached. >> After some workload, the io scheduler is detached. >> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs. >> >> blk_mq_get_request blk_mq_queue_tag_busy_iter >> -> blk_mq_get_tag >> -> bt_for_each >> -> bt_iter >> -> rq = taags->rqs[] >> -> rq->q >> -> blk_mq_rq_ctx_init >> -> data->hctx->tags->rqs[rq->tag] = rq; >> >> If the scenario is possible, maybe we could fix it as following. > > Ah yes, good point, I bet that's what it is. But we just had this exact > discussion in another thread, and my point there was that we should > clear these when they go away, not inline. So how about clearing entries > when the sched tags go away? > I guess it should be OK. :) Thanks Jianchao ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 4:32 ` jianchao.wang @ 2018-12-20 4:48 ` Jens Axboe 2018-12-20 5:03 ` jianchao.wang 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 4:48 UTC (permalink / raw) To: jianchao.wang, Bart Van Assche, linux-block On 12/19/18 9:32 PM, jianchao.wang wrote: > > > On 12/20/18 12:19 PM, Jens Axboe wrote: >> On 12/19/18 8:24 PM, jianchao.wang wrote: >>> >>> >>> On 12/20/18 11:17 AM, Jens Axboe wrote: >>>> On 12/19/18 5:16 PM, Bart Van Assche wrote: >>>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: >>>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote: >>>>>>> Hello, >>>>>>> >>>>>>> If I run the srp blktests in a loop then I see the below call stack appearing >>>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting >>>>>>> this here in case someone else would already have had a look at this. >>>>>>> >>>>>>> Bart. >>>>>>> >>>>>>> ================================================================== >>>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >>>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412 >>>>>>> >>>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >>>>>>> Call Trace: >>>>>>> dump_stack+0x86/0xca >>>>>>> print_address_description+0x71/0x239 >>>>>>> kasan_report.cold.5+0x242/0x301 >>>>>>> __asan_load8+0x54/0x90 >>>>>>> bt_iter+0x86/0xf0 >>>>>>> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >>>>>>> blk_mq_in_flight+0x96/0xb0 >>>>>>> part_in_flight+0x40/0x140 >>>>>>> part_round_stats+0x18e/0x370 >>>>>>> blk_account_io_start+0x3d7/0x670 >>>>>>> blk_mq_bio_to_request+0x19c/0x3a0 >>>>>>> blk_mq_make_request+0x7a9/0xcb0 >>>>>>> generic_make_request+0x41d/0x960 >>>>>>> submit_bio+0x9b/0x250 >>>>>>> do_blockdev_direct_IO+0x435c/0x4c70 >>>>>>> __blockdev_direct_IO+0x79/0x88 >>>>>>> ext4_direct_IO+0x46c/0xc00 >>>>>>> generic_file_direct_write+0x119/0x210 >>>>>>> __generic_file_write_iter+0x11c/0x280 >>>>>>> ext4_file_write_iter+0x1b8/0x6f0 >>>>>>> aio_write+0x204/0x310 >>>>>>> io_submit_one+0x9d3/0xe80 >>>>>>> __x64_sys_io_submit+0x115/0x340 >>>>>>> do_syscall_64+0x71/0x210 >>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>>>> RIP: 0033:0x7f02cf043219 >>>>>> >>>>>> I've seen this one before as well, it's not a new thing. As far as I can >>>>>> tell, it's a false positive. There should be no possibility for a >>>>>> use-after-free iterating the static tags/requests. >>>>> >>>>> Are you sure this is a false positive? >>>> >>>> No I'm not, but the few times I have seen it, I haven't been able to >>>> make much sense of it. It goes back quite a bit. >>>> >>>> I have not yet encountered any false >>>>> positive KASAN complaints. According to the following gdb output this complaint >>>>> refers to reading rq->q: >>>>> >>>>> (gdb) list *(bt_iter+0x86) >>>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). >>>>> 232 >>>>> 233 /* >>>>> 234 * We can hit rq == NULL here, because the tagging functions >>>>> 235 * test and set the bit before assigning ->rqs[]. >>>>> 236 */ >>>>> 237 if (rq && rq->q == hctx->queue) >>>>> 238 iter_data->fn(hctx, rq, iter_data->data, reserved); >>>>> 239 return true; >>>>> 240 } >>>>> 241 >>>>> >>>>> From the disassembly output: >>>>> >>>>> 232 >>>>> 233 /* >>>>> 234 * We can hit rq == NULL here, because the tagging functions >>>>> 235 * test and set the bit before assigning ->rqs[]. >>>>> 236 */ >>>>> 237 if (rq && rq->q == hctx->queue) >>>>> 0xffffffff816b9339 <+121>: test %r12,%r12 >>>>> 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> >>>>> 0xffffffff816b933e <+126>: mov %r12,%rdi >>>>> 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> >>>>> 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi >>>>> 0xffffffff816b934d <+141>: mov (%r12),%r14 >>>>> 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> >>>>> 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 >>>>> 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> >>>>> >>>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also >>>>> refer to hctx->fq.flush_rq. >>>> >>>> But even those are persistent for the lifetime of the queue... But since >>>> kasan complains it belongs to a specific page, I'm guessing it's one >>>> of the regular requests since those are out of a chopped up page. Which >>>> means it makes even less sense. >>>> >>>> Is this happening while devices are being actively torn down? And >>>> are you using shared tags? That's the only way I could see this >>>> triggering. >>>> >>> >>> Or could it be caused by the stale request in hctx->tags->rqs[] slot ? >>> We don't clear it after free the requests. >>> >>> And there could be a scenario like, >>> There used to be a io scheduler attached. >>> After some workload, the io scheduler is detached. >>> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs. >>> >>> blk_mq_get_request blk_mq_queue_tag_busy_iter >>> -> blk_mq_get_tag >>> -> bt_for_each >>> -> bt_iter >>> -> rq = taags->rqs[] >>> -> rq->q >>> -> blk_mq_rq_ctx_init >>> -> data->hctx->tags->rqs[rq->tag] = rq; >>> >>> If the scenario is possible, maybe we could fix it as following. >> >> Ah yes, good point, I bet that's what it is. But we just had this exact >> discussion in another thread, and my point there was that we should >> clear these when they go away, not inline. So how about clearing entries >> when the sched tags go away? >> > I guess it should be OK. :) Something like this. Totally untested... And I wonder if there's a more efficient way to do this, not that it matters THAT much. But still. diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..341cb8b9cfb7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2025,16 +2025,21 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, { struct page *page; - if (tags->rqs && set->ops->exit_request) { - int i; + if (tags->rqs) { + int i, j; for (i = 0; i < tags->nr_tags; i++) { struct request *rq = tags->static_rqs[i]; if (!rq) continue; - set->ops->exit_request(set, rq, hctx_idx); + if (set->ops->exit_request) + set->ops->exit_request(set, rq, hctx_idx); tags->static_rqs[i] = NULL; + + for (j = 0; j < tags->nr_tags; j++) + if (tags->rqs[j] == rq) + tags->rqs[j] = NULL; } } -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 4:48 ` Jens Axboe @ 2018-12-20 5:03 ` jianchao.wang 2018-12-20 13:02 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: jianchao.wang @ 2018-12-20 5:03 UTC (permalink / raw) To: Jens Axboe, Bart Van Assche, linux-block On 12/20/18 12:48 PM, Jens Axboe wrote: > On 12/19/18 9:32 PM, jianchao.wang wrote: >> >> >> On 12/20/18 12:19 PM, Jens Axboe wrote: >>> On 12/19/18 8:24 PM, jianchao.wang wrote: >>>> >>>> >>>> On 12/20/18 11:17 AM, Jens Axboe wrote: >>>>> On 12/19/18 5:16 PM, Bart Van Assche wrote: >>>>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: >>>>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> If I run the srp blktests in a loop then I see the below call stack appearing >>>>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting >>>>>>>> this here in case someone else would already have had a look at this. >>>>>>>> >>>>>>>> Bart. >>>>>>>> >>>>>>>> ================================================================== >>>>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >>>>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412 >>>>>>>> >>>>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >>>>>>>> Call Trace: >>>>>>>> dump_stack+0x86/0xca >>>>>>>> print_address_description+0x71/0x239 >>>>>>>> kasan_report.cold.5+0x242/0x301 >>>>>>>> __asan_load8+0x54/0x90 >>>>>>>> bt_iter+0x86/0xf0 >>>>>>>> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >>>>>>>> blk_mq_in_flight+0x96/0xb0 >>>>>>>> part_in_flight+0x40/0x140 >>>>>>>> part_round_stats+0x18e/0x370 >>>>>>>> blk_account_io_start+0x3d7/0x670 >>>>>>>> blk_mq_bio_to_request+0x19c/0x3a0 >>>>>>>> blk_mq_make_request+0x7a9/0xcb0 >>>>>>>> generic_make_request+0x41d/0x960 >>>>>>>> submit_bio+0x9b/0x250 >>>>>>>> do_blockdev_direct_IO+0x435c/0x4c70 >>>>>>>> __blockdev_direct_IO+0x79/0x88 >>>>>>>> ext4_direct_IO+0x46c/0xc00 >>>>>>>> generic_file_direct_write+0x119/0x210 >>>>>>>> __generic_file_write_iter+0x11c/0x280 >>>>>>>> ext4_file_write_iter+0x1b8/0x6f0 >>>>>>>> aio_write+0x204/0x310 >>>>>>>> io_submit_one+0x9d3/0xe80 >>>>>>>> __x64_sys_io_submit+0x115/0x340 >>>>>>>> do_syscall_64+0x71/0x210 >>>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>>>>> RIP: 0033:0x7f02cf043219 >>>>>>> >>>>>>> I've seen this one before as well, it's not a new thing. As far as I can >>>>>>> tell, it's a false positive. There should be no possibility for a >>>>>>> use-after-free iterating the static tags/requests. >>>>>> >>>>>> Are you sure this is a false positive? >>>>> >>>>> No I'm not, but the few times I have seen it, I haven't been able to >>>>> make much sense of it. It goes back quite a bit. >>>>> >>>>> I have not yet encountered any false >>>>>> positive KASAN complaints. According to the following gdb output this complaint >>>>>> refers to reading rq->q: >>>>>> >>>>>> (gdb) list *(bt_iter+0x86) >>>>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). >>>>>> 232 >>>>>> 233 /* >>>>>> 234 * We can hit rq == NULL here, because the tagging functions >>>>>> 235 * test and set the bit before assigning ->rqs[]. >>>>>> 236 */ >>>>>> 237 if (rq && rq->q == hctx->queue) >>>>>> 238 iter_data->fn(hctx, rq, iter_data->data, reserved); >>>>>> 239 return true; >>>>>> 240 } >>>>>> 241 >>>>>> >>>>>> From the disassembly output: >>>>>> >>>>>> 232 >>>>>> 233 /* >>>>>> 234 * We can hit rq == NULL here, because the tagging functions >>>>>> 235 * test and set the bit before assigning ->rqs[]. >>>>>> 236 */ >>>>>> 237 if (rq && rq->q == hctx->queue) >>>>>> 0xffffffff816b9339 <+121>: test %r12,%r12 >>>>>> 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> >>>>>> 0xffffffff816b933e <+126>: mov %r12,%rdi >>>>>> 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> >>>>>> 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi >>>>>> 0xffffffff816b934d <+141>: mov (%r12),%r14 >>>>>> 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> >>>>>> 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 >>>>>> 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> >>>>>> >>>>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also >>>>>> refer to hctx->fq.flush_rq. >>>>> >>>>> But even those are persistent for the lifetime of the queue... But since >>>>> kasan complains it belongs to a specific page, I'm guessing it's one >>>>> of the regular requests since those are out of a chopped up page. Which >>>>> means it makes even less sense. >>>>> >>>>> Is this happening while devices are being actively torn down? And >>>>> are you using shared tags? That's the only way I could see this >>>>> triggering. >>>>> >>>> >>>> Or could it be caused by the stale request in hctx->tags->rqs[] slot ? >>>> We don't clear it after free the requests. >>>> >>>> And there could be a scenario like, >>>> There used to be a io scheduler attached. >>>> After some workload, the io scheduler is detached. >>>> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs. >>>> >>>> blk_mq_get_request blk_mq_queue_tag_busy_iter >>>> -> blk_mq_get_tag >>>> -> bt_for_each >>>> -> bt_iter >>>> -> rq = taags->rqs[] >>>> -> rq->q >>>> -> blk_mq_rq_ctx_init >>>> -> data->hctx->tags->rqs[rq->tag] = rq; >>>> >>>> If the scenario is possible, maybe we could fix it as following. >>> >>> Ah yes, good point, I bet that's what it is. But we just had this exact >>> discussion in another thread, and my point there was that we should >>> clear these when they go away, not inline. So how about clearing entries >>> when the sched tags go away? >>> >> I guess it should be OK. :) > > Something like this. Totally untested... And I wonder if there's a more > efficient way to do this, not that it matters THAT much. But still. > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 2de972857496..341cb8b9cfb7 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2025,16 +2025,21 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > { > struct page *page; > > - if (tags->rqs && set->ops->exit_request) { > - int i; > + if (tags->rqs) { > + int i, j; > > for (i = 0; i < tags->nr_tags; i++) { > struct request *rq = tags->static_rqs[i]; > > if (!rq) > continue; > - set->ops->exit_request(set, rq, hctx_idx); > + if (set->ops->exit_request) > + set->ops->exit_request(set, rq, hctx_idx); > tags->static_rqs[i] = NULL; > + > + for (j = 0; j < tags->nr_tags; j++) > + if (tags->rqs[j] == rq) > + tags->rqs[j] = NULL; > } > } > > I'm afraid this cannot work. The 'tags' here could be the hctx->sched_tags, but what we need to clear is hctx->tags->rqs[]. Thanks Jianchao ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 5:03 ` jianchao.wang @ 2018-12-20 13:02 ` Jens Axboe 2018-12-20 13:07 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 13:02 UTC (permalink / raw) To: jianchao.wang, Bart Van Assche, linux-block On 12/19/18 10:03 PM, jianchao.wang wrote: > > > On 12/20/18 12:48 PM, Jens Axboe wrote: >> On 12/19/18 9:32 PM, jianchao.wang wrote: >>> >>> >>> On 12/20/18 12:19 PM, Jens Axboe wrote: >>>> On 12/19/18 8:24 PM, jianchao.wang wrote: >>>>> >>>>> >>>>> On 12/20/18 11:17 AM, Jens Axboe wrote: >>>>>> On 12/19/18 5:16 PM, Bart Van Assche wrote: >>>>>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: >>>>>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> If I run the srp blktests in a loop then I see the below call stack appearing >>>>>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting >>>>>>>>> this here in case someone else would already have had a look at this. >>>>>>>>> >>>>>>>>> Bart. >>>>>>>>> >>>>>>>>> ================================================================== >>>>>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >>>>>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412 >>>>>>>>> >>>>>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >>>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >>>>>>>>> Call Trace: >>>>>>>>> dump_stack+0x86/0xca >>>>>>>>> print_address_description+0x71/0x239 >>>>>>>>> kasan_report.cold.5+0x242/0x301 >>>>>>>>> __asan_load8+0x54/0x90 >>>>>>>>> bt_iter+0x86/0xf0 >>>>>>>>> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >>>>>>>>> blk_mq_in_flight+0x96/0xb0 >>>>>>>>> part_in_flight+0x40/0x140 >>>>>>>>> part_round_stats+0x18e/0x370 >>>>>>>>> blk_account_io_start+0x3d7/0x670 >>>>>>>>> blk_mq_bio_to_request+0x19c/0x3a0 >>>>>>>>> blk_mq_make_request+0x7a9/0xcb0 >>>>>>>>> generic_make_request+0x41d/0x960 >>>>>>>>> submit_bio+0x9b/0x250 >>>>>>>>> do_blockdev_direct_IO+0x435c/0x4c70 >>>>>>>>> __blockdev_direct_IO+0x79/0x88 >>>>>>>>> ext4_direct_IO+0x46c/0xc00 >>>>>>>>> generic_file_direct_write+0x119/0x210 >>>>>>>>> __generic_file_write_iter+0x11c/0x280 >>>>>>>>> ext4_file_write_iter+0x1b8/0x6f0 >>>>>>>>> aio_write+0x204/0x310 >>>>>>>>> io_submit_one+0x9d3/0xe80 >>>>>>>>> __x64_sys_io_submit+0x115/0x340 >>>>>>>>> do_syscall_64+0x71/0x210 >>>>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>>>>>> RIP: 0033:0x7f02cf043219 >>>>>>>> >>>>>>>> I've seen this one before as well, it's not a new thing. As far as I can >>>>>>>> tell, it's a false positive. There should be no possibility for a >>>>>>>> use-after-free iterating the static tags/requests. >>>>>>> >>>>>>> Are you sure this is a false positive? >>>>>> >>>>>> No I'm not, but the few times I have seen it, I haven't been able to >>>>>> make much sense of it. It goes back quite a bit. >>>>>> >>>>>> I have not yet encountered any false >>>>>>> positive KASAN complaints. According to the following gdb output this complaint >>>>>>> refers to reading rq->q: >>>>>>> >>>>>>> (gdb) list *(bt_iter+0x86) >>>>>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). >>>>>>> 232 >>>>>>> 233 /* >>>>>>> 234 * We can hit rq == NULL here, because the tagging functions >>>>>>> 235 * test and set the bit before assigning ->rqs[]. >>>>>>> 236 */ >>>>>>> 237 if (rq && rq->q == hctx->queue) >>>>>>> 238 iter_data->fn(hctx, rq, iter_data->data, reserved); >>>>>>> 239 return true; >>>>>>> 240 } >>>>>>> 241 >>>>>>> >>>>>>> From the disassembly output: >>>>>>> >>>>>>> 232 >>>>>>> 233 /* >>>>>>> 234 * We can hit rq == NULL here, because the tagging functions >>>>>>> 235 * test and set the bit before assigning ->rqs[]. >>>>>>> 236 */ >>>>>>> 237 if (rq && rq->q == hctx->queue) >>>>>>> 0xffffffff816b9339 <+121>: test %r12,%r12 >>>>>>> 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> >>>>>>> 0xffffffff816b933e <+126>: mov %r12,%rdi >>>>>>> 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> >>>>>>> 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi >>>>>>> 0xffffffff816b934d <+141>: mov (%r12),%r14 >>>>>>> 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> >>>>>>> 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 >>>>>>> 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> >>>>>>> >>>>>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also >>>>>>> refer to hctx->fq.flush_rq. >>>>>> >>>>>> But even those are persistent for the lifetime of the queue... But since >>>>>> kasan complains it belongs to a specific page, I'm guessing it's one >>>>>> of the regular requests since those are out of a chopped up page. Which >>>>>> means it makes even less sense. >>>>>> >>>>>> Is this happening while devices are being actively torn down? And >>>>>> are you using shared tags? That's the only way I could see this >>>>>> triggering. >>>>>> >>>>> >>>>> Or could it be caused by the stale request in hctx->tags->rqs[] slot ? >>>>> We don't clear it after free the requests. >>>>> >>>>> And there could be a scenario like, >>>>> There used to be a io scheduler attached. >>>>> After some workload, the io scheduler is detached. >>>>> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs. >>>>> >>>>> blk_mq_get_request blk_mq_queue_tag_busy_iter >>>>> -> blk_mq_get_tag >>>>> -> bt_for_each >>>>> -> bt_iter >>>>> -> rq = taags->rqs[] >>>>> -> rq->q >>>>> -> blk_mq_rq_ctx_init >>>>> -> data->hctx->tags->rqs[rq->tag] = rq; >>>>> >>>>> If the scenario is possible, maybe we could fix it as following. >>>> >>>> Ah yes, good point, I bet that's what it is. But we just had this exact >>>> discussion in another thread, and my point there was that we should >>>> clear these when they go away, not inline. So how about clearing entries >>>> when the sched tags go away? >>>> >>> I guess it should be OK. :) >> >> Something like this. Totally untested... And I wonder if there's a more >> efficient way to do this, not that it matters THAT much. But still. >> >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 2de972857496..341cb8b9cfb7 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -2025,16 +2025,21 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >> { >> struct page *page; >> >> - if (tags->rqs && set->ops->exit_request) { >> - int i; >> + if (tags->rqs) { >> + int i, j; >> >> for (i = 0; i < tags->nr_tags; i++) { >> struct request *rq = tags->static_rqs[i]; >> >> if (!rq) >> continue; >> - set->ops->exit_request(set, rq, hctx_idx); >> + if (set->ops->exit_request) >> + set->ops->exit_request(set, rq, hctx_idx); >> tags->static_rqs[i] = NULL; >> + >> + for (j = 0; j < tags->nr_tags; j++) >> + if (tags->rqs[j] == rq) >> + tags->rqs[j] = NULL; >> } >> } >> >> > > I'm afraid this cannot work. > > The 'tags' here could be the hctx->sched_tags, but what we need to > clear is hctx->tags->rqs[]. You are right, of course, a bit too quick on the trigger. This one should work better, and also avoids that silly quadratic loop. I don't think we need the tag == -1 check, but probably best to be safe. diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..151891eb6fbd 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2025,16 +2025,22 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, { struct page *page; - if (tags->rqs && set->ops->exit_request) { - int i; + if (tags->rqs) { + int i, j; for (i = 0; i < tags->nr_tags; i++) { struct request *rq = tags->static_rqs[i]; if (!rq) continue; - set->ops->exit_request(set, rq, hctx_idx); + if (set->ops->exit_request) + set->ops->exit_request(set, rq, hctx_idx); tags->static_rqs[i] = NULL; + + if (rq->tag == -1) + continue; + if (set->tags[hctx_idx]->rqs[rq->tag] == rq) + set->tags[hctx_idx]->rqs[rq->tag] = NULL; } } -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 13:02 ` Jens Axboe @ 2018-12-20 13:07 ` Jens Axboe 2018-12-20 18:01 ` Bart Van Assche 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 13:07 UTC (permalink / raw) To: jianchao.wang, Bart Van Assche, linux-block On 12/20/18 6:02 AM, Jens Axboe wrote: >> I'm afraid this cannot work. >> >> The 'tags' here could be the hctx->sched_tags, but what we need to >> clear is hctx->tags->rqs[]. > > You are right, of course, a bit too quick on the trigger. This one > should work better, and also avoids that silly quadratic loop. I don't > think we need the tag == -1 check, but probably best to be safe. Sent out the wrong one, here it is. Bart, if you can reproduce, can you give it a whirl? diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..fc04bb648f18 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2025,7 +2025,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, { struct page *page; - if (tags->rqs && set->ops->exit_request) { + if (tags->rqs) { int i; for (i = 0; i < tags->nr_tags; i++) { @@ -2033,8 +2033,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, if (!rq) continue; - set->ops->exit_request(set, rq, hctx_idx); + if (set->ops->exit_request) + set->ops->exit_request(set, rq, hctx_idx); tags->static_rqs[i] = NULL; + + if (rq->tag == -1) + continue; + if (set->tags[hctx_idx]->rqs[rq->tag] == rq) + set->tags[hctx_idx]->rqs[rq->tag] = NULL; } } @@ -2113,6 +2119,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, return ret; } + rq->tag = -1; WRITE_ONCE(rq->state, MQ_RQ_IDLE); return 0; } -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 13:07 ` Jens Axboe @ 2018-12-20 18:01 ` Bart Van Assche 2018-12-20 18:21 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2018-12-20 18:01 UTC (permalink / raw) To: Jens Axboe, jianchao.wang, linux-block On Thu, 2018-12-20 at 06:07 -0700, Jens Axboe wrote: > On 12/20/18 6:02 AM, Jens Axboe wrote: > > > I'm afraid this cannot work. > > > > > > The 'tags' here could be the hctx->sched_tags, but what we need to > > > clear is hctx->tags->rqs[]. > > > > You are right, of course, a bit too quick on the trigger. This one > > should work better, and also avoids that silly quadratic loop. I don't > > think we need the tag == -1 check, but probably best to be safe. > > Sent out the wrong one, here it is. Bart, if you can reproduce, can you > give it a whirl? > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 2de972857496..fc04bb648f18 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2025,7 +2025,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > { > struct page *page; > > - if (tags->rqs && set->ops->exit_request) { > + if (tags->rqs) { > int i; > > for (i = 0; i < tags->nr_tags; i++) { > @@ -2033,8 +2033,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > > if (!rq) > continue; > - set->ops->exit_request(set, rq, hctx_idx); > + if (set->ops->exit_request) > + set->ops->exit_request(set, rq, hctx_idx); > tags->static_rqs[i] = NULL; > + > + if (rq->tag == -1) > + continue; > + if (set->tags[hctx_idx]->rqs[rq->tag] == rq) > + set->tags[hctx_idx]->rqs[rq->tag] = NULL; > } > } > > @@ -2113,6 +2119,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, > return ret; > } > > + rq->tag = -1; > WRITE_ONCE(rq->state, MQ_RQ_IDLE); > return 0; > } Hi Jens, Are you sure this is sufficient? My understanding is that blk_mq_queue_tag_busy_iter() iterates over all tags in the tag set. So if the request queue on which part_in_flight() is called and the request queue for which blk_mq_free_rqs() is called share their tag set then part_in_flight() and blk_mq_free_rqs() can run concurrently. That can cause ugly race conditions. Do you think it would be a good idea to modify the inflight accounting code such that it only considers the requests of a single request queue instead of all requests for a given tag set? Thanks, Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 18:01 ` Bart Van Assche @ 2018-12-20 18:21 ` Jens Axboe 2018-12-20 18:33 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 18:21 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 11:01 AM, Bart Van Assche wrote: > On Thu, 2018-12-20 at 06:07 -0700, Jens Axboe wrote: >> On 12/20/18 6:02 AM, Jens Axboe wrote: >>>> I'm afraid this cannot work. >>>> >>>> The 'tags' here could be the hctx->sched_tags, but what we need to >>>> clear is hctx->tags->rqs[]. >>> >>> You are right, of course, a bit too quick on the trigger. This one >>> should work better, and also avoids that silly quadratic loop. I don't >>> think we need the tag == -1 check, but probably best to be safe. >> >> Sent out the wrong one, here it is. Bart, if you can reproduce, can you >> give it a whirl? >> >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 2de972857496..fc04bb648f18 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -2025,7 +2025,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >> { >> struct page *page; >> >> - if (tags->rqs && set->ops->exit_request) { >> + if (tags->rqs) { >> int i; >> >> for (i = 0; i < tags->nr_tags; i++) { >> @@ -2033,8 +2033,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >> >> if (!rq) >> continue; >> - set->ops->exit_request(set, rq, hctx_idx); >> + if (set->ops->exit_request) >> + set->ops->exit_request(set, rq, hctx_idx); >> tags->static_rqs[i] = NULL; >> + >> + if (rq->tag == -1) >> + continue; >> + if (set->tags[hctx_idx]->rqs[rq->tag] == rq) >> + set->tags[hctx_idx]->rqs[rq->tag] = NULL; >> } >> } >> >> @@ -2113,6 +2119,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, >> return ret; >> } >> >> + rq->tag = -1; >> WRITE_ONCE(rq->state, MQ_RQ_IDLE); >> return 0; >> } > > Hi Jens, > > Are you sure this is sufficient? No, I don't think it is. > My understanding is that > blk_mq_queue_tag_busy_iter() iterates over all tags in the tag set. So if the > request queue on which part_in_flight() is called and the request queue for > which blk_mq_free_rqs() is called share their tag set then part_in_flight() > and blk_mq_free_rqs() can run concurrently. That can cause ugly race > conditions. Do you think it would be a good idea to modify the inflight > accounting code such that it only considers the requests of a single request > queue instead of all requests for a given tag set? That would of course solve it, the question is how to do it. One option would be to have ->rqs[] be: struct rq_entry { struct request_queue *q; struct request *rq; }; instead of just a request, since then you could check the queue without having to dereference the request. The current race is inherent in that we set ->rqs[] AFTER having acquired the tag, so there's a window where you could find a stale entry. That's not normally an issue since requests are persistent, but for shared tag maps and queues disappearing it can pose a problem. -- Jens Axboe ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 18:21 ` Jens Axboe @ 2018-12-20 18:33 ` Jens Axboe 2018-12-20 20:56 ` Bart Van Assche 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 18:33 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 11:21 AM, Jens Axboe wrote: > On 12/20/18 11:01 AM, Bart Van Assche wrote: >> On Thu, 2018-12-20 at 06:07 -0700, Jens Axboe wrote: >>> On 12/20/18 6:02 AM, Jens Axboe wrote: >>>>> I'm afraid this cannot work. >>>>> >>>>> The 'tags' here could be the hctx->sched_tags, but what we need to >>>>> clear is hctx->tags->rqs[]. >>>> >>>> You are right, of course, a bit too quick on the trigger. This one >>>> should work better, and also avoids that silly quadratic loop. I don't >>>> think we need the tag == -1 check, but probably best to be safe. >>> >>> Sent out the wrong one, here it is. Bart, if you can reproduce, can you >>> give it a whirl? >>> >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 2de972857496..fc04bb648f18 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -2025,7 +2025,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>> { >>> struct page *page; >>> >>> - if (tags->rqs && set->ops->exit_request) { >>> + if (tags->rqs) { >>> int i; >>> >>> for (i = 0; i < tags->nr_tags; i++) { >>> @@ -2033,8 +2033,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>> >>> if (!rq) >>> continue; >>> - set->ops->exit_request(set, rq, hctx_idx); >>> + if (set->ops->exit_request) >>> + set->ops->exit_request(set, rq, hctx_idx); >>> tags->static_rqs[i] = NULL; >>> + >>> + if (rq->tag == -1) >>> + continue; >>> + if (set->tags[hctx_idx]->rqs[rq->tag] == rq) >>> + set->tags[hctx_idx]->rqs[rq->tag] = NULL; >>> } >>> } >>> >>> @@ -2113,6 +2119,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, >>> return ret; >>> } >>> >>> + rq->tag = -1; >>> WRITE_ONCE(rq->state, MQ_RQ_IDLE); >>> return 0; >>> } >> >> Hi Jens, >> >> Are you sure this is sufficient? > > No, I don't think it is. > >> My understanding is that >> blk_mq_queue_tag_busy_iter() iterates over all tags in the tag set. So if the >> request queue on which part_in_flight() is called and the request queue for >> which blk_mq_free_rqs() is called share their tag set then part_in_flight() >> and blk_mq_free_rqs() can run concurrently. That can cause ugly race >> conditions. Do you think it would be a good idea to modify the inflight >> accounting code such that it only considers the requests of a single request >> queue instead of all requests for a given tag set? > > That would of course solve it, the question is how to do it. One option > would be to have ->rqs[] be: > > struct rq_entry { > struct request_queue *q; > struct request *rq; > }; > > instead of just a request, since then you could check the queue without > having to dereference the request. The current race is inherent in that > we set ->rqs[] AFTER having acquired the tag, so there's a window where > you could find a stale entry. That's not normally an issue since > requests are persistent, but for shared tag maps and queues disappearing > it can pose a problem. Something like this, totally untested. If the queue matches, we know it's safe to dereference it. A safer API to export as well... diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 2089c6c62f44..78192b544fa2 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (!reserved) bitnr += tags->nr_reserved_tags; - rq = tags->rqs[bitnr]; + if (tags->rqs[bitnr].queue != hctx->queue) + return true; /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ - if (rq && rq->q == hctx->queue) + rq = tags->rqs[bitnr].rq; + if (rq) return iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } @@ -268,6 +270,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, struct bt_tags_iter_data { struct blk_mq_tags *tags; + struct blk_mq_hw_ctx *hctx; busy_tag_iter_fn *fn; void *data; bool reserved; @@ -287,7 +290,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) * We can hit rq == NULL here, because the tagging functions * test and set the bit before assining ->rqs[]. */ - rq = tags->rqs[bitnr]; + rq = tags->rqs[bitnr].rq; if (rq && blk_mq_request_started(rq)) return iter_data->fn(rq, iter_data->data, reserved); diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 61deab0b5a5a..bb84d1f099c7 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -4,6 +4,11 @@ #include "blk-mq.h" +struct rq_tag_entry { + struct request_queue *queue; + struct request *rq; +}; + /* * Tag address space map. */ @@ -16,7 +21,7 @@ struct blk_mq_tags { struct sbitmap_queue bitmap_tags; struct sbitmap_queue breserved_tags; - struct request **rqs; + struct rq_tag_entry *rqs; struct request **static_rqs; struct list_head page_list; }; @@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx, unsigned int tag, struct request *rq) { - hctx->tags->rqs[tag] = rq; + hctx->tags->rqs[tag].queue = hctx->queue; + hctx->tags->rqs[tag].rq = rq; } static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, diff --git a/block/blk-mq.c b/block/blk-mq.c index 3ba37b9e15e9..4f194946dbd9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->tag = -1; rq->internal_tag = tag; } else { - if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) { + struct blk_mq_hw_ctx *hctx = data->hctx; + + if (hctx->flags & BLK_MQ_F_TAG_SHARED) { rq_flags = RQF_MQ_INFLIGHT; - atomic_inc(&data->hctx->nr_active); + atomic_inc(&hctx->nr_active); } rq->tag = tag; rq->internal_tag = -1; - data->hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } /* csd/requeue_work/fifo_time is initialized before use */ @@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) { if (tag < tags->nr_tags) { - prefetch(tags->rqs[tag]); - return tags->rqs[tag]; + prefetch(tags->rqs[tag].rq); + return tags->rqs[tag].rq; } return NULL; @@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { /* - * If we find a request that is inflight and the queue matches, - * we know the queue is busy. Return false to stop the iteration. + * We're only called here if the queue matches. If the rq state is + * inflight, we know the queue is busy. Return false to stop the + * iteration. */ - if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) { + if (rq->state == MQ_RQ_IN_FLIGHT) { bool *busy = priv; *busy = true; @@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq) shared = blk_mq_tag_busy(data.hctx); rq->tag = blk_mq_get_tag(&data); if (rq->tag >= 0) { + struct blk_mq_hw_ctx *hctx = data.hctx; + if (shared) { rq->rq_flags |= RQF_MQ_INFLIGHT; atomic_inc(&data.hctx->nr_active); } - data.hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } done: @@ -2069,7 +2076,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, if (!tags) return NULL; - tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *), + tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node); if (!tags->rqs) { -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 18:33 ` Jens Axboe @ 2018-12-20 20:56 ` Bart Van Assche 2018-12-20 21:00 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2018-12-20 20:56 UTC (permalink / raw) To: Jens Axboe, jianchao.wang, linux-block On Thu, 2018-12-20 at 11:33 -0700, Jens Axboe wrote: > [ ... ] > Something like this, totally untested. If the queue matches, we know it's > safe to dereference it. > > A safer API to export as well... > > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 2089c6c62f44..78192b544fa2 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > > if (!reserved) > bitnr += tags->nr_reserved_tags; > - rq = tags->rqs[bitnr]; > + if (tags->rqs[bitnr].queue != hctx->queue) > + return true; > > /* > * We can hit rq == NULL here, because the tagging functions > * test and set the bit before assigning ->rqs[]. > */ > - if (rq && rq->q == hctx->queue) > + rq = tags->rqs[bitnr].rq; > + if (rq) > return iter_data->fn(hctx, rq, iter_data->data, reserved); > return true; > } > @@ -268,6 +270,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, > > struct bt_tags_iter_data { > struct blk_mq_tags *tags; > + struct blk_mq_hw_ctx *hctx; > busy_tag_iter_fn *fn; > void *data; > bool reserved; > @@ -287,7 +290,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > * We can hit rq == NULL here, because the tagging functions > * test and set the bit before assining ->rqs[]. > */ > - rq = tags->rqs[bitnr]; > + rq = tags->rqs[bitnr].rq; > if (rq && blk_mq_request_started(rq)) > return iter_data->fn(rq, iter_data->data, reserved); > > diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h > index 61deab0b5a5a..bb84d1f099c7 100644 > --- a/block/blk-mq-tag.h > +++ b/block/blk-mq-tag.h > @@ -4,6 +4,11 @@ > > #include "blk-mq.h" > > +struct rq_tag_entry { > + struct request_queue *queue; > + struct request *rq; > +}; > + > /* > * Tag address space map. > */ > @@ -16,7 +21,7 @@ struct blk_mq_tags { > struct sbitmap_queue bitmap_tags; > struct sbitmap_queue breserved_tags; > > - struct request **rqs; > + struct rq_tag_entry *rqs; > struct request **static_rqs; > struct list_head page_list; > }; > @@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) > static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx, > unsigned int tag, struct request *rq) > { > - hctx->tags->rqs[tag] = rq; > + hctx->tags->rqs[tag].queue = hctx->queue; > + hctx->tags->rqs[tag].rq = rq; > } > > static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3ba37b9e15e9..4f194946dbd9 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, > rq->tag = -1; > rq->internal_tag = tag; > } else { > - if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) { > + struct blk_mq_hw_ctx *hctx = data->hctx; > + > + if (hctx->flags & BLK_MQ_F_TAG_SHARED) { > rq_flags = RQF_MQ_INFLIGHT; > - atomic_inc(&data->hctx->nr_active); > + atomic_inc(&hctx->nr_active); > } > rq->tag = tag; > rq->internal_tag = -1; > - data->hctx->tags->rqs[rq->tag] = rq; > + hctx->tags->rqs[rq->tag].queue = hctx->queue; > + hctx->tags->rqs[rq->tag].rq = rq; > } > > /* csd/requeue_work/fifo_time is initialized before use */ > @@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); > struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) > { > if (tag < tags->nr_tags) { > - prefetch(tags->rqs[tag]); > - return tags->rqs[tag]; > + prefetch(tags->rqs[tag].rq); > + return tags->rqs[tag].rq; > } > > return NULL; > @@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, > void *priv, bool reserved) > { > /* > - * If we find a request that is inflight and the queue matches, > - * we know the queue is busy. Return false to stop the iteration. > + * We're only called here if the queue matches. If the rq state is > + * inflight, we know the queue is busy. Return false to stop the > + * iteration. > */ > - if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) { > + if (rq->state == MQ_RQ_IN_FLIGHT) { > bool *busy = priv; > > *busy = true; > @@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq) > shared = blk_mq_tag_busy(data.hctx); > rq->tag = blk_mq_get_tag(&data); > if (rq->tag >= 0) { > + struct blk_mq_hw_ctx *hctx = data.hctx; > + > if (shared) { > rq->rq_flags |= RQF_MQ_INFLIGHT; > atomic_inc(&data.hctx->nr_active); > } > - data.hctx->tags->rqs[rq->tag] = rq; > + hctx->tags->rqs[rq->tag].queue = hctx->queue; > + hctx->tags->rqs[rq->tag].rq = rq; > } > > done: > @@ -2069,7 +2076,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, > if (!tags) > return NULL; > > - tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *), > + tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry), > GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, > node); > if (!tags->rqs) { Hi Jens, How about the patch below? Its behavior should be similar to your patch but I think this patch is simpler. Thanks, Bart. --- block/blk-mq.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6a7566244de3..c57611b1f2a8 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -86,6 +86,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx, } struct mq_inflight { + struct request_queue *q; struct hd_struct *part; unsigned int *inflight; }; @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, { struct mq_inflight *mi = priv; + if (rq->q != mi->q) + return; + /* * index[0] counts the specific partition that was asked for. index[1] * counts the ones that are active on the whole device, so increment @@ -110,7 +114,11 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) { - struct mq_inflight mi = { .part = part, .inflight = inflight, }; + struct mq_inflight mi = { + .q = q, + .part = part, + .inflight = inflight, + }; inflight[0] = inflight[1] = 0; blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi); @@ -129,7 +137,11 @@ static void blk_mq_check_inflight_rw(struct blk_mq_hw_ctx *hctx, void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) { - struct mq_inflight mi = { .part = part, .inflight = inflight, }; + struct mq_inflight mi = { + .q = q, + .part = part, + .inflight = inflight, + }; inflight[0] = inflight[1] = 0; blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi); @@ -477,6 +489,8 @@ static void __blk_mq_free_request(struct request *rq) const int sched_tag = rq->internal_tag; blk_pm_mark_last_busy(rq); + /* Avoid that blk_mq_queue_tag_busy_iter() picks up this request. */ + rq->q = NULL; if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 20:56 ` Bart Van Assche @ 2018-12-20 21:00 ` Jens Axboe 2018-12-20 21:23 ` Bart Van Assche 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 21:00 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 1:56 PM, Bart Van Assche wrote: > @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, > { > struct mq_inflight *mi = priv; > > + if (rq->q != mi->q) > + return; > + Aren't you back to square one with this one, if the tags are shared? You can't dereference it before you know it matches. -- Jens Axboe ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 21:00 ` Jens Axboe @ 2018-12-20 21:23 ` Bart Van Assche 2018-12-20 21:26 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2018-12-20 21:23 UTC (permalink / raw) To: Jens Axboe, jianchao.wang, linux-block On Thu, 2018-12-20 at 14:00 -0700, Jens Axboe wrote: > On 12/20/18 1:56 PM, Bart Van Assche wrote: > > @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, > > { > > struct mq_inflight *mi = priv; > > > > + if (rq->q != mi->q) > > + return; > > Aren't you back to square one with this one, if the tags are shared? You > can't dereference it before you know it matches. My patch can only work if the new rq->q = NULL assignment in __blk_mq_free_request() is executed before the request tag is freed and if freeing a tag does not happen concurrently with any bt_iter() call. Would you accept that I add a seqlock to avoid this scenario? Thanks, Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 21:23 ` Bart Van Assche @ 2018-12-20 21:26 ` Jens Axboe 2018-12-20 21:31 ` Bart Van Assche 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 21:26 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 2:23 PM, Bart Van Assche wrote: > On Thu, 2018-12-20 at 14:00 -0700, Jens Axboe wrote: >> On 12/20/18 1:56 PM, Bart Van Assche wrote: >>> @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, >>> { >>> struct mq_inflight *mi = priv; >>> >>> + if (rq->q != mi->q) >>> + return; >> >> Aren't you back to square one with this one, if the tags are shared? You >> can't dereference it before you know it matches. > > My patch can only work if the new rq->q = NULL assignment in __blk_mq_free_request() > is executed before the request tag is freed and if freeing a tag does not happen > concurrently with any bt_iter() call. Would you accept that I add a seqlock to avoid > this scenario? Ugh no, let's not go that far. Why not just use my approach that avoids any need to dereference rq, unless we know it belongs to the queue in question? I think that's cheaper than resetting ->queue as well when the rq completes, I'm always wary of adding new stores in the completion path. -- Jens Axboe ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 21:26 ` Jens Axboe @ 2018-12-20 21:31 ` Bart Van Assche 2018-12-20 21:34 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2018-12-20 21:31 UTC (permalink / raw) To: Jens Axboe, jianchao.wang, linux-block On Thu, 2018-12-20 at 14:26 -0700, Jens Axboe wrote: > On 12/20/18 2:23 PM, Bart Van Assche wrote: > > On Thu, 2018-12-20 at 14:00 -0700, Jens Axboe wrote: > > > On 12/20/18 1:56 PM, Bart Van Assche wrote: > > > > @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, > > > > { > > > > struct mq_inflight *mi = priv; > > > > > > > > + if (rq->q != mi->q) > > > > + return; > > > > > > Aren't you back to square one with this one, if the tags are shared? You > > > can't dereference it before you know it matches. > > > > My patch can only work if the new rq->q = NULL assignment in __blk_mq_free_request() > > is executed before the request tag is freed and if freeing a tag does not happen > > concurrently with any bt_iter() call. Would you accept that I add a seqlock to avoid > > this scenario? > > Ugh no, let's not go that far. Why not just use my approach that avoids > any need to dereference rq, unless we know it belongs to the queue in > question? I think that's cheaper than resetting ->queue as well when the > rq completes, I'm always wary of adding new stores in the completion > path. I think there is a race condition in bt_iter() in your approach: tags->rqs[bitnr].queue can change after it has been read and that can cause a request that is not associated with hctx->queue to be passed to iter_data->fn(). Since 'fn' will access '*rq' I think that with your patch a use-after-free can occur similar to the one reported at the start of this e-mail thread. Your patch may make it harder to trigger that issue though. Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 21:31 ` Bart Van Assche @ 2018-12-20 21:34 ` Jens Axboe 2018-12-20 21:40 ` Bart Van Assche 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 21:34 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 2:31 PM, Bart Van Assche wrote: > On Thu, 2018-12-20 at 14:26 -0700, Jens Axboe wrote: >> On 12/20/18 2:23 PM, Bart Van Assche wrote: >>> On Thu, 2018-12-20 at 14:00 -0700, Jens Axboe wrote: >>>> On 12/20/18 1:56 PM, Bart Van Assche wrote: >>>>> @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, >>>>> { >>>>> struct mq_inflight *mi = priv; >>>>> >>>>> + if (rq->q != mi->q) >>>>> + return; >>>> >>>> Aren't you back to square one with this one, if the tags are shared? You >>>> can't dereference it before you know it matches. >>> >>> My patch can only work if the new rq->q = NULL assignment in __blk_mq_free_request() >>> is executed before the request tag is freed and if freeing a tag does not happen >>> concurrently with any bt_iter() call. Would you accept that I add a seqlock to avoid >>> this scenario? >> >> Ugh no, let's not go that far. Why not just use my approach that avoids >> any need to dereference rq, unless we know it belongs to the queue in >> question? I think that's cheaper than resetting ->queue as well when the >> rq completes, I'm always wary of adding new stores in the completion >> path. > > I think there is a race condition in bt_iter() in your approach: > tags->rqs[bitnr].queue can change after it has been read and that can > cause a request that is not associated with hctx->queue to be passed > to iter_data->fn(). Since 'fn' will access '*rq' I think that with > your patch a use-after-free can occur similar to the one reported at > the start of this e-mail thread. Your patch may make it harder to > trigger that issue though. Yeah, I don't think it's bullet proof either, it just closes the gap. I'm fine with fiddling with the tag iteration. On top of what I sent, we could have tag iteration hold the RCU read lock, and then we just need to ensure that the tags are freed with RCU. -- Jens Axboe ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 21:34 ` Jens Axboe @ 2018-12-20 21:40 ` Bart Van Assche 2018-12-20 21:44 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2018-12-20 21:40 UTC (permalink / raw) To: Jens Axboe, jianchao.wang, linux-block On Thu, 2018-12-20 at 14:34 -0700, Jens Axboe wrote: > Yeah, I don't think it's bullet proof either, it just closes the gap. > I'm fine with fiddling with the tag iteration. On top of what I sent, we > could have tag iteration hold the RCU read lock, and then we just need > to ensure that the tags are freed with RCU. Do you mean using call_rcu() to free tags? Would that require to add a struct rcu_head to every request? Would it be acceptable to increase the size of struct request with an rcu_head? Additionally, could that reduce the queue depth if the time between grace periods is larger than the time between I/O submissions? Thanks, Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 21:40 ` Bart Van Assche @ 2018-12-20 21:44 ` Jens Axboe 2018-12-20 21:48 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 21:44 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 2:40 PM, Bart Van Assche wrote: > On Thu, 2018-12-20 at 14:34 -0700, Jens Axboe wrote: >> Yeah, I don't think it's bullet proof either, it just closes the gap. >> I'm fine with fiddling with the tag iteration. On top of what I sent, we >> could have tag iteration hold the RCU read lock, and then we just need >> to ensure that the tags are freed with RCU. > > Do you mean using call_rcu() to free tags? Would that require to add a > struct rcu_head to every request? Would it be acceptable to increase the > size of struct request with an rcu_head? Additionally, could that reduce > the queue depth if the time between grace periods is larger than the time > between I/O submissions? No no, the requests are out of full pages, we just need one per blk_mq_tags. Something like the below... And then the tag iteration needs to grab the RCU read lock to prevent the page freeing from happening. This should essentially be free, as rcu read lock for tag iteration is basically a no-op. Need to handle the flush request on top of this as well. diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 2089c6c62f44..c39b58391ae8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (!reserved) bitnr += tags->nr_reserved_tags; - rq = tags->rqs[bitnr]; + if (tags->rqs[bitnr].queue != hctx->queue) + return true; /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ - if (rq && rq->q == hctx->queue) + rq = tags->rqs[bitnr].rq; + if (rq) return iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } @@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, .reserved = reserved, }; + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); + rcu_read_unlock(); } struct bt_tags_iter_data { @@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) * We can hit rq == NULL here, because the tagging functions * test and set the bit before assining ->rqs[]. */ - rq = tags->rqs[bitnr]; + rq = tags->rqs[bitnr].rq; if (rq && blk_mq_request_started(rq)) return iter_data->fn(rq, iter_data->data, reserved); @@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, .reserved = reserved, }; - if (tags->rqs) + if (tags->rqs) { + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data); + rcu_read_unlock(); + } } /** diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 61deab0b5a5a..bdd1bfc08c21 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -4,6 +4,11 @@ #include "blk-mq.h" +struct rq_tag_entry { + struct request_queue *queue; + struct request *rq; +}; + /* * Tag address space map. */ @@ -16,9 +21,11 @@ struct blk_mq_tags { struct sbitmap_queue bitmap_tags; struct sbitmap_queue breserved_tags; - struct request **rqs; + struct rq_tag_entry *rqs; struct request **static_rqs; struct list_head page_list; + + struct rcu_work rcu_work; }; @@ -78,7 +85,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx, unsigned int tag, struct request *rq) { - hctx->tags->rqs[tag] = rq; + hctx->tags->rqs[tag].queue = hctx->queue; + hctx->tags->rqs[tag].rq = rq; } static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..4decd1e7d2d9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->tag = -1; rq->internal_tag = tag; } else { - if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) { + struct blk_mq_hw_ctx *hctx = data->hctx; + + if (hctx->flags & BLK_MQ_F_TAG_SHARED) { rq_flags = RQF_MQ_INFLIGHT; - atomic_inc(&data->hctx->nr_active); + atomic_inc(&hctx->nr_active); } rq->tag = tag; rq->internal_tag = -1; - data->hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } /* csd/requeue_work/fifo_time is initialized before use */ @@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) { if (tag < tags->nr_tags) { - prefetch(tags->rqs[tag]); - return tags->rqs[tag]; + prefetch(tags->rqs[tag].rq); + return tags->rqs[tag].rq; } return NULL; @@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { /* - * If we find a request that is inflight and the queue matches, - * we know the queue is busy. Return false to stop the iteration. + * We're only called here if the queue matches. If the rq state is + * inflight, we know the queue is busy. Return false to stop the + * iteration. */ - if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) { + if (rq->state == MQ_RQ_IN_FLIGHT) { bool *busy = priv; *busy = true; @@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq) shared = blk_mq_tag_busy(data.hctx); rq->tag = blk_mq_get_tag(&data); if (rq->tag >= 0) { + struct blk_mq_hw_ctx *hctx = data.hctx; + if (shared) { rq->rq_flags |= RQF_MQ_INFLIGHT; atomic_inc(&data.hctx->nr_active); } - data.hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } done: @@ -2020,11 +2027,27 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) return cookie; } -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, - unsigned int hctx_idx) +static void blk_mq_rcu_free_pages(struct work_struct *work) { + struct blk_mq_tags *tags = container_of(to_rcu_work(work), + struct blk_mq_tags, rcu_work); struct page *page; + while (!list_empty(&tags->page_list)) { + page = list_first_entry(&tags->page_list, struct page, lru); + list_del_init(&page->lru); + /* + * Remove kmemleak object previously allocated in + * blk_mq_init_rq_map(). + */ + kmemleak_free(page_address(page)); + __free_pages(page, page->private); + } +} + +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, + unsigned int hctx_idx) +{ if (tags->rqs && set->ops->exit_request) { int i; @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, } } - while (!list_empty(&tags->page_list)) { - page = list_first_entry(&tags->page_list, struct page, lru); - list_del_init(&page->lru); - /* - * Remove kmemleak object previously allocated in - * blk_mq_init_rq_map(). - */ - kmemleak_free(page_address(page)); - __free_pages(page, page->private); - } + /* Punt to RCU free, so we don't race with tag iteration */ + INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages); + queue_rcu_work(system_wq, &tags->rcu_work); } void blk_mq_free_rq_map(struct blk_mq_tags *tags) @@ -2077,7 +2093,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, if (!tags) return NULL; - tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *), + tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node); if (!tags->rqs) { -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 21:44 ` Jens Axboe @ 2018-12-20 21:48 ` Jens Axboe 2018-12-20 22:19 ` Bart Van Assche 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 21:48 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 2:44 PM, Jens Axboe wrote: > On 12/20/18 2:40 PM, Bart Van Assche wrote: >> On Thu, 2018-12-20 at 14:34 -0700, Jens Axboe wrote: >>> Yeah, I don't think it's bullet proof either, it just closes the gap. >>> I'm fine with fiddling with the tag iteration. On top of what I sent, we >>> could have tag iteration hold the RCU read lock, and then we just need >>> to ensure that the tags are freed with RCU. >> >> Do you mean using call_rcu() to free tags? Would that require to add a >> struct rcu_head to every request? Would it be acceptable to increase the >> size of struct request with an rcu_head? Additionally, could that reduce >> the queue depth if the time between grace periods is larger than the time >> between I/O submissions? > > No no, the requests are out of full pages, we just need one per blk_mq_tags. > Something like the below... And then the tag iteration needs to grab the > RCU read lock to prevent the page freeing from happening. > > This should essentially be free, as rcu read lock for tag iteration is > basically a no-op. > > Need to handle the flush request on top of this as well. With FQ handled as well. diff --git a/block/blk-flush.c b/block/blk-flush.c index a3fc7191c694..23e1f5fb091f 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -491,12 +491,22 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q, return NULL; } +static void blk_fq_rcu_free(struct work_struct *work) +{ + struct blk_flush_queue *fq = container_of(to_rcu_work(work), + struct blk_flush_queue, + rcu_work); + + kfree(fq->flush_rq); + kfree(fq); +} + void blk_free_flush_queue(struct blk_flush_queue *fq) { /* bio based request queue hasn't flush queue */ if (!fq) return; - kfree(fq->flush_rq); - kfree(fq); + INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free); + queue_rcu_work(system_wq, &fq->rcu_work); } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 2089c6c62f44..c39b58391ae8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (!reserved) bitnr += tags->nr_reserved_tags; - rq = tags->rqs[bitnr]; + if (tags->rqs[bitnr].queue != hctx->queue) + return true; /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ - if (rq && rq->q == hctx->queue) + rq = tags->rqs[bitnr].rq; + if (rq) return iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } @@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, .reserved = reserved, }; + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); + rcu_read_unlock(); } struct bt_tags_iter_data { @@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) * We can hit rq == NULL here, because the tagging functions * test and set the bit before assining ->rqs[]. */ - rq = tags->rqs[bitnr]; + rq = tags->rqs[bitnr].rq; if (rq && blk_mq_request_started(rq)) return iter_data->fn(rq, iter_data->data, reserved); @@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, .reserved = reserved, }; - if (tags->rqs) + if (tags->rqs) { + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data); + rcu_read_unlock(); + } } /** diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 61deab0b5a5a..bdd1bfc08c21 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -4,6 +4,11 @@ #include "blk-mq.h" +struct rq_tag_entry { + struct request_queue *queue; + struct request *rq; +}; + /* * Tag address space map. */ @@ -16,9 +21,11 @@ struct blk_mq_tags { struct sbitmap_queue bitmap_tags; struct sbitmap_queue breserved_tags; - struct request **rqs; + struct rq_tag_entry *rqs; struct request **static_rqs; struct list_head page_list; + + struct rcu_work rcu_work; }; @@ -78,7 +85,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx, unsigned int tag, struct request *rq) { - hctx->tags->rqs[tag] = rq; + hctx->tags->rqs[tag].queue = hctx->queue; + hctx->tags->rqs[tag].rq = rq; } static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..4decd1e7d2d9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->tag = -1; rq->internal_tag = tag; } else { - if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) { + struct blk_mq_hw_ctx *hctx = data->hctx; + + if (hctx->flags & BLK_MQ_F_TAG_SHARED) { rq_flags = RQF_MQ_INFLIGHT; - atomic_inc(&data->hctx->nr_active); + atomic_inc(&hctx->nr_active); } rq->tag = tag; rq->internal_tag = -1; - data->hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } /* csd/requeue_work/fifo_time is initialized before use */ @@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) { if (tag < tags->nr_tags) { - prefetch(tags->rqs[tag]); - return tags->rqs[tag]; + prefetch(tags->rqs[tag].rq); + return tags->rqs[tag].rq; } return NULL; @@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { /* - * If we find a request that is inflight and the queue matches, - * we know the queue is busy. Return false to stop the iteration. + * We're only called here if the queue matches. If the rq state is + * inflight, we know the queue is busy. Return false to stop the + * iteration. */ - if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) { + if (rq->state == MQ_RQ_IN_FLIGHT) { bool *busy = priv; *busy = true; @@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq) shared = blk_mq_tag_busy(data.hctx); rq->tag = blk_mq_get_tag(&data); if (rq->tag >= 0) { + struct blk_mq_hw_ctx *hctx = data.hctx; + if (shared) { rq->rq_flags |= RQF_MQ_INFLIGHT; atomic_inc(&data.hctx->nr_active); } - data.hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } done: @@ -2020,11 +2027,27 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) return cookie; } -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, - unsigned int hctx_idx) +static void blk_mq_rcu_free_pages(struct work_struct *work) { + struct blk_mq_tags *tags = container_of(to_rcu_work(work), + struct blk_mq_tags, rcu_work); struct page *page; + while (!list_empty(&tags->page_list)) { + page = list_first_entry(&tags->page_list, struct page, lru); + list_del_init(&page->lru); + /* + * Remove kmemleak object previously allocated in + * blk_mq_init_rq_map(). + */ + kmemleak_free(page_address(page)); + __free_pages(page, page->private); + } +} + +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, + unsigned int hctx_idx) +{ if (tags->rqs && set->ops->exit_request) { int i; @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, } } - while (!list_empty(&tags->page_list)) { - page = list_first_entry(&tags->page_list, struct page, lru); - list_del_init(&page->lru); - /* - * Remove kmemleak object previously allocated in - * blk_mq_init_rq_map(). - */ - kmemleak_free(page_address(page)); - __free_pages(page, page->private); - } + /* Punt to RCU free, so we don't race with tag iteration */ + INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages); + queue_rcu_work(system_wq, &tags->rcu_work); } void blk_mq_free_rq_map(struct blk_mq_tags *tags) @@ -2077,7 +2093,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, if (!tags) return NULL; - tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *), + tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node); if (!tags->rqs) { diff --git a/block/blk.h b/block/blk.h index 848278c52030..785207fc8a30 100644 --- a/block/blk.h +++ b/block/blk.h @@ -29,6 +29,8 @@ struct blk_flush_queue { */ struct request *orig_rq; spinlock_t mq_flush_lock; + + struct rcu_work rcu_work; }; extern struct kmem_cache *blk_requestq_cachep; -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 21:48 ` Jens Axboe @ 2018-12-20 22:19 ` Bart Van Assche 2018-12-20 22:23 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2018-12-20 22:19 UTC (permalink / raw) To: Jens Axboe, jianchao.wang, linux-block On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote: > -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > - unsigned int hctx_idx) > +static void blk_mq_rcu_free_pages(struct work_struct *work) > { > + struct blk_mq_tags *tags = container_of(to_rcu_work(work), > + struct blk_mq_tags, rcu_work); > struct page *page; > > + while (!list_empty(&tags->page_list)) { > + page = list_first_entry(&tags->page_list, struct page, lru); > + list_del_init(&page->lru); > + /* > + * Remove kmemleak object previously allocated in > + * blk_mq_init_rq_map(). > + */ > + kmemleak_free(page_address(page)); > + __free_pages(page, page->private); > + } > +} > + > +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > + unsigned int hctx_idx) > +{ > if (tags->rqs && set->ops->exit_request) { > int i; > > @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > } > } > > - while (!list_empty(&tags->page_list)) { > - page = list_first_entry(&tags->page_list, struct page, lru); > - list_del_init(&page->lru); > - /* > - * Remove kmemleak object previously allocated in > - * blk_mq_init_rq_map(). > - */ > - kmemleak_free(page_address(page)); > - __free_pages(page, page->private); > - } > + /* Punt to RCU free, so we don't race with tag iteration */ > + INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages); > + queue_rcu_work(system_wq, &tags->rcu_work); > } This can only work correctly if blk_mq_rcu_free_pages() is called before INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed. What provides these guarantees? Did I perhaps miss something? Thanks, Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 22:19 ` Bart Van Assche @ 2018-12-20 22:23 ` Jens Axboe 2018-12-20 22:33 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 22:23 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 3:19 PM, Bart Van Assche wrote: > On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote: >> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >> - unsigned int hctx_idx) >> +static void blk_mq_rcu_free_pages(struct work_struct *work) >> { >> + struct blk_mq_tags *tags = container_of(to_rcu_work(work), >> + struct blk_mq_tags, rcu_work); >> struct page *page; >> >> + while (!list_empty(&tags->page_list)) { >> + page = list_first_entry(&tags->page_list, struct page, lru); >> + list_del_init(&page->lru); >> + /* >> + * Remove kmemleak object previously allocated in >> + * blk_mq_init_rq_map(). >> + */ >> + kmemleak_free(page_address(page)); >> + __free_pages(page, page->private); >> + } >> +} >> + >> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >> + unsigned int hctx_idx) >> +{ >> if (tags->rqs && set->ops->exit_request) { >> int i; >> >> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >> } >> } >> >> - while (!list_empty(&tags->page_list)) { >> - page = list_first_entry(&tags->page_list, struct page, lru); >> - list_del_init(&page->lru); >> - /* >> - * Remove kmemleak object previously allocated in >> - * blk_mq_init_rq_map(). >> - */ >> - kmemleak_free(page_address(page)); >> - __free_pages(page, page->private); >> - } >> + /* Punt to RCU free, so we don't race with tag iteration */ >> + INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages); >> + queue_rcu_work(system_wq, &tags->rcu_work); >> } > > This can only work correctly if blk_mq_rcu_free_pages() is called before > INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure > and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed. > What provides these guarantees? Did I perhaps miss something? We don't call it twice. Protecting against that is outside the scope of this function. But we call and clear form the regular shutdown path, and the rest is error handling on setup. But yes, we do need to ensure that 'tags' doesn't go away. Probably best to rework it to shove it somewhere else for freeing for that case, and leave the rest of the shutdown alone. I'll update the patch. -- Jens Axboe ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 22:23 ` Jens Axboe @ 2018-12-20 22:33 ` Jens Axboe 2018-12-20 22:47 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 22:33 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 3:23 PM, Jens Axboe wrote: > On 12/20/18 3:19 PM, Bart Van Assche wrote: >> On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote: >>> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>> - unsigned int hctx_idx) >>> +static void blk_mq_rcu_free_pages(struct work_struct *work) >>> { >>> + struct blk_mq_tags *tags = container_of(to_rcu_work(work), >>> + struct blk_mq_tags, rcu_work); >>> struct page *page; >>> >>> + while (!list_empty(&tags->page_list)) { >>> + page = list_first_entry(&tags->page_list, struct page, lru); >>> + list_del_init(&page->lru); >>> + /* >>> + * Remove kmemleak object previously allocated in >>> + * blk_mq_init_rq_map(). >>> + */ >>> + kmemleak_free(page_address(page)); >>> + __free_pages(page, page->private); >>> + } >>> +} >>> + >>> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>> + unsigned int hctx_idx) >>> +{ >>> if (tags->rqs && set->ops->exit_request) { >>> int i; >>> >>> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>> } >>> } >>> >>> - while (!list_empty(&tags->page_list)) { >>> - page = list_first_entry(&tags->page_list, struct page, lru); >>> - list_del_init(&page->lru); >>> - /* >>> - * Remove kmemleak object previously allocated in >>> - * blk_mq_init_rq_map(). >>> - */ >>> - kmemleak_free(page_address(page)); >>> - __free_pages(page, page->private); >>> - } >>> + /* Punt to RCU free, so we don't race with tag iteration */ >>> + INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages); >>> + queue_rcu_work(system_wq, &tags->rcu_work); >>> } >> >> This can only work correctly if blk_mq_rcu_free_pages() is called before >> INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure >> and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed. >> What provides these guarantees? Did I perhaps miss something? > > We don't call it twice. Protecting against that is outside the scope > of this function. But we call and clear form the regular shutdown path, > and the rest is error handling on setup. > > But yes, we do need to ensure that 'tags' doesn't go away. Probably best > to rework it to shove it somewhere else for freeing for that case, and > leave the rest of the shutdown alone. I'll update the patch. Here's a version that doesn't rely on tags, we just dynamically allocate the structure. For the very odd case where we fail, we just punt to an on-stack structure and use the big synchronize_rcu() hammer first. diff --git a/block/blk-flush.c b/block/blk-flush.c index a3fc7191c694..827e3d2180d8 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -491,12 +491,22 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q, return NULL; } +static void blk_fq_rcu_free(struct work_struct *work) +{ + struct blk_flush_queue *fq = container_of(to_rcu_work(work), + struct blk_flush_queue, + rcu_work); + + kfree(fq->flush_rq); + kfree(fq); +} + void blk_free_flush_queue(struct blk_flush_queue *fq) { /* bio based request queue hasn't flush queue */ if (!fq) return; - kfree(fq->flush_rq); - kfree(fq); + INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free); + queue_rcu_work(system_wq, &fq->rcu_work); } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 2089c6c62f44..c39b58391ae8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (!reserved) bitnr += tags->nr_reserved_tags; - rq = tags->rqs[bitnr]; + if (tags->rqs[bitnr].queue != hctx->queue) + return true; /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ - if (rq && rq->q == hctx->queue) + rq = tags->rqs[bitnr].rq; + if (rq) return iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } @@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, .reserved = reserved, }; + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); + rcu_read_unlock(); } struct bt_tags_iter_data { @@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) * We can hit rq == NULL here, because the tagging functions * test and set the bit before assining ->rqs[]. */ - rq = tags->rqs[bitnr]; + rq = tags->rqs[bitnr].rq; if (rq && blk_mq_request_started(rq)) return iter_data->fn(rq, iter_data->data, reserved); @@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, .reserved = reserved, }; - if (tags->rqs) + if (tags->rqs) { + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data); + rcu_read_unlock(); + } } /** diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 61deab0b5a5a..bb84d1f099c7 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -4,6 +4,11 @@ #include "blk-mq.h" +struct rq_tag_entry { + struct request_queue *queue; + struct request *rq; +}; + /* * Tag address space map. */ @@ -16,7 +21,7 @@ struct blk_mq_tags { struct sbitmap_queue bitmap_tags; struct sbitmap_queue breserved_tags; - struct request **rqs; + struct rq_tag_entry *rqs; struct request **static_rqs; struct list_head page_list; }; @@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx, unsigned int tag, struct request *rq) { - hctx->tags->rqs[tag] = rq; + hctx->tags->rqs[tag].queue = hctx->queue; + hctx->tags->rqs[tag].rq = rq; } static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..d104bcb4308f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->tag = -1; rq->internal_tag = tag; } else { - if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) { + struct blk_mq_hw_ctx *hctx = data->hctx; + + if (hctx->flags & BLK_MQ_F_TAG_SHARED) { rq_flags = RQF_MQ_INFLIGHT; - atomic_inc(&data->hctx->nr_active); + atomic_inc(&hctx->nr_active); } rq->tag = tag; rq->internal_tag = -1; - data->hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } /* csd/requeue_work/fifo_time is initialized before use */ @@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) { if (tag < tags->nr_tags) { - prefetch(tags->rqs[tag]); - return tags->rqs[tag]; + prefetch(tags->rqs[tag].rq); + return tags->rqs[tag].rq; } return NULL; @@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { /* - * If we find a request that is inflight and the queue matches, - * we know the queue is busy. Return false to stop the iteration. + * We're only called here if the queue matches. If the rq state is + * inflight, we know the queue is busy. Return false to stop the + * iteration. */ - if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) { + if (rq->state == MQ_RQ_IN_FLIGHT) { bool *busy = priv; *busy = true; @@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq) shared = blk_mq_tag_busy(data.hctx); rq->tag = blk_mq_get_tag(&data); if (rq->tag >= 0) { + struct blk_mq_hw_ctx *hctx = data.hctx; + if (shared) { rq->rq_flags |= RQF_MQ_INFLIGHT; atomic_inc(&data.hctx->nr_active); } - data.hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } done: @@ -2020,10 +2027,34 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) return cookie; } +struct rq_page_list { + struct rcu_work rcu_work; + struct list_head list; +}; + +static void blk_mq_rcu_free_pages(struct work_struct *work) +{ + struct rq_page_list *rpl = container_of(to_rcu_work(work), + struct rq_page_list, rcu_work); + struct page *page; + + while (!list_empty(&rpl->list)) { + page = list_first_entry(&rpl->list, struct page, lru); + list_del_init(&page->lru); + /* + * Remove kmemleak object previously allocated in + * blk_mq_init_rq_map(). + */ + kmemleak_free(page_address(page)); + __free_pages(page, page->private); + } + kfree(rpl); +} + void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, unsigned int hctx_idx) { - struct page *page; + struct rq_page_list *rpl; if (tags->rqs && set->ops->exit_request) { int i; @@ -2038,15 +2069,26 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, } } - while (!list_empty(&tags->page_list)) { - page = list_first_entry(&tags->page_list, struct page, lru); - list_del_init(&page->lru); + if (list_empty(&tags->page_list)) + return; + + rpl = kmalloc(sizeof(*rpl), GFP_NOIO); + if (rpl) { + list_splice_init(&tags->page_list, &rpl->list); + + /* Punt to RCU free, so we don't race with tag iteration */ + INIT_RCU_WORK(&rpl->rcu_work, blk_mq_rcu_free_pages); + queue_rcu_work(system_wq, &rpl->rcu_work); + } else { + struct rq_page_list stack_rpl; + /* - * Remove kmemleak object previously allocated in - * blk_mq_init_rq_map(). + * Fail alloc, punt to on-stack, we just have to synchronize + * RCU first to ensure readers are done. */ - kmemleak_free(page_address(page)); - __free_pages(page, page->private); + list_splice_init(&tags->page_list, &stack_rpl.list); + synchronize_rcu(); + blk_mq_rcu_free_pages(&stack_rpl.rcu_work.work); } } @@ -2077,7 +2119,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, if (!tags) return NULL; - tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *), + tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node); if (!tags->rqs) { diff --git a/block/blk.h b/block/blk.h index 848278c52030..785207fc8a30 100644 --- a/block/blk.h +++ b/block/blk.h @@ -29,6 +29,8 @@ struct blk_flush_queue { */ struct request *orig_rq; spinlock_t mq_flush_lock; + + struct rcu_work rcu_work; }; extern struct kmem_cache *blk_requestq_cachep; -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 22:33 ` Jens Axboe @ 2018-12-20 22:47 ` Jens Axboe 2018-12-20 22:50 ` Jens Axboe 0 siblings, 1 reply; 34+ messages in thread From: Jens Axboe @ 2018-12-20 22:47 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 3:33 PM, Jens Axboe wrote: > On 12/20/18 3:23 PM, Jens Axboe wrote: >> On 12/20/18 3:19 PM, Bart Van Assche wrote: >>> On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote: >>>> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>>> - unsigned int hctx_idx) >>>> +static void blk_mq_rcu_free_pages(struct work_struct *work) >>>> { >>>> + struct blk_mq_tags *tags = container_of(to_rcu_work(work), >>>> + struct blk_mq_tags, rcu_work); >>>> struct page *page; >>>> >>>> + while (!list_empty(&tags->page_list)) { >>>> + page = list_first_entry(&tags->page_list, struct page, lru); >>>> + list_del_init(&page->lru); >>>> + /* >>>> + * Remove kmemleak object previously allocated in >>>> + * blk_mq_init_rq_map(). >>>> + */ >>>> + kmemleak_free(page_address(page)); >>>> + __free_pages(page, page->private); >>>> + } >>>> +} >>>> + >>>> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>>> + unsigned int hctx_idx) >>>> +{ >>>> if (tags->rqs && set->ops->exit_request) { >>>> int i; >>>> >>>> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>>> } >>>> } >>>> >>>> - while (!list_empty(&tags->page_list)) { >>>> - page = list_first_entry(&tags->page_list, struct page, lru); >>>> - list_del_init(&page->lru); >>>> - /* >>>> - * Remove kmemleak object previously allocated in >>>> - * blk_mq_init_rq_map(). >>>> - */ >>>> - kmemleak_free(page_address(page)); >>>> - __free_pages(page, page->private); >>>> - } >>>> + /* Punt to RCU free, so we don't race with tag iteration */ >>>> + INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages); >>>> + queue_rcu_work(system_wq, &tags->rcu_work); >>>> } >>> >>> This can only work correctly if blk_mq_rcu_free_pages() is called before >>> INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure >>> and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed. >>> What provides these guarantees? Did I perhaps miss something? >> >> We don't call it twice. Protecting against that is outside the scope >> of this function. But we call and clear form the regular shutdown path, >> and the rest is error handling on setup. >> >> But yes, we do need to ensure that 'tags' doesn't go away. Probably best >> to rework it to shove it somewhere else for freeing for that case, and >> leave the rest of the shutdown alone. I'll update the patch. > > Here's a version that doesn't rely on tags, we just dynamically allocate > the structure. For the very odd case where we fail, we just punt to > an on-stack structure and use the big synchronize_rcu() hammer first. Forgot to init the lists... This one I actually booted, works for me. Outside of that, not tested, in terms of verifying the use-after-free is gone for tag iteration. diff --git a/block/blk-flush.c b/block/blk-flush.c index a3fc7191c694..827e3d2180d8 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -491,12 +491,22 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q, return NULL; } +static void blk_fq_rcu_free(struct work_struct *work) +{ + struct blk_flush_queue *fq = container_of(to_rcu_work(work), + struct blk_flush_queue, + rcu_work); + + kfree(fq->flush_rq); + kfree(fq); +} + void blk_free_flush_queue(struct blk_flush_queue *fq) { /* bio based request queue hasn't flush queue */ if (!fq) return; - kfree(fq->flush_rq); - kfree(fq); + INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free); + queue_rcu_work(system_wq, &fq->rcu_work); } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 2089c6c62f44..c39b58391ae8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (!reserved) bitnr += tags->nr_reserved_tags; - rq = tags->rqs[bitnr]; + if (tags->rqs[bitnr].queue != hctx->queue) + return true; /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ - if (rq && rq->q == hctx->queue) + rq = tags->rqs[bitnr].rq; + if (rq) return iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } @@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, .reserved = reserved, }; + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); + rcu_read_unlock(); } struct bt_tags_iter_data { @@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) * We can hit rq == NULL here, because the tagging functions * test and set the bit before assining ->rqs[]. */ - rq = tags->rqs[bitnr]; + rq = tags->rqs[bitnr].rq; if (rq && blk_mq_request_started(rq)) return iter_data->fn(rq, iter_data->data, reserved); @@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, .reserved = reserved, }; - if (tags->rqs) + if (tags->rqs) { + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data); + rcu_read_unlock(); + } } /** diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 61deab0b5a5a..bb84d1f099c7 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -4,6 +4,11 @@ #include "blk-mq.h" +struct rq_tag_entry { + struct request_queue *queue; + struct request *rq; +}; + /* * Tag address space map. */ @@ -16,7 +21,7 @@ struct blk_mq_tags { struct sbitmap_queue bitmap_tags; struct sbitmap_queue breserved_tags; - struct request **rqs; + struct rq_tag_entry *rqs; struct request **static_rqs; struct list_head page_list; }; @@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx, unsigned int tag, struct request *rq) { - hctx->tags->rqs[tag] = rq; + hctx->tags->rqs[tag].queue = hctx->queue; + hctx->tags->rqs[tag].rq = rq; } static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..c17103c00362 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->tag = -1; rq->internal_tag = tag; } else { - if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) { + struct blk_mq_hw_ctx *hctx = data->hctx; + + if (hctx->flags & BLK_MQ_F_TAG_SHARED) { rq_flags = RQF_MQ_INFLIGHT; - atomic_inc(&data->hctx->nr_active); + atomic_inc(&hctx->nr_active); } rq->tag = tag; rq->internal_tag = -1; - data->hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } /* csd/requeue_work/fifo_time is initialized before use */ @@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) { if (tag < tags->nr_tags) { - prefetch(tags->rqs[tag]); - return tags->rqs[tag]; + prefetch(tags->rqs[tag].rq); + return tags->rqs[tag].rq; } return NULL; @@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { /* - * If we find a request that is inflight and the queue matches, - * we know the queue is busy. Return false to stop the iteration. + * We're only called here if the queue matches. If the rq state is + * inflight, we know the queue is busy. Return false to stop the + * iteration. */ - if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) { + if (rq->state == MQ_RQ_IN_FLIGHT) { bool *busy = priv; *busy = true; @@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq) shared = blk_mq_tag_busy(data.hctx); rq->tag = blk_mq_get_tag(&data); if (rq->tag >= 0) { + struct blk_mq_hw_ctx *hctx = data.hctx; + if (shared) { rq->rq_flags |= RQF_MQ_INFLIGHT; atomic_inc(&data.hctx->nr_active); } - data.hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } done: @@ -2020,10 +2027,34 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) return cookie; } +struct rq_page_list { + struct rcu_work rcu_work; + struct list_head list; +}; + +static void blk_mq_rcu_free_pages(struct work_struct *work) +{ + struct rq_page_list *rpl = container_of(to_rcu_work(work), + struct rq_page_list, rcu_work); + struct page *page; + + while (!list_empty(&rpl->list)) { + page = list_first_entry(&rpl->list, struct page, lru); + list_del_init(&page->lru); + /* + * Remove kmemleak object previously allocated in + * blk_mq_init_rq_map(). + */ + kmemleak_free(page_address(page)); + __free_pages(page, page->private); + } + kfree(rpl); +} + void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, unsigned int hctx_idx) { - struct page *page; + struct rq_page_list *rpl; if (tags->rqs && set->ops->exit_request) { int i; @@ -2038,15 +2069,28 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, } } - while (!list_empty(&tags->page_list)) { - page = list_first_entry(&tags->page_list, struct page, lru); - list_del_init(&page->lru); + if (list_empty(&tags->page_list)) + return; + + rpl = kmalloc(sizeof(*rpl), GFP_NOIO); + if (rpl) { + INIT_LIST_HEAD(&rpl->list); + list_splice_init(&tags->page_list, &rpl->list); + + /* Punt to RCU free, so we don't race with tag iteration */ + INIT_RCU_WORK(&rpl->rcu_work, blk_mq_rcu_free_pages); + queue_rcu_work(system_wq, &rpl->rcu_work); + } else { + struct rq_page_list stack_rpl; + /* - * Remove kmemleak object previously allocated in - * blk_mq_init_rq_map(). + * Fail alloc, punt to on-stack, we just have to synchronize + * RCU first to ensure readers are done. */ - kmemleak_free(page_address(page)); - __free_pages(page, page->private); + INIT_LIST_HEAD(&stack_rpl.list); + list_splice_init(&tags->page_list, &stack_rpl.list); + synchronize_rcu(); + blk_mq_rcu_free_pages(&stack_rpl.rcu_work.work); } } @@ -2077,7 +2121,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, if (!tags) return NULL; - tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *), + tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node); if (!tags->rqs) { diff --git a/block/blk.h b/block/blk.h index 848278c52030..785207fc8a30 100644 --- a/block/blk.h +++ b/block/blk.h @@ -29,6 +29,8 @@ struct blk_flush_queue { */ struct request *orig_rq; spinlock_t mq_flush_lock; + + struct rcu_work rcu_work; }; extern struct kmem_cache *blk_requestq_cachep; -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 22:47 ` Jens Axboe @ 2018-12-20 22:50 ` Jens Axboe 2019-02-14 23:36 ` Bart Van Assche 2019-02-15 2:57 ` jianchao.wang 0 siblings, 2 replies; 34+ messages in thread From: Jens Axboe @ 2018-12-20 22:50 UTC (permalink / raw) To: Bart Van Assche, jianchao.wang, linux-block On 12/20/18 3:47 PM, Jens Axboe wrote: > On 12/20/18 3:33 PM, Jens Axboe wrote: >> On 12/20/18 3:23 PM, Jens Axboe wrote: >>> On 12/20/18 3:19 PM, Bart Van Assche wrote: >>>> On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote: >>>>> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>>>> - unsigned int hctx_idx) >>>>> +static void blk_mq_rcu_free_pages(struct work_struct *work) >>>>> { >>>>> + struct blk_mq_tags *tags = container_of(to_rcu_work(work), >>>>> + struct blk_mq_tags, rcu_work); >>>>> struct page *page; >>>>> >>>>> + while (!list_empty(&tags->page_list)) { >>>>> + page = list_first_entry(&tags->page_list, struct page, lru); >>>>> + list_del_init(&page->lru); >>>>> + /* >>>>> + * Remove kmemleak object previously allocated in >>>>> + * blk_mq_init_rq_map(). >>>>> + */ >>>>> + kmemleak_free(page_address(page)); >>>>> + __free_pages(page, page->private); >>>>> + } >>>>> +} >>>>> + >>>>> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>>>> + unsigned int hctx_idx) >>>>> +{ >>>>> if (tags->rqs && set->ops->exit_request) { >>>>> int i; >>>>> >>>>> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >>>>> } >>>>> } >>>>> >>>>> - while (!list_empty(&tags->page_list)) { >>>>> - page = list_first_entry(&tags->page_list, struct page, lru); >>>>> - list_del_init(&page->lru); >>>>> - /* >>>>> - * Remove kmemleak object previously allocated in >>>>> - * blk_mq_init_rq_map(). >>>>> - */ >>>>> - kmemleak_free(page_address(page)); >>>>> - __free_pages(page, page->private); >>>>> - } >>>>> + /* Punt to RCU free, so we don't race with tag iteration */ >>>>> + INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages); >>>>> + queue_rcu_work(system_wq, &tags->rcu_work); >>>>> } >>>> >>>> This can only work correctly if blk_mq_rcu_free_pages() is called before >>>> INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure >>>> and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed. >>>> What provides these guarantees? Did I perhaps miss something? >>> >>> We don't call it twice. Protecting against that is outside the scope >>> of this function. But we call and clear form the regular shutdown path, >>> and the rest is error handling on setup. >>> >>> But yes, we do need to ensure that 'tags' doesn't go away. Probably best >>> to rework it to shove it somewhere else for freeing for that case, and >>> leave the rest of the shutdown alone. I'll update the patch. >> >> Here's a version that doesn't rely on tags, we just dynamically allocate >> the structure. For the very odd case where we fail, we just punt to >> an on-stack structure and use the big synchronize_rcu() hammer first. > > Forgot to init the lists... This one I actually booted, works for me. > Outside of that, not tested, in terms of verifying the use-after-free is > gone for tag iteration. Oh, and we probably shouldn't free it unless we actually allocated it... diff --git a/block/blk-flush.c b/block/blk-flush.c index a3fc7191c694..827e3d2180d8 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -491,12 +491,22 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q, return NULL; } +static void blk_fq_rcu_free(struct work_struct *work) +{ + struct blk_flush_queue *fq = container_of(to_rcu_work(work), + struct blk_flush_queue, + rcu_work); + + kfree(fq->flush_rq); + kfree(fq); +} + void blk_free_flush_queue(struct blk_flush_queue *fq) { /* bio based request queue hasn't flush queue */ if (!fq) return; - kfree(fq->flush_rq); - kfree(fq); + INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free); + queue_rcu_work(system_wq, &fq->rcu_work); } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 2089c6c62f44..c39b58391ae8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (!reserved) bitnr += tags->nr_reserved_tags; - rq = tags->rqs[bitnr]; + if (tags->rqs[bitnr].queue != hctx->queue) + return true; /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ - if (rq && rq->q == hctx->queue) + rq = tags->rqs[bitnr].rq; + if (rq) return iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } @@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, .reserved = reserved, }; + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); + rcu_read_unlock(); } struct bt_tags_iter_data { @@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) * We can hit rq == NULL here, because the tagging functions * test and set the bit before assining ->rqs[]. */ - rq = tags->rqs[bitnr]; + rq = tags->rqs[bitnr].rq; if (rq && blk_mq_request_started(rq)) return iter_data->fn(rq, iter_data->data, reserved); @@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, .reserved = reserved, }; - if (tags->rqs) + if (tags->rqs) { + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data); + rcu_read_unlock(); + } } /** diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 61deab0b5a5a..bb84d1f099c7 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -4,6 +4,11 @@ #include "blk-mq.h" +struct rq_tag_entry { + struct request_queue *queue; + struct request *rq; +}; + /* * Tag address space map. */ @@ -16,7 +21,7 @@ struct blk_mq_tags { struct sbitmap_queue bitmap_tags; struct sbitmap_queue breserved_tags; - struct request **rqs; + struct rq_tag_entry *rqs; struct request **static_rqs; struct list_head page_list; }; @@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx, unsigned int tag, struct request *rq) { - hctx->tags->rqs[tag] = rq; + hctx->tags->rqs[tag].queue = hctx->queue; + hctx->tags->rqs[tag].rq = rq; } static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..82341a78a0c0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->tag = -1; rq->internal_tag = tag; } else { - if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) { + struct blk_mq_hw_ctx *hctx = data->hctx; + + if (hctx->flags & BLK_MQ_F_TAG_SHARED) { rq_flags = RQF_MQ_INFLIGHT; - atomic_inc(&data->hctx->nr_active); + atomic_inc(&hctx->nr_active); } rq->tag = tag; rq->internal_tag = -1; - data->hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } /* csd/requeue_work/fifo_time is initialized before use */ @@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) { if (tag < tags->nr_tags) { - prefetch(tags->rqs[tag]); - return tags->rqs[tag]; + prefetch(tags->rqs[tag].rq); + return tags->rqs[tag].rq; } return NULL; @@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { /* - * If we find a request that is inflight and the queue matches, - * we know the queue is busy. Return false to stop the iteration. + * We're only called here if the queue matches. If the rq state is + * inflight, we know the queue is busy. Return false to stop the + * iteration. */ - if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) { + if (rq->state == MQ_RQ_IN_FLIGHT) { bool *busy = priv; *busy = true; @@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq) shared = blk_mq_tag_busy(data.hctx); rq->tag = blk_mq_get_tag(&data); if (rq->tag >= 0) { + struct blk_mq_hw_ctx *hctx = data.hctx; + if (shared) { rq->rq_flags |= RQF_MQ_INFLIGHT; atomic_inc(&data.hctx->nr_active); } - data.hctx->tags->rqs[rq->tag] = rq; + hctx->tags->rqs[rq->tag].queue = hctx->queue; + hctx->tags->rqs[rq->tag].rq = rq; } done: @@ -2020,10 +2027,36 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) return cookie; } +struct rq_page_list { + struct rcu_work rcu_work; + struct list_head list; + bool on_stack; +}; + +static void blk_mq_rcu_free_pages(struct work_struct *work) +{ + struct rq_page_list *rpl = container_of(to_rcu_work(work), + struct rq_page_list, rcu_work); + struct page *page; + + while (!list_empty(&rpl->list)) { + page = list_first_entry(&rpl->list, struct page, lru); + list_del_init(&page->lru); + /* + * Remove kmemleak object previously allocated in + * blk_mq_init_rq_map(). + */ + kmemleak_free(page_address(page)); + __free_pages(page, page->private); + } + if (!rpl->on_stack) + kfree(rpl); +} + void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, unsigned int hctx_idx) { - struct page *page; + struct rq_page_list *rpl; if (tags->rqs && set->ops->exit_request) { int i; @@ -2038,15 +2071,30 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, } } - while (!list_empty(&tags->page_list)) { - page = list_first_entry(&tags->page_list, struct page, lru); - list_del_init(&page->lru); + if (list_empty(&tags->page_list)) + return; + + rpl = kmalloc(sizeof(*rpl), GFP_NOIO); + if (rpl) { + INIT_LIST_HEAD(&rpl->list); + list_splice_init(&tags->page_list, &rpl->list); + + /* Punt to RCU free, so we don't race with tag iteration */ + INIT_RCU_WORK(&rpl->rcu_work, blk_mq_rcu_free_pages); + rpl->on_stack = false; + queue_rcu_work(system_wq, &rpl->rcu_work); + } else { + struct rq_page_list stack_rpl; + /* - * Remove kmemleak object previously allocated in - * blk_mq_init_rq_map(). + * Fail alloc, punt to on-stack, we just have to synchronize + * RCU first to ensure readers are done. */ - kmemleak_free(page_address(page)); - __free_pages(page, page->private); + INIT_LIST_HEAD(&stack_rpl.list); + list_splice_init(&tags->page_list, &stack_rpl.list); + stack_rpl.on_stack = true; + synchronize_rcu(); + blk_mq_rcu_free_pages(&stack_rpl.rcu_work.work); } } @@ -2077,7 +2125,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, if (!tags) return NULL; - tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *), + tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node); if (!tags->rqs) { diff --git a/block/blk.h b/block/blk.h index 848278c52030..785207fc8a30 100644 --- a/block/blk.h +++ b/block/blk.h @@ -29,6 +29,8 @@ struct blk_flush_queue { */ struct request *orig_rq; spinlock_t mq_flush_lock; + + struct rcu_work rcu_work; }; extern struct kmem_cache *blk_requestq_cachep; -- Jens Axboe ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 22:50 ` Jens Axboe @ 2019-02-14 23:36 ` Bart Van Assche 2019-02-15 18:29 ` Evan Green 2019-02-15 2:57 ` jianchao.wang 1 sibling, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2019-02-14 23:36 UTC (permalink / raw) To: Jens Axboe, jianchao.wang, linux-block; +Cc: Evan Green On Thu, 2018-12-20 at 15:50 -0700, Jens Axboe wrote: > +static void blk_fq_rcu_free(struct work_struct *work) > +{ > + struct blk_flush_queue *fq = container_of(to_rcu_work(work), > + struct blk_flush_queue, > + rcu_work); > + > + kfree(fq->flush_rq); > + kfree(fq); > +} > + > void blk_free_flush_queue(struct blk_flush_queue *fq) > { > /* bio based request queue hasn't flush queue */ > if (!fq) > return; > > - kfree(fq->flush_rq); > - kfree(fq); > + INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free); > + queue_rcu_work(system_wq, &fq->rcu_work); > } Can INIT_RCU_WORK() + queue_rcu_work() be changed into call_rcu()? The latter namely uses a smaller data structure. > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 2089c6c62f44..c39b58391ae8 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > > if (!reserved) > bitnr += tags->nr_reserved_tags; > - rq = tags->rqs[bitnr]; > + if (tags->rqs[bitnr].queue != hctx->queue) > + return true; Since blk_mq_tag_set_rq() is not serialized against this function I doubt that the tags->rqs[bitnr].queue != hctx->queue check helps. Can it be left out? > +struct rq_tag_entry { > + struct request_queue *queue; > + struct request *rq; If the new test can be left out from bt_iter(), can this new data structure be left out too? In other words, keep the existing approach of only storing the request pointer and not the queue pointer. Thanks, Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2019-02-14 23:36 ` Bart Van Assche @ 2019-02-15 18:29 ` Evan Green 2019-02-19 16:48 ` Bart Van Assche 0 siblings, 1 reply; 34+ messages in thread From: Evan Green @ 2019-02-15 18:29 UTC (permalink / raw) To: Bart Van Assche; +Cc: Jens Axboe, jianchao.wang, linux-block Hi Jens, I got all turned around while trying to understand this fix, and I'll admit it's probably just me. It looks like you're trying to use an rcu lock to prevent the iter functions from racing with free. Is that true? But then the race at least as I understand it wasn't that there was a free in-flight, it's that the free had happened a long time ago, and there was still a stale value in tags->rqs[bitnr]. So maybe there's some other issue that part is solving? And then it looks like you added a new struct where tags->rqs was so that you could compare hctx->queue without reaching through rq. I have no idea if that's sufficient to prevent stale accesses through rq. Since you're changing multiple values I think where you populate that structure you'd at least need to do something like: clear rq, barrier, set hctx, barrier, set rq. But like Bart said, that's probably not the right way to go. Finally, I didn't really get why the conditional in blk_mq_rq_inflight() changed. Is that guarding against some other identified problem, or just an additional check you can do when you convert rqs into a struct? It looks like blk_mq_free_rqs() might be the magic function I was looking for earlier. Would it be possible to just clear tags[rq->tag] for each static_rq? Or is it possible for rqs from one set to end up in the tags array of another set? (Which would make what I just suggested insufficient). -Evan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2019-02-15 18:29 ` Evan Green @ 2019-02-19 16:48 ` Bart Van Assche 2019-02-21 20:54 ` Evan Green 0 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2019-02-19 16:48 UTC (permalink / raw) To: Evan Green; +Cc: Jens Axboe, jianchao.wang, linux-block On Fri, 2019-02-15 at 10:29 -0800, Evan Green wrote: > I got all turned around while trying to understand this fix, and I'll > admit it's probably just me. It looks like you're trying to use an rcu > lock to prevent the iter functions from racing with free. Is that > true? But then the race at least as I understand it wasn't that there > was a free in-flight, it's that the free had happened a long time ago, > and there was still a stale value in tags->rqs[bitnr]. So maybe > there's some other issue that part is solving? > > And then it looks like you added a new struct where tags->rqs was so > that you could compare hctx->queue without reaching through rq. I have > no idea if that's sufficient to prevent stale accesses through rq. > Since you're changing multiple values I think where you populate that > structure you'd at least need to do something like: clear rq, barrier, > set hctx, barrier, set rq. But like Bart said, that's probably not the > right way to go. > > Finally, I didn't really get why the conditional in > blk_mq_rq_inflight() changed. Is that guarding against some other > identified problem, or just an additional check you can do when you > convert rqs into a struct? > > It looks like blk_mq_free_rqs() might be the magic function I was > looking for earlier. Would it be possible to just clear tags[rq->tag] > for each static_rq? Or is it possible for rqs from one set to end up > in the tags array of another set? (Which would make what I just > suggested insufficient). Hi Evan, Multiple request queues can share a single tag set. Examples of use cases are NVMe namespaces associated with the same NVMe controller and SCSI LUNs associated with the same SCSI host. The code that allocates a request not only marks a tag as allocated but also updates the rqs[] array in the tag set (see also blk_mq_get_request()). The code that iterates over a tag set examines both the tag bitmap and the rqs[] array. The code that allocates requests and the code that iterates over requests are not serialized against each other. That is why the association of a tag with a request queue can change while iterating over a tag set. Another race condition is that a request can be allocated or freed while iterating over a tag set. Fixing these race conditions would require a locking mechanism and I think the performance overhead of such a locking mechanism is larger than acceptable. So code that iterates over requests either must be able to handle such race conditions or must suspend request processing before it iterates over requests. This is why I'm in favor of only modifying blk_mq_free_rqs() such that the memory in which the tags are stored is delayed until a grace period has occurred. Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2019-02-19 16:48 ` Bart Van Assche @ 2019-02-21 20:54 ` Evan Green 0 siblings, 0 replies; 34+ messages in thread From: Evan Green @ 2019-02-21 20:54 UTC (permalink / raw) To: Bart Van Assche; +Cc: Jens Axboe, jianchao.wang, linux-block On Tue, Feb 19, 2019 at 8:48 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On Fri, 2019-02-15 at 10:29 -0800, Evan Green wrote: > > I got all turned around while trying to understand this fix, and I'll > > admit it's probably just me. It looks like you're trying to use an rcu > > lock to prevent the iter functions from racing with free. Is that > > true? But then the race at least as I understand it wasn't that there > > was a free in-flight, it's that the free had happened a long time ago, > > and there was still a stale value in tags->rqs[bitnr]. So maybe > > there's some other issue that part is solving? > > > > And then it looks like you added a new struct where tags->rqs was so > > that you could compare hctx->queue without reaching through rq. I have > > no idea if that's sufficient to prevent stale accesses through rq. > > Since you're changing multiple values I think where you populate that > > structure you'd at least need to do something like: clear rq, barrier, > > set hctx, barrier, set rq. But like Bart said, that's probably not the > > right way to go. > > > > Finally, I didn't really get why the conditional in > > blk_mq_rq_inflight() changed. Is that guarding against some other > > identified problem, or just an additional check you can do when you > > convert rqs into a struct? > > > > It looks like blk_mq_free_rqs() might be the magic function I was > > looking for earlier. Would it be possible to just clear tags[rq->tag] > > for each static_rq? Or is it possible for rqs from one set to end up > > in the tags array of another set? (Which would make what I just > > suggested insufficient). > > Hi Evan, > > Multiple request queues can share a single tag set. Examples of use cases > are NVMe namespaces associated with the same NVMe controller and SCSI LUNs > associated with the same SCSI host. The code that allocates a request not > only marks a tag as allocated but also updates the rqs[] array in the > tag set (see also blk_mq_get_request()). The code that iterates over a > tag set examines both the tag bitmap and the rqs[] array. The code that > allocates requests and the code that iterates over requests are not > serialized against each other. That is why the association of a tag with a > request queue can change while iterating over a tag set. > > Another race condition is that a request can be allocated or freed while > iterating over a tag set. Oh, true. The iter code was trying (unsuccessfully) to guard against allocations with its if (rq && rq->q ...), but I hadn't thought about in-progress frees. > > Fixing these race conditions would require a locking mechanism and I think > the performance overhead of such a locking mechanism is larger than > acceptable. So code that iterates over requests either must be able to > handle such race conditions or must suspend request processing before it > iterates over requests. > > This is why I'm in favor of only modifying blk_mq_free_rqs() such that the > memory in which the tags are stored is delayed until a grace period has > occurred. This makes sense to me for the purpose of not racing with an in-progress free. I just can't fully prove to myself that it solves the stale tags entry, since unless we clear the entry in the tags array it seems like there could maybe still be stale entries. Maybe I'll get it in the next change. Jens, keep me in the loop if you send another patch please. -Evan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-20 22:50 ` Jens Axboe 2019-02-14 23:36 ` Bart Van Assche @ 2019-02-15 2:57 ` jianchao.wang 1 sibling, 0 replies; 34+ messages in thread From: jianchao.wang @ 2019-02-15 2:57 UTC (permalink / raw) To: Jens Axboe, Bart Van Assche, linux-block On 12/21/18 6:50 AM, Jens Axboe wrote: ... > @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > > if (!reserved) > bitnr += tags->nr_reserved_tags; > - rq = tags->rqs[bitnr]; > + if (tags->rqs[bitnr].queue != hctx->queue) > + return true; > > /* > * We can hit rq == NULL here, because the tagging functions > * test and set the bit before assigning ->rqs[]. > */ > - if (rq && rq->q == hctx->queue) > + rq = tags->rqs[bitnr].rq; > + if (rq) > return iter_data->fn(hctx, rq, iter_data->data, reserved); > return true; > } > @@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, > .reserved = reserved, > }; > > + rcu_read_lock(); > sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); > + rcu_read_unlock(); > } > > struct bt_tags_iter_data { ... > @@ -2038,15 +2071,30 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > } > } Just checking request_queue match in bt_iter seems not enough. The blk_mq_free_rqs could be invoked by switching io scheduler. At the moment, request_queue could match. Maybe we still need to clear the associated tags->rqs[] entries. With cleaning the associated tags->rqs[] entries and synchronize_rcu after that, bt_iter would only see some stable request with idle state but not a freed one. Thanks Jianchao > > - while (!list_empty(&tags->page_list)) { > - page = list_first_entry(&tags->page_list, struct page, lru); > - list_del_init(&page->lru); > + if (list_empty(&tags->page_list)) > + return; > + > + rpl = kmalloc(sizeof(*rpl), GFP_NOIO); > + if (rpl) { > + INIT_LIST_HEAD(&rpl->list); > + list_splice_init(&tags->page_list, &rpl->list); > + > + /* Punt to RCU free, so we don't race with tag iteration */ > + INIT_RCU_WORK(&rpl->rcu_work, blk_mq_rcu_free_pages); > + rpl->on_stack = false; > + queue_rcu_work(system_wq, &rpl->rcu_work); > + } else { ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: v4.20-rc6: Sporadic use-after-free in bt_iter() 2018-12-19 23:24 v4.20-rc6: Sporadic use-after-free in bt_iter() Bart Van Assche 2018-12-19 23:27 ` Jens Axboe @ 2018-12-20 4:06 ` Ming Lei 1 sibling, 0 replies; 34+ messages in thread From: Ming Lei @ 2018-12-20 4:06 UTC (permalink / raw) To: Bart Van Assche; +Cc: linux-block On Thu, Dec 20, 2018 at 9:22 AM Bart Van Assche <bvanassche@acm.org> wrote: > > Hello, > > If I run the srp blktests in a loop then I see the below call stack appearing > sporadically. I have not yet had the time to analyze this but I'm reporting > this here in case someone else would already have had a look at this. > > Bart. > > ================================================================== > BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 > Read of size 8 at addr ffff88803b335240 by task fio/21412 > > CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > Call Trace: > dump_stack+0x86/0xca > print_address_description+0x71/0x239 > kasan_report.cold.5+0x242/0x301 > __asan_load8+0x54/0x90 > bt_iter+0x86/0xf0 > blk_mq_queue_tag_busy_iter+0x373/0x5e0 > blk_mq_in_flight+0x96/0xb0 > part_in_flight+0x40/0x140 > part_round_stats+0x18e/0x370 > blk_account_io_start+0x3d7/0x670 > blk_mq_bio_to_request+0x19c/0x3a0 > blk_mq_make_request+0x7a9/0xcb0 > generic_make_request+0x41d/0x960 > submit_bio+0x9b/0x250 > do_blockdev_direct_IO+0x435c/0x4c70 > __blockdev_direct_IO+0x79/0x88 > ext4_direct_IO+0x46c/0xc00 > generic_file_direct_write+0x119/0x210 > __generic_file_write_iter+0x11c/0x280 > ext4_file_write_iter+0x1b8/0x6f0 > aio_write+0x204/0x310 > io_submit_one+0x9d3/0xe80 > __x64_sys_io_submit+0x115/0x340 > do_syscall_64+0x71/0x210 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x7f02cf043219 > Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 47 fc 0c 00 f7 d8 64 89 01 48 > RSP: 002b:00007f02a1df78b8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1 > RAX: ffffffffffffffda RBX: 00007f02a1df8ef8 RCX: 00007f02cf043219 > RDX: 00007f029804a7c0 RSI: 0000000000000001 RDI: 00007f02c4f67000 > RBP: 00007f02c4f67000 R08: 00007f0298007af0 R09: 00007f02a362f0f0 > R10: 00007f029804a9c0 R11: 0000000000000246 R12: 0000000000000001 > R13: 0000000000000000 R14: 00007f029804a7c0 R15: 00007f0298049f60 > > The buggy address belongs to the page: > page:ffffea0000eccd40 count:0 mapcount:0 mapping:0000000000000000 index:0x0 > flags: 0x1fff000000000000() > raw: 1fff000000000000 0000000000000000 ffffffff00ec0201 0000000000000000 > raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > ffff88803b335100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ffff88803b335180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > >ffff88803b335200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ^ > ffff88803b335280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ffff88803b335300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ================================================================== It isn't a new issue, and I reported it on v4.16 actually: https://lists.openwall.net/linux-ext4/2018/04/04/6 Thanks, Ming Lei ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2019-02-21 20:55 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-12-19 23:24 v4.20-rc6: Sporadic use-after-free in bt_iter() Bart Van Assche 2018-12-19 23:27 ` Jens Axboe 2018-12-20 0:16 ` Bart Van Assche 2018-12-20 3:17 ` Jens Axboe 2018-12-20 3:24 ` jianchao.wang 2018-12-20 4:19 ` Jens Axboe 2018-12-20 4:32 ` jianchao.wang 2018-12-20 4:48 ` Jens Axboe 2018-12-20 5:03 ` jianchao.wang 2018-12-20 13:02 ` Jens Axboe 2018-12-20 13:07 ` Jens Axboe 2018-12-20 18:01 ` Bart Van Assche 2018-12-20 18:21 ` Jens Axboe 2018-12-20 18:33 ` Jens Axboe 2018-12-20 20:56 ` Bart Van Assche 2018-12-20 21:00 ` Jens Axboe 2018-12-20 21:23 ` Bart Van Assche 2018-12-20 21:26 ` Jens Axboe 2018-12-20 21:31 ` Bart Van Assche 2018-12-20 21:34 ` Jens Axboe 2018-12-20 21:40 ` Bart Van Assche 2018-12-20 21:44 ` Jens Axboe 2018-12-20 21:48 ` Jens Axboe 2018-12-20 22:19 ` Bart Van Assche 2018-12-20 22:23 ` Jens Axboe 2018-12-20 22:33 ` Jens Axboe 2018-12-20 22:47 ` Jens Axboe 2018-12-20 22:50 ` Jens Axboe 2019-02-14 23:36 ` Bart Van Assche 2019-02-15 18:29 ` Evan Green 2019-02-19 16:48 ` Bart Van Assche 2019-02-21 20:54 ` Evan Green 2019-02-15 2:57 ` jianchao.wang 2018-12-20 4:06 ` Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).