linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* v4.20-rc6: Sporadic use-after-free in bt_iter()
@ 2018-12-19 23:24 Bart Van Assche
  2018-12-19 23:27 ` Jens Axboe
  2018-12-20  4:06 ` Ming Lei
  0 siblings, 2 replies; 34+ messages in thread
From: Bart Van Assche @ 2018-12-19 23:24 UTC (permalink / raw)
  To: linux-block

Hello,

If I run the srp blktests in a loop then I see the below call stack appearing
sporadically. I have not yet had the time to analyze this but I'm reporting
this here in case someone else would already have had a look at this.

Bart.

==================================================================
BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
Read of size 8 at addr ffff88803b335240 by task fio/21412

CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0x86/0xca
 print_address_description+0x71/0x239
 kasan_report.cold.5+0x242/0x301
 __asan_load8+0x54/0x90
 bt_iter+0x86/0xf0
 blk_mq_queue_tag_busy_iter+0x373/0x5e0
 blk_mq_in_flight+0x96/0xb0
 part_in_flight+0x40/0x140
 part_round_stats+0x18e/0x370
 blk_account_io_start+0x3d7/0x670
 blk_mq_bio_to_request+0x19c/0x3a0
 blk_mq_make_request+0x7a9/0xcb0
 generic_make_request+0x41d/0x960
 submit_bio+0x9b/0x250
 do_blockdev_direct_IO+0x435c/0x4c70
 __blockdev_direct_IO+0x79/0x88
 ext4_direct_IO+0x46c/0xc00
 generic_file_direct_write+0x119/0x210
 __generic_file_write_iter+0x11c/0x280
 ext4_file_write_iter+0x1b8/0x6f0
 aio_write+0x204/0x310
 io_submit_one+0x9d3/0xe80
 __x64_sys_io_submit+0x115/0x340
 do_syscall_64+0x71/0x210
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f02cf043219
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 47 fc 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007f02a1df78b8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
RAX: ffffffffffffffda RBX: 00007f02a1df8ef8 RCX: 00007f02cf043219
RDX: 00007f029804a7c0 RSI: 0000000000000001 RDI: 00007f02c4f67000
RBP: 00007f02c4f67000 R08: 00007f0298007af0 R09: 00007f02a362f0f0
R10: 00007f029804a9c0 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00007f029804a7c0 R15: 00007f0298049f60

The buggy address belongs to the page:
page:ffffea0000eccd40 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x1fff000000000000()
raw: 1fff000000000000 0000000000000000 ffffffff00ec0201 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88803b335100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88803b335180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88803b335200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                           ^
 ffff88803b335280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88803b335300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-19 23:24 v4.20-rc6: Sporadic use-after-free in bt_iter() Bart Van Assche
@ 2018-12-19 23:27 ` Jens Axboe
  2018-12-20  0:16   ` Bart Van Assche
  2018-12-20  4:06 ` Ming Lei
  1 sibling, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-19 23:27 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 12/19/18 4:24 PM, Bart Van Assche wrote:
> Hello,
> 
> If I run the srp blktests in a loop then I see the below call stack appearing
> sporadically. I have not yet had the time to analyze this but I'm reporting
> this here in case someone else would already have had a look at this.
> 
> Bart.
> 
> ==================================================================
> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> Read of size 8 at addr ffff88803b335240 by task fio/21412
> 
> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Call Trace:
>  dump_stack+0x86/0xca
>  print_address_description+0x71/0x239
>  kasan_report.cold.5+0x242/0x301
>  __asan_load8+0x54/0x90
>  bt_iter+0x86/0xf0
>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>  blk_mq_in_flight+0x96/0xb0
>  part_in_flight+0x40/0x140
>  part_round_stats+0x18e/0x370
>  blk_account_io_start+0x3d7/0x670
>  blk_mq_bio_to_request+0x19c/0x3a0
>  blk_mq_make_request+0x7a9/0xcb0
>  generic_make_request+0x41d/0x960
>  submit_bio+0x9b/0x250
>  do_blockdev_direct_IO+0x435c/0x4c70
>  __blockdev_direct_IO+0x79/0x88
>  ext4_direct_IO+0x46c/0xc00
>  generic_file_direct_write+0x119/0x210
>  __generic_file_write_iter+0x11c/0x280
>  ext4_file_write_iter+0x1b8/0x6f0
>  aio_write+0x204/0x310
>  io_submit_one+0x9d3/0xe80
>  __x64_sys_io_submit+0x115/0x340
>  do_syscall_64+0x71/0x210
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f02cf043219

I've seen this one before as well, it's not a new thing. As far as I can
tell, it's a false positive. There should be no possibility for a
use-after-free iterating the static tags/requests.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-19 23:27 ` Jens Axboe
@ 2018-12-20  0:16   ` Bart Van Assche
  2018-12-20  3:17     ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2018-12-20  0:16 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote:
> On 12/19/18 4:24 PM, Bart Van Assche wrote:
> > Hello,
> > 
> > If I run the srp blktests in a loop then I see the below call stack appearing
> > sporadically. I have not yet had the time to analyze this but I'm reporting
> > this here in case someone else would already have had a look at this.
> > 
> > Bart.
> > 
> > ==================================================================
> > BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> > Read of size 8 at addr ffff88803b335240 by task fio/21412
> > 
> > CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > Call Trace:
> >  dump_stack+0x86/0xca
> >  print_address_description+0x71/0x239
> >  kasan_report.cold.5+0x242/0x301
> >  __asan_load8+0x54/0x90
> >  bt_iter+0x86/0xf0
> >  blk_mq_queue_tag_busy_iter+0x373/0x5e0
> >  blk_mq_in_flight+0x96/0xb0
> >  part_in_flight+0x40/0x140
> >  part_round_stats+0x18e/0x370
> >  blk_account_io_start+0x3d7/0x670
> >  blk_mq_bio_to_request+0x19c/0x3a0
> >  blk_mq_make_request+0x7a9/0xcb0
> >  generic_make_request+0x41d/0x960
> >  submit_bio+0x9b/0x250
> >  do_blockdev_direct_IO+0x435c/0x4c70
> >  __blockdev_direct_IO+0x79/0x88
> >  ext4_direct_IO+0x46c/0xc00
> >  generic_file_direct_write+0x119/0x210
> >  __generic_file_write_iter+0x11c/0x280
> >  ext4_file_write_iter+0x1b8/0x6f0
> >  aio_write+0x204/0x310
> >  io_submit_one+0x9d3/0xe80
> >  __x64_sys_io_submit+0x115/0x340
> >  do_syscall_64+0x71/0x210
> >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x7f02cf043219
> 
> I've seen this one before as well, it's not a new thing. As far as I can
> tell, it's a false positive. There should be no possibility for a
> use-after-free iterating the static tags/requests.

Are you sure this is a false positive? I have not yet encountered any false
positive KASAN complaints. According to the following gdb output this complaint
refers to reading rq->q:

(gdb) list *(bt_iter+0x86)
0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237).
232
233             /*
234              * We can hit rq == NULL here, because the tagging functions
235              * test and set the bit before assigning ->rqs[].
236              */
237             if (rq && rq->q == hctx->queue)
238                     iter_data->fn(hctx, rq, iter_data->data, reserved);
239             return true;
240     }
241

From the disassembly output:

232
233             /*
234              * We can hit rq == NULL here, because the tagging functions
235              * test and set the bit before assigning ->rqs[].
236              */
237             if (rq && rq->q == hctx->queue)
   0xffffffff816b9339 <+121>:   test   %r12,%r12
   0xffffffff816b933c <+124>:   je     0xffffffff816b935f <bt_iter+159>
   0xffffffff816b933e <+126>:   mov    %r12,%rdi
   0xffffffff816b9341 <+129>:   callq  0xffffffff813bd3e0 <__asan_load8>
   0xffffffff816b9346 <+134>:   lea    0x138(%r13),%rdi
   0xffffffff816b934d <+141>:   mov    (%r12),%r14
   0xffffffff816b9351 <+145>:   callq  0xffffffff813bd3e0 <__asan_load8>
   0xffffffff816b9356 <+150>:   cmp    0x138(%r13),%r14
   0xffffffff816b935d <+157>:   je     0xffffffff816b936f <bt_iter+175>

BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also
refer to hctx->fq.flush_rq.

Bart.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20  0:16   ` Bart Van Assche
@ 2018-12-20  3:17     ` Jens Axboe
  2018-12-20  3:24       ` jianchao.wang
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20  3:17 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 12/19/18 5:16 PM, Bart Van Assche wrote:
> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote:
>> On 12/19/18 4:24 PM, Bart Van Assche wrote:
>>> Hello,
>>>
>>> If I run the srp blktests in a loop then I see the below call stack appearing
>>> sporadically. I have not yet had the time to analyze this but I'm reporting
>>> this here in case someone else would already have had a look at this.
>>>
>>> Bart.
>>>
>>> ==================================================================
>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>>
>>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>>> Call Trace:
>>>  dump_stack+0x86/0xca
>>>  print_address_description+0x71/0x239
>>>  kasan_report.cold.5+0x242/0x301
>>>  __asan_load8+0x54/0x90
>>>  bt_iter+0x86/0xf0
>>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>>  blk_mq_in_flight+0x96/0xb0
>>>  part_in_flight+0x40/0x140
>>>  part_round_stats+0x18e/0x370
>>>  blk_account_io_start+0x3d7/0x670
>>>  blk_mq_bio_to_request+0x19c/0x3a0
>>>  blk_mq_make_request+0x7a9/0xcb0
>>>  generic_make_request+0x41d/0x960
>>>  submit_bio+0x9b/0x250
>>>  do_blockdev_direct_IO+0x435c/0x4c70
>>>  __blockdev_direct_IO+0x79/0x88
>>>  ext4_direct_IO+0x46c/0xc00
>>>  generic_file_direct_write+0x119/0x210
>>>  __generic_file_write_iter+0x11c/0x280
>>>  ext4_file_write_iter+0x1b8/0x6f0
>>>  aio_write+0x204/0x310
>>>  io_submit_one+0x9d3/0xe80
>>>  __x64_sys_io_submit+0x115/0x340
>>>  do_syscall_64+0x71/0x210
>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> RIP: 0033:0x7f02cf043219
>>
>> I've seen this one before as well, it's not a new thing. As far as I can
>> tell, it's a false positive. There should be no possibility for a
>> use-after-free iterating the static tags/requests.
> 
> Are you sure this is a false positive?

No I'm not, but the few times I have seen it, I haven't been able to
make much sense of it. It goes back quite a bit.

I have not yet encountered any false
> positive KASAN complaints. According to the following gdb output this complaint
> refers to reading rq->q:
> 
> (gdb) list *(bt_iter+0x86)
> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237).
> 232
> 233             /*
> 234              * We can hit rq == NULL here, because the tagging functions
> 235              * test and set the bit before assigning ->rqs[].
> 236              */
> 237             if (rq && rq->q == hctx->queue)
> 238                     iter_data->fn(hctx, rq, iter_data->data, reserved);
> 239             return true;
> 240     }
> 241
> 
> From the disassembly output:
> 
> 232
> 233             /*
> 234              * We can hit rq == NULL here, because the tagging functions
> 235              * test and set the bit before assigning ->rqs[].
> 236              */
> 237             if (rq && rq->q == hctx->queue)
>    0xffffffff816b9339 <+121>:   test   %r12,%r12
>    0xffffffff816b933c <+124>:   je     0xffffffff816b935f <bt_iter+159>
>    0xffffffff816b933e <+126>:   mov    %r12,%rdi
>    0xffffffff816b9341 <+129>:   callq  0xffffffff813bd3e0 <__asan_load8>
>    0xffffffff816b9346 <+134>:   lea    0x138(%r13),%rdi
>    0xffffffff816b934d <+141>:   mov    (%r12),%r14
>    0xffffffff816b9351 <+145>:   callq  0xffffffff813bd3e0 <__asan_load8>
>    0xffffffff816b9356 <+150>:   cmp    0x138(%r13),%r14
>    0xffffffff816b935d <+157>:   je     0xffffffff816b936f <bt_iter+175>
> 
> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also
> refer to hctx->fq.flush_rq.

But even those are persistent for the lifetime of the queue... But since
kasan complains it belongs to a specific page, I'm guessing it's one
of the regular requests since those are out of a chopped up page. Which
means it makes even less sense.

Is this happening while devices are being actively torn down? And
are you using shared tags? That's the only way I could see this
triggering.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20  3:17     ` Jens Axboe
@ 2018-12-20  3:24       ` jianchao.wang
  2018-12-20  4:19         ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: jianchao.wang @ 2018-12-20  3:24 UTC (permalink / raw)
  To: Jens Axboe, Bart Van Assche, linux-block



On 12/20/18 11:17 AM, Jens Axboe wrote:
> On 12/19/18 5:16 PM, Bart Van Assche wrote:
>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote:
>>> On 12/19/18 4:24 PM, Bart Van Assche wrote:
>>>> Hello,
>>>>
>>>> If I run the srp blktests in a loop then I see the below call stack appearing
>>>> sporadically. I have not yet had the time to analyze this but I'm reporting
>>>> this here in case someone else would already have had a look at this.
>>>>
>>>> Bart.
>>>>
>>>> ==================================================================
>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>>>
>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>>>> Call Trace:
>>>>  dump_stack+0x86/0xca
>>>>  print_address_description+0x71/0x239
>>>>  kasan_report.cold.5+0x242/0x301
>>>>  __asan_load8+0x54/0x90
>>>>  bt_iter+0x86/0xf0
>>>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>>>  blk_mq_in_flight+0x96/0xb0
>>>>  part_in_flight+0x40/0x140
>>>>  part_round_stats+0x18e/0x370
>>>>  blk_account_io_start+0x3d7/0x670
>>>>  blk_mq_bio_to_request+0x19c/0x3a0
>>>>  blk_mq_make_request+0x7a9/0xcb0
>>>>  generic_make_request+0x41d/0x960
>>>>  submit_bio+0x9b/0x250
>>>>  do_blockdev_direct_IO+0x435c/0x4c70
>>>>  __blockdev_direct_IO+0x79/0x88
>>>>  ext4_direct_IO+0x46c/0xc00
>>>>  generic_file_direct_write+0x119/0x210
>>>>  __generic_file_write_iter+0x11c/0x280
>>>>  ext4_file_write_iter+0x1b8/0x6f0
>>>>  aio_write+0x204/0x310
>>>>  io_submit_one+0x9d3/0xe80
>>>>  __x64_sys_io_submit+0x115/0x340
>>>>  do_syscall_64+0x71/0x210
>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>> RIP: 0033:0x7f02cf043219
>>>
>>> I've seen this one before as well, it's not a new thing. As far as I can
>>> tell, it's a false positive. There should be no possibility for a
>>> use-after-free iterating the static tags/requests.
>>
>> Are you sure this is a false positive?
> 
> No I'm not, but the few times I have seen it, I haven't been able to
> make much sense of it. It goes back quite a bit.
> 
> I have not yet encountered any false
>> positive KASAN complaints. According to the following gdb output this complaint
>> refers to reading rq->q:
>>
>> (gdb) list *(bt_iter+0x86)
>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237).
>> 232
>> 233             /*
>> 234              * We can hit rq == NULL here, because the tagging functions
>> 235              * test and set the bit before assigning ->rqs[].
>> 236              */
>> 237             if (rq && rq->q == hctx->queue)
>> 238                     iter_data->fn(hctx, rq, iter_data->data, reserved);
>> 239             return true;
>> 240     }
>> 241
>>
>> From the disassembly output:
>>
>> 232
>> 233             /*
>> 234              * We can hit rq == NULL here, because the tagging functions
>> 235              * test and set the bit before assigning ->rqs[].
>> 236              */
>> 237             if (rq && rq->q == hctx->queue)
>>    0xffffffff816b9339 <+121>:   test   %r12,%r12
>>    0xffffffff816b933c <+124>:   je     0xffffffff816b935f <bt_iter+159>
>>    0xffffffff816b933e <+126>:   mov    %r12,%rdi
>>    0xffffffff816b9341 <+129>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>    0xffffffff816b9346 <+134>:   lea    0x138(%r13),%rdi
>>    0xffffffff816b934d <+141>:   mov    (%r12),%r14
>>    0xffffffff816b9351 <+145>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>    0xffffffff816b9356 <+150>:   cmp    0x138(%r13),%r14
>>    0xffffffff816b935d <+157>:   je     0xffffffff816b936f <bt_iter+175>
>>
>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also
>> refer to hctx->fq.flush_rq.
> 
> But even those are persistent for the lifetime of the queue... But since
> kasan complains it belongs to a specific page, I'm guessing it's one
> of the regular requests since those are out of a chopped up page. Which
> means it makes even less sense.
> 
> Is this happening while devices are being actively torn down? And
> are you using shared tags? That's the only way I could see this
> triggering.
> 

Or could it be caused by the stale request in hctx->tags->rqs[] slot ?
We don't clear it after free the requests.

And there could be a scenario like,
There used to be a io scheduler attached.
After some workload, the io scheduler is detached.
So there could be rqs allocated by the io scheduler left in hctx->tags->rqs.

blk_mq_get_request                            blk_mq_queue_tag_busy_iter
  -> blk_mq_get_tag
                                                -> bt_for_each
                                                  -> bt_iter
                                                    -> rq = taags->rqs[]
                                                    -> rq->q
  -> blk_mq_rq_ctx_init
    -> data->hctx->tags->rqs[rq->tag] = rq;

If the scenario is possible, maybe we could fix it as following.


---
 block/blk-mq.c | 1 +
 block/blk-mq.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6a75662..ad55226 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,6 +477,7 @@ static void __blk_mq_free_request(struct request *rq)
 	const int sched_tag = rq->internal_tag;
 
 	blk_pm_mark_last_busy(rq);
+	hctx->tags->rqs[rq->tag] = NULL;
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47..675b681 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -190,6 +190,7 @@ static inline void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
 	if (rq->tag == -1 || rq->internal_tag == -1)
 		return;
 
+	hctx->tags->rqs[rq->tag] = NULL;
 	__blk_mq_put_driver_tag(hctx, rq);
 }
 
@@ -201,6 +202,7 @@ static inline void blk_mq_put_driver_tag(struct request *rq)
 		return;
 
 	hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu);
+	hctx->tags->rqs[rq->tag] = NULL;
 	__blk_mq_put_driver_tag(hctx, rq);
 }
 
-- 
2.7.4


Thanks
Jianchao

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-19 23:24 v4.20-rc6: Sporadic use-after-free in bt_iter() Bart Van Assche
  2018-12-19 23:27 ` Jens Axboe
@ 2018-12-20  4:06 ` Ming Lei
  1 sibling, 0 replies; 34+ messages in thread
From: Ming Lei @ 2018-12-20  4:06 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block

On Thu, Dec 20, 2018 at 9:22 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> Hello,
>
> If I run the srp blktests in a loop then I see the below call stack appearing
> sporadically. I have not yet had the time to analyze this but I'm reporting
> this here in case someone else would already have had a look at this.
>
> Bart.
>
> ==================================================================
> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> Read of size 8 at addr ffff88803b335240 by task fio/21412
>
> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Call Trace:
>  dump_stack+0x86/0xca
>  print_address_description+0x71/0x239
>  kasan_report.cold.5+0x242/0x301
>  __asan_load8+0x54/0x90
>  bt_iter+0x86/0xf0
>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>  blk_mq_in_flight+0x96/0xb0
>  part_in_flight+0x40/0x140
>  part_round_stats+0x18e/0x370
>  blk_account_io_start+0x3d7/0x670
>  blk_mq_bio_to_request+0x19c/0x3a0
>  blk_mq_make_request+0x7a9/0xcb0
>  generic_make_request+0x41d/0x960
>  submit_bio+0x9b/0x250
>  do_blockdev_direct_IO+0x435c/0x4c70
>  __blockdev_direct_IO+0x79/0x88
>  ext4_direct_IO+0x46c/0xc00
>  generic_file_direct_write+0x119/0x210
>  __generic_file_write_iter+0x11c/0x280
>  ext4_file_write_iter+0x1b8/0x6f0
>  aio_write+0x204/0x310
>  io_submit_one+0x9d3/0xe80
>  __x64_sys_io_submit+0x115/0x340
>  do_syscall_64+0x71/0x210
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f02cf043219
> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 47 fc 0c 00 f7 d8 64 89 01 48
> RSP: 002b:00007f02a1df78b8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> RAX: ffffffffffffffda RBX: 00007f02a1df8ef8 RCX: 00007f02cf043219
> RDX: 00007f029804a7c0 RSI: 0000000000000001 RDI: 00007f02c4f67000
> RBP: 00007f02c4f67000 R08: 00007f0298007af0 R09: 00007f02a362f0f0
> R10: 00007f029804a9c0 R11: 0000000000000246 R12: 0000000000000001
> R13: 0000000000000000 R14: 00007f029804a7c0 R15: 00007f0298049f60
>
> The buggy address belongs to the page:
> page:ffffea0000eccd40 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> flags: 0x1fff000000000000()
> raw: 1fff000000000000 0000000000000000 ffffffff00ec0201 0000000000000000
> raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  ffff88803b335100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>  ffff88803b335180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >ffff88803b335200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>                                            ^
>  ffff88803b335280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>  ffff88803b335300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ==================================================================

It isn't a new issue, and I reported it on v4.16 actually:

https://lists.openwall.net/linux-ext4/2018/04/04/6

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20  3:24       ` jianchao.wang
@ 2018-12-20  4:19         ` Jens Axboe
  2018-12-20  4:32           ` jianchao.wang
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20  4:19 UTC (permalink / raw)
  To: jianchao.wang, Bart Van Assche, linux-block

On 12/19/18 8:24 PM, jianchao.wang wrote:
> 
> 
> On 12/20/18 11:17 AM, Jens Axboe wrote:
>> On 12/19/18 5:16 PM, Bart Van Assche wrote:
>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote:
>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote:
>>>>> Hello,
>>>>>
>>>>> If I run the srp blktests in a loop then I see the below call stack appearing
>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting
>>>>> this here in case someone else would already have had a look at this.
>>>>>
>>>>> Bart.
>>>>>
>>>>> ==================================================================
>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>>>>
>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>>>>> Call Trace:
>>>>>  dump_stack+0x86/0xca
>>>>>  print_address_description+0x71/0x239
>>>>>  kasan_report.cold.5+0x242/0x301
>>>>>  __asan_load8+0x54/0x90
>>>>>  bt_iter+0x86/0xf0
>>>>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>>>>  blk_mq_in_flight+0x96/0xb0
>>>>>  part_in_flight+0x40/0x140
>>>>>  part_round_stats+0x18e/0x370
>>>>>  blk_account_io_start+0x3d7/0x670
>>>>>  blk_mq_bio_to_request+0x19c/0x3a0
>>>>>  blk_mq_make_request+0x7a9/0xcb0
>>>>>  generic_make_request+0x41d/0x960
>>>>>  submit_bio+0x9b/0x250
>>>>>  do_blockdev_direct_IO+0x435c/0x4c70
>>>>>  __blockdev_direct_IO+0x79/0x88
>>>>>  ext4_direct_IO+0x46c/0xc00
>>>>>  generic_file_direct_write+0x119/0x210
>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>  ext4_file_write_iter+0x1b8/0x6f0
>>>>>  aio_write+0x204/0x310
>>>>>  io_submit_one+0x9d3/0xe80
>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>  do_syscall_64+0x71/0x210
>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>> RIP: 0033:0x7f02cf043219
>>>>
>>>> I've seen this one before as well, it's not a new thing. As far as I can
>>>> tell, it's a false positive. There should be no possibility for a
>>>> use-after-free iterating the static tags/requests.
>>>
>>> Are you sure this is a false positive?
>>
>> No I'm not, but the few times I have seen it, I haven't been able to
>> make much sense of it. It goes back quite a bit.
>>
>> I have not yet encountered any false
>>> positive KASAN complaints. According to the following gdb output this complaint
>>> refers to reading rq->q:
>>>
>>> (gdb) list *(bt_iter+0x86)
>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237).
>>> 232
>>> 233             /*
>>> 234              * We can hit rq == NULL here, because the tagging functions
>>> 235              * test and set the bit before assigning ->rqs[].
>>> 236              */
>>> 237             if (rq && rq->q == hctx->queue)
>>> 238                     iter_data->fn(hctx, rq, iter_data->data, reserved);
>>> 239             return true;
>>> 240     }
>>> 241
>>>
>>> From the disassembly output:
>>>
>>> 232
>>> 233             /*
>>> 234              * We can hit rq == NULL here, because the tagging functions
>>> 235              * test and set the bit before assigning ->rqs[].
>>> 236              */
>>> 237             if (rq && rq->q == hctx->queue)
>>>    0xffffffff816b9339 <+121>:   test   %r12,%r12
>>>    0xffffffff816b933c <+124>:   je     0xffffffff816b935f <bt_iter+159>
>>>    0xffffffff816b933e <+126>:   mov    %r12,%rdi
>>>    0xffffffff816b9341 <+129>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>    0xffffffff816b9346 <+134>:   lea    0x138(%r13),%rdi
>>>    0xffffffff816b934d <+141>:   mov    (%r12),%r14
>>>    0xffffffff816b9351 <+145>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>    0xffffffff816b9356 <+150>:   cmp    0x138(%r13),%r14
>>>    0xffffffff816b935d <+157>:   je     0xffffffff816b936f <bt_iter+175>
>>>
>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also
>>> refer to hctx->fq.flush_rq.
>>
>> But even those are persistent for the lifetime of the queue... But since
>> kasan complains it belongs to a specific page, I'm guessing it's one
>> of the regular requests since those are out of a chopped up page. Which
>> means it makes even less sense.
>>
>> Is this happening while devices are being actively torn down? And
>> are you using shared tags? That's the only way I could see this
>> triggering.
>>
> 
> Or could it be caused by the stale request in hctx->tags->rqs[] slot ?
> We don't clear it after free the requests.
> 
> And there could be a scenario like,
> There used to be a io scheduler attached.
> After some workload, the io scheduler is detached.
> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs.
> 
> blk_mq_get_request                            blk_mq_queue_tag_busy_iter
>   -> blk_mq_get_tag
>                                                 -> bt_for_each
>                                                   -> bt_iter
>                                                     -> rq = taags->rqs[]
>                                                     -> rq->q
>   -> blk_mq_rq_ctx_init
>     -> data->hctx->tags->rqs[rq->tag] = rq;
> 
> If the scenario is possible, maybe we could fix it as following.

Ah yes, good point, I bet that's what it is. But we just had this exact
discussion in another thread, and my point there was that we should
clear these when they go away, not inline. So how about clearing entries
when the sched tags go away?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20  4:19         ` Jens Axboe
@ 2018-12-20  4:32           ` jianchao.wang
  2018-12-20  4:48             ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: jianchao.wang @ 2018-12-20  4:32 UTC (permalink / raw)
  To: Jens Axboe, Bart Van Assche, linux-block



On 12/20/18 12:19 PM, Jens Axboe wrote:
> On 12/19/18 8:24 PM, jianchao.wang wrote:
>>
>>
>> On 12/20/18 11:17 AM, Jens Axboe wrote:
>>> On 12/19/18 5:16 PM, Bart Van Assche wrote:
>>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote:
>>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote:
>>>>>> Hello,
>>>>>>
>>>>>> If I run the srp blktests in a loop then I see the below call stack appearing
>>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting
>>>>>> this here in case someone else would already have had a look at this.
>>>>>>
>>>>>> Bart.
>>>>>>
>>>>>> ==================================================================
>>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>>>>>
>>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>>>>>> Call Trace:
>>>>>>  dump_stack+0x86/0xca
>>>>>>  print_address_description+0x71/0x239
>>>>>>  kasan_report.cold.5+0x242/0x301
>>>>>>  __asan_load8+0x54/0x90
>>>>>>  bt_iter+0x86/0xf0
>>>>>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>>>>>  blk_mq_in_flight+0x96/0xb0
>>>>>>  part_in_flight+0x40/0x140
>>>>>>  part_round_stats+0x18e/0x370
>>>>>>  blk_account_io_start+0x3d7/0x670
>>>>>>  blk_mq_bio_to_request+0x19c/0x3a0
>>>>>>  blk_mq_make_request+0x7a9/0xcb0
>>>>>>  generic_make_request+0x41d/0x960
>>>>>>  submit_bio+0x9b/0x250
>>>>>>  do_blockdev_direct_IO+0x435c/0x4c70
>>>>>>  __blockdev_direct_IO+0x79/0x88
>>>>>>  ext4_direct_IO+0x46c/0xc00
>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>  ext4_file_write_iter+0x1b8/0x6f0
>>>>>>  aio_write+0x204/0x310
>>>>>>  io_submit_one+0x9d3/0xe80
>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>  do_syscall_64+0x71/0x210
>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>> RIP: 0033:0x7f02cf043219
>>>>>
>>>>> I've seen this one before as well, it's not a new thing. As far as I can
>>>>> tell, it's a false positive. There should be no possibility for a
>>>>> use-after-free iterating the static tags/requests.
>>>>
>>>> Are you sure this is a false positive?
>>>
>>> No I'm not, but the few times I have seen it, I haven't been able to
>>> make much sense of it. It goes back quite a bit.
>>>
>>> I have not yet encountered any false
>>>> positive KASAN complaints. According to the following gdb output this complaint
>>>> refers to reading rq->q:
>>>>
>>>> (gdb) list *(bt_iter+0x86)
>>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237).
>>>> 232
>>>> 233             /*
>>>> 234              * We can hit rq == NULL here, because the tagging functions
>>>> 235              * test and set the bit before assigning ->rqs[].
>>>> 236              */
>>>> 237             if (rq && rq->q == hctx->queue)
>>>> 238                     iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>> 239             return true;
>>>> 240     }
>>>> 241
>>>>
>>>> From the disassembly output:
>>>>
>>>> 232
>>>> 233             /*
>>>> 234              * We can hit rq == NULL here, because the tagging functions
>>>> 235              * test and set the bit before assigning ->rqs[].
>>>> 236              */
>>>> 237             if (rq && rq->q == hctx->queue)
>>>>    0xffffffff816b9339 <+121>:   test   %r12,%r12
>>>>    0xffffffff816b933c <+124>:   je     0xffffffff816b935f <bt_iter+159>
>>>>    0xffffffff816b933e <+126>:   mov    %r12,%rdi
>>>>    0xffffffff816b9341 <+129>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>>    0xffffffff816b9346 <+134>:   lea    0x138(%r13),%rdi
>>>>    0xffffffff816b934d <+141>:   mov    (%r12),%r14
>>>>    0xffffffff816b9351 <+145>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>>    0xffffffff816b9356 <+150>:   cmp    0x138(%r13),%r14
>>>>    0xffffffff816b935d <+157>:   je     0xffffffff816b936f <bt_iter+175>
>>>>
>>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also
>>>> refer to hctx->fq.flush_rq.
>>>
>>> But even those are persistent for the lifetime of the queue... But since
>>> kasan complains it belongs to a specific page, I'm guessing it's one
>>> of the regular requests since those are out of a chopped up page. Which
>>> means it makes even less sense.
>>>
>>> Is this happening while devices are being actively torn down? And
>>> are you using shared tags? That's the only way I could see this
>>> triggering.
>>>
>>
>> Or could it be caused by the stale request in hctx->tags->rqs[] slot ?
>> We don't clear it after free the requests.
>>
>> And there could be a scenario like,
>> There used to be a io scheduler attached.
>> After some workload, the io scheduler is detached.
>> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs.
>>
>> blk_mq_get_request                            blk_mq_queue_tag_busy_iter
>>   -> blk_mq_get_tag
>>                                                 -> bt_for_each
>>                                                   -> bt_iter
>>                                                     -> rq = taags->rqs[]
>>                                                     -> rq->q
>>   -> blk_mq_rq_ctx_init
>>     -> data->hctx->tags->rqs[rq->tag] = rq;
>>
>> If the scenario is possible, maybe we could fix it as following.
> 
> Ah yes, good point, I bet that's what it is. But we just had this exact
> discussion in another thread, and my point there was that we should
> clear these when they go away, not inline. So how about clearing entries
> when the sched tags go away?
> 
I guess it should be OK. :)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20  4:32           ` jianchao.wang
@ 2018-12-20  4:48             ` Jens Axboe
  2018-12-20  5:03               ` jianchao.wang
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20  4:48 UTC (permalink / raw)
  To: jianchao.wang, Bart Van Assche, linux-block

On 12/19/18 9:32 PM, jianchao.wang wrote:
> 
> 
> On 12/20/18 12:19 PM, Jens Axboe wrote:
>> On 12/19/18 8:24 PM, jianchao.wang wrote:
>>>
>>>
>>> On 12/20/18 11:17 AM, Jens Axboe wrote:
>>>> On 12/19/18 5:16 PM, Bart Van Assche wrote:
>>>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote:
>>>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> If I run the srp blktests in a loop then I see the below call stack appearing
>>>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting
>>>>>>> this here in case someone else would already have had a look at this.
>>>>>>>
>>>>>>> Bart.
>>>>>>>
>>>>>>> ==================================================================
>>>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>>>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>>>>>>
>>>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>>>>>>> Call Trace:
>>>>>>>  dump_stack+0x86/0xca
>>>>>>>  print_address_description+0x71/0x239
>>>>>>>  kasan_report.cold.5+0x242/0x301
>>>>>>>  __asan_load8+0x54/0x90
>>>>>>>  bt_iter+0x86/0xf0
>>>>>>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>>>>>>  blk_mq_in_flight+0x96/0xb0
>>>>>>>  part_in_flight+0x40/0x140
>>>>>>>  part_round_stats+0x18e/0x370
>>>>>>>  blk_account_io_start+0x3d7/0x670
>>>>>>>  blk_mq_bio_to_request+0x19c/0x3a0
>>>>>>>  blk_mq_make_request+0x7a9/0xcb0
>>>>>>>  generic_make_request+0x41d/0x960
>>>>>>>  submit_bio+0x9b/0x250
>>>>>>>  do_blockdev_direct_IO+0x435c/0x4c70
>>>>>>>  __blockdev_direct_IO+0x79/0x88
>>>>>>>  ext4_direct_IO+0x46c/0xc00
>>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>>  ext4_file_write_iter+0x1b8/0x6f0
>>>>>>>  aio_write+0x204/0x310
>>>>>>>  io_submit_one+0x9d3/0xe80
>>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>>  do_syscall_64+0x71/0x210
>>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>> RIP: 0033:0x7f02cf043219
>>>>>>
>>>>>> I've seen this one before as well, it's not a new thing. As far as I can
>>>>>> tell, it's a false positive. There should be no possibility for a
>>>>>> use-after-free iterating the static tags/requests.
>>>>>
>>>>> Are you sure this is a false positive?
>>>>
>>>> No I'm not, but the few times I have seen it, I haven't been able to
>>>> make much sense of it. It goes back quite a bit.
>>>>
>>>> I have not yet encountered any false
>>>>> positive KASAN complaints. According to the following gdb output this complaint
>>>>> refers to reading rq->q:
>>>>>
>>>>> (gdb) list *(bt_iter+0x86)
>>>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237).
>>>>> 232
>>>>> 233             /*
>>>>> 234              * We can hit rq == NULL here, because the tagging functions
>>>>> 235              * test and set the bit before assigning ->rqs[].
>>>>> 236              */
>>>>> 237             if (rq && rq->q == hctx->queue)
>>>>> 238                     iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>>> 239             return true;
>>>>> 240     }
>>>>> 241
>>>>>
>>>>> From the disassembly output:
>>>>>
>>>>> 232
>>>>> 233             /*
>>>>> 234              * We can hit rq == NULL here, because the tagging functions
>>>>> 235              * test and set the bit before assigning ->rqs[].
>>>>> 236              */
>>>>> 237             if (rq && rq->q == hctx->queue)
>>>>>    0xffffffff816b9339 <+121>:   test   %r12,%r12
>>>>>    0xffffffff816b933c <+124>:   je     0xffffffff816b935f <bt_iter+159>
>>>>>    0xffffffff816b933e <+126>:   mov    %r12,%rdi
>>>>>    0xffffffff816b9341 <+129>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>>>    0xffffffff816b9346 <+134>:   lea    0x138(%r13),%rdi
>>>>>    0xffffffff816b934d <+141>:   mov    (%r12),%r14
>>>>>    0xffffffff816b9351 <+145>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>>>    0xffffffff816b9356 <+150>:   cmp    0x138(%r13),%r14
>>>>>    0xffffffff816b935d <+157>:   je     0xffffffff816b936f <bt_iter+175>
>>>>>
>>>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also
>>>>> refer to hctx->fq.flush_rq.
>>>>
>>>> But even those are persistent for the lifetime of the queue... But since
>>>> kasan complains it belongs to a specific page, I'm guessing it's one
>>>> of the regular requests since those are out of a chopped up page. Which
>>>> means it makes even less sense.
>>>>
>>>> Is this happening while devices are being actively torn down? And
>>>> are you using shared tags? That's the only way I could see this
>>>> triggering.
>>>>
>>>
>>> Or could it be caused by the stale request in hctx->tags->rqs[] slot ?
>>> We don't clear it after free the requests.
>>>
>>> And there could be a scenario like,
>>> There used to be a io scheduler attached.
>>> After some workload, the io scheduler is detached.
>>> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs.
>>>
>>> blk_mq_get_request                            blk_mq_queue_tag_busy_iter
>>>   -> blk_mq_get_tag
>>>                                                 -> bt_for_each
>>>                                                   -> bt_iter
>>>                                                     -> rq = taags->rqs[]
>>>                                                     -> rq->q
>>>   -> blk_mq_rq_ctx_init
>>>     -> data->hctx->tags->rqs[rq->tag] = rq;
>>>
>>> If the scenario is possible, maybe we could fix it as following.
>>
>> Ah yes, good point, I bet that's what it is. But we just had this exact
>> discussion in another thread, and my point there was that we should
>> clear these when they go away, not inline. So how about clearing entries
>> when the sched tags go away?
>>
> I guess it should be OK. :)

Something like this. Totally untested... And I wonder if there's a more
efficient way to do this, not that it matters THAT much. But still.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2de972857496..341cb8b9cfb7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2025,16 +2025,21 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 {
 	struct page *page;
 
-	if (tags->rqs && set->ops->exit_request) {
-		int i;
+	if (tags->rqs) {
+		int i, j;
 
 		for (i = 0; i < tags->nr_tags; i++) {
 			struct request *rq = tags->static_rqs[i];
 
 			if (!rq)
 				continue;
-			set->ops->exit_request(set, rq, hctx_idx);
+			if (set->ops->exit_request)
+				set->ops->exit_request(set, rq, hctx_idx);
 			tags->static_rqs[i] = NULL;
+
+			for (j = 0; j < tags->nr_tags; j++)
+				if (tags->rqs[j] == rq)
+					tags->rqs[j] = NULL;
 		}
 	}
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20  4:48             ` Jens Axboe
@ 2018-12-20  5:03               ` jianchao.wang
  2018-12-20 13:02                 ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: jianchao.wang @ 2018-12-20  5:03 UTC (permalink / raw)
  To: Jens Axboe, Bart Van Assche, linux-block



On 12/20/18 12:48 PM, Jens Axboe wrote:
> On 12/19/18 9:32 PM, jianchao.wang wrote:
>>
>>
>> On 12/20/18 12:19 PM, Jens Axboe wrote:
>>> On 12/19/18 8:24 PM, jianchao.wang wrote:
>>>>
>>>>
>>>> On 12/20/18 11:17 AM, Jens Axboe wrote:
>>>>> On 12/19/18 5:16 PM, Bart Van Assche wrote:
>>>>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote:
>>>>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> If I run the srp blktests in a loop then I see the below call stack appearing
>>>>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting
>>>>>>>> this here in case someone else would already have had a look at this.
>>>>>>>>
>>>>>>>> Bart.
>>>>>>>>
>>>>>>>> ==================================================================
>>>>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>>>>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>>>>>>>
>>>>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>>>>>>>> Call Trace:
>>>>>>>>  dump_stack+0x86/0xca
>>>>>>>>  print_address_description+0x71/0x239
>>>>>>>>  kasan_report.cold.5+0x242/0x301
>>>>>>>>  __asan_load8+0x54/0x90
>>>>>>>>  bt_iter+0x86/0xf0
>>>>>>>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>>>>>>>  blk_mq_in_flight+0x96/0xb0
>>>>>>>>  part_in_flight+0x40/0x140
>>>>>>>>  part_round_stats+0x18e/0x370
>>>>>>>>  blk_account_io_start+0x3d7/0x670
>>>>>>>>  blk_mq_bio_to_request+0x19c/0x3a0
>>>>>>>>  blk_mq_make_request+0x7a9/0xcb0
>>>>>>>>  generic_make_request+0x41d/0x960
>>>>>>>>  submit_bio+0x9b/0x250
>>>>>>>>  do_blockdev_direct_IO+0x435c/0x4c70
>>>>>>>>  __blockdev_direct_IO+0x79/0x88
>>>>>>>>  ext4_direct_IO+0x46c/0xc00
>>>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>>>  ext4_file_write_iter+0x1b8/0x6f0
>>>>>>>>  aio_write+0x204/0x310
>>>>>>>>  io_submit_one+0x9d3/0xe80
>>>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>>>  do_syscall_64+0x71/0x210
>>>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>> RIP: 0033:0x7f02cf043219
>>>>>>>
>>>>>>> I've seen this one before as well, it's not a new thing. As far as I can
>>>>>>> tell, it's a false positive. There should be no possibility for a
>>>>>>> use-after-free iterating the static tags/requests.
>>>>>>
>>>>>> Are you sure this is a false positive?
>>>>>
>>>>> No I'm not, but the few times I have seen it, I haven't been able to
>>>>> make much sense of it. It goes back quite a bit.
>>>>>
>>>>> I have not yet encountered any false
>>>>>> positive KASAN complaints. According to the following gdb output this complaint
>>>>>> refers to reading rq->q:
>>>>>>
>>>>>> (gdb) list *(bt_iter+0x86)
>>>>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237).
>>>>>> 232
>>>>>> 233             /*
>>>>>> 234              * We can hit rq == NULL here, because the tagging functions
>>>>>> 235              * test and set the bit before assigning ->rqs[].
>>>>>> 236              */
>>>>>> 237             if (rq && rq->q == hctx->queue)
>>>>>> 238                     iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>>>> 239             return true;
>>>>>> 240     }
>>>>>> 241
>>>>>>
>>>>>> From the disassembly output:
>>>>>>
>>>>>> 232
>>>>>> 233             /*
>>>>>> 234              * We can hit rq == NULL here, because the tagging functions
>>>>>> 235              * test and set the bit before assigning ->rqs[].
>>>>>> 236              */
>>>>>> 237             if (rq && rq->q == hctx->queue)
>>>>>>    0xffffffff816b9339 <+121>:   test   %r12,%r12
>>>>>>    0xffffffff816b933c <+124>:   je     0xffffffff816b935f <bt_iter+159>
>>>>>>    0xffffffff816b933e <+126>:   mov    %r12,%rdi
>>>>>>    0xffffffff816b9341 <+129>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>>>>    0xffffffff816b9346 <+134>:   lea    0x138(%r13),%rdi
>>>>>>    0xffffffff816b934d <+141>:   mov    (%r12),%r14
>>>>>>    0xffffffff816b9351 <+145>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>>>>    0xffffffff816b9356 <+150>:   cmp    0x138(%r13),%r14
>>>>>>    0xffffffff816b935d <+157>:   je     0xffffffff816b936f <bt_iter+175>
>>>>>>
>>>>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also
>>>>>> refer to hctx->fq.flush_rq.
>>>>>
>>>>> But even those are persistent for the lifetime of the queue... But since
>>>>> kasan complains it belongs to a specific page, I'm guessing it's one
>>>>> of the regular requests since those are out of a chopped up page. Which
>>>>> means it makes even less sense.
>>>>>
>>>>> Is this happening while devices are being actively torn down? And
>>>>> are you using shared tags? That's the only way I could see this
>>>>> triggering.
>>>>>
>>>>
>>>> Or could it be caused by the stale request in hctx->tags->rqs[] slot ?
>>>> We don't clear it after free the requests.
>>>>
>>>> And there could be a scenario like,
>>>> There used to be a io scheduler attached.
>>>> After some workload, the io scheduler is detached.
>>>> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs.
>>>>
>>>> blk_mq_get_request                            blk_mq_queue_tag_busy_iter
>>>>   -> blk_mq_get_tag
>>>>                                                 -> bt_for_each
>>>>                                                   -> bt_iter
>>>>                                                     -> rq = taags->rqs[]
>>>>                                                     -> rq->q
>>>>   -> blk_mq_rq_ctx_init
>>>>     -> data->hctx->tags->rqs[rq->tag] = rq;
>>>>
>>>> If the scenario is possible, maybe we could fix it as following.
>>>
>>> Ah yes, good point, I bet that's what it is. But we just had this exact
>>> discussion in another thread, and my point there was that we should
>>> clear these when they go away, not inline. So how about clearing entries
>>> when the sched tags go away?
>>>
>> I guess it should be OK. :)
> 
> Something like this. Totally untested... And I wonder if there's a more
> efficient way to do this, not that it matters THAT much. But still.
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2de972857496..341cb8b9cfb7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2025,16 +2025,21 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  {
>  	struct page *page;
>  
> -	if (tags->rqs && set->ops->exit_request) {
> -		int i;
> +	if (tags->rqs) {
> +		int i, j;
>  
>  		for (i = 0; i < tags->nr_tags; i++) {
>  			struct request *rq = tags->static_rqs[i];
>  
>  			if (!rq)
>  				continue;
> -			set->ops->exit_request(set, rq, hctx_idx);
> +			if (set->ops->exit_request)
> +				set->ops->exit_request(set, rq, hctx_idx);
>  			tags->static_rqs[i] = NULL;
> +
> +			for (j = 0; j < tags->nr_tags; j++)
> +				if (tags->rqs[j] == rq)
> +					tags->rqs[j] = NULL;
>  		}
>  	}
>  
> 

I'm afraid this cannot work.

The 'tags' here could be the hctx->sched_tags, but what we need to clear is hctx->tags->rqs[].

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20  5:03               ` jianchao.wang
@ 2018-12-20 13:02                 ` Jens Axboe
  2018-12-20 13:07                   ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 13:02 UTC (permalink / raw)
  To: jianchao.wang, Bart Van Assche, linux-block

On 12/19/18 10:03 PM, jianchao.wang wrote:
> 
> 
> On 12/20/18 12:48 PM, Jens Axboe wrote:
>> On 12/19/18 9:32 PM, jianchao.wang wrote:
>>>
>>>
>>> On 12/20/18 12:19 PM, Jens Axboe wrote:
>>>> On 12/19/18 8:24 PM, jianchao.wang wrote:
>>>>>
>>>>>
>>>>> On 12/20/18 11:17 AM, Jens Axboe wrote:
>>>>>> On 12/19/18 5:16 PM, Bart Van Assche wrote:
>>>>>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote:
>>>>>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> If I run the srp blktests in a loop then I see the below call stack appearing
>>>>>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting
>>>>>>>>> this here in case someone else would already have had a look at this.
>>>>>>>>>
>>>>>>>>> Bart.
>>>>>>>>>
>>>>>>>>> ==================================================================
>>>>>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>>>>>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>>>>>>>>
>>>>>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>>>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>>>>>>>>> Call Trace:
>>>>>>>>>  dump_stack+0x86/0xca
>>>>>>>>>  print_address_description+0x71/0x239
>>>>>>>>>  kasan_report.cold.5+0x242/0x301
>>>>>>>>>  __asan_load8+0x54/0x90
>>>>>>>>>  bt_iter+0x86/0xf0
>>>>>>>>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>>>>>>>>  blk_mq_in_flight+0x96/0xb0
>>>>>>>>>  part_in_flight+0x40/0x140
>>>>>>>>>  part_round_stats+0x18e/0x370
>>>>>>>>>  blk_account_io_start+0x3d7/0x670
>>>>>>>>>  blk_mq_bio_to_request+0x19c/0x3a0
>>>>>>>>>  blk_mq_make_request+0x7a9/0xcb0
>>>>>>>>>  generic_make_request+0x41d/0x960
>>>>>>>>>  submit_bio+0x9b/0x250
>>>>>>>>>  do_blockdev_direct_IO+0x435c/0x4c70
>>>>>>>>>  __blockdev_direct_IO+0x79/0x88
>>>>>>>>>  ext4_direct_IO+0x46c/0xc00
>>>>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>>>>  ext4_file_write_iter+0x1b8/0x6f0
>>>>>>>>>  aio_write+0x204/0x310
>>>>>>>>>  io_submit_one+0x9d3/0xe80
>>>>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>>>>  do_syscall_64+0x71/0x210
>>>>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>>> RIP: 0033:0x7f02cf043219
>>>>>>>>
>>>>>>>> I've seen this one before as well, it's not a new thing. As far as I can
>>>>>>>> tell, it's a false positive. There should be no possibility for a
>>>>>>>> use-after-free iterating the static tags/requests.
>>>>>>>
>>>>>>> Are you sure this is a false positive?
>>>>>>
>>>>>> No I'm not, but the few times I have seen it, I haven't been able to
>>>>>> make much sense of it. It goes back quite a bit.
>>>>>>
>>>>>> I have not yet encountered any false
>>>>>>> positive KASAN complaints. According to the following gdb output this complaint
>>>>>>> refers to reading rq->q:
>>>>>>>
>>>>>>> (gdb) list *(bt_iter+0x86)
>>>>>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237).
>>>>>>> 232
>>>>>>> 233             /*
>>>>>>> 234              * We can hit rq == NULL here, because the tagging functions
>>>>>>> 235              * test and set the bit before assigning ->rqs[].
>>>>>>> 236              */
>>>>>>> 237             if (rq && rq->q == hctx->queue)
>>>>>>> 238                     iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>>>>> 239             return true;
>>>>>>> 240     }
>>>>>>> 241
>>>>>>>
>>>>>>> From the disassembly output:
>>>>>>>
>>>>>>> 232
>>>>>>> 233             /*
>>>>>>> 234              * We can hit rq == NULL here, because the tagging functions
>>>>>>> 235              * test and set the bit before assigning ->rqs[].
>>>>>>> 236              */
>>>>>>> 237             if (rq && rq->q == hctx->queue)
>>>>>>>    0xffffffff816b9339 <+121>:   test   %r12,%r12
>>>>>>>    0xffffffff816b933c <+124>:   je     0xffffffff816b935f <bt_iter+159>
>>>>>>>    0xffffffff816b933e <+126>:   mov    %r12,%rdi
>>>>>>>    0xffffffff816b9341 <+129>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>>>>>    0xffffffff816b9346 <+134>:   lea    0x138(%r13),%rdi
>>>>>>>    0xffffffff816b934d <+141>:   mov    (%r12),%r14
>>>>>>>    0xffffffff816b9351 <+145>:   callq  0xffffffff813bd3e0 <__asan_load8>
>>>>>>>    0xffffffff816b9356 <+150>:   cmp    0x138(%r13),%r14
>>>>>>>    0xffffffff816b935d <+157>:   je     0xffffffff816b936f <bt_iter+175>
>>>>>>>
>>>>>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also
>>>>>>> refer to hctx->fq.flush_rq.
>>>>>>
>>>>>> But even those are persistent for the lifetime of the queue... But since
>>>>>> kasan complains it belongs to a specific page, I'm guessing it's one
>>>>>> of the regular requests since those are out of a chopped up page. Which
>>>>>> means it makes even less sense.
>>>>>>
>>>>>> Is this happening while devices are being actively torn down? And
>>>>>> are you using shared tags? That's the only way I could see this
>>>>>> triggering.
>>>>>>
>>>>>
>>>>> Or could it be caused by the stale request in hctx->tags->rqs[] slot ?
>>>>> We don't clear it after free the requests.
>>>>>
>>>>> And there could be a scenario like,
>>>>> There used to be a io scheduler attached.
>>>>> After some workload, the io scheduler is detached.
>>>>> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs.
>>>>>
>>>>> blk_mq_get_request                            blk_mq_queue_tag_busy_iter
>>>>>   -> blk_mq_get_tag
>>>>>                                                 -> bt_for_each
>>>>>                                                   -> bt_iter
>>>>>                                                     -> rq = taags->rqs[]
>>>>>                                                     -> rq->q
>>>>>   -> blk_mq_rq_ctx_init
>>>>>     -> data->hctx->tags->rqs[rq->tag] = rq;
>>>>>
>>>>> If the scenario is possible, maybe we could fix it as following.
>>>>
>>>> Ah yes, good point, I bet that's what it is. But we just had this exact
>>>> discussion in another thread, and my point there was that we should
>>>> clear these when they go away, not inline. So how about clearing entries
>>>> when the sched tags go away?
>>>>
>>> I guess it should be OK. :)
>>
>> Something like this. Totally untested... And I wonder if there's a more
>> efficient way to do this, not that it matters THAT much. But still.
>>
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 2de972857496..341cb8b9cfb7 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2025,16 +2025,21 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>  {
>>  	struct page *page;
>>  
>> -	if (tags->rqs && set->ops->exit_request) {
>> -		int i;
>> +	if (tags->rqs) {
>> +		int i, j;
>>  
>>  		for (i = 0; i < tags->nr_tags; i++) {
>>  			struct request *rq = tags->static_rqs[i];
>>  
>>  			if (!rq)
>>  				continue;
>> -			set->ops->exit_request(set, rq, hctx_idx);
>> +			if (set->ops->exit_request)
>> +				set->ops->exit_request(set, rq, hctx_idx);
>>  			tags->static_rqs[i] = NULL;
>> +
>> +			for (j = 0; j < tags->nr_tags; j++)
>> +				if (tags->rqs[j] == rq)
>> +					tags->rqs[j] = NULL;
>>  		}
>>  	}
>>  
>>
> 
> I'm afraid this cannot work.
> 
> The 'tags' here could be the hctx->sched_tags, but what we need to
> clear is hctx->tags->rqs[].

You are right, of course, a bit too quick on the trigger. This one
should work better, and also avoids that silly quadratic loop. I don't
think we need the tag == -1 check, but probably best to be safe.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2de972857496..151891eb6fbd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2025,16 +2025,22 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 {
 	struct page *page;
 
-	if (tags->rqs && set->ops->exit_request) {
-		int i;
+	if (tags->rqs) {
+		int i, j;
 
 		for (i = 0; i < tags->nr_tags; i++) {
 			struct request *rq = tags->static_rqs[i];
 
 			if (!rq)
 				continue;
-			set->ops->exit_request(set, rq, hctx_idx);
+			if (set->ops->exit_request)
+				set->ops->exit_request(set, rq, hctx_idx);
 			tags->static_rqs[i] = NULL;
+
+			if (rq->tag == -1)
+				continue;
+			if (set->tags[hctx_idx]->rqs[rq->tag] == rq)
+				set->tags[hctx_idx]->rqs[rq->tag] = NULL;
 		}
 	}
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 13:02                 ` Jens Axboe
@ 2018-12-20 13:07                   ` Jens Axboe
  2018-12-20 18:01                     ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 13:07 UTC (permalink / raw)
  To: jianchao.wang, Bart Van Assche, linux-block

On 12/20/18 6:02 AM, Jens Axboe wrote:
>> I'm afraid this cannot work.
>>
>> The 'tags' here could be the hctx->sched_tags, but what we need to
>> clear is hctx->tags->rqs[].
> 
> You are right, of course, a bit too quick on the trigger. This one
> should work better, and also avoids that silly quadratic loop. I don't
> think we need the tag == -1 check, but probably best to be safe.

Sent out the wrong one, here it is. Bart, if you can reproduce, can you
give it a whirl?


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2de972857496..fc04bb648f18 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2025,7 +2025,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 {
 	struct page *page;
 
-	if (tags->rqs && set->ops->exit_request) {
+	if (tags->rqs) {
 		int i;
 
 		for (i = 0; i < tags->nr_tags; i++) {
@@ -2033,8 +2033,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 
 			if (!rq)
 				continue;
-			set->ops->exit_request(set, rq, hctx_idx);
+			if (set->ops->exit_request)
+				set->ops->exit_request(set, rq, hctx_idx);
 			tags->static_rqs[i] = NULL;
+
+			if (rq->tag == -1)
+				continue;
+			if (set->tags[hctx_idx]->rqs[rq->tag] == rq)
+				set->tags[hctx_idx]->rqs[rq->tag] = NULL;
 		}
 	}
 
@@ -2113,6 +2119,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
+	rq->tag = -1;
 	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 	return 0;
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 13:07                   ` Jens Axboe
@ 2018-12-20 18:01                     ` Bart Van Assche
  2018-12-20 18:21                       ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2018-12-20 18:01 UTC (permalink / raw)
  To: Jens Axboe, jianchao.wang, linux-block

On Thu, 2018-12-20 at 06:07 -0700, Jens Axboe wrote:
> On 12/20/18 6:02 AM, Jens Axboe wrote:
> > > I'm afraid this cannot work.
> > > 
> > > The 'tags' here could be the hctx->sched_tags, but what we need to
> > > clear is hctx->tags->rqs[].
> > 
> > You are right, of course, a bit too quick on the trigger. This one
> > should work better, and also avoids that silly quadratic loop. I don't
> > think we need the tag == -1 check, but probably best to be safe.
> 
> Sent out the wrong one, here it is. Bart, if you can reproduce, can you
> give it a whirl?
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2de972857496..fc04bb648f18 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2025,7 +2025,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  {
>  	struct page *page;
>  
> -	if (tags->rqs && set->ops->exit_request) {
> +	if (tags->rqs) {
>  		int i;
>  
>  		for (i = 0; i < tags->nr_tags; i++) {
> @@ -2033,8 +2033,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  
>  			if (!rq)
>  				continue;
> -			set->ops->exit_request(set, rq, hctx_idx);
> +			if (set->ops->exit_request)
> +				set->ops->exit_request(set, rq, hctx_idx);
>  			tags->static_rqs[i] = NULL;
> +
> +			if (rq->tag == -1)
> +				continue;
> +			if (set->tags[hctx_idx]->rqs[rq->tag] == rq)
> +				set->tags[hctx_idx]->rqs[rq->tag] = NULL;
>  		}
>  	}
>  
> @@ -2113,6 +2119,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
>  			return ret;
>  	}
>  
> +	rq->tag = -1;
>  	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
>  	return 0;
>  }

Hi Jens,

Are you sure this is sufficient? My understanding is that
blk_mq_queue_tag_busy_iter() iterates over all tags in the tag set. So if the
request queue on which part_in_flight() is called and the request queue for
which blk_mq_free_rqs() is called share their tag set then part_in_flight()
and blk_mq_free_rqs() can run concurrently. That can cause ugly race
conditions. Do you think it would be a good idea to modify the inflight
accounting code such that it only considers the requests of a single request
queue instead of all requests for a given tag set?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 18:01                     ` Bart Van Assche
@ 2018-12-20 18:21                       ` Jens Axboe
  2018-12-20 18:33                         ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 18:21 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 11:01 AM, Bart Van Assche wrote:
> On Thu, 2018-12-20 at 06:07 -0700, Jens Axboe wrote:
>> On 12/20/18 6:02 AM, Jens Axboe wrote:
>>>> I'm afraid this cannot work.
>>>>
>>>> The 'tags' here could be the hctx->sched_tags, but what we need to
>>>> clear is hctx->tags->rqs[].
>>>
>>> You are right, of course, a bit too quick on the trigger. This one
>>> should work better, and also avoids that silly quadratic loop. I don't
>>> think we need the tag == -1 check, but probably best to be safe.
>>
>> Sent out the wrong one, here it is. Bart, if you can reproduce, can you
>> give it a whirl?
>>
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 2de972857496..fc04bb648f18 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2025,7 +2025,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>  {
>>  	struct page *page;
>>  
>> -	if (tags->rqs && set->ops->exit_request) {
>> +	if (tags->rqs) {
>>  		int i;
>>  
>>  		for (i = 0; i < tags->nr_tags; i++) {
>> @@ -2033,8 +2033,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>  
>>  			if (!rq)
>>  				continue;
>> -			set->ops->exit_request(set, rq, hctx_idx);
>> +			if (set->ops->exit_request)
>> +				set->ops->exit_request(set, rq, hctx_idx);
>>  			tags->static_rqs[i] = NULL;
>> +
>> +			if (rq->tag == -1)
>> +				continue;
>> +			if (set->tags[hctx_idx]->rqs[rq->tag] == rq)
>> +				set->tags[hctx_idx]->rqs[rq->tag] = NULL;
>>  		}
>>  	}
>>  
>> @@ -2113,6 +2119,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
>>  			return ret;
>>  	}
>>  
>> +	rq->tag = -1;
>>  	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
>>  	return 0;
>>  }
> 
> Hi Jens,
> 
> Are you sure this is sufficient?

No, I don't think it is.

> My understanding is that
> blk_mq_queue_tag_busy_iter() iterates over all tags in the tag set. So if the
> request queue on which part_in_flight() is called and the request queue for
> which blk_mq_free_rqs() is called share their tag set then part_in_flight()
> and blk_mq_free_rqs() can run concurrently. That can cause ugly race
> conditions. Do you think it would be a good idea to modify the inflight
> accounting code such that it only considers the requests of a single request
> queue instead of all requests for a given tag set?

That would of course solve it, the question is how to do it. One option
would be to have ->rqs[] be:

struct rq_entry {
	struct request_queue *q;
	struct request *rq;
};

instead of just a request, since then you could check the queue without
having to dereference the request. The current race is inherent in that
we set ->rqs[] AFTER having acquired the tag, so there's a window where
you could find a stale entry. That's not normally an issue since
requests are persistent, but for shared tag maps and queues disappearing
it can pose a problem.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 18:21                       ` Jens Axboe
@ 2018-12-20 18:33                         ` Jens Axboe
  2018-12-20 20:56                           ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 18:33 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 11:21 AM, Jens Axboe wrote:
> On 12/20/18 11:01 AM, Bart Van Assche wrote:
>> On Thu, 2018-12-20 at 06:07 -0700, Jens Axboe wrote:
>>> On 12/20/18 6:02 AM, Jens Axboe wrote:
>>>>> I'm afraid this cannot work.
>>>>>
>>>>> The 'tags' here could be the hctx->sched_tags, but what we need to
>>>>> clear is hctx->tags->rqs[].
>>>>
>>>> You are right, of course, a bit too quick on the trigger. This one
>>>> should work better, and also avoids that silly quadratic loop. I don't
>>>> think we need the tag == -1 check, but probably best to be safe.
>>>
>>> Sent out the wrong one, here it is. Bart, if you can reproduce, can you
>>> give it a whirl?
>>>
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 2de972857496..fc04bb648f18 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -2025,7 +2025,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>  {
>>>  	struct page *page;
>>>  
>>> -	if (tags->rqs && set->ops->exit_request) {
>>> +	if (tags->rqs) {
>>>  		int i;
>>>  
>>>  		for (i = 0; i < tags->nr_tags; i++) {
>>> @@ -2033,8 +2033,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>  
>>>  			if (!rq)
>>>  				continue;
>>> -			set->ops->exit_request(set, rq, hctx_idx);
>>> +			if (set->ops->exit_request)
>>> +				set->ops->exit_request(set, rq, hctx_idx);
>>>  			tags->static_rqs[i] = NULL;
>>> +
>>> +			if (rq->tag == -1)
>>> +				continue;
>>> +			if (set->tags[hctx_idx]->rqs[rq->tag] == rq)
>>> +				set->tags[hctx_idx]->rqs[rq->tag] = NULL;
>>>  		}
>>>  	}
>>>  
>>> @@ -2113,6 +2119,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
>>>  			return ret;
>>>  	}
>>>  
>>> +	rq->tag = -1;
>>>  	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
>>>  	return 0;
>>>  }
>>
>> Hi Jens,
>>
>> Are you sure this is sufficient?
> 
> No, I don't think it is.
> 
>> My understanding is that
>> blk_mq_queue_tag_busy_iter() iterates over all tags in the tag set. So if the
>> request queue on which part_in_flight() is called and the request queue for
>> which blk_mq_free_rqs() is called share their tag set then part_in_flight()
>> and blk_mq_free_rqs() can run concurrently. That can cause ugly race
>> conditions. Do you think it would be a good idea to modify the inflight
>> accounting code such that it only considers the requests of a single request
>> queue instead of all requests for a given tag set?
> 
> That would of course solve it, the question is how to do it. One option
> would be to have ->rqs[] be:
> 
> struct rq_entry {
> 	struct request_queue *q;
> 	struct request *rq;
> };
> 
> instead of just a request, since then you could check the queue without
> having to dereference the request. The current race is inherent in that
> we set ->rqs[] AFTER having acquired the tag, so there's a window where
> you could find a stale entry. That's not normally an issue since
> requests are persistent, but for shared tag maps and queues disappearing
> it can pose a problem.

Something like this, totally untested. If the queue matches, we know it's
safe to dereference it.

A safer API to export as well...


diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2089c6c62f44..78192b544fa2 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+	if (tags->rqs[bitnr].queue != hctx->queue)
+		return true;
 
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	rq = tags->rqs[bitnr].rq;
+	if (rq)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -268,6 +270,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 
 struct bt_tags_iter_data {
 	struct blk_mq_tags *tags;
+	struct blk_mq_hw_ctx *hctx;
 	busy_tag_iter_fn *fn;
 	void *data;
 	bool reserved;
@@ -287,7 +290,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assining ->rqs[].
 	 */
-	rq = tags->rqs[bitnr];
+	rq = tags->rqs[bitnr].rq;
 	if (rq && blk_mq_request_started(rq))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..bb84d1f099c7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -4,6 +4,11 @@
 
 #include "blk-mq.h"
 
+struct rq_tag_entry {
+	struct request_queue *queue;
+	struct request *rq;
+};
+
 /*
  * Tag address space map.
  */
@@ -16,7 +21,7 @@ struct blk_mq_tags {
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
 
-	struct request **rqs;
+	struct rq_tag_entry *rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
 };
@@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx,
 		unsigned int tag, struct request *rq)
 {
-	hctx->tags->rqs[tag] = rq;
+	hctx->tags->rqs[tag].queue = hctx->queue;
+	hctx->tags->rqs[tag].rq = rq;
 }
 
 static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ba37b9e15e9..4f194946dbd9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = -1;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		struct blk_mq_hw_ctx *hctx = data->hctx;
+
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			atomic_inc(&hctx->nr_active);
 		}
 		rq->tag = tag;
 		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 	/* csd/requeue_work/fifo_time is initialized before use */
@@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		prefetch(tags->rqs[tag].rq);
+		return tags->rqs[tag].rq;
 	}
 
 	return NULL;
@@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
 			       void *priv, bool reserved)
 {
 	/*
-	 * If we find a request that is inflight and the queue matches,
-	 * we know the queue is busy. Return false to stop the iteration.
+	 * We're only called here if the queue matches. If the rq state is
+	 * inflight, we know the queue is busy. Return false to stop the
+	 * iteration.
 	 */
-	if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
+	if (rq->state == MQ_RQ_IN_FLIGHT) {
 		bool *busy = priv;
 
 		*busy = true;
@@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	shared = blk_mq_tag_busy(data.hctx);
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
+		struct blk_mq_hw_ctx *hctx = data.hctx;
+
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
 			atomic_inc(&data.hctx->nr_active);
 		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 done:
@@ -2069,7 +2076,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (!tags)
 		return NULL;
 
-	tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
+	tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry),
 				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 node);
 	if (!tags->rqs) {

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 18:33                         ` Jens Axboe
@ 2018-12-20 20:56                           ` Bart Van Assche
  2018-12-20 21:00                             ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2018-12-20 20:56 UTC (permalink / raw)
  To: Jens Axboe, jianchao.wang, linux-block

On Thu, 2018-12-20 at 11:33 -0700, Jens Axboe wrote:
> [ ... ]
> Something like this, totally untested. If the queue matches, we know it's
> safe to dereference it.
> 
> A safer API to export as well...
> 
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 2089c6c62f44..78192b544fa2 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  
>  	if (!reserved)
>  		bitnr += tags->nr_reserved_tags;
> -	rq = tags->rqs[bitnr];
> +	if (tags->rqs[bitnr].queue != hctx->queue)
> +		return true;
>  
>  	/*
>  	 * We can hit rq == NULL here, because the tagging functions
>  	 * test and set the bit before assigning ->rqs[].
>  	 */
> -	if (rq && rq->q == hctx->queue)
> +	rq = tags->rqs[bitnr].rq;
> +	if (rq)
>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>  	return true;
>  }
> @@ -268,6 +270,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
>  
>  struct bt_tags_iter_data {
>  	struct blk_mq_tags *tags;
> +	struct blk_mq_hw_ctx *hctx;
>  	busy_tag_iter_fn *fn;
>  	void *data;
>  	bool reserved;
> @@ -287,7 +290,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	 * We can hit rq == NULL here, because the tagging functions
>  	 * test and set the bit before assining ->rqs[].
>  	 */
> -	rq = tags->rqs[bitnr];
> +	rq = tags->rqs[bitnr].rq;
>  	if (rq && blk_mq_request_started(rq))
>  		return iter_data->fn(rq, iter_data->data, reserved);
>  
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 61deab0b5a5a..bb84d1f099c7 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -4,6 +4,11 @@
>  
>  #include "blk-mq.h"
>  
> +struct rq_tag_entry {
> +	struct request_queue *queue;
> +	struct request *rq;
> +};
> +
>  /*
>   * Tag address space map.
>   */
> @@ -16,7 +21,7 @@ struct blk_mq_tags {
>  	struct sbitmap_queue bitmap_tags;
>  	struct sbitmap_queue breserved_tags;
>  
> -	struct request **rqs;
> +	struct rq_tag_entry *rqs;
>  	struct request **static_rqs;
>  	struct list_head page_list;
>  };
> @@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>  static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx,
>  		unsigned int tag, struct request *rq)
>  {
> -	hctx->tags->rqs[tag] = rq;
> +	hctx->tags->rqs[tag].queue = hctx->queue;
> +	hctx->tags->rqs[tag].rq = rq;
>  }
>  
>  static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ba37b9e15e9..4f194946dbd9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  		rq->tag = -1;
>  		rq->internal_tag = tag;
>  	} else {
> -		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
> +		struct blk_mq_hw_ctx *hctx = data->hctx;
> +
> +		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
>  			rq_flags = RQF_MQ_INFLIGHT;
> -			atomic_inc(&data->hctx->nr_active);
> +			atomic_inc(&hctx->nr_active);
>  		}
>  		rq->tag = tag;
>  		rq->internal_tag = -1;
> -		data->hctx->tags->rqs[rq->tag] = rq;
> +		hctx->tags->rqs[rq->tag].queue = hctx->queue;
> +		hctx->tags->rqs[rq->tag].rq = rq;
>  	}
>  
>  	/* csd/requeue_work/fifo_time is initialized before use */
> @@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
>  struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
>  {
>  	if (tag < tags->nr_tags) {
> -		prefetch(tags->rqs[tag]);
> -		return tags->rqs[tag];
> +		prefetch(tags->rqs[tag].rq);
> +		return tags->rqs[tag].rq;
>  	}
>  
>  	return NULL;
> @@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  			       void *priv, bool reserved)
>  {
>  	/*
> -	 * If we find a request that is inflight and the queue matches,
> -	 * we know the queue is busy. Return false to stop the iteration.
> +	 * We're only called here if the queue matches. If the rq state is
> +	 * inflight, we know the queue is busy. Return false to stop the
> +	 * iteration.
>  	 */
> -	if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
> +	if (rq->state == MQ_RQ_IN_FLIGHT) {
>  		bool *busy = priv;
>  
>  		*busy = true;
> @@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq)
>  	shared = blk_mq_tag_busy(data.hctx);
>  	rq->tag = blk_mq_get_tag(&data);
>  	if (rq->tag >= 0) {
> +		struct blk_mq_hw_ctx *hctx = data.hctx;
> +
>  		if (shared) {
>  			rq->rq_flags |= RQF_MQ_INFLIGHT;
>  			atomic_inc(&data.hctx->nr_active);
>  		}
> -		data.hctx->tags->rqs[rq->tag] = rq;
> +		hctx->tags->rqs[rq->tag].queue = hctx->queue;
> +		hctx->tags->rqs[rq->tag].rq = rq;
>  	}
>  
>  done:
> @@ -2069,7 +2076,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>  	if (!tags)
>  		return NULL;
>  
> -	tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
> +	tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry),
>  				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
>  				 node);
>  	if (!tags->rqs) {

Hi Jens,

How about the patch below? Its behavior should be similar to your patch but I think
this patch is simpler.

Thanks,

Bart.


---
 block/blk-mq.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6a7566244de3..c57611b1f2a8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -86,6 +86,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 }
 
 struct mq_inflight {
+	struct request_queue *q;
 	struct hd_struct *part;
 	unsigned int *inflight;
 };
@@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
 {
 	struct mq_inflight *mi = priv;
 
+	if (rq->q != mi->q)
+		return;
+
 	/*
 	 * index[0] counts the specific partition that was asked for. index[1]
 	 * counts the ones that are active on the whole device, so increment
@@ -110,7 +114,11 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
 void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 		      unsigned int inflight[2])
 {
-	struct mq_inflight mi = { .part = part, .inflight = inflight, };
+	struct mq_inflight mi = {
+		.q = q,
+		.part = part,
+		.inflight = inflight,
+	};
 
 	inflight[0] = inflight[1] = 0;
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
@@ -129,7 +137,11 @@ static void blk_mq_check_inflight_rw(struct blk_mq_hw_ctx *hctx,
 void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 			 unsigned int inflight[2])
 {
-	struct mq_inflight mi = { .part = part, .inflight = inflight, };
+	struct mq_inflight mi = {
+		.q = q,
+		.part = part,
+		.inflight = inflight,
+	};
 
 	inflight[0] = inflight[1] = 0;
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi);
@@ -477,6 +489,8 @@ static void __blk_mq_free_request(struct request *rq)
 	const int sched_tag = rq->internal_tag;
 
 	blk_pm_mark_last_busy(rq);
+	/* Avoid that blk_mq_queue_tag_busy_iter() picks up this request. */
+	rq->q = NULL;
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 20:56                           ` Bart Van Assche
@ 2018-12-20 21:00                             ` Jens Axboe
  2018-12-20 21:23                               ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 21:00 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 1:56 PM, Bart Van Assche wrote:
> @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>  {
>  	struct mq_inflight *mi = priv;
>  
> +	if (rq->q != mi->q)
> +		return;
> +

Aren't you back to square one with this one, if the tags are shared? You
can't dereference it before you know it matches.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 21:00                             ` Jens Axboe
@ 2018-12-20 21:23                               ` Bart Van Assche
  2018-12-20 21:26                                 ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2018-12-20 21:23 UTC (permalink / raw)
  To: Jens Axboe, jianchao.wang, linux-block

On Thu, 2018-12-20 at 14:00 -0700, Jens Axboe wrote:
> On 12/20/18 1:56 PM, Bart Van Assche wrote:
> > @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> >  {
> >  	struct mq_inflight *mi = priv;
> >  
> > +	if (rq->q != mi->q)
> > +		return;
> 
> Aren't you back to square one with this one, if the tags are shared? You
> can't dereference it before you know it matches.

My patch can only work if the new rq->q = NULL assignment in __blk_mq_free_request()
is executed before the request tag is freed and if freeing a tag does not happen
concurrently with any bt_iter() call. Would you accept that I add a seqlock to avoid
this scenario?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 21:23                               ` Bart Van Assche
@ 2018-12-20 21:26                                 ` Jens Axboe
  2018-12-20 21:31                                   ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 21:26 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 2:23 PM, Bart Van Assche wrote:
> On Thu, 2018-12-20 at 14:00 -0700, Jens Axboe wrote:
>> On 12/20/18 1:56 PM, Bart Van Assche wrote:
>>> @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>>>  {
>>>  	struct mq_inflight *mi = priv;
>>>  
>>> +	if (rq->q != mi->q)
>>> +		return;
>>
>> Aren't you back to square one with this one, if the tags are shared? You
>> can't dereference it before you know it matches.
> 
> My patch can only work if the new rq->q = NULL assignment in __blk_mq_free_request()
> is executed before the request tag is freed and if freeing a tag does not happen
> concurrently with any bt_iter() call. Would you accept that I add a seqlock to avoid
> this scenario?

Ugh no, let's not go that far. Why not just use my approach that avoids
any need to dereference rq, unless we know it belongs to the queue in
question? I think that's cheaper than resetting ->queue as well when the
rq completes, I'm always wary of adding new stores in the completion
path.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 21:26                                 ` Jens Axboe
@ 2018-12-20 21:31                                   ` Bart Van Assche
  2018-12-20 21:34                                     ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2018-12-20 21:31 UTC (permalink / raw)
  To: Jens Axboe, jianchao.wang, linux-block

On Thu, 2018-12-20 at 14:26 -0700, Jens Axboe wrote:
> On 12/20/18 2:23 PM, Bart Van Assche wrote:
> > On Thu, 2018-12-20 at 14:00 -0700, Jens Axboe wrote:
> > > On 12/20/18 1:56 PM, Bart Van Assche wrote:
> > > > @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> > > >  {
> > > >  	struct mq_inflight *mi = priv;
> > > >  
> > > > +	if (rq->q != mi->q)
> > > > +		return;
> > > 
> > > Aren't you back to square one with this one, if the tags are shared? You
> > > can't dereference it before you know it matches.
> > 
> > My patch can only work if the new rq->q = NULL assignment in __blk_mq_free_request()
> > is executed before the request tag is freed and if freeing a tag does not happen
> > concurrently with any bt_iter() call. Would you accept that I add a seqlock to avoid
> > this scenario?
> 
> Ugh no, let's not go that far. Why not just use my approach that avoids
> any need to dereference rq, unless we know it belongs to the queue in
> question? I think that's cheaper than resetting ->queue as well when the
> rq completes, I'm always wary of adding new stores in the completion
> path.

I think there is a race condition in bt_iter() in your approach: tags->rqs[bitnr].queue
can change after it has been read and that can cause a request that is not associated
with hctx->queue to be passed to iter_data->fn(). Since 'fn' will access '*rq' I think
that with your patch a use-after-free can occur similar to the one reported at the
start of this e-mail thread. Your patch may make it harder to trigger that issue though.

Bart.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 21:31                                   ` Bart Van Assche
@ 2018-12-20 21:34                                     ` Jens Axboe
  2018-12-20 21:40                                       ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 21:34 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 2:31 PM, Bart Van Assche wrote:
> On Thu, 2018-12-20 at 14:26 -0700, Jens Axboe wrote:
>> On 12/20/18 2:23 PM, Bart Van Assche wrote:
>>> On Thu, 2018-12-20 at 14:00 -0700, Jens Axboe wrote:
>>>> On 12/20/18 1:56 PM, Bart Van Assche wrote:
>>>>> @@ -96,6 +97,9 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>>>>>  {
>>>>>  	struct mq_inflight *mi = priv;
>>>>>  
>>>>> +	if (rq->q != mi->q)
>>>>> +		return;
>>>>
>>>> Aren't you back to square one with this one, if the tags are shared? You
>>>> can't dereference it before you know it matches.
>>>
>>> My patch can only work if the new rq->q = NULL assignment in __blk_mq_free_request()
>>> is executed before the request tag is freed and if freeing a tag does not happen
>>> concurrently with any bt_iter() call. Would you accept that I add a seqlock to avoid
>>> this scenario?
>>
>> Ugh no, let's not go that far. Why not just use my approach that avoids
>> any need to dereference rq, unless we know it belongs to the queue in
>> question? I think that's cheaper than resetting ->queue as well when the
>> rq completes, I'm always wary of adding new stores in the completion
>> path.
> 
> I think there is a race condition in bt_iter() in your approach:
> tags->rqs[bitnr].queue can change after it has been read and that can
> cause a request that is not associated with hctx->queue to be passed
> to iter_data->fn(). Since 'fn' will access '*rq' I think that with
> your patch a use-after-free can occur similar to the one reported at
> the start of this e-mail thread. Your patch may make it harder to
> trigger that issue though.

Yeah, I don't think it's bullet proof either, it just closes the gap.
I'm fine with fiddling with the tag iteration. On top of what I sent, we
could have tag iteration hold the RCU read lock, and then we just need
to ensure that the tags are freed with RCU.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 21:34                                     ` Jens Axboe
@ 2018-12-20 21:40                                       ` Bart Van Assche
  2018-12-20 21:44                                         ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2018-12-20 21:40 UTC (permalink / raw)
  To: Jens Axboe, jianchao.wang, linux-block

On Thu, 2018-12-20 at 14:34 -0700, Jens Axboe wrote:
> Yeah, I don't think it's bullet proof either, it just closes the gap.
> I'm fine with fiddling with the tag iteration. On top of what I sent, we
> could have tag iteration hold the RCU read lock, and then we just need
> to ensure that the tags are freed with RCU.

Do you mean using call_rcu() to free tags? Would that require to add a
struct rcu_head to every request? Would it be acceptable to increase the
size of struct request with an rcu_head? Additionally, could that reduce
the queue depth if the time between grace periods is larger than the time
between I/O submissions?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 21:40                                       ` Bart Van Assche
@ 2018-12-20 21:44                                         ` Jens Axboe
  2018-12-20 21:48                                           ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 21:44 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 2:40 PM, Bart Van Assche wrote:
> On Thu, 2018-12-20 at 14:34 -0700, Jens Axboe wrote:
>> Yeah, I don't think it's bullet proof either, it just closes the gap.
>> I'm fine with fiddling with the tag iteration. On top of what I sent, we
>> could have tag iteration hold the RCU read lock, and then we just need
>> to ensure that the tags are freed with RCU.
> 
> Do you mean using call_rcu() to free tags? Would that require to add a
> struct rcu_head to every request? Would it be acceptable to increase the
> size of struct request with an rcu_head? Additionally, could that reduce
> the queue depth if the time between grace periods is larger than the time
> between I/O submissions?

No no, the requests are out of full pages, we just need one per blk_mq_tags.
Something like the below... And then the tag iteration needs to grab the
RCU read lock to prevent the page freeing from happening.

This should essentially be free, as rcu read lock for tag iteration is
basically a no-op.

Need to handle the flush request on top of this as well.


diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2089c6c62f44..c39b58391ae8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+	if (tags->rqs[bitnr].queue != hctx->queue)
+		return true;
 
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	rq = tags->rqs[bitnr].rq;
+	if (rq)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
+	rcu_read_lock();
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+	rcu_read_unlock();
 }
 
 struct bt_tags_iter_data {
@@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assining ->rqs[].
 	 */
-	rq = tags->rqs[bitnr];
+	rq = tags->rqs[bitnr].rq;
 	if (rq && blk_mq_request_started(rq))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
@@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
-	if (tags->rqs)
+	if (tags->rqs) {
+		rcu_read_lock();
 		sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
+		rcu_read_unlock();
+	}
 }
 
 /**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..bdd1bfc08c21 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -4,6 +4,11 @@
 
 #include "blk-mq.h"
 
+struct rq_tag_entry {
+	struct request_queue *queue;
+	struct request *rq;
+};
+
 /*
  * Tag address space map.
  */
@@ -16,9 +21,11 @@ struct blk_mq_tags {
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
 
-	struct request **rqs;
+	struct rq_tag_entry *rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
+
+	struct rcu_work rcu_work;
 };
 
 
@@ -78,7 +85,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx,
 		unsigned int tag, struct request *rq)
 {
-	hctx->tags->rqs[tag] = rq;
+	hctx->tags->rqs[tag].queue = hctx->queue;
+	hctx->tags->rqs[tag].rq = rq;
 }
 
 static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2de972857496..4decd1e7d2d9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = -1;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		struct blk_mq_hw_ctx *hctx = data->hctx;
+
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			atomic_inc(&hctx->nr_active);
 		}
 		rq->tag = tag;
 		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 	/* csd/requeue_work/fifo_time is initialized before use */
@@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		prefetch(tags->rqs[tag].rq);
+		return tags->rqs[tag].rq;
 	}
 
 	return NULL;
@@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
 			       void *priv, bool reserved)
 {
 	/*
-	 * If we find a request that is inflight and the queue matches,
-	 * we know the queue is busy. Return false to stop the iteration.
+	 * We're only called here if the queue matches. If the rq state is
+	 * inflight, we know the queue is busy. Return false to stop the
+	 * iteration.
 	 */
-	if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
+	if (rq->state == MQ_RQ_IN_FLIGHT) {
 		bool *busy = priv;
 
 		*busy = true;
@@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	shared = blk_mq_tag_busy(data.hctx);
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
+		struct blk_mq_hw_ctx *hctx = data.hctx;
+
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
 			atomic_inc(&data.hctx->nr_active);
 		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 done:
@@ -2020,11 +2027,27 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
-void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
-		     unsigned int hctx_idx)
+static void blk_mq_rcu_free_pages(struct work_struct *work)
 {
+	struct blk_mq_tags *tags = container_of(to_rcu_work(work),
+						struct blk_mq_tags, rcu_work);
 	struct page *page;
 
+	while (!list_empty(&tags->page_list)) {
+		page = list_first_entry(&tags->page_list, struct page, lru);
+		list_del_init(&page->lru);
+		/*
+		 * Remove kmemleak object previously allocated in
+		 * blk_mq_init_rq_map().
+		 */
+		kmemleak_free(page_address(page));
+		__free_pages(page, page->private);
+	}
+}
+
+void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx)
+{
 	if (tags->rqs && set->ops->exit_request) {
 		int i;
 
@@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		}
 	}
 
-	while (!list_empty(&tags->page_list)) {
-		page = list_first_entry(&tags->page_list, struct page, lru);
-		list_del_init(&page->lru);
-		/*
-		 * Remove kmemleak object previously allocated in
-		 * blk_mq_init_rq_map().
-		 */
-		kmemleak_free(page_address(page));
-		__free_pages(page, page->private);
-	}
+	/* Punt to RCU free, so we don't race with tag iteration */
+	INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages);
+	queue_rcu_work(system_wq, &tags->rcu_work);
 }
 
 void blk_mq_free_rq_map(struct blk_mq_tags *tags)
@@ -2077,7 +2093,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (!tags)
 		return NULL;
 
-	tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
+	tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry),
 				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 node);
 	if (!tags->rqs) {

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 21:44                                         ` Jens Axboe
@ 2018-12-20 21:48                                           ` Jens Axboe
  2018-12-20 22:19                                             ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 21:48 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 2:44 PM, Jens Axboe wrote:
> On 12/20/18 2:40 PM, Bart Van Assche wrote:
>> On Thu, 2018-12-20 at 14:34 -0700, Jens Axboe wrote:
>>> Yeah, I don't think it's bullet proof either, it just closes the gap.
>>> I'm fine with fiddling with the tag iteration. On top of what I sent, we
>>> could have tag iteration hold the RCU read lock, and then we just need
>>> to ensure that the tags are freed with RCU.
>>
>> Do you mean using call_rcu() to free tags? Would that require to add a
>> struct rcu_head to every request? Would it be acceptable to increase the
>> size of struct request with an rcu_head? Additionally, could that reduce
>> the queue depth if the time between grace periods is larger than the time
>> between I/O submissions?
> 
> No no, the requests are out of full pages, we just need one per blk_mq_tags.
> Something like the below... And then the tag iteration needs to grab the
> RCU read lock to prevent the page freeing from happening.
> 
> This should essentially be free, as rcu read lock for tag iteration is
> basically a no-op.
> 
> Need to handle the flush request on top of this as well.

With FQ handled as well.


diff --git a/block/blk-flush.c b/block/blk-flush.c
index a3fc7191c694..23e1f5fb091f 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -491,12 +491,22 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 	return NULL;
 }
 
+static void blk_fq_rcu_free(struct work_struct *work)
+{
+	struct blk_flush_queue *fq = container_of(to_rcu_work(work),
+							struct blk_flush_queue,
+							rcu_work);
+
+	kfree(fq->flush_rq);
+	kfree(fq);
+}
+	
 void blk_free_flush_queue(struct blk_flush_queue *fq)
 {
 	/* bio based request queue hasn't flush queue */
 	if (!fq)
 		return;
 
-	kfree(fq->flush_rq);
-	kfree(fq);
+	INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free);
+	queue_rcu_work(system_wq, &fq->rcu_work);
 }
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2089c6c62f44..c39b58391ae8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+	if (tags->rqs[bitnr].queue != hctx->queue)
+		return true;
 
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	rq = tags->rqs[bitnr].rq;
+	if (rq)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
+	rcu_read_lock();
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+	rcu_read_unlock();
 }
 
 struct bt_tags_iter_data {
@@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assining ->rqs[].
 	 */
-	rq = tags->rqs[bitnr];
+	rq = tags->rqs[bitnr].rq;
 	if (rq && blk_mq_request_started(rq))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
@@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
-	if (tags->rqs)
+	if (tags->rqs) {
+		rcu_read_lock();
 		sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
+		rcu_read_unlock();
+	}
 }
 
 /**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..bdd1bfc08c21 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -4,6 +4,11 @@
 
 #include "blk-mq.h"
 
+struct rq_tag_entry {
+	struct request_queue *queue;
+	struct request *rq;
+};
+
 /*
  * Tag address space map.
  */
@@ -16,9 +21,11 @@ struct blk_mq_tags {
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
 
-	struct request **rqs;
+	struct rq_tag_entry *rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
+
+	struct rcu_work rcu_work;
 };
 
 
@@ -78,7 +85,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx,
 		unsigned int tag, struct request *rq)
 {
-	hctx->tags->rqs[tag] = rq;
+	hctx->tags->rqs[tag].queue = hctx->queue;
+	hctx->tags->rqs[tag].rq = rq;
 }
 
 static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2de972857496..4decd1e7d2d9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = -1;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		struct blk_mq_hw_ctx *hctx = data->hctx;
+
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			atomic_inc(&hctx->nr_active);
 		}
 		rq->tag = tag;
 		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 	/* csd/requeue_work/fifo_time is initialized before use */
@@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		prefetch(tags->rqs[tag].rq);
+		return tags->rqs[tag].rq;
 	}
 
 	return NULL;
@@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
 			       void *priv, bool reserved)
 {
 	/*
-	 * If we find a request that is inflight and the queue matches,
-	 * we know the queue is busy. Return false to stop the iteration.
+	 * We're only called here if the queue matches. If the rq state is
+	 * inflight, we know the queue is busy. Return false to stop the
+	 * iteration.
 	 */
-	if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
+	if (rq->state == MQ_RQ_IN_FLIGHT) {
 		bool *busy = priv;
 
 		*busy = true;
@@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	shared = blk_mq_tag_busy(data.hctx);
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
+		struct blk_mq_hw_ctx *hctx = data.hctx;
+
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
 			atomic_inc(&data.hctx->nr_active);
 		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 done:
@@ -2020,11 +2027,27 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
-void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
-		     unsigned int hctx_idx)
+static void blk_mq_rcu_free_pages(struct work_struct *work)
 {
+	struct blk_mq_tags *tags = container_of(to_rcu_work(work),
+						struct blk_mq_tags, rcu_work);
 	struct page *page;
 
+	while (!list_empty(&tags->page_list)) {
+		page = list_first_entry(&tags->page_list, struct page, lru);
+		list_del_init(&page->lru);
+		/*
+		 * Remove kmemleak object previously allocated in
+		 * blk_mq_init_rq_map().
+		 */
+		kmemleak_free(page_address(page));
+		__free_pages(page, page->private);
+	}
+}
+
+void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx)
+{
 	if (tags->rqs && set->ops->exit_request) {
 		int i;
 
@@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		}
 	}
 
-	while (!list_empty(&tags->page_list)) {
-		page = list_first_entry(&tags->page_list, struct page, lru);
-		list_del_init(&page->lru);
-		/*
-		 * Remove kmemleak object previously allocated in
-		 * blk_mq_init_rq_map().
-		 */
-		kmemleak_free(page_address(page));
-		__free_pages(page, page->private);
-	}
+	/* Punt to RCU free, so we don't race with tag iteration */
+	INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages);
+	queue_rcu_work(system_wq, &tags->rcu_work);
 }
 
 void blk_mq_free_rq_map(struct blk_mq_tags *tags)
@@ -2077,7 +2093,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (!tags)
 		return NULL;
 
-	tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
+	tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry),
 				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 node);
 	if (!tags->rqs) {
diff --git a/block/blk.h b/block/blk.h
index 848278c52030..785207fc8a30 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -29,6 +29,8 @@ struct blk_flush_queue {
 	 */
 	struct request		*orig_rq;
 	spinlock_t		mq_flush_lock;
+
+	struct rcu_work		rcu_work;
 };
 
 extern struct kmem_cache *blk_requestq_cachep;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 21:48                                           ` Jens Axboe
@ 2018-12-20 22:19                                             ` Bart Van Assche
  2018-12-20 22:23                                               ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2018-12-20 22:19 UTC (permalink / raw)
  To: Jens Axboe, jianchao.wang, linux-block

On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote:
> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> -		     unsigned int hctx_idx)
> +static void blk_mq_rcu_free_pages(struct work_struct *work)
>  {
> +	struct blk_mq_tags *tags = container_of(to_rcu_work(work),
> +						struct blk_mq_tags, rcu_work);
>  	struct page *page;
>  
> +	while (!list_empty(&tags->page_list)) {
> +		page = list_first_entry(&tags->page_list, struct page, lru);
> +		list_del_init(&page->lru);
> +		/*
> +		 * Remove kmemleak object previously allocated in
> +		 * blk_mq_init_rq_map().
> +		 */
> +		kmemleak_free(page_address(page));
> +		__free_pages(page, page->private);
> +	}
> +}
> +
> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> +		     unsigned int hctx_idx)
> +{
>  	if (tags->rqs && set->ops->exit_request) {
>  		int i;
>  
> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  		}
>  	}
>  
> -	while (!list_empty(&tags->page_list)) {
> -		page = list_first_entry(&tags->page_list, struct page, lru);
> -		list_del_init(&page->lru);
> -		/*
> -		 * Remove kmemleak object previously allocated in
> -		 * blk_mq_init_rq_map().
> -		 */
> -		kmemleak_free(page_address(page));
> -		__free_pages(page, page->private);
> -	}
> +	/* Punt to RCU free, so we don't race with tag iteration */
> +	INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages);
> +	queue_rcu_work(system_wq, &tags->rcu_work);
>  }

This can only work correctly if blk_mq_rcu_free_pages() is called before
INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure
and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed.
What provides these guarantees? Did I perhaps miss something?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 22:19                                             ` Bart Van Assche
@ 2018-12-20 22:23                                               ` Jens Axboe
  2018-12-20 22:33                                                 ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 22:23 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 3:19 PM, Bart Van Assche wrote:
> On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote:
>> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>> -		     unsigned int hctx_idx)
>> +static void blk_mq_rcu_free_pages(struct work_struct *work)
>>  {
>> +	struct blk_mq_tags *tags = container_of(to_rcu_work(work),
>> +						struct blk_mq_tags, rcu_work);
>>  	struct page *page;
>>  
>> +	while (!list_empty(&tags->page_list)) {
>> +		page = list_first_entry(&tags->page_list, struct page, lru);
>> +		list_del_init(&page->lru);
>> +		/*
>> +		 * Remove kmemleak object previously allocated in
>> +		 * blk_mq_init_rq_map().
>> +		 */
>> +		kmemleak_free(page_address(page));
>> +		__free_pages(page, page->private);
>> +	}
>> +}
>> +
>> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>> +		     unsigned int hctx_idx)
>> +{
>>  	if (tags->rqs && set->ops->exit_request) {
>>  		int i;
>>  
>> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>  		}
>>  	}
>>  
>> -	while (!list_empty(&tags->page_list)) {
>> -		page = list_first_entry(&tags->page_list, struct page, lru);
>> -		list_del_init(&page->lru);
>> -		/*
>> -		 * Remove kmemleak object previously allocated in
>> -		 * blk_mq_init_rq_map().
>> -		 */
>> -		kmemleak_free(page_address(page));
>> -		__free_pages(page, page->private);
>> -	}
>> +	/* Punt to RCU free, so we don't race with tag iteration */
>> +	INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages);
>> +	queue_rcu_work(system_wq, &tags->rcu_work);
>>  }
> 
> This can only work correctly if blk_mq_rcu_free_pages() is called before
> INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure
> and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed.
> What provides these guarantees? Did I perhaps miss something?

We don't call it twice. Protecting against that is outside the scope
of this function. But we call and clear form the regular shutdown path,
and the rest is error handling on setup.

But yes, we do need to ensure that 'tags' doesn't go away. Probably best
to rework it to shove it somewhere else for freeing for that case, and
leave the rest of the shutdown alone. I'll update the patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 22:23                                               ` Jens Axboe
@ 2018-12-20 22:33                                                 ` Jens Axboe
  2018-12-20 22:47                                                   ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 22:33 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 3:23 PM, Jens Axboe wrote:
> On 12/20/18 3:19 PM, Bart Van Assche wrote:
>> On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote:
>>> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>> -		     unsigned int hctx_idx)
>>> +static void blk_mq_rcu_free_pages(struct work_struct *work)
>>>  {
>>> +	struct blk_mq_tags *tags = container_of(to_rcu_work(work),
>>> +						struct blk_mq_tags, rcu_work);
>>>  	struct page *page;
>>>  
>>> +	while (!list_empty(&tags->page_list)) {
>>> +		page = list_first_entry(&tags->page_list, struct page, lru);
>>> +		list_del_init(&page->lru);
>>> +		/*
>>> +		 * Remove kmemleak object previously allocated in
>>> +		 * blk_mq_init_rq_map().
>>> +		 */
>>> +		kmemleak_free(page_address(page));
>>> +		__free_pages(page, page->private);
>>> +	}
>>> +}
>>> +
>>> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>> +		     unsigned int hctx_idx)
>>> +{
>>>  	if (tags->rqs && set->ops->exit_request) {
>>>  		int i;
>>>  
>>> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>  		}
>>>  	}
>>>  
>>> -	while (!list_empty(&tags->page_list)) {
>>> -		page = list_first_entry(&tags->page_list, struct page, lru);
>>> -		list_del_init(&page->lru);
>>> -		/*
>>> -		 * Remove kmemleak object previously allocated in
>>> -		 * blk_mq_init_rq_map().
>>> -		 */
>>> -		kmemleak_free(page_address(page));
>>> -		__free_pages(page, page->private);
>>> -	}
>>> +	/* Punt to RCU free, so we don't race with tag iteration */
>>> +	INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages);
>>> +	queue_rcu_work(system_wq, &tags->rcu_work);
>>>  }
>>
>> This can only work correctly if blk_mq_rcu_free_pages() is called before
>> INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure
>> and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed.
>> What provides these guarantees? Did I perhaps miss something?
> 
> We don't call it twice. Protecting against that is outside the scope
> of this function. But we call and clear form the regular shutdown path,
> and the rest is error handling on setup.
> 
> But yes, we do need to ensure that 'tags' doesn't go away. Probably best
> to rework it to shove it somewhere else for freeing for that case, and
> leave the rest of the shutdown alone. I'll update the patch.

Here's a version that doesn't rely on tags, we just dynamically allocate
the structure. For the very odd case where we fail, we just punt to
an on-stack structure and use the big synchronize_rcu() hammer first.


diff --git a/block/blk-flush.c b/block/blk-flush.c
index a3fc7191c694..827e3d2180d8 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -491,12 +491,22 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 	return NULL;
 }
 
+static void blk_fq_rcu_free(struct work_struct *work)
+{
+	struct blk_flush_queue *fq = container_of(to_rcu_work(work),
+							struct blk_flush_queue,
+							rcu_work);
+
+	kfree(fq->flush_rq);
+	kfree(fq);
+}
+
 void blk_free_flush_queue(struct blk_flush_queue *fq)
 {
 	/* bio based request queue hasn't flush queue */
 	if (!fq)
 		return;
 
-	kfree(fq->flush_rq);
-	kfree(fq);
+	INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free);
+	queue_rcu_work(system_wq, &fq->rcu_work);
 }
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2089c6c62f44..c39b58391ae8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+	if (tags->rqs[bitnr].queue != hctx->queue)
+		return true;
 
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	rq = tags->rqs[bitnr].rq;
+	if (rq)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
+	rcu_read_lock();
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+	rcu_read_unlock();
 }
 
 struct bt_tags_iter_data {
@@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assining ->rqs[].
 	 */
-	rq = tags->rqs[bitnr];
+	rq = tags->rqs[bitnr].rq;
 	if (rq && blk_mq_request_started(rq))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
@@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
-	if (tags->rqs)
+	if (tags->rqs) {
+		rcu_read_lock();
 		sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
+		rcu_read_unlock();
+	}
 }
 
 /**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..bb84d1f099c7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -4,6 +4,11 @@
 
 #include "blk-mq.h"
 
+struct rq_tag_entry {
+	struct request_queue *queue;
+	struct request *rq;
+};
+
 /*
  * Tag address space map.
  */
@@ -16,7 +21,7 @@ struct blk_mq_tags {
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
 
-	struct request **rqs;
+	struct rq_tag_entry *rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
 };
@@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx,
 		unsigned int tag, struct request *rq)
 {
-	hctx->tags->rqs[tag] = rq;
+	hctx->tags->rqs[tag].queue = hctx->queue;
+	hctx->tags->rqs[tag].rq = rq;
 }
 
 static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2de972857496..d104bcb4308f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = -1;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		struct blk_mq_hw_ctx *hctx = data->hctx;
+
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			atomic_inc(&hctx->nr_active);
 		}
 		rq->tag = tag;
 		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 	/* csd/requeue_work/fifo_time is initialized before use */
@@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		prefetch(tags->rqs[tag].rq);
+		return tags->rqs[tag].rq;
 	}
 
 	return NULL;
@@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
 			       void *priv, bool reserved)
 {
 	/*
-	 * If we find a request that is inflight and the queue matches,
-	 * we know the queue is busy. Return false to stop the iteration.
+	 * We're only called here if the queue matches. If the rq state is
+	 * inflight, we know the queue is busy. Return false to stop the
+	 * iteration.
 	 */
-	if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
+	if (rq->state == MQ_RQ_IN_FLIGHT) {
 		bool *busy = priv;
 
 		*busy = true;
@@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	shared = blk_mq_tag_busy(data.hctx);
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
+		struct blk_mq_hw_ctx *hctx = data.hctx;
+
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
 			atomic_inc(&data.hctx->nr_active);
 		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 done:
@@ -2020,10 +2027,34 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
+struct rq_page_list {
+	struct rcu_work rcu_work;
+	struct list_head list;
+};
+
+static void blk_mq_rcu_free_pages(struct work_struct *work)
+{
+	struct rq_page_list *rpl = container_of(to_rcu_work(work),
+						struct rq_page_list, rcu_work);
+	struct page *page;
+
+	while (!list_empty(&rpl->list)) {
+		page = list_first_entry(&rpl->list, struct page, lru);
+		list_del_init(&page->lru);
+		/*
+		 * Remove kmemleak object previously allocated in
+		 * blk_mq_init_rq_map().
+		 */
+		kmemleak_free(page_address(page));
+		__free_pages(page, page->private);
+	}
+	kfree(rpl);
+}
+
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
-	struct page *page;
+	struct rq_page_list *rpl;
 
 	if (tags->rqs && set->ops->exit_request) {
 		int i;
@@ -2038,15 +2069,26 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		}
 	}
 
-	while (!list_empty(&tags->page_list)) {
-		page = list_first_entry(&tags->page_list, struct page, lru);
-		list_del_init(&page->lru);
+	if (list_empty(&tags->page_list))
+		return;
+
+	rpl = kmalloc(sizeof(*rpl), GFP_NOIO);
+	if (rpl) {
+		list_splice_init(&tags->page_list, &rpl->list);
+
+		/* Punt to RCU free, so we don't race with tag iteration */
+		INIT_RCU_WORK(&rpl->rcu_work, blk_mq_rcu_free_pages);
+		queue_rcu_work(system_wq, &rpl->rcu_work);
+	} else {
+		struct rq_page_list stack_rpl;
+
 		/*
-		 * Remove kmemleak object previously allocated in
-		 * blk_mq_init_rq_map().
+		 * Fail alloc, punt to on-stack, we just have to synchronize
+		 * RCU first to ensure readers are done.
 		 */
-		kmemleak_free(page_address(page));
-		__free_pages(page, page->private);
+		list_splice_init(&tags->page_list, &stack_rpl.list);
+		synchronize_rcu();
+		blk_mq_rcu_free_pages(&stack_rpl.rcu_work.work);
 	}
 }
 
@@ -2077,7 +2119,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (!tags)
 		return NULL;
 
-	tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
+	tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry),
 				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 node);
 	if (!tags->rqs) {
diff --git a/block/blk.h b/block/blk.h
index 848278c52030..785207fc8a30 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -29,6 +29,8 @@ struct blk_flush_queue {
 	 */
 	struct request		*orig_rq;
 	spinlock_t		mq_flush_lock;
+
+	struct rcu_work		rcu_work;
 };
 
 extern struct kmem_cache *blk_requestq_cachep;


-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 22:33                                                 ` Jens Axboe
@ 2018-12-20 22:47                                                   ` Jens Axboe
  2018-12-20 22:50                                                     ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 22:47 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 3:33 PM, Jens Axboe wrote:
> On 12/20/18 3:23 PM, Jens Axboe wrote:
>> On 12/20/18 3:19 PM, Bart Van Assche wrote:
>>> On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote:
>>>> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>> -		     unsigned int hctx_idx)
>>>> +static void blk_mq_rcu_free_pages(struct work_struct *work)
>>>>  {
>>>> +	struct blk_mq_tags *tags = container_of(to_rcu_work(work),
>>>> +						struct blk_mq_tags, rcu_work);
>>>>  	struct page *page;
>>>>  
>>>> +	while (!list_empty(&tags->page_list)) {
>>>> +		page = list_first_entry(&tags->page_list, struct page, lru);
>>>> +		list_del_init(&page->lru);
>>>> +		/*
>>>> +		 * Remove kmemleak object previously allocated in
>>>> +		 * blk_mq_init_rq_map().
>>>> +		 */
>>>> +		kmemleak_free(page_address(page));
>>>> +		__free_pages(page, page->private);
>>>> +	}
>>>> +}
>>>> +
>>>> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>> +		     unsigned int hctx_idx)
>>>> +{
>>>>  	if (tags->rqs && set->ops->exit_request) {
>>>>  		int i;
>>>>  
>>>> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>>  		}
>>>>  	}
>>>>  
>>>> -	while (!list_empty(&tags->page_list)) {
>>>> -		page = list_first_entry(&tags->page_list, struct page, lru);
>>>> -		list_del_init(&page->lru);
>>>> -		/*
>>>> -		 * Remove kmemleak object previously allocated in
>>>> -		 * blk_mq_init_rq_map().
>>>> -		 */
>>>> -		kmemleak_free(page_address(page));
>>>> -		__free_pages(page, page->private);
>>>> -	}
>>>> +	/* Punt to RCU free, so we don't race with tag iteration */
>>>> +	INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages);
>>>> +	queue_rcu_work(system_wq, &tags->rcu_work);
>>>>  }
>>>
>>> This can only work correctly if blk_mq_rcu_free_pages() is called before
>>> INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure
>>> and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed.
>>> What provides these guarantees? Did I perhaps miss something?
>>
>> We don't call it twice. Protecting against that is outside the scope
>> of this function. But we call and clear form the regular shutdown path,
>> and the rest is error handling on setup.
>>
>> But yes, we do need to ensure that 'tags' doesn't go away. Probably best
>> to rework it to shove it somewhere else for freeing for that case, and
>> leave the rest of the shutdown alone. I'll update the patch.
> 
> Here's a version that doesn't rely on tags, we just dynamically allocate
> the structure. For the very odd case where we fail, we just punt to
> an on-stack structure and use the big synchronize_rcu() hammer first.

Forgot to init the lists... This one I actually booted, works for me.
Outside of that, not tested, in terms of verifying the use-after-free is
gone for tag iteration.


diff --git a/block/blk-flush.c b/block/blk-flush.c
index a3fc7191c694..827e3d2180d8 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -491,12 +491,22 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 	return NULL;
 }
 
+static void blk_fq_rcu_free(struct work_struct *work)
+{
+	struct blk_flush_queue *fq = container_of(to_rcu_work(work),
+							struct blk_flush_queue,
+							rcu_work);
+
+	kfree(fq->flush_rq);
+	kfree(fq);
+}
+
 void blk_free_flush_queue(struct blk_flush_queue *fq)
 {
 	/* bio based request queue hasn't flush queue */
 	if (!fq)
 		return;
 
-	kfree(fq->flush_rq);
-	kfree(fq);
+	INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free);
+	queue_rcu_work(system_wq, &fq->rcu_work);
 }
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2089c6c62f44..c39b58391ae8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+	if (tags->rqs[bitnr].queue != hctx->queue)
+		return true;
 
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	rq = tags->rqs[bitnr].rq;
+	if (rq)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
+	rcu_read_lock();
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+	rcu_read_unlock();
 }
 
 struct bt_tags_iter_data {
@@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assining ->rqs[].
 	 */
-	rq = tags->rqs[bitnr];
+	rq = tags->rqs[bitnr].rq;
 	if (rq && blk_mq_request_started(rq))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
@@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
-	if (tags->rqs)
+	if (tags->rqs) {
+		rcu_read_lock();
 		sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
+		rcu_read_unlock();
+	}
 }
 
 /**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..bb84d1f099c7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -4,6 +4,11 @@
 
 #include "blk-mq.h"
 
+struct rq_tag_entry {
+	struct request_queue *queue;
+	struct request *rq;
+};
+
 /*
  * Tag address space map.
  */
@@ -16,7 +21,7 @@ struct blk_mq_tags {
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
 
-	struct request **rqs;
+	struct rq_tag_entry *rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
 };
@@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx,
 		unsigned int tag, struct request *rq)
 {
-	hctx->tags->rqs[tag] = rq;
+	hctx->tags->rqs[tag].queue = hctx->queue;
+	hctx->tags->rqs[tag].rq = rq;
 }
 
 static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2de972857496..c17103c00362 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = -1;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		struct blk_mq_hw_ctx *hctx = data->hctx;
+
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			atomic_inc(&hctx->nr_active);
 		}
 		rq->tag = tag;
 		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 	/* csd/requeue_work/fifo_time is initialized before use */
@@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		prefetch(tags->rqs[tag].rq);
+		return tags->rqs[tag].rq;
 	}
 
 	return NULL;
@@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
 			       void *priv, bool reserved)
 {
 	/*
-	 * If we find a request that is inflight and the queue matches,
-	 * we know the queue is busy. Return false to stop the iteration.
+	 * We're only called here if the queue matches. If the rq state is
+	 * inflight, we know the queue is busy. Return false to stop the
+	 * iteration.
 	 */
-	if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
+	if (rq->state == MQ_RQ_IN_FLIGHT) {
 		bool *busy = priv;
 
 		*busy = true;
@@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	shared = blk_mq_tag_busy(data.hctx);
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
+		struct blk_mq_hw_ctx *hctx = data.hctx;
+
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
 			atomic_inc(&data.hctx->nr_active);
 		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 done:
@@ -2020,10 +2027,34 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
+struct rq_page_list {
+	struct rcu_work rcu_work;
+	struct list_head list;
+};
+
+static void blk_mq_rcu_free_pages(struct work_struct *work)
+{
+	struct rq_page_list *rpl = container_of(to_rcu_work(work),
+						struct rq_page_list, rcu_work);
+	struct page *page;
+
+	while (!list_empty(&rpl->list)) {
+		page = list_first_entry(&rpl->list, struct page, lru);
+		list_del_init(&page->lru);
+		/*
+		 * Remove kmemleak object previously allocated in
+		 * blk_mq_init_rq_map().
+		 */
+		kmemleak_free(page_address(page));
+		__free_pages(page, page->private);
+	}
+	kfree(rpl);
+}
+
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
-	struct page *page;
+	struct rq_page_list *rpl;
 
 	if (tags->rqs && set->ops->exit_request) {
 		int i;
@@ -2038,15 +2069,28 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		}
 	}
 
-	while (!list_empty(&tags->page_list)) {
-		page = list_first_entry(&tags->page_list, struct page, lru);
-		list_del_init(&page->lru);
+	if (list_empty(&tags->page_list))
+		return;
+
+	rpl = kmalloc(sizeof(*rpl), GFP_NOIO);
+	if (rpl) {
+		INIT_LIST_HEAD(&rpl->list);
+		list_splice_init(&tags->page_list, &rpl->list);
+
+		/* Punt to RCU free, so we don't race with tag iteration */
+		INIT_RCU_WORK(&rpl->rcu_work, blk_mq_rcu_free_pages);
+		queue_rcu_work(system_wq, &rpl->rcu_work);
+	} else {
+		struct rq_page_list stack_rpl;
+
 		/*
-		 * Remove kmemleak object previously allocated in
-		 * blk_mq_init_rq_map().
+		 * Fail alloc, punt to on-stack, we just have to synchronize
+		 * RCU first to ensure readers are done.
 		 */
-		kmemleak_free(page_address(page));
-		__free_pages(page, page->private);
+		INIT_LIST_HEAD(&stack_rpl.list);
+		list_splice_init(&tags->page_list, &stack_rpl.list);
+		synchronize_rcu();
+		blk_mq_rcu_free_pages(&stack_rpl.rcu_work.work);
 	}
 }
 
@@ -2077,7 +2121,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (!tags)
 		return NULL;
 
-	tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
+	tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry),
 				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 node);
 	if (!tags->rqs) {
diff --git a/block/blk.h b/block/blk.h
index 848278c52030..785207fc8a30 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -29,6 +29,8 @@ struct blk_flush_queue {
 	 */
 	struct request		*orig_rq;
 	spinlock_t		mq_flush_lock;
+
+	struct rcu_work		rcu_work;
 };
 
 extern struct kmem_cache *blk_requestq_cachep;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 22:47                                                   ` Jens Axboe
@ 2018-12-20 22:50                                                     ` Jens Axboe
  2019-02-14 23:36                                                       ` Bart Van Assche
  2019-02-15  2:57                                                       ` jianchao.wang
  0 siblings, 2 replies; 34+ messages in thread
From: Jens Axboe @ 2018-12-20 22:50 UTC (permalink / raw)
  To: Bart Van Assche, jianchao.wang, linux-block

On 12/20/18 3:47 PM, Jens Axboe wrote:
> On 12/20/18 3:33 PM, Jens Axboe wrote:
>> On 12/20/18 3:23 PM, Jens Axboe wrote:
>>> On 12/20/18 3:19 PM, Bart Van Assche wrote:
>>>> On Thu, 2018-12-20 at 14:48 -0700, Jens Axboe wrote:
>>>>> -void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>>> -		     unsigned int hctx_idx)
>>>>> +static void blk_mq_rcu_free_pages(struct work_struct *work)
>>>>>  {
>>>>> +	struct blk_mq_tags *tags = container_of(to_rcu_work(work),
>>>>> +						struct blk_mq_tags, rcu_work);
>>>>>  	struct page *page;
>>>>>  
>>>>> +	while (!list_empty(&tags->page_list)) {
>>>>> +		page = list_first_entry(&tags->page_list, struct page, lru);
>>>>> +		list_del_init(&page->lru);
>>>>> +		/*
>>>>> +		 * Remove kmemleak object previously allocated in
>>>>> +		 * blk_mq_init_rq_map().
>>>>> +		 */
>>>>> +		kmemleak_free(page_address(page));
>>>>> +		__free_pages(page, page->private);
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> +void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>>> +		     unsigned int hctx_idx)
>>>>> +{
>>>>>  	if (tags->rqs && set->ops->exit_request) {
>>>>>  		int i;
>>>>>  
>>>>> @@ -2038,16 +2061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>>>>  		}
>>>>>  	}
>>>>>  
>>>>> -	while (!list_empty(&tags->page_list)) {
>>>>> -		page = list_first_entry(&tags->page_list, struct page, lru);
>>>>> -		list_del_init(&page->lru);
>>>>> -		/*
>>>>> -		 * Remove kmemleak object previously allocated in
>>>>> -		 * blk_mq_init_rq_map().
>>>>> -		 */
>>>>> -		kmemleak_free(page_address(page));
>>>>> -		__free_pages(page, page->private);
>>>>> -	}
>>>>> +	/* Punt to RCU free, so we don't race with tag iteration */
>>>>> +	INIT_RCU_WORK(&tags->rcu_work, blk_mq_rcu_free_pages);
>>>>> +	queue_rcu_work(system_wq, &tags->rcu_work);
>>>>>  }
>>>>
>>>> This can only work correctly if blk_mq_rcu_free_pages() is called before
>>>> INIT_RCU_WORK() is called a second time for the same bkl_mq_tags structure
>>>> and if blk_mq_rcu_free_pages() is called before struct blk_mq_tags is freed.
>>>> What provides these guarantees? Did I perhaps miss something?
>>>
>>> We don't call it twice. Protecting against that is outside the scope
>>> of this function. But we call and clear form the regular shutdown path,
>>> and the rest is error handling on setup.
>>>
>>> But yes, we do need to ensure that 'tags' doesn't go away. Probably best
>>> to rework it to shove it somewhere else for freeing for that case, and
>>> leave the rest of the shutdown alone. I'll update the patch.
>>
>> Here's a version that doesn't rely on tags, we just dynamically allocate
>> the structure. For the very odd case where we fail, we just punt to
>> an on-stack structure and use the big synchronize_rcu() hammer first.
> 
> Forgot to init the lists... This one I actually booted, works for me.
> Outside of that, not tested, in terms of verifying the use-after-free is
> gone for tag iteration.

Oh, and we probably shouldn't free it unless we actually allocated it...


diff --git a/block/blk-flush.c b/block/blk-flush.c
index a3fc7191c694..827e3d2180d8 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -491,12 +491,22 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 	return NULL;
 }
 
+static void blk_fq_rcu_free(struct work_struct *work)
+{
+	struct blk_flush_queue *fq = container_of(to_rcu_work(work),
+							struct blk_flush_queue,
+							rcu_work);
+
+	kfree(fq->flush_rq);
+	kfree(fq);
+}
+
 void blk_free_flush_queue(struct blk_flush_queue *fq)
 {
 	/* bio based request queue hasn't flush queue */
 	if (!fq)
 		return;
 
-	kfree(fq->flush_rq);
-	kfree(fq);
+	INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free);
+	queue_rcu_work(system_wq, &fq->rcu_work);
 }
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2089c6c62f44..c39b58391ae8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+	if (tags->rqs[bitnr].queue != hctx->queue)
+		return true;
 
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	rq = tags->rqs[bitnr].rq;
+	if (rq)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
+	rcu_read_lock();
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+	rcu_read_unlock();
 }
 
 struct bt_tags_iter_data {
@@ -287,7 +291,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assining ->rqs[].
 	 */
-	rq = tags->rqs[bitnr];
+	rq = tags->rqs[bitnr].rq;
 	if (rq && blk_mq_request_started(rq))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
@@ -317,8 +321,11 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 		.reserved = reserved,
 	};
 
-	if (tags->rqs)
+	if (tags->rqs) {
+		rcu_read_lock();
 		sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
+		rcu_read_unlock();
+	}
 }
 
 /**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..bb84d1f099c7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -4,6 +4,11 @@
 
 #include "blk-mq.h"
 
+struct rq_tag_entry {
+	struct request_queue *queue;
+	struct request *rq;
+};
+
 /*
  * Tag address space map.
  */
@@ -16,7 +21,7 @@ struct blk_mq_tags {
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
 
-	struct request **rqs;
+	struct rq_tag_entry *rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
 };
@@ -78,7 +83,8 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx,
 		unsigned int tag, struct request *rq)
 {
-	hctx->tags->rqs[tag] = rq;
+	hctx->tags->rqs[tag].queue = hctx->queue;
+	hctx->tags->rqs[tag].rq = rq;
 }
 
 static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2de972857496..82341a78a0c0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -298,13 +298,16 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = -1;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		struct blk_mq_hw_ctx *hctx = data->hctx;
+
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			atomic_inc(&hctx->nr_active);
 		}
 		rq->tag = tag;
 		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 	/* csd/requeue_work/fifo_time is initialized before use */
@@ -797,8 +800,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		prefetch(tags->rqs[tag].rq);
+		return tags->rqs[tag].rq;
 	}
 
 	return NULL;
@@ -809,10 +812,11 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
 			       void *priv, bool reserved)
 {
 	/*
-	 * If we find a request that is inflight and the queue matches,
-	 * we know the queue is busy. Return false to stop the iteration.
+	 * We're only called here if the queue matches. If the rq state is
+	 * inflight, we know the queue is busy. Return false to stop the
+	 * iteration.
 	 */
-	if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
+	if (rq->state == MQ_RQ_IN_FLIGHT) {
 		bool *busy = priv;
 
 		*busy = true;
@@ -1049,11 +1053,14 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	shared = blk_mq_tag_busy(data.hctx);
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
+		struct blk_mq_hw_ctx *hctx = data.hctx;
+
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
 			atomic_inc(&data.hctx->nr_active);
 		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+		hctx->tags->rqs[rq->tag].queue = hctx->queue;
+		hctx->tags->rqs[rq->tag].rq = rq;
 	}
 
 done:
@@ -2020,10 +2027,36 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
+struct rq_page_list {
+	struct rcu_work rcu_work;
+	struct list_head list;
+	bool on_stack;
+};
+
+static void blk_mq_rcu_free_pages(struct work_struct *work)
+{
+	struct rq_page_list *rpl = container_of(to_rcu_work(work),
+						struct rq_page_list, rcu_work);
+	struct page *page;
+
+	while (!list_empty(&rpl->list)) {
+		page = list_first_entry(&rpl->list, struct page, lru);
+		list_del_init(&page->lru);
+		/*
+		 * Remove kmemleak object previously allocated in
+		 * blk_mq_init_rq_map().
+		 */
+		kmemleak_free(page_address(page));
+		__free_pages(page, page->private);
+	}
+	if (!rpl->on_stack)
+		kfree(rpl);
+}
+
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
-	struct page *page;
+	struct rq_page_list *rpl;
 
 	if (tags->rqs && set->ops->exit_request) {
 		int i;
@@ -2038,15 +2071,30 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		}
 	}
 
-	while (!list_empty(&tags->page_list)) {
-		page = list_first_entry(&tags->page_list, struct page, lru);
-		list_del_init(&page->lru);
+	if (list_empty(&tags->page_list))
+		return;
+
+	rpl = kmalloc(sizeof(*rpl), GFP_NOIO);
+	if (rpl) {
+		INIT_LIST_HEAD(&rpl->list);
+		list_splice_init(&tags->page_list, &rpl->list);
+
+		/* Punt to RCU free, so we don't race with tag iteration */
+		INIT_RCU_WORK(&rpl->rcu_work, blk_mq_rcu_free_pages);
+		rpl->on_stack = false;
+		queue_rcu_work(system_wq, &rpl->rcu_work);
+	} else {
+		struct rq_page_list stack_rpl;
+
 		/*
-		 * Remove kmemleak object previously allocated in
-		 * blk_mq_init_rq_map().
+		 * Fail alloc, punt to on-stack, we just have to synchronize
+		 * RCU first to ensure readers are done.
 		 */
-		kmemleak_free(page_address(page));
-		__free_pages(page, page->private);
+		INIT_LIST_HEAD(&stack_rpl.list);
+		list_splice_init(&tags->page_list, &stack_rpl.list);
+		stack_rpl.on_stack = true;
+		synchronize_rcu();
+		blk_mq_rcu_free_pages(&stack_rpl.rcu_work.work);
 	}
 }
 
@@ -2077,7 +2125,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (!tags)
 		return NULL;
 
-	tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
+	tags->rqs = kcalloc_node(nr_tags, sizeof(struct rq_tag_entry),
 				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 node);
 	if (!tags->rqs) {
diff --git a/block/blk.h b/block/blk.h
index 848278c52030..785207fc8a30 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -29,6 +29,8 @@ struct blk_flush_queue {
 	 */
 	struct request		*orig_rq;
 	spinlock_t		mq_flush_lock;
+
+	struct rcu_work		rcu_work;
 };
 
 extern struct kmem_cache *blk_requestq_cachep;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 22:50                                                     ` Jens Axboe
@ 2019-02-14 23:36                                                       ` Bart Van Assche
  2019-02-15 18:29                                                         ` Evan Green
  2019-02-15  2:57                                                       ` jianchao.wang
  1 sibling, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2019-02-14 23:36 UTC (permalink / raw)
  To: Jens Axboe, jianchao.wang, linux-block; +Cc: Evan Green

On Thu, 2018-12-20 at 15:50 -0700, Jens Axboe wrote:
> +static void blk_fq_rcu_free(struct work_struct *work)
> +{
> +	struct blk_flush_queue *fq = container_of(to_rcu_work(work),
> +							struct blk_flush_queue,
> +							rcu_work);
> +
> +	kfree(fq->flush_rq);
> +	kfree(fq);
> +}
> +
>  void blk_free_flush_queue(struct blk_flush_queue *fq)
>  {
>  	/* bio based request queue hasn't flush queue */
>  	if (!fq)
>  		return;
>  
> -	kfree(fq->flush_rq);
> -	kfree(fq);
> +	INIT_RCU_WORK(&fq->rcu_work, blk_fq_rcu_free);
> +	queue_rcu_work(system_wq, &fq->rcu_work);
>  }

Can INIT_RCU_WORK() + queue_rcu_work() be changed into call_rcu()? The latter
namely uses a smaller data structure.

> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 2089c6c62f44..c39b58391ae8 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  
>  	if (!reserved)
>  		bitnr += tags->nr_reserved_tags;
> -	rq = tags->rqs[bitnr];
> +	if (tags->rqs[bitnr].queue != hctx->queue)
> +		return true;

Since blk_mq_tag_set_rq() is not serialized against this function I doubt that
the tags->rqs[bitnr].queue != hctx->queue check helps. Can it be left out?

> +struct rq_tag_entry {
> +       struct request_queue *queue;
> +       struct request *rq;

If the new test can be left out from bt_iter(), can this new data structure be
left out too? In other words, keep the existing approach of only storing the
request pointer and not the queue pointer.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2018-12-20 22:50                                                     ` Jens Axboe
  2019-02-14 23:36                                                       ` Bart Van Assche
@ 2019-02-15  2:57                                                       ` jianchao.wang
  1 sibling, 0 replies; 34+ messages in thread
From: jianchao.wang @ 2019-02-15  2:57 UTC (permalink / raw)
  To: Jens Axboe, Bart Van Assche, linux-block



On 12/21/18 6:50 AM, Jens Axboe wrote:
...
> @@ -228,13 +228,15 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  
>  	if (!reserved)
>  		bitnr += tags->nr_reserved_tags;
> -	rq = tags->rqs[bitnr];
> +	if (tags->rqs[bitnr].queue != hctx->queue)
> +		return true;
>  
>  	/*
>  	 * We can hit rq == NULL here, because the tagging functions
>  	 * test and set the bit before assigning ->rqs[].
>  	 */
> -	if (rq && rq->q == hctx->queue)
> +	rq = tags->rqs[bitnr].rq;
> +	if (rq)
>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>  	return true;
>  }
> @@ -263,7 +265,9 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
>  		.reserved = reserved,
>  	};
>  
> +	rcu_read_lock();
>  	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
> +	rcu_read_unlock();
>  }
>  
>  struct bt_tags_iter_data {
...
> @@ -2038,15 +2071,30 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  		}
>  	}

Just checking request_queue match in bt_iter seems not enough.
The blk_mq_free_rqs could be invoked by switching io scheduler.
At the moment, request_queue could match. Maybe we still need to clear the
associated tags->rqs[] entries.

With cleaning the associated tags->rqs[] entries and synchronize_rcu after that,
bt_iter would only see some stable request with idle state but not a freed one.


Thanks
Jianchao

>  
> -	while (!list_empty(&tags->page_list)) {
> -		page = list_first_entry(&tags->page_list, struct page, lru);
> -		list_del_init(&page->lru);
> +	if (list_empty(&tags->page_list))
> +		return;
> +
> +	rpl = kmalloc(sizeof(*rpl), GFP_NOIO);
> +	if (rpl) {
> +		INIT_LIST_HEAD(&rpl->list);
> +		list_splice_init(&tags->page_list, &rpl->list);
> +
> +		/* Punt to RCU free, so we don't race with tag iteration */
> +		INIT_RCU_WORK(&rpl->rcu_work, blk_mq_rcu_free_pages);
> +		rpl->on_stack = false;
> +		queue_rcu_work(system_wq, &rpl->rcu_work);
> +	} else {


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2019-02-14 23:36                                                       ` Bart Van Assche
@ 2019-02-15 18:29                                                         ` Evan Green
  2019-02-19 16:48                                                           ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Evan Green @ 2019-02-15 18:29 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, jianchao.wang, linux-block

Hi Jens,
I got all turned around while trying to understand this fix, and I'll
admit it's probably just me. It looks like you're trying to use an rcu
lock to prevent the iter functions from racing with free. Is that
true? But then the race at least as I understand it wasn't that there
was a free in-flight, it's that the free had happened a long time ago,
and there was still a stale value in tags->rqs[bitnr]. So maybe
there's some other issue that part is solving?

And then it looks like you added a new struct where tags->rqs was so
that you could compare hctx->queue without reaching through rq. I have
no idea if that's sufficient to prevent stale accesses through rq.
Since you're changing multiple values I think where you populate that
structure you'd at least need to do something like: clear rq, barrier,
set hctx, barrier, set rq. But like Bart said, that's probably not the
right way to go.

Finally, I didn't really get why the conditional in
blk_mq_rq_inflight() changed. Is that guarding against some other
identified problem, or just an additional check you can do when you
convert rqs into a struct?

It looks like blk_mq_free_rqs() might be the magic function I was
looking for earlier. Would it be possible to just clear tags[rq->tag]
for each static_rq? Or is it possible for rqs from one set to end up
in the tags array of another set? (Which would make what I just
suggested insufficient).
-Evan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2019-02-15 18:29                                                         ` Evan Green
@ 2019-02-19 16:48                                                           ` Bart Van Assche
  2019-02-21 20:54                                                             ` Evan Green
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2019-02-19 16:48 UTC (permalink / raw)
  To: Evan Green; +Cc: Jens Axboe, jianchao.wang, linux-block

On Fri, 2019-02-15 at 10:29 -0800, Evan Green wrote:
> I got all turned around while trying to understand this fix, and I'll
> admit it's probably just me. It looks like you're trying to use an rcu
> lock to prevent the iter functions from racing with free. Is that
> true? But then the race at least as I understand it wasn't that there
> was a free in-flight, it's that the free had happened a long time ago,
> and there was still a stale value in tags->rqs[bitnr]. So maybe
> there's some other issue that part is solving?
> 
> And then it looks like you added a new struct where tags->rqs was so
> that you could compare hctx->queue without reaching through rq. I have
> no idea if that's sufficient to prevent stale accesses through rq.
> Since you're changing multiple values I think where you populate that
> structure you'd at least need to do something like: clear rq, barrier,
> set hctx, barrier, set rq. But like Bart said, that's probably not the
> right way to go.
> 
> Finally, I didn't really get why the conditional in
> blk_mq_rq_inflight() changed. Is that guarding against some other
> identified problem, or just an additional check you can do when you
> convert rqs into a struct?
> 
> It looks like blk_mq_free_rqs() might be the magic function I was
> looking for earlier. Would it be possible to just clear tags[rq->tag]
> for each static_rq? Or is it possible for rqs from one set to end up
> in the tags array of another set? (Which would make what I just
> suggested insufficient).

Hi Evan,

Multiple request queues can share a single tag set. Examples of use cases
are NVMe namespaces associated with the same NVMe controller and SCSI LUNs
associated with the same SCSI host. The code that allocates a request not
only marks a tag as allocated but also updates the rqs[] array in the
tag set (see also blk_mq_get_request()). The code that iterates over a
tag set examines both the tag bitmap and the rqs[] array. The code that
allocates requests and the code that iterates over requests are not
serialized against each other. That is why the association of a tag with a
request queue can change while iterating over a tag set.

Another race condition is that a request can be allocated or freed while
iterating over a tag set.

Fixing these race conditions would require a locking mechanism and I think
the performance overhead of such a locking mechanism is larger than
acceptable. So code that iterates over requests either must be able to
handle such race conditions or must suspend request processing before it
iterates over requests.

This is why I'm in favor of only modifying blk_mq_free_rqs() such that the
memory in which the tags are stored is delayed until a grace period has
occurred.

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: v4.20-rc6: Sporadic use-after-free in bt_iter()
  2019-02-19 16:48                                                           ` Bart Van Assche
@ 2019-02-21 20:54                                                             ` Evan Green
  0 siblings, 0 replies; 34+ messages in thread
From: Evan Green @ 2019-02-21 20:54 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, jianchao.wang, linux-block

On Tue, Feb 19, 2019 at 8:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On Fri, 2019-02-15 at 10:29 -0800, Evan Green wrote:
> > I got all turned around while trying to understand this fix, and I'll
> > admit it's probably just me. It looks like you're trying to use an rcu
> > lock to prevent the iter functions from racing with free. Is that
> > true? But then the race at least as I understand it wasn't that there
> > was a free in-flight, it's that the free had happened a long time ago,
> > and there was still a stale value in tags->rqs[bitnr]. So maybe
> > there's some other issue that part is solving?
> >
> > And then it looks like you added a new struct where tags->rqs was so
> > that you could compare hctx->queue without reaching through rq. I have
> > no idea if that's sufficient to prevent stale accesses through rq.
> > Since you're changing multiple values I think where you populate that
> > structure you'd at least need to do something like: clear rq, barrier,
> > set hctx, barrier, set rq. But like Bart said, that's probably not the
> > right way to go.
> >
> > Finally, I didn't really get why the conditional in
> > blk_mq_rq_inflight() changed. Is that guarding against some other
> > identified problem, or just an additional check you can do when you
> > convert rqs into a struct?
> >
> > It looks like blk_mq_free_rqs() might be the magic function I was
> > looking for earlier. Would it be possible to just clear tags[rq->tag]
> > for each static_rq? Or is it possible for rqs from one set to end up
> > in the tags array of another set? (Which would make what I just
> > suggested insufficient).
>
> Hi Evan,
>
> Multiple request queues can share a single tag set. Examples of use cases
> are NVMe namespaces associated with the same NVMe controller and SCSI LUNs
> associated with the same SCSI host. The code that allocates a request not
> only marks a tag as allocated but also updates the rqs[] array in the
> tag set (see also blk_mq_get_request()). The code that iterates over a
> tag set examines both the tag bitmap and the rqs[] array. The code that
> allocates requests and the code that iterates over requests are not
> serialized against each other. That is why the association of a tag with a
> request queue can change while iterating over a tag set.
>
> Another race condition is that a request can be allocated or freed while
> iterating over a tag set.

Oh, true. The iter code was trying (unsuccessfully) to guard against
allocations with its if (rq && rq->q ...), but I hadn't thought about
in-progress frees.

>
> Fixing these race conditions would require a locking mechanism and I think
> the performance overhead of such a locking mechanism is larger than
> acceptable. So code that iterates over requests either must be able to
> handle such race conditions or must suspend request processing before it
> iterates over requests.
>
> This is why I'm in favor of only modifying blk_mq_free_rqs() such that the
> memory in which the tags are stored is delayed until a grace period has
> occurred.

This makes sense to me for the purpose of not racing with an
in-progress free. I just can't fully prove to myself that it solves
the stale tags entry, since unless we clear the entry in the tags
array it seems like there could maybe still be stale entries. Maybe
I'll get it in the next change.

Jens, keep me in the loop if you send another patch please.
-Evan

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-02-21 20:55 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-19 23:24 v4.20-rc6: Sporadic use-after-free in bt_iter() Bart Van Assche
2018-12-19 23:27 ` Jens Axboe
2018-12-20  0:16   ` Bart Van Assche
2018-12-20  3:17     ` Jens Axboe
2018-12-20  3:24       ` jianchao.wang
2018-12-20  4:19         ` Jens Axboe
2018-12-20  4:32           ` jianchao.wang
2018-12-20  4:48             ` Jens Axboe
2018-12-20  5:03               ` jianchao.wang
2018-12-20 13:02                 ` Jens Axboe
2018-12-20 13:07                   ` Jens Axboe
2018-12-20 18:01                     ` Bart Van Assche
2018-12-20 18:21                       ` Jens Axboe
2018-12-20 18:33                         ` Jens Axboe
2018-12-20 20:56                           ` Bart Van Assche
2018-12-20 21:00                             ` Jens Axboe
2018-12-20 21:23                               ` Bart Van Assche
2018-12-20 21:26                                 ` Jens Axboe
2018-12-20 21:31                                   ` Bart Van Assche
2018-12-20 21:34                                     ` Jens Axboe
2018-12-20 21:40                                       ` Bart Van Assche
2018-12-20 21:44                                         ` Jens Axboe
2018-12-20 21:48                                           ` Jens Axboe
2018-12-20 22:19                                             ` Bart Van Assche
2018-12-20 22:23                                               ` Jens Axboe
2018-12-20 22:33                                                 ` Jens Axboe
2018-12-20 22:47                                                   ` Jens Axboe
2018-12-20 22:50                                                     ` Jens Axboe
2019-02-14 23:36                                                       ` Bart Van Assche
2019-02-15 18:29                                                         ` Evan Green
2019-02-19 16:48                                                           ` Bart Van Assche
2019-02-21 20:54                                                             ` Evan Green
2019-02-15  2:57                                                       ` jianchao.wang
2018-12-20  4:06 ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).