Re: [PATCH] blk-mq: fix use-after-free in blk_mq_exit_sched

From: John Garry <john.garry@huawei.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>,
	<linux-block@vger.kernel.org>,
	<syzbot+77ba3d171a25c56756ea@syzkaller.appspotmail.com>
Subject: Re: [PATCH] blk-mq: fix use-after-free in blk_mq_exit_sched
Date: Wed, 9 Jun 2021 12:42:52 +0100	[thread overview]
Message-ID: <81c38feb-9d3e-7adc-57e6-54bccf0d3142@huawei.com> (raw)
In-Reply-To: <YMCU+iuHs4ULN0lb@T590>

On 09/06/2021 11:16, Ming Lei wrote:
> On Wed, Jun 09, 2021 at 09:59:43AM +0100, John Garry wrote:
>> On 09/06/2021 07:30, Ming Lei wrote:
>>
>> Thanks for the fix
>>
>>> tagset can't be used after blk_cleanup_queue() is returned because
>>> freeing tagset usually follows blk_clenup_queue(). Commit d97e594c5166
>>> ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap") adds
>>> check on q->tag_set->flags in blk_mq_exit_sched(), and causes
>>> use-after-free.
>>>
>>> Fixes it by using hctx->flags.
>>>
>>
>> The tagset is a member of the Scsi_Host structure. So it is true that this
>> memory may be freed before the request_queue is exited?
> 
> Yeah, please see commit c3e2219216c9 ("block: free sched's request pool in
> blk_cleanup_queue")

JFYI, I could recreate with the following simple steps:

root@(none)$ mount /dev/sda1 mnt
[   27.252887] FAT-fs (sda1): Volume was not properly unmounted. Some 
data may be corrupt. Please run fsck.
_hw/unbind)$ echo HISI0162:01 > ./sys/bus/platform/drivers/hisi_sas_v2
[   31.262274] sas: ex 500e004aaaaaaa1f phys DID NOT change
[   31.270314] sas: ex 500e004aaaaaaa1f phys DID NOT change
[   31.278262] sas: ex 500e004aaaaaaa1f phys DID NOT change
[   31.286245] sas: ex 500e004aaaaaaa1f phys DID NOT change
[   31.294164] sas: ex 500e004aaaaaaa1f phys DID NOT change
[   31.302143] sas: ex 500e004aaaaaaa1f phys DID NOT change
[   31.310097] sas: ex 500e004aaaaaaa1f phys DID NOT change
[   31.321599] hisi_sas_v2_hw HISI0162:01: dev[9:1] is gone
[   31.429245] hisi_sas_v2_hw HISI0162:01: dev[8:1] is gone
[   31.533461] hisi_sas_v2_hw HISI0162:01: dev[7:1] is gone
[   31.637338] hisi_sas_v2_hw HISI0162:01: dev[6:1] is gone
[   31.740840] hisi_sas_v2_hw HISI0162:01: dev[5:1] is gone
[   31.750659] sd 0:0:3:0: [sdd] Synchronizing SCSI cache
[   31.833500] hisi_sas_v2_hw HISI0162:01: dev[4:1] is gone
[   31.937351] hisi_sas_v2_hw HISI0162:01: dev[3:1] is gone
[   31.947749] sd 0:0:1:0: [sdb] Synchronizing SCSI cache
[   31.953195] sd 0:0:1:0: [sdb] Stopping disk

[   32.690815] hisi_sas_v2_hw HISI0162:01: dev[2:5] is gone
[   32.771526] hisi_sas_v2_hw HISI0162:01: dev[1:1] is gone
[   32.790406] hisi_sas_v2_hw HISI0162:01: dev[0:2] is gone

root@(none)$
root@(none)$
root@(none)$ umount mnt
[   37.323039] 
==================================================================
[   37.330262] BUG: KASAN: use-after-free in blk_mq_exit_sched+0x110/0x1c8
[   37.336880] Read of size 4 at addr ffff001051e80100 by task umount/547
[   37.343401]
[   37.344884] CPU: 4 PID: 547 Comm: umount Not tainted 
5.13.0-rc5-next-20210608 #80
[   37.352362] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon 
D05 IT21 Nemo 2.0 RC0 04/18/2018
[   37.361486] Call trace:
[   37.363924]  dump_backtrace+0x0/0x2d0
[   37.367586]  show_stack+0x18/0x28
[   37.370898]  dump_stack_lvl+0xfc/0x138
[   37.374643]  print_address_description.constprop.13+0x78/0x314
[   37.380472]  kasan_report+0x1e0/0x248
[   37.384131]  __asan_load4+0x9c/0xd8
[   37.387615]  blk_mq_exit_sched+0x110/0x1c8
[   37.391706]  __elevator_exit+0x34/0x58
[   37.395451]  blk_release_queue+0x108/0x1d8
[   37.399545]  kobject_put+0xa8/0x180
[   37.403029]  blk_put_queue+0x14/0x20
[   37.406601]  disk_release+0xcc/0x100
[   37.410171]  device_release+0x94/0x110
[   37.413918]  kobject_put+0xa8/0x180
[   37.417401]  put_device+0x14/0x28
[   37.420712]  put_disk+0x2c/0x40
[   37.423848]  blkdev_put_no_open+0x54/0x78
[   37.427853]  blkdev_put+0x108/0x258
[   37.431335]  kill_block_super+0x5c/0x78
[   37.435166]  deactivate_locked_super+0x6c/0xd0
[   37.439605]  deactivate_super+0x8c/0xa8
[   37.443435]  cleanup_mnt+0x110/0x1c0
[   37.447007]  __cleanup_mnt+0x14/0x20
[   37.450578]  task_work_run+0xbc/0x1a8
[   37.454236]  do_notify_resume+0x2cc/0x590
[   37.458242]  work_pending+0xc/0x3c8
[   37.461725]
[   37.463207] The buggy address belongs to the page:
[   37.467990] page:(____ptrval____) refcount:0 mapcount:-128 
mapping:0000000000000000 index:0x0 pfn:0x1051e80
[   37.477724] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
[   37.484164] raw: 0bfffc0000000000 fffffc00415a9008 ffff0017fbffebb0 
0000000000000000
[   37.491900] raw: 0000000000000000 0000000000000006 00000000ffffff7f 
0000000000000000
[   37.499635] page dumped because: kasan: bad access detected
[   37.505198]
[   37.506680] Memory state around the buggy address:
[   37.511463]  ffff001051e80000: ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff
[   37.518677]  ffff001051e80080: ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff
[   37.525891] >ffff001051e80100: ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff
[   37.533104]^
[   37.536324]  ffff001051e80180: ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff
[   37.543538]  ffff001051e80200: ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff
[   37.550751] 
==================================================================
[   37.557963] Disabling lock debugging due to kernel taint
root@(none)$
root@(none)$

And this patch fixes it:
Tested-by: John Garry <john.garry@huawei.com>

> 
>>
>>> Reported-by: syzbot+77ba3d171a25c56756ea@syzkaller.appspotmail.com
>>> Fixes: d97e594c5166 ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap")
>>> Cc: John Garry <john.garry@huawei.com>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    block/blk-mq-sched.c | 4 +++-
>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>> index a9182d2f8ad3..80273245d11a 100644
>>> --- a/block/blk-mq-sched.c
>>> +++ b/block/blk-mq-sched.c
>>> @@ -680,6 +680,7 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
>>>    {
>>>    	struct blk_mq_hw_ctx *hctx;
>>>    	unsigned int i;
>>> +	unsigned int flags = 0;
>>>    	queue_for_each_hw_ctx(q, hctx, i) {
>>>    		blk_mq_debugfs_unregister_sched_hctx(hctx);
>>> @@ -687,12 +688,13 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
>>>    			e->type->ops.exit_hctx(hctx, i);
>>>    			hctx->sched_data = NULL;
>>>    		}
>>> +		flags = hctx->flags;
>>
>> I know the choice is limited, but it is unfortunate that we must set flags
>> in a loop
> 
> Does it matter?

It's just a nit on the coding style: it's not an especially good 
practice to set the same value in a loop.

But, as I said, choice is limited.

> 
>>
>>>    	}
>>>    	blk_mq_debugfs_unregister_sched(q);
>>>    	if (e->type->ops.exit_sched)
>>>    		e->type->ops.exit_sched(e);
>>>    	blk_mq_sched_tags_teardown(q);
>>> -	if (blk_mq_is_sbitmap_shared(q->tag_set->flags))
>>> +	if (blk_mq_is_sbitmap_shared(flags))
>>>    		blk_mq_exit_sched_shared_sbitmap(q);
>>
>> this is
>>
>> blk_mq_exit_sched_shared_sbitmap(struct request_queue *queue)
>> {
>> 	sbitmap_queue_free(&queue->sched_bitmap_tags);
>> 	..
>> }
>>
>> And isn't it safe to call sbitmap_queue_free() when
>> sbitmap_queue_init_node() has not been called?
>>
>> I'm just wondering if we can always call blk_mq_exit_sched_shared_sbitmap()?
>> I know it's not an ideal choice either.
> 
> So far it may work, not sure if it can in future, I suggest to follow
> the traditional alloc & free pattern.
> 
> 

Fine

Thanks,
John