Re: [PATCH] block: free sched's request pool in blk_cleanup_queue

From: Yi Zhang <yi.zhang@redhat.com>
To: Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, Bart Van Assche <bvanassche@acm.org>,
	Christoph Hellwig <hch@lst.de>,
	kernel test robot <rong.a.chen@intel.com>
Subject: Re: [PATCH] block: free sched's request pool in blk_cleanup_queue
Date: Wed, 5 Jun 2019 09:27:27 +0800	[thread overview]
Message-ID: <ffae2084-3a2a-5e46-5809-ff9d8bc3741e@redhat.com> (raw)
In-Reply-To: <20190604130802.17076-1-ming.lei@redhat.com>

Tested-by: Yi Zhang <yi.zhang@redhat.com>

This patch fixed bellow issue which triggered by blktests block/006, thanks.

Kernel 5.2.0-rc3 on an x86_64

[  567.066143] run blktests block/006 at 2019-06-05 03:18:04
[  567.089457] null: module loaded
[  593.937235] BUG: unable to handle page fault for address: 
ffffffffc0738110
[  593.938442] #PF: supervisor read access in kernel mode
[  593.939230] #PF: error_code(0x0000) - not-present page
[  593.940011] PGD 11760f067 P4D 11760f067 PUD 117611067 PMD 12b61f067 PTE 0
[  593.941048] Oops: 0000 [#1] SMP PTI
[  593.941607] CPU: 0 PID: 1106 Comm: kworker/0:0 Not tainted 5.2.0-rc3 #1
[  593.942607] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  593.943511] Workqueue: events __blk_release_queue
[  593.944240] RIP: 0010:blk_mq_free_rqs+0x21/0xd0
[  593.944930] Code: c6 e8 83 f0 ff ff eb 9f 90 0f 1f 44 00 00 41 56 41 
55 41 54 55 48 89 f5 53 48 83 be 90 00 00 00 00 74 56 48 8b 47 38 49 89 
fd <48> 83 78 50 00 74 48 8b 06 85 c0 74 42 41 89 d6 31 db 48 8b 85 98
[  593.947773] RSP: 0018:ffffa21300e7fde8 EFLAGS: 00010286
[  593.948562] RAX: ffffffffc07380c0 RBX: 0000000000000000 RCX: 
0000000000002ab8
[  593.949737] RDX: 0000000000000000 RSI: ffff9296971089c0 RDI: 
ffff9296975fb438
[  593.950811] RBP: ffff9296971089c0 R08: 000000000002f0e0 R09: 
ffffffffb6427890
[  593.951878] R10: ffffd41084a3fd80 R11: 0000000000000001 R12: 
ffff9296a91e58b0
[  593.952947] R13: ffff9296975fb438 R14: ffff9296ab5b2600 R15: 
ffff9296a91e6080
[  593.954011] FS:  0000000000000000(0000) GS:ffff9296bba00000(0000) 
knlGS:0000000000000000
[  593.955252] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  593.956147] CR2: ffffffffc0738110 CR3: 000000011760a003 CR4: 
00000000003606f0
[  593.957249] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  593.958354] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  593.959452] Call Trace:
[  593.959877]  blk_mq_sched_tags_teardown+0x40/0x70
[  593.960628]  blk_mq_exit_sched+0x88/0xa0
[  593.961246]  elevator_exit+0x30/0x50
[  593.961817]  __blk_release_queue+0x5c/0x100
[  593.962497]  process_one_work+0x1a1/0x3a0
[  593.963138]  worker_thread+0x30/0x380
[  593.963714]  ? pwq_unbound_release_workfn+0xd0/0xd0
[  593.964479]  kthread+0x112/0x130
[  593.964992]  ? __kthread_parkme+0x70/0x70
[  593.965641]  ret_from_fork+0x35/0x40
[  593.966215] Modules linked in: sunrpc snd_hda_codec_generic 
ledtrig_audio snd_hda_intel snd_hda_codec nfit snd_hda_core libnvdimm 
snd_hwdep snd_seq crct10dif_pclmul snd_seq_device crc32_pclmul snd_pcm 
ghash_clmulni_intel snd_timer snd pcspkr joydev virtio_balloon i2c_piix4 
soundcore ip_tables xfs libcrc32c qxl drm_kms_helper ata_generic 
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata 
virtio_net net_failover crc32c_intel serio_raw virtio_blk virtio_console 
failover dm_mirror dm_region_hash dm_log dm_mod [last unloaded: null_blk]
[  593.973735] CR2: ffffffffc0738110
[  593.974270] ---[ end trace 782d8ae6cfec57e7 ]---
[  593.974983] RIP: 0010:blk_mq_free_rqs+0x21/0xd0
[  593.975691] Code: c6 e8 83 f0 ff ff eb 9f 90 0f 1f 44 00 00 41 56 41 
55 41 54 55 48 89 f5 53 48 83 be 90 00 00 00 00 74 56 48 8b 47 38 49 89 
fd <48> 83 78 50 00 74 48 8b 06 85 c0 74 42 41 89 d6 31 db 48 8b 85 98
[  593.978548] RSP: 0018:ffffa21300e7fde8 EFLAGS: 00010286
[  593.979466] RAX: ffffffffc07380c0 RBX: 0000000000000000 RCX: 
0000000000002ab8
[  593.980578] RDX: 0000000000000000 RSI: ffff9296971089c0 RDI: 
ffff9296975fb438
[  593.981677] RBP: ffff9296971089c0 R08: 000000000002f0e0 R09: 
ffffffffb6427890
[  593.982780] R10: ffffd41084a3fd80 R11: 0000000000000001 R12: 
ffff9296a91e58b0
[  593.983883] R13: ffff9296975fb438 R14: ffff9296ab5b2600 R15: 
ffff9296a91e6080
[  593.984982] FS:  0000000000000000(0000) GS:ffff9296bba00000(0000) 
knlGS:0000000000000000
[  593.986239] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  593.987138] CR2: ffffffffc0738110 CR3: 000000011760a003 CR4: 
00000000003606f0
[  593.988241] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  593.989344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  593.990447] Kernel panic - not syncing: Fatal exception
[  593.992292] Kernel Offset: 0x35000000 from 0xffffffff81000000 
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  593.994147] ---[ end Kernel panic - not syncing: Fatal exception ]---

On 6/4/19 9:08 PM, Ming Lei wrote:
> In theory, IO scheduler belongs to request queue, and the request pool
> of sched tags belongs to the request queue too.
>
> However, the current tags allocation interfaces are re-used for both
> driver tags and sched tags, and driver tags is definitely host wide,
> and doesn't belong to any request queue, same with its request pool.
> So we need tagset instance for freeing request of sched tags.
>
> Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
> of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
> tags to be freed before calling blk_mq_free_tag_set().
>
> Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
> moves blk_exit_queue into __blk_release_queue for simplying the fast
> path in generic_make_request(), then causes oops during freeing requests
> of sched tags in __blk_release_queue().
>
> Fix the above issue by move freeing request pool of sched tags into
> blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
> in-queue requests at that time. Freeing sched tags has to be kept in queue's
> release handler becasue there might be un-completed dispatch activity
> which might refer to sched tags.
>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c     | 13 +++++++++++++
>   block/blk-mq-sched.c | 30 +++++++++++++++++++++++++++---
>   block/blk-mq-sched.h |  1 +
>   block/blk-sysfs.c    |  2 +-
>   block/blk.h          | 10 +++++++++-
>   block/elevator.c     |  2 +-
>   6 files changed, 52 insertions(+), 6 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ee1b35fe8572..8340f69670d8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -320,6 +320,19 @@ void blk_cleanup_queue(struct request_queue *q)
>   	if (queue_is_mq(q))
>   		blk_mq_exit_queue(q);
>   
> +	/*
> +	 * In theory, request pool of sched_tags belongs to request queue.
> +	 * However, the current implementation requires tag_set for freeing
> +	 * requests, so free the pool now.
> +	 *
> +	 * Queue has become frozen, there can't be any in-queue requests, so
> +	 * it is safe to free requests now.
> +	 */
> +	mutex_lock(&q->sysfs_lock);
> +	if (q->elevator)
> +		blk_mq_sched_free_requests(q);
> +	mutex_unlock(&q->sysfs_lock);
> +
>   	percpu_ref_exit(&q->q_usage_counter);
>   
>   	/* @q is and will stay empty, shutdown and put */
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 74c6bb871f7e..500cb04901cc 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -475,14 +475,18 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
>   	return ret;
>   }
>   
> +/* called in queue's release handler, tagset has gone away */
>   static void blk_mq_sched_tags_teardown(struct request_queue *q)
>   {
> -	struct blk_mq_tag_set *set = q->tag_set;
>   	struct blk_mq_hw_ctx *hctx;
>   	int i;
>   
> -	queue_for_each_hw_ctx(q, hctx, i)
> -		blk_mq_sched_free_tags(set, hctx, i);
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (hctx->sched_tags) {
> +			blk_mq_free_rq_map(hctx->sched_tags);
> +			hctx->sched_tags = NULL;
> +		}
> +	}
>   }
>   
>   int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> @@ -523,6 +527,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
>   			ret = e->ops.init_hctx(hctx, i);
>   			if (ret) {
>   				eq = q->elevator;
> +				blk_mq_sched_free_requests(q);
>   				blk_mq_exit_sched(q, eq);
>   				kobject_put(&eq->kobj);
>   				return ret;
> @@ -534,11 +539,30 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
>   	return 0;
>   
>   err:
> +	blk_mq_sched_free_requests(q);
>   	blk_mq_sched_tags_teardown(q);
>   	q->elevator = NULL;
>   	return ret;
>   }
>   
> +/*
> + * called in either blk_queue_cleanup or elevator_switch, tagset
> + * is required for freeing requests
> + */
> +void blk_mq_sched_free_requests(struct request_queue *q)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +	int i;
> +
> +	lockdep_assert_held(&q->sysfs_lock);
> +	WARN_ON(!q->elevator);
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (hctx->sched_tags)
> +			blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i);
> +	}
> +}
> +
>   void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
>   {
>   	struct blk_mq_hw_ctx *hctx;
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index c7bdb52367ac..3cf92cbbd8ac 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -28,6 +28,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
>   
>   int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
>   void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
> +void blk_mq_sched_free_requests(struct request_queue *q);
>   
>   static inline bool
>   blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 75b5281cc577..977c659dcd18 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -850,7 +850,7 @@ static void blk_exit_queue(struct request_queue *q)
>   	 */
>   	if (q->elevator) {
>   		ioc_clear_queue(q);
> -		elevator_exit(q, q->elevator);
> +		__elevator_exit(q, q->elevator);
>   		q->elevator = NULL;
>   	}
>   
> diff --git a/block/blk.h b/block/blk.h
> index 91b3581b7c7a..7814aa207153 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -6,6 +6,7 @@
>   #include <linux/blk-mq.h>
>   #include <xen/xen.h>
>   #include "blk-mq.h"
> +#include "blk-mq-sched.h"
>   
>   /* Max future timer expiry for timeouts */
>   #define BLK_MAX_TIMEOUT		(5 * HZ)
> @@ -176,10 +177,17 @@ void blk_insert_flush(struct request *rq);
>   int elevator_init_mq(struct request_queue *q);
>   int elevator_switch_mq(struct request_queue *q,
>   			      struct elevator_type *new_e);
> -void elevator_exit(struct request_queue *, struct elevator_queue *);
> +void __elevator_exit(struct request_queue *, struct elevator_queue *);
>   int elv_register_queue(struct request_queue *q);
>   void elv_unregister_queue(struct request_queue *q);
>   
> +static inline void elevator_exit(struct request_queue *q,
> +		struct elevator_queue *e)
> +{
> +	blk_mq_sched_free_requests(q);
> +	__elevator_exit(q, e);
> +}
> +
>   struct hd_struct *__disk_get_part(struct gendisk *disk, int partno);
>   
>   #ifdef CONFIG_FAIL_IO_TIMEOUT
> diff --git a/block/elevator.c b/block/elevator.c
> index ec55d5fc0b3e..2f17d66d0e61 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -178,7 +178,7 @@ static void elevator_release(struct kobject *kobj)
>   	kfree(e);
>   }
>   
> -void elevator_exit(struct request_queue *q, struct elevator_queue *e)
> +void __elevator_exit(struct request_queue *q, struct elevator_queue *e)
>   {
>   	mutex_lock(&e->sysfs_lock);
>   	if (e->type->ops.exit_sched)