RE: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

From: Tianxianting <tian.xianting@h3c.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
Date: Sat, 26 Sep 2020 02:52:50 +0000	[thread overview]
Message-ID: <ffd885d363074710b1ada87701619c0d@h3c.com> (raw)
In-Reply-To: <c55b365d-c6b0-6207-f326-6d58dd113d18@kernel.dk>

Hi Jens
Thanks a lot for the comments,
I think it is not hot path, it is only called when system startup or device hot-plugging.
So I submitted V3 patch for you reviewing :)
https://lkml.org/lkml/2020/9/25/1543 

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Saturday, September 26, 2020 3:26 AM
To: tianxianting (RD) <tian.xianting@h3c.com>
Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: Re: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

On 9/17/20 2:13 AM, Xianting Tian wrote:
> We found it takes more time of blk_mq_alloc_rq_maps() in kernel space 
> when testing nvme hot-plugging. The test and anlysis as below.
> 
> Debug code,
> 1, blk_mq_alloc_rq_maps():
>         u64 start, end;
>         depth = set->queue_depth;
>         start = ktime_get_ns();
>         pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
>                         current->pid, current->comm, current->nvcsw, current->nivcsw,
>                         set->queue_depth, set->nr_hw_queues);
>         do {
>                 err = __blk_mq_alloc_rq_maps(set);
>                 if (!err)
>                         break;
> 
>                 set->queue_depth >>= 1;
>                 if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
>                         err = -ENOMEM;
>                         break;
>                 }
>         } while (set->queue_depth);
>         end = ktime_get_ns();
>         pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
>                         current->pid, current->comm,
>                         current->nvcsw, current->nivcsw, end - start);
> 
> 2, __blk_mq_alloc_rq_maps():
>         u64 start, end;
>         for (i = 0; i < set->nr_hw_queues; i++) {
>                 start = ktime_get_ns();
>                 if (!__blk_mq_alloc_rq_map(set, i))
>                         goto out_unwind;
>                 end = ktime_get_ns();
>                 pr_err("hw queue %d init cost time %lld\n", i, end - start);
>         }
> 
> Test nvme hot-plugging with above debug code, we found it totally cost 
> more than 3ms in kernel space without being scheduled out when alloc 
> rqs for all
> 16 hw queues with depth 1024, each hw queue cost about 140-250us. The 
> time cost will be increased with hw queue number and queue depth 
> increasing. And if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will 
> try "queue_depth >>= 1", more time will be consumed.
> 	[  428.428771] nvme nvme0: pci function 10000:01:00.0
> 	[  428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
> 	[  428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
> 	[  428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
> 	[  432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
> 	[  432.593404] hw queue 0 init cost time 22883 ns
> 	[  432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
> 	[  432.595953] nvme nvme0: 16/0/0 default/read/poll queues
> 	[  432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
> 	[  432.596203] hw queue 0 init cost time 242630 ns
> 	[  432.596441] hw queue 1 init cost time 235913 ns
> 	[  432.596659] hw queue 2 init cost time 216461 ns
> 	[  432.596877] hw queue 3 init cost time 215851 ns
> 	[  432.597107] hw queue 4 init cost time 228406 ns
> 	[  432.597336] hw queue 5 init cost time 227298 ns
> 	[  432.597564] hw queue 6 init cost time 224633 ns
> 	[  432.597785] hw queue 7 init cost time 219954 ns
> 	[  432.597937] hw queue 8 init cost time 150930 ns
> 	[  432.598082] hw queue 9 init cost time 143496 ns
> 	[  432.598231] hw queue 10 init cost time 147261 ns
> 	[  432.598397] hw queue 11 init cost time 164522 ns
> 	[  432.598542] hw queue 12 init cost time 143401 ns
> 	[  432.598692] hw queue 13 init cost time 148934 ns
> 	[  432.598841] hw queue 14 init cost time 147194 ns
> 	[  432.598991] hw queue 15 init cost time 148942 ns
> 	[  432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
> 	[  432.602611]  nvme0n1: p1
> 
> So use this patch to trigger schedule between each hw queue init, to 
> avoid other threads getting stuck. We call cond_resched() only when 
> "queue depth >= 512". We are not in atomic context when executing 
> __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
> 
> Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
> ---
>  block/blk-mq.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c index 
> b3d2785ee..5a71fe53a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3255,11 +3255,16 @@ void blk_mq_exit_queue(struct request_queue 
> *q)  static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)  {
>  	int i;
> +	unsigned int depth = set->queue_depth;
>  
> -	for (i = 0; i < set->nr_hw_queues; i++)
> +	for (i = 0; i < set->nr_hw_queues; i++) {
>  		if (!__blk_mq_alloc_map_and_request(set, i))
>  			goto out_unwind;
>  
> +		if (depth >= 512)
> +			cond_resched();
> +	}
> +
>  	return 0;

I generally dislike

if (some_num)
	cond_resched();

particularly when it's not a true hot path. How about just making the
cond_resched() unconditional? I suspect that will be just fine.

--
Jens Axboe