[PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx
@ 2022-06-14  7:14 Yu Kuai
  2022-06-14  7:31 ` Ming Lei
  0 siblings, 1 reply; 5+ messages in thread
From: Yu Kuai @ 2022-06-14  7:14 UTC (permalink / raw)
  To: axboe, ming.lei, djeffery, bvanassche
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

We found that boot time is increased for about 8s after upgrading kernel
from v4.19 to v5.10(megaraid-sas is used in the environment).

Following is where the extra time is spent:

scsi_probe_and_add_lun
 __scsi_remove_device
  blk_cleanup_queue
   blk_mq_exit_queue
    blk_mq_exit_hw_queues
     blk_mq_exit_hctx
      blk_mq_clear_flush_rq_mapping -> function latency is 0.1ms
       cmpxchg

There are three reasons:
1) megaraid-sas is using multiple hctxs in v5.10, thus blk_mq_exit_hctx()
will be called much more times in v5.10 compared to v4.19.
2) scsi will scan for each target thus __scsi_remove_device() will be
called for many times.
3) blk_mq_clear_flush_rq_mapping() is introduced after v4.19, it will
call cmpxchg() for each request, and function latency is abount 0.1ms.

Since that blk_mq_clear_flush_rq_mapping() will only be called while the
queue is freezed already, which means there is no inflight request,
it's safe to set NULL for 'tags->rqs[]' directly instead of using
cmpxchg(). Tests show that with this change, function latency of
blk_mq_clear_flush_rq_mapping() is about 1us, and boot time is not
increased.

Fixes: 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 2 +-
 block/blk-mq.c     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2dcd738c6952..d002eefcdaf5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -253,7 +253,7 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
 	unsigned long flags;

 	spin_lock_irqsave(&tags->lock, flags);
-	rq = tags->rqs[bitnr];
+	rq = READ_ONCE(tags->rqs[bitnr]);
 	if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
 		rq = NULL;
 	spin_unlock_irqrestore(&tags->lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e9bf950983c7..21ace698d3be 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3421,7 +3421,7 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
 	WARN_ON_ONCE(req_ref_read(flush_rq) != 0);

 	for (i = 0; i < queue_depth; i++)
-		cmpxchg(&tags->rqs[i], flush_rq, NULL);
+		WRITE_ONCE(tags->rqs[i], NULL);

 	/*
 	 * Wait until all pending iteration is done.
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx
  2022-06-14  7:14 [PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx Yu Kuai
@ 2022-06-14  7:31 ` Ming Lei
  2022-06-14 13:15   ` Yu Kuai
  0 siblings, 1 reply; 5+ messages in thread
From: Ming Lei @ 2022-06-14  7:31 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, djeffery, bvanassche, linux-block, linux-kernel, yi.zhang,
	ming.lei

On Tue, Jun 14, 2022 at 03:14:10PM +0800, Yu Kuai wrote:
> We found that boot time is increased for about 8s after upgrading kernel
> from v4.19 to v5.10(megaraid-sas is used in the environment).

But 'blk-mq: clearing flush request reference in tags->rqs[]' was merged
to v5.14, :-)

> 
> Following is where the extra time is spent:
> 
> scsi_probe_and_add_lun
>  __scsi_remove_device
>   blk_cleanup_queue
>    blk_mq_exit_queue
>     blk_mq_exit_hw_queues
>      blk_mq_exit_hctx
>       blk_mq_clear_flush_rq_mapping -> function latency is 0.1ms

So queue_depth looks pretty long, is it 4k?

But if it is 0.1ms, how can the 8sec delay be caused? That requires 80K hw queues
for making so long, so I guess there must be other delay added by the feature
of BLK_MQ_F_TAG_HCTX_SHARED.

>        cmpxchg
> 
> There are three reasons:
> 1) megaraid-sas is using multiple hctxs in v5.10, thus blk_mq_exit_hctx()
> will be called much more times in v5.10 compared to v4.19.
> 2) scsi will scan for each target thus __scsi_remove_device() will be
> called for many times.
> 3) blk_mq_clear_flush_rq_mapping() is introduced after v4.19, it will
> call cmpxchg() for each request, and function latency is abount 0.1ms.
> 
> Since that blk_mq_clear_flush_rq_mapping() will only be called while the
> queue is freezed already, which means there is no inflight request,
> it's safe to set NULL for 'tags->rqs[]' directly instead of using
> cmpxchg(). Tests show that with this change, function latency of
> blk_mq_clear_flush_rq_mapping() is about 1us, and boot time is not
> increased.

tags is shared among all LUNs attached to the host, so freezing single
request queue here means nothing, so your patch doesn't work.

Please test the following patch, and see if it can improve boot delay for
your case.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e9bf950983c7..1463076a527c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3443,8 +3443,9 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (blk_mq_hw_queue_mapped(hctx))
 		blk_mq_tag_idle(hctx);
 
-	blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
-			set->queue_depth, flush_rq);
+	if (blk_queue_init_done(q))
+		blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
+				set->queue_depth, flush_rq);
 	if (set->ops->exit_request)
 		set->ops->exit_request(set, flush_rq, hctx_idx);
 


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx
  2022-06-14  7:31 ` Ming Lei
@ 2022-06-14 13:15   ` Yu Kuai
  2022-06-14 14:26     ` Ming Lei
  2022-06-15  6:21     ` Christoph Hellwig
  0 siblings, 2 replies; 5+ messages in thread
From: Yu Kuai @ 2022-06-14 13:15 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, djeffery, bvanassche, linux-block, linux-kernel, yi.zhang

在 2022/06/14 15:31, Ming Lei 写道:
> On Tue, Jun 14, 2022 at 03:14:10PM +0800, Yu Kuai wrote:
>> We found that boot time is increased for about 8s after upgrading kernel
>> from v4.19 to v5.10(megaraid-sas is used in the environment).
> 
> But 'blk-mq: clearing flush request reference in tags->rqs[]' was merged
> to v5.14, :-)
Hi,

Yes, but this patch is applied to 5.10 stable, thus we backport in our
v5.10. Sorry that I didn't mention that.

> 
>>
>> Following is where the extra time is spent:
>>
>> 
>>   __scsi_remove_device
>>    blk_cleanup_queue
>>     blk_mq_exit_queue
>>      blk_mq_exit_hw_queues
>>       blk_mq_exit_hctx
>>        blk_mq_clear_flush_rq_mapping -> function latency is 0.1ms
> 
> So queue_depth looks pretty long, is it 4k?
No, in the environment, it's just 32, and nr_hw_queues is 128, which
means each blk_cleanup_queue() will cost about 10-20 ms.

> 
> But if it is 0.1ms, how can the 8sec delay be caused? That requires 80K hw queues
> for making so long, so I guess there must be other delay added by the feature
> of BLK_MQ_F_TAG_HCTX_SHARED.

Please see details in the reasons 2), scsi scan will call
__scsi_remove_device() a lot of times(each host, each channel, each
target).
> 
>>         cmpxchg
>>
>> There are three reasons:
>> 1) megaraid-sas is using multiple hctxs in v5.10, thus blk_mq_exit_hctx()
>> will be called much more times in v5.10 compared to v4.19.
>> 2) scsi will scan for each target thus __scsi_remove_device() will be
>> called for many times.
>> 3) blk_mq_clear_flush_rq_mapping() is introduced after v4.19, it will
>> call cmpxchg() for each request, and function latency is abount 0.1ms.
>>
>> Since that blk_mq_clear_flush_rq_mapping() will only be called while the
>> queue is freezed already, which means there is no inflight request,
>> it's safe to set NULL for 'tags->rqs[]' directly instead of using
>> cmpxchg(). Tests show that with this change, function latency of
>> blk_mq_clear_flush_rq_mapping() is about 1us, and boot time is not
>> increased.
> 
> tags is shared among all LUNs attached to the host, so freezing single
> request queue here means nothing, so your patch doesn't work.

You'are right, I forgot about that tags can be shared.

> 
> Please test the following patch, and see if it can improve boot delay for
> your case.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e9bf950983c7..1463076a527c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3443,8 +3443,9 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   	if (blk_mq_hw_queue_mapped(hctx))
>   		blk_mq_tag_idle(hctx);
>   
> -	blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
> -			set->queue_depth, flush_rq);
> +	if (blk_queue_init_done(q))
> +		blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
> +				set->queue_depth, flush_rq);
>   	if (set->ops->exit_request)
>   		set->ops->exit_request(set, flush_rq, hctx_idx);
>   

Thanks for the patch, I test it and boot delay is fixed.

Kuai
> 
> 
> Thanks,
> Ming
> 
> .
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx
  2022-06-14 13:15   ` Yu Kuai
@ 2022-06-14 14:26     ` Ming Lei
  2022-06-15  6:21     ` Christoph Hellwig
  1 sibling, 0 replies; 5+ messages in thread
From: Ming Lei @ 2022-06-14 14:26 UTC (permalink / raw)
  To: Yu Kuai; +Cc: axboe, djeffery, bvanassche, linux-block, linux-kernel, yi.zhang

On Tue, Jun 14, 2022 at 09:15:36PM +0800, Yu Kuai wrote:
> 在 2022/06/14 15:31, Ming Lei 写道:
> > On Tue, Jun 14, 2022 at 03:14:10PM +0800, Yu Kuai wrote:
> > > We found that boot time is increased for about 8s after upgrading kernel
> > > from v4.19 to v5.10(megaraid-sas is used in the environment).
> > 
> > But 'blk-mq: clearing flush request reference in tags->rqs[]' was merged
> > to v5.14, :-)
> Hi,
> 
> Yes, but this patch is applied to 5.10 stable, thus we backport in our
> v5.10. Sorry that I didn't mention that.
> 
> > 
> > > 
> > > Following is where the extra time is spent:
> > > 
> > > 
> > >   __scsi_remove_device
> > >    blk_cleanup_queue
> > >     blk_mq_exit_queue
> > >      blk_mq_exit_hw_queues
> > >       blk_mq_exit_hctx
> > >        blk_mq_clear_flush_rq_mapping -> function latency is 0.1ms
> > 
> > So queue_depth looks pretty long, is it 4k?
> No, in the environment, it's just 32, and nr_hw_queues is 128, which
> means each blk_cleanup_queue() will cost about 10-20 ms.

So 32 * cmpxchg takes 100us, which is really crazy, what is the arch
and processor?

> 
> > 
> > But if it is 0.1ms, how can the 8sec delay be caused? That requires 80K hw queues
> > for making so long, so I guess there must be other delay added by the feature
> > of BLK_MQ_F_TAG_HCTX_SHARED.
> 
> Please see details in the reasons 2), scsi scan will call
> __scsi_remove_device() a lot of times(each host, each channel, each
> target).

80K / 128 = 640 LUNs.

OK, that can be true.

> > 
> > >         cmpxchg
> > > 
> > > There are three reasons:
> > > 1) megaraid-sas is using multiple hctxs in v5.10, thus blk_mq_exit_hctx()
> > > will be called much more times in v5.10 compared to v4.19.
> > > 2) scsi will scan for each target thus __scsi_remove_device() will be
> > > called for many times.
> > > 3) blk_mq_clear_flush_rq_mapping() is introduced after v4.19, it will
> > > call cmpxchg() for each request, and function latency is abount 0.1ms.
> > > 
> > > Since that blk_mq_clear_flush_rq_mapping() will only be called while the
> > > queue is freezed already, which means there is no inflight request,
> > > it's safe to set NULL for 'tags->rqs[]' directly instead of using
> > > cmpxchg(). Tests show that with this change, function latency of
> > > blk_mq_clear_flush_rq_mapping() is about 1us, and boot time is not
> > > increased.
> > 
> > tags is shared among all LUNs attached to the host, so freezing single
> > request queue here means nothing, so your patch doesn't work.
> 
> You'are right, I forgot about that tags can be shared.
> 
> > 
> > Please test the following patch, and see if it can improve boot delay for
> > your case.
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index e9bf950983c7..1463076a527c 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3443,8 +3443,9 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> >   	if (blk_mq_hw_queue_mapped(hctx))
> >   		blk_mq_tag_idle(hctx);
> > -	blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
> > -			set->queue_depth, flush_rq);
> > +	if (blk_queue_init_done(q))
> > +		blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
> > +				set->queue_depth, flush_rq);
> >   	if (set->ops->exit_request)
> >   		set->ops->exit_request(set, flush_rq, hctx_idx);
> 
> Thanks for the patch, I test it and boot delay is fixed.

Thanks for the test, and I will send it out tomorrow.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx
  2022-06-14 13:15   ` Yu Kuai
  2022-06-14 14:26     ` Ming Lei
@ 2022-06-15  6:21     ` Christoph Hellwig
  1 sibling, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2022-06-15  6:21 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Ming Lei, axboe, djeffery, bvanassche, linux-block, linux-kernel,
	yi.zhang

On Tue, Jun 14, 2022 at 09:15:36PM +0800, Yu Kuai wrote:
> > for making so long, so I guess there must be other delay added by the feature
> > of BLK_MQ_F_TAG_HCTX_SHARED.
> 
> Please see details in the reasons 2), scsi scan will call
> __scsi_remove_device() a lot of times(each host, each channel, each
> target).

Well, if the SCSI core does that you are falling back to a sequentical
scan instad of using REPORT LUNS.  Which seems like a bug on its own
that is worth investigation.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-15  6:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14  7:14 [PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx Yu Kuai
2022-06-14  7:31 ` Ming Lei
2022-06-14 13:15   ` Yu Kuai
2022-06-14 14:26     ` Ming Lei
2022-06-15  6:21     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.