linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V5] scsi: core: only re-run queue in scsi_end_request() if device queue is busy
@ 2020-09-07  7:10 Ming Lei
  2020-09-07 16:52 ` Bart Van Assche
  0 siblings, 1 reply; 5+ messages in thread
From: Ming Lei @ 2020-09-07  7:10 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: Ming Lei, Ewan D . Milne, Kashyap Desai, Hannes Reinecke,
	Bart Van Assche, Long Li, John Garry, linux-block

Now the request queue is run in scsi_end_request() unconditionally if both
target queue and host queue is ready. We should have re-run request queue
only after this device queue becomes busy for restarting this LUN only.

Recently Long Li reported that cost of run queue may be very heavy in
case of high queue depth. So improve this situation by only running
the request queue when this LUN is busy.

Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Long Li <longli@microsoft.com>
Cc: John Garry <john.garry@huawei.com>
Cc: linux-block@vger.kernel.org
Tested-by: Long Li <longli@microsoft.com>
Reported-by: Long Li <longli@microsoft.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V5:
	- patch style && comment change, as suggested by Bart
	- add reviewed-by & tested-by tag
V4:
	- fix one race reported by Kashyap, and simplify the implementation
	a bit; also pass Kashyap's both function and performance test
V3:
	- add one smp_mb() in scsi_mq_get_budget() and comment

V2:
	- commit log change, no any code change
	- add reported-by tag
 drivers/scsi/scsi_lib.c    | 48 +++++++++++++++++++++++++++++++++++---
 include/scsi/scsi_device.h |  1 +
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7affaaf8b98e..a05e431ee62a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -551,8 +551,25 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
 	if (scsi_target(sdev)->single_lun ||
 	    !list_empty(&sdev->host->starved_list))
 		kblockd_schedule_work(&sdev->requeue_work);
-	else
-		blk_mq_run_hw_queues(sdev->request_queue, true);
+	else {
+		/*
+		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
+		 * is for ordering writing .device_busy in scsi_device_unbusy()
+		 * and reading sdev->restarts.
+		 */
+		int old = atomic_read(&sdev->restarts);
+
+		/*
+		 * ->restarts has to be kept as non-zero if there new budget
+		 *  contention comes.
+		 *
+		 *  No need to run queue when either another re-run
+		 *  queue wins in updating ->restarts or one new budget
+		 *  contention comes.
+		 */
+		if (old && (atomic_cmpxchg(&sdev->restarts, old, 0) == old))
+			blk_mq_run_hw_queues(sdev->request_queue, true);
+	}
 }
 
 /* Returns false when no more bytes to process, true if there are more */
@@ -1611,8 +1628,33 @@ static void scsi_mq_put_budget(struct request_queue *q)
 static bool scsi_mq_get_budget(struct request_queue *q)
 {
 	struct scsi_device *sdev = q->queuedata;
+	int ret = scsi_dev_queue_ready(q, sdev);
 
-	return scsi_dev_queue_ready(q, sdev);
+	if (ret)
+		return true;
+
+	atomic_inc(&sdev->restarts);
+
+	/*
+	 * Order writing .restarts and reading .device_busy. Its pair is
+	 * implied by __blk_mq_end_request() in scsi_end_request() for
+	 * ordering writing .device_busy in scsi_device_unbusy() and
+	 * reading .restarts.
+	 */
+	smp_mb__after_atomic();
+
+	/*
+	 * If all in-flight requests originated from this LUN are completed
+	 * before setting .restarts, sdev->device_busy will be observed as
+	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this request
+	 * soon. Otherwise, completion of one of these request will observe
+	 * the .restarts flag, and the request queue will be run for handling
+	 * this request, see scsi_end_request().
+	 */
+	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
+				!scsi_device_blocked(sdev)))
+		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
+	return false;
 }
 
 static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index bc5909033d13..1a5c9a3df6d6 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -109,6 +109,7 @@ struct scsi_device {
 	atomic_t device_busy;		/* commands actually active on LLDD */
 	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
 
+	atomic_t restarts;
 	spinlock_t list_lock;
 	struct list_head starved_entry;
 	unsigned short queue_depth;	/* How deep of a queue we want */
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH V5] scsi: core: only re-run queue in scsi_end_request() if device queue is busy
  2020-09-07  7:10 [PATCH V5] scsi: core: only re-run queue in scsi_end_request() if device queue is busy Ming Lei
@ 2020-09-07 16:52 ` Bart Van Assche
  2020-09-08  1:47   ` Ming Lei
  0 siblings, 1 reply; 5+ messages in thread
From: Bart Van Assche @ 2020-09-07 16:52 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: Ewan D . Milne, Kashyap Desai, Hannes Reinecke, Long Li,
	John Garry, linux-block

On 2020-09-07 00:10, Ming Lei wrote:
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 7affaaf8b98e..a05e431ee62a 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -551,8 +551,25 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
>  	if (scsi_target(sdev)->single_lun ||
>  	    !list_empty(&sdev->host->starved_list))
>  		kblockd_schedule_work(&sdev->requeue_work);
> -	else
> -		blk_mq_run_hw_queues(sdev->request_queue, true);
> +	else {

Please follow the Linux kernel coding style and balance braces.

> +		/*
> +		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
> +		 * is for ordering writing .device_busy in scsi_device_unbusy()
> +		 * and reading sdev->restarts.
> +		 */
> +		int old = atomic_read(&sdev->restarts);

scsi_run_queue_async() has two callers: scsi_end_request() and scsi_queue_rq().
I don't see how ordering between scsi_device_unbusy() and the above atomic_read()
could be guaranteed if this function is called from scsi_queue_rq()?

Regarding the I/O completion path, my understanding is that the I/O completion
path is as follows if rq->end_io == NULL:

scsi_mq_done()
  blk_mq_complete_request()
    rq->q->mq_ops->complete(rq) = scsi_softirq_done
      scsi_finish_command()
        scsi_device_unbusy()
        scsi_cmd_to_driver(cmd)->done(cmd)
        scsi_io_completion()
          scsi_end_request()
            blk_update_request()
            scsi_mq_uninit_cmd()
            __blk_mq_end_request()
              blk_mq_free_request()
                __blk_mq_free_request()
                  blk_queue_exit()
            scsi_run_queue_async()

I haven't found any store memory barrier between the .device_busy change in
scsi_device_unbusy() and the scsi_run_queue_async() call? Did I perhaps overlook
something?

> +		/*
> +		 * ->restarts has to be kept as non-zero if there new budget
> +		 *  contention comes.

Please fix the grammar in the above sentence.

> +	/*
> +	 * Order writing .restarts and reading .device_busy. Its pair is
> +	 * implied by __blk_mq_end_request() in scsi_end_request() for
> +	 * ordering writing .device_busy in scsi_device_unbusy() and
> +	 * reading .restarts.
> +	 */
> +	smp_mb__after_atomic();

What does "its pair is implied" mean? Please make the above comment
unambiguous.

> +	/*
> +	 * If all in-flight requests originated from this LUN are completed
> +	 * before setting .restarts, sdev->device_busy will be observed as
> +	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this request
> +	 * soon. Otherwise, completion of one of these request will observe
> +	 * the .restarts flag, and the request queue will be run for handling
> +	 * this request, see scsi_end_request().
> +	 */
> +	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
> +				!scsi_device_blocked(sdev)))
> +		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
> +	return false;

What will happen if all in-flight requests complete after
scsi_run_queue_async() has read .restarts and before it executes
atomic_cmpxchg()? Will that cause the queue to be run after a delay
although it should be run immediately?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH V5] scsi: core: only re-run queue in scsi_end_request() if device queue is busy
  2020-09-07 16:52 ` Bart Van Assche
@ 2020-09-08  1:47   ` Ming Lei
  2020-09-08  3:45     ` Bart Van Assche
  0 siblings, 1 reply; 5+ messages in thread
From: Ming Lei @ 2020-09-08  1:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, Ewan D . Milne,
	Kashyap Desai, Hannes Reinecke, Long Li, John Garry, linux-block

On Mon, Sep 07, 2020 at 09:52:42AM -0700, Bart Van Assche wrote:
> On 2020-09-07 00:10, Ming Lei wrote:
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 7affaaf8b98e..a05e431ee62a 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -551,8 +551,25 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
> >  	if (scsi_target(sdev)->single_lun ||
> >  	    !list_empty(&sdev->host->starved_list))
> >  		kblockd_schedule_work(&sdev->requeue_work);
> > -	else
> > -		blk_mq_run_hw_queues(sdev->request_queue, true);
> > +	else {
> 
> Please follow the Linux kernel coding style and balance braces.

Could you provide one document about such style? The patch does pass
checkpatch, or I am happy to follow your suggestion if checkpatch is
updated to this way.

> 
> > +		/*
> > +		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
> > +		 * is for ordering writing .device_busy in scsi_device_unbusy()
> > +		 * and reading sdev->restarts.
> > +		 */
> > +		int old = atomic_read(&sdev->restarts);
> 
> scsi_run_queue_async() has two callers: scsi_end_request() and scsi_queue_rq().
> I don't see how ordering between scsi_device_unbusy() and the above atomic_read()
> could be guaranteed if this function is called from scsi_queue_rq()?
> 
> Regarding the I/O completion path, my understanding is that the I/O completion
> path is as follows if rq->end_io == NULL:
> 
> scsi_mq_done()
>   blk_mq_complete_request()
>     rq->q->mq_ops->complete(rq) = scsi_softirq_done
>       scsi_finish_command()
>         scsi_device_unbusy()

scsi_device_unbusy()
	atomic_dec(&sdev->device_busy);

>         scsi_cmd_to_driver(cmd)->done(cmd)
>         scsi_io_completion()
>           scsi_end_request()
>             blk_update_request()
>             scsi_mq_uninit_cmd()
>             __blk_mq_end_request()
>               blk_mq_free_request()
>                 __blk_mq_free_request()

__blk_mq_free_request()
	blk_mq_put_tag
		smp_mb__after_atomic()

>                   blk_queue_exit()
>             scsi_run_queue_async()
> 
> I haven't found any store memory barrier between the .device_busy change in
> scsi_device_unbusy() and the scsi_run_queue_async() call? Did I perhaps overlook
> something?
> 
> > +		/*
> > +		 * ->restarts has to be kept as non-zero if there new budget
> > +		 *  contention comes.
> 
> Please fix the grammar in the above sentence.

OK.

> 
> > +	/*
> > +	 * Order writing .restarts and reading .device_busy. Its pair is
> > +	 * implied by __blk_mq_end_request() in scsi_end_request() for
> > +	 * ordering writing .device_busy in scsi_device_unbusy() and
> > +	 * reading .restarts.
> > +	 */
> > +	smp_mb__after_atomic();
> 
> What does "its pair is implied" mean? Please make the above comment
> unambiguous.

See comment in scsi_run_queue_async().

> 
> > +	/*
> > +	 * If all in-flight requests originated from this LUN are completed
> > +	 * before setting .restarts, sdev->device_busy will be observed as
> > +	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this request
> > +	 * soon. Otherwise, completion of one of these request will observe
> > +	 * the .restarts flag, and the request queue will be run for handling
> > +	 * this request, see scsi_end_request().
> > +	 */
> > +	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
> > +				!scsi_device_blocked(sdev)))
> > +		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
> > +	return false;
> 
> What will happen if all in-flight requests complete after
> scsi_run_queue_async() has read .restarts and before it executes
> atomic_cmpxchg()?

One of these completions will run atomic_cmpxchg() successfully, and the
queue is re-run immediately from scsi_run_queue_async().

> Will that cause the queue to be run after a delay
> although it should be run immediately?

Yeah, blk_mq_delay_run_hw_queues() will be called, however:

If scsi_run_queue_async() has scheduled run queue already, this code path
won't queue a dwork successfully. On the other hand, if
blk_mq_delay_run_hw_queues(SCSI_QUEUE_DELAY) has queued a dwork,
scsi_run_queue_async() still can queue the dwork successfully, since the delay
timer can be deactivated easily, see try_to_grab_pending(). In short, the case
you described is an extremely unlikely event. Even though it happens,
forward progress is still guaranteed.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH V5] scsi: core: only re-run queue in scsi_end_request() if device queue is busy
  2020-09-08  1:47   ` Ming Lei
@ 2020-09-08  3:45     ` Bart Van Assche
  2020-09-08 10:01       ` Ming Lei
  0 siblings, 1 reply; 5+ messages in thread
From: Bart Van Assche @ 2020-09-08  3:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, Ewan D . Milne,
	Kashyap Desai, Hannes Reinecke, Long Li, John Garry, linux-block

On 2020-09-07 18:47, Ming Lei wrote:
> On Mon, Sep 07, 2020 at 09:52:42AM -0700, Bart Van Assche wrote:
>> On 2020-09-07 00:10, Ming Lei wrote:
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index 7affaaf8b98e..a05e431ee62a 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -551,8 +551,25 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
>>>  	if (scsi_target(sdev)->single_lun ||
>>>  	    !list_empty(&sdev->host->starved_list))
>>>  		kblockd_schedule_work(&sdev->requeue_work);
>>> -	else
>>> -		blk_mq_run_hw_queues(sdev->request_queue, true);
>>> +	else {
>>
>> Please follow the Linux kernel coding style and balance braces.
> 
> Could you provide one document about such style? The patch does pass
> checkpatch, or I am happy to follow your suggestion if checkpatch is
> updated to this way.

Apparently the checkpatch script only warns about unbalanced braces with the
option --strict. From commit e4c5babd32f9 ("checkpatch: notice unbalanced
else braces in a patch") # v4.11:

    checkpatch: notice unbalanced else braces in a patch

    Patches that add or modify code like

            } else
                    <foo>
    or
            else {
                    <bar>

    where one branch appears to have a brace and the other branch does not
    have a brace should emit a --strict style message.

[ ... ]

+# check for single line unbalanced braces
+		if ($sline =~ /.\s*\}\s*else\s*$/ ||
+		    $sline =~ /.\s*else\s*\{\s*$/) {
+			CHK("BRACES", "Unbalanced braces around else statement\n" . $herecurr);
+		}
+

Anyway, I think the following output makes it clear that there are many more
balanced than non-balanced else statements:

$ git grep -c "} else {" | awk 'BEGIN {FS=":"} {total+=$2} END {print total}'
66944
$ git grep -Ec "$(printf "\t")else \{|\} else$" | awk 'BEGIN {FS=":"} {total+=$2} END {print total}'
12289

>>> +		/*
>>> +		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
>>> +		 * is for ordering writing .device_busy in scsi_device_unbusy()
>>> +		 * and reading sdev->restarts.
>>> +		 */
>>> +		int old = atomic_read(&sdev->restarts);
>>
>> scsi_run_queue_async() has two callers: scsi_end_request() and scsi_queue_rq().
>> I don't see how ordering between scsi_device_unbusy() and the above atomic_read()
>> could be guaranteed if this function is called from scsi_queue_rq()?
>>
>> Regarding the I/O completion path, my understanding is that the I/O completion
>> path is as follows if rq->end_io == NULL:
>>
>> scsi_mq_done()
>>   blk_mq_complete_request()
>>     rq->q->mq_ops->complete(rq) = scsi_softirq_done
>>       scsi_finish_command()
>>         scsi_device_unbusy()
> 
> scsi_device_unbusy()
> 	atomic_dec(&sdev->device_busy);
> 
>>         scsi_cmd_to_driver(cmd)->done(cmd)
>>         scsi_io_completion()
>>           scsi_end_request()
>>             blk_update_request()
>>             scsi_mq_uninit_cmd()
>>             __blk_mq_end_request()
>>               blk_mq_free_request()
>>                 __blk_mq_free_request()
> 
> __blk_mq_free_request()
> 	blk_mq_put_tag
> 		smp_mb__after_atomic()
> 

Thanks for the clarification. How about changing the text "implied in either
rq->end_io or blk_mq_free_request" into "present in sbitmap_queue_clear()"
such that the person who reads the comment does not have to look up where
the barrier occurs?

>>
>>> +	/*
>>> +	 * Order writing .restarts and reading .device_busy. Its pair is
>>> +	 * implied by __blk_mq_end_request() in scsi_end_request() for
>>> +	 * ordering writing .device_busy in scsi_device_unbusy() and
>>> +	 * reading .restarts.
>>> +	 */
>>> +	smp_mb__after_atomic();
>>
>> What does "its pair is implied" mean? Please make the above comment
>> unambiguous.
> 
> See comment in scsi_run_queue_async().

How about making the above comment more by changing it into the following?
/*
 * Orders atomic_inc(&sdev->restarts) and atomic_read(&sdev->device_busy).
 * .restarts must be incremented before .device_busy is read because the code
 * in scsi_run_queue_async() depends on the order of these operations.
 */

>> Will that cause the queue to be run after a delay
>> although it should be run immediately?
> 
> Yeah, blk_mq_delay_run_hw_queues() will be called, however:
> 
> If scsi_run_queue_async() has scheduled run queue already, this code path
> won't queue a dwork successfully. On the other hand, if
> blk_mq_delay_run_hw_queues(SCSI_QUEUE_DELAY) has queued a dwork,
> scsi_run_queue_async() still can queue the dwork successfully, since the delay
> timer can be deactivated easily, see try_to_grab_pending(). In short, the case
> you described is an extremely unlikely event. Even though it happens,
> forward progress is still guaranteed.

I think I would sleep better if that race would be fixed. I'm concerned
that sooner or later someone will run a workload that triggers that scenario
systematically ...

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH V5] scsi: core: only re-run queue in scsi_end_request() if device queue is busy
  2020-09-08  3:45     ` Bart Van Assche
@ 2020-09-08 10:01       ` Ming Lei
  0 siblings, 0 replies; 5+ messages in thread
From: Ming Lei @ 2020-09-08 10:01 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, Ewan D . Milne,
	Kashyap Desai, Hannes Reinecke, Long Li, John Garry, linux-block

On Mon, Sep 07, 2020 at 08:45:32PM -0700, Bart Van Assche wrote:
> On 2020-09-07 18:47, Ming Lei wrote:
> > On Mon, Sep 07, 2020 at 09:52:42AM -0700, Bart Van Assche wrote:
> >> On 2020-09-07 00:10, Ming Lei wrote:
> >>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> >>> index 7affaaf8b98e..a05e431ee62a 100644
> >>> --- a/drivers/scsi/scsi_lib.c
> >>> +++ b/drivers/scsi/scsi_lib.c
> >>> @@ -551,8 +551,25 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
> >>>  	if (scsi_target(sdev)->single_lun ||
> >>>  	    !list_empty(&sdev->host->starved_list))
> >>>  		kblockd_schedule_work(&sdev->requeue_work);
> >>> -	else
> >>> -		blk_mq_run_hw_queues(sdev->request_queue, true);
> >>> +	else {
> >>
> >> Please follow the Linux kernel coding style and balance braces.
> > 
> > Could you provide one document about such style? The patch does pass
> > checkpatch, or I am happy to follow your suggestion if checkpatch is
> > updated to this way.
> 
> Apparently the checkpatch script only warns about unbalanced braces with the
> option --strict. From commit e4c5babd32f9 ("checkpatch: notice unbalanced
> else braces in a patch") # v4.11:
> 
>     checkpatch: notice unbalanced else braces in a patch
> 
>     Patches that add or modify code like
> 
>             } else
>                     <foo>
>     or
>             else {
>                     <bar>
> 
>     where one branch appears to have a brace and the other branch does not
>     have a brace should emit a --strict style message.
> 
> [ ... ]
> 
> +# check for single line unbalanced braces
> +		if ($sline =~ /.\s*\}\s*else\s*$/ ||
> +		    $sline =~ /.\s*else\s*\{\s*$/) {
> +			CHK("BRACES", "Unbalanced braces around else statement\n" . $herecurr);
> +		}
> +
> 
> Anyway, I think the following output makes it clear that there are many more
> balanced than non-balanced else statements:
> 
> $ git grep -c "} else {" | awk 'BEGIN {FS=":"} {total+=$2} END {print total}'
> 66944
> $ git grep -Ec "$(printf "\t")else \{|\} else$" | awk 'BEGIN {FS=":"} {total+=$2} END {print total}'
> 12289

OK, looks it is still not something which must be obeyed, but I do not
want to waste time on this thing, so will switch to balanced brace.

> 
> >>> +		/*
> >>> +		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
> >>> +		 * is for ordering writing .device_busy in scsi_device_unbusy()
> >>> +		 * and reading sdev->restarts.
> >>> +		 */
> >>> +		int old = atomic_read(&sdev->restarts);
> >>
> >> scsi_run_queue_async() has two callers: scsi_end_request() and scsi_queue_rq().
> >> I don't see how ordering between scsi_device_unbusy() and the above atomic_read()
> >> could be guaranteed if this function is called from scsi_queue_rq()?
> >>
> >> Regarding the I/O completion path, my understanding is that the I/O completion
> >> path is as follows if rq->end_io == NULL:
> >>
> >> scsi_mq_done()
> >>   blk_mq_complete_request()
> >>     rq->q->mq_ops->complete(rq) = scsi_softirq_done
> >>       scsi_finish_command()
> >>         scsi_device_unbusy()
> > 
> > scsi_device_unbusy()
> > 	atomic_dec(&sdev->device_busy);
> > 
> >>         scsi_cmd_to_driver(cmd)->done(cmd)
> >>         scsi_io_completion()
> >>           scsi_end_request()
> >>             blk_update_request()
> >>             scsi_mq_uninit_cmd()
> >>             __blk_mq_end_request()
> >>               blk_mq_free_request()
> >>                 __blk_mq_free_request()
> > 
> > __blk_mq_free_request()
> > 	blk_mq_put_tag
> > 		smp_mb__after_atomic()
> > 
> 
> Thanks for the clarification. How about changing the text "implied in either
> rq->end_io or blk_mq_free_request" into "present in sbitmap_queue_clear()"
> such that the person who reads the comment does not have to look up where
> the barrier occurs?

Fine.

> >>
> >>> +	/*
> >>> +	 * Order writing .restarts and reading .device_busy. Its pair is
> >>> +	 * implied by __blk_mq_end_request() in scsi_end_request() for
> >>> +	 * ordering writing .device_busy in scsi_device_unbusy() and
> >>> +	 * reading .restarts.
> >>> +	 */
> >>> +	smp_mb__after_atomic();
> >>
> >> What does "its pair is implied" mean? Please make the above comment
> >> unambiguous.
> > 
> > See comment in scsi_run_queue_async().
> 
> How about making the above comment more by changing it into the following?
> /*
>  * Orders atomic_inc(&sdev->restarts) and atomic_read(&sdev->device_busy).
>  * .restarts must be incremented before .device_busy is read because the code
>  * in scsi_run_queue_async() depends on the order of these operations.
>  */

OK.

> 
> >> Will that cause the queue to be run after a delay
> >> although it should be run immediately?
> > 
> > Yeah, blk_mq_delay_run_hw_queues() will be called, however:
> > 
> > If scsi_run_queue_async() has scheduled run queue already, this code path
> > won't queue a dwork successfully. On the other hand, if
> > blk_mq_delay_run_hw_queues(SCSI_QUEUE_DELAY) has queued a dwork,
> > scsi_run_queue_async() still can queue the dwork successfully, since the delay
> > timer can be deactivated easily, see try_to_grab_pending(). In short, the case
> > you described is an extremely unlikely event. Even though it happens,
> > forward progress is still guaranteed.
> 
> I think I would sleep better if that race would be fixed. I'm concerned

blk-mq has several similar handling, see delay run queue in the following functions:

	__blk_mq_do_dispatch_sched()
	blk_mq_do_dispatch_ctx()
	blk_mq_dispatch_rq_list()

> that sooner or later someone will run a workload that triggers that scenario
> systematically ...

After taking a close look at mod_delayed_work_on() & try_to_grab_pending(), I
think there isn't such issue you are worrying about.

mod_delayed_work_on() calls try_to_grab_pending() in the following way:

        do {
                ret = try_to_grab_pending(&dwork->work, true, &flags);
        } while (unlikely(ret == -EAGAIN));

The only two negative return values from try_to_grab_pending are -EAGAIN
and -ENOENT.

Both blk_mq_run_hw_queues()(called from scsi_run_queue_async) and
blk_mq_delay_run_hw_queues()(called from scsi_mq_get_budget) calls
mod_delayed_work_on() finally via kblockd_mod_delayed_work_on().

Both two calls of mod_delayed_work_on() will call __queue_delayed_work()
finally. blk_mq_run_hw_queues() will call __queue_work() to queue
the work immediately because delay is zero, blk_mq_delay_run_hw_queues()
just calls add_timer() to schedule a timer for running __queue_work()
because the delay is 3ms.

So blk_mq_delay_run_hw_queues() will _not_ cause a delay run queue.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-08 10:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-07  7:10 [PATCH V5] scsi: core: only re-run queue in scsi_end_request() if device queue is busy Ming Lei
2020-09-07 16:52 ` Bart Van Assche
2020-09-08  1:47   ` Ming Lei
2020-09-08  3:45     ` Bart Van Assche
2020-09-08 10:01       ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).