linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing
@ 2020-04-02 15:51 Douglas Anderson
  2020-04-02 15:51 ` [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson
  2020-04-02 15:51 ` [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention Douglas Anderson
  0 siblings, 2 replies; 15+ messages in thread
From: Douglas Anderson @ 2020-04-02 15:51 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: paolo.valente, sqazi, linux-block, linux-scsi, Ming Lei, groeck,
	Douglas Anderson, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, linux-kernel

While doing reboot testing, I found that occasionally my device would
trigger the hung task detector.  Many tasks were stuck waiting for the
a blkdev mutex, but at least one task in the system was always sitting
waiting for IO to complete (and holding the blkdev mutex).  One
example of a task that was just waiting for its IO to complete on one
reboot:

 udevd           D    0  2177    306 0x00400209
 Call trace:
  __switch_to+0x15c/0x17c
  __schedule+0x6e0/0x928
  schedule+0x8c/0xbc
  schedule_timeout+0x9c/0xfc
  io_schedule_timeout+0x24/0x48
  do_wait_for_common+0xd0/0x160
  wait_for_completion_io_timeout+0x54/0x74
  blk_execute_rq+0x9c/0xd8
  __scsi_execute+0x104/0x198
  scsi_test_unit_ready+0xa0/0x154
  sd_check_events+0xb4/0x164
  disk_check_events+0x58/0x154
  disk_clear_events+0x74/0x110
  check_disk_change+0x28/0x6c
  sd_open+0x5c/0x130
  __blkdev_get+0x20c/0x3d4
  blkdev_get+0x74/0x170
  blkdev_open+0x94/0xa8
  do_dentry_open+0x268/0x3a0
  vfs_open+0x34/0x40
  path_openat+0x39c/0xdf4
  do_filp_open+0x90/0x10c
  do_sys_open+0x150/0x3c8
  ...

I've reproduced this on two systems: one boots from an internal UFS
disk and one from eMMC.  Each has a card reader attached via USB with
an SD card plugged in.  On the USB-attached SD card is a disk with 12
partitions (a Chrome OS test image), if it matters.  The system
doesn't do much with the USB disk other than probe it (it's plugged in
my system to help me recover).

From digging, I believe that there are two separate but related
issues.  Both issues relate to the SCSI code saying that there is no
budget.

I have done testing with only one or the other of the two patches in
this series and found that I could still encounter hung tasks if only
one of the two patches was applied.  This deserves a bit of
explanation.  To me, it's fairly obvious that the first fix wouldn't
fix the problems talked about in the second patch.  However, it's less
obvious why the second patch doesn't fix the problems in
blk_mq_dispatch_rq_list().  It turns out that it _almost_ does
(problems become much more rare), but I did manage to get a single
trace where the "kick" scheduled by the second patch happened really
quickly.  The scheduled kick then ran and found nothing to do.  This
happened in parallel to a task running in blk_mq_dispatch_rq_list()
which hadn't gotten around to splicing the list back into
hctx->dispatch.  This is why we need both fixes or a heavier hammer
where we always kick whenever two threads request budget at the same
time.

Most of my testing has been atop Chrome OS 5.4's kernel tree which
currently has v5.4.28 merged in.  The Chrome OS 5.4 tree also has a
patch by Salman Qazi, namely ("block: Limit number of items taken from
the I/O scheduler in one go").  Reverting that patch didn't make the
hung tasks go away, so I kept it in for most of my testing.

I have also done some testing on mainline Linux (git describe says I'm
on v5.6-rc7-227-gf3e69428b5e2) even without Salman's patch.  I found
that I could reproduce the problems there and that traces looked about
the same as I saw on the downstream branch.  These patches were also
confirmed to fix the problems on mainline.

Chrome OS is currently setup to use the BFQ scheduler and I found that
I couldn't reproduce the problems without BFQ.  As discussed in the
second patch this is believed to be because BFQ sometimes returns
"true" from has_work() but then NULL from dispatch_request().

I'll insert my usual caveat that I'm sending patches to code that I
know very little about.  If I'm making a total bozo patch here, please
help me figure out how I should fix the problems I found in a better
way.

If you want to see a total ridiculous amount of chatter where I
stumbled around a whole bunch trying to figure out what was wrong and
how to fix it, feel free to read <https://crbug.com/1061950>.  I
promise it will make your eyes glaze over right away if this cover
letter didn't already do that.  Specifically comment 79 in that bug
includes a link to my ugly prototype of making BFQ's has_work() more
exact (I only managed it by actually defining _both_ an exact and
inexact function to avoid circular locking problems when it was called
directly from blk_mq_hctx_has_pending()).  Comment 79 also has more
thoughts about alternatives considered.

I don't know if these fixes represent a regression of some sort or are
new.  As per above I could only reproduce with BFQ enabled which makes
it nearly impossible to go too far back with this.  I haven't listed
any "Fixes" tags here, but if someone felt it was appropriate to
backport this to some stable trees that seems like it'd be nice.
Presumably at least 5.4 stable would make sense.

Thanks to Salman Qazi, Paolo Valente, and Guenter Roeck who spent a
bunch of time helping me trawl through some of this code and reviewing
early versions of this patch.

Changes in v2:
- Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...")

Douglas Anderson (2):
  blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick
  blk-mq: Rerun dispatching in the case of budget contention

 block/blk-mq-sched.c   | 26 ++++++++++++++++++++++++--
 block/blk-mq.c         | 14 +++++++++++---
 include/linux/blkdev.h |  2 ++
 3 files changed, 37 insertions(+), 5 deletions(-)

-- 
2.26.0.rc2.310.g2932bb562d-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick
  2020-04-02 15:51 [PATCH v2 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing Douglas Anderson
@ 2020-04-02 15:51 ` Douglas Anderson
  2020-04-03  1:55   ` Ming Lei
  2020-04-02 15:51 ` [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention Douglas Anderson
  1 sibling, 1 reply; 15+ messages in thread
From: Douglas Anderson @ 2020-04-02 15:51 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: paolo.valente, sqazi, linux-block, linux-scsi, Ming Lei, groeck,
	Douglas Anderson, linux-kernel

In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns
true and the driver returns BLK_STS_RESOURCE then we'll kick the
queue.  However, there's another case where we might need to kick it.
If we were unable to get budget we can be in much the same state as
when the driver returns BLK_STS_RESOURCE, so we should treat it the
same.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2: None

 block/blk-mq.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..2cd8d2b49ff4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1189,6 +1189,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	bool no_tag = false;
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
+	bool no_budget_avail = false;
 
 	if (list_empty(list))
 		return false;
@@ -1205,8 +1206,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		rq = list_first_entry(list, struct request, queuelist);
 
 		hctx = rq->mq_hctx;
-		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
+		if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) {
+			no_budget_avail = true;
 			break;
+		}
 
 		if (!blk_mq_get_driver_tag(rq)) {
 			/*
@@ -1311,13 +1314,15 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 *
 		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
 		 * bit is set, run queue after a delay to avoid IO stalls
-		 * that could otherwise occur if the queue is idle.
+		 * that could otherwise occur if the queue is idle.  We'll do
+		 * similar if we couldn't get budget and SCHED_RESTART is set.
 		 */
 		needs_restart = blk_mq_sched_needs_restart(hctx);
 		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
-		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+		else if (needs_restart && (ret == BLK_STS_RESOURCE ||
+					   no_budget_avail))
 			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
 
 		blk_mq_update_dispatch_busy(hctx, true);
-- 
2.26.0.rc2.310.g2932bb562d-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-02 15:51 [PATCH v2 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing Douglas Anderson
  2020-04-02 15:51 ` [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson
@ 2020-04-02 15:51 ` Douglas Anderson
  2020-04-03  1:33   ` Ming Lei
  1 sibling, 1 reply; 15+ messages in thread
From: Douglas Anderson @ 2020-04-02 15:51 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: paolo.valente, sqazi, linux-block, linux-scsi, Ming Lei, groeck,
	Douglas Anderson, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, linux-kernel

It is possible for two threads to be running
blk_mq_do_dispatch_sched() at the same time with the same "hctx".
This is because there can be more than one caller to
__blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
prevent more than one thread from entering.

If more than one thread is running blk_mq_do_dispatch_sched() at the
same time with the same "hctx", they may have contention acquiring
budget.  The blk_mq_get_dispatch_budget() can eventually translate
into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
uncommon) then only one of the two threads will be the one to
increment "device_busy" to 1 and get the budget.

The losing thread will break out of blk_mq_do_dispatch_sched() and
will stop dispatching requests.  The assumption is that when more
budget is available later (when existing transactions finish) the
queue will be kicked again, perhaps in scsi_end_request().

The winning thread now has budget and can go on to call
dispatch_request().  If dispatch_request() returns NULL here then we
have a potential problem.  Specifically we'll now call
blk_mq_put_dispatch_budget() which translates into
scsi_mq_put_budget().  That will mark the device as no longer busy but
doesn't do anything to kick the queue.  This violates the assumption
that the queue would be kicked when more budget was available.

Pictorially:

Thread A                          Thread B
================================= ==================================
blk_mq_get_dispatch_budget() => 1
dispatch_request() => NULL
                                  blk_mq_get_dispatch_budget() => 0
                                  // because Thread A marked
                                  // "device_busy" in scsi_device
blk_mq_put_dispatch_budget()

The above case was observed in reboot tests and caused a task to hang
forever waiting for IO to complete.  Traces showed that in fact two
tasks were running blk_mq_do_dispatch_sched() at the same time with
the same "hctx".  The task that got the budget did in fact see
dispatch_request() return NULL.  Both tasks returned and the system
went on for several minutes (until the hung task delay kicked in)
without the given "hctx" showing up again in traces.

Let's attempt to fix this problem by detecting if there was contention
for the budget in the case where we put the budget without dispatching
anything.  If we saw contention we kick all hctx's associated with the
queue where there was contention.  We do this without any locking by
adding a double-check for budget and accepting a small amount of faux
contention if the 2nd check gives us budget but then we don't dispatch
anything (we'll look like we contended with ourselves).

A few extra notes:

- This whole thing is only a problem due to the inexact nature of
  has_work().  Specifically if has_work() always guaranteed that a
  "true" return meant that dispatch_request() would return non-NULL
  then we wouldn't have this problem.  That's because we only grab the
  budget if has_work() returned true.  If we had the non-NULL
  guarantee then at least one of the threads would actually dispatch
  work and when the work was done then queues would be kicked
  normally.

- One specific I/O scheduler that trips this problem quite a bit is
  BFQ which definitely returns "true" for has_work() in cases where it
  wouldn't dispatch.  Making BFQ's has_work() more exact requires that
  has_work() becomes a much heavier function, including figuring out
  how to acquire spinlocks in has_work() without tripping circular
  lock dependencies.  This is prototyped but it's unclear if it's
  really the way to go when the problem can be solved with a
  relatively lightweight contention detection mechanism.

- Because this problem only trips with inexact has_work() it's
  believed that blk_mq_do_dispatch_ctx() does not need a similar
  change.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2:
- Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...")

 block/blk-mq-sched.c   | 26 ++++++++++++++++++++++++--
 block/blk-mq.c         |  3 +++
 include/linux/blkdev.h |  2 ++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74cedea56034..0195d75f5f96 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
 			break;
 
-		if (!blk_mq_get_dispatch_budget(hctx))
-			break;
+
+		if (!blk_mq_get_dispatch_budget(hctx)) {
+			/*
+			 * We didn't get budget so set contention.  If
+			 * someone else had the budget but didn't dispatch
+			 * they'll kick everything.  NOTE: we check one
+			 * extra time _after_ setting contention to fully
+			 * close the race.  If we don't actually dispatch
+			 * we'll detext faux contention (with ourselves)
+			 * but that should be rare.
+			 */
+			atomic_set(&q->budget_contention, 1);
+
+			if (!blk_mq_get_dispatch_budget(hctx))
+				break;
+		}
 
 		rq = e->type->ops.dispatch_request(hctx);
 		if (!rq) {
 			blk_mq_put_dispatch_budget(hctx);
+
+			/*
+			 * We've released the budget but us holding it might
+			 * have prevented someone else from dispatching.
+			 * Detect that case and run all queues again.
+			 */
+			if (atomic_read(&q->budget_contention))
+				blk_mq_run_hw_queues(q, true);
 			break;
 		}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2cd8d2b49ff4..6163c43ceca5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	/* We're running the queues, so clear the contention detector */
+	atomic_set(&q->budget_contention, 0);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f629d40c645c..07f21e45d993 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -583,6 +583,8 @@ struct request_queue {
 
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
+
+	atomic_t		budget_contention;
 };
 
 #define QUEUE_FLAG_STOPPED	0	/* queue is stopped */
-- 
2.26.0.rc2.310.g2932bb562d-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-02 15:51 ` [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention Douglas Anderson
@ 2020-04-03  1:33   ` Ming Lei
  2020-04-03 15:10     ` Doug Anderson
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2020-04-03  1:33 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: axboe, jejb, martin.petersen, paolo.valente, sqazi, linux-block,
	linux-scsi, groeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, linux-kernel

On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote:
> It is possible for two threads to be running
> blk_mq_do_dispatch_sched() at the same time with the same "hctx".
> This is because there can be more than one caller to
> __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
> prevent more than one thread from entering.
> 
> If more than one thread is running blk_mq_do_dispatch_sched() at the
> same time with the same "hctx", they may have contention acquiring
> budget.  The blk_mq_get_dispatch_budget() can eventually translate
> into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
> uncommon) then only one of the two threads will be the one to
> increment "device_busy" to 1 and get the budget.
> 
> The losing thread will break out of blk_mq_do_dispatch_sched() and
> will stop dispatching requests.  The assumption is that when more
> budget is available later (when existing transactions finish) the
> queue will be kicked again, perhaps in scsi_end_request().
> 
> The winning thread now has budget and can go on to call
> dispatch_request().  If dispatch_request() returns NULL here then we
> have a potential problem.  Specifically we'll now call

As I mentioned before, it is a BFQ specific issue, it tells blk-mq
that it has work to do, and now the budget is assigned to the only
winning thread, however, dispatch_request() still returns NULL.

> blk_mq_put_dispatch_budget() which translates into
> scsi_mq_put_budget().  That will mark the device as no longer busy but
> doesn't do anything to kick the queue.  This violates the assumption
> that the queue would be kicked when more budget was available.

The queue is still kicked in by BFQ in its idle timer, however that
timer doesn't make forward progress.

Without this idle timer, it is simply a BFQ issue.

> 
> Pictorially:
> 
> Thread A                          Thread B
> ================================= ==================================
> blk_mq_get_dispatch_budget() => 1
> dispatch_request() => NULL
>                                   blk_mq_get_dispatch_budget() => 0
>                                   // because Thread A marked
>                                   // "device_busy" in scsi_device
> blk_mq_put_dispatch_budget()

What if there is only thread A? You need to mention that thread B
is from BFQ.

> 
> The above case was observed in reboot tests and caused a task to hang
> forever waiting for IO to complete.  Traces showed that in fact two
> tasks were running blk_mq_do_dispatch_sched() at the same time with
> the same "hctx".  The task that got the budget did in fact see
> dispatch_request() return NULL.  Both tasks returned and the system
> went on for several minutes (until the hung task delay kicked in)
> without the given "hctx" showing up again in traces.
> 
> Let's attempt to fix this problem by detecting if there was contention
> for the budget in the case where we put the budget without dispatching
> anything.  If we saw contention we kick all hctx's associated with the
> queue where there was contention.  We do this without any locking by
> adding a double-check for budget and accepting a small amount of faux
> contention if the 2nd check gives us budget but then we don't dispatch
> anything (we'll look like we contended with ourselves).
> 
> A few extra notes:
> 
> - This whole thing is only a problem due to the inexact nature of
>   has_work().  Specifically if has_work() always guaranteed that a
>   "true" return meant that dispatch_request() would return non-NULL
>   then we wouldn't have this problem.  That's because we only grab the
>   budget if has_work() returned true.  If we had the non-NULL
>   guarantee then at least one of the threads would actually dispatch
>   work and when the work was done then queues would be kicked
>   normally.
> 
> - One specific I/O scheduler that trips this problem quite a bit is
>   BFQ which definitely returns "true" for has_work() in cases where it
>   wouldn't dispatch.  Making BFQ's has_work() more exact requires that
>   has_work() becomes a much heavier function, including figuring out
>   how to acquire spinlocks in has_work() without tripping circular
>   lock dependencies.  This is prototyped but it's unclear if it's
>   really the way to go when the problem can be solved with a
>   relatively lightweight contention detection mechanism.
> 
> - Because this problem only trips with inexact has_work() it's
>   believed that blk_mq_do_dispatch_ctx() does not need a similar
>   change.

Right, I prefer to fix it in BFQ, given it isn't a generic issue,
not worth of generic solution.

> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
> Changes in v2:
> - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...")
> 
>  block/blk-mq-sched.c   | 26 ++++++++++++++++++++++++--
>  block/blk-mq.c         |  3 +++
>  include/linux/blkdev.h |  2 ++
>  3 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 74cedea56034..0195d75f5f96 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
>  			break;
>  
> -		if (!blk_mq_get_dispatch_budget(hctx))
> -			break;
> +
> +		if (!blk_mq_get_dispatch_budget(hctx)) {
> +			/*
> +			 * We didn't get budget so set contention.  If
> +			 * someone else had the budget but didn't dispatch
> +			 * they'll kick everything.  NOTE: we check one
> +			 * extra time _after_ setting contention to fully
> +			 * close the race.  If we don't actually dispatch
> +			 * we'll detext faux contention (with ourselves)
> +			 * but that should be rare.
> +			 */
> +			atomic_set(&q->budget_contention, 1);
> +
> +			if (!blk_mq_get_dispatch_budget(hctx))

scsi_mq_get_budget() implies a smp_mb(), so the barrier can order
between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1).

> +				break;
> +		}
>  
>  		rq = e->type->ops.dispatch_request(hctx);
>  		if (!rq) {
>  			blk_mq_put_dispatch_budget(hctx);
> +
> +			/*
> +			 * We've released the budget but us holding it might
> +			 * have prevented someone else from dispatching.
> +			 * Detect that case and run all queues again.
> +			 */
> +			if (atomic_read(&q->budget_contention))

scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic()
is needed between the above two op.

> +				blk_mq_run_hw_queues(q, true);
>  			break;
>  		}
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2cd8d2b49ff4..6163c43ceca5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
>  	struct blk_mq_hw_ctx *hctx;
>  	int i;
>  
> +	/* We're running the queues, so clear the contention detector */
> +	atomic_set(&q->budget_contention, 0);
> +

You add extra cost in fast path.

>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (blk_mq_hctx_stopped(hctx))
>  			continue;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f629d40c645c..07f21e45d993 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -583,6 +583,8 @@ struct request_queue {
>  
>  #define BLK_MAX_WRITE_HINTS	5
>  	u64			write_hints[BLK_MAX_WRITE_HINTS];
> +
> +	atomic_t		budget_contention;

It needn't to be a atomic variable, and simple 'unsigned'
int should be fine, what matters is that the order between
R/W this flag and get/put budget.


thanks,
Ming


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick
  2020-04-02 15:51 ` [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson
@ 2020-04-03  1:55   ` Ming Lei
  0 siblings, 0 replies; 15+ messages in thread
From: Ming Lei @ 2020-04-03  1:55 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: axboe, jejb, martin.petersen, paolo.valente, sqazi, linux-block,
	linux-scsi, groeck, linux-kernel

On Thu, Apr 02, 2020 at 08:51:29AM -0700, Douglas Anderson wrote:
> In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns
> true and the driver returns BLK_STS_RESOURCE then we'll kick the
> queue.  However, there's another case where we might need to kick it.
> If we were unable to get budget we can be in much the same state as
> when the driver returns BLK_STS_RESOURCE, so we should treat it the
> same.

The situation is really similar, especially the restart handler may
not see the request which may not be added to hctx->dispatch, so

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-03  1:33   ` Ming Lei
@ 2020-04-03 15:10     ` Doug Anderson
  2020-04-03 15:49       ` Doug Anderson
  0 siblings, 1 reply; 15+ messages in thread
From: Doug Anderson @ 2020-04-03 15:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Paolo Valente, Salman Qazi, linux-block, linux-scsi,
	Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, LKML

Hi,

On Thu, Apr 2, 2020 at 6:34 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote:
> > It is possible for two threads to be running
> > blk_mq_do_dispatch_sched() at the same time with the same "hctx".
> > This is because there can be more than one caller to
> > __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
> > prevent more than one thread from entering.
> >
> > If more than one thread is running blk_mq_do_dispatch_sched() at the
> > same time with the same "hctx", they may have contention acquiring
> > budget.  The blk_mq_get_dispatch_budget() can eventually translate
> > into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
> > uncommon) then only one of the two threads will be the one to
> > increment "device_busy" to 1 and get the budget.
> >
> > The losing thread will break out of blk_mq_do_dispatch_sched() and
> > will stop dispatching requests.  The assumption is that when more
> > budget is available later (when existing transactions finish) the
> > queue will be kicked again, perhaps in scsi_end_request().
> >
> > The winning thread now has budget and can go on to call
> > dispatch_request().  If dispatch_request() returns NULL here then we
> > have a potential problem.  Specifically we'll now call
>
> As I mentioned before, it is a BFQ specific issue, it tells blk-mq
> that it has work to do, and now the budget is assigned to the only
> winning thread, however, dispatch_request() still returns NULL.

Correct that it only happens with BFQ, but whether it's a BFQ bug or
not just depends on how you define the has_work() API.  If has_work()
is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
depending on how you cut it.  If has_work() must be exact then it is
certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
not a BFQ bug.  I believe that a sane API could be defined either way.
Either has_work() can be defined as a lightweight hint to trigger
heavier code or it can be defined as something exact.  It's really up
to blk-mq to say how they define it.

From his response on the SCSI patch [1], it sounded like Jens was OK
with has_work() being a lightweight hint as long as BFQ ensures that
the queues run later.  ...but, as my investigation found, I believe
that BFQ _does_ try to ensure that the queue is run at a later time by
calling blk_mq_run_hw_queues().  The irony is that due to the race
we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
be reliable if has_work() is inexact.  :(  One way to address this is
to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.

...so Jens: care to clarify how you'd like has_work() to be defined?


> > - Because this problem only trips with inexact has_work() it's
> >   believed that blk_mq_do_dispatch_ctx() does not need a similar
> >   change.
>
> Right, I prefer to fix it in BFQ, given it isn't a generic issue,
> not worth of generic solution.

Just to confirm: it sounds like you are saying that BFQ is not a first
class citizen here because not everyone uses BFQ.  Is that really the
best policy?  Couldn't I say that SCSI shouldn't be a first class
citizen because not everyone uses SCSI?  In such a case I could argue
that we should speed up the blk-mq layer by removing all the budget
code since SCSI is the only user.  I'm not actually suggesting it,
only pointing out that sometimes we need to make allowances in the
code.


> > @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >               if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> >                       break;
> >
> > -             if (!blk_mq_get_dispatch_budget(hctx))
> > -                     break;
> > +
> > +             if (!blk_mq_get_dispatch_budget(hctx)) {
> > +                     /*
> > +                      * We didn't get budget so set contention.  If
> > +                      * someone else had the budget but didn't dispatch
> > +                      * they'll kick everything.  NOTE: we check one
> > +                      * extra time _after_ setting contention to fully
> > +                      * close the race.  If we don't actually dispatch
> > +                      * we'll detext faux contention (with ourselves)
> > +                      * but that should be rare.
> > +                      */
> > +                     atomic_set(&q->budget_contention, 1);
> > +
> > +                     if (!blk_mq_get_dispatch_budget(hctx))
>
> scsi_mq_get_budget() implies a smp_mb(), so the barrier can order
> between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1).

I always struggle when working with memory barriers.  I think you are
saying that this section of code is OK, though, presumably because the
SCSI code does "atomic_inc_return" which implies this barrier.

Do you happen to know if it's documented that anyone who implements
get_budget() for "struct blk_mq_ops" will have a memory barrier here?
I know SCSI is the only existing user, but we'd want to keep it
generic.


> > +                             break;
> > +             }
> >
> >               rq = e->type->ops.dispatch_request(hctx);
> >               if (!rq) {
> >                       blk_mq_put_dispatch_budget(hctx);
> > +
> > +                     /*
> > +                      * We've released the budget but us holding it might
> > +                      * have prevented someone else from dispatching.
> > +                      * Detect that case and run all queues again.
> > +                      */
> > +                     if (atomic_read(&q->budget_contention))
>
> scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic()
> is needed between the above two op.

OK, thanks.  I will add that we decide to move forward with this
patch.  Again I'd wonder if this type of thing should be documented.


> > +                             blk_mq_run_hw_queues(q, true);
> >                       break;
> >               }
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 2cd8d2b49ff4..6163c43ceca5 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
> >       struct blk_mq_hw_ctx *hctx;
> >       int i;
> >
> > +     /* We're running the queues, so clear the contention detector */
> > +     atomic_set(&q->budget_contention, 0);
> > +
>
> You add extra cost in fast path.

Yes, but it is a fairly minor cost added.  It's called once in a place
where we're unlikely to be looping and where we're about to do a lot
heavier operations (perhaps in a loop).  This doesn't feel like a
dealbreaker.


> >       queue_for_each_hw_ctx(q, hctx, i) {
> >               if (blk_mq_hctx_stopped(hctx))
> >                       continue;
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index f629d40c645c..07f21e45d993 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -583,6 +583,8 @@ struct request_queue {
> >
> >  #define BLK_MAX_WRITE_HINTS  5
> >       u64                     write_hints[BLK_MAX_WRITE_HINTS];
> > +
> > +     atomic_t                budget_contention;
>
> It needn't to be a atomic variable, and simple 'unsigned'
> int should be fine, what matters is that the order between
> R/W this flag and get/put budget.

Apparently I'm the only one who feels atomic_t is more documenting
here.  I will cede to the will of the majority here and change to
'unsigned int' if we decide to move forward with this patch.

-Doug

[1] https://lore.kernel.org/r/d6af2344-11f7-5862-daed-e21cbd496d92@kernel.dk

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-03 15:10     ` Doug Anderson
@ 2020-04-03 15:49       ` Doug Anderson
  2020-04-05  9:14         ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Doug Anderson @ 2020-04-03 15:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Paolo Valente, Salman Qazi, linux-block, linux-scsi,
	Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, LKML

Hi,

On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Correct that it only happens with BFQ, but whether it's a BFQ bug or
> not just depends on how you define the has_work() API.  If has_work()
> is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
> depending on how you cut it.  If has_work() must be exact then it is
> certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
> not a BFQ bug.  I believe that a sane API could be defined either way.
> Either has_work() can be defined as a lightweight hint to trigger
> heavier code or it can be defined as something exact.  It's really up
> to blk-mq to say how they define it.
>
> From his response on the SCSI patch [1], it sounded like Jens was OK
> with has_work() being a lightweight hint as long as BFQ ensures that
> the queues run later.  ...but, as my investigation found, I believe
> that BFQ _does_ try to ensure that the queue is run at a later time by
> calling blk_mq_run_hw_queues().  The irony is that due to the race
> we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
> be reliable if has_work() is inexact.  :(  One way to address this is
> to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.
>
> ...so Jens: care to clarify how you'd like has_work() to be defined?

Sorry to reply so quickly after my prior reply, but I might have found
an extreme corner case where we can still run into the same race even
if has_work() is exact.  This is all theoretical from code analysis.
Maybe you can poke a hole in my scenario or tell me it's so
implausible that we don't care, but it seems like it's theoretically
possible.  For this example I'll assume a budget of 1 (AKA only one
thread can get budget for a given queue):

* Threads A and B both run has_work() at the same time with the same
  "hctx".  has_work() is exact but there's no lock, so it's OK if
  Thread A and B both get back true.

* Thread B gets interrupted for a long time right after it decides
  that there is work.  Maybe its CPU gets an interrupt and the
  interrupt handler is slow.

* Thread A runs, get budget, dispatches work.

* Thread A's work finishes and budget is released.

* Thread B finally runs again and gets budget.

* Since Thread A already took care of the work and no new work has
  come in, Thread B will get NULL from dispatch_request().  I believe
  this is specifically why dispatch_request() is allowed to return
  NULL in the first place if has_work() must be exact.

* Thread B will now be holding the budget and is about to call
  put_budget(), but hasn't called it yet.

* Thread B gets interrupted for a long time (again).  Dang interrupts.

* Now Thread C (with a different "hctx" but the same queue) comes
  along and runs blk_mq_do_dispatch_sched().

* Thread C won't do anything because it can't get budget.

* Finally Thread B will run again and put the budget without kicking
  any queues.

Now we have a potential I/O stall because nobody will ever kick the
queues.


I think the above example could happen even on non-BFQ systems and I
think it would also be fixed by an approach like the one in my patch.

-Doug

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-03 15:49       ` Doug Anderson
@ 2020-04-05  9:14         ` Ming Lei
  2020-04-05 14:00           ` Doug Anderson
  2020-04-05 16:26           ` Doug Anderson
  0 siblings, 2 replies; 15+ messages in thread
From: Ming Lei @ 2020-04-05  9:14 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Paolo Valente, Salman Qazi, linux-block, linux-scsi,
	Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, LKML

On Fri, Apr 03, 2020 at 08:49:54AM -0700, Doug Anderson wrote:
> Hi,
> 
> On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Correct that it only happens with BFQ, but whether it's a BFQ bug or
> > not just depends on how you define the has_work() API.  If has_work()
> > is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
> > depending on how you cut it.  If has_work() must be exact then it is
> > certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
> > not a BFQ bug.  I believe that a sane API could be defined either way.
> > Either has_work() can be defined as a lightweight hint to trigger
> > heavier code or it can be defined as something exact.  It's really up
> > to blk-mq to say how they define it.
> >
> > From his response on the SCSI patch [1], it sounded like Jens was OK
> > with has_work() being a lightweight hint as long as BFQ ensures that
> > the queues run later.  ...but, as my investigation found, I believe
> > that BFQ _does_ try to ensure that the queue is run at a later time by
> > calling blk_mq_run_hw_queues().  The irony is that due to the race
> > we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
> > be reliable if has_work() is inexact.  :(  One way to address this is
> > to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.
> >
> > ...so Jens: care to clarify how you'd like has_work() to be defined?
> 
> Sorry to reply so quickly after my prior reply, but I might have found
> an extreme corner case where we can still run into the same race even
> if has_work() is exact.  This is all theoretical from code analysis.
> Maybe you can poke a hole in my scenario or tell me it's so
> implausible that we don't care, but it seems like it's theoretically
> possible.  For this example I'll assume a budget of 1 (AKA only one
> thread can get budget for a given queue):
> 
> * Threads A and B both run has_work() at the same time with the same
>   "hctx".  has_work() is exact but there's no lock, so it's OK if
>   Thread A and B both get back true.
> 
> * Thread B gets interrupted for a long time right after it decides
>   that there is work.  Maybe its CPU gets an interrupt and the
>   interrupt handler is slow.
> 
> * Thread A runs, get budget, dispatches work.
> 
> * Thread A's work finishes and budget is released.
> 
> * Thread B finally runs again and gets budget.
> 
> * Since Thread A already took care of the work and no new work has
>   come in, Thread B will get NULL from dispatch_request().  I believe
>   this is specifically why dispatch_request() is allowed to return
>   NULL in the first place if has_work() must be exact.
> 
> * Thread B will now be holding the budget and is about to call
>   put_budget(), but hasn't called it yet.
> 
> * Thread B gets interrupted for a long time (again).  Dang interrupts.
> 
> * Now Thread C (with a different "hctx" but the same queue) comes
>   along and runs blk_mq_do_dispatch_sched().
> 
> * Thread C won't do anything because it can't get budget.
> 
> * Finally Thread B will run again and put the budget without kicking
>   any queues.
> 
> Now we have a potential I/O stall because nobody will ever kick the
> queues.
> 
> 
> I think the above example could happen even on non-BFQ systems and I
> think it would also be fixed by an approach like the one in my patch.

OK, looks it isn't specific on BFQ any more.

Follows another candidate approach for this issue, given it is so hard
to trigger, we can make it more reliable by rerun queue when has_work()
returns true after ops->dispath_request() returns NULL.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74cedea56034..4408e5d4fcd8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
        blk_mq_run_hw_queue(hctx, true);
 }

+#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
 /*
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
@@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
                rq = e->type->ops.dispatch_request(hctx);
                if (!rq) {
                        blk_mq_put_dispatch_budget(hctx);
+
+                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
+                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
                        break;
                }


Thanks, 
Ming


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-05  9:14         ` Ming Lei
@ 2020-04-05 14:00           ` Doug Anderson
  2020-04-05 14:57             ` Paolo Valente
  2020-04-05 16:26           ` Doug Anderson
  1 sibling, 1 reply; 15+ messages in thread
From: Doug Anderson @ 2020-04-05 14:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Paolo Valente, Salman Qazi, linux-block, linux-scsi,
	Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, LKML

Hi,

On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> OK, looks it isn't specific on BFQ any more.
>
> Follows another candidate approach for this issue, given it is so hard
> to trigger, we can make it more reliable by rerun queue when has_work()
> returns true after ops->dispath_request() returns NULL.
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 74cedea56034..4408e5d4fcd8 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>         blk_mq_run_hw_queue(hctx, true);
>  }
>
> +#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
>  /*
>   * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>   * its queue by itself in its completion handler, so we don't need to
> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>                 rq = e->type->ops.dispatch_request(hctx);
>                 if (!rq) {
>                         blk_mq_put_dispatch_budget(hctx);
> +
> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);

I agree that your patch should solve the race.  With the current BFQ's
has_work() it's a bit of a disaster though. It will essentially put
blk-mq into a busy-wait loop (with a 3 ms delay between each poll)
while BFQ's has_work() says "true" but BFQ doesn't dispatch anything.

...so I guess the question that still needs to be answered: does
has_work() need to be exact?  If so then we need the patch you propose
plus one to BFQ.  If not, we should continue along the lines of my
patch.

-Doug

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-05 14:00           ` Doug Anderson
@ 2020-04-05 14:57             ` Paolo Valente
  2020-04-05 16:16               ` Doug Anderson
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Valente @ 2020-04-05 14:57 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Ming Lei, Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi,
	Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni,
	Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML



> Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto:
> 
> Hi,
> 
> On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
>> 
>> OK, looks it isn't specific on BFQ any more.
>> 
>> Follows another candidate approach for this issue, given it is so hard
>> to trigger, we can make it more reliable by rerun queue when has_work()
>> returns true after ops->dispath_request() returns NULL.
>> 
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index 74cedea56034..4408e5d4fcd8 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>>        blk_mq_run_hw_queue(hctx, true);
>> }
>> 
>> +#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
>> /*
>>  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>>  * its queue by itself in its completion handler, so we don't need to
>> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>>                rq = e->type->ops.dispatch_request(hctx);
>>                if (!rq) {
>>                        blk_mq_put_dispatch_budget(hctx);
>> +
>> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
>> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
> 
> I agree that your patch should solve the race.  With the current BFQ's
> has_work() it's a bit of a disaster though. It will essentially put
> blk-mq into a busy-wait loop (with a 3 ms delay between each poll)
> while BFQ's has_work() says "true" but BFQ doesn't dispatch anything.
> 
> ...so I guess the question that still needs to be answered: does
> has_work() need to be exact?  If so then we need the patch you propose
> plus one to BFQ.  If not, we should continue along the lines of my
> patch.
> 

Some more comments.  BFQ's I/O plugging lasts 9 ms by default.  So,
with this last Ming's patch, BFQ may happen to be polled every 3ms,
for at most three times.

On the opposite end, making bfq_has_work plugging aware costs more
complexity, and possibly one more lock.  While avoiding the above
occasional polling, this may imply a lot of overhead or CPU stalls on
every dispatch.

Paolo
 
> -Doug


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-05 14:57             ` Paolo Valente
@ 2020-04-05 16:16               ` Doug Anderson
  2020-04-07 10:43                 ` Paolo Valente
  0 siblings, 1 reply; 15+ messages in thread
From: Doug Anderson @ 2020-04-05 16:16 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Ming Lei, Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi,
	Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni,
	Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML

Hi,

On Sun, Apr 5, 2020 at 7:55 AM Paolo Valente <paolo.valente@linaro.org> wrote:
>
> > Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto:
> >
> > Hi,
> >
> > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
> >>
> >> OK, looks it isn't specific on BFQ any more.
> >>
> >> Follows another candidate approach for this issue, given it is so hard
> >> to trigger, we can make it more reliable by rerun queue when has_work()
> >> returns true after ops->dispath_request() returns NULL.
> >>
> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >> index 74cedea56034..4408e5d4fcd8 100644
> >> --- a/block/blk-mq-sched.c
> >> +++ b/block/blk-mq-sched.c
> >> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> >>        blk_mq_run_hw_queue(hctx, true);
> >> }
> >>
> >> +#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
> >> /*
> >>  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
> >>  * its queue by itself in its completion handler, so we don't need to
> >> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >>                rq = e->type->ops.dispatch_request(hctx);
> >>                if (!rq) {
> >>                        blk_mq_put_dispatch_budget(hctx);
> >> +
> >> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> >> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
> >
> > I agree that your patch should solve the race.  With the current BFQ's
> > has_work() it's a bit of a disaster though. It will essentially put
> > blk-mq into a busy-wait loop (with a 3 ms delay between each poll)
> > while BFQ's has_work() says "true" but BFQ doesn't dispatch anything.
> >
> > ...so I guess the question that still needs to be answered: does
> > has_work() need to be exact?  If so then we need the patch you propose
> > plus one to BFQ.  If not, we should continue along the lines of my
> > patch.
> >
>
> Some more comments.  BFQ's I/O plugging lasts 9 ms by default.  So,
> with this last Ming's patch, BFQ may happen to be polled every 3ms,
> for at most three times.

Ah!  I did not know this.  OK, then Ming's patch seems like it should
work.  If nothing else it should fix the problem.  If this ends up
making BFQ chew up too much CPU time then presumably someone will
notice and BFQ's has_work() can be improved.

Ming: how do you want to proceed?  Do you want to formally post the
patch?  Do you want me to post a v3 of my series where I place patch
#2 with your patch?  Do you want authorship (which implies adding your
Signed-off-by)?


> On the opposite end, making bfq_has_work plugging aware costs more
> complexity, and possibly one more lock.  While avoiding the above
> occasional polling, this may imply a lot of overhead or CPU stalls on
> every dispatch.

I still think it would be interesting to run performance tests with my
proof-of-concept solution for has_work().  Even if it's not ideal,
knowing whether performance increased, decreased, or stayed the same
would give information about how much more effort should be put into
this.

-Doug

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-05  9:14         ` Ming Lei
  2020-04-05 14:00           ` Doug Anderson
@ 2020-04-05 16:26           ` Doug Anderson
  2020-04-07  2:14             ` Ming Lei
  1 sibling, 1 reply; 15+ messages in thread
From: Doug Anderson @ 2020-04-05 16:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Paolo Valente, Salman Qazi, linux-block, linux-scsi,
	Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, LKML

Hi,

On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>                 rq = e->type->ops.dispatch_request(hctx);
>                 if (!rq) {
>                         blk_mq_put_dispatch_budget(hctx);
> +
> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);

To really close the race, don't we need to run all the queues
associated with the hctx?  I haven't traced it through, but I've been
assuming that the multiple "hctx"s associated with the same queue will
have the same budget associated with them and thus they can block each
other out.

-Doug

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-05 16:26           ` Doug Anderson
@ 2020-04-07  2:14             ` Ming Lei
  2020-04-07 22:04               ` Doug Anderson
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2020-04-07  2:14 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Paolo Valente, Salman Qazi, linux-block, linux-scsi,
	Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, LKML

On Sun, Apr 05, 2020 at 09:26:39AM -0700, Doug Anderson wrote:
> Hi,
> 
> On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >                 rq = e->type->ops.dispatch_request(hctx);
> >                 if (!rq) {
> >                         blk_mq_put_dispatch_budget(hctx);
> > +
> > +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> > +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
> 
> To really close the race, don't we need to run all the queues
> associated with the hctx?  I haven't traced it through, but I've been
> assuming that the multiple "hctx"s associated with the same queue will
> have the same budget associated with them and thus they can block each
> other out.

Yeah, we should run all hctxs which share the same budget space.

Also, in theory, we don't have to add the delay, however BFQ may plug the
dispatch for 9 ms, so looks delay run queue is required.

thanks,
Ming


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-05 16:16               ` Doug Anderson
@ 2020-04-07 10:43                 ` Paolo Valente
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Valente @ 2020-04-07 10:43 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Ming Lei, Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi,
	Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni,
	Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML



> Il giorno 5 apr 2020, alle ore 18:16, Doug Anderson <dianders@chromium.org> ha scritto:
> 
> Hi,
> 
> On Sun, Apr 5, 2020 at 7:55 AM Paolo Valente <paolo.valente@linaro.org> wrote:
>> 
>>> Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto:
>>> 
>>> Hi,
>>> 
>>> On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>> 
>>>> OK, looks it isn't specific on BFQ any more.
>>>> 
>>>> Follows another candidate approach for this issue, given it is so hard
>>>> to trigger, we can make it more reliable by rerun queue when has_work()
>>>> returns true after ops->dispath_request() returns NULL.
>>>> 
>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>>> index 74cedea56034..4408e5d4fcd8 100644
>>>> --- a/block/blk-mq-sched.c
>>>> +++ b/block/blk-mq-sched.c
>>>> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>>>>       blk_mq_run_hw_queue(hctx, true);
>>>> }
>>>> 
>>>> +#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
>>>> /*
>>>> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>>>> * its queue by itself in its completion handler, so we don't need to
>>>> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>>>>               rq = e->type->ops.dispatch_request(hctx);
>>>>               if (!rq) {
>>>>                       blk_mq_put_dispatch_budget(hctx);
>>>> +
>>>> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
>>>> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
>>> 
>>> I agree that your patch should solve the race.  With the current BFQ's
>>> has_work() it's a bit of a disaster though. It will essentially put
>>> blk-mq into a busy-wait loop (with a 3 ms delay between each poll)
>>> while BFQ's has_work() says "true" but BFQ doesn't dispatch anything.
>>> 
>>> ...so I guess the question that still needs to be answered: does
>>> has_work() need to be exact?  If so then we need the patch you propose
>>> plus one to BFQ.  If not, we should continue along the lines of my
>>> patch.
>>> 
>> 
>> Some more comments.  BFQ's I/O plugging lasts 9 ms by default.  So,
>> with this last Ming's patch, BFQ may happen to be polled every 3ms,
>> for at most three times.
> 
> Ah!  I did not know this.  OK, then Ming's patch seems like it should
> work.  If nothing else it should fix the problem.  If this ends up
> making BFQ chew up too much CPU time then presumably someone will
> notice and BFQ's has_work() can be improved.
> 
> Ming: how do you want to proceed?  Do you want to formally post the
> patch?  Do you want me to post a v3 of my series where I place patch
> #2 with your patch?  Do you want authorship (which implies adding your
> Signed-off-by)?
> 
> 
>> On the opposite end, making bfq_has_work plugging aware costs more
>> complexity, and possibly one more lock.  While avoiding the above
>> occasional polling, this may imply a lot of overhead or CPU stalls on
>> every dispatch.
> 
> I still think it would be interesting to run performance tests with my
> proof-of-concept solution for has_work().  Even if it's not ideal,
> knowing whether performance increased, decreased, or stayed the same
> would give information about how much more effort should be put into
> this.
> 

Why not?  It is however hard to hope that we add only negligible
overhead and CPU stalls if we move from one lock-protected section per
I/O-request dispatch, to two or more lock-protected sections per
request (has_work may be invoked several times per request).

At any rate, if useful, one of the scripts in my S benchmark suite can
also measure max IOPS (when limited only by I/O processing) [1].  The
script is for Linux distros; I don't know whether it works in your
environments of interest, Doug.

Paolo

[1] https://github.com/Algodev-github/S/blob/master/throughput-sync/throughput-sync.sh

> -Doug


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention
  2020-04-07  2:14             ` Ming Lei
@ 2020-04-07 22:04               ` Doug Anderson
  0 siblings, 0 replies; 15+ messages in thread
From: Doug Anderson @ 2020-04-07 22:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen,
	Paolo Valente, Salman Qazi, linux-block, linux-scsi,
	Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov,
	Tejun Heo, LKML

Hi,

On Mon, Apr 6, 2020 at 7:14 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Sun, Apr 05, 2020 at 09:26:39AM -0700, Doug Anderson wrote:
> > Hi,
> >
> > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > >                 rq = e->type->ops.dispatch_request(hctx);
> > >                 if (!rq) {
> > >                         blk_mq_put_dispatch_budget(hctx);
> > > +
> > > +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> > > +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
> >
> > To really close the race, don't we need to run all the queues
> > associated with the hctx?  I haven't traced it through, but I've been
> > assuming that the multiple "hctx"s associated with the same queue will
> > have the same budget associated with them and thus they can block each
> > other out.
>
> Yeah, we should run all hctxs which share the same budget space.
>
> Also, in theory, we don't have to add the delay, however BFQ may plug the
> dispatch for 9 ms, so looks delay run queue is required.

I have posted up v3.  A few notes:

* Since we should run all "hctxs" I took out the check for has_work()
before kicking the queue.

* As far as I can tell the theoretical race happens for _anyone_ who
puts budget, so I added it to blk_mq_put_dispatch_budget().  Feel free
to shout at me in response to v3 if you believe I'm wrong about that.

Thanks for all your reviews and suggestions so far!

-Doug

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-04-07 22:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-02 15:51 [PATCH v2 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing Douglas Anderson
2020-04-02 15:51 ` [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson
2020-04-03  1:55   ` Ming Lei
2020-04-02 15:51 ` [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention Douglas Anderson
2020-04-03  1:33   ` Ming Lei
2020-04-03 15:10     ` Doug Anderson
2020-04-03 15:49       ` Doug Anderson
2020-04-05  9:14         ` Ming Lei
2020-04-05 14:00           ` Doug Anderson
2020-04-05 14:57             ` Paolo Valente
2020-04-05 16:16               ` Doug Anderson
2020-04-07 10:43                 ` Paolo Valente
2020-04-05 16:26           ` Doug Anderson
2020-04-07  2:14             ` Ming Lei
2020-04-07 22:04               ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).