Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* Hung tasks with multiple partitions
@ 2020-01-30 19:34 Salman Qazi
  2020-01-30 20:49 ` Bart Van Assche
  0 siblings, 1 reply; 18+ messages in thread
From: Salman Qazi @ 2020-01-30 19:34 UTC (permalink / raw)
  To: Jens Axboe, Linux Kernel Mailing List, linux-block
  Cc: Jesse Barnes, Gwendal Grignou

Hi,

I am writing on behalf of the Chromium OS team at Google.  We found
the root cause for some hung tasks we were experiencing and we would
like to get your opinion on potential solutions.  The bugs were
encountered on 4.19 kernel.
However my reading of the code suggests that the relevant portions of the
code have not changed since then.

We have an eMMC flash drive that has been carved into partitions on an
8 CPU system.  The repro case that we came up with, is to use 8
threaded fio write-mostly workload against one partition, let the
system use the other partition as the read-write filesystem (i.e. just
background activity) and then run the following loop:

while true; do sync; sleep 1 ; done

The hung task stack traces look like the following:

[  128.994891] jbd2/dm-1-8     D    0   367      2 0x00000028
last_sleep: 96340206998.  last_runnable: 96340140151
[  128.994898] Call trace:
[  128.994903]  __switch_to+0x120/0x13c
[  128.994909]  __schedule+0x60c/0x7dc
[  128.994914]  schedule+0x74/0x94
[  128.994919]  io_schedule+0x1c/0x40
[  128.994925]  bit_wait_io+0x18/0x58
[  128.994930]  __wait_on_bit+0x78/0xdc
[  128.994935]  out_of_line_wait_on_bit+0xa0/0xcc
[  128.994943]  __wait_on_buffer+0x48/0x54
[  128.994948]  jbd2_journal_commit_transaction+0x1198/0x1a4c
[  128.994956]  kjournald2+0x19c/0x268
[  128.994961]  kthread+0x120/0x130
[  128.994967]  ret_from_fork+0x10/0x18

I added some more information to trace points to understand what was
going on.  It turns out that blk_mq_sched_dispatch_requests had
checked hctx->dispatch, found it empty, and then began consuming
requests from the io scheduler (in blk_mq_do_dispatch_sched).
Unfortunately, the deluge from the I/O scheduler (BFQ in our case)
doesn't stop for 30 seconds and there is no mechanism present in
blk_mq_do_dispatch_sched to terminate early or reconsider
hctx->dispatch contents.  In the meantime, a flush command arrives in
hctx->dispatch (via insertion in  blk_mq_sched_bypass_insert) and
languishes there.  Eventually the thread waiting on the flush triggers
the hung task watchdog.

The solution that comes to mind is to periodically check
hctx->dispatch in blk_mq_do_dispatch_sched and exit early if it is
non-empty.  However, not being an expert in this subsystem, I am not
sure if there would be other consequences.

Any help is appreciated,

Salman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Hung tasks with multiple partitions
  2020-01-30 19:34 Hung tasks with multiple partitions Salman Qazi
@ 2020-01-30 20:49 ` Bart Van Assche
  2020-01-30 21:02   ` Salman Qazi
  0 siblings, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2020-01-30 20:49 UTC (permalink / raw)
  To: Salman Qazi, Jens Axboe, Linux Kernel Mailing List, linux-block
  Cc: Jesse Barnes, Gwendal Grignou

[-- Attachment #1: Type: text/plain, Size: 2815 bytes --]

On 1/30/20 11:34 AM, Salman Qazi wrote:
> I am writing on behalf of the Chromium OS team at Google.  We found
> the root cause for some hung tasks we were experiencing and we would
> like to get your opinion on potential solutions.  The bugs were
> encountered on 4.19 kernel.
> However my reading of the code suggests that the relevant portions of the
> code have not changed since then.
> 
> We have an eMMC flash drive that has been carved into partitions on an
> 8 CPU system.  The repro case that we came up with, is to use 8
> threaded fio write-mostly workload against one partition, let the
> system use the other partition as the read-write filesystem (i.e. just
> background activity) and then run the following loop:
> 
> while true; do sync; sleep 1 ; done
> 
> The hung task stack traces look like the following:
> 
> [  128.994891] jbd2/dm-1-8     D    0   367      2 0x00000028
> last_sleep: 96340206998.  last_runnable: 96340140151
> [  128.994898] Call trace:
> [  128.994903]  __switch_to+0x120/0x13c
> [  128.994909]  __schedule+0x60c/0x7dc
> [  128.994914]  schedule+0x74/0x94
> [  128.994919]  io_schedule+0x1c/0x40
> [  128.994925]  bit_wait_io+0x18/0x58
> [  128.994930]  __wait_on_bit+0x78/0xdc
> [  128.994935]  out_of_line_wait_on_bit+0xa0/0xcc
> [  128.994943]  __wait_on_buffer+0x48/0x54
> [  128.994948]  jbd2_journal_commit_transaction+0x1198/0x1a4c
> [  128.994956]  kjournald2+0x19c/0x268
> [  128.994961]  kthread+0x120/0x130
> [  128.994967]  ret_from_fork+0x10/0x18
> 
> I added some more information to trace points to understand what was
> going on.  It turns out that blk_mq_sched_dispatch_requests had
> checked hctx->dispatch, found it empty, and then began consuming
> requests from the io scheduler (in blk_mq_do_dispatch_sched).
> Unfortunately, the deluge from the I/O scheduler (BFQ in our case)
> doesn't stop for 30 seconds and there is no mechanism present in
> blk_mq_do_dispatch_sched to terminate early or reconsider
> hctx->dispatch contents.  In the meantime, a flush command arrives in
> hctx->dispatch (via insertion in  blk_mq_sched_bypass_insert) and
> languishes there.  Eventually the thread waiting on the flush triggers
> the hung task watchdog.
> 
> The solution that comes to mind is to periodically check
> hctx->dispatch in blk_mq_do_dispatch_sched and exit early if it is
> non-empty.  However, not being an expert in this subsystem, I am not
> sure if there would be other consequences.

The call stack shown in your e-mail usually means that an I/O request 
got stuck. How about determining first whether this is caused by the BFQ 
scheduler or by the eMMC driver? I think the developers of these 
software components need that information anyway before they can step in.

The attached script may help to identify which requests got stuck.

Bart.

[-- Attachment #2: list-pending-block-requests --]
[-- Type: text/plain, Size: 1501 bytes --]

#!/bin/bash

show_state() {
    local a dev=$1

    for a in device/state queue/scheduler; do
	[ -e "$dev/$a" ] && grep -aH . "$dev/$a"
    done
}

if [ -e /sys/kernel/debug/block ]; then
    devs=($(cd /sys/kernel/debug/block && echo ./*))
else
    devs=($(cd /sys/class/block && echo ./*))
fi

cd /sys/class/block || exit $?
for dev in "${devs[@]}"; do
    dev="${dev#./}"
    echo "$dev"
    pending=0
    if [ -e "$dev/mq" ]; then
	for f in "$dev"/mq/*/{pending,*/rq_list}; do
	    [ -e "$f" ] || continue
	    if { read -r line1 && read -r line2; } <"$f"; then
		echo "$f"
		echo "$line1 $line2" >/dev/null
		head -n 9 "$f"
		((pending++))
	    fi
	done
    fi
    (
	busy=0
	cd /sys/kernel/debug/block >&/dev/null &&
	    { grep -aH . $dev/requeue_list; true; } &&
	    for d in "$dev"/mq/hctx* "$dev"/hctx*; do
		[ ! -d "$d" ] && continue
		{ [ ! -e "$d/tags" ] ||
		      grep -q '^busy=0$' "$d/tags"; } &&
		    { [ ! -e "$d/sched_tags" ] ||
			  [ "$(<"$d/sched_tags")" = "" ] ||
			  grep -q '^busy=0$' "$d/sched_tags"; } && continue
		((busy++))
	        for f in "$d"/{active,busy,dispatch,flags,requeue_list,sched_tags,state,tags*,cpu*/rq_list,sched/*rqs}; do
		    [ -e "$f" ] && grep -aH . "$f"
		done
	    done
	exit $busy
    )
    pending=$((pending+$?))
    if [ "$pending" -gt 0 ]; then
	(
	    cd /sys/kernel/debug/block >&/dev/null &&
		if [ -e "$dev/mq/state" ]; then
		    grep -aH . "$dev/mq/state"
		else
		    grep -aH . "$dev/state"
		fi
	)
	show_state "$dev"
    fi
done

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Hung tasks with multiple partitions
  2020-01-30 20:49 ` Bart Van Assche
@ 2020-01-30 21:02   ` Salman Qazi
       [not found]     ` <20200203204554.119849-1-sqazi@google.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Salman Qazi @ 2020-01-30 21:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Jesse Barnes,
	Gwendal Grignou

On Thu, Jan 30, 2020 at 12:49 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 1/30/20 11:34 AM, Salman Qazi wrote:
> > I am writing on behalf of the Chromium OS team at Google.  We found
> > the root cause for some hung tasks we were experiencing and we would
> > like to get your opinion on potential solutions.  The bugs were
> > encountered on 4.19 kernel.
> > However my reading of the code suggests that the relevant portions of the
> > code have not changed since then.
> >
> > We have an eMMC flash drive that has been carved into partitions on an
> > 8 CPU system.  The repro case that we came up with, is to use 8
> > threaded fio write-mostly workload against one partition, let the
> > system use the other partition as the read-write filesystem (i.e. just
> > background activity) and then run the following loop:
> >
> > while true; do sync; sleep 1 ; done
> >
> > The hung task stack traces look like the following:
> >
> > [  128.994891] jbd2/dm-1-8     D    0   367      2 0x00000028
> > last_sleep: 96340206998.  last_runnable: 96340140151
> > [  128.994898] Call trace:
> > [  128.994903]  __switch_to+0x120/0x13c
> > [  128.994909]  __schedule+0x60c/0x7dc
> > [  128.994914]  schedule+0x74/0x94
> > [  128.994919]  io_schedule+0x1c/0x40
> > [  128.994925]  bit_wait_io+0x18/0x58
> > [  128.994930]  __wait_on_bit+0x78/0xdc
> > [  128.994935]  out_of_line_wait_on_bit+0xa0/0xcc
> > [  128.994943]  __wait_on_buffer+0x48/0x54
> > [  128.994948]  jbd2_journal_commit_transaction+0x1198/0x1a4c
> > [  128.994956]  kjournald2+0x19c/0x268
> > [  128.994961]  kthread+0x120/0x130
> > [  128.994967]  ret_from_fork+0x10/0x18
> >
> > I added some more information to trace points to understand what was
> > going on.  It turns out that blk_mq_sched_dispatch_requests had
> > checked hctx->dispatch, found it empty, and then began consuming
> > requests from the io scheduler (in blk_mq_do_dispatch_sched).
> > Unfortunately, the deluge from the I/O scheduler (BFQ in our case)
> > doesn't stop for 30 seconds and there is no mechanism present in
> > blk_mq_do_dispatch_sched to terminate early or reconsider
> > hctx->dispatch contents.  In the meantime, a flush command arrives in
> > hctx->dispatch (via insertion in  blk_mq_sched_bypass_insert) and
> > languishes there.  Eventually the thread waiting on the flush triggers
> > the hung task watchdog.
> >
> > The solution that comes to mind is to periodically check
> > hctx->dispatch in blk_mq_do_dispatch_sched and exit early if it is
> > non-empty.  However, not being an expert in this subsystem, I am not
> > sure if there would be other consequences.
>
> The call stack shown in your e-mail usually means that an I/O request
> got stuck. How about determining first whether this is caused by the BFQ
> scheduler or by the eMMC driver? I think the developers of these
> software components need that information anyway before they can step in.

As I mentioned in my previous email, I did use trace points to arrive
at my conclusion.  I added trace points in
blk_mq_sched_dispatch_requests to
detect both the start and the end of that function, as well as where
the dispatched commands were picked from.  I also traced
blk_mq_sched_bypass_insert and saw a flush enter hctx->dispatch after
blk_mq_sched_dispatch_requests had started but before it
finished.  After reaching my conclusion, I also tried a simple fix by
introducing an exit path in blk_mq_do_dispatch_sched, if
we detect that hctx->dispatch has become non-empty.   This made the
problem go away.

>
> The attached script may help to identify which requests got stuck.
>
> Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] block: Limit number of items taken from the I/O scheduler in one go
       [not found]     ` <20200203204554.119849-1-sqazi@google.com>
@ 2020-02-03 20:59       ` Salman Qazi
  2020-02-04  3:47         ` Bart Van Assche
  2020-02-04  9:20         ` Ming Lei
  0 siblings, 2 replies; 18+ messages in thread
From: Salman Qazi @ 2020-02-03 20:59 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe, linux-block, linux-kernel
  Cc: Jesse Barnes, Gwendal Grignou, Salman Qazi

We observed that it is possible for a flush to bypass the I/O
scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert.
This can happen while a kworker is running blk_mq_do_dispatch_sched call
in blk_mq_sched_dispatch_requests.

However, the blk_mq_do_dispatch_sched call doesn't end in bounded time.
As a result, the flush can sit there indefinitely, as the I/O scheduler
feeds an arbitrary number of requests to the hardware.

The solution is to periodically poll hctx->dispatch in
blk_mq_do_dispatch_sched, to put a bound on the latency of the commands
sitting there.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 block/blk-mq-sched.c   |  6 ++++++
 block/blk-mq.c         |  4 ++++
 block/blk-sysfs.c      | 33 +++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 4 files changed, 45 insertions(+)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..75cdec64b9c7 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -90,6 +90,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
+	int count = 0;
 
 	do {
 		struct request *rq;
@@ -97,6 +98,10 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
 			break;
 
+		if (count > 0 && count % q->max_sched_batch == 0 &&
+		    !list_empty_careful(&hctx->dispatch))
+			break;
+
 		if (!blk_mq_get_dispatch_budget(hctx))
 			break;
 
@@ -112,6 +117,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		 * in blk_mq_dispatch_rq_list().
 		 */
 		list_add(&rq->queuelist, &rq_list);
+		count++;
 	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a12b1763508d..7cb13aa72a94 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -40,6 +40,8 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
+#define BLK_MQ_DEFAULT_MAX_SCHED_BATCH	100
+
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
@@ -2934,6 +2936,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	 */
 	q->poll_nsec = BLK_MQ_POLL_CLASSIC;
 
+	q->max_sched_batch = BLK_MQ_DEFAULT_MAX_SCHED_BATCH;
+
 	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
 	blk_mq_add_queue_tag_set(set, q);
 	blk_mq_map_swqueue(q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..dd7b58a1bd35 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -390,6 +390,32 @@ static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page,
 	return count;
 }
 
+static ssize_t queue_max_sched_batch_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%d\n", q->max_sched_batch);
+}
+
+static ssize_t queue_max_sched_batch_store(struct request_queue *q,
+					   const char *page,
+					   size_t count)
+{
+	int err, val;
+
+	if (!q->mq_ops)
+		return -EINVAL;
+
+	err = kstrtoint(page, 10, &val);
+	if (err < 0)
+		return err;
+
+	if (val <= 0)
+		return -EINVAL;
+
+	q->max_sched_batch = val;
+
+	return count;
+}
+
 static ssize_t queue_poll_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(test_bit(QUEUE_FLAG_POLL, &q->queue_flags), page);
@@ -691,6 +717,12 @@ static struct queue_sysfs_entry queue_poll_delay_entry = {
 	.store = queue_poll_delay_store,
 };
 
+static struct queue_sysfs_entry queue_max_sched_batch_entry = {
+	.attr = {.name = "max_sched_batch", .mode = 0644 },
+	.show = queue_max_sched_batch_show,
+	.store = queue_max_sched_batch_store,
+};
+
 static struct queue_sysfs_entry queue_wc_entry = {
 	.attr = {.name = "write_cache", .mode = 0644 },
 	.show = queue_wc_show,
@@ -763,6 +795,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_wb_lat_entry.attr,
 	&queue_poll_delay_entry.attr,
 	&queue_io_timeout_entry.attr,
+	&queue_max_sched_batch_entry.attr,
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	&throtl_sample_time_entry.attr,
 #endif
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 053ea4b51988..68e7d29d4dd4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -477,6 +477,8 @@ struct request_queue {
 	unsigned int		rq_timeout;
 	int			poll_nsec;
 
+	int			max_sched_batch;
+
 	struct blk_stat_callback	*poll_cb;
 	struct blk_rq_stat	poll_stat[BLK_MQ_POLL_STATS_BKTS];
 
-- 
2.25.0.341.g760bfbb309-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-03 20:59       ` [PATCH] block: Limit number of items taken from the I/O scheduler in one go Salman Qazi
@ 2020-02-04  3:47         ` Bart Van Assche
  2020-02-04  9:20         ` Ming Lei
  1 sibling, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2020-02-04  3:47 UTC (permalink / raw)
  To: Salman Qazi, Jens Axboe, linux-block, linux-kernel
  Cc: Jesse Barnes, Gwendal Grignou, Christoph Hellwig, Ming Lei,
	Hannes Reinecke

On 2020-02-03 12:59, Salman Qazi wrote:
> We observed that it is possible for a flush to bypass the I/O
> scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert.
> This can happen while a kworker is running blk_mq_do_dispatch_sched call
> in blk_mq_sched_dispatch_requests.
> 
> However, the blk_mq_do_dispatch_sched call doesn't end in bounded time.
> As a result, the flush can sit there indefinitely, as the I/O scheduler
> feeds an arbitrary number of requests to the hardware.
> 
> The solution is to periodically poll hctx->dispatch in
> blk_mq_do_dispatch_sched, to put a bound on the latency of the commands
> sitting there.

(added Christoph, Ming and Hannes to the Cc-list)

Thank you for having posted a patch; that really helps.

I see that my name occurs first in the "To:" list. Since Jens is the
block layer maintainer I think Jens should have been mentioned first.

In version v4.20 of the Linux kernel I found the following in the legacy
block layer code:
* From blk_insert_flush():
	list_add_tail(&rq->queuelist, &q->queue_head);
* From elv_next_request():
	list_for_each_entry(rq, &q->queue_head, queuelist)

I think this means that the legacy block layer sent flush requests to
the scheduler instead of directly to the block driver. How about
modifying the blk-mq code such that it mimics that approach? I'm asking
this because this patch, although the code looks clean, doesn't seem the
best solution to me.

> +		if (count > 0 && count % q->max_sched_batch == 0 &&
> +		    !list_empty_careful(&hctx->dispatch))
> +			break;

A modulo operation in the hot path? Please don't do that.

> +static ssize_t queue_max_sched_batch_store(struct request_queue *q,
> +					   const char *page,
> +					   size_t count)
> +{
> +	int err, val;
> +
> +	if (!q->mq_ops)
> +		return -EINVAL;
> +
> +	err = kstrtoint(page, 10, &val);
> +	if (err < 0)
> +		return err;
> +
> +	if (val <= 0)
> +		return -EINVAL;
> +
> +	q->max_sched_batch = val;
> +
> +	return count;
> +}

Has it been considered to use kstrtouint() instead of checking whether
the value returned by kstrtoint() is positive?

> +	int			max_sched_batch;

unsigned int?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-03 20:59       ` [PATCH] block: Limit number of items taken from the I/O scheduler in one go Salman Qazi
  2020-02-04  3:47         ` Bart Van Assche
@ 2020-02-04  9:20         ` Ming Lei
  2020-02-04 18:26           ` Salman Qazi
  1 sibling, 1 reply; 18+ messages in thread
From: Ming Lei @ 2020-02-04  9:20 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Bart Van Assche, Jens Axboe, linux-block, linux-kernel,
	Jesse Barnes, Gwendal Grignou

On Mon, Feb 03, 2020 at 12:59:50PM -0800, Salman Qazi wrote:
> We observed that it is possible for a flush to bypass the I/O
> scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert.

We always bypass io scheduler for flush request.

> This can happen while a kworker is running blk_mq_do_dispatch_sched call
> in blk_mq_sched_dispatch_requests.
> 
> However, the blk_mq_do_dispatch_sched call doesn't end in bounded time.
> As a result, the flush can sit there indefinitely, as the I/O scheduler
> feeds an arbitrary number of requests to the hardware.

blk-mq supposes to handle requests in hctx->dispatch with higher
priority, see blk_mq_sched_dispatch_requests().

However, there is single hctx->run_work, so async run queue for dispatching
flush request can only be run until another async run queue from scheduler
is done.

> 
> The solution is to periodically poll hctx->dispatch in
> blk_mq_do_dispatch_sched, to put a bound on the latency of the commands
> sitting there.
> 
> Signed-off-by: Salman Qazi <sqazi@google.com>
> ---
>  block/blk-mq-sched.c   |  6 ++++++
>  block/blk-mq.c         |  4 ++++
>  block/blk-sysfs.c      | 33 +++++++++++++++++++++++++++++++++
>  include/linux/blkdev.h |  2 ++
>  4 files changed, 45 insertions(+)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..75cdec64b9c7 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -90,6 +90,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	LIST_HEAD(rq_list);
> +	int count = 0;
>  
>  	do {
>  		struct request *rq;
> @@ -97,6 +98,10 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
>  			break;
>  
> +		if (count > 0 && count % q->max_sched_batch == 0 &&
> +		    !list_empty_careful(&hctx->dispatch))
> +			break;

q->max_sched_batch may not be needed, and it is reasonable to always
disaptch requests in hctx->dispatch first.

Also such trick is missed in dispatch from sw queue.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-04  9:20         ` Ming Lei
@ 2020-02-04 18:26           ` Salman Qazi
  2020-02-04 19:37             ` Salman Qazi
  0 siblings, 1 reply; 18+ messages in thread
From: Salman Qazi @ 2020-02-04 18:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, linux-block,
	Linux Kernel Mailing List, Jesse Barnes, Gwendal Grignou

On Tue, Feb 4, 2020 at 1:20 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Feb 03, 2020 at 12:59:50PM -0800, Salman Qazi wrote:
> > We observed that it is possible for a flush to bypass the I/O
> > scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert.
>
> We always bypass io scheduler for flush request.
>
> > This can happen while a kworker is running blk_mq_do_dispatch_sched call
> > in blk_mq_sched_dispatch_requests.
> >
> > However, the blk_mq_do_dispatch_sched call doesn't end in bounded time.
> > As a result, the flush can sit there indefinitely, as the I/O scheduler
> > feeds an arbitrary number of requests to the hardware.
>
> blk-mq supposes to handle requests in hctx->dispatch with higher
> priority, see blk_mq_sched_dispatch_requests().
>
> However, there is single hctx->run_work, so async run queue for dispatching
> flush request can only be run until another async run queue from scheduler
> is done.
>
> >
> > The solution is to periodically poll hctx->dispatch in
> > blk_mq_do_dispatch_sched, to put a bound on the latency of the commands
> > sitting there.
> >
> > Signed-off-by: Salman Qazi <sqazi@google.com>
> > ---
> >  block/blk-mq-sched.c   |  6 ++++++
> >  block/blk-mq.c         |  4 ++++
> >  block/blk-sysfs.c      | 33 +++++++++++++++++++++++++++++++++
> >  include/linux/blkdev.h |  2 ++
> >  4 files changed, 45 insertions(+)
> >
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index ca22afd47b3d..75cdec64b9c7 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -90,6 +90,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >       struct request_queue *q = hctx->queue;
> >       struct elevator_queue *e = q->elevator;
> >       LIST_HEAD(rq_list);
> > +     int count = 0;
> >
> >       do {
> >               struct request *rq;
> > @@ -97,6 +98,10 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >               if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> >                       break;
> >
> > +             if (count > 0 && count % q->max_sched_batch == 0 &&
> > +                 !list_empty_careful(&hctx->dispatch))
> > +                     break;
>
> q->max_sched_batch may not be needed, and it is reasonable to always
> disaptch requests in hctx->dispatch first.
>
> Also such trick is missed in dispatch from sw queue.

I will update the commit message, drop max_sched_batch and just turn
it into a simple check and add the same
thing to the dispatch from the sw queue.

>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-04 18:26           ` Salman Qazi
@ 2020-02-04 19:37             ` Salman Qazi
  2020-02-05  4:55               ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Salman Qazi @ 2020-02-04 19:37 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Bart Van Assche, linux-block, linux-kernel
  Cc: Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig, Salman Qazi

Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.

Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.

A similar phenomenon exists with dispatches from the software queue.

The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 block/blk-mq-sched.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..d1b8b31bc3d4 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -97,6 +97,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
 			break;
 
+		if (!list_empty_careful(&hctx->dispatch))
+			break;
+
 		if (!blk_mq_get_dispatch_budget(hctx))
 			break;
 
@@ -140,6 +143,9 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 	do {
 		struct request *rq;
 
+		if (!list_empty_careful(&hctx->dispatch))
+			break;
+
 		if (!sbitmap_any_bit_set(&hctx->ctx_map))
 			break;
 
-- 
2.25.0.341.g760bfbb309-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-04 19:37             ` Salman Qazi
@ 2020-02-05  4:55               ` Ming Lei
  2020-02-05 19:57                 ` Salman Qazi
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2020-02-05  4:55 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Jens Axboe, Bart Van Assche, linux-block, linux-kernel,
	Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig

On Tue, Feb 04, 2020 at 11:37:11AM -0800, Salman Qazi wrote:
> Flushes bypass the I/O scheduler and get added to hctx->dispatch
> in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
> hctx->run_work work item and is past the point in
> blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
> 
> The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
> because the I/O scheduler can feed an arbitrary number of commands.
> 
> Since we have only one hctx->run_work, the commands waiting in
> hctx->dispatch will wait an arbitrary length of time for run_work to be
> rerun.
> 
> A similar phenomenon exists with dispatches from the software queue.
> 
> The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
> blk_mq_do_dispatch_ctx and return from the run_work handler and let it
> rerun.
> 
> Signed-off-by: Salman Qazi <sqazi@google.com>
> ---
>  block/blk-mq-sched.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..d1b8b31bc3d4 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -97,6 +97,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
>  			break;
>  
> +		if (!list_empty_careful(&hctx->dispatch))
> +			break;
> +
>  		if (!blk_mq_get_dispatch_budget(hctx))
>  			break;
>  
> @@ -140,6 +143,9 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
>  	do {
>  		struct request *rq;
>  
> +		if (!list_empty_careful(&hctx->dispatch))
> +			break;
> +
>  		if (!sbitmap_any_bit_set(&hctx->ctx_map))
>  			break;

This approach looks good, given actually we retrieved request this way in
legacy IO request path, see __elv_next_request().

However,  blk_mq_request_bypass_insert() may be run at the same time, so
this patch may cause requests stalled in scheduler queue.

How about returning if there is request available in hctx->dispatch from
the two helpers, then re-dispatch requests in blk_mq_sched_dispatch_requests()
if yes.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-05  4:55               ` Ming Lei
@ 2020-02-05 19:57                 ` Salman Qazi
  2020-02-06 10:18                   ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Salman Qazi @ 2020-02-05 19:57 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Bart Van Assche, linux-block, linux-kernel
  Cc: Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig, Salman Qazi

Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.

Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.

A similar phenomenon exists with dispatches from the software queue.

The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 block/blk-mq-sched.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..52249fddeb66 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -84,12 +84,16 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
  * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.
  */
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
+	bool ret = false;
 
 	do {
 		struct request *rq;
@@ -97,6 +101,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
 			break;
 
+		if (!list_empty_careful(&hctx->dispatch)) {
+			ret = true;
+			break;
+		}
+
 		if (!blk_mq_get_dispatch_budget(hctx))
 			break;
 
@@ -113,6 +122,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		 */
 		list_add(&rq->queuelist, &rq_list);
 	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+	return ret;
 }
 
 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -130,16 +141,25 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
  * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.
  */
-static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	LIST_HEAD(rq_list);
 	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+	bool ret = false;
 
 	do {
 		struct request *rq;
 
+		if (!list_empty_careful(&hctx->dispatch)) {
+			ret = true;
+			break;
+		}
+
 		if (!sbitmap_any_bit_set(&hctx->ctx_map))
 			break;
 
@@ -165,6 +185,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
 
 	WRITE_ONCE(hctx->dispatch_from, ctx);
+	return ret;
 }
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -172,6 +193,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
+	bool run_again = false;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -208,19 +230,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		blk_mq_sched_mark_restart_hctx(hctx);
 		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
 			if (has_sched_dispatch)
-				blk_mq_do_dispatch_sched(hctx);
+				run_again = blk_mq_do_dispatch_sched(hctx);
 			else
-				blk_mq_do_dispatch_ctx(hctx);
+				run_again = blk_mq_do_dispatch_ctx(hctx);
 		}
 	} else if (has_sched_dispatch) {
-		blk_mq_do_dispatch_sched(hctx);
+		run_again = blk_mq_do_dispatch_sched(hctx);
 	} else if (hctx->dispatch_busy) {
 		/* dequeue request one by one from sw queue if queue is busy */
-		blk_mq_do_dispatch_ctx(hctx);
+		run_again = blk_mq_do_dispatch_ctx(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list, false);
 	}
+
+	if (run_again)
+		blk_mq_run_hw_queue(hctx, true);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.25.0.341.g760bfbb309-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-05 19:57                 ` Salman Qazi
@ 2020-02-06 10:18                   ` Ming Lei
  2020-02-06 21:12                     ` Salman Qazi
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2020-02-06 10:18 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Jens Axboe, Bart Van Assche, linux-block, linux-kernel,
	Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig

On Wed, Feb 05, 2020 at 11:57:06AM -0800, Salman Qazi wrote:
> Flushes bypass the I/O scheduler and get added to hctx->dispatch
> in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
> hctx->run_work work item and is past the point in
> blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
> 
> The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
> because the I/O scheduler can feed an arbitrary number of commands.
> 
> Since we have only one hctx->run_work, the commands waiting in
> hctx->dispatch will wait an arbitrary length of time for run_work to be
> rerun.
> 
> A similar phenomenon exists with dispatches from the software queue.
> 
> The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
> blk_mq_do_dispatch_ctx and return from the run_work handler and let it
> rerun.
> 
> Signed-off-by: Salman Qazi <sqazi@google.com>
> ---
>  block/blk-mq-sched.c | 37 +++++++++++++++++++++++++++++++------
>  1 file changed, 31 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..52249fddeb66 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -84,12 +84,16 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>   * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>   * its queue by itself in its completion handler, so we don't need to
>   * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.
>   */
> -static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	LIST_HEAD(rq_list);
> +	bool ret = false;
>  
>  	do {
>  		struct request *rq;
> @@ -97,6 +101,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
>  			break;
>  
> +		if (!list_empty_careful(&hctx->dispatch)) {
> +			ret = true;
> +			break;
> +		}
> +
>  		if (!blk_mq_get_dispatch_budget(hctx))
>  			break;
>  
> @@ -113,6 +122,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		 */
>  		list_add(&rq->queuelist, &rq_list);
>  	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
> +
> +	return ret;
>  }
>  
>  static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> @@ -130,16 +141,25 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
>   * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>   * its queue by itself in its completion handler, so we don't need to
>   * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.
>   */
> -static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
>  	LIST_HEAD(rq_list);
>  	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> +	bool ret = false;
>  
>  	do {
>  		struct request *rq;
>  
> +		if (!list_empty_careful(&hctx->dispatch)) {
> +			ret = true;
> +			break;
> +		}
> +
>  		if (!sbitmap_any_bit_set(&hctx->ctx_map))
>  			break;
>  
> @@ -165,6 +185,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
>  	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
>  
>  	WRITE_ONCE(hctx->dispatch_from, ctx);
> +	return ret;
>  }
>  
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> @@ -172,6 +193,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
> +	bool run_again = false;
>  	LIST_HEAD(rq_list);
>  
>  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -208,19 +230,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  		blk_mq_sched_mark_restart_hctx(hctx);
>  		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
>  			if (has_sched_dispatch)
> -				blk_mq_do_dispatch_sched(hctx);
> +				run_again = blk_mq_do_dispatch_sched(hctx);
>  			else
> -				blk_mq_do_dispatch_ctx(hctx);
> +				run_again = blk_mq_do_dispatch_ctx(hctx);
>  		}
>  	} else if (has_sched_dispatch) {
> -		blk_mq_do_dispatch_sched(hctx);
> +		run_again = blk_mq_do_dispatch_sched(hctx);
>  	} else if (hctx->dispatch_busy) {
>  		/* dequeue request one by one from sw queue if queue is busy */
> -		blk_mq_do_dispatch_ctx(hctx);
> +		run_again = blk_mq_do_dispatch_ctx(hctx);
>  	} else {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list, false);
>  	}
> +
> +	if (run_again)
> +		blk_mq_run_hw_queue(hctx, true);

One improvement may be to run again locally in this function by
limited times(such as 1) first, then switch to blk_mq_run_hw_queue()
if run again is still needed.

This way may save one async run hw queue.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-06 10:18                   ` Ming Lei
@ 2020-02-06 21:12                     ` Salman Qazi
  2020-02-07  2:07                       ` Ming Lei
  2020-02-07 15:26                       ` Bart Van Assche
  0 siblings, 2 replies; 18+ messages in thread
From: Salman Qazi @ 2020-02-06 21:12 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Bart Van Assche, linux-block, linux-kernel
  Cc: Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig, Salman Qazi

Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.

Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.

A similar phenomenon exists with dispatches from the software queue.

The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 block/blk-mq-sched.c | 47 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..84dde147f646 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -84,12 +84,16 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
  * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.
  */
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
+	bool ret = false;
 
 	do {
 		struct request *rq;
@@ -97,6 +101,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
 			break;
 
+		if (!list_empty_careful(&hctx->dispatch)) {
+			ret = true;
+			break;
+		}
+
 		if (!blk_mq_get_dispatch_budget(hctx))
 			break;
 
@@ -113,6 +122,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		 */
 		list_add(&rq->queuelist, &rq_list);
 	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+	return ret;
 }
 
 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -130,16 +141,25 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
  * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.
  */
-static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	LIST_HEAD(rq_list);
 	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+	bool ret = false;
 
 	do {
 		struct request *rq;
 
+		if (!list_empty_careful(&hctx->dispatch)) {
+			ret = true;
+			break;
+		}
+
 		if (!sbitmap_any_bit_set(&hctx->ctx_map))
 			break;
 
@@ -165,6 +185,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
 
 	WRITE_ONCE(hctx->dispatch_from, ctx);
+	return ret;
 }
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -172,6 +193,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
+	bool run_again;
+	bool restarted = false;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -180,6 +203,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 
 	hctx->run++;
 
+again:
+	run_again = false;
+
 	/*
 	 * If we have previous entries on our dispatch list, grab them first for
 	 * more fair dispatch.
@@ -208,19 +234,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		blk_mq_sched_mark_restart_hctx(hctx);
 		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
 			if (has_sched_dispatch)
-				blk_mq_do_dispatch_sched(hctx);
+				run_again = blk_mq_do_dispatch_sched(hctx);
 			else
-				blk_mq_do_dispatch_ctx(hctx);
+				run_again = blk_mq_do_dispatch_ctx(hctx);
 		}
 	} else if (has_sched_dispatch) {
-		blk_mq_do_dispatch_sched(hctx);
+		run_again = blk_mq_do_dispatch_sched(hctx);
 	} else if (hctx->dispatch_busy) {
 		/* dequeue request one by one from sw queue if queue is busy */
-		blk_mq_do_dispatch_ctx(hctx);
+		run_again = blk_mq_do_dispatch_ctx(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list, false);
 	}
+
+	if (run_again) {
+		if (!restarted) {
+			restarted = true;
+			goto again;
+		}
+
+		blk_mq_run_hw_queue(hctx, true);
+	}
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.25.0.341.g760bfbb309-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-06 21:12                     ` Salman Qazi
@ 2020-02-07  2:07                       ` Ming Lei
  2020-02-07 15:26                       ` Bart Van Assche
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2020-02-07  2:07 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Jens Axboe, Bart Van Assche, linux-block, linux-kernel,
	Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig

On Thu, Feb 06, 2020 at 01:12:22PM -0800, Salman Qazi wrote:
> Flushes bypass the I/O scheduler and get added to hctx->dispatch
> in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
> hctx->run_work work item and is past the point in
> blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
> 
> The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
> because the I/O scheduler can feed an arbitrary number of commands.
> 
> Since we have only one hctx->run_work, the commands waiting in
> hctx->dispatch will wait an arbitrary length of time for run_work to be
> rerun.
> 
> A similar phenomenon exists with dispatches from the software queue.
> 
> The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
> blk_mq_do_dispatch_ctx and return from the run_work handler and let it
> rerun.
> 
> Signed-off-by: Salman Qazi <sqazi@google.com>
> ---
>  block/blk-mq-sched.c | 47 ++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 41 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..84dde147f646 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -84,12 +84,16 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>   * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>   * its queue by itself in its completion handler, so we don't need to
>   * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.
>   */
> -static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	LIST_HEAD(rq_list);
> +	bool ret = false;
>  
>  	do {
>  		struct request *rq;
> @@ -97,6 +101,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
>  			break;
>  
> +		if (!list_empty_careful(&hctx->dispatch)) {
> +			ret = true;
> +			break;
> +		}
> +
>  		if (!blk_mq_get_dispatch_budget(hctx))
>  			break;
>  
> @@ -113,6 +122,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		 */
>  		list_add(&rq->queuelist, &rq_list);
>  	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
> +
> +	return ret;
>  }
>  
>  static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> @@ -130,16 +141,25 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
>   * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>   * its queue by itself in its completion handler, so we don't need to
>   * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.
>   */
> -static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
>  	LIST_HEAD(rq_list);
>  	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> +	bool ret = false;
>  
>  	do {
>  		struct request *rq;
>  
> +		if (!list_empty_careful(&hctx->dispatch)) {
> +			ret = true;
> +			break;
> +		}
> +
>  		if (!sbitmap_any_bit_set(&hctx->ctx_map))
>  			break;
>  
> @@ -165,6 +185,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
>  	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
>  
>  	WRITE_ONCE(hctx->dispatch_from, ctx);
> +	return ret;
>  }
>  
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> @@ -172,6 +193,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
> +	bool run_again;
> +	bool restarted = false;
>  	LIST_HEAD(rq_list);
>  
>  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -180,6 +203,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  
>  	hctx->run++;
>  
> +again:
> +	run_again = false;
> +
>  	/*
>  	 * If we have previous entries on our dispatch list, grab them first for
>  	 * more fair dispatch.
> @@ -208,19 +234,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  		blk_mq_sched_mark_restart_hctx(hctx);
>  		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
>  			if (has_sched_dispatch)
> -				blk_mq_do_dispatch_sched(hctx);
> +				run_again = blk_mq_do_dispatch_sched(hctx);
>  			else
> -				blk_mq_do_dispatch_ctx(hctx);
> +				run_again = blk_mq_do_dispatch_ctx(hctx);
>  		}
>  	} else if (has_sched_dispatch) {
> -		blk_mq_do_dispatch_sched(hctx);
> +		run_again = blk_mq_do_dispatch_sched(hctx);
>  	} else if (hctx->dispatch_busy) {
>  		/* dequeue request one by one from sw queue if queue is busy */
> -		blk_mq_do_dispatch_ctx(hctx);
> +		run_again = blk_mq_do_dispatch_ctx(hctx);
>  	} else {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list, false);
>  	}
> +
> +	if (run_again) {
> +		if (!restarted) {
> +			restarted = true;
> +			goto again;
> +		}
> +
> +		blk_mq_run_hw_queue(hctx, true);
> +	}
>  }
>  
>  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> -- 
> 2.25.0.341.g760bfbb309-goog
> 

Reviewed-by: Ming Lei <ming.lei@redhat.com>

-- 
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-06 21:12                     ` Salman Qazi
  2020-02-07  2:07                       ` Ming Lei
@ 2020-02-07 15:26                       ` Bart Van Assche
  2020-02-07 18:45                         ` Salman Qazi
  1 sibling, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2020-02-07 15:26 UTC (permalink / raw)
  To: Salman Qazi, Jens Axboe, Ming Lei, linux-block, linux-kernel
  Cc: Jesse Barnes, Gwendal Grignou, Hannes Reinecke, Christoph Hellwig

On 2020-02-06 13:12, Salman Qazi wrote:
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.

Please elaborate this comment and explain why this is necessary (to
avoid that flush processing is postponed forever).

> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.

Same comment here.

> +again:
> +	run_again = false;
> +
>  	/*
>  	 * If we have previous entries on our dispatch list, grab them first for
>  	 * more fair dispatch.
> @@ -208,19 +234,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  		blk_mq_sched_mark_restart_hctx(hctx);
>  		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
>  			if (has_sched_dispatch)
> -				blk_mq_do_dispatch_sched(hctx);
> +				run_again = blk_mq_do_dispatch_sched(hctx);
>  			else
> -				blk_mq_do_dispatch_ctx(hctx);
> +				run_again = blk_mq_do_dispatch_ctx(hctx);
>  		}
>  	} else if (has_sched_dispatch) {
> -		blk_mq_do_dispatch_sched(hctx);
> +		run_again = blk_mq_do_dispatch_sched(hctx);
>  	} else if (hctx->dispatch_busy) {
>  		/* dequeue request one by one from sw queue if queue is busy */
> -		blk_mq_do_dispatch_ctx(hctx);
> +		run_again = blk_mq_do_dispatch_ctx(hctx);
>  	} else {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list, false);
>  	}
> +
> +	if (run_again) {
> +		if (!restarted) {
> +			restarted = true;
> +			goto again;
> +		}
> +
> +		blk_mq_run_hw_queue(hctx, true);
> +	}

So this patch changes blk_mq_sched_dispatch_requests() such that it
iterates at most two times? How about implementing that loop with an
explicit for-loop? I think that will make
blk_mq_sched_dispatch_requests() easier to read. As you may know forward
goto's are accepted in kernel code but backward goto's are frowned upon.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-07 15:26                       ` Bart Van Assche
@ 2020-02-07 18:45                         ` Salman Qazi
  2020-02-07 19:04                           ` Salman Qazi
  2020-02-07 20:19                           ` Bart Van Assche
  0 siblings, 2 replies; 18+ messages in thread
From: Salman Qazi @ 2020-02-07 18:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Ming Lei, linux-block, Linux Kernel Mailing List,
	Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig

On Fri, Feb 7, 2020 at 7:26 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2020-02-06 13:12, Salman Qazi wrote:
> > + *
> > + * Returns true if hctx->dispatch was found non-empty and
> > + * run_work has to be run again.
>
> Please elaborate this comment and explain why this is necessary (to
> avoid that flush processing is postponed forever).
>
> > + * Returns true if hctx->dispatch was found non-empty and
> > + * run_work has to be run again.
>
> Same comment here.

Will do.

>
> > +again:
> > +     run_again = false;
> > +
> >       /*
> >        * If we have previous entries on our dispatch list, grab them first for
> >        * more fair dispatch.
> > @@ -208,19 +234,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >               blk_mq_sched_mark_restart_hctx(hctx);
> >               if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
> >                       if (has_sched_dispatch)
> > -                             blk_mq_do_dispatch_sched(hctx);
> > +                             run_again = blk_mq_do_dispatch_sched(hctx);
> >                       else
> > -                             blk_mq_do_dispatch_ctx(hctx);
> > +                             run_again = blk_mq_do_dispatch_ctx(hctx);
> >               }
> >       } else if (has_sched_dispatch) {
> > -             blk_mq_do_dispatch_sched(hctx);
> > +             run_again = blk_mq_do_dispatch_sched(hctx);
> >       } else if (hctx->dispatch_busy) {
> >               /* dequeue request one by one from sw queue if queue is busy */
> > -             blk_mq_do_dispatch_ctx(hctx);
> > +             run_again = blk_mq_do_dispatch_ctx(hctx);
> >       } else {
> >               blk_mq_flush_busy_ctxs(hctx, &rq_list);
> >               blk_mq_dispatch_rq_list(q, &rq_list, false);
> >       }
> > +
> > +     if (run_again) {
> > +             if (!restarted) {
> > +                     restarted = true;
> > +                     goto again;
> > +             }
> > +
> > +             blk_mq_run_hw_queue(hctx, true);
> > +     }
>
> So this patch changes blk_mq_sched_dispatch_requests() such that it
> iterates at most two times? How about implementing that loop with an
> explicit for-loop? I think that will make
> blk_mq_sched_dispatch_requests() easier to read. As you may know forward
> goto's are accepted in kernel code but backward goto's are frowned upon.
>

About the goto, I don't know if backwards gotos in general are frowned
upon.  There are plenty of examples
in the kernel source.  This particular label, 'again' for instance:

$ grep -r again: mm/|wc -l
22
$ grep -r again: block/|wc -l
4

But, just because others have done it doesn't mean I should.  So, I
will attempt to explain why I think this is a good idea.
If I were to write this as a for-loop, it will look like this:

for (i = 0; i == 0 || (run_again && i < 2); i++) {
/* another level of 8 character wide indentation */
    run_again = false;
   /* a bunch of code that possibly sets run_again to true
}

if (run_again)
    blk_mq_run_hw_queue(hctx, true);

[Another alternative is to set run_again to true, and simplify the for-loop
condition to run_again && i < 2.  But, again, lots of verbiage and a boolean
in the for-loop condition.]

The for-loop is far from idiomatic.  It's not clear what it does when
you first look at it.
It distracts from the common path of the code, which is something that
almost always
runs exactly once.  There is now an additional level of indentation.
The readers of the
code aren't any better off, because they still have to figure out what
run_again is and if
they care about it.  And the only way to do that is to read the entire
body of the loop, and
comments at the top of the functions.

The goto in this case preserves the intent of the code better.  It is
dealing with an exceptional
and unusual case.  Indeed this kind of use is not unusual in the
kernel, for instance to deal
with possible but unlikely races.

Just my $0.02.

> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-07 18:45                         ` Salman Qazi
@ 2020-02-07 19:04                           ` Salman Qazi
  2020-02-07 20:19                           ` Bart Van Assche
  1 sibling, 0 replies; 18+ messages in thread
From: Salman Qazi @ 2020-02-07 19:04 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Bart Van Assche, linux-block, linux-kernel
  Cc: Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig, Salman Qazi

Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.

Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.

A similar phenomenon exists with dispatches from the software queue.

The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.

Signed-off-by: Salman Qazi <sqazi@google.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 49 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..3e78c5bbb4d9 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -84,12 +84,17 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
  * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.  This is necessary to avoid
+ * starving flushes.
  */
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
+	bool ret = false;
 
 	do {
 		struct request *rq;
@@ -97,6 +102,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
 			break;
 
+		if (!list_empty_careful(&hctx->dispatch)) {
+			ret = true;
+			break;
+		}
+
 		if (!blk_mq_get_dispatch_budget(hctx))
 			break;
 
@@ -113,6 +123,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		 */
 		list_add(&rq->queuelist, &rq_list);
 	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+	return ret;
 }
 
 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -130,16 +142,26 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
  * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.  This is necessary to avoid
+ * starving flushes.
  */
-static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	LIST_HEAD(rq_list);
 	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+	bool ret = false;
 
 	do {
 		struct request *rq;
 
+		if (!list_empty_careful(&hctx->dispatch)) {
+			ret = true;
+			break;
+		}
+
 		if (!sbitmap_any_bit_set(&hctx->ctx_map))
 			break;
 
@@ -165,6 +187,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
 
 	WRITE_ONCE(hctx->dispatch_from, ctx);
+	return ret;
 }
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -172,6 +195,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
+	bool run_again;
+	bool restarted = false;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -180,6 +205,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 
 	hctx->run++;
 
+again:
+	run_again = false;
+
 	/*
 	 * If we have previous entries on our dispatch list, grab them first for
 	 * more fair dispatch.
@@ -208,19 +236,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		blk_mq_sched_mark_restart_hctx(hctx);
 		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
 			if (has_sched_dispatch)
-				blk_mq_do_dispatch_sched(hctx);
+				run_again = blk_mq_do_dispatch_sched(hctx);
 			else
-				blk_mq_do_dispatch_ctx(hctx);
+				run_again = blk_mq_do_dispatch_ctx(hctx);
 		}
 	} else if (has_sched_dispatch) {
-		blk_mq_do_dispatch_sched(hctx);
+		run_again = blk_mq_do_dispatch_sched(hctx);
 	} else if (hctx->dispatch_busy) {
 		/* dequeue request one by one from sw queue if queue is busy */
-		blk_mq_do_dispatch_ctx(hctx);
+		run_again = blk_mq_do_dispatch_ctx(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list, false);
 	}
+
+	if (run_again) {
+		if (!restarted) {
+			restarted = true;
+			goto again;
+		}
+
+		blk_mq_run_hw_queue(hctx, true);
+	}
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.25.0.341.g760bfbb309-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-07 18:45                         ` Salman Qazi
  2020-02-07 19:04                           ` Salman Qazi
@ 2020-02-07 20:19                           ` Bart Van Assche
  2020-02-07 20:37                             ` Salman Qazi
  1 sibling, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2020-02-07 20:19 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Jens Axboe, Ming Lei, linux-block, Linux Kernel Mailing List,
	Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig

On 2/7/20 10:45 AM, Salman Qazi wrote:
> If I were to write this as a for-loop, it will look like this:
> 
> for (i = 0; i == 0 || (run_again && i < 2); i++) {
> /* another level of 8 character wide indentation */
>      run_again = false;
>     /* a bunch of code that possibly sets run_again to true
> }
> 
> if (run_again)
>      blk_mq_run_hw_queue(hctx, true);

That's not what I meant. What I meant is a loop that iterates at most 
two times and also to break out of the loop if run_again == false.

BTW, I share your concern about the additional indentation by eight 
positions. How about avoiding deeper indentation by introducing a new 
function?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go
  2020-02-07 20:19                           ` Bart Van Assche
@ 2020-02-07 20:37                             ` Salman Qazi
  0 siblings, 0 replies; 18+ messages in thread
From: Salman Qazi @ 2020-02-07 20:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Ming Lei, linux-block, Linux Kernel Mailing List,
	Jesse Barnes, Gwendal Grignou, Hannes Reinecke,
	Christoph Hellwig

On Fri, Feb 7, 2020 at 12:19 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2/7/20 10:45 AM, Salman Qazi wrote:
> > If I were to write this as a for-loop, it will look like this:
> >
> > for (i = 0; i == 0 || (run_again && i < 2); i++) {
> > /* another level of 8 character wide indentation */
> >      run_again = false;
> >     /* a bunch of code that possibly sets run_again to true
> > }
> >
> > if (run_again)
> >      blk_mq_run_hw_queue(hctx, true);
>
> That's not what I meant. What I meant is a loop that iterates at most
> two times and also to break out of the loop if run_again == false.
>

I picked the most compact variant to demonstrate the problem.  Adding
breaks isn't
really helping the readability.

for (i = 0; i < 2; i++) {
  run_again = false;
/* bunch of code that possibly sets it to true */
...
 if (!run_again)
    break;
}
if (run_again)
    blk_mq_run_hw_queue(hctx, true);

When I read this, I initially assume that the loop in general runs
twice and that this is the common case.  It has the
same problem with conveying intent.  Perhaps, more importantly, the
point of using programming constructs is to shorten and simplify the
code.
There are still two if-statements in addition to the loop. We haven't
gained much by introducing the loop.

> BTW, I share your concern about the additional indentation by eight
> positions. How about avoiding deeper indentation by introducing a new
> function?

If there was a benefit to introducing the loop, this would be a good
call.  But the way I see it, the introduction of another
function is yet another way in which the introduction of the loop
makes the code less readable.

This is not a hill I want to die on.  If the maintainer agrees with
you on this point, I will use a loop.
>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, back to index

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-30 19:34 Hung tasks with multiple partitions Salman Qazi
2020-01-30 20:49 ` Bart Van Assche
2020-01-30 21:02   ` Salman Qazi
     [not found]     ` <20200203204554.119849-1-sqazi@google.com>
2020-02-03 20:59       ` [PATCH] block: Limit number of items taken from the I/O scheduler in one go Salman Qazi
2020-02-04  3:47         ` Bart Van Assche
2020-02-04  9:20         ` Ming Lei
2020-02-04 18:26           ` Salman Qazi
2020-02-04 19:37             ` Salman Qazi
2020-02-05  4:55               ` Ming Lei
2020-02-05 19:57                 ` Salman Qazi
2020-02-06 10:18                   ` Ming Lei
2020-02-06 21:12                     ` Salman Qazi
2020-02-07  2:07                       ` Ming Lei
2020-02-07 15:26                       ` Bart Van Assche
2020-02-07 18:45                         ` Salman Qazi
2020-02-07 19:04                           ` Salman Qazi
2020-02-07 20:19                           ` Bart Van Assche
2020-02-07 20:37                             ` Salman Qazi

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git