All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 RFC] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
@ 2020-12-18 21:44 Jan Kara
  2020-12-18 21:44 ` [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance" Jan Kara
  2020-12-18 21:44 ` [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
  0 siblings, 2 replies; 14+ messages in thread
From: Jan Kara @ 2020-12-18 21:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, hare, Kashyap Desai, Jan Kara

Hello!

This patch series aims to fix a regression we've noticed on our test grid when
support for multiple HW queues in megaraid_sas driver was added during the 5.10
cycle (103fbf8e4020 scsi: megaraid_sas: Added support for shared host tagset
for cpuhotplug). The commit was reverted in the end for other reasons but I
believe the fundamental problem still exists for any other similar setup. The
problem manifests when the storage card supports multiple hardware queues
however storage behind it is slow (single rotating disk in our case) and so
using IO scheduler such as BFQ is desirable. See the second patch for details.

FWIW I'm not 100% sold on this approach, in particular I'm not sure it cannot
cause some issues in higher-end setups but I want expect mq-deadline or BFQ
to be used there. Anyway that's why I'm sending this as an RFC.

								Honza

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance"
  2020-12-18 21:44 [PATCH 0/2 RFC] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
@ 2020-12-18 21:44 ` Jan Kara
  2020-12-18 21:44 ` [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
  1 sibling, 0 replies; 14+ messages in thread
From: Jan Kara @ 2020-12-18 21:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, hare, Kashyap Desai, Jan Kara

This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95.

Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/bfq-iosched.c    | 5 -----
 block/blk-mq.c         | 1 -
 block/mq-deadline.c    | 6 ------
 include/linux/blk-mq.h | 4 ----
 4 files changed, 16 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e81d1052091..a99dfaa75a8c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4640,9 +4640,6 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
 
-	if (!atomic_read(&hctx->elevator_queued))
-		return false;
-
 	/*
 	 * Avoiding lock: a race on bfqd->busy_queues should cause at
 	 * most a call to dispatch for nothing
@@ -5557,7 +5554,6 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		bfq_insert_request(hctx, rq, at_head);
-		atomic_inc(&hctx->elevator_queued);
 	}
 }
 
@@ -5925,7 +5921,6 @@ static void bfq_finish_requeue_request(struct request *rq)
 
 		bfq_completed_request(bfqq, bfqd);
 		bfq_finish_requeue_request_body(bfqq);
-		atomic_dec(&rq->mq_hctx->elevator_queued);
 
 		spin_unlock_irqrestore(&bfqd->lock, flags);
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b09ce00cc6af..57d0461f2be5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2670,7 +2670,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 		goto free_hctx;
 
 	atomic_set(&hctx->nr_active, 0);
-	atomic_set(&hctx->elevator_queued, 0);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 	hctx->numa_node = node;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 800ac902809b..b57470e154c8 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -386,8 +386,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	spin_lock(&dd->lock);
 	rq = __dd_dispatch_request(dd);
 	spin_unlock(&dd->lock);
-	if (rq)
-		atomic_dec(&rq->mq_hctx->elevator_queued);
 
 	return rq;
 }
@@ -535,7 +533,6 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		dd_insert_request(hctx, rq, at_head);
-		atomic_inc(&hctx->elevator_queued);
 	}
 	spin_unlock(&dd->lock);
 }
@@ -582,9 +579,6 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
 
-	if (!atomic_read(&hctx->elevator_queued))
-		return false;
-
 	return !list_empty_careful(&dd->dispatch) ||
 		!list_empty_careful(&dd->fifo_list[0]) ||
 		!list_empty_careful(&dd->fifo_list[1]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 47b021952ac7..8296811a031e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,10 +140,6 @@ struct blk_mq_hw_ctx {
 	 * shared across request queues.
 	 */
 	atomic_t		nr_active;
-	/**
-	 * @elevator_queued: Number of queued requests on hctx.
-	 */
-	atomic_t                elevator_queued;
 
 	/** @cpuhp_online: List to store request if CPU is going to die */
 	struct hlist_node	cpuhp_online;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2020-12-18 21:44 [PATCH 0/2 RFC] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
  2020-12-18 21:44 ` [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance" Jan Kara
@ 2020-12-18 21:44 ` Jan Kara
  2020-12-19  3:14   ` Ming Lei
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2020-12-18 21:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, hare, Kashyap Desai, Jan Kara

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

      clients           MQ                     SQ             MQ-Patched
Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk-mq.c           | 37 +++++++++++++++++++++++++++++++++++++
 block/kyber-iosched.c    |  1 +
 include/linux/elevator.h |  2 ++
 3 files changed, 40 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 57d0461f2be5..6d80054c231b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1663,6 +1663,31 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 }
 EXPORT_SYMBOL(blk_mq_run_hw_queue);
 
+static struct blk_mq_hw_ctx *blk_mq_sq_iosched_hctx(struct request_queue *q)
+{
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_hw_ctx *hctx;
+
+	/*
+	 * The queue has multiple hardware queues but uses IO scheduler that
+	 * does not respect hardware queues when dispatching? This is not a
+	 * great setup but it can be sensible when we have a single rotational
+	 * disk behind a raid card. Just don't bother with multiple HW queues
+	 * and dispatch from hctx for the current CPU since running multiple
+	 * queues just causes lock contention inside the scheduler and
+	 * pointless cache bouncing because the hctx is not respected by the IO
+	 * scheduler's dispatch function anyway.
+	 */
+	if (q->nr_hw_queues > 1 && e && e->type->ops.dispatch_request &&
+	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE)) {
+		hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
+					     raw_smp_processor_id());
+		if (!blk_mq_hctx_stopped(hctx))
+			return hctx;
+	}
+	return NULL;
+}
+
 /**
  * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
  * @q: Pointer to the request queue to run.
@@ -1673,6 +1698,12 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	hctx = blk_mq_sq_iosched_hctx(q);
+	if (hctx) {
+		blk_mq_run_hw_queue(hctx, async);
+		return;
+	}
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
@@ -1692,6 +1723,12 @@ void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	hctx = blk_mq_sq_iosched_hctx(q);
+	if (hctx) {
+		blk_mq_delay_run_hw_queue(hctx, msecs);
+		return;
+	}
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index dc89199bc8c6..c25c41d0d061 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -1029,6 +1029,7 @@ static struct elevator_type kyber_sched = {
 #endif
 	.elevator_attrs = kyber_sched_attrs,
 	.elevator_name = "kyber",
+	.elevator_features = ELEVATOR_F_MQ_AWARE,
 	.elevator_owner = THIS_MODULE,
 };
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index bacc40a0bdf3..1fe8e105b83b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -172,6 +172,8 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 
 /* Supports zoned block devices sequential write constraint */
 #define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
+/* Supports scheduling on multiple hardware queues */
+#define ELEVATOR_F_MQ_AWARE		(1U << 1)
 
 #endif /* CONFIG_BLOCK */
 #endif
-- 
2.16.4


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2020-12-18 21:44 ` [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
@ 2020-12-19  3:14   ` Ming Lei
  2020-12-22 10:18     ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2020-12-19  3:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block, hare, Kashyap Desai

On Fri, Dec 18, 2020 at 10:44:12PM +0100, Jan Kara wrote:
> Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
> a queue with multiple HW queues, the performance it rather bad. The
> problem is that these IO schedulers use queue-wide locking and their
> dispatch function does not respect the hctx it is passed in and returns
> any request it finds appropriate. Thus locality of request access is
> broken and dispatch from multiple CPUs just contends on IO scheduler
> locks. For these IO schedulers there's little point in dispatching from
> multiple CPUs. Instead dispatch always only from a single CPU to limit
> contention.
> 
> Below is a comparison of dbench runs on XFS filesystem where the storage
> is a raid card with 64 HW queues and to it attached a single rotating
> disk. BFQ is used as IO scheduler:
> 
>       clients           MQ                     SQ             MQ-Patched
> Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
> Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
> Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
> Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
> Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
> Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
> Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
> Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
> 
> Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
> the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
> advertises just a single HW queue. MQ-Patched is a kernel with this
> patch applied.
> 
> You can see multiple hardware queues heavily hurt performance in
> combination with BFQ. The patch restores the performance.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/blk-mq.c           | 37 +++++++++++++++++++++++++++++++++++++
>  block/kyber-iosched.c    |  1 +
>  include/linux/elevator.h |  2 ++
>  3 files changed, 40 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 57d0461f2be5..6d80054c231b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1663,6 +1663,31 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>  }
>  EXPORT_SYMBOL(blk_mq_run_hw_queue);
>  
> +static struct blk_mq_hw_ctx *blk_mq_sq_iosched_hctx(struct request_queue *q)
> +{
> +	struct elevator_queue *e = q->elevator;
> +	struct blk_mq_hw_ctx *hctx;
> +
> +	/*
> +	 * The queue has multiple hardware queues but uses IO scheduler that
> +	 * does not respect hardware queues when dispatching? This is not a
> +	 * great setup but it can be sensible when we have a single rotational
> +	 * disk behind a raid card. Just don't bother with multiple HW queues
> +	 * and dispatch from hctx for the current CPU since running multiple
> +	 * queues just causes lock contention inside the scheduler and
> +	 * pointless cache bouncing because the hctx is not respected by the IO
> +	 * scheduler's dispatch function anyway.
> +	 */
> +	if (q->nr_hw_queues > 1 && e && e->type->ops.dispatch_request &&
> +	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE)) {
> +		hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
> +					     raw_smp_processor_id());
> +		if (!blk_mq_hctx_stopped(hctx))
> +			return hctx;
> +	}
> +	return NULL;
> +}
> +
>  /**
>   * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
>   * @q: Pointer to the request queue to run.
> @@ -1673,6 +1698,12 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
>  	struct blk_mq_hw_ctx *hctx;
>  	int i;
>  
> +	hctx = blk_mq_sq_iosched_hctx(q);
> +	if (hctx) {
> +		blk_mq_run_hw_queue(hctx, async);
> +		return;
> +	}
> +

This approach looks reasonable, just wondering which code path is wrt.
blk_mq_run_hw_queues() improvement by this patch.

Since ed5dd6a67d5e ("scsi: core: Only re-run queue in scsi_end_request() if device
queue is busy") is merged, blk_mq_run_hw_queues() is only called from scsi_end_request()
when the scsi device is busy for megaraid.

Another one is bfq_schedule_dispatch(), in which blk_mq_run_hw_queues()
is still be called, if that is the reason, maybe it is easier to optimize
bfq_schedule_dispatch() by avoiding to call blk_mq_run_hw_queues().


thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2020-12-19  3:14   ` Ming Lei
@ 2020-12-22 10:18     ` Jan Kara
  2020-12-22 16:55       ` Jan Kara
  2020-12-23  3:33       ` Ming Lei
  0 siblings, 2 replies; 14+ messages in thread
From: Jan Kara @ 2020-12-22 10:18 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jan Kara, Jens Axboe, linux-block, hare, Kashyap Desai

On Sat 19-12-20 11:14:27, Ming Lei wrote:
> On Fri, Dec 18, 2020 at 10:44:12PM +0100, Jan Kara wrote:
> > Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
> > a queue with multiple HW queues, the performance it rather bad. The
> > problem is that these IO schedulers use queue-wide locking and their
> > dispatch function does not respect the hctx it is passed in and returns
> > any request it finds appropriate. Thus locality of request access is
> > broken and dispatch from multiple CPUs just contends on IO scheduler
> > locks. For these IO schedulers there's little point in dispatching from
> > multiple CPUs. Instead dispatch always only from a single CPU to limit
> > contention.
> > 
> > Below is a comparison of dbench runs on XFS filesystem where the storage
> > is a raid card with 64 HW queues and to it attached a single rotating
> > disk. BFQ is used as IO scheduler:
> > 
> >       clients           MQ                     SQ             MQ-Patched
> > Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
> > Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
> > Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
> > Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
> > Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
> > Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
> > Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
> > Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
> > 
> > Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
> > the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
> > advertises just a single HW queue. MQ-Patched is a kernel with this
> > patch applied.
> > 
> > You can see multiple hardware queues heavily hurt performance in
> > combination with BFQ. The patch restores the performance.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  block/blk-mq.c           | 37 +++++++++++++++++++++++++++++++++++++
> >  block/kyber-iosched.c    |  1 +
> >  include/linux/elevator.h |  2 ++
> >  3 files changed, 40 insertions(+)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 57d0461f2be5..6d80054c231b 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1663,6 +1663,31 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> >  }
> >  EXPORT_SYMBOL(blk_mq_run_hw_queue);
> >  
> > +static struct blk_mq_hw_ctx *blk_mq_sq_iosched_hctx(struct request_queue *q)
> > +{
> > +	struct elevator_queue *e = q->elevator;
> > +	struct blk_mq_hw_ctx *hctx;
> > +
> > +	/*
> > +	 * The queue has multiple hardware queues but uses IO scheduler that
> > +	 * does not respect hardware queues when dispatching? This is not a
> > +	 * great setup but it can be sensible when we have a single rotational
> > +	 * disk behind a raid card. Just don't bother with multiple HW queues
> > +	 * and dispatch from hctx for the current CPU since running multiple
> > +	 * queues just causes lock contention inside the scheduler and
> > +	 * pointless cache bouncing because the hctx is not respected by the IO
> > +	 * scheduler's dispatch function anyway.
> > +	 */
> > +	if (q->nr_hw_queues > 1 && e && e->type->ops.dispatch_request &&
> > +	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE)) {
> > +		hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
> > +					     raw_smp_processor_id());
> > +		if (!blk_mq_hctx_stopped(hctx))
> > +			return hctx;
> > +	}
> > +	return NULL;
> > +}
> > +
> >  /**
> >   * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
> >   * @q: Pointer to the request queue to run.
> > @@ -1673,6 +1698,12 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
> >  	struct blk_mq_hw_ctx *hctx;
> >  	int i;
> >  
> > +	hctx = blk_mq_sq_iosched_hctx(q);
> > +	if (hctx) {
> > +		blk_mq_run_hw_queue(hctx, async);
> > +		return;
> > +	}
> > +
> 
> This approach looks reasonable, just wondering which code path is wrt.
> blk_mq_run_hw_queues() improvement by this patch.
> 
> Since ed5dd6a67d5e ("scsi: core: Only re-run queue in scsi_end_request() if device
> queue is busy") is merged, blk_mq_run_hw_queues() is only called from scsi_end_request()
> when the scsi device is busy for megaraid.
> 
> Another one is bfq_schedule_dispatch(), in which blk_mq_run_hw_queues()
> is still be called, if that is the reason, maybe it is easier to optimize
> bfq_schedule_dispatch() by avoiding to call blk_mq_run_hw_queues().

That's a good question. Tracing shows that with dbench I'm seeing *lots*
(about 23000/s) blk_mq_delay_run_hw_queues() calls, mostly from
__blk_mq_do_dispatch_sched(). This drops to "only" about 2000 calls/s with
my patches applied.

So it means BFQ decided not to dispatch any request (e.g. because it is
idling for more IO from the same process) and that triggers that path in
__blk_mq_do_dispatch_sched() that just queues the dispatch again. So blk-mq
ends up polling BFQ rather heavily for requests it doesn't want to give out
:). In this sense my patch just makes the real problem less severe.

I've noticed that if ->has_work() returned false, we would not end up
calling blk_mq_delay_run_hw_queues(). But for BFQ ->has_work() often
returns true because it has requests queued but ->dispatch_request()
doesn't dispatch anything because of other scheduling constraints. And so
we end up calling blk_mq_delay_run_hw_queues() because if we allocated
dispatch budget and didn't dispatch in the end, we could have blocked
dispatch from another hctx and so now need to rerun that hctx to dispatch
possibly queued requests.

I was thinking how we could possibly improve this. One obvious possibility
is to modify IO schedulers so that their ->has_work() does not return true
if they later decide not to dispatch anything. However this can happen both
to mq-deadline and BFQ and for either of them determining whether they will
dispatch a request or not is about as expensive as dispatching it. So it
doesn't seem very appealing for these IO schedulers to do the work twice or
to somehow cache the request found. What seems more workable would be for
blk_mq_put_dispatch_budget() to return whether rerunning the queue might be
needed or not (for SCSI, which is the only subsystem using budgeting, this
means returning whether we were currently at queue_depth) and use that
information in __blk_mq_do_dispatch_sched(). I'll experiment with a patch I
guess...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2020-12-22 10:18     ` Jan Kara
@ 2020-12-22 16:55       ` Jan Kara
  2020-12-23  3:43         ` Ming Lei
  2020-12-23  3:33       ` Ming Lei
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2020-12-22 16:55 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jan Kara, Jens Axboe, linux-block, hare, Kashyap Desai

On Tue 22-12-20 11:18:22, Jan Kara wrote:
> On Sat 19-12-20 11:14:27, Ming Lei wrote:
> > On Fri, Dec 18, 2020 at 10:44:12PM +0100, Jan Kara wrote:
> > > Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
> > > a queue with multiple HW queues, the performance it rather bad. The
> > > problem is that these IO schedulers use queue-wide locking and their
> > > dispatch function does not respect the hctx it is passed in and returns
> > > any request it finds appropriate. Thus locality of request access is
> > > broken and dispatch from multiple CPUs just contends on IO scheduler
> > > locks. For these IO schedulers there's little point in dispatching from
> > > multiple CPUs. Instead dispatch always only from a single CPU to limit
> > > contention.
> > > 
> > > Below is a comparison of dbench runs on XFS filesystem where the storage
> > > is a raid card with 64 HW queues and to it attached a single rotating
> > > disk. BFQ is used as IO scheduler:
> > > 
> > >       clients           MQ                     SQ             MQ-Patched
> > > Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
> > > Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
> > > Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
> > > Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
> > > Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
> > > Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
> > > Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
> > > Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
> > > 
> > > Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
> > > the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
> > > advertises just a single HW queue. MQ-Patched is a kernel with this
> > > patch applied.
> > > 
> > > You can see multiple hardware queues heavily hurt performance in
> > > combination with BFQ. The patch restores the performance.
> > > 
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  block/blk-mq.c           | 37 +++++++++++++++++++++++++++++++++++++
> > >  block/kyber-iosched.c    |  1 +
> > >  include/linux/elevator.h |  2 ++
> > >  3 files changed, 40 insertions(+)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 57d0461f2be5..6d80054c231b 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1663,6 +1663,31 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> > >  }
> > >  EXPORT_SYMBOL(blk_mq_run_hw_queue);
> > >  
> > > +static struct blk_mq_hw_ctx *blk_mq_sq_iosched_hctx(struct request_queue *q)
> > > +{
> > > +	struct elevator_queue *e = q->elevator;
> > > +	struct blk_mq_hw_ctx *hctx;
> > > +
> > > +	/*
> > > +	 * The queue has multiple hardware queues but uses IO scheduler that
> > > +	 * does not respect hardware queues when dispatching? This is not a
> > > +	 * great setup but it can be sensible when we have a single rotational
> > > +	 * disk behind a raid card. Just don't bother with multiple HW queues
> > > +	 * and dispatch from hctx for the current CPU since running multiple
> > > +	 * queues just causes lock contention inside the scheduler and
> > > +	 * pointless cache bouncing because the hctx is not respected by the IO
> > > +	 * scheduler's dispatch function anyway.
> > > +	 */
> > > +	if (q->nr_hw_queues > 1 && e && e->type->ops.dispatch_request &&
> > > +	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE)) {
> > > +		hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
> > > +					     raw_smp_processor_id());
> > > +		if (!blk_mq_hctx_stopped(hctx))
> > > +			return hctx;
> > > +	}
> > > +	return NULL;
> > > +}
> > > +
> > >  /**
> > >   * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
> > >   * @q: Pointer to the request queue to run.
> > > @@ -1673,6 +1698,12 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
> > >  	struct blk_mq_hw_ctx *hctx;
> > >  	int i;
> > >  
> > > +	hctx = blk_mq_sq_iosched_hctx(q);
> > > +	if (hctx) {
> > > +		blk_mq_run_hw_queue(hctx, async);
> > > +		return;
> > > +	}
> > > +
> > 
> > This approach looks reasonable, just wondering which code path is wrt.
> > blk_mq_run_hw_queues() improvement by this patch.
> > 
> > Since ed5dd6a67d5e ("scsi: core: Only re-run queue in scsi_end_request() if device
> > queue is busy") is merged, blk_mq_run_hw_queues() is only called from scsi_end_request()
> > when the scsi device is busy for megaraid.
> > 
> > Another one is bfq_schedule_dispatch(), in which blk_mq_run_hw_queues()
> > is still be called, if that is the reason, maybe it is easier to optimize
> > bfq_schedule_dispatch() by avoiding to call blk_mq_run_hw_queues().
> 
> That's a good question. Tracing shows that with dbench I'm seeing *lots*
> (about 23000/s) blk_mq_delay_run_hw_queues() calls, mostly from
> __blk_mq_do_dispatch_sched(). This drops to "only" about 2000 calls/s with
> my patches applied.
> 
> So it means BFQ decided not to dispatch any request (e.g. because it is
> idling for more IO from the same process) and that triggers that path in
> __blk_mq_do_dispatch_sched() that just queues the dispatch again. So blk-mq
> ends up polling BFQ rather heavily for requests it doesn't want to give out
> :). In this sense my patch just makes the real problem less severe.
> 
> I've noticed that if ->has_work() returned false, we would not end up
> calling blk_mq_delay_run_hw_queues(). But for BFQ ->has_work() often
> returns true because it has requests queued but ->dispatch_request()
> doesn't dispatch anything because of other scheduling constraints. And so
> we end up calling blk_mq_delay_run_hw_queues() because if we allocated
> dispatch budget and didn't dispatch in the end, we could have blocked
> dispatch from another hctx and so now need to rerun that hctx to dispatch
> possibly queued requests.
> 
> I was thinking how we could possibly improve this. One obvious possibility
> is to modify IO schedulers so that their ->has_work() does not return true
> if they later decide not to dispatch anything. However this can happen both
> to mq-deadline and BFQ and for either of them determining whether they will
> dispatch a request or not is about as expensive as dispatching it. So it
> doesn't seem very appealing for these IO schedulers to do the work twice or
> to somehow cache the request found. What seems more workable would be for
> blk_mq_put_dispatch_budget() to return whether rerunning the queue might be
> needed or not (for SCSI, which is the only subsystem using budgeting, this
> means returning whether we were currently at queue_depth) and use that
> information in __blk_mq_do_dispatch_sched(). I'll experiment with a patch I
> guess...

OK, so I was experimenting more with this. I've implemented my idea of
->put_budget() returning whether it is needed to rerun the queue (and then
using this inside __blk_mq_do_dispatch_sched()). This indeed mostly got rid
of calls of blk_mq_delay_run_hw_queues() from __blk_mq_do_dispatch_sched().
This patch on its own however was not enough to fix the regression when
megaraid_sas started using multiple HW queues - it got back about half of
the regression. There was still enough contention and cache bouncing from
the remaining blk_mq_run_hw_queues() calls - mostly called when a request was
completed and there was no request currently running (this happens in
bfq_completed_request -> bfq_schedule_dispatch path). And these calls seem
to be really needed AFAICT. Also I did test run with both the previous
series and "budget" patch applied and it didn't show any significant
difference to just the previous series. So although conceptually the patch
makes sense and reduces calls to blk_mq_delay_run_hw_queues(), I'm not sure
it is worth it given it brings no measurable benefit. What do people think?

								Honza

PS: I've noticed that my original patch is slightly buggy and we probably
need to run not only the hctx on current CPU but also any hctx with
non-empty ->dispatch list. Otherwise we could stall some requests. I'll fix
that up on next posting.

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2020-12-22 10:18     ` Jan Kara
  2020-12-22 16:55       ` Jan Kara
@ 2020-12-23  3:33       ` Ming Lei
  1 sibling, 0 replies; 14+ messages in thread
From: Ming Lei @ 2020-12-23  3:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block, hare, Kashyap Desai

On Tue, Dec 22, 2020 at 11:18:22AM +0100, Jan Kara wrote:
> On Sat 19-12-20 11:14:27, Ming Lei wrote:
> > On Fri, Dec 18, 2020 at 10:44:12PM +0100, Jan Kara wrote:
> > > Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
> > > a queue with multiple HW queues, the performance it rather bad. The
> > > problem is that these IO schedulers use queue-wide locking and their
> > > dispatch function does not respect the hctx it is passed in and returns
> > > any request it finds appropriate. Thus locality of request access is
> > > broken and dispatch from multiple CPUs just contends on IO scheduler
> > > locks. For these IO schedulers there's little point in dispatching from
> > > multiple CPUs. Instead dispatch always only from a single CPU to limit
> > > contention.
> > > 
> > > Below is a comparison of dbench runs on XFS filesystem where the storage
> > > is a raid card with 64 HW queues and to it attached a single rotating
> > > disk. BFQ is used as IO scheduler:
> > > 
> > >       clients           MQ                     SQ             MQ-Patched
> > > Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
> > > Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
> > > Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
> > > Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
> > > Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
> > > Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
> > > Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
> > > Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
> > > 
> > > Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
> > > the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
> > > advertises just a single HW queue. MQ-Patched is a kernel with this
> > > patch applied.
> > > 
> > > You can see multiple hardware queues heavily hurt performance in
> > > combination with BFQ. The patch restores the performance.
> > > 
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  block/blk-mq.c           | 37 +++++++++++++++++++++++++++++++++++++
> > >  block/kyber-iosched.c    |  1 +
> > >  include/linux/elevator.h |  2 ++
> > >  3 files changed, 40 insertions(+)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 57d0461f2be5..6d80054c231b 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1663,6 +1663,31 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> > >  }
> > >  EXPORT_SYMBOL(blk_mq_run_hw_queue);
> > >  
> > > +static struct blk_mq_hw_ctx *blk_mq_sq_iosched_hctx(struct request_queue *q)
> > > +{
> > > +	struct elevator_queue *e = q->elevator;
> > > +	struct blk_mq_hw_ctx *hctx;
> > > +
> > > +	/*
> > > +	 * The queue has multiple hardware queues but uses IO scheduler that
> > > +	 * does not respect hardware queues when dispatching? This is not a
> > > +	 * great setup but it can be sensible when we have a single rotational
> > > +	 * disk behind a raid card. Just don't bother with multiple HW queues
> > > +	 * and dispatch from hctx for the current CPU since running multiple
> > > +	 * queues just causes lock contention inside the scheduler and
> > > +	 * pointless cache bouncing because the hctx is not respected by the IO
> > > +	 * scheduler's dispatch function anyway.
> > > +	 */
> > > +	if (q->nr_hw_queues > 1 && e && e->type->ops.dispatch_request &&
> > > +	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE)) {
> > > +		hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
> > > +					     raw_smp_processor_id());
> > > +		if (!blk_mq_hctx_stopped(hctx))
> > > +			return hctx;
> > > +	}
> > > +	return NULL;
> > > +}
> > > +
> > >  /**
> > >   * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
> > >   * @q: Pointer to the request queue to run.
> > > @@ -1673,6 +1698,12 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
> > >  	struct blk_mq_hw_ctx *hctx;
> > >  	int i;
> > >  
> > > +	hctx = blk_mq_sq_iosched_hctx(q);
> > > +	if (hctx) {
> > > +		blk_mq_run_hw_queue(hctx, async);
> > > +		return;
> > > +	}
> > > +
> > 
> > This approach looks reasonable, just wondering which code path is wrt.
> > blk_mq_run_hw_queues() improvement by this patch.
> > 
> > Since ed5dd6a67d5e ("scsi: core: Only re-run queue in scsi_end_request() if device
> > queue is busy") is merged, blk_mq_run_hw_queues() is only called from scsi_end_request()
> > when the scsi device is busy for megaraid.
> > 
> > Another one is bfq_schedule_dispatch(), in which blk_mq_run_hw_queues()
> > is still be called, if that is the reason, maybe it is easier to optimize
> > bfq_schedule_dispatch() by avoiding to call blk_mq_run_hw_queues().
> 
> That's a good question. Tracing shows that with dbench I'm seeing *lots*
> (about 23000/s) blk_mq_delay_run_hw_queues() calls, mostly from
> __blk_mq_do_dispatch_sched(). This drops to "only" about 2000 calls/s with
> my patches applied.

There is only one blk_mq_delay_run_hw_queues(BLK_MQ_BUDGET_DELAY) in
__blk_mq_do_dispatch_sched() with 3ms delay, so in theory there will be at most
(300 * nr_hw_queues) calls/s, but nr_hw_queues could be big as nr_cpus.

> 
> So it means BFQ decided not to dispatch any request (e.g. because it is
> idling for more IO from the same process) and that triggers that path in
> __blk_mq_do_dispatch_sched() that just queues the dispatch again. So blk-mq
> ends up polling BFQ rather heavily for requests it doesn't want to give out
> :). In this sense my patch just makes the real problem less severe.
> 
> I've noticed that if ->has_work() returned false, we would not end up
> calling blk_mq_delay_run_hw_queues(). But for BFQ ->has_work() often
> returns true because it has requests queued but ->dispatch_request()
> doesn't dispatch anything because of other scheduling constraints. And so
> we end up calling blk_mq_delay_run_hw_queues() because if we allocated
> dispatch budget and didn't dispatch in the end, we could have blocked
> dispatch from another hctx and so now need to rerun that hctx to dispatch
> possibly queued requests.

Yeah, it is one BFQ specific behavior, and BFQ just said there is work
to do, but it can't be dispatched immediately.

> 
> I was thinking how we could possibly improve this. One obvious possibility
> is to modify IO schedulers so that their ->has_work() does not return true
> if they later decide not to dispatch anything. However this can happen both
> to mq-deadline and BFQ and for either of them determining whether they will
> dispatch a request or not is about as expensive as dispatching it. So it
> doesn't seem very appealing for these IO schedulers to do the work twice or
> to somehow cache the request found. What seems more workable would be for
> blk_mq_put_dispatch_budget() to return whether rerunning the queue might be
> needed or not (for SCSI, which is the only subsystem using budgeting, this

You may refer to commit a0823421a4d7("blk-mq: Rerun dispatching in the case of
budget contention")

BTW, I think the approach in your patch is good enough for handling the
issue. For BFQ and mq-deadline, it is enough to just run one hctx in
blk_mq_delay_run_hw_queues() and blk_mq_run_hw_queues().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2020-12-22 16:55       ` Jan Kara
@ 2020-12-23  3:43         ` Ming Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2020-12-23  3:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block, hare, Kashyap Desai

On Tue, Dec 22, 2020 at 05:55:42PM +0100, Jan Kara wrote:
> On Tue 22-12-20 11:18:22, Jan Kara wrote:
> > On Sat 19-12-20 11:14:27, Ming Lei wrote:
> > > On Fri, Dec 18, 2020 at 10:44:12PM +0100, Jan Kara wrote:
> > > > Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
> > > > a queue with multiple HW queues, the performance it rather bad. The
> > > > problem is that these IO schedulers use queue-wide locking and their
> > > > dispatch function does not respect the hctx it is passed in and returns
> > > > any request it finds appropriate. Thus locality of request access is
> > > > broken and dispatch from multiple CPUs just contends on IO scheduler
> > > > locks. For these IO schedulers there's little point in dispatching from
> > > > multiple CPUs. Instead dispatch always only from a single CPU to limit
> > > > contention.
> > > > 
> > > > Below is a comparison of dbench runs on XFS filesystem where the storage
> > > > is a raid card with 64 HW queues and to it attached a single rotating
> > > > disk. BFQ is used as IO scheduler:
> > > > 
> > > >       clients           MQ                     SQ             MQ-Patched
> > > > Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
> > > > Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
> > > > Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
> > > > Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
> > > > Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
> > > > Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
> > > > Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
> > > > Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
> > > > 
> > > > Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
> > > > the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
> > > > advertises just a single HW queue. MQ-Patched is a kernel with this
> > > > patch applied.
> > > > 
> > > > You can see multiple hardware queues heavily hurt performance in
> > > > combination with BFQ. The patch restores the performance.
> > > > 
> > > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > > ---
> > > >  block/blk-mq.c           | 37 +++++++++++++++++++++++++++++++++++++
> > > >  block/kyber-iosched.c    |  1 +
> > > >  include/linux/elevator.h |  2 ++
> > > >  3 files changed, 40 insertions(+)
> > > > 
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 57d0461f2be5..6d80054c231b 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -1663,6 +1663,31 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> > > >  }
> > > >  EXPORT_SYMBOL(blk_mq_run_hw_queue);
> > > >  
> > > > +static struct blk_mq_hw_ctx *blk_mq_sq_iosched_hctx(struct request_queue *q)
> > > > +{
> > > > +	struct elevator_queue *e = q->elevator;
> > > > +	struct blk_mq_hw_ctx *hctx;
> > > > +
> > > > +	/*
> > > > +	 * The queue has multiple hardware queues but uses IO scheduler that
> > > > +	 * does not respect hardware queues when dispatching? This is not a
> > > > +	 * great setup but it can be sensible when we have a single rotational
> > > > +	 * disk behind a raid card. Just don't bother with multiple HW queues
> > > > +	 * and dispatch from hctx for the current CPU since running multiple
> > > > +	 * queues just causes lock contention inside the scheduler and
> > > > +	 * pointless cache bouncing because the hctx is not respected by the IO
> > > > +	 * scheduler's dispatch function anyway.
> > > > +	 */
> > > > +	if (q->nr_hw_queues > 1 && e && e->type->ops.dispatch_request &&
> > > > +	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE)) {
> > > > +		hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
> > > > +					     raw_smp_processor_id());
> > > > +		if (!blk_mq_hctx_stopped(hctx))
> > > > +			return hctx;
> > > > +	}
> > > > +	return NULL;
> > > > +}
> > > > +
> > > >  /**
> > > >   * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
> > > >   * @q: Pointer to the request queue to run.
> > > > @@ -1673,6 +1698,12 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
> > > >  	struct blk_mq_hw_ctx *hctx;
> > > >  	int i;
> > > >  
> > > > +	hctx = blk_mq_sq_iosched_hctx(q);
> > > > +	if (hctx) {
> > > > +		blk_mq_run_hw_queue(hctx, async);
> > > > +		return;
> > > > +	}
> > > > +
> > > 
> > > This approach looks reasonable, just wondering which code path is wrt.
> > > blk_mq_run_hw_queues() improvement by this patch.
> > > 
> > > Since ed5dd6a67d5e ("scsi: core: Only re-run queue in scsi_end_request() if device
> > > queue is busy") is merged, blk_mq_run_hw_queues() is only called from scsi_end_request()
> > > when the scsi device is busy for megaraid.
> > > 
> > > Another one is bfq_schedule_dispatch(), in which blk_mq_run_hw_queues()
> > > is still be called, if that is the reason, maybe it is easier to optimize
> > > bfq_schedule_dispatch() by avoiding to call blk_mq_run_hw_queues().
> > 
> > That's a good question. Tracing shows that with dbench I'm seeing *lots*
> > (about 23000/s) blk_mq_delay_run_hw_queues() calls, mostly from
> > __blk_mq_do_dispatch_sched(). This drops to "only" about 2000 calls/s with
> > my patches applied.
> > 
> > So it means BFQ decided not to dispatch any request (e.g. because it is
> > idling for more IO from the same process) and that triggers that path in
> > __blk_mq_do_dispatch_sched() that just queues the dispatch again. So blk-mq
> > ends up polling BFQ rather heavily for requests it doesn't want to give out
> > :). In this sense my patch just makes the real problem less severe.
> > 
> > I've noticed that if ->has_work() returned false, we would not end up
> > calling blk_mq_delay_run_hw_queues(). But for BFQ ->has_work() often
> > returns true because it has requests queued but ->dispatch_request()
> > doesn't dispatch anything because of other scheduling constraints. And so
> > we end up calling blk_mq_delay_run_hw_queues() because if we allocated
> > dispatch budget and didn't dispatch in the end, we could have blocked
> > dispatch from another hctx and so now need to rerun that hctx to dispatch
> > possibly queued requests.
> > 
> > I was thinking how we could possibly improve this. One obvious possibility
> > is to modify IO schedulers so that their ->has_work() does not return true
> > if they later decide not to dispatch anything. However this can happen both
> > to mq-deadline and BFQ and for either of them determining whether they will
> > dispatch a request or not is about as expensive as dispatching it. So it
> > doesn't seem very appealing for these IO schedulers to do the work twice or
> > to somehow cache the request found. What seems more workable would be for
> > blk_mq_put_dispatch_budget() to return whether rerunning the queue might be
> > needed or not (for SCSI, which is the only subsystem using budgeting, this
> > means returning whether we were currently at queue_depth) and use that
> > information in __blk_mq_do_dispatch_sched(). I'll experiment with a patch I
> > guess...
> 
> OK, so I was experimenting more with this. I've implemented my idea of
> ->put_budget() returning whether it is needed to rerun the queue (and then
> using this inside __blk_mq_do_dispatch_sched()). This indeed mostly got rid
> of calls of blk_mq_delay_run_hw_queues() from __blk_mq_do_dispatch_sched().
> This patch on its own however was not enough to fix the regression when
> megaraid_sas started using multiple HW queues - it got back about half of
> the regression. There was still enough contention and cache bouncing from
> the remaining blk_mq_run_hw_queues() calls - mostly called when a request was
> completed and there was no request currently running (this happens in
> bfq_completed_request -> bfq_schedule_dispatch path). And these calls seem
> to be really needed AFAICT. Also I did test run with both the previous
> series and "budget" patch applied and it didn't show any significant
> difference to just the previous series. So although conceptually the patch
> makes sense and reduces calls to blk_mq_delay_run_hw_queues(), I'm not sure
> it is worth it given it brings no measurable benefit. What do people think?

I think it is good to fix the regression by the MQ_AWARE way first.

> 
> 								Honza
> 
> PS: I've noticed that my original patch is slightly buggy and we probably
> need to run not only the hctx on current CPU but also any hctx with
> non-empty ->dispatch list. Otherwise we could stall some requests. I'll fix
> that up on next posting.

Yeah, the MQ_AWARE way should only be applied to dispatch from scheduler
queue.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-11 16:47 ` [PATCH 2/2] " Jan Kara
@ 2021-01-12  2:15   ` Ming Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2021-01-12  2:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block

On Mon, Jan 11, 2021 at 05:47:17PM +0100, Jan Kara wrote:
> Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
> a queue with multiple HW queues, the performance it rather bad. The
> problem is that these IO schedulers use queue-wide locking and their
> dispatch function does not respect the hctx it is passed in and returns
> any request it finds appropriate. Thus locality of request access is
> broken and dispatch from multiple CPUs just contends on IO scheduler
> locks. For these IO schedulers there's little point in dispatching from
> multiple CPUs. Instead dispatch always only from a single CPU to limit
> contention.
> 
> Below is a comparison of dbench runs on XFS filesystem where the storage
> is a raid card with 64 HW queues and to it attached a single rotating
> disk. BFQ is used as IO scheduler:
> 
>       clients           MQ                     SQ             MQ-Patched
> Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
> Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
> Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
> Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
> Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
> Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
> Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
> Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
> 
> Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
> the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
> advertises just a single HW queue. MQ-Patched is a kernel with this
> patch applied.
> 
> You can see multiple hardware queues heavily hurt performance in
> combination with BFQ. The patch restores the performance.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/blk-mq.c           | 66 ++++++++++++++++++++++++++++++++++++----
>  block/kyber-iosched.c    |  1 +
>  include/linux/elevator.h |  2 ++
>  3 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 57f3482b2c26..580601787f7a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1646,6 +1646,42 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>  }
>  EXPORT_SYMBOL(blk_mq_run_hw_queue);
>  
> +/*
> + * Is the request queue handled by an IO scheduler that does not respect
> + * hardware queues when dispatching?
> + */
> +static bool blk_mq_has_sqsched(struct request_queue *q)
> +{
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->ops.dispatch_request &&
> +	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE))
> +		return true;
> +	return false;
> +}
> +
> +/*
> + * Return prefered queue to dispatch from (if any) for non-mq aware IO
> + * scheduler.
> + */
> +static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +
> +	/*
> +	 * If the IO scheduler does not respect hardware queues when
> +	 * dispatching, we just don't bother with multiple HW queues and
> +	 * dispatch from hctx for the current CPU since running multiple queues
> +	 * just causes lock contention inside the scheduler and pointless cache
> +	 * bouncing.
> +	 */
> +	hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
> +				     raw_smp_processor_id());
> +	if (!blk_mq_hctx_stopped(hctx))
> +		return hctx;
> +	return NULL;
> +}
> +
>  /**
>   * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
>   * @q: Pointer to the request queue to run.
> @@ -1653,14 +1689,23 @@ EXPORT_SYMBOL(blk_mq_run_hw_queue);
>   */
>  void blk_mq_run_hw_queues(struct request_queue *q, bool async)
>  {
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx, *sq_hctx;
>  	int i;
>  
> +	sq_hctx = NULL;
> +	if (blk_mq_has_sqsched(q))
> +		sq_hctx = blk_mq_get_sq_hctx(q);
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (blk_mq_hctx_stopped(hctx))
>  			continue;
> -
> -		blk_mq_run_hw_queue(hctx, async);
> +		/*
> +		 * Dispatch from this hctx either if there's no hctx preferred
> +		 * by IO scheduler or if it has requests that bypass the
> +		 * scheduler.
> +		 */
> +		if (!sq_hctx || sq_hctx == hctx ||
> +		    !list_empty_careful(&hctx->dispatch))
> +			blk_mq_run_hw_queue(hctx, async);
>  	}
>  }
>  EXPORT_SYMBOL(blk_mq_run_hw_queues);
> @@ -1672,14 +1717,23 @@ EXPORT_SYMBOL(blk_mq_run_hw_queues);
>   */
>  void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
>  {
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx, *sq_hctx;
>  	int i;
>  
> +	sq_hctx = NULL;
> +	if (blk_mq_has_sqsched(q))
> +		sq_hctx = blk_mq_get_sq_hctx(q);
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (blk_mq_hctx_stopped(hctx))
>  			continue;
> -
> -		blk_mq_delay_run_hw_queue(hctx, msecs);
> +		/*
> +		 * Dispatch from this hctx either if there's no hctx preferred
> +		 * by IO scheduler or if it has requests that bypass the
> +		 * scheduler.
> +		 */
> +		if (!sq_hctx || sq_hctx == hctx ||
> +		    !list_empty_careful(&hctx->dispatch))
> +			blk_mq_delay_run_hw_queue(hctx, msecs);
>  	}
>  }
>  EXPORT_SYMBOL(blk_mq_delay_run_hw_queues);
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index dc89199bc8c6..c25c41d0d061 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -1029,6 +1029,7 @@ static struct elevator_type kyber_sched = {
>  #endif
>  	.elevator_attrs = kyber_sched_attrs,
>  	.elevator_name = "kyber",
> +	.elevator_features = ELEVATOR_F_MQ_AWARE,
>  	.elevator_owner = THIS_MODULE,
>  };
>  
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index bacc40a0bdf3..1fe8e105b83b 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -172,6 +172,8 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
>  
>  /* Supports zoned block devices sequential write constraint */
>  #define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
> +/* Supports scheduling on multiple hardware queues */
> +#define ELEVATOR_F_MQ_AWARE		(1U << 1)
>  
>  #endif /* CONFIG_BLOCK */
>  #endif
> -- 
> 2.26.2
> 

Reviewed-by: Ming Lei <ming.lei@redhat.com>

-- 
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-11 16:47 [PATCH 0/2 v3] " Jan Kara
@ 2021-01-11 16:47 ` Jan Kara
  2021-01-12  2:15   ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2021-01-11 16:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Jan Kara

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

      clients           MQ                     SQ             MQ-Patched
Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk-mq.c           | 66 ++++++++++++++++++++++++++++++++++++----
 block/kyber-iosched.c    |  1 +
 include/linux/elevator.h |  2 ++
 3 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 57f3482b2c26..580601787f7a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1646,6 +1646,42 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 }
 EXPORT_SYMBOL(blk_mq_run_hw_queue);
 
+/*
+ * Is the request queue handled by an IO scheduler that does not respect
+ * hardware queues when dispatching?
+ */
+static bool blk_mq_has_sqsched(struct request_queue *q)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.dispatch_request &&
+	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE))
+		return true;
+	return false;
+}
+
+/*
+ * Return prefered queue to dispatch from (if any) for non-mq aware IO
+ * scheduler.
+ */
+static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	/*
+	 * If the IO scheduler does not respect hardware queues when
+	 * dispatching, we just don't bother with multiple HW queues and
+	 * dispatch from hctx for the current CPU since running multiple queues
+	 * just causes lock contention inside the scheduler and pointless cache
+	 * bouncing.
+	 */
+	hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
+				     raw_smp_processor_id());
+	if (!blk_mq_hctx_stopped(hctx))
+		return hctx;
+	return NULL;
+}
+
 /**
  * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
  * @q: Pointer to the request queue to run.
@@ -1653,14 +1689,23 @@ EXPORT_SYMBOL(blk_mq_run_hw_queue);
  */
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx, *sq_hctx;
 	int i;
 
+	sq_hctx = NULL;
+	if (blk_mq_has_sqsched(q))
+		sq_hctx = blk_mq_get_sq_hctx(q);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
-
-		blk_mq_run_hw_queue(hctx, async);
+		/*
+		 * Dispatch from this hctx either if there's no hctx preferred
+		 * by IO scheduler or if it has requests that bypass the
+		 * scheduler.
+		 */
+		if (!sq_hctx || sq_hctx == hctx ||
+		    !list_empty_careful(&hctx->dispatch))
+			blk_mq_run_hw_queue(hctx, async);
 	}
 }
 EXPORT_SYMBOL(blk_mq_run_hw_queues);
@@ -1672,14 +1717,23 @@ EXPORT_SYMBOL(blk_mq_run_hw_queues);
  */
 void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx, *sq_hctx;
 	int i;
 
+	sq_hctx = NULL;
+	if (blk_mq_has_sqsched(q))
+		sq_hctx = blk_mq_get_sq_hctx(q);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
-
-		blk_mq_delay_run_hw_queue(hctx, msecs);
+		/*
+		 * Dispatch from this hctx either if there's no hctx preferred
+		 * by IO scheduler or if it has requests that bypass the
+		 * scheduler.
+		 */
+		if (!sq_hctx || sq_hctx == hctx ||
+		    !list_empty_careful(&hctx->dispatch))
+			blk_mq_delay_run_hw_queue(hctx, msecs);
 	}
 }
 EXPORT_SYMBOL(blk_mq_delay_run_hw_queues);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index dc89199bc8c6..c25c41d0d061 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -1029,6 +1029,7 @@ static struct elevator_type kyber_sched = {
 #endif
 	.elevator_attrs = kyber_sched_attrs,
 	.elevator_name = "kyber",
+	.elevator_features = ELEVATOR_F_MQ_AWARE,
 	.elevator_owner = THIS_MODULE,
 };
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index bacc40a0bdf3..1fe8e105b83b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -172,6 +172,8 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 
 /* Supports zoned block devices sequential write constraint */
 #define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
+/* Supports scheduling on multiple hardware queues */
+#define ELEVATOR_F_MQ_AWARE		(1U << 1)
 
 #endif /* CONFIG_BLOCK */
 #endif
-- 
2.26.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-07 11:18     ` Jan Kara
@ 2021-01-07 12:06       ` Ming Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2021-01-07 12:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block

On Thu, Jan 07, 2021 at 12:18:15PM +0100, Jan Kara wrote:
> On Thu 07-01-21 14:19:18, Ming Lei wrote:
> > On Wed, Jan 06, 2021 at 11:24:28AM +0100, Jan Kara wrote:
> > > +/* Check if there are requests queued in hctx lists. */
> > > +static bool blk_mq_hctx_has_queued_rq(struct blk_mq_hw_ctx *hctx)
> > > +{
> > > +	return !list_empty_careful(&hctx->dispatch) ||
> > > +		sbitmap_any_bit_set(&hctx->ctx_map);
> > > +}
> > > +
> > 
> > blk_mq_hctx_mark_pending() is only called in case of none scheduler, so
> > looks not necessary to check hctx->ctx_map in blk_mq_hctx_has_queued_rq()
> > which is supposed to be used when real io scheduler is attached to MQ queue.
> 
> Yes, I know. I just wanted to make the code less fragile... In particular I
> was somewhat uneasy that we'd rely on the implicit behavior that
> blk_mq_get_sqsched_hctx() can return non-NULL only if sbitmap_any_bit_set()
> is not needed. But maybe we could structure the code like:

BTW, I mentioned the point because sbitmap_any_bit_set(hctx->ctx_map) may take
some CPU cycle in case that nr_cpu_ids is big.

> 
> 	sq_hctx = NULL;
> 	if (blk_mq_has_sqsched(q))
> 		sq_hctx = blk_mq_get_sq_hctx(q);
> 	queue_for_each_hw_ctx(q, hctx, i) {
> 		...
> 		if (!sq_hctx || sq_hctx == hctx ||
> 		    !list_empty_careful(&hctx->dispatch))
> 			... run ...
> 	}
> 
> Because then it is kind of obvious that sq_hctx is set only if there's IO
> scheduler for the queue and thus ctx_map is unused. What do you think?

IMO, the above is more readable and efficient.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-07  6:19   ` Ming Lei
@ 2021-01-07 11:18     ` Jan Kara
  2021-01-07 12:06       ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2021-01-07 11:18 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jan Kara, Jens Axboe, linux-block

On Thu 07-01-21 14:19:18, Ming Lei wrote:
> On Wed, Jan 06, 2021 at 11:24:28AM +0100, Jan Kara wrote:
> > +/* Check if there are requests queued in hctx lists. */
> > +static bool blk_mq_hctx_has_queued_rq(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	return !list_empty_careful(&hctx->dispatch) ||
> > +		sbitmap_any_bit_set(&hctx->ctx_map);
> > +}
> > +
> 
> blk_mq_hctx_mark_pending() is only called in case of none scheduler, so
> looks not necessary to check hctx->ctx_map in blk_mq_hctx_has_queued_rq()
> which is supposed to be used when real io scheduler is attached to MQ queue.

Yes, I know. I just wanted to make the code less fragile... In particular I
was somewhat uneasy that we'd rely on the implicit behavior that
blk_mq_get_sqsched_hctx() can return non-NULL only if sbitmap_any_bit_set()
is not needed. But maybe we could structure the code like:

	sq_hctx = NULL;
	if (blk_mq_has_sqsched(q))
		sq_hctx = blk_mq_get_sq_hctx(q);
	queue_for_each_hw_ctx(q, hctx, i) {
		...
		if (!sq_hctx || sq_hctx == hctx ||
		    !list_empty_careful(&hctx->dispatch))
			... run ...
	}

Because then it is kind of obvious that sq_hctx is set only if there's IO
scheduler for the queue and thus ctx_map is unused. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-06 10:24 ` [PATCH 2/2] " Jan Kara
@ 2021-01-07  6:19   ` Ming Lei
  2021-01-07 11:18     ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2021-01-07  6:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block

On Wed, Jan 06, 2021 at 11:24:28AM +0100, Jan Kara wrote:
> Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
> a queue with multiple HW queues, the performance it rather bad. The
> problem is that these IO schedulers use queue-wide locking and their
> dispatch function does not respect the hctx it is passed in and returns
> any request it finds appropriate. Thus locality of request access is
> broken and dispatch from multiple CPUs just contends on IO scheduler
> locks. For these IO schedulers there's little point in dispatching from
> multiple CPUs. Instead dispatch always only from a single CPU to limit
> contention.
> 
> Below is a comparison of dbench runs on XFS filesystem where the storage
> is a raid card with 64 HW queues and to it attached a single rotating
> disk. BFQ is used as IO scheduler:
> 
>       clients           MQ                     SQ             MQ-Patched
> Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
> Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
> Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
> Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
> Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
> Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
> Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
> Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
> 
> Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
> the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
> advertises just a single HW queue. MQ-Patched is a kernel with this
> patch applied.
> 
> You can see multiple hardware queues heavily hurt performance in
> combination with BFQ. The patch restores the performance.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/blk-mq.c           | 62 ++++++++++++++++++++++++++++++++++------
>  block/kyber-iosched.c    |  1 +
>  include/linux/elevator.h |  2 ++
>  3 files changed, 56 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 57f3482b2c26..26e0f6e64a3a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -63,15 +63,20 @@ static int blk_mq_poll_stats_bkt(const struct request *rq)
>  	return bucket;
>  }
>  
> +/* Check if there are requests queued in hctx lists. */
> +static bool blk_mq_hctx_has_queued_rq(struct blk_mq_hw_ctx *hctx)
> +{
> +	return !list_empty_careful(&hctx->dispatch) ||
> +		sbitmap_any_bit_set(&hctx->ctx_map);
> +}
> +

blk_mq_hctx_mark_pending() is only called in case of none scheduler, so
looks not necessary to check hctx->ctx_map in blk_mq_hctx_has_queued_rq()
which is supposed to be used when real io scheduler is attached to MQ queue.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-06 10:24 [PATCH 0/2 v2] " Jan Kara
@ 2021-01-06 10:24 ` Jan Kara
  2021-01-07  6:19   ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2021-01-06 10:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Jan Kara

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

      clients           MQ                     SQ             MQ-Patched
Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk-mq.c           | 62 ++++++++++++++++++++++++++++++++++------
 block/kyber-iosched.c    |  1 +
 include/linux/elevator.h |  2 ++
 3 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 57f3482b2c26..26e0f6e64a3a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -63,15 +63,20 @@ static int blk_mq_poll_stats_bkt(const struct request *rq)
 	return bucket;
 }
 
+/* Check if there are requests queued in hctx lists. */
+static bool blk_mq_hctx_has_queued_rq(struct blk_mq_hw_ctx *hctx)
+{
+	return !list_empty_careful(&hctx->dispatch) ||
+		sbitmap_any_bit_set(&hctx->ctx_map);
+}
+
 /*
  * Check if any of the ctx, dispatch list or elevator
  * have pending work in this hardware queue.
  */
 static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
-	return !list_empty_careful(&hctx->dispatch) ||
-		sbitmap_any_bit_set(&hctx->ctx_map) ||
-			blk_mq_sched_has_work(hctx);
+	return blk_mq_hctx_has_queued_rq(hctx) || blk_mq_sched_has_work(hctx);
 }
 
 /*
@@ -1646,6 +1651,31 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 }
 EXPORT_SYMBOL(blk_mq_run_hw_queue);
 
+static struct blk_mq_hw_ctx *blk_mq_get_sqsched_hctx(struct request_queue *q)
+{
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_hw_ctx *hctx;
+
+	/*
+	 * The queue has multiple hardware queues but uses IO scheduler that
+	 * does not respect hardware queues when dispatching? This is not a
+	 * great setup but it can be sensible when we have a single rotational
+	 * disk behind a raid card. Just don't bother with multiple HW queues
+	 * and dispatch from hctx for the current CPU since running multiple
+	 * queues just causes lock contention inside the scheduler and
+	 * pointless cache bouncing because the hctx is not respected by the IO
+	 * scheduler's dispatch function anyway.
+	 */
+	if (q->nr_hw_queues > 1 && e && e->type->ops.dispatch_request &&
+	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE)) {
+		hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
+					     raw_smp_processor_id());
+		if (!blk_mq_hctx_stopped(hctx))
+			return hctx;
+	}
+	return NULL;
+}
+
 /**
  * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
  * @q: Pointer to the request queue to run.
@@ -1653,14 +1683,21 @@ EXPORT_SYMBOL(blk_mq_run_hw_queue);
  */
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx, *sq_hctx;
 	int i;
 
+	sq_hctx = blk_mq_get_sqsched_hctx(q);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
-
-		blk_mq_run_hw_queue(hctx, async);
+		/*
+		 * Dispatch from this hctx either if there's no hctx preferred
+		 * by IO scheduler or if it has requests that bypass the
+		 * scheduler.
+		 */
+		if (!sq_hctx || blk_mq_hctx_has_queued_rq(hctx) ||
+		   sq_hctx == hctx)
+			blk_mq_run_hw_queue(hctx, async);
 	}
 }
 EXPORT_SYMBOL(blk_mq_run_hw_queues);
@@ -1672,14 +1709,21 @@ EXPORT_SYMBOL(blk_mq_run_hw_queues);
  */
 void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx, *sq_hctx;
 	int i;
 
+	sq_hctx = blk_mq_get_sqsched_hctx(q);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
-
-		blk_mq_delay_run_hw_queue(hctx, msecs);
+		/*
+		 * Dispatch from this hctx either if there's no hctx preferred
+		 * by IO scheduler or if it has requests that bypass the
+		 * scheduler.
+		 */
+		if (!sq_hctx || blk_mq_hctx_has_queued_rq(hctx) ||
+		   sq_hctx == hctx)
+			blk_mq_delay_run_hw_queue(hctx, msecs);
 	}
 }
 EXPORT_SYMBOL(blk_mq_delay_run_hw_queues);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index dc89199bc8c6..c25c41d0d061 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -1029,6 +1029,7 @@ static struct elevator_type kyber_sched = {
 #endif
 	.elevator_attrs = kyber_sched_attrs,
 	.elevator_name = "kyber",
+	.elevator_features = ELEVATOR_F_MQ_AWARE,
 	.elevator_owner = THIS_MODULE,
 };
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index bacc40a0bdf3..1fe8e105b83b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -172,6 +172,8 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 
 /* Supports zoned block devices sequential write constraint */
 #define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
+/* Supports scheduling on multiple hardware queues */
+#define ELEVATOR_F_MQ_AWARE		(1U << 1)
 
 #endif /* CONFIG_BLOCK */
 #endif
-- 
2.26.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-01-12  2:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-18 21:44 [PATCH 0/2 RFC] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
2020-12-18 21:44 ` [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance" Jan Kara
2020-12-18 21:44 ` [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
2020-12-19  3:14   ` Ming Lei
2020-12-22 10:18     ` Jan Kara
2020-12-22 16:55       ` Jan Kara
2020-12-23  3:43         ` Ming Lei
2020-12-23  3:33       ` Ming Lei
2021-01-06 10:24 [PATCH 0/2 v2] " Jan Kara
2021-01-06 10:24 ` [PATCH 2/2] " Jan Kara
2021-01-07  6:19   ` Ming Lei
2021-01-07 11:18     ` Jan Kara
2021-01-07 12:06       ` Ming Lei
2021-01-11 16:47 [PATCH 0/2 v3] " Jan Kara
2021-01-11 16:47 ` [PATCH 2/2] " Jan Kara
2021-01-12  2:15   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.