[PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
@ 2021-01-11 16:47 Jan Kara
  2021-01-11 16:47 ` [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance" Jan Kara
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Jan Kara @ 2021-01-11 16:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Jan Kara

Hello!

This patch series aims to fix a regression we've noticed on our test grid when
support for multiple HW queues in megaraid_sas driver was added during the 5.10
cycle (103fbf8e4020 scsi: megaraid_sas: Added support for shared host tagset
for cpuhotplug). The commit was reverted in the end for other reasons but I
believe the fundamental problem still exists for any other similar setup. The
problem manifests when the storage card supports multiple hardware queues
however storage behind it is slow (single rotating disk in our case) and so
using IO scheduler such as BFQ is desirable. See the second patch for details.

Changes since v2
* Modified code to avoid useless sbitmap_any_set() calls on submit path

Changes since v1
* Fixed queue running code to don't leave pending requests that bypass IO
  scheduler.

								Honza

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance"
  2021-01-11 16:47 [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
@ 2021-01-11 16:47 ` Jan Kara
  2021-01-12  2:14   ` Ming Lei
  2021-01-11 16:47 ` [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2021-01-11 16:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Jan Kara

This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95.

Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/bfq-iosched.c    | 5 -----
 block/blk-mq.c         | 1 -
 block/mq-deadline.c    | 6 ------
 include/linux/blk-mq.h | 4 ----
 4 files changed, 16 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e81d1052091..a99dfaa75a8c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4640,9 +4640,6 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
 
-	if (!atomic_read(&hctx->elevator_queued))
-		return false;
-
 	/*
 	 * Avoiding lock: a race on bfqd->busy_queues should cause at
 	 * most a call to dispatch for nothing
@@ -5557,7 +5554,6 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		bfq_insert_request(hctx, rq, at_head);
-		atomic_inc(&hctx->elevator_queued);
 	}
 }
 
@@ -5925,7 +5921,6 @@ static void bfq_finish_requeue_request(struct request *rq)
 
 		bfq_completed_request(bfqq, bfqd);
 		bfq_finish_requeue_request_body(bfqq);
-		atomic_dec(&rq->mq_hctx->elevator_queued);
 
 		spin_unlock_irqrestore(&bfqd->lock, flags);
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f285a9123a8b..57f3482b2c26 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2653,7 +2653,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 		goto free_hctx;
 
 	atomic_set(&hctx->nr_active, 0);
-	atomic_set(&hctx->elevator_queued, 0);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 	hctx->numa_node = node;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 800ac902809b..b57470e154c8 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -386,8 +386,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	spin_lock(&dd->lock);
 	rq = __dd_dispatch_request(dd);
 	spin_unlock(&dd->lock);
-	if (rq)
-		atomic_dec(&rq->mq_hctx->elevator_queued);
 
 	return rq;
 }
@@ -535,7 +533,6 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		dd_insert_request(hctx, rq, at_head);
-		atomic_inc(&hctx->elevator_queued);
 	}
 	spin_unlock(&dd->lock);
 }
@@ -582,9 +579,6 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
 
-	if (!atomic_read(&hctx->elevator_queued))
-		return false;
-
 	return !list_empty_careful(&dd->dispatch) ||
 		!list_empty_careful(&dd->fifo_list[0]) ||
 		!list_empty_careful(&dd->fifo_list[1]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d705b174d346..b4b9604bbfd7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,10 +140,6 @@ struct blk_mq_hw_ctx {
 	 * shared across request queues.
 	 */
 	atomic_t		nr_active;
-	/**
-	 * @elevator_queued: Number of queued requests on hctx.
-	 */
-	atomic_t                elevator_queued;
 
 	/** @cpuhp_online: List to store request if CPU is going to die */
 	struct hlist_node	cpuhp_online;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-11 16:47 [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
  2021-01-11 16:47 ` [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance" Jan Kara
@ 2021-01-11 16:47 ` Jan Kara
  2021-01-12  2:15   ` Ming Lei
  2021-01-22 14:11 ` [PATCH 0/2 v3] " Jan Kara
  2021-01-25  1:20 ` Jens Axboe
  3 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2021-01-11 16:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Jan Kara

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

      clients           MQ                     SQ             MQ-Patched
Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk-mq.c           | 66 ++++++++++++++++++++++++++++++++++++----
 block/kyber-iosched.c    |  1 +
 include/linux/elevator.h |  2 ++
 3 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 57f3482b2c26..580601787f7a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1646,6 +1646,42 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 }
 EXPORT_SYMBOL(blk_mq_run_hw_queue);
 
+/*
+ * Is the request queue handled by an IO scheduler that does not respect
+ * hardware queues when dispatching?
+ */
+static bool blk_mq_has_sqsched(struct request_queue *q)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.dispatch_request &&
+	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE))
+		return true;
+	return false;
+}
+
+/*
+ * Return prefered queue to dispatch from (if any) for non-mq aware IO
+ * scheduler.
+ */
+static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	/*
+	 * If the IO scheduler does not respect hardware queues when
+	 * dispatching, we just don't bother with multiple HW queues and
+	 * dispatch from hctx for the current CPU since running multiple queues
+	 * just causes lock contention inside the scheduler and pointless cache
+	 * bouncing.
+	 */
+	hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
+				     raw_smp_processor_id());
+	if (!blk_mq_hctx_stopped(hctx))
+		return hctx;
+	return NULL;
+}
+
 /**
  * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
  * @q: Pointer to the request queue to run.
@@ -1653,14 +1689,23 @@ EXPORT_SYMBOL(blk_mq_run_hw_queue);
  */
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx, *sq_hctx;
 	int i;
 
+	sq_hctx = NULL;
+	if (blk_mq_has_sqsched(q))
+		sq_hctx = blk_mq_get_sq_hctx(q);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
-
-		blk_mq_run_hw_queue(hctx, async);
+		/*
+		 * Dispatch from this hctx either if there's no hctx preferred
+		 * by IO scheduler or if it has requests that bypass the
+		 * scheduler.
+		 */
+		if (!sq_hctx || sq_hctx == hctx ||
+		    !list_empty_careful(&hctx->dispatch))
+			blk_mq_run_hw_queue(hctx, async);
 	}
 }
 EXPORT_SYMBOL(blk_mq_run_hw_queues);
@@ -1672,14 +1717,23 @@ EXPORT_SYMBOL(blk_mq_run_hw_queues);
  */
 void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx, *sq_hctx;
 	int i;
 
+	sq_hctx = NULL;
+	if (blk_mq_has_sqsched(q))
+		sq_hctx = blk_mq_get_sq_hctx(q);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
-
-		blk_mq_delay_run_hw_queue(hctx, msecs);
+		/*
+		 * Dispatch from this hctx either if there's no hctx preferred
+		 * by IO scheduler or if it has requests that bypass the
+		 * scheduler.
+		 */
+		if (!sq_hctx || sq_hctx == hctx ||
+		    !list_empty_careful(&hctx->dispatch))
+			blk_mq_delay_run_hw_queue(hctx, msecs);
 	}
 }
 EXPORT_SYMBOL(blk_mq_delay_run_hw_queues);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index dc89199bc8c6..c25c41d0d061 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -1029,6 +1029,7 @@ static struct elevator_type kyber_sched = {
 #endif
 	.elevator_attrs = kyber_sched_attrs,
 	.elevator_name = "kyber",
+	.elevator_features = ELEVATOR_F_MQ_AWARE,
 	.elevator_owner = THIS_MODULE,
 };
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index bacc40a0bdf3..1fe8e105b83b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -172,6 +172,8 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 
 /* Supports zoned block devices sequential write constraint */
 #define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
+/* Supports scheduling on multiple hardware queues */
+#define ELEVATOR_F_MQ_AWARE		(1U << 1)
 
 #endif /* CONFIG_BLOCK */
 #endif
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance"
  2021-01-11 16:47 ` [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance" Jan Kara
@ 2021-01-12  2:14   ` Ming Lei
  0 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-01-12  2:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block

On Mon, Jan 11, 2021 at 05:47:16PM +0100, Jan Kara wrote:
> This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95.
> 
> Since both mq-deadline and BFQ completely ignore hctx they are passed to
> their dispatch function and dispatch whatever request they deem fit
> checking whether any request for a particular hctx is queued is just
> pointless since we'll very likely get a request from a different hctx
> anyway. In the following commit we'll deal with lock contention in these
> IO schedulers in presence of multiple HW queues in a different way.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/bfq-iosched.c    | 5 -----
>  block/blk-mq.c         | 1 -
>  block/mq-deadline.c    | 6 ------
>  include/linux/blk-mq.h | 4 ----
>  4 files changed, 16 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 9e81d1052091..a99dfaa75a8c 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4640,9 +4640,6 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>  
> -	if (!atomic_read(&hctx->elevator_queued))
> -		return false;
> -
>  	/*
>  	 * Avoiding lock: a race on bfqd->busy_queues should cause at
>  	 * most a call to dispatch for nothing
> @@ -5557,7 +5554,6 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
>  		rq = list_first_entry(list, struct request, queuelist);
>  		list_del_init(&rq->queuelist);
>  		bfq_insert_request(hctx, rq, at_head);
> -		atomic_inc(&hctx->elevator_queued);
>  	}
>  }
>  
> @@ -5925,7 +5921,6 @@ static void bfq_finish_requeue_request(struct request *rq)
>  
>  		bfq_completed_request(bfqq, bfqd);
>  		bfq_finish_requeue_request_body(bfqq);
> -		atomic_dec(&rq->mq_hctx->elevator_queued);
>  
>  		spin_unlock_irqrestore(&bfqd->lock, flags);
>  	} else {
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f285a9123a8b..57f3482b2c26 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2653,7 +2653,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
>  		goto free_hctx;
>  
>  	atomic_set(&hctx->nr_active, 0);
> -	atomic_set(&hctx->elevator_queued, 0);
>  	if (node == NUMA_NO_NODE)
>  		node = set->numa_node;
>  	hctx->numa_node = node;
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 800ac902809b..b57470e154c8 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -386,8 +386,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  	spin_lock(&dd->lock);
>  	rq = __dd_dispatch_request(dd);
>  	spin_unlock(&dd->lock);
> -	if (rq)
> -		atomic_dec(&rq->mq_hctx->elevator_queued);
>  
>  	return rq;
>  }
> @@ -535,7 +533,6 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>  		rq = list_first_entry(list, struct request, queuelist);
>  		list_del_init(&rq->queuelist);
>  		dd_insert_request(hctx, rq, at_head);
> -		atomic_inc(&hctx->elevator_queued);
>  	}
>  	spin_unlock(&dd->lock);
>  }
> @@ -582,9 +579,6 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>  
> -	if (!atomic_read(&hctx->elevator_queued))
> -		return false;
> -
>  	return !list_empty_careful(&dd->dispatch) ||
>  		!list_empty_careful(&dd->fifo_list[0]) ||
>  		!list_empty_careful(&dd->fifo_list[1]);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index d705b174d346..b4b9604bbfd7 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -140,10 +140,6 @@ struct blk_mq_hw_ctx {
>  	 * shared across request queues.
>  	 */
>  	atomic_t		nr_active;
> -	/**
> -	 * @elevator_queued: Number of queued requests on hctx.
> -	 */
> -	atomic_t                elevator_queued;
>  
>  	/** @cpuhp_online: List to store request if CPU is going to die */
>  	struct hlist_node	cpuhp_online;
> -- 
> 2.26.2
> 

Reviewed-by: Ming Lei <ming.lei@redhat.com>

-- 
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-11 16:47 ` [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
@ 2021-01-12  2:15   ` Ming Lei
  0 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-01-12  2:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block

On Mon, Jan 11, 2021 at 05:47:17PM +0100, Jan Kara wrote:
> Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
> a queue with multiple HW queues, the performance it rather bad. The
> problem is that these IO schedulers use queue-wide locking and their
> dispatch function does not respect the hctx it is passed in and returns
> any request it finds appropriate. Thus locality of request access is
> broken and dispatch from multiple CPUs just contends on IO scheduler
> locks. For these IO schedulers there's little point in dispatching from
> multiple CPUs. Instead dispatch always only from a single CPU to limit
> contention.
> 
> Below is a comparison of dbench runs on XFS filesystem where the storage
> is a raid card with 64 HW queues and to it attached a single rotating
> disk. BFQ is used as IO scheduler:
> 
>       clients           MQ                     SQ             MQ-Patched
> Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
> Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
> Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
> Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
> Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
> Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
> Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
> Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
> 
> Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
> the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
> advertises just a single HW queue. MQ-Patched is a kernel with this
> patch applied.
> 
> You can see multiple hardware queues heavily hurt performance in
> combination with BFQ. The patch restores the performance.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/blk-mq.c           | 66 ++++++++++++++++++++++++++++++++++++----
>  block/kyber-iosched.c    |  1 +
>  include/linux/elevator.h |  2 ++
>  3 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 57f3482b2c26..580601787f7a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1646,6 +1646,42 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>  }
>  EXPORT_SYMBOL(blk_mq_run_hw_queue);
>  
> +/*
> + * Is the request queue handled by an IO scheduler that does not respect
> + * hardware queues when dispatching?
> + */
> +static bool blk_mq_has_sqsched(struct request_queue *q)
> +{
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->ops.dispatch_request &&
> +	    !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE))
> +		return true;
> +	return false;
> +}
> +
> +/*
> + * Return prefered queue to dispatch from (if any) for non-mq aware IO
> + * scheduler.
> + */
> +static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +
> +	/*
> +	 * If the IO scheduler does not respect hardware queues when
> +	 * dispatching, we just don't bother with multiple HW queues and
> +	 * dispatch from hctx for the current CPU since running multiple queues
> +	 * just causes lock contention inside the scheduler and pointless cache
> +	 * bouncing.
> +	 */
> +	hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
> +				     raw_smp_processor_id());
> +	if (!blk_mq_hctx_stopped(hctx))
> +		return hctx;
> +	return NULL;
> +}
> +
>  /**
>   * blk_mq_run_hw_queues - Run all hardware queues in a request queue.
>   * @q: Pointer to the request queue to run.
> @@ -1653,14 +1689,23 @@ EXPORT_SYMBOL(blk_mq_run_hw_queue);
>   */
>  void blk_mq_run_hw_queues(struct request_queue *q, bool async)
>  {
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx, *sq_hctx;
>  	int i;
>  
> +	sq_hctx = NULL;
> +	if (blk_mq_has_sqsched(q))
> +		sq_hctx = blk_mq_get_sq_hctx(q);
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (blk_mq_hctx_stopped(hctx))
>  			continue;
> -
> -		blk_mq_run_hw_queue(hctx, async);
> +		/*
> +		 * Dispatch from this hctx either if there's no hctx preferred
> +		 * by IO scheduler or if it has requests that bypass the
> +		 * scheduler.
> +		 */
> +		if (!sq_hctx || sq_hctx == hctx ||
> +		    !list_empty_careful(&hctx->dispatch))
> +			blk_mq_run_hw_queue(hctx, async);
>  	}
>  }
>  EXPORT_SYMBOL(blk_mq_run_hw_queues);
> @@ -1672,14 +1717,23 @@ EXPORT_SYMBOL(blk_mq_run_hw_queues);
>   */
>  void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
>  {
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx, *sq_hctx;
>  	int i;
>  
> +	sq_hctx = NULL;
> +	if (blk_mq_has_sqsched(q))
> +		sq_hctx = blk_mq_get_sq_hctx(q);
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (blk_mq_hctx_stopped(hctx))
>  			continue;
> -
> -		blk_mq_delay_run_hw_queue(hctx, msecs);
> +		/*
> +		 * Dispatch from this hctx either if there's no hctx preferred
> +		 * by IO scheduler or if it has requests that bypass the
> +		 * scheduler.
> +		 */
> +		if (!sq_hctx || sq_hctx == hctx ||
> +		    !list_empty_careful(&hctx->dispatch))
> +			blk_mq_delay_run_hw_queue(hctx, msecs);
>  	}
>  }
>  EXPORT_SYMBOL(blk_mq_delay_run_hw_queues);
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index dc89199bc8c6..c25c41d0d061 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -1029,6 +1029,7 @@ static struct elevator_type kyber_sched = {
>  #endif
>  	.elevator_attrs = kyber_sched_attrs,
>  	.elevator_name = "kyber",
> +	.elevator_features = ELEVATOR_F_MQ_AWARE,
>  	.elevator_owner = THIS_MODULE,
>  };
>  
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index bacc40a0bdf3..1fe8e105b83b 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -172,6 +172,8 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
>  
>  /* Supports zoned block devices sequential write constraint */
>  #define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
> +/* Supports scheduling on multiple hardware queues */
> +#define ELEVATOR_F_MQ_AWARE		(1U << 1)
>  
>  #endif /* CONFIG_BLOCK */
>  #endif
> -- 
> 2.26.2
> 

Reviewed-by: Ming Lei <ming.lei@redhat.com>

-- 
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-11 16:47 [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
  2021-01-11 16:47 ` [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance" Jan Kara
  2021-01-11 16:47 ` [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
@ 2021-01-22 14:11 ` Jan Kara
  2021-01-25  1:20 ` Jens Axboe
  3 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2021-01-22 14:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Jan Kara

On Mon 11-01-21 17:47:15, Jan Kara wrote:
> Hello!
> 
> This patch series aims to fix a regression we've noticed on our test grid when
> support for multiple HW queues in megaraid_sas driver was added during the 5.10
> cycle (103fbf8e4020 scsi: megaraid_sas: Added support for shared host tagset
> for cpuhotplug). The commit was reverted in the end for other reasons but I
> believe the fundamental problem still exists for any other similar setup. The
> problem manifests when the storage card supports multiple hardware queues
> however storage behind it is slow (single rotating disk in our case) and so
> using IO scheduler such as BFQ is desirable. See the second patch for details.
> 
> Changes since v2
> * Modified code to avoid useless sbitmap_any_set() calls on submit path
> 
> Changes since v1
> * Fixed queue running code to don't leave pending requests that bypass IO
>   scheduler.

Jens, can you please pickup these patches? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-11 16:47 [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
                   ` (2 preceding siblings ...)
  2021-01-22 14:11 ` [PATCH 0/2 v3] " Jan Kara
@ 2021-01-25  1:20 ` Jens Axboe
  2021-02-03 16:18   ` John Garry
  3 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2021-01-25  1:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-block, Ming Lei

On 1/11/21 9:47 AM, Jan Kara wrote:
> Hello!
> 
> This patch series aims to fix a regression we've noticed on our test grid when
> support for multiple HW queues in megaraid_sas driver was added during the 5.10
> cycle (103fbf8e4020 scsi: megaraid_sas: Added support for shared host tagset
> for cpuhotplug). The commit was reverted in the end for other reasons but I
> believe the fundamental problem still exists for any other similar setup. The
> problem manifests when the storage card supports multiple hardware queues
> however storage behind it is slow (single rotating disk in our case) and so
> using IO scheduler such as BFQ is desirable. See the second patch for details.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-01-25  1:20 ` Jens Axboe
@ 2021-02-03 16:18   ` John Garry
  2021-02-03 16:54     ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: John Garry @ 2021-02-03 16:18 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara; +Cc: linux-block, Ming Lei

On 25/01/2021 01:20, Jens Axboe wrote:
> On 1/11/21 9:47 AM, Jan Kara wrote:
>> Hello!
>>
>> This patch series aims to fix a regression we've noticed on our test grid when
>> support for multiple HW queues in megaraid_sas driver was added during the 5.10
>> cycle (103fbf8e4020 scsi: megaraid_sas: Added support for shared host tagset
>> for cpuhotplug). The commit was reverted in the end for other reasons but I
>> believe the fundamental problem still exists for any other similar setup.

That commit made it into 5.11-rc, and other SCSI HBA expose HW queues in 
5.10 and earlier. But then this series is targeted at 5.12.

Question: can we consider backport this series just due to performance 
issue regression? I'd say no, but maybe someone strongly disagrees with 
me ...

Thanks,
John

> The
>> problem manifests when the storage card supports multiple hardware queues
>> however storage behind it is slow (single rotating disk in our case) and so
>> using IO scheduler such as BFQ is desirable. See the second patch for details.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
  2021-02-03 16:18   ` John Garry
@ 2021-02-03 16:54     ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2021-02-03 16:54 UTC (permalink / raw)
  To: John Garry; +Cc: Jens Axboe, Jan Kara, linux-block, Ming Lei

On Wed 03-02-21 16:18:59, John Garry wrote:
> On 25/01/2021 01:20, Jens Axboe wrote:
> > On 1/11/21 9:47 AM, Jan Kara wrote:
> > > Hello!
> > > 
> > > This patch series aims to fix a regression we've noticed on our test grid when
> > > support for multiple HW queues in megaraid_sas driver was added during the 5.10
> > > cycle (103fbf8e4020 scsi: megaraid_sas: Added support for shared host tagset
> > > for cpuhotplug). The commit was reverted in the end for other reasons but I
> > > believe the fundamental problem still exists for any other similar setup.
> 
> That commit made it into 5.11-rc, and other SCSI HBA expose HW queues in
> 5.10 and earlier. But then this series is targeted at 5.12.
> 
> Question: can we consider backport this series just due to performance issue
> regression? I'd say no, but maybe someone strongly disagrees with me ...

I also don't consider my series a stable tree material. The setup of
multiple HW queues with BFQ IO scheduler isn't really very common...

								Honza

> 
> Thanks,
> John
> 
> > The
> > > problem manifests when the storage card supports multiple hardware queues
> > > however storage behind it is slow (single rotating disk in our case) and so
> > > using IO scheduler such as BFQ is desirable. See the second patch for details.
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-02-03 16:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-11 16:47 [PATCH 0/2 v3] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
2021-01-11 16:47 ` [PATCH 1/2] Revert "blk-mq, elevator: Count requests per hctx to improve performance" Jan Kara
2021-01-12  2:14   ` Ming Lei
2021-01-11 16:47 ` [PATCH 2/2] blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues Jan Kara
2021-01-12  2:15   ` Ming Lei
2021-01-22 14:11 ` [PATCH 0/2 v3] " Jan Kara
2021-01-25  1:20 ` Jens Axboe
2021-02-03 16:18   ` John Garry
2021-02-03 16:54     ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.