All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] blk-mq: various fixes and cleanups
@ 2017-04-05 19:01 Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Rebase of v2 [1] onto block-next. As with v2,

- Patch 1 is the new fix for a hang that Josef reported after trying v1.
- Patches 2-6 are the original series. Patch 5 now has Christoph's and
  Sagi's Reviewed-by.
- Patches 7 and 8 are trivial cleanups.

Patches 1-5 should probably go into 4.11, and 6-8 are for 4.12.

1: http://marc.info/?l=linux-block&m=149141696306216&w=2

Omar Sandoval (8):
  blk-mq: use the right hctx when getting a driver tag fails
  blk-mq-sched: refactor scheduler initialization
  blk-mq-sched: set up scheduler tags when bringing up new queues
  blk-mq-sched: fix crash in switch error path
  blk-mq: remap queues when adding/removing hardware queues
  blk-mq-sched: provide hooks for initializing hardware queue data
  blk-mq: make driver tag failure path easier to follow
  blk-mq: use true instead of 1 for blk_mq_queue_data.last

 block/blk-mq-sched.c     | 187 +++++++++++++++++++++++++++++------------------
 block/blk-mq-sched.h     |  13 ++--
 block/blk-mq.c           |  71 ++++++++++--------
 block/blk-mq.h           |   2 +-
 block/blk-sysfs.c        |   2 +-
 block/elevator.c         | 114 +++++++++++++++--------------
 include/linux/elevator.h |   4 +-
 7 files changed, 227 insertions(+), 166 deletions(-)

-- 
2.12.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
@ 2017-04-05 19:01 ` Omar Sandoval
  2017-04-06  4:31   ` Ming Lei
  2017-04-05 19:01 ` [PATCH v3 2/8] blk-mq-sched: refactor scheduler initialization Omar Sandoval
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
   request for in blk_mq_get_driver_tag() regardless of whether it
   succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
   blk_mq_hw_queue to make it clear that it must handle multiple
   hardware queues, since I've already messed this up on a couple of
   occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c |  9 +++++----
 block/blk-mq.c       | 25 +++++++++++++------------
 block/blk-mq.h       |  2 +-
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..fc00f00898d3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -171,7 +171,8 @@ void blk_mq_sched_put_request(struct request *rq)
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
-	struct elevator_queue *e = hctx->queue->elevator;
+	struct request_queue *q = hctx->queue;
+	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
 	bool did_work = false;
 	LIST_HEAD(rq_list);
@@ -203,10 +204,10 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(hctx, &rq_list);
+		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(hctx, &rq_list);
+		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
 	/*
@@ -222,7 +223,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 			if (!rq)
 				break;
 			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(hctx, &rq_list));
+		} while (blk_mq_dispatch_rq_list(q, &rq_list));
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7cd3208bcdf..09cff6d1ba76 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -863,12 +863,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 		.flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
 	};
 
-	if (rq->tag != -1) {
-done:
-		if (hctx)
-			*hctx = data.hctx;
-		return true;
-	}
+	if (rq->tag != -1)
+		goto done;
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
@@ -880,10 +876,12 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 			atomic_inc(&data.hctx->nr_active);
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
-		goto done;
 	}
 
-	return false;
+done:
+	if (hctx)
+		*hctx = data.hctx;
+	return rq->tag != -1;
 }
 
 static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
@@ -980,17 +978,20 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 {
-	struct request_queue *q = hctx->queue;
+	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
 	int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK;
 
+	if (list_empty(list))
+		return false;
+
 	/*
 	 * Now process all the entries, sending them to the driver.
 	 */
 	errors = queued = 0;
-	while (!list_empty(list)) {
+	do {
 		struct blk_mq_queue_data bd;
 
 		rq = list_first_entry(list, struct request, queuelist);
@@ -1053,7 +1054,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 
 		if (ret == BLK_MQ_RQ_QUEUE_BUSY)
 			break;
-	}
+	} while (!list_empty(list));
 
 	hctx->dispatched[queued_to_index(queued)]++;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..7e6f2e467696 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
-bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
+bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/8] blk-mq-sched: refactor scheduler initialization
  2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
@ 2017-04-05 19:01 ` Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues Omar Sandoval
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 82 ++++++++++++++++++++++++++++------------------------
 block/blk-mq-sched.h |  2 +-
 block/elevator.c     | 32 ++++++++------------
 3 files changed, 57 insertions(+), 59 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index fc00f00898d3..6bd1758ea29b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -432,11 +432,45 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 	}
 }
 
-int blk_mq_sched_setup(struct request_queue *q)
+static int blk_mq_sched_alloc_tags(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx,
+				   unsigned int hctx_idx)
+{
+	struct blk_mq_tag_set *set = q->tag_set;
+	int ret;
+
+	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
+					       set->reserved_tags);
+	if (!hctx->sched_tags)
+		return -ENOMEM;
+
+	ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
+	if (ret)
+		blk_mq_sched_free_tags(set, hctx, hctx_idx);
+
+	return ret;
+}
+
+void blk_mq_sched_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
-	int ret, i;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_sched_free_tags(set, hctx, i);
+}
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	int ret;
+
+	if (!e) {
+		q->elevator = NULL;
+		return 0;
+	}
 
 	/*
 	 * Default to 256, since we don't split into sync/async like the
@@ -444,49 +478,21 @@ int blk_mq_sched_setup(struct request_queue *q)
 	 */
 	q->nr_requests = 2 * BLKDEV_MAX_RQ;
 
-	/*
-	 * We're switching to using an IO scheduler, so setup the hctx
-	 * scheduler tags and switch the request map from the regular
-	 * tags to scheduler tags. First allocate what we need, so we
-	 * can safely fail and fallback, if needed.
-	 */
-	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_tags = blk_mq_alloc_rq_map(set, i,
-				q->nr_requests, set->reserved_tags);
-		if (!hctx->sched_tags) {
-			ret = -ENOMEM;
-			break;
-		}
-		ret = blk_mq_alloc_rqs(set, hctx->sched_tags, i, q->nr_requests);
+		ret = blk_mq_sched_alloc_tags(q, hctx, i);
 		if (ret)
-			break;
+			goto err;
 	}
 
-	/*
-	 * If we failed, free what we did allocate
-	 */
-	if (ret) {
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->sched_tags)
-				continue;
-			blk_mq_sched_free_tags(set, hctx, i);
-		}
-
-		return ret;
-	}
+	ret = e->ops.mq.init_sched(q, e);
+	if (ret)
+		goto err;
 
 	return 0;
-}
 
-void blk_mq_sched_teardown(struct request_queue *q)
-{
-	struct blk_mq_tag_set *set = q->tag_set;
-	struct blk_mq_hw_ctx *hctx;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_sched_free_tags(set, hctx, i);
+err:
+	blk_mq_sched_teardown(q);
+	return ret;
 }
 
 int blk_mq_sched_init(struct request_queue *q)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..873f9af5a35b 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -32,7 +32,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct list_head *rq_list,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
-int blk_mq_sched_setup(struct request_queue *q);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
 int blk_mq_sched_init(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 01139f549b5b..f236ef1d2be9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -242,17 +242,12 @@ int elevator_init(struct request_queue *q, char *name)
 		}
 	}
 
-	if (e->uses_mq) {
-		err = blk_mq_sched_setup(q);
-		if (!err)
-			err = e->ops.mq.init_sched(q, e);
-	} else
+	if (e->uses_mq)
+		err = blk_mq_init_sched(q, e);
+	else
 		err = e->ops.sq.elevator_init_fn(q, e);
-	if (err) {
-		if (e->uses_mq)
-			blk_mq_sched_teardown(q);
+	if (err)
 		elevator_put(e);
-	}
 	return err;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -987,21 +982,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (new_e) {
-		if (new_e->uses_mq) {
-			err = blk_mq_sched_setup(q);
-			if (!err)
-				err = new_e->ops.mq.init_sched(q, new_e);
-		} else
-			err = new_e->ops.sq.elevator_init_fn(q, new_e);
-		if (err)
-			goto fail_init;
+	if (q->mq_ops)
+		err = blk_mq_init_sched(q, new_e);
+	else
+		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	if (err)
+		goto fail_init;
 
+	if (new_e) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	} else
-		q->elevator = NULL;
+	}
 
 	/* done, kill the old one and finish */
 	if (old) {
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues
  2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 2/8] blk-mq-sched: refactor scheduler initialization Omar Sandoval
@ 2017-04-05 19:01 ` Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 4/8] blk-mq-sched: fix crash in switch error path Omar Sandoval
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 22 ++++++++++++++++++++++
 block/blk-mq-sched.h |  5 +++++
 block/blk-mq.c       |  9 ++++++++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 6bd1758ea29b..0bb13bb51daa 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -461,6 +461,28 @@ void blk_mq_sched_teardown(struct request_queue *q)
 		blk_mq_sched_free_tags(set, hctx, i);
 }
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return 0;
+
+	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+}
+
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return;
+
+	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+}
+
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 873f9af5a35b..19db25e0c95a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -35,6 +35,11 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx);
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx);
+
 int blk_mq_sched_init(struct request_queue *q);
 
 static inline bool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 09cff6d1ba76..672430c8c342 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1823,6 +1823,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 				       hctx->fq->flush_rq, hctx_idx,
 				       flush_start_tag + hctx_idx);
 
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
+
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
@@ -1889,9 +1891,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
 		goto free_bitmap;
 
+	if (blk_mq_sched_init_hctx(q, hctx, hctx_idx))
+		goto exit_hctx;
+
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
 	if (!hctx->fq)
-		goto exit_hctx;
+		goto sched_exit_hctx;
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
@@ -1906,6 +1911,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
  free_fq:
 	kfree(hctx->fq);
+ sched_exit_hctx:
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 4/8] blk-mq-sched: fix crash in switch error path
  2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (2 preceding siblings ...)
  2017-04-05 19:01 ` [PATCH v3 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues Omar Sandoval
@ 2017-04-05 19:01 ` Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 5/8] blk-mq: remap queues when adding/removing hardware queues Omar Sandoval
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 13 +++++--
 block/blk-mq-sched.h     |  2 +-
 block/blk-mq.c           |  2 --
 block/blk-sysfs.c        |  2 +-
 block/elevator.c         | 94 +++++++++++++++++++++++++++---------------------
 include/linux/elevator.h |  2 +-
 6 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 0bb13bb51daa..e8c2ed654ef0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -451,7 +451,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	return ret;
 }
 
-void blk_mq_sched_teardown(struct request_queue *q)
+static void blk_mq_sched_tags_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
@@ -513,10 +513,19 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	return 0;
 
 err:
-	blk_mq_sched_teardown(q);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
 	return ret;
 }
 
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
+{
+	if (e->type->ops.mq.exit_sched)
+		e->type->ops.mq.exit_sched(e);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
+}
+
 int blk_mq_sched_init(struct request_queue *q)
 {
 	int ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 19db25e0c95a..e704956e0862 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -33,7 +33,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
-void blk_mq_sched_teardown(struct request_queue *q);
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 
 int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 672430c8c342..147cc53134d8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2137,8 +2137,6 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	blk_mq_sched_teardown(q);
-
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 45854266e398..c47db43a40cc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -803,7 +803,7 @@ static void blk_release_queue(struct kobject *kobj)
 
 	if (q->elevator) {
 		ioc_clear_queue(q);
-		elevator_exit(q->elevator);
+		elevator_exit(q, q->elevator);
 	}
 
 	blk_free_queue_stats(q->stats);
diff --git a/block/elevator.c b/block/elevator.c
index f236ef1d2be9..dbeecf7be719 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -252,11 +252,11 @@ int elevator_init(struct request_queue *q, char *name)
 }
 EXPORT_SYMBOL(elevator_init);
 
-void elevator_exit(struct elevator_queue *e)
+void elevator_exit(struct request_queue *q, struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
 	if (e->uses_mq && e->type->ops.mq.exit_sched)
-		e->type->ops.mq.exit_sched(e);
+		blk_mq_exit_sched(q, e);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
 		e->type->ops.sq.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
@@ -941,6 +941,45 @@ void elv_unregister(struct elevator_type *e)
 }
 EXPORT_SYMBOL_GPL(elv_unregister);
 
+static int elevator_switch_mq(struct request_queue *q,
+			      struct elevator_type *new_e)
+{
+	int ret;
+
+	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
+
+	if (q->elevator) {
+		if (q->elevator->registered)
+			elv_unregister_queue(q);
+		ioc_clear_queue(q);
+		elevator_exit(q, q->elevator);
+	}
+
+	ret = blk_mq_init_sched(q, new_e);
+	if (ret)
+		goto out;
+
+	if (new_e) {
+		ret = elv_register_queue(q);
+		if (ret) {
+			elevator_exit(q, q->elevator);
+			goto out;
+		}
+	}
+
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
+
+out:
+	blk_mq_unfreeze_queue(q);
+	blk_mq_start_stopped_hw_queues(q, true);
+	return ret;
+
+}
+
 /*
  * switch to new_e io scheduler. be careful not to introduce deadlocks -
  * we don't free the old io scheduler, before we have allocated what we
@@ -953,10 +992,8 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	bool old_registered = false;
 	int err;
 
-	if (q->mq_ops) {
-		blk_mq_freeze_queue(q);
-		blk_mq_quiesce_queue(q);
-	}
+	if (q->mq_ops)
+		return elevator_switch_mq(q, new_e);
 
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
@@ -968,11 +1005,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (old) {
 		old_registered = old->registered;
 
-		if (old->uses_mq)
-			blk_mq_sched_teardown(q);
-
-		if (!q->mq_ops)
-			blk_queue_bypass_start(q);
+		blk_queue_bypass_start(q);
 
 		/* unregister and clear all auxiliary data of the old elevator */
 		if (old_registered)
@@ -982,53 +1015,32 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (q->mq_ops)
-		err = blk_mq_init_sched(q, new_e);
-	else
-		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	err = new_e->ops.sq.elevator_init_fn(q, new_e);
 	if (err)
 		goto fail_init;
 
-	if (new_e) {
-		err = elv_register_queue(q);
-		if (err)
-			goto fail_register;
-	}
+	err = elv_register_queue(q);
+	if (err)
+		goto fail_register;
 
 	/* done, kill the old one and finish */
 	if (old) {
-		elevator_exit(old);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
+		elevator_exit(q, old);
+		blk_queue_bypass_end(q);
 	}
 
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
-	}
-
-	if (new_e)
-		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
-	else
-		blk_add_trace_msg(q, "elv switch: none");
+	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
 
 	return 0;
 
 fail_register:
-	if (q->mq_ops)
-		blk_mq_sched_teardown(q);
-	elevator_exit(q->elevator);
+	elevator_exit(q, q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
 	if (old) {
 		q->elevator = old;
 		elv_register_queue(q);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
-	}
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
+		blk_queue_bypass_end(q);
 	}
 
 	return err;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index aebecc4ed088..22d39e8d4de1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -211,7 +211,7 @@ extern ssize_t elv_iosched_show(struct request_queue *, char *);
 extern ssize_t elv_iosched_store(struct request_queue *, const char *, size_t);
 
 extern int elevator_init(struct request_queue *, char *);
-extern void elevator_exit(struct elevator_queue *);
+extern void elevator_exit(struct request_queue *, struct elevator_queue *);
 extern int elevator_change(struct request_queue *, const char *);
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
 extern struct elevator_queue *elevator_alloc(struct request_queue *,
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 5/8] blk-mq: remap queues when adding/removing hardware queues
  2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (3 preceding siblings ...)
  2017-04-05 19:01 ` [PATCH v3 4/8] blk-mq-sched: fix crash in switch error path Omar Sandoval
@ 2017-04-05 19:01 ` Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 6/8] blk-mq-sched: provide hooks for initializing hardware queue data Omar Sandoval
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 147cc53134d8..39e9af3498ac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2467,6 +2467,14 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	return 0;
 }
 
+static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
+{
+	if (set->ops->map_queues)
+		return set->ops->map_queues(set);
+	else
+		return blk_mq_map_queues(set);
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2521,10 +2529,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->mq_map)
 		goto out_free_tags;
 
-	if (set->ops->map_queues)
-		ret = set->ops->map_queues(set);
-	else
-		ret = blk_mq_map_queues(set);
+	ret = blk_mq_update_queue_map(set);
 	if (ret)
 		goto out_free_mq_map;
 
@@ -2616,6 +2621,7 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		blk_mq_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
+	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
 		blk_mq_queue_reinit(q, cpu_online_mask);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 6/8] blk-mq-sched: provide hooks for initializing hardware queue data
  2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (4 preceding siblings ...)
  2017-04-05 19:01 ` [PATCH v3 5/8] blk-mq: remap queues when adding/removing hardware queues Omar Sandoval
@ 2017-04-05 19:01 ` Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 7/8] blk-mq: make driver tag failure path easier to follow Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 8/8] blk-mq: use true instead of 1 for blk_mq_queue_data.last Omar Sandoval
  7 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 81 +++++++++++++++++++++++++-----------------------
 block/blk-mq-sched.h     |  4 ---
 include/linux/elevator.h |  2 ++
 3 files changed, 45 insertions(+), 42 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e8c2ed654ef0..9d7f6d6ca693 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -30,43 +30,6 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *))
-{
-	struct blk_mq_hw_ctx *hctx;
-	int ret;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
-		if (!hctx->sched_data) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (init) {
-			ret = init(hctx);
-			if (ret) {
-				/*
-				 * We don't want to give exit() a partially
-				 * initialized sched_data. init() must clean up
-				 * if it fails.
-				 */
-				kfree(hctx->sched_data);
-				hctx->sched_data = NULL;
-				goto error;
-			}
-		}
-	}
-
-	return 0;
-error:
-	blk_mq_sched_free_hctx_data(q, exit);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
-
 static void __blk_mq_sched_assign_ioc(struct request_queue *q,
 				      struct request *rq,
 				      struct bio *bio,
@@ -465,11 +428,24 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx)
 {
 	struct elevator_queue *e = q->elevator;
+	int ret;
 
 	if (!e)
 		return 0;
 
-	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	if (ret)
+		return ret;
+
+	if (e->type->ops.mq.init_hctx) {
+		ret = e->type->ops.mq.init_hctx(hctx, hctx_idx);
+		if (ret) {
+			blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+			return ret;
+		}
+	}
+
+	return 0;
 }
 
 void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
@@ -480,12 +456,18 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 	if (!e)
 		return;
 
+	if (e->type->ops.mq.exit_hctx && hctx->sched_data) {
+		e->type->ops.mq.exit_hctx(hctx, hctx_idx);
+		hctx->sched_data = NULL;
+	}
+
 	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
 }
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
+	struct elevator_queue *eq;
 	unsigned int i;
 	int ret;
 
@@ -510,6 +492,18 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (ret)
 		goto err;
 
+	if (e->ops.mq.init_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			ret = e->ops.mq.init_hctx(hctx, i);
+			if (ret) {
+				eq = q->elevator;
+				blk_mq_exit_sched(q, eq);
+				kobject_put(&eq->kobj);
+				return ret;
+			}
+		}
+	}
+
 	return 0;
 
 err:
@@ -520,6 +514,17 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 {
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	if (e->type->ops.mq.exit_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (hctx->sched_data) {
+				e->type->ops.mq.exit_hctx(hctx, i);
+				hctx->sched_data = NULL;
+			}
+		}
+	}
 	if (e->type->ops.mq.exit_sched)
 		e->type->ops.mq.exit_sched(e);
 	blk_mq_sched_tags_teardown(q);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index e704956e0862..c6e760df0fb4 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -4,10 +4,6 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *));
-
 void blk_mq_sched_free_hctx_data(struct request_queue *q,
 				 void (*exit)(struct blk_mq_hw_ctx *));
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 22d39e8d4de1..b7ec315ee7e7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -93,6 +93,8 @@ struct blk_mq_hw_ctx;
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_type *);
 	void (*exit_sched)(struct elevator_queue *);
+	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
+	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 7/8] blk-mq: make driver tag failure path easier to follow
  2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (5 preceding siblings ...)
  2017-04-05 19:01 ` [PATCH v3 6/8] blk-mq-sched: provide hooks for initializing hardware queue data Omar Sandoval
@ 2017-04-05 19:01 ` Omar Sandoval
  2017-04-05 19:01 ` [PATCH v3 8/8] blk-mq: use true instead of 1 for blk_mq_queue_data.last Omar Sandoval
  7 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Minor cleanup that makes it easier to figure out what's going on in the
driver tag allocation failure path of blk_mq_dispatch_rq_list().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 39e9af3498ac..de57e7727c52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1003,17 +1003,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed.
 			 */
-			if (blk_mq_dispatch_wait_add(hctx)) {
-				/*
-				 * It's possible that a tag was freed in the
-				 * window between the allocation failure and
-				 * adding the hardware queue to the wait queue.
-				 */
-				if (!blk_mq_get_driver_tag(rq, &hctx, false))
-					break;
-			} else {
+			if (!blk_mq_dispatch_wait_add(hctx))
+				break;
+
+			/*
+			 * It's possible that a tag was freed in the window
+			 * between the allocation failure and adding the
+			 * hardware queue to the wait queue.
+			 */
+			if (!blk_mq_get_driver_tag(rq, &hctx, false))
 				break;
-			}
 		}
 
 		list_del_init(&rq->queuelist);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 8/8] blk-mq: use true instead of 1 for blk_mq_queue_data.last
  2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (6 preceding siblings ...)
  2017-04-05 19:01 ` [PATCH v3 7/8] blk-mq: make driver tag failure path easier to follow Omar Sandoval
@ 2017-04-05 19:01 ` Omar Sandoval
  7 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Trivial cleanup.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index de57e7727c52..c82fcb923ff8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1435,7 +1435,7 @@ static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
-		.last = 1
+		.last = true,
 	};
 	struct blk_mq_hw_ctx *hctx;
 	blk_qc_t new_cookie;
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-05 19:01 ` [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
@ 2017-04-06  4:31   ` Ming Lei
  2017-04-06  7:57     ` Omar Sandoval
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-04-06  4:31 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Jens Axboe, linux-block, kernel-team

On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> While dispatching requests, if we fail to get a driver tag, we mark the
> hardware queue as waiting for a tag and put the requests on a
> hctx->dispatch list to be run later when a driver tag is freed. However,
> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> queues if using a single-queue scheduler with a multiqueue device. If

It can't perform well by using a SQ scheduler on a MQ device, so just be
curious why someone wants to do that in this way,:-)

I guess you mean that ops.mq.dispatch_request() may dispatch requests
from other hardware queues in blk_mq_sched_dispatch_requests() instead
of current hctx.

If that is true, it looks like a issue in usage of I/O scheduler, since
the mq-deadline scheduler just queues requests in one per-request_queue
linked list, for MQ device, the scheduler queue should have been per-hctx.

> blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> are processing. This means we end up using the hardware queue of the
> previous request, which may or may not be the same as that of the
> current request. If it isn't, the wrong hardware queue will end up
> waiting for a tag, and the requests will be on the wrong dispatch list,
> leading to a hang.
> 
> The fix is twofold:
> 
> 1. Make sure we save which hardware queue we were trying to get a
>    request for in blk_mq_get_driver_tag() regardless of whether it
>    succeeds or not.

It shouldn't have be needed, once rq->mq_ctx is set, the hctx should been
fixed already, that means we can just pass the hctx to blk_mq_get_driver_tag().

> 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
>    blk_mq_hw_queue to make it clear that it must handle multiple
>    hardware queues, since I've already messed this up on a couple of
>    occasions.
> 
> This didn't appear in testing with nvme and mq-deadline because nvme has
> more driver tags than the default number of scheduler tags. However,
> with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  block/blk-mq-sched.c |  9 +++++----
>  block/blk-mq.c       | 25 +++++++++++++------------
>  block/blk-mq.h       |  2 +-
>  3 files changed, 19 insertions(+), 17 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 09af8ff18719..fc00f00898d3 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -171,7 +171,8 @@ void blk_mq_sched_put_request(struct request *rq)
>  
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  {
> -	struct elevator_queue *e = hctx->queue->elevator;
> +	struct request_queue *q = hctx->queue;
> +	struct elevator_queue *e = q->elevator;
>  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
>  	bool did_work = false;
>  	LIST_HEAD(rq_list);
> @@ -203,10 +204,10 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
> -		did_work = blk_mq_dispatch_rq_list(hctx, &rq_list);
> +		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
>  	} else if (!has_sched_dispatch) {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> -		blk_mq_dispatch_rq_list(hctx, &rq_list);
> +		blk_mq_dispatch_rq_list(q, &rq_list);
>  	}
>  
>  	/*
> @@ -222,7 +223,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  			if (!rq)
>  				break;
>  			list_add(&rq->queuelist, &rq_list);
> -		} while (blk_mq_dispatch_rq_list(hctx, &rq_list));
> +		} while (blk_mq_dispatch_rq_list(q, &rq_list));
>  	}
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f7cd3208bcdf..09cff6d1ba76 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -863,12 +863,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
>  		.flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
>  	};
>  
> -	if (rq->tag != -1) {
> -done:
> -		if (hctx)
> -			*hctx = data.hctx;
> -		return true;
> -	}
> +	if (rq->tag != -1)
> +		goto done;
>  
>  	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
>  		data.flags |= BLK_MQ_REQ_RESERVED;
> @@ -880,10 +876,12 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
>  			atomic_inc(&data.hctx->nr_active);
>  		}
>  		data.hctx->tags->rqs[rq->tag] = rq;
> -		goto done;
>  	}
>  
> -	return false;
> +done:
> +	if (hctx)
> +		*hctx = data.hctx;
> +	return rq->tag != -1;
>  }
>  
>  static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> @@ -980,17 +978,20 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
>  	return true;
>  }
>  
> -bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> +bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
>  {
> -	struct request_queue *q = hctx->queue;
> +	struct blk_mq_hw_ctx *hctx;
>  	struct request *rq;
>  	int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK;
>  
> +	if (list_empty(list))
> +		return false;
> +
>  	/*
>  	 * Now process all the entries, sending them to the driver.
>  	 */
>  	errors = queued = 0;
> -	while (!list_empty(list)) {
> +	do {
>  		struct blk_mq_queue_data bd;
>  
>  		rq = list_first_entry(list, struct request, queuelist);
> @@ -1053,7 +1054,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  
>  		if (ret == BLK_MQ_RQ_QUEUE_BUSY)
>  			break;
> -	}
> +	} while (!list_empty(list));
>  
>  	hctx->dispatched[queued_to_index(queued)]++;
>  
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 8d49c06fc520..7e6f2e467696 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -30,7 +30,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
>  void blk_mq_free_queue(struct request_queue *q);
>  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
>  void blk_mq_wake_waiters(struct request_queue *q);
> -bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
> +bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
>  void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
>  bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
>  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
> -- 
> 2.12.2
> 

-- 
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-06  4:31   ` Ming Lei
@ 2017-04-06  7:57     ` Omar Sandoval
  2017-04-06  8:23       ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Omar Sandoval @ 2017-04-06  7:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, kernel-team

On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > While dispatching requests, if we fail to get a driver tag, we mark the
> > hardware queue as waiting for a tag and put the requests on a
> > hctx->dispatch list to be run later when a driver tag is freed. However,
> > blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> > queues if using a single-queue scheduler with a multiqueue device. If
> 
> It can't perform well by using a SQ scheduler on a MQ device, so just be
> curious why someone wants to do that in this way,:-)

I don't know why anyone would want to, but it has to work :) The only
reason we noticed this is because when the NBD device is created, it
only has a single queue, so we automatically assign mq-deadline to it.
Later, we update the number of queues, but it's still using mq-deadline.

> I guess you mean that ops.mq.dispatch_request() may dispatch requests
> from other hardware queues in blk_mq_sched_dispatch_requests() instead
> of current hctx.

Yup, that's right. It's weird, and I talked to Jens about just forcing
the MQ device into an SQ mode when using an SQ scheduler, but this way
works fine more or less.

> If that is true, it looks like a issue in usage of I/O scheduler, since
> the mq-deadline scheduler just queues requests in one per-request_queue
> linked list, for MQ device, the scheduler queue should have been per-hctx.

That's an option, but that's a different scheduling policy. Again, I
agree that it's strange, but it's reasonable behavior.

> > blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> > are processing. This means we end up using the hardware queue of the
> > previous request, which may or may not be the same as that of the
> > current request. If it isn't, the wrong hardware queue will end up
> > waiting for a tag, and the requests will be on the wrong dispatch list,
> > leading to a hang.
> > 
> > The fix is twofold:
> > 
> > 1. Make sure we save which hardware queue we were trying to get a
> >    request for in blk_mq_get_driver_tag() regardless of whether it
> >    succeeds or not.
> 
> It shouldn't have be needed, once rq->mq_ctx is set, the hctx should been
> fixed already, that means we can just pass the hctx to blk_mq_get_driver_tag().

We still need to do the blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) step to
get the hctx. We could do that at every callsite of
blk_mq_get_driver_tag(), but this way it's just factored into
blk_mq_get_driver_tag().

Thanks!

> > 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
> >    blk_mq_hw_queue to make it clear that it must handle multiple
> >    hardware queues, since I've already messed this up on a couple of
> >    occasions.
> > 
> > This didn't appear in testing with nvme and mq-deadline because nvme has
> > more driver tags than the default number of scheduler tags. However,
> > with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-06  7:57     ` Omar Sandoval
@ 2017-04-06  8:23       ` Ming Lei
  2017-04-06 19:29         ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-04-06  8:23 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Jens Axboe, linux-block, kernel-team

On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> > On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > While dispatching requests, if we fail to get a driver tag, we mark the
> > > hardware queue as waiting for a tag and put the requests on a
> > > hctx->dispatch list to be run later when a driver tag is freed. However,
> > > blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> > > queues if using a single-queue scheduler with a multiqueue device. If
> > 
> > It can't perform well by using a SQ scheduler on a MQ device, so just be
> > curious why someone wants to do that in this way,:-)
> 
> I don't know why anyone would want to, but it has to work :) The only
> reason we noticed this is because when the NBD device is created, it
> only has a single queue, so we automatically assign mq-deadline to it.
> Later, we update the number of queues, but it's still using mq-deadline.
> 
> > I guess you mean that ops.mq.dispatch_request() may dispatch requests
> > from other hardware queues in blk_mq_sched_dispatch_requests() instead
> > of current hctx.
> 
> Yup, that's right. It's weird, and I talked to Jens about just forcing
> the MQ device into an SQ mode when using an SQ scheduler, but this way
> works fine more or less.

Or just switch the elevator to the MQ default one when the device becomes
MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
hctx?

> 
> > If that is true, it looks like a issue in usage of I/O scheduler, since
> > the mq-deadline scheduler just queues requests in one per-request_queue
> > linked list, for MQ device, the scheduler queue should have been per-hctx.
> 
> That's an option, but that's a different scheduling policy. Again, I
> agree that it's strange, but it's reasonable behavior.

IMO, the current mq-deadline isn't good/ready for MQ device, and it
doesn't make sense to use it for MQ.

> 
> > > blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> > > are processing. This means we end up using the hardware queue of the
> > > previous request, which may or may not be the same as that of the
> > > current request. If it isn't, the wrong hardware queue will end up
> > > waiting for a tag, and the requests will be on the wrong dispatch list,
> > > leading to a hang.
> > > 
> > > The fix is twofold:
> > > 
> > > 1. Make sure we save which hardware queue we were trying to get a
> > >    request for in blk_mq_get_driver_tag() regardless of whether it
> > >    succeeds or not.
> > 
> > It shouldn't have be needed, once rq->mq_ctx is set, the hctx should been
> > fixed already, that means we can just pass the hctx to blk_mq_get_driver_tag().
> 
> We still need to do the blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) step to
> get the hctx. We could do that at every callsite of
> blk_mq_get_driver_tag(), but this way it's just factored into
> blk_mq_get_driver_tag().

We can just use the hctx passed to blk_mq_dispatch_rq_list().

> 
> Thanks!
> 
> > > 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
> > >    blk_mq_hw_queue to make it clear that it must handle multiple
> > >    hardware queues, since I've already messed this up on a couple of
> > >    occasions.

The 2nd part makes code less readable too, and it should have been
enough to pass 'hctx' into blk_mq_dispatch_rq_list() if we make
sure that ops.mq.dispatch_request() only returns requests from current hw
queue, since we have passed 'hctx' to elevator already.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-06  8:23       ` Ming Lei
@ 2017-04-06 19:29         ` Jens Axboe
  2017-04-07  3:23           ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2017-04-06 19:29 UTC (permalink / raw)
  To: Ming Lei, Omar Sandoval; +Cc: linux-block, kernel-team

On 04/06/2017 02:23 AM, Ming Lei wrote:
> On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
>> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
>>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
>>>> From: Omar Sandoval <osandov@fb.com>
>>>>
>>>> While dispatching requests, if we fail to get a driver tag, we mark the
>>>> hardware queue as waiting for a tag and put the requests on a
>>>> hctx->dispatch list to be run later when a driver tag is freed. However,
>>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
>>>> queues if using a single-queue scheduler with a multiqueue device. If
>>>
>>> It can't perform well by using a SQ scheduler on a MQ device, so just be
>>> curious why someone wants to do that in this way,:-)
>>
>> I don't know why anyone would want to, but it has to work :) The only
>> reason we noticed this is because when the NBD device is created, it
>> only has a single queue, so we automatically assign mq-deadline to it.
>> Later, we update the number of queues, but it's still using mq-deadline.
>>
>>> I guess you mean that ops.mq.dispatch_request() may dispatch requests
>>> from other hardware queues in blk_mq_sched_dispatch_requests() instead
>>> of current hctx.
>>
>> Yup, that's right. It's weird, and I talked to Jens about just forcing
>> the MQ device into an SQ mode when using an SQ scheduler, but this way
>> works fine more or less.
> 
> Or just switch the elevator to the MQ default one when the device becomes
> MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
> hctx?

No, that would be a really bad idea imho. First of all, I don't want
kernel driven scheduler changes. Secondly, the framework should work
with a non-direct mapping between hardware dispatch queues and
scheduling queues.

While we could force a single queue usage to make that a 1:1 mapping
always, that loses big benefits on eg nbd, which uses multiple hardware
queues to up the bandwidth. Similarly on nvme, for example, we still
scale better with N submission queues and 1 scheduling queue compared to
having just 1 submission queue.

>>> If that is true, it looks like a issue in usage of I/O scheduler, since
>>> the mq-deadline scheduler just queues requests in one per-request_queue
>>> linked list, for MQ device, the scheduler queue should have been per-hctx.
>>
>> That's an option, but that's a different scheduling policy. Again, I
>> agree that it's strange, but it's reasonable behavior.
> 
> IMO, the current mq-deadline isn't good/ready for MQ device, and it
> doesn't make sense to use it for MQ.

I don't think that's true at all. I do agree that it's somewhat quirky
since it does introduce scheduling dependencies between the hardware
queues, and we have to work at making that well understood and explicit,
as not to introduce bugs due to that. But in reality, all multiqueue
hardware we are deadling with are mapped to a single resource. As such,
it makes a lot of sense to schedule it as such. Hence I don't think that
a single queue deadline approach is necessarily a bad idea for even fast
storage.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-06 19:29         ` Jens Axboe
@ 2017-04-07  3:23           ` Ming Lei
  2017-04-07 14:45             ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-04-07  3:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, linux-block, kernel-team

Hi Jens,

Thanks for your comment!

On Thu, Apr 06, 2017 at 01:29:26PM -0600, Jens Axboe wrote:
> On 04/06/2017 02:23 AM, Ming Lei wrote:
> > On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
> >> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> >>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> >>>> From: Omar Sandoval <osandov@fb.com>
> >>>>
> >>>> While dispatching requests, if we fail to get a driver tag, we mark the
> >>>> hardware queue as waiting for a tag and put the requests on a
> >>>> hctx->dispatch list to be run later when a driver tag is freed. However,
> >>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> >>>> queues if using a single-queue scheduler with a multiqueue device. If
> >>>
> >>> It can't perform well by using a SQ scheduler on a MQ device, so just be
> >>> curious why someone wants to do that in this way,:-)
> >>
> >> I don't know why anyone would want to, but it has to work :) The only
> >> reason we noticed this is because when the NBD device is created, it
> >> only has a single queue, so we automatically assign mq-deadline to it.
> >> Later, we update the number of queues, but it's still using mq-deadline.
> >>
> >>> I guess you mean that ops.mq.dispatch_request() may dispatch requests
> >>> from other hardware queues in blk_mq_sched_dispatch_requests() instead
> >>> of current hctx.
> >>
> >> Yup, that's right. It's weird, and I talked to Jens about just forcing
> >> the MQ device into an SQ mode when using an SQ scheduler, but this way
> >> works fine more or less.
> > 
> > Or just switch the elevator to the MQ default one when the device becomes
> > MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
> > hctx?
> 
> No, that would be a really bad idea imho. First of all, I don't want
> kernel driven scheduler changes. Secondly, the framework should work
> with a non-direct mapping between hardware dispatch queues and
> scheduling queues.
> 
> While we could force a single queue usage to make that a 1:1 mapping
> always, that loses big benefits on eg nbd, which uses multiple hardware
> queues to up the bandwidth. Similarly on nvme, for example, we still
> scale better with N submission queues and 1 scheduling queue compared to
> having just 1 submission queue.

Looks that isn't what I meant. And my 2nd point is to make mq-deadline's
dispatch_request(hctx) just returns requests mapped to the hw queue of
'hctx', then we can avoid to mess up blk-mq.c and blk-mq-sched.c.

> 
> >>> If that is true, it looks like a issue in usage of I/O scheduler, since
> >>> the mq-deadline scheduler just queues requests in one per-request_queue
> >>> linked list, for MQ device, the scheduler queue should have been per-hctx.
> >>
> >> That's an option, but that's a different scheduling policy. Again, I
> >> agree that it's strange, but it's reasonable behavior.
> > 
> > IMO, the current mq-deadline isn't good/ready for MQ device, and it
> > doesn't make sense to use it for MQ.
> 
> I don't think that's true at all. I do agree that it's somewhat quirky
> since it does introduce scheduling dependencies between the hardware
> queues, and we have to work at making that well understood and explicit,
> as not to introduce bugs due to that. But in reality, all multiqueue
> hardware we are deadling with are mapped to a single resource. As such,
> it makes a lot of sense to schedule it as such. Hence I don't think that
> a single queue deadline approach is necessarily a bad idea for even fast
> storage.

When we map all mq into one single deadline queue, it can't scale well, for
example, I just run a simple write test(libaio, dio, bs:4k, 4jobs) over one
commodity NVMe in a dual-socket(four cores) system:
	
	IO scheduler|	CPU utilization
	-------------------------------
	none		|   60%
	-------------------------------
	mq-deadline |	100%
	-------------------------------

And IO throughput is basically same in two cases.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-07  3:23           ` Ming Lei
@ 2017-04-07 14:45             ` Jens Axboe
  2017-04-11  2:12               ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2017-04-07 14:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Omar Sandoval, linux-block, kernel-team

On 04/06/2017 09:23 PM, Ming Lei wrote:
> Hi Jens,
> 
> Thanks for your comment!
> 
> On Thu, Apr 06, 2017 at 01:29:26PM -0600, Jens Axboe wrote:
>> On 04/06/2017 02:23 AM, Ming Lei wrote:
>>> On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
>>>> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
>>>>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>
>>>>>> While dispatching requests, if we fail to get a driver tag, we mark the
>>>>>> hardware queue as waiting for a tag and put the requests on a
>>>>>> hctx->dispatch list to be run later when a driver tag is freed. However,
>>>>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
>>>>>> queues if using a single-queue scheduler with a multiqueue device. If
>>>>>
>>>>> It can't perform well by using a SQ scheduler on a MQ device, so just be
>>>>> curious why someone wants to do that in this way,:-)
>>>>
>>>> I don't know why anyone would want to, but it has to work :) The only
>>>> reason we noticed this is because when the NBD device is created, it
>>>> only has a single queue, so we automatically assign mq-deadline to it.
>>>> Later, we update the number of queues, but it's still using mq-deadline.
>>>>
>>>>> I guess you mean that ops.mq.dispatch_request() may dispatch requests
>>>>> from other hardware queues in blk_mq_sched_dispatch_requests() instead
>>>>> of current hctx.
>>>>
>>>> Yup, that's right. It's weird, and I talked to Jens about just forcing
>>>> the MQ device into an SQ mode when using an SQ scheduler, but this way
>>>> works fine more or less.
>>>
>>> Or just switch the elevator to the MQ default one when the device becomes
>>> MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
>>> hctx?
>>
>> No, that would be a really bad idea imho. First of all, I don't want
>> kernel driven scheduler changes. Secondly, the framework should work
>> with a non-direct mapping between hardware dispatch queues and
>> scheduling queues.
>>
>> While we could force a single queue usage to make that a 1:1 mapping
>> always, that loses big benefits on eg nbd, which uses multiple hardware
>> queues to up the bandwidth. Similarly on nvme, for example, we still
>> scale better with N submission queues and 1 scheduling queue compared to
>> having just 1 submission queue.
> 
> Looks that isn't what I meant. And my 2nd point is to make
> mq-deadline's dispatch_request(hctx) just returns requests mapped to
> the hw queue of 'hctx', then we can avoid to mess up blk-mq.c and
> blk-mq-sched.c.

That would indeed be better. But let's assume that we have 4 hardware
queues, we're asked to run queue X. But the FIFO dictates that the first
request that should run is on queue Y, since it's expired. What do we
do? Then we'd have to abort dispatching on queue X, and run queue Y
appropriately.

This shuffle can happen all the time. I think it'd be worthwhile to work
towards the goal of improving mq-deadline to deal with this, and
subsequently cleaning up the interface. It would be a cleaner
implementation, though efficiency might suffer further.

>>>>> If that is true, it looks like a issue in usage of I/O scheduler, since
>>>>> the mq-deadline scheduler just queues requests in one per-request_queue
>>>>> linked list, for MQ device, the scheduler queue should have been per-hctx.
>>>>
>>>> That's an option, but that's a different scheduling policy. Again, I
>>>> agree that it's strange, but it's reasonable behavior.
>>>
>>> IMO, the current mq-deadline isn't good/ready for MQ device, and it
>>> doesn't make sense to use it for MQ.
>>
>> I don't think that's true at all. I do agree that it's somewhat quirky
>> since it does introduce scheduling dependencies between the hardware
>> queues, and we have to work at making that well understood and explicit,
>> as not to introduce bugs due to that. But in reality, all multiqueue
>> hardware we are deadling with are mapped to a single resource. As such,
>> it makes a lot of sense to schedule it as such. Hence I don't think that
>> a single queue deadline approach is necessarily a bad idea for even fast
>> storage.
> 
> When we map all mq into one single deadline queue, it can't scale well, for
> example, I just run a simple write test(libaio, dio, bs:4k, 4jobs) over one
> commodity NVMe in a dual-socket(four cores) system:
> 	
> 	IO scheduler|	CPU utilization
> 	-------------------------------
> 	none		|   60%
> 	-------------------------------
> 	mq-deadline |	100%
> 	-------------------------------
> 
> And IO throughput is basically same in two cases.
> 
That's a given, for low blocksize and high iops, we'll be hitting the
deadline lock pretty hard. We dispatch single requests at the time, to
optimize for rotational storage, since it improves merging a lot. If we
modified e->type->ops.mq.dispatch_request() to take a "dump this many
requests" parameter and ramped up the depth quickly, then we could
drastically reduce the overhead for your case. The initial version
dumped the whole thing. It had lower overhead on flash, but didn't reach
the performance of old deadline on !mq since we continually emptied the
queue and eliminated merging opportunities.

blk_mq_sched_dispatch_requests() could add logic that brought back the
old behavior if NONROT is set.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-07 14:45             ` Jens Axboe
@ 2017-04-11  2:12               ` Ming Lei
  2017-04-11  2:15                 ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-04-11  2:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ming Lei, Omar Sandoval, linux-block, kernel-team

On Fri, Apr 07, 2017 at 08:45:41AM -0600, Jens Axboe wrote:
> On 04/06/2017 09:23 PM, Ming Lei wrote:
> > Hi Jens,
> > 
> > Thanks for your comment!
> > 
> > On Thu, Apr 06, 2017 at 01:29:26PM -0600, Jens Axboe wrote:
> >> On 04/06/2017 02:23 AM, Ming Lei wrote:
> >>> On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
> >>>> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> >>>>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> >>>>>> From: Omar Sandoval <osandov@fb.com>
> >>>>>>
> >>>>>> While dispatching requests, if we fail to get a driver tag, we mark the
> >>>>>> hardware queue as waiting for a tag and put the requests on a
> >>>>>> hctx->dispatch list to be run later when a driver tag is freed. However,
> >>>>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> >>>>>> queues if using a single-queue scheduler with a multiqueue device. If
> >>>>>
> >>>>> It can't perform well by using a SQ scheduler on a MQ device, so just be
> >>>>> curious why someone wants to do that in this way,:-)
> >>>>
> >>>> I don't know why anyone would want to, but it has to work :) The only
> >>>> reason we noticed this is because when the NBD device is created, it
> >>>> only has a single queue, so we automatically assign mq-deadline to it.
> >>>> Later, we update the number of queues, but it's still using mq-deadline.
> >>>>
> >>>>> I guess you mean that ops.mq.dispatch_request() may dispatch requests
> >>>>> from other hardware queues in blk_mq_sched_dispatch_requests() instead
> >>>>> of current hctx.
> >>>>
> >>>> Yup, that's right. It's weird, and I talked to Jens about just forcing
> >>>> the MQ device into an SQ mode when using an SQ scheduler, but this way
> >>>> works fine more or less.
> >>>
> >>> Or just switch the elevator to the MQ default one when the device becomes
> >>> MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
> >>> hctx?
> >>
> >> No, that would be a really bad idea imho. First of all, I don't want
> >> kernel driven scheduler changes. Secondly, the framework should work
> >> with a non-direct mapping between hardware dispatch queues and
> >> scheduling queues.
> >>
> >> While we could force a single queue usage to make that a 1:1 mapping
> >> always, that loses big benefits on eg nbd, which uses multiple hardware
> >> queues to up the bandwidth. Similarly on nvme, for example, we still
> >> scale better with N submission queues and 1 scheduling queue compared to
> >> having just 1 submission queue.
> > 
> > Looks that isn't what I meant. And my 2nd point is to make
> > mq-deadline's dispatch_request(hctx) just returns requests mapped to
> > the hw queue of 'hctx', then we can avoid to mess up blk-mq.c and
> > blk-mq-sched.c.
> 
> That would indeed be better. But let's assume that we have 4 hardware
> queues, we're asked to run queue X. But the FIFO dictates that the first
> request that should run is on queue Y, since it's expired. What do we
> do? Then we'd have to abort dispatching on queue X, and run queue Y
> appropriately.

The previous rule is to run queue Y on the CPUs(sw queues) mapped to
queue Y, and that should still work by following this rule. But this
patch is starting to break the rule, :-(

> 
> This shuffle can happen all the time. I think it'd be worthwhile to work
> towards the goal of improving mq-deadline to deal with this, and
> subsequently cleaning up the interface. It would be a cleaner
> implementation, though efficiency might suffer further.
> 
> >>>>> If that is true, it looks like a issue in usage of I/O scheduler, since
> >>>>> the mq-deadline scheduler just queues requests in one per-request_queue
> >>>>> linked list, for MQ device, the scheduler queue should have been per-hctx.
> >>>>
> >>>> That's an option, but that's a different scheduling policy. Again, I
> >>>> agree that it's strange, but it's reasonable behavior.
> >>>
> >>> IMO, the current mq-deadline isn't good/ready for MQ device, and it
> >>> doesn't make sense to use it for MQ.
> >>
> >> I don't think that's true at all. I do agree that it's somewhat quirky
> >> since it does introduce scheduling dependencies between the hardware
> >> queues, and we have to work at making that well understood and explicit,
> >> as not to introduce bugs due to that. But in reality, all multiqueue
> >> hardware we are deadling with are mapped to a single resource. As such,
> >> it makes a lot of sense to schedule it as such. Hence I don't think that
> >> a single queue deadline approach is necessarily a bad idea for even fast
> >> storage.
> > 
> > When we map all mq into one single deadline queue, it can't scale well, for
> > example, I just run a simple write test(libaio, dio, bs:4k, 4jobs) over one
> > commodity NVMe in a dual-socket(four cores) system:
> > 	
> > 	IO scheduler|	CPU utilization
> > 	-------------------------------
> > 	none		|   60%
> > 	-------------------------------
> > 	mq-deadline |	100%
> > 	-------------------------------
> > 
> > And IO throughput is basically same in two cases.
> > 
> That's a given, for low blocksize and high iops, we'll be hitting the
> deadline lock pretty hard. We dispatch single requests at the time, to
> optimize for rotational storage, since it improves merging a lot. If we
> modified e->type->ops.mq.dispatch_request() to take a "dump this many
> requests" parameter and ramped up the depth quickly, then we could
> drastically reduce the overhead for your case. The initial version
> dumped the whole thing. It had lower overhead on flash, but didn't reach
> the performance of old deadline on !mq since we continually emptied the
> queue and eliminated merging opportunities.
> 
> blk_mq_sched_dispatch_requests() could add logic that brought back the
> old behavior if NONROT is set.

>From my further test, looks the issue is related with the single
mq-deadline queue(includes the single lock) which need to be accessed
from different NUMA nodes.

The following tree implements per-hctx mq-deadline queue, and basically
makes q->last_merge, elevator_queue->hash and 'struct deadline_data'
defined as per-hctx. With this patches, the issue disappered.


Thanks
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-11  2:12               ` Ming Lei
@ 2017-04-11  2:15                 ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2017-04-11  2:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ming Lei, Omar Sandoval, linux-block, kernel-team

On Tue, Apr 11, 2017 at 10:12:49AM +0800, Ming Lei wrote:
> On Fri, Apr 07, 2017 at 08:45:41AM -0600, Jens Axboe wrote:
> > On 04/06/2017 09:23 PM, Ming Lei wrote:
> > > Hi Jens,
> > > 
> > > Thanks for your comment!
> > > 
> > > On Thu, Apr 06, 2017 at 01:29:26PM -0600, Jens Axboe wrote:
> > >> On 04/06/2017 02:23 AM, Ming Lei wrote:
> > >>> On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
> > >>>> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> > >>>>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> > >>>>>> From: Omar Sandoval <osandov@fb.com>
> > >>>>>>
> > >>>>>> While dispatching requests, if we fail to get a driver tag, we mark the
> > >>>>>> hardware queue as waiting for a tag and put the requests on a
> > >>>>>> hctx->dispatch list to be run later when a driver tag is freed. However,
> > >>>>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> > >>>>>> queues if using a single-queue scheduler with a multiqueue device. If
> > >>>>>
> > >>>>> It can't perform well by using a SQ scheduler on a MQ device, so just be
> > >>>>> curious why someone wants to do that in this way,:-)
> > >>>>
> > >>>> I don't know why anyone would want to, but it has to work :) The only
> > >>>> reason we noticed this is because when the NBD device is created, it
> > >>>> only has a single queue, so we automatically assign mq-deadline to it.
> > >>>> Later, we update the number of queues, but it's still using mq-deadline.
> > >>>>
> > >>>>> I guess you mean that ops.mq.dispatch_request() may dispatch requests
> > >>>>> from other hardware queues in blk_mq_sched_dispatch_requests() instead
> > >>>>> of current hctx.
> > >>>>
> > >>>> Yup, that's right. It's weird, and I talked to Jens about just forcing
> > >>>> the MQ device into an SQ mode when using an SQ scheduler, but this way
> > >>>> works fine more or less.
> > >>>
> > >>> Or just switch the elevator to the MQ default one when the device becomes
> > >>> MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
> > >>> hctx?
> > >>
> > >> No, that would be a really bad idea imho. First of all, I don't want
> > >> kernel driven scheduler changes. Secondly, the framework should work
> > >> with a non-direct mapping between hardware dispatch queues and
> > >> scheduling queues.
> > >>
> > >> While we could force a single queue usage to make that a 1:1 mapping
> > >> always, that loses big benefits on eg nbd, which uses multiple hardware
> > >> queues to up the bandwidth. Similarly on nvme, for example, we still
> > >> scale better with N submission queues and 1 scheduling queue compared to
> > >> having just 1 submission queue.
> > > 
> > > Looks that isn't what I meant. And my 2nd point is to make
> > > mq-deadline's dispatch_request(hctx) just returns requests mapped to
> > > the hw queue of 'hctx', then we can avoid to mess up blk-mq.c and
> > > blk-mq-sched.c.
> > 
> > That would indeed be better. But let's assume that we have 4 hardware
> > queues, we're asked to run queue X. But the FIFO dictates that the first
> > request that should run is on queue Y, since it's expired. What do we
> > do? Then we'd have to abort dispatching on queue X, and run queue Y
> > appropriately.
> 
> The previous rule is to run queue Y on the CPUs(sw queues) mapped to
> queue Y, and that should still work by following this rule. But this
> patch is starting to break the rule, :-(
> 
> > 
> > This shuffle can happen all the time. I think it'd be worthwhile to work
> > towards the goal of improving mq-deadline to deal with this, and
> > subsequently cleaning up the interface. It would be a cleaner
> > implementation, though efficiency might suffer further.
> > 
> > >>>>> If that is true, it looks like a issue in usage of I/O scheduler, since
> > >>>>> the mq-deadline scheduler just queues requests in one per-request_queue
> > >>>>> linked list, for MQ device, the scheduler queue should have been per-hctx.
> > >>>>
> > >>>> That's an option, but that's a different scheduling policy. Again, I
> > >>>> agree that it's strange, but it's reasonable behavior.
> > >>>
> > >>> IMO, the current mq-deadline isn't good/ready for MQ device, and it
> > >>> doesn't make sense to use it for MQ.
> > >>
> > >> I don't think that's true at all. I do agree that it's somewhat quirky
> > >> since it does introduce scheduling dependencies between the hardware
> > >> queues, and we have to work at making that well understood and explicit,
> > >> as not to introduce bugs due to that. But in reality, all multiqueue
> > >> hardware we are deadling with are mapped to a single resource. As such,
> > >> it makes a lot of sense to schedule it as such. Hence I don't think that
> > >> a single queue deadline approach is necessarily a bad idea for even fast
> > >> storage.
> > > 
> > > When we map all mq into one single deadline queue, it can't scale well, for
> > > example, I just run a simple write test(libaio, dio, bs:4k, 4jobs) over one
> > > commodity NVMe in a dual-socket(four cores) system:
> > > 	
> > > 	IO scheduler|	CPU utilization
> > > 	-------------------------------
> > > 	none		|   60%
> > > 	-------------------------------
> > > 	mq-deadline |	100%
> > > 	-------------------------------
> > > 
> > > And IO throughput is basically same in two cases.
> > > 
> > That's a given, for low blocksize and high iops, we'll be hitting the
> > deadline lock pretty hard. We dispatch single requests at the time, to
> > optimize for rotational storage, since it improves merging a lot. If we
> > modified e->type->ops.mq.dispatch_request() to take a "dump this many
> > requests" parameter and ramped up the depth quickly, then we could
> > drastically reduce the overhead for your case. The initial version
> > dumped the whole thing. It had lower overhead on flash, but didn't reach
> > the performance of old deadline on !mq since we continually emptied the
> > queue and eliminated merging opportunities.
> > 
> > blk_mq_sched_dispatch_requests() could add logic that brought back the
> > old behavior if NONROT is set.
> 
> From my further test, looks the issue is related with the single
> mq-deadline queue(includes the single lock) which need to be accessed
> from different NUMA nodes.
> 
> The following tree implements per-hctx mq-deadline queue, and basically
> makes q->last_merge, elevator_queue->hash and 'struct deadline_data'
> defined as per-hctx. With this patches, the issue disappered.

Sorry for missing the link:

	https://github.com/ming1/linux/commits/my_v4.11-rcX

Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-04-11  2:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-05 19:01 [PATCH v3 0/8] blk-mq: various fixes and cleanups Omar Sandoval
2017-04-05 19:01 ` [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
2017-04-06  4:31   ` Ming Lei
2017-04-06  7:57     ` Omar Sandoval
2017-04-06  8:23       ` Ming Lei
2017-04-06 19:29         ` Jens Axboe
2017-04-07  3:23           ` Ming Lei
2017-04-07 14:45             ` Jens Axboe
2017-04-11  2:12               ` Ming Lei
2017-04-11  2:15                 ` Ming Lei
2017-04-05 19:01 ` [PATCH v3 2/8] blk-mq-sched: refactor scheduler initialization Omar Sandoval
2017-04-05 19:01 ` [PATCH v3 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues Omar Sandoval
2017-04-05 19:01 ` [PATCH v3 4/8] blk-mq-sched: fix crash in switch error path Omar Sandoval
2017-04-05 19:01 ` [PATCH v3 5/8] blk-mq: remap queues when adding/removing hardware queues Omar Sandoval
2017-04-05 19:01 ` [PATCH v3 6/8] blk-mq-sched: provide hooks for initializing hardware queue data Omar Sandoval
2017-04-05 19:01 ` [PATCH v3 7/8] blk-mq: make driver tag failure path easier to follow Omar Sandoval
2017-04-05 19:01 ` [PATCH v3 8/8] blk-mq: use true instead of 1 for blk_mq_queue_data.last Omar Sandoval

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.