All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] blk-mq: various fixes and cleanups
@ 2017-04-05 18:28 Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This v2 of my series from a couple of days ago [1] with one extra fix
and two extra cleanups.

- Patch 1 is the new fix for a hang that Josef reported after trying v1.
- Patches 2-6 are the original series. Patch 5 now has Christoph's and
  Sagi's Reviewed-by.
- Patches 7 and 8 are trivial cleanups.

Patches 1-5 should probably go into 4.11, and 6-8 are for 4.12.

1: http://marc.info/?l=linux-block&m=149125578724683&w=2

Omar Sandoval (8):
  blk-mq: use the right hctx when getting a driver tag fails
  blk-mq-sched: refactor scheduler initialization
  blk-mq-sched: set up scheduler tags when bringing up new queues
  blk-mq-sched: fix crash in switch error path
  blk-mq: remap queues when adding/removing hardware queues
  blk-mq-sched: provide hooks for initializing hardware queue data
  blk-mq: make driver tag failure path easier to follow
  blk-mq: clean up direct issue blk_mq_queue_data initialization

 block/blk-mq-sched.c     | 187 +++++++++++++++++++++++++++++------------------
 block/blk-mq-sched.h     |  13 ++--
 block/blk-mq.c           |  72 ++++++++++--------
 block/blk-mq.h           |   2 +-
 block/blk-sysfs.c        |   2 +-
 block/elevator.c         | 114 +++++++++++++++--------------
 include/linux/elevator.h |   4 +-
 7 files changed, 227 insertions(+), 167 deletions(-)

-- 
2.12.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
@ 2017-04-05 18:28 ` Omar Sandoval
  2017-04-05 18:33   ` Bart Van Assche
  2017-04-05 18:28 ` [PATCH v2 2/8] blk-mq-sched: refactor scheduler initialization Omar Sandoval
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
   request for in blk_mq_get_driver_tag() regardless of whether it
   succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
   blk_mq_hw_queue to make it clear that it must handle multiple
   hardware queues, since I've already messed this up on a couple of
   occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c |  9 +++++----
 block/blk-mq.c       | 25 +++++++++++++------------
 block/blk-mq.h       |  2 +-
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..fc00f00898d3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -171,7 +171,8 @@ void blk_mq_sched_put_request(struct request *rq)
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
-	struct elevator_queue *e = hctx->queue->elevator;
+	struct request_queue *q = hctx->queue;
+	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
 	bool did_work = false;
 	LIST_HEAD(rq_list);
@@ -203,10 +204,10 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(hctx, &rq_list);
+		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(hctx, &rq_list);
+		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
 	/*
@@ -222,7 +223,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 			if (!rq)
 				break;
 			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(hctx, &rq_list));
+		} while (blk_mq_dispatch_rq_list(q, &rq_list));
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2cc88d3..6c1bedc23b5a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -863,12 +863,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 		.flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
 	};
 
-	if (rq->tag != -1) {
-done:
-		if (hctx)
-			*hctx = data.hctx;
-		return true;
-	}
+	if (rq->tag != -1)
+		goto done;
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
@@ -880,10 +876,12 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 			atomic_inc(&data.hctx->nr_active);
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
-		goto done;
 	}
 
-	return false;
+done:
+	if (hctx)
+		*hctx = data.hctx;
+	return rq->tag != -1;
 }
 
 static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
@@ -980,14 +978,17 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 {
-	struct request_queue *q = hctx->queue;
+	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
 	int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK;
 
+	if (list_empty(list))
+		return false;
+
 	/*
 	 * Start off with dptr being NULL, so we start the first request
 	 * immediately, even if we have more pending.
@@ -998,7 +999,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	 * Now process all the entries, sending them to the driver.
 	 */
 	errors = queued = 0;
-	while (!list_empty(list)) {
+	do {
 		struct blk_mq_queue_data bd;
 
 		rq = list_first_entry(list, struct request, queuelist);
@@ -1069,7 +1070,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		 */
 		if (!dptr && list->next != list->prev)
 			dptr = &driver_list;
-	}
+	} while (!list_empty(list));
 
 	hctx->dispatched[queued_to_index(queued)]++;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..7e6f2e467696 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
-bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
+bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/8] blk-mq-sched: refactor scheduler initialization
  2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
@ 2017-04-05 18:28 ` Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues Omar Sandoval
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 82 ++++++++++++++++++++++++++++------------------------
 block/blk-mq-sched.h |  2 +-
 block/elevator.c     | 32 ++++++++------------
 3 files changed, 57 insertions(+), 59 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index fc00f00898d3..6bd1758ea29b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -432,11 +432,45 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 	}
 }
 
-int blk_mq_sched_setup(struct request_queue *q)
+static int blk_mq_sched_alloc_tags(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx,
+				   unsigned int hctx_idx)
+{
+	struct blk_mq_tag_set *set = q->tag_set;
+	int ret;
+
+	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
+					       set->reserved_tags);
+	if (!hctx->sched_tags)
+		return -ENOMEM;
+
+	ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
+	if (ret)
+		blk_mq_sched_free_tags(set, hctx, hctx_idx);
+
+	return ret;
+}
+
+void blk_mq_sched_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
-	int ret, i;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_sched_free_tags(set, hctx, i);
+}
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	int ret;
+
+	if (!e) {
+		q->elevator = NULL;
+		return 0;
+	}
 
 	/*
 	 * Default to 256, since we don't split into sync/async like the
@@ -444,49 +478,21 @@ int blk_mq_sched_setup(struct request_queue *q)
 	 */
 	q->nr_requests = 2 * BLKDEV_MAX_RQ;
 
-	/*
-	 * We're switching to using an IO scheduler, so setup the hctx
-	 * scheduler tags and switch the request map from the regular
-	 * tags to scheduler tags. First allocate what we need, so we
-	 * can safely fail and fallback, if needed.
-	 */
-	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_tags = blk_mq_alloc_rq_map(set, i,
-				q->nr_requests, set->reserved_tags);
-		if (!hctx->sched_tags) {
-			ret = -ENOMEM;
-			break;
-		}
-		ret = blk_mq_alloc_rqs(set, hctx->sched_tags, i, q->nr_requests);
+		ret = blk_mq_sched_alloc_tags(q, hctx, i);
 		if (ret)
-			break;
+			goto err;
 	}
 
-	/*
-	 * If we failed, free what we did allocate
-	 */
-	if (ret) {
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->sched_tags)
-				continue;
-			blk_mq_sched_free_tags(set, hctx, i);
-		}
-
-		return ret;
-	}
+	ret = e->ops.mq.init_sched(q, e);
+	if (ret)
+		goto err;
 
 	return 0;
-}
 
-void blk_mq_sched_teardown(struct request_queue *q)
-{
-	struct blk_mq_tag_set *set = q->tag_set;
-	struct blk_mq_hw_ctx *hctx;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_sched_free_tags(set, hctx, i);
+err:
+	blk_mq_sched_teardown(q);
+	return ret;
 }
 
 int blk_mq_sched_init(struct request_queue *q)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..873f9af5a35b 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -32,7 +32,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct list_head *rq_list,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
-int blk_mq_sched_setup(struct request_queue *q);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
 int blk_mq_sched_init(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 01139f549b5b..f236ef1d2be9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -242,17 +242,12 @@ int elevator_init(struct request_queue *q, char *name)
 		}
 	}
 
-	if (e->uses_mq) {
-		err = blk_mq_sched_setup(q);
-		if (!err)
-			err = e->ops.mq.init_sched(q, e);
-	} else
+	if (e->uses_mq)
+		err = blk_mq_init_sched(q, e);
+	else
 		err = e->ops.sq.elevator_init_fn(q, e);
-	if (err) {
-		if (e->uses_mq)
-			blk_mq_sched_teardown(q);
+	if (err)
 		elevator_put(e);
-	}
 	return err;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -987,21 +982,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (new_e) {
-		if (new_e->uses_mq) {
-			err = blk_mq_sched_setup(q);
-			if (!err)
-				err = new_e->ops.mq.init_sched(q, new_e);
-		} else
-			err = new_e->ops.sq.elevator_init_fn(q, new_e);
-		if (err)
-			goto fail_init;
+	if (q->mq_ops)
+		err = blk_mq_init_sched(q, new_e);
+	else
+		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	if (err)
+		goto fail_init;
 
+	if (new_e) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	} else
-		q->elevator = NULL;
+	}
 
 	/* done, kill the old one and finish */
 	if (old) {
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues
  2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 2/8] blk-mq-sched: refactor scheduler initialization Omar Sandoval
@ 2017-04-05 18:28 ` Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 4/8] blk-mq-sched: fix crash in switch error path Omar Sandoval
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 22 ++++++++++++++++++++++
 block/blk-mq-sched.h |  5 +++++
 block/blk-mq.c       |  9 ++++++++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 6bd1758ea29b..0bb13bb51daa 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -461,6 +461,28 @@ void blk_mq_sched_teardown(struct request_queue *q)
 		blk_mq_sched_free_tags(set, hctx, i);
 }
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return 0;
+
+	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+}
+
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return;
+
+	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+}
+
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 873f9af5a35b..19db25e0c95a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -35,6 +35,11 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx);
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx);
+
 int blk_mq_sched_init(struct request_queue *q);
 
 static inline bool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6c1bedc23b5a..cf158a6fc997 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1840,6 +1840,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 				       hctx->fq->flush_rq, hctx_idx,
 				       flush_start_tag + hctx_idx);
 
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
+
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
@@ -1906,9 +1908,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
 		goto free_bitmap;
 
+	if (blk_mq_sched_init_hctx(q, hctx, hctx_idx))
+		goto exit_hctx;
+
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
 	if (!hctx->fq)
-		goto exit_hctx;
+		goto sched_exit_hctx;
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
@@ -1923,6 +1928,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
  free_fq:
 	kfree(hctx->fq);
+ sched_exit_hctx:
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/8] blk-mq-sched: fix crash in switch error path
  2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (2 preceding siblings ...)
  2017-04-05 18:28 ` [PATCH v2 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues Omar Sandoval
@ 2017-04-05 18:28 ` Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 5/8] blk-mq: remap queues when adding/removing hardware queues Omar Sandoval
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 13 +++++--
 block/blk-mq-sched.h     |  2 +-
 block/blk-mq.c           |  2 --
 block/blk-sysfs.c        |  2 +-
 block/elevator.c         | 94 +++++++++++++++++++++++++++---------------------
 include/linux/elevator.h |  2 +-
 6 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 0bb13bb51daa..e8c2ed654ef0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -451,7 +451,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	return ret;
 }
 
-void blk_mq_sched_teardown(struct request_queue *q)
+static void blk_mq_sched_tags_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
@@ -513,10 +513,19 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	return 0;
 
 err:
-	blk_mq_sched_teardown(q);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
 	return ret;
 }
 
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
+{
+	if (e->type->ops.mq.exit_sched)
+		e->type->ops.mq.exit_sched(e);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
+}
+
 int blk_mq_sched_init(struct request_queue *q)
 {
 	int ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 19db25e0c95a..e704956e0862 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -33,7 +33,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
-void blk_mq_sched_teardown(struct request_queue *q);
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 
 int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cf158a6fc997..53c5c6097be9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2154,8 +2154,6 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	blk_mq_sched_teardown(q);
-
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 45854266e398..c47db43a40cc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -803,7 +803,7 @@ static void blk_release_queue(struct kobject *kobj)
 
 	if (q->elevator) {
 		ioc_clear_queue(q);
-		elevator_exit(q->elevator);
+		elevator_exit(q, q->elevator);
 	}
 
 	blk_free_queue_stats(q->stats);
diff --git a/block/elevator.c b/block/elevator.c
index f236ef1d2be9..dbeecf7be719 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -252,11 +252,11 @@ int elevator_init(struct request_queue *q, char *name)
 }
 EXPORT_SYMBOL(elevator_init);
 
-void elevator_exit(struct elevator_queue *e)
+void elevator_exit(struct request_queue *q, struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
 	if (e->uses_mq && e->type->ops.mq.exit_sched)
-		e->type->ops.mq.exit_sched(e);
+		blk_mq_exit_sched(q, e);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
 		e->type->ops.sq.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
@@ -941,6 +941,45 @@ void elv_unregister(struct elevator_type *e)
 }
 EXPORT_SYMBOL_GPL(elv_unregister);
 
+static int elevator_switch_mq(struct request_queue *q,
+			      struct elevator_type *new_e)
+{
+	int ret;
+
+	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
+
+	if (q->elevator) {
+		if (q->elevator->registered)
+			elv_unregister_queue(q);
+		ioc_clear_queue(q);
+		elevator_exit(q, q->elevator);
+	}
+
+	ret = blk_mq_init_sched(q, new_e);
+	if (ret)
+		goto out;
+
+	if (new_e) {
+		ret = elv_register_queue(q);
+		if (ret) {
+			elevator_exit(q, q->elevator);
+			goto out;
+		}
+	}
+
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
+
+out:
+	blk_mq_unfreeze_queue(q);
+	blk_mq_start_stopped_hw_queues(q, true);
+	return ret;
+
+}
+
 /*
  * switch to new_e io scheduler. be careful not to introduce deadlocks -
  * we don't free the old io scheduler, before we have allocated what we
@@ -953,10 +992,8 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	bool old_registered = false;
 	int err;
 
-	if (q->mq_ops) {
-		blk_mq_freeze_queue(q);
-		blk_mq_quiesce_queue(q);
-	}
+	if (q->mq_ops)
+		return elevator_switch_mq(q, new_e);
 
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
@@ -968,11 +1005,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (old) {
 		old_registered = old->registered;
 
-		if (old->uses_mq)
-			blk_mq_sched_teardown(q);
-
-		if (!q->mq_ops)
-			blk_queue_bypass_start(q);
+		blk_queue_bypass_start(q);
 
 		/* unregister and clear all auxiliary data of the old elevator */
 		if (old_registered)
@@ -982,53 +1015,32 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (q->mq_ops)
-		err = blk_mq_init_sched(q, new_e);
-	else
-		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	err = new_e->ops.sq.elevator_init_fn(q, new_e);
 	if (err)
 		goto fail_init;
 
-	if (new_e) {
-		err = elv_register_queue(q);
-		if (err)
-			goto fail_register;
-	}
+	err = elv_register_queue(q);
+	if (err)
+		goto fail_register;
 
 	/* done, kill the old one and finish */
 	if (old) {
-		elevator_exit(old);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
+		elevator_exit(q, old);
+		blk_queue_bypass_end(q);
 	}
 
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
-	}
-
-	if (new_e)
-		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
-	else
-		blk_add_trace_msg(q, "elv switch: none");
+	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
 
 	return 0;
 
 fail_register:
-	if (q->mq_ops)
-		blk_mq_sched_teardown(q);
-	elevator_exit(q->elevator);
+	elevator_exit(q, q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
 	if (old) {
 		q->elevator = old;
 		elv_register_queue(q);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
-	}
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
+		blk_queue_bypass_end(q);
 	}
 
 	return err;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index aebecc4ed088..22d39e8d4de1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -211,7 +211,7 @@ extern ssize_t elv_iosched_show(struct request_queue *, char *);
 extern ssize_t elv_iosched_store(struct request_queue *, const char *, size_t);
 
 extern int elevator_init(struct request_queue *, char *);
-extern void elevator_exit(struct elevator_queue *);
+extern void elevator_exit(struct request_queue *, struct elevator_queue *);
 extern int elevator_change(struct request_queue *, const char *);
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
 extern struct elevator_queue *elevator_alloc(struct request_queue *,
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 5/8] blk-mq: remap queues when adding/removing hardware queues
  2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (3 preceding siblings ...)
  2017-04-05 18:28 ` [PATCH v2 4/8] blk-mq-sched: fix crash in switch error path Omar Sandoval
@ 2017-04-05 18:28 ` Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 6/8] blk-mq-sched: provide hooks for initializing hardware queue data Omar Sandoval
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 53c5c6097be9..4aa3148befe1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2484,6 +2484,14 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	return 0;
 }
 
+static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
+{
+	if (set->ops->map_queues)
+		return set->ops->map_queues(set);
+	else
+		return blk_mq_map_queues(set);
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2538,10 +2546,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->mq_map)
 		goto out_free_tags;
 
-	if (set->ops->map_queues)
-		ret = set->ops->map_queues(set);
-	else
-		ret = blk_mq_map_queues(set);
+	ret = blk_mq_update_queue_map(set);
 	if (ret)
 		goto out_free_mq_map;
 
@@ -2633,6 +2638,7 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		blk_mq_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
+	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
 		blk_mq_queue_reinit(q, cpu_online_mask);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 6/8] blk-mq-sched: provide hooks for initializing hardware queue data
  2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (4 preceding siblings ...)
  2017-04-05 18:28 ` [PATCH v2 5/8] blk-mq: remap queues when adding/removing hardware queues Omar Sandoval
@ 2017-04-05 18:28 ` Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 7/8] blk-mq: make driver tag failure path easier to follow Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization Omar Sandoval
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 81 +++++++++++++++++++++++++-----------------------
 block/blk-mq-sched.h     |  4 ---
 include/linux/elevator.h |  2 ++
 3 files changed, 45 insertions(+), 42 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e8c2ed654ef0..9d7f6d6ca693 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -30,43 +30,6 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *))
-{
-	struct blk_mq_hw_ctx *hctx;
-	int ret;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
-		if (!hctx->sched_data) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (init) {
-			ret = init(hctx);
-			if (ret) {
-				/*
-				 * We don't want to give exit() a partially
-				 * initialized sched_data. init() must clean up
-				 * if it fails.
-				 */
-				kfree(hctx->sched_data);
-				hctx->sched_data = NULL;
-				goto error;
-			}
-		}
-	}
-
-	return 0;
-error:
-	blk_mq_sched_free_hctx_data(q, exit);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
-
 static void __blk_mq_sched_assign_ioc(struct request_queue *q,
 				      struct request *rq,
 				      struct bio *bio,
@@ -465,11 +428,24 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx)
 {
 	struct elevator_queue *e = q->elevator;
+	int ret;
 
 	if (!e)
 		return 0;
 
-	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	if (ret)
+		return ret;
+
+	if (e->type->ops.mq.init_hctx) {
+		ret = e->type->ops.mq.init_hctx(hctx, hctx_idx);
+		if (ret) {
+			blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+			return ret;
+		}
+	}
+
+	return 0;
 }
 
 void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
@@ -480,12 +456,18 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 	if (!e)
 		return;
 
+	if (e->type->ops.mq.exit_hctx && hctx->sched_data) {
+		e->type->ops.mq.exit_hctx(hctx, hctx_idx);
+		hctx->sched_data = NULL;
+	}
+
 	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
 }
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
+	struct elevator_queue *eq;
 	unsigned int i;
 	int ret;
 
@@ -510,6 +492,18 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (ret)
 		goto err;
 
+	if (e->ops.mq.init_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			ret = e->ops.mq.init_hctx(hctx, i);
+			if (ret) {
+				eq = q->elevator;
+				blk_mq_exit_sched(q, eq);
+				kobject_put(&eq->kobj);
+				return ret;
+			}
+		}
+	}
+
 	return 0;
 
 err:
@@ -520,6 +514,17 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 {
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	if (e->type->ops.mq.exit_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (hctx->sched_data) {
+				e->type->ops.mq.exit_hctx(hctx, i);
+				hctx->sched_data = NULL;
+			}
+		}
+	}
 	if (e->type->ops.mq.exit_sched)
 		e->type->ops.mq.exit_sched(e);
 	blk_mq_sched_tags_teardown(q);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index e704956e0862..c6e760df0fb4 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -4,10 +4,6 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *));
-
 void blk_mq_sched_free_hctx_data(struct request_queue *q,
 				 void (*exit)(struct blk_mq_hw_ctx *));
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 22d39e8d4de1..b7ec315ee7e7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -93,6 +93,8 @@ struct blk_mq_hw_ctx;
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_type *);
 	void (*exit_sched)(struct elevator_queue *);
+	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
+	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 7/8] blk-mq: make driver tag failure path easier to follow
  2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (5 preceding siblings ...)
  2017-04-05 18:28 ` [PATCH v2 6/8] blk-mq-sched: provide hooks for initializing hardware queue data Omar Sandoval
@ 2017-04-05 18:28 ` Omar Sandoval
  2017-04-05 18:28 ` [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization Omar Sandoval
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Minor cleanup that makes it easier to figure out what's going on in the
driver tag allocation failure path of blk_mq_dispatch_rq_list().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4aa3148befe1..71dc8608f3a8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1011,17 +1011,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed.
 			 */
-			if (blk_mq_dispatch_wait_add(hctx)) {
-				/*
-				 * It's possible that a tag was freed in the
-				 * window between the allocation failure and
-				 * adding the hardware queue to the wait queue.
-				 */
-				if (!blk_mq_get_driver_tag(rq, &hctx, false))
-					break;
-			} else {
+			if (!blk_mq_dispatch_wait_add(hctx))
+				break;
+
+			/*
+			 * It's possible that a tag was freed in the window
+			 * between the allocation failure and adding the
+			 * hardware queue to the wait queue.
+			 */
+			if (!blk_mq_get_driver_tag(rq, &hctx, false))
 				break;
-			}
 		}
 
 		list_del_init(&rq->queuelist);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization
  2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
                   ` (6 preceding siblings ...)
  2017-04-05 18:28 ` [PATCH v2 7/8] blk-mq: make driver tag failure path easier to follow Omar Sandoval
@ 2017-04-05 18:28 ` Omar Sandoval
  2017-04-05 18:35   ` Bart Van Assche
  7 siblings, 1 reply; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Trivial cleanup.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 71dc8608f3a8..779249a5999b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1451,8 +1451,7 @@ static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
-		.list = NULL,
-		.last = 1
+		.last = true,
 	};
 	struct blk_mq_hw_ctx *hctx;
 	blk_qc_t new_cookie;
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-05 18:28 ` [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
@ 2017-04-05 18:33   ` Bart Van Assche
  2017-04-05 18:42     ` Omar Sandoval
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2017-04-05 18:33 UTC (permalink / raw)
  To: osandov, linux-block, axboe; +Cc: kernel-team

On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>=20
> While dispatching requests, if we fail to get a driver tag, we mark the
> hardware queue as waiting for a tag and put the requests on a
> hctx->dispatch list to be run later when a driver tag is freed. However,
> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> queues if using a single-queue scheduler with a multiqueue device. If
> blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> are processing. This means we end up using the hardware queue of the
> previous request, which may or may not be the same as that of the
> current request. If it isn't, the wrong hardware queue will end up
> waiting for a tag, and the requests will be on the wrong dispatch list,
> leading to a hang.
>=20
> The fix is twofold:
>=20
> 1. Make sure we save which hardware queue we were trying to get a
>    request for in blk_mq_get_driver_tag() regardless of whether it
>    succeeds or not.
> 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
>    blk_mq_hw_queue to make it clear that it must handle multiple
>    hardware queues, since I've already messed this up on a couple of
>    occasions.
>=20
> This didn't appear in testing with nvme and mq-deadline because nvme has
> more driver tags than the default number of scheduler tags. However,
> with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Would the patch below be a valid alternative?

Thanks,

Bart.

[PATCH] blk-mq: Simplify blk_mq_get_driver_tag()

The blk_mq_get_driver_tag() callers either assume that *hctx is not
modified or that it points to a valid hctx pointer upon return if
tag allocation succeeded. Avoid this confusion by returning the hctx
pointer if and only if tag allocation succeeded and by only storing
the return value into hctx in those blk_mq_get_driver_tag() callers
for which the hctx pointer had not yet been computed before the
blk_mq_get_driver_tag() call.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
---
=A0block/blk-mq-sched.c |=A0=A04 +++-
=A0block/blk-mq.c=A0=A0=A0=A0=A0=A0=A0| 24 ++++++++++--------------
=A0block/blk-mq.h=A0=A0=A0=A0=A0=A0=A0|=A0=A03 +--
=A03 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a85d939ef450..bfb8bdb95a87 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -386,7 +386,9 @@ void blk_mq_sched_restart_a_queue(struct blk_mq_hw_ctx =
*hctx)
=A0static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
=A0				=A0=A0=A0=A0=A0=A0struct request *rq, bool can_block)
=A0{
-	if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
+	WARN_ON_ONCE(!hctx);
+
+	if (blk_mq_get_driver_tag(rq, can_block)) {
=A0		blk_insert_flush(rq);
=A0		blk_mq_run_hw_queue(hctx, true);
=A0	} else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45e7f597cea3..c8e0c02dc8ca 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -857,8 +857,7 @@ static inline unsigned int queued_to_index(unsigned int=
 queued)
=A0	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
=A0}
=A0
-bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx=
,
-			=A0=A0=A0bool wait)
+struct blk_mq_hw_ctx *blk_mq_get_driver_tag(struct request *rq, bool wait)
=A0{
=A0	struct blk_mq_alloc_data data =3D {
=A0		.q =3D rq->q,
@@ -866,12 +865,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct =
blk_mq_hw_ctx **hctx,
=A0		.flags =3D wait ? 0 : BLK_MQ_REQ_NOWAIT,
=A0	};
=A0
-	if (rq->tag !=3D -1) {
-done:
-		if (hctx)
-			*hctx =3D data.hctx;
-		return true;
-	}
+	if (rq->tag !=3D -1)
+		return data.hctx;
=A0
=A0	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
=A0		data.flags |=3D BLK_MQ_REQ_RESERVED;
@@ -883,10 +878,10 @@ bool blk_mq_get_driver_tag(struct request *rq, struct=
 blk_mq_hw_ctx **hctx,
=A0			atomic_inc(&data.hctx->nr_active);
=A0		}
=A0		data.hctx->tags->rqs[rq->tag] =3D rq;
-		goto done;
+		return data.hctx;
=A0	}
=A0
-	return false;
+	return NULL;
=A0}
=A0
=A0static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
@@ -985,7 +980,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx=
, struct list_head *list)
=A0		struct blk_mq_queue_data bd;
=A0
=A0		rq =3D list_first_entry(list, struct request, queuelist);
-		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
+		if (!blk_mq_get_driver_tag(rq, false)) {
=A0			if (!queued && reorder_tags_to_front(list))
=A0				continue;
=A0
@@ -999,7 +994,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx=
, struct list_head *list)
=A0				=A0* window between the allocation failure and
=A0				=A0* adding the hardware queue to the wait queue.
=A0				=A0*/
-				if (!blk_mq_get_driver_tag(rq, &hctx, false))
+				if (!blk_mq_get_driver_tag(rq, false))
=A0					break;
=A0			} else {
=A0				break;
@@ -1020,7 +1015,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hc=
tx, struct list_head *list)
=A0			struct request *nxt;
=A0
=A0			nxt =3D list_first_entry(list, struct request, queuelist);
-			bd.last =3D !blk_mq_get_driver_tag(nxt, NULL, false);
+			bd.last =3D !blk_mq_get_driver_tag(nxt, false);
=A0		}
=A0
=A0		ret =3D q->mq_ops->queue_rq(hctx, &bd);
@@ -1435,7 +1430,8 @@ static void __blk_mq_try_issue_directly(struct reques=
t *rq, blk_qc_t *cookie,
=A0	if (q->elevator)
=A0		goto insert;
=A0
-	if (!blk_mq_get_driver_tag(rq, &hctx, false))
+	hctx =3D blk_mq_get_driver_tag(rq, false);
+	if (!hctx)
=A0		goto insert;
=A0
=A0	new_cookie =3D request_to_qc_t(hctx, rq);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..b1917fbe955c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -33,8 +33,7 @@ void blk_mq_wake_waiters(struct request_queue *q);
=A0bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *)=
;
=A0void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head=
 *list);
=A0bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
-bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx=
,
-				bool wait);
+struct blk_mq_hw_ctx *blk_mq_get_driver_tag(struct request *rq, bool wait)=
;
=A0
=A0/*
=A0 * Internal helpers for allocating/freeing the request map
--=A0
2.12.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization
  2017-04-05 18:28 ` [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization Omar Sandoval
@ 2017-04-05 18:35   ` Bart Van Assche
  2017-04-05 18:37     ` Omar Sandoval
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2017-04-05 18:35 UTC (permalink / raw)
  To: osandov, linux-block, axboe; +Cc: kernel-team

On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>=20
> Trivial cleanup.
>=20
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  block/blk-mq.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>=20
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 71dc8608f3a8..779249a5999b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1451,8 +1451,7 @@ static void __blk_mq_try_issue_directly(struct requ=
est *rq, blk_qc_t *cookie,
>  	struct request_queue *q =3D rq->q;
>  	struct blk_mq_queue_data bd =3D {
>  		.rq =3D rq,
> -		.list =3D NULL,
> -		.last =3D 1
> +		.last =3D true,
>  	};
>  	struct blk_mq_hw_ctx *hctx;
>  	blk_qc_t new_cookie;

Hello Omar,

Although this patch looks fine to me I'm afraid that since this morning
it no longer applies to Jens' for-next branch.

Bart.=

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization
  2017-04-05 18:35   ` Bart Van Assche
@ 2017-04-05 18:37     ` Omar Sandoval
  0 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:37 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, axboe, kernel-team

On Wed, Apr 05, 2017 at 06:35:58PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Trivial cleanup.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  block/blk-mq.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 71dc8608f3a8..779249a5999b 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1451,8 +1451,7 @@ static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
> >  	struct request_queue *q = rq->q;
> >  	struct blk_mq_queue_data bd = {
> >  		.rq = rq,
> > -		.list = NULL,
> > -		.last = 1
> > +		.last = true,
> >  	};
> >  	struct blk_mq_hw_ctx *hctx;
> >  	blk_qc_t new_cookie;
> 
> Hello Omar,
> 
> Although this patch looks fine to me I'm afraid that since this morning
> it no longer applies to Jens' for-next branch.
> 
> Bart.

Oh, I forgot to rebase. Thanks, Bart, I'll send out an updated version.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails
  2017-04-05 18:33   ` Bart Van Assche
@ 2017-04-05 18:42     ` Omar Sandoval
  0 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2017-04-05 18:42 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, axboe, kernel-team

On Wed, Apr 05, 2017 at 06:33:14PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > While dispatching requests, if we fail to get a driver tag, we mark the
> > hardware queue as waiting for a tag and put the requests on a
> > hctx->dispatch list to be run later when a driver tag is freed. However,
> > blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> > queues if using a single-queue scheduler with a multiqueue device. If
> > blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> > are processing. This means we end up using the hardware queue of the
> > previous request, which may or may not be the same as that of the
> > current request. If it isn't, the wrong hardware queue will end up
> > waiting for a tag, and the requests will be on the wrong dispatch list,
> > leading to a hang.
> > 
> > The fix is twofold:
> > 
> > 1. Make sure we save which hardware queue we were trying to get a
> >    request for in blk_mq_get_driver_tag() regardless of whether it
> >    succeeds or not.
> > 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
> >    blk_mq_hw_queue to make it clear that it must handle multiple
> >    hardware queues, since I've already messed this up on a couple of
> >    occasions.
> > 
> > This didn't appear in testing with nvme and mq-deadline because nvme has
> > more driver tags than the default number of scheduler tags. However,
> > with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
> 
> Would the patch below be a valid alternative?
> 
> Thanks,
> 
> Bart.

Hi, Bart,

This actually has the same bug as the original code, see below.

> [PATCH] blk-mq: Simplify blk_mq_get_driver_tag()
> 
> The blk_mq_get_driver_tag() callers either assume that *hctx is not
> modified or that it points to a valid hctx pointer upon return if
> tag allocation succeeded. Avoid this confusion by returning the hctx
> pointer if and only if tag allocation succeeded and by only storing
> the return value into hctx in those blk_mq_get_driver_tag() callers
> for which the hctx pointer had not yet been computed before the
> blk_mq_get_driver_tag() call.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> ---
> �block/blk-mq-sched.c |��4 +++-
> �block/blk-mq.c�������| 24 ++++++++++--------------
> �block/blk-mq.h�������|��3 +--
> �3 files changed, 14 insertions(+), 17 deletions(-)
> 

[snip]

> �static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
> @@ -985,7 +980,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> �		struct blk_mq_queue_data bd;
> �
> �		rq = list_first_entry(list, struct request, queuelist);
> -		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
> +		if (!blk_mq_get_driver_tag(rq, false)) {

Here, we want to know what hardware queue we attempted the tag
allocation on, so this won't work.

Thanks for taking a look!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-04-05 18:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-05 18:28 [PATCH v2 0/8] blk-mq: various fixes and cleanups Omar Sandoval
2017-04-05 18:28 ` [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails Omar Sandoval
2017-04-05 18:33   ` Bart Van Assche
2017-04-05 18:42     ` Omar Sandoval
2017-04-05 18:28 ` [PATCH v2 2/8] blk-mq-sched: refactor scheduler initialization Omar Sandoval
2017-04-05 18:28 ` [PATCH v2 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues Omar Sandoval
2017-04-05 18:28 ` [PATCH v2 4/8] blk-mq-sched: fix crash in switch error path Omar Sandoval
2017-04-05 18:28 ` [PATCH v2 5/8] blk-mq: remap queues when adding/removing hardware queues Omar Sandoval
2017-04-05 18:28 ` [PATCH v2 6/8] blk-mq-sched: provide hooks for initializing hardware queue data Omar Sandoval
2017-04-05 18:28 ` [PATCH v2 7/8] blk-mq: make driver tag failure path easier to follow Omar Sandoval
2017-04-05 18:28 ` [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization Omar Sandoval
2017-04-05 18:35   ` Bart Van Assche
2017-04-05 18:37     ` Omar Sandoval

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.