linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET/RFC v2] blk-mq scheduling framework
@ 2016-12-08 20:13 Jens Axboe
  2016-12-08 20:13 ` [PATCH 1/7] blk-mq: add blk_mq_start_stopped_hw_queue() Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-08 20:13 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

As a followup to this posting from yesterday:

https://marc.info/?l=linux-block&m=148115232806065&w=2

this is version 2. I wanted to post a new one fairly quickly, as there
ended up being a number of potential crashes in v1. This one should be
solid, I've run mq-deadline on both NVMe and regular rotating storage,
and we handle the various merging cases correctly.

You can download it from git as well:

git://git.kernel.dk/linux-block blk-mq-sched.2

Note that this is based on for-4.10/block, which is in turn based on
v4.9-rc1. I suggest pulling it into my for-next branch, which would
then merge nicely with 'master' as well.

Changes since v1:

- Add Kconfig entries to allow the user to choose what the default
  scheduler should be for blk-mq, and whether that depends on the
  number of hardware queues.

- Properly abstract the whole get/put of a request, so we can manage
  the life time properly.

- Enable full merging on mq-deadline (front/back, bio-to-rq, rq-to-rq).
  Has full feature parity with deadline now.

- Export necessary symbols for compiling mq-deadline as a module.

- Various API adjustments for the mq schedulers.

- Various cleanups and improvements.

- Fix a lot of bugs. A lot. Upgrade!

 block/Kconfig.iosched    |   37 ++
 block/Makefile           |    3 
 block/blk-core.c         |    9 
 block/blk-exec.c         |    3 
 block/blk-flush.c        |    7 
 block/blk-merge.c        |    3 
 block/blk-mq-sched.c     |  265 +++++++++++++++++++
 block/blk-mq-sched.h     |  188 +++++++++++++
 block/blk-mq-tag.c       |    1 
 block/blk-mq.c           |  254 ++++++++++--------
 block/blk-mq.h           |   35 +-
 block/elevator.c         |  194 ++++++++++----
 block/mq-deadline.c      |  647 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/pci.c  |    1 
 include/linux/blk-mq.h   |    4 
 include/linux/elevator.h |   34 ++
 16 files changed, 1495 insertions(+), 190 deletions(-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/7] blk-mq: add blk_mq_start_stopped_hw_queue()
  2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
@ 2016-12-08 20:13 ` Jens Axboe
  2016-12-13  8:48   ` Bart Van Assche
  2016-12-08 20:13 ` [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2016-12-08 20:13 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

We have a variant for all hardware queues, but not one for a single
hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-mq.c         | 18 +++++++++++-------
 include/linux/blk-mq.h |  1 +
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90db5b490df9..b216746be9d3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1064,18 +1064,22 @@ void blk_mq_start_hw_queues(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_mq_start_hw_queues);
 
+void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+{
+	if (!blk_mq_hctx_stopped(hctx))
+		return;
+
+	clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	blk_mq_run_hw_queue(hctx, async);
+}
+
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
 {
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!blk_mq_hctx_stopped(hctx))
-			continue;
-
-		clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
-		blk_mq_run_hw_queue(hctx, async);
-	}
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_start_stopped_hw_queue(hctx, async);
 }
 EXPORT_SYMBOL(blk_mq_start_stopped_hw_queues);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 35a0af5ede6d..87e404aae267 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -231,6 +231,7 @@ void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
 void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx);
 void blk_mq_stop_hw_queues(struct request_queue *q);
 void blk_mq_start_hw_queues(struct request_queue *q);
+void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
  2016-12-08 20:13 ` [PATCH 1/7] blk-mq: add blk_mq_start_stopped_hw_queue() Jens Axboe
@ 2016-12-08 20:13 ` Jens Axboe
  2016-12-09  6:44   ` Hannes Reinecke
                     ` (2 more replies)
  2016-12-08 20:13 ` [PATCH 3/7] elevator: make the rqhash helpers exported Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-08 20:13 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Takes a list of requests, and dispatches it. Moves any residual
requests to the dispatch list.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-mq.c | 85 ++++++++++++++++++++++++++++++++--------------------------
 block/blk-mq.h |  1 +
 2 files changed, 48 insertions(+), 38 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b216746be9d3..abbf7cca4d0d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -821,41 +821,13 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-/*
- * Run this hardware queue, pulling any software queues mapped to it in.
- * Note that this function currently has various problems around ordering
- * of IO. In particular, we'd like FIFO behaviour on handling existing
- * items on the hctx->dispatch list. Ignore that for now.
- */
-static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
+bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
 	struct request_queue *q = hctx->queue;
 	struct request *rq;
-	LIST_HEAD(rq_list);
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
-	int queued;
-
-	if (unlikely(blk_mq_hctx_stopped(hctx)))
-		return;
-
-	hctx->run++;
-
-	/*
-	 * Touch any software queue that has pending entries.
-	 */
-	flush_busy_ctxs(hctx, &rq_list);
-
-	/*
-	 * If we have previous entries on our dispatch list, grab them
-	 * and stuff them at the front for more fair dispatch.
-	 */
-	if (!list_empty_careful(&hctx->dispatch)) {
-		spin_lock(&hctx->lock);
-		if (!list_empty(&hctx->dispatch))
-			list_splice_init(&hctx->dispatch, &rq_list);
-		spin_unlock(&hctx->lock);
-	}
+	int queued, ret = BLK_MQ_RQ_QUEUE_OK;
 
 	/*
 	 * Start off with dptr being NULL, so we start the first request
@@ -867,16 +839,15 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 	 * Now process all the entries, sending them to the driver.
 	 */
 	queued = 0;
-	while (!list_empty(&rq_list)) {
+	while (!list_empty(list)) {
 		struct blk_mq_queue_data bd;
-		int ret;
 
-		rq = list_first_entry(&rq_list, struct request, queuelist);
+		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 
 		bd.rq = rq;
 		bd.list = dptr;
-		bd.last = list_empty(&rq_list);
+		bd.last = list_empty(list);
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
 		switch (ret) {
@@ -884,7 +855,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 			queued++;
 			break;
 		case BLK_MQ_RQ_QUEUE_BUSY:
-			list_add(&rq->queuelist, &rq_list);
+			list_add(&rq->queuelist, list);
 			__blk_mq_requeue_request(rq);
 			break;
 		default:
@@ -902,7 +873,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 		 * We've done the first request. If we have more than 1
 		 * left in the list, set dptr to defer issue.
 		 */
-		if (!dptr && rq_list.next != rq_list.prev)
+		if (!dptr && list->next != list->prev)
 			dptr = &driver_list;
 	}
 
@@ -912,10 +883,11 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 	 * Any items that need requeuing? Stuff them into hctx->dispatch,
 	 * that is where we will continue on next queue run.
 	 */
-	if (!list_empty(&rq_list)) {
+	if (!list_empty(list)) {
 		spin_lock(&hctx->lock);
-		list_splice(&rq_list, &hctx->dispatch);
+		list_splice(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
+
 		/*
 		 * the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but
 		 * it's possible the queue is stopped and restarted again
@@ -927,6 +899,43 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 		 **/
 		blk_mq_run_hw_queue(hctx, true);
 	}
+
+	return ret != BLK_MQ_RQ_QUEUE_BUSY;
+}
+
+/*
+ * Run this hardware queue, pulling any software queues mapped to it in.
+ * Note that this function currently has various problems around ordering
+ * of IO. In particular, we'd like FIFO behaviour on handling existing
+ * items on the hctx->dispatch list. Ignore that for now.
+ */
+static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(rq_list);
+	LIST_HEAD(driver_list);
+
+	if (unlikely(blk_mq_hctx_stopped(hctx)))
+		return;
+
+	hctx->run++;
+
+	/*
+	 * Touch any software queue that has pending entries.
+	 */
+	flush_busy_ctxs(hctx, &rq_list);
+
+	/*
+	 * If we have previous entries on our dispatch list, grab them
+	 * and stuff them at the front for more fair dispatch.
+	 */
+	if (!list_empty_careful(&hctx->dispatch)) {
+		spin_lock(&hctx->lock);
+		if (!list_empty(&hctx->dispatch))
+			list_splice_init(&hctx->dispatch, &rq_list);
+		spin_unlock(&hctx->lock);
+	}
+
+	blk_mq_dispatch_rq_list(hctx, &rq_list);
 }
 
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index b444370ae05b..3a54dd32a6fc 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -31,6 +31,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
+bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
 
 /*
  * CPU hotplug helpers
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/7] elevator: make the rqhash helpers exported
  2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
  2016-12-08 20:13 ` [PATCH 1/7] blk-mq: add blk_mq_start_stopped_hw_queue() Jens Axboe
  2016-12-08 20:13 ` [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper Jens Axboe
@ 2016-12-08 20:13 ` Jens Axboe
  2016-12-09  6:45   ` Hannes Reinecke
  2016-12-08 20:13 ` [PATCH 4/7] blk-flush: run the queue when inserting blk-mq flush Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2016-12-08 20:13 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/elevator.c         | 8 ++++----
 include/linux/elevator.h | 5 +++++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index a18a5db274e4..40f0c04e5ad3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -248,13 +248,13 @@ static inline void __elv_rqhash_del(struct request *rq)
 	rq->rq_flags &= ~RQF_HASHED;
 }
 
-static void elv_rqhash_del(struct request_queue *q, struct request *rq)
+void elv_rqhash_del(struct request_queue *q, struct request *rq)
 {
 	if (ELV_ON_HASH(rq))
 		__elv_rqhash_del(rq);
 }
 
-static void elv_rqhash_add(struct request_queue *q, struct request *rq)
+void elv_rqhash_add(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -263,13 +263,13 @@ static void elv_rqhash_add(struct request_queue *q, struct request *rq)
 	rq->rq_flags |= RQF_HASHED;
 }
 
-static void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
+void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
 {
 	__elv_rqhash_del(rq);
 	elv_rqhash_add(q, rq);
 }
 
-static struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
+struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
 {
 	struct elevator_queue *e = q->elevator;
 	struct hlist_node *next;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index f219c9aed360..b276e9ef0e0b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -108,6 +108,11 @@ struct elevator_type
 
 #define ELV_HASH_BITS 6
 
+void elv_rqhash_del(struct request_queue *q, struct request *rq);
+void elv_rqhash_add(struct request_queue *q, struct request *rq);
+void elv_rqhash_reposition(struct request_queue *q, struct request *rq);
+struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
+
 /*
  * each queue has an elevator_queue associated with it
  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-flush: run the queue when inserting blk-mq flush
  2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
                   ` (2 preceding siblings ...)
  2016-12-08 20:13 ` [PATCH 3/7] elevator: make the rqhash helpers exported Jens Axboe
@ 2016-12-08 20:13 ` Jens Axboe
  2016-12-09  6:45   ` Hannes Reinecke
  2016-12-08 20:13 ` [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2016-12-08 20:13 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Currently we pass in to run the queue async, but don't flag the
queue to be run. We don't need to run it async here, but we should
run it. So fixup the parameters.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-flush.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 1bdbb3d3e5f5..27a42dab5a36 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -426,7 +426,7 @@ void blk_insert_flush(struct request *rq)
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
 		if (q->mq_ops) {
-			blk_mq_insert_request(rq, false, false, true);
+			blk_mq_insert_request(rq, false, true, false);
 		} else
 			list_add_tail(&rq->queuelist, &q->queue_head);
 		return;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
                   ` (3 preceding siblings ...)
  2016-12-08 20:13 ` [PATCH 4/7] blk-flush: run the queue when inserting blk-mq flush Jens Axboe
@ 2016-12-08 20:13 ` Jens Axboe
  2016-12-13 13:56   ` Bart Van Assche
  2016-12-13 14:29   ` Bart Van Assche
  2016-12-08 20:13 ` [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-08 20:13 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Makefile           |   2 +-
 block/blk-core.c         |   9 +-
 block/blk-exec.c         |   3 +-
 block/blk-flush.c        |   7 +-
 block/blk-merge.c        |   3 +
 block/blk-mq-sched.c     | 246 +++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-sched.h     | 187 +++++++++++++++++++++++++++++++++++
 block/blk-mq-tag.c       |   1 +
 block/blk-mq.c           | 150 +++++++++++++++--------------
 block/blk-mq.h           |  34 +++----
 block/elevator.c         | 181 ++++++++++++++++++++++++++--------
 include/linux/blk-mq.h   |   2 +-
 include/linux/elevator.h |  29 +++++-
 13 files changed, 713 insertions(+), 141 deletions(-)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
-			blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b7ec5958055..3f83414d6986 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-mq-sched.h"
 #include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -1428,7 +1429,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 		return;
 
 	if (q->mq_ops) {
-		blk_mq_free_request(req);
+		blk_mq_sched_put_request(req);
 		return;
 	}
 
@@ -1464,7 +1465,7 @@ void blk_put_request(struct request *req)
 	struct request_queue *q = req->q;
 
 	if (q->mq_ops)
-		blk_mq_free_request(req);
+		blk_mq_sched_put_request(req);
 	else {
 		unsigned long flags;
 
@@ -1528,6 +1529,7 @@ bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
 	blk_account_io_start(req, false);
 	return true;
 }
+EXPORT_SYMBOL_GPL(bio_attempt_back_merge);
 
 bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
 			     struct bio *bio)
@@ -1552,6 +1554,7 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
 	blk_account_io_start(req, false);
 	return true;
 }
+EXPORT_SYMBOL_GPL(bio_attempt_front_merge);
 
 /**
  * blk_attempt_plug_merge - try to merge with %current's plugged list
@@ -2173,7 +2176,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	if (q->mq_ops) {
 		if (blk_queue_io_stat(q))
 			blk_account_io_start(rq, true);
-		blk_mq_insert_request(rq, false, true, false);
+		blk_mq_sched_insert_request(rq, false, true, false);
 		return 0;
 	}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
 #include <linux/sched/sysctl.h>
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 /*
  * for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 	 * be reused after dying flag is set
 	 */
 	if (q->mq_ops) {
-		blk_mq_insert_request(rq, at_head, true, false);
+		blk_mq_sched_insert_request(rq, at_head, true, false);
 		return;
 	}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 27a42dab5a36..63b91697d167 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 /* FLUSH/FUA sequences */
 enum {
@@ -425,9 +426,9 @@ void blk_insert_flush(struct request *rq)
 	 */
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-		if (q->mq_ops) {
-			blk_mq_insert_request(rq, false, true, false);
-		} else
+		if (q->mq_ops)
+			blk_mq_sched_insert_request(rq, false, true, false);
+		else
 			list_add_tail(&rq->queuelist, &q->queue_head);
 		return;
 	}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1002afdfee99..01247812e13f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -766,6 +766,7 @@ int attempt_back_merge(struct request_queue *q, struct request *rq)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(attempt_back_merge);
 
 int attempt_front_merge(struct request_queue *q, struct request *rq)
 {
@@ -776,6 +777,7 @@ int attempt_front_merge(struct request_queue *q, struct request *rq)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(attempt_front_merge);
 
 int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 			  struct request *next)
@@ -825,3 +827,4 @@ int blk_try_merge(struct request *rq, struct bio *bio)
 		return ELEVATOR_FRONT_MERGE;
 	return ELEVATOR_NO_MERGE;
 }
+EXPORT_SYMBOL_GPL(blk_try_merge);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index 000000000000..9213366e67d1
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,246 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <linux/blk-mq.h>
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static struct blk_mq_ops mq_sched_tag_ops = {
+	.queue_rq	= NULL,
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+	blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+						unsigned int numa_node)
+{
+	struct blk_mq_tag_set set = {
+		.ops		= &mq_sched_tag_ops,
+		.nr_hw_queues	= 1,
+		.queue_depth	= depth,
+		.numa_node	= numa_node,
+	};
+
+	return blk_mq_init_rq_map(&set, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (exit)
+			exit(hctx);
+		kfree(hctx->sched_data);
+		hctx->sched_data = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+				void (*init)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
+		if (!hctx->sched_data)
+			goto error;
+
+		if (init)
+			init(hctx);
+	}
+
+	return 0;
+error:
+	blk_mq_sched_free_hctx_data(q, NULL);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
+
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+						  struct blk_mq_alloc_data *data,
+						  struct blk_mq_tags *tags,
+						  atomic_t *wait_index)
+{
+	struct sbq_wait_state *ws;
+	DEFINE_WAIT(wait);
+	struct request *rq;
+	int tag;
+
+	tag = __sbitmap_queue_get(&tags->bitmap_tags);
+	if (tag != -1)
+		goto done;
+
+	if (data->flags & BLK_MQ_REQ_NOWAIT)
+		return NULL;
+
+	ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+	do {
+		prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+		tag = __sbitmap_queue_get(&tags->bitmap_tags);
+		if (tag != -1)
+			break;
+
+		blk_mq_run_hw_queue(data->hctx, false);
+
+		tag = __sbitmap_queue_get(&tags->bitmap_tags);
+		if (tag != -1)
+			break;
+
+		blk_mq_put_ctx(data->ctx);
+		io_schedule();
+
+		data->ctx = blk_mq_get_ctx(data->q);
+		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+		finish_wait(&ws->wait, &wait);
+		ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+	} while (1);
+
+	finish_wait(&ws->wait, &wait);
+done:
+	rq = tags->rqs[tag];
+	rq->tag = tag;
+	rq->rq_flags |= RQF_ALLOCED;
+	return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+				      struct request *rq)
+{
+	WARN_ON_ONCE(!(rq->rq_flags & RQF_ALLOCED));
+	sbitmap_queue_clear(&tags->bitmap_tags, rq->tag, rq->mq_ctx->cpu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
+
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name)	((dst)->name = (src)->name)
+	FIELD_COPY(rq, src, cpu);
+	FIELD_COPY(rq, src, cmd_type);
+	FIELD_COPY(rq, src, cmd_flags);
+	rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | RQF_DONTPREP));
+	rq->rq_flags &= ~RQF_IO_STAT;
+	FIELD_COPY(rq, src, __data_len);
+	FIELD_COPY(rq, src, __sector);
+	FIELD_COPY(rq, src, bio);
+	FIELD_COPY(rq, src, biotail);
+	FIELD_COPY(rq, src, rq_disk);
+	FIELD_COPY(rq, src, part);
+	FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+	FIELD_COPY(rq, src, ioprio);
+	FIELD_COPY(rq, src, timeout);
+
+	if (src->cmd_type == REQ_TYPE_BLOCK_PC) {
+		FIELD_COPY(rq, src, cmd);
+		FIELD_COPY(rq, src, cmd_len);
+		FIELD_COPY(rq, src, extra_len);
+		FIELD_COPY(rq, src, sense_len);
+		FIELD_COPY(rq, src, resid_len);
+		FIELD_COPY(rq, src, sense);
+		FIELD_COPY(rq, src, retries);
+	}
+
+	src->bio = src->biotail = NULL;
+}
+
+static void sched_rq_end_io(struct request *rq, int error)
+{
+	struct request *sched_rq = rq->end_io_data;
+
+	FIELD_COPY(sched_rq, rq, resid_len);
+	FIELD_COPY(sched_rq, rq, extra_len);
+	FIELD_COPY(sched_rq, rq, sense_len);
+	FIELD_COPY(sched_rq, rq, errors);
+	FIELD_COPY(sched_rq, rq, retries);
+
+	blk_account_io_completion(sched_rq, blk_rq_bytes(sched_rq));
+	blk_account_io_done(sched_rq);
+
+	wbt_done(sched_rq->q->rq_wb, &sched_rq->issue_stat);
+
+	if (sched_rq->end_io)
+		sched_rq->end_io(sched_rq, error);
+
+	blk_mq_free_request(rq);
+}
+
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_alloc_data data;
+	struct request *sched_rq, *rq;
+
+	data.q = hctx->queue;
+	data.flags = BLK_MQ_REQ_NOWAIT;
+	data.ctx = blk_mq_get_ctx(hctx->queue);
+	data.hctx = hctx;
+
+	rq = __blk_mq_alloc_request(&data, 0);
+	blk_mq_put_ctx(data.ctx);
+
+	if (!rq) {
+		blk_mq_stop_hw_queue(hctx);
+		return NULL;
+	}
+
+	sched_rq = get_sched_rq(hctx);
+
+	if (!sched_rq) {
+		blk_queue_enter_live(hctx->queue);
+		__blk_mq_free_request(hctx, data.ctx, rq);
+		return NULL;
+	}
+
+	WARN_ON_ONCE(!(sched_rq->rq_flags & RQF_ALLOCED));
+	rq_copy(rq, sched_rq);
+	rq->end_io = sched_rq_end_io;
+	rq->end_io_data = sched_rq;
+
+	return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_from_shadow);
+
+void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+	struct request *rq;
+	LIST_HEAD(rq_list);
+
+	if (unlikely(blk_mq_hctx_stopped(hctx)))
+		return;
+
+	hctx->run++;
+
+	if (!list_empty(&hctx->dispatch)) {
+		spin_lock(&hctx->lock);
+		if (!list_empty(&hctx->dispatch))
+			list_splice_init(&hctx->dispatch, &rq_list);
+		spin_unlock(&hctx->lock);
+	}
+
+	while ((rq = e->type->mq_ops.dispatch_request(hctx)) != NULL)
+		list_add_tail(&rq->queuelist, &rq_list);
+
+	blk_mq_dispatch_rq_list(hctx, &rq_list);
+}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
new file mode 100644
index 000000000000..609c80506cfc
--- /dev/null
+++ b/block/blk-mq-sched.h
@@ -0,0 +1,187 @@
+#ifndef BLK_MQ_SCHED_H
+#define BLK_MQ_SCHED_H
+
+#include "blk-mq.h"
+
+struct blk_mq_hw_ctx;
+struct blk_mq_ctx;
+struct request_queue;
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+				void (*init)(struct blk_mq_hw_ctx *));
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *));
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+				      struct request *rq);
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+						  struct blk_mq_alloc_data *data,
+						  struct blk_mq_tags *tags,
+						  atomic_t *wait_index);
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
+
+
+struct blk_mq_alloc_data {
+	/* input parameter */
+	struct request_queue *q;
+	unsigned int flags;
+
+	/* input & output parameter */
+	struct blk_mq_ctx *ctx;
+	struct blk_mq_hw_ctx *hctx;
+};
+
+static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
+		struct request_queue *q, unsigned int flags,
+		struct blk_mq_ctx *ctx, struct blk_mq_hw_ctx *hctx)
+{
+	data->q = q;
+	data->flags = flags;
+	data->ctx = ctx;
+	data->hctx = hctx;
+}
+
+static inline bool
+blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (blk_queue_nomerges(q) || !bio_mergeable(bio))
+		return false;
+
+	if (e) {
+		struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
+		struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+		blk_mq_put_ctx(ctx);
+		return e->type->mq_ops.bio_merge(hctx, bio);
+	}
+
+	return false;
+}
+
+static inline void blk_mq_sched_put_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->mq_ops.put_request)
+		e->type->mq_ops.put_request(rq);
+	else
+		blk_mq_free_request(rq);
+}
+
+static inline struct request *
+blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
+			 struct blk_mq_alloc_data *data)
+{
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+	struct request *rq;
+
+	blk_queue_enter_live(q);
+	ctx = blk_mq_get_ctx(q);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+
+	if (e && e->type->mq_ops.get_request)
+		rq = e->type->mq_ops.get_request(q, op, data);
+	else
+		rq = __blk_mq_alloc_request(data, op);
+
+	if (rq)
+		data->hctx->queued++;
+
+	return rq;
+
+}
+
+static inline void
+blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
+			    bool async)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	if (e)
+		e->type->mq_ops.insert_request(hctx, rq, at_head);
+	else {
+		spin_lock(&ctx->lock);
+		__blk_mq_insert_request(hctx, rq, at_head);
+		spin_unlock(&ctx->lock);
+	}
+
+	if (run_queue)
+		blk_mq_run_hw_queue(hctx, async);
+}
+
+static inline bool
+blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
+			 struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->mq_ops.allow_merge)
+		return e->type->mq_ops.allow_merge(q, rq, bio);
+
+	return true;
+}
+
+static inline void
+blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->mq_ops.completed_request)
+		e->type->mq_ops.completed_request(hctx, rq);
+}
+
+static inline void blk_mq_sched_started_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->mq_ops.started_request)
+		e->type->mq_ops.started_request(rq);
+}
+
+static inline void blk_mq_sched_requeue_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->mq_ops.requeue_request)
+		e->type->mq_ops.requeue_request(rq);
+}
+
+static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->mq_ops.has_work)
+		return e->type->mq_ops.has_work(hctx);
+
+	return false;
+}
+
+void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+
+static inline void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->queue->elevator)
+		__blk_mq_sched_dispatch_requests(hctx);
+	else
+		blk_mq_process_sw_list(hctx);
+}
+
+
+#endif
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index dcf5ce3ba4bf..bbd494e23d57 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -12,6 +12,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index abbf7cca4d0d..019de6f0fd06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -32,6 +32,7 @@
 #include "blk-mq-tag.h"
 #include "blk-stat.h"
 #include "blk-wbt.h"
+#include "blk-mq-sched.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -41,7 +42,8 @@ static LIST_HEAD(all_q_list);
  */
 static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
-	return sbitmap_any_bit_set(&hctx->ctx_map);
+	return sbitmap_any_bit_set(&hctx->ctx_map) ||
+		blk_mq_sched_has_work(hctx);
 }
 
 /*
@@ -167,8 +169,8 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
 }
 EXPORT_SYMBOL(blk_mq_can_queue);
 
-static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
-			       struct request *rq, unsigned int op)
+void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
+			struct request *rq, unsigned int op)
 {
 	INIT_LIST_HEAD(&rq->queuelist);
 	/* csd/requeue_work/fifo_time is initialized before use */
@@ -213,9 +215,10 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 
 	ctx->rq_dispatched[op_is_sync(op)]++;
 }
+EXPORT_SYMBOL_GPL(blk_mq_rq_ctx_init);
 
-static struct request *
-__blk_mq_alloc_request(struct blk_mq_alloc_data *data, unsigned int op)
+struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
+				       unsigned int op)
 {
 	struct request *rq;
 	unsigned int tag;
@@ -236,25 +239,23 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, unsigned int op)
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
 
 struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		unsigned int flags)
 {
-	struct blk_mq_ctx *ctx;
-	struct blk_mq_hw_ctx *hctx;
-	struct request *rq;
 	struct blk_mq_alloc_data alloc_data;
+	struct request *rq;
 	int ret;
 
 	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
 
-	ctx = blk_mq_get_ctx(q);
-	hctx = blk_mq_map_queue(q, ctx->cpu);
-	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
-	rq = __blk_mq_alloc_request(&alloc_data, rw);
-	blk_mq_put_ctx(ctx);
+	rq = blk_mq_sched_get_request(q, rw, &alloc_data);
+
+	blk_mq_put_ctx(alloc_data.ctx);
+	blk_queue_exit(q);
 
 	if (!rq) {
 		blk_queue_exit(q);
@@ -319,12 +320,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
-static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
-				  struct blk_mq_ctx *ctx, struct request *rq)
+void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+			   struct request *rq)
 {
 	const int tag = rq->tag;
 	struct request_queue *q = rq->q;
 
+	blk_mq_sched_completed_request(hctx, rq);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
 
@@ -467,6 +470,8 @@ void blk_mq_start_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	blk_mq_sched_started_request(rq);
+
 	trace_block_rq_issue(q, rq);
 
 	rq->resid_len = blk_rq_bytes(rq);
@@ -515,6 +520,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
+	blk_mq_sched_requeue_request(rq);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
 		if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -549,13 +555,13 @@ static void blk_mq_requeue_work(struct work_struct *work)
 
 		rq->rq_flags &= ~RQF_SOFTBARRIER;
 		list_del_init(&rq->queuelist);
-		blk_mq_insert_request(rq, true, false, false);
+		blk_mq_sched_insert_request(rq, true, false, false);
 	}
 
 	while (!list_empty(&rq_list)) {
 		rq = list_entry(rq_list.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
-		blk_mq_insert_request(rq, false, false, false);
+		blk_mq_sched_insert_request(rq, false, false, false);
 	}
 
 	blk_mq_run_hw_queues(q, false);
@@ -761,8 +767,16 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
 
 		if (!blk_rq_merge_ok(rq, bio))
 			continue;
+		if (!blk_mq_sched_allow_merge(q, rq, bio))
+			break;
 
 		el_ret = blk_try_merge(rq, bio);
+		if (el_ret == ELEVATOR_NO_MERGE)
+			continue;
+
+		if (!blk_mq_sched_allow_merge(q, rq, bio))
+			break;
+
 		if (el_ret == ELEVATOR_BACK_MERGE) {
 			if (bio_attempt_back_merge(q, rq, bio)) {
 				ctx->rq_merged++;
@@ -909,7 +923,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
  * of IO. In particular, we'd like FIFO behaviour on handling existing
  * items on the hctx->dispatch list. Ignore that for now.
  */
-static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
+void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx)
 {
 	LIST_HEAD(rq_list);
 	LIST_HEAD(driver_list);
@@ -947,11 +961,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		blk_mq_process_rq_list(hctx);
+		blk_mq_sched_dispatch_requests(hctx);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
-		blk_mq_process_rq_list(hctx);
+		blk_mq_sched_dispatch_requests(hctx);
 		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
 	}
 }
@@ -1081,6 +1095,7 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
 	blk_mq_run_hw_queue(hctx, async);
 }
+EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
 
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
 {
@@ -1135,8 +1150,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
 		list_add_tail(&rq->queuelist, &ctx->rq_list);
 }
 
-static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
-				    struct request *rq, bool at_head)
+void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+			     bool at_head)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 
@@ -1144,21 +1159,6 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
 	blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
-void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
-			   bool async)
-{
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	spin_lock(&ctx->lock);
-	__blk_mq_insert_request(hctx, rq, at_head);
-	spin_unlock(&ctx->lock);
-
-	if (run_queue)
-		blk_mq_run_hw_queue(hctx, async);
-}
-
 static void blk_mq_insert_requests(struct request_queue *q,
 				     struct blk_mq_ctx *ctx,
 				     struct list_head *list,
@@ -1174,17 +1174,14 @@ static void blk_mq_insert_requests(struct request_queue *q,
 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
 	 * offline now
 	 */
-	spin_lock(&ctx->lock);
 	while (!list_empty(list)) {
 		struct request *rq;
 
 		rq = list_first_entry(list, struct request, queuelist);
 		BUG_ON(rq->mq_ctx != ctx);
 		list_del_init(&rq->queuelist);
-		__blk_mq_insert_req_list(hctx, rq, false);
+		blk_mq_sched_insert_request(rq, false, false, false);
 	}
-	blk_mq_hctx_mark_pending(hctx, ctx);
-	spin_unlock(&ctx->lock);
 
 	blk_mq_run_hw_queue(hctx, from_schedule);
 }
@@ -1285,41 +1282,27 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
 	}
 }
 
-static struct request *blk_mq_map_request(struct request_queue *q,
-					  struct bio *bio,
-					  struct blk_mq_alloc_data *data)
-{
-	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx;
-	struct request *rq;
-
-	blk_queue_enter_live(q);
-	ctx = blk_mq_get_ctx(q);
-	hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	trace_block_getrq(q, bio, bio->bi_opf);
-	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
-	rq = __blk_mq_alloc_request(data, bio->bi_opf);
-
-	data->hctx->queued++;
-	return rq;
-}
-
 static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 {
-	int ret;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
 		.list = NULL,
 		.last = 1
 	};
-	blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+	struct blk_mq_hw_ctx *hctx;
+	blk_qc_t new_cookie;
+	int ret;
+
+	if (q->elevator)
+		goto insert;
 
+	hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	if (blk_mq_hctx_stopped(hctx))
 		goto insert;
 
+	new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+
 	/*
 	 * For OK queue, we are done. For error, kill it. Any other
 	 * error (busy), just add it to our list as we previously
@@ -1341,7 +1324,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	}
 
 insert:
-	blk_mq_insert_request(rq, false, true, true);
+	blk_mq_sched_insert_request(rq, false, true, true);
 }
 
 /*
@@ -1374,9 +1357,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return BLK_QC_T_NONE;
 
+	if (blk_mq_sched_bio_merge(q, bio))
+		return BLK_QC_T_NONE;
+
 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
 
-	rq = blk_mq_map_request(q, bio, &data);
+	trace_block_getrq(q, bio, bio->bi_opf);
+
+	rq = blk_mq_sched_get_request(q, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
@@ -1438,6 +1426,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		goto done;
 	}
 
+	if (q->elevator) {
+		blk_mq_put_ctx(data.ctx);
+		blk_mq_bio_to_request(rq, bio);
+		blk_mq_sched_insert_request(rq, false, true, true);
+		goto done;
+	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
@@ -1483,9 +1477,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	} else
 		request_count = blk_plug_queued_count(q);
 
+	if (blk_mq_sched_bio_merge(q, bio))
+		return BLK_QC_T_NONE;
+
 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
 
-	rq = blk_mq_map_request(q, bio, &data);
+	trace_block_getrq(q, bio, bio->bi_opf);
+
+	rq = blk_mq_sched_get_request(q, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
@@ -1535,6 +1534,12 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return cookie;
 	}
 
+	if (q->elevator) {
+		blk_mq_put_ctx(data.ctx);
+		blk_mq_bio_to_request(rq, bio);
+		blk_mq_sched_insert_request(rq, false, true, true);
+		goto done;
+	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
@@ -1547,15 +1552,16 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	blk_mq_put_ctx(data.ctx);
+done:
 	return cookie;
 }
 
-static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
-		struct blk_mq_tags *tags, unsigned int hctx_idx)
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+			unsigned int hctx_idx)
 {
 	struct page *page;
 
-	if (tags->rqs && set->ops->exit_request) {
+	if (tags->rqs && set && set->ops->exit_request) {
 		int i;
 
 		for (i = 0; i < tags->nr_tags; i++) {
@@ -1588,8 +1594,8 @@ static size_t order_to_size(unsigned int order)
 	return (size_t)PAGE_SIZE << order;
 }
 
-static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
-		unsigned int hctx_idx)
+struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+				       unsigned int hctx_idx)
 {
 	struct blk_mq_tags *tags;
 	unsigned int i, j, entries_per_page, max_order = 4;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3a54dd32a6fc..ddce89bb0461 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -84,26 +84,6 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
 	put_cpu();
 }
 
-struct blk_mq_alloc_data {
-	/* input parameter */
-	struct request_queue *q;
-	unsigned int flags;
-
-	/* input & output parameter */
-	struct blk_mq_ctx *ctx;
-	struct blk_mq_hw_ctx *hctx;
-};
-
-static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
-		struct request_queue *q, unsigned int flags,
-		struct blk_mq_ctx *ctx, struct blk_mq_hw_ctx *hctx)
-{
-	data->q = q;
-	data->flags = flags;
-	data->ctx = ctx;
-	data->hctx = hctx;
-}
-
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
 	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
@@ -114,4 +94,18 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+			unsigned int hctx_idx);
+struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+				       unsigned int hctx_idx);
+void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
+			struct request *rq, unsigned int op);
+void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+			   struct request *rq);
+struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
+				       unsigned int op);
+void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+			     bool at_head);
+void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);
+
 #endif
diff --git a/block/elevator.c b/block/elevator.c
index 40f0c04e5ad3..f1191b3b0ff3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -40,6 +40,7 @@
 #include <trace/events/block.h>
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
@@ -58,7 +59,9 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_allow_bio_merge_fn)
+	if (e->uses_mq && e->type->mq_ops.allow_merge)
+		return e->type->mq_ops.allow_merge(q, rq, bio);
+	else if (!e->uses_mq && e->type->ops.elevator_allow_bio_merge_fn)
 		return e->type->ops.elevator_allow_bio_merge_fn(q, rq, bio);
 
 	return 1;
@@ -163,6 +166,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
 	hash_init(eq->hash);
+	eq->uses_mq = e->uses_mq;
 
 	return eq;
 }
@@ -224,7 +228,10 @@ int elevator_init(struct request_queue *q, char *name)
 		}
 	}
 
-	err = e->ops.elevator_init_fn(q, e);
+	if (e->uses_mq)
+		err = e->mq_ops.init_sched(q, e);
+	else
+		err = e->ops.elevator_init_fn(q, e);
 	if (err)
 		elevator_put(e);
 	return err;
@@ -234,7 +241,9 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
-	if (e->type->ops.elevator_exit_fn)
+	if (e->uses_mq && e->type->mq_ops.exit_sched)
+		e->type->mq_ops.exit_sched(e);
+	else if (!e->uses_mq && e->type->ops.elevator_exit_fn)
 		e->type->ops.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
 
@@ -253,6 +262,7 @@ void elv_rqhash_del(struct request_queue *q, struct request *rq)
 	if (ELV_ON_HASH(rq))
 		__elv_rqhash_del(rq);
 }
+EXPORT_SYMBOL_GPL(elv_rqhash_del);
 
 void elv_rqhash_add(struct request_queue *q, struct request *rq)
 {
@@ -262,12 +272,14 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
 	hash_add(e->hash, &rq->hash, rq_hash_key(rq));
 	rq->rq_flags |= RQF_HASHED;
 }
+EXPORT_SYMBOL_GPL(elv_rqhash_add);
 
 void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
 {
 	__elv_rqhash_del(rq);
 	elv_rqhash_add(q, rq);
 }
+EXPORT_SYMBOL_GPL(elv_rqhash_reposition);
 
 struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
 {
@@ -289,6 +301,7 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(elv_rqhash_find);
 
 /*
  * RB-tree support functions for inserting/lookup/removal of requests
@@ -411,6 +424,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct request *__rq;
 	int ret;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return ELEVATOR_NO_MERGE;
+
 	/*
 	 * Levels of merges:
 	 * 	nomerges:  No merges at all attempted
@@ -462,6 +478,9 @@ static bool elv_attempt_insert_merge(struct request_queue *q,
 	struct request *__rq;
 	bool ret;
 
+	if (WARN_ON_ONCE(q->elevator && q->elevator->uses_mq))
+		return false;
+
 	if (blk_queue_nomerges(q))
 		return false;
 
@@ -495,7 +514,7 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_merged_fn)
+	if (!e->uses_mq && e->type->ops.elevator_merged_fn)
 		e->type->ops.elevator_merged_fn(q, rq, type);
 
 	if (type == ELEVATOR_BACK_MERGE)
@@ -508,10 +527,15 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 			     struct request *next)
 {
 	struct elevator_queue *e = q->elevator;
-	const int next_sorted = next->rq_flags & RQF_SORTED;
-
-	if (next_sorted && e->type->ops.elevator_merge_req_fn)
-		e->type->ops.elevator_merge_req_fn(q, rq, next);
+	bool next_sorted = false;
+
+	if (e->uses_mq && e->type->mq_ops.requests_merged)
+		e->type->mq_ops.requests_merged(q, rq, next);
+	else if (e->type->ops.elevator_merge_req_fn) {
+		next_sorted = next->rq_flags & RQF_SORTED;
+		if (next_sorted)
+			e->type->ops.elevator_merge_req_fn(q, rq, next);
+	}
 
 	elv_rqhash_reposition(q, rq);
 
@@ -528,6 +552,9 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	if (e->type->ops.elevator_bio_merged_fn)
 		e->type->ops.elevator_bio_merged_fn(q, rq, bio);
 }
@@ -682,8 +709,11 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_latter_req_fn)
+	if (e->uses_mq && e->type->mq_ops.next_request)
+		return e->type->mq_ops.next_request(q, rq);
+	else if (!e->uses_mq && e->type->ops.elevator_latter_req_fn)
 		return e->type->ops.elevator_latter_req_fn(q, rq);
+
 	return NULL;
 }
 
@@ -691,7 +721,9 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_former_req_fn)
+	if (e->uses_mq && e->type->mq_ops.former_request)
+		return e->type->mq_ops.former_request(q, rq);
+	if (!e->uses_mq && e->type->ops.elevator_former_req_fn)
 		return e->type->ops.elevator_former_req_fn(q, rq);
 	return NULL;
 }
@@ -701,6 +733,9 @@ int elv_set_request(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return 0;
+
 	if (e->type->ops.elevator_set_req_fn)
 		return e->type->ops.elevator_set_req_fn(q, rq, bio, gfp_mask);
 	return 0;
@@ -710,6 +745,9 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	if (e->type->ops.elevator_put_req_fn)
 		e->type->ops.elevator_put_req_fn(rq);
 }
@@ -718,6 +756,9 @@ int elv_may_queue(struct request_queue *q, unsigned int op)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return 0;
+
 	if (e->type->ops.elevator_may_queue_fn)
 		return e->type->ops.elevator_may_queue_fn(q, op);
 
@@ -728,6 +769,9 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	/*
 	 * request is released from the driver, io must be done
 	 */
@@ -803,7 +847,7 @@ int elv_register_queue(struct request_queue *q)
 		}
 		kobject_uevent(&e->kobj, KOBJ_ADD);
 		e->registered = 1;
-		if (e->type->ops.elevator_registered_fn)
+		if (!e->uses_mq && e->type->ops.elevator_registered_fn)
 			e->type->ops.elevator_registered_fn(q);
 	}
 	return error;
@@ -891,9 +935,14 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old = q->elevator;
-	bool registered = old->registered;
+	bool old_registered = false;
 	int err;
 
+	if (q->mq_ops) {
+		blk_mq_freeze_queue(q);
+		blk_mq_quiesce_queue(q);
+	}
+
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
 	 * Block layer doesn't call into a quiesced elevator - all requests
@@ -901,32 +950,54 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	 * using INSERT_BACK.  All requests have SOFTBARRIER set and no
 	 * merge happens either.
 	 */
-	blk_queue_bypass_start(q);
+	if (old) {
+		old_registered = old->registered;
+
+		if (!q->mq_ops)
+			blk_queue_bypass_start(q);
 
-	/* unregister and clear all auxiliary data of the old elevator */
-	if (registered)
-		elv_unregister_queue(q);
+		/* unregister and clear all auxiliary data of the old elevator */
+		if (old_registered)
+			elv_unregister_queue(q);
 
-	spin_lock_irq(q->queue_lock);
-	ioc_clear_queue(q);
-	spin_unlock_irq(q->queue_lock);
+		if (q->queue_lock) {
+			spin_lock_irq(q->queue_lock);
+			ioc_clear_queue(q);
+			spin_unlock_irq(q->queue_lock);
+		}
+	}
 
 	/* allocate, init and register new elevator */
-	err = new_e->ops.elevator_init_fn(q, new_e);
-	if (err)
-		goto fail_init;
+	if (new_e) {
+		if (new_e->uses_mq)
+			err = new_e->mq_ops.init_sched(q, new_e);
+		else
+			err = new_e->ops.elevator_init_fn(q, new_e);
+		if (err)
+			goto fail_init;
 
-	if (registered) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	}
+	} else
+		q->elevator = NULL;
 
 	/* done, kill the old one and finish */
-	elevator_exit(old);
-	blk_queue_bypass_end(q);
+	if (old) {
+		elevator_exit(old);
+		if (!q->mq_ops)
+			blk_queue_bypass_end(q);
+	}
 
-	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	if (q->mq_ops) {
+		blk_mq_unfreeze_queue(q);
+		blk_mq_start_stopped_hw_queues(q, true);
+	}
+
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
 
 	return 0;
 
@@ -934,9 +1005,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	elevator_exit(q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
-	q->elevator = old;
-	elv_register_queue(q);
-	blk_queue_bypass_end(q);
+	if (old) {
+		q->elevator = old;
+		elv_register_queue(q);
+		if (!q->mq_ops)
+			blk_queue_bypass_end(q);
+	}
+	if (q->mq_ops) {
+		blk_mq_unfreeze_queue(q);
+		blk_mq_start_stopped_hw_queues(q, true);
+	}
 
 	return err;
 }
@@ -949,8 +1027,11 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	char elevator_name[ELV_NAME_MAX];
 	struct elevator_type *e;
 
-	if (!q->elevator)
-		return -ENXIO;
+	/*
+	 * Special case for mq, turn off scheduling
+	 */
+	if (q->mq_ops && !strncmp(name, "none", 4))
+		return elevator_switch(q, NULL);
 
 	strlcpy(elevator_name, name, sizeof(elevator_name));
 	e = elevator_get(strstrip(elevator_name), true);
@@ -959,11 +1040,23 @@ static int __elevator_change(struct request_queue *q, const char *name)
 		return -EINVAL;
 	}
 
-	if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
+	if (q->elevator &&
+	    !strcmp(elevator_name, q->elevator->type->elevator_name)) {
 		elevator_put(e);
 		return 0;
 	}
 
+	if (!e->uses_mq && q->mq_ops) {
+		printk(KERN_ERR "blk-mq-sched: elv %s does not support mq\n", elevator_name);
+		elevator_put(e);
+		return -EINVAL;
+	}
+	if (e->uses_mq && !q->mq_ops) {
+		printk(KERN_ERR "blk-mq-sched: elv %s is for mq\n", elevator_name);
+		elevator_put(e);
+		return -EINVAL;
+	}
+
 	return elevator_switch(q, e);
 }
 
@@ -985,7 +1078,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 {
 	int ret;
 
-	if (!q->elevator)
+	if (!q->mq_ops || q->request_fn)
 		return count;
 
 	ret = __elevator_change(q, name);
@@ -999,24 +1092,34 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 ssize_t elv_iosched_show(struct request_queue *q, char *name)
 {
 	struct elevator_queue *e = q->elevator;
-	struct elevator_type *elv;
+	struct elevator_type *elv = NULL;
 	struct elevator_type *__e;
 	int len = 0;
 
-	if (!q->elevator || !blk_queue_stackable(q))
+	if (!blk_queue_stackable(q))
 		return sprintf(name, "none\n");
 
-	elv = e->type;
+	if (!q->elevator)
+		len += sprintf(name+len, "[none] ");
+	else
+		elv = e->type;
 
 	spin_lock(&elv_list_lock);
 	list_for_each_entry(__e, &elv_list, list) {
-		if (!strcmp(elv->elevator_name, __e->elevator_name))
+		if (elv && !strcmp(elv->elevator_name, __e->elevator_name)) {
 			len += sprintf(name+len, "[%s] ", elv->elevator_name);
-		else
+			continue;
+		}
+		if (__e->uses_mq && q->mq_ops)
+			len += sprintf(name+len, "%s ", __e->elevator_name);
+		else if (!__e->uses_mq && !q->mq_ops)
 			len += sprintf(name+len, "%s ", __e->elevator_name);
 	}
 	spin_unlock(&elv_list_lock);
 
+	if (q->mq_ops && q->elevator)
+		len += sprintf(name+len, "none");
+
 	len += sprintf(len+name, "\n");
 	return len;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 87e404aae267..c86b314dde97 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,7 @@ struct blk_mq_hw_ctx {
 
 	unsigned long		flags;		/* BLK_MQ_F_* flags */
 
+	void			*sched_data;
 	struct request_queue	*queue;
 	struct blk_flush_queue	*fq;
 
@@ -179,7 +180,6 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
 
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);
 
-void blk_mq_insert_request(struct request *, bool, bool, bool);
 void blk_mq_free_request(struct request *rq);
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index b276e9ef0e0b..5d013f2b9071 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -77,6 +77,28 @@ struct elevator_ops
 	elevator_registered_fn *elevator_registered_fn;
 };
 
+struct blk_mq_alloc_data;
+struct blk_mq_hw_ctx;
+
+struct elevator_mq_ops {
+	int (*init_sched)(struct request_queue *, struct elevator_type *);
+	void (*exit_sched)(struct elevator_queue *);
+
+	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
+	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
+	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
+	struct request *(*get_request)(struct request_queue *, unsigned int, struct blk_mq_alloc_data *);
+	void (*put_request)(struct request *);
+	void (*insert_request)(struct blk_mq_hw_ctx *, struct request *, bool);
+	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
+	bool (*has_work)(struct blk_mq_hw_ctx *);
+	void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+	void (*started_request)(struct request *);
+	void (*requeue_request)(struct request *);
+	struct request *(*former_request)(struct request_queue *, struct request *);
+	struct request *(*next_request)(struct request_queue *, struct request *);
+};
+
 #define ELV_NAME_MAX	(16)
 
 struct elv_fs_entry {
@@ -94,12 +116,16 @@ struct elevator_type
 	struct kmem_cache *icq_cache;
 
 	/* fields provided by elevator implementation */
-	struct elevator_ops ops;
+	union {
+		struct elevator_ops ops;
+		struct elevator_mq_ops mq_ops;
+	};
 	size_t icq_size;	/* see iocontext.h */
 	size_t icq_align;	/* ditto */
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+	bool uses_mq;
 
 	/* managed by elevator core */
 	char icq_cache_name[ELV_NAME_MAX + 5];	/* elvname + "_io_cq" */
@@ -123,6 +149,7 @@ struct elevator_queue
 	struct kobject kobj;
 	struct mutex sysfs_lock;
 	unsigned int registered:1;
+	unsigned int uses_mq:1;
 	DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
                   ` (4 preceding siblings ...)
  2016-12-08 20:13 ` [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
@ 2016-12-08 20:13 ` Jens Axboe
  2016-12-13 11:04   ` Bart Van Assche
  2016-12-14  8:09   ` Bart Van Assche
  2016-12-08 20:13 ` [PATCH 7/7] blk-mq-sched: allow setting of default " Jens Axboe
  2016-12-13  9:26 ` [PATCHSET/RFC v2] blk-mq scheduling framework Paolo Valente
  7 siblings, 2 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-08 20:13 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Kconfig.iosched |   6 +
 block/Makefile        |   1 +
 block/mq-deadline.c   | 647 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 654 insertions(+)
 create mode 100644 block/mq-deadline.c

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9c4c48..490ef2850fae 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,6 +32,12 @@ config IOSCHED_CFQ
 
 	  This is the default I/O scheduler.
 
+config MQ_IOSCHED_DEADLINE
+	tristate "MQ deadline I/O scheduler"
+	default y
+	---help---
+	  MQ version of the deadline IO scheduler.
+
 config CFQ_GROUP_IOSCHED
 	bool "CFQ Group Scheduling support"
 	depends on IOSCHED_CFQ && BLK_CGROUP
diff --git a/block/Makefile b/block/Makefile
index 2eee9e1bb6db..3ee0abd7205a 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
new file mode 100644
index 000000000000..dfd30b68bfc4
--- /dev/null
+++ b/block/mq-deadline.c
@@ -0,0 +1,647 @@
+/*
+ *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
+ *  for the blk-mq scheduling framework
+ *
+ *  Copyright (C) 2016 Jens Axboe <axboe@kernel.dk>
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/rbtree.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
+
+static inline bool dd_rq_is_shadow(struct request *rq)
+{
+	return rq->rq_flags & RQF_ALLOCED;
+}
+
+/*
+ * See Documentation/block/deadline-iosched.txt
+ */
+static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
+static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+static const int writes_starved = 2;    /* max times reads can starve a write */
+static const int fifo_batch = 16;       /* # of sequential requests treated as one
+				     by the above parameters. For throughput. */
+
+struct deadline_data {
+	/*
+	 * run time data
+	 */
+
+	/*
+	 * requests (deadline_rq s) are present on both sort_list and fifo_list
+	 */
+	struct rb_root sort_list[2];
+	struct list_head fifo_list[2];
+
+	/*
+	 * next in sort order. read, write or both are NULL
+	 */
+	struct request *next_rq[2];
+	unsigned int batching;		/* number of sequential requests made */
+	unsigned int starved;		/* times reads have starved writes */
+
+	/*
+	 * settings that change how the i/o scheduler behaves
+	 */
+	int fifo_expire[2];
+	int fifo_batch;
+	int writes_starved;
+	int front_merges;
+
+	spinlock_t lock;
+	struct list_head dispatch;
+	struct blk_mq_tags *tags;
+	atomic_t wait_index;
+};
+
+static inline struct rb_root *
+deadline_rb_root(struct deadline_data *dd, struct request *rq)
+{
+	return &dd->sort_list[rq_data_dir(rq)];
+}
+
+/*
+ * get the request after `rq' in sector-sorted order
+ */
+static inline struct request *
+deadline_latter_request(struct request *rq)
+{
+	struct rb_node *node = rb_next(&rq->rb_node);
+
+	if (node)
+		return rb_entry_rq(node);
+
+	return NULL;
+}
+
+static void
+deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+	struct rb_root *root = deadline_rb_root(dd, rq);
+
+	elv_rb_add(root, rq);
+}
+
+static inline void
+deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+	const int data_dir = rq_data_dir(rq);
+
+	if (dd->next_rq[data_dir] == rq)
+		dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+	elv_rb_del(deadline_rb_root(dd, rq), rq);
+}
+
+/*
+ * remove rq from rbtree and fifo.
+ */
+static void deadline_remove_request(struct request_queue *q, struct request *rq)
+{
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	list_del_init(&rq->queuelist);
+	deadline_del_rq_rb(dd, rq);
+
+	elv_rqhash_del(q, rq);
+	if (q->last_merge == rq)
+		q->last_merge = NULL;
+}
+
+static void dd_merged_requests(struct request_queue *q, struct request *req,
+			       struct request *next)
+{
+	/*
+	 * if next expires before rq, assign its expire time to rq
+	 * and move into next position (next will be deleted) in fifo
+	 */
+	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
+		if (time_before((unsigned long)next->fifo_time,
+				(unsigned long)req->fifo_time)) {
+			list_move(&req->queuelist, &next->queuelist);
+			req->fifo_time = next->fifo_time;
+		}
+	}
+
+	/*
+	 * kill knowledge of next, this one is a goner
+	 */
+	deadline_remove_request(q, next);
+}
+
+/*
+ * move an entry to dispatch queue
+ */
+static void
+deadline_move_request(struct deadline_data *dd, struct request *rq)
+{
+	const int data_dir = rq_data_dir(rq);
+
+	dd->next_rq[READ] = NULL;
+	dd->next_rq[WRITE] = NULL;
+	dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+	/*
+	 * take it off the sort and fifo list
+	 */
+	deadline_remove_request(rq->q, rq);
+}
+
+/*
+ * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
+ * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
+ */
+static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+{
+	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+
+	/*
+	 * rq is expired!
+	 */
+	if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * deadline_dispatch_requests selects the best request according to
+ * read/write expire, fifo_batch, etc
+ */
+static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+	struct request *rq;
+	bool reads, writes;
+	int data_dir;
+
+	spin_lock(&dd->lock);
+
+	if (!list_empty(&dd->dispatch)) {
+		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		goto done;
+	}
+
+	reads = !list_empty(&dd->fifo_list[READ]);
+	writes = !list_empty(&dd->fifo_list[WRITE]);
+
+	/*
+	 * batches are currently reads XOR writes
+	 */
+	if (dd->next_rq[WRITE])
+		rq = dd->next_rq[WRITE];
+	else
+		rq = dd->next_rq[READ];
+
+	if (rq && dd->batching < dd->fifo_batch)
+		/* we have a next request are still entitled to batch */
+		goto dispatch_request;
+
+	/*
+	 * at this point we are not running a batch. select the appropriate
+	 * data direction (read / write)
+	 */
+
+	if (reads) {
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+
+		if (writes && (dd->starved++ >= dd->writes_starved))
+			goto dispatch_writes;
+
+		data_dir = READ;
+
+		goto dispatch_find_request;
+	}
+
+	/*
+	 * there are either no reads or writes have been starved
+	 */
+
+	if (writes) {
+dispatch_writes:
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+
+		dd->starved = 0;
+
+		data_dir = WRITE;
+
+		goto dispatch_find_request;
+	}
+
+	spin_unlock(&dd->lock);
+	return NULL;
+
+dispatch_find_request:
+	/*
+	 * we are not running a batch, find best request for selected data_dir
+	 */
+	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+		/*
+		 * A deadline has expired, the last request was in the other
+		 * direction, or we have run out of higher-sectored requests.
+		 * Start again from the request with the earliest expiry time.
+		 */
+		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+	} else {
+		/*
+		 * The last req was the same dir and we have a next request in
+		 * sort order. No expired requests so continue on from here.
+		 */
+		rq = dd->next_rq[data_dir];
+	}
+
+	dd->batching = 0;
+
+dispatch_request:
+	/*
+	 * rq is the selected appropriate request.
+	 */
+	dd->batching++;
+	deadline_move_request(dd, rq);
+done:
+	rq->rq_flags |= RQF_STARTED;
+	spin_unlock(&dd->lock);
+	return rq;
+}
+
+static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	return blk_mq_sched_request_from_shadow(hctx, __dd_dispatch_request);
+}
+
+static void dd_exit_queue(struct elevator_queue *e)
+{
+	struct deadline_data *dd = e->elevator_data;
+
+	BUG_ON(!list_empty(&dd->fifo_list[READ]));
+	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+
+	blk_mq_sched_free_requests(dd->tags);
+	kfree(dd);
+}
+
+/*
+ * initialize elevator private data (deadline_data).
+ */
+static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+	struct deadline_data *dd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
+	if (!dd) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+	eq->elevator_data = dd;
+
+	dd->tags = blk_mq_sched_alloc_requests(256, q->node);
+	if (!dd->tags) {
+		kfree(dd);
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	INIT_LIST_HEAD(&dd->fifo_list[READ]);
+	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
+	dd->sort_list[READ] = RB_ROOT;
+	dd->sort_list[WRITE] = RB_ROOT;
+	dd->fifo_expire[READ] = read_expire;
+	dd->fifo_expire[WRITE] = write_expire;
+	dd->writes_starved = writes_starved;
+	dd->front_merges = 1;
+	dd->fifo_batch = fifo_batch;
+	spin_lock_init(&dd->lock);
+	INIT_LIST_HEAD(&dd->dispatch);
+	atomic_set(&dd->wait_index, 0);
+
+	q->elevator = eq;
+	return 0;
+}
+
+static int __dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+			  struct request **req)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct request *__rq;
+	sector_t sector;
+	int ret;
+
+	/*
+	 * First try one-hit cache.
+	 */
+	if (q->last_merge && elv_bio_merge_ok(q->last_merge, bio)) {
+		ret = blk_try_merge(q->last_merge, bio);
+		if (ret != ELEVATOR_NO_MERGE) {
+			*req = q->last_merge;
+			return ret;
+		}
+	}
+
+	if (blk_queue_noxmerges(q))
+		return ELEVATOR_NO_MERGE;
+
+	/*
+	 * See if our hash lookup can find a potential backmerge.
+	 */
+	__rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
+	if (__rq && elv_bio_merge_ok(__rq, bio)) {
+		*req = __rq;
+		return ELEVATOR_BACK_MERGE;
+	}
+
+	if (!dd->front_merges)
+		return ELEVATOR_NO_MERGE;
+
+	sector = bio_end_sector(bio);
+
+	__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+	if (__rq) {
+		BUG_ON(sector != blk_rq_pos(__rq));
+
+		if (elv_bio_merge_ok(__rq, bio)) {
+			*req = __rq;
+			return ELEVATOR_FRONT_MERGE;
+		}
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct request *rq;
+	int ret;
+
+	spin_lock(&dd->lock);
+
+	ret = __dd_bio_merge(hctx, bio, &rq);
+
+	if (ret == ELEVATOR_BACK_MERGE) {
+		if (bio_attempt_back_merge(q, rq, bio)) {
+			q->last_merge = rq;
+			elv_rqhash_reposition(q, rq);
+			if (!attempt_back_merge(q, rq))
+				elv_merged_request(q, rq, ret);
+			goto done;
+		}
+		ret = ELEVATOR_NO_MERGE;
+	} else if (ret == ELEVATOR_FRONT_MERGE) {
+		if (bio_attempt_front_merge(q, rq, bio)) {
+			q->last_merge = rq;
+			elv_rb_del(deadline_rb_root(dd, rq), rq);
+			deadline_add_rq_rb(dd, rq);
+			if (!attempt_front_merge(q, rq))
+				elv_merged_request(q, rq, ret);
+			goto done;
+		}
+		ret = ELEVATOR_NO_MERGE;
+	}
+
+done:
+	spin_unlock(&dd->lock);
+	return ret != ELEVATOR_NO_MERGE;
+}
+
+/*
+ * add rq to rbtree and fifo
+ */
+static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+			      bool at_head)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	const int data_dir = rq_data_dir(rq);
+
+	/*
+	 * If we're trying to insert a real request, just send it directly
+	 * to the hardware dispatch list. This only happens for a requeue,
+	 * or FUA/FLUSH requests.
+	 */
+	if (!dd_rq_is_shadow(rq)) {
+		spin_lock(&hctx->lock);
+		list_add_tail(&rq->queuelist, &hctx->dispatch);
+		spin_unlock(&hctx->lock);
+		return;
+	}
+
+	spin_lock(&dd->lock);
+
+	if (at_head || rq->cmd_type != REQ_TYPE_FS) {
+		if (at_head)
+			list_add(&rq->queuelist, &dd->dispatch);
+		else
+			list_add_tail(&rq->queuelist, &dd->dispatch);
+	} else {
+		deadline_add_rq_rb(dd, rq);
+
+		if (rq_mergeable(rq)) {
+			elv_rqhash_add(q, rq);
+			if (!q->last_merge)
+				q->last_merge = rq;
+		}
+
+		/*
+		 * set expire time and add to fifo list
+		 */
+		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
+		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	}
+
+	spin_unlock(&dd->lock);
+}
+
+static struct request *dd_get_request(struct request_queue *q, unsigned int op,
+				      struct blk_mq_alloc_data *data)
+{
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct request *rq;
+
+	/*
+	 * The flush machinery intercepts before we insert the request. As
+	 * a work-around, just hand it back a real request.
+	 */
+	if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
+		rq = __blk_mq_alloc_request(data, op);
+	else {
+		rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
+		if (rq)
+			blk_mq_rq_ctx_init(q, data->ctx, rq, op);
+	}
+
+	return rq;
+}
+
+static void dd_put_request(struct request *rq)
+{
+	/*
+	 * If it's a real request, we just have to free it. For a shadow
+	 * request, we should only free it if we haven't started it. A
+	 * started request is mapped to a real one, and the real one will
+	 * free it. We can get here with request merges, since we then
+	 * free the request before we start/issue it.
+	 */
+	if (!dd_rq_is_shadow(rq))
+		blk_mq_free_request(rq);
+	else if (!(rq->rq_flags & RQF_STARTED)) {
+		struct deadline_data *dd = rq->q->elevator->elevator_data;
+
+		blk_mq_sched_free_shadow_request(dd->tags, rq);
+	}
+}
+
+static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct request *sched_rq = rq->end_io_data;
+
+	/*
+	 * sched_rq can be NULL, if we haven't setup the shadow yet
+	 * because we failed getting one.
+	 */
+	if (sched_rq) {
+		struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+
+		blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
+		blk_mq_start_stopped_hw_queue(hctx, true);
+	}
+}
+
+static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+
+	return !list_empty_careful(&dd->dispatch) ||
+		!list_empty_careful(&dd->fifo_list[0]) ||
+		!list_empty_careful(&dd->fifo_list[1]);
+}
+
+/*
+ * sysfs parts below
+ */
+static ssize_t
+deadline_var_show(int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+deadline_var_store(int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtol(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct deadline_data *dd = e->elevator_data;			\
+	int __data = __VAR;						\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return deadline_var_show(__data, (page));			\
+}
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
+SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
+SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct deadline_data *dd = e->elevator_data;			\
+	int __data;							\
+	int ret = deadline_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
+STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
+STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
+#undef STORE_FUNCTION
+
+#define DD_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
+				      deadline_##name##_store)
+
+static struct elv_fs_entry deadline_attrs[] = {
+	DD_ATTR(read_expire),
+	DD_ATTR(write_expire),
+	DD_ATTR(writes_starved),
+	DD_ATTR(front_merges),
+	DD_ATTR(fifo_batch),
+	__ATTR_NULL
+};
+
+static struct elevator_type mq_deadline = {
+	.mq_ops = {
+		.get_request		= dd_get_request,
+		.put_request		= dd_put_request,
+		.insert_request		= dd_insert_request,
+		.dispatch_request	= dd_dispatch_request,
+		.completed_request	= dd_completed_request,
+		.next_request		= elv_rb_latter_request,
+		.former_request		= elv_rb_former_request,
+		.bio_merge		= dd_bio_merge,
+		.requests_merged	= dd_merged_requests,
+		.has_work		= dd_has_work,
+		.init_sched		= dd_init_queue,
+		.exit_sched		= dd_exit_queue,
+	},
+
+	.uses_mq	= true,
+	.elevator_attrs = deadline_attrs,
+	.elevator_name = "mq-deadline",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init deadline_init(void)
+{
+	return elv_register(&mq_deadline);
+}
+
+static void __exit deadline_exit(void)
+{
+	elv_unregister(&mq_deadline);
+}
+
+module_init(deadline_init);
+module_exit(deadline_exit);
+
+MODULE_AUTHOR("Jens Axboe");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("MQ deadline IO scheduler");
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 7/7] blk-mq-sched: allow setting of default IO scheduler
  2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
                   ` (5 preceding siblings ...)
  2016-12-08 20:13 ` [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
@ 2016-12-08 20:13 ` Jens Axboe
  2016-12-13 10:13   ` Bart Van Assche
  2016-12-13  9:26 ` [PATCHSET/RFC v2] blk-mq scheduling framework Paolo Valente
  7 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2016-12-08 20:13 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Kconfig.iosched   | 43 +++++++++++++++++++++++++++++++++++++------
 block/blk-mq-sched.c    | 19 +++++++++++++++++++
 block/blk-mq-sched.h    |  1 +
 block/blk-mq.c          |  3 +++
 block/elevator.c        |  5 ++++-
 drivers/nvme/host/pci.c |  1 +
 include/linux/blk-mq.h  |  1 +
 7 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 490ef2850fae..00502a3d76b7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,12 +32,6 @@ config IOSCHED_CFQ
 
 	  This is the default I/O scheduler.
 
-config MQ_IOSCHED_DEADLINE
-	tristate "MQ deadline I/O scheduler"
-	default y
-	---help---
-	  MQ version of the deadline IO scheduler.
-
 config CFQ_GROUP_IOSCHED
 	bool "CFQ Group Scheduling support"
 	depends on IOSCHED_CFQ && BLK_CGROUP
@@ -69,6 +63,43 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config MQ_IOSCHED_DEADLINE
+	tristate "MQ deadline I/O scheduler"
+	default y
+	---help---
+	  MQ version of the deadline IO scheduler.
+
+config MQ_IOSCHED_NONE
+	bool
+	default y
+
+choice
+	prompt "Default MQ I/O scheduler"
+	default MQ_IOSCHED_NONE
+	help
+	  Select the I/O scheduler which will be used by default for all
+	  blk-mq managed block devices.
+
+	config DEFAULT_MQ_DEADLINE
+		bool "MQ Deadline" if MQ_IOSCHED_DEADLINE=y
+
+	config DEFAULT_MQ_NONE
+		bool "None"
+
+endchoice
+
+config DEFAULT_MQ_IOSCHED
+	string
+	default "mq-deadline" if DEFAULT_MQ_DEADLINE
+	default "none" if DEFAULT_MQ_NONE
+
 endmenu
 
+config MQ_IOSCHED_ONLY_SQ
+	bool "Enable blk-mq IO scheduler only for single queue devices"
+	default y
+	help
+	  Say Y here, if you only want to enable IO scheduling on block
+	  devices that have a single queue registered.
+
 endif
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9213366e67d1..bcab84d325c2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -244,3 +244,22 @@ void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 
 	blk_mq_dispatch_rq_list(hctx, &rq_list);
 }
+
+int blk_mq_sched_init(struct request_queue *q)
+{
+	int ret;
+
+#if defined(CONFIG_DEFAULT_MQ_NONE)
+	return 0;
+#endif
+#if defined(CONFIG_MQ_IOSCHED_ONLY_SQ)
+	if (q->nr_hw_queues > 1)
+		return 0;
+#endif
+
+	mutex_lock(&q->sysfs_lock);
+	ret = elevator_init(q, NULL);
+	mutex_unlock(&q->sysfs_lock);
+
+	return ret;
+}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 609c80506cfc..391ecc00f520 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -25,6 +25,7 @@ struct request *
 blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
 				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
 
+int blk_mq_sched_init(struct request_queue *q);
 
 struct blk_mq_alloc_data {
 	/* input parameter */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 019de6f0fd06..9eeffd76f729 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2141,6 +2141,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	INIT_LIST_HEAD(&q->requeue_list);
 	spin_lock_init(&q->requeue_lock);
 
+	if (!(set->flags & BLK_MQ_F_NO_SCHED))
+		blk_mq_sched_init(q);
+
 	if (q->nr_hw_queues > 1)
 		blk_queue_make_request(q, blk_mq_make_request);
 	else
diff --git a/block/elevator.c b/block/elevator.c
index f1191b3b0ff3..368976d05f0a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -219,7 +219,10 @@ int elevator_init(struct request_queue *q, char *name)
 	}
 
 	if (!e) {
-		e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+		if (q->mq_ops)
+			e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
+		else
+			e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
 		if (!e) {
 			printk(KERN_ERR
 				"Default I/O scheduler not found. " \
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 82b9b3f1f21d..7777ec58252f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1186,6 +1186,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
 		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
 		dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
+		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
 		dev->admin_tagset.driver_data = dev;
 
 		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c86b314dde97..7c470bf4d7bf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,6 +152,7 @@ enum {
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
+	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  2016-12-08 20:13 ` [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper Jens Axboe
@ 2016-12-09  6:44   ` Hannes Reinecke
  2016-12-13  8:51   ` Bart Van Assche
  2016-12-13  9:18   ` Ritesh Harjani
  2 siblings, 0 replies; 38+ messages in thread
From: Hannes Reinecke @ 2016-12-09  6:44 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> Takes a list of requests, and dispatches it. Moves any residual
> requests to the dispatch list.
>
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  block/blk-mq.c | 85 ++++++++++++++++++++++++++++++++--------------------------
>  block/blk-mq.h |  1 +
>  2 files changed, 48 insertions(+), 38 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/7] elevator: make the rqhash helpers exported
  2016-12-08 20:13 ` [PATCH 3/7] elevator: make the rqhash helpers exported Jens Axboe
@ 2016-12-09  6:45   ` Hannes Reinecke
  0 siblings, 0 replies; 38+ messages in thread
From: Hannes Reinecke @ 2016-12-09  6:45 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  block/elevator.c         | 8 ++++----
>  include/linux/elevator.h | 5 +++++
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-flush: run the queue when inserting blk-mq flush
  2016-12-08 20:13 ` [PATCH 4/7] blk-flush: run the queue when inserting blk-mq flush Jens Axboe
@ 2016-12-09  6:45   ` Hannes Reinecke
  0 siblings, 0 replies; 38+ messages in thread
From: Hannes Reinecke @ 2016-12-09  6:45 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> Currently we pass in to run the queue async, but don't flag the
> queue to be run. We don't need to run it async here, but we should
> run it. So fixup the parameters.
>
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  block/blk-flush.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 1bdbb3d3e5f5..27a42dab5a36 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -426,7 +426,7 @@ void blk_insert_flush(struct request *rq)
>  	if ((policy & REQ_FSEQ_DATA) &&
>  	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
>  		if (q->mq_ops) {
> -			blk_mq_insert_request(rq, false, false, true);
> +			blk_mq_insert_request(rq, false, true, false);
>  		} else
>  			list_add_tail(&rq->queuelist, &q->queue_head);
>  		return;
>
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/7] blk-mq: add blk_mq_start_stopped_hw_queue()
  2016-12-08 20:13 ` [PATCH 1/7] blk-mq: add blk_mq_start_stopped_hw_queue() Jens Axboe
@ 2016-12-13  8:48   ` Bart Van Assche
  0 siblings, 0 replies; 38+ messages in thread
From: Bart Van Assche @ 2016-12-13  8:48 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> We have a variant for all hardware queues, but not one for a single
> hardware queue.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  2016-12-08 20:13 ` [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper Jens Axboe
  2016-12-09  6:44   ` Hannes Reinecke
@ 2016-12-13  8:51   ` Bart Van Assche
  2016-12-13 15:05     ` Jens Axboe
  2016-12-13  9:18   ` Ritesh Harjani
  2 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2016-12-13  8:51 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> +{
> +	LIST_HEAD(rq_list);
> +	LIST_HEAD(driver_list);

Hello Jens,

driver_list is not used in this function so please consider removing 
that variable from blk_mq_process_rq_list(). Otherwise this patch looks 
fine to me.

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  2016-12-08 20:13 ` [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper Jens Axboe
  2016-12-09  6:44   ` Hannes Reinecke
  2016-12-13  8:51   ` Bart Van Assche
@ 2016-12-13  9:18   ` Ritesh Harjani
  2016-12-13  9:29     ` Bart Van Assche
  2 siblings, 1 reply; 38+ messages in thread
From: Ritesh Harjani @ 2016-12-13  9:18 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

Minor comments.

On 12/9/2016 1:43 AM, Jens Axboe wrote:
> Takes a list of requests, and dispatches it. Moves any residual
> requests to the dispatch list.
>
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  block/blk-mq.c | 85 ++++++++++++++++++++++++++++++++--------------------------
>  block/blk-mq.h |  1 +
>  2 files changed, 48 insertions(+), 38 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b216746be9d3..abbf7cca4d0d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -821,41 +821,13 @@ static inline unsigned int queued_to_index(unsigned int queued)
>  	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>  }
>
> -/*
> - * Run this hardware queue, pulling any software queues mapped to it in.
> - * Note that this function currently has various problems around ordering
> - * of IO. In particular, we'd like FIFO behaviour on handling existing
> - * items on the hctx->dispatch list. Ignore that for now.
> - */
> -static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> +bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  {
>  	struct request_queue *q = hctx->queue;
>  	struct request *rq;
> -	LIST_HEAD(rq_list);
>  	LIST_HEAD(driver_list);
>  	struct list_head *dptr;
> -	int queued;
> -
> -	if (unlikely(blk_mq_hctx_stopped(hctx)))
> -		return;
> -
> -	hctx->run++;
> -
> -	/*
> -	 * Touch any software queue that has pending entries.
> -	 */
> -	flush_busy_ctxs(hctx, &rq_list);
> -
> -	/*
> -	 * If we have previous entries on our dispatch list, grab them
> -	 * and stuff them at the front for more fair dispatch.
> -	 */
> -	if (!list_empty_careful(&hctx->dispatch)) {
> -		spin_lock(&hctx->lock);
> -		if (!list_empty(&hctx->dispatch))
> -			list_splice_init(&hctx->dispatch, &rq_list);
> -		spin_unlock(&hctx->lock);
> -	}
> +	int queued, ret = BLK_MQ_RQ_QUEUE_OK;
>
>  	/*
>  	 * Start off with dptr being NULL, so we start the first request
> @@ -867,16 +839,15 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
>  	 * Now process all the entries, sending them to the driver.
>  	 */
>  	queued = 0;
> -	while (!list_empty(&rq_list)) {
> +	while (!list_empty(list)) {
>  		struct blk_mq_queue_data bd;
> -		int ret;
>
> -		rq = list_first_entry(&rq_list, struct request, queuelist);
> +		rq = list_first_entry(list, struct request, queuelist);
>  		list_del_init(&rq->queuelist);
>
>  		bd.rq = rq;
>  		bd.list = dptr;
> -		bd.last = list_empty(&rq_list);
> +		bd.last = list_empty(list);
>
>  		ret = q->mq_ops->queue_rq(hctx, &bd);
>  		switch (ret) {
> @@ -884,7 +855,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
>  			queued++;
>  			break;
>  		case BLK_MQ_RQ_QUEUE_BUSY:
> -			list_add(&rq->queuelist, &rq_list);
> +			list_add(&rq->queuelist, list);
>  			__blk_mq_requeue_request(rq);
>  			break;
>  		default:
> @@ -902,7 +873,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
>  		 * We've done the first request. If we have more than 1
>  		 * left in the list, set dptr to defer issue.
>  		 */
> -		if (!dptr && rq_list.next != rq_list.prev)
> +		if (!dptr && list->next != list->prev)
>  			dptr = &driver_list;
>  	}
>
> @@ -912,10 +883,11 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
>  	 * Any items that need requeuing? Stuff them into hctx->dispatch,
>  	 * that is where we will continue on next queue run.
>  	 */
> -	if (!list_empty(&rq_list)) {
> +	if (!list_empty(list)) {
>  		spin_lock(&hctx->lock);
> -		list_splice(&rq_list, &hctx->dispatch);
> +		list_splice(list, &hctx->dispatch);
>  		spin_unlock(&hctx->lock);
> +
>  		/*
>  		 * the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but
>  		 * it's possible the queue is stopped and restarted again
> @@ -927,6 +899,43 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
>  		 **/
>  		blk_mq_run_hw_queue(hctx, true);
>  	}
> +
> +	return ret != BLK_MQ_RQ_QUEUE_BUSY;
> +}
> +
> +/*
> + * Run this hardware queue, pulling any software queues mapped to it in.
> + * Note that this function currently has various problems around ordering
> + * of IO. In particular, we'd like FIFO behaviour on handling existing
> + * items on the hctx->dispatch list. Ignore that for now.
> + */
> +static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> +{
> +	LIST_HEAD(rq_list);
> +	LIST_HEAD(driver_list);
driver_list is not required. since not used in this function anymore.
> +
> +	if (unlikely(blk_mq_hctx_stopped(hctx)))
> +		return;
> +
> +	hctx->run++;
> +
> +	/*
> +	 * Touch any software queue that has pending entries.
> +	 */
> +	flush_busy_ctxs(hctx, &rq_list);
> +
> +	/*
> +	 * If we have previous entries on our dispatch list, grab them
> +	 * and stuff them at the front for more fair dispatch.
> +	 */
> +	if (!list_empty_careful(&hctx->dispatch)) {
> +		spin_lock(&hctx->lock);
> +		if (!list_empty(&hctx->dispatch))
list_splice_init already checks for list_empty. So this may be 
redundant. Please check.
> +			list_splice_init(&hctx->dispatch, &rq_list);
> +		spin_unlock(&hctx->lock);
> +	}
> +
> +	blk_mq_dispatch_rq_list(hctx, &rq_list);
>  }
>
>  static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index b444370ae05b..3a54dd32a6fc 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -31,6 +31,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
>  void blk_mq_free_queue(struct request_queue *q);
>  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
>  void blk_mq_wake_waiters(struct request_queue *q);
> +bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
>
>  /*
>   * CPU hotplug helpers
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSET/RFC v2] blk-mq scheduling framework
  2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
                   ` (6 preceding siblings ...)
  2016-12-08 20:13 ` [PATCH 7/7] blk-mq-sched: allow setting of default " Jens Axboe
@ 2016-12-13  9:26 ` Paolo Valente
  2016-12-13 15:17   ` Jens Axboe
  7 siblings, 1 reply; 38+ messages in thread
From: Paolo Valente @ 2016-12-13  9:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 08 dic 2016, alle ore 21:13, Jens Axboe <axboe@fb.com> ha scritto:
> 
> As a followup to this posting from yesterday:
> 
> https://marc.info/?l=linux-block&m=148115232806065&w=2
> 
> this is version 2. I wanted to post a new one fairly quickly, as there
> ended up being a number of potential crashes in v1. This one should be
> solid, I've run mq-deadline on both NVMe and regular rotating storage,
> and we handle the various merging cases correctly.
> 
> You can download it from git as well:
> 
> git://git.kernel.dk/linux-block blk-mq-sched.2
> 
> Note that this is based on for-4.10/block, which is in turn based on
> v4.9-rc1. I suggest pulling it into my for-next branch, which would
> then merge nicely with 'master' as well.
> 

Hi Jens,
this is just to tell you that I have finished running some extensive
tests on this patch series (throughput, responsiveness, low latency
for soft real time).  No regression w.r.t. blk detected, and no
crashes or other anomalies.

Starting to work on BFQ port.  Please be patient with my little
expertise on mq environment, and with my next silly questions!

Thanks,
Paolo


> Changes since v1:
> 
> - Add Kconfig entries to allow the user to choose what the default
>  scheduler should be for blk-mq, and whether that depends on the
>  number of hardware queues.
> 
> - Properly abstract the whole get/put of a request, so we can manage
>  the life time properly.
> 
> - Enable full merging on mq-deadline (front/back, bio-to-rq, rq-to-rq).
>  Has full feature parity with deadline now.
> 
> - Export necessary symbols for compiling mq-deadline as a module.
> 
> - Various API adjustments for the mq schedulers.
> 
> - Various cleanups and improvements.
> 
> - Fix a lot of bugs. A lot. Upgrade!
> 
> block/Kconfig.iosched    |   37 ++
> block/Makefile           |    3 
> block/blk-core.c         |    9 
> block/blk-exec.c         |    3 
> block/blk-flush.c        |    7 
> block/blk-merge.c        |    3 
> block/blk-mq-sched.c     |  265 +++++++++++++++++++
> block/blk-mq-sched.h     |  188 +++++++++++++
> block/blk-mq-tag.c       |    1 
> block/blk-mq.c           |  254 ++++++++++--------
> block/blk-mq.h           |   35 +-
> block/elevator.c         |  194 ++++++++++----
> block/mq-deadline.c      |  647 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/pci.c  |    1 
> include/linux/blk-mq.h   |    4 
> include/linux/elevator.h |   34 ++
> 16 files changed, 1495 insertions(+), 190 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  2016-12-13  9:18   ` Ritesh Harjani
@ 2016-12-13  9:29     ` Bart Van Assche
  0 siblings, 0 replies; 38+ messages in thread
From: Bart Van Assche @ 2016-12-13  9:29 UTC (permalink / raw)
  To: Ritesh Harjani, Jens Axboe, axboe, linux-block, linux-kernel
  Cc: paolo.valente, osandov

On 12/13/2016 10:18 AM, Ritesh Harjani wrote:
> On 12/9/2016 1:43 AM, Jens Axboe wrote:
>> +    if (!list_empty_careful(&hctx->dispatch)) {
>> +        spin_lock(&hctx->lock);
>> +        if (!list_empty(&hctx->dispatch))
> list_splice_init already checks for list_empty. So this may be
> redundant. Please check.

Hello Ritesh,

I think the list_empty() check is on purpose and is intended as a 
performance optimization.

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 7/7] blk-mq-sched: allow setting of default IO scheduler
  2016-12-08 20:13 ` [PATCH 7/7] blk-mq-sched: allow setting of default " Jens Axboe
@ 2016-12-13 10:13   ` Bart Van Assche
  2016-12-13 15:06     ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2016-12-13 10:13 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +config DEFAULT_MQ_IOSCHED
> +	string
> +	default "mq-deadline" if DEFAULT_MQ_DEADLINE
> +	default "none" if DEFAULT_MQ_NONE
> +
>  endmenu
>
> +config MQ_IOSCHED_ONLY_SQ
> +	bool "Enable blk-mq IO scheduler only for single queue devices"
> +	default y
> +	help
> +	  Say Y here, if you only want to enable IO scheduling on block
> +	  devices that have a single queue registered.
> +
>  endif

Hello Jens,

Shouln't the MQ_IOSCHED_ONLY_SQ entry be placed before "endmenu" such 
that it is displayed in the I/O scheduler menu instead of the block menu?

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-08 20:13 ` [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
@ 2016-12-13 11:04   ` Bart Van Assche
  2016-12-13 15:08     ` Jens Axboe
  2016-12-14  8:09   ` Bart Van Assche
  1 sibling, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2016-12-13 11:04 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +{
> +	struct deadline_data *dd;
> +	struct elevator_queue *eq;
> +
> +	eq = elevator_alloc(q, e);
> +	if (!eq)
> +		return -ENOMEM;
> +
> +	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> +	if (!dd) {
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +	eq->elevator_data = dd;
> +
> +	dd->tags = blk_mq_sched_alloc_requests(256, q->node);
> +	if (!dd->tags) {
> +		kfree(dd);
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}

Hello Jens,

Please add a comment that explains where the number 256 comes from.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-08 20:13 ` [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
@ 2016-12-13 13:56   ` Bart Van Assche
  2016-12-13 15:14     ` Jens Axboe
  2016-12-13 14:29   ` Bart Van Assche
  1 sibling, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2016-12-13 13:56 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +/*
> + * Empty set
> + */
> +static struct blk_mq_ops mq_sched_tag_ops = {
> +	.queue_rq	= NULL,
> +};

Hello Jens,

Would "static struct blk_mq_ops mq_sched_tag_ops;" have been sufficient? 
Can this data structure be declared 'const' if the blk_mq_ops pointers 
in struct blk_mq_tag_set and struct request_queue are also declared const?

> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
> +						  struct blk_mq_alloc_data *data,
> +						  struct blk_mq_tags *tags,
> +						  atomic_t *wait_index)
> +{

Using the word "shadow" in the function name suggests to me that there 
is a shadow request for every request and a request for every shadow 
request. However, my understanding from the code is that there can be 
requests without shadow requests (for e.g. a flush) and shadow requests 
without requests. Shouldn't the name of this function reflect that, e.g. 
by using "sched" or "elv" in the function name instead of "shadow"?

> +struct request *
> +blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
> +				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))

This function dequeues a request from the I/O scheduler queue, allocates 
a request, copies the relevant request structure members into that 
request and makes the request refer to the shadow request. Isn't the 
request dispatching more important than associating the request with the 
shadow request? If so, how about making the function name reflect that?

> +{
> +	struct blk_mq_alloc_data data;
> +	struct request *sched_rq, *rq;
> +
> +	data.q = hctx->queue;
> +	data.flags = BLK_MQ_REQ_NOWAIT;
> +	data.ctx = blk_mq_get_ctx(hctx->queue);
> +	data.hctx = hctx;
> +
> +	rq = __blk_mq_alloc_request(&data, 0);
> +	blk_mq_put_ctx(data.ctx);
> +
> +	if (!rq) {
> +		blk_mq_stop_hw_queue(hctx);
> +		return NULL;
> +	}
> +
> +	sched_rq = get_sched_rq(hctx);
> +
> +	if (!sched_rq) {
> +		blk_queue_enter_live(hctx->queue);
> +		__blk_mq_free_request(hctx, data.ctx, rq);
> +		return NULL;
> +	}

The mq deadline scheduler calls this function with get_sched_rq == 
__dd_dispatch_request. If __blk_mq_alloc_request() fails, shouldn't the 
request that was removed from the scheduler queue be pushed back onto 
that queue? Additionally, are you sure it's necessary to call 
blk_queue_enter_live() from the error path?

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-08 20:13 ` [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
  2016-12-13 13:56   ` Bart Van Assche
@ 2016-12-13 14:29   ` Bart Van Assche
  2016-12-13 15:20     ` Jens Axboe
  1 sibling, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2016-12-13 14:29 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static inline void blk_mq_sched_put_request(struct request *rq)
> +{
> +	struct request_queue *q = rq->q;
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->mq_ops.put_request)
> +		e->type->mq_ops.put_request(rq);
> +	else
> +		blk_mq_free_request(rq);
> +}

blk_mq_free_request() always triggers a call of blk_queue_exit(). 
dd_put_request() only triggers a call of blk_queue_exit() if it is not a 
shadow request. Is that on purpose?

> +static inline struct request *
> +blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
> +			 struct blk_mq_alloc_data *data)
> +{
> +	struct elevator_queue *e = q->elevator;
> +	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_ctx *ctx;
> +	struct request *rq;
> +
> +	blk_queue_enter_live(q);
> +	ctx = blk_mq_get_ctx(q);
> +	hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> +	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> +
> +	if (e && e->type->mq_ops.get_request)
> +		rq = e->type->mq_ops.get_request(q, op, data);
> +	else
> +		rq = __blk_mq_alloc_request(data, op);
> +
> +	if (rq)
> +		data->hctx->queued++;
> +
> +	return rq;
> +
> +}

Some but not all callers of blk_mq_sched_get_request() call 
blk_queue_exit() if this function returns NULL. Please consider to move 
the blk_queue_exit() call from the blk_mq_alloc_request() error path 
into this function. I think that will make it a lot easier to verify 
whether or not the blk_queue_enter() / blk_queue_exit() calls are 
balanced properly.

Additionally, since blk_queue_enter() / blk_queue_exit() calls by 
blk_mq_sched_get_request() and blk_mq_sched_put_request() must be 
balanced and since the latter function only calls blk_queue_exit() for 
non-shadow requests, shouldn't blk_mq_sched_get_request() call 
blk_queue_enter_live() only if __blk_mq_alloc_request() is called?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  2016-12-13  8:51   ` Bart Van Assche
@ 2016-12-13 15:05     ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-13 15:05 UTC (permalink / raw)
  To: Bart Van Assche, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/13/2016 01:51 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
>> +{
>> +	LIST_HEAD(rq_list);
>> +	LIST_HEAD(driver_list);
> 
> Hello Jens,
> 
> driver_list is not used in this function so please consider removing 
> that variable from blk_mq_process_rq_list(). Otherwise this patch looks 
> fine to me.

Thanks Bart, this already got fixed up in the current branch.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 7/7] blk-mq-sched: allow setting of default IO scheduler
  2016-12-13 10:13   ` Bart Van Assche
@ 2016-12-13 15:06     ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-13 15:06 UTC (permalink / raw)
  To: Bart Van Assche, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/13/2016 03:13 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +config DEFAULT_MQ_IOSCHED
>> +	string
>> +	default "mq-deadline" if DEFAULT_MQ_DEADLINE
>> +	default "none" if DEFAULT_MQ_NONE
>> +
>>  endmenu
>>
>> +config MQ_IOSCHED_ONLY_SQ
>> +	bool "Enable blk-mq IO scheduler only for single queue devices"
>> +	default y
>> +	help
>> +	  Say Y here, if you only want to enable IO scheduling on block
>> +	  devices that have a single queue registered.
>> +
>>  endif
> 
> Hello Jens,
> 
> Shouln't the MQ_IOSCHED_ONLY_SQ entry be placed before "endmenu" such 
> that it is displayed in the I/O scheduler menu instead of the block menu?

Good catch, yes it should. I'll move it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-13 11:04   ` Bart Van Assche
@ 2016-12-13 15:08     ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-13 15:08 UTC (permalink / raw)
  To: Bart Van Assche, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/13/2016 04:04 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
>> +{
>> +	struct deadline_data *dd;
>> +	struct elevator_queue *eq;
>> +
>> +	eq = elevator_alloc(q, e);
>> +	if (!eq)
>> +		return -ENOMEM;
>> +
>> +	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
>> +	if (!dd) {
>> +		kobject_put(&eq->kobj);
>> +		return -ENOMEM;
>> +	}
>> +	eq->elevator_data = dd;
>> +
>> +	dd->tags = blk_mq_sched_alloc_requests(256, q->node);
>> +	if (!dd->tags) {
>> +		kfree(dd);
>> +		kobject_put(&eq->kobj);
>> +		return -ENOMEM;
>> +	}
> 
> Hello Jens,
> 
> Please add a comment that explains where the number 256 comes from.

Pulled out of my... I'll add a comment! Really this should just be
->nr_requests soft setting, the 256 is just a random sane default that I
chose for now. I had forgotten about that, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-13 13:56   ` Bart Van Assche
@ 2016-12-13 15:14     ` Jens Axboe
  2016-12-14 10:31       ` Bart Van Assche
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2016-12-13 15:14 UTC (permalink / raw)
  To: Bart Van Assche, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/13/2016 06:56 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +/*
>> + * Empty set
>> + */
>> +static struct blk_mq_ops mq_sched_tag_ops = {
>> +	.queue_rq	= NULL,
>> +};
> 
> Hello Jens,
> 
> Would "static struct blk_mq_ops mq_sched_tag_ops;" have been sufficient? 
> Can this data structure be declared 'const' if the blk_mq_ops pointers 
> in struct blk_mq_tag_set and struct request_queue are also declared const?

Yes, the static should be enough to ensure that it's all zeroes. I did
have this as const, but then realized I'd have to change a few other
places too. I'll make that change, hopefully it'll just work out.

>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>> +						  struct blk_mq_alloc_data *data,
>> +						  struct blk_mq_tags *tags,
>> +						  atomic_t *wait_index)
>> +{
> 
> Using the word "shadow" in the function name suggests to me that there 
> is a shadow request for every request and a request for every shadow 
> request. However, my understanding from the code is that there can be 
> requests without shadow requests (for e.g. a flush) and shadow requests 
> without requests. Shouldn't the name of this function reflect that, e.g. 
> by using "sched" or "elv" in the function name instead of "shadow"?

Shadow might not be the best name. Most do have shadows though, it's
only the rare exception like the flush, that you mention. I'll see if I
can come up with a better name.

>> +struct request *
>> +blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
>> +				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
> 
> This function dequeues a request from the I/O scheduler queue, allocates 
> a request, copies the relevant request structure members into that 
> request and makes the request refer to the shadow request. Isn't the 
> request dispatching more important than associating the request with the 
> shadow request? If so, how about making the function name reflect that?

Sure, I can update the naming. Will need to anyway, if we get rid of the
shadow naming.

>> +{
>> +	struct blk_mq_alloc_data data;
>> +	struct request *sched_rq, *rq;
>> +
>> +	data.q = hctx->queue;
>> +	data.flags = BLK_MQ_REQ_NOWAIT;
>> +	data.ctx = blk_mq_get_ctx(hctx->queue);
>> +	data.hctx = hctx;
>> +
>> +	rq = __blk_mq_alloc_request(&data, 0);
>> +	blk_mq_put_ctx(data.ctx);
>> +
>> +	if (!rq) {
>> +		blk_mq_stop_hw_queue(hctx);
>> +		return NULL;
>> +	}
>> +
>> +	sched_rq = get_sched_rq(hctx);
>> +
>> +	if (!sched_rq) {
>> +		blk_queue_enter_live(hctx->queue);
>> +		__blk_mq_free_request(hctx, data.ctx, rq);
>> +		return NULL;
>> +	}
> 
> The mq deadline scheduler calls this function with get_sched_rq == 
> __dd_dispatch_request. If __blk_mq_alloc_request() fails, shouldn't the 
> request that was removed from the scheduler queue be pushed back onto 
> that queue? Additionally, are you sure it's necessary to call 
> blk_queue_enter_live() from the error path?

If __blk_mq_alloc_request() fails, we haven't pulled a request from the
scheduler yet. The extra ref is needed because __blk_mq_alloc_request()
doesn't get a ref on the request, however __blk_mq_free_request() does
put one.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSET/RFC v2] blk-mq scheduling framework
  2016-12-13  9:26 ` [PATCHSET/RFC v2] blk-mq scheduling framework Paolo Valente
@ 2016-12-13 15:17   ` Jens Axboe
  2016-12-13 16:15     ` Paolo Valente
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2016-12-13 15:17 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, Linux-Kernal, osandov

On Tue, Dec 13 2016, Paolo Valente wrote:
> 
> > Il giorno 08 dic 2016, alle ore 21:13, Jens Axboe <axboe@fb.com> ha scritto:
> > 
> > As a followup to this posting from yesterday:
> > 
> > https://marc.info/?l=linux-block&m=148115232806065&w=2
> > 
> > this is version 2. I wanted to post a new one fairly quickly, as there
> > ended up being a number of potential crashes in v1. This one should be
> > solid, I've run mq-deadline on both NVMe and regular rotating storage,
> > and we handle the various merging cases correctly.
> > 
> > You can download it from git as well:
> > 
> > git://git.kernel.dk/linux-block blk-mq-sched.2
> > 
> > Note that this is based on for-4.10/block, which is in turn based on
> > v4.9-rc1. I suggest pulling it into my for-next branch, which would
> > then merge nicely with 'master' as well.
> > 
> 
> Hi Jens,
> this is just to tell you that I have finished running some extensive
> tests on this patch series (throughput, responsiveness, low latency
> for soft real time).  No regression w.r.t. blk detected, and no
> crashes or other anomalies.
> 
> Starting to work on BFQ port.  Please be patient with my little
> expertise on mq environment, and with my next silly questions!

No worries, ask away if you have questions. As you might have seen, it's
still a little bit of a moving target, but it's getting closer every
day. I'll post a v3 later today hopefully that will be a good fix point
for you. I'll need to add the io context setup etc, that's not there
yet, as only cfq/bfq uses that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-13 14:29   ` Bart Van Assche
@ 2016-12-13 15:20     ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-13 15:20 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, linux-kernel, paolo.valente, osandov

On Tue, Dec 13 2016, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
> >+static inline void blk_mq_sched_put_request(struct request *rq)
> >+{
> >+	struct request_queue *q = rq->q;
> >+	struct elevator_queue *e = q->elevator;
> >+
> >+	if (e && e->type->mq_ops.put_request)
> >+		e->type->mq_ops.put_request(rq);
> >+	else
> >+		blk_mq_free_request(rq);
> >+}
> 
> blk_mq_free_request() always triggers a call of blk_queue_exit().
> dd_put_request() only triggers a call of blk_queue_exit() if it is not a
> shadow request. Is that on purpose?

If the scheduler doesn't define get/put requests, then the lifetime
follows the normal setup. If we do define them, then dd_put_request()
only wants to put the request if it's one where we did setup a shadow.

> >+static inline struct request *
> >+blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
> >+			 struct blk_mq_alloc_data *data)
> >+{
> >+	struct elevator_queue *e = q->elevator;
> >+	struct blk_mq_hw_ctx *hctx;
> >+	struct blk_mq_ctx *ctx;
> >+	struct request *rq;
> >+
> >+	blk_queue_enter_live(q);
> >+	ctx = blk_mq_get_ctx(q);
> >+	hctx = blk_mq_map_queue(q, ctx->cpu);
> >+
> >+	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> >+
> >+	if (e && e->type->mq_ops.get_request)
> >+		rq = e->type->mq_ops.get_request(q, op, data);
> >+	else
> >+		rq = __blk_mq_alloc_request(data, op);
> >+
> >+	if (rq)
> >+		data->hctx->queued++;
> >+
> >+	return rq;
> >+
> >+}
> 
> Some but not all callers of blk_mq_sched_get_request() call blk_queue_exit()
> if this function returns NULL. Please consider to move the blk_queue_exit()
> call from the blk_mq_alloc_request() error path into this function. I think
> that will make it a lot easier to verify whether or not the
> blk_queue_enter() / blk_queue_exit() calls are balanced properly.

Agree, I'll make the change, it'll be easier to read then.

> Additionally, since blk_queue_enter() / blk_queue_exit() calls by
> blk_mq_sched_get_request() and blk_mq_sched_put_request() must be balanced
> and since the latter function only calls blk_queue_exit() for non-shadow
> requests, shouldn't blk_mq_sched_get_request() call blk_queue_enter_live()
> only if __blk_mq_alloc_request() is called?

I'll double check that part, there might be a bug or at least a chance
to clean this up a bit. I did verify most of this at some point, and
tested it with the scheduler switching. That part falls apart pretty
quickly, if the references aren't matched exactly.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSET/RFC v2] blk-mq scheduling framework
  2016-12-13 15:17   ` Jens Axboe
@ 2016-12-13 16:15     ` Paolo Valente
  2016-12-13 16:28       ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Valente @ 2016-12-13 16:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Linux-Kernal, osandov


> Il giorno 13 dic 2016, alle ore 16:17, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On Tue, Dec 13 2016, Paolo Valente wrote:
>> 
>>> Il giorno 08 dic 2016, alle ore 21:13, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> As a followup to this posting from yesterday:
>>> 
>>> https://marc.info/?l=linux-block&m=148115232806065&w=2
>>> 
>>> this is version 2. I wanted to post a new one fairly quickly, as there
>>> ended up being a number of potential crashes in v1. This one should be
>>> solid, I've run mq-deadline on both NVMe and regular rotating storage,
>>> and we handle the various merging cases correctly.
>>> 
>>> You can download it from git as well:
>>> 
>>> git://git.kernel.dk/linux-block blk-mq-sched.2
>>> 
>>> Note that this is based on for-4.10/block, which is in turn based on
>>> v4.9-rc1. I suggest pulling it into my for-next branch, which would
>>> then merge nicely with 'master' as well.
>>> 
>> 
>> Hi Jens,
>> this is just to tell you that I have finished running some extensive
>> tests on this patch series (throughput, responsiveness, low latency
>> for soft real time).  No regression w.r.t. blk detected, and no
>> crashes or other anomalies.
>> 
>> Starting to work on BFQ port.  Please be patient with my little
>> expertise on mq environment, and with my next silly questions!
> 
> No worries, ask away if you have questions. As you might have seen, it's
> still a little bit of a moving target, but it's getting closer every
> day. I'll post a v3 later today hopefully that will be a good fix point
> for you. I'll need to add the io context setup etc, that's not there
> yet, as only cfq/bfq uses that.
> 

You anticipated the question that was worrying me more, how to handle
iocontexts :) I'll go on studying your patches while waiting for this
(last, right?) missing piece for bfq.

Should you implement a modified version of cfq, to test your last
extensions, I would of course appreciate very much to have a look at
it (if you are willing to share it, of course).

Thanks,
Paolo


> -- 
> Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSET/RFC v2] blk-mq scheduling framework
  2016-12-13 16:15     ` Paolo Valente
@ 2016-12-13 16:28       ` Jens Axboe
  2016-12-13 21:51         ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2016-12-13 16:28 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, Linux-Kernal, osandov

On 12/13/2016 09:15 AM, Paolo Valente wrote:
> 
>> Il giorno 13 dic 2016, alle ore 16:17, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On Tue, Dec 13 2016, Paolo Valente wrote:
>>>
>>>> Il giorno 08 dic 2016, alle ore 21:13, Jens Axboe <axboe@fb.com> ha scritto:
>>>>
>>>> As a followup to this posting from yesterday:
>>>>
>>>> https://marc.info/?l=linux-block&m=148115232806065&w=2
>>>>
>>>> this is version 2. I wanted to post a new one fairly quickly, as there
>>>> ended up being a number of potential crashes in v1. This one should be
>>>> solid, I've run mq-deadline on both NVMe and regular rotating storage,
>>>> and we handle the various merging cases correctly.
>>>>
>>>> You can download it from git as well:
>>>>
>>>> git://git.kernel.dk/linux-block blk-mq-sched.2
>>>>
>>>> Note that this is based on for-4.10/block, which is in turn based on
>>>> v4.9-rc1. I suggest pulling it into my for-next branch, which would
>>>> then merge nicely with 'master' as well.
>>>>
>>>
>>> Hi Jens,
>>> this is just to tell you that I have finished running some extensive
>>> tests on this patch series (throughput, responsiveness, low latency
>>> for soft real time).  No regression w.r.t. blk detected, and no
>>> crashes or other anomalies.
>>>
>>> Starting to work on BFQ port.  Please be patient with my little
>>> expertise on mq environment, and with my next silly questions!
>>
>> No worries, ask away if you have questions. As you might have seen, it's
>> still a little bit of a moving target, but it's getting closer every
>> day. I'll post a v3 later today hopefully that will be a good fix point
>> for you. I'll need to add the io context setup etc, that's not there
>> yet, as only cfq/bfq uses that.
>>
> 
> You anticipated the question that was worrying me more, how to handle
> iocontexts :) I'll go on studying your patches while waiting for this
> (last, right?) missing piece for bfq.

It's the last missing larger piece. We probably have a few hooks that
BFQ/CFQ currently uses that aren't wired up yet in the elevator_ops for
mq, so you'll probably have to do those as you go. I can take a look,
but I would prefer if they be done one a as-needed basis. Perhaps we can
get rid of some of them.

> Should you implement a modified version of cfq, to test your last
> extensions, I would of course appreciate very much to have a look at
> it (if you are willing to share it, of course).

I most likely won't do that, as it would be a waste of time on my end.
If you need help with the BFQ parts, I'll help you out.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSET/RFC v2] blk-mq scheduling framework
  2016-12-13 16:28       ` Jens Axboe
@ 2016-12-13 21:51         ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-13 21:51 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, Linux-Kernal, osandov

On 12/13/2016 09:28 AM, Jens Axboe wrote:
>>> No worries, ask away if you have questions. As you might have seen, it's
>>> still a little bit of a moving target, but it's getting closer every
>>> day. I'll post a v3 later today hopefully that will be a good fix point
>>> for you. I'll need to add the io context setup etc, that's not there
>>> yet, as only cfq/bfq uses that.
>>>
>>
>> You anticipated the question that was worrying me more, how to handle
>> iocontexts :) I'll go on studying your patches while waiting for this
>> (last, right?) missing piece for bfq.
> 
> It's the last missing larger piece. We probably have a few hooks that
> BFQ/CFQ currently uses that aren't wired up yet in the elevator_ops for
> mq, so you'll probably have to do those as you go. I can take a look,
> but I would prefer if they be done one a as-needed basis. Perhaps we can
> get rid of some of them.

The current 'blk-mq-sched' branch has support for getting the IO
contexts setup, and assigned to requests. Only works off the task
io_context for now, we ignore anything set in the bio. But that's a
minor thing, generally it should work for you.

Note that the mq ops have different naming than the classic elevator
ops. For instance, the set_request/put_request are
get_rq_priv/put_rq_priv instead. Others are different as well. In
general, refer to mq-deadline.c for how the hooks work and you can
compare with deadline-iosched.c, since they are still very close.

Note that the io context linking uses the embedded queue lock,
q->queue_lock, whereas for other things you are free to use a lock
embedded in the your elevator data. Again, refer to mq-deadline, it uses
dd->lock to protect the hash/rbtree. If mq-deadline used io contexts, it
would manage those behind q->queue_lock.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-08 20:13 ` [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
  2016-12-13 11:04   ` Bart Van Assche
@ 2016-12-14  8:09   ` Bart Van Assche
  2016-12-14 15:02     ` Jens Axboe
  1 sibling, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2016-12-14  8:09 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static inline bool dd_rq_is_shadow(struct request *rq)
> +{
> +	return rq->rq_flags & RQF_ALLOCED;
> +}

Hello Jens,

Something minor: because req_flags_t has been defined using __bitwise
(typedef __u32 __bitwise req_flags_t) sparse complains for the above
function about converting req_flags_t into bool. How about changing the
body of that function into "return (rq->rq_flags & RQF_ALLOCED) != 0" to
keep sparse happy?

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-13 15:14     ` Jens Axboe
@ 2016-12-14 10:31       ` Bart Van Assche
  2016-12-14 15:05         ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2016-12-14 10:31 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/13/2016 04:14 PM, Jens Axboe wrote:
> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>>> +						  struct blk_mq_alloc_data *data,
>>> +						  struct blk_mq_tags *tags,
>>> +						  atomic_t *wait_index)
>>> +{
>>
>> Using the word "shadow" in the function name suggests to me that there 
>> is a shadow request for every request and a request for every shadow 
>> request. However, my understanding from the code is that there can be 
>> requests without shadow requests (for e.g. a flush) and shadow requests 
>> without requests. Shouldn't the name of this function reflect that, e.g. 
>> by using "sched" or "elv" in the function name instead of "shadow"?
> 
> Shadow might not be the best name. Most do have shadows though, it's
> only the rare exception like the flush, that you mention. I'll see if I
> can come up with a better name.

Hello Jens,

One aspect of this patch series that might turn out to be a maintenance
burden is the copying between original and shadow requests. It is easy
to overlook that rq_copy() has to be updated if a field would ever be
added to struct request. Additionally, having to allocate two requests
structures per I/O instead of one will have a runtime overhead. Do you
think the following approach would work?
- Instead of using two request structures per I/O, only use a single
  request structure.
- Instead of storing one tag in the request structure, store two tags
  in that structure. One tag comes from the I/O scheduler tag set
  (size: nr_requests) and the other from the tag set associated with
  the block driver (size: HBA queue depth).
- Only add a request to the hctx dispatch list after a block driver tag
  has been assigned. This means that an I/O scheduler must keep a
  request structure on a list it manages itself as long as no block
  driver tag has been assigned.
- sysfs_list_show() is modified such that it shows both tags.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-14  8:09   ` Bart Van Assche
@ 2016-12-14 15:02     ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-14 15:02 UTC (permalink / raw)
  To: Bart Van Assche, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/14/2016 01:09 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +static inline bool dd_rq_is_shadow(struct request *rq)
>> +{
>> +	return rq->rq_flags & RQF_ALLOCED;
>> +}
> 
> Hello Jens,
> 
> Something minor: because req_flags_t has been defined using __bitwise
> (typedef __u32 __bitwise req_flags_t) sparse complains for the above
> function about converting req_flags_t into bool. How about changing the
> body of that function into "return (rq->rq_flags & RQF_ALLOCED) != 0" to
> keep sparse happy?

Sure, I can fold in that change.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-14 10:31       ` Bart Van Assche
@ 2016-12-14 15:05         ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-14 15:05 UTC (permalink / raw)
  To: Bart Van Assche, axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

On 12/14/2016 03:31 AM, Bart Van Assche wrote:
> On 12/13/2016 04:14 PM, Jens Axboe wrote:
>> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>>>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>>>> +						  struct blk_mq_alloc_data *data,
>>>> +						  struct blk_mq_tags *tags,
>>>> +						  atomic_t *wait_index)
>>>> +{
>>>
>>> Using the word "shadow" in the function name suggests to me that there 
>>> is a shadow request for every request and a request for every shadow 
>>> request. However, my understanding from the code is that there can be 
>>> requests without shadow requests (for e.g. a flush) and shadow requests 
>>> without requests. Shouldn't the name of this function reflect that, e.g. 
>>> by using "sched" or "elv" in the function name instead of "shadow"?
>>
>> Shadow might not be the best name. Most do have shadows though, it's
>> only the rare exception like the flush, that you mention. I'll see if I
>> can come up with a better name.
> 
> Hello Jens,
> 
> One aspect of this patch series that might turn out to be a maintenance
> burden is the copying between original and shadow requests. It is easy
> to overlook that rq_copy() has to be updated if a field would ever be
> added to struct request. Additionally, having to allocate two requests
> structures per I/O instead of one will have a runtime overhead. Do you
> think the following approach would work?
> - Instead of using two request structures per I/O, only use a single
>   request structure.
> - Instead of storing one tag in the request structure, store two tags
>   in that structure. One tag comes from the I/O scheduler tag set
>   (size: nr_requests) and the other from the tag set associated with
>   the block driver (size: HBA queue depth).
> - Only add a request to the hctx dispatch list after a block driver tag
>   has been assigned. This means that an I/O scheduler must keep a
>   request structure on a list it manages itself as long as no block
>   driver tag has been assigned.
> - sysfs_list_show() is modified such that it shows both tags.

I have considered doing exactly that, decided to go down the other path.
I may still revisit, it's not that I'm a huge fan of the shadow requests
and the necessary copying. We don't update the request that often, so I
don't think it's going to be a big maintenance burden. But it'd be hard
to claim that it's super pretty...

I'll play with the idea.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-15 19:29   ` Omar Sandoval
  2016-12-15 20:14     ` Jens Axboe
@ 2016-12-15 21:44     ` Jens Axboe
  1 sibling, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-15 21:44 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: axboe, linux-block, linux-kernel, paolo.valente, osandov

On 12/15/2016 12:29 PM, Omar Sandoval wrote:
> Hey, Jens, a couple of minor nits below.
> 
> One bigger note: adding the blk_mq_sched_*() helpers and keeping the
> blk_mq_*() helpers that they replaced seems risky. For example,
> blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
> we kept blk_mq_free_request(). There are definitely some codepaths that
> are still using blk_mq_free_request() that are now wrong
> (__nvme_submit_user_cmd() is the most obvious one I saw).
> 
> Can we get rid of the old, non-sched functions? Or maybe even make old
> interface do the sched stuff instead of adding blk_mq_sched_*()?

I fixed this up. blk_mq_free_request() is now what we still use as an
exported interface, and that just calls blk_mq_sched_put_request().  The
old (__)blk_mq_free_request() are now (__)blk_mq_finish_request() and
internal only.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-15 19:29   ` Omar Sandoval
@ 2016-12-15 20:14     ` Jens Axboe
  2016-12-15 21:44     ` Jens Axboe
  1 sibling, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-15 20:14 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-block, linux-kernel, paolo.valente, osandov

On Thu, Dec 15 2016, Omar Sandoval wrote:
> Hey, Jens, a couple of minor nits below.
> 
> One bigger note: adding the blk_mq_sched_*() helpers and keeping the
> blk_mq_*() helpers that they replaced seems risky. For example,
> blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
> we kept blk_mq_free_request(). There are definitely some codepaths that
> are still using blk_mq_free_request() that are now wrong
> (__nvme_submit_user_cmd() is the most obvious one I saw).
> 
> Can we get rid of the old, non-sched functions? Or maybe even make old
> interface do the sched stuff instead of adding blk_mq_sched_*()?

You are right, that is a concern. I'll think of some ways to lessen that
risk, it's much better to have build time breakage than runtime.

> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 8d1cec8e25d1..d10a246a3bc7 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int cpu)
> >   * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
> >   * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
> >   *
> > - * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
> > - * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> > - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
> > - * is ignored.
> > + * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is set
> > + * in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> > + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
> > + * ignored.
> >   */
> 
> This belongs in patch 4 where flush_busy_ctxs() got renamed.

Fixed, thanks!

> >  static int blk_mq_queue_reinit_prepare(unsigned int cpu)
> >  {
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index e59f5ca520a2..a5ddc860b220 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
> >   */
> >  void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> >  				bool at_head);
> > +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> > +				struct list_head *list);
> > +void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);
> 
> Looks like the declaration of blk_mq_process_sw_list() survived a rebase
> even though it doesn't exist anymore.

Indeed, now killed.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-15  5:26 ` [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
@ 2016-12-15 19:29   ` Omar Sandoval
  2016-12-15 20:14     ` Jens Axboe
  2016-12-15 21:44     ` Jens Axboe
  0 siblings, 2 replies; 38+ messages in thread
From: Omar Sandoval @ 2016-12-15 19:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-block, linux-kernel, paolo.valente, osandov

Hey, Jens, a couple of minor nits below.

One bigger note: adding the blk_mq_sched_*() helpers and keeping the
blk_mq_*() helpers that they replaced seems risky. For example,
blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
we kept blk_mq_free_request(). There are definitely some codepaths that
are still using blk_mq_free_request() that are now wrong
(__nvme_submit_user_cmd() is the most obvious one I saw).

Can we get rid of the old, non-sched functions? Or maybe even make old
interface do the sched stuff instead of adding blk_mq_sched_*()?

On Wed, Dec 14, 2016 at 10:26:06PM -0700, Jens Axboe wrote:
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  block/Makefile           |   2 +-
>  block/blk-core.c         |   7 +-
>  block/blk-exec.c         |   3 +-
>  block/blk-flush.c        |   7 +-
>  block/blk-mq-sched.c     | 375 +++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-mq-sched.h     | 190 ++++++++++++++++++++++++
>  block/blk-mq-tag.c       |   1 +
>  block/blk-mq.c           | 192 ++++++++++--------------
>  block/blk-mq.h           |   3 +
>  block/elevator.c         | 186 +++++++++++++++++------
>  include/linux/blk-mq.h   |   3 +-
>  include/linux/elevator.h |  29 ++++
>  12 files changed, 833 insertions(+), 165 deletions(-)
>  create mode 100644 block/blk-mq-sched.c
>  create mode 100644 block/blk-mq-sched.h
> 

[snip]

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8d1cec8e25d1..d10a246a3bc7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int cpu)
>   * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
>   * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
>   *
> - * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
> - * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
> - * is ignored.
> + * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is set
> + * in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
> + * ignored.
>   */

This belongs in patch 4 where flush_busy_ctxs() got renamed.

>  static int blk_mq_queue_reinit_prepare(unsigned int cpu)
>  {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e59f5ca520a2..a5ddc860b220 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
>   */
>  void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  				bool at_head);
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> +				struct list_head *list);
> +void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);

Looks like the declaration of blk_mq_process_sw_list() survived a rebase
even though it doesn't exist anymore.

[snip]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-15  5:26 [PATCHSET v3] " Jens Axboe
@ 2016-12-15  5:26 ` Jens Axboe
  2016-12-15 19:29   ` Omar Sandoval
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2016-12-15  5:26 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Makefile           |   2 +-
 block/blk-core.c         |   7 +-
 block/blk-exec.c         |   3 +-
 block/blk-flush.c        |   7 +-
 block/blk-mq-sched.c     | 375 +++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-sched.h     | 190 ++++++++++++++++++++++++
 block/blk-mq-tag.c       |   1 +
 block/blk-mq.c           | 192 ++++++++++--------------
 block/blk-mq.h           |   3 +
 block/elevator.c         | 186 +++++++++++++++++------
 include/linux/blk-mq.h   |   3 +-
 include/linux/elevator.h |  29 ++++
 12 files changed, 833 insertions(+), 165 deletions(-)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
-			blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 92baea07acbc..cb1e864cb23d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-mq-sched.h"
 #include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -1413,7 +1414,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 		return;
 
 	if (q->mq_ops) {
-		blk_mq_free_request(req);
+		blk_mq_sched_put_request(req);
 		return;
 	}
 
@@ -1449,7 +1450,7 @@ void blk_put_request(struct request *req)
 	struct request_queue *q = req->q;
 
 	if (q->mq_ops)
-		blk_mq_free_request(req);
+		blk_mq_sched_put_request(req);
 	else {
 		unsigned long flags;
 
@@ -2127,7 +2128,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	if (q->mq_ops) {
 		if (blk_queue_io_stat(q))
 			blk_account_io_start(rq, true);
-		blk_mq_insert_request(rq, false, true, false);
+		blk_mq_sched_insert_request(rq, false, true, false);
 		return 0;
 	}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
 #include <linux/sched/sysctl.h>
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 /*
  * for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 	 * be reused after dying flag is set
 	 */
 	if (q->mq_ops) {
-		blk_mq_insert_request(rq, at_head, true, false);
+		blk_mq_sched_insert_request(rq, at_head, true, false);
 		return;
 	}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 20b7c7a02f1c..6a7c29d2eb3c 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 /* FLUSH/FUA sequences */
 enum {
@@ -453,9 +454,9 @@ void blk_insert_flush(struct request *rq)
 	 */
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-		if (q->mq_ops) {
-			blk_mq_insert_request(rq, false, true, false);
-		} else
+		if (q->mq_ops)
+			blk_mq_sched_insert_request(rq, false, true, false);
+		else
 			list_add_tail(&rq->queuelist, &q->queue_head);
 		return;
 	}
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index 000000000000..02ad17258666
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,375 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/blk-mq.h>
+
+#include <trace/events/block.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static const struct blk_mq_ops mq_sched_tag_ops = {
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+	blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+						unsigned int numa_node)
+{
+	struct blk_mq_tag_set set = {
+		.ops		= &mq_sched_tag_ops,
+		.nr_hw_queues	= 1,
+		.queue_depth	= depth,
+		.numa_node	= numa_node,
+	};
+
+	return blk_mq_init_rq_map(&set, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (exit)
+			exit(hctx);
+		kfree(hctx->sched_data);
+		hctx->sched_data = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+				void (*init)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
+		if (!hctx->sched_data)
+			goto error;
+
+		if (init)
+			init(hctx);
+	}
+
+	return 0;
+error:
+	blk_mq_sched_free_hctx_data(q, NULL);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
+
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+						  struct blk_mq_alloc_data *data,
+						  struct blk_mq_tags *tags,
+						  atomic_t *wait_index)
+{
+	struct sbq_wait_state *ws;
+	DEFINE_WAIT(wait);
+	struct request *rq;
+	int tag;
+
+	tag = __sbitmap_queue_get(&tags->bitmap_tags);
+	if (tag != -1)
+		goto done;
+
+	if (data->flags & BLK_MQ_REQ_NOWAIT)
+		return NULL;
+
+	ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+	do {
+		prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+		tag = __sbitmap_queue_get(&tags->bitmap_tags);
+		if (tag != -1)
+			break;
+
+		blk_mq_run_hw_queue(data->hctx, false);
+
+		tag = __sbitmap_queue_get(&tags->bitmap_tags);
+		if (tag != -1)
+			break;
+
+		blk_mq_put_ctx(data->ctx);
+		io_schedule();
+
+		data->ctx = blk_mq_get_ctx(data->q);
+		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+		finish_wait(&ws->wait, &wait);
+		ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+	} while (1);
+
+	finish_wait(&ws->wait, &wait);
+done:
+	rq = tags->rqs[tag];
+	rq->tag = tag;
+	rq->rq_flags |= RQF_ALLOCED;
+	return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+				      struct request *rq)
+{
+	WARN_ON_ONCE(!(rq->rq_flags & RQF_ALLOCED));
+	sbitmap_queue_clear(&tags->bitmap_tags, rq->tag, rq->mq_ctx->cpu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
+
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name)	((dst)->name = (src)->name)
+	FIELD_COPY(rq, src, cpu);
+	FIELD_COPY(rq, src, cmd_type);
+	FIELD_COPY(rq, src, cmd_flags);
+	rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | RQF_DONTPREP));
+	rq->rq_flags &= ~RQF_IO_STAT;
+	FIELD_COPY(rq, src, __data_len);
+	FIELD_COPY(rq, src, __sector);
+	FIELD_COPY(rq, src, bio);
+	FIELD_COPY(rq, src, biotail);
+	FIELD_COPY(rq, src, rq_disk);
+	FIELD_COPY(rq, src, part);
+	FIELD_COPY(rq, src, issue_stat);
+	src->issue_stat.time = 0;
+	FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+	FIELD_COPY(rq, src, ioprio);
+	FIELD_COPY(rq, src, timeout);
+
+	if (src->cmd_type == REQ_TYPE_BLOCK_PC) {
+		FIELD_COPY(rq, src, cmd);
+		FIELD_COPY(rq, src, cmd_len);
+		FIELD_COPY(rq, src, extra_len);
+		FIELD_COPY(rq, src, sense_len);
+		FIELD_COPY(rq, src, resid_len);
+		FIELD_COPY(rq, src, sense);
+		FIELD_COPY(rq, src, retries);
+	}
+
+	src->bio = src->biotail = NULL;
+}
+
+static void sched_rq_end_io(struct request *rq, int error)
+{
+	struct request *sched_rq = rq->end_io_data;
+
+	FIELD_COPY(sched_rq, rq, resid_len);
+	FIELD_COPY(sched_rq, rq, extra_len);
+	FIELD_COPY(sched_rq, rq, sense_len);
+	FIELD_COPY(sched_rq, rq, errors);
+	FIELD_COPY(sched_rq, rq, retries);
+
+	blk_account_io_completion(sched_rq, blk_rq_bytes(sched_rq));
+	blk_account_io_done(sched_rq);
+
+	if (sched_rq->end_io)
+		sched_rq->end_io(sched_rq, error);
+
+	blk_mq_free_request(rq);
+}
+
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_alloc_data data;
+	struct request *sched_rq, *rq;
+
+	data.q = hctx->queue;
+	data.flags = BLK_MQ_REQ_NOWAIT;
+	data.ctx = blk_mq_get_ctx(hctx->queue);
+	data.hctx = hctx;
+
+	rq = __blk_mq_alloc_request(&data, 0);
+	blk_mq_put_ctx(data.ctx);
+
+	if (!rq) {
+		blk_mq_stop_hw_queue(hctx);
+		return NULL;
+	}
+
+	sched_rq = get_sched_rq(hctx);
+
+	if (!sched_rq) {
+		blk_queue_enter_live(hctx->queue);
+		__blk_mq_free_request(hctx, data.ctx, rq);
+		return NULL;
+	}
+
+	WARN_ON_ONCE(!(sched_rq->rq_flags & RQF_ALLOCED));
+	rq_copy(rq, sched_rq);
+	rq->end_io = sched_rq_end_io;
+	rq->end_io_data = sched_rq;
+
+	return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_from_shadow);
+
+static void blk_mq_sched_assign_ioc(struct request_queue *q,
+				    struct request *rq, struct bio *bio)
+{
+	struct io_context *ioc = rq_ioc(bio);
+	struct io_cq *icq;
+
+	if (!ioc)
+		return;
+
+	spin_lock_irq(q->queue_lock);
+	icq = ioc_lookup_icq(ioc, q);
+	spin_unlock_irq(q->queue_lock);
+
+	if (!icq) {
+		if (ioc)
+			icq = ioc_create_icq(ioc, q, GFP_ATOMIC);
+		if (!icq) {
+fail:
+			printk_ratelimited("failed icq alloc\n");
+			return;
+		}
+	}
+
+	rq->elv.icq = icq;
+	if (blk_mq_sched_get_rq_priv(q, rq))
+		goto fail;
+
+	if (icq)
+		get_io_context(icq->ioc);
+}
+
+struct request *blk_mq_sched_get_request(struct request_queue *q,
+					 struct bio *bio,
+					 unsigned int op,
+					 struct blk_mq_alloc_data *data)
+{
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+	struct request *rq;
+
+	blk_queue_enter_live(q);
+	ctx = blk_mq_get_ctx(q);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+
+	if (e && e->type->ops.mq.get_request)
+		rq = e->type->ops.mq.get_request(q, op, data);
+	else
+		rq = __blk_mq_alloc_request(data, op);
+
+	if (rq) {
+		rq->elv.icq = NULL;
+		if (e && e->type->icq_cache)
+			blk_mq_sched_assign_ioc(q, rq, bio);
+		data->hctx->queued++;
+		return rq;
+	}
+
+	blk_queue_exit(q);
+	return NULL;
+}
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+	LIST_HEAD(rq_list);
+
+	if (unlikely(blk_mq_hctx_stopped(hctx)))
+		return;
+
+	hctx->run++;
+
+	/*
+	 * If we have previous entries on our dispatch list, grab them first for
+	 * more fair dispatch.
+	 */
+	if (!list_empty_careful(&hctx->dispatch)) {
+		spin_lock(&hctx->lock);
+		if (!list_empty(&hctx->dispatch))
+			list_splice_init(&hctx->dispatch, &rq_list);
+		spin_unlock(&hctx->lock);
+	}
+
+	/*
+	 * Only ask the scheduler for requests, if we didn't have residual
+	 * requests from the dispatch list. This is to avoid the case where
+	 * we only ever dispatch a fraction of the requests available because
+	 * of low device queue depth. Once we pull requests out of the IO
+	 * scheduler, we can no longer merge or sort them. So it's best to
+	 * leave them there for as long as we can. Mark the hw queue as
+	 * needing a restart in that case.
+	 */
+	if (list_empty(&rq_list)) {
+		if (e && e->type->ops.mq.dispatch_requests)
+			e->type->ops.mq.dispatch_requests(hctx, &rq_list);
+		else
+			blk_mq_flush_busy_ctxs(hctx, &rq_list);
+	} else
+		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+
+	blk_mq_dispatch_rq_list(hctx, &rq_list);
+}
+
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
+{
+	struct request *rq;
+	int ret;
+
+	ret = elv_merge(q, &rq, bio);
+	if (ret == ELEVATOR_BACK_MERGE) {
+		if (bio_attempt_back_merge(q, rq, bio)) {
+			if (!attempt_back_merge(q, rq))
+				elv_merged_request(q, rq, ret);
+			goto done;
+		}
+		ret = ELEVATOR_NO_MERGE;
+	} else if (ret == ELEVATOR_FRONT_MERGE) {
+		if (bio_attempt_front_merge(q, rq, bio)) {
+			if (!attempt_front_merge(q, rq))
+				elv_merged_request(q, rq, ret);
+			goto done;
+		}
+		ret = ELEVATOR_NO_MERGE;
+	}
+done:
+	return ret != ELEVATOR_NO_MERGE;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
+
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e->type->ops.mq.bio_merge) {
+		struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
+		struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+		blk_mq_put_ctx(ctx);
+		return e->type->ops.mq.bio_merge(hctx, bio);
+	}
+
+	return false;
+}
+
+void blk_mq_sched_request_inserted(struct request *rq)
+{
+	trace_block_rq_insert(rq->q, rq);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
new file mode 100644
index 000000000000..b68dccc0190e
--- /dev/null
+++ b/block/blk-mq-sched.h
@@ -0,0 +1,190 @@
+#ifndef BLK_MQ_SCHED_H
+#define BLK_MQ_SCHED_H
+
+#include "blk-mq.h"
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+				void (*init)(struct blk_mq_hw_ctx *));
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *));
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+				      struct request *rq);
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+						  struct blk_mq_alloc_data *data,
+						  struct blk_mq_tags *tags,
+						  atomic_t *wait_index);
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
+
+struct request *blk_mq_sched_get_request(struct request_queue *q, struct bio *bio, unsigned int op, struct blk_mq_alloc_data *data);
+
+void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+void blk_mq_sched_request_inserted(struct request *rq);
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+
+static inline bool
+blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e || blk_queue_nomerges(q) || !bio_mergeable(bio))
+		return false;
+
+	return __blk_mq_sched_bio_merge(q, bio);
+}
+
+static inline int blk_mq_sched_get_rq_priv(struct request_queue *q,
+					   struct request *rq)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.get_rq_priv)
+		return e->type->ops.mq.get_rq_priv(q, rq);
+
+	return 0;
+}
+
+static inline void blk_mq_sched_put_rq_priv(struct request_queue *q,
+					    struct request *rq)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.put_rq_priv)
+		e->type->ops.mq.put_rq_priv(q, rq);
+}
+
+static inline void blk_mq_sched_put_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+	bool do_free = true;
+
+	if (rq->rq_flags & RQF_ELVPRIV) {
+		blk_mq_sched_put_rq_priv(rq->q, rq);
+		if (rq->elv.icq)
+			put_io_context(rq->elv.icq->ioc);
+	}
+
+	if (e && e->type->ops.mq.put_request)
+		do_free = !e->type->ops.mq.put_request(rq);
+	if (do_free)
+		blk_mq_free_request(rq);
+}
+
+static inline void
+blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
+			    bool async)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	if (e && e->type->ops.mq.insert_requests) {
+		LIST_HEAD(list);
+
+		list_add(&rq->queuelist, &list);
+		e->type->ops.mq.insert_requests(hctx, &list, at_head);
+	} else {
+		spin_lock(&ctx->lock);
+		__blk_mq_insert_request(hctx, rq, at_head);
+		spin_unlock(&ctx->lock);
+	}
+
+	if (run_queue)
+		blk_mq_run_hw_queue(hctx, async);
+}
+
+static inline void
+blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx,
+			     struct list_head *list, bool run_queue_async)
+{
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->ops.mq.insert_requests)
+		e->type->ops.mq.insert_requests(hctx, list, false);
+	else
+		blk_mq_insert_requests(hctx, ctx, list);
+
+	blk_mq_run_hw_queue(hctx, run_queue_async);
+}
+
+static inline void
+blk_mq_sched_dispatch_shadow_requests(struct blk_mq_hw_ctx *hctx,
+				      struct list_head *rq_list,
+				      struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+	do {
+		struct request *rq;
+
+		rq = blk_mq_sched_request_from_shadow(hctx, get_sched_rq);
+		if (!rq)
+			break;
+
+		list_add_tail(&rq->queuelist, rq_list);
+	} while (1);
+}
+
+static inline bool
+blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
+			 struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.allow_merge)
+		return e->type->ops.mq.allow_merge(q, rq, bio);
+
+	return true;
+}
+
+static inline void
+blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->ops.mq.completed_request)
+		e->type->ops.mq.completed_request(hctx, rq);
+
+	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		blk_mq_run_hw_queue(hctx, true);
+	}
+}
+
+static inline void blk_mq_sched_started_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.started_request)
+		e->type->ops.mq.started_request(rq);
+}
+
+static inline void blk_mq_sched_requeue_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.requeue_request)
+		e->type->ops.mq.requeue_request(rq);
+}
+
+static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->ops.mq.has_work)
+		return e->type->ops.mq.has_work(hctx);
+
+	return false;
+}
+#endif
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index dcf5ce3ba4bf..bbd494e23d57 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -12,6 +12,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8d1cec8e25d1..d10a246a3bc7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -32,6 +32,7 @@
 #include "blk-mq-tag.h"
 #include "blk-stat.h"
 #include "blk-wbt.h"
+#include "blk-mq-sched.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -41,7 +42,8 @@ static LIST_HEAD(all_q_list);
  */
 static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
-	return sbitmap_any_bit_set(&hctx->ctx_map);
+	return sbitmap_any_bit_set(&hctx->ctx_map) ||
+		blk_mq_sched_has_work(hctx);
 }
 
 /*
@@ -242,26 +244,21 @@ EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
 struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		unsigned int flags)
 {
-	struct blk_mq_ctx *ctx;
-	struct blk_mq_hw_ctx *hctx;
-	struct request *rq;
 	struct blk_mq_alloc_data alloc_data;
+	struct request *rq;
 	int ret;
 
 	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
 
-	ctx = blk_mq_get_ctx(q);
-	hctx = blk_mq_map_queue(q, ctx->cpu);
-	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
-	rq = __blk_mq_alloc_request(&alloc_data, rw);
-	blk_mq_put_ctx(ctx);
+	rq = blk_mq_sched_get_request(q, NULL, rw, &alloc_data);
 
-	if (!rq) {
-		blk_queue_exit(q);
+	blk_mq_put_ctx(alloc_data.ctx);
+	blk_queue_exit(q);
+
+	if (!rq)
 		return ERR_PTR(-EWOULDBLOCK);
-	}
 
 	rq->__data_len = 0;
 	rq->__sector = (sector_t) -1;
@@ -327,6 +324,8 @@ void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	const int tag = rq->tag;
 	struct request_queue *q = rq->q;
 
+	blk_mq_sched_completed_request(hctx, rq);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
 
@@ -364,8 +363,8 @@ inline void __blk_mq_end_request(struct request *rq, int error)
 		rq->end_io(rq, error);
 	} else {
 		if (unlikely(blk_bidi_rq(rq)))
-			blk_mq_free_request(rq->next_rq);
-		blk_mq_free_request(rq);
+			blk_mq_sched_put_request(rq->next_rq);
+		blk_mq_sched_put_request(rq);
 	}
 }
 EXPORT_SYMBOL(__blk_mq_end_request);
@@ -469,6 +468,8 @@ void blk_mq_start_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	blk_mq_sched_started_request(rq);
+
 	trace_block_rq_issue(q, rq);
 
 	rq->resid_len = blk_rq_bytes(rq);
@@ -517,6 +518,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
+	blk_mq_sched_requeue_request(rq);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
 		if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -551,13 +553,13 @@ static void blk_mq_requeue_work(struct work_struct *work)
 
 		rq->rq_flags &= ~RQF_SOFTBARRIER;
 		list_del_init(&rq->queuelist);
-		blk_mq_insert_request(rq, true, false, false);
+		blk_mq_sched_insert_request(rq, true, false, false);
 	}
 
 	while (!list_empty(&rq_list)) {
 		rq = list_entry(rq_list.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
-		blk_mq_insert_request(rq, false, false, false);
+		blk_mq_sched_insert_request(rq, false, false, false);
 	}
 
 	blk_mq_run_hw_queues(q, false);
@@ -763,8 +765,16 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
 
 		if (!blk_rq_merge_ok(rq, bio))
 			continue;
+		if (!blk_mq_sched_allow_merge(q, rq, bio))
+			break;
 
 		el_ret = blk_try_merge(rq, bio);
+		if (el_ret == ELEVATOR_NO_MERGE)
+			continue;
+
+		if (!blk_mq_sched_allow_merge(q, rq, bio))
+			break;
+
 		if (el_ret == ELEVATOR_BACK_MERGE) {
 			if (bio_attempt_back_merge(q, rq, bio)) {
 				ctx->rq_merged++;
@@ -906,41 +916,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	return ret != BLK_MQ_RQ_QUEUE_BUSY;
 }
 
-/*
- * Run this hardware queue, pulling any software queues mapped to it in.
- * Note that this function currently has various problems around ordering
- * of IO. In particular, we'd like FIFO behaviour on handling existing
- * items on the hctx->dispatch list. Ignore that for now.
- */
-static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
-{
-	LIST_HEAD(rq_list);
-	LIST_HEAD(driver_list);
-
-	if (unlikely(blk_mq_hctx_stopped(hctx)))
-		return;
-
-	hctx->run++;
-
-	/*
-	 * Touch any software queue that has pending entries.
-	 */
-	blk_mq_flush_busy_ctxs(hctx, &rq_list);
-
-	/*
-	 * If we have previous entries on our dispatch list, grab them
-	 * and stuff them at the front for more fair dispatch.
-	 */
-	if (!list_empty_careful(&hctx->dispatch)) {
-		spin_lock(&hctx->lock);
-		if (!list_empty(&hctx->dispatch))
-			list_splice_init(&hctx->dispatch, &rq_list);
-		spin_unlock(&hctx->lock);
-	}
-
-	blk_mq_dispatch_rq_list(hctx, &rq_list);
-}
-
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
 	int srcu_idx;
@@ -950,11 +925,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		blk_mq_process_rq_list(hctx);
+		blk_mq_sched_dispatch_requests(hctx);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
-		blk_mq_process_rq_list(hctx);
+		blk_mq_sched_dispatch_requests(hctx);
 		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
 	}
 }
@@ -1148,32 +1123,10 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
-void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
-			   bool async)
-{
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	spin_lock(&ctx->lock);
-	__blk_mq_insert_request(hctx, rq, at_head);
-	spin_unlock(&ctx->lock);
-
-	if (run_queue)
-		blk_mq_run_hw_queue(hctx, async);
-}
-
-static void blk_mq_insert_requests(struct request_queue *q,
-				     struct blk_mq_ctx *ctx,
-				     struct list_head *list,
-				     int depth,
-				     bool from_schedule)
+void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+			    struct list_head *list)
 
 {
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	trace_block_unplug(q, depth, !from_schedule);
-
 	/*
 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
 	 * offline now
@@ -1189,8 +1142,6 @@ static void blk_mq_insert_requests(struct request_queue *q,
 	}
 	blk_mq_hctx_mark_pending(hctx, ctx);
 	spin_unlock(&ctx->lock);
-
-	blk_mq_run_hw_queue(hctx, from_schedule);
 }
 
 static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
@@ -1226,9 +1177,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		BUG_ON(!rq->q);
 		if (rq->mq_ctx != this_ctx) {
 			if (this_ctx) {
-				blk_mq_insert_requests(this_q, this_ctx,
-							&ctx_list, depth,
-							from_schedule);
+				trace_block_unplug(this_q, depth, from_schedule);
+				blk_mq_sched_insert_requests(this_q, this_ctx,
+								&ctx_list,
+								from_schedule);
 			}
 
 			this_ctx = rq->mq_ctx;
@@ -1245,8 +1197,9 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 * on 'ctx_list'. Do those.
 	 */
 	if (this_ctx) {
-		blk_mq_insert_requests(this_q, this_ctx, &ctx_list, depth,
-				       from_schedule);
+		trace_block_unplug(this_q, depth, from_schedule);
+		blk_mq_sched_insert_requests(this_q, this_ctx, &ctx_list,
+						from_schedule);
 	}
 }
 
@@ -1289,41 +1242,27 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
 	}
 }
 
-static struct request *blk_mq_map_request(struct request_queue *q,
-					  struct bio *bio,
-					  struct blk_mq_alloc_data *data)
-{
-	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx;
-	struct request *rq;
-
-	blk_queue_enter_live(q);
-	ctx = blk_mq_get_ctx(q);
-	hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	trace_block_getrq(q, bio, bio->bi_opf);
-	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
-	rq = __blk_mq_alloc_request(data, bio->bi_opf);
-
-	data->hctx->queued++;
-	return rq;
-}
-
 static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 {
-	int ret;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
 		.list = NULL,
 		.last = 1
 	};
-	blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+	struct blk_mq_hw_ctx *hctx;
+	blk_qc_t new_cookie;
+	int ret;
+
+	if (q->elevator)
+		goto insert;
 
+	hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	if (blk_mq_hctx_stopped(hctx))
 		goto insert;
 
+	new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+
 	/*
 	 * For OK queue, we are done. For error, kill it. Any other
 	 * error (busy), just add it to our list as we previously
@@ -1345,7 +1284,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	}
 
 insert:
-	blk_mq_insert_request(rq, false, true, true);
+	blk_mq_sched_insert_request(rq, false, true, true);
 }
 
 /*
@@ -1378,9 +1317,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return BLK_QC_T_NONE;
 
+	if (blk_mq_sched_bio_merge(q, bio))
+		return BLK_QC_T_NONE;
+
 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
 
-	rq = blk_mq_map_request(q, bio, &data);
+	trace_block_getrq(q, bio, bio->bi_opf);
+
+	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
@@ -1442,6 +1386,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		goto done;
 	}
 
+	if (q->elevator) {
+		blk_mq_put_ctx(data.ctx);
+		blk_mq_bio_to_request(rq, bio);
+		blk_mq_sched_insert_request(rq, false, true, true);
+		goto done;
+	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
@@ -1487,9 +1437,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	} else
 		request_count = blk_plug_queued_count(q);
 
+	if (blk_mq_sched_bio_merge(q, bio))
+		return BLK_QC_T_NONE;
+
 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
 
-	rq = blk_mq_map_request(q, bio, &data);
+	trace_block_getrq(q, bio, bio->bi_opf);
+
+	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
@@ -1539,6 +1494,12 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return cookie;
 	}
 
+	if (q->elevator) {
+		blk_mq_put_ctx(data.ctx);
+		blk_mq_bio_to_request(rq, bio);
+		blk_mq_sched_insert_request(rq, false, true, true);
+		goto done;
+	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
@@ -1551,6 +1512,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	blk_mq_put_ctx(data.ctx);
+done:
 	return cookie;
 }
 
@@ -1559,7 +1521,7 @@ void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 {
 	struct page *page;
 
-	if (tags->rqs && set->ops->exit_request) {
+	if (tags->rqs && set && set->ops->exit_request) {
 		int i;
 
 		for (i = 0; i < tags->nr_tags; i++) {
@@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int cpu)
  * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
  * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
  *
- * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
- * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
- * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
- * is ignored.
+ * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is set
+ * in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
+ * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
+ * ignored.
  */
 static int blk_mq_queue_reinit_prepare(unsigned int cpu)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e59f5ca520a2..a5ddc860b220 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
  */
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 				bool at_head);
+void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+				struct list_head *list);
+void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);
 
 /*
  * CPU hotplug helpers
diff --git a/block/elevator.c b/block/elevator.c
index 022a26830297..6d39197768c1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -40,6 +40,7 @@
 #include <trace/events/block.h>
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
@@ -58,7 +59,9 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_allow_bio_merge_fn)
+	if (e->uses_mq && e->type->ops.mq.allow_merge)
+		return e->type->ops.mq.allow_merge(q, rq, bio);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_allow_bio_merge_fn)
 		return e->type->ops.sq.elevator_allow_bio_merge_fn(q, rq, bio);
 
 	return 1;
@@ -163,6 +166,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
 	hash_init(eq->hash);
+	eq->uses_mq = e->uses_mq;
 
 	return eq;
 }
@@ -219,12 +223,19 @@ int elevator_init(struct request_queue *q, char *name)
 		if (!e) {
 			printk(KERN_ERR
 				"Default I/O scheduler not found. " \
-				"Using noop.\n");
+				"Using noop/none.\n");
+			if (q->mq_ops) {
+				elevator_put(e);
+				return 0;
+			}
 			e = elevator_get("noop", false);
 		}
 	}
 
-	err = e->ops.sq.elevator_init_fn(q, e);
+	if (e->uses_mq)
+		err = e->ops.mq.init_sched(q, e);
+	else
+		err = e->ops.sq.elevator_init_fn(q, e);
 	if (err)
 		elevator_put(e);
 	return err;
@@ -234,7 +245,9 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
-	if (e->type->ops.sq.elevator_exit_fn)
+	if (e->uses_mq && e->type->ops.mq.exit_sched)
+		e->type->ops.mq.exit_sched(e);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
 		e->type->ops.sq.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
 
@@ -253,6 +266,7 @@ void elv_rqhash_del(struct request_queue *q, struct request *rq)
 	if (ELV_ON_HASH(rq))
 		__elv_rqhash_del(rq);
 }
+EXPORT_SYMBOL_GPL(elv_rqhash_del);
 
 void elv_rqhash_add(struct request_queue *q, struct request *rq)
 {
@@ -262,6 +276,7 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
 	hash_add(e->hash, &rq->hash, rq_hash_key(rq));
 	rq->rq_flags |= RQF_HASHED;
 }
+EXPORT_SYMBOL_GPL(elv_rqhash_add);
 
 void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
 {
@@ -443,7 +458,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 		return ELEVATOR_BACK_MERGE;
 	}
 
-	if (e->type->ops.sq.elevator_merge_fn)
+	if (e->uses_mq && e->type->ops.mq.request_merge)
+		return e->type->ops.mq.request_merge(q, req, bio);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
 		return e->type->ops.sq.elevator_merge_fn(q, req, bio);
 
 	return ELEVATOR_NO_MERGE;
@@ -462,6 +479,9 @@ static bool elv_attempt_insert_merge(struct request_queue *q,
 	struct request *__rq;
 	bool ret;
 
+	if (WARN_ON_ONCE(q->elevator && q->elevator->uses_mq))
+		return false;
+
 	if (blk_queue_nomerges(q))
 		return false;
 
@@ -495,7 +515,9 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_merged_fn)
+	if (e->uses_mq && e->type->ops.mq.request_merged)
+		e->type->ops.mq.request_merged(q, rq, type);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_merged_fn)
 		e->type->ops.sq.elevator_merged_fn(q, rq, type);
 
 	if (type == ELEVATOR_BACK_MERGE)
@@ -508,10 +530,15 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 			     struct request *next)
 {
 	struct elevator_queue *e = q->elevator;
-	const int next_sorted = next->rq_flags & RQF_SORTED;
-
-	if (next_sorted && e->type->ops.sq.elevator_merge_req_fn)
-		e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
+	bool next_sorted = false;
+
+	if (e->uses_mq && e->type->ops.mq.requests_merged)
+		e->type->ops.mq.requests_merged(q, rq, next);
+	else if (e->type->ops.sq.elevator_merge_req_fn) {
+		next_sorted = next->rq_flags & RQF_SORTED;
+		if (next_sorted)
+			e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
+	}
 
 	elv_rqhash_reposition(q, rq);
 
@@ -528,6 +555,9 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	if (e->type->ops.sq.elevator_bio_merged_fn)
 		e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
 }
@@ -682,8 +712,11 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_latter_req_fn)
+	if (e->uses_mq && e->type->ops.mq.next_request)
+		return e->type->ops.mq.next_request(q, rq);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_latter_req_fn)
 		return e->type->ops.sq.elevator_latter_req_fn(q, rq);
+
 	return NULL;
 }
 
@@ -691,7 +724,9 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_former_req_fn)
+	if (e->uses_mq && e->type->ops.mq.former_request)
+		return e->type->ops.mq.former_request(q, rq);
+	if (!e->uses_mq && e->type->ops.sq.elevator_former_req_fn)
 		return e->type->ops.sq.elevator_former_req_fn(q, rq);
 	return NULL;
 }
@@ -701,6 +736,9 @@ int elv_set_request(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return 0;
+
 	if (e->type->ops.sq.elevator_set_req_fn)
 		return e->type->ops.sq.elevator_set_req_fn(q, rq, bio, gfp_mask);
 	return 0;
@@ -710,6 +748,9 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	if (e->type->ops.sq.elevator_put_req_fn)
 		e->type->ops.sq.elevator_put_req_fn(rq);
 }
@@ -718,6 +759,9 @@ int elv_may_queue(struct request_queue *q, unsigned int op)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return 0;
+
 	if (e->type->ops.sq.elevator_may_queue_fn)
 		return e->type->ops.sq.elevator_may_queue_fn(q, op);
 
@@ -728,6 +772,9 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	/*
 	 * request is released from the driver, io must be done
 	 */
@@ -803,7 +850,7 @@ int elv_register_queue(struct request_queue *q)
 		}
 		kobject_uevent(&e->kobj, KOBJ_ADD);
 		e->registered = 1;
-		if (e->type->ops.sq.elevator_registered_fn)
+		if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn)
 			e->type->ops.sq.elevator_registered_fn(q);
 	}
 	return error;
@@ -891,9 +938,14 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old = q->elevator;
-	bool registered = old->registered;
+	bool old_registered = false;
 	int err;
 
+	if (q->mq_ops) {
+		blk_mq_freeze_queue(q);
+		blk_mq_quiesce_queue(q);
+	}
+
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
 	 * Block layer doesn't call into a quiesced elevator - all requests
@@ -901,32 +953,52 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	 * using INSERT_BACK.  All requests have SOFTBARRIER set and no
 	 * merge happens either.
 	 */
-	blk_queue_bypass_start(q);
+	if (old) {
+		old_registered = old->registered;
 
-	/* unregister and clear all auxiliary data of the old elevator */
-	if (registered)
-		elv_unregister_queue(q);
+		if (!q->mq_ops)
+			blk_queue_bypass_start(q);
 
-	spin_lock_irq(q->queue_lock);
-	ioc_clear_queue(q);
-	spin_unlock_irq(q->queue_lock);
+		/* unregister and clear all auxiliary data of the old elevator */
+		if (old_registered)
+			elv_unregister_queue(q);
+
+		spin_lock_irq(q->queue_lock);
+		ioc_clear_queue(q);
+		spin_unlock_irq(q->queue_lock);
+	}
 
 	/* allocate, init and register new elevator */
-	err = new_e->ops.sq.elevator_init_fn(q, new_e);
-	if (err)
-		goto fail_init;
+	if (new_e) {
+		if (new_e->uses_mq)
+			err = new_e->ops.mq.init_sched(q, new_e);
+		else
+			err = new_e->ops.sq.elevator_init_fn(q, new_e);
+		if (err)
+			goto fail_init;
 
-	if (registered) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	}
+	} else
+		q->elevator = NULL;
 
 	/* done, kill the old one and finish */
-	elevator_exit(old);
-	blk_queue_bypass_end(q);
+	if (old) {
+		elevator_exit(old);
+		if (!q->mq_ops)
+			blk_queue_bypass_end(q);
+	}
+
+	if (q->mq_ops) {
+		blk_mq_unfreeze_queue(q);
+		blk_mq_start_stopped_hw_queues(q, true);
+	}
 
-	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
 
 	return 0;
 
@@ -934,9 +1006,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	elevator_exit(q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
-	q->elevator = old;
-	elv_register_queue(q);
-	blk_queue_bypass_end(q);
+	if (old) {
+		q->elevator = old;
+		elv_register_queue(q);
+		if (!q->mq_ops)
+			blk_queue_bypass_end(q);
+	}
+	if (q->mq_ops) {
+		blk_mq_unfreeze_queue(q);
+		blk_mq_start_stopped_hw_queues(q, true);
+	}
 
 	return err;
 }
@@ -949,8 +1028,11 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	char elevator_name[ELV_NAME_MAX];
 	struct elevator_type *e;
 
-	if (!q->elevator)
-		return -ENXIO;
+	/*
+	 * Special case for mq, turn off scheduling
+	 */
+	if (q->mq_ops && !strncmp(name, "none", 4))
+		return elevator_switch(q, NULL);
 
 	strlcpy(elevator_name, name, sizeof(elevator_name));
 	e = elevator_get(strstrip(elevator_name), true);
@@ -959,11 +1041,23 @@ static int __elevator_change(struct request_queue *q, const char *name)
 		return -EINVAL;
 	}
 
-	if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
+	if (q->elevator &&
+	    !strcmp(elevator_name, q->elevator->type->elevator_name)) {
 		elevator_put(e);
 		return 0;
 	}
 
+	if (!e->uses_mq && q->mq_ops) {
+		printk(KERN_ERR "blk-mq-sched: elv %s does not support mq\n", elevator_name);
+		elevator_put(e);
+		return -EINVAL;
+	}
+	if (e->uses_mq && !q->mq_ops) {
+		printk(KERN_ERR "blk-mq-sched: elv %s is for mq\n", elevator_name);
+		elevator_put(e);
+		return -EINVAL;
+	}
+
 	return elevator_switch(q, e);
 }
 
@@ -985,7 +1079,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 {
 	int ret;
 
-	if (!q->elevator)
+	if (!q->mq_ops || q->request_fn)
 		return count;
 
 	ret = __elevator_change(q, name);
@@ -999,24 +1093,34 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 ssize_t elv_iosched_show(struct request_queue *q, char *name)
 {
 	struct elevator_queue *e = q->elevator;
-	struct elevator_type *elv;
+	struct elevator_type *elv = NULL;
 	struct elevator_type *__e;
 	int len = 0;
 
-	if (!q->elevator || !blk_queue_stackable(q))
+	if (!blk_queue_stackable(q))
 		return sprintf(name, "none\n");
 
-	elv = e->type;
+	if (!q->elevator)
+		len += sprintf(name+len, "[none] ");
+	else
+		elv = e->type;
 
 	spin_lock(&elv_list_lock);
 	list_for_each_entry(__e, &elv_list, list) {
-		if (!strcmp(elv->elevator_name, __e->elevator_name))
+		if (elv && !strcmp(elv->elevator_name, __e->elevator_name)) {
 			len += sprintf(name+len, "[%s] ", elv->elevator_name);
-		else
+			continue;
+		}
+		if (__e->uses_mq && q->mq_ops)
+			len += sprintf(name+len, "%s ", __e->elevator_name);
+		else if (!__e->uses_mq && !q->mq_ops)
 			len += sprintf(name+len, "%s ", __e->elevator_name);
 	}
 	spin_unlock(&elv_list_lock);
 
+	if (q->mq_ops && q->elevator)
+		len += sprintf(name+len, "none");
+
 	len += sprintf(len+name, "\n");
 	return len;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index afc81d77e471..73b58b5be6e0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,7 @@ struct blk_mq_hw_ctx {
 
 	unsigned long		flags;		/* BLK_MQ_F_* flags */
 
+	void			*sched_data;
 	struct request_queue	*queue;
 	struct blk_flush_queue	*fq;
 
@@ -156,6 +157,7 @@ enum {
 
 	BLK_MQ_S_STOPPED	= 0,
 	BLK_MQ_S_TAG_ACTIVE	= 1,
+	BLK_MQ_S_SCHED_RESTART	= 2,
 
 	BLK_MQ_MAX_DEPTH	= 10240,
 
@@ -179,7 +181,6 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
 
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);
 
-void blk_mq_insert_request(struct request *, bool, bool, bool);
 void blk_mq_free_request(struct request *rq);
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2a9e966eed03..0cbf30a00186 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -77,6 +77,32 @@ struct elevator_ops
 	elevator_registered_fn *elevator_registered_fn;
 };
 
+struct blk_mq_alloc_data;
+struct blk_mq_hw_ctx;
+
+struct elevator_mq_ops {
+	int (*init_sched)(struct request_queue *, struct elevator_type *);
+	void (*exit_sched)(struct elevator_queue *);
+
+	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
+	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
+	int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
+	void (*request_merged)(struct request_queue *, struct request *, int);
+	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
+	struct request *(*get_request)(struct request_queue *, unsigned int, struct blk_mq_alloc_data *);
+	bool (*put_request)(struct request *);
+	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
+	void (*dispatch_requests)(struct blk_mq_hw_ctx *, struct list_head *);
+	bool (*has_work)(struct blk_mq_hw_ctx *);
+	void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+	void (*started_request)(struct request *);
+	void (*requeue_request)(struct request *);
+	struct request *(*former_request)(struct request_queue *, struct request *);
+	struct request *(*next_request)(struct request_queue *, struct request *);
+	int (*get_rq_priv)(struct request_queue *, struct request *);
+	void (*put_rq_priv)(struct request_queue *, struct request *);
+};
+
 #define ELV_NAME_MAX	(16)
 
 struct elv_fs_entry {
@@ -96,12 +122,14 @@ struct elevator_type
 	/* fields provided by elevator implementation */
 	union {
 		struct elevator_ops sq;
+		struct elevator_mq_ops mq;
 	} ops;
 	size_t icq_size;	/* see iocontext.h */
 	size_t icq_align;	/* ditto */
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+	bool uses_mq;
 
 	/* managed by elevator core */
 	char icq_cache_name[ELV_NAME_MAX + 5];	/* elvname + "_io_cq" */
@@ -125,6 +153,7 @@ struct elevator_queue
 	struct kobject kobj;
 	struct mutex sysfs_lock;
 	unsigned int registered:1;
+	unsigned int uses_mq:1;
 	DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-07 23:09 [PATCHSET/RFC] blk-mq scheduling framework Jens Axboe
@ 2016-12-07 23:09 ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2016-12-07 23:09 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-mq-sched.c | 243 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-sched.h | 168 +++++++++++++++++++++++++++++++++++
 2 files changed, 411 insertions(+)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index 000000000000..8317b26990f8
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,243 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <linux/blk-mq.h>
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static struct blk_mq_ops mq_sched_tag_ops = {
+	.queue_rq	= NULL,
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+	blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+						unsigned int numa_node)
+{
+	struct blk_mq_tag_set set = {
+		.ops		= &mq_sched_tag_ops,
+		.nr_hw_queues	= 1,
+		.queue_depth	= depth,
+		.numa_node	= numa_node,
+	};
+
+	return blk_mq_init_rq_map(&set, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (exit)
+			exit(hctx);
+		kfree(hctx->sched_data);
+		hctx->sched_data = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+				void (*init)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
+		if (!hctx->sched_data)
+			goto error;
+
+		if (init)
+			init(hctx);
+	}
+
+	return 0;
+error:
+	blk_mq_sched_free_hctx_data(q, NULL);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
+
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+						  struct blk_mq_alloc_data *data,
+						  struct blk_mq_tags *tags,
+						  atomic_t *wait_index)
+{
+	struct sbq_wait_state *ws;
+	DEFINE_WAIT(wait);
+	struct request *rq;
+	int tag;
+
+	tag = __sbitmap_queue_get(&tags->bitmap_tags);
+	if (tag != -1)
+		goto done;
+
+	if (data->flags & BLK_MQ_REQ_NOWAIT)
+		return NULL;
+
+	ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+	do {
+		prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+		tag = __sbitmap_queue_get(&tags->bitmap_tags);
+		if (tag != -1)
+			break;
+
+		blk_mq_run_hw_queue(data->hctx, false);
+
+		tag = __sbitmap_queue_get(&tags->bitmap_tags);
+		if (tag != -1)
+			break;
+
+		blk_mq_put_ctx(data->ctx);
+		io_schedule();
+
+		data->ctx = blk_mq_get_ctx(data->q);
+		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+		finish_wait(&ws->wait, &wait);
+		ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+	} while (1);
+
+	finish_wait(&ws->wait, &wait);
+done:
+	rq = tags->rqs[tag];
+	rq->tag = tag;
+	return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+				      struct request *rq)
+{
+	sbitmap_queue_clear(&tags->bitmap_tags, rq->tag, rq->mq_ctx->cpu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
+
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name)	((dst)->name = (src)->name)
+	FIELD_COPY(rq, src, cpu);
+	FIELD_COPY(rq, src, cmd_type);
+	FIELD_COPY(rq, src, cmd_flags);
+	rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | RQF_DONTPREP));
+	rq->rq_flags &= ~RQF_IO_STAT;
+	FIELD_COPY(rq, src, __data_len);
+	FIELD_COPY(rq, src, __sector);
+	FIELD_COPY(rq, src, bio);
+	FIELD_COPY(rq, src, biotail);
+	FIELD_COPY(rq, src, rq_disk);
+	FIELD_COPY(rq, src, part);
+	FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+	FIELD_COPY(rq, src, ioprio);
+	FIELD_COPY(rq, src, timeout);
+
+	if (src->cmd_type == REQ_TYPE_BLOCK_PC) {
+		FIELD_COPY(rq, src, cmd);
+		FIELD_COPY(rq, src, cmd_len);
+		FIELD_COPY(rq, src, extra_len);
+		FIELD_COPY(rq, src, sense_len);
+		FIELD_COPY(rq, src, resid_len);
+		FIELD_COPY(rq, src, sense);
+		FIELD_COPY(rq, src, retries);
+	}
+
+	src->bio = src->biotail = NULL;
+}
+
+static void sched_rq_end_io(struct request *rq, int error)
+{
+	struct request *sched_rq = rq->end_io_data;
+
+	FIELD_COPY(sched_rq, rq, resid_len);
+	FIELD_COPY(sched_rq, rq, extra_len);
+	FIELD_COPY(sched_rq, rq, sense_len);
+	FIELD_COPY(sched_rq, rq, errors);
+	FIELD_COPY(sched_rq, rq, retries);
+
+	blk_account_io_completion(sched_rq, blk_rq_bytes(sched_rq));
+	blk_account_io_done(sched_rq);
+
+	wbt_done(sched_rq->q->rq_wb, &sched_rq->issue_stat);
+
+	if (sched_rq->end_io)
+		sched_rq->end_io(sched_rq, error);
+
+	blk_mq_free_request(rq);
+}
+
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_alloc_data data;
+	struct request *sched_rq, *rq;
+
+	data.q = hctx->queue;
+	data.flags = BLK_MQ_REQ_NOWAIT;
+	data.ctx = blk_mq_get_ctx(hctx->queue);
+	data.hctx = hctx;
+
+	rq = __blk_mq_alloc_request(&data, 0);
+	blk_mq_put_ctx(data.ctx);
+
+	if (!rq) {
+		blk_mq_stop_hw_queue(hctx);
+		return NULL;
+	}
+
+	sched_rq = get_sched_rq(hctx);
+
+	if (!sched_rq) {
+		blk_queue_enter_live(hctx->queue);
+		__blk_mq_free_request(hctx, data.ctx, rq);
+		return NULL;
+	}
+
+	rq_copy(rq, sched_rq);
+	rq->end_io = sched_rq_end_io;
+	rq->end_io_data = sched_rq;
+
+	return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_from_shadow);
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+	struct request *rq;
+	LIST_HEAD(rq_list);
+
+	if (unlikely(blk_mq_hctx_stopped(hctx)))
+		return;
+
+	hctx->run++;
+
+	if (!list_empty(&hctx->dispatch)) {
+		spin_lock(&hctx->lock);
+		if (!list_empty(&hctx->dispatch))
+			list_splice_init(&hctx->dispatch, &rq_list);
+		spin_unlock(&hctx->lock);
+	}
+
+	while ((rq = e->type->mq_ops.dispatch_request(hctx)) != NULL)
+		list_add_tail(&rq->queuelist, &rq_list);
+
+	blk_mq_dispatch_rq_list(hctx, &rq_list);
+}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
new file mode 100644
index 000000000000..125e14e5274a
--- /dev/null
+++ b/block/blk-mq-sched.h
@@ -0,0 +1,168 @@
+#ifndef BLK_MQ_SCHED_H
+#define BLK_MQ_SCHED_H
+
+#include "blk-mq.h"
+
+struct blk_mq_hw_ctx;
+struct blk_mq_ctx;
+struct request_queue;
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+				void (*init)(struct blk_mq_hw_ctx *));
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *));
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+				      struct request *rq);
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+						  struct blk_mq_alloc_data *data,
+						  struct blk_mq_tags *tags,
+						  atomic_t *wait_index);
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
+
+
+struct blk_mq_alloc_data {
+	/* input parameter */
+	struct request_queue *q;
+	unsigned int flags;
+
+	/* input & output parameter */
+	struct blk_mq_ctx *ctx;
+	struct blk_mq_hw_ctx *hctx;
+};
+
+static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
+		struct request_queue *q, unsigned int flags,
+		struct blk_mq_ctx *ctx, struct blk_mq_hw_ctx *hctx)
+{
+	data->q = q;
+	data->flags = flags;
+	data->ctx = ctx;
+	data->hctx = hctx;
+}
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+
+static inline bool
+blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (blk_queue_nomerges(q) || !bio_mergeable(bio))
+		return false;
+
+	if (e) {
+		struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
+		struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+		blk_mq_put_ctx(ctx);
+		return e->type->mq_ops.bio_merge(hctx, bio);
+	}
+
+	return false;
+}
+
+static inline struct request *
+blk_mq_sched_get_request(struct request_queue *q, struct bio *bio,
+			 struct blk_mq_alloc_data *data)
+{
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+	struct request *rq;
+
+	blk_queue_enter_live(q);
+	ctx = blk_mq_get_ctx(q);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+
+	if (e)
+		rq = e->type->mq_ops.get_request(q, bio, data);
+	else
+		rq = __blk_mq_alloc_request(data, bio->bi_opf);
+
+	if (rq)
+		data->hctx->queued++;
+
+	return rq;
+
+}
+
+static inline void
+blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
+			    bool async)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	if (e)
+		e->type->mq_ops.insert_request(hctx, rq, at_head);
+	else {
+		spin_lock(&ctx->lock);
+		__blk_mq_insert_request(hctx, rq, at_head);
+		spin_unlock(&ctx->lock);
+	}
+
+	if (run_queue)
+		blk_mq_run_hw_queue(hctx, async);
+}
+
+static inline bool
+blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
+			 struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->mq_ops.allow_merge)
+		return e->type->mq_ops.allow_merge(q, rq, bio);
+
+	return true;
+}
+
+static inline void
+blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->mq_ops.completed_request)
+		e->type->mq_ops.completed_request(hctx, rq);
+}
+
+static inline void blk_mq_sched_started_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->mq_ops.started_request)
+		e->type->mq_ops.started_request(rq);
+}
+
+static inline void blk_mq_sched_requeue_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->mq_ops.requeue_request)
+		e->type->mq_ops.requeue_request(rq);
+}
+
+static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->mq_ops.has_work)
+		return e->type->mq_ops.has_work(hctx);
+
+	return false;
+}
+
+
+#endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-12-15 21:44 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-08 20:13 [PATCHSET/RFC v2] blk-mq scheduling framework Jens Axboe
2016-12-08 20:13 ` [PATCH 1/7] blk-mq: add blk_mq_start_stopped_hw_queue() Jens Axboe
2016-12-13  8:48   ` Bart Van Assche
2016-12-08 20:13 ` [PATCH 2/7] blk-mq: abstract out blk_mq_dispatch_rq_list() helper Jens Axboe
2016-12-09  6:44   ` Hannes Reinecke
2016-12-13  8:51   ` Bart Van Assche
2016-12-13 15:05     ` Jens Axboe
2016-12-13  9:18   ` Ritesh Harjani
2016-12-13  9:29     ` Bart Van Assche
2016-12-08 20:13 ` [PATCH 3/7] elevator: make the rqhash helpers exported Jens Axboe
2016-12-09  6:45   ` Hannes Reinecke
2016-12-08 20:13 ` [PATCH 4/7] blk-flush: run the queue when inserting blk-mq flush Jens Axboe
2016-12-09  6:45   ` Hannes Reinecke
2016-12-08 20:13 ` [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
2016-12-13 13:56   ` Bart Van Assche
2016-12-13 15:14     ` Jens Axboe
2016-12-14 10:31       ` Bart Van Assche
2016-12-14 15:05         ` Jens Axboe
2016-12-13 14:29   ` Bart Van Assche
2016-12-13 15:20     ` Jens Axboe
2016-12-08 20:13 ` [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
2016-12-13 11:04   ` Bart Van Assche
2016-12-13 15:08     ` Jens Axboe
2016-12-14  8:09   ` Bart Van Assche
2016-12-14 15:02     ` Jens Axboe
2016-12-08 20:13 ` [PATCH 7/7] blk-mq-sched: allow setting of default " Jens Axboe
2016-12-13 10:13   ` Bart Van Assche
2016-12-13 15:06     ` Jens Axboe
2016-12-13  9:26 ` [PATCHSET/RFC v2] blk-mq scheduling framework Paolo Valente
2016-12-13 15:17   ` Jens Axboe
2016-12-13 16:15     ` Paolo Valente
2016-12-13 16:28       ` Jens Axboe
2016-12-13 21:51         ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2016-12-15  5:26 [PATCHSET v3] " Jens Axboe
2016-12-15  5:26 ` [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
2016-12-15 19:29   ` Omar Sandoval
2016-12-15 20:14     ` Jens Axboe
2016-12-15 21:44     ` Jens Axboe
2016-12-07 23:09 [PATCHSET/RFC] blk-mq scheduling framework Jens Axboe
2016-12-07 23:09 ` [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).