linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq
@ 2016-12-05 18:26 Jens Axboe
  2016-12-05 18:27 ` [PATCH 1/7] block: use legacy path for flush requests for MQ with a scheduler Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-05 18:26 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente

Version 2 of the hack/patchset, that enables blk-mq to use the legacy
IO schedulers with single queue devices. Original posting is here:

https://marc.info/?l=linux-block&m=148073493203664&w=2

You can also found this version in the following git branch:

git://git.kernel.dk/linux-block blk-mq-legacy-sched.2

and new developments/fixes will happen in the 'blk-mq-legacy-sched'
branch.

Changes since v1:

- Remove the default 'deadline' hard wiring, and provide Kconfig
  entries to set the blk-mq scheduler. This now works like for legacy
  devices.

- Rename blk_use_mq_path() to blk_use_sched_path() to make it more
  clear. Suggested by Johannes Thumshirn.

- Fixup a spot where we did not use the accessor function to determine
  what path to use.

- Flush mq software queues, even if IO scheduler managed. This should
  make paths work that are MQ aware, and using only MQ interfaces.

- Cleanup free path of MQ request.

- Account when IO was queued to a hardware context, similarly to the
  regular MQ path.

- Add BLK_MQ_F_NO_SCHED flag, so that drivers can explicitly ask for
  no scheduling on a queue. Add this for NVMe admin queues.

- Kill BLK_MQ_F_SCHED, since we now have Kconfig entries for setting
  the desired IO scheduler.

- Fix issues with live scheduler switching through sysfs. Should now
  be solid, even with lots of IO running on the device.

- Drop null_blk and SCSI changes, not needed anymore.


 block/Kconfig.iosched   |   29 ++++
 block/blk-core.c        |   77 ++++++-----
 block/blk-exec.c        |   12 +
 block/blk-flush.c       |   40 +++--
 block/blk-merge.c       |    5 
 block/blk-mq.c          |  332 +++++++++++++++++++++++++++++++++++++++++++++---
 block/blk-mq.h          |    1 
 block/blk-sysfs.c       |    2 
 block/blk.h             |   16 ++
 block/cfq-iosched.c     |   22 ++-
 block/elevator.c        |  125 ++++++++++++------
 drivers/nvme/host/pci.c |    1 
 include/linux/blk-mq.h  |    1 
 include/linux/blkdev.h  |    2 
 14 files changed, 555 insertions(+), 110 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/7] block: use legacy path for flush requests for MQ with a scheduler
  2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
@ 2016-12-05 18:27 ` Jens Axboe
  2016-12-05 18:27 ` [PATCH 2/7] cfq-iosched: use appropriate run queue function Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-05 18:27 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, Jens Axboe

No functional changes with this patch, it's just in preparation for
supporting legacy schedulers on blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-core.c  |  2 +-
 block/blk-exec.c  |  2 +-
 block/blk-flush.c | 26 ++++++++++++++------------
 block/blk.h       | 12 +++++++++++-
 4 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4b7ec5958055..813c448453bf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1310,7 +1310,7 @@ static struct request *blk_old_get_request(struct request_queue *q, int rw,
 
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
-	if (q->mq_ops)
+	if (!blk_use_sched_path(q))
 		return blk_mq_alloc_request(q, rw,
 			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
 				0 : BLK_MQ_REQ_NOWAIT);
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..3356dff5508c 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -64,7 +64,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 	 * don't check dying flag for MQ because the request won't
 	 * be reused after dying flag is set
 	 */
-	if (q->mq_ops) {
+	if (!blk_use_sched_path(q)) {
 		blk_mq_insert_request(rq, at_head, true, false);
 		return;
 	}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 1bdbb3d3e5f5..040c36b83ef7 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -133,14 +133,16 @@ static void blk_flush_restore_request(struct request *rq)
 
 static bool blk_flush_queue_rq(struct request *rq, bool add_front)
 {
-	if (rq->q->mq_ops) {
+	struct request_queue *q = rq->q;
+
+	if (!blk_use_sched_path(q)) {
 		blk_mq_add_to_requeue_list(rq, add_front, true);
 		return false;
 	} else {
 		if (add_front)
-			list_add(&rq->queuelist, &rq->q->queue_head);
+			list_add(&rq->queuelist, &q->queue_head);
 		else
-			list_add_tail(&rq->queuelist, &rq->q->queue_head);
+			list_add_tail(&rq->queuelist, &q->queue_head);
 		return true;
 	}
 }
@@ -201,7 +203,7 @@ static bool blk_flush_complete_seq(struct request *rq,
 		BUG_ON(!list_empty(&rq->queuelist));
 		list_del_init(&rq->flush.list);
 		blk_flush_restore_request(rq);
-		if (q->mq_ops)
+		if (!blk_use_sched_path(q))
 			blk_mq_end_request(rq, error);
 		else
 			__blk_end_request_all(rq, error);
@@ -224,7 +226,7 @@ static void flush_end_io(struct request *flush_rq, int error)
 	unsigned long flags = 0;
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
 
-	if (q->mq_ops) {
+	if (!blk_use_sched_path(q)) {
 		struct blk_mq_hw_ctx *hctx;
 
 		/* release the tag's ownership to the req cloned from */
@@ -240,7 +242,7 @@ static void flush_end_io(struct request *flush_rq, int error)
 	/* account completion of the flush request */
 	fq->flush_running_idx ^= 1;
 
-	if (!q->mq_ops)
+	if (blk_use_sched_path(q))
 		elv_completed_request(q, flush_rq);
 
 	/* and push the waiting requests to the next stage */
@@ -267,7 +269,7 @@ static void flush_end_io(struct request *flush_rq, int error)
 		blk_run_queue_async(q);
 	}
 	fq->flush_queue_delayed = 0;
-	if (q->mq_ops)
+	if (!blk_use_sched_path(q))
 		spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
 }
 
@@ -315,7 +317,7 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 	 * be in flight at the same time. And acquire the tag's
 	 * ownership for flush req.
 	 */
-	if (q->mq_ops) {
+	if (!blk_use_sched_path(q)) {
 		struct blk_mq_hw_ctx *hctx;
 
 		flush_rq->mq_ctx = first_rq->mq_ctx;
@@ -409,7 +411,7 @@ void blk_insert_flush(struct request *rq)
 	 * complete the request.
 	 */
 	if (!policy) {
-		if (q->mq_ops)
+		if (!blk_use_sched_path(q))
 			blk_mq_end_request(rq, 0);
 		else
 			__blk_end_bidi_request(rq, 0, 0, 0);
@@ -425,9 +427,9 @@ void blk_insert_flush(struct request *rq)
 	 */
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-		if (q->mq_ops) {
+		if (!blk_use_sched_path(q))
 			blk_mq_insert_request(rq, false, false, true);
-		} else
+		else
 			list_add_tail(&rq->queuelist, &q->queue_head);
 		return;
 	}
@@ -440,7 +442,7 @@ void blk_insert_flush(struct request *rq)
 	INIT_LIST_HEAD(&rq->flush.list);
 	rq->rq_flags |= RQF_FLUSH_SEQ;
 	rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
-	if (q->mq_ops) {
+	if (!blk_use_sched_path(q)) {
 		rq->end_io = mq_flush_data_end_io;
 
 		spin_lock_irq(&fq->mq_flush_lock);
diff --git a/block/blk.h b/block/blk.h
index 041185e5f129..600e22ea62e9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -36,10 +36,20 @@ extern struct kmem_cache *request_cachep;
 extern struct kobj_type blk_queue_ktype;
 extern struct ida blk_queue_ida;
 
+/*
+ * Use the MQ path if we have mq_ops, but not if we are using an IO
+ * scheduler. For the scheduler, we should use the legacy path. Only
+ * for internal use in the block layer.
+ */
+static inline bool blk_use_sched_path(struct request_queue *q)
+{
+	return !q->mq_ops || q->elevator;
+}
+
 static inline struct blk_flush_queue *blk_get_flush_queue(
 		struct request_queue *q, struct blk_mq_ctx *ctx)
 {
-	if (q->mq_ops)
+	if (!blk_use_sched_path(q))
 		return blk_mq_map_queue(q, ctx->cpu)->fq;
 	return q->fq;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/7] cfq-iosched: use appropriate run queue function
  2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
  2016-12-05 18:27 ` [PATCH 1/7] block: use legacy path for flush requests for MQ with a scheduler Jens Axboe
@ 2016-12-05 18:27 ` Jens Axboe
  2016-12-05 18:27 ` [PATCH 3/7] block: use appropriate queue running functions Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-05 18:27 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, Jens Axboe

For MQ devices, we have to use other functions to run the queue.
No functional changes in this patch, just a prep patch for
support legacy schedulers on blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/cfq-iosched.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c73a6fcaeb9d..d6d454a72bd4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -919,8 +919,14 @@ static inline struct cfq_data *cic_to_cfqd(struct cfq_io_cq *cic)
 static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
 {
 	if (cfqd->busy_queues) {
+		struct request_queue *q = cfqd->queue;
+
 		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(&cfqd->unplug_work);
+
+		if (q->mq_ops)
+			blk_mq_run_hw_queues(q, true);
+		else
+			kblockd_schedule_work(&cfqd->unplug_work);
 	}
 }
 
@@ -4086,6 +4092,16 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	cfq_mark_cfqq_slice_new(cfqq);
 }
 
+static void cfq_run_queue(struct cfq_data *cfqd)
+{
+	struct request_queue *q = cfqd->queue;
+
+	if (q->mq_ops)
+		blk_mq_run_hw_queues(q, true);
+	else
+		__blk_run_queue(q);
+}
+
 /*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
@@ -4122,7 +4138,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 			    cfqd->busy_queues > 1) {
 				cfq_del_timer(cfqd, cfqq);
 				cfq_clear_cfqq_wait_request(cfqq);
-				__blk_run_queue(cfqd->queue);
+				cfq_run_queue(cfqd);
 			} else {
 				cfqg_stats_update_idle_time(cfqq->cfqg);
 				cfq_mark_cfqq_must_dispatch(cfqq);
@@ -4136,7 +4152,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		 * this new queue is RT and the current one is BE
 		 */
 		cfq_preempt_queue(cfqd, cfqq);
-		__blk_run_queue(cfqd->queue);
+		cfq_run_queue(cfqd);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/7] block: use appropriate queue running functions
  2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
  2016-12-05 18:27 ` [PATCH 1/7] block: use legacy path for flush requests for MQ with a scheduler Jens Axboe
  2016-12-05 18:27 ` [PATCH 2/7] cfq-iosched: use appropriate run queue function Jens Axboe
@ 2016-12-05 18:27 ` Jens Axboe
  2016-12-05 18:27 ` [PATCH 4/7] blk-mq: blk_account_io_start() takes a bool Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-05 18:27 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, Jens Axboe

Use MQ variants for MQ, legacy ones for legacy.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-core.c  |  5 ++++-
 block/blk-exec.c  | 10 ++++++++--
 block/blk-flush.c | 14 ++++++++++----
 block/elevator.c  |  5 ++++-
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 813c448453bf..f0aa810a5fe2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -340,7 +340,10 @@ void __blk_run_queue(struct request_queue *q)
 	if (unlikely(blk_queue_stopped(q)))
 		return;
 
-	__blk_run_queue_uncond(q);
+	if (WARN_ON_ONCE(q->mq_ops))
+		blk_mq_run_hw_queues(q, true);
+	else
+		__blk_run_queue_uncond(q);
 }
 EXPORT_SYMBOL(__blk_run_queue);
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3356dff5508c..6c3f12b32f86 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -80,8 +80,14 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 	}
 
 	__elv_add_request(q, rq, where);
-	__blk_run_queue(q);
-	spin_unlock_irq(q->queue_lock);
+
+	if (q->mq_ops) {
+		spin_unlock_irq(q->queue_lock);
+		blk_mq_run_hw_queues(q, false);
+	} else {
+		__blk_run_queue(q);
+		spin_unlock_irq(q->queue_lock);
+	}
 }
 EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 040c36b83ef7..8f2354d97e17 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -265,8 +265,10 @@ static void flush_end_io(struct request *flush_rq, int error)
 	 * kblockd.
 	 */
 	if (queued || fq->flush_queue_delayed) {
-		WARN_ON(q->mq_ops);
-		blk_run_queue_async(q);
+		if (q->mq_ops)
+			blk_mq_run_hw_queues(q, true);
+		else
+			blk_run_queue_async(q);
 	}
 	fq->flush_queue_delayed = 0;
 	if (!blk_use_sched_path(q))
@@ -346,8 +348,12 @@ static void flush_data_end_io(struct request *rq, int error)
 	 * After populating an empty queue, kick it to avoid stall.  Read
 	 * the comment in flush_end_io().
 	 */
-	if (blk_flush_complete_seq(rq, fq, REQ_FSEQ_DATA, error))
-		blk_run_queue_async(q);
+	if (blk_flush_complete_seq(rq, fq, REQ_FSEQ_DATA, error)) {
+		if (q->mq_ops)
+			blk_mq_run_hw_queues(q, true);
+		else
+			blk_run_queue_async(q);
+	}
 }
 
 static void mq_flush_data_end_io(struct request *rq, int error)
diff --git a/block/elevator.c b/block/elevator.c
index a18a5db274e4..11d2cfee2bc1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -627,7 +627,10 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where)
 		 *   with anything.  There's no point in delaying queue
 		 *   processing.
 		 */
-		__blk_run_queue(q);
+		if (q->mq_ops)
+			blk_mq_run_hw_queues(q, true);
+		else
+			__blk_run_queue(q);
 		break;
 
 	case ELEVATOR_INSERT_SORT_MERGE:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/7] blk-mq: blk_account_io_start() takes a bool
  2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
                   ` (2 preceding siblings ...)
  2016-12-05 18:27 ` [PATCH 3/7] block: use appropriate queue running functions Jens Axboe
@ 2016-12-05 18:27 ` Jens Axboe
  2016-12-05 18:27 ` [PATCH 5/7] blk-mq: add BLK_MQ_F_NO_SCHED flag Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-05 18:27 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bac12caece06..90db5b490df9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1237,7 +1237,7 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio)
 {
 	init_request_from_bio(rq, bio);
 
-	blk_account_io_start(rq, 1);
+	blk_account_io_start(rq, true);
 }
 
 static inline bool hctx_allow_merges(struct blk_mq_hw_ctx *hctx)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/7] blk-mq: add BLK_MQ_F_NO_SCHED flag
  2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
                   ` (3 preceding siblings ...)
  2016-12-05 18:27 ` [PATCH 4/7] blk-mq: blk_account_io_start() takes a bool Jens Axboe
@ 2016-12-05 18:27 ` Jens Axboe
  2016-12-05 18:27 ` [PATCH 6/7] blk-mq: test patch to get legacy IO schedulers working Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-05 18:27 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, Jens Axboe

Drivers can use this to prevent IO scheduling on a queue. Use this
for NVMe, for the admin queue, which doesn't handle file system
requests.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 drivers/nvme/host/pci.c | 1 +
 include/linux/blk-mq.h  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d58f8e4e2c06..4560575f0a39 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1187,6 +1187,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
 		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
 		dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
+		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
 		dev->admin_tagset.driver_data = dev;
 
 		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 35a0af5ede6d..a85bd83bb218 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -151,6 +151,7 @@ enum {
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
+	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/7] blk-mq: test patch to get legacy IO schedulers working
  2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
                   ` (4 preceding siblings ...)
  2016-12-05 18:27 ` [PATCH 5/7] blk-mq: add BLK_MQ_F_NO_SCHED flag Jens Axboe
@ 2016-12-05 18:27 ` Jens Axboe
  2016-12-05 18:27 ` [PATCH 7/7] block: drop irq+lock when flushing queue plugs Jens Axboe
  2016-12-06 10:01 ` [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Paolo Valente
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-05 18:27 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, Jens Axboe

With this applied, a single queue blk-mq manage device can use
any of the legacy IO schedulers. This is exposed through Kconfig,
exposing MQ scheduler choices. Like with the legacy devices,
the IO scheduler can be runtime switched by echoing something else
into:

/sys/block/</dev>/queue/scheduler

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Kconfig.iosched  |  29 +++++
 block/blk-core.c       |  44 ++++---
 block/blk-merge.c      |   5 +
 block/blk-mq.c         | 330 ++++++++++++++++++++++++++++++++++++++++++++++---
 block/blk-mq.h         |   1 +
 block/blk-sysfs.c      |   2 +-
 block/blk.h            |   4 +
 block/elevator.c       | 120 ++++++++++++------
 include/linux/blkdev.h |   2 +
 9 files changed, 469 insertions(+), 68 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9c4c48..50de4589f38e 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -63,6 +63,35 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+choice
+	prompt "Default I/O scheduler for MQ devices"
+	default DEFAULT_MQ_SCHED_NONE
+	help
+	  Select the I/O scheduler which will be used by default for all
+	  multiqueue block devices. Note that this only applies for
+	  multiqueue devies that expose a single queue.
+
+	config DEFAULT_MQ_SCHED_NONE
+		bool "None"
+
+	config DEFAULT_MQ_DEADLINE
+		bool "Deadline" if IOSCHED_DEADLINE=y
+
+	config DEFAULT_MQ_CFQ
+		bool "CFQ" if IOSCHED_CFQ=y
+
+	config DEFAULT_MQ_NOOP
+		bool "No-op"
+
+endchoice
+
+config DEFAULT_MQ_IOSCHED
+	string
+	default "deadline" if DEFAULT_MQ_DEADLINE
+	default "cfq" if DEFAULT_MQ_CFQ
+	default "noop" if DEFAULT_MQ_NOOP
+	default "none" if DEFAULT_MQ_SCHED_NONE
+
 endmenu
 
 endif
diff --git a/block/blk-core.c b/block/blk-core.c
index f0aa810a5fe2..8be12ba91f8e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1160,6 +1160,14 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
 	rq->cmd_flags = op;
 	rq->rq_flags = rq_flags;
 
+	/*
+	 * If we're allocating a legacy request off the MQ path, mark
+	 * the request as such. This is so that the free path does the
+	 * right thing.
+	 */
+	if (q->mq_ops)
+		rq->rq_flags |= RQF_MQ_RL;
+
 	/* init elvpriv */
 	if (rq_flags & RQF_ELVPRIV) {
 		if (unlikely(et->icq_cache && !icq)) {
@@ -1246,8 +1254,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
  * Returns ERR_PTR on failure, with @q->queue_lock held.
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
-static struct request *get_request(struct request_queue *q, unsigned int op,
-		struct bio *bio, gfp_t gfp_mask)
+struct request *get_request(struct request_queue *q, unsigned int op,
+			    struct bio *bio, gfp_t gfp_mask)
 {
 	const bool is_sync = op_is_sync(op);
 	DEFINE_WAIT(wait);
@@ -1430,7 +1438,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (unlikely(!q))
 		return;
 
-	if (q->mq_ops) {
+	if (q->mq_ops && !(req->rq_flags & RQF_MQ_RL)) {
 		blk_mq_free_request(req);
 		return;
 	}
@@ -1466,7 +1474,7 @@ void blk_put_request(struct request *req)
 {
 	struct request_queue *q = req->q;
 
-	if (q->mq_ops)
+	if (q->mq_ops && !(req->rq_flags & RQF_MQ_RL))
 		blk_mq_free_request(req);
 	else {
 		unsigned long flags;
@@ -1556,6 +1564,15 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
 	return true;
 }
 
+struct list_head *blk_get_plug_list(struct request_queue *q,
+				    struct blk_plug *plug)
+{
+	if (!blk_use_sched_path(q))
+		return &plug->mq_list;
+
+	return &plug->list;
+}
+
 /**
  * blk_attempt_plug_merge - try to merge with %current's plugged list
  * @q: request_queue new bio is being queued at
@@ -1592,10 +1609,7 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		goto out;
 	*request_count = 0;
 
-	if (q->mq_ops)
-		plug_list = &plug->mq_list;
-	else
-		plug_list = &plug->list;
+	plug_list = blk_get_plug_list(q, plug);
 
 	list_for_each_entry_reverse(rq, plug_list, queuelist) {
 		int el_ret;
@@ -1640,10 +1654,7 @@ unsigned int blk_plug_queued_count(struct request_queue *q)
 	if (!plug)
 		goto out;
 
-	if (q->mq_ops)
-		plug_list = &plug->mq_list;
-	else
-		plug_list = &plug->list;
+	plug_list = blk_get_plug_list(q, plug);
 
 	list_for_each_entry(rq, plug_list, queuelist) {
 		if (rq->q == q)
@@ -3198,7 +3209,9 @@ static void queue_unplugged(struct request_queue *q, unsigned int depth,
 {
 	trace_block_unplug(q, depth, !from_schedule);
 
-	if (from_schedule)
+	if (q->mq_ops)
+		blk_mq_run_hw_queues(q, true);
+	else if (from_schedule)
 		blk_run_queue_async(q);
 	else
 		__blk_run_queue(q);
@@ -3294,7 +3307,10 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		 * Short-circuit if @q is dead
 		 */
 		if (unlikely(blk_queue_dying(q))) {
-			__blk_end_request_all(rq, -ENODEV);
+			if (q->mq_ops)
+				blk_mq_end_request(rq, -ENODEV);
+			else
+				__blk_end_request_all(rq, -ENODEV);
 			continue;
 		}
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1002afdfee99..0952e0503aa4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -754,6 +754,11 @@ static int attempt_merge(struct request_queue *q, struct request *req,
 	/* owner-ship of bio passed from next to req */
 	next->bio = NULL;
 	__blk_put_request(q, next);
+
+	/* FIXME: MQ+sched holds a reference */
+	if (q->mq_ops && q->elevator)
+		blk_queue_exit(q);
+
 	return 1;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90db5b490df9..b6b6dd9590f6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -821,6 +821,151 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name)	((dst)->name = (src)->name)
+	FIELD_COPY(rq, src, cpu);
+	FIELD_COPY(rq, src, cmd_type);
+	FIELD_COPY(rq, src, cmd_flags);
+	rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | RQF_DONTPREP));
+	rq->rq_flags &= ~RQF_IO_STAT;
+	FIELD_COPY(rq, src, __data_len);
+	FIELD_COPY(rq, src, __sector);
+	FIELD_COPY(rq, src, bio);
+	FIELD_COPY(rq, src, biotail);
+	FIELD_COPY(rq, src, rq_disk);
+	FIELD_COPY(rq, src, part);
+	FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+	FIELD_COPY(rq, src, ioprio);
+	FIELD_COPY(rq, src, timeout);
+
+	if (src->cmd_type == REQ_TYPE_BLOCK_PC) {
+		FIELD_COPY(rq, src, cmd);
+		FIELD_COPY(rq, src, cmd_len);
+		FIELD_COPY(rq, src, extra_len);
+		FIELD_COPY(rq, src, sense_len);
+		FIELD_COPY(rq, src, resid_len);
+		FIELD_COPY(rq, src, sense);
+		FIELD_COPY(rq, src, retries);
+	}
+
+	src->bio = src->biotail = NULL;
+}
+
+static void sched_rq_end_io(struct request *rq, int error)
+{
+	struct request *sched_rq = rq->end_io_data;
+	struct request_queue *q = rq->q;
+	unsigned long flags;
+
+	FIELD_COPY(sched_rq, rq, resid_len);
+	FIELD_COPY(sched_rq, rq, extra_len);
+	FIELD_COPY(sched_rq, rq, sense_len);
+	FIELD_COPY(sched_rq, rq, errors);
+	FIELD_COPY(sched_rq, rq, retries);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_finish_request(sched_rq, error);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	blk_mq_free_request(rq);
+	blk_mq_start_stopped_hw_queues(q, true);
+}
+
+/*
+ * Pull off the elevator dispatch list and send it to the driver. Note that
+ * we have to transform the fake requests into real requests
+ */
+static void blk_mq_sched_dispatch(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct request *rq, *sched_rq;
+	struct blk_mq_alloc_data alloc_data;
+	struct blk_mq_queue_data bd;
+	LIST_HEAD(rq_list);
+	int queued = 0, ret;
+
+	if (unlikely(blk_mq_hctx_stopped(hctx)))
+		return;
+
+	flush_busy_ctxs(hctx, &rq_list);
+
+	hctx->run++;
+
+again:
+	rq = NULL;
+	if (!list_empty(&hctx->dispatch) || !list_empty(&rq_list)) {
+		spin_lock_irq(&hctx->lock);
+		if (!list_empty(&rq_list))
+			list_splice_tail_init(&rq_list, &hctx->dispatch);
+		if (!list_empty(&hctx->dispatch)) {
+			rq = list_first_entry(&hctx->dispatch, struct request, queuelist);
+			list_del_init(&rq->queuelist);
+		}
+		spin_unlock_irq(&hctx->lock);
+	}
+
+	if (!rq) {
+		alloc_data.q = q;
+		alloc_data.flags = BLK_MQ_REQ_NOWAIT;
+		alloc_data.ctx = blk_mq_get_ctx(q);
+		alloc_data.hctx = hctx;
+
+		rq = __blk_mq_alloc_request(&alloc_data, 0);
+		blk_mq_put_ctx(alloc_data.ctx);
+
+		if (!rq) {
+			blk_mq_stop_hw_queue(hctx);
+			goto done;
+		}
+
+		spin_lock_irq(q->queue_lock);
+		sched_rq = blk_fetch_request(q);
+		spin_unlock_irq(q->queue_lock);
+
+		if (!sched_rq) {
+			blk_mq_put_tag(hctx, alloc_data.ctx, rq->tag);
+			goto done;
+		}
+
+		rq_copy(rq, sched_rq);
+		rq->end_io = sched_rq_end_io;
+		rq->end_io_data = sched_rq;
+	}
+
+	bd.rq = rq;
+	bd.list = NULL;
+	bd.last = true;
+
+	ret = q->mq_ops->queue_rq(hctx, &bd);
+	switch (ret) {
+	case BLK_MQ_RQ_QUEUE_OK:
+		queued++;
+		break;
+	case BLK_MQ_RQ_QUEUE_BUSY:
+		spin_lock_irq(&hctx->lock);
+		list_add_tail(&rq->queuelist, &hctx->dispatch);
+		spin_unlock_irq(&hctx->lock);
+		blk_mq_stop_hw_queue(hctx);
+		break;
+	default:
+		pr_err("blk-mq: bad return on queue: %d\n", ret);
+	case BLK_MQ_RQ_QUEUE_ERROR:
+		rq->errors = -EIO;
+		blk_mq_end_request(rq, rq->errors);
+		break;
+	}
+
+	if (ret != BLK_MQ_RQ_QUEUE_BUSY)
+		goto again;
+
+done:
+	hctx->dispatched[queued_to_index(queued)]++;
+}
+
 /*
  * Run this hardware queue, pulling any software queues mapped to it in.
  * Note that this function currently has various problems around ordering
@@ -938,11 +1083,17 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		blk_mq_process_rq_list(hctx);
+		if (!hctx->queue->elevator)
+			blk_mq_process_rq_list(hctx);
+		else
+			blk_mq_sched_dispatch(hctx);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
-		blk_mq_process_rq_list(hctx);
+		if (!hctx->queue->elevator)
+			blk_mq_process_rq_list(hctx);
+		else
+			blk_mq_sched_dispatch(hctx);
 		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
 	}
 }
@@ -992,18 +1143,27 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work);
 }
 
+static inline bool hctx_pending_io(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	/*
+	 * For the pure MQ case, we have pending IO if any of the software
+	 * queues are loaded, or we have residual dispatch. If we have
+	 * an IO scheduler attached, we don't know for sure. So just say
+	 * yes, to ensure the queue runs.
+	 */
+	return blk_mq_hctx_has_pending(hctx) ||
+		!list_empty_careful(&hctx->dispatch) || q->elevator;
+}
+
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 {
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
 	queue_for_each_hw_ctx(q, hctx, i) {
-		if ((!blk_mq_hctx_has_pending(hctx) &&
-		    list_empty_careful(&hctx->dispatch)) ||
-		    blk_mq_hctx_stopped(hctx))
-			continue;
-
-		blk_mq_run_hw_queue(hctx, async);
+		if (hctx_pending_io(q, hctx))
+			blk_mq_run_hw_queue(hctx, async);
 	}
 }
 EXPORT_SYMBOL(blk_mq_run_hw_queues);
@@ -1448,12 +1608,15 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
+	const bool can_merge = !blk_queue_nomerges(q) && bio_mergeable(bio);
+	const bool use_elevator = READ_ONCE(q->elevator);
 	struct blk_plug *plug;
 	unsigned int request_count = 0;
 	struct blk_mq_alloc_data data;
 	struct request *rq;
 	blk_qc_t cookie;
 	unsigned int wb_acct;
+	int where = ELEVATOR_INSERT_SORT;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1464,18 +1627,71 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_split(q, &bio, q->bio_split);
 
-	if (!is_flush_fua && !blk_queue_nomerges(q)) {
+	if (!is_flush_fua && can_merge) {
 		if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
 			return BLK_QC_T_NONE;
 	} else
 		request_count = blk_plug_queued_count(q);
 
+	/*
+	 * Set some defaults - we have just one hardware queue, so
+	 * we don't have to explicitly map it.
+	 */
+	data.hctx = q->queue_hw_ctx[0];
+	data.ctx = NULL;
+
+	if (use_elevator && can_merge) {
+		int el_ret;
+
+		spin_lock_irq(q->queue_lock);
+
+		el_ret = elv_merge(q, &rq, bio);
+		if (el_ret == ELEVATOR_BACK_MERGE) {
+			if (bio_attempt_back_merge(q, rq, bio)) {
+				elv_bio_merged(q, rq, bio);
+				if (!attempt_back_merge(q, rq))
+					elv_merged_request(q, rq, el_ret);
+				goto elv_unlock;
+			}
+		} else if (el_ret == ELEVATOR_FRONT_MERGE) {
+			if (bio_attempt_front_merge(q, rq, bio)) {
+				elv_bio_merged(q, rq, bio);
+				if (!attempt_front_merge(q, rq))
+					elv_merged_request(q, rq, el_ret);
+				goto elv_unlock;
+			}
+		}
+
+		spin_unlock_irq(q->queue_lock);
+	}
+
 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
 
-	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq)) {
-		__wbt_done(q->rq_wb, wb_acct);
-		return BLK_QC_T_NONE;
+	/*
+	 * If we're not using an IO scheduler, just allocate a MQ
+	 * request/tag. If we are using an IO scheduler, use the
+	 * legacy request_list and we'll map this to an MQ request
+	 * at dispatch time.
+	 */
+	if (!use_elevator) {
+		rq = blk_mq_map_request(q, bio, &data);
+		if (unlikely(!rq)) {
+			__wbt_done(q->rq_wb, wb_acct);
+			return BLK_QC_T_NONE;
+		}
+	} else {
+		blk_queue_enter_live(q);
+		spin_lock_irq(q->queue_lock);
+		rq = get_request(q, bio->bi_opf, bio, GFP_NOIO);
+		if (IS_ERR(rq)) {
+			spin_unlock_irq(q->queue_lock);
+			blk_queue_exit(q);
+			__wbt_done(q->rq_wb, wb_acct);
+			goto elv_unlock;
+		}
+
+		data.hctx->queued++;
+		init_request_from_bio(rq, bio);
 	}
 
 	wbt_track(&rq->issue_stat, wb_acct);
@@ -1483,6 +1699,11 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
 	if (unlikely(is_flush_fua)) {
+		if (use_elevator) {
+			init_request_from_bio(rq, bio);
+			where = ELEVATOR_INSERT_FLUSH;
+			goto elv_insert;
+		}
 		blk_mq_bio_to_request(rq, bio);
 		blk_insert_flush(rq);
 		goto run_queue;
@@ -1495,6 +1716,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	 */
 	plug = current->plug;
 	if (plug) {
+		struct list_head *plug_list = blk_get_plug_list(q, plug);
 		struct request *last = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
@@ -1503,14 +1725,15 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		 * @request_count may become stale because of schedule
 		 * out, so check the list again.
 		 */
-		if (list_empty(&plug->mq_list))
+		if (list_empty(plug_list))
 			request_count = 0;
 		if (!request_count)
 			trace_block_plug(q);
 		else
-			last = list_entry_rq(plug->mq_list.prev);
+			last = list_entry_rq(plug_list->prev);
 
-		blk_mq_put_ctx(data.ctx);
+		if (data.ctx)
+			blk_mq_put_ctx(data.ctx);
 
 		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
 		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
@@ -1518,10 +1741,21 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 			trace_block_plug(q);
 		}
 
-		list_add_tail(&rq->queuelist, &plug->mq_list);
+		list_add_tail(&rq->queuelist, plug_list);
 		return cookie;
 	}
 
+	if (use_elevator) {
+elv_insert:
+		blk_account_io_start(rq, true);
+		spin_lock_irq(q->queue_lock);
+		__elv_add_request(q, rq, where);
+elv_unlock:
+		spin_unlock_irq(q->queue_lock);
+		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
+		return BLK_QC_T_NONE;
+	}
+
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
@@ -2085,6 +2319,64 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	blk_mq_sysfs_register(q);
 }
 
+int blk_mq_sched_init(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[0];
+
+	if (q->nr_hw_queues > 1)
+		return -EINVAL;
+	if (hctx->flags & BLK_MQ_F_NO_SCHED)
+		return -EINVAL;
+	if (q->fq)
+		return 0;
+
+	q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, 0);
+	if (!q->fq)
+		goto fail;
+
+	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
+		goto fail;
+
+	return 0;
+fail:
+	if (q->fq) {
+		blk_free_flush_queue(q->fq);
+		q->fq = NULL;
+	}
+	return -ENOMEM;
+}
+
+static int blk_sq_sched_init(struct request_queue *q)
+{
+	int ret;
+
+#if defined(CONFIG_DEFAULT_MQ_IOSCHED)
+	if (!strcmp("none", CONFIG_DEFAULT_MQ_IOSCHED))
+		return 0;
+#else
+	return 0;
+#endif
+
+	if (blk_mq_sched_init(q))
+		goto fail;
+
+	mutex_lock(&q->sysfs_lock);
+	ret = elevator_init(q, NULL);
+	mutex_unlock(&q->sysfs_lock);
+
+	if (ret) {
+		blk_exit_rl(&q->root_rl);
+		goto fail;
+	}
+
+	q->queue_lock = &q->queue_hw_ctx[0]->lock;
+	return 0;
+fail:
+	printk(KERN_ERR "blk-mq: sq io scheduler init failed\n");
+	blk_free_flush_queue(q->fq);
+	return 1;
+}
+
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q)
 {
@@ -2124,8 +2416,10 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	if (q->nr_hw_queues > 1)
 		blk_queue_make_request(q, blk_mq_make_request);
-	else
+	else {
 		blk_queue_make_request(q, blk_sq_make_request);
+		blk_sq_sched_init(q);
+	}
 
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
diff --git a/block/blk-mq.h b/block/blk-mq.h
index b444370ae05b..2b8f43fb9583 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -56,6 +56,7 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
+extern int blk_mq_sched_init(struct request_queue *q);
 
 extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 706b27bd73a1..f3a11d4de4e6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -896,7 +896,7 @@ int blk_register_queue(struct gendisk *disk)
 
 	blk_wb_init(q);
 
-	if (!q->request_fn)
+	if (!q->elevator)
 		return 0;
 
 	ret = elv_register_queue(q);
diff --git a/block/blk.h b/block/blk.h
index 600e22ea62e9..758c4e7bb788 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -77,6 +77,9 @@ bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
 void blk_freeze_queue(struct request_queue *q);
 
+struct request *get_request(struct request_queue *, unsigned int, struct bio *,
+				gfp_t);
+
 static inline void blk_queue_enter_live(struct request_queue *q)
 {
 	/*
@@ -110,6 +113,7 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 			    unsigned int *request_count,
 			    struct request **same_queue_rq);
 unsigned int blk_plug_queued_count(struct request_queue *q);
+struct list_head *blk_get_plug_list(struct request_queue *, struct blk_plug *);
 
 void blk_account_io_start(struct request *req, bool new_io);
 void blk_account_io_completion(struct request *req, unsigned int bytes);
diff --git a/block/elevator.c b/block/elevator.c
index 11d2cfee2bc1..19fb1ea0d0fd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -180,7 +180,7 @@ static void elevator_release(struct kobject *kobj)
 int elevator_init(struct request_queue *q, char *name)
 {
 	struct elevator_type *e = NULL;
-	int err;
+	int err = 0;
 
 	/*
 	 * q->sysfs_lock must be held to provide mutual exclusion between
@@ -215,18 +215,24 @@ int elevator_init(struct request_queue *q, char *name)
 	}
 
 	if (!e) {
-		e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+		if (q->mq_ops)
+			e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
+		else
+			e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
 		if (!e) {
 			printk(KERN_ERR
 				"Default I/O scheduler not found. " \
-				"Using noop.\n");
-			e = elevator_get("noop", false);
+				"Using %s.\n", q->mq_ops ? "none" : "noop");
+			if (!q->mq_ops)
+				e = elevator_get("noop", false);
 		}
 	}
 
-	err = e->ops.elevator_init_fn(q, e);
-	if (err)
-		elevator_put(e);
+	if (e) {
+		err = e->ops.elevator_init_fn(q, e);
+		if (err)
+			elevator_put(e);
+	}
 	return err;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -894,9 +900,14 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old = q->elevator;
-	bool registered = old->registered;
+	bool old_registered = false;
 	int err;
 
+	if (q->mq_ops) {
+		blk_mq_freeze_queue(q);
+		blk_mq_quiesce_queue(q);
+	}
+
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
 	 * Block layer doesn't call into a quiesced elevator - all requests
@@ -904,32 +915,49 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	 * using INSERT_BACK.  All requests have SOFTBARRIER set and no
 	 * merge happens either.
 	 */
-	blk_queue_bypass_start(q);
+	if (old) {
+		old_registered = old->registered;
+
+		if (!q->mq_ops)
+			blk_queue_bypass_start(q);
 
-	/* unregister and clear all auxiliary data of the old elevator */
-	if (registered)
-		elv_unregister_queue(q);
+		/* unregister and clear all auxiliary data of the old elevator */
+		if (old_registered)
+			elv_unregister_queue(q);
 
-	spin_lock_irq(q->queue_lock);
-	ioc_clear_queue(q);
-	spin_unlock_irq(q->queue_lock);
+		spin_lock_irq(q->queue_lock);
+		ioc_clear_queue(q);
+		spin_unlock_irq(q->queue_lock);
+	}
 
 	/* allocate, init and register new elevator */
-	err = new_e->ops.elevator_init_fn(q, new_e);
-	if (err)
-		goto fail_init;
+	if (new_e) {
+		err = new_e->ops.elevator_init_fn(q, new_e);
+		if (err)
+			goto fail_init;
 
-	if (registered) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	}
+	} else
+		q->elevator = NULL;
 
 	/* done, kill the old one and finish */
-	elevator_exit(old);
-	blk_queue_bypass_end(q);
+	if (old) {
+		elevator_exit(old);
+		if (!q->mq_ops)
+			blk_queue_bypass_end(q);
+	}
 
-	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	if (q->mq_ops) {
+		blk_mq_unfreeze_queue(q);
+		blk_mq_start_stopped_hw_queues(q, true);
+	}
+
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
 
 	return 0;
 
@@ -937,10 +965,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	elevator_exit(q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
-	q->elevator = old;
-	elv_register_queue(q);
-	blk_queue_bypass_end(q);
-
+	if (old) {
+		q->elevator = old;
+		elv_register_queue(q);
+		if (!q->mq_ops)
+			blk_queue_bypass_end(q);
+	}
+	if (q->mq_ops) {
+		blk_mq_unfreeze_queue(q);
+		blk_mq_start_stopped_hw_queues(q, true);
+	}
 	return err;
 }
 
@@ -952,8 +986,11 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	char elevator_name[ELV_NAME_MAX];
 	struct elevator_type *e;
 
-	if (!q->elevator)
-		return -ENXIO;
+	/*
+	 * Special case for mq, turn off scheduling
+	 */
+	if (q->mq_ops && !strncmp(name, "none", 4))
+		return elevator_switch(q, NULL);
 
 	strlcpy(elevator_name, name, sizeof(elevator_name));
 	e = elevator_get(strstrip(elevator_name), true);
@@ -962,7 +999,8 @@ static int __elevator_change(struct request_queue *q, const char *name)
 		return -EINVAL;
 	}
 
-	if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
+	if (q->elevator &&
+	    !strcmp(elevator_name, q->elevator->type->elevator_name)) {
 		elevator_put(e);
 		return 0;
 	}
@@ -988,9 +1026,15 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 {
 	int ret;
 
-	if (!q->elevator)
+	if (!q->mq_ops || q->request_fn)
 		return count;
 
+	if (q->mq_ops && !q->elevator) {
+		ret = blk_mq_sched_init(q);
+		if (ret)
+			return ret;
+	}
+
 	ret = __elevator_change(q, name);
 	if (!ret)
 		return count;
@@ -1002,24 +1046,30 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 ssize_t elv_iosched_show(struct request_queue *q, char *name)
 {
 	struct elevator_queue *e = q->elevator;
-	struct elevator_type *elv;
+	struct elevator_type *elv = NULL;
 	struct elevator_type *__e;
 	int len = 0;
 
-	if (!q->elevator || !blk_queue_stackable(q))
+	if (!blk_queue_stackable(q))
 		return sprintf(name, "none\n");
 
-	elv = e->type;
+	if (!q->elevator)
+		len += sprintf(name+len, "[none] ");
+	else
+		elv = e->type;
 
 	spin_lock(&elv_list_lock);
 	list_for_each_entry(__e, &elv_list, list) {
-		if (!strcmp(elv->elevator_name, __e->elevator_name))
+		if (elv && !strcmp(elv->elevator_name, __e->elevator_name))
 			len += sprintf(name+len, "[%s] ", elv->elevator_name);
 		else
 			len += sprintf(name+len, "%s ", __e->elevator_name);
 	}
 	spin_unlock(&elv_list_lock);
 
+	if (q->mq_ops && q->elevator)
+		len += sprintf(name+len, "none");
+
 	len += sprintf(len+name, "\n");
 	return len;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ebeef2b79c5a..a8c580f806cc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -120,6 +120,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_HASHED		((__force req_flags_t)(1 << 16))
 /* IO stats tracking on */
 #define RQF_STATS		((__force req_flags_t)(1 << 17))
+/* rl based request on MQ queue */
+#define RQF_MQ_RL		((__force req_flags_t)(1 << 18))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/7] block: drop irq+lock when flushing queue plugs
  2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
                   ` (5 preceding siblings ...)
  2016-12-05 18:27 ` [PATCH 6/7] blk-mq: test patch to get legacy IO schedulers working Jens Axboe
@ 2016-12-05 18:27 ` Jens Axboe
  2016-12-06 10:01 ` [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Paolo Valente
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-05 18:27 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, Jens Axboe

Not convinced this is a faster approach, and it does look IRQs off
longer than otherwise. With mq+scheduling, it's a problem since
it forces us to offload the queue running. If we get rid of it,
we can run the queue without the queue lock held.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-core.c | 32 ++++++++++++++------------------
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8be12ba91f8e..2c61d2020c3f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3204,18 +3204,21 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
  * plugger did not intend it.
  */
 static void queue_unplugged(struct request_queue *q, unsigned int depth,
-			    bool from_schedule)
+			    bool from_schedule, unsigned long flags)
 	__releases(q->queue_lock)
 {
 	trace_block_unplug(q, depth, !from_schedule);
 
-	if (q->mq_ops)
-		blk_mq_run_hw_queues(q, true);
-	else if (from_schedule)
-		blk_run_queue_async(q);
-	else
-		__blk_run_queue(q);
-	spin_unlock(q->queue_lock);
+	if (q->mq_ops) {
+		spin_unlock_irqrestore(q->queue_lock, flags);
+		blk_mq_run_hw_queues(q, from_schedule);
+	} else {
+		if (from_schedule)
+			blk_run_queue_async(q);
+		else
+			__blk_run_queue(q);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
 }
 
 static void flush_plug_callbacks(struct blk_plug *plug, bool from_schedule)
@@ -3283,11 +3286,6 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	q = NULL;
 	depth = 0;
 
-	/*
-	 * Save and disable interrupts here, to avoid doing it for every
-	 * queue lock we have to take.
-	 */
-	local_irq_save(flags);
 	while (!list_empty(&list)) {
 		rq = list_entry_rq(list.next);
 		list_del_init(&rq->queuelist);
@@ -3297,10 +3295,10 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 			 * This drops the queue lock
 			 */
 			if (q)
-				queue_unplugged(q, depth, from_schedule);
+				queue_unplugged(q, depth, from_schedule, flags);
 			q = rq->q;
 			depth = 0;
-			spin_lock(q->queue_lock);
+			spin_lock_irqsave(q->queue_lock, flags);
 		}
 
 		/*
@@ -3329,9 +3327,7 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 * This drops the queue lock
 	 */
 	if (q)
-		queue_unplugged(q, depth, from_schedule);
-
-	local_irq_restore(flags);
+		queue_unplugged(q, depth, from_schedule, flags);
 }
 
 void blk_finish_plug(struct blk_plug *plug)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq
  2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
                   ` (6 preceding siblings ...)
  2016-12-05 18:27 ` [PATCH 7/7] block: drop irq+lock when flushing queue plugs Jens Axboe
@ 2016-12-06 10:01 ` Paolo Valente
  2016-12-06 15:29   ` Jens Axboe
  7 siblings, 1 reply; 10+ messages in thread
From: Paolo Valente @ 2016-12-06 10:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, linux-block, Linux-Kernal, Mark Brown, Ulf Hansson,
	Linus Walleij


> Il giorno 05 dic 2016, alle ore 19:26, Jens Axboe <axboe@fb.com> ha scritto:
> 
> Version 2 of the hack/patchset, that enables blk-mq to use the legacy
> IO schedulers with single queue devices. Original posting is here:
> 
> https://marc.info/?l=linux-block&m=148073493203664&w=2
> 
> You can also found this version in the following git branch:
> 
> git://git.kernel.dk/linux-block blk-mq-legacy-sched.2
> 
> and new developments/fixes will happen in the 'blk-mq-legacy-sched'
> branch.
> 

Hi Jens,
while running some tests, the system hung after a while.

If I'm not mistaken, the above branches contain a (modified) 4.9-rc1.
Maybe instability follows from that?  I have tried a rebase, but
resulting conflicts are non-trivial (for me) to solve.

Meanwhile, if you deem it useful, I can provide you with the oops
message, as I catch it.

As a secondary issue, iostat always reports 0 MB/s for both reads and
writes, while tps are non null.

Thanks,
Paolo

> Changes since v1:
> 
> - Remove the default 'deadline' hard wiring, and provide Kconfig
>  entries to set the blk-mq scheduler. This now works like for legacy
>  devices.
> 
> - Rename blk_use_mq_path() to blk_use_sched_path() to make it more
>  clear. Suggested by Johannes Thumshirn.
> 
> - Fixup a spot where we did not use the accessor function to determine
>  what path to use.
> 
> - Flush mq software queues, even if IO scheduler managed. This should
>  make paths work that are MQ aware, and using only MQ interfaces.
> 
> - Cleanup free path of MQ request.
> 
> - Account when IO was queued to a hardware context, similarly to the
>  regular MQ path.
> 
> - Add BLK_MQ_F_NO_SCHED flag, so that drivers can explicitly ask for
>  no scheduling on a queue. Add this for NVMe admin queues.
> 
> - Kill BLK_MQ_F_SCHED, since we now have Kconfig entries for setting
>  the desired IO scheduler.
> 
> - Fix issues with live scheduler switching through sysfs. Should now
>  be solid, even with lots of IO running on the device.
> 
> - Drop null_blk and SCSI changes, not needed anymore.
> 
> 
> block/Kconfig.iosched   |   29 ++++
> block/blk-core.c        |   77 ++++++-----
> block/blk-exec.c        |   12 +
> block/blk-flush.c       |   40 +++--
> block/blk-merge.c       |    5 
> block/blk-mq.c          |  332 +++++++++++++++++++++++++++++++++++++++++++++---
> block/blk-mq.h          |    1 
> block/blk-sysfs.c       |    2 
> block/blk.h             |   16 ++
> block/cfq-iosched.c     |   22 ++-
> block/elevator.c        |  125 ++++++++++++------
> drivers/nvme/host/pci.c |    1 
> include/linux/blk-mq.h  |    1 
> include/linux/blkdev.h  |    2 
> 14 files changed, 555 insertions(+), 110 deletions(-)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq
  2016-12-06 10:01 ` [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Paolo Valente
@ 2016-12-06 15:29   ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-12-06 15:29 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, linux-block, Linux-Kernal, Mark Brown, Ulf Hansson,
	Linus Walleij

On 12/06/2016 03:01 AM, Paolo Valente wrote:
> 
>> Il giorno 05 dic 2016, alle ore 19:26, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> Version 2 of the hack/patchset, that enables blk-mq to use the legacy
>> IO schedulers with single queue devices. Original posting is here:
>>
>> https://marc.info/?l=linux-block&m=148073493203664&w=2
>>
>> You can also found this version in the following git branch:
>>
>> git://git.kernel.dk/linux-block blk-mq-legacy-sched.2
>>
>> and new developments/fixes will happen in the 'blk-mq-legacy-sched'
>> branch.
>>
> 
> Hi Jens,
> while running some tests, the system hung after a while.
> 
> If I'm not mistaken, the above branches contain a (modified) 4.9-rc1.
> Maybe instability follows from that?  I have tried a rebase, but
> resulting conflicts are non-trivial (for me) to solve.
> 
> Meanwhile, if you deem it useful, I can provide you with the oops
> message, as I catch it.

Please do. I run my testing with master pulled in, but I can rebase the
branch to include that.

> As a secondary issue, iostat always reports 0 MB/s for both reads and
> writes, while tps are non null.

Looks like only the inflight stuff worked, but the completion bytes. Let
me fix that up. Done.

OK, pull from:

git://git.kernel.dk/linux-block blk-mq-legacy-sched

and you'll get the latest and greatest, and merged to 4.9-rc8 as well.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-12-06 15:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-05 18:26 [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Jens Axboe
2016-12-05 18:27 ` [PATCH 1/7] block: use legacy path for flush requests for MQ with a scheduler Jens Axboe
2016-12-05 18:27 ` [PATCH 2/7] cfq-iosched: use appropriate run queue function Jens Axboe
2016-12-05 18:27 ` [PATCH 3/7] block: use appropriate queue running functions Jens Axboe
2016-12-05 18:27 ` [PATCH 4/7] blk-mq: blk_account_io_start() takes a bool Jens Axboe
2016-12-05 18:27 ` [PATCH 5/7] blk-mq: add BLK_MQ_F_NO_SCHED flag Jens Axboe
2016-12-05 18:27 ` [PATCH 6/7] blk-mq: test patch to get legacy IO schedulers working Jens Axboe
2016-12-05 18:27 ` [PATCH 7/7] block: drop irq+lock when flushing queue plugs Jens Axboe
2016-12-06 10:01 ` [PATCHSET/RFC v2] Make legacy IO schedulers work with blk-mq Paolo Valente
2016-12-06 15:29   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).