linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v4] blk-mq-scheduling framework
@ 2016-12-17  0:12 Jens Axboe
  2016-12-17  0:12 ` [PATCH 1/8] block: move existing elevator ops to union Jens Axboe
                   ` (9 more replies)
  0 siblings, 10 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov

This is version 4 of this patchset, version 3 was posted here:

https://marc.info/?l=linux-block&m=148178513407631&w=2

>From the discussion last time, I looked into the feasibility of having
two sets of tags for the same request pool, to avoid having to copy
some of the request fields at dispatch and completion time. To do that,
we'd have to replace the driver tag map(s) with our own, and augment
that with tag map(s) on the side representing the device queue depth.
Queuing IO with the scheduler would allocate from the new map, and
dispatching would acquire the "real" tag. We would need to change
drivers to do this, or add an extra indirection table to map a real
tag to the scheduler tag. We would also need a 1:1 mapping between
scheduler and hardware tag pools, or additional info to track it.
Unless someone can convince me otherwise, I think the current approach
is cleaner.

I wasn't going to post v4 so soon, but I discovered a bug that led
to drastically decreased merging. Especially on rotating storage,
this release should be fast, and on par with the merging that we
get through the legacy schedulers.

Changes since v3:

- Keep the blk_mq_free_request/__blk_mq_free_request() as the
  interface, and have those functions call the scheduler API
  instead.

- Add insertion merging from unplugging.

- Ensure that RQF_STARTED is cleared when we get a new shadow
  request, or merging will fail if it is already set.

- Improve the blk_mq_sched_init_hctx_data() implementation. From Omar.

- Make the shadow alloc/free interface more usable by schedulers
  that use the software queues. From Omar.

- Fix a bug in the io context code.

- Put the is_shadow() helper in generic code, instead of in mq-deadline.

- Add prep patch that unexports blk_mq_free_hctx_request(), it's not
  used by anyone.

- Remove the magic '256' queue depth from mq-deadline, replace with a
  module parameter, 'queue_depth', that defaults to 256.

- Various cleanups.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/8] block: move existing elevator ops to union
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
@ 2016-12-17  0:12 ` Jens Axboe
  2016-12-17  0:12 ` [PATCH 2/8] blk-mq: make mq_ops a const pointer Jens Axboe
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-ioc.c          |  8 +++----
 block/blk-merge.c        |  4 ++--
 block/blk.h              | 10 ++++----
 block/cfq-iosched.c      |  2 +-
 block/deadline-iosched.c |  2 +-
 block/elevator.c         | 60 ++++++++++++++++++++++++------------------------
 block/noop-iosched.c     |  2 +-
 include/linux/elevator.h |  4 +++-
 8 files changed, 47 insertions(+), 45 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 381cb50a673c..ab372092a57d 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -43,8 +43,8 @@ static void ioc_exit_icq(struct io_cq *icq)
 	if (icq->flags & ICQ_EXITED)
 		return;
 
-	if (et->ops.elevator_exit_icq_fn)
-		et->ops.elevator_exit_icq_fn(icq);
+	if (et->ops.sq.elevator_exit_icq_fn)
+		et->ops.sq.elevator_exit_icq_fn(icq);
 
 	icq->flags |= ICQ_EXITED;
 }
@@ -383,8 +383,8 @@ struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
 	if (likely(!radix_tree_insert(&ioc->icq_tree, q->id, icq))) {
 		hlist_add_head(&icq->ioc_node, &ioc->icq_list);
 		list_add(&icq->q_node, &q->icq_list);
-		if (et->ops.elevator_init_icq_fn)
-			et->ops.elevator_init_icq_fn(icq);
+		if (et->ops.sq.elevator_init_icq_fn)
+			et->ops.sq.elevator_init_icq_fn(icq);
 	} else {
 		kmem_cache_free(et->icq_cache, icq);
 		icq = ioc_lookup_icq(ioc, q);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 182398cb1524..480570b691dc 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -763,8 +763,8 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_allow_rq_merge_fn)
-		if (!e->type->ops.elevator_allow_rq_merge_fn(q, rq, next))
+	if (e->type->ops.sq.elevator_allow_rq_merge_fn)
+		if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next))
 			return 0;
 
 	return attempt_merge(q, rq, next);
diff --git a/block/blk.h b/block/blk.h
index 041185e5f129..f46c0ac8ae3d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -167,7 +167,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 			return NULL;
 		}
 		if (unlikely(blk_queue_bypass(q)) ||
-		    !q->elevator->type->ops.elevator_dispatch_fn(q, 0))
+		    !q->elevator->type->ops.sq.elevator_dispatch_fn(q, 0))
 			return NULL;
 	}
 }
@@ -176,16 +176,16 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_activate_req_fn)
-		e->type->ops.elevator_activate_req_fn(q, rq);
+	if (e->type->ops.sq.elevator_activate_req_fn)
+		e->type->ops.sq.elevator_activate_req_fn(q, rq);
 }
 
 static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_deactivate_req_fn)
-		e->type->ops.elevator_deactivate_req_fn(q, rq);
+	if (e->type->ops.sq.elevator_deactivate_req_fn)
+		e->type->ops.sq.elevator_deactivate_req_fn(q, rq);
 }
 
 #ifdef CONFIG_FAIL_IO_TIMEOUT
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c73a6fcaeb9d..37aeb20fa454 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -4837,7 +4837,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 };
 
 static struct elevator_type iosched_cfq = {
-	.ops = {
+	.ops.sq = {
 		.elevator_merge_fn = 		cfq_merge,
 		.elevator_merged_fn =		cfq_merged_request,
 		.elevator_merge_req_fn =	cfq_merged_requests,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 55e0bb6d7da7..05fc0ea25a98 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -439,7 +439,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 };
 
 static struct elevator_type iosched_deadline = {
-	.ops = {
+	.ops.sq = {
 		.elevator_merge_fn = 		deadline_merge,
 		.elevator_merged_fn =		deadline_merged_request,
 		.elevator_merge_req_fn =	deadline_merged_requests,
diff --git a/block/elevator.c b/block/elevator.c
index 40f0c04e5ad3..022a26830297 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -58,8 +58,8 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_allow_bio_merge_fn)
-		return e->type->ops.elevator_allow_bio_merge_fn(q, rq, bio);
+	if (e->type->ops.sq.elevator_allow_bio_merge_fn)
+		return e->type->ops.sq.elevator_allow_bio_merge_fn(q, rq, bio);
 
 	return 1;
 }
@@ -224,7 +224,7 @@ int elevator_init(struct request_queue *q, char *name)
 		}
 	}
 
-	err = e->ops.elevator_init_fn(q, e);
+	err = e->ops.sq.elevator_init_fn(q, e);
 	if (err)
 		elevator_put(e);
 	return err;
@@ -234,8 +234,8 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
-	if (e->type->ops.elevator_exit_fn)
-		e->type->ops.elevator_exit_fn(e);
+	if (e->type->ops.sq.elevator_exit_fn)
+		e->type->ops.sq.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -443,8 +443,8 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 		return ELEVATOR_BACK_MERGE;
 	}
 
-	if (e->type->ops.elevator_merge_fn)
-		return e->type->ops.elevator_merge_fn(q, req, bio);
+	if (e->type->ops.sq.elevator_merge_fn)
+		return e->type->ops.sq.elevator_merge_fn(q, req, bio);
 
 	return ELEVATOR_NO_MERGE;
 }
@@ -495,8 +495,8 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_merged_fn)
-		e->type->ops.elevator_merged_fn(q, rq, type);
+	if (e->type->ops.sq.elevator_merged_fn)
+		e->type->ops.sq.elevator_merged_fn(q, rq, type);
 
 	if (type == ELEVATOR_BACK_MERGE)
 		elv_rqhash_reposition(q, rq);
@@ -510,8 +510,8 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	struct elevator_queue *e = q->elevator;
 	const int next_sorted = next->rq_flags & RQF_SORTED;
 
-	if (next_sorted && e->type->ops.elevator_merge_req_fn)
-		e->type->ops.elevator_merge_req_fn(q, rq, next);
+	if (next_sorted && e->type->ops.sq.elevator_merge_req_fn)
+		e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
 
 	elv_rqhash_reposition(q, rq);
 
@@ -528,8 +528,8 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_bio_merged_fn)
-		e->type->ops.elevator_bio_merged_fn(q, rq, bio);
+	if (e->type->ops.sq.elevator_bio_merged_fn)
+		e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
 }
 
 #ifdef CONFIG_PM
@@ -578,7 +578,7 @@ void elv_drain_elevator(struct request_queue *q)
 
 	lockdep_assert_held(q->queue_lock);
 
-	while (q->elevator->type->ops.elevator_dispatch_fn(q, 1))
+	while (q->elevator->type->ops.sq.elevator_dispatch_fn(q, 1))
 		;
 	if (q->nr_sorted && printed++ < 10) {
 		printk(KERN_ERR "%s: forced dispatching is broken "
@@ -653,7 +653,7 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where)
 		 * rq cannot be accessed after calling
 		 * elevator_add_req_fn.
 		 */
-		q->elevator->type->ops.elevator_add_req_fn(q, rq);
+		q->elevator->type->ops.sq.elevator_add_req_fn(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_FLUSH:
@@ -682,8 +682,8 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_latter_req_fn)
-		return e->type->ops.elevator_latter_req_fn(q, rq);
+	if (e->type->ops.sq.elevator_latter_req_fn)
+		return e->type->ops.sq.elevator_latter_req_fn(q, rq);
 	return NULL;
 }
 
@@ -691,8 +691,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_former_req_fn)
-		return e->type->ops.elevator_former_req_fn(q, rq);
+	if (e->type->ops.sq.elevator_former_req_fn)
+		return e->type->ops.sq.elevator_former_req_fn(q, rq);
 	return NULL;
 }
 
@@ -701,8 +701,8 @@ int elv_set_request(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_set_req_fn)
-		return e->type->ops.elevator_set_req_fn(q, rq, bio, gfp_mask);
+	if (e->type->ops.sq.elevator_set_req_fn)
+		return e->type->ops.sq.elevator_set_req_fn(q, rq, bio, gfp_mask);
 	return 0;
 }
 
@@ -710,16 +710,16 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_put_req_fn)
-		e->type->ops.elevator_put_req_fn(rq);
+	if (e->type->ops.sq.elevator_put_req_fn)
+		e->type->ops.sq.elevator_put_req_fn(rq);
 }
 
 int elv_may_queue(struct request_queue *q, unsigned int op)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.elevator_may_queue_fn)
-		return e->type->ops.elevator_may_queue_fn(q, op);
+	if (e->type->ops.sq.elevator_may_queue_fn)
+		return e->type->ops.sq.elevator_may_queue_fn(q, op);
 
 	return ELV_MQUEUE_MAY;
 }
@@ -734,8 +734,8 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
 		if ((rq->rq_flags & RQF_SORTED) &&
-		    e->type->ops.elevator_completed_req_fn)
-			e->type->ops.elevator_completed_req_fn(q, rq);
+		    e->type->ops.sq.elevator_completed_req_fn)
+			e->type->ops.sq.elevator_completed_req_fn(q, rq);
 	}
 }
 
@@ -803,8 +803,8 @@ int elv_register_queue(struct request_queue *q)
 		}
 		kobject_uevent(&e->kobj, KOBJ_ADD);
 		e->registered = 1;
-		if (e->type->ops.elevator_registered_fn)
-			e->type->ops.elevator_registered_fn(q);
+		if (e->type->ops.sq.elevator_registered_fn)
+			e->type->ops.sq.elevator_registered_fn(q);
 	}
 	return error;
 }
@@ -912,7 +912,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	spin_unlock_irq(q->queue_lock);
 
 	/* allocate, init and register new elevator */
-	err = new_e->ops.elevator_init_fn(q, new_e);
+	err = new_e->ops.sq.elevator_init_fn(q, new_e);
 	if (err)
 		goto fail_init;
 
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index a163c487cf38..2d1b15d89b45 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,7 +92,7 @@ static void noop_exit_queue(struct elevator_queue *e)
 }
 
 static struct elevator_type elevator_noop = {
-	.ops = {
+	.ops.sq = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index b276e9ef0e0b..2a9e966eed03 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -94,7 +94,9 @@ struct elevator_type
 	struct kmem_cache *icq_cache;
 
 	/* fields provided by elevator implementation */
-	struct elevator_ops ops;
+	union {
+		struct elevator_ops sq;
+	} ops;
 	size_t icq_size;	/* see iocontext.h */
 	size_t icq_align;	/* ditto */
 	struct elv_fs_entry *elevator_attrs;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 2/8] blk-mq: make mq_ops a const pointer
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
  2016-12-17  0:12 ` [PATCH 1/8] block: move existing elevator ops to union Jens Axboe
@ 2016-12-17  0:12 ` Jens Axboe
  2016-12-17  0:12 ` [PATCH 3/8] block: move rq_ioc() to blk.h Jens Axboe
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

We never change it, make that clear.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
---
 block/blk-mq.c         | 2 +-
 include/linux/blk-mq.h | 2 +-
 include/linux/blkdev.h | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d79fdc11b1ee..87b7eaa1cb74 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -639,7 +639,7 @@ struct blk_mq_timeout_data {
 
 void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
-	struct blk_mq_ops *ops = req->q->mq_ops;
+	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
 	/*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4a2ab5d99ff7..afc81d77e471 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -60,7 +60,7 @@ struct blk_mq_hw_ctx {
 
 struct blk_mq_tag_set {
 	unsigned int		*mq_map;
-	struct blk_mq_ops	*ops;
+	const struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
 	unsigned int		reserved_tags;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 286b2a264383..7c40fb838b44 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -408,7 +408,7 @@ struct request_queue {
 	dma_drain_needed_fn	*dma_drain_needed;
 	lld_busy_fn		*lld_busy_fn;
 
-	struct blk_mq_ops	*mq_ops;
+	const struct blk_mq_ops	*mq_ops;
 
 	unsigned int		*mq_map;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 3/8] block: move rq_ioc() to blk.h
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
  2016-12-17  0:12 ` [PATCH 1/8] block: move existing elevator ops to union Jens Axboe
  2016-12-17  0:12 ` [PATCH 2/8] blk-mq: make mq_ops a const pointer Jens Axboe
@ 2016-12-17  0:12 ` Jens Axboe
  2016-12-20 10:12   ` Paolo Valente
  2016-12-17  0:12 ` [PATCH 4/8] blk-mq: un-export blk_mq_free_hctx_request() Jens Axboe
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

We want to use it outside of blk-core.c.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-core.c | 16 ----------------
 block/blk.h      | 16 ++++++++++++++++
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c58b64..92baea07acbc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1040,22 +1040,6 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
 }
 
 /**
- * rq_ioc - determine io_context for request allocation
- * @bio: request being allocated is for this bio (can be %NULL)
- *
- * Determine io_context to use for request allocation for @bio.  May return
- * %NULL if %current->io_context doesn't exist.
- */
-static struct io_context *rq_ioc(struct bio *bio)
-{
-#ifdef CONFIG_BLK_CGROUP
-	if (bio && bio->bi_ioc)
-		return bio->bi_ioc;
-#endif
-	return current->io_context;
-}
-
-/**
  * __get_request - get a free request
  * @rl: request list to allocate from
  * @op: operation and flags
diff --git a/block/blk.h b/block/blk.h
index f46c0ac8ae3d..9a716b5925a4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -264,6 +264,22 @@ void ioc_clear_queue(struct request_queue *q);
 int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);
 
 /**
+ * rq_ioc - determine io_context for request allocation
+ * @bio: request being allocated is for this bio (can be %NULL)
+ *
+ * Determine io_context to use for request allocation for @bio.  May return
+ * %NULL if %current->io_context doesn't exist.
+ */
+static inline struct io_context *rq_ioc(struct bio *bio)
+{
+#ifdef CONFIG_BLK_CGROUP
+	if (bio && bio->bi_ioc)
+		return bio->bi_ioc;
+#endif
+	return current->io_context;
+}
+
+/**
  * create_io_context - try to create task->io_context
  * @gfp_mask: allocation mask
  * @node: allocation node
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 4/8] blk-mq: un-export blk_mq_free_hctx_request()
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
                   ` (2 preceding siblings ...)
  2016-12-17  0:12 ` [PATCH 3/8] block: move rq_ioc() to blk.h Jens Axboe
@ 2016-12-17  0:12 ` Jens Axboe
  2016-12-17  0:12 ` [PATCH 5/8] blk-mq: export some helpers we need to the scheduling framework Jens Axboe
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

It's only used in blk-mq, kill it from the main exported header
and kill the symbol export as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-mq.c         | 5 ++---
 include/linux/blk-mq.h | 1 -
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 87b7eaa1cb74..6eeae30cc027 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -337,15 +337,14 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 	blk_queue_exit(q);
 }
 
-void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx,
+				     struct request *rq)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 
 	ctx->rq_completed[rq_is_sync(rq)]++;
 	__blk_mq_free_request(hctx, ctx, rq);
-
 }
-EXPORT_SYMBOL_GPL(blk_mq_free_hctx_request);
 
 void blk_mq_free_request(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index afc81d77e471..2686f9e7302a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -181,7 +181,6 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);
 
 void blk_mq_insert_request(struct request *, bool, bool, bool);
 void blk_mq_free_request(struct request *rq);
-void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
 enum {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 5/8] blk-mq: export some helpers we need to the scheduling framework
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
                   ` (3 preceding siblings ...)
  2016-12-17  0:12 ` [PATCH 4/8] blk-mq: un-export blk_mq_free_hctx_request() Jens Axboe
@ 2016-12-17  0:12 ` Jens Axboe
  2016-12-17  0:12 ` [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-mq.c | 39 +++++++++++++++++++++------------------
 block/blk-mq.h | 25 +++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6eeae30cc027..c3119f527bc1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -167,8 +167,8 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
 }
 EXPORT_SYMBOL(blk_mq_can_queue);
 
-static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
-			       struct request *rq, unsigned int op)
+void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
+			struct request *rq, unsigned int op)
 {
 	INIT_LIST_HEAD(&rq->queuelist);
 	/* csd/requeue_work/fifo_time is initialized before use */
@@ -213,9 +213,10 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 
 	ctx->rq_dispatched[op_is_sync(op)]++;
 }
+EXPORT_SYMBOL_GPL(blk_mq_rq_ctx_init);
 
-static struct request *
-__blk_mq_alloc_request(struct blk_mq_alloc_data *data, unsigned int op)
+struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
+				       unsigned int op)
 {
 	struct request *rq;
 	unsigned int tag;
@@ -236,6 +237,7 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, unsigned int op)
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
 
 struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		unsigned int flags)
@@ -319,8 +321,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
-static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
-				  struct blk_mq_ctx *ctx, struct request *rq)
+void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+			   struct request *rq)
 {
 	const int tag = rq->tag;
 	struct request_queue *q = rq->q;
@@ -802,7 +804,7 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
  * Process software queues that have been marked busy, splicing them
  * to the for-dispatch
  */
-static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
 	struct flush_busy_ctx_data data = {
 		.hctx = hctx,
@@ -811,6 +813,7 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 
 	sbitmap_for_each_set(&hctx->ctx_map, flush_busy_ctx, &data);
 }
+EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
 static inline unsigned int queued_to_index(unsigned int queued)
 {
@@ -921,7 +924,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 	/*
 	 * Touch any software queue that has pending entries.
 	 */
-	flush_busy_ctxs(hctx, &rq_list);
+	blk_mq_flush_busy_ctxs(hctx, &rq_list);
 
 	/*
 	 * If we have previous entries on our dispatch list, grab them
@@ -1135,8 +1138,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
 		list_add_tail(&rq->queuelist, &ctx->rq_list);
 }
 
-static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
-				    struct request *rq, bool at_head)
+void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+			     bool at_head)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 
@@ -1550,8 +1553,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
-static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
-		struct blk_mq_tags *tags, unsigned int hctx_idx)
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+			unsigned int hctx_idx)
 {
 	struct page *page;
 
@@ -1588,8 +1591,8 @@ static size_t order_to_size(unsigned int order)
 	return (size_t)PAGE_SIZE << order;
 }
 
-static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
-		unsigned int hctx_idx)
+struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+				       unsigned int hctx_idx)
 {
 	struct blk_mq_tags *tags;
 	unsigned int i, j, entries_per_page, max_order = 4;
@@ -2263,10 +2266,10 @@ static int blk_mq_queue_reinit_dead(unsigned int cpu)
  * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
  * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
  *
- * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
- * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
- * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
- * is ignored.
+ * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is set
+ * in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
+ * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
+ * ignored.
  */
 static int blk_mq_queue_reinit_prepare(unsigned int cpu)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 63e9116cddbd..e59f5ca520a2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -32,6 +32,21 @@ void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
+void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
+
+/*
+ * Internal helpers for allocating/freeing the request map
+ */
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+			unsigned int hctx_idx);
+struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+					unsigned int hctx_idx);
+
+/*
+ * Internal helpers for request insertion into sw queues
+ */
+void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+				bool at_head);
 
 /*
  * CPU hotplug helpers
@@ -103,6 +118,16 @@ static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
 	data->hctx = hctx;
 }
 
+/*
+ * Internal helpers for request allocation/init/free
+ */
+void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
+			struct request *rq, unsigned int op);
+void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+				struct request *rq);
+struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
+					unsigned int op);
+
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
 	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
                   ` (4 preceding siblings ...)
  2016-12-17  0:12 ` [PATCH 5/8] blk-mq: export some helpers we need to the scheduling framework Jens Axboe
@ 2016-12-17  0:12 ` Jens Axboe
  2016-12-20 11:55   ` Paolo Valente
  2016-12-22  9:59   ` Paolo Valente
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

Schedulers can opt in to using shadow requests. Shadow requests
are internal requests that the scheduler uses for for the allocate
and insert part, which are then mapped to a real driver request
at dispatch time. This is needed to separate the device queue depth
from the pool of requests that the scheduler has to work with.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Makefile           |   2 +-
 block/blk-core.c         |   3 +-
 block/blk-exec.c         |   3 +-
 block/blk-flush.c        |   7 +-
 block/blk-merge.c        |   2 +-
 block/blk-mq-sched.c     | 434 +++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-sched.h     | 209 +++++++++++++++++++++++
 block/blk-mq.c           | 197 +++++++++------------
 block/blk-mq.h           |   6 +-
 block/elevator.c         | 186 +++++++++++++++-----
 include/linux/blk-mq.h   |   3 +-
 include/linux/elevator.h |  30 ++++
 12 files changed, 914 insertions(+), 168 deletions(-)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
-			blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 92baea07acbc..ee3a6f340cb8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-mq-sched.h"
 #include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -2127,7 +2128,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	if (q->mq_ops) {
 		if (blk_queue_io_stat(q))
 			blk_account_io_start(rq, true);
-		blk_mq_insert_request(rq, false, true, false);
+		blk_mq_sched_insert_request(rq, false, true, false);
 		return 0;
 	}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
 #include <linux/sched/sysctl.h>
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 /*
  * for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 	 * be reused after dying flag is set
 	 */
 	if (q->mq_ops) {
-		blk_mq_insert_request(rq, at_head, true, false);
+		blk_mq_sched_insert_request(rq, at_head, true, false);
 		return;
 	}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 20b7c7a02f1c..6a7c29d2eb3c 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 /* FLUSH/FUA sequences */
 enum {
@@ -453,9 +454,9 @@ void blk_insert_flush(struct request *rq)
 	 */
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-		if (q->mq_ops) {
-			blk_mq_insert_request(rq, false, true, false);
-		} else
+		if (q->mq_ops)
+			blk_mq_sched_insert_request(rq, false, true, false);
+		else
 			list_add_tail(&rq->queuelist, &q->queue_head);
 		return;
 	}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 480570b691dc..6aa43dec5af4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -763,7 +763,7 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_allow_rq_merge_fn)
+	if (!e->uses_mq && e->type->ops.sq.elevator_allow_rq_merge_fn)
 		if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next))
 			return 0;
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index 000000000000..b7e1839d4785
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,434 @@
+/*
+ * blk-mq scheduling framework
+ *
+ * Copyright (C) 2016 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/blk-mq.h>
+
+#include <trace/events/block.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static const struct blk_mq_ops mq_sched_tag_ops = {
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+	blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+						unsigned int numa_node)
+{
+	struct blk_mq_tag_set set = {
+		.ops		= &mq_sched_tag_ops,
+		.nr_hw_queues	= 1,
+		.queue_depth	= depth,
+		.numa_node	= numa_node,
+	};
+
+	return blk_mq_init_rq_map(&set, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (exit)
+			exit(hctx);
+		kfree(hctx->sched_data);
+		hctx->sched_data = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+				int (*init)(struct blk_mq_hw_ctx *),
+				void (*exit)(struct blk_mq_hw_ctx *))
+{
+	struct blk_mq_hw_ctx *hctx;
+	int ret;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
+		if (!hctx->sched_data) {
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		if (init) {
+			ret = init(hctx);
+			if (ret) {
+				/*
+				 * We don't want to give exit() a partially
+				 * initialized sched_data. init() must clean up
+				 * if it fails.
+				 */
+				kfree(hctx->sched_data);
+				hctx->sched_data = NULL;
+				goto error;
+			}
+		}
+	}
+
+	return 0;
+error:
+	blk_mq_sched_free_hctx_data(q, exit);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
+
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+						  struct blk_mq_alloc_data *data,
+						  struct blk_mq_tags *tags,
+						  atomic_t *wait_index)
+{
+	struct sbq_wait_state *ws;
+	DEFINE_WAIT(wait);
+	struct request *rq;
+	int tag;
+
+	tag = __sbitmap_queue_get(&tags->bitmap_tags);
+	if (tag != -1)
+		goto done;
+
+	if (data->flags & BLK_MQ_REQ_NOWAIT)
+		return NULL;
+
+	ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+	do {
+		prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+		tag = __sbitmap_queue_get(&tags->bitmap_tags);
+		if (tag != -1)
+			break;
+
+		blk_mq_run_hw_queue(data->hctx, false);
+
+		tag = __sbitmap_queue_get(&tags->bitmap_tags);
+		if (tag != -1)
+			break;
+
+		blk_mq_put_ctx(data->ctx);
+		io_schedule();
+
+		data->ctx = blk_mq_get_ctx(data->q);
+		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+		finish_wait(&ws->wait, &wait);
+		ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+	} while (1);
+
+	finish_wait(&ws->wait, &wait);
+done:
+	rq = tags->rqs[tag];
+	rq->tag = tag;
+	rq->rq_flags = RQF_ALLOCED;
+	return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+				      struct request *rq)
+{
+	WARN_ON_ONCE(!(rq->rq_flags & RQF_ALLOCED));
+	sbitmap_queue_clear(&tags->bitmap_tags, rq->tag, rq->mq_ctx->cpu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
+
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name)	((dst)->name = (src)->name)
+	FIELD_COPY(rq, src, cpu);
+	FIELD_COPY(rq, src, cmd_type);
+	FIELD_COPY(rq, src, cmd_flags);
+	rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | RQF_DONTPREP));
+	rq->rq_flags &= ~RQF_IO_STAT;
+	FIELD_COPY(rq, src, __data_len);
+	FIELD_COPY(rq, src, __sector);
+	FIELD_COPY(rq, src, bio);
+	FIELD_COPY(rq, src, biotail);
+	FIELD_COPY(rq, src, rq_disk);
+	FIELD_COPY(rq, src, part);
+	FIELD_COPY(rq, src, issue_stat);
+	src->issue_stat.time = 0;
+	FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+	FIELD_COPY(rq, src, ioprio);
+	FIELD_COPY(rq, src, timeout);
+
+	if (src->cmd_type == REQ_TYPE_BLOCK_PC) {
+		FIELD_COPY(rq, src, cmd);
+		FIELD_COPY(rq, src, cmd_len);
+		FIELD_COPY(rq, src, extra_len);
+		FIELD_COPY(rq, src, sense_len);
+		FIELD_COPY(rq, src, resid_len);
+		FIELD_COPY(rq, src, sense);
+		FIELD_COPY(rq, src, retries);
+	}
+
+	src->bio = src->biotail = NULL;
+}
+
+static void sched_rq_end_io(struct request *rq, int error)
+{
+	struct request *sched_rq = rq->end_io_data;
+
+	FIELD_COPY(sched_rq, rq, resid_len);
+	FIELD_COPY(sched_rq, rq, extra_len);
+	FIELD_COPY(sched_rq, rq, sense_len);
+	FIELD_COPY(sched_rq, rq, errors);
+	FIELD_COPY(sched_rq, rq, retries);
+
+	blk_account_io_completion(sched_rq, blk_rq_bytes(sched_rq));
+	blk_account_io_done(sched_rq);
+
+	if (sched_rq->end_io)
+		sched_rq->end_io(sched_rq, error);
+
+	blk_mq_finish_request(rq);
+}
+
+static inline struct request *
+__blk_mq_sched_alloc_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_alloc_data data;
+	struct request *rq;
+
+	data.q = hctx->queue;
+	data.flags = BLK_MQ_REQ_NOWAIT;
+	data.ctx = blk_mq_get_ctx(hctx->queue);
+	data.hctx = hctx;
+
+	rq = __blk_mq_alloc_request(&data, 0);
+	blk_mq_put_ctx(data.ctx);
+
+	if (!rq)
+		blk_mq_stop_hw_queue(hctx);
+
+	return rq;
+}
+
+static inline void
+__blk_mq_sched_init_request_from_shadow(struct request *rq,
+					struct request *sched_rq)
+{
+	WARN_ON_ONCE(!(sched_rq->rq_flags & RQF_ALLOCED));
+	rq_copy(rq, sched_rq);
+	rq->end_io = sched_rq_end_io;
+	rq->end_io_data = sched_rq;
+}
+
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+	struct request *rq, *sched_rq;
+
+	rq = __blk_mq_sched_alloc_request(hctx);
+	if (!rq)
+		return NULL;
+
+	sched_rq = get_sched_rq(hctx);
+	if (sched_rq) {
+		__blk_mq_sched_init_request_from_shadow(rq, sched_rq);
+		return rq;
+	}
+
+	/*
+	 * __blk_mq_finish_request() drops a queue ref we already hold,
+	 * so grab an extra one.
+	 */
+	blk_queue_enter_live(hctx->queue);
+	__blk_mq_finish_request(hctx, rq->mq_ctx, rq);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_from_shadow);
+
+struct request *__blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+						   struct request *sched_rq)
+{
+	struct request *rq;
+
+	rq = __blk_mq_sched_alloc_request(hctx);
+	if (rq)
+		__blk_mq_sched_init_request_from_shadow(rq, sched_rq);
+
+	return rq;
+}
+EXPORT_SYMBOL_GPL(__blk_mq_sched_request_from_shadow);
+
+static void __blk_mq_sched_assign_ioc(struct request_queue *q,
+				      struct request *rq, struct io_context *ioc)
+{
+	struct io_cq *icq;
+
+	spin_lock_irq(q->queue_lock);
+	icq = ioc_lookup_icq(ioc, q);
+	spin_unlock_irq(q->queue_lock);
+
+	if (!icq) {
+		icq = ioc_create_icq(ioc, q, GFP_ATOMIC);
+		if (!icq)
+			return;
+	}
+
+	rq->elv.icq = icq;
+	if (!blk_mq_sched_get_rq_priv(q, rq)) {
+		get_io_context(icq->ioc);
+		return;
+	}
+
+	rq->elv.icq = NULL;
+}
+
+static void blk_mq_sched_assign_ioc(struct request_queue *q,
+				    struct request *rq, struct bio *bio)
+{
+	struct io_context *ioc;
+
+	ioc = rq_ioc(bio);
+	if (ioc)
+		__blk_mq_sched_assign_ioc(q, rq, ioc);
+}
+
+struct request *blk_mq_sched_get_request(struct request_queue *q,
+					 struct bio *bio,
+					 unsigned int op,
+					 struct blk_mq_alloc_data *data)
+{
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+	struct request *rq;
+
+	blk_queue_enter_live(q);
+	ctx = blk_mq_get_ctx(q);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+
+	if (e && e->type->ops.mq.get_request)
+		rq = e->type->ops.mq.get_request(q, op, data);
+	else
+		rq = __blk_mq_alloc_request(data, op);
+
+	if (rq) {
+		rq->elv.icq = NULL;
+		if (e && e->type->icq_cache)
+			blk_mq_sched_assign_ioc(q, rq, bio);
+		data->hctx->queued++;
+		return rq;
+	}
+
+	blk_queue_exit(q);
+	return NULL;
+}
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+	LIST_HEAD(rq_list);
+
+	if (unlikely(blk_mq_hctx_stopped(hctx)))
+		return;
+
+	hctx->run++;
+
+	/*
+	 * If we have previous entries on our dispatch list, grab them first for
+	 * more fair dispatch.
+	 */
+	if (!list_empty_careful(&hctx->dispatch)) {
+		spin_lock(&hctx->lock);
+		if (!list_empty(&hctx->dispatch))
+			list_splice_init(&hctx->dispatch, &rq_list);
+		spin_unlock(&hctx->lock);
+	}
+
+	/*
+	 * Only ask the scheduler for requests, if we didn't have residual
+	 * requests from the dispatch list. This is to avoid the case where
+	 * we only ever dispatch a fraction of the requests available because
+	 * of low device queue depth. Once we pull requests out of the IO
+	 * scheduler, we can no longer merge or sort them. So it's best to
+	 * leave them there for as long as we can. Mark the hw queue as
+	 * needing a restart in that case.
+	 */
+	if (list_empty(&rq_list)) {
+		if (e && e->type->ops.mq.dispatch_requests)
+			e->type->ops.mq.dispatch_requests(hctx, &rq_list);
+		else
+			blk_mq_flush_busy_ctxs(hctx, &rq_list);
+	} else if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+
+	blk_mq_dispatch_rq_list(hctx, &rq_list);
+}
+
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
+{
+	struct request *rq;
+	int ret;
+
+	ret = elv_merge(q, &rq, bio);
+	if (ret == ELEVATOR_BACK_MERGE) {
+		if (bio_attempt_back_merge(q, rq, bio)) {
+			if (!attempt_back_merge(q, rq))
+				elv_merged_request(q, rq, ret);
+			return true;
+		}
+	} else if (ret == ELEVATOR_FRONT_MERGE) {
+		if (bio_attempt_front_merge(q, rq, bio)) {
+			if (!attempt_front_merge(q, rq))
+				elv_merged_request(q, rq, ret);
+			return true;
+		}
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
+
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e->type->ops.mq.bio_merge) {
+		struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
+		struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+		blk_mq_put_ctx(ctx);
+		return e->type->ops.mq.bio_merge(hctx, bio);
+	}
+
+	return false;
+}
+
+bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq)
+{
+	return rq_mergeable(rq) && elv_attempt_insert_merge(q, rq);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
+
+void blk_mq_sched_request_inserted(struct request *rq)
+{
+	trace_block_rq_insert(rq->q, rq);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
new file mode 100644
index 000000000000..1d1a4e9ce6ca
--- /dev/null
+++ b/block/blk-mq-sched.h
@@ -0,0 +1,209 @@
+#ifndef BLK_MQ_SCHED_H
+#define BLK_MQ_SCHED_H
+
+#include "blk-mq.h"
+#include "blk-wbt.h"
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+				int (*init)(struct blk_mq_hw_ctx *),
+				void (*exit)(struct blk_mq_hw_ctx *));
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *));
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+				      struct request *rq);
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+						  struct blk_mq_alloc_data *data,
+						  struct blk_mq_tags *tags,
+						  atomic_t *wait_index);
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
+struct request *
+__blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+				   struct request *sched_rq);
+
+struct request *blk_mq_sched_get_request(struct request_queue *q, struct bio *bio, unsigned int op, struct blk_mq_alloc_data *data);
+
+void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+void blk_mq_sched_request_inserted(struct request *rq);
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
+bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+
+static inline bool
+blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e || blk_queue_nomerges(q) || !bio_mergeable(bio))
+		return false;
+
+	return __blk_mq_sched_bio_merge(q, bio);
+}
+
+static inline int blk_mq_sched_get_rq_priv(struct request_queue *q,
+					   struct request *rq)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.get_rq_priv)
+		return e->type->ops.mq.get_rq_priv(q, rq);
+
+	return 0;
+}
+
+static inline void blk_mq_sched_put_rq_priv(struct request_queue *q,
+					    struct request *rq)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.put_rq_priv)
+		e->type->ops.mq.put_rq_priv(q, rq);
+}
+
+static inline void blk_mq_sched_put_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+	bool do_free = true;
+
+	wbt_done(q->rq_wb, &rq->issue_stat);
+
+	if (rq->rq_flags & RQF_ELVPRIV) {
+		blk_mq_sched_put_rq_priv(rq->q, rq);
+		if (rq->elv.icq) {
+			put_io_context(rq->elv.icq->ioc);
+			rq->elv.icq = NULL;
+		}
+	}
+
+	if (e && e->type->ops.mq.put_request)
+		do_free = !e->type->ops.mq.put_request(rq);
+	if (do_free)
+		blk_mq_finish_request(rq);
+}
+
+static inline void
+blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
+			    bool async)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	if (e && e->type->ops.mq.insert_requests) {
+		LIST_HEAD(list);
+
+		list_add(&rq->queuelist, &list);
+		e->type->ops.mq.insert_requests(hctx, &list, at_head);
+	} else {
+		spin_lock(&ctx->lock);
+		__blk_mq_insert_request(hctx, rq, at_head);
+		spin_unlock(&ctx->lock);
+	}
+
+	if (run_queue)
+		blk_mq_run_hw_queue(hctx, async);
+}
+
+static inline void
+blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx,
+			     struct list_head *list, bool run_queue_async)
+{
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->ops.mq.insert_requests)
+		e->type->ops.mq.insert_requests(hctx, list, false);
+	else
+		blk_mq_insert_requests(hctx, ctx, list);
+
+	blk_mq_run_hw_queue(hctx, run_queue_async);
+}
+
+static inline void
+blk_mq_sched_dispatch_shadow_requests(struct blk_mq_hw_ctx *hctx,
+				      struct list_head *rq_list,
+				      struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+	do {
+		struct request *rq;
+
+		rq = blk_mq_sched_request_from_shadow(hctx, get_sched_rq);
+		if (!rq)
+			break;
+
+		list_add_tail(&rq->queuelist, rq_list);
+	} while (1);
+}
+
+static inline bool
+blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
+			 struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.allow_merge)
+		return e->type->ops.mq.allow_merge(q, rq, bio);
+
+	return true;
+}
+
+static inline void
+blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->ops.mq.completed_request)
+		e->type->ops.mq.completed_request(hctx, rq);
+
+	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		blk_mq_run_hw_queue(hctx, true);
+	}
+}
+
+static inline void blk_mq_sched_started_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.started_request)
+		e->type->ops.mq.started_request(rq);
+}
+
+static inline void blk_mq_sched_requeue_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->type->ops.mq.requeue_request)
+		e->type->ops.mq.requeue_request(rq);
+}
+
+static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->ops.mq.has_work)
+		return e->type->ops.mq.has_work(hctx);
+
+	return false;
+}
+
+/*
+ * Returns true if this is an internal shadow request
+ */
+static inline bool blk_mq_sched_rq_is_shadow(struct request *rq)
+{
+	return (rq->rq_flags & RQF_ALLOCED) != 0;
+}
+#endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c3119f527bc1..032dca4a27bf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -32,6 +32,7 @@
 #include "blk-mq-tag.h"
 #include "blk-stat.h"
 #include "blk-wbt.h"
+#include "blk-mq-sched.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -41,7 +42,8 @@ static LIST_HEAD(all_q_list);
  */
 static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
-	return sbitmap_any_bit_set(&hctx->ctx_map);
+	return sbitmap_any_bit_set(&hctx->ctx_map) ||
+		blk_mq_sched_has_work(hctx);
 }
 
 /*
@@ -242,26 +244,21 @@ EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
 struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		unsigned int flags)
 {
-	struct blk_mq_ctx *ctx;
-	struct blk_mq_hw_ctx *hctx;
-	struct request *rq;
 	struct blk_mq_alloc_data alloc_data;
+	struct request *rq;
 	int ret;
 
 	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
 
-	ctx = blk_mq_get_ctx(q);
-	hctx = blk_mq_map_queue(q, ctx->cpu);
-	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
-	rq = __blk_mq_alloc_request(&alloc_data, rw);
-	blk_mq_put_ctx(ctx);
+	rq = blk_mq_sched_get_request(q, NULL, rw, &alloc_data);
 
-	if (!rq) {
-		blk_queue_exit(q);
+	blk_mq_put_ctx(alloc_data.ctx);
+	blk_queue_exit(q);
+
+	if (!rq)
 		return ERR_PTR(-EWOULDBLOCK);
-	}
 
 	rq->__data_len = 0;
 	rq->__sector = (sector_t) -1;
@@ -321,12 +318,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
-void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
-			   struct request *rq)
+void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+			     struct request *rq)
 {
 	const int tag = rq->tag;
 	struct request_queue *q = rq->q;
 
+	blk_mq_sched_completed_request(hctx, rq);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
 
@@ -339,18 +338,23 @@ void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	blk_queue_exit(q);
 }
 
-static void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_finish_hctx_request(struct blk_mq_hw_ctx *hctx,
 				     struct request *rq)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 
 	ctx->rq_completed[rq_is_sync(rq)]++;
-	__blk_mq_free_request(hctx, ctx, rq);
+	__blk_mq_finish_request(hctx, ctx, rq);
+}
+
+void blk_mq_finish_request(struct request *rq)
+{
+	blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
 }
 
 void blk_mq_free_request(struct request *rq)
 {
-	blk_mq_free_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
+	blk_mq_sched_put_request(rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
@@ -468,6 +472,8 @@ void blk_mq_start_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	blk_mq_sched_started_request(rq);
+
 	trace_block_rq_issue(q, rq);
 
 	rq->resid_len = blk_rq_bytes(rq);
@@ -516,6 +522,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
+	blk_mq_sched_requeue_request(rq);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
 		if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -550,13 +557,13 @@ static void blk_mq_requeue_work(struct work_struct *work)
 
 		rq->rq_flags &= ~RQF_SOFTBARRIER;
 		list_del_init(&rq->queuelist);
-		blk_mq_insert_request(rq, true, false, false);
+		blk_mq_sched_insert_request(rq, true, false, false);
 	}
 
 	while (!list_empty(&rq_list)) {
 		rq = list_entry(rq_list.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
-		blk_mq_insert_request(rq, false, false, false);
+		blk_mq_sched_insert_request(rq, false, false, false);
 	}
 
 	blk_mq_run_hw_queues(q, false);
@@ -762,8 +769,16 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
 
 		if (!blk_rq_merge_ok(rq, bio))
 			continue;
+		if (!blk_mq_sched_allow_merge(q, rq, bio))
+			break;
 
 		el_ret = blk_try_merge(rq, bio);
+		if (el_ret == ELEVATOR_NO_MERGE)
+			continue;
+
+		if (!blk_mq_sched_allow_merge(q, rq, bio))
+			break;
+
 		if (el_ret == ELEVATOR_BACK_MERGE) {
 			if (bio_attempt_back_merge(q, rq, bio)) {
 				ctx->rq_merged++;
@@ -905,41 +920,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	return ret != BLK_MQ_RQ_QUEUE_BUSY;
 }
 
-/*
- * Run this hardware queue, pulling any software queues mapped to it in.
- * Note that this function currently has various problems around ordering
- * of IO. In particular, we'd like FIFO behaviour on handling existing
- * items on the hctx->dispatch list. Ignore that for now.
- */
-static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
-{
-	LIST_HEAD(rq_list);
-	LIST_HEAD(driver_list);
-
-	if (unlikely(blk_mq_hctx_stopped(hctx)))
-		return;
-
-	hctx->run++;
-
-	/*
-	 * Touch any software queue that has pending entries.
-	 */
-	blk_mq_flush_busy_ctxs(hctx, &rq_list);
-
-	/*
-	 * If we have previous entries on our dispatch list, grab them
-	 * and stuff them at the front for more fair dispatch.
-	 */
-	if (!list_empty_careful(&hctx->dispatch)) {
-		spin_lock(&hctx->lock);
-		if (!list_empty(&hctx->dispatch))
-			list_splice_init(&hctx->dispatch, &rq_list);
-		spin_unlock(&hctx->lock);
-	}
-
-	blk_mq_dispatch_rq_list(hctx, &rq_list);
-}
-
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
 	int srcu_idx;
@@ -949,11 +929,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		blk_mq_process_rq_list(hctx);
+		blk_mq_sched_dispatch_requests(hctx);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
-		blk_mq_process_rq_list(hctx);
+		blk_mq_sched_dispatch_requests(hctx);
 		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
 	}
 }
@@ -1147,32 +1127,10 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
-void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
-			   bool async)
-{
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	spin_lock(&ctx->lock);
-	__blk_mq_insert_request(hctx, rq, at_head);
-	spin_unlock(&ctx->lock);
-
-	if (run_queue)
-		blk_mq_run_hw_queue(hctx, async);
-}
-
-static void blk_mq_insert_requests(struct request_queue *q,
-				     struct blk_mq_ctx *ctx,
-				     struct list_head *list,
-				     int depth,
-				     bool from_schedule)
+void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+			    struct list_head *list)
 
 {
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	trace_block_unplug(q, depth, !from_schedule);
-
 	/*
 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
 	 * offline now
@@ -1188,8 +1146,6 @@ static void blk_mq_insert_requests(struct request_queue *q,
 	}
 	blk_mq_hctx_mark_pending(hctx, ctx);
 	spin_unlock(&ctx->lock);
-
-	blk_mq_run_hw_queue(hctx, from_schedule);
 }
 
 static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
@@ -1225,9 +1181,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		BUG_ON(!rq->q);
 		if (rq->mq_ctx != this_ctx) {
 			if (this_ctx) {
-				blk_mq_insert_requests(this_q, this_ctx,
-							&ctx_list, depth,
-							from_schedule);
+				trace_block_unplug(this_q, depth, from_schedule);
+				blk_mq_sched_insert_requests(this_q, this_ctx,
+								&ctx_list,
+								from_schedule);
 			}
 
 			this_ctx = rq->mq_ctx;
@@ -1244,8 +1201,9 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 * on 'ctx_list'. Do those.
 	 */
 	if (this_ctx) {
-		blk_mq_insert_requests(this_q, this_ctx, &ctx_list, depth,
-				       from_schedule);
+		trace_block_unplug(this_q, depth, from_schedule);
+		blk_mq_sched_insert_requests(this_q, this_ctx, &ctx_list,
+						from_schedule);
 	}
 }
 
@@ -1283,46 +1241,32 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
 		}
 
 		spin_unlock(&ctx->lock);
-		__blk_mq_free_request(hctx, ctx, rq);
+		__blk_mq_finish_request(hctx, ctx, rq);
 		return true;
 	}
 }
 
-static struct request *blk_mq_map_request(struct request_queue *q,
-					  struct bio *bio,
-					  struct blk_mq_alloc_data *data)
-{
-	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx;
-	struct request *rq;
-
-	blk_queue_enter_live(q);
-	ctx = blk_mq_get_ctx(q);
-	hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	trace_block_getrq(q, bio, bio->bi_opf);
-	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
-	rq = __blk_mq_alloc_request(data, bio->bi_opf);
-
-	data->hctx->queued++;
-	return rq;
-}
-
 static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 {
-	int ret;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
 		.list = NULL,
 		.last = 1
 	};
-	blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+	struct blk_mq_hw_ctx *hctx;
+	blk_qc_t new_cookie;
+	int ret;
+
+	if (q->elevator)
+		goto insert;
 
+	hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	if (blk_mq_hctx_stopped(hctx))
 		goto insert;
 
+	new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+
 	/*
 	 * For OK queue, we are done. For error, kill it. Any other
 	 * error (busy), just add it to our list as we previously
@@ -1344,7 +1288,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	}
 
 insert:
-	blk_mq_insert_request(rq, false, true, true);
+	blk_mq_sched_insert_request(rq, false, true, true);
 }
 
 /*
@@ -1377,9 +1321,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return BLK_QC_T_NONE;
 
+	if (blk_mq_sched_bio_merge(q, bio))
+		return BLK_QC_T_NONE;
+
 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
 
-	rq = blk_mq_map_request(q, bio, &data);
+	trace_block_getrq(q, bio, bio->bi_opf);
+
+	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
@@ -1441,6 +1390,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		goto done;
 	}
 
+	if (q->elevator) {
+		blk_mq_put_ctx(data.ctx);
+		blk_mq_bio_to_request(rq, bio);
+		blk_mq_sched_insert_request(rq, false, true, true);
+		goto done;
+	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
@@ -1486,9 +1441,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	} else
 		request_count = blk_plug_queued_count(q);
 
+	if (blk_mq_sched_bio_merge(q, bio))
+		return BLK_QC_T_NONE;
+
 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
 
-	rq = blk_mq_map_request(q, bio, &data);
+	trace_block_getrq(q, bio, bio->bi_opf);
+
+	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
@@ -1538,6 +1498,12 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return cookie;
 	}
 
+	if (q->elevator) {
+		blk_mq_put_ctx(data.ctx);
+		blk_mq_bio_to_request(rq, bio);
+		blk_mq_sched_insert_request(rq, false, true, true);
+		goto done;
+	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
@@ -1550,6 +1516,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	blk_mq_put_ctx(data.ctx);
+done:
 	return cookie;
 }
 
@@ -1558,7 +1525,7 @@ void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 {
 	struct page *page;
 
-	if (tags->rqs && set->ops->exit_request) {
+	if (tags->rqs && set && set->ops->exit_request) {
 		int i;
 
 		for (i = 0; i < tags->nr_tags; i++) {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e59f5ca520a2..898c3c9a60ec 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -47,7 +47,8 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
  */
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 				bool at_head);
-
+void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+				struct list_head *list);
 /*
  * CPU hotplug helpers
  */
@@ -123,8 +124,9 @@ static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
  */
 void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 			struct request *rq, unsigned int op);
-void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 				struct request *rq);
+void blk_mq_finish_request(struct request *rq);
 struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
 					unsigned int op);
 
diff --git a/block/elevator.c b/block/elevator.c
index 022a26830297..e6b523360231 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -40,6 +40,7 @@
 #include <trace/events/block.h>
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
@@ -58,7 +59,9 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_allow_bio_merge_fn)
+	if (e->uses_mq && e->type->ops.mq.allow_merge)
+		return e->type->ops.mq.allow_merge(q, rq, bio);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_allow_bio_merge_fn)
 		return e->type->ops.sq.elevator_allow_bio_merge_fn(q, rq, bio);
 
 	return 1;
@@ -163,6 +166,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
 	hash_init(eq->hash);
+	eq->uses_mq = e->uses_mq;
 
 	return eq;
 }
@@ -219,12 +223,19 @@ int elevator_init(struct request_queue *q, char *name)
 		if (!e) {
 			printk(KERN_ERR
 				"Default I/O scheduler not found. " \
-				"Using noop.\n");
+				"Using noop/none.\n");
+			if (q->mq_ops) {
+				elevator_put(e);
+				return 0;
+			}
 			e = elevator_get("noop", false);
 		}
 	}
 
-	err = e->ops.sq.elevator_init_fn(q, e);
+	if (e->uses_mq)
+		err = e->ops.mq.init_sched(q, e);
+	else
+		err = e->ops.sq.elevator_init_fn(q, e);
 	if (err)
 		elevator_put(e);
 	return err;
@@ -234,7 +245,9 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
-	if (e->type->ops.sq.elevator_exit_fn)
+	if (e->uses_mq && e->type->ops.mq.exit_sched)
+		e->type->ops.mq.exit_sched(e);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
 		e->type->ops.sq.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
 
@@ -253,6 +266,7 @@ void elv_rqhash_del(struct request_queue *q, struct request *rq)
 	if (ELV_ON_HASH(rq))
 		__elv_rqhash_del(rq);
 }
+EXPORT_SYMBOL_GPL(elv_rqhash_del);
 
 void elv_rqhash_add(struct request_queue *q, struct request *rq)
 {
@@ -262,6 +276,7 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
 	hash_add(e->hash, &rq->hash, rq_hash_key(rq));
 	rq->rq_flags |= RQF_HASHED;
 }
+EXPORT_SYMBOL_GPL(elv_rqhash_add);
 
 void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
 {
@@ -443,7 +458,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 		return ELEVATOR_BACK_MERGE;
 	}
 
-	if (e->type->ops.sq.elevator_merge_fn)
+	if (e->uses_mq && e->type->ops.mq.request_merge)
+		return e->type->ops.mq.request_merge(q, req, bio);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
 		return e->type->ops.sq.elevator_merge_fn(q, req, bio);
 
 	return ELEVATOR_NO_MERGE;
@@ -456,8 +473,7 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
  *
  * Returns true if we merged, false otherwise
  */
-static bool elv_attempt_insert_merge(struct request_queue *q,
-				     struct request *rq)
+bool elv_attempt_insert_merge(struct request_queue *q, struct request *rq)
 {
 	struct request *__rq;
 	bool ret;
@@ -495,7 +511,9 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_merged_fn)
+	if (e->uses_mq && e->type->ops.mq.request_merged)
+		e->type->ops.mq.request_merged(q, rq, type);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_merged_fn)
 		e->type->ops.sq.elevator_merged_fn(q, rq, type);
 
 	if (type == ELEVATOR_BACK_MERGE)
@@ -508,10 +526,15 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 			     struct request *next)
 {
 	struct elevator_queue *e = q->elevator;
-	const int next_sorted = next->rq_flags & RQF_SORTED;
-
-	if (next_sorted && e->type->ops.sq.elevator_merge_req_fn)
-		e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
+	bool next_sorted = false;
+
+	if (e->uses_mq && e->type->ops.mq.requests_merged)
+		e->type->ops.mq.requests_merged(q, rq, next);
+	else if (e->type->ops.sq.elevator_merge_req_fn) {
+		next_sorted = next->rq_flags & RQF_SORTED;
+		if (next_sorted)
+			e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
+	}
 
 	elv_rqhash_reposition(q, rq);
 
@@ -528,6 +551,9 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	if (e->type->ops.sq.elevator_bio_merged_fn)
 		e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
 }
@@ -682,8 +708,11 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_latter_req_fn)
+	if (e->uses_mq && e->type->ops.mq.next_request)
+		return e->type->ops.mq.next_request(q, rq);
+	else if (!e->uses_mq && e->type->ops.sq.elevator_latter_req_fn)
 		return e->type->ops.sq.elevator_latter_req_fn(q, rq);
+
 	return NULL;
 }
 
@@ -691,7 +720,9 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->type->ops.sq.elevator_former_req_fn)
+	if (e->uses_mq && e->type->ops.mq.former_request)
+		return e->type->ops.mq.former_request(q, rq);
+	if (!e->uses_mq && e->type->ops.sq.elevator_former_req_fn)
 		return e->type->ops.sq.elevator_former_req_fn(q, rq);
 	return NULL;
 }
@@ -701,6 +732,9 @@ int elv_set_request(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return 0;
+
 	if (e->type->ops.sq.elevator_set_req_fn)
 		return e->type->ops.sq.elevator_set_req_fn(q, rq, bio, gfp_mask);
 	return 0;
@@ -710,6 +744,9 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	if (e->type->ops.sq.elevator_put_req_fn)
 		e->type->ops.sq.elevator_put_req_fn(rq);
 }
@@ -718,6 +755,9 @@ int elv_may_queue(struct request_queue *q, unsigned int op)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return 0;
+
 	if (e->type->ops.sq.elevator_may_queue_fn)
 		return e->type->ops.sq.elevator_may_queue_fn(q, op);
 
@@ -728,6 +768,9 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (WARN_ON_ONCE(e->uses_mq))
+		return;
+
 	/*
 	 * request is released from the driver, io must be done
 	 */
@@ -803,7 +846,7 @@ int elv_register_queue(struct request_queue *q)
 		}
 		kobject_uevent(&e->kobj, KOBJ_ADD);
 		e->registered = 1;
-		if (e->type->ops.sq.elevator_registered_fn)
+		if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn)
 			e->type->ops.sq.elevator_registered_fn(q);
 	}
 	return error;
@@ -891,9 +934,14 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old = q->elevator;
-	bool registered = old->registered;
+	bool old_registered = false;
 	int err;
 
+	if (q->mq_ops) {
+		blk_mq_freeze_queue(q);
+		blk_mq_quiesce_queue(q);
+	}
+
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
 	 * Block layer doesn't call into a quiesced elevator - all requests
@@ -901,32 +949,52 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	 * using INSERT_BACK.  All requests have SOFTBARRIER set and no
 	 * merge happens either.
 	 */
-	blk_queue_bypass_start(q);
+	if (old) {
+		old_registered = old->registered;
 
-	/* unregister and clear all auxiliary data of the old elevator */
-	if (registered)
-		elv_unregister_queue(q);
+		if (!q->mq_ops)
+			blk_queue_bypass_start(q);
 
-	spin_lock_irq(q->queue_lock);
-	ioc_clear_queue(q);
-	spin_unlock_irq(q->queue_lock);
+		/* unregister and clear all auxiliary data of the old elevator */
+		if (old_registered)
+			elv_unregister_queue(q);
+
+		spin_lock_irq(q->queue_lock);
+		ioc_clear_queue(q);
+		spin_unlock_irq(q->queue_lock);
+	}
 
 	/* allocate, init and register new elevator */
-	err = new_e->ops.sq.elevator_init_fn(q, new_e);
-	if (err)
-		goto fail_init;
+	if (new_e) {
+		if (new_e->uses_mq)
+			err = new_e->ops.mq.init_sched(q, new_e);
+		else
+			err = new_e->ops.sq.elevator_init_fn(q, new_e);
+		if (err)
+			goto fail_init;
 
-	if (registered) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	}
+	} else
+		q->elevator = NULL;
 
 	/* done, kill the old one and finish */
-	elevator_exit(old);
-	blk_queue_bypass_end(q);
+	if (old) {
+		elevator_exit(old);
+		if (!q->mq_ops)
+			blk_queue_bypass_end(q);
+	}
+
+	if (q->mq_ops) {
+		blk_mq_unfreeze_queue(q);
+		blk_mq_start_stopped_hw_queues(q, true);
+	}
 
-	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
 
 	return 0;
 
@@ -934,9 +1002,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	elevator_exit(q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
-	q->elevator = old;
-	elv_register_queue(q);
-	blk_queue_bypass_end(q);
+	if (old) {
+		q->elevator = old;
+		elv_register_queue(q);
+		if (!q->mq_ops)
+			blk_queue_bypass_end(q);
+	}
+	if (q->mq_ops) {
+		blk_mq_unfreeze_queue(q);
+		blk_mq_start_stopped_hw_queues(q, true);
+	}
 
 	return err;
 }
@@ -949,8 +1024,11 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	char elevator_name[ELV_NAME_MAX];
 	struct elevator_type *e;
 
-	if (!q->elevator)
-		return -ENXIO;
+	/*
+	 * Special case for mq, turn off scheduling
+	 */
+	if (q->mq_ops && !strncmp(name, "none", 4))
+		return elevator_switch(q, NULL);
 
 	strlcpy(elevator_name, name, sizeof(elevator_name));
 	e = elevator_get(strstrip(elevator_name), true);
@@ -959,11 +1037,23 @@ static int __elevator_change(struct request_queue *q, const char *name)
 		return -EINVAL;
 	}
 
-	if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
+	if (q->elevator &&
+	    !strcmp(elevator_name, q->elevator->type->elevator_name)) {
 		elevator_put(e);
 		return 0;
 	}
 
+	if (!e->uses_mq && q->mq_ops) {
+		printk(KERN_ERR "blk-mq-sched: elv %s does not support mq\n", elevator_name);
+		elevator_put(e);
+		return -EINVAL;
+	}
+	if (e->uses_mq && !q->mq_ops) {
+		printk(KERN_ERR "blk-mq-sched: elv %s is for mq\n", elevator_name);
+		elevator_put(e);
+		return -EINVAL;
+	}
+
 	return elevator_switch(q, e);
 }
 
@@ -985,7 +1075,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 {
 	int ret;
 
-	if (!q->elevator)
+	if (!q->mq_ops || q->request_fn)
 		return count;
 
 	ret = __elevator_change(q, name);
@@ -999,24 +1089,34 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 ssize_t elv_iosched_show(struct request_queue *q, char *name)
 {
 	struct elevator_queue *e = q->elevator;
-	struct elevator_type *elv;
+	struct elevator_type *elv = NULL;
 	struct elevator_type *__e;
 	int len = 0;
 
-	if (!q->elevator || !blk_queue_stackable(q))
+	if (!blk_queue_stackable(q))
 		return sprintf(name, "none\n");
 
-	elv = e->type;
+	if (!q->elevator)
+		len += sprintf(name+len, "[none] ");
+	else
+		elv = e->type;
 
 	spin_lock(&elv_list_lock);
 	list_for_each_entry(__e, &elv_list, list) {
-		if (!strcmp(elv->elevator_name, __e->elevator_name))
+		if (elv && !strcmp(elv->elevator_name, __e->elevator_name)) {
 			len += sprintf(name+len, "[%s] ", elv->elevator_name);
-		else
+			continue;
+		}
+		if (__e->uses_mq && q->mq_ops)
+			len += sprintf(name+len, "%s ", __e->elevator_name);
+		else if (!__e->uses_mq && !q->mq_ops)
 			len += sprintf(name+len, "%s ", __e->elevator_name);
 	}
 	spin_unlock(&elv_list_lock);
 
+	if (q->mq_ops && q->elevator)
+		len += sprintf(name+len, "none");
+
 	len += sprintf(len+name, "\n");
 	return len;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2686f9e7302a..e3159be841ff 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,7 @@ struct blk_mq_hw_ctx {
 
 	unsigned long		flags;		/* BLK_MQ_F_* flags */
 
+	void			*sched_data;
 	struct request_queue	*queue;
 	struct blk_flush_queue	*fq;
 
@@ -156,6 +157,7 @@ enum {
 
 	BLK_MQ_S_STOPPED	= 0,
 	BLK_MQ_S_TAG_ACTIVE	= 1,
+	BLK_MQ_S_SCHED_RESTART	= 2,
 
 	BLK_MQ_MAX_DEPTH	= 10240,
 
@@ -179,7 +181,6 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
 
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);
 
-void blk_mq_insert_request(struct request *, bool, bool, bool);
 void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2a9e966eed03..417810b2d2f5 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -77,6 +77,32 @@ struct elevator_ops
 	elevator_registered_fn *elevator_registered_fn;
 };
 
+struct blk_mq_alloc_data;
+struct blk_mq_hw_ctx;
+
+struct elevator_mq_ops {
+	int (*init_sched)(struct request_queue *, struct elevator_type *);
+	void (*exit_sched)(struct elevator_queue *);
+
+	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
+	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
+	int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
+	void (*request_merged)(struct request_queue *, struct request *, int);
+	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
+	struct request *(*get_request)(struct request_queue *, unsigned int, struct blk_mq_alloc_data *);
+	bool (*put_request)(struct request *);
+	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
+	void (*dispatch_requests)(struct blk_mq_hw_ctx *, struct list_head *);
+	bool (*has_work)(struct blk_mq_hw_ctx *);
+	void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+	void (*started_request)(struct request *);
+	void (*requeue_request)(struct request *);
+	struct request *(*former_request)(struct request_queue *, struct request *);
+	struct request *(*next_request)(struct request_queue *, struct request *);
+	int (*get_rq_priv)(struct request_queue *, struct request *);
+	void (*put_rq_priv)(struct request_queue *, struct request *);
+};
+
 #define ELV_NAME_MAX	(16)
 
 struct elv_fs_entry {
@@ -96,12 +122,14 @@ struct elevator_type
 	/* fields provided by elevator implementation */
 	union {
 		struct elevator_ops sq;
+		struct elevator_mq_ops mq;
 	} ops;
 	size_t icq_size;	/* see iocontext.h */
 	size_t icq_align;	/* ditto */
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+	bool uses_mq;
 
 	/* managed by elevator core */
 	char icq_cache_name[ELV_NAME_MAX + 5];	/* elvname + "_io_cq" */
@@ -125,6 +153,7 @@ struct elevator_queue
 	struct kobject kobj;
 	struct mutex sysfs_lock;
 	unsigned int registered:1;
+	unsigned int uses_mq:1;
 	DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
 };
 
@@ -141,6 +170,7 @@ extern void elv_merge_requests(struct request_queue *, struct request *,
 extern void elv_merged_request(struct request_queue *, struct request *, int);
 extern void elv_bio_merged(struct request_queue *q, struct request *,
 				struct bio *);
+extern bool elv_attempt_insert_merge(struct request_queue *, struct request *);
 extern void elv_requeue_request(struct request_queue *, struct request *);
 extern struct request *elv_former_request(struct request_queue *, struct request *);
 extern struct request *elv_latter_request(struct request_queue *, struct request *);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
                   ` (5 preceding siblings ...)
  2016-12-17  0:12 ` [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
@ 2016-12-17  0:12 ` Jens Axboe
  2016-12-20  9:34   ` Paolo Valente
                     ` (7 more replies)
  2016-12-17  0:12 ` [PATCH 8/8] blk-mq-sched: allow setting of default " Jens Axboe
                   ` (2 subsequent siblings)
  9 siblings, 8 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

This is basically identical to deadline-iosched, except it registers
as a MQ capable scheduler. This is still a single queue design.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Kconfig.iosched |   6 +
 block/Makefile        |   1 +
 block/mq-deadline.c   | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 656 insertions(+)
 create mode 100644 block/mq-deadline.c

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9c4c48..490ef2850fae 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,6 +32,12 @@ config IOSCHED_CFQ
 
 	  This is the default I/O scheduler.
 
+config MQ_IOSCHED_DEADLINE
+	tristate "MQ deadline I/O scheduler"
+	default y
+	---help---
+	  MQ version of the deadline IO scheduler.
+
 config CFQ_GROUP_IOSCHED
 	bool "CFQ Group Scheduling support"
 	depends on IOSCHED_CFQ && BLK_CGROUP
diff --git a/block/Makefile b/block/Makefile
index 2eee9e1bb6db..3ee0abd7205a 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
new file mode 100644
index 000000000000..3cb9de21ab21
--- /dev/null
+++ b/block/mq-deadline.c
@@ -0,0 +1,649 @@
+/*
+ *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
+ *  for the blk-mq scheduling framework
+ *
+ *  Copyright (C) 2016 Jens Axboe <axboe@kernel.dk>
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/rbtree.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
+
+static unsigned int queue_depth = 256;
+module_param(queue_depth, uint, 0644);
+MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
+
+/*
+ * See Documentation/block/deadline-iosched.txt
+ */
+static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
+static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+static const int writes_starved = 2;    /* max times reads can starve a write */
+static const int fifo_batch = 16;       /* # of sequential requests treated as one
+				     by the above parameters. For throughput. */
+
+struct deadline_data {
+	/*
+	 * run time data
+	 */
+
+	/*
+	 * requests (deadline_rq s) are present on both sort_list and fifo_list
+	 */
+	struct rb_root sort_list[2];
+	struct list_head fifo_list[2];
+
+	/*
+	 * next in sort order. read, write or both are NULL
+	 */
+	struct request *next_rq[2];
+	unsigned int batching;		/* number of sequential requests made */
+	unsigned int starved;		/* times reads have starved writes */
+
+	/*
+	 * settings that change how the i/o scheduler behaves
+	 */
+	int fifo_expire[2];
+	int fifo_batch;
+	int writes_starved;
+	int front_merges;
+
+	spinlock_t lock;
+	struct list_head dispatch;
+	struct blk_mq_tags *tags;
+	atomic_t wait_index;
+};
+
+static inline struct rb_root *
+deadline_rb_root(struct deadline_data *dd, struct request *rq)
+{
+	return &dd->sort_list[rq_data_dir(rq)];
+}
+
+/*
+ * get the request after `rq' in sector-sorted order
+ */
+static inline struct request *
+deadline_latter_request(struct request *rq)
+{
+	struct rb_node *node = rb_next(&rq->rb_node);
+
+	if (node)
+		return rb_entry_rq(node);
+
+	return NULL;
+}
+
+static void
+deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+	struct rb_root *root = deadline_rb_root(dd, rq);
+
+	elv_rb_add(root, rq);
+}
+
+static inline void
+deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+	const int data_dir = rq_data_dir(rq);
+
+	if (dd->next_rq[data_dir] == rq)
+		dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+	elv_rb_del(deadline_rb_root(dd, rq), rq);
+}
+
+/*
+ * remove rq from rbtree and fifo.
+ */
+static void deadline_remove_request(struct request_queue *q, struct request *rq)
+{
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	list_del_init(&rq->queuelist);
+
+	/*
+	 * We might not be on the rbtree, if we are doing an insert merge
+	 */
+	if (!RB_EMPTY_NODE(&rq->rb_node))
+		deadline_del_rq_rb(dd, rq);
+
+	elv_rqhash_del(q, rq);
+	if (q->last_merge == rq)
+		q->last_merge = NULL;
+}
+
+static void dd_request_merged(struct request_queue *q, struct request *req,
+			      int type)
+{
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	/*
+	 * if the merge was a front merge, we need to reposition request
+	 */
+	if (type == ELEVATOR_FRONT_MERGE) {
+		elv_rb_del(deadline_rb_root(dd, req), req);
+		deadline_add_rq_rb(dd, req);
+	}
+}
+
+static void dd_merged_requests(struct request_queue *q, struct request *req,
+			       struct request *next)
+{
+	/*
+	 * if next expires before rq, assign its expire time to rq
+	 * and move into next position (next will be deleted) in fifo
+	 */
+	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
+		if (time_before((unsigned long)next->fifo_time,
+				(unsigned long)req->fifo_time)) {
+			list_move(&req->queuelist, &next->queuelist);
+			req->fifo_time = next->fifo_time;
+		}
+	}
+
+	/*
+	 * kill knowledge of next, this one is a goner
+	 */
+	deadline_remove_request(q, next);
+}
+
+/*
+ * move an entry to dispatch queue
+ */
+static void
+deadline_move_request(struct deadline_data *dd, struct request *rq)
+{
+	const int data_dir = rq_data_dir(rq);
+
+	dd->next_rq[READ] = NULL;
+	dd->next_rq[WRITE] = NULL;
+	dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+	/*
+	 * take it off the sort and fifo list
+	 */
+	deadline_remove_request(rq->q, rq);
+}
+
+/*
+ * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
+ * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
+ */
+static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+{
+	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+
+	/*
+	 * rq is expired!
+	 */
+	if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * deadline_dispatch_requests selects the best request according to
+ * read/write expire, fifo_batch, etc
+ */
+static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+	struct request *rq;
+	bool reads, writes;
+	int data_dir;
+
+	spin_lock(&dd->lock);
+
+	if (!list_empty(&dd->dispatch)) {
+		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		goto done;
+	}
+
+	reads = !list_empty(&dd->fifo_list[READ]);
+	writes = !list_empty(&dd->fifo_list[WRITE]);
+
+	/*
+	 * batches are currently reads XOR writes
+	 */
+	if (dd->next_rq[WRITE])
+		rq = dd->next_rq[WRITE];
+	else
+		rq = dd->next_rq[READ];
+
+	if (rq && dd->batching < dd->fifo_batch)
+		/* we have a next request are still entitled to batch */
+		goto dispatch_request;
+
+	/*
+	 * at this point we are not running a batch. select the appropriate
+	 * data direction (read / write)
+	 */
+
+	if (reads) {
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+
+		if (writes && (dd->starved++ >= dd->writes_starved))
+			goto dispatch_writes;
+
+		data_dir = READ;
+
+		goto dispatch_find_request;
+	}
+
+	/*
+	 * there are either no reads or writes have been starved
+	 */
+
+	if (writes) {
+dispatch_writes:
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+
+		dd->starved = 0;
+
+		data_dir = WRITE;
+
+		goto dispatch_find_request;
+	}
+
+	spin_unlock(&dd->lock);
+	return NULL;
+
+dispatch_find_request:
+	/*
+	 * we are not running a batch, find best request for selected data_dir
+	 */
+	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+		/*
+		 * A deadline has expired, the last request was in the other
+		 * direction, or we have run out of higher-sectored requests.
+		 * Start again from the request with the earliest expiry time.
+		 */
+		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+	} else {
+		/*
+		 * The last req was the same dir and we have a next request in
+		 * sort order. No expired requests so continue on from here.
+		 */
+		rq = dd->next_rq[data_dir];
+	}
+
+	dd->batching = 0;
+
+dispatch_request:
+	/*
+	 * rq is the selected appropriate request.
+	 */
+	dd->batching++;
+	deadline_move_request(dd, rq);
+done:
+	rq->rq_flags |= RQF_STARTED;
+	spin_unlock(&dd->lock);
+	return rq;
+}
+
+static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
+				 struct list_head *rq_list)
+{
+	blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
+}
+
+static void dd_exit_queue(struct elevator_queue *e)
+{
+	struct deadline_data *dd = e->elevator_data;
+
+	BUG_ON(!list_empty(&dd->fifo_list[READ]));
+	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+
+	blk_mq_sched_free_requests(dd->tags);
+	kfree(dd);
+}
+
+/*
+ * initialize elevator private data (deadline_data).
+ */
+static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+	struct deadline_data *dd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
+	if (!dd) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+	eq->elevator_data = dd;
+
+	dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
+	if (!dd->tags) {
+		kfree(dd);
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	INIT_LIST_HEAD(&dd->fifo_list[READ]);
+	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
+	dd->sort_list[READ] = RB_ROOT;
+	dd->sort_list[WRITE] = RB_ROOT;
+	dd->fifo_expire[READ] = read_expire;
+	dd->fifo_expire[WRITE] = write_expire;
+	dd->writes_starved = writes_starved;
+	dd->front_merges = 1;
+	dd->fifo_batch = fifo_batch;
+	spin_lock_init(&dd->lock);
+	INIT_LIST_HEAD(&dd->dispatch);
+	atomic_set(&dd->wait_index, 0);
+
+	q->elevator = eq;
+	return 0;
+}
+
+static int dd_request_merge(struct request_queue *q, struct request **rq,
+			    struct bio *bio)
+{
+	struct deadline_data *dd = q->elevator->elevator_data;
+	sector_t sector = bio_end_sector(bio);
+	struct request *__rq;
+
+	if (!dd->front_merges)
+		return ELEVATOR_NO_MERGE;
+
+	__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+	if (__rq) {
+		BUG_ON(sector != blk_rq_pos(__rq));
+
+		if (elv_bio_merge_ok(__rq, bio)) {
+			*rq = __rq;
+			return ELEVATOR_FRONT_MERGE;
+		}
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	int ret;
+
+	spin_lock(&dd->lock);
+	ret = blk_mq_sched_try_merge(q, bio);
+	spin_unlock(&dd->lock);
+
+	return ret;
+}
+
+/*
+ * add rq to rbtree and fifo
+ */
+static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+			      bool at_head)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	const int data_dir = rq_data_dir(rq);
+
+	if (blk_mq_sched_try_insert_merge(q, rq))
+		return;
+
+	blk_mq_sched_request_inserted(rq);
+
+	/*
+	 * If we're trying to insert a real request, just send it directly
+	 * to the hardware dispatch list. This only happens for a requeue,
+	 * or FUA/FLUSH requests.
+	 */
+	if (!blk_mq_sched_rq_is_shadow(rq)) {
+		spin_lock(&hctx->lock);
+		list_add_tail(&rq->queuelist, &hctx->dispatch);
+		spin_unlock(&hctx->lock);
+		return;
+	}
+
+	if (at_head || rq->cmd_type != REQ_TYPE_FS) {
+		if (at_head)
+			list_add(&rq->queuelist, &dd->dispatch);
+		else
+			list_add_tail(&rq->queuelist, &dd->dispatch);
+	} else {
+		deadline_add_rq_rb(dd, rq);
+
+		if (rq_mergeable(rq)) {
+			elv_rqhash_add(q, rq);
+			if (!q->last_merge)
+				q->last_merge = rq;
+		}
+
+		/*
+		 * set expire time and add to fifo list
+		 */
+		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
+		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	}
+}
+
+static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
+			       struct list_head *list, bool at_head)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	spin_lock(&dd->lock);
+	while (!list_empty(list)) {
+		struct request *rq;
+
+		rq = list_first_entry(list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		dd_insert_request(hctx, rq, at_head);
+	}
+	spin_unlock(&dd->lock);
+}
+
+static struct request *dd_get_request(struct request_queue *q, unsigned int op,
+				      struct blk_mq_alloc_data *data)
+{
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct request *rq;
+
+	/*
+	 * The flush machinery intercepts before we insert the request. As
+	 * a work-around, just hand it back a real request.
+	 */
+	if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
+		rq = __blk_mq_alloc_request(data, op);
+	else {
+		rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
+		if (rq)
+			blk_mq_rq_ctx_init(q, data->ctx, rq, op);
+	}
+
+	return rq;
+}
+
+static bool dd_put_request(struct request *rq)
+{
+	/*
+	 * If it's a real request, we just have to free it. For a shadow
+	 * request, we should only free it if we haven't started it. A
+	 * started request is mapped to a real one, and the real one will
+	 * free it. We can get here with request merges, since we then
+	 * free the request before we start/issue it.
+	 */
+	if (!blk_mq_sched_rq_is_shadow(rq))
+		return false;
+
+	if (!(rq->rq_flags & RQF_STARTED)) {
+		struct request_queue *q = rq->q;
+		struct deadline_data *dd = q->elevator->elevator_data;
+
+		/*
+		 * IO completion would normally do this, but if we merge
+		 * and free before we issue the request, drop both the
+		 * tag and queue ref
+		 */
+		blk_mq_sched_free_shadow_request(dd->tags, rq);
+		blk_queue_exit(q);
+	}
+
+	return true;
+}
+
+static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct request *sched_rq = rq->end_io_data;
+
+	/*
+	 * sched_rq can be NULL, if we haven't setup the shadow yet
+	 * because we failed getting one.
+	 */
+	if (sched_rq) {
+		struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+
+		blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
+		blk_mq_start_stopped_hw_queue(hctx, true);
+	}
+}
+
+static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+
+	return !list_empty_careful(&dd->dispatch) ||
+		!list_empty_careful(&dd->fifo_list[0]) ||
+		!list_empty_careful(&dd->fifo_list[1]);
+}
+
+/*
+ * sysfs parts below
+ */
+static ssize_t
+deadline_var_show(int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+deadline_var_store(int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtol(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct deadline_data *dd = e->elevator_data;			\
+	int __data = __VAR;						\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return deadline_var_show(__data, (page));			\
+}
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
+SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
+SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct deadline_data *dd = e->elevator_data;			\
+	int __data;							\
+	int ret = deadline_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
+STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
+STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
+#undef STORE_FUNCTION
+
+#define DD_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
+				      deadline_##name##_store)
+
+static struct elv_fs_entry deadline_attrs[] = {
+	DD_ATTR(read_expire),
+	DD_ATTR(write_expire),
+	DD_ATTR(writes_starved),
+	DD_ATTR(front_merges),
+	DD_ATTR(fifo_batch),
+	__ATTR_NULL
+};
+
+static struct elevator_type mq_deadline = {
+	.ops.mq = {
+		.get_request		= dd_get_request,
+		.put_request		= dd_put_request,
+		.insert_requests	= dd_insert_requests,
+		.dispatch_requests	= dd_dispatch_requests,
+		.completed_request	= dd_completed_request,
+		.next_request		= elv_rb_latter_request,
+		.former_request		= elv_rb_former_request,
+		.bio_merge		= dd_bio_merge,
+		.request_merge		= dd_request_merge,
+		.requests_merged	= dd_merged_requests,
+		.request_merged		= dd_request_merged,
+		.has_work		= dd_has_work,
+		.init_sched		= dd_init_queue,
+		.exit_sched		= dd_exit_queue,
+	},
+
+	.uses_mq	= true,
+	.elevator_attrs = deadline_attrs,
+	.elevator_name = "mq-deadline",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init deadline_init(void)
+{
+	if (!queue_depth) {
+		pr_err("mq-deadline: queue depth must be > 0\n");
+		return -EINVAL;
+	}
+	return elv_register(&mq_deadline);
+}
+
+static void __exit deadline_exit(void)
+{
+	elv_unregister(&mq_deadline);
+}
+
+module_init(deadline_init);
+module_exit(deadline_exit);
+
+MODULE_AUTHOR("Jens Axboe");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("MQ deadline IO scheduler");
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 8/8] blk-mq-sched: allow setting of default IO scheduler
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
                   ` (6 preceding siblings ...)
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
@ 2016-12-17  0:12 ` Jens Axboe
  2016-12-19 11:32 ` [PATCHSET v4] blk-mq-scheduling framework Paolo Valente
  2016-12-22 16:23 ` Bart Van Assche
  9 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-17  0:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: paolo.valente, osandov, Jens Axboe

Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Kconfig.iosched   | 43 +++++++++++++++++++++++++++++++++++++------
 block/blk-mq-sched.c    | 19 +++++++++++++++++++
 block/blk-mq-sched.h    |  2 ++
 block/blk-mq.c          |  3 +++
 block/elevator.c        |  5 ++++-
 drivers/nvme/host/pci.c |  1 +
 include/linux/blk-mq.h  |  1 +
 7 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 490ef2850fae..96216cf18560 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,12 +32,6 @@ config IOSCHED_CFQ
 
 	  This is the default I/O scheduler.
 
-config MQ_IOSCHED_DEADLINE
-	tristate "MQ deadline I/O scheduler"
-	default y
-	---help---
-	  MQ version of the deadline IO scheduler.
-
 config CFQ_GROUP_IOSCHED
 	bool "CFQ Group Scheduling support"
 	depends on IOSCHED_CFQ && BLK_CGROUP
@@ -69,6 +63,43 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config MQ_IOSCHED_DEADLINE
+	tristate "MQ deadline I/O scheduler"
+	default y
+	---help---
+	  MQ version of the deadline IO scheduler.
+
+config MQ_IOSCHED_NONE
+	bool
+	default y
+
+choice
+	prompt "Default MQ I/O scheduler"
+	default MQ_IOSCHED_NONE
+	help
+	  Select the I/O scheduler which will be used by default for all
+	  blk-mq managed block devices.
+
+	config DEFAULT_MQ_DEADLINE
+		bool "MQ Deadline" if MQ_IOSCHED_DEADLINE=y
+
+	config DEFAULT_MQ_NONE
+		bool "None"
+
+endchoice
+
+config DEFAULT_MQ_IOSCHED
+	string
+	default "mq-deadline" if DEFAULT_MQ_DEADLINE
+	default "none" if DEFAULT_MQ_NONE
+
+config MQ_IOSCHED_ONLY_SQ
+	bool "Enable blk-mq IO scheduler only for single queue devices"
+	default y
+	help
+	  Say Y here, if you only want to enable IO scheduling on block
+	  devices that have a single queue registered.
+
 endmenu
 
 endif
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index b7e1839d4785..1f06efcdaa2d 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -432,3 +432,22 @@ void blk_mq_sched_request_inserted(struct request *rq)
 	trace_block_rq_insert(rq->q, rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
+
+int blk_mq_sched_init(struct request_queue *q)
+{
+	int ret;
+
+#if defined(CONFIG_DEFAULT_MQ_NONE)
+	return 0;
+#endif
+#if defined(CONFIG_MQ_IOSCHED_ONLY_SQ)
+	if (q->nr_hw_queues > 1)
+		return 0;
+#endif
+
+	mutex_lock(&q->sysfs_lock);
+	ret = elevator_init(q, NULL);
+	mutex_unlock(&q->sysfs_lock);
+
+	return ret;
+}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1d1a4e9ce6ca..826f3e6991e3 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -37,6 +37,8 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
+int blk_mq_sched_init(struct request_queue *q);
+
 static inline bool
 blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 032dca4a27bf..0d8ea45b8562 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2105,6 +2105,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	INIT_LIST_HEAD(&q->requeue_list);
 	spin_lock_init(&q->requeue_lock);
 
+	if (!(set->flags & BLK_MQ_F_NO_SCHED))
+		blk_mq_sched_init(q);
+
 	if (q->nr_hw_queues > 1)
 		blk_queue_make_request(q, blk_mq_make_request);
 	else
diff --git a/block/elevator.c b/block/elevator.c
index e6b523360231..eb34c26f675f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -219,7 +219,10 @@ int elevator_init(struct request_queue *q, char *name)
 	}
 
 	if (!e) {
-		e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+		if (q->mq_ops)
+			e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
+		else
+			e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
 		if (!e) {
 			printk(KERN_ERR
 				"Default I/O scheduler not found. " \
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d6e6bce93d0c..063410d9b3cc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1188,6 +1188,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
 		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
 		dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
+		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
 		dev->admin_tagset.driver_data = dev;
 
 		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3159be841ff..9255ccb043f2 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,6 +152,7 @@ enum {
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
+	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
                   ` (7 preceding siblings ...)
  2016-12-17  0:12 ` [PATCH 8/8] blk-mq-sched: allow setting of default " Jens Axboe
@ 2016-12-19 11:32 ` Paolo Valente
  2016-12-19 15:20   ` Jens Axboe
  2016-12-22 16:23 ` Bart Van Assche
  9 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2016-12-19 11:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This is version 4 of this patchset, version 3 was posted here:
> 
> https://marc.info/?l=linux-block&m=148178513407631&w=2
> 
> From the discussion last time, I looked into the feasibility of having
> two sets of tags for the same request pool, to avoid having to copy
> some of the request fields at dispatch and completion time. To do that,
> we'd have to replace the driver tag map(s) with our own, and augment
> that with tag map(s) on the side representing the device queue depth.
> Queuing IO with the scheduler would allocate from the new map, and
> dispatching would acquire the "real" tag. We would need to change
> drivers to do this, or add an extra indirection table to map a real
> tag to the scheduler tag. We would also need a 1:1 mapping between
> scheduler and hardware tag pools, or additional info to track it.
> Unless someone can convince me otherwise, I think the current approach
> is cleaner.
> 
> I wasn't going to post v4 so soon, but I discovered a bug that led
> to drastically decreased merging. Especially on rotating storage,
> this release should be fast, and on par with the merging that we
> get through the legacy schedulers.
> 

I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
you already have an idea of what they are, so that I am somehow
prepared to what won't work even if my changes are right?

Thanks,
Paolo

> Changes since v3:
> 
> - Keep the blk_mq_free_request/__blk_mq_free_request() as the
>  interface, and have those functions call the scheduler API
>  instead.
> 
> - Add insertion merging from unplugging.
> 
> - Ensure that RQF_STARTED is cleared when we get a new shadow
>  request, or merging will fail if it is already set.
> 
> - Improve the blk_mq_sched_init_hctx_data() implementation. From Omar.
> 
> - Make the shadow alloc/free interface more usable by schedulers
>  that use the software queues. From Omar.
> 
> - Fix a bug in the io context code.
> 
> - Put the is_shadow() helper in generic code, instead of in mq-deadline.
> 
> - Add prep patch that unexports blk_mq_free_hctx_request(), it's not
>  used by anyone.
> 
> - Remove the magic '256' queue depth from mq-deadline, replace with a
>  module parameter, 'queue_depth', that defaults to 256.
> 
> - Various cleanups.
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-19 11:32 ` [PATCHSET v4] blk-mq-scheduling framework Paolo Valente
@ 2016-12-19 15:20   ` Jens Axboe
  2016-12-19 15:33     ` Jens Axboe
  2016-12-19 18:21     ` Paolo Valente
  0 siblings, 2 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-19 15:20 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown

On 12/19/2016 04:32 AM, Paolo Valente wrote:
> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> This is version 4 of this patchset, version 3 was posted here:
>>
>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>
>> From the discussion last time, I looked into the feasibility of having
>> two sets of tags for the same request pool, to avoid having to copy
>> some of the request fields at dispatch and completion time. To do that,
>> we'd have to replace the driver tag map(s) with our own, and augment
>> that with tag map(s) on the side representing the device queue depth.
>> Queuing IO with the scheduler would allocate from the new map, and
>> dispatching would acquire the "real" tag. We would need to change
>> drivers to do this, or add an extra indirection table to map a real
>> tag to the scheduler tag. We would also need a 1:1 mapping between
>> scheduler and hardware tag pools, or additional info to track it.
>> Unless someone can convince me otherwise, I think the current approach
>> is cleaner.
>>
>> I wasn't going to post v4 so soon, but I discovered a bug that led
>> to drastically decreased merging. Especially on rotating storage,
>> this release should be fast, and on par with the merging that we
>> get through the legacy schedulers.
>>
> 
> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
> you already have an idea of what they are, so that I am somehow
> prepared to what won't work even if my changes are right?

I'm mostly talking about elevator ops hooks that aren't there in the new
framework, but exist in the old one. There should be no hidden
surprises, if that's what you are worried about.

On the ops side, the only ones I can think of are the activate and
deactivate, and those can be done in the dispatch_request hook for
activate, and put/requeue for deactivate.

Outside of that, some of them have been renamed, and some have been
collapsed (like activate/deactivate), and yet others again work a little
differently (like merging). See the mq-deadline conversion, and just
work through them one at the time.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-19 15:20   ` Jens Axboe
@ 2016-12-19 15:33     ` Jens Axboe
  2016-12-19 18:21     ` Paolo Valente
  1 sibling, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-19 15:33 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-block, Linux-Kernal, Omar Sandoval, Linus Walleij,
	Ulf Hansson, Mark Brown

On 12/19/2016 08:20 AM, Jens Axboe wrote:
> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>
>>> This is version 4 of this patchset, version 3 was posted here:
>>>
>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>
>>> From the discussion last time, I looked into the feasibility of having
>>> two sets of tags for the same request pool, to avoid having to copy
>>> some of the request fields at dispatch and completion time. To do that,
>>> we'd have to replace the driver tag map(s) with our own, and augment
>>> that with tag map(s) on the side representing the device queue depth.
>>> Queuing IO with the scheduler would allocate from the new map, and
>>> dispatching would acquire the "real" tag. We would need to change
>>> drivers to do this, or add an extra indirection table to map a real
>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>> scheduler and hardware tag pools, or additional info to track it.
>>> Unless someone can convince me otherwise, I think the current approach
>>> is cleaner.
>>>
>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>> to drastically decreased merging. Especially on rotating storage,
>>> this release should be fast, and on par with the merging that we
>>> get through the legacy schedulers.
>>>
>>
>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>> you already have an idea of what they are, so that I am somehow
>> prepared to what won't work even if my changes are right?
> 
> I'm mostly talking about elevator ops hooks that aren't there in the new
> framework, but exist in the old one. There should be no hidden
> surprises, if that's what you are worried about.
> 
> On the ops side, the only ones I can think of are the activate and
> deactivate, and those can be done in the dispatch_request hook for
> activate, and put/requeue for deactivate.
> 
> Outside of that, some of them have been renamed, and some have been
> collapsed (like activate/deactivate), and yet others again work a little
> differently (like merging). See the mq-deadline conversion, and just
> work through them one at the time.

Some more details...

Outside of the differences outlined above, a major one is that the old
scheduler interfaces invoked almost all of the hooks with the device
queue lock held. That's no longer the case on the new framework, you
have to setup your own lock(s) for what you need. That's a lot saner.
One example is the attempt to merge a bio to an existing request, that
would be the ->bio_merge() hook. If you look at mq-deadline, the hook
merely grabs its per-queue lock (dd->lock) and calls a blk-mq-sched
helper to do the merging. That, in turn, will call ->request_merge(), so
that is called with the lock that ->bio_merge() grabs.

-- 
Jens Axboe


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-19 15:20   ` Jens Axboe
  2016-12-19 15:33     ` Jens Axboe
@ 2016-12-19 18:21     ` Paolo Valente
  2016-12-19 21:05       ` Jens Axboe
  1 sibling, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2016-12-19 18:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown


> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>> 
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> This is version 4 of this patchset, version 3 was posted here:
>>> 
>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>> 
>>> From the discussion last time, I looked into the feasibility of having
>>> two sets of tags for the same request pool, to avoid having to copy
>>> some of the request fields at dispatch and completion time. To do that,
>>> we'd have to replace the driver tag map(s) with our own, and augment
>>> that with tag map(s) on the side representing the device queue depth.
>>> Queuing IO with the scheduler would allocate from the new map, and
>>> dispatching would acquire the "real" tag. We would need to change
>>> drivers to do this, or add an extra indirection table to map a real
>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>> scheduler and hardware tag pools, or additional info to track it.
>>> Unless someone can convince me otherwise, I think the current approach
>>> is cleaner.
>>> 
>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>> to drastically decreased merging. Especially on rotating storage,
>>> this release should be fast, and on par with the merging that we
>>> get through the legacy schedulers.
>>> 
>> 
>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>> you already have an idea of what they are, so that I am somehow
>> prepared to what won't work even if my changes are right?
> 
> I'm mostly talking about elevator ops hooks that aren't there in the new
> framework, but exist in the old one. There should be no hidden
> surprises, if that's what you are worried about.
> 
> On the ops side, the only ones I can think of are the activate and
> deactivate, and those can be done in the dispatch_request hook for
> activate, and put/requeue for deactivate.
> 

You mean that there is no conceptual problem in moving the code of the
activate interface function into the dispatch function, and the code
of the deactivate into the put_request? (for a requeue it is a little
less clear to me, so one step at a time)  Or am I missing
something more complex?

> Outside of that, some of them have been renamed, and some have been
> collapsed (like activate/deactivate), and yet others again work a little
> differently (like merging). See the mq-deadline conversion, and just
> work through them one at the time.
> 

That's how I'm proceeding, thanks.

Thank you,
Paolo

> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-19 18:21     ` Paolo Valente
@ 2016-12-19 21:05       ` Jens Axboe
  2016-12-22 15:28         ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2016-12-19 21:05 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown

On 12/19/2016 11:21 AM, Paolo Valente wrote:
> 
>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>>
>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>>
>>>> This is version 4 of this patchset, version 3 was posted here:
>>>>
>>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>>
>>>> From the discussion last time, I looked into the feasibility of having
>>>> two sets of tags for the same request pool, to avoid having to copy
>>>> some of the request fields at dispatch and completion time. To do that,
>>>> we'd have to replace the driver tag map(s) with our own, and augment
>>>> that with tag map(s) on the side representing the device queue depth.
>>>> Queuing IO with the scheduler would allocate from the new map, and
>>>> dispatching would acquire the "real" tag. We would need to change
>>>> drivers to do this, or add an extra indirection table to map a real
>>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>>> scheduler and hardware tag pools, or additional info to track it.
>>>> Unless someone can convince me otherwise, I think the current approach
>>>> is cleaner.
>>>>
>>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>>> to drastically decreased merging. Especially on rotating storage,
>>>> this release should be fast, and on par with the merging that we
>>>> get through the legacy schedulers.
>>>>
>>>
>>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>>> you already have an idea of what they are, so that I am somehow
>>> prepared to what won't work even if my changes are right?
>>
>> I'm mostly talking about elevator ops hooks that aren't there in the new
>> framework, but exist in the old one. There should be no hidden
>> surprises, if that's what you are worried about.
>>
>> On the ops side, the only ones I can think of are the activate and
>> deactivate, and those can be done in the dispatch_request hook for
>> activate, and put/requeue for deactivate.
>>
> 
> You mean that there is no conceptual problem in moving the code of the
> activate interface function into the dispatch function, and the code
> of the deactivate into the put_request? (for a requeue it is a little
> less clear to me, so one step at a time)  Or am I missing
> something more complex?

Yes, what I mean is that there isn't a 1:1 mapping between the old ops
and the new ops. So you'll have to consider the cases.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
@ 2016-12-20  9:34   ` Paolo Valente
  2016-12-20 15:46     ` Jens Axboe
  2016-12-21 11:59   ` Bart Van Assche
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2016-12-20  9:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ...
> +
> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +	return !list_empty_careful(&dd->dispatch) ||
> +		!list_empty_careful(&dd->fifo_list[0]) ||
> +		!list_empty_careful(&dd->fifo_list[1]);

Just a request for clarification: if I'm not mistaken,
list_empty_careful can be used safely only if the only possible other
concurrent access is a delete.  Or am I missing something?

If the above constraint does hold, then how are we guaranteed that it
is met?  My doubt arises from, e.g., the possible concurrent list_add
from dd_insert_request.

Thanks,
Paolo

> +}
> +
> +/*
> + * sysfs parts below
> + */
> +static ssize_t
> +deadline_var_show(int var, char *page)
> +{
> +	return sprintf(page, "%d\n", var);
> +}
> +
> +static ssize_t
> +deadline_var_store(int *var, const char *page, size_t count)
> +{
> +	char *p = (char *) page;
> +
> +	*var = simple_strtol(p, &p, 10);
> +	return count;
> +}
> +
> +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
> +static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data = __VAR;						\
> +	if (__CONV)							\
> +		__data = jiffies_to_msecs(__data);			\
> +	return deadline_var_show(__data, (page));			\
> +}
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> +SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> +SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data;							\
> +	int ret = deadline_var_store(&__data, (page), count);		\
> +	if (__data < (MIN))						\
> +		__data = (MIN);						\
> +	else if (__data > (MAX))					\
> +		__data = (MAX);						\
> +	if (__CONV)							\
> +		*(__PTR) = msecs_to_jiffies(__data);			\
> +	else								\
> +		*(__PTR) = __data;					\
> +	return ret;							\
> +}
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> +STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> +STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#undef STORE_FUNCTION
> +
> +#define DD_ATTR(name) \
> +	__ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
> +				      deadline_##name##_store)
> +
> +static struct elv_fs_entry deadline_attrs[] = {
> +	DD_ATTR(read_expire),
> +	DD_ATTR(write_expire),
> +	DD_ATTR(writes_starved),
> +	DD_ATTR(front_merges),
> +	DD_ATTR(fifo_batch),
> +	__ATTR_NULL
> +};
> +
> +static struct elevator_type mq_deadline = {
> +	.ops.mq = {
> +		.get_request		= dd_get_request,
> +		.put_request		= dd_put_request,
> +		.insert_requests	= dd_insert_requests,
> +		.dispatch_requests	= dd_dispatch_requests,
> +		.completed_request	= dd_completed_request,
> +		.next_request		= elv_rb_latter_request,
> +		.former_request		= elv_rb_former_request,
> +		.bio_merge		= dd_bio_merge,
> +		.request_merge		= dd_request_merge,
> +		.requests_merged	= dd_merged_requests,
> +		.request_merged		= dd_request_merged,
> +		.has_work		= dd_has_work,
> +		.init_sched		= dd_init_queue,
> +		.exit_sched		= dd_exit_queue,
> +	},
> +
> +	.uses_mq	= true,
> +	.elevator_attrs = deadline_attrs,
> +	.elevator_name = "mq-deadline",
> +	.elevator_owner = THIS_MODULE,
> +};
> +
> +static int __init deadline_init(void)
> +{
> +	if (!queue_depth) {
> +		pr_err("mq-deadline: queue depth must be > 0\n");
> +		return -EINVAL;
> +	}
> +	return elv_register(&mq_deadline);
> +}
> +
> +static void __exit deadline_exit(void)
> +{
> +	elv_unregister(&mq_deadline);
> +}
> +
> +module_init(deadline_init);
> +module_exit(deadline_exit);
> +
> +MODULE_AUTHOR("Jens Axboe");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("MQ deadline IO scheduler");
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/8] block: move rq_ioc() to blk.h
  2016-12-17  0:12 ` [PATCH 3/8] block: move rq_ioc() to blk.h Jens Axboe
@ 2016-12-20 10:12   ` Paolo Valente
  2016-12-20 15:46     ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2016-12-20 10:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> We want to use it outside of blk-core.c.
> 

Hi Jens,
no hooks equivalent to elevator_init_icq_fn and elevator_exit_icq_fn
are invoked.  In particular, the second hook let bfq (as with cfq)
know when it could finally exit the queue associated with the icq.
I'm trying to figure out how to make it without these hooks/signals,
but at no avail so far ...

Thanks,
Paolo

> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
> block/blk-core.c | 16 ----------------
> block/blk.h      | 16 ++++++++++++++++
> 2 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 61ba08c58b64..92baea07acbc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1040,22 +1040,6 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
> }
> 
> /**
> - * rq_ioc - determine io_context for request allocation
> - * @bio: request being allocated is for this bio (can be %NULL)
> - *
> - * Determine io_context to use for request allocation for @bio.  May return
> - * %NULL if %current->io_context doesn't exist.
> - */
> -static struct io_context *rq_ioc(struct bio *bio)
> -{
> -#ifdef CONFIG_BLK_CGROUP
> -	if (bio && bio->bi_ioc)
> -		return bio->bi_ioc;
> -#endif
> -	return current->io_context;
> -}
> -
> -/**
>  * __get_request - get a free request
>  * @rl: request list to allocate from
>  * @op: operation and flags
> diff --git a/block/blk.h b/block/blk.h
> index f46c0ac8ae3d..9a716b5925a4 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -264,6 +264,22 @@ void ioc_clear_queue(struct request_queue *q);
> int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);
> 
> /**
> + * rq_ioc - determine io_context for request allocation
> + * @bio: request being allocated is for this bio (can be %NULL)
> + *
> + * Determine io_context to use for request allocation for @bio.  May return
> + * %NULL if %current->io_context doesn't exist.
> + */
> +static inline struct io_context *rq_ioc(struct bio *bio)
> +{
> +#ifdef CONFIG_BLK_CGROUP
> +	if (bio && bio->bi_ioc)
> +		return bio->bi_ioc;
> +#endif
> +	return current->io_context;
> +}
> +
> +/**
>  * create_io_context - try to create task->io_context
>  * @gfp_mask: allocation mask
>  * @node: allocation node
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-17  0:12 ` [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
@ 2016-12-20 11:55   ` Paolo Valente
  2016-12-20 15:45     ` Jens Axboe
  2016-12-21  2:22     ` Jens Axboe
  2016-12-22  9:59   ` Paolo Valente
  1 sibling, 2 replies; 69+ messages in thread
From: Paolo Valente @ 2016-12-20 11:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-block, linux-kernel, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This adds a set of hooks that intercepts the blk-mq path of
> allocating/inserting/issuing/completing requests, allowing
> us to develop a scheduler within that framework.
> 
> We reuse the existing elevator scheduler API on the registration
> side, but augment that with the scheduler flagging support for
> the blk-mq interfce, and with a separate set of ops hooks for MQ
> devices.
> 
> Schedulers can opt in to using shadow requests. Shadow requests
> are internal requests that the scheduler uses for for the allocate
> and insert part, which are then mapped to a real driver request
> at dispatch time. This is needed to separate the device queue depth
> from the pool of requests that the scheduler has to work with.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>

> ...
> 
> +struct request *blk_mq_sched_get_request(struct request_queue *q,
> +					 struct bio *bio,
> +					 unsigned int op,
> +					 struct blk_mq_alloc_data *data)
> +{
> +	struct elevator_queue *e = q->elevator;
> +	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_ctx *ctx;
> +	struct request *rq;
> +
> +	blk_queue_enter_live(q);
> +	ctx = blk_mq_get_ctx(q);
> +	hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> +	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> +
> +	if (e && e->type->ops.mq.get_request)
> +		rq = e->type->ops.mq.get_request(q, op, data);

bio is not passed to the scheduler here.  Yet bfq uses bio to get the
blkcg (invoking bio_blkcg).  I'm not finding any workaround.

> +	else
> +		rq = __blk_mq_alloc_request(data, op);
> +
> +	if (rq) {
> +		rq->elv.icq = NULL;
> +		if (e && e->type->icq_cache)
> +			blk_mq_sched_assign_ioc(q, rq, bio);

bfq needs rq->elv.icq to be consistent in bfq_get_request, but the
needed initialization seems to occur only after mq.get_request is
invoked.

Note: to minimize latency, I'm reporting immediately each problem that
apparently cannot be solved by just modifying bfq.  But, if the
resulting higher number of micro-emails is annoying for you, I can
buffer my questions, and send you cumulative emails less frequently.

Thanks,
Paolo

> +		data->hctx->queued++;
> +		return rq;
> +	}
> +
> +	blk_queue_exit(q);
> +	return NULL;
> +}
> +
> +void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct elevator_queue *e = hctx->queue->elevator;
> +	LIST_HEAD(rq_list);
> +
> +	if (unlikely(blk_mq_hctx_stopped(hctx)))
> +		return;
> +
> +	hctx->run++;
> +
> +	/*
> +	 * If we have previous entries on our dispatch list, grab them first for
> +	 * more fair dispatch.
> +	 */
> +	if (!list_empty_careful(&hctx->dispatch)) {
> +		spin_lock(&hctx->lock);
> +		if (!list_empty(&hctx->dispatch))
> +			list_splice_init(&hctx->dispatch, &rq_list);
> +		spin_unlock(&hctx->lock);
> +	}
> +
> +	/*
> +	 * Only ask the scheduler for requests, if we didn't have residual
> +	 * requests from the dispatch list. This is to avoid the case where
> +	 * we only ever dispatch a fraction of the requests available because
> +	 * of low device queue depth. Once we pull requests out of the IO
> +	 * scheduler, we can no longer merge or sort them. So it's best to
> +	 * leave them there for as long as we can. Mark the hw queue as
> +	 * needing a restart in that case.
> +	 */
> +	if (list_empty(&rq_list)) {
> +		if (e && e->type->ops.mq.dispatch_requests)
> +			e->type->ops.mq.dispatch_requests(hctx, &rq_list);
> +		else
> +			blk_mq_flush_busy_ctxs(hctx, &rq_list);
> +	} else if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
> +		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +
> +	blk_mq_dispatch_rq_list(hctx, &rq_list);
> +}
> +
> +bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
> +{
> +	struct request *rq;
> +	int ret;
> +
> +	ret = elv_merge(q, &rq, bio);
> +	if (ret == ELEVATOR_BACK_MERGE) {
> +		if (bio_attempt_back_merge(q, rq, bio)) {
> +			if (!attempt_back_merge(q, rq))
> +				elv_merged_request(q, rq, ret);
> +			return true;
> +		}
> +	} else if (ret == ELEVATOR_FRONT_MERGE) {
> +		if (bio_attempt_front_merge(q, rq, bio)) {
> +			if (!attempt_front_merge(q, rq))
> +				elv_merged_request(q, rq, ret);
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
> +
> +bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
> +{
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e->type->ops.mq.bio_merge) {
> +		struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
> +		struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> +		blk_mq_put_ctx(ctx);
> +		return e->type->ops.mq.bio_merge(hctx, bio);
> +	}
> +
> +	return false;
> +}
> +
> +bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq)
> +{
> +	return rq_mergeable(rq) && elv_attempt_insert_merge(q, rq);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
> +
> +void blk_mq_sched_request_inserted(struct request *rq)
> +{
> +	trace_block_rq_insert(rq->q, rq);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> new file mode 100644
> index 000000000000..1d1a4e9ce6ca
> --- /dev/null
> +++ b/block/blk-mq-sched.h
> @@ -0,0 +1,209 @@
> +#ifndef BLK_MQ_SCHED_H
> +#define BLK_MQ_SCHED_H
> +
> +#include "blk-mq.h"
> +#include "blk-wbt.h"
> +
> +struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
> +void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
> +
> +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
> +				int (*init)(struct blk_mq_hw_ctx *),
> +				void (*exit)(struct blk_mq_hw_ctx *));
> +
> +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> +				 void (*exit)(struct blk_mq_hw_ctx *));
> +
> +void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
> +				      struct request *rq);
> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
> +						  struct blk_mq_alloc_data *data,
> +						  struct blk_mq_tags *tags,
> +						  atomic_t *wait_index);
> +struct request *
> +blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
> +				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
> +struct request *
> +__blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
> +				   struct request *sched_rq);
> +
> +struct request *blk_mq_sched_get_request(struct request_queue *q, struct bio *bio, unsigned int op, struct blk_mq_alloc_data *data);
> +
> +void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
> +void blk_mq_sched_request_inserted(struct request *rq);
> +bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
> +bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
> +bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
> +
> +void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
> +
> +static inline bool
> +blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
> +{
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (!e || blk_queue_nomerges(q) || !bio_mergeable(bio))
> +		return false;
> +
> +	return __blk_mq_sched_bio_merge(q, bio);
> +}
> +
> +static inline int blk_mq_sched_get_rq_priv(struct request_queue *q,
> +					   struct request *rq)
> +{
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->ops.mq.get_rq_priv)
> +		return e->type->ops.mq.get_rq_priv(q, rq);
> +
> +	return 0;
> +}
> +
> +static inline void blk_mq_sched_put_rq_priv(struct request_queue *q,
> +					    struct request *rq)
> +{
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->ops.mq.put_rq_priv)
> +		e->type->ops.mq.put_rq_priv(q, rq);
> +}
> +
> +static inline void blk_mq_sched_put_request(struct request *rq)
> +{
> +	struct request_queue *q = rq->q;
> +	struct elevator_queue *e = q->elevator;
> +	bool do_free = true;
> +
> +	wbt_done(q->rq_wb, &rq->issue_stat);
> +
> +	if (rq->rq_flags & RQF_ELVPRIV) {
> +		blk_mq_sched_put_rq_priv(rq->q, rq);
> +		if (rq->elv.icq) {
> +			put_io_context(rq->elv.icq->ioc);
> +			rq->elv.icq = NULL;
> +		}
> +	}
> +
> +	if (e && e->type->ops.mq.put_request)
> +		do_free = !e->type->ops.mq.put_request(rq);
> +	if (do_free)
> +		blk_mq_finish_request(rq);
> +}
> +
> +static inline void
> +blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
> +			    bool async)
> +{
> +	struct request_queue *q = rq->q;
> +	struct elevator_queue *e = q->elevator;
> +	struct blk_mq_ctx *ctx = rq->mq_ctx;
> +	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> +	if (e && e->type->ops.mq.insert_requests) {
> +		LIST_HEAD(list);
> +
> +		list_add(&rq->queuelist, &list);
> +		e->type->ops.mq.insert_requests(hctx, &list, at_head);
> +	} else {
> +		spin_lock(&ctx->lock);
> +		__blk_mq_insert_request(hctx, rq, at_head);
> +		spin_unlock(&ctx->lock);
> +	}
> +
> +	if (run_queue)
> +		blk_mq_run_hw_queue(hctx, async);
> +}
> +
> +static inline void
> +blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx,
> +			     struct list_head *list, bool run_queue_async)
> +{
> +	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> +	struct elevator_queue *e = hctx->queue->elevator;
> +
> +	if (e && e->type->ops.mq.insert_requests)
> +		e->type->ops.mq.insert_requests(hctx, list, false);
> +	else
> +		blk_mq_insert_requests(hctx, ctx, list);
> +
> +	blk_mq_run_hw_queue(hctx, run_queue_async);
> +}
> +
> +static inline void
> +blk_mq_sched_dispatch_shadow_requests(struct blk_mq_hw_ctx *hctx,
> +				      struct list_head *rq_list,
> +				      struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
> +{
> +	do {
> +		struct request *rq;
> +
> +		rq = blk_mq_sched_request_from_shadow(hctx, get_sched_rq);
> +		if (!rq)
> +			break;
> +
> +		list_add_tail(&rq->queuelist, rq_list);
> +	} while (1);
> +}
> +
> +static inline bool
> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
> +			 struct bio *bio)
> +{
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->ops.mq.allow_merge)
> +		return e->type->ops.mq.allow_merge(q, rq, bio);
> +
> +	return true;
> +}
> +
> +static inline void
> +blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
> +{
> +	struct elevator_queue *e = hctx->queue->elevator;
> +
> +	if (e && e->type->ops.mq.completed_request)
> +		e->type->ops.mq.completed_request(hctx, rq);
> +
> +	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
> +		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +		blk_mq_run_hw_queue(hctx, true);
> +	}
> +}
> +
> +static inline void blk_mq_sched_started_request(struct request *rq)
> +{
> +	struct request_queue *q = rq->q;
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->ops.mq.started_request)
> +		e->type->ops.mq.started_request(rq);
> +}
> +
> +static inline void blk_mq_sched_requeue_request(struct request *rq)
> +{
> +	struct request_queue *q = rq->q;
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->ops.mq.requeue_request)
> +		e->type->ops.mq.requeue_request(rq);
> +}
> +
> +static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct elevator_queue *e = hctx->queue->elevator;
> +
> +	if (e && e->type->ops.mq.has_work)
> +		return e->type->ops.mq.has_work(hctx);
> +
> +	return false;
> +}
> +
> +/*
> + * Returns true if this is an internal shadow request
> + */
> +static inline bool blk_mq_sched_rq_is_shadow(struct request *rq)
> +{
> +	return (rq->rq_flags & RQF_ALLOCED) != 0;
> +}
> +#endif
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c3119f527bc1..032dca4a27bf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -32,6 +32,7 @@
> #include "blk-mq-tag.h"
> #include "blk-stat.h"
> #include "blk-wbt.h"
> +#include "blk-mq-sched.h"
> 
> static DEFINE_MUTEX(all_q_mutex);
> static LIST_HEAD(all_q_list);
> @@ -41,7 +42,8 @@ static LIST_HEAD(all_q_list);
>  */
> static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
> {
> -	return sbitmap_any_bit_set(&hctx->ctx_map);
> +	return sbitmap_any_bit_set(&hctx->ctx_map) ||
> +		blk_mq_sched_has_work(hctx);
> }
> 
> /*
> @@ -242,26 +244,21 @@ EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
> struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
> 		unsigned int flags)
> {
> -	struct blk_mq_ctx *ctx;
> -	struct blk_mq_hw_ctx *hctx;
> -	struct request *rq;
> 	struct blk_mq_alloc_data alloc_data;
> +	struct request *rq;
> 	int ret;
> 
> 	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
> 	if (ret)
> 		return ERR_PTR(ret);
> 
> -	ctx = blk_mq_get_ctx(q);
> -	hctx = blk_mq_map_queue(q, ctx->cpu);
> -	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
> -	rq = __blk_mq_alloc_request(&alloc_data, rw);
> -	blk_mq_put_ctx(ctx);
> +	rq = blk_mq_sched_get_request(q, NULL, rw, &alloc_data);
> 
> -	if (!rq) {
> -		blk_queue_exit(q);
> +	blk_mq_put_ctx(alloc_data.ctx);
> +	blk_queue_exit(q);
> +
> +	if (!rq)
> 		return ERR_PTR(-EWOULDBLOCK);
> -	}
> 
> 	rq->__data_len = 0;
> 	rq->__sector = (sector_t) -1;
> @@ -321,12 +318,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
> }
> EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
> 
> -void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> -			   struct request *rq)
> +void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> +			     struct request *rq)
> {
> 	const int tag = rq->tag;
> 	struct request_queue *q = rq->q;
> 
> +	blk_mq_sched_completed_request(hctx, rq);
> +
> 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
> 		atomic_dec(&hctx->nr_active);
> 
> @@ -339,18 +338,23 @@ void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> 	blk_queue_exit(q);
> }
> 
> -static void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx,
> +static void blk_mq_finish_hctx_request(struct blk_mq_hw_ctx *hctx,
> 				     struct request *rq)
> {
> 	struct blk_mq_ctx *ctx = rq->mq_ctx;
> 
> 	ctx->rq_completed[rq_is_sync(rq)]++;
> -	__blk_mq_free_request(hctx, ctx, rq);
> +	__blk_mq_finish_request(hctx, ctx, rq);
> +}
> +
> +void blk_mq_finish_request(struct request *rq)
> +{
> +	blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
> }
> 
> void blk_mq_free_request(struct request *rq)
> {
> -	blk_mq_free_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
> +	blk_mq_sched_put_request(rq);
> }
> EXPORT_SYMBOL_GPL(blk_mq_free_request);
> 
> @@ -468,6 +472,8 @@ void blk_mq_start_request(struct request *rq)
> {
> 	struct request_queue *q = rq->q;
> 
> +	blk_mq_sched_started_request(rq);
> +
> 	trace_block_rq_issue(q, rq);
> 
> 	rq->resid_len = blk_rq_bytes(rq);
> @@ -516,6 +522,7 @@ static void __blk_mq_requeue_request(struct request *rq)
> 
> 	trace_block_rq_requeue(q, rq);
> 	wbt_requeue(q->rq_wb, &rq->issue_stat);
> +	blk_mq_sched_requeue_request(rq);
> 
> 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> 		if (q->dma_drain_size && blk_rq_bytes(rq))
> @@ -550,13 +557,13 @@ static void blk_mq_requeue_work(struct work_struct *work)
> 
> 		rq->rq_flags &= ~RQF_SOFTBARRIER;
> 		list_del_init(&rq->queuelist);
> -		blk_mq_insert_request(rq, true, false, false);
> +		blk_mq_sched_insert_request(rq, true, false, false);
> 	}
> 
> 	while (!list_empty(&rq_list)) {
> 		rq = list_entry(rq_list.next, struct request, queuelist);
> 		list_del_init(&rq->queuelist);
> -		blk_mq_insert_request(rq, false, false, false);
> +		blk_mq_sched_insert_request(rq, false, false, false);
> 	}
> 
> 	blk_mq_run_hw_queues(q, false);
> @@ -762,8 +769,16 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
> 
> 		if (!blk_rq_merge_ok(rq, bio))
> 			continue;
> +		if (!blk_mq_sched_allow_merge(q, rq, bio))
> +			break;
> 
> 		el_ret = blk_try_merge(rq, bio);
> +		if (el_ret == ELEVATOR_NO_MERGE)
> +			continue;
> +
> +		if (!blk_mq_sched_allow_merge(q, rq, bio))
> +			break;
> +
> 		if (el_ret == ELEVATOR_BACK_MERGE) {
> 			if (bio_attempt_back_merge(q, rq, bio)) {
> 				ctx->rq_merged++;
> @@ -905,41 +920,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> 	return ret != BLK_MQ_RQ_QUEUE_BUSY;
> }
> 
> -/*
> - * Run this hardware queue, pulling any software queues mapped to it in.
> - * Note that this function currently has various problems around ordering
> - * of IO. In particular, we'd like FIFO behaviour on handling existing
> - * items on the hctx->dispatch list. Ignore that for now.
> - */
> -static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> -{
> -	LIST_HEAD(rq_list);
> -	LIST_HEAD(driver_list);
> -
> -	if (unlikely(blk_mq_hctx_stopped(hctx)))
> -		return;
> -
> -	hctx->run++;
> -
> -	/*
> -	 * Touch any software queue that has pending entries.
> -	 */
> -	blk_mq_flush_busy_ctxs(hctx, &rq_list);
> -
> -	/*
> -	 * If we have previous entries on our dispatch list, grab them
> -	 * and stuff them at the front for more fair dispatch.
> -	 */
> -	if (!list_empty_careful(&hctx->dispatch)) {
> -		spin_lock(&hctx->lock);
> -		if (!list_empty(&hctx->dispatch))
> -			list_splice_init(&hctx->dispatch, &rq_list);
> -		spin_unlock(&hctx->lock);
> -	}
> -
> -	blk_mq_dispatch_rq_list(hctx, &rq_list);
> -}
> -
> static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> {
> 	int srcu_idx;
> @@ -949,11 +929,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> 
> 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> 		rcu_read_lock();
> -		blk_mq_process_rq_list(hctx);
> +		blk_mq_sched_dispatch_requests(hctx);
> 		rcu_read_unlock();
> 	} else {
> 		srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
> -		blk_mq_process_rq_list(hctx);
> +		blk_mq_sched_dispatch_requests(hctx);
> 		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
> 	}
> }
> @@ -1147,32 +1127,10 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> 	blk_mq_hctx_mark_pending(hctx, ctx);
> }
> 
> -void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
> -			   bool async)
> -{
> -	struct blk_mq_ctx *ctx = rq->mq_ctx;
> -	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> -
> -	spin_lock(&ctx->lock);
> -	__blk_mq_insert_request(hctx, rq, at_head);
> -	spin_unlock(&ctx->lock);
> -
> -	if (run_queue)
> -		blk_mq_run_hw_queue(hctx, async);
> -}
> -
> -static void blk_mq_insert_requests(struct request_queue *q,
> -				     struct blk_mq_ctx *ctx,
> -				     struct list_head *list,
> -				     int depth,
> -				     bool from_schedule)
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> +			    struct list_head *list)
> 
> {
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> -
> -	trace_block_unplug(q, depth, !from_schedule);
> -
> 	/*
> 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
> 	 * offline now
> @@ -1188,8 +1146,6 @@ static void blk_mq_insert_requests(struct request_queue *q,
> 	}
> 	blk_mq_hctx_mark_pending(hctx, ctx);
> 	spin_unlock(&ctx->lock);
> -
> -	blk_mq_run_hw_queue(hctx, from_schedule);
> }
> 
> static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
> @@ -1225,9 +1181,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> 		BUG_ON(!rq->q);
> 		if (rq->mq_ctx != this_ctx) {
> 			if (this_ctx) {
> -				blk_mq_insert_requests(this_q, this_ctx,
> -							&ctx_list, depth,
> -							from_schedule);
> +				trace_block_unplug(this_q, depth, from_schedule);
> +				blk_mq_sched_insert_requests(this_q, this_ctx,
> +								&ctx_list,
> +								from_schedule);
> 			}
> 
> 			this_ctx = rq->mq_ctx;
> @@ -1244,8 +1201,9 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> 	 * on 'ctx_list'. Do those.
> 	 */
> 	if (this_ctx) {
> -		blk_mq_insert_requests(this_q, this_ctx, &ctx_list, depth,
> -				       from_schedule);
> +		trace_block_unplug(this_q, depth, from_schedule);
> +		blk_mq_sched_insert_requests(this_q, this_ctx, &ctx_list,
> +						from_schedule);
> 	}
> }
> 
> @@ -1283,46 +1241,32 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
> 		}
> 
> 		spin_unlock(&ctx->lock);
> -		__blk_mq_free_request(hctx, ctx, rq);
> +		__blk_mq_finish_request(hctx, ctx, rq);
> 		return true;
> 	}
> }
> 
> -static struct request *blk_mq_map_request(struct request_queue *q,
> -					  struct bio *bio,
> -					  struct blk_mq_alloc_data *data)
> -{
> -	struct blk_mq_hw_ctx *hctx;
> -	struct blk_mq_ctx *ctx;
> -	struct request *rq;
> -
> -	blk_queue_enter_live(q);
> -	ctx = blk_mq_get_ctx(q);
> -	hctx = blk_mq_map_queue(q, ctx->cpu);
> -
> -	trace_block_getrq(q, bio, bio->bi_opf);
> -	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> -	rq = __blk_mq_alloc_request(data, bio->bi_opf);
> -
> -	data->hctx->queued++;
> -	return rq;
> -}
> -
> static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
> {
> -	int ret;
> 	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> 	struct blk_mq_queue_data bd = {
> 		.rq = rq,
> 		.list = NULL,
> 		.last = 1
> 	};
> -	blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
> +	struct blk_mq_hw_ctx *hctx;
> +	blk_qc_t new_cookie;
> +	int ret;
> +
> +	if (q->elevator)
> +		goto insert;
> 
> +	hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> 	if (blk_mq_hctx_stopped(hctx))
> 		goto insert;
> 
> +	new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
> +
> 	/*
> 	 * For OK queue, we are done. For error, kill it. Any other
> 	 * error (busy), just add it to our list as we previously
> @@ -1344,7 +1288,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
> 	}
> 
> insert:
> -	blk_mq_insert_request(rq, false, true, true);
> +	blk_mq_sched_insert_request(rq, false, true, true);
> }
> 
> /*
> @@ -1377,9 +1321,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
> 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
> 		return BLK_QC_T_NONE;
> 
> +	if (blk_mq_sched_bio_merge(q, bio))
> +		return BLK_QC_T_NONE;
> +
> 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
> 
> -	rq = blk_mq_map_request(q, bio, &data);
> +	trace_block_getrq(q, bio, bio->bi_opf);
> +
> +	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
> 	if (unlikely(!rq)) {
> 		__wbt_done(q->rq_wb, wb_acct);
> 		return BLK_QC_T_NONE;
> @@ -1441,6 +1390,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
> 		goto done;
> 	}
> 
> +	if (q->elevator) {
> +		blk_mq_put_ctx(data.ctx);
> +		blk_mq_bio_to_request(rq, bio);
> +		blk_mq_sched_insert_request(rq, false, true, true);
> +		goto done;
> +	}
> 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
> 		/*
> 		 * For a SYNC request, send it to the hardware immediately. For
> @@ -1486,9 +1441,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
> 	} else
> 		request_count = blk_plug_queued_count(q);
> 
> +	if (blk_mq_sched_bio_merge(q, bio))
> +		return BLK_QC_T_NONE;
> +
> 	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
> 
> -	rq = blk_mq_map_request(q, bio, &data);
> +	trace_block_getrq(q, bio, bio->bi_opf);
> +
> +	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
> 	if (unlikely(!rq)) {
> 		__wbt_done(q->rq_wb, wb_acct);
> 		return BLK_QC_T_NONE;
> @@ -1538,6 +1498,12 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
> 		return cookie;
> 	}
> 
> +	if (q->elevator) {
> +		blk_mq_put_ctx(data.ctx);
> +		blk_mq_bio_to_request(rq, bio);
> +		blk_mq_sched_insert_request(rq, false, true, true);
> +		goto done;
> +	}
> 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
> 		/*
> 		 * For a SYNC request, send it to the hardware immediately. For
> @@ -1550,6 +1516,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
> 	}
> 
> 	blk_mq_put_ctx(data.ctx);
> +done:
> 	return cookie;
> }
> 
> @@ -1558,7 +1525,7 @@ void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> {
> 	struct page *page;
> 
> -	if (tags->rqs && set->ops->exit_request) {
> +	if (tags->rqs && set && set->ops->exit_request) {
> 		int i;
> 
> 		for (i = 0; i < tags->nr_tags; i++) {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e59f5ca520a2..898c3c9a60ec 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -47,7 +47,8 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
>  */
> void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> 				bool at_head);
> -
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> +				struct list_head *list);
> /*
>  * CPU hotplug helpers
>  */
> @@ -123,8 +124,9 @@ static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
>  */
> void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
> 			struct request *rq, unsigned int op);
> -void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> +void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> 				struct request *rq);
> +void blk_mq_finish_request(struct request *rq);
> struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
> 					unsigned int op);
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index 022a26830297..e6b523360231 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -40,6 +40,7 @@
> #include <trace/events/block.h>
> 
> #include "blk.h"
> +#include "blk-mq-sched.h"
> 
> static DEFINE_SPINLOCK(elv_list_lock);
> static LIST_HEAD(elv_list);
> @@ -58,7 +59,9 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
> 	struct request_queue *q = rq->q;
> 	struct elevator_queue *e = q->elevator;
> 
> -	if (e->type->ops.sq.elevator_allow_bio_merge_fn)
> +	if (e->uses_mq && e->type->ops.mq.allow_merge)
> +		return e->type->ops.mq.allow_merge(q, rq, bio);
> +	else if (!e->uses_mq && e->type->ops.sq.elevator_allow_bio_merge_fn)
> 		return e->type->ops.sq.elevator_allow_bio_merge_fn(q, rq, bio);
> 
> 	return 1;
> @@ -163,6 +166,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
> 	kobject_init(&eq->kobj, &elv_ktype);
> 	mutex_init(&eq->sysfs_lock);
> 	hash_init(eq->hash);
> +	eq->uses_mq = e->uses_mq;
> 
> 	return eq;
> }
> @@ -219,12 +223,19 @@ int elevator_init(struct request_queue *q, char *name)
> 		if (!e) {
> 			printk(KERN_ERR
> 				"Default I/O scheduler not found. " \
> -				"Using noop.\n");
> +				"Using noop/none.\n");
> +			if (q->mq_ops) {
> +				elevator_put(e);
> +				return 0;
> +			}
> 			e = elevator_get("noop", false);
> 		}
> 	}
> 
> -	err = e->ops.sq.elevator_init_fn(q, e);
> +	if (e->uses_mq)
> +		err = e->ops.mq.init_sched(q, e);
> +	else
> +		err = e->ops.sq.elevator_init_fn(q, e);
> 	if (err)
> 		elevator_put(e);
> 	return err;
> @@ -234,7 +245,9 @@ EXPORT_SYMBOL(elevator_init);
> void elevator_exit(struct elevator_queue *e)
> {
> 	mutex_lock(&e->sysfs_lock);
> -	if (e->type->ops.sq.elevator_exit_fn)
> +	if (e->uses_mq && e->type->ops.mq.exit_sched)
> +		e->type->ops.mq.exit_sched(e);
> +	else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
> 		e->type->ops.sq.elevator_exit_fn(e);
> 	mutex_unlock(&e->sysfs_lock);
> 
> @@ -253,6 +266,7 @@ void elv_rqhash_del(struct request_queue *q, struct request *rq)
> 	if (ELV_ON_HASH(rq))
> 		__elv_rqhash_del(rq);
> }
> +EXPORT_SYMBOL_GPL(elv_rqhash_del);
> 
> void elv_rqhash_add(struct request_queue *q, struct request *rq)
> {
> @@ -262,6 +276,7 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
> 	hash_add(e->hash, &rq->hash, rq_hash_key(rq));
> 	rq->rq_flags |= RQF_HASHED;
> }
> +EXPORT_SYMBOL_GPL(elv_rqhash_add);
> 
> void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
> {
> @@ -443,7 +458,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> 		return ELEVATOR_BACK_MERGE;
> 	}
> 
> -	if (e->type->ops.sq.elevator_merge_fn)
> +	if (e->uses_mq && e->type->ops.mq.request_merge)
> +		return e->type->ops.mq.request_merge(q, req, bio);
> +	else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
> 		return e->type->ops.sq.elevator_merge_fn(q, req, bio);
> 
> 	return ELEVATOR_NO_MERGE;
> @@ -456,8 +473,7 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
>  *
>  * Returns true if we merged, false otherwise
>  */
> -static bool elv_attempt_insert_merge(struct request_queue *q,
> -				     struct request *rq)
> +bool elv_attempt_insert_merge(struct request_queue *q, struct request *rq)
> {
> 	struct request *__rq;
> 	bool ret;
> @@ -495,7 +511,9 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
> {
> 	struct elevator_queue *e = q->elevator;
> 
> -	if (e->type->ops.sq.elevator_merged_fn)
> +	if (e->uses_mq && e->type->ops.mq.request_merged)
> +		e->type->ops.mq.request_merged(q, rq, type);
> +	else if (!e->uses_mq && e->type->ops.sq.elevator_merged_fn)
> 		e->type->ops.sq.elevator_merged_fn(q, rq, type);
> 
> 	if (type == ELEVATOR_BACK_MERGE)
> @@ -508,10 +526,15 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
> 			     struct request *next)
> {
> 	struct elevator_queue *e = q->elevator;
> -	const int next_sorted = next->rq_flags & RQF_SORTED;
> -
> -	if (next_sorted && e->type->ops.sq.elevator_merge_req_fn)
> -		e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
> +	bool next_sorted = false;
> +
> +	if (e->uses_mq && e->type->ops.mq.requests_merged)
> +		e->type->ops.mq.requests_merged(q, rq, next);
> +	else if (e->type->ops.sq.elevator_merge_req_fn) {
> +		next_sorted = next->rq_flags & RQF_SORTED;
> +		if (next_sorted)
> +			e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
> +	}
> 
> 	elv_rqhash_reposition(q, rq);
> 
> @@ -528,6 +551,9 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
> {
> 	struct elevator_queue *e = q->elevator;
> 
> +	if (WARN_ON_ONCE(e->uses_mq))
> +		return;
> +
> 	if (e->type->ops.sq.elevator_bio_merged_fn)
> 		e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
> }
> @@ -682,8 +708,11 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
> {
> 	struct elevator_queue *e = q->elevator;
> 
> -	if (e->type->ops.sq.elevator_latter_req_fn)
> +	if (e->uses_mq && e->type->ops.mq.next_request)
> +		return e->type->ops.mq.next_request(q, rq);
> +	else if (!e->uses_mq && e->type->ops.sq.elevator_latter_req_fn)
> 		return e->type->ops.sq.elevator_latter_req_fn(q, rq);
> +
> 	return NULL;
> }
> 
> @@ -691,7 +720,9 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
> {
> 	struct elevator_queue *e = q->elevator;
> 
> -	if (e->type->ops.sq.elevator_former_req_fn)
> +	if (e->uses_mq && e->type->ops.mq.former_request)
> +		return e->type->ops.mq.former_request(q, rq);
> +	if (!e->uses_mq && e->type->ops.sq.elevator_former_req_fn)
> 		return e->type->ops.sq.elevator_former_req_fn(q, rq);
> 	return NULL;
> }
> @@ -701,6 +732,9 @@ int elv_set_request(struct request_queue *q, struct request *rq,
> {
> 	struct elevator_queue *e = q->elevator;
> 
> +	if (WARN_ON_ONCE(e->uses_mq))
> +		return 0;
> +
> 	if (e->type->ops.sq.elevator_set_req_fn)
> 		return e->type->ops.sq.elevator_set_req_fn(q, rq, bio, gfp_mask);
> 	return 0;
> @@ -710,6 +744,9 @@ void elv_put_request(struct request_queue *q, struct request *rq)
> {
> 	struct elevator_queue *e = q->elevator;
> 
> +	if (WARN_ON_ONCE(e->uses_mq))
> +		return;
> +
> 	if (e->type->ops.sq.elevator_put_req_fn)
> 		e->type->ops.sq.elevator_put_req_fn(rq);
> }
> @@ -718,6 +755,9 @@ int elv_may_queue(struct request_queue *q, unsigned int op)
> {
> 	struct elevator_queue *e = q->elevator;
> 
> +	if (WARN_ON_ONCE(e->uses_mq))
> +		return 0;
> +
> 	if (e->type->ops.sq.elevator_may_queue_fn)
> 		return e->type->ops.sq.elevator_may_queue_fn(q, op);
> 
> @@ -728,6 +768,9 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
> {
> 	struct elevator_queue *e = q->elevator;
> 
> +	if (WARN_ON_ONCE(e->uses_mq))
> +		return;
> +
> 	/*
> 	 * request is released from the driver, io must be done
> 	 */
> @@ -803,7 +846,7 @@ int elv_register_queue(struct request_queue *q)
> 		}
> 		kobject_uevent(&e->kobj, KOBJ_ADD);
> 		e->registered = 1;
> -		if (e->type->ops.sq.elevator_registered_fn)
> +		if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn)
> 			e->type->ops.sq.elevator_registered_fn(q);
> 	}
> 	return error;
> @@ -891,9 +934,14 @@ EXPORT_SYMBOL_GPL(elv_unregister);
> static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
> {
> 	struct elevator_queue *old = q->elevator;
> -	bool registered = old->registered;
> +	bool old_registered = false;
> 	int err;
> 
> +	if (q->mq_ops) {
> +		blk_mq_freeze_queue(q);
> +		blk_mq_quiesce_queue(q);
> +	}
> +
> 	/*
> 	 * Turn on BYPASS and drain all requests w/ elevator private data.
> 	 * Block layer doesn't call into a quiesced elevator - all requests
> @@ -901,32 +949,52 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
> 	 * using INSERT_BACK.  All requests have SOFTBARRIER set and no
> 	 * merge happens either.
> 	 */
> -	blk_queue_bypass_start(q);
> +	if (old) {
> +		old_registered = old->registered;
> 
> -	/* unregister and clear all auxiliary data of the old elevator */
> -	if (registered)
> -		elv_unregister_queue(q);
> +		if (!q->mq_ops)
> +			blk_queue_bypass_start(q);
> 
> -	spin_lock_irq(q->queue_lock);
> -	ioc_clear_queue(q);
> -	spin_unlock_irq(q->queue_lock);
> +		/* unregister and clear all auxiliary data of the old elevator */
> +		if (old_registered)
> +			elv_unregister_queue(q);
> +
> +		spin_lock_irq(q->queue_lock);
> +		ioc_clear_queue(q);
> +		spin_unlock_irq(q->queue_lock);
> +	}
> 
> 	/* allocate, init and register new elevator */
> -	err = new_e->ops.sq.elevator_init_fn(q, new_e);
> -	if (err)
> -		goto fail_init;
> +	if (new_e) {
> +		if (new_e->uses_mq)
> +			err = new_e->ops.mq.init_sched(q, new_e);
> +		else
> +			err = new_e->ops.sq.elevator_init_fn(q, new_e);
> +		if (err)
> +			goto fail_init;
> 
> -	if (registered) {
> 		err = elv_register_queue(q);
> 		if (err)
> 			goto fail_register;
> -	}
> +	} else
> +		q->elevator = NULL;
> 
> 	/* done, kill the old one and finish */
> -	elevator_exit(old);
> -	blk_queue_bypass_end(q);
> +	if (old) {
> +		elevator_exit(old);
> +		if (!q->mq_ops)
> +			blk_queue_bypass_end(q);
> +	}
> +
> +	if (q->mq_ops) {
> +		blk_mq_unfreeze_queue(q);
> +		blk_mq_start_stopped_hw_queues(q, true);
> +	}
> 
> -	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
> +	if (new_e)
> +		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
> +	else
> +		blk_add_trace_msg(q, "elv switch: none");
> 
> 	return 0;
> 
> @@ -934,9 +1002,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
> 	elevator_exit(q->elevator);
> fail_init:
> 	/* switch failed, restore and re-register old elevator */
> -	q->elevator = old;
> -	elv_register_queue(q);
> -	blk_queue_bypass_end(q);
> +	if (old) {
> +		q->elevator = old;
> +		elv_register_queue(q);
> +		if (!q->mq_ops)
> +			blk_queue_bypass_end(q);
> +	}
> +	if (q->mq_ops) {
> +		blk_mq_unfreeze_queue(q);
> +		blk_mq_start_stopped_hw_queues(q, true);
> +	}
> 
> 	return err;
> }
> @@ -949,8 +1024,11 @@ static int __elevator_change(struct request_queue *q, const char *name)
> 	char elevator_name[ELV_NAME_MAX];
> 	struct elevator_type *e;
> 
> -	if (!q->elevator)
> -		return -ENXIO;
> +	/*
> +	 * Special case for mq, turn off scheduling
> +	 */
> +	if (q->mq_ops && !strncmp(name, "none", 4))
> +		return elevator_switch(q, NULL);
> 
> 	strlcpy(elevator_name, name, sizeof(elevator_name));
> 	e = elevator_get(strstrip(elevator_name), true);
> @@ -959,11 +1037,23 @@ static int __elevator_change(struct request_queue *q, const char *name)
> 		return -EINVAL;
> 	}
> 
> -	if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
> +	if (q->elevator &&
> +	    !strcmp(elevator_name, q->elevator->type->elevator_name)) {
> 		elevator_put(e);
> 		return 0;
> 	}
> 
> +	if (!e->uses_mq && q->mq_ops) {
> +		printk(KERN_ERR "blk-mq-sched: elv %s does not support mq\n", elevator_name);
> +		elevator_put(e);
> +		return -EINVAL;
> +	}
> +	if (e->uses_mq && !q->mq_ops) {
> +		printk(KERN_ERR "blk-mq-sched: elv %s is for mq\n", elevator_name);
> +		elevator_put(e);
> +		return -EINVAL;
> +	}
> +
> 	return elevator_switch(q, e);
> }
> 
> @@ -985,7 +1075,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
> {
> 	int ret;
> 
> -	if (!q->elevator)
> +	if (!q->mq_ops || q->request_fn)
> 		return count;
> 
> 	ret = __elevator_change(q, name);
> @@ -999,24 +1089,34 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
> ssize_t elv_iosched_show(struct request_queue *q, char *name)
> {
> 	struct elevator_queue *e = q->elevator;
> -	struct elevator_type *elv;
> +	struct elevator_type *elv = NULL;
> 	struct elevator_type *__e;
> 	int len = 0;
> 
> -	if (!q->elevator || !blk_queue_stackable(q))
> +	if (!blk_queue_stackable(q))
> 		return sprintf(name, "none\n");
> 
> -	elv = e->type;
> +	if (!q->elevator)
> +		len += sprintf(name+len, "[none] ");
> +	else
> +		elv = e->type;
> 
> 	spin_lock(&elv_list_lock);
> 	list_for_each_entry(__e, &elv_list, list) {
> -		if (!strcmp(elv->elevator_name, __e->elevator_name))
> +		if (elv && !strcmp(elv->elevator_name, __e->elevator_name)) {
> 			len += sprintf(name+len, "[%s] ", elv->elevator_name);
> -		else
> +			continue;
> +		}
> +		if (__e->uses_mq && q->mq_ops)
> +			len += sprintf(name+len, "%s ", __e->elevator_name);
> +		else if (!__e->uses_mq && !q->mq_ops)
> 			len += sprintf(name+len, "%s ", __e->elevator_name);
> 	}
> 	spin_unlock(&elv_list_lock);
> 
> +	if (q->mq_ops && q->elevator)
> +		len += sprintf(name+len, "none");
> +
> 	len += sprintf(len+name, "\n");
> 	return len;
> }
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2686f9e7302a..e3159be841ff 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -22,6 +22,7 @@ struct blk_mq_hw_ctx {
> 
> 	unsigned long		flags;		/* BLK_MQ_F_* flags */
> 
> +	void			*sched_data;
> 	struct request_queue	*queue;
> 	struct blk_flush_queue	*fq;
> 
> @@ -156,6 +157,7 @@ enum {
> 
> 	BLK_MQ_S_STOPPED	= 0,
> 	BLK_MQ_S_TAG_ACTIVE	= 1,
> +	BLK_MQ_S_SCHED_RESTART	= 2,
> 
> 	BLK_MQ_MAX_DEPTH	= 10240,
> 
> @@ -179,7 +181,6 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
> 
> void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);
> 
> -void blk_mq_insert_request(struct request *, bool, bool, bool);
> void blk_mq_free_request(struct request *rq);
> bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
> 
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index 2a9e966eed03..417810b2d2f5 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -77,6 +77,32 @@ struct elevator_ops
> 	elevator_registered_fn *elevator_registered_fn;
> };
> 
> +struct blk_mq_alloc_data;
> +struct blk_mq_hw_ctx;
> +
> +struct elevator_mq_ops {
> +	int (*init_sched)(struct request_queue *, struct elevator_type *);
> +	void (*exit_sched)(struct elevator_queue *);
> +
> +	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
> +	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
> +	int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
> +	void (*request_merged)(struct request_queue *, struct request *, int);
> +	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
> +	struct request *(*get_request)(struct request_queue *, unsigned int, struct blk_mq_alloc_data *);
> +	bool (*put_request)(struct request *);
> +	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
> +	void (*dispatch_requests)(struct blk_mq_hw_ctx *, struct list_head *);
> +	bool (*has_work)(struct blk_mq_hw_ctx *);
> +	void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
> +	void (*started_request)(struct request *);
> +	void (*requeue_request)(struct request *);
> +	struct request *(*former_request)(struct request_queue *, struct request *);
> +	struct request *(*next_request)(struct request_queue *, struct request *);
> +	int (*get_rq_priv)(struct request_queue *, struct request *);
> +	void (*put_rq_priv)(struct request_queue *, struct request *);
> +};
> +
> #define ELV_NAME_MAX	(16)
> 
> struct elv_fs_entry {
> @@ -96,12 +122,14 @@ struct elevator_type
> 	/* fields provided by elevator implementation */
> 	union {
> 		struct elevator_ops sq;
> +		struct elevator_mq_ops mq;
> 	} ops;
> 	size_t icq_size;	/* see iocontext.h */
> 	size_t icq_align;	/* ditto */
> 	struct elv_fs_entry *elevator_attrs;
> 	char elevator_name[ELV_NAME_MAX];
> 	struct module *elevator_owner;
> +	bool uses_mq;
> 
> 	/* managed by elevator core */
> 	char icq_cache_name[ELV_NAME_MAX + 5];	/* elvname + "_io_cq" */
> @@ -125,6 +153,7 @@ struct elevator_queue
> 	struct kobject kobj;
> 	struct mutex sysfs_lock;
> 	unsigned int registered:1;
> +	unsigned int uses_mq:1;
> 	DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
> };
> 
> @@ -141,6 +170,7 @@ extern void elv_merge_requests(struct request_queue *, struct request *,
> extern void elv_merged_request(struct request_queue *, struct request *, int);
> extern void elv_bio_merged(struct request_queue *q, struct request *,
> 				struct bio *);
> +extern bool elv_attempt_insert_merge(struct request_queue *, struct request *);
> extern void elv_requeue_request(struct request_queue *, struct request *);
> extern struct request *elv_former_request(struct request_queue *, struct request *);
> extern struct request *elv_latter_request(struct request_queue *, struct request *);
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-20 11:55   ` Paolo Valente
@ 2016-12-20 15:45     ` Jens Axboe
  2016-12-21  2:22     ` Jens Axboe
  1 sibling, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-20 15:45 UTC (permalink / raw)
  To: Paolo Valente; +Cc: axboe, linux-block, linux-kernel, osandov

On 12/20/2016 04:55 AM, Paolo Valente wrote:
>> +struct request *blk_mq_sched_get_request(struct request_queue *q,
>> +					 struct bio *bio,
>> +					 unsigned int op,
>> +					 struct blk_mq_alloc_data *data)
>> +{
>> +	struct elevator_queue *e = q->elevator;
>> +	struct blk_mq_hw_ctx *hctx;
>> +	struct blk_mq_ctx *ctx;
>> +	struct request *rq;
>> +
>> +	blk_queue_enter_live(q);
>> +	ctx = blk_mq_get_ctx(q);
>> +	hctx = blk_mq_map_queue(q, ctx->cpu);
>> +
>> +	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
>> +
>> +	if (e && e->type->ops.mq.get_request)
>> +		rq = e->type->ops.mq.get_request(q, op, data);
> 
> bio is not passed to the scheduler here.  Yet bfq uses bio to get the
> blkcg (invoking bio_blkcg).  I'm not finding any workaround.

One important note here - what I'm posting is a work in progress, it's
by no means set in stone. So when you find missing items like this, feel
free to fix them up and send a patch. I will then fold in that patch. Or
if you don't feel comfortable fixing it up, let me know, and I'll fix it
up next time I touch it.

>> +	else
>> +		rq = __blk_mq_alloc_request(data, op);
>> +
>> +	if (rq) {
>> +		rq->elv.icq = NULL;
>> +		if (e && e->type->icq_cache)
>> +			blk_mq_sched_assign_ioc(q, rq, bio);
> 
> bfq needs rq->elv.icq to be consistent in bfq_get_request, but the
> needed initialization seems to occur only after mq.get_request is
> invoked.
> 
> Note: to minimize latency, I'm reporting immediately each problem that
> apparently cannot be solved by just modifying bfq.  But, if the
> resulting higher number of micro-emails is annoying for you, I can
> buffer my questions, and send you cumulative emails less frequently.

That's perfectly fine, I prefer knowing earlier rather than later. But
do also remember that it's fine to send a patch to fix those things up,
you don't have to wait for me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-20  9:34   ` Paolo Valente
@ 2016-12-20 15:46     ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-20 15:46 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 12/20/2016 02:34 AM, Paolo Valente wrote:
> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> This is basically identical to deadline-iosched, except it registers
>> as a MQ capable scheduler. This is still a single queue design.
>>
>> Signed-off-by: Jens Axboe <axboe@fb.com>
>> ...
>> +
>> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>> +{
>> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>> +
>> +	return !list_empty_careful(&dd->dispatch) ||
>> +		!list_empty_careful(&dd->fifo_list[0]) ||
>> +		!list_empty_careful(&dd->fifo_list[1]);
> 
> Just a request for clarification: if I'm not mistaken,
> list_empty_careful can be used safely only if the only possible other
> concurrent access is a delete.  Or am I missing something?

We can "solve" that with memory barriers. For now, it's safe to ignore
on your end.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/8] block: move rq_ioc() to blk.h
  2016-12-20 10:12   ` Paolo Valente
@ 2016-12-20 15:46     ` Jens Axboe
  2016-12-20 22:14       ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2016-12-20 15:46 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 12/20/2016 03:12 AM, Paolo Valente wrote:
> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> We want to use it outside of blk-core.c.
>>
> 
> Hi Jens,
> no hooks equivalent to elevator_init_icq_fn and elevator_exit_icq_fn
> are invoked.  In particular, the second hook let bfq (as with cfq)
> know when it could finally exit the queue associated with the icq.
> I'm trying to figure out how to make it without these hooks/signals,
> but at no avail so far ...

Yep, those need to be added.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/8] block: move rq_ioc() to blk.h
  2016-12-20 15:46     ` Jens Axboe
@ 2016-12-20 22:14       ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-20 22:14 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, Linux-Kernal, osandov

On 12/20/2016 08:46 AM, Jens Axboe wrote:
> On 12/20/2016 03:12 AM, Paolo Valente wrote:
>>
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>
>>> We want to use it outside of blk-core.c.
>>>
>>
>> Hi Jens,
>> no hooks equivalent to elevator_init_icq_fn and elevator_exit_icq_fn
>> are invoked.  In particular, the second hook let bfq (as with cfq)
>> know when it could finally exit the queue associated with the icq.
>> I'm trying to figure out how to make it without these hooks/signals,
>> but at no avail so far ...
> 
> Yep, those need to be added.

Done, pushed out.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-20 11:55   ` Paolo Valente
  2016-12-20 15:45     ` Jens Axboe
@ 2016-12-21  2:22     ` Jens Axboe
  2016-12-22 15:20       ` Paolo Valente
  1 sibling, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2016-12-21  2:22 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, linux-kernel, osandov

On Tue, Dec 20 2016, Paolo Valente wrote:
> > +	else
> > +		rq = __blk_mq_alloc_request(data, op);
> > +
> > +	if (rq) {
> > +		rq->elv.icq = NULL;
> > +		if (e && e->type->icq_cache)
> > +			blk_mq_sched_assign_ioc(q, rq, bio);
> 
> bfq needs rq->elv.icq to be consistent in bfq_get_request, but the
> needed initialization seems to occur only after mq.get_request is
> invoked.

Can you do it from get/put_rq_priv? The icq is assigned there. If not,
we can redo this part, not a big deal.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
  2016-12-20  9:34   ` Paolo Valente
@ 2016-12-21 11:59   ` Bart Van Assche
  2016-12-21 14:22     ` Jens Axboe
  2016-12-22 16:07   ` Paolo Valente
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Bart Van Assche @ 2016-12-21 11:59 UTC (permalink / raw)
  To: linux-kernel, linux-block, axboe, axboe; +Cc: osandov, paolo.valente

On 12/17/2016 01:12 AM, Jens Axboe wrote:
> +static bool dd_put_request(struct request *rq)
> +{
> +	/*
> +	 * If it's a real request, we just have to
free it. For a shadow
> +	 * request, we should only free it if we haven't started it. A
> +	 * started request is mapped to a real one, and the
real one will
> +	 * free it. We can get here with request merges, since we then
> +	 * free the request before we start/issue it.
> +	 */
> +	
if (!blk_mq_sched_rq_is_shadow(rq))
> +		return false;
> +
> +	if (!(rq->rq_flags & RQF_STARTED)) {
> +		struct request_queue *q
= rq->q;
> +		struct deadline_data *dd = q->elevator->elevator_data;
> +
> +		/*
> +		 * IO completion would normally do
this, but if we merge
> +		 * and free before we issue the request, drop both the
> +		 * tag and queue ref
> +		 */
> +	
	blk_mq_sched_free_shadow_request(dd->tags, rq);
> +		blk_queue_exit(q);
> +	}
> +
> +	return true;
> +}

Hello Jens,

Since this patch is the first patch that introduces a call to blk_queue_exit()
from a module other than the block layer core, shouldn't this patch export the
blk_queue_exit() function? An attempt to build mq-deadline as a module resulted
in the following:

ERROR: "blk_queue_exit" [block/mq-deadline.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:91: __modpost] Error 1
make: *** [Makefile:1198: modules] Error 2
Execution failed: make all

Bart.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-21 11:59   ` Bart Van Assche
@ 2016-12-21 14:22     ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2016-12-21 14:22 UTC (permalink / raw)
  To: Bart Van Assche, linux-kernel, linux-block, axboe; +Cc: osandov, paolo.valente

On 12/21/2016 04:59 AM, Bart Van Assche wrote:
> Since this patch is the first patch that introduces a call to
> blk_queue_exit() from a module other than the block layer core,
> shouldn't this patch export the blk_queue_exit() function? An attempt
> to build mq-deadline as a module resulted in the following:
> 
> ERROR: "blk_queue_exit" [block/mq-deadline.ko] undefined!
> make[1]: *** [scripts/Makefile.modpost:91: __modpost] Error 1
> make: *** [Makefile:1198: modules] Error 2
> Execution failed: make all

Yes, it should. I'll make the export for now, I want to move that check
and free/drop into the generic code so that the schedulers don't have to
worry about it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-17  0:12 ` [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
  2016-12-20 11:55   ` Paolo Valente
@ 2016-12-22  9:59   ` Paolo Valente
  2016-12-22 11:13     ` Paolo Valente
                       ` (2 more replies)
  1 sibling, 3 replies; 69+ messages in thread
From: Paolo Valente @ 2016-12-22  9:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This adds a set of hooks that intercepts the blk-mq path of
> allocating/inserting/issuing/completing requests, allowing
> us to develop a scheduler within that framework.
> 
> We reuse the existing elevator scheduler API on the registration
> side, but augment that with the scheduler flagging support for
> the blk-mq interfce, and with a separate set of ops hooks for MQ
> devices.
> 
> Schedulers can opt in to using shadow requests. Shadow requests
> are internal requests that the scheduler uses for for the allocate
> and insert part, which are then mapped to a real driver request
> at dispatch time. This is needed to separate the device queue depth
> from the pool of requests that the scheduler has to work with.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
> 
...

> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> new file mode 100644
> index 000000000000..b7e1839d4785
> --- /dev/null
> +++ b/block/blk-mq-sched.c

> ...
> +static inline bool
> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
> +			 struct bio *bio)
> +{
> +	struct elevator_queue *e = q->elevator;
> +
> +	if (e && e->type->ops.mq.allow_merge)
> +		return e->type->ops.mq.allow_merge(q, rq, bio);
> +
> +	return true;
> +}
> +

Something does not seem to add up here:
e->type->ops.mq.allow_merge may be called only in
blk_mq_sched_allow_merge, which, in its turn, may be called only in
blk_mq_attempt_merge, which, finally, may be called only in
blk_mq_merge_queue_io.  Yet the latter may be called only if there is
no elevator (line 1399 and 1507 in blk-mq.c).

Therefore, e->type->ops.mq.allow_merge can never be called, both if
there is and if there is not an elevator.  Be patient if I'm missing
something huge, but I thought it was worth reporting this.

Paolo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-22  9:59   ` Paolo Valente
@ 2016-12-22 11:13     ` Paolo Valente
  2017-01-17  2:47       ` Jens Axboe
  2016-12-23 10:12     ` Paolo Valente
  2017-01-17  2:47     ` Jens Axboe
  2 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2016-12-22 11:13 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 22 dic 2016, alle ore 10:59, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
>> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>> 
>> This adds a set of hooks that intercepts the blk-mq path of
>> allocating/inserting/issuing/completing requests, allowing
>> us to develop a scheduler within that framework.
>> 
>> We reuse the existing elevator scheduler API on the registration
>> side, but augment that with the scheduler flagging support for
>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>> devices.
>> 
>> Schedulers can opt in to using shadow requests. Shadow requests
>> are internal requests that the scheduler uses for for the allocate
>> and insert part, which are then mapped to a real driver request
>> at dispatch time. This is needed to separate the device queue depth
>> from the pool of requests that the scheduler has to work with.
>> 
>> Signed-off-by: Jens Axboe <axboe@fb.com>
>> 
> ...
> 
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> new file mode 100644
>> index 000000000000..b7e1839d4785
>> --- /dev/null
>> +++ b/block/blk-mq-sched.c
> 
>> ...
>> +static inline bool
>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>> +			 struct bio *bio)
>> +{
>> +	struct elevator_queue *e = q->elevator;
>> +
>> +	if (e && e->type->ops.mq.allow_merge)
>> +		return e->type->ops.mq.allow_merge(q, rq, bio);
>> +
>> +	return true;
>> +}
>> +
> 
> Something does not seem to add up here:
> e->type->ops.mq.allow_merge may be called only in
> blk_mq_sched_allow_merge, which, in its turn, may be called only in
> blk_mq_attempt_merge, which, finally, may be called only in
> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
> no elevator (line 1399 and 1507 in blk-mq.c).
> 
> Therefore, e->type->ops.mq.allow_merge can never be called, both if
> there is and if there is not an elevator.  Be patient if I'm missing
> something huge, but I thought it was worth reporting this.
> 

Just another detail: if e->type->ops.mq.allow_merge does get invoked
from the above path, then it is invoked of course without the
scheduler lock held.  In contrast, if this function gets invoked
from dd_bio_merge, then the scheduler lock is held.

To handle this opposite alternatives, I don't know whether checking if
the lock is held (and possibly taking it) from inside
e->type->ops.mq.allow_merge is a good solution.  In any case, before
possibly trying it, I will wait for some feedback on the main problem,
i.e., on the fact that e->type->ops.mq.allow_merge
seems unreachable in the above path.

Thanks,
Paolo

> Paolo
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-21  2:22     ` Jens Axboe
@ 2016-12-22 15:20       ` Paolo Valente
  0 siblings, 0 replies; 69+ messages in thread
From: Paolo Valente @ 2016-12-22 15:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Linux-Kernal, osandov


> Il giorno 21 dic 2016, alle ore 03:22, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On Tue, Dec 20 2016, Paolo Valente wrote:
>>> +	else
>>> +		rq = __blk_mq_alloc_request(data, op);
>>> +
>>> +	if (rq) {
>>> +		rq->elv.icq = NULL;
>>> +		if (e && e->type->icq_cache)
>>> +			blk_mq_sched_assign_ioc(q, rq, bio);
>> 
>> bfq needs rq->elv.icq to be consistent in bfq_get_request, but the
>> needed initialization seems to occur only after mq.get_request is
>> invoked.
> 
> Can you do it from get/put_rq_priv?

Definitely, I just overlooked them, sorry :(

Thanks,
Paolo

> The icq is assigned there. If not,
> we can redo this part, not a big deal.
> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-19 21:05       ` Jens Axboe
@ 2016-12-22 15:28         ` Paolo Valente
  2017-01-17  2:47           ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2016-12-22 15:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown


> Il giorno 19 dic 2016, alle ore 22:05, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 12/19/2016 11:21 AM, Paolo Valente wrote:
>> 
>>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>>> 
>>>>> This is version 4 of this patchset, version 3 was posted here:
>>>>> 
>>>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>>> 
>>>>> From the discussion last time, I looked into the feasibility of having
>>>>> two sets of tags for the same request pool, to avoid having to copy
>>>>> some of the request fields at dispatch and completion time. To do that,
>>>>> we'd have to replace the driver tag map(s) with our own, and augment
>>>>> that with tag map(s) on the side representing the device queue depth.
>>>>> Queuing IO with the scheduler would allocate from the new map, and
>>>>> dispatching would acquire the "real" tag. We would need to change
>>>>> drivers to do this, or add an extra indirection table to map a real
>>>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>>>> scheduler and hardware tag pools, or additional info to track it.
>>>>> Unless someone can convince me otherwise, I think the current approach
>>>>> is cleaner.
>>>>> 
>>>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>>>> to drastically decreased merging. Especially on rotating storage,
>>>>> this release should be fast, and on par with the merging that we
>>>>> get through the legacy schedulers.
>>>>> 
>>>> 
>>>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>>>> you already have an idea of what they are, so that I am somehow
>>>> prepared to what won't work even if my changes are right?
>>> 
>>> I'm mostly talking about elevator ops hooks that aren't there in the new
>>> framework, but exist in the old one. There should be no hidden
>>> surprises, if that's what you are worried about.
>>> 
>>> On the ops side, the only ones I can think of are the activate and
>>> deactivate, and those can be done in the dispatch_request hook for
>>> activate, and put/requeue for deactivate.
>>> 
>> 
>> You mean that there is no conceptual problem in moving the code of the
>> activate interface function into the dispatch function, and the code
>> of the deactivate into the put_request? (for a requeue it is a little
>> less clear to me, so one step at a time)  Or am I missing
>> something more complex?
> 
> Yes, what I mean is that there isn't a 1:1 mapping between the old ops
> and the new ops. So you'll have to consider the cases.
> 
> 

Problem: whereas it seems easy and safe to do somewhere else the
simple increment that was done in activate_request, I wonder if it may
happen that a request is deactivate before being completed.  In it may
happen, then, without a deactivate_request hook, the increments would
remain unbalanced.  Or are request completions always guaranteed till
no hw/sw components breaks?

Thanks,
Paolo 

> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
  2016-12-20  9:34   ` Paolo Valente
  2016-12-21 11:59   ` Bart Van Assche
@ 2016-12-22 16:07   ` Paolo Valente
  2017-01-17  2:47     ` Jens Axboe
  2016-12-22 16:49   ` Paolo Valente
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2016-12-22 16:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---

...

> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index 000000000000..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> ...
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	list_del_init(&rq->queuelist);
> +
> +	/*
> +	 * We might not be on the rbtree, if we are doing an insert merge
> +	 */
> +	if (!RB_EMPTY_NODE(&rq->rb_node))
> +		deadline_del_rq_rb(dd, rq);
> +

I've been scratching my head on the last three instructions, but at no
avail.  If I understand correctly, the
list_del_init(&rq->queue list);
removes rq from the fifo list.  But, if so, I don't understand how it
could be possible that rq has not been added to the rb_tree too.

Another interpretation that I tried is that the above three lines
handle correctly the following case where rq has not been inserted at
all into deadline fifo queue and rb tree: when dd_insert_request was
executed for rq, blk_mq_sched_try_insert_merge succeeded.  Yet, the
list_del_init(&rq->queue list);
does not seem to make sense.

Could you please shed some light on this for me?

Thanks,
Paolo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
                   ` (8 preceding siblings ...)
  2016-12-19 11:32 ` [PATCHSET v4] blk-mq-scheduling framework Paolo Valente
@ 2016-12-22 16:23 ` Bart Van Assche
  2016-12-22 16:52   ` Omar Sandoval
  9 siblings, 1 reply; 69+ messages in thread
From: Bart Van Assche @ 2016-12-22 16:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, axboe, axboe; +Cc: osandov, paolo.valente

[-- Attachment #1: Type: text/plain, Size: 1767 bytes --]

On Fri, 2016-12-16 at 17:12 -0700, Jens Axboe wrote:
> From the discussion last time, I looked into the feasibility of having
> two sets of tags for the same request pool, to avoid having to copy
> some of the request fields at dispatch and completion time. To do that,
> we'd have to replace the driver tag map(s) with our own, and augment
> that with tag map(s) on the side representing the device queue depth.
> Queuing IO with the scheduler would allocate from the new map, and
> dispatching would acquire the "real" tag. We would need to change
> drivers to do this, or add an extra indirection table to map a real
> tag to the scheduler tag. We would also need a 1:1 mapping between
> scheduler and hardware tag pools, or additional info to track it.
> Unless someone can convince me otherwise, I think the current approach
> is cleaner.

Hello Jens,

Can you have a look at the attached patches? These implement the "two tags
per request" approach without a table that maps one tag type to the other
or any other ugly construct. __blk_mq_alloc_request() is modified such that
it assigns rq->sched_tag and sched_tags->rqs[] instead of rq->tag and
tags->rqs[]. rq->tag and tags->rqs[] are assigned just before dispatch by
blk_mq_assign_drv_tag(). This approach results in significantly less code
than the approach proposed in v4 of your blk-mq-sched patch series. Memory
usage is lower because only a single set of requests is allocated. The
runtime overhead is lower because request fields no longer have to be
copied between the requests owned by the block driver and the requests
owned by the I/O scheduler. I can boot a VM from the virtio-blk driver but
otherwise the attached patches have not yet been tested.

Thanks,

Bart.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-blk-mq-Revert-some-of-the-blk-mq-sched-framework-cha.patch --]
[-- Type: text/x-patch; name="0001-blk-mq-Revert-some-of-the-blk-mq-sched-framework-cha.patch", Size: 21755 bytes --]

From 0fd04112850a73f5be9fa91a29bd1791179e1e80 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bart.vanassche@sandisk.com>
Date: Tue, 20 Dec 2016 12:53:54 +0100
Subject: [PATCH 1/3] blk-mq: Revert some of the blk-mq-sched framework changes

Remove the functions that allocate and free shadow requests.
Remove the get_request, put_request and completed_request callback
functions from struct elevator_type. Remove blk-mq I/O scheduling
functions that become superfluous due to these changes.

Note: this patch breaks blk-mq I/O scheduling. Later patches will make
blk-mq I/O scheduling work again.
---
 block/blk-mq-sched.c     | 295 +----------------------------------------------
 block/blk-mq-sched.h     |  55 +--------
 block/blk-mq.c           |  50 ++++++--
 block/mq-deadline.c      |  90 ++-------------
 include/linux/elevator.h |   3 -
 5 files changed, 58 insertions(+), 435 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 265e4a9cce7e..e46769db3d57 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -15,196 +15,6 @@
 #include "blk-mq-tag.h"
 #include "blk-wbt.h"
 
-/*
- * Empty set
- */
-static const struct blk_mq_ops mq_sched_tag_ops = {
-};
-
-void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
-{
-	blk_mq_free_rq_map(NULL, tags, 0);
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
-
-struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
-						unsigned int numa_node)
-{
-	struct blk_mq_tag_set set = {
-		.ops		= &mq_sched_tag_ops,
-		.nr_hw_queues	= 1,
-		.queue_depth	= depth,
-		.numa_node	= numa_node,
-	};
-
-	return blk_mq_init_rq_map(&set, 0);
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
-
-void blk_mq_sched_free_hctx_data(struct request_queue *q,
-				 void (*exit)(struct blk_mq_hw_ctx *))
-{
-	struct blk_mq_hw_ctx *hctx;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (exit)
-			exit(hctx);
-		kfree(hctx->sched_data);
-		hctx->sched_data = NULL;
-	}
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
-
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *))
-{
-	struct blk_mq_hw_ctx *hctx;
-	int ret;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
-		if (!hctx->sched_data) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (init) {
-			ret = init(hctx);
-			if (ret) {
-				/*
-				 * We don't want to give exit() a partially
-				 * initialized sched_data. init() must clean up
-				 * if it fails.
-				 */
-				kfree(hctx->sched_data);
-				hctx->sched_data = NULL;
-				goto error;
-			}
-		}
-	}
-
-	return 0;
-error:
-	blk_mq_sched_free_hctx_data(q, exit);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
-
-struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
-						  struct blk_mq_alloc_data *data,
-						  struct blk_mq_tags *tags,
-						  atomic_t *wait_index)
-{
-	struct sbq_wait_state *ws;
-	DEFINE_WAIT(wait);
-	struct request *rq;
-	int tag;
-
-	tag = __sbitmap_queue_get(&tags->bitmap_tags);
-	if (tag != -1)
-		goto done;
-
-	if (data->flags & BLK_MQ_REQ_NOWAIT)
-		return NULL;
-
-	ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
-	do {
-		prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
-
-		tag = __sbitmap_queue_get(&tags->bitmap_tags);
-		if (tag != -1)
-			break;
-
-		blk_mq_run_hw_queue(data->hctx, false);
-
-		tag = __sbitmap_queue_get(&tags->bitmap_tags);
-		if (tag != -1)
-			break;
-
-		blk_mq_put_ctx(data->ctx);
-		io_schedule();
-
-		data->ctx = blk_mq_get_ctx(data->q);
-		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
-		finish_wait(&ws->wait, &wait);
-		ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
-	} while (1);
-
-	finish_wait(&ws->wait, &wait);
-done:
-	rq = tags->rqs[tag];
-	rq->tag = tag;
-	rq->rq_flags = RQF_ALLOCED;
-	return rq;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
-
-void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
-				      struct request *rq)
-{
-	WARN_ON_ONCE(!(rq->rq_flags & RQF_ALLOCED));
-	sbitmap_queue_clear(&tags->bitmap_tags, rq->tag, rq->mq_ctx->cpu);
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
-
-static void rq_copy(struct request *rq, struct request *src)
-{
-#define FIELD_COPY(dst, src, name)	((dst)->name = (src)->name)
-	FIELD_COPY(rq, src, cpu);
-	FIELD_COPY(rq, src, cmd_type);
-	FIELD_COPY(rq, src, cmd_flags);
-	rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | RQF_DONTPREP));
-	rq->rq_flags &= ~RQF_IO_STAT;
-	FIELD_COPY(rq, src, __data_len);
-	FIELD_COPY(rq, src, __sector);
-	FIELD_COPY(rq, src, bio);
-	FIELD_COPY(rq, src, biotail);
-	FIELD_COPY(rq, src, rq_disk);
-	FIELD_COPY(rq, src, part);
-	FIELD_COPY(rq, src, issue_stat);
-	src->issue_stat.time = 0;
-	FIELD_COPY(rq, src, nr_phys_segments);
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
-	FIELD_COPY(rq, src, nr_integrity_segments);
-#endif
-	FIELD_COPY(rq, src, ioprio);
-	FIELD_COPY(rq, src, timeout);
-
-	if (src->cmd_type == REQ_TYPE_BLOCK_PC) {
-		FIELD_COPY(rq, src, cmd);
-		FIELD_COPY(rq, src, cmd_len);
-		FIELD_COPY(rq, src, extra_len);
-		FIELD_COPY(rq, src, sense_len);
-		FIELD_COPY(rq, src, resid_len);
-		FIELD_COPY(rq, src, sense);
-		FIELD_COPY(rq, src, retries);
-	}
-
-	src->bio = src->biotail = NULL;
-}
-
-static void sched_rq_end_io(struct request *rq, int error)
-{
-	struct request *sched_rq = rq->end_io_data;
-
-	FIELD_COPY(sched_rq, rq, resid_len);
-	FIELD_COPY(sched_rq, rq, extra_len);
-	FIELD_COPY(sched_rq, rq, sense_len);
-	FIELD_COPY(sched_rq, rq, errors);
-	FIELD_COPY(sched_rq, rq, retries);
-
-	blk_account_io_completion(sched_rq, blk_rq_bytes(sched_rq));
-	blk_account_io_done(sched_rq);
-
-	if (sched_rq->end_io)
-		sched_rq->end_io(sched_rq, error);
-
-	blk_mq_finish_request(rq);
-}
-
 static inline struct request *
 __blk_mq_sched_alloc_request(struct blk_mq_hw_ctx *hctx)
 {
@@ -225,55 +35,6 @@ __blk_mq_sched_alloc_request(struct blk_mq_hw_ctx *hctx)
 	return rq;
 }
 
-static inline void
-__blk_mq_sched_init_request_from_shadow(struct request *rq,
-					struct request *sched_rq)
-{
-	WARN_ON_ONCE(!(sched_rq->rq_flags & RQF_ALLOCED));
-	rq_copy(rq, sched_rq);
-	rq->end_io = sched_rq_end_io;
-	rq->end_io_data = sched_rq;
-}
-
-struct request *
-blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
-				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
-{
-	struct request *rq, *sched_rq;
-
-	rq = __blk_mq_sched_alloc_request(hctx);
-	if (!rq)
-		return NULL;
-
-	sched_rq = get_sched_rq(hctx);
-	if (sched_rq) {
-		__blk_mq_sched_init_request_from_shadow(rq, sched_rq);
-		return rq;
-	}
-
-	/*
-	 * __blk_mq_finish_request() drops a queue ref we already hold,
-	 * so grab an extra one.
-	 */
-	blk_queue_enter_live(hctx->queue);
-	__blk_mq_finish_request(hctx, rq->mq_ctx, rq);
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_request_from_shadow);
-
-struct request *__blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
-						   struct request *sched_rq)
-{
-	struct request *rq;
-
-	rq = __blk_mq_sched_alloc_request(hctx);
-	if (rq)
-		__blk_mq_sched_init_request_from_shadow(rq, sched_rq);
-
-	return rq;
-}
-EXPORT_SYMBOL_GPL(__blk_mq_sched_request_from_shadow);
-
 static void __blk_mq_sched_assign_ioc(struct request_queue *q,
 				      struct request *rq, struct io_context *ioc)
 {
@@ -298,8 +59,8 @@ static void __blk_mq_sched_assign_ioc(struct request_queue *q,
 	rq->elv.icq = NULL;
 }
 
-static void blk_mq_sched_assign_ioc(struct request_queue *q,
-				    struct request *rq, struct bio *bio)
+void blk_mq_sched_assign_ioc(struct request_queue *q, struct request *rq,
+			     struct bio *bio)
 {
 	struct io_context *ioc;
 
@@ -308,44 +69,9 @@ static void blk_mq_sched_assign_ioc(struct request_queue *q,
 		__blk_mq_sched_assign_ioc(q, rq, ioc);
 }
 
-struct request *blk_mq_sched_get_request(struct request_queue *q,
-					 struct bio *bio,
-					 unsigned int op,
-					 struct blk_mq_alloc_data *data)
-{
-	struct elevator_queue *e = q->elevator;
-	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx;
-	struct request *rq;
-
-	blk_queue_enter_live(q);
-	ctx = blk_mq_get_ctx(q);
-	hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
-
-	if (e && e->type->ops.mq.get_request)
-		rq = e->type->ops.mq.get_request(q, op, data);
-	else
-		rq = __blk_mq_alloc_request(data, op);
-
-	if (rq) {
-		rq->elv.icq = NULL;
-		if (e && e->type->icq_cache)
-			blk_mq_sched_assign_ioc(q, rq, bio);
-		data->hctx->queued++;
-		return rq;
-	}
-
-	blk_queue_exit(q);
-	return NULL;
-}
-
 void blk_mq_sched_put_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct elevator_queue *e = q->elevator;
-	bool has_queue_ref = false, do_free = false;
 
 	wbt_done(q->rq_wb, &rq->issue_stat);
 
@@ -357,22 +83,7 @@ void blk_mq_sched_put_request(struct request *rq)
 		}
 	}
 
-	/*
-	 * If we are freeing a shadow that hasn't been started, then drop
-	 * our queue ref on it. This normally happens at IO completion
-	 * time, but if we merge request-to-request, then this 'rq' will
-	 * never get started or completed.
-	 */
-	if (blk_mq_sched_rq_is_shadow(rq) && !(rq->rq_flags & RQF_STARTED))
-		has_queue_ref = true;
-
-	if (e && e->type->ops.mq.put_request)
-		do_free = !e->type->ops.mq.put_request(rq);
-
-	if (do_free)
-		blk_mq_finish_request(rq);
-	if (has_queue_ref)
-		blk_queue_exit(q);
+	blk_mq_finish_request(rq);
 }
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 8ff37f9782e9..6b8c314b5c20 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -3,30 +3,6 @@
 
 #include "blk-mq.h"
 
-struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
-void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
-
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *));
-
-void blk_mq_sched_free_hctx_data(struct request_queue *q,
-				 void (*exit)(struct blk_mq_hw_ctx *));
-
-void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
-				      struct request *rq);
-struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
-						  struct blk_mq_alloc_data *data,
-						  struct blk_mq_tags *tags,
-						  atomic_t *wait_index);
-struct request *
-blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
-				 struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
-struct request *
-__blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
-				   struct request *sched_rq);
-
-struct request *blk_mq_sched_get_request(struct request_queue *q, struct bio *bio, unsigned int op, struct blk_mq_alloc_data *data);
 void blk_mq_sched_put_request(struct request *rq);
 
 void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
@@ -35,6 +11,9 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
 
+void blk_mq_sched_assign_ioc(struct request_queue *q,
+			     struct request *rq, struct bio *bio);
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
 int blk_mq_sched_init(struct request_queue *q);
@@ -109,22 +88,6 @@ blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx,
 	blk_mq_run_hw_queue(hctx, run_queue_async);
 }
 
-static inline void
-blk_mq_sched_dispatch_shadow_requests(struct blk_mq_hw_ctx *hctx,
-				      struct list_head *rq_list,
-				      struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
-{
-	do {
-		struct request *rq;
-
-		rq = blk_mq_sched_request_from_shadow(hctx, get_sched_rq);
-		if (!rq)
-			break;
-
-		list_add_tail(&rq->queuelist, rq_list);
-	} while (1);
-}
-
 static inline bool
 blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
 			 struct bio *bio)
@@ -140,11 +103,6 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
 static inline void
 blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
 {
-	struct elevator_queue *e = hctx->queue->elevator;
-
-	if (e && e->type->ops.mq.completed_request)
-		e->type->ops.mq.completed_request(hctx, rq);
-
 	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 		blk_mq_run_hw_queue(hctx, true);
@@ -179,11 +137,4 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
-/*
- * Returns true if this is an internal shadow request
- */
-static inline bool blk_mq_sched_rq_is_shadow(struct request *rq)
-{
-	return (rq->rq_flags & RQF_ALLOCED) != 0;
-}
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3a19834211b2..35e1162602f5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -245,6 +245,8 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		unsigned int flags)
 {
 	struct blk_mq_alloc_data alloc_data;
+	struct blk_mq_ctx *ctx;
+	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
 	int ret;
 
@@ -252,13 +254,16 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 	if (ret)
 		return ERR_PTR(ret);
 
-	rq = blk_mq_sched_get_request(q, NULL, rw, &alloc_data);
-
-	blk_mq_put_ctx(alloc_data.ctx);
-	blk_queue_exit(q);
+	ctx = blk_mq_get_ctx(q);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
+	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
+	rq = __blk_mq_alloc_request(&alloc_data, rw);
+	blk_mq_put_ctx(ctx);
 
-	if (!rq)
+	if (!rq) {
+		blk_queue_exit(q);
 		return ERR_PTR(-EWOULDBLOCK);
+	}
 
 	rq->__data_len = 0;
 	rq->__sector = (sector_t) -1;
@@ -324,7 +329,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	const int tag = rq->tag;
 	struct request_queue *q = rq->q;
 
-	blk_mq_sched_completed_request(hctx, rq);
+	ctx->rq_completed[rq_is_sync(rq)]++;
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
@@ -1246,6 +1251,34 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
 	}
 }
 
+static struct request *blk_mq_get_request(struct request_queue *q,
+					  struct bio *bio,
+					  struct blk_mq_alloc_data *data)
+{
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+	struct request *rq;
+
+	blk_queue_enter_live(q);
+	ctx = blk_mq_get_ctx(q);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	trace_block_getrq(q, bio, bio->bi_opf);
+	blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+	rq = __blk_mq_alloc_request(data, bio->bi_opf);
+
+	if (rq) {
+		rq->elv.icq = NULL;
+		if (e && e->type->icq_cache)
+			blk_mq_sched_assign_ioc(q, rq, bio);
+		data->hctx->queued++;
+		return rq;
+	}
+
+	return rq;
+}
+
 static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 {
 	struct request_queue *q = rq->q;
@@ -1328,7 +1361,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	trace_block_getrq(q, bio, bio->bi_opf);
 
-	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
+	rq = blk_mq_get_request(q, bio, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
@@ -1448,7 +1481,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 
 	trace_block_getrq(q, bio, bio->bi_opf);
 
-	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
+	rq = blk_mq_get_request(q, bio, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
@@ -1504,6 +1537,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_sched_insert_request(rq, false, true, true);
 		goto done;
 	}
+
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index e26c02798041..9a4039d9b4f0 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -63,8 +63,6 @@ struct deadline_data {
 
 	spinlock_t lock;
 	struct list_head dispatch;
-	struct blk_mq_tags *tags;
-	atomic_t wait_index;
 };
 
 static inline struct rb_root *
@@ -300,7 +298,13 @@ static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
 				 struct list_head *rq_list)
 {
-	blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
+	for (;;) {
+		struct request *rq = __dd_dispatch_request(hctx);
+		if (!rq)
+			break;
+
+		list_add_tail(&rq->queuelist, rq_list);
+	}
 }
 
 static void dd_exit_queue(struct elevator_queue *e)
@@ -310,7 +314,6 @@ static void dd_exit_queue(struct elevator_queue *e)
 	BUG_ON(!list_empty(&dd->fifo_list[READ]));
 	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
 
-	blk_mq_sched_free_requests(dd->tags);
 	kfree(dd);
 }
 
@@ -333,13 +336,6 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
 	}
 	eq->elevator_data = dd;
 
-	dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
-	if (!dd->tags) {
-		kfree(dd);
-		kobject_put(&eq->kobj);
-		return -ENOMEM;
-	}
-
 	INIT_LIST_HEAD(&dd->fifo_list[READ]);
 	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
 	dd->sort_list[READ] = RB_ROOT;
@@ -351,7 +347,6 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
 	dd->fifo_batch = fifo_batch;
 	spin_lock_init(&dd->lock);
 	INIT_LIST_HEAD(&dd->dispatch);
-	atomic_set(&dd->wait_index, 0);
 
 	q->elevator = eq;
 	return 0;
@@ -409,11 +404,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_mq_sched_request_inserted(rq);
 
 	/*
-	 * If we're trying to insert a real request, just send it directly
-	 * to the hardware dispatch list. This only happens for a requeue,
-	 * or FUA/FLUSH requests.
+	 * Send FUA and FLUSH requests directly to the hardware dispatch list.
+	 * To do: also send requeued requests directly to the hw disp list.
 	 */
-	if (!blk_mq_sched_rq_is_shadow(rq)) {
+	if (rq->cmd_flags & (REQ_PREFLUSH | REQ_FUA)) {
 		spin_lock(&hctx->lock);
 		list_add_tail(&rq->queuelist, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -459,67 +453,6 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	spin_unlock(&dd->lock);
 }
 
-static struct request *dd_get_request(struct request_queue *q, unsigned int op,
-				      struct blk_mq_alloc_data *data)
-{
-	struct deadline_data *dd = q->elevator->elevator_data;
-	struct request *rq;
-
-	/*
-	 * The flush machinery intercepts before we insert the request. As
-	 * a work-around, just hand it back a real request.
-	 */
-	if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
-		rq = __blk_mq_alloc_request(data, op);
-	else {
-		rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
-		if (rq)
-			blk_mq_rq_ctx_init(q, data->ctx, rq, op);
-	}
-
-	return rq;
-}
-
-static bool dd_put_request(struct request *rq)
-{
-	/*
-	 * If it's a real request, we just have to free it. Return false
-	 * to say we didn't handle it, and blk_mq_sched will take care of that.
-	 */
-	if (!blk_mq_sched_rq_is_shadow(rq))
-		return false;
-
-	if (!(rq->rq_flags & RQF_STARTED)) {
-		struct request_queue *q = rq->q;
-		struct deadline_data *dd = q->elevator->elevator_data;
-
-		/*
-		 * IO completion would normally do this, but if we merge
-		 * and free before we issue the request, we need to free
-		 * the shadow tag here.
-		 */
-		blk_mq_sched_free_shadow_request(dd->tags, rq);
-	}
-
-	return true;
-}
-
-static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
-{
-	struct request *sched_rq = rq->end_io_data;
-
-	/*
-	 * sched_rq can be NULL, if we haven't setup the shadow yet
-	 * because we failed getting one.
-	 */
-	if (sched_rq) {
-		struct deadline_data *dd = hctx->queue->elevator->elevator_data;
-
-		blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
-		blk_mq_start_stopped_hw_queue(hctx, true);
-	}
-}
-
 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
@@ -601,11 +534,8 @@ static struct elv_fs_entry deadline_attrs[] = {
 
 static struct elevator_type mq_deadline = {
 	.ops.mq = {
-		.get_request		= dd_get_request,
-		.put_request		= dd_put_request,
 		.insert_requests	= dd_insert_requests,
 		.dispatch_requests	= dd_dispatch_requests,
-		.completed_request	= dd_completed_request,
 		.next_request		= elv_rb_latter_request,
 		.former_request		= elv_rb_former_request,
 		.bio_merge		= dd_bio_merge,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 64224d39d707..312e6d3e89fa 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -89,12 +89,9 @@ struct elevator_mq_ops {
 	int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
 	void (*request_merged)(struct request_queue *, struct request *, int);
 	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
-	struct request *(*get_request)(struct request_queue *, unsigned int, struct blk_mq_alloc_data *);
-	bool (*put_request)(struct request *);
 	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
 	void (*dispatch_requests)(struct blk_mq_hw_ctx *, struct list_head *);
 	bool (*has_work)(struct blk_mq_hw_ctx *);
-	void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
 	void (*started_request)(struct request *);
 	void (*requeue_request)(struct request *);
 	struct request *(*former_request)(struct request_queue *, struct request *);
-- 
2.11.0


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-blk-mq-Make-the-blk_mq_-get-put-_tag-callers-specify.patch --]
[-- Type: text/x-patch; name="0002-blk-mq-Make-the-blk_mq_-get-put-_tag-callers-specify.patch", Size: 9820 bytes --]

From ae72bb9f67d01b3a02cee80c81a712f775d13c32 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bart.vanassche@sandisk.com>
Date: Tue, 20 Dec 2016 12:00:47 +0100
Subject: [PATCH 2/3] blk-mq: Make the blk_mq_{get,put}_tag() callers specify
 the tag set

This patch does not change any functionality.
---
 block/blk-mq-tag.c | 29 ++++++++++----------
 block/blk-mq-tag.h |  7 +++--
 block/blk-mq.c     | 80 ++++++++++++++++++++++++++++++++++++------------------
 block/blk-mq.h     |  9 ++++--
 4 files changed, 77 insertions(+), 48 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index dcf5ce3ba4bf..890d634db0ee 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -156,47 +156,46 @@ static int bt_get(struct blk_mq_alloc_data *data, struct sbitmap_queue *bt,
 	return tag;
 }
 
-static unsigned int __blk_mq_get_tag(struct blk_mq_alloc_data *data)
+static unsigned int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
+				     struct blk_mq_tags *tags)
 {
 	int tag;
 
-	tag = bt_get(data, &data->hctx->tags->bitmap_tags, data->hctx,
-		     data->hctx->tags);
+	tag = bt_get(data, &tags->bitmap_tags, data->hctx, tags);
 	if (tag >= 0)
-		return tag + data->hctx->tags->nr_reserved_tags;
+		return tag + tags->nr_reserved_tags;
 
 	return BLK_MQ_TAG_FAIL;
 }
 
-static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_alloc_data *data)
+static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_alloc_data *data,
+					      struct blk_mq_tags *tags)
 {
 	int tag;
 
-	if (unlikely(!data->hctx->tags->nr_reserved_tags)) {
+	if (unlikely(!tags->nr_reserved_tags)) {
 		WARN_ON_ONCE(1);
 		return BLK_MQ_TAG_FAIL;
 	}
 
-	tag = bt_get(data, &data->hctx->tags->breserved_tags, NULL,
-		     data->hctx->tags);
+	tag = bt_get(data, &tags->breserved_tags, NULL, tags);
 	if (tag < 0)
 		return BLK_MQ_TAG_FAIL;
 
 	return tag;
 }
 
-unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
+unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data,
+			    struct blk_mq_tags *tags)
 {
 	if (data->flags & BLK_MQ_REQ_RESERVED)
-		return __blk_mq_get_reserved_tag(data);
-	return __blk_mq_get_tag(data);
+		return __blk_mq_get_reserved_tag(data, tags);
+	return __blk_mq_get_tag(data, tags);
 }
 
-void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
-		    unsigned int tag)
+void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_tags *tags,
+		    struct blk_mq_ctx *ctx, unsigned int tag)
 {
-	struct blk_mq_tags *tags = hctx->tags;
-
 	if (tag >= tags->nr_reserved_tags) {
 		const int real_tag = tag - tags->nr_reserved_tags;
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d1662734dc53..84186a11d2e0 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -23,9 +23,10 @@ struct blk_mq_tags {
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
-extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
-extern void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
-			   unsigned int tag);
+extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data,
+				   struct blk_mq_tags *tags);
+extern void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_tags *tags,
+			   struct blk_mq_ctx *ctx, unsigned int tag);
 extern bool blk_mq_has_free_tags(struct blk_mq_tags *tags);
 extern ssize_t blk_mq_tag_sysfs_show(struct blk_mq_tags *tags, char *page);
 extern int blk_mq_tag_update_depth(struct blk_mq_tags *tags, unsigned int depth);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 35e1162602f5..b68b7fc43e46 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -220,12 +220,13 @@ EXPORT_SYMBOL_GPL(blk_mq_rq_ctx_init);
 struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
 				       unsigned int op)
 {
+	struct blk_mq_tags *tags = data->hctx->tags;
 	struct request *rq;
 	unsigned int tag;
 
-	tag = blk_mq_get_tag(data);
+	tag = blk_mq_get_tag(data, tags);
 	if (tag != BLK_MQ_TAG_FAIL) {
-		rq = data->hctx->tags->rqs[tag];
+		rq = tags->rqs[tag];
 
 		if (blk_mq_tag_busy(data->hctx)) {
 			rq->rq_flags = RQF_MQ_INFLIGHT;
@@ -339,7 +340,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
-	blk_mq_put_tag(hctx, ctx, tag);
+	blk_mq_put_tag(hctx, hctx->tags, ctx, tag);
 	blk_queue_exit(q);
 }
 
@@ -1554,8 +1555,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
-void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
-			unsigned int hctx_idx)
+void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx)
 {
 	struct page *page;
 
@@ -1581,23 +1582,19 @@ void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		kmemleak_free(page_address(page));
 		__free_pages(page, page->private);
 	}
+}
 
+void blk_mq_free_rq_map(struct blk_mq_tags *tags)
+{
 	kfree(tags->rqs);
 
 	blk_mq_free_tags(tags);
 }
 
-static size_t order_to_size(unsigned int order)
-{
-	return (size_t)PAGE_SIZE << order;
-}
-
-struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
-				       unsigned int hctx_idx)
+struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
+					unsigned int hctx_idx)
 {
 	struct blk_mq_tags *tags;
-	unsigned int i, j, entries_per_page, max_order = 4;
-	size_t rq_size, left;
 
 	tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
 				set->numa_node,
@@ -1605,8 +1602,6 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	if (!tags)
 		return NULL;
 
-	INIT_LIST_HEAD(&tags->page_list);
-
 	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
 				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 set->numa_node);
@@ -1615,6 +1610,22 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		return NULL;
 	}
 
+	return tags;
+}
+
+static size_t order_to_size(unsigned int order)
+{
+	return (size_t)PAGE_SIZE << order;
+}
+
+int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx)
+{
+	unsigned int i, j, entries_per_page, max_order = 4;
+	size_t rq_size, left;
+
+	INIT_LIST_HEAD(&tags->page_list);
+
 	/*
 	 * rq_size is the size of the request plus driver payload, rounded
 	 * to the cacheline size
@@ -1674,11 +1685,11 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 			i++;
 		}
 	}
-	return tags;
+	return 0;
 
 fail:
-	blk_mq_free_rq_map(set, tags, hctx_idx);
-	return NULL;
+	blk_mq_free_rqs(set, tags, hctx_idx);
+	return -ENOMEM;
 }
 
 /*
@@ -1899,7 +1910,13 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 		hctx_idx = q->mq_map[i];
 		/* unmapped hw queue can be remapped after CPU topo changed */
 		if (!set->tags[hctx_idx]) {
-			set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
+			set->tags[hctx_idx] = blk_mq_alloc_rq_map(set,
+								  hctx_idx);
+			if (blk_mq_alloc_rqs(set, set->tags[hctx_idx],
+					     hctx_idx) < 0) {
+				blk_mq_free_rq_map(set->tags[hctx_idx]);
+				set->tags[hctx_idx] = NULL;
+			}
 
 			/*
 			 * If tags initialization fail for some hctx,
@@ -1932,7 +1949,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 			 * allocation
 			 */
 			if (i && set->tags[i]) {
-				blk_mq_free_rq_map(set, set->tags[i], i);
+				blk_mq_free_rqs(set, set->tags[i], i);
+				blk_mq_free_rq_map(set->tags[i]);
 				set->tags[i] = NULL;
 			}
 			hctx->tags = NULL;
@@ -2102,7 +2120,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		if (hctx) {
 			if (hctx->tags) {
-				blk_mq_free_rq_map(set, hctx->tags, j);
+				blk_mq_free_rqs(set, set->tags[j], j);
+				blk_mq_free_rq_map(hctx->tags);
 				set->tags[j] = NULL;
 			}
 			blk_mq_exit_hctx(q, set, hctx, j);
@@ -2304,16 +2323,21 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	int i;
 
 	for (i = 0; i < set->nr_hw_queues; i++) {
-		set->tags[i] = blk_mq_init_rq_map(set, i);
+		set->tags[i] = blk_mq_alloc_rq_map(set, i);
 		if (!set->tags[i])
 			goto out_unwind;
+		if (blk_mq_alloc_rqs(set, set->tags[i], i) < 0)
+			goto free_rq_map;
 	}
 
 	return 0;
 
 out_unwind:
-	while (--i >= 0)
-		blk_mq_free_rq_map(set, set->tags[i], i);
+	while (--i >= 0) {
+		blk_mq_free_rqs(set, set->tags[i], i);
+free_rq_map:
+		blk_mq_free_rq_map(set->tags[i]);
+	}
 
 	return -ENOMEM;
 }
@@ -2438,8 +2462,10 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 	int i;
 
 	for (i = 0; i < nr_cpu_ids; i++) {
-		if (set->tags[i])
-			blk_mq_free_rq_map(set, set->tags[i], i);
+		if (set->tags[i]) {
+			blk_mq_free_rqs(set, set->tags[i], i);
+			blk_mq_free_rq_map(set->tags[i]);
+		}
 	}
 
 	kfree(set->mq_map);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 898c3c9a60ec..2e98dd8ccee2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,10 +37,13 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 /*
  * Internal helpers for allocating/freeing the request map
  */
-void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
-			unsigned int hctx_idx);
-struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx);
+void blk_mq_free_rq_map(struct blk_mq_tags *tags);
+struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 					unsigned int hctx_idx);
+int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx);
 
 /*
  * Internal helpers for request insertion into sw queues
-- 
2.11.0


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: 0003-blk-mq-Split-driver-and-scheduler-tags.patch --]
[-- Type: text/x-patch; name="0003-blk-mq-Split-driver-and-scheduler-tags.patch", Size: 14364 bytes --]

From c49ec4e8b0e4135a87c9894597901539f3e3ca08 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bart.vanassche@sandisk.com>
Date: Wed, 21 Dec 2016 12:39:33 +0100
Subject: [PATCH 3/3] blk-mq: Split driver and scheduler tags

Add 'sched_tags' next to 'tags' in struct blk_mq_hw_ctx and also
in struct blk_mq_tag_set. Add 'sched_tag' next to 'tag' in struct
request. Modify blk_mq_update_nr_requests() such that it accepts
values larger than the queue depth. Make __blk_mq_free_request()
free both tags. Make blk_mq_alloc_tag_set() allocate both tag sets.
Make blk_mq_free_tag_set() free both tag sets. Make
blk_mq_dispatch_rq_list() allocate the driver tag. Modify
blk_mq_update_nr_requests() such that it accepts a size that
exceeds the queue depth.
---
 block/blk-flush.c      |   9 ++-
 block/blk-mq.c         | 160 +++++++++++++++++++++++++++++++++++--------------
 block/blk-mq.h         |   5 +-
 block/blk-tag.c        |   1 +
 include/linux/blk-mq.h |   2 +
 include/linux/blkdev.h |   1 +
 6 files changed, 129 insertions(+), 49 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 6a7c29d2eb3c..46d12bbfde85 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -170,6 +170,8 @@ static bool blk_flush_complete_seq(struct request *rq,
 	struct list_head *pending = &fq->flush_queue[fq->flush_pending_idx];
 	bool queued = false, kicked;
 
+	BUG_ON(rq->tag < 0);
+
 	BUG_ON(rq->flush.seq & seq);
 	rq->flush.seq |= seq;
 
@@ -319,6 +321,8 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
 
+		BUG_ON(first_rq->tag < 0);
+
 		flush_rq->mq_ctx = first_rq->mq_ctx;
 		flush_rq->tag = first_rq->tag;
 		fq->orig_rq = first_rq;
@@ -452,8 +456,9 @@ void blk_insert_flush(struct request *rq)
 	 * processed directly without going through flush machinery.  Queue
 	 * for normal execution.
 	 */
-	if ((policy & REQ_FSEQ_DATA) &&
-	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
+	if (((policy & REQ_FSEQ_DATA) &&
+	     !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) ||
+	    (q->mq_ops && blk_mq_assign_drv_tag(rq) < 0)) {
 		if (q->mq_ops)
 			blk_mq_sched_insert_request(rq, false, true, false);
 		else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b68b7fc43e46..48d7968d4ed9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -220,20 +220,21 @@ EXPORT_SYMBOL_GPL(blk_mq_rq_ctx_init);
 struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
 				       unsigned int op)
 {
-	struct blk_mq_tags *tags = data->hctx->tags;
+	struct blk_mq_tags *tags = data->hctx->sched_tags;
 	struct request *rq;
-	unsigned int tag;
+	unsigned int sched_tag;
 
-	tag = blk_mq_get_tag(data, tags);
-	if (tag != BLK_MQ_TAG_FAIL) {
-		rq = tags->rqs[tag];
+	sched_tag = blk_mq_get_tag(data, tags);
+	if (sched_tag != BLK_MQ_TAG_FAIL) {
+		rq = tags->rqs[sched_tag];
+		rq->tag = -1;
 
 		if (blk_mq_tag_busy(data->hctx)) {
 			rq->rq_flags = RQF_MQ_INFLIGHT;
 			atomic_inc(&data->hctx->nr_active);
 		}
 
-		rq->tag = tag;
+		rq->sched_tag = sched_tag;
 		blk_mq_rq_ctx_init(data->q, data->ctx, rq, op);
 		return rq;
 	}
@@ -328,6 +329,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 			     struct request *rq)
 {
 	const int tag = rq->tag;
+	const int sched_tag = rq->sched_tag;
 	struct request_queue *q = rq->q;
 
 	ctx->rq_completed[rq_is_sync(rq)]++;
@@ -340,7 +342,13 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
-	blk_mq_put_tag(hctx, hctx->tags, ctx, tag);
+	if (tag >= 0) {
+		WARN_ON_ONCE(hctx->tags->rqs[tag] != rq);
+		hctx->tags->rqs[tag] = NULL;
+		blk_mq_put_tag(hctx, hctx->tags, ctx, tag);
+	}
+	if (sched_tag >= 0)
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 	blk_queue_exit(q);
 }
 
@@ -844,6 +852,26 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
+int blk_mq_assign_drv_tag(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
+	struct blk_mq_alloc_data data = {
+		.q = rq->q,
+		.ctx = rq->mq_ctx,
+		.hctx = hctx,
+	};
+
+	rq->tag = blk_mq_get_tag(&data, hctx->tags);
+	if (rq->tag < 0)
+		goto out;
+	WARN_ON_ONCE(hctx->tags->rqs[rq->tag]);
+	hctx->tags->rqs[rq->tag] = rq;
+
+out:
+	return rq->tag;
+}
+
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
 	struct request_queue *q = hctx->queue;
@@ -866,6 +894,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		struct blk_mq_queue_data bd;
 
 		rq = list_first_entry(list, struct request, queuelist);
+		if (rq->tag < 0 && blk_mq_assign_drv_tag(rq) < 0)
+			break;
 		list_del_init(&rq->queuelist);
 
 		bd.rq = rq;
@@ -1296,7 +1326,8 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 		goto insert;
 
 	hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	if (blk_mq_hctx_stopped(hctx))
+	if (blk_mq_hctx_stopped(hctx) ||
+	    (rq->tag < 0 && blk_mq_assign_drv_tag(rq) < 0))
 		goto insert;
 
 	new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
@@ -1592,17 +1623,19 @@ void blk_mq_free_rq_map(struct blk_mq_tags *tags)
 }
 
 struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
-					unsigned int hctx_idx)
+					unsigned int hctx_idx,
+					unsigned int nr_tags,
+					unsigned int reserved_tags)
 {
 	struct blk_mq_tags *tags;
 
-	tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
+	tags = blk_mq_init_tags(nr_tags, reserved_tags,
 				set->numa_node,
 				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
 	if (!tags)
 		return NULL;
 
-	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
+	tags->rqs = kzalloc_node(nr_tags * sizeof(struct request *),
 				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 set->numa_node);
 	if (!tags->rqs) {
@@ -1800,6 +1833,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
+	hctx->sched_tags = set->sched_tags[hctx_idx];
 
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
@@ -1881,6 +1915,38 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 	}
 }
 
+static void __blk_mq_free_rq_map_i(struct blk_mq_tag_set *set, int hctx_idx)
+{
+	if (set->sched_tags[hctx_idx]) {
+		blk_mq_free_rqs(set, set->sched_tags[hctx_idx], hctx_idx);
+		blk_mq_free_rq_map(set->sched_tags[hctx_idx]);
+		set->sched_tags[hctx_idx] = NULL;
+	}
+	if (set->tags[hctx_idx]) {
+		blk_mq_free_rq_map(set->tags[hctx_idx]);
+		set->tags[hctx_idx] = NULL;
+	}
+}
+
+static bool __blk_mq_alloc_rq_map_i(struct blk_mq_tag_set *set, int hctx_idx,
+				    unsigned int nr_requests)
+{
+	int ret = 0;
+
+	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
+					set->queue_depth, set->reserved_tags);
+	set->sched_tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
+							nr_requests, 0);
+	if (set->sched_tags[hctx_idx])
+		ret = blk_mq_alloc_rqs(set, set->sched_tags[hctx_idx],
+				       hctx_idx);
+	if (!set->tags[hctx_idx] || !set->sched_tags[hctx_idx] || ret < 0) {
+		__blk_mq_free_rq_map_i(set, hctx_idx);
+		return false;
+	}
+	return true;
+}
+
 static void blk_mq_map_swqueue(struct request_queue *q,
 			       const struct cpumask *online_mask)
 {
@@ -1909,23 +1975,15 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 
 		hctx_idx = q->mq_map[i];
 		/* unmapped hw queue can be remapped after CPU topo changed */
-		if (!set->tags[hctx_idx]) {
-			set->tags[hctx_idx] = blk_mq_alloc_rq_map(set,
-								  hctx_idx);
-			if (blk_mq_alloc_rqs(set, set->tags[hctx_idx],
-					     hctx_idx) < 0) {
-				blk_mq_free_rq_map(set->tags[hctx_idx]);
-				set->tags[hctx_idx] = NULL;
-			}
-
+		if (!set->tags[hctx_idx] &&
+		    !__blk_mq_alloc_rq_map_i(set, hctx_idx, q->nr_requests)) {
 			/*
 			 * If tags initialization fail for some hctx,
 			 * that hctx won't be brought online.  In this
 			 * case, remap the current ctx to hctx[0] which
 			 * is guaranteed to always have tags allocated
 			 */
-			if (!set->tags[hctx_idx])
-				q->mq_map[i] = 0;
+			q->mq_map[i] = 0;
 		}
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
@@ -2318,26 +2376,20 @@ static int blk_mq_queue_reinit_prepare(unsigned int cpu)
 	return 0;
 }
 
-static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
+static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set,
+				  unsigned int nr_requests)
 {
 	int i;
 
-	for (i = 0; i < set->nr_hw_queues; i++) {
-		set->tags[i] = blk_mq_alloc_rq_map(set, i);
-		if (!set->tags[i])
+	for (i = 0; i < set->nr_hw_queues; i++)
+		if (!__blk_mq_alloc_rq_map_i(set, i, nr_requests))
 			goto out_unwind;
-		if (blk_mq_alloc_rqs(set, set->tags[i], i) < 0)
-			goto free_rq_map;
-	}
 
 	return 0;
 
 out_unwind:
-	while (--i >= 0) {
-		blk_mq_free_rqs(set, set->tags[i], i);
-free_rq_map:
-		blk_mq_free_rq_map(set->tags[i]);
-	}
+	while (--i >= 0)
+		__blk_mq_free_rq_map_i(set, i);
 
 	return -ENOMEM;
 }
@@ -2347,14 +2399,15 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
  * may reduce the depth asked for, if memory is tight. set->queue_depth
  * will be updated to reflect the allocated depth.
  */
-static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
+static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set,
+				unsigned int nr_requests)
 {
 	unsigned int depth;
 	int err;
 
 	depth = set->queue_depth;
 	do {
-		err = __blk_mq_alloc_rq_maps(set);
+		err = __blk_mq_alloc_rq_maps(set, nr_requests);
 		if (!err)
 			break;
 
@@ -2385,7 +2438,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
-	int ret;
+	int ret = -ENOMEM;
 
 	BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
 
@@ -2425,32 +2478,39 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
-	ret = -ENOMEM;
+	set->sched_tags = kzalloc_node(nr_cpu_ids * sizeof(struct blk_mq_tags *),
+				       GFP_KERNEL, set->numa_node);
+	if (!set->sched_tags)
+		goto free_drv_tags;
+
 	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
 			GFP_KERNEL, set->numa_node);
 	if (!set->mq_map)
-		goto out_free_tags;
+		goto free_sched_tags;
 
 	if (set->ops->map_queues)
 		ret = set->ops->map_queues(set);
 	else
 		ret = blk_mq_map_queues(set);
 	if (ret)
-		goto out_free_mq_map;
+		goto free_mq_map;
 
-	ret = blk_mq_alloc_rq_maps(set);
+	ret = blk_mq_alloc_rq_maps(set, set->queue_depth/*q->nr_requests*/);
 	if (ret)
-		goto out_free_mq_map;
+		goto free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
 
-out_free_mq_map:
+free_mq_map:
 	kfree(set->mq_map);
 	set->mq_map = NULL;
-out_free_tags:
+free_sched_tags:
+	kfree(set->sched_tags);
+	set->sched_tags = NULL;
+free_drv_tags:
 	kfree(set->tags);
 	set->tags = NULL;
 	return ret;
@@ -2465,12 +2525,16 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 		if (set->tags[i]) {
 			blk_mq_free_rqs(set, set->tags[i], i);
 			blk_mq_free_rq_map(set->tags[i]);
+			blk_mq_free_rq_map(set->sched_tags[i]);
 		}
 	}
 
 	kfree(set->mq_map);
 	set->mq_map = NULL;
 
+	kfree(set->sched_tags);
+	set->sched_tags = NULL;
+
 	kfree(set->tags);
 	set->tags = NULL;
 }
@@ -2482,14 +2546,18 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	struct blk_mq_hw_ctx *hctx;
 	int i, ret;
 
-	if (!set || nr > set->queue_depth)
+	if (!set)
 		return -EINVAL;
 
 	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx->tags)
 			continue;
-		ret = blk_mq_tag_update_depth(hctx->tags, nr);
+		ret = blk_mq_tag_update_depth(hctx->tags,
+					      min(nr, set->queue_depth));
+		if (ret)
+			break;
+		ret = blk_mq_tag_update_depth(hctx->sched_tags, nr);
 		if (ret)
 			break;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2e98dd8ccee2..0368c513c2ab 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -31,6 +31,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
+int blk_mq_assign_drv_tag(struct request *rq);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 
@@ -41,7 +42,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx);
 void blk_mq_free_rq_map(struct blk_mq_tags *tags);
 struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
-					unsigned int hctx_idx);
+					unsigned int hctx_idx,
+					unsigned int nr_tags,
+					unsigned int reserved_tags);
 int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx);
 
diff --git a/block/blk-tag.c b/block/blk-tag.c
index bae1decb6ec3..319a3a3eb1d7 100644
--- a/block/blk-tag.c
+++ b/block/blk-tag.c
@@ -272,6 +272,7 @@ void blk_queue_end_tag(struct request_queue *q, struct request *rq)
 	list_del_init(&rq->queuelist);
 	rq->rq_flags &= ~RQF_QUEUED;
 	rq->tag = -1;
+	rq->sched_tag = -1;
 
 	if (unlikely(bqt->tag_index[tag] == NULL))
 		printk(KERN_ERR "%s: tag %d is missing\n",
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 9255ccb043f2..377594bcda8d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -36,6 +36,7 @@ struct blk_mq_hw_ctx {
 	atomic_t		wait_index;
 
 	struct blk_mq_tags	*tags;
+	struct blk_mq_tags	*sched_tags;
 
 	struct srcu_struct	queue_rq_srcu;
 
@@ -72,6 +73,7 @@ struct blk_mq_tag_set {
 	void			*driver_data;
 
 	struct blk_mq_tags	**tags;
+	struct blk_mq_tags	**sched_tags;
 
 	struct mutex		tag_list_lock;
 	struct list_head	tag_list;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7c40fb838b44..112b57bce9e9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -223,6 +223,7 @@ struct request {
 	void *special;		/* opaque pointer available for LLD use */
 
 	int tag;
+	int sched_tag;
 	int errors;
 
 	/*
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
                     ` (2 preceding siblings ...)
  2016-12-22 16:07   ` Paolo Valente
@ 2016-12-22 16:49   ` Paolo Valente
  2017-01-17  2:47     ` Jens Axboe
  2017-01-20 13:14   ` Paolo Valente
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2016-12-22 16:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 

One last question (for today ...):in mq-deadline there are no
"schedule dispatch" or "unplug work" functions.  In blk, CFQ and BFQ
do these schedules/unplugs in a lot of cases.  What's the right
replacement?  Just doing nothing?

Thanks,
Paolo

> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
> block/Kconfig.iosched |   6 +
> block/Makefile        |   1 +
> block/mq-deadline.c   | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 656 insertions(+)
> create mode 100644 block/mq-deadline.c
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 421bef9c4c48..490ef2850fae 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -32,6 +32,12 @@ config IOSCHED_CFQ
> 
> 	  This is the default I/O scheduler.
> 
> +config MQ_IOSCHED_DEADLINE
> +	tristate "MQ deadline I/O scheduler"
> +	default y
> +	---help---
> +	  MQ version of the deadline IO scheduler.
> +
> config CFQ_GROUP_IOSCHED
> 	bool "CFQ Group Scheduling support"
> 	depends on IOSCHED_CFQ && BLK_CGROUP
> diff --git a/block/Makefile b/block/Makefile
> index 2eee9e1bb6db..3ee0abd7205a 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
> obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> +obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
> 
> obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
> obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index 000000000000..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> @@ -0,0 +1,649 @@
> +/*
> + *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
> + *  for the blk-mq scheduling framework
> + *
> + *  Copyright (C) 2016 Jens Axboe <axboe@kernel.dk>
> + */
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
> +#include <linux/elevator.h>
> +#include <linux/bio.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/compiler.h>
> +#include <linux/rbtree.h>
> +#include <linux/sbitmap.h>
> +
> +#include "blk.h"
> +#include "blk-mq.h"
> +#include "blk-mq-tag.h"
> +#include "blk-mq-sched.h"
> +
> +static unsigned int queue_depth = 256;
> +module_param(queue_depth, uint, 0644);
> +MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
> +
> +/*
> + * See Documentation/block/deadline-iosched.txt
> + */
> +static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
> +static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +static const int writes_starved = 2;    /* max times reads can starve a write */
> +static const int fifo_batch = 16;       /* # of sequential requests treated as one
> +				     by the above parameters. For throughput. */
> +
> +struct deadline_data {
> +	/*
> +	 * run time data
> +	 */
> +
> +	/*
> +	 * requests (deadline_rq s) are present on both sort_list and fifo_list
> +	 */
> +	struct rb_root sort_list[2];
> +	struct list_head fifo_list[2];
> +
> +	/*
> +	 * next in sort order. read, write or both are NULL
> +	 */
> +	struct request *next_rq[2];
> +	unsigned int batching;		/* number of sequential requests made */
> +	unsigned int starved;		/* times reads have starved writes */
> +
> +	/*
> +	 * settings that change how the i/o scheduler behaves
> +	 */
> +	int fifo_expire[2];
> +	int fifo_batch;
> +	int writes_starved;
> +	int front_merges;
> +
> +	spinlock_t lock;
> +	struct list_head dispatch;
> +	struct blk_mq_tags *tags;
> +	atomic_t wait_index;
> +};
> +
> +static inline struct rb_root *
> +deadline_rb_root(struct deadline_data *dd, struct request *rq)
> +{
> +	return &dd->sort_list[rq_data_dir(rq)];
> +}
> +
> +/*
> + * get the request after `rq' in sector-sorted order
> + */
> +static inline struct request *
> +deadline_latter_request(struct request *rq)
> +{
> +	struct rb_node *node = rb_next(&rq->rb_node);
> +
> +	if (node)
> +		return rb_entry_rq(node);
> +
> +	return NULL;
> +}
> +
> +static void
> +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> +	struct rb_root *root = deadline_rb_root(dd, rq);
> +
> +	elv_rb_add(root, rq);
> +}
> +
> +static inline void
> +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> +	const int data_dir = rq_data_dir(rq);
> +
> +	if (dd->next_rq[data_dir] == rq)
> +		dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> +	elv_rb_del(deadline_rb_root(dd, rq), rq);
> +}
> +
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	list_del_init(&rq->queuelist);
> +
> +	/*
> +	 * We might not be on the rbtree, if we are doing an insert merge
> +	 */
> +	if (!RB_EMPTY_NODE(&rq->rb_node))
> +		deadline_del_rq_rb(dd, rq);
> +
> +	elv_rqhash_del(q, rq);
> +	if (q->last_merge == rq)
> +		q->last_merge = NULL;
> +}
> +
> +static void dd_request_merged(struct request_queue *q, struct request *req,
> +			      int type)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	/*
> +	 * if the merge was a front merge, we need to reposition request
> +	 */
> +	if (type == ELEVATOR_FRONT_MERGE) {
> +		elv_rb_del(deadline_rb_root(dd, req), req);
> +		deadline_add_rq_rb(dd, req);
> +	}
> +}
> +
> +static void dd_merged_requests(struct request_queue *q, struct request *req,
> +			       struct request *next)
> +{
> +	/*
> +	 * if next expires before rq, assign its expire time to rq
> +	 * and move into next position (next will be deleted) in fifo
> +	 */
> +	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
> +		if (time_before((unsigned long)next->fifo_time,
> +				(unsigned long)req->fifo_time)) {
> +			list_move(&req->queuelist, &next->queuelist);
> +			req->fifo_time = next->fifo_time;
> +		}
> +	}
> +
> +	/*
> +	 * kill knowledge of next, this one is a goner
> +	 */
> +	deadline_remove_request(q, next);
> +}
> +
> +/*
> + * move an entry to dispatch queue
> + */
> +static void
> +deadline_move_request(struct deadline_data *dd, struct request *rq)
> +{
> +	const int data_dir = rq_data_dir(rq);
> +
> +	dd->next_rq[READ] = NULL;
> +	dd->next_rq[WRITE] = NULL;
> +	dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> +	/*
> +	 * take it off the sort and fifo list
> +	 */
> +	deadline_remove_request(rq->q, rq);
> +}
> +
> +/*
> + * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
> + * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> + */
> +static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
> +{
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
> +
> +	/*
> +	 * rq is expired!
> +	 */
> +	if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * deadline_dispatch_requests selects the best request according to
> + * read/write expire, fifo_batch, etc
> + */
> +static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	struct request *rq;
> +	bool reads, writes;
> +	int data_dir;
> +
> +	spin_lock(&dd->lock);
> +
> +	if (!list_empty(&dd->dispatch)) {
> +		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		goto done;
> +	}
> +
> +	reads = !list_empty(&dd->fifo_list[READ]);
> +	writes = !list_empty(&dd->fifo_list[WRITE]);
> +
> +	/*
> +	 * batches are currently reads XOR writes
> +	 */
> +	if (dd->next_rq[WRITE])
> +		rq = dd->next_rq[WRITE];
> +	else
> +		rq = dd->next_rq[READ];
> +
> +	if (rq && dd->batching < dd->fifo_batch)
> +		/* we have a next request are still entitled to batch */
> +		goto dispatch_request;
> +
> +	/*
> +	 * at this point we are not running a batch. select the appropriate
> +	 * data direction (read / write)
> +	 */
> +
> +	if (reads) {
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
> +
> +		if (writes && (dd->starved++ >= dd->writes_starved))
> +			goto dispatch_writes;
> +
> +		data_dir = READ;
> +
> +		goto dispatch_find_request;
> +	}
> +
> +	/*
> +	 * there are either no reads or writes have been starved
> +	 */
> +
> +	if (writes) {
> +dispatch_writes:
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> +
> +		dd->starved = 0;
> +
> +		data_dir = WRITE;
> +
> +		goto dispatch_find_request;
> +	}
> +
> +	spin_unlock(&dd->lock);
> +	return NULL;
> +
> +dispatch_find_request:
> +	/*
> +	 * we are not running a batch, find best request for selected data_dir
> +	 */
> +	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
> +		/*
> +		 * A deadline has expired, the last request was in the other
> +		 * direction, or we have run out of higher-sectored requests.
> +		 * Start again from the request with the earliest expiry time.
> +		 */
> +		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	} else {
> +		/*
> +		 * The last req was the same dir and we have a next request in
> +		 * sort order. No expired requests so continue on from here.
> +		 */
> +		rq = dd->next_rq[data_dir];
> +	}
> +
> +	dd->batching = 0;
> +
> +dispatch_request:
> +	/*
> +	 * rq is the selected appropriate request.
> +	 */
> +	dd->batching++;
> +	deadline_move_request(dd, rq);
> +done:
> +	rq->rq_flags |= RQF_STARTED;
> +	spin_unlock(&dd->lock);
> +	return rq;
> +}
> +
> +static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
> +				 struct list_head *rq_list)
> +{
> +	blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
> +}
> +
> +static void dd_exit_queue(struct elevator_queue *e)
> +{
> +	struct deadline_data *dd = e->elevator_data;
> +
> +	BUG_ON(!list_empty(&dd->fifo_list[READ]));
> +	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
> +
> +	blk_mq_sched_free_requests(dd->tags);
> +	kfree(dd);
> +}
> +
> +/*
> + * initialize elevator private data (deadline_data).
> + */
> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +{
> +	struct deadline_data *dd;
> +	struct elevator_queue *eq;
> +
> +	eq = elevator_alloc(q, e);
> +	if (!eq)
> +		return -ENOMEM;
> +
> +	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> +	if (!dd) {
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +	eq->elevator_data = dd;
> +
> +	dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
> +	if (!dd->tags) {
> +		kfree(dd);
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +
> +	INIT_LIST_HEAD(&dd->fifo_list[READ]);
> +	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
> +	dd->sort_list[READ] = RB_ROOT;
> +	dd->sort_list[WRITE] = RB_ROOT;
> +	dd->fifo_expire[READ] = read_expire;
> +	dd->fifo_expire[WRITE] = write_expire;
> +	dd->writes_starved = writes_starved;
> +	dd->front_merges = 1;
> +	dd->fifo_batch = fifo_batch;
> +	spin_lock_init(&dd->lock);
> +	INIT_LIST_HEAD(&dd->dispatch);
> +	atomic_set(&dd->wait_index, 0);
> +
> +	q->elevator = eq;
> +	return 0;
> +}
> +
> +static int dd_request_merge(struct request_queue *q, struct request **rq,
> +			    struct bio *bio)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	sector_t sector = bio_end_sector(bio);
> +	struct request *__rq;
> +
> +	if (!dd->front_merges)
> +		return ELEVATOR_NO_MERGE;
> +
> +	__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
> +	if (__rq) {
> +		BUG_ON(sector != blk_rq_pos(__rq));
> +
> +		if (elv_bio_merge_ok(__rq, bio)) {
> +			*rq = __rq;
> +			return ELEVATOR_FRONT_MERGE;
> +		}
> +	}
> +
> +	return ELEVATOR_NO_MERGE;
> +}
> +
> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	int ret;
> +
> +	spin_lock(&dd->lock);
> +	ret = blk_mq_sched_try_merge(q, bio);
> +	spin_unlock(&dd->lock);
> +
> +	return ret;
> +}
> +
> +/*
> + * add rq to rbtree and fifo
> + */
> +static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> +			      bool at_head)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const int data_dir = rq_data_dir(rq);
> +
> +	if (blk_mq_sched_try_insert_merge(q, rq))
> +		return;
> +
> +	blk_mq_sched_request_inserted(rq);
> +
> +	/*
> +	 * If we're trying to insert a real request, just send it directly
> +	 * to the hardware dispatch list. This only happens for a requeue,
> +	 * or FUA/FLUSH requests.
> +	 */
> +	if (!blk_mq_sched_rq_is_shadow(rq)) {
> +		spin_lock(&hctx->lock);
> +		list_add_tail(&rq->queuelist, &hctx->dispatch);
> +		spin_unlock(&hctx->lock);
> +		return;
> +	}
> +
> +	if (at_head || rq->cmd_type != REQ_TYPE_FS) {
> +		if (at_head)
> +			list_add(&rq->queuelist, &dd->dispatch);
> +		else
> +			list_add_tail(&rq->queuelist, &dd->dispatch);
> +	} else {
> +		deadline_add_rq_rb(dd, rq);
> +
> +		if (rq_mergeable(rq)) {
> +			elv_rqhash_add(q, rq);
> +			if (!q->last_merge)
> +				q->last_merge = rq;
> +		}
> +
> +		/*
> +		 * set expire time and add to fifo list
> +		 */
> +		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> +		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
> +	}
> +}
> +
> +static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
> +			       struct list_head *list, bool at_head)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	spin_lock(&dd->lock);
> +	while (!list_empty(list)) {
> +		struct request *rq;
> +
> +		rq = list_first_entry(list, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		dd_insert_request(hctx, rq, at_head);
> +	}
> +	spin_unlock(&dd->lock);
> +}
> +
> +static struct request *dd_get_request(struct request_queue *q, unsigned int op,
> +				      struct blk_mq_alloc_data *data)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	struct request *rq;
> +
> +	/*
> +	 * The flush machinery intercepts before we insert the request. As
> +	 * a work-around, just hand it back a real request.
> +	 */
> +	if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
> +		rq = __blk_mq_alloc_request(data, op);
> +	else {
> +		rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
> +		if (rq)
> +			blk_mq_rq_ctx_init(q, data->ctx, rq, op);
> +	}
> +
> +	return rq;
> +}
> +
> +static bool dd_put_request(struct request *rq)
> +{
> +	/*
> +	 * If it's a real request, we just have to free it. For a shadow
> +	 * request, we should only free it if we haven't started it. A
> +	 * started request is mapped to a real one, and the real one will
> +	 * free it. We can get here with request merges, since we then
> +	 * free the request before we start/issue it.
> +	 */
> +	if (!blk_mq_sched_rq_is_shadow(rq))
> +		return false;
> +
> +	if (!(rq->rq_flags & RQF_STARTED)) {
> +		struct request_queue *q = rq->q;
> +		struct deadline_data *dd = q->elevator->elevator_data;
> +
> +		/*
> +		 * IO completion would normally do this, but if we merge
> +		 * and free before we issue the request, drop both the
> +		 * tag and queue ref
> +		 */
> +		blk_mq_sched_free_shadow_request(dd->tags, rq);
> +		blk_queue_exit(q);
> +	}
> +
> +	return true;
> +}
> +
> +static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
> +{
> +	struct request *sched_rq = rq->end_io_data;
> +
> +	/*
> +	 * sched_rq can be NULL, if we haven't setup the shadow yet
> +	 * because we failed getting one.
> +	 */
> +	if (sched_rq) {
> +		struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +		blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
> +		blk_mq_start_stopped_hw_queue(hctx, true);
> +	}
> +}
> +
> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +	return !list_empty_careful(&dd->dispatch) ||
> +		!list_empty_careful(&dd->fifo_list[0]) ||
> +		!list_empty_careful(&dd->fifo_list[1]);
> +}
> +
> +/*
> + * sysfs parts below
> + */
> +static ssize_t
> +deadline_var_show(int var, char *page)
> +{
> +	return sprintf(page, "%d\n", var);
> +}
> +
> +static ssize_t
> +deadline_var_store(int *var, const char *page, size_t count)
> +{
> +	char *p = (char *) page;
> +
> +	*var = simple_strtol(p, &p, 10);
> +	return count;
> +}
> +
> +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
> +static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data = __VAR;						\
> +	if (__CONV)							\
> +		__data = jiffies_to_msecs(__data);			\
> +	return deadline_var_show(__data, (page));			\
> +}
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> +SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> +SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data;							\
> +	int ret = deadline_var_store(&__data, (page), count);		\
> +	if (__data < (MIN))						\
> +		__data = (MIN);						\
> +	else if (__data > (MAX))					\
> +		__data = (MAX);						\
> +	if (__CONV)							\
> +		*(__PTR) = msecs_to_jiffies(__data);			\
> +	else								\
> +		*(__PTR) = __data;					\
> +	return ret;							\
> +}
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> +STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> +STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#undef STORE_FUNCTION
> +
> +#define DD_ATTR(name) \
> +	__ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
> +				      deadline_##name##_store)
> +
> +static struct elv_fs_entry deadline_attrs[] = {
> +	DD_ATTR(read_expire),
> +	DD_ATTR(write_expire),
> +	DD_ATTR(writes_starved),
> +	DD_ATTR(front_merges),
> +	DD_ATTR(fifo_batch),
> +	__ATTR_NULL
> +};
> +
> +static struct elevator_type mq_deadline = {
> +	.ops.mq = {
> +		.get_request		= dd_get_request,
> +		.put_request		= dd_put_request,
> +		.insert_requests	= dd_insert_requests,
> +		.dispatch_requests	= dd_dispatch_requests,
> +		.completed_request	= dd_completed_request,
> +		.next_request		= elv_rb_latter_request,
> +		.former_request		= elv_rb_former_request,
> +		.bio_merge		= dd_bio_merge,
> +		.request_merge		= dd_request_merge,
> +		.requests_merged	= dd_merged_requests,
> +		.request_merged		= dd_request_merged,
> +		.has_work		= dd_has_work,
> +		.init_sched		= dd_init_queue,
> +		.exit_sched		= dd_exit_queue,
> +	},
> +
> +	.uses_mq	= true,
> +	.elevator_attrs = deadline_attrs,
> +	.elevator_name = "mq-deadline",
> +	.elevator_owner = THIS_MODULE,
> +};
> +
> +static int __init deadline_init(void)
> +{
> +	if (!queue_depth) {
> +		pr_err("mq-deadline: queue depth must be > 0\n");
> +		return -EINVAL;
> +	}
> +	return elv_register(&mq_deadline);
> +}
> +
> +static void __exit deadline_exit(void)
> +{
> +	elv_unregister(&mq_deadline);
> +}
> +
> +module_init(deadline_init);
> +module_exit(deadline_exit);
> +
> +MODULE_AUTHOR("Jens Axboe");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("MQ deadline IO scheduler");
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-22 16:23 ` Bart Van Assche
@ 2016-12-22 16:52   ` Omar Sandoval
  2016-12-22 16:57     ` Bart Van Assche
  0 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2016-12-22 16:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-kernel, linux-block, axboe, axboe, osandov, paolo.valente

On Thu, Dec 22, 2016 at 04:23:24PM +0000, Bart Van Assche wrote:
> On Fri, 2016-12-16 at 17:12 -0700, Jens Axboe wrote:
> > From the discussion last time, I looked into the feasibility of having
> > two sets of tags for the same request pool, to avoid having to copy
> > some of the request fields at dispatch and completion time. To do that,
> > we'd have to replace the driver tag map(s) with our own, and augment
> > that with tag map(s) on the side representing the device queue depth.
> > Queuing IO with the scheduler would allocate from the new map, and
> > dispatching would acquire the "real" tag. We would need to change
> > drivers to do this, or add an extra indirection table to map a real
> > tag to the scheduler tag. We would also need a 1:1 mapping between
> > scheduler and hardware tag pools, or additional info to track it.
> > Unless someone can convince me otherwise, I think the current approach
> > is cleaner.
> 
> Hello Jens,
> 
> Can you have a look at the attached patches? These implement the "two tags
> per request" approach without a table that maps one tag type to the other
> or any other ugly construct. __blk_mq_alloc_request() is modified such that
> it assigns rq->sched_tag and sched_tags->rqs[] instead of rq->tag and
> tags->rqs[]. rq->tag and tags->rqs[] are assigned just before dispatch by
> blk_mq_assign_drv_tag(). This approach results in significantly less code
> than the approach proposed in v4 of your blk-mq-sched patch series. Memory
> usage is lower because only a single set of requests is allocated. The
> runtime overhead is lower because request fields no longer have to be
> copied between the requests owned by the block driver and the requests
> owned by the I/O scheduler. I can boot a VM from the virtio-blk driver but
> otherwise the attached patches have not yet been tested.
> 
> Thanks,
> 
> Bart.

Hey, Bart,

This approach occurred to us, but we couldn't figure out a way to make
blk_mq_tag_to_rq() work with it. From skimming over the patches, I
didn't see a solution to that problem.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-22 16:52   ` Omar Sandoval
@ 2016-12-22 16:57     ` Bart Van Assche
  2016-12-22 17:12       ` Omar Sandoval
  0 siblings, 1 reply; 69+ messages in thread
From: Bart Van Assche @ 2016-12-22 16:57 UTC (permalink / raw)
  To: osandov; +Cc: linux-kernel, linux-block, osandov, axboe, axboe, paolo.valente

On Thu, 2016-12-22 at 08:52 -0800, Omar Sandoval wrote:
> This approach occurred to us, but we couldn't figure out a way to make
> blk_mq_tag_to_rq() work with it. From skimming over the patches, I
> didn't see a solution to that problem.

Hello Omar,

Can you clarify your comment? Since my patches initialize both tags->rqs[]
and sched_tags->rqs[] the function blk_mq_tag_to_rq() should still work.

Bart.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-22 16:57     ` Bart Van Assche
@ 2016-12-22 17:12       ` Omar Sandoval
  2016-12-22 17:39         ` Bart Van Assche
  0 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2016-12-22 17:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-kernel, linux-block, osandov, axboe, axboe, paolo.valente

On Thu, Dec 22, 2016 at 04:57:36PM +0000, Bart Van Assche wrote:
> On Thu, 2016-12-22 at 08:52 -0800, Omar Sandoval wrote:
> > This approach occurred to us, but we couldn't figure out a way to make
> > blk_mq_tag_to_rq() work with it. From skimming over the patches, I
> > didn't see a solution to that problem.
> 
> Hello Omar,
> 
> Can you clarify your comment? Since my patches initialize both tags->rqs[]
> and sched_tags->rqs[] the function blk_mq_tag_to_rq() should still work.
> 
> Bart.

Sorry, you're right, it does work, but tags->rqs[] ends up being the
extra lookup table. I suspect that the runtime overhead of keeping that
up to date could be worse than copying the rq fields if you have lots of
CPUs but only one hardware queue.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-22 17:12       ` Omar Sandoval
@ 2016-12-22 17:39         ` Bart Van Assche
  0 siblings, 0 replies; 69+ messages in thread
From: Bart Van Assche @ 2016-12-22 17:39 UTC (permalink / raw)
  To: osandov; +Cc: linux-kernel, linux-block, osandov, axboe, axboe, paolo.valente

On Thu, 2016-12-22 at 09:12 -0800, Omar Sandoval wrote:
> On Thu, Dec 22, 2016 at 04:57:36PM +0000, Bart Van Assche wrote:
> > On Thu, 2016-12-22 at 08:52 -0800, Omar Sandoval wrote:
> > > This approach occurred to us, but we couldn't figure out a way to make
> > > blk_mq_tag_to_rq() work with it. From skimming over the patches, I
> > > didn't see a solution to that problem.
> > 
> > Can you clarify your comment? Since my patches initialize both tags->rqs[]
> > and sched_tags->rqs[] the function blk_mq_tag_to_rq() should still work.
> 
> Sorry, you're right, it does work, but tags->rqs[] ends up being the
> extra lookup table. I suspect that the runtime overhead of keeping that
> up to date could be worse than copying the rq fields if you have lots of
> CPUs but only one hardware queue.

Hello Omar,

I'm not sure that anything can be done if the number of CPUs that is submitting
I/O is large compared to the queue depth so I don't think we should spend our
time on that case. If the queue depth is large enough then the sbitmap code will
allocate tags such that different CPUs use different rqs[] elements.

The advantages of the approach I proposed are such that I am convinced that is
what we should start from and address contention on the tags->rqs[] array if it
measurements show that it is necessary to address it.

Bart.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-22  9:59   ` Paolo Valente
  2016-12-22 11:13     ` Paolo Valente
@ 2016-12-23 10:12     ` Paolo Valente
  2017-01-17  2:47     ` Jens Axboe
  2 siblings, 0 replies; 69+ messages in thread
From: Paolo Valente @ 2016-12-23 10:12 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 22 dic 2016, alle ore 10:59, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
>> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>> 
>> This adds a set of hooks that intercepts the blk-mq path of
>> allocating/inserting/issuing/completing requests, allowing
>> us to develop a scheduler within that framework.
>> 
>> We reuse the existing elevator scheduler API on the registration
>> side, but augment that with the scheduler flagging support for
>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>> devices.
>> 
>> Schedulers can opt in to using shadow requests. Shadow requests
>> are internal requests that the scheduler uses for for the allocate
>> and insert part, which are then mapped to a real driver request
>> at dispatch time. This is needed to separate the device queue depth
>> from the pool of requests that the scheduler has to work with.
>> 
>> Signed-off-by: Jens Axboe <axboe@fb.com>
>> 
> ...
> 
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> new file mode 100644
>> index 000000000000..b7e1839d4785
>> --- /dev/null
>> +++ b/block/blk-mq-sched.c
> 
>> ...
>> +static inline bool
>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>> +			 struct bio *bio)
>> +{
>> +	struct elevator_queue *e = q->elevator;
>> +
>> +	if (e && e->type->ops.mq.allow_merge)
>> +		return e->type->ops.mq.allow_merge(q, rq, bio);
>> +
>> +	return true;
>> +}
>> +
> 
> Something does not seem to add up here:
> e->type->ops.mq.allow_merge may be called only in
> blk_mq_sched_allow_merge, which, in its turn, may be called only in
> blk_mq_attempt_merge, which, finally, may be called only in
> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
> no elevator (line 1399 and 1507 in blk-mq.c).
> 
> Therefore, e->type->ops.mq.allow_merge can never be called, both if
> there is and if there is not an elevator.  Be patient if I'm missing
> something huge, but I thought it was worth reporting this.
> 

Jens,
I forgot to add that I'm willing (and would be happy) to propose a fix
to this, and possibly the other problems too, on my own.  Just, I'm
not yet expert enough to do it with having first received some
feedback or instructions from you.  In this specific case, I don't
even know yet whether this is really a bug.

Thanks, and merry Christmas if we don't get in touch before,
Paolo

> Paolo
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-22 11:13     ` Paolo Valente
@ 2017-01-17  2:47       ` Jens Axboe
  2017-01-17 10:13         ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-01-17  2:47 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 12/22/2016 04:13 AM, Paolo Valente wrote:
> 
>> Il giorno 22 dic 2016, alle ore 10:59, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>
>>>
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>
>>> This adds a set of hooks that intercepts the blk-mq path of
>>> allocating/inserting/issuing/completing requests, allowing
>>> us to develop a scheduler within that framework.
>>>
>>> We reuse the existing elevator scheduler API on the registration
>>> side, but augment that with the scheduler flagging support for
>>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>>> devices.
>>>
>>> Schedulers can opt in to using shadow requests. Shadow requests
>>> are internal requests that the scheduler uses for for the allocate
>>> and insert part, which are then mapped to a real driver request
>>> at dispatch time. This is needed to separate the device queue depth
>>> from the pool of requests that the scheduler has to work with.
>>>
>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>>
>> ...
>>
>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>> new file mode 100644
>>> index 000000000000..b7e1839d4785
>>> --- /dev/null
>>> +++ b/block/blk-mq-sched.c
>>
>>> ...
>>> +static inline bool
>>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>>> +			 struct bio *bio)
>>> +{
>>> +	struct elevator_queue *e = q->elevator;
>>> +
>>> +	if (e && e->type->ops.mq.allow_merge)
>>> +		return e->type->ops.mq.allow_merge(q, rq, bio);
>>> +
>>> +	return true;
>>> +}
>>> +
>>
>> Something does not seem to add up here:
>> e->type->ops.mq.allow_merge may be called only in
>> blk_mq_sched_allow_merge, which, in its turn, may be called only in
>> blk_mq_attempt_merge, which, finally, may be called only in
>> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
>> no elevator (line 1399 and 1507 in blk-mq.c).
>>
>> Therefore, e->type->ops.mq.allow_merge can never be called, both if
>> there is and if there is not an elevator.  Be patient if I'm missing
>> something huge, but I thought it was worth reporting this.
>>
> 
> Just another detail: if e->type->ops.mq.allow_merge does get invoked
> from the above path, then it is invoked of course without the
> scheduler lock held.  In contrast, if this function gets invoked
> from dd_bio_merge, then the scheduler lock is held.

But the scheduler controls that itself. So it'd be perfectly fine to
have a locked and unlocked variant. The way that's typically done is to
have function() grabbing the lock, and __function() is invoked with the
lock held.

> To handle this opposite alternatives, I don't know whether checking if
> the lock is held (and possibly taking it) from inside
> e->type->ops.mq.allow_merge is a good solution.  In any case, before
> possibly trying it, I will wait for some feedback on the main problem,
> i.e., on the fact that e->type->ops.mq.allow_merge
> seems unreachable in the above path.

Checking if a lock is held is NEVER a good idea, as it leads to both bad
and incorrect code. If you just check if a lock is held when being
called, you don't necessarily know if it was the caller that grabbed it
or it just happens to be held by someone else for unrelated reasons.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2016-12-22 15:28         ` Paolo Valente
@ 2017-01-17  2:47           ` Jens Axboe
  2017-01-17 10:49             ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-01-17  2:47 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown

On 12/22/2016 08:28 AM, Paolo Valente wrote:
> 
>> Il giorno 19 dic 2016, alle ore 22:05, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 12/19/2016 11:21 AM, Paolo Valente wrote:
>>>
>>>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <axboe@fb.com> ha scritto:
>>>>
>>>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>>>>
>>>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>>>>
>>>>>> This is version 4 of this patchset, version 3 was posted here:
>>>>>>
>>>>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>>>>
>>>>>> From the discussion last time, I looked into the feasibility of having
>>>>>> two sets of tags for the same request pool, to avoid having to copy
>>>>>> some of the request fields at dispatch and completion time. To do that,
>>>>>> we'd have to replace the driver tag map(s) with our own, and augment
>>>>>> that with tag map(s) on the side representing the device queue depth.
>>>>>> Queuing IO with the scheduler would allocate from the new map, and
>>>>>> dispatching would acquire the "real" tag. We would need to change
>>>>>> drivers to do this, or add an extra indirection table to map a real
>>>>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>>>>> scheduler and hardware tag pools, or additional info to track it.
>>>>>> Unless someone can convince me otherwise, I think the current approach
>>>>>> is cleaner.
>>>>>>
>>>>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>>>>> to drastically decreased merging. Especially on rotating storage,
>>>>>> this release should be fast, and on par with the merging that we
>>>>>> get through the legacy schedulers.
>>>>>>
>>>>>
>>>>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>>>>> you already have an idea of what they are, so that I am somehow
>>>>> prepared to what won't work even if my changes are right?
>>>>
>>>> I'm mostly talking about elevator ops hooks that aren't there in the new
>>>> framework, but exist in the old one. There should be no hidden
>>>> surprises, if that's what you are worried about.
>>>>
>>>> On the ops side, the only ones I can think of are the activate and
>>>> deactivate, and those can be done in the dispatch_request hook for
>>>> activate, and put/requeue for deactivate.
>>>>
>>>
>>> You mean that there is no conceptual problem in moving the code of the
>>> activate interface function into the dispatch function, and the code
>>> of the deactivate into the put_request? (for a requeue it is a little
>>> less clear to me, so one step at a time)  Or am I missing
>>> something more complex?
>>
>> Yes, what I mean is that there isn't a 1:1 mapping between the old ops
>> and the new ops. So you'll have to consider the cases.
>>
>>
> 
> Problem: whereas it seems easy and safe to do somewhere else the
> simple increment that was done in activate_request, I wonder if it may
> happen that a request is deactivate before being completed.  In it may
> happen, then, without a deactivate_request hook, the increments would
> remain unbalanced.  Or are request completions always guaranteed till
> no hw/sw components breaks?

You should be able to do it in get/put_request. But you might need some
extra tracking, I'd need to double check. I'm trying to avoid adding
hooks that we don't truly need, the old interface had a lot of that. If
you find that you need a hook and it isn't there, feel free to add it.
activate/deactivate might be a good change.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-22 16:07   ` Paolo Valente
@ 2017-01-17  2:47     ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2017-01-17  2:47 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 12/22/2016 09:07 AM, Paolo Valente wrote:
> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> This is basically identical to deadline-iosched, except it registers
>> as a MQ capable scheduler. This is still a single queue design.
>>
>> Signed-off-by: Jens Axboe <axboe@fb.com>
>> ---
> 
> ...
> 
>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>> new file mode 100644
>> index 000000000000..3cb9de21ab21
>> --- /dev/null
>> +++ b/block/mq-deadline.c
>> ...
>> +/*
>> + * remove rq from rbtree and fifo.
>> + */
>> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
>> +{
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +
>> +	list_del_init(&rq->queuelist);
>> +
>> +	/*
>> +	 * We might not be on the rbtree, if we are doing an insert merge
>> +	 */
>> +	if (!RB_EMPTY_NODE(&rq->rb_node))
>> +		deadline_del_rq_rb(dd, rq);
>> +
> 
> I've been scratching my head on the last three instructions, but at no
> avail.  If I understand correctly, the
> list_del_init(&rq->queue list);
> removes rq from the fifo list.  But, if so, I don't understand how it
> could be possible that rq has not been added to the rb_tree too.
>
> Another interpretation that I tried is that the above three lines
> handle correctly the following case where rq has not been inserted at
> all into deadline fifo queue and rb tree: when dd_insert_request was
> executed for rq, blk_mq_sched_try_insert_merge succeeded.  Yet, the
> list_del_init(&rq->queue list);
> does not seem to make sense.
> 
> Could you please shed some light on this for me?

I think you are correct, we don't need to touch ->queuelist for the case
where RB_EMPTY_NODE() is true. Minor detail, the list is already empty,
so it does no harm.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-22 16:49   ` Paolo Valente
@ 2017-01-17  2:47     ` Jens Axboe
  2017-01-20 11:07       ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-01-17  2:47 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 12/22/2016 09:49 AM, Paolo Valente wrote:
> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> This is basically identical to deadline-iosched, except it registers
>> as a MQ capable scheduler. This is still a single queue design.
>>
> 
> One last question (for today ...):in mq-deadline there are no
> "schedule dispatch" or "unplug work" functions.  In blk, CFQ and BFQ
> do these schedules/unplugs in a lot of cases.  What's the right
> replacement?  Just doing nothing?

You just use blk_mq_run_hw_queue() or variants thereof to kick off queue
runs.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2016-12-22  9:59   ` Paolo Valente
  2016-12-22 11:13     ` Paolo Valente
  2016-12-23 10:12     ` Paolo Valente
@ 2017-01-17  2:47     ` Jens Axboe
  2017-01-17  9:17       ` Paolo Valente
  2 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-01-17  2:47 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 12/22/2016 02:59 AM, Paolo Valente wrote:
> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> This adds a set of hooks that intercepts the blk-mq path of
>> allocating/inserting/issuing/completing requests, allowing
>> us to develop a scheduler within that framework.
>>
>> We reuse the existing elevator scheduler API on the registration
>> side, but augment that with the scheduler flagging support for
>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>> devices.
>>
>> Schedulers can opt in to using shadow requests. Shadow requests
>> are internal requests that the scheduler uses for for the allocate
>> and insert part, which are then mapped to a real driver request
>> at dispatch time. This is needed to separate the device queue depth
>> from the pool of requests that the scheduler has to work with.
>>
>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>
> ...
> 
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> new file mode 100644
>> index 000000000000..b7e1839d4785
>> --- /dev/null
>> +++ b/block/blk-mq-sched.c
> 
>> ...
>> +static inline bool
>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>> +			 struct bio *bio)
>> +{
>> +	struct elevator_queue *e = q->elevator;
>> +
>> +	if (e && e->type->ops.mq.allow_merge)
>> +		return e->type->ops.mq.allow_merge(q, rq, bio);
>> +
>> +	return true;
>> +}
>> +
> 
> Something does not seem to add up here:
> e->type->ops.mq.allow_merge may be called only in
> blk_mq_sched_allow_merge, which, in its turn, may be called only in
> blk_mq_attempt_merge, which, finally, may be called only in
> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
> no elevator (line 1399 and 1507 in blk-mq.c).
> 
> Therefore, e->type->ops.mq.allow_merge can never be called, both if
> there is and if there is not an elevator.  Be patient if I'm missing
> something huge, but I thought it was worth reporting this.

I went through the current branch, and it seems mostly fine. There was
a double call to allow_merge() that I killed in the plug path, and one
set missing in blk_mq_sched_try_merge(). The rest looks OK.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2017-01-17  2:47     ` Jens Axboe
@ 2017-01-17  9:17       ` Paolo Valente
  0 siblings, 0 replies; 69+ messages in thread
From: Paolo Valente @ 2017-01-17  9:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 12/22/2016 02:59 AM, Paolo Valente wrote:
>> 
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> This adds a set of hooks that intercepts the blk-mq path of
>>> allocating/inserting/issuing/completing requests, allowing
>>> us to develop a scheduler within that framework.
>>> 
>>> We reuse the existing elevator scheduler API on the registration
>>> side, but augment that with the scheduler flagging support for
>>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>>> devices.
>>> 
>>> Schedulers can opt in to using shadow requests. Shadow requests
>>> are internal requests that the scheduler uses for for the allocate
>>> and insert part, which are then mapped to a real driver request
>>> at dispatch time. This is needed to separate the device queue depth
>>> from the pool of requests that the scheduler has to work with.
>>> 
>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>> 
>> ...
>> 
>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>> new file mode 100644
>>> index 000000000000..b7e1839d4785
>>> --- /dev/null
>>> +++ b/block/blk-mq-sched.c
>> 
>>> ...
>>> +static inline bool
>>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>>> +			 struct bio *bio)
>>> +{
>>> +	struct elevator_queue *e = q->elevator;
>>> +
>>> +	if (e && e->type->ops.mq.allow_merge)
>>> +		return e->type->ops.mq.allow_merge(q, rq, bio);
>>> +
>>> +	return true;
>>> +}
>>> +
>> 
>> Something does not seem to add up here:
>> e->type->ops.mq.allow_merge may be called only in
>> blk_mq_sched_allow_merge, which, in its turn, may be called only in
>> blk_mq_attempt_merge, which, finally, may be called only in
>> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
>> no elevator (line 1399 and 1507 in blk-mq.c).
>> 
>> Therefore, e->type->ops.mq.allow_merge can never be called, both if
>> there is and if there is not an elevator.  Be patient if I'm missing
>> something huge, but I thought it was worth reporting this.
> 
> I went through the current branch, and it seems mostly fine. There was
> a double call to allow_merge() that I killed in the plug path, and one
> set missing in blk_mq_sched_try_merge(). The rest looks OK.
> 

Yes, I missed a path, sorry.  I'm happy that at least your check has
not been a waste of time for other reasons.

Thanks,
Paolo

> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2017-01-17  2:47       ` Jens Axboe
@ 2017-01-17 10:13         ` Paolo Valente
  2017-01-17 12:38           ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-01-17 10:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 12/22/2016 04:13 AM, Paolo Valente wrote:
>> 
>>> Il giorno 22 dic 2016, alle ore 10:59, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>> 
>>>> 
>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>> 
>>>> This adds a set of hooks that intercepts the blk-mq path of
>>>> allocating/inserting/issuing/completing requests, allowing
>>>> us to develop a scheduler within that framework.
>>>> 
>>>> We reuse the existing elevator scheduler API on the registration
>>>> side, but augment that with the scheduler flagging support for
>>>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>>>> devices.
>>>> 
>>>> Schedulers can opt in to using shadow requests. Shadow requests
>>>> are internal requests that the scheduler uses for for the allocate
>>>> and insert part, which are then mapped to a real driver request
>>>> at dispatch time. This is needed to separate the device queue depth
>>>> from the pool of requests that the scheduler has to work with.
>>>> 
>>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>>> 
>>> ...
>>> 
>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>>> new file mode 100644
>>>> index 000000000000..b7e1839d4785
>>>> --- /dev/null
>>>> +++ b/block/blk-mq-sched.c
>>> 
>>>> ...
>>>> +static inline bool
>>>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>>>> +			 struct bio *bio)
>>>> +{
>>>> +	struct elevator_queue *e = q->elevator;
>>>> +
>>>> +	if (e && e->type->ops.mq.allow_merge)
>>>> +		return e->type->ops.mq.allow_merge(q, rq, bio);
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>> 
>>> Something does not seem to add up here:
>>> e->type->ops.mq.allow_merge may be called only in
>>> blk_mq_sched_allow_merge, which, in its turn, may be called only in
>>> blk_mq_attempt_merge, which, finally, may be called only in
>>> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
>>> no elevator (line 1399 and 1507 in blk-mq.c).
>>> 
>>> Therefore, e->type->ops.mq.allow_merge can never be called, both if
>>> there is and if there is not an elevator.  Be patient if I'm missing
>>> something huge, but I thought it was worth reporting this.
>>> 
>> 
>> Just another detail: if e->type->ops.mq.allow_merge does get invoked
>> from the above path, then it is invoked of course without the
>> scheduler lock held.  In contrast, if this function gets invoked
>> from dd_bio_merge, then the scheduler lock is held.
> 
> But the scheduler controls that itself. So it'd be perfectly fine to
> have a locked and unlocked variant. The way that's typically done is to
> have function() grabbing the lock, and __function() is invoked with the
> lock held.
> 
>> To handle this opposite alternatives, I don't know whether checking if
>> the lock is held (and possibly taking it) from inside
>> e->type->ops.mq.allow_merge is a good solution.  In any case, before
>> possibly trying it, I will wait for some feedback on the main problem,
>> i.e., on the fact that e->type->ops.mq.allow_merge
>> seems unreachable in the above path.
> 
> Checking if a lock is held is NEVER a good idea, as it leads to both bad
> and incorrect code. If you just check if a lock is held when being
> called, you don't necessarily know if it was the caller that grabbed it
> or it just happens to be held by someone else for unrelated reasons.
> 
> 

Thanks a lot for this and the above explanations.  Unfortunately, I
still see the problem.  To hopefully make you waste less time, I have
reported the problematic paths explicitly, so that you can quickly
point me to my mistake.

The problem is caused by the existence of at least the following two
alternative paths to e->type->ops.mq.allow_merge.

1.  In mq-deadline.c (line 374): spin_lock(&dd->lock);
blk_mq_sched_try_merge -> elv_merge -> elv_bio_merge_ok ->
elv_iosched_allow_bio_merge -> e->type->ops.mq.allow_merge

2. In blk-core.c (line 1660): spin_lock_irq(q->queue_lock);
elv_merge -> elv_bio_merge_ok ->
elv_iosched_allow_bio_merge -> e->type->ops.mq.allow_merge

In the first path, the scheduler lock is held, while in the second
path, it is not.  This does not cause problems with mq-deadline,
because the latter just has no allow_merge function.  Yet it does
cause problems with the allow_merge implementation of bfq.  There was
no issue in blk, as only the global queue lock was used.

Where am I wrong?

Thanks,
Paolo


> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2017-01-17  2:47           ` Jens Axboe
@ 2017-01-17 10:49             ` Paolo Valente
  2017-01-18 16:14               ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-01-17 10:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown

[NEW RESEND ATTEMPT]

> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 12/22/2016 08:28 AM, Paolo Valente wrote:
>> 
>>> Il giorno 19 dic 2016, alle ore 22:05, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> On 12/19/2016 11:21 AM, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <axboe@fb.com> ha scritto:
>>>>> 
>>>>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>>>>> 
>>>>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>>>>> 
>>>>>>> This is version 4 of this patchset, version 3 was posted here:
>>>>>>> 
>>>>>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>>>>> 
>>>>>>> From the discussion last time, I looked into the feasibility of having
>>>>>>> two sets of tags for the same request pool, to avoid having to copy
>>>>>>> some of the request fields at dispatch and completion time. To do that,
>>>>>>> we'd have to replace the driver tag map(s) with our own, and augment
>>>>>>> that with tag map(s) on the side representing the device queue depth.
>>>>>>> Queuing IO with the scheduler would allocate from the new map, and
>>>>>>> dispatching would acquire the "real" tag. We would need to change
>>>>>>> drivers to do this, or add an extra indirection table to map a real
>>>>>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>>>>>> scheduler and hardware tag pools, or additional info to track it.
>>>>>>> Unless someone can convince me otherwise, I think the current approach
>>>>>>> is cleaner.
>>>>>>> 
>>>>>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>>>>>> to drastically decreased merging. Especially on rotating storage,
>>>>>>> this release should be fast, and on par with the merging that we
>>>>>>> get through the legacy schedulers.
>>>>>>> 
>>>>>> 
>>>>>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>>>>>> you already have an idea of what they are, so that I am somehow
>>>>>> prepared to what won't work even if my changes are right?
>>>>> 
>>>>> I'm mostly talking about elevator ops hooks that aren't there in the new
>>>>> framework, but exist in the old one. There should be no hidden
>>>>> surprises, if that's what you are worried about.
>>>>> 
>>>>> On the ops side, the only ones I can think of are the activate and
>>>>> deactivate, and those can be done in the dispatch_request hook for
>>>>> activate, and put/requeue for deactivate.
>>>>> 
>>>> 
>>>> You mean that there is no conceptual problem in moving the code of the
>>>> activate interface function into the dispatch function, and the code
>>>> of the deactivate into the put_request? (for a requeue it is a little
>>>> less clear to me, so one step at a time)  Or am I missing
>>>> something more complex?
>>> 
>>> Yes, what I mean is that there isn't a 1:1 mapping between the old ops
>>> and the new ops. So you'll have to consider the cases.
>>> 
>>> 
>> 
>> Problem: whereas it seems easy and safe to do somewhere else the
>> simple increment that was done in activate_request, I wonder if it may
>> happen that a request is deactivate before being completed.  In it may
>> happen, then, without a deactivate_request hook, the increments would
>> remain unbalanced.  Or are request completions always guaranteed till
>> no hw/sw components breaks?
> 
> You should be able to do it in get/put_request. But you might need some
> extra tracking, I'd need to double check.

Exactly, AFAICT something extra is apparently needed.  In particular,
get is not ok, because dispatch is a different event (but dispatch is
however an already controlled event), while put could be used,
provided that it is guaranteed to be executed only after dispatch.  If
it is not, then I think that an extra flag or something should be
added to the request.  I don't know whether adding this extra piece
would be worst than adding an extra hook.

> 
> I'm trying to avoid adding
> hooks that we don't truly need, the old interface had a lot of that. If
> you find that you need a hook and it isn't there, feel free to add it.
> activate/deactivate might be a good change.
> 

If my comments above do not trigger any proposal of a better solution,
then I will try by adding only one extra 'deactivate' hook.  Unless
unbalanced hooks are a bad idea too.

Thanks,
Paolo

> -- 
> Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers
  2017-01-17 10:13         ` Paolo Valente
@ 2017-01-17 12:38           ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2017-01-17 12:38 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 01/17/2017 02:13 AM, Paolo Valente wrote:
> 
>> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 12/22/2016 04:13 AM, Paolo Valente wrote:
>>>
>>>> Il giorno 22 dic 2016, alle ore 10:59, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>>>
>>>>>
>>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>>>
>>>>> This adds a set of hooks that intercepts the blk-mq path of
>>>>> allocating/inserting/issuing/completing requests, allowing
>>>>> us to develop a scheduler within that framework.
>>>>>
>>>>> We reuse the existing elevator scheduler API on the registration
>>>>> side, but augment that with the scheduler flagging support for
>>>>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>>>>> devices.
>>>>>
>>>>> Schedulers can opt in to using shadow requests. Shadow requests
>>>>> are internal requests that the scheduler uses for for the allocate
>>>>> and insert part, which are then mapped to a real driver request
>>>>> at dispatch time. This is needed to separate the device queue depth
>>>>> from the pool of requests that the scheduler has to work with.
>>>>>
>>>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>>>>
>>>> ...
>>>>
>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>>>> new file mode 100644
>>>>> index 000000000000..b7e1839d4785
>>>>> --- /dev/null
>>>>> +++ b/block/blk-mq-sched.c
>>>>
>>>>> ...
>>>>> +static inline bool
>>>>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>>>>> +			 struct bio *bio)
>>>>> +{
>>>>> +	struct elevator_queue *e = q->elevator;
>>>>> +
>>>>> +	if (e && e->type->ops.mq.allow_merge)
>>>>> +		return e->type->ops.mq.allow_merge(q, rq, bio);
>>>>> +
>>>>> +	return true;
>>>>> +}
>>>>> +
>>>>
>>>> Something does not seem to add up here:
>>>> e->type->ops.mq.allow_merge may be called only in
>>>> blk_mq_sched_allow_merge, which, in its turn, may be called only in
>>>> blk_mq_attempt_merge, which, finally, may be called only in
>>>> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
>>>> no elevator (line 1399 and 1507 in blk-mq.c).
>>>>
>>>> Therefore, e->type->ops.mq.allow_merge can never be called, both if
>>>> there is and if there is not an elevator.  Be patient if I'm missing
>>>> something huge, but I thought it was worth reporting this.
>>>>
>>>
>>> Just another detail: if e->type->ops.mq.allow_merge does get invoked
>>> from the above path, then it is invoked of course without the
>>> scheduler lock held.  In contrast, if this function gets invoked
>>> from dd_bio_merge, then the scheduler lock is held.
>>
>> But the scheduler controls that itself. So it'd be perfectly fine to
>> have a locked and unlocked variant. The way that's typically done is to
>> have function() grabbing the lock, and __function() is invoked with the
>> lock held.
>>
>>> To handle this opposite alternatives, I don't know whether checking if
>>> the lock is held (and possibly taking it) from inside
>>> e->type->ops.mq.allow_merge is a good solution.  In any case, before
>>> possibly trying it, I will wait for some feedback on the main problem,
>>> i.e., on the fact that e->type->ops.mq.allow_merge
>>> seems unreachable in the above path.
>>
>> Checking if a lock is held is NEVER a good idea, as it leads to both bad
>> and incorrect code. If you just check if a lock is held when being
>> called, you don't necessarily know if it was the caller that grabbed it
>> or it just happens to be held by someone else for unrelated reasons.
>>
>>
> 
> Thanks a lot for this and the above explanations.  Unfortunately, I
> still see the problem.  To hopefully make you waste less time, I have
> reported the problematic paths explicitly, so that you can quickly
> point me to my mistake.
> 
> The problem is caused by the existence of at least the following two
> alternative paths to e->type->ops.mq.allow_merge.
> 
> 1.  In mq-deadline.c (line 374): spin_lock(&dd->lock);
> blk_mq_sched_try_merge -> elv_merge -> elv_bio_merge_ok ->
> elv_iosched_allow_bio_merge -> e->type->ops.mq.allow_merge
> 
> 2. In blk-core.c (line 1660): spin_lock_irq(q->queue_lock);
> elv_merge -> elv_bio_merge_ok ->
> elv_iosched_allow_bio_merge -> e->type->ops.mq.allow_merge
> 
> In the first path, the scheduler lock is held, while in the second
> path, it is not.  This does not cause problems with mq-deadline,
> because the latter just has no allow_merge function.  Yet it does
> cause problems with the allow_merge implementation of bfq.  There was
> no issue in blk, as only the global queue lock was used.
> 
> Where am I wrong?

#2 can never happen for blk-mq, it's the old IO path. blk-mq is never
invoked with ->queue_lock held.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2017-01-17 10:49             ` Paolo Valente
@ 2017-01-18 16:14               ` Paolo Valente
  2017-01-18 16:21                 ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-01-18 16:14 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown


> Il giorno 17 gen 2017, alle ore 11:49, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> [NEW RESEND ATTEMPT]
> 
>> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe <axboe@fb.com> ha scritto:
>> 
>> On 12/22/2016 08:28 AM, Paolo Valente wrote:
>>> 
>>>> Il giorno 19 dic 2016, alle ore 22:05, Jens Axboe <axboe@fb.com> ha scritto:
>>>> 
>>>> On 12/19/2016 11:21 AM, Paolo Valente wrote:
>>>>> 
>>>>>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <axboe@fb.com> ha scritto:
>>>>>> 
>>>>>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>>>>>> 
>>>>>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>>>>>>> 
>>>>>>>> This is version 4 of this patchset, version 3 was posted here:
>>>>>>>> 
>>>>>>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>>>>>> 
>>>>>>>> From the discussion last time, I looked into the feasibility of having
>>>>>>>> two sets of tags for the same request pool, to avoid having to copy
>>>>>>>> some of the request fields at dispatch and completion time. To do that,
>>>>>>>> we'd have to replace the driver tag map(s) with our own, and augment
>>>>>>>> that with tag map(s) on the side representing the device queue depth.
>>>>>>>> Queuing IO with the scheduler would allocate from the new map, and
>>>>>>>> dispatching would acquire the "real" tag. We would need to change
>>>>>>>> drivers to do this, or add an extra indirection table to map a real
>>>>>>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>>>>>>> scheduler and hardware tag pools, or additional info to track it.
>>>>>>>> Unless someone can convince me otherwise, I think the current approach
>>>>>>>> is cleaner.
>>>>>>>> 
>>>>>>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>>>>>>> to drastically decreased merging. Especially on rotating storage,
>>>>>>>> this release should be fast, and on par with the merging that we
>>>>>>>> get through the legacy schedulers.
>>>>>>>> 
>>>>>>> 
>>>>>>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>>>>>>> you already have an idea of what they are, so that I am somehow
>>>>>>> prepared to what won't work even if my changes are right?
>>>>>> 
>>>>>> I'm mostly talking about elevator ops hooks that aren't there in the new
>>>>>> framework, but exist in the old one. There should be no hidden
>>>>>> surprises, if that's what you are worried about.
>>>>>> 
>>>>>> On the ops side, the only ones I can think of are the activate and
>>>>>> deactivate, and those can be done in the dispatch_request hook for
>>>>>> activate, and put/requeue for deactivate.
>>>>>> 
>>>>> 
>>>>> You mean that there is no conceptual problem in moving the code of the
>>>>> activate interface function into the dispatch function, and the code
>>>>> of the deactivate into the put_request? (for a requeue it is a little
>>>>> less clear to me, so one step at a time)  Or am I missing
>>>>> something more complex?
>>>> 
>>>> Yes, what I mean is that there isn't a 1:1 mapping between the old ops
>>>> and the new ops. So you'll have to consider the cases.
>>>> 
>>>> 
>>> 
>>> Problem: whereas it seems easy and safe to do somewhere else the
>>> simple increment that was done in activate_request, I wonder if it may
>>> happen that a request is deactivate before being completed.  In it may
>>> happen, then, without a deactivate_request hook, the increments would
>>> remain unbalanced.  Or are request completions always guaranteed till
>>> no hw/sw components breaks?
>> 
>> You should be able to do it in get/put_request. But you might need some
>> extra tracking, I'd need to double check.
> 
> Exactly, AFAICT something extra is apparently needed.  In particular,
> get is not ok, because dispatch is a different event (but dispatch is
> however an already controlled event), while put could be used,
> provided that it is guaranteed to be executed only after dispatch.  If
> it is not, then I think that an extra flag or something should be
> added to the request.  I don't know whether adding this extra piece
> would be worst than adding an extra hook.
> 
>> 
>> I'm trying to avoid adding
>> hooks that we don't truly need, the old interface had a lot of that. If
>> you find that you need a hook and it isn't there, feel free to add it.
>> activate/deactivate might be a good change.
>> 
> 
> If my comments above do not trigger any proposal of a better solution,
> then I will try by adding only one extra 'deactivate' hook.  Unless
> unbalanced hooks are a bad idea too.
> 

Jens,
according to the function blk_mq_sched_put_request, the
mq.completed_request hook seems to always be invoked (if set) for a
request for which the mq.put_rq_priv is invoked (if set).

If you don't warn me that I'm wrong, I will base on the above
assumption, and complete bfq without any additional hook or flag.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> -- 
>> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2017-01-18 16:14               ` Paolo Valente
@ 2017-01-18 16:21                 ` Jens Axboe
  2017-01-23 17:04                   ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-01-18 16:21 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown

On 01/18/2017 08:14 AM, Paolo Valente wrote:
> according to the function blk_mq_sched_put_request, the
> mq.completed_request hook seems to always be invoked (if set) for a
> request for which the mq.put_rq_priv is invoked (if set).

Correct, any request that came out of blk_mq_sched_get_request()
will always have completed called on it, regardless of whether it
had IO started on it or not.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-01-17  2:47     ` Jens Axboe
@ 2017-01-20 11:07       ` Paolo Valente
  2017-01-20 14:26         ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-01-20 11:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 12/22/2016 09:49 AM, Paolo Valente wrote:
>> 
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> This is basically identical to deadline-iosched, except it registers
>>> as a MQ capable scheduler. This is still a single queue design.
>>> 
>> 
>> One last question (for today ...):in mq-deadline there are no
>> "schedule dispatch" or "unplug work" functions.  In blk, CFQ and BFQ
>> do these schedules/unplugs in a lot of cases.  What's the right
>> replacement?  Just doing nothing?
> 
> You just use blk_mq_run_hw_queue() or variants thereof to kick off queue
> runs.
> 

Hi Jens,
I'm working on this right now.  I have a pair of quick questions about
performance.

In the blk version of bfq, if the in-service bfq_queue happen to have
no more budget when the bfq dispatch function is invoked, then bfq:
returns no request (NULL), immediately expires the in-service
bfq_queue, and schedules a new dispatch.  The third step is taken so
that, if other bfq_queues have requests, then a new in-service
bfq_queue will be selected on the upcoming new dispatch, and a new
request will be provided right away.

My questions are: is this dispatch-schedule step still needed with
blk-mq, to avoid a stall?  If it is not needed to avoid a stall, would
it still be needed to boost throughput, because it would force an
immediate, next dispatch?

BTW, bfq-mq survived its first request completion.  I will provide you
with a link to a github branch as soon as bfq-mq seems able to stand
up with a minimal workload.

Thanks,
Paolo

> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
                     ` (3 preceding siblings ...)
  2016-12-22 16:49   ` Paolo Valente
@ 2017-01-20 13:14   ` Paolo Valente
  2017-01-20 13:18     ` Paolo Valente
  2017-01-20 14:28     ` Jens Axboe
  2017-02-01 11:11   ` Paolo Valente
                     ` (2 subsequent siblings)
  7 siblings, 2 replies; 69+ messages in thread
From: Paolo Valente @ 2017-01-20 13:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 

Jens,
no spin_lock_irq* in the code.  So, also request dispatches are
guaranteed to never be executed in IRQ context?  I'm asking this
question to understand whether I'm missing something that, even in
BFQ, would somehow allow me to not disable irqs in critical sections,
even if there is the slice_idle-expiration handler.  Be patient with
my ignorance.

Thanks,
Paolo

> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
> block/Kconfig.iosched |   6 +
> block/Makefile        |   1 +
> block/mq-deadline.c   | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 656 insertions(+)
> create mode 100644 block/mq-deadline.c
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 421bef9c4c48..490ef2850fae 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -32,6 +32,12 @@ config IOSCHED_CFQ
> 
> 	  This is the default I/O scheduler.
> 
> +config MQ_IOSCHED_DEADLINE
> +	tristate "MQ deadline I/O scheduler"
> +	default y
> +	---help---
> +	  MQ version of the deadline IO scheduler.
> +
> config CFQ_GROUP_IOSCHED
> 	bool "CFQ Group Scheduling support"
> 	depends on IOSCHED_CFQ && BLK_CGROUP
> diff --git a/block/Makefile b/block/Makefile
> index 2eee9e1bb6db..3ee0abd7205a 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
> obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> +obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
> 
> obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
> obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index 000000000000..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> @@ -0,0 +1,649 @@
> +/*
> + *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
> + *  for the blk-mq scheduling framework
> + *
> + *  Copyright (C) 2016 Jens Axboe <axboe@kernel.dk>
> + */
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
> +#include <linux/elevator.h>
> +#include <linux/bio.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/compiler.h>
> +#include <linux/rbtree.h>
> +#include <linux/sbitmap.h>
> +
> +#include "blk.h"
> +#include "blk-mq.h"
> +#include "blk-mq-tag.h"
> +#include "blk-mq-sched.h"
> +
> +static unsigned int queue_depth = 256;
> +module_param(queue_depth, uint, 0644);
> +MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
> +
> +/*
> + * See Documentation/block/deadline-iosched.txt
> + */
> +static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
> +static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +static const int writes_starved = 2;    /* max times reads can starve a write */
> +static const int fifo_batch = 16;       /* # of sequential requests treated as one
> +				     by the above parameters. For throughput. */
> +
> +struct deadline_data {
> +	/*
> +	 * run time data
> +	 */
> +
> +	/*
> +	 * requests (deadline_rq s) are present on both sort_list and fifo_list
> +	 */
> +	struct rb_root sort_list[2];
> +	struct list_head fifo_list[2];
> +
> +	/*
> +	 * next in sort order. read, write or both are NULL
> +	 */
> +	struct request *next_rq[2];
> +	unsigned int batching;		/* number of sequential requests made */
> +	unsigned int starved;		/* times reads have starved writes */
> +
> +	/*
> +	 * settings that change how the i/o scheduler behaves
> +	 */
> +	int fifo_expire[2];
> +	int fifo_batch;
> +	int writes_starved;
> +	int front_merges;
> +
> +	spinlock_t lock;
> +	struct list_head dispatch;
> +	struct blk_mq_tags *tags;
> +	atomic_t wait_index;
> +};
> +
> +static inline struct rb_root *
> +deadline_rb_root(struct deadline_data *dd, struct request *rq)
> +{
> +	return &dd->sort_list[rq_data_dir(rq)];
> +}
> +
> +/*
> + * get the request after `rq' in sector-sorted order
> + */
> +static inline struct request *
> +deadline_latter_request(struct request *rq)
> +{
> +	struct rb_node *node = rb_next(&rq->rb_node);
> +
> +	if (node)
> +		return rb_entry_rq(node);
> +
> +	return NULL;
> +}
> +
> +static void
> +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> +	struct rb_root *root = deadline_rb_root(dd, rq);
> +
> +	elv_rb_add(root, rq);
> +}
> +
> +static inline void
> +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> +	const int data_dir = rq_data_dir(rq);
> +
> +	if (dd->next_rq[data_dir] == rq)
> +		dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> +	elv_rb_del(deadline_rb_root(dd, rq), rq);
> +}
> +
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	list_del_init(&rq->queuelist);
> +
> +	/*
> +	 * We might not be on the rbtree, if we are doing an insert merge
> +	 */
> +	if (!RB_EMPTY_NODE(&rq->rb_node))
> +		deadline_del_rq_rb(dd, rq);
> +
> +	elv_rqhash_del(q, rq);
> +	if (q->last_merge == rq)
> +		q->last_merge = NULL;
> +}
> +
> +static void dd_request_merged(struct request_queue *q, struct request *req,
> +			      int type)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	/*
> +	 * if the merge was a front merge, we need to reposition request
> +	 */
> +	if (type == ELEVATOR_FRONT_MERGE) {
> +		elv_rb_del(deadline_rb_root(dd, req), req);
> +		deadline_add_rq_rb(dd, req);
> +	}
> +}
> +
> +static void dd_merged_requests(struct request_queue *q, struct request *req,
> +			       struct request *next)
> +{
> +	/*
> +	 * if next expires before rq, assign its expire time to rq
> +	 * and move into next position (next will be deleted) in fifo
> +	 */
> +	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
> +		if (time_before((unsigned long)next->fifo_time,
> +				(unsigned long)req->fifo_time)) {
> +			list_move(&req->queuelist, &next->queuelist);
> +			req->fifo_time = next->fifo_time;
> +		}
> +	}
> +
> +	/*
> +	 * kill knowledge of next, this one is a goner
> +	 */
> +	deadline_remove_request(q, next);
> +}
> +
> +/*
> + * move an entry to dispatch queue
> + */
> +static void
> +deadline_move_request(struct deadline_data *dd, struct request *rq)
> +{
> +	const int data_dir = rq_data_dir(rq);
> +
> +	dd->next_rq[READ] = NULL;
> +	dd->next_rq[WRITE] = NULL;
> +	dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> +	/*
> +	 * take it off the sort and fifo list
> +	 */
> +	deadline_remove_request(rq->q, rq);
> +}
> +
> +/*
> + * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
> + * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> + */
> +static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
> +{
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
> +
> +	/*
> +	 * rq is expired!
> +	 */
> +	if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * deadline_dispatch_requests selects the best request according to
> + * read/write expire, fifo_batch, etc
> + */
> +static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	struct request *rq;
> +	bool reads, writes;
> +	int data_dir;
> +
> +	spin_lock(&dd->lock);
> +
> +	if (!list_empty(&dd->dispatch)) {
> +		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		goto done;
> +	}
> +
> +	reads = !list_empty(&dd->fifo_list[READ]);
> +	writes = !list_empty(&dd->fifo_list[WRITE]);
> +
> +	/*
> +	 * batches are currently reads XOR writes
> +	 */
> +	if (dd->next_rq[WRITE])
> +		rq = dd->next_rq[WRITE];
> +	else
> +		rq = dd->next_rq[READ];
> +
> +	if (rq && dd->batching < dd->fifo_batch)
> +		/* we have a next request are still entitled to batch */
> +		goto dispatch_request;
> +
> +	/*
> +	 * at this point we are not running a batch. select the appropriate
> +	 * data direction (read / write)
> +	 */
> +
> +	if (reads) {
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
> +
> +		if (writes && (dd->starved++ >= dd->writes_starved))
> +			goto dispatch_writes;
> +
> +		data_dir = READ;
> +
> +		goto dispatch_find_request;
> +	}
> +
> +	/*
> +	 * there are either no reads or writes have been starved
> +	 */
> +
> +	if (writes) {
> +dispatch_writes:
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> +
> +		dd->starved = 0;
> +
> +		data_dir = WRITE;
> +
> +		goto dispatch_find_request;
> +	}
> +
> +	spin_unlock(&dd->lock);
> +	return NULL;
> +
> +dispatch_find_request:
> +	/*
> +	 * we are not running a batch, find best request for selected data_dir
> +	 */
> +	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
> +		/*
> +		 * A deadline has expired, the last request was in the other
> +		 * direction, or we have run out of higher-sectored requests.
> +		 * Start again from the request with the earliest expiry time.
> +		 */
> +		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	} else {
> +		/*
> +		 * The last req was the same dir and we have a next request in
> +		 * sort order. No expired requests so continue on from here.
> +		 */
> +		rq = dd->next_rq[data_dir];
> +	}
> +
> +	dd->batching = 0;
> +
> +dispatch_request:
> +	/*
> +	 * rq is the selected appropriate request.
> +	 */
> +	dd->batching++;
> +	deadline_move_request(dd, rq);
> +done:
> +	rq->rq_flags |= RQF_STARTED;
> +	spin_unlock(&dd->lock);
> +	return rq;
> +}
> +
> +static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
> +				 struct list_head *rq_list)
> +{
> +	blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
> +}
> +
> +static void dd_exit_queue(struct elevator_queue *e)
> +{
> +	struct deadline_data *dd = e->elevator_data;
> +
> +	BUG_ON(!list_empty(&dd->fifo_list[READ]));
> +	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
> +
> +	blk_mq_sched_free_requests(dd->tags);
> +	kfree(dd);
> +}
> +
> +/*
> + * initialize elevator private data (deadline_data).
> + */
> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +{
> +	struct deadline_data *dd;
> +	struct elevator_queue *eq;
> +
> +	eq = elevator_alloc(q, e);
> +	if (!eq)
> +		return -ENOMEM;
> +
> +	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> +	if (!dd) {
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +	eq->elevator_data = dd;
> +
> +	dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
> +	if (!dd->tags) {
> +		kfree(dd);
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +
> +	INIT_LIST_HEAD(&dd->fifo_list[READ]);
> +	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
> +	dd->sort_list[READ] = RB_ROOT;
> +	dd->sort_list[WRITE] = RB_ROOT;
> +	dd->fifo_expire[READ] = read_expire;
> +	dd->fifo_expire[WRITE] = write_expire;
> +	dd->writes_starved = writes_starved;
> +	dd->front_merges = 1;
> +	dd->fifo_batch = fifo_batch;
> +	spin_lock_init(&dd->lock);
> +	INIT_LIST_HEAD(&dd->dispatch);
> +	atomic_set(&dd->wait_index, 0);
> +
> +	q->elevator = eq;
> +	return 0;
> +}
> +
> +static int dd_request_merge(struct request_queue *q, struct request **rq,
> +			    struct bio *bio)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	sector_t sector = bio_end_sector(bio);
> +	struct request *__rq;
> +
> +	if (!dd->front_merges)
> +		return ELEVATOR_NO_MERGE;
> +
> +	__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
> +	if (__rq) {
> +		BUG_ON(sector != blk_rq_pos(__rq));
> +
> +		if (elv_bio_merge_ok(__rq, bio)) {
> +			*rq = __rq;
> +			return ELEVATOR_FRONT_MERGE;
> +		}
> +	}
> +
> +	return ELEVATOR_NO_MERGE;
> +}
> +
> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	int ret;
> +
> +	spin_lock(&dd->lock);
> +	ret = blk_mq_sched_try_merge(q, bio);
> +	spin_unlock(&dd->lock);
> +
> +	return ret;
> +}
> +
> +/*
> + * add rq to rbtree and fifo
> + */
> +static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> +			      bool at_head)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const int data_dir = rq_data_dir(rq);
> +
> +	if (blk_mq_sched_try_insert_merge(q, rq))
> +		return;
> +
> +	blk_mq_sched_request_inserted(rq);
> +
> +	/*
> +	 * If we're trying to insert a real request, just send it directly
> +	 * to the hardware dispatch list. This only happens for a requeue,
> +	 * or FUA/FLUSH requests.
> +	 */
> +	if (!blk_mq_sched_rq_is_shadow(rq)) {
> +		spin_lock(&hctx->lock);
> +		list_add_tail(&rq->queuelist, &hctx->dispatch);
> +		spin_unlock(&hctx->lock);
> +		return;
> +	}
> +
> +	if (at_head || rq->cmd_type != REQ_TYPE_FS) {
> +		if (at_head)
> +			list_add(&rq->queuelist, &dd->dispatch);
> +		else
> +			list_add_tail(&rq->queuelist, &dd->dispatch);
> +	} else {
> +		deadline_add_rq_rb(dd, rq);
> +
> +		if (rq_mergeable(rq)) {
> +			elv_rqhash_add(q, rq);
> +			if (!q->last_merge)
> +				q->last_merge = rq;
> +		}
> +
> +		/*
> +		 * set expire time and add to fifo list
> +		 */
> +		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> +		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
> +	}
> +}
> +
> +static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
> +			       struct list_head *list, bool at_head)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	spin_lock(&dd->lock);
> +	while (!list_empty(list)) {
> +		struct request *rq;
> +
> +		rq = list_first_entry(list, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		dd_insert_request(hctx, rq, at_head);
> +	}
> +	spin_unlock(&dd->lock);
> +}
> +
> +static struct request *dd_get_request(struct request_queue *q, unsigned int op,
> +				      struct blk_mq_alloc_data *data)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	struct request *rq;
> +
> +	/*
> +	 * The flush machinery intercepts before we insert the request. As
> +	 * a work-around, just hand it back a real request.
> +	 */
> +	if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
> +		rq = __blk_mq_alloc_request(data, op);
> +	else {
> +		rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
> +		if (rq)
> +			blk_mq_rq_ctx_init(q, data->ctx, rq, op);
> +	}
> +
> +	return rq;
> +}
> +
> +static bool dd_put_request(struct request *rq)
> +{
> +	/*
> +	 * If it's a real request, we just have to free it. For a shadow
> +	 * request, we should only free it if we haven't started it. A
> +	 * started request is mapped to a real one, and the real one will
> +	 * free it. We can get here with request merges, since we then
> +	 * free the request before we start/issue it.
> +	 */
> +	if (!blk_mq_sched_rq_is_shadow(rq))
> +		return false;
> +
> +	if (!(rq->rq_flags & RQF_STARTED)) {
> +		struct request_queue *q = rq->q;
> +		struct deadline_data *dd = q->elevator->elevator_data;
> +
> +		/*
> +		 * IO completion would normally do this, but if we merge
> +		 * and free before we issue the request, drop both the
> +		 * tag and queue ref
> +		 */
> +		blk_mq_sched_free_shadow_request(dd->tags, rq);
> +		blk_queue_exit(q);
> +	}
> +
> +	return true;
> +}
> +
> +static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
> +{
> +	struct request *sched_rq = rq->end_io_data;
> +
> +	/*
> +	 * sched_rq can be NULL, if we haven't setup the shadow yet
> +	 * because we failed getting one.
> +	 */
> +	if (sched_rq) {
> +		struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +		blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
> +		blk_mq_start_stopped_hw_queue(hctx, true);
> +	}
> +}
> +
> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +	return !list_empty_careful(&dd->dispatch) ||
> +		!list_empty_careful(&dd->fifo_list[0]) ||
> +		!list_empty_careful(&dd->fifo_list[1]);
> +}
> +
> +/*
> + * sysfs parts below
> + */
> +static ssize_t
> +deadline_var_show(int var, char *page)
> +{
> +	return sprintf(page, "%d\n", var);
> +}
> +
> +static ssize_t
> +deadline_var_store(int *var, const char *page, size_t count)
> +{
> +	char *p = (char *) page;
> +
> +	*var = simple_strtol(p, &p, 10);
> +	return count;
> +}
> +
> +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
> +static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data = __VAR;						\
> +	if (__CONV)							\
> +		__data = jiffies_to_msecs(__data);			\
> +	return deadline_var_show(__data, (page));			\
> +}
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> +SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> +SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data;							\
> +	int ret = deadline_var_store(&__data, (page), count);		\
> +	if (__data < (MIN))						\
> +		__data = (MIN);						\
> +	else if (__data > (MAX))					\
> +		__data = (MAX);						\
> +	if (__CONV)							\
> +		*(__PTR) = msecs_to_jiffies(__data);			\
> +	else								\
> +		*(__PTR) = __data;					\
> +	return ret;							\
> +}
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> +STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> +STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#undef STORE_FUNCTION
> +
> +#define DD_ATTR(name) \
> +	__ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
> +				      deadline_##name##_store)
> +
> +static struct elv_fs_entry deadline_attrs[] = {
> +	DD_ATTR(read_expire),
> +	DD_ATTR(write_expire),
> +	DD_ATTR(writes_starved),
> +	DD_ATTR(front_merges),
> +	DD_ATTR(fifo_batch),
> +	__ATTR_NULL
> +};
> +
> +static struct elevator_type mq_deadline = {
> +	.ops.mq = {
> +		.get_request		= dd_get_request,
> +		.put_request		= dd_put_request,
> +		.insert_requests	= dd_insert_requests,
> +		.dispatch_requests	= dd_dispatch_requests,
> +		.completed_request	= dd_completed_request,
> +		.next_request		= elv_rb_latter_request,
> +		.former_request		= elv_rb_former_request,
> +		.bio_merge		= dd_bio_merge,
> +		.request_merge		= dd_request_merge,
> +		.requests_merged	= dd_merged_requests,
> +		.request_merged		= dd_request_merged,
> +		.has_work		= dd_has_work,
> +		.init_sched		= dd_init_queue,
> +		.exit_sched		= dd_exit_queue,
> +	},
> +
> +	.uses_mq	= true,
> +	.elevator_attrs = deadline_attrs,
> +	.elevator_name = "mq-deadline",
> +	.elevator_owner = THIS_MODULE,
> +};
> +
> +static int __init deadline_init(void)
> +{
> +	if (!queue_depth) {
> +		pr_err("mq-deadline: queue depth must be > 0\n");
> +		return -EINVAL;
> +	}
> +	return elv_register(&mq_deadline);
> +}
> +
> +static void __exit deadline_exit(void)
> +{
> +	elv_unregister(&mq_deadline);
> +}
> +
> +module_init(deadline_init);
> +module_exit(deadline_exit);
> +
> +MODULE_AUTHOR("Jens Axboe");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("MQ deadline IO scheduler");
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-01-20 13:14   ` Paolo Valente
@ 2017-01-20 13:18     ` Paolo Valente
  2017-01-20 14:28       ` Jens Axboe
  2017-01-20 14:28     ` Jens Axboe
  1 sibling, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-01-20 13:18 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 20 gen 2017, alle ore 14:14, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
>> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>> 
>> This is basically identical to deadline-iosched, except it registers
>> as a MQ capable scheduler. This is still a single queue design.
>> 
> 
> Jens,
> no spin_lock_irq* in the code.  So, also request dispatches are
> guaranteed to never be executed in IRQ context?

Or maybe the opposite? That is, all scheduler functions invoked in IRQ context?

Thanks,
Paolo

>  I'm asking this
> question to understand whether I'm missing something that, even in
> BFQ, would somehow allow me to not disable irqs in critical sections,
> even if there is the slice_idle-expiration handler.  Be patient with
> my ignorance.
> 
> Thanks,
> Paolo
> 
>> Signed-off-by: Jens Axboe <axboe@fb.com>
>> ---
>> block/Kconfig.iosched |   6 +
>> block/Makefile        |   1 +
>> block/mq-deadline.c   | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 656 insertions(+)
>> create mode 100644 block/mq-deadline.c
>> 
>> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
>> index 421bef9c4c48..490ef2850fae 100644
>> --- a/block/Kconfig.iosched
>> +++ b/block/Kconfig.iosched
>> @@ -32,6 +32,12 @@ config IOSCHED_CFQ
>> 
>> 	  This is the default I/O scheduler.
>> 
>> +config MQ_IOSCHED_DEADLINE
>> +	tristate "MQ deadline I/O scheduler"
>> +	default y
>> +	---help---
>> +	  MQ version of the deadline IO scheduler.
>> +
>> config CFQ_GROUP_IOSCHED
>> 	bool "CFQ Group Scheduling support"
>> 	depends on IOSCHED_CFQ && BLK_CGROUP
>> diff --git a/block/Makefile b/block/Makefile
>> index 2eee9e1bb6db..3ee0abd7205a 100644
>> --- a/block/Makefile
>> +++ b/block/Makefile
>> @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
>> obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
>> obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
>> obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
>> +obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
>> 
>> obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
>> obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>> new file mode 100644
>> index 000000000000..3cb9de21ab21
>> --- /dev/null
>> +++ b/block/mq-deadline.c
>> @@ -0,0 +1,649 @@
>> +/*
>> + *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
>> + *  for the blk-mq scheduling framework
>> + *
>> + *  Copyright (C) 2016 Jens Axboe <axboe@kernel.dk>
>> + */
>> +#include <linux/kernel.h>
>> +#include <linux/fs.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/blk-mq.h>
>> +#include <linux/elevator.h>
>> +#include <linux/bio.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/init.h>
>> +#include <linux/compiler.h>
>> +#include <linux/rbtree.h>
>> +#include <linux/sbitmap.h>
>> +
>> +#include "blk.h"
>> +#include "blk-mq.h"
>> +#include "blk-mq-tag.h"
>> +#include "blk-mq-sched.h"
>> +
>> +static unsigned int queue_depth = 256;
>> +module_param(queue_depth, uint, 0644);
>> +MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
>> +
>> +/*
>> + * See Documentation/block/deadline-iosched.txt
>> + */
>> +static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
>> +static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
>> +static const int writes_starved = 2;    /* max times reads can starve a write */
>> +static const int fifo_batch = 16;       /* # of sequential requests treated as one
>> +				     by the above parameters. For throughput. */
>> +
>> +struct deadline_data {
>> +	/*
>> +	 * run time data
>> +	 */
>> +
>> +	/*
>> +	 * requests (deadline_rq s) are present on both sort_list and fifo_list
>> +	 */
>> +	struct rb_root sort_list[2];
>> +	struct list_head fifo_list[2];
>> +
>> +	/*
>> +	 * next in sort order. read, write or both are NULL
>> +	 */
>> +	struct request *next_rq[2];
>> +	unsigned int batching;		/* number of sequential requests made */
>> +	unsigned int starved;		/* times reads have starved writes */
>> +
>> +	/*
>> +	 * settings that change how the i/o scheduler behaves
>> +	 */
>> +	int fifo_expire[2];
>> +	int fifo_batch;
>> +	int writes_starved;
>> +	int front_merges;
>> +
>> +	spinlock_t lock;
>> +	struct list_head dispatch;
>> +	struct blk_mq_tags *tags;
>> +	atomic_t wait_index;
>> +};
>> +
>> +static inline struct rb_root *
>> +deadline_rb_root(struct deadline_data *dd, struct request *rq)
>> +{
>> +	return &dd->sort_list[rq_data_dir(rq)];
>> +}
>> +
>> +/*
>> + * get the request after `rq' in sector-sorted order
>> + */
>> +static inline struct request *
>> +deadline_latter_request(struct request *rq)
>> +{
>> +	struct rb_node *node = rb_next(&rq->rb_node);
>> +
>> +	if (node)
>> +		return rb_entry_rq(node);
>> +
>> +	return NULL;
>> +}
>> +
>> +static void
>> +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
>> +{
>> +	struct rb_root *root = deadline_rb_root(dd, rq);
>> +
>> +	elv_rb_add(root, rq);
>> +}
>> +
>> +static inline void
>> +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
>> +{
>> +	const int data_dir = rq_data_dir(rq);
>> +
>> +	if (dd->next_rq[data_dir] == rq)
>> +		dd->next_rq[data_dir] = deadline_latter_request(rq);
>> +
>> +	elv_rb_del(deadline_rb_root(dd, rq), rq);
>> +}
>> +
>> +/*
>> + * remove rq from rbtree and fifo.
>> + */
>> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
>> +{
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +
>> +	list_del_init(&rq->queuelist);
>> +
>> +	/*
>> +	 * We might not be on the rbtree, if we are doing an insert merge
>> +	 */
>> +	if (!RB_EMPTY_NODE(&rq->rb_node))
>> +		deadline_del_rq_rb(dd, rq);
>> +
>> +	elv_rqhash_del(q, rq);
>> +	if (q->last_merge == rq)
>> +		q->last_merge = NULL;
>> +}
>> +
>> +static void dd_request_merged(struct request_queue *q, struct request *req,
>> +			      int type)
>> +{
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +
>> +	/*
>> +	 * if the merge was a front merge, we need to reposition request
>> +	 */
>> +	if (type == ELEVATOR_FRONT_MERGE) {
>> +		elv_rb_del(deadline_rb_root(dd, req), req);
>> +		deadline_add_rq_rb(dd, req);
>> +	}
>> +}
>> +
>> +static void dd_merged_requests(struct request_queue *q, struct request *req,
>> +			       struct request *next)
>> +{
>> +	/*
>> +	 * if next expires before rq, assign its expire time to rq
>> +	 * and move into next position (next will be deleted) in fifo
>> +	 */
>> +	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
>> +		if (time_before((unsigned long)next->fifo_time,
>> +				(unsigned long)req->fifo_time)) {
>> +			list_move(&req->queuelist, &next->queuelist);
>> +			req->fifo_time = next->fifo_time;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * kill knowledge of next, this one is a goner
>> +	 */
>> +	deadline_remove_request(q, next);
>> +}
>> +
>> +/*
>> + * move an entry to dispatch queue
>> + */
>> +static void
>> +deadline_move_request(struct deadline_data *dd, struct request *rq)
>> +{
>> +	const int data_dir = rq_data_dir(rq);
>> +
>> +	dd->next_rq[READ] = NULL;
>> +	dd->next_rq[WRITE] = NULL;
>> +	dd->next_rq[data_dir] = deadline_latter_request(rq);
>> +
>> +	/*
>> +	 * take it off the sort and fifo list
>> +	 */
>> +	deadline_remove_request(rq->q, rq);
>> +}
>> +
>> +/*
>> + * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>> + * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>> + */
>> +static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
>> +{
>> +	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
>> +
>> +	/*
>> +	 * rq is expired!
>> +	 */
>> +	if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
>> +		return 1;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * deadline_dispatch_requests selects the best request according to
>> + * read/write expire, fifo_batch, etc
>> + */
>> +static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>> +{
>> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>> +	struct request *rq;
>> +	bool reads, writes;
>> +	int data_dir;
>> +
>> +	spin_lock(&dd->lock);
>> +
>> +	if (!list_empty(&dd->dispatch)) {
>> +		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
>> +		list_del_init(&rq->queuelist);
>> +		goto done;
>> +	}
>> +
>> +	reads = !list_empty(&dd->fifo_list[READ]);
>> +	writes = !list_empty(&dd->fifo_list[WRITE]);
>> +
>> +	/*
>> +	 * batches are currently reads XOR writes
>> +	 */
>> +	if (dd->next_rq[WRITE])
>> +		rq = dd->next_rq[WRITE];
>> +	else
>> +		rq = dd->next_rq[READ];
>> +
>> +	if (rq && dd->batching < dd->fifo_batch)
>> +		/* we have a next request are still entitled to batch */
>> +		goto dispatch_request;
>> +
>> +	/*
>> +	 * at this point we are not running a batch. select the appropriate
>> +	 * data direction (read / write)
>> +	 */
>> +
>> +	if (reads) {
>> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
>> +
>> +		if (writes && (dd->starved++ >= dd->writes_starved))
>> +			goto dispatch_writes;
>> +
>> +		data_dir = READ;
>> +
>> +		goto dispatch_find_request;
>> +	}
>> +
>> +	/*
>> +	 * there are either no reads or writes have been starved
>> +	 */
>> +
>> +	if (writes) {
>> +dispatch_writes:
>> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
>> +
>> +		dd->starved = 0;
>> +
>> +		data_dir = WRITE;
>> +
>> +		goto dispatch_find_request;
>> +	}
>> +
>> +	spin_unlock(&dd->lock);
>> +	return NULL;
>> +
>> +dispatch_find_request:
>> +	/*
>> +	 * we are not running a batch, find best request for selected data_dir
>> +	 */
>> +	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
>> +		/*
>> +		 * A deadline has expired, the last request was in the other
>> +		 * direction, or we have run out of higher-sectored requests.
>> +		 * Start again from the request with the earliest expiry time.
>> +		 */
>> +		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
>> +	} else {
>> +		/*
>> +		 * The last req was the same dir and we have a next request in
>> +		 * sort order. No expired requests so continue on from here.
>> +		 */
>> +		rq = dd->next_rq[data_dir];
>> +	}
>> +
>> +	dd->batching = 0;
>> +
>> +dispatch_request:
>> +	/*
>> +	 * rq is the selected appropriate request.
>> +	 */
>> +	dd->batching++;
>> +	deadline_move_request(dd, rq);
>> +done:
>> +	rq->rq_flags |= RQF_STARTED;
>> +	spin_unlock(&dd->lock);
>> +	return rq;
>> +}
>> +
>> +static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
>> +				 struct list_head *rq_list)
>> +{
>> +	blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
>> +}
>> +
>> +static void dd_exit_queue(struct elevator_queue *e)
>> +{
>> +	struct deadline_data *dd = e->elevator_data;
>> +
>> +	BUG_ON(!list_empty(&dd->fifo_list[READ]));
>> +	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
>> +
>> +	blk_mq_sched_free_requests(dd->tags);
>> +	kfree(dd);
>> +}
>> +
>> +/*
>> + * initialize elevator private data (deadline_data).
>> + */
>> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
>> +{
>> +	struct deadline_data *dd;
>> +	struct elevator_queue *eq;
>> +
>> +	eq = elevator_alloc(q, e);
>> +	if (!eq)
>> +		return -ENOMEM;
>> +
>> +	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
>> +	if (!dd) {
>> +		kobject_put(&eq->kobj);
>> +		return -ENOMEM;
>> +	}
>> +	eq->elevator_data = dd;
>> +
>> +	dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
>> +	if (!dd->tags) {
>> +		kfree(dd);
>> +		kobject_put(&eq->kobj);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	INIT_LIST_HEAD(&dd->fifo_list[READ]);
>> +	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
>> +	dd->sort_list[READ] = RB_ROOT;
>> +	dd->sort_list[WRITE] = RB_ROOT;
>> +	dd->fifo_expire[READ] = read_expire;
>> +	dd->fifo_expire[WRITE] = write_expire;
>> +	dd->writes_starved = writes_starved;
>> +	dd->front_merges = 1;
>> +	dd->fifo_batch = fifo_batch;
>> +	spin_lock_init(&dd->lock);
>> +	INIT_LIST_HEAD(&dd->dispatch);
>> +	atomic_set(&dd->wait_index, 0);
>> +
>> +	q->elevator = eq;
>> +	return 0;
>> +}
>> +
>> +static int dd_request_merge(struct request_queue *q, struct request **rq,
>> +			    struct bio *bio)
>> +{
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +	sector_t sector = bio_end_sector(bio);
>> +	struct request *__rq;
>> +
>> +	if (!dd->front_merges)
>> +		return ELEVATOR_NO_MERGE;
>> +
>> +	__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
>> +	if (__rq) {
>> +		BUG_ON(sector != blk_rq_pos(__rq));
>> +
>> +		if (elv_bio_merge_ok(__rq, bio)) {
>> +			*rq = __rq;
>> +			return ELEVATOR_FRONT_MERGE;
>> +		}
>> +	}
>> +
>> +	return ELEVATOR_NO_MERGE;
>> +}
>> +
>> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
>> +{
>> +	struct request_queue *q = hctx->queue;
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +	int ret;
>> +
>> +	spin_lock(&dd->lock);
>> +	ret = blk_mq_sched_try_merge(q, bio);
>> +	spin_unlock(&dd->lock);
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * add rq to rbtree and fifo
>> + */
>> +static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>> +			      bool at_head)
>> +{
>> +	struct request_queue *q = hctx->queue;
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +	const int data_dir = rq_data_dir(rq);
>> +
>> +	if (blk_mq_sched_try_insert_merge(q, rq))
>> +		return;
>> +
>> +	blk_mq_sched_request_inserted(rq);
>> +
>> +	/*
>> +	 * If we're trying to insert a real request, just send it directly
>> +	 * to the hardware dispatch list. This only happens for a requeue,
>> +	 * or FUA/FLUSH requests.
>> +	 */
>> +	if (!blk_mq_sched_rq_is_shadow(rq)) {
>> +		spin_lock(&hctx->lock);
>> +		list_add_tail(&rq->queuelist, &hctx->dispatch);
>> +		spin_unlock(&hctx->lock);
>> +		return;
>> +	}
>> +
>> +	if (at_head || rq->cmd_type != REQ_TYPE_FS) {
>> +		if (at_head)
>> +			list_add(&rq->queuelist, &dd->dispatch);
>> +		else
>> +			list_add_tail(&rq->queuelist, &dd->dispatch);
>> +	} else {
>> +		deadline_add_rq_rb(dd, rq);
>> +
>> +		if (rq_mergeable(rq)) {
>> +			elv_rqhash_add(q, rq);
>> +			if (!q->last_merge)
>> +				q->last_merge = rq;
>> +		}
>> +
>> +		/*
>> +		 * set expire time and add to fifo list
>> +		 */
>> +		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
>> +		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
>> +	}
>> +}
>> +
>> +static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>> +			       struct list_head *list, bool at_head)
>> +{
>> +	struct request_queue *q = hctx->queue;
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +
>> +	spin_lock(&dd->lock);
>> +	while (!list_empty(list)) {
>> +		struct request *rq;
>> +
>> +		rq = list_first_entry(list, struct request, queuelist);
>> +		list_del_init(&rq->queuelist);
>> +		dd_insert_request(hctx, rq, at_head);
>> +	}
>> +	spin_unlock(&dd->lock);
>> +}
>> +
>> +static struct request *dd_get_request(struct request_queue *q, unsigned int op,
>> +				      struct blk_mq_alloc_data *data)
>> +{
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +	struct request *rq;
>> +
>> +	/*
>> +	 * The flush machinery intercepts before we insert the request. As
>> +	 * a work-around, just hand it back a real request.
>> +	 */
>> +	if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
>> +		rq = __blk_mq_alloc_request(data, op);
>> +	else {
>> +		rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
>> +		if (rq)
>> +			blk_mq_rq_ctx_init(q, data->ctx, rq, op);
>> +	}
>> +
>> +	return rq;
>> +}
>> +
>> +static bool dd_put_request(struct request *rq)
>> +{
>> +	/*
>> +	 * If it's a real request, we just have to free it. For a shadow
>> +	 * request, we should only free it if we haven't started it. A
>> +	 * started request is mapped to a real one, and the real one will
>> +	 * free it. We can get here with request merges, since we then
>> +	 * free the request before we start/issue it.
>> +	 */
>> +	if (!blk_mq_sched_rq_is_shadow(rq))
>> +		return false;
>> +
>> +	if (!(rq->rq_flags & RQF_STARTED)) {
>> +		struct request_queue *q = rq->q;
>> +		struct deadline_data *dd = q->elevator->elevator_data;
>> +
>> +		/*
>> +		 * IO completion would normally do this, but if we merge
>> +		 * and free before we issue the request, drop both the
>> +		 * tag and queue ref
>> +		 */
>> +		blk_mq_sched_free_shadow_request(dd->tags, rq);
>> +		blk_queue_exit(q);
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
>> +{
>> +	struct request *sched_rq = rq->end_io_data;
>> +
>> +	/*
>> +	 * sched_rq can be NULL, if we haven't setup the shadow yet
>> +	 * because we failed getting one.
>> +	 */
>> +	if (sched_rq) {
>> +		struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>> +
>> +		blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
>> +		blk_mq_start_stopped_hw_queue(hctx, true);
>> +	}
>> +}
>> +
>> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>> +{
>> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>> +
>> +	return !list_empty_careful(&dd->dispatch) ||
>> +		!list_empty_careful(&dd->fifo_list[0]) ||
>> +		!list_empty_careful(&dd->fifo_list[1]);
>> +}
>> +
>> +/*
>> + * sysfs parts below
>> + */
>> +static ssize_t
>> +deadline_var_show(int var, char *page)
>> +{
>> +	return sprintf(page, "%d\n", var);
>> +}
>> +
>> +static ssize_t
>> +deadline_var_store(int *var, const char *page, size_t count)
>> +{
>> +	char *p = (char *) page;
>> +
>> +	*var = simple_strtol(p, &p, 10);
>> +	return count;
>> +}
>> +
>> +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
>> +static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>> +{									\
>> +	struct deadline_data *dd = e->elevator_data;			\
>> +	int __data = __VAR;						\
>> +	if (__CONV)							\
>> +		__data = jiffies_to_msecs(__data);			\
>> +	return deadline_var_show(__data, (page));			\
>> +}
>> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
>> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
>> +SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
>> +SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
>> +SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
>> +#undef SHOW_FUNCTION
>> +
>> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
>> +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
>> +{									\
>> +	struct deadline_data *dd = e->elevator_data;			\
>> +	int __data;							\
>> +	int ret = deadline_var_store(&__data, (page), count);		\
>> +	if (__data < (MIN))						\
>> +		__data = (MIN);						\
>> +	else if (__data > (MAX))					\
>> +		__data = (MAX);						\
>> +	if (__CONV)							\
>> +		*(__PTR) = msecs_to_jiffies(__data);			\
>> +	else								\
>> +		*(__PTR) = __data;					\
>> +	return ret;							\
>> +}
>> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
>> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
>> +STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
>> +STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
>> +STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
>> +#undef STORE_FUNCTION
>> +
>> +#define DD_ATTR(name) \
>> +	__ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
>> +				      deadline_##name##_store)
>> +
>> +static struct elv_fs_entry deadline_attrs[] = {
>> +	DD_ATTR(read_expire),
>> +	DD_ATTR(write_expire),
>> +	DD_ATTR(writes_starved),
>> +	DD_ATTR(front_merges),
>> +	DD_ATTR(fifo_batch),
>> +	__ATTR_NULL
>> +};
>> +
>> +static struct elevator_type mq_deadline = {
>> +	.ops.mq = {
>> +		.get_request		= dd_get_request,
>> +		.put_request		= dd_put_request,
>> +		.insert_requests	= dd_insert_requests,
>> +		.dispatch_requests	= dd_dispatch_requests,
>> +		.completed_request	= dd_completed_request,
>> +		.next_request		= elv_rb_latter_request,
>> +		.former_request		= elv_rb_former_request,
>> +		.bio_merge		= dd_bio_merge,
>> +		.request_merge		= dd_request_merge,
>> +		.requests_merged	= dd_merged_requests,
>> +		.request_merged		= dd_request_merged,
>> +		.has_work		= dd_has_work,
>> +		.init_sched		= dd_init_queue,
>> +		.exit_sched		= dd_exit_queue,
>> +	},
>> +
>> +	.uses_mq	= true,
>> +	.elevator_attrs = deadline_attrs,
>> +	.elevator_name = "mq-deadline",
>> +	.elevator_owner = THIS_MODULE,
>> +};
>> +
>> +static int __init deadline_init(void)
>> +{
>> +	if (!queue_depth) {
>> +		pr_err("mq-deadline: queue depth must be > 0\n");
>> +		return -EINVAL;
>> +	}
>> +	return elv_register(&mq_deadline);
>> +}
>> +
>> +static void __exit deadline_exit(void)
>> +{
>> +	elv_unregister(&mq_deadline);
>> +}
>> +
>> +module_init(deadline_init);
>> +module_exit(deadline_exit);
>> +
>> +MODULE_AUTHOR("Jens Axboe");
>> +MODULE_LICENSE("GPL");
>> +MODULE_DESCRIPTION("MQ deadline IO scheduler");
>> -- 
>> 2.7.4
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-01-20 11:07       ` Paolo Valente
@ 2017-01-20 14:26         ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2017-01-20 14:26 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, Linux-Kernal, osandov

On Fri, Jan 20 2017, Paolo Valente wrote:
> 
> > Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe <axboe@fb.com> ha scritto:
> > 
> > On 12/22/2016 09:49 AM, Paolo Valente wrote:
> >> 
> >>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> >>> 
> >>> This is basically identical to deadline-iosched, except it registers
> >>> as a MQ capable scheduler. This is still a single queue design.
> >>> 
> >> 
> >> One last question (for today ...):in mq-deadline there are no
> >> "schedule dispatch" or "unplug work" functions.  In blk, CFQ and BFQ
> >> do these schedules/unplugs in a lot of cases.  What's the right
> >> replacement?  Just doing nothing?
> > 
> > You just use blk_mq_run_hw_queue() or variants thereof to kick off queue
> > runs.
> > 
> 
> Hi Jens,
> I'm working on this right now.  I have a pair of quick questions about
> performance.
> 
> In the blk version of bfq, if the in-service bfq_queue happen to have
> no more budget when the bfq dispatch function is invoked, then bfq:
> returns no request (NULL), immediately expires the in-service
> bfq_queue, and schedules a new dispatch.  The third step is taken so
> that, if other bfq_queues have requests, then a new in-service
> bfq_queue will be selected on the upcoming new dispatch, and a new
> request will be provided right away.
> 
> My questions are: is this dispatch-schedule step still needed with
> blk-mq, to avoid a stall?  If it is not needed to avoid a stall, would
> it still be needed to boost throughput, because it would force an
> immediate, next dispatch?

Generally that step is only needed if you don't dispatch a request for
that invocation, yet you have requests to dispatch. For that case, you
must ensure that the queues are run at some point in the future. So I'm
inclined to answer yes to your question, though it depends on exactly
how it happens. If you have the queue run in the code, comment it like
that, and we can always revisit.

> BTW, bfq-mq survived its first request completion.  I will provide you
> with a link to a github branch as soon as bfq-mq seems able to stand
> up with a minimal workload.

Congratulations, that's a nice milestone!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-01-20 13:14   ` Paolo Valente
  2017-01-20 13:18     ` Paolo Valente
@ 2017-01-20 14:28     ` Jens Axboe
  1 sibling, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2017-01-20 14:28 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, Linux-Kernal, osandov

On Fri, Jan 20 2017, Paolo Valente wrote:
> 
> > Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> > 
> > This is basically identical to deadline-iosched, except it registers
> > as a MQ capable scheduler. This is still a single queue design.
> > 
> 
> Jens,
> no spin_lock_irq* in the code.  So, also request dispatches are
> guaranteed to never be executed in IRQ context?  I'm asking this
> question to understand whether I'm missing something that, even in
> BFQ, would somehow allow me to not disable irqs in critical sections,
> even if there is the slice_idle-expiration handler.  Be patient with
> my ignorance.

Yes, dispatches will never happen from IRQ context. blk-mq was designed
so we didn't have to use irq disabling locks.

That said, certain parts of the API can be called from IRQ context.
put_request and the completion parts, for instance. But blk-mq doesn't
need to grab any locks there, and neither does mq-deadline. This might
be different from bfq. lockdep can be a big help there.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-01-20 13:18     ` Paolo Valente
@ 2017-01-20 14:28       ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2017-01-20 14:28 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, Linux-Kernal, osandov

On Fri, Jan 20 2017, Paolo Valente wrote:
> 
> > Il giorno 20 gen 2017, alle ore 14:14, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> > 
> >> 
> >> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> >> 
> >> This is basically identical to deadline-iosched, except it registers
> >> as a MQ capable scheduler. This is still a single queue design.
> >> 
> > 
> > Jens,
> > no spin_lock_irq* in the code.  So, also request dispatches are
> > guaranteed to never be executed in IRQ context?
> 
> Or maybe the opposite? That is, all scheduler functions invoked in IRQ context?

Nope

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2017-01-18 16:21                 ` Jens Axboe
@ 2017-01-23 17:04                   ` Paolo Valente
  2017-01-23 17:42                     ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-01-23 17:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown


> Il giorno 18 gen 2017, alle ore 17:21, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 01/18/2017 08:14 AM, Paolo Valente wrote:
>> according to the function blk_mq_sched_put_request, the
>> mq.completed_request hook seems to always be invoked (if set) for a
>> request for which the mq.put_rq_priv is invoked (if set).
> 
> Correct, any request that came out of blk_mq_sched_get_request()
> will always have completed called on it, regardless of whether it
> had IO started on it or not.
> 

It seems that some request, after being dispatched, happens to have no
mq.put_rq_priv invoked on it now or then.  Is it expected?  If it is,
could you point me to the path through which the end of the life of
such a request is handled?

Thanks,
Paolo

> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2017-01-23 17:04                   ` Paolo Valente
@ 2017-01-23 17:42                     ` Jens Axboe
  2017-01-25  8:46                       ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-01-23 17:42 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown

On 01/23/2017 10:04 AM, Paolo Valente wrote:
> 
>> Il giorno 18 gen 2017, alle ore 17:21, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 01/18/2017 08:14 AM, Paolo Valente wrote:
>>> according to the function blk_mq_sched_put_request, the
>>> mq.completed_request hook seems to always be invoked (if set) for a
>>> request for which the mq.put_rq_priv is invoked (if set).
>>
>> Correct, any request that came out of blk_mq_sched_get_request()
>> will always have completed called on it, regardless of whether it
>> had IO started on it or not.
>>
> 
> It seems that some request, after being dispatched, happens to have no
> mq.put_rq_priv invoked on it now or then.  Is it expected?  If it is,
> could you point me to the path through which the end of the life of
> such a request is handled?

I'm guessing that's a flush request. I added RQF_QUEUED to check for
that, if RQF_QUEUED is set, you know it has come from your get_request
handler.

I'm assuming that is it, let me know.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2017-01-23 17:42                     ` Jens Axboe
@ 2017-01-25  8:46                       ` Paolo Valente
  2017-01-25 16:13                         ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-01-25  8:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown


> Il giorno 23 gen 2017, alle ore 18:42, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 01/23/2017 10:04 AM, Paolo Valente wrote:
>> 
>>> Il giorno 18 gen 2017, alle ore 17:21, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> On 01/18/2017 08:14 AM, Paolo Valente wrote:
>>>> according to the function blk_mq_sched_put_request, the
>>>> mq.completed_request hook seems to always be invoked (if set) for a
>>>> request for which the mq.put_rq_priv is invoked (if set).
>>> 
>>> Correct, any request that came out of blk_mq_sched_get_request()
>>> will always have completed called on it, regardless of whether it
>>> had IO started on it or not.
>>> 
>> 
>> It seems that some request, after being dispatched, happens to have no
>> mq.put_rq_priv invoked on it now or then.  Is it expected?  If it is,
>> could you point me to the path through which the end of the life of
>> such a request is handled?
> 
> I'm guessing that's a flush request. I added RQF_QUEUED to check for
> that, if RQF_QUEUED is set, you know it has come from your get_request
> handler.
> 

Exactly, the completion-without-put_rq_priv pattern seems to occur
only for requests coming from the flusher, precisely because they have
the flag RQF_ELVPRIV unset.  Just to understand: why is this flag
unset for these requests, if they do have private elevator (bfq)
data attached?  What am I misunderstanding?

Just to be certain: this should be the only case where the
completed_request hook is invoked while the put_rq_priv is not, right?

Thanks,
Paolo

> I'm assuming that is it, let me know.
> 
> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2017-01-25  8:46                       ` Paolo Valente
@ 2017-01-25 16:13                         ` Jens Axboe
  2017-01-26 14:23                           ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-01-25 16:13 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown

On 01/25/2017 01:46 AM, Paolo Valente wrote:
> 
>> Il giorno 23 gen 2017, alle ore 18:42, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 01/23/2017 10:04 AM, Paolo Valente wrote:
>>>
>>>> Il giorno 18 gen 2017, alle ore 17:21, Jens Axboe <axboe@fb.com> ha scritto:
>>>>
>>>> On 01/18/2017 08:14 AM, Paolo Valente wrote:
>>>>> according to the function blk_mq_sched_put_request, the
>>>>> mq.completed_request hook seems to always be invoked (if set) for a
>>>>> request for which the mq.put_rq_priv is invoked (if set).
>>>>
>>>> Correct, any request that came out of blk_mq_sched_get_request()
>>>> will always have completed called on it, regardless of whether it
>>>> had IO started on it or not.
>>>>
>>>
>>> It seems that some request, after being dispatched, happens to have no
>>> mq.put_rq_priv invoked on it now or then.  Is it expected?  If it is,
>>> could you point me to the path through which the end of the life of
>>> such a request is handled?
>>
>> I'm guessing that's a flush request. I added RQF_QUEUED to check for
>> that, if RQF_QUEUED is set, you know it has come from your get_request
>> handler.
>>
> 
> Exactly, the completion-without-put_rq_priv pattern seems to occur
> only for requests coming from the flusher, precisely because they have
> the flag RQF_ELVPRIV unset.  Just to understand: why is this flag
> unset for these requests, if they do have private elevator (bfq)
> data attached?  What am I misunderstanding?
> 
> Just to be certain: this should be the only case where the
> completed_request hook is invoked while the put_rq_priv is not, right?

They must NOT have scheduler data attached. In your get_request
function, you must bypass if blk_mq_sched_bypass_insert() returns true.
See how mq-deadline does that. This is important, or you will get hangs
with flushes as well, since the IO scheduler private data and the flush
data is unionized in the request.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCHSET v4] blk-mq-scheduling framework
  2017-01-25 16:13                         ` Jens Axboe
@ 2017-01-26 14:23                           ` Paolo Valente
  0 siblings, 0 replies; 69+ messages in thread
From: Paolo Valente @ 2017-01-26 14:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, linux-block, Linux-Kernal, Omar Sandoval,
	Linus Walleij, Ulf Hansson, Mark Brown


> Il giorno 25 gen 2017, alle ore 17:13, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 01/25/2017 01:46 AM, Paolo Valente wrote:
>> 
>>> Il giorno 23 gen 2017, alle ore 18:42, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> On 01/23/2017 10:04 AM, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 18 gen 2017, alle ore 17:21, Jens Axboe <axboe@fb.com> ha scritto:
>>>>> 
>>>>> On 01/18/2017 08:14 AM, Paolo Valente wrote:
>>>>>> according to the function blk_mq_sched_put_request, the
>>>>>> mq.completed_request hook seems to always be invoked (if set) for a
>>>>>> request for which the mq.put_rq_priv is invoked (if set).
>>>>> 
>>>>> Correct, any request that came out of blk_mq_sched_get_request()
>>>>> will always have completed called on it, regardless of whether it
>>>>> had IO started on it or not.
>>>>> 
>>>> 
>>>> It seems that some request, after being dispatched, happens to have no
>>>> mq.put_rq_priv invoked on it now or then.  Is it expected?  If it is,
>>>> could you point me to the path through which the end of the life of
>>>> such a request is handled?
>>> 
>>> I'm guessing that's a flush request. I added RQF_QUEUED to check for
>>> that, if RQF_QUEUED is set, you know it has come from your get_request
>>> handler.
>>> 
>> 
>> Exactly, the completion-without-put_rq_priv pattern seems to occur
>> only for requests coming from the flusher, precisely because they have
>> the flag RQF_ELVPRIV unset.  Just to understand: why is this flag
>> unset for these requests, if they do have private elevator (bfq)
>> data attached?  What am I misunderstanding?
>> 
>> Just to be certain: this should be the only case where the
>> completed_request hook is invoked while the put_rq_priv is not, right?
> 
> They must NOT have scheduler data attached. In your get_request
> function, you must bypass if blk_mq_sched_bypass_insert() returns true.

Yes, sorry.  I'm already using blk_mq_sched_bypass_insert() to bypass
insertion in the insert hook, as done in mq-deadline, and I have no
get_request defined (see below).

The source of my confusion was that I assumed that flush requests had
not to leave any trace in the scheduler, since the scheduler does not
decide anything for them.  Accordingly, I thought they did not trigger
any put or completion hook.  In contrast, these requests get the
flag QUEUED set, in case the get_request hook is set, and then trigger
both a put_request and a completed_request.  In this respect, in bfq-mq
I'm not using any of these three hooks (they are all NULL).  I hope
I'm not doing something unexpected or incoherent.

UPDATE: bfq-mq now survives for minutes.  I'm debugging two occasional
failures, which (un)fortunately become more and more occasional as I
go on with debugging and instrumenting the code.

Thanks,
Paolo

> See how mq-deadline does that. This is important, or you will get hangs
> with flushes as well, since the IO scheduler private data and the flush
> data is unionized in the request.
> 
> -- 
> Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
                     ` (4 preceding siblings ...)
  2017-01-20 13:14   ` Paolo Valente
@ 2017-02-01 11:11   ` Paolo Valente
  2017-02-02  5:19     ` Jens Axboe
  2017-02-01 11:56   ` Paolo Valente
  2017-02-16 10:46   ` Paolo Valente
  7 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-02-01 11:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
> block/Kconfig.iosched |   6 +
> block/Makefile        |   1 +
> block/mq-deadline.c   | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 656 insertions(+)
> create mode 100644 block/mq-deadline.c
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 421bef9c4c48..490ef2850fae 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -32,6 +32,12 @@ config IOSCHED_CFQ
> 
> 	  This is the default I/O scheduler.
> 
> +config MQ_IOSCHED_DEADLINE
> +	tristate "MQ deadline I/O scheduler"
> +	default y
> +	---help---
> +	  MQ version of the deadline IO scheduler.
> +
> config CFQ_GROUP_IOSCHED
> 	bool "CFQ Group Scheduling support"
> 	depends on IOSCHED_CFQ && BLK_CGROUP
> diff --git a/block/Makefile b/block/Makefile
> index 2eee9e1bb6db..3ee0abd7205a 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
> obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> +obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
> 
> obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
> obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index 000000000000..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> @@ -0,0 +1,649 @@
> +/*
> + *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
> + *  for the blk-mq scheduling framework
> + *
> + *  Copyright (C) 2016 Jens Axboe <axboe@kernel.dk>
> + */
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
> +#include <linux/elevator.h>
> +#include <linux/bio.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/compiler.h>
> +#include <linux/rbtree.h>
> +#include <linux/sbitmap.h>
> +
> +#include "blk.h"
> +#include "blk-mq.h"
> +#include "blk-mq-tag.h"
> +#include "blk-mq-sched.h"
> +
> +static unsigned int queue_depth = 256;
> +module_param(queue_depth, uint, 0644);
> +MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
> +
> +/*
> + * See Documentation/block/deadline-iosched.txt
> + */
> +static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
> +static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +static const int writes_starved = 2;    /* max times reads can starve a write */
> +static const int fifo_batch = 16;       /* # of sequential requests treated as one
> +				     by the above parameters. For throughput. */
> +
> +struct deadline_data {
> +	/*
> +	 * run time data
> +	 */
> +
> +	/*
> +	 * requests (deadline_rq s) are present on both sort_list and fifo_list
> +	 */
> +	struct rb_root sort_list[2];
> +	struct list_head fifo_list[2];
> +
> +	/*
> +	 * next in sort order. read, write or both are NULL
> +	 */
> +	struct request *next_rq[2];
> +	unsigned int batching;		/* number of sequential requests made */
> +	unsigned int starved;		/* times reads have starved writes */
> +
> +	/*
> +	 * settings that change how the i/o scheduler behaves
> +	 */
> +	int fifo_expire[2];
> +	int fifo_batch;
> +	int writes_starved;
> +	int front_merges;
> +
> +	spinlock_t lock;
> +	struct list_head dispatch;
> +	struct blk_mq_tags *tags;
> +	atomic_t wait_index;
> +};
> +
> +static inline struct rb_root *
> +deadline_rb_root(struct deadline_data *dd, struct request *rq)
> +{
> +	return &dd->sort_list[rq_data_dir(rq)];
> +}
> +
> +/*
> + * get the request after `rq' in sector-sorted order
> + */
> +static inline struct request *
> +deadline_latter_request(struct request *rq)
> +{
> +	struct rb_node *node = rb_next(&rq->rb_node);
> +
> +	if (node)
> +		return rb_entry_rq(node);
> +
> +	return NULL;
> +}
> +
> +static void
> +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> +	struct rb_root *root = deadline_rb_root(dd, rq);
> +
> +	elv_rb_add(root, rq);
> +}
> +
> +static inline void
> +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> +	const int data_dir = rq_data_dir(rq);
> +
> +	if (dd->next_rq[data_dir] == rq)
> +		dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> +	elv_rb_del(deadline_rb_root(dd, rq), rq);
> +}
> +
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	list_del_init(&rq->queuelist);
> +
> +	/*
> +	 * We might not be on the rbtree, if we are doing an insert merge
> +	 */
> +	if (!RB_EMPTY_NODE(&rq->rb_node))
> +		deadline_del_rq_rb(dd, rq);
> +
> +	elv_rqhash_del(q, rq);
> +	if (q->last_merge == rq)
> +		q->last_merge = NULL;
> +}
> +
> +static void dd_request_merged(struct request_queue *q, struct request *req,
> +			      int type)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	/*
> +	 * if the merge was a front merge, we need to reposition request
> +	 */
> +	if (type == ELEVATOR_FRONT_MERGE) {
> +		elv_rb_del(deadline_rb_root(dd, req), req);
> +		deadline_add_rq_rb(dd, req);
> +	}
> +}
> +
> +static void dd_merged_requests(struct request_queue *q, struct request *req,
> +			       struct request *next)
> +{
> +	/*
> +	 * if next expires before rq, assign its expire time to rq
> +	 * and move into next position (next will be deleted) in fifo
> +	 */
> +	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
> +		if (time_before((unsigned long)next->fifo_time,
> +				(unsigned long)req->fifo_time)) {
> +			list_move(&req->queuelist, &next->queuelist);
> +			req->fifo_time = next->fifo_time;
> +		}
> +	}
> +
> +	/*
> +	 * kill knowledge of next, this one is a goner
> +	 */
> +	deadline_remove_request(q, next);
> +}
> +
> +/*
> + * move an entry to dispatch queue
> + */
> +static void
> +deadline_move_request(struct deadline_data *dd, struct request *rq)
> +{
> +	const int data_dir = rq_data_dir(rq);
> +
> +	dd->next_rq[READ] = NULL;
> +	dd->next_rq[WRITE] = NULL;
> +	dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> +	/*
> +	 * take it off the sort and fifo list
> +	 */
> +	deadline_remove_request(rq->q, rq);
> +}
> +
> +/*
> + * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
> + * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> + */
> +static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
> +{
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
> +
> +	/*
> +	 * rq is expired!
> +	 */
> +	if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * deadline_dispatch_requests selects the best request according to
> + * read/write expire, fifo_batch, etc
> + */
> +static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	struct request *rq;
> +	bool reads, writes;
> +	int data_dir;
> +
> +	spin_lock(&dd->lock);
> +
> +	if (!list_empty(&dd->dispatch)) {
> +		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		goto done;
> +	}
> +
> +	reads = !list_empty(&dd->fifo_list[READ]);
> +	writes = !list_empty(&dd->fifo_list[WRITE]);
> +
> +	/*
> +	 * batches are currently reads XOR writes
> +	 */
> +	if (dd->next_rq[WRITE])
> +		rq = dd->next_rq[WRITE];
> +	else
> +		rq = dd->next_rq[READ];
> +
> +	if (rq && dd->batching < dd->fifo_batch)
> +		/* we have a next request are still entitled to batch */
> +		goto dispatch_request;
> +
> +	/*
> +	 * at this point we are not running a batch. select the appropriate
> +	 * data direction (read / write)
> +	 */
> +
> +	if (reads) {
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
> +
> +		if (writes && (dd->starved++ >= dd->writes_starved))
> +			goto dispatch_writes;
> +
> +		data_dir = READ;
> +
> +		goto dispatch_find_request;
> +	}
> +
> +	/*
> +	 * there are either no reads or writes have been starved
> +	 */
> +
> +	if (writes) {
> +dispatch_writes:
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> +
> +		dd->starved = 0;
> +
> +		data_dir = WRITE;
> +
> +		goto dispatch_find_request;
> +	}
> +
> +	spin_unlock(&dd->lock);
> +	return NULL;
> +
> +dispatch_find_request:
> +	/*
> +	 * we are not running a batch, find best request for selected data_dir
> +	 */
> +	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
> +		/*
> +		 * A deadline has expired, the last request was in the other
> +		 * direction, or we have run out of higher-sectored requests.
> +		 * Start again from the request with the earliest expiry time.
> +		 */
> +		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	} else {
> +		/*
> +		 * The last req was the same dir and we have a next request in
> +		 * sort order. No expired requests so continue on from here.
> +		 */
> +		rq = dd->next_rq[data_dir];
> +	}
> +
> +	dd->batching = 0;
> +
> +dispatch_request:
> +	/*
> +	 * rq is the selected appropriate request.
> +	 */
> +	dd->batching++;
> +	deadline_move_request(dd, rq);
> +done:
> +	rq->rq_flags |= RQF_STARTED;
> +	spin_unlock(&dd->lock);
> +	return rq;
> +}
> +
> +static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
> +				 struct list_head *rq_list)
> +{
> +	blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
> +}
> +
> +static void dd_exit_queue(struct elevator_queue *e)
> +{
> +	struct deadline_data *dd = e->elevator_data;
> +
> +	BUG_ON(!list_empty(&dd->fifo_list[READ]));
> +	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
> +
> +	blk_mq_sched_free_requests(dd->tags);
> +	kfree(dd);
> +}
> +
> +/*
> + * initialize elevator private data (deadline_data).
> + */
> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +{
> +	struct deadline_data *dd;
> +	struct elevator_queue *eq;
> +
> +	eq = elevator_alloc(q, e);
> +	if (!eq)
> +		return -ENOMEM;
> +
> +	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> +	if (!dd) {
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +	eq->elevator_data = dd;
> +
> +	dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
> +	if (!dd->tags) {
> +		kfree(dd);
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +
> +	INIT_LIST_HEAD(&dd->fifo_list[READ]);
> +	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
> +	dd->sort_list[READ] = RB_ROOT;
> +	dd->sort_list[WRITE] = RB_ROOT;
> +	dd->fifo_expire[READ] = read_expire;
> +	dd->fifo_expire[WRITE] = write_expire;
> +	dd->writes_starved = writes_starved;
> +	dd->front_merges = 1;
> +	dd->fifo_batch = fifo_batch;
> +	spin_lock_init(&dd->lock);
> +	INIT_LIST_HEAD(&dd->dispatch);
> +	atomic_set(&dd->wait_index, 0);
> +
> +	q->elevator = eq;
> +	return 0;
> +}
> +
> +static int dd_request_merge(struct request_queue *q, struct request **rq,
> +			    struct bio *bio)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	sector_t sector = bio_end_sector(bio);
> +	struct request *__rq;
> +
> +	if (!dd->front_merges)
> +		return ELEVATOR_NO_MERGE;
> +
> +	__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
> +	if (__rq) {
> +		BUG_ON(sector != blk_rq_pos(__rq));
> +
> +		if (elv_bio_merge_ok(__rq, bio)) {
> +			*rq = __rq;
> +			return ELEVATOR_FRONT_MERGE;
> +		}
> +	}
> +
> +	return ELEVATOR_NO_MERGE;
> +}
> +
> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	int ret;
> +
> +	spin_lock(&dd->lock);
> +	ret = blk_mq_sched_try_merge(q, bio);
> +	spin_unlock(&dd->lock);
> +

Hi Jens,
first, good news, bfq is passing my first sanity checks.  Still, I
need a little more help for the following issue.  There is a case that
would be impossible to handle without modifying code outside bfq.  But
so far such a case never occurred, and I hope that it can never occur.
I'll try to briefly list all relevant details on this concern of mine,
so that you can quickly confirm my hope, or highlight where or what I
am missing.

First, as done above for mq-deadline, invoking blk_mq_sched_try_merge
with the scheduler lock held is of course necessary (for example, to
protect q->last_merge).  This may lead to put_rq_private invoked
with the lock held, in case of successful merge.

As a consequence, put_rq_private may be invoked:
(1) in IRQ context, no scheduler lock held, because of a completion:
can be handled by deferring work and lock grabbing, because the
completed request is not queued in the scheduler any more;
(2) in process context, scheduler lock held, because of the above
successful merge: must be handled immediately, for consistency,
because the request is still queued in the scheduler;
(3) in process context, no scheduler lock held, for some other reason:
some path apparently may lead to this case, although I've never seen
it to happen.  Immediate handling, and hence locking, may be needed,
depending on whether the request is still queued in the scheduler.

So, my main question is: is case (3) actually impossible?  Should it
be possible, I guess we would have a problem, because of the
different lock state with respect to (2).

Finally, I hope that it is certainly impossible to have a case (4): in
IRQ context, no lock held, but with the request in the scheduler.

Thanks,
Paolo

> +	return ret;
> +}
> +
> +/*
> + * add rq to rbtree and fifo
> + */
> +static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> +			      bool at_head)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const int data_dir = rq_data_dir(rq);
> +
> +	if (blk_mq_sched_try_insert_merge(q, rq))
> +		return;
> +
> +	blk_mq_sched_request_inserted(rq);
> +
> +	/*
> +	 * If we're trying to insert a real request, just send it directly
> +	 * to the hardware dispatch list. This only happens for a requeue,
> +	 * or FUA/FLUSH requests.
> +	 */
> +	if (!blk_mq_sched_rq_is_shadow(rq)) {
> +		spin_lock(&hctx->lock);
> +		list_add_tail(&rq->queuelist, &hctx->dispatch);
> +		spin_unlock(&hctx->lock);
> +		return;
> +	}
> +
> +	if (at_head || rq->cmd_type != REQ_TYPE_FS) {
> +		if (at_head)
> +			list_add(&rq->queuelist, &dd->dispatch);
> +		else
> +			list_add_tail(&rq->queuelist, &dd->dispatch);
> +	} else {
> +		deadline_add_rq_rb(dd, rq);
> +
> +		if (rq_mergeable(rq)) {
> +			elv_rqhash_add(q, rq);
> +			if (!q->last_merge)
> +				q->last_merge = rq;
> +		}
> +
> +		/*
> +		 * set expire time and add to fifo list
> +		 */
> +		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> +		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
> +	}
> +}
> +
> +static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
> +			       struct list_head *list, bool at_head)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	spin_lock(&dd->lock);
> +	while (!list_empty(list)) {
> +		struct request *rq;
> +
> +		rq = list_first_entry(list, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		dd_insert_request(hctx, rq, at_head);
> +	}
> +	spin_unlock(&dd->lock);
> +}
> +
> +static struct request *dd_get_request(struct request_queue *q, unsigned int op,
> +				      struct blk_mq_alloc_data *data)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	struct request *rq;
> +
> +	/*
> +	 * The flush machinery intercepts before we insert the request. As
> +	 * a work-around, just hand it back a real request.
> +	 */
> +	if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
> +		rq = __blk_mq_alloc_request(data, op);
> +	else {
> +		rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
> +		if (rq)
> +			blk_mq_rq_ctx_init(q, data->ctx, rq, op);
> +	}
> +
> +	return rq;
> +}
> +
> +static bool dd_put_request(struct request *rq)
> +{
> +	/*
> +	 * If it's a real request, we just have to free it. For a shadow
> +	 * request, we should only free it if we haven't started it. A
> +	 * started request is mapped to a real one, and the real one will
> +	 * free it. We can get here with request merges, since we then
> +	 * free the request before we start/issue it.
> +	 */
> +	if (!blk_mq_sched_rq_is_shadow(rq))
> +		return false;
> +
> +	if (!(rq->rq_flags & RQF_STARTED)) {
> +		struct request_queue *q = rq->q;
> +		struct deadline_data *dd = q->elevator->elevator_data;
> +
> +		/*
> +		 * IO completion would normally do this, but if we merge
> +		 * and free before we issue the request, drop both the
> +		 * tag and queue ref
> +		 */
> +		blk_mq_sched_free_shadow_request(dd->tags, rq);
> +		blk_queue_exit(q);
> +	}
> +
> +	return true;
> +}
> +
> +static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
> +{
> +	struct request *sched_rq = rq->end_io_data;
> +
> +	/*
> +	 * sched_rq can be NULL, if we haven't setup the shadow yet
> +	 * because we failed getting one.
> +	 */
> +	if (sched_rq) {
> +		struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +		blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
> +		blk_mq_start_stopped_hw_queue(hctx, true);
> +	}
> +}
> +
> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +	return !list_empty_careful(&dd->dispatch) ||
> +		!list_empty_careful(&dd->fifo_list[0]) ||
> +		!list_empty_careful(&dd->fifo_list[1]);
> +}
> +
> +/*
> + * sysfs parts below
> + */
> +static ssize_t
> +deadline_var_show(int var, char *page)
> +{
> +	return sprintf(page, "%d\n", var);
> +}
> +
> +static ssize_t
> +deadline_var_store(int *var, const char *page, size_t count)
> +{
> +	char *p = (char *) page;
> +
> +	*var = simple_strtol(p, &p, 10);
> +	return count;
> +}
> +
> +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
> +static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data = __VAR;						\
> +	if (__CONV)							\
> +		__data = jiffies_to_msecs(__data);			\
> +	return deadline_var_show(__data, (page));			\
> +}
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> +SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> +SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data;							\
> +	int ret = deadline_var_store(&__data, (page), count);		\
> +	if (__data < (MIN))						\
> +		__data = (MIN);						\
> +	else if (__data > (MAX))					\
> +		__data = (MAX);						\
> +	if (__CONV)							\
> +		*(__PTR) = msecs_to_jiffies(__data);			\
> +	else								\
> +		*(__PTR) = __data;					\
> +	return ret;							\
> +}
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> +STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> +STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#undef STORE_FUNCTION
> +
> +#define DD_ATTR(name) \
> +	__ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
> +				      deadline_##name##_store)
> +
> +static struct elv_fs_entry deadline_attrs[] = {
> +	DD_ATTR(read_expire),
> +	DD_ATTR(write_expire),
> +	DD_ATTR(writes_starved),
> +	DD_ATTR(front_merges),
> +	DD_ATTR(fifo_batch),
> +	__ATTR_NULL
> +};
> +
> +static struct elevator_type mq_deadline = {
> +	.ops.mq = {
> +		.get_request		= dd_get_request,
> +		.put_request		= dd_put_request,
> +		.insert_requests	= dd_insert_requests,
> +		.dispatch_requests	= dd_dispatch_requests,
> +		.completed_request	= dd_completed_request,
> +		.next_request		= elv_rb_latter_request,
> +		.former_request		= elv_rb_former_request,
> +		.bio_merge		= dd_bio_merge,
> +		.request_merge		= dd_request_merge,
> +		.requests_merged	= dd_merged_requests,
> +		.request_merged		= dd_request_merged,
> +		.has_work		= dd_has_work,
> +		.init_sched		= dd_init_queue,
> +		.exit_sched		= dd_exit_queue,
> +	},
> +
> +	.uses_mq	= true,
> +	.elevator_attrs = deadline_attrs,
> +	.elevator_name = "mq-deadline",
> +	.elevator_owner = THIS_MODULE,
> +};
> +
> +static int __init deadline_init(void)
> +{
> +	if (!queue_depth) {
> +		pr_err("mq-deadline: queue depth must be > 0\n");
> +		return -EINVAL;
> +	}
> +	return elv_register(&mq_deadline);
> +}
> +
> +static void __exit deadline_exit(void)
> +{
> +	elv_unregister(&mq_deadline);
> +}
> +
> +module_init(deadline_init);
> +module_exit(deadline_exit);
> +
> +MODULE_AUTHOR("Jens Axboe");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("MQ deadline IO scheduler");
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
                     ` (5 preceding siblings ...)
  2017-02-01 11:11   ` Paolo Valente
@ 2017-02-01 11:56   ` Paolo Valente
  2017-02-02  5:20     ` Jens Axboe
  2017-02-16 10:46   ` Paolo Valente
  7 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-02-01 11:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
> block/Kconfig.iosched |   6 +
> block/Makefile        |   1 +
> block/mq-deadline.c   | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 656 insertions(+)
> create mode 100644 block/mq-deadline.c
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 421bef9c4c48..490ef2850fae 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -32,6 +32,12 @@ config IOSCHED_CFQ
> 
> 	  This is the default I/O scheduler.
> 
> +config MQ_IOSCHED_DEADLINE
> +	tristate "MQ deadline I/O scheduler"
> +	default y
> +	---help---
> +	  MQ version of the deadline IO scheduler.
> +
> config CFQ_GROUP_IOSCHED
> 	bool "CFQ Group Scheduling support"
> 	depends on IOSCHED_CFQ && BLK_CGROUP
> diff --git a/block/Makefile b/block/Makefile
> index 2eee9e1bb6db..3ee0abd7205a 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
> obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> +obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
> 
> obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
> obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index 000000000000..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> @@ -0,0 +1,649 @@
> +/*
> + *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
> + *  for the blk-mq scheduling framework
> + *
> + *  Copyright (C) 2016 Jens Axboe <axboe@kernel.dk>
> + */
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
> +#include <linux/elevator.h>
> +#include <linux/bio.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/compiler.h>
> +#include <linux/rbtree.h>
> +#include <linux/sbitmap.h>
> +
> +#include "blk.h"
> +#include "blk-mq.h"
> +#include "blk-mq-tag.h"
> +#include "blk-mq-sched.h"
> +
> +static unsigned int queue_depth = 256;
> +module_param(queue_depth, uint, 0644);
> +MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
> +
> +/*
> + * See Documentation/block/deadline-iosched.txt
> + */
> +static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
> +static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +static const int writes_starved = 2;    /* max times reads can starve a write */
> +static const int fifo_batch = 16;       /* # of sequential requests treated as one
> +				     by the above parameters. For throughput. */
> +
> +struct deadline_data {
> +	/*
> +	 * run time data
> +	 */
> +
> +	/*
> +	 * requests (deadline_rq s) are present on both sort_list and fifo_list
> +	 */
> +	struct rb_root sort_list[2];
> +	struct list_head fifo_list[2];
> +
> +	/*
> +	 * next in sort order. read, write or both are NULL
> +	 */
> +	struct request *next_rq[2];
> +	unsigned int batching;		/* number of sequential requests made */
> +	unsigned int starved;		/* times reads have starved writes */
> +
> +	/*
> +	 * settings that change how the i/o scheduler behaves
> +	 */
> +	int fifo_expire[2];
> +	int fifo_batch;
> +	int writes_starved;
> +	int front_merges;
> +
> +	spinlock_t lock;
> +	struct list_head dispatch;
> +	struct blk_mq_tags *tags;
> +	atomic_t wait_index;
> +};
> +
> +static inline struct rb_root *
> +deadline_rb_root(struct deadline_data *dd, struct request *rq)
> +{
> +	return &dd->sort_list[rq_data_dir(rq)];
> +}
> +
> +/*
> + * get the request after `rq' in sector-sorted order
> + */
> +static inline struct request *
> +deadline_latter_request(struct request *rq)
> +{
> +	struct rb_node *node = rb_next(&rq->rb_node);
> +
> +	if (node)
> +		return rb_entry_rq(node);
> +
> +	return NULL;
> +}
> +
> +static void
> +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> +	struct rb_root *root = deadline_rb_root(dd, rq);
> +
> +	elv_rb_add(root, rq);
> +}
> +
> +static inline void
> +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> +	const int data_dir = rq_data_dir(rq);
> +
> +	if (dd->next_rq[data_dir] == rq)
> +		dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> +	elv_rb_del(deadline_rb_root(dd, rq), rq);
> +}
> +
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	list_del_init(&rq->queuelist);
> +
> +	/*
> +	 * We might not be on the rbtree, if we are doing an insert merge
> +	 */
> +	if (!RB_EMPTY_NODE(&rq->rb_node))
> +		deadline_del_rq_rb(dd, rq);
> +
> +	elv_rqhash_del(q, rq);
> +	if (q->last_merge == rq)
> +		q->last_merge = NULL;
> +}
> +
> +static void dd_request_merged(struct request_queue *q, struct request *req,
> +			      int type)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	/*
> +	 * if the merge was a front merge, we need to reposition request
> +	 */
> +	if (type == ELEVATOR_FRONT_MERGE) {
> +		elv_rb_del(deadline_rb_root(dd, req), req);
> +		deadline_add_rq_rb(dd, req);
> +	}
> +}
> +
> +static void dd_merged_requests(struct request_queue *q, struct request *req,
> +			       struct request *next)
> +{
> +	/*
> +	 * if next expires before rq, assign its expire time to rq
> +	 * and move into next position (next will be deleted) in fifo
> +	 */
> +	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
> +		if (time_before((unsigned long)next->fifo_time,
> +				(unsigned long)req->fifo_time)) {
> +			list_move(&req->queuelist, &next->queuelist);
> +			req->fifo_time = next->fifo_time;
> +		}
> +	}
> +
> +	/*
> +	 * kill knowledge of next, this one is a goner
> +	 */
> +	deadline_remove_request(q, next);
> +}
> +
> +/*
> + * move an entry to dispatch queue
> + */
> +static void
> +deadline_move_request(struct deadline_data *dd, struct request *rq)
> +{
> +	const int data_dir = rq_data_dir(rq);
> +
> +	dd->next_rq[READ] = NULL;
> +	dd->next_rq[WRITE] = NULL;
> +	dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> +	/*
> +	 * take it off the sort and fifo list
> +	 */
> +	deadline_remove_request(rq->q, rq);
> +}
> +
> +/*
> + * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
> + * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> + */
> +static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
> +{
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
> +
> +	/*
> +	 * rq is expired!
> +	 */
> +	if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * deadline_dispatch_requests selects the best request according to
> + * read/write expire, fifo_batch, etc
> + */
> +static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	struct request *rq;
> +	bool reads, writes;
> +	int data_dir;
> +
> +	spin_lock(&dd->lock);
> +
> +	if (!list_empty(&dd->dispatch)) {
> +		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		goto done;
> +	}
> +
> +	reads = !list_empty(&dd->fifo_list[READ]);
> +	writes = !list_empty(&dd->fifo_list[WRITE]);
> +
> +	/*
> +	 * batches are currently reads XOR writes
> +	 */
> +	if (dd->next_rq[WRITE])
> +		rq = dd->next_rq[WRITE];
> +	else
> +		rq = dd->next_rq[READ];
> +
> +	if (rq && dd->batching < dd->fifo_batch)
> +		/* we have a next request are still entitled to batch */
> +		goto dispatch_request;
> +
> +	/*
> +	 * at this point we are not running a batch. select the appropriate
> +	 * data direction (read / write)
> +	 */
> +
> +	if (reads) {
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
> +
> +		if (writes && (dd->starved++ >= dd->writes_starved))
> +			goto dispatch_writes;
> +
> +		data_dir = READ;
> +
> +		goto dispatch_find_request;
> +	}
> +
> +	/*
> +	 * there are either no reads or writes have been starved
> +	 */
> +
> +	if (writes) {
> +dispatch_writes:
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> +
> +		dd->starved = 0;
> +
> +		data_dir = WRITE;
> +
> +		goto dispatch_find_request;
> +	}
> +
> +	spin_unlock(&dd->lock);
> +	return NULL;
> +
> +dispatch_find_request:
> +	/*
> +	 * we are not running a batch, find best request for selected data_dir
> +	 */
> +	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
> +		/*
> +		 * A deadline has expired, the last request was in the other
> +		 * direction, or we have run out of higher-sectored requests.
> +		 * Start again from the request with the earliest expiry time.
> +		 */
> +		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	} else {
> +		/*
> +		 * The last req was the same dir and we have a next request in
> +		 * sort order. No expired requests so continue on from here.
> +		 */
> +		rq = dd->next_rq[data_dir];
> +	}
> +
> +	dd->batching = 0;
> +
> +dispatch_request:
> +	/*
> +	 * rq is the selected appropriate request.
> +	 */
> +	dd->batching++;
> +	deadline_move_request(dd, rq);
> +done:
> +	rq->rq_flags |= RQF_STARTED;
> +	spin_unlock(&dd->lock);
> +	return rq;
> +}
> +
> +static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
> +				 struct list_head *rq_list)
> +{
> +	blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
> +}
> +
> +static void dd_exit_queue(struct elevator_queue *e)
> +{
> +	struct deadline_data *dd = e->elevator_data;
> +
> +	BUG_ON(!list_empty(&dd->fifo_list[READ]));
> +	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
> +
> +	blk_mq_sched_free_requests(dd->tags);
> +	kfree(dd);
> +}
> +
> +/*
> + * initialize elevator private data (deadline_data).
> + */
> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +{
> +	struct deadline_data *dd;
> +	struct elevator_queue *eq;
> +
> +	eq = elevator_alloc(q, e);
> +	if (!eq)
> +		return -ENOMEM;
> +
> +	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> +	if (!dd) {
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +	eq->elevator_data = dd;
> +
> +	dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
> +	if (!dd->tags) {
> +		kfree(dd);
> +		kobject_put(&eq->kobj);
> +		return -ENOMEM;
> +	}
> +
> +	INIT_LIST_HEAD(&dd->fifo_list[READ]);
> +	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
> +	dd->sort_list[READ] = RB_ROOT;
> +	dd->sort_list[WRITE] = RB_ROOT;
> +	dd->fifo_expire[READ] = read_expire;
> +	dd->fifo_expire[WRITE] = write_expire;
> +	dd->writes_starved = writes_starved;
> +	dd->front_merges = 1;
> +	dd->fifo_batch = fifo_batch;
> +	spin_lock_init(&dd->lock);
> +	INIT_LIST_HEAD(&dd->dispatch);
> +	atomic_set(&dd->wait_index, 0);
> +
> +	q->elevator = eq;
> +	return 0;
> +}
> +
> +static int dd_request_merge(struct request_queue *q, struct request **rq,
> +			    struct bio *bio)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	sector_t sector = bio_end_sector(bio);
> +	struct request *__rq;
> +
> +	if (!dd->front_merges)
> +		return ELEVATOR_NO_MERGE;
> +
> +	__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
> +	if (__rq) {
> +		BUG_ON(sector != blk_rq_pos(__rq));
> +
> +		if (elv_bio_merge_ok(__rq, bio)) {
> +			*rq = __rq;
> +			return ELEVATOR_FRONT_MERGE;
> +		}
> +	}
> +
> +	return ELEVATOR_NO_MERGE;
> +}
> +
> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	int ret;
> +
> +	spin_lock(&dd->lock);
> +	ret = blk_mq_sched_try_merge(q, bio);
> +	spin_unlock(&dd->lock);
> +
> +	return ret;
> +}
> +
> +/*
> + * add rq to rbtree and fifo
> + */
> +static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> +			      bool at_head)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const int data_dir = rq_data_dir(rq);
> +
> +	if (blk_mq_sched_try_insert_merge(q, rq))
> +		return;
> +

A related doubt: shouldn't blk_mq_sched_try_insert_merge be invoked with the scheduler lock held too, as blk_mq_sched_try_merge, to protect (at least) q->last_merge?

In bfq this function is invoked with the lock held.

Thanks,
Paolo

> +	blk_mq_sched_request_inserted(rq);
> +
> +	/*
> +	 * If we're trying to insert a real request, just send it directly
> +	 * to the hardware dispatch list. This only happens for a requeue,
> +	 * or FUA/FLUSH requests.
> +	 */
> +	if (!blk_mq_sched_rq_is_shadow(rq)) {
> +		spin_lock(&hctx->lock);
> +		list_add_tail(&rq->queuelist, &hctx->dispatch);
> +		spin_unlock(&hctx->lock);
> +		return;
> +	}
> +
> +	if (at_head || rq->cmd_type != REQ_TYPE_FS) {
> +		if (at_head)
> +			list_add(&rq->queuelist, &dd->dispatch);
> +		else
> +			list_add_tail(&rq->queuelist, &dd->dispatch);
> +	} else {
> +		deadline_add_rq_rb(dd, rq);
> +
> +		if (rq_mergeable(rq)) {
> +			elv_rqhash_add(q, rq);
> +			if (!q->last_merge)
> +				q->last_merge = rq;
> +		}
> +
> +		/*
> +		 * set expire time and add to fifo list
> +		 */
> +		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> +		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
> +	}
> +}
> +
> +static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
> +			       struct list_head *list, bool at_head)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	spin_lock(&dd->lock);
> +	while (!list_empty(list)) {
> +		struct request *rq;
> +
> +		rq = list_first_entry(list, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		dd_insert_request(hctx, rq, at_head);
> +	}
> +	spin_unlock(&dd->lock);
> +}
> +
> +static struct request *dd_get_request(struct request_queue *q, unsigned int op,
> +				      struct blk_mq_alloc_data *data)
> +{
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	struct request *rq;
> +
> +	/*
> +	 * The flush machinery intercepts before we insert the request. As
> +	 * a work-around, just hand it back a real request.
> +	 */
> +	if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
> +		rq = __blk_mq_alloc_request(data, op);
> +	else {
> +		rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
> +		if (rq)
> +			blk_mq_rq_ctx_init(q, data->ctx, rq, op);
> +	}
> +
> +	return rq;
> +}
> +
> +static bool dd_put_request(struct request *rq)
> +{
> +	/*
> +	 * If it's a real request, we just have to free it. For a shadow
> +	 * request, we should only free it if we haven't started it. A
> +	 * started request is mapped to a real one, and the real one will
> +	 * free it. We can get here with request merges, since we then
> +	 * free the request before we start/issue it.
> +	 */
> +	if (!blk_mq_sched_rq_is_shadow(rq))
> +		return false;
> +
> +	if (!(rq->rq_flags & RQF_STARTED)) {
> +		struct request_queue *q = rq->q;
> +		struct deadline_data *dd = q->elevator->elevator_data;
> +
> +		/*
> +		 * IO completion would normally do this, but if we merge
> +		 * and free before we issue the request, drop both the
> +		 * tag and queue ref
> +		 */
> +		blk_mq_sched_free_shadow_request(dd->tags, rq);
> +		blk_queue_exit(q);
> +	}
> +
> +	return true;
> +}
> +
> +static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
> +{
> +	struct request *sched_rq = rq->end_io_data;
> +
> +	/*
> +	 * sched_rq can be NULL, if we haven't setup the shadow yet
> +	 * because we failed getting one.
> +	 */
> +	if (sched_rq) {
> +		struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +		blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
> +		blk_mq_start_stopped_hw_queue(hctx, true);
> +	}
> +}
> +
> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> +	return !list_empty_careful(&dd->dispatch) ||
> +		!list_empty_careful(&dd->fifo_list[0]) ||
> +		!list_empty_careful(&dd->fifo_list[1]);
> +}
> +
> +/*
> + * sysfs parts below
> + */
> +static ssize_t
> +deadline_var_show(int var, char *page)
> +{
> +	return sprintf(page, "%d\n", var);
> +}
> +
> +static ssize_t
> +deadline_var_store(int *var, const char *page, size_t count)
> +{
> +	char *p = (char *) page;
> +
> +	*var = simple_strtol(p, &p, 10);
> +	return count;
> +}
> +
> +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
> +static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data = __VAR;						\
> +	if (__CONV)							\
> +		__data = jiffies_to_msecs(__data);			\
> +	return deadline_var_show(__data, (page));			\
> +}
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> +SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> +SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> +{									\
> +	struct deadline_data *dd = e->elevator_data;			\
> +	int __data;							\
> +	int ret = deadline_var_store(&__data, (page), count);		\
> +	if (__data < (MIN))						\
> +		__data = (MIN);						\
> +	else if (__data > (MAX))					\
> +		__data = (MAX);						\
> +	if (__CONV)							\
> +		*(__PTR) = msecs_to_jiffies(__data);			\
> +	else								\
> +		*(__PTR) = __data;					\
> +	return ret;							\
> +}
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> +STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> +STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#undef STORE_FUNCTION
> +
> +#define DD_ATTR(name) \
> +	__ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
> +				      deadline_##name##_store)
> +
> +static struct elv_fs_entry deadline_attrs[] = {
> +	DD_ATTR(read_expire),
> +	DD_ATTR(write_expire),
> +	DD_ATTR(writes_starved),
> +	DD_ATTR(front_merges),
> +	DD_ATTR(fifo_batch),
> +	__ATTR_NULL
> +};
> +
> +static struct elevator_type mq_deadline = {
> +	.ops.mq = {
> +		.get_request		= dd_get_request,
> +		.put_request		= dd_put_request,
> +		.insert_requests	= dd_insert_requests,
> +		.dispatch_requests	= dd_dispatch_requests,
> +		.completed_request	= dd_completed_request,
> +		.next_request		= elv_rb_latter_request,
> +		.former_request		= elv_rb_former_request,
> +		.bio_merge		= dd_bio_merge,
> +		.request_merge		= dd_request_merge,
> +		.requests_merged	= dd_merged_requests,
> +		.request_merged		= dd_request_merged,
> +		.has_work		= dd_has_work,
> +		.init_sched		= dd_init_queue,
> +		.exit_sched		= dd_exit_queue,
> +	},
> +
> +	.uses_mq	= true,
> +	.elevator_attrs = deadline_attrs,
> +	.elevator_name = "mq-deadline",
> +	.elevator_owner = THIS_MODULE,
> +};
> +
> +static int __init deadline_init(void)
> +{
> +	if (!queue_depth) {
> +		pr_err("mq-deadline: queue depth must be > 0\n");
> +		return -EINVAL;
> +	}
> +	return elv_register(&mq_deadline);
> +}
> +
> +static void __exit deadline_exit(void)
> +{
> +	elv_unregister(&mq_deadline);
> +}
> +
> +module_init(deadline_init);
> +module_exit(deadline_exit);
> +
> +MODULE_AUTHOR("Jens Axboe");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("MQ deadline IO scheduler");
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-02-01 11:11   ` Paolo Valente
@ 2017-02-02  5:19     ` Jens Axboe
  2017-02-02  9:19       ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-02-02  5:19 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 02/01/2017 04:11 AM, Paolo Valente wrote:
>> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
>> +{
>> +	struct request_queue *q = hctx->queue;
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +	int ret;
>> +
>> +	spin_lock(&dd->lock);
>> +	ret = blk_mq_sched_try_merge(q, bio);
>> +	spin_unlock(&dd->lock);
>> +
> 
> Hi Jens,
> first, good news, bfq is passing my first sanity checks.  Still, I
> need a little more help for the following issue.  There is a case that
> would be impossible to handle without modifying code outside bfq.  But
> so far such a case never occurred, and I hope that it can never occur.
> I'll try to briefly list all relevant details on this concern of mine,
> so that you can quickly confirm my hope, or highlight where or what I
> am missing.

Remember my earlier advice - it's not a problem to change anything in
the core, in fact I would be surprised if you did not need to. My
foresight isn't THAT good! It's much better to fix up an inconsistency
there, rather than work around it in the consumer of that API.

> First, as done above for mq-deadline, invoking blk_mq_sched_try_merge
> with the scheduler lock held is of course necessary (for example, to
> protect q->last_merge).  This may lead to put_rq_private invoked
> with the lock held, in case of successful merge.

Right, or some other lock with the same scope, as per my other email.

> As a consequence, put_rq_private may be invoked:
> (1) in IRQ context, no scheduler lock held, because of a completion:
> can be handled by deferring work and lock grabbing, because the
> completed request is not queued in the scheduler any more;
> (2) in process context, scheduler lock held, because of the above
> successful merge: must be handled immediately, for consistency,
> because the request is still queued in the scheduler;
> (3) in process context, no scheduler lock held, for some other reason:
> some path apparently may lead to this case, although I've never seen
> it to happen.  Immediate handling, and hence locking, may be needed,
> depending on whether the request is still queued in the scheduler.
> 
> So, my main question is: is case (3) actually impossible?  Should it
> be possible, I guess we would have a problem, because of the
> different lock state with respect to (2).

I agree, there's some inconsistency there, if you potentially need to
grab the lock in your put_rq_private handler. The problem case is #2,
when we have the merge. I would probably suggest that the best way to
handle that is to pass back the dropped request so we can put it outside
of holding the lock.

Let me see if I can come up with a good solution for this. We have to be
consistent in how we invoke the scheduler functions, we can't have hooks
that are called in unknown lock states. I also don't want you to have to
add defer work handling in that kind of path, that will impact your
performance and overhead.

> Finally, I hope that it is certainly impossible to have a case (4): in
> IRQ context, no lock held, but with the request in the scheduler.

That should not be possible.

Edit: since I'm on a flight and email won't send, I had a few minutes to
hack this up. Totally untested, but something like the below should do
it. Not super pretty... I'll play with this a bit more tomorrow.


diff --git a/block/blk-core.c b/block/blk-core.c
index c142de090c41..530a9a3f60c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1609,7 +1609,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	struct blk_plug *plug;
 	int el_ret, where = ELEVATOR_INSERT_SORT;
-	struct request *req;
+	struct request *req, *free;
 	unsigned int request_count = 0;
 	unsigned int wb_acct;
 
@@ -1650,15 +1650,21 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	if (el_ret == ELEVATOR_BACK_MERGE) {
 		if (bio_attempt_back_merge(q, req, bio)) {
 			elv_bio_merged(q, req, bio);
-			if (!attempt_back_merge(q, req))
+			free = attempt_back_merge(q, req);
+			if (!free)
 				elv_merged_request(q, req, el_ret);
+			else
+				__blk_put_request(q, free);
 			goto out_unlock;
 		}
 	} else if (el_ret == ELEVATOR_FRONT_MERGE) {
 		if (bio_attempt_front_merge(q, req, bio)) {
 			elv_bio_merged(q, req, bio);
-			if (!attempt_front_merge(q, req))
+			free = attempt_front_merge(q, req);
+			if (!free)
 				elv_merged_request(q, req, el_ret);
+			else
+				__blk_put_request(q, free);
 			goto out_unlock;
 		}
 	}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6aa43dec5af4..011b1c6e3cb4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -661,29 +661,29 @@ static void blk_account_io_merge(struct request *req)
 /*
  * Has to be called with the request spinlock acquired
  */
-static int attempt_merge(struct request_queue *q, struct request *req,
-			  struct request *next)
+static struct request *attempt_merge(struct request_queue *q,
+				     struct request *req, struct request *next)
 {
 	if (!rq_mergeable(req) || !rq_mergeable(next))
-		return 0;
+		return NULL;
 
 	if (req_op(req) != req_op(next))
-		return 0;
+		return NULL;
 
 	/*
 	 * not contiguous
 	 */
 	if (blk_rq_pos(req) + blk_rq_sectors(req) != blk_rq_pos(next))
-		return 0;
+		return NULL;
 
 	if (rq_data_dir(req) != rq_data_dir(next)
 	    || req->rq_disk != next->rq_disk
 	    || req_no_special_merge(next))
-		return 0;
+		return NULL;
 
 	if (req_op(req) == REQ_OP_WRITE_SAME &&
 	    !blk_write_same_mergeable(req->bio, next->bio))
-		return 0;
+		return NULL;
 
 	/*
 	 * If we are allowed to merge, then append bio list
@@ -692,7 +692,7 @@ static int attempt_merge(struct request_queue *q, struct request *req,
 	 * counts here.
 	 */
 	if (!ll_merge_requests_fn(q, req, next))
-		return 0;
+		return NULL;
 
 	/*
 	 * If failfast settings disagree or any of the two is already
@@ -732,30 +732,32 @@ static int attempt_merge(struct request_queue *q, struct request *req,
 	if (blk_rq_cpu_valid(next))
 		req->cpu = next->cpu;
 
-	/* owner-ship of bio passed from next to req */
+	/*
+	 * owner-ship of bio passed from next to req, return 'next' for
+	 * the caller to free
+	 */
 	next->bio = NULL;
-	__blk_put_request(q, next);
-	return 1;
+	return next;
 }
 
-int attempt_back_merge(struct request_queue *q, struct request *rq)
+struct request *attempt_back_merge(struct request_queue *q, struct request *rq)
 {
 	struct request *next = elv_latter_request(q, rq);
 
 	if (next)
 		return attempt_merge(q, rq, next);
 
-	return 0;
+	return NULL;
 }
 
-int attempt_front_merge(struct request_queue *q, struct request *rq)
+struct request *attempt_front_merge(struct request_queue *q, struct request *rq)
 {
 	struct request *prev = elv_former_request(q, rq);
 
 	if (prev)
 		return attempt_merge(q, prev, rq);
 
-	return 0;
+	return NULL;
 }
 
 int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
@@ -767,7 +769,12 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 		if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next))
 			return 0;
 
-	return attempt_merge(q, rq, next);
+	if (attempt_merge(q, rq, next)) {
+		__blk_put_request(q, next);
+		return 1;
+	}
+
+	return 0;
 }
 
 bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 114814ec3d49..d93b56d53c4e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -234,7 +234,8 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_move_to_dispatch);
 
-bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
+			    struct request **merged_request)
 {
 	struct request *rq;
 	int ret;
@@ -244,7 +245,8 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
 		if (!blk_mq_sched_allow_merge(q, rq, bio))
 			return false;
 		if (bio_attempt_back_merge(q, rq, bio)) {
-			if (!attempt_back_merge(q, rq))
+			*merged_request = attempt_back_merge(q, rq);
+			if (!*merged_request)
 				elv_merged_request(q, rq, ret);
 			return true;
 		}
@@ -252,7 +254,8 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
 		if (!blk_mq_sched_allow_merge(q, rq, bio))
 			return false;
 		if (bio_attempt_front_merge(q, rq, bio)) {
-			if (!attempt_front_merge(q, rq))
+			*merged_request = attempt_front_merge(q, rq);
+			if (!*merged_request)
 				elv_merged_request(q, rq, ret);
 			return true;
 		}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 9478aaeb48c5..3643686a54b8 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -16,7 +16,8 @@ void blk_mq_sched_put_request(struct request *rq);
 
 void blk_mq_sched_request_inserted(struct request *rq);
 bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq);
-bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
+				struct request **merged_request);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
 void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
diff --git a/block/blk.h b/block/blk.h
index c1bd4bf9e645..918cea38d51e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -204,8 +204,8 @@ int ll_back_merge_fn(struct request_queue *q, struct request *req,
 		     struct bio *bio);
 int ll_front_merge_fn(struct request_queue *q, struct request *req, 
 		      struct bio *bio);
-int attempt_back_merge(struct request_queue *q, struct request *rq);
-int attempt_front_merge(struct request_queue *q, struct request *rq);
+struct request *attempt_back_merge(struct request_queue *q, struct request *rq);
+struct request *attempt_front_merge(struct request_queue *q, struct request *rq);
 int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 				struct request *next);
 void blk_recalc_rq_segments(struct request *rq);
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 49583536698c..682fa64f55ff 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -371,12 +371,16 @@ static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
 {
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
-	int ret;
+	struct request *free = NULL;
+	bool ret;
 
 	spin_lock(&dd->lock);
-	ret = blk_mq_sched_try_merge(q, bio);
+	ret = blk_mq_sched_try_merge(q, bio, &free);
 	spin_unlock(&dd->lock);
 
+	if (free)
+		blk_mq_free_request(free);
+
 	return ret;
 }
 

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-02-01 11:56   ` Paolo Valente
@ 2017-02-02  5:20     ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2017-02-02  5:20 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 02/01/2017 04:56 AM, Paolo Valente wrote:
>> +/*
>> + * add rq to rbtree and fifo
>> + */
>> +static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>> +			      bool at_head)
>> +{
>> +	struct request_queue *q = hctx->queue;
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +	const int data_dir = rq_data_dir(rq);
>> +
>> +	if (blk_mq_sched_try_insert_merge(q, rq))
>> +		return;
>> +
> 
> A related doubt: shouldn't blk_mq_sched_try_insert_merge be invoked
> with the scheduler lock held too, as blk_mq_sched_try_merge, to
> protect (at least) q->last_merge?
>
> In bfq this function is invoked with the lock held.

It doesn't matter which lock you use, as long as:

1) You use the same one consistently
2) It has the same scope as the queue lock (the one you call the
   scheduler lock)

mq-deadline sets up a per-queue structure, deadline_data, and it has a
lock embedded in that structure. This is what mq-deadline uses to
serialize access to its data structures, as well as those in the queue
(like last_merge).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-02-02  5:19     ` Jens Axboe
@ 2017-02-02  9:19       ` Paolo Valente
  2017-02-02 15:30         ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-02-02  9:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 02 feb 2017, alle ore 06:19, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 02/01/2017 04:11 AM, Paolo Valente wrote:
>>> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
>>> +{
>>> +	struct request_queue *q = hctx->queue;
>>> +	struct deadline_data *dd = q->elevator->elevator_data;
>>> +	int ret;
>>> +
>>> +	spin_lock(&dd->lock);
>>> +	ret = blk_mq_sched_try_merge(q, bio);
>>> +	spin_unlock(&dd->lock);
>>> +
>> 
>> Hi Jens,
>> first, good news, bfq is passing my first sanity checks.  Still, I
>> need a little more help for the following issue.  There is a case that
>> would be impossible to handle without modifying code outside bfq.  But
>> so far such a case never occurred, and I hope that it can never occur.
>> I'll try to briefly list all relevant details on this concern of mine,
>> so that you can quickly confirm my hope, or highlight where or what I
>> am missing.
> 
> Remember my earlier advice - it's not a problem to change anything in
> the core, in fact I would be surprised if you did not need to. My
> foresight isn't THAT good! It's much better to fix up an inconsistency
> there, rather than work around it in the consumer of that API.
> 
>> First, as done above for mq-deadline, invoking blk_mq_sched_try_merge
>> with the scheduler lock held is of course necessary (for example, to
>> protect q->last_merge).  This may lead to put_rq_private invoked
>> with the lock held, in case of successful merge.
> 
> Right, or some other lock with the same scope, as per my other email.
> 
>> As a consequence, put_rq_private may be invoked:
>> (1) in IRQ context, no scheduler lock held, because of a completion:
>> can be handled by deferring work and lock grabbing, because the
>> completed request is not queued in the scheduler any more;
>> (2) in process context, scheduler lock held, because of the above
>> successful merge: must be handled immediately, for consistency,
>> because the request is still queued in the scheduler;
>> (3) in process context, no scheduler lock held, for some other reason:
>> some path apparently may lead to this case, although I've never seen
>> it to happen.  Immediate handling, and hence locking, may be needed,
>> depending on whether the request is still queued in the scheduler.
>> 
>> So, my main question is: is case (3) actually impossible?  Should it
>> be possible, I guess we would have a problem, because of the
>> different lock state with respect to (2).
> 
> I agree, there's some inconsistency there, if you potentially need to
> grab the lock in your put_rq_private handler. The problem case is #2,
> when we have the merge. I would probably suggest that the best way to
> handle that is to pass back the dropped request so we can put it outside
> of holding the lock.
> 
> Let me see if I can come up with a good solution for this. We have to be
> consistent in how we invoke the scheduler functions, we can't have hooks
> that are called in unknown lock states. I also don't want you to have to
> add defer work handling in that kind of path, that will impact your
> performance and overhead.
> 

I'll try to learn from your solution, because, as of now, I don't see
how to avoid deferred work for the case where put_rq_private is
invoked in interrupt context.  In fact, for this case, we cannot grab
the lock, unless we turn all spin_lock into spin_lock_irq*.

>> Finally, I hope that it is certainly impossible to have a case (4): in
>> IRQ context, no lock held, but with the request in the scheduler.
> 
> That should not be possible.
> 
> Edit: since I'm on a flight and email won't send, I had a few minutes to
> hack this up. Totally untested, but something like the below should do
> it. Not super pretty... I'll play with this a bit more tomorrow.
> 
> 

The scheme is clear.  One comment, in case it could make sense and
avoid more complexity: since put_rq_priv is invoked in two different
contexts, process or interrupt, I didn't feel so confusing that, when
put_rq_priv is invoked in the context where the lock cannot be held
(unless one is willing to pay with irq disabling all the times), the
lock is not held, while, when invoked in the context where the lock
can be held, the lock is actually held, or must be taken.

Thanks,
Paolo

> diff --git a/block/blk-core.c b/block/blk-core.c
> index c142de090c41..530a9a3f60c9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1609,7 +1609,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
> {
> 	struct blk_plug *plug;
> 	int el_ret, where = ELEVATOR_INSERT_SORT;
> -	struct request *req;
> +	struct request *req, *free;
> 	unsigned int request_count = 0;
> 	unsigned int wb_acct;
> 
> @@ -1650,15 +1650,21 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
> 	if (el_ret == ELEVATOR_BACK_MERGE) {
> 		if (bio_attempt_back_merge(q, req, bio)) {
> 			elv_bio_merged(q, req, bio);
> -			if (!attempt_back_merge(q, req))
> +			free = attempt_back_merge(q, req);
> +			if (!free)
> 				elv_merged_request(q, req, el_ret);
> +			else
> +				__blk_put_request(q, free);
> 			goto out_unlock;
> 		}
> 	} else if (el_ret == ELEVATOR_FRONT_MERGE) {
> 		if (bio_attempt_front_merge(q, req, bio)) {
> 			elv_bio_merged(q, req, bio);
> -			if (!attempt_front_merge(q, req))
> +			free = attempt_front_merge(q, req);
> +			if (!free)
> 				elv_merged_request(q, req, el_ret);
> +			else
> +				__blk_put_request(q, free);
> 			goto out_unlock;
> 		}
> 	}
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 6aa43dec5af4..011b1c6e3cb4 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -661,29 +661,29 @@ static void blk_account_io_merge(struct request *req)
> /*
>  * Has to be called with the request spinlock acquired
>  */
> -static int attempt_merge(struct request_queue *q, struct request *req,
> -			  struct request *next)
> +static struct request *attempt_merge(struct request_queue *q,
> +				     struct request *req, struct request *next)
> {
> 	if (!rq_mergeable(req) || !rq_mergeable(next))
> -		return 0;
> +		return NULL;
> 
> 	if (req_op(req) != req_op(next))
> -		return 0;
> +		return NULL;
> 
> 	/*
> 	 * not contiguous
> 	 */
> 	if (blk_rq_pos(req) + blk_rq_sectors(req) != blk_rq_pos(next))
> -		return 0;
> +		return NULL;
> 
> 	if (rq_data_dir(req) != rq_data_dir(next)
> 	    || req->rq_disk != next->rq_disk
> 	    || req_no_special_merge(next))
> -		return 0;
> +		return NULL;
> 
> 	if (req_op(req) == REQ_OP_WRITE_SAME &&
> 	    !blk_write_same_mergeable(req->bio, next->bio))
> -		return 0;
> +		return NULL;
> 
> 	/*
> 	 * If we are allowed to merge, then append bio list
> @@ -692,7 +692,7 @@ static int attempt_merge(struct request_queue *q, struct request *req,
> 	 * counts here.
> 	 */
> 	if (!ll_merge_requests_fn(q, req, next))
> -		return 0;
> +		return NULL;
> 
> 	/*
> 	 * If failfast settings disagree or any of the two is already
> @@ -732,30 +732,32 @@ static int attempt_merge(struct request_queue *q, struct request *req,
> 	if (blk_rq_cpu_valid(next))
> 		req->cpu = next->cpu;
> 
> -	/* owner-ship of bio passed from next to req */
> +	/*
> +	 * owner-ship of bio passed from next to req, return 'next' for
> +	 * the caller to free
> +	 */
> 	next->bio = NULL;
> -	__blk_put_request(q, next);
> -	return 1;
> +	return next;
> }
> 
> -int attempt_back_merge(struct request_queue *q, struct request *rq)
> +struct request *attempt_back_merge(struct request_queue *q, struct request *rq)
> {
> 	struct request *next = elv_latter_request(q, rq);
> 
> 	if (next)
> 		return attempt_merge(q, rq, next);
> 
> -	return 0;
> +	return NULL;
> }
> 
> -int attempt_front_merge(struct request_queue *q, struct request *rq)
> +struct request *attempt_front_merge(struct request_queue *q, struct request *rq)
> {
> 	struct request *prev = elv_former_request(q, rq);
> 
> 	if (prev)
> 		return attempt_merge(q, prev, rq);
> 
> -	return 0;
> +	return NULL;
> }
> 
> int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
> @@ -767,7 +769,12 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
> 		if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next))
> 			return 0;
> 
> -	return attempt_merge(q, rq, next);
> +	if (attempt_merge(q, rq, next)) {
> +		__blk_put_request(q, next);
> +		return 1;
> +	}
> +
> +	return 0;
> }
> 
> bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 114814ec3d49..d93b56d53c4e 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -234,7 +234,8 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
> }
> EXPORT_SYMBOL_GPL(blk_mq_sched_move_to_dispatch);
> 
> -bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
> +bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> +			    struct request **merged_request)
> {
> 	struct request *rq;
> 	int ret;
> @@ -244,7 +245,8 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
> 		if (!blk_mq_sched_allow_merge(q, rq, bio))
> 			return false;
> 		if (bio_attempt_back_merge(q, rq, bio)) {
> -			if (!attempt_back_merge(q, rq))
> +			*merged_request = attempt_back_merge(q, rq);
> +			if (!*merged_request)
> 				elv_merged_request(q, rq, ret);
> 			return true;
> 		}
> @@ -252,7 +254,8 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
> 		if (!blk_mq_sched_allow_merge(q, rq, bio))
> 			return false;
> 		if (bio_attempt_front_merge(q, rq, bio)) {
> -			if (!attempt_front_merge(q, rq))
> +			*merged_request = attempt_front_merge(q, rq);
> +			if (!*merged_request)
> 				elv_merged_request(q, rq, ret);
> 			return true;
> 		}
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 9478aaeb48c5..3643686a54b8 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -16,7 +16,8 @@ void blk_mq_sched_put_request(struct request *rq);
> 
> void blk_mq_sched_request_inserted(struct request *rq);
> bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq);
> -bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
> +bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> +				struct request **merged_request);
> bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
> bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
> void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
> diff --git a/block/blk.h b/block/blk.h
> index c1bd4bf9e645..918cea38d51e 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -204,8 +204,8 @@ int ll_back_merge_fn(struct request_queue *q, struct request *req,
> 		     struct bio *bio);
> int ll_front_merge_fn(struct request_queue *q, struct request *req, 
> 		      struct bio *bio);
> -int attempt_back_merge(struct request_queue *q, struct request *rq);
> -int attempt_front_merge(struct request_queue *q, struct request *rq);
> +struct request *attempt_back_merge(struct request_queue *q, struct request *rq);
> +struct request *attempt_front_merge(struct request_queue *q, struct request *rq);
> int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
> 				struct request *next);
> void blk_recalc_rq_segments(struct request *rq);
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 49583536698c..682fa64f55ff 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -371,12 +371,16 @@ static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
> {
> 	struct request_queue *q = hctx->queue;
> 	struct deadline_data *dd = q->elevator->elevator_data;
> -	int ret;
> +	struct request *free = NULL;
> +	bool ret;
> 
> 	spin_lock(&dd->lock);
> -	ret = blk_mq_sched_try_merge(q, bio);
> +	ret = blk_mq_sched_try_merge(q, bio, &free);
> 	spin_unlock(&dd->lock);
> 
> +	if (free)
> +		blk_mq_free_request(free);
> +
> 	return ret;
> }
> 
> 
> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-02-02  9:19       ` Paolo Valente
@ 2017-02-02 15:30         ` Jens Axboe
  2017-02-02 21:15           ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-02-02 15:30 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 02/02/2017 02:19 AM, Paolo Valente wrote:
> The scheme is clear.  One comment, in case it could make sense and
> avoid more complexity: since put_rq_priv is invoked in two different
> contexts, process or interrupt, I didn't feel so confusing that, when
> put_rq_priv is invoked in the context where the lock cannot be held
> (unless one is willing to pay with irq disabling all the times), the
> lock is not held, while, when invoked in the context where the lock
> can be held, the lock is actually held, or must be taken.

If you grab the same lock from put_rq_priv, yes, you must make it IRQ
disabling in all contexts, and use _irqsave() from put_rq_priv. If it's
just freeing resources, you could potentially wait and do that when
someone else needs them, since that part will come from proces context.
That would need two locks, though.

As I said above, I would not worry about the IRQ disabling lock.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-02-02 15:30         ` Jens Axboe
@ 2017-02-02 21:15           ` Paolo Valente
  2017-02-02 21:32             ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-02-02 21:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 02 feb 2017, alle ore 16:30, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 02/02/2017 02:19 AM, Paolo Valente wrote:
>> The scheme is clear.  One comment, in case it could make sense and
>> avoid more complexity: since put_rq_priv is invoked in two different
>> contexts, process or interrupt, I didn't feel so confusing that, when
>> put_rq_priv is invoked in the context where the lock cannot be held
>> (unless one is willing to pay with irq disabling all the times), the
>> lock is not held, while, when invoked in the context where the lock
>> can be held, the lock is actually held, or must be taken.
> 
> If you grab the same lock from put_rq_priv, yes, you must make it IRQ
> disabling in all contexts, and use _irqsave() from put_rq_priv. If it's
> just freeing resources, you could potentially wait and do that when
> someone else needs them, since that part will come from proces context.
> That would need two locks, though.
> 
> As I said above, I would not worry about the IRQ disabling lock.
> 

I'm sorry, I focused only on the IRQ-disabling consequence of grabbing
a scheduler lock also in IRQ context.  I thought it was a serious
enough issue to avoid this option.  Yet there is also a deadlock
problem related to this option.  In fact, the IRQ handler may preempt
some process-context code that already holds some other locks, and, if
some of these locks are already held by another process, which is
executing on another CPU and which then tries to take the scheduler
lock, or which happens to be preempted by an IRQ handler trying to
grab the scheduler lock, then a deadlock occurs.  This is not just a
speculation, but a problem that did occur before I moved to a
deferred-work solution, and that can be readily reproduced.  Before
moving to a deferred work solution, I tried various code manipulations
to avoid these deadlocks without resorting to deferred work, but at no
avail.

At any rate, bfq seems now to work, so I can finally move from just
asking questions endlessly, to proposing actual code to discuss on.
I'm about to: port this version of bfq to your improved/fixed
blk-mq-sched version in for-4.11 (port postponed, to avoid introducing
further changes in code that did not yet wok), run more extensive
tests, polish commits a little bit, and finally share a branch.

Thanks,
Paolo

> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-02-02 21:15           ` Paolo Valente
@ 2017-02-02 21:32             ` Jens Axboe
  2017-02-07 17:27               ` Paolo Valente
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2017-02-02 21:32 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 02/02/2017 02:15 PM, Paolo Valente wrote:
> 
>> Il giorno 02 feb 2017, alle ore 16:30, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 02/02/2017 02:19 AM, Paolo Valente wrote:
>>> The scheme is clear.  One comment, in case it could make sense and
>>> avoid more complexity: since put_rq_priv is invoked in two different
>>> contexts, process or interrupt, I didn't feel so confusing that, when
>>> put_rq_priv is invoked in the context where the lock cannot be held
>>> (unless one is willing to pay with irq disabling all the times), the
>>> lock is not held, while, when invoked in the context where the lock
>>> can be held, the lock is actually held, or must be taken.
>>
>> If you grab the same lock from put_rq_priv, yes, you must make it IRQ
>> disabling in all contexts, and use _irqsave() from put_rq_priv. If it's
>> just freeing resources, you could potentially wait and do that when
>> someone else needs them, since that part will come from proces context.
>> That would need two locks, though.
>>
>> As I said above, I would not worry about the IRQ disabling lock.
>>
> 
> I'm sorry, I focused only on the IRQ-disabling consequence of grabbing
> a scheduler lock also in IRQ context.  I thought it was a serious
> enough issue to avoid this option.  Yet there is also a deadlock
> problem related to this option.  In fact, the IRQ handler may preempt
> some process-context code that already holds some other locks, and, if
> some of these locks are already held by another process, which is
> executing on another CPU and which then tries to take the scheduler
> lock, or which happens to be preempted by an IRQ handler trying to
> grab the scheduler lock, then a deadlock occurs.  This is not just a
> speculation, but a problem that did occur before I moved to a
> deferred-work solution, and that can be readily reproduced.  Before
> moving to a deferred work solution, I tried various code manipulations
> to avoid these deadlocks without resorting to deferred work, but at no
> avail.

There are two important rules here:

1) If a lock is ever used in interrupt context, anyone acquiring it must
   ensure that interrupts gets disabled.

2) If multiple locks are needed, they need to be acquired in the right
   order.

Instead of talking in hypotheticals, be more specific. With the latest
code, the scheduler lock should now be fine, there should be no cases
where you are being invoked with it held. I'm assuming you are running
with lockdep enabled on your kernel? Post the stack traces from your
problem (and your code...), then we can take a look.

Don't punt to deferring work from your put_rq_private() function, that's
a suboptimal work around. It needs to be fixed for real.

> At any rate, bfq seems now to work, so I can finally move from just
> asking questions endlessly, to proposing actual code to discuss on.
> I'm about to: port this version of bfq to your improved/fixed
> blk-mq-sched version in for-4.11 (port postponed, to avoid introducing
> further changes in code that did not yet wok), run more extensive
> tests, polish commits a little bit, and finally share a branch.

Post the code sooner rather than later. There are bound to be things
that need to be improved or fixed up, let's start this process now. The
framework is pretty much buttoned up at this point, so there's time to
shift the attention a bit to a consumer of it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-02-02 21:32             ` Jens Axboe
@ 2017-02-07 17:27               ` Paolo Valente
  0 siblings, 0 replies; 69+ messages in thread
From: Paolo Valente @ 2017-02-07 17:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 02 feb 2017, alle ore 22:32, Jens Axboe <axboe@fb.com> ha scritto:
> 
> On 02/02/2017 02:15 PM, Paolo Valente wrote:
>> 
>>> Il giorno 02 feb 2017, alle ore 16:30, Jens Axboe <axboe@fb.com> ha scritto:
>>> 
>>> On 02/02/2017 02:19 AM, Paolo Valente wrote:
>>>> The scheme is clear.  One comment, in case it could make sense and
>>>> avoid more complexity: since put_rq_priv is invoked in two different
>>>> contexts, process or interrupt, I didn't feel so confusing that, when
>>>> put_rq_priv is invoked in the context where the lock cannot be held
>>>> (unless one is willing to pay with irq disabling all the times), the
>>>> lock is not held, while, when invoked in the context where the lock
>>>> can be held, the lock is actually held, or must be taken.
>>> 
>>> If you grab the same lock from put_rq_priv, yes, you must make it IRQ
>>> disabling in all contexts, and use _irqsave() from put_rq_priv. If it's
>>> just freeing resources, you could potentially wait and do that when
>>> someone else needs them, since that part will come from proces context.
>>> That would need two locks, though.
>>> 
>>> As I said above, I would not worry about the IRQ disabling lock.
>>> 
>> 
>> I'm sorry, I focused only on the IRQ-disabling consequence of grabbing
>> a scheduler lock also in IRQ context.  I thought it was a serious
>> enough issue to avoid this option.  Yet there is also a deadlock
>> problem related to this option.  In fact, the IRQ handler may preempt
>> some process-context code that already holds some other locks, and, if
>> some of these locks are already held by another process, which is
>> executing on another CPU and which then tries to take the scheduler
>> lock, or which happens to be preempted by an IRQ handler trying to
>> grab the scheduler lock, then a deadlock occurs.  This is not just a
>> speculation, but a problem that did occur before I moved to a
>> deferred-work solution, and that can be readily reproduced.  Before
>> moving to a deferred work solution, I tried various code manipulations
>> to avoid these deadlocks without resorting to deferred work, but at no
>> avail.
> 
> There are two important rules here:
> 
> 1) If a lock is ever used in interrupt context, anyone acquiring it must
>   ensure that interrupts gets disabled.
> 
> 2) If multiple locks are needed, they need to be acquired in the right
>   order.
> 
> Instead of talking in hypotheticals, be more specific. With the latest
> code, the scheduler lock should now be fine, there should be no cases
> where you are being invoked with it held. I'm assuming you are running
> with lockdep enabled on your kernel? Post the stack traces from your
> problem (and your code...), then we can take a look.
> 

Hi Jens,

your last change (freeing requests outside merges) did remove two of
out three deadlock scenarios for which I turned some handlers into
deferred work items in bfq-mq.  For the remaining one, I'm about to
send a separate email, with the description of the deadlock, together
with the patch that, applied on top of the bfq-mq branch, causes the
deadlock by turning moving back the body of exit_icq from a deferred
work to the exit_icq hook itself.  And, yes, as I'll write below, I'm
finally about to share a branch containing bfq-mq.

> Don't punt to deferring work from your put_rq_private() function, that's
> a suboptimal work around. It needs to be fixed for real.
> 

Yeah, sub-optimal also in terms of poor developer time: I spent a lot
of time letting that deferred work, and hopefully be a little
efficient!  The actual problem has been that I preferred to try to get
to the bottom of those deadlocks on my own, and not to bother you also
on that issue.  Maybe next time I will ask you one more question
instead of one less :)

>> At any rate, bfq seems now to work, so I can finally move from just
>> asking questions endlessly, to proposing actual code to discuss on.
>> I'm about to: port this version of bfq to your improved/fixed
>> blk-mq-sched version in for-4.11 (port postponed, to avoid introducing
>> further changes in code that did not yet wok), run more extensive
>> tests, polish commits a little bit, and finally share a branch.
> 
> Post the code sooner rather than later. There are bound to be things
> that need to be improved or fixed up, let's start this process now. The
> framework is pretty much buttoned up at this point, so there's time to
> shift the attention a bit to a consumer of it.
> 

Ok, to follow this suggestion of yours at 100%, I have postponed
several steps (removal of any invariant check or extra log message,
merging of the various files bfq is made of in just one file, code
polishing), and I'm about to share my current WIP branch in a
follow-up message.

Thanks,
Paolo

> -- 
> Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
                     ` (6 preceding siblings ...)
  2017-02-01 11:56   ` Paolo Valente
@ 2017-02-16 10:46   ` Paolo Valente
  2017-02-16 15:35     ` Jens Axboe
  7 siblings, 1 reply; 69+ messages in thread
From: Paolo Valente @ 2017-02-16 10:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov


> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
...
> +
> +static void dd_merged_requests(struct request_queue *q, struct request *req,
> +			       struct request *next)
> +{
> +	/*
> +	 * if next expires before rq, assign its expire time to rq
> +	 * and move into next position (next will be deleted) in fifo
> +	 */
> +	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
> +		if (time_before((unsigned long)next->fifo_time,
> +				(unsigned long)req->fifo_time)) {
> +			list_move(&req->queuelist, &next->queuelist);
> +			req->fifo_time = next->fifo_time;
> +		}
> +	}
> +

Jens,
while trying to imagine the possible causes of Bart's hang with
bfq-mq, I've bumped into the following doubt: in the above function
(in my case, in bfq-mq-'s equivalent of the above function), are
we sure that neither req or next could EVER be in dd->dispatch instead
of dd->fifo_list?  I've tried to verify it, but, although I think it has never
happened in my tests, I was not able to make sure that no unlucky
combination may ever happen (considering also the use of
blk_rq_is_passthrough too, to decide where to put a new request).

I'm making a blunder, right?

Thanks,
Paolo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
  2017-02-16 10:46   ` Paolo Valente
@ 2017-02-16 15:35     ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2017-02-16 15:35 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, linux-block, Linux-Kernal, osandov

On 02/16/2017 03:46 AM, Paolo Valente wrote:
> 
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> This is basically identical to deadline-iosched, except it registers
>> as a MQ capable scheduler. This is still a single queue design.
>>
>> Signed-off-by: Jens Axboe <axboe@fb.com>
> ...
>> +
>> +static void dd_merged_requests(struct request_queue *q, struct request *req,
>> +			       struct request *next)
>> +{
>> +	/*
>> +	 * if next expires before rq, assign its expire time to rq
>> +	 * and move into next position (next will be deleted) in fifo
>> +	 */
>> +	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
>> +		if (time_before((unsigned long)next->fifo_time,
>> +				(unsigned long)req->fifo_time)) {
>> +			list_move(&req->queuelist, &next->queuelist);
>> +			req->fifo_time = next->fifo_time;
>> +		}
>> +	}
>> +
> 
> Jens,
> while trying to imagine the possible causes of Bart's hang with
> bfq-mq, I've bumped into the following doubt: in the above function
> (in my case, in bfq-mq-'s equivalent of the above function), are
> we sure that neither req or next could EVER be in dd->dispatch instead
> of dd->fifo_list?  I've tried to verify it, but, although I think it has never
> happened in my tests, I was not able to make sure that no unlucky
> combination may ever happen (considering also the use of
> blk_rq_is_passthrough too, to decide where to put a new request).
> 
> I'm making a blunder, right?

If a request goes into dd->dispatch, it's going to be found for merging.
Hence we can never call the above on the request.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2017-02-16 15:35 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-17  0:12 [PATCHSET v4] blk-mq-scheduling framework Jens Axboe
2016-12-17  0:12 ` [PATCH 1/8] block: move existing elevator ops to union Jens Axboe
2016-12-17  0:12 ` [PATCH 2/8] blk-mq: make mq_ops a const pointer Jens Axboe
2016-12-17  0:12 ` [PATCH 3/8] block: move rq_ioc() to blk.h Jens Axboe
2016-12-20 10:12   ` Paolo Valente
2016-12-20 15:46     ` Jens Axboe
2016-12-20 22:14       ` Jens Axboe
2016-12-17  0:12 ` [PATCH 4/8] blk-mq: un-export blk_mq_free_hctx_request() Jens Axboe
2016-12-17  0:12 ` [PATCH 5/8] blk-mq: export some helpers we need to the scheduling framework Jens Axboe
2016-12-17  0:12 ` [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers Jens Axboe
2016-12-20 11:55   ` Paolo Valente
2016-12-20 15:45     ` Jens Axboe
2016-12-21  2:22     ` Jens Axboe
2016-12-22 15:20       ` Paolo Valente
2016-12-22  9:59   ` Paolo Valente
2016-12-22 11:13     ` Paolo Valente
2017-01-17  2:47       ` Jens Axboe
2017-01-17 10:13         ` Paolo Valente
2017-01-17 12:38           ` Jens Axboe
2016-12-23 10:12     ` Paolo Valente
2017-01-17  2:47     ` Jens Axboe
2017-01-17  9:17       ` Paolo Valente
2016-12-17  0:12 ` [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler Jens Axboe
2016-12-20  9:34   ` Paolo Valente
2016-12-20 15:46     ` Jens Axboe
2016-12-21 11:59   ` Bart Van Assche
2016-12-21 14:22     ` Jens Axboe
2016-12-22 16:07   ` Paolo Valente
2017-01-17  2:47     ` Jens Axboe
2016-12-22 16:49   ` Paolo Valente
2017-01-17  2:47     ` Jens Axboe
2017-01-20 11:07       ` Paolo Valente
2017-01-20 14:26         ` Jens Axboe
2017-01-20 13:14   ` Paolo Valente
2017-01-20 13:18     ` Paolo Valente
2017-01-20 14:28       ` Jens Axboe
2017-01-20 14:28     ` Jens Axboe
2017-02-01 11:11   ` Paolo Valente
2017-02-02  5:19     ` Jens Axboe
2017-02-02  9:19       ` Paolo Valente
2017-02-02 15:30         ` Jens Axboe
2017-02-02 21:15           ` Paolo Valente
2017-02-02 21:32             ` Jens Axboe
2017-02-07 17:27               ` Paolo Valente
2017-02-01 11:56   ` Paolo Valente
2017-02-02  5:20     ` Jens Axboe
2017-02-16 10:46   ` Paolo Valente
2017-02-16 15:35     ` Jens Axboe
2016-12-17  0:12 ` [PATCH 8/8] blk-mq-sched: allow setting of default " Jens Axboe
2016-12-19 11:32 ` [PATCHSET v4] blk-mq-scheduling framework Paolo Valente
2016-12-19 15:20   ` Jens Axboe
2016-12-19 15:33     ` Jens Axboe
2016-12-19 18:21     ` Paolo Valente
2016-12-19 21:05       ` Jens Axboe
2016-12-22 15:28         ` Paolo Valente
2017-01-17  2:47           ` Jens Axboe
2017-01-17 10:49             ` Paolo Valente
2017-01-18 16:14               ` Paolo Valente
2017-01-18 16:21                 ` Jens Axboe
2017-01-23 17:04                   ` Paolo Valente
2017-01-23 17:42                     ` Jens Axboe
2017-01-25  8:46                       ` Paolo Valente
2017-01-25 16:13                         ` Jens Axboe
2017-01-26 14:23                           ` Paolo Valente
2016-12-22 16:23 ` Bart Van Assche
2016-12-22 16:52   ` Omar Sandoval
2016-12-22 16:57     ` Bart Van Assche
2016-12-22 17:12       ` Omar Sandoval
2016-12-22 17:39         ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).