[PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
@ 2017-09-30 10:27 Ming Lei
  2017-09-30 10:27 ` [PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert() Ming Lei
                   ` (8 more replies)
  0 siblings, 9 replies; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
then we can improve dm-mpath's performance in part 2, which will
be posted out soon.

The 2nd six patches improve this situation, and brings back
some performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performanc
improvement on lpfc/qla2xx was observed with V1.[1]

Please consider it for V4.15.

[1] http://marc.info/?l=linux-block&m=150151989915776&w=2
[2] https://marc.info/?l=linux-block&m=150217980602843&w=2

V5:
	- address some comments from Omar
	- add Tested-by & Reveiewed-by tag
	- use direct issue for blk_mq_request_bypass_insert(), and
	start to consider to improve sequential I/O for dm-mpath
	- only include part 1(the original patch 1 ~ 6), as suggested
	by Omar

V4:
	- add Reviewed-by tag
	- some trival change: typo fix in commit log or comment,
	variable name, no actual functional change

V3:
	- totally round robin for picking req from ctx, as suggested
	by Bart
	- remove one local variable in __sbitmap_for_each_set()
	- drop patches of single dispatch list, which can improve
	performance on mq-deadline, but cause a bit degrade on
	none because all hctxs need to be checked after ->dispatch
	is flushed. Will post it again once it is mature.
	- rebase on v4.13-rc6 with block for-next

V2:
	- dequeue request from sw queues in round roubin's style
	as suggested by Bart, and introduces one helper in sbitmap
	for this purpose
	- improve bio merge via hash table from sw queue
	- add comments about using DISPATCH_BUSY state in lockless way,
	simplifying handling on busy state,
	- hold ctx->lock when clearing ctx busy bit as suggested
	by Bart


Ming Lei (7):
  blk-mq: issue rq directly in blk_mq_request_bypass_insert()
  blk-mq-sched: fix scheduler bad performance
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce blk_mq_dequeue_from_ctx()
  blk-mq-sched: move actual dispatching into one helper
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
    flushed

 block/blk-core.c        |   3 +-
 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq-sched.c    | 104 ++++++++++++++++++++++++++++++++++++-------
 block/blk-mq.c          | 114 +++++++++++++++++++++++++++++++++++++++---------
 block/blk-mq.h          |   4 +-
 drivers/md/dm-rq.c      |   2 +-
 include/linux/blk-mq.h  |   3 ++
 include/linux/sbitmap.h |  64 +++++++++++++++++++--------
 8 files changed, 238 insertions(+), 57 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert()
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
@ 2017-09-30 10:27 ` Ming Lei
  2017-10-03  8:58   ` Christoph Hellwig
  2017-09-30 10:27 ` [PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance Ming Lei
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

With issuing rq directly in blk_mq_request_bypass_insert(),
we can:

1) avoid to acquire hctx->lock.

2) the dispatch result can be returned to dm-rq, so that dm-rq
can use this information for improving I/O performance, and
part2 of this patchset will do that.

3) Also the following patch for improving sequential I/O performance
uses hctx->dispatch to decide if hctx is busy, so we need to avoid
to add rq into hctx->dispatch direclty.

There will be another patch in which we move blk_mq_request_direct_insert()
out since it is better for dm-rq to deal with this situation, and
the IO scheduler is actually in dm-rq side.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c     | 70 ++++++++++++++++++++++++++++++++++++++----------------
 block/blk-mq.h     |  2 +-
 drivers/md/dm-rq.c |  2 +-
 4 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 048be4aa6024..4c7fd2231145 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2350,8 +2350,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
 		 * bypass a potential scheduler on the bottom device for
 		 * insert.
 		 */
-		blk_mq_request_bypass_insert(rq);
-		return BLK_STS_OK;
+		return blk_mq_request_bypass_insert(rq);
 	}
 
 	spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..d1b9fb539eba 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -39,6 +39,8 @@
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, blk_qc_t *cookie, bool dispatch_only);
 
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
@@ -1401,20 +1403,31 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
+static void blk_mq_request_direct_insert(struct blk_mq_hw_ctx *hctx,
+					 struct request *rq)
+{
+	spin_lock(&hctx->lock);
+	list_add_tail(&rq->queuelist, &hctx->dispatch);
+	spin_unlock(&hctx->lock);
+
+	blk_mq_run_hw_queue(hctx, false);
+}
+
 /*
  * Should only be used carefully, when the caller knows we want to
  * bypass a potential IO scheduler on the target device.
  */
-void blk_mq_request_bypass_insert(struct request *rq)
+blk_status_t blk_mq_request_bypass_insert(struct request *rq)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+	blk_qc_t cookie;
+	blk_status_t ret;
 
-	spin_lock(&hctx->lock);
-	list_add_tail(&rq->queuelist, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
-
-	blk_mq_run_hw_queue(hctx, false);
+	ret = blk_mq_try_issue_directly(hctx, rq, &cookie, true);
+	if (ret == BLK_STS_RESOURCE)
+		blk_mq_request_direct_insert(hctx, rq);
+	return ret;
 }
 
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
@@ -1527,9 +1540,14 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-					struct request *rq,
-					blk_qc_t *cookie, bool may_sleep)
+/*
+ * 'dispatch_only' means we only try to dispatch it out, and
+ * don't deal with dispatch failure if BLK_STS_RESOURCE or
+ * BLK_STS_IOERR happens.
+ */
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, blk_qc_t *cookie, bool may_sleep,
+		bool dispatch_only)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
@@ -1537,7 +1555,7 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		.last = true,
 	};
 	blk_qc_t new_cookie;
-	blk_status_t ret;
+	blk_status_t ret = BLK_STS_OK;
 	bool run_queue = true;
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1546,9 +1564,10 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		goto insert;
 	}
 
-	if (q->elevator)
+	if (q->elevator && !dispatch_only)
 		goto insert;
 
+	ret = BLK_STS_RESOURCE;
 	if (!blk_mq_get_driver_tag(rq, NULL, false))
 		goto insert;
 
@@ -1563,26 +1582,32 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	switch (ret) {
 	case BLK_STS_OK:
 		*cookie = new_cookie;
-		return;
+		return ret;
 	case BLK_STS_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		goto insert;
 	default:
 		*cookie = BLK_QC_T_NONE;
-		blk_mq_end_request(rq, ret);
-		return;
+		if (!dispatch_only)
+			blk_mq_end_request(rq, ret);
+		return ret;
 	}
 
 insert:
-	blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep);
+	if (!dispatch_only)
+		blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep);
+	return ret;
 }
 
-static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, blk_qc_t *cookie)
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, blk_qc_t *cookie, bool dispatch_only)
 {
+	blk_status_t ret;
+
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		__blk_mq_try_issue_directly(hctx, rq, cookie, false);
+		ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false,
+				dispatch_only);
 		rcu_read_unlock();
 	} else {
 		unsigned int srcu_idx;
@@ -1590,9 +1615,12 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		might_sleep();
 
 		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		__blk_mq_try_issue_directly(hctx, rq, cookie, true);
+		ret = __blk_mq_try_issue_directly(hctx, rq, cookie, true,
+				dispatch_only);
 		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
 	}
+
+	return ret;
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
@@ -1697,12 +1725,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 			data.hctx = blk_mq_map_queue(q,
 					same_queue_rq->mq_ctx->cpu);
 			blk_mq_try_issue_directly(data.hctx, same_queue_rq,
-					&cookie);
+					&cookie, false);
 		}
 	} else if (q->nr_hw_queues > 1 && is_sync) {
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
-		blk_mq_try_issue_directly(data.hctx, rq, &cookie);
+		blk_mq_try_issue_directly(data.hctx, rq, &cookie, false);
 	} else if (q->elevator) {
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ef15b3414da5..61aecf398a4b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -54,7 +54,7 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
  */
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 				bool at_head);
-void blk_mq_request_bypass_insert(struct request *rq);
+blk_status_t blk_mq_request_bypass_insert(struct request *rq);
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 				struct list_head *list);
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 29b237dcc39d..f5e2b6967357 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -404,7 +404,7 @@ static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r)
+	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
  2017-09-30 10:27 ` [PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert() Ming Lei
@ 2017-09-30 10:27 ` Ming Lei
  2017-10-02 14:19   ` Christoph Hellwig
  2017-09-30 10:27 ` [PATCH V5 3/7] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise it is difficult to do
IO merge.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Tom Nguyen <tom81094@gmail.com>
Tested-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..eca011fdfa0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool did_work = false;
+	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,18 +125,18 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
+		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
 	/*
-	 * We want to dispatch from the scheduler if we had no work left
-	 * on the dispatch list, OR if we did have work but weren't able
-	 * to make progress.
+	 * We want to dispatch from the scheduler if there was nothing
+	 * on the dispatch list or we were able to dispatch from the
+	 * dispatch list.
 	 */
-	if (!did_work && has_sched_dispatch) {
+	if (do_sched_dispatch && has_sched_dispatch) {
 		do {
 			struct request *rq;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V5 3/7] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
  2017-09-30 10:27 ` [PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert() Ming Lei
  2017-09-30 10:27 ` [PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-09-30 10:27 ` Ming Lei
  2017-09-30 10:27 ` [PATCH V5 4/7] blk-mq: introduce blk_mq_dequeue_from_ctx() Ming Lei
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

We need to iterate ctx starting from any ctx in round robin
way, so introduce this helper.

Cc: Omar Sandoval <osandov@fb.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Tom Nguyen <tom81094@gmail.com>
Tested-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/sbitmap.h | 64 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..0dcc60e820de 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
- * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * __sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @start: Where to start the iteration.
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -222,35 +226,61 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
  * This is inline even though it's non-trivial so that the function calls to the
  * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-					void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+					  unsigned int start,
+					  sb_for_each_fn fn, void *data)
 {
-	unsigned int i;
+	unsigned int index;
+	unsigned int nr;
+	unsigned int scanned = 0;
 
-	for (i = 0; i < sb->map_nr; i++) {
-		struct sbitmap_word *word = &sb->map[i];
-		unsigned int off, nr;
+	if (start >= sb->depth)
+		start = 0;
+	index = SB_NR_TO_INDEX(sb, start);
+	nr = SB_NR_TO_BIT(sb, start);
 
-		if (!word->word)
-			continue;
+	while (scanned < sb->depth) {
+		struct sbitmap_word *word = &sb->map[index];
+		unsigned int depth = min_t(unsigned int, word->depth - nr,
+					   sb->depth - scanned);
 
-		nr = 0;
-		off = i << sb->shift;
+		scanned += depth;
+		if (!word->word)
+			goto next;
+
+		/*
+		 * On the first iteration of the outer loop, we need to add the
+		 * bit offset back to the size of the word for find_next_bit().
+		 * On all other iterations, nr is zero, so this is a noop.
+		 */
+		depth += nr;
 		while (1) {
-			nr = find_next_bit(&word->word, word->depth, nr);
-			if (nr >= word->depth)
+			nr = find_next_bit(&word->word, depth, nr);
+			if (nr >= depth)
 				break;
-
-			if (!fn(sb, off + nr, data))
+			if (!fn(sb, (index << sb->shift) + nr, data))
 				return;
 
 			nr++;
 		}
+next:
+		nr = 0;
+		if (++index >= sb->map_nr)
+			index = 0;
 	}
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+					void *data)
+{
+	__sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
 					    unsigned int bitnr)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V5 4/7] blk-mq: introduce blk_mq_dequeue_from_ctx()
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
                   ` (2 preceding siblings ...)
  2017-09-30 10:27 ` [PATCH V5 3/7] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
@ 2017-09-30 10:27 ` Ming Lei
  2017-10-03  9:01   ` Christoph Hellwig
  2017-09-30 10:27 ` [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper Ming Lei
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

This function is introduced for dequeuing request
from sw queue so that we can dispatch it in
scheduler's way.

More importantly, some SCSI devices may set
q->queue_depth, which is a per-request_queue limit,
and applied on pending I/O from all hctxs. This
function is introduced for avoiding to dequeue too
many requests from sw queue when ->dispatch isn't
flushed completely.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Tom Nguyen <tom81094@gmail.com>
Tested-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 38 ++++++++++++++++++++++++++++++++++++++
 block/blk-mq.h |  2 ++
 2 files changed, 40 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d1b9fb539eba..8b49af1ade7f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -882,6 +882,44 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
+struct dispatch_rq_data {
+	struct blk_mq_hw_ctx *hctx;
+	struct request *rq;
+};
+
+static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
+{
+	struct dispatch_rq_data *dispatch_data = data;
+	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
+	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+
+	spin_lock(&ctx->lock);
+	if (unlikely(!list_empty(&ctx->rq_list))) {
+		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
+		list_del_init(&dispatch_data->rq->queuelist);
+		if (list_empty(&ctx->rq_list))
+			sbitmap_clear_bit(sb, bitnr);
+	}
+	spin_unlock(&ctx->lock);
+
+	return !dispatch_data->rq;
+}
+
+struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
+					struct blk_mq_ctx *start)
+{
+	unsigned off = start ? start->index_hw : 0;
+	struct dispatch_rq_data data = {
+		.hctx = hctx,
+		.rq   = NULL,
+	};
+
+	__sbitmap_for_each_set(&hctx->ctx_map, off,
+			       dispatch_rq_from_ctx, &data);
+
+	return data.rq;
+}
+
 static inline unsigned int queued_to_index(unsigned int queued)
 {
 	if (!queued)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 61aecf398a4b..915de58572e7 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
+struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
+					struct blk_mq_ctx *start);
 
 /*
  * Internal helpers for allocating/freeing the request map
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
                   ` (3 preceding siblings ...)
  2017-09-30 10:27 ` [PATCH V5 4/7] blk-mq: introduce blk_mq_dequeue_from_ctx() Ming Lei
@ 2017-09-30 10:27 ` Ming Lei
  2017-10-02 14:19   ` Christoph Hellwig
  2017-09-30 10:27 ` [PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue Ming Lei
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

So that it becomes easy to support to dispatch from
sw queue in the following patch.

No functional change.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Tom Nguyen <tom81094@gmail.com>
Tested-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index eca011fdfa0e..538f363f39ca 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,6 +89,22 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
+static void blk_mq_do_dispatch_sched(struct request_queue *q,
+				     struct elevator_queue *e,
+				     struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(rq_list);
+
+	do {
+		struct request *rq;
+
+		rq = e->type->ops.mq.dispatch_request(hctx);
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
@@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * on the dispatch list or we were able to dispatch from the
 	 * dispatch list.
 	 */
-	if (do_sched_dispatch && has_sched_dispatch) {
-		do {
-			struct request *rq;
-
-			rq = e->type->ops.mq.dispatch_request(hctx);
-			if (!rq)
-				break;
-			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(q, &rq_list));
-	}
+	if (do_sched_dispatch && has_sched_dispatch)
+		blk_mq_do_dispatch_sched(q, e, hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
                   ` (4 preceding siblings ...)
  2017-09-30 10:27 ` [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper Ming Lei
@ 2017-09-30 10:27 ` Ming Lei
  2017-10-03  9:05   ` Christoph Hellwig
  2017-09-30 10:27 ` [PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

SCSI devices use host-wide tagset, and the shared
driver tag space is often quite big. Meantime
there is also queue depth for each lun(.cmd_per_lun),
which is often small.

So lots of requests may stay in sw queue, and we
always flush all belonging to same hw queue and
dispatch them all to driver, unfortunately it is
easy to cause queue busy because of the small
per-lun queue depth. Once these requests are flushed
out, they have to stay in hctx->dispatch, and no bio
merge can participate into these requests, and
sequential IO performance is hurted.

This patch improves dispatching from sw queue when
there is per-request-queue queue depth by taking
request one by one from sw queue, just like the way
of IO scheduler.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Tom Nguyen <tom81094@gmail.com>
Tested-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c   | 53 ++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  2 ++
 2 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 538f363f39ca..3ba112d9dc15 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -105,6 +105,42 @@ static void blk_mq_do_dispatch_sched(struct request_queue *q,
 	} while (blk_mq_dispatch_rq_list(q, &rq_list));
 }
 
+static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
+					  struct blk_mq_ctx *ctx)
+{
+	unsigned idx = ctx->index_hw;
+
+	if (++idx == hctx->nr_ctx)
+		idx = 0;
+
+	return hctx->ctxs[idx];
+}
+
+static void blk_mq_do_dispatch_ctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(rq_list);
+	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+	bool dispatched;
+
+	do {
+		struct request *rq;
+
+		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+
+		/* round robin for fair dispatch */
+		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
+
+		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);
+	} while (dispatched);
+
+	if (!dispatched)
+		WRITE_ONCE(hctx->dispatch_from, ctx);
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
@@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
 		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
-	} else if (!has_sched_dispatch) {
+	} else if (!has_sched_dispatch && !q->queue_depth) {
+		/*
+		 * If there is no per-request_queue depth, we
+		 * flush all requests in this hw queue, otherwise
+		 * pick up request one by one from sw queue for
+		 * avoiding to mess up I/O merge when dispatch
+		 * run out of resource, which can be triggered
+		 * easily by per-request_queue queue depth
+		 */
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
+	if (!do_sched_dispatch)
+		return;
+
 	/*
 	 * We want to dispatch from the scheduler if there was nothing
 	 * on the dispatch list or we were able to dispatch from the
 	 * dispatch list.
 	 */
-	if (do_sched_dispatch && has_sched_dispatch)
+	if (has_sched_dispatch)
 		blk_mq_do_dispatch_sched(q, e, hctx);
+	else
+		blk_mq_do_dispatch_ctx(q, hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2747469cedaf..fccabe00fb55 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
 
 	struct sbitmap		ctx_map;
 
+	struct blk_mq_ctx	*dispatch_from;
+
 	struct blk_mq_ctx	**ctxs;
 	unsigned int		nr_ctx;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
                   ` (5 preceding siblings ...)
  2017-09-30 10:27 ` [PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-09-30 10:27 ` Ming Lei
  2017-10-03  9:11   ` Christoph Hellwig
  2017-09-30 10:32 ` [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
  2017-10-09 12:09   ` John Garry
  8 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

During dispatching, we moved all requests from hctx->dispatch to
one temporary list, then dispatch them one by one from this list.
Unfortunately during this period, run queue from other contexts
may think the queue is idle, then start to dequeue from sw/scheduler
queue and still try to dispatch because ->dispatch is empty. This way
hurts sequential I/O performance because requests are dequeued when
lld queue is busy.

This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
make sure that request isn't dequeued until ->dispatch is
flushed.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Tom Nguyen <tom81094@gmail.com>
Tested-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 53 +++++++++++++++++++++++++++++++++-----------------
 block/blk-mq.c         |  6 ++++++
 include/linux/blk-mq.h |  1 +
 4 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 813ca3bbbefc..f1a62c0d1acc 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -182,6 +182,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(SCHED_RESTART),
 	HCTX_STATE_NAME(TAG_WAITING),
 	HCTX_STATE_NAME(START_ON_RUN),
+	HCTX_STATE_NAME(DISPATCH_BUSY),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 3ba112d9dc15..c5eac1eee442 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -146,7 +146,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -177,8 +176,33 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
-	} else if (!has_sched_dispatch && !q->queue_depth) {
+		blk_mq_dispatch_rq_list(q, &rq_list);
+
+		/*
+		 * We may clear DISPATCH_BUSY just after it
+		 * is set from another context, the only cost
+		 * is that one request is dequeued a bit early,
+		 * we can survive that. Given the window is
+		 * small enough, no need to worry about performance
+		 * effect.
+		 */
+		if (list_empty_careful(&hctx->dispatch))
+			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	}
+
+	/*
+	 * If DISPATCH_BUSY is set, that means hw queue is busy
+	 * and requests in the list of hctx->dispatch need to
+	 * be flushed first, so return early.
+	 *
+	 * Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
+	 * will be run to try to make progress, so it is always
+	 * safe to check the state here.
+	 */
+	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
+		return;
+
+	if (!has_sched_dispatch) {
 		/*
 		 * If there is no per-request_queue depth, we
 		 * flush all requests in this hw queue, otherwise
@@ -187,22 +211,15 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		 * run out of resource, which can be triggered
 		 * easily by per-request_queue queue depth
 		 */
-		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(q, &rq_list);
-	}
-
-	if (!do_sched_dispatch)
-		return;
-
-	/*
-	 * We want to dispatch from the scheduler if there was nothing
-	 * on the dispatch list or we were able to dispatch from the
-	 * dispatch list.
-	 */
-	if (has_sched_dispatch)
+		if (!q->queue_depth) {
+			blk_mq_flush_busy_ctxs(hctx, &rq_list);
+			blk_mq_dispatch_rq_list(q, &rq_list);
+		} else {
+			blk_mq_do_dispatch_ctx(q, hctx);
+		}
+	} else {
 		blk_mq_do_dispatch_sched(q, e, hctx);
-	else
-		blk_mq_do_dispatch_ctx(q, hctx);
+	}
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8b49af1ade7f..7cb3f87334c0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1142,6 +1142,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
+		/*
+		 * DISPATCH_BUSY won't be cleared until all requests
+		 * in hctx->dispatch are dispatched successfully
+		 */
+		set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 		spin_unlock(&hctx->lock);
 
 		/*
@@ -1446,6 +1451,7 @@ static void blk_mq_request_direct_insert(struct blk_mq_hw_ctx *hctx,
 {
 	spin_lock(&hctx->lock);
 	list_add_tail(&rq->queuelist, &hctx->dispatch);
+	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 	spin_unlock(&hctx->lock);
 
 	blk_mq_run_hw_queue(hctx, false);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index fccabe00fb55..aa9853ada8b8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -172,6 +172,7 @@ enum {
 	BLK_MQ_S_SCHED_RESTART	= 2,
 	BLK_MQ_S_TAG_WAITING	= 3,
 	BLK_MQ_S_START_ON_RUN	= 4,
+	BLK_MQ_S_DISPATCH_BUSY	= 5,
 
 	BLK_MQ_MAX_DEPTH	= 10240,
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
                   ` (6 preceding siblings ...)
  2017-09-30 10:27 ` [PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
@ 2017-09-30 10:32 ` Ming Lei
  2017-10-09 12:09   ` John Garry
  8 siblings, 0 replies; 33+ messages in thread
From: Ming Lei @ 2017-09-30 10:32 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Sat, Sep 30, 2017 at 06:27:13PM +0800, Ming Lei wrote:
> Hi Jens,
> 
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
> 
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
> 
> This issue becomes one of mains reasons for reverting default SCSI_MQ
> in V4.13.
> 
> The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> then we can improve dm-mpath's performance in part 2, which will
> be posted out soon.
> 
> The 2nd six patches improve this situation, and brings back
> some performance loss.
> 
> With this change, SCSI-MQ sequential I/O performance is
> improved much, Paolo reported that mq-deadline performance
> improved much[2] in his dbench test wrt V2. Also performanc
> improvement on lpfc/qla2xx was observed with V1.[1]
> 
> Please consider it for V4.15.
> 
> [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> [2] https://marc.info/?l=linux-block&m=150217980602843&w=2
> 
> V5:
> 	- address some comments from Omar
> 	- add Tested-by & Reveiewed-by tag
> 	- use direct issue for blk_mq_request_bypass_insert(), and
> 	start to consider to improve sequential I/O for dm-mpath
> 	- only include part 1(the original patch 1 ~ 6), as suggested
> 	by Omar
> 
> V4:
> 	- add Reviewed-by tag
> 	- some trival change: typo fix in commit log or comment,
> 	variable name, no actual functional change
> 
> V3:
> 	- totally round robin for picking req from ctx, as suggested
> 	by Bart
> 	- remove one local variable in __sbitmap_for_each_set()
> 	- drop patches of single dispatch list, which can improve
> 	performance on mq-deadline, but cause a bit degrade on
> 	none because all hctxs need to be checked after ->dispatch
> 	is flushed. Will post it again once it is mature.
> 	- rebase on v4.13-rc6 with block for-next
> 
> V2:
> 	- dequeue request from sw queues in round roubin's style
> 	as suggested by Bart, and introduces one helper in sbitmap
> 	for this purpose
> 	- improve bio merge via hash table from sw queue
> 	- add comments about using DISPATCH_BUSY state in lockless way,
> 	simplifying handling on busy state,
> 	- hold ctx->lock when clearing ctx busy bit as suggested
> 	by Bart
> 
> 
> Ming Lei (7):
>   blk-mq: issue rq directly in blk_mq_request_bypass_insert()
>   blk-mq-sched: fix scheduler bad performance
>   sbitmap: introduce __sbitmap_for_each_set()
>   blk-mq: introduce blk_mq_dequeue_from_ctx()
>   blk-mq-sched: move actual dispatching into one helper
>   blk-mq-sched: improve dispatching from sw queue
>   blk-mq-sched: don't dequeue request until all in ->dispatch are
>     flushed
> 
>  block/blk-core.c        |   3 +-
>  block/blk-mq-debugfs.c  |   1 +
>  block/blk-mq-sched.c    | 104 ++++++++++++++++++++++++++++++++++++-------
>  block/blk-mq.c          | 114 +++++++++++++++++++++++++++++++++++++++---------
>  block/blk-mq.h          |   4 +-
>  drivers/md/dm-rq.c      |   2 +-
>  include/linux/blk-mq.h  |   3 ++
>  include/linux/sbitmap.h |  64 +++++++++++++++++++--------
>  8 files changed, 238 insertions(+), 57 deletions(-)

Oops, the title should have been:

[PATCH V5 0/7] blk-mq-sched: improve sequential I/O performance(part 1)

Sorry for that.

-- 
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance
  2017-09-30 10:27 ` [PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-10-02 14:19   ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2017-10-02 14:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper
  2017-09-30 10:27 ` [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper Ming Lei
@ 2017-10-02 14:19   ` Christoph Hellwig
  2017-10-09  9:07     ` Ming Lei
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2017-10-02 14:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

Can you move this to the beginning of your series, just after
the other edits to blk_mq_sched_dispatch_requests?

> +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> +				     struct elevator_queue *e,
> +				     struct blk_mq_hw_ctx *hctx)

No need to pass anything but the hctx here, the other two can
be trivially derived from it.

> +{
> +	LIST_HEAD(rq_list);
> +
> +	do {
> +		struct request *rq;
> +
> +		rq = e->type->ops.mq.dispatch_request(hctx);

how about shortening this to:

	struct request *rq = e->type->ops.mq.dispatch_request(hctx);

>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
> @@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 * on the dispatch list or we were able to dispatch from the
>  	 * dispatch list.
>  	 */
> -	if (do_sched_dispatch && has_sched_dispatch) {
> -		do {
> -			struct request *rq;
> -
> -			rq = e->type->ops.mq.dispatch_request(hctx);
> -			if (!rq)
> -				break;
> -			list_add(&rq->queuelist, &rq_list);
> -		} while (blk_mq_dispatch_rq_list(q, &rq_list));
> -	}
> +	if (do_sched_dispatch && has_sched_dispatch)
> +		blk_mq_do_dispatch_sched(q, e, hctx);
>  }

Please use this new helper to simplify the logic. E.g.:

	if (!list_empty(&rq_list)) {
		blk_mq_sched_mark_restart_hctx(hctx);
		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
			blk_mq_do_dispatch_sched(hctx);
	} else if (has_sched_dispatch) {
		blk_mq_do_dispatch_sched(hctx);
	} else {
		blk_mq_flush_busy_ctxs(hctx, &rq_list);
		blk_mq_dispatch_rq_list(q, &rq_list);
	}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert()
  2017-09-30 10:27 ` [PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert() Ming Lei
@ 2017-10-03  8:58   ` Christoph Hellwig
  2017-10-03 13:39     ` Ming Lei
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2017-10-03  8:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

This patch does two many things at once and needs a split. I also
don't really understand why it's in this series and not your dm-mpath
performance one.

> +static void blk_mq_request_direct_insert(struct blk_mq_hw_ctx *hctx,
> +					 struct request *rq)
> +{
> +	spin_lock(&hctx->lock);
> +	list_add_tail(&rq->queuelist, &hctx->dispatch);
> +	spin_unlock(&hctx->lock);
> +
> +	blk_mq_run_hw_queue(hctx, false);
> +}

Why doesn't this share code with blk_mq_sched_bypass_insert?

>  /*
>   * Should only be used carefully, when the caller knows we want to
>   * bypass a potential IO scheduler on the target device.
>   */
> -void blk_mq_request_bypass_insert(struct request *rq)
> +blk_status_t blk_mq_request_bypass_insert(struct request *rq)
>  {
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
>  	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
> +	blk_qc_t cookie;
> +	blk_status_t ret;
>  
> -	spin_lock(&hctx->lock);
> -	list_add_tail(&rq->queuelist, &hctx->dispatch);
> -	spin_unlock(&hctx->lock);
> -
> -	blk_mq_run_hw_queue(hctx, false);
> +	ret = blk_mq_try_issue_directly(hctx, rq, &cookie, true);
> +	if (ret == BLK_STS_RESOURCE)
> +		blk_mq_request_direct_insert(hctx, rq);
> +	return ret;

If you actually insert the request on BLK_STS_RESOURCE why do you
pass the error on?  In general BLK_STS_RESOURCE indicates a failure
to issue.

> +/*
> + * 'dispatch_only' means we only try to dispatch it out, and
> + * don't deal with dispatch failure if BLK_STS_RESOURCE or
> + * BLK_STS_IOERR happens.
> + */
> +static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> +		struct request *rq, blk_qc_t *cookie, bool may_sleep,
> +		bool dispatch_only)

This dispatch_only argument that completely changes behavior is a
nightmare.  Try to find a way to have a low-level helper that
always behaves as if dispatch_only is set, and then build another
helper that actually issues/completes around it.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 4/7] blk-mq: introduce blk_mq_dequeue_from_ctx()
  2017-09-30 10:27 ` [PATCH V5 4/7] blk-mq: introduce blk_mq_dequeue_from_ctx() Ming Lei
@ 2017-10-03  9:01   ` Christoph Hellwig
  2017-10-09  4:36     ` Ming Lei
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2017-10-03  9:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Sat, Sep 30, 2017 at 06:27:17PM +0800, Ming Lei wrote:
> This function is introduced for dequeuing request
> from sw queue so that we can dispatch it in
> scheduler's way.
> 
> More importantly, some SCSI devices may set
> q->queue_depth, which is a per-request_queue limit,
> and applied on pending I/O from all hctxs. This
> function is introduced for avoiding to dequeue too
> many requests from sw queue when ->dispatch isn't
> flushed completely.

Weird commit log formatting, please use your 70+ chacters to format the
text.

> @@ -882,6 +882,44 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
>  
> +struct dispatch_rq_data {
> +	struct blk_mq_hw_ctx *hctx;
> +	struct request *rq;
> +};
> +
> +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)

Overly long line.

But except for that this looks ok to me except for the fact that you
just add an unused function.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue
  2017-09-30 10:27 ` [PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-10-03  9:05   ` Christoph Hellwig
  2017-10-09 10:15     ` Ming Lei
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2017-10-03  9:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Sat, Sep 30, 2017 at 06:27:19PM +0800, Ming Lei wrote:
> SCSI devices use host-wide tagset, and the shared
> driver tag space is often quite big. Meantime
> there is also queue depth for each lun(.cmd_per_lun),
> which is often small.
> 
> So lots of requests may stay in sw queue, and we
> always flush all belonging to same hw queue and
> dispatch them all to driver, unfortunately it is
> easy to cause queue busy because of the small
> per-lun queue depth. Once these requests are flushed
> out, they have to stay in hctx->dispatch, and no bio
> merge can participate into these requests, and
> sequential IO performance is hurted.
> 
> This patch improves dispatching from sw queue when
> there is per-request-queue queue depth by taking
> request one by one from sw queue, just like the way
> of IO scheduler.

Once again, use your line length real estate for your change logs..

> +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> +				   struct blk_mq_hw_ctx *hctx)
> +{
> +	LIST_HEAD(rq_list);
> +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> +	bool dispatched;
> +
> +	do {
> +		struct request *rq;
> +
> +		rq = blk_mq_dequeue_from_ctx(hctx, ctx);

This probably should be merged wit hthe patch that introduceѕ
blk_mq_dequeue_from_ctx.

> +		if (!rq)
> +			break;
> +		list_add(&rq->queuelist, &rq_list);

Btw, do we really need to return a request from blk_mq_dequeue_from_ctx,
or would it be easier to just add it directly to the list passed as
an argument.  I did wonder that about the existing code already.

> +
> +		/* round robin for fair dispatch */
> +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> +
> +		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);
> +	} while (dispatched);
> +
> +	if (!dispatched)
> +		WRITE_ONCE(hctx->dispatch_from, ctx);

No need for the dispatched argument, just write this as:

	 } while (blk_mq_dispatch_rq_list(q, &rq_list));
	
	WRITE_ONCE(hctx->dispatch_from, ctx);

and do an early return instead of a break from inside the loop.

> +	} else if (!has_sched_dispatch && !q->queue_depth) {
> +		/*
> +		 * If there is no per-request_queue depth, we
> +		 * flush all requests in this hw queue, otherwise
> +		 * pick up request one by one from sw queue for
> +		 * avoiding to mess up I/O merge when dispatch
> +		 * run out of resource, which can be triggered
> +		 * easily by per-request_queue queue depth
> +		 */

Use your 80 line real estate for comments..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-09-30 10:27 ` [PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
@ 2017-10-03  9:11   ` Christoph Hellwig
  2017-10-09 10:40     ` Ming Lei
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2017-10-03  9:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

This looks good in general:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Minor nitpicks below:

>  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;

This is now only tested once, so you can remove the local variable
for it.

> +		/*
> +		 * We may clear DISPATCH_BUSY just after it
> +		 * is set from another context, the only cost
> +		 * is that one request is dequeued a bit early,
> +		 * we can survive that. Given the window is
> +		 * small enough, no need to worry about performance
> +		 * effect.
> +		 */

Use your 80 line real estate for comments please.

>  	if (!has_sched_dispatch)
> +		if (!q->queue_depth) {
> +			blk_mq_flush_busy_ctxs(hctx, &rq_list);
> +			blk_mq_dispatch_rq_list(q, &rq_list);
> +		} else {
> +			blk_mq_do_dispatch_ctx(q, hctx);
> +		}
> +	} else {
>  		blk_mq_do_dispatch_sched(q, e, hctx);
> +	}

Maybe flatten this out to:

	if (e && e->type->ops.mq.dispatch_request) {
		blk_mq_do_dispatch_sched(q, e, hctx);
	} else if (q->queue_depth) {
		blk_mq_do_dispatch_ctx(q, hctx);
	} else {
		blk_mq_flush_busy_ctxs(hctx, &rq_list);
		blk_mq_dispatch_rq_list(q, &rq_list);
	}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert()
  2017-10-03  8:58   ` Christoph Hellwig
@ 2017-10-03 13:39     ` Ming Lei
  0 siblings, 0 replies; 33+ messages in thread
From: Ming Lei @ 2017-10-03 13:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval

On Tue, Oct 03, 2017 at 01:58:50AM -0700, Christoph Hellwig wrote:
> This patch does two many things at once and needs a split. I also
> don't really understand why it's in this series and not your dm-mpath
> performance one.

Because the following patches only set hctx as busy after
BLK_STS_RESOURCE is returned from .queue_rq(), then add the
rq into hctx->dispatch.

But commit 157f377beb71(block: directly insert blk-mq request from
blk_insert_cloned_request()) just inserts rq into hctx->dispatch
directly, then we can't think hctx as busy any more if there are
requests in hctx->dispatch. That said the commit(157f377beb71)
makes the busy detection approach not working any more.

>
> > +static void blk_mq_request_direct_insert(struct blk_mq_hw_ctx *hctx,
> > +					 struct request *rq)
> > +{
> > +	spin_lock(&hctx->lock);
> > +	list_add_tail(&rq->queuelist, &hctx->dispatch);
> > +	spin_unlock(&hctx->lock);
> > +
> > +	blk_mq_run_hw_queue(hctx, false);
> > +}
> 
> Why doesn't this share code with blk_mq_sched_bypass_insert?

It actually shares the code as this function is called
by blk_mq_request_bypass_insert().

> 
> >  /*
> >   * Should only be used carefully, when the caller knows we want to
> >   * bypass a potential IO scheduler on the target device.
> >   */
> > -void blk_mq_request_bypass_insert(struct request *rq)
> > +blk_status_t blk_mq_request_bypass_insert(struct request *rq)
> >  {
> >  	struct blk_mq_ctx *ctx = rq->mq_ctx;
> >  	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
> > +	blk_qc_t cookie;
> > +	blk_status_t ret;
> >  
> > -	spin_lock(&hctx->lock);
> > -	list_add_tail(&rq->queuelist, &hctx->dispatch);
> > -	spin_unlock(&hctx->lock);
> > -
> > -	blk_mq_run_hw_queue(hctx, false);
> > +	ret = blk_mq_try_issue_directly(hctx, rq, &cookie, true);
> > +	if (ret == BLK_STS_RESOURCE)
> > +		blk_mq_request_direct_insert(hctx, rq);
> > +	return ret;
> 
> If you actually insert the request on BLK_STS_RESOURCE why do you
> pass the error on?  In general BLK_STS_RESOURCE indicates a failure
> to issue.

OK, I will change it into BLK_STS_OK and switch it back in
the dm-rq patches.

> 
> > +/*
> > + * 'dispatch_only' means we only try to dispatch it out, and
> > + * don't deal with dispatch failure if BLK_STS_RESOURCE or
> > + * BLK_STS_IOERR happens.
> > + */
> > +static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > +		struct request *rq, blk_qc_t *cookie, bool may_sleep,
> > +		bool dispatch_only)
> 
> This dispatch_only argument that completely changes behavior is a
> nightmare.  Try to find a way to have a low-level helper that
> always behaves as if dispatch_only is set, and then build another
> helper that actually issues/completes around it.

OK, I will try to work towards that way.

-- 
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 4/7] blk-mq: introduce blk_mq_dequeue_from_ctx()
  2017-10-03  9:01   ` Christoph Hellwig
@ 2017-10-09  4:36     ` Ming Lei
  0 siblings, 0 replies; 33+ messages in thread
From: Ming Lei @ 2017-10-09  4:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval

On Tue, Oct 03, 2017 at 02:01:09AM -0700, Christoph Hellwig wrote:
> On Sat, Sep 30, 2017 at 06:27:17PM +0800, Ming Lei wrote:
> > This function is introduced for dequeuing request
> > from sw queue so that we can dispatch it in
> > scheduler's way.
> > 
> > More importantly, some SCSI devices may set
> > q->queue_depth, which is a per-request_queue limit,
> > and applied on pending I/O from all hctxs. This
> > function is introduced for avoiding to dequeue too
> > many requests from sw queue when ->dispatch isn't
> > flushed completely.
> 
> Weird commit log formatting, please use your 70+ chacters to format the
> text.

OK.

> 
> > @@ -882,6 +882,44 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> >  }
> >  EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
> >  
> > +struct dispatch_rq_data {
> > +	struct blk_mq_hw_ctx *hctx;
> > +	struct request *rq;
> > +};
> > +
> > +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
> 
> Overly long line.
> 
> But except for that this looks ok to me except for the fact that you
> just add an unused function.

OK

-- 
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper
  2017-10-02 14:19   ` Christoph Hellwig
@ 2017-10-09  9:07     ` Ming Lei
  0 siblings, 0 replies; 33+ messages in thread
From: Ming Lei @ 2017-10-09  9:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval

On Mon, Oct 02, 2017 at 07:19:56AM -0700, Christoph Hellwig wrote:
> Can you move this to the beginning of your series, just after
> the other edits to blk_mq_sched_dispatch_requests?

OK.

> 
> > +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> > +				     struct elevator_queue *e,
> > +				     struct blk_mq_hw_ctx *hctx)
> 
> No need to pass anything but the hctx here, the other two can
> be trivially derived from it.

OK.

> 
> > +{
> > +	LIST_HEAD(rq_list);
> > +
> > +	do {
> > +		struct request *rq;
> > +
> > +		rq = e->type->ops.mq.dispatch_request(hctx);
> 
> how about shortening this to:
> 
> 	struct request *rq = e->type->ops.mq.dispatch_request(hctx);

OK

> 
> >  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  {
> >  	struct request_queue *q = hctx->queue;
> > @@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  	 * on the dispatch list or we were able to dispatch from the
> >  	 * dispatch list.
> >  	 */
> > -	if (do_sched_dispatch && has_sched_dispatch) {
> > -		do {
> > -			struct request *rq;
> > -
> > -			rq = e->type->ops.mq.dispatch_request(hctx);
> > -			if (!rq)
> > -				break;
> > -			list_add(&rq->queuelist, &rq_list);
> > -		} while (blk_mq_dispatch_rq_list(q, &rq_list));
> > -	}
> > +	if (do_sched_dispatch && has_sched_dispatch)
> > +		blk_mq_do_dispatch_sched(q, e, hctx);
> >  }
> 
> Please use this new helper to simplify the logic. E.g.:
> 
> 	if (!list_empty(&rq_list)) {
> 		blk_mq_sched_mark_restart_hctx(hctx);
> 		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
> 			blk_mq_do_dispatch_sched(hctx);
> 	} else if (has_sched_dispatch) {
> 		blk_mq_do_dispatch_sched(hctx);
> 	} else {
> 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> 		blk_mq_dispatch_rq_list(q, &rq_list);
> 	}

OK

-- 
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue
  2017-10-03  9:05   ` Christoph Hellwig
@ 2017-10-09 10:15     ` Ming Lei
  0 siblings, 0 replies; 33+ messages in thread
From: Ming Lei @ 2017-10-09 10:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval

On Tue, Oct 03, 2017 at 02:05:27AM -0700, Christoph Hellwig wrote:
> On Sat, Sep 30, 2017 at 06:27:19PM +0800, Ming Lei wrote:
> > SCSI devices use host-wide tagset, and the shared
> > driver tag space is often quite big. Meantime
> > there is also queue depth for each lun(.cmd_per_lun),
> > which is often small.
> > 
> > So lots of requests may stay in sw queue, and we
> > always flush all belonging to same hw queue and
> > dispatch them all to driver, unfortunately it is
> > easy to cause queue busy because of the small
> > per-lun queue depth. Once these requests are flushed
> > out, they have to stay in hctx->dispatch, and no bio
> > merge can participate into these requests, and
> > sequential IO performance is hurted.
> > 
> > This patch improves dispatching from sw queue when
> > there is per-request-queue queue depth by taking
> > request one by one from sw queue, just like the way
> > of IO scheduler.
> 
> Once again, use your line length real estate for your change logs..

OK.

> 
> > +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > +				   struct blk_mq_hw_ctx *hctx)
> > +{
> > +	LIST_HEAD(rq_list);
> > +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > +	bool dispatched;
> > +
> > +	do {
> > +		struct request *rq;
> > +
> > +		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
> 
> This probably should be merged wit hthe patch that introduceѕ
> blk_mq_dequeue_from_ctx.

OK.

> 
> > +		if (!rq)
> > +			break;
> > +		list_add(&rq->queuelist, &rq_list);
> 
> Btw, do we really need to return a request from blk_mq_dequeue_from_ctx,
> or would it be easier to just add it directly to the list passed as
> an argument.  I did wonder that about the existing code already.

I suggest to keep them as current way in this patch, but we can clean
them up all in another patch.

> 
> > +
> > +		/* round robin for fair dispatch */
> > +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > +
> > +		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);
> > +	} while (dispatched);
> > +
> > +	if (!dispatched)
> > +		WRITE_ONCE(hctx->dispatch_from, ctx);
> 
> No need for the dispatched argument, just write this as:
> 
> 	 } while (blk_mq_dispatch_rq_list(q, &rq_list));
> 	
> 	WRITE_ONCE(hctx->dispatch_from, ctx);
> 
> and do an early return instead of a break from inside the loop.

OK.

> 
> > +	} else if (!has_sched_dispatch && !q->queue_depth) {
> > +		/*
> > +		 * If there is no per-request_queue depth, we
> > +		 * flush all requests in this hw queue, otherwise
> > +		 * pick up request one by one from sw queue for
> > +		 * avoiding to mess up I/O merge when dispatch
> > +		 * run out of resource, which can be triggered
> > +		 * easily by per-request_queue queue depth
> > +		 */
> 
> Use your 80 line real estate for comments..

OK.

-- 
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-10-03  9:11   ` Christoph Hellwig
@ 2017-10-09 10:40     ` Ming Lei
  0 siblings, 0 replies; 33+ messages in thread
From: Ming Lei @ 2017-10-09 10:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval

On Tue, Oct 03, 2017 at 02:11:28AM -0700, Christoph Hellwig wrote:
> This looks good in general:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Minor nitpicks below:
> 
> >  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> 
> This is now only tested once, so you can remove the local variable
> for it.

There are still two users of the local variable, so I suggest to
keep it.

> 
> > +		/*
> > +		 * We may clear DISPATCH_BUSY just after it
> > +		 * is set from another context, the only cost
> > +		 * is that one request is dequeued a bit early,
> > +		 * we can survive that. Given the window is
> > +		 * small enough, no need to worry about performance
> > +		 * effect.
> > +		 */
> 
> Use your 80 line real estate for comments please.

OK.

> 
> >  	if (!has_sched_dispatch)
> > +		if (!q->queue_depth) {
> > +			blk_mq_flush_busy_ctxs(hctx, &rq_list);
> > +			blk_mq_dispatch_rq_list(q, &rq_list);
> > +		} else {
> > +			blk_mq_do_dispatch_ctx(q, hctx);
> > +		}
> > +	} else {
> >  		blk_mq_do_dispatch_sched(q, e, hctx);
> > +	}
> 
> Maybe flatten this out to:
> 
> 	if (e && e->type->ops.mq.dispatch_request) {
> 		blk_mq_do_dispatch_sched(q, e, hctx);
> 	} else if (q->queue_depth) {
> 		blk_mq_do_dispatch_ctx(q, hctx);
> 	} else {
> 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> 		blk_mq_dispatch_rq_list(q, &rq_list);
> 	}
> 

OK.


-- 
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
@ 2017-10-09 12:09   ` John Garry
  2017-09-30 10:27 ` [PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance Ming Lei
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2017-10-09 12:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

On 30/09/2017 11:27, Ming Lei wrote:
> Hi Jens,
>
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
>
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
>
> This issue becomes one of mains reasons for reverting default SCSI_MQ
> in V4.13.
>
> The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> then we can improve dm-mpath's performance in part 2, which will
> be posted out soon.
>
> The 2nd six patches improve this situation, and brings back
> some performance loss.
>
> With this change, SCSI-MQ sequential I/O performance is
> improved much, Paolo reported that mq-deadline performance
> improved much[2] in his dbench test wrt V2. Also performanc
> improvement on lpfc/qla2xx was observed with V1.[1]
>
> Please consider it for V4.15.
>
> [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> [2] https://marc.info/?l=linux-block&m=150217980602843&w=2
>

I tested this series for the SAS controller on HiSilicon hip07 platform 
as I am interested in enabling MQ for this driver. Driver is 
./drivers/scsi/hisi_sas/.

So I found that that performance is improved when enabling default 
SCSI_MQ with this series vs baseline. However, it is still not as a good 
as when default SCSI_MQ is disabled.

Here are some figures I got with fio:
4.14-rc2 without default SCSI_MQ
read, rw, write IOPS	
952K, 133K/133K, 800K

4.14-rc2 with default SCSI_MQ
read, rw, write IOPS	
311K, 117K/117K, 320K

This series* without default SCSI_MQ
read, rw, write IOPS	
975K, 132K/132K, 790K

This series* with default SCSI_MQ
read, rw, write IOPS	
770K, 164K/164K, 594K

Please note that hisi_sas driver does not enable mq by exposing multiple 
queues to upper layer (even though it has multiple queues). I have been 
playing with enabling it, but my performance is always worse...

* I'm using 
https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1, 
as advised by Ming Lei.

Thanks,
John

> V5:
> 	- address some comments from Omar
> 	- add Tested-by & Reveiewed-by tag
> 	- use direct issue for blk_mq_request_bypass_insert(), and
> 	start to consider to improve sequential I/O for dm-mpath
> 	- only include part 1(the original patch 1 ~ 6), as suggested
> 	by Omar
>
> V4:
> 	- add Reviewed-by tag
> 	- some trival change: typo fix in commit log or comment,
> 	variable name, no actual functional change
>
> V3:
> 	- totally round robin for picking req from ctx, as suggested
> 	by Bart
> 	- remove one local variable in __sbitmap_for_each_set()
> 	- drop patches of single dispatch list, which can improve
> 	performance on mq-deadline, but cause a bit degrade on
> 	none because all hctxs need to be checked after ->dispatch
> 	is flushed. Will post it again once it is mature.
> 	- rebase on v4.13-rc6 with block for-next
>
> V2:
> 	- dequeue request from sw queues in round roubin's style
> 	as suggested by Bart, and introduces one helper in sbitmap
> 	for this purpose
> 	- improve bio merge via hash table from sw queue
> 	- add comments about using DISPATCH_BUSY state in lockless way,
> 	simplifying handling on busy state,
> 	- hold ctx->lock when clearing ctx busy bit as suggested
> 	by Bart
>
>
> Ming Lei (7):
>   blk-mq: issue rq directly in blk_mq_request_bypass_insert()
>   blk-mq-sched: fix scheduler bad performance
>   sbitmap: introduce __sbitmap_for_each_set()
>   blk-mq: introduce blk_mq_dequeue_from_ctx()
>   blk-mq-sched: move actual dispatching into one helper
>   blk-mq-sched: improve dispatching from sw queue
>   blk-mq-sched: don't dequeue request until all in ->dispatch are
>     flushed
>
>  block/blk-core.c        |   3 +-
>  block/blk-mq-debugfs.c  |   1 +
>  block/blk-mq-sched.c    | 104 ++++++++++++++++++++++++++++++++++++-------
>  block/blk-mq.c          | 114 +++++++++++++++++++++++++++++++++++++++---------
>  block/blk-mq.h          |   4 +-
>  drivers/md/dm-rq.c      |   2 +-
>  include/linux/blk-mq.h  |   3 ++
>  include/linux/sbitmap.h |  64 +++++++++++++++++++--------
>  8 files changed, 238 insertions(+), 57 deletions(-)
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
@ 2017-10-09 12:09   ` John Garry
  0 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2017-10-09 12:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

On 30/09/2017 11:27, Ming Lei wrote:
> Hi Jens,
>
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
>
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
>
> This issue becomes one of mains reasons for reverting default SCSI_MQ
> in V4.13.
>
> The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> then we can improve dm-mpath's performance in part 2, which will
> be posted out soon.
>
> The 2nd six patches improve this situation, and brings back
> some performance loss.
>
> With this change, SCSI-MQ sequential I/O performance is
> improved much, Paolo reported that mq-deadline performance
> improved much[2] in his dbench test wrt V2. Also performanc
> improvement on lpfc/qla2xx was observed with V1.[1]
>
> Please consider it for V4.15.
>
> [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> [2] https://marc.info/?l=linux-block&m=150217980602843&w=2
>

I tested this series for the SAS controller on HiSilicon hip07 platform 
as I am interested in enabling MQ for this driver. Driver is 
./drivers/scsi/hisi_sas/.

So I found that that performance is improved when enabling default 
SCSI_MQ with this series vs baseline. However, it is still not as a good 
as when default SCSI_MQ is disabled.

Here are some figures I got with fio:
4.14-rc2 without default SCSI_MQ
read, rw, write IOPS	
952K, 133K/133K, 800K

4.14-rc2 with default SCSI_MQ
read, rw, write IOPS	
311K, 117K/117K, 320K

This series* without default SCSI_MQ
read, rw, write IOPS	
975K, 132K/132K, 790K

This series* with default SCSI_MQ
read, rw, write IOPS	
770K, 164K/164K, 594K

Please note that hisi_sas driver does not enable mq by exposing multiple 
queues to upper layer (even though it has multiple queues). I have been 
playing with enabling it, but my performance is always worse...

* I'm using 
https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1, 
as advised by Ming Lei.

Thanks,
John

> V5:
> 	- address some comments from Omar
> 	- add Tested-by & Reveiewed-by tag
> 	- use direct issue for blk_mq_request_bypass_insert(), and
> 	start to consider to improve sequential I/O for dm-mpath
> 	- only include part 1(the original patch 1 ~ 6), as suggested
> 	by Omar
>
> V4:
> 	- add Reviewed-by tag
> 	- some trival change: typo fix in commit log or comment,
> 	variable name, no actual functional change
>
> V3:
> 	- totally round robin for picking req from ctx, as suggested
> 	by Bart
> 	- remove one local variable in __sbitmap_for_each_set()
> 	- drop patches of single dispatch list, which can improve
> 	performance on mq-deadline, but cause a bit degrade on
> 	none because all hctxs need to be checked after ->dispatch
> 	is flushed. Will post it again once it is mature.
> 	- rebase on v4.13-rc6 with block for-next
>
> V2:
> 	- dequeue request from sw queues in round roubin's style
> 	as suggested by Bart, and introduces one helper in sbitmap
> 	for this purpose
> 	- improve bio merge via hash table from sw queue
> 	- add comments about using DISPATCH_BUSY state in lockless way,
> 	simplifying handling on busy state,
> 	- hold ctx->lock when clearing ctx busy bit as suggested
> 	by Bart
>
>
> Ming Lei (7):
>   blk-mq: issue rq directly in blk_mq_request_bypass_insert()
>   blk-mq-sched: fix scheduler bad performance
>   sbitmap: introduce __sbitmap_for_each_set()
>   blk-mq: introduce blk_mq_dequeue_from_ctx()
>   blk-mq-sched: move actual dispatching into one helper
>   blk-mq-sched: improve dispatching from sw queue
>   blk-mq-sched: don't dequeue request until all in ->dispatch are
>     flushed
>
>  block/blk-core.c        |   3 +-
>  block/blk-mq-debugfs.c  |   1 +
>  block/blk-mq-sched.c    | 104 ++++++++++++++++++++++++++++++++++++-------
>  block/blk-mq.c          | 114 +++++++++++++++++++++++++++++++++++++++---------
>  block/blk-mq.h          |   4 +-
>  drivers/md/dm-rq.c      |   2 +-
>  include/linux/blk-mq.h  |   3 ++
>  include/linux/sbitmap.h |  64 +++++++++++++++++++--------
>  8 files changed, 238 insertions(+), 57 deletions(-)
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-10-09 12:09   ` John Garry
  (?)
@ 2017-10-09 15:04   ` Ming Lei
  2017-10-10  1:46     ` Ming Lei
  -1 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-10-09 15:04 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

Hi John,

On Mon, Oct 09, 2017 at 01:09:22PM +0100, John Garry wrote:
> On 30/09/2017 11:27, Ming Lei wrote:
> > Hi Jens,
> > 
> > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > found that I/O performance is much bad with mq-deadline, especially
> > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > SRP...)
> > 
> > Turns out one big issue causes the performance regression: requests
> > are still dequeued from sw queue/scheduler queue even when ldd's
> > queue is busy, so I/O merge becomes quite difficult to make, then
> > sequential IO degrades a lot.
> > 
> > This issue becomes one of mains reasons for reverting default SCSI_MQ
> > in V4.13.
> > 
> > The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> > then we can improve dm-mpath's performance in part 2, which will
> > be posted out soon.
> > 
> > The 2nd six patches improve this situation, and brings back
> > some performance loss.
> > 
> > With this change, SCSI-MQ sequential I/O performance is
> > improved much, Paolo reported that mq-deadline performance
> > improved much[2] in his dbench test wrt V2. Also performanc
> > improvement on lpfc/qla2xx was observed with V1.[1]
> > 
> > Please consider it for V4.15.
> > 
> > [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> > [2] https://marc.info/?l=linux-block&m=150217980602843&w=2
> > 
> 
> I tested this series for the SAS controller on HiSilicon hip07 platform as I
> am interested in enabling MQ for this driver. Driver is
> ./drivers/scsi/hisi_sas/.
> 
> So I found that that performance is improved when enabling default SCSI_MQ
> with this series vs baseline. However, it is still not as a good as when
> default SCSI_MQ is disabled.
> 
> Here are some figures I got with fio:
> 4.14-rc2 without default SCSI_MQ
> read, rw, write IOPS	
> 952K, 133K/133K, 800K
> 
> 4.14-rc2 with default SCSI_MQ
> read, rw, write IOPS	
> 311K, 117K/117K, 320K
> 
> This series* without default SCSI_MQ
> read, rw, write IOPS	
> 975K, 132K/132K, 790K
> 
> This series* with default SCSI_MQ
> read, rw, write IOPS	
> 770K, 164K/164K, 594K

Thanks for testing this patchset!

Looks there is big improvement, but the gap compared with
block legacy is not small too.

> 
> Please note that hisi_sas driver does not enable mq by exposing multiple
> queues to upper layer (even though it has multiple queues). I have been
> playing with enabling it, but my performance is always worse...
> 
> * I'm using
> https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> as advised by Ming Lei.

Could you test on the following branch and see if it makes a
difference?

	https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test

BTW, one big change is in the following commit, which just takes block
legacy's policy to dequeue request, and I can observe some improvement
on virtio-scsi too, and this commit is just for verification/debug purpose,
which is never posted out before.

https://github.com/ming1/linux/commit/94a117fdd9cfc1291445e5a35f04464c89c9ce70


Thanks,
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-10-09 15:04   ` Ming Lei
@ 2017-10-10  1:46     ` Ming Lei
  2017-10-10 12:24         ` John Garry
  0 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-10-10  1:46 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

On Mon, Oct 09, 2017 at 11:04:39PM +0800, Ming Lei wrote:
> Hi John,
> 
> On Mon, Oct 09, 2017 at 01:09:22PM +0100, John Garry wrote:
> > On 30/09/2017 11:27, Ming Lei wrote:
> > > Hi Jens,
> > > 
> > > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > > found that I/O performance is much bad with mq-deadline, especially
> > > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > > SRP...)
> > > 
> > > Turns out one big issue causes the performance regression: requests
> > > are still dequeued from sw queue/scheduler queue even when ldd's
> > > queue is busy, so I/O merge becomes quite difficult to make, then
> > > sequential IO degrades a lot.
> > > 
> > > This issue becomes one of mains reasons for reverting default SCSI_MQ
> > > in V4.13.
> > > 
> > > The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> > > then we can improve dm-mpath's performance in part 2, which will
> > > be posted out soon.
> > > 
> > > The 2nd six patches improve this situation, and brings back
> > > some performance loss.
> > > 
> > > With this change, SCSI-MQ sequential I/O performance is
> > > improved much, Paolo reported that mq-deadline performance
> > > improved much[2] in his dbench test wrt V2. Also performanc
> > > improvement on lpfc/qla2xx was observed with V1.[1]
> > > 
> > > Please consider it for V4.15.
> > > 
> > > [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> > > [2] https://marc.info/?l=linux-block&m=150217980602843&w=2
> > > 
> > 
> > I tested this series for the SAS controller on HiSilicon hip07 platform as I
> > am interested in enabling MQ for this driver. Driver is
> > ./drivers/scsi/hisi_sas/.
> > 
> > So I found that that performance is improved when enabling default SCSI_MQ
> > with this series vs baseline. However, it is still not as a good as when
> > default SCSI_MQ is disabled.
> > 
> > Here are some figures I got with fio:
> > 4.14-rc2 without default SCSI_MQ
> > read, rw, write IOPS	
> > 952K, 133K/133K, 800K
> > 
> > 4.14-rc2 with default SCSI_MQ
> > read, rw, write IOPS	
> > 311K, 117K/117K, 320K
> > 
> > This series* without default SCSI_MQ
> > read, rw, write IOPS	
> > 975K, 132K/132K, 790K
> > 
> > This series* with default SCSI_MQ
> > read, rw, write IOPS	
> > 770K, 164K/164K, 594K
> 
> Thanks for testing this patchset!
> 
> Looks there is big improvement, but the gap compared with
> block legacy is not small too.
> 
> > 
> > Please note that hisi_sas driver does not enable mq by exposing multiple
> > queues to upper layer (even though it has multiple queues). I have been
> > playing with enabling it, but my performance is always worse...
> > 
> > * I'm using
> > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> > as advised by Ming Lei.
> 
> Could you test on the following branch and see if it makes a
> difference?
> 
> 	https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test

Hi John,

Please test the following branch directly:

https://github.com/ming1/linux/tree/blk_mq_improve_scsi_mpath_perf_V6.2_test

And code is simplified and cleaned up much in V6.2, then only two extra
patches(top 2) are needed against V6 which was posted yesterday.

Please test SCSI_MQ with mq-deadline, which should be the default
mq scheduler on your HiSilicon SAS.

-- 
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-10-10  1:46     ` Ming Lei
@ 2017-10-10 12:24         ` John Garry
  0 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2017-10-10 12:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

On 10/10/2017 02:46, Ming Lei wrote:
>>> > > I tested this series for the SAS controller on HiSilicon hip07 platform as I
>>> > > am interested in enabling MQ for this driver. Driver is
>>> > > ./drivers/scsi/hisi_sas/.
>>> > >
>>> > > So I found that that performance is improved when enabling default SCSI_MQ
>>> > > with this series vs baseline. However, it is still not as a good as when
>>> > > default SCSI_MQ is disabled.
>>> > >
>>> > > Here are some figures I got with fio:
>>> > > 4.14-rc2 without default SCSI_MQ
>>> > > read, rw, write IOPS	
>>> > > 952K, 133K/133K, 800K
>>> > >
>>> > > 4.14-rc2 with default SCSI_MQ
>>> > > read, rw, write IOPS	
>>> > > 311K, 117K/117K, 320K
>>> > >
>>> > > This series* without default SCSI_MQ
>>> > > read, rw, write IOPS	
>>> > > 975K, 132K/132K, 790K
>>> > >
>>> > > This series* with default SCSI_MQ
>>> > > read, rw, write IOPS	
>>> > > 770K, 164K/164K, 594K
>> >
>> > Thanks for testing this patchset!
>> >
>> > Looks there is big improvement, but the gap compared with
>> > block legacy is not small too.
>> >
>>> > >
>>> > > Please note that hisi_sas driver does not enable mq by exposing multiple
>>> > > queues to upper layer (even though it has multiple queues). I have been
>>> > > playing with enabling it, but my performance is always worse...
>>> > >
>>> > > * I'm using
>>> > > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
>>> > > as advised by Ming Lei.
>> >
>> > Could you test on the following branch and see if it makes a
>> > difference?
>> >
>> > 	https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test
> Hi John,
>
> Please test the following branch directly:
>
> https://github.com/ming1/linux/tree/blk_mq_improve_scsi_mpath_perf_V6.2_test
>
> And code is simplified and cleaned up much in V6.2, then only two extra
> patches(top 2) are needed against V6 which was posted yesterday.
>
> Please test SCSI_MQ with mq-deadline, which should be the default
> mq scheduler on your HiSilicon SAS.

Hi Ming Lei,

It's using cfq (for non-mq) and mq-deadline (obviously for mq).

root@(none)$ pwd
/sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
noop [cfq]

and

root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
[mq-deadline] kyber none

Unfortunately my setup has changed since yeterday, and the absolute 
figures are not the exact same (I retested 4.14-rc2). However, we still 
see that drop when mq is enabled.

Here's the results:
4.14-rc4 without default SCSI_MQ
read, rw, write IOPS	
860K, 112K/112K, 800K

4.14-rc2 without default SCSI_MQ
read, rw, write IOPS	
880K, 113K/113K, 808K

V6.2 series without default SCSI_MQ
read, rw, write IOPS	
820K, 114/114K, 790K

V6.2 series with default SCSI_MQ
read, rw, write IOPS	
700K, 130K/128K, 640K

Cheers,
John

>
> -- Ming .

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
@ 2017-10-10 12:24         ` John Garry
  0 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2017-10-10 12:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

On 10/10/2017 02:46, Ming Lei wrote:
>>> > > I tested this series for the SAS controller on HiSilicon hip07 platform as I
>>> > > am interested in enabling MQ for this driver. Driver is
>>> > > ./drivers/scsi/hisi_sas/.
>>> > >
>>> > > So I found that that performance is improved when enabling default SCSI_MQ
>>> > > with this series vs baseline. However, it is still not as a good as when
>>> > > default SCSI_MQ is disabled.
>>> > >
>>> > > Here are some figures I got with fio:
>>> > > 4.14-rc2 without default SCSI_MQ
>>> > > read, rw, write IOPS	
>>> > > 952K, 133K/133K, 800K
>>> > >
>>> > > 4.14-rc2 with default SCSI_MQ
>>> > > read, rw, write IOPS	
>>> > > 311K, 117K/117K, 320K
>>> > >
>>> > > This series* without default SCSI_MQ
>>> > > read, rw, write IOPS	
>>> > > 975K, 132K/132K, 790K
>>> > >
>>> > > This series* with default SCSI_MQ
>>> > > read, rw, write IOPS	
>>> > > 770K, 164K/164K, 594K
>> >
>> > Thanks for testing this patchset!
>> >
>> > Looks there is big improvement, but the gap compared with
>> > block legacy is not small too.
>> >
>>> > >
>>> > > Please note that hisi_sas driver does not enable mq by exposing multiple
>>> > > queues to upper layer (even though it has multiple queues). I have been
>>> > > playing with enabling it, but my performance is always worse...
>>> > >
>>> > > * I'm using
>>> > > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
>>> > > as advised by Ming Lei.
>> >
>> > Could you test on the following branch and see if it makes a
>> > difference?
>> >
>> > 	https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test
> Hi John,
>
> Please test the following branch directly:
>
> https://github.com/ming1/linux/tree/blk_mq_improve_scsi_mpath_perf_V6.2_test
>
> And code is simplified and cleaned up much in V6.2, then only two extra
> patches(top 2) are needed against V6 which was posted yesterday.
>
> Please test SCSI_MQ with mq-deadline, which should be the default
> mq scheduler on your HiSilicon SAS.

Hi Ming Lei,

It's using cfq (for non-mq) and mq-deadline (obviously for mq).

root@(none)$ pwd
/sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
noop [cfq]

and

root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
[mq-deadline] kyber none

Unfortunately my setup has changed since yeterday, and the absolute 
figures are not the exact same (I retested 4.14-rc2). However, we still 
see that drop when mq is enabled.

Here's the results:
4.14-rc4 without default SCSI_MQ
read, rw, write IOPS	
860K, 112K/112K, 800K

4.14-rc2 without default SCSI_MQ
read, rw, write IOPS	
880K, 113K/113K, 808K

V6.2 series without default SCSI_MQ
read, rw, write IOPS	
820K, 114/114K, 790K

V6.2 series with default SCSI_MQ
read, rw, write IOPS	
700K, 130K/128K, 640K

Cheers,
John

>
> -- Ming .

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-10-10 12:24         ` John Garry
@ 2017-10-10 12:34           ` Johannes Thumshirn
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Thumshirn @ 2017-10-10 12:34 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, dm-devel, Bart Van Assche, Laurence Oberman,
	Paolo Valente, Oleksandr Natalenko, Tom Nguyen, linux-kernel,
	linux-scsi, Omar Sandoval, Linuxarm

Hi John,

On Tue, Oct 10, 2017 at 01:24:52PM +0100, John Garry wrote:
> It's using cfq (for non-mq) and mq-deadline (obviously for mq).

Please be aware that cfq and mq-deadline are _not_ comparable, for a realistic
comparasion please use deadline and mq-deadline or cfq and bfq.

> root@(none)$ pwd
> /sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
> noop [cfq]

Maybe missing CONFIG_IOSCHED_DEADLINE?

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: Felix Imendï¿½rffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nï¿½rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
@ 2017-10-10 12:34           ` Johannes Thumshirn
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Thumshirn @ 2017-10-10 12:34 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, dm-devel, Bart Van Assche, Laurence Oberman,
	Paolo Valente, Oleksandr Natalenko, Tom Nguyen, linux-kernel,
	linux-scsi, Omar Sandoval, Linuxarm

Hi John,

On Tue, Oct 10, 2017 at 01:24:52PM +0100, John Garry wrote:
> It's using cfq (for non-mq) and mq-deadline (obviously for mq).

Please be aware that cfq and mq-deadline are _not_ comparable, for a realistic
comparasion please use deadline and mq-deadline or cfq and bfq.

> root@(none)$ pwd
> /sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
> noop [cfq]

Maybe missing CONFIG_IOSCHED_DEADLINE?

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-10-10 12:34           ` Johannes Thumshirn
@ 2017-10-10 12:37             ` Paolo Valente
  -1 siblings, 0 replies; 33+ messages in thread
From: Paolo Valente @ 2017-10-10 12:37 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: John Garry, Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, dm-devel, Bart Van Assche, Laurence Oberman,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm


> Il giorno 10 ott 2017, alle ore 14:34, Johannes Thumshirn =
<jthumshirn@suse.de> ha scritto:
>=20
> Hi John,
>=20
> On Tue, Oct 10, 2017 at 01:24:52PM +0100, John Garry wrote:
>> It's using cfq (for non-mq) and mq-deadline (obviously for mq).
>=20
> Please be aware that cfq and mq-deadline are _not_ comparable, for a =
realistic
> comparasion please use deadline and mq-deadline or cfq and bfq.
>=20

Please set low_latency=3D0 for bfq if yours is just a maximum-throughput =
test.

Thanks,
Paolo

>> root@(none)$ pwd
>> =
/sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/e=
nd_device-0:0:7
>> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
>> noop [cfq]
>=20
> Maybe missing CONFIG_IOSCHED_DEADLINE?
>=20
> Thanks,
> 	Johannes
>=20
> --=20
> Johannes Thumshirn                                          Storage
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=FCrnberg
> GF: Felix Imend=F6rffer, Jane Smithard, Graham Norton
> HRB 21284 (AG N=FCrnberg)
> Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
@ 2017-10-10 12:37             ` Paolo Valente
  0 siblings, 0 replies; 33+ messages in thread
From: Paolo Valente @ 2017-10-10 12:37 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: John Garry, Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, dm-devel, Bart Van Assche, Laurence Oberman,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm


> Il giorno 10 ott 2017, alle ore 14:34, Johannes Thumshirn <jthumshirn@suse.de> ha scritto:
> 
> Hi John,
> 
> On Tue, Oct 10, 2017 at 01:24:52PM +0100, John Garry wrote:
>> It's using cfq (for non-mq) and mq-deadline (obviously for mq).
> 
> Please be aware that cfq and mq-deadline are _not_ comparable, for a realistic
> comparasion please use deadline and mq-deadline or cfq and bfq.
> 

Please set low_latency=0 for bfq if yours is just a maximum-throughput test.

Thanks,
Paolo

>> root@(none)$ pwd
>> /sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
>> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
>> noop [cfq]
> 
> Maybe missing CONFIG_IOSCHED_DEADLINE?
> 
> Thanks,
> 	Johannes
> 
> -- 
> Johannes Thumshirn                                          Storage
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-10-10 12:24         ` John Garry
  (?)
  (?)
@ 2017-10-10 13:45         ` Ming Lei
  2017-10-10 15:10             ` John Garry
  -1 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2017-10-10 13:45 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

On Tue, Oct 10, 2017 at 01:24:52PM +0100, John Garry wrote:
> On 10/10/2017 02:46, Ming Lei wrote:
> > > > > > I tested this series for the SAS controller on HiSilicon hip07 platform as I
> > > > > > am interested in enabling MQ for this driver. Driver is
> > > > > > ./drivers/scsi/hisi_sas/.
> > > > > >
> > > > > > So I found that that performance is improved when enabling default SCSI_MQ
> > > > > > with this series vs baseline. However, it is still not as a good as when
> > > > > > default SCSI_MQ is disabled.
> > > > > >
> > > > > > Here are some figures I got with fio:
> > > > > > 4.14-rc2 without default SCSI_MQ
> > > > > > read, rw, write IOPS	
> > > > > > 952K, 133K/133K, 800K
> > > > > >
> > > > > > 4.14-rc2 with default SCSI_MQ
> > > > > > read, rw, write IOPS	
> > > > > > 311K, 117K/117K, 320K
> > > > > >
> > > > > > This series* without default SCSI_MQ
> > > > > > read, rw, write IOPS	
> > > > > > 975K, 132K/132K, 790K
> > > > > >
> > > > > > This series* with default SCSI_MQ
> > > > > > read, rw, write IOPS	
> > > > > > 770K, 164K/164K, 594K
> > > >
> > > > Thanks for testing this patchset!
> > > >
> > > > Looks there is big improvement, but the gap compared with
> > > > block legacy is not small too.
> > > >
> > > > > >
> > > > > > Please note that hisi_sas driver does not enable mq by exposing multiple
> > > > > > queues to upper layer (even though it has multiple queues). I have been
> > > > > > playing with enabling it, but my performance is always worse...
> > > > > >
> > > > > > * I'm using
> > > > > > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> > > > > > as advised by Ming Lei.
> > > >
> > > > Could you test on the following branch and see if it makes a
> > > > difference?
> > > >
> > > > 	https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test
> > Hi John,
> > 
> > Please test the following branch directly:
> > 
> > https://github.com/ming1/linux/tree/blk_mq_improve_scsi_mpath_perf_V6.2_test
> > 
> > And code is simplified and cleaned up much in V6.2, then only two extra
> > patches(top 2) are needed against V6 which was posted yesterday.
> > 
> > Please test SCSI_MQ with mq-deadline, which should be the default
> > mq scheduler on your HiSilicon SAS.
> 
> Hi Ming Lei,
> 
> It's using cfq (for non-mq) and mq-deadline (obviously for mq).
> 
> root@(none)$ pwd
> /sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
> noop [cfq]
> 
> and
> 
> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
> [mq-deadline] kyber none
> 
> Unfortunately my setup has changed since yeterday, and the absolute figures
> are not the exact same (I retested 4.14-rc2). However, we still see that
> drop when mq is enabled.
> 
> Here's the results:
> 4.14-rc4 without default SCSI_MQ
> read, rw, write IOPS	
> 860K, 112K/112K, 800K
> 
> 4.14-rc2 without default SCSI_MQ
> read, rw, write IOPS	
> 880K, 113K/113K, 808K
> 
> V6.2 series without default SCSI_MQ
> read, rw, write IOPS	
> 820K, 114/114K, 790K

Hi John,

All change in V6.2 is blk-mq/scsi-mq only, which shouldn't
affect non SCSI_MQ, so I suggest you to compare the perf
between deadline and mq-deadline, like Johannes mentioned.

> 
> V6.2 series with default SCSI_MQ
> read, rw, write IOPS	
> 700K, 130K/128K, 640K

If possible, could you provide your fio script and log on both
non SCSI_MQ(deadline) and SCSI_MQ(mq_deadline)? Maybe some clues
can be figured out.

Also, I just put another patch on V6.2 branch, which may improve
a bit too. You may try that in your test.

	https://github.com/ming1/linux/commit/e31e2eec46c9b5ae7cfa181e9b77adad2c6a97ce

-- 
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
  2017-10-10 13:45         ` Ming Lei
@ 2017-10-10 15:10             ` John Garry
  0 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2017-10-10 15:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

On 10/10/2017 14:45, Ming Lei wrote:
> Hi John,
>
> All change in V6.2 is blk-mq/scsi-mq only, which shouldn't
> affect non SCSI_MQ, so I suggest you to compare the perf
> between deadline and mq-deadline, like Johannes mentioned.
>
>> >
>> > V6.2 series with default SCSI_MQ
>> > read, rw, write IOPS	
>> > 700K, 130K/128K, 640K
> If possible, could you provide your fio script and log on both
> non SCSI_MQ(deadline) and SCSI_MQ(mq_deadline)? Maybe some clues
> can be figured out.
>
> Also, I just put another patch on V6.2 branch, which may improve
> a bit too. You may try that in your test.
>
> 	https://github.com/ming1/linux/commit/e31e2eec46c9b5ae7cfa181e9b77adad2c6a97ce
>
> -- Ming .

Hi Ming Lei,

OK, I have tested deadline vs mq-deadline for your v6.2 branch and 
4.12-rc2. Unfortunately I don't have time now to test your experimental 
patches.

4.14-rc2 without default SCSI_MQ, deadline scheduler
read, rw, write IOPS	
920K, 115K/115K, 806K

4.14-rc2 with default SCSI_MQ, mq-deadline scheduler
read, rw, write IOPS	
280K, 99K/99K, 300K

V6.2 series without default SCSI_MQ, deadline scheduler
read, rw, write IOPS	
919K, 117K/117K, 806K

V6.2 series with default SCSI_MQ, mq-deadline scheduler
read, rw, write IOPS	
688K, 128K/128K, 630K

I think that the non-mq results look a bit more sensible - that is, 
consistent results.

Here's my script sample:
[global]
rw=rW
direct=1
ioengine=libaio
iodepth=2048
numjobs=1
bs=4k
;size=10240000m
;zero_buffers=1
group_reporting=1
group_reporting=1
;ioscheduler=noop
cpumask=0xff
;cpus_allowed=0-3
;gtod_reduce=1
;iodepth_batch=2
;iodepth_batch_complete=2
runtime=100000000
;thread
loops = 10000

[job1]
filename=/dev/sdb:
[job1]
filename=/dev/sdc:
[job1]
filename=/dev/sdd:
[job1]
filename=/dev/sde:
[job1]
filename=/dev/sdf:
[job1]
filename=/dev/sdg:

John

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)
@ 2017-10-10 15:10             ` John Garry
  0 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2017-10-10 15:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Linuxarm

On 10/10/2017 14:45, Ming Lei wrote:
> Hi John,
>
> All change in V6.2 is blk-mq/scsi-mq only, which shouldn't
> affect non SCSI_MQ, so I suggest you to compare the perf
> between deadline and mq-deadline, like Johannes mentioned.
>
>> >
>> > V6.2 series with default SCSI_MQ
>> > read, rw, write IOPS	
>> > 700K, 130K/128K, 640K
> If possible, could you provide your fio script and log on both
> non SCSI_MQ(deadline) and SCSI_MQ(mq_deadline)? Maybe some clues
> can be figured out.
>
> Also, I just put another patch on V6.2 branch, which may improve
> a bit too. You may try that in your test.
>
> 	https://github.com/ming1/linux/commit/e31e2eec46c9b5ae7cfa181e9b77adad2c6a97ce
>
> -- Ming .

Hi Ming Lei,

OK, I have tested deadline vs mq-deadline for your v6.2 branch and 
4.12-rc2. Unfortunately I don't have time now to test your experimental 
patches.

4.14-rc2 without default SCSI_MQ, deadline scheduler
read, rw, write IOPS	
920K, 115K/115K, 806K

4.14-rc2 with default SCSI_MQ, mq-deadline scheduler
read, rw, write IOPS	
280K, 99K/99K, 300K

V6.2 series without default SCSI_MQ, deadline scheduler
read, rw, write IOPS	
919K, 117K/117K, 806K

V6.2 series with default SCSI_MQ, mq-deadline scheduler
read, rw, write IOPS	
688K, 128K/128K, 630K

I think that the non-mq results look a bit more sensible - that is, 
consistent results.

Here's my script sample:
[global]
rw=rW
direct=1
ioengine=libaio
iodepth=2048
numjobs=1
bs=4k
;size=10240000m
;zero_buffers=1
group_reporting=1
group_reporting=1
;ioscheduler=noop
cpumask=0xff
;cpus_allowed=0-3
;gtod_reduce=1
;iodepth_batch=2
;iodepth_batch_complete=2
runtime=100000000
;thread
loops = 10000

[job1]
filename=/dev/sdb:
[job1]
filename=/dev/sdc:
[job1]
filename=/dev/sdd:
[job1]
filename=/dev/sde:
[job1]
filename=/dev/sdf:
[job1]
filename=/dev/sdg:

John

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-10-10 15:12 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-30 10:27 [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
2017-09-30 10:27 ` [PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert() Ming Lei
2017-10-03  8:58   ` Christoph Hellwig
2017-10-03 13:39     ` Ming Lei
2017-09-30 10:27 ` [PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance Ming Lei
2017-10-02 14:19   ` Christoph Hellwig
2017-09-30 10:27 ` [PATCH V5 3/7] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
2017-09-30 10:27 ` [PATCH V5 4/7] blk-mq: introduce blk_mq_dequeue_from_ctx() Ming Lei
2017-10-03  9:01   ` Christoph Hellwig
2017-10-09  4:36     ` Ming Lei
2017-09-30 10:27 ` [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper Ming Lei
2017-10-02 14:19   ` Christoph Hellwig
2017-10-09  9:07     ` Ming Lei
2017-09-30 10:27 ` [PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue Ming Lei
2017-10-03  9:05   ` Christoph Hellwig
2017-10-09 10:15     ` Ming Lei
2017-09-30 10:27 ` [PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
2017-10-03  9:11   ` Christoph Hellwig
2017-10-09 10:40     ` Ming Lei
2017-09-30 10:32 ` [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1) Ming Lei
2017-10-09 12:09 ` John Garry
2017-10-09 12:09   ` John Garry
2017-10-09 15:04   ` Ming Lei
2017-10-10  1:46     ` Ming Lei
2017-10-10 12:24       ` John Garry
2017-10-10 12:24         ` John Garry
2017-10-10 12:34         ` Johannes Thumshirn
2017-10-10 12:34           ` Johannes Thumshirn
2017-10-10 12:37           ` Paolo Valente
2017-10-10 12:37             ` Paolo Valente
2017-10-10 13:45         ` Ming Lei
2017-10-10 15:10           ` John Garry
2017-10-10 15:10             ` John Garry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.