linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance
@ 2017-10-12 18:36 Ming Lei
  2017-10-12 18:36 ` [PATCH V7 1/6] blk-mq-sched: fix scheduler bad performance Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-12 18:36 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry, Ming Lei

Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

This six patches improve this situation, and brings back
performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performance
improvement on lpfc/qla2xx was observed with V1.[1]

[1] http://marc.info/?l=linux-block&m=150151989915776&w=2
[2] https://marc.info/?l=linux-block&m=150217980602843&w=2

gitweb:
	https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V7_part1

V7:
	- introduce .get_budget/.put_budget, and IO merge gets improved
	compared with V6, and in theory this approach is better than the way
	taken in block legacy

	- drop patch of 'blk-mq-sched: don't dequeue request until all in ->dispatch are flushed'

V6:
	- address comments from Christoph
	- drop the 1st patch which changes blk_mq_request_bypass_insert(),
	which should belong to dm-mpath's improvement
	- move ' blk-mq-sched: move actual dispatching into one helper'
	as 2nd patch, and use the introduced helper to simplify dispatch
	logic
	- merge two previous patches into one for improving dispatch from sw queue
	- make comment/commit log line width as ~70, as suggested by
	  Christoph

V5:
	- address some comments from Omar
	- add Tested-by & Reveiewed-by tag
	- use direct issue for blk_mq_request_bypass_insert(), and
	start to consider to improve sequential I/O for dm-mpath
	- only include part 1(the original patch 1 ~ 6), as suggested
	by Omar

V4:
	- add Reviewed-by tag
	- some trival change: typo fix in commit log or comment,
	variable name, no actual functional change

V3:
	- totally round robin for picking req from ctx, as suggested
	by Bart
	- remove one local variable in __sbitmap_for_each_set()
	- drop patches of single dispatch list, which can improve
	performance on mq-deadline, but cause a bit degrade on
	none because all hctxs need to be checked after ->dispatch
	is flushed. Will post it again once it is mature.
	- rebase on v4.13-rc6 with block for-next

V2:
	- dequeue request from sw queues in round roubin's style
	as suggested by Bart, and introduces one helper in sbitmap
	for this purpose
	- improve bio merge via hash table from sw queue
	- add comments about using DISPATCH_BUSY state in lockless way,
	simplifying handling on busy state,
	- hold ctx->lock when clearing ctx busy bit as suggested
	by Bart

Ming Lei (6):
  blk-mq-sched: fix scheduler bad performance
  blk-mq-sched: move actual dispatching into one helper
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  blk-mq-sched: improve dispatching from sw queue
  SCSI: implement .get_budget and .put_budget for blk-mq

 block/blk-mq-sched.c    | 121 +++++++++++++++++++++++++++++++++++++++++-------
 block/blk-mq.c          |  74 +++++++++++++++++++++++++++--
 block/blk-mq.h          |   4 +-
 drivers/scsi/scsi_lib.c |  42 +++++++++++++----
 include/linux/blk-mq.h  |  12 +++++
 include/linux/sbitmap.h |  64 ++++++++++++++++++-------
 6 files changed, 268 insertions(+), 49 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V7 1/6] blk-mq-sched: fix scheduler bad performance
  2017-10-12 18:36 [PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance Ming Lei
@ 2017-10-12 18:36 ` Ming Lei
  2017-10-12 18:37 ` [PATCH V7 2/6] blk-mq-sched: move actual dispatching into one helper Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-12 18:36 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry, Ming Lei

When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise it is difficult to do
IO merge.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..eca011fdfa0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool did_work = false;
+	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,18 +125,18 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
+		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
 	/*
-	 * We want to dispatch from the scheduler if we had no work left
-	 * on the dispatch list, OR if we did have work but weren't able
-	 * to make progress.
+	 * We want to dispatch from the scheduler if there was nothing
+	 * on the dispatch list or we were able to dispatch from the
+	 * dispatch list.
 	 */
-	if (!did_work && has_sched_dispatch) {
+	if (do_sched_dispatch && has_sched_dispatch) {
 		do {
 			struct request *rq;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V7 2/6] blk-mq-sched: move actual dispatching into one helper
  2017-10-12 18:36 [PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance Ming Lei
  2017-10-12 18:36 ` [PATCH V7 1/6] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-10-12 18:37 ` Ming Lei
  2017-10-12 18:37 ` [PATCH V7 3/6] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-12 18:37 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry, Ming Lei

So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index eca011fdfa0e..be29ba849408 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,12 +89,26 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
+static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct elevator_queue *e = q->elevator;
+	LIST_HEAD(rq_list);
+
+	do {
+		struct request *rq = e->type->ops.mq.dispatch_request(hctx);
+
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -122,30 +136,21 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * scheduler, we can no longer merge or sort them. So it's best to
 	 * leave them there for as long as we can. Mark the hw queue as
 	 * needing a restart in that case.
+	 *
+	 * We want to dispatch from the scheduler if there was nothing
+	 * on the dispatch list or we were able to dispatch from the
+	 * dispatch list.
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
-	} else if (!has_sched_dispatch) {
+		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
+			blk_mq_do_dispatch_sched(hctx);
+	} else if (has_sched_dispatch) {
+		blk_mq_do_dispatch_sched(hctx);
+	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
-
-	/*
-	 * We want to dispatch from the scheduler if there was nothing
-	 * on the dispatch list or we were able to dispatch from the
-	 * dispatch list.
-	 */
-	if (do_sched_dispatch && has_sched_dispatch) {
-		do {
-			struct request *rq;
-
-			rq = e->type->ops.mq.dispatch_request(hctx);
-			if (!rq)
-				break;
-			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(q, &rq_list));
-	}
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V7 3/6] sbitmap: introduce __sbitmap_for_each_set()
  2017-10-12 18:36 [PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance Ming Lei
  2017-10-12 18:36 ` [PATCH V7 1/6] blk-mq-sched: fix scheduler bad performance Ming Lei
  2017-10-12 18:37 ` [PATCH V7 2/6] blk-mq-sched: move actual dispatching into one helper Ming Lei
@ 2017-10-12 18:37 ` Ming Lei
  2017-10-12 18:37 ` [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-12 18:37 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry, Ming Lei

We need to iterate ctx starting from any ctx in round robin
way, so introduce this helper.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/sbitmap.h | 64 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..0dcc60e820de 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
- * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * __sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @start: Where to start the iteration.
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -222,35 +226,61 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
  * This is inline even though it's non-trivial so that the function calls to the
  * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-					void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+					  unsigned int start,
+					  sb_for_each_fn fn, void *data)
 {
-	unsigned int i;
+	unsigned int index;
+	unsigned int nr;
+	unsigned int scanned = 0;
 
-	for (i = 0; i < sb->map_nr; i++) {
-		struct sbitmap_word *word = &sb->map[i];
-		unsigned int off, nr;
+	if (start >= sb->depth)
+		start = 0;
+	index = SB_NR_TO_INDEX(sb, start);
+	nr = SB_NR_TO_BIT(sb, start);
 
-		if (!word->word)
-			continue;
+	while (scanned < sb->depth) {
+		struct sbitmap_word *word = &sb->map[index];
+		unsigned int depth = min_t(unsigned int, word->depth - nr,
+					   sb->depth - scanned);
 
-		nr = 0;
-		off = i << sb->shift;
+		scanned += depth;
+		if (!word->word)
+			goto next;
+
+		/*
+		 * On the first iteration of the outer loop, we need to add the
+		 * bit offset back to the size of the word for find_next_bit().
+		 * On all other iterations, nr is zero, so this is a noop.
+		 */
+		depth += nr;
 		while (1) {
-			nr = find_next_bit(&word->word, word->depth, nr);
-			if (nr >= word->depth)
+			nr = find_next_bit(&word->word, depth, nr);
+			if (nr >= depth)
 				break;
-
-			if (!fn(sb, off + nr, data))
+			if (!fn(sb, (index << sb->shift) + nr, data))
 				return;
 
 			nr++;
 		}
+next:
+		nr = 0;
+		if (++index >= sb->map_nr)
+			index = 0;
 	}
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+					void *data)
+{
+	__sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
 					    unsigned int bitnr)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-12 18:36 [PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance Ming Lei
                   ` (2 preceding siblings ...)
  2017-10-12 18:37 ` [PATCH V7 3/6] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
@ 2017-10-12 18:37 ` Ming Lei
  2017-10-12 18:46   ` Jens Axboe
  2017-10-12 18:37 ` [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue Ming Lei
  2017-10-12 18:37 ` [PATCH V7 6/6] SCSI: implement .get_budget and .put_budget for blk-mq Ming Lei
  5 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2017-10-12 18:37 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry, Ming Lei

For SCSI devices, there is often per-request-queue depth, which need
to be respected before queuing one request.

The current blk-mq always dequeues one request first, then calls .queue_rq()
to dispatch the request to lld. One obvious issue of this way is that I/O
merge may not be good, because when the per-request-queue depth can't be
respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
has to staty in hctx->dispatch list, and never got chance to participate
into I/O merge.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
Once we can't get budget for queueing I/O, we don't need to dequeue request
at all, then I/O merge can get improved a lot.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c   | 39 +++++++++++++++++++++++++++++++--------
 block/blk-mq.c         | 35 ++++++++++++++++++++++++++++++++---
 block/blk-mq.h         |  2 +-
 include/linux/blk-mq.h | 10 ++++++++++
 4 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index be29ba849408..bcf4773ee1ca 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,19 +89,32 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
 
 	do {
-		struct request *rq = e->type->ops.mq.dispatch_request(hctx);
+		struct request *rq;
 
-		if (!rq)
+		if (e->type->ops.mq.has_work &&
+				!e->type->ops.mq.has_work(hctx))
 			break;
+
+		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+			return true;
+
+		rq = e->type->ops.mq.dispatch_request(hctx);
+		if (!rq) {
+			if (q->mq_ops->put_budget)
+				q->mq_ops->put_budget(hctx);
+			break;
+		}
 		list_add(&rq->queuelist, &rq_list);
-	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+	return false;
 }
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -110,6 +123,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
 	LIST_HEAD(rq_list);
+	bool run_queue = false;
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
 	if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
@@ -143,13 +157,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
-			blk_mq_do_dispatch_sched(hctx);
+		if (blk_mq_dispatch_rq_list(q, &rq_list, false) &&
+				has_sched_dispatch)
+			run_queue = blk_mq_do_dispatch_sched(hctx);
 	} else if (has_sched_dispatch) {
-		blk_mq_do_dispatch_sched(hctx);
+		run_queue = blk_mq_do_dispatch_sched(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(q, &rq_list);
+		blk_mq_dispatch_rq_list(q, &rq_list, false);
+	}
+
+	if (run_queue) {
+		if (!blk_mq_sched_needs_restart(hctx) &&
+		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) {
+			blk_mq_sched_mark_restart_hctx(hctx);
+			blk_mq_run_hw_queue(hctx, true);
+		}
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 076cbab9c3e0..e6e50a91f170 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1045,7 +1045,16 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
+static void blk_mq_put_budget(struct blk_mq_hw_ctx *hctx, bool got_budget)
+{
+	struct request_queue *q = hctx->queue;
+
+	if (q->mq_ops->put_budget && got_budget)
+		q->mq_ops->put_budget(hctx);
+}
+
+bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
+		bool got_budget)
 {
 	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
@@ -1054,6 +1063,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 	if (list_empty(list))
 		return false;
 
+	WARN_ON(!list_is_singular(list) && got_budget);
+
 	/*
 	 * Now process all the entries, sending them to the driver.
 	 */
@@ -1071,15 +1082,25 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed.
 			 */
-			if (!blk_mq_dispatch_wait_add(hctx))
+			if (!blk_mq_dispatch_wait_add(hctx)) {
+				blk_mq_put_budget(hctx, got_budget);
 				break;
+			}
 
 			/*
 			 * It's possible that a tag was freed in the window
 			 * between the allocation failure and adding the
 			 * hardware queue to the wait queue.
 			 */
-			if (!blk_mq_get_driver_tag(rq, &hctx, false))
+			if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
+				blk_mq_put_budget(hctx, got_budget);
+				break;
+			}
+		}
+
+		if (!got_budget) {
+			if (q->mq_ops->get_budget &&
+					!q->mq_ops->get_budget(hctx))
 				break;
 		}
 
@@ -1577,6 +1598,11 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_driver_tag(rq, NULL, false))
 		goto insert;
 
+	if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) {
+		blk_mq_put_driver_tag(rq);
+		goto insert;
+	}
+
 	new_cookie = request_to_qc_t(hctx, rq);
 
 	/*
@@ -2577,6 +2603,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->ops->queue_rq)
 		return -EINVAL;
 
+	if ((!!set->ops->get_budget) != (!!set->ops->put_budget))
+		return -EINVAL;
+
 	if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
 		pr_info("blk-mq: reduced tag depth to %u\n",
 			BLK_MQ_MAX_DEPTH);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ef15b3414da5..d1bfce3e69ad 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
-bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
+bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..f8c4ba2682ec 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -90,6 +90,8 @@ struct blk_mq_queue_data {
 
 typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
 		const struct blk_mq_queue_data *);
+typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
+typedef void (put_budget_fn)(struct blk_mq_hw_ctx *);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
 typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
 typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
@@ -112,6 +114,14 @@ struct blk_mq_ops {
 	queue_rq_fn		*queue_rq;
 
 	/*
+	 * Reserve budget before queue request, once .queue_rq is
+	 * run, it is driver's responsibility to release the
+	 * reserved budget.
+	 */
+	get_budget_fn		*get_budget;
+	put_budget_fn		*put_budget;
+
+	/*
 	 * Called on request timeout
 	 */
 	timeout_fn		*timeout;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 18:36 [PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance Ming Lei
                   ` (3 preceding siblings ...)
  2017-10-12 18:37 ` [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops Ming Lei
@ 2017-10-12 18:37 ` Ming Lei
  2017-10-12 18:48   ` Bart Van Assche
  2017-10-12 18:37 ` [PATCH V7 6/6] SCSI: implement .get_budget and .put_budget for blk-mq Ming Lei
  5 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2017-10-12 18:37 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry, Ming Lei

SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. Meantime there is also queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver, unfortunately
it is easy to cause queue busy because of the small .cmd_per_lun.
Once these requests are flushed out, they have to stay in hctx->dispatch,
and no bio merge can participate into these requests, and sequential IO
performance is hurt a lot.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
sw queue so that we can dispatch them in scheduler's way, then we can
avoid to dequeue too many requests from sw queue when ->dispatch isn't
flushed completely.

This patch improves dispatching from sw queue by using the callback of
.get_budget and .put_budget

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---

Jens/Omar, once .get_budget is introduced, it can be unusual to
return busy from .queue_rq, so we can't use the approach discussed
to decide to take one by one or flush all. Then this patch
simply checks if .get_budget is implemented, looks it is reasonable,
and in future it is even possible to take more requests by getting
enough budget first.


 block/blk-mq-sched.c   | 63 +++++++++++++++++++++++++++++++++++++++++++++++---
 block/blk-mq.c         | 39 +++++++++++++++++++++++++++++++
 block/blk-mq.h         |  2 ++
 include/linux/blk-mq.h |  2 ++
 4 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index bcf4773ee1ca..ccbbc7e108ea 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -117,6 +117,50 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
+static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
+					  struct blk_mq_ctx *ctx)
+{
+	unsigned idx = ctx->index_hw;
+
+	if (++idx == hctx->nr_ctx)
+		idx = 0;
+
+	return hctx->ctxs[idx];
+}
+
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	LIST_HEAD(rq_list);
+	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+
+	do {
+		struct request *rq;
+
+		if (!sbitmap_any_bit_set(&hctx->ctx_map))
+			break;
+
+		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+			return true;
+
+		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
+		if (!rq) {
+			if (q->mq_ops->put_budget)
+				q->mq_ops->put_budget(hctx);
+			break;
+		}
+		list_add(&rq->queuelist, &rq_list);
+
+		/* round robin for fair dispatch */
+		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
+
+	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+	WRITE_ONCE(hctx->dispatch_from, ctx);
+
+	return false;
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
@@ -157,11 +201,24 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		if (blk_mq_dispatch_rq_list(q, &rq_list, false) &&
-				has_sched_dispatch)
-			run_queue = blk_mq_do_dispatch_sched(hctx);
+		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
+			if (has_sched_dispatch)
+				run_queue = blk_mq_do_dispatch_sched(hctx);
+			else
+				run_queue = blk_mq_do_dispatch_ctx(hctx);
+		}
 	} else if (has_sched_dispatch) {
 		run_queue = blk_mq_do_dispatch_sched(hctx);
+	} else if (q->mq_ops->get_budget) {
+		/*
+		 * If we need to get budget before queuing request, we
+		 * dequeue request one by one from sw queue for avoiding
+		 * to mess up I/O merge when dispatch runs out of resource.
+		 *
+		 * TODO: get more budgets, and dequeue more requests in
+		 * one time.
+		 */
+		run_queue = blk_mq_do_dispatch_ctx(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list, false);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e6e50a91f170..9e2fc26a9018 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -911,6 +911,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
+struct dispatch_rq_data {
+	struct blk_mq_hw_ctx *hctx;
+	struct request *rq;
+};
+
+static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
+		void *data)
+{
+	struct dispatch_rq_data *dispatch_data = data;
+	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
+	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+
+	spin_lock(&ctx->lock);
+	if (unlikely(!list_empty(&ctx->rq_list))) {
+		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
+		list_del_init(&dispatch_data->rq->queuelist);
+		if (list_empty(&ctx->rq_list))
+			sbitmap_clear_bit(sb, bitnr);
+	}
+	spin_unlock(&ctx->lock);
+
+	return !dispatch_data->rq;
+}
+
+struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
+					struct blk_mq_ctx *start)
+{
+	unsigned off = start ? start->index_hw : 0;
+	struct dispatch_rq_data data = {
+		.hctx = hctx,
+		.rq   = NULL,
+	};
+
+	__sbitmap_for_each_set(&hctx->ctx_map, off,
+			       dispatch_rq_from_ctx, &data);
+
+	return data.rq;
+}
+
 static inline unsigned int queued_to_index(unsigned int queued)
 {
 	if (!queued)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d1bfce3e69ad..4d12ef08b0a9 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
+struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
+					struct blk_mq_ctx *start);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f8c4ba2682ec..cc757dcc49e1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
 
 	struct sbitmap		ctx_map;
 
+	struct blk_mq_ctx	*dispatch_from;
+
 	struct blk_mq_ctx	**ctxs;
 	unsigned int		nr_ctx;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V7 6/6] SCSI: implement .get_budget and .put_budget for blk-mq
  2017-10-12 18:36 [PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance Ming Lei
                   ` (4 preceding siblings ...)
  2017-10-12 18:37 ` [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-10-12 18:37 ` Ming Lei
  5 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-12 18:37 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry, Ming Lei

We need to tell blk-mq to reserve resource before queuing
one request, so implement these two callbacks. Then blk-mq
can avoid to dequeue request earlier, and IO merge can
be improved a lot.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 42 ++++++++++++++++++++++++++++++++----------
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..d6273954a3b4 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1946,6 +1946,33 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
 	blk_mq_complete_request(cmd->request);
 }
 
+static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct scsi_device *sdev = q->queuedata;
+
+	atomic_dec(&sdev->device_busy);
+	put_device(&sdev->sdev_gendev);
+}
+
+static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct scsi_device *sdev = q->queuedata;
+
+	if (!get_device(&sdev->sdev_gendev))
+		goto out;
+	if (!scsi_dev_queue_ready(q, sdev))
+		goto out_put_device;
+
+	return true;
+
+out_put_device:
+	put_device(&sdev->sdev_gendev);
+out:
+	return false;
+}
+
 static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 			 const struct blk_mq_queue_data *bd)
 {
@@ -1962,13 +1989,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	if (!get_device(&sdev->sdev_gendev))
-		goto out;
-
-	if (!scsi_dev_queue_ready(q, sdev))
-		goto out_put_device;
 	if (!scsi_target_queue_ready(shost, sdev))
-		goto out_dec_device_busy;
+		goto out_put_budget;
 	if (!scsi_host_queue_ready(q, shost, sdev))
 		goto out_dec_target_busy;
 
@@ -2003,10 +2025,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_dec_target_busy:
 	if (scsi_target(sdev)->can_queue > 0)
 		atomic_dec(&scsi_target(sdev)->target_busy);
-out_dec_device_busy:
-	atomic_dec(&sdev->device_busy);
-out_put_device:
-	put_device(&sdev->sdev_gendev);
+out_put_budget:
+	scsi_mq_put_budget(hctx);
 out:
 	switch (ret) {
 	case BLK_STS_OK:
@@ -2211,6 +2231,8 @@ struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev)
 }
 
 static const struct blk_mq_ops scsi_mq_ops = {
+	.get_budget	= scsi_mq_get_budget,
+	.put_budget	= scsi_mq_put_budget,
 	.queue_rq	= scsi_queue_rq,
 	.complete	= scsi_softirq_done,
 	.timeout	= scsi_timeout,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-12 18:37 ` [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops Ming Lei
@ 2017-10-12 18:46   ` Jens Axboe
  2017-10-13  0:19     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2017-10-12 18:46 UTC (permalink / raw)
  To: Ming Lei, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry

On 10/12/2017 12:37 PM, Ming Lei wrote:
> For SCSI devices, there is often per-request-queue depth, which need
> to be respected before queuing one request.
> 
> The current blk-mq always dequeues one request first, then calls .queue_rq()
> to dispatch the request to lld. One obvious issue of this way is that I/O
> merge may not be good, because when the per-request-queue depth can't be
> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> has to staty in hctx->dispatch list, and never got chance to participate
> into I/O merge.
> 
> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> then we can try to get reserved budget first before dequeuing request.
> Once we can't get budget for queueing I/O, we don't need to dequeue request
> at all, then I/O merge can get improved a lot.

I can't help but think that it would be cleaner to just be able to
reinsert the request into the scheduler properly, if we fail to
dispatch it. Bart hinted at that earlier as well.

With that, you would not need budget functions.

I don't feel _that_ strongly about it, though, and it is something
we can do down the line as well. I'd just hate having to introduce
budget ops only to kill them a little later.

If we stick with the budget, then add a parallel blk_mq_get_budget()
like you have a blk_mq_put_budget(), so we don't have to litter the code
with things like:

> +		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
> +			return true;

all over the place. There are also a few places where you don't use
blk_mq_put_budget() but open-code it instead, you should be consistent.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 18:37 ` [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-10-12 18:48   ` Bart Van Assche
  2017-10-13  0:20     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Bart Van Assche @ 2017-10-12 18:48 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, John Garry

On 10/12/17 11:37, Ming Lei wrote:
> [ ... ]
> 
> This patch improves dispatching from sw queue by using the callback of
> .get_budget and .put_budget
> 
> Reviewed-by: Omar Sandoval <osandov@fb.com>
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> 
> Jens/Omar, once .get_budget is introduced, it can be unusual to
> return busy from .queue_rq, so we can't use the approach discussed
> to decide to take one by one or flush all. Then this patch
> simply checks if .get_budget is implemented, looks it is reasonable,
> and in future it is even possible to take more requests by getting
> enough budget first.
> [ ... ]

Version 7 of this patch differs significantly from the version I 
reviewed so I think you should remove my Reviewed-by tag.

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-12 18:46   ` Jens Axboe
@ 2017-10-13  0:19     ` Ming Lei
  2017-10-13 14:44       ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2017-10-13  0:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> On 10/12/2017 12:37 PM, Ming Lei wrote:
> > For SCSI devices, there is often per-request-queue depth, which need
> > to be respected before queuing one request.
> > 
> > The current blk-mq always dequeues one request first, then calls .queue_rq()
> > to dispatch the request to lld. One obvious issue of this way is that I/O
> > merge may not be good, because when the per-request-queue depth can't be
> > respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> > has to staty in hctx->dispatch list, and never got chance to participate
> > into I/O merge.
> > 
> > This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> > then we can try to get reserved budget first before dequeuing request.
> > Once we can't get budget for queueing I/O, we don't need to dequeue request
> > at all, then I/O merge can get improved a lot.
> 
> I can't help but think that it would be cleaner to just be able to
> reinsert the request into the scheduler properly, if we fail to
> dispatch it. Bart hinted at that earlier as well.

Actually when I start to investigate the issue, the 1st thing I tried
is to reinsert, but that way is even worse on qla2xxx.

Once request is dequeued, the IO merge chance is decreased a lot.
With none scheduler, it becomes not possible to merge because
we only try to merge over the last 8 requests. With mq-deadline,
when one request is reinserted, another request may be dequeued
at the same time.

Not mention the cost of acquiring/releasing lock, that work
is just doing useless work and wasting CPU.

> 
> With that, you would not need budget functions.
> 
> I don't feel _that_ strongly about it, though, and it is something
> we can do down the line as well. I'd just hate having to introduce
> budget ops only to kill them a little later.
> 
> If we stick with the budget, then add a parallel blk_mq_get_budget()
> like you have a blk_mq_put_budget(), so we don't have to litter the code
> with things like:
> 
> > +		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
> > +			return true;
> 
> all over the place. There are also a few places where you don't use
> blk_mq_put_budget() but open-code it instead, you should be consistent.

OK.


-- 
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 18:48   ` Bart Van Assche
@ 2017-10-13  0:20     ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-13  0:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On Thu, Oct 12, 2017 at 11:48:22AM -0700, Bart Van Assche wrote:
> On 10/12/17 11:37, Ming Lei wrote:
> > [ ... ]
> > 
> > This patch improves dispatching from sw queue by using the callback of
> > .get_budget and .put_budget
> > 
> > Reviewed-by: Omar Sandoval <osandov@fb.com>
> > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > 
> > Jens/Omar, once .get_budget is introduced, it can be unusual to
> > return busy from .queue_rq, so we can't use the approach discussed
> > to decide to take one by one or flush all. Then this patch
> > simply checks if .get_budget is implemented, looks it is reasonable,
> > and in future it is even possible to take more requests by getting
> > enough budget first.
> > [ ... ]
> 
> Version 7 of this patch differs significantly from the version I reviewed so
> I think you should remove my Reviewed-by tag.

OK, no problem, the only difference is on the introduced .get_budget().

-- 
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13  0:19     ` Ming Lei
@ 2017-10-13 14:44       ` Jens Axboe
  2017-10-13 16:07         ` Ming Lei
  2017-10-13 16:17         ` Ming Lei
  0 siblings, 2 replies; 32+ messages in thread
From: Jens Axboe @ 2017-10-13 14:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On 10/12/2017 06:19 PM, Ming Lei wrote:
> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>> For SCSI devices, there is often per-request-queue depth, which need
>>> to be respected before queuing one request.
>>>
>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>> merge may not be good, because when the per-request-queue depth can't be
>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>> has to staty in hctx->dispatch list, and never got chance to participate
>>> into I/O merge.
>>>
>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>> then we can try to get reserved budget first before dequeuing request.
>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>> at all, then I/O merge can get improved a lot.
>>
>> I can't help but think that it would be cleaner to just be able to
>> reinsert the request into the scheduler properly, if we fail to
>> dispatch it. Bart hinted at that earlier as well.
> 
> Actually when I start to investigate the issue, the 1st thing I tried
> is to reinsert, but that way is even worse on qla2xxx.
> 
> Once request is dequeued, the IO merge chance is decreased a lot.
> With none scheduler, it becomes not possible to merge because
> we only try to merge over the last 8 requests. With mq-deadline,
> when one request is reinserted, another request may be dequeued
> at the same time.

I don't care too much about 'none'. If perfect merging is crucial for
getting to the performance level you want on the hardware you are using,
you should not be using 'none'. 'none' will work perfectly fine for NVMe
etc style devices, where we are not dependent on merging to the same
extent that we are on other devices.

mq-deadline reinsertion will be expensive, that's in the nature of that
beast. It's basically the same as a normal request inserition.  So for
that, we'd have to be a bit careful not to run into this too much. Even
with a dumb approach, it should only happen 1 out of N times, where N is
the typical point at which the device will return STS_RESOURCE. The
reinsertion vs dequeue should be serialized with your patch to do that,
at least for the single queue mq-deadline setup. In fact, I think your
approach suffers from that same basic race, in that the budget isn't a
hard allocation, it's just a hint. It can change from the time you check
it, and when you go and dispatch the IO, if you don't serialize that
part. So really should be no different in that regard.

> Not mention the cost of acquiring/releasing lock, that work
> is just doing useless work and wasting CPU.

Sure, my point is that if it doesn't happen too often, it doesn't really
matter. It's not THAT expensive.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 14:44       ` Jens Axboe
@ 2017-10-13 16:07         ` Ming Lei
  2017-10-13 16:19           ` Jens Axboe
  2017-10-13 16:31           ` Bart Van Assche
  2017-10-13 16:17         ` Ming Lei
  1 sibling, 2 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-13 16:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
> On 10/12/2017 06:19 PM, Ming Lei wrote:
> > On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> >> On 10/12/2017 12:37 PM, Ming Lei wrote:
> >>> For SCSI devices, there is often per-request-queue depth, which need
> >>> to be respected before queuing one request.
> >>>
> >>> The current blk-mq always dequeues one request first, then calls .queue_rq()
> >>> to dispatch the request to lld. One obvious issue of this way is that I/O
> >>> merge may not be good, because when the per-request-queue depth can't be
> >>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> >>> has to staty in hctx->dispatch list, and never got chance to participate
> >>> into I/O merge.
> >>>
> >>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> >>> then we can try to get reserved budget first before dequeuing request.
> >>> Once we can't get budget for queueing I/O, we don't need to dequeue request
> >>> at all, then I/O merge can get improved a lot.
> >>
> >> I can't help but think that it would be cleaner to just be able to
> >> reinsert the request into the scheduler properly, if we fail to
> >> dispatch it. Bart hinted at that earlier as well.
> > 
> > Actually when I start to investigate the issue, the 1st thing I tried
> > is to reinsert, but that way is even worse on qla2xxx.
> > 
> > Once request is dequeued, the IO merge chance is decreased a lot.
> > With none scheduler, it becomes not possible to merge because
> > we only try to merge over the last 8 requests. With mq-deadline,
> > when one request is reinserted, another request may be dequeued
> > at the same time.
> 
> I don't care too much about 'none'. If perfect merging is crucial for
> getting to the performance level you want on the hardware you are using,
> you should not be using 'none'. 'none' will work perfectly fine for NVMe
> etc style devices, where we are not dependent on merging to the same
> extent that we are on other devices.
> 
> mq-deadline reinsertion will be expensive, that's in the nature of that
> beast. It's basically the same as a normal request inserition.  So for
> that, we'd have to be a bit careful not to run into this too much. Even
> with a dumb approach, it should only happen 1 out of N times, where N is
> the typical point at which the device will return STS_RESOURCE. The
> reinsertion vs dequeue should be serialized with your patch to do that,
> at least for the single queue mq-deadline setup. In fact, I think your
> approach suffers from that same basic race, in that the budget isn't a
> hard allocation, it's just a hint. It can change from the time you check
> it, and when you go and dispatch the IO, if you don't serialize that
> part. So really should be no different in that regard.

In case of SCSI, the .get_buget is done as atomic counting,
and it is completely effective to avoid unnecessary dequeue, please take
a look at patch 6.

> 
> > Not mention the cost of acquiring/releasing lock, that work
> > is just doing useless work and wasting CPU.
> 
> Sure, my point is that if it doesn't happen too often, it doesn't really
> matter. It's not THAT expensive.

Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
it is quite easy to trigger STS_RESOURCE.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 14:44       ` Jens Axboe
  2017-10-13 16:07         ` Ming Lei
@ 2017-10-13 16:17         ` Ming Lei
  2017-10-13 16:20           ` Jens Axboe
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2017-10-13 16:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
> On 10/12/2017 06:19 PM, Ming Lei wrote:
> > On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> >> On 10/12/2017 12:37 PM, Ming Lei wrote:
> >>> For SCSI devices, there is often per-request-queue depth, which need
> >>> to be respected before queuing one request.
> >>>
> >>> The current blk-mq always dequeues one request first, then calls .queue_rq()
> >>> to dispatch the request to lld. One obvious issue of this way is that I/O
> >>> merge may not be good, because when the per-request-queue depth can't be
> >>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> >>> has to staty in hctx->dispatch list, and never got chance to participate
> >>> into I/O merge.
> >>>
> >>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> >>> then we can try to get reserved budget first before dequeuing request.
> >>> Once we can't get budget for queueing I/O, we don't need to dequeue request
> >>> at all, then I/O merge can get improved a lot.
> >>
> >> I can't help but think that it would be cleaner to just be able to
> >> reinsert the request into the scheduler properly, if we fail to
> >> dispatch it. Bart hinted at that earlier as well.
> > 
> > Actually when I start to investigate the issue, the 1st thing I tried
> > is to reinsert, but that way is even worse on qla2xxx.
> > 
> > Once request is dequeued, the IO merge chance is decreased a lot.
> > With none scheduler, it becomes not possible to merge because
> > we only try to merge over the last 8 requests. With mq-deadline,
> > when one request is reinserted, another request may be dequeued
> > at the same time.
> 
> I don't care too much about 'none'. If perfect merging is crucial for
> getting to the performance level you want on the hardware you are using,
> you should not be using 'none'. 'none' will work perfectly fine for NVMe
> etc style devices, where we are not dependent on merging to the same
> extent that we are on other devices.

We still have some SCSI device, such as qla2xxx, which is 1:1 multi-queue
device, like NVMe, in my test, the big lock of mq-deadline has been
an issue for this kind of device, and none actually is better than
mq-deadline, even though its merge isn't good.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:07         ` Ming Lei
@ 2017-10-13 16:19           ` Jens Axboe
  2017-10-13 16:21             ` Ming Lei
  2017-10-13 16:31           ` Bart Van Assche
  1 sibling, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2017-10-13 16:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On 10/13/2017 10:07 AM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
>> On 10/12/2017 06:19 PM, Ming Lei wrote:
>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>>>> For SCSI devices, there is often per-request-queue depth, which need
>>>>> to be respected before queuing one request.
>>>>>
>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>>>> merge may not be good, because when the per-request-queue depth can't be
>>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>>>> has to staty in hctx->dispatch list, and never got chance to participate
>>>>> into I/O merge.
>>>>>
>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>>>> then we can try to get reserved budget first before dequeuing request.
>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>>>> at all, then I/O merge can get improved a lot.
>>>>
>>>> I can't help but think that it would be cleaner to just be able to
>>>> reinsert the request into the scheduler properly, if we fail to
>>>> dispatch it. Bart hinted at that earlier as well.
>>>
>>> Actually when I start to investigate the issue, the 1st thing I tried
>>> is to reinsert, but that way is even worse on qla2xxx.
>>>
>>> Once request is dequeued, the IO merge chance is decreased a lot.
>>> With none scheduler, it becomes not possible to merge because
>>> we only try to merge over the last 8 requests. With mq-deadline,
>>> when one request is reinserted, another request may be dequeued
>>> at the same time.
>>
>> I don't care too much about 'none'. If perfect merging is crucial for
>> getting to the performance level you want on the hardware you are using,
>> you should not be using 'none'. 'none' will work perfectly fine for NVMe
>> etc style devices, where we are not dependent on merging to the same
>> extent that we are on other devices.
>>
>> mq-deadline reinsertion will be expensive, that's in the nature of that
>> beast. It's basically the same as a normal request inserition.  So for
>> that, we'd have to be a bit careful not to run into this too much. Even
>> with a dumb approach, it should only happen 1 out of N times, where N is
>> the typical point at which the device will return STS_RESOURCE. The
>> reinsertion vs dequeue should be serialized with your patch to do that,
>> at least for the single queue mq-deadline setup. In fact, I think your
>> approach suffers from that same basic race, in that the budget isn't a
>> hard allocation, it's just a hint. It can change from the time you check
>> it, and when you go and dispatch the IO, if you don't serialize that
>> part. So really should be no different in that regard.
> 
> In case of SCSI, the .get_buget is done as atomic counting,
> and it is completely effective to avoid unnecessary dequeue, please take
> a look at patch 6.

Looks like you are right, I had initially misread that as just checking
the busy count. But you are actually getting the count at that point,
so it should be solid.

>>> Not mention the cost of acquiring/releasing lock, that work
>>> is just doing useless work and wasting CPU.
>>
>> Sure, my point is that if it doesn't happen too often, it doesn't really
>> matter. It's not THAT expensive.
> 
> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> it is quite easy to trigger STS_RESOURCE.

Ugh, that is low.

OK, I think we should just roll with this and see how far we can go. I'll
apply it for 4.15.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:17         ` Ming Lei
@ 2017-10-13 16:20           ` Jens Axboe
  2017-10-13 16:22             ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2017-10-13 16:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On 10/13/2017 10:17 AM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
>> On 10/12/2017 06:19 PM, Ming Lei wrote:
>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>>>> For SCSI devices, there is often per-request-queue depth, which need
>>>>> to be respected before queuing one request.
>>>>>
>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>>>> merge may not be good, because when the per-request-queue depth can't be
>>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>>>> has to staty in hctx->dispatch list, and never got chance to participate
>>>>> into I/O merge.
>>>>>
>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>>>> then we can try to get reserved budget first before dequeuing request.
>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>>>> at all, then I/O merge can get improved a lot.
>>>>
>>>> I can't help but think that it would be cleaner to just be able to
>>>> reinsert the request into the scheduler properly, if we fail to
>>>> dispatch it. Bart hinted at that earlier as well.
>>>
>>> Actually when I start to investigate the issue, the 1st thing I tried
>>> is to reinsert, but that way is even worse on qla2xxx.
>>>
>>> Once request is dequeued, the IO merge chance is decreased a lot.
>>> With none scheduler, it becomes not possible to merge because
>>> we only try to merge over the last 8 requests. With mq-deadline,
>>> when one request is reinserted, another request may be dequeued
>>> at the same time.
>>
>> I don't care too much about 'none'. If perfect merging is crucial for
>> getting to the performance level you want on the hardware you are using,
>> you should not be using 'none'. 'none' will work perfectly fine for NVMe
>> etc style devices, where we are not dependent on merging to the same
>> extent that we are on other devices.
> 
> We still have some SCSI device, such as qla2xxx, which is 1:1 multi-queue
> device, like NVMe, in my test, the big lock of mq-deadline has been
> an issue for this kind of device, and none actually is better than
> mq-deadline, even though its merge isn't good.

Kyber should be able to fill that hole, hopefully.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:19           ` Jens Axboe
@ 2017-10-13 16:21             ` Ming Lei
  2017-10-13 16:28               ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2017-10-13 16:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On Fri, Oct 13, 2017 at 10:19:04AM -0600, Jens Axboe wrote:
> On 10/13/2017 10:07 AM, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
> >> On 10/12/2017 06:19 PM, Ming Lei wrote:
> >>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> >>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
> >>>>> For SCSI devices, there is often per-request-queue depth, which need
> >>>>> to be respected before queuing one request.
> >>>>>
> >>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
> >>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
> >>>>> merge may not be good, because when the per-request-queue depth can't be
> >>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> >>>>> has to staty in hctx->dispatch list, and never got chance to participate
> >>>>> into I/O merge.
> >>>>>
> >>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> >>>>> then we can try to get reserved budget first before dequeuing request.
> >>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
> >>>>> at all, then I/O merge can get improved a lot.
> >>>>
> >>>> I can't help but think that it would be cleaner to just be able to
> >>>> reinsert the request into the scheduler properly, if we fail to
> >>>> dispatch it. Bart hinted at that earlier as well.
> >>>
> >>> Actually when I start to investigate the issue, the 1st thing I tried
> >>> is to reinsert, but that way is even worse on qla2xxx.
> >>>
> >>> Once request is dequeued, the IO merge chance is decreased a lot.
> >>> With none scheduler, it becomes not possible to merge because
> >>> we only try to merge over the last 8 requests. With mq-deadline,
> >>> when one request is reinserted, another request may be dequeued
> >>> at the same time.
> >>
> >> I don't care too much about 'none'. If perfect merging is crucial for
> >> getting to the performance level you want on the hardware you are using,
> >> you should not be using 'none'. 'none' will work perfectly fine for NVMe
> >> etc style devices, where we are not dependent on merging to the same
> >> extent that we are on other devices.
> >>
> >> mq-deadline reinsertion will be expensive, that's in the nature of that
> >> beast. It's basically the same as a normal request inserition.  So for
> >> that, we'd have to be a bit careful not to run into this too much. Even
> >> with a dumb approach, it should only happen 1 out of N times, where N is
> >> the typical point at which the device will return STS_RESOURCE. The
> >> reinsertion vs dequeue should be serialized with your patch to do that,
> >> at least for the single queue mq-deadline setup. In fact, I think your
> >> approach suffers from that same basic race, in that the budget isn't a
> >> hard allocation, it's just a hint. It can change from the time you check
> >> it, and when you go and dispatch the IO, if you don't serialize that
> >> part. So really should be no different in that regard.
> > 
> > In case of SCSI, the .get_buget is done as atomic counting,
> > and it is completely effective to avoid unnecessary dequeue, please take
> > a look at patch 6.
> 
> Looks like you are right, I had initially misread that as just checking
> the busy count. But you are actually getting the count at that point,
> so it should be solid.
> 
> >>> Not mention the cost of acquiring/releasing lock, that work
> >>> is just doing useless work and wasting CPU.
> >>
> >> Sure, my point is that if it doesn't happen too often, it doesn't really
> >> matter. It's not THAT expensive.
> > 
> > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> > it is quite easy to trigger STS_RESOURCE.
> 
> Ugh, that is low.
> 
> OK, I think we should just roll with this and see how far we can go. I'll
> apply it for 4.15.

OK, I have some update, will post a new version soon.

Thanks
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:20           ` Jens Axboe
@ 2017-10-13 16:22             ` Ming Lei
  2017-10-13 16:28               ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2017-10-13 16:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On Fri, Oct 13, 2017 at 10:20:01AM -0600, Jens Axboe wrote:
> On 10/13/2017 10:17 AM, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
> >> On 10/12/2017 06:19 PM, Ming Lei wrote:
> >>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> >>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
> >>>>> For SCSI devices, there is often per-request-queue depth, which need
> >>>>> to be respected before queuing one request.
> >>>>>
> >>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
> >>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
> >>>>> merge may not be good, because when the per-request-queue depth can't be
> >>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> >>>>> has to staty in hctx->dispatch list, and never got chance to participate
> >>>>> into I/O merge.
> >>>>>
> >>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> >>>>> then we can try to get reserved budget first before dequeuing request.
> >>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
> >>>>> at all, then I/O merge can get improved a lot.
> >>>>
> >>>> I can't help but think that it would be cleaner to just be able to
> >>>> reinsert the request into the scheduler properly, if we fail to
> >>>> dispatch it. Bart hinted at that earlier as well.
> >>>
> >>> Actually when I start to investigate the issue, the 1st thing I tried
> >>> is to reinsert, but that way is even worse on qla2xxx.
> >>>
> >>> Once request is dequeued, the IO merge chance is decreased a lot.
> >>> With none scheduler, it becomes not possible to merge because
> >>> we only try to merge over the last 8 requests. With mq-deadline,
> >>> when one request is reinserted, another request may be dequeued
> >>> at the same time.
> >>
> >> I don't care too much about 'none'. If perfect merging is crucial for
> >> getting to the performance level you want on the hardware you are using,
> >> you should not be using 'none'. 'none' will work perfectly fine for NVMe
> >> etc style devices, where we are not dependent on merging to the same
> >> extent that we are on other devices.
> > 
> > We still have some SCSI device, such as qla2xxx, which is 1:1 multi-queue
> > device, like NVMe, in my test, the big lock of mq-deadline has been
> > an issue for this kind of device, and none actually is better than
> > mq-deadline, even though its merge isn't good.
> 
> Kyber should be able to fill that hole, hopefully.

Yeah, kyber still uses same IO merge with none, :-)

-- 
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:21             ` Ming Lei
@ 2017-10-13 16:28               ` Jens Axboe
  0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-10-13 16:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On 10/13/2017 10:21 AM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 10:19:04AM -0600, Jens Axboe wrote:
>> On 10/13/2017 10:07 AM, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
>>>> On 10/12/2017 06:19 PM, Ming Lei wrote:
>>>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>>>>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>>>>>> For SCSI devices, there is often per-request-queue depth, which need
>>>>>>> to be respected before queuing one request.
>>>>>>>
>>>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>>>>>> merge may not be good, because when the per-request-queue depth can't be
>>>>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>>>>>> has to staty in hctx->dispatch list, and never got chance to participate
>>>>>>> into I/O merge.
>>>>>>>
>>>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>>>>>> then we can try to get reserved budget first before dequeuing request.
>>>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>>>>>> at all, then I/O merge can get improved a lot.
>>>>>>
>>>>>> I can't help but think that it would be cleaner to just be able to
>>>>>> reinsert the request into the scheduler properly, if we fail to
>>>>>> dispatch it. Bart hinted at that earlier as well.
>>>>>
>>>>> Actually when I start to investigate the issue, the 1st thing I tried
>>>>> is to reinsert, but that way is even worse on qla2xxx.
>>>>>
>>>>> Once request is dequeued, the IO merge chance is decreased a lot.
>>>>> With none scheduler, it becomes not possible to merge because
>>>>> we only try to merge over the last 8 requests. With mq-deadline,
>>>>> when one request is reinserted, another request may be dequeued
>>>>> at the same time.
>>>>
>>>> I don't care too much about 'none'. If perfect merging is crucial for
>>>> getting to the performance level you want on the hardware you are using,
>>>> you should not be using 'none'. 'none' will work perfectly fine for NVMe
>>>> etc style devices, where we are not dependent on merging to the same
>>>> extent that we are on other devices.
>>>>
>>>> mq-deadline reinsertion will be expensive, that's in the nature of that
>>>> beast. It's basically the same as a normal request inserition.  So for
>>>> that, we'd have to be a bit careful not to run into this too much. Even
>>>> with a dumb approach, it should only happen 1 out of N times, where N is
>>>> the typical point at which the device will return STS_RESOURCE. The
>>>> reinsertion vs dequeue should be serialized with your patch to do that,
>>>> at least for the single queue mq-deadline setup. In fact, I think your
>>>> approach suffers from that same basic race, in that the budget isn't a
>>>> hard allocation, it's just a hint. It can change from the time you check
>>>> it, and when you go and dispatch the IO, if you don't serialize that
>>>> part. So really should be no different in that regard.
>>>
>>> In case of SCSI, the .get_buget is done as atomic counting,
>>> and it is completely effective to avoid unnecessary dequeue, please take
>>> a look at patch 6.
>>
>> Looks like you are right, I had initially misread that as just checking
>> the busy count. But you are actually getting the count at that point,
>> so it should be solid.
>>
>>>>> Not mention the cost of acquiring/releasing lock, that work
>>>>> is just doing useless work and wasting CPU.
>>>>
>>>> Sure, my point is that if it doesn't happen too often, it doesn't really
>>>> matter. It's not THAT expensive.
>>>
>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
>>> it is quite easy to trigger STS_RESOURCE.
>>
>> Ugh, that is low.
>>
>> OK, I think we should just roll with this and see how far we can go. I'll
>> apply it for 4.15.
> 
> OK, I have some update, will post a new version soon.

Fold something like this into it then:


diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ccbbc7e108ea..b7bf84b5ddf2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -102,13 +102,12 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 				!e->type->ops.mq.has_work(hctx))
 			break;
 
-		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+		if (!blk_mq_get_dispatch_budget(hctx))
 			return true;
 
 		rq = e->type->ops.mq.dispatch_request(hctx);
 		if (!rq) {
-			if (q->mq_ops->put_budget)
-				q->mq_ops->put_budget(hctx);
+			blk_mq_put_dispatch_budget(hctx, true);
 			break;
 		}
 		list_add(&rq->queuelist, &rq_list);
@@ -140,13 +139,12 @@ static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 		if (!sbitmap_any_bit_set(&hctx->ctx_map))
 			break;
 
-		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+		if (!blk_mq_get_dispatch_budget(hctx))
 			return true;
 
 		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
 		if (!rq) {
-			if (q->mq_ops->put_budget)
-				q->mq_ops->put_budget(hctx);
+			blk_mq_put_dispatch_budget(hctx, true);
 			break;
 		}
 		list_add(&rq->queuelist, &rq_list);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 255a705f8672..008c975b6f4b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1087,14 +1087,6 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-static void blk_mq_put_budget(struct blk_mq_hw_ctx *hctx, bool got_budget)
-{
-	struct request_queue *q = hctx->queue;
-
-	if (q->mq_ops->put_budget && got_budget)
-		q->mq_ops->put_budget(hctx);
-}
-
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		bool got_budget)
 {
@@ -1125,7 +1117,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			 * rerun the hardware queue when a tag is freed.
 			 */
 			if (!blk_mq_dispatch_wait_add(hctx)) {
-				blk_mq_put_budget(hctx, got_budget);
+				blk_mq_put_dispatch_budget(hctx, got_budget);
 				break;
 			}
 
@@ -1135,16 +1127,13 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			 * hardware queue to the wait queue.
 			 */
 			if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
-				blk_mq_put_budget(hctx, got_budget);
+				blk_mq_put_dispatch_budget(hctx, got_budget);
 				break;
 			}
 		}
 
-		if (!got_budget) {
-			if (q->mq_ops->get_budget &&
-					!q->mq_ops->get_budget(hctx))
-				break;
-		}
+		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
+			break;
 
 		list_del_init(&rq->queuelist);
 
@@ -1642,7 +1631,7 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_driver_tag(rq, NULL, false))
 		goto insert;
 
-	if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) {
+	if (!blk_mq_get_dispatch_budget(hctx)) {
 		blk_mq_put_driver_tag(rq);
 		goto insert;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 4d12ef08b0a9..9a1426e8b6e5 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -139,4 +139,23 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 			unsigned int inflight[2]);
 
+static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+
+	if (!q->mq_ops->get_budget)
+		return true;
+
+	return q->mq_ops->get_budget(hctx);
+}
+
+static inline void blk_mq_put_dispatch_budget(struct blk_mq_hw_ctx *hctx,
+					      bool got_budget)
+{
+	struct request_queue *q = hctx->queue;
+
+	if (got_budget && q->mq_ops->put_budget)
+		q->mq_ops->put_budget(hctx);
+}
+
 #endif

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:22             ` Ming Lei
@ 2017-10-13 16:28               ` Jens Axboe
  0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-10-13 16:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Oleksandr Natalenko, Tom Nguyen,
	linux-kernel, linux-scsi, Omar Sandoval, John Garry

On 10/13/2017 10:22 AM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 10:20:01AM -0600, Jens Axboe wrote:
>> On 10/13/2017 10:17 AM, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
>>>> On 10/12/2017 06:19 PM, Ming Lei wrote:
>>>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>>>>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>>>>>> For SCSI devices, there is often per-request-queue depth, which need
>>>>>>> to be respected before queuing one request.
>>>>>>>
>>>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>>>>>> merge may not be good, because when the per-request-queue depth can't be
>>>>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>>>>>> has to staty in hctx->dispatch list, and never got chance to participate
>>>>>>> into I/O merge.
>>>>>>>
>>>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>>>>>> then we can try to get reserved budget first before dequeuing request.
>>>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>>>>>> at all, then I/O merge can get improved a lot.
>>>>>>
>>>>>> I can't help but think that it would be cleaner to just be able to
>>>>>> reinsert the request into the scheduler properly, if we fail to
>>>>>> dispatch it. Bart hinted at that earlier as well.
>>>>>
>>>>> Actually when I start to investigate the issue, the 1st thing I tried
>>>>> is to reinsert, but that way is even worse on qla2xxx.
>>>>>
>>>>> Once request is dequeued, the IO merge chance is decreased a lot.
>>>>> With none scheduler, it becomes not possible to merge because
>>>>> we only try to merge over the last 8 requests. With mq-deadline,
>>>>> when one request is reinserted, another request may be dequeued
>>>>> at the same time.
>>>>
>>>> I don't care too much about 'none'. If perfect merging is crucial for
>>>> getting to the performance level you want on the hardware you are using,
>>>> you should not be using 'none'. 'none' will work perfectly fine for NVMe
>>>> etc style devices, where we are not dependent on merging to the same
>>>> extent that we are on other devices.
>>>
>>> We still have some SCSI device, such as qla2xxx, which is 1:1 multi-queue
>>> device, like NVMe, in my test, the big lock of mq-deadline has been
>>> an issue for this kind of device, and none actually is better than
>>> mq-deadline, even though its merge isn't good.
>>
>> Kyber should be able to fill that hole, hopefully.
> 
> Yeah, kyber still uses same IO merge with none, :-)

Doesn't mean it can't be changed... 'none' has to remain with very low
overhead, any extra smarts or logic should be a scheduler thing.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:07         ` Ming Lei
  2017-10-13 16:19           ` Jens Axboe
@ 2017-10-13 16:31           ` Bart Van Assche
  2017-10-13 16:33             ` Jens Axboe
  2017-10-13 16:45             ` Ming Lei
  1 sibling, 2 replies; 32+ messages in thread
From: Bart Van Assche @ 2017-10-13 16:31 UTC (permalink / raw)
  To: ming.lei, axboe
  Cc: linux-kernel, linux-block, hch, tom81094, paolo.valente,
	Bart Van Assche, linux-scsi, oleksandr, john.garry, osandov,
	loberman

On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,

Sorry but I doubt whether that is correct. More in general, I don't know any modern
storage HBA for which the default queue depth is so low.

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:31           ` Bart Van Assche
@ 2017-10-13 16:33             ` Jens Axboe
  2017-10-13 16:45             ` Ming Lei
  1 sibling, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-10-13 16:33 UTC (permalink / raw)
  To: Bart Van Assche, ming.lei
  Cc: linux-kernel, linux-block, hch, tom81094, paolo.valente,
	linux-scsi, oleksandr, john.garry, osandov, loberman

On 10/13/2017 10:31 AM, Bart Van Assche wrote:
> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> 
> Sorry but I doubt whether that is correct. More in general, I don't know any modern
> storage HBA for which the default queue depth is so low.

That does seem insanely low. BTW, even a low depth isn't an issue, as long as
we know what it roughly is.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:31           ` Bart Van Assche
  2017-10-13 16:33             ` Jens Axboe
@ 2017-10-13 16:45             ` Ming Lei
  2017-10-13 17:08               ` Bart Van Assche
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2017-10-13 16:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, linux-kernel, linux-block, hch, tom81094, paolo.valente,
	linux-scsi, oleksandr, john.garry, osandov, loberman

On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> 
> Sorry but I doubt whether that is correct. More in general, I don't know any modern
> storage HBA for which the default queue depth is so low.

You can grep:

[ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"

Even SRP/IB isn't big too, just 32.

-- 
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 16:45             ` Ming Lei
@ 2017-10-13 17:08               ` Bart Van Assche
  2017-10-13 17:29                 ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Bart Van Assche @ 2017-10-13 17:08 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-kernel, linux-block, hch, tom81094, himanshu.madhani,
	paolo.valente, axboe, linux-scsi, oleksandr, john.garry, osandov,
	loberman, james.smart

On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> > On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> > > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> > 
> > Sorry but I doubt whether that is correct. More in general, I don't know any modern
> > storage HBA for which the default queue depth is so low.
> 
> You can grep:
> 
> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"

Such a low queue depth will result in suboptimal performance for adapters
that communicate over a storage network. I think that's a bug and that both
adapters support much higher cmd_per_lun values.

(+James Smart)

James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
from 30 to 3? Was that perhaps a workaround for a bug in a specific target
implementation?

(+Himanshu Madhani)

Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
the qla2xxx initiator driver to the scsi_host->can_queue value?

> Even SRP/IB isn't big too, just 32.

The default value for ib_srp for cmd_per_lun is 62 but that value can be
overridden by selecting another value in /etc/default/srp_daemon.conf. Note:
a lower value is selected if after SRP login it becomes clear that the target
queue depth is lower than the cmd_per_lun value requested by the user. This
is a performance optimization and avoids that the SRP target system has to
send back BUSY responses to the initiator.

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 17:08               ` Bart Van Assche
@ 2017-10-13 17:29                 ` Ming Lei
  2017-10-13 17:47                   ` Bart Van Assche
  2017-10-16 11:30                   ` Hannes Reinecke
  0 siblings, 2 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-13 17:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-kernel, linux-block, hch, tom81094, himanshu.madhani,
	paolo.valente, axboe, linux-scsi, oleksandr, john.garry, osandov,
	loberman, james.smart

On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> > > On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> > > > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> > > 
> > > Sorry but I doubt whether that is correct. More in general, I don't know any modern
> > > storage HBA for which the default queue depth is so low.
> > 
> > You can grep:
> > 
> > [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
> 
> Such a low queue depth will result in suboptimal performance for adapters
> that communicate over a storage network. I think that's a bug and that both
> adapters support much higher cmd_per_lun values.
> 
> (+James Smart)
> 
> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
> implementation?
> 
> (+Himanshu Madhani)
> 
> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
> the qla2xxx initiator driver to the scsi_host->can_queue value?

->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
reasonable to increase cmd_per_lun to .can_queue.

> 
> > Even SRP/IB isn't big too, just 32.
> 
> The default value for ib_srp for cmd_per_lun is 62 but that value can be
> overridden by selecting another value in /etc/default/srp_daemon.conf. Note:
> a lower value is selected if after SRP login it becomes clear that the target
> queue depth is lower than the cmd_per_lun value requested by the user. This
> is a performance optimization and avoids that the SRP target system has to
> send back BUSY responses to the initiator.

OK, thanks for sharing, I just read it as 32 in Laurence's machine.

-- 
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 17:29                 ` Ming Lei
@ 2017-10-13 17:47                   ` Bart Van Assche
  2017-10-16 11:30                   ` Hannes Reinecke
  1 sibling, 0 replies; 32+ messages in thread
From: Bart Van Assche @ 2017-10-13 17:47 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-kernel, linux-block, hch, himanshu.madhani, tom81094,
	paolo.valente, axboe, linux-scsi, oleksandr, john.garry, osandov,
	loberman, james.smart

On Sat, 2017-10-14 at 01:29 +0800, Ming Lei wrote:
> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
> reasonable to increase cmd_per_lun to .can_queue.

Sorry but I disagree. Setting cmd_per_lun to a value lower than can_queue
will result in suboptimal performance if there is only a single LUN per SCSI
host. If there are multiple LUNs per SCSI host then the blk-mq core tracks
the number of active LUNs through the blk_mq_tags.active_queues variable.
See also hctx_may_queue(). The comment above that function is as follows:

/*
 * For shared tag users, we track the number of currently active users
 * and attempt to provide a fair share of the tag depth for each of them.
 */

BTW, the ib_srp initiator driver sets cmd_per_lun to can_queue and is able
to achieve more than one million IOPS even in tests with multiple LUNs per
SCSI host.

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-13 17:29                 ` Ming Lei
  2017-10-13 17:47                   ` Bart Van Assche
@ 2017-10-16 11:30                   ` Hannes Reinecke
  2017-10-16 16:06                     ` Bart Van Assche
  2017-10-17  1:29                     ` Ming Lei
  1 sibling, 2 replies; 32+ messages in thread
From: Hannes Reinecke @ 2017-10-16 11:30 UTC (permalink / raw)
  To: Ming Lei, Bart Van Assche
  Cc: linux-kernel, linux-block, hch, tom81094, himanshu.madhani,
	paolo.valente, axboe, linux-scsi, oleksandr, john.garry, osandov,
	loberman, james.smart

On 10/13/2017 07:29 PM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
>> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
>>>> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
>>>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
>>>>
>>>> Sorry but I doubt whether that is correct. More in general, I don't know any modern
>>>> storage HBA for which the default queue depth is so low.
>>>
>>> You can grep:
>>>
>>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
>>
>> Such a low queue depth will result in suboptimal performance for adapters
>> that communicate over a storage network. I think that's a bug and that both
>> adapters support much higher cmd_per_lun values.
>>
>> (+James Smart)
>>
>> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
>> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
>> implementation?
>>
>> (+Himanshu Madhani)
>>
>> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
>> the qla2xxx initiator driver to the scsi_host->can_queue value?
> 
> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
> reasonable to increase cmd_per_lun to .can_queue.
> 
'3' is just a starting point; later on it'll be adjusted via
scsi_change_depth().
Looks like it's not working correctly with blk-mq, though.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-16 11:30                   ` Hannes Reinecke
@ 2017-10-16 16:06                     ` Bart Van Assche
  2017-10-17  1:29                     ` Ming Lei
  1 sibling, 0 replies; 32+ messages in thread
From: Bart Van Assche @ 2017-10-16 16:06 UTC (permalink / raw)
  To: hare, ming.lei
  Cc: linux-kernel, linux-block, hch, himanshu.madhani, tom81094,
	paolo.valente, axboe, linux-scsi, oleksandr, john.garry, osandov,
	loberman, james.smart

On Mon, 2017-10-16 at 13:30 +0200, Hannes Reinecke wrote:
> On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
> > (+Himanshu Madhani)
> > 
> > Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
> > the qla2xxx initiator driver to the scsi_host->can_queue value?
> 
> '3' is just a starting point; later on it'll be adjusted via scsi_change_depth().
> Looks like it's not working correctly with blk-mq, though.

Hello Hannes,

Isn't the purpose of scsi_change_queue_depth() to change the sdev queue_depth
only? As far as I know there is only one assignment of shost->cmd_per_lun in
the SCSI core and that's the following assignment in scsi_host_alloc():

	shost->cmd_per_lun = sht->cmd_per_lun;

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-16 11:30                   ` Hannes Reinecke
  2017-10-16 16:06                     ` Bart Van Assche
@ 2017-10-17  1:29                     ` Ming Lei
  2017-10-17  6:38                       ` Hannes Reinecke
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2017-10-17  1:29 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Bart Van Assche, linux-kernel, linux-block, hch, tom81094,
	himanshu.madhani, paolo.valente, axboe, linux-scsi, oleksandr,
	john.garry, osandov, loberman, james.smart

On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote:
> On 10/13/2017 07:29 PM, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
> >> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> >>> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> >>>> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> >>>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> >>>>
> >>>> Sorry but I doubt whether that is correct. More in general, I don't know any modern
> >>>> storage HBA for which the default queue depth is so low.
> >>>
> >>> You can grep:
> >>>
> >>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
> >>
> >> Such a low queue depth will result in suboptimal performance for adapters
> >> that communicate over a storage network. I think that's a bug and that both
> >> adapters support much higher cmd_per_lun values.
> >>
> >> (+James Smart)
> >>
> >> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
> >> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
> >> implementation?
> >>
> >> (+Himanshu Madhani)
> >>
> >> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
> >> the qla2xxx initiator driver to the scsi_host->can_queue value?
> > 
> > ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
> > reasonable to increase cmd_per_lun to .can_queue.
> > 
> '3' is just a starting point; later on it'll be adjusted via
> scsi_change_depth().
> Looks like it's not working correctly with blk-mq, though.

At default, in scsi_alloc_sdev(), q->queue_depth is set as
host->cmd_per_lun. You are right, q->queue_depth can be adjusted
later too.

q->queue_depth is respected in scsi_dev_queue_ready().
.cmd_per_lun defines the max outstanding cmds for each lun, I
guess it is respected by some hardware inside.

For example, I remembered that on lpfc q->queue_depth is 30 because
the default 'lpfc_lun_queue_depth' is 30. And its .cmd_per_lun is 3.
Per my observation, this .cmd_per_lun limit is still workable.

-- 
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-17  1:29                     ` Ming Lei
@ 2017-10-17  6:38                       ` Hannes Reinecke
  2017-10-17  9:36                         ` Ming Lei
  2017-10-17 18:09                         ` Bart Van Assche
  0 siblings, 2 replies; 32+ messages in thread
From: Hannes Reinecke @ 2017-10-17  6:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, linux-kernel, linux-block, hch, tom81094,
	himanshu.madhani, paolo.valente, axboe, linux-scsi, oleksandr,
	john.garry, osandov, loberman, james.smart

On 10/17/2017 03:29 AM, Ming Lei wrote:
> On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote:
>> On 10/13/2017 07:29 PM, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
>>>> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
>>>>> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
>>>>>> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
>>>>>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
>>>>>>
>>>>>> Sorry but I doubt whether that is correct. More in general, I don't know any modern
>>>>>> storage HBA for which the default queue depth is so low.
>>>>>
>>>>> You can grep:
>>>>>
>>>>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
>>>>
>>>> Such a low queue depth will result in suboptimal performance for adapters
>>>> that communicate over a storage network. I think that's a bug and that both
>>>> adapters support much higher cmd_per_lun values.
>>>>
>>>> (+James Smart)
>>>>
>>>> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
>>>> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
>>>> implementation?
>>>>
>>>> (+Himanshu Madhani)
>>>>
>>>> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
>>>> the qla2xxx initiator driver to the scsi_host->can_queue value?
>>>
>>> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
>>> reasonable to increase cmd_per_lun to .can_queue.
>>>
>> '3' is just a starting point; later on it'll be adjusted via
>> scsi_change_depth().
>> Looks like it's not working correctly with blk-mq, though.
> 
> At default, in scsi_alloc_sdev(), q->queue_depth is set as
> host->cmd_per_lun. You are right, q->queue_depth can be adjusted
> later too.
> 
> q->queue_depth is respected in scsi_dev_queue_ready().
> .cmd_per_lun defines the max outstanding cmds for each lun, I
> guess it is respected by some hardware inside.
> 
No, this is purely a linux abstraction. Nothing to do with the hardware.

> For example, I remembered that on lpfc q->queue_depth is 30 because
> the default 'lpfc_lun_queue_depth' is 30. And its .cmd_per_lun is 3.
> Per my observation, this .cmd_per_lun limit is still workable.
> 
Again, these are just some pre-defined values to avoid I/O starvation
when having several LUNs. _if_ we can guarantee I/O fairness between
several (hundreds!) devices all sharing the same tagspace we wouldn't
need these variables.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-17  6:38                       ` Hannes Reinecke
@ 2017-10-17  9:36                         ` Ming Lei
  2017-10-17 18:09                         ` Bart Van Assche
  1 sibling, 0 replies; 32+ messages in thread
From: Ming Lei @ 2017-10-17  9:36 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Bart Van Assche, linux-kernel, linux-block, hch, tom81094,
	himanshu.madhani, paolo.valente, axboe, linux-scsi, oleksandr,
	john.garry, osandov, loberman, james.smart

On Tue, Oct 17, 2017 at 08:38:01AM +0200, Hannes Reinecke wrote:
> On 10/17/2017 03:29 AM, Ming Lei wrote:
> > On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote:
> >> On 10/13/2017 07:29 PM, Ming Lei wrote:
> >>> On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
> >>>> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> >>>>> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> >>>>>> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> >>>>>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> >>>>>>
> >>>>>> Sorry but I doubt whether that is correct. More in general, I don't know any modern
> >>>>>> storage HBA for which the default queue depth is so low.
> >>>>>
> >>>>> You can grep:
> >>>>>
> >>>>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
> >>>>
> >>>> Such a low queue depth will result in suboptimal performance for adapters
> >>>> that communicate over a storage network. I think that's a bug and that both
> >>>> adapters support much higher cmd_per_lun values.
> >>>>
> >>>> (+James Smart)
> >>>>
> >>>> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
> >>>> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
> >>>> implementation?
> >>>>
> >>>> (+Himanshu Madhani)
> >>>>
> >>>> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
> >>>> the qla2xxx initiator driver to the scsi_host->can_queue value?
> >>>
> >>> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
> >>> reasonable to increase cmd_per_lun to .can_queue.
> >>>
> >> '3' is just a starting point; later on it'll be adjusted via
> >> scsi_change_depth().
> >> Looks like it's not working correctly with blk-mq, though.
> > 
> > At default, in scsi_alloc_sdev(), q->queue_depth is set as
> > host->cmd_per_lun. You are right, q->queue_depth can be adjusted
> > later too.
> > 
> > q->queue_depth is respected in scsi_dev_queue_ready().
> > .cmd_per_lun defines the max outstanding cmds for each lun, I
> > guess it is respected by some hardware inside.
> > 
> No, this is purely a linux abstraction. Nothing to do with the hardware.

That is also my initial understanding.

But my test showed that actually the max outstanding cmds per LUN is
really 3 even though q->queue_depth is 30, that is why I guess the
hardware may put a hard limit inside:

	https://marc.info/?l=linux-block&m=150549401611868&w=2

Also if they were same thing, why does lpfc define different
default value for q->queue_depth and .cmd_per_lun?

drivers/scsi/lpfc/lpfc_attr.c: 3411
	/*
	# lun_queue_depth:  This parameter is used to limit the number of outstanding
	# commands per FCP LUN. Value range is [1,512]. Default value is 30.
	# If this parameter value is greater than 1/8th the maximum number of exchanges
	# supported by the HBA port, then the lun queue depth will be reduced to
	# 1/8th the maximum number of exchanges.
	*/
	LPFC_VPORT_ATTR_R(lun_queue_depth, 30, 1, 512,
	                  "Max number of FCP commands we can queue to a specific LUN");

drivers/scsi/lpfc/lpfc.h: 47
	#define LPFC_CMD_PER_LUN        3       /* max outstanding cmds per lun */


-- 
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  2017-10-17  6:38                       ` Hannes Reinecke
  2017-10-17  9:36                         ` Ming Lei
@ 2017-10-17 18:09                         ` Bart Van Assche
  1 sibling, 0 replies; 32+ messages in thread
From: Bart Van Assche @ 2017-10-17 18:09 UTC (permalink / raw)
  To: Hannes Reinecke, Ming Lei
  Cc: linux-kernel, linux-block, hch, tom81094, himanshu.madhani,
	paolo.valente, axboe, linux-scsi, oleksandr, john.garry, osandov,
	loberman, james.smart

On 10/16/17 23:38, Hannes Reinecke wrote:
> Again, these are just some pre-defined values to avoid I/O starvation
> when having several LUNs. _if_ we can guarantee I/O fairness between
> several (hundreds!) devices all sharing the same tagspace we wouldn't
> need these variables.

I think we already have two mechanisms for guaranteeing I/O fairness:
* Scsi_Host.starved_list for both the legacy SCSI core and scsi-mq.
* For scsi-mq, the active_queues counter in the blk-mq core. See also
   hctx_may_queue().

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-10-17 18:09 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-12 18:36 [PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance Ming Lei
2017-10-12 18:36 ` [PATCH V7 1/6] blk-mq-sched: fix scheduler bad performance Ming Lei
2017-10-12 18:37 ` [PATCH V7 2/6] blk-mq-sched: move actual dispatching into one helper Ming Lei
2017-10-12 18:37 ` [PATCH V7 3/6] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
2017-10-12 18:37 ` [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops Ming Lei
2017-10-12 18:46   ` Jens Axboe
2017-10-13  0:19     ` Ming Lei
2017-10-13 14:44       ` Jens Axboe
2017-10-13 16:07         ` Ming Lei
2017-10-13 16:19           ` Jens Axboe
2017-10-13 16:21             ` Ming Lei
2017-10-13 16:28               ` Jens Axboe
2017-10-13 16:31           ` Bart Van Assche
2017-10-13 16:33             ` Jens Axboe
2017-10-13 16:45             ` Ming Lei
2017-10-13 17:08               ` Bart Van Assche
2017-10-13 17:29                 ` Ming Lei
2017-10-13 17:47                   ` Bart Van Assche
2017-10-16 11:30                   ` Hannes Reinecke
2017-10-16 16:06                     ` Bart Van Assche
2017-10-17  1:29                     ` Ming Lei
2017-10-17  6:38                       ` Hannes Reinecke
2017-10-17  9:36                         ` Ming Lei
2017-10-17 18:09                         ` Bart Van Assche
2017-10-13 16:17         ` Ming Lei
2017-10-13 16:20           ` Jens Axboe
2017-10-13 16:22             ` Ming Lei
2017-10-13 16:28               ` Jens Axboe
2017-10-12 18:37 ` [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue Ming Lei
2017-10-12 18:48   ` Bart Van Assche
2017-10-13  0:20     ` Ming Lei
2017-10-12 18:37 ` [PATCH V7 6/6] SCSI: implement .get_budget and .put_budget for blk-mq Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).