All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance
@ 2017-09-02 15:17 Ming Lei
  2017-09-02 15:17 ` [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance Ming Lei
                   ` (15 more replies)
  0 siblings, 16 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

Hi,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

The 1st five patches improve this situation, and brings back
some performance loss.

Patch 6 ~ 7 uses q->queue_depth as hint for setting up
scheduler queue depth.

Patch 8 ~ 15 improve bio merge via hash table in sw queue,
which makes bio merge more efficient than current approch
in which only the last 8 requests are checked. Since patch
6~14 converts to the scheduler way of dequeuing one request
from sw queue one time for SCSI device, and the times of
acquring ctx->lock is increased, and merging bio via hash
table decreases holding time of ctx->lock and should eliminate
effect from patch 14. 

With this changes, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performanc
improvement on lpfc/qla2xx was observed with V1.[1]

Also Bart worried that this patchset may affect SRP, so provide
test data on SCSI SRP this time:

- fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
- system(16 cores, dual sockets, mem: 96G)

          |v4.13-rc6+*  |v4.13-rc6+   | patched v4.13-rc6+ 
-----------------------------------------------------
 IOPS(K)  |  DEADLINE   |    NONE     |    NONE     
-----------------------------------------------------
read      |      587.81 |      511.96 |      518.51 
-----------------------------------------------------
randread  |      116.44 |      142.99 |      142.46 
-----------------------------------------------------
write     |      580.87 |       536.4 |      582.15 
-----------------------------------------------------
randwrite |      104.95 |      124.89 |      123.99 
-----------------------------------------------------


          |v4.13-rc6+   |v4.13-rc6+   | patched v4.13-rc6+ 
-----------------------------------------------------
 IOPS(K)  |  DEADLINE   |MQ-DEADLINE  |MQ-DEADLINE  
-----------------------------------------------------
read      |      587.81 |       158.7 |      450.41 
-----------------------------------------------------
randread  |      116.44 |      142.04 |      142.72 
-----------------------------------------------------
write     |      580.87 |      136.61 |      569.37 
-----------------------------------------------------
randwrite |      104.95 |      123.14 |      124.36 
-----------------------------------------------------

*: v4.13-rc6+ means v4.13-rc6 with block for-next


Please consider to merge to V4.4.

[1] http://marc.info/?l=linux-block&m=150151989915776&w=2
[2] https://marc.info/?l=linux-block&m=150217980602843&w=2

V4:
	- add Reviewed-by tag
	- some trival change: typo fix in commit log or comment,
	variable name, no actual functional change

V3:
	- totally round robin for picking req from ctx, as suggested
	by Bart
	- remove one local variable in __sbitmap_for_each_set()
	- drop patches of single dispatch list, which can improve
	performance on mq-deadline, but cause a bit degrade on
	none because all hctxs need to be checked after ->dispatch
	is flushed. Will post it again once it is mature.
	- rebase on v4.13-rc6 with block for-next

V2:
	- dequeue request from sw queues in round roubin's style
	as suggested by Bart, and introduces one helper in sbitmap
	for this purpose
	- improve bio merge via hash table from sw queue
	- add comments about using DISPATCH_BUSY state in lockless way,
	simplifying handling on busy state,
	- hold ctx->lock when clearing ctx busy bit as suggested
	by Bart


Ming Lei (14):
  blk-mq-sched: fix scheduler bad performance
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  blk-mq-sched: move actual dispatching into one helper
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
    flushed
  blk-mq-sched: introduce blk_mq_sched_queue_depth()
  blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  block: introduce rqhash helpers
  block: move actual bio merge code into __elv_merge
  block: add check on elevator for supporting bio merge via hashtable
    from blk-mq sw queue
  block: introduce .last_merge and .hash to blk_mq_ctx
  blk-mq-sched: refactor blk_mq_sched_try_merge()
  blk-mq: improve bio merge from blk-mq sw queue

 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq-sched.c    | 186 +++++++++++++++++++++++++++++++-----------------
 block/blk-mq-sched.h    |  23 ++++++
 block/blk-mq.c          |  93 +++++++++++++++++++++++-
 block/blk-mq.h          |   7 ++
 block/blk-settings.c    |   2 +
 block/blk.h             |  55 ++++++++++++++
 block/elevator.c        |  93 ++++++++++++++----------
 include/linux/blk-mq.h  |   3 +
 include/linux/sbitmap.h |  54 ++++++++++----
 10 files changed, 399 insertions(+), 118 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-08 20:48   ` Omar Sandoval
  2017-09-02 15:17 ` [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise it is difficult to do
IO merge.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..845e5baf8af1 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool did_work = false;
+	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
+		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
@@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * on the dispatch list, OR if we did have work but weren't able
 	 * to make progress.
 	 */
-	if (!did_work && has_sched_dispatch) {
+	if (do_sched_dispatch && has_sched_dispatch) {
 		do {
 			struct request *rq;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
  2017-09-02 15:17 ` [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-08 20:43   ` Omar Sandoval
  2017-09-02 15:17 ` [PATCH V4 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx() Ming Lei
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman,
	Ming Lei, Omar Sandoval

We need to iterate ctx starting from any ctx in round robin
way, so introduce this helper.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/sbitmap.h | 54 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..2329b9e1a0e2 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
  * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @off: Where to start the iteration
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -222,35 +226,57 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
  * This is inline even though it's non-trivial so that the function calls to the
  * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-					void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+					  unsigned int off,
+					  sb_for_each_fn fn, void *data)
 {
-	unsigned int i;
+	unsigned int index = SB_NR_TO_INDEX(sb, off);
+	unsigned int nr = SB_NR_TO_BIT(sb, off);
+	unsigned int scanned = 0;
 
-	for (i = 0; i < sb->map_nr; i++) {
-		struct sbitmap_word *word = &sb->map[i];
-		unsigned int off, nr;
+	while (1) {
+		struct sbitmap_word *word = &sb->map[index];
+		unsigned int depth = min_t(unsigned int, word->depth - nr,
+					   sb->depth - scanned);
 
+		scanned += depth;
 		if (!word->word)
-			continue;
+			goto next;
 
-		nr = 0;
-		off = i << sb->shift;
+		depth += nr;
+		off = index << sb->shift;
 		while (1) {
-			nr = find_next_bit(&word->word, word->depth, nr);
-			if (nr >= word->depth)
+			nr = find_next_bit(&word->word, depth, nr);
+			if (nr >= depth)
 				break;
-
 			if (!fn(sb, off + nr, data))
 				return;
 
 			nr++;
 		}
+ next:
+		if (scanned >= sb->depth)
+			break;
+		nr = 0;
+		if (++index >= sb->map_nr)
+			index = 0;
 	}
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ *
+ * This is inline even though it's non-trivial so that the function calls to the
+ * callback will hopefully get optimized away.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+					void *data)
+{
+	__sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
 					    unsigned int bitnr)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
  2017-09-02 15:17 ` [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance Ming Lei
  2017-09-02 15:17 ` [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-15  0:04   ` Omar Sandoval
  2017-09-02 15:17 ` [PATCH V4 04/14] blk-mq-sched: move actual dispatching into one helper Ming Lei
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

This function is introduced for dequeuing request
from sw queue so that we can dispatch it in
scheduler's way.

More importantly, some SCSI devices may set
q->queue_depth, which is a per-request_queue limit,
and applied on pending I/O from all hctxs. This
function is introduced for avoiding to dequeue too
many requests from sw queue when ->dispatch isn't
flushed completely.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 38 ++++++++++++++++++++++++++++++++++++++
 block/blk-mq.h |  2 ++
 2 files changed, 40 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f18cff80050..f063dd0f197f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -880,6 +880,44 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
+struct dispatch_rq_data {
+	struct blk_mq_hw_ctx *hctx;
+	struct request *rq;
+};
+
+static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
+{
+	struct dispatch_rq_data *dispatch_data = data;
+	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
+	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+
+	spin_lock(&ctx->lock);
+	if (unlikely(!list_empty(&ctx->rq_list))) {
+		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
+		list_del_init(&dispatch_data->rq->queuelist);
+		if (list_empty(&ctx->rq_list))
+			sbitmap_clear_bit(sb, bitnr);
+	}
+	spin_unlock(&ctx->lock);
+
+	return !dispatch_data->rq;
+}
+
+struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
+					    struct blk_mq_ctx *start)
+{
+	unsigned off = start ? start->index_hw : 0;
+	struct dispatch_rq_data data = {
+		.hctx = hctx,
+		.rq   = NULL,
+	};
+
+	__sbitmap_for_each_set(&hctx->ctx_map, off,
+			       dispatch_rq_from_ctx, &data);
+
+	return data.rq;
+}
+
 static inline unsigned int queued_to_index(unsigned int queued)
 {
 	if (!queued)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 98252b79b80b..e42748bfb959 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
+struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
+					    struct blk_mq_ctx *start);
 
 /*
  * Internal helpers for allocating/freeing the request map
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 04/14] blk-mq-sched: move actual dispatching into one helper
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (2 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx() Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-19 19:21   ` Omar Sandoval
  2017-09-02 15:17 ` [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue Ming Lei
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

So that it becomes easy to support to dispatch from
sw queue in the following patch.

No functional change.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 845e5baf8af1..f69752961a34 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,6 +89,22 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
+static void blk_mq_do_dispatch(struct request_queue *q,
+			       struct elevator_queue *e,
+			       struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(rq_list);
+
+	do {
+		struct request *rq;
+
+		rq = e->type->ops.mq.dispatch_request(hctx);
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
@@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * on the dispatch list, OR if we did have work but weren't able
 	 * to make progress.
 	 */
-	if (do_sched_dispatch && has_sched_dispatch) {
-		do {
-			struct request *rq;
-
-			rq = e->type->ops.mq.dispatch_request(hctx);
-			if (!rq)
-				break;
-			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(q, &rq_list));
-	}
+	if (do_sched_dispatch && has_sched_dispatch)
+		blk_mq_do_dispatch(q, e, hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (3 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 04/14] blk-mq-sched: move actual dispatching into one helper Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-08 23:54   ` Omar Sandoval
  2017-09-19 20:37   ` Jens Axboe
  2017-09-02 15:17 ` [PATCH V4 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

SCSI devices use host-wide tagset, and the shared
driver tag space is often quite big. Meantime
there is also queue depth for each lun(.cmd_per_lun),
which is often small.

So lots of requests may stay in sw queue, and we
always flush all belonging to same hw queue and
dispatch them all to driver, unfortunately it is
easy to cause queue busy because of the small
per-lun queue depth. Once these requests are flushed
out, they have to stay in hctx->dispatch, and no bio
merge can participate into these requests, and
sequential IO performance is hurted.

This patch improves dispatching from sw queue when
there is per-request-queue queue depth by taking
request one by one from sw queue, just like the way
of IO scheduler.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c   | 61 +++++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/blk-mq.h |  2 ++
 2 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index f69752961a34..735e432294ab 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,9 +89,9 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
-static void blk_mq_do_dispatch(struct request_queue *q,
-			       struct elevator_queue *e,
-			       struct blk_mq_hw_ctx *hctx)
+static void blk_mq_do_dispatch_sched(struct request_queue *q,
+				     struct elevator_queue *e,
+				     struct blk_mq_hw_ctx *hctx)
 {
 	LIST_HEAD(rq_list);
 
@@ -105,6 +105,42 @@ static void blk_mq_do_dispatch(struct request_queue *q,
 	} while (blk_mq_dispatch_rq_list(q, &rq_list));
 }
 
+static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
+					  struct blk_mq_ctx *ctx)
+{
+	unsigned idx = ctx->index_hw;
+
+	if (++idx == hctx->nr_ctx)
+		idx = 0;
+
+	return hctx->ctxs[idx];
+}
+
+static void blk_mq_do_dispatch_ctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(rq_list);
+	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+	bool dispatched;
+
+	do {
+		struct request *rq;
+
+		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+
+		/* round robin for fair dispatch */
+		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
+
+		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);
+	} while (dispatched);
+
+	if (!dispatched)
+		WRITE_ONCE(hctx->dispatch_from, blk_mq_next_ctx(hctx, ctx));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
@@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
 		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
-	} else if (!has_sched_dispatch) {
+	} else if (!has_sched_dispatch && !q->queue_depth) {
+		/*
+		 * If there is no per-request_queue depth, we
+		 * flush all requests in this hw queue, otherwise
+		 * pick up request one by one from sw queue for
+		 * avoiding to mess up I/O merge when dispatch
+		 * is busy, which can be triggered easily by
+		 * per-request_queue queue depth
+		 */
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
+	if (!do_sched_dispatch)
+		return;
+
 	/*
 	 * We want to dispatch from the scheduler if we had no work left
 	 * on the dispatch list, OR if we did have work but weren't able
 	 * to make progress.
 	 */
-	if (do_sched_dispatch && has_sched_dispatch)
-		blk_mq_do_dispatch(q, e, hctx);
+	if (has_sched_dispatch)
+		blk_mq_do_dispatch_sched(q, e, hctx);
+	else
+		blk_mq_do_dispatch_ctx(q, hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..7b7a366a97f3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
 
 	struct sbitmap		ctx_map;
 
+	struct blk_mq_ctx	*dispatch_from;
+
 	struct blk_mq_ctx	**ctxs;
 	unsigned int		nr_ctx;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (4 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-19 19:11   ` Omar Sandoval
  2017-09-02 15:17 ` [PATCH V4 07/14] blk-mq-sched: introduce blk_mq_sched_queue_depth() Ming Lei
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

During dispatching, we moved all requests from hctx->dispatch to
one temporary list, then dispatch them one by one from this list.
Unfortunately during this period, run queue from other contexts
may think the queue is idle, then start to dequeue from sw/scheduler
queue and still try to dispatch because ->dispatch is empty. This way
hurts sequential I/O performance because requests are dequeued when
lld queue is busy.

This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
make sure that request isn't dequeued until ->dispatch is
flushed.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 58 +++++++++++++++++++++++++++++++++++---------------
 block/blk-mq.c         |  6 ++++++
 include/linux/blk-mq.h |  1 +
 4 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 980e73095643..7a27f262c96a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -182,6 +182,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(SCHED_RESTART),
 	HCTX_STATE_NAME(TAG_WAITING),
 	HCTX_STATE_NAME(START_ON_RUN),
+	HCTX_STATE_NAME(DISPATCH_BUSY),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 735e432294ab..97e7a4fe3a32 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -146,7 +146,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -177,8 +176,33 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
-	} else if (!has_sched_dispatch && !q->queue_depth) {
+		blk_mq_dispatch_rq_list(q, &rq_list);
+
+		/*
+		 * We may clear DISPATCH_BUSY just after it
+		 * is set from another context, the only cost
+		 * is that one request is dequeued a bit early,
+		 * we can survive that. Given the window is
+		 * small enough, no need to worry about performance
+		 * effect.
+		 */
+		if (list_empty_careful(&hctx->dispatch))
+			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	}
+
+	/*
+	 * If DISPATCH_BUSY is set, that means hw queue is busy
+	 * and requests in the list of hctx->dispatch need to
+	 * be flushed first, so return early.
+	 *
+	 * Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
+	 * will be run to try to make progress, so it is always
+	 * safe to check the state here.
+	 */
+	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
+		return;
+
+	if (!has_sched_dispatch) {
 		/*
 		 * If there is no per-request_queue depth, we
 		 * flush all requests in this hw queue, otherwise
@@ -187,22 +211,21 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		 * is busy, which can be triggered easily by
 		 * per-request_queue queue depth
 		 */
-		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(q, &rq_list);
-	}
-
-	if (!do_sched_dispatch)
-		return;
+		if (!q->queue_depth) {
+			blk_mq_flush_busy_ctxs(hctx, &rq_list);
+			blk_mq_dispatch_rq_list(q, &rq_list);
+		} else {
+			blk_mq_do_dispatch_ctx(q, hctx);
+		}
+	} else {
 
-	/*
-	 * We want to dispatch from the scheduler if we had no work left
-	 * on the dispatch list, OR if we did have work but weren't able
-	 * to make progress.
-	 */
-	if (has_sched_dispatch)
+		/*
+		 * We want to dispatch from the scheduler if we had no work left
+		 * on the dispatch list, OR if we did have work but weren't able
+		 * to make progress.
+		 */
 		blk_mq_do_dispatch_sched(q, e, hctx);
-	else
-		blk_mq_do_dispatch_ctx(q, hctx);
+	}
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
@@ -330,6 +353,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	 */
 	spin_lock(&hctx->lock);
 	list_add(&rq->queuelist, &hctx->dispatch);
+	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 	spin_unlock(&hctx->lock);
 	return true;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f063dd0f197f..6af56a71c1cd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1140,6 +1140,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
+		/*
+		 * DISPATCH_BUSY won't be cleared until all requests
+		 * in hctx->dispatch are dispatched successfully
+		 */
+		set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 		spin_unlock(&hctx->lock);
 
 		/*
@@ -1927,6 +1932,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 	spin_lock(&hctx->lock);
 	list_splice_tail_init(&tmp, &hctx->dispatch);
+	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 	spin_unlock(&hctx->lock);
 
 	blk_mq_run_hw_queue(hctx, true);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7b7a366a97f3..13f6c25fa461 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -172,6 +172,7 @@ enum {
 	BLK_MQ_S_SCHED_RESTART	= 2,
 	BLK_MQ_S_TAG_WAITING	= 3,
 	BLK_MQ_S_START_ON_RUN	= 4,
+	BLK_MQ_S_DISPATCH_BUSY	= 5,
 
 	BLK_MQ_MAX_DEPTH	= 10240,
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 07/14] blk-mq-sched: introduce blk_mq_sched_queue_depth()
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (5 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-02 15:17 ` [PATCH V4 08/14] blk-mq-sched: use q->queue_depth as hint for q->nr_requests Ming Lei
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

The following patch will use one hint to figure out
default queue depth for scheduler queue, so introduce
the helper of blk_mq_sched_queue_depth() for this purpose.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c |  8 +-------
 block/blk-mq-sched.h | 11 +++++++++++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 97e7a4fe3a32..1ff6f9bedd1a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -595,13 +595,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 		return 0;
 	}
 
-	/*
-	 * Default to double of smaller one between hw queue_depth and 128,
-	 * since we don't split into sync/async like the old code did.
-	 * Additionally, this is a per-hw queue depth.
-	 */
-	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
-				   BLKDEV_MAX_RQ);
+	q->nr_requests = blk_mq_sched_queue_depth(q);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		ret = blk_mq_sched_alloc_tags(q, hctx, i);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 9267d0b7c197..1d47f3fda1d0 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -96,4 +96,15 @@ static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
+static inline unsigned blk_mq_sched_queue_depth(struct request_queue *q)
+{
+	/*
+	 * Default to double of smaller one between hw queue_depth and 128,
+	 * since we don't split into sync/async like the old code did.
+	 * Additionally, this is a per-hw queue depth.
+	 */
+	return 2 * min_t(unsigned int, q->tag_set->queue_depth,
+				   BLKDEV_MAX_RQ);
+}
+
 #endif
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 08/14] blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (6 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 07/14] blk-mq-sched: introduce blk_mq_sched_queue_depth() Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-02 15:17 ` [PATCH V4 09/14] block: introduce rqhash helpers Ming Lei
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

SCSI sets q->queue_depth from shost->cmd_per_lun, and
q->queue_depth is per request_queue and more related to
scheduler queue compared with hw queue depth, which can be
shared by queues, such as TAG_SHARED.

This patch tries to use q->queue_depth as hint for computing
q->nr_requests, which should be more effective than
current way.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.h | 18 +++++++++++++++---
 block/blk-mq.c       | 27 +++++++++++++++++++++++++--
 block/blk-mq.h       |  1 +
 block/blk-settings.c |  2 ++
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1d47f3fda1d0..906b10c54f78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -99,12 +99,24 @@ static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 static inline unsigned blk_mq_sched_queue_depth(struct request_queue *q)
 {
 	/*
-	 * Default to double of smaller one between hw queue_depth and 128,
+	 * q->queue_depth is more close to scheduler queue, so use it
+	 * as hint for computing scheduler queue depth if it is valid
+	 */
+	unsigned q_depth = q->queue_depth ?: q->tag_set->queue_depth;
+
+	/*
+	 * Default to double of smaller one between queue depth and 128,
 	 * since we don't split into sync/async like the old code did.
 	 * Additionally, this is a per-hw queue depth.
 	 */
-	return 2 * min_t(unsigned int, q->tag_set->queue_depth,
-				   BLKDEV_MAX_RQ);
+	q_depth = 2 * min_t(unsigned int, q_depth, BLKDEV_MAX_RQ);
+
+	/*
+	 * when queue depth of driver is too small, we set queue depth
+	 * of scheduler queue as 128 which is the default setting of
+	 * block legacy code.
+	 */
+	return max_t(unsigned, q_depth, BLKDEV_MAX_RQ);
 }
 
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6af56a71c1cd..fc3d26bbfc1a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2650,7 +2650,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
-int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+static int __blk_mq_update_nr_requests(struct request_queue *q,
+				       bool sched_only,
+				       unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
@@ -2669,7 +2671,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		 * If we're using an MQ scheduler, just update the scheduler
 		 * queue depth. This is similar to what the old code would do.
 		 */
-		if (!hctx->sched_tags) {
+		if (!sched_only && !hctx->sched_tags) {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->tags,
 							min(nr, set->queue_depth),
 							false);
@@ -2689,6 +2691,27 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	return ret;
 }
 
+int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+{
+	return __blk_mq_update_nr_requests(q, false, nr);
+}
+
+/*
+ * When drivers update q->queue_depth, this API is called so that
+ * we can use this queue depth as hint for adjusting scheduler
+ * queue depth.
+ */
+int blk_mq_update_sched_queue_depth(struct request_queue *q)
+{
+	unsigned nr;
+
+	if (!q->mq_ops || !q->elevator)
+		return 0;
+
+	nr = blk_mq_sched_queue_depth(q);
+	return __blk_mq_update_nr_requests(q, true, nr);
+}
+
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 							int nr_hw_queues)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e42748bfb959..0277f9771fab 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,6 +37,7 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
 struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
 					    struct blk_mq_ctx *start);
+int blk_mq_update_sched_queue_depth(struct request_queue *q);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8559e9563c52..c2db38d2ec2b 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -878,6 +878,8 @@ void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
 	q->queue_depth = depth;
 	wbt_set_queue_depth(q->rq_wb, depth);
+
+	WARN_ON(blk_mq_update_sched_queue_depth(q));
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 09/14] block: introduce rqhash helpers
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (7 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 08/14] blk-mq-sched: use q->queue_depth as hint for q->nr_requests Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-02 15:17 ` [PATCH V4 10/14] block: move actual bio merge code into __elv_merge Ming Lei
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

We need this helpers for supporting to use hashtable to improve
bio merge from sw queue in the following patches.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk.h      | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator.c | 36 +++++++-----------------------------
 2 files changed, 59 insertions(+), 29 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index fcb9775b997d..eb3436d4a73f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -146,6 +146,58 @@ static inline void blk_clear_rq_complete(struct request *rq)
  */
 #define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
 
+/*
+ * Merge hash stuff.
+ */
+#define rq_hash_key(rq)		(blk_rq_pos(rq) + blk_rq_sectors(rq))
+
+#define bucket(head, key)	&((head)[hash_min((key), ELV_HASH_BITS)])
+
+static inline void __rqhash_del(struct request *rq)
+{
+	hash_del(&rq->hash);
+	rq->rq_flags &= ~RQF_HASHED;
+}
+
+static inline void rqhash_del(struct request *rq)
+{
+	if (ELV_ON_HASH(rq))
+		__rqhash_del(rq);
+}
+
+static inline void rqhash_add(struct hlist_head *hash, struct request *rq)
+{
+	BUG_ON(ELV_ON_HASH(rq));
+	hlist_add_head(&rq->hash, bucket(hash, rq_hash_key(rq)));
+	rq->rq_flags |= RQF_HASHED;
+}
+
+static inline void rqhash_reposition(struct hlist_head *hash, struct request *rq)
+{
+	__rqhash_del(rq);
+	rqhash_add(hash, rq);
+}
+
+static inline struct request *rqhash_find(struct hlist_head *hash, sector_t offset)
+{
+	struct hlist_node *next;
+	struct request *rq = NULL;
+
+	hlist_for_each_entry_safe(rq, next, bucket(hash, offset), hash) {
+		BUG_ON(!ELV_ON_HASH(rq));
+
+		if (unlikely(!rq_mergeable(rq))) {
+			__rqhash_del(rq);
+			continue;
+		}
+
+		if (rq_hash_key(rq) == offset)
+			return rq;
+	}
+
+	return NULL;
+}
+
 void blk_insert_flush(struct request *rq);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
diff --git a/block/elevator.c b/block/elevator.c
index 153926a90901..824cc3e69ac3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -47,11 +47,6 @@ static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
 
 /*
- * Merge hash stuff.
- */
-#define rq_hash_key(rq)		(blk_rq_pos(rq) + blk_rq_sectors(rq))
-
-/*
  * Query io scheduler to see if the current process issuing bio may be
  * merged with rq.
  */
@@ -268,14 +263,12 @@ EXPORT_SYMBOL(elevator_exit);
 
 static inline void __elv_rqhash_del(struct request *rq)
 {
-	hash_del(&rq->hash);
-	rq->rq_flags &= ~RQF_HASHED;
+	__rqhash_del(rq);
 }
 
 void elv_rqhash_del(struct request_queue *q, struct request *rq)
 {
-	if (ELV_ON_HASH(rq))
-		__elv_rqhash_del(rq);
+	rqhash_del(rq);
 }
 EXPORT_SYMBOL_GPL(elv_rqhash_del);
 
@@ -283,37 +276,22 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	BUG_ON(ELV_ON_HASH(rq));
-	hash_add(e->hash, &rq->hash, rq_hash_key(rq));
-	rq->rq_flags |= RQF_HASHED;
+	rqhash_add(e->hash, rq);
 }
 EXPORT_SYMBOL_GPL(elv_rqhash_add);
 
 void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
 {
-	__elv_rqhash_del(rq);
-	elv_rqhash_add(q, rq);
+	struct elevator_queue *e = q->elevator;
+
+	rqhash_reposition(e->hash, rq);
 }
 
 struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
 {
 	struct elevator_queue *e = q->elevator;
-	struct hlist_node *next;
-	struct request *rq;
-
-	hash_for_each_possible_safe(e->hash, rq, next, hash, offset) {
-		BUG_ON(!ELV_ON_HASH(rq));
 
-		if (unlikely(!rq_mergeable(rq))) {
-			__elv_rqhash_del(rq);
-			continue;
-		}
-
-		if (rq_hash_key(rq) == offset)
-			return rq;
-	}
-
-	return NULL;
+	return rqhash_find(e->hash, offset);
 }
 
 /*
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 10/14] block: move actual bio merge code into __elv_merge
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (8 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 09/14] block: introduce rqhash helpers Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-02 15:17 ` [PATCH V4 11/14] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue Ming Lei
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

So that we can reuse __elv_merge() to merge bio
into requests from sw queue in the following patches.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 824cc3e69ac3..e11c7873fc21 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -409,8 +409,9 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 }
 EXPORT_SYMBOL(elv_dispatch_add_tail);
 
-enum elv_merge elv_merge(struct request_queue *q, struct request **req,
-		struct bio *bio)
+static enum elv_merge __elv_merge(struct request_queue *q,
+		struct request **req, struct bio *bio,
+		struct hlist_head *hash, struct request *last_merge)
 {
 	struct elevator_queue *e = q->elevator;
 	struct request *__rq;
@@ -427,11 +428,11 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
 	/*
 	 * First try one-hit cache.
 	 */
-	if (q->last_merge && elv_bio_merge_ok(q->last_merge, bio)) {
-		enum elv_merge ret = blk_try_merge(q->last_merge, bio);
+	if (last_merge && elv_bio_merge_ok(last_merge, bio)) {
+		enum elv_merge ret = blk_try_merge(last_merge, bio);
 
 		if (ret != ELEVATOR_NO_MERGE) {
-			*req = q->last_merge;
+			*req = last_merge;
 			return ret;
 		}
 	}
@@ -442,7 +443,7 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
 	/*
 	 * See if our hash lookup can find a potential backmerge.
 	 */
-	__rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
+	__rq = rqhash_find(hash, bio->bi_iter.bi_sector);
 	if (__rq && elv_bio_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_BACK_MERGE;
@@ -456,6 +457,12 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
 	return ELEVATOR_NO_MERGE;
 }
 
+enum elv_merge elv_merge(struct request_queue *q, struct request **req,
+		struct bio *bio)
+{
+	return __elv_merge(q, req, bio, q->elevator->hash, q->last_merge);
+}
+
 /*
  * Attempt to do an insertion back merge. Only check for the case where
  * we can append 'rq' to an existing request, so we can throw 'rq' away
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 11/14] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (9 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 10/14] block: move actual bio merge code into __elv_merge Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-02 15:17 ` [PATCH V4 12/14] block: introduce .last_merge and .hash to blk_mq_ctx Ming Lei
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

blk_mq_sched_try_merge() will be reused in following patches
to support bio merge to blk-mq sw queue, so add checkes to
related functions which are called from blk_mq_sched_try_merge().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/block/elevator.c b/block/elevator.c
index e11c7873fc21..2424aea85393 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -71,6 +71,10 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio)
 	if (!blk_rq_merge_ok(rq, bio))
 		return false;
 
+	/* We need to support to merge bio from sw queue */
+	if (!rq->q->elevator)
+		return true;
+
 	if (!elv_iosched_allow_bio_merge(rq, bio))
 		return false;
 
@@ -449,6 +453,10 @@ static enum elv_merge __elv_merge(struct request_queue *q,
 		return ELEVATOR_BACK_MERGE;
 	}
 
+	/* no elevator when merging bio to blk-mq sw queue */
+	if (!e)
+		return ELEVATOR_NO_MERGE;
+
 	if (e->uses_mq && e->type->ops.mq.request_merge)
 		return e->type->ops.mq.request_merge(q, req, bio);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
@@ -711,6 +719,10 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/* no elevator when merging bio to blk-mq sw queue */
+	if (!e)
+		return NULL;
+
 	if (e->uses_mq && e->type->ops.mq.next_request)
 		return e->type->ops.mq.next_request(q, rq);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_latter_req_fn)
@@ -723,6 +735,10 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/* no elevator when merging bio to blk-mq sw queue */
+	if (!e)
+		return NULL;
+
 	if (e->uses_mq && e->type->ops.mq.former_request)
 		return e->type->ops.mq.former_request(q, rq);
 	if (!e->uses_mq && e->type->ops.sq.elevator_former_req_fn)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 12/14] block: introduce .last_merge and .hash to blk_mq_ctx
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (10 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 11/14] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-02 15:17 ` [PATCH V4 13/14] blk-mq-sched: refactor blk_mq_sched_try_merge() Ming Lei
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

Prepare for supporting bio merge to sw queue if no
blk-mq io scheduler is taken.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.h   |  4 ++++
 block/blk.h      |  3 +++
 block/elevator.c | 22 +++++++++++++++++++---
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 0277f9771fab..1b9742eb7399 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -18,6 +18,10 @@ struct blk_mq_ctx {
 	unsigned long		rq_dispatched[2];
 	unsigned long		rq_merged;
 
+	/* bio merge via request hash table */
+	struct request		*last_merge;
+	DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
+
 	/* incremented at completion time */
 	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
 
diff --git a/block/blk.h b/block/blk.h
index eb3436d4a73f..fa4f232afc18 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -198,6 +198,9 @@ static inline struct request *rqhash_find(struct hlist_head *hash, sector_t offs
 	return NULL;
 }
 
+enum elv_merge elv_merge_ctx(struct request_queue *q, struct request **req,
+                struct bio *bio, struct blk_mq_ctx *ctx);
+
 void blk_insert_flush(struct request *rq);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
diff --git a/block/elevator.c b/block/elevator.c
index 2424aea85393..0e13e5c18982 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -471,6 +471,13 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
 	return __elv_merge(q, req, bio, q->elevator->hash, q->last_merge);
 }
 
+enum elv_merge elv_merge_ctx(struct request_queue *q, struct request **req,
+		struct bio *bio, struct blk_mq_ctx *ctx)
+{
+	WARN_ON_ONCE(!q->mq_ops);
+	return __elv_merge(q, req, bio, ctx->hash, ctx->last_merge);
+}
+
 /*
  * Attempt to do an insertion back merge. Only check for the case where
  * we can append 'rq' to an existing request, so we can throw 'rq' away
@@ -516,16 +523,25 @@ void elv_merged_request(struct request_queue *q, struct request *rq,
 		enum elv_merge type)
 {
 	struct elevator_queue *e = q->elevator;
+	struct hlist_head *hash = e->hash;
+
+	/* we do bio merge on blk-mq sw queue */
+	if (q->mq_ops && !e) {
+		rq->mq_ctx->last_merge = rq;
+		hash = rq->mq_ctx->hash;
+		goto reposition;
+	}
+
+	q->last_merge = rq;
 
 	if (e->uses_mq && e->type->ops.mq.request_merged)
 		e->type->ops.mq.request_merged(q, rq, type);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_merged_fn)
 		e->type->ops.sq.elevator_merged_fn(q, rq, type);
 
+ reposition:
 	if (type == ELEVATOR_BACK_MERGE)
-		elv_rqhash_reposition(q, rq);
-
-	q->last_merge = rq;
+		rqhash_reposition(hash, rq);
 }
 
 void elv_merge_requests(struct request_queue *q, struct request *rq,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 13/14] blk-mq-sched: refactor blk_mq_sched_try_merge()
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (11 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 12/14] block: introduce .last_merge and .hash to blk_mq_ctx Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-02 15:17 ` [PATCH V4 14/14] blk-mq: improve bio merge from blk-mq sw queue Ming Lei
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

This patch introduces one function __blk_mq_try_merge()
which will be resued for bio merge to sw queue in
the following patch.

No functional change.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 1ff6f9bedd1a..afa7d9a258e4 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -228,12 +228,11 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	}
 }
 
-bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-			    struct request **merged_request)
+static bool __blk_mq_try_merge(struct request_queue *q,
+		struct bio *bio, struct request **merged_request,
+		struct request *rq, enum elv_merge type)
 {
-	struct request *rq;
-
-	switch (elv_merge(q, &rq, bio)) {
+	switch (type) {
 	case ELEVATOR_BACK_MERGE:
 		if (!blk_mq_sched_allow_merge(q, rq, bio))
 			return false;
@@ -256,6 +255,15 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 		return false;
 	}
 }
+
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
+		struct request **merged_request)
+{
+	struct request *rq;
+	enum elv_merge type = elv_merge(q, &rq, bio);
+
+	return __blk_mq_try_merge(q, bio, merged_request, rq, type);
+}
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
 /*
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V4 14/14] blk-mq: improve bio merge from blk-mq sw queue
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (12 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 13/14] blk-mq-sched: refactor blk_mq_sched_try_merge() Ming Lei
@ 2017-09-02 15:17 ` Ming Lei
  2017-09-04  9:12 ` [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Paolo Valente
  2017-09-19 19:25 ` Omar Sandoval
  15 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-02 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman, Ming Lei

This patch uses hash table to do bio merge from sw queue,
then we can align to blk-mq scheduler/block legacy's way
for bio merge.

Turns out bio merge via hash table is more efficient than
simple merge on the last 8 requests in sw queue. On SCSI SRP,
it is observed ~10% IOPS is increased in sequential IO test
with this patch.

It is also one step forward to real 'none' scheduler, in which
way the blk-mq scheduler framework can be more clean.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 49 ++++++++++++-------------------------------------
 block/blk-mq.c       | 28 +++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 40 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index afa7d9a258e4..df41cba49866 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -266,50 +266,25 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
-/*
- * Reverse check our software queue for entries that we could potentially
- * merge with. Currently includes a hand-wavy stop count of 8, to not spend
- * too much time checking for merges.
- */
-static bool blk_mq_attempt_merge(struct request_queue *q,
+static bool blk_mq_ctx_try_merge(struct request_queue *q,
 				 struct blk_mq_ctx *ctx, struct bio *bio)
 {
-	struct request *rq;
-	int checked = 8;
+	struct request *rq, *free = NULL;
+	enum elv_merge type;
+	bool merged;
 
 	lockdep_assert_held(&ctx->lock);
 
-	list_for_each_entry_reverse(rq, &ctx->rq_list, queuelist) {
-		bool merged = false;
-
-		if (!checked--)
-			break;
-
-		if (!blk_rq_merge_ok(rq, bio))
-			continue;
+	type = elv_merge_ctx(q, &rq, bio, ctx);
+	merged = __blk_mq_try_merge(q, bio, &free, rq, type);
 
-		switch (blk_try_merge(rq, bio)) {
-		case ELEVATOR_BACK_MERGE:
-			if (blk_mq_sched_allow_merge(q, rq, bio))
-				merged = bio_attempt_back_merge(q, rq, bio);
-			break;
-		case ELEVATOR_FRONT_MERGE:
-			if (blk_mq_sched_allow_merge(q, rq, bio))
-				merged = bio_attempt_front_merge(q, rq, bio);
-			break;
-		case ELEVATOR_DISCARD_MERGE:
-			merged = bio_attempt_discard_merge(q, rq, bio);
-			break;
-		default:
-			continue;
-		}
+	if (free)
+		blk_mq_free_request(free);
 
-		if (merged)
-			ctx->rq_merged++;
-		return merged;
-	}
+	if (merged)
+		ctx->rq_merged++;
 
-	return false;
+	return merged;
 }
 
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
@@ -327,7 +302,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 	if (hctx->flags & BLK_MQ_F_SHOULD_MERGE) {
 		/* default per sw-queue merge */
 		spin_lock(&ctx->lock);
-		ret = blk_mq_attempt_merge(q, ctx, bio);
+		ret = blk_mq_ctx_try_merge(q, ctx, bio);
 		spin_unlock(&ctx->lock);
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index fc3d26bbfc1a..d935f15c54da 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -847,6 +847,18 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	blk_queue_exit(q);
 }
 
+static void blk_mq_ctx_remove_rq_list(struct blk_mq_ctx *ctx,
+		struct list_head *head)
+{
+	struct request *rq;
+
+	lockdep_assert_held(&ctx->lock);
+
+	list_for_each_entry(rq, head, queuelist)
+		rqhash_del(rq);
+	ctx->last_merge = NULL;
+}
+
 struct flush_busy_ctx_data {
 	struct blk_mq_hw_ctx *hctx;
 	struct list_head *list;
@@ -861,6 +873,7 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
 	sbitmap_clear_bit(sb, bitnr);
 	spin_lock(&ctx->lock);
 	list_splice_tail_init(&ctx->rq_list, flush_data->list);
+	blk_mq_ctx_remove_rq_list(ctx, flush_data->list);
 	spin_unlock(&ctx->lock);
 	return true;
 }
@@ -890,17 +903,23 @@ static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *d
 	struct dispatch_rq_data *dispatch_data = data;
 	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
 	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+	struct request *rq = NULL;
 
 	spin_lock(&ctx->lock);
 	if (unlikely(!list_empty(&ctx->rq_list))) {
-		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
-		list_del_init(&dispatch_data->rq->queuelist);
+		rq = list_entry_rq(ctx->rq_list.next);
+		list_del_init(&rq->queuelist);
+		rqhash_del(rq);
 		if (list_empty(&ctx->rq_list))
 			sbitmap_clear_bit(sb, bitnr);
 	}
+	if (ctx->last_merge == rq)
+		ctx->last_merge = NULL;
 	spin_unlock(&ctx->lock);
 
-	return !dispatch_data->rq;
+	dispatch_data->rq = rq;
+
+	return !rq;
 }
 
 struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
@@ -1431,6 +1450,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
 		list_add(&rq->queuelist, &ctx->rq_list);
 	else
 		list_add_tail(&rq->queuelist, &ctx->rq_list);
+
+	rqhash_add(ctx->hash, rq);
 }
 
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
@@ -1923,6 +1944,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_list)) {
 		list_splice_init(&ctx->rq_list, &tmp);
+		blk_mq_ctx_remove_rq_list(ctx, &tmp);
 		blk_mq_hctx_clear_pending(hctx, ctx);
 	}
 	spin_unlock(&ctx->lock);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (13 preceding siblings ...)
  2017-09-02 15:17 ` [PATCH V4 14/14] blk-mq: improve bio merge from blk-mq sw queue Ming Lei
@ 2017-09-04  9:12 ` Paolo Valente
  2017-09-05  1:39   ` Ming Lei
  2017-09-19 19:25 ` Omar Sandoval
  15 siblings, 1 reply; 50+ messages in thread
From: Paolo Valente @ 2017-09-04  9:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Mel Gorman


> Il giorno 02 set 2017, alle ore 17:17, Ming Lei <ming.lei@redhat.com> =
ha scritto:
>=20
> Hi,
>=20
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
>=20
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
>=20
> The 1st five patches improve this situation, and brings back
> some performance loss.
>=20
> Patch 6 ~ 7 uses q->queue_depth as hint for setting up
> scheduler queue depth.
>=20
> Patch 8 ~ 15 improve bio merge via hash table in sw queue,
> which makes bio merge more efficient than current approch
> in which only the last 8 requests are checked. Since patch
> 6~14 converts to the scheduler way of dequeuing one request
> from sw queue one time for SCSI device, and the times of
> acquring ctx->lock is increased, and merging bio via hash
> table decreases holding time of ctx->lock and should eliminate
> effect from patch 14.=20
>=20
> With this changes, SCSI-MQ sequential I/O performance is
> improved much, Paolo reported that mq-deadline performance
> improved much[2] in his dbench test wrt V2. Also performanc
> improvement on lpfc/qla2xx was observed with V1.[1]
>=20
> Also Bart worried that this patchset may affect SRP, so provide
> test data on SCSI SRP this time:
>=20
> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> - system(16 cores, dual sockets, mem: 96G)
>=20
>          |v4.13-rc6+*  |v4.13-rc6+   | patched v4.13-rc6+=20
> -----------------------------------------------------
> IOPS(K)  |  DEADLINE   |    NONE     |    NONE    =20
> -----------------------------------------------------
> read      |      587.81 |      511.96 |      518.51=20
> -----------------------------------------------------
> randread  |      116.44 |      142.99 |      142.46=20
> -----------------------------------------------------
> write     |      580.87 |       536.4 |      582.15=20
> -----------------------------------------------------
> randwrite |      104.95 |      124.89 |      123.99=20
> -----------------------------------------------------
>=20
>=20
>          |v4.13-rc6+   |v4.13-rc6+   | patched v4.13-rc6+=20
> -----------------------------------------------------
> IOPS(K)  |  DEADLINE   |MQ-DEADLINE  |MQ-DEADLINE =20
> -----------------------------------------------------
> read      |      587.81 |       158.7 |      450.41=20
> -----------------------------------------------------
> randread  |      116.44 |      142.04 |      142.72=20
> -----------------------------------------------------
> write     |      580.87 |      136.61 |      569.37=20
> -----------------------------------------------------
> randwrite |      104.95 |      123.14 |      124.36=20
> -----------------------------------------------------
>=20
> *: v4.13-rc6+ means v4.13-rc6 with block for-next
>=20
>=20
> Please consider to merge to V4.4.
>=20
> [1] http://marc.info/?l=3Dlinux-block&m=3D150151989915776&w=3D2
> [2] https://marc.info/?l=3Dlinux-block&m=3D150217980602843&w=3D2
>=20
> V4:
> 	- add Reviewed-by tag
> 	- some trival change: typo fix in commit log or comment,
> 	variable name, no actual functional change
>=20
> V3:
> 	- totally round robin for picking req from ctx, as suggested
> 	by Bart
> 	- remove one local variable in __sbitmap_for_each_set()
> 	- drop patches of single dispatch list, which can improve
> 	performance on mq-deadline, but cause a bit degrade on
> 	none because all hctxs need to be checked after ->dispatch
> 	is flushed. Will post it again once it is mature.
> 	- rebase on v4.13-rc6 with block for-next
>=20
> V2:
> 	- dequeue request from sw queues in round roubin's style
> 	as suggested by Bart, and introduces one helper in sbitmap
> 	for this purpose
> 	- improve bio merge via hash table from sw queue
> 	- add comments about using DISPATCH_BUSY state in lockless way,
> 	simplifying handling on busy state,
> 	- hold ctx->lock when clearing ctx busy bit as suggested
> 	by Bart
>=20
>=20

Tested-by: Paolo Valente <paolo.valente@linaro.org>

> Ming Lei (14):
>  blk-mq-sched: fix scheduler bad performance
>  sbitmap: introduce __sbitmap_for_each_set()
>  blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
>  blk-mq-sched: move actual dispatching into one helper
>  blk-mq-sched: improve dispatching from sw queue
>  blk-mq-sched: don't dequeue request until all in ->dispatch are
>    flushed
>  blk-mq-sched: introduce blk_mq_sched_queue_depth()
>  blk-mq-sched: use q->queue_depth as hint for q->nr_requests
>  block: introduce rqhash helpers
>  block: move actual bio merge code into __elv_merge
>  block: add check on elevator for supporting bio merge via hashtable
>    from blk-mq sw queue
>  block: introduce .last_merge and .hash to blk_mq_ctx
>  blk-mq-sched: refactor blk_mq_sched_try_merge()
>  blk-mq: improve bio merge from blk-mq sw queue
>=20
> block/blk-mq-debugfs.c  |   1 +
> block/blk-mq-sched.c    | 186 =
+++++++++++++++++++++++++++++++-----------------
> block/blk-mq-sched.h    |  23 ++++++
> block/blk-mq.c          |  93 +++++++++++++++++++++++-
> block/blk-mq.h          |   7 ++
> block/blk-settings.c    |   2 +
> block/blk.h             |  55 ++++++++++++++
> block/elevator.c        |  93 ++++++++++++++----------
> include/linux/blk-mq.h  |   3 +
> include/linux/sbitmap.h |  54 ++++++++++----
> 10 files changed, 399 insertions(+), 118 deletions(-)
>=20
> --=20
> 2.9.5
>=20

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance
  2017-09-04  9:12 ` [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Paolo Valente
@ 2017-09-05  1:39   ` Ming Lei
  2017-09-06 15:27     ` Ming Lei
  0 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-05  1:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Mel Gorman

On Mon, Sep 04, 2017 at 11:12:49AM +0200, Paolo Valente wrote:
> 
> > Il giorno 02 set 2017, alle ore 17:17, Ming Lei <ming.lei@redhat.com> ha scritto:
> > 
> > Hi,
> > 
> > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > found that I/O performance is much bad with mq-deadline, especially
> > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > SRP...)
> > 
> > Turns out one big issue causes the performance regression: requests
> > are still dequeued from sw queue/scheduler queue even when ldd's
> > queue is busy, so I/O merge becomes quite difficult to make, then
> > sequential IO degrades a lot.
> > 
> > The 1st five patches improve this situation, and brings back
> > some performance loss.
> > 
> > Patch 6 ~ 7 uses q->queue_depth as hint for setting up
> > scheduler queue depth.
> > 
> > Patch 8 ~ 15 improve bio merge via hash table in sw queue,
> > which makes bio merge more efficient than current approch
> > in which only the last 8 requests are checked. Since patch
> > 6~14 converts to the scheduler way of dequeuing one request
> > from sw queue one time for SCSI device, and the times of
> > acquring ctx->lock is increased, and merging bio via hash
> > table decreases holding time of ctx->lock and should eliminate
> > effect from patch 14. 
> > 
> > With this changes, SCSI-MQ sequential I/O performance is
> > improved much, Paolo reported that mq-deadline performance
> > improved much[2] in his dbench test wrt V2. Also performanc
> > improvement on lpfc/qla2xx was observed with V1.[1]
> > 
> > Also Bart worried that this patchset may affect SRP, so provide
> > test data on SCSI SRP this time:
> > 
> > - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> > - system(16 cores, dual sockets, mem: 96G)
> > 
> >          |v4.13-rc6+*  |v4.13-rc6+   | patched v4.13-rc6+ 
> > -----------------------------------------------------
> > IOPS(K)  |  DEADLINE   |    NONE     |    NONE     
> > -----------------------------------------------------
> > read      |      587.81 |      511.96 |      518.51 
> > -----------------------------------------------------
> > randread  |      116.44 |      142.99 |      142.46 
> > -----------------------------------------------------
> > write     |      580.87 |       536.4 |      582.15 
> > -----------------------------------------------------
> > randwrite |      104.95 |      124.89 |      123.99 
> > -----------------------------------------------------
> > 
> > 
> >          |v4.13-rc6+   |v4.13-rc6+   | patched v4.13-rc6+ 
> > -----------------------------------------------------
> > IOPS(K)  |  DEADLINE   |MQ-DEADLINE  |MQ-DEADLINE  
> > -----------------------------------------------------
> > read      |      587.81 |       158.7 |      450.41 
> > -----------------------------------------------------
> > randread  |      116.44 |      142.04 |      142.72 
> > -----------------------------------------------------
> > write     |      580.87 |      136.61 |      569.37 
> > -----------------------------------------------------
> > randwrite |      104.95 |      123.14 |      124.36 
> > -----------------------------------------------------
> > 
> > *: v4.13-rc6+ means v4.13-rc6 with block for-next
> > 
> > 
> > Please consider to merge to V4.4.
> > 
> > [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> > [2] https://marc.info/?l=linux-block&m=150217980602843&w=2
> > 
> > V4:
> > 	- add Reviewed-by tag
> > 	- some trival change: typo fix in commit log or comment,
> > 	variable name, no actual functional change
> > 
> > V3:
> > 	- totally round robin for picking req from ctx, as suggested
> > 	by Bart
> > 	- remove one local variable in __sbitmap_for_each_set()
> > 	- drop patches of single dispatch list, which can improve
> > 	performance on mq-deadline, but cause a bit degrade on
> > 	none because all hctxs need to be checked after ->dispatch
> > 	is flushed. Will post it again once it is mature.
> > 	- rebase on v4.13-rc6 with block for-next
> > 
> > V2:
> > 	- dequeue request from sw queues in round roubin's style
> > 	as suggested by Bart, and introduces one helper in sbitmap
> > 	for this purpose
> > 	- improve bio merge via hash table from sw queue
> > 	- add comments about using DISPATCH_BUSY state in lockless way,
> > 	simplifying handling on busy state,
> > 	- hold ctx->lock when clearing ctx busy bit as suggested
> > 	by Bart
> > 
> > 
> 
> Tested-by: Paolo Valente <paolo.valente@linaro.org>

Hi Jens,

Is there any chance to make this patchset merged to V4.4?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance
  2017-09-05  1:39   ` Ming Lei
@ 2017-09-06 15:27     ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-06 15:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Mel Gorman

On Tue, Sep 05, 2017 at 09:39:51AM +0800, Ming Lei wrote:
> On Mon, Sep 04, 2017 at 11:12:49AM +0200, Paolo Valente wrote:
> > 
> > > Il giorno 02 set 2017, alle ore 17:17, Ming Lei <ming.lei@redhat.com> ha scritto:
> > > 
> > > Hi,
> > > 
> > > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > > found that I/O performance is much bad with mq-deadline, especially
> > > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > > SRP...)
> > > 
> > > Turns out one big issue causes the performance regression: requests
> > > are still dequeued from sw queue/scheduler queue even when ldd's
> > > queue is busy, so I/O merge becomes quite difficult to make, then
> > > sequential IO degrades a lot.
> > > 
> > > The 1st five patches improve this situation, and brings back
> > > some performance loss.
> > > 
> > > Patch 6 ~ 7 uses q->queue_depth as hint for setting up
> > > scheduler queue depth.
> > > 
> > > Patch 8 ~ 15 improve bio merge via hash table in sw queue,
> > > which makes bio merge more efficient than current approch
> > > in which only the last 8 requests are checked. Since patch
> > > 6~14 converts to the scheduler way of dequeuing one request
> > > from sw queue one time for SCSI device, and the times of
> > > acquring ctx->lock is increased, and merging bio via hash
> > > table decreases holding time of ctx->lock and should eliminate
> > > effect from patch 14. 
> > > 
> > > With this changes, SCSI-MQ sequential I/O performance is
> > > improved much, Paolo reported that mq-deadline performance
> > > improved much[2] in his dbench test wrt V2. Also performanc
> > > improvement on lpfc/qla2xx was observed with V1.[1]
> > > 
> > > Also Bart worried that this patchset may affect SRP, so provide
> > > test data on SCSI SRP this time:
> > > 
> > > - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> > > - system(16 cores, dual sockets, mem: 96G)
> > > 
> > >          |v4.13-rc6+*  |v4.13-rc6+   | patched v4.13-rc6+ 
> > > -----------------------------------------------------
> > > IOPS(K)  |  DEADLINE   |    NONE     |    NONE     
> > > -----------------------------------------------------
> > > read      |      587.81 |      511.96 |      518.51 
> > > -----------------------------------------------------
> > > randread  |      116.44 |      142.99 |      142.46 
> > > -----------------------------------------------------
> > > write     |      580.87 |       536.4 |      582.15 
> > > -----------------------------------------------------
> > > randwrite |      104.95 |      124.89 |      123.99 
> > > -----------------------------------------------------
> > > 
> > > 
> > >          |v4.13-rc6+   |v4.13-rc6+   | patched v4.13-rc6+ 
> > > -----------------------------------------------------
> > > IOPS(K)  |  DEADLINE   |MQ-DEADLINE  |MQ-DEADLINE  
> > > -----------------------------------------------------
> > > read      |      587.81 |       158.7 |      450.41 
> > > -----------------------------------------------------
> > > randread  |      116.44 |      142.04 |      142.72 
> > > -----------------------------------------------------
> > > write     |      580.87 |      136.61 |      569.37 
> > > -----------------------------------------------------
> > > randwrite |      104.95 |      123.14 |      124.36 
> > > -----------------------------------------------------
> > > 
> > > *: v4.13-rc6+ means v4.13-rc6 with block for-next
> > > 
> > > 
> > > Please consider to merge to V4.4.
> > > 
> > > [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> > > [2] https://marc.info/?l=linux-block&m=150217980602843&w=2
> > > 
> > > V4:
> > > 	- add Reviewed-by tag
> > > 	- some trival change: typo fix in commit log or comment,
> > > 	variable name, no actual functional change
> > > 
> > > V3:
> > > 	- totally round robin for picking req from ctx, as suggested
> > > 	by Bart
> > > 	- remove one local variable in __sbitmap_for_each_set()
> > > 	- drop patches of single dispatch list, which can improve
> > > 	performance on mq-deadline, but cause a bit degrade on
> > > 	none because all hctxs need to be checked after ->dispatch
> > > 	is flushed. Will post it again once it is mature.
> > > 	- rebase on v4.13-rc6 with block for-next
> > > 
> > > V2:
> > > 	- dequeue request from sw queues in round roubin's style
> > > 	as suggested by Bart, and introduces one helper in sbitmap
> > > 	for this purpose
> > > 	- improve bio merge via hash table from sw queue
> > > 	- add comments about using DISPATCH_BUSY state in lockless way,
> > > 	simplifying handling on busy state,
> > > 	- hold ctx->lock when clearing ctx busy bit as suggested
> > > 	by Bart
> > > 
> > > 
> > 
> > Tested-by: Paolo Valente <paolo.valente@linaro.org>
> 
> Hi Jens,
> 
> Is there any chance to make this patchset merged to V4.4?

Hi Jens,

Ping...

Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-02 15:17 ` [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
@ 2017-09-08 20:43   ` Omar Sandoval
  2017-09-09  9:38     ` Ming Lei
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-08 20:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Sat, Sep 02, 2017 at 11:17:17PM +0800, Ming Lei wrote:
> We need to iterate ctx starting from any ctx in round robin
> way, so introduce this helper.
> 
> Cc: Omar Sandoval <osandov@fb.com>

A couple of comments below, once you address those you can add

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/sbitmap.h | 54 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 40 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index a1904aadbc45..2329b9e1a0e2 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
>   */
>  bool sbitmap_any_bit_clear(const struct sbitmap *sb);
>  
> +#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
> +#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
> +
>  typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
>  
>  /**
>   * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> + * @off: Where to start the iteration
>   * @sb: Bitmap to iterate over.
>   * @fn: Callback. Should return true to continue or false to break early.
>   * @data: Pointer to pass to callback.
> @@ -222,35 +226,57 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
>   * This is inline even though it's non-trivial so that the function calls to the
>   * callback will hopefully get optimized away.
>   */
> -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
> -					void *data)
> +static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> +					  unsigned int off,
> +					  sb_for_each_fn fn, void *data)
>  {
> -	unsigned int i;
> +	unsigned int index = SB_NR_TO_INDEX(sb, off);
> +	unsigned int nr = SB_NR_TO_BIT(sb, off);
> +	unsigned int scanned = 0;
>  
> -	for (i = 0; i < sb->map_nr; i++) {
> -		struct sbitmap_word *word = &sb->map[i];
> -		unsigned int off, nr;
> +	while (1) {
> +		struct sbitmap_word *word = &sb->map[index];
> +		unsigned int depth = min_t(unsigned int, word->depth - nr,
> +					   sb->depth - scanned);
>  
> +		scanned += depth;
>  		if (!word->word)
> -			continue;
> +			goto next;
>  
> -		nr = 0;
> -		off = i << sb->shift;
> +		depth += nr;

I had to think hard to convince myself this was right. If above we set
depth to (sb->depth - scanned), then we must have already looped at
least once, so nr must be 0, therefore this is okay. Am I following this
correctly? I think reassigning like so would be more clear:

		depth = min_t(unsigned int, word->depth, sb->depth - scanned);

> +		off = index << sb->shift;
>  		while (1) {
> -			nr = find_next_bit(&word->word, word->depth, nr);
> -			if (nr >= word->depth)
> +			nr = find_next_bit(&word->word, depth, nr);
> +			if (nr >= depth)
>  				break;
> -
>  			if (!fn(sb, off + nr, data))
>  				return;
>  
>  			nr++;
>  		}
> + next:
> +		if (scanned >= sb->depth)
> +			break;
> +		nr = 0;
> +		if (++index >= sb->map_nr)
> +			index = 0;
>  	}
>  }
>  
> -#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
> -#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
> +/**
> + * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> + * @sb: Bitmap to iterate over.
> + * @fn: Callback. Should return true to continue or false to break early.
> + * @data: Pointer to pass to callback.


> + *
> + * This is inline even though it's non-trivial so that the function calls to the
> + * callback will hopefully get optimized away.

This part of the comment doesn't make sense for this wrapper, please
remove it.

> + */
> +static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
> +					void *data)
> +{
> +	__sbitmap_for_each_set(sb, 0, fn, data);
> +}
>  
>  static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
>  					    unsigned int bitnr)
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance
  2017-09-02 15:17 ` [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-09-08 20:48   ` Omar Sandoval
  2017-09-08 20:54     ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-08 20:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Sat, Sep 02, 2017 at 11:17:16PM +0800, Ming Lei wrote:
> When hw queue is busy, we shouldn't take requests from
> scheduler queue any more, otherwise it is difficult to do
> IO merge.
> 
> This patch fixes the awful IO performance on some
> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> is used by not taking requests if hw queue is busy.
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>

Jens and I did some testing with xfs/297 and couldn't find any issues,
so

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 4ab69435708c..845e5baf8af1 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> -	bool did_work = false;
> +	bool do_sched_dispatch = true;
>  	LIST_HEAD(rq_list);
>  
>  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
> -		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> +		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
>  	} else if (!has_sched_dispatch) {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list);
> @@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 * on the dispatch list, OR if we did have work but weren't able
>  	 * to make progress.
>  	 */
> -	if (!did_work && has_sched_dispatch) {
> +	if (do_sched_dispatch && has_sched_dispatch) {
>  		do {
>  			struct request *rq;
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance
  2017-09-08 20:48   ` Omar Sandoval
@ 2017-09-08 20:54     ` Omar Sandoval
  2017-09-08 20:56       ` Omar Sandoval
  2017-09-09  7:33       ` Ming Lei
  0 siblings, 2 replies; 50+ messages in thread
From: Omar Sandoval @ 2017-09-08 20:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Fri, Sep 08, 2017 at 01:48:32PM -0700, Omar Sandoval wrote:
> On Sat, Sep 02, 2017 at 11:17:16PM +0800, Ming Lei wrote:
> > When hw queue is busy, we shouldn't take requests from
> > scheduler queue any more, otherwise it is difficult to do
> > IO merge.
> > 
> > This patch fixes the awful IO performance on some
> > SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> > is used by not taking requests if hw queue is busy.
> > 
> > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> 
> Jens and I did some testing with xfs/297 and couldn't find any issues,
> so
> 
> Reviewed-by: Omar Sandoval <osandov@fb.com>

I take it back, the comment below needs to be updated

> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq-sched.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 4ab69435708c..845e5baf8af1 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  	struct request_queue *q = hctx->queue;
> >  	struct elevator_queue *e = q->elevator;
> >  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> > -	bool did_work = false;
> > +	bool do_sched_dispatch = true;
> >  	LIST_HEAD(rq_list);
> >  
> >  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> > @@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  	 */
> >  	if (!list_empty(&rq_list)) {
> >  		blk_mq_sched_mark_restart_hctx(hctx);
> > -		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> > +		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> >  	} else if (!has_sched_dispatch) {
> >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> >  		blk_mq_dispatch_rq_list(q, &rq_list);
> > @@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  	 * on the dispatch list, OR if we did have work but weren't able
> >  	 * to make progress.
> >  	 */

This should be updated to say:

We want to dispatch from the scheduler if there was nothing on the
dispatch list or we were able to dispatch from the dispatch list.

> > -	if (!did_work && has_sched_dispatch) {
> > +	if (do_sched_dispatch && has_sched_dispatch) {
> >  		do {
> >  			struct request *rq;
> >  
> > -- 
> > 2.9.5
> > 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance
  2017-09-08 20:54     ` Omar Sandoval
@ 2017-09-08 20:56       ` Omar Sandoval
  2017-09-09  7:43         ` Ming Lei
  2017-09-09  7:33       ` Ming Lei
  1 sibling, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-08 20:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Fri, Sep 08, 2017 at 01:54:48PM -0700, Omar Sandoval wrote:
> On Fri, Sep 08, 2017 at 01:48:32PM -0700, Omar Sandoval wrote:
> > On Sat, Sep 02, 2017 at 11:17:16PM +0800, Ming Lei wrote:
> > > When hw queue is busy, we shouldn't take requests from
> > > scheduler queue any more, otherwise it is difficult to do
> > > IO merge.
> > > 
> > > This patch fixes the awful IO performance on some
> > > SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> > > is used by not taking requests if hw queue is busy.
> > > 
> > > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > 
> > Jens and I did some testing with xfs/297 and couldn't find any issues,
> > so
> > 
> > Reviewed-by: Omar Sandoval <osandov@fb.com>
> 
> I take it back, the comment below needs to be updated
> 
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  block/blk-mq-sched.c | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index 4ab69435708c..845e5baf8af1 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > >  	struct request_queue *q = hctx->queue;
> > >  	struct elevator_queue *e = q->elevator;
> > >  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> > > -	bool did_work = false;
> > > +	bool do_sched_dispatch = true;
> > >  	LIST_HEAD(rq_list);
> > >  
> > >  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> > > @@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > >  	 */
> > >  	if (!list_empty(&rq_list)) {
> > >  		blk_mq_sched_mark_restart_hctx(hctx);
> > > -		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> > > +		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> > >  	} else if (!has_sched_dispatch) {
> > >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> > >  		blk_mq_dispatch_rq_list(q, &rq_list);
> > > @@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > >  	 * on the dispatch list, OR if we did have work but weren't able
> > >  	 * to make progress.
> > >  	 */
> 
> This should be updated to say:
> 
> We want to dispatch from the scheduler if there was nothing on the
> dispatch list or we were able to dispatch from the dispatch list.

By the way, did you experiment with changing blk_mq_dispatch_rq_list()
to return true IFF _all_ requests were dispatched, instead of _any_
requests were dispatched?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-02 15:17 ` [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-09-08 23:54   ` Omar Sandoval
  2017-09-10  4:45     ` Ming Lei
  2017-09-19 20:37   ` Jens Axboe
  1 sibling, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-08 23:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Sat, Sep 02, 2017 at 11:17:20PM +0800, Ming Lei wrote:
> SCSI devices use host-wide tagset, and the shared
> driver tag space is often quite big. Meantime
> there is also queue depth for each lun(.cmd_per_lun),
> which is often small.
> 
> So lots of requests may stay in sw queue, and we
> always flush all belonging to same hw queue and
> dispatch them all to driver, unfortunately it is
> easy to cause queue busy because of the small
> per-lun queue depth. Once these requests are flushed
> out, they have to stay in hctx->dispatch, and no bio
> merge can participate into these requests, and
> sequential IO performance is hurted.
> 
> This patch improves dispatching from sw queue when
> there is per-request-queue queue depth by taking
> request one by one from sw queue, just like the way
> of IO scheduler.
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c   | 61 +++++++++++++++++++++++++++++++++++++++++++++-----
>  include/linux/blk-mq.h |  2 ++
>  2 files changed, 57 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index f69752961a34..735e432294ab 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -89,9 +89,9 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
>  	return false;
>  }
>  
> -static void blk_mq_do_dispatch(struct request_queue *q,
> -			       struct elevator_queue *e,
> -			       struct blk_mq_hw_ctx *hctx)
> +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> +				     struct elevator_queue *e,
> +				     struct blk_mq_hw_ctx *hctx)
>  {
>  	LIST_HEAD(rq_list);
>  
> @@ -105,6 +105,42 @@ static void blk_mq_do_dispatch(struct request_queue *q,
>  	} while (blk_mq_dispatch_rq_list(q, &rq_list));
>  }
>  
> +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> +					  struct blk_mq_ctx *ctx)
> +{
> +	unsigned idx = ctx->index_hw;
> +
> +	if (++idx == hctx->nr_ctx)
> +		idx = 0;
> +
> +	return hctx->ctxs[idx];
> +}
> +
> +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> +				   struct blk_mq_hw_ctx *hctx)
> +{
> +	LIST_HEAD(rq_list);
> +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> +	bool dispatched;
> +
> +	do {
> +		struct request *rq;
> +
> +		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> +		if (!rq)
> +			break;
> +		list_add(&rq->queuelist, &rq_list);
> +
> +		/* round robin for fair dispatch */
> +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);

Hm... this next ctx will get skipped if the dispatch on the previous ctx
fails, since we call blk_mq_next_ctx() again. Seems unfair. Maybe move
the blk_mq_next_ctx() from the if (!dispatched) below into the if (!rq)
above?

> +		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);
> +	} while (dispatched);
> +
> +	if (!dispatched)
> +		WRITE_ONCE(hctx->dispatch_from, blk_mq_next_ctx(hctx, ctx));
> +}
> +
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
> @@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
>  		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> -	} else if (!has_sched_dispatch) {
> +	} else if (!has_sched_dispatch && !q->queue_depth) {
> +		/*
> +		 * If there is no per-request_queue depth, we
> +		 * flush all requests in this hw queue, otherwise
> +		 * pick up request one by one from sw queue for
> +		 * avoiding to mess up I/O merge when dispatch
> +		 * is busy, which can be triggered easily by
> +		 * per-request_queue queue depth
> +		 */
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list);
>  	}
>  
> +	if (!do_sched_dispatch)
> +		return;
> +
>  	/*
>  	 * We want to dispatch from the scheduler if we had no work left
>  	 * on the dispatch list, OR if we did have work but weren't able
>  	 * to make progress.
>  	 */
> -	if (do_sched_dispatch && has_sched_dispatch)
> -		blk_mq_do_dispatch(q, e, hctx);
> +	if (has_sched_dispatch)
> +		blk_mq_do_dispatch_sched(q, e, hctx);
> +	else
> +		blk_mq_do_dispatch_ctx(q, hctx);
>  }
>  
>  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 50c6485cb04f..7b7a366a97f3 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
>  
>  	struct sbitmap		ctx_map;
>  
> +	struct blk_mq_ctx	*dispatch_from;
> +
>  	struct blk_mq_ctx	**ctxs;
>  	unsigned int		nr_ctx;
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance
  2017-09-08 20:54     ` Omar Sandoval
  2017-09-08 20:56       ` Omar Sandoval
@ 2017-09-09  7:33       ` Ming Lei
  1 sibling, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-09  7:33 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Fri, Sep 08, 2017 at 01:54:48PM -0700, Omar Sandoval wrote:
> On Fri, Sep 08, 2017 at 01:48:32PM -0700, Omar Sandoval wrote:
> > On Sat, Sep 02, 2017 at 11:17:16PM +0800, Ming Lei wrote:
> > > When hw queue is busy, we shouldn't take requests from
> > > scheduler queue any more, otherwise it is difficult to do
> > > IO merge.
> > > 
> > > This patch fixes the awful IO performance on some
> > > SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> > > is used by not taking requests if hw queue is busy.
> > > 
> > > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > 
> > Jens and I did some testing with xfs/297 and couldn't find any issues,
> > so
> > 
> > Reviewed-by: Omar Sandoval <osandov@fb.com>
> 
> I take it back, the comment below needs to be updated
> 
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  block/blk-mq-sched.c | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index 4ab69435708c..845e5baf8af1 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > >  	struct request_queue *q = hctx->queue;
> > >  	struct elevator_queue *e = q->elevator;
> > >  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> > > -	bool did_work = false;
> > > +	bool do_sched_dispatch = true;
> > >  	LIST_HEAD(rq_list);
> > >  
> > >  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> > > @@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > >  	 */
> > >  	if (!list_empty(&rq_list)) {
> > >  		blk_mq_sched_mark_restart_hctx(hctx);
> > > -		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> > > +		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> > >  	} else if (!has_sched_dispatch) {
> > >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> > >  		blk_mq_dispatch_rq_list(q, &rq_list);
> > > @@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > >  	 * on the dispatch list, OR if we did have work but weren't able
> > >  	 * to make progress.
> > >  	 */
> 
> This should be updated to say:
> 
> We want to dispatch from the scheduler if there was nothing on the
> dispatch list or we were able to dispatch from the dispatch list.

OK, will do it in V5.

-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance
  2017-09-08 20:56       ` Omar Sandoval
@ 2017-09-09  7:43         ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-09  7:43 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Fri, Sep 08, 2017 at 01:56:29PM -0700, Omar Sandoval wrote:
> On Fri, Sep 08, 2017 at 01:54:48PM -0700, Omar Sandoval wrote:
> > On Fri, Sep 08, 2017 at 01:48:32PM -0700, Omar Sandoval wrote:
> > > On Sat, Sep 02, 2017 at 11:17:16PM +0800, Ming Lei wrote:
> > > > When hw queue is busy, we shouldn't take requests from
> > > > scheduler queue any more, otherwise it is difficult to do
> > > > IO merge.
> > > > 
> > > > This patch fixes the awful IO performance on some
> > > > SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> > > > is used by not taking requests if hw queue is busy.
> > > > 
> > > > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > > 
> > > Jens and I did some testing with xfs/297 and couldn't find any issues,
> > > so
> > > 
> > > Reviewed-by: Omar Sandoval <osandov@fb.com>
> > 
> > I take it back, the comment below needs to be updated
> > 
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  block/blk-mq-sched.c | 6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > index 4ab69435708c..845e5baf8af1 100644
> > > > --- a/block/blk-mq-sched.c
> > > > +++ b/block/blk-mq-sched.c
> > > > @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > > >  	struct request_queue *q = hctx->queue;
> > > >  	struct elevator_queue *e = q->elevator;
> > > >  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> > > > -	bool did_work = false;
> > > > +	bool do_sched_dispatch = true;
> > > >  	LIST_HEAD(rq_list);
> > > >  
> > > >  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> > > > @@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > > >  	 */
> > > >  	if (!list_empty(&rq_list)) {
> > > >  		blk_mq_sched_mark_restart_hctx(hctx);
> > > > -		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> > > > +		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> > > >  	} else if (!has_sched_dispatch) {
> > > >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> > > >  		blk_mq_dispatch_rq_list(q, &rq_list);
> > > > @@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > > >  	 * on the dispatch list, OR if we did have work but weren't able
> > > >  	 * to make progress.
> > > >  	 */
> > 
> > This should be updated to say:
> > 
> > We want to dispatch from the scheduler if there was nothing on the
> > dispatch list or we were able to dispatch from the dispatch list.
> 
> By the way, did you experiment with changing blk_mq_dispatch_rq_list()
> to return true IFF _all_ requests were dispatched, instead of _any_
> requests were dispatched?

I understand that changes nothing, because dispatch may happen
concurrently, and queue busy may be triggered in any one of the contexts
and cause rq to be added to ->dispatch.

So even all requests are dispatched successfully from
blk_mq_dispatch_rq_list(), rq still may be added to ->dispatch
from other contexts, and you will see the check in patch
"blk-mq-sched: don't dequeue request until all in ->dispatch are flushed".

-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-08 20:43   ` Omar Sandoval
@ 2017-09-09  9:38     ` Ming Lei
  2017-09-10 17:20       ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-09  9:38 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Fri, Sep 08, 2017 at 01:43:41PM -0700, Omar Sandoval wrote:
> On Sat, Sep 02, 2017 at 11:17:17PM +0800, Ming Lei wrote:
> > We need to iterate ctx starting from any ctx in round robin
> > way, so introduce this helper.
> > 
> > Cc: Omar Sandoval <osandov@fb.com>
> 
> A couple of comments below, once you address those you can add
> 
> Reviewed-by: Omar Sandoval <osandov@fb.com>
> 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  include/linux/sbitmap.h | 54 ++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 40 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> > index a1904aadbc45..2329b9e1a0e2 100644
> > --- a/include/linux/sbitmap.h
> > +++ b/include/linux/sbitmap.h
> > @@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
> >   */
> >  bool sbitmap_any_bit_clear(const struct sbitmap *sb);
> >  
> > +#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
> > +#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
> > +
> >  typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
> >  
> >  /**
> >   * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> > + * @off: Where to start the iteration
> >   * @sb: Bitmap to iterate over.
> >   * @fn: Callback. Should return true to continue or false to break early.
> >   * @data: Pointer to pass to callback.
> > @@ -222,35 +226,57 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
> >   * This is inline even though it's non-trivial so that the function calls to the
> >   * callback will hopefully get optimized away.
> >   */
> > -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
> > -					void *data)
> > +static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> > +					  unsigned int off,
> > +					  sb_for_each_fn fn, void *data)
> >  {
> > -	unsigned int i;
> > +	unsigned int index = SB_NR_TO_INDEX(sb, off);
> > +	unsigned int nr = SB_NR_TO_BIT(sb, off);
> > +	unsigned int scanned = 0;
> >  
> > -	for (i = 0; i < sb->map_nr; i++) {
> > -		struct sbitmap_word *word = &sb->map[i];
> > -		unsigned int off, nr;
> > +	while (1) {
> > +		struct sbitmap_word *word = &sb->map[index];
> > +		unsigned int depth = min_t(unsigned int, word->depth - nr,
> > +					   sb->depth - scanned);
> >  
> > +		scanned += depth;
> >  		if (!word->word)
> > -			continue;
> > +			goto next;
> >  
> > -		nr = 0;
> > -		off = i << sb->shift;
> > +		depth += nr;
> 
> I had to think hard to convince myself this was right. If above we set
> depth to (sb->depth - scanned), then we must have already looped at

It should be so only in the last loop, in which the 1st half of
the word is to be checked because we start from the 2nd half of
the same word.

> least once, so nr must be 0, therefore this is okay. Am I following this
> correctly? I think reassigning like so would be more clear:

Yes, you are right, nr can be non-zero only in the 1st loop.

> 
> 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);

If nr isn't zero, the depth to be scanned should be 'word->depth - nr'
in the 1st loop, so the above way can't cover this case.

> 
> > +		off = index << sb->shift;
> >  		while (1) {
> > -			nr = find_next_bit(&word->word, word->depth, nr);
> > -			if (nr >= word->depth)
> > +			nr = find_next_bit(&word->word, depth, nr);
> > +			if (nr >= depth)
> >  				break;
> > -
> >  			if (!fn(sb, off + nr, data))
> >  				return;
> >  
> >  			nr++;
> >  		}
> > + next:
> > +		if (scanned >= sb->depth)
> > +			break;
> > +		nr = 0;
> > +		if (++index >= sb->map_nr)
> > +			index = 0;
> >  	}
> >  }
> >  
> > -#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
> > -#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
> > +/**
> > + * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> > + * @sb: Bitmap to iterate over.
> > + * @fn: Callback. Should return true to continue or false to break early.
> > + * @data: Pointer to pass to callback.
> 
> 
> > + *
> > + * This is inline even though it's non-trivial so that the function calls to the
> > + * callback will hopefully get optimized away.
> 
> This part of the comment doesn't make sense for this wrapper, please
> remove it.

OK.

-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-08 23:54   ` Omar Sandoval
@ 2017-09-10  4:45     ` Ming Lei
  2017-09-10 17:38       ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-10  4:45 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Fri, Sep 08, 2017 at 04:54:39PM -0700, Omar Sandoval wrote:
> On Sat, Sep 02, 2017 at 11:17:20PM +0800, Ming Lei wrote:
> > SCSI devices use host-wide tagset, and the shared
> > driver tag space is often quite big. Meantime
> > there is also queue depth for each lun(.cmd_per_lun),
> > which is often small.
> > 
> > So lots of requests may stay in sw queue, and we
> > always flush all belonging to same hw queue and
> > dispatch them all to driver, unfortunately it is
> > easy to cause queue busy because of the small
> > per-lun queue depth. Once these requests are flushed
> > out, they have to stay in hctx->dispatch, and no bio
> > merge can participate into these requests, and
> > sequential IO performance is hurted.
> > 
> > This patch improves dispatching from sw queue when
> > there is per-request-queue queue depth by taking
> > request one by one from sw queue, just like the way
> > of IO scheduler.
> > 
> > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq-sched.c   | 61 +++++++++++++++++++++++++++++++++++++++++++++-----
> >  include/linux/blk-mq.h |  2 ++
> >  2 files changed, 57 insertions(+), 6 deletions(-)
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index f69752961a34..735e432294ab 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -89,9 +89,9 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> >  	return false;
> >  }
> >  
> > -static void blk_mq_do_dispatch(struct request_queue *q,
> > -			       struct elevator_queue *e,
> > -			       struct blk_mq_hw_ctx *hctx)
> > +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> > +				     struct elevator_queue *e,
> > +				     struct blk_mq_hw_ctx *hctx)
> >  {
> >  	LIST_HEAD(rq_list);
> >  
> > @@ -105,6 +105,42 @@ static void blk_mq_do_dispatch(struct request_queue *q,
> >  	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> >  }
> >  
> > +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> > +					  struct blk_mq_ctx *ctx)
> > +{
> > +	unsigned idx = ctx->index_hw;
> > +
> > +	if (++idx == hctx->nr_ctx)
> > +		idx = 0;
> > +
> > +	return hctx->ctxs[idx];
> > +}
> > +
> > +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > +				   struct blk_mq_hw_ctx *hctx)
> > +{
> > +	LIST_HEAD(rq_list);
> > +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > +	bool dispatched;
> > +
> > +	do {
> > +		struct request *rq;
> > +
> > +		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > +		if (!rq)
> > +			break;
> > +		list_add(&rq->queuelist, &rq_list);
> > +
> > +		/* round robin for fair dispatch */
> > +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> 
> Hm... this next ctx will get skipped if the dispatch on the previous ctx
> fails, since we call blk_mq_next_ctx() again. Seems unfair. Maybe move
> the blk_mq_next_ctx() from the if (!dispatched) below into the if (!rq)
> above?

In case of if (!rq), that means there isn't any request in all ctxs
belonging to this hctx, so it is reasonable to start the dispatch from
any one of these ctxs next time, include the next one.

If dispatch fails on previous ctx, the rq from that ctx will be
put into ->dispatch, so it is fair to start dispatch from next ctx
next time too.


-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-09  9:38     ` Ming Lei
@ 2017-09-10 17:20       ` Omar Sandoval
  2017-09-11  4:08         ` Ming Lei
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-10 17:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Sat, Sep 09, 2017 at 05:38:13PM +0800, Ming Lei wrote:
> On Fri, Sep 08, 2017 at 01:43:41PM -0700, Omar Sandoval wrote:
> > On Sat, Sep 02, 2017 at 11:17:17PM +0800, Ming Lei wrote:
> > > We need to iterate ctx starting from any ctx in round robin
> > > way, so introduce this helper.
> > > 
> > > Cc: Omar Sandoval <osandov@fb.com>
> > 
> > A couple of comments below, once you address those you can add
> > 
> > Reviewed-by: Omar Sandoval <osandov@fb.com>
> > 
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  include/linux/sbitmap.h | 54 ++++++++++++++++++++++++++++++++++++-------------
> > >  1 file changed, 40 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> > > index a1904aadbc45..2329b9e1a0e2 100644
> > > --- a/include/linux/sbitmap.h
> > > +++ b/include/linux/sbitmap.h
> > > @@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
> > >   */
> > >  bool sbitmap_any_bit_clear(const struct sbitmap *sb);
> > >  
> > > +#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
> > > +#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
> > > +
> > >  typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
> > >  
> > >  /**
> > >   * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> > > + * @off: Where to start the iteration
> > >   * @sb: Bitmap to iterate over.
> > >   * @fn: Callback. Should return true to continue or false to break early.
> > >   * @data: Pointer to pass to callback.
> > > @@ -222,35 +226,57 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
> > >   * This is inline even though it's non-trivial so that the function calls to the
> > >   * callback will hopefully get optimized away.
> > >   */
> > > -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
> > > -					void *data)
> > > +static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> > > +					  unsigned int off,
> > > +					  sb_for_each_fn fn, void *data)
> > >  {
> > > -	unsigned int i;
> > > +	unsigned int index = SB_NR_TO_INDEX(sb, off);
> > > +	unsigned int nr = SB_NR_TO_BIT(sb, off);
> > > +	unsigned int scanned = 0;
> > >  
> > > -	for (i = 0; i < sb->map_nr; i++) {
> > > -		struct sbitmap_word *word = &sb->map[i];
> > > -		unsigned int off, nr;
> > > +	while (1) {
> > > +		struct sbitmap_word *word = &sb->map[index];
> > > +		unsigned int depth = min_t(unsigned int, word->depth - nr,
> > > +					   sb->depth - scanned);
> > >  
> > > +		scanned += depth;
> > >  		if (!word->word)
> > > -			continue;
> > > +			goto next;
> > >  
> > > -		nr = 0;
> > > -		off = i << sb->shift;
> > > +		depth += nr;
> > 
> > I had to think hard to convince myself this was right. If above we set
> > depth to (sb->depth - scanned), then we must have already looped at
> 
> It should be so only in the last loop, in which the 1st half of
> the word is to be checked because we start from the 2nd half of
> the same word.
> 
> > least once, so nr must be 0, therefore this is okay. Am I following this
> > correctly? I think reassigning like so would be more clear:
> 
> Yes, you are right, nr can be non-zero only in the 1st loop.
> 
> > 
> > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> 
> If nr isn't zero, the depth to be scanned should be 'word->depth - nr'
> in the 1st loop, so the above way can't cover this case.

What I mean is that you keep the same initialization above, but instead of
		depth += nr
you do
		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
because like I said, the reasoning about why `+= nr` is okay in the
`sb->depth - scanned` case is subtle.

And maybe even replace the
		scanned += depth;
with
		scanned += min_t(unsigned int, word->depth - nr,
	   			 sb->depth - scanned);
I.e., don't reuse the depth local variable for two different things. I'm
nitpicking here but this code is tricky enough as it is.

For completeness, I mean this exactly:

	while (1) {
		struct sbitmap_word *word = &sb->map[index];
		unsigned int depth;

		scanned += min_t(unsigned int, word->depth - nr,
				 sb->depth - scanned);
		if (!word->word)
			goto next;

		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
		off = index << sb->shift;
		while (1) {
			nr = find_next_bit(&word->word, depth, nr);
			if (nr >= depth)
				break;

			if (!fn(sb, off + nr, data))
				return;

			nr++;
		}
next:
		if (scanned >= sb->depth)
			break;
		nr = 0;
		if (++index >= sb->map_nr)
			index = 0;
	}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-10  4:45     ` Ming Lei
@ 2017-09-10 17:38       ` Omar Sandoval
  2017-09-11  4:13         ` Ming Lei
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-10 17:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Sun, Sep 10, 2017 at 12:45:15PM +0800, Ming Lei wrote:
> On Fri, Sep 08, 2017 at 04:54:39PM -0700, Omar Sandoval wrote:
> > On Sat, Sep 02, 2017 at 11:17:20PM +0800, Ming Lei wrote:
> > > SCSI devices use host-wide tagset, and the shared
> > > driver tag space is often quite big. Meantime
> > > there is also queue depth for each lun(.cmd_per_lun),
> > > which is often small.
> > > 
> > > So lots of requests may stay in sw queue, and we
> > > always flush all belonging to same hw queue and
> > > dispatch them all to driver, unfortunately it is
> > > easy to cause queue busy because of the small
> > > per-lun queue depth. Once these requests are flushed
> > > out, they have to stay in hctx->dispatch, and no bio
> > > merge can participate into these requests, and
> > > sequential IO performance is hurted.
> > > 
> > > This patch improves dispatching from sw queue when
> > > there is per-request-queue queue depth by taking
> > > request one by one from sw queue, just like the way
> > > of IO scheduler.
> > > 
> > > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  block/blk-mq-sched.c   | 61 +++++++++++++++++++++++++++++++++++++++++++++-----
> > >  include/linux/blk-mq.h |  2 ++
> > >  2 files changed, 57 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index f69752961a34..735e432294ab 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -89,9 +89,9 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> > >  	return false;
> > >  }
> > >  
> > > -static void blk_mq_do_dispatch(struct request_queue *q,
> > > -			       struct elevator_queue *e,
> > > -			       struct blk_mq_hw_ctx *hctx)
> > > +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> > > +				     struct elevator_queue *e,
> > > +				     struct blk_mq_hw_ctx *hctx)
> > >  {
> > >  	LIST_HEAD(rq_list);
> > >  
> > > @@ -105,6 +105,42 @@ static void blk_mq_do_dispatch(struct request_queue *q,
> > >  	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> > >  }
> > >  
> > > +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> > > +					  struct blk_mq_ctx *ctx)
> > > +{
> > > +	unsigned idx = ctx->index_hw;
> > > +
> > > +	if (++idx == hctx->nr_ctx)
> > > +		idx = 0;
> > > +
> > > +	return hctx->ctxs[idx];
> > > +}
> > > +
> > > +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > > +				   struct blk_mq_hw_ctx *hctx)
> > > +{
> > > +	LIST_HEAD(rq_list);
> > > +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > > +	bool dispatched;
> > > +
> > > +	do {
> > > +		struct request *rq;
> > > +
> > > +		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > > +		if (!rq)
> > > +			break;
> > > +		list_add(&rq->queuelist, &rq_list);
> > > +
> > > +		/* round robin for fair dispatch */
> > > +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > 
> > Hm... this next ctx will get skipped if the dispatch on the previous ctx
> > fails, since we call blk_mq_next_ctx() again. Seems unfair. Maybe move
> > the blk_mq_next_ctx() from the if (!dispatched) below into the if (!rq)
> > above?
> 
> In case of if (!rq), that means there isn't any request in all ctxs
> belonging to this hctx, so it is reasonable to start the dispatch from
> any one of these ctxs next time, include the next one.

Yep, that case is okay.

> If dispatch fails on previous ctx, the rq from that ctx will be
> put into ->dispatch, so it is fair to start dispatch from next ctx
> next time too.

I'm talking about this case

	LIST_HEAD(rq_list);
	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
	bool dispatched;
 
	/*
	 * Let's say that ctxs 0, 1, and 2 all have requests pending and
	 * hctx->dispatch_from was ctx0, so ctx is ctx0 when we start.
	 */
	do {
		struct request *rq;
 
		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
		if (!rq)
			break;
		list_add(&rq->queuelist, &rq_list);

		/* Now rq is a request from ctx0 */

		/* round robin for fair dispatch */
		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
		/* Now ctx is ctx1. */
 
		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);

		/* If we couldn't dispatch, we break here. */
	} while (dispatched);
 
	if (!dispatched)
		/*
		 * Now we set hctx->dispatch_from to ctx2, so we've
		 * skipped over ctx1.
		 */
		WRITE_ONCE(hctx->dispatch_from, blk_mq_next_ctx(hctx, ctx));

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-10 17:20       ` Omar Sandoval
@ 2017-09-11  4:08         ` Ming Lei
  2017-09-13 18:37           ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-11  4:08 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Sun, Sep 10, 2017 at 10:20:27AM -0700, Omar Sandoval wrote:
> On Sat, Sep 09, 2017 at 05:38:13PM +0800, Ming Lei wrote:
> > On Fri, Sep 08, 2017 at 01:43:41PM -0700, Omar Sandoval wrote:
> > > On Sat, Sep 02, 2017 at 11:17:17PM +0800, Ming Lei wrote:
> > > > We need to iterate ctx starting from any ctx in round robin
> > > > way, so introduce this helper.
> > > > 
> > > > Cc: Omar Sandoval <osandov@fb.com>
> > > 
> > > A couple of comments below, once you address those you can add
> > > 
> > > Reviewed-by: Omar Sandoval <osandov@fb.com>
> > > 
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  include/linux/sbitmap.h | 54 ++++++++++++++++++++++++++++++++++++-------------
> > > >  1 file changed, 40 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> > > > index a1904aadbc45..2329b9e1a0e2 100644
> > > > --- a/include/linux/sbitmap.h
> > > > +++ b/include/linux/sbitmap.h
> > > > @@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
> > > >   */
> > > >  bool sbitmap_any_bit_clear(const struct sbitmap *sb);
> > > >  
> > > > +#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
> > > > +#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
> > > > +
> > > >  typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
> > > >  
> > > >  /**
> > > >   * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> > > > + * @off: Where to start the iteration
> > > >   * @sb: Bitmap to iterate over.
> > > >   * @fn: Callback. Should return true to continue or false to break early.
> > > >   * @data: Pointer to pass to callback.
> > > > @@ -222,35 +226,57 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
> > > >   * This is inline even though it's non-trivial so that the function calls to the
> > > >   * callback will hopefully get optimized away.
> > > >   */
> > > > -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
> > > > -					void *data)
> > > > +static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> > > > +					  unsigned int off,
> > > > +					  sb_for_each_fn fn, void *data)
> > > >  {
> > > > -	unsigned int i;
> > > > +	unsigned int index = SB_NR_TO_INDEX(sb, off);
> > > > +	unsigned int nr = SB_NR_TO_BIT(sb, off);
> > > > +	unsigned int scanned = 0;
> > > >  
> > > > -	for (i = 0; i < sb->map_nr; i++) {
> > > > -		struct sbitmap_word *word = &sb->map[i];
> > > > -		unsigned int off, nr;
> > > > +	while (1) {
> > > > +		struct sbitmap_word *word = &sb->map[index];
> > > > +		unsigned int depth = min_t(unsigned int, word->depth - nr,
> > > > +					   sb->depth - scanned);
> > > >  
> > > > +		scanned += depth;
> > > >  		if (!word->word)
> > > > -			continue;
> > > > +			goto next;
> > > >  
> > > > -		nr = 0;
> > > > -		off = i << sb->shift;
> > > > +		depth += nr;
> > > 
> > > I had to think hard to convince myself this was right. If above we set
> > > depth to (sb->depth - scanned), then we must have already looped at
> > 
> > It should be so only in the last loop, in which the 1st half of
> > the word is to be checked because we start from the 2nd half of
> > the same word.
> > 
> > > least once, so nr must be 0, therefore this is okay. Am I following this
> > > correctly? I think reassigning like so would be more clear:
> > 
> > Yes, you are right, nr can be non-zero only in the 1st loop.
> > 
> > > 
> > > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > 
> > If nr isn't zero, the depth to be scanned should be 'word->depth - nr'
> > in the 1st loop, so the above way can't cover this case.
> 
> What I mean is that you keep the same initialization above, but instead of
> 		depth += nr
> you do
> 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> because like I said, the reasoning about why `+= nr` is okay in the
> `sb->depth - scanned` case is subtle.
> 
> And maybe even replace the
> 		scanned += depth;
> with
> 		scanned += min_t(unsigned int, word->depth - nr,
> 	   			 sb->depth - scanned);
> I.e., don't reuse the depth local variable for two different things. I'm
> nitpicking here but this code is tricky enough as it is.

It wasn't reused in old version, just for saving one local variable, and
one extra min_t().

Yeah, I admit it isn't clean enough.

> 
> For completeness, I mean this exactly:
> 
> 	while (1) {
> 		struct sbitmap_word *word = &sb->map[index];
> 		unsigned int depth;
> 
> 		scanned += min_t(unsigned int, word->depth - nr,
> 				 sb->depth - scanned);
> 		if (!word->word)
> 			goto next;
> 
> 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);

two min_t and a little code duplication.

> 		off = index << sb->shift;
> 		while (1) {
> 			nr = find_next_bit(&word->word, depth, nr);
> 			if (nr >= depth)
> 				break;
> 
> 			if (!fn(sb, off + nr, data))
> 				return;
> 
> 			nr++;
> 		}
> next:
> 		if (scanned >= sb->depth)
> 			break;
> 		nr = 0;
> 		if (++index >= sb->map_nr)
> 			index = 0;
> 	}

The following patch switches to do{}while and handles the
1st scan outside of the loop, then it should be clean
enough(no two min_t()), so how about this one?

>From dd455230fed9200d519d3ae4f4c32a1430506441 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Wed, 2 Aug 2017 15:53:09 +0800
Subject: [PATCH] sbitmap: introduce __sbitmap_for_each_set()

We need to iterate ctx starting from any ctx in round robin
way, so introduce this helper.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/sbitmap.h | 68 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 20 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..4df5480b7c4e 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,46 +211,74 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
  * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @start: Where to start the iteration
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
- *
- * This is inline even though it's non-trivial so that the function calls to the
- * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-					void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+					  const unsigned int start,
+					  sb_for_each_fn fn, void *data)
 {
-	unsigned int i;
-
-	for (i = 0; i < sb->map_nr; i++) {
-		struct sbitmap_word *word = &sb->map[i];
-		unsigned int off, nr;
+	unsigned int index = SB_NR_TO_INDEX(sb, start);
+	unsigned int nr = SB_NR_TO_BIT(sb, start);
+	struct sbitmap_word *word = &sb->map[index];
+	unsigned int depth = word->depth;
+	/*
+	 * 'scanned' is always updated beforehand, so the exit
+	 * check has to be done after the inner bitmap scanning
+	 * is completed.
+	 */
+	unsigned int scanned = depth - nr;
 
+	do {
 		if (!word->word)
-			continue;
+			goto next;
 
-		nr = 0;
-		off = i << sb->shift;
 		while (1) {
-			nr = find_next_bit(&word->word, word->depth, nr);
-			if (nr >= word->depth)
+			nr = find_next_bit(&word->word, depth, nr);
+			if (nr >= depth)
 				break;
-
-			if (!fn(sb, off + nr, data))
+			if (!fn(sb, (index << sb->shift) + nr, data))
 				return;
 
 			nr++;
 		}
-	}
+ next:
+		if (scanned >= sb->depth)
+			break;
+
+		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
+		scanned += depth;
+
+		if (++index >= sb->map_nr)
+			index = 0;
+		word = &sb->map[index];
+		nr = 0;
+	} while (1);
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ *
+ * This is inline even though it's non-trivial so that the function calls to the
+ * callback will hopefully get optimized away.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+					void *data)
+{
+	__sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
 					    unsigned int bitnr)
-- 
2.9.5

-- 
Ming

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-10 17:38       ` Omar Sandoval
@ 2017-09-11  4:13         ` Ming Lei
  2017-09-13 17:32           ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-11  4:13 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Sun, Sep 10, 2017 at 10:38:33AM -0700, Omar Sandoval wrote:
> On Sun, Sep 10, 2017 at 12:45:15PM +0800, Ming Lei wrote:
> > On Fri, Sep 08, 2017 at 04:54:39PM -0700, Omar Sandoval wrote:
> > > On Sat, Sep 02, 2017 at 11:17:20PM +0800, Ming Lei wrote:
> > > > SCSI devices use host-wide tagset, and the shared
> > > > driver tag space is often quite big. Meantime
> > > > there is also queue depth for each lun(.cmd_per_lun),
> > > > which is often small.
> > > > 
> > > > So lots of requests may stay in sw queue, and we
> > > > always flush all belonging to same hw queue and
> > > > dispatch them all to driver, unfortunately it is
> > > > easy to cause queue busy because of the small
> > > > per-lun queue depth. Once these requests are flushed
> > > > out, they have to stay in hctx->dispatch, and no bio
> > > > merge can participate into these requests, and
> > > > sequential IO performance is hurted.
> > > > 
> > > > This patch improves dispatching from sw queue when
> > > > there is per-request-queue queue depth by taking
> > > > request one by one from sw queue, just like the way
> > > > of IO scheduler.
> > > > 
> > > > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  block/blk-mq-sched.c   | 61 +++++++++++++++++++++++++++++++++++++++++++++-----
> > > >  include/linux/blk-mq.h |  2 ++
> > > >  2 files changed, 57 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > index f69752961a34..735e432294ab 100644
> > > > --- a/block/blk-mq-sched.c
> > > > +++ b/block/blk-mq-sched.c
> > > > @@ -89,9 +89,9 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> > > >  	return false;
> > > >  }
> > > >  
> > > > -static void blk_mq_do_dispatch(struct request_queue *q,
> > > > -			       struct elevator_queue *e,
> > > > -			       struct blk_mq_hw_ctx *hctx)
> > > > +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> > > > +				     struct elevator_queue *e,
> > > > +				     struct blk_mq_hw_ctx *hctx)
> > > >  {
> > > >  	LIST_HEAD(rq_list);
> > > >  
> > > > @@ -105,6 +105,42 @@ static void blk_mq_do_dispatch(struct request_queue *q,
> > > >  	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> > > >  }
> > > >  
> > > > +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> > > > +					  struct blk_mq_ctx *ctx)
> > > > +{
> > > > +	unsigned idx = ctx->index_hw;
> > > > +
> > > > +	if (++idx == hctx->nr_ctx)
> > > > +		idx = 0;
> > > > +
> > > > +	return hctx->ctxs[idx];
> > > > +}
> > > > +
> > > > +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > > > +				   struct blk_mq_hw_ctx *hctx)
> > > > +{
> > > > +	LIST_HEAD(rq_list);
> > > > +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > > > +	bool dispatched;
> > > > +
> > > > +	do {
> > > > +		struct request *rq;
> > > > +
> > > > +		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > > > +		if (!rq)
> > > > +			break;
> > > > +		list_add(&rq->queuelist, &rq_list);
> > > > +
> > > > +		/* round robin for fair dispatch */
> > > > +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > > 
> > > Hm... this next ctx will get skipped if the dispatch on the previous ctx
> > > fails, since we call blk_mq_next_ctx() again. Seems unfair. Maybe move
> > > the blk_mq_next_ctx() from the if (!dispatched) below into the if (!rq)
> > > above?
> > 
> > In case of if (!rq), that means there isn't any request in all ctxs
> > belonging to this hctx, so it is reasonable to start the dispatch from
> > any one of these ctxs next time, include the next one.
> 
> Yep, that case is okay.
> 
> > If dispatch fails on previous ctx, the rq from that ctx will be
> > put into ->dispatch, so it is fair to start dispatch from next ctx
> > next time too.
> 
> I'm talking about this case
> 
> 	LIST_HEAD(rq_list);
> 	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> 	bool dispatched;
>  
> 	/*
> 	 * Let's say that ctxs 0, 1, and 2 all have requests pending and
> 	 * hctx->dispatch_from was ctx0, so ctx is ctx0 when we start.
> 	 */
> 	do {
> 		struct request *rq;
>  
> 		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> 		if (!rq)
> 			break;
> 		list_add(&rq->queuelist, &rq_list);
> 
> 		/* Now rq is a request from ctx0 */
> 
> 		/* round robin for fair dispatch */
> 		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> 		/* Now ctx is ctx1. */
>  
> 		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);
> 
> 		/* If we couldn't dispatch, we break here. */
> 	} while (dispatched);
>  
> 	if (!dispatched)
> 		/*
> 		 * Now we set hctx->dispatch_from to ctx2, so we've
> 		 * skipped over ctx1.
> 		 */
> 		WRITE_ONCE(hctx->dispatch_from, blk_mq_next_ctx(hctx, ctx));

Good catch, looks it can be fixed by simply changing the above line
into following:

 		WRITE_ONCE(hctx->dispatch_from, ctx);


-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-11  4:13         ` Ming Lei
@ 2017-09-13 17:32           ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2017-09-13 17:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Mon, Sep 11, 2017 at 12:13:09PM +0800, Ming Lei wrote:
> On Sun, Sep 10, 2017 at 10:38:33AM -0700, Omar Sandoval wrote:
> > On Sun, Sep 10, 2017 at 12:45:15PM +0800, Ming Lei wrote:
> > > On Fri, Sep 08, 2017 at 04:54:39PM -0700, Omar Sandoval wrote:
> > > > On Sat, Sep 02, 2017 at 11:17:20PM +0800, Ming Lei wrote:
> > > > > SCSI devices use host-wide tagset, and the shared
> > > > > driver tag space is often quite big. Meantime
> > > > > there is also queue depth for each lun(.cmd_per_lun),
> > > > > which is often small.
> > > > > 
> > > > > So lots of requests may stay in sw queue, and we
> > > > > always flush all belonging to same hw queue and
> > > > > dispatch them all to driver, unfortunately it is
> > > > > easy to cause queue busy because of the small
> > > > > per-lun queue depth. Once these requests are flushed
> > > > > out, they have to stay in hctx->dispatch, and no bio
> > > > > merge can participate into these requests, and
> > > > > sequential IO performance is hurted.
> > > > > 
> > > > > This patch improves dispatching from sw queue when
> > > > > there is per-request-queue queue depth by taking
> > > > > request one by one from sw queue, just like the way
> > > > > of IO scheduler.
> > > > > 
> > > > > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > ---
> > > > >  block/blk-mq-sched.c   | 61 +++++++++++++++++++++++++++++++++++++++++++++-----
> > > > >  include/linux/blk-mq.h |  2 ++
> > > > >  2 files changed, 57 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > index f69752961a34..735e432294ab 100644
> > > > > --- a/block/blk-mq-sched.c
> > > > > +++ b/block/blk-mq-sched.c
> > > > > @@ -89,9 +89,9 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> > > > >  	return false;
> > > > >  }
> > > > >  
> > > > > -static void blk_mq_do_dispatch(struct request_queue *q,
> > > > > -			       struct elevator_queue *e,
> > > > > -			       struct blk_mq_hw_ctx *hctx)
> > > > > +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> > > > > +				     struct elevator_queue *e,
> > > > > +				     struct blk_mq_hw_ctx *hctx)
> > > > >  {
> > > > >  	LIST_HEAD(rq_list);
> > > > >  
> > > > > @@ -105,6 +105,42 @@ static void blk_mq_do_dispatch(struct request_queue *q,
> > > > >  	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> > > > >  }
> > > > >  
> > > > > +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> > > > > +					  struct blk_mq_ctx *ctx)
> > > > > +{
> > > > > +	unsigned idx = ctx->index_hw;
> > > > > +
> > > > > +	if (++idx == hctx->nr_ctx)
> > > > > +		idx = 0;
> > > > > +
> > > > > +	return hctx->ctxs[idx];
> > > > > +}
> > > > > +
> > > > > +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > > > > +				   struct blk_mq_hw_ctx *hctx)
> > > > > +{
> > > > > +	LIST_HEAD(rq_list);
> > > > > +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > > > > +	bool dispatched;
> > > > > +
> > > > > +	do {
> > > > > +		struct request *rq;
> > > > > +
> > > > > +		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > > > > +		if (!rq)
> > > > > +			break;
> > > > > +		list_add(&rq->queuelist, &rq_list);
> > > > > +
> > > > > +		/* round robin for fair dispatch */
> > > > > +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > > > 
> > > > Hm... this next ctx will get skipped if the dispatch on the previous ctx
> > > > fails, since we call blk_mq_next_ctx() again. Seems unfair. Maybe move
> > > > the blk_mq_next_ctx() from the if (!dispatched) below into the if (!rq)
> > > > above?
> > > 
> > > In case of if (!rq), that means there isn't any request in all ctxs
> > > belonging to this hctx, so it is reasonable to start the dispatch from
> > > any one of these ctxs next time, include the next one.
> > 
> > Yep, that case is okay.
> > 
> > > If dispatch fails on previous ctx, the rq from that ctx will be
> > > put into ->dispatch, so it is fair to start dispatch from next ctx
> > > next time too.
> > 
> > I'm talking about this case
> > 
> > 	LIST_HEAD(rq_list);
> > 	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > 	bool dispatched;
> >  
> > 	/*
> > 	 * Let's say that ctxs 0, 1, and 2 all have requests pending and
> > 	 * hctx->dispatch_from was ctx0, so ctx is ctx0 when we start.
> > 	 */
> > 	do {
> > 		struct request *rq;
> >  
> > 		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > 		if (!rq)
> > 			break;
> > 		list_add(&rq->queuelist, &rq_list);
> > 
> > 		/* Now rq is a request from ctx0 */
> > 
> > 		/* round robin for fair dispatch */
> > 		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > 		/* Now ctx is ctx1. */
> >  
> > 		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);
> > 
> > 		/* If we couldn't dispatch, we break here. */
> > 	} while (dispatched);
> >  
> > 	if (!dispatched)
> > 		/*
> > 		 * Now we set hctx->dispatch_from to ctx2, so we've
> > 		 * skipped over ctx1.
> > 		 */
> > 		WRITE_ONCE(hctx->dispatch_from, blk_mq_next_ctx(hctx, ctx));
> 
> Good catch, looks it can be fixed by simply changing the above line
> into following:
> 
>  		WRITE_ONCE(hctx->dispatch_from, ctx);

Yeah that'll do it. Once you fix that you can add

Reviewed-by: Omar Sandoval <osandov@fb.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-11  4:08         ` Ming Lei
@ 2017-09-13 18:37           ` Omar Sandoval
  2017-09-14  1:56             ` Ming Lei
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-13 18:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Mon, Sep 11, 2017 at 12:08:29PM +0800, Ming Lei wrote:
> On Sun, Sep 10, 2017 at 10:20:27AM -0700, Omar Sandoval wrote:

[snip]

> > What I mean is that you keep the same initialization above, but instead of
> > 		depth += nr
> > you do
> > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > because like I said, the reasoning about why `+= nr` is okay in the
> > `sb->depth - scanned` case is subtle.
> > 
> > And maybe even replace the
> > 		scanned += depth;
> > with
> > 		scanned += min_t(unsigned int, word->depth - nr,
> > 	   			 sb->depth - scanned);
> > I.e., don't reuse the depth local variable for two different things. I'm
> > nitpicking here but this code is tricky enough as it is.
> 
> It wasn't reused in old version, just for saving one local variable, and
> one extra min_t().
> 
> Yeah, I admit it isn't clean enough.
> 
> > 
> > For completeness, I mean this exactly:
> > 
> > 	while (1) {
> > 		struct sbitmap_word *word = &sb->map[index];
> > 		unsigned int depth;
> > 
> > 		scanned += min_t(unsigned int, word->depth - nr,
> > 				 sb->depth - scanned);
> > 		if (!word->word)
> > 			goto next;
> > 
> > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> 
> two min_t and a little code duplication.

They're similar but they represent different things, so I think trying
to deduplicate this code just makes it more confusing. If performance is
your concern, I'd be really surprised if there's a noticable difference.

As a side note, I also realized that this code doesn't handle the
sb->depth == 0 case. We should change the while (1) to
while (scanned < sb->depth) and remove the
if (scanned >= sb->depth) break;

> > 		off = index << sb->shift;
> > 		while (1) {
> > 			nr = find_next_bit(&word->word, depth, nr);
> > 			if (nr >= depth)
> > 				break;
> > 
> > 			if (!fn(sb, off + nr, data))
> > 				return;
> > 
> > 			nr++;
> > 		}
> > next:
> > 		if (scanned >= sb->depth)
> > 			break;
> > 		nr = 0;
> > 		if (++index >= sb->map_nr)
> > 			index = 0;
> > 	}
> 
> The following patch switches to do{}while and handles the
> 1st scan outside of the loop, then it should be clean
> enough(no two min_t()), so how about this one?

I find this one subtler and harder to follow. The less it looks like the
typical loop pattern, the longer someone reading the code has to reason
about it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-13 18:37           ` Omar Sandoval
@ 2017-09-14  1:56             ` Ming Lei
  2017-09-14 14:59               ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-14  1:56 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Wed, Sep 13, 2017 at 11:37:20AM -0700, Omar Sandoval wrote:
> On Mon, Sep 11, 2017 at 12:08:29PM +0800, Ming Lei wrote:
> > On Sun, Sep 10, 2017 at 10:20:27AM -0700, Omar Sandoval wrote:
> 
> [snip]
> 
> > > What I mean is that you keep the same initialization above, but instead of
> > > 		depth += nr
> > > you do
> > > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > > because like I said, the reasoning about why `+= nr` is okay in the
> > > `sb->depth - scanned` case is subtle.
> > > 
> > > And maybe even replace the
> > > 		scanned += depth;
> > > with
> > > 		scanned += min_t(unsigned int, word->depth - nr,
> > > 	   			 sb->depth - scanned);
> > > I.e., don't reuse the depth local variable for two different things. I'm
> > > nitpicking here but this code is tricky enough as it is.
> > 
> > It wasn't reused in old version, just for saving one local variable, and
> > one extra min_t().
> > 
> > Yeah, I admit it isn't clean enough.
> > 
> > > 
> > > For completeness, I mean this exactly:
> > > 
> > > 	while (1) {
> > > 		struct sbitmap_word *word = &sb->map[index];
> > > 		unsigned int depth;
> > > 
> > > 		scanned += min_t(unsigned int, word->depth - nr,
> > > 				 sb->depth - scanned);
> > > 		if (!word->word)
> > > 			goto next;
> > > 
> > > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > 
> > two min_t and a little code duplication.
> 
> They're similar but they represent different things, so I think trying
> to deduplicate this code just makes it more confusing. If performance is
> your concern, I'd be really surprised if there's a noticable difference.

No only one extra min_t(), also it isn't easy to read the code, since
only in the first scan that 'depth' isn't same with 'depth', that is
why I set the 1st 'scan' outside of the loop, then we can update 'scan'
with 'depth' in every loop. People will be easy to follow the
meaning.

> 
> As a side note, I also realized that this code doesn't handle the
> sb->depth == 0 case. We should change the while (1) to
> while (scanned < sb->depth) and remove the
> if (scanned >= sb->depth) break;

In the attached patch, I remember that the zero depth case is
addressed by:

	if (start >= sb->depth)
		return;

which is required since 'start' parameter is introduced in
this patch.

> 
> > > 		off = index << sb->shift;
> > > 		while (1) {
> > > 			nr = find_next_bit(&word->word, depth, nr);
> > > 			if (nr >= depth)
> > > 				break;
> > > 
> > > 			if (!fn(sb, off + nr, data))
> > > 				return;
> > > 
> > > 			nr++;
> > > 		}
> > > next:
> > > 		if (scanned >= sb->depth)
> > > 			break;
> > > 		nr = 0;
> > > 		if (++index >= sb->map_nr)
> > > 			index = 0;
> > > 	}
> > 
> > The following patch switches to do{}while and handles the
> > 1st scan outside of the loop, then it should be clean
> > enough(no two min_t()), so how about this one?
> 
> I find this one subtler and harder to follow. The less it looks like the
> typical loop pattern, the longer someone reading the code has to reason
> about it.

Looks using 'depth' to update 'scanned' is easier to follow, than
two min_t(), since it will make people easy to understand the relation
between the two, then understand the whole code.

-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-14  1:56             ` Ming Lei
@ 2017-09-14 14:59               ` Omar Sandoval
  2017-09-14 15:18                 ` Omar Sandoval
  2017-09-15  1:57                 ` Ming Lei
  0 siblings, 2 replies; 50+ messages in thread
From: Omar Sandoval @ 2017-09-14 14:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Thu, Sep 14, 2017 at 09:56:56AM +0800, Ming Lei wrote:
> On Wed, Sep 13, 2017 at 11:37:20AM -0700, Omar Sandoval wrote:
> > On Mon, Sep 11, 2017 at 12:08:29PM +0800, Ming Lei wrote:
> > > On Sun, Sep 10, 2017 at 10:20:27AM -0700, Omar Sandoval wrote:
> > 
> > [snip]
> > 
> > > > What I mean is that you keep the same initialization above, but instead of
> > > > 		depth += nr
> > > > you do
> > > > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > > > because like I said, the reasoning about why `+= nr` is okay in the
> > > > `sb->depth - scanned` case is subtle.
> > > > 
> > > > And maybe even replace the
> > > > 		scanned += depth;
> > > > with
> > > > 		scanned += min_t(unsigned int, word->depth - nr,
> > > > 	   			 sb->depth - scanned);
> > > > I.e., don't reuse the depth local variable for two different things. I'm
> > > > nitpicking here but this code is tricky enough as it is.
> > > 
> > > It wasn't reused in old version, just for saving one local variable, and
> > > one extra min_t().
> > > 
> > > Yeah, I admit it isn't clean enough.
> > > 
> > > > 
> > > > For completeness, I mean this exactly:
> > > > 
> > > > 	while (1) {
> > > > 		struct sbitmap_word *word = &sb->map[index];
> > > > 		unsigned int depth;
> > > > 
> > > > 		scanned += min_t(unsigned int, word->depth - nr,
> > > > 				 sb->depth - scanned);
> > > > 		if (!word->word)
> > > > 			goto next;
> > > > 
> > > > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > > 
> > > two min_t and a little code duplication.
> > 
> > They're similar but they represent different things, so I think trying
> > to deduplicate this code just makes it more confusing. If performance is
> > your concern, I'd be really surprised if there's a noticable difference.
> 
> No only one extra min_t(), also it isn't easy to read the code, since
> only in the first scan that 'depth' isn't same with 'depth', that is
> why I set the 1st 'scan' outside of the loop, then we can update 'scan'
> with 'depth' in every loop. People will be easy to follow the
> meaning.
> 
> > 
> > As a side note, I also realized that this code doesn't handle the
> > sb->depth == 0 case. We should change the while (1) to
> > while (scanned < sb->depth) and remove the
> > if (scanned >= sb->depth) break;
> 
> In the attached patch, I remember that the zero depth case is
> addressed by:
> 
> 	if (start >= sb->depth)
> 		return;
> 
> which is required since 'start' parameter is introduced in
> this patch.

I think the better way to handle this is

if (start >= sb->depth)
	start = 0;

Since the sbitmap may have gotten resized since the last time the user
called this and cached their start value.

> > 
> > > > 		off = index << sb->shift;
> > > > 		while (1) {
> > > > 			nr = find_next_bit(&word->word, depth, nr);
> > > > 			if (nr >= depth)
> > > > 				break;
> > > > 
> > > > 			if (!fn(sb, off + nr, data))
> > > > 				return;
> > > > 
> > > > 			nr++;
> > > > 		}
> > > > next:
> > > > 		if (scanned >= sb->depth)
> > > > 			break;
> > > > 		nr = 0;
> > > > 		if (++index >= sb->map_nr)
> > > > 			index = 0;
> > > > 	}
> > > 
> > > The following patch switches to do{}while and handles the
> > > 1st scan outside of the loop, then it should be clean
> > > enough(no two min_t()), so how about this one?
> > 
> > I find this one subtler and harder to follow. The less it looks like the
> > typical loop pattern, the longer someone reading the code has to reason
> > about it.
> 
> Looks using 'depth' to update 'scanned' is easier to follow, than
> two min_t(), since it will make people easy to understand the relation
> between the two, then understand the whole code.

Honestly I prefer your original patch with a comment on depth += nr. I'd
be happy with the following incremental patch on top of your original v4
patch.

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 2329b9e1a0e2..8d747048ae4f 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -218,7 +218,7 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
  * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
- * @off: Where to start the iteration
+ * @off: Where to start the iteration.
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -230,11 +230,16 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
 					  unsigned int off,
 					  sb_for_each_fn fn, void *data)
 {
-	unsigned int index = SB_NR_TO_INDEX(sb, off);
-	unsigned int nr = SB_NR_TO_BIT(sb, off);
+	unsigned int index;
+	unsigned int nr;
 	unsigned int scanned = 0;
 
-	while (1) {
+	if (off >= sb->depth)
+		off = 0;
+	index = SB_NR_TO_INDEX(sb, off);
+	nr = SB_NR_TO_BIT(sb, off);
+
+	while (scanned < sb->depth) {
 		struct sbitmap_word *word = &sb->map[index];
 		unsigned int depth = min_t(unsigned int, word->depth - nr,
 					   sb->depth - scanned);
@@ -243,6 +248,11 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
 		if (!word->word)
 			goto next;
 
+		/*
+		 * On the first iteration of the outer loop, we need to add the
+		 * bit offset back to the size of the word for find_next_bit().
+		 * On all other iterations, nr is zero, so this is a noop.
+		 */
 		depth += nr;
 		off = index << sb->shift;
 		while (1) {
@@ -254,9 +264,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
 
 			nr++;
 		}
- next:
-		if (scanned >= sb->depth)
-			break;
+next:
 		nr = 0;
 		if (++index >= sb->map_nr)
 			index = 0;
@@ -268,9 +276,6 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
- *
- * This is inline even though it's non-trivial so that the function calls to the
- * callback will hopefully get optimized away.
  */
 static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
 					void *data)

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-14 14:59               ` Omar Sandoval
@ 2017-09-14 15:18                 ` Omar Sandoval
  2017-09-15  1:57                 ` Ming Lei
  1 sibling, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2017-09-14 15:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Thu, Sep 14, 2017 at 07:59:43AM -0700, Omar Sandoval wrote:

[snip]

> Honestly I prefer your original patch with a comment on depth += nr. I'd
> be happy with the following incremental patch on top of your original v4
> patch.

Oh, and the change renaming the off parameter to start would be good to
include, too.

> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 2329b9e1a0e2..8d747048ae4f 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -218,7 +218,7 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
>  
>  /**
>   * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> - * @off: Where to start the iteration
> + * @off: Where to start the iteration.
>   * @sb: Bitmap to iterate over.
>   * @fn: Callback. Should return true to continue or false to break early.
>   * @data: Pointer to pass to callback.
> @@ -230,11 +230,16 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
>  					  unsigned int off,
>  					  sb_for_each_fn fn, void *data)
>  {
> -	unsigned int index = SB_NR_TO_INDEX(sb, off);
> -	unsigned int nr = SB_NR_TO_BIT(sb, off);
> +	unsigned int index;
> +	unsigned int nr;
>  	unsigned int scanned = 0;
>  
> -	while (1) {
> +	if (off >= sb->depth)
> +		off = 0;
> +	index = SB_NR_TO_INDEX(sb, off);
> +	nr = SB_NR_TO_BIT(sb, off);
> +
> +	while (scanned < sb->depth) {
>  		struct sbitmap_word *word = &sb->map[index];
>  		unsigned int depth = min_t(unsigned int, word->depth - nr,
>  					   sb->depth - scanned);
> @@ -243,6 +248,11 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
>  		if (!word->word)
>  			goto next;
>  
> +		/*
> +		 * On the first iteration of the outer loop, we need to add the
> +		 * bit offset back to the size of the word for find_next_bit().
> +		 * On all other iterations, nr is zero, so this is a noop.
> +		 */
>  		depth += nr;
>  		off = index << sb->shift;
>  		while (1) {
> @@ -254,9 +264,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
>  
>  			nr++;
>  		}
> - next:
> -		if (scanned >= sb->depth)
> -			break;
> +next:
>  		nr = 0;
>  		if (++index >= sb->map_nr)
>  			index = 0;
> @@ -268,9 +276,6 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
>   * @sb: Bitmap to iterate over.
>   * @fn: Callback. Should return true to continue or false to break early.
>   * @data: Pointer to pass to callback.
> - *
> - * This is inline even though it's non-trivial so that the function calls to the
> - * callback will hopefully get optimized away.
>   */
>  static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
>  					void *data)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  2017-09-02 15:17 ` [PATCH V4 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx() Ming Lei
@ 2017-09-15  0:04   ` Omar Sandoval
  2017-09-15  1:50     ` Ming Lei
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-15  0:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Sat, Sep 02, 2017 at 11:17:18PM +0800, Ming Lei wrote:
> This function is introduced for dequeuing request
> from sw queue so that we can dispatch it in
> scheduler's way.
> 
> More importantly, some SCSI devices may set
> q->queue_depth, which is a per-request_queue limit,
> and applied on pending I/O from all hctxs. This
> function is introduced for avoiding to dequeue too
> many requests from sw queue when ->dispatch isn't
> flushed completely.

blk_mq_dispatch_rq_from_ctx() doesn't actually dispatch the request. Can
you please rename it blk_mq_dequeue_from_ctx() or something like that?
(Same for the dispatch_rq_from_ctx helper.)

> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 38 ++++++++++++++++++++++++++++++++++++++
>  block/blk-mq.h |  2 ++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3f18cff80050..f063dd0f197f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -880,6 +880,44 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
>  
> +struct dispatch_rq_data {
> +	struct blk_mq_hw_ctx *hctx;
> +	struct request *rq;
> +};
> +
> +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
> +{
> +	struct dispatch_rq_data *dispatch_data = data;
> +	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
> +	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
> +
> +	spin_lock(&ctx->lock);
> +	if (unlikely(!list_empty(&ctx->rq_list))) {
> +		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
> +		list_del_init(&dispatch_data->rq->queuelist);
> +		if (list_empty(&ctx->rq_list))
> +			sbitmap_clear_bit(sb, bitnr);
> +	}
> +	spin_unlock(&ctx->lock);
> +
> +	return !dispatch_data->rq;
> +}
> +
> +struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
> +					    struct blk_mq_ctx *start)
> +{
> +	unsigned off = start ? start->index_hw : 0;
> +	struct dispatch_rq_data data = {
> +		.hctx = hctx,
> +		.rq   = NULL,
> +	};
> +
> +	__sbitmap_for_each_set(&hctx->ctx_map, off,
> +			       dispatch_rq_from_ctx, &data);
> +
> +	return data.rq;
> +}
> +
>  static inline unsigned int queued_to_index(unsigned int queued)
>  {
>  	if (!queued)
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 98252b79b80b..e42748bfb959 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
>  bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
>  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
>  				bool wait);
> +struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
> +					    struct blk_mq_ctx *start);
>  
>  /*
>   * Internal helpers for allocating/freeing the request map
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  2017-09-15  0:04   ` Omar Sandoval
@ 2017-09-15  1:50     ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-15  1:50 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Thu, Sep 14, 2017 at 05:04:13PM -0700, Omar Sandoval wrote:
> On Sat, Sep 02, 2017 at 11:17:18PM +0800, Ming Lei wrote:
> > This function is introduced for dequeuing request
> > from sw queue so that we can dispatch it in
> > scheduler's way.
> > 
> > More importantly, some SCSI devices may set
> > q->queue_depth, which is a per-request_queue limit,
> > and applied on pending I/O from all hctxs. This
> > function is introduced for avoiding to dequeue too
> > many requests from sw queue when ->dispatch isn't
> > flushed completely.
> 
> blk_mq_dispatch_rq_from_ctx() doesn't actually dispatch the request. Can
> you please rename it blk_mq_dequeue_from_ctx() or something like that?
> (Same for the dispatch_rq_from_ctx helper.)

It makes sense, will do that.

-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()
  2017-09-14 14:59               ` Omar Sandoval
  2017-09-14 15:18                 ` Omar Sandoval
@ 2017-09-15  1:57                 ` Ming Lei
  1 sibling, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-15  1:57 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman, Omar Sandoval

On Thu, Sep 14, 2017 at 07:59:43AM -0700, Omar Sandoval wrote:
> On Thu, Sep 14, 2017 at 09:56:56AM +0800, Ming Lei wrote:
> > On Wed, Sep 13, 2017 at 11:37:20AM -0700, Omar Sandoval wrote:
> > > On Mon, Sep 11, 2017 at 12:08:29PM +0800, Ming Lei wrote:
> > > > On Sun, Sep 10, 2017 at 10:20:27AM -0700, Omar Sandoval wrote:
> > > 
> > > [snip]
> > > 
> > > > > What I mean is that you keep the same initialization above, but instead of
> > > > > 		depth += nr
> > > > > you do
> > > > > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > > > > because like I said, the reasoning about why `+= nr` is okay in the
> > > > > `sb->depth - scanned` case is subtle.
> > > > > 
> > > > > And maybe even replace the
> > > > > 		scanned += depth;
> > > > > with
> > > > > 		scanned += min_t(unsigned int, word->depth - nr,
> > > > > 	   			 sb->depth - scanned);
> > > > > I.e., don't reuse the depth local variable for two different things. I'm
> > > > > nitpicking here but this code is tricky enough as it is.
> > > > 
> > > > It wasn't reused in old version, just for saving one local variable, and
> > > > one extra min_t().
> > > > 
> > > > Yeah, I admit it isn't clean enough.
> > > > 
> > > > > 
> > > > > For completeness, I mean this exactly:
> > > > > 
> > > > > 	while (1) {
> > > > > 		struct sbitmap_word *word = &sb->map[index];
> > > > > 		unsigned int depth;
> > > > > 
> > > > > 		scanned += min_t(unsigned int, word->depth - nr,
> > > > > 				 sb->depth - scanned);
> > > > > 		if (!word->word)
> > > > > 			goto next;
> > > > > 
> > > > > 		depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > > > 
> > > > two min_t and a little code duplication.
> > > 
> > > They're similar but they represent different things, so I think trying
> > > to deduplicate this code just makes it more confusing. If performance is
> > > your concern, I'd be really surprised if there's a noticable difference.
> > 
> > No only one extra min_t(), also it isn't easy to read the code, since
> > only in the first scan that 'depth' isn't same with 'depth', that is
> > why I set the 1st 'scan' outside of the loop, then we can update 'scan'
> > with 'depth' in every loop. People will be easy to follow the
> > meaning.
> > 
> > > 
> > > As a side note, I also realized that this code doesn't handle the
> > > sb->depth == 0 case. We should change the while (1) to
> > > while (scanned < sb->depth) and remove the
> > > if (scanned >= sb->depth) break;
> > 
> > In the attached patch, I remember that the zero depth case is
> > addressed by:
> > 
> > 	if (start >= sb->depth)
> > 		return;
> > 
> > which is required since 'start' parameter is introduced in
> > this patch.
> 
> I think the better way to handle this is
> 
> if (start >= sb->depth)
> 	start = 0;
> 
> Since the sbitmap may have gotten resized since the last time the user
> called this and cached their start value.

OK.

> 
> > > 
> > > > > 		off = index << sb->shift;
> > > > > 		while (1) {
> > > > > 			nr = find_next_bit(&word->word, depth, nr);
> > > > > 			if (nr >= depth)
> > > > > 				break;
> > > > > 
> > > > > 			if (!fn(sb, off + nr, data))
> > > > > 				return;
> > > > > 
> > > > > 			nr++;
> > > > > 		}
> > > > > next:
> > > > > 		if (scanned >= sb->depth)
> > > > > 			break;
> > > > > 		nr = 0;
> > > > > 		if (++index >= sb->map_nr)
> > > > > 			index = 0;
> > > > > 	}
> > > > 
> > > > The following patch switches to do{}while and handles the
> > > > 1st scan outside of the loop, then it should be clean
> > > > enough(no two min_t()), so how about this one?
> > > 
> > > I find this one subtler and harder to follow. The less it looks like the
> > > typical loop pattern, the longer someone reading the code has to reason
> > > about it.
> > 
> > Looks using 'depth' to update 'scanned' is easier to follow, than
> > two min_t(), since it will make people easy to understand the relation
> > between the two, then understand the whole code.
> 
> Honestly I prefer your original patch with a comment on depth += nr. I'd
> be happy with the following incremental patch on top of your original v4
> patch.

OK, looks fine!

Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-09-02 15:17 ` [PATCH V4 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
@ 2017-09-19 19:11   ` Omar Sandoval
  2017-09-20  2:55     ` Ming Lei
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-19 19:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Sat, Sep 02, 2017 at 11:17:21PM +0800, Ming Lei wrote:
> During dispatching, we moved all requests from hctx->dispatch to
> one temporary list, then dispatch them one by one from this list.
> Unfortunately during this period, run queue from other contexts
> may think the queue is idle, then start to dequeue from sw/scheduler
> queue and still try to dispatch because ->dispatch is empty. This way
> hurts sequential I/O performance because requests are dequeued when
> lld queue is busy.
> 
> This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> make sure that request isn't dequeued until ->dispatch is
> flushed.
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Neat. I did something similar when I first started on I/O scheduling for
blk-mq, but I completely serialized dispatch so only one CPU would be
dispatching a given hctx at a time. I ended up ditching it because it
was actually a performance regression on artificial null-blk tests with
many CPUs and 1 hctx, and it was also really hard to get right without
deadlocks. Yours still allows concurrent dispatch from multiple CPUs, so
you probably wouldn't have those same issues, but I don't like how
handwavy this is about races on the DISPATCH_BUSY bit, and how the
handling of that bit is spread around everywhere. I'm not totally
opposed to taking this as is, but what do you think about serializing
the dispatch per hctx completely?

> ---
>  block/blk-mq-debugfs.c |  1 +
>  block/blk-mq-sched.c   | 58 +++++++++++++++++++++++++++++++++++---------------
>  block/blk-mq.c         |  6 ++++++
>  include/linux/blk-mq.h |  1 +
>  4 files changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 980e73095643..7a27f262c96a 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -182,6 +182,7 @@ static const char *const hctx_state_name[] = {
>  	HCTX_STATE_NAME(SCHED_RESTART),
>  	HCTX_STATE_NAME(TAG_WAITING),
>  	HCTX_STATE_NAME(START_ON_RUN),
> +	HCTX_STATE_NAME(DISPATCH_BUSY),
>  };
>  #undef HCTX_STATE_NAME
>  
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 735e432294ab..97e7a4fe3a32 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -146,7 +146,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> -	bool do_sched_dispatch = true;
>  	LIST_HEAD(rq_list);
>  
>  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -177,8 +176,33 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
> -		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> -	} else if (!has_sched_dispatch && !q->queue_depth) {
> +		blk_mq_dispatch_rq_list(q, &rq_list);
> +
> +		/*
> +		 * We may clear DISPATCH_BUSY just after it
> +		 * is set from another context, the only cost
> +		 * is that one request is dequeued a bit early,
> +		 * we can survive that. Given the window is
> +		 * small enough, no need to worry about performance
> +		 * effect.
> +		 */
> +		if (list_empty_careful(&hctx->dispatch))
> +			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> +	}
> +
> +	/*
> +	 * If DISPATCH_BUSY is set, that means hw queue is busy
> +	 * and requests in the list of hctx->dispatch need to
> +	 * be flushed first, so return early.
> +	 *
> +	 * Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
> +	 * will be run to try to make progress, so it is always
> +	 * safe to check the state here.
> +	 */
> +	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
> +		return;
> +
> +	if (!has_sched_dispatch) {
>  		/*
>  		 * If there is no per-request_queue depth, we
>  		 * flush all requests in this hw queue, otherwise
> @@ -187,22 +211,21 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  		 * is busy, which can be triggered easily by
>  		 * per-request_queue queue depth
>  		 */
> -		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> -		blk_mq_dispatch_rq_list(q, &rq_list);
> -	}
> -
> -	if (!do_sched_dispatch)
> -		return;
> +		if (!q->queue_depth) {
> +			blk_mq_flush_busy_ctxs(hctx, &rq_list);
> +			blk_mq_dispatch_rq_list(q, &rq_list);
> +		} else {
> +			blk_mq_do_dispatch_ctx(q, hctx);
> +		}
> +	} else {
>  
> -	/*
> -	 * We want to dispatch from the scheduler if we had no work left
> -	 * on the dispatch list, OR if we did have work but weren't able
> -	 * to make progress.
> -	 */
> -	if (has_sched_dispatch)
> +		/*
> +		 * We want to dispatch from the scheduler if we had no work left
> +		 * on the dispatch list, OR if we did have work but weren't able
> +		 * to make progress.
> +		 */
>  		blk_mq_do_dispatch_sched(q, e, hctx);
> -	else
> -		blk_mq_do_dispatch_ctx(q, hctx);
> +	}
>  }
>  
>  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> @@ -330,6 +353,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
>  	 */
>  	spin_lock(&hctx->lock);
>  	list_add(&rq->queuelist, &hctx->dispatch);
> +	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
>  	spin_unlock(&hctx->lock);
>  	return true;
>  }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f063dd0f197f..6af56a71c1cd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1140,6 +1140,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
>  
>  		spin_lock(&hctx->lock);
>  		list_splice_init(list, &hctx->dispatch);
> +		/*
> +		 * DISPATCH_BUSY won't be cleared until all requests
> +		 * in hctx->dispatch are dispatched successfully
> +		 */
> +		set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
>  		spin_unlock(&hctx->lock);
>  
>  		/*
> @@ -1927,6 +1932,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  
>  	spin_lock(&hctx->lock);
>  	list_splice_tail_init(&tmp, &hctx->dispatch);
> +	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
>  	spin_unlock(&hctx->lock);
>  
>  	blk_mq_run_hw_queue(hctx, true);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 7b7a366a97f3..13f6c25fa461 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -172,6 +172,7 @@ enum {
>  	BLK_MQ_S_SCHED_RESTART	= 2,
>  	BLK_MQ_S_TAG_WAITING	= 3,
>  	BLK_MQ_S_START_ON_RUN	= 4,
> +	BLK_MQ_S_DISPATCH_BUSY	= 5,
>  
>  	BLK_MQ_MAX_DEPTH	= 10240,
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 04/14] blk-mq-sched: move actual dispatching into one helper
  2017-09-02 15:17 ` [PATCH V4 04/14] blk-mq-sched: move actual dispatching into one helper Ming Lei
@ 2017-09-19 19:21   ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2017-09-19 19:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Sat, Sep 02, 2017 at 11:17:19PM +0800, Ming Lei wrote:
> So that it becomes easy to support to dispatch from
> sw queue in the following patch.
> 
> No functional change.
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c | 28 ++++++++++++++++++----------
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 845e5baf8af1..f69752961a34 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -89,6 +89,22 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
>  	return false;
>  }
>  
> +static void blk_mq_do_dispatch(struct request_queue *q,
> +			       struct elevator_queue *e,
> +			       struct blk_mq_hw_ctx *hctx)

You rename this to blk_mq_do_dispatch_sched() in the next patch, might
as well name it that way in the first place. Besides that,

Reviewed-by: Omar Sandoval <osandov@fb.com>

> +{
> +	LIST_HEAD(rq_list);
> +
> +	do {
> +		struct request *rq;
> +
> +		rq = e->type->ops.mq.dispatch_request(hctx);
> +		if (!rq)
> +			break;
> +		list_add(&rq->queuelist, &rq_list);
> +	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> +}
> +
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
> @@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 * on the dispatch list, OR if we did have work but weren't able
>  	 * to make progress.
>  	 */
> -	if (do_sched_dispatch && has_sched_dispatch) {
> -		do {
> -			struct request *rq;
> -
> -			rq = e->type->ops.mq.dispatch_request(hctx);
> -			if (!rq)
> -				break;
> -			list_add(&rq->queuelist, &rq_list);
> -		} while (blk_mq_dispatch_rq_list(q, &rq_list));
> -	}
> +	if (do_sched_dispatch && has_sched_dispatch)
> +		blk_mq_do_dispatch(q, e, hctx);
>  }
>  
>  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance
  2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (14 preceding siblings ...)
  2017-09-04  9:12 ` [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Paolo Valente
@ 2017-09-19 19:25 ` Omar Sandoval
  2017-09-20  3:18   ` Ming Lei
  15 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2017-09-19 19:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Sat, Sep 02, 2017 at 11:17:15PM +0800, Ming Lei wrote:
> Hi,
> 
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
> 
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
> 
> The 1st five patches improve this situation, and brings back
> some performance loss.

Sorry it took so long, I've reviewed or commented on patches 1-6. When
you send v5, could you just send patches 1-6, and split the rest as
their own series?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-02 15:17 ` [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue Ming Lei
  2017-09-08 23:54   ` Omar Sandoval
@ 2017-09-19 20:37   ` Jens Axboe
  2017-09-20  2:37     ` Ming Lei
  2017-09-20 12:20     ` Ming Lei
  1 sibling, 2 replies; 50+ messages in thread
From: Jens Axboe @ 2017-09-19 20:37 UTC (permalink / raw)
  To: Ming Lei, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente, Mel Gorman

On 09/02/2017 09:17 AM, Ming Lei wrote:
> @@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
>  		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> -	} else if (!has_sched_dispatch) {
> +	} else if (!has_sched_dispatch && !q->queue_depth) {
> +		/*
> +		 * If there is no per-request_queue depth, we
> +		 * flush all requests in this hw queue, otherwise
> +		 * pick up request one by one from sw queue for
> +		 * avoiding to mess up I/O merge when dispatch
> +		 * is busy, which can be triggered easily by
> +		 * per-request_queue queue depth
> +		 */
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list);
>  	}

I don't like this part at all. It's a heuristic, and on top of that,
it's a heuristic based on a weak assumption that if q->queue_depth == 0,
then we never run into BUSY constraints. I think that would be stronger
if it was based on "is this using shared tags", but even that is not
great at all. It's perfectly possible to have a shared tags setup and
never run into resource constraints. The opposite condition is also true
- you can run without shared tags, and yet hit resource constraints
constantly. Hence this is NOT a good test for deciding whether to flush
everything at once or not. In short, I think a much better test case
would be "has this device ever returned BLK_STS_RESOURCE. As it so
happens, that's easy enough to keep track of and base this test on.

Additionally, this further differentiates dispatch cases. I'd much
rather just have one sane dispatch case. I realize we are balancing
between performance/scalability and practical cases like merging here,
but it should not be impossible to do.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-19 20:37   ` Jens Axboe
@ 2017-09-20  2:37     ` Ming Lei
  2017-09-20 12:20     ` Ming Lei
  1 sibling, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-20  2:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

Hi Jens,

Thanks for your comment!

On Tue, Sep 19, 2017 at 02:37:50PM -0600, Jens Axboe wrote:
> On 09/02/2017 09:17 AM, Ming Lei wrote:
> > @@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  	if (!list_empty(&rq_list)) {
> >  		blk_mq_sched_mark_restart_hctx(hctx);
> >  		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> > -	} else if (!has_sched_dispatch) {
> > +	} else if (!has_sched_dispatch && !q->queue_depth) {
> > +		/*
> > +		 * If there is no per-request_queue depth, we
> > +		 * flush all requests in this hw queue, otherwise
> > +		 * pick up request one by one from sw queue for
> > +		 * avoiding to mess up I/O merge when dispatch
> > +		 * is busy, which can be triggered easily by
> > +		 * per-request_queue queue depth
> > +		 */
> >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> >  		blk_mq_dispatch_rq_list(q, &rq_list);
> >  	}
> 
> I don't like this part at all. It's a heuristic, and on top of that,

Yeah, it is a heuristic for improving SCSI-MQ's performance.

> it's a heuristic based on a weak assumption that if q->queue_depth == 0,

Only SCSI sets q->queue_depth, which means all hw queues share
the global queue depth of q->queue_depth, once the total number
of in-flight requests from all hctx is bigger than q->queue_depth,
the queue will becomes BUSY.

> then we never run into BUSY constraints. I think that would be stronger
> if it was based on "is this using shared tags", but even that is not

IMO, it isn't related with shared tags, for example, SATA's
performance can be affected by SCSI_MQ, which has single tags.

The root cause is that limits of q->queue_depth and SCSI's
.cmd_per_lun inside SCSI, which will cause device to return
BLK_STS_RESOURCE because IO scheduler queue depth is usually
bigger than q->queue_depth, and it can be quite worse especially
in case of shared tags.

> great at all. It's perfectly possible to have a shared tags setup and
> never run into resource constraints. The opposite condition is also true

As I explained, it isn't related with shared tags, either shared tags
or single tags may run into resource constraints, SCSI has both
shared/single tags cases, and all have this issue if q->queue_depth
set.

> - you can run without shared tags, and yet hit resource constraints
> constantly. Hence this is NOT a good test for deciding whether to flush
> everything at once or not. In short, I think a much better test case
> would be "has this device ever returned BLK_STS_RESOURCE. As it so
> happens, that's easy enough to keep track of and base this test on.

OK, looks we can try to flush all first, and switch to pick one by
one if device ever returned BLK_STS_RESOURCE. Are you fine with
this way? Or better suggestion?

> 
> Additionally, this further differentiates dispatch cases. I'd much
> rather just have one sane dispatch case. I realize we are balancing
> between performance/scalability and practical cases like merging here,
> but it should not be impossible to do.

Yeah, I agree, but both flush_all and pick one by one aren't invented
by this patch, :-)

-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-09-19 19:11   ` Omar Sandoval
@ 2017-09-20  2:55     ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-20  2:55 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Tue, Sep 19, 2017 at 12:11:05PM -0700, Omar Sandoval wrote:
> On Sat, Sep 02, 2017 at 11:17:21PM +0800, Ming Lei wrote:
> > During dispatching, we moved all requests from hctx->dispatch to
> > one temporary list, then dispatch them one by one from this list.
> > Unfortunately during this period, run queue from other contexts
> > may think the queue is idle, then start to dequeue from sw/scheduler
> > queue and still try to dispatch because ->dispatch is empty. This way
> > hurts sequential I/O performance because requests are dequeued when
> > lld queue is busy.
> > 
> > This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> > make sure that request isn't dequeued until ->dispatch is
> > flushed.
> > 
> > Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> Neat. I did something similar when I first started on I/O scheduling for
> blk-mq, but I completely serialized dispatch so only one CPU would be
> dispatching a given hctx at a time. I ended up ditching it because it
> was actually a performance regression on artificial null-blk tests with
> many CPUs and 1 hctx, and it was also really hard to get right without
> deadlocks. Yours still allows concurrent dispatch from multiple CPUs, so
> you probably wouldn't have those same issues, but I don't like how
> handwavy this is about races on the DISPATCH_BUSY bit, and how the
> handling of that bit is spread around everywhere. I'm not totally
> opposed to taking this as is, but what do you think about serializing
> the dispatch per hctx completely?

I think we might need that for SCSI because q->queue_depth is actually
a per-request_queue limit, and all hctxs share the depth.

It has been posted once in V2 as patch 6 ~ 14:

	https://marc.info/?t=150191627900003&r=1&w=2

but running all hctx is missed after queue busy is cleared.

I appreciate much you may take a look at it and see if
that approach is good, maybe we can figure out one better one.

I tested serializing all hctx can improve mq-deadline some on
IB/SRP, but Bart and Jens suggested to split this patchset, so
not include it in V3/V4. 

And sbitmap may be required to mark if one hctx need to run
again, otherwise it is observed performance loss with none on
IB/SRP if all hctxes is simply run after the busy bit is cleared.

-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance
  2017-09-19 19:25 ` Omar Sandoval
@ 2017-09-20  3:18   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-20  3:18 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Tue, Sep 19, 2017 at 12:25:15PM -0700, Omar Sandoval wrote:
> On Sat, Sep 02, 2017 at 11:17:15PM +0800, Ming Lei wrote:
> > Hi,
> > 
> > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > found that I/O performance is much bad with mq-deadline, especially
> > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > SRP...)
> > 
> > Turns out one big issue causes the performance regression: requests
> > are still dequeued from sw queue/scheduler queue even when ldd's
> > queue is busy, so I/O merge becomes quite difficult to make, then
> > sequential IO degrades a lot.
> > 
> > The 1st five patches improve this situation, and brings back
> > some performance loss.
> 
> Sorry it took so long, I've reviewed or commented on patches 1-6. When
> you send v5, could you just send patches 1-6, and split the rest as
> their own series?

Sure, no problem.

Thanks for your review!

-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-19 20:37   ` Jens Axboe
  2017-09-20  2:37     ` Ming Lei
@ 2017-09-20 12:20     ` Ming Lei
  2017-09-22  2:15       ` Ming Lei
  1 sibling, 1 reply; 50+ messages in thread
From: Ming Lei @ 2017-09-20 12:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Tue, Sep 19, 2017 at 02:37:50PM -0600, Jens Axboe wrote:
> On 09/02/2017 09:17 AM, Ming Lei wrote:
> > @@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  	if (!list_empty(&rq_list)) {
> >  		blk_mq_sched_mark_restart_hctx(hctx);
> >  		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> > -	} else if (!has_sched_dispatch) {
> > +	} else if (!has_sched_dispatch && !q->queue_depth) {
> > +		/*
> > +		 * If there is no per-request_queue depth, we
> > +		 * flush all requests in this hw queue, otherwise
> > +		 * pick up request one by one from sw queue for
> > +		 * avoiding to mess up I/O merge when dispatch
> > +		 * is busy, which can be triggered easily by
> > +		 * per-request_queue queue depth
> > +		 */
> >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> >  		blk_mq_dispatch_rq_list(q, &rq_list);
> >  	}
> 
> I don't like this part at all. It's a heuristic, and on top of that,
> it's a heuristic based on a weak assumption that if q->queue_depth == 0,
> then we never run into BUSY constraints. I think that would be stronger
> if it was based on "is this using shared tags", but even that is not
> great at all. It's perfectly possible to have a shared tags setup and
> never run into resource constraints. The opposite condition is also true
> - you can run without shared tags, and yet hit resource constraints
> constantly. Hence this is NOT a good test for deciding whether to flush
> everything at once or not. In short, I think a much better test case
> would be "has this device ever returned BLK_STS_RESOURCE. As it so
> happens, that's easy enough to keep track of and base this test on.

Hi Jens,

The attached patch follows your suggestion, and uses EWMA to
compute the average length of hctx->dispatch, then only flush
all requests from ctxs if the average length is zero, what do
you think about this approach? Or other suggestions?


diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7a27f262c96a..c420c775b9c0 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -232,6 +232,14 @@ static int hctx_flags_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+static int hctx_dispatch_len_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+
+	seq_printf(m, "%u\n", hctx->dispatch_len);
+	return 0;
+}
+
 #define REQ_OP_NAME(name) [REQ_OP_##name] = #name
 static const char *const op_name[] = {
 	REQ_OP_NAME(READ),
@@ -763,6 +771,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"state", 0400, hctx_state_show},
 	{"flags", 0400, hctx_flags_show},
 	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
+	{"dispatch_length", 0400, hctx_dispatch_len_show},
 	{"busy", 0400, hctx_busy_show},
 	{"ctx_map", 0400, hctx_ctx_map_show},
 	{"tags", 0400, hctx_tags_show},
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a5712dd67783..91a6eeaadaf1 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -102,7 +102,7 @@ static void blk_mq_do_dispatch_sched(struct request_queue *q,
 		if (!rq)
 			break;
 		list_add(&rq->queuelist, &rq_list);
-	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+	} while (blk_mq_dispatch_rq_list(q, &rq_list, 1));
 }
 
 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -134,19 +134,51 @@ static void blk_mq_do_dispatch_ctx(struct request_queue *q,
 		/* round robin for fair dispatch */
 		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
 
-		dispatched = blk_mq_dispatch_rq_list(q, &rq_list);
+		dispatched = blk_mq_dispatch_rq_list(q, &rq_list, 1);
 	} while (dispatched);
 
 	if (!dispatched)
 		WRITE_ONCE(hctx->dispatch_from, ctx);
 }
 
+/* borrowed from bcache */
+void ewma_add(unsigned *ewma, unsigned val, unsigned weight)
+{
+	unsigned  result = READ_ONCE(*ewma);
+
+	if (val == 1) {
+		result += 1;
+	} else {
+		result *= weight - 1;
+		result += val;
+		result /= weight;
+	}
+	WRITE_ONCE(*ewma, result);
+}
+
+/*
+ * We use EWMA to compute the average length of dispatch list.
+ * It is totally lockless, but we can survive that since it
+ * is just a hint.
+ *
+ * We only flush requests from all ctxs if the average length
+ * of dispatch list is zero, otherwise the hw queue can be
+ * thought as busy and we dequeue request from ctxs one by
+ * one
+ */
+static void blk_mq_update_dispatch_length(struct blk_mq_hw_ctx *hctx,
+					  unsigned cnt)
+{
+	ewma_add(&hctx->dispatch_len, cnt, 8);
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
 	LIST_HEAD(rq_list);
+	unsigned rq_cnt = 0;
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
 	if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
@@ -162,9 +194,14 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		spin_lock(&hctx->lock);
 		if (!list_empty(&hctx->dispatch))
 			list_splice_init(&hctx->dispatch, &rq_list);
+		rq_cnt = hctx->cnt;
+		hctx->cnt = 0;
 		spin_unlock(&hctx->lock);
 	}
 
+	if (!has_sched_dispatch)
+		blk_mq_update_dispatch_length(hctx, rq_cnt);
+
 	/*
 	 * Only ask the scheduler for requests, if we didn't have residual
 	 * requests from the dispatch list. This is to avoid the case where
@@ -176,7 +213,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		blk_mq_dispatch_rq_list(q, &rq_list);
+		blk_mq_dispatch_rq_list(q, &rq_list, rq_cnt);
 
 		/*
 		 * We may clear DISPATCH_BUSY just after it
@@ -204,16 +241,16 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 
 	if (!has_sched_dispatch) {
 		/*
-		 * If there is no per-request_queue depth, we
+		 * If dispatch doesn't run out of resource, we
 		 * flush all requests in this hw queue, otherwise
 		 * pick up request one by one from sw queue for
 		 * avoiding to mess up I/O merge when dispatch
 		 * run out of resource, which can be triggered
 		 * easily by per-request_queue queue depth
 		 */
-		if (!q->queue_depth) {
-			blk_mq_flush_busy_ctxs(hctx, &rq_list);
-			blk_mq_dispatch_rq_list(q, &rq_list);
+		if (!hctx->dispatch_len) {
+			rq_cnt = blk_mq_flush_busy_ctxs(hctx, &rq_list);
+			blk_mq_dispatch_rq_list(q, &rq_list, rq_cnt);
 		} else {
 			blk_mq_do_dispatch_ctx(q, hctx);
 		}
@@ -346,6 +383,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	 * the dispatch list.
 	 */
 	spin_lock(&hctx->lock);
+	hctx->cnt++;
 	list_add(&rq->queuelist, &hctx->dispatch);
 	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 	spin_unlock(&hctx->lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1bb45e995da3..b2db5d6d568e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -850,6 +850,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 struct flush_busy_ctx_data {
 	struct blk_mq_hw_ctx *hctx;
 	struct list_head *list;
+	unsigned cnt;
 };
 
 static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
@@ -857,11 +858,16 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
 	struct flush_busy_ctx_data *flush_data = data;
 	struct blk_mq_hw_ctx *hctx = flush_data->hctx;
 	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+	unsigned cnt = 0;
 
 	sbitmap_clear_bit(sb, bitnr);
 	spin_lock(&ctx->lock);
 	list_splice_tail_init(&ctx->rq_list, flush_data->list);
+	cnt = ctx->cnt;
+	ctx->cnt = 0;;
 	spin_unlock(&ctx->lock);
+	flush_data->cnt += cnt;
+
 	return true;
 }
 
@@ -869,14 +875,17 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
  * Process software queues that have been marked busy, splicing them
  * to the for-dispatch
  */
-void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+unsigned blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
 	struct flush_busy_ctx_data data = {
 		.hctx = hctx,
 		.list = list,
+		.cnt  = 0,
 	};
 
 	sbitmap_for_each_set(&hctx->ctx_map, flush_busy_ctx, &data);
+
+	return data.cnt;
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
@@ -897,6 +906,7 @@ static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *d
 		list_del_init(&dispatch_data->rq->queuelist);
 		if (list_empty(&ctx->rq_list))
 			sbitmap_clear_bit(sb, bitnr);
+		ctx->cnt--;
 	}
 	spin_unlock(&ctx->lock);
 
@@ -1052,7 +1062,9 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
+/* @cnt represents the number of requests in @list */
+bool blk_mq_dispatch_rq_list(struct request_queue *q,
+			     struct list_head *list, unsigned cnt)
 {
 	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
@@ -1139,6 +1151,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 		blk_mq_put_driver_tag(rq);
 
 		spin_lock(&hctx->lock);
+		hctx->cnt += cnt - queued - errors;
 		list_splice_init(list, &hctx->dispatch);
 		/*
 		 * DISPATCH_BUSY won't be cleared until all requests
@@ -1431,6 +1444,7 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
 		list_add(&rq->queuelist, &ctx->rq_list);
 	else
 		list_add_tail(&rq->queuelist, &ctx->rq_list);
+	ctx->cnt++;
 }
 
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
@@ -1454,6 +1468,7 @@ void blk_mq_request_bypass_insert(struct request *rq)
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
 
 	spin_lock(&hctx->lock);
+	hctx->cnt++;
 	list_add_tail(&rq->queuelist, &hctx->dispatch);
 	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 	spin_unlock(&hctx->lock);
@@ -1941,6 +1956,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	if (!list_empty(&ctx->rq_list)) {
 		list_splice_init(&ctx->rq_list, &tmp);
 		blk_mq_hctx_clear_pending(hctx, ctx);
+		ctx->cnt = 0;
 	}
 	spin_unlock(&ctx->lock);
 
@@ -1950,6 +1966,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	spin_lock(&hctx->lock);
 	list_splice_tail_init(&tmp, &hctx->dispatch);
 	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	hctx->cnt++;
 	spin_unlock(&hctx->lock);
 
 	blk_mq_run_hw_queue(hctx, true);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 231cfb0d973b..0da5a81cc41f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -9,6 +9,7 @@ struct blk_mq_ctx {
 	struct {
 		spinlock_t		lock;
 		struct list_head	rq_list;
+		unsigned		cnt;
 	}  ____cacheline_aligned_in_smp;
 
 	unsigned int		cpu;
@@ -30,8 +31,9 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
-bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
-void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
+bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *,
+			     unsigned);
+unsigned blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 13f6c25fa461..778dbdd596f3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -11,6 +11,7 @@ struct blk_flush_queue;
 struct blk_mq_hw_ctx {
 	struct {
 		spinlock_t		lock;
+		unsigned		cnt;
 		struct list_head	dispatch;
 		unsigned long		state;		/* BLK_MQ_S_* flags */
 	} ____cacheline_aligned_in_smp;
@@ -31,6 +32,7 @@ struct blk_mq_hw_ctx {
 	struct sbitmap		ctx_map;
 
 	struct blk_mq_ctx	*dispatch_from;
+	unsigned		dispatch_len;
 
 	struct blk_mq_ctx	**ctxs;
 	unsigned int		nr_ctx;


Thanks,
Ming

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue
  2017-09-20 12:20     ` Ming Lei
@ 2017-09-22  2:15       ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2017-09-22  2:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Paolo Valente, Mel Gorman

On Wed, Sep 20, 2017 at 08:20:58PM +0800, Ming Lei wrote:
> On Tue, Sep 19, 2017 at 02:37:50PM -0600, Jens Axboe wrote:
> > On 09/02/2017 09:17 AM, Ming Lei wrote:
> > > @@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > >  	if (!list_empty(&rq_list)) {
> > >  		blk_mq_sched_mark_restart_hctx(hctx);
> > >  		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> > > -	} else if (!has_sched_dispatch) {
> > > +	} else if (!has_sched_dispatch && !q->queue_depth) {
> > > +		/*
> > > +		 * If there is no per-request_queue depth, we
> > > +		 * flush all requests in this hw queue, otherwise
> > > +		 * pick up request one by one from sw queue for
> > > +		 * avoiding to mess up I/O merge when dispatch
> > > +		 * is busy, which can be triggered easily by
> > > +		 * per-request_queue queue depth
> > > +		 */
> > >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> > >  		blk_mq_dispatch_rq_list(q, &rq_list);
> > >  	}
> > 
> > I don't like this part at all. It's a heuristic, and on top of that,
> > it's a heuristic based on a weak assumption that if q->queue_depth == 0,
> > then we never run into BUSY constraints. I think that would be stronger
> > if it was based on "is this using shared tags", but even that is not
> > great at all. It's perfectly possible to have a shared tags setup and
> > never run into resource constraints. The opposite condition is also true
> > - you can run without shared tags, and yet hit resource constraints
> > constantly. Hence this is NOT a good test for deciding whether to flush
> > everything at once or not. In short, I think a much better test case
> > would be "has this device ever returned BLK_STS_RESOURCE. As it so
> > happens, that's easy enough to keep track of and base this test on.
> 
> Hi Jens,
> 
> The attached patch follows your suggestion, and uses EWMA to
> compute the average length of hctx->dispatch, then only flush
> all requests from ctxs if the average length is zero, what do
> you think about this approach? Or other suggestions?

Hi Jens,

I am going to prepare for V5, as suggested by Omar.

Could you suggest which way you prefer to?  Keeping to check
q->queue_depth, or the approach of using average length of
hctx->dispath, or others? 


-- 
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance
  2017-09-06 21:09 Oleksandr Natalenko
@ 2017-09-06 21:22 ` Tom Nguyen
  0 siblings, 0 replies; 50+ messages in thread
From: Tom Nguyen @ 2017-09-06 21:22 UTC (permalink / raw)
  To: Oleksandr Natalenko, ming.lei; +Cc: linux-block, Jens Axboe

Likewise with no problems on my work laptop with 4 days uptime.

Tested-by: Tom Nguyen <tom81094@gmail.com>


On 09/07/2017 04:09 AM, Oleksandr Natalenko wrote:
> Feel free to add:
>
> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
>
> since I'm running this on 4 machines without issues.
>
>> Hi Jens,
>>
>> Ping...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance
@ 2017-09-06 21:09 Oleksandr Natalenko
  2017-09-06 21:22 ` Tom Nguyen
  0 siblings, 1 reply; 50+ messages in thread
From: Oleksandr Natalenko @ 2017-09-06 21:09 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-block, Jens Axboe

Feel free to add:

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

since I'm running this on 4 machines without issues.

> Hi Jens,
>
> Ping...

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2017-09-22  2:15 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-02 15:17 [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Ming Lei
2017-09-02 15:17 ` [PATCH V4 01/14] blk-mq-sched: fix scheduler bad performance Ming Lei
2017-09-08 20:48   ` Omar Sandoval
2017-09-08 20:54     ` Omar Sandoval
2017-09-08 20:56       ` Omar Sandoval
2017-09-09  7:43         ` Ming Lei
2017-09-09  7:33       ` Ming Lei
2017-09-02 15:17 ` [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
2017-09-08 20:43   ` Omar Sandoval
2017-09-09  9:38     ` Ming Lei
2017-09-10 17:20       ` Omar Sandoval
2017-09-11  4:08         ` Ming Lei
2017-09-13 18:37           ` Omar Sandoval
2017-09-14  1:56             ` Ming Lei
2017-09-14 14:59               ` Omar Sandoval
2017-09-14 15:18                 ` Omar Sandoval
2017-09-15  1:57                 ` Ming Lei
2017-09-02 15:17 ` [PATCH V4 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx() Ming Lei
2017-09-15  0:04   ` Omar Sandoval
2017-09-15  1:50     ` Ming Lei
2017-09-02 15:17 ` [PATCH V4 04/14] blk-mq-sched: move actual dispatching into one helper Ming Lei
2017-09-19 19:21   ` Omar Sandoval
2017-09-02 15:17 ` [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue Ming Lei
2017-09-08 23:54   ` Omar Sandoval
2017-09-10  4:45     ` Ming Lei
2017-09-10 17:38       ` Omar Sandoval
2017-09-11  4:13         ` Ming Lei
2017-09-13 17:32           ` Omar Sandoval
2017-09-19 20:37   ` Jens Axboe
2017-09-20  2:37     ` Ming Lei
2017-09-20 12:20     ` Ming Lei
2017-09-22  2:15       ` Ming Lei
2017-09-02 15:17 ` [PATCH V4 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
2017-09-19 19:11   ` Omar Sandoval
2017-09-20  2:55     ` Ming Lei
2017-09-02 15:17 ` [PATCH V4 07/14] blk-mq-sched: introduce blk_mq_sched_queue_depth() Ming Lei
2017-09-02 15:17 ` [PATCH V4 08/14] blk-mq-sched: use q->queue_depth as hint for q->nr_requests Ming Lei
2017-09-02 15:17 ` [PATCH V4 09/14] block: introduce rqhash helpers Ming Lei
2017-09-02 15:17 ` [PATCH V4 10/14] block: move actual bio merge code into __elv_merge Ming Lei
2017-09-02 15:17 ` [PATCH V4 11/14] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue Ming Lei
2017-09-02 15:17 ` [PATCH V4 12/14] block: introduce .last_merge and .hash to blk_mq_ctx Ming Lei
2017-09-02 15:17 ` [PATCH V4 13/14] blk-mq-sched: refactor blk_mq_sched_try_merge() Ming Lei
2017-09-02 15:17 ` [PATCH V4 14/14] blk-mq: improve bio merge from blk-mq sw queue Ming Lei
2017-09-04  9:12 ` [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance Paolo Valente
2017-09-05  1:39   ` Ming Lei
2017-09-06 15:27     ` Ming Lei
2017-09-19 19:25 ` Omar Sandoval
2017-09-20  3:18   ` Ming Lei
2017-09-06 21:09 Oleksandr Natalenko
2017-09-06 21:22 ` Tom Nguyen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.