All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
@ 2017-08-05  6:56 Ming Lei
  2017-08-05  6:56 ` [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance Ming Lei
                   ` (23 more replies)
  0 siblings, 24 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

The 1st five patches improve this situation, and brings back
some performance loss.

But looks they are still not enough. It is caused by
the shared queue depth among all hw queues. For SCSI devices,
.cmd_per_lun defines the max number of pending I/O on one
request queue, which is per-request_queue depth. So during
dispatch, if one hctx is too busy to move on, all hctxs can't
dispatch too because of the per-request_queue depth.

Patch 6 ~ 14 use per-request_queue dispatch list to avoid
to dequeue requests from sw/scheduler queue when lld queue
is busy.

Patch 15 ~20 improve bio merge via hash table in sw queue,
which makes bio merge more efficient than current approch
in which only the last 8 requests are checked. Since patch
6~14 converts to the scheduler way of dequeuing one request
from sw queue one time for SCSI device, and the times of
acquring ctx->lock is increased, and merging bio via hash
table decreases holding time of ctx->lock and should eliminate
effect from patch 14. 

With this changes, SCSI-MQ sequential I/O performance is
improved much, for lpfc, it is basically brought back
compared with block legacy path[1], especially mq-deadline
is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
For mq-none it is improved by 10% on lpfc, and write is
improved by > 10% on SRP too.

Also Bart worried that this patchset may affect SRP, so provide
test data on SCSI SRP this time:

- fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
- system(16 cores, dual sockets, mem: 96G)

              |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches |
              |blk-legacy dd |blk-mq none   | blk-mq none  |
-----------------------------------------------------------|  
read     :iops|         587K |         526K |         537K |
randread :iops|         115K |         140K |         139K |
write    :iops|         596K |         519K |         602K |
randwrite:iops|         103K |         122K |         120K |


              |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches
              |blk-legacy dd |blk-mq dd     | blk-mq dd    |
------------------------------------------------------------
read     :iops|         587K |         155K |         522K |
randread :iops|         115K |         140K |         141K |
write    :iops|         596K |         135K |         587K |
randwrite:iops|         103K |         120K |         118K |

V2:
	- dequeue request from sw queues in round roubin's style
	as suggested by Bart, and introduces one helper in sbitmap
	for this purpose
	- improve bio merge via hash table from sw queue
	- add comments about using DISPATCH_BUSY state in lockless way,
	simplifying handling on busy state,
	- hold ctx->lock when clearing ctx busy bit as suggested
	by Bart


[1] http://marc.info/?l=linux-block&m=150151989915776&w=2

Ming Lei (20):
  blk-mq-sched: fix scheduler bad performance
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  blk-mq-sched: move actual dispatching into one helper
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
    flushed
  blk-mq-sched: introduce blk_mq_sched_queue_depth()
  blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  blk-mq-sched: introduce helpers for query, change busy state
  blk-mq: introduce helpers for operating ->dispatch list
  blk-mq: introduce pointers to dispatch lock & list
  blk-mq: pass 'request_queue *' to several helpers of operating BUSY
  blk-mq-sched: improve IO scheduling on SCSI devcie
  block: introduce rqhash helpers
  block: move actual bio merge code into __elv_merge
  block: add check on elevator for supporting bio merge via hashtable
    from blk-mq sw queue
  block: introduce .last_merge and .hash to blk_mq_ctx
  blk-mq-sched: refactor blk_mq_sched_try_merge()
  blk-mq: improve bio merge from blk-mq sw queue

 block/blk-mq-debugfs.c  |  12 ++--
 block/blk-mq-sched.c    | 187 +++++++++++++++++++++++++++++-------------------
 block/blk-mq-sched.h    |  23 ++++++
 block/blk-mq.c          | 133 +++++++++++++++++++++++++++++++---
 block/blk-mq.h          |  73 +++++++++++++++++++
 block/blk-settings.c    |   2 +
 block/blk.h             |  55 ++++++++++++++
 block/elevator.c        |  93 ++++++++++++++----------
 include/linux/blk-mq.h  |   5 ++
 include/linux/blkdev.h  |   5 ++
 include/linux/sbitmap.h |  54 ++++++++++----
 11 files changed, 504 insertions(+), 138 deletions(-)

-- 
2.9.4

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-09  0:11   ` Omar Sandoval
  2017-08-05  6:56 ` [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
                   ` (22 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise IO merge will be
difficult to do.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..845e5baf8af1 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool did_work = false;
+	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
+		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
@@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * on the dispatch list, OR if we did have work but weren't able
 	 * to make progress.
 	 */
-	if (!did_work && has_sched_dispatch) {
+	if (do_sched_dispatch && has_sched_dispatch) {
 		do {
 			struct request *rq;
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set()
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
  2017-08-05  6:56 ` [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 18:28   ` Bart Van Assche
  2017-08-22 18:37   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx() Ming Lei
                   ` (21 subsequent siblings)
  23 siblings, 2 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei, Omar Sandoval

We need to iterate ctx starting from one offset in way
of round robin, so introduce this helper.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/sbitmap.h | 54 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..ace08016f54a 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
  * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @off: Offset to iterate from
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -222,35 +226,57 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
  * This is inline even though it's non-trivial so that the function calls to the
  * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-					void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+					  unsigned int off,
+					  sb_for_each_fn fn, void *data)
 {
-	unsigned int i;
+	unsigned int index = SB_NR_TO_INDEX(sb, off);
+	unsigned int nr = SB_NR_TO_BIT(sb, off);
+	unsigned int scanned = 0;
 
-	for (i = 0; i < sb->map_nr; i++) {
-		struct sbitmap_word *word = &sb->map[i];
-		unsigned int off, nr;
+	while (1) {
+		struct sbitmap_word *word = &sb->map[index];
+		unsigned int start = nr;
+		unsigned int depth = min_t(unsigned int, word->depth - start,
+					   sb->depth - scanned);
 
 		if (!word->word)
-			continue;
+			goto next;
 
-		nr = 0;
-		off = i << sb->shift;
+		off = index << sb->shift;
 		while (1) {
-			nr = find_next_bit(&word->word, word->depth, nr);
-			if (nr >= word->depth)
+			nr = find_next_bit(&word->word, start + depth, nr);
+			if (nr >= start + depth)
 				break;
-
 			if (!fn(sb, off + nr, data))
 				return;
 
 			nr++;
 		}
+ next:
+		scanned += depth;
+		if (scanned >= sb->depth)
+			break;
+		nr = 0;
+		if (++index >= sb->map_nr)
+			index = 0;
 	}
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ *
+ * This is inline even though it's non-trivial so that the function calls to the
+ * callback will hopefully get optimized away.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+					void *data)
+{
+	__sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
 					    unsigned int bitnr)
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
  2017-08-05  6:56 ` [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance Ming Lei
  2017-08-05  6:56 ` [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 18:45   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 04/20] blk-mq-sched: move actual dispatching into one helper Ming Lei
                   ` (20 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

This function is introduced for dequeuing request
from sw queue so that we can dispatch it in scheduler's way.

More importantly, for some SCSI devices, driver
tags are host wide, and the number is quite big,
but each lun has very limited queue depth. This
function is introduced for avoiding to dequeue too
many requests from sw queue when ->dispatch isn't
flushed completely.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 38 ++++++++++++++++++++++++++++++++++++++
 block/blk-mq.h |  2 ++
 2 files changed, 40 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 041f7b7fa0d6..d7a89d009f62 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -842,6 +842,44 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
+struct dispatch_rq_data {
+	struct blk_mq_hw_ctx *hctx;
+	struct request *rq;
+};
+
+static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
+{
+	struct dispatch_rq_data *dispatch_data = data;
+	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
+	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+
+	spin_lock(&ctx->lock);
+	if (unlikely(!list_empty(&ctx->rq_list))) {
+		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
+		list_del_init(&dispatch_data->rq->queuelist);
+		if (list_empty(&ctx->rq_list))
+			sbitmap_clear_bit(sb, bitnr);
+	}
+	spin_unlock(&ctx->lock);
+
+	return !dispatch_data->rq;
+}
+
+struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
+					    struct blk_mq_ctx *start)
+{
+	unsigned off = start ? start->index_hw : 0;
+	struct dispatch_rq_data data = {
+		.hctx = hctx,
+		.rq   = NULL,
+	};
+
+	__sbitmap_for_each_set(&hctx->ctx_map, off,
+			       dispatch_rq_from_ctx, &data);
+
+	return data.rq;
+}
+
 static inline unsigned int queued_to_index(unsigned int queued)
 {
 	if (!queued)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 60b01c0309bc..2bfb1254841b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
+struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
+					    struct blk_mq_ctx *start);
 
 /*
  * Internal helpers for allocating/freeing the request map
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 04/20] blk-mq-sched: move actual dispatching into one helper
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (2 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx() Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 19:50   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue Ming Lei
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

So that it becomes easy to support to dispatch from
sw queue in the following patch.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 845e5baf8af1..f69752961a34 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,6 +89,22 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
+static void blk_mq_do_dispatch(struct request_queue *q,
+			       struct elevator_queue *e,
+			       struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(rq_list);
+
+	do {
+		struct request *rq;
+
+		rq = e->type->ops.mq.dispatch_request(hctx);
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
@@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * on the dispatch list, OR if we did have work but weren't able
 	 * to make progress.
 	 */
-	if (do_sched_dispatch && has_sched_dispatch) {
-		do {
-			struct request *rq;
-
-			rq = e->type->ops.mq.dispatch_request(hctx);
-			if (!rq)
-				break;
-			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(q, &rq_list));
-	}
+	if (do_sched_dispatch && has_sched_dispatch)
+		blk_mq_do_dispatch(q, e, hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (3 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 04/20] blk-mq-sched: move actual dispatching into one helper Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 19:55   ` Bart Van Assche
  2017-08-22 20:57   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
                   ` (18 subsequent siblings)
  23 siblings, 2 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

SCSI devices use host-wide tagset, and the shared
driver tag space is often quite big. Meantime
there is also queue depth for each lun(.cmd_per_lun),
which is often small.

So lots of requests may stay in sw queue, and we
always flush all belonging to same hw queue and
dispatch them all to driver, unfortunately it is
easy to cause queue busy becasue of the small
per-lun queue depth. Once these requests are flushed
out, they have to stay in hctx->dispatch, and no bio
merge can participate into these requests, and
sequential IO performance is hurted.

This patch improves dispatching from sw queue when
there is per-request-queue queue depth by taking
request one by one from sw queue, just like the way
of IO scheduler.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index f69752961a34..e43c9407d653 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,9 +89,9 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
-static void blk_mq_do_dispatch(struct request_queue *q,
-			       struct elevator_queue *e,
-			       struct blk_mq_hw_ctx *hctx)
+static inline void blk_mq_do_dispatch_sched(struct request_queue *q,
+					    struct elevator_queue *e,
+					    struct blk_mq_hw_ctx *hctx)
 {
 	LIST_HEAD(rq_list);
 
@@ -105,6 +105,36 @@ static void blk_mq_do_dispatch(struct request_queue *q,
 	} while (blk_mq_dispatch_rq_list(q, &rq_list));
 }
 
+static inline struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
+						 struct blk_mq_ctx *ctx)
+{
+	unsigned idx = ctx->index_hw;
+
+	if (++idx == hctx->nr_ctx)
+		idx = 0;
+
+	return hctx->ctxs[idx];
+}
+
+static inline void blk_mq_do_dispatch_ctx(struct request_queue *q,
+					  struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(rq_list);
+	struct blk_mq_ctx *ctx = NULL;
+
+	do {
+		struct request *rq;
+
+		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+
+		/* round robin for fair dispatch */
+		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
+	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
@@ -142,18 +172,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
 		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
-	} else if (!has_sched_dispatch) {
+	} else if (!has_sched_dispatch & !q->queue_depth) {
+		/*
+		 * If there is no per-request_queue depth, we
+		 * flush all requests in this hw queue, otherwise
+		 * pick up request one by one from sw queue for
+		 * avoiding to mess up I/O merge when dispatch
+		 * is busy, which can be triggeredd easily by
+		 * limit of of request_queue's queue depth
+		 */
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
+	if (!do_sched_dispatch)
+		return;
+
 	/*
 	 * We want to dispatch from the scheduler if we had no work left
 	 * on the dispatch list, OR if we did have work but weren't able
 	 * to make progress.
 	 */
-	if (do_sched_dispatch && has_sched_dispatch)
-		blk_mq_do_dispatch(q, e, hctx);
+	if (has_sched_dispatch)
+		blk_mq_do_dispatch_sched(q, e, hctx);
+	else
+		blk_mq_do_dispatch_ctx(q, hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (4 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 20:09   ` Bart Van Assche
  2017-08-23 19:56   ` Jens Axboe
  2017-08-05  6:56 ` [PATCH V2 07/20] blk-mq-sched: introduce blk_mq_sched_queue_depth() Ming Lei
                   ` (17 subsequent siblings)
  23 siblings, 2 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

During dispatching, we moved all requests from hctx->dispatch to
one temporary list, then dispatch them one by one from this list.
Unfortunately duirng this period, run queue from other contexts
may think the queue is idle, then start to dequeue from sw/scheduler
queue and still try to dispatch because ->dispatch is empty. This way
hurts sequential I/O performance because requests are dequeued when
lld queue is busy.

This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
make sure that request isn't dequeued until ->dispatch is
flushed.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 54 ++++++++++++++++++++++++++++++++++----------------
 block/blk-mq.c         |  6 ++++++
 include/linux/blk-mq.h |  1 +
 4 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 9ebc2945f991..7932782c4614 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -182,6 +182,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(SCHED_RESTART),
 	HCTX_STATE_NAME(TAG_WAITING),
 	HCTX_STATE_NAME(START_ON_RUN),
+	HCTX_STATE_NAME(DISPATCH_BUSY),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e43c9407d653..26dc19204548 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -140,7 +140,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -171,8 +170,29 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
-	} else if (!has_sched_dispatch & !q->queue_depth) {
+		blk_mq_dispatch_rq_list(q, &rq_list);
+
+		/*
+		 * We may clear DISPATCH_BUSY just after it
+		 * is set from another context, the only cost
+		 * is that one request is dequeued a bit early,
+		 * we can survive that. Given the window is
+		 * too small, no need to worry about performance
+		 * effect.
+		 */
+		if (list_empty_careful(&hctx->dispatch))
+			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	}
+
+	/*
+	 * Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
+	 * will be run to try to make progress, so it is always
+	 * safe to check the state here.
+	 */
+	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
+		return;
+
+	if (!has_sched_dispatch) {
 		/*
 		 * If there is no per-request_queue depth, we
 		 * flush all requests in this hw queue, otherwise
@@ -181,22 +201,21 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		 * is busy, which can be triggeredd easily by
 		 * limit of of request_queue's queue depth
 		 */
-		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(q, &rq_list);
-	}
-
-	if (!do_sched_dispatch)
-		return;
+		if (!q->queue_depth) {
+			blk_mq_flush_busy_ctxs(hctx, &rq_list);
+			blk_mq_dispatch_rq_list(q, &rq_list);
+		} else {
+			blk_mq_do_dispatch_ctx(q, hctx);
+		}
+	} else {
 
-	/*
-	 * We want to dispatch from the scheduler if we had no work left
-	 * on the dispatch list, OR if we did have work but weren't able
-	 * to make progress.
-	 */
-	if (has_sched_dispatch)
+		/*
+		 * We want to dispatch from the scheduler if we had no work left
+		 * on the dispatch list, OR if we did have work but weren't able
+		 * to make progress.
+		 */
 		blk_mq_do_dispatch_sched(q, e, hctx);
-	else
-		blk_mq_do_dispatch_ctx(q, hctx);
+	}
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
@@ -323,6 +342,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	 * the dispatch list.
 	 */
 	spin_lock(&hctx->lock);
+	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 	list_add(&rq->queuelist, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
 	return true;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d7a89d009f62..a6b23e51893c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1101,6 +1101,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 		blk_mq_put_driver_tag(rq);
 
 		spin_lock(&hctx->lock);
+		/*
+		 * DISPATCH_BUSY won't be cleared until all requests
+		 * in hctx->dispatch are dispatched successfully
+		 */
+		set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
 
@@ -1878,6 +1883,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 		return 0;
 
 	spin_lock(&hctx->lock);
+	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 	list_splice_tail_init(&tmp, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 14542308d25b..cae063482b82 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -172,6 +172,7 @@ enum {
 	BLK_MQ_S_SCHED_RESTART	= 2,
 	BLK_MQ_S_TAG_WAITING	= 3,
 	BLK_MQ_S_START_ON_RUN	= 4,
+	BLK_MQ_S_DISPATCH_BUSY	= 5,
 
 	BLK_MQ_MAX_DEPTH	= 10240,
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 07/20] blk-mq-sched: introduce blk_mq_sched_queue_depth()
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (5 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 20:10   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests Ming Lei
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

The following patch will propose some hints to figure out
default queue depth for scheduler queue, so introduce helper
of blk_mq_sched_queue_depth() for this purpose.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c |  8 +-------
 block/blk-mq-sched.h | 11 +++++++++++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 26dc19204548..2eda942357c3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -585,13 +585,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 		return 0;
 	}
 
-	/*
-	 * Default to double of smaller one between hw queue_depth and 128,
-	 * since we don't split into sync/async like the old code did.
-	 * Additionally, this is a per-hw queue depth.
-	 */
-	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
-				   BLKDEV_MAX_RQ);
+	q->nr_requests = blk_mq_sched_queue_depth(q);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		ret = blk_mq_sched_alloc_tags(q, hctx, i);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 9267d0b7c197..1d47f3fda1d0 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -96,4 +96,15 @@ static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
+static inline unsigned blk_mq_sched_queue_depth(struct request_queue *q)
+{
+	/*
+	 * Default to double of smaller one between hw queue_depth and 128,
+	 * since we don't split into sync/async like the old code did.
+	 * Additionally, this is a per-hw queue depth.
+	 */
+	return 2 * min_t(unsigned int, q->tag_set->queue_depth,
+				   BLKDEV_MAX_RQ);
+}
+
 #endif
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (6 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 07/20] blk-mq-sched: introduce blk_mq_sched_queue_depth() Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 20:20   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH Ming Lei
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

SCSI sets q->queue_depth from shost->cmd_per_lun, and
q->queue_depth is per request_queue and more related to
scheduler queue compared with hw queue depth, which can be
shared by queues, such as TAG_SHARED.

This patch trys to use q->queue_depth as hint for computing
q->nr_requests, which should be more effective than
current way.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.h | 18 +++++++++++++++---
 block/blk-mq.c       | 27 +++++++++++++++++++++++++--
 block/blk-mq.h       |  1 +
 block/blk-settings.c |  2 ++
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1d47f3fda1d0..906b10c54f78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -99,12 +99,24 @@ static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 static inline unsigned blk_mq_sched_queue_depth(struct request_queue *q)
 {
 	/*
-	 * Default to double of smaller one between hw queue_depth and 128,
+	 * q->queue_depth is more close to scheduler queue, so use it
+	 * as hint for computing scheduler queue depth if it is valid
+	 */
+	unsigned q_depth = q->queue_depth ?: q->tag_set->queue_depth;
+
+	/*
+	 * Default to double of smaller one between queue depth and 128,
 	 * since we don't split into sync/async like the old code did.
 	 * Additionally, this is a per-hw queue depth.
 	 */
-	return 2 * min_t(unsigned int, q->tag_set->queue_depth,
-				   BLKDEV_MAX_RQ);
+	q_depth = 2 * min_t(unsigned int, q_depth, BLKDEV_MAX_RQ);
+
+	/*
+	 * when queue depth of driver is too small, we set queue depth
+	 * of scheduler queue as 128 which is the default setting of
+	 * block legacy code.
+	 */
+	return max_t(unsigned, q_depth, BLKDEV_MAX_RQ);
 }
 
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a6b23e51893c..da2fa175d78f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2602,7 +2602,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
-int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+static int __blk_mq_update_nr_requests(struct request_queue *q,
+				       bool sched_only,
+				       unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
@@ -2621,7 +2623,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		 * If we're using an MQ scheduler, just update the scheduler
 		 * queue depth. This is similar to what the old code would do.
 		 */
-		if (!hctx->sched_tags) {
+		if (!sched_only && !hctx->sched_tags) {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->tags,
 							min(nr, set->queue_depth),
 							false);
@@ -2641,6 +2643,27 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	return ret;
 }
 
+int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+{
+	return __blk_mq_update_nr_requests(q, false, nr);
+}
+
+/*
+ * When drivers update q->queue_depth, this API is called so that
+ * we can use this queue depth as hint for adjusting scheduler
+ * queue depth.
+ */
+int blk_mq_update_sched_queue_depth(struct request_queue *q)
+{
+	unsigned nr;
+
+	if (!q->mq_ops || !q->elevator)
+		return 0;
+
+	nr = blk_mq_sched_queue_depth(q);
+	return __blk_mq_update_nr_requests(q, true, nr);
+}
+
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 							int nr_hw_queues)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2bfb1254841b..260b608af336 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,6 +37,7 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
 struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
 					    struct blk_mq_ctx *start);
+int blk_mq_update_sched_queue_depth(struct request_queue *q);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/block/blk-settings.c b/block/blk-settings.c
index be1f115b538b..94a349601545 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -877,6 +877,8 @@ void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
 	q->queue_depth = depth;
 	wbt_set_queue_depth(q->rq_wb, depth);
+
+	WARN_ON(blk_mq_update_sched_queue_depth(q));
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (7 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 21:55   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state Ming Lei
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

SCSI devices often provides one per-request_queue depth via
q->queue_depth(.cmd_per_lun), which is a global limit on all
hw queues. After the pending I/O submitted to one rquest queue
reaches this limit, BLK_STS_RESOURCE will be returned to all
dispatch path. That means when one hw queue is stuck, actually
all hctxs are stuck too.

This flag is introduced for improving blk-mq IO scheduling
on this kind of device.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq.c         | 25 ++++++++++++++++++++++---
 include/linux/blk-mq.h |  1 +
 4 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7932782c4614..febcaa7bfc82 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -210,6 +210,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(SG_MERGE),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(SHARED_DEPTH),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2eda942357c3..18d997679d7d 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -201,7 +201,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		 * is busy, which can be triggeredd easily by
 		 * limit of of request_queue's queue depth
 		 */
-		if (!q->queue_depth) {
+		if (!(hctx->flags & BLK_MQ_F_SHARED_DEPTH)) {
 			blk_mq_flush_busy_ctxs(hctx, &rq_list);
 			blk_mq_dispatch_rq_list(q, &rq_list);
 		} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index da2fa175d78f..b535587570b8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2656,12 +2656,31 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 int blk_mq_update_sched_queue_depth(struct request_queue *q)
 {
 	unsigned nr;
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	int ret = 0;
 
-	if (!q->mq_ops || !q->elevator)
-		return 0;
+	if (!q->mq_ops)
+		return ret;
+
+	blk_mq_freeze_queue(q);
+	/*
+	 * if there is q->queue_depth, all hw queues share
+	 * this queue depth limit
+	 */
+	if (q->queue_depth) {
+		queue_for_each_hw_ctx(q, hctx, i)
+			hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
+	}
+
+	if (!q->elevator)
+		goto exit;
 
 	nr = blk_mq_sched_queue_depth(q);
-	return __blk_mq_update_nr_requests(q, true, nr);
+	ret = __blk_mq_update_nr_requests(q, true, nr);
+ exit:
+	blk_mq_unfreeze_queue(q);
+	return ret;
 }
 
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index cae063482b82..1197e5dee015 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -164,6 +164,7 @@ enum {
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
+	BLK_MQ_F_SHARED_DEPTH	= 1 << 7,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (8 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 20:41   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list Ming Lei
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c |  6 +++---
 block/blk-mq.c       |  4 ++--
 block/blk-mq.h       | 15 +++++++++++++++
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 18d997679d7d..9fae76275acf 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -181,7 +181,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		 * effect.
 		 */
 		if (list_empty_careful(&hctx->dispatch))
-			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+			blk_mq_hctx_clear_dispatch_busy(hctx);
 	}
 
 	/*
@@ -189,7 +189,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * will be run to try to make progress, so it is always
 	 * safe to check the state here.
 	 */
-	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
+	if (blk_mq_hctx_is_dispatch_busy(hctx))
 		return;
 
 	if (!has_sched_dispatch) {
@@ -342,7 +342,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	 * the dispatch list.
 	 */
 	spin_lock(&hctx->lock);
-	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	blk_mq_hctx_set_dispatch_busy(hctx);
 	list_add(&rq->queuelist, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
 	return true;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b535587570b8..11042aa0501d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1105,7 +1105,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 		 * DISPATCH_BUSY won't be cleared until all requests
 		 * in hctx->dispatch are dispatched successfully
 		 */
-		set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+		blk_mq_hctx_set_dispatch_busy(hctx);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
 
@@ -1883,7 +1883,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 		return 0;
 
 	spin_lock(&hctx->lock);
-	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	blk_mq_hctx_set_dispatch_busy(hctx);
 	list_splice_tail_init(&tmp, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 260b608af336..cadc0c83a140 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -136,4 +136,19 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
+static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx *hctx)
+{
+	return test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+}
+
+static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx *hctx)
+{
+	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+}
+
+static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
+{
+	clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+}
+
 #endif
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (9 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 20:43   ` Bart Van Assche
  2017-08-05  6:56 ` [PATCH V2 12/20] blk-mq: introduce pointers to dispatch lock & list Ming Lei
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 13 +++----------
 block/blk-mq.c       | 24 +++++++++++-------------
 block/blk-mq.h       | 40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9fae76275acf..5d435f01ecc8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -152,12 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * If we have previous entries on our dispatch list, grab them first for
 	 * more fair dispatch.
 	 */
-	if (!list_empty_careful(&hctx->dispatch)) {
-		spin_lock(&hctx->lock);
-		if (!list_empty(&hctx->dispatch))
-			list_splice_init(&hctx->dispatch, &rq_list);
-		spin_unlock(&hctx->lock);
-	}
+	if (blk_mq_has_dispatch_rqs(hctx))
+		blk_mq_take_list_from_dispatch(hctx, &rq_list);
 
 	/*
 	 * Only ask the scheduler for requests, if we didn't have residual
@@ -341,10 +337,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	 * If we already have a real request tag, send directly to
 	 * the dispatch list.
 	 */
-	spin_lock(&hctx->lock);
-	blk_mq_hctx_set_dispatch_busy(hctx);
-	list_add(&rq->queuelist, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+	blk_mq_add_rq_to_dispatch(hctx, rq);
 	return true;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11042aa0501d..2392a813f5ee 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -63,7 +63,7 @@ static int blk_mq_poll_stats_bkt(const struct request *rq)
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
 	return sbitmap_any_bit_set(&hctx->ctx_map) ||
-			!list_empty_careful(&hctx->dispatch) ||
+			blk_mq_has_dispatch_rqs(hctx) ||
 			blk_mq_sched_has_work(hctx);
 }
 
@@ -1100,14 +1100,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 		rq = list_first_entry(list, struct request, queuelist);
 		blk_mq_put_driver_tag(rq);
 
-		spin_lock(&hctx->lock);
-		/*
-		 * DISPATCH_BUSY won't be cleared until all requests
-		 * in hctx->dispatch are dispatched successfully
-		 */
-		blk_mq_hctx_set_dispatch_busy(hctx);
-		list_splice_init(list, &hctx->dispatch);
-		spin_unlock(&hctx->lock);
+		blk_mq_add_list_to_dispatch(hctx, list);
 
 		/*
 		 * If SCHED_RESTART was set by the caller of this function and
@@ -1882,10 +1875,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	if (list_empty(&tmp))
 		return 0;
 
-	spin_lock(&hctx->lock);
-	blk_mq_hctx_set_dispatch_busy(hctx);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+	blk_mq_add_list_to_dispatch_tail(hctx, &tmp);
 
 	blk_mq_run_hw_queue(hctx, true);
 	return 0;
@@ -1935,6 +1925,13 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
+static void blk_mq_init_dispatch(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx)
+{
+	spin_lock_init(&hctx->lock);
+	INIT_LIST_HEAD(&hctx->dispatch);
+}
+
 static int blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
@@ -1948,6 +1945,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
+	blk_mq_init_dispatch(q, hctx);
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index cadc0c83a140..7f0d35ca5fea 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -151,4 +151,44 @@ static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
 	clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 }
 
+static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	return !list_empty_careful(&hctx->dispatch);
+}
+
+static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
+		struct request *rq)
+{
+	spin_lock(&hctx->lock);
+	list_add(&rq->queuelist, &hctx->dispatch);
+	blk_mq_hctx_set_dispatch_busy(hctx);
+	spin_unlock(&hctx->lock);
+}
+
+static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
+		struct list_head *list)
+{
+	spin_lock(&hctx->lock);
+	list_splice_init(list, &hctx->dispatch);
+	blk_mq_hctx_set_dispatch_busy(hctx);
+	spin_unlock(&hctx->lock);
+}
+
+static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
+						    struct list_head *list)
+{
+	spin_lock(&hctx->lock);
+	list_splice_tail_init(list, &hctx->dispatch);
+	blk_mq_hctx_set_dispatch_busy(hctx);
+	spin_unlock(&hctx->lock);
+}
+
+static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
+		struct list_head *list)
+{
+	spin_lock(&hctx->lock);
+	list_splice_init(&hctx->dispatch, list);
+	spin_unlock(&hctx->lock);
+}
+
 #endif
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 12/20] blk-mq: introduce pointers to dispatch lock & list
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (10 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-05  6:56 ` [PATCH V2 13/20] blk-mq: pass 'request_queue *' to several helpers of operating BUSY Ming Lei
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

Prepare to support per-request-queue dispatch list,
so introduce dispatch lock and list for avoiding to
do runtime check.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 10 +++++-----
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq.c         |  7 +++++--
 block/blk-mq.h         | 26 +++++++++++++-------------
 include/linux/blk-mq.h |  3 +++
 5 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index febcaa7bfc82..734e478a0395 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -371,23 +371,23 @@ static void *hctx_dispatch_start(struct seq_file *m, loff_t *pos)
 {
 	struct blk_mq_hw_ctx *hctx = m->private;
 
-	spin_lock(&hctx->lock);
-	return seq_list_start(&hctx->dispatch, *pos);
+	spin_lock(hctx->dispatch_lock);
+	return seq_list_start(hctx->dispatch_list, *pos);
 }
 
 static void *hctx_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
 {
 	struct blk_mq_hw_ctx *hctx = m->private;
 
-	return seq_list_next(v, &hctx->dispatch, pos);
+	return seq_list_next(v, hctx->dispatch_list, pos);
 }
 
 static void hctx_dispatch_stop(struct seq_file *m, void *v)
-	__releases(&hctx->lock)
+	__releases(hctx->dispatch_lock)
 {
 	struct blk_mq_hw_ctx *hctx = m->private;
 
-	spin_unlock(&hctx->lock);
+	spin_unlock(hctx->dispatch_lock);
 }
 
 static const struct seq_operations hctx_dispatch_seq_ops = {
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 5d435f01ecc8..53d6d5acd71c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -176,7 +176,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		 * too small, no need to worry about performance
 		 * effect.
 		 */
-		if (list_empty_careful(&hctx->dispatch))
+		if (list_empty_careful(hctx->dispatch_list))
 			blk_mq_hctx_clear_dispatch_busy(hctx);
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2392a813f5ee..86e4bb44f054 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1928,8 +1928,11 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 static void blk_mq_init_dispatch(struct request_queue *q,
 		struct blk_mq_hw_ctx *hctx)
 {
-	spin_lock_init(&hctx->lock);
-	INIT_LIST_HEAD(&hctx->dispatch);
+	hctx->dispatch_lock = &hctx->lock;
+	hctx->dispatch_list = &hctx->dispatch;
+
+	spin_lock_init(hctx->dispatch_lock);
+	INIT_LIST_HEAD(hctx->dispatch_list);
 }
 
 static int blk_mq_init_hctx(struct request_queue *q,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7f0d35ca5fea..474e1c2aa8c4 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -153,42 +153,42 @@ static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
 
 static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
 {
-	return !list_empty_careful(&hctx->dispatch);
+	return !list_empty_careful(hctx->dispatch_list);
 }
 
 static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
 		struct request *rq)
 {
-	spin_lock(&hctx->lock);
-	list_add(&rq->queuelist, &hctx->dispatch);
+	spin_lock(hctx->dispatch_lock);
+	list_add(&rq->queuelist, hctx->dispatch_list);
 	blk_mq_hctx_set_dispatch_busy(hctx);
-	spin_unlock(&hctx->lock);
+	spin_unlock(hctx->dispatch_lock);
 }
 
 static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
 		struct list_head *list)
 {
-	spin_lock(&hctx->lock);
-	list_splice_init(list, &hctx->dispatch);
+	spin_lock(hctx->dispatch_lock);
+	list_splice_init(list, hctx->dispatch_list);
 	blk_mq_hctx_set_dispatch_busy(hctx);
-	spin_unlock(&hctx->lock);
+	spin_unlock(hctx->dispatch_lock);
 }
 
 static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
 						    struct list_head *list)
 {
-	spin_lock(&hctx->lock);
-	list_splice_tail_init(list, &hctx->dispatch);
+	spin_lock(hctx->dispatch_lock);
+	list_splice_tail_init(list, hctx->dispatch_list);
 	blk_mq_hctx_set_dispatch_busy(hctx);
-	spin_unlock(&hctx->lock);
+	spin_unlock(hctx->dispatch_lock);
 }
 
 static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
 		struct list_head *list)
 {
-	spin_lock(&hctx->lock);
-	list_splice_init(&hctx->dispatch, list);
-	spin_unlock(&hctx->lock);
+	spin_lock(hctx->dispatch_lock);
+	list_splice_init(hctx->dispatch_list, list);
+	spin_unlock(hctx->dispatch_lock);
 }
 
 #endif
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1197e5dee015..5c316e9e03a4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,9 @@ struct blk_mq_hw_ctx {
 
 	unsigned long		flags;		/* BLK_MQ_F_* flags */
 
+	spinlock_t		*dispatch_lock;
+	struct list_head	*dispatch_list;
+
 	void			*sched_data;
 	struct request_queue	*queue;
 	struct blk_flush_queue	*fq;
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 13/20] blk-mq: pass 'request_queue *' to several helpers of operating BUSY
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (11 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 12/20] blk-mq: introduce pointers to dispatch lock & list Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-05  6:56 ` [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie Ming Lei
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

We need to support per-request_queue dispatch list for avoiding
early dispatch in case of shared queue depth.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c |  4 ++--
 block/blk-mq.c       |  2 +-
 block/blk-mq.h       | 19 +++++++++++--------
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 53d6d5acd71c..b11f53d1e1f3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -177,7 +177,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		 * effect.
 		 */
 		if (list_empty_careful(hctx->dispatch_list))
-			blk_mq_hctx_clear_dispatch_busy(hctx);
+			blk_mq_hctx_clear_dispatch_busy(q, hctx);
 	}
 
 	/*
@@ -185,7 +185,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * will be run to try to make progress, so it is always
 	 * safe to check the state here.
 	 */
-	if (blk_mq_hctx_is_dispatch_busy(hctx))
+	if (blk_mq_hctx_is_dispatch_busy(q, hctx))
 		return;
 
 	if (!has_sched_dispatch) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 86e4bb44f054..c6624154bb37 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1100,7 +1100,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 		rq = list_first_entry(list, struct request, queuelist);
 		blk_mq_put_driver_tag(rq);
 
-		blk_mq_add_list_to_dispatch(hctx, list);
+		blk_mq_add_list_to_dispatch(q, hctx, list);
 
 		/*
 		 * If SCHED_RESTART was set by the caller of this function and
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 474e1c2aa8c4..86a35c799ca6 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -136,17 +136,20 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
-static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx *hctx)
+static inline bool blk_mq_hctx_is_dispatch_busy(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx)
 {
 	return test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 }
 
-static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx *hctx)
+static inline void blk_mq_hctx_set_dispatch_busy(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx)
 {
 	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 }
 
-static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
+static inline void blk_mq_hctx_clear_dispatch_busy(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx)
 {
 	clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 }
@@ -161,16 +164,16 @@ static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
 {
 	spin_lock(hctx->dispatch_lock);
 	list_add(&rq->queuelist, hctx->dispatch_list);
-	blk_mq_hctx_set_dispatch_busy(hctx);
+	blk_mq_hctx_set_dispatch_busy(rq->q, hctx);
 	spin_unlock(hctx->dispatch_lock);
 }
 
-static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
-		struct list_head *list)
+static inline void blk_mq_add_list_to_dispatch(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
 	spin_lock(hctx->dispatch_lock);
 	list_splice_init(list, hctx->dispatch_list);
-	blk_mq_hctx_set_dispatch_busy(hctx);
+	blk_mq_hctx_set_dispatch_busy(q, hctx);
 	spin_unlock(hctx->dispatch_lock);
 }
 
@@ -179,7 +182,7 @@ static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
 {
 	spin_lock(hctx->dispatch_lock);
 	list_splice_tail_init(list, hctx->dispatch_list);
-	blk_mq_hctx_set_dispatch_busy(hctx);
+	blk_mq_hctx_set_dispatch_busy(hctx->queue, hctx);
 	spin_unlock(hctx->dispatch_lock);
 }
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (12 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 13/20] blk-mq: pass 'request_queue *' to several helpers of operating BUSY Ming Lei
@ 2017-08-05  6:56 ` Ming Lei
  2017-08-22 20:51   ` Bart Van Assche
  2017-08-05  6:57 ` [PATCH V2 15/20] block: introduce rqhash helpers Ming Lei
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

SCSI device often has per-request_queue queue depth
(.cmd_per_lun), which is applied among all hw queues
actually, and this patchset calls this as shared
queue depth.

One theory of scheduler is that we shouldn't dequeue
request from sw/scheduler queue and dispatch it to
driver when the low level queue is busy.

For SCSI device, queue being busy depends on the
per-request_queue limit, so we should hold all
hw queues if the request queue is busy.

This patch introduces per-request_queue dispatch
list for this purpose, and only when all requests
in this list are dispatched out successfully, we
can restart to dequeue request from sw/scheduler
queue and dispath it to lld.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         |  8 +++++++-
 block/blk-mq.h         | 14 +++++++++++---
 include/linux/blkdev.h |  5 +++++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c6624154bb37..db21e71bb087 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2670,8 +2670,14 @@ int blk_mq_update_sched_queue_depth(struct request_queue *q)
 	 * this queue depth limit
 	 */
 	if (q->queue_depth) {
-		queue_for_each_hw_ctx(q, hctx, i)
+		queue_for_each_hw_ctx(q, hctx, i) {
 			hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
+			hctx->dispatch_lock = &q->__mq_dispatch_lock;
+			hctx->dispatch_list = &q->__mq_dispatch_list;
+
+			spin_lock_init(hctx->dispatch_lock);
+			INIT_LIST_HEAD(hctx->dispatch_list);
+		}
 	}
 
 	if (!q->elevator)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 86a35c799ca6..295fd9dfb01d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -139,19 +139,27 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 static inline bool blk_mq_hctx_is_dispatch_busy(struct request_queue *q,
 		struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	if (!(hctx->flags & BLK_MQ_F_SHARED_DEPTH))
+		return test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	return q->mq_dispatch_busy;
 }
 
 static inline void blk_mq_hctx_set_dispatch_busy(struct request_queue *q,
 		struct blk_mq_hw_ctx *hctx)
 {
-	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	if (!(hctx->flags & BLK_MQ_F_SHARED_DEPTH))
+		set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	else
+		q->mq_dispatch_busy = 1;
 }
 
 static inline void blk_mq_hctx_clear_dispatch_busy(struct request_queue *q,
 		struct blk_mq_hw_ctx *hctx)
 {
-	clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	if (!(hctx->flags & BLK_MQ_F_SHARED_DEPTH))
+		clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+	else
+		q->mq_dispatch_busy = 0;
 }
 
 static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0cb27d3..bc0e607710f2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -395,6 +395,11 @@ struct request_queue {
 
 	atomic_t		shared_hctx_restart;
 
+	/* blk-mq dispatch list and lock for shared queue depth case */
+	struct list_head	__mq_dispatch_list;
+	spinlock_t		__mq_dispatch_lock;
+	unsigned int		mq_dispatch_busy;
+
 	struct blk_queue_stats	*stats;
 	struct rq_wb		*rq_wb;
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 15/20] block: introduce rqhash helpers
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (13 preceding siblings ...)
  2017-08-05  6:56 ` [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie Ming Lei
@ 2017-08-05  6:57 ` Ming Lei
  2017-08-05  6:57 ` [PATCH V2 16/20] block: move actual bio merge code into __elv_merge Ming Lei
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

We need this helpers for supporting to use hashtable to improve
bio merge from sw queue in the following patches.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk.h      | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator.c | 36 +++++++-----------------------------
 2 files changed, 59 insertions(+), 29 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 3a3d715bd725..313002c2f666 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -147,6 +147,58 @@ static inline void blk_clear_rq_complete(struct request *rq)
  */
 #define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
 
+/*
+ * Merge hash stuff.
+ */
+#define rq_hash_key(rq)		(blk_rq_pos(rq) + blk_rq_sectors(rq))
+
+#define bucket(head, key)	&((head)[hash_min((key), ELV_HASH_BITS)])
+
+static inline void __rqhash_del(struct request *rq)
+{
+	hash_del(&rq->hash);
+	rq->rq_flags &= ~RQF_HASHED;
+}
+
+static inline void rqhash_del(struct request *rq)
+{
+	if (ELV_ON_HASH(rq))
+		__rqhash_del(rq);
+}
+
+static inline void rqhash_add(struct hlist_head *hash, struct request *rq)
+{
+	BUG_ON(ELV_ON_HASH(rq));
+	hlist_add_head(&rq->hash, bucket(hash, rq_hash_key(rq)));
+	rq->rq_flags |= RQF_HASHED;
+}
+
+static inline void rqhash_reposition(struct hlist_head *hash, struct request *rq)
+{
+	__rqhash_del(rq);
+	rqhash_add(hash, rq);
+}
+
+static inline struct request *rqhash_find(struct hlist_head *hash, sector_t offset)
+{
+	struct hlist_node *next;
+	struct request *rq = NULL;
+
+	hlist_for_each_entry_safe(rq, next, bucket(hash, offset), hash) {
+		BUG_ON(!ELV_ON_HASH(rq));
+
+		if (unlikely(!rq_mergeable(rq))) {
+			__rqhash_del(rq);
+			continue;
+		}
+
+		if (rq_hash_key(rq) == offset)
+			return rq;
+	}
+
+	return NULL;
+}
+
 void blk_insert_flush(struct request *rq);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
diff --git a/block/elevator.c b/block/elevator.c
index 4bb2f0c93fa6..7423e9c9cb27 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -47,11 +47,6 @@ static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
 
 /*
- * Merge hash stuff.
- */
-#define rq_hash_key(rq)		(blk_rq_pos(rq) + blk_rq_sectors(rq))
-
-/*
  * Query io scheduler to see if the current process issuing bio may be
  * merged with rq.
  */
@@ -268,14 +263,12 @@ EXPORT_SYMBOL(elevator_exit);
 
 static inline void __elv_rqhash_del(struct request *rq)
 {
-	hash_del(&rq->hash);
-	rq->rq_flags &= ~RQF_HASHED;
+	__rqhash_del(rq);
 }
 
 void elv_rqhash_del(struct request_queue *q, struct request *rq)
 {
-	if (ELV_ON_HASH(rq))
-		__elv_rqhash_del(rq);
+	rqhash_del(rq);
 }
 EXPORT_SYMBOL_GPL(elv_rqhash_del);
 
@@ -283,37 +276,22 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	BUG_ON(ELV_ON_HASH(rq));
-	hash_add(e->hash, &rq->hash, rq_hash_key(rq));
-	rq->rq_flags |= RQF_HASHED;
+	rqhash_add(e->hash, rq);
 }
 EXPORT_SYMBOL_GPL(elv_rqhash_add);
 
 void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
 {
-	__elv_rqhash_del(rq);
-	elv_rqhash_add(q, rq);
+	struct elevator_queue *e = q->elevator;
+
+	rqhash_reposition(e->hash, rq);
 }
 
 struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
 {
 	struct elevator_queue *e = q->elevator;
-	struct hlist_node *next;
-	struct request *rq;
-
-	hash_for_each_possible_safe(e->hash, rq, next, hash, offset) {
-		BUG_ON(!ELV_ON_HASH(rq));
 
-		if (unlikely(!rq_mergeable(rq))) {
-			__elv_rqhash_del(rq);
-			continue;
-		}
-
-		if (rq_hash_key(rq) == offset)
-			return rq;
-	}
-
-	return NULL;
+	return rqhash_find(e->hash, offset);
 }
 
 /*
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 16/20] block: move actual bio merge code into __elv_merge
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (14 preceding siblings ...)
  2017-08-05  6:57 ` [PATCH V2 15/20] block: introduce rqhash helpers Ming Lei
@ 2017-08-05  6:57 ` Ming Lei
  2017-08-05  6:57 ` [PATCH V2 17/20] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue Ming Lei
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

So that we can reuse __elv_merge() to merge bio
into requests from sw queue in the following patches.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 7423e9c9cb27..2fd2f777127b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -409,8 +409,9 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 }
 EXPORT_SYMBOL(elv_dispatch_add_tail);
 
-enum elv_merge elv_merge(struct request_queue *q, struct request **req,
-		struct bio *bio)
+static inline enum elv_merge __elv_merge(struct request_queue *q,
+		struct request **req, struct bio *bio,
+		struct hlist_head *hash, struct request *last_merge)
 {
 	struct elevator_queue *e = q->elevator;
 	struct request *__rq;
@@ -427,11 +428,11 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
 	/*
 	 * First try one-hit cache.
 	 */
-	if (q->last_merge && elv_bio_merge_ok(q->last_merge, bio)) {
-		enum elv_merge ret = blk_try_merge(q->last_merge, bio);
+	if (last_merge && elv_bio_merge_ok(last_merge, bio)) {
+		enum elv_merge ret = blk_try_merge(last_merge, bio);
 
 		if (ret != ELEVATOR_NO_MERGE) {
-			*req = q->last_merge;
+			*req = last_merge;
 			return ret;
 		}
 	}
@@ -442,7 +443,7 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
 	/*
 	 * See if our hash lookup can find a potential backmerge.
 	 */
-	__rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
+	__rq = rqhash_find(hash, bio->bi_iter.bi_sector);
 	if (__rq && elv_bio_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_BACK_MERGE;
@@ -456,6 +457,12 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
 	return ELEVATOR_NO_MERGE;
 }
 
+enum elv_merge elv_merge(struct request_queue *q, struct request **req,
+		struct bio *bio)
+{
+	return __elv_merge(q, req, bio, q->elevator->hash, q->last_merge);
+}
+
 /*
  * Attempt to do an insertion back merge. Only check for the case where
  * we can append 'rq' to an existing request, so we can throw 'rq' away
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 17/20] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (15 preceding siblings ...)
  2017-08-05  6:57 ` [PATCH V2 16/20] block: move actual bio merge code into __elv_merge Ming Lei
@ 2017-08-05  6:57 ` Ming Lei
  2017-08-05  6:57 ` [PATCH V2 18/20] block: introduce .last_merge and .hash to blk_mq_ctx Ming Lei
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

blk_mq_sched_try_merge() will be reused in following patches
to support bio merge to blk-mq sw queue, so add checkes to
related functions which are called from blk_mq_sched_try_merge().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/block/elevator.c b/block/elevator.c
index 2fd2f777127b..eb56e2709291 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -71,6 +71,10 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio)
 	if (!blk_rq_merge_ok(rq, bio))
 		return false;
 
+	/* We need to support to merge bio from sw queue */
+	if (!rq->q->elevator)
+		return true;
+
 	if (!elv_iosched_allow_bio_merge(rq, bio))
 		return false;
 
@@ -449,6 +453,10 @@ static inline enum elv_merge __elv_merge(struct request_queue *q,
 		return ELEVATOR_BACK_MERGE;
 	}
 
+	/* no elevator when merging bio to blk-mq sw queue */
+	if (!e)
+		return ELEVATOR_NO_MERGE;
+
 	if (e->uses_mq && e->type->ops.mq.request_merge)
 		return e->type->ops.mq.request_merge(q, req, bio);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
@@ -711,6 +719,10 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/* no elevator when merging bio to blk-mq sw queue */
+	if (!e)
+		return NULL;
+
 	if (e->uses_mq && e->type->ops.mq.next_request)
 		return e->type->ops.mq.next_request(q, rq);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_latter_req_fn)
@@ -723,6 +735,10 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/* no elevator when merging bio to blk-mq sw queue */
+	if (!e)
+		return NULL;
+
 	if (e->uses_mq && e->type->ops.mq.former_request)
 		return e->type->ops.mq.former_request(q, rq);
 	if (!e->uses_mq && e->type->ops.sq.elevator_former_req_fn)
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 18/20] block: introduce .last_merge and .hash to blk_mq_ctx
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (16 preceding siblings ...)
  2017-08-05  6:57 ` [PATCH V2 17/20] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue Ming Lei
@ 2017-08-05  6:57 ` Ming Lei
  2017-08-05  6:57 ` [PATCH V2 19/20] blk-mq-sched: refactor blk_mq_sched_try_merge() Ming Lei
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

Prepare for supporting bio merge to sw queue if no
blk-mq io scheduler is taken.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.h   |  4 ++++
 block/blk.h      |  3 +++
 block/elevator.c | 22 +++++++++++++++++++---
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 295fd9dfb01d..3400b01a2e4c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -18,6 +18,10 @@ struct blk_mq_ctx {
 	unsigned long		rq_dispatched[2];
 	unsigned long		rq_merged;
 
+	/* bio merge via request hash table */
+	struct request		*last_merge;
+	DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
+
 	/* incremented at completion time */
 	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
 
diff --git a/block/blk.h b/block/blk.h
index 313002c2f666..6847c5435cca 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -199,6 +199,9 @@ static inline struct request *rqhash_find(struct hlist_head *hash, sector_t offs
 	return NULL;
 }
 
+enum elv_merge elv_merge_ctx(struct request_queue *q, struct request **req,
+                struct bio *bio, struct blk_mq_ctx *ctx);
+
 void blk_insert_flush(struct request *rq);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
diff --git a/block/elevator.c b/block/elevator.c
index eb56e2709291..cbdbe823d262 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -471,6 +471,13 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
 	return __elv_merge(q, req, bio, q->elevator->hash, q->last_merge);
 }
 
+enum elv_merge elv_merge_ctx(struct request_queue *q, struct request **req,
+		struct bio *bio, struct blk_mq_ctx *ctx)
+{
+	WARN_ON_ONCE(!q->mq_ops);
+	return __elv_merge(q, req, bio, ctx->hash, ctx->last_merge);
+}
+
 /*
  * Attempt to do an insertion back merge. Only check for the case where
  * we can append 'rq' to an existing request, so we can throw 'rq' away
@@ -516,16 +523,25 @@ void elv_merged_request(struct request_queue *q, struct request *rq,
 		enum elv_merge type)
 {
 	struct elevator_queue *e = q->elevator;
+	struct hlist_head *hash = e->hash;
+
+	/* we do bio merge on blk-mq sw queue */
+	if (q->mq_ops && !e) {
+		rq->mq_ctx->last_merge = rq;
+		hash = rq->mq_ctx->hash;
+		goto reposition;
+	}
+
+	q->last_merge = rq;
 
 	if (e->uses_mq && e->type->ops.mq.request_merged)
 		e->type->ops.mq.request_merged(q, rq, type);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_merged_fn)
 		e->type->ops.sq.elevator_merged_fn(q, rq, type);
 
+ reposition:
 	if (type == ELEVATOR_BACK_MERGE)
-		elv_rqhash_reposition(q, rq);
-
-	q->last_merge = rq;
+		rqhash_reposition(hash, rq);
 }
 
 void elv_merge_requests(struct request_queue *q, struct request *rq,
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 19/20] blk-mq-sched: refactor blk_mq_sched_try_merge()
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (17 preceding siblings ...)
  2017-08-05  6:57 ` [PATCH V2 18/20] block: introduce .last_merge and .hash to blk_mq_ctx Ming Lei
@ 2017-08-05  6:57 ` Ming Lei
  2017-08-05  6:57 ` [PATCH V2 20/20] blk-mq: improve bio merge from blk-mq sw queue Ming Lei
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

This patch introduces one function __blk_mq_try_merge()
which will be resued for bio merge to sw queue in
the following patch.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index b11f53d1e1f3..7fb7949ed7ce 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -214,12 +214,13 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	}
 }
 
-bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-			    struct request **merged_request)
+static inline bool __blk_mq_try_merge(struct request_queue *q,
+		struct bio *bio, struct request **merged_request,
+		struct request *candidate, enum elv_merge type)
 {
-	struct request *rq;
+	struct request *rq = candidate;
 
-	switch (elv_merge(q, &rq, bio)) {
+	switch (type) {
 	case ELEVATOR_BACK_MERGE:
 		if (!blk_mq_sched_allow_merge(q, rq, bio))
 			return false;
@@ -242,6 +243,15 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 		return false;
 	}
 }
+
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
+		struct request **merged_request)
+{
+	struct request *rq;
+	enum elv_merge type = elv_merge(q, &rq, bio);
+
+	return __blk_mq_try_merge(q, bio, merged_request, rq, type);
+}
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
 /*
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V2 20/20] blk-mq: improve bio merge from blk-mq sw queue
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (18 preceding siblings ...)
  2017-08-05  6:57 ` [PATCH V2 19/20] blk-mq-sched: refactor blk_mq_sched_try_merge() Ming Lei
@ 2017-08-05  6:57 ` Ming Lei
  2017-08-07 12:48 ` [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Laurence Oberman
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-05  6:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Laurence Oberman, Ming Lei

This patch uses hash table to do bio merge from sw queue,
then we can align to blk-mq scheduler/block legacy's way
for bio merge.

Turns out bio merge via hash table is more efficient than
simple merge on the last 8 requests in sw queue. On SCSI SRP,
it is observed ~10% IOPS is increased in sequential IO test
with this patch.

It is also one step forward to real 'none' scheduler, in which
way the blk-mq scheduler framework can be more clean.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 49 ++++++++++++-------------------------------------
 block/blk-mq.c       | 28 +++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 40 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 7fb7949ed7ce..f49d131d91ce 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -254,50 +254,25 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
-/*
- * Reverse check our software queue for entries that we could potentially
- * merge with. Currently includes a hand-wavy stop count of 8, to not spend
- * too much time checking for merges.
- */
-static bool blk_mq_attempt_merge(struct request_queue *q,
+static bool blk_mq_ctx_try_merge(struct request_queue *q,
 				 struct blk_mq_ctx *ctx, struct bio *bio)
 {
-	struct request *rq;
-	int checked = 8;
+	struct request *rq, *free = NULL;
+	enum elv_merge type;
+	bool merged;
 
 	lockdep_assert_held(&ctx->lock);
 
-	list_for_each_entry_reverse(rq, &ctx->rq_list, queuelist) {
-		bool merged = false;
-
-		if (!checked--)
-			break;
-
-		if (!blk_rq_merge_ok(rq, bio))
-			continue;
+	type = elv_merge_ctx(q, &rq, bio, ctx);
+	merged = __blk_mq_try_merge(q, bio, &free, rq, type);
 
-		switch (blk_try_merge(rq, bio)) {
-		case ELEVATOR_BACK_MERGE:
-			if (blk_mq_sched_allow_merge(q, rq, bio))
-				merged = bio_attempt_back_merge(q, rq, bio);
-			break;
-		case ELEVATOR_FRONT_MERGE:
-			if (blk_mq_sched_allow_merge(q, rq, bio))
-				merged = bio_attempt_front_merge(q, rq, bio);
-			break;
-		case ELEVATOR_DISCARD_MERGE:
-			merged = bio_attempt_discard_merge(q, rq, bio);
-			break;
-		default:
-			continue;
-		}
+	if (free)
+		blk_mq_free_request(free);
 
-		if (merged)
-			ctx->rq_merged++;
-		return merged;
-	}
+	if (merged)
+		ctx->rq_merged++;
 
-	return false;
+	return merged;
 }
 
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
@@ -315,7 +290,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 	if (hctx->flags & BLK_MQ_F_SHOULD_MERGE) {
 		/* default per sw-queue merge */
 		spin_lock(&ctx->lock);
-		ret = blk_mq_attempt_merge(q, ctx, bio);
+		ret = blk_mq_ctx_try_merge(q, ctx, bio);
 		spin_unlock(&ctx->lock);
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index db21e71bb087..6bd960fe11ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -809,6 +809,18 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	blk_queue_exit(q);
 }
 
+static void blk_mq_ctx_remove_rq_list(struct blk_mq_ctx *ctx,
+		struct list_head *head)
+{
+	struct request *rq;
+
+	lockdep_assert_held(&ctx->lock);
+
+	list_for_each_entry(rq, head, queuelist)
+		rqhash_del(rq);
+	ctx->last_merge = NULL;
+}
+
 struct flush_busy_ctx_data {
 	struct blk_mq_hw_ctx *hctx;
 	struct list_head *list;
@@ -823,6 +835,7 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
 	sbitmap_clear_bit(sb, bitnr);
 	spin_lock(&ctx->lock);
 	list_splice_tail_init(&ctx->rq_list, flush_data->list);
+	blk_mq_ctx_remove_rq_list(ctx, flush_data->list);
 	spin_unlock(&ctx->lock);
 	return true;
 }
@@ -852,17 +865,23 @@ static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *d
 	struct dispatch_rq_data *dispatch_data = data;
 	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
 	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+	struct request *rq = NULL;
 
 	spin_lock(&ctx->lock);
 	if (unlikely(!list_empty(&ctx->rq_list))) {
-		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
-		list_del_init(&dispatch_data->rq->queuelist);
+		rq = list_entry_rq(ctx->rq_list.next);
+		list_del_init(&rq->queuelist);
+		rqhash_del(rq);
 		if (list_empty(&ctx->rq_list))
 			sbitmap_clear_bit(sb, bitnr);
 	}
+	if (ctx->last_merge == rq)
+		ctx->last_merge = NULL;
 	spin_unlock(&ctx->lock);
 
-	return !dispatch_data->rq;
+	dispatch_data->rq = rq;
+
+	return !rq;
 }
 
 struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
@@ -1376,6 +1395,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
 		list_add(&rq->queuelist, &ctx->rq_list);
 	else
 		list_add_tail(&rq->queuelist, &ctx->rq_list);
+
+	rqhash_add(ctx->hash, rq);
 }
 
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
@@ -1868,6 +1889,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_list)) {
 		list_splice_init(&ctx->rq_list, &tmp);
+		blk_mq_ctx_remove_rq_list(ctx, &tmp);
 		blk_mq_hctx_clear_pending(hctx, ctx);
 	}
 	spin_unlock(&ctx->lock);
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (19 preceding siblings ...)
  2017-08-05  6:57 ` [PATCH V2 20/20] blk-mq: improve bio merge from blk-mq sw queue Ming Lei
@ 2017-08-07 12:48 ` Laurence Oberman
  2017-08-07 15:27   ` Bart Van Assche
       [not found]   ` <CAFfF4qv3W6D-j8BSSZbwPLqhd_mmwk8CZQe7dSqud8cMMd2yPg@mail.gmail.com>
  2017-08-08  8:09 ` Paolo Valente
                   ` (2 subsequent siblings)
  23 siblings, 2 replies; 84+ messages in thread
From: Laurence Oberman @ 2017-08-07 12:48 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig; +Cc: Bart Van Assche



On 08/05/2017 02:56 AM, Ming Lei wrote:
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
> 
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
> 
> The 1st five patches improve this situation, and brings back
> some performance loss.
> 
> But looks they are still not enough. It is caused by
> the shared queue depth among all hw queues. For SCSI devices,
> .cmd_per_lun defines the max number of pending I/O on one
> request queue, which is per-request_queue depth. So during
> dispatch, if one hctx is too busy to move on, all hctxs can't
> dispatch too because of the per-request_queue depth.
> 
> Patch 6 ~ 14 use per-request_queue dispatch list to avoid
> to dequeue requests from sw/scheduler queue when lld queue
> is busy.
> 
> Patch 15 ~20 improve bio merge via hash table in sw queue,
> which makes bio merge more efficient than current approch
> in which only the last 8 requests are checked. Since patch
> 6~14 converts to the scheduler way of dequeuing one request
> from sw queue one time for SCSI device, and the times of
> acquring ctx->lock is increased, and merging bio via hash
> table decreases holding time of ctx->lock and should eliminate
> effect from patch 14.
> 
> With this changes, SCSI-MQ sequential I/O performance is
> improved much, for lpfc, it is basically brought back
> compared with block legacy path[1], especially mq-deadline
> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
> For mq-none it is improved by 10% on lpfc, and write is
> improved by > 10% on SRP too.
> 
> Also Bart worried that this patchset may affect SRP, so provide
> test data on SCSI SRP this time:
> 
> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> - system(16 cores, dual sockets, mem: 96G)
> 
>                |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches |
>                |blk-legacy dd |blk-mq none   | blk-mq none  |
> -----------------------------------------------------------|
> read     :iops|         587K |         526K |         537K |
> randread :iops|         115K |         140K |         139K |
> write    :iops|         596K |         519K |         602K |
> randwrite:iops|         103K |         122K |         120K |
> 
> 
>                |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches
>                |blk-legacy dd |blk-mq dd     | blk-mq dd    |
> ------------------------------------------------------------
> read     :iops|         587K |         155K |         522K |
> randread :iops|         115K |         140K |         141K |
> write    :iops|         596K |         135K |         587K |
> randwrite:iops|         103K |         120K |         118K |
> 
> V2:
> 	- dequeue request from sw queues in round roubin's style
> 	as suggested by Bart, and introduces one helper in sbitmap
> 	for this purpose
> 	- improve bio merge via hash table from sw queue
> 	- add comments about using DISPATCH_BUSY state in lockless way,
> 	simplifying handling on busy state,
> 	- hold ctx->lock when clearing ctx busy bit as suggested
> 	by Bart
> 
> 
> [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> 
> Ming Lei (20):
>    blk-mq-sched: fix scheduler bad performance
>    sbitmap: introduce __sbitmap_for_each_set()
>    blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
>    blk-mq-sched: move actual dispatching into one helper
>    blk-mq-sched: improve dispatching from sw queue
>    blk-mq-sched: don't dequeue request until all in ->dispatch are
>      flushed
>    blk-mq-sched: introduce blk_mq_sched_queue_depth()
>    blk-mq-sched: use q->queue_depth as hint for q->nr_requests
>    blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
>    blk-mq-sched: introduce helpers for query, change busy state
>    blk-mq: introduce helpers for operating ->dispatch list
>    blk-mq: introduce pointers to dispatch lock & list
>    blk-mq: pass 'request_queue *' to several helpers of operating BUSY
>    blk-mq-sched: improve IO scheduling on SCSI devcie
>    block: introduce rqhash helpers
>    block: move actual bio merge code into __elv_merge
>    block: add check on elevator for supporting bio merge via hashtable
>      from blk-mq sw queue
>    block: introduce .last_merge and .hash to blk_mq_ctx
>    blk-mq-sched: refactor blk_mq_sched_try_merge()
>    blk-mq: improve bio merge from blk-mq sw queue
> 
>   block/blk-mq-debugfs.c  |  12 ++--
>   block/blk-mq-sched.c    | 187 +++++++++++++++++++++++++++++-------------------
>   block/blk-mq-sched.h    |  23 ++++++
>   block/blk-mq.c          | 133 +++++++++++++++++++++++++++++++---
>   block/blk-mq.h          |  73 +++++++++++++++++++
>   block/blk-settings.c    |   2 +
>   block/blk.h             |  55 ++++++++++++++
>   block/elevator.c        |  93 ++++++++++++++----------
>   include/linux/blk-mq.h  |   5 ++
>   include/linux/blkdev.h  |   5 ++
>   include/linux/sbitmap.h |  54 ++++++++++----
>   11 files changed, 504 insertions(+), 138 deletions(-)
> 

Hello

I tested this series using Ming's tests as well as my own set of tests 
typically run against changes to upstream code in my SRP test-bed.
My tests also include very large sequential buffered and un-buffered I/O.

This series seems to be fine for me. I did uncover another issue that is 
unrelated to these patches and also exists in 4.13-RC3 generic that I am 
still debugging.

For what its worth:
Tested-by: Laurence Oberman <loberman@redhat.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-07 12:48 ` [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Laurence Oberman
@ 2017-08-07 15:27   ` Bart Van Assche
  2017-08-07 17:29     ` Laurence Oberman
       [not found]   ` <CAFfF4qv3W6D-j8BSSZbwPLqhd_mmwk8CZQe7dSqud8cMMd2yPg@mail.gmail.com>
  1 sibling, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-07 15:27 UTC (permalink / raw)
  To: hch, linux-block, axboe, loberman, ming.lei; +Cc: Bart Van Assche

T24gTW9uLCAyMDE3LTA4LTA3IGF0IDA4OjQ4IC0wNDAwLCBMYXVyZW5jZSBPYmVybWFuIHdyb3Rl
Og0KPiBJIHRlc3RlZCB0aGlzIHNlcmllcyB1c2luZyBNaW5nJ3MgdGVzdHMgYXMgd2VsbCBhcyBt
eSBvd24gc2V0IG9mIHRlc3RzIA0KPiB0eXBpY2FsbHkgcnVuIGFnYWluc3QgY2hhbmdlcyB0byB1
cHN0cmVhbSBjb2RlIGluIG15IFNSUCB0ZXN0LWJlZC4NCj4gTXkgdGVzdHMgYWxzbyBpbmNsdWRl
IHZlcnkgbGFyZ2Ugc2VxdWVudGlhbCBidWZmZXJlZCBhbmQgdW4tYnVmZmVyZWQgSS9PLg0KPiAN
Cj4gVGhpcyBzZXJpZXMgc2VlbXMgdG8gYmUgZmluZSBmb3IgbWUuIEkgZGlkIHVuY292ZXIgYW5v
dGhlciBpc3N1ZSB0aGF0IGlzIA0KPiB1bnJlbGF0ZWQgdG8gdGhlc2UgcGF0Y2hlcyBhbmQgYWxz
byBleGlzdHMgaW4gNC4xMy1SQzMgZ2VuZXJpYyB0aGF0IEkgYW0gDQo+IHN0aWxsIGRlYnVnZ2lu
Zy4NCg0KSGVsbG8gTGF1cmVuY2UsDQoNCldoYXQga2luZCBvZiB0ZXN0cyBkaWQgeW91IHJ1bj8g
T25seSBmdW5jdGlvbmFsIHRlc3RzIG9yIGFsc28gcGVyZm9ybWFuY2UNCnRlc3RzPw0KDQpIYXMg
dGhlIGlzc3VlIHlvdSBlbmNvdW50ZXJlZCB3aXRoIGtlcm5lbCA0LjEzLXJjMyBhbHJlYWR5IGJl
ZW4gcmVwb3J0ZWQgb24NCnRoZSBsaW51eC1yZG1hIG1haWxpbmcgbGlzdD8NCg0KVGhhbmtzLA0K
DQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-07 15:27   ` Bart Van Assche
@ 2017-08-07 17:29     ` Laurence Oberman
  2017-08-07 18:46       ` Laurence Oberman
  2017-08-07 23:04       ` Ming Lei
  0 siblings, 2 replies; 84+ messages in thread
From: Laurence Oberman @ 2017-08-07 17:29 UTC (permalink / raw)
  To: Bart Van Assche, hch, linux-block, axboe, ming.lei



On 08/07/2017 11:27 AM, Bart Van Assche wrote:
> On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:
>> I tested this series using Ming's tests as well as my own set of tests
>> typically run against changes to upstream code in my SRP test-bed.
>> My tests also include very large sequential buffered and un-buffered I/O.
>>
>> This series seems to be fine for me. I did uncover another issue that is
>> unrelated to these patches and also exists in 4.13-RC3 generic that I am
>> still debugging.
> 
> Hello Laurence,
> 
> What kind of tests did you run? Only functional tests or also performance
> tests?
> 
> Has the issue you encountered with kernel 4.13-rc3 already been reported on
> the linux-rdma mailing list?
> 
> Thanks,
> 
> Bart.
> 

Hi Bart

Actually I was focusing on just performance to see if we had any 
regressions with Mings new patches for the large sequential I/O cases.

Ming had already tested the small I/O performance cases so I was making 
sure the large I/O sequential tests did not suffer.

The 4MB un-buffered direct read tests to DM devices seems to perform 
much the same in my test bed.
The 4MB buffered and un-buffered 4MB writes also seem to be well behaved 
with not much improvement.

These were not exhaustive tests and did not include my usual port 
disconnect and recovery tests either.
I was just making sure we did not regress with Ming's changes.

I was only just starting to baseline test the mq-deadline scheduler as 
prior to 4.13-RC3 I had not been testing any of the new MQ schedulers.
I had always only tested with [none]

The tests were with [none] and [mq-deadline]

The new issue I started seeing was not yet reported yet as I was still 
investigating it.

In summary:
With buffered writes we see the clone fail in blk_get_request in both 
Mings kernel and in the upstream 4.13-RC3 stock kernel

[  885.271451] io scheduler mq-deadline registered
[  898.455442] device-mapper: multipath: blk_get_request() returned -11 
- requeuing

This is due to

multipath_clone_and_map()

/*
  * Map cloned requests (request-based multipath)
  */
static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
                                    union map_info *map_context,
                                    struct request **__clone)
{
..
..
..
         clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
GFP_ATOMIC);
         if (IS_ERR(clone)) {
                 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
                 bool queue_dying = blk_queue_dying(q);
                 DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing",
                             PTR_ERR(clone), queue_dying ? " (path 
offline)" : "");
                 if (queue_dying) {
                         atomic_inc(&m->pg_init_in_progress);
                         activate_or_offline_path(pgpath);
                         return DM_MAPIO_REQUEUE;
                 }
                 return DM_MAPIO_DELAY_REQUEUE;
         }

Still investigating but it leads to a hard lockup


So I still need to see if the hard-lockup happens in the stock kernel 
with mq-deadline and some other work before coming up with a full 
summary of the issue.

I also intend to re-run all tests including disconnect and reconnect 
tests on both mq-deadline and none.

Trace below


[ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin 
xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set 
nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle 
ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 
iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser 
libiscsi scsi_transport_iscsi iptable_mangle iptable_security 
iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter 
ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm 
ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core 
intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
[ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
[ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
4.13.0-rc3lobeming.ming_V4+ #20
[ 1553.167411] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[ 1553.167412] task: ffff9d9344b0d800 task.stack: ffffb5cb913b0000
[ 1553.167421] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
[ 1553.167421] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
[ 1553.167422] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
ffff9d91b8e9f500
[ 1553.167423] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
ffff9d91b8e9f488
[ 1553.167424] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
0000000000000001
[ 1553.167424] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[ 1553.167425] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000003
[ 1553.167426] FS:  00007f3061002740(0000) GS:ffff9d9377a80000(0000) 
knlGS:0000000000000000
[ 1553.167427] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1553.167428] CR2: 00007f305a511000 CR3: 0000000b1bfb2000 CR4: 
00000000000006e0
[ 1553.167429] Call Trace:
[ 1553.167432]  <IRQ>
[ 1553.167437]  ? mempool_free+0x2b/0x80
[ 1553.167440]  blk_recalc_rq_segments+0x28/0x40
[ 1553.167442]  blk_update_request+0x249/0x310
[ 1553.167450]  end_clone_bio+0x46/0x70 [dm_mod]
[ 1553.167453]  bio_endio+0xa1/0x120
[ 1553.167454]  blk_update_request+0xb7/0x310
[ 1553.167457]  scsi_end_request+0x38/0x1d0
[ 1553.167458]  scsi_io_completion+0x13c/0x630
[ 1553.167460]  scsi_finish_command+0xd9/0x120
[ 1553.167462]  scsi_softirq_done+0x12a/0x150
[ 1553.167463]  __blk_mq_complete_request_remote+0x13/0x20
[ 1553.167466]  flush_smp_call_function_queue+0x52/0x110
[ 1553.167468]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 1553.167470]  smp_call_function_single_interrupt+0x27/0x40
[ 1553.167471]  call_function_single_interrupt+0x93/0xa0
[ 1553.167473] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
[ 1553.167474] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
ffffffffffffff04
[ 1553.167475] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
ffff9d9e2c556938
[ 1553.167476] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
ffff9d9e2c556910
[ 1553.167476] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
0000000ab7d52000
[ 1553.167477] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
0000000000000230
[ 1553.167477] R13: 0000000000000000 R14: 000000000006e0fe R15: 
ffff9d9e2c556910
[ 1553.167478]  </IRQ>
[ 1553.167480]  ? radix_tree_next_chunk+0x10b/0x2e0
[ 1553.167481]  find_get_pages_tag+0x149/0x270
[ 1553.167485]  ? block_write_full_page+0xcd/0xe0
[ 1553.167487]  pagevec_lookup_tag+0x21/0x30
[ 1553.167488]  write_cache_pages+0x14c/0x510
[ 1553.167490]  ? __wb_calc_thresh+0x140/0x140
[ 1553.167492]  generic_writepages+0x51/0x80
[ 1553.167493]  blkdev_writepages+0x2f/0x40
[ 1553.167494]  do_writepages+0x1c/0x70
[ 1553.167495]  __filemap_fdatawrite_range+0xc6/0x100
[ 1553.167497]  filemap_write_and_wait+0x3d/0x80
[ 1553.167498]  __sync_blockdev+0x1f/0x40
[ 1553.167499]  __blkdev_put+0x74/0x200
[ 1553.167500]  blkdev_put+0x4c/0xd0
[ 1553.167501]  blkdev_close+0x25/0x30
[ 1553.167503]  __fput+0xe7/0x210
[ 1553.167504]  ____fput+0xe/0x10
[ 1553.167508]  task_work_run+0x83/0xb0
[ 1553.167511]  exit_to_usermode_loop+0x66/0x98
[ 1553.167512]  do_syscall_64+0x13a/0x150
[ 1553.167519]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1553.167520] RIP: 0033:0x7f3060b25bd0
[ 1553.167521] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000003
[ 1553.167522] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
00007f3060b25bd0
[ 1553.167522] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
0000000000000001
[ 1553.167523] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
0000000000402003
[ 1553.167524] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
0000000000000320
[ 1553.167525] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
00007ffc5e6b1540
[ 1553.167526] Code: db 0f 84 92 01 00 00 45 89 f5 45 89 e2 4c 89 ee 48 
c1 e6 04 48 03 71 78 8b 46 08 48 8b 16 44 29 e0 48 89 54 24 30 39 d8 0f 
47 c3 <44> 03 56 0c 45 84 db 89 44 24 38 44 89 54 24 3c 0f 85 2a 01 00
[ 1553.167540] Kernel panic - not syncing: Hard LOCKUP
[ 1553.167542] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
4.13.0-rc3lobeming.ming_V4+ #20
[ 1553.167542] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[ 1553.167543] Call Trace:
[ 1553.167543]  <NMI>
[ 1553.167545]  dump_stack+0x63/0x87
[ 1553.167548]  panic+0xeb/0x245
[ 1553.167550]  nmi_panic+0x3f/0x40
[ 1553.167553]  watchdog_overflow_callback+0xce/0xf0
[ 1553.167556]  __perf_event_overflow+0x54/0xe0
[ 1553.167559]  perf_event_overflow+0x14/0x20
[ 1553.167562]  intel_pmu_handle_irq+0x203/0x4b0
[ 1553.167566]  perf_event_nmi_handler+0x2d/0x50
[ 1553.167568]  nmi_handle+0x61/0x110
[ 1553.167569]  default_do_nmi+0x44/0x120
[ 1553.167571]  do_nmi+0x113/0x190
[ 1553.167573]  end_repeat_nmi+0x1a/0x1e
[ 1553.167574] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
[ 1553.167575] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
[ 1553.167576] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
ffff9d91b8e9f500
[ 1553.167576] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
ffff9d91b8e9f488
[ 1553.167577] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
0000000000000001
[ 1553.167577] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[ 1553.167578] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000003
[ 1553.167580]  ? __blk_recalc_rq_segments+0xec/0x3d0
[ 1553.167581]  ? __blk_recalc_rq_segments+0xec/0x3d0
[ 1553.167582]  </NMI>
[ 1553.167582]  <IRQ>
[ 1553.167584]  ? mempool_free+0x2b/0x80
[ 1553.167585]  blk_recalc_rq_segments+0x28/0x40
[ 1553.167586]  blk_update_request+0x249/0x310
[ 1553.167590]  end_clone_bio+0x46/0x70 [dm_mod]
[ 1553.167592]  bio_endio+0xa1/0x120
[ 1553.167593]  blk_update_request+0xb7/0x310
[ 1553.167594]  scsi_end_request+0x38/0x1d0
[ 1553.167595]  scsi_io_completion+0x13c/0x630
[ 1553.167597]  scsi_finish_command+0xd9/0x120
[ 1553.167598]  scsi_softirq_done+0x12a/0x150
[ 1553.167599]  __blk_mq_complete_request_remote+0x13/0x20
[ 1553.167601]  flush_smp_call_function_queue+0x52/0x110
[ 1553.167602]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 1553.167603]  smp_call_function_single_interrupt+0x27/0x40
[ 1553.167604]  call_function_single_interrupt+0x93/0xa0
[ 1553.167605] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
[ 1553.167606] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
ffffffffffffff04
[ 1553.167607] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
ffff9d9e2c556938
[ 1553.167607] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
ffff9d9e2c556910
[ 1553.167608] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
0000000ab7d52000
[ 1553.167609] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
0000000000000230
[ 1553.167609] R13: 0000000000000000 R14: 000000000006e0fe R15: 
ffff9d9e2c556910
[ 1553.167610]  </IRQ>
[ 1553.167611]  ? radix_tree_next_chunk+0x10b/0x2e0
[ 1553.167613]  find_get_pages_tag+0x149/0x270
[ 1553.167615]  ? block_write_full_page+0xcd/0xe0
[ 1553.167616]  pagevec_lookup_tag+0x21/0x30
[ 1553.167617]  write_cache_pages+0x14c/0x510
[ 1553.167619]  ? __wb_calc_thresh+0x140/0x140
[ 1553.167621]  generic_writepages+0x51/0x80
[ 1553.167622]  blkdev_writepages+0x2f/0x40
[ 1553.167623]  do_writepages+0x1c/0x70
[ 1553.167627]  __filemap_fdatawrite_range+0xc6/0x100
[ 1553.167629]  filemap_write_and_wait+0x3d/0x80
[ 1553.167630]  __sync_blockdev+0x1f/0x40
[ 1553.167631]  __blkdev_put+0x74/0x200
[ 1553.167632]  blkdev_put+0x4c/0xd0
[ 1553.167633]  blkdev_close+0x25/0x30
[ 1553.167634]  __fput+0xe7/0x210
[ 1553.167635]  ____fput+0xe/0x10
[ 1553.167636]  task_work_run+0x83/0xb0
[ 1553.167637]  exit_to_usermode_loop+0x66/0x98
[ 1553.167638]  do_syscall_64+0x13a/0x150
[ 1553.167640]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1553.167641] RIP: 0033:0x7f3060b25bd0
[ 1553.167641] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000003
[ 1553.167642] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
00007f3060b25bd0
[ 1553.167643] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
0000000000000001
[ 1553.167643] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
0000000000402003
[ 1553.167644] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
0000000000000320
[ 1553.167645] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
00007ffc5e6b1540
[ 1553.167829] Kernel Offset: 0x2f400000 from 0xffffffff81000000 
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1558.119352] ---[ end Kernel panic - not syncing: Hard LOCKUP
[ 1558.151334] sched: Unexpected reschedule of offline CPU#0!
[ 1558.151342] ------------[ cut here ]------------
[ 1558.151347] WARNING: CPU: 4 PID: 11532 at arch/x86/kernel/smp.c:128 
native_smp_send_reschedule+0x3c/0x40
[ 1558.151348] Modules linked in: mq_deadline binfmt_misc dm_round_robin 
xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set 
nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle 
ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 
iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser 
libiscsi scsi_transport_iscsi iptable_mangle iptable_security 
iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter 
ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm 
ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core 
intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
[ 1558.151363]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
[ 1558.151377] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
4.13.0-rc3lobeming.ming_V4+ #20
[ 1558.151378] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[ 1558.151379] task: ffff9d9344b0d800 task.stack: ffffb5cb913b0000
[ 1558.151380] RIP: 0010:native_smp_send_reschedule+0x3c/0x40
[ 1558.151381] RSP: 0018:ffff9d9377a856a0 EFLAGS: 00010046
[ 1558.151382] RAX: 000000000000002e RBX: 0000000000000000 RCX: 
0000000000000006
[ 1558.151383] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9d9377a8e010
[ 1558.151383] RBP: ffff9d9377a856a0 R08: 000000000000002e R09: 
ffff9d9fa7029552
[ 1558.151384] R10: 0000000000000732 R11: 0000000000000000 R12: 
ffff9d9377a1bb00
[ 1558.151384] R13: ffff9d936ef70000 R14: ffff9d9377a85758 R15: 
ffff9d9377a1bb00
[ 1558.151385] FS:  00007f3061002740(0000) GS:ffff9d9377a80000(0000) 
knlGS:0000000000000000
[ 1558.151386] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1558.151387] CR2: 00007f305a511000 CR3: 0000000b1bfb2000 CR4: 
00000000000006e0
[ 1558.151387] Call Trace:
[ 1558.151388]  <NMI>
[ 1558.151391]  resched_curr+0xa1/0xc0
[ 1558.151392]  check_preempt_curr+0x79/0x90
[ 1558.151393]  ttwu_do_wakeup+0x1e/0x160
[ 1558.151395]  ttwu_do_activate+0x7a/0x90
[ 1558.151396]  try_to_wake_up+0x1e1/0x470
[ 1558.151398]  ? vsnprintf+0x1fa/0x4a0
[ 1558.151399]  default_wake_function+0x12/0x20
[ 1558.151403]  __wake_up_common+0x55/0x90
[ 1558.151405]  __wake_up_locked+0x13/0x20
[ 1558.151408]  ep_poll_callback+0xd3/0x2a0
[ 1558.151409]  __wake_up_common+0x55/0x90
[ 1558.151411]  __wake_up+0x39/0x50
[ 1558.151415]  wake_up_klogd_work_func+0x40/0x60
[ 1558.151419]  irq_work_run_list+0x4d/0x70
[ 1558.151421]  ? tick_sched_do_timer+0x70/0x70
[ 1558.151422]  irq_work_tick+0x40/0x50
[ 1558.151424]  update_process_times+0x42/0x60
[ 1558.151426]  tick_sched_handle+0x2d/0x60
[ 1558.151427]  tick_sched_timer+0x39/0x70
[ 1558.151428]  __hrtimer_run_queues+0xe5/0x230
[ 1558.151429]  hrtimer_interrupt+0xa8/0x1a0
[ 1558.151431]  local_apic_timer_interrupt+0x35/0x60
[ 1558.151432]  smp_apic_timer_interrupt+0x38/0x50
[ 1558.151433]  apic_timer_interrupt+0x93/0xa0
[ 1558.151435] RIP: 0010:panic+0x1fa/0x245
[ 1558.151435] RSP: 0018:ffff9d9377a85b30 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffff10
[ 1558.151436] RAX: 0000000000000030 RBX: ffff9d890c5d0000 RCX: 
0000000000000006
[ 1558.151437] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9d9377a8e010
[ 1558.151438] RBP: ffff9d9377a85ba0 R08: 0000000000000030 R09: 
ffff9d9fa7029514
[ 1558.151438] R10: 0000000000000731 R11: 0000000000000000 R12: 
ffffffffb0e63f7f
[ 1558.151439] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
[ 1558.151441]  nmi_panic+0x3f/0x40
[ 1558.151442]  watchdog_overflow_callback+0xce/0xf0
[ 1558.151444]  __perf_event_overflow+0x54/0xe0
[ 1558.151445]  perf_event_overflow+0x14/0x20
[ 1558.151446]  intel_pmu_handle_irq+0x203/0x4b0
[ 1558.151449]  perf_event_nmi_handler+0x2d/0x50
[ 1558.151450]  nmi_handle+0x61/0x110
[ 1558.151452]  default_do_nmi+0x44/0x120
[ 1558.151453]  do_nmi+0x113/0x190
[ 1558.151454]  end_repeat_nmi+0x1a/0x1e
[ 1558.151456] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
[ 1558.151456] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
[ 1558.151457] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
ffff9d91b8e9f500
[ 1558.151458] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
ffff9d91b8e9f488
[ 1558.151458] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
0000000000000001
[ 1558.151459] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[ 1558.151459] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000003
[ 1558.151461]  ? __blk_recalc_rq_segments+0xec/0x3d0
[ 1558.151463]  ? __blk_recalc_rq_segments+0xec/0x3d0
[ 1558.151463]  </NMI>
[ 1558.151464]  <IRQ>
[ 1558.151465]  ? mempool_free+0x2b/0x80
[ 1558.151467]  blk_recalc_rq_segments+0x28/0x40
[ 1558.151468]  blk_update_request+0x249/0x310
[ 1558.151472]  end_clone_bio+0x46/0x70 [dm_mod]
[ 1558.151473]  bio_endio+0xa1/0x120
[ 1558.151474]  blk_update_request+0xb7/0x310
[ 1558.151476]  scsi_end_request+0x38/0x1d0
[ 1558.151477]  scsi_io_completion+0x13c/0x630
[ 1558.151478]  scsi_finish_command+0xd9/0x120
[ 1558.151480]  scsi_softirq_done+0x12a/0x150
[ 1558.151481]  __blk_mq_complete_request_remote+0x13/0x20
[ 1558.151483]  flush_smp_call_function_queue+0x52/0x110
[ 1558.151484]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 1558.151485]  smp_call_function_single_interrupt+0x27/0x40
[ 1558.151486]  call_function_single_interrupt+0x93/0xa0
[ 1558.151487] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
[ 1558.151488] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
ffffffffffffff04
[ 1558.151489] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
ffff9d9e2c556938
[ 1558.151490] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
ffff9d9e2c556910
[ 1558.151490] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
0000000ab7d52000
[ 1558.151491] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
0000000000000230
[ 1558.151492] R13: 0000000000000000 R14: 000000000006e0fe R15: 
ffff9d9e2c556910
[ 1558.151492]  </IRQ>
[ 1558.151494]  ? radix_tree_next_chunk+0x10b/0x2e0
[ 1558.151496]  find_get_pages_tag+0x149/0x270
[ 1558.151497]  ? block_write_full_page+0xcd/0xe0
[ 1558.151499]  pagevec_lookup_tag+0x21/0x30
[ 1558.151500]  write_cache_pages+0x14c/0x510
[ 1558.151502]  ? __wb_calc_thresh+0x140/0x140
[ 1558.151504]  generic_writepages+0x51/0x80
[ 1558.151505]  blkdev_writepages+0x2f/0x40
[ 1558.151506]  do_writepages+0x1c/0x70
[ 1558.151507]  __filemap_fdatawrite_range+0xc6/0x100
[ 1558.151509]  filemap_write_and_wait+0x3d/0x80
[ 1558.151510]  __sync_blockdev+0x1f/0x40
[ 1558.151511]  __blkdev_put+0x74/0x200
[ 1558.151512]  blkdev_put+0x4c/0xd0
[ 1558.151513]  blkdev_close+0x25/0x30
[ 1558.151514]  __fput+0xe7/0x210
[ 1558.151515]  ____fput+0xe/0x10
[ 1558.151517]  task_work_run+0x83/0xb0
[ 1558.151518]  exit_to_usermode_loop+0x66/0x98
[ 1558.151519]  do_syscall_64+0x13a/0x150
[ 1558.151521]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1558.151522] RIP: 0033:0x7f3060b25bd0
[ 1558.151522] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000003
[ 1558.151523] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
00007f3060b25bd0
[ 1558.151524] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
0000000000000001
[ 1558.151524] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
0000000000402003
[ 1558.151525] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
0000000000000320
[ 1558.151526] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
00007ffc5e6b1540
[ 1558.151527] Code: dc 00 0f 92 c0 84 c0 74 14 48 8b 05 ff 9d a9 00 be 
fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 90 4d e3 b0 e8 c7 db 
09 00 <0f> ff 5d c3 66 66 66 66 90 55 48 89 e5 48 83 ec 20 65 48 8b 04
[ 1558.151540] ---[ end trace 408c28f8c132530c ]---

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-07 17:29     ` Laurence Oberman
@ 2017-08-07 18:46       ` Laurence Oberman
  2017-08-07 19:46         ` Laurence Oberman
  2017-08-07 23:04       ` Ming Lei
  1 sibling, 1 reply; 84+ messages in thread
From: Laurence Oberman @ 2017-08-07 18:46 UTC (permalink / raw)
  To: Bart Van Assche, hch, linux-block, axboe, ming.lei



On 08/07/2017 01:29 PM, Laurence Oberman wrote:
> 
> 
> On 08/07/2017 11:27 AM, Bart Van Assche wrote:
>> On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:
>>> I tested this series using Ming's tests as well as my own set of tests
>>> typically run against changes to upstream code in my SRP test-bed.
>>> My tests also include very large sequential buffered and un-buffered 
>>> I/O.
>>>
>>> This series seems to be fine for me. I did uncover another issue that is
>>> unrelated to these patches and also exists in 4.13-RC3 generic that I am
>>> still debugging.
>>
>> Hello Laurence,
>>
>> What kind of tests did you run? Only functional tests or also performance
>> tests?
>>
>> Has the issue you encountered with kernel 4.13-rc3 already been 
>> reported on
>> the linux-rdma mailing list?
>>
>> Thanks,
>>
>> Bart.
>>
> 
> Hi Bart
> 
> Actually I was focusing on just performance to see if we had any 
> regressions with Mings new patches for the large sequential I/O cases.
> 
> Ming had already tested the small I/O performance cases so I was making 
> sure the large I/O sequential tests did not suffer.
> 
> The 4MB un-buffered direct read tests to DM devices seems to perform 
> much the same in my test bed.
> The 4MB buffered and un-buffered 4MB writes also seem to be well behaved 
> with not much improvement.
> 
> These were not exhaustive tests and did not include my usual port 
> disconnect and recovery tests either.
> I was just making sure we did not regress with Ming's changes.
> 
> I was only just starting to baseline test the mq-deadline scheduler as 
> prior to 4.13-RC3 I had not been testing any of the new MQ schedulers.
> I had always only tested with [none]
> 
> The tests were with [none] and [mq-deadline]
> 
> The new issue I started seeing was not yet reported yet as I was still 
> investigating it.
> 
> In summary:
> With buffered writes we see the clone fail in blk_get_request in both 
> Mings kernel and in the upstream 4.13-RC3 stock kernel
> 
> [  885.271451] io scheduler mq-deadline registered
> [  898.455442] device-mapper: multipath: blk_get_request() returned -11 
> - requeuing
> 
> This is due to
> 
> multipath_clone_and_map()
> 
> /*
>   * Map cloned requests (request-based multipath)
>   */
> static int multipath_clone_and_map(struct dm_target *ti, struct request 
> *rq,
>                                     union map_info *map_context,
>                                     struct request **__clone)
> {
> ..
> ..
> ..
>          clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
> GFP_ATOMIC);
>          if (IS_ERR(clone)) {
>                  /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
>                  bool queue_dying = blk_queue_dying(q);
>                  DMERR_LIMIT("blk_get_request() returned %ld%s - 
> requeuing",
>                              PTR_ERR(clone), queue_dying ? " (path 
> offline)" : "");
>                  if (queue_dying) {
>                          atomic_inc(&m->pg_init_in_progress);
>                          activate_or_offline_path(pgpath);
>                          return DM_MAPIO_REQUEUE;
>                  }
>                  return DM_MAPIO_DELAY_REQUEUE;
>          }
> 
> Still investigating but it leads to a hard lockup
> 
> 
> So I still need to see if the hard-lockup happens in the stock kernel 
> with mq-deadline and some other work before coming up with a full 
> summary of the issue.
> 
> I also intend to re-run all tests including disconnect and reconnect 
> tests on both mq-deadline and none.
> 
> Trace below
> 
> 
> [ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
> [ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin 
> xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter 
> ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set 
> nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
> nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle 
> ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 
> iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser 
> libiscsi scsi_transport_iscsi iptable_mangle iptable_security 
> iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter 
> ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm 
> ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core 
> intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
> [ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
> joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
> glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
> acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
> nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
> amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
> sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
> serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
> dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
> [ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
> 4.13.0-rc3lobeming.ming_V4+ #20
> [ 1553.167411] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
> [ 1553.167412] task: ffff9d9344b0d800 task.stack: ffffb5cb913b0000
> [ 1553.167421] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
> [ 1553.167421] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
> [ 1553.167422] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
> ffff9d91b8e9f500
> [ 1553.167423] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
> ffff9d91b8e9f488
> [ 1553.167424] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
> 0000000000000001
> [ 1553.167424] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [ 1553.167425] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000003
> [ 1553.167426] FS:  00007f3061002740(0000) GS:ffff9d9377a80000(0000) 
> knlGS:0000000000000000
> [ 1553.167427] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1553.167428] CR2: 00007f305a511000 CR3: 0000000b1bfb2000 CR4: 
> 00000000000006e0
> [ 1553.167429] Call Trace:
> [ 1553.167432]  <IRQ>
> [ 1553.167437]  ? mempool_free+0x2b/0x80
> [ 1553.167440]  blk_recalc_rq_segments+0x28/0x40
> [ 1553.167442]  blk_update_request+0x249/0x310
> [ 1553.167450]  end_clone_bio+0x46/0x70 [dm_mod]
> [ 1553.167453]  bio_endio+0xa1/0x120
> [ 1553.167454]  blk_update_request+0xb7/0x310
> [ 1553.167457]  scsi_end_request+0x38/0x1d0
> [ 1553.167458]  scsi_io_completion+0x13c/0x630
> [ 1553.167460]  scsi_finish_command+0xd9/0x120
> [ 1553.167462]  scsi_softirq_done+0x12a/0x150
> [ 1553.167463]  __blk_mq_complete_request_remote+0x13/0x20
> [ 1553.167466]  flush_smp_call_function_queue+0x52/0x110
> [ 1553.167468]  generic_smp_call_function_single_interrupt+0x13/0x30
> [ 1553.167470]  smp_call_function_single_interrupt+0x27/0x40
> [ 1553.167471]  call_function_single_interrupt+0x93/0xa0
> [ 1553.167473] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
> [ 1553.167474] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
> ffffffffffffff04
> [ 1553.167475] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
> ffff9d9e2c556938
> [ 1553.167476] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
> ffff9d9e2c556910
> [ 1553.167476] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
> 0000000ab7d52000
> [ 1553.167477] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
> 0000000000000230
> [ 1553.167477] R13: 0000000000000000 R14: 000000000006e0fe R15: 
> ffff9d9e2c556910
> [ 1553.167478]  </IRQ>
> [ 1553.167480]  ? radix_tree_next_chunk+0x10b/0x2e0
> [ 1553.167481]  find_get_pages_tag+0x149/0x270
> [ 1553.167485]  ? block_write_full_page+0xcd/0xe0
> [ 1553.167487]  pagevec_lookup_tag+0x21/0x30
> [ 1553.167488]  write_cache_pages+0x14c/0x510
> [ 1553.167490]  ? __wb_calc_thresh+0x140/0x140
> [ 1553.167492]  generic_writepages+0x51/0x80
> [ 1553.167493]  blkdev_writepages+0x2f/0x40
> [ 1553.167494]  do_writepages+0x1c/0x70
> [ 1553.167495]  __filemap_fdatawrite_range+0xc6/0x100
> [ 1553.167497]  filemap_write_and_wait+0x3d/0x80
> [ 1553.167498]  __sync_blockdev+0x1f/0x40
> [ 1553.167499]  __blkdev_put+0x74/0x200
> [ 1553.167500]  blkdev_put+0x4c/0xd0
> [ 1553.167501]  blkdev_close+0x25/0x30
> [ 1553.167503]  __fput+0xe7/0x210
> [ 1553.167504]  ____fput+0xe/0x10
> [ 1553.167508]  task_work_run+0x83/0xb0
> [ 1553.167511]  exit_to_usermode_loop+0x66/0x98
> [ 1553.167512]  do_syscall_64+0x13a/0x150
> [ 1553.167519]  entry_SYSCALL64_slow_path+0x25/0x25
> [ 1553.167520] RIP: 0033:0x7f3060b25bd0
> [ 1553.167521] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000003
> [ 1553.167522] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
> 00007f3060b25bd0
> [ 1553.167522] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
> 0000000000000001
> [ 1553.167523] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
> 0000000000402003
> [ 1553.167524] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
> 0000000000000320
> [ 1553.167525] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
> 00007ffc5e6b1540
> [ 1553.167526] Code: db 0f 84 92 01 00 00 45 89 f5 45 89 e2 4c 89 ee 48 
> c1 e6 04 48 03 71 78 8b 46 08 48 8b 16 44 29 e0 48 89 54 24 30 39 d8 0f 
> 47 c3 <44> 03 56 0c 45 84 db 89 44 24 38 44 89 54 24 3c 0f 85 2a 01 00
> [ 1553.167540] Kernel panic - not syncing: Hard LOCKUP
> [ 1553.167542] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
> 4.13.0-rc3lobeming.ming_V4+ #20
> [ 1553.167542] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
> [ 1553.167543] Call Trace:
> [ 1553.167543]  <NMI>
> [ 1553.167545]  dump_stack+0x63/0x87
> [ 1553.167548]  panic+0xeb/0x245
> [ 1553.167550]  nmi_panic+0x3f/0x40
> [ 1553.167553]  watchdog_overflow_callback+0xce/0xf0
> [ 1553.167556]  __perf_event_overflow+0x54/0xe0
> [ 1553.167559]  perf_event_overflow+0x14/0x20
> [ 1553.167562]  intel_pmu_handle_irq+0x203/0x4b0
> [ 1553.167566]  perf_event_nmi_handler+0x2d/0x50
> [ 1553.167568]  nmi_handle+0x61/0x110
> [ 1553.167569]  default_do_nmi+0x44/0x120
> [ 1553.167571]  do_nmi+0x113/0x190
> [ 1553.167573]  end_repeat_nmi+0x1a/0x1e
> [ 1553.167574] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
> [ 1553.167575] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
> [ 1553.167576] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
> ffff9d91b8e9f500
> [ 1553.167576] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
> ffff9d91b8e9f488
> [ 1553.167577] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
> 0000000000000001
> [ 1553.167577] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [ 1553.167578] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000003
> [ 1553.167580]  ? __blk_recalc_rq_segments+0xec/0x3d0
> [ 1553.167581]  ? __blk_recalc_rq_segments+0xec/0x3d0
> [ 1553.167582]  </NMI>
> [ 1553.167582]  <IRQ>
> [ 1553.167584]  ? mempool_free+0x2b/0x80
> [ 1553.167585]  blk_recalc_rq_segments+0x28/0x40
> [ 1553.167586]  blk_update_request+0x249/0x310
> [ 1553.167590]  end_clone_bio+0x46/0x70 [dm_mod]
> [ 1553.167592]  bio_endio+0xa1/0x120
> [ 1553.167593]  blk_update_request+0xb7/0x310
> [ 1553.167594]  scsi_end_request+0x38/0x1d0
> [ 1553.167595]  scsi_io_completion+0x13c/0x630
> [ 1553.167597]  scsi_finish_command+0xd9/0x120
> [ 1553.167598]  scsi_softirq_done+0x12a/0x150
> [ 1553.167599]  __blk_mq_complete_request_remote+0x13/0x20
> [ 1553.167601]  flush_smp_call_function_queue+0x52/0x110
> [ 1553.167602]  generic_smp_call_function_single_interrupt+0x13/0x30
> [ 1553.167603]  smp_call_function_single_interrupt+0x27/0x40
> [ 1553.167604]  call_function_single_interrupt+0x93/0xa0
> [ 1553.167605] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
> [ 1553.167606] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
> ffffffffffffff04
> [ 1553.167607] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
> ffff9d9e2c556938
> [ 1553.167607] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
> ffff9d9e2c556910
> [ 1553.167608] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
> 0000000ab7d52000
> [ 1553.167609] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
> 0000000000000230
> [ 1553.167609] R13: 0000000000000000 R14: 000000000006e0fe R15: 
> ffff9d9e2c556910
> [ 1553.167610]  </IRQ>
> [ 1553.167611]  ? radix_tree_next_chunk+0x10b/0x2e0
> [ 1553.167613]  find_get_pages_tag+0x149/0x270
> [ 1553.167615]  ? block_write_full_page+0xcd/0xe0
> [ 1553.167616]  pagevec_lookup_tag+0x21/0x30
> [ 1553.167617]  write_cache_pages+0x14c/0x510
> [ 1553.167619]  ? __wb_calc_thresh+0x140/0x140
> [ 1553.167621]  generic_writepages+0x51/0x80
> [ 1553.167622]  blkdev_writepages+0x2f/0x40
> [ 1553.167623]  do_writepages+0x1c/0x70
> [ 1553.167627]  __filemap_fdatawrite_range+0xc6/0x100
> [ 1553.167629]  filemap_write_and_wait+0x3d/0x80
> [ 1553.167630]  __sync_blockdev+0x1f/0x40
> [ 1553.167631]  __blkdev_put+0x74/0x200
> [ 1553.167632]  blkdev_put+0x4c/0xd0
> [ 1553.167633]  blkdev_close+0x25/0x30
> [ 1553.167634]  __fput+0xe7/0x210
> [ 1553.167635]  ____fput+0xe/0x10
> [ 1553.167636]  task_work_run+0x83/0xb0
> [ 1553.167637]  exit_to_usermode_loop+0x66/0x98
> [ 1553.167638]  do_syscall_64+0x13a/0x150
> [ 1553.167640]  entry_SYSCALL64_slow_path+0x25/0x25
> [ 1553.167641] RIP: 0033:0x7f3060b25bd0
> [ 1553.167641] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000003
> [ 1553.167642] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
> 00007f3060b25bd0
> [ 1553.167643] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
> 0000000000000001
> [ 1553.167643] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
> 0000000000402003
> [ 1553.167644] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
> 0000000000000320
> [ 1553.167645] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
> 00007ffc5e6b1540
> [ 1553.167829] Kernel Offset: 0x2f400000 from 0xffffffff81000000 
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 1558.119352] ---[ end Kernel panic - not syncing: Hard LOCKUP
> [ 1558.151334] sched: Unexpected reschedule of offline CPU#0!
> [ 1558.151342] ------------[ cut here ]------------
> [ 1558.151347] WARNING: CPU: 4 PID: 11532 at arch/x86/kernel/smp.c:128 
> native_smp_send_reschedule+0x3c/0x40
> [ 1558.151348] Modules linked in: mq_deadline binfmt_misc dm_round_robin 
> xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter 
> ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set 
> nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
> nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle 
> ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 
> iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser 
> libiscsi scsi_transport_iscsi iptable_mangle iptable_security 
> iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter 
> ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm 
> ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core 
> intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
> [ 1558.151363]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
> joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
> glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
> acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
> nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
> amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
> sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
> serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
> dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
> [ 1558.151377] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
> 4.13.0-rc3lobeming.ming_V4+ #20
> [ 1558.151378] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
> [ 1558.151379] task: ffff9d9344b0d800 task.stack: ffffb5cb913b0000
> [ 1558.151380] RIP: 0010:native_smp_send_reschedule+0x3c/0x40
> [ 1558.151381] RSP: 0018:ffff9d9377a856a0 EFLAGS: 00010046
> [ 1558.151382] RAX: 000000000000002e RBX: 0000000000000000 RCX: 
> 0000000000000006
> [ 1558.151383] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
> ffff9d9377a8e010
> [ 1558.151383] RBP: ffff9d9377a856a0 R08: 000000000000002e R09: 
> ffff9d9fa7029552
> [ 1558.151384] R10: 0000000000000732 R11: 0000000000000000 R12: 
> ffff9d9377a1bb00
> [ 1558.151384] R13: ffff9d936ef70000 R14: ffff9d9377a85758 R15: 
> ffff9d9377a1bb00
> [ 1558.151385] FS:  00007f3061002740(0000) GS:ffff9d9377a80000(0000) 
> knlGS:0000000000000000
> [ 1558.151386] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1558.151387] CR2: 00007f305a511000 CR3: 0000000b1bfb2000 CR4: 
> 00000000000006e0
> [ 1558.151387] Call Trace:
> [ 1558.151388]  <NMI>
> [ 1558.151391]  resched_curr+0xa1/0xc0
> [ 1558.151392]  check_preempt_curr+0x79/0x90
> [ 1558.151393]  ttwu_do_wakeup+0x1e/0x160
> [ 1558.151395]  ttwu_do_activate+0x7a/0x90
> [ 1558.151396]  try_to_wake_up+0x1e1/0x470
> [ 1558.151398]  ? vsnprintf+0x1fa/0x4a0
> [ 1558.151399]  default_wake_function+0x12/0x20
> [ 1558.151403]  __wake_up_common+0x55/0x90
> [ 1558.151405]  __wake_up_locked+0x13/0x20
> [ 1558.151408]  ep_poll_callback+0xd3/0x2a0
> [ 1558.151409]  __wake_up_common+0x55/0x90
> [ 1558.151411]  __wake_up+0x39/0x50
> [ 1558.151415]  wake_up_klogd_work_func+0x40/0x60
> [ 1558.151419]  irq_work_run_list+0x4d/0x70
> [ 1558.151421]  ? tick_sched_do_timer+0x70/0x70
> [ 1558.151422]  irq_work_tick+0x40/0x50
> [ 1558.151424]  update_process_times+0x42/0x60
> [ 1558.151426]  tick_sched_handle+0x2d/0x60
> [ 1558.151427]  tick_sched_timer+0x39/0x70
> [ 1558.151428]  __hrtimer_run_queues+0xe5/0x230
> [ 1558.151429]  hrtimer_interrupt+0xa8/0x1a0
> [ 1558.151431]  local_apic_timer_interrupt+0x35/0x60
> [ 1558.151432]  smp_apic_timer_interrupt+0x38/0x50
> [ 1558.151433]  apic_timer_interrupt+0x93/0xa0
> [ 1558.151435] RIP: 0010:panic+0x1fa/0x245
> [ 1558.151435] RSP: 0018:ffff9d9377a85b30 EFLAGS: 00000246 ORIG_RAX: 
> ffffffffffffff10
> [ 1558.151436] RAX: 0000000000000030 RBX: ffff9d890c5d0000 RCX: 
> 0000000000000006
> [ 1558.151437] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
> ffff9d9377a8e010
> [ 1558.151438] RBP: ffff9d9377a85ba0 R08: 0000000000000030 R09: 
> ffff9d9fa7029514
> [ 1558.151438] R10: 0000000000000731 R11: 0000000000000000 R12: 
> ffffffffb0e63f7f
> [ 1558.151439] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> [ 1558.151441]  nmi_panic+0x3f/0x40
> [ 1558.151442]  watchdog_overflow_callback+0xce/0xf0
> [ 1558.151444]  __perf_event_overflow+0x54/0xe0
> [ 1558.151445]  perf_event_overflow+0x14/0x20
> [ 1558.151446]  intel_pmu_handle_irq+0x203/0x4b0
> [ 1558.151449]  perf_event_nmi_handler+0x2d/0x50
> [ 1558.151450]  nmi_handle+0x61/0x110
> [ 1558.151452]  default_do_nmi+0x44/0x120
> [ 1558.151453]  do_nmi+0x113/0x190
> [ 1558.151454]  end_repeat_nmi+0x1a/0x1e
> [ 1558.151456] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
> [ 1558.151456] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
> [ 1558.151457] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
> ffff9d91b8e9f500
> [ 1558.151458] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
> ffff9d91b8e9f488
> [ 1558.151458] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
> 0000000000000001
> [ 1558.151459] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [ 1558.151459] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000003
> [ 1558.151461]  ? __blk_recalc_rq_segments+0xec/0x3d0
> [ 1558.151463]  ? __blk_recalc_rq_segments+0xec/0x3d0
> [ 1558.151463]  </NMI>
> [ 1558.151464]  <IRQ>
> [ 1558.151465]  ? mempool_free+0x2b/0x80
> [ 1558.151467]  blk_recalc_rq_segments+0x28/0x40
> [ 1558.151468]  blk_update_request+0x249/0x310
> [ 1558.151472]  end_clone_bio+0x46/0x70 [dm_mod]
> [ 1558.151473]  bio_endio+0xa1/0x120
> [ 1558.151474]  blk_update_request+0xb7/0x310
> [ 1558.151476]  scsi_end_request+0x38/0x1d0
> [ 1558.151477]  scsi_io_completion+0x13c/0x630
> [ 1558.151478]  scsi_finish_command+0xd9/0x120
> [ 1558.151480]  scsi_softirq_done+0x12a/0x150
> [ 1558.151481]  __blk_mq_complete_request_remote+0x13/0x20
> [ 1558.151483]  flush_smp_call_function_queue+0x52/0x110
> [ 1558.151484]  generic_smp_call_function_single_interrupt+0x13/0x30
> [ 1558.151485]  smp_call_function_single_interrupt+0x27/0x40
> [ 1558.151486]  call_function_single_interrupt+0x93/0xa0
> [ 1558.151487] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
> [ 1558.151488] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
> ffffffffffffff04
> [ 1558.151489] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
> ffff9d9e2c556938
> [ 1558.151490] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
> ffff9d9e2c556910
> [ 1558.151490] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
> 0000000ab7d52000
> [ 1558.151491] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
> 0000000000000230
> [ 1558.151492] R13: 0000000000000000 R14: 000000000006e0fe R15: 
> ffff9d9e2c556910
> [ 1558.151492]  </IRQ>
> [ 1558.151494]  ? radix_tree_next_chunk+0x10b/0x2e0
> [ 1558.151496]  find_get_pages_tag+0x149/0x270
> [ 1558.151497]  ? block_write_full_page+0xcd/0xe0
> [ 1558.151499]  pagevec_lookup_tag+0x21/0x30
> [ 1558.151500]  write_cache_pages+0x14c/0x510
> [ 1558.151502]  ? __wb_calc_thresh+0x140/0x140
> [ 1558.151504]  generic_writepages+0x51/0x80
> [ 1558.151505]  blkdev_writepages+0x2f/0x40
> [ 1558.151506]  do_writepages+0x1c/0x70
> [ 1558.151507]  __filemap_fdatawrite_range+0xc6/0x100
> [ 1558.151509]  filemap_write_and_wait+0x3d/0x80
> [ 1558.151510]  __sync_blockdev+0x1f/0x40
> [ 1558.151511]  __blkdev_put+0x74/0x200
> [ 1558.151512]  blkdev_put+0x4c/0xd0
> [ 1558.151513]  blkdev_close+0x25/0x30
> [ 1558.151514]  __fput+0xe7/0x210
> [ 1558.151515]  ____fput+0xe/0x10
> [ 1558.151517]  task_work_run+0x83/0xb0
> [ 1558.151518]  exit_to_usermode_loop+0x66/0x98
> [ 1558.151519]  do_syscall_64+0x13a/0x150
> [ 1558.151521]  entry_SYSCALL64_slow_path+0x25/0x25
> [ 1558.151522] RIP: 0033:0x7f3060b25bd0
> [ 1558.151522] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000003
> [ 1558.151523] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
> 00007f3060b25bd0
> [ 1558.151524] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
> 0000000000000001
> [ 1558.151524] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
> 0000000000402003
> [ 1558.151525] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
> 0000000000000320
> [ 1558.151526] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
> 00007ffc5e6b1540
> [ 1558.151527] Code: dc 00 0f 92 c0 84 c0 74 14 48 8b 05 ff 9d a9 00 be 
> fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 90 4d e3 b0 e8 c7 db 
> 09 00 <0f> ff 5d c3 66 66 66 66 90 55 48 89 e5 48 83 ec 20 65 48 8b 04
> [ 1558.151540] ---[ end trace 408c28f8c132530c ]---

Replying to my own email:

Hello Bart

So with the stock 4.13-RC3 using [none] I get the failing clones but no 
hard lockup.

[  526.677611] multipath_clone_and_map: 148 callbacks suppressed
[  526.707429] device-mapper: multipath: blk_get_request() returned -11 
- requeuing
[  527.283432] device-mapper: multipath: blk_get_request() returned -11 
- requeuing
[  528.324566] device-mapper: multipath: blk_get_request() returned -11 
- requeuing
[  528.367118] device-mapper: multipath: blk_get_request() returned -11 
- requeuing

Then on the same  kernel
modprobe mq-deadline and running with [mq-deadline] on stock upstream I 
DO not get the clone failures and I don't seem to get hard lockups.

I will reach out to Ming once I confirm going back to his kernel gets me 
back into hard lockups again.
I will also mention to him that we see the clone failure when using his 
kernel with [mq-deadline]

Perhaps his enhancements are indeed contributing

Thanks
Laurence



I also intend to test qla2xxx and mq-deadline shortly

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-07 18:46       ` Laurence Oberman
@ 2017-08-07 19:46         ` Laurence Oberman
  0 siblings, 0 replies; 84+ messages in thread
From: Laurence Oberman @ 2017-08-07 19:46 UTC (permalink / raw)
  To: Bart Van Assche, hch, linux-block, axboe, ming.lei



On 08/07/2017 02:46 PM, Laurence Oberman wrote:
> 
> 
> On 08/07/2017 01:29 PM, Laurence Oberman wrote:
>>
>>
>> On 08/07/2017 11:27 AM, Bart Van Assche wrote:
>>> On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:
>>>> I tested this series using Ming's tests as well as my own set of tests
>>>> typically run against changes to upstream code in my SRP test-bed.
>>>> My tests also include very large sequential buffered and un-buffered 
>>>> I/O.
>>>>
>>>> This series seems to be fine for me. I did uncover another issue 
>>>> that is
>>>> unrelated to these patches and also exists in 4.13-RC3 generic that 
>>>> I am
>>>> still debugging.
>>>
>>> Hello Laurence,
>>>
>>> What kind of tests did you run? Only functional tests or also 
>>> performance
>>> tests?
>>>
>>> Has the issue you encountered with kernel 4.13-rc3 already been 
>>> reported on
>>> the linux-rdma mailing list?
>>>
>>> Thanks,
>>>
>>> Bart.
>>>
>>
>> Hi Bart
>>
>> Actually I was focusing on just performance to see if we had any 
>> regressions with Mings new patches for the large sequential I/O cases.
>>
>> Ming had already tested the small I/O performance cases so I was 
>> making sure the large I/O sequential tests did not suffer.
>>
>> The 4MB un-buffered direct read tests to DM devices seems to perform 
>> much the same in my test bed.
>> The 4MB buffered and un-buffered 4MB writes also seem to be well 
>> behaved with not much improvement.
>>
>> These were not exhaustive tests and did not include my usual port 
>> disconnect and recovery tests either.
>> I was just making sure we did not regress with Ming's changes.
>>
>> I was only just starting to baseline test the mq-deadline scheduler as 
>> prior to 4.13-RC3 I had not been testing any of the new MQ schedulers.
>> I had always only tested with [none]
>>
>> The tests were with [none] and [mq-deadline]
>>
>> The new issue I started seeing was not yet reported yet as I was still 
>> investigating it.
>>
>> In summary:
>> With buffered writes we see the clone fail in blk_get_request in both 
>> Mings kernel and in the upstream 4.13-RC3 stock kernel
>>
>> [  885.271451] io scheduler mq-deadline registered
>> [  898.455442] device-mapper: multipath: blk_get_request() returned 
>> -11 - requeuing
>>
>> This is due to
>>
>> multipath_clone_and_map()
>>
>> /*
>>   * Map cloned requests (request-based multipath)
>>   */
>> static int multipath_clone_and_map(struct dm_target *ti, struct 
>> request *rq,
>>                                     union map_info *map_context,
>>                                     struct request **__clone)
>> {
>> ..
>> ..
>> ..
>>          clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
>> GFP_ATOMIC);
>>          if (IS_ERR(clone)) {
>>                  /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
>>                  bool queue_dying = blk_queue_dying(q);
>>                  DMERR_LIMIT("blk_get_request() returned %ld%s - 
>> requeuing",
>>                              PTR_ERR(clone), queue_dying ? " (path 
>> offline)" : "");
>>                  if (queue_dying) {
>>                          atomic_inc(&m->pg_init_in_progress);
>>                          activate_or_offline_path(pgpath);
>>                          return DM_MAPIO_REQUEUE;
>>                  }
>>                  return DM_MAPIO_DELAY_REQUEUE;
>>          }
>>
>> Still investigating but it leads to a hard lockup
>>
>>
>> So I still need to see if the hard-lockup happens in the stock kernel 
>> with mq-deadline and some other work before coming up with a full 
>> summary of the issue.
>>
>> I also intend to re-run all tests including disconnect and reconnect 
>> tests on both mq-deadline and none.
>>
>> Trace below
>>
>>
>> [ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
>> [ 1553.167359] Modules linked in: mq_deadline binfmt_misc 
>> dm_round_robin xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun 
>> ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 
>> xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp 
>> llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma 
>> ip6table_mangle ip6table_security ip6table_raw iptable_nat ib_isert 
>> nf_conntrack_ipv4 iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
>> nf_conntrack ib_iser libiscsi scsi_transport_iscsi iptable_mangle 
>> iptable_security iptable_raw ebtable_filter ebtables target_core_mod 
>> ip6table_filter ip6_tables iptable_filter ib_srp scsi_transport_srp 
>> ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib 
>> ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass 
>> crct10dif_pclmul
>> [ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
>> joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
>> glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
>> acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
>> nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
>> amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
>> sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
>> serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
>> dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
>> [ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
>> 4.13.0-rc3lobeming.ming_V4+ #20
>> [ 1553.167411] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
>> [ 1553.167412] task: ffff9d9344b0d800 task.stack: ffffb5cb913b0000
>> [ 1553.167421] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
>> [ 1553.167421] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
>> [ 1553.167422] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
>> ffff9d91b8e9f500
>> [ 1553.167423] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
>> ffff9d91b8e9f488
>> [ 1553.167424] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
>> 0000000000000001
>> [ 1553.167424] R10: 0000000000000000 R11: 0000000000000000 R12: 
>> 0000000000000000
>> [ 1553.167425] R13: 0000000000000000 R14: 0000000000000000 R15: 
>> 0000000000000003
>> [ 1553.167426] FS:  00007f3061002740(0000) GS:ffff9d9377a80000(0000) 
>> knlGS:0000000000000000
>> [ 1553.167427] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1553.167428] CR2: 00007f305a511000 CR3: 0000000b1bfb2000 CR4: 
>> 00000000000006e0
>> [ 1553.167429] Call Trace:
>> [ 1553.167432]  <IRQ>
>> [ 1553.167437]  ? mempool_free+0x2b/0x80
>> [ 1553.167440]  blk_recalc_rq_segments+0x28/0x40
>> [ 1553.167442]  blk_update_request+0x249/0x310
>> [ 1553.167450]  end_clone_bio+0x46/0x70 [dm_mod]
>> [ 1553.167453]  bio_endio+0xa1/0x120
>> [ 1553.167454]  blk_update_request+0xb7/0x310
>> [ 1553.167457]  scsi_end_request+0x38/0x1d0
>> [ 1553.167458]  scsi_io_completion+0x13c/0x630
>> [ 1553.167460]  scsi_finish_command+0xd9/0x120
>> [ 1553.167462]  scsi_softirq_done+0x12a/0x150
>> [ 1553.167463]  __blk_mq_complete_request_remote+0x13/0x20
>> [ 1553.167466]  flush_smp_call_function_queue+0x52/0x110
>> [ 1553.167468]  generic_smp_call_function_single_interrupt+0x13/0x30
>> [ 1553.167470]  smp_call_function_single_interrupt+0x27/0x40
>> [ 1553.167471]  call_function_single_interrupt+0x93/0xa0
>> [ 1553.167473] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
>> [ 1553.167474] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
>> ffffffffffffff04
>> [ 1553.167475] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
>> ffff9d9e2c556938
>> [ 1553.167476] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
>> ffff9d9e2c556910
>> [ 1553.167476] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
>> 0000000ab7d52000
>> [ 1553.167477] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
>> 0000000000000230
>> [ 1553.167477] R13: 0000000000000000 R14: 000000000006e0fe R15: 
>> ffff9d9e2c556910
>> [ 1553.167478]  </IRQ>
>> [ 1553.167480]  ? radix_tree_next_chunk+0x10b/0x2e0
>> [ 1553.167481]  find_get_pages_tag+0x149/0x270
>> [ 1553.167485]  ? block_write_full_page+0xcd/0xe0
>> [ 1553.167487]  pagevec_lookup_tag+0x21/0x30
>> [ 1553.167488]  write_cache_pages+0x14c/0x510
>> [ 1553.167490]  ? __wb_calc_thresh+0x140/0x140
>> [ 1553.167492]  generic_writepages+0x51/0x80
>> [ 1553.167493]  blkdev_writepages+0x2f/0x40
>> [ 1553.167494]  do_writepages+0x1c/0x70
>> [ 1553.167495]  __filemap_fdatawrite_range+0xc6/0x100
>> [ 1553.167497]  filemap_write_and_wait+0x3d/0x80
>> [ 1553.167498]  __sync_blockdev+0x1f/0x40
>> [ 1553.167499]  __blkdev_put+0x74/0x200
>> [ 1553.167500]  blkdev_put+0x4c/0xd0
>> [ 1553.167501]  blkdev_close+0x25/0x30
>> [ 1553.167503]  __fput+0xe7/0x210
>> [ 1553.167504]  ____fput+0xe/0x10
>> [ 1553.167508]  task_work_run+0x83/0xb0
>> [ 1553.167511]  exit_to_usermode_loop+0x66/0x98
>> [ 1553.167512]  do_syscall_64+0x13a/0x150
>> [ 1553.167519]  entry_SYSCALL64_slow_path+0x25/0x25
>> [ 1553.167520] RIP: 0033:0x7f3060b25bd0
>> [ 1553.167521] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
>> 0000000000000003
>> [ 1553.167522] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
>> 00007f3060b25bd0
>> [ 1553.167522] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
>> 0000000000000001
>> [ 1553.167523] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
>> 0000000000402003
>> [ 1553.167524] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
>> 0000000000000320
>> [ 1553.167525] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
>> 00007ffc5e6b1540
>> [ 1553.167526] Code: db 0f 84 92 01 00 00 45 89 f5 45 89 e2 4c 89 ee 
>> 48 c1 e6 04 48 03 71 78 8b 46 08 48 8b 16 44 29 e0 48 89 54 24 30 39 
>> d8 0f 47 c3 <44> 03 56 0c 45 84 db 89 44 24 38 44 89 54 24 3c 0f 85 2a 
>> 01 00
>> [ 1553.167540] Kernel panic - not syncing: Hard LOCKUP
>> [ 1553.167542] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
>> 4.13.0-rc3lobeming.ming_V4+ #20
>> [ 1553.167542] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
>> [ 1553.167543] Call Trace:
>> [ 1553.167543]  <NMI>
>> [ 1553.167545]  dump_stack+0x63/0x87
>> [ 1553.167548]  panic+0xeb/0x245
>> [ 1553.167550]  nmi_panic+0x3f/0x40
>> [ 1553.167553]  watchdog_overflow_callback+0xce/0xf0
>> [ 1553.167556]  __perf_event_overflow+0x54/0xe0
>> [ 1553.167559]  perf_event_overflow+0x14/0x20
>> [ 1553.167562]  intel_pmu_handle_irq+0x203/0x4b0
>> [ 1553.167566]  perf_event_nmi_handler+0x2d/0x50
>> [ 1553.167568]  nmi_handle+0x61/0x110
>> [ 1553.167569]  default_do_nmi+0x44/0x120
>> [ 1553.167571]  do_nmi+0x113/0x190
>> [ 1553.167573]  end_repeat_nmi+0x1a/0x1e
>> [ 1553.167574] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
>> [ 1553.167575] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
>> [ 1553.167576] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
>> ffff9d91b8e9f500
>> [ 1553.167576] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
>> ffff9d91b8e9f488
>> [ 1553.167577] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
>> 0000000000000001
>> [ 1553.167577] R10: 0000000000000000 R11: 0000000000000000 R12: 
>> 0000000000000000
>> [ 1553.167578] R13: 0000000000000000 R14: 0000000000000000 R15: 
>> 0000000000000003
>> [ 1553.167580]  ? __blk_recalc_rq_segments+0xec/0x3d0
>> [ 1553.167581]  ? __blk_recalc_rq_segments+0xec/0x3d0
>> [ 1553.167582]  </NMI>
>> [ 1553.167582]  <IRQ>
>> [ 1553.167584]  ? mempool_free+0x2b/0x80
>> [ 1553.167585]  blk_recalc_rq_segments+0x28/0x40
>> [ 1553.167586]  blk_update_request+0x249/0x310
>> [ 1553.167590]  end_clone_bio+0x46/0x70 [dm_mod]
>> [ 1553.167592]  bio_endio+0xa1/0x120
>> [ 1553.167593]  blk_update_request+0xb7/0x310
>> [ 1553.167594]  scsi_end_request+0x38/0x1d0
>> [ 1553.167595]  scsi_io_completion+0x13c/0x630
>> [ 1553.167597]  scsi_finish_command+0xd9/0x120
>> [ 1553.167598]  scsi_softirq_done+0x12a/0x150
>> [ 1553.167599]  __blk_mq_complete_request_remote+0x13/0x20
>> [ 1553.167601]  flush_smp_call_function_queue+0x52/0x110
>> [ 1553.167602]  generic_smp_call_function_single_interrupt+0x13/0x30
>> [ 1553.167603]  smp_call_function_single_interrupt+0x27/0x40
>> [ 1553.167604]  call_function_single_interrupt+0x93/0xa0
>> [ 1553.167605] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
>> [ 1553.167606] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
>> ffffffffffffff04
>> [ 1553.167607] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
>> ffff9d9e2c556938
>> [ 1553.167607] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
>> ffff9d9e2c556910
>> [ 1553.167608] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
>> 0000000ab7d52000
>> [ 1553.167609] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
>> 0000000000000230
>> [ 1553.167609] R13: 0000000000000000 R14: 000000000006e0fe R15: 
>> ffff9d9e2c556910
>> [ 1553.167610]  </IRQ>
>> [ 1553.167611]  ? radix_tree_next_chunk+0x10b/0x2e0
>> [ 1553.167613]  find_get_pages_tag+0x149/0x270
>> [ 1553.167615]  ? block_write_full_page+0xcd/0xe0
>> [ 1553.167616]  pagevec_lookup_tag+0x21/0x30
>> [ 1553.167617]  write_cache_pages+0x14c/0x510
>> [ 1553.167619]  ? __wb_calc_thresh+0x140/0x140
>> [ 1553.167621]  generic_writepages+0x51/0x80
>> [ 1553.167622]  blkdev_writepages+0x2f/0x40
>> [ 1553.167623]  do_writepages+0x1c/0x70
>> [ 1553.167627]  __filemap_fdatawrite_range+0xc6/0x100
>> [ 1553.167629]  filemap_write_and_wait+0x3d/0x80
>> [ 1553.167630]  __sync_blockdev+0x1f/0x40
>> [ 1553.167631]  __blkdev_put+0x74/0x200
>> [ 1553.167632]  blkdev_put+0x4c/0xd0
>> [ 1553.167633]  blkdev_close+0x25/0x30
>> [ 1553.167634]  __fput+0xe7/0x210
>> [ 1553.167635]  ____fput+0xe/0x10
>> [ 1553.167636]  task_work_run+0x83/0xb0
>> [ 1553.167637]  exit_to_usermode_loop+0x66/0x98
>> [ 1553.167638]  do_syscall_64+0x13a/0x150
>> [ 1553.167640]  entry_SYSCALL64_slow_path+0x25/0x25
>> [ 1553.167641] RIP: 0033:0x7f3060b25bd0
>> [ 1553.167641] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
>> 0000000000000003
>> [ 1553.167642] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
>> 00007f3060b25bd0
>> [ 1553.167643] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
>> 0000000000000001
>> [ 1553.167643] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
>> 0000000000402003
>> [ 1553.167644] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
>> 0000000000000320
>> [ 1553.167645] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
>> 00007ffc5e6b1540
>> [ 1553.167829] Kernel Offset: 0x2f400000 from 0xffffffff81000000 
>> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 1558.119352] ---[ end Kernel panic - not syncing: Hard LOCKUP
>> [ 1558.151334] sched: Unexpected reschedule of offline CPU#0!
>> [ 1558.151342] ------------[ cut here ]------------
>> [ 1558.151347] WARNING: CPU: 4 PID: 11532 at arch/x86/kernel/smp.c:128 
>> native_smp_send_reschedule+0x3c/0x40
>> [ 1558.151348] Modules linked in: mq_deadline binfmt_misc 
>> dm_round_robin xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun 
>> ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 
>> xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp 
>> llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma 
>> ip6table_mangle ip6table_security ip6table_raw iptable_nat ib_isert 
>> nf_conntrack_ipv4 iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
>> nf_conntrack ib_iser libiscsi scsi_transport_iscsi iptable_mangle 
>> iptable_security iptable_raw ebtable_filter ebtables target_core_mod 
>> ip6table_filter ip6_tables iptable_filter ib_srp scsi_transport_srp 
>> ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib 
>> ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass 
>> crct10dif_pclmul
>> [ 1558.151363]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
>> joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
>> glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
>> acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
>> nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
>> amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
>> sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
>> serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
>> dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
>> [ 1558.151377] CPU: 4 PID: 11532 Comm: dd Tainted: G          I 
>> 4.13.0-rc3lobeming.ming_V4+ #20
>> [ 1558.151378] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
>> [ 1558.151379] task: ffff9d9344b0d800 task.stack: ffffb5cb913b0000
>> [ 1558.151380] RIP: 0010:native_smp_send_reschedule+0x3c/0x40
>> [ 1558.151381] RSP: 0018:ffff9d9377a856a0 EFLAGS: 00010046
>> [ 1558.151382] RAX: 000000000000002e RBX: 0000000000000000 RCX: 
>> 0000000000000006
>> [ 1558.151383] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
>> ffff9d9377a8e010
>> [ 1558.151383] RBP: ffff9d9377a856a0 R08: 000000000000002e R09: 
>> ffff9d9fa7029552
>> [ 1558.151384] R10: 0000000000000732 R11: 0000000000000000 R12: 
>> ffff9d9377a1bb00
>> [ 1558.151384] R13: ffff9d936ef70000 R14: ffff9d9377a85758 R15: 
>> ffff9d9377a1bb00
>> [ 1558.151385] FS:  00007f3061002740(0000) GS:ffff9d9377a80000(0000) 
>> knlGS:0000000000000000
>> [ 1558.151386] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1558.151387] CR2: 00007f305a511000 CR3: 0000000b1bfb2000 CR4: 
>> 00000000000006e0
>> [ 1558.151387] Call Trace:
>> [ 1558.151388]  <NMI>
>> [ 1558.151391]  resched_curr+0xa1/0xc0
>> [ 1558.151392]  check_preempt_curr+0x79/0x90
>> [ 1558.151393]  ttwu_do_wakeup+0x1e/0x160
>> [ 1558.151395]  ttwu_do_activate+0x7a/0x90
>> [ 1558.151396]  try_to_wake_up+0x1e1/0x470
>> [ 1558.151398]  ? vsnprintf+0x1fa/0x4a0
>> [ 1558.151399]  default_wake_function+0x12/0x20
>> [ 1558.151403]  __wake_up_common+0x55/0x90
>> [ 1558.151405]  __wake_up_locked+0x13/0x20
>> [ 1558.151408]  ep_poll_callback+0xd3/0x2a0
>> [ 1558.151409]  __wake_up_common+0x55/0x90
>> [ 1558.151411]  __wake_up+0x39/0x50
>> [ 1558.151415]  wake_up_klogd_work_func+0x40/0x60
>> [ 1558.151419]  irq_work_run_list+0x4d/0x70
>> [ 1558.151421]  ? tick_sched_do_timer+0x70/0x70
>> [ 1558.151422]  irq_work_tick+0x40/0x50
>> [ 1558.151424]  update_process_times+0x42/0x60
>> [ 1558.151426]  tick_sched_handle+0x2d/0x60
>> [ 1558.151427]  tick_sched_timer+0x39/0x70
>> [ 1558.151428]  __hrtimer_run_queues+0xe5/0x230
>> [ 1558.151429]  hrtimer_interrupt+0xa8/0x1a0
>> [ 1558.151431]  local_apic_timer_interrupt+0x35/0x60
>> [ 1558.151432]  smp_apic_timer_interrupt+0x38/0x50
>> [ 1558.151433]  apic_timer_interrupt+0x93/0xa0
>> [ 1558.151435] RIP: 0010:panic+0x1fa/0x245
>> [ 1558.151435] RSP: 0018:ffff9d9377a85b30 EFLAGS: 00000246 ORIG_RAX: 
>> ffffffffffffff10
>> [ 1558.151436] RAX: 0000000000000030 RBX: ffff9d890c5d0000 RCX: 
>> 0000000000000006
>> [ 1558.151437] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
>> ffff9d9377a8e010
>> [ 1558.151438] RBP: ffff9d9377a85ba0 R08: 0000000000000030 R09: 
>> ffff9d9fa7029514
>> [ 1558.151438] R10: 0000000000000731 R11: 0000000000000000 R12: 
>> ffffffffb0e63f7f
>> [ 1558.151439] R13: 0000000000000000 R14: 0000000000000000 R15: 
>> 0000000000000000
>> [ 1558.151441]  nmi_panic+0x3f/0x40
>> [ 1558.151442]  watchdog_overflow_callback+0xce/0xf0
>> [ 1558.151444]  __perf_event_overflow+0x54/0xe0
>> [ 1558.151445]  perf_event_overflow+0x14/0x20
>> [ 1558.151446]  intel_pmu_handle_irq+0x203/0x4b0
>> [ 1558.151449]  perf_event_nmi_handler+0x2d/0x50
>> [ 1558.151450]  nmi_handle+0x61/0x110
>> [ 1558.151452]  default_do_nmi+0x44/0x120
>> [ 1558.151453]  do_nmi+0x113/0x190
>> [ 1558.151454]  end_repeat_nmi+0x1a/0x1e
>> [ 1558.151456] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
>> [ 1558.151456] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
>> [ 1558.151457] RAX: 0000000000001000 RBX: 0000000000001000 RCX: 
>> ffff9d91b8e9f500
>> [ 1558.151458] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI: 
>> ffff9d91b8e9f488
>> [ 1558.151458] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09: 
>> 0000000000000001
>> [ 1558.151459] R10: 0000000000000000 R11: 0000000000000000 R12: 
>> 0000000000000000
>> [ 1558.151459] R13: 0000000000000000 R14: 0000000000000000 R15: 
>> 0000000000000003
>> [ 1558.151461]  ? __blk_recalc_rq_segments+0xec/0x3d0
>> [ 1558.151463]  ? __blk_recalc_rq_segments+0xec/0x3d0
>> [ 1558.151463]  </NMI>
>> [ 1558.151464]  <IRQ>
>> [ 1558.151465]  ? mempool_free+0x2b/0x80
>> [ 1558.151467]  blk_recalc_rq_segments+0x28/0x40
>> [ 1558.151468]  blk_update_request+0x249/0x310
>> [ 1558.151472]  end_clone_bio+0x46/0x70 [dm_mod]
>> [ 1558.151473]  bio_endio+0xa1/0x120
>> [ 1558.151474]  blk_update_request+0xb7/0x310
>> [ 1558.151476]  scsi_end_request+0x38/0x1d0
>> [ 1558.151477]  scsi_io_completion+0x13c/0x630
>> [ 1558.151478]  scsi_finish_command+0xd9/0x120
>> [ 1558.151480]  scsi_softirq_done+0x12a/0x150
>> [ 1558.151481]  __blk_mq_complete_request_remote+0x13/0x20
>> [ 1558.151483]  flush_smp_call_function_queue+0x52/0x110
>> [ 1558.151484]  generic_smp_call_function_single_interrupt+0x13/0x30
>> [ 1558.151485]  smp_call_function_single_interrupt+0x27/0x40
>> [ 1558.151486]  call_function_single_interrupt+0x93/0xa0
>> [ 1558.151487] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
>> [ 1558.151488] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX: 
>> ffffffffffffff04
>> [ 1558.151489] RAX: 0000000000000001 RBX: 0000000000000010 RCX: 
>> ffff9d9e2c556938
>> [ 1558.151490] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI: 
>> ffff9d9e2c556910
>> [ 1558.151490] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09: 
>> 0000000ab7d52000
>> [ 1558.151491] R10: fffffcd56adf5440 R11: 0000000000000000 R12: 
>> 0000000000000230
>> [ 1558.151492] R13: 0000000000000000 R14: 000000000006e0fe R15: 
>> ffff9d9e2c556910
>> [ 1558.151492]  </IRQ>
>> [ 1558.151494]  ? radix_tree_next_chunk+0x10b/0x2e0
>> [ 1558.151496]  find_get_pages_tag+0x149/0x270
>> [ 1558.151497]  ? block_write_full_page+0xcd/0xe0
>> [ 1558.151499]  pagevec_lookup_tag+0x21/0x30
>> [ 1558.151500]  write_cache_pages+0x14c/0x510
>> [ 1558.151502]  ? __wb_calc_thresh+0x140/0x140
>> [ 1558.151504]  generic_writepages+0x51/0x80
>> [ 1558.151505]  blkdev_writepages+0x2f/0x40
>> [ 1558.151506]  do_writepages+0x1c/0x70
>> [ 1558.151507]  __filemap_fdatawrite_range+0xc6/0x100
>> [ 1558.151509]  filemap_write_and_wait+0x3d/0x80
>> [ 1558.151510]  __sync_blockdev+0x1f/0x40
>> [ 1558.151511]  __blkdev_put+0x74/0x200
>> [ 1558.151512]  blkdev_put+0x4c/0xd0
>> [ 1558.151513]  blkdev_close+0x25/0x30
>> [ 1558.151514]  __fput+0xe7/0x210
>> [ 1558.151515]  ____fput+0xe/0x10
>> [ 1558.151517]  task_work_run+0x83/0xb0
>> [ 1558.151518]  exit_to_usermode_loop+0x66/0x98
>> [ 1558.151519]  do_syscall_64+0x13a/0x150
>> [ 1558.151521]  entry_SYSCALL64_slow_path+0x25/0x25
>> [ 1558.151522] RIP: 0033:0x7f3060b25bd0
>> [ 1558.151522] RSP: 002b:00007ffc5e6b12c8 EFLAGS: 00000246 ORIG_RAX: 
>> 0000000000000003
>> [ 1558.151523] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
>> 00007f3060b25bd0
>> [ 1558.151524] RDX: 0000000000400000 RSI: 0000000000000000 RDI: 
>> 0000000000000001
>> [ 1558.151524] RBP: 0000000000000320 R08: ffffffffffffffff R09: 
>> 0000000000402003
>> [ 1558.151525] R10: 00007ffc5e6b1040 R11: 0000000000000246 R12: 
>> 0000000000000320
>> [ 1558.151526] R13: 0000000000000000 R14: 00007ffc5e6b2618 R15: 
>> 00007ffc5e6b1540
>> [ 1558.151527] Code: dc 00 0f 92 c0 84 c0 74 14 48 8b 05 ff 9d a9 00 
>> be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 90 4d e3 b0 e8 
>> c7 db 09 00 <0f> ff 5d c3 66 66 66 66 90 55 48 89 e5 48 83 ec 20 65 48 
>> 8b 04
>> [ 1558.151540] ---[ end trace 408c28f8c132530c ]---
> 
> Replying to my own email:
> 
> Hello Bart
> 
> So with the stock 4.13-RC3 using [none] I get the failing clones but no 
> hard lockup.
> 
> [  526.677611] multipath_clone_and_map: 148 callbacks suppressed
> [  526.707429] device-mapper: multipath: blk_get_request() returned -11 
> - requeuing
> [  527.283432] device-mapper: multipath: blk_get_request() returned -11 
> - requeuing
> [  528.324566] device-mapper: multipath: blk_get_request() returned -11 
> - requeuing
> [  528.367118] device-mapper: multipath: blk_get_request() returned -11 
> - requeuing
> 
> Then on the same  kernel
> modprobe mq-deadline and running with [mq-deadline] on stock upstream I 
> DO not get the clone failures and I don't seem to get hard lockups.
> 
> I will reach out to Ming once I confirm going back to his kernel gets me 
> back into hard lockups again.
> I will also mention to him that we see the clone failure when using his 
> kernel with [mq-deadline]
> 
> Perhaps his enhancements are indeed contributing
> 
> Thanks
> Laurence
> 
> 
> 
> I also intend to test qla2xxx and mq-deadline shortly
> 
> 

Hello Ming

I am able to force the hard lockup with your kernel if wait long enough 
after the blk_get_request() fails in the clone.

I will get a vmcore and do some analysis.

In summary running two dd writes in parallel to two separate dm devices 
with buffered 4MB I/O sizes gets into the hard lockup after a while.

I guess its good we focused more on the clone failure because all of 
your tests were not to the dm device, just to the single device.

My tests are always to the dm device.

I noticed the hard lockup this weekend but thought it was specific to 
the clone issue and not your patches but I can't create the hard lockup 
on stock upstream.

In both kernels I can create the clone failure which seems to recover on 
the upstream stock kernel but of course the requeue means we are throttled.

Thanks
Laurence

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
       [not found]   ` <CAFfF4qv3W6D-j8BSSZbwPLqhd_mmwk8CZQe7dSqud8cMMd2yPg@mail.gmail.com>
@ 2017-08-07 22:29     ` Bart Van Assche
  2017-08-07 23:17     ` Ming Lei
  2017-08-08 13:41     ` Ming Lei
  2 siblings, 0 replies; 84+ messages in thread
From: Bart Van Assche @ 2017-08-07 22:29 UTC (permalink / raw)
  To: hch, linux-block, axboe, loberman, ming.lei; +Cc: Bart Van Assche

T24gTW9uLCAyMDE3LTA4LTA3IGF0IDE4OjA2IC0wNDAwLCBMYXVyZW5jZSBPYmVybWFuIHdyb3Rl
Og0KPiBXaXRoIFttcS1kZWFkbGluZV0gZW5hYmxlZCBvbiBzdG9jayBJIGRvbnQgc2VlIHRoZW0g
YXQgYWxsIGFuZCBpdCBiZWhhdmVzLg0KPiANCj4gTm93IHdpdGggTWluZydzIHBhdGNoZXMgaWYg
d2UgZW5hYmxlIFttcS1kZWFkbGluZV0gd2UgRE8gc2VlIHRoZSBjbG9uZQ0KPiBmYWlsdXJlcyBh
bmQgdGhlIGhhcmQgbG9ja3VwIHNvIHdlIGhhdmUgb3Bwb3NpdCBiZWhhdmlvdXIgd2l0aCB0aGUN
Cj4gc2NoZWR1bGVyIGNob2ljZSBhbmQgd2UgaGF2ZSB0aGUgaGFyZCBsb2NrdXAuDQo+IA0KPiBP
biBNaW5nJ3Mga2VybmVsIHdpdGggW25vbmVdIHdlIGFyZSB3ZWxsIGJlaGF2ZWQgYW5kIHRoYXQg
d2FzIG15IG9yaWdpbmFsDQo+IGZvY3VzLCB0ZXN0aW5nIG9uIFtub25lXSBhbmQgaGVuY2UgbXkg
VGVzdGVkLWJ5OiBwYXNzLg0KPiANCj4gU28gbW9yZSBpbnZlc3RpZ2F0aW9uIGlzIG5lZWRlZCBo
ZXJlLg0KDQpIZWxsbyBMYXVyZW5jZSwNCg0KV2FzIGRlYnVnZnMgZW5hYmxlZCBvbiB5b3VyIHRl
c3Qgc2V0dXA/IEhhZCB5b3UgcGVyaGFwcyBjb2xsZWN0ZWQgdGhlDQpjb250ZW50cyBvZiB0aGUg
YmxvY2sgbGF5ZXIgZGVidWdmcyBmaWxlcyBhZnRlciB0aGUgbG9ja3VwIG9jY3VycmVkLCBlLmcu
IGFzDQpmb2xsb3dzOiAoY2QgL3N5cy9rZXJuZWwvZGVidWcvYmxvY2sgJiYgZmluZCAtdHlwZSBm
IHwgeGFyZ3MgZ3JlcCAtYUggJycpPw0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-07 17:29     ` Laurence Oberman
  2017-08-07 18:46       ` Laurence Oberman
@ 2017-08-07 23:04       ` Ming Lei
  1 sibling, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-07 23:04 UTC (permalink / raw)
  To: Laurence Oberman; +Cc: Bart Van Assche, hch, linux-block, axboe, dm-devel

On Mon, Aug 07, 2017 at 01:29:41PM -0400, Laurence Oberman wrote:
> 
> 
> On 08/07/2017 11:27 AM, Bart Van Assche wrote:
> > On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:
> > > I tested this series using Ming's tests as well as my own set of tests
> > > typically run against changes to upstream code in my SRP test-bed.
> > > My tests also include very large sequential buffered and un-buffered I/O.
> > > 
> > > This series seems to be fine for me. I did uncover another issue that is
> > > unrelated to these patches and also exists in 4.13-RC3 generic that I am
> > > still debugging.
> > 
> > Hello Laurence,
> > 
> > What kind of tests did you run? Only functional tests or also performance
> > tests?
> > 
> > Has the issue you encountered with kernel 4.13-rc3 already been reported on
> > the linux-rdma mailing list?
> > 
> > Thanks,
> > 
> > Bart.
> > 
> 
> Hi Bart
> 
> Actually I was focusing on just performance to see if we had any regressions
> with Mings new patches for the large sequential I/O cases.
> 
> Ming had already tested the small I/O performance cases so I was making sure
> the large I/O sequential tests did not suffer.
> 
> The 4MB un-buffered direct read tests to DM devices seems to perform much
> the same in my test bed.
> The 4MB buffered and un-buffered 4MB writes also seem to be well behaved
> with not much improvement.

As I described, this patch improves I/O scheduling, especially it
improves I/O merge in sequential I/O. BLK_DEF_MAX_SECTORS is defined
as 2560(1280K), it is expected that this patch can't help the 4M I/O
because there isn't no merge for 4MB I/O.

But the result is still positive, since there isn't regression with
patchset.

> 
> These were not exhaustive tests and did not include my usual port disconnect
> and recovery tests either.
> I was just making sure we did not regress with Ming's changes.
> 
> I was only just starting to baseline test the mq-deadline scheduler as prior
> to 4.13-RC3 I had not been testing any of the new MQ schedulers.
> I had always only tested with [none]
> 
> The tests were with [none] and [mq-deadline]
> 
> The new issue I started seeing was not yet reported yet as I was still
> investigating it.
> 
> In summary:
> With buffered writes we see the clone fail in blk_get_request in both Mings
> kernel and in the upstream 4.13-RC3 stock kernel
> 
> [  885.271451] io scheduler mq-deadline registered
> [  898.455442] device-mapper: multipath: blk_get_request() returned -11 -
> requeuing

-11 is -EAGAIN, and it isn't a error.

GFP_ATOMIC is passed to blk_get_request() in multipath_clone_and_map(),
it isn't strange to see this failure especially when there are too
many concurrent I/O. So 

> 
> This is due to
> 
> multipath_clone_and_map()
> 
> /*
>  * Map cloned requests (request-based multipath)
>  */
> static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
>                                    union map_info *map_context,
>                                    struct request **__clone)
> {
> ..
> ..
> ..
>         clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
>         if (IS_ERR(clone)) {
>                 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
>                 bool queue_dying = blk_queue_dying(q);
>                 DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing",
>                             PTR_ERR(clone), queue_dying ? " (path offline)"
> : "");
>                 if (queue_dying) {
>                         atomic_inc(&m->pg_init_in_progress);
>                         activate_or_offline_path(pgpath);
>                         return DM_MAPIO_REQUEUE;
>                 }
>                 return DM_MAPIO_DELAY_REQUEUE;
>         }
> 
> Still investigating but it leads to a hard lockup
> 
> 
> So I still need to see if the hard-lockup happens in the stock kernel with
> mq-deadline and some other work before coming up with a full summary of the
> issue.
> 
> I also intend to re-run all tests including disconnect and reconnect tests
> on both mq-deadline and none.
> 
> Trace below
> 
> 
> [ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
> [ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin
> xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter
> ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set
> nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat
> nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle
> ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4
> iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser
> libiscsi scsi_transport_iscsi iptable_mangle iptable_security iptable_raw
> ebtable_filter ebtables target_core_mod ip6table_filter ip6_tables
> iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs
> ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> kvm_intel kvm irqbypass crct10dif_pclmul
> [ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg joydev
> ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf glue_helper
> gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr acpi_power_meter
> i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> dm_multipath ip_tables xfs libcrc32c sd_mod amdkfd amd_iommu_v2 radeon
> i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
> ttm mlx5_core drm mlxfw i2c_core ptp serio_raw hpsa crc32c_intel bnx2
> pps_core devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> [last unloaded: ib_srpt]
> [ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G          I
> 4.13.0-rc3lobeming.ming_V4+ #20
> [ 1553.167411] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
> [ 1553.167412] task: ffff9d9344b0d800 task.stack: ffffb5cb913b0000
> [ 1553.167421] RIP: 0010:__blk_recalc_rq_segments+0xec/0x3d0
> [ 1553.167421] RSP: 0018:ffff9d9377a83d08 EFLAGS: 00000046
> [ 1553.167422] RAX: 0000000000001000 RBX: 0000000000001000 RCX:
> ffff9d91b8e9f500
> [ 1553.167423] RDX: fffffcd56af20f00 RSI: ffff9d91b8e9f588 RDI:
> ffff9d91b8e9f488
> [ 1553.167424] RBP: ffff9d9377a83d88 R08: 000000000003c000 R09:
> 0000000000000001
> [ 1553.167424] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [ 1553.167425] R13: 0000000000000000 R14: 0000000000000000 R15:
> 0000000000000003
> [ 1553.167426] FS:  00007f3061002740(0000) GS:ffff9d9377a80000(0000)
> knlGS:0000000000000000
> [ 1553.167427] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1553.167428] CR2: 00007f305a511000 CR3: 0000000b1bfb2000 CR4:
> 00000000000006e0
> [ 1553.167429] Call Trace:
> [ 1553.167432]  <IRQ>
> [ 1553.167437]  ? mempool_free+0x2b/0x80
> [ 1553.167440]  blk_recalc_rq_segments+0x28/0x40
> [ 1553.167442]  blk_update_request+0x249/0x310
> [ 1553.167450]  end_clone_bio+0x46/0x70 [dm_mod]
> [ 1553.167453]  bio_endio+0xa1/0x120
> [ 1553.167454]  blk_update_request+0xb7/0x310
> [ 1553.167457]  scsi_end_request+0x38/0x1d0
> [ 1553.167458]  scsi_io_completion+0x13c/0x630
> [ 1553.167460]  scsi_finish_command+0xd9/0x120
> [ 1553.167462]  scsi_softirq_done+0x12a/0x150
> [ 1553.167463]  __blk_mq_complete_request_remote+0x13/0x20
> [ 1553.167466]  flush_smp_call_function_queue+0x52/0x110
> [ 1553.167468]  generic_smp_call_function_single_interrupt+0x13/0x30
> [ 1553.167470]  smp_call_function_single_interrupt+0x27/0x40
> [ 1553.167471]  call_function_single_interrupt+0x93/0xa0
> [ 1553.167473] RIP: 0010:radix_tree_next_chunk+0xcb/0x2e0
> [ 1553.167474] RSP: 0018:ffffb5cb913b3a68 EFLAGS: 00000203 ORIG_RAX:
> ffffffffffffff04
> [ 1553.167475] RAX: 0000000000000001 RBX: 0000000000000010 RCX:
> ffff9d9e2c556938
> [ 1553.167476] RDX: ffff9d93053c5919 RSI: 0000000000000001 RDI:
> ffff9d9e2c556910
> [ 1553.167476] RBP: ffffb5cb913b3ab8 R08: ffffb5cb913b3bc0 R09:
> 0000000ab7d52000
> [ 1553.167477] R10: fffffcd56adf5440 R11: 0000000000000000 R12:
> 0000000000000230
> [ 1553.167477] R13: 0000000000000000 R14: 000000000006e0fe R15:
> ffff9d9e2c556910

This one looks a real issue, and not sure this one is related
with request allocation failure. Looks there is a deadloop
in blk_recalc_rq_segments()?



-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
       [not found]   ` <CAFfF4qv3W6D-j8BSSZbwPLqhd_mmwk8CZQe7dSqud8cMMd2yPg@mail.gmail.com>
  2017-08-07 22:29     ` Bart Van Assche
@ 2017-08-07 23:17     ` Ming Lei
  2017-08-08 13:41     ` Ming Lei
  2 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-07 23:17 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche

On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote:
> On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman <loberman@redhat.com>
> wrote:
> 
> >
> >
> > On 08/05/2017 02:56 AM, Ming Lei wrote:
> >
> >> In Red Hat internal storage test wrt. blk-mq scheduler, we
> >> found that I/O performance is much bad with mq-deadline, especially
> >> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> >> SRP...)
> >>
> >> Turns out one big issue causes the performance regression: requests
> >> are still dequeued from sw queue/scheduler queue even when ldd's
> >> queue is busy, so I/O merge becomes quite difficult to make, then
> >> sequential IO degrades a lot.
> >>
> >> The 1st five patches improve this situation, and brings back
> >> some performance loss.
> >>
> >> But looks they are still not enough. It is caused by
> >> the shared queue depth among all hw queues. For SCSI devices,
> >> .cmd_per_lun defines the max number of pending I/O on one
> >> request queue, which is per-request_queue depth. So during
> >> dispatch, if one hctx is too busy to move on, all hctxs can't
> >> dispatch too because of the per-request_queue depth.
> >>
> >> Patch 6 ~ 14 use per-request_queue dispatch list to avoid
> >> to dequeue requests from sw/scheduler queue when lld queue
> >> is busy.
> >>
> >> Patch 15 ~20 improve bio merge via hash table in sw queue,
> >> which makes bio merge more efficient than current approch
> >> in which only the last 8 requests are checked. Since patch
> >> 6~14 converts to the scheduler way of dequeuing one request
> >> from sw queue one time for SCSI device, and the times of
> >> acquring ctx->lock is increased, and merging bio via hash
> >> table decreases holding time of ctx->lock and should eliminate
> >> effect from patch 14.
> >>
> >> With this changes, SCSI-MQ sequential I/O performance is
> >> improved much, for lpfc, it is basically brought back
> >> compared with block legacy path[1], especially mq-deadline
> >> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
> >> For mq-none it is improved by 10% on lpfc, and write is
> >> improved by > 10% on SRP too.
> >>
> >> Also Bart worried that this patchset may affect SRP, so provide
> >> test data on SCSI SRP this time:
> >>
> >> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> >> - system(16 cores, dual sockets, mem: 96G)
> >>
> >>                |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches |
> >>                |blk-legacy dd |blk-mq none   | blk-mq none  |
> >> -----------------------------------------------------------|
> >> read     :iops|         587K |         526K |         537K |
> >> randread :iops|         115K |         140K |         139K |
> >> write    :iops|         596K |         519K |         602K |
> >> randwrite:iops|         103K |         122K |         120K |
> >>
> >>
> >>                |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches
> >>                |blk-legacy dd |blk-mq dd     | blk-mq dd    |
> >> ------------------------------------------------------------
> >> read     :iops|         587K |         155K |         522K |
> >> randread :iops|         115K |         140K |         141K |
> >> write    :iops|         596K |         135K |         587K |
> >> randwrite:iops|         103K |         120K |         118K |
> >>
> >> V2:
> >>         - dequeue request from sw queues in round roubin's style
> >>         as suggested by Bart, and introduces one helper in sbitmap
> >>         for this purpose
> >>         - improve bio merge via hash table from sw queue
> >>         - add comments about using DISPATCH_BUSY state in lockless way,
> >>         simplifying handling on busy state,
> >>         - hold ctx->lock when clearing ctx busy bit as suggested
> >>         by Bart
> >>
> >>
> >> [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> >>
> >> Ming Lei (20):
> >>    blk-mq-sched: fix scheduler bad performance
> >>    sbitmap: introduce __sbitmap_for_each_set()
> >>    blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
> >>    blk-mq-sched: move actual dispatching into one helper
> >>    blk-mq-sched: improve dispatching from sw queue
> >>    blk-mq-sched: don't dequeue request until all in ->dispatch are
> >>      flushed
> >>    blk-mq-sched: introduce blk_mq_sched_queue_depth()
> >>    blk-mq-sched: use q->queue_depth as hint for q->nr_requests
> >>    blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
> >>    blk-mq-sched: introduce helpers for query, change busy state
> >>    blk-mq: introduce helpers for operating ->dispatch list
> >>    blk-mq: introduce pointers to dispatch lock & list
> >>    blk-mq: pass 'request_queue *' to several helpers of operating BUSY
> >>    blk-mq-sched: improve IO scheduling on SCSI devcie
> >>    block: introduce rqhash helpers
> >>    block: move actual bio merge code into __elv_merge
> >>    block: add check on elevator for supporting bio merge via hashtable
> >>      from blk-mq sw queue
> >>    block: introduce .last_merge and .hash to blk_mq_ctx
> >>    blk-mq-sched: refactor blk_mq_sched_try_merge()
> >>    blk-mq: improve bio merge from blk-mq sw queue
> >>
> >>   block/blk-mq-debugfs.c  |  12 ++--
> >>   block/blk-mq-sched.c    | 187 +++++++++++++++++++++++++++++-
> >> ------------------
> >>   block/blk-mq-sched.h    |  23 ++++++
> >>   block/blk-mq.c          | 133 +++++++++++++++++++++++++++++++---
> >>   block/blk-mq.h          |  73 +++++++++++++++++++
> >>   block/blk-settings.c    |   2 +
> >>   block/blk.h             |  55 ++++++++++++++
> >>   block/elevator.c        |  93 ++++++++++++++----------
> >>   include/linux/blk-mq.h  |   5 ++
> >>   include/linux/blkdev.h  |   5 ++
> >>   include/linux/sbitmap.h |  54 ++++++++++----
> >>   11 files changed, 504 insertions(+), 138 deletions(-)
> >>
> >>
> > Hello
> >
> > I tested this series using Ming's tests as well as my own set of tests
> > typically run against changes to upstream code in my SRP test-bed.
> > My tests also include very large sequential buffered and un-buffered I/O.
> >
> > This series seems to be fine for me. I did uncover another issue that is
> > unrelated to these patches and also exists in 4.13-RC3 generic that I am
> > still debugging.
> >
> > For what its worth:
> > Tested-by: Laurence Oberman <loberman@redhat.com>
> >
> >
> Hello
> 
> I need to retract my Tested-by:
> 
> While its valid that the patches do not introduce performance regressions,
> they seem to cause a hard lockup when the [mq-deadline] scheduler is
> enabled so I am not confident with a passing result here.
> 
> This is specific to large buffered I/O writes (4MB) At least that is my
> current test.
> 
> I did not wait long enough for the issue to show when I first sent the pass
> (Tested-by) message because I know my test platform so well I thought I had
> given it enough time to validate the patches for performance regressions.
> 
> I dont know if the failing clone in blk_get_request() is a direct a
> catalyst for the hard lockup but what I do know is with the stock upstream
> 4.13-RC3 I only see them when I am set to [none] and stock upstream never
> seems to see the hard lockup.

That might be triggered when the IOPS is much higher.

> 
> With [mq-deadline] enabled on stock I dont see them at all and it behaves.
> 
> Now with Ming's patches if we enable [mq-deadline] we DO see the clone
> failures

That should be easy to explain since this patchset improves IOPS by > 3X
on SRP in case of mq-deadline, and request allocation(GFP_ATOMIC) failure
can be triggered much easier.

> and the hard lockup so we have opposit behaviour with the
> scheduler choice and we have the hard lockup.

IMO, it is hard to say scheduler choice is related with this hard
lockup, since I/O scheduler only works in I/O submission path and
you still can trigger the issue with mq-none no matter if this
patchset is applied or not.

Today I still have some time, if you may provide me exact
reproduction steps, I am happy to take a look at it.

> On Ming's kernel with [none] we are well behaved and that was my original
> focus, testing on [none] and hence my Tested-by: pass.

OK, that is still a positive feedback, thanks for your test!


--
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (20 preceding siblings ...)
  2017-08-07 12:48 ` [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Laurence Oberman
@ 2017-08-08  8:09 ` Paolo Valente
  2017-08-08  9:09   ` Ming Lei
  2017-08-11  8:11 ` Christoph Hellwig
  2017-08-23 16:12 ` Bart Van Assche
  23 siblings, 1 reply; 84+ messages in thread
From: Paolo Valente @ 2017-08-08  8:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman


> Il giorno 05 ago 2017, alle ore 08:56, Ming Lei <ming.lei@redhat.com> =
ha scritto:
>=20
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
>=20
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
>=20
> The 1st five patches improve this situation, and brings back
> some performance loss.
>=20
> But looks they are still not enough. It is caused by
> the shared queue depth among all hw queues. For SCSI devices,
> .cmd_per_lun defines the max number of pending I/O on one
> request queue, which is per-request_queue depth. So during
> dispatch, if one hctx is too busy to move on, all hctxs can't
> dispatch too because of the per-request_queue depth.
>=20
> Patch 6 ~ 14 use per-request_queue dispatch list to avoid
> to dequeue requests from sw/scheduler queue when lld queue
> is busy.
>=20
> Patch 15 ~20 improve bio merge via hash table in sw queue,
> which makes bio merge more efficient than current approch
> in which only the last 8 requests are checked. Since patch
> 6~14 converts to the scheduler way of dequeuing one request
> from sw queue one time for SCSI device, and the times of
> acquring ctx->lock is increased, and merging bio via hash
> table decreases holding time of ctx->lock and should eliminate
> effect from patch 14.=20
>=20
> With this changes, SCSI-MQ sequential I/O performance is
> improved much, for lpfc, it is basically brought back
> compared with block legacy path[1], especially mq-deadline
> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
> For mq-none it is improved by 10% on lpfc, and write is
> improved by > 10% on SRP too.
>=20
> Also Bart worried that this patchset may affect SRP, so provide
> test data on SCSI SRP this time:
>=20
> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> - system(16 cores, dual sockets, mem: 96G)
>=20
>              |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches |
>              |blk-legacy dd |blk-mq none   | blk-mq none  |
> -----------------------------------------------------------| =20
> read     :iops|         587K |         526K |         537K |
> randread :iops|         115K |         140K |         139K |
> write    :iops|         596K |         519K |         602K |
> randwrite:iops|         103K |         122K |         120K |
>=20
>=20
>              |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches
>              |blk-legacy dd |blk-mq dd     | blk-mq dd    |
> ------------------------------------------------------------
> read     :iops|         587K |         155K |         522K |
> randread :iops|         115K |         140K |         141K |
> write    :iops|         596K |         135K |         587K |
> randwrite:iops|         103K |         120K |         118K |
>=20
> V2:
> 	- dequeue request from sw queues in round roubin's style
> 	as suggested by Bart, and introduces one helper in sbitmap
> 	for this purpose
> 	- improve bio merge via hash table from sw queue
> 	- add comments about using DISPATCH_BUSY state in lockless way,
> 	simplifying handling on busy state,
> 	- hold ctx->lock when clearing ctx busy bit as suggested
> 	by Bart
>=20
>=20

Hi,
I've performance-tested Ming's patchset with the dbench4 test in
MMTests, and with the mq-deadline and bfq schedulers.  Max latencies,
have decreased dramatically: up to 32 times.  Very good results for
average latencies as well.

For brevity, here are only results for deadline.  You can find full
results with bfq in the thread that triggered my testing of Ming's
patches [1].

MQ-DEADLINE WITHOUT MING'S PATCHES

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13760    90.542 13221.495
 Close                   137654     0.008    27.133
 LockX                      640     0.009     0.115
 Rename                    8064     1.062   246.759
 ReadX                   297956     0.051   347.018
 WriteX                   94698   425.636 15090.020
 Unlink                   35077     0.580   208.462
 UnlockX                    640     0.007     0.291
 FIND_FIRST               66630     0.566   530.339
 SET_FILE_INFORMATION     16000     1.419   811.494
 QUERY_FILE_INFORMATION   30717     0.004     1.108
 QUERY_PATH_INFORMATION  176153     0.182   517.419
 QUERY_FS_INFORMATION     30857     0.018    18.562
 NTCreateX               184145     0.281   582.076

Throughput 8.93961 MB/sec  64 clients  64 procs  max_latency=3D15090.026 =
ms

MQ-DEADLINE WITH MING'S PATCHES

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13760    48.650   431.525
 Close                   144320     0.004     7.605
 LockX                      640     0.005     0.019
 Rename                    8320     0.187     5.702
 ReadX                   309248     0.023   216.220
 WriteX                   97176   338.961  5464.995
 Unlink                   39744     0.454   315.207
 UnlockX                    640     0.004     0.027
 FIND_FIRST               69184     0.042    17.648
 SET_FILE_INFORMATION     16128     0.113   134.464
 QUERY_FILE_INFORMATION   31104     0.004     0.370
 QUERY_PATH_INFORMATION  187136     0.031   168.554
 QUERY_FS_INFORMATION     33024     0.009     2.915
 NTCreateX               196672     0.152   163.835

Thanks,
Paolo

[1] https://lkml.org/lkml/2017/8/3/157

> [1] http://marc.info/?l=3Dlinux-block&m=3D150151989915776&w=3D2
>=20
> Ming Lei (20):
>  blk-mq-sched: fix scheduler bad performance
>  sbitmap: introduce __sbitmap_for_each_set()
>  blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
>  blk-mq-sched: move actual dispatching into one helper
>  blk-mq-sched: improve dispatching from sw queue
>  blk-mq-sched: don't dequeue request until all in ->dispatch are
>    flushed
>  blk-mq-sched: introduce blk_mq_sched_queue_depth()
>  blk-mq-sched: use q->queue_depth as hint for q->nr_requests
>  blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
>  blk-mq-sched: introduce helpers for query, change busy state
>  blk-mq: introduce helpers for operating ->dispatch list
>  blk-mq: introduce pointers to dispatch lock & list
>  blk-mq: pass 'request_queue *' to several helpers of operating BUSY
>  blk-mq-sched: improve IO scheduling on SCSI devcie
>  block: introduce rqhash helpers
>  block: move actual bio merge code into __elv_merge
>  block: add check on elevator for supporting bio merge via hashtable
>    from blk-mq sw queue
>  block: introduce .last_merge and .hash to blk_mq_ctx
>  blk-mq-sched: refactor blk_mq_sched_try_merge()
>  blk-mq: improve bio merge from blk-mq sw queue
>=20
> block/blk-mq-debugfs.c  |  12 ++--
> block/blk-mq-sched.c    | 187 =
+++++++++++++++++++++++++++++-------------------
> block/blk-mq-sched.h    |  23 ++++++
> block/blk-mq.c          | 133 +++++++++++++++++++++++++++++++---
> block/blk-mq.h          |  73 +++++++++++++++++++
> block/blk-settings.c    |   2 +
> block/blk.h             |  55 ++++++++++++++
> block/elevator.c        |  93 ++++++++++++++----------
> include/linux/blk-mq.h  |   5 ++
> include/linux/blkdev.h  |   5 ++
> include/linux/sbitmap.h |  54 ++++++++++----
> 11 files changed, 504 insertions(+), 138 deletions(-)
>=20
> --=20
> 2.9.4
>=20

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-08  8:09 ` Paolo Valente
@ 2017-08-08  9:09   ` Ming Lei
  2017-08-08  9:13     ` Paolo Valente
  0 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-08  9:09 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman

On Tue, Aug 08, 2017 at 10:09:57AM +0200, Paolo Valente wrote:
> 
> > Il giorno 05 ago 2017, alle ore 08:56, Ming Lei <ming.lei@redhat.com> ha scritto:
> > 
> > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > found that I/O performance is much bad with mq-deadline, especially
> > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > SRP...)
> > 
> > Turns out one big issue causes the performance regression: requests
> > are still dequeued from sw queue/scheduler queue even when ldd's
> > queue is busy, so I/O merge becomes quite difficult to make, then
> > sequential IO degrades a lot.
> > 
> > The 1st five patches improve this situation, and brings back
> > some performance loss.
> > 
> > But looks they are still not enough. It is caused by
> > the shared queue depth among all hw queues. For SCSI devices,
> > .cmd_per_lun defines the max number of pending I/O on one
> > request queue, which is per-request_queue depth. So during
> > dispatch, if one hctx is too busy to move on, all hctxs can't
> > dispatch too because of the per-request_queue depth.
> > 
> > Patch 6 ~ 14 use per-request_queue dispatch list to avoid
> > to dequeue requests from sw/scheduler queue when lld queue
> > is busy.
> > 
> > Patch 15 ~20 improve bio merge via hash table in sw queue,
> > which makes bio merge more efficient than current approch
> > in which only the last 8 requests are checked. Since patch
> > 6~14 converts to the scheduler way of dequeuing one request
> > from sw queue one time for SCSI device, and the times of
> > acquring ctx->lock is increased, and merging bio via hash
> > table decreases holding time of ctx->lock and should eliminate
> > effect from patch 14. 
> > 
> > With this changes, SCSI-MQ sequential I/O performance is
> > improved much, for lpfc, it is basically brought back
> > compared with block legacy path[1], especially mq-deadline
> > is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
> > For mq-none it is improved by 10% on lpfc, and write is
> > improved by > 10% on SRP too.
> > 
> > Also Bart worried that this patchset may affect SRP, so provide
> > test data on SCSI SRP this time:
> > 
> > - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> > - system(16 cores, dual sockets, mem: 96G)
> > 
> >              |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches |
> >              |blk-legacy dd |blk-mq none   | blk-mq none  |
> > -----------------------------------------------------------|  
> > read     :iops|         587K |         526K |         537K |
> > randread :iops|         115K |         140K |         139K |
> > write    :iops|         596K |         519K |         602K |
> > randwrite:iops|         103K |         122K |         120K |
> > 
> > 
> >              |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches
> >              |blk-legacy dd |blk-mq dd     | blk-mq dd    |
> > ------------------------------------------------------------
> > read     :iops|         587K |         155K |         522K |
> > randread :iops|         115K |         140K |         141K |
> > write    :iops|         596K |         135K |         587K |
> > randwrite:iops|         103K |         120K |         118K |
> > 
> > V2:
> > 	- dequeue request from sw queues in round roubin's style
> > 	as suggested by Bart, and introduces one helper in sbitmap
> > 	for this purpose
> > 	- improve bio merge via hash table from sw queue
> > 	- add comments about using DISPATCH_BUSY state in lockless way,
> > 	simplifying handling on busy state,
> > 	- hold ctx->lock when clearing ctx busy bit as suggested
> > 	by Bart
> > 
> > 
> 
> Hi,
> I've performance-tested Ming's patchset with the dbench4 test in
> MMTests, and with the mq-deadline and bfq schedulers.  Max latencies,
> have decreased dramatically: up to 32 times.  Very good results for
> average latencies as well.
> 
> For brevity, here are only results for deadline.  You can find full
> results with bfq in the thread that triggered my testing of Ming's
> patches [1].
> 
> MQ-DEADLINE WITHOUT MING'S PATCHES
> 
>  Operation                Count    AvgLat    MaxLat
>  --------------------------------------------------
>  Flush                    13760    90.542 13221.495
>  Close                   137654     0.008    27.133
>  LockX                      640     0.009     0.115
>  Rename                    8064     1.062   246.759
>  ReadX                   297956     0.051   347.018
>  WriteX                   94698   425.636 15090.020
>  Unlink                   35077     0.580   208.462
>  UnlockX                    640     0.007     0.291
>  FIND_FIRST               66630     0.566   530.339
>  SET_FILE_INFORMATION     16000     1.419   811.494
>  QUERY_FILE_INFORMATION   30717     0.004     1.108
>  QUERY_PATH_INFORMATION  176153     0.182   517.419
>  QUERY_FS_INFORMATION     30857     0.018    18.562
>  NTCreateX               184145     0.281   582.076
> 
> Throughput 8.93961 MB/sec  64 clients  64 procs  max_latency=15090.026 ms
> 
> MQ-DEADLINE WITH MING'S PATCHES
> 
>  Operation                Count    AvgLat    MaxLat
>  --------------------------------------------------
>  Flush                    13760    48.650   431.525
>  Close                   144320     0.004     7.605
>  LockX                      640     0.005     0.019
>  Rename                    8320     0.187     5.702
>  ReadX                   309248     0.023   216.220
>  WriteX                   97176   338.961  5464.995
>  Unlink                   39744     0.454   315.207
>  UnlockX                    640     0.004     0.027
>  FIND_FIRST               69184     0.042    17.648
>  SET_FILE_INFORMATION     16128     0.113   134.464
>  QUERY_FILE_INFORMATION   31104     0.004     0.370
>  QUERY_PATH_INFORMATION  187136     0.031   168.554
>  QUERY_FS_INFORMATION     33024     0.009     2.915
>  NTCreateX               196672     0.152   163.835

Hi Paolo,

Thanks very much for testing this patchset!

BTW, could you share us which kind of disk you are using
in this test?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-08  9:09   ` Ming Lei
@ 2017-08-08  9:13     ` Paolo Valente
  0 siblings, 0 replies; 84+ messages in thread
From: Paolo Valente @ 2017-08-08  9:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman


> Il giorno 08 ago 2017, alle ore 11:09, Ming Lei <ming.lei@redhat.com> =
ha scritto:
>=20
> On Tue, Aug 08, 2017 at 10:09:57AM +0200, Paolo Valente wrote:
>>=20
>>> Il giorno 05 ago 2017, alle ore 08:56, Ming Lei =
<ming.lei@redhat.com> ha scritto:
>>>=20
>>> In Red Hat internal storage test wrt. blk-mq scheduler, we
>>> found that I/O performance is much bad with mq-deadline, especially
>>> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
>>> SRP...)
>>>=20
>>> Turns out one big issue causes the performance regression: requests
>>> are still dequeued from sw queue/scheduler queue even when ldd's
>>> queue is busy, so I/O merge becomes quite difficult to make, then
>>> sequential IO degrades a lot.
>>>=20
>>> The 1st five patches improve this situation, and brings back
>>> some performance loss.
>>>=20
>>> But looks they are still not enough. It is caused by
>>> the shared queue depth among all hw queues. For SCSI devices,
>>> .cmd_per_lun defines the max number of pending I/O on one
>>> request queue, which is per-request_queue depth. So during
>>> dispatch, if one hctx is too busy to move on, all hctxs can't
>>> dispatch too because of the per-request_queue depth.
>>>=20
>>> Patch 6 ~ 14 use per-request_queue dispatch list to avoid
>>> to dequeue requests from sw/scheduler queue when lld queue
>>> is busy.
>>>=20
>>> Patch 15 ~20 improve bio merge via hash table in sw queue,
>>> which makes bio merge more efficient than current approch
>>> in which only the last 8 requests are checked. Since patch
>>> 6~14 converts to the scheduler way of dequeuing one request
>>> from sw queue one time for SCSI device, and the times of
>>> acquring ctx->lock is increased, and merging bio via hash
>>> table decreases holding time of ctx->lock and should eliminate
>>> effect from patch 14.=20
>>>=20
>>> With this changes, SCSI-MQ sequential I/O performance is
>>> improved much, for lpfc, it is basically brought back
>>> compared with block legacy path[1], especially mq-deadline
>>> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
>>> For mq-none it is improved by 10% on lpfc, and write is
>>> improved by > 10% on SRP too.
>>>=20
>>> Also Bart worried that this patchset may affect SRP, so provide
>>> test data on SCSI SRP this time:
>>>=20
>>> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
>>> - system(16 cores, dual sockets, mem: 96G)
>>>=20
>>>             |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches |
>>>             |blk-legacy dd |blk-mq none   | blk-mq none  |
>>> -----------------------------------------------------------| =20
>>> read     :iops|         587K |         526K |         537K |
>>> randread :iops|         115K |         140K |         139K |
>>> write    :iops|         596K |         519K |         602K |
>>> randwrite:iops|         103K |         122K |         120K |
>>>=20
>>>=20
>>>             |v4.13-rc3     |v4.13-rc3     | v4.13-rc3+patches
>>>             |blk-legacy dd |blk-mq dd     | blk-mq dd    |
>>> ------------------------------------------------------------
>>> read     :iops|         587K |         155K |         522K |
>>> randread :iops|         115K |         140K |         141K |
>>> write    :iops|         596K |         135K |         587K |
>>> randwrite:iops|         103K |         120K |         118K |
>>>=20
>>> V2:
>>> 	- dequeue request from sw queues in round roubin's style
>>> 	as suggested by Bart, and introduces one helper in sbitmap
>>> 	for this purpose
>>> 	- improve bio merge via hash table from sw queue
>>> 	- add comments about using DISPATCH_BUSY state in lockless way,
>>> 	simplifying handling on busy state,
>>> 	- hold ctx->lock when clearing ctx busy bit as suggested
>>> 	by Bart
>>>=20
>>>=20
>>=20
>> Hi,
>> I've performance-tested Ming's patchset with the dbench4 test in
>> MMTests, and with the mq-deadline and bfq schedulers.  Max latencies,
>> have decreased dramatically: up to 32 times.  Very good results for
>> average latencies as well.
>>=20
>> For brevity, here are only results for deadline.  You can find full
>> results with bfq in the thread that triggered my testing of Ming's
>> patches [1].
>>=20
>> MQ-DEADLINE WITHOUT MING'S PATCHES
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13760    90.542 13221.495
>> Close                   137654     0.008    27.133
>> LockX                      640     0.009     0.115
>> Rename                    8064     1.062   246.759
>> ReadX                   297956     0.051   347.018
>> WriteX                   94698   425.636 15090.020
>> Unlink                   35077     0.580   208.462
>> UnlockX                    640     0.007     0.291
>> FIND_FIRST               66630     0.566   530.339
>> SET_FILE_INFORMATION     16000     1.419   811.494
>> QUERY_FILE_INFORMATION   30717     0.004     1.108
>> QUERY_PATH_INFORMATION  176153     0.182   517.419
>> QUERY_FS_INFORMATION     30857     0.018    18.562
>> NTCreateX               184145     0.281   582.076
>>=20
>> Throughput 8.93961 MB/sec  64 clients  64 procs  =
max_latency=3D15090.026 ms
>>=20
>> MQ-DEADLINE WITH MING'S PATCHES
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13760    48.650   431.525
>> Close                   144320     0.004     7.605
>> LockX                      640     0.005     0.019
>> Rename                    8320     0.187     5.702
>> ReadX                   309248     0.023   216.220
>> WriteX                   97176   338.961  5464.995
>> Unlink                   39744     0.454   315.207
>> UnlockX                    640     0.004     0.027
>> FIND_FIRST               69184     0.042    17.648
>> SET_FILE_INFORMATION     16128     0.113   134.464
>> QUERY_FILE_INFORMATION   31104     0.004     0.370
>> QUERY_PATH_INFORMATION  187136     0.031   168.554
>> QUERY_FS_INFORMATION     33024     0.009     2.915
>> NTCreateX               196672     0.152   163.835
>=20
> Hi Paolo,
>=20
> Thanks very much for testing this patchset!
>=20
> BTW, could you share us which kind of disk you are using
> in this test?
>=20

Absolutely:

ATA device, with non-removable media
	Model Number:       HITACHI HTS727550A9E364                =20
	Serial Number:      J3370082G622JD
	Firmware Revision:  JF3ZD0H0
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II =
Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project =
D1697 Revision 0b

Thanks,
Paolo

> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
       [not found]   ` <CAFfF4qv3W6D-j8BSSZbwPLqhd_mmwk8CZQe7dSqud8cMMd2yPg@mail.gmail.com>
  2017-08-07 22:29     ` Bart Van Assche
  2017-08-07 23:17     ` Ming Lei
@ 2017-08-08 13:41     ` Ming Lei
  2017-08-08 13:58       ` Laurence Oberman
  2 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-08 13:41 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche

Hi Laurence and Guys, 

On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote:
> On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman <loberman@redhat.com>
> wrote:
> Hello
> 
> I need to retract my Tested-by:
> 
> While its valid that the patches do not introduce performance regressions,
> they seem to cause a hard lockup when the [mq-deadline] scheduler is
> enabled so I am not confident with a passing result here.
> 
> This is specific to large buffered I/O writes (4MB) At least that is my
> current test.
> 
> I did not wait long enough for the issue to show when I first sent the pass
> (Tested-by) message because I know my test platform so well I thought I had
> given it enough time to validate the patches for performance regressions.
> 
> I dont know if the failing clone in blk_get_request() is a direct a
> catalyst for the hard lockup but what I do know is with the stock upstream
> 4.13-RC3 I only see them when I am set to [none] and stock upstream never
> seems to see the hard lockup.
> 
> With [mq-deadline] enabled on stock I dont see them at all and it behaves.
> 
> Now with Ming's patches if we enable [mq-deadline] we DO see the clone
> failures and the hard lockup so we have opposit behaviour with the
> scheduler choice and we have the hard lockup.
> 
> On Ming's kernel with [none] we are well behaved and that was my original
> focus, testing on [none] and hence my Tested-by: pass.
> 
> So more investigation is needed here.

Laurence, as we talked in IRC, the hard lock issue you saw isn't
related with this patchset, because the issue can be reproduced on
both v4.13-rc3 and RHEL7. The only trick is to run your hammer
write script concurrently in 16 jobs, then it just takes several
minutes to trigger, no matter with using mq none or mq-deadline
scheduler.

Given it is easy to reproduce, I believe it shouldn't be very
difficult to investigate and root cause.

I will report the issue on another thread, and attach the
script for reproduction.

So let's focus on this patchset([PATCH V2 00/20] blk-mq-sched: improve
SCSI-MQ performance) in this thread.

Thanks again for your test!

Thanks,
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-08 13:41     ` Ming Lei
@ 2017-08-08 13:58       ` Laurence Oberman
  0 siblings, 0 replies; 84+ messages in thread
From: Laurence Oberman @ 2017-08-08 13:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche



On 08/08/2017 09:41 AM, Ming Lei wrote:
> Hi Laurence and Guys,
> 
> On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote:
>> On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman <loberman@redhat.com>
>> wrote:
>> Hello
>>
>> I need to retract my Tested-by:
>>
>> While its valid that the patches do not introduce performance regressions,
>> they seem to cause a hard lockup when the [mq-deadline] scheduler is
>> enabled so I am not confident with a passing result here.
>>
>> This is specific to large buffered I/O writes (4MB) At least that is my
>> current test.
>>
>> I did not wait long enough for the issue to show when I first sent the pass
>> (Tested-by) message because I know my test platform so well I thought I had
>> given it enough time to validate the patches for performance regressions.
>>
>> I dont know if the failing clone in blk_get_request() is a direct a
>> catalyst for the hard lockup but what I do know is with the stock upstream
>> 4.13-RC3 I only see them when I am set to [none] and stock upstream never
>> seems to see the hard lockup.
>>
>> With [mq-deadline] enabled on stock I dont see them at all and it behaves.
>>
>> Now with Ming's patches if we enable [mq-deadline] we DO see the clone
>> failures and the hard lockup so we have opposit behaviour with the
>> scheduler choice and we have the hard lockup.
>>
>> On Ming's kernel with [none] we are well behaved and that was my original
>> focus, testing on [none] and hence my Tested-by: pass.
>>
>> So more investigation is needed here.
> 
> Laurence, as we talked in IRC, the hard lock issue you saw isn't
> related with this patchset, because the issue can be reproduced on
> both v4.13-rc3 and RHEL7. The only trick is to run your hammer
> write script concurrently in 16 jobs, then it just takes several
> minutes to trigger, no matter with using mq none or mq-deadline
> scheduler.
> 
> Given it is easy to reproduce, I believe it shouldn't be very
> difficult to investigate and root cause.
> 
> I will report the issue on another thread, and attach the
> script for reproduction.
> 
> So let's focus on this patchset([PATCH V2 00/20] blk-mq-sched: improve
> SCSI-MQ performance) in this thread.
> 
> Thanks again for your test!
> 
> Thanks,
> Ming
> 

Hello Ming,

Yes I agree, this means my original Tested-by: for your patch set is 
then still valid for large size I/O tests.
Thank you for all this hard work and improving block-MQ

Regards
Laurence

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance
  2017-08-05  6:56 ` [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-08-09  0:11   ` Omar Sandoval
  2017-08-09  2:32     ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Omar Sandoval @ 2017-08-09  0:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman

On Sat, Aug 05, 2017 at 02:56:46PM +0800, Ming Lei wrote:
> When hw queue is busy, we shouldn't take requests from
> scheduler queue any more, otherwise IO merge will be
> difficult to do.
> 
> This patch fixes the awful IO performance on some
> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> is used by not taking requests if hw queue is busy.

Jens added this behavior in 64765a75ef25 ("blk-mq-sched: ask scheduler
for work, if we failed dispatching leftovers"). That change was a big
performance improvement, but we didn't figure out why. We'll need to dig
up whatever test Jens was doing to make sure it doesn't regress.

> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 4ab69435708c..845e5baf8af1 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> -	bool did_work = false;
> +	bool do_sched_dispatch = true;
>  	LIST_HEAD(rq_list);
>  
>  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
> -		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> +		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
>  	} else if (!has_sched_dispatch) {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list);
> @@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 * on the dispatch list, OR if we did have work but weren't able
>  	 * to make progress.
>  	 */
> -	if (!did_work && has_sched_dispatch) {
> +	if (do_sched_dispatch && has_sched_dispatch) {
>  		do {
>  			struct request *rq;
>  
> -- 
> 2.9.4
> 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance
  2017-08-09  0:11   ` Omar Sandoval
@ 2017-08-09  2:32     ` Ming Lei
  2017-08-09  7:11       ` Omar Sandoval
  0 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-09  2:32 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Bart Van Assche, Laurence Oberman

On Wed, Aug 9, 2017 at 8:11 AM, Omar Sandoval <osandov@osandov.com> wrote:
> On Sat, Aug 05, 2017 at 02:56:46PM +0800, Ming Lei wrote:
>> When hw queue is busy, we shouldn't take requests from
>> scheduler queue any more, otherwise IO merge will be
>> difficult to do.
>>
>> This patch fixes the awful IO performance on some
>> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
>> is used by not taking requests if hw queue is busy.
>
> Jens added this behavior in 64765a75ef25 ("blk-mq-sched: ask scheduler
> for work, if we failed dispatching leftovers"). That change was a big
> performance improvement, but we didn't figure out why. We'll need to dig
> up whatever test Jens was doing to make sure it doesn't regress.

Not found info about Jen's test case on this commit from google.

Maybe Jens could provide some input about your test case?

In theory, if hw queue is busy and requests are left in ->dispatch,
we should not have continued to dequeue requests from sw/scheduler queue
any more. Otherwise, IO merge can be hurt much. At least on SCSI devices,
this improved much on sequential I/O,  at least 3X of sequential
read is increased on lpfc with this patch, in case of mq-deadline.

Or are there other special cases in which we still need
to push requests hard into a busy hardware?

And this patch won't have an effect on devices in which queue busy
is seldom triggered, such as NVMe.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance
  2017-08-09  2:32     ` Ming Lei
@ 2017-08-09  7:11       ` Omar Sandoval
  2017-08-21  8:18         ` Ming Lei
  2017-08-23  7:48         ` Ming Lei
  0 siblings, 2 replies; 84+ messages in thread
From: Omar Sandoval @ 2017-08-09  7:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Bart Van Assche, Laurence Oberman

On Wed, Aug 09, 2017 at 10:32:52AM +0800, Ming Lei wrote:
> On Wed, Aug 9, 2017 at 8:11 AM, Omar Sandoval <osandov@osandov.com> wrote:
> > On Sat, Aug 05, 2017 at 02:56:46PM +0800, Ming Lei wrote:
> >> When hw queue is busy, we shouldn't take requests from
> >> scheduler queue any more, otherwise IO merge will be
> >> difficult to do.
> >>
> >> This patch fixes the awful IO performance on some
> >> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> >> is used by not taking requests if hw queue is busy.
> >
> > Jens added this behavior in 64765a75ef25 ("blk-mq-sched: ask scheduler
> > for work, if we failed dispatching leftovers"). That change was a big
> > performance improvement, but we didn't figure out why. We'll need to dig
> > up whatever test Jens was doing to make sure it doesn't regress.
> 
> Not found info about Jen's test case on this commit from google.
> 
> Maybe Jens could provide some input about your test case?

Okay I found my previous discussion with Jens (it was an off-list
discussion). The test case was xfs/297 from xfstests: after
64765a75ef25, the test went from taking ~300 seconds to ~200 seconds on
his SCSI device.

> In theory, if hw queue is busy and requests are left in ->dispatch,
> we should not have continued to dequeue requests from sw/scheduler queue
> any more. Otherwise, IO merge can be hurt much. At least on SCSI devices,
> this improved much on sequential I/O,  at least 3X of sequential
> read is increased on lpfc with this patch, in case of mq-deadline.

Right, your patch definitely makes more sense intuitively.

> Or are there other special cases in which we still need
> to push requests hard into a busy hardware?

xfs/297 does a lot of fsyncs and hence a lot of flushes, that could be
the special case.

> And this patch won't have an effect on devices in which queue busy
> is seldom triggered, such as NVMe.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (21 preceding siblings ...)
  2017-08-08  8:09 ` Paolo Valente
@ 2017-08-11  8:11 ` Christoph Hellwig
  2017-08-11 14:25   ` James Bottomley
  2017-08-23 16:12 ` Bart Van Assche
  23 siblings, 1 reply; 84+ messages in thread
From: Christoph Hellwig @ 2017-08-11  8:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Laurence Oberman, Martin K. Petersen, linux-scsi

[+ Martin and linux-scsi]

Given that we need this big pile and a few bfq fixes to avoid
major regressesions I'm tempted to revert the default to scsi-mq
for 4.14, but bring it back a little later for 4.15.

What do you think?  Maybe for 4.15 we could also do it through the
block tree where all the fixes will be queued.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-11  8:11 ` Christoph Hellwig
@ 2017-08-11 14:25   ` James Bottomley
  0 siblings, 0 replies; 84+ messages in thread
From: James Bottomley @ 2017-08-11 14:25 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Laurence Oberman,
	Martin K. Petersen, linux-scsi

On Fri, 2017-08-11 at 01:11 -0700, Christoph Hellwig wrote:
> [+ Martin and linux-scsi]
> 
> Given that we need this big pile and a few bfq fixes to avoid
> major regressesions I'm tempted to revert the default to scsi-mq
> for 4.14, but bring it back a little later for 4.15.
> 
> What do you think?  Maybe for 4.15 we could also do it through the
> block tree where all the fixes will be queued.

Given the severe workload regressions Mel reported, I think that's
wise.

I also think we wouldn't have found all these problems if it hadn't
been the default, so the original patch was the best way of trying to
find out if we were ready for the switch and forcing all the issues
out.

Thanks,

James

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance
  2017-08-09  7:11       ` Omar Sandoval
@ 2017-08-21  8:18         ` Ming Lei
  2017-08-23  7:48         ` Ming Lei
  1 sibling, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-21  8:18 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Bart Van Assche, Laurence Oberman

On Wed, Aug 09, 2017 at 12:11:18AM -0700, Omar Sandoval wrote:
> On Wed, Aug 09, 2017 at 10:32:52AM +0800, Ming Lei wrote:
> > On Wed, Aug 9, 2017 at 8:11 AM, Omar Sandoval <osandov@osandov.com> wrote:
> > > On Sat, Aug 05, 2017 at 02:56:46PM +0800, Ming Lei wrote:
> > >> When hw queue is busy, we shouldn't take requests from
> > >> scheduler queue any more, otherwise IO merge will be
> > >> difficult to do.
> > >>
> > >> This patch fixes the awful IO performance on some
> > >> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> > >> is used by not taking requests if hw queue is busy.
> > >
> > > Jens added this behavior in 64765a75ef25 ("blk-mq-sched: ask scheduler
> > > for work, if we failed dispatching leftovers"). That change was a big
> > > performance improvement, but we didn't figure out why. We'll need to dig
> > > up whatever test Jens was doing to make sure it doesn't regress.
> > 
> > Not found info about Jen's test case on this commit from google.
> > 
> > Maybe Jens could provide some input about your test case?
> 
> Okay I found my previous discussion with Jens (it was an off-list
> discussion). The test case was xfs/297 from xfstests: after
> 64765a75ef25, the test went from taking ~300 seconds to ~200 seconds on
> his SCSI device.
> 
> > In theory, if hw queue is busy and requests are left in ->dispatch,
> > we should not have continued to dequeue requests from sw/scheduler queue
> > any more. Otherwise, IO merge can be hurt much. At least on SCSI devices,
> > this improved much on sequential I/O,  at least 3X of sequential
> > read is increased on lpfc with this patch, in case of mq-deadline.
> 
> Right, your patch definitely makes more sense intuitively.
> 
> > Or are there other special cases in which we still need
> > to push requests hard into a busy hardware?
> 
> xfs/297 does a lot of fsyncs and hence a lot of flushes, that could be
> the special case.

OK, thanks a lot for this input, and I will run xfs/297 with this
patchset to see if performance is degraded.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set()
  2017-08-05  6:56 ` [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
@ 2017-08-22 18:28   ` Bart Van Assche
  2017-08-24  3:57     ` Ming Lei
  2017-08-22 18:37   ` Bart Van Assche
  1 sibling, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 18:28 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: osandov, loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gLyoq
DQo+ICAgKiBzYml0bWFwX2Zvcl9lYWNoX3NldCgpIC0gSXRlcmF0ZSBvdmVyIGVhY2ggc2V0IGJp
dCBpbiBhICZzdHJ1Y3Qgc2JpdG1hcC4NCj4gKyAqIEBvZmY6IE9mZnNldCB0byBpdGVyYXRlIGZy
b20NCj4gICAqIEBzYjogQml0bWFwIHRvIGl0ZXJhdGUgb3Zlci4NCj4gICAqIEBmbjogQ2FsbGJh
Y2suIFNob3VsZCByZXR1cm4gdHJ1ZSB0byBjb250aW51ZSBvciBmYWxzZSB0byBicmVhayBlYXJs
eS4NCj4gICAqIEBkYXRhOiBQb2ludGVyIHRvIHBhc3MgdG8gY2FsbGJhY2suDQoNClVzaW5nICdv
ZmYnIGFzIHRoZSBuYW1lIGZvciB0aGUgbmV3IGFyZ3VtZW50IHNlZW1zIGNvbmZ1c2luZyB0byBt
ZSBzaW5jZSB0aGF0DQphcmd1bWVudCBzdGFydHMgZnJvbSB6ZXJvIGFuZCBpcyBub3QgYW4gb2Zm
c2V0IHJlbGF0aXZlIHRvIGFueXRoaW5nLiBQbGVhc2UNCmNvbnNpZGVyIHRvIHVzZSAnc3RhcnQn
IGFzIHRoZSBuYW1lIGZvciB0aGlzIGFyZ3VtZW50Lg0KDQo+IC1zdGF0aWMgaW5saW5lIHZvaWQg
c2JpdG1hcF9mb3JfZWFjaF9zZXQoc3RydWN0IHNiaXRtYXAgKnNiLCBzYl9mb3JfZWFjaF9mbiBm
biwNCj4gLQkJCQkJdm9pZCAqZGF0YSkNCj4gK3N0YXRpYyBpbmxpbmUgdm9pZCBfX3NiaXRtYXBf
Zm9yX2VhY2hfc2V0KHN0cnVjdCBzYml0bWFwICpzYiwNCj4gKwkJCQkJICB1bnNpZ25lZCBpbnQg
b2ZmLA0KPiArCQkJCQkgIHNiX2Zvcl9lYWNoX2ZuIGZuLCB2b2lkICpkYXRhKQ0KPiAgew0KPiAt
CXVuc2lnbmVkIGludCBpOw0KPiArCXVuc2lnbmVkIGludCBpbmRleCA9IFNCX05SX1RPX0lOREVY
KHNiLCBvZmYpOw0KDQpJcyBpdCByZWFsbHkgdXNlZnVsIHRvIHJlbmFtZSAnaScgaW50byAnaW5k
ZXgnPyBJIHRoaW5rIHRoYXQgY2hhbmdlIG1ha2VzIHRoaXMNCnBhdGNoIGhhcmRlciB0byByZWFk
IHRoYW4gbmVjZXNzYXJ5Lg0KDQo+ICsJdW5zaWduZWQgaW50IG5yID0gU0JfTlJfVE9fQklUKHNi
LCBvZmYpOw0KPiArCXVuc2lnbmVkIGludCBzY2FubmVkID0gMDsNCj4gIA0KPiAtCWZvciAoaSA9
IDA7IGkgPCBzYi0+bWFwX25yOyBpKyspIHsNCj4gLQkJc3RydWN0IHNiaXRtYXBfd29yZCAqd29y
ZCA9ICZzYi0+bWFwW2ldOw0KPiAtCQl1bnNpZ25lZCBpbnQgb2ZmLCBucjsNCj4gKwl3aGlsZSAo
MSkgew0KDQpTb3JyeSBidXQgdGhpcyBjaGFuZ2UgbG9va3MgaW5jb3JyZWN0IHRvIG1lLiBJIHRo
aW5rIHRoZSBmb2xsb3dpbmcgdHdvIHRlc3RzDQpoYXZlIHRvIGJlIHBlcmZvcm1lZCBiZWZvcmUg
dGhlIHdoaWxlIGxvb3Agc3RhcnRzIHRvIGF2b2lkIHRyaWdnZXJpbmcgYW4NCm91dC1vZi1ib3Vu
ZHMgcmVmZXJlbmNlIG9mIHNiLT5tYXBbXToNCiogV2hldGhlciBvciBub3Qgc2ItPm1hcF9uciBp
cyB6ZXJvLg0KKiBXaGV0aGVyIG9yIG5vdCBpbmRleCA+PSBzYi0+bWFwX25yLiBJIHByb3Bvc2Ug
dG8gc3RhcnQgaXRlcmF0aW5nIGZyb20gdGhlDQogIHN0YXJ0IG9mIEBzYiBpbiB0aGlzIGNhc2Uu
DQoNCkFkZGl0aW9uYWxseSwgdGhlIG5ldyBsb29wIGluIF9fc2JpdG1hcF9mb3JfZWFjaF9zZXQo
KSBsb29rcyBtb3JlIGNvbXBsaWNhdGVkDQphbmQgbW9yZSBmcmFnaWxlIHRvIG1lIHRoYW4gbmVj
ZXNzYXJ5LiBIb3cgYWJvdXQgdXNpbmcgdGhlIGNvZGUgYmVsb3c/IFRoYXQNCmNvZGUgbmVlZHMg
b25lIGxvY2FsIHZhcmlhYmxlIGxlc3MgdGhhbiB5b3VyIGltcGxlbWVudGF0aW9uLg0KDQpzdGF0
aWMgaW5saW5lIHZvaWQgX19zYml0bWFwX2Zvcl9lYWNoX3NldChzdHJ1Y3Qgc2JpdG1hcCAqc2Is
DQoJCQkJCSAgY29uc3QgdW5zaWduZWQgaW50IHN0YXJ0LA0KCQkJCQkgIHNiX2Zvcl9lYWNoX2Zu
IGZuLCB2b2lkICpkYXRhKQ0Kew0KCXVuc2lnbmVkIGludCBpID0gc3RhcnQgPj4gc2ItPnNoaWZ0
Ow0KCXVuc2lnbmVkIGludCBuciA9IHN0YXJ0ICYgKCgxIDw8IHNiLT5zaGlmdCkgLSAxKTsNCgli
b29sIGN5Y2xlZCA9IGZhbHNlOw0KDQoJaWYgKCFzYi0+bWFwX25yKQ0KCQlyZXR1cm47DQoNCglp
ZiAodW5saWtlbHkoaSA+PSBzYi0+bWFwX25yKSkgew0KCQlpID0gMDsNCgkJbnIgPSAwOw0KCX0N
Cg0KCXdoaWxlICh0cnVlKSB7DQoJCXN0cnVjdCBzYml0bWFwX3dvcmQgKndvcmQgPSAmc2ItPm1h
cFtpXTsNCgkJdW5zaWduZWQgaW50IG9mZjsNCg0KCQlvZmYgPSBpIDw8IHNiLT5zaGlmdDsNCgkJ
d2hpbGUgKDEpIHsNCgkJCW5yID0gZmluZF9uZXh0X2JpdCgmd29yZC0+d29yZCwgd29yZC0+ZGVw
dGgsIG5yKTsNCgkJCWlmIChjeWNsZWQgJiYgb2ZmICsgbnIgPj0gc3RhcnQpDQoJCQkJcmV0dXJu
Ow0KDQoJCQlpZiAobnIgPj0gd29yZC0+ZGVwdGgpDQoJCQkJYnJlYWs7DQoNCgkJCWlmICghZm4o
c2IsIG9mZiArIG5yLCBkYXRhKSkNCgkJCQlyZXR1cm47DQoNCgkJCW5yKys7DQoJCX0NCgkJaWYg
KCsraSA+PSBzYi0+bWFwX25yKSB7DQoJCQljeWNsZWQgPSB0cnVlOw0KCQkJaSA9IDA7DQoJCX0N
CgkJbnIgPSAwOw0KCX0NCn0NCg0KVGhhbmtzLA0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set()
  2017-08-05  6:56 ` [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
  2017-08-22 18:28   ` Bart Van Assche
@ 2017-08-22 18:37   ` Bart Van Assche
  2017-08-24  4:02     ` Ming Lei
  1 sibling, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 18:37 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: osandov, loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gLXN0
YXRpYyBpbmxpbmUgdm9pZCBzYml0bWFwX2Zvcl9lYWNoX3NldChzdHJ1Y3Qgc2JpdG1hcCAqc2Is
IHNiX2Zvcl9lYWNoX2ZuIGZuLA0KPiAtCQkJCQl2b2lkICpkYXRhKQ0KPiArc3RhdGljIGlubGlu
ZSB2b2lkIF9fc2JpdG1hcF9mb3JfZWFjaF9zZXQoc3RydWN0IHNiaXRtYXAgKnNiLA0KPiArCQkJ
CQkgIHVuc2lnbmVkIGludCBvZmYsDQo+ICsJCQkJCSAgc2JfZm9yX2VhY2hfZm4gZm4sIHZvaWQg
KmRhdGEpDQo+ICB7DQoNCkFuIGFkZGl0aW9uYWwgY29tbWVudDogaWYgYSBmdW5jdGlvbiBuYW1l
IHN0YXJ0cyB3aXRoIGEgZG91YmxlIHVuZGVyc2NvcmUNCnVzdWFsbHkgdGhhdCBlaXRoZXIgbWVh
bnMgdGhhdCBpdCBzaG91bGQgYmUgY2FsbGVkIHdpdGggYSBzcGVjaWZpYyBsb2NrDQpoZWxkIG9y
IHRoYXQgaXQgaXMgYW4gaW1wbGVtZW50YXRpb24gZnVuY3Rpb24gdGhhdCBzaG91bGQgbm90IGJl
IGNhbGxlZCBieQ0Kb3RoZXIgbW9kdWxlcy4gU2luY2UgbmVpdGhlciBpcyB0aGUgY2FzZSBmb3Ig
X19zYml0bWFwX2Zvcl9lYWNoX3NldCgpLA0KcGxlYXNlIGNvbnNpZGVyIHRvIHVzZSBhbm90aGVy
IG5hbWUgZm9yIHRoaXMgZnVuY3Rpb24uDQoNClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  2017-08-05  6:56 ` [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx() Ming Lei
@ 2017-08-22 18:45   ` Bart Van Assche
  2017-08-24  4:52     ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 18:45 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gTW9y
ZSBpbXBvcnRhbnRseSwgZm9yIHNvbWUgU0NTSSBkZXZpY2VzLCBkcml2ZXINCj4gdGFncyBhcmUg
aG9zdCB3aWRlLCBhbmQgdGhlIG51bWJlciBpcyBxdWl0ZSBiaWcsDQo+IGJ1dCBlYWNoIGx1biBo
YXMgdmVyeSBsaW1pdGVkIHF1ZXVlIGRlcHRoLg0KDQpUaGlzIG1heSBiZSB0aGUgY2FzZSBidXQg
aXMgbm90IGFsd2F5cyB0aGUgY2FzZS4gQW5vdGhlciBpbXBvcnRhbnQgdXNlLWNhc2UNCmlzIG9u
ZSBMVU4gcGVyIGhvc3QgYW5kIHdoZXJlIHRoZSBxdWV1ZSBkZXB0aCBwZXIgTFVOIGlzIGlkZW50
aWNhbCB0byB0aGUNCm51bWJlciBvZiBob3N0IHRhZ3MuDQoNCj4gK3N0cnVjdCByZXF1ZXN0ICpi
bGtfbXFfZGlzcGF0Y2hfcnFfZnJvbV9jdHgoc3RydWN0IGJsa19tcV9od19jdHggKmhjdHgsDQo+
ICsJCQkJCSAgICBzdHJ1Y3QgYmxrX21xX2N0eCAqc3RhcnQpDQo+ICt7DQo+ICsJdW5zaWduZWQg
b2ZmID0gc3RhcnQgPyBzdGFydC0+aW5kZXhfaHcgOiAwOw0KDQpQbGVhc2UgY29uc2lkZXIgdG8g
cmVuYW1lIHRoaXMgZnVuY3Rpb24gaW50byBibGtfbXFfZGlzcGF0Y2hfcnFfZnJvbV9uZXh0X2N0
eCgpDQphbmQgdG8gc3RhcnQgZnJvbSBzdGFydC0+aW5kZXhfaHcgKyAxIGluc3RlYWQgb2Ygc3Rh
cnQtPmluZGV4X2h3LiBJIHRoaW5rIHRoYXQNCndpbGwgbm90IG9ubHkgcmVzdWx0IGluIHNpbXBs
ZXIgYnV0IGFsc28gaW4gZmFzdGVyIGNvZGUuDQoNClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 04/20] blk-mq-sched: move actual dispatching into one helper
  2017-08-05  6:56 ` [PATCH V2 04/20] blk-mq-sched: move actual dispatching into one helper Ming Lei
@ 2017-08-22 19:50   ` Bart Van Assche
  0 siblings, 0 replies; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 19:50 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gU28g
dGhhdCBpdCBiZWNvbWVzIGVhc3kgdG8gc3VwcG9ydCB0byBkaXNwYXRjaCBmcm9tDQo+IHN3IHF1
ZXVlIGluIHRoZSBmb2xsb3dpbmcgcGF0Y2guDQoNClJldmlld2VkLWJ5OiBCYXJ0IFZhbiBBc3Nj
aGUgPGJhcnQudmFuYXNzY2hlQHdkYy5jb20+DQoNCg==

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue
  2017-08-05  6:56 ` [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-08-22 19:55   ` Bart Van Assche
  2017-08-23 19:58     ` Jens Axboe
  2017-08-24  5:52     ` Ming Lei
  2017-08-22 20:57   ` Bart Van Assche
  1 sibling, 2 replies; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 19:55 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gZWFz
eSB0byBjYXVzZSBxdWV1ZSBidXN5IGJlY2FzdWUgb2YgdGhlIHNtYWxsDQogICAgICAgICAgICAg
ICAgICAgICAgICAgICBeXl5eXl5eDQogICAgICAgICAgICAgICAgICAgICAgICAgICBiZWNhdXNl
Pw0KIA0KPiAtc3RhdGljIHZvaWQgYmxrX21xX2RvX2Rpc3BhdGNoKHN0cnVjdCByZXF1ZXN0X3F1
ZXVlICpxLA0KPiAtCQkJICAgICAgIHN0cnVjdCBlbGV2YXRvcl9xdWV1ZSAqZSwNCj4gLQkJCSAg
ICAgICBzdHJ1Y3QgYmxrX21xX2h3X2N0eCAqaGN0eCkNCj4gK3N0YXRpYyBpbmxpbmUgdm9pZCBi
bGtfbXFfZG9fZGlzcGF0Y2hfc2NoZWQoc3RydWN0IHJlcXVlc3RfcXVldWUgKnEsDQo+ICsJCQkJ
CSAgICBzdHJ1Y3QgZWxldmF0b3JfcXVldWUgKmUsDQo+ICsJCQkJCSAgICBzdHJ1Y3QgYmxrX21x
X2h3X2N0eCAqaGN0eCkNCj4gIHsNCj4gIAlMSVNUX0hFQUQocnFfbGlzdCk7DQoNCldoeSB0byBk
ZWNsYXJlIHRoaXMgZnVuY3Rpb24gImlubGluZSI/IEFyZSB5b3Ugc3VyZSB0aGF0IHRoZSBjb21w
aWxlcg0KaXMgbm90IHNtYXJ0IGVub3VnaCB0byBkZWNpZGUgb24gaXRzIG93biB3aGV0aGVyIG9y
IG5vdCB0byBpbmxpbmUgdGhpcw0KZnVuY3Rpb24/DQoNCj4gK3N0YXRpYyBpbmxpbmUgc3RydWN0
IGJsa19tcV9jdHggKmJsa19tcV9uZXh0X2N0eChzdHJ1Y3QgYmxrX21xX2h3X2N0eCAqaGN0eCwN
Cj4gKwkJCQkJCSBzdHJ1Y3QgYmxrX21xX2N0eCAqY3R4KQ0KPiArew0KPiArCXVuc2lnbmVkIGlk
eCA9IGN0eC0+aW5kZXhfaHc7DQo+ICsNCj4gKwlpZiAoKytpZHggPT0gaGN0eC0+bnJfY3R4KQ0K
PiArCQlpZHggPSAwOw0KPiArDQo+ICsJcmV0dXJuIGhjdHgtPmN0eHNbaWR4XTsNCj4gK30NCj4g
Kw0KPiArc3RhdGljIGlubGluZSB2b2lkIGJsa19tcV9kb19kaXNwYXRjaF9jdHgoc3RydWN0IHJl
cXVlc3RfcXVldWUgKnEsDQo+ICsJCQkJCSAgc3RydWN0IGJsa19tcV9od19jdHggKmhjdHgpDQo+
ICt7DQo+ICsJTElTVF9IRUFEKHJxX2xpc3QpOw0KPiArCXN0cnVjdCBibGtfbXFfY3R4ICpjdHgg
PSBOVUxMOw0KPiArDQo+ICsJZG8gew0KPiArCQlzdHJ1Y3QgcmVxdWVzdCAqcnE7DQo+ICsNCj4g
KwkJcnEgPSBibGtfbXFfZGlzcGF0Y2hfcnFfZnJvbV9jdHgoaGN0eCwgY3R4KTsNCj4gKwkJaWYg
KCFycSkNCj4gKwkJCWJyZWFrOw0KPiArCQlsaXN0X2FkZCgmcnEtPnF1ZXVlbGlzdCwgJnJxX2xp
c3QpOw0KPiArDQo+ICsJCS8qIHJvdW5kIHJvYmluIGZvciBmYWlyIGRpc3BhdGNoICovDQo+ICsJ
CWN0eCA9IGJsa19tcV9uZXh0X2N0eChoY3R4LCBycS0+bXFfY3R4KTsNCj4gKwl9IHdoaWxlIChi
bGtfbXFfZGlzcGF0Y2hfcnFfbGlzdChxLCAmcnFfbGlzdCkpOw0KPiArfQ0KDQpQbGVhc2UgY29u
c2lkZXIgdG8gbW92ZSB0aGUgYmxrX21xX25leHRfY3R4KCkgZnVuY3Rpb25hbGl0eSBpbnRvDQpi
bGtfbXFfZGlzcGF0Y2hfcnFfZnJvbV9jdHgoKSBhcyByZXF1ZXN0ZWQgaW4gYSBjb21tZW50IG9u
IGEgcHJldmlvdXMgcGF0Y2guDQoNCj4gIHZvaWQgYmxrX21xX3NjaGVkX2Rpc3BhdGNoX3JlcXVl
c3RzKHN0cnVjdCBibGtfbXFfaHdfY3R4ICpoY3R4KQ0KPiAgew0KPiAgCXN0cnVjdCByZXF1ZXN0
X3F1ZXVlICpxID0gaGN0eC0+cXVldWU7DQo+IEBAIC0xNDIsMTggKzE3MiwzMSBAQCB2b2lkIGJs
a19tcV9zY2hlZF9kaXNwYXRjaF9yZXF1ZXN0cyhzdHJ1Y3QgYmxrX21xX2h3X2N0eCAqaGN0eCkN
Cj4gIAlpZiAoIWxpc3RfZW1wdHkoJnJxX2xpc3QpKSB7DQo+ICAJCWJsa19tcV9zY2hlZF9tYXJr
X3Jlc3RhcnRfaGN0eChoY3R4KTsNCj4gIAkJZG9fc2NoZWRfZGlzcGF0Y2ggPSBibGtfbXFfZGlz
cGF0Y2hfcnFfbGlzdChxLCAmcnFfbGlzdCk7DQo+IC0JfSBlbHNlIGlmICghaGFzX3NjaGVkX2Rp
c3BhdGNoKSB7DQo+ICsJfSBlbHNlIGlmICghaGFzX3NjaGVkX2Rpc3BhdGNoICYgIXEtPnF1ZXVl
X2RlcHRoKSB7DQoNClBsZWFzZSB1c2UgIiYmIiBpbnN0ZWFkIG9mICImIiB0byByZXByZXNlbnQg
bG9naWNhbCBhbmQuDQoNCk90aGVyd2lzZSB0aGlzIHBhdGNoIGxvb2tzIGZpbmUgdG8gbWUuDQoN
ClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-08-05  6:56 ` [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
@ 2017-08-22 20:09   ` Bart Van Assche
  2017-08-24  6:18     ` Ming Lei
  2017-08-23 19:56   ` Jens Axboe
  1 sibling, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 20:09 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gKwkv
Kg0KPiArCSAqIFdoZXJldmVyIERJU1BBVENIX0JVU1kgaXMgc2V0LCBibGtfbXFfcnVuX2h3X3F1
ZXVlKCkNCj4gKwkgKiB3aWxsIGJlIHJ1biB0byB0cnkgdG8gbWFrZSBwcm9ncmVzcywgc28gaXQg
aXMgYWx3YXlzDQo+ICsJICogc2FmZSB0byBjaGVjayB0aGUgc3RhdGUgaGVyZS4NCj4gKwkgKi8N
Cj4gKwlpZiAodGVzdF9iaXQoQkxLX01RX1NfRElTUEFUQ0hfQlVTWSwgJmhjdHgtPnN0YXRlKSkN
Cj4gKwkJcmV0dXJuOw0KDQpUaGUgY29tbWVudCBhYm92ZSB0ZXN0X2JpdCgpIGlzIHVzZWZ1bCBi
dXQgZG9lcyBub3QgZXhwbGFpbiB0aGUgcHVycG9zZSBvZg0KdGhlIGVhcmx5IHJldHVybi4gSXMg
dGhlIHB1cnBvc2Ugb2YgdGhlIGVhcmx5IHJldHVybiBwZXJoYXBzIHRvIHNlcmlhbGl6ZQ0KYmxr
X21xX3NjaGVkX2Rpc3BhdGNoX3JlcXVlc3RzKCkgY2FsbHM/IElmIHNvLCBwbGVhc2UgbWVudGlv
biB0aGlzLg0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 07/20] blk-mq-sched: introduce blk_mq_sched_queue_depth()
  2017-08-05  6:56 ` [PATCH V2 07/20] blk-mq-sched: introduce blk_mq_sched_queue_depth() Ming Lei
@ 2017-08-22 20:10   ` Bart Van Assche
  0 siblings, 0 replies; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 20:10 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gVGhl
IGZvbGxvd2luZyBwYXRjaCB3aWxsIHByb3Bvc2Ugc29tZSBoaW50cyB0byBmaWd1cmUgb3V0DQo+
IGRlZmF1bHQgcXVldWUgZGVwdGggZm9yIHNjaGVkdWxlciBxdWV1ZSwgc28gaW50cm9kdWNlIGhl
bHBlcg0KPiBvZiBibGtfbXFfc2NoZWRfcXVldWVfZGVwdGgoKSBmb3IgdGhpcyBwdXJwb3NlLg0K
DQpSZXZpZXdlZC1ieTogQmFydCBWYW4gQXNzY2hlIDxiYXJ0LnZhbmFzc2NoZUB3ZGMuY29tPg==

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  2017-08-05  6:56 ` [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests Ming Lei
@ 2017-08-22 20:20   ` Bart Van Assche
  2017-08-24  6:39     ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 20:20 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gU0NT
SSBzZXRzIHEtPnF1ZXVlX2RlcHRoIGZyb20gc2hvc3QtPmNtZF9wZXJfbHVuLCBhbmQNCj4gcS0+
cXVldWVfZGVwdGggaXMgcGVyIHJlcXVlc3RfcXVldWUgYW5kIG1vcmUgcmVsYXRlZCB0bw0KPiBz
Y2hlZHVsZXIgcXVldWUgY29tcGFyZWQgd2l0aCBodyBxdWV1ZSBkZXB0aCwgd2hpY2ggY2FuIGJl
DQo+IHNoYXJlZCBieSBxdWV1ZXMsIHN1Y2ggYXMgVEFHX1NIQVJFRC4NCj4gDQo+IFRoaXMgcGF0
Y2ggdHJ5cyB0byB1c2UgcS0+cXVldWVfZGVwdGggYXMgaGludCBmb3IgY29tcHV0aW5nDQogICAg
ICAgICAgICAgXl5eXg0KICAgICAgICAgICAgIHRyaWVzPw0KPiBxLT5ucl9yZXF1ZXN0cywgd2hp
Y2ggc2hvdWxkIGJlIG1vcmUgZWZmZWN0aXZlIHRoYW4NCj4gY3VycmVudCB3YXkuDQoNClJldmll
d2VkLWJ5OiBCYXJ0IFZhbiBBc3NjaGUgPGJhcnQudmFuYXNzY2hlQHdkYy5jb20+DQoNCg0K

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state
  2017-08-05  6:56 ` [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state Ming Lei
@ 2017-08-22 20:41   ` Bart Van Assche
  2017-08-23 20:02     ` Jens Axboe
  2017-08-24  6:54     ` Ming Lei
  0 siblings, 2 replies; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 20:41 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gK3N0
YXRpYyBpbmxpbmUgYm9vbCBibGtfbXFfaGN0eF9pc19kaXNwYXRjaF9idXN5KHN0cnVjdCBibGtf
bXFfaHdfY3R4ICpoY3R4KQ0KPiArew0KPiArCXJldHVybiB0ZXN0X2JpdChCTEtfTVFfU19ESVNQ
QVRDSF9CVVNZLCAmaGN0eC0+c3RhdGUpOw0KPiArfQ0KPiArDQo+ICtzdGF0aWMgaW5saW5lIHZv
aWQgYmxrX21xX2hjdHhfc2V0X2Rpc3BhdGNoX2J1c3koc3RydWN0IGJsa19tcV9od19jdHggKmhj
dHgpDQo+ICt7DQo+ICsJc2V0X2JpdChCTEtfTVFfU19ESVNQQVRDSF9CVVNZLCAmaGN0eC0+c3Rh
dGUpOw0KPiArfQ0KPiArDQo+ICtzdGF0aWMgaW5saW5lIHZvaWQgYmxrX21xX2hjdHhfY2xlYXJf
ZGlzcGF0Y2hfYnVzeShzdHJ1Y3QgYmxrX21xX2h3X2N0eCAqaGN0eCkNCj4gK3sNCj4gKwljbGVh
cl9iaXQoQkxLX01RX1NfRElTUEFUQ0hfQlVTWSwgJmhjdHgtPnN0YXRlKTsNCj4gK30NCg0KSGVs
bG8gTWluZywNCg0KQXJlIHRoZXNlIGhlbHBlciBmdW5jdGlvbnMgbW9kaWZpZWQgaW4gYSBsYXRl
ciBwYXRjaD8gSWYgbm90LCBwbGVhc2UgZHJvcA0KdGhpcyBwYXRjaC4gT25lIGxpbmUgaGVscGVy
IGZ1bmN0aW9ucyBhcmUgbm90IHVzZWZ1bCBhbmQgZG8gbm90IGltcHJvdmUNCnJlYWRhYmlsaXR5
IG9mIHNvdXJjZSBjb2RlLg0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
  2017-08-05  6:56 ` [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list Ming Lei
@ 2017-08-22 20:43   ` Bart Van Assche
  2017-08-24  0:59     ` Damien Le Moal
  2017-08-24  6:57     ` Ming Lei
  0 siblings, 2 replies; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 20:43 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gK3N0
YXRpYyBpbmxpbmUgYm9vbCBibGtfbXFfaGFzX2Rpc3BhdGNoX3JxcyhzdHJ1Y3QgYmxrX21xX2h3
X2N0eCAqaGN0eCkNCj4gK3sNCj4gKwlyZXR1cm4gIWxpc3RfZW1wdHlfY2FyZWZ1bCgmaGN0eC0+
ZGlzcGF0Y2gpOw0KPiArfQ0KPiArDQo+ICtzdGF0aWMgaW5saW5lIHZvaWQgYmxrX21xX2FkZF9y
cV90b19kaXNwYXRjaChzdHJ1Y3QgYmxrX21xX2h3X2N0eCAqaGN0eCwNCj4gKwkJc3RydWN0IHJl
cXVlc3QgKnJxKQ0KPiArew0KPiArCXNwaW5fbG9jaygmaGN0eC0+bG9jayk7DQo+ICsJbGlzdF9h
ZGQoJnJxLT5xdWV1ZWxpc3QsICZoY3R4LT5kaXNwYXRjaCk7DQo+ICsJYmxrX21xX2hjdHhfc2V0
X2Rpc3BhdGNoX2J1c3koaGN0eCk7DQo+ICsJc3Bpbl91bmxvY2soJmhjdHgtPmxvY2spOw0KPiAr
fQ0KPiArDQo+ICtzdGF0aWMgaW5saW5lIHZvaWQgYmxrX21xX2FkZF9saXN0X3RvX2Rpc3BhdGNo
KHN0cnVjdCBibGtfbXFfaHdfY3R4ICpoY3R4LA0KPiArCQlzdHJ1Y3QgbGlzdF9oZWFkICpsaXN0
KQ0KPiArew0KPiArCXNwaW5fbG9jaygmaGN0eC0+bG9jayk7DQo+ICsJbGlzdF9zcGxpY2VfaW5p
dChsaXN0LCAmaGN0eC0+ZGlzcGF0Y2gpOw0KPiArCWJsa19tcV9oY3R4X3NldF9kaXNwYXRjaF9i
dXN5KGhjdHgpOw0KPiArCXNwaW5fdW5sb2NrKCZoY3R4LT5sb2NrKTsNCj4gK30NCj4gKw0KPiAr
c3RhdGljIGlubGluZSB2b2lkIGJsa19tcV9hZGRfbGlzdF90b19kaXNwYXRjaF90YWlsKHN0cnVj
dCBibGtfbXFfaHdfY3R4ICpoY3R4LA0KPiArCQkJCQkJICAgIHN0cnVjdCBsaXN0X2hlYWQgKmxp
c3QpDQo+ICt7DQo+ICsJc3Bpbl9sb2NrKCZoY3R4LT5sb2NrKTsNCj4gKwlsaXN0X3NwbGljZV90
YWlsX2luaXQobGlzdCwgJmhjdHgtPmRpc3BhdGNoKTsNCj4gKwlibGtfbXFfaGN0eF9zZXRfZGlz
cGF0Y2hfYnVzeShoY3R4KTsNCj4gKwlzcGluX3VubG9jaygmaGN0eC0+bG9jayk7DQo+ICt9DQo+
ICsNCj4gK3N0YXRpYyBpbmxpbmUgdm9pZCBibGtfbXFfdGFrZV9saXN0X2Zyb21fZGlzcGF0Y2go
c3RydWN0IGJsa19tcV9od19jdHggKmhjdHgsDQo+ICsJCXN0cnVjdCBsaXN0X2hlYWQgKmxpc3Qp
DQo+ICt7DQo+ICsJc3Bpbl9sb2NrKCZoY3R4LT5sb2NrKTsNCj4gKwlsaXN0X3NwbGljZV9pbml0
KCZoY3R4LT5kaXNwYXRjaCwgbGlzdCk7DQo+ICsJc3Bpbl91bmxvY2soJmhjdHgtPmxvY2spOw0K
PiArfQ0KDQpTYW1lIGNvbW1lbnQgZm9yIHRoaXMgcGF0Y2g6IHRoZXNlIGhlbHBlciBmdW5jdGlv
bnMgYXJlIHNvIHNob3J0IHRoYXQgSSdtIG5vdA0Kc3VyZSBpdCBpcyB1c2VmdWwgdG8gaW50cm9k
dWNlIHRoZXNlIGhlbHBlciBmdW5jdGlvbnMuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie
  2017-08-05  6:56 ` [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie Ming Lei
@ 2017-08-22 20:51   ` Bart Van Assche
  2017-08-24  7:14     ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 20:51 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gVGhp
cyBwYXRjaCBpbnRyb2R1Y2VzIHBlci1yZXF1ZXN0X3F1ZXVlIGRpc3BhdGNoDQo+IGxpc3QgZm9y
IHRoaXMgcHVycG9zZSwgYW5kIG9ubHkgd2hlbiBhbGwgcmVxdWVzdHMNCj4gaW4gdGhpcyBsaXN0
IGFyZSBkaXNwYXRjaGVkIG91dCBzdWNjZXNzZnVsbHksIHdlDQo+IGNhbiByZXN0YXJ0IHRvIGRl
cXVldWUgcmVxdWVzdCBmcm9tIHN3L3NjaGVkdWxlcg0KPiBxdWV1ZSBhbmQgZGlzcGF0aCBpdCB0
byBsbGQuDQoNCldhc24ndCBvbmUgb2YgdGhlIGRlc2lnbiBnb2FscyBvZiBibGstbXEgdG8gYXZv
aWQgYSBwZXItcmVxdWVzdF9xdWV1ZSBsaXN0Pw0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue
  2017-08-05  6:56 ` [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue Ming Lei
  2017-08-22 19:55   ` Bart Van Assche
@ 2017-08-22 20:57   ` Bart Van Assche
  2017-08-24  6:12     ` Ming Lei
  1 sibling, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 20:57 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gK3N0
YXRpYyBpbmxpbmUgdm9pZCBibGtfbXFfZG9fZGlzcGF0Y2hfY3R4KHN0cnVjdCByZXF1ZXN0X3F1
ZXVlICpxLA0KPiArCQkJCQkgIHN0cnVjdCBibGtfbXFfaHdfY3R4ICpoY3R4KQ0KPiArew0KPiAr
CUxJU1RfSEVBRChycV9saXN0KTsNCj4gKwlzdHJ1Y3QgYmxrX21xX2N0eCAqY3R4ID0gTlVMTDsN
Cj4gKw0KPiArCWRvIHsNCj4gKwkJc3RydWN0IHJlcXVlc3QgKnJxOw0KPiArDQo+ICsJCXJxID0g
YmxrX21xX2Rpc3BhdGNoX3JxX2Zyb21fY3R4KGhjdHgsIGN0eCk7DQo+ICsJCWlmICghcnEpDQo+
ICsJCQlicmVhazsNCj4gKwkJbGlzdF9hZGQoJnJxLT5xdWV1ZWxpc3QsICZycV9saXN0KTsNCj4g
Kw0KPiArCQkvKiByb3VuZCByb2JpbiBmb3IgZmFpciBkaXNwYXRjaCAqLw0KPiArCQljdHggPSBi
bGtfbXFfbmV4dF9jdHgoaGN0eCwgcnEtPm1xX2N0eCk7DQo+ICsJfSB3aGlsZSAoYmxrX21xX2Rp
c3BhdGNoX3JxX2xpc3QocSwgJnJxX2xpc3QpKTsNCj4gK30NCg0KQW4gYWRkaXRpb25hbCBxdWVz
dGlvbiBhYm91dCB0aGlzIHBhdGNoOiBzaG91bGRuJ3QgcmVxdWVzdCBkZXF1ZXVpbmcgc3RhcnQN
CmF0IHRoZSBzb2Z0d2FyZSBxdWV1ZSBuZXh0IHRvIHRoZSBsYXN0IG9uZSBmcm9tIHdoaWNoIGEg
cmVxdWVzdCBnb3QgZGVxdWV1ZWQNCmluc3RlYWQgb2YgYWx3YXlzIHN0YXJ0aW5nIGF0IHRoZSBm
aXJzdCBzb2Z0d2FyZSBxdWV1ZSAoc3RydWN0IGJsa19tcV9jdHgNCipjdHggPSBOVUxMKSB0byBi
ZSB0cnVseSByb3VuZCByb2Jpbj8NCg0KVGhhbmtzLA0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  2017-08-05  6:56 ` [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH Ming Lei
@ 2017-08-22 21:55   ` Bart Van Assche
  2017-08-23  6:46     ` Hannes Reinecke
  2017-08-24  6:52     ` Ming Lei
  0 siblings, 2 replies; 84+ messages in thread
From: Bart Van Assche @ 2017-08-22 21:55 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gKwkv
Kg0KPiArCSAqIGlmIHRoZXJlIGlzIHEtPnF1ZXVlX2RlcHRoLCBhbGwgaHcgcXVldWVzIHNoYXJl
DQo+ICsJICogdGhpcyBxdWV1ZSBkZXB0aCBsaW1pdA0KPiArCSAqLw0KPiArCWlmIChxLT5xdWV1
ZV9kZXB0aCkgew0KPiArCQlxdWV1ZV9mb3JfZWFjaF9od19jdHgocSwgaGN0eCwgaSkNCj4gKwkJ
CWhjdHgtPmZsYWdzIHw9IEJMS19NUV9GX1NIQVJFRF9ERVBUSDsNCj4gKwl9DQo+ICsNCj4gKwlp
ZiAoIXEtPmVsZXZhdG9yKQ0KPiArCQlnb3RvIGV4aXQ7DQoNCkhlbGxvIE1pbmcsDQoNCkl0IHNl
ZW1zIHZlcnkgZnJhZ2lsZSB0byBtZSB0byBzZXQgQkxLX01RX0ZfU0hBUkVEX0RFUFRIIGlmIGFu
ZCBvbmx5IGlmDQpxLT5xdWV1ZV9kZXB0aCAhPSAwLiBXb3VsZG4ndCBpdCBiZSBiZXR0ZXIgdG8g
bGV0IHRoZSBibG9jayBkcml2ZXIgdGVsbCB0aGUNCmJsb2NrIGxheWVyIHdoZXRoZXIgb3Igbm90
IHRoZXJlIGlzIGEgcXVldWUgZGVwdGggbGltaXQgYWNyb3NzIGhhcmR3YXJlDQpxdWV1ZXMsIGUu
Zy4gdGhyb3VnaCBhIHRhZyBzZXQgZmxhZz8NCg0KVGhhbmtzLA0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  2017-08-22 21:55   ` Bart Van Assche
@ 2017-08-23  6:46     ` Hannes Reinecke
  2017-08-24  6:52     ` Ming Lei
  1 sibling, 0 replies; 84+ messages in thread
From: Hannes Reinecke @ 2017-08-23  6:46 UTC (permalink / raw)
  To: Bart Van Assche, hch, linux-block, axboe, ming.lei; +Cc: loberman

On 08/22/2017 11:55 PM, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
>> +	/*
>> +	 * if there is q->queue_depth, all hw queues share
>> +	 * this queue depth limit
>> +	 */
>> +	if (q->queue_depth) {
>> +		queue_for_each_hw_ctx(q, hctx, i)
>> +			hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
>> +	}
>> +
>> +	if (!q->elevator)
>> +		goto exit;
> 
> Hello Ming,
> 
> It seems very fragile to me to set BLK_MQ_F_SHARED_DEPTH if and only if
> q->queue_depth != 0. Wouldn't it be better to let the block driver tell the
> block layer whether or not there is a queue depth limit across hardware
> queues, e.g. through a tag set flag?
> 
Yes, please.
I've come across this several times now, and am always tempted to
introduce this flag.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance
  2017-08-09  7:11       ` Omar Sandoval
  2017-08-21  8:18         ` Ming Lei
@ 2017-08-23  7:48         ` Ming Lei
  1 sibling, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-23  7:48 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
	Bart Van Assche, Laurence Oberman

On Wed, Aug 09, 2017 at 12:11:18AM -0700, Omar Sandoval wrote:
> On Wed, Aug 09, 2017 at 10:32:52AM +0800, Ming Lei wrote:
> > On Wed, Aug 9, 2017 at 8:11 AM, Omar Sandoval <osandov@osandov.com> wrote:
> > > On Sat, Aug 05, 2017 at 02:56:46PM +0800, Ming Lei wrote:
> > >> When hw queue is busy, we shouldn't take requests from
> > >> scheduler queue any more, otherwise IO merge will be
> > >> difficult to do.
> > >>
> > >> This patch fixes the awful IO performance on some
> > >> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> > >> is used by not taking requests if hw queue is busy.
> > >
> > > Jens added this behavior in 64765a75ef25 ("blk-mq-sched: ask scheduler
> > > for work, if we failed dispatching leftovers"). That change was a big
> > > performance improvement, but we didn't figure out why. We'll need to dig
> > > up whatever test Jens was doing to make sure it doesn't regress.
> > 
> > Not found info about Jen's test case on this commit from google.
> > 
> > Maybe Jens could provide some input about your test case?
> 
> Okay I found my previous discussion with Jens (it was an off-list
> discussion). The test case was xfs/297 from xfstests: after
> 64765a75ef25, the test went from taking ~300 seconds to ~200 seconds on
> his SCSI device.

Just run xfs/297 on virtio-scsi device with this patch, and use
mq-deadline scheduler:

	v4.13-rc6 + block for-next: 				83s
	v4.13-rc6 + block for-next + this patch:	79s

So looks no big difference.

> 
> > In theory, if hw queue is busy and requests are left in ->dispatch,
> > we should not have continued to dequeue requests from sw/scheduler queue
> > any more. Otherwise, IO merge can be hurt much. At least on SCSI devices,
> > this improved much on sequential I/O,  at least 3X of sequential
> > read is increased on lpfc with this patch, in case of mq-deadline.
> 
> Right, your patch definitely makes more sense intuitively.
> 
> > Or are there other special cases in which we still need
> > to push requests hard into a busy hardware?
> 
> xfs/297 does a lot of fsyncs and hence a lot of flushes, that could be
> the special case.

IMO, this patch shouldn't degrade flush in theory, and actually in
Paolo's dbench test[1], flush latency is decreased a lot with this
patchset, and Paolo's test is on SATA device.

[1] https://marc.info/?l=linux-block&m=150217980602843&w=2

--
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
                   ` (22 preceding siblings ...)
  2017-08-11  8:11 ` Christoph Hellwig
@ 2017-08-23 16:12 ` Bart Van Assche
  2017-08-23 16:15   ` Jens Axboe
  23 siblings, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-23 16:12 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: loberman

T24gU2F0LCAyMDE3LTA4LTA1IGF0IDE0OjU2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gSW4g
UmVkIEhhdCBpbnRlcm5hbCBzdG9yYWdlIHRlc3Qgd3J0LiBibGstbXEgc2NoZWR1bGVyLCB3ZQ0K
PiBmb3VuZCB0aGF0IEkvTyBwZXJmb3JtYW5jZSBpcyBtdWNoIGJhZCB3aXRoIG1xLWRlYWRsaW5l
LCBlc3BlY2lhbGx5DQo+IGFib3V0IHNlcXVlbnRpYWwgSS9PIG9uIHNvbWUgbXVsdGktcXVldWUg
U0NTSSBkZXZjaWVzKGxwZmMsIHFsYTJ4eHgsDQo+IFNSUC4uLikNCg0KSGVsbG8gTWluZyBhbmQg
SmVucywNCg0KVGhlcmUgbWF5IG5vdCBiZSBlbm91Z2ggdGltZSBsZWZ0IHRvIHJlYWNoIGFncmVl
bWVudCBhYm91dCB0aGUgd2hvbGUgcGF0Y2gNCnNlcmllcyBiZWZvcmUgdGhlIGtlcm5lbCB2NC4x
NCBtZXJnZSB3aW5kb3cgb3BlbnMuIEhvdyBhYm91dCBmb2N1c2luZyBvbg0KcGF0Y2hlcyAxLi44
IG9mIHRoaXMgc2VyaWVzIGZvciBrZXJuZWwgdjQuMTQgYW5kIHJldmlzaXRpbmcgdGhlIHJlc3Qg
b2YgdGhpcw0KcGF0Y2ggc2VyaWVzIGxhdGVyPw0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-23 16:12 ` Bart Van Assche
@ 2017-08-23 16:15   ` Jens Axboe
  2017-08-23 16:24     ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Jens Axboe @ 2017-08-23 16:15 UTC (permalink / raw)
  To: Bart Van Assche, hch, linux-block, ming.lei; +Cc: loberman

On 08/23/2017 10:12 AM, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
>> In Red Hat internal storage test wrt. blk-mq scheduler, we
>> found that I/O performance is much bad with mq-deadline, especially
>> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
>> SRP...)
> 
> Hello Ming and Jens,
> 
> There may not be enough time left to reach agreement about the whole patch
> series before the kernel v4.14 merge window opens. How about focusing on
> patches 1..8 of this series for kernel v4.14 and revisiting the rest of this
> patch series later?

I was going to go over the series today with 4.14 in mind. Looks to me like
this should really be 2-3 patch series, that depend on each other. Might be
better for review purposes as well. So I'd agree with Bart - can we get this
split a bit and geared towards what we need for 4.14 at least, since it's
getting close. And some of the changes do make me somewhat nervous, they
need proper cooking time.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
  2017-08-23 16:15   ` Jens Axboe
@ 2017-08-23 16:24     ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-23 16:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Bart Van Assche, hch, linux-block, loberman

On Wed, Aug 23, 2017 at 10:15:29AM -0600, Jens Axboe wrote:
> On 08/23/2017 10:12 AM, Bart Van Assche wrote:
> > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> >> In Red Hat internal storage test wrt. blk-mq scheduler, we
> >> found that I/O performance is much bad with mq-deadline, especially
> >> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> >> SRP...)
> > 
> > Hello Ming and Jens,
> > 
> > There may not be enough time left to reach agreement about the whole patch
> > series before the kernel v4.14 merge window opens. How about focusing on
> > patches 1..8 of this series for kernel v4.14 and revisiting the rest of this
> > patch series later?
> 
> I was going to go over the series today with 4.14 in mind. Looks to me like
> this should really be 2-3 patch series, that depend on each other. Might be
> better for review purposes as well. So I'd agree with Bart - can we get this
> split a bit and geared towards what we need for 4.14 at least, since it's
> getting close. And some of the changes do make me somewhat nervous, they
> need proper cooking time.

I agree to split the patchset, will do it tomorrow.
If you guys have any suggestions about the splitting(such as
which should aim at v4.14), please let me know.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-08-05  6:56 ` [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
  2017-08-22 20:09   ` Bart Van Assche
@ 2017-08-23 19:56   ` Jens Axboe
  2017-08-24  6:38     ` Ming Lei
  1 sibling, 1 reply; 84+ messages in thread
From: Jens Axboe @ 2017-08-23 19:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Laurence Oberman

On Sat, Aug 05 2017, Ming Lei wrote:
> During dispatching, we moved all requests from hctx->dispatch to
> one temporary list, then dispatch them one by one from this list.
> Unfortunately duirng this period, run queue from other contexts
> may think the queue is idle, then start to dequeue from sw/scheduler
> queue and still try to dispatch because ->dispatch is empty. This way
> hurts sequential I/O performance because requests are dequeued when
> lld queue is busy.
> 
> This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> make sure that request isn't dequeued until ->dispatch is
> flushed.

I don't like how this patch introduces a bunch of locked setting of a
flag under the hctx lock. Especially since I think we can easily avoid
it.

> -	} else if (!has_sched_dispatch & !q->queue_depth) {
> +		blk_mq_dispatch_rq_list(q, &rq_list);
> +
> +		/*
> +		 * We may clear DISPATCH_BUSY just after it
> +		 * is set from another context, the only cost
> +		 * is that one request is dequeued a bit early,
> +		 * we can survive that. Given the window is
> +		 * too small, no need to worry about performance
> +		 * effect.
> +		 */
> +		if (list_empty_careful(&hctx->dispatch))
> +			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);

This is basically the only place where we modify it without holding the
hctx lock. Can we move it into blk_mq_dispatch_rq_list()? The list is
generally empty, unless for the case where we splice residuals back. If
we splice them back, we grab the lock anyway.

The other places it's set under the hctx lock, yet we end up using an
atomic operation to do it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue
  2017-08-22 19:55   ` Bart Van Assche
@ 2017-08-23 19:58     ` Jens Axboe
  2017-08-24  5:52     ` Ming Lei
  1 sibling, 0 replies; 84+ messages in thread
From: Jens Axboe @ 2017-08-23 19:58 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, ming.lei, loberman

On Tue, Aug 22 2017, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > easy to cause queue busy becasue of the small
>                            ^^^^^^^
>                            because?
>  
> > -static void blk_mq_do_dispatch(struct request_queue *q,
> > -			       struct elevator_queue *e,
> > -			       struct blk_mq_hw_ctx *hctx)
> > +static inline void blk_mq_do_dispatch_sched(struct request_queue *q,
> > +					    struct elevator_queue *e,
> > +					    struct blk_mq_hw_ctx *hctx)
> >  {
> >  	LIST_HEAD(rq_list);
> 
> Why to declare this function "inline"? Are you sure that the compiler
> is not smart enough to decide on its own whether or not to inline this
> function?

Ditto in other places, too. Kill them.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state
  2017-08-22 20:41   ` Bart Van Assche
@ 2017-08-23 20:02     ` Jens Axboe
  2017-08-24  6:55       ` Ming Lei
  2017-08-24  6:54     ` Ming Lei
  1 sibling, 1 reply; 84+ messages in thread
From: Jens Axboe @ 2017-08-23 20:02 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, ming.lei, loberman

On Tue, Aug 22 2017, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	return test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > +}
> > +
> > +static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > +}
> > +
> > +static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > +}
> 
> Hello Ming,
> 
> Are these helper functions modified in a later patch? If not, please drop
> this patch. One line helper functions are not useful and do not improve
> readability of source code.

Agree, they just obfuscate the code. Only reason to do this is to do
things like:

static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
{
        if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
	        clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
}

to avoid unecessary RMW (and locked instruction). At least then you can
add a single comment as to why it's done that way. Apart from that, I
prefer to open-code it so people don't have to grep to figure out wtf
blk_mq_hctx_clear_dispatch_busy() does.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
  2017-08-22 20:43   ` Bart Van Assche
@ 2017-08-24  0:59     ` Damien Le Moal
  2017-08-24  7:10       ` Ming Lei
  2017-08-24  6:57     ` Ming Lei
  1 sibling, 1 reply; 84+ messages in thread
From: Damien Le Moal @ 2017-08-24  0:59 UTC (permalink / raw)
  To: Bart Van Assche, hch, linux-block, axboe, ming.lei; +Cc: loberman


On 8/23/17 05:43, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
>> +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
>> +{
>> +	return !list_empty_careful(&hctx->dispatch);
>> +}
>> +
>> +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
>> +		struct request *rq)
>> +{
>> +	spin_lock(&hctx->lock);
>> +	list_add(&rq->queuelist, &hctx->dispatch);
>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>> +	spin_unlock(&hctx->lock);
>> +}
>> +
>> +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
>> +		struct list_head *list)
>> +{
>> +	spin_lock(&hctx->lock);
>> +	list_splice_init(list, &hctx->dispatch);
>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>> +	spin_unlock(&hctx->lock);
>> +}
>> +
>> +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
>> +						    struct list_head *list)
>> +{
>> +	spin_lock(&hctx->lock);
>> +	list_splice_tail_init(list, &hctx->dispatch);
>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>> +	spin_unlock(&hctx->lock);
>> +}
>> +
>> +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
>> +		struct list_head *list)
>> +{
>> +	spin_lock(&hctx->lock);
>> +	list_splice_init(&hctx->dispatch, list);
>> +	spin_unlock(&hctx->lock);
>> +}
> 
> Same comment for this patch: these helper functions are so short that I'm not
> sure it is useful to introduce these helper functions.
> 
> Bart.

Personally, I like those very much as they give a place to hook up
different dispatch_list handling without having to change blk-mq.c and
blk-mq-sched.c all over the place.

I am thinking of SMR (zoned block device) support here since we need to
to sort insert write requests in blk_mq_add_rq_to_dispatch() and
blk_mq_add_list_to_dispatch_tail(). For this last one, the name would
become a little awkward though. This sort insert would be to avoid
breaking a sequential write request sequence sent by the disk user. This
is needed since this reordering breakage cannot be solved only from the
SCSI layer.

blk_mq_add_list_to_dispatch() and blk_mq_add_list_to_dispatch_tail()
could be combined together into blk_mq_add_list_to_dispatch() with the
addition of a "where" argument (BLK_MQ_INSERT_HEAD or
BLK_MQ_INSERT_TAIL), or like some other functions in blk-mq, the
addition of a simple "bool at_head" argument.

Best regards.

-- 
Damien Le Moal,
Western Digital Research

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set()
  2017-08-22 18:28   ` Bart Van Assche
@ 2017-08-24  3:57     ` Ming Lei
  2017-08-25 21:36       ` Bart Van Assche
  0 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-24  3:57 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, osandov, loberman

On Tue, Aug 22, 2017 at 06:28:54PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > /**
> >   * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> > + * @off: Offset to iterate from
> >   * @sb: Bitmap to iterate over.
> >   * @fn: Callback. Should return true to continue or false to break early.
> >   * @data: Pointer to pass to callback.
> 
> Using 'off' as the name for the new argument seems confusing to me since that
> argument starts from zero and is not an offset relative to anything. Please
> consider to use 'start' as the name for this argument.

OK

> 
> > -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
> > -					void *data)
> > +static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> > +					  unsigned int off,
> > +					  sb_for_each_fn fn, void *data)
> >  {
> > -	unsigned int i;
> > +	unsigned int index = SB_NR_TO_INDEX(sb, off);
> 
> Is it really useful to rename 'i' into 'index'? I think that change makes this
> patch harder to read than necessary.

All local variable is named as 'index' in lib/sbitmap.c
if it is initialized as SB_NR_TO_INDEX().

> 
> > +	unsigned int nr = SB_NR_TO_BIT(sb, off);
> > +	unsigned int scanned = 0;
> >  
> > -	for (i = 0; i < sb->map_nr; i++) {
> > -		struct sbitmap_word *word = &sb->map[i];
> > -		unsigned int off, nr;
> > +	while (1) {
> 
> Sorry but this change looks incorrect to me. I think the following two tests
> have to be performed before the while loop starts to avoid triggering an
> out-of-bounds reference of sb->map[]:
> * Whether or not sb->map_nr is zero.

That won't happen, please see sbitmap_init_node().

> * Whether or not index >= sb->map_nr. I propose to start iterating from the
>   start of @sb in this case.

It has been checked at the end of the loop.

> 
> Additionally, the new loop in __sbitmap_for_each_set() looks more complicated
> and more fragile to me than necessary. How about using the code below? That

Why is it fragile? With the 'scanned' variable, it is guaranteed
that all bit can be iterated just once.

> code needs one local variable less than your implementation.

The local variable of 'start' can be removed too in the orignal
patch, will do that in V3.

> 
> static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> 					  const unsigned int start,
> 					  sb_for_each_fn fn, void *data)
> {
> 	unsigned int i = start >> sb->shift;
> 	unsigned int nr = start & ((1 << sb->shift) - 1);
> 	bool cycled = false;
> 
> 	if (!sb->map_nr)
> 		return;

We don't have zero depth yet, so no zero sb->map_nr.

> 
> 	if (unlikely(i >= sb->map_nr)) {
> 		i = 0;
> 		nr = 0;

This check isn't necessary at least now.

> 	}
> 
> 	while (true) {
> 		struct sbitmap_word *word = &sb->map[i];
> 		unsigned int off;

Looks you removed the check on 'word->word'.

> 
> 		off = i << sb->shift;
> 		while (1) {
> 			nr = find_next_bit(&word->word, word->depth, nr);

If we start from middle of one word, the last search shouldn't touch
the part which is scanned already. This way can't guarantee that point.

> 			if (cycled && off + nr >= start)
> 				return;

If 'map_nr' is 1 and start is zero, cycled is still false, so we should
return here but not, then every time we have to search one extra time
for this case.

> 
> 			if (nr >= word->depth)
> 				break;
> 
> 			if (!fn(sb, off + nr, data))
> 				return;
> 
> 			nr++;
> 		}
> 		if (++i >= sb->map_nr) {
> 			cycled = true;
> 			i = 0;
> 		}
> 		nr = 0;
> 	}
> }
> 
> Thanks,
> 
> Bart.

Given sbitmap is a global map, and the original patch can guarantee
to iterate each bit just once strictly, I suggest to take the
original way first.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set()
  2017-08-22 18:37   ` Bart Van Assche
@ 2017-08-24  4:02     ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  4:02 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, osandov, loberman

On Tue, Aug 22, 2017 at 06:37:02PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
> > -					void *data)
> > +static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> > +					  unsigned int off,
> > +					  sb_for_each_fn fn, void *data)
> >  {
> 
> An additional comment: if a function name starts with a double underscore
> usually that either means that it should be called with a specific lock
> held or that it is an implementation function that should not be called by
> other modules. Since neither is the case for __sbitmap_for_each_set(),
> please consider to use another name for this function.

We have lots of this kind of naming, please see __blk_mq_end_request(),
__free_pages(), ....

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  2017-08-22 18:45   ` Bart Van Assche
@ 2017-08-24  4:52     ` Ming Lei
  2017-08-25 21:41       ` Bart Van Assche
  0 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-24  4:52 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 06:45:46PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > More importantly, for some SCSI devices, driver
> > tags are host wide, and the number is quite big,
> > but each lun has very limited queue depth.
> 
> This may be the case but is not always the case. Another important use-case
> is one LUN per host and where the queue depth per LUN is identical to the
> number of host tags.

This patchset won't hurt this case because the BUSY info is returned
from driver.  In this case, BLK_STS_RESOURCE should seldom be returned
from .queue_rq generally.

Also one important fact is that once q->queue_depth is set, that
means there is per-request_queue limit on pending I/Os, and the
single LUN is just the special case which is covered by this whole
patchset. We don't need to pay special attention in this special case
at all.

> 
> > +struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
> > +					    struct blk_mq_ctx *start)
> > +{
> > +	unsigned off = start ? start->index_hw : 0;
> 
> Please consider to rename this function into blk_mq_dispatch_rq_from_next_ctx()
> and to start from start->index_hw + 1 instead of start->index_hw. I think that
> will not only result in simpler but also in faster code.

I believe this helper with blk_mq_next_ctx(hctx, rq->mq_ctx) together
will be much simpler and easier to implement, and code can be much
readable too.

blk_mq_dispatch_rq_from_next_ctx() is ugly and mixing two things
together.

Please see the actual implementation in patch of 'blk-mq-sched: improve
dispatching from sw queue'.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue
  2017-08-22 19:55   ` Bart Van Assche
  2017-08-23 19:58     ` Jens Axboe
@ 2017-08-24  5:52     ` Ming Lei
  1 sibling, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  5:52 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 07:55:55PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > easy to cause queue busy becasue of the small
>                            ^^^^^^^
>                            because?
>  
> > -static void blk_mq_do_dispatch(struct request_queue *q,
> > -			       struct elevator_queue *e,
> > -			       struct blk_mq_hw_ctx *hctx)
> > +static inline void blk_mq_do_dispatch_sched(struct request_queue *q,
> > +					    struct elevator_queue *e,
> > +					    struct blk_mq_hw_ctx *hctx)
> >  {
> >  	LIST_HEAD(rq_list);
> 
> Why to declare this function "inline"? Are you sure that the compiler
> is not smart enough to decide on its own whether or not to inline this
> function?

OK.

> 
> > +static inline struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> > +						 struct blk_mq_ctx *ctx)
> > +{
> > +	unsigned idx = ctx->index_hw;
> > +
> > +	if (++idx == hctx->nr_ctx)
> > +		idx = 0;
> > +
> > +	return hctx->ctxs[idx];
> > +}
> > +
> > +static inline void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > +					  struct blk_mq_hw_ctx *hctx)
> > +{
> > +	LIST_HEAD(rq_list);
> > +	struct blk_mq_ctx *ctx = NULL;
> > +
> > +	do {
> > +		struct request *rq;
> > +
> > +		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > +		if (!rq)
> > +			break;
> > +		list_add(&rq->queuelist, &rq_list);
> > +
> > +		/* round robin for fair dispatch */
> > +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > +	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> > +}
> 
> Please consider to move the blk_mq_next_ctx() functionality into
> blk_mq_dispatch_rq_from_ctx() as requested in a comment on a previous patch.

I believe this way is more clean and readable, otherwise
blk_mq_dispatch_rq_from_next_ctx() can be a bit ugly.

> 
> >  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  {
> >  	struct request_queue *q = hctx->queue;
> > @@ -142,18 +172,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  	if (!list_empty(&rq_list)) {
> >  		blk_mq_sched_mark_restart_hctx(hctx);
> >  		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> > -	} else if (!has_sched_dispatch) {
> > +	} else if (!has_sched_dispatch & !q->queue_depth) {
> 
> Please use "&&" instead of "&" to represent logical and.

Hamm, I remember that another one is fixed against V1, but
this one is missed.

> 
> Otherwise this patch looks fine to me.

Thanks.


-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue
  2017-08-22 20:57   ` Bart Van Assche
@ 2017-08-24  6:12     ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  6:12 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 08:57:03PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +static inline void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > +					  struct blk_mq_hw_ctx *hctx)
> > +{
> > +	LIST_HEAD(rq_list);
> > +	struct blk_mq_ctx *ctx = NULL;
> > +
> > +	do {
> > +		struct request *rq;
> > +
> > +		rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > +		if (!rq)
> > +			break;
> > +		list_add(&rq->queuelist, &rq_list);
> > +
> > +		/* round robin for fair dispatch */
> > +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > +	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> > +}
> 
> An additional question about this patch: shouldn't request dequeuing start
> at the software queue next to the last one from which a request got dequeued
> instead of always starting at the first software queue (struct blk_mq_ctx
> *ctx = NULL) to be truly round robin?

Looks a good idea, will introduce a ctx hint in hctx for starting when
blk_mq_dispatch_rq_list() returns zero. No lock is needed since
it is just a hint.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-08-22 20:09   ` Bart Van Assche
@ 2017-08-24  6:18     ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  6:18 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 08:09:32PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +	/*
> > +	 * Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
> > +	 * will be run to try to make progress, so it is always
> > +	 * safe to check the state here.
> > +	 */
> > +	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
> > +		return;
> 
> The comment above test_bit() is useful but does not explain the purpose of
> the early return. Is the purpose of the early return perhaps to serialize
> blk_mq_sched_dispatch_requests() calls? If so, please mention this.

If the bit is set, that means there are requests in hctx->dispatch, so
return early for avoiding to dequeue requests from sw/scheduler queue
unnecessarily.

I thought the code is self-comment, so not explain it, will add the
comment.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-08-23 19:56   ` Jens Axboe
@ 2017-08-24  6:38     ` Ming Lei
  2017-08-25 10:19       ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-24  6:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Laurence Oberman

On Wed, Aug 23, 2017 at 01:56:50PM -0600, Jens Axboe wrote:
> On Sat, Aug 05 2017, Ming Lei wrote:
> > During dispatching, we moved all requests from hctx->dispatch to
> > one temporary list, then dispatch them one by one from this list.
> > Unfortunately duirng this period, run queue from other contexts
> > may think the queue is idle, then start to dequeue from sw/scheduler
> > queue and still try to dispatch because ->dispatch is empty. This way
> > hurts sequential I/O performance because requests are dequeued when
> > lld queue is busy.
> > 
> > This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> > make sure that request isn't dequeued until ->dispatch is
> > flushed.
> 
> I don't like how this patch introduces a bunch of locked setting of a
> flag under the hctx lock. Especially since I think we can easily avoid
> it.

Actually the lock isn't needed for setting the flag, will move it out
in V3.

> 
> > -	} else if (!has_sched_dispatch & !q->queue_depth) {
> > +		blk_mq_dispatch_rq_list(q, &rq_list);
> > +
> > +		/*
> > +		 * We may clear DISPATCH_BUSY just after it
> > +		 * is set from another context, the only cost
> > +		 * is that one request is dequeued a bit early,
> > +		 * we can survive that. Given the window is
> > +		 * too small, no need to worry about performance
> > +		 * effect.
> > +		 */
> > +		if (list_empty_careful(&hctx->dispatch))
> > +			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> 
> This is basically the only place where we modify it without holding the
> hctx lock. Can we move it into blk_mq_dispatch_rq_list()? The list is

The problem is that blk_mq_dispatch_rq_list() don't know if it is
handling requests from hctx->dispatch or sw/scheduler queue. We only
need to clear the bit after hctx->dispatch is flushed. So the clearing
can't be moved into blk_mq_dispatch_rq_list().

> generally empty, unless for the case where we splice residuals back. If
> we splice them back, we grab the lock anyway.
> 
> The other places it's set under the hctx lock, yet we end up using an
> atomic operation to do it.

As the comment explained, it needn't to be atomic.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  2017-08-22 20:20   ` Bart Van Assche
@ 2017-08-24  6:39     ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  6:39 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 08:20:20PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > SCSI sets q->queue_depth from shost->cmd_per_lun, and
> > q->queue_depth is per request_queue and more related to
> > scheduler queue compared with hw queue depth, which can be
> > shared by queues, such as TAG_SHARED.
> > 
> > This patch trys to use q->queue_depth as hint for computing
>              ^^^^
>              tries?

Will fix it in V3.

> > q->nr_requests, which should be more effective than
> > current way.
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>

Thanks!

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  2017-08-22 21:55   ` Bart Van Assche
  2017-08-23  6:46     ` Hannes Reinecke
@ 2017-08-24  6:52     ` Ming Lei
  2017-08-25 22:23       ` Bart Van Assche
  1 sibling, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-24  6:52 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 09:55:57PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +	/*
> > +	 * if there is q->queue_depth, all hw queues share
> > +	 * this queue depth limit
> > +	 */
> > +	if (q->queue_depth) {
> > +		queue_for_each_hw_ctx(q, hctx, i)
> > +			hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
> > +	}
> > +
> > +	if (!q->elevator)
> > +		goto exit;
> 
> Hello Ming,
> 
> It seems very fragile to me to set BLK_MQ_F_SHARED_DEPTH if and only if
> q->queue_depth != 0. Wouldn't it be better to let the block driver tell the
> block layer whether or not there is a queue depth limit across hardware
> queues, e.g. through a tag set flag?

One reason for not doing in that way is because q->queue_depth can be
changed via sysfs interface.

Another reason is that better to not exposing this flag to drivers since
it isn't necessary, that said it is an internal flag actually.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state
  2017-08-22 20:41   ` Bart Van Assche
  2017-08-23 20:02     ` Jens Axboe
@ 2017-08-24  6:54     ` Ming Lei
  1 sibling, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  6:54 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 08:41:14PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	return test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > +}
> > +
> > +static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > +}
> > +
> > +static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > +}
> 
> Hello Ming,
> 
> Are these helper functions modified in a later patch? If not, please drop
> this patch. One line helper functions are not useful and do not improve
> readability of source code.

It is definitely modified in later patch, :-)

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state
  2017-08-23 20:02     ` Jens Axboe
@ 2017-08-24  6:55       ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  6:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Bart Van Assche, hch, linux-block, loberman

On Wed, Aug 23, 2017 at 02:02:20PM -0600, Jens Axboe wrote:
> On Tue, Aug 22 2017, Bart Van Assche wrote:
> > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > > +static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > > +{
> > > +	return test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > > +}
> > > +
> > > +static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > > +{
> > > +	set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > > +}
> > > +
> > > +static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > > +{
> > > +	clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > > +}
> > 
> > Hello Ming,
> > 
> > Are these helper functions modified in a later patch? If not, please drop
> > this patch. One line helper functions are not useful and do not improve
> > readability of source code.
> 
> Agree, they just obfuscate the code. Only reason to do this is to do
> things like:

If you look at the following patch, these introduced functions are
modified a lot.

> 
> static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> {
>         if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
> 	        clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> }
> 
> to avoid unecessary RMW (and locked instruction). At least then you can
> add a single comment as to why it's done that way. Apart from that, I
> prefer to open-code it so people don't have to grep to figure out wtf
> blk_mq_hctx_clear_dispatch_busy() does.

Ok, will do that in this way.



-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
  2017-08-22 20:43   ` Bart Van Assche
  2017-08-24  0:59     ` Damien Le Moal
@ 2017-08-24  6:57     ` Ming Lei
  1 sibling, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  6:57 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 08:43:12PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	return !list_empty_careful(&hctx->dispatch);
> > +}
> > +
> > +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
> > +		struct request *rq)
> > +{
> > +	spin_lock(&hctx->lock);
> > +	list_add(&rq->queuelist, &hctx->dispatch);
> > +	blk_mq_hctx_set_dispatch_busy(hctx);
> > +	spin_unlock(&hctx->lock);
> > +}
> > +
> > +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
> > +		struct list_head *list)
> > +{
> > +	spin_lock(&hctx->lock);
> > +	list_splice_init(list, &hctx->dispatch);
> > +	blk_mq_hctx_set_dispatch_busy(hctx);
> > +	spin_unlock(&hctx->lock);
> > +}
> > +
> > +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
> > +						    struct list_head *list)
> > +{
> > +	spin_lock(&hctx->lock);
> > +	list_splice_tail_init(list, &hctx->dispatch);
> > +	blk_mq_hctx_set_dispatch_busy(hctx);
> > +	spin_unlock(&hctx->lock);
> > +}
> > +
> > +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
> > +		struct list_head *list)
> > +{
> > +	spin_lock(&hctx->lock);
> > +	list_splice_init(&hctx->dispatch, list);
> > +	spin_unlock(&hctx->lock);
> > +}
> 
> Same comment for this patch: these helper functions are so short that I'm not
> sure it is useful to introduce these helper functions.

It will become long in the following patches.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
  2017-08-24  0:59     ` Damien Le Moal
@ 2017-08-24  7:10       ` Ming Lei
  2017-08-24  7:42         ` Damien Le Moal
  0 siblings, 1 reply; 84+ messages in thread
From: Ming Lei @ 2017-08-24  7:10 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Bart Van Assche, hch, linux-block, axboe, loberman

On Thu, Aug 24, 2017 at 09:59:13AM +0900, Damien Le Moal wrote:
> 
> On 8/23/17 05:43, Bart Van Assche wrote:
> > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> >> +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
> >> +{
> >> +	return !list_empty_careful(&hctx->dispatch);
> >> +}
> >> +
> >> +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
> >> +		struct request *rq)
> >> +{
> >> +	spin_lock(&hctx->lock);
> >> +	list_add(&rq->queuelist, &hctx->dispatch);
> >> +	blk_mq_hctx_set_dispatch_busy(hctx);
> >> +	spin_unlock(&hctx->lock);
> >> +}
> >> +
> >> +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
> >> +		struct list_head *list)
> >> +{
> >> +	spin_lock(&hctx->lock);
> >> +	list_splice_init(list, &hctx->dispatch);
> >> +	blk_mq_hctx_set_dispatch_busy(hctx);
> >> +	spin_unlock(&hctx->lock);
> >> +}
> >> +
> >> +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
> >> +						    struct list_head *list)
> >> +{
> >> +	spin_lock(&hctx->lock);
> >> +	list_splice_tail_init(list, &hctx->dispatch);
> >> +	blk_mq_hctx_set_dispatch_busy(hctx);
> >> +	spin_unlock(&hctx->lock);
> >> +}
> >> +
> >> +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
> >> +		struct list_head *list)
> >> +{
> >> +	spin_lock(&hctx->lock);
> >> +	list_splice_init(&hctx->dispatch, list);
> >> +	spin_unlock(&hctx->lock);
> >> +}
> > 
> > Same comment for this patch: these helper functions are so short that I'm not
> > sure it is useful to introduce these helper functions.
> > 
> > Bart.
> 
> Personally, I like those very much as they give a place to hook up
> different dispatch_list handling without having to change blk-mq.c and
> blk-mq-sched.c all over the place.
> 
> I am thinking of SMR (zoned block device) support here since we need to
> to sort insert write requests in blk_mq_add_rq_to_dispatch() and
> blk_mq_add_list_to_dispatch_tail(). For this last one, the name would

Could you explain a bit why you sort insert write rq in these two
helpers? which are only triggered in case of dispatch busy.

> become a little awkward though. This sort insert would be to avoid
> breaking a sequential write request sequence sent by the disk user. This
> is needed since this reordering breakage cannot be solved only from the
> SCSI layer.

Basically this patchset tries to flush hctx->dispatch first, then
fetch requests from scheduler queue after htct->dispatch is flushed.

But it isn't strictly in this way because the implementation is lockless.

So looks the request from hctx->dispatch won't break one coming
requests from scheduler.


-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie
  2017-08-22 20:51   ` Bart Van Assche
@ 2017-08-24  7:14     ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-24  7:14 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Tue, Aug 22, 2017 at 08:51:32PM +0000, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > This patch introduces per-request_queue dispatch
> > list for this purpose, and only when all requests
> > in this list are dispatched out successfully, we
> > can restart to dequeue request from sw/scheduler
> > queue and dispath it to lld.
> 
> Wasn't one of the design goals of blk-mq to avoid a per-request_queue list?

Yes, but why does scsi device have per-request_queue depth?

Anyway this single dispatch list won't hurt devices which doesn't have
per-request_queue depth.

For scsi device which has q->queue_depth, if one hctx is stuck, all
hctxs are stuck actually, the single dispatch list just fits this
kind of use case, doesn't it?

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
  2017-08-24  7:10       ` Ming Lei
@ 2017-08-24  7:42         ` Damien Le Moal
  0 siblings, 0 replies; 84+ messages in thread
From: Damien Le Moal @ 2017-08-24  7:42 UTC (permalink / raw)
  To: Ming Lei; +Cc: Bart Van Assche, hch, linux-block, axboe, loberman



On 8/24/17 16:10, Ming Lei wrote:
> On Thu, Aug 24, 2017 at 09:59:13AM +0900, Damien Le Moal wrote:
>>
>> On 8/23/17 05:43, Bart Van Assche wrote:
>>> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
>>>> +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
>>>> +{
>>>> +	return !list_empty_careful(&hctx->dispatch);
>>>> +}
>>>> +
>>>> +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
>>>> +		struct request *rq)
>>>> +{
>>>> +	spin_lock(&hctx->lock);
>>>> +	list_add(&rq->queuelist, &hctx->dispatch);
>>>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>>>> +	spin_unlock(&hctx->lock);
>>>> +}
>>>> +
>>>> +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
>>>> +		struct list_head *list)
>>>> +{
>>>> +	spin_lock(&hctx->lock);
>>>> +	list_splice_init(list, &hctx->dispatch);
>>>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>>>> +	spin_unlock(&hctx->lock);
>>>> +}
>>>> +
>>>> +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
>>>> +						    struct list_head *list)
>>>> +{
>>>> +	spin_lock(&hctx->lock);
>>>> +	list_splice_tail_init(list, &hctx->dispatch);
>>>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>>>> +	spin_unlock(&hctx->lock);
>>>> +}
>>>> +
>>>> +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
>>>> +		struct list_head *list)
>>>> +{
>>>> +	spin_lock(&hctx->lock);
>>>> +	list_splice_init(&hctx->dispatch, list);
>>>> +	spin_unlock(&hctx->lock);
>>>> +}
>>>
>>> Same comment for this patch: these helper functions are so short that I'm not
>>> sure it is useful to introduce these helper functions.
>>>
>>> Bart.
>>
>> Personally, I like those very much as they give a place to hook up
>> different dispatch_list handling without having to change blk-mq.c and
>> blk-mq-sched.c all over the place.
>>
>> I am thinking of SMR (zoned block device) support here since we need to
>> to sort insert write requests in blk_mq_add_rq_to_dispatch() and
>> blk_mq_add_list_to_dispatch_tail(). For this last one, the name would
> 
> Could you explain a bit why you sort insert write rq in these two
> helpers? which are only triggered in case of dispatch busy.

For host-managed zoned block devices (SMR disks), writes have to be
sequential per zone of the device, otherwise an "unaligned write error"
is returned by the device. This constraint is exposed to the user (FS,
device mapper or applications doing raw I/Os to the block device). So a
well behaved user (f2fs and dm-zoned in the kernel) will issue
sequential writes to zones (this includes write same and write zeroes).
However, the current current blk-mq code does not guarantee that this
issuing order will be respected at dispatch time. There are multiple
reasons:

1) On submit, requests first go into per CPU context list. So if the
issuer thread is moved from one CPU to another in the middle of a
sequential write request sequence, the sequence would be broken in two
parts into different lists. When a scheduler is used, this can get fixed
by a nice LBA ordering (mq-deadline would do that), but in the "nonw"
scheduler case, the requests in the CPU lists are merged into the hctx
dispatch queue in CPU number order, so potentially reordering write
sequence. Hence the sort to fix that.

2) Dispatch may happen in parallel with multiple contexts getting
requests from the scheduler (or the CPU context lists) and again break
write sequences. I think your patch partly addresses this with the BUSY
flag, but when requests are in the dispatch queue already. However, at
dispatch time, requests go directly from the scheduler (or CPU context
lists) into the dispatch caller local list, bypassing the hctx dispatch
queue. Write ordering gets broken again here.

3) SCSI layer locks a device zone when it sees a write to prevent
subsequent write to the same zone. This is to avoid reordering at the
HBA level (AHCI has a 100% chance of doing it). In effect, this is a
write QD=1 per zone implementation, and that causes requeue of write
requests. In that case, this is a requeue happening after ->queue_rq()
call, so using the dispatch context local list. Combine that with (2)
and reordering can happen here too.

I have a series implementing a "dispatcher", which is basically a set of
operations that a device driver can specify to handle the dispatch
queue. The operations are basically "insert" and "get", so what you have
as helpers. The default dispatcher uses operations very similar to your
helpers minus the busy bit. That is, the default dispatcher does not
change the current behavior. But a ZBC dispatcher can be installed by
the scsi layer for a ZBC drive at device scan time and do the sort
insert of write requests to address all the potential reordering.

That series goes beyond what you did, and I was waiting for your series
to stabilize to rebase (or rework) what I did on top of your changes
which look like they would simplify handling of ZBC drives.

Note: ZBC drives are single queue devices, so they always only have a
single hctx.

> 
>> become a little awkward though. This sort insert would be to avoid
>> breaking a sequential write request sequence sent by the disk user. This
>> is needed since this reordering breakage cannot be solved only from the
>> SCSI layer.
> 
> Basically this patchset tries to flush hctx->dispatch first, then
> fetch requests from scheduler queue after htct->dispatch is flushed.
> 
> But it isn't strictly in this way because the implementation is lockless.
> 
> So looks the request from hctx->dispatch won't break one coming
> requests from scheduler.

Yes, I understood that. The result would be the requests staying longer
in the scheduler, increasing chances of merge. I think this is a good
idea. But unfortunately, I do not think that it will solve the ZBC
sequential write case in its own. In particular, this does not change
much to what happens to write request order without a scheduler, nor the
potential reordering on requeue. But I may be completely wrong here. I
need to look at your code more carefully.

I have not yet tested your code on ZBC drives. I will do it as soon as I
can.

Thanks.

Best regards.

-- 
Damien Le Moal,
Western Digital Research

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-08-24  6:38     ` Ming Lei
@ 2017-08-25 10:19       ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-25 10:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Laurence Oberman

On Thu, Aug 24, 2017 at 02:38:36PM +0800, Ming Lei wrote:
> On Wed, Aug 23, 2017 at 01:56:50PM -0600, Jens Axboe wrote:
> > On Sat, Aug 05 2017, Ming Lei wrote:
> > > During dispatching, we moved all requests from hctx->dispatch to
> > > one temporary list, then dispatch them one by one from this list.
> > > Unfortunately duirng this period, run queue from other contexts
> > > may think the queue is idle, then start to dequeue from sw/scheduler
> > > queue and still try to dispatch because ->dispatch is empty. This way
> > > hurts sequential I/O performance because requests are dequeued when
> > > lld queue is busy.
> > > 
> > > This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> > > make sure that request isn't dequeued until ->dispatch is
> > > flushed.
> > 
> > I don't like how this patch introduces a bunch of locked setting of a
> > flag under the hctx lock. Especially since I think we can easily avoid
> > it.
> 
> Actually the lock isn't needed for setting the flag, will move it out
> in V3.

My fault, looks we can't move it out of the lock, because the new
added rqs can be flushed with the bit cleared together just
between adding list to ->dispatch and setting BLK_MQ_S_DISPATCH_BUSY,
then the bit is never cleared and I/O hang is caused.

> 
> > 
> > > -	} else if (!has_sched_dispatch & !q->queue_depth) {
> > > +		blk_mq_dispatch_rq_list(q, &rq_list);
> > > +
> > > +		/*
> > > +		 * We may clear DISPATCH_BUSY just after it
> > > +		 * is set from another context, the only cost
> > > +		 * is that one request is dequeued a bit early,
> > > +		 * we can survive that. Given the window is
> > > +		 * too small, no need to worry about performance
> > > +		 * effect.
> > > +		 */
> > > +		if (list_empty_careful(&hctx->dispatch))
> > > +			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > 
> > This is basically the only place where we modify it without holding the
> > hctx lock. Can we move it into blk_mq_dispatch_rq_list()? The list is
> 
> The problem is that blk_mq_dispatch_rq_list() don't know if it is
> handling requests from hctx->dispatch or sw/scheduler queue. We only
> need to clear the bit after hctx->dispatch is flushed. So the clearing
> can't be moved into blk_mq_dispatch_rq_list().
> 
> > generally empty, unless for the case where we splice residuals back. If
> > we splice them back, we grab the lock anyway.
> > 
> > The other places it's set under the hctx lock, yet we end up using an
> > atomic operation to do it.

In theory, it is better to hold the lock to clear the bit, but
with cost of one extra lock acquiring no matter moving it to
blk_mq_dispatch_rq_list() or not.

We can move clear_bit() into blk_mq_dispatch_rq_list() and
pass one parameter to indicate if it is handling requests
from ->dispatch or not, the following code is needed at
the end of blk_mq_dispatch_rq_list():

	if (list_empty(list)) {
		if (rq_from_dispatch_list) {
			spin_lock(&hctx->lock);
			if (list_empty_careful(&hctx->dispatch))
				clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
			spin_unlock(&hctx->lock);
		}
	}

If we clear the bit lockless, the BUSY bit may be cleared early, then
dequeue early, that is what we can accept because the race window is
so small.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set()
  2017-08-24  3:57     ` Ming Lei
@ 2017-08-25 21:36       ` Bart Van Assche
  2017-08-26  8:43         ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-25 21:36 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, linux-block, osandov, axboe, loberman

T24gVGh1LCAyMDE3LTA4LTI0IGF0IDExOjU3ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
VHVlLCBBdWcgMjIsIDIwMTcgYXQgMDY6Mjg6NTRQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+ICogV2hldGhlciBvciBub3QgaW5kZXggPj0gc2ItPm1hcF9uci4gSSBwcm9wb3Nl
IHRvIHN0YXJ0IGl0ZXJhdGluZyBmcm9tIHRoZQ0KPiA+ICAgc3RhcnQgb2YgQHNiIGluIHRoaXMg
Y2FzZS4NCj4gDQo+IEl0IGhhcyBiZWVuIGNoZWNrZWQgYXQgdGhlIGVuZCBvZiB0aGUgbG9vcC4N
Cg0KVGhhdCdzIG5vdCBzdWZmaWNpZW50IHRvIGF2b2lkIGFuIG91dC1vZi1ib3VuZHMgYWNjZXNz
IGlmIHRoZSBzdGFydCBpbmRleCBpcw0KbGFyZ2UuIElmIF9fc2JpdG1hcF9mb3JfZWFjaF9zZXQo
KSB3b3VsZCBhY2NlcHQgdmFsdWVzIGZvciB0aGUgc3RhcnQgaW5kZXgNCmFyZ3VtZW50IHRoYXQg
cmVzdWx0IGluIGluZGV4ID49IHNiLT5tYXBfbnIgdGhlbiB0aGF0IHdpbGwgc2ltcGxpZnkgY29k
ZSB0aGF0DQphY2Nlc3NlcyBhbiBzYml0bWFwIGluIGEgcm91bmQtcm9iaW4gZmFzaGlvbi4NCg0K
PiA+IAl9DQo+ID4gDQo+ID4gCXdoaWxlICh0cnVlKSB7DQo+ID4gCQlzdHJ1Y3Qgc2JpdG1hcF93
b3JkICp3b3JkID0gJnNiLT5tYXBbaV07DQo+ID4gCQl1bnNpZ25lZCBpbnQgb2ZmOw0KPiANCj4g
TG9va3MgeW91IHJlbW92ZWQgdGhlIGNoZWNrIG9uICd3b3JkLT53b3JkJy4NCg0KWWVzLCBhbmQg
SSBkaWQgdGhhdCBvbiBwdXJwb3NlLiBJZiB0aGUgc3RhcnQgaW5kZXggcmVmZXJzIHRvIGEgd29y
ZCB0aGF0IGlzDQp6ZXJvIHRoZW4gdGhlICJpZiAod29yZC0+d29yZCkgY29udGludWU7IiBjb2Rl
IHdpbGwgY2F1c2UgdGhlIGVuZC1vZi1sb29wDQpjaGVjayB0byBiZSBza2lwcGVkIGFuZCBoZW5j
ZSB3aWxsIGNhdXNlIGFuIGluZmluaXRlIGxvb3AuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  2017-08-24  4:52     ` Ming Lei
@ 2017-08-25 21:41       ` Bart Van Assche
  2017-08-26  8:47         ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-25 21:41 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, linux-block, axboe, loberman

T24gVGh1LCAyMDE3LTA4LTI0IGF0IDEyOjUyICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
VHVlLCBBdWcgMjIsIDIwMTcgYXQgMDY6NDU6NDZQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IE9uIFNhdCwgMjAxNy0wOC0wNSBhdCAxNDo1NiArMDgwMCwgTWluZyBMZWkgd3Jv
dGU6DQo+ID4gPiBNb3JlIGltcG9ydGFudGx5LCBmb3Igc29tZSBTQ1NJIGRldmljZXMsIGRyaXZl
cg0KPiA+ID4gdGFncyBhcmUgaG9zdCB3aWRlLCBhbmQgdGhlIG51bWJlciBpcyBxdWl0ZSBiaWcs
DQo+ID4gPiBidXQgZWFjaCBsdW4gaGFzIHZlcnkgbGltaXRlZCBxdWV1ZSBkZXB0aC4NCj4gPiAN
Cj4gPiBUaGlzIG1heSBiZSB0aGUgY2FzZSBidXQgaXMgbm90IGFsd2F5cyB0aGUgY2FzZS4gQW5v
dGhlciBpbXBvcnRhbnQgdXNlLWNhc2UNCj4gPiBpcyBvbmUgTFVOIHBlciBob3N0IGFuZCB3aGVy
ZSB0aGUgcXVldWUgZGVwdGggcGVyIExVTiBpcyBpZGVudGljYWwgdG8gdGhlDQo+ID4gbnVtYmVy
IG9mIGhvc3QgdGFncy4NCj4gDQo+IFRoaXMgcGF0Y2hzZXQgd29uJ3QgaHVydCB0aGlzIGNhc2Ug
YmVjYXVzZSB0aGUgQlVTWSBpbmZvIGlzIHJldHVybmVkDQo+IGZyb20gZHJpdmVyLiAgSW4gdGhp
cyBjYXNlLCBCTEtfU1RTX1JFU09VUkNFIHNob3VsZCBzZWxkb20gYmUgcmV0dXJuZWQNCj4gZnJv
bSAucXVldWVfcnEgZ2VuZXJhbGx5Lg0KPiANCj4gQWxzbyBvbmUgaW1wb3J0YW50IGZhY3QgaXMg
dGhhdCBvbmNlIHEtPnF1ZXVlX2RlcHRoIGlzIHNldCwgdGhhdA0KPiBtZWFucyB0aGVyZSBpcyBw
ZXItcmVxdWVzdF9xdWV1ZSBsaW1pdCBvbiBwZW5kaW5nIEkvT3MsIGFuZCB0aGUNCj4gc2luZ2xl
IExVTiBpcyBqdXN0IHRoZSBzcGVjaWFsIGNhc2Ugd2hpY2ggaXMgY292ZXJlZCBieSB0aGlzIHdo
b2xlDQo+IHBhdGNoc2V0LiBXZSBkb24ndCBuZWVkIHRvIHBheSBzcGVjaWFsIGF0dGVudGlvbiBp
biB0aGlzIHNwZWNpYWwgY2FzZQ0KPiBhdCBhbGwuDQoNClRoZSBwdXJwb3NlIG9mIG15IGNvbW1l
bnQgd2FzIG5vdCB0byBhc2sgZm9yIGZ1cnRoZXIgY2xhcmlmaWNhdGlvbiBidXQgdG8NCnJlcG9y
dCB0aGF0IHRoZSBkZXNjcmlwdGlvbiBvZiB0aGlzIHBhdGNoIGlzIG5vdCBjb3JyZWN0Lg0KDQo+
ID4gDQo+ID4gPiArc3RydWN0IHJlcXVlc3QgKmJsa19tcV9kaXNwYXRjaF9ycV9mcm9tX2N0eChz
dHJ1Y3QgYmxrX21xX2h3X2N0eCAqaGN0eCwNCj4gPiA+ICsJCQkJCSAgICBzdHJ1Y3QgYmxrX21x
X2N0eCAqc3RhcnQpDQo+ID4gPiArew0KPiA+ID4gKwl1bnNpZ25lZCBvZmYgPSBzdGFydCA/IHN0
YXJ0LT5pbmRleF9odyA6IDA7DQo+ID4gDQo+ID4gUGxlYXNlIGNvbnNpZGVyIHRvIHJlbmFtZSB0
aGlzIGZ1bmN0aW9uIGludG8gYmxrX21xX2Rpc3BhdGNoX3JxX2Zyb21fbmV4dF9jdHgoKQ0KPiA+
IGFuZCB0byBzdGFydCBmcm9tIHN0YXJ0LT5pbmRleF9odyArIDEgaW5zdGVhZCBvZiBzdGFydC0+
aW5kZXhfaHcuIEkgdGhpbmsgdGhhdA0KPiA+IHdpbGwgbm90IG9ubHkgcmVzdWx0IGluIHNpbXBs
ZXIgYnV0IGFsc28gaW4gZmFzdGVyIGNvZGUuDQo+IA0KPiBJIGJlbGlldmUgdGhpcyBoZWxwZXIg
d2l0aCBibGtfbXFfbmV4dF9jdHgoaGN0eCwgcnEtPm1xX2N0eCkgdG9nZXRoZXINCj4gd2lsbCBi
ZSBtdWNoIHNpbXBsZXIgYW5kIGVhc2llciB0byBpbXBsZW1lbnQsIGFuZCBjb2RlIGNhbiBiZSBt
dWNoDQo+IHJlYWRhYmxlIHRvby4NCj4gDQo+IGJsa19tcV9kaXNwYXRjaF9ycV9mcm9tX25leHRf
Y3R4KCkgaXMgdWdseSBhbmQgbWl4aW5nIHR3byB0aGluZ3MNCj4gdG9nZXRoZXIuDQoNClNvcnJ5
IGJ1dCBJIGRpc2FncmVlIHdpdGggYm90aCBvZiB0aGUgYWJvdmUgc3RhdGVtZW50cy4NCg0KQmFy
dC4=

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  2017-08-24  6:52     ` Ming Lei
@ 2017-08-25 22:23       ` Bart Van Assche
  2017-08-26  8:53         ` Ming Lei
  0 siblings, 1 reply; 84+ messages in thread
From: Bart Van Assche @ 2017-08-25 22:23 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, linux-block, axboe, loberman

T24gVGh1LCAyMDE3LTA4LTI0IGF0IDE0OjUyICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
VHVlLCBBdWcgMjIsIDIwMTcgYXQgMDk6NTU6NTdQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IE9uIFNhdCwgMjAxNy0wOC0wNSBhdCAxNDo1NiArMDgwMCwgTWluZyBMZWkgd3Jv
dGU6DQo+ID4gPiArCS8qDQo+ID4gPiArCSAqIGlmIHRoZXJlIGlzIHEtPnF1ZXVlX2RlcHRoLCBh
bGwgaHcgcXVldWVzIHNoYXJlDQo+ID4gPiArCSAqIHRoaXMgcXVldWUgZGVwdGggbGltaXQNCj4g
PiA+ICsJICovDQo+ID4gPiArCWlmIChxLT5xdWV1ZV9kZXB0aCkgew0KPiA+ID4gKwkJcXVldWVf
Zm9yX2VhY2hfaHdfY3R4KHEsIGhjdHgsIGkpDQo+ID4gPiArCQkJaGN0eC0+ZmxhZ3MgfD0gQkxL
X01RX0ZfU0hBUkVEX0RFUFRIOw0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4gPiArCWlmICghcS0+
ZWxldmF0b3IpDQo+ID4gPiArCQlnb3RvIGV4aXQ7DQo+ID4gDQo+ID4gSGVsbG8gTWluZywNCj4g
PiANCj4gPiBJdCBzZWVtcyB2ZXJ5IGZyYWdpbGUgdG8gbWUgdG8gc2V0IEJMS19NUV9GX1NIQVJF
RF9ERVBUSCBpZiBhbmQgb25seSBpZg0KPiA+IHEtPnF1ZXVlX2RlcHRoICE9IDAuIFdvdWxkbid0
IGl0IGJlIGJldHRlciB0byBsZXQgdGhlIGJsb2NrIGRyaXZlciB0ZWxsIHRoZQ0KPiA+IGJsb2Nr
IGxheWVyIHdoZXRoZXIgb3Igbm90IHRoZXJlIGlzIGEgcXVldWUgZGVwdGggbGltaXQgYWNyb3Nz
IGhhcmR3YXJlDQo+ID4gcXVldWVzLCBlLmcuIHRocm91Z2ggYSB0YWcgc2V0IGZsYWc/DQo+IA0K
PiBPbmUgcmVhc29uIGZvciBub3QgZG9pbmcgaW4gdGhhdCB3YXkgaXMgYmVjYXVzZSBxLT5xdWV1
ZV9kZXB0aCBjYW4gYmUNCj4gY2hhbmdlZCB2aWEgc3lzZnMgaW50ZXJmYWNlLg0KDQpPbmx5IHRo
ZSBTQ1NJIGNvcmUgYWxsb3dzIHRoZSBxdWV1ZSBkZXB0aCB0byBiZSBjaGFuZ2VkIHRocm91Z2gg
c3lzZnMuIFRoZQ0Kb3RoZXIgYmxvY2sgZHJpdmVycyBJIGFtIGZhbWlsaWFyIHdpdGggc2V0IHRo
ZSBxdWV1ZSBkZXB0aCB3aGVuIHRoZSBibG9jaw0KbGF5ZXIgcXVldWUgaXMgY3JlYXRlZCBhbmQg
ZG8gbm90IGFsbG93IHRoZSBxdWV1ZSBkZXB0aCB0byBiZSBjaGFuZ2VkIGxhdGVyDQpvbi4NCg0K
PiBBbm90aGVyIHJlYXNvbiBpcyB0aGF0IGJldHRlciB0byBub3QgZXhwb3NpbmcgdGhpcyBmbGFn
IHRvIGRyaXZlcnMgc2luY2UNCj4gaXQgaXNuJ3QgbmVjZXNzYXJ5LCB0aGF0IHNhaWQgaXQgaXMg
YW4gaW50ZXJuYWwgZmxhZyBhY3R1YWxseS4NCg0KQXMgZmFyIGFzIEkga25vdyBvbmx5IHRoZSBT
Q1NJIGNvcmUgY2FuIGNyZWF0ZSByZXF1ZXN0IHF1ZXVlcyB0aGF0IGhhdmUgYQ0KcXVldWUgZGVw
dGggdGhhdCBpcyBsb3dlciB0aGFuIHRoZSBudW1iZXIgb2YgdGFncyBpbiB0aGUgdGFnIHNldC4g
U28gZm9yIGFsbA0KYmxvY2sgZHJpdmVycyBleGNlcHQgdGhlIFNDU0kgY29yZSBpdCBpcyBPSyB0
byBkaXNwYXRjaCBhbGwgcmVxdWVzdHMgYXQgb25jZQ0KdG8gd2hpY2ggYSBoYXJkd2FyZSB0YWcg
aGFzIGJlZW4gYXNzaWduZWQuIE15IHByb3Bvc2FsIGlzIGVpdGhlciB0byBsZXQNCmRyaXZlcnMg
bGlrZSB0aGUgU0NTSSBjb3JlIHNldCBCTEtfTVFfRl9TSEFSRURfREVQVEggb3IgdG8gbW9kaWZ5
DQpibGtfbXFfc2NoZWRfZGlzcGF0Y2hfcmVxdWVzdHMoKSBzdWNoIHRoYXQgaXQgZmx1c2hlcyBh
bGwgcmVxdWVzdHMgaWYgdGhlDQpyZXF1ZXN0IHF1ZXVlIGRlcHRoIGlzIG5vdCBsb3dlciB0aGFu
IHRoZSBoYXJkd2FyZSB0YWcgc2V0IHNpemUgaW5zdGVhZCBvZiBpZg0KcS0+cXVldWVfZGVwdGgg
PT0gMC4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set()
  2017-08-25 21:36       ` Bart Van Assche
@ 2017-08-26  8:43         ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-26  8:43 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, osandov, axboe, loberman

On Fri, Aug 25, 2017 at 09:36:26PM +0000, Bart Van Assche wrote:
> On Thu, 2017-08-24 at 11:57 +0800, Ming Lei wrote:
> > On Tue, Aug 22, 2017 at 06:28:54PM +0000, Bart Van Assche wrote:
> > > * Whether or not index >= sb->map_nr. I propose to start iterating from the
> > >   start of @sb in this case.
> > 
> > It has been checked at the end of the loop.
> 
> That's not sufficient to avoid an out-of-bounds access if the start index is
> large. If __sbitmap_for_each_set() would accept values for the start index
> argument that result in index >= sb->map_nr then that will simplify code that
> accesses an sbitmap in a round-robin fashion.

Given the only user of this helper is blk_mq_dispatch_rq_from_ctx(), the
start index won't be out of bounds.

> 
> > > 	}
> > > 
> > > 	while (true) {
> > > 		struct sbitmap_word *word = &sb->map[i];
> > > 		unsigned int off;
> > 
> > Looks you removed the check on 'word->word'.
> 
> Yes, and I did that on purpose. If the start index refers to a word that is
> zero then the "if (word->word) continue;" code will cause the end-of-loop
> check to be skipped and hence will cause an infinite loop.

Got it, but it removes the optimization too, :-)

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
  2017-08-25 21:41       ` Bart Van Assche
@ 2017-08-26  8:47         ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-26  8:47 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Fri, Aug 25, 2017 at 09:41:08PM +0000, Bart Van Assche wrote:
> On Thu, 2017-08-24 at 12:52 +0800, Ming Lei wrote:
> > On Tue, Aug 22, 2017 at 06:45:46PM +0000, Bart Van Assche wrote:
> > > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > > > More importantly, for some SCSI devices, driver
> > > > tags are host wide, and the number is quite big,
> > > > but each lun has very limited queue depth.
> > > 
> > > This may be the case but is not always the case. Another important use-case
> > > is one LUN per host and where the queue depth per LUN is identical to the
> > > number of host tags.
> > 
> > This patchset won't hurt this case because the BUSY info is returned
> > from driver.  In this case, BLK_STS_RESOURCE should seldom be returned
> > from .queue_rq generally.
> > 
> > Also one important fact is that once q->queue_depth is set, that
> > means there is per-request_queue limit on pending I/Os, and the
> > single LUN is just the special case which is covered by this whole
> > patchset. We don't need to pay special attention in this special case
> > at all.
> 
> The purpose of my comment was not to ask for further clarification but to
> report that the description of this patch is not correct.

OK, will change the commit log in V3.

> 
> > > 
> > > > +struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
> > > > +					    struct blk_mq_ctx *start)
> > > > +{
> > > > +	unsigned off = start ? start->index_hw : 0;
> > > 
> > > Please consider to rename this function into blk_mq_dispatch_rq_from_next_ctx()
> > > and to start from start->index_hw + 1 instead of start->index_hw. I think that
> > > will not only result in simpler but also in faster code.
> > 
> > I believe this helper with blk_mq_next_ctx(hctx, rq->mq_ctx) together
> > will be much simpler and easier to implement, and code can be much
> > readable too.
> > 
> > blk_mq_dispatch_rq_from_next_ctx() is ugly and mixing two things
> > together.
> 
> Sorry but I disagree with both of the above statements.

I will post out V3, please comment it on that patch about this issue,
especially total round robin is added.

-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  2017-08-25 22:23       ` Bart Van Assche
@ 2017-08-26  8:53         ` Ming Lei
  0 siblings, 0 replies; 84+ messages in thread
From: Ming Lei @ 2017-08-26  8:53 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, loberman

On Fri, Aug 25, 2017 at 10:23:09PM +0000, Bart Van Assche wrote:
> On Thu, 2017-08-24 at 14:52 +0800, Ming Lei wrote:
> > On Tue, Aug 22, 2017 at 09:55:57PM +0000, Bart Van Assche wrote:
> > > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > > > +	/*
> > > > +	 * if there is q->queue_depth, all hw queues share
> > > > +	 * this queue depth limit
> > > > +	 */
> > > > +	if (q->queue_depth) {
> > > > +		queue_for_each_hw_ctx(q, hctx, i)
> > > > +			hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
> > > > +	}
> > > > +
> > > > +	if (!q->elevator)
> > > > +		goto exit;
> > > 
> > > Hello Ming,
> > > 
> > > It seems very fragile to me to set BLK_MQ_F_SHARED_DEPTH if and only if
> > > q->queue_depth != 0. Wouldn't it be better to let the block driver tell the
> > > block layer whether or not there is a queue depth limit across hardware
> > > queues, e.g. through a tag set flag?
> > 
> > One reason for not doing in that way is because q->queue_depth can be
> > changed via sysfs interface.
> 
> Only the SCSI core allows the queue depth to be changed through sysfs. The
> other block drivers I am familiar with set the queue depth when the block
> layer queue is created and do not allow the queue depth to be changed later
> on.

Actually SCSI core is the only user of q->queue_depth, and it supports
to change it via sysfs.

> 
> > Another reason is that better to not exposing this flag to drivers since
> > it isn't necessary, that said it is an internal flag actually.
> 
> As far as I know only the SCSI core can create request queues that have a
> queue depth that is lower than the number of tags in the tag set. So for all
> block drivers except the SCSI core it is OK to dispatch all requests at once
> to which a hardware tag has been assigned. My proposal is either to let
> drivers like the SCSI core set BLK_MQ_F_SHARED_DEPTH or to modify
> blk_mq_sched_dispatch_requests() such that it flushes all requests if the
> request queue depth is not lower than the hardware tag set size instead of if
> q->queue_depth == 0.

As I mentioned, SCSI core is the only user of q->queue_depth, so no
affect on other drivers.


-- 
Ming

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2017-08-26  8:53 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-05  6:56 [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Ming Lei
2017-08-05  6:56 ` [PATCH V2 01/20] blk-mq-sched: fix scheduler bad performance Ming Lei
2017-08-09  0:11   ` Omar Sandoval
2017-08-09  2:32     ` Ming Lei
2017-08-09  7:11       ` Omar Sandoval
2017-08-21  8:18         ` Ming Lei
2017-08-23  7:48         ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
2017-08-22 18:28   ` Bart Van Assche
2017-08-24  3:57     ` Ming Lei
2017-08-25 21:36       ` Bart Van Assche
2017-08-26  8:43         ` Ming Lei
2017-08-22 18:37   ` Bart Van Assche
2017-08-24  4:02     ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx() Ming Lei
2017-08-22 18:45   ` Bart Van Assche
2017-08-24  4:52     ` Ming Lei
2017-08-25 21:41       ` Bart Van Assche
2017-08-26  8:47         ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 04/20] blk-mq-sched: move actual dispatching into one helper Ming Lei
2017-08-22 19:50   ` Bart Van Assche
2017-08-05  6:56 ` [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue Ming Lei
2017-08-22 19:55   ` Bart Van Assche
2017-08-23 19:58     ` Jens Axboe
2017-08-24  5:52     ` Ming Lei
2017-08-22 20:57   ` Bart Van Assche
2017-08-24  6:12     ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
2017-08-22 20:09   ` Bart Van Assche
2017-08-24  6:18     ` Ming Lei
2017-08-23 19:56   ` Jens Axboe
2017-08-24  6:38     ` Ming Lei
2017-08-25 10:19       ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 07/20] blk-mq-sched: introduce blk_mq_sched_queue_depth() Ming Lei
2017-08-22 20:10   ` Bart Van Assche
2017-08-05  6:56 ` [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests Ming Lei
2017-08-22 20:20   ` Bart Van Assche
2017-08-24  6:39     ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH Ming Lei
2017-08-22 21:55   ` Bart Van Assche
2017-08-23  6:46     ` Hannes Reinecke
2017-08-24  6:52     ` Ming Lei
2017-08-25 22:23       ` Bart Van Assche
2017-08-26  8:53         ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state Ming Lei
2017-08-22 20:41   ` Bart Van Assche
2017-08-23 20:02     ` Jens Axboe
2017-08-24  6:55       ` Ming Lei
2017-08-24  6:54     ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list Ming Lei
2017-08-22 20:43   ` Bart Van Assche
2017-08-24  0:59     ` Damien Le Moal
2017-08-24  7:10       ` Ming Lei
2017-08-24  7:42         ` Damien Le Moal
2017-08-24  6:57     ` Ming Lei
2017-08-05  6:56 ` [PATCH V2 12/20] blk-mq: introduce pointers to dispatch lock & list Ming Lei
2017-08-05  6:56 ` [PATCH V2 13/20] blk-mq: pass 'request_queue *' to several helpers of operating BUSY Ming Lei
2017-08-05  6:56 ` [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie Ming Lei
2017-08-22 20:51   ` Bart Van Assche
2017-08-24  7:14     ` Ming Lei
2017-08-05  6:57 ` [PATCH V2 15/20] block: introduce rqhash helpers Ming Lei
2017-08-05  6:57 ` [PATCH V2 16/20] block: move actual bio merge code into __elv_merge Ming Lei
2017-08-05  6:57 ` [PATCH V2 17/20] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue Ming Lei
2017-08-05  6:57 ` [PATCH V2 18/20] block: introduce .last_merge and .hash to blk_mq_ctx Ming Lei
2017-08-05  6:57 ` [PATCH V2 19/20] blk-mq-sched: refactor blk_mq_sched_try_merge() Ming Lei
2017-08-05  6:57 ` [PATCH V2 20/20] blk-mq: improve bio merge from blk-mq sw queue Ming Lei
2017-08-07 12:48 ` [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance Laurence Oberman
2017-08-07 15:27   ` Bart Van Assche
2017-08-07 17:29     ` Laurence Oberman
2017-08-07 18:46       ` Laurence Oberman
2017-08-07 19:46         ` Laurence Oberman
2017-08-07 23:04       ` Ming Lei
     [not found]   ` <CAFfF4qv3W6D-j8BSSZbwPLqhd_mmwk8CZQe7dSqud8cMMd2yPg@mail.gmail.com>
2017-08-07 22:29     ` Bart Van Assche
2017-08-07 23:17     ` Ming Lei
2017-08-08 13:41     ` Ming Lei
2017-08-08 13:58       ` Laurence Oberman
2017-08-08  8:09 ` Paolo Valente
2017-08-08  9:09   ` Ming Lei
2017-08-08  9:13     ` Paolo Valente
2017-08-11  8:11 ` Christoph Hellwig
2017-08-11 14:25   ` James Bottomley
2017-08-23 16:12 ` Bart Van Assche
2017-08-23 16:15   ` Jens Axboe
2017-08-23 16:24     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.