All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
@ 2018-01-11  6:01 Ming Lei
  2018-01-11  6:01 ` [PATCH V3 1/5] dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 48+ messages in thread
From: Ming Lei @ 2018-01-11  6:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Ming Lei

Hi Guys,

The 1st patch removes the workaround of blk_mq_delay_run_hw_queue() in
case of requeue, this way isn't necessary, and more worse, it makes
BLK_MQ_S_SCHED_RESTART not working, and degarde I/O performance.

The 2nd patch return DM_MAPIO_REQUEUE to dm-rq if underlying request
allocation fails, then we can return BLK_STS_RESOURCE from dm-rq to
blk-mq, so that blk-mq can hold the requests to be dequeued.

The other 3 paches changes the blk-mq part of blk_insert_cloned_request(),
in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
and blk-mq can get the dispatch result of underlying queue, and with
this information, blk-mq can handle IO merge much better, then
sequential I/O performance is improved much.

In my dm-mpath over virtio-scsi test, this whole patchset improves
sequential IO by 3X ~ 5X.

V3:
	- rebase on the latest for-4.16/block of block tree
	- add missed pg_init_all_paths() in patch 1, according to Bart's review

V2:
	- drop 'dm-mpath: cache ti->clone during requeue', which is a bit
	too complicated, and not see obvious performance improvement.
	- make change on blk-mq part cleaner

Ming Lei (5):
  dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of
    BLK_STS_RESOURCE
  dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
  blk-mq: move actual issue into one helper
  blk-mq: return dispatch result to caller in blk_mq_try_issue_directly
  blk-mq: issue request directly for blk_insert_cloned_request

 block/blk-core.c      |  3 +-
 block/blk-mq.c        | 86 +++++++++++++++++++++++++++++++++++++++------------
 block/blk-mq.h        |  3 ++
 drivers/md/dm-mpath.c | 19 +++++++++---
 drivers/md/dm-rq.c    | 20 +++++++++---
 5 files changed, 101 insertions(+), 30 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH V3 1/5] dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
  2018-01-11  6:01 [PATCH V3 0/5] dm-rq: improve sequential I/O performance Ming Lei
@ 2018-01-11  6:01 ` Ming Lei
  2018-01-11  6:01 ` [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2018-01-11  6:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Ming Lei

If .queue_rq() returns BLK_STS_RESOURCE, blk-mq will rerun the queue in
the three situations:

1) if BLK_MQ_S_SCHED_RESTART is set
- queue is rerun after one rq is completed, see blk_mq_sched_restart()
which is run from blk_mq_free_request()

2) run out of driver tag
- queue is rerun after one tag is freed

3) otherwise
- queue is run immediately in blk_mq_dispatch_rq_list()

This random dealy of running hw queue is introduced by commit 6077c2d706097c0
(dm rq: Avoid that request processing stalls sporadically), which claimed
one request processing stalling is fixed, but never explained the behind
idea, and it is a workaound at most. Even the question isn't explained by
anyone in recent discussion.

Also calling blk_mq_delay_run_hw_queue() inside .queue_rq() is a horrible
hack because it makes BLK_MQ_S_SCHED_RESTART not working, and degrades I/O
peformance a lot.

Finally this patch makes sure that dm-rq returns BLK_STS_RESOURCE to blk-mq
only when underlying queue is out of resource, so we switch to return
DM_MAPIO_DELAY_REQUEU if either MPATHF_QUEUE_IO or MPATHF_PG_INIT_REQUIRED
is set in multipath_clone_and_map().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-mpath.c | 5 ++---
 drivers/md/dm-rq.c    | 1 -
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index f7810cc869ac..86bf502a8e51 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -516,9 +516,8 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 		return DM_MAPIO_KILL;
 	} else if (test_bit(MPATHF_QUEUE_IO, &m->flags) ||
 		   test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
-		if (pg_init_all_paths(m))
-			return DM_MAPIO_DELAY_REQUEUE;
-		return DM_MAPIO_REQUEUE;
+		pg_init_all_paths(m);
+		return DM_MAPIO_DELAY_REQUEUE;
 	}
 
 	memset(mpio, 0, sizeof(*mpio));
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3b319776d80c..4d157b14d302 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -755,7 +755,6 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
-		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_STS_RESOURCE;
 	}
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
  2018-01-11  6:01 [PATCH V3 0/5] dm-rq: improve sequential I/O performance Ming Lei
  2018-01-11  6:01 ` [PATCH V3 1/5] dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE Ming Lei
@ 2018-01-11  6:01 ` Ming Lei
  2018-01-12 19:04     ` Bart Van Assche
  2018-01-11  6:01 ` [PATCH V3 3/5] blk-mq: move actual issue into one helper Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Ming Lei @ 2018-01-11  6:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Ming Lei

blk-mq will rerun queue via RESTART or dispatch wake after one request
is completed, so not necessary to wait random time for requeuing, we
should trust blk-mq to do it.

More importantly, we need return BLK_STS_RESOURCE to blk-mq so that
dequeue from I/O scheduler can be stopped, then I/O merge gets improved.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-mpath.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 86bf502a8e51..fcddf5a62581 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 		if (queue_dying) {
 			atomic_inc(&m->pg_init_in_progress);
 			activate_or_offline_path(pgpath);
+			return DM_MAPIO_DELAY_REQUEUE;
 		}
-		return DM_MAPIO_DELAY_REQUEUE;
+
+		/*
+		 * blk-mq's SCHED_RESTART can cover this requeue, so
+		 * we needn't to deal with it by DELAY_REQUEUE. More
+		 * importantly, we have to return DM_MAPIO_REQUEUE
+		 * so that blk-mq can get the queue busy feedback,
+		 * otherwise I/O merge can be hurt.
+		 */
+		if (q->mq_ops)
+			return DM_MAPIO_REQUEUE;
+		else
+			return DM_MAPIO_DELAY_REQUEUE;
 	}
 	clone->bio = clone->biotail = NULL;
 	clone->rq_disk = bdev->bd_disk;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V3 3/5] blk-mq: move actual issue into one helper
  2018-01-11  6:01 [PATCH V3 0/5] dm-rq: improve sequential I/O performance Ming Lei
  2018-01-11  6:01 ` [PATCH V3 1/5] dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE Ming Lei
  2018-01-11  6:01 ` [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure Ming Lei
@ 2018-01-11  6:01 ` Ming Lei
  2018-01-11 22:09   ` Mike Snitzer
  2018-01-11  6:01 ` [PATCH V3 4/5] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Ming Lei @ 2018-01-11  6:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Ming Lei

No functional change, just to clean up code a bit, so that the following
change of using direct issue for blk_mq_request_bypass_insert() which is
needed by DM can be easier to do.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 39 +++++++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9aa24c9508f9..d27c1a265d54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1684,15 +1684,38 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-					struct request *rq,
-					blk_qc_t *cookie)
+/* return true if insert is needed */
+static bool __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
+			       struct request *rq,
+			       blk_qc_t *new_cookie,
+			       blk_status_t *ret)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
 		.last = true,
 	};
+
+	if (!blk_mq_get_driver_tag(rq, NULL, false))
+		return true;
+
+	if (!blk_mq_get_dispatch_budget(hctx)) {
+		blk_mq_put_driver_tag(rq);
+		return true;
+	}
+
+	*new_cookie = request_to_qc_t(hctx, rq);
+
+	*ret = q->mq_ops->queue_rq(hctx, &bd);
+
+	return false;
+}
+
+static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+					struct request *rq,
+					blk_qc_t *cookie)
+{
+	struct request_queue *q = rq->q;
 	blk_qc_t new_cookie;
 	blk_status_t ret;
 	bool run_queue = true;
@@ -1706,22 +1729,14 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (q->elevator)
 		goto insert;
 
-	if (!blk_mq_get_driver_tag(rq, NULL, false))
+	if (__blk_mq_issue_req(hctx, rq, &new_cookie, &ret))
 		goto insert;
 
-	if (!blk_mq_get_dispatch_budget(hctx)) {
-		blk_mq_put_driver_tag(rq);
-		goto insert;
-	}
-
-	new_cookie = request_to_qc_t(hctx, rq);
-
 	/*
 	 * For OK queue, we are done. For error, kill it. Any other
 	 * error (busy), just add it to our list as we previously
 	 * would have done
 	 */
-	ret = q->mq_ops->queue_rq(hctx, &bd);
 	switch (ret) {
 	case BLK_STS_OK:
 		*cookie = new_cookie;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V3 4/5] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly
  2018-01-11  6:01 [PATCH V3 0/5] dm-rq: improve sequential I/O performance Ming Lei
                   ` (2 preceding siblings ...)
  2018-01-11  6:01 ` [PATCH V3 3/5] blk-mq: move actual issue into one helper Ming Lei
@ 2018-01-11  6:01 ` Ming Lei
  2018-01-11 22:10   ` Mike Snitzer
  2018-01-11  6:01 ` [PATCH V3 5/5] blk-mq: issue request directly for blk_insert_cloned_request Ming Lei
  2018-01-11 22:07 ` [PATCH V3 0/5] dm-rq: improve sequential I/O performance Mike Snitzer
  5 siblings, 1 reply; 48+ messages in thread
From: Ming Lei @ 2018-01-11  6:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Ming Lei

In the following patch, we will use blk_mq_try_issue_directly() for DM
to return the dispatch result, and DM need this informatin to improve
IO merge.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d27c1a265d54..444a4bf9705f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1711,13 +1711,13 @@ static bool __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
 	return false;
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-					struct request *rq,
-					blk_qc_t *cookie)
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+						struct request *rq,
+						blk_qc_t *cookie)
 {
 	struct request_queue *q = rq->q;
 	blk_qc_t new_cookie;
-	blk_status_t ret;
+	blk_status_t ret = BLK_STS_OK;
 	bool run_queue = true;
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1740,31 +1740,36 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	switch (ret) {
 	case BLK_STS_OK:
 		*cookie = new_cookie;
-		return;
+		return ret;
 	case BLK_STS_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		goto insert;
 	default:
 		*cookie = BLK_QC_T_NONE;
 		blk_mq_end_request(rq, ret);
-		return;
+		return ret;
 	}
 
 insert:
 	blk_mq_sched_insert_request(rq, false, run_queue, false,
 					hctx->flags & BLK_MQ_F_BLOCKING);
+	return ret;
 }
 
-static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, blk_qc_t *cookie)
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+					      struct request *rq,
+					      blk_qc_t *cookie)
 {
 	int srcu_idx;
+	blk_status_t ret;
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
 	hctx_lock(hctx, &srcu_idx);
-	__blk_mq_try_issue_directly(hctx, rq, cookie);
+	ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
 	hctx_unlock(hctx, srcu_idx);
+
+	return ret;
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V3 5/5] blk-mq: issue request directly for blk_insert_cloned_request
  2018-01-11  6:01 [PATCH V3 0/5] dm-rq: improve sequential I/O performance Ming Lei
                   ` (3 preceding siblings ...)
  2018-01-11  6:01 ` [PATCH V3 4/5] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly Ming Lei
@ 2018-01-11  6:01 ` Ming Lei
  2018-01-11 22:42   ` Mike Snitzer
  2018-01-11 22:07 ` [PATCH V3 0/5] dm-rq: improve sequential I/O performance Mike Snitzer
  5 siblings, 1 reply; 48+ messages in thread
From: Ming Lei @ 2018-01-11  6:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel
  Cc: Bart Van Assche, Ming Lei

blk_insert_cloned_request() is called in fast path of dm-rq driver, and
in this function we append request to hctx->dispatch_list of the underlying
queue directly.

1) This way isn't efficient enough because hctx lock is always required

2) With blk_insert_cloned_request(), we bypass underlying queue's IO scheduler
totally, and depend on DM rq driver to do IO schedule completely. But DM
rq driver can't get underlying queue's dispatch feedback at all, and this
information is extreamly useful for IO merge. Without that IO merge can't
be done basically by blk-mq, and causes very bad sequential IO performance.

This patch makes use of blk_mq_try_issue_directly() to dispatch rq to
underlying queue and provides disptch result to dm-rq and blk-mq, and
improves the above situations very much.

With this patch, seqential IO is improved by 3X in my test over
mpath/virtio-scsi.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c     | 34 ++++++++++++++++++++++++++++++----
 block/blk-mq.h     |  3 +++
 drivers/md/dm-rq.c | 19 ++++++++++++++++---
 4 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f843ae4f858d..30759fcc93e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
 		 * bypass a potential scheduler on the bottom device for
 		 * insert.
 		 */
-		blk_mq_request_bypass_insert(rq, true);
-		return BLK_STS_OK;
+		return blk_mq_request_direct_issue(rq);
 	}
 
 	spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 444a4bf9705f..9bfca039d526 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1719,6 +1719,12 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	blk_qc_t new_cookie;
 	blk_status_t ret = BLK_STS_OK;
 	bool run_queue = true;
+	/*
+	 * This function is called from blk_insert_cloned_request() if
+	 * 'cookie' is NULL, and for dispatching this request only.
+	 */
+	bool dispatch_only = !cookie;
+	bool need_insert;
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
 	if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
@@ -1726,10 +1732,19 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		goto insert;
 	}
 
-	if (q->elevator)
+	if (q->elevator && !dispatch_only)
 		goto insert;
 
-	if (__blk_mq_issue_req(hctx, rq, &new_cookie, &ret))
+	need_insert = __blk_mq_issue_req(hctx, rq, &new_cookie, &ret);
+	if (dispatch_only) {
+		if (need_insert)
+			return BLK_STS_RESOURCE;
+		if (ret == BLK_STS_RESOURCE)
+			__blk_mq_requeue_request(rq);
+		return ret;
+	}
+
+	if (need_insert)
 		goto insert;
 
 	/*
@@ -1751,8 +1766,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	}
 
 insert:
-	blk_mq_sched_insert_request(rq, false, run_queue, false,
-					hctx->flags & BLK_MQ_F_BLOCKING);
+	if (!dispatch_only)
+		blk_mq_sched_insert_request(rq, false, run_queue, false,
+				hctx->flags & BLK_MQ_F_BLOCKING);
+	else
+		blk_mq_request_bypass_insert(rq, run_queue);
 	return ret;
 }
 
@@ -1772,6 +1790,14 @@ static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+blk_status_t blk_mq_request_direct_issue(struct request *rq)
+{
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+
+	return blk_mq_try_issue_directly(hctx, rq, NULL);
+}
+
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8591a54d989b..bd98864d8e38 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -74,6 +74,9 @@ void blk_mq_request_bypass_insert(struct request *rq, bool run_queue);
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 				struct list_head *list);
 
+/* Used by DM for issuing req directly */
+blk_status_t blk_mq_request_direct_issue(struct request *rq);
+
 /*
  * CPU -> queue mappings
  */
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 4d157b14d302..6b01298fba06 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -395,7 +395,7 @@ static void end_clone_request(struct request *clone, blk_status_t error)
 	dm_complete_request(tio->orig, error);
 }
 
-static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
+static blk_status_t dm_dispatch_clone_request(struct request *clone, struct request *rq)
 {
 	blk_status_t r;
 
@@ -404,9 +404,10 @@ static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r)
+	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
+	return r;
 }
 
 static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
@@ -476,8 +477,10 @@ static int map_request(struct dm_rq_target_io *tio)
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 	struct request *clone = NULL;
+	blk_status_t ret;
 
 	r = ti->type->clone_and_map_rq(ti, rq, &tio->info, &clone);
+ check_again:
 	switch (r) {
 	case DM_MAPIO_SUBMITTED:
 		/* The target has taken the I/O to submit by itself later */
@@ -492,7 +495,17 @@ static int map_request(struct dm_rq_target_io *tio)
 		/* The target has remapped the I/O so dispatch it */
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
-		dm_dispatch_clone_request(clone, rq);
+		ret = dm_dispatch_clone_request(clone, rq);
+		if (ret == BLK_STS_RESOURCE) {
+			blk_rq_unprep_clone(clone);
+			tio->ti->type->release_clone_rq(clone);
+			tio->clone = NULL;
+			if (!rq->q->mq_ops)
+				r = DM_MAPIO_DELAY_REQUEUE;
+			else
+				r = DM_MAPIO_REQUEUE;
+			goto check_again;
+		}
 		break;
 	case DM_MAPIO_REQUEUE:
 		/* The target wants to requeue the I/O */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-11  6:01 [PATCH V3 0/5] dm-rq: improve sequential I/O performance Ming Lei
                   ` (4 preceding siblings ...)
  2018-01-11  6:01 ` [PATCH V3 5/5] blk-mq: issue request directly for blk_insert_cloned_request Ming Lei
@ 2018-01-11 22:07 ` Mike Snitzer
  2018-01-11 22:37     ` Bart Van Assche
  5 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-11 22:07 UTC (permalink / raw)
  To: Ming Lei, axboe
  Cc: Jens Axboe, linux-block, Christoph Hellwig, dm-devel, Bart Van Assche

On Thu, Jan 11 2018 at  1:01am -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> Hi Guys,
> 
> The 1st patch removes the workaround of blk_mq_delay_run_hw_queue() in
> case of requeue, this way isn't necessary, and more worse, it makes
> BLK_MQ_S_SCHED_RESTART not working, and degarde I/O performance.
> 
> The 2nd patch return DM_MAPIO_REQUEUE to dm-rq if underlying request
> allocation fails, then we can return BLK_STS_RESOURCE from dm-rq to
> blk-mq, so that blk-mq can hold the requests to be dequeued.

In general the delayed requeue was meant to cope (slightly better) with
queue_if_no_path and no paths being available.  I think patch 1 respect
that intent.  And patch 2 looks right to me, BLK_STS_RESOURCE definitely
should be returned if blk_get_request() fails.

So I picked these up for dm-4.16 (and staged them in linux-next).  I
reworded both headers and the 2nd patch's block comment, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=64e4e12bf474cf3aaaf59245bbeba3057d2fedf9
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=c50de17c1cc7d5910df58aba9b950e99579e239d

Bart, if for some reason we regress for some workload you're able to
more readily test we can deal with it.  But I'm too encouraged by Ming's
performance improvements to hold these changes back any further.

> The other 3 paches changes the blk-mq part of blk_insert_cloned_request(),
> in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
> and blk-mq can get the dispatch result of underlying queue, and with
> this information, blk-mq can handle IO merge much better, then
> sequential I/O performance is improved much.
> 
> In my dm-mpath over virtio-scsi test, this whole patchset improves
> sequential IO by 3X ~ 5X.

I've merged dm-4.16 and for-4.16/block and patched 3 - 5 apply cleanly
(so we won't have any merge conflicts during 4.16), resulting git branch
is here:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=block-4.16_dm-4.16

I'll reply to patches 3 - 5 individually with my Reviewed-by.
Jens, please feel free to pick them up for 4.16.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 3/5] blk-mq: move actual issue into one helper
  2018-01-11  6:01 ` [PATCH V3 3/5] blk-mq: move actual issue into one helper Ming Lei
@ 2018-01-11 22:09   ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2018-01-11 22:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, dm-devel, Bart Van Assche

On Thu, Jan 11 2018 at  1:01am -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> No functional change, just to clean up code a bit, so that the following
> change of using direct issue for blk_mq_request_bypass_insert() which is
> needed by DM can be easier to do.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 4/5] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly
  2018-01-11  6:01 ` [PATCH V3 4/5] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly Ming Lei
@ 2018-01-11 22:10   ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2018-01-11 22:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, dm-devel, Bart Van Assche

On Thu, Jan 11 2018 at  1:01am -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> In the following patch, we will use blk_mq_try_issue_directly() for DM
> to return the dispatch result, and DM need this informatin to improve
> IO merge.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-11 22:07 ` [PATCH V3 0/5] dm-rq: improve sequential I/O performance Mike Snitzer
@ 2018-01-11 22:37     ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-11 22:37 UTC (permalink / raw)
  To: axboe, snitzer, ming.lei
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen

T24gVGh1LCAyMDE4LTAxLTExIGF0IDE3OjA3IC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+
IEJhcnQsIGlmIGZvciBzb21lIHJlYXNvbiB3ZSByZWdyZXNzIGZvciBzb21lIHdvcmtsb2FkIHlv
dSdyZSBhYmxlIHRvDQo+IG1vcmUgcmVhZGlseSB0ZXN0IHdlIGNhbiBkZWFsIHdpdGggaXQuICBC
dXQgSSdtIHRvbyBlbmNvdXJhZ2VkIGJ5IE1pbmcncw0KPiBwZXJmb3JtYW5jZSBpbXByb3ZlbWVu
dHMgdG8gaG9sZCB0aGVzZSBjaGFuZ2VzIGJhY2sgYW55IGZ1cnRoZXIuDQoNClNvcnJ5IE1pa2Ug
YnV0IEkgdGhpbmsgTWluZydzIG1lYXN1cmVtZW50IHJlc3VsdHMgYXJlIG5vdCBzdWZmaWNpZW50
IHRvDQptb3RpdmF0ZSB1cHN0cmVhbWluZyBvZiB0aGVzZSBwYXRjaGVzLiBXaGF0IEkgcmVtZW1i
ZXIgZnJvbSBwcmV2aW91cyB2ZXJzaW9ucw0Kb2YgdGhpcyBwYXRjaCBzZXJpZXMgaXMgdGhhdCBN
aW5nIG1lYXN1cmVkIHRoZSBwZXJmb3JtYW5jZSBpbXBhY3Qgb2YgdGhpcw0KcGF0Y2ggc2VyaWVz
IG9ubHkgZm9yIHRoZSBFbXVsZXggRkMgZHJpdmVyIChscGZjKS4gVGhhdCBkcml2ZXIgbGltaXRz
IHF1ZXVlDQpkZXB0aCB0byAzLiBJbnN0ZWFkIG9mIG1vZGlmeWluZyB0aGUgZG0gY29kZSwgdGhh
dCBkcml2ZXIgbmVlZHMgdG8gYmUgZml4ZWQNCnN1Y2ggdGhhdCBpdCBhbGxvd3MgbGFyZ2VyIHF1
ZXVlIGRlcHRocy4NCg0KQWRkaXRpb25hbGx5LCBzb21lIHRpbWUgYWdvIEkgaGFkIGV4cGxhaW5l
ZCBpbiBkZXRhaWwgd2h5IEkgdGhpbmsgdGhhdCBwYXRjaA0KMi81IGluIHRoaXMgc2VyaWVzIGlz
IHdyb25nIGFuZCB3aHkgaXQgd2lsbCBpbnRyb2R1Y2UgZmFpcm5lc3MgcmVncmVzc2lvbnMNCmlu
IG11bHRpLUxVTiB3b3JrbG9hZHMuIEkgdGhpbmsgd2UgbmVlZCBwZXJmb3JtYW5jZSBtZWFzdXJl
bWVudHMgZm9yIHRoaXMNCnBhdGNoIHNlcmllcyBmb3IgYXQgbGVhc3QgdGhlIGZvbGxvd2luZyB0
d28gY29uZmlndXJhdGlvbnMgYmVmb3JlIHRoaXMgcGF0Y2gNCnNlcmllcyBjYW4gYmUgY29uc2lk
ZXJlZCBmb3IgdXBzdHJlYW0gaW5jbHVzaW9uOg0KKiBUaGUgcGVyZm9ybWFuY2UgaW1wYWN0IG9m
IHRoaXMgcGF0Y2ggc2VyaWVzIGZvciBTQ1NJIGRldmljZXMgd2l0aCBhDQogIHJlYWxpc3RpYyBx
dWV1ZSBkZXB0aCAoZS5nLiA2NCBvciAxMjgpLg0KKiBUaGUgcGVyZm9ybWFuY2UgaW1wYWN0IGZv
ciB0aGlzIHBhdGNoIHNlcmllcyBmb3IgYSBTQ1NJIGhvc3Qgd2l0aCB3aGljaA0KICBtdWx0aXBs
ZSBMVU5zIGFyZSBhc3NvY2lhdGVkIGFuZCBmb3Igd2hpY2ggdGhlIHRhcmdldCBzeXN0ZW0gb2Z0
ZW4gcmVwbGllcw0KICB3aXRoIHRoZSBTQ1NJICJCVVNZIiBzdGF0dXMuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
@ 2018-01-11 22:37     ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-11 22:37 UTC (permalink / raw)
  To: axboe, snitzer, ming.lei
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen

On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> Bart, if for some reason we regress for some workload you're able to
> more readily test we can deal with it.  But I'm too encouraged by Ming's
> performance improvements to hold these changes back any further.

Sorry Mike but I think Ming's measurement results are not sufficient to
motivate upstreaming of these patches. What I remember from previous versions
of this patch series is that Ming measured the performance impact of this
patch series only for the Emulex FC driver (lpfc). That driver limits queue
depth to 3. Instead of modifying the dm code, that driver needs to be fixed
such that it allows larger queue depths.

Additionally, some time ago I had explained in detail why I think that patch
2/5 in this series is wrong and why it will introduce fairness regressions
in multi-LUN workloads. I think we need performance measurements for this
patch series for at least the following two configurations before this patch
series can be considered for upstream inclusion:
* The performance impact of this patch series for SCSI devices with a
  realistic queue depth (e.g. 64 or 128).
* The performance impact for this patch series for a SCSI host with which
  multiple LUNs are associated and for which the target system often replies
  with the SCSI "BUSY" status.

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 5/5] blk-mq: issue request directly for blk_insert_cloned_request
  2018-01-11  6:01 ` [PATCH V3 5/5] blk-mq: issue request directly for blk_insert_cloned_request Ming Lei
@ 2018-01-11 22:42   ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2018-01-11 22:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, dm-devel, Bart Van Assche

On Thu, Jan 11 2018 at  1:01am -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> in this function we append request to hctx->dispatch_list of the underlying
> queue directly.
> 
> 1) This way isn't efficient enough because hctx lock is always required
> 
> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO scheduler
> totally, and depend on DM rq driver to do IO schedule completely. But DM
> rq driver can't get underlying queue's dispatch feedback at all, and this
> information is extreamly useful for IO merge. Without that IO merge can't
> be done basically by blk-mq, and causes very bad sequential IO performance.
> 
> This patch makes use of blk_mq_try_issue_directly() to dispatch rq to
> underlying queue and provides disptch result to dm-rq and blk-mq, and
> improves the above situations very much.
> 
> With this patch, seqential IO is improved by 3X in my test over
> mpath/virtio-scsi.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Tested-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-11 22:37     ` Bart Van Assche
  (?)
@ 2018-01-11 22:58     ` Mike Snitzer
  2018-01-11 23:27         ` Bart Van Assche
  -1 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-11 22:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, ming.lei, dm-devel, hch, linux-block, axboe, martin.petersen

On Thu, Jan 11 2018 at  5:37pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > Bart, if for some reason we regress for some workload you're able to
> > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > performance improvements to hold these changes back any further.
> 
> Sorry Mike but I think Ming's measurement results are not sufficient to
> motivate upstreaming of these patches. What I remember from previous versions
> of this patch series is that Ming measured the performance impact of this
> patch series only for the Emulex FC driver (lpfc). That driver limits queue
> depth to 3. Instead of modifying the dm code, that driver needs to be fixed
> such that it allows larger queue depths.

Sorry Bart but even in my test (using null_blk with queue depth of 8, 12
hw queues on 12 cpu system, with 12 fio jobs) it is yielding performance
improvement.  Going from 1350K iops to 1420K iops.  And 5275MB/s to
5532MB/s.  For sequential fio workload.

This test has been a staple of mine for a very long time.  The
improvement is very real (and hard to come by).  The iops improvement is
a bellweather for improved efficiency in the fast path.

So all your concern about Ming's testing is fine.  But you'd really help
me out a lot if you could give these changes a test.
 
> Additionally, some time ago I had explained in detail why I think that patch
> 2/5 in this series is wrong and why it will introduce fairness regressions
> in multi-LUN workloads.

You may think you explained it.. but IIRC it was very hand-wavvy.  If
you can provide a pointer to what you think was a compelling argument
then please do.

But I'm left very much unconvinced with your argument given what I see
in patch 2.  If blk_get_request() fails it follows that BLK_STS_RESOURCE
should be returned.  Why is the 100ms delay meaningful in that case for
SCSI?

> I think we need performance measurements for this
> patch series for at least the following two configurations before this patch
> series can be considered for upstream inclusion:
> * The performance impact of this patch series for SCSI devices with a
>   realistic queue depth (e.g. 64 or 128).
> * The performance impact for this patch series for a SCSI host with which
>   multiple LUNs are associated and for which the target system often replies
>   with the SCSI "BUSY" status.

OK, so please test it.

The changes are pretty easy to review.  This notion that these changes
are problematic rings very hollow given your lack of actual numbers (or
some other concerning observation rooted in testing fact) to back up
your position.

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-11 22:58     ` Mike Snitzer
@ 2018-01-11 23:27         ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-11 23:27 UTC (permalink / raw)
  To: snitzer
  Cc: dm-devel, hch, linux-block, martin.petersen, axboe, axboe, ming.lei

T24gVGh1LCAyMDE4LTAxLTExIGF0IDE3OjU4IC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+
IFRoZSBjaGFuZ2VzIGFyZSBwcmV0dHkgZWFzeSB0byByZXZpZXcuICBUaGlzIG5vdGlvbiB0aGF0
IHRoZXNlIGNoYW5nZXMNCj4gYXJlIHByb2JsZW1hdGljIHJpbmdzIHZlcnkgaG9sbG93IGdpdmVu
IHlvdXIgbGFjayBvZiBhY3R1YWwgbnVtYmVycyAob3INCj4gc29tZSBvdGhlciBjb25jZXJuaW5n
IG9ic2VydmF0aW9uIHJvb3RlZCBpbiB0ZXN0aW5nIGZhY3QpIHRvIGJhY2sgdXANCj4geW91ciBw
b3NpdGlvbi4NCg0KSXQncyBub3QgbXkgam9iIHRvIHJ1biB0aGUgbXVsdGktTFVOIHRlc3QuIFRo
YXQncyB0aGUgam9iIG9mIHRoZSBwZW9wbGUgd2hvDQp3YW50IHRoZXNlIHBhdGNoZXMgdXBzdHJl
YW0uIFNpbmNlIEkgYXNrZWQgZm9yIHRoZXNlIHRlc3QgcmVzdWx0cyBmb3IgdGhlIGZpcnN0DQp0
aW1lIHNldmVyYWwgbW9udGhzIGFnbyBJJ20gc3VycHJpc2VkIHRoYXQgbm9ib2R5IGhhcyBydW4g
dGhlc2UgdGVzdHMgeWV0Lg0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
@ 2018-01-11 23:27         ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-11 23:27 UTC (permalink / raw)
  To: snitzer
  Cc: dm-devel, hch, linux-block, martin.petersen, axboe, axboe, ming.lei

On Thu, 2018-01-11 at 17:58 -0500, Mike Snitzer wrote:
> The changes are pretty easy to review.  This notion that these changes
> are problematic rings very hollow given your lack of actual numbers (or
> some other concerning observation rooted in testing fact) to back up
> your position.

It's not my job to run the multi-LUN test. That's the job of the people who
want these patches upstream. Since I asked for these test results for the first
time several months ago I'm surprised that nobody has run these tests yet.

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-11 22:37     ` Bart Van Assche
  (?)
  (?)
@ 2018-01-12  1:42     ` Ming Lei
  2018-01-12  1:57       ` Mike Snitzer
  -1 siblings, 1 reply; 48+ messages in thread
From: Ming Lei @ 2018-01-12  1:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, snitzer, dm-devel, hch, linux-block, axboe, martin.petersen

On Thu, Jan 11, 2018 at 10:37:37PM +0000, Bart Van Assche wrote:
> On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > Bart, if for some reason we regress for some workload you're able to
> > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > performance improvements to hold these changes back any further.
> 
> Sorry Mike but I think Ming's measurement results are not sufficient to
> motivate upstreaming of these patches. What I remember from previous versions
> of this patch series is that Ming measured the performance impact of this
> patch series only for the Emulex FC driver (lpfc). That driver limits queue
> depth to 3. Instead of modifying the dm code, that driver needs to be fixed
> such that it allows larger queue depths.
> 
> Additionally, some time ago I had explained in detail why I think that patch
> 2/5 in this series is wrong and why it will introduce fairness regressions
> in multi-LUN workloads. I think we need performance measurements for this
> patch series for at least the following two configurations before this patch
> series can be considered for upstream inclusion:
> * The performance impact of this patch series for SCSI devices with a
>   realistic queue depth (e.g. 64 or 128).

Please look at the cover letter or patch 5's commit log, it mentions that
dm-mpath over virtio-scsi is tested, and the default queue depth of virito-scsi
is 128.

> * The performance impact for this patch series for a SCSI host with which
>   multiple LUNs are associated and for which the target system often replies
>   with the SCSI "BUSY" status.

I don't think this patch is related with this case, this patch just provides the
BUSY feedback from underlying queue to blk-mq via dm-rq.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-11 23:27         ` Bart Van Assche
  (?)
@ 2018-01-12  1:43         ` Mike Snitzer
  -1 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2018-01-12  1:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: dm-devel, hch, linux-block, martin.petersen, axboe, axboe, ming.lei

On Thu, Jan 11 2018 at  6:27pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Thu, 2018-01-11 at 17:58 -0500, Mike Snitzer wrote:
> > The changes are pretty easy to review.  This notion that these changes
> > are problematic rings very hollow given your lack of actual numbers (or
> > some other concerning observation rooted in testing fact) to back up
> > your position.
> 
> It's not my job to run the multi-LUN test. That's the job of the people who
> want these patches upstream. Since I asked for these test results for the first
> time several months ago I'm surprised that nobody has run these tests yet.

I've reasoned through a few different ways to respond to this.  Fact is
you're not giving me much to work with.  AFAIK you _are_ charted with
supporting the types of storage configs that you've requested
performance results from.

Your dm-rq.c commit 6077c2d706097c0 ("dm rq: Avoid that request
processing stalls sporadically") silently went in through Jens:
https://www.redhat.com/archives/dm-devel/2017-April/msg00157.html
Not sure why that happened to begin with honestly.

But at the end of that post I meant to say:
"If this dm-mq specific commit is justified the case certainly is
_not_ spelled out in the commit header."

Anyway, I've split this contentious removal of dm_mq_queue_rq's
blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) into a separate patch; but
at this point I'm still inclined to accept it for 4.16.

I'll hopefully look closer at understanding the need for commit
6077c2d706097c0 tomorrow.

In the meantime, I'd _really_ appreciate it if you'd give the rest of
the changes Ming has proposed in this patchset a much more open mind!

Thanks,
Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12  1:42     ` Ming Lei
@ 2018-01-12  1:57       ` Mike Snitzer
  2018-01-12  3:33         ` Ming Lei
  0 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-12  1:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, axboe, dm-devel, hch, linux-block, axboe,
	martin.petersen

On Thu, Jan 11 2018 at  8:42pm -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> On Thu, Jan 11, 2018 at 10:37:37PM +0000, Bart Van Assche wrote:
> > On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > > Bart, if for some reason we regress for some workload you're able to
> > > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > > performance improvements to hold these changes back any further.
> > 
> > Sorry Mike but I think Ming's measurement results are not sufficient to
> > motivate upstreaming of these patches. What I remember from previous versions
> > of this patch series is that Ming measured the performance impact of this
> > patch series only for the Emulex FC driver (lpfc). That driver limits queue
> > depth to 3. Instead of modifying the dm code, that driver needs to be fixed
> > such that it allows larger queue depths.
> > 
> > Additionally, some time ago I had explained in detail why I think that patch
> > 2/5 in this series is wrong and why it will introduce fairness regressions
> > in multi-LUN workloads. I think we need performance measurements for this
> > patch series for at least the following two configurations before this patch
> > series can be considered for upstream inclusion:
> > * The performance impact of this patch series for SCSI devices with a
> >   realistic queue depth (e.g. 64 or 128).
> 
> Please look at the cover letter or patch 5's commit log, it mentions that
> dm-mpath over virtio-scsi is tested, and the default queue depth of virito-scsi
> is 128.
> 
> > * The performance impact for this patch series for a SCSI host with which
> >   multiple LUNs are associated and for which the target system often replies
> >   with the SCSI "BUSY" status.
> 
> I don't think this patch is related with this case, this patch just provides the
> BUSY feedback from underlying queue to blk-mq via dm-rq.

Hi Ming,

Please see https://www.redhat.com/archives/dm-devel/2017-April/msg00190.html

Specifically:
"That patch [commit 6077c2d706] can be considered as a first step that
can be refined further, namely by modifying the dm-rq code further such
that dm-rq queues are only rerun after the busy condition has been
cleared. The patch at the start of this thread is easier to review and
easier to test than any patch that would only rerun dm-rq queues after
the busy condition has been cleared."

Do you have time to reason through Bart's previous proposal for how to
better address the specific issue that was documented to be fixed in the
header for commit 6077c2d706 ?

Otherwise I fear we'll just keep hitting a wall with Bart...

Thanks,
Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12  1:57       ` Mike Snitzer
@ 2018-01-12  3:33         ` Ming Lei
  2018-01-12 17:18           ` Mike Snitzer
  0 siblings, 1 reply; 48+ messages in thread
From: Ming Lei @ 2018-01-12  3:33 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Bart Van Assche, axboe, dm-devel, hch, linux-block, axboe,
	martin.petersen

On Thu, Jan 11, 2018 at 08:57:21PM -0500, Mike Snitzer wrote:
> On Thu, Jan 11 2018 at  8:42pm -0500,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Thu, Jan 11, 2018 at 10:37:37PM +0000, Bart Van Assche wrote:
> > > On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > > > Bart, if for some reason we regress for some workload you're able to
> > > > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > > > performance improvements to hold these changes back any further.
> > > 
> > > Sorry Mike but I think Ming's measurement results are not sufficient to
> > > motivate upstreaming of these patches. What I remember from previous versions
> > > of this patch series is that Ming measured the performance impact of this
> > > patch series only for the Emulex FC driver (lpfc). That driver limits queue
> > > depth to 3. Instead of modifying the dm code, that driver needs to be fixed
> > > such that it allows larger queue depths.
> > > 
> > > Additionally, some time ago I had explained in detail why I think that patch
> > > 2/5 in this series is wrong and why it will introduce fairness regressions
> > > in multi-LUN workloads. I think we need performance measurements for this
> > > patch series for at least the following two configurations before this patch
> > > series can be considered for upstream inclusion:
> > > * The performance impact of this patch series for SCSI devices with a
> > >   realistic queue depth (e.g. 64 or 128).
> > 
> > Please look at the cover letter or patch 5's commit log, it mentions that
> > dm-mpath over virtio-scsi is tested, and the default queue depth of virito-scsi
> > is 128.
> > 
> > > * The performance impact for this patch series for a SCSI host with which
> > >   multiple LUNs are associated and for which the target system often replies
> > >   with the SCSI "BUSY" status.
> > 
> > I don't think this patch is related with this case, this patch just provides the
> > BUSY feedback from underlying queue to blk-mq via dm-rq.
> 
> Hi Ming,
> 
> Please see https://www.redhat.com/archives/dm-devel/2017-April/msg00190.html
> 
> Specifically:
> "That patch [commit 6077c2d706] can be considered as a first step that
> can be refined further, namely by modifying the dm-rq code further such
> that dm-rq queues are only rerun after the busy condition has been
> cleared. The patch at the start of this thread is easier to review and
> easier to test than any patch that would only rerun dm-rq queues after
> the busy condition has been cleared."
> 
> Do you have time to reason through Bart's previous proposal for how to
> better address the specific issue that was documented to be fixed in the
> header for commit 6077c2d706 ?

Hi Mike,

Recently we fixed one such issue in blk-mq (eb619fdb2d4cb8b3d34), and I
highly guess that may fix Bart's case.

Let's see this commit 6077c2d706 again, I don't know the idea behind the
fix, which just adds random of 100ms, and this delay degrades performance a
lot, since no hctx can be scheduled any more during the delay.

So again it is definitely a workaround, no any reasoning, no any theory, just
say it is a fix. Are there anyone who can explain the story?

IMO, the blk_get_request() in dm-mpath is just like a kmalloc(ATOMIC) which
is used in other blk-mq drivers' IO path too, so don't know why dm-mq is so
strange to require such ugly 100ms delay.

So I suggest to remove it, let's see if there are reports, and if there
are, I am quite confident we can root cause that and fix that.

> 
> Otherwise I fear we'll just keep hitting a wall with Bart...

I do test dm-mq over scsi-debug which is setup as two LUNs, and set
can_queue as 1, and this patchset just works well, not any IO hang.
And anyone can try and play with it.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12  3:33         ` Ming Lei
@ 2018-01-12 17:18           ` Mike Snitzer
  2018-01-12 17:26               ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-12 17:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, axboe, dm-devel, hch, linux-block, axboe,
	martin.petersen

On Thu, Jan 11 2018 at 10:33pm -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> On Thu, Jan 11, 2018 at 08:57:21PM -0500, Mike Snitzer wrote:
> > On Thu, Jan 11 2018 at  8:42pm -0500,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Thu, Jan 11, 2018 at 10:37:37PM +0000, Bart Van Assche wrote:
> > > > On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > > > > Bart, if for some reason we regress for some workload you're able to
> > > > > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > > > > performance improvements to hold these changes back any further.
> > > > 
> > > > Sorry Mike but I think Ming's measurement results are not sufficient to
> > > > motivate upstreaming of these patches. What I remember from previous versions
> > > > of this patch series is that Ming measured the performance impact of this
> > > > patch series only for the Emulex FC driver (lpfc). That driver limits queue
> > > > depth to 3. Instead of modifying the dm code, that driver needs to be fixed
> > > > such that it allows larger queue depths.
> > > > 
> > > > Additionally, some time ago I had explained in detail why I think that patch
> > > > 2/5 in this series is wrong and why it will introduce fairness regressions
> > > > in multi-LUN workloads. I think we need performance measurements for this
> > > > patch series for at least the following two configurations before this patch
> > > > series can be considered for upstream inclusion:
> > > > * The performance impact of this patch series for SCSI devices with a
> > > >   realistic queue depth (e.g. 64 or 128).
> > > 
> > > Please look at the cover letter or patch 5's commit log, it mentions that
> > > dm-mpath over virtio-scsi is tested, and the default queue depth of virito-scsi
> > > is 128.
> > > 
> > > > * The performance impact for this patch series for a SCSI host with which
> > > >   multiple LUNs are associated and for which the target system often replies
> > > >   with the SCSI "BUSY" status.
> > > 
> > > I don't think this patch is related with this case, this patch just provides the
> > > BUSY feedback from underlying queue to blk-mq via dm-rq.
> > 
> > Hi Ming,
> > 
> > Please see https://www.redhat.com/archives/dm-devel/2017-April/msg00190.html
> > 
> > Specifically:
> > "That patch [commit 6077c2d706] can be considered as a first step that
> > can be refined further, namely by modifying the dm-rq code further such
> > that dm-rq queues are only rerun after the busy condition has been
> > cleared. The patch at the start of this thread is easier to review and
> > easier to test than any patch that would only rerun dm-rq queues after
> > the busy condition has been cleared."
> > 
> > Do you have time to reason through Bart's previous proposal for how to
> > better address the specific issue that was documented to be fixed in the
> > header for commit 6077c2d706 ?
> 
> Hi Mike,
> 
> Recently we fixed one such issue in blk-mq (eb619fdb2d4cb8b3d34), and I
> highly guess that may fix Bart's case.

Wow, kind of staggering that there was no mention of this fix prior to
now.  Seems _very_ relevant (like the real fix).

> Let's see this commit 6077c2d706 again, I don't know the idea behind the
> fix, which just adds random of 100ms, and this delay degrades performance a
> lot, since no hctx can be scheduled any more during the delay.

I'd love to try to reproduce the issue documented in commit
6077c2d706 but sadly I cannot get that srp-test to work on my system.
Just fails with:

# while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module
Zero-initializing /dev/ram0 ... done
Zero-initializing /dev/ram1 ... done
Configured SRP target driver
Running test srp-test/tests/02-mq ...
Test file I/O on top of multipath concurrently with logout and login (0 min; mq)
SRP login failed
Test srp-test/tests/02-mq failed

Think the login is failing due to target not being setup properly?
Yeap, now from reading this thread, it is clear that srp-test only works
if you have an IB HCA:
https://patchwork.kernel.org/patch/10041381/

> So again it is definitely a workaround, no any reasoning, no any theory, just
> say it is a fix. Are there anyone who can explain the story?
> 
> IMO, the blk_get_request() in dm-mpath is just like a kmalloc(ATOMIC) which
> is used in other blk-mq drivers' IO path too, so don't know why dm-mq is so
> strange to require such ugly 100ms delay.

The header for commit 6077c2d706 notes that same action that Jens took
to unwedge the request stalls (in the patchwork thread I referenced
above), with:

echo run > /sys/kernel/debug/block/nullb1/state
vs what Bart noted in commit 6077c2d706:
echo run >/sys/kernel/debug/block/dm-0/state

Really does feel like the issue Jens' shared tags fix addressed _is_ the
root cause that commit 6077c2d706 worked around.

> So I suggest to remove it, let's see if there are reports, and if there
> are, I am quite confident we can root cause that and fix that.

Yeah, it really is rediculous that we're getting this kind of pushback
for something that was/is so poorly justified.  And especially given
that dm_mq_queue_rq() _still_ has this:

        if (ti->type->busy && ti->type->busy(ti))
                return BLK_STS_RESOURCE;

yet our desire to avoid blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in
response to blk_get_request() failure, just prior to returning
BLK_STS_RESOURCE in dm_mq_queue_rq(), is met with hellfire.

I refuse to submit to this highly unexpected cargo cult programming.

This is going upstream for 4.16:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16&id=5b18cff4baedde77e0d69bd62a13ae78f9488d89

I've lost patience with the unwillingness to reassess/justify the need
for commit 6077c2d706 ; SO it is just getting removed and we'll debug
and fix any future reported blk-mq stalls (and/or high cpu load) in a
much more precise manner -- provided the reporter is willing to work
with us!

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 17:18           ` Mike Snitzer
@ 2018-01-12 17:26               ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 17:26 UTC (permalink / raw)
  To: snitzer, ming.lei
  Cc: dm-devel, hch, linux-block, axboe, axboe, martin.petersen

T24gRnJpLCAyMDE4LTAxLTEyIGF0IDEyOjE4IC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+
IFRoaXMgaXMgZ29pbmcgdXBzdHJlYW0gZm9yIDQuMTY6DQo+IGh0dHBzOi8vZ2l0Lmtlcm5lbC5v
cmcvcHViL3NjbS9saW51eC9rZXJuZWwvZ2l0L3NuaXR6ZXIvbGludXguZ2l0L2NvbW1pdC8/aD1k
bS00LjE2JmlkPTViMThjZmY0YmFlZGRlNzdlMGQ2OWJkNjJhMTNhZTc4Zjk0ODhkODkNCg0KVGhh
dCBpcyByZWFsbHkgZ3Jvc3MuIEkgaGF2ZSBleHBsYWluZWQgbWFueSB0aW1lcyBpbiBkZXRhaWwg
b24gdGhlIGRtLWRldmVsDQpsaXN0IHdoeSB0aGF0IGJsa19tcV9kZWxheV9ydW5faHdfcXVldWUo
KSBjYWxsIGlzIG5lY2Vzc2FyeS4gU28gSSB0aGluayBpdCBpcw0Kd3JvbmcgaWYgeW91IHJlbW92
ZSBpdC4gSXNuJ3QgdGhlIHJlc3BvbnNpYmlsaXR5IG9mIHlvdSBhcyB0aGUgZGV2aWNlIG1hcHBl
cg0Ka2VybmVsIG1haW50YWluZXIgdG8gYXZvaWQgcmVncmVzc2lvbnMgaW5zdGVhZCBvZiBpbnRy
b2R1Y2luZyByZWdyZXNzaW9ucz8NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
@ 2018-01-12 17:26               ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 17:26 UTC (permalink / raw)
  To: snitzer, ming.lei
  Cc: axboe, linux-block, martin.petersen, axboe, hch, dm-devel

On Fri, 2018-01-12 at 12:18 -0500, Mike Snitzer wrote:
> This is going upstream for 4.16:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16&id=5b18cff4baedde77e0d69bd62a13ae78f9488d89

That is really gross. I have explained many times in detail on the dm-devel
list why that blk_mq_delay_run_hw_queue() call is necessary. So I think it is
wrong if you remove it. Isn't the responsibility of you as the device mapper
kernel maintainer to avoid regressions instead of introducing regressions?

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 17:26               ` Bart Van Assche
  (?)
@ 2018-01-12 17:40               ` Mike Snitzer
  2018-01-12 17:46                   ` Bart Van Assche
  -1 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-12 17:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: ming.lei, dm-devel, hch, linux-block, axboe, axboe, martin.petersen

On Fri, Jan 12 2018 at 12:26pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Fri, 2018-01-12 at 12:18 -0500, Mike Snitzer wrote:
> > This is going upstream for 4.16:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16&id=5b18cff4baedde77e0d69bd62a13ae78f9488d89
> 
> That is really gross. I have explained many times in detail on the dm-devel
> list why that blk_mq_delay_run_hw_queue() call is necessary. So I think it is
> wrong if you remove it. Isn't the responsibility of you as the device mapper
> kernel maintainer to avoid regressions instead of introducing regressions?

Please stop, seriously.

You've not explained it many times.  We cannot get a straight answer
from you.  No analysis that establishes that if an underlying dm-mq
multipath path is out of tags (shared or otherwise) that dm-rq core
_must_ run the hw queue after a delay.  

Commit 6077c2d7060 just papers over a real blk-mq problem (that may now
be fixed).  Your assertions that blk-mq would need to otherwise poll
(and waste resources so it can immediately retry) ignores that blk-mq
_should_ make progress as requests complete or if/when a path is
recovered, etc.  So I'm not going to accept your dysfuctional reasoning
on this, sorry.  _THAT_ is my job as a maintainer. 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 17:40               ` Mike Snitzer
@ 2018-01-12 17:46                   ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 17:46 UTC (permalink / raw)
  To: snitzer
  Cc: dm-devel, hch, linux-block, martin.petersen, axboe, ming.lei, axboe

T24gRnJpLCAyMDE4LTAxLTEyIGF0IDEyOjQwIC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+
IFlvdSd2ZSBub3QgZXhwbGFpbmVkIGl0IG1hbnkgdGltZXMuDQoNClRoYXQncyBub3QgY29ycmVj
dC4gSSBoYXZlIGFscmVhZHkgc2V2ZXJhbCB0aW1lcyBwb3N0ZWQgYSBkZXRhaWxlZCBhbmQgZWFz
eQ0KdG8gdW5kZXJzdGFuZCBleHBsYW5hdGlvbg0KDQo+IFdlIGNhbm5vdCBnZXQgYSBzdHJhaWdo
dCBhbnN3ZXIgZnJvbSB5b3UuDQoNClRoYXQgaXMgYSBncm9zcyBhbmQgaW5jb3JyZWN0IHN0YXRl
bWVudC4gUGxlYXNlIGNhbG0gZG93bi4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
@ 2018-01-12 17:46                   ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 17:46 UTC (permalink / raw)
  To: snitzer
  Cc: dm-devel, hch, linux-block, martin.petersen, axboe, ming.lei, axboe

On Fri, 2018-01-12 at 12:40 -0500, Mike Snitzer wrote:
> You've not explained it many times.

That's not correct. I have already several times posted a detailed and easy
to understand explanation

> We cannot get a straight answer from you.

That is a gross and incorrect statement. Please calm down.

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 17:46                   ` Bart Van Assche
  (?)
@ 2018-01-12 18:06                   ` Mike Snitzer
  2018-01-12 18:54                       ` Bart Van Assche
  -1 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-12 18:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: dm-devel, hch, linux-block, martin.petersen, axboe, ming.lei, axboe

On Fri, Jan 12 2018 at 12:46pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Fri, 2018-01-12 at 12:40 -0500, Mike Snitzer wrote:
> > You've not explained it many times.
> 
> That's not correct. I have already several times posted a detailed and easy
> to understand explanation

OK, you have the stage: please give me a pointer to your best
explaination of the several.

> > We cannot get a straight answer from you.
>
> That is a gross and incorrect statement. Please calm down.

I'm perfectly calm.  I'm just tired of how you sow controversy.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 18:06                   ` Mike Snitzer
@ 2018-01-12 18:54                       ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 18:54 UTC (permalink / raw)
  To: snitzer
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen, axboe, ming.lei

T24gRnJpLCAyMDE4LTAxLTEyIGF0IDEzOjA2IC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+
IE9LLCB5b3UgaGF2ZSB0aGUgc3RhZ2U6IHBsZWFzZSBnaXZlIG1lIGEgcG9pbnRlciB0byB5b3Vy
IGJlc3QNCj4gZXhwbGFpbmF0aW9uIG9mIHRoZSBzZXZlcmFsLg0KDQpTaW5jZSB0aGUgcHJldmlv
dXMgZGlzY3Vzc2lvbiBhYm91dCB0aGlzIHRvcGljIG9jY3VycmVkIG1vcmUgdGhhbiBhIG1vbnRo
DQphZ28gaXQgY291bGQgdGFrZSBtb3JlIHRpbWUgdG8gbG9vayB1cCBhbiBleHBsYW5hdGlvbiB0
aGFuIHRvIGV4cGxhaW4gaXQNCmFnYWluLiBBbnl3YXksIGhlcmUgd2UgZ28uIEFzIHlvdSBrbm93
IGEgYmxvY2sgbGF5ZXIgcmVxdWVzdCBxdWV1ZSBuZWVkcyB0bw0KYmUgcmVydW4gaWYgb25lIG9y
IG1vcmUgcmVxdWVzdHMgYXJlIHdhaXRpbmcgYW5kIGEgcHJldmlvdXMgY29uZGl0aW9uIHRoYXQN
CnByZXZlbnRlZCB0aGUgcmVxdWVzdCB0byBiZSBleGVjdXRlZCBoYXMgYmVlbiBjbGVhcmVkLiBG
b3IgdGhlIGRtLW1wYXRoDQpkcml2ZXIsIGV4YW1wbGVzIG9mIHN1Y2ggY29uZGl0aW9ucyBhcmUg
bm8gdGFncyBhdmFpbGFibGUsIGEgcGF0aCB0aGF0IGlzDQpidXN5IChzZWUgYWxzbyBwZ3BhdGhf
YnVzeSgpKSwgcGF0aCBpbml0aWFsaXphdGlvbiB0aGF0IGlzIGluIHByb2dyZXNzDQoocGdfaW5p
dF9pbl9wcm9ncmVzcykgb3IgYSByZXF1ZXN0IGNvbXBsZXRlcyB3aXRoIHN0YXR1cywgZS5nLiBp
ZiB0aGUNClNDU0kgY29yZSBjYWxscyBfX2Jsa19tcV9lbmRfcmVxdWVzdChyZXEsIGVycm9yKSB3
aXRoIGVycm9yICE9IDAuIEZvciBzb21lDQpvZiB0aGVzZSBjb25kaXRpb25zLCBlLmcuIHBhdGgg
aW5pdGlhbGl6YXRpb24gY29tcGxldGVzLCBhIGNhbGxiYWNrDQpmdW5jdGlvbiBpbiB0aGUgZG0t
bXBhdGggZHJpdmVyIGlzIGNhbGxlZCBhbmQgaXQgaXMgcG9zc2libGUgdG8gZXhwbGljaXRseQ0K
cmVydW4gdGhlIHF1ZXVlLiBJIGFncmVlIHRoYXQgZm9yIHN1Y2ggc2NlbmFyaW8ncyBhIGRlbGF5
ZWQgcXVldWUgcnVuIHNob3VsZA0Kbm90IGJlIHRyaWdnZXJlZC4gRm9yIG90aGVyIHNjZW5hcmlv
J3MsIGUuZy4gaWYgYSBTQ1NJIGluaXRpYXRvciBzdWJtaXRzIGENClNDU0kgcmVxdWVzdCBvdmVy
IGEgZmFicmljIGFuZCB0aGUgU0NTSSB0YXJnZXQgcmVwbGllcyB3aXRoICJCVVNZIiB0aGVuIHRo
ZQ0KU0NTSSBjb3JlIHdpbGwgZW5kIHRoZSBJL08gcmVxdWVzdCB3aXRoIHN0YXR1cyBCTEtfU1RT
X1JFU09VUkNFIGFmdGVyIHRoZQ0KbWF4aW11bSBudW1iZXIgb2YgcmV0cmllcyBoYXMgYmVlbiBy
ZWFjaGVkIChzZWUgYWxzbyBzY3NpX2lvX2NvbXBsZXRpb24oKSkuDQpJbiB0aGF0IGxhc3QgY2Fz
ZSwgaWYgYSBTQ1NJIHRhcmdldCBzZW5kcyBhICJCVVNZIiByZXBseSBvdmVyIHRoZSB3aXJlIGJh
Y2sNCnRvIHRoZSBpbml0aWF0b3IsIHRoZXJlIGlzIG5vIG90aGVyIGFwcHJvYWNoIGZvciB0aGUg
U0NTSSBpbml0aWF0b3IgdG8NCmZpZ3VyZSBvdXQgd2hldGhlciBpdCBjYW4gcXVldWUgYW5vdGhl
ciByZXF1ZXN0IHRoYW4gdG8gcmVzdWJtaXQgdGhlDQpyZXF1ZXN0LiBUaGUgd29yc3QgcG9zc2li
bGUgc3RyYXRlZ3kgaXMgdG8gcmVzdWJtaXQgYSByZXF1ZXN0IGltbWVkaWF0ZWx5DQpiZWNhdXNl
IHRoYXQgd2lsbCBjYXVzZSBhIHNpZ25pZmljYW50IGZyYWN0aW9uIG9mIHRoZSBmYWJyaWMgYmFu
ZHdpZHRoIHRvDQpiZSB1c2VkIGp1c3QgZm9yIHJlcGx5aW5nICJCVVNZIiB0byByZXF1ZXN0cyB0
aGF0IGNhbid0IGJlIHByb2Nlc3NlZA0KaW1tZWRpYXRlbHkuDQoNClRoZSBpbnRlbnRpb24gb2Yg
Y29tbWl0IDYwNzdjMmQ3MDYwOTdjMCB3YXMgdG8gYWRkcmVzcyB0aGUgbGFzdCBtZW50aW9uZWQN
CmNhc2UuIEl0IG1heSBiZSBwb3NzaWJsZSB0byBtb3ZlIHRoZSBkZWxheWVkIHF1ZXVlIHJlcnVu
IGZyb20gdGhlDQpkbV9xdWV1ZV9ycSgpIGludG8gZG1fcmVxdWV1ZV9vcmlnaW5hbF9yZXF1ZXN0
KCkuIEJ1dCBJIHRoaW5rIGl0IHdvdWxkIGJlDQp3cm9uZyB0byByZXJ1biB0aGUgcXVldWUgaW1t
ZWRpYXRlbHkgaW4gY2FzZSBhIFNDU0kgdGFyZ2V0IHN5c3RlbSByZXR1cm5zDQoiQlVTWSIuDQoN
CkJhcnQu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
@ 2018-01-12 18:54                       ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 18:54 UTC (permalink / raw)
  To: snitzer
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen, axboe, ming.lei

On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> OK, you have the stage: please give me a pointer to your best
> explaination of the several.

Since the previous discussion about this topic occurred more than a month
ago it could take more time to look up an explanation than to explain it
again. Anyway, here we go. As you know a block layer request queue needs to
be rerun if one or more requests are waiting and a previous condition that
prevented the request to be executed has been cleared. For the dm-mpath
driver, examples of such conditions are no tags available, a path that is
busy (see also pgpath_busy()), path initialization that is in progress
(pg_init_in_progress) or a request completes with status, e.g. if the
SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
of these conditions, e.g. path initialization completes, a callback
function in the dm-mpath driver is called and it is possible to explicitly
rerun the queue. I agree that for such scenario's a delayed queue run should
not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
SCSI request over a fabric and the SCSI target replies with "BUSY" then the
SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
maximum number of retries has been reached (see also scsi_io_completion()).
In that last case, if a SCSI target sends a "BUSY" reply over the wire back
to the initiator, there is no other approach for the SCSI initiator to
figure out whether it can queue another request than to resubmit the
request. The worst possible strategy is to resubmit a request immediately
because that will cause a significant fraction of the fabric bandwidth to
be used just for replying "BUSY" to requests that can't be processed
immediately.

The intention of commit 6077c2d706097c0 was to address the last mentioned
case. It may be possible to move the delayed queue rerun from the
dm_queue_rq() into dm_requeue_original_request(). But I think it would be
wrong to rerun the queue immediately in case a SCSI target system returns
"BUSY".

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
  2018-01-11  6:01 ` [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure Ming Lei
@ 2018-01-12 19:04     ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 19:04 UTC (permalink / raw)
  To: dm-devel, hch, linux-block, axboe, ming.lei, snitzer

T24gVGh1LCAyMDE4LTAxLTExIGF0IDE0OjAxICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gZGlm
ZiAtLWdpdCBhL2RyaXZlcnMvbWQvZG0tbXBhdGguYyBiL2RyaXZlcnMvbWQvZG0tbXBhdGguYw0K
PiBpbmRleCA4NmJmNTAyYThlNTEuLmZjZGRmNWE2MjU4MSAxMDA2NDQNCj4gLS0tIGEvZHJpdmVy
cy9tZC9kbS1tcGF0aC5jDQo+ICsrKyBiL2RyaXZlcnMvbWQvZG0tbXBhdGguYw0KPiBAQCAtNTMz
LDggKzUzMywyMCBAQCBzdGF0aWMgaW50IG11bHRpcGF0aF9jbG9uZV9hbmRfbWFwKHN0cnVjdCBk
bV90YXJnZXQgKnRpLCBzdHJ1Y3QgcmVxdWVzdCAqcnEsDQo+ICAJCWlmIChxdWV1ZV9keWluZykg
ew0KPiAgCQkJYXRvbWljX2luYygmbS0+cGdfaW5pdF9pbl9wcm9ncmVzcyk7DQo+ICAJCQlhY3Rp
dmF0ZV9vcl9vZmZsaW5lX3BhdGgocGdwYXRoKTsNCj4gKwkJCXJldHVybiBETV9NQVBJT19ERUxB
WV9SRVFVRVVFOw0KPiAgCQl9DQo+IC0JCXJldHVybiBETV9NQVBJT19ERUxBWV9SRVFVRVVFOw0K
PiArDQo+ICsJCS8qDQo+ICsJCSAqIGJsay1tcSdzIFNDSEVEX1JFU1RBUlQgY2FuIGNvdmVyIHRo
aXMgcmVxdWV1ZSwgc28NCj4gKwkJICogd2UgbmVlZG4ndCB0byBkZWFsIHdpdGggaXQgYnkgREVM
QVlfUkVRVUVVRS4gTW9yZQ0KPiArCQkgKiBpbXBvcnRhbnRseSwgd2UgaGF2ZSB0byByZXR1cm4g
RE1fTUFQSU9fUkVRVUVVRQ0KPiArCQkgKiBzbyB0aGF0IGJsay1tcSBjYW4gZ2V0IHRoZSBxdWV1
ZSBidXN5IGZlZWRiYWNrLA0KPiArCQkgKiBvdGhlcndpc2UgSS9PIG1lcmdlIGNhbiBiZSBodXJ0
Lg0KPiArCQkgKi8NCj4gKwkJaWYgKHEtPm1xX29wcykNCj4gKwkJCXJldHVybiBETV9NQVBJT19S
RVFVRVVFOw0KPiArCQllbHNlDQo+ICsJCQlyZXR1cm4gRE1fTUFQSU9fREVMQVlfUkVRVUVVRTsN
Cj4gIAl9DQoNClNvcnJ5IGJ1dCB0aGUgYXBwcm9hY2ggb2YgdGhpcyBwYXRjaCBsb29rcyB3cm9u
ZyB0byBtZS4gSSdtIGFmcmFpZCB0aGF0IHRoaXMNCmFwcHJvYWNoIHdpbGwgY2F1c2UgMTAwJSBD
UFUgY29uc3VtcHRpb24gaWYgdGhlIHVuZGVybHlpbmcgLnF1ZXVlX3JxKCkNCmZ1bmN0aW9uIHJl
dHVybnMgQkxLX1NUU19SRVNPVVJDRSBmb3IgYW5vdGhlciByZWFzb24gdGhhbiB3aGF0IGNhbiBi
ZSBkZXRlY3RlZA0KYnkgdGhlIC5nZXRfYnVkZ2V0KCkgY2FsbC4gVGhpcyBjYW4gaGFwcGVuIGlm
IGUuZy4gYSBTQ1NJIExMRCAucXVldWVjb21tYW5kKCkNCmltcGxlbWVudGF0aW9uIHJldHVybnMg
U0NTSV9NTFFVRVVFX0hPU1RfQlVTWS4gTWFueSBTQ1NJIExMRHMgY2FuIGRvIHRoaXM6DQokIGdp
dCBncmVwICdTQ1NJX01MUVVFVUVfW15fXSpfQlVTWScgfCB3YyAtbA0KMjA0DQoNCklzbid0IHRo
aXMgYSBzZXZlcmUgcmVncmVzc2lvbj8NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
@ 2018-01-12 19:04     ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 19:04 UTC (permalink / raw)
  To: dm-devel, hch, linux-block, axboe, ming.lei, snitzer

On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote:
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 86bf502a8e51..fcddf5a62581 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
>  		if (queue_dying) {
>  			atomic_inc(&m->pg_init_in_progress);
>  			activate_or_offline_path(pgpath);
> +			return DM_MAPIO_DELAY_REQUEUE;
>  		}
> -		return DM_MAPIO_DELAY_REQUEUE;
> +
> +		/*
> +		 * blk-mq's SCHED_RESTART can cover this requeue, so
> +		 * we needn't to deal with it by DELAY_REQUEUE. More
> +		 * importantly, we have to return DM_MAPIO_REQUEUE
> +		 * so that blk-mq can get the queue busy feedback,
> +		 * otherwise I/O merge can be hurt.
> +		 */
> +		if (q->mq_ops)
> +			return DM_MAPIO_REQUEUE;
> +		else
> +			return DM_MAPIO_DELAY_REQUEUE;
>  	}

Sorry but the approach of this patch looks wrong to me. I'm afraid that this
approach will cause 100% CPU consumption if the underlying .queue_rq()
function returns BLK_STS_RESOURCE for another reason than what can be detected
by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand()
implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this:
$ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l
204

Isn't this a severe regression?

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 18:54                       ` Bart Van Assche
  (?)
@ 2018-01-12 19:29                       ` Mike Snitzer
  -1 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2018-01-12 19:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen, axboe, ming.lei

On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > OK, you have the stage: please give me a pointer to your best
> > explaination of the several.
> 
> Since the previous discussion about this topic occurred more than a month
> ago it could take more time to look up an explanation than to explain it
> again. Anyway, here we go. As you know a block layer request queue needs to
> be rerun if one or more requests are waiting and a previous condition that
> prevented the request to be executed has been cleared. For the dm-mpath
> driver, examples of such conditions are no tags available, a path that is
> busy (see also pgpath_busy()), path initialization that is in progress
> (pg_init_in_progress) or a request completes with status, e.g. if the
> SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> of these conditions, e.g. path initialization completes, a callback
> function in the dm-mpath driver is called and it is possible to explicitly
> rerun the queue. I agree that for such scenario's a delayed queue run should
> not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.
> 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

OK, thank you very much for this.  Really helps.

For starters multipath_clone_and_map() could do a fair amount more with
the insight that a SCSI "BUSY" was transmitted back.  If both blk-mq
being out of tags and SCSI "BUSY" simply return BLK_STS_RESOURCE then
dm-mpath doesn't have the ability to behave more intelligently.

Anyway, armed with this info I'll have a think about what we might do to
tackle this problem head on.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 18:54                       ` Bart Van Assche
@ 2018-01-12 19:53                         ` Elliott, Robert (Persistent Memory)
  -1 siblings, 0 replies; 48+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2018-01-12 19:53 UTC (permalink / raw)
  To: Bart Van Assche, snitzer
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen, axboe, ming.lei

DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogbGludXgtYmxvY2stb3du
ZXJAdmdlci5rZXJuZWwub3JnIFttYWlsdG86bGludXgtYmxvY2stDQo+IG93bmVyQHZnZXIua2Vy
bmVsLm9yZ10gT24gQmVoYWxmIE9mIEJhcnQgVmFuIEFzc2NoZQ0KLi4uDQo+IFRoZSBpbnRlbnRp
b24gb2YgY29tbWl0IDYwNzdjMmQ3MDYwOTdjMCB3YXMgdG8gYWRkcmVzcyB0aGUgbGFzdCBtZW50
aW9uZWQNCj4gY2FzZS4gSXQgbWF5IGJlIHBvc3NpYmxlIHRvIG1vdmUgdGhlIGRlbGF5ZWQgcXVl
dWUgcmVydW4gZnJvbSB0aGUNCj4gZG1fcXVldWVfcnEoKSBpbnRvIGRtX3JlcXVldWVfb3JpZ2lu
YWxfcmVxdWVzdCgpLiBCdXQgSSB0aGluayBpdCB3b3VsZCBiZQ0KPiB3cm9uZyB0byByZXJ1biB0
aGUgcXVldWUgaW1tZWRpYXRlbHkgaW4gY2FzZSBhIFNDU0kgdGFyZ2V0IHN5c3RlbSByZXR1cm5z
DQo+ICJCVVNZIi4NCg0KU2VlaW5nIFNDU0kgQlVTWSBtZW50aW9uZWQgaGVyZS4uLg0KDQpPbiBp
dHMgb3duLCBwYXRjaCA2MDc3YzJkNzA2IHNlZW1zIHRvIGJlIGFkZGluZyBhbiBhcmJpdHJhcmls
eSBzZWxlY3RlZA0KbWFnaWMgdmFsdWUgb2YgMTAwIG1zIHdpdGhvdXQgZXhwbGFuYXRpb24gaW4g
dGhlIHBhdGNoIGRlc2NyaXB0aW9uIG9yDQppbiB0aGUgYWRkZWQgY29kZS4NCg0KQnV0LCBkbV9y
ZXF1ZXN0X29yaWdpbmFsX3JlcXVlc3QoKSBhbHJlYWR5IHNlZW1zIHRvIGhhdmUgY2hvc2VuIHRo
YXQgdmFsdWUNCmZvciBzaW1pbGFyIHB1cnBvc2VzOg0KICAgICAgICB1bnNpZ25lZCBsb25nIGRl
bGF5X21zID0gZGVsYXlfcmVxdWV1ZSA/IDEwMCA6IDA7DQoNClNvLCBtYWtpbmcgdGhlbSBzaGFy
ZSBhICNkZWZpbmUgd291bGQgaW5kaWNhdGUgdGhlcmUncyBhIHJlYXNvbiBmb3IgdGhhdA0KcGFy
dGljdWxhciB2YWx1ZS4gIEFueSBjaGFuZ2UgdG8gdGhlIHZhbHVlIHdvdWxkIGJlIHBpY2tlZCB1
cCBldmVyeXdoZXJlLg0KDQpBbGwgdGhlIG90aGVyIGNhbGxlcnMgb2YgYmxrX21xX2RlbGF5X3J1
bl9od19xdWV1ZSgpIHVzZSBtYWNyb3M6DQpkcml2ZXJzL21kL2RtLXJxLmM6ICAgICAgICAgICAg
IGJsa19tcV9kZWxheV9ydW5faHdfcXVldWUoaGN0eCwgMTAwLyptcyovKTsNCmRyaXZlcnMvbnZt
ZS9ob3N0L2ZjLmM6ICAgICAgICAgYmxrX21xX2RlbGF5X3J1bl9od19xdWV1ZShxdWV1ZS0+aGN0
eCwgTlZNRUZDX1FVRVVFX0RFTEFZKTsNCmRyaXZlcnMvc2NzaS9zY3NpX2xpYi5jOiAgICAgICAg
ICAgICAgICBibGtfbXFfZGVsYXlfcnVuX2h3X3F1ZXVlKGhjdHgsIFNDU0lfUVVFVUVfREVMQVkp
Ow0KZHJpdmVycy9zY3NpL3Njc2lfbGliLmM6ICAgICAgICAgICAgICAgICAgICAgICAgYmxrX21x
X2RlbGF5X3J1bl9od19xdWV1ZShoY3R4LCBTQ1NJX1FVRVVFX0RFTEFZKTsNCg0KVGhvc2UgYm90
aCBoYXBwZW4gdG8gYmUgc2V0IHRvIDMgKG1zKS4NCg0KDQotLS0NClJvYmVydCBFbGxpb3R0LCBI
UEUgUGVyc2lzdGVudCBNZW1vcnkNCg0KDQoNCg==

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
@ 2018-01-12 19:53                         ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 48+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2018-01-12 19:53 UTC (permalink / raw)
  To: Bart Van Assche, snitzer
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen, axboe, ming.lei



> -----Original Message-----
> From: linux-block-owner@vger.kernel.org [mailto:linux-block-
> owner@vger.kernel.org] On Behalf Of Bart Van Assche
...
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

Seeing SCSI BUSY mentioned here...

On its own, patch 6077c2d706 seems to be adding an arbitrarily selected
magic value of 100 ms without explanation in the patch description or
in the added code.

But, dm_request_original_request() already seems to have chosen that value
for similar purposes:
        unsigned long delay_ms = delay_requeue ? 100 : 0;

So, making them share a #define would indicate there's a reason for that
particular value.  Any change to the value would be picked up everywhere.

All the other callers of blk_mq_delay_run_hw_queue() use macros:
drivers/md/dm-rq.c:             blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
drivers/nvme/host/fc.c:         blk_mq_delay_run_hw_queue(queue->hctx, NVMEFC_QUEUE_DELAY);
drivers/scsi/scsi_lib.c:                blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
drivers/scsi/scsi_lib.c:                        blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);

Those both happen to be set to 3 (ms).


---
Robert Elliott, HPE Persistent Memory




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 18:54                       ` Bart Van Assche
                                         ` (2 preceding siblings ...)
  (?)
@ 2018-01-12 22:31                       ` Mike Snitzer
  2018-01-13 15:04                           ` Ming Lei
  -1 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-12 22:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen, axboe, ming.lei

On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > OK, you have the stage: please give me a pointer to your best
> > explaination of the several.
> 
> Since the previous discussion about this topic occurred more than a month
> ago it could take more time to look up an explanation than to explain it
> again. Anyway, here we go. As you know a block layer request queue needs to
> be rerun if one or more requests are waiting and a previous condition that
> prevented the request to be executed has been cleared. For the dm-mpath
> driver, examples of such conditions are no tags available, a path that is
> busy (see also pgpath_busy()), path initialization that is in progress
> (pg_init_in_progress) or a request completes with status, e.g. if the
> SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> of these conditions, e.g. path initialization completes, a callback
> function in the dm-mpath driver is called and it is possible to explicitly
> rerun the queue. I agree that for such scenario's a delayed queue run should
> not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.

The thing is, the 2 scenarios you are most concerned about have
_nothing_ to do with dm_mq_queue_rq() at all.  They occur as an error in
the IO path _after_ the request is successfully retrieved with
blk_get_request() and then submitted.
 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case.

So it really makes me question why you think commit 6077c2d706097c0
addresses the issue you think it does.  And gives me more conviction to
remove 6077c2d706097c0.

It may help just by virtue of blindly kicking the queue if
blk_get_request() fails (regardless of the target is responding with
BUSY or not).  Very unsatisfying to say the least.

I think it'd be much more beneficial for dm-mpath.c:multipath_end_io()
to be trained to be respond more intelligently to BLK_STS_RESOURCE.

E.g. set BLK_MQ_S_SCHED_RESTART if requests are known to be outstanding
on the path.  This is one case where Ming said the queue would be
re-run, as detailed in this header:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=5b18cff4baedde77e0d69bd62a13ae78f9488d89

And Jens has reinforced to me that BLK_MQ_S_SCHED_RESTART is a means to
kicking the queue more efficiently.  _BUT_ I'm not seeing any external
blk-mq interface that exposes this capability to a blk-mq driver.  As is
BLK_MQ_S_SCHED_RESTART gets set very much in the bowels of blk-mq
(blk_mq_sched_mark_restart_hctx).

SO I have to do more homework here...

Ming or Jens: might you be able to shed some light on how dm-mpath
would/could set BLK_MQ_S_SCHED_RESTART?  A new function added that can
be called from q->softirq_done_fn in response to BLK_STS_RESOURCE?  Or
is this level of control (and blk-mq detail) something that a blk-mq
driver should need to care about?

Anyway, we may need to get to that level of blk-mq driver controlled
retry optimization.

But Bart: again this is a vastly different feedback loop than the
stabbing in the dark your commit 6077c2d706097c0 is doing.  

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 18:54                       ` Bart Van Assche
                                         ` (3 preceding siblings ...)
  (?)
@ 2018-01-12 23:17                       ` Mike Snitzer
  2018-01-12 23:42                           ` Bart Van Assche
  -1 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-12 23:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: dm-devel, hch, linux-block, axboe, martin.petersen, axboe, ming.lei

On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

This is my 3rd reply to this email.. 3rd time's the charm? ;)

Here is a patch that will kick the hw queues via blk_mq_requeue_work(),
indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay if
BLK_STS_RESOURCE is returned from the target.

Your thoughts on this patch as an alternative to commit 6077c2d7060 ?

Thanks,
Mike

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index d8c7259..ab2cfdc 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
 	if (error && blk_path_error(error)) {
 		struct multipath *m = ti->private;
 
-		r = DM_ENDIO_REQUEUE;
+		if (r == BLK_STS_RESOURCE)
+			r = DM_ENDIO_DELAY_REQUEUE;
+		else
+			r = DM_ENDIO_REQUEUE;
 
 		if (pgpath)
 			fail_path(pgpath);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6b01298..ab0ed2d 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped)
 		/* The target wants to requeue the I/O */
 		dm_requeue_original_request(tio, false);
 		break;
+	case DM_ENDIO_DELAY_REQUEUE:
+		/* The target wants to requeue the I/O after a delay */
+		dm_requeue_original_request(tio, true);
+		break;
 	default:
 		DMWARN("unimplemented target endio return value: %d", r);
 		BUG();
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 9ba8453..da83f64 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
 #define DM_ENDIO_DONE		0
 #define DM_ENDIO_INCOMPLETE	1
 #define DM_ENDIO_REQUEUE	2
+#define DM_ENDIO_DELAY_REQUEUE	3
 
 /*
  * Definitions of return values from target map function.
@@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
 #define DM_MAPIO_SUBMITTED	0
 #define DM_MAPIO_REMAPPED	1
 #define DM_MAPIO_REQUEUE	DM_ENDIO_REQUEUE
-#define DM_MAPIO_DELAY_REQUEUE	3
+#define DM_MAPIO_DELAY_REQUEUE	DM_ENDIO_DELAY_REQUEUE
 #define DM_MAPIO_KILL		4
 
 #define dm_sector_div64(x, y)( \

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 23:17                       ` Mike Snitzer
@ 2018-01-12 23:42                           ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 23:42 UTC (permalink / raw)
  To: snitzer
  Cc: dm-devel, hch, linux-block, ming.lei, martin.petersen, axboe, axboe

T24gRnJpLCAyMDE4LTAxLTEyIGF0IDE4OjE3IC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+
IEBAIC0xNTcwLDcgKzE1NzAsMTAgQEAgc3RhdGljIGludCBtdWx0aXBhdGhfZW5kX2lvKHN0cnVj
dCBkbV90YXJnZXQgKnRpLCBzdHJ1Y3QgcmVxdWVzdCAqY2xvbmUsDQo+ICAJaWYgKGVycm9yICYm
IGJsa19wYXRoX2Vycm9yKGVycm9yKSkgew0KPiAgCQlzdHJ1Y3QgbXVsdGlwYXRoICptID0gdGkt
PnByaXZhdGU7DQo+ICANCj4gLQkJciA9IERNX0VORElPX1JFUVVFVUU7DQo+ICsJCWlmIChyID09
IEJMS19TVFNfUkVTT1VSQ0UpDQo+ICsJCQlyID0gRE1fRU5ESU9fREVMQVlfUkVRVUVVRTsNCj4g
KwkJZWxzZQ0KPiArCQkJciA9IERNX0VORElPX1JFUVVFVUU7DQoNCkRpZCB5b3UgcGVyaGFwcyBp
bnRlbmQgImVycm9yID09IEJMS19TVFNfUkVTT1VSQ0UiPw0KDQo+IGRpZmYgLS1naXQgYS9pbmNs
dWRlL2xpbnV4L2RldmljZS1tYXBwZXIuaCBiL2luY2x1ZGUvbGludXgvZGV2aWNlLW1hcHBlci5o
DQo+IGluZGV4IDliYTg0NTMuLmRhODNmNjQgMTAwNjQ0DQo+IC0tLSBhL2luY2x1ZGUvbGludXgv
ZGV2aWNlLW1hcHBlci5oDQo+ICsrKyBiL2luY2x1ZGUvbGludXgvZGV2aWNlLW1hcHBlci5oDQo+
IEBAIC01NTAsNiArNTUwLDcgQEAgc3RydWN0IGRtX3RhYmxlICpkbV9zd2FwX3RhYmxlKHN0cnVj
dCBtYXBwZWRfZGV2aWNlICptZCwNCj4gICNkZWZpbmUgRE1fRU5ESU9fRE9ORQkJMA0KPiAgI2Rl
ZmluZSBETV9FTkRJT19JTkNPTVBMRVRFCTENCj4gICNkZWZpbmUgRE1fRU5ESU9fUkVRVUVVRQky
DQo+ICsjZGVmaW5lIERNX0VORElPX0RFTEFZX1JFUVVFVUUJMw0KPiAgDQo+ICAvKg0KPiAgICog
RGVmaW5pdGlvbnMgb2YgcmV0dXJuIHZhbHVlcyBmcm9tIHRhcmdldCBtYXAgZnVuY3Rpb24uDQo+
IEBAIC01NTcsNyArNTU4LDcgQEAgc3RydWN0IGRtX3RhYmxlICpkbV9zd2FwX3RhYmxlKHN0cnVj
dCBtYXBwZWRfZGV2aWNlICptZCwNCj4gICNkZWZpbmUgRE1fTUFQSU9fU1VCTUlUVEVECTANCj4g
ICNkZWZpbmUgRE1fTUFQSU9fUkVNQVBQRUQJMQ0KPiAgI2RlZmluZSBETV9NQVBJT19SRVFVRVVF
CURNX0VORElPX1JFUVVFVUUNCj4gLSNkZWZpbmUgRE1fTUFQSU9fREVMQVlfUkVRVUVVRQkzDQo+
ICsjZGVmaW5lIERNX01BUElPX0RFTEFZX1JFUVVFVUUJRE1fRU5ESU9fREVMQVlfUkVRVUVVRQ0K
PiAgI2RlZmluZSBETV9NQVBJT19LSUxMCQk0DQo+ICANCj4gICNkZWZpbmUgZG1fc2VjdG9yX2Rp
djY0KHgsIHkpKCBcDQoNClBsZWFzZSBjb25zaWRlciB0byBpbnRyb2R1Y2UgZW51bWVyYXRpb24g
dHlwZXMgZm9yIHRoZSBETV9FTkRJT18qIGFuZCB0aGUNCkRNX01BUElPXyogY29uc3RhbnRzIHN1
Y2ggdGhhdCB0aGUgY29tcGlsZXIgY2FuIGNhdGNoIHdoYXQgSSByZXBvcnRlZCBhYm92ZS4NCk90
aGVyd2lzZSB0aGlzIHBhdGNoIGxvb2tzIGZpbmUgdG8gbWUuDQoNClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
@ 2018-01-12 23:42                           ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-12 23:42 UTC (permalink / raw)
  To: snitzer
  Cc: dm-devel, hch, linux-block, ming.lei, martin.petersen, axboe, axboe

On Fri, 2018-01-12 at 18:17 -0500, Mike Snitzer wrote:
> @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
>  	if (error && blk_path_error(error)) {
>  		struct multipath *m = ti->private;
>  
> -		r = DM_ENDIO_REQUEUE;
> +		if (r == BLK_STS_RESOURCE)
> +			r = DM_ENDIO_DELAY_REQUEUE;
> +		else
> +			r = DM_ENDIO_REQUEUE;

Did you perhaps intend "error == BLK_STS_RESOURCE"?

> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 9ba8453..da83f64 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
>  #define DM_ENDIO_DONE		0
>  #define DM_ENDIO_INCOMPLETE	1
>  #define DM_ENDIO_REQUEUE	2
> +#define DM_ENDIO_DELAY_REQUEUE	3
>  
>  /*
>   * Definitions of return values from target map function.
> @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
>  #define DM_MAPIO_SUBMITTED	0
>  #define DM_MAPIO_REMAPPED	1
>  #define DM_MAPIO_REQUEUE	DM_ENDIO_REQUEUE
> -#define DM_MAPIO_DELAY_REQUEUE	3
> +#define DM_MAPIO_DELAY_REQUEUE	DM_ENDIO_DELAY_REQUEUE
>  #define DM_MAPIO_KILL		4
>  
>  #define dm_sector_div64(x, y)( \

Please consider to introduce enumeration types for the DM_ENDIO_* and the
DM_MAPIO_* constants such that the compiler can catch what I reported above.
Otherwise this patch looks fine to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 23:42                           ` Bart Van Assche
  (?)
@ 2018-01-13  0:45                           ` Mike Snitzer
  -1 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2018-01-13  0:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: dm-devel, hch, linux-block, ming.lei, martin.petersen, axboe, axboe

On Fri, Jan 12 2018 at  6:42pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Fri, 2018-01-12 at 18:17 -0500, Mike Snitzer wrote:
> > @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
> >  	if (error && blk_path_error(error)) {
> >  		struct multipath *m = ti->private;
> >  
> > -		r = DM_ENDIO_REQUEUE;
> > +		if (r == BLK_STS_RESOURCE)
> > +			r = DM_ENDIO_DELAY_REQUEUE;
> > +		else
> > +			r = DM_ENDIO_REQUEUE;
> 
> Did you perhaps intend "error == BLK_STS_RESOURCE"?

Yes, it was a quick patch to get your thoughts.

> 
> > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > index 9ba8453..da83f64 100644
> > --- a/include/linux/device-mapper.h
> > +++ b/include/linux/device-mapper.h
> > @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
> >  #define DM_ENDIO_DONE		0
> >  #define DM_ENDIO_INCOMPLETE	1
> >  #define DM_ENDIO_REQUEUE	2
> > +#define DM_ENDIO_DELAY_REQUEUE	3
> >  
> >  /*
> >   * Definitions of return values from target map function.
> > @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
> >  #define DM_MAPIO_SUBMITTED	0
> >  #define DM_MAPIO_REMAPPED	1
> >  #define DM_MAPIO_REQUEUE	DM_ENDIO_REQUEUE
> > -#define DM_MAPIO_DELAY_REQUEUE	3
> > +#define DM_MAPIO_DELAY_REQUEUE	DM_ENDIO_DELAY_REQUEUE
> >  #define DM_MAPIO_KILL		4
> >  
> >  #define dm_sector_div64(x, y)( \
> 
> Please consider to introduce enumeration types for the DM_ENDIO_* and the
> DM_MAPIO_* constants such that the compiler can catch what I reported above.

OK, point taken.

> Otherwise this patch looks fine to me.

Cool, thanks.

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 19:53                         ` Elliott, Robert (Persistent Memory)
  (?)
@ 2018-01-13  0:52                         ` Mike Snitzer
  2018-01-13  1:00                             ` Bart Van Assche
  -1 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-13  0:52 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory)
  Cc: Bart Van Assche, dm-devel, hch, linux-block, axboe,
	martin.petersen, axboe, ming.lei

On Fri, Jan 12 2018 at  2:53pm -0500,
Elliott, Robert (Persistent Memory) <elliott@hpe.com> wrote:

> 
> 
> > -----Original Message-----
> > From: linux-block-owner@vger.kernel.org [mailto:linux-block-
> > owner@vger.kernel.org] On Behalf Of Bart Van Assche
> ...
> > The intention of commit 6077c2d706097c0 was to address the last mentioned
> > case. It may be possible to move the delayed queue rerun from the
> > dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> > wrong to rerun the queue immediately in case a SCSI target system returns
> > "BUSY".
> 
> Seeing SCSI BUSY mentioned here...
> 
> On its own, patch 6077c2d706 seems to be adding an arbitrarily selected
> magic value of 100 ms without explanation in the patch description or
> in the added code.

It was 50 ms before it was 100 ms.  No real explaination for these
values other than they seem to make Bart's IB SRP testbed happy?
 
> But, dm_request_original_request() already seems to have chosen that value
> for similar purposes:
>         unsigned long delay_ms = delay_requeue ? 100 : 0;
> 
> So, making them share a #define would indicate there's a reason for that
> particular value.  Any change to the value would be picked up everywhere.
> 
> All the other callers of blk_mq_delay_run_hw_queue() use macros:
> drivers/md/dm-rq.c:             blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
> drivers/nvme/host/fc.c:         blk_mq_delay_run_hw_queue(queue->hctx, NVMEFC_QUEUE_DELAY);
> drivers/scsi/scsi_lib.c:                blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> drivers/scsi/scsi_lib.c:                        blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);

Sure, I'll add a #define.

> Those both happen to be set to 3 (ms).

As for the value of 100, we're dealing with path faults so retrying them
extremely fast could be wasted effort.  But there is obviously no once
size fits all.  As storage gets faster 100 ms could prove to be very
bad.

But without tests to verify a change, there won't be one.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-13  0:52                         ` Mike Snitzer
@ 2018-01-13  1:00                             ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-13  1:00 UTC (permalink / raw)
  To: snitzer, elliott
  Cc: dm-devel, hch, linux-block, ming.lei, martin.petersen, axboe, axboe

T24gRnJpLCAyMDE4LTAxLTEyIGF0IDE5OjUyIC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+
IEl0IHdhcyA1MCBtcyBiZWZvcmUgaXQgd2FzIDEwMCBtcy4gIE5vIHJlYWwgZXhwbGFpbmF0aW9u
IGZvciB0aGVzZQ0KPiB2YWx1ZXMgb3RoZXIgdGhhbiB0aGV5IHNlZW0gdG8gbWFrZSBCYXJ0J3Mg
SUIgU1JQIHRlc3RiZWQgaGFwcHk/DQoNCkJ1dCB0aGF0IGNvbnN0YW50IHdhcyBub3QgaW50cm9k
dWNlZCBieSBtZSBpbiB0aGUgZG0gY29kZS4gU2VlIGUuZy4gdGhlDQpmb2xsb3dpbmcgY29tbWl0
czoNCg0KY29tbWl0IGQ1NDhiMzRiMDYyYjYwYjRmNGRmMjk1YTBiNDgyM2RmZGExZjFmYzQNCkF1
dGhvcjogTWlrZSBTbml0emVyIDxzbml0emVyQHJlZGhhdC5jb20+DQpEYXRlOiAgIFRodSBNYXIg
NSAyMjoyMToxMCAyMDE1IC0wNTAwDQoNCiAgICBkbTogcmVkdWNlIHRoZSBxdWV1ZSBkZWxheSB1
c2VkIGluIGRtX3JlcXVlc3RfZm4gZnJvbSAxMDBtcyB0byAxMG1zDQogICAgIA0KICAgIENvbW1p
dCA3ZWFjZWFjY2FiICgiYmxvY2s6IHJlbW92ZSBwZXItcXVldWUgcGx1Z2dpbmciKSBkaWRuJ3Qg
anVzdGlmeQ0KICAgIERNJ3MgdXNlIG9mIGEgMTAwbXMgZGVsYXk7IHN1Y2ggYW4gZXh0ZW5kZWQg
ZGVsYXkgaXMgYSBsaWFiaWxpdHkgd2hlbg0KICAgIHRoZXJlIGlzIHJlYXNvbiB0byByZS1raWNr
IHRoZSBxdWV1ZS4NCiAgICAgDQogICAgU2lnbmVkLW9mZi1ieTogTWlrZSBTbml0emVyIDxzbml0
emVyQHJlZGhhdC5jb20+DQoNCmNvbW1pdCA3ZWFjZWFjY2FiNWY0MGJiZmRhMDQ0NjI5YTYyOTg2
MTZhZWFlZDUwDQpBdXRob3I6IEplbnMgQXhib2UgPGpheGJvZUBmdXNpb25pby5jb20+DQpEYXRl
OiAgIFRodSBNYXIgMTAgMDg6NTI6MDcgMjAxMSArMDEwMA0KDQogICAgYmxvY2s6IHJlbW92ZSBw
ZXItcXVldWUgcGx1Z2dpbmcNCiAgICAgDQogICAgQ29kZSBoYXMgYmVlbiBjb252ZXJ0ZWQgb3Zl
ciB0byB0aGUgbmV3IGV4cGxpY2l0IG9uLXN0YWNrIHBsdWdnaW5nLA0KICAgIGFuZCBkZWxheSB1
c2VycyBoYXZlIGJlZW4gY29udmVydGVkIHRvIHVzZSB0aGUgbmV3IEFQSSBmb3IgdGhhdC4NCiAg
ICBTbyBsZXRzIGtpbGwgb2ZmIHRoZSBvbGQgcGx1Z2dpbmcgYWxvbmcgd2l0aCBhb3BzLT5zeW5j
X3BhZ2UoKS4NCiAgICAgDQogICAgU2lnbmVkLW9mZi1ieTogSmVucyBBeGJvZSA8amF4Ym9lQGZ1
c2lvbmlvLmNvbT4NCg0KQmFydC4g

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
@ 2018-01-13  1:00                             ` Bart Van Assche
  0 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2018-01-13  1:00 UTC (permalink / raw)
  To: snitzer, elliott
  Cc: dm-devel, hch, linux-block, ming.lei, martin.petersen, axboe, axboe

On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> It was 50 ms before it was 100 ms.  No real explaination for these
> values other than they seem to make Bart's IB SRP testbed happy?

But that constant was not introduced by me in the dm code. See e.g. the
following commits:

commit d548b34b062b60b4f4df295a0b4823dfda1f1fc4
Author: Mike Snitzer <snitzer@redhat.com>
Date:   Thu Mar 5 22:21:10 2015 -0500

    dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms
     
    Commit 7eaceaccab ("block: remove per-queue plugging") didn't justify
    DM's use of a 100ms delay; such an extended delay is a liability when
    there is reason to re-kick the queue.
     
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 7eaceaccab5f40bbfda044629a6298616aeaed50
Author: Jens Axboe <jaxboe@fusionio.com>
Date:   Thu Mar 10 08:52:07 2011 +0100

    block: remove per-queue plugging
     
    Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().
     
    Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

Bart. 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
  2018-01-12 19:04     ` Bart Van Assche
  (?)
@ 2018-01-13  1:29     ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2018-01-13  1:29 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: dm-devel, hch, linux-block, axboe, snitzer

On Fri, Jan 12, 2018 at 07:04:28PM +0000, Bart Van Assche wrote:
> On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote:
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index 86bf502a8e51..fcddf5a62581 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
> >  		if (queue_dying) {
> >  			atomic_inc(&m->pg_init_in_progress);
> >  			activate_or_offline_path(pgpath);
> > +			return DM_MAPIO_DELAY_REQUEUE;
> >  		}
> > -		return DM_MAPIO_DELAY_REQUEUE;
> > +
> > +		/*
> > +		 * blk-mq's SCHED_RESTART can cover this requeue, so
> > +		 * we needn't to deal with it by DELAY_REQUEUE. More
> > +		 * importantly, we have to return DM_MAPIO_REQUEUE
> > +		 * so that blk-mq can get the queue busy feedback,
> > +		 * otherwise I/O merge can be hurt.
> > +		 */
> > +		if (q->mq_ops)
> > +			return DM_MAPIO_REQUEUE;
> > +		else
> > +			return DM_MAPIO_DELAY_REQUEUE;
> >  	}
> 
> Sorry but the approach of this patch looks wrong to me. I'm afraid that this
> approach will cause 100% CPU consumption if the underlying .queue_rq()
> function returns BLK_STS_RESOURCE for another reason than what can be detected
> by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand()
> implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this:
> $ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l
> 204
> 
> Isn't this a severe regression?

No, it is not.

This DM_MAPIO_REQUEUE will be converted to BLK_STS_RESOURCE in dm-rq, then
it is returned to blk-mq. That means when running out of resource for
dispatching the underlying request, we tell blk-mq the truth. This
way has been used for other blk-mq drivers, such NVMe,..., more
importantly we have to provide this real feedback to blk-mq for fixing
the performance issue.

The blk_get_request(GFP_ATOMIC) is just like a kmalloc(ATOMIC),
and memory is shared system-wide too. You can see there are lots kmalloc(ATOMIC)
in NVMe IO path, when it fails, BLK_STS_RESOURCE is returned to blk-mq.

blk-mq has built-in SCHED_RESTART mechanism to handle out of RESOURCE
when BLK_STS_RESOURCE is observed.

We only implements .get_bugdet() in SCSI only, and other blk-mq drivers: NVMe,
..., often returns BLK_STS_RESOURCE to blk-mq too, so wrt. this usage there
isn't any difference between dm-mpath and other cases, right?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
  2018-01-13  1:00                             ` Bart Van Assche
  (?)
@ 2018-01-13  1:37                             ` Mike Snitzer
  2018-01-13 15:14                               ` Mike Snitzer
  -1 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2018-01-13  1:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: elliott, axboe, linux-block, martin.petersen, axboe, ming.lei,
	hch, dm-devel

On Fri, Jan 12 2018 at  8:00pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> > It was 50 ms before it was 100 ms.  No real explaination for these
> > values other than they seem to make Bart's IB SRP testbed happy?
> 
> But that constant was not introduced by me in the dm code.

No actually it was (not that there's anything wrong with that):

commit 06eb061f48594aa369f6e852b352410298b317a8
Author: Bart Van Assche <bart.vanassche@sandisk.com>
Date:   Fri Apr 7 16:50:44 2017 -0700

    dm mpath: requeue after a small delay if blk_get_request() fails

    If blk_get_request() returns ENODEV then multipath_clone_and_map()
    causes a request to be requeued immediately. This can cause a kworker
    thread to spend 100% of the CPU time of a single core in
    __blk_mq_run_hw_queue() and also can cause device removal to never
    finish.

    Avoid this by only requeuing after a delay if blk_get_request() fails.
    Additionally, reduce the requeue delay.

    Cc: stable@vger.kernel.org # 4.9+
    Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Note that this commit actually details a different case where a
blk_get_request() (in existing code) return of -ENODEV is a very
compelling case to use DM_MAPIO_DELAY_REQUEUE.

SO I'll revisit what is appropriate in multipath_clone_and_map() on
Monday.

I need a break... have a good weekend Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 18:54                       ` Bart Van Assche
                                         ` (4 preceding siblings ...)
  (?)
@ 2018-01-13 14:34                       ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2018-01-13 14:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: snitzer, dm-devel, hch, linux-block, axboe, martin.petersen, axboe

On Fri, Jan 12, 2018 at 06:54:49PM +0000, Bart Van Assche wrote:
> On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > OK, you have the stage: please give me a pointer to your best
> > explaination of the several.
> 
> Since the previous discussion about this topic occurred more than a month
> ago it could take more time to look up an explanation than to explain it
> again. Anyway, here we go. As you know a block layer request queue needs to
> be rerun if one or more requests are waiting and a previous condition that
> prevented the request to be executed has been cleared. For the dm-mpath
> driver, examples of such conditions are no tags available, a path that is
> busy (see also pgpath_busy()), path initialization that is in progress
> (pg_init_in_progress) or a request completes with status, e.g. if the
> SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> of these conditions, e.g. path initialization completes, a callback
> function in the dm-mpath driver is called and it is possible to explicitly
> rerun the queue. I agree that for such scenario's a delayed queue run should
> not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.

That isn't true, when BLK_STS_RESOURCE is returned to blk-mq, blk-mq
will apply BLK_MQ_S_SCHED_RESTART to hold the queue until one in-flight
request is completed, please see blk_mq_sched_restart() which is called
from blk_mq_free_request().

Also now we have IO schedulers, when blk_get_request() in dm-mpath returns
NULL, it doesn't provide underlying queue's BUSY accurately or in time, since
at default size of scheduler tags is double size of driver tags. So it isn't
good to depend blk_get_request() only to evaluate queue's busy status, this
patchset provides underlying's dispatch result directly to blk-mq, and can deal
with this case much better.

> 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

Again, queue won't be rerun immediately after STS_RESOURCE is returned to
blk-mq. And BLK_MQ_S_SCHED_RESTART should address your concern on continuous
resubmission in case of running out of requests, right?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-12 22:31                       ` Mike Snitzer
@ 2018-01-13 15:04                           ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2018-01-13 15:04 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Bart Van Assche, dm-devel, hch, linux-block, axboe,
	martin.petersen, axboe

On Fri, Jan 12, 2018 at 05:31:17PM -0500, Mike Snitzer wrote:
> On Fri, Jan 12 2018 at  1:54pm -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> 
> > On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > > OK, you have the stage: please give me a pointer to your best
> > > explaination of the several.
> > 
> > Since the previous discussion about this topic occurred more than a month
> > ago it could take more time to look up an explanation than to explain it
> > again. Anyway, here we go. As you know a block layer request queue needs to
> > be rerun if one or more requests are waiting and a previous condition that
> > prevented the request to be executed has been cleared. For the dm-mpath
> > driver, examples of such conditions are no tags available, a path that is
> > busy (see also pgpath_busy()), path initialization that is in progress
> > (pg_init_in_progress) or a request completes with status, e.g. if the
> > SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> > of these conditions, e.g. path initialization completes, a callback
> > function in the dm-mpath driver is called and it is possible to explicitly
> > rerun the queue. I agree that for such scenario's a delayed queue run should
> > not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> > SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> > SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> > maximum number of retries has been reached (see also scsi_io_completion()).
> > In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> > to the initiator, there is no other approach for the SCSI initiator to
> > figure out whether it can queue another request than to resubmit the
> > request. The worst possible strategy is to resubmit a request immediately
> > because that will cause a significant fraction of the fabric bandwidth to
> > be used just for replying "BUSY" to requests that can't be processed
> > immediately.
> 
> The thing is, the 2 scenarios you are most concerned about have
> _nothing_ to do with dm_mq_queue_rq() at all.  They occur as an error in
> the IO path _after_ the request is successfully retrieved with
> blk_get_request() and then submitted.
>  
> > The intention of commit 6077c2d706097c0 was to address the last mentioned
> > case.
> 
> So it really makes me question why you think commit 6077c2d706097c0
> addresses the issue you think it does.  And gives me more conviction to
> remove 6077c2d706097c0.
> 
> It may help just by virtue of blindly kicking the queue if
> blk_get_request() fails (regardless of the target is responding with
> BUSY or not).  Very unsatisfying to say the least.
> 
> I think it'd be much more beneficial for dm-mpath.c:multipath_end_io()
> to be trained to be respond more intelligently to BLK_STS_RESOURCE.
> 
> E.g. set BLK_MQ_S_SCHED_RESTART if requests are known to be outstanding
> on the path.  This is one case where Ming said the queue would be
> re-run, as detailed in this header:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=5b18cff4baedde77e0d69bd62a13ae78f9488d89
> 
> And Jens has reinforced to me that BLK_MQ_S_SCHED_RESTART is a means to
> kicking the queue more efficiently.  _BUT_ I'm not seeing any external
> blk-mq interface that exposes this capability to a blk-mq driver.  As is
> BLK_MQ_S_SCHED_RESTART gets set very much in the bowels of blk-mq
> (blk_mq_sched_mark_restart_hctx).
> 
> SO I have to do more homework here...
> 
> Ming or Jens: might you be able to shed some light on how dm-mpath
> would/could set BLK_MQ_S_SCHED_RESTART?  A new function added that can

When BLK_STS_RESOURCE is returned from .queue_rq(), blk_mq_dispatch_rq_list()
will check if BLK_MQ_S_SCHED_RESTART is set.

If it has been set, the queue won't be rerun for this request, and the queue
will be rerun until one in-flight request is completed, see blk_mq_sched_restart()
which is called from blk_mq_free_request().

If BLK_MQ_S_SCHED_RESTART isn't set, queue is rerun in blk_mq_dispatch_rq_list(),
and BLK_MQ_S_SCHED_RESTART is set before calling .queue_rq(), see
blk_mq_sched_mark_restart_hctx() which is called in blk_mq_sched_dispatch_requests().

This mechanism can avoid continuous running queue in case of STS_RESOURCE, that
means drivers wouldn't worry about that by adding random delay.


-- 
Ming

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
@ 2018-01-13 15:04                           ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2018-01-13 15:04 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, linux-block, martin.petersen, axboe, hch, dm-devel,
	Bart Van Assche

On Fri, Jan 12, 2018 at 05:31:17PM -0500, Mike Snitzer wrote:
> On Fri, Jan 12 2018 at  1:54pm -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> 
> > On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > > OK, you have the stage: please give me a pointer to your best
> > > explaination of the several.
> > 
> > Since the previous discussion about this topic occurred more than a month
> > ago it could take more time to look up an explanation than to explain it
> > again. Anyway, here we go. As you know a block layer request queue needs to
> > be rerun if one or more requests are waiting and a previous condition that
> > prevented the request to be executed has been cleared. For the dm-mpath
> > driver, examples of such conditions are no tags available, a path that is
> > busy (see also pgpath_busy()), path initialization that is in progress
> > (pg_init_in_progress) or a request completes with status, e.g. if the
> > SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> > of these conditions, e.g. path initialization completes, a callback
> > function in the dm-mpath driver is called and it is possible to explicitly
> > rerun the queue. I agree that for such scenario's a delayed queue run should
> > not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> > SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> > SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> > maximum number of retries has been reached (see also scsi_io_completion()).
> > In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> > to the initiator, there is no other approach for the SCSI initiator to
> > figure out whether it can queue another request than to resubmit the
> > request. The worst possible strategy is to resubmit a request immediately
> > because that will cause a significant fraction of the fabric bandwidth to
> > be used just for replying "BUSY" to requests that can't be processed
> > immediately.
> 
> The thing is, the 2 scenarios you are most concerned about have
> _nothing_ to do with dm_mq_queue_rq() at all.  They occur as an error in
> the IO path _after_ the request is successfully retrieved with
> blk_get_request() and then submitted.
>  
> > The intention of commit 6077c2d706097c0 was to address the last mentioned
> > case.
> 
> So it really makes me question why you think commit 6077c2d706097c0
> addresses the issue you think it does.  And gives me more conviction to
> remove 6077c2d706097c0.
> 
> It may help just by virtue of blindly kicking the queue if
> blk_get_request() fails (regardless of the target is responding with
> BUSY or not).  Very unsatisfying to say the least.
> 
> I think it'd be much more beneficial for dm-mpath.c:multipath_end_io()
> to be trained to be respond more intelligently to BLK_STS_RESOURCE.
> 
> E.g. set BLK_MQ_S_SCHED_RESTART if requests are known to be outstanding
> on the path.  This is one case where Ming said the queue would be
> re-run, as detailed in this header:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=5b18cff4baedde77e0d69bd62a13ae78f9488d89
> 
> And Jens has reinforced to me that BLK_MQ_S_SCHED_RESTART is a means to
> kicking the queue more efficiently.  _BUT_ I'm not seeing any external
> blk-mq interface that exposes this capability to a blk-mq driver.  As is
> BLK_MQ_S_SCHED_RESTART gets set very much in the bowels of blk-mq
> (blk_mq_sched_mark_restart_hctx).
> 
> SO I have to do more homework here...
> 
> Ming or Jens: might you be able to shed some light on how dm-mpath
> would/could set BLK_MQ_S_SCHED_RESTART?  A new function added that can

When BLK_STS_RESOURCE is returned from .queue_rq(), blk_mq_dispatch_rq_list()
will check if BLK_MQ_S_SCHED_RESTART is set.

If it has been set, the queue won't be rerun for this request, and the queue
will be rerun until one in-flight request is completed, see blk_mq_sched_restart()
which is called from blk_mq_free_request().

If BLK_MQ_S_SCHED_RESTART isn't set, queue is rerun in blk_mq_dispatch_rq_list(),
and BLK_MQ_S_SCHED_RESTART is set before calling .queue_rq(), see
blk_mq_sched_mark_restart_hctx() which is called in blk_mq_sched_dispatch_requests().

This mechanism can avoid continuous running queue in case of STS_RESOURCE, that
means drivers wouldn't worry about that by adding random delay.


-- 
Ming

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5]  dm-rq: improve sequential I/O performance
  2018-01-13 15:04                           ` Ming Lei
  (?)
@ 2018-01-13 15:10                           ` Mike Snitzer
  -1 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2018-01-13 15:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, dm-devel, hch, linux-block, axboe,
	martin.petersen, axboe

On Sat, Jan 13 2018 at 10:04am -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> On Fri, Jan 12, 2018 at 05:31:17PM -0500, Mike Snitzer wrote:
> > 
> > Ming or Jens: might you be able to shed some light on how dm-mpath
> > would/could set BLK_MQ_S_SCHED_RESTART?  A new function added that can
> 
> When BLK_STS_RESOURCE is returned from .queue_rq(), blk_mq_dispatch_rq_list()
> will check if BLK_MQ_S_SCHED_RESTART is set.
> 
> If it has been set, the queue won't be rerun for this request, and the queue
> will be rerun until one in-flight request is completed, see blk_mq_sched_restart()
> which is called from blk_mq_free_request().
> 
> If BLK_MQ_S_SCHED_RESTART isn't set, queue is rerun in blk_mq_dispatch_rq_list(),
> and BLK_MQ_S_SCHED_RESTART is set before calling .queue_rq(), see
> blk_mq_sched_mark_restart_hctx() which is called in blk_mq_sched_dispatch_requests().
> 
> This mechanism can avoid continuous running queue in case of STS_RESOURCE, that
> means drivers wouldn't worry about that by adding random delay.

Great, thanks for the overview.  Really appreciate it.

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
  2018-01-13  1:37                             ` Mike Snitzer
@ 2018-01-13 15:14                               ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2018-01-13 15:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: elliott, axboe, linux-block, martin.petersen, axboe, ming.lei,
	hch, dm-devel

On Fri, Jan 12 2018 at  8:37pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Fri, Jan 12 2018 at  8:00pm -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> 
> > On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> > > It was 50 ms before it was 100 ms.  No real explaination for these
> > > values other than they seem to make Bart's IB SRP testbed happy?
> > 
> > But that constant was not introduced by me in the dm code.
> 
> No actually it was (not that there's anything wrong with that):
> 
> commit 06eb061f48594aa369f6e852b352410298b317a8
> Author: Bart Van Assche <bart.vanassche@sandisk.com>
> Date:   Fri Apr 7 16:50:44 2017 -0700
> 
>     dm mpath: requeue after a small delay if blk_get_request() fails
> 
>     If blk_get_request() returns ENODEV then multipath_clone_and_map()
>     causes a request to be requeued immediately. This can cause a kworker
>     thread to spend 100% of the CPU time of a single core in
>     __blk_mq_run_hw_queue() and also can cause device removal to never
>     finish.
> 
>     Avoid this by only requeuing after a delay if blk_get_request() fails.
>     Additionally, reduce the requeue delay.
> 
>     Cc: stable@vger.kernel.org # 4.9+
>     Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> Note that this commit actually details a different case where a
> blk_get_request() (in existing code) return of -ENODEV is a very
> compelling case to use DM_MAPIO_DELAY_REQUEUE.
> 
> SO I'll revisit what is appropriate in multipath_clone_and_map() on
> Monday.

Sleep helped.  I had another look and it is only the old .request_fn
blk_get_request() code that even sets -ENODEV (if blk_queue_dying).

But thankfully the blk_get_request() error handling in
multipath_clone_and_map() checks for blk_queue_dying() and will return
DM_MAPIO_DELAY_REQUEUE.

So we're all set for this case.

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2018-01-13 15:14 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-11  6:01 [PATCH V3 0/5] dm-rq: improve sequential I/O performance Ming Lei
2018-01-11  6:01 ` [PATCH V3 1/5] dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE Ming Lei
2018-01-11  6:01 ` [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure Ming Lei
2018-01-12 19:04   ` Bart Van Assche
2018-01-12 19:04     ` Bart Van Assche
2018-01-13  1:29     ` Ming Lei
2018-01-11  6:01 ` [PATCH V3 3/5] blk-mq: move actual issue into one helper Ming Lei
2018-01-11 22:09   ` Mike Snitzer
2018-01-11  6:01 ` [PATCH V3 4/5] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly Ming Lei
2018-01-11 22:10   ` Mike Snitzer
2018-01-11  6:01 ` [PATCH V3 5/5] blk-mq: issue request directly for blk_insert_cloned_request Ming Lei
2018-01-11 22:42   ` Mike Snitzer
2018-01-11 22:07 ` [PATCH V3 0/5] dm-rq: improve sequential I/O performance Mike Snitzer
2018-01-11 22:37   ` Bart Van Assche
2018-01-11 22:37     ` Bart Van Assche
2018-01-11 22:58     ` Mike Snitzer
2018-01-11 23:27       ` Bart Van Assche
2018-01-11 23:27         ` Bart Van Assche
2018-01-12  1:43         ` Mike Snitzer
2018-01-12  1:42     ` Ming Lei
2018-01-12  1:57       ` Mike Snitzer
2018-01-12  3:33         ` Ming Lei
2018-01-12 17:18           ` Mike Snitzer
2018-01-12 17:26             ` Bart Van Assche
2018-01-12 17:26               ` Bart Van Assche
2018-01-12 17:40               ` Mike Snitzer
2018-01-12 17:46                 ` Bart Van Assche
2018-01-12 17:46                   ` Bart Van Assche
2018-01-12 18:06                   ` Mike Snitzer
2018-01-12 18:54                     ` Bart Van Assche
2018-01-12 18:54                       ` Bart Van Assche
2018-01-12 19:29                       ` Mike Snitzer
2018-01-12 19:53                       ` Elliott, Robert (Persistent Memory)
2018-01-12 19:53                         ` Elliott, Robert (Persistent Memory)
2018-01-13  0:52                         ` Mike Snitzer
2018-01-13  1:00                           ` Bart Van Assche
2018-01-13  1:00                             ` Bart Van Assche
2018-01-13  1:37                             ` Mike Snitzer
2018-01-13 15:14                               ` Mike Snitzer
2018-01-12 22:31                       ` Mike Snitzer
2018-01-13 15:04                         ` Ming Lei
2018-01-13 15:04                           ` Ming Lei
2018-01-13 15:10                           ` Mike Snitzer
2018-01-12 23:17                       ` Mike Snitzer
2018-01-12 23:42                         ` Bart Van Assche
2018-01-12 23:42                           ` Bart Van Assche
2018-01-13  0:45                           ` Mike Snitzer
2018-01-13 14:34                       ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.