All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] blk-mq: Timeout rework
@ 2018-05-21 23:11 ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw)
  To: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig,
	Bart Van Assche
  Cc: Keith Busch

The current blk-mq code potentially locks requests out of completion by
the thousands, making drivers jump through hoops to handle them. This
patch set allows drivers to complete their requests whenever they're
completed without requiring drivers know anything about the timeout code
with minimal syncronization.

Other proposals under current consideration still have moments that
prevent a driver from progressing a request to the completed state.

The timeout is ultimatley made safe by reference counting the request
when timeout handling claims the request. By holding the reference count,
we don't need to do any tricks to prevent a driver from completing the
request out from under the timeout handler, allowing the actual state
to be changed inline with the true state, and drivers don't need to be
aware any of this is happening.

In order to make the overhead as minimal as possible, the request's
reference is taken only when it appears that actual timeout handling
needs to be done.

Keith Busch (3):
  blk-mq: Reference count request usage
  blk-mq: Fix timeout and state order
  blk-mq: Remove generation seqeunce

 block/blk-core.c       |   6 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 291 +++++++++++++------------------------------------
 block/blk-mq.h         |  20 +---
 block/blk-timeout.c    |   1 -
 include/linux/blkdev.h |  26 +----
 6 files changed, 83 insertions(+), 262 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 0/3] blk-mq: Timeout rework
@ 2018-05-21 23:11 ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw)


The current blk-mq code potentially locks requests out of completion by
the thousands, making drivers jump through hoops to handle them. This
patch set allows drivers to complete their requests whenever they're
completed without requiring drivers know anything about the timeout code
with minimal syncronization.

Other proposals under current consideration still have moments that
prevent a driver from progressing a request to the completed state.

The timeout is ultimatley made safe by reference counting the request
when timeout handling claims the request. By holding the reference count,
we don't need to do any tricks to prevent a driver from completing the
request out from under the timeout handler, allowing the actual state
to be changed inline with the true state, and drivers don't need to be
aware any of this is happening.

In order to make the overhead as minimal as possible, the request's
reference is taken only when it appears that actual timeout handling
needs to be done.

Keith Busch (3):
  blk-mq: Reference count request usage
  blk-mq: Fix timeout and state order
  blk-mq: Remove generation seqeunce

 block/blk-core.c       |   6 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 291 +++++++++++++------------------------------------
 block/blk-mq.h         |  20 +---
 block/blk-timeout.c    |   1 -
 include/linux/blkdev.h |  26 +----
 6 files changed, 83 insertions(+), 262 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 1/3] blk-mq: Reference count request usage
  2018-05-21 23:11 ` Keith Busch
@ 2018-05-21 23:11   ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw)
  To: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig,
	Bart Van Assche
  Cc: Keith Busch

This patch adds a struct kref to struct request so that request users
can be sure they're operating on the same request without it changing
while they're processing it. The request's tag won't be released for
reuse until the last user is done with it.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c         | 30 +++++++++++++++++++++++-------
 include/linux/blkdev.h |  2 ++
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4cbfd784e837..8b370ed75605 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 #endif
 
 	data->ctx->rq_dispatched[op_is_sync(op)]++;
+	kref_init(&rq->ref);
 	return rq;
 }
 
@@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+static void blk_mq_exit_request(struct kref *ref)
+{
+	struct request *rq = container_of(ref, struct request, ref);
+	struct request_queue *q = rq->q;
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	const int sched_tag = rq->internal_tag;
+
+	if (rq->tag != -1)
+		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+	if (sched_tag != -1)
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
+	blk_mq_sched_restart(hctx);
+	blk_queue_exit(q);
+}
+
+static void blk_mq_put_request(struct request *rq)
+{
+       kref_put(&rq->ref, blk_mq_exit_request);
+}
+
 void blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-	const int sched_tag = rq->internal_tag;
 
 	if (rq->rq_flags & RQF_ELVPRIV) {
 		if (e && e->type->ops.mq.finish_request)
@@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq)
 		blk_put_rl(blk_rq_rl(rq));
 
 	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
-	if (rq->tag != -1)
-		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
-	blk_mq_sched_restart(hctx);
-	blk_queue_exit(q);
+	blk_mq_put_request(rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..26bf2c1e3502 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -257,6 +257,8 @@ struct request {
 	struct u64_stats_sync aborted_gstate_sync;
 	u64 aborted_gstate;
 
+	struct kref ref;
+
 	/* access through blk_rq_set_deadline, blk_rq_deadline */
 	unsigned long __deadline;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 1/3] blk-mq: Reference count request usage
@ 2018-05-21 23:11   ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw)


This patch adds a struct kref to struct request so that request users
can be sure they're operating on the same request without it changing
while they're processing it. The request's tag won't be released for
reuse until the last user is done with it.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 block/blk-mq.c         | 30 +++++++++++++++++++++++-------
 include/linux/blkdev.h |  2 ++
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4cbfd784e837..8b370ed75605 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 #endif
 
 	data->ctx->rq_dispatched[op_is_sync(op)]++;
+	kref_init(&rq->ref);
 	return rq;
 }
 
@@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+static void blk_mq_exit_request(struct kref *ref)
+{
+	struct request *rq = container_of(ref, struct request, ref);
+	struct request_queue *q = rq->q;
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	const int sched_tag = rq->internal_tag;
+
+	if (rq->tag != -1)
+		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+	if (sched_tag != -1)
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
+	blk_mq_sched_restart(hctx);
+	blk_queue_exit(q);
+}
+
+static void blk_mq_put_request(struct request *rq)
+{
+       kref_put(&rq->ref, blk_mq_exit_request);
+}
+
 void blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-	const int sched_tag = rq->internal_tag;
 
 	if (rq->rq_flags & RQF_ELVPRIV) {
 		if (e && e->type->ops.mq.finish_request)
@@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq)
 		blk_put_rl(blk_rq_rl(rq));
 
 	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
-	if (rq->tag != -1)
-		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
-	blk_mq_sched_restart(hctx);
-	blk_queue_exit(q);
+	blk_mq_put_request(rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..26bf2c1e3502 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -257,6 +257,8 @@ struct request {
 	struct u64_stats_sync aborted_gstate_sync;
 	u64 aborted_gstate;
 
+	struct kref ref;
+
 	/* access through blk_rq_set_deadline, blk_rq_deadline */
 	unsigned long __deadline;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 2/3] blk-mq: Fix timeout and state order
  2018-05-21 23:11 ` Keith Busch
@ 2018-05-21 23:11   ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw)
  To: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig,
	Bart Van Assche
  Cc: Keith Busch

The block layer had been setting the state to in-flight prior to updating
the timer. This is the wrong order since the timeout handler could observe
the in-flight state with the older timeout, believing the request had
expired when in fact it is just getting started.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8b370ed75605..66e5c768803f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq)
 	preempt_disable();
 	write_seqcount_begin(&rq->gstate_seq);
 
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 	blk_add_timer(rq);
+	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 
 	write_seqcount_end(&rq->gstate_seq);
 	preempt_enable();
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 2/3] blk-mq: Fix timeout and state order
@ 2018-05-21 23:11   ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw)


The block layer had been setting the state to in-flight prior to updating
the timer. This is the wrong order since the timeout handler could observe
the in-flight state with the older timeout, believing the request had
expired when in fact it is just getting started.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8b370ed75605..66e5c768803f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq)
 	preempt_disable();
 	write_seqcount_begin(&rq->gstate_seq);
 
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 	blk_add_timer(rq);
+	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 
 	write_seqcount_end(&rq->gstate_seq);
 	preempt_enable();
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-21 23:11 ` Keith Busch
@ 2018-05-21 23:11   ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw)
  To: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig,
	Bart Van Assche
  Cc: Keith Busch

This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-core.c       |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 259 ++++++++++---------------------------------------
 block/blk-mq.h         |  20 +---
 block/blk-timeout.c    |   1 -
 include/linux/blkdev.h |  26 +----
 6 files changed, 58 insertions(+), 255 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43370faee935..cee03cad99f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * See comment of blk_mq_init_request
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 66e5c768803f..4858876fd364 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq)
 	put_cpu();
 }
 
-static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
-	__releases(hctx->srcu)
-{
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-		rcu_read_unlock();
-	else
-		srcu_read_unlock(hctx->srcu, srcu_idx);
-}
-
-static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
-	__acquires(hctx->srcu)
-{
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		/* shut up gcc false positive */
-		*srcu_idx = 0;
-		rcu_read_lock();
-	} else
-		*srcu_idx = srcu_read_lock(hctx->srcu);
-}
-
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
-
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
-		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
+	__blk_mq_complete_request(rq);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -699,26 +632,9 @@ void blk_mq_start_request(struct request *rq)
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
 	blk_add_timer(rq);
 	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
-
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
 		 * Make sure space for the drain appears.  We know we can do
@@ -730,11 +646,6 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -843,33 +754,24 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 }
 EXPORT_SYMBOL(blk_mq_tag_to_rq);
 
-struct blk_mq_timeout_data {
-	unsigned long next;
-	unsigned int next_set;
-	unsigned int nr_expired;
-};
-
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
 	switch (ret) {
 	case BLK_EH_HANDLED:
-		__blk_mq_complete_request(req);
-		break;
-	case BLK_EH_RESET_TIMER:
 		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
+		 * If the request is still in flight, the driver is requesting
+		 * blk-mq complete it.
 		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
+		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
+			__blk_mq_complete_request(req);
+		break;
+	case BLK_EH_RESET_TIMER:
 		blk_add_timer(req);
 		break;
 	case BLK_EH_NOT_HANDLED:
@@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	}
 }
 
-static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, void *priv, bool reserved)
+static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
 {
-	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
+	unsigned long deadline;
 
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
+	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
+		return false;
 
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
+	deadline = blk_rq_deadline(rq);
+	if (time_after_eq(jiffies, deadline))
+		return true;
 
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
-	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
-		data->nr_expired++;
-		hctx->nr_expired++;
-	} else if (!data->next_set || time_after(data->next, deadline)) {
-		data->next = deadline;
-		data->next_set = 1;
-	}
+	if (*next == 0)
+		*next = deadline;
+	else if (time_after(*next, deadline))
+		*next = deadline;
+	return false;
 }
 
-static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
+	unsigned long *next = priv;
+
 	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 * Just do a quick check if it is expired before locking the request in
+	 * so we're not unnecessarilly synchronizing across CPUs.
 	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+	if (!blk_mq_req_expired(rq, next))
+		return;
+
+	/*
+	 * We have reason to believe the request may be expired. Take a
+	 * reference on the request to lock this request lifetime into its
+	 * currently allocated context to prevent it from being reallocated in
+	 * the event the completion by-passes this timeout handler.
+	 * 
+	 * If the reference was already released, then the driver beat the
+	 * timeout handler to posting a natural completion.
+	 */
+	if (!kref_get_unless_zero(&rq->ref))
+		return;
+
+	/*
+	 * The request is now locked and cannot be reallocated underneath the
+	 * timeout handler's processing. Re-verify this exact request is truly
+	 * expired; if it is not expired, then the request was completed and
+	 * reallocated as a new request.
+	 */
+	if (blk_mq_req_expired(rq, next))
 		blk_mq_rq_timed_out(rq, reserved);
+	blk_mq_put_request(rq);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	struct blk_mq_timeout_data data = {
-		.next		= 0,
-		.next_set	= 0,
-		.nr_expired	= 0,
-	};
+	unsigned long next = 0;
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -957,39 +859,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
-	/* scan for the expired ones and set their ->aborted_gstate */
-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
+	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
 
-	if (data.nr_expired) {
-		bool has_rcu = false;
-
-		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
-		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
-
-		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
-	}
-
-	if (data.next_set) {
-		data.next = blk_rq_timeout(round_jiffies_up(data.next));
-		mod_timer(&q->timeout, data.next);
+	if (next != 0) {
+		mod_timer(&q->timeout, next);
 	} else {
 		/*
 		 * Request timeouts are handled as a forward rolling timer. If
@@ -1334,8 +1207,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
-	int srcu_idx;
-
 	/*
 	 * We should be running this queue from one of the CPUs that
 	 * are mapped to it.
@@ -1369,9 +1240,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-	hctx_lock(hctx, &srcu_idx);
 	blk_mq_sched_dispatch_requests(hctx);
-	hctx_unlock(hctx, srcu_idx);
 }
 
 static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
@@ -1458,7 +1327,6 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
 
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-	int srcu_idx;
 	bool need_run;
 
 	/*
@@ -1469,10 +1337,8 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
 	 * quiesced.
 	 */
-	hctx_lock(hctx, &srcu_idx);
 	need_run = !blk_queue_quiesced(hctx->queue) &&
 		blk_mq_hctx_has_pending(hctx);
-	hctx_unlock(hctx, srcu_idx);
 
 	if (need_run) {
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
@@ -1828,34 +1694,23 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, blk_qc_t *cookie)
 {
 	blk_status_t ret;
-	int srcu_idx;
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-	hctx_lock(hctx, &srcu_idx);
-
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
 	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
-
-	hctx_unlock(hctx, srcu_idx);
 }
 
 blk_status_t blk_mq_request_issue_directly(struct request *rq)
 {
-	blk_status_t ret;
-	int srcu_idx;
 	blk_qc_t unused_cookie;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
 
-	hctx_lock(hctx, &srcu_idx);
-	ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true);
-	hctx_unlock(hctx, srcu_idx);
-
-	return ret;
+	return __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true);
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
@@ -2065,15 +1920,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * start gstate with gen 1 instead of 0, otherwise it will be equal
-	 * to aborted_gstate, and be identified timed out by
-	 * blk_mq_terminate_expired.
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
-
+	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e1bb420dc5d6..53452df16343 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -31,17 +31,12 @@ struct blk_mq_ctx {
 } ____cacheline_aligned_in_smp;
 
 /*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
+ * Request state.
  */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
-
-	MQ_RQ_STATE_BITS	= 2,
-	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -109,7 +104,7 @@ void blk_mq_release(struct request_queue *q);
  */
 static inline int blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return READ_ONCE(rq->state);
 }
 
 /**
@@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq)
 static inline void blk_mq_rq_update_state(struct request *rq,
 					  enum mq_rq_state state)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
-	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	WRITE_ONCE(rq->state, state);
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..f95d6e6cbc96 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -214,7 +214,6 @@ void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 26bf2c1e3502..8812a7e3c0a3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -125,10 +125,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
@@ -236,27 +234,7 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
-	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
-	/*
-	 * ->aborted_gstate is used by the timeout to claim a specific
-	 * recycle instance of this request.  See blk_mq_timeout_work().
-	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
-
+	u64 state;
 	struct kref ref;
 
 	/* access through blk_rq_set_deadline, blk_rq_deadline */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-21 23:11   ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw)


This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 block/blk-core.c       |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 259 ++++++++++---------------------------------------
 block/blk-mq.h         |  20 +---
 block/blk-timeout.c    |   1 -
 include/linux/blkdev.h |  26 +----
 6 files changed, 58 insertions(+), 255 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43370faee935..cee03cad99f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * See comment of blk_mq_init_request
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 66e5c768803f..4858876fd364 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq)
 	put_cpu();
 }
 
-static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
-	__releases(hctx->srcu)
-{
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-		rcu_read_unlock();
-	else
-		srcu_read_unlock(hctx->srcu, srcu_idx);
-}
-
-static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
-	__acquires(hctx->srcu)
-{
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		/* shut up gcc false positive */
-		*srcu_idx = 0;
-		rcu_read_lock();
-	} else
-		*srcu_idx = srcu_read_lock(hctx->srcu);
-}
-
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
-
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
-		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
+	__blk_mq_complete_request(rq);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -699,26 +632,9 @@ void blk_mq_start_request(struct request *rq)
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
 	blk_add_timer(rq);
 	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
-
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
 		 * Make sure space for the drain appears.  We know we can do
@@ -730,11 +646,6 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -843,33 +754,24 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 }
 EXPORT_SYMBOL(blk_mq_tag_to_rq);
 
-struct blk_mq_timeout_data {
-	unsigned long next;
-	unsigned int next_set;
-	unsigned int nr_expired;
-};
-
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
 	switch (ret) {
 	case BLK_EH_HANDLED:
-		__blk_mq_complete_request(req);
-		break;
-	case BLK_EH_RESET_TIMER:
 		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
+		 * If the request is still in flight, the driver is requesting
+		 * blk-mq complete it.
 		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
+		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
+			__blk_mq_complete_request(req);
+		break;
+	case BLK_EH_RESET_TIMER:
 		blk_add_timer(req);
 		break;
 	case BLK_EH_NOT_HANDLED:
@@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	}
 }
 
-static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, void *priv, bool reserved)
+static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
 {
-	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
+	unsigned long deadline;
 
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
+	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
+		return false;
 
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
+	deadline = blk_rq_deadline(rq);
+	if (time_after_eq(jiffies, deadline))
+		return true;
 
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
-	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
-		data->nr_expired++;
-		hctx->nr_expired++;
-	} else if (!data->next_set || time_after(data->next, deadline)) {
-		data->next = deadline;
-		data->next_set = 1;
-	}
+	if (*next == 0)
+		*next = deadline;
+	else if (time_after(*next, deadline))
+		*next = deadline;
+	return false;
 }
 
-static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
+	unsigned long *next = priv;
+
 	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 * Just do a quick check if it is expired before locking the request in
+	 * so we're not unnecessarilly synchronizing across CPUs.
 	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+	if (!blk_mq_req_expired(rq, next))
+		return;
+
+	/*
+	 * We have reason to believe the request may be expired. Take a
+	 * reference on the request to lock this request lifetime into its
+	 * currently allocated context to prevent it from being reallocated in
+	 * the event the completion by-passes this timeout handler.
+	 * 
+	 * If the reference was already released, then the driver beat the
+	 * timeout handler to posting a natural completion.
+	 */
+	if (!kref_get_unless_zero(&rq->ref))
+		return;
+
+	/*
+	 * The request is now locked and cannot be reallocated underneath the
+	 * timeout handler's processing. Re-verify this exact request is truly
+	 * expired; if it is not expired, then the request was completed and
+	 * reallocated as a new request.
+	 */
+	if (blk_mq_req_expired(rq, next))
 		blk_mq_rq_timed_out(rq, reserved);
+	blk_mq_put_request(rq);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	struct blk_mq_timeout_data data = {
-		.next		= 0,
-		.next_set	= 0,
-		.nr_expired	= 0,
-	};
+	unsigned long next = 0;
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -957,39 +859,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
-	/* scan for the expired ones and set their ->aborted_gstate */
-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
+	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
 
-	if (data.nr_expired) {
-		bool has_rcu = false;
-
-		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
-		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
-
-		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
-	}
-
-	if (data.next_set) {
-		data.next = blk_rq_timeout(round_jiffies_up(data.next));
-		mod_timer(&q->timeout, data.next);
+	if (next != 0) {
+		mod_timer(&q->timeout, next);
 	} else {
 		/*
 		 * Request timeouts are handled as a forward rolling timer. If
@@ -1334,8 +1207,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
-	int srcu_idx;
-
 	/*
 	 * We should be running this queue from one of the CPUs that
 	 * are mapped to it.
@@ -1369,9 +1240,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-	hctx_lock(hctx, &srcu_idx);
 	blk_mq_sched_dispatch_requests(hctx);
-	hctx_unlock(hctx, srcu_idx);
 }
 
 static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
@@ -1458,7 +1327,6 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
 
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-	int srcu_idx;
 	bool need_run;
 
 	/*
@@ -1469,10 +1337,8 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
 	 * quiesced.
 	 */
-	hctx_lock(hctx, &srcu_idx);
 	need_run = !blk_queue_quiesced(hctx->queue) &&
 		blk_mq_hctx_has_pending(hctx);
-	hctx_unlock(hctx, srcu_idx);
 
 	if (need_run) {
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
@@ -1828,34 +1694,23 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, blk_qc_t *cookie)
 {
 	blk_status_t ret;
-	int srcu_idx;
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-	hctx_lock(hctx, &srcu_idx);
-
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
 	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
-
-	hctx_unlock(hctx, srcu_idx);
 }
 
 blk_status_t blk_mq_request_issue_directly(struct request *rq)
 {
-	blk_status_t ret;
-	int srcu_idx;
 	blk_qc_t unused_cookie;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
 
-	hctx_lock(hctx, &srcu_idx);
-	ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true);
-	hctx_unlock(hctx, srcu_idx);
-
-	return ret;
+	return __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true);
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
@@ -2065,15 +1920,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * start gstate with gen 1 instead of 0, otherwise it will be equal
-	 * to aborted_gstate, and be identified timed out by
-	 * blk_mq_terminate_expired.
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
-
+	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e1bb420dc5d6..53452df16343 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -31,17 +31,12 @@ struct blk_mq_ctx {
 } ____cacheline_aligned_in_smp;
 
 /*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
+ * Request state.
  */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
-
-	MQ_RQ_STATE_BITS	= 2,
-	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -109,7 +104,7 @@ void blk_mq_release(struct request_queue *q);
  */
 static inline int blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return READ_ONCE(rq->state);
 }
 
 /**
@@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq)
 static inline void blk_mq_rq_update_state(struct request *rq,
 					  enum mq_rq_state state)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
-	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	WRITE_ONCE(rq->state, state);
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..f95d6e6cbc96 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -214,7 +214,6 @@ void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 26bf2c1e3502..8812a7e3c0a3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -125,10 +125,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
@@ -236,27 +234,7 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
-	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
-	/*
-	 * ->aborted_gstate is used by the timeout to claim a specific
-	 * recycle instance of this request.  See blk_mq_timeout_work().
-	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
-
+	u64 state;
 	struct kref ref;
 
 	/* access through blk_rq_set_deadline, blk_rq_deadline */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-21 23:11   ` Keith Busch
@ 2018-05-21 23:29     ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-21 23:29 UTC (permalink / raw)
  To: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

T24gTW9uLCAyMDE4LTA1LTIxIGF0IDE3OjExIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
QEAgLTY1MCwyNyArNjAwLDEwIEBAIHN0YXRpYyB1NjQgYmxrX21xX3JxX2Fib3J0ZWRfZ3N0YXRl
KHN0cnVjdCByZXF1ZXN0ICpycSkNCj4gIHZvaWQgYmxrX21xX2NvbXBsZXRlX3JlcXVlc3Qoc3Ry
dWN0IHJlcXVlc3QgKnJxKQ0KPiAgew0KPiAgCXN0cnVjdCByZXF1ZXN0X3F1ZXVlICpxID0gcnEt
PnE7DQo+IC0Jc3RydWN0IGJsa19tcV9od19jdHggKmhjdHggPSBibGtfbXFfbWFwX3F1ZXVlKHEs
IHJxLT5tcV9jdHgtPmNwdSk7DQo+IC0JaW50IHNyY3VfaWR4Ow0KPiAgDQo+ICAJaWYgKHVubGlr
ZWx5KGJsa19zaG91bGRfZmFrZV90aW1lb3V0KHEpKSkNCj4gIAkJcmV0dXJuOw0KPiAtCVsgLi4u
IF0NCj4gKwlfX2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KHJxKTsNCj4gIH0NCj4gIEVYUE9SVF9T
WU1CT0woYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QpOw0KPiBbIC4uLiBdDQo+ICBzdGF0aWMgdm9p
ZCBibGtfbXFfcnFfdGltZWRfb3V0KHN0cnVjdCByZXF1ZXN0ICpyZXEsIGJvb2wgcmVzZXJ2ZWQp
DQo+ICB7DQo+ICAJY29uc3Qgc3RydWN0IGJsa19tcV9vcHMgKm9wcyA9IHJlcS0+cS0+bXFfb3Bz
Ow0KPiAgCWVudW0gYmxrX2VoX3RpbWVyX3JldHVybiByZXQgPSBCTEtfRUhfUkVTRVRfVElNRVI7
DQo+ICANCj4gLQlyZXEtPnJxX2ZsYWdzIHw9IFJRRl9NUV9USU1FT1VUX0VYUElSRUQ7DQo+IC0N
Cj4gIAlpZiAob3BzLT50aW1lb3V0KQ0KPiAgCQlyZXQgPSBvcHMtPnRpbWVvdXQocmVxLCByZXNl
cnZlZCk7DQo+ICANCj4gIAlzd2l0Y2ggKHJldCkgew0KPiAgCWNhc2UgQkxLX0VIX0hBTkRMRUQ6
DQo+IC0JCV9fYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QocmVxKTsNCj4gLQkJYnJlYWs7DQo+IC0J
Y2FzZSBCTEtfRUhfUkVTRVRfVElNRVI6DQo+ICAJCS8qDQo+IC0JCSAqIEFzIG5vdGhpbmcgcHJl
dmVudHMgZnJvbSBjb21wbGV0aW9uIGhhcHBlbmluZyB3aGlsZQ0KPiAtCQkgKiAtPmFib3J0ZWRf
Z3N0YXRlIGlzIHNldCwgdGhpcyBtYXkgbGVhZCB0byBpZ25vcmVkDQo+IC0JCSAqIGNvbXBsZXRp
b25zIGFuZCBmdXJ0aGVyIHNwdXJpb3VzIHRpbWVvdXRzLg0KPiArCQkgKiBJZiB0aGUgcmVxdWVz
dCBpcyBzdGlsbCBpbiBmbGlnaHQsIHRoZSBkcml2ZXIgaXMgcmVxdWVzdGluZw0KPiArCQkgKiBi
bGstbXEgY29tcGxldGUgaXQuDQo+ICAJCSAqLw0KPiAtCQlibGtfbXFfcnFfdXBkYXRlX2Fib3J0
ZWRfZ3N0YXRlKHJlcSwgMCk7DQo+ICsJCWlmIChibGtfbXFfcnFfc3RhdGUocmVxKSA9PSBNUV9S
UV9JTl9GTElHSFQpDQo+ICsJCQlfX2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KHJlcSk7DQo+ICsJ
CWJyZWFrOw0KPiArCWNhc2UgQkxLX0VIX1JFU0VUX1RJTUVSOg0KPiAgCQlibGtfYWRkX3RpbWVy
KHJlcSk7DQo+ICAJCWJyZWFrOw0KPiAgCWNhc2UgQkxLX0VIX05PVF9IQU5ETEVEOg0KPiBAQCAt
ODgwLDY0ICs3ODIsNjQgQEAgc3RhdGljIHZvaWQgYmxrX21xX3JxX3RpbWVkX291dChzdHJ1Y3Qg
cmVxdWVzdCAqcmVxLCBib29sIHJlc2VydmVkKQ0KPiAgCX0NCj4gIH0NCg0KSSB0aGluayB0aGUg
YWJvdmUgY2hhbmdlcyBjYW4gbGVhZCB0byBjb25jdXJyZW50IGNhbGxzIG9mDQpfX2Jsa19tcV9j
b21wbGV0ZV9yZXF1ZXN0KCkgZnJvbSB0aGUgcmVndWxhciBjb21wbGV0aW9uIHBhdGggYW5kIHRo
ZSB0aW1lb3V0DQpwYXRoLiBUaGF0J3Mgd3Jvbmc6IHRoZSBfX2Jsa19tcV9jb21wbGV0ZV9yZXF1
ZXN0KCkgY2FsbGVyIHNob3VsZCBndWFyYW50ZWUNCnRoYXQgbm8gY29uY3VycmVudCBjYWxscyBm
cm9tIGFub3RoZXIgY29udGV4dCB0byB0aGF0IGZ1bmN0aW9uIGNhbiBvY2N1ci4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-21 23:29     ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-21 23:29 UTC (permalink / raw)


On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote:
> @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
>  void blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> -	int srcu_idx;
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
> -	[ ... ]
> +	__blk_mq_complete_request(rq);
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);
> [ ... ]
>  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  {
>  	const struct blk_mq_ops *ops = req->q->mq_ops;
>  	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
>  
> -	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
> -
>  	if (ops->timeout)
>  		ret = ops->timeout(req, reserved);
>  
>  	switch (ret) {
>  	case BLK_EH_HANDLED:
> -		__blk_mq_complete_request(req);
> -		break;
> -	case BLK_EH_RESET_TIMER:
>  		/*
> -		 * As nothing prevents from completion happening while
> -		 * ->aborted_gstate is set, this may lead to ignored
> -		 * completions and further spurious timeouts.
> +		 * If the request is still in flight, the driver is requesting
> +		 * blk-mq complete it.
>  		 */
> -		blk_mq_rq_update_aborted_gstate(req, 0);
> +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> +			__blk_mq_complete_request(req);
> +		break;
> +	case BLK_EH_RESET_TIMER:
>  		blk_add_timer(req);
>  		break;
>  	case BLK_EH_NOT_HANDLED:
> @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  	}
>  }

I think the above changes can lead to concurrent calls of
__blk_mq_complete_request() from the regular completion path and the timeout
path. That's wrong: the __blk_mq_complete_request() caller should guarantee
that no concurrent calls from another context to that function can occur.

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 0/3] blk-mq: Timeout rework
  2018-05-21 23:11 ` Keith Busch
@ 2018-05-21 23:29   ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-21 23:29 UTC (permalink / raw)
  To: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

T24gTW9uLCAyMDE4LTA1LTIxIGF0IDE3OjExIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
VGhlIGN1cnJlbnQgYmxrLW1xIGNvZGUgcG90ZW50aWFsbHkgbG9ja3MgcmVxdWVzdHMgb3V0IG9m
IGNvbXBsZXRpb24gYnkNCj4gdGhlIHRob3VzYW5kcywgbWFraW5nIGRyaXZlcnMganVtcCB0aHJv
dWdoIGhvb3BzIHRvIGhhbmRsZSB0aGVtLiBUaGlzDQo+IHBhdGNoIHNldCBhbGxvd3MgZHJpdmVy
cyB0byBjb21wbGV0ZSB0aGVpciByZXF1ZXN0cyB3aGVuZXZlciB0aGV5J3JlDQo+IGNvbXBsZXRl
ZCB3aXRob3V0IHJlcXVpcmluZyBkcml2ZXJzIGtub3cgYW55dGhpbmcgYWJvdXQgdGhlIHRpbWVv
dXQgY29kZQ0KPiB3aXRoIG1pbmltYWwgc3luY3Jvbml6YXRpb24uDQo+IA0KPiBPdGhlciBwcm9w
b3NhbHMgdW5kZXIgY3VycmVudCBjb25zaWRlcmF0aW9uIHN0aWxsIGhhdmUgbW9tZW50cyB0aGF0
DQo+IHByZXZlbnQgYSBkcml2ZXIgZnJvbSBwcm9ncmVzc2luZyBhIHJlcXVlc3QgdG8gdGhlIGNv
bXBsZXRlZCBzdGF0ZS4NCj4gDQo+IFRoZSB0aW1lb3V0IGlzIHVsdGltYXRsZXkgbWFkZSBzYWZl
IGJ5IHJlZmVyZW5jZSBjb3VudGluZyB0aGUgcmVxdWVzdA0KPiB3aGVuIHRpbWVvdXQgaGFuZGxp
bmcgY2xhaW1zIHRoZSByZXF1ZXN0LiBCeSBob2xkaW5nIHRoZSByZWZlcmVuY2UgY291bnQsDQo+
IHdlIGRvbid0IG5lZWQgdG8gZG8gYW55IHRyaWNrcyB0byBwcmV2ZW50IGEgZHJpdmVyIGZyb20g
Y29tcGxldGluZyB0aGUNCj4gcmVxdWVzdCBvdXQgZnJvbSB1bmRlciB0aGUgdGltZW91dCBoYW5k
bGVyLCBhbGxvd2luZyB0aGUgYWN0dWFsIHN0YXRlDQo+IHRvIGJlIGNoYW5nZWQgaW5saW5lIHdp
dGggdGhlIHRydWUgc3RhdGUsIGFuZCBkcml2ZXJzIGRvbid0IG5lZWQgdG8gYmUNCj4gYXdhcmUg
YW55IG9mIHRoaXMgaXMgaGFwcGVuaW5nLg0KPiANCj4gSW4gb3JkZXIgdG8gbWFrZSB0aGUgb3Zl
cmhlYWQgYXMgbWluaW1hbCBhcyBwb3NzaWJsZSwgdGhlIHJlcXVlc3Qncw0KPiByZWZlcmVuY2Ug
aXMgdGFrZW4gb25seSB3aGVuIGl0IGFwcGVhcnMgdGhhdCBhY3R1YWwgdGltZW91dCBoYW5kbGlu
Zw0KPiBuZWVkcyB0byBiZSBkb25lLg0KDQpIZWxsbyBLZWl0aCwNCg0KQ2FuIHlvdSBleHBsYWlu
IHdoeSB0aGUgTlZNZSBkcml2ZXIgbmVlZHMgcmVmZXJlbmNlIGNvdW50aW5nIG9mIHJlcXVlc3Rz
IGJ1dA0Kbm8gb3RoZXIgYmxvY2sgZHJpdmVyIG5lZWRzIHRoaXM/IEFkZGl0aW9uYWxseSwgd2h5
IGlzIGl0IHRoYXQgZm9yIGFsbCBibG9jaw0KZHJpdmVycyBleGNlcHQgTlZNZSB0aGUgY3VycmVu
dCBibG9jayBsYXllciBBUEkgaXMgc3VmZmljaWVudA0KKGJsa19nZXRfcmVxdWVzdCgpL2Jsa19l
eGVjdXRlX3JxKCkvYmxrX21xX3N0YXJ0X3JlcXVlc3QoKS8NCmJsa19tcV9jb21wbGV0ZV9yZXF1
ZXN0KCkvYmxrX21xX2VuZF9yZXF1ZXN0KCkpPw0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 0/3] blk-mq: Timeout rework
@ 2018-05-21 23:29   ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-21 23:29 UTC (permalink / raw)


On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote:
> The current blk-mq code potentially locks requests out of completion by
> the thousands, making drivers jump through hoops to handle them. This
> patch set allows drivers to complete their requests whenever they're
> completed without requiring drivers know anything about the timeout code
> with minimal syncronization.
> 
> Other proposals under current consideration still have moments that
> prevent a driver from progressing a request to the completed state.
> 
> The timeout is ultimatley made safe by reference counting the request
> when timeout handling claims the request. By holding the reference count,
> we don't need to do any tricks to prevent a driver from completing the
> request out from under the timeout handler, allowing the actual state
> to be changed inline with the true state, and drivers don't need to be
> aware any of this is happening.
> 
> In order to make the overhead as minimal as possible, the request's
> reference is taken only when it appears that actual timeout handling
> needs to be done.

Hello Keith,

Can you explain why the NVMe driver needs reference counting of requests but
no other block driver needs this? Additionally, why is it that for all block
drivers except NVMe the current block layer API is sufficient
(blk_get_request()/blk_execute_rq()/blk_mq_start_request()/
blk_mq_complete_request()/blk_mq_end_request())?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 1/3] blk-mq: Reference count request usage
  2018-05-21 23:11   ` Keith Busch
@ 2018-05-22  2:27     ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  2:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche

On Mon, May 21, 2018 at 05:11:29PM -0600, Keith Busch wrote:
> This patch adds a struct kref to struct request so that request users
> can be sure they're operating on the same request without it changing
> while they're processing it. The request's tag won't be released for
> reuse until the last user is done with it.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  block/blk-mq.c         | 30 +++++++++++++++++++++++-------
>  include/linux/blkdev.h |  2 ++
>  2 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4cbfd784e837..8b370ed75605 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  #endif
>  
>  	data->ctx->rq_dispatched[op_is_sync(op)]++;
> +	kref_init(&rq->ref);
>  	return rq;
>  }
>  
> @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
>  
> +static void blk_mq_exit_request(struct kref *ref)
> +{
> +	struct request *rq = container_of(ref, struct request, ref);
> +	struct request_queue *q = rq->q;
> +	struct blk_mq_ctx *ctx = rq->mq_ctx;
> +	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> +	const int sched_tag = rq->internal_tag;
> +
> +	if (rq->tag != -1)
> +		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> +	if (sched_tag != -1)
> +		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
> +	blk_mq_sched_restart(hctx);
> +	blk_queue_exit(q);
> +}
> +
> +static void blk_mq_put_request(struct request *rq)
> +{
> +       kref_put(&rq->ref, blk_mq_exit_request);
> +}
> +
>  void blk_mq_free_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
>  	struct elevator_queue *e = q->elevator;
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
>  	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> -	const int sched_tag = rq->internal_tag;
>  
>  	if (rq->rq_flags & RQF_ELVPRIV) {
>  		if (e && e->type->ops.mq.finish_request)
> @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq)
>  		blk_put_rl(blk_rq_rl(rq));
>  
>  	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
> -	if (rq->tag != -1)
> -		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> -	if (sched_tag != -1)
> -		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
> -	blk_mq_sched_restart(hctx);
> -	blk_queue_exit(q);
> +	blk_mq_put_request(rq);

Both the above line(atomic_try_cmpxchg_release is implied) and kref_init()
in blk_mq_rq_ctx_init() are run from fast path, and may introduce some cost,
you may have to run some benchmark to show if there is effect.

Also given the cost isn't free, could you describe a bit in comment log
what we can get with the cost?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 1/3] blk-mq: Reference count request usage
@ 2018-05-22  2:27     ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  2:27 UTC (permalink / raw)


On Mon, May 21, 2018@05:11:29PM -0600, Keith Busch wrote:
> This patch adds a struct kref to struct request so that request users
> can be sure they're operating on the same request without it changing
> while they're processing it. The request's tag won't be released for
> reuse until the last user is done with it.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  block/blk-mq.c         | 30 +++++++++++++++++++++++-------
>  include/linux/blkdev.h |  2 ++
>  2 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4cbfd784e837..8b370ed75605 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  #endif
>  
>  	data->ctx->rq_dispatched[op_is_sync(op)]++;
> +	kref_init(&rq->ref);
>  	return rq;
>  }
>  
> @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
>  
> +static void blk_mq_exit_request(struct kref *ref)
> +{
> +	struct request *rq = container_of(ref, struct request, ref);
> +	struct request_queue *q = rq->q;
> +	struct blk_mq_ctx *ctx = rq->mq_ctx;
> +	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> +	const int sched_tag = rq->internal_tag;
> +
> +	if (rq->tag != -1)
> +		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> +	if (sched_tag != -1)
> +		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
> +	blk_mq_sched_restart(hctx);
> +	blk_queue_exit(q);
> +}
> +
> +static void blk_mq_put_request(struct request *rq)
> +{
> +       kref_put(&rq->ref, blk_mq_exit_request);
> +}
> +
>  void blk_mq_free_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
>  	struct elevator_queue *e = q->elevator;
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
>  	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> -	const int sched_tag = rq->internal_tag;
>  
>  	if (rq->rq_flags & RQF_ELVPRIV) {
>  		if (e && e->type->ops.mq.finish_request)
> @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq)
>  		blk_put_rl(blk_rq_rl(rq));
>  
>  	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
> -	if (rq->tag != -1)
> -		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> -	if (sched_tag != -1)
> -		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
> -	blk_mq_sched_restart(hctx);
> -	blk_queue_exit(q);
> +	blk_mq_put_request(rq);

Both the above line(atomic_try_cmpxchg_release is implied) and kref_init()
in blk_mq_rq_ctx_init() are run from fast path, and may introduce some cost,
you may have to run some benchmark to show if there is effect.

Also given the cost isn't free, could you describe a bit in comment log
what we can get with the cost?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order
  2018-05-21 23:11   ` Keith Busch
@ 2018-05-22  2:28     ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  2:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche

On Mon, May 21, 2018 at 05:11:30PM -0600, Keith Busch wrote:
> The block layer had been setting the state to in-flight prior to updating
> the timer. This is the wrong order since the timeout handler could observe
> the in-flight state with the older timeout, believing the request had
> expired when in fact it is just getting started.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  block/blk-mq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8b370ed75605..66e5c768803f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq)
>  	preempt_disable();
>  	write_seqcount_begin(&rq->gstate_seq);
>  
> -	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>  	blk_add_timer(rq);
> +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>  
>  	write_seqcount_end(&rq->gstate_seq);
>  	preempt_enable();
> -- 
> 2.14.3
> 

Looks fine,

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 2/3] blk-mq: Fix timeout and state order
@ 2018-05-22  2:28     ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  2:28 UTC (permalink / raw)


On Mon, May 21, 2018@05:11:30PM -0600, Keith Busch wrote:
> The block layer had been setting the state to in-flight prior to updating
> the timer. This is the wrong order since the timeout handler could observe
> the in-flight state with the older timeout, believing the request had
> expired when in fact it is just getting started.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  block/blk-mq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8b370ed75605..66e5c768803f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq)
>  	preempt_disable();
>  	write_seqcount_begin(&rq->gstate_seq);
>  
> -	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>  	blk_add_timer(rq);
> +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>  
>  	write_seqcount_end(&rq->gstate_seq);
>  	preempt_enable();
> -- 
> 2.14.3
> 

Looks fine,

Reviewed-by: Ming Lei <ming.lei at redhat.com>

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-21 23:11   ` Keith Busch
@ 2018-05-22  2:49     ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  2:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche

On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> This patch simplifies the timeout handling by relying on the request
> reference counting to ensure the iterator is operating on an inflight

The reference counting isn't free, what is the real benefit in this way?

> and truly timed out request. Since the reference counting prevents the
> tag from being reallocated, the block layer no longer needs to prevent
> drivers from completing their requests while the timeout handler is
> operating on it: a driver completing a request is allowed to proceed to
> the next state without additional syncronization with the block layer.

This might cause trouble for drivers, since the previous behaviour
is that one request is only completed from one path, and now you change
the behaviour.

> 
> This also removes any need for generation sequence numbers since the
> request lifetime is prevented from being reallocated as a new sequence
> while timeout handling is operating on it.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  block/blk-core.c       |   6 --
>  block/blk-mq-debugfs.c |   1 -
>  block/blk-mq.c         | 259 ++++++++++---------------------------------------
>  block/blk-mq.h         |  20 +---
>  block/blk-timeout.c    |   1 -
>  include/linux/blkdev.h |  26 +----
>  6 files changed, 58 insertions(+), 255 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 43370faee935..cee03cad99f2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->internal_tag = -1;
>  	rq->start_time_ns = ktime_get_ns();
>  	rq->part = NULL;
> -	seqcount_init(&rq->gstate_seq);
> -	u64_stats_init(&rq->aborted_gstate_sync);
> -	/*
> -	 * See comment of blk_mq_init_request
> -	 */
> -	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 3080e18cb859..ffa622366922 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
>  	RQF_NAME(STATS),
>  	RQF_NAME(SPECIAL_PAYLOAD),
>  	RQF_NAME(ZONE_WRITE_LOCKED),
> -	RQF_NAME(MQ_TIMEOUT_EXPIRED),
>  	RQF_NAME(MQ_POLL_SLEPT),
>  };
>  #undef RQF_NAME
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 66e5c768803f..4858876fd364 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq)
>  	put_cpu();
>  }
>  
> -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
> -	__releases(hctx->srcu)
> -{
> -	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> -		rcu_read_unlock();
> -	else
> -		srcu_read_unlock(hctx->srcu, srcu_idx);
> -}
> -
> -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
> -	__acquires(hctx->srcu)
> -{
> -	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> -		/* shut up gcc false positive */
> -		*srcu_idx = 0;
> -		rcu_read_lock();
> -	} else
> -		*srcu_idx = srcu_read_lock(hctx->srcu);
> -}
> -
> -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
> -{
> -	unsigned long flags;
> -
> -	/*
> -	 * blk_mq_rq_aborted_gstate() is used from the completion path and
> -	 * can thus be called from irq context.  u64_stats_fetch in the
> -	 * middle of update on the same CPU leads to lockup.  Disable irq
> -	 * while updating.
> -	 */
> -	local_irq_save(flags);
> -	u64_stats_update_begin(&rq->aborted_gstate_sync);
> -	rq->aborted_gstate = gstate;
> -	u64_stats_update_end(&rq->aborted_gstate_sync);
> -	local_irq_restore(flags);
> -}
> -
> -static u64 blk_mq_rq_aborted_gstate(struct request *rq)
> -{
> -	unsigned int start;
> -	u64 aborted_gstate;
> -
> -	do {
> -		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
> -		aborted_gstate = rq->aborted_gstate;
> -	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
> -
> -	return aborted_gstate;
> -}
> -
>  /**
>   * blk_mq_complete_request - end I/O on a request
>   * @rq:		the request being processed
> @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
>  void blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> -	int srcu_idx;
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
> -
> -	/*
> -	 * If @rq->aborted_gstate equals the current instance, timeout is
> -	 * claiming @rq and we lost.  This is synchronized through
> -	 * hctx_lock().  See blk_mq_timeout_work() for details.
> -	 *
> -	 * Completion path never blocks and we can directly use RCU here
> -	 * instead of hctx_lock() which can be either RCU or SRCU.
> -	 * However, that would complicate paths which want to synchronize
> -	 * against us.  Let stay in sync with the issue path so that
> -	 * hctx_lock() covers both issue and completion paths.
> -	 */
> -	hctx_lock(hctx, &srcu_idx);
> -	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> -		__blk_mq_complete_request(rq);
> -	hctx_unlock(hctx, srcu_idx);
> +	__blk_mq_complete_request(rq);
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);
>  
> @@ -699,26 +632,9 @@ void blk_mq_start_request(struct request *rq)
>  
>  	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>  
> -	/*
> -	 * Mark @rq in-flight which also advances the generation number,
> -	 * and register for timeout.  Protect with a seqcount to allow the
> -	 * timeout path to read both @rq->gstate and @rq->deadline
> -	 * coherently.
> -	 *
> -	 * This is the only place where a request is marked in-flight.  If
> -	 * the timeout path reads an in-flight @rq->gstate, the
> -	 * @rq->deadline it reads together under @rq->gstate_seq is
> -	 * guaranteed to be the matching one.
> -	 */
> -	preempt_disable();
> -	write_seqcount_begin(&rq->gstate_seq);
> -
>  	blk_add_timer(rq);
>  	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>  
> -	write_seqcount_end(&rq->gstate_seq);
> -	preempt_enable();
> -
>  	if (q->dma_drain_size && blk_rq_bytes(rq)) {
>  		/*
>  		 * Make sure space for the drain appears.  We know we can do
> @@ -730,11 +646,6 @@ void blk_mq_start_request(struct request *rq)
>  }
>  EXPORT_SYMBOL(blk_mq_start_request);
>  
> -/*
> - * When we reach here because queue is busy, it's safe to change the state
> - * to IDLE without checking @rq->aborted_gstate because we should still be
> - * holding the RCU read lock and thus protected against timeout.
> - */
>  static void __blk_mq_requeue_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> @@ -843,33 +754,24 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
>  }
>  EXPORT_SYMBOL(blk_mq_tag_to_rq);
>  
> -struct blk_mq_timeout_data {
> -	unsigned long next;
> -	unsigned int next_set;
> -	unsigned int nr_expired;
> -};
> -
>  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  {
>  	const struct blk_mq_ops *ops = req->q->mq_ops;
>  	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
>  
> -	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
> -
>  	if (ops->timeout)
>  		ret = ops->timeout(req, reserved);
>  
>  	switch (ret) {
>  	case BLK_EH_HANDLED:
> -		__blk_mq_complete_request(req);
> -		break;
> -	case BLK_EH_RESET_TIMER:
>  		/*
> -		 * As nothing prevents from completion happening while
> -		 * ->aborted_gstate is set, this may lead to ignored
> -		 * completions and further spurious timeouts.
> +		 * If the request is still in flight, the driver is requesting
> +		 * blk-mq complete it.
>  		 */
> -		blk_mq_rq_update_aborted_gstate(req, 0);
> +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> +			__blk_mq_complete_request(req);
> +		break;
> +	case BLK_EH_RESET_TIMER:
>  		blk_add_timer(req);
>  		break;
>  	case BLK_EH_NOT_HANDLED:
> @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  	}
>  }
>  
> -static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> -		struct request *rq, void *priv, bool reserved)
> +static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
>  {
> -	struct blk_mq_timeout_data *data = priv;
> -	unsigned long gstate, deadline;
> -	int start;
> +	unsigned long deadline;
>  
> -	might_sleep();
> -
> -	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
> -		return;
> +	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
> +		return false;
>  
> -	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
> -	while (true) {
> -		start = read_seqcount_begin(&rq->gstate_seq);
> -		gstate = READ_ONCE(rq->gstate);
> -		deadline = blk_rq_deadline(rq);
> -		if (!read_seqcount_retry(&rq->gstate_seq, start))
> -			break;
> -		cond_resched();
> -	}
> +	deadline = blk_rq_deadline(rq);
> +	if (time_after_eq(jiffies, deadline))
> +		return true;
>  
> -	/* if in-flight && overdue, mark for abortion */
> -	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
> -	    time_after_eq(jiffies, deadline)) {
> -		blk_mq_rq_update_aborted_gstate(rq, gstate);
> -		data->nr_expired++;
> -		hctx->nr_expired++;
> -	} else if (!data->next_set || time_after(data->next, deadline)) {
> -		data->next = deadline;
> -		data->next_set = 1;
> -	}
> +	if (*next == 0)
> +		*next = deadline;
> +	else if (time_after(*next, deadline))
> +		*next = deadline;
> +	return false;
>  }
>  
> -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
> +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>  		struct request *rq, void *priv, bool reserved)
>  {
> +	unsigned long *next = priv;
> +
>  	/*
> -	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
> -	 * completions that we lost to, they would have finished and
> -	 * updated @rq->gstate by now; otherwise, the completion path is
> -	 * now guaranteed to see @rq->aborted_gstate and yield.  If
> -	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> +	 * Just do a quick check if it is expired before locking the request in
> +	 * so we're not unnecessarilly synchronizing across CPUs.
>  	 */
> -	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> -	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
> +	if (!blk_mq_req_expired(rq, next))
> +		return;
> +
> +	/*
> +	 * We have reason to believe the request may be expired. Take a
> +	 * reference on the request to lock this request lifetime into its
> +	 * currently allocated context to prevent it from being reallocated in
> +	 * the event the completion by-passes this timeout handler.
> +	 * 
> +	 * If the reference was already released, then the driver beat the
> +	 * timeout handler to posting a natural completion.
> +	 */
> +	if (!kref_get_unless_zero(&rq->ref))
> +		return;

If this request is just completed in normal path and its state isn't
updated yet, timeout will hold the request, and may complete this
request again, then this req can be completed two times.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22  2:49     ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  2:49 UTC (permalink / raw)


On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote:
> This patch simplifies the timeout handling by relying on the request
> reference counting to ensure the iterator is operating on an inflight

The reference counting isn't free, what is the real benefit in this way?

> and truly timed out request. Since the reference counting prevents the
> tag from being reallocated, the block layer no longer needs to prevent
> drivers from completing their requests while the timeout handler is
> operating on it: a driver completing a request is allowed to proceed to
> the next state without additional syncronization with the block layer.

This might cause trouble for drivers, since the previous behaviour
is that one request is only completed from one path, and now you change
the behaviour.

> 
> This also removes any need for generation sequence numbers since the
> request lifetime is prevented from being reallocated as a new sequence
> while timeout handling is operating on it.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  block/blk-core.c       |   6 --
>  block/blk-mq-debugfs.c |   1 -
>  block/blk-mq.c         | 259 ++++++++++---------------------------------------
>  block/blk-mq.h         |  20 +---
>  block/blk-timeout.c    |   1 -
>  include/linux/blkdev.h |  26 +----
>  6 files changed, 58 insertions(+), 255 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 43370faee935..cee03cad99f2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->internal_tag = -1;
>  	rq->start_time_ns = ktime_get_ns();
>  	rq->part = NULL;
> -	seqcount_init(&rq->gstate_seq);
> -	u64_stats_init(&rq->aborted_gstate_sync);
> -	/*
> -	 * See comment of blk_mq_init_request
> -	 */
> -	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 3080e18cb859..ffa622366922 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
>  	RQF_NAME(STATS),
>  	RQF_NAME(SPECIAL_PAYLOAD),
>  	RQF_NAME(ZONE_WRITE_LOCKED),
> -	RQF_NAME(MQ_TIMEOUT_EXPIRED),
>  	RQF_NAME(MQ_POLL_SLEPT),
>  };
>  #undef RQF_NAME
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 66e5c768803f..4858876fd364 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq)
>  	put_cpu();
>  }
>  
> -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
> -	__releases(hctx->srcu)
> -{
> -	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> -		rcu_read_unlock();
> -	else
> -		srcu_read_unlock(hctx->srcu, srcu_idx);
> -}
> -
> -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
> -	__acquires(hctx->srcu)
> -{
> -	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> -		/* shut up gcc false positive */
> -		*srcu_idx = 0;
> -		rcu_read_lock();
> -	} else
> -		*srcu_idx = srcu_read_lock(hctx->srcu);
> -}
> -
> -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
> -{
> -	unsigned long flags;
> -
> -	/*
> -	 * blk_mq_rq_aborted_gstate() is used from the completion path and
> -	 * can thus be called from irq context.  u64_stats_fetch in the
> -	 * middle of update on the same CPU leads to lockup.  Disable irq
> -	 * while updating.
> -	 */
> -	local_irq_save(flags);
> -	u64_stats_update_begin(&rq->aborted_gstate_sync);
> -	rq->aborted_gstate = gstate;
> -	u64_stats_update_end(&rq->aborted_gstate_sync);
> -	local_irq_restore(flags);
> -}
> -
> -static u64 blk_mq_rq_aborted_gstate(struct request *rq)
> -{
> -	unsigned int start;
> -	u64 aborted_gstate;
> -
> -	do {
> -		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
> -		aborted_gstate = rq->aborted_gstate;
> -	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
> -
> -	return aborted_gstate;
> -}
> -
>  /**
>   * blk_mq_complete_request - end I/O on a request
>   * @rq:		the request being processed
> @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
>  void blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> -	int srcu_idx;
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
> -
> -	/*
> -	 * If @rq->aborted_gstate equals the current instance, timeout is
> -	 * claiming @rq and we lost.  This is synchronized through
> -	 * hctx_lock().  See blk_mq_timeout_work() for details.
> -	 *
> -	 * Completion path never blocks and we can directly use RCU here
> -	 * instead of hctx_lock() which can be either RCU or SRCU.
> -	 * However, that would complicate paths which want to synchronize
> -	 * against us.  Let stay in sync with the issue path so that
> -	 * hctx_lock() covers both issue and completion paths.
> -	 */
> -	hctx_lock(hctx, &srcu_idx);
> -	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> -		__blk_mq_complete_request(rq);
> -	hctx_unlock(hctx, srcu_idx);
> +	__blk_mq_complete_request(rq);
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);
>  
> @@ -699,26 +632,9 @@ void blk_mq_start_request(struct request *rq)
>  
>  	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>  
> -	/*
> -	 * Mark @rq in-flight which also advances the generation number,
> -	 * and register for timeout.  Protect with a seqcount to allow the
> -	 * timeout path to read both @rq->gstate and @rq->deadline
> -	 * coherently.
> -	 *
> -	 * This is the only place where a request is marked in-flight.  If
> -	 * the timeout path reads an in-flight @rq->gstate, the
> -	 * @rq->deadline it reads together under @rq->gstate_seq is
> -	 * guaranteed to be the matching one.
> -	 */
> -	preempt_disable();
> -	write_seqcount_begin(&rq->gstate_seq);
> -
>  	blk_add_timer(rq);
>  	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>  
> -	write_seqcount_end(&rq->gstate_seq);
> -	preempt_enable();
> -
>  	if (q->dma_drain_size && blk_rq_bytes(rq)) {
>  		/*
>  		 * Make sure space for the drain appears.  We know we can do
> @@ -730,11 +646,6 @@ void blk_mq_start_request(struct request *rq)
>  }
>  EXPORT_SYMBOL(blk_mq_start_request);
>  
> -/*
> - * When we reach here because queue is busy, it's safe to change the state
> - * to IDLE without checking @rq->aborted_gstate because we should still be
> - * holding the RCU read lock and thus protected against timeout.
> - */
>  static void __blk_mq_requeue_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> @@ -843,33 +754,24 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
>  }
>  EXPORT_SYMBOL(blk_mq_tag_to_rq);
>  
> -struct blk_mq_timeout_data {
> -	unsigned long next;
> -	unsigned int next_set;
> -	unsigned int nr_expired;
> -};
> -
>  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  {
>  	const struct blk_mq_ops *ops = req->q->mq_ops;
>  	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
>  
> -	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
> -
>  	if (ops->timeout)
>  		ret = ops->timeout(req, reserved);
>  
>  	switch (ret) {
>  	case BLK_EH_HANDLED:
> -		__blk_mq_complete_request(req);
> -		break;
> -	case BLK_EH_RESET_TIMER:
>  		/*
> -		 * As nothing prevents from completion happening while
> -		 * ->aborted_gstate is set, this may lead to ignored
> -		 * completions and further spurious timeouts.
> +		 * If the request is still in flight, the driver is requesting
> +		 * blk-mq complete it.
>  		 */
> -		blk_mq_rq_update_aborted_gstate(req, 0);
> +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> +			__blk_mq_complete_request(req);
> +		break;
> +	case BLK_EH_RESET_TIMER:
>  		blk_add_timer(req);
>  		break;
>  	case BLK_EH_NOT_HANDLED:
> @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  	}
>  }
>  
> -static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> -		struct request *rq, void *priv, bool reserved)
> +static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
>  {
> -	struct blk_mq_timeout_data *data = priv;
> -	unsigned long gstate, deadline;
> -	int start;
> +	unsigned long deadline;
>  
> -	might_sleep();
> -
> -	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
> -		return;
> +	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
> +		return false;
>  
> -	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
> -	while (true) {
> -		start = read_seqcount_begin(&rq->gstate_seq);
> -		gstate = READ_ONCE(rq->gstate);
> -		deadline = blk_rq_deadline(rq);
> -		if (!read_seqcount_retry(&rq->gstate_seq, start))
> -			break;
> -		cond_resched();
> -	}
> +	deadline = blk_rq_deadline(rq);
> +	if (time_after_eq(jiffies, deadline))
> +		return true;
>  
> -	/* if in-flight && overdue, mark for abortion */
> -	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
> -	    time_after_eq(jiffies, deadline)) {
> -		blk_mq_rq_update_aborted_gstate(rq, gstate);
> -		data->nr_expired++;
> -		hctx->nr_expired++;
> -	} else if (!data->next_set || time_after(data->next, deadline)) {
> -		data->next = deadline;
> -		data->next_set = 1;
> -	}
> +	if (*next == 0)
> +		*next = deadline;
> +	else if (time_after(*next, deadline))
> +		*next = deadline;
> +	return false;
>  }
>  
> -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
> +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>  		struct request *rq, void *priv, bool reserved)
>  {
> +	unsigned long *next = priv;
> +
>  	/*
> -	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
> -	 * completions that we lost to, they would have finished and
> -	 * updated @rq->gstate by now; otherwise, the completion path is
> -	 * now guaranteed to see @rq->aborted_gstate and yield.  If
> -	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> +	 * Just do a quick check if it is expired before locking the request in
> +	 * so we're not unnecessarilly synchronizing across CPUs.
>  	 */
> -	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> -	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
> +	if (!blk_mq_req_expired(rq, next))
> +		return;
> +
> +	/*
> +	 * We have reason to believe the request may be expired. Take a
> +	 * reference on the request to lock this request lifetime into its
> +	 * currently allocated context to prevent it from being reallocated in
> +	 * the event the completion by-passes this timeout handler.
> +	 * 
> +	 * If the reference was already released, then the driver beat the
> +	 * timeout handler to posting a natural completion.
> +	 */
> +	if (!kref_get_unless_zero(&rq->ref))
> +		return;

If this request is just completed in normal path and its state isn't
updated yet, timeout will hold the request, and may complete this
request again, then this req can be completed two times.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22  2:49     ` Ming Lei
@ 2018-05-22  3:16       ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2018-05-22  3:16 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche

On 5/21/18 8:49 PM, Ming Lei wrote:
> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
>> This patch simplifies the timeout handling by relying on the request
>> reference counting to ensure the iterator is operating on an inflight
> 
> The reference counting isn't free, what is the real benefit in this way?

Neither is the current scheme and locking, and this is a hell of a lot
simpler. I'd get rid of the kref stuff and just do a simple
atomic_dec_and_test(). Most of the time we should be uncontended on
that, which would make it pretty darn cheap. I'd be surprised if it
wasn't better than the current alternatives.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22  3:16       ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2018-05-22  3:16 UTC (permalink / raw)


On 5/21/18 8:49 PM, Ming Lei wrote:
> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote:
>> This patch simplifies the timeout handling by relying on the request
>> reference counting to ensure the iterator is operating on an inflight
> 
> The reference counting isn't free, what is the real benefit in this way?

Neither is the current scheme and locking, and this is a hell of a lot
simpler. I'd get rid of the kref stuff and just do a simple
atomic_dec_and_test(). Most of the time we should be uncontended on
that, which would make it pretty darn cheap. I'd be surprised if it
wasn't better than the current alternatives.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22  3:16       ` Jens Axboe
@ 2018-05-22  3:47         ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  3:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche

On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
> On 5/21/18 8:49 PM, Ming Lei wrote:
> > On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> >> This patch simplifies the timeout handling by relying on the request
> >> reference counting to ensure the iterator is operating on an inflight
> > 
> > The reference counting isn't free, what is the real benefit in this way?
> 
> Neither is the current scheme and locking, and this is a hell of a lot
> simpler. I'd get rid of the kref stuff and just do a simple
> atomic_dec_and_test(). Most of the time we should be uncontended on
> that, which would make it pretty darn cheap. I'd be surprised if it
> wasn't better than the current alternatives.

The explicit memory barriers by atomic_dec_and_test() isn't free.

Also the double completion issue need to be fixed before discussing
this approach further.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22  3:47         ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  3:47 UTC (permalink / raw)


On Mon, May 21, 2018@09:16:33PM -0600, Jens Axboe wrote:
> On 5/21/18 8:49 PM, Ming Lei wrote:
> > On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote:
> >> This patch simplifies the timeout handling by relying on the request
> >> reference counting to ensure the iterator is operating on an inflight
> > 
> > The reference counting isn't free, what is the real benefit in this way?
> 
> Neither is the current scheme and locking, and this is a hell of a lot
> simpler. I'd get rid of the kref stuff and just do a simple
> atomic_dec_and_test(). Most of the time we should be uncontended on
> that, which would make it pretty darn cheap. I'd be surprised if it
> wasn't better than the current alternatives.

The explicit memory barriers by atomic_dec_and_test() isn't free.

Also the double completion issue need to be fixed before discussing
this approach further.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22  3:47         ` Ming Lei
@ 2018-05-22  3:51           ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2018-05-22  3:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche

On May 21, 2018, at 9:47 PM, Ming Lei <ming.lei@redhat.com> wrote:
>=20
>> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
>>> On 5/21/18 8:49 PM, Ming Lei wrote:
>>>> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
>>>> This patch simplifies the timeout handling by relying on the request
>>>> reference counting to ensure the iterator is operating on an inflight
>>>=20
>>> The reference counting isn't free, what is the real benefit in this way?=

>>=20
>> Neither is the current scheme and locking, and this is a hell of a lot
>> simpler. I'd get rid of the kref stuff and just do a simple
>> atomic_dec_and_test(). Most of the time we should be uncontended on
>> that, which would make it pretty darn cheap. I'd be surprised if it
>> wasn't better than the current alternatives.
>=20
> The explicit memory barriers by atomic_dec_and_test() isn't free.

I=E2=80=99m not saying it=E2=80=99s free. Neither is our current synchroniza=
tion.

> Also the double completion issue need to be fixed before discussing
> this approach further.

Certainly. Also not saying that the current patch is perfect. But it=E2=80=99=
s a lot more palatable than the alternatives, hence my interest. And I=E2=80=
=99d like for this issue to get solved, we seem to be a bit stuck atm.=20


I=E2=80=99ll take a closer look tomorrow.=20

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22  3:51           ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2018-05-22  3:51 UTC (permalink / raw)


On May 21, 2018,@9:47 PM, Ming Lei <ming.lei@redhat.com> wrote:
> 
>> On Mon, May 21, 2018@09:16:33PM -0600, Jens Axboe wrote:
>>> On 5/21/18 8:49 PM, Ming Lei wrote:
>>>> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote:
>>>> This patch simplifies the timeout handling by relying on the request
>>>> reference counting to ensure the iterator is operating on an inflight
>>> 
>>> The reference counting isn't free, what is the real benefit in this way?
>> 
>> Neither is the current scheme and locking, and this is a hell of a lot
>> simpler. I'd get rid of the kref stuff and just do a simple
>> atomic_dec_and_test(). Most of the time we should be uncontended on
>> that, which would make it pretty darn cheap. I'd be surprised if it
>> wasn't better than the current alternatives.
> 
> The explicit memory barriers by atomic_dec_and_test() isn't free.

I?m not saying it?s free. Neither is our current synchronization.

> Also the double completion issue need to be fixed before discussing
> this approach further.

Certainly. Also not saying that the current patch is perfect. But it?s a lot more palatable than the alternatives, hence my interest. And I?d like for this issue to get solved, we seem to be a bit stuck atm. 


I?ll take a closer look tomorrow. 

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22  3:51           ` Jens Axboe
@ 2018-05-22  8:51             ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  8:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche

On Mon, May 21, 2018 at 09:51:19PM -0600, Jens Axboe wrote:
> On May 21, 2018, at 9:47 PM, Ming Lei <ming.lei@redhat.com> wrote:
> > 
> >> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
> >>> On 5/21/18 8:49 PM, Ming Lei wrote:
> >>>> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> >>>> This patch simplifies the timeout handling by relying on the request
> >>>> reference counting to ensure the iterator is operating on an inflight
> >>> 
> >>> The reference counting isn't free, what is the real benefit in this way?
> >> 
> >> Neither is the current scheme and locking, and this is a hell of a lot
> >> simpler. I'd get rid of the kref stuff and just do a simple
> >> atomic_dec_and_test(). Most of the time we should be uncontended on
> >> that, which would make it pretty darn cheap. I'd be surprised if it
> >> wasn't better than the current alternatives.
> > 
> > The explicit memory barriers by atomic_dec_and_test() isn't free.
> 
> I’m not saying it’s free. Neither is our current synchronization.
> 
> > Also the double completion issue need to be fixed before discussing
> > this approach further.
> 
> Certainly. Also not saying that the current patch is perfect. But it’s a lot more palatable than the alternatives, hence my interest. And I’d like for this issue to get solved, we seem to be a bit stuck atm. 
> 

It may not be something we are stuck at, and seems no alternatives for
this patchset.

It is a new requirement from NVMe, and Keith wants driver to complete
timed-out request in .timeout(). We never support that before for both
mq and non-mq code path.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22  8:51             ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22  8:51 UTC (permalink / raw)


On Mon, May 21, 2018@09:51:19PM -0600, Jens Axboe wrote:
> On May 21, 2018,@9:47 PM, Ming Lei <ming.lei@redhat.com> wrote:
> > 
> >> On Mon, May 21, 2018@09:16:33PM -0600, Jens Axboe wrote:
> >>> On 5/21/18 8:49 PM, Ming Lei wrote:
> >>>> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote:
> >>>> This patch simplifies the timeout handling by relying on the request
> >>>> reference counting to ensure the iterator is operating on an inflight
> >>> 
> >>> The reference counting isn't free, what is the real benefit in this way?
> >> 
> >> Neither is the current scheme and locking, and this is a hell of a lot
> >> simpler. I'd get rid of the kref stuff and just do a simple
> >> atomic_dec_and_test(). Most of the time we should be uncontended on
> >> that, which would make it pretty darn cheap. I'd be surprised if it
> >> wasn't better than the current alternatives.
> > 
> > The explicit memory barriers by atomic_dec_and_test() isn't free.
> 
> I?m not saying it?s free. Neither is our current synchronization.
> 
> > Also the double completion issue need to be fixed before discussing
> > this approach further.
> 
> Certainly. Also not saying that the current patch is perfect. But it?s a lot more palatable than the alternatives, hence my interest. And I?d like for this issue to get solved, we seem to be a bit stuck atm. 
> 

It may not be something we are stuck at, and seems no alternatives for
this patchset.

It is a new requirement from NVMe, and Keith wants driver to complete
timed-out request in .timeout(). We never support that before for both
mq and non-mq code path.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 0/3] blk-mq: Timeout rework
  2018-05-21 23:29   ` Bart Van Assche
@ 2018-05-22 14:06     ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 14:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

On Mon, May 21, 2018 at 11:29:21PM +0000, Bart Van Assche wrote:
> Can you explain why the NVMe driver needs reference counting of requests but
> no other block driver needs this? Additionally, why is it that for all block
> drivers except NVMe the current block layer API is sufficient
> (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/
> blk_mq_complete_request()/blk_mq_end_request())?

Hi Bart,

I'm pretty sure NVMe isn't the only driver where a call to
blk_mq_complete_request silently fails to transition the request to
COMPLETE, forcing unnecessary error handling. This patch isn't so
much about NVMe as it is about removing that silent exception from the
block API.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 0/3] blk-mq: Timeout rework
@ 2018-05-22 14:06     ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 14:06 UTC (permalink / raw)


On Mon, May 21, 2018@11:29:21PM +0000, Bart Van Assche wrote:
> Can you explain why the NVMe driver needs reference counting of requests but
> no other block driver needs this? Additionally, why is it that for all block
> drivers except NVMe the current block layer API is sufficient
> (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/
> blk_mq_complete_request()/blk_mq_end_request())?

Hi Bart,

I'm pretty sure NVMe isn't the only driver where a call to
blk_mq_complete_request silently fails to transition the request to
COMPLETE, forcing unnecessary error handling. This patch isn't so
much about NVMe as it is about removing that silent exception from the
block API.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-21 23:29     ` Bart Van Assche
@ 2018-05-22 14:15       ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 14:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

On Mon, May 21, 2018 at 11:29:06PM +0000, Bart Van Assche wrote:
> On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> >  	switch (ret) {
> >  	case BLK_EH_HANDLED:
> > -		__blk_mq_complete_request(req);
> > -		break;
> > -	case BLK_EH_RESET_TIMER:
> >  		/*
> > -		 * As nothing prevents from completion happening while
> > -		 * ->aborted_gstate is set, this may lead to ignored
> > -		 * completions and further spurious timeouts.
> > +		 * If the request is still in flight, the driver is requesting
> > +		 * blk-mq complete it.
> >  		 */
> > -		blk_mq_rq_update_aborted_gstate(req, 0);
> > +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> > +			__blk_mq_complete_request(req);
> > +		break;
> > +	case BLK_EH_RESET_TIMER:
> >  		blk_add_timer(req);
> >  		break;
> >  	case BLK_EH_NOT_HANDLED:
> > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
> >  	}
> >  }
> 
> I think the above changes can lead to concurrent calls of
> __blk_mq_complete_request() from the regular completion path and the timeout
> path. That's wrong: the __blk_mq_complete_request() caller should guarantee
> that no concurrent calls from another context to that function can occur.

Hi Bart,

This shouldn't be introducing any new concorrent calls to
__blk_mq_complete_request if they don't already exist. The timeout work
calls it only if the driver's timeout returns BLK_EH_HANDLED, which means
the driver is claiming the command is now done, but the driver didn't call
blk_mq_complete_request as indicated by the request's IN_FLIGHT state.

In order to get a second call to __blk_mq_complete_request(), then,
the driver would have to end up calling blk_mq_complete_request()
in another context. But there's nothing stopping that from happening
today, and would be bad if any driver actually did that: the request
may have been re-allocated and issued as a new command, and calling
blk_mq_complete_request() the second time introduces data corruption.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 14:15       ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 14:15 UTC (permalink / raw)


On Mon, May 21, 2018@11:29:06PM +0000, Bart Van Assche wrote:
> On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote:
> >  	switch (ret) {
> >  	case BLK_EH_HANDLED:
> > -		__blk_mq_complete_request(req);
> > -		break;
> > -	case BLK_EH_RESET_TIMER:
> >  		/*
> > -		 * As nothing prevents from completion happening while
> > -		 * ->aborted_gstate is set, this may lead to ignored
> > -		 * completions and further spurious timeouts.
> > +		 * If the request is still in flight, the driver is requesting
> > +		 * blk-mq complete it.
> >  		 */
> > -		blk_mq_rq_update_aborted_gstate(req, 0);
> > +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> > +			__blk_mq_complete_request(req);
> > +		break;
> > +	case BLK_EH_RESET_TIMER:
> >  		blk_add_timer(req);
> >  		break;
> >  	case BLK_EH_NOT_HANDLED:
> > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
> >  	}
> >  }
> 
> I think the above changes can lead to concurrent calls of
> __blk_mq_complete_request() from the regular completion path and the timeout
> path. That's wrong: the __blk_mq_complete_request() caller should guarantee
> that no concurrent calls from another context to that function can occur.

Hi Bart,

This shouldn't be introducing any new concorrent calls to
__blk_mq_complete_request if they don't already exist. The timeout work
calls it only if the driver's timeout returns BLK_EH_HANDLED, which means
the driver is claiming the command is now done, but the driver didn't call
blk_mq_complete_request as indicated by the request's IN_FLIGHT state.

In order to get a second call to __blk_mq_complete_request(), then,
the driver would have to end up calling blk_mq_complete_request()
in another context. But there's nothing stopping that from happening
today, and would be bad if any driver actually did that: the request
may have been re-allocated and issued as a new command, and calling
blk_mq_complete_request() the second time introduces data corruption.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22  2:49     ` Ming Lei
@ 2018-05-22 14:20       ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 14:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, linux-block, Bart Van Assche,
	Christoph Hellwig, linux-nvme

On Tue, May 22, 2018 at 10:49:11AM +0800, Ming Lei wrote:
> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
> > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> >  		struct request *rq, void *priv, bool reserved)
> >  {
> > +	unsigned long *next = priv;
> > +
> >  	/*
> > -	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
> > -	 * completions that we lost to, they would have finished and
> > -	 * updated @rq->gstate by now; otherwise, the completion path is
> > -	 * now guaranteed to see @rq->aborted_gstate and yield.  If
> > -	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> > +	 * Just do a quick check if it is expired before locking the request in
> > +	 * so we're not unnecessarilly synchronizing across CPUs.
> >  	 */
> > -	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> > -	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
> > +	if (!blk_mq_req_expired(rq, next))
> > +		return;
> > +
> > +	/*
> > +	 * We have reason to believe the request may be expired. Take a
> > +	 * reference on the request to lock this request lifetime into its
> > +	 * currently allocated context to prevent it from being reallocated in
> > +	 * the event the completion by-passes this timeout handler.
> > +	 * 
> > +	 * If the reference was already released, then the driver beat the
> > +	 * timeout handler to posting a natural completion.
> > +	 */
> > +	if (!kref_get_unless_zero(&rq->ref))
> > +		return;
> 
> If this request is just completed in normal path and its state isn't
> updated yet, timeout will hold the request, and may complete this
> request again, then this req can be completed two times.

Hi Ming,

In the event the driver requests a normal completion, the timeout work
releasing the last reference doesn't do a second completion: it only
releases the request's tag back for re-allocation.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 14:20       ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 14:20 UTC (permalink / raw)


On Tue, May 22, 2018@10:49:11AM +0800, Ming Lei wrote:
> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote:
> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
> > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> >  		struct request *rq, void *priv, bool reserved)
> >  {
> > +	unsigned long *next = priv;
> > +
> >  	/*
> > -	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
> > -	 * completions that we lost to, they would have finished and
> > -	 * updated @rq->gstate by now; otherwise, the completion path is
> > -	 * now guaranteed to see @rq->aborted_gstate and yield.  If
> > -	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> > +	 * Just do a quick check if it is expired before locking the request in
> > +	 * so we're not unnecessarilly synchronizing across CPUs.
> >  	 */
> > -	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> > -	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
> > +	if (!blk_mq_req_expired(rq, next))
> > +		return;
> > +
> > +	/*
> > +	 * We have reason to believe the request may be expired. Take a
> > +	 * reference on the request to lock this request lifetime into its
> > +	 * currently allocated context to prevent it from being reallocated in
> > +	 * the event the completion by-passes this timeout handler.
> > +	 * 
> > +	 * If the reference was already released, then the driver beat the
> > +	 * timeout handler to posting a natural completion.
> > +	 */
> > +	if (!kref_get_unless_zero(&rq->ref))
> > +		return;
> 
> If this request is just completed in normal path and its state isn't
> updated yet, timeout will hold the request, and may complete this
> request again, then this req can be completed two times.

Hi Ming,

In the event the driver requests a normal completion, the timeout work
releasing the last reference doesn't do a second completion: it only
releases the request's tag back for re-allocation.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22  8:51             ` Ming Lei
@ 2018-05-22 14:35               ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2018-05-22 14:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche

On 5/22/18 2:51 AM, Ming Lei wrote:
> On Mon, May 21, 2018 at 09:51:19PM -0600, Jens Axboe wrote:
>> On May 21, 2018, at 9:47 PM, Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>>> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
>>>>> On 5/21/18 8:49 PM, Ming Lei wrote:
>>>>>> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
>>>>>> This patch simplifies the timeout handling by relying on the request
>>>>>> reference counting to ensure the iterator is operating on an inflight
>>>>>
>>>>> The reference counting isn't free, what is the real benefit in this way?
>>>>
>>>> Neither is the current scheme and locking, and this is a hell of a lot
>>>> simpler. I'd get rid of the kref stuff and just do a simple
>>>> atomic_dec_and_test(). Most of the time we should be uncontended on
>>>> that, which would make it pretty darn cheap. I'd be surprised if it
>>>> wasn't better than the current alternatives.
>>>
>>> The explicit memory barriers by atomic_dec_and_test() isn't free.
>>
>> I’m not saying it’s free. Neither is our current synchronization.
>>
>>> Also the double completion issue need to be fixed before discussing
>>> this approach further.
>>
>> Certainly. Also not saying that the current patch is perfect. But
>> it’s a lot more palatable than the alternatives, hence my interest.
>> And I’d like for this issue to get solved, we seem to be a bit stuck
>> atm. 
>>
> 
> It may not be something we are stuck at, and seems no alternatives for
> this patchset.
> 
> It is a new requirement from NVMe, and Keith wants driver to complete
> timed-out request in .timeout(). We never support that before for both
> mq and non-mq code path.

No, that's not what he wants to do. He wants to use referencing to
release the final resources of the request (the tag), to prevent
premature reuse of the request.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 14:35               ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2018-05-22 14:35 UTC (permalink / raw)


On 5/22/18 2:51 AM, Ming Lei wrote:
> On Mon, May 21, 2018@09:51:19PM -0600, Jens Axboe wrote:
>> On May 21, 2018,@9:47 PM, Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>>> On Mon, May 21, 2018@09:16:33PM -0600, Jens Axboe wrote:
>>>>> On 5/21/18 8:49 PM, Ming Lei wrote:
>>>>>> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote:
>>>>>> This patch simplifies the timeout handling by relying on the request
>>>>>> reference counting to ensure the iterator is operating on an inflight
>>>>>
>>>>> The reference counting isn't free, what is the real benefit in this way?
>>>>
>>>> Neither is the current scheme and locking, and this is a hell of a lot
>>>> simpler. I'd get rid of the kref stuff and just do a simple
>>>> atomic_dec_and_test(). Most of the time we should be uncontended on
>>>> that, which would make it pretty darn cheap. I'd be surprised if it
>>>> wasn't better than the current alternatives.
>>>
>>> The explicit memory barriers by atomic_dec_and_test() isn't free.
>>
>> I?m not saying it?s free. Neither is our current synchronization.
>>
>>> Also the double completion issue need to be fixed before discussing
>>> this approach further.
>>
>> Certainly. Also not saying that the current patch is perfect. But
>> it?s a lot more palatable than the alternatives, hence my interest.
>> And I?d like for this issue to get solved, we seem to be a bit stuck
>> atm. 
>>
> 
> It may not be something we are stuck at, and seems no alternatives for
> this patchset.
> 
> It is a new requirement from NVMe, and Keith wants driver to complete
> timed-out request in .timeout(). We never support that before for both
> mq and non-mq code path.

No, that's not what he wants to do. He wants to use referencing to
release the final resources of the request (the tag), to prevent
premature reuse of the request.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 14:20       ` Keith Busch
@ 2018-05-22 14:37         ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22 14:37 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Keith Busch, Jens Axboe, linux-block, Bart Van Assche,
	Christoph Hellwig, linux-nvme

On Tue, May 22, 2018 at 10:20 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 22, 2018 at 10:49:11AM +0800, Ming Lei wrote:
>> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
>> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
>> > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>> >             struct request *rq, void *priv, bool reserved)
>> >  {
>> > +   unsigned long *next = priv;
>> > +
>> >     /*
>> > -    * We marked @rq->aborted_gstate and waited for RCU.  If there were
>> > -    * completions that we lost to, they would have finished and
>> > -    * updated @rq->gstate by now; otherwise, the completion path is
>> > -    * now guaranteed to see @rq->aborted_gstate and yield.  If
>> > -    * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
>> > +    * Just do a quick check if it is expired before locking the request in
>> > +    * so we're not unnecessarilly synchronizing across CPUs.
>> >      */
>> > -   if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
>> > -       READ_ONCE(rq->gstate) == rq->aborted_gstate)
>> > +   if (!blk_mq_req_expired(rq, next))
>> > +           return;
>> > +
>> > +   /*
>> > +    * We have reason to believe the request may be expired. Take a
>> > +    * reference on the request to lock this request lifetime into its
>> > +    * currently allocated context to prevent it from being reallocated in
>> > +    * the event the completion by-passes this timeout handler.
>> > +    *
>> > +    * If the reference was already released, then the driver beat the
>> > +    * timeout handler to posting a natural completion.
>> > +    */
>> > +   if (!kref_get_unless_zero(&rq->ref))
>> > +           return;
>>
>> If this request is just completed in normal path and its state isn't
>> updated yet, timeout will hold the request, and may complete this
>> request again, then this req can be completed two times.
>
> Hi Ming,
>
> In the event the driver requests a normal completion, the timeout work
> releasing the last reference doesn't do a second completion: it only

The reference only covers request's lifetime, not related with completion.

It isn't the last reference. When driver returns EH_HANDLED, blk-mq
will complete this request on extra time.

Yes, if driver's timeout code and normal completion code can sync
about this completion, that should be fine, but the current behaviour
doesn't depend driver's sync since the req is always completed atomically
via the following way:

1) timeout

if (mark_completed(rq))
   timed_out(rq)

2) normal completion
if (mark_completed(rq))
    complete(rq)

For example, before nvme_timeout() is trying to run nvme_dev_disable(),
irq comes and this req is completed from normal completion path, but
nvme_timeout() still returns EH_HANDLED, and blk-mq may complete
the req one extra time since the normal completion path may not update
req's state yet.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 14:37         ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22 14:37 UTC (permalink / raw)


On Tue, May 22, 2018 at 10:20 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 22, 2018@10:49:11AM +0800, Ming Lei wrote:
>> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote:
>> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
>> > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>> >             struct request *rq, void *priv, bool reserved)
>> >  {
>> > +   unsigned long *next = priv;
>> > +
>> >     /*
>> > -    * We marked @rq->aborted_gstate and waited for RCU.  If there were
>> > -    * completions that we lost to, they would have finished and
>> > -    * updated @rq->gstate by now; otherwise, the completion path is
>> > -    * now guaranteed to see @rq->aborted_gstate and yield.  If
>> > -    * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
>> > +    * Just do a quick check if it is expired before locking the request in
>> > +    * so we're not unnecessarilly synchronizing across CPUs.
>> >      */
>> > -   if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
>> > -       READ_ONCE(rq->gstate) == rq->aborted_gstate)
>> > +   if (!blk_mq_req_expired(rq, next))
>> > +           return;
>> > +
>> > +   /*
>> > +    * We have reason to believe the request may be expired. Take a
>> > +    * reference on the request to lock this request lifetime into its
>> > +    * currently allocated context to prevent it from being reallocated in
>> > +    * the event the completion by-passes this timeout handler.
>> > +    *
>> > +    * If the reference was already released, then the driver beat the
>> > +    * timeout handler to posting a natural completion.
>> > +    */
>> > +   if (!kref_get_unless_zero(&rq->ref))
>> > +           return;
>>
>> If this request is just completed in normal path and its state isn't
>> updated yet, timeout will hold the request, and may complete this
>> request again, then this req can be completed two times.
>
> Hi Ming,
>
> In the event the driver requests a normal completion, the timeout work
> releasing the last reference doesn't do a second completion: it only

The reference only covers request's lifetime, not related with completion.

It isn't the last reference. When driver returns EH_HANDLED, blk-mq
will complete this request on extra time.

Yes, if driver's timeout code and normal completion code can sync
about this completion, that should be fine, but the current behaviour
doesn't depend driver's sync since the req is always completed atomically
via the following way:

1) timeout

if (mark_completed(rq))
   timed_out(rq)

2) normal completion
if (mark_completed(rq))
    complete(rq)

For example, before nvme_timeout() is trying to run nvme_dev_disable(),
irq comes and this req is completed from normal completion path, but
nvme_timeout() still returns EH_HANDLED, and blk-mq may complete
the req one extra time since the normal completion path may not update
req's state yet.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 14:37         ` Ming Lei
@ 2018-05-22 14:46           ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 14:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch,
	Bart Van Assche, Christoph Hellwig

On Tue, May 22, 2018 at 10:37:32PM +0800, Ming Lei wrote:
> On Tue, May 22, 2018 at 10:20 PM, Keith Busch
> <keith.busch@linux.intel.com> wrote:
> > In the event the driver requests a normal completion, the timeout work
> > releasing the last reference doesn't do a second completion: it only
> 
> The reference only covers request's lifetime, not related with completion.
> 
> It isn't the last reference. When driver returns EH_HANDLED, blk-mq
> will complete this request on extra time.
> 
> Yes, if driver's timeout code and normal completion code can sync
> about this completion, that should be fine, but the current behaviour
> doesn't depend driver's sync since the req is always completed atomically
> via the following way:
> 
> 1) timeout
> 
> if (mark_completed(rq))
>    timed_out(rq)
> 
> 2) normal completion
> if (mark_completed(rq))
>     complete(rq)
> 
> For example, before nvme_timeout() is trying to run nvme_dev_disable(),
> irq comes and this req is completed from normal completion path, but
> nvme_timeout() still returns EH_HANDLED, and blk-mq may complete
> the req one extra time since the normal completion path may not update
> req's state yet.

nvme_dev_disable tears down irq's, meaing their handling is already
sync'ed before nvme_dev_disable can proceed. Whether the completion
comes through nvme_irq, or through nvme_dev_disable, there is no way
possible for nvme's timeout to return EH_HANDLED if the state was not
updated prior to returning that status.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 14:46           ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 14:46 UTC (permalink / raw)


On Tue, May 22, 2018@10:37:32PM +0800, Ming Lei wrote:
> On Tue, May 22, 2018 at 10:20 PM, Keith Busch
> <keith.busch@linux.intel.com> wrote:
> > In the event the driver requests a normal completion, the timeout work
> > releasing the last reference doesn't do a second completion: it only
> 
> The reference only covers request's lifetime, not related with completion.
> 
> It isn't the last reference. When driver returns EH_HANDLED, blk-mq
> will complete this request on extra time.
> 
> Yes, if driver's timeout code and normal completion code can sync
> about this completion, that should be fine, but the current behaviour
> doesn't depend driver's sync since the req is always completed atomically
> via the following way:
> 
> 1) timeout
> 
> if (mark_completed(rq))
>    timed_out(rq)
> 
> 2) normal completion
> if (mark_completed(rq))
>     complete(rq)
> 
> For example, before nvme_timeout() is trying to run nvme_dev_disable(),
> irq comes and this req is completed from normal completion path, but
> nvme_timeout() still returns EH_HANDLED, and blk-mq may complete
> the req one extra time since the normal completion path may not update
> req's state yet.

nvme_dev_disable tears down irq's, meaing their handling is already
sync'ed before nvme_dev_disable can proceed. Whether the completion
comes through nvme_irq, or through nvme_dev_disable, there is no way
possible for nvme's timeout to return EH_HANDLED if the state was not
updated prior to returning that status.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 14:46           ` Keith Busch
@ 2018-05-22 14:57             ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22 14:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch,
	Bart Van Assche, Christoph Hellwig

On Tue, May 22, 2018 at 10:46 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 22, 2018 at 10:37:32PM +0800, Ming Lei wrote:
>> On Tue, May 22, 2018 at 10:20 PM, Keith Busch
>> <keith.busch@linux.intel.com> wrote:
>> > In the event the driver requests a normal completion, the timeout work
>> > releasing the last reference doesn't do a second completion: it only
>>
>> The reference only covers request's lifetime, not related with completion.
>>
>> It isn't the last reference. When driver returns EH_HANDLED, blk-mq
>> will complete this request on extra time.
>>
>> Yes, if driver's timeout code and normal completion code can sync
>> about this completion, that should be fine, but the current behaviour
>> doesn't depend driver's sync since the req is always completed atomically
>> via the following way:
>>
>> 1) timeout
>>
>> if (mark_completed(rq))
>>    timed_out(rq)
>>
>> 2) normal completion
>> if (mark_completed(rq))
>>     complete(rq)
>>
>> For example, before nvme_timeout() is trying to run nvme_dev_disable(),
>> irq comes and this req is completed from normal completion path, but
>> nvme_timeout() still returns EH_HANDLED, and blk-mq may complete
>> the req one extra time since the normal completion path may not update
>> req's state yet.
>
> nvme_dev_disable tears down irq's, meaing their handling is already
> sync'ed before nvme_dev_disable can proceed. Whether the completion
> comes through nvme_irq, or through nvme_dev_disable, there is no way
> possible for nvme's timeout to return EH_HANDLED if the state was not
> updated prior to returning that status.

OK, that still depends on driver's behaviour, even though it is true
for NVMe,  you still have to audit other drivers about this sync
because it is required by your patch.


thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 14:57             ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22 14:57 UTC (permalink / raw)


On Tue, May 22, 2018 at 10:46 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 22, 2018@10:37:32PM +0800, Ming Lei wrote:
>> On Tue, May 22, 2018 at 10:20 PM, Keith Busch
>> <keith.busch@linux.intel.com> wrote:
>> > In the event the driver requests a normal completion, the timeout work
>> > releasing the last reference doesn't do a second completion: it only
>>
>> The reference only covers request's lifetime, not related with completion.
>>
>> It isn't the last reference. When driver returns EH_HANDLED, blk-mq
>> will complete this request on extra time.
>>
>> Yes, if driver's timeout code and normal completion code can sync
>> about this completion, that should be fine, but the current behaviour
>> doesn't depend driver's sync since the req is always completed atomically
>> via the following way:
>>
>> 1) timeout
>>
>> if (mark_completed(rq))
>>    timed_out(rq)
>>
>> 2) normal completion
>> if (mark_completed(rq))
>>     complete(rq)
>>
>> For example, before nvme_timeout() is trying to run nvme_dev_disable(),
>> irq comes and this req is completed from normal completion path, but
>> nvme_timeout() still returns EH_HANDLED, and blk-mq may complete
>> the req one extra time since the normal completion path may not update
>> req's state yet.
>
> nvme_dev_disable tears down irq's, meaing their handling is already
> sync'ed before nvme_dev_disable can proceed. Whether the completion
> comes through nvme_irq, or through nvme_dev_disable, there is no way
> possible for nvme's timeout to return EH_HANDLED if the state was not
> updated prior to returning that status.

OK, that still depends on driver's behaviour, even though it is true
for NVMe,  you still have to audit other drivers about this sync
because it is required by your patch.


thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 14:57             ` Ming Lei
@ 2018-05-22 15:01               ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 15:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch,
	Bart Van Assche, Christoph Hellwig

On Tue, May 22, 2018 at 10:57:40PM +0800, Ming Lei wrote:
> OK, that still depends on driver's behaviour, even though it is true
> for NVMe,  you still have to audit other drivers about this sync
> because it is required by your patch.

Okay, forget about nvme for a moment here. Please run this thought
experiment to decide if what you're saying is even plausible for any
block driver with the existing implementation:

Your scenario has a driver return EH_HANDLED for a timed out request. The
timeout work then completes the request. The request may then be
reallocated for a new command and dispatched.

At approximately the same time, you're saying the driver that returned
EH_HANDLED may then call blk_mq_complete_request() in reference to the
*old* request that it returned EH_HANDLED for, incorrectly completing
the new request before it is done. That will inevitably lead to data
corruption. Is that happening today in any driver?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 15:01               ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 15:01 UTC (permalink / raw)


On Tue, May 22, 2018@10:57:40PM +0800, Ming Lei wrote:
> OK, that still depends on driver's behaviour, even though it is true
> for NVMe,  you still have to audit other drivers about this sync
> because it is required by your patch.

Okay, forget about nvme for a moment here. Please run this thought
experiment to decide if what you're saying is even plausible for any
block driver with the existing implementation:

Your scenario has a driver return EH_HANDLED for a timed out request. The
timeout work then completes the request. The request may then be
reallocated for a new command and dispatched.

At approximately the same time, you're saying the driver that returned
EH_HANDLED may then call blk_mq_complete_request() in reference to the
*old* request that it returned EH_HANDLED for, incorrectly completing
the new request before it is done. That will inevitably lead to data
corruption. Is that happening today in any driver?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 15:01               ` Keith Busch
@ 2018-05-22 15:07                 ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22 15:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch,
	Bart Van Assche, Christoph Hellwig

On Tue, May 22, 2018 at 11:01 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 22, 2018 at 10:57:40PM +0800, Ming Lei wrote:
>> OK, that still depends on driver's behaviour, even though it is true
>> for NVMe,  you still have to audit other drivers about this sync
>> because it is required by your patch.
>
> Okay, forget about nvme for a moment here. Please run this thought
> experiment to decide if what you're saying is even plausible for any
> block driver with the existing implementation:
>
> Your scenario has a driver return EH_HANDLED for a timed out request. The
> timeout work then completes the request. The request may then be
> reallocated for a new command and dispatched.

Yes.

>
> At approximately the same time, you're saying the driver that returned
> EH_HANDLED may then call blk_mq_complete_request() in reference to the
> *old* request that it returned EH_HANDLED for, incorrectly completing

Because this request has been completed above by blk-mq timeout,
then this old request won't be completed any more via blk_mq_complete_request()
either from normal path or what ever, such as cancel.

> the new request before it is done. That will inevitably lead to data
> corruption. Is that happening today in any driver?

No such issue since current blk-mq makes sure one req is only completed
once, but your patch changes to depend on driver to make sure that.


thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 15:07                 ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22 15:07 UTC (permalink / raw)


On Tue, May 22, 2018 at 11:01 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 22, 2018@10:57:40PM +0800, Ming Lei wrote:
>> OK, that still depends on driver's behaviour, even though it is true
>> for NVMe,  you still have to audit other drivers about this sync
>> because it is required by your patch.
>
> Okay, forget about nvme for a moment here. Please run this thought
> experiment to decide if what you're saying is even plausible for any
> block driver with the existing implementation:
>
> Your scenario has a driver return EH_HANDLED for a timed out request. The
> timeout work then completes the request. The request may then be
> reallocated for a new command and dispatched.

Yes.

>
> At approximately the same time, you're saying the driver that returned
> EH_HANDLED may then call blk_mq_complete_request() in reference to the
> *old* request that it returned EH_HANDLED for, incorrectly completing

Because this request has been completed above by blk-mq timeout,
then this old request won't be completed any more via blk_mq_complete_request()
either from normal path or what ever, such as cancel.

> the new request before it is done. That will inevitably lead to data
> corruption. Is that happening today in any driver?

No such issue since current blk-mq makes sure one req is only completed
once, but your patch changes to depend on driver to make sure that.


thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 15:07                 ` Ming Lei
@ 2018-05-22 15:17                   ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 15:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch,
	Bart Van Assche, Christoph Hellwig

On Tue, May 22, 2018 at 11:07:07PM +0800, Ming Lei wrote:
> > At approximately the same time, you're saying the driver that returned
> > EH_HANDLED may then call blk_mq_complete_request() in reference to the
> > *old* request that it returned EH_HANDLED for, incorrectly completing
> 
> Because this request has been completed above by blk-mq timeout,
> then this old request won't be completed any more via blk_mq_complete_request()
> either from normal path or what ever, such as cancel.
 
> > the new request before it is done. That will inevitably lead to data
> > corruption. Is that happening today in any driver?
> 
> No such issue since current blk-mq makes sure one req is only completed
> once, but your patch changes to depend on driver to make sure that.

The blk-mq timeout complete makes the request available for allocation
as a new command, at which point blk_mq_complete_request can be called
again. If a driver is somehow relying on blk-mq to prevent a double
completion for a previously completed request context, they're already
in a lot of trouble.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 15:17                   ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 15:17 UTC (permalink / raw)


On Tue, May 22, 2018@11:07:07PM +0800, Ming Lei wrote:
> > At approximately the same time, you're saying the driver that returned
> > EH_HANDLED may then call blk_mq_complete_request() in reference to the
> > *old* request that it returned EH_HANDLED for, incorrectly completing
> 
> Because this request has been completed above by blk-mq timeout,
> then this old request won't be completed any more via blk_mq_complete_request()
> either from normal path or what ever, such as cancel.
 
> > the new request before it is done. That will inevitably lead to data
> > corruption. Is that happening today in any driver?
> 
> No such issue since current blk-mq makes sure one req is only completed
> once, but your patch changes to depend on driver to make sure that.

The blk-mq timeout complete makes the request available for allocation
as a new command, at which point blk_mq_complete_request can be called
again. If a driver is somehow relying on blk-mq to prevent a double
completion for a previously completed request context, they're already
in a lot of trouble.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 1/3] blk-mq: Reference count request usage
  2018-05-21 23:11   ` Keith Busch
@ 2018-05-22 15:19     ` Christoph Hellwig
  -1 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2018-05-22 15:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig,
	Bart Van Assche

On Mon, May 21, 2018 at 05:11:29PM -0600, Keith Busch wrote:
> This patch adds a struct kref to struct request so that request users
> can be sure they're operating on the same request without it changing
> while they're processing it. The request's tag won't be released for
> reuse until the last user is done with it.

Can you please use a raw refcount_t instead of the kref?  That avoids
a possible indirect call in the fast path, which have become really
painful with the spectre mitigrations, and also is easier to follow
to start with.

I also think this should be merged into patch 3, as it isn't very
useful on its own.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 1/3] blk-mq: Reference count request usage
@ 2018-05-22 15:19     ` Christoph Hellwig
  0 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2018-05-22 15:19 UTC (permalink / raw)


On Mon, May 21, 2018@05:11:29PM -0600, Keith Busch wrote:
> This patch adds a struct kref to struct request so that request users
> can be sure they're operating on the same request without it changing
> while they're processing it. The request's tag won't be released for
> reuse until the last user is done with it.

Can you please use a raw refcount_t instead of the kref?  That avoids
a possible indirect call in the fast path, which have become really
painful with the spectre mitigrations, and also is easier to follow
to start with.

I also think this should be merged into patch 3, as it isn't very
useful on its own.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 15:17                   ` Keith Busch
@ 2018-05-22 15:23                     ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22 15:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch,
	Bart Van Assche, Christoph Hellwig

On Tue, May 22, 2018 at 11:17 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 22, 2018 at 11:07:07PM +0800, Ming Lei wrote:
>> > At approximately the same time, you're saying the driver that returned
>> > EH_HANDLED may then call blk_mq_complete_request() in reference to the
>> > *old* request that it returned EH_HANDLED for, incorrectly completing
>>
>> Because this request has been completed above by blk-mq timeout,
>> then this old request won't be completed any more via blk_mq_complete_request()
>> either from normal path or what ever, such as cancel.
>
>> > the new request before it is done. That will inevitably lead to data
>> > corruption. Is that happening today in any driver?
>>
>> No such issue since current blk-mq makes sure one req is only completed
>> once, but your patch changes to depend on driver to make sure that.
>
> The blk-mq timeout complete makes the request available for allocation
> as a new command, at which point blk_mq_complete_request can be called
> again. If a driver is somehow relying on blk-mq to prevent a double
> completion for a previously completed request context, they're already
> in a lot of trouble.

Yes, previously there is the atomic flag of REQ_ATOM_COMPLETE for
covering the atomic completion, and recently Tejun changes to aborted
state with generation counter, but both provides sort of atomic completion.

So even though it is much simplified by using request refcount, the atomic
completion should be provided by blk-mq, or drivers have to be audited
to avoid double completion.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 15:23                     ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-22 15:23 UTC (permalink / raw)


On Tue, May 22, 2018 at 11:17 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 22, 2018@11:07:07PM +0800, Ming Lei wrote:
>> > At approximately the same time, you're saying the driver that returned
>> > EH_HANDLED may then call blk_mq_complete_request() in reference to the
>> > *old* request that it returned EH_HANDLED for, incorrectly completing
>>
>> Because this request has been completed above by blk-mq timeout,
>> then this old request won't be completed any more via blk_mq_complete_request()
>> either from normal path or what ever, such as cancel.
>
>> > the new request before it is done. That will inevitably lead to data
>> > corruption. Is that happening today in any driver?
>>
>> No such issue since current blk-mq makes sure one req is only completed
>> once, but your patch changes to depend on driver to make sure that.
>
> The blk-mq timeout complete makes the request available for allocation
> as a new command, at which point blk_mq_complete_request can be called
> again. If a driver is somehow relying on blk-mq to prevent a double
> completion for a previously completed request context, they're already
> in a lot of trouble.

Yes, previously there is the atomic flag of REQ_ATOM_COMPLETE for
covering the atomic completion, and recently Tejun changes to aborted
state with generation counter, but both provides sort of atomic completion.

So even though it is much simplified by using request refcount, the atomic
completion should be provided by blk-mq, or drivers have to be audited
to avoid double completion.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order
  2018-05-21 23:11   ` Keith Busch
@ 2018-05-22 15:24     ` Christoph Hellwig
  -1 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2018-05-22 15:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig,
	Bart Van Assche

On Mon, May 21, 2018 at 05:11:30PM -0600, Keith Busch wrote:
> The block layer had been setting the state to in-flight prior to updating
> the timer. This is the wrong order since the timeout handler could observe
> the in-flight state with the older timeout, believing the request had
> expired when in fact it is just getting started.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>

The way I understood Barts comments to my comments on his patch we
actually need the two updates to be atomic.  I haven't had much time
to follow up, but I'd like to hear Barts opinion.  Either way we
clearly need to document our assumptions here in comments.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 2/3] blk-mq: Fix timeout and state order
@ 2018-05-22 15:24     ` Christoph Hellwig
  0 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2018-05-22 15:24 UTC (permalink / raw)


On Mon, May 21, 2018@05:11:30PM -0600, Keith Busch wrote:
> The block layer had been setting the state to in-flight prior to updating
> the timer. This is the wrong order since the timeout handler could observe
> the in-flight state with the older timeout, believing the request had
> expired when in fact it is just getting started.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

The way I understood Barts comments to my comments on his patch we
actually need the two updates to be atomic.  I haven't had much time
to follow up, but I'd like to hear Barts opinion.  Either way we
clearly need to document our assumptions here in comments.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-21 23:11   ` Keith Busch
@ 2018-05-22 16:17     ` Christoph Hellwig
  -1 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2018-05-22 16:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig,
	Bart Van Assche

Hi Keith,

I like this series a lot.  One comment that is probably close
to the big discussion in the thread:

>  	switch (ret) {
>  	case BLK_EH_HANDLED:
>  		/*
> +		 * If the request is still in flight, the driver is requesting
> +		 * blk-mq complete it.
>  		 */
> +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> +			__blk_mq_complete_request(req);
> +		break;

The state check here really irked me, and from the thread it seems like
I'm not the only one.  At least for the NVMe case I think it is perfectly
safe, although I agree I'd rather audit what other drivers do carefully.

That being said I think BLK_EH_HANDLED seems like a fundamentally broken
idea, and I'd actually prefer to get rid of it over adding things like
the MQ_RQ_IN_FLIGHT check above.

E.g. if we look at the cases where nvme-pci returns it:

 - if we did call nvme_dev_disable, we already canceled all requests,
   so we might as well just return BLK_EH_NOT_HANDLED
 - the poll for completion case already completed the command,
   so we should return BLK_EH_NOT_HANDLED

So I think we need to fix up nvme and if needed any other driver
to return the right value and then assert that the request is
still in in-flight status for the BLK_EH_HANDLED case.

> @@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq)
>  static inline void blk_mq_rq_update_state(struct request *rq,
>  					  enum mq_rq_state state)
>  {
> +	WRITE_ONCE(rq->state, state);
>  }

I think this helper can go away now.  But we should have a comment
near the state field documenting the concurrency implications.



> +	u64 state;

This should probably be a mq_rq_state instead.  Which means it needs
to be moved to blkdev.h, but that should be ok.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 16:17     ` Christoph Hellwig
  0 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2018-05-22 16:17 UTC (permalink / raw)


Hi Keith,

I like this series a lot.  One comment that is probably close
to the big discussion in the thread:

>  	switch (ret) {
>  	case BLK_EH_HANDLED:
>  		/*
> +		 * If the request is still in flight, the driver is requesting
> +		 * blk-mq complete it.
>  		 */
> +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> +			__blk_mq_complete_request(req);
> +		break;

The state check here really irked me, and from the thread it seems like
I'm not the only one.  At least for the NVMe case I think it is perfectly
safe, although I agree I'd rather audit what other drivers do carefully.

That being said I think BLK_EH_HANDLED seems like a fundamentally broken
idea, and I'd actually prefer to get rid of it over adding things like
the MQ_RQ_IN_FLIGHT check above.

E.g. if we look at the cases where nvme-pci returns it:

 - if we did call nvme_dev_disable, we already canceled all requests,
   so we might as well just return BLK_EH_NOT_HANDLED
 - the poll for completion case already completed the command,
   so we should return BLK_EH_NOT_HANDLED

So I think we need to fix up nvme and if needed any other driver
to return the right value and then assert that the request is
still in in-flight status for the BLK_EH_HANDLED case.

> @@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq)
>  static inline void blk_mq_rq_update_state(struct request *rq,
>  					  enum mq_rq_state state)
>  {
> +	WRITE_ONCE(rq->state, state);
>  }

I think this helper can go away now.  But we should have a comment
near the state field documenting the concurrency implications.



> +	u64 state;

This should probably be a mq_rq_state instead.  Which means it needs
to be moved to blkdev.h, but that should be ok.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order
  2018-05-22 15:24     ` Christoph Hellwig
@ 2018-05-22 16:27       ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:27 UTC (permalink / raw)
  To: hch, keith.busch; +Cc: linux-nvme, linux-block, axboe, ming.lei

T24gVHVlLCAyMDE4LTA1LTIyIGF0IDE3OjI0ICswMjAwLCBDaHJpc3RvcGggSGVsbHdpZyB3cm90
ZToNCj4gT24gTW9uLCBNYXkgMjEsIDIwMTggYXQgMDU6MTE6MzBQTSAtMDYwMCwgS2VpdGggQnVz
Y2ggd3JvdGU6DQo+ID4gVGhlIGJsb2NrIGxheWVyIGhhZCBiZWVuIHNldHRpbmcgdGhlIHN0YXRl
IHRvIGluLWZsaWdodCBwcmlvciB0byB1cGRhdGluZw0KPiA+IHRoZSB0aW1lci4gVGhpcyBpcyB0
aGUgd3Jvbmcgb3JkZXIgc2luY2UgdGhlIHRpbWVvdXQgaGFuZGxlciBjb3VsZCBvYnNlcnZlDQo+
ID4gdGhlIGluLWZsaWdodCBzdGF0ZSB3aXRoIHRoZSBvbGRlciB0aW1lb3V0LCBiZWxpZXZpbmcg
dGhlIHJlcXVlc3QgaGFkDQo+ID4gZXhwaXJlZCB3aGVuIGluIGZhY3QgaXQgaXMganVzdCBnZXR0
aW5nIHN0YXJ0ZWQuDQo+ID4gDQo+ID4gU2lnbmVkLW9mZi1ieTogS2VpdGggQnVzY2ggPGtlaXRo
LmJ1c2NoQGludGVsLmNvbT4NCj4gDQo+IFRoZSB3YXkgSSB1bmRlcnN0b29kIEJhcnRzIGNvbW1l
bnRzIHRvIG15IGNvbW1lbnRzIG9uIGhpcyBwYXRjaCB3ZQ0KPiBhY3R1YWxseSBuZWVkIHRoZSB0
d28gdXBkYXRlcyB0byBiZSBhdG9taWMuICBJIGhhdmVuJ3QgaGFkIG11Y2ggdGltZQ0KPiB0byBm
b2xsb3cgdXAsIGJ1dCBJJ2QgbGlrZSB0byBoZWFyIEJhcnRzIG9waW5pb24uICBFaXRoZXIgd2F5
IHdlDQo+IGNsZWFybHkgbmVlZCB0byBkb2N1bWVudCBvdXIgYXNzdW1wdGlvbnMgaGVyZSBpbiBj
b21tZW50cy4NCg0KQWZ0ZXIgd2UgZGlzY3Vzc2VkIHJlcXVlc3Qgc3RhdGUgdXBkYXRpbmcgSSBo
YXZlIGJlZW4gdGhpbmtpbmcgZnVydGhlciBhYm91dA0KdGhpcy4gSSB0aGluayBub3cgdGhhdCBp
dCBpcyBzYWZlIHRvIHVwZGF0ZSB0aGUgcmVxdWVzdCBkZWFkbGluZSBmaXJzdCBiZWNhdXNlDQp0
aGUgdGltZW91dCBjb2RlIGlnbm9yZXMgcmVxdWVzdHMgYW55d2F5IHRoYXQgaGF2ZSBhbm90aGVy
IHN0YXRlIHRoYW4NCk1RX1JRX0lOX0ZMSUdIVC4gQWRkaXRpb25hbGx5LCBpdCBpcyB1bmxpa2Vs
eSB0aGF0IHRoZSByZXF1ZXN0IHRpbWVyIHdpbGwgZmlyZQ0KYmVmb3JlIHRoZSByZXF1ZXN0IHN0
YXRlIGhhcyBiZWVuIHVwZGF0ZWQuIEFuZCBpZiB0aGF0IHdvdWxkIGhhcHBlbiB0aGUgcmVxdWVz
dA0KdGltZW91dCB3aWxsIGJlIGhhbmRsZWQgdGhlIG5leHQgdGltZSByZXF1ZXN0IHRpbWVvdXRz
IGFyZSBleGFtaW5lZC4NCg0KQmFydC4NCg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 2/3] blk-mq: Fix timeout and state order
@ 2018-05-22 16:27       ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:27 UTC (permalink / raw)


On Tue, 2018-05-22@17:24 +0200, Christoph Hellwig wrote:
> On Mon, May 21, 2018@05:11:30PM -0600, Keith Busch wrote:
> > The block layer had been setting the state to in-flight prior to updating
> > the timer. This is the wrong order since the timeout handler could observe
> > the in-flight state with the older timeout, believing the request had
> > expired when in fact it is just getting started.
> > 
> > Signed-off-by: Keith Busch <keith.busch at intel.com>
> 
> The way I understood Barts comments to my comments on his patch we
> actually need the two updates to be atomic.  I haven't had much time
> to follow up, but I'd like to hear Barts opinion.  Either way we
> clearly need to document our assumptions here in comments.

After we discussed request state updating I have been thinking further about
this. I think now that it is safe to update the request deadline first because
the timeout code ignores requests anyway that have another state than
MQ_RQ_IN_FLIGHT. Additionally, it is unlikely that the request timer will fire
before the request state has been updated. And if that would happen the request
timeout will be handled the next time request timeouts are examined.

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 14:15       ` Keith Busch
@ 2018-05-22 16:29         ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:29 UTC (permalink / raw)
  To: keith.busch; +Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

T24gVHVlLCAyMDE4LTA1LTIyIGF0IDA4OjE1IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
VGhpcyBzaG91bGRuJ3QgYmUgaW50cm9kdWNpbmcgYW55IG5ldyBjb25jb3JyZW50IGNhbGxzIHRv
DQo+IF9fYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QgaWYgdGhleSBkb24ndCBhbHJlYWR5IGV4aXN0
LiBUaGUgdGltZW91dCB3b3JrDQo+IGNhbGxzIGl0IG9ubHkgaWYgdGhlIGRyaXZlcidzIHRpbWVv
dXQgcmV0dXJucyBCTEtfRUhfSEFORExFRCwgd2hpY2ggbWVhbnMNCj4gdGhlIGRyaXZlciBpcyBj
bGFpbWluZyB0aGUgY29tbWFuZCBpcyBub3cgZG9uZSwgYnV0IHRoZSBkcml2ZXIgZGlkbid0IGNh
bGwNCj4gYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QgYXMgaW5kaWNhdGVkIGJ5IHRoZSByZXF1ZXN0
J3MgSU5fRkxJR0hUIHN0YXRlLg0KPiANCj4gSW4gb3JkZXIgdG8gZ2V0IGEgc2Vjb25kIGNhbGwg
dG8gX19ibGtfbXFfY29tcGxldGVfcmVxdWVzdCgpLCB0aGVuLA0KPiB0aGUgZHJpdmVyIHdvdWxk
IGhhdmUgdG8gZW5kIHVwIGNhbGxpbmcgYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QoKQ0KPiBpbiBh
bm90aGVyIGNvbnRleHQuIEJ1dCB0aGVyZSdzIG5vdGhpbmcgc3RvcHBpbmcgdGhhdCBmcm9tIGhh
cHBlbmluZw0KPiB0b2RheSwgYW5kIHdvdWxkIGJlIGJhZCBpZiBhbnkgZHJpdmVyIGFjdHVhbGx5
IGRpZCB0aGF0OiB0aGUgcmVxdWVzdA0KPiBtYXkgaGF2ZSBiZWVuIHJlLWFsbG9jYXRlZCBhbmQg
aXNzdWVkIGFzIGEgbmV3IGNvbW1hbmQsIGFuZCBjYWxsaW5nDQo+IGJsa19tcV9jb21wbGV0ZV9y
ZXF1ZXN0KCkgdGhlIHNlY29uZCB0aW1lIGludHJvZHVjZXMgZGF0YSBjb3JydXB0aW9uLg0KDQpI
ZWxsbyBLZWl0aCwNCg0KUGxlYXNlIGhhdmUgYW5vdGhlciBsb29rIGF0IHRoZSBjdXJyZW50IGNv
ZGUgdGhhdCBoYW5kbGVzIHJlcXVlc3QgdGltZW91dHMNCmFuZCBjb21wbGV0aW9ucy4gVGhlIGN1
cnJlbnQgaW1wbGVtZW50YXRpb24gZ3VhcmFudGVlcyB0aGF0IG5vIGRvdWJsZQ0KY29tcGxldGlv
bnMgY2FuIG9jY3VyIGJ1dCB5b3VyIHBhdGNoIHJlbW92ZXMgZXNzZW50aWFsIGFzcGVjdHMgb2Yg
dGhhdA0KaW1wbGVtZW50YXRpb24uDQoNClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 16:29         ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:29 UTC (permalink / raw)


On Tue, 2018-05-22@08:15 -0600, Keith Busch wrote:
> This shouldn't be introducing any new concorrent calls to
> __blk_mq_complete_request if they don't already exist. The timeout work
> calls it only if the driver's timeout returns BLK_EH_HANDLED, which means
> the driver is claiming the command is now done, but the driver didn't call
> blk_mq_complete_request as indicated by the request's IN_FLIGHT state.
> 
> In order to get a second call to __blk_mq_complete_request(), then,
> the driver would have to end up calling blk_mq_complete_request()
> in another context. But there's nothing stopping that from happening
> today, and would be bad if any driver actually did that: the request
> may have been re-allocated and issued as a new command, and calling
> blk_mq_complete_request() the second time introduces data corruption.

Hello Keith,

Please have another look at the current code that handles request timeouts
and completions. The current implementation guarantees that no double
completions can occur but your patch removes essential aspects of that
implementation.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 0/3] blk-mq: Timeout rework
  2018-05-22 14:06     ` Keith Busch
@ 2018-05-22 16:30       ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:30 UTC (permalink / raw)
  To: keith.busch; +Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

T24gVHVlLCAyMDE4LTA1LTIyIGF0IDA4OjA2IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
T24gTW9uLCBNYXkgMjEsIDIwMTggYXQgMTE6Mjk6MjFQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl
IHdyb3RlOg0KPiA+IENhbiB5b3UgZXhwbGFpbiB3aHkgdGhlIE5WTWUgZHJpdmVyIG5lZWRzIHJl
ZmVyZW5jZSBjb3VudGluZyBvZiByZXF1ZXN0cyBidXQNCj4gPiBubyBvdGhlciBibG9jayBkcml2
ZXIgbmVlZHMgdGhpcz8gQWRkaXRpb25hbGx5LCB3aHkgaXMgaXQgdGhhdCBmb3IgYWxsIGJsb2Nr
DQo+ID4gZHJpdmVycyBleGNlcHQgTlZNZSB0aGUgY3VycmVudCBibG9jayBsYXllciBBUEkgaXMg
c3VmZmljaWVudA0KPiA+IChibGtfZ2V0X3JlcXVlc3QoKS9ibGtfZXhlY3V0ZV9ycSgpL2Jsa19t
cV9zdGFydF9yZXF1ZXN0KCkvDQo+ID4gYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QoKS9ibGtfbXFf
ZW5kX3JlcXVlc3QoKSk/DQo+IA0KPiBJJ20gcHJldHR5IHN1cmUgTlZNZSBpc24ndCB0aGUgb25s
eSBkcml2ZXIgd2hlcmUgYSBjYWxsIHRvDQo+IGJsa19tcV9jb21wbGV0ZV9yZXF1ZXN0IHNpbGVu
dGx5IGZhaWxzIHRvIHRyYW5zaXRpb24gdGhlIHJlcXVlc3QgdG8NCj4gQ09NUExFVEUsIGZvcmNp
bmcgdW5uZWNlc3NhcnkgZXJyb3IgaGFuZGxpbmcuIFRoaXMgcGF0Y2ggaXNuJ3Qgc28NCj4gbXVj
aCBhYm91dCBOVk1lIGFzIGl0IGlzIGFib3V0IHJlbW92aW5nIHRoYXQgc2lsZW50IGV4Y2VwdGlv
biBmcm9tIHRoZQ0KPiBibG9jayBBUEkuDQoNCkhlbGxvIEtlaXRoLA0KDQpQbGVhc2UgaGF2ZSBh
IGxvb2sgYXQgdjEzIG9mIHRoZSB0aW1lb3V0IGhhbmRsaW5nIHJld29yayBwYXRjaCB0aGF0IEkg
cG9zdGVkLg0KVGhhdCBwYXRjaCBzaG91bGQgbm90IGludHJvZHVjZSBhbnkgbmV3IHJhY2UgY29u
ZGl0aW9ucyBhbmQgc2hvdWxkIGFsc28gaGFuZGxlDQp0aGUgc2NlbmFyaW8gZmluZSBpbiB3aGlj
aCBibGtfbXFfY29tcGxldGVfcmVxdWVzdCgpIGlzIGNhbGxlZCB3aGlsZSB0aGUgTlZNZQ0KdGlt
ZW91dCBoYW5kbGluZyBmdW5jdGlvbiBpcyBpbiBwcm9ncmVzcy4NCg0KVGhhbmtzLA0KDQpCYXJ0
Lg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 0/3] blk-mq: Timeout rework
@ 2018-05-22 16:30       ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:30 UTC (permalink / raw)


On Tue, 2018-05-22@08:06 -0600, Keith Busch wrote:
> On Mon, May 21, 2018@11:29:21PM +0000, Bart Van Assche wrote:
> > Can you explain why the NVMe driver needs reference counting of requests but
> > no other block driver needs this? Additionally, why is it that for all block
> > drivers except NVMe the current block layer API is sufficient
> > (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/
> > blk_mq_complete_request()/blk_mq_end_request())?
> 
> I'm pretty sure NVMe isn't the only driver where a call to
> blk_mq_complete_request silently fails to transition the request to
> COMPLETE, forcing unnecessary error handling. This patch isn't so
> much about NVMe as it is about removing that silent exception from the
> block API.

Hello Keith,

Please have a look at v13 of the timeout handling rework patch that I posted.
That patch should not introduce any new race conditions and should also handle
the scenario fine in which blk_mq_complete_request() is called while the NVMe
timeout handling function is in progress.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 16:29         ` Bart Van Assche
@ 2018-05-22 16:34           ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 16:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

On Tue, May 22, 2018 at 04:29:07PM +0000, Bart Van Assche wrote:
> Please have another look at the current code that handles request timeouts
> and completions. The current implementation guarantees that no double
> completions can occur but your patch removes essential aspects of that
> implementation.

How does the current implementation guarantee a double completion doesn't
happen when the request is allocated for a new command?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 16:34           ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 16:34 UTC (permalink / raw)


On Tue, May 22, 2018@04:29:07PM +0000, Bart Van Assche wrote:
> Please have another look at the current code that handles request timeouts
> and completions. The current implementation guarantees that no double
> completions can occur but your patch removes essential aspects of that
> implementation.

How does the current implementation guarantee a double completion doesn't
happen when the request is allocated for a new command?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 0/3] blk-mq: Timeout rework
  2018-05-22 16:30       ` Bart Van Assche
@ 2018-05-22 16:44         ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 16:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

On Tue, May 22, 2018 at 04:30:41PM +0000, Bart Van Assche wrote:
> 
> Please have a look at v13 of the timeout handling rework patch that I posted.
> That patch should not introduce any new race conditions and should also handle
> the scenario fine in which blk_mq_complete_request() is called while the NVMe
> timeout handling function is in progress.

Thanks for the notice. That sounds very interesting and will be happy
take a look.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 0/3] blk-mq: Timeout rework
@ 2018-05-22 16:44         ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-22 16:44 UTC (permalink / raw)


On Tue, May 22, 2018@04:30:41PM +0000, Bart Van Assche wrote:
> 
> Please have a look at v13 of the timeout handling rework patch that I posted.
> That patch should not introduce any new race conditions and should also handle
> the scenario fine in which blk_mq_complete_request() is called while the NVMe
> timeout handling function is in progress.

Thanks for the notice. That sounds very interesting and will be happy
take a look.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 16:34           ` Keith Busch
@ 2018-05-22 16:48             ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:48 UTC (permalink / raw)
  To: keith.busch; +Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

T24gVHVlLCAyMDE4LTA1LTIyIGF0IDEwOjM0IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
T24gVHVlLCBNYXkgMjIsIDIwMTggYXQgMDQ6Mjk6MDdQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl
IHdyb3RlOg0KPiA+IFBsZWFzZSBoYXZlIGFub3RoZXIgbG9vayBhdCB0aGUgY3VycmVudCBjb2Rl
IHRoYXQgaGFuZGxlcyByZXF1ZXN0IHRpbWVvdXRzDQo+ID4gYW5kIGNvbXBsZXRpb25zLiBUaGUg
Y3VycmVudCBpbXBsZW1lbnRhdGlvbiBndWFyYW50ZWVzIHRoYXQgbm8gZG91YmxlDQo+ID4gY29t
cGxldGlvbnMgY2FuIG9jY3VyIGJ1dCB5b3VyIHBhdGNoIHJlbW92ZXMgZXNzZW50aWFsIGFzcGVj
dHMgb2YgdGhhdA0KPiA+IGltcGxlbWVudGF0aW9uLg0KPiANCj4gSG93IGRvZXMgdGhlIGN1cnJl
bnQgaW1wbGVtZW50YXRpb24gZ3VhcmFudGVlIGEgZG91YmxlIGNvbXBsZXRpb24gZG9lc24ndA0K
PiBoYXBwZW4gd2hlbiB0aGUgcmVxdWVzdCBpcyBhbGxvY2F0ZWQgZm9yIGEgbmV3IGNvbW1hbmQ/
DQoNCkhlbGxvIEtlaXRoLA0KDQpJZiBhIHJlcXVlc3QgaXMgY29tcGxldGVzIGFuZCBpcyByZXVz
ZWQgYWZ0ZXIgdGhlIHRpbWVvdXQgaGFuZGxlciBoYXMgc2V0DQphYm9ydGVkX2dzdGF0ZSBhbmQg
YmVmb3JlIGJsa19tcV90ZXJtaW5hdGVfZXhwaXJlZCgpIGlzIGNhbGxlZCB0aGVuIHRoZSBsYXR0
ZXINCmZ1bmN0aW9uIHdpbGwgc2tpcCB0aGUgcmVxdWVzdCBiZWNhdXNlIHJlc3RhcnRpbmcgYSBy
ZXF1ZXN0IGNhdXNlcyB0aGUNCmdlbmVyYXRpb24gbnVtYmVyIGluIHJxLT5nc3RhdGUgdG8gYmUg
aW5jcmVtZW50ZWQuIEZyb20gYmxrX21xX3JxX3VwZGF0ZV9zdGF0ZSgpOg0KDQoJaWYgKHN0YXRl
ID09IE1RX1JRX0lOX0ZMSUdIVCkgew0KCQlXQVJOX09OX09OQ0UoKG9sZF92YWwgJiBNUV9SUV9T
VEFURV9NQVNLKSAhPSBNUV9SUV9JRExFKTsNCgkJbmV3X3ZhbCArPSBNUV9SUV9HRU5fSU5DOw0K
CX0NCg0KQmFydC4NCg0KDQoNCg==

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-22 16:48             ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:48 UTC (permalink / raw)


On Tue, 2018-05-22@10:34 -0600, Keith Busch wrote:
> On Tue, May 22, 2018@04:29:07PM +0000, Bart Van Assche wrote:
> > Please have another look at the current code that handles request timeouts
> > and completions. The current implementation guarantees that no double
> > completions can occur but your patch removes essential aspects of that
> > implementation.
> 
> How does the current implementation guarantee a double completion doesn't
> happen when the request is allocated for a new command?

Hello Keith,

If a request is completes and is reused after the timeout handler has set
aborted_gstate and before blk_mq_terminate_expired() is called then the latter
function will skip the request because restarting a request causes the
generation number in rq->gstate to be incremented. From blk_mq_rq_update_state():

	if (state == MQ_RQ_IN_FLIGHT) {
		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
		new_val += MQ_RQ_GEN_INC;
	}

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 16:17     ` Christoph Hellwig
@ 2018-05-23  0:34       ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-23  0:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, linux-nvme, linux-block, Bart Van Assche

On Tue, May 22, 2018 at 06:17:04PM +0200, Christoph Hellwig wrote:
> Hi Keith,
> 
> I like this series a lot.  One comment that is probably close
> to the big discussion in the thread:
> 
> >  	switch (ret) {
> >  	case BLK_EH_HANDLED:
> >  		/*
> > +		 * If the request is still in flight, the driver is requesting
> > +		 * blk-mq complete it.
> >  		 */
> > +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> > +			__blk_mq_complete_request(req);
> > +		break;
> 
> The state check here really irked me, and from the thread it seems like
> I'm not the only one.  At least for the NVMe case I think it is perfectly
> safe, although I agree I'd rather audit what other drivers do carefully.

Let's consider the normal NVMe timeout code path:

1) one request is timed out;

2) controller is shutdown, this timed-out request is requeued from
nvme_cancel_request(), but can't dispatch because queues are quiesced

3) reset is done from another context, and this request is dispatched
again, and completed exactly before returning EH_HANDLED to blk-mq, but
its state isn't updated to COMPLETE yet.

4) then double completions are done from both normal completion and timeout
path.

Seems same issue exists on poll path.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-23  0:34       ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-23  0:34 UTC (permalink / raw)


On Tue, May 22, 2018@06:17:04PM +0200, Christoph Hellwig wrote:
> Hi Keith,
> 
> I like this series a lot.  One comment that is probably close
> to the big discussion in the thread:
> 
> >  	switch (ret) {
> >  	case BLK_EH_HANDLED:
> >  		/*
> > +		 * If the request is still in flight, the driver is requesting
> > +		 * blk-mq complete it.
> >  		 */
> > +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> > +			__blk_mq_complete_request(req);
> > +		break;
> 
> The state check here really irked me, and from the thread it seems like
> I'm not the only one.  At least for the NVMe case I think it is perfectly
> safe, although I agree I'd rather audit what other drivers do carefully.

Let's consider the normal NVMe timeout code path:

1) one request is timed out;

2) controller is shutdown, this timed-out request is requeued from
nvme_cancel_request(), but can't dispatch because queues are quiesced

3) reset is done from another context, and this request is dispatched
again, and completed exactly before returning EH_HANDLED to blk-mq, but
its state isn't updated to COMPLETE yet.

4) then double completions are done from both normal completion and timeout
path.

Seems same issue exists on poll path.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-22 16:17     ` Christoph Hellwig
@ 2018-05-23  5:48       ` Hannes Reinecke
  -1 siblings, 0 replies; 128+ messages in thread
From: Hannes Reinecke @ 2018-05-23  5:48 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, linux-block, Ming Lei, Bart Van Assche

On 05/22/2018 06:17 PM, Christoph Hellwig wrote:
> Hi Keith,
> 
> I like this series a lot.  One comment that is probably close
> to the big discussion in the thread:
> 
>>   	switch (ret) {
>>   	case BLK_EH_HANDLED:
>>   		/*
>> +		 * If the request is still in flight, the driver is requesting
>> +		 * blk-mq complete it.
>>   		 */
>> +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
>> +			__blk_mq_complete_request(req);
>> +		break;
> 
> The state check here really irked me, and from the thread it seems like
> I'm not the only one.  At least for the NVMe case I think it is perfectly
> safe, although I agree I'd rather audit what other drivers do carefully.
> 
> That being said I think BLK_EH_HANDLED seems like a fundamentally broken
> idea, and I'd actually prefer to get rid of it over adding things like
> the MQ_RQ_IN_FLIGHT check above.
> 
I can't agree more here.

BLK_EH_HANDLED is breaking all sorts of assumptions, and I'd be glad to 
see it go.

> E.g. if we look at the cases where nvme-pci returns it:
> 
>   - if we did call nvme_dev_disable, we already canceled all requests,
>     so we might as well just return BLK_EH_NOT_HANDLED
>   - the poll for completion case already completed the command,
>     so we should return BLK_EH_NOT_HANDLED
> 
> So I think we need to fix up nvme and if needed any other driver
> to return the right value and then assert that the request is
> still in in-flight status for the BLK_EH_HANDLED case.
> 
... and then kill BLK_EH_HANDLED :-)

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-23  5:48       ` Hannes Reinecke
  0 siblings, 0 replies; 128+ messages in thread
From: Hannes Reinecke @ 2018-05-23  5:48 UTC (permalink / raw)


On 05/22/2018 06:17 PM, Christoph Hellwig wrote:
> Hi Keith,
> 
> I like this series a lot.  One comment that is probably close
> to the big discussion in the thread:
> 
>>   	switch (ret) {
>>   	case BLK_EH_HANDLED:
>>   		/*
>> +		 * If the request is still in flight, the driver is requesting
>> +		 * blk-mq complete it.
>>   		 */
>> +		if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
>> +			__blk_mq_complete_request(req);
>> +		break;
> 
> The state check here really irked me, and from the thread it seems like
> I'm not the only one.  At least for the NVMe case I think it is perfectly
> safe, although I agree I'd rather audit what other drivers do carefully.
> 
> That being said I think BLK_EH_HANDLED seems like a fundamentally broken
> idea, and I'd actually prefer to get rid of it over adding things like
> the MQ_RQ_IN_FLIGHT check above.
> 
I can't agree more here.

BLK_EH_HANDLED is breaking all sorts of assumptions, and I'd be glad to 
see it go.

> E.g. if we look at the cases where nvme-pci returns it:
> 
>   - if we did call nvme_dev_disable, we already canceled all requests,
>     so we might as well just return BLK_EH_NOT_HANDLED
>   - the poll for completion case already completed the command,
>     so we should return BLK_EH_NOT_HANDLED
> 
> So I think we need to fix up nvme and if needed any other driver
> to return the right value and then assert that the request is
> still in in-flight status for the BLK_EH_HANDLED case.
> 
... and then kill BLK_EH_HANDLED :-)

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-23  0:34       ` Ming Lei
@ 2018-05-23 14:35         ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-23 14:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Bart Van Assche,
	linux-block, linux-nvme

On Wed, May 23, 2018 at 08:34:48AM +0800, Ming Lei wrote:
> Let's consider the normal NVMe timeout code path:
> 
> 1) one request is timed out;
> 
> 2) controller is shutdown, this timed-out request is requeued from
> nvme_cancel_request(), but can't dispatch because queues are quiesced
> 
> 3) reset is done from another context, and this request is dispatched
> again, and completed exactly before returning EH_HANDLED to blk-mq, but
> its state isn't updated to COMPLETE yet.
> 
> 4) then double completions are done from both normal completion and timeout
> path.

We're definitely fixing this, but I must admit that's an impressive
cognitive traversal across 5 thread contexts to arrive at that race. :)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-23 14:35         ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-05-23 14:35 UTC (permalink / raw)


On Wed, May 23, 2018@08:34:48AM +0800, Ming Lei wrote:
> Let's consider the normal NVMe timeout code path:
> 
> 1) one request is timed out;
> 
> 2) controller is shutdown, this timed-out request is requeued from
> nvme_cancel_request(), but can't dispatch because queues are quiesced
> 
> 3) reset is done from another context, and this request is dispatched
> again, and completed exactly before returning EH_HANDLED to blk-mq, but
> its state isn't updated to COMPLETE yet.
> 
> 4) then double completions are done from both normal completion and timeout
> path.

We're definitely fixing this, but I must admit that's an impressive
cognitive traversal across 5 thread contexts to arrive at that race. :)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-23 14:35         ` Keith Busch
@ 2018-05-24  1:52           ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-24  1:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Bart Van Assche,
	linux-block, linux-nvme

On Wed, May 23, 2018 at 08:35:40AM -0600, Keith Busch wrote:
> On Wed, May 23, 2018 at 08:34:48AM +0800, Ming Lei wrote:
> > Let's consider the normal NVMe timeout code path:
> > 
> > 1) one request is timed out;
> > 
> > 2) controller is shutdown, this timed-out request is requeued from
> > nvme_cancel_request(), but can't dispatch because queues are quiesced
> > 
> > 3) reset is done from another context, and this request is dispatched
> > again, and completed exactly before returning EH_HANDLED to blk-mq, but
> > its state isn't updated to COMPLETE yet.
> > 
> > 4) then double completions are done from both normal completion and timeout
> > path.
> 
> We're definitely fixing this, but I must admit that's an impressive
> cognitive traversal across 5 thread contexts to arrive at that race. :)

It can be only 2 thread contexts if requeue is done on polled request
from nvme_timeout(), :-)

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-05-24  1:52           ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2018-05-24  1:52 UTC (permalink / raw)


On Wed, May 23, 2018@08:35:40AM -0600, Keith Busch wrote:
> On Wed, May 23, 2018@08:34:48AM +0800, Ming Lei wrote:
> > Let's consider the normal NVMe timeout code path:
> > 
> > 1) one request is timed out;
> > 
> > 2) controller is shutdown, this timed-out request is requeued from
> > nvme_cancel_request(), but can't dispatch because queues are quiesced
> > 
> > 3) reset is done from another context, and this request is dispatched
> > again, and completed exactly before returning EH_HANDLED to blk-mq, but
> > its state isn't updated to COMPLETE yet.
> > 
> > 4) then double completions are done from both normal completion and timeout
> > path.
> 
> We're definitely fixing this, but I must admit that's an impressive
> cognitive traversal across 5 thread contexts to arrive at that race. :)

It can be only 2 thread contexts if requeue is done on polled request
from nvme_timeout(), :-)

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-05-21 23:11   ` Keith Busch
@ 2018-07-12 18:16     ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-12 18:16 UTC (permalink / raw)
  To: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

T24gTW9uLCAyMDE4LTA1LTIxIGF0IDE3OjExIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
IAkvKg0KPiAtCSAqIFdlIG1hcmtlZCBAcnEtPmFib3J0ZWRfZ3N0YXRlIGFuZCB3YWl0ZWQgZm9y
IFJDVS4gIElmIHRoZXJlIHdlcmUNCj4gLQkgKiBjb21wbGV0aW9ucyB0aGF0IHdlIGxvc3QgdG8s
IHRoZXkgd291bGQgaGF2ZSBmaW5pc2hlZCBhbmQNCj4gLQkgKiB1cGRhdGVkIEBycS0+Z3N0YXRl
IGJ5IG5vdzsgb3RoZXJ3aXNlLCB0aGUgY29tcGxldGlvbiBwYXRoIGlzDQo+IC0JICogbm93IGd1
YXJhbnRlZWQgdG8gc2VlIEBycS0+YWJvcnRlZF9nc3RhdGUgYW5kIHlpZWxkLiAgSWYNCj4gLQkg
KiBAcnEtPmFib3J0ZWRfZ3N0YXRlIHN0aWxsIG1hdGNoZXMgQHJxLT5nc3RhdGUsIEBycSBpcyBv
dXJzLg0KPiArCSAqIEp1c3QgZG8gYSBxdWljayBjaGVjayBpZiBpdCBpcyBleHBpcmVkIGJlZm9y
ZSBsb2NraW5nIHRoZSByZXF1ZXN0IGluDQo+ICsJICogc28gd2UncmUgbm90IHVubmVjZXNzYXJp
bGx5IHN5bmNocm9uaXppbmcgYWNyb3NzIENQVXMuDQo+ICAJICovDQo+IC0JaWYgKCEocnEtPnJx
X2ZsYWdzICYgUlFGX01RX1RJTUVPVVRfRVhQSVJFRCkgJiYNCj4gLQkgICAgUkVBRF9PTkNFKHJx
LT5nc3RhdGUpID09IHJxLT5hYm9ydGVkX2dzdGF0ZSkNCj4gKwlpZiAoIWJsa19tcV9yZXFfZXhw
aXJlZChycSwgbmV4dCkpDQo+ICsJCXJldHVybjsNCj4gKw0KPiArCS8qDQo+ICsJICogV2UgaGF2
ZSByZWFzb24gdG8gYmVsaWV2ZSB0aGUgcmVxdWVzdCBtYXkgYmUgZXhwaXJlZC4gVGFrZSBhDQo+
ICsJICogcmVmZXJlbmNlIG9uIHRoZSByZXF1ZXN0IHRvIGxvY2sgdGhpcyByZXF1ZXN0IGxpZmV0
aW1lIGludG8gaXRzDQo+ICsJICogY3VycmVudGx5IGFsbG9jYXRlZCBjb250ZXh0IHRvIHByZXZl
bnQgaXQgZnJvbSBiZWluZyByZWFsbG9jYXRlZCBpbg0KPiArCSAqIHRoZSBldmVudCB0aGUgY29t
cGxldGlvbiBieS1wYXNzZXMgdGhpcyB0aW1lb3V0IGhhbmRsZXIuDQo+ICsJICogDQo+ICsJICog
SWYgdGhlIHJlZmVyZW5jZSB3YXMgYWxyZWFkeSByZWxlYXNlZCwgdGhlbiB0aGUgZHJpdmVyIGJl
YXQgdGhlDQo+ICsJICogdGltZW91dCBoYW5kbGVyIHRvIHBvc3RpbmcgYSBuYXR1cmFsIGNvbXBs
ZXRpb24uDQo+ICsJICovDQo+ICsJaWYgKCFrcmVmX2dldF91bmxlc3NfemVybygmcnEtPnJlZikp
DQo+ICsJCXJldHVybjsNCj4gKw0KPiArCS8qDQo+ICsJICogVGhlIHJlcXVlc3QgaXMgbm93IGxv
Y2tlZCBhbmQgY2Fubm90IGJlIHJlYWxsb2NhdGVkIHVuZGVybmVhdGggdGhlDQo+ICsJICogdGlt
ZW91dCBoYW5kbGVyJ3MgcHJvY2Vzc2luZy4gUmUtdmVyaWZ5IHRoaXMgZXhhY3QgcmVxdWVzdCBp
cyB0cnVseQ0KPiArCSAqIGV4cGlyZWQ7IGlmIGl0IGlzIG5vdCBleHBpcmVkLCB0aGVuIHRoZSBy
ZXF1ZXN0IHdhcyBjb21wbGV0ZWQgYW5kDQo+ICsJICogcmVhbGxvY2F0ZWQgYXMgYSBuZXcgcmVx
dWVzdC4NCj4gKwkgKi8NCj4gKwlpZiAoYmxrX21xX3JlcV9leHBpcmVkKHJxLCBuZXh0KSkNCj4g
IAkJYmxrX21xX3JxX3RpbWVkX291dChycSwgcmVzZXJ2ZWQpOw0KPiArCWJsa19tcV9wdXRfcmVx
dWVzdChycSk7DQo+ICB9DQoNCkhlbGxvIEtlaXRoIGFuZCBDaHJpc3RvcGgsDQoNCldoYXQgcHJl
dmVudHMgdGhhdCBhIHJlcXVlc3QgZmluaXNoZXMgYW5kIGdldHMgcmV1c2VkIGFmdGVyIHRoZQ0K
YmxrX21xX3JlcV9leHBpcmVkKCkgY2FsbCBoYXMgZmluaXNoZWQgYW5kIGJlZm9yZSBrcmVmX2dl
dF91bmxlc3NfemVybygpIGlzDQpjYWxsZWQ/IElzIHRoaXMgcGVyaGFwcyBhIHJhY2UgY29uZGl0
aW9uIHRoYXQgaGFzIG5vdCB5ZXQgYmVlbiB0cmlnZ2VyZWQgYnkNCmFueSBleGlzdGluZyBibG9j
ayBsYXllciB0ZXN0PyBQbGVhc2Ugbm90ZSB0aGF0IHRoZXJlIGlzIG5vIHN1Y2ggcmFjZQ0KY29u
ZGl0aW9uIGluIHRoZSBwYXRjaCBJIGhhZCBwb3N0ZWQgKCJibGstbXE6IFJld29yayBibGstbXEg
dGltZW91dCBoYW5kbGluZw0KYWdhaW4iIC0gaHR0cHM6Ly93d3cuc3Bpbmljcy5uZXQvbGlzdHMv
bGludXgtYmxvY2svbXNnMjY0ODkuaHRtbCkuDQoNClRoYW5rcywNCg0KQmFydC4NCg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-12 18:16     ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-12 18:16 UTC (permalink / raw)


On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote:
>  	/*
> -	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
> -	 * completions that we lost to, they would have finished and
> -	 * updated @rq->gstate by now; otherwise, the completion path is
> -	 * now guaranteed to see @rq->aborted_gstate and yield.  If
> -	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> +	 * Just do a quick check if it is expired before locking the request in
> +	 * so we're not unnecessarilly synchronizing across CPUs.
>  	 */
> -	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> -	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
> +	if (!blk_mq_req_expired(rq, next))
> +		return;
> +
> +	/*
> +	 * We have reason to believe the request may be expired. Take a
> +	 * reference on the request to lock this request lifetime into its
> +	 * currently allocated context to prevent it from being reallocated in
> +	 * the event the completion by-passes this timeout handler.
> +	 * 
> +	 * If the reference was already released, then the driver beat the
> +	 * timeout handler to posting a natural completion.
> +	 */
> +	if (!kref_get_unless_zero(&rq->ref))
> +		return;
> +
> +	/*
> +	 * The request is now locked and cannot be reallocated underneath the
> +	 * timeout handler's processing. Re-verify this exact request is truly
> +	 * expired; if it is not expired, then the request was completed and
> +	 * reallocated as a new request.
> +	 */
> +	if (blk_mq_req_expired(rq, next))
>  		blk_mq_rq_timed_out(rq, reserved);
> +	blk_mq_put_request(rq);
>  }

Hello Keith and Christoph,

What prevents that a request finishes and gets reused after the
blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
called? Is this perhaps a race condition that has not yet been triggered by
any existing block layer test? Please note that there is no such race
condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
again" - https://www.spinics.net/lists/linux-block/msg26489.html).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-12 18:16     ` Bart Van Assche
@ 2018-07-12 19:24       ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-12 19:24 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

On Thu, Jul 12, 2018 at 06:16:12PM +0000, Bart Van Assche wrote:
> On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> >  	/*
> > -	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
> > -	 * completions that we lost to, they would have finished and
> > -	 * updated @rq->gstate by now; otherwise, the completion path is
> > -	 * now guaranteed to see @rq->aborted_gstate and yield.  If
> > -	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> > +	 * Just do a quick check if it is expired before locking the request in
> > +	 * so we're not unnecessarilly synchronizing across CPUs.
> >  	 */
> > -	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> > -	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
> > +	if (!blk_mq_req_expired(rq, next))
> > +		return;
> > +
> > +	/*
> > +	 * We have reason to believe the request may be expired. Take a
> > +	 * reference on the request to lock this request lifetime into its
> > +	 * currently allocated context to prevent it from being reallocated in
> > +	 * the event the completion by-passes this timeout handler.
> > +	 * 
> > +	 * If the reference was already released, then the driver beat the
> > +	 * timeout handler to posting a natural completion.
> > +	 */
> > +	if (!kref_get_unless_zero(&rq->ref))
> > +		return;
> > +
> > +	/*
> > +	 * The request is now locked and cannot be reallocated underneath the
> > +	 * timeout handler's processing. Re-verify this exact request is truly
> > +	 * expired; if it is not expired, then the request was completed and
> > +	 * reallocated as a new request.
> > +	 */
> > +	if (blk_mq_req_expired(rq, next))
> >  		blk_mq_rq_timed_out(rq, reserved);
> > +	blk_mq_put_request(rq);
> >  }
> 
> Hello Keith and Christoph,
> 
> What prevents that a request finishes and gets reused after the
> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
> called? Is this perhaps a race condition that has not yet been triggered by
> any existing block layer test? Please note that there is no such race
> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
> again" - https://www.spinics.net/lists/linux-block/msg26489.html).

I don't think there's any such race in the merged implementation
either. If the request is reallocated, then the kref check may succeed,
but the blk_mq_req_expired() check would surely fail, allowing the
request to proceed as normal. The code comments at least say as much.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-12 19:24       ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-12 19:24 UTC (permalink / raw)


On Thu, Jul 12, 2018@06:16:12PM +0000, Bart Van Assche wrote:
> On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote:
> >  	/*
> > -	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
> > -	 * completions that we lost to, they would have finished and
> > -	 * updated @rq->gstate by now; otherwise, the completion path is
> > -	 * now guaranteed to see @rq->aborted_gstate and yield.  If
> > -	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> > +	 * Just do a quick check if it is expired before locking the request in
> > +	 * so we're not unnecessarilly synchronizing across CPUs.
> >  	 */
> > -	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> > -	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
> > +	if (!blk_mq_req_expired(rq, next))
> > +		return;
> > +
> > +	/*
> > +	 * We have reason to believe the request may be expired. Take a
> > +	 * reference on the request to lock this request lifetime into its
> > +	 * currently allocated context to prevent it from being reallocated in
> > +	 * the event the completion by-passes this timeout handler.
> > +	 * 
> > +	 * If the reference was already released, then the driver beat the
> > +	 * timeout handler to posting a natural completion.
> > +	 */
> > +	if (!kref_get_unless_zero(&rq->ref))
> > +		return;
> > +
> > +	/*
> > +	 * The request is now locked and cannot be reallocated underneath the
> > +	 * timeout handler's processing. Re-verify this exact request is truly
> > +	 * expired; if it is not expired, then the request was completed and
> > +	 * reallocated as a new request.
> > +	 */
> > +	if (blk_mq_req_expired(rq, next))
> >  		blk_mq_rq_timed_out(rq, reserved);
> > +	blk_mq_put_request(rq);
> >  }
> 
> Hello Keith and Christoph,
> 
> What prevents that a request finishes and gets reused after the
> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
> called? Is this perhaps a race condition that has not yet been triggered by
> any existing block layer test? Please note that there is no such race
> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
> again" - https://www.spinics.net/lists/linux-block/msg26489.html).

I don't think there's any such race in the merged implementation
either. If the request is reallocated, then the kref check may succeed,
but the blk_mq_req_expired() check would surely fail, allowing the
request to proceed as normal. The code comments at least say as much.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-12 19:24       ` Keith Busch
@ 2018-07-12 22:24         ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-12 22:24 UTC (permalink / raw)
  To: keith.busch; +Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

T24gVGh1LCAyMDE4LTA3LTEyIGF0IDEzOjI0IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
T24gVGh1LCBKdWwgMTIsIDIwMTggYXQgMDY6MTY6MTJQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl
IHdyb3RlOg0KPiA+IFdoYXQgcHJldmVudHMgdGhhdCBhIHJlcXVlc3QgZmluaXNoZXMgYW5kIGdl
dHMgcmV1c2VkIGFmdGVyIHRoZQ0KPiA+IGJsa19tcV9yZXFfZXhwaXJlZCgpIGNhbGwgaGFzIGZp
bmlzaGVkIGFuZCBiZWZvcmUga3JlZl9nZXRfdW5sZXNzX3plcm8oKSBpcw0KPiA+IGNhbGxlZD8g
SXMgdGhpcyBwZXJoYXBzIGEgcmFjZSBjb25kaXRpb24gdGhhdCBoYXMgbm90IHlldCBiZWVuIHRy
aWdnZXJlZCBieQ0KPiA+IGFueSBleGlzdGluZyBibG9jayBsYXllciB0ZXN0PyBQbGVhc2Ugbm90
ZSB0aGF0IHRoZXJlIGlzIG5vIHN1Y2ggcmFjZQ0KPiA+IGNvbmRpdGlvbiBpbiB0aGUgcGF0Y2gg
SSBoYWQgcG9zdGVkICgiYmxrLW1xOiBSZXdvcmsgYmxrLW1xIHRpbWVvdXQgaGFuZGxpbmcNCj4g
PiBhZ2FpbiIgLSBodHRwczovL3d3dy5zcGluaWNzLm5ldC9saXN0cy9saW51eC1ibG9jay9tc2cy
NjQ4OS5odG1sKS4NCj4gDQo+IEkgZG9uJ3QgdGhpbmsgdGhlcmUncyBhbnkgc3VjaCByYWNlIGlu
IHRoZSBtZXJnZWQgaW1wbGVtZW50YXRpb24NCj4gZWl0aGVyLiBJZiB0aGUgcmVxdWVzdCBpcyBy
ZWFsbG9jYXRlZCwgdGhlbiB0aGUga3JlZiBjaGVjayBtYXkgc3VjY2VlZCwNCj4gYnV0IHRoZSBi
bGtfbXFfcmVxX2V4cGlyZWQoKSBjaGVjayB3b3VsZCBzdXJlbHkgZmFpbCwgYWxsb3dpbmcgdGhl
DQo+IHJlcXVlc3QgdG8gcHJvY2VlZCBhcyBub3JtYWwuIFRoZSBjb2RlIGNvbW1lbnRzIGF0IGxl
YXN0IHNheSBhcyBtdWNoLg0KDQpIZWxsbyBLZWl0aCwNCg0KQmVmb3JlIGNvbW1pdCAxMmY1Yjkz
MTQ1NDUgKCJibGstbXE6IFJlbW92ZSBnZW5lcmF0aW9uIHNlcWV1bmNlIiksIGlmIGENCnJlcXVl
c3QgY29tcGxldGlvbiB3YXMgcmVwb3J0ZWQgYWZ0ZXIgcmVxdWVzdCB0aW1lb3V0IHByb2Nlc3Np
bmcgaGFkDQpzdGFydGVkLCBjb21wbGV0aW9uIGhhbmRsaW5nIHdhcyBza2lwcGVkLiBUaGUgZm9s
bG93aW5nIGNvZGUgaW4NCmJsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KCkgcmVhbGl6ZWQgdGhhdDoN
Cg0KCWlmIChibGtfbXFfcnFfYWJvcnRlZF9nc3RhdGUocnEpICE9IHJxLT5nc3RhdGUpDQoJCV9f
YmxrX21xX2NvbXBsZXRlX3JlcXVlc3QocnEpOw0KDQpTaW5jZSBjb21taXQgMTJmNWI5MzE0NTQ1
LCBpZiBhIGNvbXBsZXRpb24gb2NjdXJzIGFmdGVyIHJlcXVlc3QgdGltZW91dA0KcHJvY2Vzc2lu
ZyBoYXMgc3RhcnRlZCwgdGhhdCBjb21wbGV0aW9uIGlzIHByb2Nlc3NlZCBpZiB0aGUgcmVxdWVz
dCBoYXMgdGhlDQpzdGF0ZSBNUV9SUV9JTl9GTElHSFQuIGJsa19tcV9ycV90aW1lZF9vdXQoKSBk
b2VzIG5vdCBtb2RpZnkgdGhlIHJlcXVlc3QNCnN0YXRlIHVubGVzcyB0aGUgYmxvY2sgZHJpdmVy
IHRpbWVvdXQgaGFuZGxlciBtb2RpZmllcyBpdCwgZS5nLiBieSBjYWxsaW5nDQpibGtfbXFfZW5k
X3JlcXVlc3QoKSBvciBieSBjYWxsaW5nIGJsa19tcV9yZXF1ZXVlX3JlcXVlc3QoKS4gVGhlIHR5
cGljYWwNCmJlaGF2aW9yIG9mIHNjc2lfdGltZXNfb3V0KCkgaXMgdG8gcXVldWUgc2VuZGluZyBv
ZiBhIFNDU0kgYWJvcnQgYW5kIGhlbmNlDQpub3QgdG8gY2hhbmdlIHRoZSByZXF1ZXN0IHN0YXRl
IGltbWVkaWF0ZWx5LiBJbiBvdGhlciB3b3JkcywgaWYgYSByZXF1ZXN0DQpjb21wbGV0aW9uIG9j
Y3VycyBkdXJpbmcgb3Igc2hvcnRseSBhZnRlciBhIHRpbWVvdXQgb2NjdXJyZWQgdGhlbg0KYmxr
X21xX2NvbXBsZXRlX3JlcXVlc3QoKSB3aWxsIGNhbGwgX19ibGtfbXFfY29tcGxldGVfcmVxdWVz
dCgpIGFuZCB3aWxsDQpjb21wbGV0ZSB0aGUgcmVxdWVzdCwgYWx0aG91Z2ggdGhhdCBpcyBub3Qg
YWxsb3dlZCBiZWNhdXNlIHRpbWVvdXQgaGFuZGxpbmcNCmhhcyBhbHJlYWR5IHN0YXJ0ZWQuIERv
IHlvdSBhZ3JlZSB3aXRoIHRoaXMgYW5hbHlzaXM/DQoNClRoYW5rcywNCg0KQmFydC4NCg0KDQoN
Cg0KDQoNCg0KDQo=

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-12 22:24         ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-12 22:24 UTC (permalink / raw)


On Thu, 2018-07-12@13:24 -0600, Keith Busch wrote:
> On Thu, Jul 12, 2018@06:16:12PM +0000, Bart Van Assche wrote:
> > What prevents that a request finishes and gets reused after the
> > blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
> > called? Is this perhaps a race condition that has not yet been triggered by
> > any existing block layer test? Please note that there is no such race
> > condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
> > again" - https://www.spinics.net/lists/linux-block/msg26489.html).
> 
> I don't think there's any such race in the merged implementation
> either. If the request is reallocated, then the kref check may succeed,
> but the blk_mq_req_expired() check would surely fail, allowing the
> request to proceed as normal. The code comments at least say as much.

Hello Keith,

Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
request completion was reported after request timeout processing had
started, completion handling was skipped. The following code in
blk_mq_complete_request() realized that:

	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
		__blk_mq_complete_request(rq);

Since commit 12f5b9314545, if a completion occurs after request timeout
processing has started, that completion is processed if the request has the
state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
state unless the block driver timeout handler modifies it, e.g. by calling
blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
not to change the request state immediately. In other words, if a request
completion occurs during or shortly after a timeout occurred then
blk_mq_complete_request() will call __blk_mq_complete_request() and will
complete the request, although that is not allowed because timeout handling
has already started. Do you agree with this analysis?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-12 22:24         ` Bart Van Assche
@ 2018-07-13  1:12           ` jianchao.wang
  -1 siblings, 0 replies; 128+ messages in thread
From: jianchao.wang @ 2018-07-13  1:12 UTC (permalink / raw)
  To: Bart Van Assche, keith.busch
  Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei



On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote:
>> On Thu, Jul 12, 2018 at 06:16:12PM +0000, Bart Van Assche wrote:
>>> What prevents that a request finishes and gets reused after the
>>> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
>>> called? Is this perhaps a race condition that has not yet been triggered by
>>> any existing block layer test? Please note that there is no such race
>>> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
>>> again" - https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dblock_msg26489.html&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=zqZd2myYLkxjU6DWtRKpls-gvzEGEB4vv8bdYq5CiBs&s=-d79KAhEM83ShMp8xCHKoE6Dp5Gxf98L94DuamLEAaU&e=).
>>
>> I don't think there's any such race in the merged implementation
>> either. If the request is reallocated, then the kref check may succeed,
>> but the blk_mq_req_expired() check would surely fail, allowing the
>> request to proceed as normal. The code comments at least say as much.
> 
> Hello Keith,
> 
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
> 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> 		__blk_mq_complete_request(rq);
> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?
>

Oh, thanks gods for hearing Bart said this.
I was always saying the same thing in the mail
https://marc.info/?l=linux-block&m=152950093831738&w=2
Even though my voice is tiny, I support Bart's point definitely.

Thanks
Jianchao
 
> Thanks,
> 
> Bart.
> 
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-13  1:12           ` jianchao.wang
  0 siblings, 0 replies; 128+ messages in thread
From: jianchao.wang @ 2018-07-13  1:12 UTC (permalink / raw)




On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> On Thu, 2018-07-12@13:24 -0600, Keith Busch wrote:
>> On Thu, Jul 12, 2018@06:16:12PM +0000, Bart Van Assche wrote:
>>> What prevents that a request finishes and gets reused after the
>>> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
>>> called? Is this perhaps a race condition that has not yet been triggered by
>>> any existing block layer test? Please note that there is no such race
>>> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
>>> again" - https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dblock_msg26489.html&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=zqZd2myYLkxjU6DWtRKpls-gvzEGEB4vv8bdYq5CiBs&s=-d79KAhEM83ShMp8xCHKoE6Dp5Gxf98L94DuamLEAaU&e=).
>>
>> I don't think there's any such race in the merged implementation
>> either. If the request is reallocated, then the kref check may succeed,
>> but the blk_mq_req_expired() check would surely fail, allowing the
>> request to proceed as normal. The code comments at least say as much.
> 
> Hello Keith,
> 
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
> 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> 		__blk_mq_complete_request(rq);
> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?
>

Oh, thanks gods for hearing Bart said this.
I was always saying the same thing in the mail
https://marc.info/?l=linux-block&m=152950093831738&w=2
Even though my voice is tiny, I support Bart's point definitely.

Thanks
Jianchao
 
> Thanks,
> 
> Bart.
> 
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-12 22:24         ` Bart Van Assche
@ 2018-07-13  2:40           ` jianchao.wang
  -1 siblings, 0 replies; 128+ messages in thread
From: jianchao.wang @ 2018-07-13  2:40 UTC (permalink / raw)
  To: Bart Van Assche, keith.busch
  Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei


On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> Hello Keith,
> 
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
> 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> 		__blk_mq_complete_request(rq);

Even if before tejun's patch, we also have this for both blk-mq and blk-legacy code.

blk_rq_check_expired

    if (time_after_eq(jiffies, rq->deadline)) {
        list_del_init(&rq->timeout_list);

        /*
         * Check if we raced with end io completion
         */
        if (!blk_mark_rq_complete(rq))
            blk_rq_timed_out(rq);
    } 
 

blk_complete_request
    
    if (!blk_mark_rq_complete(req))
        __blk_complete_request(req);

blk_mq_check_expired
    
    if (time_after_eq(jiffies, rq->deadline)) {
        if (!blk_mark_rq_complete(rq))
            blk_mq_rq_timed_out(rq, reserved);
    }


blk_mq_complete_request

    if (!blk_mark_rq_complete(rq))
        __blk_mq_complete_request(rq);

Thanks
Jianchao

> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-13  2:40           ` jianchao.wang
  0 siblings, 0 replies; 128+ messages in thread
From: jianchao.wang @ 2018-07-13  2:40 UTC (permalink / raw)



On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> Hello Keith,
> 
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
> 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> 		__blk_mq_complete_request(rq);

Even if before tejun's patch, we also have this for both blk-mq and blk-legacy code.

blk_rq_check_expired

    if (time_after_eq(jiffies, rq->deadline)) {
        list_del_init(&rq->timeout_list);

        /*
         * Check if we raced with end io completion
         */
        if (!blk_mark_rq_complete(rq))
            blk_rq_timed_out(rq);
    } 
 

blk_complete_request
    
    if (!blk_mark_rq_complete(req))
        __blk_complete_request(req);

blk_mq_check_expired
    
    if (time_after_eq(jiffies, rq->deadline)) {
        if (!blk_mark_rq_complete(rq))
            blk_mq_rq_timed_out(rq, reserved);
    }


blk_mq_complete_request

    if (!blk_mark_rq_complete(rq))
        __blk_mq_complete_request(rq);

Thanks
Jianchao

> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-12 22:24         ` Bart Van Assche
@ 2018-07-13 15:43           ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-13 15:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, linux-block, linux-nvme, ming.lei, keith.busch, hch

On Thu, Jul 12, 2018 at 10:24:42PM +0000, Bart Van Assche wrote:
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
> 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> 		__blk_mq_complete_request(rq);
> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?

Yes, it's different, and that was the whole point. No one made that a
secret either. Are you saying you want timeout software to take priority
over handling hardware events?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-13 15:43           ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-13 15:43 UTC (permalink / raw)


On Thu, Jul 12, 2018@10:24:42PM +0000, Bart Van Assche wrote:
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
> 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> 		__blk_mq_complete_request(rq);
> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?

Yes, it's different, and that was the whole point. No one made that a
secret either. Are you saying you want timeout software to take priority
over handling hardware events?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-13 15:43           ` Keith Busch
@ 2018-07-13 15:52             ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-13 15:52 UTC (permalink / raw)
  To: keith.busch; +Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei

T24gRnJpLCAyMDE4LTA3LTEzIGF0IDA5OjQzIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
T24gVGh1LCBKdWwgMTIsIDIwMTggYXQgMTA6MjQ6NDJQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl
IHdyb3RlOg0KPiA+IEJlZm9yZSBjb21taXQgMTJmNWI5MzE0NTQ1ICgiYmxrLW1xOiBSZW1vdmUg
Z2VuZXJhdGlvbiBzZXFldW5jZSIpLCBpZiBhDQo+ID4gcmVxdWVzdCBjb21wbGV0aW9uIHdhcyBy
ZXBvcnRlZCBhZnRlciByZXF1ZXN0IHRpbWVvdXQgcHJvY2Vzc2luZyBoYWQNCj4gPiBzdGFydGVk
LCBjb21wbGV0aW9uIGhhbmRsaW5nIHdhcyBza2lwcGVkLiBUaGUgZm9sbG93aW5nIGNvZGUgaW4N
Cj4gPiBibGtfbXFfY29tcGxldGVfcmVxdWVzdCgpIHJlYWxpemVkIHRoYXQ6DQo+ID4gDQo+ID4g
CWlmIChibGtfbXFfcnFfYWJvcnRlZF9nc3RhdGUocnEpICE9IHJxLT5nc3RhdGUpDQo+ID4gCQlf
X2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KHJxKTsNCj4gPiANCj4gPiBTaW5jZSBjb21taXQgMTJm
NWI5MzE0NTQ1LCBpZiBhIGNvbXBsZXRpb24gb2NjdXJzIGFmdGVyIHJlcXVlc3QgdGltZW91dA0K
PiA+IHByb2Nlc3NpbmcgaGFzIHN0YXJ0ZWQsIHRoYXQgY29tcGxldGlvbiBpcyBwcm9jZXNzZWQg
aWYgdGhlIHJlcXVlc3QgaGFzIHRoZQ0KPiA+IHN0YXRlIE1RX1JRX0lOX0ZMSUdIVC4gYmxrX21x
X3JxX3RpbWVkX291dCgpIGRvZXMgbm90IG1vZGlmeSB0aGUgcmVxdWVzdA0KPiA+IHN0YXRlIHVu
bGVzcyB0aGUgYmxvY2sgZHJpdmVyIHRpbWVvdXQgaGFuZGxlciBtb2RpZmllcyBpdCwgZS5nLiBi
eSBjYWxsaW5nDQo+ID4gYmxrX21xX2VuZF9yZXF1ZXN0KCkgb3IgYnkgY2FsbGluZyBibGtfbXFf
cmVxdWV1ZV9yZXF1ZXN0KCkuIFRoZSB0eXBpY2FsDQo+ID4gYmVoYXZpb3Igb2Ygc2NzaV90aW1l
c19vdXQoKSBpcyB0byBxdWV1ZSBzZW5kaW5nIG9mIGEgU0NTSSBhYm9ydCBhbmQgaGVuY2UNCj4g
PiBub3QgdG8gY2hhbmdlIHRoZSByZXF1ZXN0IHN0YXRlIGltbWVkaWF0ZWx5LiBJbiBvdGhlciB3
b3JkcywgaWYgYSByZXF1ZXN0DQo+ID4gY29tcGxldGlvbiBvY2N1cnMgZHVyaW5nIG9yIHNob3J0
bHkgYWZ0ZXIgYSB0aW1lb3V0IG9jY3VycmVkIHRoZW4NCj4gPiBibGtfbXFfY29tcGxldGVfcmVx
dWVzdCgpIHdpbGwgY2FsbCBfX2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KCkgYW5kIHdpbGwNCj4g
PiBjb21wbGV0ZSB0aGUgcmVxdWVzdCwgYWx0aG91Z2ggdGhhdCBpcyBub3QgYWxsb3dlZCBiZWNh
dXNlIHRpbWVvdXQgaGFuZGxpbmcNCj4gPiBoYXMgYWxyZWFkeSBzdGFydGVkLiBEbyB5b3UgYWdy
ZWUgd2l0aCB0aGlzIGFuYWx5c2lzPw0KPiANCj4gWWVzLCBpdCdzIGRpZmZlcmVudCwgYW5kIHRo
YXQgd2FzIHRoZSB3aG9sZSBwb2ludC4gTm8gb25lIG1hZGUgdGhhdCBhDQo+IHNlY3JldCBlaXRo
ZXIuIEFyZSB5b3Ugc2F5aW5nIHlvdSB3YW50IHRpbWVvdXQgc29mdHdhcmUgdG8gdGFrZSBwcmlv
cml0eQ0KPiBvdmVyIGhhbmRsaW5nIGhhcmR3YXJlIGV2ZW50cz8NCg0KTm8uIFdoYXQgSSdtIHNh
eWluZyBpcyB0aGF0IGEgYmVoYXZpb3IgY2hhbmdlIGhhcyBiZWVuIGludHJvZHVjZWQgaW4gdGhl
IGJsb2NrDQpsYXllciB0aGF0IHdhcyBub3QgZG9jdW1lbnRlZCBpbiB0aGUgcGF0Y2ggZGVzY3Jp
cHRpb24uIEkgdGhpbmsgdGhhdCBiZWhhdmlvcg0KY2hhbmdlIGV2ZW4gY2FuIHRyaWdnZXIgYSBr
ZXJuZWwgY3Jhc2guDQoNCkJhcnQuDQoNCg0KDQoNCg==

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-13 15:52             ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-13 15:52 UTC (permalink / raw)


On Fri, 2018-07-13@09:43 -0600, Keith Busch wrote:
> On Thu, Jul 12, 2018@10:24:42PM +0000, Bart Van Assche wrote:
> > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> > request completion was reported after request timeout processing had
> > started, completion handling was skipped. The following code in
> > blk_mq_complete_request() realized that:
> > 
> > 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> > 		__blk_mq_complete_request(rq);
> > 
> > Since commit 12f5b9314545, if a completion occurs after request timeout
> > processing has started, that completion is processed if the request has the
> > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> > state unless the block driver timeout handler modifies it, e.g. by calling
> > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> > not to change the request state immediately. In other words, if a request
> > completion occurs during or shortly after a timeout occurred then
> > blk_mq_complete_request() will call __blk_mq_complete_request() and will
> > complete the request, although that is not allowed because timeout handling
> > has already started. Do you agree with this analysis?
> 
> Yes, it's different, and that was the whole point. No one made that a
> secret either. Are you saying you want timeout software to take priority
> over handling hardware events?

No. What I'm saying is that a behavior change has been introduced in the block
layer that was not documented in the patch description. I think that behavior
change even can trigger a kernel crash.

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-13 15:52             ` Bart Van Assche
@ 2018-07-13 18:47               ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-13 18:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei

On Fri, Jul 13, 2018 at 03:52:38PM +0000, Bart Van Assche wrote:
> No. What I'm saying is that a behavior change has been introduced in the block
> layer that was not documented in the patch description. 

Did you read the changelog?

> I think that behavior change even can trigger a kernel crash.

I think we are past acknowledging issues exist with timeouts.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-13 18:47               ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-13 18:47 UTC (permalink / raw)


On Fri, Jul 13, 2018@03:52:38PM +0000, Bart Van Assche wrote:
> No. What I'm saying is that a behavior change has been introduced in the block
> layer that was not documented in the patch description. 

Did you read the changelog?

> I think that behavior change even can trigger a kernel crash.

I think we are past acknowledging issues exist with timeouts.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-13 18:47               ` Keith Busch
@ 2018-07-13 23:03                 ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-13 23:03 UTC (permalink / raw)
  To: keith.busch; +Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei

T24gRnJpLCAyMDE4LTA3LTEzIGF0IDEyOjQ3IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
T24gRnJpLCBKdWwgMTMsIDIwMTggYXQgMDM6NTI6MzhQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl
IHdyb3RlOg0KPiA+IEkgdGhpbmsgdGhhdCBiZWhhdmlvciBjaGFuZ2UgZXZlbiBjYW4gdHJpZ2dl
ciBhIGtlcm5lbCBjcmFzaC4NCj4gDQo+IEkgdGhpbmsgd2UgYXJlIHBhc3QgYWNrbm93bGVkZ2lu
ZyBpc3N1ZXMgZXhpc3Qgd2l0aCB0aW1lb3V0cy4NCg0KSGVsbG8gS2VpdGgsDQoNCkhvdyBkbyB5
b3Ugd2FudCB0byBnbyBmb3J3YXJkIGZyb20gaGVyZT8gRG8geW91IHByZWZlciB0aGUgYXBwcm9h
Y2ggb2YgdGhlDQpwYXRjaCBJIGhhZCBwb3N0ZWQgKGh0dHBzOi8vd3d3LnNwaW5pY3MubmV0L2xp
c3RzL2xpbnV4LWJsb2NrL21zZzI2NDg5Lmh0bWwpLCANCkppYW5jaGFvJ3MgYXBwcm9hY2ggKGh0
dHBzOi8vbWFyYy5pbmZvLz9sPWxpbnV4LWJsb2NrJm09MTUyOTUwMDkzODMxNzM4KSBvcg0KcGVy
aGFwcyB5ZXQgYW5vdGhlciBhcHByb2FjaD8gTm90ZTogSSB0aGluayBKaWFuY2hhbydzIHBhdGNo
IGlzIGEgZ29vZCBzdGFydA0KYnV0IGFsc28gdGhhdCBpdCBuZWVkcyBmdXJ0aGVyIGltcHJvdmVt
ZW50Lg0KDQpUaGFua3MsDQoNCkJhcnQuDQoNCg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-13 23:03                 ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-13 23:03 UTC (permalink / raw)


On Fri, 2018-07-13@12:47 -0600, Keith Busch wrote:
> On Fri, Jul 13, 2018@03:52:38PM +0000, Bart Van Assche wrote:
> > I think that behavior change even can trigger a kernel crash.
> 
> I think we are past acknowledging issues exist with timeouts.

Hello Keith,

How do you want to go forward from here? Do you prefer the approach of the
patch I had posted (https://www.spinics.net/lists/linux-block/msg26489.html), 
Jianchao's approach (https://marc.info/?l=linux-block&m=152950093831738) or
perhaps yet another approach? Note: I think Jianchao's patch is a good start
but also that it needs further improvement.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-13 23:03                 ` Bart Van Assche
@ 2018-07-13 23:58                   ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-13 23:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei

On Fri, Jul 13, 2018 at 11:03:18PM +0000, Bart Van Assche wrote:
> How do you want to go forward from here? Do you prefer the approach of the
> patch I had posted (https://www.spinics.net/lists/linux-block/msg26489.html), 
> Jianchao's approach (https://marc.info/?l=linux-block&m=152950093831738) or
> perhaps yet another approach? Note: I think Jianchao's patch is a good start
> but also that it needs further improvement.

Of the two you mentioned, yours is preferable IMO. While I appreciate
Jianchao's detailed analysis, it's hard to take a proposal seriously
that so colourfully calls everyone else "dangerous" while advocating
for silently losing requests on purpose.

But where's the option that fixes scsi to handle hardware completions
concurrently with arbitrary timeout software? Propping up that house of
cards can't be the only recourse.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-13 23:58                   ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-13 23:58 UTC (permalink / raw)


On Fri, Jul 13, 2018@11:03:18PM +0000, Bart Van Assche wrote:
> How do you want to go forward from here? Do you prefer the approach of the
> patch I had posted (https://www.spinics.net/lists/linux-block/msg26489.html), 
> Jianchao's approach (https://marc.info/?l=linux-block&m=152950093831738) or
> perhaps yet another approach? Note: I think Jianchao's patch is a good start
> but also that it needs further improvement.

Of the two you mentioned, yours is preferable IMO. While I appreciate
Jianchao's detailed analysis, it's hard to take a proposal seriously
that so colourfully calls everyone else "dangerous" while advocating
for silently losing requests on purpose.

But where's the option that fixes scsi to handle hardware completions
concurrently with arbitrary timeout software? Propping up that house of
cards can't be the only recourse.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-13 23:58                   ` Keith Busch
@ 2018-07-18 19:56                     ` hch
  -1 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-18 19:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, hch, keith.busch, linux-block, linux-nvme,
	axboe, ming.lei

On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> Of the two you mentioned, yours is preferable IMO. While I appreciate
> Jianchao's detailed analysis, it's hard to take a proposal seriously
> that so colourfully calls everyone else "dangerous" while advocating
> for silently losing requests on purpose.
> 
> But where's the option that fixes scsi to handle hardware completions
> concurrently with arbitrary timeout software? Propping up that house of
> cards can't be the only recourse.

The important bit is that we need to fix this issue quickly.  We are
past -rc5 so I'm rather concerned about anything too complicated.

I'm not even sure SCSI has a problem with multiple completions happening
at the same time, but it certainly has a problem with bypassing
blk_mq_complete_request from the EH path.

I think we can solve this properly, but I also think we are way to late
in the 4.18 cycle to fix it properly.  For now I fear we'll just have
to revert the changes and try again for 4.19 or even 4.20 if we don't
act quickly enough.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 19:56                     ` hch
  0 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-18 19:56 UTC (permalink / raw)


On Fri, Jul 13, 2018@05:58:08PM -0600, Keith Busch wrote:
> Of the two you mentioned, yours is preferable IMO. While I appreciate
> Jianchao's detailed analysis, it's hard to take a proposal seriously
> that so colourfully calls everyone else "dangerous" while advocating
> for silently losing requests on purpose.
> 
> But where's the option that fixes scsi to handle hardware completions
> concurrently with arbitrary timeout software? Propping up that house of
> cards can't be the only recourse.

The important bit is that we need to fix this issue quickly.  We are
past -rc5 so I'm rather concerned about anything too complicated.

I'm not even sure SCSI has a problem with multiple completions happening
at the same time, but it certainly has a problem with bypassing
blk_mq_complete_request from the EH path.

I think we can solve this properly, but I also think we are way to late
in the 4.18 cycle to fix it properly.  For now I fear we'll just have
to revert the changes and try again for 4.19 or even 4.20 if we don't
act quickly enough.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 19:56                     ` hch
@ 2018-07-18 20:39                       ` hch
  -1 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-18 20:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, hch, keith.busch, linux-block, linux-nvme,
	axboe, ming.lei, jianchao.w.wang

On Wed, Jul 18, 2018 at 09:56:50PM +0200, hch@lst.de wrote:
> On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully calls everyone else "dangerous" while advocating
> > for silently losing requests on purpose.
> > 
> > But where's the option that fixes scsi to handle hardware completions
> > concurrently with arbitrary timeout software? Propping up that house of
> > cards can't be the only recourse.
> 
> The important bit is that we need to fix this issue quickly.  We are
> past -rc5 so I'm rather concerned about anything too complicated.
> 
> I'm not even sure SCSI has a problem with multiple completions happening
> at the same time, but it certainly has a problem with bypassing
> blk_mq_complete_request from the EH path.
> 
> I think we can solve this properly, but I also think we are way to late
> in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> to revert the changes and try again for 4.19 or even 4.20 if we don't
> act quickly enough.

So here is a quick attempt at the revert while also trying to keep
nvme working.  Keith, Bart, Jianchao - does this looks reasonable
as a 4.18 band aid?

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 20:39                       ` hch
  0 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-18 20:39 UTC (permalink / raw)


On Wed, Jul 18, 2018@09:56:50PM +0200, hch@lst.de wrote:
> On Fri, Jul 13, 2018@05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully calls everyone else "dangerous" while advocating
> > for silently losing requests on purpose.
> > 
> > But where's the option that fixes scsi to handle hardware completions
> > concurrently with arbitrary timeout software? Propping up that house of
> > cards can't be the only recourse.
> 
> The important bit is that we need to fix this issue quickly.  We are
> past -rc5 so I'm rather concerned about anything too complicated.
> 
> I'm not even sure SCSI has a problem with multiple completions happening
> at the same time, but it certainly has a problem with bypassing
> blk_mq_complete_request from the EH path.
> 
> I think we can solve this properly, but I also think we are way to late
> in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> to revert the changes and try again for 4.19 or even 4.20 if we don't
> act quickly enough.

So here is a quick attempt at the revert while also trying to keep
nvme working.  Keith, Bart, Jianchao - does this looks reasonable
as a 4.18 band aid?

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 19:56                     ` hch
@ 2018-07-18 20:53                       ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-18 20:53 UTC (permalink / raw)
  To: hch
  Cc: Bart Van Assche, keith.busch, linux-block, linux-nvme, axboe, ming.lei

On Wed, Jul 18, 2018 at 09:56:50PM +0200, hch@lst.de wrote:
> On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully calls everyone else "dangerous" while advocating
> > for silently losing requests on purpose.
> > 
> > But where's the option that fixes scsi to handle hardware completions
> > concurrently with arbitrary timeout software? Propping up that house of
> > cards can't be the only recourse.
> 
> The important bit is that we need to fix this issue quickly.  We are
> past -rc5 so I'm rather concerned about anything too complicated.
> 
> I'm not even sure SCSI has a problem with multiple completions happening
> at the same time, but it certainly has a problem with bypassing
> blk_mq_complete_request from the EH path.
> 
> I think we can solve this properly, but I also think we are way to late
> in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> to revert the changes and try again for 4.19 or even 4.20 if we don't
> act quickly enough.

If scsi needs this behavior, why not just put that behavior in scsi? It
can set the state to complete and then everything can play out as
before.

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 22326612a5d3..f50559718b71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -558,10 +558,8 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	if (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
-			MQ_RQ_IN_FLIGHT)
+	if (blk_mq_mark_complete(rq))
 		return;
-
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..a5d05fab24a7 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,14 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	enum blk_eh_timer_return rtn = BLK_EH_DONE;
 	struct Scsi_Host *host = scmd->device->host;
 
+	/*
+	 * Mark complete now so lld can't post a completion during error
+	 * handling. Return immediately if it was already marked complete, as
+	 * that means the lld posted a completion already.
+	 */
+	if (req->q->mq_ops && blk_mq_mark_complete(req))
+		return rtn;
+
 	trace_scsi_dispatch_cmd_timeout(scmd);
 	scsi_log_completion(scmd, TIMEOUT_ERROR);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 9b0fd11ce89a..0ce587c9c27b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -289,6 +289,15 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+/*
+ * Returns true if request is not in flight.
+ */
+static inline bool blk_mq_mark_complete(struct request *rq)
+{
+	return (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
+			MQ_RQ_IN_FLIGHT);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
--

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 20:53                       ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-18 20:53 UTC (permalink / raw)


On Wed, Jul 18, 2018@09:56:50PM +0200, hch@lst.de wrote:
> On Fri, Jul 13, 2018@05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully calls everyone else "dangerous" while advocating
> > for silently losing requests on purpose.
> > 
> > But where's the option that fixes scsi to handle hardware completions
> > concurrently with arbitrary timeout software? Propping up that house of
> > cards can't be the only recourse.
> 
> The important bit is that we need to fix this issue quickly.  We are
> past -rc5 so I'm rather concerned about anything too complicated.
> 
> I'm not even sure SCSI has a problem with multiple completions happening
> at the same time, but it certainly has a problem with bypassing
> blk_mq_complete_request from the EH path.
> 
> I think we can solve this properly, but I also think we are way to late
> in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> to revert the changes and try again for 4.19 or even 4.20 if we don't
> act quickly enough.

If scsi needs this behavior, why not just put that behavior in scsi? It
can set the state to complete and then everything can play out as
before.

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 22326612a5d3..f50559718b71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -558,10 +558,8 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	if (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
-			MQ_RQ_IN_FLIGHT)
+	if (blk_mq_mark_complete(rq))
 		return;
-
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..a5d05fab24a7 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,14 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	enum blk_eh_timer_return rtn = BLK_EH_DONE;
 	struct Scsi_Host *host = scmd->device->host;
 
+	/*
+	 * Mark complete now so lld can't post a completion during error
+	 * handling. Return immediately if it was already marked complete, as
+	 * that means the lld posted a completion already.
+	 */
+	if (req->q->mq_ops && blk_mq_mark_complete(req))
+		return rtn;
+
 	trace_scsi_dispatch_cmd_timeout(scmd);
 	scsi_log_completion(scmd, TIMEOUT_ERROR);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 9b0fd11ce89a..0ce587c9c27b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -289,6 +289,15 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+/*
+ * Returns true if request is not in flight.
+ */
+static inline bool blk_mq_mark_complete(struct request *rq)
+{
+	return (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
+			MQ_RQ_IN_FLIGHT);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
--

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 20:53                       ` Keith Busch
@ 2018-07-18 20:58                         ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-18 20:58 UTC (permalink / raw)
  To: hch, keith.busch; +Cc: keith.busch, linux-block, linux-nvme, axboe, ming.lei

On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> If scsi needs this behavior, why not just put that behavior in scsi? =
It
> can set the state to complete and then everything can play out as
> before.
> [ ... ]

There may be other drivers that need the same protection the SCSI core need=
s
so I think the patch at the end of your previous e-mail is a step in the wr=
ong
direction.

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 20:58                         ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-18 20:58 UTC (permalink / raw)


On Wed, 2018-07-18@14:53 -0600, Keith Busch wrote:
> If scsi needs this behavior, why not just put that behavior in scsi? It
> can set the state to complete and then everything can play out as
> before.
> [ ... ]

There may be other drivers that need the same protection the SCSI core needs
so I think the patch at the end of your previous e-mail is a step in the wrong
direction.

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 20:39                       ` hch
@ 2018-07-18 21:05                         ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-18 21:05 UTC (permalink / raw)
  To: hch, keith.busch
  Cc: jianchao.w.wang, keith.busch, linux-block, linux-nvme, axboe, ming.lei

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-7", Size: 1775 bytes --]

On Wed, 2018-07-18 at 22:39 +-0200, hch+AEA-lst.de wrote:
+AD4- On Wed, Jul 18, 2018 at 09:56:50PM +-0200, hch+AEA-lst.de wrote:
+AD4- +AD4- The important bit is that we need to fix this issue quickly.  W=
e are
+AD4- +AD4- past -rc5 so I'm rather concerned about anything too complicate=
d.
+AD4- +AD4-=20
+AD4- +AD4- I'm not even sure SCSI has a problem with multiple completions =
happening
+AD4- +AD4- at the same time, but it certainly has a problem with bypassing
+AD4- +AD4- blk+AF8-mq+AF8-complete+AF8-request from the EH path.
+AD4- +AD4-=20
+AD4- +AD4- I think we can solve this properly, but I also think we are way=
 to late
+AD4- +AD4- in the 4.18 cycle to fix it properly.  For now I fear we'll jus=
t have
+AD4- +AD4- to revert the changes and try again for 4.19 or even 4.20 if we=
 don't
+AD4- +AD4- act quickly enough.
+AD4-=20
+AD4- So here is a quick attempt at the revert while also trying to keep
+AD4- nvme working.  Keith, Bart, Jianchao - does this looks reasonable
+AD4- as a 4.18 band aid?
+AD4-=20
+AD4- http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-=
eh-revert

Hello Christoph,

A patch series that first reverts the following patches:
+ACo- blk-mq: Fix timeout handling in case the timeout handler returns BLK+=
AF8-EH+AF8-DONE
+ACo- block: fix timeout changes for legacy request drivers
+ACo- blk-mq: don't time out requests again that are in the timeout handler
+ACo- blk-mq: simplify blk+AF8-mq+AF8-rq+AF8-timed+AF8-out
+ACo- block: remove BLK+AF8-EH+AF8-HANDLED
+ACo- block: rename BLK+AF8-EH+AF8-NOT+AF8-HANDLED to BLK+AF8-EH+AF8-DONE
+ACo- blk-mq: Remove generation seqeunce

and next renames BLK+AF8-EH+AF8-NOT+AF8-HANDLED again into BLK+AF8-EH+AF8-D=
ONE would probably be
a lot easier to review.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 21:05                         ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-18 21:05 UTC (permalink / raw)


On Wed, 2018-07-18@22:39 +0200, hch@lst.de wrote:
> On Wed, Jul 18, 2018@09:56:50PM +0200, hch@lst.de wrote:
> > The important bit is that we need to fix this issue quickly.  We are
> > past -rc5 so I'm rather concerned about anything too complicated.
> > 
> > I'm not even sure SCSI has a problem with multiple completions happening
> > at the same time, but it certainly has a problem with bypassing
> > blk_mq_complete_request from the EH path.
> > 
> > I think we can solve this properly, but I also think we are way to late
> > in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> > to revert the changes and try again for 4.19 or even 4.20 if we don't
> > act quickly enough.
> 
> So here is a quick attempt at the revert while also trying to keep
> nvme working.  Keith, Bart, Jianchao - does this looks reasonable
> as a 4.18 band aid?
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert

Hello Christoph,

A patch series that first reverts the following patches:
* blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
* block: fix timeout changes for legacy request drivers
* blk-mq: don't time out requests again that are in the timeout handler
* blk-mq: simplify blk_mq_rq_timed_out
* block: remove BLK_EH_HANDLED
* block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE
* blk-mq: Remove generation seqeunce

and next renames BLK_EH_NOT_HANDLED again into BLK_EH_DONE would probably be
a lot easier to review.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 20:58                         ` Bart Van Assche
@ 2018-07-18 21:17                           ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-18 21:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei

On Wed, Jul 18, 2018 at 08:58:48PM +0000, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> > If scsi needs this behavior, why not just put that behavior in scsi? It
> > can set the state to complete and then everything can play out as
> > before.
> > [ ... ]
> 
> There may be other drivers that need the same protection the SCSI core needs
> so I think the patch at the end of your previous e-mail is a step in the wrong
> direction.
> 
> Bart.

And there may be other drivers that don't want their completions
ignored, so breaking them again is also a step in the wrong direction.

There are not that many blk-mq drivers, so we can go through them all.

Most don't even implement .timeout, so they never know that condition
ever happened. Others always return BLK_EH_RESET_TIMER without doing
anythign else, so the 'new' behavior would have to be better for those,
too.

The following don't implement .timeout:

  loop, rdb, virtio, xen, dm, ubi, scm

The following always return RESET_TIMER:

  null, skd

The following is safe to the new way:

  mtip

And now ones I am not sure about:

  ndb, mmc, dasd

I don't know, reverting looks worse than just fixing the drivers.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 21:17                           ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-18 21:17 UTC (permalink / raw)


On Wed, Jul 18, 2018@08:58:48PM +0000, Bart Van Assche wrote:
> On Wed, 2018-07-18@14:53 -0600, Keith Busch wrote:
> > If scsi needs this behavior, why not just put that behavior in scsi? It
> > can set the state to complete and then everything can play out as
> > before.
> > [ ... ]
> 
> There may be other drivers that need the same protection the SCSI core needs
> so I think the patch at the end of your previous e-mail is a step in the wrong
> direction.
> 
> Bart.

And there may be other drivers that don't want their completions
ignored, so breaking them again is also a step in the wrong direction.

There are not that many blk-mq drivers, so we can go through them all.

Most don't even implement .timeout, so they never know that condition
ever happened. Others always return BLK_EH_RESET_TIMER without doing
anythign else, so the 'new' behavior would have to be better for those,
too.

The following don't implement .timeout:

  loop, rdb, virtio, xen, dm, ubi, scm

The following always return RESET_TIMER:

  null, skd

The following is safe to the new way:

  mtip

And now ones I am not sure about:

  ndb, mmc, dasd

I don't know, reverting looks worse than just fixing the drivers.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 21:17                           ` Keith Busch
@ 2018-07-18 21:30                             ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-18 21:30 UTC (permalink / raw)
  To: keith.busch; +Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei

On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> There are not that many blk-mq drivers, so we can go through them all=
.

I'm not sure that's the right approach. I think it is the responsibility of
the block layer to handle races between completion handling and timeout
handling and that this is not the responsibility of e.g. a block driver. If
you look at e.g. the legacy block layer then you will see that it takes car=
e
of this race and that legacy block drivers do not have to worry about this
race.

Bart.=

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 21:30                             ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-18 21:30 UTC (permalink / raw)


On Wed, 2018-07-18@15:17 -0600, Keith Busch wrote:
> There are not that many blk-mq drivers, so we can go through them all.

I'm not sure that's the right approach. I think it is the responsibility of
the block layer to handle races between completion handling and timeout
handling and that this is not the responsibility of e.g. a block driver. If
you look at e.g. the legacy block layer then you will see that it takes care
of this race and that legacy block drivers do not have to worry about this
race.

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 21:30                             ` Bart Van Assche
@ 2018-07-18 21:33                               ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-18 21:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei

On Wed, Jul 18, 2018 at 09:30:11PM +0000, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I'm not sure that's the right approach. I think it is the responsibility of
> the block layer to handle races between completion handling and timeout
> handling and that this is not the responsibility of e.g. a block driver. If
> you look at e.g. the legacy block layer then you will see that it takes care
> of this race and that legacy block drivers do not have to worry about this
> race.

Reverting doesn't handle the race at all. It just ignores completions
and puts the responsibility on the drivers to handle the race because
that's what scsi wants to happen.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 21:33                               ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-18 21:33 UTC (permalink / raw)


On Wed, Jul 18, 2018@09:30:11PM +0000, Bart Van Assche wrote:
> On Wed, 2018-07-18@15:17 -0600, Keith Busch wrote:
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I'm not sure that's the right approach. I think it is the responsibility of
> the block layer to handle races between completion handling and timeout
> handling and that this is not the responsibility of e.g. a block driver. If
> you look at e.g. the legacy block layer then you will see that it takes care
> of this race and that legacy block drivers do not have to worry about this
> race.

Reverting doesn't handle the race at all. It just ignores completions
and puts the responsibility on the drivers to handle the race because
that's what scsi wants to happen.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 20:39                       ` hch
@ 2018-07-18 22:53                         ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-18 22:53 UTC (permalink / raw)
  To: hch
  Cc: Bart Van Assche, keith.busch, linux-block, linux-nvme, axboe,
	ming.lei, jianchao.w.wang

On Wed, Jul 18, 2018 at 10:39:36PM +0200, hch@lst.de wrote:
> On Wed, Jul 18, 2018 at 09:56:50PM +0200, hch@lst.de wrote:
> > On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > > that so colourfully calls everyone else "dangerous" while advocating
> > > for silently losing requests on purpose.
> > > 
> > > But where's the option that fixes scsi to handle hardware completions
> > > concurrently with arbitrary timeout software? Propping up that house of
> > > cards can't be the only recourse.
> > 
> > The important bit is that we need to fix this issue quickly.  We are
> > past -rc5 so I'm rather concerned about anything too complicated.
> > 
> > I'm not even sure SCSI has a problem with multiple completions happening
> > at the same time, but it certainly has a problem with bypassing
> > blk_mq_complete_request from the EH path.
> > 
> > I think we can solve this properly, but I also think we are way to late
> > in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> > to revert the changes and try again for 4.19 or even 4.20 if we don't
> > act quickly enough.
> 
> So here is a quick attempt at the revert while also trying to keep
> nvme working.  Keith, Bart, Jianchao - does this looks reasonable
> as a 4.18 band aid?
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert

Hm, I not really a fan. The far majority of blk-mq drivers don't even
implement a timeout, and reverting it will lose their requests forever
if they complete at the same time as a timeout.

Of the remaining drivers, most of those don't want the reverted behavior
either. It actually looks like scsi is the only mq driver that wants
to block completions. In the short term, scsi can make that happen with
just three lines of code.

---
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..03986af3076c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,10 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	enum blk_eh_timer_return rtn = BLK_EH_DONE;
 	struct Scsi_Host *host = scmd->device->host;

+	if (req->q->mq_ops && cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
+							MQ_RQ_IN_FLIGHT);
+		return rtn;
+
	trace_scsi_dispatch_cmd_timeout(scmd);
	scsi_log_completion(scmd, TIMEOUT_ERROR);


-- 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-18 22:53                         ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-18 22:53 UTC (permalink / raw)


On Wed, Jul 18, 2018@10:39:36PM +0200, hch@lst.de wrote:
> On Wed, Jul 18, 2018@09:56:50PM +0200, hch@lst.de wrote:
> > On Fri, Jul 13, 2018@05:58:08PM -0600, Keith Busch wrote:
> > > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > > that so colourfully calls everyone else "dangerous" while advocating
> > > for silently losing requests on purpose.
> > > 
> > > But where's the option that fixes scsi to handle hardware completions
> > > concurrently with arbitrary timeout software? Propping up that house of
> > > cards can't be the only recourse.
> > 
> > The important bit is that we need to fix this issue quickly.  We are
> > past -rc5 so I'm rather concerned about anything too complicated.
> > 
> > I'm not even sure SCSI has a problem with multiple completions happening
> > at the same time, but it certainly has a problem with bypassing
> > blk_mq_complete_request from the EH path.
> > 
> > I think we can solve this properly, but I also think we are way to late
> > in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> > to revert the changes and try again for 4.19 or even 4.20 if we don't
> > act quickly enough.
> 
> So here is a quick attempt at the revert while also trying to keep
> nvme working.  Keith, Bart, Jianchao - does this looks reasonable
> as a 4.18 band aid?
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert

Hm, I not really a fan. The far majority of blk-mq drivers don't even
implement a timeout, and reverting it will lose their requests forever
if they complete at the same time as a timeout.

Of the remaining drivers, most of those don't want the reverted behavior
either. It actually looks like scsi is the only mq driver that wants
to block completions. In the short term, scsi can make that happen with
just three lines of code.

---
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..03986af3076c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,10 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	enum blk_eh_timer_return rtn = BLK_EH_DONE;
 	struct Scsi_Host *host = scmd->device->host;

+	if (req->q->mq_ops && cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
+							MQ_RQ_IN_FLIGHT);
+		return rtn;
+
	trace_scsi_dispatch_cmd_timeout(scmd);
	scsi_log_completion(scmd, TIMEOUT_ERROR);


-- 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 21:17                           ` Keith Busch
@ 2018-07-19 13:19                             ` hch
  -1 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-19 13:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, hch, keith.busch, linux-block, linux-nvme,
	axboe, ming.lei

On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote:
> And there may be other drivers that don't want their completions
> ignored, so breaking them again is also a step in the wrong direction.
> 
> There are not that many blk-mq drivers, so we can go through them all.

I think the point is that SCSI is the biggest user by both the number
of low-level drivers sitting under the midlayer, and also by usage.

We need to be very careful not to break it.  Note that this doesn't
mean that I don't want to eventually move away from just ignoring
completions in timeout state for SCSI.  I'd just rather rever 4.18
to a clean known state instead of doctoring around late in the rc
phase.

> Most don't even implement .timeout, so they never know that condition
> ever happened. Others always return BLK_EH_RESET_TIMER without doing
> anythign else, so the 'new' behavior would have to be better for those,
> too.

And we should never even hit the timeout handler for those as it
is rather pointless (although it looks we currently do..).

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-19 13:19                             ` hch
  0 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-19 13:19 UTC (permalink / raw)


On Wed, Jul 18, 2018@03:17:11PM -0600, Keith Busch wrote:
> And there may be other drivers that don't want their completions
> ignored, so breaking them again is also a step in the wrong direction.
> 
> There are not that many blk-mq drivers, so we can go through them all.

I think the point is that SCSI is the biggest user by both the number
of low-level drivers sitting under the midlayer, and also by usage.

We need to be very careful not to break it.  Note that this doesn't
mean that I don't want to eventually move away from just ignoring
completions in timeout state for SCSI.  I'd just rather rever 4.18
to a clean known state instead of doctoring around late in the rc
phase.

> Most don't even implement .timeout, so they never know that condition
> ever happened. Others always return BLK_EH_RESET_TIMER without doing
> anythign else, so the 'new' behavior would have to be better for those,
> too.

And we should never even hit the timeout handler for those as it
is rather pointless (although it looks we currently do..).

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-18 20:53                       ` Keith Busch
@ 2018-07-19 13:22                         ` hch
  -1 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-19 13:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: hch, Bart Van Assche, keith.busch, linux-block, linux-nvme,
	axboe, ming.lei, linux-scsi

On Wed, Jul 18, 2018 at 02:53:21PM -0600, Keith Busch wrote:
> If scsi needs this behavior, why not just put that behavior in scsi? It
> can set the state to complete and then everything can play out as
> before.

I think even with this we are missing handling for the somewhat
degenerate blk_abort_request case.

But most importantly we'll need some good test coverage.  Please
do some basic testing (e.g. with a version of the hack from
Jianchao (who seems to keep getting dropped from this thread for
some reason) and send it out to the block and scsi lists.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-19 13:22                         ` hch
  0 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-19 13:22 UTC (permalink / raw)


On Wed, Jul 18, 2018@02:53:21PM -0600, Keith Busch wrote:
> If scsi needs this behavior, why not just put that behavior in scsi? It
> can set the state to complete and then everything can play out as
> before.

I think even with this we are missing handling for the somewhat
degenerate blk_abort_request case.

But most importantly we'll need some good test coverage.  Please
do some basic testing (e.g. with a version of the hack from
Jianchao (who seems to keep getting dropped from this thread for
some reason) and send it out to the block and scsi lists.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-19 13:19                             ` hch
@ 2018-07-19 14:59                               ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-19 14:59 UTC (permalink / raw)
  To: hch
  Cc: axboe, linux-block, linux-nvme, ming.lei, keith.busch, Bart Van Assche

On Thu, Jul 19, 2018 at 03:19:04PM +0200, hch@lst.de wrote:
> On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote:
> > And there may be other drivers that don't want their completions
> > ignored, so breaking them again is also a step in the wrong direction.
> > 
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I think the point is that SCSI is the biggest user by both the number
> of low-level drivers sitting under the midlayer, and also by usage.
> 
> We need to be very careful not to break it.  Note that this doesn't
> mean that I don't want to eventually move away from just ignoring
> completions in timeout state for SCSI.  I'd just rather rever 4.18
> to a clean known state instead of doctoring around late in the rc
> phase.

I definitely do not want to break scsi. I just don't want to break every
one else either, and I think scsi can get the behavior it wants without
forcing others to subscribe to it.
 
> > Most don't even implement .timeout, so they never know that condition
> > ever happened. Others always return BLK_EH_RESET_TIMER without doing
> > anythign else, so the 'new' behavior would have to be better for those,
> > too.
> 
> And we should never even hit the timeout handler for those as it
> is rather pointless (although it looks we currently do..).

I don't see why we'd expect to never hit timeout for at least some of
these. It's not a stretch to see, for example, that virtio-blk or loop
could have their requests lost with no way to recover if we revert. I've
wasted too much time debugging hardware for such lost commands when it
was in fact functioning perfectly fine. So reintroducing that behavior
is a bit distressing.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-19 14:59                               ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-19 14:59 UTC (permalink / raw)


On Thu, Jul 19, 2018@03:19:04PM +0200, hch@lst.de wrote:
> On Wed, Jul 18, 2018@03:17:11PM -0600, Keith Busch wrote:
> > And there may be other drivers that don't want their completions
> > ignored, so breaking them again is also a step in the wrong direction.
> > 
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I think the point is that SCSI is the biggest user by both the number
> of low-level drivers sitting under the midlayer, and also by usage.
> 
> We need to be very careful not to break it.  Note that this doesn't
> mean that I don't want to eventually move away from just ignoring
> completions in timeout state for SCSI.  I'd just rather rever 4.18
> to a clean known state instead of doctoring around late in the rc
> phase.

I definitely do not want to break scsi. I just don't want to break every
one else either, and I think scsi can get the behavior it wants without
forcing others to subscribe to it.
 
> > Most don't even implement .timeout, so they never know that condition
> > ever happened. Others always return BLK_EH_RESET_TIMER without doing
> > anythign else, so the 'new' behavior would have to be better for those,
> > too.
> 
> And we should never even hit the timeout handler for those as it
> is rather pointless (although it looks we currently do..).

I don't see why we'd expect to never hit timeout for at least some of
these. It's not a stretch to see, for example, that virtio-blk or loop
could have their requests lost with no way to recover if we revert. I've
wasted too much time debugging hardware for such lost commands when it
was in fact functioning perfectly fine. So reintroducing that behavior
is a bit distressing.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-19 14:59                               ` Keith Busch
@ 2018-07-19 15:56                                 ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-19 15:56 UTC (permalink / raw)
  To: hch
  Cc: axboe, keith.busch, linux-nvme, ming.lei, linux-block, Bart Van Assche

On Thu, Jul 19, 2018 at 08:59:31AM -0600, Keith Busch wrote:
> > And we should never even hit the timeout handler for those as it
> > is rather pointless (although it looks we currently do..).
> 
> I don't see why we'd expect to never hit timeout for at least some of
> these. It's not a stretch to see, for example, that virtio-blk or loop
> could have their requests lost with no way to recover if we revert. I've
> wasted too much time debugging hardware for such lost commands when it
> was in fact functioning perfectly fine. So reintroducing that behavior
> is a bit distressing.

Even some scsi drivers are susceptible to losing their requests with the
reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
from it's timeout handler. With the behavior everyone seems to want,
a natural completion at or around the same time is lost forever because
it was blocked from completion with no way to recover.

While the timing for when requests may be lost is quite narrow, I've
seen it enough with very difficult to reproduce scenarios that hardware
devs no longer trust IO timeouts are their problem because Linux loses
their completions.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-19 15:56                                 ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-19 15:56 UTC (permalink / raw)


On Thu, Jul 19, 2018@08:59:31AM -0600, Keith Busch wrote:
> > And we should never even hit the timeout handler for those as it
> > is rather pointless (although it looks we currently do..).
> 
> I don't see why we'd expect to never hit timeout for at least some of
> these. It's not a stretch to see, for example, that virtio-blk or loop
> could have their requests lost with no way to recover if we revert. I've
> wasted too much time debugging hardware for such lost commands when it
> was in fact functioning perfectly fine. So reintroducing that behavior
> is a bit distressing.

Even some scsi drivers are susceptible to losing their requests with the
reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
from it's timeout handler. With the behavior everyone seems to want,
a natural completion at or around the same time is lost forever because
it was blocked from completion with no way to recover.

While the timing for when requests may be lost is quite narrow, I've
seen it enough with very difficult to reproduce scenarios that hardware
devs no longer trust IO timeouts are their problem because Linux loses
their completions.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-19 15:56                                 ` Keith Busch
@ 2018-07-19 16:04                                   ` Bart Van Assche
  -1 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-19 16:04 UTC (permalink / raw)
  To: hch, keith.busch; +Cc: keith.busch, linux-nvme, linux-block, axboe, ming.lei

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-7", Size: 1662 bytes --]

On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
+AD4- Even some scsi drivers are susceptible to losing their requests with =
the
+AD4- reverted behavior: take virtio-scsi for example, which returns RESET+=
AF8-TIMER
+AD4- from it's timeout handler. With the behavior everyone seems to want,
+AD4- a natural completion at or around the same time is lost forever becau=
se
+AD4- it was blocked from completion with no way to recover.

The patch I had posted handles a completion that occurs while a timeout is
being handled properly. From https://www.mail-archive.com/linux-block+AEA-v=
ger.kernel.org/msg22196.html:

 void blk+AF8-mq+AF8-complete+AF8-request(struct request +ACo-rq)
+AFs- ... +AF0-
+-               if (blk+AF8-mq+AF8-change+AF8-rq+AF8-state(rq, MQ+AF8-RQ+A=
F8-IN+AF8-FLIGHT,
+-                                          MQ+AF8-RQ+AF8-COMPLETE)) +AHs-
+-                       +AF8AXw-blk+AF8-mq+AF8-complete+AF8-request(rq)+AD=
s-
+-                       break+ADs-
+-               +AH0-
+-               if (blk+AF8-mq+AF8-change+AF8-rq+AF8-state(rq, MQ+AF8-RQ+A=
F8-TIMED+AF8-OUT, MQ+AF8-RQ+AF8-COMPLETE))
+-                       break+ADs-
+AFs- ... +AF0-
+AEAAQA- -838,25 +-838,42 +AEAAQA- static void blk+AF8-mq+AF8-rq+AF8-timed+=
AF8-out(struct request +ACo-req, bool reserved)
+AFs- ... +AF0-
        case BLK+AF8-EH+AF8-RESET+AF8-TIMER:
+AFs- ... +AF0-
+-                       if (blk+AF8-mq+AF8-rq+AF8-state(req) +AD0APQ- MQ+A=
F8-RQ+AF8-COMPLETE) +AHs-
+-                               +AF8AXw-blk+AF8-mq+AF8-complete+AF8-reques=
t(req)+ADs-
+-                               break+ADs-
+-                       +AH0-

Bart.=

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-19 16:04                                   ` Bart Van Assche
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Van Assche @ 2018-07-19 16:04 UTC (permalink / raw)


On Thu, 2018-07-19@09:56 -0600, Keith Busch wrote:
> Even some scsi drivers are susceptible to losing their requests with the
> reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> from it's timeout handler. With the behavior everyone seems to want,
> a natural completion at or around the same time is lost forever because
> it was blocked from completion with no way to recover.

The patch I had posted handles a completion that occurs while a timeout is
being handled properly. From https://www.mail-archive.com/linux-block at vger.kernel.org/msg22196.html:

 void blk_mq_complete_request(struct request *rq)
[ ... ]
+               if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT,
+                                          MQ_RQ_COMPLETE)) {
+                       __blk_mq_complete_request(rq);
+                       break;
+               }
+               if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
+                       break;
[ ... ]
@@ -838,25 +838,42 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
[ ... ]
        case BLK_EH_RESET_TIMER:
[ ... ]
+                       if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
+                               __blk_mq_complete_request(req);
+                               break;
+                       }

Bart.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-19 16:04                                   ` Bart Van Assche
@ 2018-07-19 16:22                                     ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-19 16:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei

On Thu, Jul 19, 2018 at 04:04:46PM +0000, Bart Van Assche wrote:
> On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> > Even some scsi drivers are susceptible to losing their requests with the
> > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> > from it's timeout handler. With the behavior everyone seems to want,
> > a natural completion at or around the same time is lost forever because
> > it was blocked from completion with no way to recover.
> 
> The patch I had posted handles a completion that occurs while a timeout is
> being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html:

Sounds like a win-win to me.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-19 16:22                                     ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-19 16:22 UTC (permalink / raw)


On Thu, Jul 19, 2018@04:04:46PM +0000, Bart Van Assche wrote:
> On Thu, 2018-07-19@09:56 -0600, Keith Busch wrote:
> > Even some scsi drivers are susceptible to losing their requests with the
> > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> > from it's timeout handler. With the behavior everyone seems to want,
> > a natural completion at or around the same time is lost forever because
> > it was blocked from completion with no way to recover.
> 
> The patch I had posted handles a completion that occurs while a timeout is
> being handled properly. From https://www.mail-archive.com/linux-block at vger.kernel.org/msg22196.html:

Sounds like a win-win to me.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-19 16:22                                     ` Keith Busch
@ 2018-07-19 16:29                                       ` hch
  -1 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-19 16:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, hch, keith.busch, linux-nvme, linux-block,
	axboe, ming.lei

On Thu, Jul 19, 2018 at 10:22:40AM -0600, Keith Busch wrote:
> On Thu, Jul 19, 2018 at 04:04:46PM +0000, Bart Van Assche wrote:
> > On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> > > Even some scsi drivers are susceptible to losing their requests with the
> > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> > > from it's timeout handler. With the behavior everyone seems to want,
> > > a natural completion at or around the same time is lost forever because
> > > it was blocked from completion with no way to recover.
> > 
> > The patch I had posted handles a completion that occurs while a timeout is
> > being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html:
> 
> Sounds like a win-win to me.

How do we get a fix into 4.18 at this part of the cycle?  I think that
is the most important prirority right now.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-19 16:29                                       ` hch
  0 siblings, 0 replies; 128+ messages in thread
From: hch @ 2018-07-19 16:29 UTC (permalink / raw)


On Thu, Jul 19, 2018@10:22:40AM -0600, Keith Busch wrote:
> On Thu, Jul 19, 2018@04:04:46PM +0000, Bart Van Assche wrote:
> > On Thu, 2018-07-19@09:56 -0600, Keith Busch wrote:
> > > Even some scsi drivers are susceptible to losing their requests with the
> > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> > > from it's timeout handler. With the behavior everyone seems to want,
> > > a natural completion at or around the same time is lost forever because
> > > it was blocked from completion with no way to recover.
> > 
> > The patch I had posted handles a completion that occurs while a timeout is
> > being handled properly. From https://www.mail-archive.com/linux-block at vger.kernel.org/msg22196.html:
> 
> Sounds like a win-win to me.

How do we get a fix into 4.18 at this part of the cycle?  I think that
is the most important prirority right now.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
  2018-07-19 16:29                                       ` hch
@ 2018-07-19 20:18                                         ` Keith Busch
  -1 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-19 20:18 UTC (permalink / raw)
  To: hch
  Cc: Bart Van Assche, keith.busch, linux-nvme, linux-block, axboe, ming.lei

On Thu, Jul 19, 2018 at 06:29:24PM +0200, hch@lst.de wrote:
> How do we get a fix into 4.18 at this part of the cycle?  I think that
> is the most important prirority right now.

Even if you were okay at this point to incorporate the concepts from
Bart's patch, it still looks like trouble for scsi (will elobrate
separately).

But reverting breaks other things we finally got working, so I'd
still vote for isolating the old behavior to scsi if that isn't too
unpalatable. I'll send a small patch shortly and see what happens.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
@ 2018-07-19 20:18                                         ` Keith Busch
  0 siblings, 0 replies; 128+ messages in thread
From: Keith Busch @ 2018-07-19 20:18 UTC (permalink / raw)


On Thu, Jul 19, 2018@06:29:24PM +0200, hch@lst.de wrote:
> How do we get a fix into 4.18 at this part of the cycle?  I think that
> is the most important prirority right now.

Even if you were okay at this point to incorporate the concepts from
Bart's patch, it still looks like trouble for scsi (will elobrate
separately).

But reverting breaks other things we finally got working, so I'd
still vote for isolating the old behavior to scsi if that isn't too
unpalatable. I'll send a small patch shortly and see what happens.

^ permalink raw reply	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2018-07-19 21:03 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-21 23:11 [RFC PATCH 0/3] blk-mq: Timeout rework Keith Busch
2018-05-21 23:11 ` Keith Busch
2018-05-21 23:11 ` [RFC PATCH 1/3] blk-mq: Reference count request usage Keith Busch
2018-05-21 23:11   ` Keith Busch
2018-05-22  2:27   ` Ming Lei
2018-05-22  2:27     ` Ming Lei
2018-05-22 15:19   ` Christoph Hellwig
2018-05-22 15:19     ` Christoph Hellwig
2018-05-21 23:11 ` [RFC PATCH 2/3] blk-mq: Fix timeout and state order Keith Busch
2018-05-21 23:11   ` Keith Busch
2018-05-22  2:28   ` Ming Lei
2018-05-22  2:28     ` Ming Lei
2018-05-22 15:24   ` Christoph Hellwig
2018-05-22 15:24     ` Christoph Hellwig
2018-05-22 16:27     ` Bart Van Assche
2018-05-22 16:27       ` Bart Van Assche
2018-05-21 23:11 ` [RFC PATCH 3/3] blk-mq: Remove generation seqeunce Keith Busch
2018-05-21 23:11   ` Keith Busch
2018-05-21 23:29   ` Bart Van Assche
2018-05-21 23:29     ` Bart Van Assche
2018-05-22 14:15     ` Keith Busch
2018-05-22 14:15       ` Keith Busch
2018-05-22 16:29       ` Bart Van Assche
2018-05-22 16:29         ` Bart Van Assche
2018-05-22 16:34         ` Keith Busch
2018-05-22 16:34           ` Keith Busch
2018-05-22 16:48           ` Bart Van Assche
2018-05-22 16:48             ` Bart Van Assche
2018-05-22  2:49   ` Ming Lei
2018-05-22  2:49     ` Ming Lei
2018-05-22  3:16     ` Jens Axboe
2018-05-22  3:16       ` Jens Axboe
2018-05-22  3:47       ` Ming Lei
2018-05-22  3:47         ` Ming Lei
2018-05-22  3:51         ` Jens Axboe
2018-05-22  3:51           ` Jens Axboe
2018-05-22  8:51           ` Ming Lei
2018-05-22  8:51             ` Ming Lei
2018-05-22 14:35             ` Jens Axboe
2018-05-22 14:35               ` Jens Axboe
2018-05-22 14:20     ` Keith Busch
2018-05-22 14:20       ` Keith Busch
2018-05-22 14:37       ` Ming Lei
2018-05-22 14:37         ` Ming Lei
2018-05-22 14:46         ` Keith Busch
2018-05-22 14:46           ` Keith Busch
2018-05-22 14:57           ` Ming Lei
2018-05-22 14:57             ` Ming Lei
2018-05-22 15:01             ` Keith Busch
2018-05-22 15:01               ` Keith Busch
2018-05-22 15:07               ` Ming Lei
2018-05-22 15:07                 ` Ming Lei
2018-05-22 15:17                 ` Keith Busch
2018-05-22 15:17                   ` Keith Busch
2018-05-22 15:23                   ` Ming Lei
2018-05-22 15:23                     ` Ming Lei
2018-05-22 16:17   ` Christoph Hellwig
2018-05-22 16:17     ` Christoph Hellwig
2018-05-23  0:34     ` Ming Lei
2018-05-23  0:34       ` Ming Lei
2018-05-23 14:35       ` Keith Busch
2018-05-23 14:35         ` Keith Busch
2018-05-24  1:52         ` Ming Lei
2018-05-24  1:52           ` Ming Lei
2018-05-23  5:48     ` Hannes Reinecke
2018-05-23  5:48       ` Hannes Reinecke
2018-07-12 18:16   ` Bart Van Assche
2018-07-12 18:16     ` Bart Van Assche
2018-07-12 19:24     ` Keith Busch
2018-07-12 19:24       ` Keith Busch
2018-07-12 22:24       ` Bart Van Assche
2018-07-12 22:24         ` Bart Van Assche
2018-07-13  1:12         ` jianchao.wang
2018-07-13  1:12           ` jianchao.wang
2018-07-13  2:40         ` jianchao.wang
2018-07-13  2:40           ` jianchao.wang
2018-07-13 15:43         ` Keith Busch
2018-07-13 15:43           ` Keith Busch
2018-07-13 15:52           ` Bart Van Assche
2018-07-13 15:52             ` Bart Van Assche
2018-07-13 18:47             ` Keith Busch
2018-07-13 18:47               ` Keith Busch
2018-07-13 23:03               ` Bart Van Assche
2018-07-13 23:03                 ` Bart Van Assche
2018-07-13 23:58                 ` Keith Busch
2018-07-13 23:58                   ` Keith Busch
2018-07-18 19:56                   ` hch
2018-07-18 19:56                     ` hch
2018-07-18 20:39                     ` hch
2018-07-18 20:39                       ` hch
2018-07-18 21:05                       ` Bart Van Assche
2018-07-18 21:05                         ` Bart Van Assche
2018-07-18 22:53                       ` Keith Busch
2018-07-18 22:53                         ` Keith Busch
2018-07-18 20:53                     ` Keith Busch
2018-07-18 20:53                       ` Keith Busch
2018-07-18 20:58                       ` Bart Van Assche
2018-07-18 20:58                         ` Bart Van Assche
2018-07-18 21:17                         ` Keith Busch
2018-07-18 21:17                           ` Keith Busch
2018-07-18 21:30                           ` Bart Van Assche
2018-07-18 21:30                             ` Bart Van Assche
2018-07-18 21:33                             ` Keith Busch
2018-07-18 21:33                               ` Keith Busch
2018-07-19 13:19                           ` hch
2018-07-19 13:19                             ` hch
2018-07-19 14:59                             ` Keith Busch
2018-07-19 14:59                               ` Keith Busch
2018-07-19 15:56                               ` Keith Busch
2018-07-19 15:56                                 ` Keith Busch
2018-07-19 16:04                                 ` Bart Van Assche
2018-07-19 16:04                                   ` Bart Van Assche
2018-07-19 16:22                                   ` Keith Busch
2018-07-19 16:22                                     ` Keith Busch
2018-07-19 16:29                                     ` hch
2018-07-19 16:29                                       ` hch
2018-07-19 20:18                                       ` Keith Busch
2018-07-19 20:18                                         ` Keith Busch
2018-07-19 13:22                       ` hch
2018-07-19 13:22                         ` hch
2018-05-21 23:29 ` [RFC PATCH 0/3] blk-mq: Timeout rework Bart Van Assche
2018-05-21 23:29   ` Bart Van Assche
2018-05-22 14:06   ` Keith Busch
2018-05-22 14:06     ` Keith Busch
2018-05-22 16:30     ` Bart Van Assche
2018-05-22 16:30       ` Bart Van Assche
2018-05-22 16:44       ` Keith Busch
2018-05-22 16:44         ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.