All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v13] blk-mq: Rework blk-mq timeout handling again
@ 2018-05-22 16:25 Bart Van Assche
  2018-05-22 16:44 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Bart Van Assche @ 2018-05-22 16:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Tejun Heo,
	Keith Busch, Jianchao Wang, Ming Lei, Sebastian Ott,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy

Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has been called.

Fix this race as follows:
- Reduce the gstate field from 64 to 32 bits such that cmpxchg() can
  be used to update it. Introduce deadline_seq for updating the deadline
  on 32-bit systems.
- Remove the request member variables that became superfluous due to
  this change, namely gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sebastian Ott <sebott@linux.ibm.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
---

Changes compared to v12:
- Switched from cmpxchg64() to cmpxchg(). This became possible because the
  deadline is now updated before the request state.
- Introduced a new request state to ensure that completions that occur while
  the timeout function is in progress are not lost.
- Left out the ARM cmpxchg64() patch.

Changes compared to v11:
- Reworked patch 1/2: instead of introducing CONFIG_ARCH_HAVE_CMPXCHG64, make
  sure that cmpxchg64() is only defined if it can be used.

Changes compared to v10:
- In patch 1/2, added "default y if 64BIT" to the "config ARCH_HAVE_CMPXCHG64"
  entry in arch/Kconfig. Left out the "select ARCH_HAVE_CMPXCHG64" statements
  that became superfluous due to this change (alpha, arm64, powerpc and s390).
- Also in patch 1/2, only select ARCH_HAVE_CMPXCHG64 if X86_CMPXCHG64 has been
  selected.
- In patch 2/2, moved blk_mq_change_rq_state() from blk-mq.h to blk-mq.c.
- Added a comment header above __blk_mq_requeue_request() and
  blk_mq_requeue_request().
- Documented the MQ_RQ_* state transitions in block/blk-mq.h.
- Left out the fourth argument of blk_mq_rq_set_deadline().

Changes compared to v9:
- Addressed multiple comments related to patch 1/2: added
  CONFIG_ARCH_HAVE_CMPXCHG64 for riscv, modified
  features/locking/cmpxchg64/arch-support.txt as requested and made the
  order of the symbols in the arch/*/Kconfig alphabetical where possible.

Changes compared to v8:
- Split into two patches.
- Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into
  blk_mq_init_request().
- Fixed the deadline set by blk_add_timer().
- Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 /
  #endif.

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.


 block/blk-core.c       |   9 +-
 block/blk-mq-debugfs.c |   4 +-
 block/blk-mq.c         | 250 ++++++++++++++++++++++++++-----------------------
 block/blk-mq.h         |  54 +++++------
 block/blk-timeout.c    | 107 +++++++++++++--------
 block/blk.h            |   1 +
 include/linux/blkdev.h |  44 ++++-----
 7 files changed, 250 insertions(+), 219 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 341501c5e239..42b055292cdc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,9 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * See comment of blk_mq_init_request
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..9c539ab2c0dc 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,15 +344,15 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
 
 static const char *const blk_mq_rq_state_name_array[] = {
 	[MQ_RQ_IDLE]		= "idle",
-	[MQ_RQ_IN_FLIGHT]	= "in_flight",
+	[MQ_RQ_IN_FLIGHT]	= "in flight",
 	[MQ_RQ_COMPLETE]	= "complete",
+	[MQ_RQ_TIMED_OUT]	= "timed out",
 };
 
 static const char *blk_mq_rq_state_name(enum mq_rq_state rq_state)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 64630caaf27e..6bfc7679a5df 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -319,6 +319,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	/* tag was already set */
 	rq->extra_len = 0;
 	rq->__deadline = 0;
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -465,6 +466,39 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static bool blk_mq_change_rq_state(struct request *rq,
+				   enum mq_rq_state old_state,
+				   enum mq_rq_state new_state)
+{
+	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
+	union blk_generation_and_state old_val = gstate;
+	union blk_generation_and_state new_val = gstate;
+
+	old_val.state = old_state;
+	new_val.state = new_state;
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		new_val.generation++;
+	/*
+	 * For transitions from state in-flight to another state cmpxchg()
+	 * must be used. For other state transitions it is safe to use
+	 * WRITE_ONCE().
+	 */
+	if (old_state != MQ_RQ_IN_FLIGHT) {
+		WRITE_ONCE(rq->gstate.val, new_val.val);
+		return true;
+	}
+	return blk_mq_set_rq_state(rq, old_val, new_val);
+}
+
 void blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -494,7 +528,8 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(true);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -547,8 +582,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -593,36 +627,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -634,27 +638,20 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
-		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
+	/* The loop is for the unlikely case of a race with the timeout code. */
+	while (true) {
+		if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT,
+					   MQ_RQ_COMPLETE)) {
+			__blk_mq_complete_request(rq);
+			break;
+		}
+		if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
+			break;
+	}
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -681,27 +678,8 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, rq);
 	}
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
-	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	blk_mq_add_timer(rq);
+	blk_mq_change_rq_state(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -714,27 +692,46 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
+/**
+ * __blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ *
+ * This function is either called by blk_mq_requeue_request() or by the block
+ * layer core if .queue_rq() returned BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
+ * If the request state is MQ_RQ_IN_FLIGHT and if this function is called from
+ * inside .queue_rq() then it is guaranteed that the timeout code won't try to
+ * modify the request state while this function is in progress because an RCU
+ * read lock is held around .queue_rq() and because the timeout code calls
+ * synchronize_rcu() after having marked requests and before processing
+ * time-outs.
  */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, rq);
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
-		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(true);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
 }
 
+/**
+ * blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ * @kick_requeue_list: whether or not to kick the requeue_list
+ *
+ * This function is called after a request has completed, after a request has
+ * timed out or from inside .queue_rq(). In the latter case the request may
+ * already have been started.
+ */
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 {
 	__blk_mq_requeue_request(rq);
@@ -838,8 +835,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
+		blk_mq_add_timer(req);
 		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
+		 * The loop is for the unlikely case of a race with the
+		 * completion code. There is no need to reset the deadline
+		 * if the transition to the in-flight state fails because
+		 * the deadline only matters in the in-flight state.
 		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		while (true) {
+			if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
+						   MQ_RQ_IN_FLIGHT))
+				break;
+			if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
+				__blk_mq_complete_request(req);
+				break;
+			}
+		}
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -868,48 +872,60 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
+	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
+	unsigned long deadline;
 
 	might_sleep();
 
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
-
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
+#ifdef CONFIG_64BIT
+	deadline = rq->__deadline;
+#else
 	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
+		int start;
+
+		start = read_seqcount_begin(&rq->deadline_seq);
+		deadline = rq->__deadline;
+		if (!read_seqcount_retry(&rq->deadline_seq, start))
 			break;
 		cond_resched();
 	}
+#endif
 
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
+	/*
+	 * Make sure that rq->aborted_gstate != rq->gstate if rq->deadline has
+	 * not yet been reached even if a request gets recycled before
+	 * blk_mq_terminate_expired() is called and the value of rq->deadline
+	 * is not modified due to the request recycling.
+	 */
+	rq->aborted_gstate = gstate;
+	rq->aborted_gstate.generation ^= (1UL << 29);
+	if (gstate.state == MQ_RQ_IN_FLIGHT &&
 	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
+		/* request timed out */
+		rq->aborted_gstate = gstate;
 		data->nr_expired++;
 		hctx->nr_expired++;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
+
 }
 
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
+	union blk_generation_and_state old_val = rq->aborted_gstate;
+	union blk_generation_and_state new_val = old_val;
+
+	new_val.state = MQ_RQ_TIMED_OUT;
+
 	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+	 * calls. If rq->gstate has not changed that means that it
+	 * is now safe to change the request state and to handle the timeout.
 	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+	if (blk_mq_set_rq_state(rq, old_val, new_val))
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
@@ -948,10 +964,12 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		bool has_rcu = false;
 
 		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
+		 * For very short timeouts it can happen that
+		 * blk_mq_check_expired() modifies the state of a request
+		 * while .queue_rq() is still in progress. Hence wait until
+		 * these .queue_rq() calls have finished. This is also
+		 * necessary to avoid races with blk_mq_requeue_request() for
+		 * requests that have already been started.
 		 */
 		queue_for_each_hw_ctx(q, hctx, i) {
 			if (!hctx->nr_expired)
@@ -967,7 +985,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		if (has_rcu)
 			synchronize_rcu();
 
-		/* terminate the ones we won */
+		/* Terminate the requests marked by blk_mq_check_expired(). */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
 	}
 
@@ -2063,14 +2081,9 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * start gstate with gen 1 instead of 0, otherwise it will be equal
-	 * to aborted_gstate, and be identified timed out by
-	 * blk_mq_terminate_expired.
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 
 	return 0;
 }
@@ -3105,7 +3118,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
+		if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE ||
+		    blk_mq_rq_state(rq) == MQ_RQ_TIMED_OUT)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e1bb420dc5d6..7b810c934732 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -2,6 +2,7 @@
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
 
+#include <asm/cmpxchg.h>
 #include "blk-stat.h"
 #include "blk-mq-tag.h"
 
@@ -30,18 +31,25 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
+/**
+ * enum mq_rq_state - blk-mq request state
+ *
+ * The legal state transitions are:
+ * - idle      -> in-flight: blk_mq_start_request()
+ * - in-flight -> complete:  blk_mq_complete_request()
+ * - timed out -> complete:  blk_mq_complete_request()
+ * - in-flight -> timed out: request times out
+ * - complete/tmo -> idle:   blk_mq_requeue_request() or blk_mq_free_request()
+ * - in-flight -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - timed out -> in-flight: request restart due to BLK_EH_RESET_TIMER
+ *
+ * See also blk_generation_and_state.state.
  */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
-
-	MQ_RQ_STATE_BITS	= 2,
-	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
+	MQ_RQ_TIMED_OUT		= 3,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -103,37 +111,21 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
-/**
- * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
- * @rq: target request.
- */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline bool blk_mq_set_rq_state(struct request *rq,
+				       union blk_generation_and_state old_val,
+				       union blk_generation_and_state new_val)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return cmpxchg(&rq->gstate.val, old_val.val, new_val.val) ==
+		old_val.val;
 }
 
 /**
- * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
+ * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
- * @state: new state to set.
- *
- * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
  */
-static inline void blk_mq_rq_update_state(struct request *rq,
-					  enum mq_rq_state state)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
-	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	return READ_ONCE(rq->gstate).state;
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..3abbaa332a91 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -145,6 +145,22 @@ void blk_timeout_work(struct work_struct *work)
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
+/* Update deadline to @time. May be called from interrupt context. */
+static void blk_mq_rq_set_deadline(struct request *rq, unsigned long new_time)
+{
+#ifdef CONFIG_64BIT
+	rq->__deadline = new_time;
+#else
+	unsigned long flags;
+
+	local_irq_save(flags);
+	write_seqcount_begin(&rq->deadline_seq);
+	rq->__deadline = new_time;
+	write_seqcount_end(&rq->deadline_seq);
+	local_irq_restore(flags);
+#endif
+}
+
 /**
  * blk_abort_request -- Request request recovery for the specified command
  * @req:	pointer to the request of interest
@@ -158,11 +174,10 @@ void blk_abort_request(struct request *req)
 {
 	if (req->q->mq_ops) {
 		/*
-		 * All we need to ensure is that timeout scan takes place
-		 * immediately and that scan sees the new timeout value.
-		 * No need for fancy synchronizations.
+		 * Ensure that a timeout scan takes place immediately and that
+		 * that scan sees the new timeout value.
 		 */
-		blk_rq_set_deadline(req, jiffies);
+		blk_mq_rq_set_deadline(req, jiffies);
 		kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
@@ -184,52 +199,17 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:	request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req, unsigned long deadline)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
-	if (!req->timeout)
-		req->timeout = q->rq_timeout;
-
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
 	 * than an existing one, modify the timer. Round up to next nearest
 	 * second.
 	 */
-	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
+	expiry = blk_rq_timeout(round_jiffies_up(deadline));
 	if (!timer_pending(&q->timeout) ||
 	    time_before(expiry, q->timeout.expires)) {
 		unsigned long diff = q->timeout.expires - expiry;
@@ -244,5 +224,50 @@ void blk_add_timer(struct request *req)
 		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
 			mod_timer(&q->timeout, expiry);
 	}
+}
 
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:	request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+	unsigned long deadline;
+
+	lockdep_assert_held(q->queue_lock);
+
+	if (!q->rq_timed_out_fn)
+		return;
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
+
+	return __blk_add_timer(req, deadline);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ *
+ * Sets the deadline of a request. The caller must guarantee that the request
+ * state won't be modified while this function is in progress.
+ */
+void blk_mq_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+	unsigned long deadline;
+
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+	deadline = jiffies + req->timeout;
+	blk_mq_rq_set_deadline(req, deadline);
+	return __blk_add_timer(req, deadline);
 }
diff --git a/block/blk.h b/block/blk.h
index eaf1a8e87d11..ffd44cbf3095 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,7 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req);
 void blk_delete_timer(struct request *);
 
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..acc08806a6e5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -7,6 +7,7 @@
 
 #ifdef CONFIG_BLOCK
 
+#include <linux/atomic.h>	/* cmpxchg() */
 #include <linux/major.h>
 #include <linux/genhd.h>
 #include <linux/list.h>
@@ -28,7 +29,6 @@
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
 #include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -125,15 +125,21 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
+union blk_generation_and_state {
+	struct {
+		uint32_t generation:30;
+		uint32_t state:2;
+	};
+	uint32_t val;
+};
+
 /*
  * Try to put the fields that are referenced together in the same cacheline.
  *
@@ -236,28 +242,24 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
-	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
 	/*
 	 * ->aborted_gstate is used by the timeout to claim a specific
 	 * recycle instance of this request.  See blk_mq_timeout_work().
 	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
+	union blk_generation_and_state aborted_gstate;
 
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
+	/*
+	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state(),
+	 * blk_mq_change_rq_state() and blk_mq_rq_set_deadline() for
+	 * blk-mq queues. deadline_seq protects __deadline in blk-mq mode
+	 * only.
+	 */
+	union blk_generation_and_state gstate;
+#ifndef CONFIG_64BIT
+	seqcount_t deadline_seq;
+#endif
 	unsigned long __deadline;
 
 	struct list_head timeout_list;
-- 
2.16.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 16:25 [PATCH v13] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-05-22 16:44 ` Jens Axboe
  2018-05-22 17:17   ` Jens Axboe
  2018-05-22 20:33 ` Keith Busch
  2018-05-23 14:02 ` Keith Busch
  2 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 16:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Keith Busch,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy


On 5/22/18 10:25 AM, Bart Van Assche wrote:
> Recently the blk-mq timeout handling code was reworked. See also Tejun
> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
> This patch reworks the blk-mq timeout handling code again. The timeout
> handling code is simplified by introducing a state machine per request.
> This change avoids that the blk-mq timeout handling code ignores
> completions that occur after blk_mq_check_expired() has been called and
> before blk_mq_rq_timed_out() has been called.

I'll take a look at this again, getting rid of cmpxchg64 makes me
much more comfortable.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 16:44 ` Jens Axboe
@ 2018-05-22 17:17   ` Jens Axboe
  2018-05-22 18:47     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 17:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Keith Busch,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On 5/22/18 10:44 AM, Jens Axboe wrote:
> 
> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>> This patch reworks the blk-mq timeout handling code again. The timeout
>> handling code is simplified by introducing a state machine per request.
>> This change avoids that the blk-mq timeout handling code ignores
>> completions that occur after blk_mq_check_expired() has been called and
>> before blk_mq_rq_timed_out() has been called.
> 
> I'll take a look at this again, getting rid of cmpxchg64 makes me
> much more comfortable.

FWIW, a quick pass on runtime testing works fine. As expected, it's
more efficient than what's currently in the kernel, testing with both
null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
is that we shrink the request size from 360 bytes to 312, and I did
a small followup patch that brings that to 304. That's a 15% reduction,
massive.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 17:17   ` Jens Axboe
@ 2018-05-22 18:47     ` Jens Axboe
  2018-05-22 19:03       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 18:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Keith Busch,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On 5/22/18 11:17 AM, Jens Axboe wrote:
> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>
>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>>> This patch reworks the blk-mq timeout handling code again. The timeout
>>> handling code is simplified by introducing a state machine per request.
>>> This change avoids that the blk-mq timeout handling code ignores
>>> completions that occur after blk_mq_check_expired() has been called and
>>> before blk_mq_rq_timed_out() has been called.
>>
>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>> much more comfortable.
> 
> FWIW, a quick pass on runtime testing works fine. As expected, it's
> more efficient than what's currently in the kernel, testing with both
> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
> is that we shrink the request size from 360 bytes to 312, and I did
> a small followup patch that brings that to 304. That's a 15% reduction,
> massive.

Ran into this, running block/014 from blktests:

[ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
[ 5750.723000] null: rq 00000000ff68f103 timed out
[ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 __blk_mq_complete_request+0xa6/0x0
[ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac x86_pkg_temp_therma]
[ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ #712
[ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
[ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
[ 5750.795850] RSP: 0018:ffff883ffb417d68 EFLAGS: 00010202
[ 5750.802187] RAX: 0000000000000003 RBX: ffff881ff100d800 RCX: 0000000000000000
[ 5750.810649] RDX: ffff88407fd9e040 RSI: ffff88407fd956b8 RDI: ffff881ff100d800
[ 5750.819119] RBP: ffffe8ffffd91800 R08: 0000000000000000 R09: ffffffff82066eb8
[ 5750.827588] R10: ffff883ffa386138 R11: ffff883ffa385900 R12: 0000000000000001
[ 5750.836050] R13: ffff881fe7da6000 R14: 0000000000000020 R15: 0000000000000002
[ 5750.844529] FS:  0000000000000000(0000) GS:ffff88407fd80000(0000) knlGS:0000000000000000
[ 5750.854482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5750.861397] CR2: 00007ffc92f97f68 CR3: 000000000201d005 CR4: 00000000003606e0
[ 5750.869861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5750.878333] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5750.886805] Call Trace:
[ 5750.890033]  bt_iter+0x42/0x50
[ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
[ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
[ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
[ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
[ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
[ 5750.920220]  process_one_work+0x21b/0x510
[ 5750.925197]  worker_thread+0x3a/0x390
[ 5750.929781]  ? process_one_work+0x510/0x510
[ 5750.934944]  kthread+0x11c/0x140
[ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
[ 5750.945169]  ret_from_fork+0x1f/0x30
[ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 3b 48 89 df ff 90 2 
[ 5750.972139] ---[ end trace 40065cb1764bf500 ]---

which is this:

        WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);                    

hitting when blk_mq_terminate_expired() completes the request through BLK_EH_HANDLED.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 18:47     ` Jens Axboe
@ 2018-05-22 19:03       ` Jens Axboe
  2018-05-22 19:38         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 19:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Keith Busch,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On 5/22/18 12:47 PM, Jens Axboe wrote:
> On 5/22/18 11:17 AM, Jens Axboe wrote:
>> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>>
>>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>>>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>>>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>>>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>>>> This patch reworks the blk-mq timeout handling code again. The timeout
>>>> handling code is simplified by introducing a state machine per request.
>>>> This change avoids that the blk-mq timeout handling code ignores
>>>> completions that occur after blk_mq_check_expired() has been called and
>>>> before blk_mq_rq_timed_out() has been called.
>>>
>>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>>> much more comfortable.
>>
>> FWIW, a quick pass on runtime testing works fine. As expected, it's
>> more efficient than what's currently in the kernel, testing with both
>> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
>> is that we shrink the request size from 360 bytes to 312, and I did
>> a small followup patch that brings that to 304. That's a 15% reduction,
>> massive.
> 
> Ran into this, running block/014 from blktests:
> 
> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
> [ 5750.723000] null: rq 00000000ff68f103 timed out
> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 __blk_mq_complete_request+0xa6/0x0
> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac x86_pkg_temp_therma]
> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ #712
> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
> [ 5750.795850] RSP: 0018:ffff883ffb417d68 EFLAGS: 00010202
> [ 5750.802187] RAX: 0000000000000003 RBX: ffff881ff100d800 RCX: 0000000000000000
> [ 5750.810649] RDX: ffff88407fd9e040 RSI: ffff88407fd956b8 RDI: ffff881ff100d800
> [ 5750.819119] RBP: ffffe8ffffd91800 R08: 0000000000000000 R09: ffffffff82066eb8
> [ 5750.827588] R10: ffff883ffa386138 R11: ffff883ffa385900 R12: 0000000000000001
> [ 5750.836050] R13: ffff881fe7da6000 R14: 0000000000000020 R15: 0000000000000002
> [ 5750.844529] FS:  0000000000000000(0000) GS:ffff88407fd80000(0000) knlGS:0000000000000000
> [ 5750.854482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5750.861397] CR2: 00007ffc92f97f68 CR3: 000000000201d005 CR4: 00000000003606e0
> [ 5750.869861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5750.878333] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 5750.886805] Call Trace:
> [ 5750.890033]  bt_iter+0x42/0x50
> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
> [ 5750.920220]  process_one_work+0x21b/0x510
> [ 5750.925197]  worker_thread+0x3a/0x390
> [ 5750.929781]  ? process_one_work+0x510/0x510
> [ 5750.934944]  kthread+0x11c/0x140
> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
> [ 5750.945169]  ret_from_fork+0x1f/0x30
> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 3b 48 89 df ff 90 2 
> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> 
> which is this:
> 
>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);

That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
state transition. So that check should be:

        WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
                     blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 19:03       ` Jens Axboe
@ 2018-05-22 19:38         ` Jens Axboe
  2018-05-22 20:26           ` Keith Busch
  2018-05-22 20:33           ` Bart Van Assche
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 19:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Keith Busch,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On 5/22/18 1:03 PM, Jens Axboe wrote:
> On 5/22/18 12:47 PM, Jens Axboe wrote:
>> On 5/22/18 11:17 AM, Jens Axboe wrote:
>>> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>>>
>>>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>>>>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>>>>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>>>>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>>>>> This patch reworks the blk-mq timeout handling code again. The timeout
>>>>> handling code is simplified by introducing a state machine per request.
>>>>> This change avoids that the blk-mq timeout handling code ignores
>>>>> completions that occur after blk_mq_check_expired() has been called and
>>>>> before blk_mq_rq_timed_out() has been called.
>>>>
>>>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>>>> much more comfortable.
>>>
>>> FWIW, a quick pass on runtime testing works fine. As expected, it's
>>> more efficient than what's currently in the kernel, testing with both
>>> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
>>> is that we shrink the request size from 360 bytes to 312, and I did
>>> a small followup patch that brings that to 304. That's a 15% reduction,
>>> massive.
>>
>> Ran into this, running block/014 from blktests:
>>
>> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
>> [ 5750.723000] null: rq 00000000ff68f103 timed out
>> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 __blk_mq_complete_request+0xa6/0x0
>> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac x86_pkg_temp_therma]
>> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ #712
>> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
>> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
>> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
>> [ 5750.795850] RSP: 0018:ffff883ffb417d68 EFLAGS: 00010202
>> [ 5750.802187] RAX: 0000000000000003 RBX: ffff881ff100d800 RCX: 0000000000000000
>> [ 5750.810649] RDX: ffff88407fd9e040 RSI: ffff88407fd956b8 RDI: ffff881ff100d800
>> [ 5750.819119] RBP: ffffe8ffffd91800 R08: 0000000000000000 R09: ffffffff82066eb8
>> [ 5750.827588] R10: ffff883ffa386138 R11: ffff883ffa385900 R12: 0000000000000001
>> [ 5750.836050] R13: ffff881fe7da6000 R14: 0000000000000020 R15: 0000000000000002
>> [ 5750.844529] FS:  0000000000000000(0000) GS:ffff88407fd80000(0000) knlGS:0000000000000000
>> [ 5750.854482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 5750.861397] CR2: 00007ffc92f97f68 CR3: 000000000201d005 CR4: 00000000003606e0
>> [ 5750.869861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 5750.878333] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 5750.886805] Call Trace:
>> [ 5750.890033]  bt_iter+0x42/0x50
>> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
>> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
>> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
>> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
>> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
>> [ 5750.920220]  process_one_work+0x21b/0x510
>> [ 5750.925197]  worker_thread+0x3a/0x390
>> [ 5750.929781]  ? process_one_work+0x510/0x510
>> [ 5750.934944]  kthread+0x11c/0x140
>> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
>> [ 5750.945169]  ret_from_fork+0x1f/0x30
>> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 3b 48 89 df ff 90 2 
>> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
>>
>> which is this:
>>
>>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> 
> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> state transition. So that check should be:
> 
>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
>                      blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);

I guess it would be cleaner to actually do the transition, in
blk_mq_rq_timed_out():

        case BLK_EH_HANDLED:                                                    
                if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,                
                                                MQ_RQ_COMPLETE))                
                        __blk_mq_complete_request(req);                         
                break;        

This works for me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 19:38         ` Jens Axboe
@ 2018-05-22 20:26           ` Keith Busch
  2018-05-22 20:29             ` Jens Axboe
  2018-05-22 20:33           ` Bart Van Assche
  1 sibling, 1 reply; 22+ messages in thread
From: Keith Busch @ 2018-05-22 20:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bart Van Assche, linux-block, Christoph Hellwig, Tejun Heo,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
> I guess it would be cleaner to actually do the transition, in
> blk_mq_rq_timed_out():
> 
>         case BLK_EH_HANDLED:                                                    
>                 if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,                
>                                                 MQ_RQ_COMPLETE))                
>                         __blk_mq_complete_request(req);                         
>                 break;        
> 
> This works for me.

Works for me as well on manual fault injection tests.

I think this change above goes back to Christoph's point earlier on usage
of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
when the driver actually knows the request has been completed before
returning the status?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:26           ` Keith Busch
@ 2018-05-22 20:29             ` Jens Axboe
  2018-05-22 21:02               ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 20:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, linux-block, Christoph Hellwig, Tejun Heo,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On 5/22/18 2:26 PM, Keith Busch wrote:
> On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
>> I guess it would be cleaner to actually do the transition, in
>> blk_mq_rq_timed_out():
>>
>>         case BLK_EH_HANDLED:                                                    
>>                 if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,                
>>                                                 MQ_RQ_COMPLETE))                
>>                         __blk_mq_complete_request(req);                         
>>                 break;        
>>
>> This works for me.
> 
> Works for me as well on manual fault injection tests.
> 
> I think this change above goes back to Christoph's point earlier on usage
> of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
> when the driver actually knows the request has been completed before
> returning the status?

If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
Or BLK_EH_HANDLED would work too, since the above state transition would
then fail.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 19:38         ` Jens Axboe
  2018-05-22 20:26           ` Keith Busch
@ 2018-05-22 20:33           ` Bart Van Assche
  2018-05-22 20:38             ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2018-05-22 20:33 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, israelr, sagi, hch, sebott, ming.lei,
	jianchao.w.wang, maxg, tj, keith.busch

T24gVHVlLCAyMDE4LTA1LTIyIGF0IDEzOjM4IC0wNjAwLCBKZW5zIEF4Ym9lIHdyb3RlOg0KPiBP
biA1LzIyLzE4IDE6MDMgUE0sIEplbnMgQXhib2Ugd3JvdGU6DQo+ID4gT24gNS8yMi8xOCAxMjo0
NyBQTSwgSmVucyBBeGJvZSB3cm90ZToNCj4gPiA+IFJhbiBpbnRvIHRoaXMsIHJ1bm5pbmcgYmxv
Y2svMDE0IGZyb20gYmxrdGVzdHM6DQo+ID4gPiANCj4gPiA+IFsgNTc0NC45NDk4MzldIHJ1biBi
bGt0ZXN0cyBibG9jay8wMTQgYXQgMjAxOC0wNS0yMiAxMjo0MToyNQ0KPiA+ID4gWyA1NzUwLjcy
MzAwMF0gbnVsbDogcnEgMDAwMDAwMDBmZjY4ZjEwMyB0aW1lZCBvdXQNCj4gPiA+IFsgNTc1MC43
MjgxODFdIFdBUk5JTkc6IENQVTogNDUgUElEOiAyNDgwIGF0IGJsb2NrL2Jsay1tcS5jOjU4NSBf
X2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KzB4YTYvMHgwDQo+ID4gPiBbIDU3NTAuNzM4MTg3XSBN
b2R1bGVzIGxpbmtlZCBpbjogbnVsbF9ibGsoKykgY29uZmlnZnMgbnZtZSBudm1lX2NvcmUgc2Jf
ZWRhYyB4ODZfcGtnX3RlbXBfdGhlcm1hXQ0KPiA+ID4gWyA1NzUwLjc2NTUwOV0gQ1BVOiA0NSBQ
SUQ6IDI0ODAgQ29tbToga3dvcmtlci80NToxSCBOb3QgdGFpbnRlZCA0LjE3LjAtcmM2KyAjNzEy
DQo+ID4gPiBbIDU3NTAuNzc0MDg3XSBIYXJkd2FyZSBuYW1lOiBEZWxsIEluYy4gUG93ZXJFZGdl
IFQ2MzAvME5UNzhYLCBCSU9TIDIuMy40IDExLzA5LzIwMTYNCj4gPiA+IFsgNTc1MC43ODMzNjld
IFdvcmtxdWV1ZToga2Jsb2NrZCBibGtfbXFfdGltZW91dF93b3JrDQo+ID4gPiBbIDU3NTAuNzg5
MjIzXSBSSVA6IDAwMTA6X19ibGtfbXFfY29tcGxldGVfcmVxdWVzdCsweGE2LzB4MTEwDQo+ID4g
PiBbIDU3NTAuNzk1ODUwXSBSU1A6IDAwMTg6ZmZmZjg4M2ZmYjQxN2Q2OCBFRkxBR1M6IDAwMDEw
MjAyDQo+ID4gPiBbIDU3NTAuODAyMTg3XSBSQVg6IDAwMDAwMDAwMDAwMDAwMDMgUkJYOiBmZmZm
ODgxZmYxMDBkODAwIFJDWDogMDAwMDAwMDAwMDAwMDAwMA0KPiA+ID4gWyA1NzUwLjgxMDY0OV0g
UkRYOiBmZmZmODg0MDdmZDllMDQwIFJTSTogZmZmZjg4NDA3ZmQ5NTZiOCBSREk6IGZmZmY4ODFm
ZjEwMGQ4MDANCj4gPiA+IFsgNTc1MC44MTkxMTldIFJCUDogZmZmZmU4ZmZmZmQ5MTgwMCBSMDg6
IDAwMDAwMDAwMDAwMDAwMDAgUjA5OiBmZmZmZmZmZjgyMDY2ZWI4DQo+ID4gPiBbIDU3NTAuODI3
NTg4XSBSMTA6IGZmZmY4ODNmZmEzODYxMzggUjExOiBmZmZmODgzZmZhMzg1OTAwIFIxMjogMDAw
MDAwMDAwMDAwMDAwMQ0KPiA+ID4gWyA1NzUwLjgzNjA1MF0gUjEzOiBmZmZmODgxZmU3ZGE2MDAw
IFIxNDogMDAwMDAwMDAwMDAwMDAyMCBSMTU6IDAwMDAwMDAwMDAwMDAwMDINCj4gPiA+IFsgNTc1
MC44NDQ1MjldIEZTOiAgMDAwMDAwMDAwMDAwMDAwMCgwMDAwKSBHUzpmZmZmODg0MDdmZDgwMDAw
KDAwMDApIGtubEdTOjAwMDAwMDAwMDAwMDAwMDANCj4gPiA+IFsgNTc1MC44NTQ0ODJdIENTOiAg
MDAxMCBEUzogMDAwMCBFUzogMDAwMCBDUjA6IDAwMDAwMDAwODAwNTAwMzMNCj4gPiA+IFsgNTc1
MC44NjEzOTddIENSMjogMDAwMDdmZmM5MmY5N2Y2OCBDUjM6IDAwMDAwMDAwMDIwMWQwMDUgQ1I0
OiAwMDAwMDAwMDAwMzYwNmUwDQo+ID4gPiBbIDU3NTAuODY5ODYxXSBEUjA6IDAwMDAwMDAwMDAw
MDAwMDAgRFIxOiAwMDAwMDAwMDAwMDAwMDAwIERSMjogMDAwMDAwMDAwMDAwMDAwMA0KPiA+ID4g
WyA1NzUwLjg3ODMzM10gRFIzOiAwMDAwMDAwMDAwMDAwMDAwIERSNjogMDAwMDAwMDBmZmZlMGZm
MCBEUjc6IDAwMDAwMDAwMDAwMDA0MDANCj4gPiA+IFsgNTc1MC44ODY4MDVdIENhbGwgVHJhY2U6
DQo+ID4gPiBbIDU3NTAuODkwMDMzXSAgYnRfaXRlcisweDQyLzB4NTANCj4gPiA+IFsgNTc1MC44
OTQwMDBdICBibGtfbXFfcXVldWVfdGFnX2J1c3lfaXRlcisweDEyYi8weDIyMA0KPiA+ID4gWyA1
NzUwLjg5OTk0MV0gID8gYmxrX21xX3RhZ190b19ycSsweDIwLzB4MjANCj4gPiA+IFsgNTc1MC45
MDQ5MTNdICA/IF9fcmN1X3JlYWRfdW5sb2NrKzB4NTAvMHg1MA0KPiA+ID4gWyA1NzUwLjkwOTk3
OF0gID8gYmxrX21xX3RhZ190b19ycSsweDIwLzB4MjANCj4gPiA+IFsgNTc1MC45MTQ5NDhdICBi
bGtfbXFfdGltZW91dF93b3JrKzB4MTRiLzB4MjQwDQo+ID4gPiBbIDU3NTAuOTIwMjIwXSAgcHJv
Y2Vzc19vbmVfd29yaysweDIxYi8weDUxMA0KPiA+ID4gWyA1NzUwLjkyNTE5N10gIHdvcmtlcl90
aHJlYWQrMHgzYS8weDM5MA0KPiA+ID4gWyA1NzUwLjkyOTc4MV0gID8gcHJvY2Vzc19vbmVfd29y
aysweDUxMC8weDUxMA0KPiA+ID4gWyA1NzUwLjkzNDk0NF0gIGt0aHJlYWQrMHgxMWMvMHgxNDAN
Cj4gPiA+IFsgNTc1MC45MzkwMjhdICA/IGt0aHJlYWRfY3JlYXRlX3dvcmtlcl9vbl9jcHUrMHg1
MC8weDUwDQo+ID4gPiBbIDU3NTAuOTQ1MTY5XSAgcmV0X2Zyb21fZm9yaysweDFmLzB4MzANCj4g
PiA+IFsgNTc1MC45NDk2NTZdIENvZGU6IDQ4IDAyIDAwIDAwIDgwIGU2IDgwIDc0IDI5IDhiIDk1
IDgwIDAwIDAwIDAwIDQ0IDM5IGUyIDc1IDNiIDQ4IDg5IGRmIGZmIDkwIDIgDQo+ID4gPiBbIDU3
NTAuOTcyMTM5XSAtLS1bIGVuZCB0cmFjZSA0MDA2NWNiMTc2NGJmNTAwIF0tLS0NCj4gPiA+IA0K
PiA+ID4gd2hpY2ggaXMgdGhpczoNCj4gPiA+IA0KPiA+ID4gICAgICAgICBXQVJOX09OX09OQ0Uo
YmxrX21xX3JxX3N0YXRlKHJxKSAhPSBNUV9SUV9DT01QTEVURSk7DQo+ID4gDQo+ID4gVGhhdCBj
aGVjayBsb29rcyB3cm9uZywgc2luY2UgVElNRURfT1VUIC0+IENPTVBMRVRFIGlzIGFsc28gYSB2
YWxpZA0KPiA+IHN0YXRlIHRyYW5zaXRpb24uIFNvIHRoYXQgY2hlY2sgc2hvdWxkIGJlOg0KPiA+
IA0KPiA+ICAgICAgICAgV0FSTl9PTl9PTkNFKGJsa19tcV9ycV9zdGF0ZShycSkgIT0gTVFfUlFf
Q09NUExFVEUgJiYNCj4gPiAgICAgICAgICAgICAgICAgICAgICBibGtfbXFfcnFfc3RhdGUocnEp
ICE9IE1RX1JRX1RJTUVEX09VVCk7DQo+IA0KPiBJIGd1ZXNzIGl0IHdvdWxkIGJlIGNsZWFuZXIg
dG8gYWN0dWFsbHkgZG8gdGhlIHRyYW5zaXRpb24sIGluDQo+IGJsa19tcV9ycV90aW1lZF9vdXQo
KToNCj4gDQo+ICAgICAgICAgY2FzZSBCTEtfRUhfSEFORExFRDogICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgDQo+ICAgICAgICAgICAgICAgICBpZiAo
YmxrX21xX2NoYW5nZV9ycV9zdGF0ZShyZXEsIE1RX1JRX1RJTUVEX09VVCwgICAgICAgICAgICAg
ICAgDQo+ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIE1R
X1JRX0NPTVBMRVRFKSkgICAgICAgICAgICAgICAgDQo+ICAgICAgICAgICAgICAgICAgICAgICAg
IF9fYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QocmVxKTsgICAgICAgICAgICAgICAgICAgICAgICAg
DQo+ICAgICAgICAgICAgICAgICBicmVhazsgICAgICAgIA0KPiANCj4gVGhpcyB3b3JrcyBmb3Ig
bWUuDQoNCkhlbGxvIEplbnMsDQoNClRoYW5rcyBmb3IgaGF2aW5nIHJlcG9ydGVkIHRoaXMuIEhv
dyBhYm91dCB1c2luZyB0aGUgZm9sbG93aW5nIGNoYW5nZSB0byBzdXBwcmVzcw0KdGhhdCB3YXJu
aW5nOg0KDQpkaWZmIC0tZ2l0IGEvYmxvY2svYmxrLW1xLmMgYi9ibG9jay9ibGstbXEuYw0KaW5k
ZXggYmI5OWMwM2U3YTM0Li44NGU1NWVhNTViYWYgMTAwNjQ0DQotLS0gYS9ibG9jay9ibGstbXEu
Yw0KKysrIGIvYmxvY2svYmxrLW1xLmMNCkBAIC04NDQsNiArODQ0LDcgQEAgc3RhdGljIHZvaWQg
YmxrX21xX3JxX3RpbWVkX291dChzdHJ1Y3QgcmVxdWVzdCAqcmVxLCBib29sIHJlc2VydmVkKQ0K
IA0KIAlzd2l0Y2ggKHJldCkgew0KIAljYXNlIEJMS19FSF9IQU5ETEVEOg0KKwkJYmxrX21xX2No
YW5nZV9ycV9zdGF0ZShyZXEsIE1RX1JRX1RJTUVEX09VVCwgTVFfUlFfQ09NUExFVEUpOw0KIAkJ
X19ibGtfbXFfY29tcGxldGVfcmVxdWVzdChyZXEpOw0KIAkJYnJlYWs7DQogCWNhc2UgQkxLX0VI
X1JFU0VUX1RJTUVSOg0KDQpJIHRoaW5rIHRoaXMgd2lsbCB3b3JrIGJldHRlciB0aGFuIHdoYXQg
d2FzIHByb3Bvc2VkIGluIHlvdXIgbGFzdCBlLW1haWwuIEknbSBhZnJhaWQNCnRoYXQgd2l0aCB0
aGF0IGNoYW5nZSB0aGF0IGEgY29tcGxldGlvbiB0aGF0IG9jY3VycyB3aGlsZSB0aGUgdGltZW91
dCBoYW5kbGVyIGlzDQpydW5uaW5nIGNhbiBiZSBpZ25vcmVkLg0KDQpUaGFua3MsDQoNCkJhcnQu
DQoNCg==

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 16:25 [PATCH v13] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  2018-05-22 16:44 ` Jens Axboe
@ 2018-05-22 20:33 ` Keith Busch
  2018-05-22 20:36   ` Bart Van Assche
  2018-05-23 14:02 ` Keith Busch
  2 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2018-05-22 20:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Tejun Heo,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> @@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  	case BLK_EH_RESET_TIMER:
> +		blk_mq_add_timer(req);
>  		/*
> +		 * The loop is for the unlikely case of a race with the
> +		 * completion code. There is no need to reset the deadline
> +		 * if the transition to the in-flight state fails because
> +		 * the deadline only matters in the in-flight state.
>  		 */
> -		blk_mq_rq_update_aborted_gstate(req, 0);
> -		blk_add_timer(req);
> +		while (true) {
> +			if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
> +						   MQ_RQ_IN_FLIGHT))
> +				break;
> +			if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
> +				__blk_mq_complete_request(req);
> +				break;
> +			}
> +		}

I'm having some trouble triggering this case where the state is already
MQ_RQ_COMPLETE, so I'm just trying to figure this out through inspection.

It looks like the new blk_mq_complete_request() already called
__blk_mq_complete_request() when it gets the state to MQ_RQ_COMPLETE,
so doing that again completes it a second time.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:33 ` Keith Busch
@ 2018-05-22 20:36   ` Bart Van Assche
  2018-05-22 20:40     ` Keith Busch
  2018-05-22 20:44     ` Keith Busch
  0 siblings, 2 replies; 22+ messages in thread
From: Bart Van Assche @ 2018-05-22 20:36 UTC (permalink / raw)
  To: keith.busch
  Cc: linux-block, israelr, sagi, hch, sebott, ming.lei, axboe,
	jianchao.w.wang, maxg, tj

T24gVHVlLCAyMDE4LTA1LTIyIGF0IDE0OjMzIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
T24gVHVlLCBNYXkgMjIsIDIwMTggYXQgMDk6MjU6MTVBTSAtMDcwMCwgQmFydCBWYW4gQXNzY2hl
IHdyb3RlOg0KPiA+IEBAIC04NDgsMTMgKzg0MywyMiBAQCBzdGF0aWMgdm9pZCBibGtfbXFfcnFf
dGltZWRfb3V0KHN0cnVjdCByZXF1ZXN0ICpyZXEsIGJvb2wgcmVzZXJ2ZWQpDQo+ID4gIAljYXNl
IEJMS19FSF9SRVNFVF9USU1FUjoNCj4gPiArCQlibGtfbXFfYWRkX3RpbWVyKHJlcSk7DQo+ID4g
IAkJLyoNCj4gPiArCQkgKiBUaGUgbG9vcCBpcyBmb3IgdGhlIHVubGlrZWx5IGNhc2Ugb2YgYSBy
YWNlIHdpdGggdGhlDQo+ID4gKwkJICogY29tcGxldGlvbiBjb2RlLiBUaGVyZSBpcyBubyBuZWVk
IHRvIHJlc2V0IHRoZSBkZWFkbGluZQ0KPiA+ICsJCSAqIGlmIHRoZSB0cmFuc2l0aW9uIHRvIHRo
ZSBpbi1mbGlnaHQgc3RhdGUgZmFpbHMgYmVjYXVzZQ0KPiA+ICsJCSAqIHRoZSBkZWFkbGluZSBv
bmx5IG1hdHRlcnMgaW4gdGhlIGluLWZsaWdodCBzdGF0ZS4NCj4gPiAgCQkgKi8NCj4gPiAtCQli
bGtfbXFfcnFfdXBkYXRlX2Fib3J0ZWRfZ3N0YXRlKHJlcSwgMCk7DQo+ID4gLQkJYmxrX2FkZF90
aW1lcihyZXEpOw0KPiA+ICsJCXdoaWxlICh0cnVlKSB7DQo+ID4gKwkJCWlmIChibGtfbXFfY2hh
bmdlX3JxX3N0YXRlKHJlcSwgTVFfUlFfVElNRURfT1VULA0KPiA+ICsJCQkJCQkgICBNUV9SUV9J
Tl9GTElHSFQpKQ0KPiA+ICsJCQkJYnJlYWs7DQo+ID4gKwkJCWlmIChibGtfbXFfcnFfc3RhdGUo
cmVxKSA9PSBNUV9SUV9DT01QTEVURSkgew0KPiA+ICsJCQkJX19ibGtfbXFfY29tcGxldGVfcmVx
dWVzdChyZXEpOw0KPiA+ICsJCQkJYnJlYWs7DQo+ID4gKwkJCX0NCj4gPiArCQl9DQo+IA0KPiBJ
J20gaGF2aW5nIHNvbWUgdHJvdWJsZSB0cmlnZ2VyaW5nIHRoaXMgY2FzZSB3aGVyZSB0aGUgc3Rh
dGUgaXMgYWxyZWFkeQ0KPiBNUV9SUV9DT01QTEVURSwgc28gSSdtIGp1c3QgdHJ5aW5nIHRvIGZp
Z3VyZSB0aGlzIG91dCB0aHJvdWdoIGluc3BlY3Rpb24uDQo+IA0KPiBJdCBsb29rcyBsaWtlIHRo
ZSBuZXcgYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QoKSBhbHJlYWR5IGNhbGxlZA0KPiBfX2Jsa19t
cV9jb21wbGV0ZV9yZXF1ZXN0KCkgd2hlbiBpdCBnZXRzIHRoZSBzdGF0ZSB0byBNUV9SUV9DT01Q
TEVURSwNCj4gc28gZG9pbmcgdGhhdCBhZ2FpbiBjb21wbGV0ZXMgaXQgYSBzZWNvbmQgdGltZS4N
Cg0KSGVsbG8gS2VpdGgsDQoNCkhhdmUgeW91IG5vdGljZWQgdGhhdCBpZiBibGtfbXFfY29tcGxl
dGVfcmVxdWVzdCgpIGVuY291bnRlcnMgYSByZXF1ZXN0IHdpdGgNCnN0YXRlIE1RX1JRX1RJTUVE
X09VVCB0aGF0IGl0IGRvZXNuJ3QgY2FsbCBfX2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KCk/IEkg
dGhpbmsNCnRoZSBjb2RlIGluIGJsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KCkgdG9nZXRoZXIgd2l0
aCB0aGUgYWJvdmUgY29kZSBndWFyYW50ZWVzDQp0aGF0IF9fYmxrX21xX2NvbXBsZXRlX3JlcXVl
c3QoKSBpcyBvbmx5IGNhbGxlZCBvbmNlIHBlciByZXF1ZXN0IGdlbmVyYXRpb24uDQoNCkJhcnQu
DQoNCg0KDQo=

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:33           ` Bart Van Assche
@ 2018-05-22 20:38             ` Jens Axboe
  2018-05-22 20:44               ` Bart Van Assche
  2018-05-23  2:35               ` Ming Lei
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 20:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, israelr, sagi, hch, sebott, ming.lei,
	jianchao.w.wang, maxg, tj, keith.busch

On 5/22/18 2:33 PM, Bart Van Assche wrote:
> On Tue, 2018-05-22 at 13:38 -0600, Jens Axboe wrote:
>> On 5/22/18 1:03 PM, Jens Axboe wrote:
>>> On 5/22/18 12:47 PM, Jens Axboe wrote:
>>>> Ran into this, running block/014 from blktests:
>>>>
>>>> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
>>>> [ 5750.723000] null: rq 00000000ff68f103 timed out
>>>> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 __blk_mq_complete_request+0xa6/0x0
>>>> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac x86_pkg_temp_therma]
>>>> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ #712
>>>> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
>>>> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
>>>> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
>>>> [ 5750.795850] RSP: 0018:ffff883ffb417d68 EFLAGS: 00010202
>>>> [ 5750.802187] RAX: 0000000000000003 RBX: ffff881ff100d800 RCX: 0000000000000000
>>>> [ 5750.810649] RDX: ffff88407fd9e040 RSI: ffff88407fd956b8 RDI: ffff881ff100d800
>>>> [ 5750.819119] RBP: ffffe8ffffd91800 R08: 0000000000000000 R09: ffffffff82066eb8
>>>> [ 5750.827588] R10: ffff883ffa386138 R11: ffff883ffa385900 R12: 0000000000000001
>>>> [ 5750.836050] R13: ffff881fe7da6000 R14: 0000000000000020 R15: 0000000000000002
>>>> [ 5750.844529] FS:  0000000000000000(0000) GS:ffff88407fd80000(0000) knlGS:0000000000000000
>>>> [ 5750.854482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 5750.861397] CR2: 00007ffc92f97f68 CR3: 000000000201d005 CR4: 00000000003606e0
>>>> [ 5750.869861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [ 5750.878333] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [ 5750.886805] Call Trace:
>>>> [ 5750.890033]  bt_iter+0x42/0x50
>>>> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
>>>> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
>>>> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
>>>> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
>>>> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
>>>> [ 5750.920220]  process_one_work+0x21b/0x510
>>>> [ 5750.925197]  worker_thread+0x3a/0x390
>>>> [ 5750.929781]  ? process_one_work+0x510/0x510
>>>> [ 5750.934944]  kthread+0x11c/0x140
>>>> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
>>>> [ 5750.945169]  ret_from_fork+0x1f/0x30
>>>> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 3b 48 89 df ff 90 2 
>>>> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
>>>>
>>>> which is this:
>>>>
>>>>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
>>>
>>> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
>>> state transition. So that check should be:
>>>
>>>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
>>>                      blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
>>
>> I guess it would be cleaner to actually do the transition, in
>> blk_mq_rq_timed_out():
>>
>>         case BLK_EH_HANDLED:                                                    
>>                 if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,                
>>                                                 MQ_RQ_COMPLETE))                
>>                         __blk_mq_complete_request(req);                         
>>                 break;        
>>
>> This works for me.
> 
> Hello Jens,
> 
> Thanks for having reported this. How about using the following change to suppress
> that warning:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index bb99c03e7a34..84e55ea55baf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  
>  	switch (ret) {
>  	case BLK_EH_HANDLED:
> +		blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
>  		__blk_mq_complete_request(req);
>  		break;
>  	case BLK_EH_RESET_TIMER:
> 
> I think this will work better than what was proposed in your last e-mail. I'm afraid
> that with that change that a completion that occurs while the timeout handler is
> running can be ignored.

What if that races with eg requeue? We get the completion from IRQ, decide we
need to requeue/restart the request rather than complete it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:36   ` Bart Van Assche
@ 2018-05-22 20:40     ` Keith Busch
  2018-05-22 20:44     ` Keith Busch
  1 sibling, 0 replies; 22+ messages in thread
From: Keith Busch @ 2018-05-22 20:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, israelr, sagi, hch, sebott, ming.lei, axboe,
	jianchao.w.wang, maxg, tj

On Tue, May 22, 2018 at 08:36:27PM +0000, Bart Van Assche wrote:
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Right, my mistake. I noticed that when I saw your reply on the EH_HANDLED
case, so looks fine.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:36   ` Bart Van Assche
  2018-05-22 20:40     ` Keith Busch
@ 2018-05-22 20:44     ` Keith Busch
  2018-05-22 20:47       ` Bart Van Assche
  1 sibling, 1 reply; 22+ messages in thread
From: Keith Busch @ 2018-05-22 20:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, israelr, sagi, hch, sebott, ming.lei, axboe,
	jianchao.w.wang, maxg, tj

On Tue, May 22, 2018 at 08:36:27PM +0000, Bart Van Assche wrote:
> 
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Okay, now to the BLK_EH_NOT_HANDLED case: that's supposedly the correct
status to return if the driver knows blk_mq_complete_request() was called
prior to returning from the timeout handler, so we need a similiar check
there, right?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:38             ` Jens Axboe
@ 2018-05-22 20:44               ` Bart Van Assche
  2018-05-22 21:03                 ` Jens Axboe
  2018-05-23  2:35               ` Ming Lei
  1 sibling, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2018-05-22 20:44 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, israelr, sagi, hch, sebott, ming.lei,
	jianchao.w.wang, maxg, tj, keith.busch

T24gVHVlLCAyMDE4LTA1LTIyIGF0IDE0OjM4IC0wNjAwLCBKZW5zIEF4Ym9lIHdyb3RlOg0KPiBP
biA1LzIyLzE4IDI6MzMgUE0sIEJhcnQgVmFuIEFzc2NoZSB3cm90ZToNCj4gPiBUaGFua3MgZm9y
IGhhdmluZyByZXBvcnRlZCB0aGlzLiBIb3cgYWJvdXQgdXNpbmcgdGhlIGZvbGxvd2luZyBjaGFu
Z2UgdG8gc3VwcHJlc3MNCj4gPiB0aGF0IHdhcm5pbmc6DQo+ID4gDQo+ID4gZGlmZiAtLWdpdCBh
L2Jsb2NrL2Jsay1tcS5jIGIvYmxvY2svYmxrLW1xLmMNCj4gPiBpbmRleCBiYjk5YzAzZTdhMzQu
Ljg0ZTU1ZWE1NWJhZiAxMDA2NDQNCj4gPiAtLS0gYS9ibG9jay9ibGstbXEuYw0KPiA+ICsrKyBi
L2Jsb2NrL2Jsay1tcS5jDQo+ID4gQEAgLTg0NCw2ICs4NDQsNyBAQCBzdGF0aWMgdm9pZCBibGtf
bXFfcnFfdGltZWRfb3V0KHN0cnVjdCByZXF1ZXN0ICpyZXEsIGJvb2wgcmVzZXJ2ZWQpDQo+ID4g
IA0KPiA+ICAJc3dpdGNoIChyZXQpIHsNCj4gPiAgCWNhc2UgQkxLX0VIX0hBTkRMRUQ6DQo+ID4g
KwkJYmxrX21xX2NoYW5nZV9ycV9zdGF0ZShyZXEsIE1RX1JRX1RJTUVEX09VVCwgTVFfUlFfQ09N
UExFVEUpOw0KPiA+ICAJCV9fYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QocmVxKTsNCj4gPiAgCQli
cmVhazsNCj4gPiAgCWNhc2UgQkxLX0VIX1JFU0VUX1RJTUVSOg0KPiA+IA0KPiA+IEkgdGhpbmsg
dGhpcyB3aWxsIHdvcmsgYmV0dGVyIHRoYW4gd2hhdCB3YXMgcHJvcG9zZWQgaW4geW91ciBsYXN0
IGUtbWFpbC4gSSdtIGFmcmFpZA0KPiA+IHRoYXQgd2l0aCB0aGF0IGNoYW5nZSB0aGF0IGEgY29t
cGxldGlvbiB0aGF0IG9jY3VycyB3aGlsZSB0aGUgdGltZW91dCBoYW5kbGVyIGlzDQo+ID4gcnVu
bmluZyBjYW4gYmUgaWdub3JlZC4NCj4gDQo+IFdoYXQgaWYgdGhhdCByYWNlcyB3aXRoIGVnIHJl
cXVldWU/IFdlIGdldCB0aGUgY29tcGxldGlvbiBmcm9tIElSUSwgZGVjaWRlIHdlDQo+IG5lZWQg
dG8gcmVxdWV1ZS9yZXN0YXJ0IHRoZSByZXF1ZXN0IHJhdGhlciB0aGFuIGNvbXBsZXRlIGl0Lg0K
DQpTaG91bGRuJ3QgYmxvY2sgZHJpdmVycyB0aGF0IHJlcXVldWUgYSByZXF1ZXN0IGZyb20gaW5z
aWRlIHRoZWlyIHRpbWVvdXQgaGFuZGxlcg0KcmV0dXJuIEJMS19FSF9OT1RfSEFORExFRCBpbnN0
ZWFkIG9mIEJMS19FSF9IQU5ETEVEPw0KDQpUaGFua3MsDQoNCkJhcnQuDQoNCg0K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:44     ` Keith Busch
@ 2018-05-22 20:47       ` Bart Van Assche
  0 siblings, 0 replies; 22+ messages in thread
From: Bart Van Assche @ 2018-05-22 20:47 UTC (permalink / raw)
  To: keith.busch
  Cc: linux-block, israelr, sagi, hch, sebott, axboe, ming.lei,
	jianchao.w.wang, maxg, tj

T24gVHVlLCAyMDE4LTA1LTIyIGF0IDE0OjQ0IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
T24gVHVlLCBNYXkgMjIsIDIwMTggYXQgMDg6MzY6MjdQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl
IHdyb3RlOg0KPiA+IA0KPiA+IEhhdmUgeW91IG5vdGljZWQgdGhhdCBpZiBibGtfbXFfY29tcGxl
dGVfcmVxdWVzdCgpIGVuY291bnRlcnMgYSByZXF1ZXN0IHdpdGgNCj4gPiBzdGF0ZSBNUV9SUV9U
SU1FRF9PVVQgdGhhdCBpdCBkb2Vzbid0IGNhbGwgX19ibGtfbXFfY29tcGxldGVfcmVxdWVzdCgp
PyBJIHRoaW5rDQo+ID4gdGhlIGNvZGUgaW4gYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QoKSB0b2dl
dGhlciB3aXRoIHRoZSBhYm92ZSBjb2RlIGd1YXJhbnRlZXMNCj4gPiB0aGF0IF9fYmxrX21xX2Nv
bXBsZXRlX3JlcXVlc3QoKSBpcyBvbmx5IGNhbGxlZCBvbmNlIHBlciByZXF1ZXN0IGdlbmVyYXRp
b24uDQo+IA0KPiBPa2F5LCBub3cgdG8gdGhlIEJMS19FSF9OT1RfSEFORExFRCBjYXNlOiB0aGF0
J3Mgc3VwcG9zZWRseSB0aGUgY29ycmVjdA0KPiBzdGF0dXMgdG8gcmV0dXJuIGlmIHRoZSBkcml2
ZXIga25vd3MgYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QoKSB3YXMgY2FsbGVkDQo+IHByaW9yIHRv
IHJldHVybmluZyBmcm9tIHRoZSB0aW1lb3V0IGhhbmRsZXIsIHNvIHdlIG5lZWQgYSBzaW1pbGlh
ciBjaGVjaw0KPiB0aGVyZSwgcmlnaHQ/DQoNCkdvb2QgY2F0Y2guIFRvIG1lIHRoYXQgc2VlbXMg
bGlrZSB0aGUgcmlnaHQgcGxhY2UgdG8gaGFuZGxlIHRoYXQgY2FzZS4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:29             ` Jens Axboe
@ 2018-05-22 21:02               ` Christoph Hellwig
  2018-05-22 21:02                 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2018-05-22 21:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, linux-block, Christoph Hellwig,
	Tejun Heo, Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On Tue, May 22, 2018 at 02:29:21PM -0600, Jens Axboe wrote:
> > of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
> > when the driver actually knows the request has been completed before
> > returning the status?
> 
> If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
> Or BLK_EH_HANDLED would work too, since the above state transition would
> then fail.

Btw, I think we should just kill off BLK_EH_HANDLED.  WIP totally
untested progress toward that is here:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-ret

The only real missing bit is SCSI overloading the value for internal
use.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 21:02               ` Christoph Hellwig
@ 2018-05-22 21:02                 ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 21:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Bart Van Assche, linux-block, Tejun Heo,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On 5/22/18 3:02 PM, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 02:29:21PM -0600, Jens Axboe wrote:
>>> of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
>>> when the driver actually knows the request has been completed before
>>> returning the status?
>>
>> If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
>> Or BLK_EH_HANDLED would work too, since the above state transition would
>> then fail.
> 
> Btw, I think we should just kill off BLK_EH_HANDLED.  WIP totally
> untested progress toward that is here:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-ret
> 
> The only real missing bit is SCSI overloading the value for internal
> use.

I think that's a great idea, the use cases aren't clear at all, driver
writers will get this wrong.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:44               ` Bart Van Assche
@ 2018-05-22 21:03                 ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2018-05-22 21:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, israelr, sagi, hch, sebott, ming.lei,
	jianchao.w.wang, maxg, tj, keith.busch

On 5/22/18 2:44 PM, Bart Van Assche wrote:
> On Tue, 2018-05-22 at 14:38 -0600, Jens Axboe wrote:
>> On 5/22/18 2:33 PM, Bart Van Assche wrote:
>>> Thanks for having reported this. How about using the following change to suppress
>>> that warning:
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index bb99c03e7a34..84e55ea55baf 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>>>  
>>>  	switch (ret) {
>>>  	case BLK_EH_HANDLED:
>>> +		blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
>>>  		__blk_mq_complete_request(req);
>>>  		break;
>>>  	case BLK_EH_RESET_TIMER:
>>>
>>> I think this will work better than what was proposed in your last e-mail. I'm afraid
>>> that with that change that a completion that occurs while the timeout handler is
>>> running can be ignored.
>>
>> What if that races with eg requeue? We get the completion from IRQ, decide we
>> need to requeue/restart the request rather than complete it.
> 
> Shouldn't block drivers that requeue a request from inside their timeout handler
> return BLK_EH_NOT_HANDLED instead of BLK_EH_HANDLED?

Yeah, I guess they should. See reply to Christoph as well, making this
simpler would be a great idea. Could be done as a prep patch (series)
to this one.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 20:38             ` Jens Axboe
  2018-05-22 20:44               ` Bart Van Assche
@ 2018-05-23  2:35               ` Ming Lei
  1 sibling, 0 replies; 22+ messages in thread
From: Ming Lei @ 2018-05-23  2:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bart Van Assche, linux-block, israelr, sagi, hch, sebott,
	jianchao.w.wang, maxg, tj, keith.busch

On Tue, May 22, 2018 at 02:38:37PM -0600, Jens Axboe wrote:
> On 5/22/18 2:33 PM, Bart Van Assche wrote:
> > On Tue, 2018-05-22 at 13:38 -0600, Jens Axboe wrote:
> >> On 5/22/18 1:03 PM, Jens Axboe wrote:
> >>> On 5/22/18 12:47 PM, Jens Axboe wrote:
> >>>> Ran into this, running block/014 from blktests:
> >>>>
> >>>> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
> >>>> [ 5750.723000] null: rq 00000000ff68f103 timed out
> >>>> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 __blk_mq_complete_request+0xa6/0x0
> >>>> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac x86_pkg_temp_therma]
> >>>> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ #712
> >>>> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
> >>>> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
> >>>> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
> >>>> [ 5750.795850] RSP: 0018:ffff883ffb417d68 EFLAGS: 00010202
> >>>> [ 5750.802187] RAX: 0000000000000003 RBX: ffff881ff100d800 RCX: 0000000000000000
> >>>> [ 5750.810649] RDX: ffff88407fd9e040 RSI: ffff88407fd956b8 RDI: ffff881ff100d800
> >>>> [ 5750.819119] RBP: ffffe8ffffd91800 R08: 0000000000000000 R09: ffffffff82066eb8
> >>>> [ 5750.827588] R10: ffff883ffa386138 R11: ffff883ffa385900 R12: 0000000000000001
> >>>> [ 5750.836050] R13: ffff881fe7da6000 R14: 0000000000000020 R15: 0000000000000002
> >>>> [ 5750.844529] FS:  0000000000000000(0000) GS:ffff88407fd80000(0000) knlGS:0000000000000000
> >>>> [ 5750.854482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> [ 5750.861397] CR2: 00007ffc92f97f68 CR3: 000000000201d005 CR4: 00000000003606e0
> >>>> [ 5750.869861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>> [ 5750.878333] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>> [ 5750.886805] Call Trace:
> >>>> [ 5750.890033]  bt_iter+0x42/0x50
> >>>> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
> >>>> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
> >>>> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
> >>>> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
> >>>> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
> >>>> [ 5750.920220]  process_one_work+0x21b/0x510
> >>>> [ 5750.925197]  worker_thread+0x3a/0x390
> >>>> [ 5750.929781]  ? process_one_work+0x510/0x510
> >>>> [ 5750.934944]  kthread+0x11c/0x140
> >>>> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
> >>>> [ 5750.945169]  ret_from_fork+0x1f/0x30
> >>>> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 3b 48 89 df ff 90 2 
> >>>> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> >>>>
> >>>> which is this:
> >>>>
> >>>>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> >>>
> >>> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> >>> state transition. So that check should be:
> >>>
> >>>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
> >>>                      blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
> >>
> >> I guess it would be cleaner to actually do the transition, in
> >> blk_mq_rq_timed_out():
> >>
> >>         case BLK_EH_HANDLED:                                                    
> >>                 if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,                
> >>                                                 MQ_RQ_COMPLETE))                
> >>                         __blk_mq_complete_request(req);                         
> >>                 break;        
> >>
> >> This works for me.
> > 
> > Hello Jens,
> > 
> > Thanks for having reported this. How about using the following change to suppress
> > that warning:
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index bb99c03e7a34..84e55ea55baf 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
> >  
> >  	switch (ret) {
> >  	case BLK_EH_HANDLED:
> > +		blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
> >  		__blk_mq_complete_request(req);
> >  		break;
> >  	case BLK_EH_RESET_TIMER:
> > 
> > I think this will work better than what was proposed in your last e-mail. I'm afraid
> > that with that change that a completion that occurs while the timeout handler is
> > running can be ignored.
> 
> What if that races with eg requeue? We get the completion from IRQ, decide we
> need to requeue/restart the request rather than complete it.

If drivers need to requeue this request, which should have been done
in its .complete.

But if the requeue is done via IRQ handler directly, that may be one
existed race between normal completion vs. timeout, and Bart's patch
doesn't change situation too.

thanks,
Ming

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-22 16:25 [PATCH v13] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  2018-05-22 16:44 ` Jens Axboe
  2018-05-22 20:33 ` Keith Busch
@ 2018-05-23 14:02 ` Keith Busch
  2018-05-23 14:08   ` Keith Busch
  2 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2018-05-23 14:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Tejun Heo,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> +static bool blk_mq_change_rq_state(struct request *rq,
> +				   enum mq_rq_state old_state,
> +				   enum mq_rq_state new_state)
> +{
> +	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
> +	union blk_generation_and_state old_val = gstate;
> +	union blk_generation_and_state new_val = gstate;
> +
> +	old_val.state = old_state;
> +	new_val.state = new_state;
> +	if (new_state == MQ_RQ_IN_FLIGHT)
> +		new_val.generation++;
> +	/*
> +	 * For transitions from state in-flight to another state cmpxchg()
> +	 * must be used. For other state transitions it is safe to use
> +	 * WRITE_ONCE().
> +	 */
> +	if (old_state != MQ_RQ_IN_FLIGHT) {
> +		WRITE_ONCE(rq->gstate.val, new_val.val);
> +		return true;
> +	}
> +	return blk_mq_set_rq_state(rq, old_val, new_val);
> +}

<snip>

>  void blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> -	int srcu_idx;
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
>  
> -	/*
> -	 * If @rq->aborted_gstate equals the current instance, timeout is
> -	 * claiming @rq and we lost.  This is synchronized through
> -	 * hctx_lock().  See blk_mq_timeout_work() for details.
> -	 *
> -	 * Completion path never blocks and we can directly use RCU here
> -	 * instead of hctx_lock() which can be either RCU or SRCU.
> -	 * However, that would complicate paths which want to synchronize
> -	 * against us.  Let stay in sync with the issue path so that
> -	 * hctx_lock() covers both issue and completion paths.
> -	 */
> -	hctx_lock(hctx, &srcu_idx);
> -	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> -		__blk_mq_complete_request(rq);
> -	hctx_unlock(hctx, srcu_idx);
> +	/* The loop is for the unlikely case of a race with the timeout code. */
> +	while (true) {
> +		if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT,
> +					   MQ_RQ_COMPLETE)) {
> +			__blk_mq_complete_request(rq);
> +			break;
> +		}
> +		if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
> +			break;
> +	}
>  }

Looks like the cmpxchg is also needed if old_state is MQ_RQ_TIMED_OUT,
otherwise its guaranteed to return 'true' and there's no point to the
loop and 'if' check.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again
  2018-05-23 14:02 ` Keith Busch
@ 2018-05-23 14:08   ` Keith Busch
  0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2018-05-23 14:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Tejun Heo,
	Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

On Wed, May 23, 2018 at 08:02:31AM -0600, Keith Busch wrote:
> Looks like the cmpxchg is also needed if old_state is MQ_RQ_TIMED_OUT,
> otherwise its guaranteed to return 'true' and there's no point to the
> loop and 'if' check.

And I see v14 is already posted with that fix! :)

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-05-23 14:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-22 16:25 [PATCH v13] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
2018-05-22 16:44 ` Jens Axboe
2018-05-22 17:17   ` Jens Axboe
2018-05-22 18:47     ` Jens Axboe
2018-05-22 19:03       ` Jens Axboe
2018-05-22 19:38         ` Jens Axboe
2018-05-22 20:26           ` Keith Busch
2018-05-22 20:29             ` Jens Axboe
2018-05-22 21:02               ` Christoph Hellwig
2018-05-22 21:02                 ` Jens Axboe
2018-05-22 20:33           ` Bart Van Assche
2018-05-22 20:38             ` Jens Axboe
2018-05-22 20:44               ` Bart Van Assche
2018-05-22 21:03                 ` Jens Axboe
2018-05-23  2:35               ` Ming Lei
2018-05-22 20:33 ` Keith Busch
2018-05-22 20:36   ` Bart Van Assche
2018-05-22 20:40     ` Keith Busch
2018-05-22 20:44     ` Keith Busch
2018-05-22 20:47       ` Bart Van Assche
2018-05-23 14:02 ` Keith Busch
2018-05-23 14:08   ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.