All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-10  1:34 Bart Van Assche
  2018-04-10  7:59 ` jianchao.wang
                   ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10  1:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Tejun Heo,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

If a completion occurs after blk_mq_rq_timed_out() has reset
rq->aborted_gstate and the request is again in flight when the timeout
expires then a request will be completed twice: a first time by the
timeout handler and a second time when the regular completion occurs.

Additionally, the blk-mq timeout handling code ignores completions that
occur after blk_mq_check_expired() has been called and before
blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
timeout handler always returns BLK_EH_RESET_TIMER then the result will
be that the request never terminates.

Since the request state can be updated from two different contexts,
namely regular completion and request timeout, this race cannot be
fixed with RCU synchronization only. Fix this race as follows:
- Split __deadline in two variables, namely lq_deadline for legacy
  request queues and mq_deadline for blk-mq request queues. Use atomic
  operations to update mq_deadline.
- Use the deadline instead of the request generation to detect whether
  or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
  of the lowest two bits of 'gstate'.
- Remove all request member variables that became superfluous due to
  this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
  patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the hctx member that became superfluous due to these changes,
  namely nr_expired.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

This patch fixes the following kernel crashes:

BUG: unable to handle kernel NULL pointer dereference at           (null)
Oops: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 151 Comm: kworker/2:1H Tainted: G        W        4.15.0-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]
Call Trace:
blk_mq_terminate_expired+0x42/0x80
bt_iter+0x3d/0x50
blk_mq_queue_tag_busy_iter+0xe9/0x200
blk_mq_timeout_work+0x181/0x2e0
process_one_work+0x21c/0x6d0
worker_thread+0x35/0x380
kthread+0x117/0x130
ret_from_fork+0x24/0x30

This patch also fixes a double completion problem in the NVMeOF
initiator driver. See also http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html.

Fixes: 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and generation based scheme")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: <stable@vger.kernel.org> # v4.16
---

Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

 block/blk-core.c       |   2 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 166 +++----------------------------------------------
 block/blk-mq.h         |  47 ++++++++------
 block/blk-timeout.c    |  57 ++++++++++++-----
 block/blk.h            |  41 ++++++++++--
 include/linux/blk-mq.h |   1 -
 include/linux/blkdev.h |  32 +++-------
 8 files changed, 122 insertions(+), 225 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0c48bef8490f..422b79b61bb9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -200,8 +200,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 6f72413b6cab..80c7c585769f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -345,7 +345,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7816d28b7219..337e10a5a30c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -305,7 +305,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->special = NULL;
 	/* tag was already set */
 	rq->extra_len = 0;
-	rq->__deadline = 0;
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -527,8 +526,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -577,36 +575,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -618,27 +586,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE))
 		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -662,27 +615,8 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
-	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	/* Mark @rq in-flight and set its deadline. */
+	blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -695,11 +629,6 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -811,7 +740,6 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
 	unsigned long next;
 	unsigned int next_set;
-	unsigned int nr_expired;
 };
 
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
@@ -819,8 +747,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -829,13 +755,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -849,60 +769,23 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
-
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
-
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
+	unsigned long deadline = blk_mq_rq_deadline(rq);
 
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
-	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
-		data->nr_expired++;
-		hctx->nr_expired++;
+	if (time_after_eq(jiffies, deadline) &&
+	    blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+		blk_mq_rq_timed_out(rq, reserved);
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
-}
 
-static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, void *priv, bool reserved)
-{
-	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
-	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
-		blk_mq_rq_timed_out(rq, reserved);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	struct blk_mq_timeout_data data = {
-		.next		= 0,
-		.next_set	= 0,
-		.nr_expired	= 0,
-	};
+	struct blk_mq_timeout_data data = { };
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -925,33 +808,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	/* scan for the expired ones and set their ->aborted_gstate */
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
-	if (data.nr_expired) {
-		bool has_rcu = false;
-
-		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
-		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
-
-		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
-	}
-
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
@@ -2087,8 +1943,6 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..4f96fd66eb8a 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,10 +27,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
- */
+/* Lowest two bits of request->mq_deadline. */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
@@ -38,7 +35,6 @@ enum mq_rq_state {
 
 	MQ_RQ_STATE_BITS	= 2,
 	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -104,9 +100,30 @@ void blk_mq_release(struct request_queue *q);
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return atomic_long_read(&rq->mq_deadline) & MQ_RQ_STATE_MASK;
+}
+
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old: Old request state.
+ * @new: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static inline bool blk_mq_change_rq_state(struct request *rq,
+					  enum mq_rq_state old_s,
+					  enum mq_rq_state new_s)
+{
+	unsigned long old_d = (atomic_long_read(&rq->mq_deadline) &
+			       ~(unsigned long)MQ_RQ_STATE_MASK) | old_s;
+	unsigned long new_d = (old_d & ~(unsigned long)MQ_RQ_STATE_MASK) |
+			      new_s;
+
+	return atomic_long_cmpxchg(&rq->mq_deadline, old_d, new_d) == old_d;
 }
 
 /**
@@ -114,23 +131,13 @@ static inline int blk_mq_rq_state(struct request *rq)
  * @rq: target request.
  * @state: new state to set.
  *
- * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
+ * Set @rq's state to @state.
  */
 static inline void blk_mq_rq_update_state(struct request *rq,
-					  enum mq_rq_state state)
+					  enum mq_rq_state new_s)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
+	while (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), new_s)) {
 	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 50a191720055..3ca829dce2d6 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -165,8 +165,9 @@ void blk_abort_request(struct request *req)
 		 * immediately and that scan sees the new timeout value.
 		 * No need for fancy synchronizations.
 		 */
-		blk_rq_set_deadline(req, jiffies);
-		kblockd_schedule_work(&req->q->timeout_work);
+		if (blk_mq_rq_set_deadline(req, jiffies, MQ_RQ_IN_FLIGHT,
+					   MQ_RQ_IN_FLIGHT))
+			kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
 			return;
@@ -187,15 +188,8 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:	request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req, enum mq_rq_state old,
+			    enum mq_rq_state new)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
@@ -216,15 +210,17 @@ void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
 	 * For the mq case we simply scan the tag map.
 	 */
-	if (!q->mq_ops)
+	if (!q->mq_ops) {
+		blk_rq_set_deadline(req, jiffies + req->timeout);
 		list_add_tail(&req->timeout_list, &req->q->timeout_list);
+	} else {
+		WARN_ON_ONCE(!blk_mq_rq_set_deadline(req, jiffies +
+						     req->timeout, old, new));
+	}
 
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
@@ -249,3 +245,34 @@ void blk_add_timer(struct request *req)
 	}
 
 }
+
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:	request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+	return __blk_add_timer(req, MQ_RQ_IDLE/*ignored*/,
+			       MQ_RQ_IDLE/*ignored*/);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ * @old:	current request state.
+ * @new:	new request state.
+ *
+ * Sets the deadline of a request if and only if it has state @old and
+ * at the same time changes the request state from @old into @new. The caller
+ * must guarantee that the request state won't be modified while this function
+ * is in progress.
+ */
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new)
+{
+	return __blk_add_timer(req, old, new);
+}
diff --git a/block/blk.h b/block/blk.h
index b034fd2460c4..7665d4af777e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,8 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new);
 void blk_delete_timer(struct request *);
 
 
@@ -191,21 +193,21 @@ void blk_account_io_done(struct request *req);
 /*
  * EH timer and IO completion will both attempt to 'grab' the request, make
  * sure that only one of them succeeds. Steal the bottom bit of the
- * __deadline field for this.
+ * lq_deadline field for this.
  */
 static inline int blk_mark_rq_complete(struct request *rq)
 {
-	return test_and_set_bit(0, &rq->__deadline);
+	return test_and_set_bit(0, &rq->lq_deadline);
 }
 
 static inline void blk_clear_rq_complete(struct request *rq)
 {
-	clear_bit(0, &rq->__deadline);
+	clear_bit(0, &rq->lq_deadline);
 }
 
 static inline bool blk_rq_is_complete(struct request *rq)
 {
-	return test_bit(0, &rq->__deadline);
+	return test_bit(0, &rq->lq_deadline);
 }
 
 /*
@@ -311,15 +313,42 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
  * Steal a bit from this field for legacy IO path atomic IO marking. Note that
  * setting the deadline clears the bottom bit, potentially clearing the
  * completed bit. The user has to be OK with this (current ones are fine).
+ * Must be called with the request queue lock held.
  */
 static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
 {
-	rq->__deadline = time & ~0x1UL;
+	rq->lq_deadline = time & ~0x1UL;
 }
 
 static inline unsigned long blk_rq_deadline(struct request *rq)
 {
-	return rq->__deadline & ~0x1UL;
+	return rq->lq_deadline & ~0x1UL;
+}
+
+/*
+ * If the state of request @rq equals @old_s, update deadline and request state
+ * atomically to @time and @new_s. blk-mq only.
+ */
+static inline bool blk_mq_rq_set_deadline(struct request *rq,
+					  unsigned long time,
+					  enum mq_rq_state old_s,
+					  enum mq_rq_state new_s)
+{
+	unsigned long old_d, new_d;
+
+	do {
+		old_d = atomic_long_read(&rq->mq_deadline);
+		if ((old_d & MQ_RQ_STATE_MASK) != old_s)
+			return false;
+		new_d = (time & ~0x3UL) | (new_s & 3UL);
+	} while (atomic_long_cmpxchg(&rq->mq_deadline, old_d, new_d) != old_d);
+
+	return true;
+}
+
+static inline unsigned long blk_mq_rq_deadline(struct request *rq)
+{
+	return atomic_long_read(&rq->mq_deadline) & ~0x3UL;
 }
 
 /*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49796a3..13ccbb418e89 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,7 +51,6 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
-	unsigned int		nr_expired;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6075d1a6760c..abf78819014b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,7 +27,6 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
-#include <linux/seqlock.h>
 #include <linux/u64_stats_sync.h>
 
 struct module;
@@ -125,8 +124,6 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
 #define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
 
@@ -226,28 +223,15 @@ struct request {
 	unsigned int extra_len;	/* length of alignment and padding */
 
 	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
+	 * Access through blk_rq_set_deadline(), blk_rq_deadline() and
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_set_deadline(),
+	 * blk_mq_rq_deadline() and blk_mq_rq_state() for blk-mq queues.
 	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
-	/*
-	 * ->aborted_gstate is used by the timeout to claim a specific
-	 * recycle instance of this request.  See blk_mq_timeout_work().
-	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
-
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
-	unsigned long __deadline;
+	union {
+		unsigned long lq_deadline;
+		atomic_long_t mq_deadline;
+	};
 
 	struct list_head timeout_list;
 
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  1:34 [PATCH v4] blk-mq: Fix race conditions in request timeout handling Bart Van Assche
@ 2018-04-10  7:59 ` jianchao.wang
  2018-04-10 10:04   ` Ming Lei
  2018-04-10 13:01     ` Bart Van Assche
  2018-04-10  8:41 ` Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 47+ messages in thread
From: jianchao.wang @ 2018-04-10  7:59 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy, stable

Hi Bart

On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs

Would you please detail more here about why the request could be completed twice ?

Is it the scenario you described as below in https://marc.info/?l=linux-block&m=151796816127318

The following race can occur between the code that resets the timer
and completion handling:
- The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
- A completion occurs and blk_mq_complete_request() calls
  __blk_mq_complete_request().
- The timeout code calls blk_add_timer() and that function sets the
  request deadline and adjusts the timer.
- __blk_mq_complete_request() frees the request tag.
- The timer fires and the timeout handler gets called for a freed
  request.
If yes, how does the timeout handler get the freed request when the tag has been freed ?

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  1:34 [PATCH v4] blk-mq: Fix race conditions in request timeout handling Bart Van Assche
  2018-04-10  7:59 ` jianchao.wang
@ 2018-04-10  8:41 ` Ming Lei
  2018-04-10 12:58     ` Bart Van Assche
  2018-04-10  9:55 ` Christoph Hellwig
  2018-04-10 14:20 ` Tejun Heo
  3 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2018-04-10  8:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Tejun Heo,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout

Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
think you are addressing the race between normal completion and
the timeout of BLK_EH_RESET_TIMER.

> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.

This patch looks simpler, and more like the previous model of
using blk_mark_rq_complete().

I have one question:

- with this patch, rq's state is updated atomically as cmpxchg. Suppose
one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
blk_mq_check_expired(), then ops->timeout() is run and
BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.

Now the original normal completion still may occur after rq's state
becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
with this patch? Maybe I am wrong, please explain a bit.

And synchronize_rcu() is needed by Tejun's approach between marking
COMPLETE and handling this rq's timeout, and the time can be quite long,
I guess it might be easier to trigger?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  1:34 [PATCH v4] blk-mq: Fix race conditions in request timeout handling Bart Van Assche
  2018-04-10  7:59 ` jianchao.wang
  2018-04-10  8:41 ` Ming Lei
@ 2018-04-10  9:55 ` Christoph Hellwig
  2018-04-10 13:26     ` Bart Van Assche
  2018-04-10 14:41   ` Jens Axboe
  2018-04-10 14:20 ` Tejun Heo
  3 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2018-04-10  9:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Tejun Heo,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.
> 
> Additionally, the blk-mq timeout handling code ignores completions that
> occur after blk_mq_check_expired() has been called and before
> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> timeout handler always returns BLK_EH_RESET_TIMER then the result will
> be that the request never terminates.
> 
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:
> - Split __deadline in two variables, namely lq_deadline for legacy
>   request queues and mq_deadline for blk-mq request queues. Use atomic
>   operations to update mq_deadline.

I don't think we need the atomic_long_cmpxchg, and can do with a plain
cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
both callers are protected from other changes so we can turn that
into a WARN_ON().  That plus some minor cleanups in the version below,
only boot tested:

diff --git a/block/blk-core.c b/block/blk-core.c
index abcb8684ba67..7e88e4a9133d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -199,8 +199,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 58b3b79cbe83..ecb934bafb29 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -345,7 +345,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f5c7dbcb954f..0f5c56789fde 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,7 +477,9 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(1);
+
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -523,8 +525,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -573,36 +574,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -614,27 +585,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE))
 		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -658,27 +614,7 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
-	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -691,22 +627,19 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	unsigned long old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
-		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(1);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -807,7 +740,6 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
 	unsigned long next;
 	unsigned int next_set;
-	unsigned int nr_expired;
 };
 
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
@@ -815,8 +747,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -825,13 +755,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -845,60 +769,23 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
-
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
-
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
+	unsigned long deadline = blk_rq_deadline(rq);
 
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
-	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
-		data->nr_expired++;
-		hctx->nr_expired++;
+	if (time_after_eq(jiffies, deadline) &&
+	    blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+		blk_mq_rq_timed_out(rq, reserved);
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
-}
 
-static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, void *priv, bool reserved)
-{
-	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
-	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
-		blk_mq_rq_timed_out(rq, reserved);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	struct blk_mq_timeout_data data = {
-		.next		= 0,
-		.next_set	= 0,
-		.nr_expired	= 0,
-	};
+	struct blk_mq_timeout_data data = { };
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -921,33 +808,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	/* scan for the expired ones and set their ->aborted_gstate */
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
-	if (data.nr_expired) {
-		bool has_rcu = false;
-
-		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
-		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
-
-		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
-	}
-
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
@@ -2067,8 +1927,6 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..a9805b9809c4 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,18 +27,11 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
- */
+/* Lowest two bits of request->__deadline. */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
-
-	MQ_RQ_STATE_BITS	= 2,
-	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -100,37 +93,52 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
+/*
+ * If the state of request @rq equals @old_s, update deadline and request state
+ * atomically to @time and @new_s. blk-mq only.
+ */
+static inline bool blk_mq_rq_set_deadline(struct request *rq,
+		unsigned long new_time, enum mq_rq_state old_state,
+		enum mq_rq_state new_state)
+{
+	unsigned long old_val, new_val;
+
+	do {
+		old_val = rq->__deadline;
+		if ((old_val & RQ_STATE_MASK) != old_state)
+			return false;
+		new_val = (new_time & ~RQ_STATE_MASK) |
+			  (new_state & RQ_STATE_MASK);
+	} while (cmpxchg(&rq->__deadline, old_val, new_val) != old_val);
+
+	return true;
+}
+
 /**
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return rq->__deadline & RQ_STATE_MASK;
 }
 
 /**
- * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
- * @rq: target request.
- * @state: new state to set.
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
  *
- * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
  */
-static inline void blk_mq_rq_update_state(struct request *rq,
-					  enum mq_rq_state state)
+static inline bool blk_mq_change_rq_state(struct request *rq,
+		enum mq_rq_state old_state, enum mq_rq_state new_state)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
-	}
+	unsigned long old_val = (rq->__deadline & ~RQ_STATE_MASK) | old_state;
+	unsigned long new_val = (rq->__deadline & ~RQ_STATE_MASK) | new_state;
 
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	return cmpxchg(&rq->__deadline, old_val, new_val) == old_val;
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..f666b45415b6 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -162,8 +162,9 @@ void blk_abort_request(struct request *req)
 		 * immediately and that scan sees the new timeout value.
 		 * No need for fancy synchronizations.
 		 */
-		blk_rq_set_deadline(req, jiffies);
-		kblockd_schedule_work(&req->q->timeout_work);
+		if (blk_mq_rq_set_deadline(req, jiffies, MQ_RQ_IN_FLIGHT,
+					   MQ_RQ_IN_FLIGHT))
+			kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
 			return;
@@ -184,52 +185,17 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:	request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
-	if (!req->timeout)
-		req->timeout = q->rq_timeout;
-
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
 	 * than an existing one, modify the timer. Round up to next nearest
 	 * second.
 	 */
 	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
 	if (!timer_pending(&q->timeout) ||
 	    time_before(expiry, q->timeout.expires)) {
 		unsigned long diff = q->timeout.expires - expiry;
@@ -244,5 +210,53 @@ void blk_add_timer(struct request *req)
 		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
 			mod_timer(&q->timeout, expiry);
 	}
+}
+
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:	request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+
+	lockdep_assert_held(q->queue_lock);
+
+	if (!q->rq_timed_out_fn)
+		return;
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+
+	blk_rq_set_deadline(req, jiffies + req->timeout);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
+
+	return __blk_add_timer(req);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ * @old:	current request state.
+ * @new:	new request state.
+ *
+ * Sets the deadline of a request if and only if it has state @old and
+ * at the same time changes the request state from @old into @new. The caller
+ * must guarantee that the request state won't be modified while this function
+ * is in progress.
+ */
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new)
+{
+	struct request_queue *q = req->q;
+
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
 
+	if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new))
+		WARN_ON_ONCE(1);
+	return __blk_add_timer(req);
 }
diff --git a/block/blk.h b/block/blk.h
index b034fd2460c4..e0d6e9eb55a3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,8 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new);
 void blk_delete_timer(struct request *);
 
 
@@ -311,15 +313,16 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
  * Steal a bit from this field for legacy IO path atomic IO marking. Note that
  * setting the deadline clears the bottom bit, potentially clearing the
  * completed bit. The user has to be OK with this (current ones are fine).
+ * Must be called with the request queue lock held.
  */
 static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
 {
-	rq->__deadline = time & ~0x1UL;
+	rq->__deadline = time & ~RQ_STATE_MASK;
 }
 
 static inline unsigned long blk_rq_deadline(struct request *rq)
 {
-	return rq->__deadline & ~0x1UL;
+	return rq->__deadline & ~RQ_STATE_MASK;
 }
 
 /*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49796a3..13ccbb418e89 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,7 +51,6 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
-	unsigned int		nr_expired;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..cde8cf6a359d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,7 +27,6 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
-#include <linux/seqlock.h>
 #include <linux/u64_stats_sync.h>
 
 struct module;
@@ -125,10 +124,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
@@ -226,27 +223,10 @@ struct request {
 	unsigned int extra_len;	/* length of alignment and padding */
 
 	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
+	 * Access through proper helps only as it encodes request state as well:
 	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
-	/*
-	 * ->aborted_gstate is used by the timeout to claim a specific
-	 * recycle instance of this request.  See blk_mq_timeout_work().
-	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
-
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
+#define RQ_STATE_BITS	2
+#define RQ_STATE_MASK	((1UL << RQ_STATE_BITS) - 1)
 	unsigned long __deadline;
 
 	struct list_head timeout_list;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  7:59 ` jianchao.wang
@ 2018-04-10 10:04   ` Ming Lei
  2018-04-10 12:04     ` Shan Hai
  2018-04-10 13:01     ` Bart Van Assche
  1 sibling, 1 reply; 47+ messages in thread
From: Ming Lei @ 2018-04-10 10:04 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig,
	Tejun Heo, Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

On Tue, Apr 10, 2018 at 03:59:30PM +0800, jianchao.wang wrote:
> Hi Bart
> 
> On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs
> 
> Would you please detail more here about why the request could be completed twice ?
> 
> Is it the scenario you described as below in https://marc.info/?l=linux-block&m=151796816127318
> 
> The following race can occur between the code that resets the timer
> and completion handling:
> - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
> - A completion occurs and blk_mq_complete_request() calls
>   __blk_mq_complete_request().
> - The timeout code calls blk_add_timer() and that function sets the
>   request deadline and adjusts the timer.
> - __blk_mq_complete_request() frees the request tag.
> - The timer fires and the timeout handler gets called for a freed
>   request.
> If yes, how does the timeout handler get the freed request when the tag has been freed ?

Thinking of this patch further.

The issue may not be a double completion issue, and it may be the
following behaviour which breaks NVMe or other drivers easily:

1) there is long delay(synchronize_rcu()) between setting rq->aborted_gstate
and handling the timeout by blk_mq_rq_timed_out().

2) during the long delay, the rq may be completed by hardware, then
if the following timeout is handled as BLK_EH_RESET_TIMER, it is
driver's bug, and driver's .timeout() may be confused about this
behaviour, I guess.

In theory this behaviour should exist in all these approaches,
but just easier to trigger if long delay is introduced before handling
timeout.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 10:04   ` Ming Lei
@ 2018-04-10 12:04     ` Shan Hai
  0 siblings, 0 replies; 47+ messages in thread
From: Shan Hai @ 2018-04-10 12:04 UTC (permalink / raw)
  To: Ming Lei, jianchao.wang
  Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig,
	Tejun Heo, Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable



On 2018年04月10日 18:04, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 03:59:30PM +0800, jianchao.wang wrote:
>> Hi Bart
>>
>> On 04/10/2018 09:34 AM, Bart Van Assche wrote:
>>> If a completion occurs after blk_mq_rq_timed_out() has reset
>>> rq->aborted_gstate and the request is again in flight when the timeout
>>> expires then a request will be completed twice: a first time by the
>>> timeout handler and a second time when the regular completion occurs
>> Would you please detail more here about why the request could be completed twice ?
>>
>> Is it the scenario you described as below in https://marc.info/?l=linux-block&m=151796816127318
>>
>> The following race can occur between the code that resets the timer
>> and completion handling:
>> - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
>> - A completion occurs and blk_mq_complete_request() calls
>>    __blk_mq_complete_request().
>> - The timeout code calls blk_add_timer() and that function sets the
>>    request deadline and adjusts the timer.
>> - __blk_mq_complete_request() frees the request tag.
>> - The timer fires and the timeout handler gets called for a freed
>>    request.
>> If yes, how does the timeout handler get the freed request when the tag has been freed ?
> Thinking of this patch further.
>
> The issue may not be a double completion issue, and it may be the
> following behaviour which breaks NVMe or other drivers easily:
>
> 1) there is long delay(synchronize_rcu()) between setting rq->aborted_gstate
> and handling the timeout by blk_mq_rq_timed_out().
>
> 2) during the long delay, the rq may be completed by hardware, then
> if the following timeout is handled as BLK_EH_RESET_TIMER, it is
> driver's bug, and driver's .timeout() may be confused about this
> behaviour, I guess.
>
> In theory this behaviour should exist in all these approaches,
> but just easier to trigger if long delay is introduced before handling
> timeout.

Or think it as below?


                    C                           C                C
+-----------------------------+---------------+-----------+
S                                   T F              E

Request life time line:
S: start
T: timed out
F: found (by timer), inflight but timed out because of a long delay
E: completed by timeout handler
C: regular completion

normal request completion time range: (S, T)
completion in (T, F) is fine since it's not inflight anymore
race window time range: (F, E)

if the delayed request is completed in (F, E) range then it could be 
completed
twice by the regular completion first and then the timeout handler.

Thanks
Shan Hai
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  8:41 ` Ming Lei
@ 2018-04-10 12:58     ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 12:58 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

T24gVHVlLCAyMDE4LTA0LTEwIGF0IDE2OjQxICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
TW9uLCBBcHIgMDksIDIwMTggYXQgMDY6MzQ6NTVQTSAtMDcwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IElmIGEgY29tcGxldGlvbiBvY2N1cnMgYWZ0ZXIgYmxrX21xX3JxX3RpbWVkX291
dCgpIGhhcyByZXNldA0KPiA+IHJxLT5hYm9ydGVkX2dzdGF0ZSBhbmQgdGhlIHJlcXVlc3QgaXMg
YWdhaW4gaW4gZmxpZ2h0IHdoZW4gdGhlIHRpbWVvdXQNCj4gDQo+IEdpdmVuIHJxLT5hYm9ydGVk
X2dzdGF0ZSBpcyByZXNldCBvbmx5IGZvciBCTEtfRUhfUkVTRVRfVElNRVIsIEkNCj4gdGhpbmsg
eW91IGFyZSBhZGRyZXNzaW5nIHRoZSByYWNlIGJldHdlZW4gbm9ybWFsIGNvbXBsZXRpb24gYW5k
DQo+IHRoZSB0aW1lb3V0IG9mIEJMS19FSF9SRVNFVF9USU1FUi4NCg0KWWVzLCB0aGF0J3MgY29y
cmVjdC4gSSB3aWxsIG1ha2UgdGhpcyBtb3JlIGV4cGxpY2l0Lg0KDQo+ID4gZXhwaXJlcyB0aGVu
IGEgcmVxdWVzdCB3aWxsIGJlIGNvbXBsZXRlZCB0d2ljZTogYSBmaXJzdCB0aW1lIGJ5IHRoZQ0K
PiA+IHRpbWVvdXQgaGFuZGxlciBhbmQgYSBzZWNvbmQgdGltZSB3aGVuIHRoZSByZWd1bGFyIGNv
bXBsZXRpb24gb2NjdXJzLg0KPiANCj4gVGhpcyBwYXRjaCBsb29rcyBzaW1wbGVyLCBhbmQgbW9y
ZSBsaWtlIHRoZSBwcmV2aW91cyBtb2RlbCBvZg0KPiB1c2luZyBibGtfbWFya19ycV9jb21wbGV0
ZSgpLg0KDQpUaGF0J3MgYWxzbyBjb3JyZWN0Lg0KDQo+IEkgaGF2ZSBvbmUgcXVlc3Rpb246DQo+
IA0KPiAtIHdpdGggdGhpcyBwYXRjaCwgcnEncyBzdGF0ZSBpcyB1cGRhdGVkIGF0b21pY2FsbHkg
YXMgY21weGNoZy4gU3VwcG9zZQ0KPiBvbmUgcnEgaXMgdGltZWQgb3V0LCB0aGUgcnEncyBzdGF0
ZSBpcyB1cGRhdGVkIGFzIE1RX1JRX0NPTVBMRVRFIGJ5DQo+IGJsa19tcV9jaGVja19leHBpcmVk
KCksIHRoZW4gb3BzLT50aW1lb3V0KCkgaXMgcnVuIGFuZA0KPiBCTEtfRUhfUkVTRVRfVElNRVIg
aXMgcmV0dXJuZWQsIHNvIGJsa19tcV9hZGRfdGltZXIocmVxLCBNUV9SUV9DT01QTEVURSwNCj4g
TVFfUlFfSU5fRkxJR0hUKSBpcyBjYWxsZWQgdG8gdXBkYXRlIHJxJ3Mgc3RhdGUgaW50byBNUV9S
UV9JTl9GTElHSFQuDQo+IA0KPiBOb3cgdGhlIG9yaWdpbmFsIG5vcm1hbCBjb21wbGV0aW9uIHN0
aWxsIG1heSBvY2N1ciBhZnRlciBycSdzIHN0YXRlDQo+IGJlY29tZXMgTVFfUlFfSU5fRkxJR0hU
LCBhbmQgc2VlbXMgdGhlcmUgaXMgc3RpbGwgdGhlIGRvdWJsZSBjb21wbGV0aW9uDQo+IHdpdGgg
dGhpcyBwYXRjaD8gTWF5YmUgSSBhbSB3cm9uZywgcGxlYXNlIGV4cGxhaW4gYSBiaXQuDQoNClRo
YXQgc2NlbmFyaW8gd29uJ3QgcmVzdWx0IGluIGEgZG91YmxlIGNvbXBsZXRpb24uIEFmdGVyIHRo
ZSB0aW1lciBoYXMNCmJlZW4gcmVzZXQgdGhlIGJsb2NrIGRyaXZlciBibGtfbXFfY29tcGxldGVf
cmVxdWVzdCgpIGNhbGwgd2lsbCBjaGFuZ2UNCnRoZSByZXF1ZXN0IHN0YXRlIGZyb20gTVFfUlFf
SU5fRkxJR0hUIGludG8gTVFfUlFfQ09NUExFVEUuIFRoZSBuZXh0DQp0aW1lIGJsa19tcV9jaGVj
a19leHBpcmVkKCkgaXMgY2FsbGVkIGl0IHdpbGwgZXhlY3V0ZSB0aGUgZm9sbG93aW5nIGNvZGU6
DQoNCglibGtfbXFfY2hhbmdlX3JxX3N0YXRlKHJxLCBNUV9SUV9JTl9GTElHSFQsIE1RX1JRX0NP
TVBMRVRFKTsNCg0KVGhhdCBmdW5jdGlvbiBjYWxsIG9ubHkgY2hhbmdlcyB0aGUgcmVxdWVzdCBz
dGF0ZSBpZiB0aGUgY3VycmVudCBzdGF0ZSBpcw0KSU5fRkxJR0hULiBIb3dldmVyLCB0aGUgYmxr
X21xX2NvbXBsZXRlX3JlcXVlc3QoKSBjYWxsIGNoYW5nZWQgdGhlIHJlcXVlc3QNCnN0YXRlIGlu
dG8gQ09NUExFVEUuIEhlbmNlLCB0aGUgYWJvdmUgYmxrX21xX2NoYW5nZV9ycV9zdGF0ZSgpIGNh
bGwgd2lsbA0KcmV0dXJuIGZhbHNlIGFuZCB0aGUgYmxrLW1xIHRpbWVvdXQgY29kZSB3aWxsIHNr
aXAgdGhpcyByZXF1ZXN0LiBJZiB0aGUNCnJlcXVlc3Qgd291bGQgYWxyZWFkeSBoYXZlIGJlZW4g
cmV1c2VkIGFuZCB3b3VsZCBoYXZlIGJlZW4gbWFya2VkIGFnYWluIGFzDQpJTl9GTElHSFQgdGhl
biBpdHMgZGVhZGxpbmUgd2lsbCBhbHNvIGhhdmUgYmVlbiB1cGRhdGVkIGFuZCB0aGUgcmVxdWVz
dA0Kd2lsbCBiZSBza2lwcGVkIGJ5IHRoZSB0aW1lb3V0IGNvZGUgYmVjYXVzZSBpdHMgZGVhZGxp
bmUgaGFzIG5vdCB5ZXQNCmV4cGlyZWQuDQoNCj4gQW5kIHN5bmNocm9uaXplX3JjdSgpIGlzIG5l
ZWRlZCBieSBUZWp1bidzIGFwcHJvYWNoIGJldHdlZW4gbWFya2luZw0KPiBDT01QTEVURSBhbmQg
aGFuZGxpbmcgdGhpcyBycSdzIHRpbWVvdXQsIGFuZCB0aGUgdGltZSBjYW4gYmUgcXVpdGUgbG9u
ZywNCj4gSSBndWVzcyBpdCBtaWdodCBiZSBlYXNpZXIgdG8gdHJpZ2dlcj8NCg0KSSBoYXZlIGRv
bmUgd2hhdCBJIGNvdWxkIHRvIHRyaWdnZXIgcmFjZXMgYmV0d2VlbiB0aGUgcmVndWxhciBjb21w
bGV0aW9uDQpwYXRoIGFuZCB0aGUgdGltZW91dCBjb2RlIGluIG15IHRlc3RzLiBXaXRob3V0IHRo
aXMgcGF0Y2ggaWYgSSBydW4gdGhlDQpzcnAtdGVzdCBzb2Z0d2FyZSBJIHNlZSBjcmFzaGVzIGJl
aW5nIHJlcG9ydGVkIGluIHRoZSByZG1hX3J4ZSBkcml2ZXIgYnV0DQp3aXRoIHRoaXMgcGF0Y2gg
YXBwbGllZCBJIGRvbid0IHNlZSBhbnkgY3Jhc2hlcyB3aXRoIG15IHRlc3RzLg0KDQpCYXJ0Lg0K
DQoNCg==

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-10 12:58     ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 12:58 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

On Tue, 2018-04-10 at 16:41 +0800, Ming Lei wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> 
> Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
> think you are addressing the race between normal completion and
> the timeout of BLK_EH_RESET_TIMER.

Yes, that's correct. I will make this more explicit.

> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs.
> 
> This patch looks simpler, and more like the previous model of
> using blk_mark_rq_complete().

That's also correct.

> I have one question:
> 
> - with this patch, rq's state is updated atomically as cmpxchg. Suppose
> one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
> blk_mq_check_expired(), then ops->timeout() is run and
> BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
> MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.
> 
> Now the original normal completion still may occur after rq's state
> becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
> with this patch? Maybe I am wrong, please explain a bit.

That scenario won't result in a double completion. After the timer has
been reset the block driver blk_mq_complete_request() call will change
the request state from MQ_RQ_IN_FLIGHT into MQ_RQ_COMPLETE. The next
time blk_mq_check_expired() is called it will execute the following code:

	blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);

That function call only changes the request state if the current state is
IN_FLIGHT. However, the blk_mq_complete_request() call changed the request
state into COMPLETE. Hence, the above blk_mq_change_rq_state() call will
return false and the blk-mq timeout code will skip this request. If the
request would already have been reused and would have been marked again as
IN_FLIGHT then its deadline will also have been updated and the request
will be skipped by the timeout code because its deadline has not yet
expired.

> And synchronize_rcu() is needed by Tejun's approach between marking
> COMPLETE and handling this rq's timeout, and the time can be quite long,
> I guess it might be easier to trigger?

I have done what I could to trigger races between the regular completion
path and the timeout code in my tests. Without this patch if I run the
srp-test software I see crashes being reported in the rdma_rxe driver but
with this patch applied I don't see any crashes with my tests.

Bart.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  7:59 ` jianchao.wang
@ 2018-04-10 13:01     ` Bart Van Assche
  2018-04-10 13:01     ` Bart Van Assche
  1 sibling, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 13:01 UTC (permalink / raw)
  To: jianchao.w.wang, axboe; +Cc: hch, tj, israelr, linux-block, maxg, stable, sagi

T24gVHVlLCAyMDE4LTA0LTEwIGF0IDE1OjU5ICswODAwLCBqaWFuY2hhby53YW5nIHdyb3RlOg0K
PiBJZiB5ZXMsIGhvdyBkb2VzIHRoZSB0aW1lb3V0IGhhbmRsZXIgZ2V0IHRoZSBmcmVlZCByZXF1
ZXN0IHdoZW4gdGhlIHRhZyBoYXMgYmVlbiBmcmVlZCA/DQoNCkhlbGxvIEppYW5jaGFvLA0KDQpI
YXZlIHlvdSBub3RpY2VkIHRoYXQgdGhlIHRpbWVvdXQgaGFuZGxlciBkb2VzIG5vdCBjaGVjayB3
aGV0aGVyIG9yIG5vdCB0aGUgcmVxdWVzdA0KdGFnIGlzIGZyZWVkPyBBZGRpdGlvbmFsbHksIEkg
ZG9uJ3QgdGhpbmsgaXQgd291bGQgYmUgcG9zc2libGUgdG8gYWRkIHN1Y2ggYSBjaGVjaw0KdG8g
dGhlIHRpbWVvdXQgY29kZSB3aXRob3V0IGludHJvZHVjaW5nIGEgbmV3IHJhY2UgY29uZGl0aW9u
Lg0KDQpCYXJ0Lg0KDQoNCg==

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-10 13:01     ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 13:01 UTC (permalink / raw)
  To: jianchao.w.wang, axboe; +Cc: hch, tj, israelr, linux-block, maxg, stable, sagi

On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
> If yes, how does the timeout handler get the freed request when the tag has been freed ?

Hello Jianchao,

Have you noticed that the timeout handler does not check whether or not the request
tag is freed? Additionally, I don't think it would be possible to add such a check
to the timeout code without introducing a new race condition.

Bart.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  9:55 ` Christoph Hellwig
@ 2018-04-10 13:26     ` Bart Van Assche
  2018-04-10 14:41   ` Jens Axboe
  1 sibling, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 13:26 UTC (permalink / raw)
  To: hch; +Cc: tj, israelr, linux-block, maxg, stable, axboe, sagi

T24gVHVlLCAyMDE4LTA0LTEwIGF0IDExOjU1ICswMjAwLCBDaHJpc3RvcGggSGVsbHdpZyB3cm90
ZToNCj4gSSBkb24ndCB0aGluayB3ZSBuZWVkIHRoZSBhdG9taWNfbG9uZ19jbXB4Y2hnLCBhbmQg
Y2FuIGRvIHdpdGggYSBwbGFpbg0KPiBjbXB4aGcuICBBbHNvIHVudGVybWluYXRlZCBjbXB4Y2hn
IGxvb3BzIGFyZSBhIGJhZCBpZGVhLCBidXQgSSB0aGluaw0KPiBib3RoIGNhbGxlcnMgYXJlIHBy
b3RlY3RlZCBmcm9tIG90aGVyIGNoYW5nZXMgc28gd2UgY2FuIHR1cm4gdGhhdA0KPiBpbnRvIGEg
V0FSTl9PTigpLg0KDQpIZWxsbyBDaHJpc3RvcGgsDQoNCkkgd2lsbCByZW1vdmUgdGhlIGJsa19t
cV9ycV91cGRhdGVfc3RhdGUoKSBmdW5jdGlvbiBzbyB0aGF0J3Mgb25lIHdoaWxlDQpsb29wIGxl
c3MuDQoNCkNhbiB5b3UgZXhwbGFpbiB3aHkgeW91IHRoaW5rIHRoYXQgdXNpbmcgY21weGNoZygp
IGlzIHNhZmUgaW4gdGhpcyBjb250ZXh0Pw0KVGhlIHJlZ3VsYXIgY29tcGxldGlvbiBwYXRoIGFu
ZCB0aGUgdGltZW91dCBjb2RlIGNhbiBib3RoIGV4ZWN1dGUgZS5nLiB0aGUNCmZvbGxvd2luZyBj
b2RlIGNvbmN1cnJlbnRseToNCg0KCWJsa19tcV9jaGFuZ2VfcnFfc3RhdGUocnEsIE1RX1JRX0lO
X0ZMSUdIVCwgTVFfUlFfQ09NUExFVEUpOw0KDQpUaGF0J3Mgd2h5IEkgdGhpbmsgdGhhdCB3ZSBu
ZWVkIGFuIGF0b21pYyBjb21wYXJlIGFuZCBleGNoYW5nZSBpbnN0ZWFkIG9mDQpjbXB4Y2hnKCku
IFdoYXQgSSBmb3VuZCBpbiB0aGUgSW50ZWwgU29mdHdhcmUgRGV2ZWxvcGVyIE1hbnVhbCBzZWVt
cyB0bw0KY29uZmlybSB0aGF0Og0KDQpEZXNjcmlwdGlvbg0KDQpDb21wYXJlcyB0aGUgdmFsdWUg
aW4gdGhlIEFMLCBBWCwgRUFYLCBvciBSQVggcmVnaXN0ZXIgd2l0aCB0aGUgZmlyc3QNCm9wZXJh
bmQgKGRlc3RpbmF0aW9uIG9wZXJhbmQpLiBJZiB0aGUgdHdvIHZhbHVlcyBhcmUgZXF1YWwsIHRo
ZSBzZWNvbmQNCm9wZXJhbmQgKHNvdXJjZSBvcGVyYW5kKSBpcyBsb2FkZWQgaW50byB0aGUgZGVz
dGluYXRpb24gb3BlcmFuZC4gT3RoZXJ3aXNlLA0KdGhlIGRlc3RpbmF0aW9uIG9wZXJhbmQgaXMg
bG9hZGVkIGludG8gdGhlIEFMLCBBWCwgRUFYIG9yIFJBWCByZWdpc3Rlci4gUkFYDQpyZWdpc3Rl
ciBpcyBhdmFpbGFibGUgb25seSBpbiA2NC1iaXQgbW9kZS4gVGhpcyBpbnN0cnVjdGlvbiBjYW4g
YmUgdXNlZA0Kd2l0aCBhIExPQ0sgcHJlZml4IHRvIGFsbG93IHRoZSBpbnN0cnVjdGlvbiB0byBi
ZSBleGVjdXRlZCBhdG9taWNhbGx5LiBUbw0Kc2ltcGxpZnkgdGhlIGludGVyZmFjZSB0byB0aGUg
cHJvY2Vzc29y4oCZcyBidXMsIHRoZSBkZXN0aW5hdGlvbiBvcGVyYW5kDQpyZWNlaXZlcyBhIHdy
aXRlIGN5Y2xlIHdpdGhvdXQgcmVnYXJkIHRvIHRoZSByZXN1bHQgb2YgdGhlIGNvbXBhcmlzb24u
IFRoZQ0KZGVzdGluYXRpb24gb3BlcmFuZCBpcyB3cml0dGVuIGJhY2sgaWYgdGhlIGNvbXBhcmlz
b24gZmFpbHM7IG90aGVyd2lzZSwgdGhlDQpzb3VyY2Ugb3BlcmFuZCBpcyB3cml0dGVuIGludG8g
dGhlIGRlc3RpbmF0aW9uLiAoVGhlIHByb2Nlc3NvciBuZXZlcg0KcHJvZHVjZXMgYSBsb2NrZWQg
cmVhZCB3aXRob3V0IGFsc28gcHJvZHVjaW5nIGEgbG9ja2VkIHdyaXRlLikNCg0KVGhhbmtzLA0K
DQpCYXJ0Lg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-10 13:26     ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 13:26 UTC (permalink / raw)
  To: hch; +Cc: tj, israelr, linux-block, maxg, stable, axboe, sagi

On Tue, 2018-04-10 at 11:55 +0200, Christoph Hellwig wrote:
> I don't think we need the atomic_long_cmpxchg, and can do with a plain
> cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
> both callers are protected from other changes so we can turn that
> into a WARN_ON().

Hello Christoph,

I will remove the blk_mq_rq_update_state() function so that's one while
loop less.

Can you explain why you think that using cmpxchg() is safe in this context?
The regular completion path and the timeout code can both execute e.g. the
following code concurrently:

	blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);

That's why I think that we need an atomic compare and exchange instead of
cmpxchg(). What I found in the Intel Software Developer Manual seems to
confirm that:

Description

Compares the value in the AL, AX, EAX, or RAX register with the first
operand (destination operand). If the two values are equal, the second
operand (source operand) is loaded into the destination operand. Otherwise,
the destination operand is loaded into the AL, AX, EAX or RAX register. RAX
register is available only in 64-bit mode. This instruction can be used
with a LOCK prefix to allow the instruction to be executed atomically. To
simplify the interface to the processor’s bus, the destination operand
receives a write cycle without regard to the result of the comparison. The
destination operand is written back if the comparison fails; otherwise, the
source operand is written into the destination. (The processor never
produces a locked read without also producing a locked write.)

Thanks,

Bart.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 12:58     ` Bart Van Assche
  (?)
@ 2018-04-10 13:55     ` Ming Lei
  2018-04-10 14:09         ` Bart Van Assche
  -1 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2018-04-10 13:55 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

On Tue, Apr 10, 2018 at 12:58:04PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 16:41 +0800, Ming Lei wrote:
> > On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > > If a completion occurs after blk_mq_rq_timed_out() has reset
> > > rq->aborted_gstate and the request is again in flight when the timeout
> > 
> > Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
> > think you are addressing the race between normal completion and
> > the timeout of BLK_EH_RESET_TIMER.
> 
> Yes, that's correct. I will make this more explicit.
> 
> > > expires then a request will be completed twice: a first time by the
> > > timeout handler and a second time when the regular completion occurs.
> > 
> > This patch looks simpler, and more like the previous model of
> > using blk_mark_rq_complete().
> 
> That's also correct.
> 
> > I have one question:
> > 
> > - with this patch, rq's state is updated atomically as cmpxchg. Suppose
> > one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
> > blk_mq_check_expired(), then ops->timeout() is run and
> > BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
> > MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.
> > 
> > Now the original normal completion still may occur after rq's state
> > becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
> > with this patch? Maybe I am wrong, please explain a bit.
> 
> That scenario won't result in a double completion. After the timer has
> been reset the block driver blk_mq_complete_request() call will change
> the request state from MQ_RQ_IN_FLIGHT into MQ_RQ_COMPLETE. The next
> time blk_mq_check_expired() is called it will execute the following code:
> 
> 	blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);
> 
> That function call only changes the request state if the current state is
> IN_FLIGHT. However, the blk_mq_complete_request() call changed the request
> state into COMPLETE. Hence, the above blk_mq_change_rq_state() call will
> return false and the blk-mq timeout code will skip this request. If the
> request would already have been reused and would have been marked again as
> IN_FLIGHT then its deadline will also have been updated and the request
> will be skipped by the timeout code because its deadline has not yet
> expired.

OK.

Then I have same question with Jianchao, what is the actual double
complete in linus tree between BLK_EH_RESET_TIMER and normal completion?

Follows my understanding:

1) when timeout is detected on one request, its aborted_gstate is
updated in blk_mq_check_expired().

2) run synchronize_rcu(), and make sure all pending completion is done

3) run blk_mq_rq_timed_out()
- ret = ops->timeout
- blk_mq_rq_update_aborted_gstate(req, 0)
- blk_add_timer(req);

If normal completion is done between 1) and reset aborted_gstate in 3),
blk_mq_complete_request() will be called, and found that aborted_gstate
is set, then the rq won't be completed really.

If normal completion is done after reset aborted_gstate in 3), it should
be same with applying this patch.

> 
> > And synchronize_rcu() is needed by Tejun's approach between marking
> > COMPLETE and handling this rq's timeout, and the time can be quite long,
> > I guess it might be easier to trigger?
> 
> I have done what I could to trigger races between the regular completion
> path and the timeout code in my tests. Without this patch if I run the
> srp-test software I see crashes being reported in the rdma_rxe driver but
> with this patch applied I don't see any crashes with my tests.

I believe this patch may fix this issue, but I think the idea behind
should be understood.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 13:55     ` Ming Lei
@ 2018-04-10 14:09         ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 14:09 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

T24gVHVlLCAyMDE4LTA0LTEwIGF0IDIxOjU1ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gVGhl
biBJIGhhdmUgc2FtZSBxdWVzdGlvbiB3aXRoIEppYW5jaGFvLCB3aGF0IGlzIHRoZSBhY3R1YWwg
ZG91YmxlDQo+IGNvbXBsZXRlIGluIGxpbnVzIHRyZWUgYmV0d2VlbiBCTEtfRUhfUkVTRVRfVElN
RVIgYW5kIG5vcm1hbCBjb21wbGV0aW9uPw0KPiANCj4gRm9sbG93cyBteSB1bmRlcnN0YW5kaW5n
Og0KPiANCj4gMSkgd2hlbiB0aW1lb3V0IGlzIGRldGVjdGVkIG9uIG9uZSByZXF1ZXN0LCBpdHMg
YWJvcnRlZF9nc3RhdGUgaXMNCj4gdXBkYXRlZCBpbiBibGtfbXFfY2hlY2tfZXhwaXJlZCgpLg0K
PiANCj4gMikgcnVuIHN5bmNocm9uaXplX3JjdSgpLCBhbmQgbWFrZSBzdXJlIGFsbCBwZW5kaW5n
IGNvbXBsZXRpb24gaXMgZG9uZQ0KPiANCj4gMykgcnVuIGJsa19tcV9ycV90aW1lZF9vdXQoKQ0K
PiAtIHJldCA9IG9wcy0+dGltZW91dA0KPiAtIGJsa19tcV9ycV91cGRhdGVfYWJvcnRlZF9nc3Rh
dGUocmVxLCAwKQ0KPiAtIGJsa19hZGRfdGltZXIocmVxKTsNCj4gDQo+IElmIG5vcm1hbCBjb21w
bGV0aW9uIGlzIGRvbmUgYmV0d2VlbiAxKSBhbmQgcmVzZXQgYWJvcnRlZF9nc3RhdGUgaW4gMyks
DQo+IGJsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KCkgd2lsbCBiZSBjYWxsZWQsIGFuZCBmb3VuZCB0
aGF0IGFib3J0ZWRfZ3N0YXRlDQo+IGlzIHNldCwgdGhlbiB0aGUgcnEgd29uJ3QgYmUgY29tcGxl
dGVkIHJlYWxseS4NCj4gDQo+IElmIG5vcm1hbCBjb21wbGV0aW9uIGlzIGRvbmUgYWZ0ZXIgcmVz
ZXQgYWJvcnRlZF9nc3RhdGUgaW4gMyksIGl0IHNob3VsZA0KPiBiZSBzYW1lIHdpdGggYXBwbHlp
bmcgdGhpcyBwYXRjaC4NCg0KSGVsbG8gTWluZywNCg0KUGxlYXNlIGtlZXAgaW4gbWluZCB0aGF0
IGFsbCBzeW5jaHJvbml6ZV9yY3UoKSBkb2VzIGlzIHRvIHdhaXQgZm9yIHByZS0NCmV4aXN0aW5n
IFJDVSByZWFkZXJzIHRvIGZpbmlzaC4gc3luY2hyb25pemVfcmN1KCkgZG9lcyBub3QgcHJldmVu
dCB0aGF0IG5ldw0KcmN1X3JlYWRfbG9jaygpIGNhbGxzIGhhcHBlbi4gSXQgaXMgZS5nLiBwb3Nz
aWJsZSB0aGF0IGFmdGVyDQpibGtfbXFfcnFfdXBkYXRlX2Fib3J0ZWRfZ3N0YXRlKHJlcSwgMCkg
aGFzIGJlZW4gZXhlY3V0ZWQgdGhhdCBhIHJlZ3VsYXINCmNvbXBsZXRpb24gb2NjdXJzLiBJZiB0
aGF0IHJlcXVlc3QgaXMgbm90IHJldXNlZCBiZWZvcmUgdGhlIHRpbWVyIHRoYXQgd2FzDQpyZXN0
YXJ0ZWQgYnkgdGhlIHRpbWVvdXQgY29kZSBleHBpcmVzLCB0aGF0IHJlcXVlc3Qgd2lsbCBiZSBj
b21wbGV0ZWQgdHdpY2UuDQoNCkJhcnQuDQoNCg0KDQo=

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-10 14:09         ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 14:09 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

On Tue, 2018-04-10 at 21:55 +0800, Ming Lei wrote:
> Then I have same question with Jianchao, what is the actual double
> complete in linus tree between BLK_EH_RESET_TIMER and normal completion?
> 
> Follows my understanding:
> 
> 1) when timeout is detected on one request, its aborted_gstate is
> updated in blk_mq_check_expired().
> 
> 2) run synchronize_rcu(), and make sure all pending completion is done
> 
> 3) run blk_mq_rq_timed_out()
> - ret = ops->timeout
> - blk_mq_rq_update_aborted_gstate(req, 0)
> - blk_add_timer(req);
> 
> If normal completion is done between 1) and reset aborted_gstate in 3),
> blk_mq_complete_request() will be called, and found that aborted_gstate
> is set, then the rq won't be completed really.
> 
> If normal completion is done after reset aborted_gstate in 3), it should
> be same with applying this patch.

Hello Ming,

Please keep in mind that all synchronize_rcu() does is to wait for pre-
existing RCU readers to finish. synchronize_rcu() does not prevent that new
rcu_read_lock() calls happen. It is e.g. possible that after
blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
completion occurs. If that request is not reused before the timer that was
restarted by the timeout code expires, that request will be completed twice.

Bart.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  1:34 [PATCH v4] blk-mq: Fix race conditions in request timeout handling Bart Van Assche
                   ` (2 preceding siblings ...)
  2018-04-10  9:55 ` Christoph Hellwig
@ 2018-04-10 14:20 ` Tejun Heo
  2018-04-10 14:30     ` Bart Van Assche
  3 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2018-04-10 14:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy, stable

Hello, Bart.

On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:

Well, it can be and the patches have been posted months ago.  It just
needed a repro case to confirm the fix, which we now seem to have.

Switching to another model might be better but let's please do that
with the right rationales.  A good portion of this seems to be built
on misunderstandings.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 14:09         ` Bart Van Assche
  (?)
@ 2018-04-10 14:30         ` Ming Lei
  2018-04-10 15:02             ` Bart Van Assche
  -1 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2018-04-10 14:30 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

On Tue, Apr 10, 2018 at 02:09:33PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 21:55 +0800, Ming Lei wrote:
> > Then I have same question with Jianchao, what is the actual double
> > complete in linus tree between BLK_EH_RESET_TIMER and normal completion?
> > 
> > Follows my understanding:
> > 
> > 1) when timeout is detected on one request, its aborted_gstate is
> > updated in blk_mq_check_expired().
> > 
> > 2) run synchronize_rcu(), and make sure all pending completion is done
> > 
> > 3) run blk_mq_rq_timed_out()
> > - ret = ops->timeout
> > - blk_mq_rq_update_aborted_gstate(req, 0)
> > - blk_add_timer(req);
> > 
> > If normal completion is done between 1) and reset aborted_gstate in 3),
> > blk_mq_complete_request() will be called, and found that aborted_gstate
> > is set, then the rq won't be completed really.
> > 
> > If normal completion is done after reset aborted_gstate in 3), it should
> > be same with applying this patch.
> 
> Hello Ming,
> 
> Please keep in mind that all synchronize_rcu() does is to wait for pre-
> existing RCU readers to finish. synchronize_rcu() does not prevent that new
> rcu_read_lock() calls happen. It is e.g. possible that after

That is right, and I also mentioned normal completion can be done between
1) and reset aborted_gstate in 3).

> blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> completion occurs. If that request is not reused before the timer that was
> restarted by the timeout code expires, that request will be completed twice.

In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
just like the above you described, right?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 14:20 ` Tejun Heo
@ 2018-04-10 14:30     ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 14:30 UTC (permalink / raw)
  To: tj; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

T24gVHVlLCAyMDE4LTA0LTEwIGF0IDA3OjIwIC0wNzAwLCBUZWp1biBIZW8gd3JvdGU6DQo+IE9u
IE1vbiwgQXByIDA5LCAyMDE4IGF0IDA2OjM0OjU1UE0gLTA3MDAsIEJhcnQgVmFuIEFzc2NoZSB3
cm90ZToNCj4gPiBTaW5jZSB0aGUgcmVxdWVzdCBzdGF0ZSBjYW4gYmUgdXBkYXRlZCBmcm9tIHR3
byBkaWZmZXJlbnQgY29udGV4dHMsDQo+ID4gbmFtZWx5IHJlZ3VsYXIgY29tcGxldGlvbiBhbmQg
cmVxdWVzdCB0aW1lb3V0LCB0aGlzIHJhY2UgY2Fubm90IGJlDQo+ID4gZml4ZWQgd2l0aCBSQ1Ug
c3luY2hyb25pemF0aW9uIG9ubHkuIEZpeCB0aGlzIHJhY2UgYXMgZm9sbG93czoNCj4gDQo+IFdl
bGwsIGl0IGNhbiBiZSBhbmQgdGhlIHBhdGNoZXMgaGF2ZSBiZWVuIHBvc3RlZCBtb250aHMgYWdv
Lg0KDQpUaGF0J3Mgbm90IGNvcnJlY3QuIEkgaGF2ZSBleHBsYWluZWQgeW91IGluIGRldGFpbCB0
aGF0IHRoZSB0d28gcGF0Y2hlcyB5b3UNCnBvc3RlZCBkbyBub3QgZml4IGFsbCB0aGUgcmFjZXMg
Zml4ZWQgYnkgdGhlIHBhdGNoIGF0IHRoZSBzdGFydCBvZiB0aGlzDQplLW1haWwgdGhyZWFkLg0K
DQo+IFN3aXRjaGluZyB0byBhbm90aGVyIG1vZGVsIG1pZ2h0IGJlIGJldHRlciBidXQgbGV0J3Mg
cGxlYXNlIGRvIHRoYXQNCj4gd2l0aCB0aGUgcmlnaHQgcmF0aW9uYWxlcy4gIEEgZ29vZCBwb3J0
aW9uIG9mIHRoaXMgc2VlbXMgdG8gYmUgYnVpbHQNCj4gb24gbWlzdW5kZXJzdGFuZGluZ3MuDQoN
CldoaWNoIG1pc3VuZGVyc3RhbmRpbmdzPyBJJ20gbm90IGF3YXJlIG9mIGFueSBtaXN1bmRlcnN0
YW5kaW5ncyBhdCBteSBzaWRlLg0KQWRkaXRpb25hbGx5LCB0ZXN0cyB3aXRoIHR3byBkaWZmZXJl
bnQgYmxvY2sgZHJpdmVycyAoTlZNZU9GIGluaXRpYXRvciBhbmQNCnRoZSBTUlAgaW5pdGlhdG9y
IGRyaXZlcikgaGF2ZSBzaG93biB0aGF0IHRoZSBjdXJyZW50IGJsay1tcSB0aW1lb3V0DQppbXBs
ZW1lbnRhdGlvbiB3aXRoIG9yIHdpdGhvdXQgeW91ciB0d28gcGF0Y2hlcyBhcHBsaWVkIHJlc3Vs
dCBpbiBzdWJ0bGUgYW5kDQpoYXJkIHRvIGRlYnVnIGNyYXNoZXMgYW5kL29yIG1lbW9yeSBjb3Jy
dXB0aW9uLiBUaGF0IGlzIG5vdCB0aGUgY2FzZSBmb3IgdGhlDQpwYXRjaCBhdCB0aGUgc3RhcnQg
b2YgdGhpcyB0aHJlYWQuIFRoZSBsYXRlc3QgcmVwb3J0IG9mIGEgY3Jhc2ggSSByYW4gaW50bw0K
bXlzZWxmIGFuZCB0aGF0IGlzIGZpeGVkIGJ5IHRoZSBwYXRjaCBhdCB0aGUgc3RhcnQgb2YgdGhp
cyB0aHJlYWQgaXMNCmF2YWlsYWJsZSBoZXJlOiBodHRwczovL3d3dy5zcGluaWNzLm5ldC9saXN0
cy9saW51eC1yZG1hL21zZzYzMjQwLmh0bWwuDQoNClBsZWFzZSBhbHNvIGtlZXAgaW4gbWluZCB0
aGF0IGlmIHRoaXMgcGF0Y2ggd291bGQgYmUgYWNjZXB0ZWQgdGhhdCB0aGF0IGRvZXMNCm5vdCBw
cmV2ZW50IHRoaXMgcGF0Y2ggdG8gYmUgcmVwbGFjZWQgd2l0aCBhbiBSQ1UtYmFzZWQgc29sdXRp
b24gbGF0ZXIgb24uDQpJZiBhbnlvbmUgY29tZXMgdXAgYW55IHRpbWUgd2l0aCBhIHJlbGlhYmx5
IHdvcmtpbmcgUkNVLWJhc2VkIHNvbHV0aW9uIEkNCndpbGwgYmUgaGFwcHkgdG8gYWNjZXB0IGEg
cmV2ZXJ0IG9mIHRoaXMgcGF0Y2ggYW5kIEkgd2lsbCBoZWxwIHJldmlld2luZyB0aGF0DQpSQ1Ut
YmFzZWQgc29sdXRpb24uDQoNCkJhcnQuDQoNCg0K

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-10 14:30     ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 14:30 UTC (permalink / raw)
  To: tj; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

On Tue, 2018-04-10 at 07:20 -0700, Tejun Heo wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > Since the request state can be updated from two different contexts,
> > namely regular completion and request timeout, this race cannot be
> > fixed with RCU synchronization only. Fix this race as follows:
> 
> Well, it can be and the patches have been posted months ago.

That's not correct. I have explained you in detail that the two patches you
posted do not fix all the races fixed by the patch at the start of this
e-mail thread.

> Switching to another model might be better but let's please do that
> with the right rationales.  A good portion of this seems to be built
> on misunderstandings.

Which misunderstandings? I'm not aware of any misunderstandings at my side.
Additionally, tests with two different block drivers (NVMeOF initiator and
the SRP initiator driver) have shown that the current blk-mq timeout
implementation with or without your two patches applied result in subtle and
hard to debug crashes and/or memory corruption. That is not the case for the
patch at the start of this thread. The latest report of a crash I ran into
myself and that is fixed by the patch at the start of this thread is
available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.

Please also keep in mind that if this patch would be accepted that that does
not prevent this patch to be replaced with an RCU-based solution later on.
If anyone comes up any time with a reliably working RCU-based solution I
will be happy to accept a revert of this patch and I will help reviewing that
RCU-based solution.

Bart.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 13:01     ` Bart Van Assche
  (?)
@ 2018-04-10 14:32     ` jianchao.wang
  -1 siblings, 0 replies; 47+ messages in thread
From: jianchao.wang @ 2018-04-10 14:32 UTC (permalink / raw)
  To: Bart Van Assche, axboe; +Cc: hch, tj, israelr, linux-block, maxg, stable, sagi

Hi Bart

Thanks for your kindly response.

On 04/10/2018 09:01 PM, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
>> If yes, how does the timeout handler get the freed request when the tag has been freed ?
> 
> Hello Jianchao,
> 
> Have you noticed that the timeout handler does not check whether or not the request
> tag is freed? Additionally, I don't think it would be possible to add such a check
> to the timeout code without introducing a new race condition.

Doesn't blk_mq_queue_tag_busy_iter only iterate the tags that has been allocated/set ?
When the request is freed, the tag will be cleared through blk_mq_put_tag->sbitmap_queue_clear
Do I miss something else ?

Thanks
Jianchao

> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 14:30     ` Bart Van Assche
  (?)
@ 2018-04-10 14:33     ` tj
  -1 siblings, 0 replies; 47+ messages in thread
From: tj @ 2018-04-10 14:33 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

Hello,

On Tue, Apr 10, 2018 at 02:30:26PM +0000, Bart Van Assche wrote:
> > Switching to another model might be better but let's please do that
> > with the right rationales.  A good portion of this seems to be built
> > on misunderstandings.
> 
> Which misunderstandings? I'm not aware of any misunderstandings at my side.
> Additionally, tests with two different block drivers (NVMeOF initiator and
> the SRP initiator driver) have shown that the current blk-mq timeout
> implementation with or without your two patches applied result in subtle and
> hard to debug crashes and/or memory corruption. That is not the case for the

I must have missed that part.  Which tests were they?

> patch at the start of this thread. The latest report of a crash I ran into
> myself and that is fixed by the patch at the start of this thread is
> available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.
> 
> Please also keep in mind that if this patch would be accepted that that does
> not prevent this patch to be replaced with an RCU-based solution later on.
> If anyone comes up any time with a reliably working RCU-based solution I
> will be happy to accept a revert of this patch and I will help reviewing that
> RCU-based solution.

Oh, switching is fine but let's get in sync first.  Who have the repro
cases and what were tested?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10  9:55 ` Christoph Hellwig
  2018-04-10 13:26     ` Bart Van Assche
@ 2018-04-10 14:41   ` Jens Axboe
  1 sibling, 0 replies; 47+ messages in thread
From: Jens Axboe @ 2018-04-10 14:41 UTC (permalink / raw)
  To: Christoph Hellwig, Bart Van Assche
  Cc: linux-block, Tejun Heo, Sagi Grimberg, Israel Rukshin,
	Max Gurtovoy, stable

On 4/10/18 3:55 AM, Christoph Hellwig wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
>> If a completion occurs after blk_mq_rq_timed_out() has reset
>> rq->aborted_gstate and the request is again in flight when the timeout
>> expires then a request will be completed twice: a first time by the
>> timeout handler and a second time when the regular completion occurs.
>>
>> Additionally, the blk-mq timeout handling code ignores completions that
>> occur after blk_mq_check_expired() has been called and before
>> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
>> timeout handler always returns BLK_EH_RESET_TIMER then the result will
>> be that the request never terminates.
>>
>> Since the request state can be updated from two different contexts,
>> namely regular completion and request timeout, this race cannot be
>> fixed with RCU synchronization only. Fix this race as follows:
>> - Split __deadline in two variables, namely lq_deadline for legacy
>>   request queues and mq_deadline for blk-mq request queues. Use atomic
>>   operations to update mq_deadline.
> 
> I don't think we need the atomic_long_cmpxchg, and can do with a plain
> cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
> both callers are protected from other changes so we can turn that
> into a WARN_ON().  That plus some minor cleanups in the version below,
> only boot tested:

This version looks reasonable, let me throw it through the testing
machinery.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 13:26     ` Bart Van Assche
  (?)
@ 2018-04-10 14:50     ` hch
  -1 siblings, 0 replies; 47+ messages in thread
From: hch @ 2018-04-10 14:50 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, tj, israelr, linux-block, maxg, stable, axboe, sagi

On Tue, Apr 10, 2018 at 01:26:39PM +0000, Bart Van Assche wrote:
> Can you explain why you think that using cmpxchg() is safe in this context?
> The regular completion path and the timeout code can both execute e.g. the
> following code concurrently:
> 
> 	blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);
> 
> That's why I think that we need an atomic compare and exchange instead of
> cmpxchg(). What I found in the Intel Software Developer Manual seems to
> confirm that:

The Linux cmpxchg() helper is supposed to always be atomic, you only
need atomic_cmpxchg and friends if you want to operate on an atomic_t.
(same for the _long variant).

In fact if you look at the x86 implementation of the cmpxchg() macro
you'll see that it will use the lock prefix if the kernel has been
built with CONFIG_SMP enabled.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 14:30         ` Ming Lei
@ 2018-04-10 15:02             ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 15:02 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

T24gVHVlLCAyMDE4LTA0LTEwIGF0IDIyOjMwICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
VHVlLCBBcHIgMTAsIDIwMTggYXQgMDI6MDk6MzNQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IFBsZWFzZSBrZWVwIGluIG1pbmQgdGhhdCBhbGwgc3luY2hyb25pemVfcmN1KCkg
ZG9lcyBpcyB0byB3YWl0IGZvciBwcmUtDQo+ID4gZXhpc3RpbmcgUkNVIHJlYWRlcnMgdG8gZmlu
aXNoLiBzeW5jaHJvbml6ZV9yY3UoKSBkb2VzIG5vdCBwcmV2ZW50IHRoYXQgbmV3DQo+ID4gcmN1
X3JlYWRfbG9jaygpIGNhbGxzIGhhcHBlbi4gSXQgaXMgZS5nLiBwb3NzaWJsZSB0aGF0IGFmdGVy
DQo+IA0KPiBUaGF0IGlzIHJpZ2h0LCBhbmQgSSBhbHNvIG1lbnRpb25lZCBub3JtYWwgY29tcGxl
dGlvbiBjYW4gYmUgZG9uZSBiZXR3ZWVuDQo+IDEpIGFuZCByZXNldCBhYm9ydGVkX2dzdGF0ZSBp
biAzKS4NCj4gDQo+ID4gYmxrX21xX3JxX3VwZGF0ZV9hYm9ydGVkX2dzdGF0ZShyZXEsIDApIGhh
cyBiZWVuIGV4ZWN1dGVkIHRoYXQgYSByZWd1bGFyDQo+ID4gY29tcGxldGlvbiBvY2N1cnMuIElm
IHRoYXQgcmVxdWVzdCBpcyBub3QgcmV1c2VkIGJlZm9yZSB0aGUgdGltZXIgdGhhdCB3YXMNCj4g
PiByZXN0YXJ0ZWQgYnkgdGhlIHRpbWVvdXQgY29kZSBleHBpcmVzLCB0aGF0IHJlcXVlc3Qgd2ls
bCBiZSBjb21wbGV0ZWQgdHdpY2UuDQo+IA0KPiBJbiB0aGlzIHBhdGNoLCBibGtfbXFfYWRkX3Rp
bWVyKHJlcSwgTVFfUlFfQ09NUExFVEUsIE1RX1JRX0lOX0ZMSUdIVCkgaXMNCj4gY2FsbGVkIGZv
ciBoYW5kbGluZyBCTEtfRUhfUkVTRVRfVElNRVIuIEFuZCBhZnRlciBycSdzIHN0YXRlIGlzIGNo
YW5nZWQNCj4gdG8gTVFfUlFfSU5fRkxJR0hULCBub3JtYWwgY29tcGxldGlvbiBzdGlsbCBjYW4g
Y29tZSBhbmQgY29tcGxldGUgdGhpcyBycSwNCj4ganVzdCBsaWtlIHRoZSBhYm92ZSB5b3UgZGVz
Y3JpYmVkLCByaWdodD8NCg0KSSBzaG91bGQgaGF2ZSBhZGRlZCB0aGUgZm9sbG93aW5nIGluIG15
IHByZXZpb3VzIGUtbWFpbDogImlmIHRoZSBjb21wbGV0aW9uDQpvY2N1cnMgYWZ0ZXIgYmxrX21x
X2NoZWNrX2V4cGlyZWQoKSBleGFtaW5lZCBycS0+Z3N0YXRlIGFuZCBiZWZvcmUgaXQgdXBkYXRl
ZA0KcnEtPmFib3J0ZWRfZ3N0YXRlIi4gVGhhdCByYWNlIGNhbiBvY2N1ciB3aXRoIHRoZSBjdXJy
ZW50IHVwc3RyZWFtIGJsay1tcQ0KdGltZW91dCBoYW5kbGluZyBjb2RlIGJ1dCBub3QgYWZ0ZXIg
bXkgcGF0Y2ggaGFzIGJlZW4gYXBwbGllZC4NCg0KQmFydC4NCg0KDQo=

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-10 15:02             ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 15:02 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

On Tue, 2018-04-10 at 22:30 +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 02:09:33PM +0000, Bart Van Assche wrote:
> > Please keep in mind that all synchronize_rcu() does is to wait for pre-
> > existing RCU readers to finish. synchronize_rcu() does not prevent that new
> > rcu_read_lock() calls happen. It is e.g. possible that after
> 
> That is right, and I also mentioned normal completion can be done between
> 1) and reset aborted_gstate in 3).
> 
> > blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> > completion occurs. If that request is not reused before the timer that was
> > restarted by the timeout code expires, that request will be completed twice.
> 
> In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
> called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
> to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
> just like the above you described, right?

I should have added the following in my previous e-mail: "if the completion
occurs after blk_mq_check_expired() examined rq->gstate and before it updated
rq->aborted_gstate". That race can occur with the current upstream blk-mq
timeout handling code but not after my patch has been applied.

Bart.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 15:02             ` Bart Van Assche
  (?)
@ 2018-04-10 15:25             ` Ming Lei
  2018-04-10 15:30               ` tj
  -1 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2018-04-10 15:25 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, israelr, sagi, hch, stable, axboe, maxg, tj

On Tue, Apr 10, 2018 at 03:02:11PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 22:30 +0800, Ming Lei wrote:
> > On Tue, Apr 10, 2018 at 02:09:33PM +0000, Bart Van Assche wrote:
> > > Please keep in mind that all synchronize_rcu() does is to wait for pre-
> > > existing RCU readers to finish. synchronize_rcu() does not prevent that new
> > > rcu_read_lock() calls happen. It is e.g. possible that after
> > 
> > That is right, and I also mentioned normal completion can be done between
> > 1) and reset aborted_gstate in 3).
> > 
> > > blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> > > completion occurs. If that request is not reused before the timer that was
> > > restarted by the timeout code expires, that request will be completed twice.
> > 
> > In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
> > called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
> > to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
> > just like the above you described, right?
> 
> I should have added the following in my previous e-mail: "if the completion
> occurs after blk_mq_check_expired() examined rq->gstate and before it updated
> rq->aborted_gstate".

That is the difference between tj's approach and your patch, but the
difference is just in the timing.

See this patch:

+   if (time_after_eq(jiffies, deadline) &&
+       blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+       blk_mq_rq_timed_out(rq, reserved);

Normal completion still can happen between blk_mq_change_rq_state()
and blk_mq_rq_timed_out().

In tj's approach, there is synchronize_rcu() between writing aborted_gstate
and blk_mq_rq_timed_out, it is easier for normal completion to happen during
the big window.

> That race can occur with the current upstream blk-mq
> timeout handling code but not after my patch has been applied.

In theory, the 'race' can occur with your patch too, but the window
is just so small.

If you think my comment is correct, please update your commit log.

-- 
Ming

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 15:25             ` Ming Lei
@ 2018-04-10 15:30               ` tj
  2018-04-10 15:38                 ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: tj @ 2018-04-10 15:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, linux-block, israelr, sagi, hch, stable, axboe, maxg

Hello, Ming.

On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> +   if (time_after_eq(jiffies, deadline) &&
> +       blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> +       blk_mq_rq_timed_out(rq, reserved);
> 
> Normal completion still can happen between blk_mq_change_rq_state()
> and blk_mq_rq_timed_out().
> 
> In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> the big window.

I don't think plugging this hole is all that difficult, but this
shouldn't lead to any critical failures.  If so, that'd be a driver
bug.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 15:30               ` tj
@ 2018-04-10 15:38                 ` Ming Lei
  2018-04-10 15:40                   ` tj
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2018-04-10 15:38 UTC (permalink / raw)
  To: tj; +Cc: Bart Van Assche, linux-block, israelr, sagi, hch, stable, axboe, maxg

Hi Tejun,

On Tue, Apr 10, 2018 at 08:30:31AM -0700, tj@kernel.org wrote:
> Hello, Ming.
> 
> On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> > +   if (time_after_eq(jiffies, deadline) &&
> > +       blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> > +       blk_mq_rq_timed_out(rq, reserved);
> > 
> > Normal completion still can happen between blk_mq_change_rq_state()
> > and blk_mq_rq_timed_out().
> > 
> > In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> > and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> > the big window.
> 
> I don't think plugging this hole is all that difficult, but this
> shouldn't lead to any critical failures.  If so, that'd be a driver
> bug.

I agree, the issue should be in driver's irq handler and .timeout in
theory.

For example, even though one request has been done by irq handler, .timeout
still may return RESET_TIMER.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 15:38                 ` Ming Lei
@ 2018-04-10 15:40                   ` tj
  2018-04-10 21:33                     ` tj
  0 siblings, 1 reply; 47+ messages in thread
From: tj @ 2018-04-10 15:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, linux-block, israelr, sagi, hch, stable, axboe, maxg

Hello,

On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> I agree, the issue should be in driver's irq handler and .timeout in
> theory.
> 
> For example, even though one request has been done by irq handler, .timeout
> still may return RESET_TIMER.

blk-mq can use a separate flag to track normal completions during
timeout and complete the request normally on RESET_TIMER if the flag
is set while EH was in progress.  With a bit of care, we'd be able to
plug the race completely.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 15:40                   ` tj
@ 2018-04-10 21:33                     ` tj
  2018-04-10 21:46                         ` Bart Van Assche
  0 siblings, 1 reply; 47+ messages in thread
From: tj @ 2018-04-10 21:33 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, linux-block, israelr, sagi, hch, stable, axboe, maxg

Hello,

On Tue, Apr 10, 2018 at 08:40:43AM -0700, tj@kernel.org wrote:
> On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> > I agree, the issue should be in driver's irq handler and .timeout in
> > theory.
> > 
> > For example, even though one request has been done by irq handler, .timeout
> > still may return RESET_TIMER.
> 
> blk-mq can use a separate flag to track normal completions during
> timeout and complete the request normally on RESET_TIMER if the flag
> is set while EH was in progress.  With a bit of care, we'd be able to
> plug the race completely.

So, something like the following.  Just compile tested.  The extra
dedicated field is kinda unfortunate but we need something like that
no matter how we do this because the involved completion marking
violates the ownership (we want to record the normail completion while
timeout handling is in progress).  Alternatively, we can add an atomic
flags field.  The addition doesn't change the size of struct request
because it fits in a hole but that hole seems removeable, so...

Anyways, it'd be great if someone can test this combined with the
previous two rcu sync patches.

Thanks.
---
 block/blk-mq.c         |   44 +++++++++++++++++++++++++++++++-------------
 include/linux/blkdev.h |    2 ++
 2 files changed, 33 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else
+		rq->missed_completion = true;
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -881,7 +883,7 @@ static void blk_mq_check_expired(struct
 	}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
 	struct blk_mq_hw_ctx *hctx;
 	bool has_rcu = false;
@@ -896,7 +898,8 @@ static void blk_mq_timeout_sync_rcu(stru
 		else
 			synchronize_srcu(hctx->srcu);
 
-		hctx->need_sync_rcu = false;
+		if (clear)
+			hctx->need_sync_rcu = false;
 	}
 	if (has_rcu)
 		synchronize_rcu();
@@ -917,25 +920,37 @@ static void blk_mq_terminate_expired(str
 		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	/*
 	 * @rq's timer reset has gone through rcu synchronization and is
 	 * visible now.  Allow normal completions again by resetting
 	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-	 * there's no memory ordering around ->aborted_gstate making it the
-	 * only field safe to update.  Let blk_add_timer() clear it later
-	 * when the request is recycled or times out again.
-	 *
-	 * As nothing prevents from completion happening while
-	 * ->aborted_gstate is set, this may lead to ignored completions
-	 * and further spurious timeouts.
+	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
+	 * memory ordering around ->aborted_gstate making it the only field
+	 * safe to update.  Let blk_add_timer() clear it later when the
+	 * request is recycled or times out again.
 	 */
 	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
 		blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * @rq is now fully returned to the normal path.  If normal
+	 * completion raced timeout handling, execute the missed completion
+	 * here.  This is safe because 1. ->missed_completion can no longer
+	 * be asserted because nothing is timing out right now and 2. if
+	 * ->missed_completion is set, @rq is safe from recycling because
+	 * nobody could have completed it.
+	 */
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+		blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -976,7 +991,7 @@ static void blk_mq_timeout_work(struct w
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		blk_mq_timeout_sync_rcu(q);
+		blk_mq_timeout_sync_rcu(q, true);
 
 		/* terminate the ones we won */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1003,12 @@ static void blk_mq_timeout_work(struct w
 		 * reset racing against recycling.
 		 */
 		if (nr_resets) {
-			blk_mq_timeout_sync_rcu(q);
+			blk_mq_timeout_sync_rcu(q, false);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_return, NULL);
+			blk_mq_timeout_sync_rcu(q, true);
 			blk_mq_queue_tag_busy_iter(q,
-					blk_mq_finish_timeout_reset, NULL);
+					blk_mq_timeout_reset_cleanup, NULL);
 		}
 	}
 
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -227,6 +227,8 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
+	bool missed_completion;
+
 	/*
 	 * On blk-mq, the lower bits of ->gstate (generation number and
 	 * state) carry the MQ_RQ_* state value and the upper bits the

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 21:33                     ` tj
@ 2018-04-10 21:46                         ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 21:46 UTC (permalink / raw)
  To: tj, ming.lei; +Cc: hch, maxg, israelr, linux-block, stable, sagi, axboe

T24gVHVlLCAyMDE4LTA0LTEwIGF0IDE0OjMzIC0wNzAwLCB0akBrZXJuZWwub3JnIHdyb3RlOg0K
PiArICAgICAgIGVsc2UNCj4gKyAgICAgICAgICAgICAgIHJxLT5taXNzZWRfY29tcGxldGlvbiA9
IHRydWU7DQoNCkluIHRoaXMgcGF0Y2ggSSBmb3VuZCBjb2RlIHRoYXQgc2V0cyBycS0+bWlzc2Vk
X2NvbXBsZXRpb24gYnV0IG5vIGNvZGUgdGhhdA0KY2xlYXJzIGl0LiBEaWQgSSBwZXJoYXBzIG1p
c3Mgc29tZXRoaW5nPw0KDQpUaGFua3MsDQoNCkJhcnQuDQoNCg0K

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-10 21:46                         ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-10 21:46 UTC (permalink / raw)
  To: tj, ming.lei; +Cc: hch, maxg, israelr, linux-block, stable, sagi, axboe

On Tue, 2018-04-10 at 14:33 -0700, tj@kernel.org wrote:
> +       else
> +               rq->missed_completion = true;

In this patch I found code that sets rq->missed_completion but no code that
clears it. Did I perhaps miss something?

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 21:46                         ` Bart Van Assche
  (?)
@ 2018-04-10 21:54                         ` tj
  2018-04-11 12:50                             ` Bart Van Assche
                                             ` (2 more replies)
  -1 siblings, 3 replies; 47+ messages in thread
From: tj @ 2018-04-10 21:54 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: ming.lei, hch, maxg, israelr, linux-block, stable, sagi, axboe

On Tue, Apr 10, 2018 at 09:46:23PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 14:33 -0700, tj@kernel.org wrote:
> > +       else
> > +               rq->missed_completion = true;
> 
> In this patch I found code that sets rq->missed_completion but no code that
> clears it. Did I perhaps miss something?

Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.

---
 block/blk-mq.c         |   45 ++++++++++++++++++++++++++++++++-------------
 include/linux/blkdev.h |    2 ++
 2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else
+		rq->missed_completion = true;
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
 	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 	blk_add_timer(rq);
+	rq->missed_completion = false;
 
 	write_seqcount_end(&rq->gstate_seq);
 	preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
 	}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
 	struct blk_mq_hw_ctx *hctx;
 	bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
 		else
 			synchronize_srcu(hctx->srcu);
 
-		hctx->need_sync_rcu = false;
+		if (clear)
+			hctx->need_sync_rcu = false;
 	}
 	if (has_rcu)
 		synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
 		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	/*
 	 * @rq's timer reset has gone through rcu synchronization and is
 	 * visible now.  Allow normal completions again by resetting
 	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-	 * there's no memory ordering around ->aborted_gstate making it the
-	 * only field safe to update.  Let blk_add_timer() clear it later
-	 * when the request is recycled or times out again.
-	 *
-	 * As nothing prevents from completion happening while
-	 * ->aborted_gstate is set, this may lead to ignored completions
-	 * and further spurious timeouts.
+	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
+	 * memory ordering around ->aborted_gstate making it the only field
+	 * safe to update.  Let blk_add_timer() clear it later when the
+	 * request is recycled or times out again.
 	 */
 	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
 		blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * @rq is now fully returned to the normal path.  If normal
+	 * completion raced timeout handling, execute the missed completion
+	 * here.  This is safe because 1. ->missed_completion can no longer
+	 * be asserted because nothing is timing out right now and 2. if
+	 * ->missed_completion is set, @rq is safe from recycling because
+	 * nobody could have completed it.
+	 */
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+		blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		blk_mq_timeout_sync_rcu(q);
+		blk_mq_timeout_sync_rcu(q, true);
 
 		/* terminate the ones we won */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
 		 * reset racing against recycling.
 		 */
 		if (nr_resets) {
-			blk_mq_timeout_sync_rcu(q);
+			blk_mq_timeout_sync_rcu(q, false);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_return, NULL);
+			blk_mq_timeout_sync_rcu(q, true);
 			blk_mq_queue_tag_busy_iter(q,
-					blk_mq_finish_timeout_reset, NULL);
+					blk_mq_timeout_reset_cleanup, NULL);
 		}
 	}
 
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -227,6 +227,8 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
+	bool missed_completion;
+
 	/*
 	 * On blk-mq, the lower bits of ->gstate (generation number and
 	 * state) carry the MQ_RQ_* state value and the upper bits the

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 21:54                         ` tj
@ 2018-04-11 12:50                             ` Bart Van Assche
  2018-04-11 14:24                           ` Sagi Grimberg
  2018-04-18 16:34                           ` Bart Van Assche
  2 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-11 12:50 UTC (permalink / raw)
  To: tj; +Cc: linux-block, israelr, sagi, hch, martin, stable, axboe, ming.lei, maxg

T24gVHVlLCAyMDE4LTA0LTEwIGF0IDE0OjU0IC0wNzAwLCB0akBrZXJuZWwub3JnIHdyb3RlOg0K
PiBBaCwgeWVhaCwgSSB3YXMgbW92aW5nIGl0IG91dCBvZiBhZGRfdGltZXIgYnV0IGZvcmdvdCB0
byBhY3R1bGx5IGFkZA0KPiBpdCB0byB0aGUgaXNzdWUgcGF0aC4gIEZpeGVkIHBhdGNoIGJlbG93
Lg0KPiANCj4gQlRXLCBubyBtYXR0ZXIgd2hhdCB3ZSBkbyB3LyB0aGUgcmVxdWVzdCBoYW5kb3Zl
ciBiZXR3ZWVuIG5vcm1hbCBhbmQNCj4gdGltZW91dCBwYXRocywgd2UnZCBuZWVkIHNvbWV0aGlu
ZyBzaW1pbGFyLiAgT3RoZXJ3aXNlLCB3aGlsZSB3ZSBjYW4NCj4gcmVkdWNlIHRoZSB3aW5kb3cs
IHdlIGNhbid0IGdldCByaWQgb2YgaXQuDQoNCigrTWFydGluIFN0ZWlnZXJ3YWxkKQ0KDQpIZWxs
byBUZWp1biwNCg0KVGhhbmsgeW91IGZvciBoYXZpbmcgc2hhcmVkIHRoaXMgcGF0Y2guIEl0IGxv
b2tzIGludGVyZXN0aW5nIHRvIG1lLiBXaGF0IEkNCmtub3cgYWJvdXQgdGhlIGJsay1tcSB0aW1l
b3V0IGhhbmRsaW5nIGlzIGFzIGZvbGxvd3M6DQoqIE5vYm9keSB3aG8gaGFzIHJldmlld2VkIHRo
ZSBibGstbXEgdGltZW91dCBoYW5kbGluZyBjb2RlIHdpdGggdGhpcyBwYXRjaA0KICBhcHBsaWVk
IGhhcyByZXBvcnRlZCBhbnkgc2hvcnRjb21pbmdzIGZvciB0aGF0IGNvZGUuDQoqIEhvd2V2ZXIs
IHNldmVyYWwgcGVvcGxlIGhhdmUgcmVwb3J0ZWQga2VybmVsIGNyYXNoZXMgdGhhdCBkaXNhcHBl
YXIgd2hlbg0KICB0aGUgYmxrLW1xIHRpbWVvdXQgY29kZSBpcyByZXdvcmtlZC4gSSdtIHJlZmVy
cmluZyB0byAibnZtZS1yZG1hIGNvcnJ1cHRzDQogIG1lbW9yeSB1cG9uIHRpbWVvdXQiDQogICho
dHRwOi8vbGlzdHMuaW5mcmFkZWFkLm9yZy9waXBlcm1haWwvbGludXgtbnZtZS8yMDE4LUZlYnJ1
YXJ5LzAxNTg0OC5odG1sKQ0KICBhbmQgYWxzbyB0byBhICJSSVA6IHNjc2lfdGltZXNfb3V0KzB4
MTciIGNyYXNoIGR1cmluZyBib290DQogIChodHRwczovL2J1Z3ppbGxhLmtlcm5lbC5vcmcvc2hv
d19idWcuY2dpP2lkPTE5OTA3NykuDQoNClNvIHdlIGhhdmUgdGhlIGNob2ljZSBiZXR3ZWVuIHR3
byBhcHByb2FjaGVzOg0KKDEpIGFwcGx5IHRoZSBwYXRjaCBmcm9tIHlvdXIgcHJldmlvdXMgZS1t
YWlsIGFuZCByb290LWNhdXNlIGFuZCBmaXggdGhlDQogICAgY3Jhc2hlcyByZWZlcnJlZCB0byBh
Ym92ZS4NCigyKSBhcHBseSBhIHBhdGNoIHRoYXQgbWFrZXMgdGhlIGNyYXNoZXMgcmVwb3J0ZWQg
YWdhaW5zdCB2NC4xNiBkaXNhcHBlYXIgYW5kDQogICAgcmVtb3ZlIHRoZSBhdG9taWMgaW5zdHJ1
Y3Rpb25zIGludHJvZHVjZWQgYnkgc3VjaCBhIHBhdGNoIGF0IGEgbGF0ZXIgdGltZS4NCg0KU2lu
Y2UgY3Jhc2hlcyBoYXZlIGJlZW4gcmVwb3J0ZWQgZm9yIGtlcm5lbCB2NC4xNiBJIHRoaW5rIHdl
IHNob3VsZCBmb2xsb3cNCmFwcHJvYWNoICgyKS4gVGhhdCB3aWxsIHJlbW92ZSB0aGUgdGltZSBw
cmVzc3VyZSBmcm9tIHJvb3QtY2F1c2luZyBhbmQgZml4aW5nDQp0aGUgY3Jhc2hlcyByZXBvcnRl
ZCBmb3IgdGhlIE5WTWVPRiBpbml0aWF0b3IgYW5kIFNDU0kgaW5pdGlhdG9yIGRyaXZlcnMuDQoN
ClRoYW5rcywNCg0KQmFydC4NCg0KDQoNCg==

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-11 12:50                             ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-11 12:50 UTC (permalink / raw)
  To: tj; +Cc: linux-block, israelr, sagi, hch, martin, stable, axboe, ming.lei, maxg

On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
> Ah, yeah, I was moving it out of add_timer but forgot to actully add
> it to the issue path.  Fixed patch below.
> 
> BTW, no matter what we do w/ the request handover between normal and
> timeout paths, we'd need something similar.  Otherwise, while we can
> reduce the window, we can't get rid of it.

(+Martin Steigerwald)

Hello Tejun,

Thank you for having shared this patch. It looks interesting to me. What I
know about the blk-mq timeout handling is as follows:
* Nobody who has reviewed the blk-mq timeout handling code with this patch
  applied has reported any shortcomings for that code.
* However, several people have reported kernel crashes that disappear when
  the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
  memory upon timeout"
  (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
  and also to a "RIP: scsi_times_out+0x17" crash during boot
  (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

So we have the choice between two approaches:
(1) apply the patch from your previous e-mail and root-cause and fix the
    crashes referred to above.
(2) apply a patch that makes the crashes reported against v4.16 disappear and
    remove the atomic instructions introduced by such a patch at a later time.

Since crashes have been reported for kernel v4.16 I think we should follow
approach (2). That will remove the time pressure from root-causing and fixing
the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

Thanks,

Bart.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-11 12:50                             ` Bart Van Assche
  (?)
@ 2018-04-11 14:16                             ` tj
  -1 siblings, 0 replies; 47+ messages in thread
From: tj @ 2018-04-11 14:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, israelr, sagi, hch, martin, stable, axboe, ming.lei, maxg

Hello, Bart.

On Wed, Apr 11, 2018 at 12:50:51PM +0000, Bart Van Assche wrote:
> Thank you for having shared this patch. It looks interesting to me. What I
> know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this patch
>   applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear when
>   the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
>   memory upon timeout"
>   (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
>   and also to a "RIP: scsi_times_out+0x17" crash during boot
>   (https://bugzilla.kernel.org/show_bug.cgi?id=199077).
> 
> So we have the choice between two approaches:
> (1) apply the patch from your previous e-mail and root-cause and fix the
>     crashes referred to above.
> (2) apply a patch that makes the crashes reported against v4.16 disappear and
>     remove the atomic instructions introduced by such a patch at a later time.
> 
> Since crashes have been reported for kernel v4.16 I think we should follow
> approach (2). That will remove the time pressure from root-causing and fixing
> the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

So, it really bothers me how blind we're going about this.  It isn't
an insurmountable emergency that we have to adopt whatever solution
which passed a couple tests this minute.  We can debug and root cause
this properly and pick the right solution.  We even have two most
likely causes already analysed and patches proposed, one of them
months ago.  If we wanna change the handover model, let's do that
because the new one is better, not because of vague fear.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 21:54                         ` tj
  2018-04-11 12:50                             ` Bart Van Assche
@ 2018-04-11 14:24                           ` Sagi Grimberg
  2018-04-11 14:43                             ` tj
  2018-04-11 16:16                             ` Israel Rukshin
  2018-04-18 16:34                           ` Bart Van Assche
  2 siblings, 2 replies; 47+ messages in thread
From: Sagi Grimberg @ 2018-04-11 14:24 UTC (permalink / raw)
  To: tj, Bart Van Assche, israelr
  Cc: ming.lei, hch, maxg, linux-block, stable, axboe


> Ah, yeah, I was moving it out of add_timer but forgot to actully add
> it to the issue path.  Fixed patch below.
> 
> BTW, no matter what we do w/ the request handover between normal and
> timeout paths, we'd need something similar.  Otherwise, while we can
> reduce the window, we can't get rid of it.
> 
> Thanks.

Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?

> 
> ---
>   block/blk-mq.c         |   45 ++++++++++++++++++++++++++++++++-------------
>   include/linux/blkdev.h |    2 ++
>   2 files changed, 34 insertions(+), 13 deletions(-)
> 
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
>   	hctx_lock(hctx, &srcu_idx);
>   	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   		__blk_mq_complete_request(rq);
> +	else
> +		rq->missed_completion = true;
>   	hctx_unlock(hctx, srcu_idx);
>   }
>   EXPORT_SYMBOL(blk_mq_complete_request);
> @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
>   
>   	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>   	blk_add_timer(rq);
> +	rq->missed_completion = false;
>   
>   	write_seqcount_end(&rq->gstate_seq);
>   	preempt_enable();
> @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
>   	}
>   }
>   
> -static void blk_mq_timeout_sync_rcu(struct request_queue *q)
> +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
>   {
>   	struct blk_mq_hw_ctx *hctx;
>   	bool has_rcu = false;
> @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
>   		else
>   			synchronize_srcu(hctx->srcu);
>   
> -		hctx->need_sync_rcu = false;
> +		if (clear)
> +			hctx->need_sync_rcu = false;
>   	}
>   	if (has_rcu)
>   		synchronize_rcu();
> @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
>   		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
>   }
>   
> -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
> +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
>   		struct request *rq, void *priv, bool reserved)
>   {
>   	/*
>   	 * @rq's timer reset has gone through rcu synchronization and is
>   	 * visible now.  Allow normal completions again by resetting
>   	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
> -	 * there's no memory ordering around ->aborted_gstate making it the
> -	 * only field safe to update.  Let blk_add_timer() clear it later
> -	 * when the request is recycled or times out again.
> -	 *
> -	 * As nothing prevents from completion happening while
> -	 * ->aborted_gstate is set, this may lead to ignored completions
> -	 * and further spurious timeouts.
> +	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
> +	 * memory ordering around ->aborted_gstate making it the only field
> +	 * safe to update.  Let blk_add_timer() clear it later when the
> +	 * request is recycled or times out again.
>   	 */
>   	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
>   		blk_mq_rq_update_aborted_gstate(rq, 0);
>   }
>   
> +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
> +		struct request *rq, void *priv, bool reserved)
> +{
> +	/*
> +	 * @rq is now fully returned to the normal path.  If normal
> +	 * completion raced timeout handling, execute the missed completion
> +	 * here.  This is safe because 1. ->missed_completion can no longer
> +	 * be asserted because nothing is timing out right now and 2. if
> +	 * ->missed_completion is set, @rq is safe from recycling because
> +	 * nobody could have completed it.
> +	 */
> +	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
> +		blk_mq_complete_request(rq);
> +}
> +
>   static void blk_mq_timeout_work(struct work_struct *work)
>   {
>   	struct request_queue *q =
> @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
>   		 * becomes a problem, we can add per-hw_ctx rcu_head and
>   		 * wait in parallel.
>   		 */
> -		blk_mq_timeout_sync_rcu(q);
> +		blk_mq_timeout_sync_rcu(q, true);
>   
>   		/* terminate the ones we won */
>   		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
> @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
>   		 * reset racing against recycling.
>   		 */
>   		if (nr_resets) {
> -			blk_mq_timeout_sync_rcu(q);
> +			blk_mq_timeout_sync_rcu(q, false);
> +			blk_mq_queue_tag_busy_iter(q,
> +					blk_mq_timeout_reset_return, NULL);
> +			blk_mq_timeout_sync_rcu(q, true);
>   			blk_mq_queue_tag_busy_iter(q,
> -					blk_mq_finish_timeout_reset, NULL);
> +					blk_mq_timeout_reset_cleanup, NULL);
>   		}
>   	}
>   
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -227,6 +227,8 @@ struct request {
>   
>   	unsigned int extra_len;	/* length of alignment and padding */
>   
> +	bool missed_completion;
> +

Would be nicer if we can flag this somewhere instead of adding a hole to
struct request...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-11 14:24                           ` Sagi Grimberg
@ 2018-04-11 14:43                             ` tj
  2018-04-11 16:16                             ` Israel Rukshin
  1 sibling, 0 replies; 47+ messages in thread
From: tj @ 2018-04-11 14:43 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, israelr, ming.lei, hch, maxg, linux-block,
	stable, axboe

Hello,

On Wed, Apr 11, 2018 at 05:24:49PM +0300, Sagi Grimberg wrote:
> 
> >Ah, yeah, I was moving it out of add_timer but forgot to actully add
> >it to the issue path.  Fixed patch below.
> >
> >BTW, no matter what we do w/ the request handover between normal and
> >timeout paths, we'd need something similar.  Otherwise, while we can
> >reduce the window, we can't get rid of it.
> >
> >Thanks.
> 
> Just noticed this one, this looks interesting to me as well. Israel,
> can you run your test with this patch?

The following is the combined patch with one debug message to see
whether missed completions are actually happening.

Thanks.

Index: work/block/blk-mq.c
===================================================================
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else
+		rq->missed_completion = true;
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
 	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 	blk_add_timer(rq);
+	rq->missed_completion = false;
 
 	write_seqcount_end(&rq->gstate_seq);
 	preempt_enable();
@@ -818,7 +821,8 @@ struct blk_mq_timeout_data {
 	unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request *req,
+				int *nr_resets, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
 		blk_add_timer(req);
+		req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+		(*nr_resets)++;
+		hctx->need_sync_rcu = true;
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct
 	    time_after_eq(jiffies, deadline)) {
 		blk_mq_rq_update_aborted_gstate(rq, gstate);
 		data->nr_expired++;
-		hctx->nr_expired++;
+		hctx->need_sync_rcu = true;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
+{
+	struct blk_mq_hw_ctx *hctx;
+	bool has_rcu = false;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!hctx->need_sync_rcu)
+			continue;
+
+		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+			has_rcu = true;
+		else
+			synchronize_srcu(hctx->srcu);
+
+		if (clear)
+			hctx->need_sync_rcu = false;
+	}
+	if (has_rcu)
+		synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
@@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str
 	 */
 	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
 	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
-		blk_mq_rq_timed_out(rq, reserved);
+		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * @rq's timer reset has gone through rcu synchronization and is
+	 * visible now.  Allow normal completions again by resetting
+	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
+	 * memory ordering around ->aborted_gstate making it the only field
+	 * safe to update.  Let blk_add_timer() clear it later when the
+	 * request is recycled or times out again.
+	 */
+	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+		blk_mq_rq_update_aborted_gstate(rq, 0);
+}
+
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	int cnt = 0;
+
+	/*
+	 * @rq is now fully returned to the normal path.  If normal
+	 * completion raced timeout handling, execute the missed completion
+	 * here.  This is safe because 1. ->missed_completion can no longer
+	 * be asserted because nothing is timing out right now and 2. if
+	 * ->missed_completion is set, @rq is safe from recycling because
+	 * nobody could have completed it.
+	 */
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion) {
+		blk_mq_complete_request(rq);
+		cnt++;
+	}
+
+	printk("XXX blk_mq_timeout_reset_cleanup(): executed %d missed completions\n",
+	       cnt);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
@@ -930,7 +991,7 @@ static void blk_mq_timeout_work(struct w
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
 	if (data.nr_expired) {
-		bool has_rcu = false;
+		int nr_resets = 0;
 
 		/*
 		 * Wait till everyone sees ->aborted_gstate.  The
@@ -938,22 +999,25 @@ static void blk_mq_timeout_work(struct w
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
+		blk_mq_timeout_sync_rcu(q, true);
 
 		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
+					   &nr_resets);
+
+		/*
+		 * For BLK_EH_RESET_TIMER, release the requests after
+		 * blk_add_timer() from above is visible to avoid timer
+		 * reset racing against recycling.
+		 */
+		if (nr_resets) {
+			blk_mq_timeout_sync_rcu(q, false);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_return, NULL);
+			blk_mq_timeout_sync_rcu(q, true);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_cleanup, NULL);
+		}
 	}
 
 	if (data.next_set) {
Index: work/include/linux/blk-mq.h
===================================================================
--- work.orig/include/linux/blk-mq.h
+++ work/include/linux/blk-mq.h
@@ -51,7 +51,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
-	unsigned int		nr_expired;
+	bool			need_sync_rcu;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
Index: work/block/blk-timeout.c
===================================================================
--- work.orig/block/blk-timeout.c
+++ work/block/blk-timeout.c
@@ -216,7 +216,7 @@ void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
+	req->rq_flags &= ~(RQF_MQ_TIMEOUT_EXPIRED | RQF_MQ_TIMEOUT_RESET);
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -127,8 +127,10 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
 /* timeout is expired */
 #define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
+/* timeout is expired */
+#define RQF_MQ_TIMEOUT_RESET	((__force req_flags_t)(1 << 21))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 22))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
@@ -225,6 +227,8 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
+	bool missed_completion;
+
 	/*
 	 * On blk-mq, the lower bits of ->gstate (generation number and
 	 * state) carry the MQ_RQ_* state value and the upper bits the

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-11 14:24                           ` Sagi Grimberg
  2018-04-11 14:43                             ` tj
@ 2018-04-11 16:16                             ` Israel Rukshin
  2018-04-11 17:07                               ` tj
  1 sibling, 1 reply; 47+ messages in thread
From: Israel Rukshin @ 2018-04-11 16:16 UTC (permalink / raw)
  To: Sagi Grimberg, tj, Bart Van Assche
  Cc: ming.lei, hch, maxg, linux-block, stable, axboe

On 4/11/2018 5:24 PM, Sagi Grimberg wrote:
>
>> Ah, yeah, I was moving it out of add_timer but forgot to actully add
>> it to the issue path.  Fixed patch below.
>>
>> BTW, no matter what we do w/ the request handover between normal and
>> timeout paths, we'd need something similar.  Otherwise, while we can
>> reduce the window, we can't get rid of it.
>>
>> Thanks.
>
> Just noticed this one, this looks interesting to me as well. Israel,
> can you run your test with this patch?

Yes, I just did and it looks good.

>
>>
>> ---
>>   block/blk-mq.c         |   45 
>> ++++++++++++++++++++++++++++++++-------------
>>   include/linux/blkdev.h |    2 ++
>>   2 files changed, 34 insertions(+), 13 deletions(-)
>>
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
>>       hctx_lock(hctx, &srcu_idx);
>>       if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>>           __blk_mq_complete_request(rq);
>> +    else
>> +        rq->missed_completion = true;
>>       hctx_unlock(hctx, srcu_idx);
>>   }
>>   EXPORT_SYMBOL(blk_mq_complete_request);
>> @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
>>         blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>>       blk_add_timer(rq);
>> +    rq->missed_completion = false;
>>         write_seqcount_end(&rq->gstate_seq);
>>       preempt_enable();
>> @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
>>       }
>>   }
>>   -static void blk_mq_timeout_sync_rcu(struct request_queue *q)
>> +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool 
>> clear)
>>   {
>>       struct blk_mq_hw_ctx *hctx;
>>       bool has_rcu = false;
>> @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
>>           else
>>               synchronize_srcu(hctx->srcu);
>>   -        hctx->need_sync_rcu = false;
>> +        if (clear)
>> +            hctx->need_sync_rcu = false;
>>       }
>>       if (has_rcu)
>>           synchronize_rcu();
>> @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
>>           blk_mq_rq_timed_out(hctx, rq, priv, reserved);
>>   }
>>   -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
>> +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
>>           struct request *rq, void *priv, bool reserved)
>>   {
>>       /*
>>        * @rq's timer reset has gone through rcu synchronization and is
>>        * visible now.  Allow normal completions again by resetting
>>        * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
>> -     * there's no memory ordering around ->aborted_gstate making it the
>> -     * only field safe to update.  Let blk_add_timer() clear it later
>> -     * when the request is recycled or times out again.
>> -     *
>> -     * As nothing prevents from completion happening while
>> -     * ->aborted_gstate is set, this may lead to ignored completions
>> -     * and further spurious timeouts.
>> +     * blk_mq_timeout_reset_cleanup() needs it again and there's no
>> +     * memory ordering around ->aborted_gstate making it the only field
>> +     * safe to update.  Let blk_add_timer() clear it later when the
>> +     * request is recycled or times out again.
>>        */
>>       if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
>>           blk_mq_rq_update_aborted_gstate(rq, 0);
>>   }
>>   +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
>> +        struct request *rq, void *priv, bool reserved)
>> +{
>> +    /*
>> +     * @rq is now fully returned to the normal path.  If normal
>> +     * completion raced timeout handling, execute the missed completion
>> +     * here.  This is safe because 1. ->missed_completion can no longer
>> +     * be asserted because nothing is timing out right now and 2. if
>> +     * ->missed_completion is set, @rq is safe from recycling because
>> +     * nobody could have completed it.
>> +     */
>> +    if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
>> +        blk_mq_complete_request(rq);
>> +}
>> +
>>   static void blk_mq_timeout_work(struct work_struct *work)
>>   {
>>       struct request_queue *q =
>> @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
>>            * becomes a problem, we can add per-hw_ctx rcu_head and
>>            * wait in parallel.
>>            */
>> -        blk_mq_timeout_sync_rcu(q);
>> +        blk_mq_timeout_sync_rcu(q, true);
>>             /* terminate the ones we won */
>>           blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
>> @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
>>            * reset racing against recycling.
>>            */
>>           if (nr_resets) {
>> -            blk_mq_timeout_sync_rcu(q);
>> +            blk_mq_timeout_sync_rcu(q, false);
>> +            blk_mq_queue_tag_busy_iter(q,
>> +                    blk_mq_timeout_reset_return, NULL);
>> +            blk_mq_timeout_sync_rcu(q, true);
>>               blk_mq_queue_tag_busy_iter(q,
>> -                    blk_mq_finish_timeout_reset, NULL);
>> +                    blk_mq_timeout_reset_cleanup, NULL);
>>           }
>>       }
>>   --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -227,6 +227,8 @@ struct request {
>>         unsigned int extra_len;    /* length of alignment and padding */
>>   +    bool missed_completion;
>> +
>
> Would be nicer if we can flag this somewhere instead of adding a hole to
> struct request...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-11 16:16                             ` Israel Rukshin
@ 2018-04-11 17:07                               ` tj
  2018-04-11 21:31                                 ` tj
  0 siblings, 1 reply; 47+ messages in thread
From: tj @ 2018-04-11 17:07 UTC (permalink / raw)
  To: Israel Rukshin
  Cc: Sagi Grimberg, Bart Van Assche, ming.lei, hch, maxg, linux-block,
	stable, axboe

Hello, Israel.

On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> >Just noticed this one, this looks interesting to me as well. Israel,
> >can you run your test with this patch?
> 
> Yes, I just did and it looks good.

Awesome.

> >>+++ b/include/linux/blkdev.h
> >>@@ -227,6 +227,8 @@ struct request {
> >>� ����� unsigned int extra_len;��� /* length of alignment and padding */
> >>� +��� bool missed_completion;
> >>+
> >
> >Would be nicer if we can flag this somewhere instead of adding a hole to
> >struct request...

I missed it before.  It's actually being put in an existing hole, so
the struct size stays the same before and after.  It's a bit of
cheating cuz this is one of the two holes which can be removed by
swapping two fields.

Re. making it a flag, regardless of whether this is a flag or a
separate field, we need to add a new field because there currently is
no field which can be modified by the party who doesn't own the
request, so if we make it a flag, we need to add sth like unsigned
long atom_flags.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-11 12:50                             ` Bart Van Assche
@ 2018-04-11 18:38                               ` Martin Steigerwald
  -1 siblings, 0 replies; 47+ messages in thread
From: Martin Steigerwald @ 2018-04-11 18:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj, linux-block, israelr, sagi, hch, stable, axboe, ming.lei, maxg

Bart Van Assche - 11.04.18, 14:50:
> On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
> > Ah, yeah, I was moving it out of add_timer but forgot to actully add
> > it to the issue path.  Fixed patch below.
> >=20
> > BTW, no matter what we do w/ the request handover between normal and
> > timeout paths, we'd need something similar.  Otherwise, while we can
> > reduce the window, we can't get rid of it.
>=20
> (+Martin Steigerwald)
[=E2=80=A6]
> Thank you for having shared this patch. It looks interesting to me.
> What I know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this
> patch applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear
> when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma
> corrupts memory upon timeout"
> =20
> (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848
> .html) and also to a "RIP: scsi_times_out+0x17" crash during boot
> (https://bugzilla.kernel.org/show_bug.cgi?id=3D199077).

Yes, with the three patches:

=2D '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a=20
request.mbox'

=2D '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into=20
rcu_read_{lock,unlock}().mbox'

=2D '[PATCH v4] blk-mq_Fix race conditions in request timeout=20
handling.mbox'

the occasional hangs on some boots / resumes from hibernation appear to=20
be gone.

Also it appears that the error loading SMART data issue is gone as well=20
(see my bug report). However it is still to early to say for sure. I=20
think I need at least 2-3 days of additional testing with this kernel to=20
be sufficiently sure about it.

However=E2=80=A6 I could also test another patch, but from reading the rest=
 of=20
this thread so far I have no clear on whether to try one of the new=20
patches and if so which one and whether adding it on top of some of the=20
patches I already applied or using it as a replacement of it.

So while doing a training this and next week I can apply a patch here=20
and then, but I won=C2=B4t have much time to read the complete discussion t=
o=20
figure out what to apply.

Personally as a stable kernel has been released with those issues, I=20
think its good to fix it up soon. On the other hand it may take quite=20
some time til popular distros carry 4.16 for regular users. And I have=20
no idea how frequent the reported issues are, i.e. how many users would=20
be affected.

Thanks,
=2D-=20
Martin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
@ 2018-04-11 18:38                               ` Martin Steigerwald
  0 siblings, 0 replies; 47+ messages in thread
From: Martin Steigerwald @ 2018-04-11 18:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj, linux-block, israelr, sagi, hch, stable, axboe, ming.lei, maxg

Bart Van Assche - 11.04.18, 14:50:
> On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
> > Ah, yeah, I was moving it out of add_timer but forgot to actully add
> > it to the issue path.  Fixed patch below.
> > 
> > BTW, no matter what we do w/ the request handover between normal and
> > timeout paths, we'd need something similar.  Otherwise, while we can
> > reduce the window, we can't get rid of it.
> 
> (+Martin Steigerwald)
[…]
> Thank you for having shared this patch. It looks interesting to me.
> What I know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this
> patch applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear
> when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma
> corrupts memory upon timeout"
>  
> (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848
> .html) and also to a "RIP: scsi_times_out+0x17" crash during boot
> (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

Yes, with the three patches:

- '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
request.mbox'

- '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
rcu_read_{lock,unlock}().mbox'

- '[PATCH v4] blk-mq_Fix race conditions in request timeout 
handling.mbox'

the occasional hangs on some boots / resumes from hibernation appear to 
be gone.

Also it appears that the error loading SMART data issue is gone as well 
(see my bug report). However it is still to early to say for sure. I 
think I need at least 2-3 days of additional testing with this kernel to 
be sufficiently sure about it.

However… I could also test another patch, but from reading the rest of 
this thread so far I have no clear on whether to try one of the new 
patches and if so which one and whether adding it on top of some of the 
patches I already applied or using it as a replacement of it.

So while doing a training this and next week I can apply a patch here 
and then, but I won´t have much time to read the complete discussion to 
figure out what to apply.

Personally as a stable kernel has been released with those issues, I 
think its good to fix it up soon. On the other hand it may take quite 
some time til popular distros carry 4.16 for regular users. And I have 
no idea how frequent the reported issues are, i.e. how many users would 
be affected.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-11 17:07                               ` tj
@ 2018-04-11 21:31                                 ` tj
  2018-04-12  8:59                                   ` Israel Rukshin
  0 siblings, 1 reply; 47+ messages in thread
From: tj @ 2018-04-11 21:31 UTC (permalink / raw)
  To: Israel Rukshin
  Cc: Sagi Grimberg, Bart Van Assche, ming.lei, hch, maxg, linux-block,
	stable, axboe

Hey, again.

On Wed, Apr 11, 2018 at 10:07:33AM -0700, tj@kernel.org wrote:
> Hello, Israel.
> 
> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> > >Just noticed this one, this looks interesting to me as well. Israel,
> > >can you run your test with this patch?
> > 
> > Yes, I just did and it looks good.
> 
> Awesome.

Just to be sure, you tested the combined patch and saw the XXX debug
messages, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-11 21:31                                 ` tj
@ 2018-04-12  8:59                                   ` Israel Rukshin
  2018-04-12 13:35                                     ` tj
  0 siblings, 1 reply; 47+ messages in thread
From: Israel Rukshin @ 2018-04-12  8:59 UTC (permalink / raw)
  To: tj
  Cc: Sagi Grimberg, Bart Van Assche, ming.lei, hch, maxg, linux-block,
	stable, axboe

On 4/12/2018 12:31 AM, tj@kernel.org wrote:
> Hey, again.
>
> On Wed, Apr 11, 2018 at 10:07:33AM -0700, tj@kernel.org wrote:
>> Hello, Israel.
>>
>> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
>>>> Just noticed this one, this looks interesting to me as well. Israel,
>>>> can you run your test with this patch?
>>> Yes, I just did and it looks good.
>> Awesome.
> Just to be sure, you tested the combined patch and saw the XXX debug
> messages, right?

I am not sure I understand.
What is the combined patch?
What are the debug messages that I need to see?

> Thanks.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-12  8:59                                   ` Israel Rukshin
@ 2018-04-12 13:35                                     ` tj
  2018-04-15 12:28                                       ` Israel Rukshin
  0 siblings, 1 reply; 47+ messages in thread
From: tj @ 2018-04-12 13:35 UTC (permalink / raw)
  To: Israel Rukshin
  Cc: Sagi Grimberg, Bart Van Assche, ming.lei, hch, maxg, linux-block,
	stable, axboe

Hello, Israel.

On Thu, Apr 12, 2018 at 11:59:11AM +0300, Israel Rukshin wrote:
> On 4/12/2018 12:31 AM, tj@kernel.org wrote:
> >Hey, again.
> >
> >On Wed, Apr 11, 2018 at 10:07:33AM -0700, tj@kernel.org wrote:
> >>Hello, Israel.
> >>
> >>On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> >>>>Just noticed this one, this looks interesting to me as well. Israel,
> >>>>can you run your test with this patch?
> >>>Yes, I just did and it looks good.
> >>Awesome.
> >Just to be sure, you tested the combined patch and saw the XXX debug
> >messages, right?
> 
> I am not sure I understand.
> What is the combined patch?
> What are the debug messages that I need to see?

There are total of three patches and I posted the combined patch + a
debug message in another post.  Can you please test the following
patch?  It'll print out something like the following (possibly many of
them).

 XXX blk_mq_timeout_reset_cleanup(): executed %d missed completions

Thanks.

Index: work/block/blk-mq.c
===================================================================
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else
+		rq->missed_completion = true;
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
 	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 	blk_add_timer(rq);
+	rq->missed_completion = false;
 
 	write_seqcount_end(&rq->gstate_seq);
 	preempt_enable();
@@ -818,7 +821,8 @@ struct blk_mq_timeout_data {
 	unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request *req,
+				int *nr_resets, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
 		blk_add_timer(req);
+		req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+		(*nr_resets)++;
+		hctx->need_sync_rcu = true;
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct
 	    time_after_eq(jiffies, deadline)) {
 		blk_mq_rq_update_aborted_gstate(rq, gstate);
 		data->nr_expired++;
-		hctx->nr_expired++;
+		hctx->need_sync_rcu = true;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
+{
+	struct blk_mq_hw_ctx *hctx;
+	bool has_rcu = false;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!hctx->need_sync_rcu)
+			continue;
+
+		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+			has_rcu = true;
+		else
+			synchronize_srcu(hctx->srcu);
+
+		if (clear)
+			hctx->need_sync_rcu = false;
+	}
+	if (has_rcu)
+		synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
@@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str
 	 */
 	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
 	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
-		blk_mq_rq_timed_out(rq, reserved);
+		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * @rq's timer reset has gone through rcu synchronization and is
+	 * visible now.  Allow normal completions again by resetting
+	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
+	 * memory ordering around ->aborted_gstate making it the only field
+	 * safe to update.  Let blk_add_timer() clear it later when the
+	 * request is recycled or times out again.
+	 */
+	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+		blk_mq_rq_update_aborted_gstate(rq, 0);
+}
+
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	int cnt = 0;
+
+	/*
+	 * @rq is now fully returned to the normal path.  If normal
+	 * completion raced timeout handling, execute the missed completion
+	 * here.  This is safe because 1. ->missed_completion can no longer
+	 * be asserted because nothing is timing out right now and 2. if
+	 * ->missed_completion is set, @rq is safe from recycling because
+	 * nobody could have completed it.
+	 */
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion) {
+		blk_mq_complete_request(rq);
+		cnt++;
+	}
+
+	printk("XXX blk_mq_timeout_reset_cleanup(): executed %d missed completions\n",
+	       cnt);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
@@ -930,7 +991,7 @@ static void blk_mq_timeout_work(struct w
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
 	if (data.nr_expired) {
-		bool has_rcu = false;
+		int nr_resets = 0;
 
 		/*
 		 * Wait till everyone sees ->aborted_gstate.  The
@@ -938,22 +999,25 @@ static void blk_mq_timeout_work(struct w
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
+		blk_mq_timeout_sync_rcu(q, true);
 
 		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
+					   &nr_resets);
+
+		/*
+		 * For BLK_EH_RESET_TIMER, release the requests after
+		 * blk_add_timer() from above is visible to avoid timer
+		 * reset racing against recycling.
+		 */
+		if (nr_resets) {
+			blk_mq_timeout_sync_rcu(q, false);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_return, NULL);
+			blk_mq_timeout_sync_rcu(q, true);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_cleanup, NULL);
+		}
 	}
 
 	if (data.next_set) {
Index: work/include/linux/blk-mq.h
===================================================================
--- work.orig/include/linux/blk-mq.h
+++ work/include/linux/blk-mq.h
@@ -51,7 +51,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
-	unsigned int		nr_expired;
+	bool			need_sync_rcu;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
Index: work/block/blk-timeout.c
===================================================================
--- work.orig/block/blk-timeout.c
+++ work/block/blk-timeout.c
@@ -216,7 +216,7 @@ void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
+	req->rq_flags &= ~(RQF_MQ_TIMEOUT_EXPIRED | RQF_MQ_TIMEOUT_RESET);
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -127,8 +127,10 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
 /* timeout is expired */
 #define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
+/* timeout is expired */
+#define RQF_MQ_TIMEOUT_RESET	((__force req_flags_t)(1 << 21))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 22))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
@@ -225,6 +227,8 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
+	bool missed_completion;
+
 	/*
 	 * On blk-mq, the lower bits of ->gstate (generation number and
 	 * state) carry the MQ_RQ_* state value and the upper bits the

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-12 13:35                                     ` tj
@ 2018-04-15 12:28                                       ` Israel Rukshin
  0 siblings, 0 replies; 47+ messages in thread
From: Israel Rukshin @ 2018-04-15 12:28 UTC (permalink / raw)
  To: tj
  Cc: Sagi Grimberg, Bart Van Assche, ming.lei, hch, maxg, linux-block,
	stable, axboe

Hi,

On 4/12/2018 4:35 PM, tj@kernel.org wrote:
> Hello, Israel.
>
> On Thu, Apr 12, 2018 at 11:59:11AM +0300, Israel Rukshin wrote:
>> On 4/12/2018 12:31 AM, tj@kernel.org wrote:
>>> Hey, again.
>>>
>>> On Wed, Apr 11, 2018 at 10:07:33AM -0700, tj@kernel.org wrote:
>>>> Hello, Israel.
>>>>
>>>> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
>>>>>> Just noticed this one, this looks interesting to me as well. Israel,
>>>>>> can you run your test with this patch?
>>>>> Yes, I just did and it looks good.
>>>> Awesome.
>>> Just to be sure, you tested the combined patch and saw the XXX debug
>>> messages, right?
>> I am not sure I understand.
>> What is the combined patch?
>> What are the debug messages that I need to see?
> There are total of three patches and I posted the combined patch + a
> debug message in another post.  Can you please test the following
> patch?  It'll print out something like the following (possibly many of
> them).

I tested this combined patch with your debug print and I got the 
following call trace:

Apr 15 11:42:35 nvme nvme1: I/O 55 QID 10 timeout, reset controller
Apr 15 11:42:35 XXX blk_mq_timeout_reset_cleanup(): executed 1 missed 
completions
Apr 15 11:42:35 XXX blk_mq_timeout_reset_cleanup(): executed 1 missed 
completions
Apr 15 11:42:35 XXX blk_mq_timeout_reset_cleanup(): executed 1 missed 
completions
Apr 15 11:42:35 WARNING: CPU: 1 PID: 648 at block/blk-mq.c:534 
__blk_mq_complete_request+0x154/0x160
Apr 15 11:42:35 Modules linked in: nvme_rdma rdma_cm iw_cm ib_cm 
nvme_fabrics nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi netconsole mlx4_ib ib_core 
mlx4_en mlx4_core xfs libcrc32c dm_mirror dm_region_hash dm_log 
dm_multipath scsi_dh_rdac scsi_dh_emc dm_mod dax scsi_dh_alua 
x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass 
ghash_clmulni_intel pcbc iTCO_wdt aesni_intel nfsd iTCO_vendor_support 
aes_x86_64 ipmi_si crypto_simd lpc_ich cryptd ipmi_devintf shpchp wmi 
acpi_pad glue_helper ipmi_msghandler pcspkr sg i2c_i801 mfd_core ioatdma 
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 
sd_mod igb i2c_algo_bit ahci i2c_core xhci_pci crc32c_intel xhci_hcd 
libahci dca configfs nvme nvme_core ipv6 crc_ccitt autofs4 [last 
unloaded: nvme_fabrics]
Apr 15 11:42:35 CPU: 1 PID: 648 Comm: kworker/1:1H Not tainted 4.16.0+ #8
Apr 15 11:42:35 Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 
2.0 12/17/2015
Apr 15 11:42:35 Workqueue: kblockd blk_mq_timeout_work
Apr 15 11:42:35 RIP: 0010:__blk_mq_complete_request+0x154/0x160
Apr 15 11:42:35 RSP: 0018:ffffc9000407bd68 EFLAGS: 00010297
Apr 15 11:42:35 RAX: 0000000000000000 RBX: ffff8808310236c0 RCX: 
0000000000000009
Apr 15 11:42:35 RDX: 0000000000000009 RSI: 0000000000000000 RDI: 
ffff8808310236c0
Apr 15 11:42:35 RBP: ffffe8ffffae12c0 R08: 0000000000000000 R09: 
00000000000004ea
Apr 15 11:42:35 R10: 000000000000006c R11: ffffc90003dfda80 R12: 
0000000000000000
Apr 15 11:42:35 R13: 0000000000000010 R14: 0000000000000010 R15: 
0000000000000000
Apr 15 11:42:35 FS:  0000000000000000(0000) GS:ffff88047fa40000(0000) 
knlGS:0000000000000000
Apr 15 11:42:35 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 15 11:42:35 CR2: 00007f6ab2db3140 CR3: 0000000001e0a006 CR4: 
00000000001606e0
Apr 15 11:42:35 Call Trace:
Apr 15 11:42:35  blk_mq_complete_request+0x4b/0x90
Apr 15 11:42:35  blk_mq_timeout_reset_cleanup+0x27/0x40
Apr 15 11:42:35  bt_iter+0x43/0x50
Apr 15 11:42:35  blk_mq_queue_tag_busy_iter+0xfb/0x230
Apr 15 11:42:35  ? blk_mq_complete_request+0x90/0x90
Apr 15 11:42:35  ? blk_mq_complete_request+0x90/0x90
Apr 15 11:42:35  ? __call_rcu.constprop.72+0x170/0x1c0
Apr 15 11:42:35  blk_mq_timeout_work+0x191/0x1f0
Apr 15 11:42:35  process_one_work+0x140/0x2a0
Apr 15 11:42:35  worker_thread+0x3f/0x3c0
Apr 15 11:42:35  kthread+0xeb/0x120
Apr 15 11:42:35  ? process_one_work+0x2a0/0x2a0
Apr 15 11:42:35  ? kthread_bind+0x10/0x10
Apr 15 11:42:35  ? SyS_exit_group+0xb/0x10
Apr 15 11:42:35  ret_from_fork+0x35/0x40
Apr 15 11:42:35 Code: 00 00 00 00 00 8b 7d 40 e8 0a 20 e7 ff e9 6e ff ff 
ff 48 8b 35 1e ca ba 00 48 83 c7 10 48 83 c6 64 e8 71 ee e5 ff e9 77 ff 
ff ff <0f> 0b e9 c3 fe ff ff 0f 1f 44 00 00 41 54 45 31 e4 55 53 48 8b
Apr 15 11:42:35 ---[ end trace fbf397c4b27ea0b8 ]---
Apr 15 11:42:35 XXX blk_mq_timeout_reset_cleanup(): executed 1 missed 
completions
Apr 15 11:42:35 BUG: unable to handle kernel NULL pointer dereference at 
0000000000000018

Regards,
Israel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
  2018-04-10 21:54                         ` tj
  2018-04-11 12:50                             ` Bart Van Assche
  2018-04-11 14:24                           ` Sagi Grimberg
@ 2018-04-18 16:34                           ` Bart Van Assche
  2 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2018-04-18 16:34 UTC (permalink / raw)
  To: tj; +Cc: ming.lei, hch, maxg, israelr, linux-block, stable, sagi, axboe

On 04/10/18 14:54, tj@kernel.org wrote:
> On Tue, Apr 10, 2018 at 09:46:23PM +0000, Bart Van Assche wrote:
>> On Tue, 2018-04-10 at 14:33 -0700, tj@kernel.org wrote:
>>> +       else
>>> +               rq->missed_completion = true;
>>
>> In this patch I found code that sets rq->missed_completion but no code that
>> clears it. Did I perhaps miss something?
> 
> Ah, yeah, I was moving it out of add_timer but forgot to actully add
> it to the issue path.  Fixed patch below.
> 
> BTW, no matter what we do w/ the request handover between normal and
> timeout paths, we'd need something similar.  Otherwise, while we can
> reduce the window, we can't get rid of it.
> 
> [ ... ]

Hello Tejun,

This patch no longer applies cleanly on Jens' tree. Can you rebase and 
repost it?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2018-04-18 16:34 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-10  1:34 [PATCH v4] blk-mq: Fix race conditions in request timeout handling Bart Van Assche
2018-04-10  7:59 ` jianchao.wang
2018-04-10 10:04   ` Ming Lei
2018-04-10 12:04     ` Shan Hai
2018-04-10 13:01   ` Bart Van Assche
2018-04-10 13:01     ` Bart Van Assche
2018-04-10 14:32     ` jianchao.wang
2018-04-10  8:41 ` Ming Lei
2018-04-10 12:58   ` Bart Van Assche
2018-04-10 12:58     ` Bart Van Assche
2018-04-10 13:55     ` Ming Lei
2018-04-10 14:09       ` Bart Van Assche
2018-04-10 14:09         ` Bart Van Assche
2018-04-10 14:30         ` Ming Lei
2018-04-10 15:02           ` Bart Van Assche
2018-04-10 15:02             ` Bart Van Assche
2018-04-10 15:25             ` Ming Lei
2018-04-10 15:30               ` tj
2018-04-10 15:38                 ` Ming Lei
2018-04-10 15:40                   ` tj
2018-04-10 21:33                     ` tj
2018-04-10 21:46                       ` Bart Van Assche
2018-04-10 21:46                         ` Bart Van Assche
2018-04-10 21:54                         ` tj
2018-04-11 12:50                           ` Bart Van Assche
2018-04-11 12:50                             ` Bart Van Assche
2018-04-11 14:16                             ` tj
2018-04-11 18:38                             ` Martin Steigerwald
2018-04-11 18:38                               ` Martin Steigerwald
2018-04-11 14:24                           ` Sagi Grimberg
2018-04-11 14:43                             ` tj
2018-04-11 16:16                             ` Israel Rukshin
2018-04-11 17:07                               ` tj
2018-04-11 21:31                                 ` tj
2018-04-12  8:59                                   ` Israel Rukshin
2018-04-12 13:35                                     ` tj
2018-04-15 12:28                                       ` Israel Rukshin
2018-04-18 16:34                           ` Bart Van Assche
2018-04-10  9:55 ` Christoph Hellwig
2018-04-10 13:26   ` Bart Van Assche
2018-04-10 13:26     ` Bart Van Assche
2018-04-10 14:50     ` hch
2018-04-10 14:41   ` Jens Axboe
2018-04-10 14:20 ` Tejun Heo
2018-04-10 14:30   ` Bart Van Assche
2018-04-10 14:30     ` Bart Van Assche
2018-04-10 14:33     ` tj

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.