All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
@ 2018-04-09  5:20 Bart Van Assche
  2018-04-09  8:37 ` Sagi Grimberg
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09  5:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Tejun Heo,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

If a completion occurs after blk_mq_rq_timed_out() has reset
rq->aborted_gstate and the request is again in flight when the timeout
expires then a request will be completed twice: a first time by the
timeout handler and a second time when the regular completion occurs.

Additionally, the blk-mq timeout handling code ignores completions that
occur after blk_mq_check_expired() has been called and before
blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
timeout handler always returns BLK_EH_RESET_TIMER then the result will
be that the request never terminates.

Since the request state can be updated from two different contexts,
namely regular completion and request timeout, this race cannot be
fixed with RCU synchronization only. Fix this race as follows:
- Introduce a spinlock to protect the request state and deadline changes.
- Use the deadline instead of the request generation to detect whether
  or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
  of the lowest two bits of 'gstate'.
- Remove all request member variables that became superfluous due to
  this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
  patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

This patch fixes the following kernel crash:

BUG: unable to handle kernel NULL pointer dereference at           (null)
Oops: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 151 Comm: kworker/2:1H Tainted: G        W        4.15.0-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]
Call Trace:
blk_mq_terminate_expired+0x42/0x80
bt_iter+0x3d/0x50
blk_mq_queue_tag_busy_iter+0xe9/0x200
blk_mq_timeout_work+0x181/0x2e0
process_one_work+0x21c/0x6d0
worker_thread+0x35/0x380
kthread+0x117/0x130
ret_from_fork+0x24/0x30

Fixes: 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and generation based scheme")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: <stable@vger.kernel.org> # v4.16
---
 block/blk-core.c       |   3 +-
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 178 +++++++++++--------------------------------------
 block/blk-mq.h         |  25 ++-----
 block/blk-timeout.c    |   1 -
 block/blk.h            |   4 +-
 include/linux/blkdev.h |  28 ++------
 7 files changed, 53 insertions(+), 187 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2623e609db4a..83c7a58e4fb3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -200,8 +200,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
+	spin_lock_init(&rq->state_lock);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 6f72413b6cab..80c7c585769f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -345,7 +345,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7816d28b7219..1da16d5e5cf1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -305,7 +305,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->special = NULL;
 	/* tag was already set */
 	rq->extra_len = 0;
-	rq->__deadline = 0;
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -527,8 +526,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -577,34 +575,26 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old: Old request state.
+ * @new: New request state.
+ */
+static bool blk_mq_change_rq_state(struct request *rq, enum mq_rq_state old,
+				   enum mq_rq_state new)
 {
 	unsigned long flags;
+	bool changed_state = false;
 
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
+	spin_lock_irqsave(&rq->state_lock, flags);
+	if (blk_mq_rq_state(rq) == old) {
+		blk_mq_rq_update_state(rq, new);
+		changed_state = true;
+	}
+	spin_unlock_irqrestore(&rq->state_lock, flags);
 
-	return aborted_gstate;
+	return changed_state;
 }
 
 /**
@@ -618,27 +608,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE))
 		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -665,24 +640,14 @@ void blk_mq_start_request(struct request *rq)
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
+	 * Mark @rq in-flight and register for timeout. Because blk_add_timer()
+	 * updates the deadline, if a timer set by a previous incarnation of
+	 * this request fires this request will be skipped by the timeout code.
 	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
+	spin_lock_irq(&rq->state_lock);
 	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	spin_unlock_irq(&rq->state_lock);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -695,11 +660,6 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -811,15 +771,13 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
 	unsigned long next;
 	unsigned int next_set;
-	unsigned int nr_expired;
 };
 
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
-
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
+	unsigned long flags;
 
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
@@ -829,13 +787,10 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
+		spin_lock_irqsave(&req->state_lock, flags);
 		blk_add_timer(req);
+		blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
+		spin_unlock_irqrestore(&req->state_lock, flags);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -849,48 +804,23 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
-
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
-
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
+	unsigned long deadline;
+	bool timed_out = false;
 
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
+	spin_lock_irq(&rq->state_lock);
+	deadline = blk_rq_deadline(rq);
+	if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT &&
 	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
-		data->nr_expired++;
+		blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+		timed_out = true;
 		hctx->nr_expired++;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
-}
+	spin_unlock_irq(&rq->state_lock);
 
-static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, void *priv, bool reserved)
-{
-	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
-	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+	if (timed_out)
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
@@ -898,11 +828,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	struct blk_mq_timeout_data data = {
-		.next		= 0,
-		.next_set	= 0,
-		.nr_expired	= 0,
-	};
+	struct blk_mq_timeout_data data = { };
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -925,33 +851,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	/* scan for the expired ones and set their ->aborted_gstate */
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
-	if (data.nr_expired) {
-		bool has_rcu = false;
-
-		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
-		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
-
-		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
-	}
-
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
@@ -2087,8 +1986,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
+	spin_lock_init(&rq->state_lock);
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..d4d72f95d5a9 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,10 +27,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
- */
+/* Lowest two bits of request->__deadline. */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
@@ -38,7 +35,6 @@ enum mq_rq_state {
 
 	MQ_RQ_STATE_BITS	= 2,
 	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -104,9 +100,9 @@ void blk_mq_release(struct request_queue *q);
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return rq->__deadline & MQ_RQ_STATE_MASK;
 }
 
 /**
@@ -115,22 +111,15 @@ static inline int blk_mq_rq_state(struct request *rq)
  * @state: new state to set.
  *
  * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
+ * there are no other updaters.
  */
 static inline void blk_mq_rq_update_state(struct request *rq,
 					  enum mq_rq_state state)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
-	}
+	unsigned long d = rq->__deadline;
 
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	d &= ~(unsigned long)MQ_RQ_STATE_MASK;
+	rq->__deadline = d | state;
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 50a191720055..844a98edcf3f 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -217,7 +217,6 @@ void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/block/blk.h b/block/blk.h
index b034fd2460c4..07275598d262 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -314,12 +314,12 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
  */
 static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
 {
-	rq->__deadline = time & ~0x1UL;
+	rq->__deadline = (time & ~0x3UL) | (rq->__deadline & 3UL);
 }
 
 static inline unsigned long blk_rq_deadline(struct request *rq)
 {
-	return rq->__deadline & ~0x1UL;
+	return rq->__deadline & ~0x3UL;
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6075d1a6760c..e0a6a741afd0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,7 +27,6 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
-#include <linux/seqlock.h>
 #include <linux/u64_stats_sync.h>
 
 struct module;
@@ -125,8 +124,6 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
 #define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
 
@@ -141,6 +138,7 @@ typedef __u32 __bitwise req_flags_t;
  * especially blk_mq_rq_ctx_init() to take care of the added fields.
  */
 struct request {
+	spinlock_t state_lock;		/* protects __deadline for blk-mq */
 	struct request_queue *q;
 	struct blk_mq_ctx *mq_ctx;
 
@@ -226,27 +224,11 @@ struct request {
 	unsigned int extra_len;	/* length of alignment and padding */
 
 	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
+	 * access through blk_rq_set_deadline(), blk_rq_deadline() and
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state() for
+	 * blk-mq queues.
 	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
-	/*
-	 * ->aborted_gstate is used by the timeout to claim a specific
-	 * recycle instance of this request.  See blk_mq_timeout_work().
-	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
-
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
 	unsigned long __deadline;
 
 	struct list_head timeout_list;
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09  5:20 [PATCH] blk-mq: Fix recently introduced races in the timeout handling code Bart Van Assche
@ 2018-04-09  8:37 ` Sagi Grimberg
  2018-04-09 14:37     ` Bart Van Assche
                     ` (2 more replies)
  2018-04-09  9:37 ` Christoph Hellwig
  2018-04-09 16:47 ` Tejun Heo
  2 siblings, 3 replies; 19+ messages in thread
From: Sagi Grimberg @ 2018-04-09  8:37 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Israel Rukshin,
	Max Gurtovoy, stable


> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.
> 
> Additionally, the blk-mq timeout handling code ignores completions that
> occur after blk_mq_check_expired() has been called and before
> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> timeout handler always returns BLK_EH_RESET_TIMER then the result will
> be that the request never terminates.

OK, now I understand how we can complete twice. Israel, can you verify
this patch solves your double completion problem?

Given that it is, the change log of your patches should be modified to
the original bug report it solves.

Thread starts here:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09  5:20 [PATCH] blk-mq: Fix recently introduced races in the timeout handling code Bart Van Assche
  2018-04-09  8:37 ` Sagi Grimberg
@ 2018-04-09  9:37 ` Christoph Hellwig
  2018-04-09 14:58     ` Bart Van Assche
  2018-04-09 16:47 ` Tejun Heo
  2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2018-04-09  9:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Tejun Heo,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

This looks sensible, but I'm worried about taking a whole spinlock
for every request completion, including irq disabling.  However it seems
like your new updated pattern would fit use of cmpxchg() very nicely.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09  8:37 ` Sagi Grimberg
@ 2018-04-09 14:37     ` Bart Van Assche
  2018-04-09 15:42   ` Israel Rukshin
  2018-04-09 16:49   ` Tejun Heo
  2 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09 14:37 UTC (permalink / raw)
  To: sagi, axboe; +Cc: hch, tj, israelr, linux-block, maxg, stable

T24gTW9uLCAyMDE4LTA0LTA5IGF0IDExOjM3ICswMzAwLCBTYWdpIEdyaW1iZXJnIHdyb3RlOg0K
PiA+IElmIGEgY29tcGxldGlvbiBvY2N1cnMgYWZ0ZXIgYmxrX21xX3JxX3RpbWVkX291dCgpIGhh
cyByZXNldA0KPiA+IHJxLT5hYm9ydGVkX2dzdGF0ZSBhbmQgdGhlIHJlcXVlc3QgaXMgYWdhaW4g
aW4gZmxpZ2h0IHdoZW4gdGhlIHRpbWVvdXQNCj4gPiBleHBpcmVzIHRoZW4gYSByZXF1ZXN0IHdp
bGwgYmUgY29tcGxldGVkIHR3aWNlOiBhIGZpcnN0IHRpbWUgYnkgdGhlDQo+ID4gdGltZW91dCBo
YW5kbGVyIGFuZCBhIHNlY29uZCB0aW1lIHdoZW4gdGhlIHJlZ3VsYXIgY29tcGxldGlvbiBvY2N1
cnMuDQo+ID4gDQo+ID4gQWRkaXRpb25hbGx5LCB0aGUgYmxrLW1xIHRpbWVvdXQgaGFuZGxpbmcg
Y29kZSBpZ25vcmVzIGNvbXBsZXRpb25zIHRoYXQNCj4gPiBvY2N1ciBhZnRlciBibGtfbXFfY2hl
Y2tfZXhwaXJlZCgpIGhhcyBiZWVuIGNhbGxlZCBhbmQgYmVmb3JlDQo+ID4gYmxrX21xX3JxX3Rp
bWVkX291dCgpIGhhcyByZXNldCBycS0+YWJvcnRlZF9nc3RhdGUuIElmIGEgYmxvY2sgZHJpdmVy
DQo+ID4gdGltZW91dCBoYW5kbGVyIGFsd2F5cyByZXR1cm5zIEJMS19FSF9SRVNFVF9USU1FUiB0
aGVuIHRoZSByZXN1bHQgd2lsbA0KPiA+IGJlIHRoYXQgdGhlIHJlcXVlc3QgbmV2ZXIgdGVybWlu
YXRlcy4NCj4gDQo+IE9LLCBub3cgSSB1bmRlcnN0YW5kIGhvdyB3ZSBjYW4gY29tcGxldGUgdHdp
Y2UuIElzcmFlbCwgY2FuIHlvdSB2ZXJpZnkNCj4gdGhpcyBwYXRjaCBzb2x2ZXMgeW91ciBkb3Vi
bGUgY29tcGxldGlvbiBwcm9ibGVtPw0KPiANCj4gR2l2ZW4gdGhhdCBpdCBpcywgdGhlIGNoYW5n
ZSBsb2cgb2YgeW91ciBwYXRjaGVzIHNob3VsZCBiZSBtb2RpZmllZCB0bw0KPiB0aGUgb3JpZ2lu
YWwgYnVnIHJlcG9ydCBpdCBzb2x2ZXMuDQo+IA0KPiBUaHJlYWQgc3RhcnRzIGhlcmU6DQo+IGh0
dHA6Ly9saXN0cy5pbmZyYWRlYWQub3JnL3BpcGVybWFpbC9saW51eC1udm1lLzIwMTgtRmVicnVh
cnkvMDE1ODQ4Lmh0bWwNCg0KSGVsbG8gU2FnaSwNCg0KSSB3aWxsIHdhaXQgdW50aWwgaXQgaGFz
IGJlZW4gdmVyaWZpZWQgd2hldGhlciBvciBub3QgdGhpcyBwYXRjaCBmaXhlcyB0aGUNCiJudm1l
LXJkbWEgY29ycnVwdHMgbWVtb3J5IHVwb24gdGltZW91dCIgaXNzdWUgYmVmb3JlIGFkZGluZyBh
IHJlZmVyZW5jZSB0bw0KdGhhdCBpc3N1ZSBpbiB0aGUgZGVzY3JpcHRpb24gb2YgdGhpcyBwYXRj
aC4NCg0KVGhhbmtzLA0KDQpCYXJ0Lg0KDQoNCg==

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
@ 2018-04-09 14:37     ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09 14:37 UTC (permalink / raw)
  To: sagi, axboe; +Cc: hch, tj, israelr, linux-block, maxg, stable

On Mon, 2018-04-09 at 11:37 +0300, Sagi Grimberg wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs.
> > 
> > Additionally, the blk-mq timeout handling code ignores completions that
> > occur after blk_mq_check_expired() has been called and before
> > blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> > timeout handler always returns BLK_EH_RESET_TIMER then the result will
> > be that the request never terminates.
> 
> OK, now I understand how we can complete twice. Israel, can you verify
> this patch solves your double completion problem?
> 
> Given that it is, the change log of your patches should be modified to
> the original bug report it solves.
> 
> Thread starts here:
> http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html

Hello Sagi,

I will wait until it has been verified whether or not this patch fixes the
"nvme-rdma corrupts memory upon timeout" issue before adding a reference to
that issue in the description of this patch.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09  9:37 ` Christoph Hellwig
@ 2018-04-09 14:58     ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09 14:58 UTC (permalink / raw)
  To: hch; +Cc: tj, israelr, linux-block, maxg, stable, axboe, sagi

T24gTW9uLCAyMDE4LTA0LTA5IGF0IDExOjM3ICswMjAwLCBDaHJpc3RvcGggSGVsbHdpZyB3cm90
ZToNCj4gVGhpcyBsb29rcyBzZW5zaWJsZSwgYnV0IEknbSB3b3JyaWVkIGFib3V0IHRha2luZyBh
IHdob2xlIHNwaW5sb2NrDQo+IGZvciBldmVyeSByZXF1ZXN0IGNvbXBsZXRpb24sIGluY2x1ZGlu
ZyBpcnEgZGlzYWJsaW5nLiAgSG93ZXZlciBpdCBzZWVtcw0KPiBsaWtlIHlvdXIgbmV3IHVwZGF0
ZWQgcGF0dGVybiB3b3VsZCBmaXQgdXNlIG9mIGNtcHhjaGcoKSB2ZXJ5IG5pY2VseS4NCg0KSGVs
bG8gQ2hyaXN0b3BoLA0KDQpUaGFua3MgZm9yIHRoZSByZXZpZXcuIEkgaGFkIGEgbG9vayBhdCB0
aGUgc3BpbiBsb2NrIGltcGxlbWVudGF0aW9uIG9uDQp4ODYgYW5kIGFwcGFyZW50bHkgb24geDg2
IHNwaW4gbG9ja3MgYXJlIGltcGxlbWVudGVkIGFzIHFzcGlubG9ja3MNCihpbmNsdWRlL2FzbS1n
ZW5lcmljL3FzcGlubG9jay5oKS4gcXVldWVkX3NwaW5fbG9jaygpIGFscmVhZHkgdXNlcw0KYXRv
bWljX2NtcHhjaGdfYWNxdWlyZSgpLiBBcmUgeW91IHN1cmUgdGhhdCByZXBsYWNpbmcgdGhlIHNw
aW4gbG9jaw0KYnkgY21weGNoZygpIHdpbGwgeWllbGQgYSBwZXJmb3JtYW5jZSBpbXByb3ZlbWVu
dD8NCg0KVGhhbmtzLA0KDQpCYXJ0Lg0K

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
@ 2018-04-09 14:58     ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09 14:58 UTC (permalink / raw)
  To: hch; +Cc: tj, israelr, linux-block, maxg, stable, axboe, sagi

On Mon, 2018-04-09 at 11:37 +0200, Christoph Hellwig wrote:
> This looks sensible, but I'm worried about taking a whole spinlock
> for every request completion, including irq disabling.  However it seems
> like your new updated pattern would fit use of cmpxchg() very nicely.

Hello Christoph,

Thanks for the review. I had a look at the spin lock implementation on
x86 and apparently on x86 spin locks are implemented as qspinlocks
(include/asm-generic/qspinlock.h). queued_spin_lock() already uses
atomic_cmpxchg_acquire(). Are you sure that replacing the spin lock
by cmpxchg() will yield a performance improvement?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09 14:58     ` Bart Van Assche
  (?)
@ 2018-04-09 15:03     ` Jens Axboe
  -1 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2018-04-09 15:03 UTC (permalink / raw)
  To: Bart Van Assche, hch; +Cc: tj, israelr, linux-block, maxg, stable, sagi

On 4/9/18 8:58 AM, Bart Van Assche wrote:
> On Mon, 2018-04-09 at 11:37 +0200, Christoph Hellwig wrote:
>> This looks sensible, but I'm worried about taking a whole spinlock
>> for every request completion, including irq disabling.  However it seems
>> like your new updated pattern would fit use of cmpxchg() very nicely.
> 
> Hello Christoph,
> 
> Thanks for the review. I had a look at the spin lock implementation on
> x86 and apparently on x86 spin locks are implemented as qspinlocks
> (include/asm-generic/qspinlock.h). queued_spin_lock() already uses
> atomic_cmpxchg_acquire(). Are you sure that replacing the spin lock
> by cmpxchg() will yield a performance improvement?

It's definitely worth a shot, especially since this looks like a clear
cut case for cmpxchg(). Additionally, it also kills the local irq
disable.

Conversion should be trivial and we can do some perf testing around
that.

Neither solution really makes me happy though, the prospect of
either kind of synchronization cost isn't appealing at all. I'll
have to ponder this a lot more, but it would be ideal if we could
shift that cost to the extremely unlikely case of a timeout
triggering rather than add the cost upfront to EVERY request.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09  8:37 ` Sagi Grimberg
  2018-04-09 14:37     ` Bart Van Assche
@ 2018-04-09 15:42   ` Israel Rukshin
  2018-04-09 16:49   ` Tejun Heo
  2 siblings, 0 replies; 19+ messages in thread
From: Israel Rukshin @ 2018-04-09 15:42 UTC (permalink / raw)
  To: Sagi Grimberg, Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Max Gurtovoy, stable

On 4/9/2018 11:37 AM, Sagi Grimberg wrote:
>
>> If a completion occurs after blk_mq_rq_timed_out() has reset
>> rq->aborted_gstate and the request is again in flight when the timeout
>> expires then a request will be completed twice: a first time by the
>> timeout handler and a second time when the regular completion occurs.
>>
>> Additionally, the blk-mq timeout handling code ignores completions that
>> occur after blk_mq_check_expired() has been called and before
>> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
>> timeout handler always returns BLK_EH_RESET_TIMER then the result will
>> be that the request never terminates.
>
> OK, now I understand how we can complete twice. Israel, can you verify
> this patch solves your double completion problem?

I just verified that this commit fix the double completion problem.
We still need my patches to fix the original bug report.

>
> Given that it is, the change log of your patches should be modified to
> the original bug report it solves.
>
> Thread starts here:
> http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09  5:20 [PATCH] blk-mq: Fix recently introduced races in the timeout handling code Bart Van Assche
  2018-04-09  8:37 ` Sagi Grimberg
  2018-04-09  9:37 ` Christoph Hellwig
@ 2018-04-09 16:47 ` Tejun Heo
  2018-04-09 17:03     ` Bart Van Assche
  2 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2018-04-09 16:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy, stable

Hey, Bart.

On Sun, Apr 08, 2018 at 10:20:38PM -0700, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.

Are we still talking about the same BLK_EH_RESET_TIMER case?  This can
be solved by the two patches which rcu-synchronizes the hand-over to
normal completion path, right?

> Additionally, the blk-mq timeout handling code ignores completions that
> occur after blk_mq_check_expired() has been called and before
> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> timeout handler always returns BLK_EH_RESET_TIMER then the result will
> be that the request never terminates.

And this is the same race window which was always there, right?  I
really don't think reducing or closing this window requires full
synchronization.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09  8:37 ` Sagi Grimberg
  2018-04-09 14:37     ` Bart Van Assche
  2018-04-09 15:42   ` Israel Rukshin
@ 2018-04-09 16:49   ` Tejun Heo
  2 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2018-04-09 16:49 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig,
	Israel Rukshin, Max Gurtovoy, stable

Hello, Sagi.

On Mon, Apr 09, 2018 at 11:37:15AM +0300, Sagi Grimberg wrote:
> 
> >If a completion occurs after blk_mq_rq_timed_out() has reset
> >rq->aborted_gstate and the request is again in flight when the timeout
> >expires then a request will be completed twice: a first time by the
> >timeout handler and a second time when the regular completion occurs.
> >
> >Additionally, the blk-mq timeout handling code ignores completions that
> >occur after blk_mq_check_expired() has been called and before
> >blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> >timeout handler always returns BLK_EH_RESET_TIMER then the result will
> >be that the request never terminates.
> 
> OK, now I understand how we can complete twice. Israel, can you verify
> this patch solves your double completion problem?
> 
> Given that it is, the change log of your patches should be modified to
> the original bug report it solves.
> 
> Thread starts here:
> http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html

Can you please see whether the following two patches fix the problem
you've been seeing?

 http://lkml.kernel.org/r/20180402190053.GC388343@devbig577.frc2.facebook.com
 http://lkml.kernel.org/r/20180402190120.GD388343@devbig577.frc2.facebook.com

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09 16:47 ` Tejun Heo
@ 2018-04-09 17:03     ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09 17:03 UTC (permalink / raw)
  To: tj; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

T24gTW9uLCAyMDE4LTA0LTA5IGF0IDA5OjQ3IC0wNzAwLCBUZWp1biBIZW8gd3JvdGU6DQo+IE9u
IFN1biwgQXByIDA4LCAyMDE4IGF0IDEwOjIwOjM4UE0gLTA3MDAsIEJhcnQgVmFuIEFzc2NoZSB3
cm90ZToNCj4gPiBJZiBhIGNvbXBsZXRpb24gb2NjdXJzIGFmdGVyIGJsa19tcV9ycV90aW1lZF9v
dXQoKSBoYXMgcmVzZXQNCj4gPiBycS0+YWJvcnRlZF9nc3RhdGUgYW5kIHRoZSByZXF1ZXN0IGlz
IGFnYWluIGluIGZsaWdodCB3aGVuIHRoZSB0aW1lb3V0DQo+ID4gZXhwaXJlcyB0aGVuIGEgcmVx
dWVzdCB3aWxsIGJlIGNvbXBsZXRlZCB0d2ljZTogYSBmaXJzdCB0aW1lIGJ5IHRoZQ0KPiA+IHRp
bWVvdXQgaGFuZGxlciBhbmQgYSBzZWNvbmQgdGltZSB3aGVuIHRoZSByZWd1bGFyIGNvbXBsZXRp
b24gb2NjdXJzLg0KPiANCj4gQXJlIHdlIHN0aWxsIHRhbGtpbmcgYWJvdXQgdGhlIHNhbWUgQkxL
X0VIX1JFU0VUX1RJTUVSIGNhc2U/ICBUaGlzIGNhbg0KPiBiZSBzb2x2ZWQgYnkgdGhlIHR3byBw
YXRjaGVzIHdoaWNoIHJjdS1zeW5jaHJvbml6ZXMgdGhlIGhhbmQtb3ZlciB0bw0KPiBub3JtYWwg
Y29tcGxldGlvbiBwYXRoLCByaWdodD8NCg0KSGVsbG8gVGVqdW4sDQoNCk15IG9waW5pb24gaXMg
bm90IG9ubHkgdGhhdCB0aGUgdHdvIHBhdGNoZXMgdGhhdCB5b3UgcG9zdGVkIHJlY2VudGx5IGRv
IG5vdA0KZml4IGFsbCB0aGUgcmFjZXMgdGhhdCBhcmUgZml4ZWQgYnkgdGhpcyBwYXRjaCBidXQg
YWxzbyB0aGF0IHRoZSByYWNlcyB0aGF0DQpleGlzdCB0b2RheSBpbiB0aGUgYmxrLW1xIHRpbWVv
dXQgaGFuZGxpbmcgY29kZSBjYW5ub3QgYmUgZml4ZWQgY29tcGxldGVseQ0KdXNpbmcgUkNVIG9u
bHkuDQoNCj4gPiBBZGRpdGlvbmFsbHksIHRoZSBibGstbXEgdGltZW91dCBoYW5kbGluZyBjb2Rl
IGlnbm9yZXMgY29tcGxldGlvbnMgdGhhdA0KPiA+IG9jY3VyIGFmdGVyIGJsa19tcV9jaGVja19l
eHBpcmVkKCkgaGFzIGJlZW4gY2FsbGVkIGFuZCBiZWZvcmUNCj4gPiBibGtfbXFfcnFfdGltZWRf
b3V0KCkgaGFzIHJlc2V0IHJxLT5hYm9ydGVkX2dzdGF0ZS4gSWYgYSBibG9jayBkcml2ZXINCj4g
PiB0aW1lb3V0IGhhbmRsZXIgYWx3YXlzIHJldHVybnMgQkxLX0VIX1JFU0VUX1RJTUVSIHRoZW4g
dGhlIHJlc3VsdCB3aWxsDQo+ID4gYmUgdGhhdCB0aGUgcmVxdWVzdCBuZXZlciB0ZXJtaW5hdGVz
Lg0KPiANCj4gQW5kIHRoaXMgaXMgdGhlIHNhbWUgcmFjZSB3aW5kb3cgd2hpY2ggd2FzIGFsd2F5
cyB0aGVyZSwgcmlnaHQ/ICBJDQo+IHJlYWxseSBkb24ndCB0aGluayByZWR1Y2luZyBvciBjbG9z
aW5nIHRoaXMgd2luZG93IHJlcXVpcmVzIGZ1bGwNCj4gc3luY2hyb25pemF0aW9uLg0KDQpUaGF0
IHJhY2Ugd2luZG93IGRpZCBub3QgZXhpc3QgaW4gdGhlIGxlZ2FjeSBibG9jayBsYXllci4gSSBk
b24ndCB0aGluaw0Kd2Ugc2hvdWxkIHRvbGVyYXRlIHN1Y2ggYSByYWNlIHdpbmRvdyBpbiB0aGUg
YmxrLW1xIGNvZGUgZWl0aGVyLiBJZiBhIHJlcXVlc3QNCmRvZXMgbm90IGdldCBjb21wbGV0ZWQg
dGhhdCBsZWFkcyB0byB1bmtpbGxhYmxlIHByb2Nlc3NlcyBvciBoYW5naW5nIGtlcm5lbA0KY29k
ZS4gU3VjaCBpc3N1ZXMgYXJlIGhhcmQgdG8gaWRlbnRpZnkgd2hlbiByZXBvcnRlZCBieSB1c2Vy
cy4gSSB0aGluayB3ZQ0Kc2hvdWxkIGZpeCB0aGVzZSByYWNlcyBiZWZvcmUgdGhlc2UgY2F1c2Ug
bW9yZSB0cm91YmxlIHRvIExpbnV4IHVzZXJzLg0KDQpUaGFua3MsDQoNCkJhcnQuDQoNCg0K

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
@ 2018-04-09 17:03     ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09 17:03 UTC (permalink / raw)
  To: tj; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

On Mon, 2018-04-09 at 09:47 -0700, Tejun Heo wrote:
> On Sun, Apr 08, 2018 at 10:20:38PM -0700, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs.
> 
> Are we still talking about the same BLK_EH_RESET_TIMER case?  This can
> be solved by the two patches which rcu-synchronizes the hand-over to
> normal completion path, right?

Hello Tejun,

My opinion is not only that the two patches that you posted recently do not
fix all the races that are fixed by this patch but also that the races that
exist today in the blk-mq timeout handling code cannot be fixed completely
using RCU only.

> > Additionally, the blk-mq timeout handling code ignores completions that
> > occur after blk_mq_check_expired() has been called and before
> > blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> > timeout handler always returns BLK_EH_RESET_TIMER then the result will
> > be that the request never terminates.
> 
> And this is the same race window which was always there, right?  I
> really don't think reducing or closing this window requires full
> synchronization.

That race window did not exist in the legacy block layer. I don't think
we should tolerate such a race window in the blk-mq code either. If a request
does not get completed that leads to unkillable processes or hanging kernel
code. Such issues are hard to identify when reported by users. I think we
should fix these races before these cause more trouble to Linux users.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09 17:03     ` Bart Van Assche
  (?)
@ 2018-04-09 18:56     ` tj
  2018-04-09 21:30         ` Bart Van Assche
  -1 siblings, 1 reply; 19+ messages in thread
From: tj @ 2018-04-09 18:56 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

Hello,

On Mon, Apr 09, 2018 at 05:03:05PM +0000, Bart Van Assche wrote:
> My opinion is not only that the two patches that you posted recently do not
> fix all the races that are fixed by this patch but also that the races that

The race was with the path where the ownership of a timed out request
is passed back to normal completion path and those two patches fix
that, right?

> exist today in the blk-mq timeout handling code cannot be fixed completely
> using RCU only.

I really don't think that is that complicated.  Let's first confirm
the race fix and get to narrowing / closing that window.

> That race window did not exist in the legacy block layer. I don't think

IIRC, it did.  It was there when I first started working on libata EH
years ago.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09 18:56     ` tj
@ 2018-04-09 21:30         ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09 21:30 UTC (permalink / raw)
  To: tj; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

T24gTW9uLCAyMDE4LTA0LTA5IGF0IDExOjU2IC0wNzAwLCB0akBrZXJuZWwub3JnIHdyb3RlOg0K
PiBPbiBNb24sIEFwciAwOSwgMjAxOCBhdCAwNTowMzowNVBNICswMDAwLCBCYXJ0IFZhbiBBc3Nj
aGUgd3JvdGU6DQo+ID4gZXhpc3QgdG9kYXkgaW4gdGhlIGJsay1tcSB0aW1lb3V0IGhhbmRsaW5n
IGNvZGUgY2Fubm90IGJlIGZpeGVkIGNvbXBsZXRlbHkNCj4gPiB1c2luZyBSQ1Ugb25seS4NCj4g
DQo+IEkgcmVhbGx5IGRvbid0IHRoaW5rIHRoYXQgaXMgdGhhdCBjb21wbGljYXRlZC4gIExldCdz
IGZpcnN0IGNvbmZpcm0NCj4gdGhlIHJhY2UgZml4IGFuZCBnZXQgdG8gbmFycm93aW5nIC8gY2xv
c2luZyB0aGF0IHdpbmRvdy4NCg0KSGVsbG8gVGVqdW4sDQoNClR3byBtb250aHMgYWdvIGl0IHdh
cyByZXBvcnRlZCBmb3IgdGhlIGZpcnN0IHRpbWUgdGhhdCBjb21taXQgMWQ5YmQ1MTYxYmEzDQoo
ImJsay1tcTogcmVwbGFjZSB0aW1lb3V0IHN5bmNocm9uaXphdGlvbiB3aXRoIGEgUkNVIGFuZCBn
ZW5lcmF0aW9uIGJhc2VkDQpzY2hlbWUiKSBpbnRyb2R1Y2VzIGEgcmVncmVzc2lvbi4gU2luY2Ug
dGhhdCByZXBvcnQgbm9ib2R5IGhhcyBwb3N0ZWQgYQ0KcGF0Y2ggdGhhdCBmaXhlcyBhbGwgcmFj
ZXMgcmVsYXRlZCB0byBibGstbXEgdGltZW91dCBoYW5kbGluZyBhbmQgdGhhdCBvbmx5DQp1c2Vz
IFJDVS4gSWYgeW91IHdhbnQgdG8gY29udGludWUgd29ya2luZyBvbiB0aGlzIHRoYXQncyBmaW5l
IHdpdGggbWUuIEJ1dA0Kc2luY2UgbXkgb3BpbmlvbiBpcyB0aGF0IGl0IGlzIGltcG9zc2libGUg
dG8gZml4IHRoZXNlIHJhY2VzIHVzaW5nIFJDVSBvbmx5DQpJIHdpbGwgY29udGludWUgd29ya2lu
ZyBvbiBhbiBhbHRlcm5hdGl2ZSBhcHByb2FjaC4gU2VlIGFsc28gIltQQVRDSF0NCmJsay1tcTog
Rml4IGEgcmFjZSBiZXR3ZWVuIHJlc2V0dGluZyB0aGUgdGltZXIgYW5kIGNvbXBsZXRpb24gaGFu
ZGxpbmciDQooaHR0cHM6Ly93d3cubWFpbC1hcmNoaXZlLmNvbS9saW51eC1ibG9ja0B2Z2VyLmtl
cm5lbC5vcmcvbXNnMTgwODkuaHRtbCkuDQoNClRoYW5rcywNCg0KQmFydC4NCg0KDQoNCg==

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
@ 2018-04-09 21:30         ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-04-09 21:30 UTC (permalink / raw)
  To: tj; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

On Mon, 2018-04-09 at 11:56 -0700, tj@kernel.org wrote:
> On Mon, Apr 09, 2018 at 05:03:05PM +0000, Bart Van Assche wrote:
> > exist today in the blk-mq timeout handling code cannot be fixed completely
> > using RCU only.
> 
> I really don't think that is that complicated.  Let's first confirm
> the race fix and get to narrowing / closing that window.

Hello Tejun,

Two months ago it was reported for the first time that commit 1d9bd5161ba3
("blk-mq: replace timeout synchronization with a RCU and generation based
scheme") introduces a regression. Since that report nobody has posted a
patch that fixes all races related to blk-mq timeout handling and that only
uses RCU. If you want to continue working on this that's fine with me. But
since my opinion is that it is impossible to fix these races using RCU only
I will continue working on an alternative approach. See also "[PATCH]
blk-mq: Fix a race between resetting the timer and completion handling"
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html).

Thanks,

Bart.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
  2018-04-09 21:30         ` Bart Van Assche
  (?)
@ 2018-04-09 21:40         ` tj
  -1 siblings, 0 replies; 19+ messages in thread
From: tj @ 2018-04-09 21:40 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, maxg, israelr, linux-block, stable, axboe, sagi

Hello, Bart.

On Mon, Apr 09, 2018 at 09:30:27PM +0000, Bart Van Assche wrote:
> On Mon, 2018-04-09 at 11:56 -0700, tj@kernel.org wrote:
> > On Mon, Apr 09, 2018 at 05:03:05PM +0000, Bart Van Assche wrote:
> > > exist today in the blk-mq timeout handling code cannot be fixed completely
> > > using RCU only.
> > 
> > I really don't think that is that complicated.  Let's first confirm
> > the race fix and get to narrowing / closing that window.
> 
> Two months ago it was reported for the first time that commit 1d9bd5161ba3
> ("blk-mq: replace timeout synchronization with a RCU and generation based
> scheme") introduces a regression. Since that report nobody has posted a
> patch that fixes all races related to blk-mq timeout handling and that only

The two patches using RCU were posted a long time ago.  It was just
that the repro that only you had at the time didn't work anymore so we
couldn't confirm the fix.  If we now have a different repro, awesome.
Let's see whether the fix works.

> uses RCU. If you want to continue working on this that's fine with me. But
> since my opinion is that it is impossible to fix these races using RCU only
> I will continue working on an alternative approach. See also "[PATCH]
> blk-mq: Fix a race between resetting the timer and completion handling"
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html).

ISTR discussing that patch earlier.  Didn't the RCU based fix get
posted after that discussion?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
@ 2018-04-10 15:37 ` Alex G.
  0 siblings, 0 replies; 19+ messages in thread
From: Alex G. @ 2018-04-10 15:37 UTC (permalink / raw)
  To: linux-block
  Cc: linux-nvme, austin_bolen, Alex_Gagniuc, shyam_iyer, Keith Busch,
	Scott Bauer

I've observed a very nasty NULL pointer dereference with NVME drives on
hot-unplug [1]. While I haven't gone in the details of this change, it
does fix the NULL dereference I've been seeing.

Alex

[1] http://lists.infradead.org/pipermail/linux-nvme/2018-April/016623.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
@ 2018-04-10 15:37 ` Alex G.
  0 siblings, 0 replies; 19+ messages in thread
From: Alex G. @ 2018-04-10 15:37 UTC (permalink / raw)


I've observed a very nasty NULL pointer dereference with NVME drives on
hot-unplug [1]. While I haven't gone in the details of this change, it
does fix the NULL dereference I've been seeing.

Alex

[1] http://lists.infradead.org/pipermail/linux-nvme/2018-April/016623.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-04-10 15:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-09  5:20 [PATCH] blk-mq: Fix recently introduced races in the timeout handling code Bart Van Assche
2018-04-09  8:37 ` Sagi Grimberg
2018-04-09 14:37   ` Bart Van Assche
2018-04-09 14:37     ` Bart Van Assche
2018-04-09 15:42   ` Israel Rukshin
2018-04-09 16:49   ` Tejun Heo
2018-04-09  9:37 ` Christoph Hellwig
2018-04-09 14:58   ` Bart Van Assche
2018-04-09 14:58     ` Bart Van Assche
2018-04-09 15:03     ` Jens Axboe
2018-04-09 16:47 ` Tejun Heo
2018-04-09 17:03   ` Bart Van Assche
2018-04-09 17:03     ` Bart Van Assche
2018-04-09 18:56     ` tj
2018-04-09 21:30       ` Bart Van Assche
2018-04-09 21:30         ` Bart Van Assche
2018-04-09 21:40         ` tj
2018-04-10 15:37 Alex G.
2018-04-10 15:37 ` Alex G.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.