linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] blk-mq: reimplement timeout handling
@ 2017-12-09 19:25 Tejun Heo
  2017-12-09 19:25 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-09 19:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunatley, it contains quite a few holes.

It's pretty easy to make blk_mq_check_expired() terminate a later
instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patchset replaces the broken synchronization mechanism with a RCU
and generation number based one.  Please read the patch description of
the second path for more details.

Oleg, Peter, I'd really appreciate if you guys can go over the
reported breakages and the new implementation.

This patchset contains the following six patches.

 0001-blk-mq-protect-completion-path-with-RCU.patch
 0002-blk-mq-replace-timeout-synchronization-with-a-RCU-an.patch
 0003-blk-mq-use-blk_mq_rq_state-instead-of-testing-REQ_AT.patch
 0004-blk-mq-make-blk_abort_request-trigger-timeout-path.patch
 0005-blk-mq-remove-REQ_ATOM_COMPLETE-usages-from-blk-mq.patch
 0006-blk-mq-remove-REQ_ATOM_STARTED.patch

and is available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-mq-timeout

diffstat follows.  Thanks.

 block/blk-core.c       |    2 
 block/blk-mq-debugfs.c |    4 
 block/blk-mq.c         |  246 +++++++++++++++++++++++++++----------------------
 block/blk-mq.h         |   48 +++++++++
 block/blk-timeout.c    |    9 -
 block/blk.h            |    7 -
 include/linux/blk-mq.h |    1 
 include/linux/blkdev.h |   23 ++++
 8 files changed, 218 insertions(+), 122 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
@ 2017-12-09 19:25 ` Tejun Heo
  2017-12-13  3:10   ` jianchao.wang
  2017-12-09 19:25 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-12-09 19:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov, Tejun Heo

Currently, blk-mq protects only the issue path with RCU.  This patch
puts the completion path under the same RCU protection.  This will be
used to synchronize issue/completion against timeout by later patches,
which will also add the comments.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1109747..acf4fbb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -568,11 +568,23 @@ static void __blk_mq_complete_request(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
+	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
-	if (!blk_mark_rq_complete(rq))
-		__blk_mq_complete_request(rq);
+
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		if (!blk_mark_rq_complete(rq))
+			__blk_mq_complete_request(rq);
+		rcu_read_unlock();
+	} else {
+		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+		if (!blk_mark_rq_complete(rq))
+			__blk_mq_complete_request(rq);
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+	}
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
  2017-12-09 19:25 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
@ 2017-12-09 19:25 ` Tejun Heo
  2017-12-12 10:09   ` Peter Zijlstra
                     ` (2 more replies)
  2017-12-09 19:25 ` [PATCH 3/6] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-09 19:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov, Tejun Heo

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunatley, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c       |   2 +
 block/blk-mq.c         | 198 +++++++++++++++++++++++++++++++------------------
 block/blk-mq.h         |  45 +++++++++++
 block/blk-timeout.c    |   2 +-
 block/blk.h            |   6 --
 include/linux/blk-mq.h |   1 +
 include/linux/blkdev.h |  23 ++++++
 7 files changed, 196 insertions(+), 81 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b888175..ccf3f8e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
+	seqcount_init(&rq->gstate_seqc);
+	u64_stats_init(&rq->aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index acf4fbb..2d093f7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -530,6 +530,9 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
+	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 	if (rq->rq_flags & RQF_STATS) {
@@ -557,6 +560,19 @@ static void __blk_mq_complete_request(struct request *rq)
 	put_cpu();
 }
 
+static u64 blk_mq_rq_aborted_gstate(struct request *rq)
+{
+	unsigned int start;
+	u64 aborted_gstate;
+
+	do {
+		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
+		aborted_gstate = rq->aborted_gstate;
+	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
+
+	return aborted_gstate;
+}
+
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -574,14 +590,21 @@ void blk_mq_complete_request(struct request *rq)
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
+	/*
+	 * If @rq->aborted_gstate equals the current instance, timeout is
+	 * claiming @rq and we lost.  This is synchronized through RCU.
+	 * See blk_mq_timeout_work() for details.
+	 */
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		if (!blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
+		    !blk_mark_rq_complete(rq))
 			__blk_mq_complete_request(rq);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		if (!blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
+		    !blk_mark_rq_complete(rq))
 			__blk_mq_complete_request(rq);
 		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
 	}
@@ -608,34 +631,28 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
 
-	blk_add_timer(rq);
-
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
 
 	/*
-	 * Mark us as started and clear complete. Complete might have been
-	 * set if requeue raced with timeout, which then marked it as
-	 * complete. So be sure to clear complete again when we start
-	 * the request, otherwise we'll ignore the completion event.
+	 * Mark @rq in-flight which also advances the generation number,
+	 * and register for timeout.  Protect with a seqcount to allow the
+	 * timeout path to read both @rq->gstate and @rq->deadline
+	 * coherently.
 	 *
-	 * Ensure that ->deadline is visible before we set STARTED, such that
-	 * blk_mq_check_expired() is guaranteed to observe our ->deadline when
-	 * it observes STARTED.
+	 * This is the only place where a request is marked in-flight.  If
+	 * the timeout path reads an in-flight @rq->gstate, the
+	 * @rq->deadline it reads together under @rq->gstate_seqc is
+	 * guaranteed to be the matching one.
 	 */
-	smp_wmb();
+	write_seqcount_begin(&rq->gstate_seqc);
+	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
+	blk_add_timer(rq);
+	write_seqcount_end(&rq->gstate_seqc);
+
 	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
-		/*
-		 * Coherence order guarantees these consecutive stores to a
-		 * single variable propagate in the specified order. Thus the
-		 * clear_bit() is ordered _after_ the set bit. See
-		 * blk_mq_check_expired().
-		 *
-		 * (the bits must be part of the same byte for this to be
-		 * true).
-		 */
+	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
 		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
-	}
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -668,6 +685,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	blk_mq_sched_requeue_request(rq);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -765,6 +783,7 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
 	unsigned long next;
 	unsigned int next_set;
+	unsigned int nr_expired;
 };
 
 void blk_mq_rq_timed_out(struct request *req, bool reserved)
@@ -807,50 +826,48 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long deadline;
+	unsigned long gstate, deadline;
+	int start;
 
 	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
 		return;
 
-	/*
-	 * Ensures that if we see STARTED we must also see our
-	 * up-to-date deadline, see blk_mq_start_request().
-	 */
-	smp_rmb();
-
-	deadline = READ_ONCE(rq->deadline);
-
-	/*
-	 * The rq being checked may have been freed and reallocated
-	 * out already here, we avoid this race by checking rq->deadline
-	 * and REQ_ATOM_COMPLETE flag together:
-	 *
-	 * - if rq->deadline is observed as new value because of
-	 *   reusing, the rq won't be timed out because of timing.
-	 * - if rq->deadline is observed as previous value,
-	 *   REQ_ATOM_COMPLETE flag won't be cleared in reuse path
-	 *   because we put a barrier between setting rq->deadline
-	 *   and clearing the flag in blk_mq_start_request(), so
-	 *   this rq won't be timed out too.
-	 */
-	if (time_after_eq(jiffies, deadline)) {
-		if (!blk_mark_rq_complete(rq)) {
-			/*
-			 * Again coherence order ensures that consecutive reads
-			 * from the same variable must be in that order. This
-			 * ensures that if we see COMPLETE clear, we must then
-			 * see STARTED set and we'll ignore this timeout.
-			 *
-			 * (There's also the MB implied by the test_and_clear())
-			 */
-			blk_mq_rq_timed_out(rq, reserved);
-		}
+	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
+	do {
+		start = read_seqcount_begin(&rq->gstate_seqc);
+		gstate = READ_ONCE(rq->gstate);
+		deadline = rq->deadline;
+	} while (read_seqcount_retry(&rq->gstate_seqc, start));
+
+	/* if in-flight && overdue, mark for abortion */
+	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
+	    time_after_eq(jiffies, deadline)) {
+		u64_stats_update_begin(&rq->aborted_gstate_sync);
+		rq->aborted_gstate = gstate;
+		u64_stats_update_end(&rq->aborted_gstate_sync);
+		data->nr_expired++;
+		hctx->nr_expired++;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
 }
 
+static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
+	 * completions that we lost to, they would have finished and
+	 * updated @rq->gstate by now; otherwise, the completion path is
+	 * now guaranteed to see @rq->aborted_gstate and yield.  If
+	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 */
+	if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
+	    !blk_mark_rq_complete(rq))
+		blk_mq_rq_timed_out(rq, reserved);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -858,7 +875,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	struct blk_mq_timeout_data data = {
 		.next		= 0,
 		.next_set	= 0,
+		.nr_expired	= 0,
 	};
+	struct blk_mq_hw_ctx *hctx;
 	int i;
 
 	/* A deadlock might occur if a request is stuck requiring a
@@ -877,14 +896,40 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
+	/* scan for the expired ones and set their ->aborted_gstate */
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
+	if (data.nr_expired) {
+		bool has_rcu = false;
+
+		/*
+		 * Wait till everyone sees ->aborted_gstate.  The
+		 * sequential waits for SRCUs aren't ideal.  If this ever
+		 * becomes a problem, we can add per-hw_ctx rcu_head and
+		 * wait in parallel.
+		 */
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->nr_expired)
+				continue;
+
+			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+				has_rcu = true;
+			else
+				synchronize_srcu(hctx->queue_rq_srcu);
+
+			hctx->nr_expired = 0;
+		}
+		if (has_rcu)
+			synchronize_rcu();
+
+		/* terminate the ones we won */
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+	}
+
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
 	} else {
-		struct blk_mq_hw_ctx *hctx;
-
 		queue_for_each_hw_ctx(q, hctx, i) {
 			/* the hctx may be unmapped, so check it here */
 			if (blk_mq_hw_queue_mapped(hctx))
@@ -1879,6 +1924,22 @@ static size_t order_to_size(unsigned int order)
 	return (size_t)PAGE_SIZE << order;
 }
 
+static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
+			       unsigned int hctx_idx, int node)
+{
+	int ret;
+
+	if (set->ops->init_request) {
+		ret = set->ops->init_request(set, rq, hctx_idx, node);
+		if (ret)
+			return ret;
+	}
+
+	seqcount_init(&rq->gstate_seqc);
+	u64_stats_init(&rq->aborted_gstate_sync);
+	return 0;
+}
+
 int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx, unsigned int depth)
 {
@@ -1940,12 +2001,9 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 			struct request *rq = p;
 
 			tags->static_rqs[i] = rq;
-			if (set->ops->init_request) {
-				if (set->ops->init_request(set, rq, hctx_idx,
-						node)) {
-					tags->static_rqs[i] = NULL;
-					goto fail;
-				}
+			if (blk_mq_init_request(set, rq, hctx_idx, node)) {
+				tags->static_rqs[i] = NULL;
+				goto fail;
 			}
 
 			p += rq_size;
@@ -2084,9 +2142,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	if (!hctx->fq)
 		goto sched_exit_hctx;
 
-	if (set->ops->init_request &&
-	    set->ops->init_request(set, hctx->fq->flush_rq, hctx_idx,
-				   node))
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
 		goto free_fq;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
@@ -2980,12 +3036,6 @@ static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 
 static int __init blk_mq_init(void)
 {
-	/*
-	 * See comment in block/blk.h rq_atomic_flags enum
-	 */
-	BUILD_BUG_ON((REQ_ATOM_STARTED / BITS_PER_BYTE) !=
-			(REQ_ATOM_COMPLETE / BITS_PER_BYTE));
-
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
 	return 0;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c7c3ff..5a83ef3 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,6 +27,19 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
+ * and the upper bits the generation number.
+ */
+enum mq_rq_state {
+	MQ_RQ_IDLE		= 0,
+	MQ_RQ_IN_FLIGHT		= 1,
+
+	MQ_RQ_STATE_BITS	= 2,
+	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
+	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
+};
+
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
@@ -85,6 +98,38 @@ extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
 
 void blk_mq_release(struct request_queue *q);
 
+/**
+ * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
+ * @rq: target request.
+ */
+static inline int blk_mq_rq_state(struct request *rq)
+{
+	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+}
+
+/**
+ * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
+ * @rq: target request.
+ * @state: new state to set.
+ *
+ * Set @rq's state to @state.  The caller is responsible for ensuring that
+ * there are no other updaters.  A request can transition into IN_FLIGHT
+ * only from IDLE and doing so increments the generation number.
+ */
+static inline void blk_mq_rq_update_state(struct request *rq,
+					  enum mq_rq_state state)
+{
+	u64 new_val = (rq->gstate & ~MQ_RQ_STATE_MASK) | state;
+
+	if (state == MQ_RQ_IN_FLIGHT) {
+		WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
+		new_val += MQ_RQ_GEN_INC;
+	}
+
+	/* avoid exposing interim values */
+	WRITE_ONCE(rq->gstate, new_val);
+}
+
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
 					   unsigned int cpu)
 {
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 764ecf9..6427be7 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -208,7 +208,7 @@ void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	WRITE_ONCE(req->deadline, jiffies + req->timeout);
+	req->deadline = jiffies + req->timeout;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/block/blk.h b/block/blk.h
index 3f14469..9cb2739 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -123,12 +123,6 @@ void blk_account_io_done(struct request *req);
  * Internal atomic flags for request handling
  */
 enum rq_atomic_flags {
-	/*
-	 * Keep these two bits first - not because we depend on the
-	 * value of them, but we do depend on them being in the same
-	 * byte of storage to ensure ordering on writes. Keeping them
-	 * first will achieve that nicely.
-	 */
 	REQ_ATOM_COMPLETE = 0,
 	REQ_ATOM_STARTED,
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95c9a5c..460798d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,6 +51,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
+	unsigned int		nr_expired;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca1..e6cfe4b3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,6 +27,8 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
+#include <linux/seqlock.h>
+#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -228,6 +230,27 @@ struct request {
 
 	unsigned short write_hint;
 
+	/*
+	 * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state
+	 * value and the upper bits the generation number which is
+	 * monotonically incremented and used to distinguish the reuse
+	 * instances.
+	 *
+	 * ->gstate_seqc allows updates to ->gstate and other fields
+	 * (currently ->deadline) during request start to be read
+	 * atomically from the timeout path, so that it can operate on a
+	 * coherent set of information.
+	 */
+	seqcount_t gstate_seqc;
+	u64 gstate;
+
+	/*
+	 * ->aborted_gstate is used by the timeout to claim a specific
+	 * recycle instance of this request.  See blk_mq_timeout_work().
+	 */
+	struct u64_stats_sync aborted_gstate_sync;
+	u64 aborted_gstate;
+
 	unsigned long deadline;
 	struct list_head timeout_list;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/6] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
  2017-12-09 19:25 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
  2017-12-09 19:25 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
@ 2017-12-09 19:25 ` Tejun Heo
  2017-12-09 19:25 ` [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-09 19:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov, Tejun Heo

blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
REQ_ATOM_COMPLETE to determine the request state.  Both uses are
speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
equivalent results.  Replace the tests.  This will allow removing
REQ_ATOM_COMPLETE usages from blk-mq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2d093f7..39b79bd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -95,8 +95,7 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
 {
 	struct mq_inflight *mi = priv;
 
-	if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
-	    !test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
+	if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
 		/*
 		 * index[0] counts the specific partition that was asked
 		 * for. index[1] counts the ones that are active on the
@@ -2950,7 +2949,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+		if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
+		    blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
                   ` (2 preceding siblings ...)
  2017-12-09 19:25 ` [PATCH 3/6] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
@ 2017-12-09 19:25 ` Tejun Heo
  2017-12-09 19:25 ` [PATCH 5/6] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-09 19:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov, Tejun Heo

With issue/complete and timeout paths now using the generation number
and state based synchronization, blk_abort_request() is the only one
which depends on REQ_ATOM_COMPLETE for arbitrating completion.

There's no reason for blk_abort_request() to be a completely separate
path.  This patch makes blk_abort_request() piggyback on the timeout
path instead of trying to terminate the request directly.

This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c      | 2 +-
 block/blk-mq.h      | 2 --
 block/blk-timeout.c | 7 ++++---
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 39b79bd..0267040 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -785,7 +785,7 @@ struct blk_mq_timeout_data {
 	unsigned int nr_expired;
 };
 
-void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 5a83ef3..760e7ed 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -94,8 +94,6 @@ extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
-extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
-
 void blk_mq_release(struct request_queue *q);
 
 /**
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 6427be7..051924f 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -156,12 +156,13 @@ void blk_timeout_work(struct work_struct *work)
  */
 void blk_abort_request(struct request *req)
 {
-	if (blk_mark_rq_complete(req))
-		return;
 
 	if (req->q->mq_ops) {
-		blk_mq_rq_timed_out(req, false);
+		req->deadline = jiffies;
+		mod_timer(&req->q->timeout, 0);
 	} else {
+		if (blk_mark_rq_complete(req))
+			return;
 		blk_delete_timer(req);
 		blk_rq_timed_out(req);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/6] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
                   ` (3 preceding siblings ...)
  2017-12-09 19:25 ` [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
@ 2017-12-09 19:25 ` Tejun Heo
  2017-12-09 19:25 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-09 19:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov, Tejun Heo

After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE for
anything.

Remove all REQ_ATOM_COMPLETE usages.  This removes atomic bitops from
hot paths too.  The next patch will push further along this direction
and include simple performance test results.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0267040..4ebac33 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -596,14 +596,12 @@ void blk_mq_complete_request(struct request *rq)
 	 */
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-		    !blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 			__blk_mq_complete_request(rq);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-		    !blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 			__blk_mq_complete_request(rq);
 		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
 	}
@@ -650,8 +648,6 @@ void blk_mq_start_request(struct request *rq)
 	write_seqcount_end(&rq->gstate_seqc);
 
 	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
-		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -862,8 +858,7 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 	 * now guaranteed to see @rq->aborted_gstate and yield.  If
 	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
 	 */
-	if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
-	    !blk_mark_rq_complete(rq))
+	if (READ_ONCE(rq->gstate) == rq->aborted_gstate)
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
                   ` (4 preceding siblings ...)
  2017-12-09 19:25 ` [PATCH 5/6] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
@ 2017-12-09 19:25 ` Tejun Heo
  2017-12-12 10:09   ` jianchao.wang
  2017-12-12 11:17   ` Nikolay Borisov
  2017-12-11  9:27 ` [PATCHSET] blk-mq: reimplement timeout handling Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-09 19:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov, Tejun Heo

After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.

Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
left and is removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq-debugfs.c |  4 +---
 block/blk-mq.c         | 39 ++++++++-------------------------------
 block/blk-mq.h         |  1 +
 block/blk.h            |  1 -
 4 files changed, 10 insertions(+), 35 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b56a4f3..8adc837 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -271,7 +271,6 @@ static const char *const cmd_flag_name[] = {
 #define RQF_NAME(name) [ilog2((__force u32)RQF_##name)] = #name
 static const char *const rqf_name[] = {
 	RQF_NAME(SORTED),
-	RQF_NAME(STARTED),
 	RQF_NAME(QUEUED),
 	RQF_NAME(SOFTBARRIER),
 	RQF_NAME(FLUSH_SEQ),
@@ -295,7 +294,6 @@ static const char *const rqf_name[] = {
 #define RQAF_NAME(name) [REQ_ATOM_##name] = #name
 static const char *const rqaf_name[] = {
 	RQAF_NAME(COMPLETE),
-	RQAF_NAME(STARTED),
 	RQAF_NAME(POLL_SLEPT),
 };
 #undef RQAF_NAME
@@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
 	const struct show_busy_params *params = data;
 
 	if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-	    test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+	    blk_mq_rq_state(rq) != MQ_RQ_IDLE)
 		__blk_mq_debugfs_rq_show(params->m,
 					 list_entry_rq(&rq->queuelist));
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4ebac33..663069d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -482,7 +482,7 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
@@ -530,7 +530,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	int cpu;
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
 
 int blk_mq_request_started(struct request *rq)
 {
-	return test_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
@@ -629,7 +629,6 @@ void blk_mq_start_request(struct request *rq)
 	}
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
 
 	/*
 	 * Mark @rq in-flight which also advances the generation number,
@@ -647,8 +646,6 @@ void blk_mq_start_request(struct request *rq)
 	blk_add_timer(rq);
 	write_seqcount_end(&rq->gstate_seqc);
 
-	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
 		 * Make sure space for the drain appears.  We know we can do
@@ -661,13 +658,9 @@ void blk_mq_start_request(struct request *rq)
 EXPORT_SYMBOL(blk_mq_start_request);
 
 /*
- * When we reach here because queue is busy, REQ_ATOM_COMPLETE
- * flag isn't set yet, so there may be race with timeout handler,
- * but given rq->deadline is just set in .queue_rq() under
- * this situation, the race won't be possible in reality because
- * rq->timeout should be set as big enough to cover the window
- * between blk_mq_start_request() called from .queue_rq() and
- * clearing REQ_ATOM_STARTED here.
+ * When we reach here because queue is busy, it's safe to change the state
+ * to IDLE without checking @rq->aborted_gstate because we should still be
+ * holding the RCU read lock and thus protected against timeout.
  */
 static void __blk_mq_requeue_request(struct request *rq)
 {
@@ -679,7 +672,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
 	blk_mq_sched_requeue_request(rq);
 
-	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
 		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
@@ -786,18 +779,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	/*
-	 * We know that complete is set at this point. If STARTED isn't set
-	 * anymore, then the request isn't active and the "timeout" should
-	 * just be ignored. This can happen due to the bitflag ordering.
-	 * Timeout first checks if STARTED is set, and if it is, assumes
-	 * the request is active. But if we race with completion, then
-	 * both flags will get cleared. So check here again, and ignore
-	 * a timeout event with a request that isn't active.
-	 */
-	if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
-		return;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -824,9 +805,6 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	unsigned long gstate, deadline;
 	int start;
 
-	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
-		return;
-
 	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
 	do {
 		start = read_seqcount_begin(&rq->gstate_seqc);
@@ -2944,8 +2922,7 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
-		    blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
+		if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 760e7ed..e1eaeb9 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -34,6 +34,7 @@ struct blk_mq_ctx {
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
+	MQ_RQ_COMPLETE		= 2,
 
 	MQ_RQ_STATE_BITS	= 2,
 	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
diff --git a/block/blk.h b/block/blk.h
index 9cb2739..a68dbe3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -124,7 +124,6 @@ void blk_account_io_done(struct request *req);
  */
 enum rq_atomic_flags {
 	REQ_ATOM_COMPLETE = 0,
-	REQ_ATOM_STARTED,
 
 	REQ_ATOM_POLL_SLEPT,
 };
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCHSET] blk-mq: reimplement timeout handling
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
                   ` (5 preceding siblings ...)
  2017-12-09 19:25 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
@ 2017-12-11  9:27 ` Peter Zijlstra
  2017-12-12  9:21 ` Christoph Hellwig
  2017-12-12 16:08 ` Peter Zijlstra
  8 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-12-11  9:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, oleg, kernel-team, osandov

On Sat, Dec 09, 2017 at 11:25:19AM -0800, Tejun Heo wrote:
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.
> 
> It's pretty easy to make blk_mq_check_expired() terminate a later
> instance of a request.  If we induce 5 sec delay before
> time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
> 2s, and issue back-to-back large IOs, blk-mq starts timing out
> requests spuriously pretty quickly.  Nothing actually timed out.  It
> just made the call on a recycle instance of a request and then
> terminated a later instance long after the original instance finished.
> The scenario isn't theoretical either.
> 
> This patchset replaces the broken synchronization mechanism with a RCU
> and generation number based one.  Please read the patch description of
> the second path for more details.
> 
> Oleg, Peter, I'd really appreciate if you guys can go over the
> reported breakages and the new implementation.

Great, yes that code seemed very suspicious when I looked at it; thanks
for making it go away. I'll try and find a spot to stare at the patches.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET] blk-mq: reimplement timeout handling
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
                   ` (6 preceding siblings ...)
  2017-12-11  9:27 ` [PATCHSET] blk-mq: reimplement timeout handling Peter Zijlstra
@ 2017-12-12  9:21 ` Christoph Hellwig
  2017-12-12 16:39   ` Tejun Heo
  2017-12-12 16:08 ` Peter Zijlstra
  8 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2017-12-12  9:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov

Please send this to linux-block so that it can be properly reviewed.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-09 19:25 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
@ 2017-12-12 10:09   ` jianchao.wang
  2017-12-12 17:01     ` Tejun Heo
  2017-12-12 17:26     ` Tejun Heo
  2017-12-12 11:17   ` Nikolay Borisov
  1 sibling, 2 replies; 26+ messages in thread
From: jianchao.wang @ 2017-12-12 10:09 UTC (permalink / raw)
  To: Tejun Heo, axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov

Hi Tejun

On 12/10/2017 03:25 AM, Tejun Heo wrote:
> After the recent updates to use generation number and state based
> synchronization, we can easily replace REQ_ATOM_STARTED usages by
> adding an extra state to distinguish completed but not yet freed
> state.
> 
> Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
> blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
> left and is removed.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-mq-debugfs.c |  4 +---
>  block/blk-mq.c         | 39 ++++++++-------------------------------
>  block/blk-mq.h         |  1 +
>  block/blk.h            |  1 -
>  4 files changed, 10 insertions(+), 35 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index b56a4f3..8adc837 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -271,7 +271,6 @@ static const char *const cmd_flag_name[] = {
>  #define RQF_NAME(name) [ilog2((__force u32)RQF_##name)] = #name
>  static const char *const rqf_name[] = {
>  	RQF_NAME(SORTED),
> -	RQF_NAME(STARTED),
>  	RQF_NAME(QUEUED),
>  	RQF_NAME(SOFTBARRIER),
>  	RQF_NAME(FLUSH_SEQ),
> @@ -295,7 +294,6 @@ static const char *const rqf_name[] = {
>  #define RQAF_NAME(name) [REQ_ATOM_##name] = #name
>  static const char *const rqaf_name[] = {
>  	RQAF_NAME(COMPLETE),
> -	RQAF_NAME(STARTED),
>  	RQAF_NAME(POLL_SLEPT),
>  };
>  #undef RQAF_NAME
> @@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
>  	const struct show_busy_params *params = data;
>  
>  	if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> -	    test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> +	    blk_mq_rq_state(rq) != MQ_RQ_IDLE)
>  		__blk_mq_debugfs_rq_show(params->m,
>  					 list_entry_rq(&rq->queuelist));
>  }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4ebac33..663069d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -482,7 +482,7 @@ void blk_mq_free_request(struct request *rq)
>  	if (blk_rq_rl(rq))
>  		blk_put_rl(blk_rq_rl(rq));
>  
> -	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> +	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
>  	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
>  	if (rq->tag != -1)
>  		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> @@ -530,7 +530,7 @@ static void __blk_mq_complete_request(struct request *rq)
>  	int cpu;
>  
>  	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
> -	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
> +	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
>  
>  	if (rq->internal_tag != -1)
>  		blk_mq_sched_completed_request(rq);
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
>  
>  int blk_mq_request_started(struct request *rq)
>  {
> -	return test_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> +	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_request_started);
>  
> @@ -629,7 +629,6 @@ void blk_mq_start_request(struct request *rq)
>  	}
>  
>  	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
> -	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
>  
>  	/*
>  	 * Mark @rq in-flight which also advances the generation number,
> @@ -647,8 +646,6 @@ void blk_mq_start_request(struct request *rq)
>  	blk_add_timer(rq);
>  	write_seqcount_end(&rq->gstate_seqc);
>  
> -	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> -
>  	if (q->dma_drain_size && blk_rq_bytes(rq)) {
>  		/*
>  		 * Make sure space for the drain appears.  We know we can do
> @@ -661,13 +658,9 @@ void blk_mq_start_request(struct request *rq)
>  EXPORT_SYMBOL(blk_mq_start_request);
>  
>  /*
> - * When we reach here because queue is busy, REQ_ATOM_COMPLETE
> - * flag isn't set yet, so there may be race with timeout handler,
> - * but given rq->deadline is just set in .queue_rq() under
> - * this situation, the race won't be possible in reality because
> - * rq->timeout should be set as big enough to cover the window
> - * between blk_mq_start_request() called from .queue_rq() and
> - * clearing REQ_ATOM_STARTED here.
> + * When we reach here because queue is busy, it's safe to change the state
> + * to IDLE without checking @rq->aborted_gstate because we should still be
> + * holding the RCU read lock and thus protected against timeout.
>   */
>  static void __blk_mq_requeue_request(struct request *rq)
>  {
> @@ -679,7 +672,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>  	wbt_requeue(q->rq_wb, &rq->issue_stat);
>  	blk_mq_sched_requeue_request(rq);
>  
> -	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> +	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
>  		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
The MQ_RQ_IDLE looks confused here. It is not freed , but idled.
And when the requeued request is started again, the generation number will be increased.
But it is not a recycle instance of the request. Maybe another state needed here ? 
>  		if (q->dma_drain_size && blk_rq_bytes(rq))
>  			rq->nr_phys_segments--;
> @@ -786,18 +779,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  	const struct blk_mq_ops *ops = req->q->mq_ops;
>  	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
>  
> -	/*
> -	 * We know that complete is set at this point. If STARTED isn't set
> -	 * anymore, then the request isn't active and the "timeout" should
> -	 * just be ignored. This can happen due to the bitflag ordering.
> -	 * Timeout first checks if STARTED is set, and if it is, assumes
> -	 * the request is active. But if we race with completion, then
> -	 * both flags will get cleared. So check here again, and ignore
> -	 * a timeout event with a request that isn't active.
> -	 */
> -	if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
> -		return;
> -
>  	if (ops->timeout)
>  		ret = ops->timeout(req, reserved);
The BLK_EH_RESET_TIMER case has not been covered here. In that case, the timer will be re-armed,
but the gstate and aborted_gstate are not updated and still equal with echo other. 
Consequently, when the request is completed later, the __blk_mq_complete_request() will be missed,
then the request will expire again. The aborted_gstate should be updated in the BLK_EH_RESET_TIMER case.

Thanks
Jianchao

>  
> @@ -824,9 +805,6 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>  	unsigned long gstate, deadline;
>  	int start;
>  
> -	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> -		return;
> -
>  	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
>  	do {
>  		start = read_seqcount_begin(&rq->gstate_seqc);
> @@ -2944,8 +2922,7 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
>  
>  	hrtimer_init_sleeper(&hs, current);
>  	do {
> -		if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
> -		    blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
> +		if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
>  			break;
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  		hrtimer_start_expires(&hs.timer, mode);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 760e7ed..e1eaeb9 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -34,6 +34,7 @@ struct blk_mq_ctx {
>  enum mq_rq_state {
>  	MQ_RQ_IDLE		= 0,
>  	MQ_RQ_IN_FLIGHT		= 1,
> +	MQ_RQ_COMPLETE		= 2,
>  
>  	MQ_RQ_STATE_BITS	= 2,
>  	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
> diff --git a/block/blk.h b/block/blk.h
> index 9cb2739..a68dbe3 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -124,7 +124,6 @@ void blk_account_io_done(struct request *req);
>   */
>  enum rq_atomic_flags {
>  	REQ_ATOM_COMPLETE = 0,
> -	REQ_ATOM_STARTED,
>  
>  	REQ_ATOM_POLL_SLEPT,
>  };
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-09 19:25 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
@ 2017-12-12 10:09   ` Peter Zijlstra
  2017-12-12 18:02     ` Tejun Heo
  2017-12-12 10:10   ` Peter Zijlstra
  2017-12-12 11:56   ` Peter Zijlstra
  2 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2017-12-12 10:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, oleg, kernel-team, osandov

I don't yet have a full picture, but let me make a few comments.

On Sat, Dec 09, 2017 at 11:25:21AM -0800, Tejun Heo wrote:

> +static u64 blk_mq_rq_aborted_gstate(struct request *rq)
> +{
> +	unsigned int start;
> +	u64 aborted_gstate;
> +
> +	do {
> +		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
> +		aborted_gstate = rq->aborted_gstate;
> +	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
> +
> +	return aborted_gstate;
> +}

> +	/* if in-flight && overdue, mark for abortion */
> +	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
> +	    time_after_eq(jiffies, deadline)) {
> +		u64_stats_update_begin(&rq->aborted_gstate_sync);
> +		rq->aborted_gstate = gstate;
> +		u64_stats_update_end(&rq->aborted_gstate_sync);
> +		data->nr_expired++;
> +		hctx->nr_expired++;
>  	} else if (!data->next_set || time_after(data->next, deadline)) {

> +	/*
> +	 * ->aborted_gstate is used by the timeout to claim a specific
> +	 * recycle instance of this request.  See blk_mq_timeout_work().
> +	 */
> +	struct u64_stats_sync aborted_gstate_sync;
> +	u64 aborted_gstate;

So I dislike that u64_stats_sync thingy. Esp when used on a single
variable like this.

There are two alternatives, but I don't understand the code well enough
to judge the trade-offs.

1) use gstate_seq for this too; yes it will add some superfluous
   instructions on 64bit targets, but if timeouts are a slow path
   this might not matter.

2) use the pattern we use for cfs_rq::min_vruntime; namely:

	u64 aborted_gstate
#ifdef CONFIG_64BIT
	u64 aborted_gstate_copy;
#endif


static inline void blk_mq_rq_set_abort(struct rq *rq, u64 gstate)
{
	rq->aborted_gstate = gstate;
#ifdef CONFIG_64BIT
	smp_wmb();
	rq->aborted_gstate_copy = gstate;
#endif
}

static inline u64 blk_mq_rq_get_abort(struct rq *rq)
{
#ifdef CONFIG_64BIT
	u64 abort, copy;

	do {
		copy = rq->aborted_gstate_copy;
		smp_rmb();
		abort = rq->aborted_gstate;
	} while (abort != copy);

	return abort;
#else
	return rq->aborted_gstate;
#endif
}

   which is actually _faster_ than the u64_stats_sync stuff (for a
   single variable).


But it might not matter; I just dislike that thing, could be me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-09 19:25 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
  2017-12-12 10:09   ` Peter Zijlstra
@ 2017-12-12 10:10   ` Peter Zijlstra
  2017-12-12 18:03     ` Tejun Heo
  2017-12-12 11:56   ` Peter Zijlstra
  2 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2017-12-12 10:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, oleg, kernel-team, osandov

On Sat, Dec 09, 2017 at 11:25:21AM -0800, Tejun Heo wrote:
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8089ca1..e6cfe4b3 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -228,6 +230,27 @@ struct request {
>  
>  	unsigned short write_hint;
>  
> +	/*
> +	 * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state
> +	 * value and the upper bits the generation number which is
> +	 * monotonically incremented and used to distinguish the reuse
> +	 * instances.
> +	 *
> +	 * ->gstate_seqc allows updates to ->gstate and other fields
> +	 * (currently ->deadline) during request start to be read
> +	 * atomically from the timeout path, so that it can operate on a
> +	 * coherent set of information.
> +	 */
> +	seqcount_t gstate_seqc;
> +	u64 gstate;

We typically name seqcount_t thingies _seq.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-09 19:25 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
  2017-12-12 10:09   ` jianchao.wang
@ 2017-12-12 11:17   ` Nikolay Borisov
  2017-12-12 17:29     ` Tejun Heo
  1 sibling, 1 reply; 26+ messages in thread
From: Nikolay Borisov @ 2017-12-12 11:17 UTC (permalink / raw)
  To: Tejun Heo, axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov



On  9.12.2017 21:25, Tejun Heo wrote:
> After the recent updates to use generation number and state based
> synchronization, we can easily replace REQ_ATOM_STARTED usages by
> adding an extra state to distinguish completed but not yet freed
> state.
> 
> Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
> blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
> left and is removed.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>


Where are the promised in patch 5/6 performance results?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-09 19:25 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
  2017-12-12 10:09   ` Peter Zijlstra
  2017-12-12 10:10   ` Peter Zijlstra
@ 2017-12-12 11:56   ` Peter Zijlstra
  2017-12-12 18:04     ` Tejun Heo
  2 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2017-12-12 11:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, oleg, kernel-team, osandov

On Sat, Dec 09, 2017 at 11:25:21AM -0800, Tejun Heo wrote:
> +static inline void blk_mq_rq_update_state(struct request *rq,
> +					  enum mq_rq_state state)
> +{
> +	u64 new_val = (rq->gstate & ~MQ_RQ_STATE_MASK) | state;
> +
> +	if (state == MQ_RQ_IN_FLIGHT) {
> +		WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
> +		new_val += MQ_RQ_GEN_INC;
> +	}
> +
> +	/* avoid exposing interim values */

My paranoia would like to see READ_ONCE() on the rq->gstate load above
as well, that makes it a fully explicit load-store operation.

> +	WRITE_ONCE(rq->gstate, new_val);
> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET] blk-mq: reimplement timeout handling
  2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
                   ` (7 preceding siblings ...)
  2017-12-12  9:21 ` Christoph Hellwig
@ 2017-12-12 16:08 ` Peter Zijlstra
  8 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-12-12 16:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, oleg, kernel-team, osandov

On Sat, Dec 09, 2017 at 11:25:19AM -0800, Tejun Heo wrote:
> Oleg, Peter, I'd really appreciate if you guys can go over the
> reported breakages and the new implementation.

I'm still struggling with a lack understanding of most of this; but I
think its OK. Its certainly easier than what was before.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET] blk-mq: reimplement timeout handling
  2017-12-12  9:21 ` Christoph Hellwig
@ 2017-12-12 16:39   ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-12 16:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov

Hello, Christoph.

On Tue, Dec 12, 2017 at 01:21:23AM -0800, Christoph Hellwig wrote:
> Please send this to linux-block so that it can be properly reviewed.

Oops, will include linux-block when posting v2.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-12 10:09   ` jianchao.wang
@ 2017-12-12 17:01     ` Tejun Heo
  2017-12-12 17:26     ` Tejun Heo
  1 sibling, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-12 17:01 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov

Hello, Jianchao.

On Tue, Dec 12, 2017 at 06:09:32PM +0800, jianchao.wang wrote:
> > @@ -786,18 +779,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
> >  	const struct blk_mq_ops *ops = req->q->mq_ops;
> >  	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
> >  
> > -	/*
> > -	 * We know that complete is set at this point. If STARTED isn't set
> > -	 * anymore, then the request isn't active and the "timeout" should
> > -	 * just be ignored. This can happen due to the bitflag ordering.
> > -	 * Timeout first checks if STARTED is set, and if it is, assumes
> > -	 * the request is active. But if we race with completion, then
> > -	 * both flags will get cleared. So check here again, and ignore
> > -	 * a timeout event with a request that isn't active.
> > -	 */
> > -	if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
> > -		return;
> > -
> >  	if (ops->timeout)
> >  		ret = ops->timeout(req, reserved);
>
> The BLK_EH_RESET_TIMER case has not been covered here. In that case,
> the timer will be re-armed, but the gstate and aborted_gstate are
> not updated and still equal with echo other.  Consequently, when the
> request is completed later, the __blk_mq_complete_request() will be
> missed, then the request will expire again. The aborted_gstate
> should be updated in the BLK_EH_RESET_TIMER case.

You're right.  This is inherently racy tho.  Nothing prevented the
command from completing before complete was cleared.  I'll just clear
aborted_gstate which should behave the same way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-12 10:09   ` jianchao.wang
  2017-12-12 17:01     ` Tejun Heo
@ 2017-12-12 17:26     ` Tejun Heo
  2017-12-13  3:05       ` jianchao.wang
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-12-12 17:26 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov

Hello, again.

Sorry, I missed part of your comment in the previous reply.

On Tue, Dec 12, 2017 at 06:09:32PM +0800, jianchao.wang wrote:
> >  static void __blk_mq_requeue_request(struct request *rq)
> >  {
> > @@ -679,7 +672,7 @@ static void __blk_mq_requeue_request(struct request *rq)
> >  	wbt_requeue(q->rq_wb, &rq->issue_stat);
> >  	blk_mq_sched_requeue_request(rq);
> >  
> > -	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> > +	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
> >  		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
>
> The MQ_RQ_IDLE looks confused here. It is not freed , but idled.
> And when the requeued request is started again, the generation
> number will be increased.  But it is not a recycle instance of the
> request. Maybe another state needed here ?

I don't quite follow it.  At this point, the request can't be
in-flight on the device side and is scheduled for re-submission.  I'm
not sure the distinction from IDLE is necessary.  Am I missing
something?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-12 11:17   ` Nikolay Borisov
@ 2017-12-12 17:29     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-12 17:29 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov

Hello, Nikolay.

On Tue, Dec 12, 2017 at 01:17:52PM +0200, Nikolay Borisov wrote:
> On  9.12.2017 21:25, Tejun Heo wrote:
> > After the recent updates to use generation number and state based
> > synchronization, we can easily replace REQ_ATOM_STARTED usages by
> > adding an extra state to distinguish completed but not yet freed
> > state.
> > 
> > Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
> > blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
> > left and is removed.
> 
> Where are the promised in patch 5/6 performance results?

Opos, I thought I removed all of those.  I couldn't reliably show that
this performed better.  I was testing with nullblk but the run-to-run
deviations were too great (they generally kept getting faster, maybe
better locality?) to draw a reliable conclusion.  Whatever difference
in performance is unlikely to be material in actual workloads anyway.

I dropped the sentence from the description.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-12 10:09   ` Peter Zijlstra
@ 2017-12-12 18:02     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-12 18:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: axboe, linux-kernel, oleg, kernel-team, osandov

Hello, Peter.

On Tue, Dec 12, 2017 at 11:09:35AM +0100, Peter Zijlstra wrote:
> > +	/*
> > +	 * ->aborted_gstate is used by the timeout to claim a specific
> > +	 * recycle instance of this request.  See blk_mq_timeout_work().
> > +	 */
> > +	struct u64_stats_sync aborted_gstate_sync;
> > +	u64 aborted_gstate;
> 
> So I dislike that u64_stats_sync thingy. Esp when used on a single
> variable like this.

Hmm... I see.

> There are two alternatives, but I don't understand the code well enough
> to judge the trade-offs.
> 
> 1) use gstate_seq for this too; yes it will add some superfluous
>    instructions on 64bit targets, but if timeouts are a slow path
>    this might not matter.

For aborted_gstate, the heavier reader side is the completion hot
path.  That's two rmbs, which in itself isn't too much but is still
difficult to justify.

> 2) use the pattern we use for cfs_rq::min_vruntime; namely:
> 
> 	u64 aborted_gstate
> #ifdef CONFIG_64BIT
> 	u64 aborted_gstate_copy;
> #endif
> 
> 
> static inline void blk_mq_rq_set_abort(struct rq *rq, u64 gstate)
> {
> 	rq->aborted_gstate = gstate;
> #ifdef CONFIG_64BIT
> 	smp_wmb();
> 	rq->aborted_gstate_copy = gstate;
> #endif
> }
> 
> static inline u64 blk_mq_rq_get_abort(struct rq *rq)
> {
> #ifdef CONFIG_64BIT
> 	u64 abort, copy;
> 
> 	do {
> 		copy = rq->aborted_gstate_copy;
> 		smp_rmb();
> 		abort = rq->aborted_gstate;
> 	} while (abort != copy);
> 
> 	return abort;
> #else
> 	return rq->aborted_gstate;
> #endif
> }
> 
>    which is actually _faster_ than the u64_stats_sync stuff (for a
>    single variable).

Hmm... doing the seq reading on the variable content itself, so if we
had something like this as library, I'd be happy to use it but I
really don't want to open-code this.

> But it might not matter; I just dislike that thing, could be me.

I'll leave it as-is for now.  Probably the right thing to do in the
longer term is adding the seq-reading-by-content-thing in the library.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-12 10:10   ` Peter Zijlstra
@ 2017-12-12 18:03     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-12 18:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: axboe, linux-kernel, oleg, kernel-team, osandov

Hello, Peter.

On Tue, Dec 12, 2017 at 11:10:52AM +0100, Peter Zijlstra wrote:
> > +	seqcount_t gstate_seqc;
> > +	u64 gstate;
> 
> We typically name seqcount_t thingies _seq.

Will rename.  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-12 11:56   ` Peter Zijlstra
@ 2017-12-12 18:04     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-12-12 18:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: axboe, linux-kernel, oleg, kernel-team, osandov

Hello,

On Tue, Dec 12, 2017 at 12:56:57PM +0100, Peter Zijlstra wrote:
> On Sat, Dec 09, 2017 at 11:25:21AM -0800, Tejun Heo wrote:
> > +static inline void blk_mq_rq_update_state(struct request *rq,
> > +					  enum mq_rq_state state)
> > +{
> > +	u64 new_val = (rq->gstate & ~MQ_RQ_STATE_MASK) | state;
> > +
> > +	if (state == MQ_RQ_IN_FLIGHT) {
> > +		WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
> > +		new_val += MQ_RQ_GEN_INC;
> > +	}
> > +
> > +	/* avoid exposing interim values */
> 
> My paranoia would like to see READ_ONCE() on the rq->gstate load above
> as well, that makes it a fully explicit load-store operation.

Right now, only the request owner can update the field so there's no
data coherency issue but then again there's nothing to lose by adding
READ_ONCE there.  Will add it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-12 17:26     ` Tejun Heo
@ 2017-12-13  3:05       ` jianchao.wang
  2017-12-13 16:09         ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: jianchao.wang @ 2017-12-13  3:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov

Hi tejun

On 12/13/2017 01:26 AM, Tejun Heo wrote:
> Hello, again.
> 
> Sorry, I missed part of your comment in the previous reply.
> 
> On Tue, Dec 12, 2017 at 06:09:32PM +0800, jianchao.wang wrote:
>>>  static void __blk_mq_requeue_request(struct request *rq)
>>>  {
>>> @@ -679,7 +672,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>>>  	wbt_requeue(q->rq_wb, &rq->issue_stat);
>>>  	blk_mq_sched_requeue_request(rq);
>>>  
>>> -	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
>>> +	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
>>>  		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
>>
>> The MQ_RQ_IDLE looks confused here. It is not freed , but idled.
>> And when the requeued request is started again, the generation
>> number will be increased.  But it is not a recycle instance of the
>> request. Maybe another state needed here ?
> 
> I don't quite follow it.  At this point, the request can't be
> in-flight on the device side and is scheduled for re-submission.  I'm
> not sure the distinction from IDLE is necessary.  Am I missing
> something?
Just don't quite understand the strict definition of "generation" you said.
A allocation->free cycle is a generation ? or a idle->in-flight cycle is a
generation? The idle->in-flight cycle could include the requeue case as above.
Can you clarify this?

Thanks
Jianchao
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-09 19:25 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
@ 2017-12-13  3:10   ` jianchao.wang
  0 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2017-12-13  3:10 UTC (permalink / raw)
  To: Tejun Heo, axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov

Hi tejun

On 12/10/2017 03:25 AM, Tejun Heo wrote:
> Currently, blk-mq protects only the issue path with RCU.  This patch
> puts the completion path under the same RCU protection.  This will be
> used to synchronize issue/completion against timeout by later patches,
> which will also add the comments.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-mq.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1109747..acf4fbb 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -568,11 +568,23 @@ static void __blk_mq_complete_request(struct request *rq)
>  void blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> +	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> +	int srcu_idx;
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
> -	if (!blk_mark_rq_complete(rq))
> -		__blk_mq_complete_request(rq);
> +
> +	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> +		rcu_read_lock();
> +		if (!blk_mark_rq_complete(rq))
> +			__blk_mq_complete_request(rq);
> +		rcu_read_unlock();
> +	} else {
> +		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> +		if (!blk_mark_rq_complete(rq))
> +			__blk_mq_complete_request(rq);
> +		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
> +	}
The __blk_mq_complete_request() could be executed in irq context. There should not be any 
sleeping operations in it. So the srcu case here is not necessary here.

Thanks
Jianchao
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);
>  
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-13  3:05       ` jianchao.wang
@ 2017-12-13 16:09         ` Tejun Heo
  2017-12-14  2:14           ` jianchao.wang
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-12-13 16:09 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov

Hello,

On Wed, Dec 13, 2017 at 11:05:11AM +0800, jianchao.wang wrote:
> Just don't quite understand the strict definition of "generation" you said.
> A allocation->free cycle is a generation ? or a idle->in-flight cycle is a
> generation? The idle->in-flight cycle could include the requeue case as above.
> Can you clarify this?

So, a request is

* IN_FLIGHT, if it can be completed by the lower level driver

* COMPLETE, if the lower level driver indicated completion but block
  layer is still processing.

* IDLE, if neither of the above two.

and generation is bumped up on IDLE -> IN_FLIGHT to distinguish
different IN_FLIGHT instances.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-13 16:09         ` Tejun Heo
@ 2017-12-14  2:14           ` jianchao.wang
  0 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2017-12-14  2:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov



On 12/14/2017 12:09 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Dec 13, 2017 at 11:05:11AM +0800, jianchao.wang wrote:
>> Just don't quite understand the strict definition of "generation" you said.
>> A allocation->free cycle is a generation ? or a idle->in-flight cycle is a
>> generation? The idle->in-flight cycle could include the requeue case as above.
>> Can you clarify this?
> 
> So, a request is
> 
> * IN_FLIGHT, if it can be completed by the lower level driver
> 
> * COMPLETE, if the lower level driver indicated completion but block
>   layer is still processing.
> 
> * IDLE, if neither of the above two.
> 
> and generation is bumped up on IDLE -> IN_FLIGHT to distinguish
> different IN_FLIGHT instances.
> 

So a generation means a cycle from IDLE -> IN_FLIGHT.
Thanks for your response. That's really appreciated.

Jianchao
> Thanks.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-12-14  2:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-09 19:25 [PATCHSET] blk-mq: reimplement timeout handling Tejun Heo
2017-12-09 19:25 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
2017-12-13  3:10   ` jianchao.wang
2017-12-09 19:25 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
2017-12-12 10:09   ` Peter Zijlstra
2017-12-12 18:02     ` Tejun Heo
2017-12-12 10:10   ` Peter Zijlstra
2017-12-12 18:03     ` Tejun Heo
2017-12-12 11:56   ` Peter Zijlstra
2017-12-12 18:04     ` Tejun Heo
2017-12-09 19:25 ` [PATCH 3/6] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
2017-12-09 19:25 ` [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
2017-12-09 19:25 ` [PATCH 5/6] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
2017-12-09 19:25 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
2017-12-12 10:09   ` jianchao.wang
2017-12-12 17:01     ` Tejun Heo
2017-12-12 17:26     ` Tejun Heo
2017-12-13  3:05       ` jianchao.wang
2017-12-13 16:09         ` Tejun Heo
2017-12-14  2:14           ` jianchao.wang
2017-12-12 11:17   ` Nikolay Borisov
2017-12-12 17:29     ` Tejun Heo
2017-12-11  9:27 ` [PATCHSET] blk-mq: reimplement timeout handling Peter Zijlstra
2017-12-12  9:21 ` Christoph Hellwig
2017-12-12 16:39   ` Tejun Heo
2017-12-12 16:08 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).