All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v2] blk-mq: reimplement timeout handling
@ 2017-12-12 19:01 Tejun Heo
  2017-12-12 19:01 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
                   ` (7 more replies)
  0 siblings, 8 replies; 49+ messages in thread
From: Tejun Heo @ 2017-12-12 19:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block, hch

Changes from the last version[1]

- BLK_EH_RESET_TIMER handling fixed.

- s/request->gstate_seqc/request->gstate_seq/

- READ_ONCE() added to blk_mq_rq_udpate_state().

- Removed left over blk_clear_rq_complete() invocation from
  blk_mq_rq_timed_out().

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunatley, it contains quite a few holes.

It's pretty easy to make blk_mq_check_expired() terminate a later
instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patchset replaces the broken synchronization mechanism with a RCU
and generation number based one.  Please read the patch description of
the second path for more details.

This patchset contains the following six patches.

 0001-blk-mq-protect-completion-path-with-RCU.patch
 0002-blk-mq-replace-timeout-synchronization-with-a-RCU-an.patch
 0003-blk-mq-use-blk_mq_rq_state-instead-of-testing-REQ_AT.patch
 0004-blk-mq-make-blk_abort_request-trigger-timeout-path.patch
 0005-blk-mq-remove-REQ_ATOM_COMPLETE-usages-from-blk-mq.patch
 0006-blk-mq-remove-REQ_ATOM_STARTED.patch

and is available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-mq-timeout

diffstat follows.  Thanks.

 block/blk-core.c       |    2 
 block/blk-mq-debugfs.c |    4 
 block/blk-mq.c         |  255 ++++++++++++++++++++++++++++---------------------
 block/blk-mq.h         |   48 ++++++++-
 block/blk-timeout.c    |    9 -
 block/blk.h            |    7 -
 include/linux/blk-mq.h |    1 
 include/linux/blkdev.h |   23 ++++
 8 files changed, 226 insertions(+), 123 deletions(-)

--
tejun

[1] http://lkml.kernel.org/r/20171209192525.982030-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
@ 2017-12-12 19:01 ` Tejun Heo
  2017-12-13  3:30   ` jianchao.wang
  2017-12-14 17:01     ` Bart Van Assche
  2017-12-12 19:01 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Tejun Heo @ 2017-12-12 19:01 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block,
	hch, Tejun Heo

Currently, blk-mq protects only the issue path with RCU.  This patch
puts the completion path under the same RCU protection.  This will be
used to synchronize issue/completion against timeout by later patches,
which will also add the comments.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1109747..acf4fbb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -568,11 +568,23 @@ static void __blk_mq_complete_request(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
+	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
-	if (!blk_mark_rq_complete(rq))
-		__blk_mq_complete_request(rq);
+
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		if (!blk_mark_rq_complete(rq))
+			__blk_mq_complete_request(rq);
+		rcu_read_unlock();
+	} else {
+		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+		if (!blk_mark_rq_complete(rq))
+			__blk_mq_complete_request(rq);
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+	}
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
  2017-12-12 19:01 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
@ 2017-12-12 19:01 ` Tejun Heo
  2017-12-12 21:37     ` Bart Van Assche
                     ` (2 more replies)
  2017-12-12 19:01 ` [PATCH 3/6] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 49+ messages in thread
From: Tejun Heo @ 2017-12-12 19:01 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block,
	hch, Tejun Heo, jianchao.wang

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunatley, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
---
 block/blk-core.c       |   2 +
 block/blk-mq.c         | 206 +++++++++++++++++++++++++++++++------------------
 block/blk-mq.h         |  45 +++++++++++
 block/blk-timeout.c    |   2 +-
 block/blk.h            |   6 --
 include/linux/blk-mq.h |   1 +
 include/linux/blkdev.h |  23 ++++++
 7 files changed, 204 insertions(+), 81 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b888175..6034623 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
+	seqcount_init(&rq->gstate_seq);
+	u64_stats_init(&rq->aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index acf4fbb..b4e733b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -530,6 +530,9 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
+	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 	if (rq->rq_flags & RQF_STATS) {
@@ -557,6 +560,19 @@ static void __blk_mq_complete_request(struct request *rq)
 	put_cpu();
 }
 
+static u64 blk_mq_rq_aborted_gstate(struct request *rq)
+{
+	unsigned int start;
+	u64 aborted_gstate;
+
+	do {
+		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
+		aborted_gstate = rq->aborted_gstate;
+	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
+
+	return aborted_gstate;
+}
+
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -574,14 +590,21 @@ void blk_mq_complete_request(struct request *rq)
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
+	/*
+	 * If @rq->aborted_gstate equals the current instance, timeout is
+	 * claiming @rq and we lost.  This is synchronized through RCU.
+	 * See blk_mq_timeout_work() for details.
+	 */
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		if (!blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
+		    !blk_mark_rq_complete(rq))
 			__blk_mq_complete_request(rq);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		if (!blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
+		    !blk_mark_rq_complete(rq))
 			__blk_mq_complete_request(rq);
 		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
 	}
@@ -608,34 +631,28 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
 
-	blk_add_timer(rq);
-
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
 
 	/*
-	 * Mark us as started and clear complete. Complete might have been
-	 * set if requeue raced with timeout, which then marked it as
-	 * complete. So be sure to clear complete again when we start
-	 * the request, otherwise we'll ignore the completion event.
+	 * Mark @rq in-flight which also advances the generation number,
+	 * and register for timeout.  Protect with a seqcount to allow the
+	 * timeout path to read both @rq->gstate and @rq->deadline
+	 * coherently.
 	 *
-	 * Ensure that ->deadline is visible before we set STARTED, such that
-	 * blk_mq_check_expired() is guaranteed to observe our ->deadline when
-	 * it observes STARTED.
+	 * This is the only place where a request is marked in-flight.  If
+	 * the timeout path reads an in-flight @rq->gstate, the
+	 * @rq->deadline it reads together under @rq->gstate_seq is
+	 * guaranteed to be the matching one.
 	 */
-	smp_wmb();
+	write_seqcount_begin(&rq->gstate_seq);
+	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
+	blk_add_timer(rq);
+	write_seqcount_end(&rq->gstate_seq);
+
 	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
-		/*
-		 * Coherence order guarantees these consecutive stores to a
-		 * single variable propagate in the specified order. Thus the
-		 * clear_bit() is ordered _after_ the set bit. See
-		 * blk_mq_check_expired().
-		 *
-		 * (the bits must be part of the same byte for this to be
-		 * true).
-		 */
+	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
 		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
-	}
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -668,6 +685,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	blk_mq_sched_requeue_request(rq);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -765,6 +783,7 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
 	unsigned long next;
 	unsigned int next_set;
+	unsigned int nr_expired;
 };
 
 void blk_mq_rq_timed_out(struct request *req, bool reserved)
@@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
+		/*
+		 * As nothing prevents from completion happening while
+		 * ->aborted_gstate is set, this may lead to ignored
+		 * completions and further spurious timeouts.
+		 */
+		u64_stats_update_begin(&req->aborted_gstate_sync);
+		req->aborted_gstate = 0;
+		u64_stats_update_end(&req->aborted_gstate_sync);
 		blk_add_timer(req);
 		blk_clear_rq_complete(req);
 		break;
@@ -807,50 +834,48 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long deadline;
+	unsigned long gstate, deadline;
+	int start;
 
 	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
 		return;
 
-	/*
-	 * Ensures that if we see STARTED we must also see our
-	 * up-to-date deadline, see blk_mq_start_request().
-	 */
-	smp_rmb();
-
-	deadline = READ_ONCE(rq->deadline);
-
-	/*
-	 * The rq being checked may have been freed and reallocated
-	 * out already here, we avoid this race by checking rq->deadline
-	 * and REQ_ATOM_COMPLETE flag together:
-	 *
-	 * - if rq->deadline is observed as new value because of
-	 *   reusing, the rq won't be timed out because of timing.
-	 * - if rq->deadline is observed as previous value,
-	 *   REQ_ATOM_COMPLETE flag won't be cleared in reuse path
-	 *   because we put a barrier between setting rq->deadline
-	 *   and clearing the flag in blk_mq_start_request(), so
-	 *   this rq won't be timed out too.
-	 */
-	if (time_after_eq(jiffies, deadline)) {
-		if (!blk_mark_rq_complete(rq)) {
-			/*
-			 * Again coherence order ensures that consecutive reads
-			 * from the same variable must be in that order. This
-			 * ensures that if we see COMPLETE clear, we must then
-			 * see STARTED set and we'll ignore this timeout.
-			 *
-			 * (There's also the MB implied by the test_and_clear())
-			 */
-			blk_mq_rq_timed_out(rq, reserved);
-		}
+	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
+	do {
+		start = read_seqcount_begin(&rq->gstate_seq);
+		gstate = READ_ONCE(rq->gstate);
+		deadline = rq->deadline;
+	} while (read_seqcount_retry(&rq->gstate_seq, start));
+
+	/* if in-flight && overdue, mark for abortion */
+	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
+	    time_after_eq(jiffies, deadline)) {
+		u64_stats_update_begin(&rq->aborted_gstate_sync);
+		rq->aborted_gstate = gstate;
+		u64_stats_update_end(&rq->aborted_gstate_sync);
+		data->nr_expired++;
+		hctx->nr_expired++;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
 }
 
+static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
+	 * completions that we lost to, they would have finished and
+	 * updated @rq->gstate by now; otherwise, the completion path is
+	 * now guaranteed to see @rq->aborted_gstate and yield.  If
+	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 */
+	if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
+	    !blk_mark_rq_complete(rq))
+		blk_mq_rq_timed_out(rq, reserved);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -858,7 +883,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	struct blk_mq_timeout_data data = {
 		.next		= 0,
 		.next_set	= 0,
+		.nr_expired	= 0,
 	};
+	struct blk_mq_hw_ctx *hctx;
 	int i;
 
 	/* A deadlock might occur if a request is stuck requiring a
@@ -877,14 +904,40 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
+	/* scan for the expired ones and set their ->aborted_gstate */
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
+	if (data.nr_expired) {
+		bool has_rcu = false;
+
+		/*
+		 * Wait till everyone sees ->aborted_gstate.  The
+		 * sequential waits for SRCUs aren't ideal.  If this ever
+		 * becomes a problem, we can add per-hw_ctx rcu_head and
+		 * wait in parallel.
+		 */
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->nr_expired)
+				continue;
+
+			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+				has_rcu = true;
+			else
+				synchronize_srcu(hctx->queue_rq_srcu);
+
+			hctx->nr_expired = 0;
+		}
+		if (has_rcu)
+			synchronize_rcu();
+
+		/* terminate the ones we won */
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+	}
+
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
 	} else {
-		struct blk_mq_hw_ctx *hctx;
-
 		queue_for_each_hw_ctx(q, hctx, i) {
 			/* the hctx may be unmapped, so check it here */
 			if (blk_mq_hw_queue_mapped(hctx))
@@ -1879,6 +1932,22 @@ static size_t order_to_size(unsigned int order)
 	return (size_t)PAGE_SIZE << order;
 }
 
+static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
+			       unsigned int hctx_idx, int node)
+{
+	int ret;
+
+	if (set->ops->init_request) {
+		ret = set->ops->init_request(set, rq, hctx_idx, node);
+		if (ret)
+			return ret;
+	}
+
+	seqcount_init(&rq->gstate_seq);
+	u64_stats_init(&rq->aborted_gstate_sync);
+	return 0;
+}
+
 int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx, unsigned int depth)
 {
@@ -1940,12 +2009,9 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 			struct request *rq = p;
 
 			tags->static_rqs[i] = rq;
-			if (set->ops->init_request) {
-				if (set->ops->init_request(set, rq, hctx_idx,
-						node)) {
-					tags->static_rqs[i] = NULL;
-					goto fail;
-				}
+			if (blk_mq_init_request(set, rq, hctx_idx, node)) {
+				tags->static_rqs[i] = NULL;
+				goto fail;
 			}
 
 			p += rq_size;
@@ -2084,9 +2150,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	if (!hctx->fq)
 		goto sched_exit_hctx;
 
-	if (set->ops->init_request &&
-	    set->ops->init_request(set, hctx->fq->flush_rq, hctx_idx,
-				   node))
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
 		goto free_fq;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
@@ -2980,12 +3044,6 @@ static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 
 static int __init blk_mq_init(void)
 {
-	/*
-	 * See comment in block/blk.h rq_atomic_flags enum
-	 */
-	BUILD_BUG_ON((REQ_ATOM_STARTED / BITS_PER_BYTE) !=
-			(REQ_ATOM_COMPLETE / BITS_PER_BYTE));
-
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
 	return 0;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c7c3ff..154b2fa 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,6 +27,19 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
+ * and the upper bits the generation number.
+ */
+enum mq_rq_state {
+	MQ_RQ_IDLE		= 0,
+	MQ_RQ_IN_FLIGHT		= 1,
+
+	MQ_RQ_STATE_BITS	= 2,
+	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
+	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
+};
+
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
@@ -85,6 +98,38 @@ extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
 
 void blk_mq_release(struct request_queue *q);
 
+/**
+ * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
+ * @rq: target request.
+ */
+static inline int blk_mq_rq_state(struct request *rq)
+{
+	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+}
+
+/**
+ * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
+ * @rq: target request.
+ * @state: new state to set.
+ *
+ * Set @rq's state to @state.  The caller is responsible for ensuring that
+ * there are no other updaters.  A request can transition into IN_FLIGHT
+ * only from IDLE and doing so increments the generation number.
+ */
+static inline void blk_mq_rq_update_state(struct request *rq,
+					  enum mq_rq_state state)
+{
+	u64 new_val = (READ_ONCE(rq->gstate) & ~MQ_RQ_STATE_MASK) | state;
+
+	if (state == MQ_RQ_IN_FLIGHT) {
+		WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
+		new_val += MQ_RQ_GEN_INC;
+	}
+
+	/* avoid exposing interim values */
+	WRITE_ONCE(rq->gstate, new_val);
+}
+
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
 					   unsigned int cpu)
 {
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 764ecf9..6427be7 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -208,7 +208,7 @@ void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	WRITE_ONCE(req->deadline, jiffies + req->timeout);
+	req->deadline = jiffies + req->timeout;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/block/blk.h b/block/blk.h
index 3f14469..9cb2739 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -123,12 +123,6 @@ void blk_account_io_done(struct request *req);
  * Internal atomic flags for request handling
  */
 enum rq_atomic_flags {
-	/*
-	 * Keep these two bits first - not because we depend on the
-	 * value of them, but we do depend on them being in the same
-	 * byte of storage to ensure ordering on writes. Keeping them
-	 * first will achieve that nicely.
-	 */
 	REQ_ATOM_COMPLETE = 0,
 	REQ_ATOM_STARTED,
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95c9a5c..460798d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,6 +51,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
+	unsigned int		nr_expired;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca1..2d6fd11 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,6 +27,8 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
+#include <linux/seqlock.h>
+#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -228,6 +230,27 @@ struct request {
 
 	unsigned short write_hint;
 
+	/*
+	 * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state
+	 * value and the upper bits the generation number which is
+	 * monotonically incremented and used to distinguish the reuse
+	 * instances.
+	 *
+	 * ->gstate_seq allows updates to ->gstate and other fields
+	 * (currently ->deadline) during request start to be read
+	 * atomically from the timeout path, so that it can operate on a
+	 * coherent set of information.
+	 */
+	seqcount_t gstate_seq;
+	u64 gstate;
+
+	/*
+	 * ->aborted_gstate is used by the timeout to claim a specific
+	 * recycle instance of this request.  See blk_mq_timeout_work().
+	 */
+	struct u64_stats_sync aborted_gstate_sync;
+	u64 aborted_gstate;
+
 	unsigned long deadline;
 	struct list_head timeout_list;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 3/6] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE
  2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
  2017-12-12 19:01 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
  2017-12-12 19:01 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
@ 2017-12-12 19:01 ` Tejun Heo
  2017-12-12 19:01 ` [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2017-12-12 19:01 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block,
	hch, Tejun Heo

blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
REQ_ATOM_COMPLETE to determine the request state.  Both uses are
speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
equivalent results.  Replace the tests.  This will allow removing
REQ_ATOM_COMPLETE usages from blk-mq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b4e733b..3a08d7a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -95,8 +95,7 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
 {
 	struct mq_inflight *mi = priv;
 
-	if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
-	    !test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
+	if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
 		/*
 		 * index[0] counts the specific partition that was asked
 		 * for. index[1] counts the ones that are active on the
@@ -2958,7 +2957,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+		if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
+		    blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path
  2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
                   ` (2 preceding siblings ...)
  2017-12-12 19:01 ` [PATCH 3/6] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
@ 2017-12-12 19:01 ` Tejun Heo
  2017-12-14 18:56     ` Bart Van Assche
  2017-12-12 19:01 ` [PATCH 5/6] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2017-12-12 19:01 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block,
	hch, Tejun Heo

With issue/complete and timeout paths now using the generation number
and state based synchronization, blk_abort_request() is the only one
which depends on REQ_ATOM_COMPLETE for arbitrating completion.

There's no reason for blk_abort_request() to be a completely separate
path.  This patch makes blk_abort_request() piggyback on the timeout
path instead of trying to terminate the request directly.

This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c      | 2 +-
 block/blk-mq.h      | 2 --
 block/blk-timeout.c | 7 ++++---
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3a08d7a..73d6444 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -785,7 +785,7 @@ struct blk_mq_timeout_data {
 	unsigned int nr_expired;
 };
 
-void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 154b2fa..1fa83fa 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -94,8 +94,6 @@ extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
-extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
-
 void blk_mq_release(struct request_queue *q);
 
 /**
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 6427be7..051924f 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -156,12 +156,13 @@ void blk_timeout_work(struct work_struct *work)
  */
 void blk_abort_request(struct request *req)
 {
-	if (blk_mark_rq_complete(req))
-		return;
 
 	if (req->q->mq_ops) {
-		blk_mq_rq_timed_out(req, false);
+		req->deadline = jiffies;
+		mod_timer(&req->q->timeout, 0);
 	} else {
+		if (blk_mark_rq_complete(req))
+			return;
 		blk_delete_timer(req);
 		blk_rq_timed_out(req);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 5/6] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq
  2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
                   ` (3 preceding siblings ...)
  2017-12-12 19:01 ` [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
@ 2017-12-12 19:01 ` Tejun Heo
  2017-12-12 19:01 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2017-12-12 19:01 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block,
	hch, Tejun Heo, jianchao.wang

After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE for
anything.

Remove all REQ_ATOM_COMPLETE usages.  This removes atomic bitops from
hot paths too.

v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
---
 block/blk-mq.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 73d6444..7269552 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -596,14 +596,12 @@ void blk_mq_complete_request(struct request *rq)
 	 */
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-		    !blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 			__blk_mq_complete_request(rq);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-		    !blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 			__blk_mq_complete_request(rq);
 		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
 	}
@@ -650,8 +648,6 @@ void blk_mq_start_request(struct request *rq)
 	write_seqcount_end(&rq->gstate_seq);
 
 	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
-		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -819,7 +815,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		req->aborted_gstate = 0;
 		u64_stats_update_end(&req->aborted_gstate_sync);
 		blk_add_timer(req);
-		blk_clear_rq_complete(req);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -870,8 +865,7 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 	 * now guaranteed to see @rq->aborted_gstate and yield.  If
 	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
 	 */
-	if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
-	    !blk_mark_rq_complete(rq))
+	if (READ_ONCE(rq->gstate) == rq->aborted_gstate)
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
                   ` (4 preceding siblings ...)
  2017-12-12 19:01 ` [PATCH 5/6] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
@ 2017-12-12 19:01 ` Tejun Heo
  2017-12-12 22:20     ` Bart Van Assche
  2017-12-12 20:23 ` [PATCHSET v2] blk-mq: reimplement timeout handling Jens Axboe
  2017-12-20 23:41   ` Bart Van Assche
  7 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2017-12-12 19:01 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block,
	hch, Tejun Heo

After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.

Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
left and is removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq-debugfs.c |  4 +---
 block/blk-mq.c         | 39 ++++++++-------------------------------
 block/blk-mq.h         |  1 +
 block/blk.h            |  1 -
 4 files changed, 10 insertions(+), 35 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b56a4f3..8adc837 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -271,7 +271,6 @@ static const char *const cmd_flag_name[] = {
 #define RQF_NAME(name) [ilog2((__force u32)RQF_##name)] = #name
 static const char *const rqf_name[] = {
 	RQF_NAME(SORTED),
-	RQF_NAME(STARTED),
 	RQF_NAME(QUEUED),
 	RQF_NAME(SOFTBARRIER),
 	RQF_NAME(FLUSH_SEQ),
@@ -295,7 +294,6 @@ static const char *const rqf_name[] = {
 #define RQAF_NAME(name) [REQ_ATOM_##name] = #name
 static const char *const rqaf_name[] = {
 	RQAF_NAME(COMPLETE),
-	RQAF_NAME(STARTED),
 	RQAF_NAME(POLL_SLEPT),
 };
 #undef RQAF_NAME
@@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
 	const struct show_busy_params *params = data;
 
 	if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-	    test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+	    blk_mq_rq_state(rq) != MQ_RQ_IDLE)
 		__blk_mq_debugfs_rq_show(params->m,
 					 list_entry_rq(&rq->queuelist));
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7269552..bf53132 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -482,7 +482,7 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
@@ -530,7 +530,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	int cpu;
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
 
 int blk_mq_request_started(struct request *rq)
 {
-	return test_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
@@ -629,7 +629,6 @@ void blk_mq_start_request(struct request *rq)
 	}
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
 
 	/*
 	 * Mark @rq in-flight which also advances the generation number,
@@ -647,8 +646,6 @@ void blk_mq_start_request(struct request *rq)
 	blk_add_timer(rq);
 	write_seqcount_end(&rq->gstate_seq);
 
-	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
 		 * Make sure space for the drain appears.  We know we can do
@@ -661,13 +658,9 @@ void blk_mq_start_request(struct request *rq)
 EXPORT_SYMBOL(blk_mq_start_request);
 
 /*
- * When we reach here because queue is busy, REQ_ATOM_COMPLETE
- * flag isn't set yet, so there may be race with timeout handler,
- * but given rq->deadline is just set in .queue_rq() under
- * this situation, the race won't be possible in reality because
- * rq->timeout should be set as big enough to cover the window
- * between blk_mq_start_request() called from .queue_rq() and
- * clearing REQ_ATOM_STARTED here.
+ * When we reach here because queue is busy, it's safe to change the state
+ * to IDLE without checking @rq->aborted_gstate because we should still be
+ * holding the RCU read lock and thus protected against timeout.
  */
 static void __blk_mq_requeue_request(struct request *rq)
 {
@@ -679,7 +672,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
 	blk_mq_sched_requeue_request(rq);
 
-	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
 		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
@@ -786,18 +779,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	/*
-	 * We know that complete is set at this point. If STARTED isn't set
-	 * anymore, then the request isn't active and the "timeout" should
-	 * just be ignored. This can happen due to the bitflag ordering.
-	 * Timeout first checks if STARTED is set, and if it is, assumes
-	 * the request is active. But if we race with completion, then
-	 * both flags will get cleared. So check here again, and ignore
-	 * a timeout event with a request that isn't active.
-	 */
-	if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
-		return;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -831,9 +812,6 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	unsigned long gstate, deadline;
 	int start;
 
-	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
-		return;
-
 	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
 	do {
 		start = read_seqcount_begin(&rq->gstate_seq);
@@ -2951,8 +2929,7 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
-		    blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
+		if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1fa83fa..523cad0 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -34,6 +34,7 @@ struct blk_mq_ctx {
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
+	MQ_RQ_COMPLETE		= 2,
 
 	MQ_RQ_STATE_BITS	= 2,
 	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
diff --git a/block/blk.h b/block/blk.h
index 9cb2739..a68dbe3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -124,7 +124,6 @@ void blk_account_io_done(struct request *req);
  */
 enum rq_atomic_flags {
 	REQ_ATOM_COMPLETE = 0,
-	REQ_ATOM_STARTED,
 
 	REQ_ATOM_POLL_SLEPT,
 };
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCHSET v2] blk-mq: reimplement timeout handling
  2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
                   ` (5 preceding siblings ...)
  2017-12-12 19:01 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
@ 2017-12-12 20:23 ` Jens Axboe
  2017-12-12 21:40   ` Tejun Heo
  2017-12-20 23:41   ` Bart Van Assche
  7 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2017-12-12 20:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block, hch

On 12/12/2017 12:01 PM, Tejun Heo wrote:
> Changes from the last version[1]
> 
> - BLK_EH_RESET_TIMER handling fixed.
> 
> - s/request->gstate_seqc/request->gstate_seq/
> 
> - READ_ONCE() added to blk_mq_rq_udpate_state().
> 
> - Removed left over blk_clear_rq_complete() invocation from
>   blk_mq_rq_timed_out().
> 
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.
> 
> It's pretty easy to make blk_mq_check_expired() terminate a later
> instance of a request.  If we induce 5 sec delay before
> time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
> 2s, and issue back-to-back large IOs, blk-mq starts timing out
> requests spuriously pretty quickly.  Nothing actually timed out.  It
> just made the call on a recycle instance of a request and then
> terminated a later instance long after the original instance finished.
> The scenario isn't theoretical either.
> 
> This patchset replaces the broken synchronization mechanism with a RCU
> and generation number based one.  Please read the patch description of
> the second path for more details.

I like this a lot, it's a lot less fragile and more intuitive/readable
than what we have now. And apparently less error prone... I'll do
some testing with this.

BTW, since youadd a few more BLK_MQ_F_BLOCKING checks, I think something
like the below would be a good cleanup on top of this.


From: Jens Axboe <axboe@kernel.dk>
Subject: [PATCH] blk-mq: move hctx lock/unlock into a helper

Move the RCU vs SRCU logic into lock/unlock helpers, which makes
the actual functional bits within the locked region much easier
to read.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 663069dca4d6..dec3d1bb0559 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -572,6 +572,22 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 	return aborted_gstate;
 }
 
+static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
+{
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+		rcu_read_unlock();
+	else
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+}
+
+static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
+{
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+		rcu_read_lock();
+	else
+		*srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+}
+
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -594,17 +610,10 @@ void blk_mq_complete_request(struct request *rq)
 	 * claiming @rq and we lost.  This is synchronized through RCU.
 	 * See blk_mq_timeout_work() for details.
 	 */
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		rcu_read_lock();
-		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
-			__blk_mq_complete_request(rq);
-		rcu_read_unlock();
-	} else {
-		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
-			__blk_mq_complete_request(rq);
-		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-	}
+	hctx_lock(hctx, &srcu_idx);
+	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+		__blk_mq_complete_request(rq);
+	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -1243,17 +1252,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	 */
 	WARN_ON_ONCE(in_interrupt());
 
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		rcu_read_lock();
-		blk_mq_sched_dispatch_requests(hctx);
-		rcu_read_unlock();
-	} else {
-		might_sleep();
+	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		blk_mq_sched_dispatch_requests(hctx);
-		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-	}
+	hctx_lock(hctx, &srcu_idx);
+	blk_mq_sched_dispatch_requests(hctx);
+	hctx_unlock(hctx, srcu_idx);
 }
 
 /*
@@ -1624,7 +1627,7 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 
 static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 					struct request *rq,
-					blk_qc_t *cookie, bool may_sleep)
+					blk_qc_t *cookie)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
@@ -1674,25 +1677,20 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	}
 
 insert:
-	blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep);
+	blk_mq_sched_insert_request(rq, false, run_queue, false,
+					hctx->flags & BLK_MQ_F_BLOCKING);
 }
 
 static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, blk_qc_t *cookie)
 {
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		rcu_read_lock();
-		__blk_mq_try_issue_directly(hctx, rq, cookie, false);
-		rcu_read_unlock();
-	} else {
-		unsigned int srcu_idx;
+	int srcu_idx;
 
-		might_sleep();
+	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		__blk_mq_try_issue_directly(hctx, rq, cookie, true);
-		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-	}
+	hctx_lock(hctx, &srcu_idx);
+	__blk_mq_try_issue_directly(hctx, rq, cookie);
+	hctx_unlock(hctx, srcu_idx);
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-12 19:01 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
@ 2017-12-12 21:37     ` Bart Van Assche
  2017-12-13  5:07   ` jianchao.wang
  2017-12-14 18:51     ` Bart Van Assche
  2 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-12 21:37 UTC (permalink / raw)
  To: tj, axboe
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch,
	jianchao.w.wang, osandov

T24gVHVlLCAyMDE3LTEyLTEyIGF0IDExOjAxIC0wODAwLCBUZWp1biBIZW8gd3JvdGU6DQo+ICsv
Kg0KPiArICogQml0cyBmb3IgcmVxdWVzdC0+Z3N0YXRlLiAgVGhlIGxvd2VyIHR3byBiaXRzIGNh
cnJ5IE1RX1JRXyogc3RhdGUgdmFsdWUNCj4gKyAqIGFuZCB0aGUgdXBwZXIgYml0cyB0aGUgZ2Vu
ZXJhdGlvbiBudW1iZXIuDQo+ICsgKi8NCj4gK2VudW0gbXFfcnFfc3RhdGUgew0KPiArCU1RX1JR
X0lETEUJCT0gMCwNCj4gKwlNUV9SUV9JTl9GTElHSFQJCT0gMSwNCj4gKw0KPiArCU1RX1JRX1NU
QVRFX0JJVFMJPSAyLA0KPiArCU1RX1JRX1NUQVRFX01BU0sJPSAoMSA8PCBNUV9SUV9TVEFURV9C
SVRTKSAtIDEsDQo+ICsJTVFfUlFfR0VOX0lOQwkJPSAxIDw8IE1RX1JRX1NUQVRFX0JJVFMsDQo+
ICt9Ow0KPiArDQo+IEBAIC04NSw2ICs5OCwzOCBAQCBleHRlcm4gdm9pZCBibGtfbXFfcnFfdGlt
ZWRfb3V0KHN0cnVjdCByZXF1ZXN0ICpyZXEsIGJvb2wgcmVzZXJ2ZWQpOw0KPiArLyoqDQo+ICsg
KiBibGtfbXFfcnFfc3RhdGUoKSAtIHJlYWQgdGhlIGN1cnJlbnQgTVFfUlFfKiBzdGF0ZSBvZiBh
IHJlcXVlc3QNCj4gKyAqIEBycTogdGFyZ2V0IHJlcXVlc3QuDQo+ICsgKi8NCj4gK3N0YXRpYyBp
bmxpbmUgaW50IGJsa19tcV9ycV9zdGF0ZShzdHJ1Y3QgcmVxdWVzdCAqcnEpDQo+ICt7DQo+ICsJ
cmV0dXJuIFJFQURfT05DRShycS0+Z3N0YXRlKSAmIE1RX1JRX1NUQVRFX01BU0s7DQo+ICt9DQo+
ICsNCj4gKy8qKg0KPiArICogYmxrX21xX3JxX3VwZGF0ZV9zdGF0ZSgpIC0gc2V0IHRoZSBjdXJy
ZW50IE1RX1JRXyogc3RhdGUgb2YgYSByZXF1ZXN0DQo+ICsgKiBAcnE6IHRhcmdldCByZXF1ZXN0
Lg0KPiArICogQHN0YXRlOiBuZXcgc3RhdGUgdG8gc2V0Lg0KPiArICoNCj4gKyAqIFNldCBAcnEn
cyBzdGF0ZSB0byBAc3RhdGUuICBUaGUgY2FsbGVyIGlzIHJlc3BvbnNpYmxlIGZvciBlbnN1cmlu
ZyB0aGF0DQo+ICsgKiB0aGVyZSBhcmUgbm8gb3RoZXIgdXBkYXRlcnMuICBBIHJlcXVlc3QgY2Fu
IHRyYW5zaXRpb24gaW50byBJTl9GTElHSFQNCj4gKyAqIG9ubHkgZnJvbSBJRExFIGFuZCBkb2lu
ZyBzbyBpbmNyZW1lbnRzIHRoZSBnZW5lcmF0aW9uIG51bWJlci4NCj4gKyAqLw0KPiArc3RhdGlj
IGlubGluZSB2b2lkIGJsa19tcV9ycV91cGRhdGVfc3RhdGUoc3RydWN0IHJlcXVlc3QgKnJxLA0K
PiArCQkJCQkgIGVudW0gbXFfcnFfc3RhdGUgc3RhdGUpDQo+ICt7DQo+ICsJdTY0IG5ld192YWwg
PSAoUkVBRF9PTkNFKHJxLT5nc3RhdGUpICYgfk1RX1JRX1NUQVRFX01BU0spIHwgc3RhdGU7DQo+
ICsNCj4gKwlpZiAoc3RhdGUgPT0gTVFfUlFfSU5fRkxJR0hUKSB7DQo+ICsJCVdBUk5fT05fT05D
RShibGtfbXFfcnFfc3RhdGUocnEpICE9IE1RX1JRX0lETEUpOw0KPiArCQluZXdfdmFsICs9IE1R
X1JRX0dFTl9JTkM7DQo+ICsJfQ0KPiArDQo+ICsJLyogYXZvaWQgZXhwb3NpbmcgaW50ZXJpbSB2
YWx1ZXMgKi8NCj4gKwlXUklURV9PTkNFKHJxLT5nc3RhdGUsIG5ld192YWwpOw0KPiArfQ0KDQpI
ZWxsbyBUZWp1biwNCg0KSGF2ZSB5b3UgY29uc2lkZXJlZCB0aGUgZm9sbG93aW5nIGluc3RlYWQg
b2YgaW50cm9kdWNpbmcgTVFfUlFfSURMRSBhbmQNCk1RX1JRX0lOX0ZMSUdIVD8gSSB0aGluayB0
aGlzIGNvdWxkIGhlbHAgdG8gbGltaXQgdGhlIG51bWJlciBvZiBuZXcgYXRvbWljDQpvcGVyYXRp
b25zIGludHJvZHVjZWQgaW4gdGhlIGhvdCBwYXRoIGJ5IHRoaXMgcGF0Y2ggc2VyaWVzLg0KDQpz
dGF0aWMgaW5saW5lIGJvb2wgYmxrX21xX3JxX2luX2ZsaWdodChzdHJ1Y3QgcmVxdWVzdCAqcnEp
DQp7DQoJcmV0dXJuIGxpc3RfZW1wdHkoJnJxLT5xdWV1ZWxpc3QpOw0KfQ0KDQpUaGFua3MsDQoN
CkJhcnQu

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
@ 2017-12-12 21:37     ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-12 21:37 UTC (permalink / raw)
  To: tj, axboe
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch,
	jianchao.w.wang, osandov

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> +/*
> + * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
> + * and the upper bits the generation number.
> + */
> +enum mq_rq_state {
> +	MQ_RQ_IDLE		= 0,
> +	MQ_RQ_IN_FLIGHT		= 1,
> +
> +	MQ_RQ_STATE_BITS	= 2,
> +	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
> +	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
> +};
> +
> @@ -85,6 +98,38 @@ extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
> +/**
> + * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
> + * @rq: target request.
> + */
> +static inline int blk_mq_rq_state(struct request *rq)
> +{
> +	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
> +}
> +
> +/**
> + * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
> + * @rq: target request.
> + * @state: new state to set.
> + *
> + * Set @rq's state to @state.  The caller is responsible for ensuring that
> + * there are no other updaters.  A request can transition into IN_FLIGHT
> + * only from IDLE and doing so increments the generation number.
> + */
> +static inline void blk_mq_rq_update_state(struct request *rq,
> +					  enum mq_rq_state state)
> +{
> +	u64 new_val = (READ_ONCE(rq->gstate) & ~MQ_RQ_STATE_MASK) | state;
> +
> +	if (state == MQ_RQ_IN_FLIGHT) {
> +		WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
> +		new_val += MQ_RQ_GEN_INC;
> +	}
> +
> +	/* avoid exposing interim values */
> +	WRITE_ONCE(rq->gstate, new_val);
> +}

Hello Tejun,

Have you considered the following instead of introducing MQ_RQ_IDLE and
MQ_RQ_IN_FLIGHT? I think this could help to limit the number of new atomic
operations introduced in the hot path by this patch series.

static inline bool blk_mq_rq_in_flight(struct request *rq)
{
	return list_empty(&rq->queuelist);
}

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET v2] blk-mq: reimplement timeout handling
  2017-12-12 20:23 ` [PATCHSET v2] blk-mq: reimplement timeout handling Jens Axboe
@ 2017-12-12 21:40   ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2017-12-12 21:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block, hch

Hello, Jens.

On Tue, Dec 12, 2017 at 01:23:01PM -0700, Jens Axboe wrote:
> I like this a lot, it's a lot less fragile and more intuitive/readable
> than what we have now. And apparently less error prone... I'll do
> some testing with this.

Great.

> BTW, since youadd a few more BLK_MQ_F_BLOCKING checks, I think something
> like the below would be a good cleanup on top of this.
> 
> From: Jens Axboe <axboe@kernel.dk>
> Subject: [PATCH] blk-mq: move hctx lock/unlock into a helper
> 
> Move the RCU vs SRCU logic into lock/unlock helpers, which makes
> the actual functional bits within the locked region much easier
> to read.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Yeah, that's a lot better.  It might also be a good idea to add
lockdep_assert_hctx_locked() for verification and documentation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-12 21:37     ` Bart Van Assche
  (?)
@ 2017-12-12 21:44     ` tj
  -1 siblings, 0 replies; 49+ messages in thread
From: tj @ 2017-12-12 21:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, linux-kernel, peterz, linux-block, kernel-team, oleg, hch,
	jianchao.w.wang, osandov

Hello, Bart.

On Tue, Dec 12, 2017 at 09:37:11PM +0000, Bart Van Assche wrote:
> Have you considered the following instead of introducing MQ_RQ_IDLE and
> MQ_RQ_IN_FLIGHT? I think this could help to limit the number of new atomic
> operations introduced in the hot path by this patch series.

But nothing in the hot paths is atomic.

> static inline bool blk_mq_rq_in_flight(struct request *rq)
> {
> 	return list_empty(&rq->queuelist);
> }

And the fact that we encode the generation number and state into a
single variable contributes to not needing atomic operations.
Breaking up the state and generation like the above would need more
synchronization, not less.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-12 19:01 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
@ 2017-12-12 22:20     ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-12 22:20 UTC (permalink / raw)
  To: tj, axboe
  Cc: kernel-team, linux-kernel, peterz, osandov, linux-block, oleg, hch

T24gVHVlLCAyMDE3LTEyLTEyIGF0IDExOjAxIC0wODAwLCBUZWp1biBIZW8gd3JvdGU6DQo+IEBA
IC00MDksNyArNDA3LDcgQEAgc3RhdGljIHZvaWQgaGN0eF9zaG93X2J1c3lfcnEoc3RydWN0IHJl
cXVlc3QgKnJxLCB2b2lkICpkYXRhLCBib29sIHJlc2VydmVkKQ0KPiAgICAgICAgIGNvbnN0IHN0
cnVjdCBzaG93X2J1c3lfcGFyYW1zICpwYXJhbXMgPSBkYXRhOw0KPiAgDQo+ICAgICAgICAgaWYg
KGJsa19tcV9tYXBfcXVldWUocnEtPnEsIHJxLT5tcV9jdHgtPmNwdSkgPT0gcGFyYW1zLT5oY3R4
ICYmDQo+IC0gICAgICAgICAgIHRlc3RfYml0KFJFUV9BVE9NX1NUQVJURUQsICZycS0+YXRvbWlj
X2ZsYWdzKSkNCj4gKyAgICAgICAgICAgYmxrX21xX3JxX3N0YXRlKHJxKSAhPSBNUV9SUV9JRExF
KQ0KPiAgICAgICAgICAgICAgICAgX19ibGtfbXFfZGVidWdmc19ycV9zaG93KHBhcmFtcy0+bSwN
Cj4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsaXN0X2VudHJ5X3Jx
KCZycS0+cXVldWVsaXN0KSk7DQoNClRoZSBhYm92ZSBjb2RlIHNob3VsZCBzaG93IGFsbCByZXF1
ZXN0cyBvd25lZCBieSB0aGUgYmxvY2sgZHJpdmVyLiBQYXRjaA0KImJsay1tcS1kZWJ1Z2ZzOiBB
bHNvIHNob3cgcmVxdWVzdHMgdGhhdCBoYXZlIG5vdCB5ZXQgYmVlbiBzdGFydGVkIiAobm90IHll
dA0KaW4gSmVucycgdHJlZSkgY2hhbmdlcyB0aGUgUkVRX0FUT01fU1RBUlRFRCB0ZXN0IGludG8g
bGlzdF9lbXB0eSgmcnEtPnF1ZXVlbGlzdCkuDQpDYW4gdGhhdCBjaGFuZ2UgYmUgc3VwcG9ydGVk
IHdpdGggdGhlIGV4aXN0aW5nIE1RX1JRXyogc3RhdGVzIG9yIHdpbGwgYSBuZXcNCnN0YXRlIGhh
dmUgdG8gYmUgaW50cm9kdWNlZCB0byBzdXBwb3J0IHRoaXM/IFNlZSBhbHNvDQpodHRwczovL21h
cmMuaW5mby8/bD1saW51eC1ibG9jayZtPTE1MTI1MjE4ODQxMTk5MS4NCg0KVGhhbmtzLA0KDQpC
YXJ0Lg==

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
@ 2017-12-12 22:20     ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-12 22:20 UTC (permalink / raw)
  To: tj, axboe
  Cc: kernel-team, linux-kernel, peterz, osandov, linux-block, oleg, hch

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> @@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
>         const struct show_busy_params *params = data;
>  
>         if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> -           test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> +           blk_mq_rq_state(rq) != MQ_RQ_IDLE)
>                 __blk_mq_debugfs_rq_show(params->m,
>                                          list_entry_rq(&rq->queuelist));

The above code should show all requests owned by the block driver. Patch
"blk-mq-debugfs: Also show requests that have not yet been started" (not yet
in Jens' tree) changes the REQ_ATOM_STARTED test into list_empty(&rq->queuelist).
Can that change be supported with the existing MQ_RQ_* states or will a new
state have to be introduced to support this? See also
https://marc.info/?l=linux-block&m=151252188411991.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED
  2017-12-12 22:20     ` Bart Van Assche
  (?)
@ 2017-12-12 22:22     ` tj
  -1 siblings, 0 replies; 49+ messages in thread
From: tj @ 2017-12-12 22:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, kernel-team, linux-kernel, peterz, osandov, linux-block,
	oleg, hch

On Tue, Dec 12, 2017 at 10:20:39PM +0000, Bart Van Assche wrote:
> The above code should show all requests owned by the block driver. Patch
> "blk-mq-debugfs: Also show requests that have not yet been started" (not yet
> in Jens' tree) changes the REQ_ATOM_STARTED test into list_empty(&rq->queuelist).
> Can that change be supported with the existing MQ_RQ_* states or will a new
> state have to be introduced to support this? See also
> https://marc.info/?l=linux-block&m=151252188411991.

If list_empty() test was correct before, it'd be correct now too.

Thnaks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-12 19:01 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
@ 2017-12-13  3:30   ` jianchao.wang
  2017-12-13 16:13     ` Tejun Heo
  2017-12-14 17:01     ` Bart Van Assche
  1 sibling, 1 reply; 49+ messages in thread
From: jianchao.wang @ 2017-12-13  3:30 UTC (permalink / raw)
  To: Tejun Heo, axboe
  Cc: linux-kernel, oleg, peterz, kernel-team, osandov, linux-block, hch

Hello tejun
Sorry for missing the V2, same comment again.
 
On 12/13/2017 03:01 AM, Tejun Heo wrote:
> Currently, blk-mq protects only the issue path with RCU.  This patch
> puts the completion path under the same RCU protection.  This will be
> used to synchronize issue/completion against timeout by later patches,
> which will also add the comments.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-mq.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1109747..acf4fbb 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -568,11 +568,23 @@ static void __blk_mq_complete_request(struct request *rq)
>  void blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> +	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> +	int srcu_idx;
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
> -	if (!blk_mark_rq_complete(rq))
> -		__blk_mq_complete_request(rq);
> +
> +	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> +		rcu_read_lock();
> +		if (!blk_mark_rq_complete(rq))
> +			__blk_mq_complete_request(rq);
> +		rcu_read_unlock();
> +	} else {
> +		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> +		if (!blk_mark_rq_complete(rq))
> +			__blk_mq_complete_request(rq);
> +		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);

The __blk_mq_complete_request() could be executed in irq context. There should not be any 
sleeping operations in it. If just synchronize with the timeout path to ensure the aborted_gstate
to be seen, only rcu is needed here ,as well as the blk_mq_timeout_work.
> +	}
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);
>  
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-12 19:01 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
  2017-12-12 21:37     ` Bart Van Assche
@ 2017-12-13  5:07   ` jianchao.wang
  2017-12-13 16:13     ` Tejun Heo
  2017-12-14 18:51     ` Bart Van Assche
  2 siblings, 1 reply; 49+ messages in thread
From: jianchao.wang @ 2017-12-13  5:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov,
	linux-block, hch

Hi Tejun

On 12/13/2017 03:01 AM, Tejun Heo wrote:
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.
> 
> There's a complex dancing around REQ_ATOM_STARTED and
> REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
> they don't have a synchronization point across request recycle
> instances and it isn't clear what the barriers add.
> blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
> deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
> 
> In fact, it's pretty easy to make blk_mq_check_expired() terminate a
> later instance of a request.  If we induce 5 sec delay before
> time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
> 2s, and issue back-to-back large IOs, blk-mq starts timing out
> requests spuriously pretty quickly.  Nothing actually timed out.  It
> just made the call on a recycle instance of a request and then
> terminated a later instance long after the original instance finished.
> The scenario isn't theoretical either.
> 
> This patch replaces the broken synchronization mechanism with a RCU
> and generation number based one.
> 
> 1. Each request has a u64 generation + state value, which can be
>    updated only by the request owner.  Whenever a request becomes
>    in-flight, the generation number gets bumped up too.  This provides
>    the basis for the timeout path to distinguish different recycle
>    instances of the request.
> 
>    Also, marking a request in-flight and setting its deadline are
>    protected with a seqcount so that the timeout path can fetch both
>    values coherently.
> 
> 2. The timeout path fetches the generation, state and deadline.  If
>    the verdict is timeout, it records the generation into a dedicated
>    request abortion field and does RCU wait.
> 
> 3. The completion path is also protected by RCU (from the previous
>    patch) and checks whether the current generation number and state
>    match the abortion field.  If so, it skips completion.
> 
> 4. The timeout path, after RCU wait, scans requests again and
>    terminates the ones whose generation and state still match the ones
>    requested for abortion.
> 
>    By now, the timeout path knows that either the generation number
>    and state changed if it lost the race or the completion will yield
>    to it and can safely timeout the request.
> 
> While it's more lines of code, it's conceptually simpler, doesn't
> depend on direct use of subtle memory ordering or coherence, and
> hopefully doesn't terminate the wrong instance.
> 
> While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
> between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
> removed yet as it's still used in other places.  Future patches will
> move all state tracking to the new mechanism and remove all bitops in
> the hot paths.
> 
> v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
>     - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
>     - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
> ---
>  block/blk-core.c       |   2 +
>  block/blk-mq.c         | 206 +++++++++++++++++++++++++++++++------------------
>  block/blk-mq.h         |  45 +++++++++++
>  block/blk-timeout.c    |   2 +-
>  block/blk.h            |   6 --
>  include/linux/blk-mq.h |   1 +
>  include/linux/blkdev.h |  23 ++++++
>  7 files changed, 204 insertions(+), 81 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b888175..6034623 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->start_time = jiffies;
>  	set_start_time_ns(rq);
>  	rq->part = NULL;
> +	seqcount_init(&rq->gstate_seq);
> +	u64_stats_init(&rq->aborted_gstate_sync);
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index acf4fbb..b4e733b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -530,6 +530,9 @@ static void __blk_mq_complete_request(struct request *rq)
>  	bool shared = false;
>  	int cpu;
>  
> +	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
> +	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
> +
>  	if (rq->internal_tag != -1)
>  		blk_mq_sched_completed_request(rq);
>  	if (rq->rq_flags & RQF_STATS) {
> @@ -557,6 +560,19 @@ static void __blk_mq_complete_request(struct request *rq)
>  	put_cpu();
>  }
>  
> +static u64 blk_mq_rq_aborted_gstate(struct request *rq)
> +{
> +	unsigned int start;
> +	u64 aborted_gstate;
> +
> +	do {
> +		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
> +		aborted_gstate = rq->aborted_gstate;
> +	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
> +
> +	return aborted_gstate;
> +}
> +
>  /**
>   * blk_mq_complete_request - end I/O on a request
>   * @rq:		the request being processed
> @@ -574,14 +590,21 @@ void blk_mq_complete_request(struct request *rq)
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
>  
> +	/*
> +	 * If @rq->aborted_gstate equals the current instance, timeout is
> +	 * claiming @rq and we lost.  This is synchronized through RCU.
> +	 * See blk_mq_timeout_work() for details.
> +	 */
>  	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
>  		rcu_read_lock();
> -		if (!blk_mark_rq_complete(rq))
> +		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
> +		    !blk_mark_rq_complete(rq))
>  			__blk_mq_complete_request(rq);
>  		rcu_read_unlock();
>  	} else {
>  		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> -		if (!blk_mark_rq_complete(rq))
> +		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
> +		    !blk_mark_rq_complete(rq))
>  			__blk_mq_complete_request(rq);
>  		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
>  	}
> @@ -608,34 +631,28 @@ void blk_mq_start_request(struct request *rq)
>  		wbt_issue(q->rq_wb, &rq->issue_stat);
>  	}
>  
> -	blk_add_timer(rq);
> -
> +	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>  	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
>  
>  	/*
> -	 * Mark us as started and clear complete. Complete might have been
> -	 * set if requeue raced with timeout, which then marked it as
> -	 * complete. So be sure to clear complete again when we start
> -	 * the request, otherwise we'll ignore the completion event.
> +	 * Mark @rq in-flight which also advances the generation number,
> +	 * and register for timeout.  Protect with a seqcount to allow the
> +	 * timeout path to read both @rq->gstate and @rq->deadline
> +	 * coherently.
>  	 *
> -	 * Ensure that ->deadline is visible before we set STARTED, such that
> -	 * blk_mq_check_expired() is guaranteed to observe our ->deadline when
> -	 * it observes STARTED.
> +	 * This is the only place where a request is marked in-flight.  If
> +	 * the timeout path reads an in-flight @rq->gstate, the
> +	 * @rq->deadline it reads together under @rq->gstate_seq is
> +	 * guaranteed to be the matching one.
>  	 */
> -	smp_wmb();
> +	write_seqcount_begin(&rq->gstate_seq);
> +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> +	blk_add_timer(rq);
> +	write_seqcount_end(&rq->gstate_seq);
> +
>  	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> -	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> -		/*
> -		 * Coherence order guarantees these consecutive stores to a
> -		 * single variable propagate in the specified order. Thus the
> -		 * clear_bit() is ordered _after_ the set bit. See
> -		 * blk_mq_check_expired().
> -		 *
> -		 * (the bits must be part of the same byte for this to be
> -		 * true).
> -		 */
> +	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
>  		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
> -	}
>  
>  	if (q->dma_drain_size && blk_rq_bytes(rq)) {
>  		/*
> @@ -668,6 +685,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>  	blk_mq_sched_requeue_request(rq);
>  
>  	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> +		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
>  		if (q->dma_drain_size && blk_rq_bytes(rq))
>  			rq->nr_phys_segments--;
>  	}
> @@ -765,6 +783,7 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
>  struct blk_mq_timeout_data {
>  	unsigned long next;
>  	unsigned int next_set;
> +	unsigned int nr_expired;
>  };
>  
>  void blk_mq_rq_timed_out(struct request *req, bool reserved)
> @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  		__blk_mq_complete_request(req);
>  		break;
>  	case BLK_EH_RESET_TIMER:
> +		/*
> +		 * As nothing prevents from completion happening while
> +		 * ->aborted_gstate is set, this may lead to ignored
> +		 * completions and further spurious timeouts.
> +		 */
> +		u64_stats_update_begin(&req->aborted_gstate_sync);
> +		req->aborted_gstate = 0;
> +		u64_stats_update_end(&req->aborted_gstate_sync);
>  		blk_add_timer(req);
>  		blk_clear_rq_complete(req);
Test ok with NVMe

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-13  3:30   ` jianchao.wang
@ 2017-12-13 16:13     ` Tejun Heo
  2017-12-14  2:09       ` jianchao.wang
  0 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2017-12-13 16:13 UTC (permalink / raw)
  To: jianchao.wang
  Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov,
	linux-block, hch

Hello,

On Wed, Dec 13, 2017 at 11:30:48AM +0800, jianchao.wang wrote:
> > +	} else {
> > +		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> > +		if (!blk_mark_rq_complete(rq))
> > +			__blk_mq_complete_request(rq);
> > +		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
> 
> The __blk_mq_complete_request() could be executed in irq context. There should not be any 
> sleeping operations in it. If just synchronize with the timeout path to ensure the aborted_gstate
> to be seen, only rcu is needed here ,as well as the blk_mq_timeout_work.

Sure, but it's just a lot cleaner to use the same to protect both
issue and completion; otherwise, whoever who wants to synchronize
against them have to do awkward double rcu locking.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-13  5:07   ` jianchao.wang
@ 2017-12-13 16:13     ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2017-12-13 16:13 UTC (permalink / raw)
  To: jianchao.wang
  Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov,
	linux-block, hch

Hi, Jianchao.

On Wed, Dec 13, 2017 at 01:07:30PM +0800, jianchao.wang wrote:
> Test ok with NVMe

Awesome, thanks for testing!

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-13 16:13     ` Tejun Heo
@ 2017-12-14  2:09       ` jianchao.wang
  0 siblings, 0 replies; 49+ messages in thread
From: jianchao.wang @ 2017-12-14  2:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, oleg, peterz, kernel-team, osandov,
	linux-block, hch



On 12/14/2017 12:13 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Dec 13, 2017 at 11:30:48AM +0800, jianchao.wang wrote:
>>> +	} else {
>>> +		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
>>> +		if (!blk_mark_rq_complete(rq))
>>> +			__blk_mq_complete_request(rq);
>>> +		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
>>
>> The __blk_mq_complete_request() could be executed in irq context. There should not be any 
>> sleeping operations in it. If just synchronize with the timeout path to ensure the aborted_gstate
>> to be seen, only rcu is needed here ,as well as the blk_mq_timeout_work.
> 
> Sure, but it's just a lot cleaner to use the same to protect both
> issue and completion; otherwise, whoever who wants to synchronize
> against them have to do awkward double rcu locking.
> 
It's fair. Thanks for your detailed response. That's really appreciated.
> Thanks.
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-12 19:01 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
@ 2017-12-14 17:01     ` Bart Van Assche
  2017-12-14 17:01     ` Bart Van Assche
  1 sibling, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 17:01 UTC (permalink / raw)
  To: tj, axboe
  Cc: kernel-team, linux-kernel, peterz, osandov, linux-block, oleg, hch

T24gVHVlLCAyMDE3LTEyLTEyIGF0IDExOjAxIC0wODAwLCBUZWp1biBIZW8gd3JvdGU6DQo+ICsJ
fSBlbHNlIHsNCj4gKwkJc3JjdV9pZHggPSBzcmN1X3JlYWRfbG9jayhoY3R4LT5xdWV1ZV9ycV9z
cmN1KTsNCj4gKwkJaWYgKCFibGtfbWFya19ycV9jb21wbGV0ZShycSkpDQo+ICsJCQlfX2Jsa19t
cV9jb21wbGV0ZV9yZXF1ZXN0KHJxKTsNCj4gKwkJc3JjdV9yZWFkX3VubG9jayhoY3R4LT5xdWV1
ZV9ycV9zcmN1LCBzcmN1X2lkeCk7DQoNCkhlbGxvIFRlanVuLA0KDQpUaGUgbmFtZSBxdWV1ZV9y
cV9zcmN1IHdhcyBjaG9zZW4gdG8gcmVmbGVjdCB0aGUgb3JpZ2luYWwgdXNlIG9mIHRoYXQgc3Ry
dWN0dXJlLA0KbmFtZWx5IHRvIHByb3RlY3QgLnF1ZXVlX3JxKCkgY2FsbHMuIFlvdXIgcGF0Y2gg
c2VyaWVzIGJyb2FkZW5zIHRoZSB1c2Ugb2YgdGhhdA0Kc3JjdSBzdHJ1Y3R1cmUgc28gSSB3b3Vs
ZCBhcHByZWNpYXRlIGl0IGlmIGl0IHdvdWxkIGJlIHJlbmFtZWQsIGUuZy4gaW50byAic3JjdSIu
DQpTZWUgYWxzbyBjb21taXQgNmE4M2U3NGQyMTRhICgiYmxrLW1xOiBJbnRyb2R1Y2UgYmxrX21x
X3F1aWVzY2VfcXVldWUoKSIpLg0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/6] blk-mq: protect completion path with RCU
@ 2017-12-14 17:01     ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 17:01 UTC (permalink / raw)
  To: tj, axboe
  Cc: kernel-team, linux-kernel, peterz, osandov, linux-block, oleg, hch

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> +	} else {
> +		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> +		if (!blk_mark_rq_complete(rq))
> +			__blk_mq_complete_request(rq);
> +		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);

Hello Tejun,

The name queue_rq_srcu was chosen to reflect the original use of that structure,
namely to protect .queue_rq() calls. Your patch series broadens the use of that
srcu structure so I would appreciate it if it would be renamed, e.g. into "srcu".
See also commit 6a83e74d214a ("blk-mq: Introduce blk_mq_quiesce_queue()").

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-14 17:01     ` Bart Van Assche
  (?)
@ 2017-12-14 18:14     ` tj
  -1 siblings, 0 replies; 49+ messages in thread
From: tj @ 2017-12-14 18:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, kernel-team, linux-kernel, peterz, osandov, linux-block,
	oleg, hch

On Thu, Dec 14, 2017 at 05:01:06PM +0000, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > +	} else {
> > +		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> > +		if (!blk_mark_rq_complete(rq))
> > +			__blk_mq_complete_request(rq);
> > +		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
> 
> Hello Tejun,
> 
> The name queue_rq_srcu was chosen to reflect the original use of that structure,
> namely to protect .queue_rq() calls. Your patch series broadens the use of that

Yeah, will add a patch to rename it.

> srcu structure so I would appreciate it if it would be renamed, e.g. into "srcu".
> See also commit 6a83e74d214a ("blk-mq: Introduce blk_mq_quiesce_queue()").

Ah yeah, it'd be nice to have the [s]rcu synchronize calls factored
out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-12 19:01 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
@ 2017-12-14 18:51     ` Bart Van Assche
  2017-12-13  5:07   ` jianchao.wang
  2017-12-14 18:51     ` Bart Van Assche
  2 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 18:51 UTC (permalink / raw)
  To: tj, axboe
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch,
	jianchao.w.wang, osandov

T24gVHVlLCAyMDE3LTEyLTEyIGF0IDExOjAxIC0wODAwLCBUZWp1biBIZW8gd3JvdGU6DQo+IHJ1
bGVzLiAgVW5mb3J0dW5hdGxleSwgaXQgY29udGFpbnMgcXVpdGUgYSBmZXcgaG9sZXMuDQogICAg
ICAgICAgXl5eXl5eXl5eXl5eXg0KICAgICAgICAgIFVuZm9ydHVuYXRlbHk/DQoNCj4gV2hpbGUg
dGhpcyBjaGFuZ2UgbWFrZXMgUkVRX0FUT01fQ09NUExFVEUgc3luY2hvcm5pemF0aW9uIHVubmVj
ZXNzYXJ5DQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIF5eXl5e
Xl5eXl5eXl5eXg0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBz
eW5jaHJvbml6YXRpb24/DQoNCj4gLS0tIGEvYmxvY2svYmxrLWNvcmUuYw0KPiArKysgYi9ibG9j
ay9ibGstY29yZS5jDQo+IEBAIC0xMjYsNiArMTI2LDggQEAgdm9pZCBibGtfcnFfaW5pdChzdHJ1
Y3QgcmVxdWVzdF9xdWV1ZSAqcSwgc3RydWN0IHJlcXVlc3QgKnJxKQ0KPiAgCXJxLT5zdGFydF90
aW1lID0gamlmZmllczsNCj4gIAlzZXRfc3RhcnRfdGltZV9ucyhycSk7DQo+ICAJcnEtPnBhcnQg
PSBOVUxMOw0KPiArCXNlcWNvdW50X2luaXQoJnJxLT5nc3RhdGVfc2VxKTsNCj4gKwl1NjRfc3Rh
dHNfaW5pdCgmcnEtPmFib3J0ZWRfZ3N0YXRlX3N5bmMpOw0KPiAgfQ0KPiAgRVhQT1JUX1NZTUJP
TChibGtfcnFfaW5pdCk7DQoNClNvcnJ5IGJ1dCB0aGUgYWJvdmUgY2hhbmdlIGxvb2tzIHVnbHkg
dG8gbWUuIE15IHVuZGVyc3RhbmRpbmcgaXMgdGhhdCANCmJsa19ycV9pbml0KCkgaXMgb25seSB1
c2VkIGluc2lkZSB0aGUgYmxvY2sgbGF5ZXIgdG8gaW5pdGlhbGl6ZSBsZWdhY3kgYmxvY2sNCmxh
eWVyIHJlcXVlc3RzIHdoaWxlIGdzdGF0ZV9zZXEgYW5kIGFib3J0ZWRfZ3N0YXRlX3N5bmMgYXJl
IG9ubHkgcmVsZXZhbnQNCmZvciBibGstbXEgcmVxdWVzdHMuIFdvdWxkbid0IGl0IGJlIGJldHRl
ciB0byBhdm9pZCB0aGF0IGJsa19ycV9pbml0KCkgaXMNCmNhbGxlZCBmb3IgYmxrLW1xIHJlcXVl
c3RzIHN1Y2ggdGhhdCB0aGUgYWJvdmUgY2hhbmdlIGNhbiBiZSBsZWZ0IG91dD8gVGhlDQpvbmx5
IGNhbGxlcnMgb3V0c2lkZSB0aGUgYmxvY2sgbGF5ZXIgY29yZSBvZiBibGtfcnFfaW5pdCgpIEkg
a25vdyBvZiBhcmUNCmlkZV9wcmVwX3NlbnNlKCkgYW5kIHNjc2lfaW9jdGxfcmVzZXQoKS4gSSBj
YW4gaGVscCB3aXRoIGNvbnZlcnRpbmcgdGhlIFNDU0kNCmNvZGUgaWYgeW91IHdhbnQuDQoNCj4g
Kwl3cml0ZV9zZXFjb3VudF9iZWdpbigmcnEtPmdzdGF0ZV9zZXEpOw0KPiArCWJsa19tcV9ycV91
cGRhdGVfc3RhdGUocnEsIE1RX1JRX0lOX0ZMSUdIVCk7DQo+ICsJYmxrX2FkZF90aW1lcihycSk7
DQo+ICsJd3JpdGVfc2VxY291bnRfZW5kKCZycS0+Z3N0YXRlX3NlcSk7DQoNCk15IHVuZGVyc3Rh
bmRpbmcgaXMgdGhhdCBib3RoIHdyaXRlX3NlcWNvdW50X2JlZ2luKCkgYW5kIHdyaXRlX3NlcWNv
dW50X2VuZCgpDQp0cmlnZ2VyIGEgd3JpdGUgbWVtb3J5IGJhcnJpZXIuIElzIGEgc2VxY291bnQg
cmVhbGx5IGZhc3RlciB0aGFuIGEgc3BpbmxvY2s/DQoNCj4gDQo+IEBAIC03OTIsNiArODExLDE0
IEBAIHZvaWQgYmxrX21xX3JxX3RpbWVkX291dChzdHJ1Y3QgcmVxdWVzdCAqcmVxLCBib29sIHJl
c2VydmVkKQ0KPiAgCQlfX2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KHJlcSk7DQo+ICAJCWJyZWFr
Ow0KPiAgCWNhc2UgQkxLX0VIX1JFU0VUX1RJTUVSOg0KPiArCQkvKg0KPiArCQkgKiBBcyBub3Ro
aW5nIHByZXZlbnRzIGZyb20gY29tcGxldGlvbiBoYXBwZW5pbmcgd2hpbGUNCj4gKwkJICogLT5h
Ym9ydGVkX2dzdGF0ZSBpcyBzZXQsIHRoaXMgbWF5IGxlYWQgdG8gaWdub3JlZA0KPiArCQkgKiBj
b21wbGV0aW9ucyBhbmQgZnVydGhlciBzcHVyaW91cyB0aW1lb3V0cy4NCj4gKwkJICovDQo+ICsJ
CXU2NF9zdGF0c191cGRhdGVfYmVnaW4oJnJlcS0+YWJvcnRlZF9nc3RhdGVfc3luYyk7DQo+ICsJ
CXJlcS0+YWJvcnRlZF9nc3RhdGUgPSAwOw0KPiArCQl1NjRfc3RhdHNfdXBkYXRlX2VuZCgmcmVx
LT5hYm9ydGVkX2dzdGF0ZV9zeW5jKTsNCg0KSWYgYSBibGstbXEgcmVxdWVzdCBpcyByZXN1Ym1p
dHRlZCAyKio2MiB0aW1lcywgY2FuIHRoYXQgcmVzdWx0IGluIHRoZSBhYm92ZQ0KY29kZSBzZXR0
aW5nIGFib3J0ZWRfZ3N0YXRlIHRvIHRoZSBzYW1lIHZhbHVlIGFzIGdzdGF0ZT8gSXNuJ3QgdGhh
dCBhIGJ1Zz8NCklmIHNvLCBob3cgYWJvdXQgc2V0dGluZyBhYm9ydGVkX2dzdGF0ZSBpbiB0aGUg
YWJvdmUgY29kZSB0byBlLmcuIGdzdGF0ZSBeICgyKio2Myk/DQoNCj4gQEAgLTIyOCw2ICsyMzAs
MjcgQEAgc3RydWN0IHJlcXVlc3Qgew0KPiAgDQo+ICAJdW5zaWduZWQgc2hvcnQgd3JpdGVfaGlu
dDsNCj4gIA0KPiArCS8qDQo+ICsJICogT24gYmxrLW1xLCB0aGUgbG93ZXIgYml0cyBvZiAtPmdz
dGF0ZSBjYXJyeSB0aGUgTVFfUlFfKiBzdGF0ZQ0KPiArCSAqIHZhbHVlIGFuZCB0aGUgdXBwZXIg
Yml0cyB0aGUgZ2VuZXJhdGlvbiBudW1iZXIgd2hpY2ggaXMNCj4gKwkgKiBtb25vdG9uaWNhbGx5
IGluY3JlbWVudGVkIGFuZCB1c2VkIHRvIGRpc3Rpbmd1aXNoIHRoZSByZXVzZQ0KPiArCSAqIGlu
c3RhbmNlcy4NCj4gKwkgKg0KPiArCSAqIC0+Z3N0YXRlX3NlcSBhbGxvd3MgdXBkYXRlcyB0byAt
PmdzdGF0ZSBhbmQgb3RoZXIgZmllbGRzDQo+ICsJICogKGN1cnJlbnRseSAtPmRlYWRsaW5lKSBk
dXJpbmcgcmVxdWVzdCBzdGFydCB0byBiZSByZWFkDQo+ICsJICogYXRvbWljYWxseSBmcm9tIHRo
ZSB0aW1lb3V0IHBhdGgsIHNvIHRoYXQgaXQgY2FuIG9wZXJhdGUgb24gYQ0KPiArCSAqIGNvaGVy
ZW50IHNldCBvZiBpbmZvcm1hdGlvbi4NCj4gKwkgKi8NCj4gKwlzZXFjb3VudF90IGdzdGF0ZV9z
ZXE7DQo+ICsJdTY0IGdzdGF0ZTsNCj4gKw0KPiArCS8qDQo+ICsJICogLT5hYm9ydGVkX2dzdGF0
ZSBpcyB1c2VkIGJ5IHRoZSB0aW1lb3V0IHRvIGNsYWltIGEgc3BlY2lmaWMNCj4gKwkgKiByZWN5
Y2xlIGluc3RhbmNlIG9mIHRoaXMgcmVxdWVzdC4gIFNlZSBibGtfbXFfdGltZW91dF93b3JrKCku
DQo+ICsJICovDQo+ICsJc3RydWN0IHU2NF9zdGF0c19zeW5jIGFib3J0ZWRfZ3N0YXRlX3N5bmM7
DQo+ICsJdTY0IGFib3J0ZWRfZ3N0YXRlOw0KPiArDQo+ICAJdW5zaWduZWQgbG9uZyBkZWFkbGlu
ZTsNCj4gIAlzdHJ1Y3QgbGlzdF9oZWFkIHRpbWVvdXRfbGlzdDsNCg0KV2h5IGFyZSBnc3RhdGUg
YW5kIGFib3J0ZWRfZ3N0YXRlIDY0LWJpdCB2YXJpYWJsZXM/IFdoYXQgbWFrZXMgeW91IHRoaW5r
IHRoYXQNCjMyIGJpdHMgd291bGQgbm90IGJlIGVub3VnaD8NCg0KVGhhbmtzLA0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
@ 2017-12-14 18:51     ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 18:51 UTC (permalink / raw)
  To: tj, axboe
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch,
	jianchao.w.wang, osandov

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> rules.  Unfortunatley, it contains quite a few holes.
          ^^^^^^^^^^^^^
          Unfortunately?

> While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
                                            ^^^^^^^^^^^^^^^
                                            synchronization?

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->start_time = jiffies;
>  	set_start_time_ns(rq);
>  	rq->part = NULL;
> +	seqcount_init(&rq->gstate_seq);
> +	u64_stats_init(&rq->aborted_gstate_sync);
>  }
>  EXPORT_SYMBOL(blk_rq_init);

Sorry but the above change looks ugly to me. My understanding is that 
blk_rq_init() is only used inside the block layer to initialize legacy block
layer requests while gstate_seq and aborted_gstate_sync are only relevant
for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
called for blk-mq requests such that the above change can be left out? The
only callers outside the block layer core of blk_rq_init() I know of are
ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
code if you want.

> +	write_seqcount_begin(&rq->gstate_seq);
> +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> +	blk_add_timer(rq);
> +	write_seqcount_end(&rq->gstate_seq);

My understanding is that both write_seqcount_begin() and write_seqcount_end()
trigger a write memory barrier. Is a seqcount really faster than a spinlock?

> 
> @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  		__blk_mq_complete_request(req);
>  		break;
>  	case BLK_EH_RESET_TIMER:
> +		/*
> +		 * As nothing prevents from completion happening while
> +		 * ->aborted_gstate is set, this may lead to ignored
> +		 * completions and further spurious timeouts.
> +		 */
> +		u64_stats_update_begin(&req->aborted_gstate_sync);
> +		req->aborted_gstate = 0;
> +		u64_stats_update_end(&req->aborted_gstate_sync);

If a blk-mq request is resubmitted 2**62 times, can that result in the above
code setting aborted_gstate to the same value as gstate? Isn't that a bug?
If so, how about setting aborted_gstate in the above code to e.g. gstate ^ (2**63)?

> @@ -228,6 +230,27 @@ struct request {
>  
>  	unsigned short write_hint;
>  
> +	/*
> +	 * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state
> +	 * value and the upper bits the generation number which is
> +	 * monotonically incremented and used to distinguish the reuse
> +	 * instances.
> +	 *
> +	 * ->gstate_seq allows updates to ->gstate and other fields
> +	 * (currently ->deadline) during request start to be read
> +	 * atomically from the timeout path, so that it can operate on a
> +	 * coherent set of information.
> +	 */
> +	seqcount_t gstate_seq;
> +	u64 gstate;
> +
> +	/*
> +	 * ->aborted_gstate is used by the timeout to claim a specific
> +	 * recycle instance of this request.  See blk_mq_timeout_work().
> +	 */
> +	struct u64_stats_sync aborted_gstate_sync;
> +	u64 aborted_gstate;
> +
>  	unsigned long deadline;
>  	struct list_head timeout_list;

Why are gstate and aborted_gstate 64-bit variables? What makes you think that
32 bits would not be enough?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path
  2017-12-12 19:01 ` [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
@ 2017-12-14 18:56     ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 18:56 UTC (permalink / raw)
  To: tj, axboe
  Cc: kernel-team, linux-kernel, peterz, osandov, linux-block, oleg, hch

T24gVHVlLCAyMDE3LTEyLTEyIGF0IDExOjAxIC0wODAwLCBUZWp1biBIZW8gd3JvdGU6DQo+ICB2
b2lkIGJsa19hYm9ydF9yZXF1ZXN0KHN0cnVjdCByZXF1ZXN0ICpyZXEpDQo+ICB7DQo+IC0JaWYg
KGJsa19tYXJrX3JxX2NvbXBsZXRlKHJlcSkpDQo+IC0JCXJldHVybjsNCj4gIA0KPiAgCWlmIChy
ZXEtPnEtPm1xX29wcykgew0KPiAtCQlibGtfbXFfcnFfdGltZWRfb3V0KHJlcSwgZmFsc2UpOw0K
PiArCQlyZXEtPmRlYWRsaW5lID0gamlmZmllczsNCj4gKwkJbW9kX3RpbWVyKCZyZXEtPnEtPnRp
bWVvdXQsIDApOw0KPiAgCX0gZWxzZSB7DQo+ICsJCWlmIChibGtfbWFya19ycV9jb21wbGV0ZShy
ZXEpKQ0KPiArCQkJcmV0dXJuOw0KPiAgCQlibGtfZGVsZXRlX3RpbWVyKHJlcSk7DQo+ICAJCWJs
a19ycV90aW1lZF9vdXQocmVxKTsNCj4gIAl9DQoNClRoaXMgcGF0Y2ggbWFrZXMgYmxrX2Fib3J0
X3JlcXVlc3QoKSBhc3luY2hyb25vdXMgZm9yIGJsay1tcS4gSGF2ZSBhbGwgY2FsbGVycw0KYmVl
biBhdWRpdGVkIHRvIHZlcmlmeSB3aGV0aGVyIHRoaXMgY2hhbmdlIGlzIHNhZmU/DQoNClRoYW5r
cywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path
@ 2017-12-14 18:56     ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 18:56 UTC (permalink / raw)
  To: tj, axboe
  Cc: kernel-team, linux-kernel, peterz, osandov, linux-block, oleg, hch

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
>  void blk_abort_request(struct request *req)
>  {
> -	if (blk_mark_rq_complete(req))
> -		return;
>  
>  	if (req->q->mq_ops) {
> -		blk_mq_rq_timed_out(req, false);
> +		req->deadline = jiffies;
> +		mod_timer(&req->q->timeout, 0);
>  	} else {
> +		if (blk_mark_rq_complete(req))
> +			return;
>  		blk_delete_timer(req);
>  		blk_rq_timed_out(req);
>  	}

This patch makes blk_abort_request() asynchronous for blk-mq. Have all callers
been audited to verify whether this change is safe?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 18:51     ` Bart Van Assche
  (?)
@ 2017-12-14 19:19     ` tj
  2017-12-14 21:13         ` Bart Van Assche
  -1 siblings, 1 reply; 49+ messages in thread
From: tj @ 2017-12-14 19:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, linux-kernel, peterz, linux-block, kernel-team, oleg, hch,
	jianchao.w.wang, osandov

Hello,

On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > rules.  Unfortunatley, it contains quite a few holes.
>           ^^^^^^^^^^^^^
>           Unfortunately?
> 
> > While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
>                                             ^^^^^^^^^^^^^^^
>                                             synchronization?

lol, believe it or not, my english spelling is a lot better than my
korean.  Will fix them.

> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
> >  	rq->start_time = jiffies;
> >  	set_start_time_ns(rq);
> >  	rq->part = NULL;
> > +	seqcount_init(&rq->gstate_seq);
> > +	u64_stats_init(&rq->aborted_gstate_sync);
> >  }
> >  EXPORT_SYMBOL(blk_rq_init);
> 
> Sorry but the above change looks ugly to me. My understanding is that 
> blk_rq_init() is only used inside the block layer to initialize legacy block
> layer requests while gstate_seq and aborted_gstate_sync are only relevant
> for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
> called for blk-mq requests such that the above change can be left out? The
> only callers outside the block layer core of blk_rq_init() I know of are
> ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
> code if you want.

This is also used by flush path.  We probably should clean that up,
but let's worry about that later cuz flush handling has enough of its
own complications.

> > +	write_seqcount_begin(&rq->gstate_seq);
> > +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > +	blk_add_timer(rq);
> > +	write_seqcount_end(&rq->gstate_seq);
> 
> My understanding is that both write_seqcount_begin() and write_seqcount_end()
> trigger a write memory barrier. Is a seqcount really faster than a spinlock?

Write memory barrier has no cost on x86 and usually pretty low cost
elsewhere too and they're likely in the same cacheline as other rq
fields.

> > @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
> >  		__blk_mq_complete_request(req);
> >  		break;
> >  	case BLK_EH_RESET_TIMER:
> > +		/*
> > +		 * As nothing prevents from completion happening while
> > +		 * ->aborted_gstate is set, this may lead to ignored
> > +		 * completions and further spurious timeouts.
> > +		 */
> > +		u64_stats_update_begin(&req->aborted_gstate_sync);
> > +		req->aborted_gstate = 0;
> > +		u64_stats_update_end(&req->aborted_gstate_sync);
> 
> If a blk-mq request is resubmitted 2**62 times, can that result in the above
> code setting aborted_gstate to the same value as gstate? Isn't that a bug?
> If so, how about setting aborted_gstate in the above code to e.g. gstate ^ (2**63)?

A request gets aborted only if the state is in-flight, 0 isn't that.
Also, how many years would 2^62 times be?

> > +	struct u64_stats_sync aborted_gstate_sync;
> > +	u64 aborted_gstate;
> > +
> >  	unsigned long deadline;
> >  	struct list_head timeout_list;
> 
> Why are gstate and aborted_gstate 64-bit variables? What makes you think that
> 32 bits would not be enough?

Because 32 bits puts it in the rance where a false hit is still
theoretically possible in a reasonable amount of time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path
  2017-12-14 18:56     ` Bart Van Assche
  (?)
@ 2017-12-14 19:26     ` tj
  -1 siblings, 0 replies; 49+ messages in thread
From: tj @ 2017-12-14 19:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, kernel-team, linux-kernel, peterz, osandov, linux-block,
	oleg, hch

Hello,

On Thu, Dec 14, 2017 at 06:56:55PM +0000, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> >  void blk_abort_request(struct request *req)
> >  {
> > -	if (blk_mark_rq_complete(req))
> > -		return;
> >  
> >  	if (req->q->mq_ops) {
> > -		blk_mq_rq_timed_out(req, false);
> > +		req->deadline = jiffies;
> > +		mod_timer(&req->q->timeout, 0);
> >  	} else {
> > +		if (blk_mark_rq_complete(req))
> > +			return;
> >  		blk_delete_timer(req);
> >  		blk_rq_timed_out(req);
> >  	}
> 
> This patch makes blk_abort_request() asynchronous for blk-mq. Have all callers
> been audited to verify whether this change is safe?

I *think* so.  For all the ata related parts, I know they're.
mtip32xx and dasd_ioctl, it seems safe, but I can't tell for sure.
Will cc the respective maintainers on the next posting.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 18:51     ` Bart Van Assche
  (?)
  (?)
@ 2017-12-14 20:20     ` Peter Zijlstra
  2017-12-14 21:42         ` Bart Van Assche
  2017-12-15 13:50       ` tj
  -1 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2017-12-14 20:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj, axboe, linux-kernel, linux-block, kernel-team, oleg, hch,
	jianchao.w.wang, osandov

On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > +	write_seqcount_begin(&rq->gstate_seq);
> > +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > +	blk_add_timer(rq);
> > +	write_seqcount_end(&rq->gstate_seq);
> 
> My understanding is that both write_seqcount_begin() and write_seqcount_end()
> trigger a write memory barrier. Is a seqcount really faster than a spinlock?

Yes lots, no atomic operations and no waiting.

The only constraint for write_seqlock is that there must not be any
concurrency.

But now that I look at this again, TJ, why can't the below happen?

	write_seqlock_begin();
	blk_mq_rq_update_state(rq, IN_FLIGHT);
	blk_add_timer(rq);
	<timer-irq>
		read_seqcount_begin()
			while (seq & 1)
				cpurelax();
		// life-lock
	</timer-irq>
	write_seqlock_end();

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 19:19     ` tj
@ 2017-12-14 21:13         ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 21:13 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch, axboe,
	jianchao.w.wang, osandov

T24gVGh1LCAyMDE3LTEyLTE0IGF0IDExOjE5IC0wODAwLCB0akBrZXJuZWwub3JnIHdyb3RlOg0K
PiBPbiBUaHUsIERlYyAxNCwgMjAxNyBhdCAwNjo1MToxMVBNICswMDAwLCBCYXJ0IFZhbiBBc3Nj
aGUgd3JvdGU6DQo+ID4gT24gVHVlLCAyMDE3LTEyLTEyIGF0IDExOjAxIC0wODAwLCBUZWp1biBI
ZW8gd3JvdGU6DQo+ID4gPiAtLS0gYS9ibG9jay9ibGstY29yZS5jDQo+ID4gPiArKysgYi9ibG9j
ay9ibGstY29yZS5jDQo+ID4gPiBAQCAtMTI2LDYgKzEyNiw4IEBAIHZvaWQgYmxrX3JxX2luaXQo
c3RydWN0IHJlcXVlc3RfcXVldWUgKnEsIHN0cnVjdCByZXF1ZXN0ICpycSkNCj4gPiA+ICAJcnEt
PnN0YXJ0X3RpbWUgPSBqaWZmaWVzOw0KPiA+ID4gIAlzZXRfc3RhcnRfdGltZV9ucyhycSk7DQo+
ID4gPiAgCXJxLT5wYXJ0ID0gTlVMTDsNCj4gPiA+ICsJc2VxY291bnRfaW5pdCgmcnEtPmdzdGF0
ZV9zZXEpOw0KPiA+ID4gKwl1NjRfc3RhdHNfaW5pdCgmcnEtPmFib3J0ZWRfZ3N0YXRlX3N5bmMp
Ow0KPiA+ID4gIH0NCj4gPiA+ICBFWFBPUlRfU1lNQk9MKGJsa19ycV9pbml0KTsNCj4gPiANCj4g
PiBTb3JyeSBidXQgdGhlIGFib3ZlIGNoYW5nZSBsb29rcyB1Z2x5IHRvIG1lLiBNeSB1bmRlcnN0
YW5kaW5nIGlzIHRoYXQgDQo+ID4gYmxrX3JxX2luaXQoKSBpcyBvbmx5IHVzZWQgaW5zaWRlIHRo
ZSBibG9jayBsYXllciB0byBpbml0aWFsaXplIGxlZ2FjeSBibG9jaw0KPiA+IGxheWVyIHJlcXVl
c3RzIHdoaWxlIGdzdGF0ZV9zZXEgYW5kIGFib3J0ZWRfZ3N0YXRlX3N5bmMgYXJlIG9ubHkgcmVs
ZXZhbnQNCj4gPiBmb3IgYmxrLW1xIHJlcXVlc3RzLiBXb3VsZG4ndCBpdCBiZSBiZXR0ZXIgdG8g
YXZvaWQgdGhhdCBibGtfcnFfaW5pdCgpIGlzDQo+ID4gY2FsbGVkIGZvciBibGstbXEgcmVxdWVz
dHMgc3VjaCB0aGF0IHRoZSBhYm92ZSBjaGFuZ2UgY2FuIGJlIGxlZnQgb3V0PyBUaGUNCj4gPiBv
bmx5IGNhbGxlcnMgb3V0c2lkZSB0aGUgYmxvY2sgbGF5ZXIgY29yZSBvZiBibGtfcnFfaW5pdCgp
IEkga25vdyBvZiBhcmUNCj4gPiBpZGVfcHJlcF9zZW5zZSgpIGFuZCBzY3NpX2lvY3RsX3Jlc2V0
KCkuIEkgY2FuIGhlbHAgd2l0aCBjb252ZXJ0aW5nIHRoZSBTQ1NJDQo+ID4gY29kZSBpZiB5b3Ug
d2FudC4NCj4gDQo+IFRoaXMgaXMgYWxzbyB1c2VkIGJ5IGZsdXNoIHBhdGguICBXZSBwcm9iYWJs
eSBzaG91bGQgY2xlYW4gdGhhdCB1cCwNCj4gYnV0IGxldCdzIHdvcnJ5IGFib3V0IHRoYXQgbGF0
ZXIgY3V6IGZsdXNoIGhhbmRsaW5nIGhhcyBlbm91Z2ggb2YgaXRzDQo+IG93biBjb21wbGljYXRp
b25zLg0KDQpXZSBtYXkgaGF2ZSBhIGRpZmZlcmVudCBvcGluaW9uIGFib3V0IHRoaXMgYnV0IEkg
dGhpbmsgaXQgaXMgbW9yZSB0aGFuIGENCmRldGFpbC4gVGhpcyBwYXRjaCBuZWVkcyBnc3RhdGVf
c2VxIGFuZCBhYm9ydGVkX2dzdGF0ZV9zeW5jIHRvIGJlIHByZXNlcnZlZA0KYWNyb3NzIHJlcXVl
c3Qgc3RhdGUgdHJhbnNpdGlvbnMgZnJvbSBNUV9SUV9JTl9GTElHSFQgdG8gTVJfUlFfSURMRS4N
CmJsa19tcV9pbml0X3JlcXVlc3QoKSBpcyBjYWxsZWQgYXQgcmVxdWVzdCBhbGxvY2F0aW9uIHRp
bWUgc28gaXQncyB0aGUgcmlnaHQNCmNvbnRleHQgdG8gaW5pdGlhbGl6ZSBnc3RhdGVfc2VxIGFu
ZCBhYm9ydGVkX2dzdGF0ZV9zeW5jLiBibGtfcnFfaW5pdCgpDQpob3dldmVyIGlzIGNhbGxlZCBi
ZWZvcmUgYSBldmVyeSB1c2Ugb2YgYSByZXF1ZXN0LiBTb3JyeSBidXQgSSdtIG5vdA0KZW50aHVz
aWFzdCBhYm91dCB0aGUgY29kZSBpbiBibGtfcnFfaW5pdCgpIHRoYXQgcmVpbml0aWFsaXplcyBz
dGF0ZQ0KaW5mb3JtYXRpb24gdGhhdCBzaG91bGQgc3Vydml2ZSByZXF1ZXN0IHJldXNlLg0KDQpC
YXJ0Lg==

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
@ 2017-12-14 21:13         ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 21:13 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch, axboe,
	jianchao.w.wang, osandov

On Thu, 2017-12-14 at 11:19 -0800, tj@kernel.org wrote:
> On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
> > >  	rq->start_time = jiffies;
> > >  	set_start_time_ns(rq);
> > >  	rq->part = NULL;
> > > +	seqcount_init(&rq->gstate_seq);
> > > +	u64_stats_init(&rq->aborted_gstate_sync);
> > >  }
> > >  EXPORT_SYMBOL(blk_rq_init);
> > 
> > Sorry but the above change looks ugly to me. My understanding is that 
> > blk_rq_init() is only used inside the block layer to initialize legacy block
> > layer requests while gstate_seq and aborted_gstate_sync are only relevant
> > for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
> > called for blk-mq requests such that the above change can be left out? The
> > only callers outside the block layer core of blk_rq_init() I know of are
> > ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
> > code if you want.
> 
> This is also used by flush path.  We probably should clean that up,
> but let's worry about that later cuz flush handling has enough of its
> own complications.

We may have a different opinion about this but I think it is more than a
detail. This patch needs gstate_seq and aborted_gstate_sync to be preserved
across request state transitions from MQ_RQ_IN_FLIGHT to MR_RQ_IDLE.
blk_mq_init_request() is called at request allocation time so it's the right
context to initialize gstate_seq and aborted_gstate_sync. blk_rq_init()
however is called before a every use of a request. Sorry but I'm not
enthusiast about the code in blk_rq_init() that reinitializes state
information that should survive request reuse.

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 20:20     ` Peter Zijlstra
@ 2017-12-14 21:42         ` Bart Van Assche
  2017-12-15 13:50       ` tj
  1 sibling, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 21:42 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, linux-block, kernel-team, oleg, hch, axboe,
	jianchao.w.wang, osandov, tj

T24gVGh1LCAyMDE3LTEyLTE0IGF0IDIxOjIwICswMTAwLCBQZXRlciBaaWpsc3RyYSB3cm90ZToN
Cj4gT24gVGh1LCBEZWMgMTQsIDIwMTcgYXQgMDY6NTE6MTFQTSArMDAwMCwgQmFydCBWYW4gQXNz
Y2hlIHdyb3RlOg0KPiA+IE9uIFR1ZSwgMjAxNy0xMi0xMiBhdCAxMTowMSAtMDgwMCwgVGVqdW4g
SGVvIHdyb3RlOg0KPiA+ID4gKwl3cml0ZV9zZXFjb3VudF9iZWdpbigmcnEtPmdzdGF0ZV9zZXEp
Ow0KPiA+ID4gKwlibGtfbXFfcnFfdXBkYXRlX3N0YXRlKHJxLCBNUV9SUV9JTl9GTElHSFQpOw0K
PiA+ID4gKwlibGtfYWRkX3RpbWVyKHJxKTsNCj4gPiA+ICsJd3JpdGVfc2VxY291bnRfZW5kKCZy
cS0+Z3N0YXRlX3NlcSk7DQo+ID4gDQo+ID4gTXkgdW5kZXJzdGFuZGluZyBpcyB0aGF0IGJvdGgg
d3JpdGVfc2VxY291bnRfYmVnaW4oKSBhbmQgd3JpdGVfc2VxY291bnRfZW5kKCkNCj4gPiB0cmln
Z2VyIGEgd3JpdGUgbWVtb3J5IGJhcnJpZXIuIElzIGEgc2VxY291bnQgcmVhbGx5IGZhc3RlciB0
aGFuIGEgc3BpbmxvY2s/DQo+IA0KPiBZZXMgbG90cywgbm8gYXRvbWljIG9wZXJhdGlvbnMgYW5k
IG5vIHdhaXRpbmcuDQo+IA0KPiBUaGUgb25seSBjb25zdHJhaW50IGZvciB3cml0ZV9zZXFsb2Nr
IGlzIHRoYXQgdGhlcmUgbXVzdCBub3QgYmUgYW55DQo+IGNvbmN1cnJlbmN5Lg0KPiANCj4gQnV0
IG5vdyB0aGF0IEkgbG9vayBhdCB0aGlzIGFnYWluLCBUSiwgd2h5IGNhbid0IHRoZSBiZWxvdyBo
YXBwZW4/DQo+IA0KPiAJd3JpdGVfc2VxbG9ja19iZWdpbigpOw0KPiAJYmxrX21xX3JxX3VwZGF0
ZV9zdGF0ZShycSwgSU5fRkxJR0hUKTsNCj4gCWJsa19hZGRfdGltZXIocnEpOw0KPiAJPHRpbWVy
LWlycT4NCj4gCQlyZWFkX3NlcWNvdW50X2JlZ2luKCkNCj4gCQkJd2hpbGUgKHNlcSAmIDEpDQo+
IAkJCQljcHVyZWxheCgpOw0KPiAJCS8vIGxpZmUtbG9jaw0KPiAJPC90aW1lci1pcnE+DQo+IAl3
cml0ZV9zZXFsb2NrX2VuZCgpOw0KDQpIZWxsbyBQZXRlciwNCg0KU29tZSB0aW1lIGFnbyB0aGUg
YmxvY2sgbGF5ZXIgd2FzIGNoYW5nZWQgdG8gaGFuZGxlIHRpbWVvdXRzIGluIHRocmVhZCBjb250
ZXh0DQppbnN0ZWFkIG9mIGludGVycnVwdCBjb250ZXh0LiBTZWUgYWxzbyBjb21taXQgMjg3OTIy
ZWIwYjE4ICgiYmxvY2s6IGRlZmVyDQp0aW1lb3V0cyB0byBhIHdvcmtxdWV1ZSIpLg0KDQpCYXJ0
Lg==

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
@ 2017-12-14 21:42         ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-14 21:42 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, linux-block, kernel-team, oleg, hch, axboe,
	jianchao.w.wang, osandov, tj

On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > +	write_seqcount_begin(&rq->gstate_seq);
> > > +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > > +	blk_add_timer(rq);
> > > +	write_seqcount_end(&rq->gstate_seq);
> > 
> > My understanding is that both write_seqcount_begin() and write_seqcount_end()
> > trigger a write memory barrier. Is a seqcount really faster than a spinlock?
> 
> Yes lots, no atomic operations and no waiting.
> 
> The only constraint for write_seqlock is that there must not be any
> concurrency.
> 
> But now that I look at this again, TJ, why can't the below happen?
> 
> 	write_seqlock_begin();
> 	blk_mq_rq_update_state(rq, IN_FLIGHT);
> 	blk_add_timer(rq);
> 	<timer-irq>
> 		read_seqcount_begin()
> 			while (seq & 1)
> 				cpurelax();
> 		// life-lock
> 	</timer-irq>
> 	write_seqlock_end();

Hello Peter,

Some time ago the block layer was changed to handle timeouts in thread context
instead of interrupt context. See also commit 287922eb0b18 ("block: defer
timeouts to a workqueue").

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 21:42         ` Bart Van Assche
  (?)
@ 2017-12-14 21:54         ` Peter Zijlstra
  2017-12-15  2:12           ` jianchao.wang
  2017-12-15  2:39             ` Mike Galbraith
  -1 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2017-12-14 21:54 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-kernel, linux-block, kernel-team, oleg, hch, axboe,
	jianchao.w.wang, osandov, tj

On Thu, Dec 14, 2017 at 09:42:48PM +0000, Bart Van Assche wrote:
> On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
> > On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> > > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > > +	write_seqcount_begin(&rq->gstate_seq);
> > > > +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > > > +	blk_add_timer(rq);
> > > > +	write_seqcount_end(&rq->gstate_seq);
> > > 
> > > My understanding is that both write_seqcount_begin() and write_seqcount_end()
> > > trigger a write memory barrier. Is a seqcount really faster than a spinlock?
> > 
> > Yes lots, no atomic operations and no waiting.
> > 
> > The only constraint for write_seqlock is that there must not be any
> > concurrency.
> > 
> > But now that I look at this again, TJ, why can't the below happen?
> > 
> > 	write_seqlock_begin();
> > 	blk_mq_rq_update_state(rq, IN_FLIGHT);
> > 	blk_add_timer(rq);
> > 	<timer-irq>
> > 		read_seqcount_begin()
> > 			while (seq & 1)
> > 				cpurelax();
> > 		// life-lock
> > 	</timer-irq>
> > 	write_seqlock_end();
> 
> Hello Peter,
> 
> Some time ago the block layer was changed to handle timeouts in thread context
> instead of interrupt context. See also commit 287922eb0b18 ("block: defer
> timeouts to a workqueue").

That only makes it a little better:

	Task-A					Worker

	write_seqcount_begin()
	blk_mq_rw_update_state(rq, IN_FLIGHT)
	blk_add_timer(rq)
	<timer>
		schedule_work()
	</timer>
	<context-switch to worker>
						read_seqcount_begin()
							while(seq & 1)
								cpu_relax();


Now normally this isn't fatal because Worker will simply spin its entire
time slice away and we'll eventually schedule our Task-A back in, which
will complete the seqcount and things will work.

But if, for some reason, our Worker was to have RT priority higher than
our Task-A we'd be up some creek without no paddles.

We don't happen to have preemption of IRQs off here? That would fix
things nicely.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 21:54         ` Peter Zijlstra
@ 2017-12-15  2:12           ` jianchao.wang
  2017-12-15  7:31             ` Peter Zijlstra
  2017-12-15  2:39             ` Mike Galbraith
  1 sibling, 1 reply; 49+ messages in thread
From: jianchao.wang @ 2017-12-15  2:12 UTC (permalink / raw)
  To: Peter Zijlstra, Bart Van Assche
  Cc: linux-kernel, linux-block, kernel-team, oleg, hch, axboe, osandov, tj



On 12/15/2017 05:54 AM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:42:48PM +0000, Bart Van Assche wrote:
>> On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
>>> On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
>>>> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
>>>>> +	write_seqcount_begin(&rq->gstate_seq);
>>>>> +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>>>>> +	blk_add_timer(rq);
>>>>> +	write_seqcount_end(&rq->gstate_seq);
>>>>
>>>> My understanding is that both write_seqcount_begin() and write_seqcount_end()
>>>> trigger a write memory barrier. Is a seqcount really faster than a spinlock?
>>>
>>> Yes lots, no atomic operations and no waiting.
>>>
>>> The only constraint for write_seqlock is that there must not be any
>>> concurrency.
>>>
>>> But now that I look at this again, TJ, why can't the below happen?
>>>
>>> 	write_seqlock_begin();
>>> 	blk_mq_rq_update_state(rq, IN_FLIGHT);
>>> 	blk_add_timer(rq);
>>> 	<timer-irq>
>>> 		read_seqcount_begin()
>>> 			while (seq & 1)
>>> 				cpurelax();
>>> 		// life-lock
>>> 	</timer-irq>
>>> 	write_seqlock_end();
>>
>> Hello Peter,
>>
>> Some time ago the block layer was changed to handle timeouts in thread context
>> instead of interrupt context. See also commit 287922eb0b18 ("block: defer
>> timeouts to a workqueue").
> 
> That only makes it a little better:
> 
> 	Task-A					Worker
> 
> 	write_seqcount_begin()
> 	blk_mq_rw_update_state(rq, IN_FLIGHT)
> 	blk_add_timer(rq)
> 	<timer>
> 		schedule_work()
> 	</timer>
> 	<context-switch to worker>
> 						read_seqcount_begin()
> 							while(seq & 1)
> 								cpu_relax();
> 
Hi Peter

The current seqcount read side is as below:
	do {
		start = read_seqcount_begin(&rq->gstate_seq);
		gstate = READ_ONCE(rq->gstate);
		deadline = rq->deadline;
	} while (read_seqcount_retry(&rq->gstate_seq, start));
read_seqcount_retry() doesn't check the bit 0, but whether the saved value from 
read_seqcount_begin() is equal to the current value of seqcount.
pls refer:
static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
{
	return unlikely(s->sequence != start);
}

Thanks
Jianchao
> 
> Now normally this isn't fatal because Worker will simply spin its entire
> time slice away and we'll eventually schedule our Task-A back in, which
> will complete the seqcount and things will work.
> 
> But if, for some reason, our Worker was to have RT priority higher than
> our Task-A we'd be up some creek without no paddles.
> 
> We don't happen to have preemption of IRQs off here? That would fix
> things nicely.
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 21:54         ` Peter Zijlstra
@ 2017-12-15  2:39             ` Mike Galbraith
  2017-12-15  2:39             ` Mike Galbraith
  1 sibling, 0 replies; 49+ messages in thread
From: Mike Galbraith @ 2017-12-15  2:39 UTC (permalink / raw)
  To: Peter Zijlstra, Bart Van Assche
  Cc: linux-kernel, linux-block, kernel-team, oleg, hch, axboe,
	jianchao.w.wang, osandov, tj

On Thu, 2017-12-14 at 22:54 +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:42:48PM +0000, Bart Van Assche wrote:
> 
> > Some time ago the block layer was changed to handle timeouts in thread context
> > instead of interrupt context. See also commit 287922eb0b18 ("block: defer
> > timeouts to a workqueue").
> 
> That only makes it a little better:
> 
> 	Task-A					Worker
> 
> 	write_seqcount_begin()
> 	blk_mq_rw_update_state(rq, IN_FLIGHT)
> 	blk_add_timer(rq)
> 	<timer>
> 		schedule_work()
> 	</timer>
> 	<context-switch to worker>
> 						read_seqcount_begin()
> 							while(seq & 1)
> 								cpu_relax();
> 
> 
> Now normally this isn't fatal because Worker will simply spin its entire
> time slice away and we'll eventually schedule our Task-A back in, which
> will complete the seqcount and things will work.
> 
> But if, for some reason, our Worker was to have RT priority higher than
> our Task-A we'd be up some creek without no paddles.

Most kthreads, including kworkers, are very frequently SCHED_FIFO here.

	-Mike

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
@ 2017-12-15  2:39             ` Mike Galbraith
  0 siblings, 0 replies; 49+ messages in thread
From: Mike Galbraith @ 2017-12-15  2:39 UTC (permalink / raw)
  To: Peter Zijlstra, Bart Van Assche
  Cc: linux-kernel, linux-block, kernel-team, oleg, hch, axboe,
	jianchao.w.wang, osandov, tj

On Thu, 2017-12-14 at 22:54 +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:42:48PM +0000, Bart Van Assche wrote:
> 
> > Some time ago the block layer was changed to handle timeouts in thread context
> > instead of interrupt context. See also commit 287922eb0b18 ("block: defer
> > timeouts to a workqueue").
> 
> That only makes it a little better:
> 
> 	Task-A					Worker
> 
> 	write_seqcount_begin()
> 	blk_mq_rw_update_state(rq, IN_FLIGHT)
> 	blk_add_timer(rq)
> 	<timer>
> 		schedule_work()
> 	</timer>
> 	<context-switch to worker>
> 						read_seqcount_begin()
> 							while(seq & 1)
> 								cpu_relax();
> 
> 
> Now normally this isn't fatal because Worker will simply spin its entire
> time slice away and we'll eventually schedule our Task-A back in, which
> will complete the seqcount and things will work.
> 
> But if, for some reason, our Worker was to have RT priority higher than
> our Task-A we'd be up some creek without no paddles.

Most kthreads, including kworkers, are very frequently SCHED_FIFO here.

	-Mike

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-15  2:12           ` jianchao.wang
@ 2017-12-15  7:31             ` Peter Zijlstra
  2017-12-15 15:14               ` jianchao.wang
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2017-12-15  7:31 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Bart Van Assche, linux-kernel, linux-block, kernel-team, oleg,
	hch, axboe, osandov, tj

On Fri, Dec 15, 2017 at 10:12:50AM +0800, jianchao.wang wrote:
> > That only makes it a little better:
> > 
> > 	Task-A					Worker
> > 
> > 	write_seqcount_begin()
> > 	blk_mq_rw_update_state(rq, IN_FLIGHT)
> > 	blk_add_timer(rq)
> > 	<timer>
> > 		schedule_work()
> > 	</timer>
> > 	<context-switch to worker>
> > 						read_seqcount_begin()
> > 							while(seq & 1)
> > 								cpu_relax();
> > 
> Hi Peter
> 
> The current seqcount read side is as below:
> 	do {
> 		start = read_seqcount_begin(&rq->gstate_seq);


static inline unsigned read_seqcount_begin(const seqcount_t *s)
{
	seqcount_lockdep_reader_access(s);
	return raw_read_seqcount_begin(s);
}

static inline unsigned raw_read_seqcount_begin(const seqcount_t *s)
{
	unsigned ret = __read_seqcount_begin(s);
	smp_rmb();
	return ret;
}

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
	unsigned ret;

repeat:
	ret = READ_ONCE(s->sequence);
	if (unlikely(ret & 1)) {
		cpu_relax();
		goto repeat;
	}
	return ret;
}

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 21:13         ` Bart Van Assche
  (?)
@ 2017-12-15 13:30         ` tj
  -1 siblings, 0 replies; 49+ messages in thread
From: tj @ 2017-12-15 13:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch, axboe,
	jianchao.w.wang, osandov

Hello, Bart.

On Thu, Dec 14, 2017 at 09:13:32PM +0000, Bart Van Assche wrote:
...
> however is called before a every use of a request. Sorry but I'm not
> enthusiast about the code in blk_rq_init() that reinitializes state
> information that should survive request reuse.

If it wasn't clear, me neither.  I think what'd be better is making
those paths use the usual request allocation path instead of custom
one but for flush handling, that's not gonna be trivial, so let's deal
with that later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-14 20:20     ` Peter Zijlstra
  2017-12-14 21:42         ` Bart Van Assche
@ 2017-12-15 13:50       ` tj
  1 sibling, 0 replies; 49+ messages in thread
From: tj @ 2017-12-15 13:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Bart Van Assche, axboe, linux-kernel, linux-block, kernel-team,
	oleg, hch, jianchao.w.wang, osandov

Hello, Peter.

On Thu, Dec 14, 2017 at 09:20:42PM +0100, Peter Zijlstra wrote:
> But now that I look at this again, TJ, why can't the below happen?
> 
> 	write_seqlock_begin();
> 	blk_mq_rq_update_state(rq, IN_FLIGHT);
> 	blk_add_timer(rq);
> 	<timer-irq>
> 		read_seqcount_begin()
> 			while (seq & 1)
> 				cpurelax();
> 		// life-lock
> 	</timer-irq>
> 	write_seqlock_end();

Ah, you're right.  For both gstate_seq and aborted_gstate_sync, we can
push all synchronization to the timeout side - ie. gstate_seq read can
yield, pause or synchronize_rcu and hte aborted_gstate_sync can
disable irq around update.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2017-12-15  7:31             ` Peter Zijlstra
@ 2017-12-15 15:14               ` jianchao.wang
  0 siblings, 0 replies; 49+ messages in thread
From: jianchao.wang @ 2017-12-15 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Bart Van Assche, linux-kernel, linux-block, kernel-team, oleg,
	hch, axboe, osandov, tj



On 12/15/2017 03:31 PM, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 10:12:50AM +0800, jianchao.wang wrote:
>>> That only makes it a little better:
>>>
>>> 	Task-A					Worker
>>>
>>> 	write_seqcount_begin()
>>> 	blk_mq_rw_update_state(rq, IN_FLIGHT)
>>> 	blk_add_timer(rq)
>>> 	<timer>
>>> 		schedule_work()
>>> 	</timer>
>>> 	<context-switch to worker>
>>> 						read_seqcount_begin()
>>> 							while(seq & 1)
>>> 								cpu_relax();
>>>
>> Hi Peter
>>
>> The current seqcount read side is as below:
>> 	do {
>> 		start = read_seqcount_begin(&rq->gstate_seq);
> 
> 
> static inline unsigned read_seqcount_begin(const seqcount_t *s)
> {
> 	seqcount_lockdep_reader_access(s);
> 	return raw_read_seqcount_begin(s);
> }
> 
> static inline unsigned raw_read_seqcount_begin(const seqcount_t *s)
> {
> 	unsigned ret = __read_seqcount_begin(s);
> 	smp_rmb();
> 	return ret;
> }
> 
> static inline unsigned __read_seqcount_begin(const seqcount_t *s)
> {
> 	unsigned ret;
> 
> repeat:
> 	ret = READ_ONCE(s->sequence);
> 	if (unlikely(ret & 1)) {
> 		cpu_relax();
> 		goto repeat;
> 	}
> 	return ret;
> }
> 
Really thanks for kindly pointing out.
jianchao

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET v2] blk-mq: reimplement timeout handling
  2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
@ 2017-12-20 23:41   ` Bart Van Assche
  2017-12-12 19:01 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-20 23:41 UTC (permalink / raw)
  To: tj, axboe
  Cc: kernel-team, linux-kernel, peterz, osandov, linux-block, oleg, hch

T24gVHVlLCAyMDE3LTEyLTEyIGF0IDExOjAxIC0wODAwLCBUZWp1biBIZW8gd3JvdGU6DQo+IEN1
cnJlbnRseSwgYmxrLW1xIHRpbWVvdXQgcGF0aCBzeW5jaHJvbml6ZXMgYWdhaW5zdCB0aGUgdXN1
YWwNCj4gaXNzdWUvY29tcGxldGlvbiBwYXRoIHVzaW5nIGEgY29tcGxleCBzY2hlbWUgaW52b2x2
aW5nIGF0b21pYw0KPiBiaXRmbGFncywgUkVRX0FUT01fKiwgbWVtb3J5IGJhcnJpZXJzIGFuZCBz
dWJ0bGUgbWVtb3J5IGNvaGVyZW5jZQ0KPiBydWxlcy4gIFVuZm9ydHVuYXRsZXksIGl0IGNvbnRh
aW5zIHF1aXRlIGEgZmV3IGhvbGVzLg0KDQpIZWxsbyBUZWp1biwNCg0KQW4gYXR0ZW1wdCB0byBy
dW4gU0NTSSBJL08gd2l0aCB0aGlzIHBhdGNoIHNlcmllcyBhcHBsaWVkIHJlc3VsdGVkIGluDQp0
aGUgZm9sbG93aW5nOg0KDQpCVUc6IHVuYWJsZSB0byBoYW5kbGUga2VybmVsIE5VTEwgcG9pbnRl
ciBkZXJlZmVyZW5jZSBhdCAgICAgICAgICAgKG51bGwpDQpJUDogc2NzaV90aW1lc19vdXQrMHgx
Yy8weDJkMA0KUEdEIDAgUDREIDANCk9vcHM6IDAwMDAgWyMxXSBQUkVFTVBUIFNNUA0KQ1BVOiAx
IFBJRDogNDM3IENvbW06IGt3b3JrZXIvMToxSCBUYWludGVkOiBHICAgICAgICBXICAgICAgICA0
LjE1LjAtcmM0LWRiZysgIzENCkhhcmR3YXJlIG5hbWU6IERlbGwgSW5jLiBQb3dlckVkZ2UgUjcy
MC8wVldUOTAsIEJJT1MgMi41LjQgMDEvMjIvMjAxNg0KV29ya3F1ZXVlOiBrYmxvY2tkIGJsa19t
cV90aW1lb3V0X3dvcmsNClJJUDogMDAxMDpzY3NpX3RpbWVzX291dCsweDFjLzB4MmQwDQpSU1A6
IDAwMTg6ZmZmZmM5MDAwN2VmM2Q1OCBFRkxBR1M6IDAwMDEwMjQ2DQpSQVg6IDAwMDAwMDAwMDAw
MDAwMDAgUkJYOiBmZmZmODgwODc4ZWFiMDAwIFJDWDogMDAwMDAwMDAwMDAwMDAwMA0KUkRYOiAw
MDAwMDAwMDAwMDAwMDAwIFJTSTogMDAwMDAwMDAwMDAwMDAwMCBSREk6IGZmZmY4ODA4NzhlYWIw
MDANClJCUDogZmZmZjg4MDg3OGVhYjFhMCBSMDg6IGZmZmZmZmZmZmZmZmZmZmYgUjA5OiAwMDAw
MDAwMDAwMDAwMDAxDQpSMTA6IDAwMDAwMDAwMDAwMDAwMDAgUjExOiAwMDAwMDAwMDAwMDAwMDAw
IFIxMjogMDAwMDAwMDAwMDAwMDAwNA0KUjEzOiAwMDAwMDAwMDAwMDAwMDAwIFIxNDogZmZmZjg4
MDg1ZTRhNWNlOCBSMTU6IGZmZmY4ODA4NzhlOWY4NDgNCkZTOiAgMDAwMDAwMDAwMDAwMDAwMCgw
MDAwKSBHUzpmZmZmODgwOTNmNjAwMDAwKDAwMDApIGtubEdTOjAwMDAwMDAwMDAwMDAwMDANCkNT
OiAgMDAxMCBEUzogMDAwMCBFUzogMDAwMCBDUjA6IDAwMDAwMDAwODAwNTAwMzMNCkNSMjogMDAw
MDAwMDAwMDAwMDAwMCBDUjM6IDAwMDAwMDAwMDFjMGYwMDIgQ1I0OiAwMDAwMDAwMDAwMDYwNmUw
DQpDYWxsIFRyYWNlOg0KIGJsa19tcV90ZXJtaW5hdGVfZXhwaXJlZCsweDM2LzB4NzANCiBidF9p
dGVyKzB4NDMvMHg1MA0KIGJsa19tcV9xdWV1ZV90YWdfYnVzeV9pdGVyKzB4ZWUvMHgyMDANCiBi
bGtfbXFfdGltZW91dF93b3JrKzB4MTg2LzB4MmUwDQogcHJvY2Vzc19vbmVfd29yaysweDIyMS8w
eDZlMA0KIHdvcmtlcl90aHJlYWQrMHgzYS8weDM5MA0KIGt0aHJlYWQrMHgxMWMvMHgxNDANCiBy
ZXRfZnJvbV9mb3JrKzB4MjQvMHgzMA0KUklQOiBzY3NpX3RpbWVzX291dCsweDFjLzB4MmQwIFJT
UDogZmZmZmM5MDAwN2VmM2Q1OA0KQ1IyOiAwMDAwMDAwMDAwMDAwMDAwDQoNCihnZGIpIGxpc3Qg
KihzY3NpX3RpbWVzX291dCsweDFjKQ0KMHhmZmZmZmZmZjgxNDdhZGJjIGlzIGluIHNjc2lfdGlt
ZXNfb3V0IChkcml2ZXJzL3Njc2kvc2NzaV9lcnJvci5jOjI4NSkuDQoyODAgICAgICAqLw0KMjgx
ICAgICBlbnVtIGJsa19laF90aW1lcl9yZXR1cm4gc2NzaV90aW1lc19vdXQoc3RydWN0IHJlcXVl
c3QgKnJlcSkNCjI4MiAgICAgew0KMjgzICAgICAgICAgICAgIHN0cnVjdCBzY3NpX2NtbmQgKnNj
bWQgPSBibGtfbXFfcnFfdG9fcGR1KHJlcSk7DQoyODQgICAgICAgICAgICAgZW51bSBibGtfZWhf
dGltZXJfcmV0dXJuIHJ0biA9IEJMS19FSF9OT1RfSEFORExFRDsNCjI4NSAgICAgICAgICAgICBz
dHJ1Y3QgU2NzaV9Ib3N0ICpob3N0ID0gc2NtZC0+ZGV2aWNlLT5ob3N0Ow0KMjg2DQoyODcgICAg
ICAgICAgICAgdHJhY2Vfc2NzaV9kaXNwYXRjaF9jbWRfdGltZW91dChzY21kKTsNCjI4OCAgICAg
ICAgICAgICBzY3NpX2xvZ19jb21wbGV0aW9uKHNjbWQsIFRJTUVPVVRfRVJST1IpOw0KMjg5DQoN
CihnZGIpIGRpc2FzIC9zIHNjc2lfdGltZXNfb3V0DQpbIC4uLiBdDQoyODMgICAgICAgICAgICAg
c3RydWN0IHNjc2lfY21uZCAqc2NtZCA9IGJsa19tcV9ycV90b19wZHUocmVxKTsNCjI4NCAgICAg
ICAgICAgICBlbnVtIGJsa19laF90aW1lcl9yZXR1cm4gcnRuID0gQkxLX0VIX05PVF9IQU5ETEVE
Ow0KMjg1ICAgICAgICAgICAgIHN0cnVjdCBTY3NpX0hvc3QgKmhvc3QgPSBzY21kLT5kZXZpY2Ut
Pmhvc3Q7DQogICAweGZmZmZmZmZmODE0N2FkYjIgPCsxOD46ICAgIG1vdiAgICAweDFkOCglcmRp
KSwlcmF4DQogICAweGZmZmZmZmZmODE0N2FkYjkgPCsyNT46ICAgIG1vdiAgICAlcmRpLCVyYngN
CiAgIDB4ZmZmZmZmZmY4MTQ3YWRiYyA8KzI4PjogICAgbW92ICAgICglcmF4KSwlcjEzDQogICAw
eGZmZmZmZmZmODE0N2FkYmYgPCszMT46ICAgIG5vcGwgICAweDAoJXJheCwlcmF4LDEpDQoNCkJh
cnQu

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET v2] blk-mq: reimplement timeout handling
@ 2017-12-20 23:41   ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-20 23:41 UTC (permalink / raw)
  To: tj, axboe
  Cc: kernel-team, linux-kernel, peterz, osandov, linux-block, oleg, hch

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.

Hello Tejun,

An attempt to run SCSI I/O with this patch series applied resulted in
the following:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: scsi_times_out+0x1c/0x2d0
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 437 Comm: kworker/1:1H Tainted: G        W        4.15.0-rc4-dbg+ #1
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 2.5.4 01/22/2016
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:scsi_times_out+0x1c/0x2d0
RSP: 0018:ffffc90007ef3d58 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880878eab000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880878eab000
RBP: ffff880878eab1a0 R08: ffffffffffffffff R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000000 R14: ffff88085e4a5ce8 R15: ffff880878e9f848
FS:  0000000000000000(0000) GS:ffff88093f600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000001c0f002 CR4: 00000000000606e0
Call Trace:
 blk_mq_terminate_expired+0x36/0x70
 bt_iter+0x43/0x50
 blk_mq_queue_tag_busy_iter+0xee/0x200
 blk_mq_timeout_work+0x186/0x2e0
 process_one_work+0x221/0x6e0
 worker_thread+0x3a/0x390
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30
RIP: scsi_times_out+0x1c/0x2d0 RSP: ffffc90007ef3d58
CR2: 0000000000000000

(gdb) list *(scsi_times_out+0x1c)
0xffffffff8147adbc is in scsi_times_out (drivers/scsi/scsi_error.c:285).
280      */
281     enum blk_eh_timer_return scsi_times_out(struct request *req)
282     {
283             struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285             struct Scsi_Host *host = scmd->device->host;
286
287             trace_scsi_dispatch_cmd_timeout(scmd);
288             scsi_log_completion(scmd, TIMEOUT_ERROR);
289

(gdb) disas /s scsi_times_out
[ ... ]
283             struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285             struct Scsi_Host *host = scmd->device->host;
   0xffffffff8147adb2 <+18>:    mov    0x1d8(%rdi),%rax
   0xffffffff8147adb9 <+25>:    mov    %rdi,%rbx
   0xffffffff8147adbc <+28>:    mov    (%rax),%r13
   0xffffffff8147adbf <+31>:    nopl   0x0(%rax,%rax,1)

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET v2] blk-mq: reimplement timeout handling
  2017-12-20 23:41   ` Bart Van Assche
  (?)
@ 2017-12-21  0:08   ` tj
  2017-12-21  1:00       ` Bart Van Assche
  -1 siblings, 1 reply; 49+ messages in thread
From: tj @ 2017-12-21  0:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, kernel-team, linux-kernel, peterz, osandov, linux-block,
	oleg, hch

On Wed, Dec 20, 2017 at 11:41:02PM +0000, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > Currently, blk-mq timeout path synchronizes against the usual
> > issue/completion path using a complex scheme involving atomic
> > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > rules.  Unfortunatley, it contains quite a few holes.
> 
> Hello Tejun,
> 
> An attempt to run SCSI I/O with this patch series applied resulted in
> the following:

Can you please try the v3?  There were a couple bugs that I missed
while testing earlier versions.

 http://lkml.kernel.org/r/20171216120726.517153-1-tj@kernel.org

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET v2] blk-mq: reimplement timeout handling
  2017-12-21  0:08   ` tj
@ 2017-12-21  1:00       ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-21  1:00 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch, axboe,
	osandov

T24gV2VkLCAyMDE3LTEyLTIwIGF0IDE2OjA4IC0wODAwLCB0akBrZXJuZWwub3JnIHdyb3RlOg0K
PiBPbiBXZWQsIERlYyAyMCwgMjAxNyBhdCAxMTo0MTowMlBNICswMDAwLCBCYXJ0IFZhbiBBc3Nj
aGUgd3JvdGU6DQo+ID4gT24gVHVlLCAyMDE3LTEyLTEyIGF0IDExOjAxIC0wODAwLCBUZWp1biBI
ZW8gd3JvdGU6DQo+ID4gPiBDdXJyZW50bHksIGJsay1tcSB0aW1lb3V0IHBhdGggc3luY2hyb25p
emVzIGFnYWluc3QgdGhlIHVzdWFsDQo+ID4gPiBpc3N1ZS9jb21wbGV0aW9uIHBhdGggdXNpbmcg
YSBjb21wbGV4IHNjaGVtZSBpbnZvbHZpbmcgYXRvbWljDQo+ID4gPiBiaXRmbGFncywgUkVRX0FU
T01fKiwgbWVtb3J5IGJhcnJpZXJzIGFuZCBzdWJ0bGUgbWVtb3J5IGNvaGVyZW5jZQ0KPiA+ID4g
cnVsZXMuICBVbmZvcnR1bmF0bGV5LCBpdCBjb250YWlucyBxdWl0ZSBhIGZldyBob2xlcy4NCj4g
PiANCj4gPiBIZWxsbyBUZWp1biwNCj4gPiANCj4gPiBBbiBhdHRlbXB0IHRvIHJ1biBTQ1NJIEkv
TyB3aXRoIHRoaXMgcGF0Y2ggc2VyaWVzIGFwcGxpZWQgcmVzdWx0ZWQgaW4NCj4gPiB0aGUgZm9s
bG93aW5nOg0KPiANCj4gQ2FuIHlvdSBwbGVhc2UgdHJ5IHRoZSB2Mz8gIFRoZXJlIHdlcmUgYSBj
b3VwbGUgYnVncyB0aGF0IEkgbWlzc2VkDQo+IHdoaWxlIHRlc3RpbmcgZWFybGllciB2ZXJzaW9u
cy4NCj4gDQo+ICBodHRwOi8vbGttbC5rZXJuZWwub3JnL3IvMjAxNzEyMTYxMjA3MjYuNTE3MTUz
LTEtdGpAa2VybmVsLm9yZw0KDQpXaWxsIGRvLiBCdXQgcGxlYXNlIENjOiBsaW51eC1ibG9jayBp
biBjYXNlIHlvdSB3b3VsZCBwb3N0IGEgdjQgb2YgdGhpcyBwYXRjaA0Kc2VyaWVzLiBJIHNlYXJj
aGVkIHRoZSBsaW51eC1ibG9jayBmb2xkZXIgb2YgbXkgbWFpbGJveCBmb3IgdGhlIGxhdGVzdCB2
ZXJzaW9uDQpvZiB0aGlzIHBhdGNoIHNlcmllcyBhbmQgdGhhdCBpcyBob3cgSSBlbmRlZCB1cCB0
ZXN0aW5nIHYyIGluc3RlYWQgb2YgdjMgLi4uDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET v2] blk-mq: reimplement timeout handling
@ 2017-12-21  1:00       ` Bart Van Assche
  0 siblings, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2017-12-21  1:00 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, linux-block, kernel-team, oleg, hch, axboe,
	osandov

On Wed, 2017-12-20 at 16:08 -0800, tj@kernel.org wrote:
> On Wed, Dec 20, 2017 at 11:41:02PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > Currently, blk-mq timeout path synchronizes against the usual
> > > issue/completion path using a complex scheme involving atomic
> > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > rules.  Unfortunatley, it contains quite a few holes.
> > 
> > Hello Tejun,
> > 
> > An attempt to run SCSI I/O with this patch series applied resulted in
> > the following:
> 
> Can you please try the v3?  There were a couple bugs that I missed
> while testing earlier versions.
> 
>  http://lkml.kernel.org/r/20171216120726.517153-1-tj@kernel.org

Will do. But please Cc: linux-block in case you would post a v4 of this patch
series. I searched the linux-block folder of my mailbox for the latest version
of this patch series and that is how I ended up testing v2 instead of v3 ...

Bart.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-09 19:25 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
@ 2017-12-13  3:10   ` jianchao.wang
  0 siblings, 0 replies; 49+ messages in thread
From: jianchao.wang @ 2017-12-13  3:10 UTC (permalink / raw)
  To: Tejun Heo, axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov

Hi tejun

On 12/10/2017 03:25 AM, Tejun Heo wrote:
> Currently, blk-mq protects only the issue path with RCU.  This patch
> puts the completion path under the same RCU protection.  This will be
> used to synchronize issue/completion against timeout by later patches,
> which will also add the comments.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-mq.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1109747..acf4fbb 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -568,11 +568,23 @@ static void __blk_mq_complete_request(struct request *rq)
>  void blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> +	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> +	int srcu_idx;
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
> -	if (!blk_mark_rq_complete(rq))
> -		__blk_mq_complete_request(rq);
> +
> +	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> +		rcu_read_lock();
> +		if (!blk_mark_rq_complete(rq))
> +			__blk_mq_complete_request(rq);
> +		rcu_read_unlock();
> +	} else {
> +		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> +		if (!blk_mark_rq_complete(rq))
> +			__blk_mq_complete_request(rq);
> +		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
> +	}
The __blk_mq_complete_request() could be executed in irq context. There should not be any 
sleeping operations in it. So the srcu case here is not necessary here.

Thanks
Jianchao
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);
>  
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/6] blk-mq: protect completion path with RCU
  2017-12-09 19:25 [PATCHSET] " Tejun Heo
@ 2017-12-09 19:25 ` Tejun Heo
  2017-12-13  3:10   ` jianchao.wang
  0 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2017-12-09 19:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, oleg, peterz, kernel-team, osandov, Tejun Heo

Currently, blk-mq protects only the issue path with RCU.  This patch
puts the completion path under the same RCU protection.  This will be
used to synchronize issue/completion against timeout by later patches,
which will also add the comments.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1109747..acf4fbb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -568,11 +568,23 @@ static void __blk_mq_complete_request(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
+	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
-	if (!blk_mark_rq_complete(rq))
-		__blk_mq_complete_request(rq);
+
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		if (!blk_mark_rq_complete(rq))
+			__blk_mq_complete_request(rq);
+		rcu_read_unlock();
+	} else {
+		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+		if (!blk_mark_rq_complete(rq))
+			__blk_mq_complete_request(rq);
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+	}
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-12-21  1:00 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-12 19:01 [PATCHSET v2] blk-mq: reimplement timeout handling Tejun Heo
2017-12-12 19:01 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
2017-12-13  3:30   ` jianchao.wang
2017-12-13 16:13     ` Tejun Heo
2017-12-14  2:09       ` jianchao.wang
2017-12-14 17:01   ` Bart Van Assche
2017-12-14 17:01     ` Bart Van Assche
2017-12-14 18:14     ` tj
2017-12-12 19:01 ` [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
2017-12-12 21:37   ` Bart Van Assche
2017-12-12 21:37     ` Bart Van Assche
2017-12-12 21:44     ` tj
2017-12-13  5:07   ` jianchao.wang
2017-12-13 16:13     ` Tejun Heo
2017-12-14 18:51   ` Bart Van Assche
2017-12-14 18:51     ` Bart Van Assche
2017-12-14 19:19     ` tj
2017-12-14 21:13       ` Bart Van Assche
2017-12-14 21:13         ` Bart Van Assche
2017-12-15 13:30         ` tj
2017-12-14 20:20     ` Peter Zijlstra
2017-12-14 21:42       ` Bart Van Assche
2017-12-14 21:42         ` Bart Van Assche
2017-12-14 21:54         ` Peter Zijlstra
2017-12-15  2:12           ` jianchao.wang
2017-12-15  7:31             ` Peter Zijlstra
2017-12-15 15:14               ` jianchao.wang
2017-12-15  2:39           ` Mike Galbraith
2017-12-15  2:39             ` Mike Galbraith
2017-12-15 13:50       ` tj
2017-12-12 19:01 ` [PATCH 3/6] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
2017-12-12 19:01 ` [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
2017-12-14 18:56   ` Bart Van Assche
2017-12-14 18:56     ` Bart Van Assche
2017-12-14 19:26     ` tj
2017-12-12 19:01 ` [PATCH 5/6] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
2017-12-12 19:01 ` [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
2017-12-12 22:20   ` Bart Van Assche
2017-12-12 22:20     ` Bart Van Assche
2017-12-12 22:22     ` tj
2017-12-12 20:23 ` [PATCHSET v2] blk-mq: reimplement timeout handling Jens Axboe
2017-12-12 21:40   ` Tejun Heo
2017-12-20 23:41 ` Bart Van Assche
2017-12-20 23:41   ` Bart Van Assche
2017-12-21  0:08   ` tj
2017-12-21  1:00     ` Bart Van Assche
2017-12-21  1:00       ` Bart Van Assche
  -- strict thread matches above, loose matches on Subject: below --
2017-12-09 19:25 [PATCHSET] " Tejun Heo
2017-12-09 19:25 ` [PATCH 1/6] blk-mq: protect completion path with RCU Tejun Heo
2017-12-13  3:10   ` jianchao.wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.