All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v5] blk-mq: reimplement timeout handling
@ 2018-01-09 16:29 Tejun Heo
  2018-01-09 16:29 ` [PATCH 1/8] blk-mq: move hctx lock/unlock into a helper Tejun Heo
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block

Hello,

Changes from [v4]

- Comments added.  Patch description updated.

Changes from [v3]

- Rebased on top of for-4.16/block.

- Integrated Jens's hctx_[un]lock() factoring patch and refreshed the
  patches accordingly.

- Added comment explaining the use of hctx_lock() instead of
  rcu_read_lock() in completion path.

Changes from [v2]

- Possible extended looping around seqcount and u64_stat_sync fixed.

- Misplaced MQ_RQ_IDLE state setting fixed.

- RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout
  multiple times.

- s/queue_rq_src/srcu/ patch added.

- Other misc changes.

Changes from [v1]

- BLK_EH_RESET_TIMER handling fixed.

- s/request->gstate_seqc/request->gstate_seq/

- READ_ONCE() added to blk_mq_rq_udpate_state().

- Removed left over blk_clear_rq_complete() invocation from
  blk_mq_rq_timed_out().

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunatley, it contains quite a few holes.

It's pretty easy to make blk_mq_check_expired() terminate a later
instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patchset replaces the broken synchronization mechanism with a RCU
and generation number based one.  Please read the patch description of
the second path for more details.

This patchset contains the following eight patches.

0001-blk-mq-move-hctx-lock-unlock-into-a-helper.patch
0002-blk-mq-protect-completion-path-with-RCU.patch
0003-blk-mq-replace-timeout-synchronization-with-a-RCU-an.patch
0004-blk-mq-use-blk_mq_rq_state-instead-of-testing-REQ_AT.patch
0005-blk-mq-make-blk_abort_request-trigger-timeout-path.patch
0006-blk-mq-remove-REQ_ATOM_COMPLETE-usages-from-blk-mq.patch
0007-blk-mq-remove-REQ_ATOM_STARTED.patch
0008-blk-mq-rename-blk_mq_hw_ctx-queue_rq_srcu-to-srcu.patch

and is available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-mq-timeout-v5

diffstat follows.  Thanks.

 block/blk-core.c       |    2 
 block/blk-mq-debugfs.c |    4 
 block/blk-mq.c         |  346 +++++++++++++++++++++++++++++--------------------
 block/blk-mq.h         |   49 ++++++
 block/blk-timeout.c    |   16 +-
 block/blk.h            |    7 
 include/linux/blk-mq.h |    3 
 include/linux/blkdev.h |   25 +++
 8 files changed, 294 insertions(+), 158 deletions(-)

--
tejun

[v4] http://lkml.kernel.org/r/20180108191542.379478-1-tj@kernel.org
[v3] http://lkml.kernel.org/r/20171216120726.517153-1-tj@kernel.org
[v2] http://lkml.kernel.org/r/20171010155441.753966-1-tj@kernel.org
[v1] http://lkml.kernel.org/r/20171209192525.982030-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/8] blk-mq: move hctx lock/unlock into a helper
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
@ 2018-01-09 16:29 ` Tejun Heo
  2018-01-09 16:29 ` [PATCH 2/8] blk-mq: protect completion path with RCU Tejun Heo
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block, Tejun Heo

From: Jens Axboe <axboe@kernel.dk>

Move the RCU vs SRCU logic into lock/unlock helpers, which makes
the actual functional bits within the locked region much easier
to read.

tj: Reordered in front of timeout revamp patches and added the missing
    blk_mq_run_hw_queue() conversion.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 66 ++++++++++++++++++++++++++++------------------------------
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 111e1aa..ddc9261 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -557,6 +557,22 @@ static void __blk_mq_complete_request(struct request *rq)
 	put_cpu();
 }
 
+static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
+{
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+		rcu_read_unlock();
+	else
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+}
+
+static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
+{
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+		rcu_read_lock();
+	else
+		*srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+}
+
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -1214,17 +1230,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	 */
 	WARN_ON_ONCE(in_interrupt());
 
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		rcu_read_lock();
-		blk_mq_sched_dispatch_requests(hctx);
-		rcu_read_unlock();
-	} else {
-		might_sleep();
+	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		blk_mq_sched_dispatch_requests(hctx);
-		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-	}
+	hctx_lock(hctx, &srcu_idx);
+	blk_mq_sched_dispatch_requests(hctx);
+	hctx_unlock(hctx, srcu_idx);
 }
 
 /*
@@ -1296,17 +1306,10 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
 	 * quiesced.
 	 */
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		rcu_read_lock();
-		need_run = !blk_queue_quiesced(hctx->queue) &&
-			blk_mq_hctx_has_pending(hctx);
-		rcu_read_unlock();
-	} else {
-		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		need_run = !blk_queue_quiesced(hctx->queue) &&
-			blk_mq_hctx_has_pending(hctx);
-		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-	}
+	hctx_lock(hctx, &srcu_idx);
+	need_run = !blk_queue_quiesced(hctx->queue) &&
+		blk_mq_hctx_has_pending(hctx);
+	hctx_unlock(hctx, srcu_idx);
 
 	if (need_run) {
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
@@ -1618,7 +1621,7 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 
 static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 					struct request *rq,
-					blk_qc_t *cookie, bool may_sleep)
+					blk_qc_t *cookie)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
@@ -1668,25 +1671,20 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	}
 
 insert:
-	blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep);
+	blk_mq_sched_insert_request(rq, false, run_queue, false,
+					hctx->flags & BLK_MQ_F_BLOCKING);
 }
 
 static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, blk_qc_t *cookie)
 {
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		rcu_read_lock();
-		__blk_mq_try_issue_directly(hctx, rq, cookie, false);
-		rcu_read_unlock();
-	} else {
-		unsigned int srcu_idx;
+	int srcu_idx;
 
-		might_sleep();
+	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		__blk_mq_try_issue_directly(hctx, rq, cookie, true);
-		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-	}
+	hctx_lock(hctx, &srcu_idx);
+	__blk_mq_try_issue_directly(hctx, rq, cookie);
+	hctx_unlock(hctx, srcu_idx);
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/8] blk-mq: protect completion path with RCU
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
  2018-01-09 16:29 ` [PATCH 1/8] blk-mq: move hctx lock/unlock into a helper Tejun Heo
@ 2018-01-09 16:29 ` Tejun Heo
  2018-01-09 16:29 ` [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block, Tejun Heo

Currently, blk-mq protects only the issue path with RCU.  This patch
puts the completion path under the same RCU protection.  This will be
used to synchronize issue/completion against timeout by later patches,
which will also add the comments.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc9261..6741c3e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
+	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
+
+	hctx_lock(hctx, &srcu_idx);
 	if (!blk_mark_rq_complete(rq))
 		__blk_mq_complete_request(rq);
+	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
  2018-01-09 16:29 ` [PATCH 1/8] blk-mq: move hctx lock/unlock into a helper Tejun Heo
  2018-01-09 16:29 ` [PATCH 2/8] blk-mq: protect completion path with RCU Tejun Heo
@ 2018-01-09 16:29 ` Tejun Heo
  2018-01-09 16:29 ` [PATCH 4/8] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block, Tejun Heo, Christoph Hellwig

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path.  The race has always been there and this
patch doesn't change it.  It's just documenting the existing race.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
      spotted by Peter.
    - MQ_RQ_IDLE was incorrectly being set in complete_request instead
      of free_request.  Fixed.

v4: - Rebased on top of hctx_lock() refactoring patch.
    - Added comment explaining the use of hctx_lock() in completion path.

v5: - Added comments requested by Bart.
    - Note the addition of BLK_EH_RESET_TIMER race condition in the
      commit message.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
---
 block/blk-core.c       |   2 +
 block/blk-mq.c         | 229 +++++++++++++++++++++++++++++++++----------------
 block/blk-mq.h         |  46 ++++++++++
 block/blk-timeout.c    |   2 +-
 block/blk.h            |   6 --
 include/linux/blk-mq.h |   1 +
 include/linux/blkdev.h |  23 +++++
 7 files changed, 230 insertions(+), 79 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2e0d041..f843ae4f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
+	seqcount_init(&rq->gstate_seq);
+	u64_stats_init(&rq->aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6741c3e..052fee5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -483,6 +483,7 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
+	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
 	if (rq->tag != -1)
@@ -530,6 +531,8 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
+
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 	if (rq->rq_flags & RQF_STATS) {
@@ -573,6 +576,36 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
 }
 
+static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+{
+	unsigned long flags;
+
+	/*
+	 * blk_mq_rq_aborted_gstate() is used from the completion path and
+	 * can thus be called from irq context.  u64_stats_fetch in the
+	 * middle of update on the same CPU leads to lockup.  Disable irq
+	 * while updating.
+	 */
+	local_irq_save(flags);
+	u64_stats_update_begin(&rq->aborted_gstate_sync);
+	rq->aborted_gstate = gstate;
+	u64_stats_update_end(&rq->aborted_gstate_sync);
+	local_irq_restore(flags);
+}
+
+static u64 blk_mq_rq_aborted_gstate(struct request *rq)
+{
+	unsigned int start;
+	u64 aborted_gstate;
+
+	do {
+		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
+		aborted_gstate = rq->aborted_gstate;
+	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
+
+	return aborted_gstate;
+}
+
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -590,8 +623,20 @@ void blk_mq_complete_request(struct request *rq)
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
+	/*
+	 * If @rq->aborted_gstate equals the current instance, timeout is
+	 * claiming @rq and we lost.  This is synchronized through
+	 * hctx_lock().  See blk_mq_timeout_work() for details.
+	 *
+	 * Completion path never blocks and we can directly use RCU here
+	 * instead of hctx_lock() which can be either RCU or SRCU.
+	 * However, that would complicate paths which want to synchronize
+	 * against us.  Let stay in sync with the issue path so that
+	 * hctx_lock() covers both issue and completion paths.
+	 */
 	hctx_lock(hctx, &srcu_idx);
-	if (!blk_mark_rq_complete(rq))
+	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
+	    !blk_mark_rq_complete(rq))
 		__blk_mq_complete_request(rq);
 	hctx_unlock(hctx, srcu_idx);
 }
@@ -617,34 +662,32 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
 
-	blk_add_timer(rq);
-
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
 
 	/*
-	 * Mark us as started and clear complete. Complete might have been
-	 * set if requeue raced with timeout, which then marked it as
-	 * complete. So be sure to clear complete again when we start
-	 * the request, otherwise we'll ignore the completion event.
+	 * Mark @rq in-flight which also advances the generation number,
+	 * and register for timeout.  Protect with a seqcount to allow the
+	 * timeout path to read both @rq->gstate and @rq->deadline
+	 * coherently.
 	 *
-	 * Ensure that ->deadline is visible before we set STARTED, such that
-	 * blk_mq_check_expired() is guaranteed to observe our ->deadline when
-	 * it observes STARTED.
+	 * This is the only place where a request is marked in-flight.  If
+	 * the timeout path reads an in-flight @rq->gstate, the
+	 * @rq->deadline it reads together under @rq->gstate_seq is
+	 * guaranteed to be the matching one.
 	 */
-	smp_wmb();
+	preempt_disable();
+	write_seqcount_begin(&rq->gstate_seq);
+
+	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
+	blk_add_timer(rq);
+
+	write_seqcount_end(&rq->gstate_seq);
+	preempt_enable();
+
 	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
-		/*
-		 * Coherence order guarantees these consecutive stores to a
-		 * single variable propagate in the specified order. Thus the
-		 * clear_bit() is ordered _after_ the set bit. See
-		 * blk_mq_check_expired().
-		 *
-		 * (the bits must be part of the same byte for this to be
-		 * true).
-		 */
+	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
 		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
-	}
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -677,6 +720,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	blk_mq_sched_requeue_request(rq);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -774,6 +818,7 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
 	unsigned long next;
 	unsigned int next_set;
+	unsigned int nr_expired;
 };
 
 void blk_mq_rq_timed_out(struct request *req, bool reserved)
@@ -801,6 +846,12 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
+		/*
+		 * As nothing prevents from completion happening while
+		 * ->aborted_gstate is set, this may lead to ignored
+		 * completions and further spurious timeouts.
+		 */
+		blk_mq_rq_update_aborted_gstate(req, 0);
 		blk_add_timer(req);
 		blk_clear_rq_complete(req);
 		break;
@@ -816,50 +867,51 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long deadline;
+	unsigned long gstate, deadline;
+	int start;
+
+	might_sleep();
 
 	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
 		return;
 
-	/*
-	 * Ensures that if we see STARTED we must also see our
-	 * up-to-date deadline, see blk_mq_start_request().
-	 */
-	smp_rmb();
-
-	deadline = READ_ONCE(rq->deadline);
+	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
+	while (true) {
+		start = read_seqcount_begin(&rq->gstate_seq);
+		gstate = READ_ONCE(rq->gstate);
+		deadline = rq->deadline;
+		if (!read_seqcount_retry(&rq->gstate_seq, start))
+			break;
+		cond_resched();
+	}
 
-	/*
-	 * The rq being checked may have been freed and reallocated
-	 * out already here, we avoid this race by checking rq->deadline
-	 * and REQ_ATOM_COMPLETE flag together:
-	 *
-	 * - if rq->deadline is observed as new value because of
-	 *   reusing, the rq won't be timed out because of timing.
-	 * - if rq->deadline is observed as previous value,
-	 *   REQ_ATOM_COMPLETE flag won't be cleared in reuse path
-	 *   because we put a barrier between setting rq->deadline
-	 *   and clearing the flag in blk_mq_start_request(), so
-	 *   this rq won't be timed out too.
-	 */
-	if (time_after_eq(jiffies, deadline)) {
-		if (!blk_mark_rq_complete(rq)) {
-			/*
-			 * Again coherence order ensures that consecutive reads
-			 * from the same variable must be in that order. This
-			 * ensures that if we see COMPLETE clear, we must then
-			 * see STARTED set and we'll ignore this timeout.
-			 *
-			 * (There's also the MB implied by the test_and_clear())
-			 */
-			blk_mq_rq_timed_out(rq, reserved);
-		}
+	/* if in-flight && overdue, mark for abortion */
+	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
+	    time_after_eq(jiffies, deadline)) {
+		blk_mq_rq_update_aborted_gstate(rq, gstate);
+		data->nr_expired++;
+		hctx->nr_expired++;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
 }
 
+static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
+	 * completions that we lost to, they would have finished and
+	 * updated @rq->gstate by now; otherwise, the completion path is
+	 * now guaranteed to see @rq->aborted_gstate and yield.  If
+	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 */
+	if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
+	    !blk_mark_rq_complete(rq))
+		blk_mq_rq_timed_out(rq, reserved);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -867,7 +919,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	struct blk_mq_timeout_data data = {
 		.next		= 0,
 		.next_set	= 0,
+		.nr_expired	= 0,
 	};
+	struct blk_mq_hw_ctx *hctx;
 	int i;
 
 	/* A deadlock might occur if a request is stuck requiring a
@@ -886,14 +940,40 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
+	/* scan for the expired ones and set their ->aborted_gstate */
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
+	if (data.nr_expired) {
+		bool has_rcu = false;
+
+		/*
+		 * Wait till everyone sees ->aborted_gstate.  The
+		 * sequential waits for SRCUs aren't ideal.  If this ever
+		 * becomes a problem, we can add per-hw_ctx rcu_head and
+		 * wait in parallel.
+		 */
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->nr_expired)
+				continue;
+
+			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+				has_rcu = true;
+			else
+				synchronize_srcu(hctx->queue_rq_srcu);
+
+			hctx->nr_expired = 0;
+		}
+		if (has_rcu)
+			synchronize_rcu();
+
+		/* terminate the ones we won */
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+	}
+
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
 	} else {
-		struct blk_mq_hw_ctx *hctx;
-
 		queue_for_each_hw_ctx(q, hctx, i) {
 			/* the hctx may be unmapped, so check it here */
 			if (blk_mq_hw_queue_mapped(hctx))
@@ -1893,6 +1973,22 @@ static size_t order_to_size(unsigned int order)
 	return (size_t)PAGE_SIZE << order;
 }
 
+static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
+			       unsigned int hctx_idx, int node)
+{
+	int ret;
+
+	if (set->ops->init_request) {
+		ret = set->ops->init_request(set, rq, hctx_idx, node);
+		if (ret)
+			return ret;
+	}
+
+	seqcount_init(&rq->gstate_seq);
+	u64_stats_init(&rq->aborted_gstate_sync);
+	return 0;
+}
+
 int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx, unsigned int depth)
 {
@@ -1954,12 +2050,9 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 			struct request *rq = p;
 
 			tags->static_rqs[i] = rq;
-			if (set->ops->init_request) {
-				if (set->ops->init_request(set, rq, hctx_idx,
-						node)) {
-					tags->static_rqs[i] = NULL;
-					goto fail;
-				}
+			if (blk_mq_init_request(set, rq, hctx_idx, node)) {
+				tags->static_rqs[i] = NULL;
+				goto fail;
 			}
 
 			p += rq_size;
@@ -2098,9 +2191,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	if (!hctx->fq)
 		goto sched_exit_hctx;
 
-	if (set->ops->init_request &&
-	    set->ops->init_request(set, hctx->fq->flush_rq, hctx_idx,
-				   node))
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
 		goto free_fq;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
@@ -3018,12 +3109,6 @@ static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 
 static int __init blk_mq_init(void)
 {
-	/*
-	 * See comment in block/blk.h rq_atomic_flags enum
-	 */
-	BUILD_BUG_ON((REQ_ATOM_STARTED / BITS_PER_BYTE) !=
-			(REQ_ATOM_COMPLETE / BITS_PER_BYTE));
-
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
 	return 0;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c7c3ff..cf01f6f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,6 +27,19 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
+ * and the upper bits the generation number.
+ */
+enum mq_rq_state {
+	MQ_RQ_IDLE		= 0,
+	MQ_RQ_IN_FLIGHT		= 1,
+
+	MQ_RQ_STATE_BITS	= 2,
+	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
+	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
+};
+
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
@@ -85,6 +98,39 @@ extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
 
 void blk_mq_release(struct request_queue *q);
 
+/**
+ * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
+ * @rq: target request.
+ */
+static inline int blk_mq_rq_state(struct request *rq)
+{
+	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+}
+
+/**
+ * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
+ * @rq: target request.
+ * @state: new state to set.
+ *
+ * Set @rq's state to @state.  The caller is responsible for ensuring that
+ * there are no other updaters.  A request can transition into IN_FLIGHT
+ * only from IDLE and doing so increments the generation number.
+ */
+static inline void blk_mq_rq_update_state(struct request *rq,
+					  enum mq_rq_state state)
+{
+	u64 old_val = READ_ONCE(rq->gstate);
+	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
+
+	if (state == MQ_RQ_IN_FLIGHT) {
+		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
+		new_val += MQ_RQ_GEN_INC;
+	}
+
+	/* avoid exposing interim values */
+	WRITE_ONCE(rq->gstate, new_val);
+}
+
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
 					   unsigned int cpu)
 {
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 764ecf9..6427be7 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -208,7 +208,7 @@ void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	WRITE_ONCE(req->deadline, jiffies + req->timeout);
+	req->deadline = jiffies + req->timeout;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/block/blk.h b/block/blk.h
index 3f14469..9cb2739 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -123,12 +123,6 @@ void blk_account_io_done(struct request *req);
  * Internal atomic flags for request handling
  */
 enum rq_atomic_flags {
-	/*
-	 * Keep these two bits first - not because we depend on the
-	 * value of them, but we do depend on them being in the same
-	 * byte of storage to ensure ordering on writes. Keeping them
-	 * first will achieve that nicely.
-	 */
 	REQ_ATOM_COMPLETE = 0,
 	REQ_ATOM_STARTED,
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95c9a5c..460798d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,6 +51,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
+	unsigned int		nr_expired;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 46e606f..ae563d0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,6 +27,8 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
+#include <linux/seqlock.h>
+#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -230,6 +232,27 @@ struct request {
 
 	unsigned short write_hint;
 
+	/*
+	 * On blk-mq, the lower bits of ->gstate (generation number and
+	 * state) carry the MQ_RQ_* state value and the upper bits the
+	 * generation number which is monotonically incremented and used to
+	 * distinguish the reuse instances.
+	 *
+	 * ->gstate_seq allows updates to ->gstate and other fields
+	 * (currently ->deadline) during request start to be read
+	 * atomically from the timeout path, so that it can operate on a
+	 * coherent set of information.
+	 */
+	seqcount_t gstate_seq;
+	u64 gstate;
+
+	/*
+	 * ->aborted_gstate is used by the timeout to claim a specific
+	 * recycle instance of this request.  See blk_mq_timeout_work().
+	 */
+	struct u64_stats_sync aborted_gstate_sync;
+	u64 aborted_gstate;
+
 	unsigned long deadline;
 	struct list_head timeout_list;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/8] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
                   ` (2 preceding siblings ...)
  2018-01-09 16:29 ` [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
@ 2018-01-09 16:29 ` Tejun Heo
  2018-01-09 16:29 ` [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block, Tejun Heo

blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
REQ_ATOM_COMPLETE to determine the request state.  Both uses are
speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
equivalent results.  Replace the tests.  This will allow removing
REQ_ATOM_COMPLETE usages from blk-mq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 052fee5..51e9704 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -95,8 +95,7 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
 {
 	struct mq_inflight *mi = priv;
 
-	if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
-	    !test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
+	if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
 		/*
 		 * index[0] counts the specific partition that was asked
 		 * for. index[1] counts the ones that are active on the
@@ -3023,7 +3022,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+		if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
+		    blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
                   ` (3 preceding siblings ...)
  2018-01-09 16:29 ` [PATCH 4/8] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
@ 2018-01-09 16:29 ` Tejun Heo
  2018-01-09 16:29 ` [PATCH 6/8] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block, Tejun Heo, Asai Thambi SP,
	Stefan Haberland, Jan Hoeppner

With issue/complete and timeout paths now using the generation number
and state based synchronization, blk_abort_request() is the only one
which depends on REQ_ATOM_COMPLETE for arbitrating completion.

There's no reason for blk_abort_request() to be a completely separate
path.  This patch makes blk_abort_request() piggyback on the timeout
path instead of trying to terminate the request directly.

This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

Note that this makes blk_abort_request() asynchronous - it initiates
abortion but the actual termination will happen after a short while,
even when the caller owns the request.  AFAICS, SCSI and ATA should be
fine with that and I think mtip32xx and dasd should be safe but not
completely sure.  It'd be great if people who know the drivers take a
look.

v2: - Add comment explaining the lack of synchronization around
      ->deadline update as requested by Bart.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi SP <asamymuthupa@micron.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
---
 block/blk-mq.c      |  2 +-
 block/blk-mq.h      |  2 --
 block/blk-timeout.c | 13 +++++++++----
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 51e9704..b419746 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -820,7 +820,7 @@ struct blk_mq_timeout_data {
 	unsigned int nr_expired;
 };
 
-void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index cf01f6f..6b2d616 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -94,8 +94,6 @@ extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
-extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
-
 void blk_mq_release(struct request_queue *q);
 
 /**
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 6427be7..4f04cd1 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -156,12 +156,17 @@ void blk_timeout_work(struct work_struct *work)
  */
 void blk_abort_request(struct request *req)
 {
-	if (blk_mark_rq_complete(req))
-		return;
-
 	if (req->q->mq_ops) {
-		blk_mq_rq_timed_out(req, false);
+		/*
+		 * All we need to ensure is that timeout scan takes place
+		 * immediately and that scan sees the new timeout value.
+		 * No need for fancy synchronizations.
+		 */
+		req->deadline = jiffies;
+		mod_timer(&req->q->timeout, 0);
 	} else {
+		if (blk_mark_rq_complete(req))
+			return;
 		blk_delete_timer(req);
 		blk_rq_timed_out(req);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/8] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
                   ` (4 preceding siblings ...)
  2018-01-09 16:29 ` [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
@ 2018-01-09 16:29 ` Tejun Heo
  2018-01-09 16:29 ` [PATCH 7/8] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block, Tejun Heo

After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.

Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times.  This removes atomic bitops from hot paths too.

v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().

v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
---
 block/blk-mq.c         | 15 +++++++--------
 block/blk-timeout.c    |  1 +
 include/linux/blkdev.h |  2 ++
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b419746..3f297f0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -634,8 +634,7 @@ void blk_mq_complete_request(struct request *rq)
 	 * hctx_lock() covers both issue and completion paths.
 	 */
 	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-	    !blk_mark_rq_complete(rq))
+	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
 	hctx_unlock(hctx, srcu_idx);
 }
@@ -685,8 +684,6 @@ void blk_mq_start_request(struct request *rq)
 	preempt_enable();
 
 	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
-		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -837,6 +834,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
 		return;
 
+	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
+
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -852,7 +851,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		 */
 		blk_mq_rq_update_aborted_gstate(req, 0);
 		blk_add_timer(req);
-		blk_clear_rq_complete(req);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -871,7 +869,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 
 	might_sleep();
 
-	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) ||
+	    !test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
 		return;
 
 	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
@@ -906,8 +905,8 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 	 * now guaranteed to see @rq->aborted_gstate and yield.  If
 	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
 	 */
-	if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
-	    !blk_mark_rq_complete(rq))
+	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
+	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 4f04cd1..ebe9996 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -214,6 +214,7 @@ void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	req->deadline = jiffies + req->timeout;
+	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ae563d0..007a7cf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -125,6 +125,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
+/* timeout is expired */
+#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/8] blk-mq: remove REQ_ATOM_STARTED
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
                   ` (5 preceding siblings ...)
  2018-01-09 16:29 ` [PATCH 6/8] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
@ 2018-01-09 16:29 ` Tejun Heo
  2018-01-09 16:29 ` [PATCH 8/8] blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu Tejun Heo
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block, Tejun Heo

After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.

Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
left and is removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq-debugfs.c |  4 +---
 block/blk-mq.c         | 37 ++++++++-----------------------------
 block/blk-mq.h         |  1 +
 block/blk.h            |  1 -
 4 files changed, 10 insertions(+), 33 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b56a4f3..8adc837 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -271,7 +271,6 @@ static const char *const cmd_flag_name[] = {
 #define RQF_NAME(name) [ilog2((__force u32)RQF_##name)] = #name
 static const char *const rqf_name[] = {
 	RQF_NAME(SORTED),
-	RQF_NAME(STARTED),
 	RQF_NAME(QUEUED),
 	RQF_NAME(SOFTBARRIER),
 	RQF_NAME(FLUSH_SEQ),
@@ -295,7 +294,6 @@ static const char *const rqf_name[] = {
 #define RQAF_NAME(name) [REQ_ATOM_##name] = #name
 static const char *const rqaf_name[] = {
 	RQAF_NAME(COMPLETE),
-	RQAF_NAME(STARTED),
 	RQAF_NAME(POLL_SLEPT),
 };
 #undef RQAF_NAME
@@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
 	const struct show_busy_params *params = data;
 
 	if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-	    test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+	    blk_mq_rq_state(rq) != MQ_RQ_IDLE)
 		__blk_mq_debugfs_rq_show(params->m,
 					 list_entry_rq(&rq->queuelist));
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f297f0..e6e74c0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -483,7 +483,6 @@ void blk_mq_free_request(struct request *rq)
 		blk_put_rl(blk_rq_rl(rq));
 
 	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
-	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
@@ -531,6 +530,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	int cpu;
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
+	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -642,7 +642,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
 
 int blk_mq_request_started(struct request *rq)
 {
-	return test_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
@@ -661,7 +661,6 @@ void blk_mq_start_request(struct request *rq)
 	}
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
 
 	/*
 	 * Mark @rq in-flight which also advances the generation number,
@@ -683,8 +682,6 @@ void blk_mq_start_request(struct request *rq)
 	write_seqcount_end(&rq->gstate_seq);
 	preempt_enable();
 
-	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
 		 * Make sure space for the drain appears.  We know we can do
@@ -697,13 +694,9 @@ void blk_mq_start_request(struct request *rq)
 EXPORT_SYMBOL(blk_mq_start_request);
 
 /*
- * When we reach here because queue is busy, REQ_ATOM_COMPLETE
- * flag isn't set yet, so there may be race with timeout handler,
- * but given rq->deadline is just set in .queue_rq() under
- * this situation, the race won't be possible in reality because
- * rq->timeout should be set as big enough to cover the window
- * between blk_mq_start_request() called from .queue_rq() and
- * clearing REQ_ATOM_STARTED here.
+ * When we reach here because queue is busy, it's safe to change the state
+ * to IDLE without checking @rq->aborted_gstate because we should still be
+ * holding the RCU read lock and thus protected against timeout.
  */
 static void __blk_mq_requeue_request(struct request *rq)
 {
@@ -715,7 +708,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
 	blk_mq_sched_requeue_request(rq);
 
-	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
 		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
@@ -822,18 +815,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	/*
-	 * We know that complete is set at this point. If STARTED isn't set
-	 * anymore, then the request isn't active and the "timeout" should
-	 * just be ignored. This can happen due to the bitflag ordering.
-	 * Timeout first checks if STARTED is set, and if it is, assumes
-	 * the request is active. But if we race with completion, then
-	 * both flags will get cleared. So check here again, and ignore
-	 * a timeout event with a request that isn't active.
-	 */
-	if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
-		return;
-
 	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
 
 	if (ops->timeout)
@@ -869,8 +850,7 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 
 	might_sleep();
 
-	if ((rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) ||
-	    !test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
 		return;
 
 	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
@@ -3021,8 +3001,7 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (test_bit(REQ_ATOM_STARTED, &rq->atomic_flags) &&
-		    blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
+		if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6b2d616..8591a54 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -34,6 +34,7 @@ struct blk_mq_ctx {
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
+	MQ_RQ_COMPLETE		= 2,
 
 	MQ_RQ_STATE_BITS	= 2,
 	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
diff --git a/block/blk.h b/block/blk.h
index 9cb2739..a68dbe3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -124,7 +124,6 @@ void blk_account_io_done(struct request *req);
  */
 enum rq_atomic_flags {
 	REQ_ATOM_COMPLETE = 0,
-	REQ_ATOM_STARTED,
 
 	REQ_ATOM_POLL_SLEPT,
 };
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 8/8] blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
                   ` (6 preceding siblings ...)
  2018-01-09 16:29 ` [PATCH 7/8] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
@ 2018-01-09 16:29 ` Tejun Heo
  2018-01-09 16:53 ` [PATCHSET v5] blk-mq: reimplement timeout handling Jens Axboe
  2018-01-12 20:57   ` Bart Van Assche
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2018-01-09 16:29 UTC (permalink / raw)
  To: jack, axboe, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block, Tejun Heo

The RCU protection has been expanded to cover both queueing and
completion paths making ->queue_rq_srcu a misnomer.  Rename it to
->srcu as suggested by Bart.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
---
 block/blk-mq.c         | 14 +++++++-------
 include/linux/blk-mq.h |  2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e6e74c0..5fb25e6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -219,7 +219,7 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (hctx->flags & BLK_MQ_F_BLOCKING)
-			synchronize_srcu(hctx->queue_rq_srcu);
+			synchronize_srcu(hctx->srcu);
 		else
 			rcu = true;
 	}
@@ -564,7 +564,7 @@ static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
 		rcu_read_unlock();
 	else
-		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+		srcu_read_unlock(hctx->srcu, srcu_idx);
 }
 
 static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
@@ -572,7 +572,7 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
 		rcu_read_lock();
 	else
-		*srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
 static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
@@ -937,7 +937,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
 				has_rcu = true;
 			else
-				synchronize_srcu(hctx->queue_rq_srcu);
+				synchronize_srcu(hctx->srcu);
 
 			hctx->nr_expired = 0;
 		}
@@ -2100,7 +2100,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		set->ops->exit_hctx(hctx, hctx_idx);
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->queue_rq_srcu);
+		cleanup_srcu_struct(hctx->srcu);
 
 	blk_mq_remove_cpuhp(hctx);
 	blk_free_flush_queue(hctx->fq);
@@ -2173,7 +2173,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 		goto free_fq;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		init_srcu_struct(hctx->queue_rq_srcu);
+		init_srcu_struct(hctx->srcu);
 
 	blk_mq_debugfs_register_hctx(q, hctx);
 
@@ -2462,7 +2462,7 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 {
 	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
 
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, queue_rq_srcu),
+	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
 			   __alignof__(struct blk_mq_hw_ctx)) !=
 		     sizeof(struct blk_mq_hw_ctx));
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 460798d..8efcf49 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -66,7 +66,7 @@ struct blk_mq_hw_ctx {
 #endif
 
 	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
-	struct srcu_struct	queue_rq_srcu[0];
+	struct srcu_struct	srcu[0];
 };
 
 struct blk_mq_tag_set {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
                   ` (7 preceding siblings ...)
  2018-01-09 16:29 ` [PATCH 8/8] blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu Tejun Heo
@ 2018-01-09 16:53 ` Jens Axboe
  2018-01-12 20:57   ` Bart Van Assche
  9 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2018-01-09 16:53 UTC (permalink / raw)
  To: Tejun Heo, jack, clm, jbacik
  Cc: kernel-team, linux-kernel, linux-btrfs, peterz, jianchao.w.wang,
	Bart.VanAssche, linux-block

On 1/9/18 9:29 AM, Tejun Heo wrote:
> Hello,
> 
> Changes from [v4]
> 
> - Comments added.  Patch description updated.
> 
> Changes from [v3]
> 
> - Rebased on top of for-4.16/block.
> 
> - Integrated Jens's hctx_[un]lock() factoring patch and refreshed the
>   patches accordingly.
> 
> - Added comment explaining the use of hctx_lock() instead of
>   rcu_read_lock() in completion path.
> 
> Changes from [v2]
> 
> - Possible extended looping around seqcount and u64_stat_sync fixed.
> 
> - Misplaced MQ_RQ_IDLE state setting fixed.
> 
> - RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout
>   multiple times.
> 
> - s/queue_rq_src/srcu/ patch added.
> 
> - Other misc changes.
> 
> Changes from [v1]
> 
> - BLK_EH_RESET_TIMER handling fixed.
> 
> - s/request->gstate_seqc/request->gstate_seq/
> 
> - READ_ONCE() added to blk_mq_rq_udpate_state().
> 
> - Removed left over blk_clear_rq_complete() invocation from
>   blk_mq_rq_timed_out().
> 
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.
> 
> It's pretty easy to make blk_mq_check_expired() terminate a later
> instance of a request.  If we induce 5 sec delay before
> time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
> 2s, and issue back-to-back large IOs, blk-mq starts timing out
> requests spuriously pretty quickly.  Nothing actually timed out.  It
> just made the call on a recycle instance of a request and then
> terminated a later instance long after the original instance finished.
> The scenario isn't theoretical either.
> 
> This patchset replaces the broken synchronization mechanism with a RCU
> and generation number based one.  Please read the patch description of
> the second path for more details.

Applied for 4.16, and added a patch to silence that false positive
srcu_idx for hctx_unlock() that got reported.

This grows the request a bit, which sucks, but probably unavoidable.
There's some room for improvement with a hole or two, however.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
@ 2018-01-12 20:57   ` Bart Van Assche
  2018-01-09 16:29 ` [PATCH 2/8] blk-mq: protect completion path with RCU Tejun Heo
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2018-01-12 20:57 UTC (permalink / raw)
  To: jbacik, tj, jack, clm, axboe
  Cc: kernel-team, linux-kernel, peterz, linux-btrfs, linux-block,
	jianchao.w.wang

T24gVHVlLCAyMDE4LTAxLTA5IGF0IDA4OjI5IC0wODAwLCBUZWp1biBIZW8gd3JvdGU6DQo+IEN1
cnJlbnRseSwgYmxrLW1xIHRpbWVvdXQgcGF0aCBzeW5jaHJvbml6ZXMgYWdhaW5zdCB0aGUgdXN1
YWwNCj4gaXNzdWUvY29tcGxldGlvbiBwYXRoIHVzaW5nIGEgY29tcGxleCBzY2hlbWUgaW52b2x2
aW5nIGF0b21pYw0KPiBiaXRmbGFncywgUkVRX0FUT01fKiwgbWVtb3J5IGJhcnJpZXJzIGFuZCBz
dWJ0bGUgbWVtb3J5IGNvaGVyZW5jZQ0KPiBydWxlcy4gIFVuZm9ydHVuYXRsZXksIGl0IGNvbnRh
aW5zIHF1aXRlIGEgZmV3IGhvbGVzLg0KDQpIZWxsbyBUZWp1biwNCg0KV2l0aCB0aGlzIHBhdGNo
IHNlcmllcyBhcHBsaWVkIEkgc2VlIHdlaXJkIGhhbmdzIGluIGJsa19tcV9nZXRfdGFnKCkgd2hl
biBJDQpydW4gdGhlIHNycC10ZXN0IHNvZnR3YXJlLiBJZiBJIHB1bGwgSmVucycgbGF0ZXN0IGZv
ci1uZXh0IGJyYW5jaCBhbmQgcmV2ZXJ0DQp0aGlzIHBhdGNoIHNlcmllcyB0aGVuIHRoZSBzcnAt
dGVzdCBzb2Z0d2FyZSBydW5zIHN1Y2Nlc3NmdWxseS4gTm90ZTogaWYgeW91DQpkb24ndCBoYXZl
IEluZmluaUJhbmQgaGFyZHdhcmUgYXZhaWxhYmxlIHRoZW4geW91IHdpbGwgbmVlZCB0aGUgUkRN
QS9DTQ0KcGF0Y2hlcyBmb3IgdGhlIFNSUCBpbml0aWF0b3IgYW5kIHRhcmdldCBkcml2ZXJzIHRo
YXQgaGF2ZSBiZWVuIHBvc3RlZA0KcmVjZW50bHkgb24gdGhlIGxpbnV4LXJkbWEgbWFpbGluZyBs
aXN0IHRvIHJ1biB0aGUgc3JwLXRlc3Qgc29mdHdhcmUuDQoNClRoaXMgaXMgaG93IEkgcnVuIHRo
ZSBzcnAtdGVzdCBzb2Z0d2FyZSBpbiBhIFZNOg0KDQouL3J1bl90ZXN0cyAtYyAtZCAtciAxMA0K
DQpIZXJlIGlzIGFuIGV4YW1wbGUgb2Ygd2hhdCBTeXNScS13IHJlcG9ydGVkIHdoZW4gdGhlIGhh
bmcgb2NjdXJyZWQ6DQoNCnN5c3JxOiBTeXNScSA6IFNob3cgQmxvY2tlZCBTdGF0ZQ0KIHRhc2sg
ICAgICAgICAgICAgICAgICAgICAgICBQQyBzdGFjayAgIHBpZCBmYXRoZXINCmt3b3JrZXIvdTg6
MCAgICBEMTI4NjQgICAgIDUgICAgICAyIDB4ODAwMDAwMDANCldvcmtxdWV1ZTogZXZlbnRzX3Vu
Ym91bmQgc2RfcHJvYmVfYXN5bmMgW3NkX21vZF0NCkNhbGwgVHJhY2U6DQo/IF9fc2NoZWR1bGUr
MHgyYjQvMHhiYjANCnNjaGVkdWxlKzB4MmQvMHg5MA0KaW9fc2NoZWR1bGUrMHhkLzB4MzANCmJs
a19tcV9nZXRfdGFnKzB4MTY5LzB4MjkwDQo/IGZpbmlzaF93YWl0KzB4ODAvMHg4MA0KYmxrX21x
X2dldF9yZXF1ZXN0KzB4MTZhLzB4NGYwDQpibGtfbXFfYWxsb2NfcmVxdWVzdCsweDU5LzB4YzAN
CmJsa19nZXRfcmVxdWVzdF9mbGFncysweDNmLzB4MjYwDQpzY3NpX2V4ZWN1dGUrMHgzMy8weDFl
MCBbc2NzaV9tb2RdDQpyZWFkX2NhcGFjaXR5XzE2LnBhcnQuMzUrMHg5Yy8weDQ2MCBbc2RfbW9k
XQ0Kc2RfcmV2YWxpZGF0ZV9kaXNrKzB4MTRiYi8weDFjYjAgW3NkX21vZF0NCnNkX3Byb2JlX2Fz
eW5jKzB4ZjIvMHgxYTAgW3NkX21vZF0NCnByb2Nlc3Nfb25lX3dvcmsrMHgyMWMvMHg2ZDANCndv
cmtlcl90aHJlYWQrMHgzNS8weDM4MA0KPyBwcm9jZXNzX29uZV93b3JrKzB4NmQwLzB4NmQwDQpr
dGhyZWFkKzB4MTE3LzB4MTMwDQo/IGt0aHJlYWRfY3JlYXRlX3dvcmtlcl9vbl9jcHUrMHg0MC8w
eDQwDQpyZXRfZnJvbV9mb3JrKzB4MjQvMHgzMA0Kc3lzdGVtZC11ZGV2ZCAgIEQxMzY3MiAgMTA0
OCAgICAyODUgMHgwMDAwMDEwMA0KQ2FsbCBUcmFjZToNCj8gX19zY2hlZHVsZSsweDJiNC8weGJi
MA0Kc2NoZWR1bGUrMHgyZC8weDkwDQppb19zY2hlZHVsZSsweGQvMHgzMA0KZ2VuZXJpY19maWxl
X3JlYWRfaXRlcisweDMyZi8weDk3MA0KPyBwYWdlX2NhY2hlX3RyZWVfaW5zZXJ0KzB4MTAwLzB4
MTAwDQpfX3Zmc19yZWFkKzB4Y2MvMHgxMjANCnZmc19yZWFkKzB4OTYvMHgxNDANClN5U19yZWFk
KzB4NDAvMHhhMA0KZG9fc3lzY2FsbF82NCsweDVmLzB4MWIwDQplbnRyeV9TWVNDQUxMNjRfc2xv
d19wYXRoKzB4MjUvMHgyNQ0KUklQOiAwMDMzOjB4N2Y4Y2U2ZDA4ZDExDQpSU1A6IDAwMmI6MDAw
MDdmZmY5NmRlYzI4OCBFRkxBR1M6IDAwMDAwMjQ2IE9SSUdfUkFYOiAwMDAwMDAwMDAwMDAwMDAw
DQpSQVg6IGZmZmZmZmZmZmZmZmZmZGEgUkJYOiAwMDAwNTY1MWRlN2Y2ZTEwIFJDWDogMDAwMDdm
OGNlNmQwOGQxMQ0KUkRYOiAwMDAwMDAwMDAwMDAwMDQwIFJTSTogMDAwMDU2NTFkZTdmNmUzOCBS
REk6IDAwMDAwMDAwMDAwMDAwMDcNClJCUDogMDAwMDU2NTFkZTdlYTUwMCBSMDg6IDAwMDA3Zjhj
ZTZjZjFjMjAgUjA5OiAwMDAwNTY1MWRlN2Y2ZTEwDQpSMTA6IDAwMDAwMDAwMDAwMDAwNmYgUjEx
OiAwMDAwMDAwMDAwMDAwMjQ2IFIxMjogMDAwMDAwMDAwMWZmMDAwMA0KUjEzOiAwMDAwMDAwMDAx
ZmYwMDQwIFIxNDogMDAwMDU2NTFkZTdlYTU1MCBSMTU6IDAwMDAwMDAwMDAwMDAwNDANCnN5c3Rl
bWQtdWRldmQgICBEMTM0OTYgIDEwNDkgICAgMjg1IDB4MDAwMDAxMDANCkNhbGwgVHJhY2U6DQo/
IF9fc2NoZWR1bGUrMHgyYjQvMHhiYjANCnNjaGVkdWxlKzB4MmQvMHg5MA0KaW9fc2NoZWR1bGUr
MHhkLzB4MzANCmJsa19tcV9nZXRfdGFnKzB4MTY5LzB4MjkwDQo/IGZpbmlzaF93YWl0KzB4ODAv
MHg4MA0KYmxrX21xX2dldF9yZXF1ZXN0KzB4MTZhLzB4NGYwDQpibGtfbXFfbWFrZV9yZXF1ZXN0
KzB4MTA1LzB4OGUwDQo/IGdlbmVyaWNfbWFrZV9yZXF1ZXN0KzB4ZDYvMHgzZDANCmdlbmVyaWNf
bWFrZV9yZXF1ZXN0KzB4MTAzLzB4M2QwDQo/IHN1Ym1pdF9iaW8rMHg1Ny8weDExMA0Kc3VibWl0
X2JpbysweDU3LzB4MTEwDQptcGFnZV9yZWFkcGFnZXMrMHgxM2IvMHgxNjANCj8gSV9CREVWKzB4
MTAvMHgxMA0KPyByY3VfcmVhZF9sb2NrX3NjaGVkX2hlbGQrMHg2Ni8weDcwDQo/IF9fYWxsb2Nf
cGFnZXNfbm9kZW1hc2srMHgyZTgvMHgzNjANCl9fZG9fcGFnZV9jYWNoZV9yZWFkYWhlYWQrMHgy
YTQvMHgzNzANCj8gZm9yY2VfcGFnZV9jYWNoZV9yZWFkYWhlYWQrMHhhZi8weDExMA0KZm9yY2Vf
cGFnZV9jYWNoZV9yZWFkYWhlYWQrMHhhZi8weDExMA0KZ2VuZXJpY19maWxlX3JlYWRfaXRlcisw
eDc0My8weDk3MA0KPyBmaW5kX2hlbGRfbG9jaysweDJkLzB4OTANCj8gX3Jhd19zcGluX3VubG9j
aysweDI5LzB4NDANCl9fdmZzX3JlYWQrMHhjYy8weDEyMA0KdmZzX3JlYWQrMHg5Ni8weDE0MA0K
U3lTX3JlYWQrMHg0MC8weGEwDQpkb19zeXNjYWxsXzY0KzB4NWYvMHgxYjANCmVudHJ5X1NZU0NB
TEw2NF9zbG93X3BhdGgrMHgyNS8weDI1DQpSSVA6IDAwMzM6MHg3ZjhjZTZkMDhkMTENClJTUDog
MDAyYjowMDAwN2ZmZjk2ZGVjOGI4IEVGTEFHUzogMDAwMDAyNDYgT1JJR19SQVg6IDAwMDAwMDAw
MDAwMDAwMDANClJBWDogZmZmZmZmZmZmZmZmZmZkYSBSQlg6IDAwMDA3ZjhjZTcwODUwMTAgUkNY
OiAwMDAwN2Y4Y2U2ZDA4ZDExDQpSRFg6IDAwMDAwMDAwMDAwNDAwMDAgUlNJOiAwMDAwN2Y4Y2U3
MDg1MDM4IFJESTogMDAwMDAwMDAwMDAwMDAwZg0KUkJQOiAwMDAwNTY1MWRlN2VjODQwIFIwODog
ZmZmZmZmZmZmZmZmZmZmZiBSMDk6IDAwMDA3ZjhjZTcwODUwMTANClIxMDogMDAwMDdmOGNlNzA4
NTAyOCBSMTE6IDAwMDAwMDAwMDAwMDAyNDYgUjEyOiAwMDAwMDAwMDAwMDAwMDAwDQpSMTM6IDAw
MDAwMDAwMDAwNDAwMDAgUjE0OiAwMDAwNTY1MWRlN2VjODkwIFIxNTogMDAwMDAwMDAwMDA0MDAw
MA0Kc3lzdGVtZC11ZGV2ZCAgIEQxMzY3MiAgMTA1NSAgICAyODUgMHgwMDAwMDEwMA0KQ2FsbCBU
cmFjZToNCj8gX19zY2hlZHVsZSsweDJiNC8weGJiMA0Kc2NoZWR1bGUrMHgyZC8weDkwDQppb19z
Y2hlZHVsZSsweGQvMHgzMA0KYmxrX21xX2dldF90YWcrMHgxNjkvMHgyOTANCj8gZmluaXNoX3dh
aXQrMHg4MC8weDgwDQpibGtfbXFfZ2V0X3JlcXVlc3QrMHgxNmEvMHg0ZjANCmJsa19tcV9tYWtl
X3JlcXVlc3QrMHgxMDUvMHg4ZTANCj8gZ2VuZXJpY19tYWtlX3JlcXVlc3QrMHhkNi8weDNkMA0K
Z2VuZXJpY19tYWtlX3JlcXVlc3QrMHgxMDMvMHgzZDANCj8gc3VibWl0X2JpbysweDU3LzB4MTEw
DQpzdWJtaXRfYmlvKzB4NTcvMHgxMTANCm1wYWdlX3JlYWRwYWdlcysweDEzYi8weDE2MA0KPyBJ
X0JERVYrMHgxMC8weDEwDQo/IHJjdV9yZWFkX2xvY2tfc2NoZWRfaGVsZCsweDY2LzB4NzANCj8g
X19hbGxvY19wYWdlc19ub2RlbWFzaysweDJlOC8weDM2MA0KX19kb19wYWdlX2NhY2hlX3JlYWRh
aGVhZCsweDJhNC8weDM3MA0KPyBmb3JjZV9wYWdlX2NhY2hlX3JlYWRhaGVhZCsweGFmLzB4MTEw
DQpmb3JjZV9wYWdlX2NhY2hlX3JlYWRhaGVhZCsweGFmLzB4MTEwDQpnZW5lcmljX2ZpbGVfcmVh
ZF9pdGVyKzB4NzQzLzB4OTcwDQpfX3Zmc19yZWFkKzB4Y2MvMHgxMjANCnZmc19yZWFkKzB4OTYv
MHgxNDANClN5U19yZWFkKzB4NDAvMHhhMA0KZG9fc3lzY2FsbF82NCsweDVmLzB4MWIwDQplbnRy
eV9TWVNDQUxMNjRfc2xvd19wYXRoKzB4MjUvMHgyNQ0KUklQOiAwMDMzOjB4N2Y4Y2U2ZDA4ZDEx
DQpSU1A6IDAwMmI6MDAwMDdmZmY5NmRlYzg0OCBFRkxBR1M6IDAwMDAwMjQ2IE9SSUdfUkFYOiAw
MDAwMDAwMDAwMDAwMDAwDQpSQVg6IGZmZmZmZmZmZmZmZmZmZGEgUkJYOiAwMDAwNTY1MWRlN2Vj
MzAwIFJDWDogMDAwMDdmOGNlNmQwOGQxMQ0KUkRYOiAwMDAwMDAwMDAwMDAwMTAwIFJTSTogMDAw
MDU2NTFkZTdlYzMyOCBSREk6IDAwMDAwMDAwMDAwMDAwMGYNClJCUDogMDAwMDU2NTFkZTdlYTUw
MCBSMDg6IDAwMDAwMDAwMDAwMDAwMDAgUjA5OiAwMDAwNTY1MWRlN2VjMzAwDQpSMTA6IDAwMDA1
NjUxZGU3ZWMzMTggUjExOiAwMDAwMDAwMDAwMDAwMjQ2IFIxMjogMDAwMDAwMDAwMWZmZTAwMA0K
UjEzOiAwMDAwMDAwMDAxZmZlMTAwIFIxNDogMDAwMDU2NTFkZTdlYTU1MCBSMTU6IDAwMDAwMDAw
MDAwMDAxMDANCg0KUGxlYXNlIGxldCBtZSBrbm93IGlmIHlvdSBuZWVkIG1vcmUgaW5mb3JtYXRp
b24uDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
@ 2018-01-12 20:57   ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2018-01-12 20:57 UTC (permalink / raw)
  To: jbacik, tj, jack, clm, axboe
  Cc: kernel-team, linux-kernel, peterz, linux-btrfs, linux-block,
	jianchao.w.wang

On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.

Hello Tejun,

With this patch series applied I see weird hangs in blk_mq_get_tag() when I
run the srp-test software. If I pull Jens' latest for-next branch and revert
this patch series then the srp-test software runs successfully. Note: if you
don't have InfiniBand hardware available then you will need the RDMA/CM
patches for the SRP initiator and target drivers that have been posted
recently on the linux-rdma mailing list to run the srp-test software.

This is how I run the srp-test software in a VM:

./run_tests -c -d -r 10

Here is an example of what SysRq-w reported when the hang occurred:

sysrq: SysRq : Show Blocked State
 task                        PC stack   pid father
kworker/u8:0    D12864     5      2 0x80000000
Workqueue: events_unbound sd_probe_async [sd_mod]
Call Trace:
? __schedule+0x2b4/0xbb0
schedule+0x2d/0x90
io_schedule+0xd/0x30
blk_mq_get_tag+0x169/0x290
? finish_wait+0x80/0x80
blk_mq_get_request+0x16a/0x4f0
blk_mq_alloc_request+0x59/0xc0
blk_get_request_flags+0x3f/0x260
scsi_execute+0x33/0x1e0 [scsi_mod]
read_capacity_16.part.35+0x9c/0x460 [sd_mod]
sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
sd_probe_async+0xf2/0x1a0 [sd_mod]
process_one_work+0x21c/0x6d0
worker_thread+0x35/0x380
? process_one_work+0x6d0/0x6d0
kthread+0x117/0x130
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x24/0x30
systemd-udevd   D13672  1048    285 0x00000100
Call Trace:
? __schedule+0x2b4/0xbb0
schedule+0x2d/0x90
io_schedule+0xd/0x30
generic_file_read_iter+0x32f/0x970
? page_cache_tree_insert+0x100/0x100
__vfs_read+0xcc/0x120
vfs_read+0x96/0x140
SyS_read+0x40/0xa0
do_syscall_64+0x5f/0x1b0
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f8ce6d08d11
RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
systemd-udevd   D13496  1049    285 0x00000100
Call Trace:
? __schedule+0x2b4/0xbb0
schedule+0x2d/0x90
io_schedule+0xd/0x30
blk_mq_get_tag+0x169/0x290
? finish_wait+0x80/0x80
blk_mq_get_request+0x16a/0x4f0
blk_mq_make_request+0x105/0x8e0
? generic_make_request+0xd6/0x3d0
generic_make_request+0x103/0x3d0
? submit_bio+0x57/0x110
submit_bio+0x57/0x110
mpage_readpages+0x13b/0x160
? I_BDEV+0x10/0x10
? rcu_read_lock_sched_held+0x66/0x70
? __alloc_pages_nodemask+0x2e8/0x360
__do_page_cache_readahead+0x2a4/0x370
? force_page_cache_readahead+0xaf/0x110
force_page_cache_readahead+0xaf/0x110
generic_file_read_iter+0x743/0x970
? find_held_lock+0x2d/0x90
? _raw_spin_unlock+0x29/0x40
__vfs_read+0xcc/0x120
vfs_read+0x96/0x140
SyS_read+0x40/0xa0
do_syscall_64+0x5f/0x1b0
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f8ce6d08d11
RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
systemd-udevd   D13672  1055    285 0x00000100
Call Trace:
? __schedule+0x2b4/0xbb0
schedule+0x2d/0x90
io_schedule+0xd/0x30
blk_mq_get_tag+0x169/0x290
? finish_wait+0x80/0x80
blk_mq_get_request+0x16a/0x4f0
blk_mq_make_request+0x105/0x8e0
? generic_make_request+0xd6/0x3d0
generic_make_request+0x103/0x3d0
? submit_bio+0x57/0x110
submit_bio+0x57/0x110
mpage_readpages+0x13b/0x160
? I_BDEV+0x10/0x10
? rcu_read_lock_sched_held+0x66/0x70
? __alloc_pages_nodemask+0x2e8/0x360
__do_page_cache_readahead+0x2a4/0x370
? force_page_cache_readahead+0xaf/0x110
force_page_cache_readahead+0xaf/0x110
generic_file_read_iter+0x743/0x970
__vfs_read+0xcc/0x120
vfs_read+0x96/0x140
SyS_read+0x40/0xa0
do_syscall_64+0x5f/0x1b0
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f8ce6d08d11
RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100

Please let me know if you need more information.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-12 20:57   ` Bart Van Assche
  (?)
@ 2018-01-12 21:07   ` Jens Axboe
  2018-01-12 21:19     ` Bart Van Assche
  -1 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2018-01-12 21:07 UTC (permalink / raw)
  To: Bart Van Assche, jbacik, tj, jack, clm
  Cc: kernel-team, linux-kernel, peterz, linux-btrfs, linux-block,
	jianchao.w.wang

On 1/12/18 1:57 PM, Bart Van Assche wrote:
> On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
>> Currently, blk-mq timeout path synchronizes against the usual
>> issue/completion path using a complex scheme involving atomic
>> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
>> rules.  Unfortunatley, it contains quite a few holes.
> 
> Hello Tejun,
> 
> With this patch series applied I see weird hangs in blk_mq_get_tag() when I
> run the srp-test software. If I pull Jens' latest for-next branch and revert
> this patch series then the srp-test software runs successfully. Note: if you
> don't have InfiniBand hardware available then you will need the RDMA/CM
> patches for the SRP initiator and target drivers that have been posted
> recently on the linux-rdma mailing list to run the srp-test software.

You're really not making it easy for folks to run this :-)

> This is how I run the srp-test software in a VM:
> 
> ./run_tests -c -d -r 10
> 
> Here is an example of what SysRq-w reported when the hang occurred:
> 
> sysrq: SysRq : Show Blocked State
>  task                        PC stack   pid father
> kworker/u8:0    D12864     5      2 0x80000000
> Workqueue: events_unbound sd_probe_async [sd_mod]
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x169/0x290
> ? finish_wait+0x80/0x80
> blk_mq_get_request+0x16a/0x4f0
> blk_mq_alloc_request+0x59/0xc0
> blk_get_request_flags+0x3f/0x260
> scsi_execute+0x33/0x1e0 [scsi_mod]
> read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> sd_probe_async+0xf2/0x1a0 [sd_mod]
> process_one_work+0x21c/0x6d0
> worker_thread+0x35/0x380
> ? process_one_work+0x6d0/0x6d0
> kthread+0x117/0x130
> ? kthread_create_worker_on_cpu+0x40/0x40
> ret_from_fork+0x24/0x30
> systemd-udevd   D13672  1048    285 0x00000100
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> generic_file_read_iter+0x32f/0x970
> ? page_cache_tree_insert+0x100/0x100
> __vfs_read+0xcc/0x120
> vfs_read+0x96/0x140
> SyS_read+0x40/0xa0
> do_syscall_64+0x5f/0x1b0
> entry_SYSCALL64_slow_path+0x25/0x25

Do you have the matching blk-mq debugfs output for the device?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-12 21:07   ` Jens Axboe
@ 2018-01-12 21:19     ` Bart Van Assche
  2018-01-12 21:55       ` Jens Axboe
  2018-01-14 15:12       ` jianchao.wang
  0 siblings, 2 replies; 25+ messages in thread
From: Bart Van Assche @ 2018-01-12 21:19 UTC (permalink / raw)
  To: jbacik, tj, jack, clm, axboe
  Cc: kernel-team, linux-kernel, peterz, linux-btrfs, linux-block,
	jianchao.w.wang

[-- Attachment #1: Type: text/plain, Size: 1139 bytes --]

On Fri, 2018-01-12 at 14:07 -0700, Jens Axboe wrote:
> You're really not making it easy for folks to run this :-)

My hope is that the ib_srp and ib_srpt patches will be accepted upstream soon.
As long as these are not upstream, anyone who wants to retrieve these patches
is welcome to clone https://github.com/bvanassche/linux/tree/block-scsi-for-next,
a kernel tree with all my pending patches.

> Do you have the matching blk-mq debugfs output for the device?

Sorry but I only retrieved the blk-mq debugfs several minutes after the hang
started so I'm not sure the state information is relevant. Anyway, I have attached
it to this e-mail. The most remarkable part is the following:

./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x1
18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete
=0, .tag=-1, .internal_tag=217}

The hexadecimal number at the start is the request_queue pointer (I modified the
blk-mq-debugfs code such that queues are registered with there address just after
creation and until a name is assigned). This is a dm-mpath queue.

Bart.

[-- Attachment #2: blk-state --]
[-- Type: text/plain, Size: 46000 bytes --]

./000000009ddfa913/hctx0/cpu3/completed:29 0
./000000009ddfa913/hctx0/cpu3/merged:0
./000000009ddfa913/hctx0/cpu3/dispatched:30 0
./000000009ddfa913/hctx0/cpu2/completed:0 0
./000000009ddfa913/hctx0/cpu2/merged:0
./000000009ddfa913/hctx0/cpu2/dispatched:0 0
./000000009ddfa913/hctx0/cpu1/completed:0 0
./000000009ddfa913/hctx0/cpu1/merged:0
./000000009ddfa913/hctx0/cpu1/dispatched:0 0
./000000009ddfa913/hctx0/cpu0/completed:0 0
./000000009ddfa913/hctx0/cpu0/merged:0
./000000009ddfa913/hctx0/cpu0/dispatched:0 0
./000000009ddfa913/hctx0/active:0
./000000009ddfa913/hctx0/run:92
./000000009ddfa913/hctx0/queued:30
./000000009ddfa913/hctx0/dispatched:       0	0
./000000009ddfa913/hctx0/dispatched:       1	97
./000000009ddfa913/hctx0/dispatched:       2	0
./000000009ddfa913/hctx0/dispatched:       4	0
./000000009ddfa913/hctx0/dispatched:       8	0
./000000009ddfa913/hctx0/dispatched:      16	0
./000000009ddfa913/hctx0/dispatched:      32+	0
./000000009ddfa913/hctx0/io_poll:considered=0
./000000009ddfa913/hctx0/io_poll:invoked=0
./000000009ddfa913/hctx0/io_poll:success=0
./000000009ddfa913/hctx0/sched_tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/sched_tags_bitmap:00000010: 0000 0000 0000 0000 0000 0002 0000 0000
./000000009ddfa913/hctx0/sched_tags:nr_tags=256
./000000009ddfa913/hctx0/sched_tags:nr_reserved_tags=0
./000000009ddfa913/hctx0/sched_tags:active_queues=0
./000000009ddfa913/hctx0/sched_tags:
./000000009ddfa913/hctx0/sched_tags:bitmap_tags:
./000000009ddfa913/hctx0/sched_tags:depth=256
./000000009ddfa913/hctx0/sched_tags:busy=1
./000000009ddfa913/hctx0/sched_tags:bits_per_word=64
./000000009ddfa913/hctx0/sched_tags:map_nr=4
./000000009ddfa913/hctx0/sched_tags:alloc_hint={45, 56, 144, 218}
./000000009ddfa913/hctx0/sched_tags:wake_batch=8
./000000009ddfa913/hctx0/sched_tags:wake_index=0
./000000009ddfa913/hctx0/sched_tags:ws={
./000000009ddfa913/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/sched_tags:}
./000000009ddfa913/hctx0/sched_tags:round_robin=0
./000000009ddfa913/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000080: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:00000090: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:000000a0: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:000000b0: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:000000c0: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:000000d0: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:000000e0: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags_bitmap:000000f0: 0000 0000 0000 0000 0000 0000 0000 0000
./000000009ddfa913/hctx0/tags:nr_tags=2048
./000000009ddfa913/hctx0/tags:nr_reserved_tags=0
./000000009ddfa913/hctx0/tags:active_queues=0
./000000009ddfa913/hctx0/tags:
./000000009ddfa913/hctx0/tags:bitmap_tags:
./000000009ddfa913/hctx0/tags:depth=2048
./000000009ddfa913/hctx0/tags:busy=0
./000000009ddfa913/hctx0/tags:bits_per_word=64
./000000009ddfa913/hctx0/tags:map_nr=32
./000000009ddfa913/hctx0/tags:alloc_hint={604, 1046, 1289, 113}
./000000009ddfa913/hctx0/tags:wake_batch=8
./000000009ddfa913/hctx0/tags:wake_index=0
./000000009ddfa913/hctx0/tags:ws={
./000000009ddfa913/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./000000009ddfa913/hctx0/tags:}
./000000009ddfa913/hctx0/tags:round_robin=0
./000000009ddfa913/hctx0/ctx_map:00000000: 00
./000000009ddfa913/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|SG_MERGE
./000000009ddfa913/hctx0/state:
./000000009ddfa913/sched/starved:0
./000000009ddfa913/sched/batching:1
./000000009ddfa913/write_hints:hint0: 0
./000000009ddfa913/write_hints:hint1: 0
./000000009ddfa913/write_hints:hint2: 0
./000000009ddfa913/write_hints:hint3: 0
./000000009ddfa913/write_hints:hint4: 0
./000000009ddfa913/state:SAME_COMP|IO_STAT|INIT_DONE|POLL|REGISTERED
./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x118, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete=0, .tag=-1, .internal_tag=217}
./000000009ddfa913/poll_stat:read  (512 Bytes): samples=0
./000000009ddfa913/poll_stat:write (512 Bytes): samples=0
./000000009ddfa913/poll_stat:read  (1024 Bytes): samples=0
./000000009ddfa913/poll_stat:write (1024 Bytes): samples=0
./000000009ddfa913/poll_stat:read  (2048 Bytes): samples=0
./000000009ddfa913/poll_stat:write (2048 Bytes): samples=0
./000000009ddfa913/poll_stat:read  (4096 Bytes): samples=0
./000000009ddfa913/poll_stat:write (4096 Bytes): samples=0
./000000009ddfa913/poll_stat:read  (8192 Bytes): samples=0
./000000009ddfa913/poll_stat:write (8192 Bytes): samples=0
./000000009ddfa913/poll_stat:read  (16384 Bytes): samples=0
./000000009ddfa913/poll_stat:write (16384 Bytes): samples=0
./000000009ddfa913/poll_stat:read  (32768 Bytes): samples=0
./000000009ddfa913/poll_stat:write (32768 Bytes): samples=0
./000000009ddfa913/poll_stat:read  (65536 Bytes): samples=0
./000000009ddfa913/poll_stat:write (65536 Bytes): samples=0
./00000000c9ca81bf/hctx0/cpu3/completed:26 0
./00000000c9ca81bf/hctx0/cpu3/merged:0
./00000000c9ca81bf/hctx0/cpu3/dispatched:26 0
./00000000c9ca81bf/hctx0/cpu2/completed:2 0
./00000000c9ca81bf/hctx0/cpu2/merged:0
./00000000c9ca81bf/hctx0/cpu2/dispatched:2 0
./00000000c9ca81bf/hctx0/cpu1/completed:0 0
./00000000c9ca81bf/hctx0/cpu1/merged:0
./00000000c9ca81bf/hctx0/cpu1/dispatched:0 0
./00000000c9ca81bf/hctx0/cpu0/completed:5 0
./00000000c9ca81bf/hctx0/cpu0/merged:0
./00000000c9ca81bf/hctx0/cpu0/dispatched:5 0
./00000000c9ca81bf/hctx0/active:0
./00000000c9ca81bf/hctx0/run:27
./00000000c9ca81bf/hctx0/queued:33
./00000000c9ca81bf/hctx0/dispatched:       0	0
./00000000c9ca81bf/hctx0/dispatched:       1	33
./00000000c9ca81bf/hctx0/dispatched:       2	0
./00000000c9ca81bf/hctx0/dispatched:       4	0
./00000000c9ca81bf/hctx0/dispatched:       8	0
./00000000c9ca81bf/hctx0/dispatched:      16	0
./00000000c9ca81bf/hctx0/dispatched:      32+	0
./00000000c9ca81bf/hctx0/io_poll:considered=0
./00000000c9ca81bf/hctx0/io_poll:invoked=0
./00000000c9ca81bf/hctx0/io_poll:success=0
./00000000c9ca81bf/hctx0/sched_tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/sched_tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/sched_tags:nr_tags=256
./00000000c9ca81bf/hctx0/sched_tags:nr_reserved_tags=0
./00000000c9ca81bf/hctx0/sched_tags:active_queues=0
./00000000c9ca81bf/hctx0/sched_tags:
./00000000c9ca81bf/hctx0/sched_tags:bitmap_tags:
./00000000c9ca81bf/hctx0/sched_tags:depth=256
./00000000c9ca81bf/hctx0/sched_tags:busy=0
./00000000c9ca81bf/hctx0/sched_tags:bits_per_word=64
./00000000c9ca81bf/hctx0/sched_tags:map_nr=4
./00000000c9ca81bf/hctx0/sched_tags:alloc_hint={56, 122, 48, 44}
./00000000c9ca81bf/hctx0/sched_tags:wake_batch=8
./00000000c9ca81bf/hctx0/sched_tags:wake_index=0
./00000000c9ca81bf/hctx0/sched_tags:ws={
./00000000c9ca81bf/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/sched_tags:}
./00000000c9ca81bf/hctx0/sched_tags:round_robin=0
./00000000c9ca81bf/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000080: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:00000090: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:000000a0: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:000000b0: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:000000c0: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:000000d0: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:000000e0: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags_bitmap:000000f0: 0000 0000 0000 0000 0000 0000 0000 0000
./00000000c9ca81bf/hctx0/tags:nr_tags=2048
./00000000c9ca81bf/hctx0/tags:nr_reserved_tags=0
./00000000c9ca81bf/hctx0/tags:active_queues=0
./00000000c9ca81bf/hctx0/tags:
./00000000c9ca81bf/hctx0/tags:bitmap_tags:
./00000000c9ca81bf/hctx0/tags:depth=2048
./00000000c9ca81bf/hctx0/tags:busy=0
./00000000c9ca81bf/hctx0/tags:bits_per_word=64
./00000000c9ca81bf/hctx0/tags:map_nr=32
./00000000c9ca81bf/hctx0/tags:alloc_hint={114, 943, 285, 201}
./00000000c9ca81bf/hctx0/tags:wake_batch=8
./00000000c9ca81bf/hctx0/tags:wake_index=0
./00000000c9ca81bf/hctx0/tags:ws={
./00000000c9ca81bf/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./00000000c9ca81bf/hctx0/tags:}
./00000000c9ca81bf/hctx0/tags:round_robin=0
./00000000c9ca81bf/hctx0/ctx_map:00000000: 00
./00000000c9ca81bf/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|SG_MERGE
./00000000c9ca81bf/hctx0/state:
./00000000c9ca81bf/sched/starved:0
./00000000c9ca81bf/sched/batching:1
./00000000c9ca81bf/write_hints:hint0: 0
./00000000c9ca81bf/write_hints:hint1: 0
./00000000c9ca81bf/write_hints:hint2: 0
./00000000c9ca81bf/write_hints:hint3: 0
./00000000c9ca81bf/write_hints:hint4: 0
./00000000c9ca81bf/state:SAME_COMP|IO_STAT|INIT_DONE|POLL|REGISTERED
./00000000c9ca81bf/poll_stat:read  (512 Bytes): samples=0
./00000000c9ca81bf/poll_stat:write (512 Bytes): samples=0
./00000000c9ca81bf/poll_stat:read  (1024 Bytes): samples=0
./00000000c9ca81bf/poll_stat:write (1024 Bytes): samples=0
./00000000c9ca81bf/poll_stat:read  (2048 Bytes): samples=0
./00000000c9ca81bf/poll_stat:write (2048 Bytes): samples=0
./00000000c9ca81bf/poll_stat:read  (4096 Bytes): samples=0
./00000000c9ca81bf/poll_stat:write (4096 Bytes): samples=0
./00000000c9ca81bf/poll_stat:read  (8192 Bytes): samples=0
./00000000c9ca81bf/poll_stat:write (8192 Bytes): samples=0
./00000000c9ca81bf/poll_stat:read  (16384 Bytes): samples=0
./00000000c9ca81bf/poll_stat:write (16384 Bytes): samples=0
./00000000c9ca81bf/poll_stat:read  (32768 Bytes): samples=0
./00000000c9ca81bf/poll_stat:write (32768 Bytes): samples=0
./00000000c9ca81bf/poll_stat:read  (65536 Bytes): samples=0
./00000000c9ca81bf/poll_stat:write (65536 Bytes): samples=0
./sde/hctx3/cpu3/completed:51 11
./sde/hctx3/cpu3/merged:0
./sde/hctx3/cpu3/dispatched:51 11
./sde/hctx3/active:62
./sde/hctx3/run:68
./sde/hctx3/queued:62
./sde/hctx3/dispatched:       0	9
./sde/hctx3/dispatched:       1	59
./sde/hctx3/dispatched:       2	0
./sde/hctx3/dispatched:       4	0
./sde/hctx3/dispatched:       8	0
./sde/hctx3/dispatched:      16	0
./sde/hctx3/dispatched:      32+	0
./sde/hctx3/io_poll:considered=0
./sde/hctx3/io_poll:invoked=0
./sde/hctx3/io_poll:success=0
./sde/hctx3/tags_bitmap:00000000: 0000 0000 0000 0000
./sde/hctx3/tags:nr_tags=62
./sde/hctx3/tags:nr_reserved_tags=0
./sde/hctx3/tags:active_queues=1
./sde/hctx3/tags:
./sde/hctx3/tags:bitmap_tags:
./sde/hctx3/tags:depth=62
./sde/hctx3/tags:busy=0
./sde/hctx3/tags:bits_per_word=8
./sde/hctx3/tags:map_nr=8
./sde/hctx3/tags:alloc_hint={21, 38, 49, 44}
./sde/hctx3/tags:wake_batch=7
./sde/hctx3/tags:wake_index=0
./sde/hctx3/tags:ws={
./sde/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx3/tags:}
./sde/hctx3/tags:round_robin=0
./sde/hctx3/ctx_map:00000000: 00
./sde/hctx3/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sde/hctx3/state:
./sde/hctx2/cpu2/completed:2 3
./sde/hctx2/cpu2/merged:0
./sde/hctx2/cpu2/dispatched:2 3
./sde/hctx2/active:5
./sde/hctx2/run:5
./sde/hctx2/queued:5
./sde/hctx2/dispatched:       0	0
./sde/hctx2/dispatched:       1	5
./sde/hctx2/dispatched:       2	0
./sde/hctx2/dispatched:       4	0
./sde/hctx2/dispatched:       8	0
./sde/hctx2/dispatched:      16	0
./sde/hctx2/dispatched:      32+	0
./sde/hctx2/io_poll:considered=0
./sde/hctx2/io_poll:invoked=0
./sde/hctx2/io_poll:success=0
./sde/hctx2/tags_bitmap:00000000: 0000 0000 0000 0000
./sde/hctx2/tags:nr_tags=62
./sde/hctx2/tags:nr_reserved_tags=0
./sde/hctx2/tags:active_queues=1
./sde/hctx2/tags:
./sde/hctx2/tags:bitmap_tags:
./sde/hctx2/tags:depth=62
./sde/hctx2/tags:busy=0
./sde/hctx2/tags:bits_per_word=8
./sde/hctx2/tags:map_nr=8
./sde/hctx2/tags:alloc_hint={28, 56, 24, 56}
./sde/hctx2/tags:wake_batch=7
./sde/hctx2/tags:wake_index=0
./sde/hctx2/tags:ws={
./sde/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx2/tags:}
./sde/hctx2/tags:round_robin=0
./sde/hctx2/ctx_map:00000000: 00
./sde/hctx2/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sde/hctx2/state:
./sde/hctx1/cpu1/completed:2 38
./sde/hctx1/cpu1/merged:0
./sde/hctx1/cpu1/dispatched:2 38
./sde/hctx1/active:40
./sde/hctx1/run:45
./sde/hctx1/queued:40
./sde/hctx1/dispatched:       0	4
./sde/hctx1/dispatched:       1	39
./sde/hctx1/dispatched:       2	0
./sde/hctx1/dispatched:       4	0
./sde/hctx1/dispatched:       8	0
./sde/hctx1/dispatched:      16	0
./sde/hctx1/dispatched:      32+	0
./sde/hctx1/io_poll:considered=0
./sde/hctx1/io_poll:invoked=0
./sde/hctx1/io_poll:success=0
./sde/hctx1/tags_bitmap:00000000: 0000 0000 0000 0000
./sde/hctx1/tags:nr_tags=62
./sde/hctx1/tags:nr_reserved_tags=0
./sde/hctx1/tags:active_queues=3
./sde/hctx1/tags:
./sde/hctx1/tags:bitmap_tags:
./sde/hctx1/tags:depth=62
./sde/hctx1/tags:busy=0
./sde/hctx1/tags:bits_per_word=8
./sde/hctx1/tags:map_nr=8
./sde/hctx1/tags:alloc_hint={10, 8, 38, 57}
./sde/hctx1/tags:wake_batch=7
./sde/hctx1/tags:wake_index=1
./sde/hctx1/tags:ws={
./sde/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx1/tags:	{.wait_cnt=2, .wait=active},
./sde/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx1/tags:	{.wait_cnt=7, .wait=active},
./sde/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx1/tags:	{.wait_cnt=6, .wait=inactive},
./sde/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx1/tags:}
./sde/hctx1/tags:round_robin=0
./sde/hctx1/ctx_map:00000000: 00
./sde/hctx1/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sde/hctx1/state:TAG_ACTIVE
./sde/hctx0/cpu0/completed:20 11
./sde/hctx0/cpu0/merged:0
./sde/hctx0/cpu0/dispatched:20 11
./sde/hctx0/active:31
./sde/hctx0/run:41
./sde/hctx0/queued:31
./sde/hctx0/dispatched:       0	8
./sde/hctx0/dispatched:       1	31
./sde/hctx0/dispatched:       2	0
./sde/hctx0/dispatched:       4	0
./sde/hctx0/dispatched:       8	0
./sde/hctx0/dispatched:      16	0
./sde/hctx0/dispatched:      32+	0
./sde/hctx0/io_poll:considered=0
./sde/hctx0/io_poll:invoked=0
./sde/hctx0/io_poll:success=0
./sde/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000
./sde/hctx0/tags:nr_tags=62
./sde/hctx0/tags:nr_reserved_tags=0
./sde/hctx0/tags:active_queues=2
./sde/hctx0/tags:
./sde/hctx0/tags:bitmap_tags:
./sde/hctx0/tags:depth=62
./sde/hctx0/tags:busy=0
./sde/hctx0/tags:bits_per_word=8
./sde/hctx0/tags:map_nr=8
./sde/hctx0/tags:alloc_hint={19, 4, 39, 51}
./sde/hctx0/tags:wake_batch=7
./sde/hctx0/tags:wake_index=5
./sde/hctx0/tags:ws={
./sde/hctx0/tags:	{.wait_cnt=7, .wait=active},
./sde/hctx0/tags:	{.wait_cnt=2, .wait=inactive},
./sde/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx0/tags:	{.wait_cnt=2, .wait=inactive},
./sde/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx0/tags:	{.wait_cnt=6, .wait=inactive},
./sde/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sde/hctx0/tags:}
./sde/hctx0/tags:round_robin=0
./sde/hctx0/ctx_map:00000000: 00
./sde/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sde/hctx0/state:TAG_ACTIVE
./sde/write_hints:hint0: 0
./sde/write_hints:hint1: 0
./sde/write_hints:hint2: 0
./sde/write_hints:hint3: 0
./sde/write_hints:hint4: 0
./sde/state:SAME_COMP|IO_STAT|INIT_DONE|POLL|REGISTERED|SCSI_PASSTHROUGH
./sde/poll_stat:read  (512 Bytes): samples=0
./sde/poll_stat:write (512 Bytes): samples=0
./sde/poll_stat:read  (1024 Bytes): samples=0
./sde/poll_stat:write (1024 Bytes): samples=0
./sde/poll_stat:read  (2048 Bytes): samples=0
./sde/poll_stat:write (2048 Bytes): samples=0
./sde/poll_stat:read  (4096 Bytes): samples=0
./sde/poll_stat:write (4096 Bytes): samples=0
./sde/poll_stat:read  (8192 Bytes): samples=0
./sde/poll_stat:write (8192 Bytes): samples=0
./sde/poll_stat:read  (16384 Bytes): samples=0
./sde/poll_stat:write (16384 Bytes): samples=0
./sde/poll_stat:read  (32768 Bytes): samples=0
./sde/poll_stat:write (32768 Bytes): samples=0
./sde/poll_stat:read  (65536 Bytes): samples=0
./sde/poll_stat:write (65536 Bytes): samples=0
./sdc/hctx3/cpu3/completed:59 3
./sdc/hctx3/cpu3/merged:0
./sdc/hctx3/cpu3/dispatched:59 3
./sdc/hctx3/active:62
./sdc/hctx3/run:64
./sdc/hctx3/queued:62
./sdc/hctx3/dispatched:       0	8
./sdc/hctx3/dispatched:       1	56
./sdc/hctx3/dispatched:       2	0
./sdc/hctx3/dispatched:       4	0
./sdc/hctx3/dispatched:       8	0
./sdc/hctx3/dispatched:      16	0
./sdc/hctx3/dispatched:      32+	0
./sdc/hctx3/io_poll:considered=0
./sdc/hctx3/io_poll:invoked=0
./sdc/hctx3/io_poll:success=0
./sdc/hctx3/tags_bitmap:00000000: 0000 0000 0000 0000
./sdc/hctx3/tags:nr_tags=62
./sdc/hctx3/tags:nr_reserved_tags=0
./sdc/hctx3/tags:active_queues=1
./sdc/hctx3/tags:
./sdc/hctx3/tags:bitmap_tags:
./sdc/hctx3/tags:depth=62
./sdc/hctx3/tags:busy=0
./sdc/hctx3/tags:bits_per_word=8
./sdc/hctx3/tags:map_nr=8
./sdc/hctx3/tags:alloc_hint={21, 38, 49, 44}
./sdc/hctx3/tags:wake_batch=7
./sdc/hctx3/tags:wake_index=0
./sdc/hctx3/tags:ws={
./sdc/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx3/tags:}
./sdc/hctx3/tags:round_robin=0
./sdc/hctx3/ctx_map:00000000: 00
./sdc/hctx3/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sdc/hctx3/state:TAG_ACTIVE
./sdc/hctx2/cpu2/completed:0 8
./sdc/hctx2/cpu2/merged:0
./sdc/hctx2/cpu2/dispatched:0 8
./sdc/hctx2/active:8
./sdc/hctx2/run:8
./sdc/hctx2/queued:8
./sdc/hctx2/dispatched:       0	0
./sdc/hctx2/dispatched:       1	8
./sdc/hctx2/dispatched:       2	0
./sdc/hctx2/dispatched:       4	0
./sdc/hctx2/dispatched:       8	0
./sdc/hctx2/dispatched:      16	0
./sdc/hctx2/dispatched:      32+	0
./sdc/hctx2/io_poll:considered=0
./sdc/hctx2/io_poll:invoked=0
./sdc/hctx2/io_poll:success=0
./sdc/hctx2/tags_bitmap:00000000: 0000 0000 0000 0000
./sdc/hctx2/tags:nr_tags=62
./sdc/hctx2/tags:nr_reserved_tags=0
./sdc/hctx2/tags:active_queues=1
./sdc/hctx2/tags:
./sdc/hctx2/tags:bitmap_tags:
./sdc/hctx2/tags:depth=62
./sdc/hctx2/tags:busy=0
./sdc/hctx2/tags:bits_per_word=8
./sdc/hctx2/tags:map_nr=8
./sdc/hctx2/tags:alloc_hint={28, 56, 24, 56}
./sdc/hctx2/tags:wake_batch=7
./sdc/hctx2/tags:wake_index=0
./sdc/hctx2/tags:ws={
./sdc/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx2/tags:}
./sdc/hctx2/tags:round_robin=0
./sdc/hctx2/ctx_map:00000000: 00
./sdc/hctx2/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sdc/hctx2/state:TAG_ACTIVE
./sdc/hctx1/cpu1/completed:14 13
./sdc/hctx1/cpu1/merged:0
./sdc/hctx1/cpu1/dispatched:14 13
./sdc/hctx1/active:21
./sdc/hctx1/run:32
./sdc/hctx1/queued:27
./sdc/hctx1/dispatched:       0	7
./sdc/hctx1/dispatched:       1	26
./sdc/hctx1/dispatched:       2	0
./sdc/hctx1/dispatched:       4	0
./sdc/hctx1/dispatched:       8	0
./sdc/hctx1/dispatched:      16	0
./sdc/hctx1/dispatched:      32+	0
./sdc/hctx1/io_poll:considered=0
./sdc/hctx1/io_poll:invoked=0
./sdc/hctx1/io_poll:success=0
./sdc/hctx1/tags_bitmap:00000000: 0000 0000 0000 0000
./sdc/hctx1/tags:nr_tags=62
./sdc/hctx1/tags:nr_reserved_tags=0
./sdc/hctx1/tags:active_queues=3
./sdc/hctx1/tags:
./sdc/hctx1/tags:bitmap_tags:
./sdc/hctx1/tags:depth=62
./sdc/hctx1/tags:busy=0
./sdc/hctx1/tags:bits_per_word=8
./sdc/hctx1/tags:map_nr=8
./sdc/hctx1/tags:alloc_hint={10, 8, 38, 57}
./sdc/hctx1/tags:wake_batch=7
./sdc/hctx1/tags:wake_index=1
./sdc/hctx1/tags:ws={
./sdc/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx1/tags:	{.wait_cnt=2, .wait=active},
./sdc/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx1/tags:	{.wait_cnt=7, .wait=active},
./sdc/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx1/tags:	{.wait_cnt=6, .wait=inactive},
./sdc/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx1/tags:}
./sdc/hctx1/tags:round_robin=0
./sdc/hctx1/ctx_map:00000000: 00
./sdc/hctx1/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sdc/hctx1/state:TAG_ACTIVE
./sdc/hctx0/cpu0/completed:1 41
./sdc/hctx0/cpu0/merged:0
./sdc/hctx0/cpu0/dispatched:1 41
./sdc/hctx0/active:36
./sdc/hctx0/run:55
./sdc/hctx0/queued:42
./sdc/hctx0/dispatched:       0	11
./sdc/hctx0/dispatched:       1	42
./sdc/hctx0/dispatched:       2	0
./sdc/hctx0/dispatched:       4	0
./sdc/hctx0/dispatched:       8	0
./sdc/hctx0/dispatched:      16	0
./sdc/hctx0/dispatched:      32+	0
./sdc/hctx0/io_poll:considered=0
./sdc/hctx0/io_poll:invoked=0
./sdc/hctx0/io_poll:success=0
./sdc/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000
./sdc/hctx0/tags:nr_tags=62
./sdc/hctx0/tags:nr_reserved_tags=0
./sdc/hctx0/tags:active_queues=2
./sdc/hctx0/tags:
./sdc/hctx0/tags:bitmap_tags:
./sdc/hctx0/tags:depth=62
./sdc/hctx0/tags:busy=0
./sdc/hctx0/tags:bits_per_word=8
./sdc/hctx0/tags:map_nr=8
./sdc/hctx0/tags:alloc_hint={19, 4, 39, 51}
./sdc/hctx0/tags:wake_batch=7
./sdc/hctx0/tags:wake_index=5
./sdc/hctx0/tags:ws={
./sdc/hctx0/tags:	{.wait_cnt=7, .wait=active},
./sdc/hctx0/tags:	{.wait_cnt=2, .wait=inactive},
./sdc/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx0/tags:	{.wait_cnt=2, .wait=inactive},
./sdc/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx0/tags:	{.wait_cnt=6, .wait=inactive},
./sdc/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sdc/hctx0/tags:}
./sdc/hctx0/tags:round_robin=0
./sdc/hctx0/ctx_map:00000000: 00
./sdc/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sdc/hctx0/state:TAG_ACTIVE
./sdc/write_hints:hint0: 0
./sdc/write_hints:hint1: 0
./sdc/write_hints:hint2: 0
./sdc/write_hints:hint3: 0
./sdc/write_hints:hint4: 0
./sdc/state:SAME_COMP|IO_STAT|INIT_DONE|POLL|REGISTERED|SCSI_PASSTHROUGH
./sdc/poll_stat:read  (512 Bytes): samples=0
./sdc/poll_stat:write (512 Bytes): samples=0
./sdc/poll_stat:read  (1024 Bytes): samples=0
./sdc/poll_stat:write (1024 Bytes): samples=0
./sdc/poll_stat:read  (2048 Bytes): samples=0
./sdc/poll_stat:write (2048 Bytes): samples=0
./sdc/poll_stat:read  (4096 Bytes): samples=0
./sdc/poll_stat:write (4096 Bytes): samples=0
./sdc/poll_stat:read  (8192 Bytes): samples=0
./sdc/poll_stat:write (8192 Bytes): samples=0
./sdc/poll_stat:read  (16384 Bytes): samples=0
./sdc/poll_stat:write (16384 Bytes): samples=0
./sdc/poll_stat:read  (32768 Bytes): samples=0
./sdc/poll_stat:write (32768 Bytes): samples=0
./sdc/poll_stat:read  (65536 Bytes): samples=0
./sdc/poll_stat:write (65536 Bytes): samples=0
./sdd/hctx3/cpu3/completed:0 0
./sdd/hctx3/cpu3/merged:0
./sdd/hctx3/cpu3/dispatched:0 0
./sdd/hctx3/active:0
./sdd/hctx3/run:0
./sdd/hctx3/queued:0
./sdd/hctx3/dispatched:       0	0
./sdd/hctx3/dispatched:       1	0
./sdd/hctx3/dispatched:       2	0
./sdd/hctx3/dispatched:       4	0
./sdd/hctx3/dispatched:       8	0
./sdd/hctx3/dispatched:      16	0
./sdd/hctx3/dispatched:      32+	0
./sdd/hctx3/io_poll:considered=0
./sdd/hctx3/io_poll:invoked=0
./sdd/hctx3/io_poll:success=0
./sdd/hctx3/tags_bitmap:00000000: 0000 0000 0000 0000
./sdd/hctx3/tags:nr_tags=62
./sdd/hctx3/tags:nr_reserved_tags=0
./sdd/hctx3/tags:active_queues=1
./sdd/hctx3/tags:
./sdd/hctx3/tags:bitmap_tags:
./sdd/hctx3/tags:depth=62
./sdd/hctx3/tags:busy=0
./sdd/hctx3/tags:bits_per_word=8
./sdd/hctx3/tags:map_nr=8
./sdd/hctx3/tags:alloc_hint={21, 38, 49, 44}
./sdd/hctx3/tags:wake_batch=7
./sdd/hctx3/tags:wake_index=0
./sdd/hctx3/tags:ws={
./sdd/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx3/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx3/tags:}
./sdd/hctx3/tags:round_robin=0
./sdd/hctx3/ctx_map:00000000: 00
./sdd/hctx3/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sdd/hctx3/state:
./sdd/hctx2/cpu2/completed:26 3
./sdd/hctx2/cpu2/merged:0
./sdd/hctx2/cpu2/dispatched:26 3
./sdd/hctx2/active:29
./sdd/hctx2/run:40
./sdd/hctx2/queued:29
./sdd/hctx2/dispatched:       0	17
./sdd/hctx2/dispatched:       1	28
./sdd/hctx2/dispatched:       2	0
./sdd/hctx2/dispatched:       4	0
./sdd/hctx2/dispatched:       8	0
./sdd/hctx2/dispatched:      16	0
./sdd/hctx2/dispatched:      32+	0
./sdd/hctx2/io_poll:considered=0
./sdd/hctx2/io_poll:invoked=0
./sdd/hctx2/io_poll:success=0
./sdd/hctx2/tags_bitmap:00000000: 0000 0000 0000 0000
./sdd/hctx2/tags:nr_tags=62
./sdd/hctx2/tags:nr_reserved_tags=0
./sdd/hctx2/tags:active_queues=1
./sdd/hctx2/tags:
./sdd/hctx2/tags:bitmap_tags:
./sdd/hctx2/tags:depth=62
./sdd/hctx2/tags:busy=0
./sdd/hctx2/tags:bits_per_word=8
./sdd/hctx2/tags:map_nr=8
./sdd/hctx2/tags:alloc_hint={28, 56, 24, 56}
./sdd/hctx2/tags:wake_batch=7
./sdd/hctx2/tags:wake_index=0
./sdd/hctx2/tags:ws={
./sdd/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx2/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx2/tags:}
./sdd/hctx2/tags:round_robin=0
./sdd/hctx2/ctx_map:00000000: 00
./sdd/hctx2/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sdd/hctx2/state:
./sdd/hctx1/cpu1/completed:1 41
./sdd/hctx1/cpu1/merged:0
./sdd/hctx1/cpu1/dispatched:1 41
./sdd/hctx1/active:42
./sdd/hctx1/run:52
./sdd/hctx1/queued:42
./sdd/hctx1/dispatched:       0	9
./sdd/hctx1/dispatched:       1	41
./sdd/hctx1/dispatched:       2	0
./sdd/hctx1/dispatched:       4	0
./sdd/hctx1/dispatched:       8	0
./sdd/hctx1/dispatched:      16	0
./sdd/hctx1/dispatched:      32+	0
./sdd/hctx1/io_poll:considered=0
./sdd/hctx1/io_poll:invoked=0
./sdd/hctx1/io_poll:success=0
./sdd/hctx1/tags_bitmap:00000000: 0000 0000 0000 0000
./sdd/hctx1/tags:nr_tags=62
./sdd/hctx1/tags:nr_reserved_tags=0
./sdd/hctx1/tags:active_queues=3
./sdd/hctx1/tags:
./sdd/hctx1/tags:bitmap_tags:
./sdd/hctx1/tags:depth=62
./sdd/hctx1/tags:busy=0
./sdd/hctx1/tags:bits_per_word=8
./sdd/hctx1/tags:map_nr=8
./sdd/hctx1/tags:alloc_hint={10, 8, 38, 57}
./sdd/hctx1/tags:wake_batch=7
./sdd/hctx1/tags:wake_index=1
./sdd/hctx1/tags:ws={
./sdd/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx1/tags:	{.wait_cnt=2, .wait=active},
./sdd/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx1/tags:	{.wait_cnt=7, .wait=active},
./sdd/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx1/tags:	{.wait_cnt=6, .wait=inactive},
./sdd/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx1/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx1/tags:}
./sdd/hctx1/tags:round_robin=0
./sdd/hctx1/ctx_map:00000000: 00
./sdd/hctx1/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sdd/hctx1/state:TAG_ACTIVE
./sdd/hctx0/cpu0/completed:5 4
./sdd/hctx0/cpu0/merged:0
./sdd/hctx0/cpu0/dispatched:5 4
./sdd/hctx0/active:9
./sdd/hctx0/run:10
./sdd/hctx0/queued:9
./sdd/hctx0/dispatched:       0	1
./sdd/hctx0/dispatched:       1	9
./sdd/hctx0/dispatched:       2	0
./sdd/hctx0/dispatched:       4	0
./sdd/hctx0/dispatched:       8	0
./sdd/hctx0/dispatched:      16	0
./sdd/hctx0/dispatched:      32+	0
./sdd/hctx0/io_poll:considered=0
./sdd/hctx0/io_poll:invoked=0
./sdd/hctx0/io_poll:success=0
./sdd/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000
./sdd/hctx0/tags:nr_tags=62
./sdd/hctx0/tags:nr_reserved_tags=0
./sdd/hctx0/tags:active_queues=2
./sdd/hctx0/tags:
./sdd/hctx0/tags:bitmap_tags:
./sdd/hctx0/tags:depth=62
./sdd/hctx0/tags:busy=0
./sdd/hctx0/tags:bits_per_word=8
./sdd/hctx0/tags:map_nr=8
./sdd/hctx0/tags:alloc_hint={19, 4, 39, 51}
./sdd/hctx0/tags:wake_batch=7
./sdd/hctx0/tags:wake_index=5
./sdd/hctx0/tags:ws={
./sdd/hctx0/tags:	{.wait_cnt=7, .wait=active},
./sdd/hctx0/tags:	{.wait_cnt=2, .wait=inactive},
./sdd/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx0/tags:	{.wait_cnt=2, .wait=inactive},
./sdd/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx0/tags:	{.wait_cnt=6, .wait=inactive},
./sdd/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx0/tags:	{.wait_cnt=7, .wait=inactive},
./sdd/hctx0/tags:}
./sdd/hctx0/tags:round_robin=0
./sdd/hctx0/ctx_map:00000000: 00
./sdd/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_SHARED|SG_MERGE
./sdd/hctx0/state:
./sdd/write_hints:hint0: 0
./sdd/write_hints:hint1: 0
./sdd/write_hints:hint2: 0
./sdd/write_hints:hint3: 0
./sdd/write_hints:hint4: 0
./sdd/state:SAME_COMP|NONROT|IO_STAT|INIT_DONE|POLL|WC|FUA|REGISTERED|SCSI_PASSTHROUGH
./sdd/poll_stat:read  (512 Bytes): samples=0
./sdd/poll_stat:write (512 Bytes): samples=0
./sdd/poll_stat:read  (1024 Bytes): samples=0
./sdd/poll_stat:write (1024 Bytes): samples=0
./sdd/poll_stat:read  (2048 Bytes): samples=0
./sdd/poll_stat:write (2048 Bytes): samples=0
./sdd/poll_stat:read  (4096 Bytes): samples=0
./sdd/poll_stat:write (4096 Bytes): samples=0
./sdd/poll_stat:read  (8192 Bytes): samples=0
./sdd/poll_stat:write (8192 Bytes): samples=0
./sdd/poll_stat:read  (16384 Bytes): samples=0
./sdd/poll_stat:write (16384 Bytes): samples=0
./sdd/poll_stat:read  (32768 Bytes): samples=0
./sdd/poll_stat:write (32768 Bytes): samples=0
./sdd/poll_stat:read  (65536 Bytes): samples=0
./sdd/poll_stat:write (65536 Bytes): samples=0
./sdb/hctx0/cpu3/completed:56 64
./sdb/hctx0/cpu3/merged:0
./sdb/hctx0/cpu3/dispatched:56 64
./sdb/hctx0/cpu2/completed:27 6
./sdb/hctx0/cpu2/merged:0
./sdb/hctx0/cpu2/dispatched:27 6
./sdb/hctx0/cpu1/completed:1 5
./sdb/hctx0/cpu1/merged:0
./sdb/hctx0/cpu1/dispatched:1 5
./sdb/hctx0/cpu0/completed:96 9
./sdb/hctx0/cpu0/merged:0
./sdb/hctx0/cpu0/dispatched:96 9
./sdb/hctx0/active:0
./sdb/hctx0/run:258
./sdb/hctx0/queued:264
./sdb/hctx0/dispatched:       0	0
./sdb/hctx0/dispatched:       1	264
./sdb/hctx0/dispatched:       2	0
./sdb/hctx0/dispatched:       4	0
./sdb/hctx0/dispatched:       8	0
./sdb/hctx0/dispatched:      16	0
./sdb/hctx0/dispatched:      32+	0
./sdb/hctx0/io_poll:considered=0
./sdb/hctx0/io_poll:invoked=0
./sdb/hctx0/io_poll:success=0
./sdb/hctx0/sched_tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
./sdb/hctx0/sched_tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
./sdb/hctx0/sched_tags:nr_tags=256
./sdb/hctx0/sched_tags:nr_reserved_tags=0
./sdb/hctx0/sched_tags:active_queues=0
./sdb/hctx0/sched_tags:
./sdb/hctx0/sched_tags:bitmap_tags:
./sdb/hctx0/sched_tags:depth=256
./sdb/hctx0/sched_tags:busy=0
./sdb/hctx0/sched_tags:bits_per_word=64
./sdb/hctx0/sched_tags:map_nr=4
./sdb/hctx0/sched_tags:alloc_hint={231, 245, 5, 202}
./sdb/hctx0/sched_tags:wake_batch=8
./sdb/hctx0/sched_tags:wake_index=0
./sdb/hctx0/sched_tags:ws={
./sdb/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/sched_tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/sched_tags:}
./sdb/hctx0/sched_tags:round_robin=0
./sdb/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
./sdb/hctx0/tags_bitmap:00000010: 0000 0000 0000 0000
./sdb/hctx0/tags:nr_tags=192
./sdb/hctx0/tags:nr_reserved_tags=0
./sdb/hctx0/tags:active_queues=0
./sdb/hctx0/tags:
./sdb/hctx0/tags:bitmap_tags:
./sdb/hctx0/tags:depth=192
./sdb/hctx0/tags:busy=0
./sdb/hctx0/tags:bits_per_word=32
./sdb/hctx0/tags:map_nr=6
./sdb/hctx0/tags:alloc_hint={70, 142, 58, 133}
./sdb/hctx0/tags:wake_batch=8
./sdb/hctx0/tags:wake_index=0
./sdb/hctx0/tags:ws={
./sdb/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sdb/hctx0/tags:}
./sdb/hctx0/tags:round_robin=0
./sdb/hctx0/ctx_map:00000000: 00
./sdb/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|SG_MERGE
./sdb/hctx0/state:
./sdb/sched/starved:0
./sdb/sched/batching:1
./sdb/write_hints:hint0: 0
./sdb/write_hints:hint1: 0
./sdb/write_hints:hint2: 0
./sdb/write_hints:hint3: 0
./sdb/write_hints:hint4: 0
./sdb/state:BIDI|SAME_COMP|NONROT|IO_STAT|INIT_DONE|POLL|WC|FUA|REGISTERED|SCSI_PASSTHROUGH
./sdb/poll_stat:read  (512 Bytes): samples=0
./sdb/poll_stat:write (512 Bytes): samples=0
./sdb/poll_stat:read  (1024 Bytes): samples=0
./sdb/poll_stat:write (1024 Bytes): samples=0
./sdb/poll_stat:read  (2048 Bytes): samples=0
./sdb/poll_stat:write (2048 Bytes): samples=0
./sdb/poll_stat:read  (4096 Bytes): samples=0
./sdb/poll_stat:write (4096 Bytes): samples=0
./sdb/poll_stat:read  (8192 Bytes): samples=0
./sdb/poll_stat:write (8192 Bytes): samples=0
./sdb/poll_stat:read  (16384 Bytes): samples=0
./sdb/poll_stat:write (16384 Bytes): samples=0
./sdb/poll_stat:read  (32768 Bytes): samples=0
./sdb/poll_stat:write (32768 Bytes): samples=0
./sdb/poll_stat:read  (65536 Bytes): samples=0
./sdb/poll_stat:write (65536 Bytes): samples=0
./sda/hctx0/cpu3/completed:2 2
./sda/hctx0/cpu3/merged:0
./sda/hctx0/cpu3/dispatched:2 2
./sda/hctx0/cpu2/completed:43 15
./sda/hctx0/cpu2/merged:0
./sda/hctx0/cpu2/dispatched:43 15
./sda/hctx0/cpu1/completed:0 8
./sda/hctx0/cpu1/merged:0
./sda/hctx0/cpu1/dispatched:0 8
./sda/hctx0/cpu0/completed:45 48
./sda/hctx0/cpu0/merged:0
./sda/hctx0/cpu0/dispatched:45 48
./sda/hctx0/active:0
./sda/hctx0/run:136
./sda/hctx0/queued:163
./sda/hctx0/dispatched:       0	0
./sda/hctx0/dispatched:       1	163
./sda/hctx0/dispatched:       2	0
./sda/hctx0/dispatched:       4	0
./sda/hctx0/dispatched:       8	0
./sda/hctx0/dispatched:      16	0
./sda/hctx0/dispatched:      32+	0
./sda/hctx0/io_poll:considered=0
./sda/hctx0/io_poll:invoked=0
./sda/hctx0/io_poll:success=0
./sda/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
./sda/hctx0/tags:nr_tags=128
./sda/hctx0/tags:nr_reserved_tags=0
./sda/hctx0/tags:active_queues=0
./sda/hctx0/tags:
./sda/hctx0/tags:bitmap_tags:
./sda/hctx0/tags:depth=128
./sda/hctx0/tags:busy=0
./sda/hctx0/tags:bits_per_word=32
./sda/hctx0/tags:map_nr=4
./sda/hctx0/tags:alloc_hint={16, 23, 40, 14}
./sda/hctx0/tags:wake_batch=8
./sda/hctx0/tags:wake_index=0
./sda/hctx0/tags:ws={
./sda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./sda/hctx0/tags:}
./sda/hctx0/tags:round_robin=0
./sda/hctx0/ctx_map:00000000: 00
./sda/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|SG_MERGE
./sda/hctx0/state:
./sda/write_hints:hint0: 0
./sda/write_hints:hint1: 0
./sda/write_hints:hint2: 0
./sda/write_hints:hint3: 0
./sda/write_hints:hint4: 0
./sda/state:SAME_COMP|IO_STAT|DISCARD|INIT_DONE|POLL|WC|REGISTERED|SCSI_PASSTHROUGH
./sda/poll_stat:read  (512 Bytes): samples=0
./sda/poll_stat:write (512 Bytes): samples=0
./sda/poll_stat:read  (1024 Bytes): samples=0
./sda/poll_stat:write (1024 Bytes): samples=0
./sda/poll_stat:read  (2048 Bytes): samples=0
./sda/poll_stat:write (2048 Bytes): samples=0
./sda/poll_stat:read  (4096 Bytes): samples=0
./sda/poll_stat:write (4096 Bytes): samples=0
./sda/poll_stat:read  (8192 Bytes): samples=0
./sda/poll_stat:write (8192 Bytes): samples=0
./sda/poll_stat:read  (16384 Bytes): samples=0
./sda/poll_stat:write (16384 Bytes): samples=0
./sda/poll_stat:read  (32768 Bytes): samples=0
./sda/poll_stat:write (32768 Bytes): samples=0
./sda/poll_stat:read  (65536 Bytes): samples=0
./sda/poll_stat:write (65536 Bytes): samples=0
./vda/hctx0/cpu3/completed:1395 39
./vda/hctx0/cpu3/merged:0
./vda/hctx0/cpu3/dispatched:1395 39
./vda/hctx0/cpu2/completed:1367 3
./vda/hctx0/cpu2/merged:0
./vda/hctx0/cpu2/dispatched:1367 3
./vda/hctx0/cpu1/completed:1176 2
./vda/hctx0/cpu1/merged:34
./vda/hctx0/cpu1/dispatched:1176 2
./vda/hctx0/cpu0/completed:2349 8
./vda/hctx0/cpu0/merged:18
./vda/hctx0/cpu0/dispatched:2349 8
./vda/hctx0/active:0
./vda/hctx0/run:5911
./vda/hctx0/queued:6339
./vda/hctx0/dispatched:       0	0
./vda/hctx0/dispatched:       1	5828
./vda/hctx0/dispatched:       2	39
./vda/hctx0/dispatched:       4	12
./vda/hctx0/dispatched:       8	8
./vda/hctx0/dispatched:      16	19
./vda/hctx0/dispatched:      32+	0
./vda/hctx0/io_poll:considered=0
./vda/hctx0/io_poll:invoked=0
./vda/hctx0/io_poll:success=0
./vda/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
./vda/hctx0/tags:nr_tags=128
./vda/hctx0/tags:nr_reserved_tags=0
./vda/hctx0/tags:active_queues=0
./vda/hctx0/tags:
./vda/hctx0/tags:bitmap_tags:
./vda/hctx0/tags:depth=128
./vda/hctx0/tags:busy=0
./vda/hctx0/tags:bits_per_word=32
./vda/hctx0/tags:map_nr=4
./vda/hctx0/tags:alloc_hint={121, 45, 19, 110}
./vda/hctx0/tags:wake_batch=8
./vda/hctx0/tags:wake_index=0
./vda/hctx0/tags:ws={
./vda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./vda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./vda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./vda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./vda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./vda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./vda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./vda/hctx0/tags:	{.wait_cnt=8, .wait=inactive},
./vda/hctx0/tags:}
./vda/hctx0/tags:round_robin=0
./vda/hctx0/ctx_map:00000000: 00
./vda/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE
./vda/hctx0/state:
./vda/write_hints:hint0: 0
./vda/write_hints:hint1: 0
./vda/write_hints:hint2: 0
./vda/write_hints:hint3: 0
./vda/write_hints:hint4: 0
./vda/state:SAME_COMP|IO_STAT|INIT_DONE|NO_SG_MERGE|POLL|WC|REGISTERED
./vda/poll_stat:read  (512 Bytes): samples=0
./vda/poll_stat:write (512 Bytes): samples=0
./vda/poll_stat:read  (1024 Bytes): samples=0
./vda/poll_stat:write (1024 Bytes): samples=0
./vda/poll_stat:read  (2048 Bytes): samples=0
./vda/poll_stat:write (2048 Bytes): samples=0
./vda/poll_stat:read  (4096 Bytes): samples=0
./vda/poll_stat:write (4096 Bytes): samples=0
./vda/poll_stat:read  (8192 Bytes): samples=0
./vda/poll_stat:write (8192 Bytes): samples=0
./vda/poll_stat:read  (16384 Bytes): samples=0
./vda/poll_stat:write (16384 Bytes): samples=0
./vda/poll_stat:read  (32768 Bytes): samples=0
./vda/poll_stat:write (32768 Bytes): samples=0
./vda/poll_stat:read  (65536 Bytes): samples=0
./vda/poll_stat:write (65536 Bytes): samples=0
./sr0/hctx0/cpu3/completed:0 39
./sr0/hctx0/cpu3/merged:0
./sr0/hctx0/cpu3/dispatched:0 39
./sr0/hctx0/cpu2/completed:0 48
./sr0/hctx0/cpu2/merged:0
./sr0/hctx0/cpu2/dispatched:0 48
./sr0/hctx0/cpu1/completed:0 15
./sr0/hctx0/cpu1/merged:0
./sr0/hctx0/cpu1/dispatched:0 15
./sr0/hctx0/cpu0/completed:46 28
./sr0/hctx0/cpu0/merged:0
./sr0/hctx0/cpu0/dispatched:46 28
./sr0/hctx0/active:0
./sr0/hctx0/run:176
./sr0/hctx0/queued:176
./sr0/hctx0/dispatched:       0	0
./sr0/hctx0/dispatched:       1	176
./sr0/hctx0/dispatched:       2	0
./sr0/hctx0/dispatched:       4	0
./sr0/hctx0/dispatched:       8	0
./sr0/hctx0/dispatched:      16	0
./sr0/hctx0/dispatched:      32+	0
./sr0/hctx0/io_poll:considered=0
./sr0/hctx0/io_poll:invoked=0
./sr0/hctx0/io_poll:success=0
./sr0/hctx0/tags_bitmap:00000000: 00
./sr0/hctx0/tags:nr_tags=1
./sr0/hctx0/tags:nr_reserved_tags=0
./sr0/hctx0/tags:active_queues=0
./sr0/hctx0/tags:
./sr0/hctx0/tags:bitmap_tags:
./sr0/hctx0/tags:depth=1
./sr0/hctx0/tags:busy=0
./sr0/hctx0/tags:bits_per_word=64
./sr0/hctx0/tags:map_nr=1
./sr0/hctx0/tags:alloc_hint={0, 0, 0, 0}
./sr0/hctx0/tags:wake_batch=1
./sr0/hctx0/tags:wake_index=4
./sr0/hctx0/tags:ws={
./sr0/hctx0/tags:	{.wait_cnt=1, .wait=inactive},
./sr0/hctx0/tags:	{.wait_cnt=1, .wait=inactive},
./sr0/hctx0/tags:	{.wait_cnt=1, .wait=inactive},
./sr0/hctx0/tags:	{.wait_cnt=1, .wait=inactive},
./sr0/hctx0/tags:	{.wait_cnt=1, .wait=inactive},
./sr0/hctx0/tags:	{.wait_cnt=1, .wait=inactive},
./sr0/hctx0/tags:	{.wait_cnt=1, .wait=inactive},
./sr0/hctx0/tags:	{.wait_cnt=1, .wait=inactive},
./sr0/hctx0/tags:}
./sr0/hctx0/tags:round_robin=1
./sr0/hctx0/ctx_map:00000000: 00
./sr0/hctx0/flags:alloc_policy=RR SHOULD_MERGE|SG_MERGE
./sr0/hctx0/state:
./sr0/write_hints:hint0: 0
./sr0/write_hints:hint1: 0
./sr0/write_hints:hint2: 0
./sr0/write_hints:hint3: 0
./sr0/write_hints:hint4: 0
./sr0/state:SAME_COMP|IO_STAT|INIT_DONE|POLL|FLUSH_NQ|REGISTERED|SCSI_PASSTHROUGH
./sr0/poll_stat:read  (512 Bytes): samples=0
./sr0/poll_stat:write (512 Bytes): samples=0
./sr0/poll_stat:read  (1024 Bytes): samples=0
./sr0/poll_stat:write (1024 Bytes): samples=0
./sr0/poll_stat:read  (2048 Bytes): samples=0
./sr0/poll_stat:write (2048 Bytes): samples=0
./sr0/poll_stat:read  (4096 Bytes): samples=0
./sr0/poll_stat:write (4096 Bytes): samples=0
./sr0/poll_stat:read  (8192 Bytes): samples=0
./sr0/poll_stat:write (8192 Bytes): samples=0
./sr0/poll_stat:read  (16384 Bytes): samples=0
./sr0/poll_stat:write (16384 Bytes): samples=0
./sr0/poll_stat:read  (32768 Bytes): samples=0
./sr0/poll_stat:write (32768 Bytes): samples=0
./sr0/poll_stat:read  (65536 Bytes): samples=0
./sr0/poll_stat:write (65536 Bytes): samples=0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-12 21:19     ` Bart Van Assche
@ 2018-01-12 21:55       ` Jens Axboe
  2018-01-14 15:12       ` jianchao.wang
  1 sibling, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2018-01-12 21:55 UTC (permalink / raw)
  To: Bart Van Assche, jbacik, tj, jack, clm
  Cc: kernel-team, linux-kernel, peterz, linux-btrfs, linux-block,
	jianchao.w.wang

On 1/12/18 2:19 PM, Bart Van Assche wrote:
> On Fri, 2018-01-12 at 14:07 -0700, Jens Axboe wrote:
>> You're really not making it easy for folks to run this :-)
> 
> My hope is that the ib_srp and ib_srpt patches will be accepted upstream soon.
> As long as these are not upstream, anyone who wants to retrieve these patches
> is welcome to clone https://github.com/bvanassche/linux/tree/block-scsi-for-next,
> a kernel tree with all my pending patches.
> 
>> Do you have the matching blk-mq debugfs output for the device?
> 
> Sorry but I only retrieved the blk-mq debugfs several minutes after the hang
> started so I'm not sure the state information is relevant. Anyway, I have attached
> it to this e-mail. The most remarkable part is the following:
> 
> ./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x1
> 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete
> =0, .tag=-1, .internal_tag=217}

Two things come to mind here:

1) We forgot to add RQF_STARTED to the debugfs bits, I just rectified
   that.

2) You are using a scheduler (which one?). The request was inserted, and
   retrieved by the driver, then requeued. After this requeue,
   apparently nothing happened. The queue should have been re-run, but
   that didn't happen. What are the queue/hctx states?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-12 20:57   ` Bart Van Assche
@ 2018-01-12 21:55     ` Laurence Oberman
  -1 siblings, 0 replies; 25+ messages in thread
From: Laurence Oberman @ 2018-01-12 21:55 UTC (permalink / raw)
  To: Bart Van Assche, jbacik, tj, jack, clm, axboe
  Cc: kernel-team, linux-kernel, peterz, linux-btrfs, linux-block,
	jianchao.w.wang

On Fri, 2018-01-12 at 20:57 +0000, Bart Van Assche wrote:
> On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> > Currently, blk-mq timeout path synchronizes against the usual
> > issue/completion path using a complex scheme involving atomic
> > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > rules.  Unfortunatley, it contains quite a few holes.
> 
> Hello Tejun,
> 
> With this patch series applied I see weird hangs in blk_mq_get_tag()
> when I
> run the srp-test software. If I pull Jens' latest for-next branch and
> revert
> this patch series then the srp-test software runs successfully. Note:
> if you
> don't have InfiniBand hardware available then you will need the
> RDMA/CM
> patches for the SRP initiator and target drivers that have been
> posted
> recently on the linux-rdma mailing list to run the srp-test software.
> 
> This is how I run the srp-test software in a VM:
> 
> ./run_tests -c -d -r 10
> 
> Here is an example of what SysRq-w reported when the hang occurred:
> 
> sysrq: SysRq : Show Blocked State
>  task                        PC stack   pid father
> kworker/u8:0    D12864     5      2 0x80000000
> Workqueue: events_unbound sd_probe_async [sd_mod]
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x169/0x290
> ? finish_wait+0x80/0x80
> blk_mq_get_request+0x16a/0x4f0
> blk_mq_alloc_request+0x59/0xc0
> blk_get_request_flags+0x3f/0x260
> scsi_execute+0x33/0x1e0 [scsi_mod]
> read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> sd_probe_async+0xf2/0x1a0 [sd_mod]
> process_one_work+0x21c/0x6d0
> worker_thread+0x35/0x380
> ? process_one_work+0x6d0/0x6d0
> kthread+0x117/0x130
> ? kthread_create_worker_on_cpu+0x40/0x40
> ret_from_fork+0x24/0x30
> systemd-udevd   D13672  1048    285 0x00000100
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> generic_file_read_iter+0x32f/0x970
> ? page_cache_tree_insert+0x100/0x100
> __vfs_read+0xcc/0x120
> vfs_read+0x96/0x140
> SyS_read+0x40/0xa0
> do_syscall_64+0x5f/0x1b0
> entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f8ce6d08d11
> RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
> RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
> RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
> R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
> R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
> systemd-udevd   D13496  1049    285 0x00000100
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x169/0x290
> ? finish_wait+0x80/0x80
> blk_mq_get_request+0x16a/0x4f0
> blk_mq_make_request+0x105/0x8e0
> ? generic_make_request+0xd6/0x3d0
> generic_make_request+0x103/0x3d0
> ? submit_bio+0x57/0x110
> submit_bio+0x57/0x110
> mpage_readpages+0x13b/0x160
> ? I_BDEV+0x10/0x10
> ? rcu_read_lock_sched_held+0x66/0x70
> ? __alloc_pages_nodemask+0x2e8/0x360
> __do_page_cache_readahead+0x2a4/0x370
> ? force_page_cache_readahead+0xaf/0x110
> force_page_cache_readahead+0xaf/0x110
> generic_file_read_iter+0x743/0x970
> ? find_held_lock+0x2d/0x90
> ? _raw_spin_unlock+0x29/0x40
> __vfs_read+0xcc/0x120
> vfs_read+0x96/0x140
> SyS_read+0x40/0xa0
> do_syscall_64+0x5f/0x1b0
> entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f8ce6d08d11
> RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
> RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
> RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
> R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
> systemd-udevd   D13672  1055    285 0x00000100
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x169/0x290
> ? finish_wait+0x80/0x80
> blk_mq_get_request+0x16a/0x4f0
> blk_mq_make_request+0x105/0x8e0
> ? generic_make_request+0xd6/0x3d0
> generic_make_request+0x103/0x3d0
> ? submit_bio+0x57/0x110
> submit_bio+0x57/0x110
> mpage_readpages+0x13b/0x160
> ? I_BDEV+0x10/0x10
> ? rcu_read_lock_sched_held+0x66/0x70
> ? __alloc_pages_nodemask+0x2e8/0x360
> __do_page_cache_readahead+0x2a4/0x370
> ? force_page_cache_readahead+0xaf/0x110
> force_page_cache_readahead+0xaf/0x110
> generic_file_read_iter+0x743/0x970
> __vfs_read+0xcc/0x120
> vfs_read+0x96/0x140
> SyS_read+0x40/0xa0
> do_syscall_64+0x5f/0x1b0
> entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f8ce6d08d11
> RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
> RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
> RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
> R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
> R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100
> 
> Please let me know if you need more information.
> 
> Bart.

I just bumped into stalls too, using Mike's tree. I had started getting
ready with blk-mq enabled for both dm_mod and scsi_mod and testing SRP
so I could run further tests for Bart.

I am working to see when it started by starting first with Linus's
latest tree as I tested that same without blk-mq initially and it was
fine.

My concern is that I tried a 4.15-rc1 kernel as well and still saw
issues so have to sanitize all of this first and report back.

My trace was this

[  615.714199]       Tainted: G          I      4.15.0-rc4+ #1
[  615.746226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  615.790597] systemd-udevd   D    0  2247    745 0x80000086
[  615.821048] Call Trace:
[  615.833971]  ? __schedule+0x28d/0x870
[  615.854226]  schedule+0x32/0x80
[  615.871819]  io_schedule+0x12/0x40
[  615.890850]  __lock_page+0x105/0x150
[  615.910378]  ? page_cache_tree_insert+0xd0/0xd0
[  615.935932]  truncate_inode_pages_range+0x49d/0x810
[  615.963371]  __blkdev_put+0x73/0x200
[  615.983351]  blkdev_close+0x21/0x30
[  616.002910]  __fput+0xd5/0x210
[  616.019812]  task_work_run+0x82/0xa0
[  616.040506]  do_exit+0x2ce/0xb30
[  616.058622]  do_group_exit+0x39/0xa0
[  616.079506]  get_signal+0x1d0/0x5b0
[  616.100116]  do_signal+0x36/0x610
[  616.118889]  ? page_cache_tree_insert+0xd0/0xd0
[  616.144958]  exit_to_usermode_loop+0x47/0x93
[  616.169040]  do_syscall_64+0x172/0x1a0
[  616.190514]  entry_SYSCALL64_slow_path+0x25/0x25
[  616.216566] RIP: 0033:0x7fb247dd86f0
[  616.237395] RSP: 002b:00007ffea3c56cb8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[  616.279551] RAX: fffffffffffffffc RBX: 0000000000200000 RCX:
00007fb247dd86f0
[  616.318057] RDX: 0000000000000400 RSI: 000055855eeefea8 RDI:
0000000000000008
[  616.357320] RBP: 000055855eeefe80 R08: 0000000005231994 R09:
0000000000000428
[  616.397099] R10: 0000000000000040 R11: 0000000000000246 R12:
000055855eeea550
[  616.436999] R13: 0000000000000400 R14: 000055855eeea5a0 R15:
000055855eeefe98
[  616.477831] INFO: task systemd-udevd:2248 blocked for more than 120
seconds.
[  616.518289]       Tainted: G          I      4.15.0-rc4+ #1
[  616.549323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
@ 2018-01-12 21:55     ` Laurence Oberman
  0 siblings, 0 replies; 25+ messages in thread
From: Laurence Oberman @ 2018-01-12 21:55 UTC (permalink / raw)
  To: Bart Van Assche, jbacik, tj, jack, clm, axboe
  Cc: kernel-team, linux-kernel, peterz, linux-btrfs, linux-block,
	jianchao.w.wang

On Fri, 2018-01-12 at 20:57 +0000, Bart Van Assche wrote:
> On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> > Currently, blk-mq timeout path synchronizes against the usual
> > issue/completion path using a complex scheme involving atomic
> > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > rules.  Unfortunatley, it contains quite a few holes.
> 
> Hello Tejun,
> 
> With this patch series applied I see weird hangs in blk_mq_get_tag()
> when I
> run the srp-test software. If I pull Jens' latest for-next branch and
> revert
> this patch series then the srp-test software runs successfully. Note:
> if you
> don't have InfiniBand hardware available then you will need the
> RDMA/CM
> patches for the SRP initiator and target drivers that have been
> posted
> recently on the linux-rdma mailing list to run the srp-test software.
> 
> This is how I run the srp-test software in a VM:
> 
> ./run_tests -c -d -r 10
> 
> Here is an example of what SysRq-w reported when the hang occurred:
> 
> sysrq: SysRq : Show Blocked State
>  task                        PC stack   pid father
> kworker/u8:0    D12864     5      2 0x80000000
> Workqueue: events_unbound sd_probe_async [sd_mod]
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x169/0x290
> ? finish_wait+0x80/0x80
> blk_mq_get_request+0x16a/0x4f0
> blk_mq_alloc_request+0x59/0xc0
> blk_get_request_flags+0x3f/0x260
> scsi_execute+0x33/0x1e0 [scsi_mod]
> read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> sd_probe_async+0xf2/0x1a0 [sd_mod]
> process_one_work+0x21c/0x6d0
> worker_thread+0x35/0x380
> ? process_one_work+0x6d0/0x6d0
> kthread+0x117/0x130
> ? kthread_create_worker_on_cpu+0x40/0x40
> ret_from_fork+0x24/0x30
> systemd-udevd   D13672  1048    285 0x00000100
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> generic_file_read_iter+0x32f/0x970
> ? page_cache_tree_insert+0x100/0x100
> __vfs_read+0xcc/0x120
> vfs_read+0x96/0x140
> SyS_read+0x40/0xa0
> do_syscall_64+0x5f/0x1b0
> entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f8ce6d08d11
> RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
> RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
> RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
> R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
> R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
> systemd-udevd   D13496  1049    285 0x00000100
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x169/0x290
> ? finish_wait+0x80/0x80
> blk_mq_get_request+0x16a/0x4f0
> blk_mq_make_request+0x105/0x8e0
> ? generic_make_request+0xd6/0x3d0
> generic_make_request+0x103/0x3d0
> ? submit_bio+0x57/0x110
> submit_bio+0x57/0x110
> mpage_readpages+0x13b/0x160
> ? I_BDEV+0x10/0x10
> ? rcu_read_lock_sched_held+0x66/0x70
> ? __alloc_pages_nodemask+0x2e8/0x360
> __do_page_cache_readahead+0x2a4/0x370
> ? force_page_cache_readahead+0xaf/0x110
> force_page_cache_readahead+0xaf/0x110
> generic_file_read_iter+0x743/0x970
> ? find_held_lock+0x2d/0x90
> ? _raw_spin_unlock+0x29/0x40
> __vfs_read+0xcc/0x120
> vfs_read+0x96/0x140
> SyS_read+0x40/0xa0
> do_syscall_64+0x5f/0x1b0
> entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f8ce6d08d11
> RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
> RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
> RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
> R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
> systemd-udevd   D13672  1055    285 0x00000100
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x169/0x290
> ? finish_wait+0x80/0x80
> blk_mq_get_request+0x16a/0x4f0
> blk_mq_make_request+0x105/0x8e0
> ? generic_make_request+0xd6/0x3d0
> generic_make_request+0x103/0x3d0
> ? submit_bio+0x57/0x110
> submit_bio+0x57/0x110
> mpage_readpages+0x13b/0x160
> ? I_BDEV+0x10/0x10
> ? rcu_read_lock_sched_held+0x66/0x70
> ? __alloc_pages_nodemask+0x2e8/0x360
> __do_page_cache_readahead+0x2a4/0x370
> ? force_page_cache_readahead+0xaf/0x110
> force_page_cache_readahead+0xaf/0x110
> generic_file_read_iter+0x743/0x970
> __vfs_read+0xcc/0x120
> vfs_read+0x96/0x140
> SyS_read+0x40/0xa0
> do_syscall_64+0x5f/0x1b0
> entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f8ce6d08d11
> RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
> RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
> RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
> R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
> R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100
> 
> Please let me know if you need more information.
> 
> Bart.

I just bumped into stalls too, using Mike's tree. I had started getting
ready with blk-mq enabled for both dm_mod and scsi_mod and testing SRP
so I could run further tests for Bart.

I am working to see when it started by starting first with Linus's
latest tree as I tested that same without blk-mq initially and it was
fine.

My concern is that I tried a 4.15-rc1 kernel as well and still saw
issues so have to sanitize all of this first and report back.

My trace was this

[  615.714199]       Tainted: G          I      4.15.0-rc4+ #1
[  615.746226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  615.790597] systemd-udevd   D    0  2247    745 0x80000086
[  615.821048] Call Trace:
[  615.833971]  ? __schedule+0x28d/0x870
[  615.854226]  schedule+0x32/0x80
[  615.871819]  io_schedule+0x12/0x40
[  615.890850]  __lock_page+0x105/0x150
[  615.910378]  ? page_cache_tree_insert+0xd0/0xd0
[  615.935932]  truncate_inode_pages_range+0x49d/0x810
[  615.963371]  __blkdev_put+0x73/0x200
[  615.983351]  blkdev_close+0x21/0x30
[  616.002910]  __fput+0xd5/0x210
[  616.019812]  task_work_run+0x82/0xa0
[  616.040506]  do_exit+0x2ce/0xb30
[  616.058622]  do_group_exit+0x39/0xa0
[  616.079506]  get_signal+0x1d0/0x5b0
[  616.100116]  do_signal+0x36/0x610
[  616.118889]  ? page_cache_tree_insert+0xd0/0xd0
[  616.144958]  exit_to_usermode_loop+0x47/0x93
[  616.169040]  do_syscall_64+0x172/0x1a0
[  616.190514]  entry_SYSCALL64_slow_path+0x25/0x25
[  616.216566] RIP: 0033:0x7fb247dd86f0
[  616.237395] RSP: 002b:00007ffea3c56cb8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[  616.279551] RAX: fffffffffffffffc RBX: 0000000000200000 RCX:
00007fb247dd86f0
[  616.318057] RDX: 0000000000000400 RSI: 000055855eeefea8 RDI:
0000000000000008
[  616.357320] RBP: 000055855eeefe80 R08: 0000000005231994 R09:
0000000000000428
[  616.397099] R10: 0000000000000040 R11: 0000000000000246 R12:
000055855eeea550
[  616.436999] R13: 0000000000000400 R14: 000055855eeea5a0 R15:
000055855eeefe98
[  616.477831] INFO: task systemd-udevd:2248 blocked for more than 120
seconds.
[  616.518289]       Tainted: G          I      4.15.0-rc4+ #1
[  616.549323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-12 21:55     ` Laurence Oberman
@ 2018-01-13 12:39       ` Ming Lei
  -1 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2018-01-13 12:39 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, jbacik, tj, jack, clm, axboe, kernel-team,
	linux-kernel, peterz, linux-btrfs, linux-block, jianchao.w.wang

On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote:
> On Fri, 2018-01-12 at 20:57 +0000, Bart Van Assche wrote:
> > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> > > Currently, blk-mq timeout path synchronizes against the usual
> > > issue/completion path using a complex scheme involving atomic
> > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > rules.��Unfortunatley, it contains quite a few holes.
> > 
> > Hello Tejun,
> > 
> > With this patch series applied I see weird hangs in blk_mq_get_tag()
> > when I
> > run the srp-test software. If I pull Jens' latest for-next branch and
> > revert
> > this patch series then the srp-test software runs successfully. Note:
> > if you
> > don't have InfiniBand hardware available then you will need the
> > RDMA/CM
> > patches for the SRP initiator and target drivers that have been
> > posted
> > recently on the linux-rdma mailing list to run the srp-test software.
> > 
> > This is how I run the srp-test software in a VM:
> > 
> > ./run_tests -c -d -r 10
> > 
> > Here is an example of what SysRq-w reported when the hang occurred:
> > 
> > sysrq: SysRq : Show Blocked State
> > �task������������������������PC stack���pid father
> > kworker/u8:0����D12864�����5������2 0x80000000
> > Workqueue: events_unbound sd_probe_async [sd_mod]
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_alloc_request+0x59/0xc0
> > blk_get_request_flags+0x3f/0x260
> > scsi_execute+0x33/0x1e0 [scsi_mod]
> > read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> > sd_probe_async+0xf2/0x1a0 [sd_mod]
> > process_one_work+0x21c/0x6d0
> > worker_thread+0x35/0x380
> > ? process_one_work+0x6d0/0x6d0
> > kthread+0x117/0x130
> > ? kthread_create_worker_on_cpu+0x40/0x40
> > ret_from_fork+0x24/0x30
> > systemd-udevd���D13672��1048����285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > generic_file_read_iter+0x32f/0x970
> > ? page_cache_tree_insert+0x100/0x100
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
> > RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
> > RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
> > R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
> > R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
> > systemd-udevd���D13496��1049����285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_make_request+0x105/0x8e0
> > ? generic_make_request+0xd6/0x3d0
> > generic_make_request+0x103/0x3d0
> > ? submit_bio+0x57/0x110
> > submit_bio+0x57/0x110
> > mpage_readpages+0x13b/0x160
> > ? I_BDEV+0x10/0x10
> > ? rcu_read_lock_sched_held+0x66/0x70
> > ? __alloc_pages_nodemask+0x2e8/0x360
> > __do_page_cache_readahead+0x2a4/0x370
> > ? force_page_cache_readahead+0xaf/0x110
> > force_page_cache_readahead+0xaf/0x110
> > generic_file_read_iter+0x743/0x970
> > ? find_held_lock+0x2d/0x90
> > ? _raw_spin_unlock+0x29/0x40
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
> > RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
> > RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
> > R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
> > systemd-udevd���D13672��1055����285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_make_request+0x105/0x8e0
> > ? generic_make_request+0xd6/0x3d0
> > generic_make_request+0x103/0x3d0
> > ? submit_bio+0x57/0x110
> > submit_bio+0x57/0x110
> > mpage_readpages+0x13b/0x160
> > ? I_BDEV+0x10/0x10
> > ? rcu_read_lock_sched_held+0x66/0x70
> > ? __alloc_pages_nodemask+0x2e8/0x360
> > __do_page_cache_readahead+0x2a4/0x370
> > ? force_page_cache_readahead+0xaf/0x110
> > force_page_cache_readahead+0xaf/0x110
> > generic_file_read_iter+0x743/0x970
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
> > RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
> > RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
> > R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
> > R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100
> > 
> > Please let me know if you need more information.
> > 
> > Bart.
> 
> I just bumped into stalls too, using Mike's tree. I had started getting
> ready with blk-mq enabled for both dm_mod and scsi_mod and testing SRP
> so I could run further tests for Bart.
> 
> I am working to see when it started by starting first with Linus's
> latest tree as I tested that same without blk-mq initially and it was
> fine.
> 
> My concern is that I tried a 4.15-rc1 kernel as well and still saw
> issues so have to sanitize all of this first and report back.

Hi Laurence,

Please collect blk-mq debugfs log via the following script after the hang is
triggered:

	http://people.redhat.com/minlei/tests/tools/dump-blk-info

This script needn't any parameter.

>From Bart's log, looks the request is requeued by dm-rq, but is never kicked
off, and your log can show if yours is same with Bart's.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
@ 2018-01-13 12:39       ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2018-01-13 12:39 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, jbacik, tj, jack, clm, axboe, kernel-team,
	linux-kernel, peterz, linux-btrfs, linux-block, jianchao.w.wang

On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote:
> On Fri, 2018-01-12 at 20:57 +0000, Bart Van Assche wrote:
> > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> > > Currently, blk-mq timeout path synchronizes against the usual
> > > issue/completion path using a complex scheme involving atomic
> > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > rules.  Unfortunatley, it contains quite a few holes.
> > 
> > Hello Tejun,
> > 
> > With this patch series applied I see weird hangs in blk_mq_get_tag()
> > when I
> > run the srp-test software. If I pull Jens' latest for-next branch and
> > revert
> > this patch series then the srp-test software runs successfully. Note:
> > if you
> > don't have InfiniBand hardware available then you will need the
> > RDMA/CM
> > patches for the SRP initiator and target drivers that have been
> > posted
> > recently on the linux-rdma mailing list to run the srp-test software.
> > 
> > This is how I run the srp-test software in a VM:
> > 
> > ./run_tests -c -d -r 10
> > 
> > Here is an example of what SysRq-w reported when the hang occurred:
> > 
> > sysrq: SysRq : Show Blocked State
> >  task                        PC stack   pid father
> > kworker/u8:0    D12864     5      2 0x80000000
> > Workqueue: events_unbound sd_probe_async [sd_mod]
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_alloc_request+0x59/0xc0
> > blk_get_request_flags+0x3f/0x260
> > scsi_execute+0x33/0x1e0 [scsi_mod]
> > read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> > sd_probe_async+0xf2/0x1a0 [sd_mod]
> > process_one_work+0x21c/0x6d0
> > worker_thread+0x35/0x380
> > ? process_one_work+0x6d0/0x6d0
> > kthread+0x117/0x130
> > ? kthread_create_worker_on_cpu+0x40/0x40
> > ret_from_fork+0x24/0x30
> > systemd-udevd   D13672  1048    285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > generic_file_read_iter+0x32f/0x970
> > ? page_cache_tree_insert+0x100/0x100
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
> > RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
> > RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
> > R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
> > R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
> > systemd-udevd   D13496  1049    285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_make_request+0x105/0x8e0
> > ? generic_make_request+0xd6/0x3d0
> > generic_make_request+0x103/0x3d0
> > ? submit_bio+0x57/0x110
> > submit_bio+0x57/0x110
> > mpage_readpages+0x13b/0x160
> > ? I_BDEV+0x10/0x10
> > ? rcu_read_lock_sched_held+0x66/0x70
> > ? __alloc_pages_nodemask+0x2e8/0x360
> > __do_page_cache_readahead+0x2a4/0x370
> > ? force_page_cache_readahead+0xaf/0x110
> > force_page_cache_readahead+0xaf/0x110
> > generic_file_read_iter+0x743/0x970
> > ? find_held_lock+0x2d/0x90
> > ? _raw_spin_unlock+0x29/0x40
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
> > RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
> > RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
> > R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
> > systemd-udevd   D13672  1055    285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_make_request+0x105/0x8e0
> > ? generic_make_request+0xd6/0x3d0
> > generic_make_request+0x103/0x3d0
> > ? submit_bio+0x57/0x110
> > submit_bio+0x57/0x110
> > mpage_readpages+0x13b/0x160
> > ? I_BDEV+0x10/0x10
> > ? rcu_read_lock_sched_held+0x66/0x70
> > ? __alloc_pages_nodemask+0x2e8/0x360
> > __do_page_cache_readahead+0x2a4/0x370
> > ? force_page_cache_readahead+0xaf/0x110
> > force_page_cache_readahead+0xaf/0x110
> > generic_file_read_iter+0x743/0x970
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
> > RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
> > RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
> > R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
> > R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100
> > 
> > Please let me know if you need more information.
> > 
> > Bart.
> 
> I just bumped into stalls too, using Mike's tree. I had started getting
> ready with blk-mq enabled for both dm_mod and scsi_mod and testing SRP
> so I could run further tests for Bart.
> 
> I am working to see when it started by starting first with Linus's
> latest tree as I tested that same without blk-mq initially and it was
> fine.
> 
> My concern is that I tried a 4.15-rc1 kernel as well and still saw
> issues so have to sanitize all of this first and report back.

Hi Laurence,

Please collect blk-mq debugfs log via the following script after the hang is
triggered:

	http://people.redhat.com/minlei/tests/tools/dump-blk-info

This script needn't any parameter.

>From Bart's log, looks the request is requeued by dm-rq, but is never kicked
off, and your log can show if yours is same with Bart's.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-12 21:55     ` Laurence Oberman
@ 2018-01-13 14:45       ` Ming Lei
  -1 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2018-01-13 14:45 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, jbacik, tj, jack, clm, axboe, kernel-team,
	linux-kernel, peterz, linux-btrfs, linux-block, jianchao.w.wang

On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote:
> On Fri, 2018-01-12 at 20:57 +0000, Bart Van Assche wrote:
> > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> > > Currently, blk-mq timeout path synchronizes against the usual
> > > issue/completion path using a complex scheme involving atomic
> > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > rules.��Unfortunatley, it contains quite a few holes.
> > 
> > Hello Tejun,
> > 
> > With this patch series applied I see weird hangs in blk_mq_get_tag()
> > when I
> > run the srp-test software. If I pull Jens' latest for-next branch and
> > revert
> > this patch series then the srp-test software runs successfully. Note:
> > if you
> > don't have InfiniBand hardware available then you will need the
> > RDMA/CM
> > patches for the SRP initiator and target drivers that have been
> > posted
> > recently on the linux-rdma mailing list to run the srp-test software.
> > 
> > This is how I run the srp-test software in a VM:
> > 
> > ./run_tests -c -d -r 10
> > 
> > Here is an example of what SysRq-w reported when the hang occurred:
> > 
> > sysrq: SysRq : Show Blocked State
> > �task������������������������PC stack���pid father
> > kworker/u8:0����D12864�����5������2 0x80000000
> > Workqueue: events_unbound sd_probe_async [sd_mod]
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_alloc_request+0x59/0xc0
> > blk_get_request_flags+0x3f/0x260
> > scsi_execute+0x33/0x1e0 [scsi_mod]
> > read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> > sd_probe_async+0xf2/0x1a0 [sd_mod]
> > process_one_work+0x21c/0x6d0
> > worker_thread+0x35/0x380
> > ? process_one_work+0x6d0/0x6d0
> > kthread+0x117/0x130
> > ? kthread_create_worker_on_cpu+0x40/0x40
> > ret_from_fork+0x24/0x30
> > systemd-udevd���D13672��1048����285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > generic_file_read_iter+0x32f/0x970
> > ? page_cache_tree_insert+0x100/0x100
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
> > RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
> > RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
> > R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
> > R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
> > systemd-udevd���D13496��1049����285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_make_request+0x105/0x8e0
> > ? generic_make_request+0xd6/0x3d0
> > generic_make_request+0x103/0x3d0
> > ? submit_bio+0x57/0x110
> > submit_bio+0x57/0x110
> > mpage_readpages+0x13b/0x160
> > ? I_BDEV+0x10/0x10
> > ? rcu_read_lock_sched_held+0x66/0x70
> > ? __alloc_pages_nodemask+0x2e8/0x360
> > __do_page_cache_readahead+0x2a4/0x370
> > ? force_page_cache_readahead+0xaf/0x110
> > force_page_cache_readahead+0xaf/0x110
> > generic_file_read_iter+0x743/0x970
> > ? find_held_lock+0x2d/0x90
> > ? _raw_spin_unlock+0x29/0x40
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
> > RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
> > RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
> > R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
> > systemd-udevd���D13672��1055����285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_make_request+0x105/0x8e0
> > ? generic_make_request+0xd6/0x3d0
> > generic_make_request+0x103/0x3d0
> > ? submit_bio+0x57/0x110
> > submit_bio+0x57/0x110
> > mpage_readpages+0x13b/0x160
> > ? I_BDEV+0x10/0x10
> > ? rcu_read_lock_sched_held+0x66/0x70
> > ? __alloc_pages_nodemask+0x2e8/0x360
> > __do_page_cache_readahead+0x2a4/0x370
> > ? force_page_cache_readahead+0xaf/0x110
> > force_page_cache_readahead+0xaf/0x110
> > generic_file_read_iter+0x743/0x970
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
> > RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
> > RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
> > R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
> > R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100
> > 
> > Please let me know if you need more information.
> > 
> > Bart.
> 
> I just bumped into stalls too, using Mike's tree. I had started getting
> ready with blk-mq enabled for both dm_mod and scsi_mod and testing SRP
> so I could run further tests for Bart.
> 
> I am working to see when it started by starting first with Linus's
> latest tree as I tested that same without blk-mq initially and it was
> fine.
> 
> My concern is that I tried a 4.15-rc1 kernel as well and still saw
> issues so have to sanitize all of this first and report back.
> 
> My trace was this
> 
> [��615.714199]�������Tainted: G����������I������4.15.0-rc4+ #1
> [��615.746226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [��615.790597] systemd-udevd���D����0��2247����745 0x80000086
> [��615.821048] Call Trace:
> [��615.833971]��? __schedule+0x28d/0x870
> [��615.854226]��schedule+0x32/0x80
> [��615.871819]��io_schedule+0x12/0x40
> [��615.890850]��__lock_page+0x105/0x150
> [��615.910378]��? page_cache_tree_insert+0xd0/0xd0
> [��615.935932]��truncate_inode_pages_range+0x49d/0x810
> [��615.963371]��__blkdev_put+0x73/0x200
> [��615.983351]��blkdev_close+0x21/0x30
> [��616.002910]��__fput+0xd5/0x210
> [��616.019812]��task_work_run+0x82/0xa0
> [��616.040506]��do_exit+0x2ce/0xb30
> [��616.058622]��do_group_exit+0x39/0xa0
> [��616.079506]��get_signal+0x1d0/0x5b0
> [��616.100116]��do_signal+0x36/0x610
> [��616.118889]��? page_cache_tree_insert+0xd0/0xd0
> [��616.144958]��exit_to_usermode_loop+0x47/0x93
> [��616.169040]��do_syscall_64+0x172/0x1a0
> [��616.190514]��entry_SYSCALL64_slow_path+0x25/0x25
> [��616.216566] RIP: 0033:0x7fb247dd86f0
> [��616.237395] RSP: 002b:00007ffea3c56cb8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> [��616.279551] RAX: fffffffffffffffc RBX: 0000000000200000 RCX:
> 00007fb247dd86f0
> [��616.318057] RDX: 0000000000000400 RSI: 000055855eeefea8 RDI:
> 0000000000000008
> [��616.357320] RBP: 000055855eeefe80 R08: 0000000005231994 R09:
> 0000000000000428
> [��616.397099] R10: 0000000000000040 R11: 0000000000000246 R12:
> 000055855eeea550
> [��616.436999] R13: 0000000000000400 R14: 000055855eeea5a0 R15:
> 000055855eeefe98
> [��616.477831] INFO: task systemd-udevd:2248 blocked for more than 120
> seconds.
> [��616.518289]�������Tainted: G����������I������4.15.0-rc4+ #1
> [��616.549323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.

I can trigger this hang too, looks it happens after starting multipathd,
and 'lsblk' also complains:

	lsblk: dm-0: failed to get device path
	lsblk: dm-0: failed to get device path

And one SCSI device can't be attached to dm-mpath.

But it may not be related with IO, since when this hang happens,
there isn't any pending IO observed from blk-mq debugfs.

I also tries Jan's 6 patches for fixing race between bdev open and
gendisk shutdown, and looks they can't help the issue.

	https://marc.info/?l=linux-block&m=151569001430678&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
@ 2018-01-13 14:45       ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2018-01-13 14:45 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, jbacik, tj, jack, clm, axboe, kernel-team,
	linux-kernel, peterz, linux-btrfs, linux-block, jianchao.w.wang

On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote:
> On Fri, 2018-01-12 at 20:57 +0000, Bart Van Assche wrote:
> > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> > > Currently, blk-mq timeout path synchronizes against the usual
> > > issue/completion path using a complex scheme involving atomic
> > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > rules.  Unfortunatley, it contains quite a few holes.
> > 
> > Hello Tejun,
> > 
> > With this patch series applied I see weird hangs in blk_mq_get_tag()
> > when I
> > run the srp-test software. If I pull Jens' latest for-next branch and
> > revert
> > this patch series then the srp-test software runs successfully. Note:
> > if you
> > don't have InfiniBand hardware available then you will need the
> > RDMA/CM
> > patches for the SRP initiator and target drivers that have been
> > posted
> > recently on the linux-rdma mailing list to run the srp-test software.
> > 
> > This is how I run the srp-test software in a VM:
> > 
> > ./run_tests -c -d -r 10
> > 
> > Here is an example of what SysRq-w reported when the hang occurred:
> > 
> > sysrq: SysRq : Show Blocked State
> >  task                        PC stack   pid father
> > kworker/u8:0    D12864     5      2 0x80000000
> > Workqueue: events_unbound sd_probe_async [sd_mod]
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_alloc_request+0x59/0xc0
> > blk_get_request_flags+0x3f/0x260
> > scsi_execute+0x33/0x1e0 [scsi_mod]
> > read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> > sd_probe_async+0xf2/0x1a0 [sd_mod]
> > process_one_work+0x21c/0x6d0
> > worker_thread+0x35/0x380
> > ? process_one_work+0x6d0/0x6d0
> > kthread+0x117/0x130
> > ? kthread_create_worker_on_cpu+0x40/0x40
> > ret_from_fork+0x24/0x30
> > systemd-udevd   D13672  1048    285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > generic_file_read_iter+0x32f/0x970
> > ? page_cache_tree_insert+0x100/0x100
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
> > RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
> > RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
> > R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
> > R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
> > systemd-udevd   D13496  1049    285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_make_request+0x105/0x8e0
> > ? generic_make_request+0xd6/0x3d0
> > generic_make_request+0x103/0x3d0
> > ? submit_bio+0x57/0x110
> > submit_bio+0x57/0x110
> > mpage_readpages+0x13b/0x160
> > ? I_BDEV+0x10/0x10
> > ? rcu_read_lock_sched_held+0x66/0x70
> > ? __alloc_pages_nodemask+0x2e8/0x360
> > __do_page_cache_readahead+0x2a4/0x370
> > ? force_page_cache_readahead+0xaf/0x110
> > force_page_cache_readahead+0xaf/0x110
> > generic_file_read_iter+0x743/0x970
> > ? find_held_lock+0x2d/0x90
> > ? _raw_spin_unlock+0x29/0x40
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
> > RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
> > RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
> > R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
> > systemd-udevd   D13672  1055    285 0x00000100
> > Call Trace:
> > ? __schedule+0x2b4/0xbb0
> > schedule+0x2d/0x90
> > io_schedule+0xd/0x30
> > blk_mq_get_tag+0x169/0x290
> > ? finish_wait+0x80/0x80
> > blk_mq_get_request+0x16a/0x4f0
> > blk_mq_make_request+0x105/0x8e0
> > ? generic_make_request+0xd6/0x3d0
> > generic_make_request+0x103/0x3d0
> > ? submit_bio+0x57/0x110
> > submit_bio+0x57/0x110
> > mpage_readpages+0x13b/0x160
> > ? I_BDEV+0x10/0x10
> > ? rcu_read_lock_sched_held+0x66/0x70
> > ? __alloc_pages_nodemask+0x2e8/0x360
> > __do_page_cache_readahead+0x2a4/0x370
> > ? force_page_cache_readahead+0xaf/0x110
> > force_page_cache_readahead+0xaf/0x110
> > generic_file_read_iter+0x743/0x970
> > __vfs_read+0xcc/0x120
> > vfs_read+0x96/0x140
> > SyS_read+0x40/0xa0
> > do_syscall_64+0x5f/0x1b0
> > entry_SYSCALL64_slow_path+0x25/0x25
> > RIP: 0033:0x7f8ce6d08d11
> > RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
> > RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
> > RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
> > R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
> > R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100
> > 
> > Please let me know if you need more information.
> > 
> > Bart.
> 
> I just bumped into stalls too, using Mike's tree. I had started getting
> ready with blk-mq enabled for both dm_mod and scsi_mod and testing SRP
> so I could run further tests for Bart.
> 
> I am working to see when it started by starting first with Linus's
> latest tree as I tested that same without blk-mq initially and it was
> fine.
> 
> My concern is that I tried a 4.15-rc1 kernel as well and still saw
> issues so have to sanitize all of this first and report back.
> 
> My trace was this
> 
> [  615.714199]       Tainted: G          I      4.15.0-rc4+ #1
> [  615.746226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  615.790597] systemd-udevd   D    0  2247    745 0x80000086
> [  615.821048] Call Trace:
> [  615.833971]  ? __schedule+0x28d/0x870
> [  615.854226]  schedule+0x32/0x80
> [  615.871819]  io_schedule+0x12/0x40
> [  615.890850]  __lock_page+0x105/0x150
> [  615.910378]  ? page_cache_tree_insert+0xd0/0xd0
> [  615.935932]  truncate_inode_pages_range+0x49d/0x810
> [  615.963371]  __blkdev_put+0x73/0x200
> [  615.983351]  blkdev_close+0x21/0x30
> [  616.002910]  __fput+0xd5/0x210
> [  616.019812]  task_work_run+0x82/0xa0
> [  616.040506]  do_exit+0x2ce/0xb30
> [  616.058622]  do_group_exit+0x39/0xa0
> [  616.079506]  get_signal+0x1d0/0x5b0
> [  616.100116]  do_signal+0x36/0x610
> [  616.118889]  ? page_cache_tree_insert+0xd0/0xd0
> [  616.144958]  exit_to_usermode_loop+0x47/0x93
> [  616.169040]  do_syscall_64+0x172/0x1a0
> [  616.190514]  entry_SYSCALL64_slow_path+0x25/0x25
> [  616.216566] RIP: 0033:0x7fb247dd86f0
> [  616.237395] RSP: 002b:00007ffea3c56cb8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> [  616.279551] RAX: fffffffffffffffc RBX: 0000000000200000 RCX:
> 00007fb247dd86f0
> [  616.318057] RDX: 0000000000000400 RSI: 000055855eeefea8 RDI:
> 0000000000000008
> [  616.357320] RBP: 000055855eeefe80 R08: 0000000005231994 R09:
> 0000000000000428
> [  616.397099] R10: 0000000000000040 R11: 0000000000000246 R12:
> 000055855eeea550
> [  616.436999] R13: 0000000000000400 R14: 000055855eeea5a0 R15:
> 000055855eeefe98
> [  616.477831] INFO: task systemd-udevd:2248 blocked for more than 120
> seconds.
> [  616.518289]       Tainted: G          I      4.15.0-rc4+ #1
> [  616.549323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.

I can trigger this hang too, looks it happens after starting multipathd,
and 'lsblk' also complains:

	lsblk: dm-0: failed to get device path
	lsblk: dm-0: failed to get device path

And one SCSI device can't be attached to dm-mpath.

But it may not be related with IO, since when this hang happens,
there isn't any pending IO observed from blk-mq debugfs.

I also tries Jan's 6 patches for fixing race between bdev open and
gendisk shutdown, and looks they can't help the issue.

	https://marc.info/?l=linux-block&m=151569001430678&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-13 14:45       ` Ming Lei
@ 2018-01-13 17:59         ` Ming Lei
  -1 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2018-01-13 17:59 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, jbacik, tj, jack, clm, axboe, kernel-team,
	linux-kernel, peterz, linux-btrfs, linux-block, jianchao.w.wang

On Sat, Jan 13, 2018 at 10:45:14PM +0800, Ming Lei wrote:
> On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote:
> > On Fri, 2018-01-12 at 20:57 +0000, Bart Van Assche wrote:
> > > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> > > > Currently, blk-mq timeout path synchronizes against the usual
> > > > issue/completion path using a complex scheme involving atomic
> > > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > > rules.��Unfortunatley, it contains quite a few holes.
> > > 
> > > Hello Tejun,
> > > 
> > > With this patch series applied I see weird hangs in blk_mq_get_tag()
> > > when I
> > > run the srp-test software. If I pull Jens' latest for-next branch and
> > > revert
> > > this patch series then the srp-test software runs successfully. Note:
> > > if you
> > > don't have InfiniBand hardware available then you will need the
> > > RDMA/CM
> > > patches for the SRP initiator and target drivers that have been
> > > posted
> > > recently on the linux-rdma mailing list to run the srp-test software.
> > > 
> > > This is how I run the srp-test software in a VM:
> > > 
> > > ./run_tests -c -d -r 10
> > > 
> > > Here is an example of what SysRq-w reported when the hang occurred:
> > > 
> > > sysrq: SysRq : Show Blocked State
> > > �task������������������������PC stack���pid father
> > > kworker/u8:0����D12864�����5������2 0x80000000
> > > Workqueue: events_unbound sd_probe_async [sd_mod]
> > > Call Trace:
> > > ? __schedule+0x2b4/0xbb0
> > > schedule+0x2d/0x90
> > > io_schedule+0xd/0x30
> > > blk_mq_get_tag+0x169/0x290
> > > ? finish_wait+0x80/0x80
> > > blk_mq_get_request+0x16a/0x4f0
> > > blk_mq_alloc_request+0x59/0xc0
> > > blk_get_request_flags+0x3f/0x260
> > > scsi_execute+0x33/0x1e0 [scsi_mod]
> > > read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> > > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> > > sd_probe_async+0xf2/0x1a0 [sd_mod]
> > > process_one_work+0x21c/0x6d0
> > > worker_thread+0x35/0x380
> > > ? process_one_work+0x6d0/0x6d0
> > > kthread+0x117/0x130
> > > ? kthread_create_worker_on_cpu+0x40/0x40
> > > ret_from_fork+0x24/0x30
> > > systemd-udevd���D13672��1048����285 0x00000100
> > > Call Trace:
> > > ? __schedule+0x2b4/0xbb0
> > > schedule+0x2d/0x90
> > > io_schedule+0xd/0x30
> > > generic_file_read_iter+0x32f/0x970
> > > ? page_cache_tree_insert+0x100/0x100
> > > __vfs_read+0xcc/0x120
> > > vfs_read+0x96/0x140
> > > SyS_read+0x40/0xa0
> > > do_syscall_64+0x5f/0x1b0
> > > entry_SYSCALL64_slow_path+0x25/0x25
> > > RIP: 0033:0x7f8ce6d08d11
> > > RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000000
> > > RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
> > > RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
> > > RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
> > > R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
> > > R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
> > > systemd-udevd���D13496��1049����285 0x00000100
> > > Call Trace:
> > > ? __schedule+0x2b4/0xbb0
> > > schedule+0x2d/0x90
> > > io_schedule+0xd/0x30
> > > blk_mq_get_tag+0x169/0x290
> > > ? finish_wait+0x80/0x80
> > > blk_mq_get_request+0x16a/0x4f0
> > > blk_mq_make_request+0x105/0x8e0
> > > ? generic_make_request+0xd6/0x3d0
> > > generic_make_request+0x103/0x3d0
> > > ? submit_bio+0x57/0x110
> > > submit_bio+0x57/0x110
> > > mpage_readpages+0x13b/0x160
> > > ? I_BDEV+0x10/0x10
> > > ? rcu_read_lock_sched_held+0x66/0x70
> > > ? __alloc_pages_nodemask+0x2e8/0x360
> > > __do_page_cache_readahead+0x2a4/0x370
> > > ? force_page_cache_readahead+0xaf/0x110
> > > force_page_cache_readahead+0xaf/0x110
> > > generic_file_read_iter+0x743/0x970
> > > ? find_held_lock+0x2d/0x90
> > > ? _raw_spin_unlock+0x29/0x40
> > > __vfs_read+0xcc/0x120
> > > vfs_read+0x96/0x140
> > > SyS_read+0x40/0xa0
> > > do_syscall_64+0x5f/0x1b0
> > > entry_SYSCALL64_slow_path+0x25/0x25
> > > RIP: 0033:0x7f8ce6d08d11
> > > RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000000
> > > RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
> > > RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
> > > RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
> > > R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
> > > R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
> > > systemd-udevd���D13672��1055����285 0x00000100
> > > Call Trace:
> > > ? __schedule+0x2b4/0xbb0
> > > schedule+0x2d/0x90
> > > io_schedule+0xd/0x30
> > > blk_mq_get_tag+0x169/0x290
> > > ? finish_wait+0x80/0x80
> > > blk_mq_get_request+0x16a/0x4f0
> > > blk_mq_make_request+0x105/0x8e0
> > > ? generic_make_request+0xd6/0x3d0
> > > generic_make_request+0x103/0x3d0
> > > ? submit_bio+0x57/0x110
> > > submit_bio+0x57/0x110
> > > mpage_readpages+0x13b/0x160
> > > ? I_BDEV+0x10/0x10
> > > ? rcu_read_lock_sched_held+0x66/0x70
> > > ? __alloc_pages_nodemask+0x2e8/0x360
> > > __do_page_cache_readahead+0x2a4/0x370
> > > ? force_page_cache_readahead+0xaf/0x110
> > > force_page_cache_readahead+0xaf/0x110
> > > generic_file_read_iter+0x743/0x970
> > > __vfs_read+0xcc/0x120
> > > vfs_read+0x96/0x140
> > > SyS_read+0x40/0xa0
> > > do_syscall_64+0x5f/0x1b0
> > > entry_SYSCALL64_slow_path+0x25/0x25
> > > RIP: 0033:0x7f8ce6d08d11
> > > RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000000
> > > RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
> > > RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
> > > RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
> > > R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
> > > R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100
> > > 
> > > Please let me know if you need more information.
> > > 
> > > Bart.
> > 
> > I just bumped into stalls too, using Mike's tree. I had started getting
> > ready with blk-mq enabled for both dm_mod and scsi_mod and testing SRP
> > so I could run further tests for Bart.
> > 
> > I am working to see when it started by starting first with Linus's
> > latest tree as I tested that same without blk-mq initially and it was
> > fine.
> > 
> > My concern is that I tried a 4.15-rc1 kernel as well and still saw
> > issues so have to sanitize all of this first and report back.
> > 
> > My trace was this
> > 
> > [��615.714199]�������Tainted: G����������I������4.15.0-rc4+ #1
> > [��615.746226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [��615.790597] systemd-udevd���D����0��2247����745 0x80000086
> > [��615.821048] Call Trace:
> > [��615.833971]��? __schedule+0x28d/0x870
> > [��615.854226]��schedule+0x32/0x80
> > [��615.871819]��io_schedule+0x12/0x40
> > [��615.890850]��__lock_page+0x105/0x150
> > [��615.910378]��? page_cache_tree_insert+0xd0/0xd0
> > [��615.935932]��truncate_inode_pages_range+0x49d/0x810
> > [��615.963371]��__blkdev_put+0x73/0x200
> > [��615.983351]��blkdev_close+0x21/0x30
> > [��616.002910]��__fput+0xd5/0x210
> > [��616.019812]��task_work_run+0x82/0xa0
> > [��616.040506]��do_exit+0x2ce/0xb30
> > [��616.058622]��do_group_exit+0x39/0xa0
> > [��616.079506]��get_signal+0x1d0/0x5b0
> > [��616.100116]��do_signal+0x36/0x610
> > [��616.118889]��? page_cache_tree_insert+0xd0/0xd0
> > [��616.144958]��exit_to_usermode_loop+0x47/0x93
> > [��616.169040]��do_syscall_64+0x172/0x1a0
> > [��616.190514]��entry_SYSCALL64_slow_path+0x25/0x25
> > [��616.216566] RIP: 0033:0x7fb247dd86f0
> > [��616.237395] RSP: 002b:00007ffea3c56cb8 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > [��616.279551] RAX: fffffffffffffffc RBX: 0000000000200000 RCX:
> > 00007fb247dd86f0
> > [��616.318057] RDX: 0000000000000400 RSI: 000055855eeefea8 RDI:
> > 0000000000000008
> > [��616.357320] RBP: 000055855eeefe80 R08: 0000000005231994 R09:
> > 0000000000000428
> > [��616.397099] R10: 0000000000000040 R11: 0000000000000246 R12:
> > 000055855eeea550
> > [��616.436999] R13: 0000000000000400 R14: 000055855eeea5a0 R15:
> > 000055855eeefe98
> > [��616.477831] INFO: task systemd-udevd:2248 blocked for more than 120
> > seconds.
> > [��616.518289]�������Tainted: G����������I������4.15.0-rc4+ #1
> > [��616.549323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> 
> I can trigger this hang too, looks it happens after starting multipathd,
> and 'lsblk' also complains:
> 
> 	lsblk: dm-0: failed to get device path
> 	lsblk: dm-0: failed to get device path

This one is caused by 7c3fb70f0341 ("block: rearrange a few request fields
for better cache layout"), and fix has been sent out in the link:

	https://marc.info/?l=linux-block&m=151586577400558&w=2

Not sure if it is same with Bart's report, but it is worth a test.

-- 
Ming

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
@ 2018-01-13 17:59         ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2018-01-13 17:59 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, jbacik, tj, jack, clm, axboe, kernel-team,
	linux-kernel, peterz, linux-btrfs, linux-block, jianchao.w.wang

On Sat, Jan 13, 2018 at 10:45:14PM +0800, Ming Lei wrote:
> On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote:
> > On Fri, 2018-01-12 at 20:57 +0000, Bart Van Assche wrote:
> > > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> > > > Currently, blk-mq timeout path synchronizes against the usual
> > > > issue/completion path using a complex scheme involving atomic
> > > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > > rules.  Unfortunatley, it contains quite a few holes.
> > > 
> > > Hello Tejun,
> > > 
> > > With this patch series applied I see weird hangs in blk_mq_get_tag()
> > > when I
> > > run the srp-test software. If I pull Jens' latest for-next branch and
> > > revert
> > > this patch series then the srp-test software runs successfully. Note:
> > > if you
> > > don't have InfiniBand hardware available then you will need the
> > > RDMA/CM
> > > patches for the SRP initiator and target drivers that have been
> > > posted
> > > recently on the linux-rdma mailing list to run the srp-test software.
> > > 
> > > This is how I run the srp-test software in a VM:
> > > 
> > > ./run_tests -c -d -r 10
> > > 
> > > Here is an example of what SysRq-w reported when the hang occurred:
> > > 
> > > sysrq: SysRq : Show Blocked State
> > >  task                        PC stack   pid father
> > > kworker/u8:0    D12864     5      2 0x80000000
> > > Workqueue: events_unbound sd_probe_async [sd_mod]
> > > Call Trace:
> > > ? __schedule+0x2b4/0xbb0
> > > schedule+0x2d/0x90
> > > io_schedule+0xd/0x30
> > > blk_mq_get_tag+0x169/0x290
> > > ? finish_wait+0x80/0x80
> > > blk_mq_get_request+0x16a/0x4f0
> > > blk_mq_alloc_request+0x59/0xc0
> > > blk_get_request_flags+0x3f/0x260
> > > scsi_execute+0x33/0x1e0 [scsi_mod]
> > > read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> > > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> > > sd_probe_async+0xf2/0x1a0 [sd_mod]
> > > process_one_work+0x21c/0x6d0
> > > worker_thread+0x35/0x380
> > > ? process_one_work+0x6d0/0x6d0
> > > kthread+0x117/0x130
> > > ? kthread_create_worker_on_cpu+0x40/0x40
> > > ret_from_fork+0x24/0x30
> > > systemd-udevd   D13672  1048    285 0x00000100
> > > Call Trace:
> > > ? __schedule+0x2b4/0xbb0
> > > schedule+0x2d/0x90
> > > io_schedule+0xd/0x30
> > > generic_file_read_iter+0x32f/0x970
> > > ? page_cache_tree_insert+0x100/0x100
> > > __vfs_read+0xcc/0x120
> > > vfs_read+0x96/0x140
> > > SyS_read+0x40/0xa0
> > > do_syscall_64+0x5f/0x1b0
> > > entry_SYSCALL64_slow_path+0x25/0x25
> > > RIP: 0033:0x7f8ce6d08d11
> > > RSP: 002b:00007fff96dec288 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000000
> > > RAX: ffffffffffffffda RBX: 00005651de7f6e10 RCX: 00007f8ce6d08d11
> > > RDX: 0000000000000040 RSI: 00005651de7f6e38 RDI: 0000000000000007
> > > RBP: 00005651de7ea500 R08: 00007f8ce6cf1c20 R09: 00005651de7f6e10
> > > R10: 000000000000006f R11: 0000000000000246 R12: 0000000001ff0000
> > > R13: 0000000001ff0040 R14: 00005651de7ea550 R15: 0000000000000040
> > > systemd-udevd   D13496  1049    285 0x00000100
> > > Call Trace:
> > > ? __schedule+0x2b4/0xbb0
> > > schedule+0x2d/0x90
> > > io_schedule+0xd/0x30
> > > blk_mq_get_tag+0x169/0x290
> > > ? finish_wait+0x80/0x80
> > > blk_mq_get_request+0x16a/0x4f0
> > > blk_mq_make_request+0x105/0x8e0
> > > ? generic_make_request+0xd6/0x3d0
> > > generic_make_request+0x103/0x3d0
> > > ? submit_bio+0x57/0x110
> > > submit_bio+0x57/0x110
> > > mpage_readpages+0x13b/0x160
> > > ? I_BDEV+0x10/0x10
> > > ? rcu_read_lock_sched_held+0x66/0x70
> > > ? __alloc_pages_nodemask+0x2e8/0x360
> > > __do_page_cache_readahead+0x2a4/0x370
> > > ? force_page_cache_readahead+0xaf/0x110
> > > force_page_cache_readahead+0xaf/0x110
> > > generic_file_read_iter+0x743/0x970
> > > ? find_held_lock+0x2d/0x90
> > > ? _raw_spin_unlock+0x29/0x40
> > > __vfs_read+0xcc/0x120
> > > vfs_read+0x96/0x140
> > > SyS_read+0x40/0xa0
> > > do_syscall_64+0x5f/0x1b0
> > > entry_SYSCALL64_slow_path+0x25/0x25
> > > RIP: 0033:0x7f8ce6d08d11
> > > RSP: 002b:00007fff96dec8b8 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000000
> > > RAX: ffffffffffffffda RBX: 00007f8ce7085010 RCX: 00007f8ce6d08d11
> > > RDX: 0000000000040000 RSI: 00007f8ce7085038 RDI: 000000000000000f
> > > RBP: 00005651de7ec840 R08: ffffffffffffffff R09: 00007f8ce7085010
> > > R10: 00007f8ce7085028 R11: 0000000000000246 R12: 0000000000000000
> > > R13: 0000000000040000 R14: 00005651de7ec890 R15: 0000000000040000
> > > systemd-udevd   D13672  1055    285 0x00000100
> > > Call Trace:
> > > ? __schedule+0x2b4/0xbb0
> > > schedule+0x2d/0x90
> > > io_schedule+0xd/0x30
> > > blk_mq_get_tag+0x169/0x290
> > > ? finish_wait+0x80/0x80
> > > blk_mq_get_request+0x16a/0x4f0
> > > blk_mq_make_request+0x105/0x8e0
> > > ? generic_make_request+0xd6/0x3d0
> > > generic_make_request+0x103/0x3d0
> > > ? submit_bio+0x57/0x110
> > > submit_bio+0x57/0x110
> > > mpage_readpages+0x13b/0x160
> > > ? I_BDEV+0x10/0x10
> > > ? rcu_read_lock_sched_held+0x66/0x70
> > > ? __alloc_pages_nodemask+0x2e8/0x360
> > > __do_page_cache_readahead+0x2a4/0x370
> > > ? force_page_cache_readahead+0xaf/0x110
> > > force_page_cache_readahead+0xaf/0x110
> > > generic_file_read_iter+0x743/0x970
> > > __vfs_read+0xcc/0x120
> > > vfs_read+0x96/0x140
> > > SyS_read+0x40/0xa0
> > > do_syscall_64+0x5f/0x1b0
> > > entry_SYSCALL64_slow_path+0x25/0x25
> > > RIP: 0033:0x7f8ce6d08d11
> > > RSP: 002b:00007fff96dec848 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000000
> > > RAX: ffffffffffffffda RBX: 00005651de7ec300 RCX: 00007f8ce6d08d11
> > > RDX: 0000000000000100 RSI: 00005651de7ec328 RDI: 000000000000000f
> > > RBP: 00005651de7ea500 R08: 0000000000000000 R09: 00005651de7ec300
> > > R10: 00005651de7ec318 R11: 0000000000000246 R12: 0000000001ffe000
> > > R13: 0000000001ffe100 R14: 00005651de7ea550 R15: 0000000000000100
> > > 
> > > Please let me know if you need more information.
> > > 
> > > Bart.
> > 
> > I just bumped into stalls too, using Mike's tree. I had started getting
> > ready with blk-mq enabled for both dm_mod and scsi_mod and testing SRP
> > so I could run further tests for Bart.
> > 
> > I am working to see when it started by starting first with Linus's
> > latest tree as I tested that same without blk-mq initially and it was
> > fine.
> > 
> > My concern is that I tried a 4.15-rc1 kernel as well and still saw
> > issues so have to sanitize all of this first and report back.
> > 
> > My trace was this
> > 
> > [  615.714199]       Tainted: G          I      4.15.0-rc4+ #1
> > [  615.746226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  615.790597] systemd-udevd   D    0  2247    745 0x80000086
> > [  615.821048] Call Trace:
> > [  615.833971]  ? __schedule+0x28d/0x870
> > [  615.854226]  schedule+0x32/0x80
> > [  615.871819]  io_schedule+0x12/0x40
> > [  615.890850]  __lock_page+0x105/0x150
> > [  615.910378]  ? page_cache_tree_insert+0xd0/0xd0
> > [  615.935932]  truncate_inode_pages_range+0x49d/0x810
> > [  615.963371]  __blkdev_put+0x73/0x200
> > [  615.983351]  blkdev_close+0x21/0x30
> > [  616.002910]  __fput+0xd5/0x210
> > [  616.019812]  task_work_run+0x82/0xa0
> > [  616.040506]  do_exit+0x2ce/0xb30
> > [  616.058622]  do_group_exit+0x39/0xa0
> > [  616.079506]  get_signal+0x1d0/0x5b0
> > [  616.100116]  do_signal+0x36/0x610
> > [  616.118889]  ? page_cache_tree_insert+0xd0/0xd0
> > [  616.144958]  exit_to_usermode_loop+0x47/0x93
> > [  616.169040]  do_syscall_64+0x172/0x1a0
> > [  616.190514]  entry_SYSCALL64_slow_path+0x25/0x25
> > [  616.216566] RIP: 0033:0x7fb247dd86f0
> > [  616.237395] RSP: 002b:00007ffea3c56cb8 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000000
> > [  616.279551] RAX: fffffffffffffffc RBX: 0000000000200000 RCX:
> > 00007fb247dd86f0
> > [  616.318057] RDX: 0000000000000400 RSI: 000055855eeefea8 RDI:
> > 0000000000000008
> > [  616.357320] RBP: 000055855eeefe80 R08: 0000000005231994 R09:
> > 0000000000000428
> > [  616.397099] R10: 0000000000000040 R11: 0000000000000246 R12:
> > 000055855eeea550
> > [  616.436999] R13: 0000000000000400 R14: 000055855eeea5a0 R15:
> > 000055855eeefe98
> > [  616.477831] INFO: task systemd-udevd:2248 blocked for more than 120
> > seconds.
> > [  616.518289]       Tainted: G          I      4.15.0-rc4+ #1
> > [  616.549323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> 
> I can trigger this hang too, looks it happens after starting multipathd,
> and 'lsblk' also complains:
> 
> 	lsblk: dm-0: failed to get device path
> 	lsblk: dm-0: failed to get device path

This one is caused by 7c3fb70f0341 ("block: rearrange a few request fields
for better cache layout"), and fix has been sent out in the link:

	https://marc.info/?l=linux-block&m=151586577400558&w=2

Not sure if it is same with Bart's report, but it is worth a test.

-- 
Ming

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-12 21:19     ` Bart Van Assche
  2018-01-12 21:55       ` Jens Axboe
@ 2018-01-14 15:12       ` jianchao.wang
  2018-01-14 15:45         ` Ming Lei
  1 sibling, 1 reply; 25+ messages in thread
From: jianchao.wang @ 2018-01-14 15:12 UTC (permalink / raw)
  To: Bart Van Assche, jbacik, tj, jack, clm, axboe
  Cc: kernel-team, linux-kernel, peterz, linux-btrfs, linux-block



On 01/13/2018 05:19 AM, Bart Van Assche wrote:
> Sorry but I only retrieved the blk-mq debugfs several minutes after the hang
> started so I'm not sure the state information is relevant. Anyway, I have attached
> it to this e-mail. The most remarkable part is the following:
> 
> ./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x1
> 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete
> =0, .tag=-1, .internal_tag=217}
> 
> The hexadecimal number at the start is the request_queue pointer (I modified the
> blk-mq-debugfs code such that queues are registered with there address just after
> creation and until a name is assigned). This is a dm-mpath queue.

There seems to be something wrong in hctx->nr_active.
./sde/hctx2/cpu2/completed:2 3
./sde/hctx2/cpu2/merged:0
./sde/hctx2/cpu2/dispatched:2 3
./sde/hctx2/active:5

./sde/hctx1/cpu1/completed:2 38
./sde/hctx1/cpu1/merged:0
./sde/hctx1/cpu1/dispatched:2 38
./sde/hctx1/active:40

./sde/hctx0/cpu0/completed:20 11
./sde/hctx0/cpu0/merged:0
./sde/hctx0/cpu0/dispatched:20 11
./sde/hctx0/active:31
...
./sdc/hctx1/cpu1/completed:14 13
./sdc/hctx1/cpu1/merged:0
./sdc/hctx1/cpu1/dispatched:14 13
./sdc/hctx1/active:21

./sdc/hctx0/cpu0/completed:1 41
./sdc/hctx0/cpu0/merged:0
./sdc/hctx0/cpu0/dispatched:1 41
./sdc/hctx0/active:36
....
Then hctx_may_queue return false.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v5] blk-mq: reimplement timeout handling
  2018-01-14 15:12       ` jianchao.wang
@ 2018-01-14 15:45         ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2018-01-14 15:45 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Bart Van Assche, jbacik, tj, jack, clm, axboe, kernel-team,
	linux-kernel, peterz, linux-btrfs, linux-block

On Sun, Jan 14, 2018 at 11:12 PM, jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
>
>
> On 01/13/2018 05:19 AM, Bart Van Assche wrote:
>> Sorry but I only retrieved the blk-mq debugfs several minutes after the hang
>> started so I'm not sure the state information is relevant. Anyway, I have attached
>> it to this e-mail. The most remarkable part is the following:
>>
>> ./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x1
>> 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete
>> =0, .tag=-1, .internal_tag=217}
>>
>> The hexadecimal number at the start is the request_queue pointer (I modified the
>> blk-mq-debugfs code such that queues are registered with there address just after
>> creation and until a name is assigned). This is a dm-mpath queue.
>
> There seems to be something wrong in hctx->nr_active.

Then looks it is same with the issue I saw during starting multipathd, and the
following patch should fix that, if there isn't other issue.

        https://marc.info/?l=linux-block&m=151586577400558&w=2

-- 
Ming Lei

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-01-14 15:45 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-09 16:29 [PATCHSET v5] blk-mq: reimplement timeout handling Tejun Heo
2018-01-09 16:29 ` [PATCH 1/8] blk-mq: move hctx lock/unlock into a helper Tejun Heo
2018-01-09 16:29 ` [PATCH 2/8] blk-mq: protect completion path with RCU Tejun Heo
2018-01-09 16:29 ` [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme Tejun Heo
2018-01-09 16:29 ` [PATCH 4/8] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE Tejun Heo
2018-01-09 16:29 ` [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path Tejun Heo
2018-01-09 16:29 ` [PATCH 6/8] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq Tejun Heo
2018-01-09 16:29 ` [PATCH 7/8] blk-mq: remove REQ_ATOM_STARTED Tejun Heo
2018-01-09 16:29 ` [PATCH 8/8] blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu Tejun Heo
2018-01-09 16:53 ` [PATCHSET v5] blk-mq: reimplement timeout handling Jens Axboe
2018-01-12 20:57 ` Bart Van Assche
2018-01-12 20:57   ` Bart Van Assche
2018-01-12 21:07   ` Jens Axboe
2018-01-12 21:19     ` Bart Van Assche
2018-01-12 21:55       ` Jens Axboe
2018-01-14 15:12       ` jianchao.wang
2018-01-14 15:45         ` Ming Lei
2018-01-12 21:55   ` Laurence Oberman
2018-01-12 21:55     ` Laurence Oberman
2018-01-13 12:39     ` Ming Lei
2018-01-13 12:39       ` Ming Lei
2018-01-13 14:45     ` Ming Lei
2018-01-13 14:45       ` Ming Lei
2018-01-13 17:59       ` Ming Lei
2018-01-13 17:59         ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.