[PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
@ 2018-04-15 15:43 Ming Lei
  2018-04-15 15:43 ` [PATCH V4 1/2] blk-mq: set RQF_MQ_TIMEOUT_EXPIRED when the rq's timeout isn't handled Ming Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Ming Lei @ 2018-04-15 15:43 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Tejun Heo, Bart Van Assche, Martin Steigerwald, Israel Rukshin, Ming Lei

Hi Jens,

This two patches fixes the recently discussed race between completion
and BLK_EH_RESET_TIMER.

Israel & Martin, this one is a simpler fix on this issue and can
cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request, could
you test V4 and see if your issue can be fixed?

Thanks,

V4:
	- run synchronize_rcu() once for handling all timed out request
	between .timeout() and the following handling
	- address tj's concern about reorder between blk_add_timer() and
    blk_mq_rq_update_aborted_gstate(req, 0)

V3:
	- before completing rq for BLK_EH_HANDLED, sync with normal completion path
	- make sure rq's state updated as MQ_RQ_IN_FLIGHT before completing
V2:
	- rename the new flag as MQ_RQ_COMPLETE_IN_TIMEOUT
	- fix lock uses in blk_mq_rq_timed_out
	- document re-order between blk_add_timer() and
	blk_mq_rq_update_aborted_gstate(req, 0)


Ming Lei (2):
  blk-mq: set RQF_MQ_TIMEOUT_EXPIRED when the rq's timeout isn't handled
  blk-mq: fix race between complete and BLK_EH_RESET_TIMER

 block/blk-mq.c         | 120 +++++++++++++++++++++++++++++++++++++++----------
 block/blk-mq.h         |   1 +
 block/blk-timeout.c    |   1 -
 include/linux/blkdev.h |   6 +++
 4 files changed, 104 insertions(+), 24 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH V4 1/2] blk-mq: set RQF_MQ_TIMEOUT_EXPIRED when the rq's timeout isn't handled
  2018-04-15 15:43 [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER Ming Lei
@ 2018-04-15 15:43 ` Ming Lei
  2018-04-15 15:43 ` [PATCH V4 2/2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER Ming Lei
  2018-04-15 16:31 ` [PATCH V4 0/2] blk-mq: fix race between completion " Martin Steigerwald
  2 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2018-04-15 15:43 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Tejun Heo, Bart Van Assche, Martin Steigerwald, Israel Rukshin,
	Ming Lei, jianchao.wang, Christoph Hellwig, Sagi Grimberg,
	Max Gurtovoy

Firstly blk_mq_timeout_work() is always run exclusivley for each queue.

Secondly if .timeout() returns BLK_EH_RESET_TIMER or BLK_EH_HANDLED,
the flag of RQF_MQ_TIMEOUT_EXPIRED will be cleared for this request,
because the request will be requeued or freed for BLK_EH_HANDLED, and
for BLK_EH_RESET_TIMER, new timer has to be started for this req.

So this patch only sets the flag of RQF_MQ_TIMEOUT_EXPIRED when .timeout
returns neither BLK_EH_RESET_TIMER nor BLK_EH_HANDLED.

Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Martin Steigerwald <Martin@Lichtvoll.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c      | 4 ++--
 block/blk-timeout.c | 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..d6a21898933d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -815,8 +815,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -834,9 +832,11 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		blk_add_timer(req);
 		break;
 	case BLK_EH_NOT_HANDLED:
+		req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
 		break;
 	default:
 		printk(KERN_ERR "block: bad eh return: %d\n", ret);
+		req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
 		break;
 	}
 }
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..f95d6e6cbc96 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -214,7 +214,6 @@ void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH V4 2/2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-15 15:43 [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER Ming Lei
  2018-04-15 15:43 ` [PATCH V4 1/2] blk-mq: set RQF_MQ_TIMEOUT_EXPIRED when the rq's timeout isn't handled Ming Lei
@ 2018-04-15 15:43 ` Ming Lei
  2018-04-15 16:31 ` [PATCH V4 0/2] blk-mq: fix race between completion " Martin Steigerwald
  2 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2018-04-15 15:43 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Tejun Heo, Bart Van Assche, Martin Steigerwald, Israel Rukshin,
	Ming Lei, jianchao.wang, Christoph Hellwig, Sagi Grimberg,
	Max Gurtovoy, stable

The normal request completion can be done before or during handling
BLK_EH_RESET_TIMER, and this race may cause the request to never
be completed since driver's .timeout() may always return
BLK_EH_RESET_TIMER.

This issue can't be fixed completely by driver, since the normal
completion can be done between returning .timeout() and handling
BLK_EH_RESET_TIMER.

This patch fixes the race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET,
and reading/writing rq's state by holding queue lock, which can be
per-request actually, but just not necessary to introduce one lock for
so unusual event.

Also handle the timeout requests in two steps:

1) in 1st step, call .timeout(), and reset timer for BLK_EH_RESET_TIMER

2) in 2nd step, sync with normal completion path by holding queue lock
for avoiding race between BLK_EH_RESET_TIMER and normal completion.

Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: stable@vger.kernel.org
Cc: Martin Steigerwald <Martin@Lichtvoll.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 116 ++++++++++++++++++++++++++++++++++++++++---------
 block/blk-mq.h         |   1 +
 include/linux/blkdev.h |   6 +++
 3 files changed, 102 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d6a21898933d..9415e65302a8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq)
 	 * However, that would complicate paths which want to synchronize
 	 * against us.  Let stay in sync with the issue path so that
 	 * hctx_lock() covers both issue and completion paths.
+	 *
+	 * Cover complete vs BLK_EH_RESET_TIMER race in slow path with
+	 * holding queue lock.
 	 */
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else {
+		unsigned long flags;
+		bool need_complete = false;
+
+		spin_lock_irqsave(q->queue_lock, flags);
+		if (!blk_mq_rq_aborted_gstate(rq))
+			need_complete = true;
+		else
+			blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+
+		if (need_complete)
+			__blk_mq_complete_request(rq);
+	}
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -810,7 +827,7 @@ struct blk_mq_timeout_data {
 	unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_pre_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -818,18 +835,44 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
+	if (ret == BLK_EH_RESET_TIMER)
+		blk_add_timer(req);
+
+	req->timeout_ret = ret;
+}
+
+static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+{
+	enum blk_eh_timer_return ret = req->timeout_ret;
+	unsigned long flags;
+
 	switch (ret) {
 	case BLK_EH_HANDLED:
+		spin_lock_irqsave(req->q->queue_lock, flags);
+ complete_rq:
+		if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE_IN_TIMEOUT)
+			blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
+		spin_unlock_irqrestore(req->q->queue_lock, flags);
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
 		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
+		 * The normal completion may happen during handling the
+		 * timeout, or even after returning from .timeout(), so
+		 * once the request has been completed, we can't reset
+		 * timer any more since this request may be handled as
+		 * BLK_EH_RESET_TIMER in next timeout handling too, and
+		 * it has to be completed in this situation.
+		 *
+		 * Holding the queue lock to cover read/write rq's
+		 * aborted_gstate and normal state, so the race can be
+		 * avoided completely.
 		 */
+		spin_lock_irqsave(req->q->queue_lock, flags);
 		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE_IN_TIMEOUT)
+			goto complete_rq;
+		spin_unlock_irqrestore(req->q->queue_lock, flags);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
@@ -875,7 +918,7 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	}
 }
 
-static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_prepare_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	/*
@@ -887,9 +930,40 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 	 */
 	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
 	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+		blk_mq_rq_pre_timed_out(rq, reserved);
+}
+
+static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
+	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
+static void blk_mq_timeout_synchronize_rcu(struct request_queue *q,
+		bool reset_expired)
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+	bool has_rcu = false;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!hctx->nr_expired)
+			continue;
+
+		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+			has_rcu = true;
+		else
+			synchronize_srcu(hctx->srcu);
+
+		if (reset_expired)
+			hctx->nr_expired = 0;
+	}
+	if (has_rcu)
+		synchronize_rcu();
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -899,8 +973,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		.next_set	= 0,
 		.nr_expired	= 0,
 	};
-	struct blk_mq_hw_ctx *hctx;
-	int i;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
@@ -922,27 +994,26 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
 	if (data.nr_expired) {
-		bool has_rcu = false;
-
 		/*
 		 * Wait till everyone sees ->aborted_gstate.  The
 		 * sequential waits for SRCUs aren't ideal.  If this ever
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
+		blk_mq_timeout_synchronize_rcu(q, false);
 
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
+		/* call .timeout() for timed-out requests */
+		blk_mq_queue_tag_busy_iter(q, blk_mq_prepare_expired, NULL);
 
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
+		/*
+		 * If .timeout returns BLK_EH_HANDLED, wait till current
+		 * completion is done, for avoiding to update state on
+		 * completed request.
+		 *
+		 * If .timeout returns BLK_EH_RESET_TIMER, wait till
+		 * blk_add_timer() is commited before completing this rq.
+		 */
+		blk_mq_timeout_synchronize_rcu(q, true);
 
 		/* terminate the ones we won */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
@@ -952,6 +1023,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
 	} else {
+		struct blk_mq_hw_ctx *hctx;
+		int i;
+
 		/*
 		 * Request timeouts are handled as a forward rolling timer. If
 		 * we end up here it means that no requests are pending and
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..0426d048743d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,7 @@ enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
+	MQ_RQ_COMPLETE_IN_TIMEOUT	= 3,
 
 	MQ_RQ_STATE_BITS	= 2,
 	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..8278f67d39a6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -252,8 +252,14 @@ struct request {
 	struct list_head timeout_list;
 
 	union {
+		/* used after completion */
 		struct __call_single_data csd;
+
+		/* used in io scheduler, before dispatch */
 		u64 fifo_time;
+
+		/* used after dispatch and before completion */
+		int timeout_ret;
 	};
 
 	/*
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
  2018-04-15 15:43 [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER Ming Lei
  2018-04-15 15:43 ` [PATCH V4 1/2] blk-mq: set RQF_MQ_TIMEOUT_EXPIRED when the rq's timeout isn't handled Ming Lei
  2018-04-15 15:43 ` [PATCH V4 2/2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER Ming Lei
@ 2018-04-15 16:31 ` Martin Steigerwald
  2018-04-16  0:45   ` Ming Lei
  2 siblings, 1 reply; 11+ messages in thread
From: Martin Steigerwald @ 2018-04-15 16:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Tejun Heo, Bart Van Assche, Israel Rukshin

Hi Ming.

Ming Lei - 15.04.18, 17:43:
> Hi Jens,
>=20
> This two patches fixes the recently discussed race between completion
> and BLK_EH_RESET_TIMER.
>=20
> Israel & Martin, this one is a simpler fix on this issue and can
> cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request, could
> you test V4 and see if your issue can be fixed?

In replacement of all the three other patches I applied?

=2D '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a=20
request.mbox'

=2D '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into=20
rcu_read_{lock,unlock}().mbox'

=2D '[PATCH v4] blk-mq_Fix race conditions in request timeout=20
handling.mbox'

These patches worked reliably so far both for the hang on boot and error=20
reading SMART data.

I=B4d compile a kernel tomorrow or Tuesday I think.

> V4:
> 	- run synchronize_rcu() once for handling all timed out request
> 	between .timeout() and the following handling
> 	- address tj's concern about reorder between blk_add_timer() and
>     blk_mq_rq_update_aborted_gstate(req, 0)
>=20
> V3:
> 	- before completing rq for BLK_EH_HANDLED, sync with normal
> completion path - make sure rq's state updated as MQ_RQ_IN_FLIGHT
> before completing V2:
> 	- rename the new flag as MQ_RQ_COMPLETE_IN_TIMEOUT
> 	- fix lock uses in blk_mq_rq_timed_out
> 	- document re-order between blk_add_timer() and
> 	blk_mq_rq_update_aborted_gstate(req, 0)
>=20
>=20
> Ming Lei (2):
>   blk-mq: set RQF_MQ_TIMEOUT_EXPIRED when the rq's timeout isn't
> handled blk-mq: fix race between complete and BLK_EH_RESET_TIMER
>=20
>  block/blk-mq.c         | 120
> +++++++++++++++++++++++++++++++++++++++---------- block/blk-mq.h    =20
>    |   1 +
>  block/blk-timeout.c    |   1 -
>  include/linux/blkdev.h |   6 +++
>  4 files changed, 104 insertions(+), 24 deletions(-)


=2D-=20
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
  2018-04-15 16:31 ` [PATCH V4 0/2] blk-mq: fix race between completion " Martin Steigerwald
@ 2018-04-16  0:45   ` Ming Lei
  2018-04-16 13:12     ` Martin Steigerwald
  0 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2018-04-16  0:45 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Jens Axboe, linux-block, Tejun Heo, Bart Van Assche, Israel Rukshin

On Sun, Apr 15, 2018 at 06:31:44PM +0200, Martin Steigerwald wrote:
> Hi Ming.
> 
> Ming Lei - 15.04.18, 17:43:
> > Hi Jens,
> > 
> > This two patches fixes the recently discussed race between completion
> > and BLK_EH_RESET_TIMER.
> > 
> > Israel & Martin, this one is a simpler fix on this issue and can
> > cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request, could
> > you test V4 and see if your issue can be fixed?
> 
> In replacement of all the three other patches I applied?
> 
> - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
> request.mbox'
> 
> - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
> rcu_read_{lock,unlock}().mbox'
> 
> - '[PATCH v4] blk-mq_Fix race conditions in request timeout 
> handling.mbox'

You only need to replace the above one '[PATCH v4] blk-mq_Fix race
conditions in request timeout' with V4 in this thread.

> 
> These patches worked reliably so far both for the hang on boot and error 
> reading SMART data.

And you may see the reason in the following thread:

https://marc.info/?l=linux-block&m=152366441625786&w=2

> 
> Iï¿½d compile a kernel tomorrow or Tuesday I think.

Thanks!

--
Ming

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
  2018-04-16  0:45   ` Ming Lei
@ 2018-04-16 13:12     ` Martin Steigerwald
  2018-04-16 16:04       ` jianchao.wang
  2018-04-18 16:46       ` Ming Lei
  0 siblings, 2 replies; 11+ messages in thread
From: Martin Steigerwald @ 2018-04-16 13:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Tejun Heo, Bart Van Assche, Israel Rukshin

Ming Lei - 16.04.18, 02:45:
> On Sun, Apr 15, 2018 at 06:31:44PM +0200, Martin Steigerwald wrote:
> > Hi Ming.
> >=20
> > Ming Lei - 15.04.18, 17:43:
> > > Hi Jens,
> > >=20
> > > This two patches fixes the recently discussed race between
> > > completion
> > > and BLK_EH_RESET_TIMER.
> > >=20
> > > Israel & Martin, this one is a simpler fix on this issue and can
> > > cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request,
> > > could
> > > you test V4 and see if your issue can be fixed?
> >=20
> > In replacement of all the three other patches I applied?
> >=20
> > - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a
> > request.mbox'
> >=20
> > - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair
> > into rcu_read_{lock,unlock}().mbox'
> >=20
> > - '[PATCH v4] blk-mq_Fix race conditions in request timeout
> > handling.mbox'
>=20
> You only need to replace the above one '[PATCH v4] blk-mq_Fix race
> conditions in request timeout' with V4 in this thread.

Ming, a 4.16.2 with the patches:

'[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a=20
request.mbox'
'[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into=20
rcu_read_{lock,unlock}().mbox'
'[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s=20
timeout isn'\''t handled.mbox'
'[PATCH V4 2_2] blk-mq_fix race between complete and=20
BLK_EH_RESET_TIMER.mbox'

hung on boot 3 out of 4 times.

See

[Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime=20
and boot failures with blk_mq_terminate_expired in backtrace
https://bugzilla.kernel.org/show_bug.cgi?id=3D199077#c13

I tried to add your mail address to Cc of the bug report, but Bugzilla=20
did not know it.

=46ortunately it booted on the fourth attempt, cause I forgot my GRUB=20
password.

Reverting back to previous 4.16.1 kernel with patches from Bart.

> > These patches worked reliably so far both for the hang on boot and
> > error reading SMART data.
>=20
> And you may see the reason in the following thread:
>=20
> https://marc.info/?l=3Dlinux-block&m=3D152366441625786&w=3D2

So requests could never be completed?

> > I=B4d compile a kernel tomorrow or Tuesday I think.

Thanks,
=2D-=20
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
  2018-04-16 13:12     ` Martin Steigerwald
@ 2018-04-16 16:04       ` jianchao.wang
  2018-04-17  0:15         ` Bart Van Assche
  2018-04-18 16:46       ` Ming Lei
  1 sibling, 1 reply; 11+ messages in thread
From: jianchao.wang @ 2018-04-16 16:04 UTC (permalink / raw)
  To: Martin Steigerwald, Ming Lei
  Cc: Jens Axboe, linux-block, Tejun Heo, Bart Van Assche, Israel Rukshin

Hi Martin and Ming

Regarding to the issue "RIP: scsi_times_out+0x17",

the rq->gstate and rq->aborted_gstate both are zero before the requests are allocated.
looks like the timeout value of scsi in Martin's system is small.
when the request_queue timer fires, if there is a request which is allocated for the first time,
the rq->gstate and rq->aborted_gstate both are 0,

static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
		struct request *rq, void *priv, bool reserved)
{
	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
		blk_mq_rq_timed_out(rq, reserved);
}

blk_mq_terminate_expired will identify the req is timed out and invoke scsi_times_out.
and at the moment, the scsi_cmnd is not initialized, so scsi_cmnd->device is NULL and we
get the crash.

maybe we could try this:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 16e83e6..be9b435 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2077,6 +2077,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 
        seqcount_init(&rq->gstate_seq);
        u64_stats_init(&rq->aborted_gstate_sync);
+       WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
        return 0;
 }

Thanks
Jianchao

On 04/16/2018 09:12 PM, Martin Steigerwald wrote:
> Ming Lei - 16.04.18, 02:45:
>> On Sun, Apr 15, 2018 at 06:31:44PM +0200, Martin Steigerwald wrote:
>>> Hi Ming.
>>>
>>> Ming Lei - 15.04.18, 17:43:
>>>> Hi Jens,
>>>>
>>>> This two patches fixes the recently discussed race between
>>>> completion
>>>> and BLK_EH_RESET_TIMER.
>>>>
>>>> Israel & Martin, this one is a simpler fix on this issue and can
>>>> cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request,
>>>> could
>>>> you test V4 and see if your issue can be fixed?
>>>
>>> In replacement of all the three other patches I applied?
>>>
>>> - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a
>>> request.mbox'
>>>
>>> - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair
>>> into rcu_read_{lock,unlock}().mbox'
>>>
>>> - '[PATCH v4] blk-mq_Fix race conditions in request timeout
>>> handling.mbox'
>>
>> You only need to replace the above one '[PATCH v4] blk-mq_Fix race
>> conditions in request timeout' with V4 in this thread.
> 
> Ming, a 4.16.2 with the patches:
> 
> '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
> request.mbox'
> '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
> rcu_read_{lock,unlock}().mbox'
> '[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s 
> timeout isn'\''t handled.mbox'
> '[PATCH V4 2_2] blk-mq_fix race between complete and 
> BLK_EH_RESET_TIMER.mbox'
> 
> hung on boot 3 out of 4 times.
> 
> See
> 
> [Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime 
> and boot failures with blk_mq_terminate_expired in backtrace
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.kernel.org_show-5Fbug.cgi-3Fid-3D199077-23c13&d=DwIDAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=29cf23VbYAblDS0xYyNaxkkds9LZmeGgn9B-hW-coT4&s=k3RMTv8QJ0j9pqbU-5vXgeUiJ2hiR7Lz1X69QyI0JkI&e=
> 
> I tried to add your mail address to Cc of the bug report, but Bugzilla 
> did not know it.
> 
> Fortunately it booted on the fourth attempt, cause I forgot my GRUB 
> password.
> 
> Reverting back to previous 4.16.1 kernel with patches from Bart.
> 
>>> These patches worked reliably so far both for the hang on boot and
>>> error reading SMART data.
>>
>> And you may see the reason in the following thread:
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D152366441625786-26w-3D2&d=DwIDAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=29cf23VbYAblDS0xYyNaxkkds9LZmeGgn9B-hW-coT4&s=HyhVTq4b6Ti5CkkAONj5WcLISRyumzfpK2nIJJZE4nU&e=
> 
> So requests could never be completed?
> 
>>> I´d compile a kernel tomorrow or Tuesday I think.
> 
> Thanks,
> 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
  2018-04-16 16:04       ` jianchao.wang
@ 2018-04-17  0:15         ` Bart Van Assche
  2018-04-17  3:49           ` jianchao.wang
  0 siblings, 1 reply; 11+ messages in thread
From: Bart Van Assche @ 2018-04-17  0:15 UTC (permalink / raw)
  To: martin, jianchao.w.wang, ming.lei; +Cc: tj, israelr, linux-block, axboe

T24gVHVlLCAyMDE4LTA0LTE3IGF0IDAwOjA0ICswODAwLCBqaWFuY2hhby53YW5nIHdyb3RlOg0K
PiBkaWZmIC0tZ2l0IGEvYmxvY2svYmxrLW1xLmMgYi9ibG9jay9ibGstbXEuYw0KPiBpbmRleCAx
NmU4M2U2Li5iZTliNDM1IDEwMDY0NA0KPiAtLS0gYS9ibG9jay9ibGstbXEuYw0KPiArKysgYi9i
bG9jay9ibGstbXEuYw0KPiBAQCAtMjA3Nyw2ICsyMDc3LDcgQEAgc3RhdGljIGludCBibGtfbXFf
aW5pdF9yZXF1ZXN0KHN0cnVjdCBibGtfbXFfdGFnX3NldCAqc2V0LCBzdHJ1Y3QgcmVxdWVzdCAq
cnEsDQo+ICANCj4gICAgICAgICBzZXFjb3VudF9pbml0KCZycS0+Z3N0YXRlX3NlcSk7DQo+ICAg
ICAgICAgdTY0X3N0YXRzX2luaXQoJnJxLT5hYm9ydGVkX2dzdGF0ZV9zeW5jKTsNCj4gKyAgICAg
ICBXUklURV9PTkNFKHJxLT5nc3RhdGUsIE1RX1JRX0dFTl9JTkMpOw0KPiAgICAgICAgIHJldHVy
biAwOw0KPiAgfQ0KDQpIZWxsbyBKaWFuY2hhbywNCg0KWW91ciBhcHByb2FjaCBsb29rcyBpbnRl
cmVzdGluZyB0byBtZS4gQ2FuIHlvdSBzZW5kIGFuIG9mZmljaWFsIHBhdGNoIHRvIEplbnM/DQoN
ClRoYW5rcywNCg0KQmFydC4NCg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
  2018-04-17  0:15         ` Bart Van Assche
@ 2018-04-17  3:49           ` jianchao.wang
  0 siblings, 0 replies; 11+ messages in thread
From: jianchao.wang @ 2018-04-17  3:49 UTC (permalink / raw)
  To: Bart Van Assche, martin, ming.lei; +Cc: tj, israelr, linux-block, axboe

Hi bart

Thanks for your kindly response.
I have sent out the patch. Please refer to
https://marc.info/?l=linux-block&m=152393666517449&w=2

Thanks
Jianchao

On 04/17/2018 08:15 AM, Bart Van Assche wrote:
> On Tue, 2018-04-17 at 00:04 +0800, jianchao.wang wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 16e83e6..be9b435 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2077,6 +2077,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
>>  
>>         seqcount_init(&rq->gstate_seq);
>>         u64_stats_init(&rq->aborted_gstate_sync);
>> +       WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
>>         return 0;
>>  }
> 
> Hello Jianchao,
> 
> Your approach looks interesting to me. Can you send an official patch to Jens?
> 
> Thanks,
> 
> Bart.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
  2018-04-16 13:12     ` Martin Steigerwald
  2018-04-16 16:04       ` jianchao.wang
@ 2018-04-18 16:46       ` Ming Lei
  2018-04-23  8:41         ` Martin Steigerwald
  1 sibling, 1 reply; 11+ messages in thread
From: Ming Lei @ 2018-04-18 16:46 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Jens Axboe, linux-block, Tejun Heo, Bart Van Assche, Israel Rukshin

On Mon, Apr 16, 2018 at 03:12:30PM +0200, Martin Steigerwald wrote:
> Ming Lei - 16.04.18, 02:45:
> > On Sun, Apr 15, 2018 at 06:31:44PM +0200, Martin Steigerwald wrote:
> > > Hi Ming.
> > > 
> > > Ming Lei - 15.04.18, 17:43:
> > > > Hi Jens,
> > > > 
> > > > This two patches fixes the recently discussed race between
> > > > completion
> > > > and BLK_EH_RESET_TIMER.
> > > > 
> > > > Israel & Martin, this one is a simpler fix on this issue and can
> > > > cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request,
> > > > could
> > > > you test V4 and see if your issue can be fixed?
> > > 
> > > In replacement of all the three other patches I applied?
> > > 
> > > - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a
> > > request.mbox'
> > > 
> > > - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair
> > > into rcu_read_{lock,unlock}().mbox'
> > > 
> > > - '[PATCH v4] blk-mq_Fix race conditions in request timeout
> > > handling.mbox'
> > 
> > You only need to replace the above one '[PATCH v4] blk-mq_Fix race
> > conditions in request timeout' with V4 in this thread.
> 
> Ming, a 4.16.2 with the patches:
> 
> '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
> request.mbox'
> '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
> rcu_read_{lock,unlock}().mbox'
> '[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s 
> timeout isn'\''t handled.mbox'
> '[PATCH V4 2_2] blk-mq_fix race between complete and 
> BLK_EH_RESET_TIMER.mbox'
> 
> hung on boot 3 out of 4 times.
> 
> See
> 
> [Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime 
> and boot failures with blk_mq_terminate_expired in backtrace
> https://bugzilla.kernel.org/show_bug.cgi?id=199077#c13
> 
> I tried to add your mail address to Cc of the bug report, but Bugzilla 
> did not know it.
> 
> Fortunately it booted on the fourth attempt, cause I forgot my GRUB 
> password.
> 
> Reverting back to previous 4.16.1 kernel with patches from Bart.
> 
> > > These patches worked reliably so far both for the hang on boot and
> > > error reading SMART data.
> > 
> > And you may see the reason in the following thread:
> > 
> > https://marc.info/?l=linux-block&m=152366441625786&w=2
> 
> So requests could never be completed?

Yes.

I guess Jianchao's patch("[PATCH] blk-mq: start request gstate with gen 1")
may work for your issue because you are using blk_abort_request().

If it doesn't, please try the following V5 together the other two, and V5
fixes one big issue, in which the new rq state shouldn't be introduced,
otherwise timeout is broken easily.

I have tested V5 by running blktests(block/011), in which both these
code paths are covered: EH_HANDLED, EH_RESET_TIMER, and normal
completion during timeout, and the patch V5 works as expected.

--

>From e81da316a953db999d155d08143fd5722b44e79e Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Thu, 12 Apr 2018 04:23:09 +0800
Subject: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

The normal request completion can be done before or during handling
BLK_EH_RESET_TIMER, and this race may cause the request to never
be completed since driver's .timeout() may always return
BLK_EH_RESET_TIMER.

This issue can't be fixed completely by driver, since the normal
completion can be done between returning .timeout() and handling
BLK_EH_RESET_TIMER.

This patch fixes the race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET,
and reading/writing rq's state by holding queue lock, which can be
per-request actually, but just not necessary to introduce one lock for
so unusual event.

Also handle the timeout requests in two steps:

1) in 1st step, call .timeout(), and reset timer for BLK_EH_RESET_TIMER

2) in 2nd step, sync with normal completion path by holding queue lock
for avoiding race between BLK_EH_RESET_TIMER and normal completion.

Another change is that one request is always handled as time-out
exclusively in this patch with help of queue lock.

Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: stable@vger.kernel.org
Cc: Martin Steigerwald <Martin@Lichtvoll.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 139 +++++++++++++++++++++++++++++++++++++++----------
 include/linux/blkdev.h |  13 +++++
 2 files changed, 125 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..b70b00910b41 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq)
 	 * However, that would complicate paths which want to synchronize
 	 * against us.  Let stay in sync with the issue path so that
 	 * hctx_lock() covers both issue and completion paths.
+	 *
+	 * Cover complete vs BLK_EH_RESET_TIMER race in slow path with
+	 * holding queue lock.
 	 */
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else {
+		unsigned long flags;
+		bool need_complete = false;
+
+		spin_lock_irqsave(q->queue_lock, flags);
+		if (!blk_mq_rq_aborted_gstate(rq))
+			need_complete = true;
+		else
+			rq->rq_flags |= RQF_MQ_COMPLETE_IN_EH;
+		spin_unlock_irqrestore(q->queue_lock, flags);
+
+		if (need_complete)
+			__blk_mq_complete_request(rq);
+	}
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -810,28 +827,50 @@ struct blk_mq_timeout_data {
 	unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_pre_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
+	if (ret == BLK_EH_RESET_TIMER)
+		blk_add_timer(req);
+	req->eh_data.ret = ret;
+}
+
+static void blk_mq_rq_timed_out(struct request *req)
+{
+	enum blk_eh_timer_return ret = req->eh_data.ret;
+
 	switch (ret) {
 	case BLK_EH_HANDLED:
+		spin_lock_irq(req->q->queue_lock);
+ complete_rq:
+		if (req->rq_flags & RQF_MQ_COMPLETE_IN_EH)
+			req->rq_flags &= ~RQF_MQ_COMPLETE_IN_EH;
+		spin_unlock_irq(req->q->queue_lock);
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
 		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
+		 * The normal completion may happen during handling the
+		 * timeout, or even after returning from .timeout(), so
+		 * once the request has been completed, we can't reset
+		 * timer any more since this request may be handled as
+		 * BLK_EH_RESET_TIMER in next timeout handling too, and
+		 * it has to be completed in this situation.
+		 *
+		 * Holding the queue lock to cover read/write rq's
+		 * aborted_gstate and normal state, so the race can be
+		 * avoided completely.
 		 */
+		spin_lock_irq(req->q->queue_lock);
+		if (req->rq_flags & RQF_MQ_COMPLETE_IN_EH)
+			goto complete_rq;
 		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		spin_unlock_irq(req->q->queue_lock);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -847,10 +886,15 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	struct blk_mq_timeout_data *data = priv;
 	unsigned long gstate, deadline;
 	int start;
+	bool expired;
 
 	might_sleep();
 
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
+	spin_lock_irq(rq->q->queue_lock);
+	expired = rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED;
+	spin_unlock_irq(rq->q->queue_lock);
+
+	if (expired)
 		return;
 
 	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
@@ -875,19 +919,53 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	}
 }
 
-static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_prepare_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
+	struct list_head *list = priv;
 	/*
 	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
 	 * completions that we lost to, they would have finished and
 	 * updated @rq->gstate by now; otherwise, the completion path is
 	 * now guaranteed to see @rq->aborted_gstate and yield.  If
 	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 *
+	 * One request is always handled as time-out exclusively with
+	 * queue lock.
 	 */
+	spin_lock_irq(rq->q->queue_lock);
 	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
-		blk_mq_rq_timed_out(rq, reserved);
+	    READ_ONCE(rq->gstate) == rq->aborted_gstate) {
+		rq->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
+		spin_unlock_irq(rq->q->queue_lock);
+
+		blk_mq_rq_pre_timed_out(rq, reserved);
+		list_add_tail(&rq->eh_data.node, list);
+	} else
+		spin_unlock_irq(rq->q->queue_lock);
+}
+
+static void blk_mq_timeout_synchronize_rcu(struct request_queue *q,
+		bool reset_expired)
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+	bool has_rcu = false;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!hctx->nr_expired)
+			continue;
+
+		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+			has_rcu = true;
+		else
+			synchronize_srcu(hctx->srcu);
+
+		if (reset_expired)
+			hctx->nr_expired = 0;
+	}
+	if (has_rcu)
+		synchronize_rcu();
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
@@ -899,8 +977,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		.next_set	= 0,
 		.nr_expired	= 0,
 	};
-	struct blk_mq_hw_ctx *hctx;
-	int i;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
@@ -922,7 +998,8 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
 	if (data.nr_expired) {
-		bool has_rcu = false;
+		struct request *rq;
+		LIST_HEAD(rq_list);
 
 		/*
 		 * Wait till everyone sees ->aborted_gstate.  The
@@ -930,28 +1007,36 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
+		blk_mq_timeout_synchronize_rcu(q, false);
 
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
+		/* call .timeout() for timed-out requests */
+		blk_mq_queue_tag_busy_iter(q, blk_mq_prepare_expired, &rq_list);
 
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
+		/*
+		 * If .timeout returns BLK_EH_HANDLED, wait till current
+		 * completion is done, for avoiding to update state on
+		 * completed request.
+		 *
+		 * If .timeout returns BLK_EH_RESET_TIMER, wait till
+		 * blk_add_timer() is commited before completing this rq.
+		 */
+		blk_mq_timeout_synchronize_rcu(q, true);
 
-		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+		/* terminate timed-out requests */
+		while (!list_empty(&rq_list)) {
+			rq = list_entry(rq_list.next, struct request, eh_data.node);
+			list_del_init(&rq->eh_data.node);
+			blk_mq_rq_timed_out(rq);
+		}
 	}
 
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
 	} else {
+		struct blk_mq_hw_ctx *hctx;
+		int i;
+
 		/*
 		 * Request timeouts are handled as a forward rolling timer. If
 		 * we end up here it means that no requests are pending and
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..704d54ce7702 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -129,11 +129,18 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
 #define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+/* already slept for hybrid poll */
+#define RQF_MQ_COMPLETE_IN_EH	((__force req_flags_t)(1 << 22))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
+struct req_eh_data {
+	int ret;
+	struct list_head node;
+};
+
 /*
  * Try to put the fields that are referenced together in the same cacheline.
  *
@@ -252,8 +259,14 @@ struct request {
 	struct list_head timeout_list;
 
 	union {
+		/* used after completion */
 		struct __call_single_data csd;
+
+		/* used in io scheduler, before dispatch */
 		u64 fifo_time;
+
+		/* used after dispatch and before completion */
+		struct req_eh_data eh_data;
 	};
 
 	/*
-- 
2.9.5

-- 
Ming

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
  2018-04-18 16:46       ` Ming Lei
@ 2018-04-23  8:41         ` Martin Steigerwald
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Steigerwald @ 2018-04-23  8:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Tejun Heo, Bart Van Assche, Israel Rukshin

Hello Ming.

Ming Lei - 18.04.18, 18:46:
> On Mon, Apr 16, 2018 at 03:12:30PM +0200, Martin Steigerwald wrote:
> > Ming Lei - 16.04.18, 02:45:
> > > On Sun, Apr 15, 2018 at 06:31:44PM +0200, Martin Steigerwald 
wrote:
> > > > Hi Ming.
> > > > 
> > > > Ming Lei - 15.04.18, 17:43:
> > > > > Hi Jens,
> > > > > 
> > > > > This two patches fixes the recently discussed race between
> > > > > completion
> > > > > and BLK_EH_RESET_TIMER.
> > > > > 
> > > > > Israel & Martin, this one is a simpler fix on this issue and
> > > > > can
> > > > > cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request,
> > > > > could
> > > > > you test V4 and see if your issue can be fixed?
> > > > 
> > > > In replacement of all the three other patches I applied?
> > > > 
> > > > - '[PATCH] blk-mq_Directly schedule q->timeout_work when
> > > > aborting a
> > > > request.mbox'
> > > > 
> > > > - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched()
> > > > pair
> > > > into rcu_read_{lock,unlock}().mbox'
> > > > 
> > > > - '[PATCH v4] blk-mq_Fix race conditions in request timeout
> > > > handling.mbox'
> > > 
> > > You only need to replace the above one '[PATCH v4] blk-mq_Fix race
> > > conditions in request timeout' with V4 in this thread.
> > 
> > Ming, a 4.16.2 with the patches:
> > 
> > '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a
> > request.mbox'
> > '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into
> > rcu_read_{lock,unlock}().mbox'
> > '[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s
> > timeout isn'\''t handled.mbox'
> > '[PATCH V4 2_2] blk-mq_fix race between complete and
> > BLK_EH_RESET_TIMER.mbox'
> > 
> > hung on boot 3 out of 4 times.
> > 
> > See
> > 
> > [Possible REGRESSION, 4.16-rc4] Error updating SMART data during
> > runtime and boot failures with blk_mq_terminate_expired in
> > backtrace https://bugzilla.kernel.org/show_bug.cgi?id=199077#c13
> > 
> > I tried to add your mail address to Cc of the bug report, but
> > Bugzilla did not know it.
> > 
> > Fortunately it booted on the fourth attempt, cause I forgot my GRUB
> > password.
> > 
> > Reverting back to previous 4.16.1 kernel with patches from Bart.
> > 
> > > > These patches worked reliably so far both for the hang on boot
> > > > and
> > > > error reading SMART data.
> > > 
> > > And you may see the reason in the following thread:
> > > 
> > > https://marc.info/?l=linux-block&m=152366441625786&w=2
> > 
> > So requests could never be completed?
> 
> Yes.
> 
> I guess Jianchao's patch("[PATCH] blk-mq: start request gstate with
> gen 1") may work for your issue because you are using
> blk_abort_request().
> 
> If it doesn't, please try the following V5 together the other two, and
> V5 fixes one big issue, in which the new rq state shouldn't be
> introduced, otherwise timeout is broken easily.
> 
> I have tested V5 by running blktests(block/011), in which both these
> code paths are covered: EH_HANDLED, EH_RESET_TIMER, and normal
> completion during timeout, and the patch V5 works as expected.

I tested 4.16.3 with just Jianchao's patch "[PATCH] blk-mq: start 
request gstate with gen 1" (+ the unrelated btrfs trimming fix I carry 
for a long time already) and it did at least 15 boots successfully 
(without hanging). So far also no "error loading smart data mail", but 
it takes a few days with suspend/hibernation + resume cycles in order to 
know for sure.

So if I read your mail correctly, there is no need to test your V5 
patch.

Thanks,
Martin

> --
> 
> From e81da316a953db999d155d08143fd5722b44e79e Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@redhat.com>
> Date: Thu, 12 Apr 2018 04:23:09 +0800
> Subject: [PATCH] blk-mq: fix race between complete and
> BLK_EH_RESET_TIMER
> 
> The normal request completion can be done before or during handling
> BLK_EH_RESET_TIMER, and this race may cause the request to never
> be completed since driver's .timeout() may always return
> BLK_EH_RESET_TIMER.
> 
> This issue can't be fixed completely by driver, since the normal
> completion can be done between returning .timeout() and handling
> BLK_EH_RESET_TIMER.
> 
> This patch fixes the race by introducing rq state of
> MQ_RQ_COMPLETE_IN_RESET, and reading/writing rq's state by holding
> queue lock, which can be per-request actually, but just not necessary
> to introduce one lock for so unusual event.
> 
> Also handle the timeout requests in two steps:
> 
> 1) in 1st step, call .timeout(), and reset timer for
> BLK_EH_RESET_TIMER
> 
> 2) in 2nd step, sync with normal completion path by holding queue lock
> for avoiding race between BLK_EH_RESET_TIMER and normal completion.
> 
> Another change is that one request is always handled as time-out
> exclusively in this patch with help of queue lock.
> 
> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Israel Rukshin <israelr@mellanox.com>,
> Cc: Max Gurtovoy <maxg@mellanox.com>
> Cc: stable@vger.kernel.org
> Cc: Martin Steigerwald <Martin@Lichtvoll.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c         | 139
> +++++++++++++++++++++++++++++++++++++++----------
> include/linux/blkdev.h |  13 +++++
>  2 files changed, 125 insertions(+), 27 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0dc9e341c2a7..b70b00910b41 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq)
> * However, that would complicate paths which want to synchronize *
> against us.  Let stay in sync with the issue path so that *
> hctx_lock() covers both issue and completion paths.
> +	 *
> +	 * Cover complete vs BLK_EH_RESET_TIMER race in slow path with
> +	 * holding queue lock.
>  	 */
>  	hctx_lock(hctx, &srcu_idx);
>  	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>  		__blk_mq_complete_request(rq);
> +	else {
> +		unsigned long flags;
> +		bool need_complete = false;
> +
> +		spin_lock_irqsave(q->queue_lock, flags);
> +		if (!blk_mq_rq_aborted_gstate(rq))
> +			need_complete = true;
> +		else
> +			rq->rq_flags |= RQF_MQ_COMPLETE_IN_EH;
> +		spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +		if (need_complete)
> +			__blk_mq_complete_request(rq);
> +	}
>  	hctx_unlock(hctx, srcu_idx);
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);
> @@ -810,28 +827,50 @@ struct blk_mq_timeout_data {
>  	unsigned int nr_expired;
>  };
> 
> -static void blk_mq_rq_timed_out(struct request *req, bool reserved)
> +static void blk_mq_rq_pre_timed_out(struct request *req, bool
> reserved) {
>  	const struct blk_mq_ops *ops = req->q->mq_ops;
>  	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
> 
> -	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
> -
>  	if (ops->timeout)
>  		ret = ops->timeout(req, reserved);
> 
> +	if (ret == BLK_EH_RESET_TIMER)
> +		blk_add_timer(req);
> +	req->eh_data.ret = ret;
> +}
> +
> +static void blk_mq_rq_timed_out(struct request *req)
> +{
> +	enum blk_eh_timer_return ret = req->eh_data.ret;
> +
>  	switch (ret) {
>  	case BLK_EH_HANDLED:
> +		spin_lock_irq(req->q->queue_lock);
> + complete_rq:
> +		if (req->rq_flags & RQF_MQ_COMPLETE_IN_EH)
> +			req->rq_flags &= ~RQF_MQ_COMPLETE_IN_EH;
> +		spin_unlock_irq(req->q->queue_lock);
>  		__blk_mq_complete_request(req);
>  		break;
>  	case BLK_EH_RESET_TIMER:
>  		/*
> -		 * As nothing prevents from completion happening while
> -		 * ->aborted_gstate is set, this may lead to ignored
> -		 * completions and further spurious timeouts.
> +		 * The normal completion may happen during handling the
> +		 * timeout, or even after returning from .timeout(), so
> +		 * once the request has been completed, we can't reset
> +		 * timer any more since this request may be handled as
> +		 * BLK_EH_RESET_TIMER in next timeout handling too, and
> +		 * it has to be completed in this situation.
> +		 *
> +		 * Holding the queue lock to cover read/write rq's
> +		 * aborted_gstate and normal state, so the race can be
> +		 * avoided completely.
>  		 */
> +		spin_lock_irq(req->q->queue_lock);
> +		if (req->rq_flags & RQF_MQ_COMPLETE_IN_EH)
> +			goto complete_rq;
>  		blk_mq_rq_update_aborted_gstate(req, 0);
> -		blk_add_timer(req);
> +		spin_unlock_irq(req->q->queue_lock);
>  		break;
>  	case BLK_EH_NOT_HANDLED:
>  		break;
> @@ -847,10 +886,15 @@ static void blk_mq_check_expired(struct
> blk_mq_hw_ctx *hctx, struct blk_mq_timeout_data *data = priv;
>  	unsigned long gstate, deadline;
>  	int start;
> +	bool expired;
> 
>  	might_sleep();
> 
> -	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
> +	spin_lock_irq(rq->q->queue_lock);
> +	expired = rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED;
> +	spin_unlock_irq(rq->q->queue_lock);
> +
> +	if (expired)
>  		return;
> 
>  	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
> @@ -875,19 +919,53 @@ static void blk_mq_check_expired(struct
> blk_mq_hw_ctx *hctx, }
>  }
> 
> -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
> +static void blk_mq_prepare_expired(struct blk_mq_hw_ctx *hctx,
>  		struct request *rq, void *priv, bool reserved)
>  {
> +	struct list_head *list = priv;
>  	/*
>  	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
>  	 * completions that we lost to, they would have finished and
>  	 * updated @rq->gstate by now; otherwise, the completion path is
>  	 * now guaranteed to see @rq->aborted_gstate and yield.  If
>  	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> +	 *
> +	 * One request is always handled as time-out exclusively with
> +	 * queue lock.
>  	 */
> +	spin_lock_irq(rq->q->queue_lock);
>  	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> -	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
> -		blk_mq_rq_timed_out(rq, reserved);
> +	    READ_ONCE(rq->gstate) == rq->aborted_gstate) {
> +		rq->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
> +		spin_unlock_irq(rq->q->queue_lock);
> +
> +		blk_mq_rq_pre_timed_out(rq, reserved);
> +		list_add_tail(&rq->eh_data.node, list);
> +	} else
> +		spin_unlock_irq(rq->q->queue_lock);
> +}
> +
> +static void blk_mq_timeout_synchronize_rcu(struct request_queue *q,
> +		bool reset_expired)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +	int i;
> +	bool has_rcu = false;
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (!hctx->nr_expired)
> +			continue;
> +
> +		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> +			has_rcu = true;
> +		else
> +			synchronize_srcu(hctx->srcu);
> +
> +		if (reset_expired)
> +			hctx->nr_expired = 0;
> +	}
> +	if (has_rcu)
> +		synchronize_rcu();
>  }
> 
>  static void blk_mq_timeout_work(struct work_struct *work)
> @@ -899,8 +977,6 @@ static void blk_mq_timeout_work(struct work_struct
> *work) .next_set	= 0,
>  		.nr_expired	= 0,
>  	};
> -	struct blk_mq_hw_ctx *hctx;
> -	int i;
> 
>  	/* A deadlock might occur if a request is stuck requiring a
>  	 * timeout at the same time a queue freeze is waiting
> @@ -922,7 +998,8 @@ static void blk_mq_timeout_work(struct work_struct
> *work) blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
> 
>  	if (data.nr_expired) {
> -		bool has_rcu = false;
> +		struct request *rq;
> +		LIST_HEAD(rq_list);
> 
>  		/*
>  		 * Wait till everyone sees ->aborted_gstate.  The
> @@ -930,28 +1007,36 @@ static void blk_mq_timeout_work(struct
> work_struct *work) * becomes a problem, we can add per-hw_ctx
> rcu_head and
>  		 * wait in parallel.
>  		 */
> -		queue_for_each_hw_ctx(q, hctx, i) {
> -			if (!hctx->nr_expired)
> -				continue;
> +		blk_mq_timeout_synchronize_rcu(q, false);
> 
> -			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> -				has_rcu = true;
> -			else
> -				synchronize_srcu(hctx->srcu);
> +		/* call .timeout() for timed-out requests */
> +		blk_mq_queue_tag_busy_iter(q, blk_mq_prepare_expired, &rq_list);
> 
> -			hctx->nr_expired = 0;
> -		}
> -		if (has_rcu)
> -			synchronize_rcu();
> +		/*
> +		 * If .timeout returns BLK_EH_HANDLED, wait till current
> +		 * completion is done, for avoiding to update state on
> +		 * completed request.
> +		 *
> +		 * If .timeout returns BLK_EH_RESET_TIMER, wait till
> +		 * blk_add_timer() is commited before completing this rq.
> +		 */
> +		blk_mq_timeout_synchronize_rcu(q, true);
> 
> -		/* terminate the ones we won */
> -		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
> +		/* terminate timed-out requests */
> +		while (!list_empty(&rq_list)) {
> +			rq = list_entry(rq_list.next, struct request, eh_data.node);
> +			list_del_init(&rq->eh_data.node);
> +			blk_mq_rq_timed_out(rq);
> +		}
>  	}
> 
>  	if (data.next_set) {
>  		data.next = blk_rq_timeout(round_jiffies_up(data.next));
>  		mod_timer(&q->timeout, data.next);
>  	} else {
> +		struct blk_mq_hw_ctx *hctx;
> +		int i;
> +
>  		/*
>  		 * Request timeouts are handled as a forward rolling timer. If
>  		 * we end up here it means that no requests are pending and
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 9af3e0f430bc..704d54ce7702 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -129,11 +129,18 @@ typedef __u32 __bitwise req_flags_t;
>  #define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
>  /* already slept for hybrid poll */
>  #define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
> +/* already slept for hybrid poll */
> +#define RQF_MQ_COMPLETE_IN_EH	((__force req_flags_t)(1 << 22))
> 
>  /* flags that prevent us from merging requests: */
>  #define RQF_NOMERGE_FLAGS \
>  	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ |
> RQF_SPECIAL_PAYLOAD)
> 
> +struct req_eh_data {
> +	int ret;
> +	struct list_head node;
> +};
> +
>  /*
>   * Try to put the fields that are referenced together in the same
> cacheline. *
> @@ -252,8 +259,14 @@ struct request {
>  	struct list_head timeout_list;
> 
>  	union {
> +		/* used after completion */
>  		struct __call_single_data csd;
> +
> +		/* used in io scheduler, before dispatch */
>  		u64 fifo_time;
> +
> +		/* used after dispatch and before completion */
> +		struct req_eh_data eh_data;
>  	};
> 
>  	/*


-- 
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-04-23  8:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-15 15:43 [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER Ming Lei
2018-04-15 15:43 ` [PATCH V4 1/2] blk-mq: set RQF_MQ_TIMEOUT_EXPIRED when the rq's timeout isn't handled Ming Lei
2018-04-15 15:43 ` [PATCH V4 2/2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER Ming Lei
2018-04-15 16:31 ` [PATCH V4 0/2] blk-mq: fix race between completion " Martin Steigerwald
2018-04-16  0:45   ` Ming Lei
2018-04-16 13:12     ` Martin Steigerwald
2018-04-16 16:04       ` jianchao.wang
2018-04-17  0:15         ` Bart Van Assche
2018-04-17  3:49           ` jianchao.wang
2018-04-18 16:46       ` Ming Lei
2018-04-23  8:41         ` Martin Steigerwald

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.