All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
@ 2018-04-11 20:55 Ming Lei
  2018-04-11 21:30 ` Tejun Heo
  2018-04-11 22:49   ` Bart Van Assche
  0 siblings, 2 replies; 10+ messages in thread
From: Ming Lei @ 2018-04-11 20:55 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Ming Lei, Bart Van Assche, Tejun Heo, Christoph Hellwig,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

The normal request completion can be done before or during handling
BLK_EH_RESET_TIMER, and this race may cause the request to never
be completed since driver's .timeout() may always return
BLK_EH_RESET_TIMER.

This issue can't be fixed completely by driver, since the normal
completion can be done between returning .timeout() and handing
BLK_EH_RESET_TIMER.

This patch fixes this race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET,
and reading/writing rq's state by holding queue lock, which can be
per-request actually, but just not necessary to introduce one lock for
so unusual event.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---

This is another way to fix this long-time issue, and turns out this
solution is much simpler.

 block/blk-mq.c | 44 +++++++++++++++++++++++++++++++++++++++-----
 block/blk-mq.h |  1 +
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..12e8850e3905 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq)
 	 * However, that would complicate paths which want to synchronize
 	 * against us.  Let stay in sync with the issue path so that
 	 * hctx_lock() covers both issue and completion paths.
+	 *
+	 * Cover complete vs BLK_EH_RESET_TIMER race in slow path with
+	 * helding queue lock.
 	 */
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else {
+		unsigned long flags;
+		bool need_complete = false;
+
+		spin_lock_irqsave(q->queue_lock, flags);
+		if (!blk_mq_rq_aborted_gstate(rq))
+			need_complete = true;
+		else
+			blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_RESET);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+
+		if (need_complete)
+			__blk_mq_complete_request(rq);
+	}
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -814,24 +831,41 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
+	unsigned long flags;
 
 	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
 
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
+again:
 	switch (ret) {
 	case BLK_EH_HANDLED:
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
 		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
+		 * The normal completion may happen during handling the
+		 * timeout, or even after returning from .timeout(), so
+		 * once the request has been completed, we can't reset
+		 * timer any more since this request may be handled as
+		 * BLK_EH_RESET_TIMER in next timeout handling too, and
+		 * it has to be completed in this situation.
+		 *
+		 * Holding the queue lock to cover read/write rq's
+		 * aborted_gstate and normal state, so the race can be
+		 * avoided completely.
 		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		spin_lock_irqsave(req->q->queue_lock, flags);
+		if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
+			blk_mq_rq_update_aborted_gstate(req, 0);
+			blk_add_timer(req);
+		} else {
+			blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
+			ret = BLK_EH_HANDLED;
+			goto again;
+		}
+		spin_unlock_irqrestore(req->q->queue_lock, flags);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..6dc242fc785a 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,7 @@ enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
+	MQ_RQ_COMPLETE_IN_RESET	= 3,
 
 	MQ_RQ_STATE_BITS	= 2,
 	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-11 20:55 [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER Ming Lei
@ 2018-04-11 21:30 ` Tejun Heo
  2018-04-11 22:43   ` Ming Lei
  2018-04-11 22:49   ` Bart Van Assche
  1 sibling, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2018-04-11 21:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Christoph Hellwig,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

Hello, Ming.

On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
...
> +		spin_lock_irqsave(req->q->queue_lock, flags);
> +		if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> +			blk_mq_rq_update_aborted_gstate(req, 0);
> +			blk_add_timer(req);

Nothing prevents the above blk_add_timer() racing against the next
recycle instance of the request, so this still leaves a small race
window.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-11 21:30 ` Tejun Heo
@ 2018-04-11 22:43   ` Ming Lei
  2018-04-11 22:47     ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2018-04-11 22:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, linux-block, Bart Van Assche, Christoph Hellwig,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
> ...
> > +		spin_lock_irqsave(req->q->queue_lock, flags);
> > +		if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > +			blk_mq_rq_update_aborted_gstate(req, 0);
> > +			blk_add_timer(req);
> 
> Nothing prevents the above blk_add_timer() racing against the next
> recycle instance of the request, so this still leaves a small race
> window.

OK.

But this small race window can be avoided by running blk_add_timer(req)
before blk_mq_rq_update_aborted_gstate(req, 0), can't it?

-- 
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-11 22:43   ` Ming Lei
@ 2018-04-11 22:47     ` Tejun Heo
  2018-04-11 23:05       ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2018-04-11 22:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Christoph Hellwig,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

Hello,

On Thu, Apr 12, 2018 at 06:43:45AM +0800, Ming Lei wrote:
> On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
> > Hello, Ming.
> > 
> > On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
> > ...
> > > +		spin_lock_irqsave(req->q->queue_lock, flags);
> > > +		if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > > +			blk_mq_rq_update_aborted_gstate(req, 0);
> > > +			blk_add_timer(req);
> > 
> > Nothing prevents the above blk_add_timer() racing against the next
> > recycle instance of the request, so this still leaves a small race
> > window.
> 
> OK.
> 
> But this small race window can be avoided by running blk_add_timer(req)
> before blk_mq_rq_update_aborted_gstate(req, 0), can't it?

Not really because aborted_gstate right now doesn't have any memory
barrier around it, so nothing ensures blk_add_timer() actually appears
before.  We can either add the matching barriers in aborted_gstate
update and when it's read in the normal completion path, or we can
wait for the update to be visible everywhere by waiting for rcu grace
period (because the reader is rcu protected).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-11 20:55 [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER Ming Lei
@ 2018-04-11 22:49   ` Bart Van Assche
  2018-04-11 22:49   ` Bart Van Assche
  1 sibling, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2018-04-11 22:49 UTC (permalink / raw)
  To: linux-block, ming.lei, axboe; +Cc: hch, tj, israelr, maxg, stable, sagi

T24gVGh1LCAyMDE4LTA0LTEyIGF0IDA0OjU1ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gK2Fn
YWluOg0KPiAgCXN3aXRjaCAocmV0KSB7DQo+ICAJY2FzZSBCTEtfRUhfSEFORExFRDoNCj4gIAkJ
X19ibGtfbXFfY29tcGxldGVfcmVxdWVzdChyZXEpOw0KPiAgCQlicmVhazsNCj4gIAljYXNlIEJM
S19FSF9SRVNFVF9USU1FUjoNCj4gIAkJWyAuLi4gXQ0KPiArCQlzcGluX2xvY2tfaXJxc2F2ZShy
ZXEtPnEtPnF1ZXVlX2xvY2ssIGZsYWdzKTsNCj4gKwkJaWYgKGJsa19tcV9ycV9zdGF0ZShyZXEp
ICE9IE1RX1JRX0NPTVBMRVRFX0lOX1JFU0VUKSB7DQo+ICsJCQlibGtfbXFfcnFfdXBkYXRlX2Fi
b3J0ZWRfZ3N0YXRlKHJlcSwgMCk7DQo+ICsJCQlibGtfYWRkX3RpbWVyKHJlcSk7DQo+ICsJCX0g
ZWxzZSB7DQo+ICsJCQlibGtfbXFfcnFfdXBkYXRlX3N0YXRlKHJlcSwgTVFfUlFfSU5fRkxJR0hU
KTsNCj4gKwkJCXJldCA9IEJMS19FSF9IQU5ETEVEOw0KPiArCQkJZ290byBhZ2FpbjsNCj4gKwkJ
fQ0KPiArCQlzcGluX3VubG9ja19pcnFyZXN0b3JlKHJlcS0+cS0+cXVldWVfbG9jaywgZmxhZ3Mp
Ow0KDQpEb2VzIHRoZSBhYm92ZSBjaHVuayBpbnRyb2R1Y2UgYSBiYWNrd2FyZHMgZ290byBmcm9t
IGluc2lkZSBhIHJlZ2lvbiBhcm91bmQNCndoaWNoIGEgc3BpbmxvY2sgaXMgaGVsZCB0byBvdXRz
aWRlIHRoYXQgcmVnaW9uPyBDYW4gc3VjaCBhIGdvdG8gcmVzdWx0IGluDQphbnl0aGluZyBlbHNl
IHRoYW4gYSBkZWFkbG9jaz8NCg0KVGhhbmtzLA0KDQpCYXJ0Lg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
@ 2018-04-11 22:49   ` Bart Van Assche
  0 siblings, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2018-04-11 22:49 UTC (permalink / raw)
  To: linux-block, ming.lei, axboe; +Cc: hch, tj, israelr, maxg, stable, sagi

On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote:
> +again:
>  	switch (ret) {
>  	case BLK_EH_HANDLED:
>  		__blk_mq_complete_request(req);
>  		break;
>  	case BLK_EH_RESET_TIMER:
>  		[ ... ]
> +		spin_lock_irqsave(req->q->queue_lock, flags);
> +		if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> +			blk_mq_rq_update_aborted_gstate(req, 0);
> +			blk_add_timer(req);
> +		} else {
> +			blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
> +			ret = BLK_EH_HANDLED;
> +			goto again;
> +		}
> +		spin_unlock_irqrestore(req->q->queue_lock, flags);

Does the above chunk introduce a backwards goto from inside a region around
which a spinlock is held to outside that region? Can such a goto result in
anything else than a deadlock?

Thanks,

Bart.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-11 22:47     ` Tejun Heo
@ 2018-04-11 23:05       ` Ming Lei
  2018-04-12 13:57         ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2018-04-11 23:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, linux-block, Bart Van Assche, Christoph Hellwig,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

On Wed, Apr 11, 2018 at 03:47:12PM -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, Apr 12, 2018 at 06:43:45AM +0800, Ming Lei wrote:
> > On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
> > > Hello, Ming.
> > > 
> > > On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
> > > ...
> > > > +		spin_lock_irqsave(req->q->queue_lock, flags);
> > > > +		if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > > > +			blk_mq_rq_update_aborted_gstate(req, 0);
> > > > +			blk_add_timer(req);
> > > 
> > > Nothing prevents the above blk_add_timer() racing against the next
> > > recycle instance of the request, so this still leaves a small race
> > > window.
> > 
> > OK.
> > 
> > But this small race window can be avoided by running blk_add_timer(req)
> > before blk_mq_rq_update_aborted_gstate(req, 0), can't it?
> 
> Not really because aborted_gstate right now doesn't have any memory
> barrier around it, so nothing ensures blk_add_timer() actually appears
> before.  We can either add the matching barriers in aborted_gstate
> update and when it's read in the normal completion path, or we can
> wait for the update to be visible everywhere by waiting for rcu grace
> period (because the reader is rcu protected).

Seems not necessary.

Suppose it is out of order, the only side-effect is that the new
recycled request is timed out as a bit late, I think that is what
we can survive, right?

But it need to be documented.

--
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-11 22:49   ` Bart Van Assche
  (?)
@ 2018-04-11 23:06   ` Ming Lei
  -1 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2018-04-11 23:06 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, axboe, hch, tj, israelr, maxg, stable, sagi

On Wed, Apr 11, 2018 at 10:49:51PM +0000, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote:
> > +again:
> >  	switch (ret) {
> >  	case BLK_EH_HANDLED:
> >  		__blk_mq_complete_request(req);
> >  		break;
> >  	case BLK_EH_RESET_TIMER:
> >  		[ ... ]
> > +		spin_lock_irqsave(req->q->queue_lock, flags);
> > +		if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > +			blk_mq_rq_update_aborted_gstate(req, 0);
> > +			blk_add_timer(req);
> > +		} else {
> > +			blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
> > +			ret = BLK_EH_HANDLED;
> > +			goto again;
> > +		}
> > +		spin_unlock_irqrestore(req->q->queue_lock, flags);
> 
> Does the above chunk introduce a backwards goto from inside a region around
> which a spinlock is held to outside that region? Can such a goto result in
> anything else than a deadlock?

Yes, it is being fixed in my local V2, :-)

-- 
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-11 23:05       ` Ming Lei
@ 2018-04-12 13:57         ` Tejun Heo
  2018-04-13 15:38           ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2018-04-12 13:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Christoph Hellwig,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote:
> > Not really because aborted_gstate right now doesn't have any memory
> > barrier around it, so nothing ensures blk_add_timer() actually appears
> > before.  We can either add the matching barriers in aborted_gstate
> > update and when it's read in the normal completion path, or we can
> > wait for the update to be visible everywhere by waiting for rcu grace
> > period (because the reader is rcu protected).
> 
> Seems not necessary.
> 
> Suppose it is out of order, the only side-effect is that the new
> recycled request is timed out as a bit late, I think that is what
> we can survive, right?

It at least can mess up the timeout duration for the next recycle
instance because there can be two competing blk_add_timer() instances.
I'm not sure whether there can be other consequences.  When ownership
isn't clear, it becomes really difficult to reason about these things
and can lead to subtle failures.  I think it'd be best to always
establish who owns what.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
  2018-04-12 13:57         ` Tejun Heo
@ 2018-04-13 15:38           ` Ming Lei
  0 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2018-04-13 15:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, linux-block, Bart Van Assche, Christoph Hellwig,
	Sagi Grimberg, Israel Rukshin, Max Gurtovoy, stable

On Thu, Apr 12, 2018 at 06:57:12AM -0700, Tejun Heo wrote:
> On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote:
> > > Not really because aborted_gstate right now doesn't have any memory
> > > barrier around it, so nothing ensures blk_add_timer() actually appears
> > > before.  We can either add the matching barriers in aborted_gstate
> > > update and when it's read in the normal completion path, or we can
> > > wait for the update to be visible everywhere by waiting for rcu grace
> > > period (because the reader is rcu protected).
> > 
> > Seems not necessary.
> > 
> > Suppose it is out of order, the only side-effect is that the new
> > recycled request is timed out as a bit late, I think that is what
> > we can survive, right?
> 
> It at least can mess up the timeout duration for the next recycle
> instance because there can be two competing blk_add_timer() instances.
> I'm not sure whether there can be other consequences.  When ownership
> isn't clear, it becomes really difficult to reason about these things
> and can lead to subtle failures.  I think it'd be best to always
> establish who owns what.

Please see the code of blk_add_timer() for blk-mq:

	blk_rq_set_deadline(req, jiffies + req->timeout);
	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;

	if (!timer_pending(&q->timeout) ||
	    time_before(expiry, q->timeout.expires))
			mod_timer(&q->timeout, expiry);

If this rq is recycled, blk_add_timer() only touches rq->deadline
and the EXPIRED flags, and the only effect is that the timeout
may be handled a bit late, but the timeout monitor won't be lost.

And this thing shouldn't be difficult to avoid, as you mentioned,
synchronize_rcu() can be added between blk_add_timer() and
resetting aborted gstate for avoiding it.


thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-04-13 15:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-11 20:55 [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER Ming Lei
2018-04-11 21:30 ` Tejun Heo
2018-04-11 22:43   ` Ming Lei
2018-04-11 22:47     ` Tejun Heo
2018-04-11 23:05       ` Ming Lei
2018-04-12 13:57         ` Tejun Heo
2018-04-13 15:38           ` Ming Lei
2018-04-11 22:49 ` Bart Van Assche
2018-04-11 22:49   ` Bart Van Assche
2018-04-11 23:06   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.