All of lore.kernel.org
 help / color / mirror / Atom feed
* backport request for use-after-free blk_mq_queue_tag_busy_iter
@ 2020-04-01 17:47 Giuliano Procida
  2020-04-01 17:55 ` Greg KH
  2020-04-03  9:20 ` Greg KH
  0 siblings, 2 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-01 17:47 UTC (permalink / raw)
  To: stable

This issue was found in 4.14 and is present in earlier kernels.

Please backport

f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with
blk_mq_queue_tag_busy_iter
530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks

onto the stable branches that don't have these. The second is a fix
for the first. Thank you.

4.19.y and later - commits already present
4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still
straightforward, just drop the comment and code mentioning switching
to 'none' in the trailing context
4.9.y - ditto
4.4.y - there was a refactoring of the code in commit
0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial
3.16.y - ditto

I am happy to try to produce clean patches, but it may be a day or so.

Regards,
Giuliano.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-01 17:47 backport request for use-after-free blk_mq_queue_tag_busy_iter Giuliano Procida
@ 2020-04-01 17:55 ` Greg KH
  2020-04-03  9:20 ` Greg KH
  1 sibling, 0 replies; 22+ messages in thread
From: Greg KH @ 2020-04-01 17:55 UTC (permalink / raw)
  To: Giuliano Procida; +Cc: stable

On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
> This issue was found in 4.14 and is present in earlier kernels.
> 
> Please backport
> 
> f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with
> blk_mq_queue_tag_busy_iter
> 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
> 
> onto the stable branches that don't have these. The second is a fix
> for the first. Thank you.
> 
> 4.19.y and later - commits already present
> 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still
> straightforward, just drop the comment and code mentioning switching
> to 'none' in the trailing context
> 4.9.y - ditto
> 4.4.y - there was a refactoring of the code in commit
> 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial
> 3.16.y - ditto
> 
> I am happy to try to produce clean patches, but it may be a day or so.

Clean patches would be good, as there are -rcs out right now so I can't
do anything until they are released in a few days.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-01 17:47 backport request for use-after-free blk_mq_queue_tag_busy_iter Giuliano Procida
  2020-04-01 17:55 ` Greg KH
@ 2020-04-03  9:20 ` Greg KH
  2020-04-03 22:30   ` Giuliano Procida
  1 sibling, 1 reply; 22+ messages in thread
From: Greg KH @ 2020-04-03  9:20 UTC (permalink / raw)
  To: Giuliano Procida; +Cc: stable

On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
> This issue was found in 4.14 and is present in earlier kernels.
> 
> Please backport
> 
> f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with
> blk_mq_queue_tag_busy_iter
> 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
> 
> onto the stable branches that don't have these. The second is a fix
> for the first. Thank you.
> 
> 4.19.y and later - commits already present
> 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still
> straightforward, just drop the comment and code mentioning switching
> to 'none' in the trailing context
> 4.9.y - ditto
> 4.4.y - there was a refactoring of the code in commit
> 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial
> 3.16.y - ditto
> 
> I am happy to try to produce clean patches, but it may be a day or so.

I have done this for 4.14.y and 4.9.y, can you please provide a backport
for 4.4.y that I can queue up?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-03  9:20 ` Greg KH
@ 2020-04-03 22:30   ` Giuliano Procida
  2020-04-07 16:31     ` Giuliano Procida
  0 siblings, 1 reply; 22+ messages in thread
From: Giuliano Procida @ 2020-04-03 22:30 UTC (permalink / raw)
  To: Greg KH; +Cc: stable

Hi Greg.

I also have 4.14 and 4.9, I'll send them on for comparison.

I will try 4.4 but, as one call site doesn't exist and the other
didn't have any locking to start with, I'd like to try to reproduce
the issue first.

I should have some spare time for this soon.

Giuilano.

On Fri, 3 Apr 2020 at 10:20, Greg KH <greg@kroah.com> wrote:
>
> On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
> > This issue was found in 4.14 and is present in earlier kernels.
> >
> > Please backport
> >
> > f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with
> > blk_mq_queue_tag_busy_iter
> > 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
> >
> > onto the stable branches that don't have these. The second is a fix
> > for the first. Thank you.
> >
> > 4.19.y and later - commits already present
> > 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still
> > straightforward, just drop the comment and code mentioning switching
> > to 'none' in the trailing context
> > 4.9.y - ditto
> > 4.4.y - there was a refactoring of the code in commit
> > 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial
> > 3.16.y - ditto
> >
> > I am happy to try to produce clean patches, but it may be a day or so.
>
> I have done this for 4.14.y and 4.9.y, can you please provide a backport
> for 4.4.y that I can queue up?
>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-03 22:30   ` Giuliano Procida
@ 2020-04-07 16:31     ` Giuliano Procida
  2020-04-07 16:55       ` [PATCH 0/4] " Giuliano Procida
                         ` (10 more replies)
  0 siblings, 11 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-07 16:31 UTC (permalink / raw)
  To: Greg KH; +Cc: stable

Hi Greg.

On Fri, 3 Apr 2020 at 23:30, Giuliano Procida <gprocida@google.com> wrote:
>
> Hi Greg.
>
> I also have 4.14 and 4.9, I'll send them on for comparison.

I've done this.

> I will try 4.4 but, as one call site doesn't exist and the other
> didn't have any locking to start with, I'd like to try to reproduce
> the issue first.

I have failed to build a bootable 4.4 kernel which is surprising /
embarrassing, as my current toolchain (even after working around
various known issues) compiles kernels that either panic or
triple-fault (apparently, as there's no log output, just a reboot) on
my amd64 hardware. Running an old live distribution with a 4.4 kernel,
I wasn't able to reproduce the issue apparently resolved by these
fixes after several hours of running.

I've also spent most of 2 days looking at unfamiliar code.

The code in 4.4 uses a timer instead of a workqueue for timeout
callbacks. The callbacks have also have blk_queue_enter/exit
protection in 4.9 but not 4.4. I'm guessing, but don't know, that the
execution contexts are sufficiently similar between timers and
workqueues that this protection should be back-ported to 4.4. This is
relatively simple, it's bits of a couple of extra commits.

f5bbbbe4d635 adds to blk_mq_queue_tag_busy_iter an RCU-protected test
to see if the blk_queue is held before doing any work. It also adds
RCU synchronisation to code that manipulates the number of hardware
queues. The follow-up 530ca2c9bd more sensibly just (possibly
recursively) does try-to-enter/exit instead. 4.4 doesn't have code
that manipulates the number of hardware queues. However, the
blk_mq_queue_tag_busy_iter locking may be enough to prevent
ioctl/procfs concurrency.

To this end, I've put together patches for 4.4. They are completely
untested. Once I've verified they actually compile I'll send them on.

Giuliano.

> I should have some spare time for this soon.
>
> Giuilano.
>
> On Fri, 3 Apr 2020 at 10:20, Greg KH <greg@kroah.com> wrote:
> >
> > On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
> > > This issue was found in 4.14 and is present in earlier kernels.
> > >
> > > Please backport
> > >
> > > f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with
> > > blk_mq_queue_tag_busy_iter
> > > 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
> > >
> > > onto the stable branches that don't have these. The second is a fix
> > > for the first. Thank you.
> > >
> > > 4.19.y and later - commits already present
> > > 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still
> > > straightforward, just drop the comment and code mentioning switching
> > > to 'none' in the trailing context
> > > 4.9.y - ditto
> > > 4.4.y - there was a refactoring of the code in commit
> > > 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial
> > > 3.16.y - ditto
> > >
> > > I am happy to try to produce clean patches, but it may be a day or so.
> >
> > I have done this for 4.14.y and 4.9.y, can you please provide a backport
> > for 4.4.y that I can queue up?
> >
> > thanks,
> >
> > greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 0/4] backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-07 16:31     ` Giuliano Procida
@ 2020-04-07 16:55       ` Giuliano Procida
  2020-04-10  9:04         ` Greg KH
  2020-04-07 16:55       ` [PATCH 1/4] block: more locking around delayed work Giuliano Procida
                         ` (9 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Giuliano Procida @ 2020-04-07 16:55 UTC (permalink / raw)
  To: greg; +Cc: stable, Giuliano Procida

Here are the patches for linux-4.4.y. Untested.

Regards,
Giuliano.

Giuliano Procida (4):
  block: more locking around delayed work
  blk-mq: Allow timeouts to run while queue is freezing
  blk-mq: sync things with blk_mq_queue_tag_busy_iter
  blk-mq: Allow blocking queue tag iter callbacks

 block/blk-mq-tag.c  |  7 ++++++-
 block/blk-mq.c      | 17 +++++++++++++++++
 block/blk-timeout.c |  3 +++
 3 files changed, 26 insertions(+), 1 deletion(-)

-- 
2.26.0.292.g33ef6b2f38-goog


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] block: more locking around delayed work
  2020-04-07 16:31     ` Giuliano Procida
  2020-04-07 16:55       ` [PATCH 0/4] " Giuliano Procida
@ 2020-04-07 16:55       ` Giuliano Procida
  2020-04-10  9:03         ` Greg KH
  2020-04-07 16:55         ` Giuliano Procida
                         ` (8 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Giuliano Procida @ 2020-04-07 16:55 UTC (permalink / raw)
  To: greg; +Cc: stable, Giuliano Procida

commit 287922eb0b186e2a5bf54fdd04b734c25c90035c upstream.

The upstream commit (block: defer timeouts to a workqueue) included
various locking changes. The original commit message did not say
anything about the extra locking. Perhaps this is only needed for
workqueue callbacks and not timer callbacks. We assume it is needed
here.

This patch includes the locking changes but leaves timeouts using a
timer.

Both blk_mq_rq_timer and blk_rq_timed_out_timer will return without
without doing any work if they cannot acquire the queue (without
waiting).

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq.c      | 4 ++++
 block/blk-timeout.c | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8649dbf06ce4..11a23bf73fd9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -628,6 +628,9 @@ static void blk_mq_rq_timer(unsigned long priv)
 	};
 	int i;
 
+	if (blk_queue_enter(q, GFP_NOWAIT))
+		return;
+
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
 	if (data.next_set) {
@@ -642,6 +645,7 @@ static void blk_mq_rq_timer(unsigned long priv)
 				blk_mq_tag_idle(hctx);
 		}
 	}
+	blk_queue_exit(q);
 }
 
 /*
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index aa40aa93381b..2bc03df554a6 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,6 +134,8 @@ void blk_rq_timed_out_timer(unsigned long data)
 	struct request *rq, *tmp;
 	int next_set = 0;
 
+	if (blk_queue_enter(q, GFP_NOWAIT))
+		return;
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
@@ -143,6 +145,7 @@ void blk_rq_timed_out_timer(unsigned long data)
 		mod_timer(&q->timeout, round_jiffies_up(next));
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
+	blk_queue_exit(q);
 }
 
 /**
-- 
2.26.0.292.g33ef6b2f38-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] blk-mq: Allow timeouts to run while queue is freezing
  2020-04-07 16:31     ` Giuliano Procida
@ 2020-04-07 16:55         ` Giuliano Procida
  2020-04-07 16:55       ` [PATCH 1/4] block: more locking around delayed work Giuliano Procida
                           ` (9 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-07 16:55 UTC (permalink / raw)
  To: greg
  Cc: stable, Giuliano Procida, Gabriel Krisman Bertazi, Brian King,
	Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Jens Axboe

commit 71f79fb3179e69b0c1448a2101a866d871c66e7f upstream.

In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11a23bf73fd9..d13e70d40df9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -628,7 +628,20 @@ static void blk_mq_rq_timer(unsigned long priv)
 	};
 	int i;
 
-	if (blk_queue_enter(q, GFP_NOWAIT))
+	/* A deadlock might occur if a request is stuck requiring a
+	 * timeout at the same time a queue freeze is waiting
+	 * completion, since the timeout code would not be able to
+	 * acquire the queue reference here.
+	 *
+	 * That's why we don't use blk_queue_enter here; instead, we use
+	 * percpu_ref_tryget directly, because we need to be able to
+	 * obtain a reference even in the short window between the queue
+	 * starting to freeze, by dropping the first reference in
+	 * blk_mq_freeze_queue_start, and the moment the last request is
+	 * consumed, marked by the instant q_usage_counter reaches
+	 * zero.
+	 */
+	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
-- 
2.26.0.292.g33ef6b2f38-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] blk-mq: Allow timeouts to run while queue is freezing
@ 2020-04-07 16:55         ` Giuliano Procida
  0 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-07 16:55 UTC (permalink / raw)
  To: greg
  Cc: linux-block, Jens Axboe, Gabriel Krisman Bertazi, linux-nvme,
	Keith Busch, stable, Brian King, Giuliano Procida,
	Christoph Hellwig

commit 71f79fb3179e69b0c1448a2101a866d871c66e7f upstream.

In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11a23bf73fd9..d13e70d40df9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -628,7 +628,20 @@ static void blk_mq_rq_timer(unsigned long priv)
 	};
 	int i;
 
-	if (blk_queue_enter(q, GFP_NOWAIT))
+	/* A deadlock might occur if a request is stuck requiring a
+	 * timeout at the same time a queue freeze is waiting
+	 * completion, since the timeout code would not be able to
+	 * acquire the queue reference here.
+	 *
+	 * That's why we don't use blk_queue_enter here; instead, we use
+	 * percpu_ref_tryget directly, because we need to be able to
+	 * obtain a reference even in the short window between the queue
+	 * starting to freeze, by dropping the first reference in
+	 * blk_mq_freeze_queue_start, and the moment the last request is
+	 * consumed, marked by the instant q_usage_counter reaches
+	 * zero.
+	 */
+	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
-- 
2.26.0.292.g33ef6b2f38-goog


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] blk-mq: sync things with blk_mq_queue_tag_busy_iter
  2020-04-07 16:31     ` Giuliano Procida
                         ` (2 preceding siblings ...)
  2020-04-07 16:55         ` Giuliano Procida
@ 2020-04-07 16:55       ` Giuliano Procida
  2020-04-07 16:55       ` [PATCH 4/4] blk-mq: Allow blocking queue tag iter callbacks Giuliano Procida
                         ` (6 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-07 16:55 UTC (permalink / raw)
  To: greg; +Cc: stable, Giuliano Procida, Jianchao Wang, Ming Lei, Jens Axboe

commit f5bbbbe4d63577026f908a809f22f5fd5a90ea1f upstream.

The original commit was intended to prevent concurrent manipulation of
nr_hw_queues and iteration over queues. The former doesn't happen in
this older kernel version. However, the extra locking (which is buggy
as it exists in this commit) may protect against other concurrent
accesses such as queue removal.

The original commit message follows for completeness.

blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter

For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq-tag.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a07ca3488d96..bf356de30134 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -481,6 +481,14 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	/*
+	 * Avoid potential races with things like queue removal.
+	 */
+	rcu_read_lock();
+	if (percpu_ref_is_zero(&q->q_usage_counter)) {
+		rcu_read_unlock();
+		return;
+	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		struct blk_mq_tags *tags = hctx->tags;
@@ -497,7 +505,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		bt_for_each(hctx, &tags->bitmap_tags, tags->nr_reserved_tags, fn, priv,
 		      false);
 	}
-
+	rcu_read_unlock();
 }
 
 static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
-- 
2.26.0.292.g33ef6b2f38-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] blk-mq: Allow blocking queue tag iter callbacks
  2020-04-07 16:31     ` Giuliano Procida
                         ` (3 preceding siblings ...)
  2020-04-07 16:55       ` [PATCH 3/4] blk-mq: sync things with blk_mq_queue_tag_busy_iter Giuliano Procida
@ 2020-04-07 16:55       ` Giuliano Procida
  2020-04-07 21:02       ` backport request for use-after-free blk_mq_queue_tag_busy_iter Giuliano Procida
                         ` (5 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-07 16:55 UTC (permalink / raw)
  To: greg; +Cc: stable, Giuliano Procida, Jianchao Wang, Keith Busch, Jens Axboe

commit 530ca2c9bd6949c72c9b5cfc330cb3dbccaa3f5b upstream.

This change is a back-ported fix to the back-port of f5bbbbe4d6357,
a439abbd6e707232b1f399e6df1a85ace42e8f9f.

A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5bbbbe4d6357 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq-tag.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index bf356de30134..c1c654319287 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -484,11 +484,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	/*
 	 * Avoid potential races with things like queue removal.
 	 */
-	rcu_read_lock();
-	if (percpu_ref_is_zero(&q->q_usage_counter)) {
-		rcu_read_unlock();
+	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
-	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		struct blk_mq_tags *tags = hctx->tags;
@@ -505,7 +502,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		bt_for_each(hctx, &tags->bitmap_tags, tags->nr_reserved_tags, fn, priv,
 		      false);
 	}
-	rcu_read_unlock();
+	blk_queue_exit(q);
 }
 
 static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
-- 
2.26.0.292.g33ef6b2f38-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-07 16:31     ` Giuliano Procida
                         ` (4 preceding siblings ...)
  2020-04-07 16:55       ` [PATCH 4/4] blk-mq: Allow blocking queue tag iter callbacks Giuliano Procida
@ 2020-04-07 21:02       ` Giuliano Procida
  2020-04-15 13:00       ` [PATCH v2 0/4] " Giuliano Procida
                         ` (4 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-07 21:02 UTC (permalink / raw)
  To: Greg KH; +Cc: stable

Greg.

Going back further to 3.16 or 3.18 looks like a lot of work and I have
low confidence of generating correct code.

There are changes like 3ef28e83ab15799742e55fd13243a5f678b04242 (from
4.3) which changed the locking from blk_mq_queue_enter to
blk_queue_enter.

I'm going to stand down here.

Sorry about this.
Giuliano.

On Tue, 7 Apr 2020 at 17:31, Giuliano Procida <gprocida@google.com> wrote:
>
> Hi Greg.
>
> On Fri, 3 Apr 2020 at 23:30, Giuliano Procida <gprocida@google.com> wrote:
> >
> > Hi Greg.
> >
> > I also have 4.14 and 4.9, I'll send them on for comparison.
>
> I've done this.
>
> > I will try 4.4 but, as one call site doesn't exist and the other
> > didn't have any locking to start with, I'd like to try to reproduce
> > the issue first.
>
> I have failed to build a bootable 4.4 kernel which is surprising /
> embarrassing, as my current toolchain (even after working around
> various known issues) compiles kernels that either panic or
> triple-fault (apparently, as there's no log output, just a reboot) on
> my amd64 hardware. Running an old live distribution with a 4.4 kernel,
> I wasn't able to reproduce the issue apparently resolved by these
> fixes after several hours of running.
>
> I've also spent most of 2 days looking at unfamiliar code.
>
> The code in 4.4 uses a timer instead of a workqueue for timeout
> callbacks. The callbacks have also have blk_queue_enter/exit
> protection in 4.9 but not 4.4. I'm guessing, but don't know, that the
> execution contexts are sufficiently similar between timers and
> workqueues that this protection should be back-ported to 4.4. This is
> relatively simple, it's bits of a couple of extra commits.
>
> f5bbbbe4d635 adds to blk_mq_queue_tag_busy_iter an RCU-protected test
> to see if the blk_queue is held before doing any work. It also adds
> RCU synchronisation to code that manipulates the number of hardware
> queues. The follow-up 530ca2c9bd more sensibly just (possibly
> recursively) does try-to-enter/exit instead. 4.4 doesn't have code
> that manipulates the number of hardware queues. However, the
> blk_mq_queue_tag_busy_iter locking may be enough to prevent
> ioctl/procfs concurrency.
>
> To this end, I've put together patches for 4.4. They are completely
> untested. Once I've verified they actually compile I'll send them on.
>
> Giuliano.
>
> > I should have some spare time for this soon.
> >
> > Giuilano.
> >
> > On Fri, 3 Apr 2020 at 10:20, Greg KH <greg@kroah.com> wrote:
> > >
> > > On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
> > > > This issue was found in 4.14 and is present in earlier kernels.
> > > >
> > > > Please backport
> > > >
> > > > f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with
> > > > blk_mq_queue_tag_busy_iter
> > > > 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
> > > >
> > > > onto the stable branches that don't have these. The second is a fix
> > > > for the first. Thank you.
> > > >
> > > > 4.19.y and later - commits already present
> > > > 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still
> > > > straightforward, just drop the comment and code mentioning switching
> > > > to 'none' in the trailing context
> > > > 4.9.y - ditto
> > > > 4.4.y - there was a refactoring of the code in commit
> > > > 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial
> > > > 3.16.y - ditto
> > > >
> > > > I am happy to try to produce clean patches, but it may be a day or so.
> > >
> > > I have done this for 4.14.y and 4.9.y, can you please provide a backport
> > > for 4.4.y that I can queue up?
> > >
> > > thanks,
> > >
> > > greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] block: more locking around delayed work
  2020-04-07 16:55       ` [PATCH 1/4] block: more locking around delayed work Giuliano Procida
@ 2020-04-10  9:03         ` Greg KH
  2020-04-15 12:03           ` Giuliano Procida
  0 siblings, 1 reply; 22+ messages in thread
From: Greg KH @ 2020-04-10  9:03 UTC (permalink / raw)
  To: Giuliano Procida; +Cc: stable

On Tue, Apr 07, 2020 at 05:55:36PM +0100, Giuliano Procida wrote:
> commit 287922eb0b186e2a5bf54fdd04b734c25c90035c upstream.
> 
> The upstream commit (block: defer timeouts to a workqueue) included
> various locking changes. The original commit message did not say
> anything about the extra locking. Perhaps this is only needed for
> workqueue callbacks and not timer callbacks. We assume it is needed
> here.
> 
> This patch includes the locking changes but leaves timeouts using a
> timer.
> 
> Both blk_mq_rq_timer and blk_rq_timed_out_timer will return without
> without doing any work if they cannot acquire the queue (without
> waiting).
> 
> Signed-off-by: Giuliano Procida <gprocida@google.com>

Don't write your own changelog text for something that is upstream,
include the original and then add your own later on below, making it
obvious what you are doing differently from the original commit.

And be sure to cc: all of the original people on that commit, and keep
their s-o-b also.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-07 16:55       ` [PATCH 0/4] " Giuliano Procida
@ 2020-04-10  9:04         ` Greg KH
  0 siblings, 0 replies; 22+ messages in thread
From: Greg KH @ 2020-04-10  9:04 UTC (permalink / raw)
  To: Giuliano Procida; +Cc: stable

On Tue, Apr 07, 2020 at 05:55:35PM +0100, Giuliano Procida wrote:
> Here are the patches for linux-4.4.y. Untested.

Can you fix up patch 1/4 as noted, and test them before resending them?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] block: more locking around delayed work
  2020-04-10  9:03         ` Greg KH
@ 2020-04-15 12:03           ` Giuliano Procida
  0 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-15 12:03 UTC (permalink / raw)
  To: Greg KH; +Cc: stable

No problem.

Will do today.

Giuliano.

On Fri, 10 Apr 2020 at 10:03, Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Apr 07, 2020 at 05:55:36PM +0100, Giuliano Procida wrote:
> > commit 287922eb0b186e2a5bf54fdd04b734c25c90035c upstream.
> >
> > The upstream commit (block: defer timeouts to a workqueue) included
> > various locking changes. The original commit message did not say
> > anything about the extra locking. Perhaps this is only needed for
> > workqueue callbacks and not timer callbacks. We assume it is needed
> > here.
> >
> > This patch includes the locking changes but leaves timeouts using a
> > timer.
> >
> > Both blk_mq_rq_timer and blk_rq_timed_out_timer will return without
> > without doing any work if they cannot acquire the queue (without
> > waiting).
> >
> > Signed-off-by: Giuliano Procida <gprocida@google.com>
>
> Don't write your own changelog text for something that is upstream,
> include the original and then add your own later on below, making it
> obvious what you are doing differently from the original commit.
>
> And be sure to cc: all of the original people on that commit, and keep
> their s-o-b also.
>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 0/4] backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-07 16:31     ` Giuliano Procida
                         ` (5 preceding siblings ...)
  2020-04-07 21:02       ` backport request for use-after-free blk_mq_queue_tag_busy_iter Giuliano Procida
@ 2020-04-15 13:00       ` Giuliano Procida
  2020-05-18  7:27         ` Greg KH
  2020-04-15 13:00       ` [PATCH v2 1/4] block: more locking around delayed work Giuliano Procida
                         ` (3 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Giuliano Procida @ 2020-04-15 13:00 UTC (permalink / raw)
  To: greg; +Cc: Giuliano Procida, stable

v2: Updated commit messages following feedback from gregkh.

Here are the patches for linux-4.4.y.

There are 2 further patches over those for linux-4.9.y and the
differences after back-porting are non-trivial.

The code complies without warnings. However, I have no suitable
hardware or virtual machine to test this on.

Regards,
Guiliano.

Giuliano Procida (4):
  block: more locking around delayed work
  blk-mq: Allow timeouts to run while queue is freezing
  blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
  blk-mq: Allow blocking queue tag iter callbacks

 block/blk-mq-tag.c  |  7 ++++++-
 block/blk-mq.c      | 17 +++++++++++++++++
 block/blk-timeout.c |  3 +++
 3 files changed, 26 insertions(+), 1 deletion(-)

-- 
2.26.0.110.g2183baf09c-goog


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/4] block: more locking around delayed work
  2020-04-07 16:31     ` Giuliano Procida
                         ` (6 preceding siblings ...)
  2020-04-15 13:00       ` [PATCH v2 0/4] " Giuliano Procida
@ 2020-04-15 13:00       ` Giuliano Procida
  2020-04-15 13:00         ` Giuliano Procida
                         ` (2 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-15 13:00 UTC (permalink / raw)
  To: greg; +Cc: Giuliano Procida, stable, Christoph Hellwig, Keith Busch, Jens Axboe

commit 287922eb0b186e2a5bf54fdd04b734c25c90035c upstream.

block: defer timeouts to a workqueue

Timer context is not very useful for drivers to perform any meaningful abort
action from.  So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals.  But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Contains a major update from Keith Bush:

"This patch removes synchronizing the timeout work so that the timer can
 start a freeze on its own queue. The timer enters the queue, so timer
 context can only start a freeze, but not wait for frozen."

NOTE: Back-ported to 4.4.y.

The only parts of the upstream commit that have been kept are various
locking changes, none of which were mentioned in the original commit
message which therefore describes this change not at all.

Timeout callbacks continue to be run via a timer. Both blk_mq_rq_timer
and blk_rq_timed_out_timer will return without without doing any work
if they cannot acquire the queue (without waiting).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq.c      | 4 ++++
 block/blk-timeout.c | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8649dbf06ce4..11a23bf73fd9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -628,6 +628,9 @@ static void blk_mq_rq_timer(unsigned long priv)
 	};
 	int i;
 
+	if (blk_queue_enter(q, GFP_NOWAIT))
+		return;
+
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
 	if (data.next_set) {
@@ -642,6 +645,7 @@ static void blk_mq_rq_timer(unsigned long priv)
 				blk_mq_tag_idle(hctx);
 		}
 	}
+	blk_queue_exit(q);
 }
 
 /*
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index aa40aa93381b..2bc03df554a6 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,6 +134,8 @@ void blk_rq_timed_out_timer(unsigned long data)
 	struct request *rq, *tmp;
 	int next_set = 0;
 
+	if (blk_queue_enter(q, GFP_NOWAIT))
+		return;
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
@@ -143,6 +145,7 @@ void blk_rq_timed_out_timer(unsigned long data)
 		mod_timer(&q->timeout, round_jiffies_up(next));
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
+	blk_queue_exit(q);
 }
 
 /**
-- 
2.26.0.110.g2183baf09c-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/4] blk-mq: Allow timeouts to run while queue is freezing
  2020-04-07 16:31     ` Giuliano Procida
@ 2020-04-15 13:00         ` Giuliano Procida
  2020-04-07 16:55       ` [PATCH 1/4] block: more locking around delayed work Giuliano Procida
                           ` (9 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-15 13:00 UTC (permalink / raw)
  To: greg
  Cc: Giuliano Procida, stable, Gabriel Krisman Bertazi, Brian King,
	Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Jens Axboe

commit 71f79fb3179e69b0c1448a2101a866d871c66e7f upstream.

In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11a23bf73fd9..d13e70d40df9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -628,7 +628,20 @@ static void blk_mq_rq_timer(unsigned long priv)
 	};
 	int i;
 
-	if (blk_queue_enter(q, GFP_NOWAIT))
+	/* A deadlock might occur if a request is stuck requiring a
+	 * timeout at the same time a queue freeze is waiting
+	 * completion, since the timeout code would not be able to
+	 * acquire the queue reference here.
+	 *
+	 * That's why we don't use blk_queue_enter here; instead, we use
+	 * percpu_ref_tryget directly, because we need to be able to
+	 * obtain a reference even in the short window between the queue
+	 * starting to freeze, by dropping the first reference in
+	 * blk_mq_freeze_queue_start, and the moment the last request is
+	 * consumed, marked by the instant q_usage_counter reaches
+	 * zero.
+	 */
+	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
-- 
2.26.0.110.g2183baf09c-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/4] blk-mq: Allow timeouts to run while queue is freezing
@ 2020-04-15 13:00         ` Giuliano Procida
  0 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-15 13:00 UTC (permalink / raw)
  To: greg
  Cc: linux-block, Jens Axboe, Gabriel Krisman Bertazi, linux-nvme,
	Keith Busch, stable, Brian King, Giuliano Procida,
	Christoph Hellwig

commit 71f79fb3179e69b0c1448a2101a866d871c66e7f upstream.

In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11a23bf73fd9..d13e70d40df9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -628,7 +628,20 @@ static void blk_mq_rq_timer(unsigned long priv)
 	};
 	int i;
 
-	if (blk_queue_enter(q, GFP_NOWAIT))
+	/* A deadlock might occur if a request is stuck requiring a
+	 * timeout at the same time a queue freeze is waiting
+	 * completion, since the timeout code would not be able to
+	 * acquire the queue reference here.
+	 *
+	 * That's why we don't use blk_queue_enter here; instead, we use
+	 * percpu_ref_tryget directly, because we need to be able to
+	 * obtain a reference even in the short window between the queue
+	 * starting to freeze, by dropping the first reference in
+	 * blk_mq_freeze_queue_start, and the moment the last request is
+	 * consumed, marked by the instant q_usage_counter reaches
+	 * zero.
+	 */
+	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
-- 
2.26.0.110.g2183baf09c-goog


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/4] blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
  2020-04-07 16:31     ` Giuliano Procida
                         ` (8 preceding siblings ...)
  2020-04-15 13:00         ` Giuliano Procida
@ 2020-04-15 13:00       ` Giuliano Procida
  2020-04-15 13:00       ` [PATCH v2 4/4] blk-mq: Allow blocking queue tag iter callbacks Giuliano Procida
  10 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-15 13:00 UTC (permalink / raw)
  To: greg; +Cc: Giuliano Procida, stable, Jianchao Wang, Ming Lei, Jens Axboe

commit f5bbbbe4d63577026f908a809f22f5fd5a90ea1f upstream.

For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.

NOTE: Back-ported to 4.4.y.

The upstream commit was intended to prevent concurrent manipulation of
nr_hw_queues and iteration over queues. The former doesn't happen in
this in 4.4.7 (as __blk_mq_update_nr_hw_queues doesn't exist). The
extra locking is also buggy in this commit but fixed in a follow-up.

It may protect against other concurrent accesses such as queue removal
by synchronising RCU locking around q_usage_counter.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq-tag.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a07ca3488d96..bf356de30134 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -481,6 +481,14 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	/*
+	 * Avoid potential races with things like queue removal.
+	 */
+	rcu_read_lock();
+	if (percpu_ref_is_zero(&q->q_usage_counter)) {
+		rcu_read_unlock();
+		return;
+	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		struct blk_mq_tags *tags = hctx->tags;
@@ -497,7 +505,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		bt_for_each(hctx, &tags->bitmap_tags, tags->nr_reserved_tags, fn, priv,
 		      false);
 	}
-
+	rcu_read_unlock();
 }
 
 static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
-- 
2.26.0.110.g2183baf09c-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/4] blk-mq: Allow blocking queue tag iter callbacks
  2020-04-07 16:31     ` Giuliano Procida
                         ` (9 preceding siblings ...)
  2020-04-15 13:00       ` [PATCH v2 3/4] blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter Giuliano Procida
@ 2020-04-15 13:00       ` Giuliano Procida
  10 siblings, 0 replies; 22+ messages in thread
From: Giuliano Procida @ 2020-04-15 13:00 UTC (permalink / raw)
  To: greg; +Cc: Giuliano Procida, stable, Jianchao Wang, Keith Busch, Jens Axboe

commit 530ca2c9bd6949c72c9b5cfc330cb3dbccaa3f5b upstream.

A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5bbbbe4d6357 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 block/blk-mq-tag.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index bf356de30134..c1c654319287 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -484,11 +484,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	/*
 	 * Avoid potential races with things like queue removal.
 	 */
-	rcu_read_lock();
-	if (percpu_ref_is_zero(&q->q_usage_counter)) {
-		rcu_read_unlock();
+	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
-	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		struct blk_mq_tags *tags = hctx->tags;
@@ -505,7 +502,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		bt_for_each(hctx, &tags->bitmap_tags, tags->nr_reserved_tags, fn, priv,
 		      false);
 	}
-	rcu_read_unlock();
+	blk_queue_exit(q);
 }
 
 static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
-- 
2.26.0.110.g2183baf09c-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] backport request for use-after-free blk_mq_queue_tag_busy_iter
  2020-04-15 13:00       ` [PATCH v2 0/4] " Giuliano Procida
@ 2020-05-18  7:27         ` Greg KH
  0 siblings, 0 replies; 22+ messages in thread
From: Greg KH @ 2020-05-18  7:27 UTC (permalink / raw)
  To: Giuliano Procida; +Cc: stable

On Wed, Apr 15, 2020 at 02:00:13PM +0100, Giuliano Procida wrote:
> v2: Updated commit messages following feedback from gregkh.
> 
> Here are the patches for linux-4.4.y.
> 
> There are 2 further patches over those for linux-4.9.y and the
> differences after back-porting are non-trivial.
> 
> The code complies without warnings. However, I have no suitable
> hardware or virtual machine to test this on.

Sorry for the delay, now queued up.

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-05-18  7:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-01 17:47 backport request for use-after-free blk_mq_queue_tag_busy_iter Giuliano Procida
2020-04-01 17:55 ` Greg KH
2020-04-03  9:20 ` Greg KH
2020-04-03 22:30   ` Giuliano Procida
2020-04-07 16:31     ` Giuliano Procida
2020-04-07 16:55       ` [PATCH 0/4] " Giuliano Procida
2020-04-10  9:04         ` Greg KH
2020-04-07 16:55       ` [PATCH 1/4] block: more locking around delayed work Giuliano Procida
2020-04-10  9:03         ` Greg KH
2020-04-15 12:03           ` Giuliano Procida
2020-04-07 16:55       ` [PATCH 2/4] blk-mq: Allow timeouts to run while queue is freezing Giuliano Procida
2020-04-07 16:55         ` Giuliano Procida
2020-04-07 16:55       ` [PATCH 3/4] blk-mq: sync things with blk_mq_queue_tag_busy_iter Giuliano Procida
2020-04-07 16:55       ` [PATCH 4/4] blk-mq: Allow blocking queue tag iter callbacks Giuliano Procida
2020-04-07 21:02       ` backport request for use-after-free blk_mq_queue_tag_busy_iter Giuliano Procida
2020-04-15 13:00       ` [PATCH v2 0/4] " Giuliano Procida
2020-05-18  7:27         ` Greg KH
2020-04-15 13:00       ` [PATCH v2 1/4] block: more locking around delayed work Giuliano Procida
2020-04-15 13:00       ` [PATCH v2 2/4] blk-mq: Allow timeouts to run while queue is freezing Giuliano Procida
2020-04-15 13:00         ` Giuliano Procida
2020-04-15 13:00       ` [PATCH v2 3/4] blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter Giuliano Procida
2020-04-15 13:00       ` [PATCH v2 4/4] blk-mq: Allow blocking queue tag iter callbacks Giuliano Procida

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.