linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2] nvme/host/pci: Fix a race in controller removal
@ 2019-09-17 22:41 Balbir Singh
  2019-09-17 22:41 ` [PATCH] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
  2019-09-17 23:45 ` [PATCH v3 1/2] nvme/host/pci: Fix a race in controller removal Keith Busch
  0 siblings, 2 replies; 3+ messages in thread
From: Balbir Singh @ 2019-09-17 22:41 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, axboe, Balbir Singh, hch, sagi

This race is hard to hit in general, now that we
have the shutdown_lock in both nvme_reset_work() and
nvme_dev_disable()

The real issue is that after doing all the setup work
in nvme_reset_work(), when get another timeout (nvme_timeout()),
then we proceed to disable the controller. This causes
the reset work to only partially progress and then fail.

Depending on the progress made, we call into
nvme_remove_dead_ctrl(), which does another
nvme_dev_disable() freezing the block-mq queues.

I've noticed a race with udevd with udevd trying to re-read
the partition table, it ends up with the bd_mutex held and
it ends up waiting in blk_queue_enter(), since we froze
the queues in nvme_dev_disable(). nvme_kill_queues() calls
revalidate_disk() and ends up waiting on the bd_mutex
resulting in a deadlock.

Allow the hung tasks a chance by unfreezing the queues after
setting dying bits on the queue, then call revalidate_disk()
to update the disk size.

NOTE: I've seen this race when the controller does not
respond to IOs or abort requests, but responds to other
commands and even signals it's ready after its reset,
but still drops IO. I've tested this by emulating the
behaviour in the driver.

Signed-off-by: Balbir Singh <sblbir@amzn.com>
---

Changelog v3
  1. Simplify the comment about moving revalidate_disk
Changelog v2
  1. Rely on blk_set_queue_dying to do the wake_all()

 drivers/nvme/host/core.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b45f82d58be8..6ad1f1df9e44 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -103,10 +103,13 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
 	 */
 	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
 		return;
-	revalidate_disk(ns->disk);
 	blk_set_queue_dying(ns->queue);
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	blk_mq_unquiesce_queue(ns->queue);
+	/*
+	 * Revalidate after unblocking dispatchers that may be holding bd_butex
+	 */
+	revalidate_disk(ns->disk);
 }
 
 static void nvme_queue_scan(struct nvme_ctrl *ctrl)
-- 
2.16.5


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-17 22:41 [PATCH v3 1/2] nvme/host/pci: Fix a race in controller removal Balbir Singh
@ 2019-09-17 22:41 ` Balbir Singh
  2019-09-17 23:45 ` [PATCH v3 1/2] nvme/host/pci: Fix a race in controller removal Keith Busch
  1 sibling, 0 replies; 3+ messages in thread
From: Balbir Singh @ 2019-09-17 22:41 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, axboe, Balbir Singh, hch, sagi

Largely for debugging purposes where controllers with large
timeouts get stuck during reset. In my case I have a simple
mirroring raid configuration on top of nvme using mdadm.
If one of the controller has IO timeouts and takes very long
(128 seconds worst case) during IO, I am happy to
use this debug option to drop that controller quickly and
let the system run.

Signed-off-by: Balbir Singh <sblbir@amzn.com>
---

Changelog:
 Implement simplifications suggested in the review

NOTE:

I am still sending this out for larger review because I think
this is helpful, there is some concern that the values might
be mis-used, a quirk does not meet my requirments and the
value of the timeout is global (as in it applies to all
controllers). I think it is still useful to have this for
my use case for the lack of a better solution

 drivers/nvme/host/core.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6ad1f1df9e44..798edc609d6a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -40,6 +40,10 @@ module_param_named(io_timeout, nvme_io_timeout, uint, 0644);
 MODULE_PARM_DESC(io_timeout, "timeout in seconds for I/O");
 EXPORT_SYMBOL_GPL(nvme_io_timeout);
 
+static unsigned int nvme_wait_ready_timeout = 0;
+module_param_named(wait_ready_timeout, nvme_wait_ready_timeout, uint, 0644);
+MODULE_PARM_DESC(wait_ready_timeout, "timeout in seconds for ready on reset");
+
 static unsigned char shutdown_timeout = 5;
 module_param(shutdown_timeout, byte, 0644);
 MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller shutdown");
@@ -1930,11 +1934,15 @@ const struct block_device_operations nvme_ns_head_ops = {
 
 static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
 {
-	unsigned long timeout =
-		((NVME_CAP_TIMEOUT(cap) + 1) * HZ / 2) + jiffies;
+	unsigned long timeout;
 	u32 csts, bit = enabled ? NVME_CSTS_RDY : 0;
 	int ret;
 
+	if (unlikely(nvme_wait_ready_timeout))
+		timeout = (nvme_wait_ready_timeout * HZ) + jiffies;
+	else
+		timeout = ((NVME_CAP_TIMEOUT(cap) + 1) * HZ / 2) + jiffies;
+
 	while ((ret = ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts)) == 0) {
 		if (csts == ~0)
 			return -ENODEV;
-- 
2.16.5


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v3 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-17 22:41 [PATCH v3 1/2] nvme/host/pci: Fix a race in controller removal Balbir Singh
  2019-09-17 22:41 ` [PATCH] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
@ 2019-09-17 23:45 ` Keith Busch
  1 sibling, 0 replies; 3+ messages in thread
From: Keith Busch @ 2019-09-17 23:45 UTC (permalink / raw)
  To: Balbir Singh; +Cc: axboe, hch, linux-nvme, sagi

On Tue, Sep 17, 2019 at 10:41:04PM +0000, Balbir Singh wrote:
> This race is hard to hit in general, now that we
> have the shutdown_lock in both nvme_reset_work() and
> nvme_dev_disable()
> 
> The real issue is that after doing all the setup work
> in nvme_reset_work(), when get another timeout (nvme_timeout()),
> then we proceed to disable the controller. This causes
> the reset work to only partially progress and then fail.
> 
> Depending on the progress made, we call into
> nvme_remove_dead_ctrl(), which does another
> nvme_dev_disable() freezing the block-mq queues.
> 
> I've noticed a race with udevd with udevd trying to re-read
> the partition table, it ends up with the bd_mutex held and
> it ends up waiting in blk_queue_enter(), since we froze
> the queues in nvme_dev_disable(). nvme_kill_queues() calls
> revalidate_disk() and ends up waiting on the bd_mutex
> resulting in a deadlock.
> 
> Allow the hung tasks a chance by unfreezing the queues after
> setting dying bits on the queue, then call revalidate_disk()
> to update the disk size.
> 
> NOTE: I've seen this race when the controller does not
> respond to IOs or abort requests, but responds to other
> commands and even signals it's ready after its reset,
> but still drops IO. I've tested this by emulating the
> behaviour in the driver.

I recommend the following changelog for this commit:

  User space programs like udevd may try to read to partitions at the
  same time the driver detects a namespace is unusable, and may deadlock
  if revalidate_disk() is called while such a process is waiting to
  enter the frozen queue. On detecting a dead namespace, move the disk
  revalidate after unblocking dispatchers that may be holding bd_butex.

And that's it.
 
> Signed-off-by: Balbir Singh <sblbir@amzn.com>
> ---
> 
> Changelog v3
>   1. Simplify the comment about moving revalidate_disk
> Changelog v2
>   1. Rely on blk_set_queue_dying to do the wake_all()
> 
>  drivers/nvme/host/core.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index b45f82d58be8..6ad1f1df9e44 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -103,10 +103,13 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
>  	 */
>  	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
>  		return;
> -	revalidate_disk(ns->disk);
>  	blk_set_queue_dying(ns->queue);
>  	/* Forcibly unquiesce queues to avoid blocking dispatch */
>  	blk_mq_unquiesce_queue(ns->queue);
> +	/*
> +	 * Revalidate after unblocking dispatchers that may be holding bd_butex
> +	 */
> +	revalidate_disk(ns->disk);
>  }
>  
>  static void nvme_queue_scan(struct nvme_ctrl *ctrl)
> -- 
> 2.16.5
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-09-17 23:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-17 22:41 [PATCH v3 1/2] nvme/host/pci: Fix a race in controller removal Balbir Singh
2019-09-17 22:41 ` [PATCH] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
2019-09-17 23:45 ` [PATCH v3 1/2] nvme/host/pci: Fix a race in controller removal Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).