[PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
@ 2019-09-13 23:36 Balbir Singh
  2019-09-13 23:36 ` [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Balbir Singh @ 2019-09-13 23:36 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, axboe, Balbir Singh, hch, sagi

This race is hard to hit in general, now that we
have the shutdown_lock in both nvme_reset_work() and
nvme_dev_disable()

The real issue is that after doing all the setup work
in nvme_reset_work(), when get another timeout (nvme_timeout()),
then we proceed to disable the controller. This causes
the reset work to only partially progress and then fail.

Depending on the progress made, we call into
nvme_remove_dead_ctrl(), which does another
nvme_dev_disable() freezing the block-mq queues.

I've noticed a race with udevd with udevd trying to re-read
the partition table, it ends up with the bd_mutex held and
it ends up waiting in blk_queue_enter(), since we froze
the queues in nvme_dev_disable(). nvme_kill_queues() calls
revalidate_disk() and ends up waiting on the bd_mutex
resulting in a deadlock.

Allow the hung tasks a chance by unfreezing the queues after
setting dying bits on the queue, then call revalidate_disk()
to update the disk size.

NOTE: I've seen this race when the controller does not
respond to IOs or abort requests, but responds to other
commands and even signals it's ready after its reset,
but still drops IO. I've tested this by emulating the
behaviour in the driver.

Signed-off-by: Balbir Singh <sblbir@amzn.com>
---

Changelog:
- Rely on blk_set_queue_dying to do the wake_all()

 drivers/nvme/host/core.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b45f82d58be8..f6ddb58a7013 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -103,10 +103,16 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
 	 */
 	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
 		return;
-	revalidate_disk(ns->disk);
 	blk_set_queue_dying(ns->queue);
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	blk_mq_unquiesce_queue(ns->queue);
+	/*
+	 * revalidate_disk, after all pending IO is cleaned up
+	 * by blk_set_queue_dying, largely any races with blk parittion
+	 * reads that might come in after freezing the queues, otherwise
+	 * we'll end up waiting up on bd_mutex, creating a deadlock.
+	 */
+	revalidate_disk(ns->disk);
 }

 static void nvme_queue_scan(struct nvme_ctrl *ctrl)
-- 
2.16.5

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-13 23:36 [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Balbir Singh
@ 2019-09-13 23:36 ` Balbir Singh
  2019-09-16  7:41   ` Christoph Hellwig
  2019-09-16  7:49 ` [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Christoph Hellwig
  2019-09-16 15:40 ` Bart Van Assche
  2 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2019-09-13 23:36 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, axboe, Balbir Singh, hch, sagi

Largely for debugging purposes where controllers with large
timeouts get stuck during reset.

Signed-off-by: Balbir Singh <sblbir@amzn.com>
---

Changelog:
 - Remove EXPORT_SYMBOL_GPL for the nvme_wait_ready_timeout

 drivers/nvme/host/core.c | 7 +++++++
 drivers/nvme/host/nvme.h | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f6ddb58a7013..b8676786c06a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -40,6 +40,10 @@ module_param_named(io_timeout, nvme_io_timeout, uint, 0644);
 MODULE_PARM_DESC(io_timeout, "timeout in seconds for I/O");
 EXPORT_SYMBOL_GPL(nvme_io_timeout);
 
+unsigned int nvme_wait_ready_timeout = 0;
+module_param_named(wait_ready_timeout, nvme_wait_ready_timeout, uint, 0644);
+MODULE_PARM_DESC(wait_ready_timeout, "timeout in seconds for wait ready on reset");
+
 static unsigned char shutdown_timeout = 5;
 module_param(shutdown_timeout, byte, 0644);
 MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller shutdown");
@@ -1938,6 +1942,9 @@ static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
 	u32 csts, bit = enabled ? NVME_CSTS_RDY : 0;
 	int ret;
 
+	if (nvme_wait_ready_timeout)
+		timeout = NVME_WAIT_READY_TIMEOUT + jiffies;
+
 	while ((ret = ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts)) == 0) {
 		if (csts == ~0)
 			return -ENODEV;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index b5013c101b35..c3caabc1f149 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -21,6 +21,9 @@
 extern unsigned int nvme_io_timeout;
 #define NVME_IO_TIMEOUT	(nvme_io_timeout * HZ)
 
+extern unsigned int nvme_wait_ready_timeout;
+#define NVME_WAIT_READY_TIMEOUT	(nvme_wait_ready_timeout * HZ)
+
 extern unsigned int admin_timeout;
 #define ADMIN_TIMEOUT	(admin_timeout * HZ)
 
-- 
2.16.5


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-13 23:36 ` [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
@ 2019-09-16  7:41   ` Christoph Hellwig
  2019-09-16 12:33     ` Singh, Balbir
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2019-09-16  7:41 UTC (permalink / raw)
  To: Balbir Singh; +Cc: kbusch, axboe, hch, linux-nvme, sagi

On Fri, Sep 13, 2019 at 11:36:31PM +0000, Balbir Singh wrote:
> +unsigned int nvme_wait_ready_timeout = 0;
> +module_param_named(wait_ready_timeout, nvme_wait_ready_timeout, uint, 0644);
> +MODULE_PARM_DESC(wait_ready_timeout, "timeout in seconds for wait ready on reset");

This is only used in core.c, so it can be marked static.

Also it introduces a > 80 char line.

> +
>  static unsigned char shutdown_timeout = 5;
>  module_param(shutdown_timeout, byte, 0644);
>  MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller shutdown");
> @@ -1938,6 +1942,9 @@ static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
>  	u32 csts, bit = enabled ? NVME_CSTS_RDY : 0;
>  	int ret;
>  
> +	if (nvme_wait_ready_timeout)
> +		timeout = NVME_WAIT_READY_TIMEOUT + jiffies;

I'm not sure the NVME_WAIT_READY_TIMEOUT #define really helps much here.
Also the code is a little confusing as as, why not and if / else with
the normal timeout definition?

Then again I'm not even sure we really want this.  The debugging use
case is somethign where you can easily hack a line in the driver, and
we really don't want normal users to mess with a random parameter like
this one.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-13 23:36 [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Balbir Singh
  2019-09-13 23:36 ` [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
@ 2019-09-16  7:49 ` Christoph Hellwig
  2019-09-16 12:07   ` Singh, Balbir
  2019-09-16 15:40 ` Bart Van Assche
  2 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2019-09-16  7:49 UTC (permalink / raw)
  To: Balbir Singh; +Cc: kbusch, axboe, hch, linux-nvme, sagi

On Fri, Sep 13, 2019 at 11:36:30PM +0000, Balbir Singh wrote:
> This race is hard to hit in general, now that we
> have the shutdown_lock in both nvme_reset_work() and
> nvme_dev_disable()
> 
> The real issue is that after doing all the setup work
> in nvme_reset_work(), when get another timeout (nvme_timeout()),
> then we proceed to disable the controller. This causes
> the reset work to only partially progress and then fail.
> 
> Depending on the progress made, we call into
> nvme_remove_dead_ctrl(), which does another
> nvme_dev_disable() freezing the block-mq queues.
> 
> I've noticed a race with udevd with udevd trying to re-read
> the partition table, it ends up with the bd_mutex held and
> it ends up waiting in blk_queue_enter(), since we froze
> the queues in nvme_dev_disable(). nvme_kill_queues() calls
> revalidate_disk() and ends up waiting on the bd_mutex
> resulting in a deadlock.
> 
> Allow the hung tasks a chance by unfreezing the queues after
> setting dying bits on the queue, then call revalidate_disk()
> to update the disk size.
> 
> NOTE: I've seen this race when the controller does not
> respond to IOs or abort requests, but responds to other
> commands and even signals it's ready after its reset,
> but still drops IO. I've tested this by emulating the
> behaviour in the driver.
> 
> Signed-off-by: Balbir Singh <sblbir@amzn.com>
> ---
> 
> Changelog:
> - Rely on blk_set_queue_dying to do the wake_all()
> 
>  drivers/nvme/host/core.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index b45f82d58be8..f6ddb58a7013 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -103,10 +103,16 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
>  	 */
>  	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
>  		return;
> -	revalidate_disk(ns->disk);
>  	blk_set_queue_dying(ns->queue);
>  	/* Forcibly unquiesce queues to avoid blocking dispatch */
>  	blk_mq_unquiesce_queue(ns->queue);
> +	/*
> +	 * revalidate_disk, after all pending IO is cleaned up
> +	 * by blk_set_queue_dying, largely any races with blk parittion
> +	 * reads that might come in after freezing the queues, otherwise
> +	 * we'll end up waiting up on bd_mutex, creating a deadlock.
> +	 */
> +	revalidate_disk(ns->disk);

The patch looks fine to me, but the comments looks a little strange.
How do we trigger the partition scan?  Is someone opening the device
again after we froze it?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-16  7:49 ` [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Christoph Hellwig
@ 2019-09-16 12:07   ` Singh, Balbir
  0 siblings, 0 replies; 23+ messages in thread
From: Singh, Balbir @ 2019-09-16 12:07 UTC (permalink / raw)
  To: hch, sblbir; +Cc: kbusch, axboe, sagi, linux-nvme

On Mon, 2019-09-16 at 09:49 +0200, Christoph Hellwig wrote:
> On Fri, Sep 13, 2019 at 11:36:30PM +0000, Balbir Singh wrote:
> > This race is hard to hit in general, now that we
> > have the shutdown_lock in both nvme_reset_work() and
> > nvme_dev_disable()
> > 
> > The real issue is that after doing all the setup work
> > in nvme_reset_work(), when get another timeout (nvme_timeout()),
> > then we proceed to disable the controller. This causes
> > the reset work to only partially progress and then fail.
> > 
> > Depending on the progress made, we call into
> > nvme_remove_dead_ctrl(), which does another
> > nvme_dev_disable() freezing the block-mq queues.
> > 
> > I've noticed a race with udevd with udevd trying to re-read
> > the partition table, it ends up with the bd_mutex held and
> > it ends up waiting in blk_queue_enter(), since we froze
> > the queues in nvme_dev_disable(). nvme_kill_queues() calls
> > revalidate_disk() and ends up waiting on the bd_mutex
> > resulting in a deadlock.
> > 
> > Allow the hung tasks a chance by unfreezing the queues after
> > setting dying bits on the queue, then call revalidate_disk()
> > to update the disk size.
> > 
> > NOTE: I've seen this race when the controller does not
> > respond to IOs or abort requests, but responds to other
> > commands and even signals it's ready after its reset,
> > but still drops IO. I've tested this by emulating the
> > behaviour in the driver.
> > 
> > Signed-off-by: Balbir Singh <sblbir@amzn.com>
> > ---
> > 
> > Changelog:
> > - Rely on blk_set_queue_dying to do the wake_all()
> > 
> >  drivers/nvme/host/core.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index b45f82d58be8..f6ddb58a7013 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -103,10 +103,16 @@ static void nvme_set_queue_dying(struct
> > nvme_ns *ns)
> >  	 */
> >  	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
> >  		return;
> > -	revalidate_disk(ns->disk);
> >  	blk_set_queue_dying(ns->queue);
> >  	/* Forcibly unquiesce queues to avoid blocking dispatch */
> >  	blk_mq_unquiesce_queue(ns->queue);
> > +	/*
> > +	 * revalidate_disk, after all pending IO is cleaned up
> > +	 * by blk_set_queue_dying, largely any races with blk parittion
> > +	 * reads that might come in after freezing the queues,
> > otherwise
> > +	 * we'll end up waiting up on bd_mutex, creating a deadlock.
> > +	 */
> > +	revalidate_disk(ns->disk);
> 
> The patch looks fine to me, but the comments looks a little strange.
> How do we trigger the partition scan?  Is someone opening the device
> again after we froze it?

The race itself is between the timeout for an aborted IO (iod->aborted
= 1) causing a reset and the subsequent disable controller, eventually
leading to IO errors on the device.

To reproduce the issue I run fio on the block device, when due to
timeout fio gets errors, it closes the device (/dev/nvme1), which
triggers udevd to reread the partitions.

But independent of this scenario, the race can be triggered by anything
scanning or reading the partition at the time of reset/disable
controller.

Balbir Singh.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-16  7:41   ` Christoph Hellwig
@ 2019-09-16 12:33     ` Singh, Balbir
  2019-09-16 16:01       ` hch
  0 siblings, 1 reply; 23+ messages in thread
From: Singh, Balbir @ 2019-09-16 12:33 UTC (permalink / raw)
  To: hch, sblbir; +Cc: kbusch, axboe, sagi, linux-nvme

On Mon, 2019-09-16 at 09:41 +0200, Christoph Hellwig wrote:
> On Fri, Sep 13, 2019 at 11:36:31PM +0000, Balbir Singh wrote:
> > +unsigned int nvme_wait_ready_timeout = 0;
> > +module_param_named(wait_ready_timeout, nvme_wait_ready_timeout,
> > uint, 0644);
> > +MODULE_PARM_DESC(wait_ready_timeout, "timeout in seconds for wait
> > ready on reset");
> 
> This is only used in core.c, so it can be marked static.
> 
> Also it introduces a > 80 char line.

I'll fix that, I must revisit my checkpatch results.

> 
> > +
> >  static unsigned char shutdown_timeout = 5;
> >  module_param(shutdown_timeout, byte, 0644);
> >  MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for
> > controller shutdown");
> > @@ -1938,6 +1942,9 @@ static int nvme_wait_ready(struct nvme_ctrl
> > *ctrl, u64 cap, bool enabled)
> >  	u32 csts, bit = enabled ? NVME_CSTS_RDY : 0;
> >  	int ret;
> >  
> > +	if (nvme_wait_ready_timeout)
> > +		timeout = NVME_WAIT_READY_TIMEOUT + jiffies;
> 
> I'm not sure the NVME_WAIT_READY_TIMEOUT #define really helps much
> here.
> Also the code is a little confusing as as, why not and if / else with
> the normal timeout definition?

I could refactor the patch

> 
> Then again I'm not even sure we really want this.  The debugging use
> case is somethign where you can easily hack a line in the driver, and
> we really don't want normal users to mess with a random parameter
> like
> this one.

The reason I sent this out is that I've seen some controllers setting
this to max value. I suspect there is no good way for the controller
to set wait ready values either. In any case 128 seconds for a failure
seems a bit too much, specially if the controller does not respond on
boot and hence the debug option. I don't expect it to be used by normal
users on most of their systems

Balbir Singh.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-13 23:36 [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Balbir Singh
  2019-09-13 23:36 ` [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
  2019-09-16  7:49 ` [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Christoph Hellwig
@ 2019-09-16 15:40 ` Bart Van Assche
  2019-09-16 19:38   ` Singh, Balbir
  2 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2019-09-16 15:40 UTC (permalink / raw)
  To: Balbir Singh, linux-nvme; +Cc: kbusch, axboe, hch, sagi

On 9/13/19 4:36 PM, Balbir Singh wrote:
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index b45f82d58be8..f6ddb58a7013 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -103,10 +103,16 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
>   	 */
>   	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
>   		return;
> -	revalidate_disk(ns->disk);
>   	blk_set_queue_dying(ns->queue);
>   	/* Forcibly unquiesce queues to avoid blocking dispatch */
>   	blk_mq_unquiesce_queue(ns->queue);
> +	/*
> +	 * revalidate_disk, after all pending IO is cleaned up
> +	 * by blk_set_queue_dying, largely any races with blk parittion
> +	 * reads that might come in after freezing the queues, otherwise
> +	 * we'll end up waiting up on bd_mutex, creating a deadlock.
> +	 */
> +	revalidate_disk(ns->disk);
>   }

The comment above revalidate_disk() looks wrong to me. I don't think 
that blk_set_queue_dying() guarantees that ongoing commands have 
finished by the time that function returns. All blk_set_queue_dying() 
does is to set the DYING flag, to kill q->q_usage_counter and to wake up 
threads that are waiting inside a request allocation function. It does 
not wait for pending commands to finish.

Bart.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-16 12:33     ` Singh, Balbir
@ 2019-09-16 16:01       ` hch
  2019-09-16 21:04         ` Singh, Balbir
  0 siblings, 1 reply; 23+ messages in thread
From: hch @ 2019-09-16 16:01 UTC (permalink / raw)
  To: Singh, Balbir; +Cc: sblbir, sagi, linux-nvme, axboe, kbusch, hch

On Mon, Sep 16, 2019 at 12:33:31PM +0000, Singh, Balbir wrote:
> > Then again I'm not even sure we really want this.  The debugging use
> > case is somethign where you can easily hack a line in the driver, and
> > we really don't want normal users to mess with a random parameter
> > like
> > this one.
> 
> The reason I sent this out is that I've seen some controllers setting
> this to max value. I suspect there is no good way for the controller
> to set wait ready values either. In any case 128 seconds for a failure
> seems a bit too much, specially if the controller does not respond on
> boot and hence the debug option. I don't expect it to be used by normal
> users on most of their systems

The problem with these tweaks is that people will touch them for
weird reasons.  Another one is that a module parameter is global,
while many settings should be per-controller (although that would
add even more boilerplate code).

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-16 15:40 ` Bart Van Assche
@ 2019-09-16 19:38   ` Singh, Balbir
  2019-09-16 19:56     ` Bart Van Assche
  2019-09-16 20:07     ` Keith Busch
  0 siblings, 2 replies; 23+ messages in thread
From: Singh, Balbir @ 2019-09-16 19:38 UTC (permalink / raw)
  To: bvanassche, linux-nvme, sblbir; +Cc: kbusch, axboe, hch, sagi

On Mon, 2019-09-16 at 08:40 -0700, Bart Van Assche wrote:
> On 9/13/19 4:36 PM, Balbir Singh wrote:
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index b45f82d58be8..f6ddb58a7013 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -103,10 +103,16 @@ static void nvme_set_queue_dying(struct
> > nvme_ns *ns)
> >   	 */
> >   	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
> >   		return;
> > -	revalidate_disk(ns->disk);
> >   	blk_set_queue_dying(ns->queue);
> >   	/* Forcibly unquiesce queues to avoid blocking dispatch */
> >   	blk_mq_unquiesce_queue(ns->queue);
> > +	/*
> > +	 * revalidate_disk, after all pending IO is cleaned up
> > +	 * by blk_set_queue_dying, largely any races with blk parittion
> > +	 * reads that might come in after freezing the queues,
> > otherwise
> > +	 * we'll end up waiting up on bd_mutex, creating a deadlock.
> > +	 */
> > +	revalidate_disk(ns->disk);
> >   }
> 
> The comment above revalidate_disk() looks wrong to me. I don't think 
> that blk_set_queue_dying() guarantees that ongoing commands have 
> finished by the time that function returns. All
> blk_set_queue_dying() 
> does is to set the DYING flag, to kill q->q_usage_counter and to wake
> up 
> threads that are waiting inside a request allocation function. It
> does 
> not wait for pending commands to finish.

I was referring to the combined effect of blk_set_queue_dying() and
blk_mq_unquiesce_queue() which should invoke blk_mq_run_hw_queues().
I can see how that might be misleading. I can reword it to say

/*
 * revalidate_disk, after all pending IO is cleaned up
 * largely any races with block partition
 * reads that might come in after freezing the queues, otherwise
 * we'll end up waiting up on bd_mutex, creating a deadlock
 */

Would that work?
Balbir Singh.

> 
> Bart.
> 
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-16 19:38   ` Singh, Balbir
@ 2019-09-16 19:56     ` Bart Van Assche
  2019-09-16 20:40       ` Singh, Balbir
  2019-09-16 20:07     ` Keith Busch
  1 sibling, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2019-09-16 19:56 UTC (permalink / raw)
  To: Singh, Balbir, linux-nvme, sblbir; +Cc: kbusch, axboe, hch, sagi

On 9/16/19 12:38 PM, Singh, Balbir wrote:
> On Mon, 2019-09-16 at 08:40 -0700, Bart Van Assche wrote:
>> On 9/13/19 4:36 PM, Balbir Singh wrote:
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index b45f82d58be8..f6ddb58a7013 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -103,10 +103,16 @@ static void nvme_set_queue_dying(struct
>>> nvme_ns *ns)
>>>    	 */
>>>    	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
>>>    		return;
>>> -	revalidate_disk(ns->disk);
>>>    	blk_set_queue_dying(ns->queue);
>>>    	/* Forcibly unquiesce queues to avoid blocking dispatch */
>>>    	blk_mq_unquiesce_queue(ns->queue);
>>> +	/*
>>> +	 * revalidate_disk, after all pending IO is cleaned up
>>> +	 * by blk_set_queue_dying, largely any races with blk parittion
>>> +	 * reads that might come in after freezing the queues,
>>> otherwise
>>> +	 * we'll end up waiting up on bd_mutex, creating a deadlock.
>>> +	 */
>>> +	revalidate_disk(ns->disk);
>>>    }
>>
>> The comment above revalidate_disk() looks wrong to me. I don't think
>> that blk_set_queue_dying() guarantees that ongoing commands have
>> finished by the time that function returns. All
>> blk_set_queue_dying()
>> does is to set the DYING flag, to kill q->q_usage_counter and to wake
>> up
>> threads that are waiting inside a request allocation function. It
>> does
>> not wait for pending commands to finish.
> 
> I was referring to the combined effect of blk_set_queue_dying() and
> blk_mq_unquiesce_queue() which should invoke blk_mq_run_hw_queues().
> I can see how that might be misleading. I can reword it to say
> 
> /*
>   * revalidate_disk, after all pending IO is cleaned up
>   * largely any races with block partition
>   * reads that might come in after freezing the queues, otherwise
>   * we'll end up waiting up on bd_mutex, creating a deadlock
>   */
> 
> Would that work?

I don't think so. Running the hardware queues is not sufficient to 
guarantee that requests that had been started before the DYING was set 
have finished.

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-16 19:38   ` Singh, Balbir
  2019-09-16 19:56     ` Bart Van Assche
@ 2019-09-16 20:07     ` Keith Busch
  1 sibling, 0 replies; 23+ messages in thread
From: Keith Busch @ 2019-09-16 20:07 UTC (permalink / raw)
  To: Singh, Balbir; +Cc: sblbir, sagi, linux-nvme, axboe, hch, bvanassche

On Mon, Sep 16, 2019 at 07:38:44PM +0000, Singh, Balbir wrote:
> On Mon, 2019-09-16 at 08:40 -0700, Bart Van Assche wrote:
> > The comment above revalidate_disk() looks wrong to me. I don't think 
> > that blk_set_queue_dying() guarantees that ongoing commands have 
> > finished by the time that function returns. All
> > blk_set_queue_dying() 
> > does is to set the DYING flag, to kill q->q_usage_counter and to wake
> > up 
> > threads that are waiting inside a request allocation function. It
> > does 
> > not wait for pending commands to finish.
> 
> I was referring to the combined effect of blk_set_queue_dying() and
> blk_mq_unquiesce_queue() which should invoke blk_mq_run_hw_queues().
> I can see how that might be misleading. I can reword it to say
> 
> /*
>  * revalidate_disk, after all pending IO is cleaned up
>  * largely any races with block partition
>  * reads that might come in after freezing the queues, otherwise
>  * we'll end up waiting up on bd_mutex, creating a deadlock
>  */
> 
> Would that work?

How about:

 /* Revalidate after unblocking dispatchers that may be holding bd_butex */

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-16 19:56     ` Bart Van Assche
@ 2019-09-16 20:40       ` Singh, Balbir
  2019-09-17 17:55         ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Singh, Balbir @ 2019-09-16 20:40 UTC (permalink / raw)
  To: bvanassche, linux-nvme, sblbir; +Cc: kbusch, axboe, hch, sagi

On Mon, 2019-09-16 at 12:56 -0700, Bart Van Assche wrote:
> On 9/16/19 12:38 PM, Singh, Balbir wrote:
> > On Mon, 2019-09-16 at 08:40 -0700, Bart Van Assche wrote:
> > > On 9/13/19 4:36 PM, Balbir Singh wrote:
> > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > index b45f82d58be8..f6ddb58a7013 100644
> > > > --- a/drivers/nvme/host/core.c
> > > > +++ b/drivers/nvme/host/core.c
> > > > @@ -103,10 +103,16 @@ static void nvme_set_queue_dying(struct
> > > > nvme_ns *ns)
> > > >    	 */
> > > >    	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
> > > >    		return;
> > > > -	revalidate_disk(ns->disk);
> > > >    	blk_set_queue_dying(ns->queue);
> > > >    	/* Forcibly unquiesce queues to avoid blocking dispatch */
> > > >    	blk_mq_unquiesce_queue(ns->queue);
> > > > +	/*
> > > > +	 * revalidate_disk, after all pending IO is cleaned up
> > > > +	 * by blk_set_queue_dying, largely any races with blk
> > > > parittion
> > > > +	 * reads that might come in after freezing the queues,
> > > > otherwise
> > > > +	 * we'll end up waiting up on bd_mutex, creating a deadlock.
> > > > +	 */
> > > > +	revalidate_disk(ns->disk);
> > > >    }
> > > 
> > > The comment above revalidate_disk() looks wrong to me. I don't think
> > > that blk_set_queue_dying() guarantees that ongoing commands have
> > > finished by the time that function returns. All
> > > blk_set_queue_dying()
> > > does is to set the DYING flag, to kill q->q_usage_counter and to wake
> > > up
> > > threads that are waiting inside a request allocation function. It
> > > does
> > > not wait for pending commands to finish.
> > 
> > I was referring to the combined effect of blk_set_queue_dying() and
> > blk_mq_unquiesce_queue() which should invoke blk_mq_run_hw_queues().
> > I can see how that might be misleading. I can reword it to say
> > 
> > /*
> >   * revalidate_disk, after all pending IO is cleaned up
> >   * largely any races with block partition
> >   * reads that might come in after freezing the queues, otherwise
> >   * we'll end up waiting up on bd_mutex, creating a deadlock
> >   */
> > 
> > Would that work?
> 
> I don't think so. Running the hardware queues is not sufficient to 
> guarantee that requests that had been started before the DYING was set 
> have finished.

May be I am missing something, but in this particular scenario, there
should be no requests pending with the controller - we've frozen the
queues and disabled the device (nvme_dev_disable()). A combination of
blk_set_queue_dying() and blk_mq_unquiesce_queue() should flush any
pending bits out.

Is your concern largely with the comment or the patch?

Keith just recommend added

 /* Revalidate after unblocking dispatchers that may be holding bd_butex */

That sounds quite reasonable to me

Cheers,
Balbir Singh.

> 
> Bart.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-16 16:01       ` hch
@ 2019-09-16 21:04         ` Singh, Balbir
  2019-09-17  1:14           ` Keith Busch
  0 siblings, 1 reply; 23+ messages in thread
From: Singh, Balbir @ 2019-09-16 21:04 UTC (permalink / raw)
  To: hch; +Cc: kbusch, axboe, sblbir, sagi, linux-nvme

On Mon, 2019-09-16 at 18:01 +0200, hch@lst.de wrote:
> On Mon, Sep 16, 2019 at 12:33:31PM +0000, Singh, Balbir wrote:
> > > Then again I'm not even sure we really want this.  The debugging use
> > > case is somethign where you can easily hack a line in the driver, and
> > > we really don't want normal users to mess with a random parameter
> > > like
> > > this one.
> > 
> > The reason I sent this out is that I've seen some controllers setting
> > this to max value. I suspect there is no good way for the controller
> > to set wait ready values either. In any case 128 seconds for a failure
> > seems a bit too much, specially if the controller does not respond on
> > boot and hence the debug option. I don't expect it to be used by normal
> > users on most of their systems
> 
> The problem with these tweaks is that people will touch them for
> weird reasons.  Another one is that a module parameter is global,
> while many settings should be per-controller (although that would
> add even more boilerplate code).

I do understand the concerns, may be a sysfs file would be better for
per controller tweaks, but that means it is hard to apply on boot.

I wonder if I should just call it debug_wait_ready_timeout and in the
comments about the help/description call out that this is a debug
feature.

What do you suggest?

Cheers,
Balbir Singh.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-16 21:04         ` Singh, Balbir
@ 2019-09-17  1:14           ` Keith Busch
  2019-09-17  2:56             ` Singh, Balbir
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-09-17  1:14 UTC (permalink / raw)
  To: Singh, Balbir; +Cc: axboe, sblbir, hch, linux-nvme, sagi

On Mon, Sep 16, 2019 at 09:04:31PM +0000, Singh, Balbir wrote:
> I wonder if I should just call it debug_wait_ready_timeout and in the
> comments about the help/description call out that this is a debug
> feature.
> 
> What do you suggest?

I recommend going to the vendors that report incorrect timeout values, and
apparently broken controllers that can't initialize, to have them fix
both. If it doesn't initialize in 128 seconds, your only debugging
recourse is to report to the vendor anyway; overriding the timeout to
something under what the device reports it requires doesn't exactly
provide you any additional debugging information.

If you really need the driver to do spec non-compliant behavior, we have
quirks for that.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-17  1:14           ` Keith Busch
@ 2019-09-17  2:56             ` Singh, Balbir
  2019-09-17  3:17               ` Bart Van Assche
  2019-09-17  3:54               ` Keith Busch
  0 siblings, 2 replies; 23+ messages in thread
From: Singh, Balbir @ 2019-09-17  2:56 UTC (permalink / raw)
  To: kbusch; +Cc: axboe, sblbir, hch, linux-nvme, sagi

On Mon, 2019-09-16 at 19:14 -0600, Keith Busch wrote:
> On Mon, Sep 16, 2019 at 09:04:31PM +0000, Singh, Balbir wrote:
> > I wonder if I should just call it debug_wait_ready_timeout and in the
> > comments about the help/description call out that this is a debug
> > feature.
> > 
> > What do you suggest?
> 
> I recommend going to the vendors that report incorrect timeout values, and
> apparently broken controllers that can't initialize, to have them fix
> both. If it doesn't initialize in 128 seconds, your only debugging
> recourse is to report to the vendor anyway; overriding the timeout to
> something under what the device reports it requires doesn't exactly
> provide you any additional debugging information.
> 

In my case I was doing a simple mirror (using madadm across two nvme
devices) and when I get timeouts on one, I need to wait up to 128 seconds
before switching over. Ideally I want this to be really fast and drop
the slow broken controller.

> If you really need the driver to do spec non-compliant behavior, we have
> quirks for that.

I like the quirks approach, but it assumes the timeout value is not
variable, but rather fixed by the quirk. I was attempting to really have
mirrored IO timeout quickly on a bad device.

Balbir Singh.



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-17  2:56             ` Singh, Balbir
@ 2019-09-17  3:17               ` Bart Van Assche
  2019-09-17  5:02                 ` Singh, Balbir
  2019-09-17 17:21                 ` James Smart
  2019-09-17  3:54               ` Keith Busch
  1 sibling, 2 replies; 23+ messages in thread
From: Bart Van Assche @ 2019-09-17  3:17 UTC (permalink / raw)
  To: Singh, Balbir, kbusch; +Cc: axboe, sblbir, hch, linux-nvme, sagi

On 9/16/19 7:56 PM, Singh, Balbir wrote:
> On Mon, 2019-09-16 at 19:14 -0600, Keith Busch wrote:
> [ ... ]
> In my case I was doing a simple mirror (using madadm across two nvme
> devices) and when I get timeouts on one, I need to wait up to 128 seconds
> before switching over. Ideally I want this to be really fast and drop
> the slow broken controller.
> 
>> If you really need the driver to do spec non-compliant behavior, we have
>> quirks for that.
> 
> I like the quirks approach, but it assumes the timeout value is not
> variable, but rather fixed by the quirk. I was attempting to really have
> mirrored IO timeout quickly on a bad device.

Other Linux kernel storage transports (FC, SRP) decouple the failover 
timeout from the I/O timeout. See also the output of git grep -nH 
fast_io_fail for the kernel source tree. See also the documentation of 
fast_io_fail_tmo in https://linux.die.net/man/5/multipath.conf. Maybe we 
need something similar for NVMe?

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-17  2:56             ` Singh, Balbir
  2019-09-17  3:17               ` Bart Van Assche
@ 2019-09-17  3:54               ` Keith Busch
  1 sibling, 0 replies; 23+ messages in thread
From: Keith Busch @ 2019-09-17  3:54 UTC (permalink / raw)
  To: Singh, Balbir; +Cc: axboe, sblbir, hch, linux-nvme, sagi

On Tue, Sep 17, 2019 at 02:56:44AM +0000, Singh, Balbir wrote:
> On Mon, 2019-09-16 at 19:14 -0600, Keith Busch wrote:
> > On Mon, Sep 16, 2019 at 09:04:31PM +0000, Singh, Balbir wrote:
> > > I wonder if I should just call it debug_wait_ready_timeout and in the
> > > comments about the help/description call out that this is a debug
> > > feature.
> > > 
> > > What do you suggest?
> > 
> > I recommend going to the vendors that report incorrect timeout values, and
> > apparently broken controllers that can't initialize, to have them fix
> > both. If it doesn't initialize in 128 seconds, your only debugging
> > recourse is to report to the vendor anyway; overriding the timeout to
> > something under what the device reports it requires doesn't exactly
> > provide you any additional debugging information.
> > 
> 
> In my case I was doing a simple mirror (using madadm across two nvme
> devices) and when I get timeouts on one, I need to wait up to 128 seconds
> before switching over. Ideally I want this to be really fast and drop
> the slow broken controller.

So it's really for debugging at all.

The initialization code will abort if you send the thread a SIGKILL. I'm
don't think we currently provide a nice way to determine the PID you want
to signal, but I think making that known to user space and controlling
the early termination from there might be the better way to go.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-17  3:17               ` Bart Van Assche
@ 2019-09-17  5:02                 ` Singh, Balbir
  2019-09-17 17:21                 ` James Smart
  1 sibling, 0 replies; 23+ messages in thread
From: Singh, Balbir @ 2019-09-17  5:02 UTC (permalink / raw)
  To: kbusch, bvanassche; +Cc: axboe, sblbir, hch, linux-nvme, sagi

On Mon, 2019-09-16 at 20:17 -0700, Bart Van Assche wrote:
> On 9/16/19 7:56 PM, Singh, Balbir wrote:
> > On Mon, 2019-09-16 at 19:14 -0600, Keith Busch wrote:
> > [ ... ]
> > In my case I was doing a simple mirror (using madadm across two nvme
> > devices) and when I get timeouts on one, I need to wait up to 128 seconds
> > before switching over. Ideally I want this to be really fast and drop
> > the slow broken controller.
> > 
> > > If you really need the driver to do spec non-compliant behavior, we have
> > > quirks for that.
> > 
> > I like the quirks approach, but it assumes the timeout value is not
> > variable, but rather fixed by the quirk. I was attempting to really have
> > mirrored IO timeout quickly on a bad device.
> 
> Other Linux kernel storage transports (FC, SRP) decouple the failover 
> timeout from the I/O timeout. See also the output of git grep -nH 
> fast_io_fail for the kernel source tree. See also the documentation of 
> fast_io_fail_tmo in https://linux.die.net/man/5/multipath.conf. Maybe we 
> need something similar for NVMe?
> 

Great suggestion!

I tried something similar with wait_ready, isolating it from io_timeout,
since it is sort of the bottleneck w.r.t. recovery in a mirrored
configuration (timeouts on one of the volumes)

In this case it needs to be system-wide to remove the non-responsive
controller within the speculated timeout, not everyone needs it, but
it sort of gives similar behaviour to (FC, SRP) failure in a multipath
setup. I am not too familiar with FC, SRP setup, but it seems like
we should provide an aggregated timeout for full reset work on failure/
timeouts. Would that make sense to try and build or did you have some
other design in mind?

> Bart.

Cheers,
Balbir Singh.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-17  3:17               ` Bart Van Assche
  2019-09-17  5:02                 ` Singh, Balbir
@ 2019-09-17 17:21                 ` James Smart
  2019-09-17 20:08                   ` James Smart
  1 sibling, 1 reply; 23+ messages in thread
From: James Smart @ 2019-09-17 17:21 UTC (permalink / raw)
  To: Bart Van Assche, Singh, Balbir, kbusch
  Cc: axboe, sblbir, hch, linux-nvme, sagi

On 9/16/2019 8:17 PM, Bart Van Assche wrote:
> On 9/16/19 7:56 PM, Singh, Balbir wrote:
>> On Mon, 2019-09-16 at 19:14 -0600, Keith Busch wrote:
>> [ ... ]
>> In my case I was doing a simple mirror (using madadm across two nvme
>> devices) and when I get timeouts on one, I need to wait up to 128 
>> seconds
>> before switching over. Ideally I want this to be really fast and drop
>> the slow broken controller.
>>
>>> If you really need the driver to do spec non-compliant behavior, we 
>>> have
>>> quirks for that.
>>
>> I like the quirks approach, but it assumes the timeout value is not
>> variable, but rather fixed by the quirk. I was attempting to really have
>> mirrored IO timeout quickly on a bad device.
>
> Other Linux kernel storage transports (FC, SRP) decouple the failover 
> timeout from the I/O timeout. See also the output of git grep -nH 
> fast_io_fail for the kernel source tree. See also the documentation of 
> fast_io_fail_tmo in https://linux.die.net/man/5/multipath.conf. Maybe 
> we need something similar for NVMe?
>

Well, it's not really fast io fail that needs to be replicated, and in 
fact, when I looked at nvme-fc, I saw no need for fast_io_failover as it 
didn't apply.

To understand:
with SCSI - we had the device "blocked" as there was a detection of a 
loss of connectivity to the device. This blocked state did not terminate 
i/o - we let any io completions trickling in continue to finish, but we 
certainly stopped new i/o from being started.  I/O could continue to 
timeout, but in most cases, a timeout while in this 
loss-of-connectivity, meant the timeout was just rescheduled. The 
blocked state was dependent on the "device loss" timeout that was 
running. Outstanding i/o wouldn't be terminated by the lldd until the 
final point when we gave up on the device and tore it down - the device 
loss timeout expiration.   With multipathing, waiting for the device 
loss timeout was too long - so we invented the fast-io-fail timeout, 
started at the same point at devloss, and inherently would expire before 
devloss, that would terminate all i/o to the device. This allowed 
multipath to get the io back faster than actual device failure.

With NVME-FC - there is a similar behavior to the "blocked" state, which 
is the reconnecting state.  E.g. when loss of connectivity is 
determined, the controller goes through an implicit reset which 
terminates all outstanding io, then goes into a reconnect timeout that 
retries connections up until an overall timer expires - known 
generically as ctrl_loss_tmo which applies to all fabric types and 
defaults to 60s.  FC additionally adds in the "device loss" tmo known by 
SCSI (the FC device may be both SCSI and NVME and should use the same 
value) and expires on the minimum of those two timeout values.    The 
fact that the controller reset terminates all outstanding i/o, true on 
any fabric transport, means the fast_io_fail timeout isn't needed.

So what seems to be talked about in this thread is how the fabric 
detects device connectivity loss.   FC has it's nameserver so it's 
automatic.  But the other transports don't have such a thing, unless 
it's TCP connection timeout failures or similar.  Connectivity loss is 
supposed to be the job of the keep alive timeout.  So I would look at 
that area to see how it should be manipulated.

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-16 20:40       ` Singh, Balbir
@ 2019-09-17 17:55         ` Bart Van Assche
  2019-09-17 20:30           ` Keith Busch
  2019-09-17 20:44           ` Singh, Balbir
  0 siblings, 2 replies; 23+ messages in thread
From: Bart Van Assche @ 2019-09-17 17:55 UTC (permalink / raw)
  To: Singh, Balbir, linux-nvme, sblbir; +Cc: kbusch, axboe, hch, sagi

On 9/16/19 1:40 PM, Singh, Balbir wrote:
> Is your concern largely with the comment or the patch?
> 
> Keith just recommend added
> 
>   /* Revalidate after unblocking dispatchers that may be holding bd_butex */
> 
> That sounds quite reasonable to me

Hi Balbir,

Keith's comment suggestion sounds great to me.

Another concern I have is that your patch depends on implementation 
details of the block layer. Calling revalidate_disk() after the DYING 
flag has been set means that requests will be allocated, started and 
completed after the DYING flag has been set. I think that works with the 
current implementation of the block layer because blk_queue_enter() does 
not check the DYING flag if percpu_ref_tryget_live() succeeds. If your 
patch gets accepted and if blk_queue_enter() would be modified such that 
the DYING flag is checked for every blk_queue_enter() call then that 
would break the NVMe driver.

Thanks,

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
  2019-09-17 17:21                 ` James Smart
@ 2019-09-17 20:08                   ` James Smart
  0 siblings, 0 replies; 23+ messages in thread
From: James Smart @ 2019-09-17 20:08 UTC (permalink / raw)
  To: Bart Van Assche, Singh, Balbir, kbusch
  Cc: axboe, sblbir, hch, linux-nvme, sagi


On 9/17/2019 10:21 AM, James Smart wrote:
> On 9/16/2019 8:17 PM, Bart Van Assche wrote:
>> On 9/16/19 7:56 PM, Singh, Balbir wrote:
>>> On Mon, 2019-09-16 at 19:14 -0600, Keith Busch wrote:
>>> [ ... ]
>>> In my case I was doing a simple mirror (using madadm across two nvme
>>> devices) and when I get timeouts on one, I need to wait up to 128 
>>> seconds
>>> before switching over. Ideally I want this to be really fast and drop
>>> the slow broken controller.
>>>
>>>> If you really need the driver to do spec non-compliant behavior, we 
>>>> have
>>>> quirks for that.
>>>
>>> I like the quirks approach, but it assumes the timeout value is not
>>> variable, but rather fixed by the quirk. I was attempting to really 
>>> have
>>> mirrored IO timeout quickly on a bad device.
>>
>> Other Linux kernel storage transports (FC, SRP) decouple the failover 
>> timeout from the I/O timeout. See also the output of git grep -nH 
>> fast_io_fail for the kernel source tree. See also the documentation 
>> of fast_io_fail_tmo in https://linux.die.net/man/5/multipath.conf. 
>> Maybe we need something similar for NVMe?
>>
>
> Well, it's not really fast io fail that needs to be replicated, and in 
> fact, when I looked at nvme-fc, I saw no need for fast_io_failover as 
> it didn't apply.
>
> To understand:
> with SCSI - we had the device "blocked" as there was a detection of a 
> loss of connectivity to the device. This blocked state did not 
> terminate i/o - we let any io completions trickling in continue to 
> finish, but we certainly stopped new i/o from being started.  I/O 
> could continue to timeout, but in most cases, a timeout while in this 
> loss-of-connectivity, meant the timeout was just rescheduled. The 
> blocked state was dependent on the "device loss" timeout that was 
> running. Outstanding i/o wouldn't be terminated by the lldd until the 
> final point when we gave up on the device and tore it down - the 
> device loss timeout expiration.   With multipathing, waiting for the 
> device loss timeout was too long - so we invented the fast-io-fail 
> timeout, started at the same point at devloss, and inherently would 
> expire before devloss, that would terminate all i/o to the device. 
> This allowed multipath to get the io back faster than actual device 
> failure.
>
> With NVME-FC - there is a similar behavior to the "blocked" state, 
> which is the reconnecting state.  E.g. when loss of connectivity is 
> determined, the controller goes through an implicit reset which 
> terminates all outstanding io, then goes into a reconnect timeout that 
> retries connections up until an overall timer expires - known 
> generically as ctrl_loss_tmo which applies to all fabric types and 
> defaults to 60s.  FC additionally adds in the "device loss" tmo known 
> by SCSI (the FC device may be both SCSI and NVME and should use the 
> same value) and expires on the minimum of those two timeout values.    
> The fact that the controller reset terminates all outstanding i/o, 
> true on any fabric transport, means the fast_io_fail timeout isn't 
> needed.
>
> So what seems to be talked about in this thread is how the fabric 
> detects device connectivity loss.   FC has it's nameserver so it's 
> automatic.  But the other transports don't have such a thing, unless 
> it's TCP connection timeout failures or similar. Connectivity loss is 
> supposed to be the job of the keep alive timeout.  So I would look at 
> that area to see how it should be manipulated.
>
> -- james
>
>

Also wanted to say, with fabrics, if a command times out - there's only 
3 options: 1) ignore it and reset the timer; 2) send an Abort cmd via 
admin queue; or 3) treat it has an unrecoverable error and reset the 
controller.    (1) - isn't reasonable.  (2) - based on Abort being 
best-effort (at best), limited in count, and various race conditions, 
isn't worthwhile. So (3) is normally what happens.   As stated, when a 
controller reset occurs, all i/o is terminated, so - it should release 
the i/o back to the multipather. So an I/O timeout is an indirect way, 
besides kato, to release the i/o's.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-17 17:55         ` Bart Van Assche
@ 2019-09-17 20:30           ` Keith Busch
  2019-09-17 20:44           ` Singh, Balbir
  1 sibling, 0 replies; 23+ messages in thread
From: Keith Busch @ 2019-09-17 20:30 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: sblbir, sagi, linux-nvme, axboe, Singh, Balbir, hch

On Tue, Sep 17, 2019 at 10:55:45AM -0700, Bart Van Assche wrote:
> On 9/16/19 1:40 PM, Singh, Balbir wrote:
> > Is your concern largely with the comment or the patch?
> > 
> > Keith just recommend added
> > 
> >   /* Revalidate after unblocking dispatchers that may be holding bd_butex */
> > 
> > That sounds quite reasonable to me
> 
> Hi Balbir,
> 
> Keith's comment suggestion sounds great to me.
> 
> Another concern I have is that your patch depends on implementation details
> of the block layer. Calling revalidate_disk() after the DYING flag has been
> set means that requests will be allocated, started and completed after the
> DYING flag has been set. I think that works with the current implementation
> of the block layer because blk_queue_enter() does not check the DYING flag
> if percpu_ref_tryget_live() succeeds. If your patch gets accepted and if
> blk_queue_enter() would be modified such that the DYING flag is checked for
> every blk_queue_enter() call then that would break the NVMe driver.

I don't think it should matter. Once the call blk_set_queue_dying()
returns, blk_queue_enter() will already never succeed again: in addition
to setting the DYING flag in the queue, it also sets the percpu_ref to
__PERCPU_REF_DEAD, which causes percpu_ref_tryget_live() to fail, at
which point it will observe the DYING flag and abort entering the queue.

In the short window in which a request may be allocated on a DYING queue
but before we've killed the percpu_ref, the nvme driver handles those
requests by unquiescing to get them flushed out through its .queue_rq().
I don't really like that hack, but blk_mq_queue_tag_busy_iter() isn't
exported for a driver to directly end requests on a quiesced queue.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal
  2019-09-17 17:55         ` Bart Van Assche
  2019-09-17 20:30           ` Keith Busch
@ 2019-09-17 20:44           ` Singh, Balbir
  1 sibling, 0 replies; 23+ messages in thread
From: Singh, Balbir @ 2019-09-17 20:44 UTC (permalink / raw)
  To: bvanassche, linux-nvme, sblbir; +Cc: kbusch, axboe, hch, sagi

On Tue, 2019-09-17 at 10:55 -0700, Bart Van Assche wrote:
> On 9/16/19 1:40 PM, Singh, Balbir wrote:
> > Is your concern largely with the comment or the patch?
> > 
> > Keith just recommend added
> > 
> >   /* Revalidate after unblocking dispatchers that may be holding bd_butex
> > */
> > 
> > That sounds quite reasonable to me
> 
> Hi Balbir,
> 
> Keith's comment suggestion sounds great to me.
> 
> Another concern I have is that your patch depends on implementation 
> details of the block layer. Calling revalidate_disk() after the DYING 
> flag has been set means that requests will be allocated, started and 
> completed after the DYING flag has been set. I think that works with the 
> current implementation of the block layer because blk_queue_enter() does 
> not check the DYING flag if percpu_ref_tryget_live() succeeds. If your 
> patch gets accepted and if blk_queue_enter() would be modified such that 
> the DYING flag is checked for every blk_queue_enter() call then that 
> would break the NVMe driver.
> 


I don't think any request is sent down, from what I understand the following
happens in nvme_set_queue_dying()

	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
		return;

In nvme_revalidate_disk(), we do the following

	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
		set_capacity(disk, 0);
		return -ENODEV;
	}

I don't think we ever send the request down, we just set capacity to 0.
and then revalidate_disk() calls check_disk_size_change().

Cheers,
Balbir Singh.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-09-17 20:48 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-13 23:36 [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Balbir Singh
2019-09-13 23:36 ` [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
2019-09-16  7:41   ` Christoph Hellwig
2019-09-16 12:33     ` Singh, Balbir
2019-09-16 16:01       ` hch
2019-09-16 21:04         ` Singh, Balbir
2019-09-17  1:14           ` Keith Busch
2019-09-17  2:56             ` Singh, Balbir
2019-09-17  3:17               ` Bart Van Assche
2019-09-17  5:02                 ` Singh, Balbir
2019-09-17 17:21                 ` James Smart
2019-09-17 20:08                   ` James Smart
2019-09-17  3:54               ` Keith Busch
2019-09-16  7:49 ` [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Christoph Hellwig
2019-09-16 12:07   ` Singh, Balbir
2019-09-16 15:40 ` Bart Van Assche
2019-09-16 19:38   ` Singh, Balbir
2019-09-16 19:56     ` Bart Van Assche
2019-09-16 20:40       ` Singh, Balbir
2019-09-17 17:55         ` Bart Van Assche
2019-09-17 20:30           ` Keith Busch
2019-09-17 20:44           ` Singh, Balbir
2019-09-16 20:07     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).