All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] nvme of: don't flush scan work inside reset context
@ 2018-11-05 11:57 Ming Lei
  2018-11-05 16:28 ` Keith Busch
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Ming Lei @ 2018-11-05 11:57 UTC (permalink / raw)


When scan work is in-progress, any controller error may trigger
reset, now fc, rdma and loop host tries to flush scan work
inside reset context.

This way can cause deadlock easily because any IO during controler
recovery(reset) can't be completed until the recovery is done.

This patch tries to address the deadlock issue by not flushing
scan work inside reset context. Actually not see obvious reason
to do that:

- once reset is done, a new scan will be scheduled.
- PCI NVMe doesn't do that way

Cc: James Smart <james.smart at broadcom.com>
Cc: Keith Busch <keith.busch at intel.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c   | 4 ++--
 drivers/nvme/host/fc.c     | 2 +-
 drivers/nvme/host/nvme.h   | 2 +-
 drivers/nvme/host/pci.c    | 2 +-
 drivers/nvme/host/rdma.c   | 2 +-
 drivers/nvme/target/loop.c | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2e65be8b1387..fbbb6bd8fbc5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -161,7 +161,7 @@ static void nvme_delete_ctrl_work(struct work_struct *work)
 		 "Removing ctrl: NQN \"%s\"\n", ctrl->opts->subsysnqn);
 
 	flush_work(&ctrl->reset_work);
-	nvme_stop_ctrl(ctrl);
+	nvme_stop_ctrl(ctrl, true);
 	nvme_remove_namespaces(ctrl);
 	ctrl->ops->delete_ctrl(ctrl);
 	nvme_uninit_ctrl(ctrl);
@@ -3469,7 +3469,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 }
 EXPORT_SYMBOL_GPL(nvme_complete_async_event);
 
-void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
+void nvme_stop_ctrl(struct nvme_ctrl *ctrl, bool flush_scan)
 {
 	nvme_mpath_stop(ctrl);
 	nvme_stop_keep_alive(ctrl);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index e52b9d3c0bd6..358249f83bc7 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2872,7 +2872,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
 		container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
 	int ret;
 
-	nvme_stop_ctrl(&ctrl->ctrl);
+	nvme_stop_ctrl(&ctrl->ctrl, false);
 
 	/* will block will waiting for io to terminate */
 	nvme_fc_delete_association(ctrl);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cee79cb388af..02cdbfb66bca 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -418,7 +418,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 		const struct nvme_ctrl_ops *ops, unsigned long quirks);
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl);
 void nvme_start_ctrl(struct nvme_ctrl *ctrl);
-void nvme_stop_ctrl(struct nvme_ctrl *ctrl);
+void nvme_stop_ctrl(struct nvme_ctrl *ctrl, bool flush_scan);
 void nvme_put_ctrl(struct nvme_ctrl *ctrl);
 int nvme_init_identify(struct nvme_ctrl *ctrl);
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f31e14b35421..9373f1bf8469 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2828,7 +2828,7 @@ static void nvme_remove(struct pci_dev *pdev)
 	}
 
 	flush_work(&dev->ctrl.reset_work);
-	nvme_stop_ctrl(&dev->ctrl);
+	nvme_stop_ctrl(&dev->ctrl, true);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true);
 	nvme_free_host_mem(dev);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d181cafedc58..26afcc97a445 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1823,7 +1823,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 	struct nvme_rdma_ctrl *ctrl =
 		container_of(work, struct nvme_rdma_ctrl, ctrl.reset_work);
 
-	nvme_stop_ctrl(&ctrl->ctrl);
+	nvme_stop_ctrl(&ctrl->ctrl, false);
 	nvme_rdma_shutdown_ctrl(ctrl, false);
 
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9908082b32c4..cf2169cb890c 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -468,7 +468,7 @@ static void nvme_loop_reset_ctrl_work(struct work_struct *work)
 	bool changed;
 	int ret;
 
-	nvme_stop_ctrl(&ctrl->ctrl);
+	nvme_stop_ctrl(&ctrl->ctrl, false);
 	nvme_loop_shutdown_ctrl(ctrl);
 
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-05 11:57 [RFC PATCH] nvme of: don't flush scan work inside reset context Ming Lei
@ 2018-11-05 16:28 ` Keith Busch
  2018-11-06  0:30   ` Ming Lei
  2018-11-05 20:04 ` James Smart
  2018-11-07  3:26 ` Sagi Grimberg
  2 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2018-11-05 16:28 UTC (permalink / raw)


On Mon, Nov 05, 2018@07:57:34PM +0800, Ming Lei wrote:
> -void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> +void nvme_stop_ctrl(struct nvme_ctrl *ctrl, bool flush_scan)
>  {
>  	nvme_mpath_stop(ctrl);
>  	nvme_stop_keep_alive(ctrl);

Newly added parameter 'flush_scan' is unsused. ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-05 11:57 [RFC PATCH] nvme of: don't flush scan work inside reset context Ming Lei
  2018-11-05 16:28 ` Keith Busch
@ 2018-11-05 20:04 ` James Smart
  2018-11-06  1:18   ` Ming Lei
  2018-11-07  3:26 ` Sagi Grimberg
  2 siblings, 1 reply; 16+ messages in thread
From: James Smart @ 2018-11-05 20:04 UTC (permalink / raw)


On 11/5/2018 3:57 AM, Ming Lei wrote:
> When scan work is in-progress, any controller error may trigger
> reset, now fc, rdma and loop host tries to flush scan work
> inside reset context.
>
> This way can cause deadlock easily because any IO during controler
> recovery(reset) can't be completed until the recovery is done.
>
> This patch tries to address the deadlock issue by not flushing
> scan work inside reset context. Actually not see obvious reason
> to do that:
>
> - once reset is done, a new scan will be scheduled.
> - PCI NVMe doesn't do that way
>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Keith Busch <keith.busch at intel.com>
> Cc: Christoph Hellwig <hch at lst.de>
> Cc: Sagi Grimberg <sagi at grimberg.me>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   drivers/nvme/host/core.c   | 4 ++--
>   drivers/nvme/host/fc.c     | 2 +-
>   drivers/nvme/host/nvme.h   | 2 +-
>   drivers/nvme/host/pci.c    | 2 +-
>   drivers/nvme/host/rdma.c   | 2 +-
>   drivers/nvme/target/loop.c | 2 +-
>   6 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2e65be8b1387..fbbb6bd8fbc5 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -161,7 +161,7 @@ static void nvme_delete_ctrl_work(struct work_struct *work)
>   		 "Removing ctrl: NQN \"%s\"\n", ctrl->opts->subsysnqn);
>   
>   	flush_work(&ctrl->reset_work);
> -	nvme_stop_ctrl(ctrl);
> +	nvme_stop_ctrl(ctrl, true);
>   	nvme_remove_namespaces(ctrl);
>   	ctrl->ops->delete_ctrl(ctrl);
>   	nvme_uninit_ctrl(ctrl);
> @@ -3469,7 +3469,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
>   }
>   EXPORT_SYMBOL_GPL(nvme_complete_async_event);
>   
> -void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> +void nvme_stop_ctrl(struct nvme_ctrl *ctrl, bool flush_scan)
>   {
>   	nvme_mpath_stop(ctrl);
>   	nvme_stop_keep_alive(ctrl);
>

Keith: This was missing a snippet in nvme_stop_ctrl() for

 ?-????????????? flush_work(&ctrl->scan_work);
 ?+???????????? if (flush_scan)
 ?+ ????????????????? flush_work(&ctrl->scan_work);

I also believe Ming's patch isn't enough.? The issue for the transports 
are several of the flush cases require an io to complete to finish 
flushing - not just scan_work.?? nvme_mpath_stop() syncs with 
ctrl->ana_work, which may be waiting for an ana log to complete.? 
fw_act_queues() may be in the midst of polling the CSTS bit register or 
doing a fw_slot log read, or a stop_ctrl() routine for the transport may 
be in the middle of a start_ctrl call.

Give the transport may be at a point where it's lost connectivity to the 
controller (never really a case for pci unless you consider hot unplug), 
before calling ctlr stop, the transport has to terminate/return 
outstanding io and stop further io from being accepted (usually by 
marking the queue as not connected).

rdma seems to pretty much do the right thing for it's error handling 
path - where it schedules the err_work handler to do this. But that's 
missing from the reset controller path (and is nooped if reset is in 
progress and stuck in one of these flushes when a timeout occurs).

fc resets the device, thus calls the flush routines, when it resets or 
has an error. So it never gets to call delete_association to 
teardown/mark things unconnected befpre waiting on the flushes.

I think we need to have the reset_work routines call the routine that 
deletes/tearsdown before it calls nvme_stop_ctrl().? These routines will 
likely call nvme_stop_keep_alive right before doing so as well.

-- james

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-05 16:28 ` Keith Busch
@ 2018-11-06  0:30   ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-11-06  0:30 UTC (permalink / raw)


On Mon, Nov 05, 2018@09:28:15AM -0700, Keith Busch wrote:
> On Mon, Nov 05, 2018@07:57:34PM +0800, Ming Lei wrote:
> > -void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> > +void nvme_stop_ctrl(struct nvme_ctrl *ctrl, bool flush_scan)
> >  {
> >  	nvme_mpath_stop(ctrl);
> >  	nvme_stop_keep_alive(ctrl);
> 
> Newly added parameter 'flush_scan' is unsused. ?

oops, it should have been applied as:

	if (flush_scan)
		flush_work(&ctrl->scan_work);

Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-05 20:04 ` James Smart
@ 2018-11-06  1:18   ` Ming Lei
  2018-11-06  5:45     ` James Smart
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2018-11-06  1:18 UTC (permalink / raw)


Hi James,

On Mon, Nov 05, 2018@12:04:18PM -0800, James Smart wrote:
> On 11/5/2018 3:57 AM, Ming Lei wrote:
> > When scan work is in-progress, any controller error may trigger
> > reset, now fc, rdma and loop host tries to flush scan work
> > inside reset context.
> > 
> > This way can cause deadlock easily because any IO during controler
> > recovery(reset) can't be completed until the recovery is done.
> > 
> > This patch tries to address the deadlock issue by not flushing
> > scan work inside reset context. Actually not see obvious reason
> > to do that:
> > 
> > - once reset is done, a new scan will be scheduled.
> > - PCI NVMe doesn't do that way
> > 
> > Cc: James Smart <james.smart at broadcom.com>
> > Cc: Keith Busch <keith.busch at intel.com>
> > Cc: Christoph Hellwig <hch at lst.de>
> > Cc: Sagi Grimberg <sagi at grimberg.me>
> > Signed-off-by: Ming Lei <ming.lei at redhat.com>
> > ---
> >   drivers/nvme/host/core.c   | 4 ++--
> >   drivers/nvme/host/fc.c     | 2 +-
> >   drivers/nvme/host/nvme.h   | 2 +-
> >   drivers/nvme/host/pci.c    | 2 +-
> >   drivers/nvme/host/rdma.c   | 2 +-
> >   drivers/nvme/target/loop.c | 2 +-
> >   6 files changed, 7 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 2e65be8b1387..fbbb6bd8fbc5 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -161,7 +161,7 @@ static void nvme_delete_ctrl_work(struct work_struct *work)
> >   		 "Removing ctrl: NQN \"%s\"\n", ctrl->opts->subsysnqn);
> >   	flush_work(&ctrl->reset_work);
> > -	nvme_stop_ctrl(ctrl);
> > +	nvme_stop_ctrl(ctrl, true);
> >   	nvme_remove_namespaces(ctrl);
> >   	ctrl->ops->delete_ctrl(ctrl);
> >   	nvme_uninit_ctrl(ctrl);
> > @@ -3469,7 +3469,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
> >   }
> >   EXPORT_SYMBOL_GPL(nvme_complete_async_event);
> > -void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> > +void nvme_stop_ctrl(struct nvme_ctrl *ctrl, bool flush_scan)
> >   {
> >   	nvme_mpath_stop(ctrl);
> >   	nvme_stop_keep_alive(ctrl);
> > 
> 
> Keith: This was missing a snippet in nvme_stop_ctrl() for
> 
> ?-????????????? flush_work(&ctrl->scan_work);
> ?+???????????? if (flush_scan)
> ?+ ????????????????? flush_work(&ctrl->scan_work);
> 
> I also believe Ming's patch isn't enough.? The issue for the transports are
> several of the flush cases require an io to complete to finish flushing -
> not just scan_work.?? nvme_mpath_stop() syncs with ctrl->ana_work, which may
> be waiting for an ana log to complete.? fw_act_queues() may be in the midst
> of polling the CSTS bit register or doing a fw_slot log read, or a
> stop_ctrl() routine for the transport may be in the middle of a start_ctrl
> call.

These commands won't be retried since REQ_FAILFAST_DRIVER is set for any
request allocated via nvme_alloc_request().

So the above flush cases you mentioned should be easier to handle than IO from
scan context. However, looks the patch we discussed before by moving
nvme_stop_ctrl() after nvme_rdma_shutdown_ctrl() is still needed for
avoiding hang on flush in these commands.

> 
> Give the transport may be at a point where it's lost connectivity to the
> controller (never really a case for pci unless you consider hot unplug),
> before calling ctlr stop, the transport has to terminate/return outstanding
> io and stop further io from being accepted (usually by marking the queue as
> not connected).

This patch doesn't change this point, all in-flight IO is still terminate/return.
With or without this patch, all these IO will be retried too after controller is
recovered.

Or you mean all these IOs have to terminate/return before calling
ctrl->ops->stop_ctrl().

> 
> rdma seems to pretty much do the right thing for it's error handling path -
> where it schedules the err_work handler to do this. But that's missing from
> the reset controller path (and is nooped if reset is in progress and stuck
> in one of these flushes when a timeout occurs).
> 
> fc resets the device, thus calls the flush routines, when it resets or has
> an error. So it never gets to call delete_association to teardown/mark
> things unconnected befpre waiting on the flushes.
> 
> I think we need to have the reset_work routines call the routine that
> deletes/tearsdown before it calls nvme_stop_ctrl().? These routines will
> likely call nvme_stop_keep_alive right before doing so as well.

I agree, this way may fail any in-flight nvme command before flushing on
any work func for waiting for their completion.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-06  1:18   ` Ming Lei
@ 2018-11-06  5:45     ` James Smart
  2018-11-07  1:58       ` Ming Lei
  2018-11-08  0:19       ` Sagi Grimberg
  0 siblings, 2 replies; 16+ messages in thread
From: James Smart @ 2018-11-06  5:45 UTC (permalink / raw)




On 11/5/2018 5:18 PM, Ming Lei wrote:
>
>> I also believe Ming's patch isn't enough.? The issue for the transports are
>> several of the flush cases require an io to complete to finish flushing -
>> not just scan_work.?? nvme_mpath_stop() syncs with ctrl->ana_work, which may
>> be waiting for an ana log to complete.? fw_act_queues() may be in the midst
>> of polling the CSTS bit register or doing a fw_slot log read, or a
>> stop_ctrl() routine for the transport may be in the middle of a start_ctrl
>> call.
> These commands won't be retried since REQ_FAILFAST_DRIVER is set for any
> request allocated via nvme_alloc_request().
>
> So the above flush cases you mentioned should be easier to handle than IO from
> scan context. However, looks the patch we discussed before by moving
> nvme_stop_ctrl() after nvme_rdma_shutdown_ctrl() is still needed for
> avoiding hang on flush in these commands.

Agree - nvme_stop_ctrl() has to move so the transport routines that 
stop/return the io can run.? Leaving it in the same place with an 
argument won't be enough.


>
>> Give the transport may be at a point where it's lost connectivity to the
>> controller (never really a case for pci unless you consider hot unplug),
>> before calling ctlr stop, the transport has to terminate/return outstanding
>> io and stop further io from being accepted (usually by marking the queue as
>> not connected).
> This patch doesn't change this point, all in-flight IO is still terminate/return.
> With or without this patch, all these IO will be retried too after controller is
> recovered.
>
> Or you mean all these IOs have to terminate/return before calling
> ctrl->ops->stop_ctrl().

Patch as is, that calls the flush routines pointed out, will deadlock as 
the io won't be terminated and returned by the transport. The flush 
routines, thus stop_ctrl, has to be called after the transport stops the 
queues and aborts/returns the ios. Retries after that will do what they do.

But the concern I have is - if things like ns_revalidate requires normal 
io to be completed to be "flushed", then we have an issue. The 
non-normal io, generated by the core layer and those issued by multipath 
will fast-fail. But the normal io, whether non-multipath or last-path 
with multipath, may not be failed as they will be continually requeued 
or wait for a new path. So I don't know how it exits this deadlock.

-- james

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-06  5:45     ` James Smart
@ 2018-11-07  1:58       ` Ming Lei
  2018-11-08  0:19       ` Sagi Grimberg
  1 sibling, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-11-07  1:58 UTC (permalink / raw)


Hi James,

On Mon, Nov 05, 2018@09:45:21PM -0800, James Smart wrote:
> 
> 
> On 11/5/2018 5:18 PM, Ming Lei wrote:
> > 
> > > I also believe Ming's patch isn't enough.? The issue for the transports are
> > > several of the flush cases require an io to complete to finish flushing -
> > > not just scan_work.?? nvme_mpath_stop() syncs with ctrl->ana_work, which may
> > > be waiting for an ana log to complete.? fw_act_queues() may be in the midst
> > > of polling the CSTS bit register or doing a fw_slot log read, or a
> > > stop_ctrl() routine for the transport may be in the middle of a start_ctrl
> > > call.
> > These commands won't be retried since REQ_FAILFAST_DRIVER is set for any
> > request allocated via nvme_alloc_request().
> > 
> > So the above flush cases you mentioned should be easier to handle than IO from
> > scan context. However, looks the patch we discussed before by moving
> > nvme_stop_ctrl() after nvme_rdma_shutdown_ctrl() is still needed for
> > avoiding hang on flush in these commands.
> 
> Agree - nvme_stop_ctrl() has to move so the transport routines that
> stop/return the io can run.? Leaving it in the same place with an argument
> won't be enough.

OK, I will include this part into RFC V2 and post out for further review.

> 
> 
> > 
> > > Give the transport may be at a point where it's lost connectivity to the
> > > controller (never really a case for pci unless you consider hot unplug),
> > > before calling ctlr stop, the transport has to terminate/return outstanding
> > > io and stop further io from being accepted (usually by marking the queue as
> > > not connected).
> > This patch doesn't change this point, all in-flight IO is still terminate/return.
> > With or without this patch, all these IO will be retried too after controller is
> > recovered.
> > 
> > Or you mean all these IOs have to terminate/return before calling
> > ctrl->ops->stop_ctrl().
> 
> Patch as is, that calls the flush routines pointed out, will deadlock as the
> io won't be terminated and returned by the transport. The flush routines,
> thus stop_ctrl, has to be called after the transport stops the queues and
> aborts/returns the ios. Retries after that will do what they do.

All normal IOs are retried and won't be terminated during reset.

However, it is very unusual to wait for completion of these normal IOs
inside reset handler, and the only such usage I saw is
flush_work(&ctrl->scan_work) in fc/rdma/loop.

> 
> But the concern I have is - if things like ns_revalidate requires normal io
> to be completed to be "flushed", then we have an issue. The non-normal io,
> generated by the core layer and those issued by multipath will fast-fail.
> But the normal io, whether non-multipath or last-path with multipath, may
> not be failed as they will be continually requeued or wait for a new path.
> So I don't know how it exits this deadlock.

Except for flush_work(&ctrl->scan_work), do we have other cases in which normal
IO is waited for completion inside reset handler? Looks not found.

For other waiting for completion of normal IO, it is the responsibility
of reset or error or timeout handler to guarantee forward progress, and
there shouldn't be such deadlock.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-05 11:57 [RFC PATCH] nvme of: don't flush scan work inside reset context Ming Lei
  2018-11-05 16:28 ` Keith Busch
  2018-11-05 20:04 ` James Smart
@ 2018-11-07  3:26 ` Sagi Grimberg
  2018-11-07  3:51   ` Ming Lei
  2 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2018-11-07  3:26 UTC (permalink / raw)


Ming,

> When scan work is in-progress, any controller error may trigger
> reset, now fc, rdma and loop host tries to flush scan work
> inside reset context.
> 
> This way can cause deadlock easily because any IO during controler
> recovery(reset) can't be completed until the recovery is done.

Did you encounter this deadlock? or is it theoretical?

The point of nvme_stop_ctrl is to quiesce everything before
moving forward with tearing down the controller instead of
trying to handle concurrent incoming I/O.

I'm not sure I understand why you say that I/O can only be
completed when the reset is done? if the transport entered
a failed state either the inflight I/O is drain or one of
the scan work I/O operations times out.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-07  3:26 ` Sagi Grimberg
@ 2018-11-07  3:51   ` Ming Lei
  2018-11-07  4:38     ` Sagi Grimberg
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2018-11-07  3:51 UTC (permalink / raw)


On Tue, Nov 06, 2018@07:26:28PM -0800, Sagi Grimberg wrote:
> Ming,
> 
> > When scan work is in-progress, any controller error may trigger
> > reset, now fc, rdma and loop host tries to flush scan work
> > inside reset context.
> > 
> > This way can cause deadlock easily because any IO during controler
> > recovery(reset) can't be completed until the recovery is done.
> 
> Did you encounter this deadlock? or is it theoretical?

There are several such reports in Red Hat Bugzilla.

> 
> The point of nvme_stop_ctrl is to quiesce everything before
> moving forward with tearing down the controller instead of
> trying to handle concurrent incoming I/O.
> 
> I'm not sure I understand why you say that I/O can only be
> completed when the reset is done? if the transport entered

Please see nvme_rdma_teardown_io_queues(), in which each in-flight
request is canceled via nvme_cancel_request(), which just calls
nvme_complete_rq() to requeue request(normal IO) to blk-mq sw queue
or scheduler queue.

During reset, block request queues are quiesced, so the requeued
requests can't be dispatched to nvme driver until reset is done.

That is why all normal I/O can only be completed after reset is done.

> a failed state either the inflight I/O is drain or one of
> the scan work I/O operations times out.

Timeout only works for in-flight request, as mentioned above,
all these requests are canceled and put back into blk-mq sw queue
or scheduler queue during reset, so timeout handler can't cover
them at all.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-07  3:51   ` Ming Lei
@ 2018-11-07  4:38     ` Sagi Grimberg
  2018-11-07  8:34       ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2018-11-07  4:38 UTC (permalink / raw)



>> Did you encounter this deadlock? or is it theoretical?
> 
> There are several such reports in Red Hat Bugzilla.

Is there any way to see the reports? I'm looking for the
scenario because error recovery does not usually go through
the reset flow.

>> The point of nvme_stop_ctrl is to quiesce everything before
>> moving forward with tearing down the controller instead of
>> trying to handle concurrent incoming I/O.
>>
>> I'm not sure I understand why you say that I/O can only be
>> completed when the reset is done? if the transport entered
> 
> Please see nvme_rdma_teardown_io_queues(), in which each in-flight
> request is canceled via nvme_cancel_request(), which just calls
> nvme_complete_rq() to requeue request(normal IO) to blk-mq sw queue
> or scheduler queue.

I'm pretty familiar with what nvme_rdma_teardown_io_queues() is
doing.

> During reset, block request queues are quiesced, so the requeued
> requests can't be dispatched to nvme driver until reset is done.

That's fine, they have no reason to be dispatched until the reset
is done, it has no chance to complete.

> That is why all normal I/O can only be completed after reset is done.

I'm still not getting your point, resets should be able to complete 
without pending I/Os to complete with a failed status.
That is why I want to see the tickets details, I want to understand
what is the issue that this solves.

>> a failed state either the inflight I/O is drain or one of
>> the scan work I/O operations times out.
> 
> Timeout only works for in-flight request, as mentioned above,
> all these requests are canceled and put back into blk-mq sw queue
> or scheduler queue during reset, so timeout handler can't cover
> them at all.

Its not supposed to. This patch says it specifically addresses the scan
work.

I think you need to explain your patch better to get across exactly
what it is fixing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-07  4:38     ` Sagi Grimberg
@ 2018-11-07  8:34       ` Ming Lei
  2018-11-07 18:38         ` Sagi Grimberg
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2018-11-07  8:34 UTC (permalink / raw)


On Tue, Nov 06, 2018@08:38:59PM -0800, Sagi Grimberg wrote:
> 
> > > Did you encounter this deadlock? or is it theoretical?
> > 
> > There are several such reports in Red Hat Bugzilla.
> 
> Is there any way to see the reports? I'm looking for the
> scenario because error recovery does not usually go through
> the reset flow.

https://bugzilla.redhat.com/show_bug.cgi?id=1622487
https://bugzilla.redhat.com/show_bug.cgi?id=1628100

Especially you may find some details in the following comment:

https://bugzilla.redhat.com/show_bug.cgi?id=1628100#c7

> 
> > > The point of nvme_stop_ctrl is to quiesce everything before
> > > moving forward with tearing down the controller instead of
> > > trying to handle concurrent incoming I/O.
> > > 
> > > I'm not sure I understand why you say that I/O can only be
> > > completed when the reset is done? if the transport entered
> > 
> > Please see nvme_rdma_teardown_io_queues(), in which each in-flight
> > request is canceled via nvme_cancel_request(), which just calls
> > nvme_complete_rq() to requeue request(normal IO) to blk-mq sw queue
> > or scheduler queue.
> 
> I'm pretty familiar with what nvme_rdma_teardown_io_queues() is
> doing.
> 
> > During reset, block request queues are quiesced, so the requeued
> > requests can't be dispatched to nvme driver until reset is done.
> 
> That's fine, they have no reason to be dispatched until the reset
> is done, it has no chance to complete.
> 
> > That is why all normal I/O can only be completed after reset is done.
> 
> I'm still not getting your point, resets should be able to complete without
> pending I/Os to complete with a failed status.
> That is why I want to see the tickets details, I want to understand
> what is the issue that this solves.

Once controller's state is updated to RESETTING, no new FS IO can
be queued to hardware any more, and these IOs are handled as
BLK_STS_RESOURCE, then scan work function may not move on because
of request exhaustion or writeback throttle.

Then flush_work(&ctrl->scan_work) inside reset work function hangs forever.

> 
> > > a failed state either the inflight I/O is drain or one of
> > > the scan work I/O operations times out.
> > 
> > Timeout only works for in-flight request, as mentioned above,
> > all these requests are canceled and put back into blk-mq sw queue
> > or scheduler queue during reset, so timeout handler can't cover
> > them at all.
> 
> Its not supposed to. This patch says it specifically addresses the scan
> work.
> 
> I think you need to explain your patch better to get across exactly
> what it is fixing.

There may not be any in-flight requests in this case, so timeout handler
can't cover this case.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-07  8:34       ` Ming Lei
@ 2018-11-07 18:38         ` Sagi Grimberg
  2018-11-07 19:27           ` James Smart
  2018-11-08  0:05           ` Ming Lei
  0 siblings, 2 replies; 16+ messages in thread
From: Sagi Grimberg @ 2018-11-07 18:38 UTC (permalink / raw)



> https://bugzilla.redhat.com/show_bug.cgi?id=1622487
> https://bugzilla.redhat.com/show_bug.cgi?id=1628100
> 
> Especially you may find some details in the following comment:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1628100#c7

I don't have permissions to see the tickets.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-07 18:38         ` Sagi Grimberg
@ 2018-11-07 19:27           ` James Smart
  2018-11-08  0:05           ` Ming Lei
  1 sibling, 0 replies; 16+ messages in thread
From: James Smart @ 2018-11-07 19:27 UTC (permalink / raw)


On 11/7/2018 10:38 AM, Sagi Grimberg wrote:
>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1622487
>> https://bugzilla.redhat.com/show_bug.cgi?id=1628100
>>
>> Especially you may find some details in the following comment:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1628100#c7
>
> I don't have permissions to see the tickets.

they show the issue occuring on both RDMA IB and FC.

the test cases are manually inducing a reset vial the reset_controller 
attribute while there is load on the controller. The reset_work threads 
are hung waiting on flush(scan_work) and the scan_work thread is in 
ns_validate waiting on the io load to go to zero - which it can't, as 
the normal io is requeued within blk-mq due to the BLK_RESOURCE returns.

FC is more susceptible as it's error handling resolves the error by 
resetting the controller, rather than an abbreviated reset like rdma does.

-- james

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-07 18:38         ` Sagi Grimberg
  2018-11-07 19:27           ` James Smart
@ 2018-11-08  0:05           ` Ming Lei
  1 sibling, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-11-08  0:05 UTC (permalink / raw)


On Wed, Nov 07, 2018@10:38:44AM -0800, Sagi Grimberg wrote:
> 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1622487
> > https://bugzilla.redhat.com/show_bug.cgi?id=1628100
> > 
> > Especially you may find some details in the following comment:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1628100#c7
> 
> I don't have permissions to see the tickets.

But I have explained the issue in enough details, don't I?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-06  5:45     ` James Smart
  2018-11-07  1:58       ` Ming Lei
@ 2018-11-08  0:19       ` Sagi Grimberg
  2018-11-08 17:49         ` James Smart
  1 sibling, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2018-11-08  0:19 UTC (permalink / raw)



>>> transports are
>>> several of the flush cases require an io to complete to finish 
>>> flushing -
>>> not just scan_work.?? nvme_mpath_stop() syncs with ctrl->ana_work, 
>>> which may
>>> be waiting for an ana log to complete.? fw_act_queues() may be in the 
>>> midst
>>> of polling the CSTS bit register or doing a fw_slot log read, or a
>>> stop_ctrl() routine for the transport may be in the middle of a 
>>> start_ctrl
>>> call.
>> These commands won't be retried since REQ_FAILFAST_DRIVER is set for any
>> request allocated via nvme_alloc_request().
>>
>> So the above flush cases you mentioned should be easier to handle than 
>> IO from
>> scan context. However, looks the patch we discussed before by moving
>> nvme_stop_ctrl() after nvme_rdma_shutdown_ctrl() is still needed for
>> avoiding hang on flush in these commands.
> 
> Agree - nvme_stop_ctrl() has to move so the transport routines that 
> stop/return the io can run.? Leaving it in the same place with an 
> argument won't be enough.

nvme_stop_ctrl() cannot move to after queue termination, it has to
barrier before moving forward. The things that must flush before
are async_event_work that does not go via the normal request path
that is quiesced and ->stop_ctrl() at least for the rdma case make
sure that no error recovery is concurrent with the reset/delete flow.

Perhaps we can move scan_work flush into nvme_remove_namespaces()
instead? I guess it makes sense that when we remove namespaces we don't
want a scan to run concurrently?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme of: don't flush scan work inside reset context
  2018-11-08  0:19       ` Sagi Grimberg
@ 2018-11-08 17:49         ` James Smart
  0 siblings, 0 replies; 16+ messages in thread
From: James Smart @ 2018-11-08 17:49 UTC (permalink / raw)




On 11/7/2018 4:19 PM, Sagi Grimberg wrote:
>
>>>> transports are
>>>> several of the flush cases require an io to complete to finish 
>>>> flushing -
>>>> not just scan_work.?? nvme_mpath_stop() syncs with ctrl->ana_work, 
>>>> which may
>>>> be waiting for an ana log to complete.? fw_act_queues() may be in 
>>>> the midst
>>>> of polling the CSTS bit register or doing a fw_slot log read, or a
>>>> stop_ctrl() routine for the transport may be in the middle of a 
>>>> start_ctrl
>>>> call.
>>> These commands won't be retried since REQ_FAILFAST_DRIVER is set for 
>>> any
>>> request allocated via nvme_alloc_request().
>>>
>>> So the above flush cases you mentioned should be easier to handle 
>>> than IO from
>>> scan context. However, looks the patch we discussed before by moving
>>> nvme_stop_ctrl() after nvme_rdma_shutdown_ctrl() is still needed for
>>> avoiding hang on flush in these commands.
>>
>> Agree - nvme_stop_ctrl() has to move so the transport routines that 
>> stop/return the io can run.? Leaving it in the same place with an 
>> argument won't be enough.
>
> nvme_stop_ctrl() cannot move to after queue termination, it has to
> barrier before moving forward. The things that must flush before
> are async_event_work that does not go via the normal request path
> that is quiesced and ->stop_ctrl() at least for the rdma case make
> sure that no error recovery is concurrent with the reset/delete flow.

Well - then I don't know how things are going to proceed. If you are in 
a scenario where the device is no longer responding for outstanding ios, 
the only way you will get the flushes waiting on internally generated 
ios to finish is for the transport to walk the queues and terminate the 
ios and mark the queues non-connected so that any retries fail 
outright.? And if it's normal io - same thing has to be true for 
scan_work, but we're additionally faced with they aren't terminated by 
the queue not connected - they sit on the blk-mq queue, or they fail 
back to the nvme multipather who may be holding them waiting for a path 
to come back.? The latter case is where this same nvme_stop_ctrl() hits 
the controller delete case if scan_work is outstanding at the time of 
the delete.

So minimally, we need to io aborted and the queues marked - like what is 
done in the err_work thread on rdma.

>
> Perhaps we can move scan_work flush into nvme_remove_namespaces()
> instead? I guess it makes sense that when we remove namespaces we don't
> want a scan to run concurrently?

well- that makes sense, but I assume there's also validity of the scan 
instance itself. If we leave scan_work blocked - waiting for normal io 
to finish thus it won't release until a new controller instance is 
connected or the device fails - and for the case where a new instance is 
connected:
a) do we have any issues of it using data gathered prior to blocking? 
(from the old controller) after it resumes (now on the new controller) ?
b) will the scan work kicked off by the init_ctrl() call on the 
reconnect cause it to be re-performed correctly a second time ?? (I 
assume so)


-- james

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-11-08 17:49 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-05 11:57 [RFC PATCH] nvme of: don't flush scan work inside reset context Ming Lei
2018-11-05 16:28 ` Keith Busch
2018-11-06  0:30   ` Ming Lei
2018-11-05 20:04 ` James Smart
2018-11-06  1:18   ` Ming Lei
2018-11-06  5:45     ` James Smart
2018-11-07  1:58       ` Ming Lei
2018-11-08  0:19       ` Sagi Grimberg
2018-11-08 17:49         ` James Smart
2018-11-07  3:26 ` Sagi Grimberg
2018-11-07  3:51   ` Ming Lei
2018-11-07  4:38     ` Sagi Grimberg
2018-11-07  8:34       ` Ming Lei
2018-11-07 18:38         ` Sagi Grimberg
2018-11-07 19:27           ` James Smart
2018-11-08  0:05           ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.