All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme-pci: fix race between pci reset and nvme probe
@ 2022-08-01 12:57 Ming Lei
  2022-08-01 14:33 ` Keith Busch
  2022-08-04 21:55 ` Keith Busch
  0 siblings, 2 replies; 6+ messages in thread
From: Ming Lei @ 2022-08-01 12:57 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvme
  Cc: Sagi Grimberg, Keith Busch, Ming Lei, Yi Zhang

After nvme_probe() returns, device lock is released, and PCI reset
handler may come, meantime reset work is just scheduled and should
be in-progress.

When nvme_reset_prepare() is run, all NSs may not be allocated yet
and each NS's request queue won't be frozen by nvme_dev_disable().

But when nvme_reset_done() is called for resetting controller, all
NSs may have been scanned successfully, and nvme_wait_freeze() is
called on un-frozen request queues, then wait forever.

Fix the issue by holding device lock for resetting from nvme probe.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/linux-block/CAHj4cs--KPTAGP=jj+7KMe=arDv=HeGeOgs1T8vbusyk=EjXow@mail.gmail.com/#r
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4232192e10dd..d49b1a082983 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3075,9 +3075,14 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
 static void nvme_async_probe(void *data, async_cookie_t cookie)
 {
 	struct nvme_dev *dev = data;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
+	pci_dev_lock(pdev);
+	nvme_reset_ctrl(&dev->ctrl);
 	flush_work(&dev->ctrl.reset_work);
 	flush_work(&dev->ctrl.scan_work);
+	pci_dev_unlock(pdev);
+
 	nvme_put_ctrl(&dev->ctrl);
 }
 
@@ -3154,7 +3159,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
-	nvme_reset_ctrl(&dev->ctrl);
 	async_schedule(nvme_async_probe, dev);
 
 	return 0;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvme-pci: fix race between pci reset and nvme probe
  2022-08-01 12:57 [PATCH] nvme-pci: fix race between pci reset and nvme probe Ming Lei
@ 2022-08-01 14:33 ` Keith Busch
  2022-08-02  0:05   ` Ming Lei
  2022-08-04 21:55 ` Keith Busch
  1 sibling, 1 reply; 6+ messages in thread
From: Keith Busch @ 2022-08-01 14:33 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, linux-nvme, Sagi Grimberg, Yi Zhang

On Mon, Aug 01, 2022 at 08:57:53PM +0800, Ming Lei wrote:
> After nvme_probe() returns, device lock is released, and PCI reset
> handler may come, meantime reset work is just scheduled and should
> be in-progress.
> 
> When nvme_reset_prepare() is run, all NSs may not be allocated yet
> and each NS's request queue won't be frozen by nvme_dev_disable().
> 
> But when nvme_reset_done() is called for resetting controller, all
> NSs may have been scanned successfully, and nvme_wait_freeze() is
> called on un-frozen request queues, then wait forever.
> 
> Fix the issue by holding device lock for resetting from nvme probe.
> 
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Link: https://lore.kernel.org/linux-block/CAHj4cs--KPTAGP=jj+7KMe=arDv=HeGeOgs1T8vbusyk=EjXow@mail.gmail.com/#r
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/pci.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 4232192e10dd..d49b1a082983 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3075,9 +3075,14 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
>  static void nvme_async_probe(void *data, async_cookie_t cookie)
>  {
>  	struct nvme_dev *dev = data;
> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
>  
> +	pci_dev_lock(pdev);
> +	nvme_reset_ctrl(&dev->ctrl);
>  	flush_work(&dev->ctrl.reset_work);
>  	flush_work(&dev->ctrl.scan_work);
> +	pci_dev_unlock(pdev);
> +
>  	nvme_put_ctrl(&dev->ctrl);
>  }

When low on memory, async_schedule() falls back to calling the requested
function directly, so this would deadlock on taking the pci_dev_lock() the
second time within the probe context.

If it is successfully scheduled asynchronously, holding the lock blocks a hot
removal, which might be the only thing that can unblock the nvme reset_work
from forward progress.

If you are encountering a nvme_reset_prepare() condition during scanning, that
might indicate a failure to communicate with the end device. The scan work may
need the error handling to unblock it.

> @@ -3154,7 +3159,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  
>  	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
>  
> -	nvme_reset_ctrl(&dev->ctrl);
>  	async_schedule(nvme_async_probe, dev);
>  
>  	return 0;
> -- 
> 2.31.1
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvme-pci: fix race between pci reset and nvme probe
  2022-08-01 14:33 ` Keith Busch
@ 2022-08-02  0:05   ` Ming Lei
  2022-08-02  2:22     ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2022-08-02  0:05 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Sagi Grimberg, Yi Zhang, ming.lei

On Mon, Aug 01, 2022 at 08:33:01AM -0600, Keith Busch wrote:
> On Mon, Aug 01, 2022 at 08:57:53PM +0800, Ming Lei wrote:
> > After nvme_probe() returns, device lock is released, and PCI reset
> > handler may come, meantime reset work is just scheduled and should
> > be in-progress.
> > 
> > When nvme_reset_prepare() is run, all NSs may not be allocated yet
> > and each NS's request queue won't be frozen by nvme_dev_disable().
> > 
> > But when nvme_reset_done() is called for resetting controller, all
> > NSs may have been scanned successfully, and nvme_wait_freeze() is
> > called on un-frozen request queues, then wait forever.
> > 
> > Fix the issue by holding device lock for resetting from nvme probe.
> > 
> > Reported-by: Yi Zhang <yi.zhang@redhat.com>
> > Link: https://lore.kernel.org/linux-block/CAHj4cs--KPTAGP=jj+7KMe=arDv=HeGeOgs1T8vbusyk=EjXow@mail.gmail.com/#r
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/nvme/host/pci.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 4232192e10dd..d49b1a082983 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -3075,9 +3075,14 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
> >  static void nvme_async_probe(void *data, async_cookie_t cookie)
> >  {
> >  	struct nvme_dev *dev = data;
> > +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> >  
> > +	pci_dev_lock(pdev);
> > +	nvme_reset_ctrl(&dev->ctrl);
> >  	flush_work(&dev->ctrl.reset_work);
> >  	flush_work(&dev->ctrl.scan_work);
> > +	pci_dev_unlock(pdev);
> > +
> >  	nvme_put_ctrl(&dev->ctrl);
> >  }
> 
> When low on memory, async_schedule() falls back to calling the requested
> function directly, so this would deadlock on taking the pci_dev_lock() the
> second time within the probe context.

Good catch, maybe async_schedule() should provide one variation by
returning failure in case of OOM.

> 
> If it is successfully scheduled asynchronously, holding the lock blocks a hot
> removal, which might be the only thing that can unblock the nvme reset_work
> from forward progress.

OK, adding one reset or pci_reset lock may avoid to affect hot removal.

> 
> If you are encountering a nvme_reset_prepare() condition during scanning, that
> might indicate a failure to communicate with the end device. The scan work may
> need the error handling to unblock it.

Not sure I get your point. nvme_reset_prepare() is called from storing
to pci device's reset sysfs attr, so it can come any time, including
scanning.

Thanks,
Ming



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvme-pci: fix race between pci reset and nvme probe
  2022-08-02  0:05   ` Ming Lei
@ 2022-08-02  2:22     ` Keith Busch
  0 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2022-08-02  2:22 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, linux-nvme, Sagi Grimberg, Yi Zhang

On Tue, Aug 02, 2022 at 08:05:09AM +0800, Ming Lei wrote:
> > 
> > If you are encountering a nvme_reset_prepare() condition during scanning, that
> > might indicate a failure to communicate with the end device. The scan work may
> > need the error handling to unblock it.
> 
> Not sure I get your point. nvme_reset_prepare() is called from storing
> to pci device's reset sysfs attr, so it can come any time, including
> scanning.

Oh, I mean these .reset_prepare/done callbacks are also used by the PCIe AER
driver when the root port sees an uncorrectable error. In some cases, you won't
be able to successfuly use the PCIe bus downstream that root port until after
resetting it. I was just considering that scan_work may be stuck because of
such an error.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvme-pci: fix race between pci reset and nvme probe
  2022-08-01 12:57 [PATCH] nvme-pci: fix race between pci reset and nvme probe Ming Lei
  2022-08-01 14:33 ` Keith Busch
@ 2022-08-04 21:55 ` Keith Busch
  2022-08-05 12:34   ` Ming Lei
  1 sibling, 1 reply; 6+ messages in thread
From: Keith Busch @ 2022-08-04 21:55 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, linux-nvme, Sagi Grimberg, Yi Zhang

On Mon, Aug 01, 2022 at 08:57:53PM +0800, Ming Lei wrote:
> After nvme_probe() returns, device lock is released, and PCI reset
> handler may come, meantime reset work is just scheduled and should
> be in-progress.
> 
> When nvme_reset_prepare() is run, all NSs may not be allocated yet
> and each NS's request queue won't be frozen by nvme_dev_disable().
> 
> But when nvme_reset_done() is called for resetting controller, all
> NSs may have been scanned successfully, and nvme_wait_freeze() is
> called on un-frozen request queues, then wait forever.
> 
> Fix the issue by holding device lock for resetting from nvme probe.

Beyond the issues with locking the pci device, I think the problem you're
describing is generic to any rescan that alters the namespace inventory, so we
can't resolve it only on probe.

Is it enough to track the frozen status within the namespace flags? Something
like the following:

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2429b11eb9a8..befd156d57c1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5027,8 +5027,11 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 	struct nvme_ns *ns;
 
 	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (!test_and_clear_bit(NVME_NS_FREEZE, &ns->flags))
+			continue;
 		blk_mq_unfreeze_queue(ns->queue);
+	}
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
@@ -5039,6 +5042,8 @@ int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 
 	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (!test_bit(NVME_NS_FREEZE, &ns->flags))
+			continue;
 		timeout = blk_mq_freeze_queue_wait_timeout(ns->queue, timeout);
 		if (timeout <= 0)
 			break;
@@ -5053,8 +5058,11 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 	struct nvme_ns *ns;
 
 	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (!test_bit(NVME_NS_FREEZE, &ns->flags))
+			continue;
 		blk_mq_freeze_queue_wait(ns->queue);
+	}
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
@@ -5064,8 +5072,10 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 	struct nvme_ns *ns;
 
 	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		set_bit(NVME_NS_FREEZE, &ns->flags);
 		blk_freeze_queue_start(ns->queue);
+	}
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bdc0ff7ed9ab..872237b4b65d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -484,6 +484,7 @@ struct nvme_ns {
 #define NVME_NS_FORCE_RO	3
 #define NVME_NS_READY		4
 #define NVME_NS_STOPPED		5
+#define NVME_NS_FREEZE		6
 
 	struct cdev		cdev;
 	struct device		cdev_device;
--


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvme-pci: fix race between pci reset and nvme probe
  2022-08-04 21:55 ` Keith Busch
@ 2022-08-05 12:34   ` Ming Lei
  0 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2022-08-05 12:34 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Sagi Grimberg, Yi Zhang, ming.lei

On Thu, Aug 04, 2022 at 03:55:14PM -0600, Keith Busch wrote:
> On Mon, Aug 01, 2022 at 08:57:53PM +0800, Ming Lei wrote:
> > After nvme_probe() returns, device lock is released, and PCI reset
> > handler may come, meantime reset work is just scheduled and should
> > be in-progress.
> > 
> > When nvme_reset_prepare() is run, all NSs may not be allocated yet
> > and each NS's request queue won't be frozen by nvme_dev_disable().
> > 
> > But when nvme_reset_done() is called for resetting controller, all
> > NSs may have been scanned successfully, and nvme_wait_freeze() is
> > called on un-frozen request queues, then wait forever.
> > 
> > Fix the issue by holding device lock for resetting from nvme probe.
> 
> Beyond the issues with locking the pci device, I think the problem you're
> describing is generic to any rescan that alters the namespace inventory, so we
> can't resolve it only on probe.

Yeah, just it is easier to trigger during the 1st scan when no any NS
exits.

> 
> Is it enough to track the frozen status within the namespace flags? Something
> like the following:

I guess the patch could avoid the issue.

Another way might be to remove nvme_wait_freeze() from nvme_reset_work(). No
see it is necessary:

- nvme_dev_add():

if (!dev->ctrl.tagset) {
	...
} else {
	blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
	...
} 

If there isn't tagset, there can't be IO; and if nr_hw_queues is to be
changed really, blk_mq_update_nr_hw_queues() will freeze queue, otherwise
it isn't necessary to nvme_wait_freeze() before calling
blk_mq_update_nr_hw_queues.



Thanks,
Ming



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-08-05 12:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-01 12:57 [PATCH] nvme-pci: fix race between pci reset and nvme probe Ming Lei
2022-08-01 14:33 ` Keith Busch
2022-08-02  0:05   ` Ming Lei
2022-08-02  2:22     ` Keith Busch
2022-08-04 21:55 ` Keith Busch
2022-08-05 12:34   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.