[RFC PATCH] nvme: avoid race-conditions when enabling devices

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] nvme: avoid race-conditions when enabling devices
@ 2018-03-21 11:00 Marta Rybczynska
  2018-03-21 11:50 ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Marta Rybczynska @ 2018-03-21 11:00 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel,
	bhelgaas, linux-pci
  Cc: Pierre-Yves Kerbrat

NVMe driver uses threads for the work at device reset, including enabling
the PCIe device. When multiple NVMe devices are initialized, their reset
works may be scheduled in parallel. Then pci_enable_device_mem can be
called in parallel on multiple cores.

This causes a loop of enabling of all upstream bridges in
pci_enable_bridge(). pci_enable_bridge() causes multiple operations
including __pci_set_master and architecture-specific functions that
call ones like and pci_enable_resources(). Both __pci_set_master()
and pci_enable_resources() read PCI_COMMAND field in the PCIe space
and change it. This is done as read/modify/write.

Imagine that the PCIe tree looks like:
A - B - switch -  C - D
               \- E - F

D and F are two NVMe disks and all devices from B are not enabled and bus
mastering is not set. If their reset work are scheduled in parallel the two
modifications of PCI_COMMAND may happen in parallel without locking and the
system may end up with the part of PCIe tree not enabled.

The problem may also happen if other device is initialized in parallel to
a nvme disk.

This fix moves pci_enable_device_mem to the probe part of the driver that
is run sequentially to avoid the issue.

Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Signed-off-by: Pierre-Yves Kerbrat <pkerbrat@kalray.eu>
---
 drivers/nvme/host/pci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b6f43b7..af53854 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2515,6 +2515,14 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)

 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));

+	/*
+	 * Enable the device now to make sure that all accesses to bridges above
+	 * are done without races
+	 */
+	result = pci_enable_device_mem(pdev);
+	if (result)
+		goto release_pools;
+
 	nvme_reset_ctrl(&dev->ctrl);

 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-21 11:00 [RFC PATCH] nvme: avoid race-conditions when enabling devices Marta Rybczynska
@ 2018-03-21 11:50 ` Ming Lei
  2018-03-21 12:10   ` Marta Rybczynska
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2018-03-21 11:50 UTC (permalink / raw)
  To: Marta Rybczynska
  Cc: keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel,
	bhelgaas, linux-pci, Pierre-Yves Kerbrat

On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> NVMe driver uses threads for the work at device reset, including enabling
> the PCIe device. When multiple NVMe devices are initialized, their reset
> works may be scheduled in parallel. Then pci_enable_device_mem can be
> called in parallel on multiple cores.
> 
> This causes a loop of enabling of all upstream bridges in
> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> including __pci_set_master and architecture-specific functions that
> call ones like and pci_enable_resources(). Both __pci_set_master()
> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> and change it. This is done as read/modify/write.
> 
> Imagine that the PCIe tree looks like:
> A - B - switch -  C - D
>                \- E - F
> 
> D and F are two NVMe disks and all devices from B are not enabled and bus
> mastering is not set. If their reset work are scheduled in parallel the two
> modifications of PCI_COMMAND may happen in parallel without locking and the
> system may end up with the part of PCIe tree not enabled.

Then looks serialized reset should be used, and I did see the commit
79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
to mark controller state' in reset stress test.

But that commit only covers case of PCI reset from sysfs attribute, and
maybe other cases need to be dealt with in similar way too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-21 11:50 ` Ming Lei
@ 2018-03-21 12:10   ` Marta Rybczynska
  2018-03-21 15:48     ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Marta Rybczynska @ 2018-03-21 12:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: keith busch, axboe, hch, sagi, linux-nvme, linux-kernel,
	bhelgaas, linux-pci, Pierre-Yves Kerbrat

> On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
>> NVMe driver uses threads for the work at device reset, including enabling
>> the PCIe device. When multiple NVMe devices are initialized, their reset
>> works may be scheduled in parallel. Then pci_enable_device_mem can be
>> called in parallel on multiple cores.
>> 
>> This causes a loop of enabling of all upstream bridges in
>> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
>> including __pci_set_master and architecture-specific functions that
>> call ones like and pci_enable_resources(). Both __pci_set_master()
>> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
>> and change it. This is done as read/modify/write.
>> 
>> Imagine that the PCIe tree looks like:
>> A - B - switch -  C - D
>>                \- E - F
>> 
>> D and F are two NVMe disks and all devices from B are not enabled and bus
>> mastering is not set. If their reset work are scheduled in parallel the two
>> modifications of PCI_COMMAND may happen in parallel without locking and the
>> system may end up with the part of PCIe tree not enabled.
> 
> Then looks serialized reset should be used, and I did see the commit
> 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> to mark controller state' in reset stress test.
> 
> But that commit only covers case of PCI reset from sysfs attribute, and
> maybe other cases need to be dealt with in similar way too.
> 

It seems to me that the serialized reset works for multiple resets of the
same device, doesn't it? Our problem is linked to resets of different devices
that share the same PCIe tree.

You're right that the problem we face might also come with manual resets
under certain conditions (I think that all devices in a subtree would need
to be disabled).

Thanks,
Marta

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-21 12:10   ` Marta Rybczynska
@ 2018-03-21 15:48     ` Ming Lei
  2018-03-21 16:02       ` Keith Busch
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2018-03-21 15:48 UTC (permalink / raw)
  To: Marta Rybczynska
  Cc: keith busch, axboe, hch, sagi, linux-nvme, linux-kernel,
	bhelgaas, linux-pci, Pierre-Yves Kerbrat

On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
> > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> >> NVMe driver uses threads for the work at device reset, including enabling
> >> the PCIe device. When multiple NVMe devices are initialized, their reset
> >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> >> called in parallel on multiple cores.
> >> 
> >> This causes a loop of enabling of all upstream bridges in
> >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> >> including __pci_set_master and architecture-specific functions that
> >> call ones like and pci_enable_resources(). Both __pci_set_master()
> >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> >> and change it. This is done as read/modify/write.
> >> 
> >> Imagine that the PCIe tree looks like:
> >> A - B - switch -  C - D
> >>                \- E - F
> >> 
> >> D and F are two NVMe disks and all devices from B are not enabled and bus
> >> mastering is not set. If their reset work are scheduled in parallel the two
> >> modifications of PCI_COMMAND may happen in parallel without locking and the
> >> system may end up with the part of PCIe tree not enabled.
> > 
> > Then looks serialized reset should be used, and I did see the commit
> > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> > to mark controller state' in reset stress test.
> > 
> > But that commit only covers case of PCI reset from sysfs attribute, and
> > maybe other cases need to be dealt with in similar way too.
> > 
> 
> It seems to me that the serialized reset works for multiple resets of the
> same device, doesn't it? Our problem is linked to resets of different devices
> that share the same PCIe tree.

Given reset shouldn't be a frequent action, it might be fine to serialize all
reset from different devices.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-21 15:48     ` Ming Lei
@ 2018-03-21 16:02       ` Keith Busch
  2018-03-21 16:10         ` Marta Rybczynska
  0 siblings, 1 reply; 10+ messages in thread
From: Keith Busch @ 2018-03-21 16:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Marta Rybczynska, axboe, hch, sagi, linux-nvme, linux-kernel,
	bhelgaas, linux-pci, Pierre-Yves Kerbrat

On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> > >> NVMe driver uses threads for the work at device reset, including enabling
> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> > >> called in parallel on multiple cores.
> > >> 
> > >> This causes a loop of enabling of all upstream bridges in
> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> > >> including __pci_set_master and architecture-specific functions that
> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> > >> and change it. This is done as read/modify/write.
> > >> 
> > >> Imagine that the PCIe tree looks like:
> > >> A - B - switch -  C - D
> > >>                \- E - F
> > >> 
> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
> > >> mastering is not set. If their reset work are scheduled in parallel the two
> > >> modifications of PCI_COMMAND may happen in parallel without locking and the
> > >> system may end up with the part of PCIe tree not enabled.
> > > 
> > > Then looks serialized reset should be used, and I did see the commit
> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> > > to mark controller state' in reset stress test.
> > > 
> > > But that commit only covers case of PCI reset from sysfs attribute, and
> > > maybe other cases need to be dealt with in similar way too.
> > > 
> > 
> > It seems to me that the serialized reset works for multiple resets of the
> > same device, doesn't it? Our problem is linked to resets of different devices
> > that share the same PCIe tree.
> 
> Given reset shouldn't be a frequent action, it might be fine to serialize all
> reset from different devices.

The driver was much simpler when we had serialized resets in line with
probe, but that had a bigger problems with certain init systems when
you put enough nvme devices in your server, making them unbootable.

Would it be okay to serialize just the pci_enable_device across all
other tasks messing with the PCI topology?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cef5ce851a92..e0a2f6c0f1cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
	int result = -ENOMEM;
	struct pci_dev *pdev = to_pci_dev(dev->dev);

-	if (pci_enable_device_mem(pdev))
-		return result;
+	pci_lock_rescan_remove();
+	result = pci_enable_device_mem(pdev);
+	pci_unlock_rescan_remove();
+	if (result)
+		return -ENODEV;

	pci_set_master(pdev);

--

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-21 16:02       ` Keith Busch
@ 2018-03-21 16:10         ` Marta Rybczynska
  2018-03-21 21:53           ` Bjorn Helgaas
  2018-03-23  7:44           ` Marta Rybczynska
  0 siblings, 2 replies; 10+ messages in thread
From: Marta Rybczynska @ 2018-03-21 16:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, axboe, hch, sagi, linux-nvme, linux-kernel, bhelgaas,
	linux-pci, Pierre-Yves Kerbrat

> On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
>> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
>> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
>> > >> NVMe driver uses threads for the work at device reset, including enabling
>> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
>> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
>> > >> called in parallel on multiple cores.
>> > >> 
>> > >> This causes a loop of enabling of all upstream bridges in
>> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
>> > >> including __pci_set_master and architecture-specific functions that
>> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
>> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
>> > >> and change it. This is done as read/modify/write.
>> > >> 
>> > >> Imagine that the PCIe tree looks like:
>> > >> A - B - switch -  C - D
>> > >>                \- E - F
>> > >> 
>> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
>> > >> mastering is not set. If their reset work are scheduled in parallel the two
>> > >> modifications of PCI_COMMAND may happen in parallel without locking and the
>> > >> system may end up with the part of PCIe tree not enabled.
>> > > 
>> > > Then looks serialized reset should be used, and I did see the commit
>> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
>> > > to mark controller state' in reset stress test.
>> > > 
>> > > But that commit only covers case of PCI reset from sysfs attribute, and
>> > > maybe other cases need to be dealt with in similar way too.
>> > > 
>> > 
>> > It seems to me that the serialized reset works for multiple resets of the
>> > same device, doesn't it? Our problem is linked to resets of different devices
>> > that share the same PCIe tree.
>> 
>> Given reset shouldn't be a frequent action, it might be fine to serialize all
>> reset from different devices.
> 
> The driver was much simpler when we had serialized resets in line with
> probe, but that had a bigger problems with certain init systems when
> you put enough nvme devices in your server, making them unbootable.
> 
> Would it be okay to serialize just the pci_enable_device across all
> other tasks messing with the PCI topology?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index cef5ce851a92..e0a2f6c0f1cf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>	int result = -ENOMEM;
>	struct pci_dev *pdev = to_pci_dev(dev->dev);
> 
> -	if (pci_enable_device_mem(pdev))
> -		return result;
> +	pci_lock_rescan_remove();
> +	result = pci_enable_device_mem(pdev);
> +	pci_unlock_rescan_remove();
> +	if (result)
> +		return -ENODEV;
> 
>	pci_set_master(pdev);
> 

The problem may happen also with other device doing its probe and nvme running its
workqueue (and we probably have seen it in practice too). We were thinking about a lock
in the pci generic code too, that's why I've put the linux-pci@ list in copy.

Marta

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-21 16:10         ` Marta Rybczynska
@ 2018-03-21 21:53           ` Bjorn Helgaas
  2018-03-23  7:28             ` Marta Rybczynska
  2018-03-23  7:44           ` Marta Rybczynska
  1 sibling, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2018-03-21 21:53 UTC (permalink / raw)
  To: Marta Rybczynska
  Cc: Keith Busch, Ming Lei, axboe, hch, sagi, linux-nvme,
	linux-kernel, bhelgaas, linux-pci, Pierre-Yves Kerbrat,
	Srinath Mannam

[+cc Srinath]

On Wed, Mar 21, 2018 at 05:10:56PM +0100, Marta Rybczynska wrote:
> > On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
> >> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
> >> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> >> > >> NVMe driver uses threads for the work at device reset, including enabling
> >> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
> >> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> >> > >> called in parallel on multiple cores.
> >> > >> 
> >> > >> This causes a loop of enabling of all upstream bridges in
> >> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> >> > >> including __pci_set_master and architecture-specific functions that
> >> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
> >> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> >> > >> and change it. This is done as read/modify/write.
> >> > >> 
> >> > >> Imagine that the PCIe tree looks like:
> >> > >> A - B - switch -  C - D
> >> > >>                \- E - F
> >> > >> 
> >> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
> >> > >> mastering is not set. If their reset work are scheduled in parallel the two
> >> > >> modifications of PCI_COMMAND may happen in parallel without locking and the
> >> > >> system may end up with the part of PCIe tree not enabled.
> >> > > 
> >> > > Then looks serialized reset should be used, and I did see the commit
> >> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> >> > > to mark controller state' in reset stress test.
> >> > > 
> >> > > But that commit only covers case of PCI reset from sysfs attribute, and
> >> > > maybe other cases need to be dealt with in similar way too.
> >> > > 
> >> > 
> >> > It seems to me that the serialized reset works for multiple resets of the
> >> > same device, doesn't it? Our problem is linked to resets of different devices
> >> > that share the same PCIe tree.
> >> 
> >> Given reset shouldn't be a frequent action, it might be fine to serialize all
> >> reset from different devices.
> > 
> > The driver was much simpler when we had serialized resets in line with
> > probe, but that had a bigger problems with certain init systems when
> > you put enough nvme devices in your server, making them unbootable.
> > 
> > Would it be okay to serialize just the pci_enable_device across all
> > other tasks messing with the PCI topology?
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index cef5ce851a92..e0a2f6c0f1cf 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >	int result = -ENOMEM;
> >	struct pci_dev *pdev = to_pci_dev(dev->dev);
> > 
> > -	if (pci_enable_device_mem(pdev))
> > -		return result;
> > +	pci_lock_rescan_remove();
> > +	result = pci_enable_device_mem(pdev);
> > +	pci_unlock_rescan_remove();
> > +	if (result)
> > +		return -ENODEV;
> > 
> >	pci_set_master(pdev);
> 
> The problem may happen also with other device doing its probe and
> nvme running its workqueue (and we probably have seen it in practice
> too). We were thinking about a lock in the pci generic code too,
> that's why I've put the linux-pci@ list in copy.

Yes, this is a generic problem in the PCI core.  We've tried to fix it
in the past but haven't figured it out yet.

See 40f11adc7cd9 ("PCI: Avoid race while enabling upstream bridges")
and 0f50a49e3008 ("Revert "PCI: Avoid race while enabling upstream
bridges"").

It's not trivial, but if you figure out a good way to fix this, I'd be
thrilled.

Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-21 21:53           ` Bjorn Helgaas
@ 2018-03-23  7:28             ` Marta Rybczynska
  2018-03-23  8:44               ` Srinath Mannam
  0 siblings, 1 reply; 10+ messages in thread
From: Marta Rybczynska @ 2018-03-23  7:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, Ming Lei, axboe, hch, sagi, linux-nvme,
	linux-kernel, bhelgaas, linux-pci, Pierre-Yves Kerbrat,
	Srinath Mannam


> On Wed, Mar 21, 2018 at 05:10:56PM +0100, Marta Rybczynska wrote:
>> 
>> The problem may happen also with other device doing its probe and
>> nvme running its workqueue (and we probably have seen it in practice
>> too). We were thinking about a lock in the pci generic code too,
>> that's why I've put the linux-pci@ list in copy.
> 
> Yes, this is a generic problem in the PCI core.  We've tried to fix it
> in the past but haven't figured it out yet.
> 
> See 40f11adc7cd9 ("PCI: Avoid race while enabling upstream bridges")
> and 0f50a49e3008 ("Revert "PCI: Avoid race while enabling upstream
> bridges"").
> 
> It's not trivial, but if you figure out a good way to fix this, I'd be
> thrilled.
> 

Bjorn, Srinath, are you aware of anyone working on an updated fix
for this one?

Marta

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-21 16:10         ` Marta Rybczynska
  2018-03-21 21:53           ` Bjorn Helgaas
@ 2018-03-23  7:44           ` Marta Rybczynska
  1 sibling, 0 replies; 10+ messages in thread
From: Marta Rybczynska @ 2018-03-23  7:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, axboe, hch, sagi, linux-nvme, linux-kernel, bhelgaas,
	linux-pci, Pierre-Yves Kerbrat



----- Mail original -----
> De: "Marta Rybczynska" <mrybczyn@kalray.eu>
> =C3=80: "Keith Busch" <keith.busch@intel.com>
> Cc: "Ming Lei" <ming.lei@redhat.com>, axboe@fb.com, hch@lst.de, sagi@grim=
berg.me, linux-nvme@lists.infradead.org,
> linux-kernel@vger.kernel.org, bhelgaas@google.com, linux-pci@vger.kernel.=
org, "Pierre-Yves Kerbrat"
> <pkerbrat@kalray.eu>
> Envoy=C3=A9: Mercredi 21 Mars 2018 17:10:56
> Objet: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

>> On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
>>> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
>>> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
>>> > >> NVMe driver uses threads for the work at device reset, including e=
nabling
>>> > >> the PCIe device. When multiple NVMe devices are initialized, their=
 reset
>>> > >> works may be scheduled in parallel. Then pci_enable_device_mem can=
 be
>>> > >> called in parallel on multiple cores.
>>> > >>=20
>>> > >> This causes a loop of enabling of all upstream bridges in
>>> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operation=
s
>>> > >> including __pci_set_master and architecture-specific functions tha=
t
>>> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
>>> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe spac=
e
>>> > >> and change it. This is done as read/modify/write.
>>> > >>=20
>>> > >> Imagine that the PCIe tree looks like:
>>> > >> A - B - switch -  C - D
>>> > >>                \- E - F
>>> > >>=20
>>> > >> D and F are two NVMe disks and all devices from B are not enabled =
and bus
>>> > >> mastering is not set. If their reset work are scheduled in paralle=
l the two
>>> > >> modifications of PCI_COMMAND may happen in parallel without lockin=
g and the
>>> > >> system may end up with the part of PCIe tree not enabled.
>>> > >=20
>>> > > Then looks serialized reset should be used, and I did see the commi=
t
>>> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'fail=
ed
>>> > > to mark controller state' in reset stress test.
>>> > >=20
>>> > > But that commit only covers case of PCI reset from sysfs attribute,=
 and
>>> > > maybe other cases need to be dealt with in similar way too.
>>> > >=20
>>> >=20
>>> > It seems to me that the serialized reset works for multiple resets of=
 the
>>> > same device, doesn't it? Our problem is linked to resets of different=
 devices
>>> > that share the same PCIe tree.
>>>=20
>>> Given reset shouldn't be a frequent action, it might be fine to seriali=
ze all
>>> reset from different devices.
>>=20
>> The driver was much simpler when we had serialized resets in line with
>> probe, but that had a bigger problems with certain init systems when
>> you put enough nvme devices in your server, making them unbootable.
>>=20
>> Would it be okay to serialize just the pci_enable_device across all
>> other tasks messing with the PCI topology?
>>=20
>> ---
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index cef5ce851a92..e0a2f6c0f1cf 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>>=09int result =3D -ENOMEM;
>>=09struct pci_dev *pdev =3D to_pci_dev(dev->dev);
>>=20
>> -=09if (pci_enable_device_mem(pdev))
>> -=09=09return result;
>> +=09pci_lock_rescan_remove();
>> +=09result =3D pci_enable_device_mem(pdev);
>> +=09pci_unlock_rescan_remove();
>> +=09if (result)
>> +=09=09return -ENODEV;
>>=20
>>=09pci_set_master(pdev);
>>=20
>=20
> The problem may happen also with other device doing its probe and nvme ru=
nning
> its
> workqueue (and we probably have seen it in practice too). We were thinkin=
g about
> a lock
> in the pci generic code too, that's why I've put the linux-pci@ list in c=
opy.
>=20

Keith, it looks to me that this is going to fix the issue between two nvme =
driver
instances at hotplug time. This is the one we didn't cover in the first pat=
ch.

We can see the issue at driver load (so at boot) and the lock isn't taken b=
y the
generic non-rescan code. Other calls to pci_enable_device_mem aren't protec=
ted=20
neither (see Bjorns message).

What do you think about applying both for now until we have a generic fix i=
n pci?

Marta

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC PATCH] nvme: avoid race-conditions when enabling devices
  2018-03-23  7:28             ` Marta Rybczynska
@ 2018-03-23  8:44               ` Srinath Mannam
  0 siblings, 0 replies; 10+ messages in thread
From: Srinath Mannam @ 2018-03-23  8:44 UTC (permalink / raw)
  To: Marta Rybczynska, Bjorn Helgaas
  Cc: Keith Busch, Ming Lei, axboe, hch, sagi, linux-nvme,
	linux-kernel, bhelgaas, linux-pci, Pierre-Yves Kerbrat

Hi Marta,

I could not get time to work on this.
The present patch works for our platforms. so we continue with that.
I will update new changes little later.
If you have time, please try the same patch and let us know if any issue you
see.

Regards,
Srinath.

-----Original Message-----
From: Marta Rybczynska [mailto:mrybczyn@kalray.eu]
Sent: Friday, March 23, 2018 12:58 PM
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>; Ming Lei <ming.lei@redhat.com>;
axboe@fb.com; hch@lst.de; sagi@grimberg.me; linux-nvme@lists.infradead.org;
linux-kernel@vger.kernel.org; bhelgaas@google.com;
linux-pci@vger.kernel.org; Pierre-Yves Kerbrat <pkerbrat@kalray.eu>; Srinath
Mannam <srinath.mannam@broadcom.com>
Subject: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices


> On Wed, Mar 21, 2018 at 05:10:56PM +0100, Marta Rybczynska wrote:
>>
>> The problem may happen also with other device doing its probe and
>> nvme running its workqueue (and we probably have seen it in practice
>> too). We were thinking about a lock in the pci generic code too,
>> that's why I've put the linux-pci@ list in copy.
>
> Yes, this is a generic problem in the PCI core.  We've tried to fix it
> in the past but haven't figured it out yet.
>
> See 40f11adc7cd9 ("PCI: Avoid race while enabling upstream bridges")
> and 0f50a49e3008 ("Revert "PCI: Avoid race while enabling upstream
> bridges"").
>
> It's not trivial, but if you figure out a good way to fix this, I'd be
> thrilled.
>

Bjorn, Srinath, are you aware of anyone working on an updated fix for this
one?

Marta

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-03-23  8:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21 11:00 [RFC PATCH] nvme: avoid race-conditions when enabling devices Marta Rybczynska
2018-03-21 11:50 ` Ming Lei
2018-03-21 12:10   ` Marta Rybczynska
2018-03-21 15:48     ` Ming Lei
2018-03-21 16:02       ` Keith Busch
2018-03-21 16:10         ` Marta Rybczynska
2018-03-21 21:53           ` Bjorn Helgaas
2018-03-23  7:28             ` Marta Rybczynska
2018-03-23  8:44               ` Srinath Mannam
2018-03-23  7:44           ` Marta Rybczynska

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).