From: Keith Busch <keith.busch@intel.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Marta Rybczynska <mrybczyn@kalray.eu>,
axboe@fb.com, hch@lst.de, sagi@grimberg.me,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
bhelgaas@google.com, linux-pci@vger.kernel.org,
Pierre-Yves Kerbrat <pkerbrat@kalray.eu>
Subject: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
Date: Wed, 21 Mar 2018 10:02:39 -0600 [thread overview]
Message-ID: <20180321160238.GF12909@localhost.localdomain> (raw)
In-Reply-To: <20180321154807.GD22254@ming.t460p>
On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> > >> NVMe driver uses threads for the work at device reset, including enabling
> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> > >> called in parallel on multiple cores.
> > >>
> > >> This causes a loop of enabling of all upstream bridges in
> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> > >> including __pci_set_master and architecture-specific functions that
> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> > >> and change it. This is done as read/modify/write.
> > >>
> > >> Imagine that the PCIe tree looks like:
> > >> A - B - switch - C - D
> > >> \- E - F
> > >>
> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
> > >> mastering is not set. If their reset work are scheduled in parallel the two
> > >> modifications of PCI_COMMAND may happen in parallel without locking and the
> > >> system may end up with the part of PCIe tree not enabled.
> > >
> > > Then looks serialized reset should be used, and I did see the commit
> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> > > to mark controller state' in reset stress test.
> > >
> > > But that commit only covers case of PCI reset from sysfs attribute, and
> > > maybe other cases need to be dealt with in similar way too.
> > >
> >
> > It seems to me that the serialized reset works for multiple resets of the
> > same device, doesn't it? Our problem is linked to resets of different devices
> > that share the same PCIe tree.
>
> Given reset shouldn't be a frequent action, it might be fine to serialize all
> reset from different devices.
The driver was much simpler when we had serialized resets in line with
probe, but that had a bigger problems with certain init systems when
you put enough nvme devices in your server, making them unbootable.
Would it be okay to serialize just the pci_enable_device across all
other tasks messing with the PCI topology?
---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cef5ce851a92..e0a2f6c0f1cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);
- if (pci_enable_device_mem(pdev))
- return result;
+ pci_lock_rescan_remove();
+ result = pci_enable_device_mem(pdev);
+ pci_unlock_rescan_remove();
+ if (result)
+ return -ENODEV;
pci_set_master(pdev);
--
next prev parent reply other threads:[~2018-03-21 16:00 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-21 11:00 [RFC PATCH] nvme: avoid race-conditions when enabling devices Marta Rybczynska
2018-03-21 11:50 ` Ming Lei
2018-03-21 12:10 ` Marta Rybczynska
2018-03-21 15:48 ` Ming Lei
2018-03-21 16:02 ` Keith Busch [this message]
2018-03-21 16:10 ` Marta Rybczynska
2018-03-21 21:53 ` Bjorn Helgaas
2018-03-23 7:28 ` Marta Rybczynska
2018-03-23 8:44 ` Srinath Mannam
2018-03-23 7:44 ` Marta Rybczynska
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180321160238.GF12909@localhost.localdomain \
--to=keith.busch@intel.com \
--cc=axboe@fb.com \
--cc=bhelgaas@google.com \
--cc=hch@lst.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-pci@vger.kernel.org \
--cc=ming.lei@redhat.com \
--cc=mrybczyn@kalray.eu \
--cc=pkerbrat@kalray.eu \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).