From: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Sam Bobroff <sbobroff@linux.ibm.com>, <linux-pci@vger.kernel.org>,
Bjorn Helgaas <bhelgaas@google.com>, <linux@yadro.com>,
Russell Currey <ruscur@russell.cc>,
<linuxppc-dev@lists.ozlabs.org>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Oliver OHalloran <oliveroh@au1.ibm.com>
Subject: Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
Date: Tue, 18 Sep 2018 17:01:48 +0300 [thread overview]
Message-ID: <b3963047-e3c1-5612-1bcf-04c9e81dacd2@yadro.com> (raw)
In-Reply-To: <20180917225935.GD54859@bhelgaas-glaptop.roam.corp.google.com>
[-- Attachment #1.1: Type: text/plain, Size: 10048 bytes --]
On 9/18/18 1:59 AM, Bjorn Helgaas wrote:
> [+cc Russell, Ben, Oliver, linuxppc-dev]
>
> On Mon, Sep 17, 2018 at 11:55:43PM +0300, Sergey Miroshnichenko wrote:
>> Hello Sam,
>>
>> On 9/17/18 8:28 AM, Sam Bobroff wrote:
>>> Hi Sergey,
>>>
>>> On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
>>>> Introduce a new command line option "pci=pcie_movable_bars" that indicates
>>>> support of PCIe hotplug without prior reservation of memory regions by
>>>> BIOS/bootloader.
>>>>
>>>> If a new PCIe device has been hot-plugged between two active ones, which
>>>> have no (or not big enough) gap between their BARs, allocating new BARs
>>>> requires to move BARs of the following working devices:
>>>>
>>>> 1) dev 4
>>>> |
>>>> v
>>>> .. | dev 3 | dev 3 | dev 5 | dev 7 |
>>>> .. | BAR 0 | BAR 1 | BAR 0 | BAR 0 |
>>>>
>>>> 2) dev 4
>>>> |
>>>> v
>>>> .. | dev 3 | dev 3 | --> --> | dev 5 | dev 7 |
>>>> .. | BAR 0 | BAR 1 | --> --> | BAR 0 | BAR 0 |
>>>>
>>>> 3)
>>>>
>>>> .. | dev 3 | dev 3 | dev 4 | dev 4 | dev 5 | dev 7 |
>>>> .. | BAR 0 | BAR 1 | BAR 0 | BAR 1 | BAR 0 | BAR 0 |
>>>>
>>>> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
>>>> threatening all memory transactions during this procedure, so the PCI
>>>> subsystem will instruct the drivers to pause by calling the reset_prepare()
>>>> and reset_done() callbacks.
>>>>
>>>> If a device may be affected by BAR movement, the BAR changes tracking must
>>>> be implemented in its driver.
>>>>
>>>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>>>> ---
>>>> .../admin-guide/kernel-parameters.txt | 6 +++
>>>> drivers/pci/pci.c | 2 +
>>>> drivers/pci/probe.c | 43 +++++++++++++++++++
>>>> include/linux/pci.h | 1 +
>>>> 4 files changed, 52 insertions(+)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index 64a3bf54b974..f8132a709061 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -3311,6 +3311,12 @@
>>>> bridges without forcing it upstream. Note:
>>>> this removes isolation between devices and
>>>> may put more devices in an IOMMU group.
>>>> + pcie_movable_bars Arrange a space at runtime for BARs of
>>>> + hotplugged PCIe devices - usable if bootloader
>>>> + doesn't reserve memory regions for them. Freeing
>>>> + a space may require moving BARs of active devices
>>>> + to higher addresses, so device drivers will be
>>>> + paused during rescan.
>>>>
>>>> pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
>>>> Management.
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 1835f3a7aa8d..5f07a59b5924 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>>>> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>>> } else if (!strncmp(str, "disable_acs_redir=", 18)) {
>>>> disable_acs_redir_param = str + 18;
>>>> + } else if (!strncmp(str, "pcie_movable_bars", 17)) {
>>>> + pci_add_flags(PCI_MOVABLE_BARS);
>>>> } else {
>>>> printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>>> str);
>>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>>> index 201f9e5ff55c..bdaafc48dc4c 100644
>>>> --- a/drivers/pci/probe.c
>>>> +++ b/drivers/pci/probe.c
>>>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>>> return max;
>>>> }
>>>>
>>>> +/*
>>>> + * Put all devices of the bus and its children to reset
>>>> + */
>>>> +static void pci_bus_reset_prepare(struct pci_bus *bus)
>>>> +{
>>>> + struct pci_dev *dev;
>>>> +
>>>> + list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> + struct pci_bus *child = dev->subordinate;
>>>> +
>>>> + if (child) {
>>>> + pci_bus_reset_prepare(child);
>>>> + } else if (dev->driver &&
>>>> + dev->driver->err_handler &&
>>>> + dev->driver->err_handler->reset_prepare) {
>>>> + dev->driver->err_handler->reset_prepare(dev);
>>>> + }
>>>
>>> What about devices with drivers that don't have reset_prepare()? It
>>> looks like it will just reconfigure them anyway. Is that right?
>>>
>>
>> It is possible that unprepared driver without these hooks will get BARs
>> moved, I should put a warning message there. There three ways we can see
>> to make this safe:
>> - add the reset_prepare()/reset_done() hooks to *every* PCIe driver;
>> - refuse BAR movement if at least one unprepared driver has been
>> encountered during rescan;
>> - reduce the number of drivers which can be affected to some
>> controllable value and prepare them on demand.
>>
>> Applying the second proposal as a major restriction seems fairly
>> reasonable, but for our particular setups and use-cases it is probably
>> too stiff:
>> - we've noticed that devices connected directly to the root bridge
>> don't get moved BARs, and this covers our x86_64 servers: we only
>> insert/remove devices into "second-level" and "lower" bridges there, but
>> not root;
>> - on PowerNV we have system devices (network interfaces, USB hub, etc.)
>> grouped into dedicated domain, with all other domains ready for hotplug,
>> and only these domains can be rescanned.
>>
>> With our scenarios currently reduced to these two, we can live with just
>> a few drivers "prepared" for now: NVME and few Ethernet adapters, this
>> gives us a possibility to use this feature before "converting" *all* the
>> drivers, and even have the NVidia cards running on a closed proprietary
>> driver.
>>
>> Should we make this behavior adjustable with something like
>> "pcie_movable_bars=safe" and "pcie_movable_bars=always" ?
>
> I like the overall idea of this a lot.
>
> - Why do we need a command line parameter to enable this? Can't we
> do it unconditionally and automatically when it's possible? We
> could have a chicken switch to *disable* it in case this breaks
> something horribly, but I would like this functionality to be
> always available without a special option.
>
After making this feature completely safe we could activate it with the
existing option "pci=realloc".
> - I'm not sure the existence of .reset_done() is evidence that a
> driver is prepared for its BARs to move. I don't see any
> documentation that refers to BAR movement, and I doubt it's been
> tested. But I only see 5 implementations in the tree, so it'd be
> easy to verify.
>
You are right, and we should clarify the description:
- drivers which have the .reset_done() already - none of them are aware
of movable BARs yet;
- the rest of the drivers should both be able to pause and handle the
changes in BARs.
> - I think your second proposal above sounds right: we should regard
> any device whose driver lacks .reset_done() as immovable. We will
> likely be able to move some devices but not others. Implementing
> .reset_done() will increase flexibility but it shouldn't be a
> requirement for all drivers.
>
Thanks for the advice! This is more flexible and doesn't have any
prerequisites. In this case the greater the "movable"/"immovable" ratio
of the devices that was working before the hotplug event - the higher
the probability to free some space for new BARs. But even a single
"immovable" device at an undesirable place can block the re-arrangement,
in this case all we can is just give up and print an error message.
This patchset in its current form doesn't support marking a choosen BAR
as immovable (just releasing all the resources of the root bridge and
trying to sort and re-assign them back), so I'll have to implement that.
Best regards,
Serge
>>>> + }
>>>> +}
>>>> +
>>>> +/*
>>>> + * Complete the reset of all devices for the bus and its children
>>>> + */
>>>> +static void pci_bus_reset_done(struct pci_bus *bus)
>>>> +{
>>>> + struct pci_dev *dev;
>>>> +
>>>> + list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> + struct pci_bus *child = dev->subordinate;
>>>> +
>>>> + if (child) {
>>>> + pci_bus_reset_done(child);
>>>> + } else if (dev->driver && dev->driver->err_handler &&
>>>> + dev->driver->err_handler->reset_done) {
>>>> + dev->driver->err_handler->reset_done(dev);
>>>> + }
>>>> + }
>>>> +}
>>>> +
>>>> /**
>>>> * pci_rescan_bus - Scan a PCI bus for devices
>>>> * @bus: PCI bus to scan
>>>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>>>> {
>>>> unsigned int max;
>>>>
>>>> + if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> + pci_bus_reset_prepare(bus);
>>>> max = pci_scan_child_bus(bus);
>>>> pci_assign_unassigned_bus_resources(bus);
>>>> + if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> + pci_bus_reset_done(bus);
>>>> pci_bus_add_devices(bus);
>>>>
>>>> return max;
>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>> index 6925828f9f25..a8cb1a367c34 100644
>>>> --- a/include/linux/pci.h
>>>> +++ b/include/linux/pci.h
>>>> @@ -847,6 +847,7 @@ enum {
>>>> PCI_ENABLE_PROC_DOMAINS = 0x00000010, /* Enable domains in /proc */
>>>> PCI_COMPAT_DOMAIN_0 = 0x00000020, /* ... except domain 0 */
>>>> PCI_SCAN_ALL_PCIE_DEVS = 0x00000040, /* Scan all, not just dev 0 */
>>>> + PCI_MOVABLE_BARS = 0x00000080, /* Runtime BAR reassign after hotplug */
>>>> };
>>>>
>>>> /* These external functions are only available when PCI support is enabled */
>>>> --
>>>> 2.17.1
>>>>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2018-09-18 19:34 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-14 16:14 [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan Sergey Miroshnichenko
2018-09-17 5:28 ` Sam Bobroff
2018-09-17 20:55 ` Sergey Miroshnichenko
2018-09-17 22:59 ` Bjorn Helgaas
2018-09-18 14:01 ` Sergey Miroshnichenko [this message]
2018-09-18 21:10 ` Bjorn Helgaas
2018-09-17 23:35 ` Oliver
2018-09-18 16:51 ` Sergey Miroshnichenko
2018-09-17 19:00 ` Rajat Jain
2018-09-17 19:38 ` Lukas Wunner
2018-09-17 19:44 ` Rajat Jain
2018-09-17 21:25 ` Sergey Miroshnichenko
2018-09-18 21:22 ` Rajat Jain
2019-01-11 17:24 ` Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 2/4] PCI: Release and reassign resources from the root " Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 3/4] PCI: Invalidate the released BAR resources Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 4/4] PCI: Fix writing invalid BARs during pci_restore_state() Sergey Miroshnichenko
2018-09-18 11:16 ` [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug David Laight
2018-09-18 17:07 ` Sergey Miroshnichenko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b3963047-e3c1-5612-1bcf-04c9e81dacd2@yadro.com \
--to=s.miroshnichenko@yadro.com \
--cc=benh@kernel.crashing.org \
--cc=bhelgaas@google.com \
--cc=helgaas@kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux@yadro.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=oliveroh@au1.ibm.com \
--cc=ruscur@russell.cc \
--cc=sbobroff@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).