linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Sam Bobroff <sbobroff@linux.ibm.com>, <linux-pci@vger.kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>, <linux@yadro.com>,
	Russell Currey <ruscur@russell.cc>,
	<linuxppc-dev@lists.ozlabs.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Oliver OHalloran <oliveroh@au1.ibm.com>
Subject: Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
Date: Tue, 18 Sep 2018 17:01:48 +0300	[thread overview]
Message-ID: <b3963047-e3c1-5612-1bcf-04c9e81dacd2@yadro.com> (raw)
In-Reply-To: <20180917225935.GD54859@bhelgaas-glaptop.roam.corp.google.com>


[-- Attachment #1.1: Type: text/plain, Size: 10048 bytes --]

On 9/18/18 1:59 AM, Bjorn Helgaas wrote:
> [+cc Russell, Ben, Oliver, linuxppc-dev]
> 
> On Mon, Sep 17, 2018 at 11:55:43PM +0300, Sergey Miroshnichenko wrote:
>> Hello Sam,
>>
>> On 9/17/18 8:28 AM, Sam Bobroff wrote:
>>> Hi Sergey,
>>>
>>> On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
>>>> Introduce a new command line option "pci=pcie_movable_bars" that indicates
>>>> support of PCIe hotplug without prior reservation of memory regions by
>>>> BIOS/bootloader.
>>>>
>>>> If a new PCIe device has been hot-plugged between two active ones, which
>>>> have no (or not big enough) gap between their BARs, allocating new BARs
>>>> requires to move BARs of the following working devices:
>>>>
>>>> 1)                   dev 4
>>>>                        |
>>>>                        v
>>>> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>>
>>>> 2)                             dev 4
>>>>                                  |
>>>>                                  v
>>>> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
>>>>
>>>> 3)
>>>>
>>>> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>>
>>>> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
>>>> threatening all memory transactions during this procedure, so the PCI
>>>> subsystem will instruct the drivers to pause by calling the reset_prepare()
>>>> and reset_done() callbacks.
>>>>
>>>> If a device may be affected by BAR movement, the BAR changes tracking must
>>>> be implemented in its driver.
>>>>
>>>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>>>> ---
>>>>  .../admin-guide/kernel-parameters.txt         |  6 +++
>>>>  drivers/pci/pci.c                             |  2 +
>>>>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>>>>  include/linux/pci.h                           |  1 +
>>>>  4 files changed, 52 insertions(+)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index 64a3bf54b974..f8132a709061 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -3311,6 +3311,12 @@
>>>>  				bridges without forcing it upstream. Note:
>>>>  				this removes isolation between devices and
>>>>  				may put more devices in an IOMMU group.
>>>> +		pcie_movable_bars	Arrange a space at runtime for BARs of
>>>> +				hotplugged PCIe devices - usable if bootloader
>>>> +				doesn't reserve memory regions for them. Freeing
>>>> +				a space may require moving BARs of active devices
>>>> +				to higher addresses, so device drivers will be
>>>> +				paused during rescan.
>>>>  
>>>>  	pcie_aspm=	[PCIE] Forcibly enable or disable PCIe Active State Power
>>>>  			Management.
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 1835f3a7aa8d..5f07a59b5924 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>>>>  				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>>>  			} else if (!strncmp(str, "disable_acs_redir=", 18)) {
>>>>  				disable_acs_redir_param = str + 18;
>>>> +			} else if (!strncmp(str, "pcie_movable_bars", 17)) {
>>>> +				pci_add_flags(PCI_MOVABLE_BARS);
>>>>  			} else {
>>>>  				printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>>>  						str);
>>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>>> index 201f9e5ff55c..bdaafc48dc4c 100644
>>>> --- a/drivers/pci/probe.c
>>>> +++ b/drivers/pci/probe.c
>>>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>>>  	return max;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Put all devices of the bus and its children to reset
>>>> + */
>>>> +static void pci_bus_reset_prepare(struct pci_bus *bus)
>>>> +{
>>>> +	struct pci_dev *dev;
>>>> +
>>>> +	list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> +		struct pci_bus *child = dev->subordinate;
>>>> +
>>>> +		if (child) {
>>>> +			pci_bus_reset_prepare(child);
>>>> +		} else if (dev->driver &&
>>>> +			   dev->driver->err_handler &&
>>>> +			   dev->driver->err_handler->reset_prepare) {
>>>> +			dev->driver->err_handler->reset_prepare(dev);
>>>> +		}
>>>
>>> What about devices with drivers that don't have reset_prepare()?  It
>>> looks like it will just reconfigure them anyway. Is that right?
>>>
>>
>> It is possible that unprepared driver without these hooks will get BARs
>> moved, I should put a warning message there. There three ways we can see
>> to make this safe:
>>  - add the reset_prepare()/reset_done() hooks to *every* PCIe driver;
>>  - refuse BAR movement if at least one unprepared driver has been
>> encountered during rescan;
>>  - reduce the number of drivers which can be affected to some
>> controllable value and prepare them on demand.
>>
>> Applying the second proposal as a major restriction seems fairly
>> reasonable, but for our particular setups and use-cases it is probably
>> too stiff:
>>  - we've noticed that devices connected directly to the root bridge
>> don't get moved BARs, and this covers our x86_64 servers: we only
>> insert/remove devices into "second-level" and "lower" bridges there, but
>> not root;
>>  - on PowerNV we have system devices (network interfaces, USB hub, etc.)
>> grouped into dedicated domain, with all other domains ready for hotplug,
>> and only these domains can be rescanned.
>>
>> With our scenarios currently reduced to these two, we can live with just
>> a few drivers "prepared" for now: NVME and few Ethernet adapters, this
>> gives us a possibility to use this feature before "converting" *all* the
>> drivers, and even have the NVidia cards running on a closed proprietary
>> driver.
>>
>> Should we make this behavior adjustable with something like
>> "pcie_movable_bars=safe" and "pcie_movable_bars=always" ?
> 
> I like the overall idea of this a lot.
> 
>   - Why do we need a command line parameter to enable this?  Can't we
>     do it unconditionally and automatically when it's possible?  We
>     could have a chicken switch to *disable* it in case this breaks
>     something horribly, but I would like this functionality to be
>     always available without a special option.
> 

After making this feature completely safe we could activate it with the
existing option "pci=realloc".

>   - I'm not sure the existence of .reset_done() is evidence that a
>     driver is prepared for its BARs to move.  I don't see any
>     documentation that refers to BAR movement, and I doubt it's been
>     tested.  But I only see 5 implementations in the tree, so it'd be
>     easy to verify.
>     

You are right, and we should clarify the description:
 - drivers which have the .reset_done() already - none of them are aware
of movable BARs yet;
 - the rest of the drivers should both be able to pause and handle the
changes in BARs.

>   - I think your second proposal above sounds right: we should regard
>     any device whose driver lacks .reset_done() as immovable.  We will
>     likely be able to move some devices but not others.  Implementing
>     .reset_done() will increase flexibility but it shouldn't be a
>     requirement for all drivers.
> 

Thanks for the advice! This is more flexible and doesn't have any
prerequisites. In this case the greater the "movable"/"immovable" ratio
of the devices that was working before the hotplug event - the higher
the probability to free some space for new BARs. But even a single
"immovable" device at an undesirable place can block the re-arrangement,
in this case all we can is just give up and print an error message.

This patchset in its current form doesn't support marking a choosen BAR
as immovable (just releasing all the resources of the root bridge and
trying to sort and re-assign them back), so I'll have to implement that.

Best regards,
Serge

>>>> +	}
>>>> +}
>>>> +
>>>> +/*
>>>> + * Complete the reset of all devices for the bus and its children
>>>> + */
>>>> +static void pci_bus_reset_done(struct pci_bus *bus)
>>>> +{
>>>> +	struct pci_dev *dev;
>>>> +
>>>> +	list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> +		struct pci_bus *child = dev->subordinate;
>>>> +
>>>> +		if (child) {
>>>> +			pci_bus_reset_done(child);
>>>> +		} else if (dev->driver && dev->driver->err_handler &&
>>>> +			   dev->driver->err_handler->reset_done) {
>>>> +			dev->driver->err_handler->reset_done(dev);
>>>> +		}
>>>> +	}
>>>> +}
>>>> +
>>>>  /**
>>>>   * pci_rescan_bus - Scan a PCI bus for devices
>>>>   * @bus: PCI bus to scan
>>>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>>>>  {
>>>>  	unsigned int max;
>>>>  
>>>> +	if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> +		pci_bus_reset_prepare(bus);
>>>>  	max = pci_scan_child_bus(bus);
>>>>  	pci_assign_unassigned_bus_resources(bus);
>>>> +	if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> +		pci_bus_reset_done(bus);
>>>>  	pci_bus_add_devices(bus);
>>>>  
>>>>  	return max;
>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>> index 6925828f9f25..a8cb1a367c34 100644
>>>> --- a/include/linux/pci.h
>>>> +++ b/include/linux/pci.h
>>>> @@ -847,6 +847,7 @@ enum {
>>>>  	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
>>>>  	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
>>>>  	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
>>>> +	PCI_MOVABLE_BARS	= 0x00000080,	/* Runtime BAR reassign after hotplug */
>>>>  };
>>>>  
>>>>  /* These external functions are only available when PCI support is enabled */
>>>> -- 
>>>> 2.17.1
>>>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2018-09-18 19:34 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-14 16:14 [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan Sergey Miroshnichenko
2018-09-17  5:28   ` Sam Bobroff
2018-09-17 20:55     ` Sergey Miroshnichenko
2018-09-17 22:59       ` Bjorn Helgaas
2018-09-18 14:01         ` Sergey Miroshnichenko [this message]
2018-09-18 21:10           ` Bjorn Helgaas
2018-09-17 23:35       ` Oliver
2018-09-18 16:51         ` Sergey Miroshnichenko
2018-09-17 19:00   ` Rajat Jain
2018-09-17 19:38     ` Lukas Wunner
2018-09-17 19:44       ` Rajat Jain
2018-09-17 21:25     ` Sergey Miroshnichenko
2018-09-18 21:22       ` Rajat Jain
2019-01-11 17:24         ` Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 2/4] PCI: Release and reassign resources from the root " Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 3/4] PCI: Invalidate the released BAR resources Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 4/4] PCI: Fix writing invalid BARs during pci_restore_state() Sergey Miroshnichenko
2018-09-18 11:16 ` [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug David Laight
2018-09-18 17:07   ` Sergey Miroshnichenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b3963047-e3c1-5612-1bcf-04c9e81dacd2@yadro.com \
    --to=s.miroshnichenko@yadro.com \
    --cc=benh@kernel.crashing.org \
    --cc=bhelgaas@google.com \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux@yadro.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=oliveroh@au1.ibm.com \
    --cc=ruscur@russell.cc \
    --cc=sbobroff@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).