linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug
@ 2018-09-14 16:14 Sergey Miroshnichenko
  2018-09-14 16:14 ` [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan Sergey Miroshnichenko
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-14 16:14 UTC (permalink / raw)
  To: linux-pci; +Cc: Bjorn Helgaas, linux, Sergey Miroshnichenko

If the firmware or kernel has arranged memory for PCIe devices in a way that doesn't
provide enough space for BARs of a new hotplugged device, the kernel can pause the
drivers of the "obstructing" devices and move their BARs, so new BARs can fit into the
created hole.

When a driver is un-paused by the kernel after the PCIe rescan, it should check if its
BARs had changed and ioremap() them if needed.

BARs are moved not directly, but by releasing bridge resources, then sorting and
assigning them back, similarly to the initial PCIe topology scan during system boot
(when pci=realloc is passed).

Pausing drivers is performed via reset_prepare() and reset_done() callbacks of struct
pci_error_handlers. Drivers should pause during rescan not only because of potential
movement of their BARs, but also because of possible updating of the bridge windows.

This patchset is a part of our work on adding support for hotplugging bridges full of
NVME devices (without special requirement such as Hot-Plug Controller, reservation of
bus numbers and memory regions by firmware, etc.), should I also add here the patch that
adds support of moving BARs to the NVME driver?

Sergey Miroshnichenko (4):
  PCI: hotplug: Add parameter to put devices to reset during rescan
  PCI: Release and reassign resources from the root during rescan
  PCI: Invalidate the released BAR resources
  PCI: Fix writing invalid BARs during pci_restore_state()

 .../admin-guide/kernel-parameters.txt         |  6 ++
 drivers/pci/pci.c                             |  4 +-
 drivers/pci/pci.h                             |  8 ++
 drivers/pci/probe.c                           | 78 ++++++++++++++++++-
 drivers/pci/setup-bus.c                       | 33 +++++---
 include/linux/pci.h                           |  1 +
 6 files changed, 119 insertions(+), 11 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-14 16:14 [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug Sergey Miroshnichenko
@ 2018-09-14 16:14 ` Sergey Miroshnichenko
  2018-09-17  5:28   ` Sam Bobroff
  2018-09-17 19:00   ` Rajat Jain
  2018-09-14 16:14 ` [PATCH RFC 2/4] PCI: Release and reassign resources from the root " Sergey Miroshnichenko
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-14 16:14 UTC (permalink / raw)
  To: linux-pci; +Cc: Bjorn Helgaas, linux, Sergey Miroshnichenko

Introduce a new command line option "pci=pcie_movable_bars" that indicates
support of PCIe hotplug without prior reservation of memory regions by
BIOS/bootloader.

If a new PCIe device has been hot-plugged between two active ones, which
have no (or not big enough) gap between their BARs, allocating new BARs
requires to move BARs of the following working devices:

1)                   dev 4
                       |
                       v
.. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
.. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |

2)                             dev 4
                                 |
                                 v
.. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
.. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |

3)

.. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
.. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |

Not only BARs, but also bridge windows can be updated during a PCIe rescan,
threatening all memory transactions during this procedure, so the PCI
subsystem will instruct the drivers to pause by calling the reset_prepare()
and reset_done() callbacks.

If a device may be affected by BAR movement, the BAR changes tracking must
be implemented in its driver.

Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
---
 .../admin-guide/kernel-parameters.txt         |  6 +++
 drivers/pci/pci.c                             |  2 +
 drivers/pci/probe.c                           | 43 +++++++++++++++++++
 include/linux/pci.h                           |  1 +
 4 files changed, 52 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 64a3bf54b974..f8132a709061 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3311,6 +3311,12 @@
 				bridges without forcing it upstream. Note:
 				this removes isolation between devices and
 				may put more devices in an IOMMU group.
+		pcie_movable_bars	Arrange a space at runtime for BARs of
+				hotplugged PCIe devices - usable if bootloader
+				doesn't reserve memory regions for them. Freeing
+				a space may require moving BARs of active devices
+				to higher addresses, so device drivers will be
+				paused during rescan.
 
 	pcie_aspm=	[PCIE] Forcibly enable or disable PCIe Active State Power
 			Management.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 1835f3a7aa8d..5f07a59b5924 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
 				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
 			} else if (!strncmp(str, "disable_acs_redir=", 18)) {
 				disable_acs_redir_param = str + 18;
+			} else if (!strncmp(str, "pcie_movable_bars", 17)) {
+				pci_add_flags(PCI_MOVABLE_BARS);
 			} else {
 				printk(KERN_ERR "PCI: Unknown option `%s'\n",
 						str);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 201f9e5ff55c..bdaafc48dc4c 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
 	return max;
 }
 
+/*
+ * Put all devices of the bus and its children to reset
+ */
+static void pci_bus_reset_prepare(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		struct pci_bus *child = dev->subordinate;
+
+		if (child) {
+			pci_bus_reset_prepare(child);
+		} else if (dev->driver &&
+			   dev->driver->err_handler &&
+			   dev->driver->err_handler->reset_prepare) {
+			dev->driver->err_handler->reset_prepare(dev);
+		}
+	}
+}
+
+/*
+ * Complete the reset of all devices for the bus and its children
+ */
+static void pci_bus_reset_done(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		struct pci_bus *child = dev->subordinate;
+
+		if (child) {
+			pci_bus_reset_done(child);
+		} else if (dev->driver && dev->driver->err_handler &&
+			   dev->driver->err_handler->reset_done) {
+			dev->driver->err_handler->reset_done(dev);
+		}
+	}
+}
+
 /**
  * pci_rescan_bus - Scan a PCI bus for devices
  * @bus: PCI bus to scan
@@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
 {
 	unsigned int max;
 
+	if (pci_has_flag(PCI_MOVABLE_BARS))
+		pci_bus_reset_prepare(bus);
 	max = pci_scan_child_bus(bus);
 	pci_assign_unassigned_bus_resources(bus);
+	if (pci_has_flag(PCI_MOVABLE_BARS))
+		pci_bus_reset_done(bus);
 	pci_bus_add_devices(bus);
 
 	return max;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6925828f9f25..a8cb1a367c34 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -847,6 +847,7 @@ enum {
 	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
 	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
 	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
+	PCI_MOVABLE_BARS	= 0x00000080,	/* Runtime BAR reassign after hotplug */
 };
 
 /* These external functions are only available when PCI support is enabled */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 2/4] PCI: Release and reassign resources from the root during rescan
  2018-09-14 16:14 [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug Sergey Miroshnichenko
  2018-09-14 16:14 ` [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan Sergey Miroshnichenko
@ 2018-09-14 16:14 ` Sergey Miroshnichenko
  2018-09-14 16:14 ` [PATCH RFC 3/4] PCI: Invalidate the released BAR resources Sergey Miroshnichenko
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-14 16:14 UTC (permalink / raw)
  To: linux-pci; +Cc: Bjorn Helgaas, linux, Sergey Miroshnichenko

If assigned resources don't contain holes to fit memory for hotplugged
devices, these conflicting resources must be freed, sorted and reassigned.

When resources are finally allocated and written to BARs, it's time to
update bridge windows with pci_setup_bridge().

Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
---
 drivers/pci/pci.h       |  8 ++++++++
 drivers/pci/probe.c     | 35 ++++++++++++++++++++++++++++++++++-
 drivers/pci/setup-bus.c | 12 ++++--------
 3 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6e0d1528d471..cb157630f8d7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -224,6 +224,11 @@ enum pci_bar_type {
 	pci_bar_mem64,		/* A 64-bit memory BAR */
 };
 
+enum release_type {
+	leaf_only,
+	whole_subtree,
+};
+
 int pci_configure_extended_tags(struct pci_dev *dev, void *ign);
 bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
 				int crs_timeout);
@@ -240,6 +245,9 @@ void __pci_bus_size_bridges(struct pci_bus *bus,
 void __pci_bus_assign_resources(const struct pci_bus *bus,
 				struct list_head *realloc_head,
 				struct list_head *fail_head);
+void pci_bus_release_bridge_resources(struct pci_bus *bus,
+				      unsigned long type,
+				      enum release_type rel_type);
 bool pci_bus_clip_resource(struct pci_dev *dev, int idx);
 
 void pci_reassigndev_resource_alignment(struct pci_dev *dev);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index bdaafc48dc4c..2c2b853454c2 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3113,6 +3113,25 @@ struct pci_bus *pci_scan_bus(int bus, struct pci_ops *ops,
 }
 EXPORT_SYMBOL(pci_scan_bus);
 
+static void pci_setup_new_bridges(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		struct pci_bus *child;
+
+		if (!pci_dev_is_added(dev))
+			continue;
+
+		child = dev->subordinate;
+		if (child)
+			pci_setup_new_bridges(child);
+	}
+
+	if (bus->self)
+		pci_setup_bridge(bus);
+}
+
 /**
  * pci_rescan_bus_bridge_resize - Scan a PCI bus for devices
  * @bridge: PCI bridge for the bus to scan
@@ -3189,11 +3208,25 @@ static void pci_bus_reset_done(struct pci_bus *bus)
 unsigned int pci_rescan_bus(struct pci_bus *bus)
 {
 	unsigned int max;
+	struct pci_bus *root = bus;
+
+	while (!pci_is_root_bus(root))
+		root = root->parent;
 
 	if (pci_has_flag(PCI_MOVABLE_BARS))
 		pci_bus_reset_prepare(bus);
 	max = pci_scan_child_bus(bus);
-	pci_assign_unassigned_bus_resources(bus);
+	if (pci_has_flag(PCI_MOVABLE_BARS)) {
+		pci_bus_release_bridge_resources(bus, IORESOURCE_IO, whole_subtree);
+		pci_bus_release_bridge_resources(bus, IORESOURCE_MEM, whole_subtree);
+		pci_bus_release_bridge_resources(bus,
+						 IORESOURCE_MEM_64 | IORESOURCE_PREFETCH,
+						 whole_subtree);
+		pci_assign_unassigned_root_bus_resources(root);
+	} else {
+		pci_assign_unassigned_bus_resources(bus);
+	}
+	pci_setup_new_bridges(bus);
 	if (pci_has_flag(PCI_MOVABLE_BARS))
 		pci_bus_reset_done(bus);
 	pci_bus_add_devices(bus);
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 79b1824e83b4..fd5675bb501f 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1244,7 +1244,7 @@ void __pci_bus_size_bridges(struct pci_bus *bus, struct list_head *realloc_head)
 
 	case PCI_CLASS_BRIDGE_PCI:
 		pci_bridge_check_ranges(bus);
-		if (bus->self->is_hotplug_bridge) {
+		if (bus->self->is_hotplug_bridge && !pci_has_flag(PCI_MOVABLE_BARS)) {
 			additional_io_size  = pci_hotplug_io_size;
 			additional_mem_size = pci_hotplug_mem_size;
 		}
@@ -1582,17 +1582,13 @@ static void pci_bridge_release_resources(struct pci_bus *bus,
 	}
 }
 
-enum release_type {
-	leaf_only,
-	whole_subtree,
-};
 /*
  * try to release pci bridge resources that is from leaf bridge,
  * so we can allocate big new one later
  */
-static void pci_bus_release_bridge_resources(struct pci_bus *bus,
-					     unsigned long type,
-					     enum release_type rel_type)
+void pci_bus_release_bridge_resources(struct pci_bus *bus,
+				      unsigned long type,
+				      enum release_type rel_type)
 {
 	struct pci_dev *dev;
 	bool is_leaf_bridge = true;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 3/4] PCI: Invalidate the released BAR resources
  2018-09-14 16:14 [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug Sergey Miroshnichenko
  2018-09-14 16:14 ` [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan Sergey Miroshnichenko
  2018-09-14 16:14 ` [PATCH RFC 2/4] PCI: Release and reassign resources from the root " Sergey Miroshnichenko
@ 2018-09-14 16:14 ` Sergey Miroshnichenko
  2018-09-14 16:14 ` [PATCH RFC 4/4] PCI: Fix writing invalid BARs during pci_restore_state() Sergey Miroshnichenko
  2018-09-18 11:16 ` [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug David Laight
  4 siblings, 0 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-14 16:14 UTC (permalink / raw)
  To: linux-pci; +Cc: Bjorn Helgaas, linux, Sergey Miroshnichenko

Otherwise after release_child_resources() there can be resources
with the IORESOURCE_STARTALIGN flag remaining, but with start
dropped to zero, that makes them not valid for re-assigning.

Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
---
 drivers/pci/setup-bus.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index fd5675bb501f..ec88461462b8 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1521,7 +1521,7 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge,
 static void pci_bridge_release_resources(struct pci_bus *bus,
 					  unsigned long type)
 {
-	struct pci_dev *dev = bus->self;
+	struct pci_dev *dev = bus->self, *child_dev;
 	struct resource *r;
 	unsigned old_flags = 0;
 	struct resource *b_res;
@@ -1564,6 +1564,25 @@ static void pci_bridge_release_resources(struct pci_bus *bus,
 	 *  all
 	 */
 	release_child_resources(r);
+	list_for_each_entry(child_dev, &bus->devices, bus_list) {
+		int i;
+
+		if (child_dev->subordinate)
+			continue;
+
+		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+			struct resource *res = &child_dev->resource[i];
+			resource_size_t size = resource_size(res);
+
+			if (!res->flags)
+				continue;
+
+			res->end = size - 1;
+			res->start = 0;
+			res->flags &= ~IORESOURCE_STARTALIGN;
+			res->flags |= IORESOURCE_SIZEALIGN;
+		}
+	}
 	if (!release_resource(r)) {
 		type = old_flags = r->flags & PCI_RES_TYPE_MASK;
 		pci_printk(KERN_DEBUG, dev, "resource %d %pR released\n",
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 4/4] PCI: Fix writing invalid BARs during pci_restore_state()
  2018-09-14 16:14 [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug Sergey Miroshnichenko
                   ` (2 preceding siblings ...)
  2018-09-14 16:14 ` [PATCH RFC 3/4] PCI: Invalidate the released BAR resources Sergey Miroshnichenko
@ 2018-09-14 16:14 ` Sergey Miroshnichenko
  2018-09-18 11:16 ` [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug David Laight
  4 siblings, 0 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-14 16:14 UTC (permalink / raw)
  To: linux-pci; +Cc: Bjorn Helgaas, linux, Sergey Miroshnichenko

If BAR movement has happened (due to PCIe hotplug) after pci_save_state(),
the saved addresses will become outdated. Restore them the most recently
calculated values, not the ones stored in an arbitrary moment.

Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
---
 drivers/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 5f07a59b5924..154130959443 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1328,7 +1328,7 @@ static void pci_restore_config_space(struct pci_dev *pdev)
 	if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
 		pci_restore_config_space_range(pdev, 10, 15, 0);
 		/* Restore BARs before the command register. */
-		pci_restore_config_space_range(pdev, 4, 9, 10);
+		pci_restore_bars(pdev);
 		pci_restore_config_space_range(pdev, 0, 3, 0);
 	} else {
 		pci_restore_config_space_range(pdev, 0, 15, 0);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-14 16:14 ` [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan Sergey Miroshnichenko
@ 2018-09-17  5:28   ` Sam Bobroff
  2018-09-17 20:55     ` Sergey Miroshnichenko
  2018-09-17 19:00   ` Rajat Jain
  1 sibling, 1 reply; 20+ messages in thread
From: Sam Bobroff @ 2018-09-17  5:28 UTC (permalink / raw)
  To: Sergey Miroshnichenko; +Cc: linux-pci, Bjorn Helgaas, linux

[-- Attachment #1: Type: text/plain, Size: 5788 bytes --]

Hi Sergey,

On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
> Introduce a new command line option "pci=pcie_movable_bars" that indicates
> support of PCIe hotplug without prior reservation of memory regions by
> BIOS/bootloader.
> 
> If a new PCIe device has been hot-plugged between two active ones, which
> have no (or not big enough) gap between their BARs, allocating new BARs
> requires to move BARs of the following working devices:
> 
> 1)                   dev 4
>                        |
>                        v
> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
> 
> 2)                             dev 4
>                                  |
>                                  v
> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
> 
> 3)
> 
> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
> 
> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
> threatening all memory transactions during this procedure, so the PCI
> subsystem will instruct the drivers to pause by calling the reset_prepare()
> and reset_done() callbacks.
> 
> If a device may be affected by BAR movement, the BAR changes tracking must
> be implemented in its driver.
> 
> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  6 +++
>  drivers/pci/pci.c                             |  2 +
>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>  include/linux/pci.h                           |  1 +
>  4 files changed, 52 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 64a3bf54b974..f8132a709061 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3311,6 +3311,12 @@
>  				bridges without forcing it upstream. Note:
>  				this removes isolation between devices and
>  				may put more devices in an IOMMU group.
> +		pcie_movable_bars	Arrange a space at runtime for BARs of
> +				hotplugged PCIe devices - usable if bootloader
> +				doesn't reserve memory regions for them. Freeing
> +				a space may require moving BARs of active devices
> +				to higher addresses, so device drivers will be
> +				paused during rescan.
>  
>  	pcie_aspm=	[PCIE] Forcibly enable or disable PCIe Active State Power
>  			Management.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 1835f3a7aa8d..5f07a59b5924 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>  				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>  			} else if (!strncmp(str, "disable_acs_redir=", 18)) {
>  				disable_acs_redir_param = str + 18;
> +			} else if (!strncmp(str, "pcie_movable_bars", 17)) {
> +				pci_add_flags(PCI_MOVABLE_BARS);
>  			} else {
>  				printk(KERN_ERR "PCI: Unknown option `%s'\n",
>  						str);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 201f9e5ff55c..bdaafc48dc4c 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>  	return max;
>  }
>  
> +/*
> + * Put all devices of the bus and its children to reset
> + */
> +static void pci_bus_reset_prepare(struct pci_bus *bus)
> +{
> +	struct pci_dev *dev;
> +
> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> +		struct pci_bus *child = dev->subordinate;
> +
> +		if (child) {
> +			pci_bus_reset_prepare(child);
> +		} else if (dev->driver &&
> +			   dev->driver->err_handler &&
> +			   dev->driver->err_handler->reset_prepare) {
> +			dev->driver->err_handler->reset_prepare(dev);
> +		}

What about devices with drivers that don't have reset_prepare()?  It
looks like it will just reconfigure them anyway. Is that right?

> +	}
> +}
> +
> +/*
> + * Complete the reset of all devices for the bus and its children
> + */
> +static void pci_bus_reset_done(struct pci_bus *bus)
> +{
> +	struct pci_dev *dev;
> +
> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> +		struct pci_bus *child = dev->subordinate;
> +
> +		if (child) {
> +			pci_bus_reset_done(child);
> +		} else if (dev->driver && dev->driver->err_handler &&
> +			   dev->driver->err_handler->reset_done) {
> +			dev->driver->err_handler->reset_done(dev);
> +		}
> +	}
> +}
> +
>  /**
>   * pci_rescan_bus - Scan a PCI bus for devices
>   * @bus: PCI bus to scan
> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>  {
>  	unsigned int max;
>  
> +	if (pci_has_flag(PCI_MOVABLE_BARS))
> +		pci_bus_reset_prepare(bus);
>  	max = pci_scan_child_bus(bus);
>  	pci_assign_unassigned_bus_resources(bus);
> +	if (pci_has_flag(PCI_MOVABLE_BARS))
> +		pci_bus_reset_done(bus);
>  	pci_bus_add_devices(bus);
>  
>  	return max;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 6925828f9f25..a8cb1a367c34 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -847,6 +847,7 @@ enum {
>  	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
>  	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
>  	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
> +	PCI_MOVABLE_BARS	= 0x00000080,	/* Runtime BAR reassign after hotplug */
>  };
>  
>  /* These external functions are only available when PCI support is enabled */
> -- 
> 2.17.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-14 16:14 ` [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan Sergey Miroshnichenko
  2018-09-17  5:28   ` Sam Bobroff
@ 2018-09-17 19:00   ` Rajat Jain
  2018-09-17 19:38     ` Lukas Wunner
  2018-09-17 21:25     ` Sergey Miroshnichenko
  1 sibling, 2 replies; 20+ messages in thread
From: Rajat Jain @ 2018-09-17 19:00 UTC (permalink / raw)
  To: s.miroshnichenko; +Cc: linux-pci, Bjorn Helgaas, linux

On Fri, Sep 14, 2018 at 9:21 AM Sergey Miroshnichenko
<s.miroshnichenko@yadro.com> wrote:
>
> Introduce a new command line option "pci=pcie_movable_bars" that indicates
> support of PCIe hotplug without prior reservation of memory regions by
> BIOS/bootloader.
>
> If a new PCIe device has been hot-plugged between two active ones, which
> have no (or not big enough) gap between their BARs, allocating new BARs
> requires to move BARs of the following working devices:

I think this is a great problem to solve. I have some questions (not
objections, but trying to understand what are the limitations of the
solution being proposed):

* What about hot-pluggable root ports? Would this help (I'm guessing not)?

* What about situations where the root port itself does not have
enough resources for the new device being inserted. I'm guessing we
are not going to expand root port allocation in those cases. But do we
fail gracefully rejecting the hotplug by not assigning it resources,
or do we manage to screw up the already working healthy devices (while
attempting to move them)?

* What about non-memory resources? E.g. cards may have pci bridges or
switches on them, and they may need extra PCI bus numbers. Does this
help that use case?
>
> 1)                   dev 4
>                        |
>                        v
> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>
> 2)                             dev 4
>                                  |
>                                  v
> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
>
> 3)
>
> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>
> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
> threatening all memory transactions during this procedure, so the PCI
> subsystem will instruct the drivers to pause by calling the reset_prepare()
> and reset_done() callbacks.
>
> If a device may be affected by BAR movement, the BAR changes tracking must
> be implemented in its driver.

This is quite a big thing right? I expect that this will break a lot
of drivers because
(a) they may not have reset_prepare() and reset_done() callbacks (I
grepped in the sources and seems only 4 support it?).
(b) Now, it is expected that in the reset_done() the drivers should
re-read the BAR and remap any memory resources that may have changed.
Note this may also include cleaning up any existing resource mappings
that they had before they were paused. Not sure if this was the
semantics of reset_done() already, or may be it was, I'm just not
sure.

Thanks,

Rajat

>
> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  6 +++
>  drivers/pci/pci.c                             |  2 +
>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>  include/linux/pci.h                           |  1 +
>  4 files changed, 52 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 64a3bf54b974..f8132a709061 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3311,6 +3311,12 @@
>                                 bridges without forcing it upstream. Note:
>                                 this removes isolation between devices and
>                                 may put more devices in an IOMMU group.
> +               pcie_movable_bars       Arrange a space at runtime for BARs of
> +                               hotplugged PCIe devices - usable if bootloader
> +                               doesn't reserve memory regions for them. Freeing
> +                               a space may require moving BARs of active devices
> +                               to higher addresses, so device drivers will be
> +                               paused during rescan.
>
>         pcie_aspm=      [PCIE] Forcibly enable or disable PCIe Active State Power
>                         Management.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 1835f3a7aa8d..5f07a59b5924 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>                                 pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>                         } else if (!strncmp(str, "disable_acs_redir=", 18)) {
>                                 disable_acs_redir_param = str + 18;
> +                       } else if (!strncmp(str, "pcie_movable_bars", 17)) {
> +                               pci_add_flags(PCI_MOVABLE_BARS);
>                         } else {
>                                 printk(KERN_ERR "PCI: Unknown option `%s'\n",
>                                                 str);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 201f9e5ff55c..bdaafc48dc4c 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>         return max;
>  }
>
> +/*
> + * Put all devices of the bus and its children to reset
> + */
> +static void pci_bus_reset_prepare(struct pci_bus *bus)
> +{
> +       struct pci_dev *dev;
> +
> +       list_for_each_entry(dev, &bus->devices, bus_list) {
> +               struct pci_bus *child = dev->subordinate;
> +
> +               if (child) {
> +                       pci_bus_reset_prepare(child);
> +               } else if (dev->driver &&
> +                          dev->driver->err_handler &&
> +                          dev->driver->err_handler->reset_prepare) {
> +                       dev->driver->err_handler->reset_prepare(dev);
> +               }
> +       }
> +}
> +
> +/*
> + * Complete the reset of all devices for the bus and its children
> + */
> +static void pci_bus_reset_done(struct pci_bus *bus)
> +{
> +       struct pci_dev *dev;
> +
> +       list_for_each_entry(dev, &bus->devices, bus_list) {
> +               struct pci_bus *child = dev->subordinate;
> +
> +               if (child) {
> +                       pci_bus_reset_done(child);
> +               } else if (dev->driver && dev->driver->err_handler &&
> +                          dev->driver->err_handler->reset_done) {
> +                       dev->driver->err_handler->reset_done(dev);
> +               }
> +       }
> +}
> +
>  /**
>   * pci_rescan_bus - Scan a PCI bus for devices
>   * @bus: PCI bus to scan
> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>  {
>         unsigned int max;
>
> +       if (pci_has_flag(PCI_MOVABLE_BARS))
> +               pci_bus_reset_prepare(bus);
>         max = pci_scan_child_bus(bus);
>         pci_assign_unassigned_bus_resources(bus);
> +       if (pci_has_flag(PCI_MOVABLE_BARS))
> +               pci_bus_reset_done(bus);
>         pci_bus_add_devices(bus);
>
>         return max;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 6925828f9f25..a8cb1a367c34 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -847,6 +847,7 @@ enum {
>         PCI_ENABLE_PROC_DOMAINS = 0x00000010,   /* Enable domains in /proc */
>         PCI_COMPAT_DOMAIN_0     = 0x00000020,   /* ... except domain 0 */
>         PCI_SCAN_ALL_PCIE_DEVS  = 0x00000040,   /* Scan all, not just dev 0 */
> +       PCI_MOVABLE_BARS        = 0x00000080,   /* Runtime BAR reassign after hotplug */
>  };
>
>  /* These external functions are only available when PCI support is enabled */
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17 19:00   ` Rajat Jain
@ 2018-09-17 19:38     ` Lukas Wunner
  2018-09-17 19:44       ` Rajat Jain
  2018-09-17 21:25     ` Sergey Miroshnichenko
  1 sibling, 1 reply; 20+ messages in thread
From: Lukas Wunner @ 2018-09-17 19:38 UTC (permalink / raw)
  To: Rajat Jain; +Cc: s.miroshnichenko, linux-pci, Bjorn Helgaas, linux

On Mon, Sep 17, 2018 at 12:00:22PM -0700, Rajat Jain wrote:
> On Fri, Sep 14, 2018 at 9:21 AM Sergey Miroshnichenko <s.miroshnichenko@yadro.com> wrote:
> > If a new PCIe device has been hot-plugged between two active ones, which
> > have no (or not big enough) gap between their BARs, allocating new BARs
> > requires to move BARs of the following working devices:
> 
> * What about non-memory resources? E.g. cards may have pci bridges or
> switches on them, and they may need extra PCI bus numbers. Does this
> help that use case?

FWIW, macOS has had a "PCI Pause" functionality since 2013, documented here:
(the anchor is apparently overridden by Javascript, scroll down to
"Supporting PCIe Pause")

https://developer.apple.com/library/archive/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html#//apple_ref/doc/uid/TP40011138-CH4-SW14

In addition to memory resources, they also reallocate bus numbers and MSIs.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17 19:38     ` Lukas Wunner
@ 2018-09-17 19:44       ` Rajat Jain
  0 siblings, 0 replies; 20+ messages in thread
From: Rajat Jain @ 2018-09-17 19:44 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: s.miroshnichenko, linux-pci, Bjorn Helgaas, linux

On Mon, Sep 17, 2018 at 12:38 PM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Mon, Sep 17, 2018 at 12:00:22PM -0700, Rajat Jain wrote:
> > On Fri, Sep 14, 2018 at 9:21 AM Sergey Miroshnichenko <s.miroshnichenko@yadro.com> wrote:
> > > If a new PCIe device has been hot-plugged between two active ones, which
> > > have no (or not big enough) gap between their BARs, allocating new BARs
> > > requires to move BARs of the following working devices:
> >
> > * What about non-memory resources? E.g. cards may have pci bridges or
> > switches on them, and they may need extra PCI bus numbers. Does this
> > help that use case?
>
> FWIW, macOS has had a "PCI Pause" functionality since 2013, documented here:
> (the anchor is apparently overridden by Javascript, scroll down to
> "Supporting PCIe Pause")
>
> https://developer.apple.com/library/archive/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html#//apple_ref/doc/uid/TP40011138-CH4-SW14
>
> In addition to memory resources, they also reallocate bus numbers and MSIs.

Thanks, that makes me wonder that would also mean that now the user
space / UDEV may need to be notified (because some one may have taken
a note of the PCI B:D:F). Also, the UIO drivers, ugh....

Thanks,

Rajat

>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17  5:28   ` Sam Bobroff
@ 2018-09-17 20:55     ` Sergey Miroshnichenko
  2018-09-17 22:59       ` Bjorn Helgaas
  2018-09-17 23:35       ` Oliver
  0 siblings, 2 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-17 20:55 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linux-pci, Bjorn Helgaas, linux

Hello Sam,

On 9/17/18 8:28 AM, Sam Bobroff wrote:
> Hi Sergey,
> 
> On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
>> Introduce a new command line option "pci=pcie_movable_bars" that indicates
>> support of PCIe hotplug without prior reservation of memory regions by
>> BIOS/bootloader.
>>
>> If a new PCIe device has been hot-plugged between two active ones, which
>> have no (or not big enough) gap between their BARs, allocating new BARs
>> requires to move BARs of the following working devices:
>>
>> 1)                   dev 4
>>                        |
>>                        v
>> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>
>> 2)                             dev 4
>>                                  |
>>                                  v
>> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
>> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
>>
>> 3)
>>
>> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>
>> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
>> threatening all memory transactions during this procedure, so the PCI
>> subsystem will instruct the drivers to pause by calling the reset_prepare()
>> and reset_done() callbacks.
>>
>> If a device may be affected by BAR movement, the BAR changes tracking must
>> be implemented in its driver.
>>
>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>> ---
>>  .../admin-guide/kernel-parameters.txt         |  6 +++
>>  drivers/pci/pci.c                             |  2 +
>>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>>  include/linux/pci.h                           |  1 +
>>  4 files changed, 52 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 64a3bf54b974..f8132a709061 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -3311,6 +3311,12 @@
>>  				bridges without forcing it upstream. Note:
>>  				this removes isolation between devices and
>>  				may put more devices in an IOMMU group.
>> +		pcie_movable_bars	Arrange a space at runtime for BARs of
>> +				hotplugged PCIe devices - usable if bootloader
>> +				doesn't reserve memory regions for them. Freeing
>> +				a space may require moving BARs of active devices
>> +				to higher addresses, so device drivers will be
>> +				paused during rescan.
>>  
>>  	pcie_aspm=	[PCIE] Forcibly enable or disable PCIe Active State Power
>>  			Management.
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 1835f3a7aa8d..5f07a59b5924 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>>  				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>  			} else if (!strncmp(str, "disable_acs_redir=", 18)) {
>>  				disable_acs_redir_param = str + 18;
>> +			} else if (!strncmp(str, "pcie_movable_bars", 17)) {
>> +				pci_add_flags(PCI_MOVABLE_BARS);
>>  			} else {
>>  				printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>  						str);
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index 201f9e5ff55c..bdaafc48dc4c 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>  	return max;
>>  }
>>  
>> +/*
>> + * Put all devices of the bus and its children to reset
>> + */
>> +static void pci_bus_reset_prepare(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *dev;
>> +
>> +	list_for_each_entry(dev, &bus->devices, bus_list) {
>> +		struct pci_bus *child = dev->subordinate;
>> +
>> +		if (child) {
>> +			pci_bus_reset_prepare(child);
>> +		} else if (dev->driver &&
>> +			   dev->driver->err_handler &&
>> +			   dev->driver->err_handler->reset_prepare) {
>> +			dev->driver->err_handler->reset_prepare(dev);
>> +		}
> 
> What about devices with drivers that don't have reset_prepare()?  It
> looks like it will just reconfigure them anyway. Is that right?
> 

It is possible that unprepared driver without these hooks will get BARs
moved, I should put a warning message there. There three ways we can see
to make this safe:
 - add the reset_prepare()/reset_done() hooks to *every* PCIe driver;
 - refuse BAR movement if at least one unprepared driver has been
encountered during rescan;
 - reduce the number of drivers which can be affected to some
controllable value and prepare them on demand.

Applying the second proposal as a major restriction seems fairly
reasonable, but for our particular setups and use-cases it is probably
too stiff:
 - we've noticed that devices connected directly to the root bridge
don't get moved BARs, and this covers our x86_64 servers: we only
insert/remove devices into "second-level" and "lower" bridges there, but
not root;
 - on PowerNV we have system devices (network interfaces, USB hub, etc.)
grouped into dedicated domain, with all other domains ready for hotplug,
and only these domains can be rescanned.

With our scenarios currently reduced to these two, we can live with just
a few drivers "prepared" for now: NVME and few Ethernet adapters, this
gives us a possibility to use this feature before "converting" *all* the
drivers, and even have the NVidia cards running on a closed proprietary
driver.

Should we make this behavior adjustable with something like
"pcie_movable_bars=safe" and "pcie_movable_bars=always" ?

Thanks!

Best regards,
Serge

>> +	}
>> +}
>> +
>> +/*
>> + * Complete the reset of all devices for the bus and its children
>> + */
>> +static void pci_bus_reset_done(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *dev;
>> +
>> +	list_for_each_entry(dev, &bus->devices, bus_list) {
>> +		struct pci_bus *child = dev->subordinate;
>> +
>> +		if (child) {
>> +			pci_bus_reset_done(child);
>> +		} else if (dev->driver && dev->driver->err_handler &&
>> +			   dev->driver->err_handler->reset_done) {
>> +			dev->driver->err_handler->reset_done(dev);
>> +		}
>> +	}
>> +}
>> +
>>  /**
>>   * pci_rescan_bus - Scan a PCI bus for devices
>>   * @bus: PCI bus to scan
>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>>  {
>>  	unsigned int max;
>>  
>> +	if (pci_has_flag(PCI_MOVABLE_BARS))
>> +		pci_bus_reset_prepare(bus);
>>  	max = pci_scan_child_bus(bus);
>>  	pci_assign_unassigned_bus_resources(bus);
>> +	if (pci_has_flag(PCI_MOVABLE_BARS))
>> +		pci_bus_reset_done(bus);
>>  	pci_bus_add_devices(bus);
>>  
>>  	return max;
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 6925828f9f25..a8cb1a367c34 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -847,6 +847,7 @@ enum {
>>  	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
>>  	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
>>  	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
>> +	PCI_MOVABLE_BARS	= 0x00000080,	/* Runtime BAR reassign after hotplug */
>>  };
>>  
>>  /* These external functions are only available when PCI support is enabled */
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17 19:00   ` Rajat Jain
  2018-09-17 19:38     ` Lukas Wunner
@ 2018-09-17 21:25     ` Sergey Miroshnichenko
  2018-09-18 21:22       ` Rajat Jain
  1 sibling, 1 reply; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-17 21:25 UTC (permalink / raw)
  To: Rajat Jain; +Cc: linux-pci, Bjorn Helgaas, linux

Hello Rajat,

On 9/17/18 10:00 PM, Rajat Jain wrote:
> On Fri, Sep 14, 2018 at 9:21 AM Sergey Miroshnichenko
> <s.miroshnichenko@yadro.com> wrote:
>>
>> Introduce a new command line option "pci=pcie_movable_bars" that indicates
>> support of PCIe hotplug without prior reservation of memory regions by
>> BIOS/bootloader.
>>
>> If a new PCIe device has been hot-plugged between two active ones, which
>> have no (or not big enough) gap between their BARs, allocating new BARs
>> requires to move BARs of the following working devices:
> 
> I think this is a great problem to solve. I have some questions (not
> objections, but trying to understand what are the limitations of the
> solution being proposed):
> 
> * What about hot-pluggable root ports? Would this help (I'm guessing not)?
> 

We don't have a hardware that is able to handle a hotplug directly into
a root port (but we'll find) - there are intermediate bridges always, so
I can't provide proofs right now. But after a thought experiment I can't
see any obstacles, so just need to verify.

> * What about situations where the root port itself does not have
> enough resources for the new device being inserted. I'm guessing we
> are not going to expand root port allocation in those cases. But do we
> fail gracefully rejecting the hotplug by not assigning it resources,
> or do we manage to screw up the already working healthy devices (while
> attempting to move them)?
> 

Thanks for a great point! The code currently is too simplistic to handle
this case properly, and though it never happened to us yet, but I'm sure
it can kick off a working device to use its resources for a hotplugged
one. The struct resource doesn't have a "priority" field, so probably I
need to check the status of the resource allocation procedure and retry
it without some of new hotplugged devices in case of fail.

> * What about non-memory resources? E.g. cards may have pci bridges or
> switches on them, and they may need extra PCI bus numbers. Does this
> help that use case?

Actually, this patchset is a part of our bigger work on hotplugging
bridges full of bridges full of NVME and GPU devices. And in case of
success here I was planning to submit the last part (hotplugging
bridges), which includes the movement of bus numbers - similar to BARs,
but more awkward, with PCIe devices renaming and re-building the
/sys/bus/pci and /proc/pci, otherwise the lspci tool gets mad.

By device renaming I mean the "/dev/nvme0n1p1" and "eth0" will remain
the same, but underlying "nvme 0022:04:00.0" and "tg3 0001:06:00.1" may
change. In this regard I don't know which UDEV action is the best here -
it's neither "add", "remove" nor "change".


Also I've just realized that should take a closer look and check if MSIs
are also needs to be reallocated.

>>
>> 1)                   dev 4
>>                        |
>>                        v
>> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>
>> 2)                             dev 4
>>                                  |
>>                                  v
>> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
>> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
>>
>> 3)
>>
>> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>
>> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
>> threatening all memory transactions during this procedure, so the PCI
>> subsystem will instruct the drivers to pause by calling the reset_prepare()
>> and reset_done() callbacks.
>>
>> If a device may be affected by BAR movement, the BAR changes tracking must
>> be implemented in its driver.
> 
> This is quite a big thing right? I expect that this will break a lot
> of drivers because
> (a) they may not have reset_prepare() and reset_done() callbacks (I
> grepped in the sources and seems only 4 support it?).
> (b) Now, it is expected that in the reset_done() the drivers should
> re-read the BAR and remap any memory resources that may have changed.
> Note this may also include cleaning up any existing resource mappings
> that they had before they were paused. Not sure if this was the
> semantics of reset_done() already, or may be it was, I'm just not
> sure.
> 

Yes, this is the most painful part: all the drivers that may be affected
with BARs getting moved, must be prepared in this way. Luckily, these
operations have distinct patterns, just like power management callbacks
like runtime_suspend()/runtime_resume() and friends.

Just a few minutes earlier in this thread I've wrote to Sam why we think
this approach still can be used even with a minimal subset of drivers
being prepared: for now we've limited the configurations of hotplug
usage to non-root bridges and to domains with similar devices. If this
concept looks viable to the community, we'll be happy to share the
patches for the nvme, ixgbe and tg3 drivers (and a bit later - for
xhci). Right now these are the only drivers _we_ need on our ppc64
(PowerNV) and x86_64 servers.

> Thanks,
> 
> Rajat
> 

Thanks for the review!

Best regards,
Serge

>>
>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>> ---
>>  .../admin-guide/kernel-parameters.txt         |  6 +++
>>  drivers/pci/pci.c                             |  2 +
>>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>>  include/linux/pci.h                           |  1 +
>>  4 files changed, 52 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 64a3bf54b974..f8132a709061 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -3311,6 +3311,12 @@
>>                                 bridges without forcing it upstream. Note:
>>                                 this removes isolation between devices and
>>                                 may put more devices in an IOMMU group.
>> +               pcie_movable_bars       Arrange a space at runtime for BARs of
>> +                               hotplugged PCIe devices - usable if bootloader
>> +                               doesn't reserve memory regions for them. Freeing
>> +                               a space may require moving BARs of active devices
>> +                               to higher addresses, so device drivers will be
>> +                               paused during rescan.
>>
>>         pcie_aspm=      [PCIE] Forcibly enable or disable PCIe Active State Power
>>                         Management.
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 1835f3a7aa8d..5f07a59b5924 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>>                                 pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>                         } else if (!strncmp(str, "disable_acs_redir=", 18)) {
>>                                 disable_acs_redir_param = str + 18;
>> +                       } else if (!strncmp(str, "pcie_movable_bars", 17)) {
>> +                               pci_add_flags(PCI_MOVABLE_BARS);
>>                         } else {
>>                                 printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>                                                 str);
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index 201f9e5ff55c..bdaafc48dc4c 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>         return max;
>>  }
>>
>> +/*
>> + * Put all devices of the bus and its children to reset
>> + */
>> +static void pci_bus_reset_prepare(struct pci_bus *bus)
>> +{
>> +       struct pci_dev *dev;
>> +
>> +       list_for_each_entry(dev, &bus->devices, bus_list) {
>> +               struct pci_bus *child = dev->subordinate;
>> +
>> +               if (child) {
>> +                       pci_bus_reset_prepare(child);
>> +               } else if (dev->driver &&
>> +                          dev->driver->err_handler &&
>> +                          dev->driver->err_handler->reset_prepare) {
>> +                       dev->driver->err_handler->reset_prepare(dev);
>> +               }
>> +       }
>> +}
>> +
>> +/*
>> + * Complete the reset of all devices for the bus and its children
>> + */
>> +static void pci_bus_reset_done(struct pci_bus *bus)
>> +{
>> +       struct pci_dev *dev;
>> +
>> +       list_for_each_entry(dev, &bus->devices, bus_list) {
>> +               struct pci_bus *child = dev->subordinate;
>> +
>> +               if (child) {
>> +                       pci_bus_reset_done(child);
>> +               } else if (dev->driver && dev->driver->err_handler &&
>> +                          dev->driver->err_handler->reset_done) {
>> +                       dev->driver->err_handler->reset_done(dev);
>> +               }
>> +       }
>> +}
>> +
>>  /**
>>   * pci_rescan_bus - Scan a PCI bus for devices
>>   * @bus: PCI bus to scan
>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>>  {
>>         unsigned int max;
>>
>> +       if (pci_has_flag(PCI_MOVABLE_BARS))
>> +               pci_bus_reset_prepare(bus);
>>         max = pci_scan_child_bus(bus);
>>         pci_assign_unassigned_bus_resources(bus);
>> +       if (pci_has_flag(PCI_MOVABLE_BARS))
>> +               pci_bus_reset_done(bus);
>>         pci_bus_add_devices(bus);
>>
>>         return max;
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 6925828f9f25..a8cb1a367c34 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -847,6 +847,7 @@ enum {
>>         PCI_ENABLE_PROC_DOMAINS = 0x00000010,   /* Enable domains in /proc */
>>         PCI_COMPAT_DOMAIN_0     = 0x00000020,   /* ... except domain 0 */
>>         PCI_SCAN_ALL_PCIE_DEVS  = 0x00000040,   /* Scan all, not just dev 0 */
>> +       PCI_MOVABLE_BARS        = 0x00000080,   /* Runtime BAR reassign after hotplug */
>>  };
>>
>>  /* These external functions are only available when PCI support is enabled */
>> --
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17 20:55     ` Sergey Miroshnichenko
@ 2018-09-17 22:59       ` Bjorn Helgaas
  2018-09-18 14:01         ` Sergey Miroshnichenko
  2018-09-17 23:35       ` Oliver
  1 sibling, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2018-09-17 22:59 UTC (permalink / raw)
  To: Sergey Miroshnichenko
  Cc: Sam Bobroff, linux-pci, Bjorn Helgaas, linux, Russell Currey,
	linuxppc-dev, Benjamin Herrenschmidt, Oliver OHalloran

[+cc Russell, Ben, Oliver, linuxppc-dev]

On Mon, Sep 17, 2018 at 11:55:43PM +0300, Sergey Miroshnichenko wrote:
> Hello Sam,
> 
> On 9/17/18 8:28 AM, Sam Bobroff wrote:
> > Hi Sergey,
> > 
> > On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
> >> Introduce a new command line option "pci=pcie_movable_bars" that indicates
> >> support of PCIe hotplug without prior reservation of memory regions by
> >> BIOS/bootloader.
> >>
> >> If a new PCIe device has been hot-plugged between two active ones, which
> >> have no (or not big enough) gap between their BARs, allocating new BARs
> >> requires to move BARs of the following working devices:
> >>
> >> 1)                   dev 4
> >>                        |
> >>                        v
> >> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
> >> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
> >>
> >> 2)                             dev 4
> >>                                  |
> >>                                  v
> >> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
> >> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
> >>
> >> 3)
> >>
> >> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
> >> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
> >>
> >> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
> >> threatening all memory transactions during this procedure, so the PCI
> >> subsystem will instruct the drivers to pause by calling the reset_prepare()
> >> and reset_done() callbacks.
> >>
> >> If a device may be affected by BAR movement, the BAR changes tracking must
> >> be implemented in its driver.
> >>
> >> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
> >> ---
> >>  .../admin-guide/kernel-parameters.txt         |  6 +++
> >>  drivers/pci/pci.c                             |  2 +
> >>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
> >>  include/linux/pci.h                           |  1 +
> >>  4 files changed, 52 insertions(+)
> >>
> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >> index 64a3bf54b974..f8132a709061 100644
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -3311,6 +3311,12 @@
> >>  				bridges without forcing it upstream. Note:
> >>  				this removes isolation between devices and
> >>  				may put more devices in an IOMMU group.
> >> +		pcie_movable_bars	Arrange a space at runtime for BARs of
> >> +				hotplugged PCIe devices - usable if bootloader
> >> +				doesn't reserve memory regions for them. Freeing
> >> +				a space may require moving BARs of active devices
> >> +				to higher addresses, so device drivers will be
> >> +				paused during rescan.
> >>  
> >>  	pcie_aspm=	[PCIE] Forcibly enable or disable PCIe Active State Power
> >>  			Management.
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index 1835f3a7aa8d..5f07a59b5924 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
> >>  				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
> >>  			} else if (!strncmp(str, "disable_acs_redir=", 18)) {
> >>  				disable_acs_redir_param = str + 18;
> >> +			} else if (!strncmp(str, "pcie_movable_bars", 17)) {
> >> +				pci_add_flags(PCI_MOVABLE_BARS);
> >>  			} else {
> >>  				printk(KERN_ERR "PCI: Unknown option `%s'\n",
> >>  						str);
> >> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> >> index 201f9e5ff55c..bdaafc48dc4c 100644
> >> --- a/drivers/pci/probe.c
> >> +++ b/drivers/pci/probe.c
> >> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
> >>  	return max;
> >>  }
> >>  
> >> +/*
> >> + * Put all devices of the bus and its children to reset
> >> + */
> >> +static void pci_bus_reset_prepare(struct pci_bus *bus)
> >> +{
> >> +	struct pci_dev *dev;
> >> +
> >> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> >> +		struct pci_bus *child = dev->subordinate;
> >> +
> >> +		if (child) {
> >> +			pci_bus_reset_prepare(child);
> >> +		} else if (dev->driver &&
> >> +			   dev->driver->err_handler &&
> >> +			   dev->driver->err_handler->reset_prepare) {
> >> +			dev->driver->err_handler->reset_prepare(dev);
> >> +		}
> > 
> > What about devices with drivers that don't have reset_prepare()?  It
> > looks like it will just reconfigure them anyway. Is that right?
> > 
> 
> It is possible that unprepared driver without these hooks will get BARs
> moved, I should put a warning message there. There three ways we can see
> to make this safe:
>  - add the reset_prepare()/reset_done() hooks to *every* PCIe driver;
>  - refuse BAR movement if at least one unprepared driver has been
> encountered during rescan;
>  - reduce the number of drivers which can be affected to some
> controllable value and prepare them on demand.
> 
> Applying the second proposal as a major restriction seems fairly
> reasonable, but for our particular setups and use-cases it is probably
> too stiff:
>  - we've noticed that devices connected directly to the root bridge
> don't get moved BARs, and this covers our x86_64 servers: we only
> insert/remove devices into "second-level" and "lower" bridges there, but
> not root;
>  - on PowerNV we have system devices (network interfaces, USB hub, etc.)
> grouped into dedicated domain, with all other domains ready for hotplug,
> and only these domains can be rescanned.
> 
> With our scenarios currently reduced to these two, we can live with just
> a few drivers "prepared" for now: NVME and few Ethernet adapters, this
> gives us a possibility to use this feature before "converting" *all* the
> drivers, and even have the NVidia cards running on a closed proprietary
> driver.
> 
> Should we make this behavior adjustable with something like
> "pcie_movable_bars=safe" and "pcie_movable_bars=always" ?

I like the overall idea of this a lot.

  - Why do we need a command line parameter to enable this?  Can't we
    do it unconditionally and automatically when it's possible?  We
    could have a chicken switch to *disable* it in case this breaks
    something horribly, but I would like this functionality to be
    always available without a special option.

  - I'm not sure the existence of .reset_done() is evidence that a
    driver is prepared for its BARs to move.  I don't see any
    documentation that refers to BAR movement, and I doubt it's been
    tested.  But I only see 5 implementations in the tree, so it'd be
    easy to verify.
    
  - I think your second proposal above sounds right: we should regard
    any device whose driver lacks .reset_done() as immovable.  We will
    likely be able to move some devices but not others.  Implementing
    .reset_done() will increase flexibility but it shouldn't be a
    requirement for all drivers.

> >> +	}
> >> +}
> >> +
> >> +/*
> >> + * Complete the reset of all devices for the bus and its children
> >> + */
> >> +static void pci_bus_reset_done(struct pci_bus *bus)
> >> +{
> >> +	struct pci_dev *dev;
> >> +
> >> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> >> +		struct pci_bus *child = dev->subordinate;
> >> +
> >> +		if (child) {
> >> +			pci_bus_reset_done(child);
> >> +		} else if (dev->driver && dev->driver->err_handler &&
> >> +			   dev->driver->err_handler->reset_done) {
> >> +			dev->driver->err_handler->reset_done(dev);
> >> +		}
> >> +	}
> >> +}
> >> +
> >>  /**
> >>   * pci_rescan_bus - Scan a PCI bus for devices
> >>   * @bus: PCI bus to scan
> >> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
> >>  {
> >>  	unsigned int max;
> >>  
> >> +	if (pci_has_flag(PCI_MOVABLE_BARS))
> >> +		pci_bus_reset_prepare(bus);
> >>  	max = pci_scan_child_bus(bus);
> >>  	pci_assign_unassigned_bus_resources(bus);
> >> +	if (pci_has_flag(PCI_MOVABLE_BARS))
> >> +		pci_bus_reset_done(bus);
> >>  	pci_bus_add_devices(bus);
> >>  
> >>  	return max;
> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
> >> index 6925828f9f25..a8cb1a367c34 100644
> >> --- a/include/linux/pci.h
> >> +++ b/include/linux/pci.h
> >> @@ -847,6 +847,7 @@ enum {
> >>  	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
> >>  	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
> >>  	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
> >> +	PCI_MOVABLE_BARS	= 0x00000080,	/* Runtime BAR reassign after hotplug */
> >>  };
> >>  
> >>  /* These external functions are only available when PCI support is enabled */
> >> -- 
> >> 2.17.1
> >>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17 20:55     ` Sergey Miroshnichenko
  2018-09-17 22:59       ` Bjorn Helgaas
@ 2018-09-17 23:35       ` Oliver
  2018-09-18 16:51         ` Sergey Miroshnichenko
  1 sibling, 1 reply; 20+ messages in thread
From: Oliver @ 2018-09-17 23:35 UTC (permalink / raw)
  To: Sergey Miroshnichenko; +Cc: Sam Bobroff, linux-pci, Bjorn Helgaas, linux

On Tue, Sep 18, 2018 at 6:55 AM, Sergey Miroshnichenko
<s.miroshnichenko@yadro.com> wrote:
> Hello Sam,
>
> On 9/17/18 8:28 AM, Sam Bobroff wrote:
>> Hi Sergey,
>>
>> On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
>>> Introduce a new command line option "pci=pcie_movable_bars" that indicates
>>> support of PCIe hotplug without prior reservation of memory regions by
>>> BIOS/bootloader.
>>>
>>> If a new PCIe device has been hot-plugged between two active ones, which
>>> have no (or not big enough) gap between their BARs, allocating new BARs
>>> requires to move BARs of the following working devices:
>>>
>>> 1)                   dev 4
>>>                        |
>>>                        v
>>> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>
>>> 2)                             dev 4
>>>                                  |
>>>                                  v
>>> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
>>> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
>>>
>>> 3)
>>>
>>> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>
>>> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
>>> threatening all memory transactions during this procedure, so the PCI
>>> subsystem will instruct the drivers to pause by calling the reset_prepare()
>>> and reset_done() callbacks.
>>>
>>> If a device may be affected by BAR movement, the BAR changes tracking must
>>> be implemented in its driver.
>>>
>>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>>> ---
>>>  .../admin-guide/kernel-parameters.txt         |  6 +++
>>>  drivers/pci/pci.c                             |  2 +
>>>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>>>  include/linux/pci.h                           |  1 +
>>>  4 files changed, 52 insertions(+)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>> index 64a3bf54b974..f8132a709061 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -3311,6 +3311,12 @@
>>>                              bridges without forcing it upstream. Note:
>>>                              this removes isolation between devices and
>>>                              may put more devices in an IOMMU group.
>>> +            pcie_movable_bars       Arrange a space at runtime for BARs of
>>> +                            hotplugged PCIe devices - usable if bootloader
>>> +                            doesn't reserve memory regions for them. Freeing
>>> +                            a space may require moving BARs of active devices
>>> +                            to higher addresses, so device drivers will be
>>> +                            paused during rescan.
>>>
>>>      pcie_aspm=      [PCIE] Forcibly enable or disable PCIe Active State Power
>>>                      Management.
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 1835f3a7aa8d..5f07a59b5924 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>>>                              pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>>                      } else if (!strncmp(str, "disable_acs_redir=", 18)) {
>>>                              disable_acs_redir_param = str + 18;
>>> +                    } else if (!strncmp(str, "pcie_movable_bars", 17)) {
>>> +                            pci_add_flags(PCI_MOVABLE_BARS);
>>>                      } else {
>>>                              printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>>                                              str);
>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>> index 201f9e5ff55c..bdaafc48dc4c 100644
>>> --- a/drivers/pci/probe.c
>>> +++ b/drivers/pci/probe.c
>>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>>      return max;
>>>  }
>>>
>>> +/*
>>> + * Put all devices of the bus and its children to reset
>>> + */
>>> +static void pci_bus_reset_prepare(struct pci_bus *bus)
>>> +{
>>> +    struct pci_dev *dev;
>>> +
>>> +    list_for_each_entry(dev, &bus->devices, bus_list) {
>>> +            struct pci_bus *child = dev->subordinate;
>>> +
>>> +            if (child) {
>>> +                    pci_bus_reset_prepare(child);
>>> +            } else if (dev->driver &&
>>> +                       dev->driver->err_handler &&
>>> +                       dev->driver->err_handler->reset_prepare) {
>>> +                    dev->driver->err_handler->reset_prepare(dev);
>>> +            }
>>
>> What about devices with drivers that don't have reset_prepare()?  It
>> looks like it will just reconfigure them anyway. Is that right?
>
> It is possible that unprepared driver without these hooks will get BARs
> moved, I should put a warning message there. There three ways we can see
> to make this safe:>

It might make more sense to allow drivers to opt-in to this rather
than assuming any driver with the error handlers supports re-assigning
BARs on the fly. As far as I know we restore the previous BAR values a
part of the EEH recovery process when there's a driver bound, but I
could be wrong.

>  - add the reset_prepare()/reset_done() hooks to *every* PCIe driver;
>  - refuse BAR movement if at least one unprepared driver has been
> encountered during rescan;
>  - reduce the number of drivers which can be affected to some
> controllable value and prepare them on demand.

Is there any reason we couldn't just unbind the unaware drivers and
re-bind them afterwards? That might be useful when hotplugging NVMe
racks since you wouldn't want a switch management driver (or similar)
to prevent re-assignments if they're required.

> Applying the second proposal as a major restriction seems fairly
> reasonable, but for our particular setups and use-cases it is probably
> too stiff:
>  - we've noticed that devices connected directly to the root bridge
> don't get moved BARs, and this covers our x86_64 servers: we only
> insert/remove devices into "second-level" and "lower" bridges there, but
> not root;
>
>  - on PowerNV we have system devices (network interfaces, USB hub, etc.)
> grouped into dedicated domain, with all other domains ready for hotplug,
> and only these domains can be rescanned.

Are you doing anything to enforce this or just relying on people not
re-scanning the system device bus?

> With our scenarios currently reduced to these two, we can live with just
> a few drivers "prepared" for now: NVME and few Ethernet adapters, this
> gives us a possibility to use this feature before "converting" *all* the
> drivers, and even have the NVidia cards running on a closed proprietary
> driver.
>
> Should we make this behavior adjustable with something like
> "pcie_movable_bars=safe" and "pcie_movable_bars=always" ?
>
> Thanks!
>
> Best regards,
> Serge
>
>>> +    }
>>> +}
>>> +
>>> +/*
>>> + * Complete the reset of all devices for the bus and its children
>>> + */
>>> +static void pci_bus_reset_done(struct pci_bus *bus)
>>> +{
>>> +    struct pci_dev *dev;
>>> +
>>> +    list_for_each_entry(dev, &bus->devices, bus_list) {
>>> +            struct pci_bus *child = dev->subordinate;
>>> +
>>> +            if (child) {
>>> +                    pci_bus_reset_done(child);
>>> +            } else if (dev->driver && dev->driver->err_handler &&
>>> +                       dev->driver->err_handler->reset_done) {
>>> +                    dev->driver->err_handler->reset_done(dev);
>>> +            }
>>> +    }
>>> +}
>>> +
>>>  /**
>>>   * pci_rescan_bus - Scan a PCI bus for devices
>>>   * @bus: PCI bus to scan
>>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>>>  {
>>>      unsigned int max;
>>>
>>> +    if (pci_has_flag(PCI_MOVABLE_BARS))
>>> +            pci_bus_reset_prepare(bus);
>>>      max = pci_scan_child_bus(bus);
>>>      pci_assign_unassigned_bus_resources(bus);
>>> +    if (pci_has_flag(PCI_MOVABLE_BARS))
>>> +            pci_bus_reset_done(bus);
>>>      pci_bus_add_devices(bus);
>>>
>>>      return max;
>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>> index 6925828f9f25..a8cb1a367c34 100644
>>> --- a/include/linux/pci.h
>>> +++ b/include/linux/pci.h
>>> @@ -847,6 +847,7 @@ enum {
>>>      PCI_ENABLE_PROC_DOMAINS = 0x00000010,   /* Enable domains in /proc */
>>>      PCI_COMPAT_DOMAIN_0     = 0x00000020,   /* ... except domain 0 */
>>>      PCI_SCAN_ALL_PCIE_DEVS  = 0x00000040,   /* Scan all, not just dev 0 */
>>> +    PCI_MOVABLE_BARS        = 0x00000080,   /* Runtime BAR reassign after hotplug */
>>>  };
>>>
>>>  /* These external functions are only available when PCI support is enabled */
>>> --
>>> 2.17.1
>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug
  2018-09-14 16:14 [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug Sergey Miroshnichenko
                   ` (3 preceding siblings ...)
  2018-09-14 16:14 ` [PATCH RFC 4/4] PCI: Fix writing invalid BARs during pci_restore_state() Sergey Miroshnichenko
@ 2018-09-18 11:16 ` David Laight
  2018-09-18 17:07   ` Sergey Miroshnichenko
  4 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2018-09-18 11:16 UTC (permalink / raw)
  To: 'Sergey Miroshnichenko', linux-pci; +Cc: Bjorn Helgaas, linux

From: Sergey Miroshnichenko
> Sent: 14 September 2018 17:14
> 
> If the firmware or kernel has arranged memory for PCIe devices in a way that doesn't
> provide enough space for BARs of a new hotplugged device, the kernel can pause the
> drivers of the "obstructing" devices and move their BARs, so new BARs can fit into the
> created hole.
> 
> When a driver is un-paused by the kernel after the PCIe rescan, it should check if its
> BARs had changed and ioremap() them if needed.
> 
> BARs are moved not directly, but by releasing bridge resources, then sorting and
> assigning them back, similarly to the initial PCIe topology scan during system boot
> (when pci=realloc is passed).
> 
> Pausing drivers is performed via reset_prepare() and reset_done() callbacks of struct
> pci_error_handlers. Drivers should pause during rescan not only because of potential
> movement of their BARs, but also because of possible updating of the bridge windows.
...

You are restricting this to GPL drivers (unless someone has opened up the error
handling functions since I last looked).

I also suspect you need something to indicate that the driver is willing for
all this to happen.

An error return from reset_prepare() will all be too late - other drivers
might have been stopped even though it is impossible to actually perform
the reorganisation.

There will also be drivers/devices that can handle being moved by a full
remove/rescan sequence, particularly if there are no current users of the
device.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17 22:59       ` Bjorn Helgaas
@ 2018-09-18 14:01         ` Sergey Miroshnichenko
  2018-09-18 21:10           ` Bjorn Helgaas
  0 siblings, 1 reply; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-18 14:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sam Bobroff, linux-pci, Bjorn Helgaas, linux, Russell Currey,
	linuxppc-dev, Benjamin Herrenschmidt, Oliver OHalloran


[-- Attachment #1.1: Type: text/plain, Size: 10048 bytes --]

On 9/18/18 1:59 AM, Bjorn Helgaas wrote:
> [+cc Russell, Ben, Oliver, linuxppc-dev]
> 
> On Mon, Sep 17, 2018 at 11:55:43PM +0300, Sergey Miroshnichenko wrote:
>> Hello Sam,
>>
>> On 9/17/18 8:28 AM, Sam Bobroff wrote:
>>> Hi Sergey,
>>>
>>> On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
>>>> Introduce a new command line option "pci=pcie_movable_bars" that indicates
>>>> support of PCIe hotplug without prior reservation of memory regions by
>>>> BIOS/bootloader.
>>>>
>>>> If a new PCIe device has been hot-plugged between two active ones, which
>>>> have no (or not big enough) gap between their BARs, allocating new BARs
>>>> requires to move BARs of the following working devices:
>>>>
>>>> 1)                   dev 4
>>>>                        |
>>>>                        v
>>>> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>>
>>>> 2)                             dev 4
>>>>                                  |
>>>>                                  v
>>>> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
>>>>
>>>> 3)
>>>>
>>>> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>>
>>>> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
>>>> threatening all memory transactions during this procedure, so the PCI
>>>> subsystem will instruct the drivers to pause by calling the reset_prepare()
>>>> and reset_done() callbacks.
>>>>
>>>> If a device may be affected by BAR movement, the BAR changes tracking must
>>>> be implemented in its driver.
>>>>
>>>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>>>> ---
>>>>  .../admin-guide/kernel-parameters.txt         |  6 +++
>>>>  drivers/pci/pci.c                             |  2 +
>>>>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>>>>  include/linux/pci.h                           |  1 +
>>>>  4 files changed, 52 insertions(+)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index 64a3bf54b974..f8132a709061 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -3311,6 +3311,12 @@
>>>>  				bridges without forcing it upstream. Note:
>>>>  				this removes isolation between devices and
>>>>  				may put more devices in an IOMMU group.
>>>> +		pcie_movable_bars	Arrange a space at runtime for BARs of
>>>> +				hotplugged PCIe devices - usable if bootloader
>>>> +				doesn't reserve memory regions for them. Freeing
>>>> +				a space may require moving BARs of active devices
>>>> +				to higher addresses, so device drivers will be
>>>> +				paused during rescan.
>>>>  
>>>>  	pcie_aspm=	[PCIE] Forcibly enable or disable PCIe Active State Power
>>>>  			Management.
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 1835f3a7aa8d..5f07a59b5924 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>>>>  				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>>>  			} else if (!strncmp(str, "disable_acs_redir=", 18)) {
>>>>  				disable_acs_redir_param = str + 18;
>>>> +			} else if (!strncmp(str, "pcie_movable_bars", 17)) {
>>>> +				pci_add_flags(PCI_MOVABLE_BARS);
>>>>  			} else {
>>>>  				printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>>>  						str);
>>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>>> index 201f9e5ff55c..bdaafc48dc4c 100644
>>>> --- a/drivers/pci/probe.c
>>>> +++ b/drivers/pci/probe.c
>>>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>>>  	return max;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Put all devices of the bus and its children to reset
>>>> + */
>>>> +static void pci_bus_reset_prepare(struct pci_bus *bus)
>>>> +{
>>>> +	struct pci_dev *dev;
>>>> +
>>>> +	list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> +		struct pci_bus *child = dev->subordinate;
>>>> +
>>>> +		if (child) {
>>>> +			pci_bus_reset_prepare(child);
>>>> +		} else if (dev->driver &&
>>>> +			   dev->driver->err_handler &&
>>>> +			   dev->driver->err_handler->reset_prepare) {
>>>> +			dev->driver->err_handler->reset_prepare(dev);
>>>> +		}
>>>
>>> What about devices with drivers that don't have reset_prepare()?  It
>>> looks like it will just reconfigure them anyway. Is that right?
>>>
>>
>> It is possible that unprepared driver without these hooks will get BARs
>> moved, I should put a warning message there. There three ways we can see
>> to make this safe:
>>  - add the reset_prepare()/reset_done() hooks to *every* PCIe driver;
>>  - refuse BAR movement if at least one unprepared driver has been
>> encountered during rescan;
>>  - reduce the number of drivers which can be affected to some
>> controllable value and prepare them on demand.
>>
>> Applying the second proposal as a major restriction seems fairly
>> reasonable, but for our particular setups and use-cases it is probably
>> too stiff:
>>  - we've noticed that devices connected directly to the root bridge
>> don't get moved BARs, and this covers our x86_64 servers: we only
>> insert/remove devices into "second-level" and "lower" bridges there, but
>> not root;
>>  - on PowerNV we have system devices (network interfaces, USB hub, etc.)
>> grouped into dedicated domain, with all other domains ready for hotplug,
>> and only these domains can be rescanned.
>>
>> With our scenarios currently reduced to these two, we can live with just
>> a few drivers "prepared" for now: NVME and few Ethernet adapters, this
>> gives us a possibility to use this feature before "converting" *all* the
>> drivers, and even have the NVidia cards running on a closed proprietary
>> driver.
>>
>> Should we make this behavior adjustable with something like
>> "pcie_movable_bars=safe" and "pcie_movable_bars=always" ?
> 
> I like the overall idea of this a lot.
> 
>   - Why do we need a command line parameter to enable this?  Can't we
>     do it unconditionally and automatically when it's possible?  We
>     could have a chicken switch to *disable* it in case this breaks
>     something horribly, but I would like this functionality to be
>     always available without a special option.
> 

After making this feature completely safe we could activate it with the
existing option "pci=realloc".

>   - I'm not sure the existence of .reset_done() is evidence that a
>     driver is prepared for its BARs to move.  I don't see any
>     documentation that refers to BAR movement, and I doubt it's been
>     tested.  But I only see 5 implementations in the tree, so it'd be
>     easy to verify.
>     

You are right, and we should clarify the description:
 - drivers which have the .reset_done() already - none of them are aware
of movable BARs yet;
 - the rest of the drivers should both be able to pause and handle the
changes in BARs.

>   - I think your second proposal above sounds right: we should regard
>     any device whose driver lacks .reset_done() as immovable.  We will
>     likely be able to move some devices but not others.  Implementing
>     .reset_done() will increase flexibility but it shouldn't be a
>     requirement for all drivers.
> 

Thanks for the advice! This is more flexible and doesn't have any
prerequisites. In this case the greater the "movable"/"immovable" ratio
of the devices that was working before the hotplug event - the higher
the probability to free some space for new BARs. But even a single
"immovable" device at an undesirable place can block the re-arrangement,
in this case all we can is just give up and print an error message.

This patchset in its current form doesn't support marking a choosen BAR
as immovable (just releasing all the resources of the root bridge and
trying to sort and re-assign them back), so I'll have to implement that.

Best regards,
Serge

>>>> +	}
>>>> +}
>>>> +
>>>> +/*
>>>> + * Complete the reset of all devices for the bus and its children
>>>> + */
>>>> +static void pci_bus_reset_done(struct pci_bus *bus)
>>>> +{
>>>> +	struct pci_dev *dev;
>>>> +
>>>> +	list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> +		struct pci_bus *child = dev->subordinate;
>>>> +
>>>> +		if (child) {
>>>> +			pci_bus_reset_done(child);
>>>> +		} else if (dev->driver && dev->driver->err_handler &&
>>>> +			   dev->driver->err_handler->reset_done) {
>>>> +			dev->driver->err_handler->reset_done(dev);
>>>> +		}
>>>> +	}
>>>> +}
>>>> +
>>>>  /**
>>>>   * pci_rescan_bus - Scan a PCI bus for devices
>>>>   * @bus: PCI bus to scan
>>>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>>>>  {
>>>>  	unsigned int max;
>>>>  
>>>> +	if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> +		pci_bus_reset_prepare(bus);
>>>>  	max = pci_scan_child_bus(bus);
>>>>  	pci_assign_unassigned_bus_resources(bus);
>>>> +	if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> +		pci_bus_reset_done(bus);
>>>>  	pci_bus_add_devices(bus);
>>>>  
>>>>  	return max;
>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>> index 6925828f9f25..a8cb1a367c34 100644
>>>> --- a/include/linux/pci.h
>>>> +++ b/include/linux/pci.h
>>>> @@ -847,6 +847,7 @@ enum {
>>>>  	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
>>>>  	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
>>>>  	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
>>>> +	PCI_MOVABLE_BARS	= 0x00000080,	/* Runtime BAR reassign after hotplug */
>>>>  };
>>>>  
>>>>  /* These external functions are only available when PCI support is enabled */
>>>> -- 
>>>> 2.17.1
>>>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17 23:35       ` Oliver
@ 2018-09-18 16:51         ` Sergey Miroshnichenko
  0 siblings, 0 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-18 16:51 UTC (permalink / raw)
  To: Oliver; +Cc: Sam Bobroff, linux-pci, Bjorn Helgaas, linux



On 9/18/18 2:35 AM, Oliver wrote:
> On Tue, Sep 18, 2018 at 6:55 AM, Sergey Miroshnichenko
> <s.miroshnichenko@yadro.com> wrote:
>> Hello Sam,
>>
>> On 9/17/18 8:28 AM, Sam Bobroff wrote:
>>> Hi Sergey,
>>>
>>> On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
>>>> Introduce a new command line option "pci=pcie_movable_bars" that indicates
>>>> support of PCIe hotplug without prior reservation of memory regions by
>>>> BIOS/bootloader.
>>>>
>>>> If a new PCIe device has been hot-plugged between two active ones, which
>>>> have no (or not big enough) gap between their BARs, allocating new BARs
>>>> requires to move BARs of the following working devices:
>>>>
>>>> 1)                   dev 4
>>>>                        |
>>>>                        v
>>>> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>>
>>>> 2)                             dev 4
>>>>                                  |
>>>>                                  v
>>>> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
>>>>
>>>> 3)
>>>>
>>>> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>>
>>>> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
>>>> threatening all memory transactions during this procedure, so the PCI
>>>> subsystem will instruct the drivers to pause by calling the reset_prepare()
>>>> and reset_done() callbacks.
>>>>
>>>> If a device may be affected by BAR movement, the BAR changes tracking must
>>>> be implemented in its driver.
>>>>
>>>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>>>> ---
>>>>  .../admin-guide/kernel-parameters.txt         |  6 +++
>>>>  drivers/pci/pci.c                             |  2 +
>>>>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>>>>  include/linux/pci.h                           |  1 +
>>>>  4 files changed, 52 insertions(+)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index 64a3bf54b974..f8132a709061 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -3311,6 +3311,12 @@
>>>>                              bridges without forcing it upstream. Note:
>>>>                              this removes isolation between devices and
>>>>                              may put more devices in an IOMMU group.
>>>> +            pcie_movable_bars       Arrange a space at runtime for BARs of
>>>> +                            hotplugged PCIe devices - usable if bootloader
>>>> +                            doesn't reserve memory regions for them. Freeing
>>>> +                            a space may require moving BARs of active devices
>>>> +                            to higher addresses, so device drivers will be
>>>> +                            paused during rescan.
>>>>
>>>>      pcie_aspm=      [PCIE] Forcibly enable or disable PCIe Active State Power
>>>>                      Management.
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 1835f3a7aa8d..5f07a59b5924 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>>>>                              pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>>>                      } else if (!strncmp(str, "disable_acs_redir=", 18)) {
>>>>                              disable_acs_redir_param = str + 18;
>>>> +                    } else if (!strncmp(str, "pcie_movable_bars", 17)) {
>>>> +                            pci_add_flags(PCI_MOVABLE_BARS);
>>>>                      } else {
>>>>                              printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>>>                                              str);
>>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>>> index 201f9e5ff55c..bdaafc48dc4c 100644
>>>> --- a/drivers/pci/probe.c
>>>> +++ b/drivers/pci/probe.c
>>>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>>>      return max;
>>>>  }
>>>>
>>>> +/*
>>>> + * Put all devices of the bus and its children to reset
>>>> + */
>>>> +static void pci_bus_reset_prepare(struct pci_bus *bus)
>>>> +{
>>>> +    struct pci_dev *dev;
>>>> +
>>>> +    list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> +            struct pci_bus *child = dev->subordinate;
>>>> +
>>>> +            if (child) {
>>>> +                    pci_bus_reset_prepare(child);
>>>> +            } else if (dev->driver &&
>>>> +                       dev->driver->err_handler &&
>>>> +                       dev->driver->err_handler->reset_prepare) {
>>>> +                    dev->driver->err_handler->reset_prepare(dev);
>>>> +            }
>>>
>>> What about devices with drivers that don't have reset_prepare()?  It
>>> looks like it will just reconfigure them anyway. Is that right?
>>
>> It is possible that unprepared driver without these hooks will get BARs
>> moved, I should put a warning message there. There three ways we can see
>> to make this safe:>
> 
> It might make more sense to allow drivers to opt-in to this rather
> than assuming any driver with the error handlers supports re-assigning
> BARs on the fly. As far as I know we restore the previous BAR values a
> part of the EEH recovery process when there's a driver bound, but I
> could be wrong.
> 

If I add the new two callbacks rescan_prepare() and rescan_done() to the
struct pci_driver, would they be a good indicator of driver's awareness
of movable BARs?

>>  - add the reset_prepare()/reset_done() hooks to *every* PCIe driver;
>>  - refuse BAR movement if at least one unprepared driver has been
>> encountered during rescan;
>>  - reduce the number of drivers which can be affected to some
>> controllable value and prepare them on demand.
> 
> Is there any reason we couldn't just unbind the unaware drivers and
> re-bind them afterwards? That might be useful when hotplugging NVMe
> racks since you wouldn't want a switch management driver (or similar)
> to prevent re-assignments if they're required.
> 

We would like the hotplug events to be as much transparent for the rest
of the system as possible: it is ok if a device takes a pause for a few
seconds, but it is not allowed to disappear, even if it will return soon.

>> Applying the second proposal as a major restriction seems fairly
>> reasonable, but for our particular setups and use-cases it is probably
>> too stiff:
>>  - we've noticed that devices connected directly to the root bridge
>> don't get moved BARs, and this covers our x86_64 servers: we only
>> insert/remove devices into "second-level" and "lower" bridges there, but
>> not root;
>>
>>  - on PowerNV we have system devices (network interfaces, USB hub, etc.)
>> grouped into dedicated domain, with all other domains ready for hotplug,
>> and only these domains can be rescanned.
> 
> Are you doing anything to enforce this or just relying on people not
> re-scanning the system device bus?
> 

It was the latter. But then Bjorn has suggested how we can avoid such
prerequisites and assumptions: I'll change the implementation so the
arrangement of a free space for BARs of new devices may fail because of
memory fragmentation by "immovable" devices. In return, the whole
operation of rescan becomes safe.

Serge

>> With our scenarios currently reduced to these two, we can live with just
>> a few drivers "prepared" for now: NVME and few Ethernet adapters, this
>> gives us a possibility to use this feature before "converting" *all* the
>> drivers, and even have the NVidia cards running on a closed proprietary
>> driver.
>>
>> Should we make this behavior adjustable with something like
>> "pcie_movable_bars=safe" and "pcie_movable_bars=always" ?
>>
>> Thanks!
>>
>> Best regards,
>> Serge
>>
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * Complete the reset of all devices for the bus and its children
>>>> + */
>>>> +static void pci_bus_reset_done(struct pci_bus *bus)
>>>> +{
>>>> +    struct pci_dev *dev;
>>>> +
>>>> +    list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> +            struct pci_bus *child = dev->subordinate;
>>>> +
>>>> +            if (child) {
>>>> +                    pci_bus_reset_done(child);
>>>> +            } else if (dev->driver && dev->driver->err_handler &&
>>>> +                       dev->driver->err_handler->reset_done) {
>>>> +                    dev->driver->err_handler->reset_done(dev);
>>>> +            }
>>>> +    }
>>>> +}
>>>> +
>>>>  /**
>>>>   * pci_rescan_bus - Scan a PCI bus for devices
>>>>   * @bus: PCI bus to scan
>>>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>>>>  {
>>>>      unsigned int max;
>>>>
>>>> +    if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> +            pci_bus_reset_prepare(bus);
>>>>      max = pci_scan_child_bus(bus);
>>>>      pci_assign_unassigned_bus_resources(bus);
>>>> +    if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> +            pci_bus_reset_done(bus);
>>>>      pci_bus_add_devices(bus);
>>>>
>>>>      return max;
>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>> index 6925828f9f25..a8cb1a367c34 100644
>>>> --- a/include/linux/pci.h
>>>> +++ b/include/linux/pci.h
>>>> @@ -847,6 +847,7 @@ enum {
>>>>      PCI_ENABLE_PROC_DOMAINS = 0x00000010,   /* Enable domains in /proc */
>>>>      PCI_COMPAT_DOMAIN_0     = 0x00000020,   /* ... except domain 0 */
>>>>      PCI_SCAN_ALL_PCIE_DEVS  = 0x00000040,   /* Scan all, not just dev 0 */
>>>> +    PCI_MOVABLE_BARS        = 0x00000080,   /* Runtime BAR reassign after hotplug */
>>>>  };
>>>>
>>>>  /* These external functions are only available when PCI support is enabled */
>>>> --
>>>> 2.17.1
>>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug
  2018-09-18 11:16 ` [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug David Laight
@ 2018-09-18 17:07   ` Sergey Miroshnichenko
  0 siblings, 0 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2018-09-18 17:07 UTC (permalink / raw)
  To: David Laight; +Cc: linux-pci, Bjorn Helgaas, linux


On 9/18/18 2:16 PM, David Laight wrote:
> From: Sergey Miroshnichenko
>> Sent: 14 September 2018 17:14
>>
>> If the firmware or kernel has arranged memory for PCIe devices in a way that doesn't
>> provide enough space for BARs of a new hotplugged device, the kernel can pause the
>> drivers of the "obstructing" devices and move their BARs, so new BARs can fit into the
>> created hole.
>>
>> When a driver is un-paused by the kernel after the PCIe rescan, it should check if its
>> BARs had changed and ioremap() them if needed.
>>
>> BARs are moved not directly, but by releasing bridge resources, then sorting and
>> assigning them back, similarly to the initial PCIe topology scan during system boot
>> (when pci=realloc is passed).
>>
>> Pausing drivers is performed via reset_prepare() and reset_done() callbacks of struct
>> pci_error_handlers. Drivers should pause during rescan not only because of potential
>> movement of their BARs, but also because of possible updating of the bridge windows.
> ...
> 
> You are restricting this to GPL drivers (unless someone has opened up the error
> handling functions since I last looked).
> 

The only GPL-related restrictions I was aware of are functions exported
with EXPORT_SYMBOL_GPL().

> I also suspect you need something to indicate that the driver is willing for
> all this to happen.
> 

I'm thinking about new callbacks in the struct pci_driver, would they
met the same restrictions?

> An error return from reset_prepare() will all be too late - other drivers
> might have been stopped even though it is impossible to actually perform
> the reorganisation.
> 
> There will also be drivers/devices that can handle being moved by a full
> remove/rescan sequence, particularly if there are no current users of the
> device.
> 

The implementation is going to be changed a bit:
 - if a device's driver doesn't have these callbacks (and so it can't
pause), its BARs are marked as immovable;
 - for every "movable" device that was working before the hotplug event,
it is guaranteed to arrange BARs;
 - if there is not enough memory or if the memory is too fragmented by
"immovable" devices, the BAR allocation for hotplugged devices can fail,
in this case the only way is to convert drivers to "movable".

The remove+rescan trick for unused "immovable" devices could be used as
a fallback, if BAR allocation has failed, but I'm not sure yet - what if
there are devices that are expected to do some job even without being
explicitly open()'ed?

> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

Best regards,
Serge

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-18 14:01         ` Sergey Miroshnichenko
@ 2018-09-18 21:10           ` Bjorn Helgaas
  0 siblings, 0 replies; 20+ messages in thread
From: Bjorn Helgaas @ 2018-09-18 21:10 UTC (permalink / raw)
  To: Sergey Miroshnichenko
  Cc: Sam Bobroff, linux-pci, Bjorn Helgaas, linux, Russell Currey,
	linuxppc-dev, Benjamin Herrenschmidt, Oliver OHalloran

On Tue, Sep 18, 2018 at 05:01:48PM +0300, Sergey Miroshnichenko wrote:
> On 9/18/18 1:59 AM, Bjorn Helgaas wrote:
> > On Mon, Sep 17, 2018 at 11:55:43PM +0300, Sergey Miroshnichenko wrote:
> >> On 9/17/18 8:28 AM, Sam Bobroff wrote:
> >>> On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote:
> >>>> Introduce a new command line option "pci=pcie_movable_bars"
> >>>> that indicates support of PCIe hotplug without prior
> >>>> reservation of memory regions by BIOS/bootloader.

> >>> What about devices with drivers that don't have reset_prepare()?  It
> >>> looks like it will just reconfigure them anyway. Is that right?
> >>
> >> It is possible that unprepared driver without these hooks will get BARs
> >> moved, I should put a warning message there. There three ways we can see
> >> to make this safe:
> >>  - add the reset_prepare()/reset_done() hooks to *every* PCIe driver;
> >>  - refuse BAR movement if at least one unprepared driver has been
> >> encountered during rescan;
> >>  - reduce the number of drivers which can be affected to some
> >> controllable value and prepare them on demand.
> >>
> >> Applying the second proposal as a major restriction seems fairly
> >> reasonable, but for our particular setups and use-cases it is probably
> >> too stiff:
> >>  - we've noticed that devices connected directly to the root bridge
> >> don't get moved BARs, and this covers our x86_64 servers: we only
> >> insert/remove devices into "second-level" and "lower" bridges there, but
> >> not root;
> >>  - on PowerNV we have system devices (network interfaces, USB hub, etc.)
> >> grouped into dedicated domain, with all other domains ready for hotplug,
> >> and only these domains can be rescanned.
> >>
> >> With our scenarios currently reduced to these two, we can live with just
> >> a few drivers "prepared" for now: NVME and few Ethernet adapters, this
> >> gives us a possibility to use this feature before "converting" *all* the
> >> drivers, and even have the NVidia cards running on a closed proprietary
> >> driver.
> >>
> >> Should we make this behavior adjustable with something like
> >> "pcie_movable_bars=safe" and "pcie_movable_bars=always" ?
> > 
> > I like the overall idea of this a lot.
> > 
> >   - Why do we need a command line parameter to enable this?  Can't we
> >     do it unconditionally and automatically when it's possible?  We
> >     could have a chicken switch to *disable* it in case this breaks
> >     something horribly, but I would like this functionality to be
> >     always available without a special option.
> 
> After making this feature completely safe we could activate it with the
> existing option "pci=realloc".

That *sounds* good, but in practice it never happens that we decide a
feature is completely safe and somebody makes it the default.  If
we're going to do this, I think we need to commit to making it work
100% of the time, with no option needed.

> >   - I'm not sure the existence of .reset_done() is evidence that a
> >     driver is prepared for its BARs to move.  I don't see any
> >     documentation that refers to BAR movement, and I doubt it's been
> >     tested.  But I only see 5 implementations in the tree, so it'd be
> >     easy to verify.
> 
> You are right, and we should clarify the description:
>  - drivers which have the .reset_done() already - none of them are aware
> of movable BARs yet;
>  - the rest of the drivers should both be able to pause and handle the
> changes in BARs.

This doesn't clarify it for me.  If you want to update all existing
.reset_done() methods so they deal with BAR changes, that would be
fine with me.  That would be done by preliminary patches in the series
that adds the feature.

> >   - I think your second proposal above sounds right: we should regard
> >     any device whose driver lacks .reset_done() as immovable.  We will
> >     likely be able to move some devices but not others.  Implementing
> >     .reset_done() will increase flexibility but it shouldn't be a
> >     requirement for all drivers.
> 
> Thanks for the advice! This is more flexible and doesn't have any
> prerequisites. In this case the greater the "movable"/"immovable" ratio
> of the devices that was working before the hotplug event - the higher
> the probability to free some space for new BARs. But even a single
> "immovable" device at an undesirable place can block the re-arrangement,
> in this case all we can is just give up and print an error message.

Right.  There's nothing we can do about that except make the relevant
drivers smarter.

> This patchset in its current form doesn't support marking a choosen BAR
> as immovable (just releasing all the resources of the root bridge and
> trying to sort and re-assign them back), so I'll have to implement that.

The current IORESOURCE_PCI_FIXED usage is for things that literally
*cannot* be moved because there is no BAR at all (VGA or IDE legacy,
enhanced allocation (see pci_ea_read()) or there's some platform
quirk.

It *might* make sense to also use it for devices where the *driver*
isn't smart enough to deal with moving it, but I'm not sure.  That
would have to be done at driver probe time, I guess.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-17 21:25     ` Sergey Miroshnichenko
@ 2018-09-18 21:22       ` Rajat Jain
  2019-01-11 17:24         ` Sergey Miroshnichenko
  0 siblings, 1 reply; 20+ messages in thread
From: Rajat Jain @ 2018-09-18 21:22 UTC (permalink / raw)
  To: s.miroshnichenko; +Cc: linux-pci, Bjorn Helgaas, linux

Hi Sergey,

Sorry, I have one more basic question: The PCI express hierarchy is
made up from PCI express switches and end points, so essentially there
are only point to point links (except the switch internal logicl bus
on which you cannot hotplug anything). The device being hotplugged
shall be plugged onto a downstream port of a switch (or root port), so
there shall be no other device on that bus (since each downstream port
gets its own secondary bus number, and its own share of resources). So
it seems to me, that in this case, IF the resources at the downstream
port are not big enough, you will need to re-program the switch to get
resources to this downstream port, by stealing from *another*
downstream port. Is this correct understanding of the problem that you
are trying to solve?

Thanks,

Rajat



On Mon, Sep 17, 2018 at 2:25 PM Sergey Miroshnichenko
<s.miroshnichenko@yadro.com> wrote:
>
> Hello Rajat,
>
> On 9/17/18 10:00 PM, Rajat Jain wrote:
> > On Fri, Sep 14, 2018 at 9:21 AM Sergey Miroshnichenko
> > <s.miroshnichenko@yadro.com> wrote:
> >>
> >> Introduce a new command line option "pci=pcie_movable_bars" that indicates
> >> support of PCIe hotplug without prior reservation of memory regions by
> >> BIOS/bootloader.
> >>
> >> If a new PCIe device has been hot-plugged between two active ones, which
> >> have no (or not big enough) gap between their BARs, allocating new BARs
> >> requires to move BARs of the following working devices:
> >
> > I think this is a great problem to solve. I have some questions (not
> > objections, but trying to understand what are the limitations of the
> > solution being proposed):
> >
> > * What about hot-pluggable root ports? Would this help (I'm guessing not)?
> >
>
> We don't have a hardware that is able to handle a hotplug directly into
> a root port (but we'll find) - there are intermediate bridges always, so
> I can't provide proofs right now. But after a thought experiment I can't
> see any obstacles, so just need to verify.
>
> > * What about situations where the root port itself does not have
> > enough resources for the new device being inserted. I'm guessing we
> > are not going to expand root port allocation in those cases. But do we
> > fail gracefully rejecting the hotplug by not assigning it resources,
> > or do we manage to screw up the already working healthy devices (while
> > attempting to move them)?
> >
>
> Thanks for a great point! The code currently is too simplistic to handle
> this case properly, and though it never happened to us yet, but I'm sure
> it can kick off a working device to use its resources for a hotplugged
> one. The struct resource doesn't have a "priority" field, so probably I
> need to check the status of the resource allocation procedure and retry
> it without some of new hotplugged devices in case of fail.
>
> > * What about non-memory resources? E.g. cards may have pci bridges or
> > switches on them, and they may need extra PCI bus numbers. Does this
> > help that use case?
>
> Actually, this patchset is a part of our bigger work on hotplugging
> bridges full of bridges full of NVME and GPU devices. And in case of
> success here I was planning to submit the last part (hotplugging
> bridges), which includes the movement of bus numbers - similar to BARs,
> but more awkward, with PCIe devices renaming and re-building the
> /sys/bus/pci and /proc/pci, otherwise the lspci tool gets mad.
>
> By device renaming I mean the "/dev/nvme0n1p1" and "eth0" will remain
> the same, but underlying "nvme 0022:04:00.0" and "tg3 0001:06:00.1" may
> change. In this regard I don't know which UDEV action is the best here -
> it's neither "add", "remove" nor "change".
>
>
> Also I've just realized that should take a closer look and check if MSIs
> are also needs to be reallocated.
>
> >>
> >> 1)                   dev 4
> >>                        |
> >>                        v
> >> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
> >> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
> >>
> >> 2)                             dev 4
> >>                                  |
> >>                                  v
> >> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
> >> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
> >>
> >> 3)
> >>
> >> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
> >> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
> >>
> >> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
> >> threatening all memory transactions during this procedure, so the PCI
> >> subsystem will instruct the drivers to pause by calling the reset_prepare()
> >> and reset_done() callbacks.
> >>
> >> If a device may be affected by BAR movement, the BAR changes tracking must
> >> be implemented in its driver.
> >
> > This is quite a big thing right? I expect that this will break a lot
> > of drivers because
> > (a) they may not have reset_prepare() and reset_done() callbacks (I
> > grepped in the sources and seems only 4 support it?).
> > (b) Now, it is expected that in the reset_done() the drivers should
> > re-read the BAR and remap any memory resources that may have changed.
> > Note this may also include cleaning up any existing resource mappings
> > that they had before they were paused. Not sure if this was the
> > semantics of reset_done() already, or may be it was, I'm just not
> > sure.
> >
>
> Yes, this is the most painful part: all the drivers that may be affected
> with BARs getting moved, must be prepared in this way. Luckily, these
> operations have distinct patterns, just like power management callbacks
> like runtime_suspend()/runtime_resume() and friends.
>
> Just a few minutes earlier in this thread I've wrote to Sam why we think
> this approach still can be used even with a minimal subset of drivers
> being prepared: for now we've limited the configurations of hotplug
> usage to non-root bridges and to domains with similar devices. If this
> concept looks viable to the community, we'll be happy to share the
> patches for the nvme, ixgbe and tg3 drivers (and a bit later - for
> xhci). Right now these are the only drivers _we_ need on our ppc64
> (PowerNV) and x86_64 servers.
>
> > Thanks,
> >
> > Rajat
> >
>
> Thanks for the review!
>
> Best regards,
> Serge
>
> >>
> >> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
> >> ---
> >>  .../admin-guide/kernel-parameters.txt         |  6 +++
> >>  drivers/pci/pci.c                             |  2 +
> >>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
> >>  include/linux/pci.h                           |  1 +
> >>  4 files changed, 52 insertions(+)
> >>
> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >> index 64a3bf54b974..f8132a709061 100644
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -3311,6 +3311,12 @@
> >>                                 bridges without forcing it upstream. Note:
> >>                                 this removes isolation between devices and
> >>                                 may put more devices in an IOMMU group.
> >> +               pcie_movable_bars       Arrange a space at runtime for BARs of
> >> +                               hotplugged PCIe devices - usable if bootloader
> >> +                               doesn't reserve memory regions for them. Freeing
> >> +                               a space may require moving BARs of active devices
> >> +                               to higher addresses, so device drivers will be
> >> +                               paused during rescan.
> >>
> >>         pcie_aspm=      [PCIE] Forcibly enable or disable PCIe Active State Power
> >>                         Management.
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index 1835f3a7aa8d..5f07a59b5924 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
> >>                                 pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
> >>                         } else if (!strncmp(str, "disable_acs_redir=", 18)) {
> >>                                 disable_acs_redir_param = str + 18;
> >> +                       } else if (!strncmp(str, "pcie_movable_bars", 17)) {
> >> +                               pci_add_flags(PCI_MOVABLE_BARS);
> >>                         } else {
> >>                                 printk(KERN_ERR "PCI: Unknown option `%s'\n",
> >>                                                 str);
> >> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> >> index 201f9e5ff55c..bdaafc48dc4c 100644
> >> --- a/drivers/pci/probe.c
> >> +++ b/drivers/pci/probe.c
> >> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
> >>         return max;
> >>  }
> >>
> >> +/*
> >> + * Put all devices of the bus and its children to reset
> >> + */
> >> +static void pci_bus_reset_prepare(struct pci_bus *bus)
> >> +{
> >> +       struct pci_dev *dev;
> >> +
> >> +       list_for_each_entry(dev, &bus->devices, bus_list) {
> >> +               struct pci_bus *child = dev->subordinate;
> >> +
> >> +               if (child) {
> >> +                       pci_bus_reset_prepare(child);
> >> +               } else if (dev->driver &&
> >> +                          dev->driver->err_handler &&
> >> +                          dev->driver->err_handler->reset_prepare) {
> >> +                       dev->driver->err_handler->reset_prepare(dev);
> >> +               }
> >> +       }
> >> +}
> >> +
> >> +/*
> >> + * Complete the reset of all devices for the bus and its children
> >> + */
> >> +static void pci_bus_reset_done(struct pci_bus *bus)
> >> +{
> >> +       struct pci_dev *dev;
> >> +
> >> +       list_for_each_entry(dev, &bus->devices, bus_list) {
> >> +               struct pci_bus *child = dev->subordinate;
> >> +
> >> +               if (child) {
> >> +                       pci_bus_reset_done(child);
> >> +               } else if (dev->driver && dev->driver->err_handler &&
> >> +                          dev->driver->err_handler->reset_done) {
> >> +                       dev->driver->err_handler->reset_done(dev);
> >> +               }
> >> +       }
> >> +}
> >> +
> >>  /**
> >>   * pci_rescan_bus - Scan a PCI bus for devices
> >>   * @bus: PCI bus to scan
> >> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
> >>  {
> >>         unsigned int max;
> >>
> >> +       if (pci_has_flag(PCI_MOVABLE_BARS))
> >> +               pci_bus_reset_prepare(bus);
> >>         max = pci_scan_child_bus(bus);
> >>         pci_assign_unassigned_bus_resources(bus);
> >> +       if (pci_has_flag(PCI_MOVABLE_BARS))
> >> +               pci_bus_reset_done(bus);
> >>         pci_bus_add_devices(bus);
> >>
> >>         return max;
> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
> >> index 6925828f9f25..a8cb1a367c34 100644
> >> --- a/include/linux/pci.h
> >> +++ b/include/linux/pci.h
> >> @@ -847,6 +847,7 @@ enum {
> >>         PCI_ENABLE_PROC_DOMAINS = 0x00000010,   /* Enable domains in /proc */
> >>         PCI_COMPAT_DOMAIN_0     = 0x00000020,   /* ... except domain 0 */
> >>         PCI_SCAN_ALL_PCIE_DEVS  = 0x00000040,   /* Scan all, not just dev 0 */
> >> +       PCI_MOVABLE_BARS        = 0x00000080,   /* Runtime BAR reassign after hotplug */
> >>  };
> >>
> >>  /* These external functions are only available when PCI support is enabled */
> >> --
> >> 2.17.1
> >>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan
  2018-09-18 21:22       ` Rajat Jain
@ 2019-01-11 17:24         ` Sergey Miroshnichenko
  0 siblings, 0 replies; 20+ messages in thread
From: Sergey Miroshnichenko @ 2019-01-11 17:24 UTC (permalink / raw)
  To: Rajat Jain; +Cc: linux-pci, Bjorn Helgaas, linux


[-- Attachment #1.1: Type: text/plain, Size: 14025 bytes --]

Hello Rajat,

Sorry for the long absence! I've finally reworked the patchset heavily
to address the proposals and issues (v2, sent today to linux-pci), so
now I can give a better answer:

On 9/19/18 12:22 AM, Rajat Jain wrote:
> Hi Sergey,
> 
> Sorry, I have one more basic question: The PCI express hierarchy is
> made up from PCI express switches and end points, so essentially there
> are only point to point links (except the switch internal logicl bus
> on which you cannot hotplug anything). The device being hotplugged
> shall be plugged onto a downstream port of a switch (or root port), so
> there shall be no other device on that bus (since each downstream port
> gets its own secondary bus number, and its own share of resources). So
> it seems to me, that in this case, IF the resources at the downstream
> port are not big enough, you will need to re-program the switch to get
> resources to this downstream port, by stealing from *another*
> downstream port. Is this correct understanding of the problem that you
> are trying to solve?

To some extent, yes.

The patchset is intended to work with the "pci=realloc" command line
argument, in this case the kernel will set bridge windows of zero size
for empty buses. And for non-empty slots their bridge windows would be
equal to to a sum of children BARs + alignments.

After hotplugging a device into a downstream port of a switch, first the
kernel needs to allocate a bridge window for this downstream port.
Neighboring empty slots have nothing to steal from, and busy slots can
not be shrunk (because all working devices must remain working after a
hotplug event).

So the switch's bridge window should be extended to fit the bridge
window for the downstream port to fit the BARs of the hotplugged device.

And here is the problem: switch's neighbors most probably reside very
closely, and they can be movable (support movable BARs, or don't have
connected drivers), immovable or movable in hard limits (depending on
location of children immovable BARs).

This patchset releases bridge windows and everything movable, marks the
immovable BARs with IORESOURCE_PCI_FIXED, and calculates for every
bridge its "fixed range" - the lowest and the highest addresses of
children immovable BARs to determine how far away we can move these
bridge windows.

Then the re-assignment starts. If it fails (the root bridge window is
not big enough to fit all the BARs or can't squeeze the new BARs between
the immovable ones), the hotplugged devices are to be marked with the
new flag PCI_DEV_IGNORE one by one, and re-assignment restarts.

Then the kernel writes the recalculated BAR addresses and bridge windows
into the ports and endpoints, and enables the MEM and IO flags if needed.

Please ask if I haven't covered some aspects!

Best regards,
Serge

> 
> Thanks,
> 
> Rajat
> 
> 
> 
> On Mon, Sep 17, 2018 at 2:25 PM Sergey Miroshnichenko
> <s.miroshnichenko@yadro.com> wrote:
>>
>> Hello Rajat,
>>
>> On 9/17/18 10:00 PM, Rajat Jain wrote:
>>> On Fri, Sep 14, 2018 at 9:21 AM Sergey Miroshnichenko
>>> <s.miroshnichenko@yadro.com> wrote:
>>>>
>>>> Introduce a new command line option "pci=pcie_movable_bars" that indicates
>>>> support of PCIe hotplug without prior reservation of memory regions by
>>>> BIOS/bootloader.
>>>>
>>>> If a new PCIe device has been hot-plugged between two active ones, which
>>>> have no (or not big enough) gap between their BARs, allocating new BARs
>>>> requires to move BARs of the following working devices:
>>>
>>> I think this is a great problem to solve. I have some questions (not
>>> objections, but trying to understand what are the limitations of the
>>> solution being proposed):
>>>
>>> * What about hot-pluggable root ports? Would this help (I'm guessing not)?
>>>
>>
>> We don't have a hardware that is able to handle a hotplug directly into
>> a root port (but we'll find) - there are intermediate bridges always, so
>> I can't provide proofs right now. But after a thought experiment I can't
>> see any obstacles, so just need to verify.
>>
>>> * What about situations where the root port itself does not have
>>> enough resources for the new device being inserted. I'm guessing we
>>> are not going to expand root port allocation in those cases. But do we
>>> fail gracefully rejecting the hotplug by not assigning it resources,
>>> or do we manage to screw up the already working healthy devices (while
>>> attempting to move them)?
>>>
>>
>> Thanks for a great point! The code currently is too simplistic to handle
>> this case properly, and though it never happened to us yet, but I'm sure
>> it can kick off a working device to use its resources for a hotplugged
>> one. The struct resource doesn't have a "priority" field, so probably I
>> need to check the status of the resource allocation procedure and retry
>> it without some of new hotplugged devices in case of fail.
>>
>>> * What about non-memory resources? E.g. cards may have pci bridges or
>>> switches on them, and they may need extra PCI bus numbers. Does this
>>> help that use case?
>>
>> Actually, this patchset is a part of our bigger work on hotplugging
>> bridges full of bridges full of NVME and GPU devices. And in case of
>> success here I was planning to submit the last part (hotplugging
>> bridges), which includes the movement of bus numbers - similar to BARs,
>> but more awkward, with PCIe devices renaming and re-building the
>> /sys/bus/pci and /proc/pci, otherwise the lspci tool gets mad.
>>
>> By device renaming I mean the "/dev/nvme0n1p1" and "eth0" will remain
>> the same, but underlying "nvme 0022:04:00.0" and "tg3 0001:06:00.1" may
>> change. In this regard I don't know which UDEV action is the best here -
>> it's neither "add", "remove" nor "change".
>>
>>
>> Also I've just realized that should take a closer look and check if MSIs
>> are also needs to be reallocated.
>>
>>>>
>>>> 1)                   dev 4
>>>>                        |
>>>>                        v
>>>> .. |  dev 3  |  dev 3  |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>>
>>>> 2)                             dev 4
>>>>                                  |
>>>>                                  v
>>>> .. |  dev 3  |  dev 3  | -->           --> |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  | -->           --> |  BAR 0  |  BAR 0  |
>>>>
>>>> 3)
>>>>
>>>> .. |  dev 3  |  dev 3  |  dev 4  |  dev 4  |  dev 5  |  dev 7  |
>>>> .. |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 1  |  BAR 0  |  BAR 0  |
>>>>
>>>> Not only BARs, but also bridge windows can be updated during a PCIe rescan,
>>>> threatening all memory transactions during this procedure, so the PCI
>>>> subsystem will instruct the drivers to pause by calling the reset_prepare()
>>>> and reset_done() callbacks.
>>>>
>>>> If a device may be affected by BAR movement, the BAR changes tracking must
>>>> be implemented in its driver.
>>>
>>> This is quite a big thing right? I expect that this will break a lot
>>> of drivers because
>>> (a) they may not have reset_prepare() and reset_done() callbacks (I
>>> grepped in the sources and seems only 4 support it?).
>>> (b) Now, it is expected that in the reset_done() the drivers should
>>> re-read the BAR and remap any memory resources that may have changed.
>>> Note this may also include cleaning up any existing resource mappings
>>> that they had before they were paused. Not sure if this was the
>>> semantics of reset_done() already, or may be it was, I'm just not
>>> sure.
>>>
>>
>> Yes, this is the most painful part: all the drivers that may be affected
>> with BARs getting moved, must be prepared in this way. Luckily, these
>> operations have distinct patterns, just like power management callbacks
>> like runtime_suspend()/runtime_resume() and friends.
>>
>> Just a few minutes earlier in this thread I've wrote to Sam why we think
>> this approach still can be used even with a minimal subset of drivers
>> being prepared: for now we've limited the configurations of hotplug
>> usage to non-root bridges and to domains with similar devices. If this
>> concept looks viable to the community, we'll be happy to share the
>> patches for the nvme, ixgbe and tg3 drivers (and a bit later - for
>> xhci). Right now these are the only drivers _we_ need on our ppc64
>> (PowerNV) and x86_64 servers.
>>
>>> Thanks,
>>>
>>> Rajat
>>>
>>
>> Thanks for the review!
>>
>> Best regards,
>> Serge
>>
>>>>
>>>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>>>> ---
>>>>  .../admin-guide/kernel-parameters.txt         |  6 +++
>>>>  drivers/pci/pci.c                             |  2 +
>>>>  drivers/pci/probe.c                           | 43 +++++++++++++++++++
>>>>  include/linux/pci.h                           |  1 +
>>>>  4 files changed, 52 insertions(+)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index 64a3bf54b974..f8132a709061 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -3311,6 +3311,12 @@
>>>>                                 bridges without forcing it upstream. Note:
>>>>                                 this removes isolation between devices and
>>>>                                 may put more devices in an IOMMU group.
>>>> +               pcie_movable_bars       Arrange a space at runtime for BARs of
>>>> +                               hotplugged PCIe devices - usable if bootloader
>>>> +                               doesn't reserve memory regions for them. Freeing
>>>> +                               a space may require moving BARs of active devices
>>>> +                               to higher addresses, so device drivers will be
>>>> +                               paused during rescan.
>>>>
>>>>         pcie_aspm=      [PCIE] Forcibly enable or disable PCIe Active State Power
>>>>                         Management.
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 1835f3a7aa8d..5f07a59b5924 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str)
>>>>                                 pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>>>                         } else if (!strncmp(str, "disable_acs_redir=", 18)) {
>>>>                                 disable_acs_redir_param = str + 18;
>>>> +                       } else if (!strncmp(str, "pcie_movable_bars", 17)) {
>>>> +                               pci_add_flags(PCI_MOVABLE_BARS);
>>>>                         } else {
>>>>                                 printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>>>                                                 str);
>>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>>> index 201f9e5ff55c..bdaafc48dc4c 100644
>>>> --- a/drivers/pci/probe.c
>>>> +++ b/drivers/pci/probe.c
>>>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>>>         return max;
>>>>  }
>>>>
>>>> +/*
>>>> + * Put all devices of the bus and its children to reset
>>>> + */
>>>> +static void pci_bus_reset_prepare(struct pci_bus *bus)
>>>> +{
>>>> +       struct pci_dev *dev;
>>>> +
>>>> +       list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> +               struct pci_bus *child = dev->subordinate;
>>>> +
>>>> +               if (child) {
>>>> +                       pci_bus_reset_prepare(child);
>>>> +               } else if (dev->driver &&
>>>> +                          dev->driver->err_handler &&
>>>> +                          dev->driver->err_handler->reset_prepare) {
>>>> +                       dev->driver->err_handler->reset_prepare(dev);
>>>> +               }
>>>> +       }
>>>> +}
>>>> +
>>>> +/*
>>>> + * Complete the reset of all devices for the bus and its children
>>>> + */
>>>> +static void pci_bus_reset_done(struct pci_bus *bus)
>>>> +{
>>>> +       struct pci_dev *dev;
>>>> +
>>>> +       list_for_each_entry(dev, &bus->devices, bus_list) {
>>>> +               struct pci_bus *child = dev->subordinate;
>>>> +
>>>> +               if (child) {
>>>> +                       pci_bus_reset_done(child);
>>>> +               } else if (dev->driver && dev->driver->err_handler &&
>>>> +                          dev->driver->err_handler->reset_done) {
>>>> +                       dev->driver->err_handler->reset_done(dev);
>>>> +               }
>>>> +       }
>>>> +}
>>>> +
>>>>  /**
>>>>   * pci_rescan_bus - Scan a PCI bus for devices
>>>>   * @bus: PCI bus to scan
>>>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
>>>>  {
>>>>         unsigned int max;
>>>>
>>>> +       if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> +               pci_bus_reset_prepare(bus);
>>>>         max = pci_scan_child_bus(bus);
>>>>         pci_assign_unassigned_bus_resources(bus);
>>>> +       if (pci_has_flag(PCI_MOVABLE_BARS))
>>>> +               pci_bus_reset_done(bus);
>>>>         pci_bus_add_devices(bus);
>>>>
>>>>         return max;
>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>> index 6925828f9f25..a8cb1a367c34 100644
>>>> --- a/include/linux/pci.h
>>>> +++ b/include/linux/pci.h
>>>> @@ -847,6 +847,7 @@ enum {
>>>>         PCI_ENABLE_PROC_DOMAINS = 0x00000010,   /* Enable domains in /proc */
>>>>         PCI_COMPAT_DOMAIN_0     = 0x00000020,   /* ... except domain 0 */
>>>>         PCI_SCAN_ALL_PCIE_DEVS  = 0x00000040,   /* Scan all, not just dev 0 */
>>>> +       PCI_MOVABLE_BARS        = 0x00000080,   /* Runtime BAR reassign after hotplug */
>>>>  };
>>>>
>>>>  /* These external functions are only available when PCI support is enabled */
>>>> --
>>>> 2.17.1
>>>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-01-11 17:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-14 16:14 [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 1/4] PCI: hotplug: Add parameter to put devices to reset during rescan Sergey Miroshnichenko
2018-09-17  5:28   ` Sam Bobroff
2018-09-17 20:55     ` Sergey Miroshnichenko
2018-09-17 22:59       ` Bjorn Helgaas
2018-09-18 14:01         ` Sergey Miroshnichenko
2018-09-18 21:10           ` Bjorn Helgaas
2018-09-17 23:35       ` Oliver
2018-09-18 16:51         ` Sergey Miroshnichenko
2018-09-17 19:00   ` Rajat Jain
2018-09-17 19:38     ` Lukas Wunner
2018-09-17 19:44       ` Rajat Jain
2018-09-17 21:25     ` Sergey Miroshnichenko
2018-09-18 21:22       ` Rajat Jain
2019-01-11 17:24         ` Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 2/4] PCI: Release and reassign resources from the root " Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 3/4] PCI: Invalidate the released BAR resources Sergey Miroshnichenko
2018-09-14 16:14 ` [PATCH RFC 4/4] PCI: Fix writing invalid BARs during pci_restore_state() Sergey Miroshnichenko
2018-09-18 11:16 ` [PATCH RFC 0/4] PCI: Allow BAR movement during hotplug David Laight
2018-09-18 17:07   ` Sergey Miroshnichenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).