All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 next 0/3] Limiting pci access
@ 2016-10-28 22:58 Keith Busch
  2016-10-28 22:58 ` [PATCHv4 next 1/3] pci: Add is_removed state Keith Busch
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Keith Busch @ 2016-10-28 22:58 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Lukas Wunner, Wei Zhang, Keith Busch

Here's version 3. Two of the patches from v2 have already been applied
upstream, so this is down to three patches now.

The main difference with v2 is setting subordinates to removed when
detected (patch 1/3).

Patch 2/3 short-cuts pci_device_is_present if is_removed is set.

Otherwise, the same as before. Just to restate the justification, this
significantly speeds up many device removals without relying on hardware.

Keith Busch (3):
  pci: Add is_removed state
  pci: No config access for removed devices
  pci/msix: Skip disabling removed devices

 drivers/pci/hotplug/pciehp_pci.c |  5 +++++
 drivers/pci/msi.c                |  7 ++++++-
 drivers/pci/pci.c                |  2 ++
 drivers/pci/pcie/pcie-dpc.c      |  4 ++++
 include/linux/pci.h              | 25 +++++++++++++++++++++++++
 5 files changed, 42 insertions(+), 1 deletion(-)

-- 
2.7.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCHv4 next 1/3] pci: Add is_removed state
  2016-10-28 22:58 [PATCHv4 next 0/3] Limiting pci access Keith Busch
@ 2016-10-28 22:58 ` Keith Busch
  2016-10-31 10:41   ` Lukas Wunner
  2016-12-13 20:56   ` Bjorn Helgaas
  2016-10-28 22:58 ` [PATCHv4 next 2/3] pci: No config access for removed devices Keith Busch
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 42+ messages in thread
From: Keith Busch @ 2016-10-28 22:58 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Lukas Wunner, Wei Zhang, Keith Busch

This adds a new state for devices that were once in the system, but
unexpectedly removed. This is so device tear down functions can observe
the device is not accessible so it may skip attempting to initialize
the hardware.

The pciehp and pcie-dpc drivers are aware of when the link is down,
so these explicitly set this flag when its handlers detect the device
is gone.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/hotplug/pciehp_pci.c | 5 +++++
 drivers/pci/pcie/pcie-dpc.c      | 4 ++++
 include/linux/pci.h              | 7 +++++++
 3 files changed, 16 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index 9e69403..7560961 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -109,6 +109,11 @@ int pciehp_unconfigure_device(struct slot *p_slot)
 				break;
 			}
 		}
+		if (!presence) {
+			pci_set_removed(dev, NULL);
+			if (pci_has_subordinate(dev))
+				pci_walk_bus(dev->subordinate, pci_set_removed, NULL);
+		}
 		pci_stop_and_remove_bus_device(dev);
 		/*
 		 * Ensure that no new Requests will be generated from
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 9811b14..7818c88 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/pci.h>
 #include <linux/pcieport_if.h>
+#include "../pci.h"
 
 struct dpc_dev {
 	struct pcie_device	*dev;
@@ -46,6 +47,9 @@ static void interrupt_event_handler(struct work_struct *work)
 	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
 					 bus_list) {
 		pci_dev_get(dev);
+		pci_set_removed(dev, NULL);
+		if (pci_has_subordinate(dev))
+			pci_walk_bus(dev->subordinate, pci_set_removed, NULL);
 		pci_stop_and_remove_bus_device(dev);
 		pci_dev_put(dev);
 	}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0e49f70..2115d19 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -341,6 +341,7 @@ struct pci_dev {
 	unsigned int	multifunction:1;/* Part of multi-function device */
 	/* keep track of device state */
 	unsigned int	is_added:1;
+	unsigned int	is_removed:1;	/* device was surprise removed */
 	unsigned int	is_busmaster:1; /* device is busmaster */
 	unsigned int	no_msi:1;	/* device may not use msi */
 	unsigned int	no_64bit_msi:1; /* device may only use 32-bit MSIs */
@@ -417,6 +418,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev)
 	return (pdev->error_state != pci_channel_io_normal);
 }
 
+static inline int pci_set_removed(struct pci_dev *pdev, void *unused)
+{
+	pdev->is_removed = 1;
+	return 0;
+}
+
 struct pci_host_bridge {
 	struct device dev;
 	struct pci_bus *bus;		/* root bus */
-- 
2.7.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv4 next 2/3] pci: No config access for removed devices
  2016-10-28 22:58 [PATCHv4 next 0/3] Limiting pci access Keith Busch
  2016-10-28 22:58 ` [PATCHv4 next 1/3] pci: Add is_removed state Keith Busch
@ 2016-10-28 22:58 ` Keith Busch
  2016-10-31 12:18   ` Lukas Wunner
  2016-10-28 22:58 ` [PATCHv4 next 3/3] pci/msix: Skip disabling " Keith Busch
  2016-11-18 23:25 ` [PATCHv4 next 0/3] Limiting pci access Keith Busch
  3 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2016-10-28 22:58 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Lukas Wunner, Wei Zhang, Keith Busch

If we've  detected that the PCI device is removed, there is no need to
access the device's config space. This patch has all the device config
reads and writes return error when in such a state.

If the device is not present on a config read, the value returned will be
set to all 1's. This is the same as what hardware is expected to return
when accessing a removed device, but software can do this faster without
relying on hardware.

When checking if a device is present, we short-cut the removal by
returning the known state instead of querying the bus.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/pci.c   |  2 ++
 include/linux/pci.h | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ba34907..c412095 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4966,6 +4966,8 @@ bool pci_device_is_present(struct pci_dev *pdev)
 {
 	u32 v;
 
+	if (pdev->is_removed)
+		return false;
 	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
 }
 EXPORT_SYMBOL_GPL(pci_device_is_present);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2115d19..a92eedb 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -936,28 +936,46 @@ struct pci_ops *pci_bus_set_ops(struct pci_bus *bus, struct pci_ops *ops);
 
 static inline int pci_read_config_byte(const struct pci_dev *dev, int where, u8 *val)
 {
+	if (unlikely(dev->is_removed)) {
+		*val = ~0;
+		return -ENODEV;
+	}
 	return pci_bus_read_config_byte(dev->bus, dev->devfn, where, val);
 }
 static inline int pci_read_config_word(const struct pci_dev *dev, int where, u16 *val)
 {
+	if (unlikely(dev->is_removed)) {
+		*val = ~0;
+		return -ENODEV;
+	}
 	return pci_bus_read_config_word(dev->bus, dev->devfn, where, val);
 }
 static inline int pci_read_config_dword(const struct pci_dev *dev, int where,
 					u32 *val)
 {
+	if (unlikely(dev->is_removed)) {
+		*val = ~0;
+		return -ENODEV;
+	}
 	return pci_bus_read_config_dword(dev->bus, dev->devfn, where, val);
 }
 static inline int pci_write_config_byte(const struct pci_dev *dev, int where, u8 val)
 {
+	if (unlikely(dev->is_removed))
+		return -ENODEV;
 	return pci_bus_write_config_byte(dev->bus, dev->devfn, where, val);
 }
 static inline int pci_write_config_word(const struct pci_dev *dev, int where, u16 val)
 {
+	if (unlikely(dev->is_removed))
+		return -ENODEV;
 	return pci_bus_write_config_word(dev->bus, dev->devfn, where, val);
 }
 static inline int pci_write_config_dword(const struct pci_dev *dev, int where,
 					 u32 val)
 {
+	if (unlikely(dev->is_removed))
+		return -ENODEV;
 	return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
 }
 
-- 
2.7.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv4 next 3/3] pci/msix: Skip disabling removed devices
  2016-10-28 22:58 [PATCHv4 next 0/3] Limiting pci access Keith Busch
  2016-10-28 22:58 ` [PATCHv4 next 1/3] pci: Add is_removed state Keith Busch
  2016-10-28 22:58 ` [PATCHv4 next 2/3] pci: No config access for removed devices Keith Busch
@ 2016-10-28 22:58 ` Keith Busch
  2016-10-31 11:00   ` Lukas Wunner
  2016-12-13 21:18   ` Bjorn Helgaas
  2016-11-18 23:25 ` [PATCHv4 next 0/3] Limiting pci access Keith Busch
  3 siblings, 2 replies; 42+ messages in thread
From: Keith Busch @ 2016-10-28 22:58 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Lukas Wunner, Wei Zhang, Keith Busch

There is no need to disable MSIx interrupts or write message data on a
device that isn't present. This patch checks the device removed state
prior to executing device shutdown operations so that tear down on
removed devices completes quicker.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/msi.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index bfdd074..9249ec1 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -317,7 +317,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 {
 	struct pci_dev *dev = msi_desc_to_pci_dev(entry);
 
-	if (dev->current_state != PCI_D0) {
+	if (dev->current_state != PCI_D0 || dev->is_removed) {
 		/* Don't touch the hardware now */
 	} else if (entry->msi_attrib.is_msix) {
 		void __iomem *base = pci_msix_desc_addr(entry);
@@ -1017,6 +1017,11 @@ void pci_msix_shutdown(struct pci_dev *dev)
 	if (!pci_msi_enable || !dev || !dev->msix_enabled)
 		return;
 
+	if (dev->is_removed) {
+		dev->msix_enabled = 0;
+		return;
+	}
+
 	/* Return the device with MSI-X masked as initial states */
 	for_each_pci_msi_entry(entry, dev) {
 		/* Keep cached states to be restored */
-- 
2.7.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 1/3] pci: Add is_removed state
  2016-10-28 22:58 ` [PATCHv4 next 1/3] pci: Add is_removed state Keith Busch
@ 2016-10-31 10:41   ` Lukas Wunner
  2016-12-13 20:56   ` Bjorn Helgaas
  1 sibling, 0 replies; 42+ messages in thread
From: Lukas Wunner @ 2016-10-31 10:41 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Wei Zhang, Andreas Noever

On Fri, Oct 28, 2016 at 06:58:15PM -0400, Keith Busch wrote:
> This adds a new state for devices that were once in the system, but
> unexpectedly removed. This is so device tear down functions can observe
> the device is not accessible so it may skip attempting to initialize
> the hardware.
> 
> The pciehp and pcie-dpc drivers are aware of when the link is down,
> so these explicitly set this flag when its handlers detect the device
> is gone.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>

Reviewed-by: Lukas Wunner <lukas@wunner.de>

I'll send a follow-up patch in a few minutes which leverages the
is_removed flag to fix a soft lockup on surprise removal of the
Apple Thunderbolt Gigabit Ethernet adapter.

Thanks,

Lukas

> ---
>  drivers/pci/hotplug/pciehp_pci.c | 5 +++++
>  drivers/pci/pcie/pcie-dpc.c      | 4 ++++
>  include/linux/pci.h              | 7 +++++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> index 9e69403..7560961 100644
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -109,6 +109,11 @@ int pciehp_unconfigure_device(struct slot *p_slot)
>  				break;
>  			}
>  		}
> +		if (!presence) {
> +			pci_set_removed(dev, NULL);
> +			if (pci_has_subordinate(dev))
> +				pci_walk_bus(dev->subordinate, pci_set_removed, NULL);
> +		}
>  		pci_stop_and_remove_bus_device(dev);
>  		/*
>  		 * Ensure that no new Requests will be generated from
> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> index 9811b14..7818c88 100644
> --- a/drivers/pci/pcie/pcie-dpc.c
> +++ b/drivers/pci/pcie/pcie-dpc.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/pci.h>
>  #include <linux/pcieport_if.h>
> +#include "../pci.h"
>  
>  struct dpc_dev {
>  	struct pcie_device	*dev;
> @@ -46,6 +47,9 @@ static void interrupt_event_handler(struct work_struct *work)
>  	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
>  					 bus_list) {
>  		pci_dev_get(dev);
> +		pci_set_removed(dev, NULL);
> +		if (pci_has_subordinate(dev))
> +			pci_walk_bus(dev->subordinate, pci_set_removed, NULL);
>  		pci_stop_and_remove_bus_device(dev);
>  		pci_dev_put(dev);
>  	}
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0e49f70..2115d19 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -341,6 +341,7 @@ struct pci_dev {
>  	unsigned int	multifunction:1;/* Part of multi-function device */
>  	/* keep track of device state */
>  	unsigned int	is_added:1;
> +	unsigned int	is_removed:1;	/* device was surprise removed */
>  	unsigned int	is_busmaster:1; /* device is busmaster */
>  	unsigned int	no_msi:1;	/* device may not use msi */
>  	unsigned int	no_64bit_msi:1; /* device may only use 32-bit MSIs */
> @@ -417,6 +418,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev)
>  	return (pdev->error_state != pci_channel_io_normal);
>  }
>  
> +static inline int pci_set_removed(struct pci_dev *pdev, void *unused)
> +{
> +	pdev->is_removed = 1;
> +	return 0;
> +}
> +
>  struct pci_host_bridge {
>  	struct device dev;
>  	struct pci_bus *bus;		/* root bus */
> -- 
> 2.7.2

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 3/3] pci/msix: Skip disabling removed devices
  2016-10-28 22:58 ` [PATCHv4 next 3/3] pci/msix: Skip disabling " Keith Busch
@ 2016-10-31 11:00   ` Lukas Wunner
  2016-10-31 13:54     ` Keith Busch
  2016-12-13 21:18   ` Bjorn Helgaas
  1 sibling, 1 reply; 42+ messages in thread
From: Lukas Wunner @ 2016-10-31 11:00 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Wei Zhang

On Fri, Oct 28, 2016 at 06:58:17PM -0400, Keith Busch wrote:
> There is no need to disable MSIx interrupts or write message data on a
> device that isn't present. This patch checks the device removed state
> prior to executing device shutdown operations so that tear down on
> removed devices completes quicker.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Cc: Lukas Wunner <lukas@wunner.de>
> ---
>  drivers/pci/msi.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index bfdd074..9249ec1 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -317,7 +317,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
>  {
>  	struct pci_dev *dev = msi_desc_to_pci_dev(entry);
>  
> -	if (dev->current_state != PCI_D0) {
> +	if (dev->current_state != PCI_D0 || dev->is_removed) {
>  		/* Don't touch the hardware now */
>  	} else if (entry->msi_attrib.is_msix) {
>  		void __iomem *base = pci_msix_desc_addr(entry);

Hm, I don't see __pci_write_msi_msg() being called anywhere on device
removal.  Am I missing something?  Rest of the patch LGTM.

Thanks,

Lukas

> @@ -1017,6 +1017,11 @@ void pci_msix_shutdown(struct pci_dev *dev)
>  	if (!pci_msi_enable || !dev || !dev->msix_enabled)
>  		return;
>  
> +	if (dev->is_removed) {
> +		dev->msix_enabled = 0;
> +		return;
> +	}
> +
>  	/* Return the device with MSI-X masked as initial states */
>  	for_each_pci_msi_entry(entry, dev) {
>  		/* Keep cached states to be restored */
> -- 
> 2.7.2

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 2/3] pci: No config access for removed devices
  2016-10-28 22:58 ` [PATCHv4 next 2/3] pci: No config access for removed devices Keith Busch
@ 2016-10-31 12:18   ` Lukas Wunner
  0 siblings, 0 replies; 42+ messages in thread
From: Lukas Wunner @ 2016-10-31 12:18 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Wei Zhang

On Fri, Oct 28, 2016 at 06:58:16PM -0400, Keith Busch wrote:
> If we've  detected that the PCI device is removed, there is no need to
> access the device's config space. This patch has all the device config
> reads and writes return error when in such a state.
> 
> If the device is not present on a config read, the value returned will be
> set to all 1's. This is the same as what hardware is expected to return
> when accessing a removed device, but software can do this faster without
> relying on hardware.
> 
> When checking if a device is present, we short-cut the removal by
> returning the known state instead of querying the bus.

I went through all the function calls that occur in the PCI core and
the pciehp driver on device removal and the result is the patch below.
This, together with your change to pci_device_is_present() in patch [2/3]
and your change to MSI-X shutdown in patch [3/3] should be sufficient to
avoid config space access on teardown of a surprise-removed device.

The changes in patch [2/3] to pci_read_config*() and pci_write_config*()
would thus become redundant.  I had originally suggested to check the
is_removed flag in them, but later realized that it causes a small
performance penalty when they're called while the device is still present,
which is the normal case.  Also, while the actual device access is avoided,
CPU cycles are still wasted for various locking and calculation of
to-be-written register contents in the config space.

However it might still make sense to keep the changes to pci_read_config*()
and pci_write_config*() in order to prevent device accesses in drivers.
That's ultimately for Bjorn to decide.

Thanks,

Lukas

-- >8 --
Subject: [PATCH] PCI: Skip access to surprise removed devices

Skip config space access to surprise removed devices in the PCI core and
pciehp driver to speed up teardown.

Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/hotplug/pciehp_hpc.c |  3 +++
 drivers/pci/msi.c                |  5 +++++
 drivers/pci/pci.c                | 12 +++++++-----
 drivers/pci/pcie/aspm.c          |  4 ++++
 4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 4582fdf..f68703e 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -684,6 +684,9 @@ static void pcie_disable_notification(struct controller *ctrl)
 {
 	u16 mask;
 
+	if (ctrl_dev(ctrl)->is_removed)
+		return;
+
 	mask = (PCI_EXP_SLTCTL_PDCE | PCI_EXP_SLTCTL_ABPE |
 		PCI_EXP_SLTCTL_MRLSCE | PCI_EXP_SLTCTL_PFDE |
 		PCI_EXP_SLTCTL_HPIE | PCI_EXP_SLTCTL_CCIE |
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index cc13332..db1462c 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -894,6 +894,11 @@ void pci_msi_shutdown(struct pci_dev *dev)
 	if (!pci_msi_enable || !dev || !dev->msi_enabled)
 		return;
 
+	if (dev->is_removed) {
+		dev->msi_enabled = 0;
+		return;
+	}
+
 	BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));
 	desc = first_pci_msi_entry(dev);
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 55c0c79..d4a47b2 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1577,10 +1577,12 @@ static void do_pci_disable_device(struct pci_dev *dev)
 {
 	u16 pci_command;
 
-	pci_read_config_word(dev, PCI_COMMAND, &pci_command);
-	if (pci_command & PCI_COMMAND_MASTER) {
-		pci_command &= ~PCI_COMMAND_MASTER;
-		pci_write_config_word(dev, PCI_COMMAND, pci_command);
+	if (!dev->is_removed) {
+		pci_read_config_word(dev, PCI_COMMAND, &pci_command);
+		if (pci_command & PCI_COMMAND_MASTER) {
+			pci_command &= ~PCI_COMMAND_MASTER;
+			pci_write_config_word(dev, PCI_COMMAND, pci_command);
+		}
 	}
 
 	pcibios_disable_device(dev);
@@ -1771,7 +1773,7 @@ static void __pci_pme_active(struct pci_dev *dev, bool enable)
 {
 	u16 pmcsr;
 
-	if (!dev->pme_support)
+	if (!dev->pme_support || dev->is_removed)
 		return;
 
 	pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 0ec649d..8bddaef 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -437,6 +437,9 @@ static void pcie_config_aspm_link(struct pcie_link_state *link, u32 state)
 	state &= (link->aspm_capable & ~link->aspm_disable);
 	if (link->aspm_enabled == state)
 		return;
+	/* Skip device access if bridge was surprise-removed */
+	if (parent->is_removed)
+		goto set_state;
 	/* Convert ASPM state to upstream/downstream ASPM register state */
 	if (state & ASPM_STATE_L0S_UP)
 		dwstream |= PCI_EXP_LNKCTL_ASPM_L0S;
@@ -459,6 +462,7 @@ static void pcie_config_aspm_link(struct pcie_link_state *link, u32 state)
 	if (!(state & ASPM_STATE_L1))
 		pcie_config_aspm_dev(parent, upstream);
 
+set_state:
 	link->aspm_enabled = state;
 }
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 3/3] pci/msix: Skip disabling removed devices
  2016-10-31 11:00   ` Lukas Wunner
@ 2016-10-31 13:54     ` Keith Busch
  0 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2016-10-31 13:54 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: linux-pci, Bjorn Helgaas, Wei Zhang

On Mon, Oct 31, 2016 at 12:00:40PM +0100, Lukas Wunner wrote:
> On Fri, Oct 28, 2016 at 06:58:17PM -0400, Keith Busch wrote:
> > There is no need to disable MSIx interrupts or write message data on a
> > device that isn't present. This patch checks the device removed state
> > prior to executing device shutdown operations so that tear down on
> > removed devices completes quicker.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > Cc: Lukas Wunner <lukas@wunner.de>
> > ---
> >  drivers/pci/msi.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index bfdd074..9249ec1 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -317,7 +317,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
> >  {
> >  	struct pci_dev *dev = msi_desc_to_pci_dev(entry);
> >  
> > -	if (dev->current_state != PCI_D0) {
> > +	if (dev->current_state != PCI_D0 || dev->is_removed) {
> >  		/* Don't touch the hardware now */
> >  	} else if (entry->msi_attrib.is_msix) {
> >  		void __iomem *base = pci_msix_desc_addr(entry);
> 
> Hm, I don't see __pci_write_msi_msg() being called anywhere on device
> removal.  Am I missing something?  Rest of the patch LGTM.

Prior to calling pci_disable_msix as part of typical driver unbinding,
the driver needs to call free_irq for every registerded action. That
will get to this __pci_write_msi_msg when it tries to mask the vector.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-10-28 22:58 [PATCHv4 next 0/3] Limiting pci access Keith Busch
                   ` (2 preceding siblings ...)
  2016-10-28 22:58 ` [PATCHv4 next 3/3] pci/msix: Skip disabling " Keith Busch
@ 2016-11-18 23:25 ` Keith Busch
  2016-11-23 16:09   ` Bjorn Helgaas
  3 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2016-11-18 23:25 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas; +Cc: Lukas Wunner, Wei Zhang

On Fri, Oct 28, 2016 at 06:58:14PM -0400, Keith Busch wrote:
> Here's version 3. Two of the patches from v2 have already been applied
> upstream, so this is down to three patches now.
> 
> The main difference with v2 is setting subordinates to removed when
> detected (patch 1/3).
> 
> Patch 2/3 short-cuts pci_device_is_present if is_removed is set.
> 
> Otherwise, the same as before. Just to restate the justification, this
> significantly speeds up many device removals without relying on hardware.
> 
> Keith Busch (3):
>   pci: Add is_removed state
>   pci: No config access for removed devices
>   pci/msix: Skip disabling removed devices
> 
>  drivers/pci/hotplug/pciehp_pci.c |  5 +++++
>  drivers/pci/msi.c                |  7 ++++++-
>  drivers/pci/pci.c                |  2 ++
>  drivers/pci/pcie/pcie-dpc.c      |  4 ++++
>  include/linux/pci.h              | 25 +++++++++++++++++++++++++
>  5 files changed, 42 insertions(+), 1 deletion(-)

Hi Bjorn,

Just wanted to check in with you on this series. Any thoughts on
this? I've been pinged from several hardware vendors on this since they
say it improves system stability. I keep telling them to bug you about
it, but I guess they think I'm less intimidating. :)

Thanks,
Keith

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-11-18 23:25 ` [PATCHv4 next 0/3] Limiting pci access Keith Busch
@ 2016-11-23 16:09   ` Bjorn Helgaas
  2016-11-28  9:14     ` Wei Zhang
  2016-11-28 18:02     ` Keith Busch
  0 siblings, 2 replies; 42+ messages in thread
From: Bjorn Helgaas @ 2016-11-23 16:09 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Fri, Nov 18, 2016 at 06:25:44PM -0500, Keith Busch wrote:
> On Fri, Oct 28, 2016 at 06:58:14PM -0400, Keith Busch wrote:
> > Here's version 3. Two of the patches from v2 have already been applied
> > upstream, so this is down to three patches now.
> > 
> > The main difference with v2 is setting subordinates to removed when
> > detected (patch 1/3).
> > 
> > Patch 2/3 short-cuts pci_device_is_present if is_removed is set.
> > 
> > Otherwise, the same as before. Just to restate the justification, this
> > significantly speeds up many device removals without relying on hardware.
> > 
> > Keith Busch (3):
> >   pci: Add is_removed state
> >   pci: No config access for removed devices
> >   pci/msix: Skip disabling removed devices
> > 
> >  drivers/pci/hotplug/pciehp_pci.c |  5 +++++
> >  drivers/pci/msi.c                |  7 ++++++-
> >  drivers/pci/pci.c                |  2 ++
> >  drivers/pci/pcie/pcie-dpc.c      |  4 ++++
> >  include/linux/pci.h              | 25 +++++++++++++++++++++++++
> >  5 files changed, 42 insertions(+), 1 deletion(-)
> 
> Hi Bjorn,
> 
> Just wanted to check in with you on this series. Any thoughts on
> this? I've been pinged from several hardware vendors on this since they
> say it improves system stability. I keep telling them to bug you about
> it, but I guess they think I'm less intimidating. :)

I'd think they'd want somebody *more* intimidating to encourage
progress :)

Sorry I haven't had a chance to look at these yet.  I want to think
about them a little more because it seems like these should be
optimizations, not really fixes.  If they improve stability by fixing
Linux issues, details of those issues would help.  But maybe the
improvement is from avoiding things the hardware doesn't handle quite
correctly.

And I'm always hesitant about things like flags that say "this device
is being removed but hasn't been removed yet" because the remove
process is already complicated and we've tripped over races and
ordering issues in the past.  That's not to say this is a bad
approach; it's just an excuse for why I need some time to work through
these patches.

Bjorn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-11-23 16:09   ` Bjorn Helgaas
@ 2016-11-28  9:14     ` Wei Zhang
  2016-11-28 10:22       ` Lukas Wunner
  2016-11-28 18:02     ` Keith Busch
  1 sibling, 1 reply; 42+ messages in thread
From: Wei Zhang @ 2016-11-28  9:14 UTC (permalink / raw)
  To: Bjorn Helgaas, Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner

SGkgQmpvcm4sDQoNClRoaXMgaXMgV2VpIGFuZCBJIHdvcmsgZnJvbSBGYWNlYm9vayBTdG9yYWdl
IEhhcmR3YXJlIGdyb3VwLiAgVGhlIHNlcmllcyBvZiBjaGFuZ2VzIG1hZGUgYnkgS2VpdGggYXJl
IHRvIHN1cHBvcnQgdGhlIFBDSWUgaG90IHBsdWcgdGVzdGluZyBvbiBvdXIgTGlnaHRuaW5nIHBs
YXRmb3JtIChUaGFua3MgS2VpdGghKS4gIExpZ2h0bmluZyBpcyBhIOKAnEp1c3Qgb2YgQnVuY2gg
b2YgRmxhc2jigJ0sIG9yIEpCT0YsIHN0b3JhZ2UgcGxhdGZvcm0gdGhhdCBoYXMgYSBQQ0llIHN3
aXRjaCBjb25uZWN0aW5nIHRvIDE1IE5WTWUgU1NEcy4gIE9uZSBvZiBvdXIgZ29hbHMgaXMgdG8g
YmUgYWJsZSB0byBzdXJwcmlzZSBob3QgcmVtb3ZlL3BsdWcgZWFjaCBTU0RzIGluIHRoZSBjaGFz
c2lzLiAgVGhlc2UgY2hhbmdlcyBhcmUgdG8gZml4IGlzc3VlcyBmb3VuZCBkdXJpbmcgc3VycHJp
c2UgaG90IHBsdWcgdGVzdGluZy4gIFNwZWNpYWxseSwgdGhleSByZWR1Y2VkIFBDSWUgcmVnaXN0
ZXJzIGFjY2VzcyBkdXJpbmcgYSBzdXJwcmlzZWQgaG90IHJlbW92ZSBldmVudC4gIFdlIGhhdmUg
aW5jb3Jwb3JhdGVkIEtlaXRo4oCZcyBwYXRjaGVzIGFuZCB0ZXN0ZWQgb24gb3VyIExpZ2h0bmlu
ZyBwbGF0Zm9ybS4gIFRoZXkgd29ya2VkIHdlbGwuIA0KDQpGb3IgYSBkZXRhaWwgZGVzY3JpcHRp
b24gb2YgdGhlIExpZ2h0bmluZyBwbGF0Zm9ybSwgcGxlYXNlIHJlZmVyIHRvIGh0dHBzOi8vY29k
ZS5mYWNlYm9vay5jb20vcG9zdHMvOTg5NjM4ODA0NDU4MDA3L2ludHJvZHVjaW5nLWxpZ2h0bmlu
Zy1hLWZsZXhpYmxlLW52bWUtamJvZi8NCg0KSG9wZSB0aGlzIGdpdmVzIHlvdSBzb21lIGJhY2tn
cm91bmQgaW5mbyBvbiB0aGVzZSBjaGFuZ2VzLiAgQXBwcmVjaWF0ZSB5b3VyIHJldmlld3MgYXMg
d2VsbCENCg0KVGhhbmtzIQ0KLVdlaQ0KDQoNCk9uIDExLzIzLzE2LCA4OjA5IEFNLCAiQmpvcm4g
SGVsZ2FhcyIgPGhlbGdhYXNAa2VybmVsLm9yZz4gd3JvdGU6DQoNCiAgICBPbiBGcmksIE5vdiAx
OCwgMjAxNiBhdCAwNjoyNTo0NFBNIC0wNTAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCiAgICA+IE9u
IEZyaSwgT2N0IDI4LCAyMDE2IGF0IDA2OjU4OjE0UE0gLTA0MDAsIEtlaXRoIEJ1c2NoIHdyb3Rl
Og0KICAgID4gPiBIZXJlJ3MgdmVyc2lvbiAzLiBUd28gb2YgdGhlIHBhdGNoZXMgZnJvbSB2MiBo
YXZlIGFscmVhZHkgYmVlbiBhcHBsaWVkDQogICAgPiA+IHVwc3RyZWFtLCBzbyB0aGlzIGlzIGRv
d24gdG8gdGhyZWUgcGF0Y2hlcyBub3cuDQogICAgPiA+IA0KICAgID4gPiBUaGUgbWFpbiBkaWZm
ZXJlbmNlIHdpdGggdjIgaXMgc2V0dGluZyBzdWJvcmRpbmF0ZXMgdG8gcmVtb3ZlZCB3aGVuDQog
ICAgPiA+IGRldGVjdGVkIChwYXRjaCAxLzMpLg0KICAgID4gPiANCiAgICA+ID4gUGF0Y2ggMi8z
IHNob3J0LWN1dHMgcGNpX2RldmljZV9pc19wcmVzZW50IGlmIGlzX3JlbW92ZWQgaXMgc2V0Lg0K
ICAgID4gPiANCiAgICA+ID4gT3RoZXJ3aXNlLCB0aGUgc2FtZSBhcyBiZWZvcmUuIEp1c3QgdG8g
cmVzdGF0ZSB0aGUganVzdGlmaWNhdGlvbiwgdGhpcw0KICAgID4gPiBzaWduaWZpY2FudGx5IHNw
ZWVkcyB1cCBtYW55IGRldmljZSByZW1vdmFscyB3aXRob3V0IHJlbHlpbmcgb24gaGFyZHdhcmUu
DQogICAgPiA+IA0KICAgID4gPiBLZWl0aCBCdXNjaCAoMyk6DQogICAgPiA+ICAgcGNpOiBBZGQg
aXNfcmVtb3ZlZCBzdGF0ZQ0KICAgID4gPiAgIHBjaTogTm8gY29uZmlnIGFjY2VzcyBmb3IgcmVt
b3ZlZCBkZXZpY2VzDQogICAgPiA+ICAgcGNpL21zaXg6IFNraXAgZGlzYWJsaW5nIHJlbW92ZWQg
ZGV2aWNlcw0KICAgID4gPiANCiAgICA+ID4gIGRyaXZlcnMvcGNpL2hvdHBsdWcvcGNpZWhwX3Bj
aS5jIHwgIDUgKysrKysNCiAgICA+ID4gIGRyaXZlcnMvcGNpL21zaS5jICAgICAgICAgICAgICAg
IHwgIDcgKysrKysrLQ0KICAgID4gPiAgZHJpdmVycy9wY2kvcGNpLmMgICAgICAgICAgICAgICAg
fCAgMiArKw0KICAgID4gPiAgZHJpdmVycy9wY2kvcGNpZS9wY2llLWRwYy5jICAgICAgfCAgNCAr
KysrDQogPiA+ICA1IGZpbGVzIGNoYW5nZWQsIDQyIGluc2VydGlvbnMoKyksIDEgZGVsZXRpb24o
LSkNCiAgICA+IA0KICAgID4gSGkgQmpvcm4sDQogICAgPiANCiAgICA+IEp1c3Qgd2FudGVkIHRv
IGNoZWNrIGluIHdpdGggeW91IG9uIHRoaXMgc2VyaWVzLiBBbnkgdGhvdWdodHMgb24NCiAgICA+
IHRoaXM/IEkndmUgYmVlbiBwaW5nZWQgZnJvbSBzZXZlcmFsIGhhcmR3YXJlIHZlbmRvcnMgb24g
dGhpcyBzaW5jZSB0aGV5DQogICAgPiBzYXkgaXQgaW1wcm92ZXMgc3lzdGVtIHN0YWJpbGl0eS4g
SSBrZWVwIHRlbGxpbmcgdGhlbSB0byBidWcgeW91IGFib3V0DQogICAgPiBpdCwgYnV0IEkgZ3Vl
c3MgdGhleSB0aGluayBJJ20gbGVzcyBpbnRpbWlkYXRpbmcuIDopDQogICAgDQogICAgSSdkIHRo
aW5rIHRoZXknZCB3YW50IHNvbWVib2R5ICptb3JlKiBpbnRpbWlkYXRpbmcgdG8gZW5jb3VyYWdl
DQogICAgcHJvZ3Jlc3MgOikNCiAgICANCiAgICBTb3JyeSBJIGhhdmVuJ3QgaGFkIGEgY2hhbmNl
IHRvIGxvb2sgYXQgdGhlc2UgeWV0LiAgSSB3YW50IHRvIHRoaW5rDQogICAgYWJvdXQgdGhlbSBh
IGxpdHRsZSBtb3JlIGJlY2F1c2UgaXQgc2VlbXMgbGlrZSB0aGVzZSBzaG91bGQgYmUNCiAgICBv
cHRpbWl6YXRpb25zLCBub3QgcmVhbGx5IGZpeGVzLiAgSWYgdGhleSBpbXByb3ZlIHN0YWJpbGl0
eSBieSBmaXhpbmcNCiAgICBMaW51eCBpc3N1ZXMsIGRldGFpbHMgb2YgdGhvc2UgaXNzdWVzIHdv
dWxkIGhlbHAuICBCdXQgbWF5YmUgdGhlDQogICAgaW1wcm92ZW1lbnQgaXMgZnJvbSBhdm9pZGlu
ZyB0aGluZ3MgdGhlIGhhcmR3YXJlIGRvZXNuJ3QgaGFuZGxlIHF1aXRlDQogICAgY29ycmVjdGx5
Lg0KICAgIA0KICAgIEFuZCBJJ20gYWx3YXlzIGhlc2l0YW50IGFib3V0IHRoaW5ncyBsaWtlIGZs
YWdzIHRoYXQgc2F5ICJ0aGlzIGRldmljZQ0KICAgIGlzIGJlaW5nIHJlbW92ZWQgYnV0IGhhc24n
dCBiZWVuIHJlbW92ZWQgeWV0IiBiZWNhdXNlIHRoZSByZW1vdmUNCiAgICBwcm9jZXNzIGlzIGFs
cmVhZHkgY29tcGxpY2F0ZWQgYW5kIHdlJ3ZlIHRyaXBwZWQgb3ZlciByYWNlcyBhbmQNCiAgICBv
cmRlcmluZyBpc3N1ZXMgaW4gdGhlIHBhc3QuICBUaGF0J3Mgbm90IHRvIHNheSB0aGlzIGlzIGEg
YmFkDQogICAgYXBwcm9hY2g7IGl0J3MganVzdCBhbiBleGN1c2UgZm9yIHdoeSBJIG5lZWQgc29t
ZSB0aW1lIHRvIHdvcmsgdGhyb3VnaA0KICAgIHRoZXNlIHBhdGNoZXMuDQogICAgDQogICAgQmpv
cm4NCiAgICANCg0K

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-11-28  9:14     ` Wei Zhang
@ 2016-11-28 10:22       ` Lukas Wunner
  0 siblings, 0 replies; 42+ messages in thread
From: Lukas Wunner @ 2016-11-28 10:22 UTC (permalink / raw)
  To: Wei Zhang; +Cc: Bjorn Helgaas, Keith Busch, linux-pci

On Mon, Nov 28, 2016 at 09:14:39AM +0000, Wei Zhang wrote:
> The series of changes made by Keith are to support the PCIe hot plug
> testing on our Lightning platform (Thanks Keith!).  Lightning is a
> "Just of Bunch of Flash", or JBOF, storage platform that has a PCIe
> switch connecting to 15 NVMe SSDs.

Are the hotplug ports on this switch capable of D3hot?  There's a
series queued up for 4.10 which will suspend hotplug ports to D3hot
when nothing is connected.  I cc'ed Keith when I posted it, but I
forgot to cc you:

https://www.spinics.net/lists/linux-pci/msg55564.html

If you haven't tested this series so far, now would be a good time
to do so, as it would allow fixing regressions (if any) before 4.10
is out.  Note that this series might save you quite a bit of power
if you have lots of unused ports.

You can fetch the patches from:

https://github.com/l1k/linux/commits/pcie_port_pm_v2

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-11-23 16:09   ` Bjorn Helgaas
  2016-11-28  9:14     ` Wei Zhang
@ 2016-11-28 18:02     ` Keith Busch
  2016-12-08 17:54       ` Bjorn Helgaas
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2016-11-28 18:02 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Wed, Nov 23, 2016 at 10:09:06AM -0600, Bjorn Helgaas wrote:
> Sorry I haven't had a chance to look at these yet.  I want to think
> about them a little more because it seems like these should be
> optimizations, not really fixes.  If they improve stability by fixing
> Linux issues, details of those issues would help.  But maybe the
> improvement is from avoiding things the hardware doesn't handle quite
> correctly.

I also think o  this mainly as an optimization since it significantly
speeds up the hot removal of larger PCIe hierarchies. If this also happens
to improve hot plug stability on hardware (which I understand it does),
that's just a bonus, but the patch should be able to stand on its own
merits without considering hardware issues.

It also is a fix in that it blocks config space access to devices known
to be removed. If we let the config access happen in a hot swap, the
config request is rejected in the best case, but it goes to the wrong
device to potentially undefined results in the worst case.
 
> And I'm always hesitant about things like flags that say "this device
> is being removed but hasn't been removed yet" because the remove
> process is already complicated and we've tripped over races and
> ordering issues in the past.  That's not to say this is a bad
> approach; it's just an excuse for why I need some time to work through
> these patches.

The flag is really just saying the device is physically removed before
we've logically torn down the software resources for it. This is similar
to how many of the Linux SCSI driver's have target specific flags to
know when to return DID_NO_CONNECT for previously configured devices
that were removed without going through software first.

Instead of 'is_removed', would a different name for this flag, like
'disconnected', be more appropriate?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-11-28 18:02     ` Keith Busch
@ 2016-12-08 17:54       ` Bjorn Helgaas
  2016-12-08 19:32         ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2016-12-08 17:54 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Mon, Nov 28, 2016 at 01:02:14PM -0500, Keith Busch wrote:
> On Wed, Nov 23, 2016 at 10:09:06AM -0600, Bjorn Helgaas wrote:
> > Sorry I haven't had a chance to look at these yet.  I want to think
> > about them a little more because it seems like these should be
> > optimizations, not really fixes.  If they improve stability by fixing
> > Linux issues, details of those issues would help.  But maybe the
> > improvement is from avoiding things the hardware doesn't handle quite
> > correctly.
> 
> I also think o  this mainly as an optimization since it significantly
> speeds up the hot removal of larger PCIe hierarchies. If this also happens
> to improve hot plug stability on hardware (which I understand it does),
> that's just a bonus, but the patch should be able to stand on its own
> merits without considering hardware issues.

In the pciehp thread, you cited performance improvements of seconds to
microseconds.  That's *huge*.  Can you give a few details in the
patches that realize that improvement (2 & 3, I think) about how they
account for it?  Are we simply avoiding seconds worth of config
accesses (I know they're not fast, but we can still do an awful lot
of them in a second), or are we avoiding timeouts somewhere, or what?

Bjorn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-12-08 17:54       ` Bjorn Helgaas
@ 2016-12-08 19:32         ` Keith Busch
  2016-12-12 23:42           ` Bjorn Helgaas
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2016-12-08 19:32 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Thu, Dec 08, 2016 at 11:54:32AM -0600, Bjorn Helgaas wrote:
> On Mon, Nov 28, 2016 at 01:02:14PM -0500, Keith Busch wrote:
> > On Wed, Nov 23, 2016 at 10:09:06AM -0600, Bjorn Helgaas wrote:
> > > Sorry I haven't had a chance to look at these yet.  I want to think
> > > about them a little more because it seems like these should be
> > > optimizations, not really fixes.  If they improve stability by fixing
> > > Linux issues, details of those issues would help.  But maybe the
> > > improvement is from avoiding things the hardware doesn't handle quite
> > > correctly.
> > 
> > I also think o  this mainly as an optimization since it significantly
> > speeds up the hot removal of larger PCIe hierarchies. If this also happens
> > to improve hot plug stability on hardware (which I understand it does),
> > that's just a bonus, but the patch should be able to stand on its own
> > merits without considering hardware issues.
> 
> In the pciehp thread, you cited performance improvements of seconds to
> microseconds.  That's *huge*.  Can you give a few details in the
> patches that realize that improvement (2 & 3, I think) about how they
> account for it?  Are we simply avoiding seconds worth of config
> accesses (I know they're not fast, but we can still do an awful lot
> of them in a second), or are we avoiding timeouts somewhere, or what?

Depending on the device and the driver, there are hundreds to thousands
of non-posted transactions submitted to the device to complete driver
unbinding and removal. Since the device is gone, hardware has to handle
that as an error condition, which is slower than the a successful
non-posted transaction. Since we're doing 1000 of them for no particular
reason, it takes a long time. If you hot remove a switch with multiple
downstream devices, the serialized removal adds up to many seconds.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-12-08 19:32         ` Keith Busch
@ 2016-12-12 23:42           ` Bjorn Helgaas
  2016-12-13  0:55             ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2016-12-12 23:42 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> On Thu, Dec 08, 2016 at 11:54:32AM -0600, Bjorn Helgaas wrote:
> > On Mon, Nov 28, 2016 at 01:02:14PM -0500, Keith Busch wrote:
> > > On Wed, Nov 23, 2016 at 10:09:06AM -0600, Bjorn Helgaas wrote:
> > > > Sorry I haven't had a chance to look at these yet.  I want to think
> > > > about them a little more because it seems like these should be
> > > > optimizations, not really fixes.  If they improve stability by fixing
> > > > Linux issues, details of those issues would help.  But maybe the
> > > > improvement is from avoiding things the hardware doesn't handle quite
> > > > correctly.
> > > 
> > > I also think o  this mainly as an optimization since it significantly
> > > speeds up the hot removal of larger PCIe hierarchies. If this also happens
> > > to improve hot plug stability on hardware (which I understand it does),
> > > that's just a bonus, but the patch should be able to stand on its own
> > > merits without considering hardware issues.
> > 
> > In the pciehp thread, you cited performance improvements of seconds to
> > microseconds.  That's *huge*.  Can you give a few details in the
> > patches that realize that improvement (2 & 3, I think) about how they
> > account for it?  Are we simply avoiding seconds worth of config
> > accesses (I know they're not fast, but we can still do an awful lot
> > of them in a second), or are we avoiding timeouts somewhere, or what?
> 
> Depending on the device and the driver, there are hundreds to thousands
> of non-posted transactions submitted to the device to complete driver
> unbinding and removal. Since the device is gone, hardware has to handle
> that as an error condition, which is slower than the a successful
> non-posted transaction. Since we're doing 1000 of them for no particular
> reason, it takes a long time. If you hot remove a switch with multiple
> downstream devices, the serialized removal adds up to many seconds.

Another thread mentioned 1-2us as a reasonable config access cost, and
I'm still a little puzzled about how we get to something on the order
of a million times that cost.

I know this is all pretty hand-wavey, but 1000 config accesses to shut
down a device seems unreasonably high.  The entire config space is
only 4096 bytes, and most devices use a small fraction of that.  If
we're really doing 1000 accesses, it sounds like we're doing something
wrong, like polling without a delay or something.

I measured the cost of config reads during enumeration using the TSC
on a 2.8GHz CPU and found the following:

  1580 cycles, 0.565 usec (device present)
  1230 cycles, 0.440 usec (empty slot)
  2130 cycles, 0.761 usec (unimplemented function of multi-function device)

So 1-2usec does seem the right order of magnitude, and my "empty slot"
error responses are actually *faster* than the "device present" ones,
which is plausible to me because the Downstream Port can generate the
error response immediately without sending a packet down the link.
The "unimplemented function" responses take longer than the "empty
slot", which makes sense because the Downstream Port does have to send
a packet to the device, which then complains because it doesn't
implement that function.

Of course, these aren't the same case as yours, where the link used to
be up but is no longer.  Is there some hardware timeout to see if the
link will come back?

Bjorn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-12-12 23:42           ` Bjorn Helgaas
@ 2016-12-13  0:55             ` Keith Busch
  2016-12-13 20:50               ` Bjorn Helgaas
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2016-12-13  0:55 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote:
> On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> > Depending on the device and the driver, there are hundreds to thousands
> > of non-posted transactions submitted to the device to complete driver
> > unbinding and removal. Since the device is gone, hardware has to handle
> > that as an error condition, which is slower than the a successful
> > non-posted transaction. Since we're doing 1000 of them for no particular
> > reason, it takes a long time. If you hot remove a switch with multiple
> > downstream devices, the serialized removal adds up to many seconds.
> 
> Another thread mentioned 1-2us as a reasonable config access cost, and
> I'm still a little puzzled about how we get to something on the order
> of a million times that cost.
> 
> I know this is all pretty hand-wavey, but 1000 config accesses to shut
> down a device seems unreasonably high.  The entire config space is
> only 4096 bytes, and most devices use a small fraction of that.  If
> we're really doing 1000 accesses, it sounds like we're doing something
> wrong, like polling without a delay or something.

Every time pci_find_ext_capability is called on a removed device, the
kernel will do 481 failed config space accesses trying to find that
capability. The kernel used to do that multiple times to find the AER
capability under conditions common to surprise removal.

But now that we cache the AER position (commit: 66b80809), we've
eliminated by far the worst offender. The counts I'm telling you are
still referencing the original captured traces showing long tear down
times, so it's not up-to-date with the most recent version of the kernel.
 
> I measured the cost of config reads during enumeration using the TSC
> on a 2.8GHz CPU and found the following:
> 
>   1580 cycles, 0.565 usec (device present)
>   1230 cycles, 0.440 usec (empty slot)
>   2130 cycles, 0.761 usec (unimplemented function of multi-function device)
> 
> So 1-2usec does seem the right order of magnitude, and my "empty slot"
> error responses are actually *faster* than the "device present" ones,
> which is plausible to me because the Downstream Port can generate the
> error response immediately without sending a packet down the link.
> The "unimplemented function" responses take longer than the "empty
> slot", which makes sense because the Downstream Port does have to send
> a packet to the device, which then complains because it doesn't
> implement that function.
> 
> Of course, these aren't the same case as yours, where the link used to
> be up but is no longer.  Is there some hardware timeout to see if the
> link will come back?

Yes, the hardware does not respond immediately under this test, which
is considered an error condition. This is a reason why PCIe Device
Capabilities 2 Completion Timeout Ranges are recommended to be in the
10ms range.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-12-13  0:55             ` Keith Busch
@ 2016-12-13 20:50               ` Bjorn Helgaas
  2016-12-13 23:18                 ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2016-12-13 20:50 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote:
> On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote:
> > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> > > Depending on the device and the driver, there are hundreds to thousands
> > > of non-posted transactions submitted to the device to complete driver
> > > unbinding and removal. Since the device is gone, hardware has to handle
> > > that as an error condition, which is slower than the a successful
> > > non-posted transaction. Since we're doing 1000 of them for no particular
> > > reason, it takes a long time. If you hot remove a switch with multiple
> > > downstream devices, the serialized removal adds up to many seconds.
> > 
> > Another thread mentioned 1-2us as a reasonable config access cost, and
> > I'm still a little puzzled about how we get to something on the order
> > of a million times that cost.
> > 
> > I know this is all pretty hand-wavey, but 1000 config accesses to shut
> > down a device seems unreasonably high.  The entire config space is
> > only 4096 bytes, and most devices use a small fraction of that.  If
> > we're really doing 1000 accesses, it sounds like we're doing something
> > wrong, like polling without a delay or something.
> 
> Every time pci_find_ext_capability is called on a removed device, the
> kernel will do 481 failed config space accesses trying to find that
> capability. The kernel used to do that multiple times to find the AER
> capability under conditions common to surprise removal.

Right, that's a perfect example.  I'd rather fix issues like this by
caching the position as you did with AER.  The "removed" bit makes
these issues "go away" without addressing the underlying problem.

We might still need a "removed" bit for other reasons, but I want to
be clear about those reasons, not just throw it in under the general
"make it go fast" umbrella.

> But now that we cache the AER position (commit: 66b80809), we've
> eliminated by far the worst offender. The counts I'm telling you are
> still referencing the original captured traces showing long tear down
> times, so it's not up-to-date with the most recent version of the kernel.
>  
> > I measured the cost of config reads during enumeration using the TSC
> > on a 2.8GHz CPU and found the following:
> > 
> >   1580 cycles, 0.565 usec (device present)
> >   1230 cycles, 0.440 usec (empty slot)
> >   2130 cycles, 0.761 usec (unimplemented function of multi-function device)
> > 
> > So 1-2usec does seem the right order of magnitude, and my "empty slot"
> > error responses are actually *faster* than the "device present" ones,
> > which is plausible to me because the Downstream Port can generate the
> > error response immediately without sending a packet down the link.
> > The "unimplemented function" responses take longer than the "empty
> > slot", which makes sense because the Downstream Port does have to send
> > a packet to the device, which then complains because it doesn't
> > implement that function.
> > 
> > Of course, these aren't the same case as yours, where the link used to
> > be up but is no longer.  Is there some hardware timeout to see if the
> > link will come back?
> 
> Yes, the hardware does not respond immediately under this test, which
> is considered an error condition. This is a reason why PCIe Device
> Capabilities 2 Completion Timeout Ranges are recommended to be in the
> 10ms range.

And we're apparently still doing a lot of these accesses?  I'm still
curious about exactly what these are, because it may be that we're
doing more than necessary.

Bjorn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 1/3] pci: Add is_removed state
  2016-10-28 22:58 ` [PATCHv4 next 1/3] pci: Add is_removed state Keith Busch
  2016-10-31 10:41   ` Lukas Wunner
@ 2016-12-13 20:56   ` Bjorn Helgaas
  2016-12-13 23:07     ` Keith Busch
  2016-12-13 23:54     ` Lukas Wunner
  1 sibling, 2 replies; 42+ messages in thread
From: Bjorn Helgaas @ 2016-12-13 20:56 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Fri, Oct 28, 2016 at 06:58:15PM -0400, Keith Busch wrote:
> This adds a new state for devices that were once in the system, but
> unexpectedly removed. This is so device tear down functions can observe
> the device is not accessible so it may skip attempting to initialize
> the hardware.
> 
> The pciehp and pcie-dpc drivers are aware of when the link is down,
> so these explicitly set this flag when its handlers detect the device
> is gone.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Cc: Lukas Wunner <lukas@wunner.de>
> ---
>  drivers/pci/hotplug/pciehp_pci.c | 5 +++++
>  drivers/pci/pcie/pcie-dpc.c      | 4 ++++
>  include/linux/pci.h              | 7 +++++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> index 9e69403..7560961 100644
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -109,6 +109,11 @@ int pciehp_unconfigure_device(struct slot *p_slot)
>  				break;
>  			}
>  		}
> +		if (!presence) {
> +			pci_set_removed(dev, NULL);
> +			if (pci_has_subordinate(dev))
> +				pci_walk_bus(dev->subordinate, pci_set_removed, NULL);
> +		}
>  		pci_stop_and_remove_bus_device(dev);
>  		/*
>  		 * Ensure that no new Requests will be generated from
> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> index 9811b14..7818c88 100644
> --- a/drivers/pci/pcie/pcie-dpc.c
> +++ b/drivers/pci/pcie/pcie-dpc.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/pci.h>
>  #include <linux/pcieport_if.h>
> +#include "../pci.h"
>  
>  struct dpc_dev {
>  	struct pcie_device	*dev;
> @@ -46,6 +47,9 @@ static void interrupt_event_handler(struct work_struct *work)
>  	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
>  					 bus_list) {
>  		pci_dev_get(dev);
> +		pci_set_removed(dev, NULL);
> +		if (pci_has_subordinate(dev))
> +			pci_walk_bus(dev->subordinate, pci_set_removed, NULL);
>  		pci_stop_and_remove_bus_device(dev);
>  		pci_dev_put(dev);
>  	}
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0e49f70..2115d19 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -341,6 +341,7 @@ struct pci_dev {
>  	unsigned int	multifunction:1;/* Part of multi-function device */
>  	/* keep track of device state */
>  	unsigned int	is_added:1;
> +	unsigned int	is_removed:1;	/* device was surprise removed */
>  	unsigned int	is_busmaster:1; /* device is busmaster */
>  	unsigned int	no_msi:1;	/* device may not use msi */
>  	unsigned int	no_64bit_msi:1; /* device may only use 32-bit MSIs */
> @@ -417,6 +418,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev)
>  	return (pdev->error_state != pci_channel_io_normal);
>  }
>  
> +static inline int pci_set_removed(struct pci_dev *pdev, void *unused)
> +{
> +	pdev->is_removed = 1;

This makes me slightly worried because this is a bitfield and there's
no locking.  A concurrent write to some nearby field can corrupt
things.  It doesn't look *likely*, but it's a lot of work to be
convinced that this is completely safe, especially since the writer is
running on behalf of the bridge, and the target is a child of the
bridge.

The USB HCD_FLAG_DEAD and HCD_FLAG_HW_ACCESSIBLE flags are somewhat
similar.  Maybe we can leverage some of that design?

> +	return 0;
> +}
> +
>  struct pci_host_bridge {
>  	struct device dev;
>  	struct pci_bus *bus;		/* root bus */
> -- 
> 2.7.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 3/3] pci/msix: Skip disabling removed devices
  2016-10-28 22:58 ` [PATCHv4 next 3/3] pci/msix: Skip disabling " Keith Busch
  2016-10-31 11:00   ` Lukas Wunner
@ 2016-12-13 21:18   ` Bjorn Helgaas
  2016-12-13 23:01     ` Keith Busch
  1 sibling, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2016-12-13 21:18 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Fri, Oct 28, 2016 at 06:58:17PM -0400, Keith Busch wrote:
> There is no need to disable MSIx interrupts or write message data on a
> device that isn't present. This patch checks the device removed state
> prior to executing device shutdown operations so that tear down on
> removed devices completes quicker.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Cc: Lukas Wunner <lukas@wunner.de>
> ---
>  drivers/pci/msi.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index bfdd074..9249ec1 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -317,7 +317,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
>  {
>  	struct pci_dev *dev = msi_desc_to_pci_dev(entry);
>  
> -	if (dev->current_state != PCI_D0) {
> +	if (dev->current_state != PCI_D0 || dev->is_removed) {
>  		/* Don't touch the hardware now */
>  	} else if (entry->msi_attrib.is_msix) {
>  		void __iomem *base = pci_msix_desc_addr(entry);
> @@ -1017,6 +1017,11 @@ void pci_msix_shutdown(struct pci_dev *dev)
>  	if (!pci_msi_enable || !dev || !dev->msix_enabled)
>  		return;
>  
> +	if (dev->is_removed) {
> +		dev->msix_enabled = 0;
> +		return;
> +	}
> +
>  	/* Return the device with MSI-X masked as initial states */
>  	for_each_pci_msi_entry(entry, dev) {
>  		/* Keep cached states to be restored */

Do we need the same thing in pci_msi_shutdown()?

It's too bad we have the data structure maintenance stuff all
intermingled with the hardware-touching code.  I don't know how they
could really be disentangled, but it feels a little ad hoc to sprinkle
is_removed checks in random places where we observe problems.

Bjorn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 3/3] pci/msix: Skip disabling removed devices
  2016-12-13 21:18   ` Bjorn Helgaas
@ 2016-12-13 23:01     ` Keith Busch
  0 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2016-12-13 23:01 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Tue, Dec 13, 2016 at 03:18:10PM -0600, Bjorn Helgaas wrote:
> On Fri, Oct 28, 2016 at 06:58:17PM -0400, Keith Busch wrote:
> > @@ -1017,6 +1017,11 @@ void pci_msix_shutdown(struct pci_dev *dev)
> >  	if (!pci_msi_enable || !dev || !dev->msix_enabled)
> >  		return;
> >  
> > +	if (dev->is_removed) {
> > +		dev->msix_enabled = 0;
> > +		return;
> > +	}
> > +
> >  	/* Return the device with MSI-X masked as initial states */
> >  	for_each_pci_msi_entry(entry, dev) {
> >  		/* Keep cached states to be restored */
> 
> Do we need the same thing in pci_msi_shutdown()?

We get MSI automatically from the previous two patches since MSI masking
is config access. MSIx is MMIO so it needs to explicitly check the
removed flag to fence off those unnecessary read-modify-writes.
 
> It's too bad we have the data structure maintenance stuff all
> intermingled with the hardware-touching code.  I don't know how they
> could really be disentangled, but it feels a little ad hoc to sprinkle
> is_removed checks in random places where we observe problems.

Yeah, I unfortunately don't see a better way to do it either. It may
look a bit arbitrary considering there's plenty of other places we
could possibly check for removed, but this is the one we observe to
be the highest contributor. I'm mainly interested in NVM-Express where
controllers have dozens to hundreds of vectors, so this is potentially
shaving off another ~100 non-posted commands.

I have heard interest in leveraging this flag for other device drivers,
though, and suspect new usage may come later.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 1/3] pci: Add is_removed state
  2016-12-13 20:56   ` Bjorn Helgaas
@ 2016-12-13 23:07     ` Keith Busch
  2016-12-14  2:50       ` Bjorn Helgaas
  2016-12-13 23:54     ` Lukas Wunner
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2016-12-13 23:07 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Tue, Dec 13, 2016 at 02:56:14PM -0600, Bjorn Helgaas wrote:
> On Fri, Oct 28, 2016 at 06:58:15PM -0400, Keith Busch wrote:
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 0e49f70..2115d19 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -341,6 +341,7 @@ struct pci_dev {
> >  	unsigned int	multifunction:1;/* Part of multi-function device */
> >  	/* keep track of device state */
> >  	unsigned int	is_added:1;
> > +	unsigned int	is_removed:1;	/* device was surprise removed */
> >  	unsigned int	is_busmaster:1; /* device is busmaster */
> >  	unsigned int	no_msi:1;	/* device may not use msi */
> >  	unsigned int	no_64bit_msi:1; /* device may only use 32-bit MSIs */
> > @@ -417,6 +418,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev)
> >  	return (pdev->error_state != pci_channel_io_normal);
> >  }
> >  
> > +static inline int pci_set_removed(struct pci_dev *pdev, void *unused)
> > +{
> > +	pdev->is_removed = 1;
> 
> This makes me slightly worried because this is a bitfield and there's
> no locking.  A concurrent write to some nearby field can corrupt
> things.  It doesn't look *likely*, but it's a lot of work to be
> convinced that this is completely safe, especially since the writer is
> running on behalf of the bridge, and the target is a child of the
> bridge.
> 
> The USB HCD_FLAG_DEAD and HCD_FLAG_HW_ACCESSIBLE flags are somewhat
> similar.  Maybe we can leverage some of that design?

A bit field with atomic accessors sounds good to me. Do you want to
see all the struct pci_dev bit flags converted to that model? If so,
I can send you a prep patch that does that first.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-12-13 20:50               ` Bjorn Helgaas
@ 2016-12-13 23:18                 ` Keith Busch
       [not found]                   ` <B58D82457FDA0744A320A2FC5AC253B93D82F37D@fmsmsx104.amr.corp.intel.com>
  2018-11-13  6:05                   ` Bjorn Helgaas
  0 siblings, 2 replies; 42+ messages in thread
From: Keith Busch @ 2016-12-13 23:18 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang

On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote:
> > On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> > > > Depending on the device and the driver, there are hundreds to thousands
> > > > of non-posted transactions submitted to the device to complete driver
> > > > unbinding and removal. Since the device is gone, hardware has to handle
> > > > that as an error condition, which is slower than the a successful
> > > > non-posted transaction. Since we're doing 1000 of them for no particular
> > > > reason, it takes a long time. If you hot remove a switch with multiple
> > > > downstream devices, the serialized removal adds up to many seconds.
> > > 
> > > Another thread mentioned 1-2us as a reasonable config access cost, and
> > > I'm still a little puzzled about how we get to something on the order
> > > of a million times that cost.
> > > 
> > > I know this is all pretty hand-wavey, but 1000 config accesses to shut
> > > down a device seems unreasonably high.  The entire config space is
> > > only 4096 bytes, and most devices use a small fraction of that.  If
> > > we're really doing 1000 accesses, it sounds like we're doing something
> > > wrong, like polling without a delay or something.
> > 
> > Every time pci_find_ext_capability is called on a removed device, the
> > kernel will do 481 failed config space accesses trying to find that
> > capability. The kernel used to do that multiple times to find the AER
> > capability under conditions common to surprise removal.
> 
> Right, that's a perfect example.  I'd rather fix issues like this by
> caching the position as you did with AER.  The "removed" bit makes
> these issues "go away" without addressing the underlying problem.
> 
> We might still need a "removed" bit for other reasons, but I want to
> be clear about those reasons, not just throw it in under the general
> "make it go fast" umbrella.
> 
> > But now that we cache the AER position (commit: 66b80809), we've
> > eliminated by far the worst offender. The counts I'm telling you are
> > still referencing the original captured traces showing long tear down
> > times, so it's not up-to-date with the most recent version of the kernel.
> >  
> > > I measured the cost of config reads during enumeration using the TSC
> > > on a 2.8GHz CPU and found the following:
> > > 
> > >   1580 cycles, 0.565 usec (device present)
> > >   1230 cycles, 0.440 usec (empty slot)
> > >   2130 cycles, 0.761 usec (unimplemented function of multi-function device)
> > > 
> > > So 1-2usec does seem the right order of magnitude, and my "empty slot"
> > > error responses are actually *faster* than the "device present" ones,
> > > which is plausible to me because the Downstream Port can generate the
> > > error response immediately without sending a packet down the link.
> > > The "unimplemented function" responses take longer than the "empty
> > > slot", which makes sense because the Downstream Port does have to send
> > > a packet to the device, which then complains because it doesn't
> > > implement that function.
> > > 
> > > Of course, these aren't the same case as yours, where the link used to
> > > be up but is no longer.  Is there some hardware timeout to see if the
> > > link will come back?
> > 
> > Yes, the hardware does not respond immediately under this test, which
> > is considered an error condition. This is a reason why PCIe Device
> > Capabilities 2 Completion Timeout Ranges are recommended to be in the
> > 10ms range.
> 
> And we're apparently still doing a lot of these accesses?  I'm still
> curious about exactly what these are, because it may be that we're
> doing more than necessary.

It's the MSI-x masking that's our next highest contributor. Masking
vectors still requires non-posted commands, and since they're not going
through managed API accessors like config space uses, the removed flag
is needed for checking before doing significant MMIO.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 1/3] pci: Add is_removed state
  2016-12-13 20:56   ` Bjorn Helgaas
  2016-12-13 23:07     ` Keith Busch
@ 2016-12-13 23:54     ` Lukas Wunner
  1 sibling, 0 replies; 42+ messages in thread
From: Lukas Wunner @ 2016-12-13 23:54 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Keith Busch, linux-pci, Wei Zhang

On Tue, Dec 13, 2016 at 02:56:14PM -0600, Bjorn Helgaas wrote:
> On Fri, Oct 28, 2016 at 06:58:15PM -0400, Keith Busch wrote:
[snip]
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 0e49f70..2115d19 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -341,6 +341,7 @@ struct pci_dev {
> >  	unsigned int	multifunction:1;/* Part of multi-function device */
> >  	/* keep track of device state */
> >  	unsigned int	is_added:1;
> > +	unsigned int	is_removed:1;	/* device was surprise removed */
> >  	unsigned int	is_busmaster:1; /* device is busmaster */
> >  	unsigned int	no_msi:1;	/* device may not use msi */
> >  	unsigned int	no_64bit_msi:1; /* device may only use 32-bit MSIs */
> > @@ -417,6 +418,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev)
> >  	return (pdev->error_state != pci_channel_io_normal);
> >  }
> >  
> > +static inline int pci_set_removed(struct pci_dev *pdev, void *unused)
> > +{
> > +	pdev->is_removed = 1;
> 
> This makes me slightly worried because this is a bitfield and there's
> no locking.  A concurrent write to some nearby field can corrupt
> things.  It doesn't look *likely*, but it's a lot of work to be
> convinced that this is completely safe, especially since the writer is
> running on behalf of the bridge, and the target is a child of the
> bridge.
> 
> The USB HCD_FLAG_DEAD and HCD_FLAG_HW_ACCESSIBLE flags are somewhat
> similar.  Maybe we can leverage some of that design?

Back in October I suggested leveraging the error_state field in struct
pci_dev.  That's an enum defined at the top of include/linux/pci.h
with values pci_channel_io_normal, pci_channel_io_frozen and
pci_channel_io_perm_failure.  I suggested adding a removed state.
The benefit is that lots of drivers already check pci_channel_offline()
before accessing a device, so without any further changes they would
treat surprise-removed devices properly.

However Keith responded:
"I'd be happy if we can reuse that, but concerned about overloading
error_state's intended purpose for AER. The conditions under which an
'is_removed' may be set can also create AER events, and the aer driver
overrides the error_state."
(http://www.spinics.net/lists/linux-pci/msg55417.html)

So it would seem to require at least a modification of the AER driver
to not overwrite a pci_channel_io_removed state.

Best regards,

Lukas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 1/3] pci: Add is_removed state
  2016-12-13 23:07     ` Keith Busch
@ 2016-12-14  2:50       ` Bjorn Helgaas
  2016-12-14  2:54         ` Bjorn Helgaas
  0 siblings, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2016-12-14  2:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang, Alan Stern,
	Greg Kroah-Hartman

[+cc Alan, Greg]

On Tue, Dec 13, 2016 at 06:07:31PM -0500, Keith Busch wrote:
> On Tue, Dec 13, 2016 at 02:56:14PM -0600, Bjorn Helgaas wrote:
> > On Fri, Oct 28, 2016 at 06:58:15PM -0400, Keith Busch wrote:
> > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > index 0e49f70..2115d19 100644
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -341,6 +341,7 @@ struct pci_dev {
> > >  	unsigned int	multifunction:1;/* Part of multi-function device */
> > >  	/* keep track of device state */
> > >  	unsigned int	is_added:1;
> > > +	unsigned int	is_removed:1;	/* device was surprise removed */
> > >  	unsigned int	is_busmaster:1; /* device is busmaster */
> > >  	unsigned int	no_msi:1;	/* device may not use msi */
> > >  	unsigned int	no_64bit_msi:1; /* device may only use 32-bit MSIs */
> > > @@ -417,6 +418,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev)
> > >  	return (pdev->error_state != pci_channel_io_normal);
> > >  }
> > >  
> > > +static inline int pci_set_removed(struct pci_dev *pdev, void *unused)
> > > +{
> > > +	pdev->is_removed = 1;
> > 
> > This makes me slightly worried because this is a bitfield and there's
> > no locking.  A concurrent write to some nearby field can corrupt
> > things.  It doesn't look *likely*, but it's a lot of work to be
> > convinced that this is completely safe, especially since the writer is
> > running on behalf of the bridge, and the target is a child of the
> > bridge.
> > 
> > The USB HCD_FLAG_DEAD and HCD_FLAG_HW_ACCESSIBLE flags are somewhat
> > similar.  Maybe we can leverage some of that design?
> 
> A bit field with atomic accessors sounds good to me. Do you want to
> see all the struct pci_dev bit flags converted to that model? If so,
> I can send you a prep patch that does that first.

This is still blue-sky, "what if?" thinking on my part.  I was
starting to wonder if we could make something generic that could be
used for your is_removed work and also for the USB stuff.

The USB HCD_FLAG_DEAD is set by usb_hc_died(), which is frequently
used by drivers after an MMIO read returns ~0.  HCD_FLAG_HW_ACCESSIBLE
seems to encode part of the idea of "this device is powered up enough
to respond", which is sort of similar to what PCI does with
dev->current_state.

If there were generic PCI interfaces like pci_dev_died() and a
pci_dev_dead() predicate, I wonder if USB could use them instead of
rolling their own?

Bjorn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 1/3] pci: Add is_removed state
  2016-12-14  2:50       ` Bjorn Helgaas
@ 2016-12-14  2:54         ` Bjorn Helgaas
  0 siblings, 0 replies; 42+ messages in thread
From: Bjorn Helgaas @ 2016-12-14  2:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-pci, Bjorn Helgaas, Lukas Wunner, Wei Zhang, Alan Stern,
	Greg Kroah-Hartman

On Tue, Dec 13, 2016 at 08:50:07PM -0600, Bjorn Helgaas wrote:
> [+cc Alan, Greg]
> 
> On Tue, Dec 13, 2016 at 06:07:31PM -0500, Keith Busch wrote:
> > On Tue, Dec 13, 2016 at 02:56:14PM -0600, Bjorn Helgaas wrote:
> > > On Fri, Oct 28, 2016 at 06:58:15PM -0400, Keith Busch wrote:
> > > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > > index 0e49f70..2115d19 100644
> > > > --- a/include/linux/pci.h
> > > > +++ b/include/linux/pci.h
> > > > @@ -341,6 +341,7 @@ struct pci_dev {
> > > >  	unsigned int	multifunction:1;/* Part of multi-function device */
> > > >  	/* keep track of device state */
> > > >  	unsigned int	is_added:1;
> > > > +	unsigned int	is_removed:1;	/* device was surprise removed */
> > > >  	unsigned int	is_busmaster:1; /* device is busmaster */
> > > >  	unsigned int	no_msi:1;	/* device may not use msi */
> > > >  	unsigned int	no_64bit_msi:1; /* device may only use 32-bit MSIs */
> > > > @@ -417,6 +418,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev)
> > > >  	return (pdev->error_state != pci_channel_io_normal);
> > > >  }
> > > >  
> > > > +static inline int pci_set_removed(struct pci_dev *pdev, void *unused)
> > > > +{
> > > > +	pdev->is_removed = 1;
> > > 
> > > This makes me slightly worried because this is a bitfield and there's
> > > no locking.  A concurrent write to some nearby field can corrupt
> > > things.  It doesn't look *likely*, but it's a lot of work to be
> > > convinced that this is completely safe, especially since the writer is
> > > running on behalf of the bridge, and the target is a child of the
> > > bridge.
> > > 
> > > The USB HCD_FLAG_DEAD and HCD_FLAG_HW_ACCESSIBLE flags are somewhat
> > > similar.  Maybe we can leverage some of that design?
> > 
> > A bit field with atomic accessors sounds good to me. Do you want to
> > see all the struct pci_dev bit flags converted to that model? If so,
> > I can send you a prep patch that does that first.
> 
> This is still blue-sky, "what if?" thinking on my part.  I was
> starting to wonder if we could make something generic that could be
> used for your is_removed work and also for the USB stuff.
> 
> The USB HCD_FLAG_DEAD is set by usb_hc_died(), which is frequently
> used by drivers after an MMIO read returns ~0.  HCD_FLAG_HW_ACCESSIBLE
> seems to encode part of the idea of "this device is powered up enough
> to respond", which is sort of similar to what PCI does with
> dev->current_state.
> 
> If there were generic PCI interfaces like pci_dev_died() and a
> pci_dev_dead() predicate, I wonder if USB could use them instead of
> rolling their own?

But I guess all the world is not PCI -- there are non-PCI USB host
controllers, so it's not quite that simple.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
       [not found]                     ` <20170120213550.GA16618@localhost.localdomain>
@ 2017-01-21  7:31                       ` Lukas Wunner
  2017-01-21  8:42                         ` Greg Kroah-Hartman
  0 siblings, 1 reply; 42+ messages in thread
From: Lukas Wunner @ 2017-01-21  7:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Keith Busch
  Cc: linux-pci, Bjorn Helgaas, Wei Zhang, Austin Bolen

On Fri, Jan 20, 2017 at 04:35:50PM -0500, Keith Busch wrote:
> On Tue, Dec 13, 2016 at 04:19:32PM -0500, Keith Busch wrote:
> > On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> > > And we're apparently still doing a lot of these accesses?  I'm still
> > > curious about exactly what these are, because it may be that we're
> > > doing more than necessary.
> > 
> > It's the MSI-x masking that's our next highest contributor. Masking
> > vectors still requires non-posted commands, and since they're not going
> > through managed API accessors like config space uses, the removed flag
> > is needed for checking before doing significant MMIO.
> 
> Hi Bjorn,
> 
> Just wanted to do another check with you on this. We'd still like to fence
> off all config access with appropriate error codes, and short cut the most
> significant MMIO access to improve surprise removal. There may still be
> other offenders, but these are the most important ones we've identified.
> 
> I've updated the series to make the new flag an atomic accessor as
> requested, and improved the change logs with more information compelling
> the change. Otherwise it's much the same as before. I know you weren't
> keen on capturing all the access under the umbrella of improving device
> unbinding time, but the general concensus among device makers is that
> it's a good thing to have software return an error early rather than
> send a command we know will fail. Any other thoughts I should consider
> before posting v5?

I think Bjorn was pondering whether a flag to indicate surprise removal
should be put in struct device rather than struct pci_dev, so as to
cover other buses capable of surprise removal.  There's already an
"offline" flag in struct device which is set when user space initiates
a safe hot removal via sysfs.

Bjorn cc'ed his e-mails of Dec 13 to Greg KH and Alan Stern but got no
replies.

@Greg KH:
Would you entertain a patch adding a bit to struct device which indicates
the device was surprise removed?  The PCIe Hotplug and PCIe Downstream
Port Containment drivers are both able to recognize surprise removal and
can set the bit.

When removing the device we currently perform numerous accesses to config
space in the PCI core.  Additionally the driver for the removed device
(e.g. network driver, GPU driver) performs mmio accesses to orderly shut
down the device.  E.g. when unplugging the Apple Thunderbolt Ethernet
adapter the kernel currently locks up as the tg3 driver tries to shutdown
the removed device.  If we had a bit to indicate surprise removal we could
handle this properly in the PCI core and device driver ->remove hooks.

For comparison, this is what macOS recommends to driver developers:

       "PCI device drivers are typically developed with the expectation
	that the device will not be removed from the PCI bus during its
	operation. However, Thunderbolt technology allows PCI data to be
	tunneled through a Thunderbolt connection, and the Thunderbolt
	cables may be unplugged from the host or device at any time.
	Therefore, the system must be able to cope with the removal of
	PCI devices by the user at any time.

	The PCI device drivers used with Thunderbolt devices may need to
	be updated in order to handle surprise or unplanned removal.
	In particular, MMIO cycles and PCI Configuration accesses require
	special attention. [...] As a basic guideline, developers should
	modify their drivers to handle a return value of 0xFFFFFFFF.
	If any thread, callback, interrupt filter, or code path in a
	driver receives 0xFFFFFFFF indicating the device has been
	unplugged, then all threads, callbacks, interrupt filters,
	interrupt handlers, and other code paths in that driver must
	cease MMIO reads and writes immediately and prepare for
	termination. [...]

	Once it has been determined that a device is no longer connected,
	do not try to clean up or reset the hardware as attempts to
	communicate with the hardware may lead to further delays. [...]
	A typical way for a developer to solve this problem is to provide
	a single bottleneck routine for all MMIO reads and have that
	routine check the status of the device before beginning the actual
	transaction."

	Source: https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html

We lack a comparable functionality and the question is whether to
support it only in the PCI core or in a more general fashion in the
driver core.  Other buses (such as USB) have to support surprise
removal as well.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-01-21  7:31                       ` Lukas Wunner
@ 2017-01-21  8:42                         ` Greg Kroah-Hartman
  2017-01-21 14:22                           ` Lukas Wunner
  2017-01-23 16:04                           ` Keith Busch
  0 siblings, 2 replies; 42+ messages in thread
From: Greg Kroah-Hartman @ 2017-01-21  8:42 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Keith Busch, linux-pci, Bjorn Helgaas, Wei Zhang, Austin Bolen

On Sat, Jan 21, 2017 at 08:31:40AM +0100, Lukas Wunner wrote:
> On Fri, Jan 20, 2017 at 04:35:50PM -0500, Keith Busch wrote:
> > On Tue, Dec 13, 2016 at 04:19:32PM -0500, Keith Busch wrote:
> > > On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> > > > And we're apparently still doing a lot of these accesses?  I'm still
> > > > curious about exactly what these are, because it may be that we're
> > > > doing more than necessary.
> > > 
> > > It's the MSI-x masking that's our next highest contributor. Masking
> > > vectors still requires non-posted commands, and since they're not going
> > > through managed API accessors like config space uses, the removed flag
> > > is needed for checking before doing significant MMIO.
> > 
> > Hi Bjorn,
> > 
> > Just wanted to do another check with you on this. We'd still like to fence
> > off all config access with appropriate error codes, and short cut the most
> > significant MMIO access to improve surprise removal. There may still be
> > other offenders, but these are the most important ones we've identified.
> > 
> > I've updated the series to make the new flag an atomic accessor as
> > requested, and improved the change logs with more information compelling
> > the change. Otherwise it's much the same as before. I know you weren't
> > keen on capturing all the access under the umbrella of improving device
> > unbinding time, but the general concensus among device makers is that
> > it's a good thing to have software return an error early rather than
> > send a command we know will fail. Any other thoughts I should consider
> > before posting v5?
> 
> I think Bjorn was pondering whether a flag to indicate surprise removal
> should be put in struct device rather than struct pci_dev, so as to
> cover other buses capable of surprise removal.  There's already an
> "offline" flag in struct device which is set when user space initiates
> a safe hot removal via sysfs.
> 
> Bjorn cc'ed his e-mails of Dec 13 to Greg KH and Alan Stern but got no
> replies.

Sorry, I don't recall seeing those :(

> @Greg KH:
> Would you entertain a patch adding a bit to struct device which indicates
> the device was surprise removed?  The PCIe Hotplug and PCIe Downstream
> Port Containment drivers are both able to recognize surprise removal and
> can set the bit.
> 
> When removing the device we currently perform numerous accesses to config
> space in the PCI core.  Additionally the driver for the removed device
> (e.g. network driver, GPU driver) performs mmio accesses to orderly shut
> down the device.  E.g. when unplugging the Apple Thunderbolt Ethernet
> adapter the kernel currently locks up as the tg3 driver tries to shutdown
> the removed device.  If we had a bit to indicate surprise removal we could
> handle this properly in the PCI core and device driver ->remove hooks.
> 
> For comparison, this is what macOS recommends to driver developers:
> 
>        "PCI device drivers are typically developed with the expectation
> 	that the device will not be removed from the PCI bus during its
> 	operation. However, Thunderbolt technology allows PCI data to be
> 	tunneled through a Thunderbolt connection, and the Thunderbolt
> 	cables may be unplugged from the host or device at any time.
> 	Therefore, the system must be able to cope with the removal of
> 	PCI devices by the user at any time.
> 
> 	The PCI device drivers used with Thunderbolt devices may need to
> 	be updated in order to handle surprise or unplanned removal.
> 	In particular, MMIO cycles and PCI Configuration accesses require
> 	special attention. [...] As a basic guideline, developers should
> 	modify their drivers to handle a return value of 0xFFFFFFFF.
> 	If any thread, callback, interrupt filter, or code path in a
> 	driver receives 0xFFFFFFFF indicating the device has been
> 	unplugged, then all threads, callbacks, interrupt filters,
> 	interrupt handlers, and other code paths in that driver must
> 	cease MMIO reads and writes immediately and prepare for
> 	termination. [...]
> 
> 	Once it has been determined that a device is no longer connected,
> 	do not try to clean up or reset the hardware as attempts to
> 	communicate with the hardware may lead to further delays. [...]
> 	A typical way for a developer to solve this problem is to provide
> 	a single bottleneck routine for all MMIO reads and have that
> 	routine check the status of the device before beginning the actual
> 	transaction."
> 
> 	Source: https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html
> 
> We lack a comparable functionality and the question is whether to
> support it only in the PCI core or in a more general fashion in the
> driver core.  Other buses (such as USB) have to support surprise
> removal as well.

PCI devices have _ALWAYS_ had to handle supprise removal, and the MacOS
guidelines are the exact same thing that we have been telling Linux
kernel developers for years.

So no, a supprise removal flag should not be needed, your driver should
already be handling this problem today (if it isn't, it needs to be
fixed.)

Don't try to rely on some out-of-band notification that your device is
removed, just do what you write above and handle it that way in your
driver.  There are lots of examples of this in the kernel today, are you
concerned about any specific driver that does not do this properly?

And really, all busses need to handle their device being removed at any
point in time anyway, so the reliance on a "supprise" flag would not
help much, as usually that is only known _after_ the fact and the device
would have already started to report other types of errors.

So what "benifit" would a supprise removal flag provide?  Other than
some PCI busses could set this, and many others could not, so a driver
would have to be changed to now support both types of detection,
_adding_ work to drivers, not making it easier by any means :(

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-01-21  8:42                         ` Greg Kroah-Hartman
@ 2017-01-21 14:22                           ` Lukas Wunner
  2017-01-25 11:47                             ` Greg Kroah-Hartman
  2017-01-23 16:04                           ` Keith Busch
  1 sibling, 1 reply; 42+ messages in thread
From: Lukas Wunner @ 2017-01-21 14:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Keith Busch, linux-pci, Bjorn Helgaas, Wei Zhang, Austin Bolen

On Sat, Jan 21, 2017 at 09:42:43AM +0100, Greg Kroah-Hartman wrote:
> On Sat, Jan 21, 2017 at 08:31:40AM +0100, Lukas Wunner wrote:
> > On Fri, Jan 20, 2017 at 04:35:50PM -0500, Keith Busch wrote:
> > > On Tue, Dec 13, 2016 at 04:19:32PM -0500, Keith Busch wrote:
> > > > On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> > > > > And we're apparently still doing a lot of these accesses?  I'm still
> > > > > curious about exactly what these are, because it may be that we're
> > > > > doing more than necessary.
> > > > 
> > > > It's the MSI-x masking that's our next highest contributor. Masking
> > > > vectors still requires non-posted commands, and since they're not going
> > > > through managed API accessors like config space uses, the removed flag
> > > > is needed for checking before doing significant MMIO.
> > > 
> > > Hi Bjorn,
> > > 
> > > Just wanted to do another check with you on this. We'd still like to fence
> > > off all config access with appropriate error codes, and short cut the most
> > > significant MMIO access to improve surprise removal. There may still be
> > > other offenders, but these are the most important ones we've identified.
> > > 
> > > I've updated the series to make the new flag an atomic accessor as
> > > requested, and improved the change logs with more information compelling
> > > the change. Otherwise it's much the same as before. I know you weren't
> > > keen on capturing all the access under the umbrella of improving device
> > > unbinding time, but the general concensus among device makers is that
> > > it's a good thing to have software return an error early rather than
> > > send a command we know will fail. Any other thoughts I should consider
> > > before posting v5?
> > 
> > I think Bjorn was pondering whether a flag to indicate surprise removal
> > should be put in struct device rather than struct pci_dev, so as to
> > cover other buses capable of surprise removal.  There's already an
> > "offline" flag in struct device which is set when user space initiates
> > a safe hot removal via sysfs.
> > 
> > Bjorn cc'ed his e-mails of Dec 13 to Greg KH and Alan Stern but got no
> > replies.
> 
> Sorry, I don't recall seeing those :(
> 
> > @Greg KH:
> > Would you entertain a patch adding a bit to struct device which indicates
> > the device was surprise removed?  The PCIe Hotplug and PCIe Downstream
> > Port Containment drivers are both able to recognize surprise removal and
> > can set the bit.
> > 
> > When removing the device we currently perform numerous accesses to config
> > space in the PCI core.  Additionally the driver for the removed device
> > (e.g. network driver, GPU driver) performs mmio accesses to orderly shut
> > down the device.  E.g. when unplugging the Apple Thunderbolt Ethernet
> > adapter the kernel currently locks up as the tg3 driver tries to shutdown
> > the removed device.  If we had a bit to indicate surprise removal we could
> > handle this properly in the PCI core and device driver ->remove hooks.
> > 
> > For comparison, this is what macOS recommends to driver developers:
> > 
> >        "PCI device drivers are typically developed with the expectation
> > 	that the device will not be removed from the PCI bus during its
> > 	operation. However, Thunderbolt technology allows PCI data to be
> > 	tunneled through a Thunderbolt connection, and the Thunderbolt
> > 	cables may be unplugged from the host or device at any time.
> > 	Therefore, the system must be able to cope with the removal of
> > 	PCI devices by the user at any time.
> > 
> > 	The PCI device drivers used with Thunderbolt devices may need to
> > 	be updated in order to handle surprise or unplanned removal.
> > 	In particular, MMIO cycles and PCI Configuration accesses require
> > 	special attention. [...] As a basic guideline, developers should
> > 	modify their drivers to handle a return value of 0xFFFFFFFF.
> > 	If any thread, callback, interrupt filter, or code path in a
> > 	driver receives 0xFFFFFFFF indicating the device has been
> > 	unplugged, then all threads, callbacks, interrupt filters,
> > 	interrupt handlers, and other code paths in that driver must
> > 	cease MMIO reads and writes immediately and prepare for
> > 	termination. [...]
> > 
> > 	Once it has been determined that a device is no longer connected,
> > 	do not try to clean up or reset the hardware as attempts to
> > 	communicate with the hardware may lead to further delays. [...]
> > 	A typical way for a developer to solve this problem is to provide
> > 	a single bottleneck routine for all MMIO reads and have that
> > 	routine check the status of the device before beginning the actual
> > 	transaction."
> > 
> > 	Source: https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html
> > 
> > We lack a comparable functionality and the question is whether to
> > support it only in the PCI core or in a more general fashion in the
> > driver core.  Other buses (such as USB) have to support surprise
> > removal as well.
> 
> PCI devices have _ALWAYS_ had to handle supprise removal, and the MacOS
> guidelines are the exact same thing that we have been telling Linux
> kernel developers for years.
> 
> So no, a supprise removal flag should not be needed, your driver should
> already be handling this problem today (if it isn't, it needs to be
> fixed.)
> 
> Don't try to rely on some out-of-band notification that your device is
> removed,

This isn't about notification, it's about caching.

Once it is known that the device is gone, that status should be cached.
This obviates the need to do any further checks for presence and allows
skipping all following config space and mmio accesses, thereby greatly
speeding up device removal and avoiding lockups.

Since the device is accessed both in the pci_bus_type's ->remove hook
as well as in ->remove hooks of individual PCI drivers, it makes sense
to cache the status in struct pci_dev so that it can be checked in the
PCI core as well as in PCI drivers.

Bjorn's question, AFAIU, was if caching in struct device would make more
sense, to let other buses benefit from the knowledge that the device is
gone.  Keith's patch recursively sets the is_removed flag on all PCI
devices below the one that was removed.  (Think of a chain of Thunderbolt
devices that is surprise removed.)  If instead the flag was in struct
device, it could be set on any type of child device.  (Think of a USB hub
in a Thunderbolt dock that is surprise removed.)

How does the USB bus type currently handle surprise removal?


> just do what you write above and handle it that way in your
> driver.  There are lots of examples of this in the kernel today, are you
> concerned about any specific driver that does not do this properly?

As I've written above, an example is drivers/net/ethernet/broadcom/tg3.c
which is the driver used for Apple Thunderbolt Ethernet adapters.
Surprise removal of those currently results in:

    NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:299]
    ...
    Workqueue: pciehp-4 pciehp_power_thread
    RIP: 0010:[<ffffffffa105b01d>]  [<ffffffffa105b01d>] tg3_read32+0xd/0x10 [tg3]
    ...
    Call Trace:
     [<ffffffffa1063610>] ? tg3_stop_block.constprop.126+0x80/0x110 [tg3]
     [<ffffffffa1066298>] ? tg3_abort_hw+0x68/0x2f0 [tg3]
     [<ffffffffa106654d>] ? tg3_halt+0x2d/0x180 [tg3]
     [<ffffffffa1072a07>] ? tg3_stop+0x157/0x210 [tg3]
     [<ffffffffa1072aeb>] ? tg3_close+0x2b/0xe0 [tg3]
     [<ffffffff81465ff4>] ? __dev_close_many+0x84/0xd0
     [<ffffffff814660b4>] ? dev_close_many+0x74/0x100
     [<ffffffff8146790b>] ? rollback_registered_many+0xfb/0x2e0
     [<ffffffff81467b19>] ? rollback_registered+0x29/0x40
     [<ffffffff81468950>] ? unregister_netdevice_queue+0x40/0x90
     [<ffffffff814689b8>] ? unregister_netdev+0x18/0x20
     [<ffffffffa106604b>] ? tg3_remove_one+0x8b/0x130 [tg3]
     [<ffffffff8130b556>] ? pci_device_remove+0x36/0xb0
     [<ffffffff813df92a>] ? __device_release_driver+0x9a/0x140
     [<ffffffff813df9ee>] ? device_release_driver+0x1e/0x30
     [<ffffffff81304bf4>] ? pci_stop_bus_device+0x84/0xa0
     [<ffffffff81304b9b>] ? pci_stop_bus_device+0x2b/0xa0
     [<ffffffff81304b9b>] ? pci_stop_bus_device+0x2b/0xa0
     [<ffffffff81304cee>] ? pci_stop_and_remove_bus_device+0xe/0x20
     [<ffffffff8131ecea>] ? pciehp_unconfigure_device+0x9a/0x180
     [<ffffffff8131e7ef>] ? pciehp_disable_slot+0x3f/0xb0
     [<ffffffff8131e8e5>] ? pciehp_power_thread+0x85/0xa0
     [<ffffffff810855df>] ? process_one_work+0x19f/0x3d0

The driver is already littered with calls to pci_channel_offline()
and pci_device_is_present() but still causes a lockup.

I've found that the issue goes away with Keith's patch to cache
surprise removal plus this minor change to pci_channel_offline():
http://www.spinics.net/lists/linux-pci/msg55601.html

Thanks,

Lukas

> 
> And really, all busses need to handle their device being removed at any
> point in time anyway, so the reliance on a "supprise" flag would not
> help much, as usually that is only known _after_ the fact and the device
> would have already started to report other types of errors.
> 
> So what "benifit" would a supprise removal flag provide?  Other than
> some PCI busses could set this, and many others could not, so a driver
> would have to be changed to now support both types of detection,
> _adding_ work to drivers, not making it easier by any means :(
> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-01-21  8:42                         ` Greg Kroah-Hartman
  2017-01-21 14:22                           ` Lukas Wunner
@ 2017-01-23 16:04                           ` Keith Busch
  2017-01-25  0:44                             ` Austin.Bolen
                                               ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Keith Busch @ 2017-01-23 16:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Lukas Wunner, linux-pci, Bjorn Helgaas, Wei Zhang, Austin Bolen

On Sat, Jan 21, 2017 at 09:42:43AM +0100, Greg Kroah-Hartman wrote:
> On Sat, Jan 21, 2017 at 08:31:40AM +0100, Lukas Wunner wrote:
> > @Greg KH:
> > Would you entertain a patch adding a bit to struct device which indicates
> > the device was surprise removed?  The PCIe Hotplug and PCIe Downstream
> > Port Containment drivers are both able to recognize surprise removal and
> > can set the bit.
> > 
> > When removing the device we currently perform numerous accesses to config
> > space in the PCI core.  Additionally the driver for the removed device
> > (e.g. network driver, GPU driver) performs mmio accesses to orderly shut
> > down the device.  E.g. when unplugging the Apple Thunderbolt Ethernet
> > adapter the kernel currently locks up as the tg3 driver tries to shutdown
> > the removed device.  If we had a bit to indicate surprise removal we could
> > handle this properly in the PCI core and device driver ->remove hooks.
> > 
> > For comparison, this is what macOS recommends to driver developers:
> > 
> >        "PCI device drivers are typically developed with the expectation
> > 	that the device will not be removed from the PCI bus during its
> > 	operation. However, Thunderbolt technology allows PCI data to be
> > 	tunneled through a Thunderbolt connection, and the Thunderbolt
> > 	cables may be unplugged from the host or device at any time.
> > 	Therefore, the system must be able to cope with the removal of
> > 	PCI devices by the user at any time.
> > 
> > 	The PCI device drivers used with Thunderbolt devices may need to
> > 	be updated in order to handle surprise or unplanned removal.
> > 	In particular, MMIO cycles and PCI Configuration accesses require
> > 	special attention. [...] As a basic guideline, developers should
> > 	modify their drivers to handle a return value of 0xFFFFFFFF.
> > 	If any thread, callback, interrupt filter, or code path in a
> > 	driver receives 0xFFFFFFFF indicating the device has been
> > 	unplugged, then all threads, callbacks, interrupt filters,
> > 	interrupt handlers, and other code paths in that driver must
> > 	cease MMIO reads and writes immediately and prepare for
> > 	termination. [...]
> > 
> > 	Once it has been determined that a device is no longer connected,
> > 	do not try to clean up or reset the hardware as attempts to
> > 	communicate with the hardware may lead to further delays. [...]
> > 	A typical way for a developer to solve this problem is to provide
> > 	a single bottleneck routine for all MMIO reads and have that
> > 	routine check the status of the device before beginning the actual
> > 	transaction."
> > 
> > 	Source: https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html
> > 
> > We lack a comparable functionality and the question is whether to
> > support it only in the PCI core or in a more general fashion in the
> > driver core.  Other buses (such as USB) have to support surprise
> > removal as well.
> 
> PCI devices have _ALWAYS_ had to handle supprise removal, and the MacOS
> guidelines are the exact same thing that we have been telling Linux
> kernel developers for years.
> 
> So no, a supprise removal flag should not be needed, your driver should
> already be handling this problem today (if it isn't, it needs to be
> fixed.)
> 
> Don't try to rely on some out-of-band notification that your device is
> removed, just do what you write above and handle it that way in your
> driver.  There are lots of examples of this in the kernel today, are you
> concerned about any specific driver that does not do this properly?

The generic pci bus driver is what we're concerned about. This is not
intended to target a device specific driver.

Handling a removal has the generic pci driver generate many hundreds
of config and MMIO access to the device we know is removed, so we know
those will fail. The flag is only so the pci bus driver knows that it
doesn't need to send requests since we should already know the outcome.

The concern is that hardware from many vendors handle these as slower
error cases, and we are very closely approaching the completion timeouts
defined in the 10s of millisecond range. We occasionally observe exceeding
the timeout, creating MCE, but that's not really our justification for
the patches.

When the pci bus driver issues hundreds of such requests, it significantly
slows down device tear down time. We want to handle removal of switches
with potentially many devices below it, and pci bus scanning being
serialized, the proposed change gets such scenarios to complete removal
in microseconds where we currently take many seconds.

We are currently so slow that a user can easily swap a device such that
the linux pci driver is still trying to unbind from the old device with
unnecessary config access to potentially undefined results when those
requests reach the new device that linux doesn't know about yet.

> And really, all busses need to handle their device being removed at any
> point in time anyway, so the reliance on a "supprise" flag would not
> help much, as usually that is only known _after_ the fact and the device
> would have already started to report other types of errors.
> 
> So what "benifit" would a supprise removal flag provide?  Other than
> some PCI busses could set this, and many others could not, so a driver
> would have to be changed to now support both types of detection,
> _adding_ work to drivers, not making it easier by any means :(

The flag is not intended for a device specific driver to use. The
intention is for the pci bus driver to use it, so no additional work for
other drivers.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCHv4 next 0/3] Limiting pci access
  2017-01-23 16:04                           ` Keith Busch
@ 2017-01-25  0:44                             ` Austin.Bolen
  2017-01-25 21:17                               ` Bjorn Helgaas
  2017-01-25 11:48                             ` Greg Kroah-Hartman
  2017-01-28  7:36                             ` Christoph Hellwig
  2 siblings, 1 reply; 42+ messages in thread
From: Austin.Bolen @ 2017-01-25  0:44 UTC (permalink / raw)
  To: keith.busch, gregkh; +Cc: lukas, linux-pci, bhelgaas, wzhang

To add to what Keith sent, I wanted to throw in my 2 cents on this issue.  =
Full disclosure - I'm not a Linux developer and so my thoughts on this are =
more from the PCIe Base Specification point of view and specifically, harde=
ning the spec to better support PCIe hot-plug for storage use cases (NVMe) =
and providing better guidance in the PCIe spec on how system software can i=
nteract with the PCIe subsystem to better determine/handle the sort of erro=
rs that arise in PCIe removal scenarios.

>From Dell EMC side, we would definitely like to see the config read routine=
s in all operating environments return an error status code any time data i=
s synthesized by the root port - but especially for failures due to device =
removal.  Today, if bad data is synthesized by a root port due to a failed =
config read, the config read routines in various operating environments do =
not return an error.  We, and others, have found cases where the caller of =
these config routines will consume the bad data since the API did not retur=
n an error due to this type of failure.  Most folks have been patching up e=
very caller to check if data is synthesized after every config read.  This =
is not very efficient as there are many entities that read config space (de=
vice driver, bus driver, systems management software, diagnostic software, =
etc.).   This patch, or similar, is very beneficial because it localizes th=
e error checking for synthesized data to a single location in the core conf=
ig read APIs rather than distributing this code to every caller.  The calle=
rs benefit by only requiring a simple check of the return status code from =
the API (which many are already doing).

Having the check for a removed device before reading config space will resu=
lt in correctly returning an error for most failures of this nature.  There=
 is a race condition even with this patch, however.  If the device is prese=
nt when the check occurs, but is removed just after that and before the con=
fig read goes out, the config routine can still return bad data without an =
error status return code.  To eliminate this race condition where bad data =
is returned, I think that config read routines will need to have a check *a=
fter* reading the data in addition to the check *before* reading.  The chec=
k after reading would need to determine if the data was synthesized and ret=
urn and error status if so.

I'm interested in hearing what others think on returning an error for this =
case in the core config read APIs or any alternative ways to more gracefull=
y handle these types of errors.

Thanks,
Austin


-----Original Message-----
From: Keith Busch [mailto:keith.busch@intel.com]=20
Sent: Monday, January 23, 2017 10:05 AM
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lukas Wunner <lukas@wunner.de>; linux-pci@vger.kernel.org; Bjorn Helgaa=
s <bhelgaas@google.com>; Wei Zhang <wzhang@fb.com>; Bolen, Austin <Austin_B=
olen@Dell.com>
Subject: Re: [PATCHv4 next 0/3] Limiting pci access

On Sat, Jan 21, 2017 at 09:42:43AM +0100, Greg Kroah-Hartman wrote:
> On Sat, Jan 21, 2017 at 08:31:40AM +0100, Lukas Wunner wrote:
> > @Greg KH:
> > Would you entertain a patch adding a bit to struct device which=20
> > indicates the device was surprise removed?  The PCIe Hotplug and=20
> > PCIe Downstream Port Containment drivers are both able to recognize=20
> > surprise removal and can set the bit.
> >=20
> > When removing the device we currently perform numerous accesses to=20
> > config space in the PCI core.  Additionally the driver for the=20
> > removed device (e.g. network driver, GPU driver) performs mmio=20
> > accesses to orderly shut down the device.  E.g. when unplugging the=20
> > Apple Thunderbolt Ethernet adapter the kernel currently locks up as=20
> > the tg3 driver tries to shutdown the removed device.  If we had a=20
> > bit to indicate surprise removal we could handle this properly in the P=
CI core and device driver ->remove hooks.
> >=20
> > For comparison, this is what macOS recommends to driver developers:
> >=20
> >        "PCI device drivers are typically developed with the expectation
> > 	that the device will not be removed from the PCI bus during its
> > 	operation. However, Thunderbolt technology allows PCI data to be
> > 	tunneled through a Thunderbolt connection, and the Thunderbolt
> > 	cables may be unplugged from the host or device at any time.
> > 	Therefore, the system must be able to cope with the removal of
> > 	PCI devices by the user at any time.
> >=20
> > 	The PCI device drivers used with Thunderbolt devices may need to
> > 	be updated in order to handle surprise or unplanned removal.
> > 	In particular, MMIO cycles and PCI Configuration accesses require
> > 	special attention. [...] As a basic guideline, developers should
> > 	modify their drivers to handle a return value of 0xFFFFFFFF.
> > 	If any thread, callback, interrupt filter, or code path in a
> > 	driver receives 0xFFFFFFFF indicating the device has been
> > 	unplugged, then all threads, callbacks, interrupt filters,
> > 	interrupt handlers, and other code paths in that driver must
> > 	cease MMIO reads and writes immediately and prepare for
> > 	termination. [...]
> >=20
> > 	Once it has been determined that a device is no longer connected,
> > 	do not try to clean up or reset the hardware as attempts to
> > 	communicate with the hardware may lead to further delays. [...]
> > 	A typical way for a developer to solve this problem is to provide
> > 	a single bottleneck routine for all MMIO reads and have that
> > 	routine check the status of the device before beginning the actual
> > 	transaction."
> >=20
> > 	Source:=20
> > https://developer.apple.com/library/content/documentation/HardwareDr
> > ivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html
> >=20
> > We lack a comparable functionality and the question is whether to=20
> > support it only in the PCI core or in a more general fashion in the=20
> > driver core.  Other buses (such as USB) have to support surprise=20
> > removal as well.
>=20
> PCI devices have _ALWAYS_ had to handle supprise removal, and the=20
> MacOS guidelines are the exact same thing that we have been telling=20
> Linux kernel developers for years.
>=20
> So no, a supprise removal flag should not be needed, your driver=20
> should already be handling this problem today (if it isn't, it needs=20
> to be
> fixed.)
>=20
> Don't try to rely on some out-of-band notification that your device is=20
> removed, just do what you write above and handle it that way in your=20
> driver.  There are lots of examples of this in the kernel today, are=20
> you concerned about any specific driver that does not do this properly?

The generic pci bus driver is what we're concerned about. This is not inten=
ded to target a device specific driver.

Handling a removal has the generic pci driver generate many hundreds of con=
fig and MMIO access to the device we know is removed, so we know those will=
 fail. The flag is only so the pci bus driver knows that it doesn't need to=
 send requests since we should already know the outcome.

The concern is that hardware from many vendors handle these as slower error=
 cases, and we are very closely approaching the completion timeouts defined=
 in the 10s of millisecond range. We occasionally observe exceeding the tim=
eout, creating MCE, but that's not really our justification for the patches=
.

When the pci bus driver issues hundreds of such requests, it significantly =
slows down device tear down time. We want to handle removal of switches wit=
h potentially many devices below it, and pci bus scanning being serialized,=
 the proposed change gets such scenarios to complete removal in microsecond=
s where we currently take many seconds.

We are currently so slow that a user can easily swap a device such that the=
 linux pci driver is still trying to unbind from the old device with unnece=
ssary config access to potentially undefined results when those requests re=
ach the new device that linux doesn't know about yet.

> And really, all busses need to handle their device being removed at=20
> any point in time anyway, so the reliance on a "supprise" flag would=20
> not help much, as usually that is only known _after_ the fact and the=20
> device would have already started to report other types of errors.
>=20
> So what "benifit" would a supprise removal flag provide?  Other than=20
> some PCI busses could set this, and many others could not, so a driver=20
> would have to be changed to now support both types of detection,=20
> _adding_ work to drivers, not making it easier by any means :(

The flag is not intended for a device specific driver to use. The intention=
 is for the pci bus driver to use it, so no additional work for other drive=
rs.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-01-21 14:22                           ` Lukas Wunner
@ 2017-01-25 11:47                             ` Greg Kroah-Hartman
  0 siblings, 0 replies; 42+ messages in thread
From: Greg Kroah-Hartman @ 2017-01-25 11:47 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Keith Busch, linux-pci, Bjorn Helgaas, Wei Zhang, Austin Bolen

On Sat, Jan 21, 2017 at 03:22:29PM +0100, Lukas Wunner wrote:
> On Sat, Jan 21, 2017 at 09:42:43AM +0100, Greg Kroah-Hartman wrote:
> > On Sat, Jan 21, 2017 at 08:31:40AM +0100, Lukas Wunner wrote:
> > > On Fri, Jan 20, 2017 at 04:35:50PM -0500, Keith Busch wrote:
> > > > On Tue, Dec 13, 2016 at 04:19:32PM -0500, Keith Busch wrote:
> > > > > On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> > > > > > And we're apparently still doing a lot of these accesses?  I'm still
> > > > > > curious about exactly what these are, because it may be that we're
> > > > > > doing more than necessary.
> > > > > 
> > > > > It's the MSI-x masking that's our next highest contributor. Masking
> > > > > vectors still requires non-posted commands, and since they're not going
> > > > > through managed API accessors like config space uses, the removed flag
> > > > > is needed for checking before doing significant MMIO.
> > > > 
> > > > Hi Bjorn,
> > > > 
> > > > Just wanted to do another check with you on this. We'd still like to fence
> > > > off all config access with appropriate error codes, and short cut the most
> > > > significant MMIO access to improve surprise removal. There may still be
> > > > other offenders, but these are the most important ones we've identified.
> > > > 
> > > > I've updated the series to make the new flag an atomic accessor as
> > > > requested, and improved the change logs with more information compelling
> > > > the change. Otherwise it's much the same as before. I know you weren't
> > > > keen on capturing all the access under the umbrella of improving device
> > > > unbinding time, but the general concensus among device makers is that
> > > > it's a good thing to have software return an error early rather than
> > > > send a command we know will fail. Any other thoughts I should consider
> > > > before posting v5?
> > > 
> > > I think Bjorn was pondering whether a flag to indicate surprise removal
> > > should be put in struct device rather than struct pci_dev, so as to
> > > cover other buses capable of surprise removal.  There's already an
> > > "offline" flag in struct device which is set when user space initiates
> > > a safe hot removal via sysfs.
> > > 
> > > Bjorn cc'ed his e-mails of Dec 13 to Greg KH and Alan Stern but got no
> > > replies.
> > 
> > Sorry, I don't recall seeing those :(
> > 
> > > @Greg KH:
> > > Would you entertain a patch adding a bit to struct device which indicates
> > > the device was surprise removed?  The PCIe Hotplug and PCIe Downstream
> > > Port Containment drivers are both able to recognize surprise removal and
> > > can set the bit.
> > > 
> > > When removing the device we currently perform numerous accesses to config
> > > space in the PCI core.  Additionally the driver for the removed device
> > > (e.g. network driver, GPU driver) performs mmio accesses to orderly shut
> > > down the device.  E.g. when unplugging the Apple Thunderbolt Ethernet
> > > adapter the kernel currently locks up as the tg3 driver tries to shutdown
> > > the removed device.  If we had a bit to indicate surprise removal we could
> > > handle this properly in the PCI core and device driver ->remove hooks.
> > > 
> > > For comparison, this is what macOS recommends to driver developers:
> > > 
> > >        "PCI device drivers are typically developed with the expectation
> > > 	that the device will not be removed from the PCI bus during its
> > > 	operation. However, Thunderbolt technology allows PCI data to be
> > > 	tunneled through a Thunderbolt connection, and the Thunderbolt
> > > 	cables may be unplugged from the host or device at any time.
> > > 	Therefore, the system must be able to cope with the removal of
> > > 	PCI devices by the user at any time.
> > > 
> > > 	The PCI device drivers used with Thunderbolt devices may need to
> > > 	be updated in order to handle surprise or unplanned removal.
> > > 	In particular, MMIO cycles and PCI Configuration accesses require
> > > 	special attention. [...] As a basic guideline, developers should
> > > 	modify their drivers to handle a return value of 0xFFFFFFFF.
> > > 	If any thread, callback, interrupt filter, or code path in a
> > > 	driver receives 0xFFFFFFFF indicating the device has been
> > > 	unplugged, then all threads, callbacks, interrupt filters,
> > > 	interrupt handlers, and other code paths in that driver must
> > > 	cease MMIO reads and writes immediately and prepare for
> > > 	termination. [...]
> > > 
> > > 	Once it has been determined that a device is no longer connected,
> > > 	do not try to clean up or reset the hardware as attempts to
> > > 	communicate with the hardware may lead to further delays. [...]
> > > 	A typical way for a developer to solve this problem is to provide
> > > 	a single bottleneck routine for all MMIO reads and have that
> > > 	routine check the status of the device before beginning the actual
> > > 	transaction."
> > > 
> > > 	Source: https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html
> > > 
> > > We lack a comparable functionality and the question is whether to
> > > support it only in the PCI core or in a more general fashion in the
> > > driver core.  Other buses (such as USB) have to support surprise
> > > removal as well.
> > 
> > PCI devices have _ALWAYS_ had to handle supprise removal, and the MacOS
> > guidelines are the exact same thing that we have been telling Linux
> > kernel developers for years.
> > 
> > So no, a supprise removal flag should not be needed, your driver should
> > already be handling this problem today (if it isn't, it needs to be
> > fixed.)
> > 
> > Don't try to rely on some out-of-band notification that your device is
> > removed,
> 
> This isn't about notification, it's about caching.
> 
> Once it is known that the device is gone, that status should be cached.
> This obviates the need to do any further checks for presence and allows
> skipping all following config space and mmio accesses, thereby greatly
> speeding up device removal and avoiding lockups.

How is device removal "slow" today?  What is locking up?

> Since the device is accessed both in the pci_bus_type's ->remove hook
> as well as in ->remove hooks of individual PCI drivers, it makes sense
> to cache the status in struct pci_dev so that it can be checked in the
> PCI core as well as in PCI drivers.

But what can you do with it?  You can't rely on it being valid (i.e. it
might change right after you look at it), so how would you actually use
it?

> Bjorn's question, AFAIU, was if caching in struct device would make more
> sense, to let other buses benefit from the knowledge that the device is
> gone.  Keith's patch recursively sets the is_removed flag on all PCI
> devices below the one that was removed.  (Think of a chain of Thunderbolt
> devices that is surprise removed.)  If instead the flag was in struct
> device, it could be set on any type of child device.  (Think of a USB hub
> in a Thunderbolt dock that is surprise removed.)
> 
> How does the USB bus type currently handle surprise removal?

We just cancel all pending transactions on the device and don't allow
any new ones to succeed.  Nothing special, as there is no need for any
magic "is the device here or not" type of logic.

> > just do what you write above and handle it that way in your
> > driver.  There are lots of examples of this in the kernel today, are you
> > concerned about any specific driver that does not do this properly?
> 
> As I've written above, an example is drivers/net/ethernet/broadcom/tg3.c
> which is the driver used for Apple Thunderbolt Ethernet adapters.
> Surprise removal of those currently results in:

<oops snipped>

Looks like you need to fix the driver :)

Again, this is something that we have been doing for well over a decade,
it shouldn't be a mystery.  If you all want to put this in the pci
device, ok, but be sure to tie it into all PCI bus types (remember
expresscard?)

How would knowing if the device is present or not prevent this specific
driver oops?

> The driver is already littered with calls to pci_channel_offline()
> and pci_device_is_present() but still causes a lockup.

Then your new flag would not have done anything new, right?

> I've found that the issue goes away with Keith's patch to cache
> surprise removal plus this minor change to pci_channel_offline():
> http://www.spinics.net/lists/linux-pci/msg55601.html

Odd that this fixes the above oops, perhaps the logic of "offline" isn't
all that correct?

Anyway, I don't see a real need for this in 'struct device' at the
moment, let's see how you all handle it in the pci core first :)

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-01-23 16:04                           ` Keith Busch
  2017-01-25  0:44                             ` Austin.Bolen
@ 2017-01-25 11:48                             ` Greg Kroah-Hartman
  2017-01-28  7:36                             ` Christoph Hellwig
  2 siblings, 0 replies; 42+ messages in thread
From: Greg Kroah-Hartman @ 2017-01-25 11:48 UTC (permalink / raw)
  To: Keith Busch
  Cc: Lukas Wunner, linux-pci, Bjorn Helgaas, Wei Zhang, Austin Bolen

On Mon, Jan 23, 2017 at 11:04:52AM -0500, Keith Busch wrote:
> On Sat, Jan 21, 2017 at 09:42:43AM +0100, Greg Kroah-Hartman wrote:
> > On Sat, Jan 21, 2017 at 08:31:40AM +0100, Lukas Wunner wrote:
> > > @Greg KH:
> > > Would you entertain a patch adding a bit to struct device which indicates
> > > the device was surprise removed?  The PCIe Hotplug and PCIe Downstream
> > > Port Containment drivers are both able to recognize surprise removal and
> > > can set the bit.
> > > 
> > > When removing the device we currently perform numerous accesses to config
> > > space in the PCI core.  Additionally the driver for the removed device
> > > (e.g. network driver, GPU driver) performs mmio accesses to orderly shut
> > > down the device.  E.g. when unplugging the Apple Thunderbolt Ethernet
> > > adapter the kernel currently locks up as the tg3 driver tries to shutdown
> > > the removed device.  If we had a bit to indicate surprise removal we could
> > > handle this properly in the PCI core and device driver ->remove hooks.
> > > 
> > > For comparison, this is what macOS recommends to driver developers:
> > > 
> > >        "PCI device drivers are typically developed with the expectation
> > > 	that the device will not be removed from the PCI bus during its
> > > 	operation. However, Thunderbolt technology allows PCI data to be
> > > 	tunneled through a Thunderbolt connection, and the Thunderbolt
> > > 	cables may be unplugged from the host or device at any time.
> > > 	Therefore, the system must be able to cope with the removal of
> > > 	PCI devices by the user at any time.
> > > 
> > > 	The PCI device drivers used with Thunderbolt devices may need to
> > > 	be updated in order to handle surprise or unplanned removal.
> > > 	In particular, MMIO cycles and PCI Configuration accesses require
> > > 	special attention. [...] As a basic guideline, developers should
> > > 	modify their drivers to handle a return value of 0xFFFFFFFF.
> > > 	If any thread, callback, interrupt filter, or code path in a
> > > 	driver receives 0xFFFFFFFF indicating the device has been
> > > 	unplugged, then all threads, callbacks, interrupt filters,
> > > 	interrupt handlers, and other code paths in that driver must
> > > 	cease MMIO reads and writes immediately and prepare for
> > > 	termination. [...]
> > > 
> > > 	Once it has been determined that a device is no longer connected,
> > > 	do not try to clean up or reset the hardware as attempts to
> > > 	communicate with the hardware may lead to further delays. [...]
> > > 	A typical way for a developer to solve this problem is to provide
> > > 	a single bottleneck routine for all MMIO reads and have that
> > > 	routine check the status of the device before beginning the actual
> > > 	transaction."
> > > 
> > > 	Source: https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html
> > > 
> > > We lack a comparable functionality and the question is whether to
> > > support it only in the PCI core or in a more general fashion in the
> > > driver core.  Other buses (such as USB) have to support surprise
> > > removal as well.
> > 
> > PCI devices have _ALWAYS_ had to handle supprise removal, and the MacOS
> > guidelines are the exact same thing that we have been telling Linux
> > kernel developers for years.
> > 
> > So no, a supprise removal flag should not be needed, your driver should
> > already be handling this problem today (if it isn't, it needs to be
> > fixed.)
> > 
> > Don't try to rely on some out-of-band notification that your device is
> > removed, just do what you write above and handle it that way in your
> > driver.  There are lots of examples of this in the kernel today, are you
> > concerned about any specific driver that does not do this properly?
> 
> The generic pci bus driver is what we're concerned about. This is not
> intended to target a device specific driver.
> 
> Handling a removal has the generic pci driver generate many hundreds
> of config and MMIO access to the device we know is removed, so we know
> those will fail. The flag is only so the pci bus driver knows that it
> doesn't need to send requests since we should already know the outcome.
> 
> The concern is that hardware from many vendors handle these as slower
> error cases, and we are very closely approaching the completion timeouts
> defined in the 10s of millisecond range. We occasionally observe exceeding
> the timeout, creating MCE, but that's not really our justification for
> the patches.
> 
> When the pci bus driver issues hundreds of such requests, it significantly
> slows down device tear down time. We want to handle removal of switches
> with potentially many devices below it, and pci bus scanning being
> serialized, the proposed change gets such scenarios to complete removal
> in microseconds where we currently take many seconds.
> 
> We are currently so slow that a user can easily swap a device such that
> the linux pci driver is still trying to unbind from the old device with
> unnecessary config access to potentially undefined results when those
> requests reach the new device that linux doesn't know about yet.

Ok, that's too slow, and not good at all.  And it makes a bit more
sense, thanks for the details.

> > And really, all busses need to handle their device being removed at any
> > point in time anyway, so the reliance on a "supprise" flag would not
> > help much, as usually that is only known _after_ the fact and the device
> > would have already started to report other types of errors.
> > 
> > So what "benifit" would a supprise removal flag provide?  Other than
> > some PCI busses could set this, and many others could not, so a driver
> > would have to be changed to now support both types of detection,
> > _adding_ work to drivers, not making it easier by any means :(
> 
> The flag is not intended for a device specific driver to use. The
> intention is for the pci bus driver to use it, so no additional work for
> other drivers.

But the driver can see this in the structure, so it will get used...

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-01-25  0:44                             ` Austin.Bolen
@ 2017-01-25 21:17                               ` Bjorn Helgaas
  2017-01-26  1:12                                 ` Austin.Bolen
  0 siblings, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2017-01-25 21:17 UTC (permalink / raw)
  To: Austin.Bolen; +Cc: keith.busch, gregkh, lukas, linux-pci, bhelgaas, wzhang

Hi Austin,

Thanks a lot for bringing some more PCIe spec expertise.  I wish
we had more insight into what the spec authors were thinking
because I'm sure we often miss the point and cause face-palms on
the committee.

On Wed, Jan 25, 2017 at 12:44:34AM +0000, Austin.Bolen@dell.com wrote:
> To add to what Keith sent, I wanted to throw in my 2 cents on
> this issue.  Full disclosure - I'm not a Linux developer and so
> my thoughts on this are more from the PCIe Base Specification
> point of view and specifically, hardening the spec to better
> support PCIe hot-plug for storage use cases (NVMe) and
> providing better guidance in the PCIe spec on how system
> software can interact with the PCIe subsystem to better
> determine/handle the sort of errors that arise in PCIe removal
> scenarios.
> 
> From Dell EMC side, we would definitely like to see the config
> read routines in all operating environments return an error
> status code any time data is synthesized by the root port - but
> especially for failures due to device removal.  Today, if bad
> data is synthesized by a root port due to a failed config read,
> the config read routines in various operating environments do
> not return an error.  We, and others, have found cases where
> the caller of these config routines will consume the bad data
> since the API did not return an error due to this type of
> failure.  Most folks have been patching up every caller to
> check if data is synthesized after every config read.

How exactly do the callers do this?  By checking for all ones
data?  It's legal for a config register to contain all ones, of
course, so checking for all ones by itself is not definitive.  A
driver does have additional knowledge about what registers exist
and what their legal contents are, but of course, the config
accessor in the core does not.

> This is not very efficient as there are many entities that read
> config space (device driver, bus driver, systems management
> software, diagnostic software, etc.).   This patch, or similar,
> is very beneficial because it localizes the error checking for
> synthesized data to a single location in the core config read
> APIs rather than distributing this code to every caller.

This sounds like you mean checking for synthesized data in the
PCI core.  How can we figure that out?  Is there something in the
root port that can tell us the data was synthesized?

Is there a way to do this check for both conventional PCI and
PCIe, since we use the same config accessors for PCI and PCIe?

We support many host bridges and most of the time (x86 platforms
with ACPI) we have no bridge-specific support in Linux, so
anything we do in the core would have to be pretty generic.  I'm
sure this is all obvious, so I apologize for restating it.  I'm
just not sure what the concrete details should look like.

> The callers benefit by only requiring a simple check of the
> return status code from the API (which many are already doing).

Very roughly, I'd guess about 10% of current Linux callers check
the status code:

  $ git grep "pci_read_config" | wc -l
  2922
  $ git grep "pci_read_config" | egrep "if \(pci| = pci" | wc -l
  219

But I certainly agree that the current API is hard for callers to
deal with.

> Having the check for a removed device before reading config
> space will result in correctly returning an error for most
> failures of this nature.  There is a race condition even with
> this patch, however.  If the device is present when the check
> occurs, but is removed just after that and before the config
> read goes out, the config routine can still return bad data
> without an error status return code.  To eliminate this race
> condition where bad data is returned, I think that config read
> routines will need to have a check *after* reading the data in
> addition to the check *before* reading.  The check after
> reading would need to determine if the data was synthesized and
> return and error status if so.
> 
> I'm interested in hearing what others think on returning an
> error for this case in the core config read APIs or any
> alternative ways to more gracefully handle these types of
> errors.

> 
> Thanks,
> Austin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCHv4 next 0/3] Limiting pci access
  2017-01-25 21:17                               ` Bjorn Helgaas
@ 2017-01-26  1:12                                 ` Austin.Bolen
  2017-02-01 16:04                                   ` Bjorn Helgaas
  0 siblings, 1 reply; 42+ messages in thread
From: Austin.Bolen @ 2017-01-26  1:12 UTC (permalink / raw)
  To: helgaas; +Cc: keith.busch, gregkh, lukas, linux-pci, bhelgaas, wzhang

Hi Bjorn,

You're welcome and likewise, the PCIe spec will only benefit by having a be=
tter understanding of the constraints and concerns of the operating systems=
 that utilize PCIe so thank you as well for your feedback from the OS and P=
CIe driver side.

I know of two ways that are being discussed on how to handle this in a gene=
ric way that don't require a priori knowledge of what the register value sh=
ould be.  Both have tradeoffs.  It may be that neither are acceptable from =
OS standpoint or both are. I'm interested in your perspective and if neithe=
r are acceptable, then it will be great to get your thoughts on if there is=
 anything that could be added to the PCIe Base Spec to provide the OS with =
an acceptable way to handle these types of errors.

1. Check Vendor ID Register
If 1's are returned on a config read then read the Vendor ID.  If the Vendo=
r ID also returns 1's (which it will if the device is removed) then we know=
 with certainty it is an error.  On removal events, we see this check worki=
ng in practice.

How the check above can fail:  It is possible (though we've not been able t=
o replicate this failure IRL) that a device could be removed causing all 1'=
s to be returned for the data and by the time the error checking code in th=
e config read API reads the Vendor ID, a new device has been inserted and V=
endor ID returns a valid value (the config read API would need to be really=
 slow or get suspended or otherwise stalled for a large amount of time). An=
 operating system could probably add some sort of co-ordination between thi=
s error checking code in the config space API and the operating system's PC=
Ie hot-plug driver (perhaps using dev->is_removed or the like as the proxy)=
 to cover this corner case and make this bullet-proof but I'm not sure if i=
t's necessary.

This mechanism would work for PCI and PCIe.

OSes could choose to not handle this in the core config space APIs and requ=
ire all software entities that may access a device's config space (PCI bus =
driver/device driver/systems management software/diagnostic utilities/etc.)=
 to perform this check (what we do today).

2. Use the Root Port Programmed IO (RP PIO) mechanism
The PCIe Spec architected way is to use RP PIO extensions which were added =
in the PCIe 3.1 specification (see section 6.2.10.3 and 7.31) as part of th=
e Enhanced Downstream Port Containment ECN.  This ECN added status bits in =
root ports for the various errors (Unsupported Request, Completer Abort, Co=
mpletion Timeout) that could occur due to a failed non-posted request (Conf=
iguration Space access, MMIO access, IO access), TLP Header Logs to determi=
ne the which device is associated with the failure, and a synchronous excep=
tion mechanism to notify the operating system.  This mechanism was intended=
 to solve all corner cases for these types of failures.

This feature won't be available in products for some time and is optional s=
o all root ports may not implement it.

This feature is PCIe only so won't work for PCI.  Is this a deal-breaker?

Unlike option 1, this mechanism most likely cannot be used by anyone but th=
e OS so this work cannot be pushed to the caller.  There would be no way to=
 co-ordinate ownership of the RP PIO registers amongst the many software en=
tities that may access a device's config space (PCI bus driver/device drive=
r/systems management software/diagnostic utilities/etc.)

Thanks,
Austin


-----Original Message-----
From: Bjorn Helgaas [mailto:helgaas@kernel.org]=20
Sent: Wednesday, January 25, 2017 3:17 PM
To: Bolen, Austin <Austin_Bolen@Dell.com>
Cc: keith.busch@intel.com; gregkh@linuxfoundation.org; lukas@wunner.de; lin=
ux-pci@vger.kernel.org; bhelgaas@google.com; wzhang@fb.com
Subject: Re: [PATCHv4 next 0/3] Limiting pci access

Hi Austin,

Thanks a lot for bringing some more PCIe spec expertise.  I wish we had mor=
e insight into what the spec authors were thinking because I'm sure we ofte=
n miss the point and cause face-palms on the committee.

On Wed, Jan 25, 2017 at 12:44:34AM +0000, Austin.Bolen@dell.com wrote:
> To add to what Keith sent, I wanted to throw in my 2 cents on this=20
> issue.  Full disclosure - I'm not a Linux developer and so my thoughts=20
> on this are more from the PCIe Base Specification point of view and=20
> specifically, hardening the spec to better support PCIe hot-plug for=20
> storage use cases (NVMe) and providing better guidance in the PCIe=20
> spec on how system software can interact with the PCIe subsystem to=20
> better determine/handle the sort of errors that arise in PCIe removal=20
> scenarios.
>=20
> From Dell EMC side, we would definitely like to see the config read=20
> routines in all operating environments return an error status code any=20
> time data is synthesized by the root port - but especially for=20
> failures due to device removal.  Today, if bad data is synthesized by=20
> a root port due to a failed config read, the config read routines in=20
> various operating environments do not return an error.  We, and=20
> others, have found cases where the caller of these config routines=20
> will consume the bad data since the API did not return an error due to=20
> this type of failure.  Most folks have been patching up every caller=20
> to check if data is synthesized after every config read.

How exactly do the callers do this?  By checking for all ones data?  It's l=
egal for a config register to contain all ones, of course, so checking for =
all ones by itself is not definitive.  A driver does have additional knowle=
dge about what registers exist and what their legal contents are, but of co=
urse, the config accessor in the core does not.

> This is not very efficient as there are many entities that read config=20
> space (device driver, bus driver, systems management
> software, diagnostic software, etc.).   This patch, or similar,
> is very beneficial because it localizes the error checking for=20
> synthesized data to a single location in the core config read APIs=20
> rather than distributing this code to every caller.

This sounds like you mean checking for synthesized data in the PCI core.  H=
ow can we figure that out?  Is there something in the root port that can te=
ll us the data was synthesized?

Is there a way to do this check for both conventional PCI and PCIe, since w=
e use the same config accessors for PCI and PCIe?

We support many host bridges and most of the time (x86 platforms with ACPI)=
 we have no bridge-specific support in Linux, so anything we do in the core=
 would have to be pretty generic.  I'm sure this is all obvious, so I apolo=
gize for restating it.  I'm just not sure what the concrete details should =
look like.

> The callers benefit by only requiring a simple check of the return=20
> status code from the API (which many are already doing).

Very roughly, I'd guess about 10% of current Linux callers check the status=
 code:

  $ git grep "pci_read_config" | wc -l
  2922
  $ git grep "pci_read_config" | egrep "if \(pci| =3D pci" | wc -l
  219

But I certainly agree that the current API is hard for callers to deal with=
.

> Having the check for a removed device before reading config space will=20
> result in correctly returning an error for most failures of this=20
> nature.  There is a race condition even with this patch, however.  If=20
> the device is present when the check occurs, but is removed just after=20
> that and before the config read goes out, the config routine can still=20
> return bad data without an error status return code.  To eliminate=20
> this race condition where bad data is returned, I think that config=20
> read routines will need to have a check *after* reading the data in=20
> addition to the check *before* reading.  The check after reading would=20
> need to determine if the data was synthesized and return and error=20
> status if so.
>=20
> I'm interested in hearing what others think on returning an error for=20
> this case in the core config read APIs or any alternative ways to more=20
> gracefully handle these types of errors.

>=20
> Thanks,
> Austin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-01-23 16:04                           ` Keith Busch
  2017-01-25  0:44                             ` Austin.Bolen
  2017-01-25 11:48                             ` Greg Kroah-Hartman
@ 2017-01-28  7:36                             ` Christoph Hellwig
  2 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2017-01-28  7:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: Greg Kroah-Hartman, Lukas Wunner, linux-pci, Bjorn Helgaas,
	Wei Zhang, Austin Bolen

Or to put it the other way around - why even start an operation
that will fail?

We have a pretty similar issue in file system land: A file system
must handle failure from the storage device at any time to provide
data integrity.  But that doesn't mean we can't optimize for the
case where a device is toast (host removed, fails any I/O, etc).
In that case we mark the file system as shutdown and stop submitting
more I/O, potentially avoiding up to hours of recovery time.

It would be helpful for everyone up and down the stack to do these
don't do anyting stupid if you can avoid it starting from the lowest
layers.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-01-26  1:12                                 ` Austin.Bolen
@ 2017-02-01 16:04                                   ` Bjorn Helgaas
  2017-02-03 20:30                                     ` Austin.Bolen
  2017-02-03 21:43                                     ` Austin.Bolen
  0 siblings, 2 replies; 42+ messages in thread
From: Bjorn Helgaas @ 2017-02-01 16:04 UTC (permalink / raw)
  To: Austin.Bolen; +Cc: keith.busch, gregkh, lukas, linux-pci, bhelgaas, wzhang

[Responding inline, per linux mailing list convention]

Hi Austin,

On Thu, Jan 26, 2017 at 01:12:40AM +0000, Austin.Bolen@dell.com wrote:
> I know of two ways that are being discussed on how to handle this in
> a generic way that don't require a priori knowledge of what the
> register value should be.  Both have tradeoffs.  It may be that
> neither are acceptable from OS standpoint or both are. I'm
> interested in your perspective and if neither are acceptable, then
> it will be great to get your thoughts on if there is anything that
> could be added to the PCIe Base Spec to provide the OS with an
> acceptable way to handle these types of errors.
> 
> 1. Check Vendor ID Register If 1's are returned on a config read
> then read the Vendor ID.  If the Vendor ID also returns 1's (which
> it will if the device is removed) then we know with certainty it is
> an error.  On removal events, we see this check working in practice.
> 
> How the check above can fail:  It is possible (though we've not been
> able to replicate this failure IRL) that a device could be removed
> causing all 1's to be returned for the data and by the time the
> error checking code in the config read API reads the Vendor ID, a
> new device has been inserted and Vendor ID returns a valid value
> (the config read API would need to be really slow or get suspended
> or otherwise stalled for a large amount of time). An operating
> system could probably add some sort of co-ordination between this
> error checking code in the config space API and the operating
> system's PCIe hot-plug driver (perhaps using dev->is_removed or the
> like as the proxy) to cover this corner case and make this
> bullet-proof but I'm not sure if it's necessary.
> 
> This mechanism would work for PCI and PCIe.

I like this idea, and if we do go down the route of trying to reliably
return PCIBIOS_DEVICE_NOT_FOUND, I think we'd have to do something
like this at least for conventional PCI.  I don't think it would help
much to make a mechanism where drivers would have to check for errors
differently for PCI vs. PCIe devices.

I do wonder about the MacOS guidance Lukas pointed out [1].  It sounds
like MacOS drivers are expected to handle 0xFFFFFFFF data values, both
from config reads and MMIO config reads.

Does that mean MacOS considered and rejected this idea (reading Vendor
ID to check for device removal) for some reason?  Or maybe MacOS
config accessors don't return success/failure indication?  Or maybe
MMIO accesses can't return success/failure and they wanted the same
failure model for both MMIO and config space?

I think there's some value in having similar driver expectations for
MacOS and Linux, to make it easier for developers who work on both
OSes.

[1] https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html

> OSes could choose to not handle this in the core config space APIs
> and require all software entities that may access a device's config
> space (PCI bus driver/device driver/systems management
> software/diagnostic utilities/etc.) to perform this check (what we
> do today).
> 
> 2. Use the Root Port Programmed IO (RP PIO) mechanism The PCIe Spec
> architected way is to use RP PIO extensions which were added in the
> PCIe 3.1 specification (see section 6.2.10.3 and 7.31) as part of
> the Enhanced Downstream Port Containment ECN.  This ECN added status
> bits in root ports for the various errors (Unsupported Request,
> Completer Abort, Completion Timeout) that could occur due to a
> failed non-posted request (Configuration Space access, MMIO access,
> IO access), TLP Header Logs to determine the which device is
> associated with the failure, and a synchronous exception mechanism
> to notify the operating system.  This mechanism was intended to
> solve all corner cases for these types of failures.
> 
> This feature won't be available in products for some time and is
> optional so all root ports may not implement it.
> 
> This feature is PCIe only so won't work for PCI.  Is this a
> deal-breaker?

The config accessors (pci_read_config_word(), etc.) currently work for
PCI and PCIe devices, and the caller doesn't have to know the
difference.  I want to preserve that, so I don't want to add
PCIe-specific accessors.  But it's conceivable that the accessors
could internally do something different for PCI vs. PCIe.

We currently serialize all config accesses system-wide, which I think
is unnecessarily strict for ECAM, so I would like to relax it.  It
sounds like this mechanism might require serialization at least at the
Root Port level to do this synchronously.  

> Unlike option 1, this mechanism most likely cannot be used by anyone
> but the OS so this work cannot be pushed to the caller.  There would
> be no way to co-ordinate ownership of the RP PIO registers amongst
> the many software entities that may access a device's config space
> (PCI bus driver/device driver/systems management software/diagnostic
> utilities/etc.)

If we want a change in this area, I think we have to figure out what
the benefit would be.  The current situation is that drivers should
check for 0xffffffff data and treat that as an (unreliable) indication
that the device was removed.  We can't change all those drivers, so I
think we have to preserve this behavior.

We could also make the accessor return something like
PCIBIOS_DEVICE_NOT_FOUND (in addition to returning data of
0xffffffff), which might be a more reliable indication.

In that case, would we change the message to driver writers from
"treat 0xffffffff data as possible removal" to "treat
PCIBIOS_DEVICE_NOT_FOUND as removal indication"?

If PCIBIOS_DEVICE_NOT_FOUND were reliable, that would be an
improvement.  But maybe we could get most of that benefit by adding a
new "pci_device_removed(dev)" API that uses one of the mechanisms you
mention?  Then the guidance could be "if you read 0xffffffff, use
pci_device_removed() and act accordingly."  That's an incremental
change and makes it easier to maintain a driver that works across old
and new versions of Linux.

You mention other entities (systems management, utilities, etc.)
above.  I don't really know what the tradeoffs are there.  Obviously
they wouldn't use the Linux accessors directly, so the strategy there
might not have to be the same.  Maybe the wrappers they use could use
one of these mechanisms.

Bjorn

> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org] 
> Sent: Wednesday, January 25, 2017 3:17 PM
> To: Bolen, Austin <Austin_Bolen@Dell.com>
> Cc: keith.busch@intel.com; gregkh@linuxfoundation.org; lukas@wunner.de; linux-pci@vger.kernel.org; bhelgaas@google.com; wzhang@fb.com
> Subject: Re: [PATCHv4 next 0/3] Limiting pci access
> 
> Hi Austin,
> 
> Thanks a lot for bringing some more PCIe spec expertise.  I wish we had more insight into what the spec authors were thinking because I'm sure we often miss the point and cause face-palms on the committee.
> 
> On Wed, Jan 25, 2017 at 12:44:34AM +0000, Austin.Bolen@dell.com wrote:
> > To add to what Keith sent, I wanted to throw in my 2 cents on this 
> > issue.  Full disclosure - I'm not a Linux developer and so my thoughts 
> > on this are more from the PCIe Base Specification point of view and 
> > specifically, hardening the spec to better support PCIe hot-plug for 
> > storage use cases (NVMe) and providing better guidance in the PCIe 
> > spec on how system software can interact with the PCIe subsystem to 
> > better determine/handle the sort of errors that arise in PCIe removal 
> > scenarios.
> > 
> > From Dell EMC side, we would definitely like to see the config read 
> > routines in all operating environments return an error status code any 
> > time data is synthesized by the root port - but especially for 
> > failures due to device removal.  Today, if bad data is synthesized by 
> > a root port due to a failed config read, the config read routines in 
> > various operating environments do not return an error.  We, and 
> > others, have found cases where the caller of these config routines 
> > will consume the bad data since the API did not return an error due to 
> > this type of failure.  Most folks have been patching up every caller 
> > to check if data is synthesized after every config read.
> 
> How exactly do the callers do this?  By checking for all ones data?  It's legal for a config register to contain all ones, of course, so checking for all ones by itself is not definitive.  A driver does have additional knowledge about what registers exist and what their legal contents are, but of course, the config accessor in the core does not.
> 
> > This is not very efficient as there are many entities that read config 
> > space (device driver, bus driver, systems management
> > software, diagnostic software, etc.).   This patch, or similar,
> > is very beneficial because it localizes the error checking for 
> > synthesized data to a single location in the core config read APIs 
> > rather than distributing this code to every caller.
> 
> This sounds like you mean checking for synthesized data in the PCI core.  How can we figure that out?  Is there something in the root port that can tell us the data was synthesized?
> 
> Is there a way to do this check for both conventional PCI and PCIe, since we use the same config accessors for PCI and PCIe?
> 
> We support many host bridges and most of the time (x86 platforms with ACPI) we have no bridge-specific support in Linux, so anything we do in the core would have to be pretty generic.  I'm sure this is all obvious, so I apologize for restating it.  I'm just not sure what the concrete details should look like.
> 
> > The callers benefit by only requiring a simple check of the return 
> > status code from the API (which many are already doing).
> 
> Very roughly, I'd guess about 10% of current Linux callers check the status code:
> 
>   $ git grep "pci_read_config" | wc -l
>   2922
>   $ git grep "pci_read_config" | egrep "if \(pci| = pci" | wc -l
>   219
> 
> But I certainly agree that the current API is hard for callers to deal with.
> 
> > Having the check for a removed device before reading config space will 
> > result in correctly returning an error for most failures of this 
> > nature.  There is a race condition even with this patch, however.  If 
> > the device is present when the check occurs, but is removed just after 
> > that and before the config read goes out, the config routine can still 
> > return bad data without an error status return code.  To eliminate 
> > this race condition where bad data is returned, I think that config 
> > read routines will need to have a check *after* reading the data in 
> > addition to the check *before* reading.  The check after reading would 
> > need to determine if the data was synthesized and return and error 
> > status if so.
> > 
> > I'm interested in hearing what others think on returning an error for 
> > this case in the core config read APIs or any alternative ways to more 
> > gracefully handle these types of errors.
> 
> > 
> > Thanks,
> > Austin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCHv4 next 0/3] Limiting pci access
  2017-02-01 16:04                                   ` Bjorn Helgaas
@ 2017-02-03 20:30                                     ` Austin.Bolen
  2017-02-03 20:39                                       ` Greg KH
  2017-02-03 21:43                                     ` Austin.Bolen
  1 sibling, 1 reply; 42+ messages in thread
From: Austin.Bolen @ 2017-02-03 20:30 UTC (permalink / raw)
  To: helgaas; +Cc: keith.busch, gregkh, lukas, linux-pci, bhelgaas, wzhang

Dell - Internal Use - Confidential =20


-----Original Message-----
From: Bjorn Helgaas [mailto:helgaas@kernel.org]=20
Sent: Wednesday, February 1, 2017 10:05 AM
To: Bolen, Austin <Austin_Bolen@Dell.com>
Cc: keith.busch@intel.com; gregkh@linuxfoundation.org; lukas@wunner.de; lin=
ux-pci@vger.kernel.org; bhelgaas@google.com; wzhang@fb.com
Subject: Re: [PATCHv4 next 0/3] Limiting pci access

[Responding inline, per linux mailing list convention]

[AB]
Hi Bjorn,
Responding in-line per convention.  It becomes a bit jumbled when I do so, =
probably because of my email client, so I'll tag my comments with [AB] ... =
[/AB] so I can keep it straight.  Hopefully it is readable on your end.
[/AB]




Hi Austin,

On Thu, Jan 26, 2017 at 01:12:40AM +0000, Austin.Bolen@dell.com wrote:
> I know of two ways that are being discussed on how to handle this in a=20
> generic way that don't require a priori knowledge of what the register=20
> value should be.  Both have tradeoffs.  It may be that neither are=20
> acceptable from OS standpoint or both are. I'm interested in your=20
> perspective and if neither are acceptable, then it will be great to=20
> get your thoughts on if there is anything that could be added to the=20
> PCIe Base Spec to provide the OS with an acceptable way to handle=20
> these types of errors.
>=20
> 1. Check Vendor ID Register If 1's are returned on a config read then=20
> read the Vendor ID.  If the Vendor ID also returns 1's (which it will=20
> if the device is removed) then we know with certainty it is an error. =20
> On removal events, we see this check working in practice.
>=20
> How the check above can fail:  It is possible (though we've not been=20
> able to replicate this failure IRL) that a device could be removed=20
> causing all 1's to be returned for the data and by the time the error=20
> checking code in the config read API reads the Vendor ID, a new device=20
> has been inserted and Vendor ID returns a valid value (the config read=20
> API would need to be really slow or get suspended or otherwise stalled=20
> for a large amount of time). An operating system could probably add=20
> some sort of co-ordination between this error checking code in the=20
> config space API and the operating system's PCIe hot-plug driver=20
> (perhaps using dev->is_removed or the like as the proxy) to cover this=20
> corner case and make this bullet-proof but I'm not sure if it's=20
> necessary.
>=20
> This mechanism would work for PCI and PCIe.

I like this idea, and if we do go down the route of trying to reliably retu=
rn PCIBIOS_DEVICE_NOT_FOUND, I think we'd have to do something like this at=
 least for conventional PCI.  I don't think it would help much to make a me=
chanism where drivers would have to check for errors differently for PCI vs=
. PCIe devices.

I do wonder about the MacOS guidance Lukas pointed out [1].  It sounds like=
 MacOS drivers are expected to handle 0xFFFFFFFF data values, both from con=
fig reads and MMIO config reads.

Does that mean MacOS considered and rejected this idea (reading Vendor ID t=
o check for device removal) for some reason?  Or maybe MacOS config accesso=
rs don't return success/failure indication?  Or maybe MMIO accesses can't r=
eturn success/failure and they wanted the same failure model for both MMIO =
and config space?

I think there's some value in having similar driver expectations for MacOS =
and Linux, to make it easier for developers who work on both OSes.

[1] https://developer.apple.com/library/content/documentation/HardwareDrive=
rs/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html

[AB]
We have a very similar guidelines.  I can't speak for why MacOS has these g=
uidelines, but we have them because OSes typically do not do the check.  If=
 the OSes do the check, then that would be our preference to avoid every ca=
ller having to do it.

Regarding reading vendor ID, it seems the MacOS guidelines are very similar=
: "If 0xFFFFFFFF is a legal value for a particular register offset, one add=
itional read of a different register, which is known to never return 0xFFFF=
FFFF is the preferred mechanism for determining if the device is still conn=
ected. "

They do not specify which register that is "known to never return 0xFFFFFFF=
F" you should read but Vendor ID is the de facto standard choice as it is a=
 register that can never be all 1's for any device.  Assume that would be o=
k for MacOS as well. Since the OS will not, in general, know when all 1's i=
s valid, it would need to always check.
[/AB]

> OSes could choose to not handle this in the core config space APIs and=20
> require all software entities that may access a device's config space=20
> (PCI bus driver/device driver/systems management software/diagnostic=20
> utilities/etc.) to perform this check (what we do today).
>=20
> 2. Use the Root Port Programmed IO (RP PIO) mechanism The PCIe Spec=20
> architected way is to use RP PIO extensions which were added in the=20
> PCIe 3.1 specification (see section 6.2.10.3 and 7.31) as part of the=20
> Enhanced Downstream Port Containment ECN.  This ECN added status bits=20
> in root ports for the various errors (Unsupported Request, Completer=20
> Abort, Completion Timeout) that could occur due to a failed non-posted=20
> request (Configuration Space access, MMIO access, IO access), TLP=20
> Header Logs to determine the which device is associated with the=20
> failure, and a synchronous exception mechanism to notify the operating=20
> system.  This mechanism was intended to solve all corner cases for=20
> these types of failures.
>=20
> This feature won't be available in products for some time and is=20
> optional so all root ports may not implement it.
>=20
> This feature is PCIe only so won't work for PCI.  Is this a=20
> deal-breaker?

The config accessors (pci_read_config_word(), etc.) currently work for PCI =
and PCIe devices, and the caller doesn't have to know the difference.  I wa=
nt to preserve that, so I don't want to add PCIe-specific accessors.  But i=
t's conceivable that the accessors could internally do something different =
for PCI vs. PCIe.

[AB]
Understandable.  One thing on hot-plug PCI... this sort of thing may not be=
 necessary for PCI so perhaps it would be ok with a single config accessor =
that did different thing internally depending on whether or not it was PCI =
or PCIe.  PCI hot-plug devices are shielded from a lot of this because they=
 follow the Standard Hot-Plug Controller model.  In this model, the driver =
has to be quiesced and power removed from the slot.  Therefore, Keith's pat=
ch to check if the device is present at the start of the routine should suf=
fice.

There is a corner case where a config transaction could get out to the devi=
ce just before power to the slot is turned off.  Checking for all 1's after=
 the read would catch this case.

The reason I mention this new RP PIO mechanism here is that it is more robu=
st than the all 1's check.  The all 1's check will probably suffice for con=
fig reads since there are already accessor routines and they are serialized=
.  The other PCI Express non-posted requests (mem read, io read, config wri=
te, io write, and AtomicOp) may require the RP PIO mechanism.  For now I'm =
just focusing on config read to set a baseline but long term would like to =
see errors from removal events being handled for all non-posted requests.
[/AB]

We currently serialize all config accesses system-wide, which I think is un=
necessarily strict for ECAM, so I would like to relax it.  It sounds like t=
his mechanism might require serialization at least at the Root Port level t=
o do this synchronously. =20

[AB]
The RP PIO mechanism should not require serialization.  It should work for =
MMIO which is not currently serialized on most OSes.  There are some talks =
on building a proof-of-concept in Linux to show how it would work.

On the topic of relaxing the requirement to serialize ECAM access, I'll thr=
ow in my .02.  There are some weird corner cases you can get into on PCIe p=
ower down sequence that can lead to Completion Timeout Errors if config acc=
esses are not serialized (though there may be other ways to handle this).  =
There may also be other scenarios that currently rely on the fact that all =
OSes serialize config access.  I would be hesitant to make a change like th=
is on the servers that we ship.  Might be safer to leave default as is and =
allow the user a knob to un-serialize if they know for certain it is ok in =
their current operating environment.  But I'd like to better understand the=
 problems being seen with serializing config accesses today.  Is this for p=
erformance optimization?
[/AB]


> Unlike option 1, this mechanism most likely cannot be used by anyone=20
> but the OS so this work cannot be pushed to the caller.  There would=20
> be no way to co-ordinate ownership of the RP PIO registers amongst the=20
> many software entities that may access a device's config space (PCI=20
> bus driver/device driver/systems management software/diagnostic
> utilities/etc.)

If we want a change in this area, I think we have to figure out what the be=
nefit would be.  The current situation is that drivers should check for 0xf=
fffffff data and treat that as an (unreliable) indication that the device w=
as removed.  We can't change all those drivers, so I think we have to prese=
rve this behavior.

[AB]
Agree the behavior has to be preserved.  Looks like Keith's patch return's =
all 1's as the data even in failure so that it works the same way for drive=
rs that check all 1's.  If we used the RP PIO mechanism it could also retur=
n all 1's as the data.
[/AB]

We could also make the accessor return something like PCIBIOS_DEVICE_NOT_FO=
UND (in addition to returning data of 0xffffffff), which might be a more re=
liable indication.

In that case, would we change the message to driver writers from "treat 0xf=
fffffff data as possible removal" to "treat PCIBIOS_DEVICE_NOT_FOUND as rem=
oval indication"?

[AB]
I think drivers could either check for error return code or check for all 1=
's or both.  New drivers can just check the return code (or check all 1's i=
f they really want to).  Old drivers can continue to check all 1's or they =
can update to check the error return code if they want.
[/AB]


If PCIBIOS_DEVICE_NOT_FOUND were reliable, that would be an improvement.  B=
ut maybe we could get most of that benefit by adding a new "pci_device_remo=
ved(dev)" API that uses one of the mechanisms you mention?  Then the guidan=
ce could be "if you read 0xffffffff, use
pci_device_removed() and act accordingly."  That's an incremental change an=
d makes it easier to maintain a driver that works across old and new versio=
ns of Linux.

[AB]
That may be a good middle ground.  The tradeoff here is that well written c=
ode that does check the return code does not need to change if the API does=
 that check and returns error.  Only code that does no checks would need to=
 change.  With this proposal, the code that currently checks the API status=
 would also need to change but at least they don't have to invent their mec=
hanism to determine device presence.

For MMIO/IO, something like this may be needed anyway.  My understanding is=
 that in Linux the convention is to read MMIO directly from the driver rath=
er than using an OS accessor routine so you'd either need an MMIO accessor =
routine or a way for the driver to manually check for presence when it rece=
ives all 1's data back.
[/AB]

You mention other entities (systems management, utilities, etc.) above.  I =
don't really know what the tradeoffs are there.  Obviously they wouldn't us=
e the Linux accessors directly, so the strategy there might not have to be =
the same.  Maybe the wrappers they use could use one of these mechanisms.

[AB]
Our management tools typically install a small driver that provides low-lev=
el access including to config space.  I believe they do use operating syste=
m calls to read config space today so they would also benefit.  So all OS p=
roduced config access mechanisms callable by driver or application would ne=
ed to return error for these device removal cases.  I can check into which =
mechanisms our tools use in Linux to access config space.
[/AB]

Bjorn

> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Wednesday, January 25, 2017 3:17 PM
> To: Bolen, Austin <Austin_Bolen@Dell.com>
> Cc: keith.busch@intel.com; gregkh@linuxfoundation.org;=20
> lukas@wunner.de; linux-pci@vger.kernel.org; bhelgaas@google.com;=20
> wzhang@fb.com
> Subject: Re: [PATCHv4 next 0/3] Limiting pci access
>=20
> Hi Austin,
>=20
> Thanks a lot for bringing some more PCIe spec expertise.  I wish we had m=
ore insight into what the spec authors were thinking because I'm sure we of=
ten miss the point and cause face-palms on the committee.
>=20
> On Wed, Jan 25, 2017 at 12:44:34AM +0000, Austin.Bolen@dell.com wrote:
> > To add to what Keith sent, I wanted to throw in my 2 cents on this=20
> > issue.  Full disclosure - I'm not a Linux developer and so my=20
> > thoughts on this are more from the PCIe Base Specification point of=20
> > view and specifically, hardening the spec to better support PCIe=20
> > hot-plug for storage use cases (NVMe) and providing better guidance=20
> > in the PCIe spec on how system software can interact with the PCIe=20
> > subsystem to better determine/handle the sort of errors that arise=20
> > in PCIe removal scenarios.
> >=20
> > From Dell EMC side, we would definitely like to see the config read=20
> > routines in all operating environments return an error status code=20
> > any time data is synthesized by the root port - but especially for=20
> > failures due to device removal.  Today, if bad data is synthesized=20
> > by a root port due to a failed config read, the config read routines=20
> > in various operating environments do not return an error.  We, and=20
> > others, have found cases where the caller of these config routines=20
> > will consume the bad data since the API did not return an error due=20
> > to this type of failure.  Most folks have been patching up every=20
> > caller to check if data is synthesized after every config read.
>=20
> How exactly do the callers do this?  By checking for all ones data?  It's=
 legal for a config register to contain all ones, of course, so checking fo=
r all ones by itself is not definitive.  A driver does have additional know=
ledge about what registers exist and what their legal contents are, but of =
course, the config accessor in the core does not.
>=20
> > This is not very efficient as there are many entities that read=20
> > config space (device driver, bus driver, systems management
> > software, diagnostic software, etc.).   This patch, or similar,
> > is very beneficial because it localizes the error checking for=20
> > synthesized data to a single location in the core config read APIs=20
> > rather than distributing this code to every caller.
>=20
> This sounds like you mean checking for synthesized data in the PCI core. =
 How can we figure that out?  Is there something in the root port that can =
tell us the data was synthesized?
>=20
> Is there a way to do this check for both conventional PCI and PCIe, since=
 we use the same config accessors for PCI and PCIe?
>=20
> We support many host bridges and most of the time (x86 platforms with ACP=
I) we have no bridge-specific support in Linux, so anything we do in the co=
re would have to be pretty generic.  I'm sure this is all obvious, so I apo=
logize for restating it.  I'm just not sure what the concrete details shoul=
d look like.
>=20
> > The callers benefit by only requiring a simple check of the return=20
> > status code from the API (which many are already doing).
>=20
> Very roughly, I'd guess about 10% of current Linux callers check the stat=
us code:
>=20
>   $ git grep "pci_read_config" | wc -l
>   2922
>   $ git grep "pci_read_config" | egrep "if \(pci| =3D pci" | wc -l
>   219
>=20
> But I certainly agree that the current API is hard for callers to deal wi=
th.
>=20
> > Having the check for a removed device before reading config space=20
> > will result in correctly returning an error for most failures of=20
> > this nature.  There is a race condition even with this patch,=20
> > however.  If the device is present when the check occurs, but is=20
> > removed just after that and before the config read goes out, the=20
> > config routine can still return bad data without an error status=20
> > return code.  To eliminate this race condition where bad data is=20
> > returned, I think that config read routines will need to have a=20
> > check *after* reading the data in addition to the check *before*=20
> > reading.  The check after reading would need to determine if the=20
> > data was synthesized and return and error status if so.
> >=20
> > I'm interested in hearing what others think on returning an error=20
> > for this case in the core config read APIs or any alternative ways=20
> > to more gracefully handle these types of errors.
>=20
> >=20
> > Thanks,
> > Austin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2017-02-03 20:30                                     ` Austin.Bolen
@ 2017-02-03 20:39                                       ` Greg KH
  0 siblings, 0 replies; 42+ messages in thread
From: Greg KH @ 2017-02-03 20:39 UTC (permalink / raw)
  To: Austin.Bolen; +Cc: helgaas, keith.busch, lukas, linux-pci, bhelgaas, wzhang

On Fri, Feb 03, 2017 at 08:30:08PM +0000, Austin.Bolen@dell.com wrote:
> Dell - Internal Use - Confidential  

Nice!  As per the legal instructions I have been given, I'm now deleting
this and can't read anymore, sorry :(

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCHv4 next 0/3] Limiting pci access
  2017-02-01 16:04                                   ` Bjorn Helgaas
  2017-02-03 20:30                                     ` Austin.Bolen
@ 2017-02-03 21:43                                     ` Austin.Bolen
  1 sibling, 0 replies; 42+ messages in thread
From: Austin.Bolen @ 2017-02-03 21:43 UTC (permalink / raw)
  To: helgaas; +Cc: keith.busch, gregkh, lukas, linux-pci, bhelgaas, wzhang

Resending without the confidentiality notice (hopefully) that our email sys=
tem stuffs into plain text email.


-----Original Message-----
From: Bjorn Helgaas [mailto:helgaas@kernel.org]
Sent: Wednesday, February 1, 2017 10:05 AM
To: Bolen, Austin <Austin_Bolen@Dell.com>
Cc: keith.busch@intel.com; gregkh@linuxfoundation.org; lukas@wunner.de; lin=
ux-pci@vger.kernel.org; bhelgaas@google.com; wzhang@fb.com
Subject: Re: [PATCHv4 next 0/3] Limiting pci access

[Responding inline, per linux mailing list convention]

[AB]
Hi Bjorn,
Responding in-line per convention.  It becomes a bit jumbled when I do so, =
probably because of my email client, so I'll tag my comments with [AB] ... =
[/AB] so I can keep it straight.  Hopefully it is readable on your end.
[/AB]




Hi Austin,

On Thu, Jan 26, 2017 at 01:12:40AM +0000, Austin.Bolen@dell.com wrote:
> I know of two ways that are being discussed on how to handle this in a=20
> generic way that don't require a priori knowledge of what the register=20
> value should be.  Both have tradeoffs.  It may be that neither are=20
> acceptable from OS standpoint or both are. I'm interested in your=20
> perspective and if neither are acceptable, then it will be great to=20
> get your thoughts on if there is anything that could be added to the=20
> PCIe Base Spec to provide the OS with an acceptable way to handle=20
> these types of errors.
>=20
> 1. Check Vendor ID Register If 1's are returned on a config read then=20
> read the Vendor ID.  If the Vendor ID also returns 1's (which it will=20
> if the device is removed) then we know with certainty it is an error.
> On removal events, we see this check working in practice.
>=20
> How the check above can fail:  It is possible (though we've not been=20
> able to replicate this failure IRL) that a device could be removed=20
> causing all 1's to be returned for the data and by the time the error=20
> checking code in the config read API reads the Vendor ID, a new device=20
> has been inserted and Vendor ID returns a valid value (the config read=20
> API would need to be really slow or get suspended or otherwise stalled=20
> for a large amount of time). An operating system could probably add=20
> some sort of co-ordination between this error checking code in the=20
> config space API and the operating system's PCIe hot-plug driver=20
> (perhaps using dev->is_removed or the like as the proxy) to cover this=20
> corner case and make this bullet-proof but I'm not sure if it's=20
> necessary.
>=20
> This mechanism would work for PCI and PCIe.

I like this idea, and if we do go down the route of trying to reliably retu=
rn PCIBIOS_DEVICE_NOT_FOUND, I think we'd have to do something like this at=
 least for conventional PCI.  I don't think it would help much to make a me=
chanism where drivers would have to check for errors differently for PCI vs=
. PCIe devices.

I do wonder about the MacOS guidance Lukas pointed out [1].  It sounds like=
 MacOS drivers are expected to handle 0xFFFFFFFF data values, both from con=
fig reads and MMIO config reads.

Does that mean MacOS considered and rejected this idea (reading Vendor ID t=
o check for device removal) for some reason?  Or maybe MacOS config accesso=
rs don't return success/failure indication?  Or maybe MMIO accesses can't r=
eturn success/failure and they wanted the same failure model for both MMIO =
and config space?

I think there's some value in having similar driver expectations for MacOS =
and Linux, to make it easier for developers who work on both OSes.

[1] https://developer.apple.com/library/content/documentation/HardwareDrive=
rs/Conceptual/ThunderboltDevGuide/Basics02/Basics02.html

[AB]
We have a very similar guidelines.  I can't speak for why MacOS has these g=
uidelines, but we have them because OSes typically do not do the check.  If=
 the OSes do the check, then that would be our preference to avoid every ca=
ller having to do it.

Regarding reading vendor ID, it seems the MacOS guidelines are very similar=
: "If 0xFFFFFFFF is a legal value for a particular register offset, one add=
itional read of a different register, which is known to never return 0xFFFF=
FFFF is the preferred mechanism for determining if the device is still conn=
ected. "

They do not specify which register that is "known to never return 0xFFFFFFF=
F" you should read but Vendor ID is the de facto standard choice as it is a=
 register that can never be all 1's for any device.  Assume that would be o=
k for MacOS as well. Since the OS will not, in general, know when all 1's i=
s valid, it would need to always check.
[/AB]

> OSes could choose to not handle this in the core config space APIs and=20
> require all software entities that may access a device's config space=20
> (PCI bus driver/device driver/systems management software/diagnostic
> utilities/etc.) to perform this check (what we do today).
>=20
> 2. Use the Root Port Programmed IO (RP PIO) mechanism The PCIe Spec=20
> architected way is to use RP PIO extensions which were added in the=20
> PCIe 3.1 specification (see section 6.2.10.3 and 7.31) as part of the=20
> Enhanced Downstream Port Containment ECN.  This ECN added status bits=20
> in root ports for the various errors (Unsupported Request, Completer=20
> Abort, Completion Timeout) that could occur due to a failed non-posted=20
> request (Configuration Space access, MMIO access, IO access), TLP=20
> Header Logs to determine the which device is associated with the=20
> failure, and a synchronous exception mechanism to notify the operating=20
> system.  This mechanism was intended to solve all corner cases for=20
> these types of failures.
>=20
> This feature won't be available in products for some time and is=20
> optional so all root ports may not implement it.
>=20
> This feature is PCIe only so won't work for PCI.  Is this a=20
> deal-breaker?

The config accessors (pci_read_config_word(), etc.) currently work for PCI =
and PCIe devices, and the caller doesn't have to know the difference.  I wa=
nt to preserve that, so I don't want to add PCIe-specific accessors.  But i=
t's conceivable that the accessors could internally do something different =
for PCI vs. PCIe.

[AB]
Understandable.  One thing on hot-plug PCI... this sort of thing may not be=
 necessary for PCI so perhaps it would be ok with a single config accessor =
that did different thing internally depending on whether or not it was PCI =
or PCIe.  PCI hot-plug devices are shielded from a lot of this because they=
 follow the Standard Hot-Plug Controller model.  In this model, the driver =
has to be quiesced and power removed from the slot.  Therefore, Keith's pat=
ch to check if the device is present at the start of the routine should suf=
fice.

There is a corner case where a config transaction could get out to the devi=
ce just before power to the slot is turned off.  Checking for all 1's after=
 the read would catch this case.

The reason I mention this new RP PIO mechanism here is that it is more robu=
st than the all 1's check.  The all 1's check will probably suffice for con=
fig reads since there are already accessor routines and they are serialized=
.  The other PCI Express non-posted requests (mem read, io read, config wri=
te, io write, and AtomicOp) may require the RP PIO mechanism.  For now I'm =
just focusing on config read to set a baseline but long term would like to =
see errors from removal events being handled for all non-posted requests.
[/AB]

We currently serialize all config accesses system-wide, which I think is un=
necessarily strict for ECAM, so I would like to relax it.  It sounds like t=
his mechanism might require serialization at least at the Root Port level t=
o do this synchronously. =20

[AB]
The RP PIO mechanism should not require serialization.  It should work for =
MMIO which is not currently serialized on most OSes.  There are some talks =
on building a proof-of-concept in Linux to show how it would work.

On the topic of relaxing the requirement to serialize ECAM access, I'll thr=
ow in my .02.  There are some weird corner cases you can get into on PCIe p=
ower down sequence that can lead to Completion Timeout Errors if config acc=
esses are not serialized (though there may be other ways to handle this).  =
There may also be other scenarios that currently rely on the fact that all =
OSes serialize config access.  I would be hesitant to make a change like th=
is on the servers that we ship.  Might be safer to leave default as is and =
allow the user a knob to un-serialize if they know for certain it is ok in =
their current operating environment.  But I'd like to better understand the=
 problems being seen with serializing config accesses today.  Is this for p=
erformance optimization?
[/AB]


> Unlike option 1, this mechanism most likely cannot be used by anyone=20
> but the OS so this work cannot be pushed to the caller.  There would=20
> be no way to co-ordinate ownership of the RP PIO registers amongst the=20
> many software entities that may access a device's config space (PCI=20
> bus driver/device driver/systems management software/diagnostic
> utilities/etc.)

If we want a change in this area, I think we have to figure out what the be=
nefit would be.  The current situation is that drivers should check for 0xf=
fffffff data and treat that as an (unreliable) indication that the device w=
as removed.  We can't change all those drivers, so I think we have to prese=
rve this behavior.

[AB]
Agree the behavior has to be preserved.  Looks like Keith's patch return's =
all 1's as the data even in failure so that it works the same way for drive=
rs that check all 1's.  If we used the RP PIO mechanism it could also retur=
n all 1's as the data.
[/AB]

We could also make the accessor return something like PCIBIOS_DEVICE_NOT_FO=
UND (in addition to returning data of 0xffffffff), which might be a more re=
liable indication.

In that case, would we change the message to driver writers from "treat 0xf=
fffffff data as possible removal" to "treat PCIBIOS_DEVICE_NOT_FOUND as rem=
oval indication"?

[AB]
I think drivers could either check for error return code or check for all 1=
's or both.  New drivers can just check the return code (or check all 1's i=
f they really want to).  Old drivers can continue to check all 1's or they =
can update to check the error return code if they want.
[/AB]


If PCIBIOS_DEVICE_NOT_FOUND were reliable, that would be an improvement.  B=
ut maybe we could get most of that benefit by adding a new "pci_device_remo=
ved(dev)" API that uses one of the mechanisms you mention?  Then the guidan=
ce could be "if you read 0xffffffff, use
pci_device_removed() and act accordingly."  That's an incremental change an=
d makes it easier to maintain a driver that works across old and new versio=
ns of Linux.

[AB]
That may be a good middle ground.  The tradeoff here is that well written c=
ode that does check the return code does not need to change if the API does=
 that check and returns error.  Only code that does no checks would need to=
 change.  With this proposal, the code that currently checks the API status=
 would also need to change but at least they don't have to invent their mec=
hanism to determine device presence.

For MMIO/IO, something like this may be needed anyway.  My understanding is=
 that in Linux the convention is to read MMIO directly from the driver rath=
er than using an OS accessor routine so you'd either need an MMIO accessor =
routine or a way for the driver to manually check for presence when it rece=
ives all 1's data back.
[/AB]

You mention other entities (systems management, utilities, etc.) above.  I =
don't really know what the tradeoffs are there.  Obviously they wouldn't us=
e the Linux accessors directly, so the strategy there might not have to be =
the same.  Maybe the wrappers they use could use one of these mechanisms.

[AB]
Our management tools typically install a small driver that provides low-lev=
el access including to config space.  I believe they do use operating syste=
m calls to read config space today so they would also benefit.  So all OS p=
roduced config access mechanisms callable by driver or application would ne=
ed to return error for these device removal cases.  I can check into which =
mechanisms our tools use in Linux to access config space.
[/AB]

Bjorn

> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Wednesday, January 25, 2017 3:17 PM
> To: Bolen, Austin <Austin_Bolen@Dell.com>
> Cc: keith.busch@intel.com; gregkh@linuxfoundation.org;=20
> lukas@wunner.de; linux-pci@vger.kernel.org; bhelgaas@google.com;=20
> wzhang@fb.com
> Subject: Re: [PATCHv4 next 0/3] Limiting pci access
>=20
> Hi Austin,
>=20
> Thanks a lot for bringing some more PCIe spec expertise.  I wish we had m=
ore insight into what the spec authors were thinking because I'm sure we of=
ten miss the point and cause face-palms on the committee.
>=20
> On Wed, Jan 25, 2017 at 12:44:34AM +0000, Austin.Bolen@dell.com wrote:
> > To add to what Keith sent, I wanted to throw in my 2 cents on this=20
> > issue.  Full disclosure - I'm not a Linux developer and so my=20
> > thoughts on this are more from the PCIe Base Specification point of=20
> > view and specifically, hardening the spec to better support PCIe=20
> > hot-plug for storage use cases (NVMe) and providing better guidance=20
> > in the PCIe spec on how system software can interact with the PCIe=20
> > subsystem to better determine/handle the sort of errors that arise=20
> > in PCIe removal scenarios.
> >=20
> > From Dell EMC side, we would definitely like to see the config read=20
> > routines in all operating environments return an error status code=20
> > any time data is synthesized by the root port - but especially for=20
> > failures due to device removal.  Today, if bad data is synthesized=20
> > by a root port due to a failed config read, the config read routines=20
> > in various operating environments do not return an error.  We, and=20
> > others, have found cases where the caller of these config routines=20
> > will consume the bad data since the API did not return an error due=20
> > to this type of failure.  Most folks have been patching up every=20
> > caller to check if data is synthesized after every config read.
>=20
> How exactly do the callers do this?  By checking for all ones data?  It's=
 legal for a config register to contain all ones, of course, so checking fo=
r all ones by itself is not definitive.  A driver does have additional know=
ledge about what registers exist and what their legal contents are, but of =
course, the config accessor in the core does not.
>=20
> > This is not very efficient as there are many entities that read=20
> > config space (device driver, bus driver, systems management
> > software, diagnostic software, etc.).   This patch, or similar,
> > is very beneficial because it localizes the error checking for=20
> > synthesized data to a single location in the core config read APIs=20
> > rather than distributing this code to every caller.
>=20
> This sounds like you mean checking for synthesized data in the PCI core. =
 How can we figure that out?  Is there something in the root port that can =
tell us the data was synthesized?
>=20
> Is there a way to do this check for both conventional PCI and PCIe, since=
 we use the same config accessors for PCI and PCIe?
>=20
> We support many host bridges and most of the time (x86 platforms with ACP=
I) we have no bridge-specific support in Linux, so anything we do in the co=
re would have to be pretty generic.  I'm sure this is all obvious, so I apo=
logize for restating it.  I'm just not sure what the concrete details shoul=
d look like.
>=20
> > The callers benefit by only requiring a simple check of the return=20
> > status code from the API (which many are already doing).
>=20
> Very roughly, I'd guess about 10% of current Linux callers check the stat=
us code:
>=20
>   $ git grep "pci_read_config" | wc -l
>   2922
>   $ git grep "pci_read_config" | egrep "if \(pci| =3D pci" | wc -l
>   219
>=20
> But I certainly agree that the current API is hard for callers to deal wi=
th.
>=20
> > Having the check for a removed device before reading config space=20
> > will result in correctly returning an error for most failures of=20
> > this nature.  There is a race condition even with this patch,=20
> > however.  If the device is present when the check occurs, but is=20
> > removed just after that and before the config read goes out, the=20
> > config routine can still return bad data without an error status=20
> > return code.  To eliminate this race condition where bad data is=20
> > returned, I think that config read routines will need to have a=20
> > check *after* reading the data in addition to the check *before*=20
> > reading.  The check after reading would need to determine if the=20
> > data was synthesized and return and error status if so.
> >=20
> > I'm interested in hearing what others think on returning an error=20
> > for this case in the core config read APIs or any alternative ways=20
> > to more gracefully handle these types of errors.
>=20
> >=20
> > Thanks,
> > Austin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2016-12-13 23:18                 ` Keith Busch
       [not found]                   ` <B58D82457FDA0744A320A2FC5AC253B93D82F37D@fmsmsx104.amr.corp.intel.com>
@ 2018-11-13  6:05                   ` Bjorn Helgaas
  2018-11-13 14:59                     ` Keith Busch
  1 sibling, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2018-11-13  6:05 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Lukas Wunner, Wei Zhang

On Tue, Dec 13, 2016 at 06:18:40PM -0500, Keith Busch wrote:
> On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> > On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote:
> > > On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote:
> > > > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> > > > > Depending on the device and the driver, there are hundreds
> > > > > to thousands of non-posted transactions submitted to the
> > > > > device to complete driver unbinding and removal. Since the
> > > > > device is gone, hardware has to handle that as an error
> > > > > condition, which is slower than the a successful non-posted
> > > > > transaction. Since we're doing 1000 of them for no
> > > > > particular reason, it takes a long time. If you hot remove a
> > > > > switch with multiple downstream devices, the serialized
> > > > > removal adds up to many seconds.
> > > > 
> > > > Another thread mentioned 1-2us as a reasonable config access
> > > > cost, and I'm still a little puzzled about how we get to
> > > > something on the order of a million times that cost.
> > > > 
> > > > I know this is all pretty hand-wavey, but 1000 config accesses
> > > > to shut down a device seems unreasonably high.  The entire
> > > > config space is only 4096 bytes, and most devices use a small
> > > > fraction of that.  If we're really doing 1000 accesses, it
> > > > sounds like we're doing something wrong, like polling without
> > > > a delay or something.
> > > 
> > > Every time pci_find_ext_capability is called on a removed
> > > device, the kernel will do 481 failed config space accesses
> > > trying to find that capability. The kernel used to do that
> > > multiple times to find the AER capability under conditions
> > > common to surprise removal.
> > 
> > Right, that's a perfect example.  I'd rather fix issues like this
> > by caching the position as you did with AER.  The "removed" bit
> > makes these issues "go away" without addressing the underlying
> > problem.
> > 
> > We might still need a "removed" bit for other reasons, but I want
> > to be clear about those reasons, not just throw it in under the
> > general "make it go fast" umbrella.
> > 
> > > But now that we cache the AER position (commit: 66b80809), we've
> > > eliminated by far the worst offender. The counts I'm telling you
> > > are still referencing the original captured traces showing long
> > > tear down times, so it's not up-to-date with the most recent
> > > version of the kernel.
> > >  
> > > > I measured the cost of config reads during enumeration using
> > > > the TSC on a 2.8GHz CPU and found the following:
> > > > 
> > > >   1580 cycles, 0.565 usec (device present)
> > > >   1230 cycles, 0.440 usec (empty slot)
> > > >   2130 cycles, 0.761 usec (unimplemented function of multi-function device)
> > > > 
> > > > So 1-2usec does seem the right order of magnitude, and my
> > > > "empty slot" error responses are actually *faster* than the
> > > > "device present" ones, which is plausible to me because the
> > > > Downstream Port can generate the error response immediately
> > > > without sending a packet down the link.  The "unimplemented
> > > > function" responses take longer than the "empty slot", which
> > > > makes sense because the Downstream Port does have to send a
> > > > packet to the device, which then complains because it doesn't
> > > > implement that function.
> > > > 
> > > > Of course, these aren't the same case as yours, where the link
> > > > used to be up but is no longer.  Is there some hardware
> > > > timeout to see if the link will come back?
> > > 
> > > Yes, the hardware does not respond immediately under this test,
> > > which is considered an error condition. This is a reason why
> > > PCIe Device Capabilities 2 Completion Timeout Ranges are
> > > recommended to be in the 10ms range.
> > 
> > And we're apparently still doing a lot of these accesses?  I'm
> > still curious about exactly what these are, because it may be that
> > we're doing more than necessary.
> 
> It's the MSI-x masking that's our next highest contributor. Masking
> vectors still requires non-posted commands, and since they're not
> going through managed API accessors like config space uses, the
> removed flag is needed for checking before doing significant MMIO.

Sorry for digging up this ancient history, but do you remember where
this MSI-X masking with non-posted commands happens?  The masking is
an MMIO write (posted) and there should only be a non-posted MMIO read
if we use the msi_set_mask_bit() path.

I'm looking at this path:

  nvme_pci_disable
    pci_free_irq_vectors
      pci_disable_msix
        pci_msix_shutdown
          if (pci_dev_is_disconnected())    # added by 0170591bb067
            return                          # (skip hw access)
          for_each_pci_msi_entry(...)       # <-- loop
            __pci_msix_desc_mask_irq
              writel                        # <-- MMIO write
          pci_read_config_word(PCI_MSIX_FLAGS)
          pci_write_config_word(PCI_MSIX_FLAGS)
        free_msi_irqs

whih only does MMIO *writes*, which I think are posted and do not
require completion and should not cause timeouts.  That's wasted
effort, I agree, but it doesn't seem like it should be a performance
issue.

So there must be another path, probably preceding this one (since
pci_disable_msix() cleans everything up), that masks the vectors and
does the non-posted reads?

The only places we do MMIO reads are msix_program_entries(), which is
done at MSI-X enable time, and msi_set_mask_bit(), which is used in
irq_chip.irq_mask() and irq_chip.irq_unmask() methods.  But I haven't
figured out where that irq_chip path is used in a loop.

Bjorn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4 next 0/3] Limiting pci access
  2018-11-13  6:05                   ` Bjorn Helgaas
@ 2018-11-13 14:59                     ` Keith Busch
  0 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2018-11-13 14:59 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Lukas Wunner, Wei Zhang

On Mon, Nov 12, 2018 at 10:05:11PM -0800, Bjorn Helgaas wrote:
> On Tue, Dec 13, 2016 at 06:18:40PM -0500, Keith Busch wrote:
> > It's the MSI-x masking that's our next highest contributor. Masking
> > vectors still requires non-posted commands, and since they're not
> > going through managed API accessors like config space uses, the
> > removed flag is needed for checking before doing significant MMIO.
> 
> Sorry for digging up this ancient history, but do you remember where
> this MSI-X masking with non-posted commands happens?  The masking is
> an MMIO write (posted) and there should only be a non-posted MMIO read
> if we use the msi_set_mask_bit() path.
> 
> I'm looking at this path:
> 
>   nvme_pci_disable
>     pci_free_irq_vectors
>       pci_disable_msix
>         pci_msix_shutdown
>           if (pci_dev_is_disconnected())    # added by 0170591bb067
>             return                          # (skip hw access)
>           for_each_pci_msi_entry(...)       # <-- loop
>             __pci_msix_desc_mask_irq
>               writel                        # <-- MMIO write
>           pci_read_config_word(PCI_MSIX_FLAGS)
>           pci_write_config_word(PCI_MSIX_FLAGS)
>         free_msi_irqs
> 
> whih only does MMIO *writes*, which I think are posted and do not
> require completion and should not cause timeouts.  That's wasted
> effort, I agree, but it doesn't seem like it should be a performance
> issue.
>
> So there must be another path, probably preceding this one (since
> pci_disable_msix() cleans everything up), that masks the vectors and
> does the non-posted reads?
> 
> The only places we do MMIO reads are msix_program_entries(), which is
> done at MSI-X enable time, and msi_set_mask_bit(), which is used in
> irq_chip.irq_mask() and irq_chip.irq_unmask() methods.  But I haven't
> figured out where that irq_chip path is used in a loop.

I must have thought masking was a read-modify-write operation. It looks
like we're using a cached value so only posted commands needed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2018-11-13 15:03 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-28 22:58 [PATCHv4 next 0/3] Limiting pci access Keith Busch
2016-10-28 22:58 ` [PATCHv4 next 1/3] pci: Add is_removed state Keith Busch
2016-10-31 10:41   ` Lukas Wunner
2016-12-13 20:56   ` Bjorn Helgaas
2016-12-13 23:07     ` Keith Busch
2016-12-14  2:50       ` Bjorn Helgaas
2016-12-14  2:54         ` Bjorn Helgaas
2016-12-13 23:54     ` Lukas Wunner
2016-10-28 22:58 ` [PATCHv4 next 2/3] pci: No config access for removed devices Keith Busch
2016-10-31 12:18   ` Lukas Wunner
2016-10-28 22:58 ` [PATCHv4 next 3/3] pci/msix: Skip disabling " Keith Busch
2016-10-31 11:00   ` Lukas Wunner
2016-10-31 13:54     ` Keith Busch
2016-12-13 21:18   ` Bjorn Helgaas
2016-12-13 23:01     ` Keith Busch
2016-11-18 23:25 ` [PATCHv4 next 0/3] Limiting pci access Keith Busch
2016-11-23 16:09   ` Bjorn Helgaas
2016-11-28  9:14     ` Wei Zhang
2016-11-28 10:22       ` Lukas Wunner
2016-11-28 18:02     ` Keith Busch
2016-12-08 17:54       ` Bjorn Helgaas
2016-12-08 19:32         ` Keith Busch
2016-12-12 23:42           ` Bjorn Helgaas
2016-12-13  0:55             ` Keith Busch
2016-12-13 20:50               ` Bjorn Helgaas
2016-12-13 23:18                 ` Keith Busch
     [not found]                   ` <B58D82457FDA0744A320A2FC5AC253B93D82F37D@fmsmsx104.amr.corp.intel.com>
     [not found]                     ` <20170120213550.GA16618@localhost.localdomain>
2017-01-21  7:31                       ` Lukas Wunner
2017-01-21  8:42                         ` Greg Kroah-Hartman
2017-01-21 14:22                           ` Lukas Wunner
2017-01-25 11:47                             ` Greg Kroah-Hartman
2017-01-23 16:04                           ` Keith Busch
2017-01-25  0:44                             ` Austin.Bolen
2017-01-25 21:17                               ` Bjorn Helgaas
2017-01-26  1:12                                 ` Austin.Bolen
2017-02-01 16:04                                   ` Bjorn Helgaas
2017-02-03 20:30                                     ` Austin.Bolen
2017-02-03 20:39                                       ` Greg KH
2017-02-03 21:43                                     ` Austin.Bolen
2017-01-25 11:48                             ` Greg Kroah-Hartman
2017-01-28  7:36                             ` Christoph Hellwig
2018-11-13  6:05                   ` Bjorn Helgaas
2018-11-13 14:59                     ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.