All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] pci: bus and slot reset interfaces
@ 2013-08-05 19:37 Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 1/9] pci: Create pci_reset_bridge_secondary_bus() Alex Williamson
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

v3: Incorporate feedback from Don:
      - Expand the comment in patch 5
      - Reverse bus/slot unlocking to go up the tree, comments for all

Thanks,
Alex

This series adds PCI bus and slot reset interfaces to the already
existing function reset interface.  I need this for two reasons, the
first is that not all devices support function level reset.  Even
some of those that we detect as supporting a PM reset on D3hot->D0
transition actually don't do any reset.  Others have no reset
capability at all.  We currently implement a secondary bus reset
escalation from the function reset path, but only when there is a
single devfn on the bus.  Drivers like vfio can have ownership of
all of the devices on a bus and should therefore have a path to
initiate a secondary bus reset with multiple devices.  This is
particularly required for use of GPUs by userspace, where none of
the predominant GPUs implement a useful function level reset.

The second reason is that even the current function reset escalating
to a secondary bus reset can cause problems with hotplug controllers.
If a root port supports PCIe HP with suprise removal, a bus reset
can trigger a presence detection change, which results in an attempt
to remove the struct device.  By having a slot reset interface, we
can involve the hotplug controllers to allow for a controlled bus
reset and avoid this spurious removal attempt.

---

Alex Williamson (9):
      pci: Create pci_reset_bridge_secondary_bus()
      pci: Add hotplug_slot_ops.reset_slot()
      pci: Implement reset_slot for pciehp
      pci: Add slot reset option to pci_dev_reset
      pci: Split out pci_dev lock/unlock and save/restore
      pci: Add slot and bus reset interfaces
      pci: Wake-up devices before save for reset
      pci: Tune secondary bus reset timing
      pci: Remove aer_do_secondary_bus_reset()


 drivers/pci/hotplug/pciehp.h       |    1 
 drivers/pci/hotplug/pciehp_core.c  |   12 +
 drivers/pci/hotplug/pciehp_hpc.c   |   31 +++
 drivers/pci/pci.c                  |  348 +++++++++++++++++++++++++++++++++---
 drivers/pci/pcie/aer/aerdrv.c      |    2 
 drivers/pci/pcie/aer/aerdrv.h      |    1 
 drivers/pci/pcie/aer/aerdrv_core.c |   35 ----
 include/linux/pci.h                |    3 
 include/linux/pci_hotplug.h        |    4 
 9 files changed, 375 insertions(+), 62 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/9] pci: Create pci_reset_bridge_secondary_bus()
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 2/9] pci: Add hotplug_slot_ops.reset_slot() Alex Williamson
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

Move the secondary bus reset code from pci_parent_bus_reset() into its own
function.  Export it as we'll later be calling it from hotplug controllers
and elsewhere.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pci.c   |   32 +++++++++++++++++++++++---------
 include/linux/pci.h |    1 +
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e37fea6..d468608 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3215,9 +3215,30 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
 	return 0;
 }
 
-static int pci_parent_bus_reset(struct pci_dev *dev, int probe)
+/**
+ * pci_reset_bridge_secondary_bus - Reset the secondary bus on a PCI bridge.
+ * @dev: Bridge device
+ *
+ * Use the bridge control register to assert reset on the secondary bus.
+ * Devices on the secondary bus are left in power-on state.
+ */
+void pci_reset_bridge_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;
+
+	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
+	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+	msleep(100);
+
+	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+	msleep(100);
+}
+EXPORT_SYMBOL_GPL(pci_reset_bridge_secondary_bus);
+
+static int pci_parent_bus_reset(struct pci_dev *dev, int probe)
+{
 	struct pci_dev *pdev;
 
 	if (pci_is_root_bus(dev->bus) || dev->subordinate || !dev->bus->self)
@@ -3230,14 +3251,7 @@ static int pci_parent_bus_reset(struct pci_dev *dev, int probe)
 	if (probe)
 		return 0;
 
-	pci_read_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, &ctrl);
-	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
-	pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl);
-	msleep(100);
-
-	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
-	pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl);
-	msleep(100);
+	pci_reset_bridge_secondary_bus(dev->bus->self);
 
 	return 0;
 }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0fd1f15..35c1bc4 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -924,6 +924,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
 int __pci_reset_function(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);
+void pci_reset_bridge_secondary_bus(struct pci_dev *dev);
 void pci_update_resource(struct pci_dev *dev, int resno);
 int __must_check pci_assign_resource(struct pci_dev *dev, int i);
 int __must_check pci_reassign_resource(struct pci_dev *dev, int i, resource_size_t add_size, resource_size_t align);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/9] pci: Add hotplug_slot_ops.reset_slot()
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 1/9] pci: Create pci_reset_bridge_secondary_bus() Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 3/9] pci: Implement reset_slot for pciehp Alex Williamson
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

This optional callback allows htoplug controllers to perform slot
specific resets.  These may be necessary in cases where a normal
secondary bus reset can interact with controller logic and expose
spurious hotplugs.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 include/linux/pci_hotplug.h |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/pci_hotplug.h b/include/linux/pci_hotplug.h
index 8db71dc..bd32109 100644
--- a/include/linux/pci_hotplug.h
+++ b/include/linux/pci_hotplug.h
@@ -63,6 +63,9 @@ enum pcie_link_width {
  * @get_adapter_status: Called to get see if an adapter is present in the slot or not.
  *	If this field is NULL, the value passed in the struct hotplug_slot_info
  *	will be used when this value is requested by a user.
+ * @reset_slot: Optional interface to allow override of a bus reset for the
+ *	slot for cases where a secondary bus reset can result in spurious
+ *	hotplug events or where a slot can be reset independent of the bus.
  *
  * The table of function pointers that is passed to the hotplug pci core by a
  * hotplug pci driver.  These functions are called by the hotplug pci core when
@@ -80,6 +83,7 @@ struct hotplug_slot_ops {
 	int (*get_attention_status)	(struct hotplug_slot *slot, u8 *value);
 	int (*get_latch_status)		(struct hotplug_slot *slot, u8 *value);
 	int (*get_adapter_status)	(struct hotplug_slot *slot, u8 *value);
+	int (*reset_slot)		(struct hotplug_slot *slot, int probe);
 };
 
 /**


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/9] pci: Implement reset_slot for pciehp
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 1/9] pci: Create pci_reset_bridge_secondary_bus() Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 2/9] pci: Add hotplug_slot_ops.reset_slot() Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 4/9] pci: Add slot reset option to pci_dev_reset Alex Williamson
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

PCIe hotplug has a bus per slot, so we can just use a normal
secondary bus reset.  However, if a slot supports surprise removal
then a bus reset can be seen as a presence detection change triggering
a hot-remove followed by a hot-add.  Disable presence detection from
triggering an interrupt or being polled around the bus reset.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/hotplug/pciehp.h      |    1 +
 drivers/pci/hotplug/pciehp_core.c |   12 ++++++++++++
 drivers/pci/hotplug/pciehp_hpc.c  |   31 +++++++++++++++++++++++++++++++
 3 files changed, 44 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 7fb3269..541bbe6 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -155,6 +155,7 @@ void pciehp_green_led_off(struct slot *slot);
 void pciehp_green_led_blink(struct slot *slot);
 int pciehp_check_link_status(struct controller *ctrl);
 void pciehp_release_ctrl(struct controller *ctrl);
+int pciehp_reset_slot(struct slot *slot, int probe);
 
 static inline const char *slot_name(struct slot *slot)
 {
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 7d72c5e..f4a18f5 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -69,6 +69,7 @@ static int get_power_status	(struct hotplug_slot *slot, u8 *value);
 static int get_attention_status	(struct hotplug_slot *slot, u8 *value);
 static int get_latch_status	(struct hotplug_slot *slot, u8 *value);
 static int get_adapter_status	(struct hotplug_slot *slot, u8 *value);
+static int reset_slot		(struct hotplug_slot *slot, int probe);
 
 /**
  * release_slot - free up the memory used by a slot
@@ -111,6 +112,7 @@ static int init_slot(struct controller *ctrl)
 	ops->disable_slot = disable_slot;
 	ops->get_power_status = get_power_status;
 	ops->get_adapter_status = get_adapter_status;
+	ops->reset_slot = reset_slot;
 	if (MRL_SENS(ctrl))
 		ops->get_latch_status = get_latch_status;
 	if (ATTN_LED(ctrl)) {
@@ -223,6 +225,16 @@ static int get_adapter_status(struct hotplug_slot *hotplug_slot, u8 *value)
 	return pciehp_get_adapter_status(slot, value);
 }
 
+static int reset_slot(struct hotplug_slot *hotplug_slot, int probe)
+{
+	struct slot *slot = hotplug_slot->private;
+
+	ctrl_dbg(slot->ctrl, "%s: physical_slot = %s\n",
+		 __func__, slot_name(slot));
+
+	return pciehp_reset_slot(slot, probe);
+}
+
 static int pciehp_probe(struct pcie_device *dev)
 {
 	int rc;
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index b225573..51f56ef 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -749,6 +749,37 @@ static void pcie_disable_notification(struct controller *ctrl)
 		ctrl_warn(ctrl, "Cannot disable software notification\n");
 }
 
+/*
+ * pciehp has a 1:1 bus:slot relationship so we ultimately want a secondary
+ * bus reset of the bridge, but if the slot supports surprise removal we need
+ * to disable presence detection around the bus reset and clear any spurious
+ * events after.
+ */
+int pciehp_reset_slot(struct slot *slot, int probe)
+{
+	struct controller *ctrl = slot->ctrl;
+
+	if (probe)
+		return 0;
+
+	if (HP_SUPR_RM(ctrl)) {
+		pcie_write_cmd(ctrl, 0, PCI_EXP_SLTCTL_PDCE);
+		if (pciehp_poll_mode)
+			del_timer_sync(&ctrl->poll_timer);
+	}
+
+	pci_reset_bridge_secondary_bus(ctrl->pcie->port);
+
+	if (HP_SUPR_RM(ctrl)) {
+		pciehp_writew(ctrl, PCI_EXP_SLTSTA, PCI_EXP_SLTSTA_PDC);
+		pcie_write_cmd(ctrl, PCI_EXP_SLTCTL_PDCE, PCI_EXP_SLTCTL_PDCE);
+		if (pciehp_poll_mode)
+			int_poll_timeout(ctrl->poll_timer.data);
+	}
+
+	return 0;
+}
+
 int pcie_init_notification(struct controller *ctrl)
 {
 	if (pciehp_request_irq(ctrl))


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 4/9] pci: Add slot reset option to pci_dev_reset
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
                   ` (2 preceding siblings ...)
  2013-08-05 19:37 ` [PATCH v4 3/9] pci: Implement reset_slot for pciehp Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 5/9] pci: Split out pci_dev lock/unlock and save/restore Alex Williamson
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

If the hotplug controller provides a way to reset a slot, use that
before a direct parent bus reset.  Like the bus reset option, this is
only available when a single pci_dev occupies the slot.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pci.c |   34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d468608..9407aab 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -22,6 +22,7 @@
 #include <linux/interrupt.h>
 #include <linux/device.h>
 #include <linux/pm_runtime.h>
+#include <linux/pci_hotplug.h>
 #include <asm-generic/pci-bridge.h>
 #include <asm/setup.h>
 #include "pci.h"
@@ -3256,6 +3257,35 @@ static int pci_parent_bus_reset(struct pci_dev *dev, int probe)
 	return 0;
 }
 
+static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, int probe)
+{
+	int rc = -ENOTTY;
+
+	if (!hotplug || !try_module_get(hotplug->ops->owner))
+		return rc;
+
+	if (hotplug->ops->reset_slot)
+		rc = hotplug->ops->reset_slot(hotplug, probe);
+
+	module_put(hotplug->ops->owner);
+
+	return rc;
+}
+
+static int pci_dev_reset_slot_function(struct pci_dev *dev, int probe)
+{
+	struct pci_dev *pdev;
+
+	if (dev->subordinate || !dev->slot)
+		return -ENOTTY;
+
+	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
+		if (pdev != dev && pdev->slot == dev->slot)
+			return -ENOTTY;
+
+	return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
+}
+
 static int __pci_dev_reset(struct pci_dev *dev, int probe)
 {
 	int rc;
@@ -3278,6 +3308,10 @@ static int __pci_dev_reset(struct pci_dev *dev, int probe)
 	if (rc != -ENOTTY)
 		goto done;
 
+	rc = pci_dev_reset_slot_function(dev, probe);
+	if (rc != -ENOTTY)
+		goto done;
+
 	rc = pci_parent_bus_reset(dev, probe);
 done:
 	return rc;


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 5/9] pci: Split out pci_dev lock/unlock and save/restore
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
                   ` (3 preceding siblings ...)
  2013-08-05 19:37 ` [PATCH v4 4/9] pci: Add slot reset option to pci_dev_reset Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 6/9] pci: Add slot and bus reset interfaces Alex Williamson
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

Only cosmetic code changes to existing paths.  Expand the comment in
the new pci_dev_save_and_disable() function since there's a lot
hidden in that Command register write.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pci.c |   55 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 17 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9407aab..4a0275c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3317,22 +3317,49 @@ done:
 	return rc;
 }
 
+static void pci_dev_lock(struct pci_dev *dev)
+{
+	pci_cfg_access_lock(dev);
+	/* block PM suspend, driver probe, etc. */
+	device_lock(&dev->dev);
+}
+
+static void pci_dev_unlock(struct pci_dev *dev)
+{
+	device_unlock(&dev->dev);
+	pci_cfg_access_unlock(dev);
+}
+
+static void pci_dev_save_and_disable(struct pci_dev *dev)
+{
+	pci_save_state(dev);
+	/*
+	 * Disable the device by clearing the Command register, except for
+	 * INTx-disable which is set.  This not only disable MMIO and I/O port
+	 * BARs, but also prevents the device from being Bus Master, preventing
+	 * DMA from the device including MSI/MSI-X interrupts.  For PCI 2.3
+	 * compliant devices, INTx-disable prevents legacy interrupts.
+	 */
+	pci_write_config_word(dev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+}
+
+static void pci_dev_restore(struct pci_dev *dev)
+{
+	pci_restore_state(dev);
+}
+
 static int pci_dev_reset(struct pci_dev *dev, int probe)
 {
 	int rc;
 
-	if (!probe) {
-		pci_cfg_access_lock(dev);
-		/* block PM suspend, driver probe, etc. */
-		device_lock(&dev->dev);
-	}
+	if (!probe)
+		pci_dev_lock(dev);
 
 	rc = __pci_dev_reset(dev, probe);
 
-	if (!probe) {
-		device_unlock(&dev->dev);
-		pci_cfg_access_unlock(dev);
-	}
+	if (!probe)
+		pci_dev_unlock(dev);
+
 	return rc;
 }
 /**
@@ -3423,17 +3450,11 @@ int pci_reset_function(struct pci_dev *dev)
 	if (rc)
 		return rc;
 
-	pci_save_state(dev);
-
-	/*
-	 * both INTx and MSI are disabled after the Interrupt Disable bit
-	 * is set and the Bus Master bit is cleared.
-	 */
-	pci_write_config_word(dev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+	pci_dev_save_and_disable(dev);
 
 	rc = pci_dev_reset(dev, 0);
 
-	pci_restore_state(dev);
+	pci_dev_restore(dev);
 
 	return rc;
 }


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 6/9] pci: Add slot and bus reset interfaces
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
                   ` (4 preceding siblings ...)
  2013-08-05 19:37 ` [PATCH v4 5/9] pci: Split out pci_dev lock/unlock and save/restore Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 7/9] pci: Wake-up devices before save for reset Alex Williamson
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

Sometimes pci_reset_function is not sufficient.  We have cases where
devices do not support any kind of reset, but there might be multiple
functions on the bus preventing pci_reset_function from doing a
secondary bus reset.  We also have cases where a device will advertise
that it supports a PM reset, but really does nothing on D3hot->D0
(graphics cards are notorious for this).  These devices often also
have more than one function, so even blacklisting PM reset for them
wouldn't allow a secondary bus reset through pci_reset_function.

If a driver supports multiple devices it should have the ability to
induce a bus reset when it needs to.  This patch provides that ability
through pci_reset_slot and pci_reset_bus.  It's the caller's
responsibility when using these interfaces to understand that all of
the devices in or below the slot (or on or below the bus) will be
reset and therefore should be under control of the caller.  PCI state
of all the affected devices is saved and restored around these resets,
but internal state of all of the affected devices is reset (which
should be the intention).

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pci.c   |  209 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h |    2 
 2 files changed, 211 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4a0275c..1dba7dd 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3460,6 +3460,215 @@ int pci_reset_function(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_reset_function);
 
+/* Lock devices from the top of the tree down */
+static void pci_bus_lock(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		pci_dev_lock(dev);
+		if (dev->subordinate)
+			pci_bus_lock(dev->subordinate);
+	}
+}
+
+/* Unlock devices from the bottom of the tree up */
+static void pci_bus_unlock(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		if (dev->subordinate)
+			pci_bus_unlock(dev->subordinate);
+		pci_dev_unlock(dev);
+	}
+}
+
+/* Lock devices from the top of the tree down */
+static void pci_slot_lock(struct pci_slot *slot)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
+		if (!dev->slot || dev->slot != slot)
+			continue;
+		pci_dev_lock(dev);
+		if (dev->subordinate)
+			pci_bus_lock(dev->subordinate);
+	}
+}
+
+/* Unlock devices from the bottom of the tree up */
+static void pci_slot_unlock(struct pci_slot *slot)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
+		if (!dev->slot || dev->slot != slot)
+			continue;
+		if (dev->subordinate)
+			pci_bus_unlock(dev->subordinate);
+		pci_dev_unlock(dev);
+	}
+}
+
+/* Save and disable devices from the top of the tree down */
+static void pci_bus_save_and_disable(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		pci_dev_save_and_disable(dev);
+		if (dev->subordinate)
+			pci_bus_save_and_disable(dev->subordinate);
+	}
+}
+
+/*
+ * Restore devices from top of the tree down - parent bridges need to be
+ * restored before we can get to subordinate devices.
+ */
+static void pci_bus_restore(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		pci_dev_restore(dev);
+		if (dev->subordinate)
+			pci_bus_restore(dev->subordinate);
+	}
+}
+
+/* Save and disable devices from the top of the tree down */
+static void pci_slot_save_and_disable(struct pci_slot *slot)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
+		if (!dev->slot || dev->slot != slot)
+			continue;
+		pci_dev_save_and_disable(dev);
+		if (dev->subordinate)
+			pci_bus_save_and_disable(dev->subordinate);
+	}
+}
+
+/*
+ * Restore devices from top of the tree down - parent bridges need to be
+ * restored before we can get to subordinate devices.
+ */
+static void pci_slot_restore(struct pci_slot *slot)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &slot->bus->devices, bus_list) {
+		if (!dev->slot || dev->slot != slot)
+			continue;
+		pci_dev_restore(dev);
+		if (dev->subordinate)
+			pci_bus_restore(dev->subordinate);
+	}
+}
+
+static int pci_slot_reset(struct pci_slot *slot, int probe)
+{
+	int rc;
+
+	if (!slot)
+		return -ENOTTY;
+
+	if (!probe)
+		pci_slot_lock(slot);
+
+	might_sleep();
+
+	rc = pci_reset_hotplug_slot(slot->hotplug, probe);
+
+	if (!probe)
+		pci_slot_unlock(slot);
+
+	return rc;
+}
+
+/**
+ * pci_reset_slot - reset a PCI slot
+ * @slot: PCI slot to reset
+ *
+ * A PCI bus may host multiple slots, each slot may support a reset mechanism
+ * independent of other slots.  For instance, some slots may support slot power
+ * control.  In the case of a 1:1 bus to slot architecture, this function may
+ * wrap the bus reset to avoid spurious slot related events such as hotplug.
+ * Generally a slot reset should be attempted before a bus reset.  All of the
+ * function of the slot and any subordinate buses behind the slot are reset
+ * through this function.  PCI config space of all devices in the slot and
+ * behind the slot is saved before and restored after reset.
+ *
+ * Return 0 on success, non-zero on error.
+ */
+int pci_reset_slot(struct pci_slot *slot)
+{
+	int rc;
+
+	rc = pci_slot_reset(slot, 1);
+	if (rc)
+		return rc;
+
+	pci_slot_save_and_disable(slot);
+
+	rc = pci_slot_reset(slot, 0);
+
+	pci_slot_restore(slot);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(pci_reset_slot);
+
+static int pci_bus_reset(struct pci_bus *bus, int probe)
+{
+	if (!bus->self)
+		return -ENOTTY;
+
+	if (probe)
+		return 0;
+
+	pci_bus_lock(bus);
+
+	might_sleep();
+
+	pci_reset_bridge_secondary_bus(bus->self);
+
+	pci_bus_unlock(bus);
+
+	return 0;
+}
+
+/**
+ * pci_reset_bus - reset a PCI bus
+ * @bus: top level PCI bus to reset
+ *
+ * Do a bus reset on the given bus and any subordinate buses, saving
+ * and restoring state of all devices.
+ *
+ * Return 0 on success, non-zero on error.
+ */
+int pci_reset_bus(struct pci_bus *bus)
+{
+	int rc;
+
+	rc = pci_bus_reset(bus, 1);
+	if (rc)
+		return rc;
+
+	pci_bus_save_and_disable(bus);
+
+	rc = pci_bus_reset(bus, 0);
+
+	pci_bus_restore(bus);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(pci_reset_bus);
+
 /**
  * pcix_get_max_mmrbc - get PCI-X maximum designed memory read byte count
  * @dev: PCI device to query
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 35c1bc4..1a8fd34 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -924,6 +924,8 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
 int __pci_reset_function(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);
+int pci_reset_slot(struct pci_slot *slot);
+int pci_reset_bus(struct pci_bus *bus);
 void pci_reset_bridge_secondary_bus(struct pci_dev *dev);
 void pci_update_resource(struct pci_dev *dev, int resno);
 int __must_check pci_assign_resource(struct pci_dev *dev, int i);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 7/9] pci: Wake-up devices before save for reset
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
                   ` (5 preceding siblings ...)
  2013-08-05 19:37 ` [PATCH v4 6/9] pci: Add slot and bus reset interfaces Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 8/9] pci: Tune secondary bus reset timing Alex Williamson
  2013-08-05 19:37 ` [PATCH v4 9/9] pci: Remove aer_do_secondary_bus_reset() Alex Williamson
  8 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

Devices come out of reset in D0.  Restoring a device to a different
post-reset state takes more smarts than our simple config space
restore, which can leave devices in an inconsistent state.  For
example, if a device is reset in D3, but the restore doesn't
successfully return the device to D3, then the actual state of the
device and dev->current_state are contradictory.  Put everything
in D0 going into the reset, then we don't need to do anything
special on the way out.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pci.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 1dba7dd..b204206 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3332,6 +3332,13 @@ static void pci_dev_unlock(struct pci_dev *dev)
 
 static void pci_dev_save_and_disable(struct pci_dev *dev)
 {
+	/*
+	 * Wake-up device prior to save.  PM registers default to D0 after
+	 * reset and a simple register restore doesn't reliably return
+	 * to a non-D0 state anyway.
+	 */
+	pci_set_power_state(dev, PCI_D0);
+
 	pci_save_state(dev);
 	/*
 	 * Disable the device by clearing the Command register, except for


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 8/9] pci: Tune secondary bus reset timing
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
                   ` (6 preceding siblings ...)
  2013-08-05 19:37 ` [PATCH v4 7/9] pci: Wake-up devices before save for reset Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  2013-08-06 23:27   ` Alexander Duyck
  2013-08-05 19:37 ` [PATCH v4 9/9] pci: Remove aer_do_secondary_bus_reset() Alex Williamson
  8 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

The PCI spec indicates that with stable power, reset needs to be
asserted for a minimum of 1ms (Trst).  Seems like we should be able
to assume power is stable for a runtime secondary bus reset.  The
current code has always used 100ms with no explanation where that
came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
that seems to be a misinterpretation of the PCIe spec, where hot
reset is implemented by TS1 ordered sets containing the hot reset
command.  After a 2ms delay the state machine enters the detect state,
but to generate a link down, only two consecutive TS1 hot reset
ordered sets are requred.  1ms should be plenty for that.

After reset is de-asserted we must wait for devices to complete
initialization.  The specs refer to this as "recovery time" (Trhfa).
For PCI this is 2^25 clock cycles or 2^26 for PCI-X.  For minimum
bus speeds, both of those come to 1s.  PCIe "softens" this
requirement with the Configuration Request Retry Status (CRS)
completion status.  Theoretically we could use CRS to shorten the
wait time.  We don't make use of that here, using a fixed 1s delay
to allow devices to re-initialize.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pci.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b204206..ba64a7e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3230,11 +3230,22 @@ void pci_reset_bridge_secondary_bus(struct pci_dev *dev)
 	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
 	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
 	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
-	msleep(100);
+	/*
+	 * PCI spec v3.0 7.6.4.2 requires minimum Trst of 1ms.
+	 */
+	msleep(1);
 
 	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
 	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
-	msleep(100);
+
+	/*
+	 * Trhfa for conventional PCI is 2^25 clock cycles.
+	 * Assuming a minimum 33MHz clock this results in a 1s
+	 * delay before we can consider subordinate devices to
+	 * be re-initialized.  PCIe has some ways to shorten this,
+	 * but we don't make use of them yet.
+	 */
+	ssleep(1);
 }
 EXPORT_SYMBOL_GPL(pci_reset_bridge_secondary_bus);
 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 9/9] pci: Remove aer_do_secondary_bus_reset()
  2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
                   ` (7 preceding siblings ...)
  2013-08-05 19:37 ` [PATCH v4 8/9] pci: Tune secondary bus reset timing Alex Williamson
@ 2013-08-05 19:37 ` Alex Williamson
  8 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-05 19:37 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: ddutile, indou.takao, linux-kernel

One PCI bus reset function to rule them all.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pcie/aer/aerdrv.c      |    2 +-
 drivers/pci/pcie/aer/aerdrv.h      |    1 -
 drivers/pci/pcie/aer/aerdrv_core.c |   35 +----------------------------------
 3 files changed, 2 insertions(+), 36 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 76ef634..0bf82a2 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -352,7 +352,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
 	pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
 
-	aer_do_secondary_bus_reset(dev);
+	pci_reset_bridge_secondary_bus(dev);
 	dev_printk(KERN_DEBUG, &dev->dev, "Root Port link has been reset\n");
 
 	/* Clear Root Error Status */
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 90ea3e8..84420b7 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -106,7 +106,6 @@ static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
 }
 
 extern struct bus_type pcie_port_bus_type;
-void aer_do_secondary_bus_reset(struct pci_dev *dev);
 int aer_init(struct pcie_device *dev);
 void aer_isr(struct work_struct *work);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 8b68ae5..85ca36f 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -367,39 +367,6 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 }
 
 /**
- * aer_do_secondary_bus_reset - perform secondary bus reset
- * @dev: pointer to bridge's pci_dev data structure
- *
- * Invoked when performing link reset at Root Port or Downstream Port.
- */
-void aer_do_secondary_bus_reset(struct pci_dev *dev)
-{
-	u16 p2p_ctrl;
-
-	/* Assert Secondary Bus Reset */
-	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
-	p2p_ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
-	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
-
-	/*
-	 * we should send hot reset message for 2ms to allow it time to
-	 * propagate to all downstream ports
-	 */
-	msleep(2);
-
-	/* De-assert Secondary Bus Reset */
-	p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
-	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
-
-	/*
-	 * System software must wait for at least 100ms from the end
-	 * of a reset of one or more device before it is permitted
-	 * to issue Configuration Requests to those devices.
-	 */
-	msleep(200);
-}
-
-/**
  * default_reset_link - default reset function
  * @dev: pointer to pci_dev data structure
  *
@@ -408,7 +375,7 @@ void aer_do_secondary_bus_reset(struct pci_dev *dev)
  */
 static pci_ers_result_t default_reset_link(struct pci_dev *dev)
 {
-	aer_do_secondary_bus_reset(dev);
+	pci_reset_bridge_secondary_bus(dev);
 	dev_printk(KERN_DEBUG, &dev->dev, "downstream link has been reset\n");
 	return PCI_ERS_RESULT_RECOVERED;
 }


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing
  2013-08-05 19:37 ` [PATCH v4 8/9] pci: Tune secondary bus reset timing Alex Williamson
@ 2013-08-06 23:27   ` Alexander Duyck
  2013-08-07  2:56     ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Duyck @ 2013-08-06 23:27 UTC (permalink / raw)
  To: Alex Williamson; +Cc: bhelgaas, linux-pci, ddutile, indou.takao, linux-kernel

On 08/05/2013 12:37 PM, Alex Williamson wrote:
> The PCI spec indicates that with stable power, reset needs to be
> asserted for a minimum of 1ms (Trst).  Seems like we should be able
> to assume power is stable for a runtime secondary bus reset.  The
> current code has always used 100ms with no explanation where that
> came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
> that seems to be a misinterpretation of the PCIe spec, where hot
> reset is implemented by TS1 ordered sets containing the hot reset
> command.  After a 2ms delay the state machine enters the detect state,
> but to generate a link down, only two consecutive TS1 hot reset
> ordered sets are requred.  1ms should be plenty for that.

The reason for doing a 2ms sleep is because the are supposed to be
sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
the documents I have read.  The 1ms number you quote is the minimum time
for a conventional PCI bus.  I'm not completely sure of that applies as
well to PCIe, nor does it represent the maximum recommended value.

If we stop early we risk not resetting the full device tree on the
secondary bus which is the bug I was resolving by adding the 2ms delay. 
Previously we saw that some devices were only getting their PCIe link
retrained without performing a hot reset when the bit was not held for
long enough.  I would prefer to keep this at 2 ms in order to account
for the fact that PCIe has to go though link recovery states before it
can perform the hot reset.

Thanks,

Alex



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing
  2013-08-06 23:27   ` Alexander Duyck
@ 2013-08-07  2:56     ` Alex Williamson
  2013-08-07 18:30       ` Alexander Duyck
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2013-08-07  2:56 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: bhelgaas, linux-pci, ddutile, indou.takao, linux-kernel

On Tue, 2013-08-06 at 16:27 -0700, Alexander Duyck wrote:
> On 08/05/2013 12:37 PM, Alex Williamson wrote:
> > The PCI spec indicates that with stable power, reset needs to be
> > asserted for a minimum of 1ms (Trst).  Seems like we should be able
> > to assume power is stable for a runtime secondary bus reset.  The
> > current code has always used 100ms with no explanation where that
> > came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
> > that seems to be a misinterpretation of the PCIe spec, where hot
> > reset is implemented by TS1 ordered sets containing the hot reset
> > command.  After a 2ms delay the state machine enters the detect state,
> > but to generate a link down, only two consecutive TS1 hot reset
> > ordered sets are requred.  1ms should be plenty for that.
> 
> The reason for doing a 2ms sleep is because the are supposed to be
> sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
> the documents I have read.

Could you point to one of those references?  In the PCIe v3 spec I'm
seeing things like 4.2.6.11 Hot Reset:

      * If two consecutive TS1 Ordered Sets are received on any Lane
        with the Hot Reset bit asserted and configured Link and Lane
        numbers, then:
              * LinkUp = 0b (False)
              * If no higher Layer is directing the Physical Layer to
                remain in Hot Reset, the next state is Detect
              * Otherwise, all Lanes in the configured Link continue to
                transmit TS1 Ordered Sets with the Hot Reset bit
                asserted and the configured Link and Lane numbers.
      * Otherwise, after a 2 ms timeout next state is Detect.

The next section has something similar for propagation of hot resets.

Nowhere there does it say TS1 Ordered Sets need to be sent continuously
for 2ms.  A hot reset is initiated only by two consecutive TS1 Ordered
Sets with the Hot Reset bit asserted.  The 2ms timeout seems to be the
delay before the link moves to the Detect state after we stop asserting
hot reset.  1ms seems like more than enough time for two TS1 Ordered
Sets to propagate down a multi-level hierarchy at 2.5GT/s. 

> The 1ms number you quote is the minimum time
> for a conventional PCI bus.  I'm not completely sure of that applies as
> well to PCIe, nor does it represent the maximum recommended value.

Correct, 1ms comes from conventional PCI.  PCIe is designed to be
software compatible with conventional PCI so it makes sense that PCIe
would do something within the timing boundaries of conventional PCI.  I
didn't see any reference to a maximum recommended value for this
parameter.

> If we stop early we risk not resetting the full device tree on the
> secondary bus which is the bug I was resolving by adding the 2ms delay. 
> Previously we saw that some devices were only getting their PCIe link
> retrained without performing a hot reset when the bit was not held for
> long enough.  I would prefer to keep this at 2 ms in order to account
> for the fact that PCIe has to go though link recovery states before it
> can perform the hot reset.

I'm not going to sweat over 1ms or 2ms but I do want to be able to
document why we're setting it to one or the other.  If it's warm
fuzzies, so be it, but I'd prefer if we could find actual spec or
hardware examples to back it up.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing
  2013-08-07  2:56     ` Alex Williamson
@ 2013-08-07 18:30       ` Alexander Duyck
  2013-08-08  5:23         ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Duyck @ 2013-08-07 18:30 UTC (permalink / raw)
  To: Alex Williamson; +Cc: bhelgaas, linux-pci, ddutile, indou.takao, linux-kernel

On 08/06/2013 07:56 PM, Alex Williamson wrote:
> On Tue, 2013-08-06 at 16:27 -0700, Alexander Duyck wrote:
>> On 08/05/2013 12:37 PM, Alex Williamson wrote:
>>> The PCI spec indicates that with stable power, reset needs to be
>>> asserted for a minimum of 1ms (Trst).  Seems like we should be able
>>> to assume power is stable for a runtime secondary bus reset.  The
>>> current code has always used 100ms with no explanation where that
>>> came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
>>> that seems to be a misinterpretation of the PCIe spec, where hot
>>> reset is implemented by TS1 ordered sets containing the hot reset
>>> command.  After a 2ms delay the state machine enters the detect state,
>>> but to generate a link down, only two consecutive TS1 hot reset
>>> ordered sets are requred.  1ms should be plenty for that.
>> The reason for doing a 2ms sleep is because the are supposed to be
>> sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
>> the documents I have read.
> Could you point to one of those references?  In the PCIe v3 spec I'm
> seeing things like 4.2.6.11 Hot Reset:
>
>       * If two consecutive TS1 Ordered Sets are received on any Lane
>         with the Hot Reset bit asserted and configured Link and Lane
>         numbers, then:
>               * LinkUp = 0b (False)
>               * If no higher Layer is directing the Physical Layer to
>                 remain in Hot Reset, the next state is Detect
>               * Otherwise, all Lanes in the configured Link continue to
>                 transmit TS1 Ordered Sets with the Hot Reset bit
>                 asserted and the configured Link and Lane numbers.
>       * Otherwise, after a 2 ms timeout next state is Detect.
>
> The next section has something similar for propagation of hot resets.
>
> Nowhere there does it say TS1 Ordered Sets need to be sent continuously
> for 2ms.  A hot reset is initiated only by two consecutive TS1 Ordered
> Sets with the Hot Reset bit asserted.  The 2ms timeout seems to be the
> delay before the link moves to the Detect state after we stop asserting
> hot reset.  1ms seems like more than enough time for two TS1 Ordered
> Sets to propagate down a multi-level hierarchy at 2.5GT/s. 
>

My original implementation is actually based on page 536 of the "PCI
Express System Architecture".  However based on the PCIe spec itself I
think the point is that the port is supposed to stay in Hot Reset for
2ms after receiving the in-band message.  For a bridge port it means
that is supposed to be sending the Hot Reset message for those 2ms on
all downstream facing ports.  After the timer expires then it stops
sending the Hot Reset TS1 Ordered Sets and then will transition to the
Detect state.

My main concern here is that the previous code was not triggering a Hot
Reset on all ports previously.  What was happening was that some of the
ports would only get as far as Recovery as the upstream port was only
sending a couple of TS1 frames and not allowing the downstream ports
time to switch to Recovery themselves and discover the Hot Reset.

>> The 1ms number you quote is the minimum time
>> for a conventional PCI bus.  I'm not completely sure of that applies as
>> well to PCIe, nor does it represent the maximum recommended value.
> Correct, 1ms comes from conventional PCI.  PCIe is designed to be
> software compatible with conventional PCI so it makes sense that PCIe
> would do something within the timing boundaries of conventional PCI.  I
> didn't see any reference to a maximum recommended value for this
> parameter.

I don't want to implement things to minimum specification as there are
too many marginal parts where the minimum doesn't work.  I would rather
not have to add a ton of quirks for all of the parts out there that
didn't quite meet up to the specification.  By using a value of 2ms we
are matching what the PCIe bridge behavior is supposed to be by sending
the Hot Reset TS1 ordered sets for 2ms.

>> If we stop early we risk not resetting the full device tree on the
>> secondary bus which is the bug I was resolving by adding the 2ms delay. 
>> Previously we saw that some devices were only getting their PCIe link
>> retrained without performing a hot reset when the bit was not held for
>> long enough.  I would prefer to keep this at 2 ms in order to account
>> for the fact that PCIe has to go though link recovery states before it
>> can perform the hot reset.
> I'm not going to sweat over 1ms or 2ms but I do want to be able to
> document why we're setting it to one or the other.  If it's warm
> fuzzies, so be it, but I'd prefer if we could find actual spec or
> hardware examples to back it up.  Thanks,
>
> Alex

I think our difference is that I based my value on the in-band message
behavior and your value is based on the recommended minimum time for the
Secondary Bus Reset.  The downstream ports of a bridge that receives the
in-band Hot Reset notification are supposed to send a continuous stream
of TS1 Ordered sets with the Hot Reset bit set for 2ms.  Based on all of
the conditions in the spec the device should start a 2ms timer, and all
downstream ports should begin transmitting the TS1 Ordered sets with the
Hot Reset bit asserted, then after the 2ms timer expires it should
switch to the detect state.  I verified with a PCIe analyzer that this
was what the AER code was doing after I had changed it and added the sleep.

What I found is that most parts will stop transmitting the TS1 ordered
sets as soon as you clear the Secondary Bus Reset bit.  So if you set
the bit and clear it 1 ms later you might only get to send a few ordered
sets and that may not be enough depending on how fast the part can
transition between L0/L0s/L1, Recovery, and Hot Reset.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing
  2013-08-07 18:30       ` Alexander Duyck
@ 2013-08-08  5:23         ` Alex Williamson
  2013-08-08 16:46           ` Alexander Duyck
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2013-08-08  5:23 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: bhelgaas, linux-pci, ddutile, indou.takao, linux-kernel

On Wed, 2013-08-07 at 11:30 -0700, Alexander Duyck wrote:
> On 08/06/2013 07:56 PM, Alex Williamson wrote:
> > On Tue, 2013-08-06 at 16:27 -0700, Alexander Duyck wrote:
> >> On 08/05/2013 12:37 PM, Alex Williamson wrote:
> >>> The PCI spec indicates that with stable power, reset needs to be
> >>> asserted for a minimum of 1ms (Trst).  Seems like we should be able
> >>> to assume power is stable for a runtime secondary bus reset.  The
> >>> current code has always used 100ms with no explanation where that
> >>> came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
> >>> that seems to be a misinterpretation of the PCIe spec, where hot
> >>> reset is implemented by TS1 ordered sets containing the hot reset
> >>> command.  After a 2ms delay the state machine enters the detect state,
> >>> but to generate a link down, only two consecutive TS1 hot reset
> >>> ordered sets are requred.  1ms should be plenty for that.
> >> The reason for doing a 2ms sleep is because the are supposed to be
> >> sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
> >> the documents I have read.
> > Could you point to one of those references?  In the PCIe v3 spec I'm
> > seeing things like 4.2.6.11 Hot Reset:
> >
> >       * If two consecutive TS1 Ordered Sets are received on any Lane
> >         with the Hot Reset bit asserted and configured Link and Lane
> >         numbers, then:
> >               * LinkUp = 0b (False)
> >               * If no higher Layer is directing the Physical Layer to
> >                 remain in Hot Reset, the next state is Detect
> >               * Otherwise, all Lanes in the configured Link continue to
> >                 transmit TS1 Ordered Sets with the Hot Reset bit
> >                 asserted and the configured Link and Lane numbers.
> >       * Otherwise, after a 2 ms timeout next state is Detect.
> >
> > The next section has something similar for propagation of hot resets.
> >
> > Nowhere there does it say TS1 Ordered Sets need to be sent continuously
> > for 2ms.  A hot reset is initiated only by two consecutive TS1 Ordered
> > Sets with the Hot Reset bit asserted.  The 2ms timeout seems to be the
> > delay before the link moves to the Detect state after we stop asserting
> > hot reset.  1ms seems like more than enough time for two TS1 Ordered
> > Sets to propagate down a multi-level hierarchy at 2.5GT/s. 
> >
> 
> My original implementation is actually based on page 536 of the "PCI
> Express System Architecture".  However based on the PCIe spec itself I
> think the point is that the port is supposed to stay in Hot Reset for
> 2ms after receiving the in-band message.  For a bridge port it means
> that is supposed to be sending the Hot Reset message for those 2ms on
> all downstream facing ports.  After the timer expires then it stops
> sending the Hot Reset TS1 Ordered Sets and then will transition to the
> Detect state.

Conveniently page 536 is available for preview on google :)  What that
suggests to me is that the minimum "nobody home", unconnected link
timeout is 2ms.  Downstream ports may exit to the Detect state after
either a 2ms timeout expires or after two hot-reset-TS1s are received
from the downstream device.  The other 2ms case is that an upstream port
in the Hot Reset state will always wait for the 2ms timeout to expire
after the last pair of hot-reset-TS1s is received before entering the
Detect state.

> My main concern here is that the previous code was not triggering a Hot
> Reset on all ports previously.  What was happening was that some of the
> ports would only get as far as Recovery as the upstream port was only
> sending a couple of TS1 frames and not allowing the downstream ports
> time to switch to Recovery themselves and discover the Hot Reset.

Was that the original code that had no delay between set and clear of
the bridge control register?  1ms is pretty long time vs no delay.

> >> The 1ms number you quote is the minimum time
> >> for a conventional PCI bus.  I'm not completely sure of that applies as
> >> well to PCIe, nor does it represent the maximum recommended value.
> > Correct, 1ms comes from conventional PCI.  PCIe is designed to be
> > software compatible with conventional PCI so it makes sense that PCIe
> > would do something within the timing boundaries of conventional PCI.  I
> > didn't see any reference to a maximum recommended value for this
> > parameter.
> 
> I don't want to implement things to minimum specification as there are
> too many marginal parts where the minimum doesn't work.  I would rather
> not have to add a ton of quirks for all of the parts out there that
> didn't quite meet up to the specification.  By using a value of 2ms we
> are matching what the PCIe bridge behavior is supposed to be by sending
> the Hot Reset TS1 ordered sets for 2ms.

The minimum requirement is 2 hot-reset-TS1.  We're sending ~2.5 million
(if we can assume 1 per transfer cycle).

> >> If we stop early we risk not resetting the full device tree on the
> >> secondary bus which is the bug I was resolving by adding the 2ms delay. 
> >> Previously we saw that some devices were only getting their PCIe link
> >> retrained without performing a hot reset when the bit was not held for
> >> long enough.  I would prefer to keep this at 2 ms in order to account
> >> for the fact that PCIe has to go though link recovery states before it
> >> can perform the hot reset.
> > I'm not going to sweat over 1ms or 2ms but I do want to be able to
> > document why we're setting it to one or the other.  If it's warm
> > fuzzies, so be it, but I'd prefer if we could find actual spec or
> > hardware examples to back it up.  Thanks,
> >
> > Alex
> 
> I think our difference is that I based my value on the in-band message
> behavior and your value is based on the recommended minimum time for the
> Secondary Bus Reset.  The downstream ports of a bridge that receives the
> in-band Hot Reset notification are supposed to send a continuous stream
> of TS1 Ordered sets with the Hot Reset bit set for 2ms.  Based on all of
> the conditions in the spec the device should start a 2ms timer, and all
> downstream ports should begin transmitting the TS1 Ordered sets with the
> Hot Reset bit asserted, then after the 2ms timer expires it should
> switch to the detect state.  I verified with a PCIe analyzer that this
> was what the AER code was doing after I had changed it and added the sleep.
>
> What I found is that most parts will stop transmitting the TS1 ordered
> sets as soon as you clear the Secondary Bus Reset bit.

If what I state above is correct, then the downstream port of the Bridge
is able to immediately move to Detect after it receives two
hot-reset-TS1s from the downstream device.  I suspect this is what you
were seeing.

> So if you set
> the bit and clear it 1 ms later you might only get to send a few ordered
> sets and that may not be enough depending on how fast the part can
> transition between L0/L0s/L1, Recovery, and Hot Reset.

I would guess what you were seeing previously with a back-to-back
set/clear of the bridge control register was that the bridge never
really entered Hot Reset.  Perhaps it wasn't even set long enough to be
latched into the hardware.  As long as we get the bridge to enter Hot
Reset, I think the protocol takes care of itself.  For example:

root port        switch           endpoint
+-----+          +-----+          +-----+
|  X  |<---A'----|  Y  |<---B'----|  Z  |
|     |----A---->|     |----B---->|     |
+-----+          +-----+          +-----+

Say root port X makes it into the Hot Reset state and we have some way
to immediately detect this and clear the bridge control register.  X
will still continue to send hot-reset-TS1 until either a) the 2ms timer
expires or b) it receives two hot-reset-TS1s on link A'.  If link A is
up, switch Y will certainly receive two host-reset-TS1s within that 2ms
and enters the Hot Reset state on it's upstream port.  Switch Y then
begins sending hot-reset-TS1s on link A'.  At the same time, Y directs
it's downstream ports to enter Hot Reset "as soon as possible", and
begins sending hot-reset-TS1s on link B.  Once X receives two
hot-reset-TS1s on link A', X enters the Detect state.  hot-reset-TS1s on
link A cease.  2ms after the upstream port of Y receives the last two
hot-reset-TS1s, those ports also enter the detect phase.

The downstream port of Y behaves the same.  We left off with Y's
downsteam port in Hot Reset sending hot-reset-TS1s down link B.  It
continues to do this for 2ms or until two hot-reset-TS1s are received on
link B'.  The protocol takes care of propagating the Hot Reset to
subordinate devices regardless of whether we're still directing the
original bridge to stay in Hot Reset.

If the above is a correct interpretation, the the only requirement on
how long we assert the secondary bus reset bit is how long it takes the
bridge to enter the Hot Reset state.  Intuitively, 1ms seems like more
than enough time and is software compatible with conventional PCI which
is generally a design goal for PCIe.  If we factor in link recovery
time, the maximum L1 latency is 64us, which is a pretty small fraction
of 1ms.

Did you experiment at all with 1ms?  I'm trying to come up with a reason
to make it 2ms, but the spec isn't supporting it.  Maybe the comment
should be "This could probably be 1ms, but we're more comfortable with
2ms.".  Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing
  2013-08-08  5:23         ` Alex Williamson
@ 2013-08-08 16:46           ` Alexander Duyck
  2013-08-08 18:42             ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Duyck @ 2013-08-08 16:46 UTC (permalink / raw)
  To: Alex Williamson; +Cc: bhelgaas, linux-pci, ddutile, indou.takao, linux-kernel

On 08/07/2013 10:23 PM, Alex Williamson wrote:
> On Wed, 2013-08-07 at 11:30 -0700, Alexander Duyck wrote:
>> On 08/06/2013 07:56 PM, Alex Williamson wrote:
>>> On Tue, 2013-08-06 at 16:27 -0700, Alexander Duyck wrote:
>>>> On 08/05/2013 12:37 PM, Alex Williamson wrote:
>>>>> The PCI spec indicates that with stable power, reset needs to be
>>>>> asserted for a minimum of 1ms (Trst).  Seems like we should be able
>>>>> to assume power is stable for a runtime secondary bus reset.  The
>>>>> current code has always used 100ms with no explanation where that
>>>>> came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
>>>>> that seems to be a misinterpretation of the PCIe spec, where hot
>>>>> reset is implemented by TS1 ordered sets containing the hot reset
>>>>> command.  After a 2ms delay the state machine enters the detect state,
>>>>> but to generate a link down, only two consecutive TS1 hot reset
>>>>> ordered sets are requred.  1ms should be plenty for that.
>>>> The reason for doing a 2ms sleep is because the are supposed to be
>>>> sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
>>>> the documents I have read.
>>> Could you point to one of those references?  In the PCIe v3 spec I'm
>>> seeing things like 4.2.6.11 Hot Reset:
>>>
>>>       * If two consecutive TS1 Ordered Sets are received on any Lane
>>>         with the Hot Reset bit asserted and configured Link and Lane
>>>         numbers, then:
>>>               * LinkUp = 0b (False)
>>>               * If no higher Layer is directing the Physical Layer to
>>>                 remain in Hot Reset, the next state is Detect
>>>               * Otherwise, all Lanes in the configured Link continue to
>>>                 transmit TS1 Ordered Sets with the Hot Reset bit
>>>                 asserted and the configured Link and Lane numbers.
>>>       * Otherwise, after a 2 ms timeout next state is Detect.
>>>
>>> The next section has something similar for propagation of hot resets.
>>>
>>> Nowhere there does it say TS1 Ordered Sets need to be sent continuously
>>> for 2ms.  A hot reset is initiated only by two consecutive TS1 Ordered
>>> Sets with the Hot Reset bit asserted.  The 2ms timeout seems to be the
>>> delay before the link moves to the Detect state after we stop asserting
>>> hot reset.  1ms seems like more than enough time for two TS1 Ordered
>>> Sets to propagate down a multi-level hierarchy at 2.5GT/s. 
>>>
>> My original implementation is actually based on page 536 of the "PCI
>> Express System Architecture".  However based on the PCIe spec itself I
>> think the point is that the port is supposed to stay in Hot Reset for
>> 2ms after receiving the in-band message.  For a bridge port it means
>> that is supposed to be sending the Hot Reset message for those 2ms on
>> all downstream facing ports.  After the timer expires then it stops
>> sending the Hot Reset TS1 Ordered Sets and then will transition to the
>> Detect state.
> Conveniently page 536 is available for preview on google :)  What that
> suggests to me is that the minimum "nobody home", unconnected link
> timeout is 2ms.  Downstream ports may exit to the Detect state after
> either a 2ms timeout expires or after two hot-reset-TS1s are received
> from the downstream device.  The other 2ms case is that an upstream port
> in the Hot Reset state will always wait for the 2ms timeout to expire
> after the last pair of hot-reset-TS1s is received before entering the
> Detect state.
>
>> My main concern here is that the previous code was not triggering a Hot
>> Reset on all ports previously.  What was happening was that some of the
>> ports would only get as far as Recovery as the upstream port was only
>> sending a couple of TS1 frames and not allowing the downstream ports
>> time to switch to Recovery themselves and discover the Hot Reset.
> Was that the original code that had no delay between set and clear of
> the bridge control register?  1ms is pretty long time vs no delay.
>
>>>> The 1ms number you quote is the minimum time
>>>> for a conventional PCI bus.  I'm not completely sure of that applies as
>>>> well to PCIe, nor does it represent the maximum recommended value.
>>> Correct, 1ms comes from conventional PCI.  PCIe is designed to be
>>> software compatible with conventional PCI so it makes sense that PCIe
>>> would do something within the timing boundaries of conventional PCI.  I
>>> didn't see any reference to a maximum recommended value for this
>>> parameter.
>> I don't want to implement things to minimum specification as there are
>> too many marginal parts where the minimum doesn't work.  I would rather
>> not have to add a ton of quirks for all of the parts out there that
>> didn't quite meet up to the specification.  By using a value of 2ms we
>> are matching what the PCIe bridge behavior is supposed to be by sending
>> the Hot Reset TS1 ordered sets for 2ms.
> The minimum requirement is 2 hot-reset-TS1.  We're sending ~2.5 million
> (if we can assume 1 per transfer cycle).

Yes, but there are multiple states that must be transitioned through in
order to get to the hot-reset state.

>>>> If we stop early we risk not resetting the full device tree on the
>>>> secondary bus which is the bug I was resolving by adding the 2ms delay. 
>>>> Previously we saw that some devices were only getting their PCIe link
>>>> retrained without performing a hot reset when the bit was not held for
>>>> long enough.  I would prefer to keep this at 2 ms in order to account
>>>> for the fact that PCIe has to go though link recovery states before it
>>>> can perform the hot reset.
>>> I'm not going to sweat over 1ms or 2ms but I do want to be able to
>>> document why we're setting it to one or the other.  If it's warm
>>> fuzzies, so be it, but I'd prefer if we could find actual spec or
>>> hardware examples to back it up.  Thanks,
>>>
>>> Alex
>> I think our difference is that I based my value on the in-band message
>> behavior and your value is based on the recommended minimum time for the
>> Secondary Bus Reset.  The downstream ports of a bridge that receives the
>> in-band Hot Reset notification are supposed to send a continuous stream
>> of TS1 Ordered sets with the Hot Reset bit set for 2ms.  Based on all of
>> the conditions in the spec the device should start a 2ms timer, and all
>> downstream ports should begin transmitting the TS1 Ordered sets with the
>> Hot Reset bit asserted, then after the 2ms timer expires it should
>> switch to the detect state.  I verified with a PCIe analyzer that this
>> was what the AER code was doing after I had changed it and added the sleep.
>>
>> What I found is that most parts will stop transmitting the TS1 ordered
>> sets as soon as you clear the Secondary Bus Reset bit.
> If what I state above is correct, then the downstream port of the Bridge
> is able to immediately move to Detect after it receives two
> hot-reset-TS1s from the downstream device.  I suspect this is what you
> were seeing.
>
>> So if you set
>> the bit and clear it 1 ms later you might only get to send a few ordered
>> sets and that may not be enough depending on how fast the part can
>> transition between L0/L0s/L1, Recovery, and Hot Reset.
> I would guess what you were seeing previously with a back-to-back
> set/clear of the bridge control register was that the bridge never
> really entered Hot Reset.  Perhaps it wasn't even set long enough to be
> latched into the hardware.  As long as we get the bridge to enter Hot
> Reset, I think the protocol takes care of itself.  For example:
>
> root port        switch           endpoint
> +-----+          +-----+          +-----+
> |  X  |<---A'----|  Y  |<---B'----|  Z  |
> |     |----A---->|     |----B---->|     |
> +-----+          +-----+          +-----+
>
> Say root port X makes it into the Hot Reset state and we have some way
> to immediately detect this and clear the bridge control register.  X
> will still continue to send hot-reset-TS1 until either a) the 2ms timer
> expires or b) it receives two hot-reset-TS1s on link A'.  If link A is
> up, switch Y will certainly receive two host-reset-TS1s within that 2ms
> and enters the Hot Reset state on it's upstream port.  Switch Y then
> begins sending hot-reset-TS1s on link A'.  At the same time, Y directs
> it's downstream ports to enter Hot Reset "as soon as possible", and
> begins sending hot-reset-TS1s on link B.  Once X receives two
> hot-reset-TS1s on link A', X enters the Detect state.  hot-reset-TS1s on
> link A cease.  2ms after the upstream port of Y receives the last two
> hot-reset-TS1s, those ports also enter the detect phase.

Are you sure about the flow of Hot Reset TS1 ordered sets along the A'
and B' paths?  My understanding was that the flowed downstream, not
upstream.  It's been so long ago that I don't have the trace with me
from when I was working on this so I don't remember the exact behavior
though so I could be wrong.

The issue is that the secondary reset bit doesn't quite work like you
have described.  From what I have seen in the past setting the bit will
hold the root port in the Hot Reset state with it pumping out the
hot-reset TS1 ordered sets until we clear the bit.  When we clear the
bit then all of the ports will cascade from the Hot Reset state to detect.

> The downstream port of Y behaves the same.  We left off with Y's
> downsteam port in Hot Reset sending hot-reset-TS1s down link B.  It
> continues to do this for 2ms or until two hot-reset-TS1s are received on
> link B'.  The protocol takes care of propagating the Hot Reset to
> subordinate devices regardless of whether we're still directing the
> original bridge to stay in Hot Reset.

This is where I derived my 2ms value from.  The simple thought is if the
downstream ports wait 2ms before giving up why shouldn't we do the same
for the Secondary Bus Reset bit.

> If the above is a correct interpretation, the the only requirement on
> how long we assert the secondary bus reset bit is how long it takes the
> bridge to enter the Hot Reset state.  Intuitively, 1ms seems like more
> than enough time and is software compatible with conventional PCI which
> is generally a design goal for PCIe.  If we factor in link recovery
> time, the maximum L1 latency is 64us, which is a pretty small fraction
> of 1ms.

1ms should be more than enough for most parts, however if that is the
case why do the downstream ports on most bridges have a 2ms timeout on
Hot Reset?

> Did you experiment at all with 1ms?  I'm trying to come up with a reason
> to make it 2ms, but the spec isn't supporting it.  Maybe the comment
> should be "This could probably be 1ms, but we're more comfortable with
> 2ms.".  Thanks,
>
> Alex

I recall I did experiment with 1ms.  It did reset the part I was working
with.  My concern as I recall was the fact that as soon as I cleared the
secondary bus reset the root port stopped transmitting the hot reset
ordered sets.

The key thing that I think is the point of contention here between you
and I is the line that "Software must ensure a minimum reset duration
(Trst) as defined in the PCI Local Bus Specification".  To me that is
the lower bounds of acceptable values, and it seems like you are
assuming that to be the recommended value.

My preference is the 2ms value with a comment stating that the value can
be no less than 1ms.  This way it gives us a bit of wiggle room for any
bus delays and such and we are more or less guaranteed to have at least
1ms with the bit set.  If you go the 1ms route we really need a comment
that we are running tight on the tolerance for the msleep since the spec
says we must have at least 1ms.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing
  2013-08-08 16:46           ` Alexander Duyck
@ 2013-08-08 18:42             ` Alex Williamson
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2013-08-08 18:42 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: bhelgaas, linux-pci, ddutile, indou.takao, linux-kernel

On Thu, 2013-08-08 at 09:46 -0700, Alexander Duyck wrote:
> On 08/07/2013 10:23 PM, Alex Williamson wrote:
> > On Wed, 2013-08-07 at 11:30 -0700, Alexander Duyck wrote:
> >> On 08/06/2013 07:56 PM, Alex Williamson wrote:
> >>> On Tue, 2013-08-06 at 16:27 -0700, Alexander Duyck wrote:
> >>>> On 08/05/2013 12:37 PM, Alex Williamson wrote:
> >>>>> The PCI spec indicates that with stable power, reset needs to be
> >>>>> asserted for a minimum of 1ms (Trst).  Seems like we should be able
> >>>>> to assume power is stable for a runtime secondary bus reset.  The
> >>>>> current code has always used 100ms with no explanation where that
> >>>>> came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
> >>>>> that seems to be a misinterpretation of the PCIe spec, where hot
> >>>>> reset is implemented by TS1 ordered sets containing the hot reset
> >>>>> command.  After a 2ms delay the state machine enters the detect state,
> >>>>> but to generate a link down, only two consecutive TS1 hot reset
> >>>>> ordered sets are requred.  1ms should be plenty for that.
> >>>> The reason for doing a 2ms sleep is because the are supposed to be
> >>>> sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
> >>>> the documents I have read.
> >>> Could you point to one of those references?  In the PCIe v3 spec I'm
> >>> seeing things like 4.2.6.11 Hot Reset:
> >>>
> >>>       * If two consecutive TS1 Ordered Sets are received on any Lane
> >>>         with the Hot Reset bit asserted and configured Link and Lane
> >>>         numbers, then:
> >>>               * LinkUp = 0b (False)
> >>>               * If no higher Layer is directing the Physical Layer to
> >>>                 remain in Hot Reset, the next state is Detect
> >>>               * Otherwise, all Lanes in the configured Link continue to
> >>>                 transmit TS1 Ordered Sets with the Hot Reset bit
> >>>                 asserted and the configured Link and Lane numbers.
> >>>       * Otherwise, after a 2 ms timeout next state is Detect.
> >>>
> >>> The next section has something similar for propagation of hot resets.
> >>>
> >>> Nowhere there does it say TS1 Ordered Sets need to be sent continuously
> >>> for 2ms.  A hot reset is initiated only by two consecutive TS1 Ordered
> >>> Sets with the Hot Reset bit asserted.  The 2ms timeout seems to be the
> >>> delay before the link moves to the Detect state after we stop asserting
> >>> hot reset.  1ms seems like more than enough time for two TS1 Ordered
> >>> Sets to propagate down a multi-level hierarchy at 2.5GT/s. 
> >>>
> >> My original implementation is actually based on page 536 of the "PCI
> >> Express System Architecture".  However based on the PCIe spec itself I
> >> think the point is that the port is supposed to stay in Hot Reset for
> >> 2ms after receiving the in-band message.  For a bridge port it means
> >> that is supposed to be sending the Hot Reset message for those 2ms on
> >> all downstream facing ports.  After the timer expires then it stops
> >> sending the Hot Reset TS1 Ordered Sets and then will transition to the
> >> Detect state.
> > Conveniently page 536 is available for preview on google :)  What that
> > suggests to me is that the minimum "nobody home", unconnected link
> > timeout is 2ms.  Downstream ports may exit to the Detect state after
> > either a 2ms timeout expires or after two hot-reset-TS1s are received
> > from the downstream device.  The other 2ms case is that an upstream port
> > in the Hot Reset state will always wait for the 2ms timeout to expire
> > after the last pair of hot-reset-TS1s is received before entering the
> > Detect state.
> >
> >> My main concern here is that the previous code was not triggering a Hot
> >> Reset on all ports previously.  What was happening was that some of the
> >> ports would only get as far as Recovery as the upstream port was only
> >> sending a couple of TS1 frames and not allowing the downstream ports
> >> time to switch to Recovery themselves and discover the Hot Reset.
> > Was that the original code that had no delay between set and clear of
> > the bridge control register?  1ms is pretty long time vs no delay.
> >
> >>>> The 1ms number you quote is the minimum time
> >>>> for a conventional PCI bus.  I'm not completely sure of that applies as
> >>>> well to PCIe, nor does it represent the maximum recommended value.
> >>> Correct, 1ms comes from conventional PCI.  PCIe is designed to be
> >>> software compatible with conventional PCI so it makes sense that PCIe
> >>> would do something within the timing boundaries of conventional PCI.  I
> >>> didn't see any reference to a maximum recommended value for this
> >>> parameter.
> >> I don't want to implement things to minimum specification as there are
> >> too many marginal parts where the minimum doesn't work.  I would rather
> >> not have to add a ton of quirks for all of the parts out there that
> >> didn't quite meet up to the specification.  By using a value of 2ms we
> >> are matching what the PCIe bridge behavior is supposed to be by sending
> >> the Hot Reset TS1 ordered sets for 2ms.
> > The minimum requirement is 2 hot-reset-TS1.  We're sending ~2.5 million
> > (if we can assume 1 per transfer cycle).
> 
> Yes, but there are multiple states that must be transitioned through in
> order to get to the hot-reset state.
> 
> >>>> If we stop early we risk not resetting the full device tree on the
> >>>> secondary bus which is the bug I was resolving by adding the 2ms delay. 
> >>>> Previously we saw that some devices were only getting their PCIe link
> >>>> retrained without performing a hot reset when the bit was not held for
> >>>> long enough.  I would prefer to keep this at 2 ms in order to account
> >>>> for the fact that PCIe has to go though link recovery states before it
> >>>> can perform the hot reset.
> >>> I'm not going to sweat over 1ms or 2ms but I do want to be able to
> >>> document why we're setting it to one or the other.  If it's warm
> >>> fuzzies, so be it, but I'd prefer if we could find actual spec or
> >>> hardware examples to back it up.  Thanks,
> >>>
> >>> Alex
> >> I think our difference is that I based my value on the in-band message
> >> behavior and your value is based on the recommended minimum time for the
> >> Secondary Bus Reset.  The downstream ports of a bridge that receives the
> >> in-band Hot Reset notification are supposed to send a continuous stream
> >> of TS1 Ordered sets with the Hot Reset bit set for 2ms.  Based on all of
> >> the conditions in the spec the device should start a 2ms timer, and all
> >> downstream ports should begin transmitting the TS1 Ordered sets with the
> >> Hot Reset bit asserted, then after the 2ms timer expires it should
> >> switch to the detect state.  I verified with a PCIe analyzer that this
> >> was what the AER code was doing after I had changed it and added the sleep.
> >>
> >> What I found is that most parts will stop transmitting the TS1 ordered
> >> sets as soon as you clear the Secondary Bus Reset bit.
> > If what I state above is correct, then the downstream port of the Bridge
> > is able to immediately move to Detect after it receives two
> > hot-reset-TS1s from the downstream device.  I suspect this is what you
> > were seeing.
> >
> >> So if you set
> >> the bit and clear it 1 ms later you might only get to send a few ordered
> >> sets and that may not be enough depending on how fast the part can
> >> transition between L0/L0s/L1, Recovery, and Hot Reset.
> > I would guess what you were seeing previously with a back-to-back
> > set/clear of the bridge control register was that the bridge never
> > really entered Hot Reset.  Perhaps it wasn't even set long enough to be
> > latched into the hardware.  As long as we get the bridge to enter Hot
> > Reset, I think the protocol takes care of itself.  For example:
> >
> > root port        switch           endpoint
> > +-----+          +-----+          +-----+
> > |  X  |<---A'----|  Y  |<---B'----|  Z  |
> > |     |----A---->|     |----B---->|     |
> > +-----+          +-----+          +-----+
> >
> > Say root port X makes it into the Hot Reset state and we have some way
> > to immediately detect this and clear the bridge control register.  X
> > will still continue to send hot-reset-TS1 until either a) the 2ms timer
> > expires or b) it receives two hot-reset-TS1s on link A'.  If link A is
> > up, switch Y will certainly receive two host-reset-TS1s within that 2ms
> > and enters the Hot Reset state on it's upstream port.  Switch Y then
> > begins sending hot-reset-TS1s on link A'.  At the same time, Y directs
> > it's downstream ports to enter Hot Reset "as soon as possible", and
> > begins sending hot-reset-TS1s on link B.  Once X receives two
> > hot-reset-TS1s on link A', X enters the Detect state.  hot-reset-TS1s on
> > link A cease.  2ms after the upstream port of Y receives the last two
> > hot-reset-TS1s, those ports also enter the detect phase.
> 
> Are you sure about the flow of Hot Reset TS1 ordered sets along the A'
> and B' paths?  My understanding was that the flowed downstream, not
> upstream.  It's been so long ago that I don't have the trace with me
> from when I was working on this so I don't remember the exact behavior
> though so I could be wrong.

I don't know for certain, my interpretation is purely from reading the
spec.  This is the only way I can make sense of (4.2.6.11):

        If two consecutive TS1 Ordered Sets are received on any Lane
        with the Hot Reset bit asserted and configured Link and Lane
        numbers,...

The wording specifically uses "transmit" and "receive" and given that
the links are bi-directional, I come to the above interpretation that
both ends drive hot-reset-TS1s in both directions on the link.

> The issue is that the secondary reset bit doesn't quite work like you
> have described.  From what I have seen in the past setting the bit will
> hold the root port in the Hot Reset state with it pumping out the
> hot-reset TS1 ordered sets until we clear the bit.  When we clear the
> bit then all of the ports will cascade from the Hot Reset state to detect.

That's exactly how I describe.  Once the upper layer stops directing the
physical layer to stay in Hot Reset and two hot-reset-TS1s are received,
the bridge immediately stops sending hot-reset-TS1s.  This causes
downstream devices to cascade into the Detect state.  What I was trying
to illustrate above is that regardless of how long we direct the
physical layer to stay in Hot Reset, once it enters Hot Reset the
protocol ensures that it cascades all the way down the chain.

> > The downstream port of Y behaves the same.  We left off with Y's
> > downsteam port in Hot Reset sending hot-reset-TS1s down link B.  It
> > continues to do this for 2ms or until two hot-reset-TS1s are received on
> > link B'.  The protocol takes care of propagating the Hot Reset to
> > subordinate devices regardless of whether we're still directing the
> > original bridge to stay in Hot Reset.
> 
> This is where I derived my 2ms value from.  The simple thought is if the
> downstream ports wait 2ms before giving up why shouldn't we do the same
> for the Secondary Bus Reset bit.

The downstream port has no way to confirm that the upstream port
received the hot-reset-TS1s that the downstream port was sending.  Using
the example above, X sends hot-reset-TS1s down link A.  X has positive
confirmation that the downstream device Y has entered Hot Reset when it
receives two hot-reset-TS1s on link A'.  It can then exit early from the
Hot Reset state if not held in Hot Rest by a higher layer.

The downstream device Y has no such positive confirmation that X ever
saw the hot-reset-TS1s that were pushed out through link A'.  Thus, Y
waits the full timeout after the last two hot-reset-TS1s before entering
Detect.

This is all from my interpretation of the spec, so it could be very
wrong.  For instance, the spec isn't clear on whether the hot-reset-TS1s
being send on A' keep X in Hot Reset.  That would obviously cause
deadlock given my interpretation, so I assume not.

> > If the above is a correct interpretation, the the only requirement on
> > how long we assert the secondary bus reset bit is how long it takes the
> > bridge to enter the Hot Reset state.  Intuitively, 1ms seems like more
> > than enough time and is software compatible with conventional PCI which
> > is generally a design goal for PCIe.  If we factor in link recovery
> > time, the maximum L1 latency is 64us, which is a pretty small fraction
> > of 1ms.
> 
> 1ms should be more than enough for most parts, however if that is the
> case why do the downstream ports on most bridges have a 2ms timeout on
> Hot Reset?

Without any response from the downstream device, the hardware is still
going to do a full 2ms timeout, so what does it matter if we hold the
device in Hot Reset for 1ms or 2ms?  That translates to 3ms or 4ms of
hot-reset-TS1s on a dead link.  We're only allowing for an early exit if
the link is live and the downstream device has entered Hot Reset.

> > Did you experiment at all with 1ms?  I'm trying to come up with a reason
> > to make it 2ms, but the spec isn't supporting it.  Maybe the comment
> > should be "This could probably be 1ms, but we're more comfortable with
> > 2ms.".  Thanks,
> >
> > Alex
> 
> I recall I did experiment with 1ms.  It did reset the part I was working
> with.  My concern as I recall was the fact that as soon as I cleared the
> secondary bus reset the root port stopped transmitting the hot reset
> ordered sets.

I think you would have to sever the return path (A') to see otherwise.
As long as there are hot-reset-TS1s on link A', X is able to immediately
transition to Detect.

> 
> The key thing that I think is the point of contention here between you
> and I is the line that "Software must ensure a minimum reset duration
> (Trst) as defined in the PCI Local Bus Specification".  To me that is
> the lower bounds of acceptable values, and it seems like you are
> assuming that to be the recommended value.
> 
> My preference is the 2ms value with a comment stating that the value can
> be no less than 1ms.  This way it gives us a bit of wiggle room for any
> bus delays and such and we are more or less guaranteed to have at least
> 1ms with the bit set.  If you go the 1ms route we really need a comment
> that we are running tight on the tolerance for the msleep since the spec
> says we must have at least 1ms.

Ok, I can agree to that and I think the justification is more from
adhering to the conventional PCI minimum timing rather than anything
added by PCIe.  2ms is simply a fudge-factor to ensure that RST# on the
bus is actually asserted for at least 1ms.  I'll send an update.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-08-08 18:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-05 19:37 [PATCH v4 0/9] pci: bus and slot reset interfaces Alex Williamson
2013-08-05 19:37 ` [PATCH v4 1/9] pci: Create pci_reset_bridge_secondary_bus() Alex Williamson
2013-08-05 19:37 ` [PATCH v4 2/9] pci: Add hotplug_slot_ops.reset_slot() Alex Williamson
2013-08-05 19:37 ` [PATCH v4 3/9] pci: Implement reset_slot for pciehp Alex Williamson
2013-08-05 19:37 ` [PATCH v4 4/9] pci: Add slot reset option to pci_dev_reset Alex Williamson
2013-08-05 19:37 ` [PATCH v4 5/9] pci: Split out pci_dev lock/unlock and save/restore Alex Williamson
2013-08-05 19:37 ` [PATCH v4 6/9] pci: Add slot and bus reset interfaces Alex Williamson
2013-08-05 19:37 ` [PATCH v4 7/9] pci: Wake-up devices before save for reset Alex Williamson
2013-08-05 19:37 ` [PATCH v4 8/9] pci: Tune secondary bus reset timing Alex Williamson
2013-08-06 23:27   ` Alexander Duyck
2013-08-07  2:56     ` Alex Williamson
2013-08-07 18:30       ` Alexander Duyck
2013-08-08  5:23         ` Alex Williamson
2013-08-08 16:46           ` Alexander Duyck
2013-08-08 18:42             ` Alex Williamson
2013-08-05 19:37 ` [PATCH v4 9/9] pci: Remove aer_do_secondary_bus_reset() Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.