linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] PCI: Power management improvements
@ 2019-06-05 14:58 Mika Westerberg
  2019-06-05 14:58 ` [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec Mika Westerberg
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Mika Westerberg @ 2019-06-05 14:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Rafael J. Wysocki
  Cc: Len Brown, Lukas Wunner, Keith Busch, Alex Williamson,
	Alexandru Gagniuc, Mika Westerberg, linux-acpi, linux-pci

Hi all,

This series includes a couple of changes to the PCI power management that
should make Linux follow the PCIe spec better and also support for sibling
PCIe devices sharing ACPI power resources. The issues this series aims to
solve came up with Intel Ice Lake Thunderbolt enabling where the controller
is first time integrated into the SoC but I think these issues are generic
to any platform having similar configuration.

Mika Westerberg (3):
  PCI: Add missing link delays required by the PCIe spec
  PCI: Do not poll for PME if the device is in D3cold
  PCI / ACPI: Handle sibling devices sharing power resources

 drivers/acpi/power.c            |  32 ++++++++
 drivers/pci/pci-acpi.c          |  32 ++++++--
 drivers/pci/pci.c               | 138 ++++++++++++++++++++++++--------
 drivers/pci/pci.h               |   6 ++
 drivers/pci/pcie/portdrv_core.c |  62 ++++++++++++++
 include/acpi/acpi_bus.h         |   3 +
 6 files changed, 233 insertions(+), 40 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec
  2019-06-05 14:58 [PATCH 0/3] PCI: Power management improvements Mika Westerberg
@ 2019-06-05 14:58 ` Mika Westerberg
  2019-06-06  8:40   ` Rafael J. Wysocki
  2019-06-05 14:58 ` [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold Mika Westerberg
  2019-06-05 14:58 ` [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources Mika Westerberg
  2 siblings, 1 reply; 23+ messages in thread
From: Mika Westerberg @ 2019-06-05 14:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Rafael J. Wysocki
  Cc: Len Brown, Lukas Wunner, Keith Busch, Alex Williamson,
	Alexandru Gagniuc, Mika Westerberg, linux-acpi, linux-pci

Currently Linux does not follow PCIe spec regarding the required delays
after reset. A concrete example is a Thunderbolt add-in-card that
consists of a PCIe switch and two PCIe endpoints:

  +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
                                  +-01.0-[04-36]-- DS hotplug port
                                  +-02.0-[37]----00.0 xHCI controller
                                  \-04.0-[38-6b]-- DS hotplug port

The root port (1b.0) and the PCIe switch downstream ports are all PCIe
gen3 so they support 8GT/s link speeds.

We wait for the PCIe hierarchy to enter D3cold (runtime):

  pcieport 0000:00:1b.0: power state changed by ACPI to D3cold

When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
PCIe switch is put to reset and its power is re-applied. This means that
we must follow the rules in PCIe 4.0 section 6.6.1.

For the PCIe gen3 ports we are dealing with here, the following applies:

  With a Downstream Port that supports Link speeds greater than 5.0
  GT/s, software must wait a minimum of 100 ms after Link training
  completes before sending a Configuration Request to the device
  immediately below that Port. Software can determine when Link training
  completes by polling the Data Link Layer Link Active bit or by setting
  up an associated interrupt (see Section 6.7.3.3).

Translating this into the above topology we would need to do this (DLLLA
stands for Data Link Layer Link Active):

  pcieport 0000:00:1b.0: wait for 100ms after DLLLA is set before access to 0000:01:00.0
  pcieport 0000:02:00.0: wait for 100ms after DLLLA is set before access to 0000:03:00.0
  pcieport 0000:02:02.0: wait for 100ms after DLLLA is set before access to 0000:37:00.0

I've instrumented the kernel with additional logging so we can see the
actual delays the kernel performs:

  pcieport 0000:00:1b.0: power state changed by ACPI to D0
  pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
  pcieport 0000:00:1b.0: waking up bus
  pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
  pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
  ...
  pcieport 0000:00:1b.0: PME# disabled
  pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:01:00.0: PME# disabled
  pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:00.0: PME# disabled
  pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  pcieport 0000:02:01.0: PME# disabled
  pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:02.0: PME# disabled
  pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:04.0: PME# disabled
  pcieport 0000:02:01.0: PME# enabled
  pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
  pcieport 0000:02:04.0: PME# enabled
  pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
  thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
  ...
  thunderbolt 0000:03:00.0: PME# disabled
  xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
  ...
  xhci_hcd 0000:37:00.0: PME# disabled

For the switch upstream port (01:00.0) we wait for 100ms but not taking
into account the DLLLA requirement. We then wait 10ms for D3hot -> D0
transition of the root port and the two downstream hotplug ports. This
means that we deviate from what the spec requires.

Performing the same check for system sleep (s2idle) transitions we can
see following when resuming from s2idle:

  pcieport 0000:00:1b.0: power state changed by ACPI to D0
  pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
  ...
  pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
  pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
  pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
  pcieport 0000:02:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
  pcieport 0000:02:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x73f073f0)
  pcieport 0000:02:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
  pcieport 0000:02:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
  pcieport 0000:02:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
  pcieport 0000:02:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
  pcieport 0000:02:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
  pcieport 0000:02:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1ff10001)
  pcieport 0000:02:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
  pcieport 0000:02:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x373702)
  pcieport 0000:02:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x49f12001)
  pcieport 0000:02:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x73e05c00)
  pcieport 0000:02:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
  pcieport 0000:02:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x89f07400)
  pcieport 0000:02:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x5151)
  pcieport 0000:02:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x8a008a00)
  pcieport 0000:02:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:02:04.0: restoring config space at offset 0x1c (was 0x101, writing 0x6161)
  pcieport 0000:02:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x360402)
  pcieport 0000:02:00.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
  pcieport 0000:02:04.0: restoring config space at offset 0x18 (was 0x0, writing 0x6b3802)
  pcieport 0000:02:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  pcieport 0000:02:00.0: restoring config space at offset 0x18 (was 0x0, writing 0x30302)
  pcieport 0000:02:01.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:02:04.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:02:00.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  pcieport 0000:02:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  pcieport 0000:02:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
  ...
  thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)

This is even worse. None of the mandatory delays are performed. If this
would be S3 instead of s2idle then according to PCI FW spec 3.2 section
4.6.8.  there is a specific _DSM that allows the OS to skip the delays
but this platform does not provide the _DSM and does not go to S3 anyway
so no firmware is involved that could already handle these delays.

In this particular Intel Coffee Lake platform these delays are not
actually needed because there is an additional delay as part of the ACPI
power resource that is used to turn on power to the hierarchy but since
that additional delay is not required by any of standards (PCIe, ACPI)
it is not present in the Intel Ice Lake, for example where missing the
mandatory delays causes pciehp to start tearing down the stack too early
(links are not yet trained).

For this reason, change the PCIe portdrv PM resume hooks so that they
perform the mandatory delays before the downstream component gets
resumed. We perform the delays before port services are resumed because
otherwise pciehp might find that the link is not up (even if it is just
training) and tears-down the hierarchy.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/pci/pci.c               | 29 +++++++++------
 drivers/pci/pci.h               |  1 +
 drivers/pci/pcie/portdrv_core.c | 62 +++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8abc843b1615..87a1f902fa8e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1004,15 +1004,10 @@ static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
 	if (state == PCI_D0) {
 		pci_platform_power_transition(dev, PCI_D0);
 		/*
-		 * Mandatory power management transition delays, see
-		 * PCI Express Base Specification Revision 2.0 Section
-		 * 6.6.1: Conventional Reset.  Do not delay for
-		 * devices powered on/off by corresponding bridge,
-		 * because have already delayed for the bridge.
+		 * Mandatory power management transition delays are
+		 * handled in the PCIe portdrv resume hooks.
 		 */
 		if (dev->runtime_d3cold) {
-			if (dev->d3cold_delay && !dev->imm_ready)
-				msleep(dev->d3cold_delay);
 			/*
 			 * When powering on a bridge from D3cold, the
 			 * whole hierarchy may be powered on into
@@ -4568,14 +4563,16 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
 
 	return pci_dev_wait(dev, "PM D3->D0", PCIE_RESET_READY_POLL_MS);
 }
+
 /**
- * pcie_wait_for_link - Wait until link is active or inactive
+ * pcie_wait_for_link_delay - Wait until link is active or inactive
  * @pdev: Bridge device
  * @active: waiting for active or inactive?
+ * @delay: Delay to wait after link has become active (in ms)
  *
  * Use this to wait till link becomes active or inactive.
  */
-bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay)
 {
 	int timeout = 1000;
 	bool ret;
@@ -4612,13 +4609,25 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
 		timeout -= 10;
 	}
 	if (active && ret)
-		msleep(100);
+		msleep(delay);
 	else if (ret != active)
 		pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
 			active ? "set" : "cleared");
 	return ret == active;
 }
 
+/**
+ * pcie_wait_for_link - Wait until link is active or inactive
+ * @pdev: Bridge device
+ * @active: waiting for active or inactive?
+ *
+ * Use this to wait till link becomes active or inactive.
+ */
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+{
+	return pcie_wait_for_link_delay(pdev, active, 100);
+}
+
 void pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9cb99380c61e..59802b3def4b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -493,6 +493,7 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
 void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		      u32 service);
 
+bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay);
 bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
 #ifdef CONFIG_PCIEASPM
 void pcie_aspm_init_link_state(struct pci_dev *pdev);
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 1b330129089f..88d151a54be6 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -9,6 +9,7 @@
 #include <linux/module.h>
 #include <linux/pci.h>
 #include <linux/kernel.h>
+#include <linux/delay.h>
 #include <linux/errno.h>
 #include <linux/pm.h>
 #include <linux/pm_runtime.h>
@@ -378,6 +379,58 @@ static int pm_iter(struct device *dev, void *data)
 	return 0;
 }
 
+static int get_downstream_delay(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+	int min_delay = 100;
+	int max_delay = 0;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		if (!pdev->imm_ready)
+			min_delay = 0;
+		else if (pdev->d3cold_delay < min_delay)
+			min_delay = pdev->d3cold_delay;
+		if (pdev->d3cold_delay > max_delay)
+			max_delay = pdev->d3cold_delay;
+	}
+
+	return max(min_delay, max_delay);
+}
+
+static void wait_for_downstream_link(struct pci_dev *pdev)
+{
+	/*
+	 * Handle delays according to PCIe 4.0 section 6.6.1 before
+	 * configuration access to the downstream component is permitted.
+	 *
+	 * This blocks PCI core resume of the hierarchy below this port
+	 * until the link is trained.
+	 */
+	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
+	     pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM) &&
+	     pdev->subordinate && !list_empty(&pdev->subordinate->devices) &&
+	     pdev->bridge_d3 && !pci_dev_is_disconnected(pdev)) {
+		int delay;
+
+		delay = get_downstream_delay(pdev->subordinate);
+		if (!delay)
+			return;
+
+		dev_dbg(&pdev->dev, "waiting downstream link for %d ms\n", delay);
+
+		/*
+		 * If downstream port does not support speeds greater than
+		 * 5 GT/s need to wait 100ms. For higher speeds (gen3) we
+		 * need to wait first for the data link layer to become
+		 * active.
+		 */
+		if (pcie_get_speed_cap(pdev) <= PCIE_SPEED_5_0GT)
+			msleep(delay);
+		else
+			pcie_wait_for_link_delay(pdev, true, delay);
+	}
+}
+
 /**
  * pcie_port_device_suspend - suspend port services associated with a PCIe port
  * @dev: PCI Express port to handle
@@ -391,6 +444,13 @@ int pcie_port_device_suspend(struct device *dev)
 int pcie_port_device_resume_noirq(struct device *dev)
 {
 	size_t off = offsetof(struct pcie_port_service_driver, resume_noirq);
+
+	/*
+	 * Wait for the link to be fully up before resuming port services.
+	 * This prevents pciehp from starting to tear-down the hierarchy
+	 * too soon.
+	 */
+	wait_for_downstream_link(to_pci_dev(dev));
 	return device_for_each_child(dev, &off, pm_iter);
 }
 
@@ -421,6 +481,8 @@ int pcie_port_device_runtime_suspend(struct device *dev)
 int pcie_port_device_runtime_resume(struct device *dev)
 {
 	size_t off = offsetof(struct pcie_port_service_driver, runtime_resume);
+
+	wait_for_downstream_link(to_pci_dev(dev));
 	return device_for_each_child(dev, &off, pm_iter);
 }
 #endif /* PM */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold
  2019-06-05 14:58 [PATCH 0/3] PCI: Power management improvements Mika Westerberg
  2019-06-05 14:58 ` [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec Mika Westerberg
@ 2019-06-05 14:58 ` Mika Westerberg
  2019-06-05 19:05   ` Lukas Wunner
                     ` (2 more replies)
  2019-06-05 14:58 ` [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources Mika Westerberg
  2 siblings, 3 replies; 23+ messages in thread
From: Mika Westerberg @ 2019-06-05 14:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Rafael J. Wysocki
  Cc: Len Brown, Lukas Wunner, Keith Busch, Alex Williamson,
	Alexandru Gagniuc, Mika Westerberg, linux-acpi, linux-pci

PME polling does not take into account that a device that is directly
connected to the host bridge may go into D3cold as well. This leads to a
situation where the PME poll thread reads from a config space of a
device that is in D3cold and gets incorrect information because the
config space is not accessible.

Here is an example from Intel Ice Lake system where two PCIe root ports
are in D3cold (I've instrumented the kernel to log the PMCSR register
contents):

  [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
  [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff

Since 0xffff is interpreted so that PME is pending, the root ports will
be runtime resumed. This repeats over and over again essentially
blocking all runtime power management.

Prevent this from happening by checking whether the device is in D3cold
before its PME status is read.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/pci/pci.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 87a1f902fa8e..720da09d4d73 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2060,6 +2060,13 @@ static void pci_pme_list_scan(struct work_struct *work)
 			 */
 			if (bridge && bridge->current_state != PCI_D0)
 				continue;
+			/*
+			 * If the device is in D3cold it should not be
+			 * polled either.
+			 */
+			if (pme_dev->dev->current_state == PCI_D3cold)
+				continue;
+
 			pci_pme_wakeup(pme_dev->dev, NULL);
 		} else {
 			list_del(&pme_dev->list);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-05 14:58 [PATCH 0/3] PCI: Power management improvements Mika Westerberg
  2019-06-05 14:58 ` [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec Mika Westerberg
  2019-06-05 14:58 ` [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold Mika Westerberg
@ 2019-06-05 14:58 ` Mika Westerberg
  2019-06-06  8:54   ` Rafael J. Wysocki
  2 siblings, 1 reply; 23+ messages in thread
From: Mika Westerberg @ 2019-06-05 14:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Rafael J. Wysocki
  Cc: Len Brown, Lukas Wunner, Keith Busch, Alex Williamson,
	Alexandru Gagniuc, Mika Westerberg, linux-acpi, linux-pci

Intel Ice Lake has an interated Thunderbolt controller which means that
the PCIe topology is extended directly from the two root ports (RP0 and
RP1). Power management is handled by ACPI power resources that are
shared between the root ports, Thunderbolt controller (NHI) and xHCI
controller.

The topology with the power resources (marked with []) looks like:

  Host bridge
      |
      +- RP0 ---\
      +- RP1 ---|--+--> [TBT]
      +- NHI --/   |
      |            |
      |            v
      +- xHCI --> [D3C]

Here TBT and D3C are the shared ACPI power resources. ACPI _PR3() method
returns either TBT or D3C or both.

Say we runtime suspend first the root ports RP0 and RP1, then NHI. Now
since the TBT power resource is still on when the root ports are runtime
suspended their dev->current_state is set to D3hot. When NHI is runtime
suspended TBT is finally turned off but state of the root ports remain
to be D3hot.

If the user now runs lspci for instance, the result is all 1's like in
the below output (07.0 is the first root port, RP0):

  # lspci -vv -s 07.0
  00:07.0 PCI bridge: Intel Corporation Device 8a1d (rev ff) (prog-if ff)
  	!!! Unknown header type 7f
  	Kernel driver in use: pcieport

I short the hardware state is not in sync with the software state
anymore. The exact same thing happens with the PME polling thread which
ends up bringing the root ports back into D0 after they are runtime
suspended.

There is another issue that happens when the power resource(s) are
turned on. Assume the Thunderbolt controller is runtime resumed and the
power resources are turned on which means that the other devices sharing
them (RP0, RP1 and xHCI) are transitioned into D0uninitialized state. If
they were configured to trigger wake (PME) on certain event that
configuration gets lost after reset so we would need to re-initialize
them to get the wakeup working as expected again. This means that we
would need to runtime resume all of them to make sure their registers
get restored properly before we can runtime suspend them again.

Prevent this from happening by asking from the platform (ACPI) if there
are other devices sharing the same power source when device power is
turned on/off through platform. If we find such devices we update their
dev->current_state (in case of D3hot -> D3cold transition) or runtime
resume them (in case of D3cold -> D0uninitialized).

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/acpi/power.c    |  32 ++++++++++++
 drivers/pci/pci-acpi.c  |  32 ++++++++++--
 drivers/pci/pci.c       | 106 ++++++++++++++++++++++++++++++----------
 drivers/pci/pci.h       |   5 ++
 include/acpi/acpi_bus.h |   3 ++
 5 files changed, 146 insertions(+), 32 deletions(-)

diff --git a/drivers/acpi/power.c b/drivers/acpi/power.c
index a916417b9e70..9d244fde05a4 100644
--- a/drivers/acpi/power.c
+++ b/drivers/acpi/power.c
@@ -739,6 +739,38 @@ int acpi_power_transition(struct acpi_device *device, int state)
 	return result;
 }
 
+/**
+ * acpi_shared_power_resource - Do given devices share power resource
+ * @adev1: First device to check
+ * @adev2: Second device to check
+ * @state: State which power resources are looked
+ *
+ * Checks if given devices share one or more power resources and in that
+ * case returns true.
+ */
+bool acpi_shared_power_resource(struct acpi_device *adev1, struct acpi_device *adev2,
+				int state)
+{
+	struct acpi_power_resource_entry *e1, *e2;
+	struct list_head *l1, *l2;
+
+	/* Both need to have power resources */
+	if (!adev1->power.flags.power_resources ||
+	    !adev2->power.flags.power_resources)
+		return false;
+
+	l1 = &adev1->power.states[state].resources;
+	l2 = &adev2->power.states[state].resources;
+
+	list_for_each_entry(e1, l1, node)
+		list_for_each_entry(e2, l2, node) {
+			if (e1->resource == e2->resource)
+				return true;
+		}
+
+	return false;
+}
+
 static void acpi_release_power_resource(struct device *dev)
 {
 	struct acpi_device *device = to_acpi_device(dev);
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 1897847ceb0c..39112e684403 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -632,16 +632,23 @@ static bool acpi_pci_power_manageable(struct pci_dev *dev)
 	return adev ? acpi_device_power_manageable(adev) : false;
 }
 
-static int acpi_pci_set_power_state(struct pci_dev *dev, pci_power_t state)
+static const u8 pci_power_to_acpi(pci_power_t state)
 {
-	struct acpi_device *adev = ACPI_COMPANION(&dev->dev);
-	static const u8 state_conv[] = {
+	static const int state_conv[] = {
 		[PCI_D0] = ACPI_STATE_D0,
 		[PCI_D1] = ACPI_STATE_D1,
 		[PCI_D2] = ACPI_STATE_D2,
 		[PCI_D3hot] = ACPI_STATE_D3_HOT,
 		[PCI_D3cold] = ACPI_STATE_D3_COLD,
 	};
+
+	return state_conv[state];
+}
+
+static int acpi_pci_set_power_state(struct pci_dev *dev, pci_power_t state)
+{
+	struct acpi_device *adev = ACPI_COMPANION(&dev->dev);
+	int acpi_state = pci_power_to_acpi(state);
 	int error = -EINVAL;
 
 	/* If the ACPI device has _EJ0, ignore the device */
@@ -660,12 +667,12 @@ static int acpi_pci_set_power_state(struct pci_dev *dev, pci_power_t state)
 	case PCI_D1:
 	case PCI_D2:
 	case PCI_D3hot:
-		error = acpi_device_set_power(adev, state_conv[state]);
+		error = acpi_device_set_power(adev, acpi_state);
 	}
 
 	if (!error)
 		pci_dbg(dev, "power state changed by ACPI to %s\n",
-			 acpi_power_state_string(state_conv[state]));
+			 acpi_power_state_string(acpi_state));
 
 	return error;
 }
@@ -743,6 +750,20 @@ static bool acpi_pci_need_resume(struct pci_dev *dev)
 	return !!adev->power.flags.dsw_present;
 }
 
+static bool acpi_pci_shared_power(struct pci_dev *dev1, struct pci_dev *dev2,
+				  pci_power_t state)
+{
+	struct acpi_device *adev1 = ACPI_COMPANION(&dev1->dev);
+	struct acpi_device *adev2 = ACPI_COMPANION(&dev2->dev);
+
+	if (!adev1 || !acpi_device_power_manageable(adev1))
+		return false;
+	if (!adev2 || !acpi_device_power_manageable(adev2))
+		return false;
+
+	return acpi_shared_power_resource(adev1, adev2, pci_power_to_acpi(state));
+}
+
 static const struct pci_platform_pm_ops acpi_pci_platform_pm = {
 	.bridge_d3 = acpi_pci_bridge_d3,
 	.is_manageable = acpi_pci_power_manageable,
@@ -751,6 +772,7 @@ static const struct pci_platform_pm_ops acpi_pci_platform_pm = {
 	.choose_state = acpi_pci_choose_state,
 	.set_wakeup = acpi_pci_wakeup,
 	.need_resume = acpi_pci_need_resume,
+	.shared_power = acpi_pci_shared_power,
 };
 
 void acpi_pci_add_bus(struct pci_bus *bus)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 720da09d4d73..224689116df3 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -755,7 +755,8 @@ static const struct pci_platform_pm_ops *pci_platform_pm;
 int pci_set_platform_pm(const struct pci_platform_pm_ops *ops)
 {
 	if (!ops->is_manageable || !ops->set_state  || !ops->get_state ||
-	    !ops->choose_state  || !ops->set_wakeup || !ops->need_resume)
+	    !ops->choose_state  || !ops->set_wakeup || !ops->need_resume ||
+	    !ops->bridge_d3 || !ops->shared_power)
 		return -EINVAL;
 	pci_platform_pm = ops;
 	return 0;
@@ -799,6 +800,12 @@ static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
 	return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
 }
 
+static inline bool platform_pci_shared_power(struct pci_dev *dev1, struct pci_dev *dev2,
+					     pci_power_t t)
+{
+	return pci_platform_pm ? pci_platform_pm->shared_power(dev1, dev2, t) : false;
+}
+
 /**
  * pci_raw_set_power_state - Use PCI PM registers to set the power state of
  *			     given PCI device
@@ -994,31 +1001,6 @@ void pci_wakeup_bus(struct pci_bus *bus)
 		pci_walk_bus(bus, pci_wakeup, NULL);
 }
 
-/**
- * __pci_start_power_transition - Start power transition of a PCI device
- * @dev: PCI device to handle.
- * @state: State to put the device into.
- */
-static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
-{
-	if (state == PCI_D0) {
-		pci_platform_power_transition(dev, PCI_D0);
-		/*
-		 * Mandatory power management transition delays are
-		 * handled in the PCIe portdrv resume hooks.
-		 */
-		if (dev->runtime_d3cold) {
-			/*
-			 * When powering on a bridge from D3cold, the
-			 * whole hierarchy may be powered on into
-			 * D0uninitialized state, resume them to give
-			 * them a chance to suspend again
-			 */
-			pci_wakeup_bus(dev->subordinate);
-		}
-	}
-}
-
 /**
  * __pci_dev_set_current_state - Set current state of a PCI device
  * @dev: Device to handle
@@ -1043,6 +1025,76 @@ void pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state)
 		pci_walk_bus(bus, __pci_dev_set_current_state, &state);
 }
 
+
+static void pci_update_topology_power_state(struct pci_dev *dev, pci_power_t state)
+{
+	struct pci_dev *p = NULL;
+
+	if (!platform_pci_power_manageable(dev))
+		return;
+
+	if (state == PCI_D3cold) {
+		/* The topology below is in D3cold as well */
+		pci_bus_set_current_state(dev->subordinate, PCI_D3cold);
+		/*
+		 * Find other devices that were in PCI_D3hot and now are in
+		 * D3cold that share the power resource.
+		 */
+		for_each_pci_dev(p) {
+			if (p == dev || !platform_pci_power_manageable(p))
+				continue;
+
+			if (platform_pci_shared_power(dev, p, PCI_D3hot) &&
+			    p->current_state == PCI_D3hot &&
+			    platform_pci_get_power_state(p) == PCI_D3cold) {
+				dev_dbg(&p->dev,
+					"transition into D3cold because power turned off\n");
+				p->current_state = PCI_D3cold;
+				pci_bus_set_current_state(p->subordinate, PCI_D3cold);
+			}
+		}
+	} else if (state == PCI_D0) {
+		/*
+		 * When powering on a bridge from D3cold, the whole
+		 * hierarchy may be powered on into D0uninitialized state,
+		 * resume them to give them a chance to suspend again.
+		 */
+		pci_wakeup_bus(dev->subordinate);
+		/*
+		 * Find other devices that were in PCI_D3cold and now are
+		 * in D0uninitialized because of the shared power resource
+		 * and resume them now so that they get properly
+		 * re-initialized.
+		 */
+		for_each_pci_dev(p) {
+			if (p == dev || !platform_pci_power_manageable(p))
+				continue;
+
+			if (platform_pci_shared_power(dev, p, PCI_D0) &&
+			    p->current_state == PCI_D3cold &&
+			    platform_pci_get_power_state(p) == PCI_D0) {
+				dev_dbg(&p->dev, "waking up because power turned on\n");
+				pci_wakeup(p, NULL);
+				pci_wakeup_bus(p->subordinate);
+			}
+		}
+	}
+}
+
+/**
+ * __pci_start_power_transition - Start power transition of a PCI device
+ * @dev: PCI device to handle.
+ * @state: State to put the device into.
+ */
+static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
+{
+	if (state == PCI_D0) {
+		pci_platform_power_transition(dev, PCI_D0);
+		if (dev->runtime_d3cold)
+			pci_update_topology_power_state(dev, PCI_D0);
+	}
+}
+
 /**
  * __pci_complete_power_transition - Complete power transition of a PCI device
  * @dev: PCI device to handle.
@@ -1059,7 +1111,7 @@ int __pci_complete_power_transition(struct pci_dev *dev, pci_power_t state)
 	ret = pci_platform_power_transition(dev, state);
 	/* Power off the bridge may power off the whole hierarchy */
 	if (!ret && state == PCI_D3cold)
-		pci_bus_set_current_state(dev->subordinate, PCI_D3cold);
+		pci_update_topology_power_state(dev, PCI_D3cold);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(__pci_complete_power_transition);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 59802b3def4b..855c9a2f3079 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -61,6 +61,9 @@ int pci_bus_error_reset(struct pci_dev *dev);
  *		 suspended) needs to be resumed to be configured for system
  *		 wakeup.
  *
+ * @shared_power: Returns 'true' if the two PCI devices share power source
+ *		  such as ACPI power resource.
+ *
  * If given platform is generally capable of power managing PCI devices, all of
  * these callbacks are mandatory.
  */
@@ -72,6 +75,8 @@ struct pci_platform_pm_ops {
 	pci_power_t (*choose_state)(struct pci_dev *dev);
 	int (*set_wakeup)(struct pci_dev *dev, bool enable);
 	bool (*need_resume)(struct pci_dev *dev);
+	bool (*shared_power)(struct pci_dev *dev1, struct pci_dev *dev2,
+			     pci_power_t state);
 };
 
 int pci_set_platform_pm(const struct pci_platform_pm_ops *ops);
diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index 31b6c87d6240..d70dbbdb5164 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -597,6 +597,9 @@ struct acpi_pci_root *acpi_pci_find_root(acpi_handle handle);
 int acpi_enable_wakeup_device_power(struct acpi_device *dev, int state);
 int acpi_disable_wakeup_device_power(struct acpi_device *dev);
 
+bool acpi_shared_power_resource(struct acpi_device *adev1,
+				struct acpi_device *adev2, int state);
+
 #ifdef CONFIG_X86
 bool acpi_device_always_present(struct acpi_device *adev);
 #else
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold
  2019-06-05 14:58 ` [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold Mika Westerberg
@ 2019-06-05 19:05   ` Lukas Wunner
  2019-06-06 11:21     ` Mika Westerberg
  2019-06-09 18:38   ` Lukas Wunner
  2019-06-10 11:35   ` Rafael J. Wysocki
  2 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-06-05 19:05 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Keith Busch,
	Alex Williamson, Alexandru Gagniuc, linux-acpi, linux-pci

On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote:
> PME polling does not take into account that a device that is directly
> connected to the host bridge may go into D3cold as well. This leads to a
> situation where the PME poll thread reads from a config space of a
> device that is in D3cold and gets incorrect information because the
> config space is not accessible.
> 
> Here is an example from Intel Ice Lake system where two PCIe root ports
> are in D3cold (I've instrumented the kernel to log the PMCSR register
> contents):
> 
>   [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
>   [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
> 
> Since 0xffff is interpreted so that PME is pending, the root ports will
> be runtime resumed. This repeats over and over again essentially
> blocking all runtime power management.
> 
> Prevent this from happening by checking whether the device is in D3cold
> before its PME status is read.

There's more broken here.  The below patch fixes a PME polling race
and should also fix the issue you're witnessing, could you verify that?

The patch has been rotting on my development branch for several months,
I just didn't get around to posting it, my apologies.

-- >8 --
Subject: [PATCH] PCI / PM: Fix race on PME polling

Since commit df17e62e5bff ("PCI: Add support for polling PME state on
suspended legacy PCI devices"), the work item pci_pme_list_scan() polls
the PME status flag of devices and wakes them up if the bit is set.

The function performs a check whether a device's upstream bridge is in
D0 for otherwise the device is inaccessible, rendering PME polling
impossible.  However the check is racy because it is performed before
polling the device.  If the upstream bridge runtime suspends to D3hot
after pci_pme_list_scan() checks its power state and before it invokes
pci_pme_wakeup(), the latter will read the PMCSR as "all ones" and
mistake it for a set PME status flag.  I am seeing this race play out as
a Thunderbolt controller going to D3cold and occasionally immediately
going to D0 again because PM polling was performed at just the wrong
time.

Avoid by checking for an "all ones" PMCSR in pci_check_pme_status().

Fixes: 58ff463396ad ("PCI PM: Add function for checking PME status of devices")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@vger.kernel.org # v2.6.34+
---
 drivers/pci/pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b98a564..2e05348 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1753,6 +1753,8 @@ bool pci_check_pme_status(struct pci_dev *dev)
 	pci_read_config_word(dev, pmcsr_pos, &pmcsr);
 	if (!(pmcsr & PCI_PM_CTRL_PME_STATUS))
 		return false;
+	if (pmcsr == ~0)
+		return false;
 
 	/* Clear PME status. */
 	pmcsr |= PCI_PM_CTRL_PME_STATUS;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec
  2019-06-05 14:58 ` [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec Mika Westerberg
@ 2019-06-06  8:40   ` Rafael J. Wysocki
  2019-06-06 11:24     ` Mika Westerberg
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2019-06-06  8:40 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Wed, Jun 5, 2019 at 4:58 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Currently Linux does not follow PCIe spec regarding the required delays
> after reset. A concrete example is a Thunderbolt add-in-card that
> consists of a PCIe switch and two PCIe endpoints:
>
>   +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
>                                   +-01.0-[04-36]-- DS hotplug port
>                                   +-02.0-[37]----00.0 xHCI controller
>                                   \-04.0-[38-6b]-- DS hotplug port
>
> The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> gen3 so they support 8GT/s link speeds.
>
> We wait for the PCIe hierarchy to enter D3cold (runtime):
>
>   pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
>
> When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> PCIe switch is put to reset and its power is re-applied. This means that
> we must follow the rules in PCIe 4.0 section 6.6.1.
>
> For the PCIe gen3 ports we are dealing with here, the following applies:
>
>   With a Downstream Port that supports Link speeds greater than 5.0
>   GT/s, software must wait a minimum of 100 ms after Link training
>   completes before sending a Configuration Request to the device
>   immediately below that Port. Software can determine when Link training
>   completes by polling the Data Link Layer Link Active bit or by setting
>   up an associated interrupt (see Section 6.7.3.3).
>
> Translating this into the above topology we would need to do this (DLLLA
> stands for Data Link Layer Link Active):
>
>   pcieport 0000:00:1b.0: wait for 100ms after DLLLA is set before access to 0000:01:00.0
>   pcieport 0000:02:00.0: wait for 100ms after DLLLA is set before access to 0000:03:00.0
>   pcieport 0000:02:02.0: wait for 100ms after DLLLA is set before access to 0000:37:00.0
>
> I've instrumented the kernel with additional logging so we can see the
> actual delays the kernel performs:
>
>   pcieport 0000:00:1b.0: power state changed by ACPI to D0
>   pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
>   pcieport 0000:00:1b.0: waking up bus
>   pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
>   pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
>   ...
>   pcieport 0000:00:1b.0: PME# disabled
>   pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:01:00.0: PME# disabled
>   pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:00.0: PME# disabled
>   pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   pcieport 0000:02:01.0: PME# disabled
>   pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:02.0: PME# disabled
>   pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:04.0: PME# disabled
>   pcieport 0000:02:01.0: PME# enabled
>   pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
>   pcieport 0000:02:04.0: PME# enabled
>   pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
>   thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
>   ...
>   thunderbolt 0000:03:00.0: PME# disabled
>   xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
>   ...
>   xhci_hcd 0000:37:00.0: PME# disabled
>
> For the switch upstream port (01:00.0) we wait for 100ms but not taking
> into account the DLLLA requirement. We then wait 10ms for D3hot -> D0
> transition of the root port and the two downstream hotplug ports. This
> means that we deviate from what the spec requires.
>
> Performing the same check for system sleep (s2idle) transitions we can
> see following when resuming from s2idle:
>
>   pcieport 0000:00:1b.0: power state changed by ACPI to D0
>   pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
>   ...
>   pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
>   pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
>   pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
>   pcieport 0000:02:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
>   pcieport 0000:02:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x73f073f0)
>   pcieport 0000:02:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
>   pcieport 0000:02:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
>   pcieport 0000:02:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
>   pcieport 0000:02:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
>   pcieport 0000:02:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
>   pcieport 0000:02:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1ff10001)
>   pcieport 0000:02:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
>   pcieport 0000:02:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x373702)
>   pcieport 0000:02:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x49f12001)
>   pcieport 0000:02:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x73e05c00)
>   pcieport 0000:02:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
>   pcieport 0000:02:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x89f07400)
>   pcieport 0000:02:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x5151)
>   pcieport 0000:02:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x8a008a00)
>   pcieport 0000:02:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:02:04.0: restoring config space at offset 0x1c (was 0x101, writing 0x6161)
>   pcieport 0000:02:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x360402)
>   pcieport 0000:02:00.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
>   pcieport 0000:02:04.0: restoring config space at offset 0x18 (was 0x0, writing 0x6b3802)
>   pcieport 0000:02:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   pcieport 0000:02:00.0: restoring config space at offset 0x18 (was 0x0, writing 0x30302)
>   pcieport 0000:02:01.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:02:04.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:02:00.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   pcieport 0000:02:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   pcieport 0000:02:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
>   ...
>   thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
>
> This is even worse. None of the mandatory delays are performed. If this
> would be S3 instead of s2idle then according to PCI FW spec 3.2 section
> 4.6.8.  there is a specific _DSM that allows the OS to skip the delays
> but this platform does not provide the _DSM and does not go to S3 anyway
> so no firmware is involved that could already handle these delays.
>
> In this particular Intel Coffee Lake platform these delays are not
> actually needed because there is an additional delay as part of the ACPI
> power resource that is used to turn on power to the hierarchy but since
> that additional delay is not required by any of standards (PCIe, ACPI)
> it is not present in the Intel Ice Lake, for example where missing the
> mandatory delays causes pciehp to start tearing down the stack too early
> (links are not yet trained).
>
> For this reason, change the PCIe portdrv PM resume hooks so that they
> perform the mandatory delays before the downstream component gets
> resumed. We perform the delays before port services are resumed because
> otherwise pciehp might find that the link is not up (even if it is just
> training) and tears-down the hierarchy.
>
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Generally

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

with a couple of nits below.

> ---
>  drivers/pci/pci.c               | 29 +++++++++------
>  drivers/pci/pci.h               |  1 +
>  drivers/pci/pcie/portdrv_core.c | 62 +++++++++++++++++++++++++++++++++
>  3 files changed, 82 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8abc843b1615..87a1f902fa8e 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1004,15 +1004,10 @@ static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
>         if (state == PCI_D0) {
>                 pci_platform_power_transition(dev, PCI_D0);
>                 /*
> -                * Mandatory power management transition delays, see
> -                * PCI Express Base Specification Revision 2.0 Section
> -                * 6.6.1: Conventional Reset.  Do not delay for
> -                * devices powered on/off by corresponding bridge,
> -                * because have already delayed for the bridge.
> +                * Mandatory power management transition delays are
> +                * handled in the PCIe portdrv resume hooks.
>                  */
>                 if (dev->runtime_d3cold) {
> -                       if (dev->d3cold_delay && !dev->imm_ready)
> -                               msleep(dev->d3cold_delay);
>                         /*
>                          * When powering on a bridge from D3cold, the
>                          * whole hierarchy may be powered on into
> @@ -4568,14 +4563,16 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
>
>         return pci_dev_wait(dev, "PM D3->D0", PCIE_RESET_READY_POLL_MS);
>  }
> +
>  /**
> - * pcie_wait_for_link - Wait until link is active or inactive
> + * pcie_wait_for_link_delay - Wait until link is active or inactive
>   * @pdev: Bridge device
>   * @active: waiting for active or inactive?
> + * @delay: Delay to wait after link has become active (in ms)
>   *
>   * Use this to wait till link becomes active or inactive.
>   */
> -bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> +bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay)
>  {
>         int timeout = 1000;
>         bool ret;
> @@ -4612,13 +4609,25 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
>                 timeout -= 10;
>         }
>         if (active && ret)
> -               msleep(100);
> +               msleep(delay);
>         else if (ret != active)
>                 pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
>                         active ? "set" : "cleared");
>         return ret == active;
>  }
>
> +/**
> + * pcie_wait_for_link - Wait until link is active or inactive
> + * @pdev: Bridge device
> + * @active: waiting for active or inactive?
> + *
> + * Use this to wait till link becomes active or inactive.
> + */
> +bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> +{
> +       return pcie_wait_for_link_delay(pdev, active, 100);
> +}
> +
>  void pci_reset_secondary_bus(struct pci_dev *dev)
>  {
>         u16 ctrl;
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 9cb99380c61e..59802b3def4b 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -493,6 +493,7 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
>  void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>                       u32 service);
>
> +bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay);
>  bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
>  #ifdef CONFIG_PCIEASPM
>  void pcie_aspm_init_link_state(struct pci_dev *pdev);
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 1b330129089f..88d151a54be6 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -9,6 +9,7 @@
>  #include <linux/module.h>
>  #include <linux/pci.h>
>  #include <linux/kernel.h>
> +#include <linux/delay.h>
>  #include <linux/errno.h>
>  #include <linux/pm.h>
>  #include <linux/pm_runtime.h>
> @@ -378,6 +379,58 @@ static int pm_iter(struct device *dev, void *data)
>         return 0;
>  }
>
> +static int get_downstream_delay(struct pci_bus *bus)
> +{
> +       struct pci_dev *pdev;
> +       int min_delay = 100;
> +       int max_delay = 0;
> +
> +       list_for_each_entry(pdev, &bus->devices, bus_list) {
> +               if (!pdev->imm_ready)
> +                       min_delay = 0;
> +               else if (pdev->d3cold_delay < min_delay)
> +                       min_delay = pdev->d3cold_delay;
> +               if (pdev->d3cold_delay > max_delay)
> +                       max_delay = pdev->d3cold_delay;
> +       }
> +
> +       return max(min_delay, max_delay);
> +}
> +
> +static void wait_for_downstream_link(struct pci_dev *pdev)
> +{
> +       /*
> +        * Handle delays according to PCIe 4.0 section 6.6.1 before
> +        * configuration access to the downstream component is permitted.
> +        *
> +        * This blocks PCI core resume of the hierarchy below this port
> +        * until the link is trained.
> +        */

Shouldn't the above go to a kerneldoc comment?

> +       if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
> +            pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM) &&
> +            pdev->subordinate && !list_empty(&pdev->subordinate->devices) &&
> +            pdev->bridge_d3 && !pci_dev_is_disconnected(pdev)) {

There's nothing except for this if () in the function, so maybe just
return here if the condition is not satisfied?

Also maybe split it into a port type check (you can return right away
if it fails), disconnected check and the rest?

> +               int delay;
> +
> +               delay = get_downstream_delay(pdev->subordinate);
> +               if (!delay)
> +                       return;
> +
> +               dev_dbg(&pdev->dev, "waiting downstream link for %d ms\n", delay);
> +
> +               /*
> +                * If downstream port does not support speeds greater than
> +                * 5 GT/s need to wait 100ms. For higher speeds (gen3) we
> +                * need to wait first for the data link layer to become
> +                * active.
> +                */
> +               if (pcie_get_speed_cap(pdev) <= PCIE_SPEED_5_0GT)
> +                       msleep(delay);
> +               else
> +                       pcie_wait_for_link_delay(pdev, true, delay);
> +       }
> +}
> +
>  /**
>   * pcie_port_device_suspend - suspend port services associated with a PCIe port
>   * @dev: PCI Express port to handle
> @@ -391,6 +444,13 @@ int pcie_port_device_suspend(struct device *dev)
>  int pcie_port_device_resume_noirq(struct device *dev)
>  {
>         size_t off = offsetof(struct pcie_port_service_driver, resume_noirq);
> +
> +       /*
> +        * Wait for the link to be fully up before resuming port services.
> +        * This prevents pciehp from starting to tear-down the hierarchy
> +        * too soon.
> +        */
> +       wait_for_downstream_link(to_pci_dev(dev));
>         return device_for_each_child(dev, &off, pm_iter);
>  }
>
> @@ -421,6 +481,8 @@ int pcie_port_device_runtime_suspend(struct device *dev)
>  int pcie_port_device_runtime_resume(struct device *dev)
>  {
>         size_t off = offsetof(struct pcie_port_service_driver, runtime_resume);
> +

The comment from pcie_port_device_resume_noirq() is applicable here
too, so maybe move it to a common place (kerneldoc?).

> +       wait_for_downstream_link(to_pci_dev(dev));
>         return device_for_each_child(dev, &off, pm_iter);
>  }
>  #endif /* PM */
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-05 14:58 ` [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources Mika Westerberg
@ 2019-06-06  8:54   ` Rafael J. Wysocki
  2019-06-06 11:26     ` Mika Westerberg
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2019-06-06  8:54 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Wed, Jun 5, 2019 at 4:58 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Intel Ice Lake has an interated Thunderbolt controller which means that

integrated

> the PCIe topology is extended directly from the two root ports (RP0 and
> RP1). Power management is handled by ACPI power resources that are
> shared between the root ports, Thunderbolt controller (NHI) and xHCI
> controller.
>
> The topology with the power resources (marked with []) looks like:
>
>   Host bridge
>       |
>       +- RP0 ---\
>       +- RP1 ---|--+--> [TBT]
>       +- NHI --/   |
>       |            |
>       |            v
>       +- xHCI --> [D3C]
>
> Here TBT and D3C are the shared ACPI power resources. ACPI _PR3() method
> returns either TBT or D3C or both.
>
> Say we runtime suspend first the root ports RP0 and RP1, then NHI. Now
> since the TBT power resource is still on when the root ports are runtime
> suspended their dev->current_state is set to D3hot. When NHI is runtime
> suspended TBT is finally turned off but state of the root ports remain
> to be D3hot.

It looks like this problem will affect all ACPI devices using power
resources and _PR3 in general, so fixing it just for PCI is not
sufficient IMO.

An alternative approach may be to set the state of a device that
dropped its references to power resources listed in _PR3 to D3cold
even though those power resources may be physically "on" at that time.
Everything else (including this patch AFAICS) will be racy this way or
another.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold
  2019-06-05 19:05   ` Lukas Wunner
@ 2019-06-06 11:21     ` Mika Westerberg
  0 siblings, 0 replies; 23+ messages in thread
From: Mika Westerberg @ 2019-06-06 11:21 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Keith Busch,
	Alex Williamson, Alexandru Gagniuc, linux-acpi, linux-pci

On Wed, Jun 05, 2019 at 09:05:57PM +0200, Lukas Wunner wrote:
> On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote:
> > PME polling does not take into account that a device that is directly
> > connected to the host bridge may go into D3cold as well. This leads to a
> > situation where the PME poll thread reads from a config space of a
> > device that is in D3cold and gets incorrect information because the
> > config space is not accessible.
> > 
> > Here is an example from Intel Ice Lake system where two PCIe root ports
> > are in D3cold (I've instrumented the kernel to log the PMCSR register
> > contents):
> > 
> >   [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
> >   [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
> > 
> > Since 0xffff is interpreted so that PME is pending, the root ports will
> > be runtime resumed. This repeats over and over again essentially
> > blocking all runtime power management.
> > 
> > Prevent this from happening by checking whether the device is in D3cold
> > before its PME status is read.
> 
> There's more broken here.  The below patch fixes a PME polling race
> and should also fix the issue you're witnessing, could you verify that?

It fixes the issue but I needed to tune it a bit ->

> The patch has been rotting on my development branch for several months,
> I just didn't get around to posting it, my apologies.

Better late than never :)

> -- >8 --
> Subject: [PATCH] PCI / PM: Fix race on PME polling
> 
> Since commit df17e62e5bff ("PCI: Add support for polling PME state on
> suspended legacy PCI devices"), the work item pci_pme_list_scan() polls
> the PME status flag of devices and wakes them up if the bit is set.
> 
> The function performs a check whether a device's upstream bridge is in
> D0 for otherwise the device is inaccessible, rendering PME polling
> impossible.  However the check is racy because it is performed before
> polling the device.  If the upstream bridge runtime suspends to D3hot
> after pci_pme_list_scan() checks its power state and before it invokes
> pci_pme_wakeup(), the latter will read the PMCSR as "all ones" and
> mistake it for a set PME status flag.  I am seeing this race play out as
> a Thunderbolt controller going to D3cold and occasionally immediately
> going to D0 again because PM polling was performed at just the wrong
> time.
> 
> Avoid by checking for an "all ones" PMCSR in pci_check_pme_status().
> 
> Fixes: 58ff463396ad ("PCI PM: Add function for checking PME status of devices")
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: stable@vger.kernel.org # v2.6.34+
> ---
>  drivers/pci/pci.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b98a564..2e05348 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1753,6 +1753,8 @@ bool pci_check_pme_status(struct pci_dev *dev)
>  	pci_read_config_word(dev, pmcsr_pos, &pmcsr);
>  	if (!(pmcsr & PCI_PM_CTRL_PME_STATUS))
>  		return false;
> +	if (pmcsr == ~0)

<- Here I needed to do

	if (pmcsr == (u16)~0)

I think it is because pmcsr is u16 so we end up comparing:

	0xffff == 0xffffffff

> +		return false;
>  
>  	/* Clear PME status. */
>  	pmcsr |= PCI_PM_CTRL_PME_STATUS;
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec
  2019-06-06  8:40   ` Rafael J. Wysocki
@ 2019-06-06 11:24     ` Mika Westerberg
  0 siblings, 0 replies; 23+ messages in thread
From: Mika Westerberg @ 2019-06-06 11:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 06, 2019 at 10:40:48AM +0200, Rafael J. Wysocki wrote:
> On Wed, Jun 5, 2019 at 4:58 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > Currently Linux does not follow PCIe spec regarding the required delays
> > after reset. A concrete example is a Thunderbolt add-in-card that
> > consists of a PCIe switch and two PCIe endpoints:
> >
> >   +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
> >                                   +-01.0-[04-36]-- DS hotplug port
> >                                   +-02.0-[37]----00.0 xHCI controller
> >                                   \-04.0-[38-6b]-- DS hotplug port
> >
> > The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> > gen3 so they support 8GT/s link speeds.
> >
> > We wait for the PCIe hierarchy to enter D3cold (runtime):
> >
> >   pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
> >
> > When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> > PCIe switch is put to reset and its power is re-applied. This means that
> > we must follow the rules in PCIe 4.0 section 6.6.1.
> >
> > For the PCIe gen3 ports we are dealing with here, the following applies:
> >
> >   With a Downstream Port that supports Link speeds greater than 5.0
> >   GT/s, software must wait a minimum of 100 ms after Link training
> >   completes before sending a Configuration Request to the device
> >   immediately below that Port. Software can determine when Link training
> >   completes by polling the Data Link Layer Link Active bit or by setting
> >   up an associated interrupt (see Section 6.7.3.3).
> >
> > Translating this into the above topology we would need to do this (DLLLA
> > stands for Data Link Layer Link Active):
> >
> >   pcieport 0000:00:1b.0: wait for 100ms after DLLLA is set before access to 0000:01:00.0
> >   pcieport 0000:02:00.0: wait for 100ms after DLLLA is set before access to 0000:03:00.0
> >   pcieport 0000:02:02.0: wait for 100ms after DLLLA is set before access to 0000:37:00.0
> >
> > I've instrumented the kernel with additional logging so we can see the
> > actual delays the kernel performs:
> >
> >   pcieport 0000:00:1b.0: power state changed by ACPI to D0
> >   pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
> >   pcieport 0000:00:1b.0: waking up bus
> >   pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
> >   pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
> >   ...
> >   pcieport 0000:00:1b.0: PME# disabled
> >   pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   ...
> >   pcieport 0000:01:00.0: PME# disabled
> >   pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   ...
> >   pcieport 0000:02:00.0: PME# disabled
> >   pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   ...
> >   pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> >   pcieport 0000:02:01.0: PME# disabled
> >   pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   ...
> >   pcieport 0000:02:02.0: PME# disabled
> >   pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   ...
> >   pcieport 0000:02:04.0: PME# disabled
> >   pcieport 0000:02:01.0: PME# enabled
> >   pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
> >   pcieport 0000:02:04.0: PME# enabled
> >   pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
> >   thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
> >   ...
> >   thunderbolt 0000:03:00.0: PME# disabled
> >   xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
> >   ...
> >   xhci_hcd 0000:37:00.0: PME# disabled
> >
> > For the switch upstream port (01:00.0) we wait for 100ms but not taking
> > into account the DLLLA requirement. We then wait 10ms for D3hot -> D0
> > transition of the root port and the two downstream hotplug ports. This
> > means that we deviate from what the spec requires.
> >
> > Performing the same check for system sleep (s2idle) transitions we can
> > see following when resuming from s2idle:
> >
> >   pcieport 0000:00:1b.0: power state changed by ACPI to D0
> >   pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
> >   ...
> >   pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   ...
> >   pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
> >   pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
> >   pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> >   pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
> >   pcieport 0000:02:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
> >   pcieport 0000:02:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x73f073f0)
> >   pcieport 0000:02:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
> >   pcieport 0000:02:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
> >   pcieport 0000:02:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
> >   pcieport 0000:02:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
> >   pcieport 0000:02:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
> >   pcieport 0000:02:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1ff10001)
> >   pcieport 0000:02:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
> >   pcieport 0000:02:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x373702)
> >   pcieport 0000:02:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x49f12001)
> >   pcieport 0000:02:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x73e05c00)
> >   pcieport 0000:02:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
> >   pcieport 0000:02:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x89f07400)
> >   pcieport 0000:02:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x5151)
> >   pcieport 0000:02:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x8a008a00)
> >   pcieport 0000:02:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> >   pcieport 0000:02:04.0: restoring config space at offset 0x1c (was 0x101, writing 0x6161)
> >   pcieport 0000:02:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x360402)
> >   pcieport 0000:02:00.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
> >   pcieport 0000:02:04.0: restoring config space at offset 0x18 (was 0x0, writing 0x6b3802)
> >   pcieport 0000:02:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> >   pcieport 0000:02:00.0: restoring config space at offset 0x18 (was 0x0, writing 0x30302)
> >   pcieport 0000:02:01.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> >   pcieport 0000:02:04.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> >   pcieport 0000:02:00.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> >   pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> >   pcieport 0000:02:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> >   pcieport 0000:02:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> >   xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
> >   ...
> >   thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
> >
> > This is even worse. None of the mandatory delays are performed. If this
> > would be S3 instead of s2idle then according to PCI FW spec 3.2 section
> > 4.6.8.  there is a specific _DSM that allows the OS to skip the delays
> > but this platform does not provide the _DSM and does not go to S3 anyway
> > so no firmware is involved that could already handle these delays.
> >
> > In this particular Intel Coffee Lake platform these delays are not
> > actually needed because there is an additional delay as part of the ACPI
> > power resource that is used to turn on power to the hierarchy but since
> > that additional delay is not required by any of standards (PCIe, ACPI)
> > it is not present in the Intel Ice Lake, for example where missing the
> > mandatory delays causes pciehp to start tearing down the stack too early
> > (links are not yet trained).
> >
> > For this reason, change the PCIe portdrv PM resume hooks so that they
> > perform the mandatory delays before the downstream component gets
> > resumed. We perform the delays before port services are resumed because
> > otherwise pciehp might find that the link is not up (even if it is just
> > training) and tears-down the hierarchy.
> >
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> 
> Generally
> 
> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Thanks!

> with a couple of nits below.
> 
> > ---
> >  drivers/pci/pci.c               | 29 +++++++++------
> >  drivers/pci/pci.h               |  1 +
> >  drivers/pci/pcie/portdrv_core.c | 62 +++++++++++++++++++++++++++++++++
> >  3 files changed, 82 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 8abc843b1615..87a1f902fa8e 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -1004,15 +1004,10 @@ static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
> >         if (state == PCI_D0) {
> >                 pci_platform_power_transition(dev, PCI_D0);
> >                 /*
> > -                * Mandatory power management transition delays, see
> > -                * PCI Express Base Specification Revision 2.0 Section
> > -                * 6.6.1: Conventional Reset.  Do not delay for
> > -                * devices powered on/off by corresponding bridge,
> > -                * because have already delayed for the bridge.
> > +                * Mandatory power management transition delays are
> > +                * handled in the PCIe portdrv resume hooks.
> >                  */
> >                 if (dev->runtime_d3cold) {
> > -                       if (dev->d3cold_delay && !dev->imm_ready)
> > -                               msleep(dev->d3cold_delay);
> >                         /*
> >                          * When powering on a bridge from D3cold, the
> >                          * whole hierarchy may be powered on into
> > @@ -4568,14 +4563,16 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
> >
> >         return pci_dev_wait(dev, "PM D3->D0", PCIE_RESET_READY_POLL_MS);
> >  }
> > +
> >  /**
> > - * pcie_wait_for_link - Wait until link is active or inactive
> > + * pcie_wait_for_link_delay - Wait until link is active or inactive
> >   * @pdev: Bridge device
> >   * @active: waiting for active or inactive?
> > + * @delay: Delay to wait after link has become active (in ms)
> >   *
> >   * Use this to wait till link becomes active or inactive.
> >   */
> > -bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> > +bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay)
> >  {
> >         int timeout = 1000;
> >         bool ret;
> > @@ -4612,13 +4609,25 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> >                 timeout -= 10;
> >         }
> >         if (active && ret)
> > -               msleep(100);
> > +               msleep(delay);
> >         else if (ret != active)
> >                 pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
> >                         active ? "set" : "cleared");
> >         return ret == active;
> >  }
> >
> > +/**
> > + * pcie_wait_for_link - Wait until link is active or inactive
> > + * @pdev: Bridge device
> > + * @active: waiting for active or inactive?
> > + *
> > + * Use this to wait till link becomes active or inactive.
> > + */
> > +bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> > +{
> > +       return pcie_wait_for_link_delay(pdev, active, 100);
> > +}
> > +
> >  void pci_reset_secondary_bus(struct pci_dev *dev)
> >  {
> >         u16 ctrl;
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 9cb99380c61e..59802b3def4b 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -493,6 +493,7 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
> >  void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> >                       u32 service);
> >
> > +bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay);
> >  bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
> >  #ifdef CONFIG_PCIEASPM
> >  void pcie_aspm_init_link_state(struct pci_dev *pdev);
> > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> > index 1b330129089f..88d151a54be6 100644
> > --- a/drivers/pci/pcie/portdrv_core.c
> > +++ b/drivers/pci/pcie/portdrv_core.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> >  #include <linux/kernel.h>
> > +#include <linux/delay.h>
> >  #include <linux/errno.h>
> >  #include <linux/pm.h>
> >  #include <linux/pm_runtime.h>
> > @@ -378,6 +379,58 @@ static int pm_iter(struct device *dev, void *data)
> >         return 0;
> >  }
> >
> > +static int get_downstream_delay(struct pci_bus *bus)
> > +{
> > +       struct pci_dev *pdev;
> > +       int min_delay = 100;
> > +       int max_delay = 0;
> > +
> > +       list_for_each_entry(pdev, &bus->devices, bus_list) {
> > +               if (!pdev->imm_ready)
> > +                       min_delay = 0;
> > +               else if (pdev->d3cold_delay < min_delay)
> > +                       min_delay = pdev->d3cold_delay;
> > +               if (pdev->d3cold_delay > max_delay)
> > +                       max_delay = pdev->d3cold_delay;
> > +       }
> > +
> > +       return max(min_delay, max_delay);
> > +}
> > +
> > +static void wait_for_downstream_link(struct pci_dev *pdev)
> > +{
> > +       /*
> > +        * Handle delays according to PCIe 4.0 section 6.6.1 before
> > +        * configuration access to the downstream component is permitted.
> > +        *
> > +        * This blocks PCI core resume of the hierarchy below this port
> > +        * until the link is trained.
> > +        */
> 
> Shouldn't the above go to a kerneldoc comment?

Yup I'll move it there.

> > +       if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
> > +            pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM) &&
> > +            pdev->subordinate && !list_empty(&pdev->subordinate->devices) &&
> > +            pdev->bridge_d3 && !pci_dev_is_disconnected(pdev)) {
> 
> There's nothing except for this if () in the function, so maybe just
> return here if the condition is not satisfied?

OK

> Also maybe split it into a port type check (you can return right away
> if it fails), disconnected check and the rest?

OK

> > +               int delay;
> > +
> > +               delay = get_downstream_delay(pdev->subordinate);
> > +               if (!delay)
> > +                       return;
> > +
> > +               dev_dbg(&pdev->dev, "waiting downstream link for %d ms\n", delay);
> > +
> > +               /*
> > +                * If downstream port does not support speeds greater than
> > +                * 5 GT/s need to wait 100ms. For higher speeds (gen3) we
> > +                * need to wait first for the data link layer to become
> > +                * active.
> > +                */
> > +               if (pcie_get_speed_cap(pdev) <= PCIE_SPEED_5_0GT)
> > +                       msleep(delay);
> > +               else
> > +                       pcie_wait_for_link_delay(pdev, true, delay);
> > +       }
> > +}
> > +
> >  /**
> >   * pcie_port_device_suspend - suspend port services associated with a PCIe port
> >   * @dev: PCI Express port to handle
> > @@ -391,6 +444,13 @@ int pcie_port_device_suspend(struct device *dev)
> >  int pcie_port_device_resume_noirq(struct device *dev)
> >  {
> >         size_t off = offsetof(struct pcie_port_service_driver, resume_noirq);
> > +
> > +       /*
> > +        * Wait for the link to be fully up before resuming port services.
> > +        * This prevents pciehp from starting to tear-down the hierarchy
> > +        * too soon.
> > +        */
> > +       wait_for_downstream_link(to_pci_dev(dev));
> >         return device_for_each_child(dev, &off, pm_iter);
> >  }
> >
> > @@ -421,6 +481,8 @@ int pcie_port_device_runtime_suspend(struct device *dev)
> >  int pcie_port_device_runtime_resume(struct device *dev)
> >  {
> >         size_t off = offsetof(struct pcie_port_service_driver, runtime_resume);
> > +
> 
> The comment from pcie_port_device_resume_noirq() is applicable here
> too, so maybe move it to a common place (kerneldoc?).

Sure.

> 
> > +       wait_for_downstream_link(to_pci_dev(dev));
> >         return device_for_each_child(dev, &off, pm_iter);
> >  }
> >  #endif /* PM */
> > --
> > 2.20.1
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-06  8:54   ` Rafael J. Wysocki
@ 2019-06-06 11:26     ` Mika Westerberg
  2019-06-06 13:44       ` Mika Westerberg
  0 siblings, 1 reply; 23+ messages in thread
From: Mika Westerberg @ 2019-06-06 11:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 06, 2019 at 10:54:40AM +0200, Rafael J. Wysocki wrote:
> On Wed, Jun 5, 2019 at 4:58 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > Intel Ice Lake has an interated Thunderbolt controller which means that
> 
> integrated

Right :)

> > the PCIe topology is extended directly from the two root ports (RP0 and
> > RP1). Power management is handled by ACPI power resources that are
> > shared between the root ports, Thunderbolt controller (NHI) and xHCI
> > controller.
> >
> > The topology with the power resources (marked with []) looks like:
> >
> >   Host bridge
> >       |
> >       +- RP0 ---\
> >       +- RP1 ---|--+--> [TBT]
> >       +- NHI --/   |
> >       |            |
> >       |            v
> >       +- xHCI --> [D3C]
> >
> > Here TBT and D3C are the shared ACPI power resources. ACPI _PR3() method
> > returns either TBT or D3C or both.
> >
> > Say we runtime suspend first the root ports RP0 and RP1, then NHI. Now
> > since the TBT power resource is still on when the root ports are runtime
> > suspended their dev->current_state is set to D3hot. When NHI is runtime
> > suspended TBT is finally turned off but state of the root ports remain
> > to be D3hot.
> 
> It looks like this problem will affect all ACPI devices using power
> resources and _PR3 in general, so fixing it just for PCI is not
> sufficient IMO.

Fair enough.

> An alternative approach may be to set the state of a device that
> dropped its references to power resources listed in _PR3 to D3cold
> even though those power resources may be physically "on" at that time.
> Everything else (including this patch AFAICS) will be racy this way or
> another.

OK, thanks for the comment. I'll try to look into this approach then.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-06 11:26     ` Mika Westerberg
@ 2019-06-06 13:44       ` Mika Westerberg
  2019-06-06 14:08         ` Rafael J. Wysocki
  0 siblings, 1 reply; 23+ messages in thread
From: Mika Westerberg @ 2019-06-06 13:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 06, 2019 at 02:26:40PM +0300, Mika Westerberg wrote:
> On Thu, Jun 06, 2019 at 10:54:40AM +0200, Rafael J. Wysocki wrote:
> > On Wed, Jun 5, 2019 at 4:58 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > Intel Ice Lake has an interated Thunderbolt controller which means that
> > 
> > integrated
> 
> Right :)
> 
> > > the PCIe topology is extended directly from the two root ports (RP0 and
> > > RP1). Power management is handled by ACPI power resources that are
> > > shared between the root ports, Thunderbolt controller (NHI) and xHCI
> > > controller.
> > >
> > > The topology with the power resources (marked with []) looks like:
> > >
> > >   Host bridge
> > >       |
> > >       +- RP0 ---\
> > >       +- RP1 ---|--+--> [TBT]
> > >       +- NHI --/   |
> > >       |            |
> > >       |            v
> > >       +- xHCI --> [D3C]
> > >
> > > Here TBT and D3C are the shared ACPI power resources. ACPI _PR3() method
> > > returns either TBT or D3C or both.
> > >
> > > Say we runtime suspend first the root ports RP0 and RP1, then NHI. Now
> > > since the TBT power resource is still on when the root ports are runtime
> > > suspended their dev->current_state is set to D3hot. When NHI is runtime
> > > suspended TBT is finally turned off but state of the root ports remain
> > > to be D3hot.
> > 
> > It looks like this problem will affect all ACPI devices using power
> > resources and _PR3 in general, so fixing it just for PCI is not
> > sufficient IMO.
> 
> Fair enough.
> 
> > An alternative approach may be to set the state of a device that
> > dropped its references to power resources listed in _PR3 to D3cold
> > even though those power resources may be physically "on" at that time.
> > Everything else (including this patch AFAICS) will be racy this way or
> > another.
> 
> OK, thanks for the comment. I'll try to look into this approach then.

One additional question.

How about the other direction when shared power resource(s) gets turned
on? We would need to wake up all the sharing devices so that their state
gets restored back from D0uninitialized.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-06 13:44       ` Mika Westerberg
@ 2019-06-06 14:08         ` Rafael J. Wysocki
  2019-06-06 14:17           ` Mika Westerberg
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2019-06-06 14:08 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Rafael J. Wysocki, Bjorn Helgaas, Rafael J. Wysocki, Len Brown,
	Lukas Wunner, Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 6, 2019 at 3:44 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Thu, Jun 06, 2019 at 02:26:40PM +0300, Mika Westerberg wrote:
> > On Thu, Jun 06, 2019 at 10:54:40AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jun 5, 2019 at 4:58 PM Mika Westerberg
> > > <mika.westerberg@linux.intel.com> wrote:
> > > >
> > > > Intel Ice Lake has an interated Thunderbolt controller which means that
> > >
> > > integrated
> >
> > Right :)
> >
> > > > the PCIe topology is extended directly from the two root ports (RP0 and
> > > > RP1). Power management is handled by ACPI power resources that are
> > > > shared between the root ports, Thunderbolt controller (NHI) and xHCI
> > > > controller.
> > > >
> > > > The topology with the power resources (marked with []) looks like:
> > > >
> > > >   Host bridge
> > > >       |
> > > >       +- RP0 ---\
> > > >       +- RP1 ---|--+--> [TBT]
> > > >       +- NHI --/   |
> > > >       |            |
> > > >       |            v
> > > >       +- xHCI --> [D3C]
> > > >
> > > > Here TBT and D3C are the shared ACPI power resources. ACPI _PR3() method
> > > > returns either TBT or D3C or both.
> > > >
> > > > Say we runtime suspend first the root ports RP0 and RP1, then NHI. Now
> > > > since the TBT power resource is still on when the root ports are runtime
> > > > suspended their dev->current_state is set to D3hot. When NHI is runtime
> > > > suspended TBT is finally turned off but state of the root ports remain
> > > > to be D3hot.
> > >
> > > It looks like this problem will affect all ACPI devices using power
> > > resources and _PR3 in general, so fixing it just for PCI is not
> > > sufficient IMO.
> >
> > Fair enough.
> >
> > > An alternative approach may be to set the state of a device that
> > > dropped its references to power resources listed in _PR3 to D3cold
> > > even though those power resources may be physically "on" at that time.
> > > Everything else (including this patch AFAICS) will be racy this way or
> > > another.
> >
> > OK, thanks for the comment. I'll try to look into this approach then.
>
> One additional question.
>
> How about the other direction when shared power resource(s) gets turned
> on? We would need to wake up all the sharing devices so that their state
> gets restored back from D0uninitialized.

That isn't necessary IMO as long as the device are not accessed.  If
the kernel thinks that a given device is in D3cold and doesn't access
it, then it really doesn't matter too much what state the device is in
physically.  On the first access the device should be reinitialized
anyway.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-06 14:08         ` Rafael J. Wysocki
@ 2019-06-06 14:17           ` Mika Westerberg
  2019-06-06 14:27             ` Rafael J. Wysocki
  0 siblings, 1 reply; 23+ messages in thread
From: Mika Westerberg @ 2019-06-06 14:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 06, 2019 at 04:08:11PM +0200, Rafael J. Wysocki wrote:
> That isn't necessary IMO as long as the device are not accessed.  If
> the kernel thinks that a given device is in D3cold and doesn't access
> it, then it really doesn't matter too much what state the device is in
> physically.  On the first access the device should be reinitialized
> anyway.

But if the device is configured to wake. For example when it detects a
hotplug that state is gone when it goes to D0unitialized.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-06 14:17           ` Mika Westerberg
@ 2019-06-06 14:27             ` Rafael J. Wysocki
  2019-06-06 14:36               ` Mika Westerberg
  2019-06-09 18:58               ` Lukas Wunner
  0 siblings, 2 replies; 23+ messages in thread
From: Rafael J. Wysocki @ 2019-06-06 14:27 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Rafael J. Wysocki, Bjorn Helgaas, Rafael J. Wysocki, Len Brown,
	Lukas Wunner, Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 6, 2019 at 4:17 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Thu, Jun 06, 2019 at 04:08:11PM +0200, Rafael J. Wysocki wrote:
> > That isn't necessary IMO as long as the device are not accessed.  If
> > the kernel thinks that a given device is in D3cold and doesn't access
> > it, then it really doesn't matter too much what state the device is in
> > physically.  On the first access the device should be reinitialized
> > anyway.
>
> But if the device is configured to wake. For example when it detects a
> hotplug that state is gone when it goes to D0unitialized.

For this we'll need a pm_runtime_resume() of the dependent device on
the resource going "on".

That means we need a list of devices to resume when the resource goes
"on" after being taken "off".

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-06 14:27             ` Rafael J. Wysocki
@ 2019-06-06 14:36               ` Mika Westerberg
  2019-06-12 22:38                 ` Rafael J. Wysocki
  2019-06-09 18:58               ` Lukas Wunner
  1 sibling, 1 reply; 23+ messages in thread
From: Mika Westerberg @ 2019-06-06 14:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 06, 2019 at 04:27:21PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 6, 2019 at 4:17 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > On Thu, Jun 06, 2019 at 04:08:11PM +0200, Rafael J. Wysocki wrote:
> > > That isn't necessary IMO as long as the device are not accessed.  If
> > > the kernel thinks that a given device is in D3cold and doesn't access
> > > it, then it really doesn't matter too much what state the device is in
> > > physically.  On the first access the device should be reinitialized
> > > anyway.
> >
> > But if the device is configured to wake. For example when it detects a
> > hotplug that state is gone when it goes to D0unitialized.
> 
> For this we'll need a pm_runtime_resume() of the dependent device on
> the resource going "on".
> 
> That means we need a list of devices to resume when the resource goes
> "on" after being taken "off".

OK, thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold
  2019-06-05 14:58 ` [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold Mika Westerberg
  2019-06-05 19:05   ` Lukas Wunner
@ 2019-06-09 18:38   ` Lukas Wunner
  2019-06-10 11:35   ` Rafael J. Wysocki
  2 siblings, 0 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-06-09 18:38 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Len Brown, Keith Busch,
	Alex Williamson, Alexandru Gagniuc, linux-acpi, linux-pci

On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote:
> PME polling does not take into account that a device that is directly
> connected to the host bridge may go into D3cold as well. This leads to a
> situation where the PME poll thread reads from a config space of a
> device that is in D3cold and gets incorrect information because the
> config space is not accessible.
> 
> Here is an example from Intel Ice Lake system where two PCIe root ports
> are in D3cold (I've instrumented the kernel to log the PMCSR register
> contents):
> 
>   [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
>   [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
> 
> Since 0xffff is interpreted so that PME is pending, the root ports will
> be runtime resumed. This repeats over and over again essentially
> blocking all runtime power management.
> 
> Prevent this from happening by checking whether the device is in D3cold
> before its PME status is read.
> 
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Reviewed-by: Lukas Wunner <lukas@wunner.de>
Fixes: 71a83bd727cc ("PCI/PM: add runtime PM support to PCIe port")
Cc: stable@vger.kernel.org # v3.6+

Although the patch I've posted today (which checks for an "all ones" read
from config space) covers the issue fixed herein, your patch still makes
sense to avoid unnecessarily accessing config space in the first place.

Thanks,

Lukas

> ---
>  drivers/pci/pci.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 87a1f902fa8e..720da09d4d73 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2060,6 +2060,13 @@ static void pci_pme_list_scan(struct work_struct *work)
>  			 */
>  			if (bridge && bridge->current_state != PCI_D0)
>  				continue;
> +			/*
> +			 * If the device is in D3cold it should not be
> +			 * polled either.
> +			 */
> +			if (pme_dev->dev->current_state == PCI_D3cold)
> +				continue;
> +
>  			pci_pme_wakeup(pme_dev->dev, NULL);
>  		} else {
>  			list_del(&pme_dev->list);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-06 14:27             ` Rafael J. Wysocki
  2019-06-06 14:36               ` Mika Westerberg
@ 2019-06-09 18:58               ` Lukas Wunner
  2019-06-10 10:57                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-06-09 18:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mika Westerberg, Bjorn Helgaas, Rafael J. Wysocki, Len Brown,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 06, 2019 at 04:27:21PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 6, 2019 at 4:17 PM Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> > On Thu, Jun 06, 2019 at 04:08:11PM +0200, Rafael J. Wysocki wrote:
> > > That isn't necessary IMO as long as the device are not accessed.  If
> > > the kernel thinks that a given device is in D3cold and doesn't access
> > > it, then it really doesn't matter too much what state the device is in
> > > physically.  On the first access the device should be reinitialized
> > > anyway.
> >
> > But if the device is configured to wake. For example when it detects a
> > hotplug that state is gone when it goes to D0unitialized.
> 
> For this we'll need a pm_runtime_resume() of the dependent device on
> the resource going "on".
> 
> That means we need a list of devices to resume when the resource goes
> "on" after being taken "off".

An idea would be to model every ACPI power resource as a struct device
and automatically set up a device link from the devices using that
power resource (consumers).  After all dependent devices runtime suspend,
the power resource "device" runtime suspends by turning itself off
(and updating the PCI current_state of dependent devices to D3cold).
When the power resource runtime resumes, it schedules a runtime resume
of all dependent devices.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-09 18:58               ` Lukas Wunner
@ 2019-06-10 10:57                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 23+ messages in thread
From: Rafael J. Wysocki @ 2019-06-10 10:57 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Rafael J. Wysocki, Mika Westerberg, Bjorn Helgaas, Len Brown,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Sunday, June 9, 2019 8:58:35 PM CEST Lukas Wunner wrote:
> On Thu, Jun 06, 2019 at 04:27:21PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Jun 6, 2019 at 4:17 PM Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> > > On Thu, Jun 06, 2019 at 04:08:11PM +0200, Rafael J. Wysocki wrote:
> > > > That isn't necessary IMO as long as the device are not accessed.  If
> > > > the kernel thinks that a given device is in D3cold and doesn't access
> > > > it, then it really doesn't matter too much what state the device is in
> > > > physically.  On the first access the device should be reinitialized
> > > > anyway.
> > >
> > > But if the device is configured to wake. For example when it detects a
> > > hotplug that state is gone when it goes to D0unitialized.
> > 
> > For this we'll need a pm_runtime_resume() of the dependent device on
> > the resource going "on".
> > 
> > That means we need a list of devices to resume when the resource goes
> > "on" after being taken "off".
> 
> An idea would be to model every ACPI power resource as a struct device
> and automatically set up a device link from the devices using that
> power resource (consumers).  After all dependent devices runtime suspend,
> the power resource "device" runtime suspends by turning itself off
> (and updating the PCI current_state of dependent devices to D3cold).
> When the power resource runtime resumes, it schedules a runtime resume
> of all dependent devices.

The sharing of power resources is covered already.  That's not the problem here.

The missing part is the runtime resume of dependent devices and I'm not even sure
if it needs to be done in general or for PCI devices only.  At least it doesn't need to be
done for devices that are not configured for wakeup, even on a PCI bus.

Thanks,
Rafael




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold
  2019-06-05 14:58 ` [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold Mika Westerberg
  2019-06-05 19:05   ` Lukas Wunner
  2019-06-09 18:38   ` Lukas Wunner
@ 2019-06-10 11:35   ` Rafael J. Wysocki
  2 siblings, 0 replies; 23+ messages in thread
From: Rafael J. Wysocki @ 2019-06-10 11:35 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Len Brown, Lukas Wunner, Keith Busch,
	Alex Williamson, Alexandru Gagniuc, linux-acpi, linux-pci

On Wednesday, June 5, 2019 4:58:19 PM CEST Mika Westerberg wrote:
> PME polling does not take into account that a device that is directly
> connected to the host bridge may go into D3cold as well. This leads to a
> situation where the PME poll thread reads from a config space of a
> device that is in D3cold and gets incorrect information because the
> config space is not accessible.
> 
> Here is an example from Intel Ice Lake system where two PCIe root ports
> are in D3cold (I've instrumented the kernel to log the PMCSR register
> contents):
> 
>   [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
>   [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
> 
> Since 0xffff is interpreted so that PME is pending, the root ports will
> be runtime resumed. This repeats over and over again essentially
> blocking all runtime power management.
> 
> Prevent this from happening by checking whether the device is in D3cold
> before its PME status is read.
> 
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/pci/pci.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 87a1f902fa8e..720da09d4d73 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2060,6 +2060,13 @@ static void pci_pme_list_scan(struct work_struct *work)
>  			 */
>  			if (bridge && bridge->current_state != PCI_D0)
>  				continue;
> +			/*
> +			 * If the device is in D3cold it should not be
> +			 * polled either.
> +			 */
> +			if (pme_dev->dev->current_state == PCI_D3cold)
> +				continue;
> +
>  			pci_pme_wakeup(pme_dev->dev, NULL);
>  		} else {
>  			list_del(&pme_dev->list);
> 





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-06 14:36               ` Mika Westerberg
@ 2019-06-12 22:38                 ` Rafael J. Wysocki
  2019-06-13 12:52                   ` Mika Westerberg
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2019-06-12 22:38 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Rafael J. Wysocki, Bjorn Helgaas, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thursday, June 6, 2019 4:36:06 PM CEST Mika Westerberg wrote:
> On Thu, Jun 06, 2019 at 04:27:21PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Jun 6, 2019 at 4:17 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > On Thu, Jun 06, 2019 at 04:08:11PM +0200, Rafael J. Wysocki wrote:
> > > > That isn't necessary IMO as long as the device are not accessed.  If
> > > > the kernel thinks that a given device is in D3cold and doesn't access
> > > > it, then it really doesn't matter too much what state the device is in
> > > > physically.  On the first access the device should be reinitialized
> > > > anyway.
> > >
> > > But if the device is configured to wake. For example when it detects a
> > > hotplug that state is gone when it goes to D0unitialized.
> > 
> > For this we'll need a pm_runtime_resume() of the dependent device on
> > the resource going "on".
> > 
> > That means we need a list of devices to resume when the resource goes
> > "on" after being taken "off".
> 
> OK, thanks.

Basically, at the pci_acpi_setup() time dev and adev need to be passed to a function that
will add dev as a "dependent device" for each of the power resources in the adev's D0
list.

Next whenever a power resource with a list of "dependent devices" goes _ON successfully,
pm_request_resume() needs to be called for each device in that list.

Finally, at the pci_acpi_cleanup() time, dev needs to be removed from the lists of
"dependent devices" for all power resources in its ACPI companion's D0 list.

At least that's how I see that.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-12 22:38                 ` Rafael J. Wysocki
@ 2019-06-13 12:52                   ` Mika Westerberg
  2019-06-13 13:51                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 23+ messages in thread
From: Mika Westerberg @ 2019-06-13 12:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Bjorn Helgaas, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 13, 2019 at 12:38:50AM +0200, Rafael J. Wysocki wrote:
> Basically, at the pci_acpi_setup() time dev and adev need to be passed to a function that
> will add dev as a "dependent device" for each of the power resources in the adev's D0
> list.
> 
> Next whenever a power resource with a list of "dependent devices" goes _ON successfully,
> pm_request_resume() needs to be called for each device in that list.
> 
> Finally, at the pci_acpi_cleanup() time, dev needs to be removed from the lists of
> "dependent devices" for all power resources in its ACPI companion's D0 list.
> 
> At least that's how I see that.

Thanks for the suggestion. This seems to make it work only for PCI
devices, though. Is that the intention?

I went for the all ACPI devices path instead where we add all devices
sharing the power resource as "consumers" for that resource. I haven't
fully tested the approach yet but the draft patch is below. I have no
issues doing what you say above, though :)

----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----

diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h
index f6157d4d637a..e840299c3293 100644
--- a/drivers/acpi/internal.h
+++ b/drivers/acpi/internal.h
@@ -127,10 +127,10 @@ int __acpi_device_uevent_modalias(struct acpi_device *adev,
                                   Power Resource
    -------------------------------------------------------------------------- */
 int acpi_power_init(void);
-void acpi_power_resources_list_free(struct list_head *list);
-int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
-				 struct list_head *list);
-int acpi_add_power_resource(acpi_handle handle);
+void acpi_power_resources_list_free(struct acpi_device *device, struct list_head *list);
+int acpi_extract_power_resources(struct acpi_device *device,
+	union acpi_object *package, unsigned int start, struct list_head *list);
+int acpi_add_power_resource(struct acpi_device *adev, acpi_handle handle);
 void acpi_power_add_remove_device(struct acpi_device *adev, bool add);
 int acpi_power_wakeup_list_init(struct list_head *list, int *system_level);
 int acpi_device_sleep_wake(struct acpi_device *dev,
diff --git a/drivers/acpi/power.c b/drivers/acpi/power.c
index a916417b9e70..31817f931381 100644
--- a/drivers/acpi/power.c
+++ b/drivers/acpi/power.c
@@ -45,6 +45,8 @@ ACPI_MODULE_NAME("power");
 struct acpi_power_resource {
 	struct acpi_device device;
 	struct list_head list_node;
+	struct list_head consumers;
+	struct mutex consumer_lock;
 	char *name;
 	u32 system_level;
 	u32 order;
@@ -58,6 +60,11 @@ struct acpi_power_resource_entry {
 	struct acpi_power_resource *resource;
 };
 
+struct acpi_power_resource_consumer {
+	struct list_head node;
+	struct acpi_device *consumer;
+};
+
 static LIST_HEAD(acpi_power_resource_list);
 static DEFINE_MUTEX(power_resource_list_lock);
 
@@ -81,6 +88,111 @@ static struct acpi_power_resource *acpi_power_get_context(acpi_handle handle)
 	return to_power_resource(device);
 }
 
+static int acpi_power_add_consumer(struct acpi_power_resource *resource,
+				   struct acpi_device *device)
+{
+	struct acpi_power_resource_consumer *consumer;
+	int ret = 0;
+
+	if (!device)
+		return 0;
+
+	mutex_lock(&resource->consumer_lock);
+
+	list_for_each_entry(consumer, &resource->consumers, node) {
+		/* Don't add it twice */
+		if (consumer->consumer == device)
+			goto unlock;
+	}
+
+	consumer = kzalloc(sizeof(*consumer), GFP_KERNEL);
+	if (!consumer) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	consumer->consumer = device;
+	list_add_tail(&consumer->node, &resource->consumers);
+
+	acpi_handle_debug(device->handle, "added dependency to %s\n",
+			  resource->name);
+
+unlock:
+	mutex_unlock(&resource->consumer_lock);
+	return ret;
+}
+
+static void acpi_power_remove_consumer(struct acpi_power_resource *resource,
+				       struct acpi_device *device)
+{
+	struct acpi_power_resource_consumer *consumer;
+
+	mutex_lock(&resource->consumer_lock);
+	list_for_each_entry(consumer, &resource->consumers, node) {
+		if (consumer->consumer == device) {
+			list_del(&consumer->node);
+			kfree(consumer);
+
+			acpi_handle_debug(device->handle,
+					  "removed dependency to %s\n",
+					  resource->name);
+			break;
+		}
+	}
+	mutex_unlock(&resource->consumer_lock);
+}
+
+static void
+acpi_power_resume_consumer(const struct acpi_power_resource *resource,
+			  struct acpi_device *device)
+{
+	struct device *dev;
+
+	/*
+	 * If the device is prepared to wake we need to resume it now so
+	 * that the driver can re-program it to do so. For non-wake devices
+	 * we can leave them as is. The driver then restores the device
+	 * when it is needed next time.
+	 */
+	if (!acpi_device_can_wakeup(device) || !device->wakeup.prepare_count)
+		return;
+
+	dev = acpi_get_first_physical_node(device);
+	if (dev) {
+		acpi_handle_debug(device->handle,
+				  "resuming %s because %s was turned on\n",
+				  dev_name(dev), resource->name);
+		pm_runtime_resume(dev);
+	}
+}
+
+static int acpi_power_resume_consumers(struct acpi_power_resource *resource,
+				       struct acpi_device *device)
+{
+	struct acpi_power_resource_consumer *consumer;
+	int ret = 0;
+
+	mutex_lock(&resource->consumer_lock);
+
+	list_for_each_entry(consumer, &resource->consumers, node) {
+		struct acpi_device *adev = consumer->consumer;
+		int state, ret;
+
+		/* Skip the device that originated the power on request */
+		if (adev == device)
+			continue;
+
+		ret = acpi_power_get_inferred_state(consumer->consumer, &state);
+		if (!ret && adev->power.state > ACPI_STATE_D0 &&
+		    state == ACPI_STATE_D0) {
+			acpi_power_resume_consumer(resource, adev);
+		}
+	}
+
+	mutex_unlock(&resource->consumer_lock);
+	return ret;
+}
+
 static int acpi_power_resources_list_add(acpi_handle handle,
 					 struct list_head *list)
 {
@@ -108,12 +220,14 @@ static int acpi_power_resources_list_add(acpi_handle handle,
 	return 0;
 }
 
-void acpi_power_resources_list_free(struct list_head *list)
+void acpi_power_resources_list_free(struct acpi_device *device, struct list_head *list)
 {
 	struct acpi_power_resource_entry *entry, *e;
 
 	list_for_each_entry_safe(entry, e, list, node) {
 		list_del(&entry->node);
+		if (device)
+			acpi_power_remove_consumer(entry->resource, device);
 		kfree(entry);
 	}
 }
@@ -135,8 +249,8 @@ static bool acpi_power_resource_is_dup(union acpi_object *package,
 	return false;
 }
 
-int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
-				 struct list_head *list)
+int acpi_extract_power_resources(struct acpi_device *device,
+	union acpi_object *package, unsigned int start, struct list_head *list)
 {
 	unsigned int i;
 	int err = 0;
@@ -159,7 +273,7 @@ int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
 		if (acpi_power_resource_is_dup(package, start, i))
 			continue;
 
-		err = acpi_add_power_resource(rhandle);
+		err = acpi_add_power_resource(device, rhandle);
 		if (err)
 			break;
 
@@ -168,7 +282,7 @@ int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
 			break;
 	}
 	if (err)
-		acpi_power_resources_list_free(list);
+		acpi_power_resources_list_free(device, list);
 
 	return err;
 }
@@ -258,18 +372,33 @@ static int acpi_power_on_unlocked(struct acpi_power_resource *resource)
 		result = __acpi_power_on(resource);
 		if (result)
 			resource->ref_count--;
+		else
+			result = 1;
 	}
 	return result;
 }
 
-static int acpi_power_on(struct acpi_power_resource *resource)
+static int acpi_power_on(struct acpi_device *device,
+			 struct acpi_power_resource *resource)
 {
 	int result;
 
 	mutex_lock(&resource->resource_lock);
 	result = acpi_power_on_unlocked(resource);
 	mutex_unlock(&resource->resource_lock);
-	return result;
+
+	if (result <= 0)
+		return result;
+
+	/*
+	 * The power resource was physically turned on. Because of this
+	 * some of the devices sharing it may have been transitioned into
+	 * D0 so we need to runtime resume them to make sure their driver
+	 * re-initializes them properly. This is important for PCI devices
+	 * that go into D0uninitialized and lose their wakeup settings
+	 * otherwise.
+	 */
+	return acpi_power_resume_consumers(resource, device);
 }
 
 static int __acpi_power_off(struct acpi_power_resource *resource)
@@ -319,7 +448,7 @@ static int acpi_power_off(struct acpi_power_resource *resource)
 	return result;
 }
 
-static int acpi_power_off_list(struct list_head *list)
+static int acpi_power_off_list(struct acpi_device *device, struct list_head *list)
 {
 	struct acpi_power_resource_entry *entry;
 	int result = 0;
@@ -333,18 +462,18 @@ static int acpi_power_off_list(struct list_head *list)
 
  err:
 	list_for_each_entry_continue(entry, list, node)
-		acpi_power_on(entry->resource);
+		acpi_power_on(device, entry->resource);
 
 	return result;
 }
 
-static int acpi_power_on_list(struct list_head *list)
+static int acpi_power_on_list(struct acpi_device *device, struct list_head *list)
 {
 	struct acpi_power_resource_entry *entry;
 	int result = 0;
 
 	list_for_each_entry(entry, list, node) {
-		result = acpi_power_on(entry->resource);
+		result = acpi_power_on(device, entry->resource);
 		if (result)
 			goto err;
 	}
@@ -582,7 +711,7 @@ int acpi_enable_wakeup_device_power(struct acpi_device *dev, int sleep_state)
 
 		if (!resource->wakeup_enabled) {
 			err = acpi_power_on_unlocked(resource);
-			if (!err)
+			if (err >= 0)
 				resource->wakeup_enabled = true;
 		}
 
@@ -703,7 +832,7 @@ int acpi_power_on_resources(struct acpi_device *device, int state)
 	if (!device || state < ACPI_STATE_D0 || state > ACPI_STATE_D3_HOT)
 		return -EINVAL;
 
-	return acpi_power_on_list(&device->power.states[state].resources);
+	return acpi_power_on_list(device, &device->power.states[state].resources);
 }
 
 int acpi_power_transition(struct acpi_device *device, int state)
@@ -726,11 +855,11 @@ int acpi_power_transition(struct acpi_device *device, int state)
 	 * we dereference all power resources used in the current list.
 	 */
 	if (state < ACPI_STATE_D3_COLD)
-		result = acpi_power_on_list(
+		result = acpi_power_on_list(device,
 			&device->power.states[state].resources);
 
 	if (!result && device->power.state < ACPI_STATE_D3_COLD)
-		acpi_power_off_list(
+		acpi_power_off_list(device,
 			&device->power.states[device->power.state].resources);
 
 	/* We shouldn't change the state unless the above operations succeed. */
@@ -788,7 +917,7 @@ static void acpi_power_add_resource_to_list(struct acpi_power_resource *resource
 	mutex_unlock(&power_resource_list_lock);
 }
 
-int acpi_add_power_resource(acpi_handle handle)
+int acpi_add_power_resource(struct acpi_device *adev, acpi_handle handle)
 {
 	struct acpi_power_resource *resource;
 	struct acpi_device *device = NULL;
@@ -798,8 +927,10 @@ int acpi_add_power_resource(acpi_handle handle)
 	int state, result = -ENODEV;
 
 	acpi_bus_get_device(handle, &device);
-	if (device)
-		return 0;
+	if (device) {
+		resource = to_power_resource(device);
+		goto add_consumer;
+	}
 
 	resource = kzalloc(sizeof(*resource), GFP_KERNEL);
 	if (!resource)
@@ -810,6 +941,8 @@ int acpi_add_power_resource(acpi_handle handle)
 				ACPI_STA_DEFAULT);
 	mutex_init(&resource->resource_lock);
 	INIT_LIST_HEAD(&resource->list_node);
+	INIT_LIST_HEAD(&resource->consumers);
+	mutex_init(&resource->consumer_lock);
 	resource->name = device->pnp.bus_id;
 	strcpy(acpi_device_name(device), ACPI_POWER_DEVICE_NAME);
 	strcpy(acpi_device_class(device), ACPI_POWER_CLASS);
@@ -840,7 +973,11 @@ int acpi_add_power_resource(acpi_handle handle)
 
 	acpi_power_add_resource_to_list(resource);
 	acpi_device_add_finalize(device);
-	return 0;
+
+ add_consumer:
+	result = acpi_power_add_consumer(resource, adev);
+	if (!result)
+		return 0;
 
  err:
 	acpi_release_power_resource(&device->dev);
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 6153030451eb..3af0abe5b5e2 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -451,14 +451,14 @@ static void acpi_free_power_resources_lists(struct acpi_device *device)
 	int i;
 
 	if (device->wakeup.flags.valid)
-		acpi_power_resources_list_free(&device->wakeup.resources);
+		acpi_power_resources_list_free(NULL, &device->wakeup.resources);
 
 	if (!device->power.flags.power_resources)
 		return;
 
 	for (i = ACPI_STATE_D0; i <= ACPI_STATE_D3_HOT; i++) {
 		struct acpi_device_power_state *ps = &device->power.states[i];
-		acpi_power_resources_list_free(&ps->resources);
+		acpi_power_resources_list_free(device, &ps->resources);
 	}
 }
 
@@ -816,7 +816,7 @@ static int acpi_bus_extract_wakeup_device_power_package(struct acpi_device *dev)
 
 	wakeup->sleep_state = element->integer.value;
 
-	err = acpi_extract_power_resources(package, 2, &wakeup->resources);
+	err = acpi_extract_power_resources(NULL, package, 2, &wakeup->resources);
 	if (err)
 		goto out;
 
@@ -828,7 +828,7 @@ static int acpi_bus_extract_wakeup_device_power_package(struct acpi_device *dev)
 		if (err) {
 			acpi_handle_warn(handle, "Retrieving current states "
 					 "of wakeup power resources failed\n");
-			acpi_power_resources_list_free(&wakeup->resources);
+			acpi_power_resources_list_free(NULL, &wakeup->resources);
 			goto out;
 		}
 		if (sleep_state < wakeup->sleep_state) {
@@ -920,7 +920,7 @@ static void acpi_bus_init_power_state(struct acpi_device *device, int state)
 		if (buffer.length && package
 		    && package->type == ACPI_TYPE_PACKAGE
 		    && package->package.count) {
-			int err = acpi_extract_power_resources(package, 0,
+			int err = acpi_extract_power_resources(device, package, 0,
 							       &ps->resources);
 			if (!err)
 				device->power.flags.power_resources = 1;
@@ -1867,7 +1867,7 @@ static acpi_status acpi_bus_check_add(acpi_handle handle, u32 lvl_not_used,
 		return AE_OK;
 
 	if (type == ACPI_BUS_TYPE_POWER) {
-		acpi_add_power_resource(handle);
+		acpi_add_power_resource(NULL, handle);
 		return AE_OK;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-13 12:52                   ` Mika Westerberg
@ 2019-06-13 13:51                     ` Rafael J. Wysocki
  2019-06-13 14:27                       ` Mika Westerberg
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2019-06-13 13:51 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Bjorn Helgaas, Len Brown,
	Lukas Wunner, Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 13, 2019 at 2:52 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Thu, Jun 13, 2019 at 12:38:50AM +0200, Rafael J. Wysocki wrote:
> > Basically, at the pci_acpi_setup() time dev and adev need to be passed to a function that
> > will add dev as a "dependent device" for each of the power resources in the adev's D0
> > list.
> >
> > Next whenever a power resource with a list of "dependent devices" goes _ON successfully,
> > pm_request_resume() needs to be called for each device in that list.
> >
> > Finally, at the pci_acpi_cleanup() time, dev needs to be removed from the lists of
> > "dependent devices" for all power resources in its ACPI companion's D0 list.
> >
> > At least that's how I see that.
>
> Thanks for the suggestion. This seems to make it work only for PCI
> devices, though. Is that the intention?

I'm not sure to what extent the D0-uninitialized concept applies to
non-PCI devices.  It may just not be necessary to do this for non-PCI
device in general.

> I went for the all ACPI devices path instead where we add all devices
> sharing the power resource as "consumers" for that resource. I haven't
> fully tested the approach yet but the draft patch is below. I have no
> issues doing what you say above, though :)
>
> ----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----
>
> diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h
> index f6157d4d637a..e840299c3293 100644
> --- a/drivers/acpi/internal.h
> +++ b/drivers/acpi/internal.h
> @@ -127,10 +127,10 @@ int __acpi_device_uevent_modalias(struct acpi_device *adev,
>                                    Power Resource
>     -------------------------------------------------------------------------- */
>  int acpi_power_init(void);
> -void acpi_power_resources_list_free(struct list_head *list);
> -int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
> -                                struct list_head *list);
> -int acpi_add_power_resource(acpi_handle handle);
> +void acpi_power_resources_list_free(struct acpi_device *device, struct list_head *list);
> +int acpi_extract_power_resources(struct acpi_device *device,
> +       union acpi_object *package, unsigned int start, struct list_head *list);
> +int acpi_add_power_resource(struct acpi_device *adev, acpi_handle handle);
>  void acpi_power_add_remove_device(struct acpi_device *adev, bool add);
>  int acpi_power_wakeup_list_init(struct list_head *list, int *system_level);
>  int acpi_device_sleep_wake(struct acpi_device *dev,
> diff --git a/drivers/acpi/power.c b/drivers/acpi/power.c
> index a916417b9e70..31817f931381 100644
> --- a/drivers/acpi/power.c
> +++ b/drivers/acpi/power.c
> @@ -45,6 +45,8 @@ ACPI_MODULE_NAME("power");
>  struct acpi_power_resource {
>         struct acpi_device device;
>         struct list_head list_node;
> +       struct list_head consumers;
> +       struct mutex consumer_lock;

Why do you need this extra lock?

>         char *name;
>         u32 system_level;
>         u32 order;
> @@ -58,6 +60,11 @@ struct acpi_power_resource_entry {
>         struct acpi_power_resource *resource;
>  };
>
> +struct acpi_power_resource_consumer {
> +       struct list_head node;
> +       struct acpi_device *consumer;

I would put the "physical" device pointer here.

> +};
> +
>  static LIST_HEAD(acpi_power_resource_list);
>  static DEFINE_MUTEX(power_resource_list_lock);
>
> @@ -81,6 +88,111 @@ static struct acpi_power_resource *acpi_power_get_context(acpi_handle handle)
>         return to_power_resource(device);
>  }
>
> +static int acpi_power_add_consumer(struct acpi_power_resource *resource,
> +                                  struct acpi_device *device)
> +{
> +       struct acpi_power_resource_consumer *consumer;
> +       int ret = 0;
> +
> +       if (!device)
> +               return 0;
> +
> +       mutex_lock(&resource->consumer_lock);

All of this could be done under the resource mutex instead, I think.

> +
> +       list_for_each_entry(consumer, &resource->consumers, node) {
> +               /* Don't add it twice */
> +               if (consumer->consumer == device)
> +                       goto unlock;
> +       }
> +
> +       consumer = kzalloc(sizeof(*consumer), GFP_KERNEL);
> +       if (!consumer) {
> +               ret = -ENOMEM;
> +               goto unlock;
> +       }
> +
> +       consumer->consumer = device;
> +       list_add_tail(&consumer->node, &resource->consumers);
> +
> +       acpi_handle_debug(device->handle, "added dependency to %s\n",
> +                         resource->name);
> +
> +unlock:
> +       mutex_unlock(&resource->consumer_lock);
> +       return ret;
> +}
> +
> +static void acpi_power_remove_consumer(struct acpi_power_resource *resource,
> +                                      struct acpi_device *device)
> +{
> +       struct acpi_power_resource_consumer *consumer;
> +
> +       mutex_lock(&resource->consumer_lock);
> +       list_for_each_entry(consumer, &resource->consumers, node) {
> +               if (consumer->consumer == device) {
> +                       list_del(&consumer->node);
> +                       kfree(consumer);
> +
> +                       acpi_handle_debug(device->handle,
> +                                         "removed dependency to %s\n",
> +                                         resource->name);
> +                       break;
> +               }
> +       }
> +       mutex_unlock(&resource->consumer_lock);
> +}
> +
> +static void
> +acpi_power_resume_consumer(const struct acpi_power_resource *resource,
> +                         struct acpi_device *device)
> +{
> +       struct device *dev;
> +
> +       /*
> +        * If the device is prepared to wake we need to resume it now so
> +        * that the driver can re-program it to do so. For non-wake devices
> +        * we can leave them as is. The driver then restores the device
> +        * when it is needed next time.
> +        */
> +       if (!acpi_device_can_wakeup(device) || !device->wakeup.prepare_count)
> +               return;

Checking prepare_count should not be necessary here.

Arguably, the power resource cannot go off without suspending all
devices that depend on it and for PM-runtime wakeup is always enabled.

> +
> +       dev = acpi_get_first_physical_node(device);

And that would be unnecessary if you stored dev in struct
acpi_power_resource_consumer.

> +       if (dev) {
> +               acpi_handle_debug(device->handle,
> +                                 "resuming %s because %s was turned on\n",
> +                                 dev_name(dev), resource->name);
> +               pm_runtime_resume(dev);

pm_request_resume()

This needs to be asynchronous as it should be done under resource_lock
(see below).

> +       }
> +}
> +
> +static int acpi_power_resume_consumers(struct acpi_power_resource *resource,
> +                                      struct acpi_device *device)
> +{
> +       struct acpi_power_resource_consumer *consumer;
> +       int ret = 0;
> +
> +       mutex_lock(&resource->consumer_lock);
> +
> +       list_for_each_entry(consumer, &resource->consumers, node) {
> +               struct acpi_device *adev = consumer->consumer;
> +               int state, ret;
> +
> +               /* Skip the device that originated the power on request */
> +               if (adev == device)
> +                       continue;
> +
> +               ret = acpi_power_get_inferred_state(consumer->consumer, &state);

This isn't necessary AFAICS.

The resource was off previously, so none of its consumers can be in D0.

> +               if (!ret && adev->power.state > ACPI_STATE_D0 &&
> +                   state == ACPI_STATE_D0) {
> +                       acpi_power_resume_consumer(resource, adev);
> +               }
> +       }
> +
> +       mutex_unlock(&resource->consumer_lock);
> +       return ret;
> +}
> +
>  static int acpi_power_resources_list_add(acpi_handle handle,
>                                          struct list_head *list)
>  {
> @@ -108,12 +220,14 @@ static int acpi_power_resources_list_add(acpi_handle handle,
>         return 0;
>  }
>
> -void acpi_power_resources_list_free(struct list_head *list)
> +void acpi_power_resources_list_free(struct acpi_device *device, struct list_head *list)
>  {
>         struct acpi_power_resource_entry *entry, *e;
>
>         list_for_each_entry_safe(entry, e, list, node) {
>                 list_del(&entry->node);
> +               if (device)
> +                       acpi_power_remove_consumer(entry->resource, device);
>                 kfree(entry);
>         }
>  }
> @@ -135,8 +249,8 @@ static bool acpi_power_resource_is_dup(union acpi_object *package,
>         return false;
>  }
>
> -int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
> -                                struct list_head *list)
> +int acpi_extract_power_resources(struct acpi_device *device,
> +       union acpi_object *package, unsigned int start, struct list_head *list)
>  {
>         unsigned int i;
>         int err = 0;
> @@ -159,7 +273,7 @@ int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
>                 if (acpi_power_resource_is_dup(package, start, i))
>                         continue;
>
> -               err = acpi_add_power_resource(rhandle);
> +               err = acpi_add_power_resource(device, rhandle);
>                 if (err)
>                         break;
>
> @@ -168,7 +282,7 @@ int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
>                         break;
>         }
>         if (err)
> -               acpi_power_resources_list_free(list);
> +               acpi_power_resources_list_free(device, list);
>
>         return err;
>  }
> @@ -258,18 +372,33 @@ static int acpi_power_on_unlocked(struct acpi_power_resource *resource)
>                 result = __acpi_power_on(resource);
>                 if (result)
>                         resource->ref_count--;
> +               else
> +                       result = 1;
>         }
>         return result;
>  }
>
> -static int acpi_power_on(struct acpi_power_resource *resource)
> +static int acpi_power_on(struct acpi_device *device,
> +                        struct acpi_power_resource *resource)
>  {
>         int result;
>
>         mutex_lock(&resource->resource_lock);
>         result = acpi_power_on_unlocked(resource);
>         mutex_unlock(&resource->resource_lock);
> -       return result;
> +
> +       if (result <= 0)
> +               return result;
> +
> +       /*
> +        * The power resource was physically turned on. Because of this
> +        * some of the devices sharing it may have been transitioned into
> +        * D0 so we need to runtime resume them to make sure their driver
> +        * re-initializes them properly. This is important for PCI devices
> +        * that go into D0uninitialized and lose their wakeup settings
> +        * otherwise.

So you only talk about PCI devices here, which is why I think that, at
least for now, this needs to be done only for PCI devices.

> +        */
> +       return acpi_power_resume_consumers(resource, device);

Resuming consumers technically belongs to the "on" operation, so it
should be done under resource_lock (or there is nothing to prevent the
resource from going off immediately in a different thread in theory,
although that is unlikely due to the way device PM code works).

>  }
>
>  static int __acpi_power_off(struct acpi_power_resource *resource)
> @@ -319,7 +448,7 @@ static int acpi_power_off(struct acpi_power_resource *resource)
>         return result;
>  }
>
> -static int acpi_power_off_list(struct list_head *list)
> +static int acpi_power_off_list(struct acpi_device *device, struct list_head *list)
>  {
>         struct acpi_power_resource_entry *entry;
>         int result = 0;
> @@ -333,18 +462,18 @@ static int acpi_power_off_list(struct list_head *list)
>
>   err:
>         list_for_each_entry_continue(entry, list, node)
> -               acpi_power_on(entry->resource);
> +               acpi_power_on(device, entry->resource);
>
>         return result;
>  }
>
> -static int acpi_power_on_list(struct list_head *list)
> +static int acpi_power_on_list(struct acpi_device *device, struct list_head *list)
>  {
>         struct acpi_power_resource_entry *entry;
>         int result = 0;
>
>         list_for_each_entry(entry, list, node) {
> -               result = acpi_power_on(entry->resource);
> +               result = acpi_power_on(device, entry->resource);
>                 if (result)
>                         goto err;
>         }
> @@ -582,7 +711,7 @@ int acpi_enable_wakeup_device_power(struct acpi_device *dev, int sleep_state)
>
>                 if (!resource->wakeup_enabled) {
>                         err = acpi_power_on_unlocked(resource);
> -                       if (!err)
> +                       if (err >= 0)
>                                 resource->wakeup_enabled = true;
>                 }
>
> @@ -703,7 +832,7 @@ int acpi_power_on_resources(struct acpi_device *device, int state)
>         if (!device || state < ACPI_STATE_D0 || state > ACPI_STATE_D3_HOT)
>                 return -EINVAL;
>
> -       return acpi_power_on_list(&device->power.states[state].resources);
> +       return acpi_power_on_list(device, &device->power.states[state].resources);
>  }
>
>  int acpi_power_transition(struct acpi_device *device, int state)
> @@ -726,11 +855,11 @@ int acpi_power_transition(struct acpi_device *device, int state)
>          * we dereference all power resources used in the current list.
>          */
>         if (state < ACPI_STATE_D3_COLD)
> -               result = acpi_power_on_list(
> +               result = acpi_power_on_list(device,
>                         &device->power.states[state].resources);
>
>         if (!result && device->power.state < ACPI_STATE_D3_COLD)
> -               acpi_power_off_list(
> +               acpi_power_off_list(device,
>                         &device->power.states[device->power.state].resources);
>
>         /* We shouldn't change the state unless the above operations succeed. */
> @@ -788,7 +917,7 @@ static void acpi_power_add_resource_to_list(struct acpi_power_resource *resource
>         mutex_unlock(&power_resource_list_lock);
>  }
>
> -int acpi_add_power_resource(acpi_handle handle)
> +int acpi_add_power_resource(struct acpi_device *adev, acpi_handle handle)
>  {
>         struct acpi_power_resource *resource;
>         struct acpi_device *device = NULL;
> @@ -798,8 +927,10 @@ int acpi_add_power_resource(acpi_handle handle)
>         int state, result = -ENODEV;
>
>         acpi_bus_get_device(handle, &device);
> -       if (device)
> -               return 0;
> +       if (device) {
> +               resource = to_power_resource(device);
> +               goto add_consumer;
> +       }
>
>         resource = kzalloc(sizeof(*resource), GFP_KERNEL);
>         if (!resource)
> @@ -810,6 +941,8 @@ int acpi_add_power_resource(acpi_handle handle)
>                                 ACPI_STA_DEFAULT);
>         mutex_init(&resource->resource_lock);
>         INIT_LIST_HEAD(&resource->list_node);
> +       INIT_LIST_HEAD(&resource->consumers);
> +       mutex_init(&resource->consumer_lock);
>         resource->name = device->pnp.bus_id;
>         strcpy(acpi_device_name(device), ACPI_POWER_DEVICE_NAME);
>         strcpy(acpi_device_class(device), ACPI_POWER_CLASS);
> @@ -840,7 +973,11 @@ int acpi_add_power_resource(acpi_handle handle)
>
>         acpi_power_add_resource_to_list(resource);
>         acpi_device_add_finalize(device);
> -       return 0;
> +
> + add_consumer:

Note that this only needs to be done if the device's D0 list contains
the resource.

> +       result = acpi_power_add_consumer(resource, adev);
> +       if (!result)
> +               return 0;
>
>   err:
>         acpi_release_power_resource(&device->dev);
> diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
> index 6153030451eb..3af0abe5b5e2 100644
> --- a/drivers/acpi/scan.c
> +++ b/drivers/acpi/scan.c
> @@ -451,14 +451,14 @@ static void acpi_free_power_resources_lists(struct acpi_device *device)
>         int i;
>
>         if (device->wakeup.flags.valid)
> -               acpi_power_resources_list_free(&device->wakeup.resources);
> +               acpi_power_resources_list_free(NULL, &device->wakeup.resources);
>
>         if (!device->power.flags.power_resources)
>                 return;
>
>         for (i = ACPI_STATE_D0; i <= ACPI_STATE_D3_HOT; i++) {
>                 struct acpi_device_power_state *ps = &device->power.states[i];
> -               acpi_power_resources_list_free(&ps->resources);
> +               acpi_power_resources_list_free(device, &ps->resources);
>         }
>  }
>
> @@ -816,7 +816,7 @@ static int acpi_bus_extract_wakeup_device_power_package(struct acpi_device *dev)
>
>         wakeup->sleep_state = element->integer.value;
>
> -       err = acpi_extract_power_resources(package, 2, &wakeup->resources);
> +       err = acpi_extract_power_resources(NULL, package, 2, &wakeup->resources);
>         if (err)
>                 goto out;
>
> @@ -828,7 +828,7 @@ static int acpi_bus_extract_wakeup_device_power_package(struct acpi_device *dev)
>                 if (err) {
>                         acpi_handle_warn(handle, "Retrieving current states "
>                                          "of wakeup power resources failed\n");
> -                       acpi_power_resources_list_free(&wakeup->resources);
> +                       acpi_power_resources_list_free(NULL, &wakeup->resources);
>                         goto out;
>                 }
>                 if (sleep_state < wakeup->sleep_state) {
> @@ -920,7 +920,7 @@ static void acpi_bus_init_power_state(struct acpi_device *device, int state)
>                 if (buffer.length && package
>                     && package->type == ACPI_TYPE_PACKAGE
>                     && package->package.count) {
> -                       int err = acpi_extract_power_resources(package, 0,
> +                       int err = acpi_extract_power_resources(device, package, 0,
>                                                                &ps->resources);
>                         if (!err)
>                                 device->power.flags.power_resources = 1;
> @@ -1867,7 +1867,7 @@ static acpi_status acpi_bus_check_add(acpi_handle handle, u32 lvl_not_used,
>                 return AE_OK;
>
>         if (type == ACPI_BUS_TYPE_POWER) {
> -               acpi_add_power_resource(handle);
> +               acpi_add_power_resource(NULL, handle);
>                 return AE_OK;
>         }
>
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources
  2019-06-13 13:51                     ` Rafael J. Wysocki
@ 2019-06-13 14:27                       ` Mika Westerberg
  0 siblings, 0 replies; 23+ messages in thread
From: Mika Westerberg @ 2019-06-13 14:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Bjorn Helgaas, Len Brown, Lukas Wunner,
	Keith Busch, Alex Williamson, Alexandru Gagniuc,
	ACPI Devel Maling List, Linux PCI

On Thu, Jun 13, 2019 at 03:51:40PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 13, 2019 at 2:52 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > On Thu, Jun 13, 2019 at 12:38:50AM +0200, Rafael J. Wysocki wrote:
> > > Basically, at the pci_acpi_setup() time dev and adev need to be passed to a function that
> > > will add dev as a "dependent device" for each of the power resources in the adev's D0
> > > list.
> > >
> > > Next whenever a power resource with a list of "dependent devices" goes _ON successfully,
> > > pm_request_resume() needs to be called for each device in that list.
> > >
> > > Finally, at the pci_acpi_cleanup() time, dev needs to be removed from the lists of
> > > "dependent devices" for all power resources in its ACPI companion's D0 list.
> > >
> > > At least that's how I see that.
> >
> > Thanks for the suggestion. This seems to make it work only for PCI
> > devices, though. Is that the intention?
> 
> I'm not sure to what extent the D0-uninitialized concept applies to
> non-PCI devices.  It may just not be necessary to do this for non-PCI
> device in general.

OK, sounds reasonable. We can extend it to apply for other types of
devices later if needed.

> > I went for the all ACPI devices path instead where we add all devices
> > sharing the power resource as "consumers" for that resource. I haven't
> > fully tested the approach yet but the draft patch is below. I have no
> > issues doing what you say above, though :)
> >
> > ----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----
> >
> > diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h
> > index f6157d4d637a..e840299c3293 100644
> > --- a/drivers/acpi/internal.h
> > +++ b/drivers/acpi/internal.h
> > @@ -127,10 +127,10 @@ int __acpi_device_uevent_modalias(struct acpi_device *adev,
> >                                    Power Resource
> >     -------------------------------------------------------------------------- */
> >  int acpi_power_init(void);
> > -void acpi_power_resources_list_free(struct list_head *list);
> > -int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
> > -                                struct list_head *list);
> > -int acpi_add_power_resource(acpi_handle handle);
> > +void acpi_power_resources_list_free(struct acpi_device *device, struct list_head *list);
> > +int acpi_extract_power_resources(struct acpi_device *device,
> > +       union acpi_object *package, unsigned int start, struct list_head *list);
> > +int acpi_add_power_resource(struct acpi_device *adev, acpi_handle handle);
> >  void acpi_power_add_remove_device(struct acpi_device *adev, bool add);
> >  int acpi_power_wakeup_list_init(struct list_head *list, int *system_level);
> >  int acpi_device_sleep_wake(struct acpi_device *dev,
> > diff --git a/drivers/acpi/power.c b/drivers/acpi/power.c
> > index a916417b9e70..31817f931381 100644
> > --- a/drivers/acpi/power.c
> > +++ b/drivers/acpi/power.c
> > @@ -45,6 +45,8 @@ ACPI_MODULE_NAME("power");
> >  struct acpi_power_resource {
> >         struct acpi_device device;
> >         struct list_head list_node;
> > +       struct list_head consumers;
> > +       struct mutex consumer_lock;
> 
> Why do you need this extra lock?

It is there because we call acpi_power_get_inferred_state() in the
notification path. However, as you point out below the call is not
needed so in that sense resource_lock should be enough.

> 
> >         char *name;
> >         u32 system_level;
> >         u32 order;
> > @@ -58,6 +60,11 @@ struct acpi_power_resource_entry {
> >         struct acpi_power_resource *resource;
> >  };
> >
> > +struct acpi_power_resource_consumer {
> > +       struct list_head node;
> > +       struct acpi_device *consumer;
> 
> I would put the "physical" device pointer here.

OK

> > +};
> > +
> >  static LIST_HEAD(acpi_power_resource_list);
> >  static DEFINE_MUTEX(power_resource_list_lock);
> >
> > @@ -81,6 +88,111 @@ static struct acpi_power_resource *acpi_power_get_context(acpi_handle handle)
> >         return to_power_resource(device);
> >  }
> >
> > +static int acpi_power_add_consumer(struct acpi_power_resource *resource,
> > +                                  struct acpi_device *device)
> > +{
> > +       struct acpi_power_resource_consumer *consumer;
> > +       int ret = 0;
> > +
> > +       if (!device)
> > +               return 0;
> > +
> > +       mutex_lock(&resource->consumer_lock);
> 
> All of this could be done under the resource mutex instead, I think.

Yup

> > +
> > +       list_for_each_entry(consumer, &resource->consumers, node) {
> > +               /* Don't add it twice */
> > +               if (consumer->consumer == device)
> > +                       goto unlock;
> > +       }
> > +
> > +       consumer = kzalloc(sizeof(*consumer), GFP_KERNEL);
> > +       if (!consumer) {
> > +               ret = -ENOMEM;
> > +               goto unlock;
> > +       }
> > +
> > +       consumer->consumer = device;
> > +       list_add_tail(&consumer->node, &resource->consumers);
> > +
> > +       acpi_handle_debug(device->handle, "added dependency to %s\n",
> > +                         resource->name);
> > +
> > +unlock:
> > +       mutex_unlock(&resource->consumer_lock);
> > +       return ret;
> > +}
> > +
> > +static void acpi_power_remove_consumer(struct acpi_power_resource *resource,
> > +                                      struct acpi_device *device)
> > +{
> > +       struct acpi_power_resource_consumer *consumer;
> > +
> > +       mutex_lock(&resource->consumer_lock);
> > +       list_for_each_entry(consumer, &resource->consumers, node) {
> > +               if (consumer->consumer == device) {
> > +                       list_del(&consumer->node);
> > +                       kfree(consumer);
> > +
> > +                       acpi_handle_debug(device->handle,
> > +                                         "removed dependency to %s\n",
> > +                                         resource->name);
> > +                       break;
> > +               }
> > +       }
> > +       mutex_unlock(&resource->consumer_lock);
> > +}
> > +
> > +static void
> > +acpi_power_resume_consumer(const struct acpi_power_resource *resource,
> > +                         struct acpi_device *device)
> > +{
> > +       struct device *dev;
> > +
> > +       /*
> > +        * If the device is prepared to wake we need to resume it now so
> > +        * that the driver can re-program it to do so. For non-wake devices
> > +        * we can leave them as is. The driver then restores the device
> > +        * when it is needed next time.
> > +        */
> > +       if (!acpi_device_can_wakeup(device) || !device->wakeup.prepare_count)
> > +               return;
> 
> Checking prepare_count should not be necessary here.
> 
> Arguably, the power resource cannot go off without suspending all
> devices that depend on it and for PM-runtime wakeup is always enabled.

OK

> > +
> > +       dev = acpi_get_first_physical_node(device);
> 
> And that would be unnecessary if you stored dev in struct
> acpi_power_resource_consumer.

Indeed

> > +       if (dev) {
> > +               acpi_handle_debug(device->handle,
> > +                                 "resuming %s because %s was turned on\n",
> > +                                 dev_name(dev), resource->name);
> > +               pm_runtime_resume(dev);
> 
> pm_request_resume()
> 
> This needs to be asynchronous as it should be done under resource_lock
> (see below).

OK

> > +       }
> > +}
> > +
> > +static int acpi_power_resume_consumers(struct acpi_power_resource *resource,
> > +                                      struct acpi_device *device)
> > +{
> > +       struct acpi_power_resource_consumer *consumer;
> > +       int ret = 0;
> > +
> > +       mutex_lock(&resource->consumer_lock);
> > +
> > +       list_for_each_entry(consumer, &resource->consumers, node) {
> > +               struct acpi_device *adev = consumer->consumer;
> > +               int state, ret;
> > +
> > +               /* Skip the device that originated the power on request */
> > +               if (adev == device)
> > +                       continue;
> > +
> > +               ret = acpi_power_get_inferred_state(consumer->consumer, &state);
> 
> This isn't necessary AFAICS.
> 
> The resource was off previously, so none of its consumers can be in D0.

Makes sense.

> 
> > +               if (!ret && adev->power.state > ACPI_STATE_D0 &&
> > +                   state == ACPI_STATE_D0) {
> > +                       acpi_power_resume_consumer(resource, adev);
> > +               }
> > +       }
> > +
> > +       mutex_unlock(&resource->consumer_lock);
> > +       return ret;
> > +}
> > +
> >  static int acpi_power_resources_list_add(acpi_handle handle,
> >                                          struct list_head *list)
> >  {
> > @@ -108,12 +220,14 @@ static int acpi_power_resources_list_add(acpi_handle handle,
> >         return 0;
> >  }
> >
> > -void acpi_power_resources_list_free(struct list_head *list)
> > +void acpi_power_resources_list_free(struct acpi_device *device, struct list_head *list)
> >  {
> >         struct acpi_power_resource_entry *entry, *e;
> >
> >         list_for_each_entry_safe(entry, e, list, node) {
> >                 list_del(&entry->node);
> > +               if (device)
> > +                       acpi_power_remove_consumer(entry->resource, device);
> >                 kfree(entry);
> >         }
> >  }
> > @@ -135,8 +249,8 @@ static bool acpi_power_resource_is_dup(union acpi_object *package,
> >         return false;
> >  }
> >
> > -int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
> > -                                struct list_head *list)
> > +int acpi_extract_power_resources(struct acpi_device *device,
> > +       union acpi_object *package, unsigned int start, struct list_head *list)
> >  {
> >         unsigned int i;
> >         int err = 0;
> > @@ -159,7 +273,7 @@ int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
> >                 if (acpi_power_resource_is_dup(package, start, i))
> >                         continue;
> >
> > -               err = acpi_add_power_resource(rhandle);
> > +               err = acpi_add_power_resource(device, rhandle);
> >                 if (err)
> >                         break;
> >
> > @@ -168,7 +282,7 @@ int acpi_extract_power_resources(union acpi_object *package, unsigned int start,
> >                         break;
> >         }
> >         if (err)
> > -               acpi_power_resources_list_free(list);
> > +               acpi_power_resources_list_free(device, list);
> >
> >         return err;
> >  }
> > @@ -258,18 +372,33 @@ static int acpi_power_on_unlocked(struct acpi_power_resource *resource)
> >                 result = __acpi_power_on(resource);
> >                 if (result)
> >                         resource->ref_count--;
> > +               else
> > +                       result = 1;
> >         }
> >         return result;
> >  }
> >
> > -static int acpi_power_on(struct acpi_power_resource *resource)
> > +static int acpi_power_on(struct acpi_device *device,
> > +                        struct acpi_power_resource *resource)
> >  {
> >         int result;
> >
> >         mutex_lock(&resource->resource_lock);
> >         result = acpi_power_on_unlocked(resource);
> >         mutex_unlock(&resource->resource_lock);
> > -       return result;
> > +
> > +       if (result <= 0)
> > +               return result;
> > +
> > +       /*
> > +        * The power resource was physically turned on. Because of this
> > +        * some of the devices sharing it may have been transitioned into
> > +        * D0 so we need to runtime resume them to make sure their driver
> > +        * re-initializes them properly. This is important for PCI devices
> > +        * that go into D0uninitialized and lose their wakeup settings
> > +        * otherwise.
> 
> So you only talk about PCI devices here, which is why I think that, at
> least for now, this needs to be done only for PCI devices.

OK. For that we would need to include <linux/pci.h> and then call
dev_is_pci() here. The other alternative is to do this in pci-acpi.c
instead like you suggest which limits this to PCI but then again PCI is
the only one having the D0uninitialized thing so probably makes more
sense to add it there anyway. I'll try that approach next.

> 
> > +        */
> > +       return acpi_power_resume_consumers(resource, device);
> 
> Resuming consumers technically belongs to the "on" operation, so it
> should be done under resource_lock (or there is nothing to prevent the
> resource from going off immediately in a different thread in theory,
> although that is unlikely due to the way device PM code works).

OK

> >  }
> >
> >  static int __acpi_power_off(struct acpi_power_resource *resource)
> > @@ -319,7 +448,7 @@ static int acpi_power_off(struct acpi_power_resource *resource)
> >         return result;
> >  }
> >
> > -static int acpi_power_off_list(struct list_head *list)
> > +static int acpi_power_off_list(struct acpi_device *device, struct list_head *list)
> >  {
> >         struct acpi_power_resource_entry *entry;
> >         int result = 0;
> > @@ -333,18 +462,18 @@ static int acpi_power_off_list(struct list_head *list)
> >
> >   err:
> >         list_for_each_entry_continue(entry, list, node)
> > -               acpi_power_on(entry->resource);
> > +               acpi_power_on(device, entry->resource);
> >
> >         return result;
> >  }
> >
> > -static int acpi_power_on_list(struct list_head *list)
> > +static int acpi_power_on_list(struct acpi_device *device, struct list_head *list)
> >  {
> >         struct acpi_power_resource_entry *entry;
> >         int result = 0;
> >
> >         list_for_each_entry(entry, list, node) {
> > -               result = acpi_power_on(entry->resource);
> > +               result = acpi_power_on(device, entry->resource);
> >                 if (result)
> >                         goto err;
> >         }
> > @@ -582,7 +711,7 @@ int acpi_enable_wakeup_device_power(struct acpi_device *dev, int sleep_state)
> >
> >                 if (!resource->wakeup_enabled) {
> >                         err = acpi_power_on_unlocked(resource);
> > -                       if (!err)
> > +                       if (err >= 0)
> >                                 resource->wakeup_enabled = true;
> >                 }
> >
> > @@ -703,7 +832,7 @@ int acpi_power_on_resources(struct acpi_device *device, int state)
> >         if (!device || state < ACPI_STATE_D0 || state > ACPI_STATE_D3_HOT)
> >                 return -EINVAL;
> >
> > -       return acpi_power_on_list(&device->power.states[state].resources);
> > +       return acpi_power_on_list(device, &device->power.states[state].resources);
> >  }
> >
> >  int acpi_power_transition(struct acpi_device *device, int state)
> > @@ -726,11 +855,11 @@ int acpi_power_transition(struct acpi_device *device, int state)
> >          * we dereference all power resources used in the current list.
> >          */
> >         if (state < ACPI_STATE_D3_COLD)
> > -               result = acpi_power_on_list(
> > +               result = acpi_power_on_list(device,
> >                         &device->power.states[state].resources);
> >
> >         if (!result && device->power.state < ACPI_STATE_D3_COLD)
> > -               acpi_power_off_list(
> > +               acpi_power_off_list(device,
> >                         &device->power.states[device->power.state].resources);
> >
> >         /* We shouldn't change the state unless the above operations succeed. */
> > @@ -788,7 +917,7 @@ static void acpi_power_add_resource_to_list(struct acpi_power_resource *resource
> >         mutex_unlock(&power_resource_list_lock);
> >  }
> >
> > -int acpi_add_power_resource(acpi_handle handle)
> > +int acpi_add_power_resource(struct acpi_device *adev, acpi_handle handle)
> >  {
> >         struct acpi_power_resource *resource;
> >         struct acpi_device *device = NULL;
> > @@ -798,8 +927,10 @@ int acpi_add_power_resource(acpi_handle handle)
> >         int state, result = -ENODEV;
> >
> >         acpi_bus_get_device(handle, &device);
> > -       if (device)
> > -               return 0;
> > +       if (device) {
> > +               resource = to_power_resource(device);
> > +               goto add_consumer;
> > +       }
> >
> >         resource = kzalloc(sizeof(*resource), GFP_KERNEL);
> >         if (!resource)
> > @@ -810,6 +941,8 @@ int acpi_add_power_resource(acpi_handle handle)
> >                                 ACPI_STA_DEFAULT);
> >         mutex_init(&resource->resource_lock);
> >         INIT_LIST_HEAD(&resource->list_node);
> > +       INIT_LIST_HEAD(&resource->consumers);
> > +       mutex_init(&resource->consumer_lock);
> >         resource->name = device->pnp.bus_id;
> >         strcpy(acpi_device_name(device), ACPI_POWER_DEVICE_NAME);
> >         strcpy(acpi_device_class(device), ACPI_POWER_CLASS);
> > @@ -840,7 +973,11 @@ int acpi_add_power_resource(acpi_handle handle)
> >
> >         acpi_power_add_resource_to_list(resource);
> >         acpi_device_add_finalize(device);
> > -       return 0;
> > +
> > + add_consumer:
> 
> Note that this only needs to be done if the device's D0 list contains
> the resource.

Good point.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-06-13 17:10 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05 14:58 [PATCH 0/3] PCI: Power management improvements Mika Westerberg
2019-06-05 14:58 ` [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec Mika Westerberg
2019-06-06  8:40   ` Rafael J. Wysocki
2019-06-06 11:24     ` Mika Westerberg
2019-06-05 14:58 ` [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold Mika Westerberg
2019-06-05 19:05   ` Lukas Wunner
2019-06-06 11:21     ` Mika Westerberg
2019-06-09 18:38   ` Lukas Wunner
2019-06-10 11:35   ` Rafael J. Wysocki
2019-06-05 14:58 ` [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources Mika Westerberg
2019-06-06  8:54   ` Rafael J. Wysocki
2019-06-06 11:26     ` Mika Westerberg
2019-06-06 13:44       ` Mika Westerberg
2019-06-06 14:08         ` Rafael J. Wysocki
2019-06-06 14:17           ` Mika Westerberg
2019-06-06 14:27             ` Rafael J. Wysocki
2019-06-06 14:36               ` Mika Westerberg
2019-06-12 22:38                 ` Rafael J. Wysocki
2019-06-13 12:52                   ` Mika Westerberg
2019-06-13 13:51                     ` Rafael J. Wysocki
2019-06-13 14:27                       ` Mika Westerberg
2019-06-09 18:58               ` Lukas Wunner
2019-06-10 10:57                 ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).