linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 0/5] PCIe, Add PCIe runtime D3cold support
@ 2012-05-04  8:13 Huang Ying
  2012-05-04  8:13 ` [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag Huang Ying
                   ` (4 more replies)
  0 siblings, 5 replies; 47+ messages in thread
From: Huang Ying @ 2012-05-04  8:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki,
	Huang Ying, Zheng Yan

This patchset add runtime D3cold support to PCIe.

Changelog:

v2: Refreshed based on comments from Rafael

[RFC v2 1/5] PM, Runtime, Add power_must_be_on flag
[RFC v2 2/5] PM, Add sysfs file power_off to control device power off
[RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
[RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep
[RFC v2 5/5] PCIe, Add PCIe runtime D3cold support

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag
  2012-05-04  8:13 [RFC v2 0/5] PCIe, Add PCIe runtime D3cold support Huang Ying
@ 2012-05-04  8:13 ` Huang Ying
  2012-05-04 19:37   ` Rafael J. Wysocki
  2012-05-04 19:50   ` Bjorn Helgaas
  2012-05-04  8:13 ` [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy Huang Ying
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 47+ messages in thread
From: Huang Ying @ 2012-05-04  8:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki,
	Huang Ying, Zheng Yan

The extreme way to save device power in runtime is to turn off power
of device.  For example, D3cold for PCIe bus and ZPODD (Zero Power
Optical Disk Drive) for SATA bus will do that.

But sometimes power off is not expected, some possible reason is as
follow

- power off device usually incurs longer resume latency, if it exceeds
  power QoS requirement, power off should be disabled.

- For some buses, device in power off state can not support remote
  wakeup.  If remote wakeup is desired, power off should be disabled.

In general, whether to put a device into power off state should be
decided by the driver of the device, but for some buses, whether to
put a device into power off state may be done by the parent of the
device.  For example, a PCIe end point device may be put into power
off state by the PCIe port connected to it.

So a flag is introduced for the children devices to tell the parent
device, whether it should be put into power off state.

This flag is also used for device driver to tell bus layer whether it
is OK to be powered off.

Signed-off-by: Huang Ying <ying.huang@intel.com>
---
 include/linux/pm.h |    1 +
 1 file changed, 1 insertion(+)

--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -536,6 +536,7 @@ struct dev_pm_info {
 	unsigned int		irq_safe:1;
 	unsigned int		use_autosuspend:1;
 	unsigned int		timer_autosuspends:1;
+	unsigned int		power_must_be_on:1;
 	enum rpm_request	request;
 	enum rpm_status		runtime_status;
 	int			runtime_error;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-04  8:13 [RFC v2 0/5] PCIe, Add PCIe runtime D3cold support Huang Ying
  2012-05-04  8:13 ` [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag Huang Ying
@ 2012-05-04  8:13 ` Huang Ying
  2012-05-04 19:33   ` Rafael J. Wysocki
  2012-05-04 19:50   ` Bjorn Helgaas
  2012-05-04  8:13 ` [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port Huang Ying
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 47+ messages in thread
From: Huang Ying @ 2012-05-04  8:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki,
	Huang Ying, Zheng Yan, Lan Tianyu

From: Lan Tianyu <tianyu.lan@intel.com>

Some devices can be powered off to save more power via some platform
mechanism, e.g., ACPI.  But that may not work as expected for some
device or platform.  So, this patch adds a sysfs file named power_off
under <device>/power directory to provide a mechanism for user to control
whether to allow the device to be power off.

power_off => "enabled" means allowing the device to be powered off if
possible.

power_off => "disabled" means the device must be power on anytime.

Also add flag power_off_user to struct dev_pm_info to record users'
choice. The bus layer can use this field to determine whether to
power off the device.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
---
 drivers/base/power/sysfs.c |   33 +++++++++++++++++++++++++++++++++
 include/linux/pm.h         |    1 +
 2 files changed, 34 insertions(+)

--- a/drivers/base/power/sysfs.c
+++ b/drivers/base/power/sysfs.c
@@ -243,6 +243,38 @@ static ssize_t pm_qos_latency_store(stru
 
 static DEVICE_ATTR(pm_qos_resume_latency_us, 0644,
 		   pm_qos_latency_show, pm_qos_latency_store);
+
+static ssize_t power_off_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%s\n",
+		       dev->power.power_off_user ? enabled : disabled);
+}
+
+static ssize_t power_off_store(struct device * dev,
+			       struct device_attribute *attr,
+			       const char * buf, size_t n)
+{
+	char *cp;
+	int len = n;
+	unsigned int power_off_user;
+
+	cp = memchr(buf, '\n', n);
+	if (cp)
+		len = cp - buf;
+
+	if (len == sizeof enabled - 1 && strncmp(buf, enabled, len) == 0)
+		dev->power.power_off_user = true;
+	else if (len == sizeof disabled - 1 && strncmp(buf, disabled, len) == 0)
+		dev->power.power_off_user = false;
+	else
+		return -EINVAL;
+
+	pm_runtime_resume(dev);
+	return n;
+
+}
+static DEVICE_ATTR(power_off, 0644, power_off_show, power_off_store);
 #endif /* CONFIG_PM_RUNTIME */
 
 #ifdef CONFIG_PM_SLEEP
@@ -508,6 +540,7 @@ static struct attribute *runtime_attrs[]
 	&dev_attr_runtime_suspended_time.attr,
 	&dev_attr_runtime_active_time.attr,
 	&dev_attr_autosuspend_delay_ms.attr,
+	&dev_attr_power_off.attr,
 #endif /* CONFIG_PM_RUNTIME */
 	NULL,
 };
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -537,6 +537,7 @@ struct dev_pm_info {
 	unsigned int		use_autosuspend:1;
 	unsigned int		timer_autosuspends:1;
 	unsigned int		power_must_be_on:1;
+	unsigned int		power_off_user:1;
 	enum rpm_request	request;
 	enum rpm_status		runtime_status;
 	int			runtime_error;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-04  8:13 [RFC v2 0/5] PCIe, Add PCIe runtime D3cold support Huang Ying
  2012-05-04  8:13 ` [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag Huang Ying
  2012-05-04  8:13 ` [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy Huang Ying
@ 2012-05-04  8:13 ` Huang Ying
  2012-05-04 19:43   ` Rafael J. Wysocki
  2012-05-04 19:50   ` Bjorn Helgaas
  2012-05-04  8:13 ` [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state Huang Ying
  2012-05-04  8:13 ` [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support Huang Ying
  4 siblings, 2 replies; 47+ messages in thread
From: Huang Ying @ 2012-05-04  8:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki,
	Huang Ying, Zheng Yan

From: Zheng Yan <zheng.z.yan@intel.com>

This patch adds runtime PM support to PCIe port.  This is needed by
PCIe D3cold support, where PCIe device in slot may be powered on/off
by PCIe port.

Because runtime suspend is broken for some chipset, a white list is
used to enable runtime PM support for only chipset known to work.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
---
 drivers/pci/pci.c              |    9 +++++++++
 drivers/pci/pcie/portdrv_pci.c |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1476,6 +1476,15 @@ bool pci_check_pme_status(struct pci_dev
  */
 static int pci_pme_wakeup(struct pci_dev *dev, void *pme_poll_reset)
 {
+	struct pci_dev *bridge = dev->bus->self;
+
+	/*
+	 * If bridge is in low power state, the configuration space of
+	 * subordinate devices may be not accessible
+	 */
+	if (bridge && bridge->current_state != PCI_D0)
+		return 0;
+
 	if (pme_poll_reset && dev->pme_poll)
 		dev->pme_poll = false;
 
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/errno.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/init.h>
 #include <linux/pcieport_if.h>
 #include <linux/aer.h>
@@ -99,6 +100,27 @@ static int pcie_port_resume_noirq(struct
 	return 0;
 }
 
+#ifdef CONFIG_PM_RUNTIME
+static int pcie_port_runtime_suspend(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	pci_save_state(pdev);
+	return 0;
+}
+
+static int pcie_port_runtime_resume(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	pci_restore_state(pdev);
+	return 0;
+}
+#else
+#define pcie_port_runtime_suspend	NULL
+#define pcie_port_runtime_resume	NULL
+#endif
+
 static const struct dev_pm_ops pcie_portdrv_pm_ops = {
 	.suspend	= pcie_port_device_suspend,
 	.resume		= pcie_port_device_resume,
@@ -107,6 +129,8 @@ static const struct dev_pm_ops pcie_port
 	.poweroff	= pcie_port_device_suspend,
 	.restore	= pcie_port_device_resume,
 	.resume_noirq	= pcie_port_resume_noirq,
+	.runtime_suspend = pcie_port_runtime_suspend,
+	.runtime_resume = pcie_port_runtime_resume,
 };
 
 #define PCIE_PORTDRV_PM_OPS	(&pcie_portdrv_pm_ops)
@@ -117,6 +141,18 @@ static const struct dev_pm_ops pcie_port
 #endif /* !PM */
 
 /*
+ * PCIe port runtime suspend is broken for some chipset, so use a
+ * white list to disable runtime PM for these chipset.
+ */
+static const struct pci_device_id port_runtime_pm_pci_ids[] = {
+	{ PCI_VDEVICE(INTEL, 0x1e10), },
+	{ PCI_VDEVICE(INTEL, 0x1e16), },
+	{ PCI_VDEVICE(INTEL, 0x1e18), },
+	{ PCI_VDEVICE(INTEL, 0x1e1c), },
+	{ /* end: all zeroes */ }
+};
+
+/*
  * pcie_portdrv_probe - Probe PCI-Express port devices
  * @dev: PCI-Express port device being probed
  *
@@ -144,12 +180,16 @@ static int __devinit pcie_portdrv_probe(
 		return status;
 
 	pci_save_state(dev);
+	if (pci_match_id(port_runtime_pm_pci_ids, dev))
+		pm_runtime_put_noidle(&dev->dev);
 
 	return 0;
 }
 
 static void pcie_portdrv_remove(struct pci_dev *dev)
 {
+	if (pci_match_id(port_runtime_pm_pci_ids, dev))
+		pm_runtime_get_noresume(&dev->dev);
 	pcie_port_device_remove(dev);
 	pci_disable_device(dev);
 }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state
  2012-05-04  8:13 [RFC v2 0/5] PCIe, Add PCIe runtime D3cold support Huang Ying
                   ` (2 preceding siblings ...)
  2012-05-04  8:13 ` [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port Huang Ying
@ 2012-05-04  8:13 ` Huang Ying
  2012-05-04 20:10   ` Rafael J. Wysocki
  2012-05-04  8:13 ` [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support Huang Ying
  4 siblings, 1 reply; 47+ messages in thread
From: Huang Ying @ 2012-05-04  8:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki,
	Huang Ying, Zheng Yan

Lower device sleep state can save more power, but has more exit
latency too.  Sometimes, to satisfy some power QoS and other
requirement, we need to constrain the lowest device sleep state.

In this patch, a parameter to specify lowest allowed state for
acpi_pm_device_sleep_state is added.  So that the caller can enforce
the constraint via the parameter.

Signed-off-by: Huang Ying <ying.huang@intel.com>
---
 drivers/acpi/sleep.c       |   18 +++++++++++++++---
 drivers/pci/pci-acpi.c     |    3 ++-
 drivers/pnp/pnpacpi/core.c |    4 ++--
 include/acpi/acpi_bus.h    |    6 +++---
 4 files changed, 22 insertions(+), 9 deletions(-)

--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -677,6 +677,7 @@ int acpi_suspend(u32 acpi_state)
  *	@dev: device to examine; its driver model wakeup flags control
  *		whether it should be able to wake up the system
  *	@d_min_p: used to store the upper limit of allowed states range
+ *	@d_max_in: specify the lowest allowed states
  *	Return value: preferred power state of the device on success, -ENODEV on
  *		failure (ie. if there's no 'struct acpi_device' for @dev)
  *
@@ -693,7 +694,7 @@ int acpi_suspend(u32 acpi_state)
  *	via @wake.
  */
 
-int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p)
+int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p, int d_max_in)
 {
 	acpi_handle handle = DEVICE_ACPI_HANDLE(dev);
 	struct acpi_device *adev;
@@ -704,11 +705,14 @@ int acpi_pm_device_sleep_state(struct de
 		printk(KERN_DEBUG "ACPI handle has no context!\n");
 		return -ENODEV;
 	}
+	d_max_in = clamp_t(int, d_max_in, ACPI_STATE_D0, ACPI_STATE_D3);
 
 	acpi_method[2] = '0' + acpi_target_sleep_state;
 	/*
-	 * If the sleep state is S0, we will return D3, but if the device has
-	 * _S0W, we will use the value from _S0W
+	 * If the sleep state is S0, the lowest limit from ACPI is D3,
+	 * but if the device has _S0W, we will use the value from _S0W
+	 * as the lowest limit from ACPI.  Finally, we will constrain
+	 * the lowest limit with the specified one.
 	 */
 	d_min = ACPI_STATE_D0;
 	d_max = ACPI_STATE_D3;
@@ -754,6 +758,14 @@ int acpi_pm_device_sleep_state(struct de
 
 	if (d_min_p)
 		*d_min_p = d_min;
+	/* constrain d_max with specified lowest limit (max number) */
+	if (d_max > d_max_in) {
+		d_max = d_max_in;
+		for (;d_max > d_min; d_max--) {
+			if (adev->power.states[d_max].flags.valid)
+				break;
+		}
+	}
 	return d_max;
 }
 #endif /* CONFIG_PM */
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -189,7 +189,8 @@ static pci_power_t acpi_pci_choose_state
 {
 	int acpi_state;
 
-	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL);
+	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
+						ACPI_STATE_D3);
 	if (acpi_state < 0)
 		return PCI_POWER_ERROR;
 
--- a/drivers/pnp/pnpacpi/core.c
+++ b/drivers/pnp/pnpacpi/core.c
@@ -170,8 +170,8 @@ static int pnpacpi_suspend(struct pnp_de
 	}
 
 	if (acpi_bus_power_manageable(handle)) {
-		int power_state = acpi_pm_device_sleep_state(&dev->dev, NULL);
-
+		int power_state = acpi_pm_device_sleep_state(&dev->dev, NULL,
+							     ACPI_STATE_D3);
 		if (power_state < 0)
 			power_state = (state.event == PM_EVENT_ON) ?
 					ACPI_STATE_D0 : ACPI_STATE_D3;
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -383,13 +383,13 @@ int acpi_enable_wakeup_device_power(stru
 int acpi_disable_wakeup_device_power(struct acpi_device *dev);
 
 #ifdef CONFIG_PM
-int acpi_pm_device_sleep_state(struct device *, int *);
+int acpi_pm_device_sleep_state(struct device *, int *, int);
 #else
-static inline int acpi_pm_device_sleep_state(struct device *d, int *p)
+static inline int acpi_pm_device_sleep_state(struct device *d, int *p, int m)
 {
 	if (p)
 		*p = ACPI_STATE_D0;
-	return ACPI_STATE_D3;
+	return m == ACPI_STATE_D3 ? m : ACPI_STATE_D0;
 }
 #endif
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-04  8:13 [RFC v2 0/5] PCIe, Add PCIe runtime D3cold support Huang Ying
                   ` (3 preceding siblings ...)
  2012-05-04  8:13 ` [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state Huang Ying
@ 2012-05-04  8:13 ` Huang Ying
  2012-05-04 19:51   ` Bjorn Helgaas
  2012-05-04 20:50   ` Rafael J. Wysocki
  4 siblings, 2 replies; 47+ messages in thread
From: Huang Ying @ 2012-05-04  8:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki,
	Huang Ying, Zheng Yan

This patch adds runtime D3cold support to PCIe bus.  D3cold is the
deepest power saving state for PCIe devices.  Where the main PCIe link
will be powered off totally, so before the PCIe main link is powered
on again, you can not access the device, even the configuration space,
which is still accessible in D3hot.  Because the main PCIe link is not
available, the PCI PM registers can not be used to put device into/out
of D3cold state, that will be done by platform logic such as ACPI
_PR3.

To support remote wake up in D3cold, aux power is supplied, and a
side-band pin: WAKE# is used to notify remote wake up event, the pin
is usually connected to platform logic such as ACPI GPE.  This is
quite different with other power saving states, where remote wake up
is notified via PME message via the PCIe main link.

PCIe devices in plug-in slot usually has no direct platform logic, for
example, there is usually no ACPI _PR3 for them.  The D3cold support
for these devices can be done via the PCIe port connected to it.  When
power on/off the PCIe port, the corresponding PCIe devices are powered
on/off too.  And the remote wake up event from PCIe device will be
notified to the corresponding PCIe port.

The PCI subsystem D3cold support and corresponding ACPI platform
support is implemented in the patch.  Including the support for PCIe
devices in plug-in slot.

For more information about PCIe D3cold and corresponding ACPI support,
please refer to:

- PCI Express Base Specification Revision 2.0
- Advanced Configuration and Power Interface Specification Revision 5.0

Originally-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
---
 drivers/pci/pci-acpi.c         |   32 +++++++++--
 drivers/pci/pci-driver.c       |    3 +
 drivers/pci/pci.c              |  115 +++++++++++++++++++++++++++++++++++++----
 drivers/pci/pci.h              |    1 
 drivers/pci/pcie/portdrv_pci.c |   28 +++++++++
 include/linux/pci.h            |   12 +++-
 6 files changed, 175 insertions(+), 16 deletions(-)

--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
 	if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
 		return;
 
+	if (pci_dev->current_state == PCI_D3cold) {
+		unsigned int count = 0;
+
+		/*
+		 * Powering on bridge need to resume whole hierarchy,
+		 * just resume the children to avoid the bridge going
+		 * suspending as soon as resumed
+		 */
+		if (pci_dev->subordinate)
+			count = pci_wakeup_bus(pci_dev->subordinate);
+		if (count == 0)
+			pm_runtime_resume(&pci_dev->dev);
+		return;
+	}
+
 	if (!pci_dev->pm_cap || !pci_dev->pme_support
 	     || pci_check_pme_status(pci_dev)) {
 		if (pci_dev->pme_poll)
@@ -187,10 +202,13 @@ acpi_status pci_acpi_remove_pm_notifier(
 
 static pci_power_t acpi_pci_choose_state(struct pci_dev *pdev)
 {
-	int acpi_state;
+	int acpi_state, d_max;
 
-	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
-						ACPI_STATE_D3);
+	if (pdev->dev.power.power_must_be_on)
+		d_max = ACPI_STATE_D3_HOT;
+	else
+		d_max = ACPI_STATE_D3_COLD;
+	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL, d_max);
 	if (acpi_state < 0)
 		return PCI_POWER_ERROR;
 
@@ -297,7 +315,13 @@ static void acpi_pci_propagate_run_wake(
 
 static int acpi_pci_run_wake(struct pci_dev *dev, bool enable)
 {
-	if (dev->pme_interrupt)
+	/*
+	 * Per PCI Express Base Specification Revision 2.0 section
+	 * 5.3.3.2 Link Wakeup, platform support is needed for D3cold
+	 * waking up to power on the main link even if there is PME
+	 * support for D3cold
+	 */
+	if (dev->pme_interrupt && !dev->runtime_d3cold)
 		return 0;
 
 	if (!acpi_pm_device_run_wake(&dev->dev, enable))
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1013,10 +1013,13 @@ static int pci_pm_runtime_suspend(struct
 	if (!pm || !pm->runtime_suspend)
 		return -ENOSYS;
 
+	dev->power.power_must_be_on = false;
 	error = pm->runtime_suspend(dev);
 	suspend_report_result(pm->runtime_suspend, error);
 	if (error)
 		return error;
+	if (!dev->power.power_off_user)
+		dev->power.power_must_be_on = true;
 
 	pci_fixup_device(pci_fixup_suspend, pci_dev);
 
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -653,6 +653,12 @@ void pci_update_current_state(struct pci
 	if (dev->pm_cap) {
 		u16 pmcsr;
 
+		/*
+		 * Configuration space is not accessible for device in
+		 * D3cold, so keep in D3cold for safety
+		 */
+		if (dev->current_state == PCI_D3cold)
+			return;
 		pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
 		dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
 	} else {
@@ -671,8 +677,11 @@ static int pci_platform_power_transition
 
 	if (platform_pci_power_manageable(dev)) {
 		error = platform_pci_set_power_state(dev, state);
-		if (!error)
+		if (!error) {
+			if (state == PCI_D3cold)
+				dev->current_state = PCI_D3cold;
 			pci_update_current_state(dev, state);
+		}
 		/* Fall back to PCI_D0 if native PM is not supported */
 		if (!dev->pm_cap)
 			dev->current_state = PCI_D0;
@@ -693,8 +702,49 @@ static int pci_platform_power_transition
  */
 static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
 {
-	if (state == PCI_D0)
+	if (state == PCI_D0) {
 		pci_platform_power_transition(dev, PCI_D0);
+		/*
+		 * Mandatory power management transition delays, see
+		 * PCI Express Base Specification Revision 2.0 Section
+		 * 6.6.1: Conventional Reset.  Do not delay for
+		 * devices powered on/off by corresponding bridge,
+		 * because have already delayed for the bridge.
+		 */
+		if (dev->runtime_d3cold) {
+			msleep(dev->d3cold_delay);
+			/* When powering on a bridge from D3cold, the
+			 * whole hierarchy may be powered on into
+			 * D0uninitialized state, resume them to give
+			 * them a chance to suspend again */
+			if (dev->subordinate)
+				pci_wakeup_bus(dev->subordinate);
+		}
+	}
+}
+
+/**
+ * __pci_dev_set_current_state - Set current state of a PCI device
+ * @dev: Device to handle
+ * @data: pointer to state to be set
+ */
+static int __pci_dev_set_current_state(struct pci_dev *dev, void *data)
+{
+	pci_power_t state = *(pci_power_t *)data;
+
+	dev->current_state = state;
+	return 0;
+}
+
+/**
+ * __pci_bus_set_current_state - Walk given bus and set current state of devices
+ * @bus: Top bus of the subtree to walk.
+ * @state: state to be set
+ */
+static void __pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state)
+{
+	if (bus)
+		pci_walk_bus(bus, __pci_dev_set_current_state, &state);
 }
 
 /**
@@ -706,8 +756,17 @@ static void __pci_start_power_transition
  */
 int __pci_complete_power_transition(struct pci_dev *dev, pci_power_t state)
 {
-	return state >= PCI_D0 ?
-			pci_platform_power_transition(dev, state) : -EINVAL;
+	int ret;
+
+	if (state < PCI_D0)
+		return -EINVAL;
+	ret = pci_platform_power_transition(dev, state);
+	if (!ret && state == PCI_D3cold) {
+		/* Power off the bridge may power off the whole hierarchy */
+		if (dev->subordinate)
+			__pci_bus_set_current_state(dev->subordinate, state);
+	}
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__pci_complete_power_transition);
 
@@ -731,8 +790,8 @@ int pci_set_power_state(struct pci_dev *
 	int error;
 
 	/* bound the state we're entering */
-	if (state > PCI_D3hot)
-		state = PCI_D3hot;
+	if (state > PCI_D3cold)
+		state = PCI_D3cold;
 	else if (state < PCI_D0)
 		state = PCI_D0;
 	else if ((state == PCI_D1 || state == PCI_D2) && pci_no_d1d2(dev))
@@ -747,10 +806,15 @@ int pci_set_power_state(struct pci_dev *
 
 	/* This device is quirked not to be put into D3, so
 	   don't put it in D3 */
-	if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
+	if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
 		return 0;
 
-	error = pci_raw_set_power_state(dev, state);
+	/*
+	 * To put device in D3cold, we put device into D3hot in native
+	 * way, then put device into D3cold with platform ops
+	 */
+	error = pci_raw_set_power_state(dev, state > PCI_D3hot ?
+					PCI_D3hot : state);
 
 	if (!__pci_complete_power_transition(dev, state))
 		error = 0;
@@ -1506,6 +1570,34 @@ void pci_pme_wakeup_bus(struct pci_bus *
 }
 
 /**
+ * pci_wakeup - Wake up a PCI device
+ * @dev: Device to handle.
+ * @data: to count how many device are waken up.
+ */
+static int pci_wakeup(struct pci_dev *dev, void *data)
+{
+	unsigned int *count = data;
+
+	pci_wakeup_event(dev);
+	pm_request_resume(&dev->dev);
+	(*count)++;
+	return 0;
+}
+
+/**
+ * pci_wakeup_bus - Walk given bus and wake up devices on it
+ * @bus: Top bus of the subtree to walk.
+ */
+unsigned int pci_wakeup_bus(struct pci_bus *bus)
+{
+	unsigned int count = 0;
+
+	if (bus)
+		pci_walk_bus(bus, pci_wakeup, &count);
+	return count;
+}
+
+/**
  * pci_pme_capable - check the capability of PCI device to generate PME#
  * @dev: PCI device to handle.
  * @state: PCI state from which device will issue PME#.
@@ -1789,12 +1881,16 @@ int pci_finish_runtime_suspend(struct pc
 	if (target_state == PCI_POWER_ERROR)
 		return -EIO;
 
+	dev->runtime_d3cold = target_state == PCI_D3cold ? true : false;
+
 	__pci_enable_wake(dev, target_state, true, pci_dev_run_wake(dev));
 
 	error = pci_set_power_state(dev, target_state);
 
-	if (error)
+	if (error) {
 		__pci_enable_wake(dev, target_state, true, false);
+		dev->runtime_d3cold = false;
+	}
 
 	return error;
 }
@@ -1864,6 +1960,7 @@ void pci_pm_init(struct pci_dev *dev)
 
 	dev->pm_cap = pm;
 	dev->d3_delay = PCI_PM_D3_WAIT;
+	dev->d3cold_delay = PCI_PM_D3COLD_WAIT;
 
 	dev->d1_support = false;
 	dev->d2_support = false;
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -70,6 +70,7 @@ extern void pci_update_current_state(str
 extern void pci_disable_enabled_device(struct pci_dev *dev);
 extern int pci_finish_runtime_suspend(struct pci_dev *dev);
 extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
+extern unsigned int pci_wakeup_bus(struct pci_bus *bus);
 extern void pci_pm_init(struct pci_dev *dev);
 extern void platform_pci_wakeup_init(struct pci_dev *dev);
 extern void pci_allocate_cap_save_buffers(struct pci_dev *dev);
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -101,10 +101,38 @@ static int pcie_port_resume_noirq(struct
 }
 
 #ifdef CONFIG_PM_RUNTIME
+struct d3cold_info {
+	bool power_must_be_on;
+	unsigned int d3cold_delay;
+};
+
+static int pci_dev_d3cold_info(struct pci_dev *pdev, void *data)
+{
+	struct d3cold_info *info = data;
+
+	info->d3cold_delay = max_t(unsigned int, pdev->d3cold_delay,
+				   info->d3cold_delay);
+	info->power_must_be_on = info->power_must_be_on ||
+		pdev->dev.power.power_must_be_on;
+	return 0;
+}
+
 static int pcie_port_runtime_suspend(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
+	struct d3cold_info d3cold_info = {
+		.power_must_be_on	= dev->power.power_must_be_on,
+		.d3cold_delay		= PCI_PM_D3_WAIT,
+	};
 
+	/*
+	 * If any subordinate device need to keep power on, we should
+	 * not power off the port.  The D3cold delay of port should be
+	 * the max of that of all subordinate devices.
+	 */
+	pci_walk_bus(pdev->subordinate, pci_dev_d3cold_info, &d3cold_info);
+	dev->power.power_must_be_on = d3cold_info.power_must_be_on;
+	pdev->d3cold_delay = d3cold_info.d3cold_delay;
 	pci_save_state(pdev);
 	return 0;
 }
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -132,9 +132,10 @@ static inline const char *pci_power_name
 	return pci_power_names[1 + (int) state];
 }
 
-#define PCI_PM_D2_DELAY	200
-#define PCI_PM_D3_WAIT	10
-#define PCI_PM_BUS_WAIT	50
+#define PCI_PM_D2_DELAY		200
+#define PCI_PM_D3_WAIT		10
+#define PCI_PM_D3COLD_WAIT	100
+#define PCI_PM_BUS_WAIT		50
 
 /** The pci_channel state describes connectivity between the CPU and
  *  the pci device.  If some PCI bus between here and the pci device
@@ -282,7 +283,12 @@ struct pci_dev {
 	unsigned int	mmio_always_on:1;	/* disallow turning off io/mem
 						   decoding during bar sizing */
 	unsigned int	wakeup_prepared:1;
+	unsigned int	runtime_d3cold:1;	/* whether go through runtime
+						   D3cold, not set for devices
+						   powered on/off by the
+						   corresponding bridge */
 	unsigned int	d3_delay;	/* D3->D0 transition time in ms */
+	unsigned int	d3cold_delay;	/* D3cold->D0 transition time in ms */
 
 #ifdef CONFIG_PCIEASPM
 	struct pcie_link_state	*link_state;	/* ASPM link state. */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-04  8:13 ` [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy Huang Ying
@ 2012-05-04 19:33   ` Rafael J. Wysocki
  2012-05-05  6:29     ` huang ying
  2012-05-04 19:50   ` Bjorn Helgaas
  1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-04 19:33 UTC (permalink / raw)
  To: Huang Ying
  Cc: Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan, Lan Tianyu

On Friday, May 04, 2012, Huang Ying wrote:
> From: Lan Tianyu <tianyu.lan@intel.com>
> 
> Some devices can be powered off to save more power via some platform
> mechanism, e.g., ACPI.  But that may not work as expected for some
> device or platform.  So, this patch adds a sysfs file named power_off
> under <device>/power directory to provide a mechanism for user to control
> whether to allow the device to be power off.
> 
> power_off => "enabled" means allowing the device to be powered off if
> possible.
> 
> power_off => "disabled" means the device must be power on anytime.
> 
> Also add flag power_off_user to struct dev_pm_info to record users'
> choice. The bus layer can use this field to determine whether to
> power off the device.

It looks like the new attribute is added for all devices regardless of whether
or not they actually can be powered off?  If so, then please don't do that,
it's _extremely_ confusing.

If you need user space to be able to control that functionality (and I can
think of a couple of cases in which you do), there need to be 2 flags,
can_power_off and may_power_off, where the first one is set by the kernel
if it is known that power can be removed from the device - the attribute
should be created when this flag is set and removed when it is unset.

Then, the setting of the second flag will be controlled by the new attribute.

And you'll need to patch quite a few places where devices actually have that
capability, like where power domains are in use, so that's quire a lot of
work.

Thanks,
Rafael


> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> ---
>  drivers/base/power/sysfs.c |   33 +++++++++++++++++++++++++++++++++
>  include/linux/pm.h         |    1 +
>  2 files changed, 34 insertions(+)
> 
> --- a/drivers/base/power/sysfs.c
> +++ b/drivers/base/power/sysfs.c
> @@ -243,6 +243,38 @@ static ssize_t pm_qos_latency_store(stru
>  
>  static DEVICE_ATTR(pm_qos_resume_latency_us, 0644,
>  		   pm_qos_latency_show, pm_qos_latency_store);
> +
> +static ssize_t power_off_show(struct device *dev,
> +			      struct device_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%s\n",
> +		       dev->power.power_off_user ? enabled : disabled);
> +}
> +
> +static ssize_t power_off_store(struct device * dev,
> +			       struct device_attribute *attr,
> +			       const char * buf, size_t n)
> +{
> +	char *cp;
> +	int len = n;
> +	unsigned int power_off_user;
> +
> +	cp = memchr(buf, '\n', n);
> +	if (cp)
> +		len = cp - buf;
> +
> +	if (len == sizeof enabled - 1 && strncmp(buf, enabled, len) == 0)
> +		dev->power.power_off_user = true;
> +	else if (len == sizeof disabled - 1 && strncmp(buf, disabled, len) == 0)
> +		dev->power.power_off_user = false;
> +	else
> +		return -EINVAL;
> +
> +	pm_runtime_resume(dev);
> +	return n;
> +
> +}
> +static DEVICE_ATTR(power_off, 0644, power_off_show, power_off_store);
>  #endif /* CONFIG_PM_RUNTIME */
>  
>  #ifdef CONFIG_PM_SLEEP
> @@ -508,6 +540,7 @@ static struct attribute *runtime_attrs[]
>  	&dev_attr_runtime_suspended_time.attr,
>  	&dev_attr_runtime_active_time.attr,
>  	&dev_attr_autosuspend_delay_ms.attr,
> +	&dev_attr_power_off.attr,
>  #endif /* CONFIG_PM_RUNTIME */
>  	NULL,
>  };
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -537,6 +537,7 @@ struct dev_pm_info {
>  	unsigned int		use_autosuspend:1;
>  	unsigned int		timer_autosuspends:1;
>  	unsigned int		power_must_be_on:1;
> +	unsigned int		power_off_user:1;
>  	enum rpm_request	request;
>  	enum rpm_status		runtime_status;
>  	int			runtime_error;
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag
  2012-05-04  8:13 ` [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag Huang Ying
@ 2012-05-04 19:37   ` Rafael J. Wysocki
  2012-05-05  5:15     ` huang ying
  2012-05-04 19:50   ` Bjorn Helgaas
  1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-04 19:37 UTC (permalink / raw)
  To: Huang Ying; +Cc: Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Friday, May 04, 2012, Huang Ying wrote:
> The extreme way to save device power in runtime is to turn off power
> of device.  For example, D3cold for PCIe bus and ZPODD (Zero Power
> Optical Disk Drive) for SATA bus will do that.
> 
> But sometimes power off is not expected, some possible reason is as
> follow
> 
> - power off device usually incurs longer resume latency, if it exceeds
>   power QoS requirement, power off should be disabled.
> 
> - For some buses, device in power off state can not support remote
>   wakeup.  If remote wakeup is desired, power off should be disabled.
> 
> In general, whether to put a device into power off state should be
> decided by the driver of the device, but for some buses, whether to
> put a device into power off state may be done by the parent of the
> device.  For example, a PCIe end point device may be put into power
> off state by the PCIe port connected to it.
> 
> So a flag is introduced for the children devices to tell the parent
> device, whether it should be put into power off state.
> 
> This flag is also used for device driver to tell bus layer whether it
> is OK to be powered off.
> 
> Signed-off-by: Huang Ying <ying.huang@intel.com>

I would be almost fine with this patch, if [2/5] were not present.

However, if you introduce a flag like this, you need to put checks
against it into all places where power may be removed from devices,
like the generic PM domains framework (but not only there).

Thanks,
Rafael


> ---
>  include/linux/pm.h |    1 +
>  1 file changed, 1 insertion(+)
> 
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -536,6 +536,7 @@ struct dev_pm_info {
>  	unsigned int		irq_safe:1;
>  	unsigned int		use_autosuspend:1;
>  	unsigned int		timer_autosuspends:1;
> +	unsigned int		power_must_be_on:1;
>  	enum rpm_request	request;
>  	enum rpm_status		runtime_status;
>  	int			runtime_error;
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-04  8:13 ` [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port Huang Ying
@ 2012-05-04 19:43   ` Rafael J. Wysocki
  2012-05-05  6:46     ` huang ying
  2012-05-04 19:50   ` Bjorn Helgaas
  1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-04 19:43 UTC (permalink / raw)
  To: Huang Ying; +Cc: Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Friday, May 04, 2012, Huang Ying wrote:
> From: Zheng Yan <zheng.z.yan@intel.com>
> 
> This patch adds runtime PM support to PCIe port.  This is needed by
> PCIe D3cold support, where PCIe device in slot may be powered on/off
> by PCIe port.
> 
> Because runtime suspend is broken for some chipset, a white list is
> used to enable runtime PM support for only chipset known to work.
> 
> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> ---
>  drivers/pci/pci.c              |    9 +++++++++
>  drivers/pci/pcie/portdrv_pci.c |   40 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+)
> 
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1476,6 +1476,15 @@ bool pci_check_pme_status(struct pci_dev
>   */
>  static int pci_pme_wakeup(struct pci_dev *dev, void *pme_poll_reset)
>  {
> +	struct pci_dev *bridge = dev->bus->self;
> +
> +	/*
> +	 * If bridge is in low power state, the configuration space of
> +	 * subordinate devices may be not accessible

Please also say in the comment _when_ this is possible.  That's far from
obvious, because the runtime PM framework generally ensures that parents are
resumed before the children, so the comment should describe the particular
scenario leading to this situation.

> +	 */
> +	if (bridge && bridge->current_state != PCI_D0)
> +		return 0;
> +
>  	if (pme_poll_reset && dev->pme_poll)
>  		dev->pme_poll = false;
>  
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -11,6 +11,7 @@
>  #include <linux/kernel.h>
>  #include <linux/errno.h>
>  #include <linux/pm.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/init.h>
>  #include <linux/pcieport_if.h>
>  #include <linux/aer.h>
> @@ -99,6 +100,27 @@ static int pcie_port_resume_noirq(struct
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PM_RUNTIME
> +static int pcie_port_runtime_suspend(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +

A comment explaining why this is needed here would be welcome.

> +	pci_save_state(pdev);
> +	return 0;
> +}
> +
> +static int pcie_port_runtime_resume(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	pci_restore_state(pdev);
> +	return 0;
> +}
> +#else
> +#define pcie_port_runtime_suspend	NULL
> +#define pcie_port_runtime_resume	NULL
> +#endif
> +
>  static const struct dev_pm_ops pcie_portdrv_pm_ops = {
>  	.suspend	= pcie_port_device_suspend,
>  	.resume		= pcie_port_device_resume,
> @@ -107,6 +129,8 @@ static const struct dev_pm_ops pcie_port
>  	.poweroff	= pcie_port_device_suspend,
>  	.restore	= pcie_port_device_resume,
>  	.resume_noirq	= pcie_port_resume_noirq,
> +	.runtime_suspend = pcie_port_runtime_suspend,
> +	.runtime_resume = pcie_port_runtime_resume,
>  };
>  
>  #define PCIE_PORTDRV_PM_OPS	(&pcie_portdrv_pm_ops)
> @@ -117,6 +141,18 @@ static const struct dev_pm_ops pcie_port
>  #endif /* !PM */
>  
>  /*
> + * PCIe port runtime suspend is broken for some chipset, so use a
> + * white list to disable runtime PM for these chipset.
> + */
> +static const struct pci_device_id port_runtime_pm_pci_ids[] = {
> +	{ PCI_VDEVICE(INTEL, 0x1e10), },
> +	{ PCI_VDEVICE(INTEL, 0x1e16), },
> +	{ PCI_VDEVICE(INTEL, 0x1e18), },
> +	{ PCI_VDEVICE(INTEL, 0x1e1c), },
> +	{ /* end: all zeroes */ }
> +};
> +
> +/*
>   * pcie_portdrv_probe - Probe PCI-Express port devices
>   * @dev: PCI-Express port device being probed
>   *
> @@ -144,12 +180,16 @@ static int __devinit pcie_portdrv_probe(
>  		return status;
>  
>  	pci_save_state(dev);
> +	if (pci_match_id(port_runtime_pm_pci_ids, dev))
> +		pm_runtime_put_noidle(&dev->dev);
>  
>  	return 0;
>  }
>  
>  static void pcie_portdrv_remove(struct pci_dev *dev)
>  {
> +	if (pci_match_id(port_runtime_pm_pci_ids, dev))
> +		pm_runtime_get_noresume(&dev->dev);
>  	pcie_port_device_remove(dev);
>  	pci_disable_device(dev);
>  }

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag
  2012-05-04  8:13 ` [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag Huang Ying
  2012-05-04 19:37   ` Rafael J. Wysocki
@ 2012-05-04 19:50   ` Bjorn Helgaas
  2012-05-05  5:59     ` huang ying
  1 sibling, 1 reply; 47+ messages in thread
From: Bjorn Helgaas @ 2012-05-04 19:50 UTC (permalink / raw)
  To: Huang Ying
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki, Zheng Yan

On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
> The extreme way to save device power in runtime is to turn off power
> of device.  For example, D3cold for PCIe bus and ZPODD (Zero Power
> Optical Disk Drive) for SATA bus will do that.
>
> But sometimes power off is not expected, some possible reason is as
> follow
>
> - power off device usually incurs longer resume latency, if it exceeds
>  power QoS requirement, power off should be disabled.
>
> - For some buses, device in power off state can not support remote
>  wakeup.  If remote wakeup is desired, power off should be disabled.
>
> In general, whether to put a device into power off state should be
> decided by the driver of the device, but for some buses, whether to
> put a device into power off state may be done by the parent of the
> device.  For example, a PCIe end point device may be put into power
> off state by the PCIe port connected to it.
>
> So a flag is introduced for the children devices to tell the parent
> device, whether it should be put into power off state.
>
> This flag is also used for device driver to tell bus layer whether it
> is OK to be powered off.
>
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> ---
>  include/linux/pm.h |    1 +
>  1 file changed, 1 insertion(+)
>
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -536,6 +536,7 @@ struct dev_pm_info {
>        unsigned int            irq_safe:1;
>        unsigned int            use_autosuspend:1;
>        unsigned int            timer_autosuspends:1;
> +       unsigned int            power_must_be_on:1;
>        enum rpm_request        request;
>        enum rpm_status         runtime_status;
>        int                     runtime_error;

It's a little weird to just add the field, with no users.  Would it
make sense to pull out the bits of patch 5 that use this and combine
them into a single smaller patch?  But see related comments there; it
might be safer to have a function that computes this whenever you need
it instead of caching the value.

Bjorn

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-04  8:13 ` [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port Huang Ying
  2012-05-04 19:43   ` Rafael J. Wysocki
@ 2012-05-04 19:50   ` Bjorn Helgaas
  2012-05-04 20:55     ` Rafael J. Wysocki
  2012-05-05  6:53     ` huang ying
  1 sibling, 2 replies; 47+ messages in thread
From: Bjorn Helgaas @ 2012-05-04 19:50 UTC (permalink / raw)
  To: Huang Ying
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki, Zheng Yan

On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
> From: Zheng Yan <zheng.z.yan@intel.com>
>
> This patch adds runtime PM support to PCIe port.  This is needed by
> PCIe D3cold support, where PCIe device in slot may be powered on/off
> by PCIe port.

I assume this works for integrated PCIe devices as well as those that
are plugged into a slot and can be physically removed -- maybe the
text "in slot" is superfluous?

> Because runtime suspend is broken for some chipset, a white list is
> used to enable runtime PM support for only chipset known to work.

A whitelist requires perpetual maintenance.  Every time a new working
chipset comes out, you have to update the whitelist.  That doesn't
seem right.

> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> ---
>  drivers/pci/pci.c              |    9 +++++++++
>  drivers/pci/pcie/portdrv_pci.c |   40 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+)
>
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1476,6 +1476,15 @@ bool pci_check_pme_status(struct pci_dev
>  */
>  static int pci_pme_wakeup(struct pci_dev *dev, void *pme_poll_reset)
>  {
> +       struct pci_dev *bridge = dev->bus->self;
> +
> +       /*
> +        * If bridge is in low power state, the configuration space of
> +        * subordinate devices may be not accessible
> +        */

This comment would make more sense as "If the upstream bridge is in a
low power state, the configuration space of this device is not
accessible."

> +       if (bridge && bridge->current_state != PCI_D0)
> +               return 0;
> +
>        if (pme_poll_reset && dev->pme_poll)
>                dev->pme_poll = false;
>
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -11,6 +11,7 @@
>  #include <linux/kernel.h>
>  #include <linux/errno.h>
>  #include <linux/pm.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/init.h>
>  #include <linux/pcieport_if.h>
>  #include <linux/aer.h>
> @@ -99,6 +100,27 @@ static int pcie_port_resume_noirq(struct
>        return 0;
>  }
>
> +#ifdef CONFIG_PM_RUNTIME
> +static int pcie_port_runtime_suspend(struct device *dev)
> +{
> +       struct pci_dev *pdev = to_pci_dev(dev);
> +
> +       pci_save_state(pdev);
> +       return 0;
> +}
> +
> +static int pcie_port_runtime_resume(struct device *dev)
> +{
> +       struct pci_dev *pdev = to_pci_dev(dev);
> +
> +       pci_restore_state(pdev);
> +       return 0;
> +}
> +#else
> +#define pcie_port_runtime_suspend      NULL
> +#define pcie_port_runtime_resume       NULL
> +#endif
> +
>  static const struct dev_pm_ops pcie_portdrv_pm_ops = {
>        .suspend        = pcie_port_device_suspend,
>        .resume         = pcie_port_device_resume,
> @@ -107,6 +129,8 @@ static const struct dev_pm_ops pcie_port
>        .poweroff       = pcie_port_device_suspend,
>        .restore        = pcie_port_device_resume,
>        .resume_noirq   = pcie_port_resume_noirq,
> +       .runtime_suspend = pcie_port_runtime_suspend,
> +       .runtime_resume = pcie_port_runtime_resume,
>  };
>
>  #define PCIE_PORTDRV_PM_OPS    (&pcie_portdrv_pm_ops)
> @@ -117,6 +141,18 @@ static const struct dev_pm_ops pcie_port
>  #endif /* !PM */
>
>  /*
> + * PCIe port runtime suspend is broken for some chipset, so use a
> + * white list to disable runtime PM for these chipset.

If you're *disabling* runtime PM for these chipsets, I'd call this a
blacklist (and a blacklist makes more sense to me from a maintenance
standpoint, because you only have to update it when new broken chips
are discovered).

> + */
> +static const struct pci_device_id port_runtime_pm_pci_ids[] = {
> +       { PCI_VDEVICE(INTEL, 0x1e10), },
> +       { PCI_VDEVICE(INTEL, 0x1e16), },
> +       { PCI_VDEVICE(INTEL, 0x1e18), },
> +       { PCI_VDEVICE(INTEL, 0x1e1c), },
> +       { /* end: all zeroes */ }
> +};
> +
> +/*
>  * pcie_portdrv_probe - Probe PCI-Express port devices
>  * @dev: PCI-Express port device being probed
>  *
> @@ -144,12 +180,16 @@ static int __devinit pcie_portdrv_probe(
>                return status;
>
>        pci_save_state(dev);
> +       if (pci_match_id(port_runtime_pm_pci_ids, dev))
> +               pm_runtime_put_noidle(&dev->dev);
>
>        return 0;
>  }
>
>  static void pcie_portdrv_remove(struct pci_dev *dev)
>  {
> +       if (pci_match_id(port_runtime_pm_pci_ids, dev))
> +               pm_runtime_get_noresume(&dev->dev);
>        pcie_port_device_remove(dev);
>        pci_disable_device(dev);
>  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-04  8:13 ` [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy Huang Ying
  2012-05-04 19:33   ` Rafael J. Wysocki
@ 2012-05-04 19:50   ` Bjorn Helgaas
  2012-05-04 21:00     ` Rafael J. Wysocki
  2012-05-05  6:36     ` huang ying
  1 sibling, 2 replies; 47+ messages in thread
From: Bjorn Helgaas @ 2012-05-04 19:50 UTC (permalink / raw)
  To: Huang Ying
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki, Zheng Yan,
	Lan Tianyu

On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
> From: Lan Tianyu <tianyu.lan@intel.com>
>
> Some devices can be powered off to save more power via some platform
> mechanism, e.g., ACPI.  But that may not work as expected for some
> device or platform.  So, this patch adds a sysfs file named power_off
> under <device>/power directory to provide a mechanism for user to control
> whether to allow the device to be power off.
>
> power_off => "enabled" means allowing the device to be powered off if
> possible.
>
> power_off => "disabled" means the device must be power on anytime.
>
> Also add flag power_off_user to struct dev_pm_info to record users'
> choice. The bus layer can use this field to determine whether to
> power off the device.

My first thought was that writing to "power_off" would actually turn
the power off, which isn't true.  Maybe something like
"poweroff_allowed" would work.

I think there's only one use of this new field, in
pci_pm_runtime_suspend().  Maybe you could pull out that hunk from
patch 5, combine it with this one, and move it to after patch 5?

> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> ---
>  drivers/base/power/sysfs.c |   33 +++++++++++++++++++++++++++++++++
>  include/linux/pm.h         |    1 +
>  2 files changed, 34 insertions(+)
>
> --- a/drivers/base/power/sysfs.c
> +++ b/drivers/base/power/sysfs.c
> @@ -243,6 +243,38 @@ static ssize_t pm_qos_latency_store(stru
>
>  static DEVICE_ATTR(pm_qos_resume_latency_us, 0644,
>                   pm_qos_latency_show, pm_qos_latency_store);
> +
> +static ssize_t power_off_show(struct device *dev,
> +                             struct device_attribute *attr, char *buf)
> +{
> +       return sprintf(buf, "%s\n",
> +                      dev->power.power_off_user ? enabled : disabled);
> +}
> +
> +static ssize_t power_off_store(struct device * dev,
> +                              struct device_attribute *attr,
> +                              const char * buf, size_t n)
> +{
> +       char *cp;
> +       int len = n;
> +       unsigned int power_off_user;
> +
> +       cp = memchr(buf, '\n', n);
> +       if (cp)
> +               len = cp - buf;
> +
> +       if (len == sizeof enabled - 1 && strncmp(buf, enabled, len) == 0)
> +               dev->power.power_off_user = true;
> +       else if (len == sizeof disabled - 1 && strncmp(buf, disabled, len) == 0)
> +               dev->power.power_off_user = false;
> +       else
> +               return -EINVAL;
> +
> +       pm_runtime_resume(dev);
> +       return n;
> +
> +}
> +static DEVICE_ATTR(power_off, 0644, power_off_show, power_off_store);
>  #endif /* CONFIG_PM_RUNTIME */
>
>  #ifdef CONFIG_PM_SLEEP
> @@ -508,6 +540,7 @@ static struct attribute *runtime_attrs[]
>        &dev_attr_runtime_suspended_time.attr,
>        &dev_attr_runtime_active_time.attr,
>        &dev_attr_autosuspend_delay_ms.attr,
> +       &dev_attr_power_off.attr,
>  #endif /* CONFIG_PM_RUNTIME */
>        NULL,
>  };
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -537,6 +537,7 @@ struct dev_pm_info {
>        unsigned int            use_autosuspend:1;
>        unsigned int            timer_autosuspends:1;
>        unsigned int            power_must_be_on:1;
> +       unsigned int            power_off_user:1;

This name definitely doesn't suggest anything useful  I think
"poweroff_allowed" or similar would make a lot more sense when reading
the code.

>        enum rpm_request        request;
>        enum rpm_status         runtime_status;
>        int                     runtime_error;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-04  8:13 ` [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support Huang Ying
@ 2012-05-04 19:51   ` Bjorn Helgaas
  2012-05-05  7:34     ` huang ying
  2012-05-04 20:50   ` Rafael J. Wysocki
  1 sibling, 1 reply; 47+ messages in thread
From: Bjorn Helgaas @ 2012-05-04 19:51 UTC (permalink / raw)
  To: Huang Ying
  Cc: ming.m.lin, linux-kernel, linux-pm, Rafael J. Wysocki, Zheng Yan

On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
> This patch adds runtime D3cold support to PCIe bus.  D3cold is the
> deepest power saving state for PCIe devices.  Where the main PCIe link
> will be powered off totally, so before the PCIe main link is powered
> on again, you can not access the device, even the configuration space,
> which is still accessible in D3hot.  Because the main PCIe link is not
> available, the PCI PM registers can not be used to put device into/out
> of D3cold state, that will be done by platform logic such as ACPI
> _PR3.
>
> To support remote wake up in D3cold, aux power is supplied, and a
> side-band pin: WAKE# is used to notify remote wake up event, the pin
> is usually connected to platform logic such as ACPI GPE.  This is
> quite different with other power saving states, where remote wake up
> is notified via PME message via the PCIe main link.
>
> PCIe devices in plug-in slot usually has no direct platform logic, for
> example, there is usually no ACPI _PR3 for them.  The D3cold support
> for these devices can be done via the PCIe port connected to it.  When
> power on/off the PCIe port, the corresponding PCIe devices are powered
> on/off too.  And the remote wake up event from PCIe device will be
> notified to the corresponding PCIe port.
>
> The PCI subsystem D3cold support and corresponding ACPI platform
> support is implemented in the patch.  Including the support for PCIe
> devices in plug-in slot.
>
> For more information about PCIe D3cold and corresponding ACPI support,
> please refer to:
>
> - PCI Express Base Specification Revision 2.0
> - Advanced Configuration and Power Interface Specification Revision 5.0
>
> Originally-by: Zheng Yan <zheng.z.yan@intel.com>
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> ---
>  drivers/pci/pci-acpi.c         |   32 +++++++++--
>  drivers/pci/pci-driver.c       |    3 +
>  drivers/pci/pci.c              |  115 +++++++++++++++++++++++++++++++++++++----
>  drivers/pci/pci.h              |    1
>  drivers/pci/pcie/portdrv_pci.c |   28 +++++++++
>  include/linux/pci.h            |   12 +++-
>  6 files changed, 175 insertions(+), 16 deletions(-)
>
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
>        if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
>                return;
>
> +       if (pci_dev->current_state == PCI_D3cold) {
> +               unsigned int count = 0;
> +
> +               /*
> +                * Powering on bridge need to resume whole hierarchy,
> +                * just resume the children to avoid the bridge going
> +                * suspending as soon as resumed
> +                */ at
> +               if (pci_dev->subordinate)
> +                       count = pci_wakeup_bus(pci_dev->subordinate);
> +               if (count == 0)
> +                       pm_runtime_resume(&pci_dev->dev);
> +               return;
> +       }
> +
>        if (!pci_dev->pm_cap || !pci_dev->pme_support
>             || pci_check_pme_status(pci_dev)) {
>                if (pci_dev->pme_poll)
> @@ -187,10 +202,13 @@ acpi_status pci_acpi_remove_pm_notifier(
>
>  static pci_power_t acpi_pci_choose_state(struct pci_dev *pdev)
>  {
> -       int acpi_state;
> +       int acpi_state, d_max;
>
> -       acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
> -                                               ACPI_STATE_D3);
> +       if (pdev->dev.power.power_must_be_on)
> +               d_max = ACPI_STATE_D3_HOT;
> +       else
> +               d_max = ACPI_STATE_D3_COLD;
> +       acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL, d_max);
>        if (acpi_state < 0)
>                return PCI_POWER_ERROR;
>
> @@ -297,7 +315,13 @@ static void acpi_pci_propagate_run_wake(
>
>  static int acpi_pci_run_wake(struct pci_dev *dev, bool enable)
>  {
> -       if (dev->pme_interrupt)
> +       /*
> +        * Per PCI Express Base Specification Revision 2.0 section
> +        * 5.3.3.2 Link Wakeup, platform support is needed for D3cold
> +        * waking up to power on the main link even if there is PME
> +        * support for D3cold
> +        */
> +       if (dev->pme_interrupt && !dev->runtime_d3cold)
>                return 0;
>
>        if (!acpi_pm_device_run_wake(&dev->dev, enable))
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1013,10 +1013,13 @@ static int pci_pm_runtime_suspend(struct
>        if (!pm || !pm->runtime_suspend)
>                return -ENOSYS;
>
> +       dev->power.power_must_be_on = false;
>        error = pm->runtime_suspend(dev);
>        suspend_report_result(pm->runtime_suspend, error);
>        if (error)
>                return error;
> +       if (!dev->power.power_off_user)
> +               dev->power.power_must_be_on = true;
>
>        pci_fixup_device(pci_fixup_suspend, pci_dev);
>
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -653,6 +653,12 @@ void pci_update_current_state(struct pci
>        if (dev->pm_cap) {
>                u16 pmcsr;
>
> +               /*
> +                * Configuration space is not accessible for device in
> +                * D3cold, so keep in D3cold for safety
> +                */
> +               if (dev->current_state == PCI_D3cold)
> +                       return;
>                pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
>                dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
>        } else {
> @@ -671,8 +677,11 @@ static int pci_platform_power_transition
>
>        if (platform_pci_power_manageable(dev)) {
>                error = platform_pci_set_power_state(dev, state);
> -               if (!error)
> +               if (!error) {
> +                       if (state == PCI_D3cold)
> +                               dev->current_state = PCI_D3cold;
>                        pci_update_current_state(dev, state);
> +               }
>                /* Fall back to PCI_D0 if native PM is not supported */
>                if (!dev->pm_cap)
>                        dev->current_state = PCI_D0;
> @@ -693,8 +702,49 @@ static int pci_platform_power_transition
>  */
>  static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
>  {
> -       if (state == PCI_D0)
> +       if (state == PCI_D0) {
>                pci_platform_power_transition(dev, PCI_D0);
> +               /*
> +                * Mandatory power management transition delays, see
> +                * PCI Express Base Specification Revision 2.0 Section
> +                * 6.6.1: Conventional Reset.  Do not delay for
> +                * devices powered on/off by corresponding bridge,
> +                * because have already delayed for the bridge.
> +                */
> +               if (dev->runtime_d3cold) {
> +                       msleep(dev->d3cold_delay);
> +                       /* When powering on a bridge from D3cold, the
> +                        * whole hierarchy may be powered on into
> +                        * D0uninitialized state, resume them to give
> +                        * them a chance to suspend again */
> +                       if (dev->subordinate)
> +                               pci_wakeup_bus(dev->subordinate);
> +               }
> +       }
> +}
> +
> +/**
> + * __pci_dev_set_current_state - Set current state of a PCI device
> + * @dev: Device to handle
> + * @data: pointer to state to be set
> + */
> +static int __pci_dev_set_current_state(struct pci_dev *dev, void *data)
> +{
> +       pci_power_t state = *(pci_power_t *)data;
> +
> +       dev->current_state = state;
> +       return 0;
> +}
> +
> +/**
> + * __pci_bus_set_current_state - Walk given bus and set current state of devices
> + * @bus: Top bus of the subtree to walk.
> + * @state: state to be set
> + */
> +static void __pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state)
> +{
> +       if (bus)
> +               pci_walk_bus(bus, __pci_dev_set_current_state, &state);
>  }
>
>  /**
> @@ -706,8 +756,17 @@ static void __pci_start_power_transition
>  */
>  int __pci_complete_power_transition(struct pci_dev *dev, pci_power_t state)
>  {
> -       return state >= PCI_D0 ?
> -                       pci_platform_power_transition(dev, state) : -EINVAL;
> +       int ret;
> +
> +       if (state < PCI_D0)
> +               return -EINVAL;
> +       ret = pci_platform_power_transition(dev, state);
> +       if (!ret && state == PCI_D3cold) {
> +               /* Power off the bridge may power off the whole hierarchy */
> +               if (dev->subordinate)
> +                       __pci_bus_set_current_state(dev->subordinate, state);
> +       }
> +       return ret;
>  }
>  EXPORT_SYMBOL_GPL(__pci_complete_power_transition);
>
> @@ -731,8 +790,8 @@ int pci_set_power_state(struct pci_dev *
>        int error;
>
>        /* bound the state we're entering */
> -       if (state > PCI_D3hot)
> -               state = PCI_D3hot;
> +       if (state > PCI_D3cold)
> +               state = PCI_D3cold;
>        else if (state < PCI_D0)
>                state = PCI_D0;
>        else if ((state == PCI_D1 || state == PCI_D2) && pci_no_d1d2(dev))
> @@ -747,10 +806,15 @@ int pci_set_power_state(struct pci_dev *
>
>        /* This device is quirked not to be put into D3, so
>           don't put it in D3 */
> -       if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
> +       if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
>                return 0;
>
> -       error = pci_raw_set_power_state(dev, state);
> +       /*
> +        * To put device in D3cold, we put device into D3hot in native
> +        * way, then put device into D3cold with platform ops
> +        */
> +       error = pci_raw_set_power_state(dev, state > PCI_D3hot ?
> +                                       PCI_D3hot : state);
>
>        if (!__pci_complete_power_transition(dev, state))
>                error = 0;
> @@ -1506,6 +1570,34 @@ void pci_pme_wakeup_bus(struct pci_bus *
>  }
>
>  /**
> + * pci_wakeup - Wake up a PCI device
> + * @dev: Device to handle.
> + * @data: to count how many device are waken up.
> + */
> +static int pci_wakeup(struct pci_dev *dev, void *data)
> +{
> +       unsigned int *count = data;
> +
> +       pci_wakeup_event(dev);
> +       pm_request_resume(&dev->dev);
> +       (*count)++;
> +       return 0;
> +}
> +
> +/**
> + * pci_wakeup_bus - Walk given bus and wake up devices on it
> + * @bus: Top bus of the subtree to walk.
> + */
> +unsigned int pci_wakeup_bus(struct pci_bus *bus)
> +{
> +       unsigned int count = 0;
> +
> +       if (bus)
> +               pci_walk_bus(bus, pci_wakeup, &count);
> +       return count;
> +}
> +
> +/**
>  * pci_pme_capable - check the capability of PCI device to generate PME#
>  * @dev: PCI device to handle.
>  * @state: PCI state from which device will issue PME#.
> @@ -1789,12 +1881,16 @@ int pci_finish_runtime_suspend(struct pc
>        if (target_state == PCI_POWER_ERROR)
>                return -EIO;
>
> +       dev->runtime_d3cold = target_state == PCI_D3cold ? true : false;
> +
>        __pci_enable_wake(dev, target_state, true, pci_dev_run_wake(dev));
>
>        error = pci_set_power_state(dev, target_state);
>
> -       if (error)
> +       if (error) {
>                __pci_enable_wake(dev, target_state, true, false);
> +               dev->runtime_d3cold = false;
> +       }
>
>        return error;
>  }
> @@ -1864,6 +1960,7 @@ void pci_pm_init(struct pci_dev *dev)
>
>        dev->pm_cap = pm;
>        dev->d3_delay = PCI_PM_D3_WAIT;
> +       dev->d3cold_delay = PCI_PM_D3COLD_WAIT;
>
>        dev->d1_support = false;
>        dev->d2_support = false;
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -70,6 +70,7 @@ extern void pci_update_current_state(str
>  extern void pci_disable_enabled_device(struct pci_dev *dev);
>  extern int pci_finish_runtime_suspend(struct pci_dev *dev);
>  extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
> +extern unsigned int pci_wakeup_bus(struct pci_bus *bus);
>  extern void pci_pm_init(struct pci_dev *dev);
>  extern void platform_pci_wakeup_init(struct pci_dev *dev);
>  extern void pci_allocate_cap_save_buffers(struct pci_dev *dev);
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -101,10 +101,38 @@ static int pcie_port_resume_noirq(struct
>  }
>
>  #ifdef CONFIG_PM_RUNTIME
> +struct d3cold_info {
> +       bool power_must_be_on;
> +       unsigned int d3cold_delay;
> +};
> +
> +static int pci_dev_d3cold_info(struct pci_dev *pdev, void *data)
> +{
> +       struct d3cold_info *info = data;
> +
> +       info->d3cold_delay = max_t(unsigned int, pdev->d3cold_delay,
> +                                  info->d3cold_delay);
> +       info->power_must_be_on = info->power_must_be_on ||
> +               pdev->dev.power.power_must_be_on;

We don't care what the previous state of info->power_must_be_on was,
so to me, this would read more naturally as:

    if (pdev->dev.power.power_must_be_on)
        info->power_must_be_on = true;

> +       return 0;
> +}
> +
>  static int pcie_port_runtime_suspend(struct device *dev)
>  {
>        struct pci_dev *pdev = to_pci_dev(dev);
> +       struct d3cold_info d3cold_info = {
> +               .power_must_be_on       = dev->power.power_must_be_on,
> +               .d3cold_delay           = PCI_PM_D3_WAIT,
> +       };
>
> +       /*
> +        * If any subordinate device need to keep power on, we should
> +        * not power off the port.  The D3cold delay of port should be
> +        * the max of that of all subordinate devices.
> +        */
> +       pci_walk_bus(pdev->subordinate, pci_dev_d3cold_info, &d3cold_info);
> +       dev->power.power_must_be_on = d3cold_info.power_must_be_on;
> +       pdev->d3cold_delay = d3cold_info.d3cold_delay;

Maybe you've thought about how this works in the presence of hot-added
devices below pdev, but it's not obvious to me that it's safe to cache
power_must_be_on and d3cold_delay in the struct device and struct
pci_dev.  Are they guaranteed to be updated when a new device is
hot-added?

>        pci_save_state(pdev);
>        return 0;
>  }
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -132,9 +132,10 @@ static inline const char *pci_power_name
>        return pci_power_names[1 + (int) state];
>  }
>
> -#define PCI_PM_D2_DELAY        200
> -#define PCI_PM_D3_WAIT 10
> -#define PCI_PM_BUS_WAIT        50
> +#define PCI_PM_D2_DELAY                200
> +#define PCI_PM_D3_WAIT         10
> +#define PCI_PM_D3COLD_WAIT     100
> +#define PCI_PM_BUS_WAIT                50
>
>  /** The pci_channel state describes connectivity between the CPU and
>  *  the pci device.  If some PCI bus between here and the pci device
> @@ -282,7 +283,12 @@ struct pci_dev {
>        unsigned int    mmio_always_on:1;       /* disallow turning off io/mem
>                                                   decoding during bar sizing */
>        unsigned int    wakeup_prepared:1;
> +       unsigned int    runtime_d3cold:1;       /* whether go through runtime
> +                                                  D3cold, not set for devices
> +                                                  powered on/off by the
> +                                                  corresponding bridge */
>        unsigned int    d3_delay;       /* D3->D0 transition time in ms */
> +       unsigned int    d3cold_delay;   /* D3cold->D0 transition time in ms */
>
>  #ifdef CONFIG_PCIEASPM
>        struct pcie_link_state  *link_state;    /* ASPM link state. */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state
  2012-05-04  8:13 ` [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state Huang Ying
@ 2012-05-04 20:10   ` Rafael J. Wysocki
  2012-05-05  7:25     ` huang ying
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-04 20:10 UTC (permalink / raw)
  To: Huang Ying; +Cc: Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Friday, May 04, 2012, Huang Ying wrote:
> Lower device sleep state can save more power, but has more exit
> latency too.  Sometimes, to satisfy some power QoS and other
> requirement, we need to constrain the lowest device sleep state.
> 
> In this patch, a parameter to specify lowest allowed state for
> acpi_pm_device_sleep_state is added.  So that the caller can enforce
> the constraint via the parameter.
> 
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> ---
>  drivers/acpi/sleep.c       |   18 +++++++++++++++---
>  drivers/pci/pci-acpi.c     |    3 ++-
>  drivers/pnp/pnpacpi/core.c |    4 ++--
>  include/acpi/acpi_bus.h    |    6 +++---
>  4 files changed, 22 insertions(+), 9 deletions(-)
> 
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -677,6 +677,7 @@ int acpi_suspend(u32 acpi_state)
>   *	@dev: device to examine; its driver model wakeup flags control
>   *		whether it should be able to wake up the system
>   *	@d_min_p: used to store the upper limit of allowed states range
> + *	@d_max_in: specify the lowest allowed states
>   *	Return value: preferred power state of the device on success, -ENODEV on
>   *		failure (ie. if there's no 'struct acpi_device' for @dev)
>   *
> @@ -693,7 +694,7 @@ int acpi_suspend(u32 acpi_state)
>   *	via @wake.
>   */
>  
> -int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p)
> +int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p, int d_max_in)
>  {
>  	acpi_handle handle = DEVICE_ACPI_HANDLE(dev);
>  	struct acpi_device *adev;
> @@ -704,11 +705,14 @@ int acpi_pm_device_sleep_state(struct de
>  		printk(KERN_DEBUG "ACPI handle has no context!\n");
>  		return -ENODEV;
>  	}
> +	d_max_in = clamp_t(int, d_max_in, ACPI_STATE_D0, ACPI_STATE_D3);

Shouldn't that be clamp_val(), rather?

>  
>  	acpi_method[2] = '0' + acpi_target_sleep_state;
>  	/*
> -	 * If the sleep state is S0, we will return D3, but if the device has
> -	 * _S0W, we will use the value from _S0W
> +	 * If the sleep state is S0, the lowest limit from ACPI is D3,
> +	 * but if the device has _S0W, we will use the value from _S0W
> +	 * as the lowest limit from ACPI.  Finally, we will constrain
> +	 * the lowest limit with the specified one.
>  	 */
>  	d_min = ACPI_STATE_D0;
>  	d_max = ACPI_STATE_D3;
> @@ -754,6 +758,14 @@ int acpi_pm_device_sleep_state(struct de
>  
>  	if (d_min_p)
>  		*d_min_p = d_min;
> +	/* constrain d_max with specified lowest limit (max number) */
> +	if (d_max > d_max_in) {
> +		d_max = d_max_in;
> +		for (;d_max > d_min; d_max--) {

Well, why didn't you do

+		for (d_max = d_max_in; d_max > d_min; d_max--)

> +			if (adev->power.states[d_max].flags.valid)
> +				break;
> +		}
> +	}

And what if d_min > d_max_in ?

>  	return d_max;
>  }
>  #endif /* CONFIG_PM */
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -189,7 +189,8 @@ static pci_power_t acpi_pci_choose_state
>  {
>  	int acpi_state;
>  
> -	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL);
> +	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
> +						ACPI_STATE_D3);
>  	if (acpi_state < 0)
>  		return PCI_POWER_ERROR;
>  
> --- a/drivers/pnp/pnpacpi/core.c
> +++ b/drivers/pnp/pnpacpi/core.c
> @@ -170,8 +170,8 @@ static int pnpacpi_suspend(struct pnp_de
>  	}
>  
>  	if (acpi_bus_power_manageable(handle)) {
> -		int power_state = acpi_pm_device_sleep_state(&dev->dev, NULL);
> -
> +		int power_state = acpi_pm_device_sleep_state(&dev->dev, NULL,
> +							     ACPI_STATE_D3);
>  		if (power_state < 0)
>  			power_state = (state.event == PM_EVENT_ON) ?
>  					ACPI_STATE_D0 : ACPI_STATE_D3;
> --- a/include/acpi/acpi_bus.h
> +++ b/include/acpi/acpi_bus.h
> @@ -383,13 +383,13 @@ int acpi_enable_wakeup_device_power(stru
>  int acpi_disable_wakeup_device_power(struct acpi_device *dev);
>  
>  #ifdef CONFIG_PM
> -int acpi_pm_device_sleep_state(struct device *, int *);
> +int acpi_pm_device_sleep_state(struct device *, int *, int);
>  #else
> -static inline int acpi_pm_device_sleep_state(struct device *d, int *p)
> +static inline int acpi_pm_device_sleep_state(struct device *d, int *p, int m)
>  {
>  	if (p)
>  		*p = ACPI_STATE_D0;
> -	return ACPI_STATE_D3;
> +	return m == ACPI_STATE_D3 ? m : ACPI_STATE_D0;

Shouldn't m be returned (so long as it is between D0 and D3 inclusive)?  IOW:

+	return (m >= ACPI_STATE_D0 && m <= ACPI_STATE_D3) ? m : ACPI_STATE_D0;

>  }
>  #endif

Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-04  8:13 ` [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support Huang Ying
  2012-05-04 19:51   ` Bjorn Helgaas
@ 2012-05-04 20:50   ` Rafael J. Wysocki
  2012-05-05  8:08     ` huang ying
  1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-04 20:50 UTC (permalink / raw)
  To: Huang Ying; +Cc: Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Friday, May 04, 2012, Huang Ying wrote:
> This patch adds runtime D3cold support to PCIe bus.  D3cold is the
> deepest power saving state for PCIe devices.  Where the main PCIe link
> will be powered off totally, so before the PCIe main link is powered
> on again, you can not access the device, even the configuration space,
> which is still accessible in D3hot.  Because the main PCIe link is not
> available, the PCI PM registers can not be used to put device into/out
> of D3cold state, that will be done by platform logic such as ACPI
> _PR3.
> 
> To support remote wake up in D3cold, aux power is supplied, and a
> side-band pin: WAKE# is used to notify remote wake up event, the pin
> is usually connected to platform logic such as ACPI GPE.  This is
> quite different with other power saving states, where remote wake up
> is notified via PME message via the PCIe main link.
> 
> PCIe devices in plug-in slot usually has no direct platform logic, for
> example, there is usually no ACPI _PR3 for them.  The D3cold support
> for these devices can be done via the PCIe port connected to it.  When
> power on/off the PCIe port, the corresponding PCIe devices are powered
> on/off too.  And the remote wake up event from PCIe device will be
> notified to the corresponding PCIe port.
> 
> The PCI subsystem D3cold support and corresponding ACPI platform
> support is implemented in the patch.  Including the support for PCIe
> devices in plug-in slot.
> 
> For more information about PCIe D3cold and corresponding ACPI support,
> please refer to:
> 
> - PCI Express Base Specification Revision 2.0
> - Advanced Configuration and Power Interface Specification Revision 5.0
> 
> Originally-by: Zheng Yan <zheng.z.yan@intel.com>
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> ---
>  drivers/pci/pci-acpi.c         |   32 +++++++++--
>  drivers/pci/pci-driver.c       |    3 +
>  drivers/pci/pci.c              |  115 +++++++++++++++++++++++++++++++++++++----
>  drivers/pci/pci.h              |    1 
>  drivers/pci/pcie/portdrv_pci.c |   28 +++++++++
>  include/linux/pci.h            |   12 +++-
>  6 files changed, 175 insertions(+), 16 deletions(-)
> 
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
>  	if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
>  		return;
>  
> +	if (pci_dev->current_state == PCI_D3cold) {
> +		unsigned int count = 0;
> +
> +		/*
> +		 * Powering on bridge need to resume whole hierarchy,
> +		 * just resume the children to avoid the bridge going
> +		 * suspending as soon as resumed
> +		 */

Don't you need to resume the bridge before you start walking the hierarchy
below it?

> +		if (pci_dev->subordinate)
> +			count = pci_wakeup_bus(pci_dev->subordinate);
> +		if (count == 0)
> +			pm_runtime_resume(&pci_dev->dev);

What's the count for, exactly?

> +		return;
> +	}
> +
>  	if (!pci_dev->pm_cap || !pci_dev->pme_support
>  	     || pci_check_pme_status(pci_dev)) {
>  		if (pci_dev->pme_poll)
> @@ -187,10 +202,13 @@ acpi_status pci_acpi_remove_pm_notifier(
>  
>  static pci_power_t acpi_pci_choose_state(struct pci_dev *pdev)
>  {
> -	int acpi_state;
> +	int acpi_state, d_max;
>  
> -	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
> -						ACPI_STATE_D3);
> +	if (pdev->dev.power.power_must_be_on)
> +		d_max = ACPI_STATE_D3_HOT;
> +	else
> +		d_max = ACPI_STATE_D3_COLD;
> +	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL, d_max);
>  	if (acpi_state < 0)
>  		return PCI_POWER_ERROR;
>  
> @@ -297,7 +315,13 @@ static void acpi_pci_propagate_run_wake(
>  
>  static int acpi_pci_run_wake(struct pci_dev *dev, bool enable)
>  {
> -	if (dev->pme_interrupt)
> +	/*
> +	 * Per PCI Express Base Specification Revision 2.0 section
> +	 * 5.3.3.2 Link Wakeup, platform support is needed for D3cold
> +	 * waking up to power on the main link even if there is PME
> +	 * support for D3cold
> +	 */
> +	if (dev->pme_interrupt && !dev->runtime_d3cold)
>  		return 0;
>  
>  	if (!acpi_pm_device_run_wake(&dev->dev, enable))
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1013,10 +1013,13 @@ static int pci_pm_runtime_suspend(struct
>  	if (!pm || !pm->runtime_suspend)
>  		return -ENOSYS;
>  
> +	dev->power.power_must_be_on = false;
>  	error = pm->runtime_suspend(dev);
>  	suspend_report_result(pm->runtime_suspend, error);
>  	if (error)
>  		return error;
> +	if (!dev->power.power_off_user)
> +		dev->power.power_must_be_on = true;

That doesn't look good.  The flag itself should be exported via sysfs, but
only if it is known that power can be removed from the device.

>  
>  	pci_fixup_device(pci_fixup_suspend, pci_dev);
>  
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -653,6 +653,12 @@ void pci_update_current_state(struct pci
>  	if (dev->pm_cap) {
>  		u16 pmcsr;
>  
> +		/*
> +		 * Configuration space is not accessible for device in
> +		 * D3cold, so keep in D3cold for safety
> +		 */
> +		if (dev->current_state == PCI_D3cold)
> +			return;
>  		pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
>  		dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
>  	} else {
> @@ -671,8 +677,11 @@ static int pci_platform_power_transition
>  
>  	if (platform_pci_power_manageable(dev)) {
>  		error = platform_pci_set_power_state(dev, state);
> -		if (!error)
> +		if (!error) {
> +			if (state == PCI_D3cold)
> +				dev->current_state = PCI_D3cold;

+			else ?

>  			pci_update_current_state(dev, state);
>
> +		}
>  		/* Fall back to PCI_D0 if native PM is not supported */
>  		if (!dev->pm_cap)
>  			dev->current_state = PCI_D0;
> @@ -693,8 +702,49 @@ static int pci_platform_power_transition
>   */
>  static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
>  {
> -	if (state == PCI_D0)
> +	if (state == PCI_D0) {
>  		pci_platform_power_transition(dev, PCI_D0);
> +		/*
> +		 * Mandatory power management transition delays, see
> +		 * PCI Express Base Specification Revision 2.0 Section
> +		 * 6.6.1: Conventional Reset.  Do not delay for
> +		 * devices powered on/off by corresponding bridge,
> +		 * because have already delayed for the bridge.
> +		 */
> +		if (dev->runtime_d3cold) {
> +			msleep(dev->d3cold_delay);
> +			/* When powering on a bridge from D3cold, the
> +			 * whole hierarchy may be powered on into
> +			 * D0uninitialized state, resume them to give
> +			 * them a chance to suspend again */
> +			if (dev->subordinate)
> +				pci_wakeup_bus(dev->subordinate);
> +		}
> +	}
> +}
> +
> +/**
> + * __pci_dev_set_current_state - Set current state of a PCI device
> + * @dev: Device to handle
> + * @data: pointer to state to be set
> + */
> +static int __pci_dev_set_current_state(struct pci_dev *dev, void *data)
> +{
> +	pci_power_t state = *(pci_power_t *)data;
> +
> +	dev->current_state = state;
> +	return 0;
> +}
> +
> +/**
> + * __pci_bus_set_current_state - Walk given bus and set current state of devices
> + * @bus: Top bus of the subtree to walk.
> + * @state: state to be set
> + */
> +static void __pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state)
> +{
> +	if (bus)
> +		pci_walk_bus(bus, __pci_dev_set_current_state, &state);
>  }
>  
>  /**
> @@ -706,8 +756,17 @@ static void __pci_start_power_transition
>   */
>  int __pci_complete_power_transition(struct pci_dev *dev, pci_power_t state)
>  {
> -	return state >= PCI_D0 ?
> -			pci_platform_power_transition(dev, state) : -EINVAL;
> +	int ret;
> +
> +	if (state < PCI_D0)
> +		return -EINVAL;
> +	ret = pci_platform_power_transition(dev, state);
> +	if (!ret && state == PCI_D3cold) {
> +		/* Power off the bridge may power off the whole hierarchy */
> +		if (dev->subordinate)
> +			__pci_bus_set_current_state(dev->subordinate, state);

I'd just say

+			__pci_bus_set_current_state(dev->subordinate, PCI_D3cold);

here.  Also you don't need the if (dev->subordinate) check, because
__pci_bus_set_current_state() will do it anyway.

> +	}
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__pci_complete_power_transition);
>  
> @@ -731,8 +790,8 @@ int pci_set_power_state(struct pci_dev *
>  	int error;
>  
>  	/* bound the state we're entering */
> -	if (state > PCI_D3hot)
> -		state = PCI_D3hot;
> +	if (state > PCI_D3cold)
> +		state = PCI_D3cold;
>  	else if (state < PCI_D0)
>  		state = PCI_D0;
>  	else if ((state == PCI_D1 || state == PCI_D2) && pci_no_d1d2(dev))
> @@ -747,10 +806,15 @@ int pci_set_power_state(struct pci_dev *
>  
>  	/* This device is quirked not to be put into D3, so
>  	   don't put it in D3 */
> -	if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
> +	if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
>  		return 0;
>
> -	error = pci_raw_set_power_state(dev, state);
> +	/*
> +	 * To put device in D3cold, we put device into D3hot in native
> +	 * way, then put device into D3cold with platform ops
> +	 */
> +	error = pci_raw_set_power_state(dev, state > PCI_D3hot ?
> +					PCI_D3hot : state);
>  
>  	if (!__pci_complete_power_transition(dev, state))
>  		error = 0;
> @@ -1506,6 +1570,34 @@ void pci_pme_wakeup_bus(struct pci_bus *
>  }
>  
>  /**
> + * pci_wakeup - Wake up a PCI device
> + * @dev: Device to handle.
> + * @data: to count how many device are waken up.
> + */
> +static int pci_wakeup(struct pci_dev *dev, void *data)
> +{
> +	unsigned int *count = data;
> +
> +	pci_wakeup_event(dev);
> +	pm_request_resume(&dev->dev);
> +	(*count)++;
> +	return 0;
> +}
> +
> +/**
> + * pci_wakeup_bus - Walk given bus and wake up devices on it
> + * @bus: Top bus of the subtree to walk.
> + */
> +unsigned int pci_wakeup_bus(struct pci_bus *bus)
> +{
> +	unsigned int count = 0;
> +
> +	if (bus)
> +		pci_walk_bus(bus, pci_wakeup, &count);
> +	return count;
> +}
> +
> +/**
>   * pci_pme_capable - check the capability of PCI device to generate PME#
>   * @dev: PCI device to handle.
>   * @state: PCI state from which device will issue PME#.
> @@ -1789,12 +1881,16 @@ int pci_finish_runtime_suspend(struct pc
>  	if (target_state == PCI_POWER_ERROR)
>  		return -EIO;
>  
> +	dev->runtime_d3cold = target_state == PCI_D3cold ? true : false;

+	dev->runtime_d3cold = target_state == PCI_D3cold;

will suffice.

> +
>  	__pci_enable_wake(dev, target_state, true, pci_dev_run_wake(dev));
>  
>  	error = pci_set_power_state(dev, target_state);
>  
> -	if (error)
> +	if (error) {
>  		__pci_enable_wake(dev, target_state, true, false);
> +		dev->runtime_d3cold = false;
> +	}
>  
>  	return error;
>  }
> @@ -1864,6 +1960,7 @@ void pci_pm_init(struct pci_dev *dev)
>  
>  	dev->pm_cap = pm;
>  	dev->d3_delay = PCI_PM_D3_WAIT;
> +	dev->d3cold_delay = PCI_PM_D3COLD_WAIT;
>  
>  	dev->d1_support = false;
>  	dev->d2_support = false;
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -70,6 +70,7 @@ extern void pci_update_current_state(str
>  extern void pci_disable_enabled_device(struct pci_dev *dev);
>  extern int pci_finish_runtime_suspend(struct pci_dev *dev);
>  extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
> +extern unsigned int pci_wakeup_bus(struct pci_bus *bus);
>  extern void pci_pm_init(struct pci_dev *dev);
>  extern void platform_pci_wakeup_init(struct pci_dev *dev);
>  extern void pci_allocate_cap_save_buffers(struct pci_dev *dev);
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -101,10 +101,38 @@ static int pcie_port_resume_noirq(struct
>  }
>  
>  #ifdef CONFIG_PM_RUNTIME
> +struct d3cold_info {
> +	bool power_must_be_on;
> +	unsigned int d3cold_delay;
> +};
> +
> +static int pci_dev_d3cold_info(struct pci_dev *pdev, void *data)
> +{
> +	struct d3cold_info *info = data;
> +
> +	info->d3cold_delay = max_t(unsigned int, pdev->d3cold_delay,
> +				   info->d3cold_delay);
> +	info->power_must_be_on = info->power_must_be_on ||
> +		pdev->dev.power.power_must_be_on;
> +	return 0;
> +}
> +
>  static int pcie_port_runtime_suspend(struct device *dev)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct d3cold_info d3cold_info = {
> +		.power_must_be_on	= dev->power.power_must_be_on,
> +		.d3cold_delay		= PCI_PM_D3_WAIT,
> +	};
>  
> +	/*
> +	 * If any subordinate device need to keep power on, we should
> +	 * not power off the port.  The D3cold delay of port should be
> +	 * the max of that of all subordinate devices.

What if some of the devices below the port are ports themselves?  Or PCI-to-PCIe
bridges?

> +	 */
> +	pci_walk_bus(pdev->subordinate, pci_dev_d3cold_info, &d3cold_info);
> +	dev->power.power_must_be_on = d3cold_info.power_must_be_on;
> +	pdev->d3cold_delay = d3cold_info.d3cold_delay;
>  	pci_save_state(pdev);
>  	return 0;
>  }
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -132,9 +132,10 @@ static inline const char *pci_power_name
>  	return pci_power_names[1 + (int) state];
>  }
>  
> -#define PCI_PM_D2_DELAY	200
> -#define PCI_PM_D3_WAIT	10
> -#define PCI_PM_BUS_WAIT	50
> +#define PCI_PM_D2_DELAY		200
> +#define PCI_PM_D3_WAIT		10
> +#define PCI_PM_D3COLD_WAIT	100
> +#define PCI_PM_BUS_WAIT		50
>  
>  /** The pci_channel state describes connectivity between the CPU and
>   *  the pci device.  If some PCI bus between here and the pci device
> @@ -282,7 +283,12 @@ struct pci_dev {
>  	unsigned int	mmio_always_on:1;	/* disallow turning off io/mem
>  						   decoding during bar sizing */
>  	unsigned int	wakeup_prepared:1;
> +	unsigned int	runtime_d3cold:1;	/* whether go through runtime
> +						   D3cold, not set for devices
> +						   powered on/off by the
> +						   corresponding bridge */
>  	unsigned int	d3_delay;	/* D3->D0 transition time in ms */
> +	unsigned int	d3cold_delay;	/* D3cold->D0 transition time in ms */
>  
>  #ifdef CONFIG_PCIEASPM
>  	struct pcie_link_state	*link_state;	/* ASPM link state. */

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-04 19:50   ` Bjorn Helgaas
@ 2012-05-04 20:55     ` Rafael J. Wysocki
  2012-05-05  6:54       ` huang ying
  2012-05-05  6:53     ` huang ying
  1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-04 20:55 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Huang Ying, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Friday, May 04, 2012, Bjorn Helgaas wrote:
> On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
> > From: Zheng Yan <zheng.z.yan@intel.com>
> >
> > This patch adds runtime PM support to PCIe port.  This is needed by
> > PCIe D3cold support, where PCIe device in slot may be powered on/off
> > by PCIe port.
> 
> I assume this works for integrated PCIe devices as well as those that
> are plugged into a slot and can be physically removed -- maybe the
> text "in slot" is superfluous?
> 
> > Because runtime suspend is broken for some chipset, a white list is
> > used to enable runtime PM support for only chipset known to work.
> 
> A whitelist requires perpetual maintenance.  Every time a new working
> chipset comes out, you have to update the whitelist.  That doesn't
> seem right.

Well, we can't possibly enable the feature for all PCIe ports in existence
either, because some of them will not work with it (almost surely).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-04 19:50   ` Bjorn Helgaas
@ 2012-05-04 21:00     ` Rafael J. Wysocki
  2012-05-05  6:36     ` huang ying
  1 sibling, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-04 21:00 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huang Ying, ming.m.lin, linux-kernel, linux-pm, Zheng Yan, Lan Tianyu

On Friday, May 04, 2012, Bjorn Helgaas wrote:
> On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
> > From: Lan Tianyu <tianyu.lan@intel.com>
> >
> > Some devices can be powered off to save more power via some platform
> > mechanism, e.g., ACPI.  But that may not work as expected for some
> > device or platform.  So, this patch adds a sysfs file named power_off
> > under <device>/power directory to provide a mechanism for user to control
> > whether to allow the device to be power off.
> >
> > power_off => "enabled" means allowing the device to be powered off if
> > possible.
> >
> > power_off => "disabled" means the device must be power on anytime.
> >
> > Also add flag power_off_user to struct dev_pm_info to record users'
> > choice. The bus layer can use this field to determine whether to
> > power off the device.
> 
> My first thought was that writing to "power_off" would actually turn
> the power off, which isn't true.  Maybe something like
> "poweroff_allowed" would work.
> 
> I think there's only one use of this new field, in
> pci_pm_runtime_suspend().  Maybe you could pull out that hunk from
> patch 5, combine it with this one, and move it to after patch 5?

Well, please see my comment.

First, it doesn't make sense to export a sysfs file to control a feature that
the given device doesn't have.

Second, if such a file is exported _at_ _this_ _level_, the sysfs setting
should affect every situation in which power may be removed from devices, not
just the PCIe D3cold damned thing.

If this is going to be PCIe-specific, the flag should go into struct pci_dev,
and the sysfs file accordingly.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag
  2012-05-04 19:37   ` Rafael J. Wysocki
@ 2012-05-05  5:15     ` huang ying
  2012-05-07 20:33       ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: huang ying @ 2012-05-05  5:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Sat, May 5, 2012 at 3:37 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Friday, May 04, 2012, Huang Ying wrote:
>> The extreme way to save device power in runtime is to turn off power
>> of device.  For example, D3cold for PCIe bus and ZPODD (Zero Power
>> Optical Disk Drive) for SATA bus will do that.
>>
>> But sometimes power off is not expected, some possible reason is as
>> follow
>>
>> - power off device usually incurs longer resume latency, if it exceeds
>>   power QoS requirement, power off should be disabled.
>>
>> - For some buses, device in power off state can not support remote
>>   wakeup.  If remote wakeup is desired, power off should be disabled.
>>
>> In general, whether to put a device into power off state should be
>> decided by the driver of the device, but for some buses, whether to
>> put a device into power off state may be done by the parent of the
>> device.  For example, a PCIe end point device may be put into power
>> off state by the PCIe port connected to it.
>>
>> So a flag is introduced for the children devices to tell the parent
>> device, whether it should be put into power off state.
>>
>> This flag is also used for device driver to tell bus layer whether it
>> is OK to be powered off.
>>
>> Signed-off-by: Huang Ying <ying.huang@intel.com>
>
> I would be almost fine with this patch, if [2/5] were not present.
>
> However, if you introduce a flag like this, you need to put checks
> against it into all places where power may be removed from devices,
> like the generic PM domains framework (but not only there).

Yes.  At least this flag will be needed by other buses, like ZPODD
support from Lin Ming:

https://lkml.org/lkml/2012/3/28/23

So my original plan is to introduce this flag firstly, then to add
checking for this flag in various places need it.  Do you suggest to
put PCIe D3cold support, ZPODD support, power domain related checking
into one patchset.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag
  2012-05-04 19:50   ` Bjorn Helgaas
@ 2012-05-05  5:59     ` huang ying
  2012-05-07 20:37       ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: huang ying @ 2012-05-05  5:59 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huang Ying, ming.m.lin, linux-kernel, linux-pm,
	Rafael J. Wysocki, Zheng Yan

On Sat, May 5, 2012 at 3:50 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
>> --- a/include/linux/pm.h
>> +++ b/include/linux/pm.h
>> @@ -536,6 +536,7 @@ struct dev_pm_info {
>>        unsigned int            irq_safe:1;
>>        unsigned int            use_autosuspend:1;
>>        unsigned int            timer_autosuspends:1;
>> +       unsigned int            power_must_be_on:1;
>>        enum rpm_request        request;
>>        enum rpm_status         runtime_status;
>>        int                     runtime_error;
>
> It's a little weird to just add the field, with no users.  Would it
> make sense to pull out the bits of patch 5 that use this and combine
> them into a single smaller patch?

This patch is needed by some other subsystem too, such as ZPODD
support in following patchset:

https://lkml.org/lkml/2012/3/28/23

The original plan is to merge this into linux-pm.git firstly, then
merge various usage of this flag into various subsystem git tree.
That will make cross-tree merging a little easier.  Is it possible?

> But see related comments there; it
> might be safer to have a function that computes this whenever you need
> it instead of caching the value.

This flag may be used (or calculated if implemented as a function)
when device is in suspended state.  Reason is as follow from the
changelog of this patch.

"
In general, whether to put a device into power off state should be
decided by the driver of the device, but for some buses, whether to
put a device into power off state may be done by the parent of the
device.  For example, a PCIe end point device may be put into power
off state by the PCIe port connected to it.
"

If all devices can calculate "power_must_be_on" when it is in
suspended state, then this will be a design choice.

Hi, Rafael,

What do you think about the idea to replace .power_must_be_on with a
function (maybe add a new callback in pm_ops)?

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-04 19:33   ` Rafael J. Wysocki
@ 2012-05-05  6:29     ` huang ying
  2012-05-07 20:53       ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: huang ying @ 2012-05-05  6:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm,
	Zheng Yan, Lan Tianyu

On Sat, May 5, 2012 at 3:33 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Friday, May 04, 2012, Huang Ying wrote:
>> From: Lan Tianyu <tianyu.lan@intel.com>
>>
>> Some devices can be powered off to save more power via some platform
>> mechanism, e.g., ACPI.  But that may not work as expected for some
>> device or platform.  So, this patch adds a sysfs file named power_off
>> under <device>/power directory to provide a mechanism for user to control
>> whether to allow the device to be power off.
>>
>> power_off => "enabled" means allowing the device to be powered off if
>> possible.
>>
>> power_off => "disabled" means the device must be power on anytime.
>>
>> Also add flag power_off_user to struct dev_pm_info to record users'
>> choice. The bus layer can use this field to determine whether to
>> power off the device.
>
> It looks like the new attribute is added for all devices regardless of whether
> or not they actually can be powered off?  If so, then please don't do that,
> it's _extremely_ confusing.

Yes.  You are right.

> If you need user space to be able to control that functionality (and I can
> think of a couple of cases in which you do), there need to be 2 flags,
> can_power_off and may_power_off, where the first one is set by the kernel
> if it is known that power can be removed from the device - the attribute
> should be created when this flag is set and removed when it is unset.
>
> Then, the setting of the second flag will be controlled by the new attribute.
>
> And you'll need to patch quite a few places where devices actually have that
> capability, like where power domains are in use, so that's quire a lot of
> work.

If so, I think maybe we need 3 flags:

- can_power_off, set by kernel when it is possible to power off the device
- may_power_off_user, set by user via sysfs attribute
- may_power_off, set by kernel according to may_power_off_user, power
QoS and some other conditions

Sysfs attribute for may_power_off_user is only created if can_power_off is true.

I think we still can do that step by step.  For example, when we add
power off support to PCI devices, we set can_power_off to true for PCI
devices that is possible to be powered off;  when we add power domain
support, we set can_power_off to true for devices in power domain.  Do
you agree?

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-04 19:50   ` Bjorn Helgaas
  2012-05-04 21:00     ` Rafael J. Wysocki
@ 2012-05-05  6:36     ` huang ying
  1 sibling, 0 replies; 47+ messages in thread
From: huang ying @ 2012-05-05  6:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huang Ying, ming.m.lin, linux-kernel, linux-pm,
	Rafael J. Wysocki, Zheng Yan, Lan Tianyu

On Sat, May 5, 2012 at 3:50 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
>> From: Lan Tianyu <tianyu.lan@intel.com>
>>
>> Some devices can be powered off to save more power via some platform
>> mechanism, e.g., ACPI.  But that may not work as expected for some
>> device or platform.  So, this patch adds a sysfs file named power_off
>> under <device>/power directory to provide a mechanism for user to control
>> whether to allow the device to be power off.
>>
>> power_off => "enabled" means allowing the device to be powered off if
>> possible.
>>
>> power_off => "disabled" means the device must be power on anytime.
>>
>> Also add flag power_off_user to struct dev_pm_info to record users'
>> choice. The bus layer can use this field to determine whether to
>> power off the device.
>
> My first thought was that writing to "power_off" would actually turn
> the power off, which isn't true.  Maybe something like
> "poweroff_allowed" would work.

Yes. "power_off" is not a good name.  "poweroff_allowed" is much better.

> I think there's only one use of this new field, in
> pci_pm_runtime_suspend().  Maybe you could pull out that hunk from
> patch 5, combine it with this one, and move it to after patch 5?

This will be used by various places, such as ZPODD support, power domain, etc.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-04 19:43   ` Rafael J. Wysocki
@ 2012-05-05  6:46     ` huang ying
  2012-05-07 21:00       ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: huang ying @ 2012-05-05  6:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Sat, May 5, 2012 at 3:43 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Friday, May 04, 2012, Huang Ying wrote:
>> From: Zheng Yan <zheng.z.yan@intel.com>
>>
>> This patch adds runtime PM support to PCIe port.  This is needed by
>> PCIe D3cold support, where PCIe device in slot may be powered on/off
>> by PCIe port.
>>
>> Because runtime suspend is broken for some chipset, a white list is
>> used to enable runtime PM support for only chipset known to work.
>>
>> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
>> Signed-off-by: Huang Ying <ying.huang@intel.com>
>> ---
>>  drivers/pci/pci.c              |    9 +++++++++
>>  drivers/pci/pcie/portdrv_pci.c |   40 ++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 49 insertions(+)
>>
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1476,6 +1476,15 @@ bool pci_check_pme_status(struct pci_dev
>>   */
>>  static int pci_pme_wakeup(struct pci_dev *dev, void *pme_poll_reset)
>>  {
>> +     struct pci_dev *bridge = dev->bus->self;
>> +
>> +     /*
>> +      * If bridge is in low power state, the configuration space of
>> +      * subordinate devices may be not accessible
>
> Please also say in the comment _when_ this is possible.  That's far from
> obvious, because the runtime PM framework generally ensures that parents are
> resumed before the children, so the comment should describe the particular
> scenario leading to this situation.

OK.  I will add something like below into comments.

This is possible when doing PME poll.

>> +      */
>> +     if (bridge && bridge->current_state != PCI_D0)
>> +             return 0;
>> +
>>       if (pme_poll_reset && dev->pme_poll)
>>               dev->pme_poll = false;
>>
>> --- a/drivers/pci/pcie/portdrv_pci.c
>> +++ b/drivers/pci/pcie/portdrv_pci.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/kernel.h>
>>  #include <linux/errno.h>
>>  #include <linux/pm.h>
>> +#include <linux/pm_runtime.h>
>>  #include <linux/init.h>
>>  #include <linux/pcieport_if.h>
>>  #include <linux/aer.h>
>> @@ -99,6 +100,27 @@ static int pcie_port_resume_noirq(struct
>>       return 0;
>>  }
>>
>> +#ifdef CONFIG_PM_RUNTIME
>> +static int pcie_port_runtime_suspend(struct device *dev)
>> +{
>> +     struct pci_dev *pdev = to_pci_dev(dev);
>> +
>
> A comment explaining why this is needed here would be welcome.

Sorry, do not catch your idea exactly.  What is needed?  Why do we
need to add PCIe port runtime suspend support?

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-04 19:50   ` Bjorn Helgaas
  2012-05-04 20:55     ` Rafael J. Wysocki
@ 2012-05-05  6:53     ` huang ying
  1 sibling, 0 replies; 47+ messages in thread
From: huang ying @ 2012-05-05  6:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huang Ying, ming.m.lin, linux-kernel, linux-pm,
	Rafael J. Wysocki, Zheng Yan

On Sat, May 5, 2012 at 3:50 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
>> From: Zheng Yan <zheng.z.yan@intel.com>
>>
>> This patch adds runtime PM support to PCIe port.  This is needed by
>> PCIe D3cold support, where PCIe device in slot may be powered on/off
>> by PCIe port.
>
> I assume this works for integrated PCIe devices as well as those that
> are plugged into a slot and can be physically removed -- maybe the
> text "in slot" is superfluous?

"in slot" is used here, because usually integrated PCIe devices has
ACPI node and method attached with them while that is not the case for
PCIe devices in slot.  I will make it more clear with something as
follow:

Some PCIe devices have not direct platform support (such as ACPI node
and method) to be powered on/off.  But they may be powered on/off by
the corresponding PCIe port.

>> Because runtime suspend is broken for some chipset, a white list is
>> used to enable runtime PM support for only chipset known to work.
>
> A whitelist requires perpetual maintenance.  Every time a new working
> chipset comes out, you have to update the whitelist.  That doesn't
> seem right.
>
>> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
>> Signed-off-by: Huang Ying <ying.huang@intel.com>
>> ---
>>  drivers/pci/pci.c              |    9 +++++++++
>>  drivers/pci/pcie/portdrv_pci.c |   40 ++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 49 insertions(+)
>>
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1476,6 +1476,15 @@ bool pci_check_pme_status(struct pci_dev
>>  */
>>  static int pci_pme_wakeup(struct pci_dev *dev, void *pme_poll_reset)
>>  {
>> +       struct pci_dev *bridge = dev->bus->self;
>> +
>> +       /*
>> +        * If bridge is in low power state, the configuration space of
>> +        * subordinate devices may be not accessible
>> +        */
>
> This comment would make more sense as "If the upstream bridge is in a
> low power state, the configuration space of this device is not
> accessible."

Yes.  Will change this.

>> +       if (bridge && bridge->current_state != PCI_D0)
>> +               return 0;
>> +
>>        if (pme_poll_reset && dev->pme_poll)
>>                dev->pme_poll = false;
>>
>> --- a/drivers/pci/pcie/portdrv_pci.c
>> +++ b/drivers/pci/pcie/portdrv_pci.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/kernel.h>
>>  #include <linux/errno.h>
>>  #include <linux/pm.h>
>> +#include <linux/pm_runtime.h>
>>  #include <linux/init.h>
>>  #include <linux/pcieport_if.h>
>>  #include <linux/aer.h>
>> @@ -99,6 +100,27 @@ static int pcie_port_resume_noirq(struct
>>        return 0;
>>  }
>>
>> +#ifdef CONFIG_PM_RUNTIME
>> +static int pcie_port_runtime_suspend(struct device *dev)
>> +{
>> +       struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> +       pci_save_state(pdev);
>> +       return 0;
>> +}
>> +
>> +static int pcie_port_runtime_resume(struct device *dev)
>> +{
>> +       struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> +       pci_restore_state(pdev);
>> +       return 0;
>> +}
>> +#else
>> +#define pcie_port_runtime_suspend      NULL
>> +#define pcie_port_runtime_resume       NULL
>> +#endif
>> +
>>  static const struct dev_pm_ops pcie_portdrv_pm_ops = {
>>        .suspend        = pcie_port_device_suspend,
>>        .resume         = pcie_port_device_resume,
>> @@ -107,6 +129,8 @@ static const struct dev_pm_ops pcie_port
>>        .poweroff       = pcie_port_device_suspend,
>>        .restore        = pcie_port_device_resume,
>>        .resume_noirq   = pcie_port_resume_noirq,
>> +       .runtime_suspend = pcie_port_runtime_suspend,
>> +       .runtime_resume = pcie_port_runtime_resume,
>>  };
>>
>>  #define PCIE_PORTDRV_PM_OPS    (&pcie_portdrv_pm_ops)
>> @@ -117,6 +141,18 @@ static const struct dev_pm_ops pcie_port
>>  #endif /* !PM */
>>
>>  /*
>> + * PCIe port runtime suspend is broken for some chipset, so use a
>> + * white list to disable runtime PM for these chipset.
>
> If you're *disabling* runtime PM for these chipsets, I'd call this a
> blacklist (and a blacklist makes more sense to me from a maintenance
> standpoint, because you only have to update it when new broken chips
> are discovered).

Yes.  I think blacklist may be better.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-04 20:55     ` Rafael J. Wysocki
@ 2012-05-05  6:54       ` huang ying
  2012-05-07 21:06         ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: huang ying @ 2012-05-05  6:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Huang Ying, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Sat, May 5, 2012 at 4:55 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Friday, May 04, 2012, Bjorn Helgaas wrote:
>> On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
>> > From: Zheng Yan <zheng.z.yan@intel.com>
>> >
>> > This patch adds runtime PM support to PCIe port.  This is needed by
>> > PCIe D3cold support, where PCIe device in slot may be powered on/off
>> > by PCIe port.
>>
>> I assume this works for integrated PCIe devices as well as those that
>> are plugged into a slot and can be physically removed -- maybe the
>> text "in slot" is superfluous?
>>
>> > Because runtime suspend is broken for some chipset, a white list is
>> > used to enable runtime PM support for only chipset known to work.
>>
>> A whitelist requires perpetual maintenance.  Every time a new working
>> chipset comes out, you have to update the whitelist.  That doesn't
>> seem right.
>
> Well, we can't possibly enable the feature for all PCIe ports in existence
> either, because some of them will not work with it (almost surely).

What do you think about the idea from Bjorn to use some kind of blacklist here?

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state
  2012-05-04 20:10   ` Rafael J. Wysocki
@ 2012-05-05  7:25     ` huang ying
  2012-05-07 21:15       ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: huang ying @ 2012-05-05  7:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Sat, May 5, 2012 at 4:10 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Friday, May 04, 2012, Huang Ying wrote:
>> Lower device sleep state can save more power, but has more exit
>> latency too.  Sometimes, to satisfy some power QoS and other
>> requirement, we need to constrain the lowest device sleep state.
>>
>> In this patch, a parameter to specify lowest allowed state for
>> acpi_pm_device_sleep_state is added.  So that the caller can enforce
>> the constraint via the parameter.
>>
>> Signed-off-by: Huang Ying <ying.huang@intel.com>
>> ---
>>  drivers/acpi/sleep.c       |   18 +++++++++++++++---
>>  drivers/pci/pci-acpi.c     |    3 ++-
>>  drivers/pnp/pnpacpi/core.c |    4 ++--
>>  include/acpi/acpi_bus.h    |    6 +++---
>>  4 files changed, 22 insertions(+), 9 deletions(-)
>>
>> --- a/drivers/acpi/sleep.c
>> +++ b/drivers/acpi/sleep.c
>> @@ -677,6 +677,7 @@ int acpi_suspend(u32 acpi_state)
>>   *   @dev: device to examine; its driver model wakeup flags control
>>   *           whether it should be able to wake up the system
>>   *   @d_min_p: used to store the upper limit of allowed states range
>> + *   @d_max_in: specify the lowest allowed states
>>   *   Return value: preferred power state of the device on success, -ENODEV on
>>   *           failure (ie. if there's no 'struct acpi_device' for @dev)
>>   *
>> @@ -693,7 +694,7 @@ int acpi_suspend(u32 acpi_state)
>>   *   via @wake.
>>   */
>>
>> -int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p)
>> +int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p, int d_max_in)
>>  {
>>       acpi_handle handle = DEVICE_ACPI_HANDLE(dev);
>>       struct acpi_device *adev;
>> @@ -704,11 +705,14 @@ int acpi_pm_device_sleep_state(struct de
>>               printk(KERN_DEBUG "ACPI handle has no context!\n");
>>               return -ENODEV;
>>       }
>> +     d_max_in = clamp_t(int, d_max_in, ACPI_STATE_D0, ACPI_STATE_D3);
>
> Shouldn't that be clamp_val(), rather?

Yes.  clamp_val() is sufficient here.

>>       acpi_method[2] = '0' + acpi_target_sleep_state;
>>       /*
>> -      * If the sleep state is S0, we will return D3, but if the device has
>> -      * _S0W, we will use the value from _S0W
>> +      * If the sleep state is S0, the lowest limit from ACPI is D3,
>> +      * but if the device has _S0W, we will use the value from _S0W
>> +      * as the lowest limit from ACPI.  Finally, we will constrain
>> +      * the lowest limit with the specified one.
>>        */
>>       d_min = ACPI_STATE_D0;
>>       d_max = ACPI_STATE_D3;
>> @@ -754,6 +758,14 @@ int acpi_pm_device_sleep_state(struct de
>>
>>       if (d_min_p)
>>               *d_min_p = d_min;
>> +     /* constrain d_max with specified lowest limit (max number) */
>> +     if (d_max > d_max_in) {
>> +             d_max = d_max_in;
>> +             for (;d_max > d_min; d_max--) {
>
> Well, why didn't you do
>
> +               for (d_max = d_max_in; d_max > d_min; d_max--)

Because I think it is possible that d_max < d_max_in.

>> +                     if (adev->power.states[d_max].flags.valid)
>> +                             break;
>> +             }
>> +     }
>
> And what if d_min > d_max_in ?

I think that means something bad happens.  Maybe we can do something as follow

if (d_min > d_max_in) {
        pr_warning("acpi_pm_device_sleep_state: the specified lowest
state is higher than the highest state from ACPI!");
        d_max_in = d_min;
}
if (d_max > d_max_in) {
...
}

>>       return d_max;
>>  }
>>  #endif /* CONFIG_PM */
>> --- a/drivers/pci/pci-acpi.c
>> +++ b/drivers/pci/pci-acpi.c
>> @@ -189,7 +189,8 @@ static pci_power_t acpi_pci_choose_state
>>  {
>>       int acpi_state;
>>
>> -     acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL);
>> +     acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
>> +                                             ACPI_STATE_D3);
>>       if (acpi_state < 0)
>>               return PCI_POWER_ERROR;
>>
>> --- a/drivers/pnp/pnpacpi/core.c
>> +++ b/drivers/pnp/pnpacpi/core.c
>> @@ -170,8 +170,8 @@ static int pnpacpi_suspend(struct pnp_de
>>       }
>>
>>       if (acpi_bus_power_manageable(handle)) {
>> -             int power_state = acpi_pm_device_sleep_state(&dev->dev, NULL);
>> -
>> +             int power_state = acpi_pm_device_sleep_state(&dev->dev, NULL,
>> +                                                          ACPI_STATE_D3);
>>               if (power_state < 0)
>>                       power_state = (state.event == PM_EVENT_ON) ?
>>                                       ACPI_STATE_D0 : ACPI_STATE_D3;
>> --- a/include/acpi/acpi_bus.h
>> +++ b/include/acpi/acpi_bus.h
>> @@ -383,13 +383,13 @@ int acpi_enable_wakeup_device_power(stru
>>  int acpi_disable_wakeup_device_power(struct acpi_device *dev);
>>
>>  #ifdef CONFIG_PM
>> -int acpi_pm_device_sleep_state(struct device *, int *);
>> +int acpi_pm_device_sleep_state(struct device *, int *, int);
>>  #else
>> -static inline int acpi_pm_device_sleep_state(struct device *d, int *p)
>> +static inline int acpi_pm_device_sleep_state(struct device *d, int *p, int m)
>>  {
>>       if (p)
>>               *p = ACPI_STATE_D0;
>> -     return ACPI_STATE_D3;
>> +     return m == ACPI_STATE_D3 ? m : ACPI_STATE_D0;
>
> Shouldn't m be returned (so long as it is between D0 and D3 inclusive)?  IOW:
>
> +       return (m >= ACPI_STATE_D0 && m <= ACPI_STATE_D3) ? m : ACPI_STATE_D0;

My original idea is that only D0 and D3 is guaranteed to be valid for
the device.  If that need not to be considered here, you one is
better.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-04 19:51   ` Bjorn Helgaas
@ 2012-05-05  7:34     ` huang ying
  0 siblings, 0 replies; 47+ messages in thread
From: huang ying @ 2012-05-05  7:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huang Ying, ming.m.lin, linux-kernel, linux-pm,
	Rafael J. Wysocki, Zheng Yan

On Sat, May 5, 2012 at 3:51 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
>> This patch adds runtime D3cold support to PCIe bus.  D3cold is the
>> deepest power saving state for PCIe devices.  Where the main PCIe link
>> will be powered off totally, so before the PCIe main link is powered
>> on again, you can not access the device, even the configuration space,
>> which is still accessible in D3hot.  Because the main PCIe link is not
>> available, the PCI PM registers can not be used to put device into/out
>> of D3cold state, that will be done by platform logic such as ACPI
>> _PR3.
>>
>> To support remote wake up in D3cold, aux power is supplied, and a
>> side-band pin: WAKE# is used to notify remote wake up event, the pin
>> is usually connected to platform logic such as ACPI GPE.  This is
>> quite different with other power saving states, where remote wake up
>> is notified via PME message via the PCIe main link.
>>
>> PCIe devices in plug-in slot usually has no direct platform logic, for
>> example, there is usually no ACPI _PR3 for them.  The D3cold support
>> for these devices can be done via the PCIe port connected to it.  When
>> power on/off the PCIe port, the corresponding PCIe devices are powered
>> on/off too.  And the remote wake up event from PCIe device will be
>> notified to the corresponding PCIe port.
>>
>> The PCI subsystem D3cold support and corresponding ACPI platform
>> support is implemented in the patch.  Including the support for PCIe
>> devices in plug-in slot.
>>
>> For more information about PCIe D3cold and corresponding ACPI support,
>> please refer to:
>>
>> - PCI Express Base Specification Revision 2.0
>> - Advanced Configuration and Power Interface Specification Revision 5.0
>>
>> Originally-by: Zheng Yan <zheng.z.yan@intel.com>
>> Signed-off-by: Huang Ying <ying.huang@intel.com>
>> ---
>>  drivers/pci/pci-acpi.c         |   32 +++++++++--
>>  drivers/pci/pci-driver.c       |    3 +
>>  drivers/pci/pci.c              |  115 +++++++++++++++++++++++++++++++++++++----
>>  drivers/pci/pci.h              |    1
>>  drivers/pci/pcie/portdrv_pci.c |   28 +++++++++
>>  include/linux/pci.h            |   12 +++-
>>  6 files changed, 175 insertions(+), 16 deletions(-)
>>
>> --- a/drivers/pci/pci-acpi.c
>> +++ b/drivers/pci/pci-acpi.c
>> @@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
>>        if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
>>                return;
>>
>> +       if (pci_dev->current_state == PCI_D3cold) {
>> +               unsigned int count = 0;
>> +
>> +               /*
>> +                * Powering on bridge need to resume whole hierarchy,
>> +                * just resume the children to avoid the bridge going
>> +                * suspending as soon as resumed
>> +                */ at
>> +               if (pci_dev->subordinate)
>> +                       count = pci_wakeup_bus(pci_dev->subordinate);
>> +               if (count == 0)
>> +                       pm_runtime_resume(&pci_dev->dev);
>> +               return;
>> +       }
>> +
>>        if (!pci_dev->pm_cap || !pci_dev->pme_support
>>             || pci_check_pme_status(pci_dev)) {
>>                if (pci_dev->pme_poll)
>> @@ -187,10 +202,13 @@ acpi_status pci_acpi_remove_pm_notifier(
>>
>>  static pci_power_t acpi_pci_choose_state(struct pci_dev *pdev)
>>  {
>> -       int acpi_state;
>> +       int acpi_state, d_max;
>>
>> -       acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
>> -                                               ACPI_STATE_D3);
>> +       if (pdev->dev.power.power_must_be_on)
>> +               d_max = ACPI_STATE_D3_HOT;
>> +       else
>> +               d_max = ACPI_STATE_D3_COLD;
>> +       acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL, d_max);
>>        if (acpi_state < 0)
>>                return PCI_POWER_ERROR;
>>
>> @@ -297,7 +315,13 @@ static void acpi_pci_propagate_run_wake(
>>
>>  static int acpi_pci_run_wake(struct pci_dev *dev, bool enable)
>>  {
>> -       if (dev->pme_interrupt)
>> +       /*
>> +        * Per PCI Express Base Specification Revision 2.0 section
>> +        * 5.3.3.2 Link Wakeup, platform support is needed for D3cold
>> +        * waking up to power on the main link even if there is PME
>> +        * support for D3cold
>> +        */
>> +       if (dev->pme_interrupt && !dev->runtime_d3cold)
>>                return 0;
>>
>>        if (!acpi_pm_device_run_wake(&dev->dev, enable))
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -1013,10 +1013,13 @@ static int pci_pm_runtime_suspend(struct
>>        if (!pm || !pm->runtime_suspend)
>>                return -ENOSYS;
>>
>> +       dev->power.power_must_be_on = false;
>>        error = pm->runtime_suspend(dev);
>>        suspend_report_result(pm->runtime_suspend, error);
>>        if (error)
>>                return error;
>> +       if (!dev->power.power_off_user)
>> +               dev->power.power_must_be_on = true;
>>
>>        pci_fixup_device(pci_fixup_suspend, pci_dev);
>>
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -653,6 +653,12 @@ void pci_update_current_state(struct pci
>>        if (dev->pm_cap) {
>>                u16 pmcsr;
>>
>> +               /*
>> +                * Configuration space is not accessible for device in
>> +                * D3cold, so keep in D3cold for safety
>> +                */
>> +               if (dev->current_state == PCI_D3cold)
>> +                       return;
>>                pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
>>                dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
>>        } else {
>> @@ -671,8 +677,11 @@ static int pci_platform_power_transition
>>
>>        if (platform_pci_power_manageable(dev)) {
>>                error = platform_pci_set_power_state(dev, state);
>> -               if (!error)
>> +               if (!error) {
>> +                       if (state == PCI_D3cold)
>> +                               dev->current_state = PCI_D3cold;
>>                        pci_update_current_state(dev, state);
>> +               }
>>                /* Fall back to PCI_D0 if native PM is not supported */
>>                if (!dev->pm_cap)
>>                        dev->current_state = PCI_D0;
>> @@ -693,8 +702,49 @@ static int pci_platform_power_transition
>>  */
>>  static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
>>  {
>> -       if (state == PCI_D0)
>> +       if (state == PCI_D0) {
>>                pci_platform_power_transition(dev, PCI_D0);
>> +               /*
>> +                * Mandatory power management transition delays, see
>> +                * PCI Express Base Specification Revision 2.0 Section
>> +                * 6.6.1: Conventional Reset.  Do not delay for
>> +                * devices powered on/off by corresponding bridge,
>> +                * because have already delayed for the bridge.
>> +                */
>> +               if (dev->runtime_d3cold) {
>> +                       msleep(dev->d3cold_delay);
>> +                       /* When powering on a bridge from D3cold, the
>> +                        * whole hierarchy may be powered on into
>> +                        * D0uninitialized state, resume them to give
>> +                        * them a chance to suspend again */
>> +                       if (dev->subordinate)
>> +                               pci_wakeup_bus(dev->subordinate);
>> +               }
>> +       }
>> +}
>> +
>> +/**
>> + * __pci_dev_set_current_state - Set current state of a PCI device
>> + * @dev: Device to handle
>> + * @data: pointer to state to be set
>> + */
>> +static int __pci_dev_set_current_state(struct pci_dev *dev, void *data)
>> +{
>> +       pci_power_t state = *(pci_power_t *)data;
>> +
>> +       dev->current_state = state;
>> +       return 0;
>> +}
>> +
>> +/**
>> + * __pci_bus_set_current_state - Walk given bus and set current state of devices
>> + * @bus: Top bus of the subtree to walk.
>> + * @state: state to be set
>> + */
>> +static void __pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state)
>> +{
>> +       if (bus)
>> +               pci_walk_bus(bus, __pci_dev_set_current_state, &state);
>>  }
>>
>>  /**
>> @@ -706,8 +756,17 @@ static void __pci_start_power_transition
>>  */
>>  int __pci_complete_power_transition(struct pci_dev *dev, pci_power_t state)
>>  {
>> -       return state >= PCI_D0 ?
>> -                       pci_platform_power_transition(dev, state) : -EINVAL;
>> +       int ret;
>> +
>> +       if (state < PCI_D0)
>> +               return -EINVAL;
>> +       ret = pci_platform_power_transition(dev, state);
>> +       if (!ret && state == PCI_D3cold) {
>> +               /* Power off the bridge may power off the whole hierarchy */
>> +               if (dev->subordinate)
>> +                       __pci_bus_set_current_state(dev->subordinate, state);
>> +       }
>> +       return ret;
>>  }
>>  EXPORT_SYMBOL_GPL(__pci_complete_power_transition);
>>
>> @@ -731,8 +790,8 @@ int pci_set_power_state(struct pci_dev *
>>        int error;
>>
>>        /* bound the state we're entering */
>> -       if (state > PCI_D3hot)
>> -               state = PCI_D3hot;
>> +       if (state > PCI_D3cold)
>> +               state = PCI_D3cold;
>>        else if (state < PCI_D0)
>>                state = PCI_D0;
>>        else if ((state == PCI_D1 || state == PCI_D2) && pci_no_d1d2(dev))
>> @@ -747,10 +806,15 @@ int pci_set_power_state(struct pci_dev *
>>
>>        /* This device is quirked not to be put into D3, so
>>           don't put it in D3 */
>> -       if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
>> +       if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
>>                return 0;
>>
>> -       error = pci_raw_set_power_state(dev, state);
>> +       /*
>> +        * To put device in D3cold, we put device into D3hot in native
>> +        * way, then put device into D3cold with platform ops
>> +        */
>> +       error = pci_raw_set_power_state(dev, state > PCI_D3hot ?
>> +                                       PCI_D3hot : state);
>>
>>        if (!__pci_complete_power_transition(dev, state))
>>                error = 0;
>> @@ -1506,6 +1570,34 @@ void pci_pme_wakeup_bus(struct pci_bus *
>>  }
>>
>>  /**
>> + * pci_wakeup - Wake up a PCI device
>> + * @dev: Device to handle.
>> + * @data: to count how many device are waken up.
>> + */
>> +static int pci_wakeup(struct pci_dev *dev, void *data)
>> +{
>> +       unsigned int *count = data;
>> +
>> +       pci_wakeup_event(dev);
>> +       pm_request_resume(&dev->dev);
>> +       (*count)++;
>> +       return 0;
>> +}
>> +
>> +/**
>> + * pci_wakeup_bus - Walk given bus and wake up devices on it
>> + * @bus: Top bus of the subtree to walk.
>> + */
>> +unsigned int pci_wakeup_bus(struct pci_bus *bus)
>> +{
>> +       unsigned int count = 0;
>> +
>> +       if (bus)
>> +               pci_walk_bus(bus, pci_wakeup, &count);
>> +       return count;
>> +}
>> +
>> +/**
>>  * pci_pme_capable - check the capability of PCI device to generate PME#
>>  * @dev: PCI device to handle.
>>  * @state: PCI state from which device will issue PME#.
>> @@ -1789,12 +1881,16 @@ int pci_finish_runtime_suspend(struct pc
>>        if (target_state == PCI_POWER_ERROR)
>>                return -EIO;
>>
>> +       dev->runtime_d3cold = target_state == PCI_D3cold ? true : false;
>> +
>>        __pci_enable_wake(dev, target_state, true, pci_dev_run_wake(dev));
>>
>>        error = pci_set_power_state(dev, target_state);
>>
>> -       if (error)
>> +       if (error) {
>>                __pci_enable_wake(dev, target_state, true, false);
>> +               dev->runtime_d3cold = false;
>> +       }
>>
>>        return error;
>>  }
>> @@ -1864,6 +1960,7 @@ void pci_pm_init(struct pci_dev *dev)
>>
>>        dev->pm_cap = pm;
>>        dev->d3_delay = PCI_PM_D3_WAIT;
>> +       dev->d3cold_delay = PCI_PM_D3COLD_WAIT;
>>
>>        dev->d1_support = false;
>>        dev->d2_support = false;
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -70,6 +70,7 @@ extern void pci_update_current_state(str
>>  extern void pci_disable_enabled_device(struct pci_dev *dev);
>>  extern int pci_finish_runtime_suspend(struct pci_dev *dev);
>>  extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
>> +extern unsigned int pci_wakeup_bus(struct pci_bus *bus);
>>  extern void pci_pm_init(struct pci_dev *dev);
>>  extern void platform_pci_wakeup_init(struct pci_dev *dev);
>>  extern void pci_allocate_cap_save_buffers(struct pci_dev *dev);
>> --- a/drivers/pci/pcie/portdrv_pci.c
>> +++ b/drivers/pci/pcie/portdrv_pci.c
>> @@ -101,10 +101,38 @@ static int pcie_port_resume_noirq(struct
>>  }
>>
>>  #ifdef CONFIG_PM_RUNTIME
>> +struct d3cold_info {
>> +       bool power_must_be_on;
>> +       unsigned int d3cold_delay;
>> +};
>> +
>> +static int pci_dev_d3cold_info(struct pci_dev *pdev, void *data)
>> +{
>> +       struct d3cold_info *info = data;
>> +
>> +       info->d3cold_delay = max_t(unsigned int, pdev->d3cold_delay,
>> +                                  info->d3cold_delay);
>> +       info->power_must_be_on = info->power_must_be_on ||
>> +               pdev->dev.power.power_must_be_on;
>
> We don't care what the previous state of info->power_must_be_on was,
> so to me, this would read more naturally as:
>
>    if (pdev->dev.power.power_must_be_on)
>        info->power_must_be_on = true;

OK.  Will change this.

>> +       return 0;
>> +}
>> +
>>  static int pcie_port_runtime_suspend(struct device *dev)
>>  {
>>        struct pci_dev *pdev = to_pci_dev(dev);
>> +       struct d3cold_info d3cold_info = {
>> +               .power_must_be_on       = dev->power.power_must_be_on,
>> +               .d3cold_delay           = PCI_PM_D3_WAIT,
>> +       };
>>
>> +       /*
>> +        * If any subordinate device need to keep power on, we should
>> +        * not power off the port.  The D3cold delay of port should be
>> +        * the max of that of all subordinate devices.
>> +        */
>> +       pci_walk_bus(pdev->subordinate, pci_dev_d3cold_info, &d3cold_info);
>> +       dev->power.power_must_be_on = d3cold_info.power_must_be_on;
>> +       pdev->d3cold_delay = d3cold_info.d3cold_delay;
>
> Maybe you've thought about how this works in the presence of hot-added
> devices below pdev, but it's not obvious to me that it's safe to cache
> power_must_be_on and d3cold_delay in the struct device and struct
> pci_dev.  Are they guaranteed to be updated when a new device is
> hot-added?

The power_must_be_on and d3cold_delay will be updated when a device is
going to be runtime suspended, and used during the runtime
suspend/resume cycle.  I think when we hot-add a new device, we should
runtime resume all its parents firstly.  Doesn't it?

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-04 20:50   ` Rafael J. Wysocki
@ 2012-05-05  8:08     ` huang ying
  2012-05-07 21:22       ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: huang ying @ 2012-05-05  8:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Sat, May 5, 2012 at 4:50 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Friday, May 04, 2012, Huang Ying wrote:
>> This patch adds runtime D3cold support to PCIe bus.  D3cold is the
>> deepest power saving state for PCIe devices.  Where the main PCIe link
>> will be powered off totally, so before the PCIe main link is powered
>> on again, you can not access the device, even the configuration space,
>> which is still accessible in D3hot.  Because the main PCIe link is not
>> available, the PCI PM registers can not be used to put device into/out
>> of D3cold state, that will be done by platform logic such as ACPI
>> _PR3.
>>
>> To support remote wake up in D3cold, aux power is supplied, and a
>> side-band pin: WAKE# is used to notify remote wake up event, the pin
>> is usually connected to platform logic such as ACPI GPE.  This is
>> quite different with other power saving states, where remote wake up
>> is notified via PME message via the PCIe main link.
>>
>> PCIe devices in plug-in slot usually has no direct platform logic, for
>> example, there is usually no ACPI _PR3 for them.  The D3cold support
>> for these devices can be done via the PCIe port connected to it.  When
>> power on/off the PCIe port, the corresponding PCIe devices are powered
>> on/off too.  And the remote wake up event from PCIe device will be
>> notified to the corresponding PCIe port.
>>
>> The PCI subsystem D3cold support and corresponding ACPI platform
>> support is implemented in the patch.  Including the support for PCIe
>> devices in plug-in slot.
>>
>> For more information about PCIe D3cold and corresponding ACPI support,
>> please refer to:
>>
>> - PCI Express Base Specification Revision 2.0
>> - Advanced Configuration and Power Interface Specification Revision 5.0
>>
>> Originally-by: Zheng Yan <zheng.z.yan@intel.com>
>> Signed-off-by: Huang Ying <ying.huang@intel.com>
>> ---
>>  drivers/pci/pci-acpi.c         |   32 +++++++++--
>>  drivers/pci/pci-driver.c       |    3 +
>>  drivers/pci/pci.c              |  115 +++++++++++++++++++++++++++++++++++++----
>>  drivers/pci/pci.h              |    1
>>  drivers/pci/pcie/portdrv_pci.c |   28 +++++++++
>>  include/linux/pci.h            |   12 +++-
>>  6 files changed, 175 insertions(+), 16 deletions(-)
>>
>> --- a/drivers/pci/pci-acpi.c
>> +++ b/drivers/pci/pci-acpi.c
>> @@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
>>       if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
>>               return;
>>
>> +     if (pci_dev->current_state == PCI_D3cold) {
>> +             unsigned int count = 0;
>> +
>> +             /*
>> +              * Powering on bridge need to resume whole hierarchy,
>> +              * just resume the children to avoid the bridge going
>> +              * suspending as soon as resumed
>> +              */
>
> Don't you need to resume the bridge before you start walking the hierarchy
> below it?

When we resume the hierarchy below the bridge, its parent, the bridge,
will be resumed firstly. That is:

rpm_resume(child)
  rpm_resume(parent)
  ->runtime_suspend(child)

>> +             if (pci_dev->subordinate)
>> +                     count = pci_wakeup_bus(pci_dev->subordinate);
>> +             if (count == 0)
>> +                     pm_runtime_resume(&pci_dev->dev);
>
> What's the count for, exactly?

If there is no devices under the bridge, count returned will be 0,
then we will resume bridge itself.

>> +             return;
>> +     }
>> +
>>       if (!pci_dev->pm_cap || !pci_dev->pme_support
>>            || pci_check_pme_status(pci_dev)) {
>>               if (pci_dev->pme_poll)
>> @@ -187,10 +202,13 @@ acpi_status pci_acpi_remove_pm_notifier(
>>
>>  static pci_power_t acpi_pci_choose_state(struct pci_dev *pdev)
>>  {
>> -     int acpi_state;
>> +     int acpi_state, d_max;
>>
>> -     acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
>> -                                             ACPI_STATE_D3);
>> +     if (pdev->dev.power.power_must_be_on)
>> +             d_max = ACPI_STATE_D3_HOT;
>> +     else
>> +             d_max = ACPI_STATE_D3_COLD;
>> +     acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL, d_max);
>>       if (acpi_state < 0)
>>               return PCI_POWER_ERROR;
>>
>> @@ -297,7 +315,13 @@ static void acpi_pci_propagate_run_wake(
>>
>>  static int acpi_pci_run_wake(struct pci_dev *dev, bool enable)
>>  {
>> -     if (dev->pme_interrupt)
>> +     /*
>> +      * Per PCI Express Base Specification Revision 2.0 section
>> +      * 5.3.3.2 Link Wakeup, platform support is needed for D3cold
>> +      * waking up to power on the main link even if there is PME
>> +      * support for D3cold
>> +      */
>> +     if (dev->pme_interrupt && !dev->runtime_d3cold)
>>               return 0;
>>
>>       if (!acpi_pm_device_run_wake(&dev->dev, enable))
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -1013,10 +1013,13 @@ static int pci_pm_runtime_suspend(struct
>>       if (!pm || !pm->runtime_suspend)
>>               return -ENOSYS;
>>
>> +     dev->power.power_must_be_on = false;
>>       error = pm->runtime_suspend(dev);
>>       suspend_report_result(pm->runtime_suspend, error);
>>       if (error)
>>               return error;
>> +     if (!dev->power.power_off_user)
>> +             dev->power.power_must_be_on = true;
>
> That doesn't look good.  The flag itself should be exported via sysfs, but
> only if it is known that power can be removed from the device.
>
>>
>>       pci_fixup_device(pci_fixup_suspend, pci_dev);
>>
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -653,6 +653,12 @@ void pci_update_current_state(struct pci
>>       if (dev->pm_cap) {
>>               u16 pmcsr;
>>
>> +             /*
>> +              * Configuration space is not accessible for device in
>> +              * D3cold, so keep in D3cold for safety
>> +              */
>> +             if (dev->current_state == PCI_D3cold)
>> +                     return;
>>               pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
>>               dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
>>       } else {
>> @@ -671,8 +677,11 @@ static int pci_platform_power_transition
>>
>>       if (platform_pci_power_manageable(dev)) {
>>               error = platform_pci_set_power_state(dev, state);
>> -             if (!error)
>> +             if (!error) {
>> +                     if (state == PCI_D3cold)
>> +                             dev->current_state = PCI_D3cold;
>
> +                       else ?

OK.  Will add else here.

>>                       pci_update_current_state(dev, state);
>>
>> +             }
>>               /* Fall back to PCI_D0 if native PM is not supported */
>>               if (!dev->pm_cap)
>>                       dev->current_state = PCI_D0;
>> @@ -693,8 +702,49 @@ static int pci_platform_power_transition
>>   */
>>  static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
>>  {
>> -     if (state == PCI_D0)
>> +     if (state == PCI_D0) {
>>               pci_platform_power_transition(dev, PCI_D0);
>> +             /*
>> +              * Mandatory power management transition delays, see
>> +              * PCI Express Base Specification Revision 2.0 Section
>> +              * 6.6.1: Conventional Reset.  Do not delay for
>> +              * devices powered on/off by corresponding bridge,
>> +              * because have already delayed for the bridge.
>> +              */
>> +             if (dev->runtime_d3cold) {
>> +                     msleep(dev->d3cold_delay);
>> +                     /* When powering on a bridge from D3cold, the
>> +                      * whole hierarchy may be powered on into
>> +                      * D0uninitialized state, resume them to give
>> +                      * them a chance to suspend again */
>> +                     if (dev->subordinate)
>> +                             pci_wakeup_bus(dev->subordinate);
>> +             }
>> +     }
>> +}
>> +
>> +/**
>> + * __pci_dev_set_current_state - Set current state of a PCI device
>> + * @dev: Device to handle
>> + * @data: pointer to state to be set
>> + */
>> +static int __pci_dev_set_current_state(struct pci_dev *dev, void *data)
>> +{
>> +     pci_power_t state = *(pci_power_t *)data;
>> +
>> +     dev->current_state = state;
>> +     return 0;
>> +}
>> +
>> +/**
>> + * __pci_bus_set_current_state - Walk given bus and set current state of devices
>> + * @bus: Top bus of the subtree to walk.
>> + * @state: state to be set
>> + */
>> +static void __pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state)
>> +{
>> +     if (bus)
>> +             pci_walk_bus(bus, __pci_dev_set_current_state, &state);
>>  }
>>
>>  /**
>> @@ -706,8 +756,17 @@ static void __pci_start_power_transition
>>   */
>>  int __pci_complete_power_transition(struct pci_dev *dev, pci_power_t state)
>>  {
>> -     return state >= PCI_D0 ?
>> -                     pci_platform_power_transition(dev, state) : -EINVAL;
>> +     int ret;
>> +
>> +     if (state < PCI_D0)
>> +             return -EINVAL;
>> +     ret = pci_platform_power_transition(dev, state);
>> +     if (!ret && state == PCI_D3cold) {
>> +             /* Power off the bridge may power off the whole hierarchy */
>> +             if (dev->subordinate)
>> +                     __pci_bus_set_current_state(dev->subordinate, state);
>
> I'd just say
>
> +                       __pci_bus_set_current_state(dev->subordinate, PCI_D3cold);
>
> here.  Also you don't need the if (dev->subordinate) check, because
> __pci_bus_set_current_state() will do it anyway.

OK.  Will do that.

>> +     }
>> +     return ret;
>>  }
>>  EXPORT_SYMBOL_GPL(__pci_complete_power_transition);
>>
>> @@ -731,8 +790,8 @@ int pci_set_power_state(struct pci_dev *
>>       int error;
>>
>>       /* bound the state we're entering */
>> -     if (state > PCI_D3hot)
>> -             state = PCI_D3hot;
>> +     if (state > PCI_D3cold)
>> +             state = PCI_D3cold;
>>       else if (state < PCI_D0)
>>               state = PCI_D0;
>>       else if ((state == PCI_D1 || state == PCI_D2) && pci_no_d1d2(dev))
>> @@ -747,10 +806,15 @@ int pci_set_power_state(struct pci_dev *
>>
>>       /* This device is quirked not to be put into D3, so
>>          don't put it in D3 */
>> -     if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
>> +     if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
>>               return 0;
>>
>> -     error = pci_raw_set_power_state(dev, state);
>> +     /*
>> +      * To put device in D3cold, we put device into D3hot in native
>> +      * way, then put device into D3cold with platform ops
>> +      */
>> +     error = pci_raw_set_power_state(dev, state > PCI_D3hot ?
>> +                                     PCI_D3hot : state);
>>
>>       if (!__pci_complete_power_transition(dev, state))
>>               error = 0;
>> @@ -1506,6 +1570,34 @@ void pci_pme_wakeup_bus(struct pci_bus *
>>  }
>>
>>  /**
>> + * pci_wakeup - Wake up a PCI device
>> + * @dev: Device to handle.
>> + * @data: to count how many device are waken up.
>> + */
>> +static int pci_wakeup(struct pci_dev *dev, void *data)
>> +{
>> +     unsigned int *count = data;
>> +
>> +     pci_wakeup_event(dev);
>> +     pm_request_resume(&dev->dev);
>> +     (*count)++;
>> +     return 0;
>> +}
>> +
>> +/**
>> + * pci_wakeup_bus - Walk given bus and wake up devices on it
>> + * @bus: Top bus of the subtree to walk.
>> + */
>> +unsigned int pci_wakeup_bus(struct pci_bus *bus)
>> +{
>> +     unsigned int count = 0;
>> +
>> +     if (bus)
>> +             pci_walk_bus(bus, pci_wakeup, &count);
>> +     return count;
>> +}
>> +
>> +/**
>>   * pci_pme_capable - check the capability of PCI device to generate PME#
>>   * @dev: PCI device to handle.
>>   * @state: PCI state from which device will issue PME#.
>> @@ -1789,12 +1881,16 @@ int pci_finish_runtime_suspend(struct pc
>>       if (target_state == PCI_POWER_ERROR)
>>               return -EIO;
>>
>> +     dev->runtime_d3cold = target_state == PCI_D3cold ? true : false;
>
> +       dev->runtime_d3cold = target_state == PCI_D3cold;
>
> will suffice.

OK.  Will do that.

>> +
>>       __pci_enable_wake(dev, target_state, true, pci_dev_run_wake(dev));
>>
>>       error = pci_set_power_state(dev, target_state);
>>
>> -     if (error)
>> +     if (error) {
>>               __pci_enable_wake(dev, target_state, true, false);
>> +             dev->runtime_d3cold = false;
>> +     }
>>
>>       return error;
>>  }
>> @@ -1864,6 +1960,7 @@ void pci_pm_init(struct pci_dev *dev)
>>
>>       dev->pm_cap = pm;
>>       dev->d3_delay = PCI_PM_D3_WAIT;
>> +     dev->d3cold_delay = PCI_PM_D3COLD_WAIT;
>>
>>       dev->d1_support = false;
>>       dev->d2_support = false;
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -70,6 +70,7 @@ extern void pci_update_current_state(str
>>  extern void pci_disable_enabled_device(struct pci_dev *dev);
>>  extern int pci_finish_runtime_suspend(struct pci_dev *dev);
>>  extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
>> +extern unsigned int pci_wakeup_bus(struct pci_bus *bus);
>>  extern void pci_pm_init(struct pci_dev *dev);
>>  extern void platform_pci_wakeup_init(struct pci_dev *dev);
>>  extern void pci_allocate_cap_save_buffers(struct pci_dev *dev);
>> --- a/drivers/pci/pcie/portdrv_pci.c
>> +++ b/drivers/pci/pcie/portdrv_pci.c
>> @@ -101,10 +101,38 @@ static int pcie_port_resume_noirq(struct
>>  }
>>
>>  #ifdef CONFIG_PM_RUNTIME
>> +struct d3cold_info {
>> +     bool power_must_be_on;
>> +     unsigned int d3cold_delay;
>> +};
>> +
>> +static int pci_dev_d3cold_info(struct pci_dev *pdev, void *data)
>> +{
>> +     struct d3cold_info *info = data;
>> +
>> +     info->d3cold_delay = max_t(unsigned int, pdev->d3cold_delay,
>> +                                info->d3cold_delay);
>> +     info->power_must_be_on = info->power_must_be_on ||
>> +             pdev->dev.power.power_must_be_on;
>> +     return 0;
>> +}
>> +
>>  static int pcie_port_runtime_suspend(struct device *dev)
>>  {
>>       struct pci_dev *pdev = to_pci_dev(dev);
>> +     struct d3cold_info d3cold_info = {
>> +             .power_must_be_on       = dev->power.power_must_be_on,
>> +             .d3cold_delay           = PCI_PM_D3_WAIT,
>> +     };
>>
>> +     /*
>> +      * If any subordinate device need to keep power on, we should
>> +      * not power off the port.  The D3cold delay of port should be
>> +      * the max of that of all subordinate devices.
>
> What if some of the devices below the port are ports themselves?  Or PCI-to-PCIe
> bridges?

For them, I think the current solution is safe.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag
  2012-05-05  5:15     ` huang ying
@ 2012-05-07 20:33       ` Rafael J. Wysocki
  0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-07 20:33 UTC (permalink / raw)
  To: huang ying
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Saturday, May 05, 2012, huang ying wrote:
> On Sat, May 5, 2012 at 3:37 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Huang Ying wrote:
> >> The extreme way to save device power in runtime is to turn off power
> >> of device.  For example, D3cold for PCIe bus and ZPODD (Zero Power
> >> Optical Disk Drive) for SATA bus will do that.
> >>
> >> But sometimes power off is not expected, some possible reason is as
> >> follow
> >>
> >> - power off device usually incurs longer resume latency, if it exceeds
> >>   power QoS requirement, power off should be disabled.
> >>
> >> - For some buses, device in power off state can not support remote
> >>   wakeup.  If remote wakeup is desired, power off should be disabled.
> >>
> >> In general, whether to put a device into power off state should be
> >> decided by the driver of the device, but for some buses, whether to
> >> put a device into power off state may be done by the parent of the
> >> device.  For example, a PCIe end point device may be put into power
> >> off state by the PCIe port connected to it.
> >>
> >> So a flag is introduced for the children devices to tell the parent
> >> device, whether it should be put into power off state.
> >>
> >> This flag is also used for device driver to tell bus layer whether it
> >> is OK to be powered off.
> >>
> >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> >
> > I would be almost fine with this patch, if [2/5] were not present.
> >
> > However, if you introduce a flag like this, you need to put checks
> > against it into all places where power may be removed from devices,
> > like the generic PM domains framework (but not only there).
> 
> Yes.  At least this flag will be needed by other buses, like ZPODD
> support from Lin Ming:
> 
> https://lkml.org/lkml/2012/3/28/23
> 
> So my original plan is to introduce this flag firstly, then to add
> checking for this flag in various places need it.

That sounds like a good plan, but then please don't export this to user
space as long as the kernel side is complete.

> Do you suggest to
> put PCIe D3cold support, ZPODD support, power domain related checking
> into one patchset.

This isn't necessary so long as the flag is not exported.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag
  2012-05-05  5:59     ` huang ying
@ 2012-05-07 20:37       ` Rafael J. Wysocki
  0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-07 20:37 UTC (permalink / raw)
  To: huang ying
  Cc: Bjorn Helgaas, Huang Ying, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Saturday, May 05, 2012, huang ying wrote:
> On Sat, May 5, 2012 at 3:50 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
> >> --- a/include/linux/pm.h
> >> +++ b/include/linux/pm.h
> >> @@ -536,6 +536,7 @@ struct dev_pm_info {
> >>        unsigned int            irq_safe:1;
> >>        unsigned int            use_autosuspend:1;
> >>        unsigned int            timer_autosuspends:1;
> >> +       unsigned int            power_must_be_on:1;
> >>        enum rpm_request        request;
> >>        enum rpm_status         runtime_status;
> >>        int                     runtime_error;
> >
> > It's a little weird to just add the field, with no users.  Would it
> > make sense to pull out the bits of patch 5 that use this and combine
> > them into a single smaller patch?
> 
> This patch is needed by some other subsystem too, such as ZPODD
> support in following patchset:
> 
> https://lkml.org/lkml/2012/3/28/23
> 
> The original plan is to merge this into linux-pm.git firstly, then
> merge various usage of this flag into various subsystem git tree.
> That will make cross-tree merging a little easier.  Is it possible?
> 
> > But see related comments there; it
> > might be safer to have a function that computes this whenever you need
> > it instead of caching the value.
> 
> This flag may be used (or calculated if implemented as a function)
> when device is in suspended state.  Reason is as follow from the
> changelog of this patch.
> 
> "
> In general, whether to put a device into power off state should be
> decided by the driver of the device, but for some buses, whether to
> put a device into power off state may be done by the parent of the
> device.  For example, a PCIe end point device may be put into power
> off state by the PCIe port connected to it.
> "
> 
> If all devices can calculate "power_must_be_on" when it is in
> suspended state, then this will be a design choice.
> 
> Hi, Rafael,
> 
> What do you think about the idea to replace .power_must_be_on with a
> function

That looks like a static characteristics of a device (ie. one that's not going
to change at run time), so it seems better to use a flag for it.

> (maybe add a new callback in pm_ops)?

Please don't do that.  We have many of them already.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-05  6:29     ` huang ying
@ 2012-05-07 20:53       ` Rafael J. Wysocki
  2012-05-08  1:44         ` Huang Ying
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-07 20:53 UTC (permalink / raw)
  To: huang ying
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm,
	Zheng Yan, Lan Tianyu

On Saturday, May 05, 2012, huang ying wrote:
> On Sat, May 5, 2012 at 3:33 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Huang Ying wrote:
> >> From: Lan Tianyu <tianyu.lan@intel.com>
> >>
> >> Some devices can be powered off to save more power via some platform
> >> mechanism, e.g., ACPI.  But that may not work as expected for some
> >> device or platform.  So, this patch adds a sysfs file named power_off
> >> under <device>/power directory to provide a mechanism for user to control
> >> whether to allow the device to be power off.
> >>
> >> power_off => "enabled" means allowing the device to be powered off if
> >> possible.
> >>
> >> power_off => "disabled" means the device must be power on anytime.
> >>
> >> Also add flag power_off_user to struct dev_pm_info to record users'
> >> choice. The bus layer can use this field to determine whether to
> >> power off the device.
> >
> > It looks like the new attribute is added for all devices regardless of whether
> > or not they actually can be powered off?  If so, then please don't do that,
> > it's _extremely_ confusing.
> 
> Yes.  You are right.
> 
> > If you need user space to be able to control that functionality (and I can
> > think of a couple of cases in which you do), there need to be 2 flags,
> > can_power_off and may_power_off, where the first one is set by the kernel
> > if it is known that power can be removed from the device - the attribute
> > should be created when this flag is set and removed when it is unset.
> >
> > Then, the setting of the second flag will be controlled by the new attribute.
> >
> > And you'll need to patch quite a few places where devices actually have that
> > capability, like where power domains are in use, so that's quire a lot of
> > work.
> 
> If so, I think maybe we need 3 flags:
> 
> - can_power_off, set by kernel when it is possible to power off the device

Well, on a second thought, it may be difficult to determine that in some
cases (eg. for devices belonging to power domains with additional constraints
related to the other devices in the same domain etc.).

In other cases power may be removed from devices indirectly, like for example
by putting a device's parent into a low power state.

So I guess the can_power_off flag may not be practical after all.

> - may_power_off_user, set by user via sysfs attribute

I'd call that one power_off_allowed, meaning "allowed by user space".

> - may_power_off, set by kernel according to may_power_off_user, power
> QoS and some other conditions

And I'd call that one power_must_be_on, meaning "don't power off even if
allowed by user space".

> Sysfs attribute for may_power_off_user is only created if can_power_off is true.
> 
> I think we still can do that step by step.  For example, when we add
> power off support to PCI devices, we set can_power_off to true for PCI
> devices that is possible to be powered off;  when we add power domain
> support, we set can_power_off to true for devices in power domain.  Do
> you agree?

I think we may add helpers for exporting/unexporting power_off_allowed
like for the PM QoS latency attribute.  Then, whoever wants to support
power_off_allowed and use it will export it through that helper.

Still, I'm afraid we're trying to special case something that really ins't
a special case.  Namely, we may want to restrict devices from using some
other low-power states as well, not only power off (eg. we may want to
prevent devices' clocks from being stopped).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-05  6:46     ` huang ying
@ 2012-05-07 21:00       ` Rafael J. Wysocki
  2012-05-11  7:57         ` Huang Ying
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-07 21:00 UTC (permalink / raw)
  To: huang ying
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Saturday, May 05, 2012, huang ying wrote:
> On Sat, May 5, 2012 at 3:43 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Huang Ying wrote:
> >> From: Zheng Yan <zheng.z.yan@intel.com>
> >>
> >> This patch adds runtime PM support to PCIe port.  This is needed by
> >> PCIe D3cold support, where PCIe device in slot may be powered on/off
> >> by PCIe port.
> >>
> >> Because runtime suspend is broken for some chipset, a white list is
> >> used to enable runtime PM support for only chipset known to work.
> >>
> >> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> >> ---
> >>  drivers/pci/pci.c              |    9 +++++++++
> >>  drivers/pci/pcie/portdrv_pci.c |   40 ++++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 49 insertions(+)
> >>
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -1476,6 +1476,15 @@ bool pci_check_pme_status(struct pci_dev
> >>   */
> >>  static int pci_pme_wakeup(struct pci_dev *dev, void *pme_poll_reset)
> >>  {
> >> +     struct pci_dev *bridge = dev->bus->self;
> >> +
> >> +     /*
> >> +      * If bridge is in low power state, the configuration space of
> >> +      * subordinate devices may be not accessible
> >
> > Please also say in the comment _when_ this is possible.  That's far from
> > obvious, because the runtime PM framework generally ensures that parents are
> > resumed before the children, so the comment should describe the particular
> > scenario leading to this situation.
> 
> OK.  I will add something like below into comments.
> 
> This is possible when doing PME poll.

Well, that doesn't really explain much. :-)

I _think_ the situation is when a device causes WAKE# to be generated and
the platform receives a GPE as a result and we get an ACPI_NOTIFY_DEVICE_WAKE
notification for the device, which may be under a bridge (PCIe port really)
in D3_cold.  Is that the case?

> >> +      */
> >> +     if (bridge && bridge->current_state != PCI_D0)
> >> +             return 0;
> >> +
> >>       if (pme_poll_reset && dev->pme_poll)
> >>               dev->pme_poll = false;
> >>
> >> --- a/drivers/pci/pcie/portdrv_pci.c
> >> +++ b/drivers/pci/pcie/portdrv_pci.c
> >> @@ -11,6 +11,7 @@
> >>  #include <linux/kernel.h>
> >>  #include <linux/errno.h>
> >>  #include <linux/pm.h>
> >> +#include <linux/pm_runtime.h>
> >>  #include <linux/init.h>
> >>  #include <linux/pcieport_if.h>
> >>  #include <linux/aer.h>
> >> @@ -99,6 +100,27 @@ static int pcie_port_resume_noirq(struct
> >>       return 0;
> >>  }
> >>
> >> +#ifdef CONFIG_PM_RUNTIME
> >> +static int pcie_port_runtime_suspend(struct device *dev)
> >> +{
> >> +     struct pci_dev *pdev = to_pci_dev(dev);
> >> +
> >
> > A comment explaining why this is needed here would be welcome.
> 
> Sorry, do not catch your idea exactly.  What is needed?  Why do we
> need to add PCIe port runtime suspend support?

No, why we need to call pci_save_state() from here and pci_restore_state()
from the corresponding resume routine.

In theory that shouldn't be necessary, because the PCI bus type's
runtime suspend/resume routines do the save/restore of the PCI config space.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-05  6:54       ` huang ying
@ 2012-05-07 21:06         ` Rafael J. Wysocki
  0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-07 21:06 UTC (permalink / raw)
  To: huang ying
  Cc: Bjorn Helgaas, Huang Ying, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Saturday, May 05, 2012, huang ying wrote:
> On Sat, May 5, 2012 at 4:55 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Bjorn Helgaas wrote:
> >> On Fri, May 4, 2012 at 2:13 AM, Huang Ying <ying.huang@intel.com> wrote:
> >> > From: Zheng Yan <zheng.z.yan@intel.com>
> >> >
> >> > This patch adds runtime PM support to PCIe port.  This is needed by
> >> > PCIe D3cold support, where PCIe device in slot may be powered on/off
> >> > by PCIe port.
> >>
> >> I assume this works for integrated PCIe devices as well as those that
> >> are plugged into a slot and can be physically removed -- maybe the
> >> text "in slot" is superfluous?
> >>
> >> > Because runtime suspend is broken for some chipset, a white list is
> >> > used to enable runtime PM support for only chipset known to work.
> >>
> >> A whitelist requires perpetual maintenance.  Every time a new working
> >> chipset comes out, you have to update the whitelist.  That doesn't
> >> seem right.
> >
> > Well, we can't possibly enable the feature for all PCIe ports in existence
> > either, because some of them will not work with it (almost surely).
> 
> What do you think about the idea from Bjorn to use some kind of blacklist here?

Whitelists are much better than blacklists for new features IMO, because they
allow the feature to be enabled on more and more existing systems over time
(as they are tested) and it's easy to revert wrong enablements (eg. if
a system is added to a whitelist and it turns out not to work afterward, it's
sufficient to remove the whitelist entry to fix the problem).

A general rule for adding new features should be that the feature is only
enabled on systems where it is _known_ to work or at the user's request.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state
  2012-05-05  7:25     ` huang ying
@ 2012-05-07 21:15       ` Rafael J. Wysocki
  2012-05-08  1:49         ` Huang Ying
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-07 21:15 UTC (permalink / raw)
  To: huang ying
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Saturday, May 05, 2012, huang ying wrote:
> On Sat, May 5, 2012 at 4:10 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Huang Ying wrote:
> >> Lower device sleep state can save more power, but has more exit
> >> latency too.  Sometimes, to satisfy some power QoS and other
> >> requirement, we need to constrain the lowest device sleep state.
> >>
> >> In this patch, a parameter to specify lowest allowed state for
> >> acpi_pm_device_sleep_state is added.  So that the caller can enforce
> >> the constraint via the parameter.
> >>
> >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> >> ---
> >>  drivers/acpi/sleep.c       |   18 +++++++++++++++---
> >>  drivers/pci/pci-acpi.c     |    3 ++-
> >>  drivers/pnp/pnpacpi/core.c |    4 ++--
> >>  include/acpi/acpi_bus.h    |    6 +++---
> >>  4 files changed, 22 insertions(+), 9 deletions(-)
> >>
> >> --- a/drivers/acpi/sleep.c
> >> +++ b/drivers/acpi/sleep.c
> >> @@ -677,6 +677,7 @@ int acpi_suspend(u32 acpi_state)
> >>   *   @dev: device to examine; its driver model wakeup flags control
> >>   *           whether it should be able to wake up the system
> >>   *   @d_min_p: used to store the upper limit of allowed states range
> >> + *   @d_max_in: specify the lowest allowed states
> >>   *   Return value: preferred power state of the device on success, -ENODEV on
> >>   *           failure (ie. if there's no 'struct acpi_device' for @dev)
> >>   *
> >> @@ -693,7 +694,7 @@ int acpi_suspend(u32 acpi_state)
> >>   *   via @wake.
> >>   */
> >>
> >> -int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p)
> >> +int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p, int d_max_in)
> >>  {
> >>       acpi_handle handle = DEVICE_ACPI_HANDLE(dev);
> >>       struct acpi_device *adev;
> >> @@ -704,11 +705,14 @@ int acpi_pm_device_sleep_state(struct de
> >>               printk(KERN_DEBUG "ACPI handle has no context!\n");
> >>               return -ENODEV;
> >>       }
> >> +     d_max_in = clamp_t(int, d_max_in, ACPI_STATE_D0, ACPI_STATE_D3);
> >
> > Shouldn't that be clamp_val(), rather?
> 
> Yes.  clamp_val() is sufficient here.
> 
> >>       acpi_method[2] = '0' + acpi_target_sleep_state;
> >>       /*
> >> -      * If the sleep state is S0, we will return D3, but if the device has
> >> -      * _S0W, we will use the value from _S0W
> >> +      * If the sleep state is S0, the lowest limit from ACPI is D3,
> >> +      * but if the device has _S0W, we will use the value from _S0W
> >> +      * as the lowest limit from ACPI.  Finally, we will constrain
> >> +      * the lowest limit with the specified one.
> >>        */
> >>       d_min = ACPI_STATE_D0;
> >>       d_max = ACPI_STATE_D3;
> >> @@ -754,6 +758,14 @@ int acpi_pm_device_sleep_state(struct de
> >>
> >>       if (d_min_p)
> >>               *d_min_p = d_min;
> >> +     /* constrain d_max with specified lowest limit (max number) */
> >> +     if (d_max > d_max_in) {
> >> +             d_max = d_max_in;
> >> +             for (;d_max > d_min; d_max--) {
> >
> > Well, why didn't you do
> >
> > +               for (d_max = d_max_in; d_max > d_min; d_max--)
> 
> Because I think it is possible that d_max < d_max_in.

I mean:

+     if (d_max > d_max_in) {
+               for (d_max = d_max_in; d_max > d_min; d_max--) {

The assignment followed by the for () loop without the start instruction looks
odd.

> >> +                     if (adev->power.states[d_max].flags.valid)
> >> +                             break;
> >> +             }
> >> +     }
> >
> > And what if d_min > d_max_in ?
> 
> I think that means something bad happens.  Maybe we can do something as follow
> 
> if (d_min > d_max_in) {
>         pr_warning("acpi_pm_device_sleep_state: the specified lowest
> state is higher than the highest state from ACPI!");
>         d_max_in = d_min;

Well, what about returning -EINVAL in that case?

> }
> if (d_max > d_max_in) {
> ...
> }
> 
> >>       return d_max;
> >>  }
> >>  #endif /* CONFIG_PM */
> >> --- a/drivers/pci/pci-acpi.c
> >> +++ b/drivers/pci/pci-acpi.c
> >> @@ -189,7 +189,8 @@ static pci_power_t acpi_pci_choose_state
> >>  {
> >>       int acpi_state;
> >>
> >> -     acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL);
> >> +     acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
> >> +                                             ACPI_STATE_D3);
> >>       if (acpi_state < 0)
> >>               return PCI_POWER_ERROR;
> >>
> >> --- a/drivers/pnp/pnpacpi/core.c
> >> +++ b/drivers/pnp/pnpacpi/core.c
> >> @@ -170,8 +170,8 @@ static int pnpacpi_suspend(struct pnp_de
> >>       }
> >>
> >>       if (acpi_bus_power_manageable(handle)) {
> >> -             int power_state = acpi_pm_device_sleep_state(&dev->dev, NULL);
> >> -
> >> +             int power_state = acpi_pm_device_sleep_state(&dev->dev, NULL,
> >> +                                                          ACPI_STATE_D3);
> >>               if (power_state < 0)
> >>                       power_state = (state.event == PM_EVENT_ON) ?
> >>                                       ACPI_STATE_D0 : ACPI_STATE_D3;
> >> --- a/include/acpi/acpi_bus.h
> >> +++ b/include/acpi/acpi_bus.h
> >> @@ -383,13 +383,13 @@ int acpi_enable_wakeup_device_power(stru
> >>  int acpi_disable_wakeup_device_power(struct acpi_device *dev);
> >>
> >>  #ifdef CONFIG_PM
> >> -int acpi_pm_device_sleep_state(struct device *, int *);
> >> +int acpi_pm_device_sleep_state(struct device *, int *, int);
> >>  #else
> >> -static inline int acpi_pm_device_sleep_state(struct device *d, int *p)
> >> +static inline int acpi_pm_device_sleep_state(struct device *d, int *p, int m)
> >>  {
> >>       if (p)
> >>               *p = ACPI_STATE_D0;
> >> -     return ACPI_STATE_D3;
> >> +     return m == ACPI_STATE_D3 ? m : ACPI_STATE_D0;
> >
> > Shouldn't m be returned (so long as it is between D0 and D3 inclusive)?  IOW:
> >
> > +       return (m >= ACPI_STATE_D0 && m <= ACPI_STATE_D3) ? m : ACPI_STATE_D0;
> 
> My original idea is that only D0 and D3 is guaranteed to be valid for
> the device.  If that need not to be considered here, you one is
> better.

No, it need not.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-05  8:08     ` huang ying
@ 2012-05-07 21:22       ` Rafael J. Wysocki
  2012-05-08  2:22         ` Huang Ying
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-07 21:22 UTC (permalink / raw)
  To: huang ying
  Cc: Huang Ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Saturday, May 05, 2012, huang ying wrote:
> On Sat, May 5, 2012 at 4:50 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Huang Ying wrote:
> >> This patch adds runtime D3cold support to PCIe bus.  D3cold is the
> >> deepest power saving state for PCIe devices.  Where the main PCIe link
> >> will be powered off totally, so before the PCIe main link is powered
> >> on again, you can not access the device, even the configuration space,
> >> which is still accessible in D3hot.  Because the main PCIe link is not
> >> available, the PCI PM registers can not be used to put device into/out
> >> of D3cold state, that will be done by platform logic such as ACPI
> >> _PR3.
> >>
> >> To support remote wake up in D3cold, aux power is supplied, and a
> >> side-band pin: WAKE# is used to notify remote wake up event, the pin
> >> is usually connected to platform logic such as ACPI GPE.  This is
> >> quite different with other power saving states, where remote wake up
> >> is notified via PME message via the PCIe main link.
> >>
> >> PCIe devices in plug-in slot usually has no direct platform logic, for
> >> example, there is usually no ACPI _PR3 for them.  The D3cold support
> >> for these devices can be done via the PCIe port connected to it.  When
> >> power on/off the PCIe port, the corresponding PCIe devices are powered
> >> on/off too.  And the remote wake up event from PCIe device will be
> >> notified to the corresponding PCIe port.
> >>
> >> The PCI subsystem D3cold support and corresponding ACPI platform
> >> support is implemented in the patch.  Including the support for PCIe
> >> devices in plug-in slot.
> >>
> >> For more information about PCIe D3cold and corresponding ACPI support,
> >> please refer to:
> >>
> >> - PCI Express Base Specification Revision 2.0
> >> - Advanced Configuration and Power Interface Specification Revision 5.0
> >>
> >> Originally-by: Zheng Yan <zheng.z.yan@intel.com>
> >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> >> ---
> >>  drivers/pci/pci-acpi.c         |   32 +++++++++--
> >>  drivers/pci/pci-driver.c       |    3 +
> >>  drivers/pci/pci.c              |  115 +++++++++++++++++++++++++++++++++++++----
> >>  drivers/pci/pci.h              |    1
> >>  drivers/pci/pcie/portdrv_pci.c |   28 +++++++++
> >>  include/linux/pci.h            |   12 +++-
> >>  6 files changed, 175 insertions(+), 16 deletions(-)
> >>
> >> --- a/drivers/pci/pci-acpi.c
> >> +++ b/drivers/pci/pci-acpi.c
> >> @@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
> >>       if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
> >>               return;
> >>
> >> +     if (pci_dev->current_state == PCI_D3cold) {
> >> +             unsigned int count = 0;
> >> +
> >> +             /*
> >> +              * Powering on bridge need to resume whole hierarchy,
> >> +              * just resume the children to avoid the bridge going
> >> +              * suspending as soon as resumed
> >> +              */
> >
> > Don't you need to resume the bridge before you start walking the hierarchy
> > below it?
> 
> When we resume the hierarchy below the bridge, its parent, the bridge,
> will be resumed firstly. That is:
> 
> rpm_resume(child)
>   rpm_resume(parent)
>   ->runtime_suspend(child)
> 
> >> +             if (pci_dev->subordinate)
> >> +                     count = pci_wakeup_bus(pci_dev->subordinate);
> >> +             if (count == 0)
> >> +                     pm_runtime_resume(&pci_dev->dev);
> >
> > What's the count for, exactly?
> 
> If there is no devices under the bridge, count returned will be 0,
> then we will resume bridge itself.

So it looks like you will resume the bridge in both cases, right?

Why don't you call pm_runtime_get_sync() on the bridge first and then
go for resuming the devices below it, then?

[...]
> >>  static int pcie_port_runtime_suspend(struct device *dev)
> >>  {
> >>       struct pci_dev *pdev = to_pci_dev(dev);
> >> +     struct d3cold_info d3cold_info = {
> >> +             .power_must_be_on       = dev->power.power_must_be_on,
> >> +             .d3cold_delay           = PCI_PM_D3_WAIT,
> >> +     };
> >>
> >> +     /*
> >> +      * If any subordinate device need to keep power on, we should
> >> +      * not power off the port.  The D3cold delay of port should be
> >> +      * the max of that of all subordinate devices.
> >
> > What if some of the devices below the port are ports themselves?  Or PCI-to-PCIe
> > bridges?
> 
> For them, I think the current solution is safe.

Hmm.  Shouldn't the total delay be a sum rather than a max in those cases?

Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-07 20:53       ` Rafael J. Wysocki
@ 2012-05-08  1:44         ` Huang Ying
  2012-05-08 21:34           ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Huang Ying @ 2012-05-08  1:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm,
	Zheng Yan, Lan Tianyu

On Mon, 2012-05-07 at 22:53 +0200, Rafael J. Wysocki wrote:
> On Saturday, May 05, 2012, huang ying wrote:
> > On Sat, May 5, 2012 at 3:33 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > On Friday, May 04, 2012, Huang Ying wrote:
> > >> From: Lan Tianyu <tianyu.lan@intel.com>
> > >>
> > >> Some devices can be powered off to save more power via some platform
> > >> mechanism, e.g., ACPI.  But that may not work as expected for some
> > >> device or platform.  So, this patch adds a sysfs file named power_off
> > >> under <device>/power directory to provide a mechanism for user to control
> > >> whether to allow the device to be power off.
> > >>
> > >> power_off => "enabled" means allowing the device to be powered off if
> > >> possible.
> > >>
> > >> power_off => "disabled" means the device must be power on anytime.
> > >>
> > >> Also add flag power_off_user to struct dev_pm_info to record users'
> > >> choice. The bus layer can use this field to determine whether to
> > >> power off the device.
> > >
> > > It looks like the new attribute is added for all devices regardless of whether
> > > or not they actually can be powered off?  If so, then please don't do that,
> > > it's _extremely_ confusing.
> > 
> > Yes.  You are right.
> > 
> > > If you need user space to be able to control that functionality (and I can
> > > think of a couple of cases in which you do), there need to be 2 flags,
> > > can_power_off and may_power_off, where the first one is set by the kernel
> > > if it is known that power can be removed from the device - the attribute
> > > should be created when this flag is set and removed when it is unset.
> > >
> > > Then, the setting of the second flag will be controlled by the new attribute.
> > >
> > > And you'll need to patch quite a few places where devices actually have that
> > > capability, like where power domains are in use, so that's quire a lot of
> > > work.
> > 
> > If so, I think maybe we need 3 flags:
> > 
> > - can_power_off, set by kernel when it is possible to power off the device
> 
> Well, on a second thought, it may be difficult to determine that in some
> cases (eg. for devices belonging to power domains with additional constraints
> related to the other devices in the same domain etc.).
> 
> In other cases power may be removed from devices indirectly, like for example
> by putting a device's parent into a low power state.
> 
> So I guess the can_power_off flag may not be practical after all.
> 
> > - may_power_off_user, set by user via sysfs attribute
> 
> I'd call that one power_off_allowed, meaning "allowed by user space".
> 
> > - may_power_off, set by kernel according to may_power_off_user, power
> > QoS and some other conditions
> 
> And I'd call that one power_must_be_on, meaning "don't power off even if
> allowed by user space".
> 
> > Sysfs attribute for may_power_off_user is only created if can_power_off is true.
> > 
> > I think we still can do that step by step.  For example, when we add
> > power off support to PCI devices, we set can_power_off to true for PCI
> > devices that is possible to be powered off;  when we add power domain
> > support, we set can_power_off to true for devices in power domain.  Do
> > you agree?
> 
> I think we may add helpers for exporting/unexporting power_off_allowed
> like for the PM QoS latency attribute.  Then, whoever wants to support
> power_off_allowed and use it will export it through that helper.

That sounds good!

> Still, I'm afraid we're trying to special case something that really ins't
> a special case.  Namely, we may want to restrict devices from using some
> other low-power states as well, not only power off (eg. we may want to
> prevent devices' clocks from being stopped).

One step towards generalization is to provide a way for user to specify
lowest power state allowed.  For example, for PCI devices, they can
specify D1, D2, D3hot or D3cold.  But it is hard to generalize a set of
low power states for all kind of devices.  Maybe we should keep this
user space interface bus specific?  For example, we can have a sysfs
file like "lowest_pm_state_allowed" for each PCI devices.

BTW:  I wonder that are there standard low power states defined for
devices on platform bus.

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state
  2012-05-07 21:15       ` Rafael J. Wysocki
@ 2012-05-08  1:49         ` Huang Ying
  0 siblings, 0 replies; 47+ messages in thread
From: Huang Ying @ 2012-05-08  1:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Mon, 2012-05-07 at 23:15 +0200, Rafael J. Wysocki wrote:
> On Saturday, May 05, 2012, huang ying wrote:
> > On Sat, May 5, 2012 at 4:10 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > On Friday, May 04, 2012, Huang Ying wrote:
> > >> Lower device sleep state can save more power, but has more exit
> > >> latency too.  Sometimes, to satisfy some power QoS and other
> > >> requirement, we need to constrain the lowest device sleep state.
> > >>
> > >> In this patch, a parameter to specify lowest allowed state for
> > >> acpi_pm_device_sleep_state is added.  So that the caller can enforce
> > >> the constraint via the parameter.
> > >>
> > >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> > >> ---
> > >>  drivers/acpi/sleep.c       |   18 +++++++++++++++---
> > >>  drivers/pci/pci-acpi.c     |    3 ++-
> > >>  drivers/pnp/pnpacpi/core.c |    4 ++--
> > >>  include/acpi/acpi_bus.h    |    6 +++---
> > >>  4 files changed, 22 insertions(+), 9 deletions(-)
> > >>
> > >> --- a/drivers/acpi/sleep.c
> > >> +++ b/drivers/acpi/sleep.c
> > >> @@ -677,6 +677,7 @@ int acpi_suspend(u32 acpi_state)
> > >>   *   @dev: device to examine; its driver model wakeup flags control
> > >>   *           whether it should be able to wake up the system
> > >>   *   @d_min_p: used to store the upper limit of allowed states range
> > >> + *   @d_max_in: specify the lowest allowed states
> > >>   *   Return value: preferred power state of the device on success, -ENODEV on
> > >>   *           failure (ie. if there's no 'struct acpi_device' for @dev)
> > >>   *
> > >> @@ -693,7 +694,7 @@ int acpi_suspend(u32 acpi_state)
> > >>   *   via @wake.
> > >>   */
> > >>
> > >> -int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p)
> > >> +int acpi_pm_device_sleep_state(struct device *dev, int *d_min_p, int d_max_in)
> > >>  {
> > >>       acpi_handle handle = DEVICE_ACPI_HANDLE(dev);
> > >>       struct acpi_device *adev;
> > >> @@ -704,11 +705,14 @@ int acpi_pm_device_sleep_state(struct de
> > >>               printk(KERN_DEBUG "ACPI handle has no context!\n");
> > >>               return -ENODEV;
> > >>       }
> > >> +     d_max_in = clamp_t(int, d_max_in, ACPI_STATE_D0, ACPI_STATE_D3);
> > >
> > > Shouldn't that be clamp_val(), rather?
> > 
> > Yes.  clamp_val() is sufficient here.
> > 
> > >>       acpi_method[2] = '0' + acpi_target_sleep_state;
> > >>       /*
> > >> -      * If the sleep state is S0, we will return D3, but if the device has
> > >> -      * _S0W, we will use the value from _S0W
> > >> +      * If the sleep state is S0, the lowest limit from ACPI is D3,
> > >> +      * but if the device has _S0W, we will use the value from _S0W
> > >> +      * as the lowest limit from ACPI.  Finally, we will constrain
> > >> +      * the lowest limit with the specified one.
> > >>        */
> > >>       d_min = ACPI_STATE_D0;
> > >>       d_max = ACPI_STATE_D3;
> > >> @@ -754,6 +758,14 @@ int acpi_pm_device_sleep_state(struct de
> > >>
> > >>       if (d_min_p)
> > >>               *d_min_p = d_min;
> > >> +     /* constrain d_max with specified lowest limit (max number) */
> > >> +     if (d_max > d_max_in) {
> > >> +             d_max = d_max_in;
> > >> +             for (;d_max > d_min; d_max--) {
> > >
> > > Well, why didn't you do
> > >
> > > +               for (d_max = d_max_in; d_max > d_min; d_max--)
> > 
> > Because I think it is possible that d_max < d_max_in.
> 
> I mean:
> 
> +     if (d_max > d_max_in) {
> +               for (d_max = d_max_in; d_max > d_min; d_max--) {
> 
> The assignment followed by the for () loop without the start instruction looks
> odd.

Oh, Yes.  I will change this.

> > >> +                     if (adev->power.states[d_max].flags.valid)
> > >> +                             break;
> > >> +             }
> > >> +     }
> > >
> > > And what if d_min > d_max_in ?
> > 
> > I think that means something bad happens.  Maybe we can do something as follow
> > 
> > if (d_min > d_max_in) {
> >         pr_warning("acpi_pm_device_sleep_state: the specified lowest
> > state is higher than the highest state from ACPI!");
> >         d_max_in = d_min;
> 
> Well, what about returning -EINVAL in that case?

Yes.  That is reasonable because it's a invalid parameter.

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-07 21:22       ` Rafael J. Wysocki
@ 2012-05-08  2:22         ` Huang Ying
  2012-05-08  8:34           ` Huang Ying
  0 siblings, 1 reply; 47+ messages in thread
From: Huang Ying @ 2012-05-08  2:22 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Mon, 2012-05-07 at 23:22 +0200, Rafael J. Wysocki wrote:
> On Saturday, May 05, 2012, huang ying wrote:
> > On Sat, May 5, 2012 at 4:50 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > On Friday, May 04, 2012, Huang Ying wrote:
> > >> This patch adds runtime D3cold support to PCIe bus.  D3cold is the
> > >> deepest power saving state for PCIe devices.  Where the main PCIe link
> > >> will be powered off totally, so before the PCIe main link is powered
> > >> on again, you can not access the device, even the configuration space,
> > >> which is still accessible in D3hot.  Because the main PCIe link is not
> > >> available, the PCI PM registers can not be used to put device into/out
> > >> of D3cold state, that will be done by platform logic such as ACPI
> > >> _PR3.
> > >>
> > >> To support remote wake up in D3cold, aux power is supplied, and a
> > >> side-band pin: WAKE# is used to notify remote wake up event, the pin
> > >> is usually connected to platform logic such as ACPI GPE.  This is
> > >> quite different with other power saving states, where remote wake up
> > >> is notified via PME message via the PCIe main link.
> > >>
> > >> PCIe devices in plug-in slot usually has no direct platform logic, for
> > >> example, there is usually no ACPI _PR3 for them.  The D3cold support
> > >> for these devices can be done via the PCIe port connected to it.  When
> > >> power on/off the PCIe port, the corresponding PCIe devices are powered
> > >> on/off too.  And the remote wake up event from PCIe device will be
> > >> notified to the corresponding PCIe port.
> > >>
> > >> The PCI subsystem D3cold support and corresponding ACPI platform
> > >> support is implemented in the patch.  Including the support for PCIe
> > >> devices in plug-in slot.
> > >>
> > >> For more information about PCIe D3cold and corresponding ACPI support,
> > >> please refer to:
> > >>
> > >> - PCI Express Base Specification Revision 2.0
> > >> - Advanced Configuration and Power Interface Specification Revision 5.0
> > >>
> > >> Originally-by: Zheng Yan <zheng.z.yan@intel.com>
> > >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> > >> ---
> > >>  drivers/pci/pci-acpi.c         |   32 +++++++++--
> > >>  drivers/pci/pci-driver.c       |    3 +
> > >>  drivers/pci/pci.c              |  115 +++++++++++++++++++++++++++++++++++++----
> > >>  drivers/pci/pci.h              |    1
> > >>  drivers/pci/pcie/portdrv_pci.c |   28 +++++++++
> > >>  include/linux/pci.h            |   12 +++-
> > >>  6 files changed, 175 insertions(+), 16 deletions(-)
> > >>
> > >> --- a/drivers/pci/pci-acpi.c
> > >> +++ b/drivers/pci/pci-acpi.c
> > >> @@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
> > >>       if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
> > >>               return;
> > >>
> > >> +     if (pci_dev->current_state == PCI_D3cold) {
> > >> +             unsigned int count = 0;
> > >> +
> > >> +             /*
> > >> +              * Powering on bridge need to resume whole hierarchy,
> > >> +              * just resume the children to avoid the bridge going
> > >> +              * suspending as soon as resumed
> > >> +              */
> > >
> > > Don't you need to resume the bridge before you start walking the hierarchy
> > > below it?
> > 
> > When we resume the hierarchy below the bridge, its parent, the bridge,
> > will be resumed firstly. That is:
> > 
> > rpm_resume(child)
> >   rpm_resume(parent)
> >   ->runtime_suspend(child)
> > 
> > >> +             if (pci_dev->subordinate)
> > >> +                     count = pci_wakeup_bus(pci_dev->subordinate);
> > >> +             if (count == 0)
> > >> +                     pm_runtime_resume(&pci_dev->dev);
> > >
> > > What's the count for, exactly?
> > 
> > If there is no devices under the bridge, count returned will be 0,
> > then we will resume bridge itself.
> 
> So it looks like you will resume the bridge in both cases, right?
> 
> Why don't you call pm_runtime_get_sync() on the bridge first and then
> go for resuming the devices below it, then?

OK.  I will do that.

> [...]
> > >>  static int pcie_port_runtime_suspend(struct device *dev)
> > >>  {
> > >>       struct pci_dev *pdev = to_pci_dev(dev);
> > >> +     struct d3cold_info d3cold_info = {
> > >> +             .power_must_be_on       = dev->power.power_must_be_on,
> > >> +             .d3cold_delay           = PCI_PM_D3_WAIT,
> > >> +     };
> > >>
> > >> +     /*
> > >> +      * If any subordinate device need to keep power on, we should
> > >> +      * not power off the port.  The D3cold delay of port should be
> > >> +      * the max of that of all subordinate devices.
> > >
> > > What if some of the devices below the port are ports themselves?  Or PCI-to-PCIe
> > > bridges?
> > 
> > For them, I think the current solution is safe.
> 
> Hmm.  Shouldn't the total delay be a sum rather than a max in those cases?

I think the max is sufficient here.  The delay is used to wait devices
to complete the power on reset after applying the power to them. There
are two possibility.

- Power is applied to all devices (including PCIe port) below altogether
  - We just need wait the max delay

- Power is applied to devices other than PCIe port altogether, for
example, for hierarchy below:

 * rp1
   * dev1
   * rp2
     - dev2
     - dev3

When rp1 is resumed, rp1, dev1 will be applied power.  And when resuming
rp2, rp2, dev2 and dev3 will be applied power.  Here the delay for rp2,
dev2 and dev3 will be sum of delays automatically.

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-08  2:22         ` Huang Ying
@ 2012-05-08  8:34           ` Huang Ying
  2012-05-10 19:28             ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Huang Ying @ 2012-05-08  8:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Tue, 2012-05-08 at 10:22 +0800, Huang Ying wrote:
> On Mon, 2012-05-07 at 23:22 +0200, Rafael J. Wysocki wrote:
> > On Saturday, May 05, 2012, huang ying wrote:
[...]
> > > >> --- a/drivers/pci/pci-acpi.c
> > > >> +++ b/drivers/pci/pci-acpi.c
> > > >> @@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
> > > >>       if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
> > > >>               return;
> > > >>
> > > >> +     if (pci_dev->current_state == PCI_D3cold) {
> > > >> +             unsigned int count = 0;
> > > >> +
> > > >> +             /*
> > > >> +              * Powering on bridge need to resume whole hierarchy,
> > > >> +              * just resume the children to avoid the bridge going
> > > >> +              * suspending as soon as resumed
> > > >> +              */
> > > >
> > > > Don't you need to resume the bridge before you start walking the hierarchy
> > > > below it?
> > > 
> > > When we resume the hierarchy below the bridge, its parent, the bridge,
> > > will be resumed firstly. That is:
> > > 
> > > rpm_resume(child)
> > >   rpm_resume(parent)
> > >   ->runtime_suspend(child)
> > > 
> > > >> +             if (pci_dev->subordinate)
> > > >> +                     count = pci_wakeup_bus(pci_dev->subordinate);
> > > >> +             if (count == 0)
> > > >> +                     pm_runtime_resume(&pci_dev->dev);
> > > >
> > > > What's the count for, exactly?
> > > 
> > > If there is no devices under the bridge, count returned will be 0,
> > > then we will resume bridge itself.
> > 
> > So it looks like you will resume the bridge in both cases, right?
> > 
> > Why don't you call pm_runtime_get_sync() on the bridge first and then
> > go for resuming the devices below it, then?
> 
> OK.  I will do that.

After some thinking, have some question on this method.

Do you suggest something like below?

    pm_runtime_get_sync(&pci_dev->dev);
    pci_wakeup_bus(pci_dev->subordinate);
    pm_runtime_put(&pci_dev->dev);

If so,  because pci_wakeup_bus() will call pm_request_resume() on
subordinate devices, which is asynchronous, bridge may go suspended
(powered off) again after pm_runtime_put(), if resuming of subordinate
devices are still pending in work queue.


If we replace pm_runtime_put() with pm_runtime_put_noidle() as follow,

    pm_runtime_get_sync(&pci_dev->dev);
    pci_wakeup_bus(pci_dev->subordinate);
    pm_runtime_put_noidle(&pci_dev->dev);

if before pm_runtime_put_noidle(), subordinate devices have already go
through resuming/suspending cycle and go suspended again (because of
preemption?), bridge may lose an opportunity to go suspended.

Or we can add a parameter to pci_wakeup_bus() which will resume the
first device synchronously?

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-08  1:44         ` Huang Ying
@ 2012-05-08 21:34           ` Rafael J. Wysocki
  2012-05-09  6:46             ` Huang Ying
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-08 21:34 UTC (permalink / raw)
  To: Huang Ying
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm,
	Zheng Yan, Lan Tianyu

On Tuesday, May 08, 2012, Huang Ying wrote:
> On Mon, 2012-05-07 at 22:53 +0200, Rafael J. Wysocki wrote:
> > On Saturday, May 05, 2012, huang ying wrote:
> > > On Sat, May 5, 2012 at 3:33 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > > On Friday, May 04, 2012, Huang Ying wrote:
> > > >> From: Lan Tianyu <tianyu.lan@intel.com>
[...]
> > I think we may add helpers for exporting/unexporting power_off_allowed
> > like for the PM QoS latency attribute.  Then, whoever wants to support
> > power_off_allowed and use it will export it through that helper.
> 
> That sounds good!
> 
> > Still, I'm afraid we're trying to special case something that really ins't
> > a special case.  Namely, we may want to restrict devices from using some
> > other low-power states as well, not only power off (eg. we may want to
> > prevent devices' clocks from being stopped).
> 
> One step towards generalization is to provide a way for user to specify
> lowest power state allowed.  For example, for PCI devices, they can
> specify D1, D2, D3hot or D3cold.  But it is hard to generalize a set of
> low power states for all kind of devices.  Maybe we should keep this
> user space interface bus specific?

I came to the same conclusion. :-)

Besides, we already have the no_d1d2 and d1_support, d2_support flags
in struct pci_dev.  We can simply add no_d3_cold in analogy and add a similar
thing for ATA to cover the ZPODD case.

I would keep those things bus-type-specific and platform-specific.

> For example, we can have a sysfs file like "lowest_pm_state_allowed" for each
> PCI devices.

I'm not sure if that's going to be sufficient, because some devices appear to
have problems with _intermediate_ low-power states.

> BTW:  I wonder that are there standard low power states defined for
> devices on platform bus.

Not that I know of.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-08 21:34           ` Rafael J. Wysocki
@ 2012-05-09  6:46             ` Huang Ying
  2012-05-09 10:38               ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Huang Ying @ 2012-05-09  6:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm,
	Zheng Yan, Lan Tianyu

On Tue, 2012-05-08 at 23:34 +0200, Rafael J. Wysocki wrote:
> On Tuesday, May 08, 2012, Huang Ying wrote:
> > On Mon, 2012-05-07 at 22:53 +0200, Rafael J. Wysocki wrote:
> > > On Saturday, May 05, 2012, huang ying wrote:
> > > > On Sat, May 5, 2012 at 3:33 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > > > On Friday, May 04, 2012, Huang Ying wrote:
> > > > >> From: Lan Tianyu <tianyu.lan@intel.com>
> [...]
> > > I think we may add helpers for exporting/unexporting power_off_allowed
> > > like for the PM QoS latency attribute.  Then, whoever wants to support
> > > power_off_allowed and use it will export it through that helper.
> > 
> > That sounds good!
> > 
> > > Still, I'm afraid we're trying to special case something that really ins't
> > > a special case.  Namely, we may want to restrict devices from using some
> > > other low-power states as well, not only power off (eg. we may want to
> > > prevent devices' clocks from being stopped).
> > 
> > One step towards generalization is to provide a way for user to specify
> > lowest power state allowed.  For example, for PCI devices, they can
> > specify D1, D2, D3hot or D3cold.  But it is hard to generalize a set of
> > low power states for all kind of devices.  Maybe we should keep this
> > user space interface bus specific?
> 
> I came to the same conclusion. :-)
> 
> Besides, we already have the no_d1d2 and d1_support, d2_support flags
> in struct pci_dev.  We can simply add no_d3_cold in analogy and add a similar
> thing for ATA to cover the ZPODD case.
> 
> I would keep those things bus-type-specific and platform-specific.

But power-off sate seems like the only low power state that can be
shared between buses.  And keep power_must_be_on in dev_pm_info seems
good for power domain implementation.  Because one power domain may
contain devices comes from different buses.

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-09  6:46             ` Huang Ying
@ 2012-05-09 10:38               ` Rafael J. Wysocki
  2012-05-10  0:55                 ` Huang Ying
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-09 10:38 UTC (permalink / raw)
  To: Huang Ying
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm,
	Zheng Yan, Lan Tianyu

On Wednesday, May 09, 2012, Huang Ying wrote:
> On Tue, 2012-05-08 at 23:34 +0200, Rafael J. Wysocki wrote:
> > On Tuesday, May 08, 2012, Huang Ying wrote:
> > > On Mon, 2012-05-07 at 22:53 +0200, Rafael J. Wysocki wrote:
> > > > On Saturday, May 05, 2012, huang ying wrote:
> > > > > On Sat, May 5, 2012 at 3:33 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > > > > On Friday, May 04, 2012, Huang Ying wrote:
> > > > > >> From: Lan Tianyu <tianyu.lan@intel.com>
> > [...]
> > > > I think we may add helpers for exporting/unexporting power_off_allowed
> > > > like for the PM QoS latency attribute.  Then, whoever wants to support
> > > > power_off_allowed and use it will export it through that helper.
> > > 
> > > That sounds good!
> > > 
> > > > Still, I'm afraid we're trying to special case something that really ins't
> > > > a special case.  Namely, we may want to restrict devices from using some
> > > > other low-power states as well, not only power off (eg. we may want to
> > > > prevent devices' clocks from being stopped).
> > > 
> > > One step towards generalization is to provide a way for user to specify
> > > lowest power state allowed.  For example, for PCI devices, they can
> > > specify D1, D2, D3hot or D3cold.  But it is hard to generalize a set of
> > > low power states for all kind of devices.  Maybe we should keep this
> > > user space interface bus specific?
> > 
> > I came to the same conclusion. :-)
> > 
> > Besides, we already have the no_d1d2 and d1_support, d2_support flags
> > in struct pci_dev.  We can simply add no_d3_cold in analogy and add a similar
> > thing for ATA to cover the ZPODD case.
> > 
> > I would keep those things bus-type-specific and platform-specific.
> 
> But power-off sate seems like the only low power state that can be
> shared between buses.

Well, the problem is that "power off" is not so well defined really.
It geneally means "power removed", but how to achieve that, or more
precisely what sequence of events leads to that situation, is rather
bus/platform-specific.  It may be a direct action (like _PS3) or
putting the parent into a low-power state (which need not mean "power off"
for the parent), or turning off a power domain.

Also, from a driver's perspective the result of putting a device into
some non-power-off low-power state may be just as unpleasant as the
result of removing power from it.

> And keep power_must_be_on in dev_pm_info seems
> good for power domain implementation.  Because one power domain may
> contain devices comes from different buses.

But then it may go into the domain data.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-09 10:38               ` Rafael J. Wysocki
@ 2012-05-10  0:55                 ` Huang Ying
  2012-05-10 14:48                   ` Alan Stern
  0 siblings, 1 reply; 47+ messages in thread
From: Huang Ying @ 2012-05-10  0:55 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm,
	Zheng Yan, Lan Tianyu

On Wed, 2012-05-09 at 12:38 +0200, Rafael J. Wysocki wrote:
> On Wednesday, May 09, 2012, Huang Ying wrote:
> > On Tue, 2012-05-08 at 23:34 +0200, Rafael J. Wysocki wrote:
> > > On Tuesday, May 08, 2012, Huang Ying wrote:
> > > > On Mon, 2012-05-07 at 22:53 +0200, Rafael J. Wysocki wrote:
> > > > > On Saturday, May 05, 2012, huang ying wrote:
> > > > > > On Sat, May 5, 2012 at 3:33 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > > > > > On Friday, May 04, 2012, Huang Ying wrote:
> > > > > > >> From: Lan Tianyu <tianyu.lan@intel.com>
> > > [...]
> > > > > I think we may add helpers for exporting/unexporting power_off_allowed
> > > > > like for the PM QoS latency attribute.  Then, whoever wants to support
> > > > > power_off_allowed and use it will export it through that helper.
> > > > 
> > > > That sounds good!
> > > > 
> > > > > Still, I'm afraid we're trying to special case something that really ins't
> > > > > a special case.  Namely, we may want to restrict devices from using some
> > > > > other low-power states as well, not only power off (eg. we may want to
> > > > > prevent devices' clocks from being stopped).
> > > > 
> > > > One step towards generalization is to provide a way for user to specify
> > > > lowest power state allowed.  For example, for PCI devices, they can
> > > > specify D1, D2, D3hot or D3cold.  But it is hard to generalize a set of
> > > > low power states for all kind of devices.  Maybe we should keep this
> > > > user space interface bus specific?
> > > 
> > > I came to the same conclusion. :-)
> > > 
> > > Besides, we already have the no_d1d2 and d1_support, d2_support flags
> > > in struct pci_dev.  We can simply add no_d3_cold in analogy and add a similar
> > > thing for ATA to cover the ZPODD case.
> > > 
> > > I would keep those things bus-type-specific and platform-specific.
> > 
> > But power-off sate seems like the only low power state that can be
> > shared between buses.
> 
> Well, the problem is that "power off" is not so well defined really.
> It geneally means "power removed", but how to achieve that, or more
> precisely what sequence of events leads to that situation, is rather
> bus/platform-specific.  It may be a direct action (like _PS3) or
> putting the parent into a low-power state (which need not mean "power off"
> for the parent), or turning off a power domain.

How to remove power is not general enough, but I think whether to power
off is more general.  So a flag like power_must_be_on can be general and
useful.

> Also, from a driver's perspective the result of putting a device into
> some non-power-off low-power state may be just as unpleasant as the
> result of removing power from it.

Yes.  From device driver or bus point of view, "power off" is just like
other low power state.  But from the "all devices" point of view, power
off can be the only low power state that is shared by all devices.  So
we can say "power off" is special from this point of view.

> > And keep power_must_be_on in dev_pm_info seems
> > good for power domain implementation.  Because one power domain may
> > contain devices comes from different buses.
> 
> But then it may go into the domain data.

That is how it is implemented for domain now.  But consider we may put a
PCIe device into power domain on some platform.  On some other platform,
D3Cold may implemented not in power domain, so we need a "no_d3cold" in
struct pci_dev.  When a PCIe device driver wants to disable "power off",
it will need to set pci_dev->no_d3cold to false and call
pm_genpd_dev_always_on().  Is it better for us to just set
dev->power_must_be_on to true instead of two operations?  After all,
duplicated state is bad smell of code.

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-10  0:55                 ` Huang Ying
@ 2012-05-10 14:48                   ` Alan Stern
  2012-05-10 19:03                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Alan Stern @ 2012-05-10 14:48 UTC (permalink / raw)
  To: Huang Ying
  Cc: Rafael J. Wysocki, huang ying, Bjorn Helgaas, ming.m.lin,
	linux-kernel, linux-pm, Zheng Yan, Lan Tianyu

On Thu, 10 May 2012, Huang Ying wrote:

> How to remove power is not general enough, but I think whether to power
> off is more general.  So a flag like power_must_be_on can be general and
> useful.

I'm not so sure about that.

> > Also, from a driver's perspective the result of putting a device into
> > some non-power-off low-power state may be just as unpleasant as the
> > result of removing power from it.
> 
> Yes.  From device driver or bus point of view, "power off" is just like
> other low power state.  But from the "all devices" point of view, power
> off can be the only low power state that is shared by all devices.  So
> we can say "power off" is special from this point of view.

No -- it's not true that all devices share a "power off" state while 
the system is running.

The only thing that can be said about all devices is that some of them 
can be put into one or many low-power states.  That's what the PM core 
means when it talks about runtime suspend.

Alan Stern


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy
  2012-05-10 14:48                   ` Alan Stern
@ 2012-05-10 19:03                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-10 19:03 UTC (permalink / raw)
  To: Alan Stern, Huang Ying
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm,
	Zheng Yan, Lan Tianyu

On Thursday, May 10, 2012, Alan Stern wrote:
> On Thu, 10 May 2012, Huang Ying wrote:
> 
> > How to remove power is not general enough, but I think whether to power
> > off is more general.  So a flag like power_must_be_on can be general and
> > useful.
> 
> I'm not so sure about that.

Me neither. :-)

> > > Also, from a driver's perspective the result of putting a device into
> > > some non-power-off low-power state may be just as unpleasant as the
> > > result of removing power from it.
> > 
> > Yes.  From device driver or bus point of view, "power off" is just like
> > other low power state.  But from the "all devices" point of view, power
> > off can be the only low power state that is shared by all devices.  So
> > we can say "power off" is special from this point of view.
> 
> No -- it's not true that all devices share a "power off" state while 
> the system is running.
> 
> The only thing that can be said about all devices is that some of them 
> can be put into one or many low-power states.  That's what the PM core 
> means when it talks about runtime suspend.

That's correct.

We actually had discussed that in-depth when we were about to implement the
runtime PM core code and the conclusion was that low-power states, including
the "power off" one, could not be generalized.  That's why we decided to make
the runtime PM core distinguish only two status values related to the current
state of the device, active, which meant that the device could do the I/O
normally, and suspended, which meant that the device was not supposed to do the
I/O (most likely because it was in a low-power state of some sort).

The handling of the detials related to device low-power states is supposed to
be done at the subsystem (e.g. bus type) or PM domain level and "power off" is
just one of those states, so I think it should just stay at this level.

Of course, the power state of a parent device may be restricted by its children
requirements, but "don't remove power from me" is just one way of specifying
them (and it's not universal).

So, I suggest that we add a no_d3_cold flag to struct pci_dev and an analogous
flag to the SATA data structures and make the PCI core and SATA controller
drivers use those flags.  If it turns out that we can or have to do something
more general in the future, we'll to it _then_.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support
  2012-05-08  8:34           ` Huang Ying
@ 2012-05-10 19:28             ` Rafael J. Wysocki
  0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-10 19:28 UTC (permalink / raw)
  To: Huang Ying
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Tuesday, May 08, 2012, Huang Ying wrote:
> On Tue, 2012-05-08 at 10:22 +0800, Huang Ying wrote:
> > On Mon, 2012-05-07 at 23:22 +0200, Rafael J. Wysocki wrote:
> > > On Saturday, May 05, 2012, huang ying wrote:
> [...]
> > > > >> --- a/drivers/pci/pci-acpi.c
> > > > >> +++ b/drivers/pci/pci-acpi.c
> > > > >> @@ -48,6 +48,21 @@ static void pci_acpi_wake_dev(acpi_handl
> > > > >>       if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
> > > > >>               return;
> > > > >>
> > > > >> +     if (pci_dev->current_state == PCI_D3cold) {
> > > > >> +             unsigned int count = 0;
> > > > >> +
> > > > >> +             /*
> > > > >> +              * Powering on bridge need to resume whole hierarchy,
> > > > >> +              * just resume the children to avoid the bridge going
> > > > >> +              * suspending as soon as resumed
> > > > >> +              */
> > > > >
> > > > > Don't you need to resume the bridge before you start walking the hierarchy
> > > > > below it?
> > > > 
> > > > When we resume the hierarchy below the bridge, its parent, the bridge,
> > > > will be resumed firstly. That is:
> > > > 
> > > > rpm_resume(child)
> > > >   rpm_resume(parent)
> > > >   ->runtime_suspend(child)
> > > > 
> > > > >> +             if (pci_dev->subordinate)
> > > > >> +                     count = pci_wakeup_bus(pci_dev->subordinate);
> > > > >> +             if (count == 0)
> > > > >> +                     pm_runtime_resume(&pci_dev->dev);
> > > > >
> > > > > What's the count for, exactly?
> > > > 
> > > > If there is no devices under the bridge, count returned will be 0,
> > > > then we will resume bridge itself.
> > > 
> > > So it looks like you will resume the bridge in both cases, right?
> > > 
> > > Why don't you call pm_runtime_get_sync() on the bridge first and then
> > > go for resuming the devices below it, then?
> > 
> > OK.  I will do that.
> 
> After some thinking, have some question on this method.
> 
> Do you suggest something like below?
> 
>     pm_runtime_get_sync(&pci_dev->dev);
>     pci_wakeup_bus(pci_dev->subordinate);
>     pm_runtime_put(&pci_dev->dev);
> 
> If so,  because pci_wakeup_bus() will call pm_request_resume() on
> subordinate devices, which is asynchronous, bridge may go suspended
> (powered off) again after pm_runtime_put(), if resuming of subordinate
> devices are still pending in work queue.

I don't think this is a big problem.  Worst case it will be resumed
again by the first resuming child, but I doubt we'll see much of that
in practice.

Alternatively, you can use pm_runtime_put_noidle() followed by
pm_runtime_schedule_suspend() with appropriate delay.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-07 21:00       ` Rafael J. Wysocki
@ 2012-05-11  7:57         ` Huang Ying
  2012-05-11 18:44           ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Huang Ying @ 2012-05-11  7:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Mon, 2012-05-07 at 23:00 +0200, Rafael J. Wysocki wrote:
> On Saturday, May 05, 2012, huang ying wrote:
> > On Sat, May 5, 2012 at 3:43 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > On Friday, May 04, 2012, Huang Ying wrote:
> > >> From: Zheng Yan <zheng.z.yan@intel.com>
> > >>
> > >> This patch adds runtime PM support to PCIe port.  This is needed by
> > >> PCIe D3cold support, where PCIe device in slot may be powered on/off
> > >> by PCIe port.
> > >>
> > >> Because runtime suspend is broken for some chipset, a white list is
> > >> used to enable runtime PM support for only chipset known to work.
> > >>
> > >> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> > >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> > >> ---
> > >>  drivers/pci/pci.c              |    9 +++++++++
> > >>  drivers/pci/pcie/portdrv_pci.c |   40 ++++++++++++++++++++++++++++++++++++++++
> > >>  2 files changed, 49 insertions(+)
> > >>
> > >> --- a/drivers/pci/pci.c
> > >> +++ b/drivers/pci/pci.c
> > >> @@ -1476,6 +1476,15 @@ bool pci_check_pme_status(struct pci_dev
> > >>   */
> > >>  static int pci_pme_wakeup(struct pci_dev *dev, void *pme_poll_reset)
> > >>  {
> > >> +     struct pci_dev *bridge = dev->bus->self;
> > >> +
> > >> +     /*
> > >> +      * If bridge is in low power state, the configuration space of
> > >> +      * subordinate devices may be not accessible
> > >
> > > Please also say in the comment _when_ this is possible.  That's far from
> > > obvious, because the runtime PM framework generally ensures that parents are
> > > resumed before the children, so the comment should describe the particular
> > > scenario leading to this situation.
> > 
> > OK.  I will add something like below into comments.
> > 
> > This is possible when doing PME poll.
> 
> Well, that doesn't really explain much. :-)
> 
> I _think_ the situation is when a device causes WAKE# to be generated and
> the platform receives a GPE as a result and we get an ACPI_NOTIFY_DEVICE_WAKE
> notification for the device, which may be under a bridge (PCIe port really)
> in D3_cold.  Is that the case?

That is not the case now.  Because we do not use pci_pme_wakeup_bus() in
ACPI handler now.

But for PME poll, if some subordinate devices under a bridge is put in
low power state, we still need to access configuration space of
subordinate devices to poll PME status even when bridge is in low power
state.  That may not work.

> > >> +      */
> > >> +     if (bridge && bridge->current_state != PCI_D0)
> > >> +             return 0;
> > >> +
> > >>       if (pme_poll_reset && dev->pme_poll)
> > >>               dev->pme_poll = false;
> > >>
> > >> --- a/drivers/pci/pcie/portdrv_pci.c
> > >> +++ b/drivers/pci/pcie/portdrv_pci.c
> > >> @@ -11,6 +11,7 @@
> > >>  #include <linux/kernel.h>
> > >>  #include <linux/errno.h>
> > >>  #include <linux/pm.h>
> > >> +#include <linux/pm_runtime.h>
> > >>  #include <linux/init.h>
> > >>  #include <linux/pcieport_if.h>
> > >>  #include <linux/aer.h>
> > >> @@ -99,6 +100,27 @@ static int pcie_port_resume_noirq(struct
> > >>       return 0;
> > >>  }
> > >>
> > >> +#ifdef CONFIG_PM_RUNTIME
> > >> +static int pcie_port_runtime_suspend(struct device *dev)
> > >> +{
> > >> +     struct pci_dev *pdev = to_pci_dev(dev);
> > >> +
> > >
> > > A comment explaining why this is needed here would be welcome.
> > 
> > Sorry, do not catch your idea exactly.  What is needed?  Why do we
> > need to add PCIe port runtime suspend support?
> 
> No, why we need to call pci_save_state() from here and pci_restore_state()
> from the corresponding resume routine.
> 
> In theory that shouldn't be necessary, because the PCI bus type's
> runtime suspend/resume routines do the save/restore of the PCI config space.

I found maybe we do not need this special processing, will test more.

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port
  2012-05-11  7:57         ` Huang Ying
@ 2012-05-11 18:44           ` Rafael J. Wysocki
  0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2012-05-11 18:44 UTC (permalink / raw)
  To: Huang Ying
  Cc: huang ying, Bjorn Helgaas, ming.m.lin, linux-kernel, linux-pm, Zheng Yan

On Friday, May 11, 2012, Huang Ying wrote:
> On Mon, 2012-05-07 at 23:00 +0200, Rafael J. Wysocki wrote:
> > On Saturday, May 05, 2012, huang ying wrote:
> > > On Sat, May 5, 2012 at 3:43 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > > On Friday, May 04, 2012, Huang Ying wrote:
> > > >> From: Zheng Yan <zheng.z.yan@intel.com>
> > > >>
> > > >> This patch adds runtime PM support to PCIe port.  This is needed by
> > > >> PCIe D3cold support, where PCIe device in slot may be powered on/off
> > > >> by PCIe port.
> > > >>
> > > >> Because runtime suspend is broken for some chipset, a white list is
> > > >> used to enable runtime PM support for only chipset known to work.
> > > >>
> > > >> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> > > >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> > > >> ---
> > > >>  drivers/pci/pci.c              |    9 +++++++++
> > > >>  drivers/pci/pcie/portdrv_pci.c |   40 ++++++++++++++++++++++++++++++++++++++++
> > > >>  2 files changed, 49 insertions(+)
> > > >>
> > > >> --- a/drivers/pci/pci.c
> > > >> +++ b/drivers/pci/pci.c
> > > >> @@ -1476,6 +1476,15 @@ bool pci_check_pme_status(struct pci_dev
> > > >>   */
> > > >>  static int pci_pme_wakeup(struct pci_dev *dev, void *pme_poll_reset)
> > > >>  {
> > > >> +     struct pci_dev *bridge = dev->bus->self;
> > > >> +
> > > >> +     /*
> > > >> +      * If bridge is in low power state, the configuration space of
> > > >> +      * subordinate devices may be not accessible
> > > >
> > > > Please also say in the comment _when_ this is possible.  That's far from
> > > > obvious, because the runtime PM framework generally ensures that parents are
> > > > resumed before the children, so the comment should describe the particular
> > > > scenario leading to this situation.
> > > 
> > > OK.  I will add something like below into comments.
> > > 
> > > This is possible when doing PME poll.
> > 
> > Well, that doesn't really explain much. :-)
> > 
> > I _think_ the situation is when a device causes WAKE# to be generated and
> > the platform receives a GPE as a result and we get an ACPI_NOTIFY_DEVICE_WAKE
> > notification for the device, which may be under a bridge (PCIe port really)
> > in D3_cold.  Is that the case?
> 
> That is not the case now.  Because we do not use pci_pme_wakeup_bus() in
> ACPI handler now.
> 
> But for PME poll, if some subordinate devices under a bridge is put in
> low power state, we still need to access configuration space of
> subordinate devices to poll PME status even when bridge is in low power
> state.  That may not work.

OK, I see.  You mean that pci_pme_list_scan() should avoid checking the
PME status for devices whose parents (bridges) are in low-power states, right?

I think that that check should be made by pci_pme_list_scan() itself, then.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2012-05-11 18:39 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-04  8:13 [RFC v2 0/5] PCIe, Add PCIe runtime D3cold support Huang Ying
2012-05-04  8:13 ` [RFC v2 1/5] PM, Runtime, Add power_must_be_on flag Huang Ying
2012-05-04 19:37   ` Rafael J. Wysocki
2012-05-05  5:15     ` huang ying
2012-05-07 20:33       ` Rafael J. Wysocki
2012-05-04 19:50   ` Bjorn Helgaas
2012-05-05  5:59     ` huang ying
2012-05-07 20:37       ` Rafael J. Wysocki
2012-05-04  8:13 ` [RFC v2 2/5] PM, Add sysfs file power_off to control device power off policy Huang Ying
2012-05-04 19:33   ` Rafael J. Wysocki
2012-05-05  6:29     ` huang ying
2012-05-07 20:53       ` Rafael J. Wysocki
2012-05-08  1:44         ` Huang Ying
2012-05-08 21:34           ` Rafael J. Wysocki
2012-05-09  6:46             ` Huang Ying
2012-05-09 10:38               ` Rafael J. Wysocki
2012-05-10  0:55                 ` Huang Ying
2012-05-10 14:48                   ` Alan Stern
2012-05-10 19:03                     ` Rafael J. Wysocki
2012-05-04 19:50   ` Bjorn Helgaas
2012-05-04 21:00     ` Rafael J. Wysocki
2012-05-05  6:36     ` huang ying
2012-05-04  8:13 ` [RFC v2 3/5] PCIe, Add runtime PM support to PCIe port Huang Ying
2012-05-04 19:43   ` Rafael J. Wysocki
2012-05-05  6:46     ` huang ying
2012-05-07 21:00       ` Rafael J. Wysocki
2012-05-11  7:57         ` Huang Ying
2012-05-11 18:44           ` Rafael J. Wysocki
2012-05-04 19:50   ` Bjorn Helgaas
2012-05-04 20:55     ` Rafael J. Wysocki
2012-05-05  6:54       ` huang ying
2012-05-07 21:06         ` Rafael J. Wysocki
2012-05-05  6:53     ` huang ying
2012-05-04  8:13 ` [RFC v2 4/5] ACPI, PM, Specify lowest allowed state for device sleep state Huang Ying
2012-05-04 20:10   ` Rafael J. Wysocki
2012-05-05  7:25     ` huang ying
2012-05-07 21:15       ` Rafael J. Wysocki
2012-05-08  1:49         ` Huang Ying
2012-05-04  8:13 ` [RFC v2 5/5] PCIe, Add PCIe runtime D3cold support Huang Ying
2012-05-04 19:51   ` Bjorn Helgaas
2012-05-05  7:34     ` huang ying
2012-05-04 20:50   ` Rafael J. Wysocki
2012-05-05  8:08     ` huang ying
2012-05-07 21:22       ` Rafael J. Wysocki
2012-05-08  2:22         ` Huang Ying
2012-05-08  8:34           ` Huang Ying
2012-05-10 19:28             ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).