All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-13 22:23 Rafael J. Wysocki
  2009-06-14  9:41 ` Magnus Damm
                   ` (3 more replies)
  0 siblings, 4 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-13 22:23 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum, Magnus Damm
  Cc: pm list, LKML, Ingo Molnar, ACPI Devel Maling List

Hi,

Below is the current version of my "run-time PM for I/O devices" patch.

I've done my best to address the comments received during the recent
discussions, but at the same time I've tried to make the patch only contain
the most essential things.  For this reason, for example, the sysfs interface
is not there and it's going to be added in a separate patch.

Please let me know if you want me to change anything in this patch or to add
anything new to it.  [Magnus, I remember you wanted something like
->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
necessary.  Please let me know if you have any particular usage scenario for
it.]

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  250 ++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  461 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   98 +++++++
 include/linux/pm_runtime.h         |   63 +++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 915 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,11 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+#ifdef CONFIG_PM_RUNTIME
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+#endif
 };
 
 /**
@@ -315,14 +343,78 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,461 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+
+/**
+ * pm_runtime_reset - Clear all of the device run-time PM flags.
+ * @dev: Device object to clear the flags for.
+ */
+static void pm_runtime_reset(struct device *dev)
+{
+	dev->power.suspend_aborted = false;
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int pm_runtime_suspend(struct device *dev)
+{
+	int error = 0;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_SUSPENDED) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_NO_SUSPEND) {
+		/* Device is resuming or there's a resume request pending. */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_IDLE
+	    && dev->power.suspend_aborted) {
+		dev->power.suspend_aborted = false;
+		dev->power.runtime_status = RPM_ACTIVE;
+		goto out;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	pm_runtime_suspend(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @delay: Time, in jiffies, to wait before attempting to suspend the device.
+ */
+void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ *
+ * Should be called under pm_lock_device() and only if we are sure that the
+ * ->autosuspend() callback hasn't started to yet.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	dev->power.suspend_aborted = true;
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int pm_runtime_resume(struct device *dev)
+{
+	int error = 0;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ACTIVE) {
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && dev->parent->power.runtime_status != RPM_ACTIVE) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = pm_runtime_resume(dev->parent);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	switch (error) {
+	case 0:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	pm_runtime_resume(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+void pm_request_resume(struct device *dev)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* We have to resume the parent first. */
+		pm_request_resume(dev->parent);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	pm_runtime_reset(dev);
+	spin_lock_init(&dev->power.lock);
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,63 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern int pm_runtime_suspend(struct device *dev);
+extern void pm_request_suspend(struct device *dev, unsigned long delay);
+extern int pm_runtime_resume(struct device *dev);
+extern void pm_request_resume(struct device *dev);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline int pm_runtime_suspend(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+}
+static inline int pm_runtime_resume(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_resume(struct device *dev) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,250 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queueing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+funtions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described in what follows.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume().  They do it by decreasing and increasing, respectively,
+the 'power.depth' field of 'struct device'.  If the value of this field is
+greater than 0, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() return immediately without doing anything and -EBUSY is
+returned by pm_runtime_suspend() and pm_runtime_resume().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM functions can be used for that device.  The initial
+value of 'power.depth', as set by pm_runtime_init(), is 1.
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device proble and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() use the 'power.runtime_status' and
+'power.suspend_aborted' fields of 'struct device' for mutual synchronization.
+These fields are initialized by pm_runtime_init() and set to RPM_ACTIVE and
+'false', respectively.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE,
+it returns immediately.  Otherwise, it changes the device's run-time PM status
+to RPM_IDLE and puts a request to execute pm_runtime_suspend() into pm_wq.  The
+'delay' argument is used to specify time to wait before the request will be
+completed, in jiffies.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called either by the PM core, to complete a request queued up by
+pm_request_suspend(), or directly by a bus type or device driver.
+* It returns immediately if the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field ('power.runtime_status').
+* It returns -EAGAIN if at least one of the RPM_WAKE and RPM_RESUMING bits is
+  set the device's run-time PM status field.
+* If the device's run-time PM status is RPM_IDLE and 'power.suspend_aborted'
+  flag is set for it, the device's run-time PM status is set to RPM_ACTIVE and
+  the function returns success.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, -EAGAIN is returned.
+* If the device's run-time PM status is RPM_SUSPENDING, which means that another
+  instance of pm_runtime_suspend() is running at the same time for the same
+  device, the function waits for the other instance to complete and returns the
+  error code (or success) returned by it.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and the device bus type's ->runtime_suspend() callback is
+executed, which is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+Next:
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  Once
+  that has happened, the device is regarded by the PM core as suspended, but it
+  need not mean that the device has been put into a low power state.  What
+  really occurs to the device at this point totally depends on its bus type (it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback completes
+  successfully, the device bus type's ->runtime_idle() callback is executed for
+  the device's parent if there is one and if all of its children are suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.
+
+pm_request_resume() is used to queue up a resume request for a device that is
+suspended, suspending or has a suspend request pending.
+* If a suspend request is pending for the device (i.e. the device's run-time PM
+  status is RPM_IDLE), it is cancelled and the function returns.
+* If the device is not suspended or suspending (i.e. none of the RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), the
+  function returns.
+* If the device's parent is inactive, a resume request is scheduled for the
+  parent and the function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() is used to carry out a run-time resume of a device that is
+suspended, suspending or has a suspend request pending.  It is called either by
+the PM core, to complete a request queued up by pm_request_resume(), or
+directly by a bus type or device driver.
+* It returns immediately if the device's run-time PM status is RPM_ACTIVE.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled and the function returns
+  success.
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in the
+  device's run-time PM status field), the function waits for the suspend
+  operation to complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field), the device's parent exists and is not active (i.e.
+  the parent's run-time PM status is not RPM_ACTIVE), pm_runtime_resume() is
+  called (recursively) for the parent and the function is restarted.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the device's run-time PM status is set to
+RPM_RESUMING and the device bus type's ->runtime_resume() callback is executed,
+which is responsible for handling the device as appropriate (for example, it may
+choose to execute the device driver's ->runtime_resume() callback or to carry
+out any other suitable action depending on the bus type).  Next:
+* If it completes successfully, the device's run-time PM status is set to
+  RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns,
+  the device _must_ be able to complete I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not commuticate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which measn that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not commuticate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-13 22:23 [PATCH] PM: Introduce core framework for run-time PM of I/O devices Rafael J. Wysocki
@ 2009-06-14  9:41   ` Magnus Damm
  2009-06-14  9:41   ` Magnus Damm
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 118+ messages in thread
From: Magnus Damm @ 2009-06-14  9:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, Oliver Neukum, Magnus Damm, pm list, LKML,
	Ingo Molnar, ACPI Devel Maling List

Hi Rafael,

On Sun, Jun 14, 2009 at 7:23 AM, Rafael J. Wysocki<rjw@sisk.pl> wrote:
> Below is the current version of my "run-time PM for I/O devices" patch.
>
> I've done my best to address the comments received during the recent
> discussions, but at the same time I've tried to make the patch only contain
> the most essential things.  For this reason, for example, the sysfs interface
> is not there and it's going to be added in a separate patch.

Good decision. Let's do this step by step.

> Please let me know if you want me to change anything in this patch or to add
> anything new to it.  [Magnus, I remember you wanted something like
> ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> necessary.  Please let me know if you have any particular usage scenario for
> it.]

I will keep on building my arch specific platform bus code on top of
the latest version of this patch.

However, to begin with I'll not make use of the ->runtime_idle()
callback in the bus code. This because rearranging the existing
platform devices into a tree will require a lot of rewriting, and I'm
not convinced it's the right approach. I'd rather focus on getting
basic functionality in place at this point. So if no one else needs
->runtime_idle(), feel free to exclude the ->runtime_idle() part if
you want to make the patch even leaner to begin with.

Together with the bus specific callbacks I plan to modify device
drivers to include pm_runtime_suspend() / pm_runtime_resume() calls to
notify the bus code when they are idle and when they need wakeup,
similar to my earlier proposal with
platform_device_idle()/platform_device_wakeup().

> --- linux-2.6.orig/include/linux/pm.h
> +++ linux-2.6/include/linux/pm.h
> @@ -182,6 +205,11 @@ struct dev_pm_ops {
>        int (*thaw_noirq)(struct device *dev);
>        int (*poweroff_noirq)(struct device *dev);
>        int (*restore_noirq)(struct device *dev);
> +#ifdef CONFIG_PM_RUNTIME
> +       int (*runtime_suspend)(struct device *dev);
> +       int (*runtime_resume)(struct device *dev);
> +       void (*runtime_idle)(struct device *dev);
> +#endif

Do we really need to wrap these in CONFIG_PM_RUNTIME? The callbacks
for STR and STD are not wrapped in CONFIG_SUSPEND and
CONFIG_HIBERNATION, right?

> --- /dev/null
> +++ linux-2.6/drivers/base/power/runtime.c
[snip]
> +/**
> + * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> + * @dev: Device to suspend.
> + *
> + * Check if the status of the device is appropriate and run the
> + * ->runtime_suspend() callback provided by the device's bus type driver.
> + * Update the run-time PM flags in the device object to reflect the current
> + * status of the device.
> + */
> +int pm_runtime_suspend(struct device *dev)
> +{
> +       int error = 0;

I'm sure you put a lot of thought into this already, but is it really
the best approach to assume that busses without runtime pm callbacks
can be suspended? I'd go with an error value by default and only
return 0 as callback return value.

> +/**
> + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> + * @dev: Device to cancel the suspend request for.
> + *
> + * Should be called under pm_lock_device() and only if we are sure that the
> + * ->autosuspend() callback hasn't started to yet.
> + */
> +static void pm_cancel_suspend(struct device *dev)
> +{
> +       dev->power.suspend_aborted = true;
> +       cancel_delayed_work(&dev->power.runtime_work);
> +       dev->power.runtime_status = RPM_ACTIVE;
> +}

This pm_lock_device() comment seems to come from old code, no?

> +/**
> + * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> + * @dev: Device to resume.
> + *
> + * Check if the device is really suspended and run the ->runtime_resume()
> + * callback provided by the device's bus type driver.  Update the run-time PM
> + * flags in the device object to reflect the current status of the device.  If
> + * runtime suspend is in progress while this function is being run, wait for it
> + * to finish before resuming the device.  If runtime suspend is scheduled, but
> + * it hasn't started yet, cancel it and we're done.
> + */
> +int pm_runtime_resume(struct device *dev)
> +{
> +       int error = 0;

Same here, does non-existing runtime pm callbacks really mean we can resume?

> +/**
> + * pm_runtime_disable - Disable run-time power management for given device.
> + * @dev: Device to handle.
> + *
> + * Increase the depth field in the device's dev_pm_info structure, which will
> + * cause the run-time PM functions above to return without doing anything.
> + * If there is a run-time PM operation in progress, wait for it to complete.
> + */
> +void pm_runtime_disable(struct device *dev)
> +{
> +       might_sleep();
> +
> +       atomic_inc(&dev->power.depth);
> +
> +       if (dev->power.runtime_status & RPM_IN_PROGRESS)
> +               wait_for_completion(&dev->power.work_done);
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_disable);
> +
> +/**
> + * pm_runtime_enable - Disable run-time power management for given device.
> + * @dev: Device to handle.
> + *
> + * Enable run-time power management for given device by decreasing the depth
> + * field in its dev_pm_info structure.
> + */
> +void pm_runtime_enable(struct device *dev)
> +{
> +       if (!atomic_add_unless(&dev->power.depth, -1, 0))
> +               dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_enable);

Any thoughts on performing ->runtime_resume()/->runtime_suspend() in
enable() and disable()? I guess it's performed too early/late to make
sense from the driver point of view?

Looking good, thanks a lot for your work on this!

Any chance we can get this included in -rc1?

/ magnus
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH] PM: Introduce core framework for run-time PM of I/O  devices
@ 2009-06-14  9:41   ` Magnus Damm
  0 siblings, 0 replies; 118+ messages in thread
From: Magnus Damm @ 2009-06-14  9:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, Oliver Neukum, Magnus Damm, pm list, LKML,
	Ingo Molnar, ACPI Devel Maling List

Hi Rafael,

On Sun, Jun 14, 2009 at 7:23 AM, Rafael J. Wysocki<rjw@sisk.pl> wrote:
> Below is the current version of my "run-time PM for I/O devices" patch.
>
> I've done my best to address the comments received during the recent
> discussions, but at the same time I've tried to make the patch only contain
> the most essential things.  For this reason, for example, the sysfs interface
> is not there and it's going to be added in a separate patch.

Good decision. Let's do this step by step.

> Please let me know if you want me to change anything in this patch or to add
> anything new to it.  [Magnus, I remember you wanted something like
> ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> necessary.  Please let me know if you have any particular usage scenario for
> it.]

I will keep on building my arch specific platform bus code on top of
the latest version of this patch.

However, to begin with I'll not make use of the ->runtime_idle()
callback in the bus code. This because rearranging the existing
platform devices into a tree will require a lot of rewriting, and I'm
not convinced it's the right approach. I'd rather focus on getting
basic functionality in place at this point. So if no one else needs
->runtime_idle(), feel free to exclude the ->runtime_idle() part if
you want to make the patch even leaner to begin with.

Together with the bus specific callbacks I plan to modify device
drivers to include pm_runtime_suspend() / pm_runtime_resume() calls to
notify the bus code when they are idle and when they need wakeup,
similar to my earlier proposal with
platform_device_idle()/platform_device_wakeup().

> --- linux-2.6.orig/include/linux/pm.h
> +++ linux-2.6/include/linux/pm.h
> @@ -182,6 +205,11 @@ struct dev_pm_ops {
>        int (*thaw_noirq)(struct device *dev);
>        int (*poweroff_noirq)(struct device *dev);
>        int (*restore_noirq)(struct device *dev);
> +#ifdef CONFIG_PM_RUNTIME
> +       int (*runtime_suspend)(struct device *dev);
> +       int (*runtime_resume)(struct device *dev);
> +       void (*runtime_idle)(struct device *dev);
> +#endif

Do we really need to wrap these in CONFIG_PM_RUNTIME? The callbacks
for STR and STD are not wrapped in CONFIG_SUSPEND and
CONFIG_HIBERNATION, right?

> --- /dev/null
> +++ linux-2.6/drivers/base/power/runtime.c
[snip]
> +/**
> + * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> + * @dev: Device to suspend.
> + *
> + * Check if the status of the device is appropriate and run the
> + * ->runtime_suspend() callback provided by the device's bus type driver.
> + * Update the run-time PM flags in the device object to reflect the current
> + * status of the device.
> + */
> +int pm_runtime_suspend(struct device *dev)
> +{
> +       int error = 0;

I'm sure you put a lot of thought into this already, but is it really
the best approach to assume that busses without runtime pm callbacks
can be suspended? I'd go with an error value by default and only
return 0 as callback return value.

> +/**
> + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> + * @dev: Device to cancel the suspend request for.
> + *
> + * Should be called under pm_lock_device() and only if we are sure that the
> + * ->autosuspend() callback hasn't started to yet.
> + */
> +static void pm_cancel_suspend(struct device *dev)
> +{
> +       dev->power.suspend_aborted = true;
> +       cancel_delayed_work(&dev->power.runtime_work);
> +       dev->power.runtime_status = RPM_ACTIVE;
> +}

This pm_lock_device() comment seems to come from old code, no?

> +/**
> + * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> + * @dev: Device to resume.
> + *
> + * Check if the device is really suspended and run the ->runtime_resume()
> + * callback provided by the device's bus type driver.  Update the run-time PM
> + * flags in the device object to reflect the current status of the device.  If
> + * runtime suspend is in progress while this function is being run, wait for it
> + * to finish before resuming the device.  If runtime suspend is scheduled, but
> + * it hasn't started yet, cancel it and we're done.
> + */
> +int pm_runtime_resume(struct device *dev)
> +{
> +       int error = 0;

Same here, does non-existing runtime pm callbacks really mean we can resume?

> +/**
> + * pm_runtime_disable - Disable run-time power management for given device.
> + * @dev: Device to handle.
> + *
> + * Increase the depth field in the device's dev_pm_info structure, which will
> + * cause the run-time PM functions above to return without doing anything.
> + * If there is a run-time PM operation in progress, wait for it to complete.
> + */
> +void pm_runtime_disable(struct device *dev)
> +{
> +       might_sleep();
> +
> +       atomic_inc(&dev->power.depth);
> +
> +       if (dev->power.runtime_status & RPM_IN_PROGRESS)
> +               wait_for_completion(&dev->power.work_done);
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_disable);
> +
> +/**
> + * pm_runtime_enable - Disable run-time power management for given device.
> + * @dev: Device to handle.
> + *
> + * Enable run-time power management for given device by decreasing the depth
> + * field in its dev_pm_info structure.
> + */
> +void pm_runtime_enable(struct device *dev)
> +{
> +       if (!atomic_add_unless(&dev->power.depth, -1, 0))
> +               dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_enable);

Any thoughts on performing ->runtime_resume()/->runtime_suspend() in
enable() and disable()? I guess it's performed too early/late to make
sense from the driver point of view?

Looking good, thanks a lot for your work on this!

Any chance we can get this included in -rc1?

/ magnus

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-13 22:23 [PATCH] PM: Introduce core framework for run-time PM of I/O devices Rafael J. Wysocki
@ 2009-06-14  9:41 ` Magnus Damm
  2009-06-14  9:41   ` Magnus Damm
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 118+ messages in thread
From: Magnus Damm @ 2009-06-14  9:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, ACPI Devel Maling List, Magnus Damm, pm list, Ingo Molnar

Hi Rafael,

On Sun, Jun 14, 2009 at 7:23 AM, Rafael J. Wysocki<rjw@sisk.pl> wrote:
> Below is the current version of my "run-time PM for I/O devices" patch.
>
> I've done my best to address the comments received during the recent
> discussions, but at the same time I've tried to make the patch only contain
> the most essential things.  For this reason, for example, the sysfs interface
> is not there and it's going to be added in a separate patch.

Good decision. Let's do this step by step.

> Please let me know if you want me to change anything in this patch or to add
> anything new to it.  [Magnus, I remember you wanted something like
> ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> necessary.  Please let me know if you have any particular usage scenario for
> it.]

I will keep on building my arch specific platform bus code on top of
the latest version of this patch.

However, to begin with I'll not make use of the ->runtime_idle()
callback in the bus code. This because rearranging the existing
platform devices into a tree will require a lot of rewriting, and I'm
not convinced it's the right approach. I'd rather focus on getting
basic functionality in place at this point. So if no one else needs
->runtime_idle(), feel free to exclude the ->runtime_idle() part if
you want to make the patch even leaner to begin with.

Together with the bus specific callbacks I plan to modify device
drivers to include pm_runtime_suspend() / pm_runtime_resume() calls to
notify the bus code when they are idle and when they need wakeup,
similar to my earlier proposal with
platform_device_idle()/platform_device_wakeup().

> --- linux-2.6.orig/include/linux/pm.h
> +++ linux-2.6/include/linux/pm.h
> @@ -182,6 +205,11 @@ struct dev_pm_ops {
>        int (*thaw_noirq)(struct device *dev);
>        int (*poweroff_noirq)(struct device *dev);
>        int (*restore_noirq)(struct device *dev);
> +#ifdef CONFIG_PM_RUNTIME
> +       int (*runtime_suspend)(struct device *dev);
> +       int (*runtime_resume)(struct device *dev);
> +       void (*runtime_idle)(struct device *dev);
> +#endif

Do we really need to wrap these in CONFIG_PM_RUNTIME? The callbacks
for STR and STD are not wrapped in CONFIG_SUSPEND and
CONFIG_HIBERNATION, right?

> --- /dev/null
> +++ linux-2.6/drivers/base/power/runtime.c
[snip]
> +/**
> + * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> + * @dev: Device to suspend.
> + *
> + * Check if the status of the device is appropriate and run the
> + * ->runtime_suspend() callback provided by the device's bus type driver.
> + * Update the run-time PM flags in the device object to reflect the current
> + * status of the device.
> + */
> +int pm_runtime_suspend(struct device *dev)
> +{
> +       int error = 0;

I'm sure you put a lot of thought into this already, but is it really
the best approach to assume that busses without runtime pm callbacks
can be suspended? I'd go with an error value by default and only
return 0 as callback return value.

> +/**
> + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> + * @dev: Device to cancel the suspend request for.
> + *
> + * Should be called under pm_lock_device() and only if we are sure that the
> + * ->autosuspend() callback hasn't started to yet.
> + */
> +static void pm_cancel_suspend(struct device *dev)
> +{
> +       dev->power.suspend_aborted = true;
> +       cancel_delayed_work(&dev->power.runtime_work);
> +       dev->power.runtime_status = RPM_ACTIVE;
> +}

This pm_lock_device() comment seems to come from old code, no?

> +/**
> + * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> + * @dev: Device to resume.
> + *
> + * Check if the device is really suspended and run the ->runtime_resume()
> + * callback provided by the device's bus type driver.  Update the run-time PM
> + * flags in the device object to reflect the current status of the device.  If
> + * runtime suspend is in progress while this function is being run, wait for it
> + * to finish before resuming the device.  If runtime suspend is scheduled, but
> + * it hasn't started yet, cancel it and we're done.
> + */
> +int pm_runtime_resume(struct device *dev)
> +{
> +       int error = 0;

Same here, does non-existing runtime pm callbacks really mean we can resume?

> +/**
> + * pm_runtime_disable - Disable run-time power management for given device.
> + * @dev: Device to handle.
> + *
> + * Increase the depth field in the device's dev_pm_info structure, which will
> + * cause the run-time PM functions above to return without doing anything.
> + * If there is a run-time PM operation in progress, wait for it to complete.
> + */
> +void pm_runtime_disable(struct device *dev)
> +{
> +       might_sleep();
> +
> +       atomic_inc(&dev->power.depth);
> +
> +       if (dev->power.runtime_status & RPM_IN_PROGRESS)
> +               wait_for_completion(&dev->power.work_done);
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_disable);
> +
> +/**
> + * pm_runtime_enable - Disable run-time power management for given device.
> + * @dev: Device to handle.
> + *
> + * Enable run-time power management for given device by decreasing the depth
> + * field in its dev_pm_info structure.
> + */
> +void pm_runtime_enable(struct device *dev)
> +{
> +       if (!atomic_add_unless(&dev->power.depth, -1, 0))
> +               dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_enable);

Any thoughts on performing ->runtime_resume()/->runtime_suspend() in
enable() and disable()? I guess it's performed too early/late to make
sense from the driver point of view?

Looking good, thanks a lot for your work on this!

Any chance we can get this included in -rc1?

/ magnus

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [PATCH] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-13 22:23 [PATCH] PM: Introduce core framework for run-time PM of I/O devices Rafael J. Wysocki
  2009-06-14  9:41 ` Magnus Damm
  2009-06-14  9:41   ` Magnus Damm
@ 2009-06-14  9:58 ` Rafael J. Wysocki
  2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
  2009-06-14 22:57   ` Rafael J. Wysocki
  2009-06-14  9:58 ` [PATCH] " Rafael J. Wysocki
  3 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-14  9:58 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum, Magnus Damm
  Cc: linux-pm, ACPI Devel Maling List, Ingo Molnar, LKML

On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> Hi,
> 
> Below is the current version of my "run-time PM for I/O devices" patch.
> 
> I've done my best to address the comments received during the recent
> discussions, but at the same time I've tried to make the patch only contain
> the most essential things.  For this reason, for example, the sysfs interface
> is not there and it's going to be added in a separate patch.
> 
> Please let me know if you want me to change anything in this patch or to add
> anything new to it.  [Magnus, I remember you wanted something like
> ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> necessary.  Please let me know if you have any particular usage scenario for
> it.]

Sorry, I sent an outdated version of the patch.  The current one is below.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  250 ++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  461 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   98 +++++++
 include/linux/pm_runtime.h         |   63 +++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 915 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,11 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+#ifdef CONFIG_PM_RUNTIME
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+#endif
 };
 
 /**
@@ -315,14 +343,78 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,461 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+
+/**
+ * pm_runtime_reset - Clear all of the device run-time PM flags.
+ * @dev: Device object to clear the flags for.
+ */
+static void pm_runtime_reset(struct device *dev)
+{
+	dev->power.suspend_aborted = false;
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int pm_runtime_suspend(struct device *dev)
+{
+	int error = 0;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_SUSPENDED) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_NO_SUSPEND) {
+		/* Device is resuming or there's a resume request pending. */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_IDLE
+	    && dev->power.suspend_aborted) {
+		dev->power.suspend_aborted = false;
+		dev->power.runtime_status = RPM_ACTIVE;
+		goto out;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	pm_runtime_suspend(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @delay: Time, in jiffies, to wait before attempting to suspend the device.
+ */
+void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ *
+ * Should be called under pm_lock_device() and only if we are sure that the
+ * ->autosuspend() callback hasn't started to yet.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	dev->power.suspend_aborted = true;
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int pm_runtime_resume(struct device *dev)
+{
+	int error = 0;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ACTIVE) {
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && dev->parent->power.runtime_status != RPM_ACTIVE) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = pm_runtime_resume(dev->parent);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	switch (error) {
+	case 0:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	pm_runtime_resume(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+void pm_request_resume(struct device *dev)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* We have to resume the parent first. */
+		pm_request_resume(dev->parent);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	pm_runtime_reset(dev);
+	spin_lock_init(&dev->power.lock);
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,63 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern int pm_runtime_suspend(struct device *dev);
+extern void pm_request_suspend(struct device *dev, unsigned long delay);
+extern int pm_runtime_resume(struct device *dev);
+extern void pm_request_resume(struct device *dev);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline int pm_runtime_suspend(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+}
+static inline int pm_runtime_resume(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_resume(struct device *dev) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,250 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+functions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described in what follows.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume().  They do it by decreasing and increasing, respectively,
+the 'power.depth' field of 'struct device'.  If the value of this field is
+greater than 0, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() return immediately without doing anything and -EBUSY is
+returned by pm_runtime_suspend() and pm_runtime_resume().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM functions can be used for that device.  The initial
+value of 'power.depth', as set by pm_runtime_init(), is 1.
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device probe and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() use the 'power.runtime_status' and
+'power.suspend_aborted' fields of 'struct device' for mutual synchronization.
+These fields are initialized by pm_runtime_init() and set to RPM_ACTIVE and
+'false', respectively.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE,
+it returns immediately.  Otherwise, it changes the device's run-time PM status
+to RPM_IDLE and puts a request to execute pm_runtime_suspend() into pm_wq.  The
+'delay' argument is used to specify time to wait before the request will be
+completed, in jiffies.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called either by the PM core, to complete a request queued up by
+pm_request_suspend(), or directly by a bus type or device driver.
+* It returns immediately if the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field ('power.runtime_status').
+* It returns -EAGAIN if at least one of the RPM_WAKE and RPM_RESUMING bits is
+  set the device's run-time PM status field.
+* If the device's run-time PM status is RPM_IDLE and 'power.suspend_aborted'
+  flag is set for it, the device's run-time PM status is set to RPM_ACTIVE and
+  the function returns success.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, -EAGAIN is returned.
+* If the device's run-time PM status is RPM_SUSPENDING, which means that another
+  instance of pm_runtime_suspend() is running at the same time for the same
+  device, the function waits for the other instance to complete and returns the
+  error code (or success) returned by it.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and the device bus type's ->runtime_suspend() callback is
+executed, which is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+Next:
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  Once
+  that has happened, the device is regarded by the PM core as suspended, but it
+  need not mean that the device has been put into a low power state.  What
+  really occurs to the device at this point totally depends on its bus type (it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback completes
+  successfully, the device bus type's ->runtime_idle() callback is executed for
+  the device's parent if there is one and if all of its children are suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.
+
+pm_request_resume() is used to queue up a resume request for a device that is
+suspended, suspending or has a suspend request pending.
+* If a suspend request is pending for the device (i.e. the device's run-time PM
+  status is RPM_IDLE), it is cancelled and the function returns.
+* If the device is not suspended or suspending (i.e. none of the RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), the
+  function returns.
+* If the device's parent is inactive, a resume request is scheduled for the
+  parent and the function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() is used to carry out a run-time resume of a device that is
+suspended, suspending or has a suspend request pending.  It is called either by
+the PM core, to complete a request queued up by pm_request_resume(), or
+directly by a bus type or device driver.
+* It returns immediately if the device's run-time PM status is RPM_ACTIVE.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled and the function returns
+  success.
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in the
+  device's run-time PM status field), the function waits for the suspend
+  operation to complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field), the device's parent exists and is not active (i.e.
+  the parent's run-time PM status is not RPM_ACTIVE), pm_runtime_resume() is
+  called (recursively) for the parent and the function is restarted.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the device's run-time PM status is set to
+RPM_RESUMING and the device bus type's ->runtime_resume() callback is executed,
+which is responsible for handling the device as appropriate (for example, it may
+choose to execute the device driver's ->runtime_resume() callback or to carry
+out any other suitable action depending on the bus type).  Next:
+* If it completes successfully, the device's run-time PM status is set to
+  RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns,
+  the device _must_ be able to complete I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-13 22:23 [PATCH] PM: Introduce core framework for run-time PM of I/O devices Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2009-06-14  9:58 ` [linux-pm] " Rafael J. Wysocki
@ 2009-06-14  9:58 ` Rafael J. Wysocki
  3 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-14  9:58 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum, Magnus Damm
  Cc: ACPI Devel Maling List, linux-pm, Ingo Molnar, LKML

On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> Hi,
> 
> Below is the current version of my "run-time PM for I/O devices" patch.
> 
> I've done my best to address the comments received during the recent
> discussions, but at the same time I've tried to make the patch only contain
> the most essential things.  For this reason, for example, the sysfs interface
> is not there and it's going to be added in a separate patch.
> 
> Please let me know if you want me to change anything in this patch or to add
> anything new to it.  [Magnus, I remember you wanted something like
> ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> necessary.  Please let me know if you have any particular usage scenario for
> it.]

Sorry, I sent an outdated version of the patch.  The current one is below.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  250 ++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  461 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   98 +++++++
 include/linux/pm_runtime.h         |   63 +++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 915 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,11 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+#ifdef CONFIG_PM_RUNTIME
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+#endif
 };
 
 /**
@@ -315,14 +343,78 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,461 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+
+/**
+ * pm_runtime_reset - Clear all of the device run-time PM flags.
+ * @dev: Device object to clear the flags for.
+ */
+static void pm_runtime_reset(struct device *dev)
+{
+	dev->power.suspend_aborted = false;
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int pm_runtime_suspend(struct device *dev)
+{
+	int error = 0;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_SUSPENDED) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_NO_SUSPEND) {
+		/* Device is resuming or there's a resume request pending. */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_IDLE
+	    && dev->power.suspend_aborted) {
+		dev->power.suspend_aborted = false;
+		dev->power.runtime_status = RPM_ACTIVE;
+		goto out;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	pm_runtime_suspend(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @delay: Time, in jiffies, to wait before attempting to suspend the device.
+ */
+void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ *
+ * Should be called under pm_lock_device() and only if we are sure that the
+ * ->autosuspend() callback hasn't started to yet.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	dev->power.suspend_aborted = true;
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int pm_runtime_resume(struct device *dev)
+{
+	int error = 0;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ACTIVE) {
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && dev->parent->power.runtime_status != RPM_ACTIVE) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = pm_runtime_resume(dev->parent);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	switch (error) {
+	case 0:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	pm_runtime_resume(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+void pm_request_resume(struct device *dev)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* We have to resume the parent first. */
+		pm_request_resume(dev->parent);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	pm_runtime_reset(dev);
+	spin_lock_init(&dev->power.lock);
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,63 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern int pm_runtime_suspend(struct device *dev);
+extern void pm_request_suspend(struct device *dev, unsigned long delay);
+extern int pm_runtime_resume(struct device *dev);
+extern void pm_request_resume(struct device *dev);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline int pm_runtime_suspend(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+}
+static inline int pm_runtime_resume(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_resume(struct device *dev) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,250 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+functions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described in what follows.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume().  They do it by decreasing and increasing, respectively,
+the 'power.depth' field of 'struct device'.  If the value of this field is
+greater than 0, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() return immediately without doing anything and -EBUSY is
+returned by pm_runtime_suspend() and pm_runtime_resume().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM functions can be used for that device.  The initial
+value of 'power.depth', as set by pm_runtime_init(), is 1.
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device probe and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() use the 'power.runtime_status' and
+'power.suspend_aborted' fields of 'struct device' for mutual synchronization.
+These fields are initialized by pm_runtime_init() and set to RPM_ACTIVE and
+'false', respectively.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE,
+it returns immediately.  Otherwise, it changes the device's run-time PM status
+to RPM_IDLE and puts a request to execute pm_runtime_suspend() into pm_wq.  The
+'delay' argument is used to specify time to wait before the request will be
+completed, in jiffies.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called either by the PM core, to complete a request queued up by
+pm_request_suspend(), or directly by a bus type or device driver.
+* It returns immediately if the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field ('power.runtime_status').
+* It returns -EAGAIN if at least one of the RPM_WAKE and RPM_RESUMING bits is
+  set the device's run-time PM status field.
+* If the device's run-time PM status is RPM_IDLE and 'power.suspend_aborted'
+  flag is set for it, the device's run-time PM status is set to RPM_ACTIVE and
+  the function returns success.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, -EAGAIN is returned.
+* If the device's run-time PM status is RPM_SUSPENDING, which means that another
+  instance of pm_runtime_suspend() is running at the same time for the same
+  device, the function waits for the other instance to complete and returns the
+  error code (or success) returned by it.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and the device bus type's ->runtime_suspend() callback is
+executed, which is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+Next:
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  Once
+  that has happened, the device is regarded by the PM core as suspended, but it
+  need not mean that the device has been put into a low power state.  What
+  really occurs to the device at this point totally depends on its bus type (it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback completes
+  successfully, the device bus type's ->runtime_idle() callback is executed for
+  the device's parent if there is one and if all of its children are suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.
+
+pm_request_resume() is used to queue up a resume request for a device that is
+suspended, suspending or has a suspend request pending.
+* If a suspend request is pending for the device (i.e. the device's run-time PM
+  status is RPM_IDLE), it is cancelled and the function returns.
+* If the device is not suspended or suspending (i.e. none of the RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), the
+  function returns.
+* If the device's parent is inactive, a resume request is scheduled for the
+  parent and the function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() is used to carry out a run-time resume of a device that is
+suspended, suspending or has a suspend request pending.  It is called either by
+the PM core, to complete a request queued up by pm_request_resume(), or
+directly by a bus type or device driver.
+* It returns immediately if the device's run-time PM status is RPM_ACTIVE.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled and the function returns
+  success.
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in the
+  device's run-time PM status field), the function waits for the suspend
+  operation to complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field), the device's parent exists and is not active (i.e.
+  the parent's run-time PM status is not RPM_ACTIVE), pm_runtime_resume() is
+  called (recursively) for the parent and the function is restarted.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the device's run-time PM status is set to
+RPM_RESUMING and the device bus type's ->runtime_resume() callback is executed,
+which is responsible for handling the device as appropriate (for example, it may
+choose to execute the device driver's ->runtime_resume() callback or to carry
+out any other suitable action depending on the bus type).  Next:
+* If it completes successfully, the device's run-time PM status is set to
+  RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns,
+  the device _must_ be able to complete I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14  9:41   ` Magnus Damm
  (?)
  (?)
@ 2009-06-14 10:29   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-14 10:29 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Alan Stern, Oliver Neukum, Magnus Damm, pm list, LKML,
	Ingo Molnar, ACPI Devel Maling List

On Sunday 14 June 2009, Magnus Damm wrote:
> Hi Rafael,
> 
> On Sun, Jun 14, 2009 at 7:23 AM, Rafael J. Wysocki<rjw@sisk.pl> wrote:
> > Below is the current version of my "run-time PM for I/O devices" patch.
> >
> > I've done my best to address the comments received during the recent
> > discussions, but at the same time I've tried to make the patch only contain
> > the most essential things.  For this reason, for example, the sysfs interface
> > is not there and it's going to be added in a separate patch.
> 
> Good decision. Let's do this step by step.
> 
> > Please let me know if you want me to change anything in this patch or to add
> > anything new to it.  [Magnus, I remember you wanted something like
> > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > necessary.  Please let me know if you have any particular usage scenario for
> > it.]
> 
> I will keep on building my arch specific platform bus code on top of
> the latest version of this patch.
> 
> However, to begin with I'll not make use of the ->runtime_idle()
> callback in the bus code. This because rearranging the existing
> platform devices into a tree will require a lot of rewriting, and I'm
> not convinced it's the right approach. I'd rather focus on getting
> basic functionality in place at this point. So if no one else needs
> ->runtime_idle(), feel free to exclude the ->runtime_idle() part if
> you want to make the patch even leaner to begin with.

I think it's going be useful in general.  If not, we can just drop it.

> Together with the bus specific callbacks I plan to modify device
> drivers to include pm_runtime_suspend() / pm_runtime_resume() calls to
> notify the bus code when they are idle and when they need wakeup,
> similar to my earlier proposal with
> platform_device_idle()/platform_device_wakeup().

That sounds like a good plan.

> > --- linux-2.6.orig/include/linux/pm.h
> > +++ linux-2.6/include/linux/pm.h
> > @@ -182,6 +205,11 @@ struct dev_pm_ops {
> >        int (*thaw_noirq)(struct device *dev);
> >        int (*poweroff_noirq)(struct device *dev);
> >        int (*restore_noirq)(struct device *dev);
> > +#ifdef CONFIG_PM_RUNTIME
> > +       int (*runtime_suspend)(struct device *dev);
> > +       int (*runtime_resume)(struct device *dev);
> > +       void (*runtime_idle)(struct device *dev);
> > +#endif
> 
> Do we really need to wrap these in CONFIG_PM_RUNTIME? The callbacks
> for STR and STD are not wrapped in CONFIG_SUSPEND and
> CONFIG_HIBERNATION, right?
> 
> > --- /dev/null
> > +++ linux-2.6/drivers/base/power/runtime.c
> [snip]
> > +/**
> > + * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> > + * @dev: Device to suspend.
> > + *
> > + * Check if the status of the device is appropriate and run the
> > + * ->runtime_suspend() callback provided by the device's bus type driver.
> > + * Update the run-time PM flags in the device object to reflect the current
> > + * status of the device.
> > + */
> > +int pm_runtime_suspend(struct device *dev)
> > +{
> > +       int error = 0;
> 
> I'm sure you put a lot of thought into this already, but is it really
> the best approach to assume that busses without runtime pm callbacks
> can be suspended? I'd go with an error value by default and only
> return 0 as callback return value.

Hmm, yes.  I think you're right.

> > +/**
> > + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> > + * @dev: Device to cancel the suspend request for.
> > + *
> > + * Should be called under pm_lock_device() and only if we are sure that the
> > + * ->autosuspend() callback hasn't started to yet.
> > + */
> > +static void pm_cancel_suspend(struct device *dev)
> > +{
> > +       dev->power.suspend_aborted = true;
> > +       cancel_delayed_work(&dev->power.runtime_work);
> > +       dev->power.runtime_status = RPM_ACTIVE;
> > +}
> 
> This pm_lock_device() comment seems to come from old code, no?

Correct, I'll fix the comments.

> > +/**
> > + * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> > + * @dev: Device to resume.
> > + *
> > + * Check if the device is really suspended and run the ->runtime_resume()
> > + * callback provided by the device's bus type driver.  Update the run-time PM
> > + * flags in the device object to reflect the current status of the device.  If
> > + * runtime suspend is in progress while this function is being run, wait for it
> > + * to finish before resuming the device.  If runtime suspend is scheduled, but
> > + * it hasn't started yet, cancel it and we're done.
> > + */
> > +int pm_runtime_resume(struct device *dev)
> > +{
> > +       int error = 0;
> 
> Same here, does non-existing runtime pm callbacks really mean we can resume?

Well, in fact if we get to the callback and it doesn't exist, that will be a
bug.  So, I think it's a good idea to return error code in such a case.

> > +/**
> > + * pm_runtime_disable - Disable run-time power management for given device.
> > + * @dev: Device to handle.
> > + *
> > + * Increase the depth field in the device's dev_pm_info structure, which will
> > + * cause the run-time PM functions above to return without doing anything.
> > + * If there is a run-time PM operation in progress, wait for it to complete.
> > + */
> > +void pm_runtime_disable(struct device *dev)
> > +{
> > +       might_sleep();
> > +
> > +       atomic_inc(&dev->power.depth);
> > +
> > +       if (dev->power.runtime_status & RPM_IN_PROGRESS)
> > +               wait_for_completion(&dev->power.work_done);
> > +}
> > +EXPORT_SYMBOL_GPL(pm_runtime_disable);
> > +
> > +/**
> > + * pm_runtime_enable - Disable run-time power management for given device.
> > + * @dev: Device to handle.
> > + *
> > + * Enable run-time power management for given device by decreasing the depth
> > + * field in its dev_pm_info structure.
> > + */
> > +void pm_runtime_enable(struct device *dev)
> > +{
> > +       if (!atomic_add_unless(&dev->power.depth, -1, 0))
> > +               dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
> > +}
> > +EXPORT_SYMBOL_GPL(pm_runtime_enable);
> 
> Any thoughts on performing ->runtime_resume()/->runtime_suspend() in
> enable() and disable()? I guess it's performed too early/late to make
> sense from the driver point of view?

Some thoughts, yes.  As for an implementation, I'd like to wait until at least
one bus type uses the framework.

> Looking good, thanks a lot for your work on this!

Thanks for your comments.

> Any chance we can get this included in -rc1?

Well, in fact I have already pushed all of the changes I wanted in 2.6.31.
Also, I'd like to receive some comments on the $subject patch from the other
people.

That said, the merge window is still open, so if the comments are supportive
and  there's a chance to put the final version into linux-next for a couple of
days before the merge window ends, I may try to push it to Linus.  After all,
the patch is not going to introduce any regressions. ;-)

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14  9:41   ` Magnus Damm
  (?)
@ 2009-06-14 10:29   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-14 10:29 UTC (permalink / raw)
  To: Magnus Damm
  Cc: LKML, ACPI Devel Maling List, Magnus Damm, pm list, Ingo Molnar

On Sunday 14 June 2009, Magnus Damm wrote:
> Hi Rafael,
> 
> On Sun, Jun 14, 2009 at 7:23 AM, Rafael J. Wysocki<rjw@sisk.pl> wrote:
> > Below is the current version of my "run-time PM for I/O devices" patch.
> >
> > I've done my best to address the comments received during the recent
> > discussions, but at the same time I've tried to make the patch only contain
> > the most essential things.  For this reason, for example, the sysfs interface
> > is not there and it's going to be added in a separate patch.
> 
> Good decision. Let's do this step by step.
> 
> > Please let me know if you want me to change anything in this patch or to add
> > anything new to it.  [Magnus, I remember you wanted something like
> > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > necessary.  Please let me know if you have any particular usage scenario for
> > it.]
> 
> I will keep on building my arch specific platform bus code on top of
> the latest version of this patch.
> 
> However, to begin with I'll not make use of the ->runtime_idle()
> callback in the bus code. This because rearranging the existing
> platform devices into a tree will require a lot of rewriting, and I'm
> not convinced it's the right approach. I'd rather focus on getting
> basic functionality in place at this point. So if no one else needs
> ->runtime_idle(), feel free to exclude the ->runtime_idle() part if
> you want to make the patch even leaner to begin with.

I think it's going be useful in general.  If not, we can just drop it.

> Together with the bus specific callbacks I plan to modify device
> drivers to include pm_runtime_suspend() / pm_runtime_resume() calls to
> notify the bus code when they are idle and when they need wakeup,
> similar to my earlier proposal with
> platform_device_idle()/platform_device_wakeup().

That sounds like a good plan.

> > --- linux-2.6.orig/include/linux/pm.h
> > +++ linux-2.6/include/linux/pm.h
> > @@ -182,6 +205,11 @@ struct dev_pm_ops {
> >        int (*thaw_noirq)(struct device *dev);
> >        int (*poweroff_noirq)(struct device *dev);
> >        int (*restore_noirq)(struct device *dev);
> > +#ifdef CONFIG_PM_RUNTIME
> > +       int (*runtime_suspend)(struct device *dev);
> > +       int (*runtime_resume)(struct device *dev);
> > +       void (*runtime_idle)(struct device *dev);
> > +#endif
> 
> Do we really need to wrap these in CONFIG_PM_RUNTIME? The callbacks
> for STR and STD are not wrapped in CONFIG_SUSPEND and
> CONFIG_HIBERNATION, right?
> 
> > --- /dev/null
> > +++ linux-2.6/drivers/base/power/runtime.c
> [snip]
> > +/**
> > + * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> > + * @dev: Device to suspend.
> > + *
> > + * Check if the status of the device is appropriate and run the
> > + * ->runtime_suspend() callback provided by the device's bus type driver.
> > + * Update the run-time PM flags in the device object to reflect the current
> > + * status of the device.
> > + */
> > +int pm_runtime_suspend(struct device *dev)
> > +{
> > +       int error = 0;
> 
> I'm sure you put a lot of thought into this already, but is it really
> the best approach to assume that busses without runtime pm callbacks
> can be suspended? I'd go with an error value by default and only
> return 0 as callback return value.

Hmm, yes.  I think you're right.

> > +/**
> > + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> > + * @dev: Device to cancel the suspend request for.
> > + *
> > + * Should be called under pm_lock_device() and only if we are sure that the
> > + * ->autosuspend() callback hasn't started to yet.
> > + */
> > +static void pm_cancel_suspend(struct device *dev)
> > +{
> > +       dev->power.suspend_aborted = true;
> > +       cancel_delayed_work(&dev->power.runtime_work);
> > +       dev->power.runtime_status = RPM_ACTIVE;
> > +}
> 
> This pm_lock_device() comment seems to come from old code, no?

Correct, I'll fix the comments.

> > +/**
> > + * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> > + * @dev: Device to resume.
> > + *
> > + * Check if the device is really suspended and run the ->runtime_resume()
> > + * callback provided by the device's bus type driver.  Update the run-time PM
> > + * flags in the device object to reflect the current status of the device.  If
> > + * runtime suspend is in progress while this function is being run, wait for it
> > + * to finish before resuming the device.  If runtime suspend is scheduled, but
> > + * it hasn't started yet, cancel it and we're done.
> > + */
> > +int pm_runtime_resume(struct device *dev)
> > +{
> > +       int error = 0;
> 
> Same here, does non-existing runtime pm callbacks really mean we can resume?

Well, in fact if we get to the callback and it doesn't exist, that will be a
bug.  So, I think it's a good idea to return error code in such a case.

> > +/**
> > + * pm_runtime_disable - Disable run-time power management for given device.
> > + * @dev: Device to handle.
> > + *
> > + * Increase the depth field in the device's dev_pm_info structure, which will
> > + * cause the run-time PM functions above to return without doing anything.
> > + * If there is a run-time PM operation in progress, wait for it to complete.
> > + */
> > +void pm_runtime_disable(struct device *dev)
> > +{
> > +       might_sleep();
> > +
> > +       atomic_inc(&dev->power.depth);
> > +
> > +       if (dev->power.runtime_status & RPM_IN_PROGRESS)
> > +               wait_for_completion(&dev->power.work_done);
> > +}
> > +EXPORT_SYMBOL_GPL(pm_runtime_disable);
> > +
> > +/**
> > + * pm_runtime_enable - Disable run-time power management for given device.
> > + * @dev: Device to handle.
> > + *
> > + * Enable run-time power management for given device by decreasing the depth
> > + * field in its dev_pm_info structure.
> > + */
> > +void pm_runtime_enable(struct device *dev)
> > +{
> > +       if (!atomic_add_unless(&dev->power.depth, -1, 0))
> > +               dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
> > +}
> > +EXPORT_SYMBOL_GPL(pm_runtime_enable);
> 
> Any thoughts on performing ->runtime_resume()/->runtime_suspend() in
> enable() and disable()? I guess it's performed too early/late to make
> sense from the driver point of view?

Some thoughts, yes.  As for an implementation, I'd like to wait until at least
one bus type uses the framework.

> Looking good, thanks a lot for your work on this!

Thanks for your comments.

> Any chance we can get this included in -rc1?

Well, in fact I have already pushed all of the changes I wanted in 2.6.31.
Also, I'd like to receive some comments on the $subject patch from the other
people.

That said, the merge window is still open, so if the comments are supportive
and  there's a chance to put the final version into linux-next for a couple of
days before the merge window ends, I may try to push it to Linus.  After all,
the patch is not going to introduce any regressions. ;-)

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14  9:58 ` [linux-pm] " Rafael J. Wysocki
@ 2009-06-14 22:57   ` Rafael J. Wysocki
  2009-06-14 23:18     ` Arjan van de Ven
                       ` (5 more replies)
  2009-06-14 22:57   ` Rafael J. Wysocki
  1 sibling, 6 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-14 22:57 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum, Magnus Damm
  Cc: linux-pm, ACPI Devel Maling List, Ingo Molnar, LKML, Greg KH

On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > Hi,
> > 
> > Below is the current version of my "run-time PM for I/O devices" patch.
> > 
> > I've done my best to address the comments received during the recent
> > discussions, but at the same time I've tried to make the patch only contain
> > the most essential things.  For this reason, for example, the sysfs interface
> > is not there and it's going to be added in a separate patch.
> > 
> > Please let me know if you want me to change anything in this patch or to add
> > anything new to it.  [Magnus, I remember you wanted something like
> > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > necessary.  Please let me know if you have any particular usage scenario for
> > it.]

Appended is an update of the patch addressing the today's comments from Magnus.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  251 +++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  466 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   96 +++++++
 include/linux/pm_runtime.h         |   63 +++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 919 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +341,78 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,466 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+
+/**
+ * pm_runtime_reset - Clear all of the device run-time PM flags.
+ * @dev: Device object to clear the flags for.
+ */
+static void pm_runtime_reset(struct device *dev)
+{
+	dev->power.suspend_aborted = false;
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int pm_runtime_suspend(struct device *dev)
+{
+	int error = -EINVAL;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if (dev->power.runtime_status & RPM_NO_SUSPEND) {
+		/* Device is resuming or there's a resume request pending. */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_IDLE
+	    && dev->power.suspend_aborted) {
+		dev->power.suspend_aborted = false;
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = 0;
+		goto out;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	pm_runtime_suspend(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @delay: Time, in jiffies, to wait before attempting to suspend the device.
+ */
+void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	dev->power.suspend_aborted = true;
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int pm_runtime_resume(struct device *dev)
+{
+	int error = -EINVAL;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out_unlock;
+	} if (dev->power.runtime_status == RPM_ACTIVE) {
+		error = 0;
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		error = 0;
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && dev->parent->power.runtime_status != RPM_ACTIVE) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = pm_runtime_resume(dev->parent);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	switch (error) {
+	case 0:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	pm_runtime_resume(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+void pm_request_resume(struct device *dev)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* We have to resume the parent first. */
+		pm_request_resume(dev->parent);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	pm_runtime_reset(dev);
+	spin_lock_init(&dev->power.lock);
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,63 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern int pm_runtime_suspend(struct device *dev);
+extern void pm_request_suspend(struct device *dev, unsigned long delay);
+extern int pm_runtime_resume(struct device *dev);
+extern void pm_request_resume(struct device *dev);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline int pm_runtime_suspend(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+}
+static inline int pm_runtime_resume(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_resume(struct device *dev) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,251 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+functions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described in what follows.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume().  They do it by decreasing and increasing, respectively,
+the 'power.depth' field of 'struct device'.  If the value of this field is
+greater than 0, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() return immediately without doing anything and -EBUSY is
+returned by pm_runtime_suspend() and pm_runtime_resume().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM functions can be used for that device.  The initial
+value of 'power.depth', as set by pm_runtime_init(), is 1.
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device probe and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() use the 'power.runtime_status' and
+'power.suspend_aborted' fields of 'struct device' for mutual synchronization.
+These fields are initialized by pm_runtime_init() and set to RPM_ACTIVE and
+'false', respectively.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE,
+it returns immediately.  Otherwise, it changes the device's run-time PM status
+to RPM_IDLE and puts a request to execute pm_runtime_suspend() into pm_wq.  The
+'delay' argument is used to specify time to wait before the request will be
+completed, in jiffies.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called either by the PM core, to complete a request queued up by
+pm_request_suspend(), or directly by a bus type or device driver.
+* It returns immediately if the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field ('power.runtime_status').
+* It returns -EAGAIN if at least one of the RPM_WAKE and RPM_RESUMING bits is
+  set the device's run-time PM status field.
+* If the device's run-time PM status is RPM_IDLE and 'power.suspend_aborted'
+  flag is set for it, the device's run-time PM status is set to RPM_ACTIVE and
+  the function returns success.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, -EAGAIN is returned.
+* If the device's run-time PM status is RPM_SUSPENDING, which means that another
+  instance of pm_runtime_suspend() is running at the same time for the same
+  device, the function waits for the other instance to complete and returns the
+  error code (or success) returned by it.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and the device bus type's ->runtime_suspend() callback is
+executed, which is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+Next:
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  Once
+  that has happened, the device is regarded by the PM core as suspended, but it
+  need not mean that the device has been put into a low power state.  What
+  really occurs to the device at this point totally depends on its bus type (it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback completes
+  successfully, the device bus type's ->runtime_idle() callback is executed for
+  the device's parent if there is one and if all of its children are suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.
+
+pm_request_resume() is used to queue up a resume request for a device that is
+suspended, suspending or has a suspend request pending.
+* If a suspend request is pending for the device (i.e. the device's run-time PM
+  status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted' flag is set
+  for the device and the function returns.
+* If the device is not suspended or suspending (i.e. none of the RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), the
+  function returns.
+* If the device's parent is inactive, a resume request is scheduled for the
+  parent and the function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() is used to carry out a run-time resume of a device that is
+suspended, suspending or has a suspend request pending.  It is called either by
+the PM core, to complete a request queued up by pm_request_resume(), or
+directly by a bus type or device driver.
+* It returns immediately if the device's run-time PM status is RPM_ACTIVE.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted'
+  flag is set for the device and the function returns success.
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in the
+  device's run-time PM status field), the function waits for the suspend
+  operation to complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field), the device's parent exists and is not active (i.e.
+  the parent's run-time PM status is not RPM_ACTIVE), pm_runtime_resume() is
+  called (recursively) for the parent and the function is restarted.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the device's run-time PM status is set to
+RPM_RESUMING and the device bus type's ->runtime_resume() callback is executed,
+which is responsible for handling the device as appropriate (for example, it may
+choose to execute the device driver's ->runtime_resume() callback or to carry
+out any other suitable action depending on the bus type).  Next:
+* If it completes successfully, the device's run-time PM status is set to
+  RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns,
+  the device _must_ be able to complete I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14  9:58 ` [linux-pm] " Rafael J. Wysocki
  2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
@ 2009-06-14 22:57   ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-14 22:57 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum, Magnus Damm
  Cc: ACPI Devel Maling List, linux-pm, Greg KH, Ingo Molnar, LKML

On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > Hi,
> > 
> > Below is the current version of my "run-time PM for I/O devices" patch.
> > 
> > I've done my best to address the comments received during the recent
> > discussions, but at the same time I've tried to make the patch only contain
> > the most essential things.  For this reason, for example, the sysfs interface
> > is not there and it's going to be added in a separate patch.
> > 
> > Please let me know if you want me to change anything in this patch or to add
> > anything new to it.  [Magnus, I remember you wanted something like
> > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > necessary.  Please let me know if you have any particular usage scenario for
> > it.]

Appended is an update of the patch addressing the today's comments from Magnus.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  251 +++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  466 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   96 +++++++
 include/linux/pm_runtime.h         |   63 +++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 919 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +341,78 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,466 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+
+/**
+ * pm_runtime_reset - Clear all of the device run-time PM flags.
+ * @dev: Device object to clear the flags for.
+ */
+static void pm_runtime_reset(struct device *dev)
+{
+	dev->power.suspend_aborted = false;
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int pm_runtime_suspend(struct device *dev)
+{
+	int error = -EINVAL;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if (dev->power.runtime_status & RPM_NO_SUSPEND) {
+		/* Device is resuming or there's a resume request pending. */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_IDLE
+	    && dev->power.suspend_aborted) {
+		dev->power.suspend_aborted = false;
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = 0;
+		goto out;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	pm_runtime_suspend(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @delay: Time, in jiffies, to wait before attempting to suspend the device.
+ */
+void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	dev->power.suspend_aborted = true;
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int pm_runtime_resume(struct device *dev)
+{
+	int error = -EINVAL;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out_unlock;
+	} if (dev->power.runtime_status == RPM_ACTIVE) {
+		error = 0;
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		error = 0;
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && dev->parent->power.runtime_status != RPM_ACTIVE) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = pm_runtime_resume(dev->parent);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	switch (error) {
+	case 0:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	pm_runtime_resume(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+void pm_request_resume(struct device *dev)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* We have to resume the parent first. */
+		pm_request_resume(dev->parent);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	pm_runtime_reset(dev);
+	spin_lock_init(&dev->power.lock);
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,63 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern int pm_runtime_suspend(struct device *dev);
+extern void pm_request_suspend(struct device *dev, unsigned long delay);
+extern int pm_runtime_resume(struct device *dev);
+extern void pm_request_resume(struct device *dev);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline int pm_runtime_suspend(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+}
+static inline int pm_runtime_resume(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_resume(struct device *dev) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,251 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+functions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described in what follows.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume().  They do it by decreasing and increasing, respectively,
+the 'power.depth' field of 'struct device'.  If the value of this field is
+greater than 0, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() return immediately without doing anything and -EBUSY is
+returned by pm_runtime_suspend() and pm_runtime_resume().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM functions can be used for that device.  The initial
+value of 'power.depth', as set by pm_runtime_init(), is 1.
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device probe and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() use the 'power.runtime_status' and
+'power.suspend_aborted' fields of 'struct device' for mutual synchronization.
+These fields are initialized by pm_runtime_init() and set to RPM_ACTIVE and
+'false', respectively.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE,
+it returns immediately.  Otherwise, it changes the device's run-time PM status
+to RPM_IDLE and puts a request to execute pm_runtime_suspend() into pm_wq.  The
+'delay' argument is used to specify time to wait before the request will be
+completed, in jiffies.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called either by the PM core, to complete a request queued up by
+pm_request_suspend(), or directly by a bus type or device driver.
+* It returns immediately if the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field ('power.runtime_status').
+* It returns -EAGAIN if at least one of the RPM_WAKE and RPM_RESUMING bits is
+  set the device's run-time PM status field.
+* If the device's run-time PM status is RPM_IDLE and 'power.suspend_aborted'
+  flag is set for it, the device's run-time PM status is set to RPM_ACTIVE and
+  the function returns success.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, -EAGAIN is returned.
+* If the device's run-time PM status is RPM_SUSPENDING, which means that another
+  instance of pm_runtime_suspend() is running at the same time for the same
+  device, the function waits for the other instance to complete and returns the
+  error code (or success) returned by it.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and the device bus type's ->runtime_suspend() callback is
+executed, which is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+Next:
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  Once
+  that has happened, the device is regarded by the PM core as suspended, but it
+  need not mean that the device has been put into a low power state.  What
+  really occurs to the device at this point totally depends on its bus type (it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback completes
+  successfully, the device bus type's ->runtime_idle() callback is executed for
+  the device's parent if there is one and if all of its children are suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.
+
+pm_request_resume() is used to queue up a resume request for a device that is
+suspended, suspending or has a suspend request pending.
+* If a suspend request is pending for the device (i.e. the device's run-time PM
+  status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted' flag is set
+  for the device and the function returns.
+* If the device is not suspended or suspending (i.e. none of the RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), the
+  function returns.
+* If the device's parent is inactive, a resume request is scheduled for the
+  parent and the function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() is used to carry out a run-time resume of a device that is
+suspended, suspending or has a suspend request pending.  It is called either by
+the PM core, to complete a request queued up by pm_request_resume(), or
+directly by a bus type or device driver.
+* It returns immediately if the device's run-time PM status is RPM_ACTIVE.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted'
+  flag is set for the device and the function returns success.
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in the
+  device's run-time PM status field), the function waits for the suspend
+  operation to complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field), the device's parent exists and is not active (i.e.
+  the parent's run-time PM status is not RPM_ACTIVE), pm_runtime_resume() is
+  called (recursively) for the parent and the function is restarted.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the device's run-time PM status is set to
+RPM_RESUMING and the device bus type's ->runtime_resume() callback is executed,
+which is responsible for handling the device as appropriate (for example, it may
+choose to execute the device driver's ->runtime_resume() callback or to carry
+out any other suitable action depending on the bus type).  Next:
+* If it completes successfully, the device's run-time PM status is set to
+  RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns,
+  the device _must_ be able to complete I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
@ 2009-06-14 23:18     ` Arjan van de Ven
  2009-06-15 20:02       ` Rafael J. Wysocki
  2009-06-15 20:02       ` Rafael J. Wysocki
  2009-06-14 23:18     ` Arjan van de Ven
                       ` (4 subsequent siblings)
  5 siblings, 2 replies; 118+ messages in thread
From: Arjan van de Ven @ 2009-06-14 23:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, Oliver Neukum, Magnus Damm, linux-pm,
	ACPI Devel Maling List, Ingo Molnar, LKML, Greg KH

On Mon, 15 Jun 2009 00:57:31 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

> On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > Hi,
> > > 
> > > Below is the current version of my "run-time PM for I/O devices"
> > > patch.
> > > 
> > > I've done my best to address the comments received during the
> > > recent discussions, but at the same time I've tried to make the
> > > patch only contain the most essential things.  For this reason,
> > > for example, the sysfs interface is not there and it's going to
> > > be added in a separate patch.
> > > 
> > > Please let me know if you want me to change anything in this
> > > patch or to add anything new to it.  [Magnus, I remember you
> > > wanted something like ->runtime_wakeup() along with
> > > ->runtime_idle(), but I'm not sure it's really necessary.  Please
> > > let me know if you have any particular usage scenario for it.]
> 
> Appended is an update of the patch addressing the today's comments
> from Magnus.

few comments from me

1) For the usecases for upcoming hw from Intel (where you really can't talk to the hw while it's in powersave mode); the locking needs to be
   IRQ safe. Think of it like this:
   Lets assume you get a (shared) interrupt from your device. In the handler you need to make 100% sure that 
    1) you're not suspended at this point .. basically do a forced wakeup right there and then
    2) assure that you're not about to suspend
2) You use jiffies in the API; I would suggest exposing milliseconds instead and internally convert to jiffies;
   milliseconds tends to be a much more natural unit for this sort of thing


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
  2009-06-14 23:18     ` Arjan van de Ven
@ 2009-06-14 23:18     ` Arjan van de Ven
  2009-06-15 21:08     ` Alan Stern
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 118+ messages in thread
From: Arjan van de Ven @ 2009-06-14 23:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, Magnus Damm, linux-pm,
	Ingo Molnar

On Mon, 15 Jun 2009 00:57:31 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

> On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > Hi,
> > > 
> > > Below is the current version of my "run-time PM for I/O devices"
> > > patch.
> > > 
> > > I've done my best to address the comments received during the
> > > recent discussions, but at the same time I've tried to make the
> > > patch only contain the most essential things.  For this reason,
> > > for example, the sysfs interface is not there and it's going to
> > > be added in a separate patch.
> > > 
> > > Please let me know if you want me to change anything in this
> > > patch or to add anything new to it.  [Magnus, I remember you
> > > wanted something like ->runtime_wakeup() along with
> > > ->runtime_idle(), but I'm not sure it's really necessary.  Please
> > > let me know if you have any particular usage scenario for it.]
> 
> Appended is an update of the patch addressing the today's comments
> from Magnus.

few comments from me

1) For the usecases for upcoming hw from Intel (where you really can't talk to the hw while it's in powersave mode); the locking needs to be
   IRQ safe. Think of it like this:
   Lets assume you get a (shared) interrupt from your device. In the handler you need to make 100% sure that 
    1) you're not suspended at this point .. basically do a forced wakeup right there and then
    2) assure that you're not about to suspend
2) You use jiffies in the API; I would suggest exposing milliseconds instead and internally convert to jiffies;
   milliseconds tends to be a much more natural unit for this sort of thing

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14 23:18     ` Arjan van de Ven
@ 2009-06-15 20:02       ` Rafael J. Wysocki
  2009-06-15 20:02       ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-15 20:02 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alan Stern, Oliver Neukum, Magnus Damm, linux-pm,
	ACPI Devel Maling List, Ingo Molnar, LKML, Greg KH

On Monday 15 June 2009, Arjan van de Ven wrote:
> On Mon, 15 Jun 2009 00:57:31 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > > Hi,
> > > > 
> > > > Below is the current version of my "run-time PM for I/O devices"
> > > > patch.
> > > > 
> > > > I've done my best to address the comments received during the
> > > > recent discussions, but at the same time I've tried to make the
> > > > patch only contain the most essential things.  For this reason,
> > > > for example, the sysfs interface is not there and it's going to
> > > > be added in a separate patch.
> > > > 
> > > > Please let me know if you want me to change anything in this
> > > > patch or to add anything new to it.  [Magnus, I remember you
> > > > wanted something like ->runtime_wakeup() along with
> > > > ->runtime_idle(), but I'm not sure it's really necessary.  Please
> > > > let me know if you have any particular usage scenario for it.]
> > 
> > Appended is an update of the patch addressing the today's comments
> > from Magnus.
> 
> few comments from me
> 
> 1) For the usecases for upcoming hw from Intel (where you really can't talk to the hw while it's in powersave mode); the locking needs to be
>    IRQ safe. Think of it like this:
>    Lets assume you get a (shared) interrupt from your device. In the handler you need to make 100% sure that 
>     1) you're not suspended at this point .. basically do a forced wakeup right there and then
>     2) assure that you're not about to suspend

Does it mean we need to use spin_[un]lock_irq[save|restore]() everywhere in the
framework?

> 2) You use jiffies in the API; I would suggest exposing milliseconds instead and internally convert to jiffies;
>    milliseconds tends to be a much more natural unit for this sort of thing

OK

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14 23:18     ` Arjan van de Ven
  2009-06-15 20:02       ` Rafael J. Wysocki
@ 2009-06-15 20:02       ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-15 20:02 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Greg KH, LKML, ACPI Devel Maling List, Magnus Damm, linux-pm,
	Ingo Molnar

On Monday 15 June 2009, Arjan van de Ven wrote:
> On Mon, 15 Jun 2009 00:57:31 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > > Hi,
> > > > 
> > > > Below is the current version of my "run-time PM for I/O devices"
> > > > patch.
> > > > 
> > > > I've done my best to address the comments received during the
> > > > recent discussions, but at the same time I've tried to make the
> > > > patch only contain the most essential things.  For this reason,
> > > > for example, the sysfs interface is not there and it's going to
> > > > be added in a separate patch.
> > > > 
> > > > Please let me know if you want me to change anything in this
> > > > patch or to add anything new to it.  [Magnus, I remember you
> > > > wanted something like ->runtime_wakeup() along with
> > > > ->runtime_idle(), but I'm not sure it's really necessary.  Please
> > > > let me know if you have any particular usage scenario for it.]
> > 
> > Appended is an update of the patch addressing the today's comments
> > from Magnus.
> 
> few comments from me
> 
> 1) For the usecases for upcoming hw from Intel (where you really can't talk to the hw while it's in powersave mode); the locking needs to be
>    IRQ safe. Think of it like this:
>    Lets assume you get a (shared) interrupt from your device. In the handler you need to make 100% sure that 
>     1) you're not suspended at this point .. basically do a forced wakeup right there and then
>     2) assure that you're not about to suspend

Does it mean we need to use spin_[un]lock_irq[save|restore]() everywhere in the
framework?

> 2) You use jiffies in the API; I would suggest exposing milliseconds instead and internally convert to jiffies;
>    milliseconds tends to be a much more natural unit for this sort of thing

OK

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
@ 2009-06-15 21:08       ` Alan Stern
  2009-06-14 23:18     ` Arjan van de Ven
                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-15 21:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Mon, 15 Jun 2009, Rafael J. Wysocki wrote:

> On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > Hi,
> > > 
> > > Below is the current version of my "run-time PM for I/O devices" patch.
> > > 
> > > I've done my best to address the comments received during the recent
> > > discussions, but at the same time I've tried to make the patch only contain
> > > the most essential things.  For this reason, for example, the sysfs interface
> > > is not there and it's going to be added in a separate patch.
> > > 
> > > Please let me know if you want me to change anything in this patch or to add
> > > anything new to it.  [Magnus, I remember you wanted something like
> > > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > > necessary.  Please let me know if you have any particular usage scenario for
> > > it.]
> 
> Appended is an update of the patch addressing the today's comments from Magnus.

This is really looking very good.  I'll do a more detailed review
later.  (In particular, I have not checked the details of the rather
intricate state machine transitions.)  For now, a couple of things 
struck my eye:

Shouldn't the calls to complete() really be complete_all()?  There
might be more than one thread waiting for a suspend or resume callback
to finish.

Since pm_runtime_resume() takes care of powering up the parent, there's 
no need for pm_request_resume() to worry about it also.

The documentation should mention that the runtime_suspend method is 
supposed to enable remote wakeup if it as available and if 
device_may_wakeup(dev) is true.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-15 21:08       ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-15 21:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Mon, 15 Jun 2009, Rafael J. Wysocki wrote:

> On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > Hi,
> > > 
> > > Below is the current version of my "run-time PM for I/O devices" patch.
> > > 
> > > I've done my best to address the comments received during the recent
> > > discussions, but at the same time I've tried to make the patch only contain
> > > the most essential things.  For this reason, for example, the sysfs interface
> > > is not there and it's going to be added in a separate patch.
> > > 
> > > Please let me know if you want me to change anything in this patch or to add
> > > anything new to it.  [Magnus, I remember you wanted something like
> > > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > > necessary.  Please let me know if you have any particular usage scenario for
> > > it.]
> 
> Appended is an update of the patch addressing the today's comments from Magnus.

This is really looking very good.  I'll do a more detailed review
later.  (In particular, I have not checked the details of the rather
intricate state machine transitions.)  For now, a couple of things 
struck my eye:

Shouldn't the calls to complete() really be complete_all()?  There
might be more than one thread waiting for a suspend or resume callback
to finish.

Since pm_runtime_resume() takes care of powering up the parent, there's 
no need for pm_request_resume() to worry about it also.

The documentation should mention that the runtime_suspend method is 
supposed to enable remote wakeup if it as available and if 
device_may_wakeup(dev) is true.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
  2009-06-14 23:18     ` Arjan van de Ven
  2009-06-14 23:18     ` Arjan van de Ven
@ 2009-06-15 21:08     ` Alan Stern
  2009-06-15 21:08       ` Alan Stern
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-15 21:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, Magnus Damm, linux-pm,
	Ingo Molnar

On Mon, 15 Jun 2009, Rafael J. Wysocki wrote:

> On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > Hi,
> > > 
> > > Below is the current version of my "run-time PM for I/O devices" patch.
> > > 
> > > I've done my best to address the comments received during the recent
> > > discussions, but at the same time I've tried to make the patch only contain
> > > the most essential things.  For this reason, for example, the sysfs interface
> > > is not there and it's going to be added in a separate patch.
> > > 
> > > Please let me know if you want me to change anything in this patch or to add
> > > anything new to it.  [Magnus, I remember you wanted something like
> > > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > > necessary.  Please let me know if you have any particular usage scenario for
> > > it.]
> 
> Appended is an update of the patch addressing the today's comments from Magnus.

This is really looking very good.  I'll do a more detailed review
later.  (In particular, I have not checked the details of the rather
intricate state machine transitions.)  For now, a couple of things 
struck my eye:

Shouldn't the calls to complete() really be complete_all()?  There
might be more than one thread waiting for a suspend or resume callback
to finish.

Since pm_runtime_resume() takes care of powering up the parent, there's 
no need for pm_request_resume() to worry about it also.

The documentation should mention that the runtime_suspend method is 
supposed to enable remote wakeup if it as available and if 
device_may_wakeup(dev) is true.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-15 21:08       ` Alan Stern
  (?)
@ 2009-06-15 23:21       ` Rafael J. Wysocki
  2009-06-16 14:30         ` Alan Stern
  2009-06-16 14:30           ` Alan Stern
  -1 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-15 23:21 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Monday 15 June 2009, Alan Stern wrote:
> On Mon, 15 Jun 2009, Rafael J. Wysocki wrote:
> 
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > > Hi,
> > > > 
> > > > Below is the current version of my "run-time PM for I/O devices" patch.
> > > > 
> > > > I've done my best to address the comments received during the recent
> > > > discussions, but at the same time I've tried to make the patch only contain
> > > > the most essential things.  For this reason, for example, the sysfs interface
> > > > is not there and it's going to be added in a separate patch.
> > > > 
> > > > Please let me know if you want me to change anything in this patch or to add
> > > > anything new to it.  [Magnus, I remember you wanted something like
> > > > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > > > necessary.  Please let me know if you have any particular usage scenario for
> > > > it.]
> > 
> > Appended is an update of the patch addressing the today's comments from Magnus.
> 
> This is really looking very good.  I'll do a more detailed review
> later.  (In particular, I have not checked the details of the rather
> intricate state machine transitions.)  For now, a couple of things 
> struck my eye:
> 
> Shouldn't the calls to complete() really be complete_all()?  There
> might be more than one thread waiting for a suspend or resume callback
> to finish.

Yes, thanks for pointing that out.

> Since pm_runtime_resume() takes care of powering up the parent, there's 
> no need for pm_request_resume() to worry about it also.

But still it won't hurt to do it IMO, because the parents are then going to be
resumed before our pm_runtime_resume() is called.

> The documentation should mention that the runtime_suspend method is 
> supposed to enable remote wakeup if it as available and if 
> device_may_wakeup(dev) is true.

Well, I thought that was obvious. :-)

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-15 21:08       ` Alan Stern
  (?)
  (?)
@ 2009-06-15 23:21       ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-15 23:21 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, LKML, ACPI Devel Maling List, Magnus Damm, linux-pm,
	Ingo Molnar

On Monday 15 June 2009, Alan Stern wrote:
> On Mon, 15 Jun 2009, Rafael J. Wysocki wrote:
> 
> > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > On Sunday 14 June 2009, Rafael J. Wysocki wrote:
> > > > Hi,
> > > > 
> > > > Below is the current version of my "run-time PM for I/O devices" patch.
> > > > 
> > > > I've done my best to address the comments received during the recent
> > > > discussions, but at the same time I've tried to make the patch only contain
> > > > the most essential things.  For this reason, for example, the sysfs interface
> > > > is not there and it's going to be added in a separate patch.
> > > > 
> > > > Please let me know if you want me to change anything in this patch or to add
> > > > anything new to it.  [Magnus, I remember you wanted something like
> > > > ->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
> > > > necessary.  Please let me know if you have any particular usage scenario for
> > > > it.]
> > 
> > Appended is an update of the patch addressing the today's comments from Magnus.
> 
> This is really looking very good.  I'll do a more detailed review
> later.  (In particular, I have not checked the details of the rather
> intricate state machine transitions.)  For now, a couple of things 
> struck my eye:
> 
> Shouldn't the calls to complete() really be complete_all()?  There
> might be more than one thread waiting for a suspend or resume callback
> to finish.

Yes, thanks for pointing that out.

> Since pm_runtime_resume() takes care of powering up the parent, there's 
> no need for pm_request_resume() to worry about it also.

But still it won't hurt to do it IMO, because the parents are then going to be
resumed before our pm_runtime_resume() is called.

> The documentation should mention that the runtime_suspend method is 
> supposed to enable remote wakeup if it as available and if 
> device_may_wakeup(dev) is true.

Well, I thought that was obvious. :-)

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-15 23:21       ` Rafael J. Wysocki
@ 2009-06-16 14:30           ` Alan Stern
  2009-06-16 14:30           ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-16 14:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Tue, 16 Jun 2009, Rafael J. Wysocki wrote:

> > Since pm_runtime_resume() takes care of powering up the parent, there's 
> > no need for pm_request_resume() to worry about it also.
> 
> But still it won't hurt to do it IMO, because the parents are then going to be
> resumed before our pm_runtime_resume() is called.

It's extra code that isn't needed.  In essence, you are trading code
space for a shorter runtime stack.

> > The documentation should mention that the runtime_suspend method is 
> > supposed to enable remote wakeup if it as available and if 
> > device_may_wakeup(dev) is true.
> 
> Well, I thought that was obvious. :-)

Sometimes it doesn't hurt to state the obvious!  :-)

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-16 14:30           ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-16 14:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Tue, 16 Jun 2009, Rafael J. Wysocki wrote:

> > Since pm_runtime_resume() takes care of powering up the parent, there's 
> > no need for pm_request_resume() to worry about it also.
> 
> But still it won't hurt to do it IMO, because the parents are then going to be
> resumed before our pm_runtime_resume() is called.

It's extra code that isn't needed.  In essence, you are trading code
space for a shorter runtime stack.

> > The documentation should mention that the runtime_suspend method is 
> > supposed to enable remote wakeup if it as available and if 
> > device_may_wakeup(dev) is true.
> 
> Well, I thought that was obvious. :-)

Sometimes it doesn't hurt to state the obvious!  :-)

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-15 23:21       ` Rafael J. Wysocki
@ 2009-06-16 14:30         ` Alan Stern
  2009-06-16 14:30           ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-16 14:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, Magnus Damm, linux-pm,
	Ingo Molnar

On Tue, 16 Jun 2009, Rafael J. Wysocki wrote:

> > Since pm_runtime_resume() takes care of powering up the parent, there's 
> > no need for pm_request_resume() to worry about it also.
> 
> But still it won't hurt to do it IMO, because the parents are then going to be
> resumed before our pm_runtime_resume() is called.

It's extra code that isn't needed.  In essence, you are trading code
space for a shorter runtime stack.

> > The documentation should mention that the runtime_suspend method is 
> > supposed to enable remote wakeup if it as available and if 
> > device_may_wakeup(dev) is true.
> 
> Well, I thought that was obvious. :-)

Sometimes it doesn't hurt to state the obvious!  :-)

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch update 2] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-16 14:30           ` Alan Stern
  (?)
  (?)
@ 2009-06-16 21:30           ` Rafael J. Wysocki
  2009-06-16 22:33             ` [patch update 2 fix] " Rafael J. Wysocki
  2009-06-16 22:33             ` Rafael J. Wysocki
  -1 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-16 21:30 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Tuesday 16 June 2009, Alan Stern wrote:
> On Tue, 16 Jun 2009, Rafael J. Wysocki wrote:
> > > Since pm_runtime_resume() takes care of powering up the parent, there's
> > > no need for pm_request_resume() to worry about it also.
> >
> > But still it won't hurt to do it IMO, because the parents are then going
> > to be resumed before our pm_runtime_resume() is called.
>
> It's extra code that isn't needed.  In essence, you are trading code
> space for a shorter runtime stack.

That's correct.  I think the code size increase is small and it's better to 
keep the stack as small as reasonably possible.

> > > The documentation should mention that the runtime_suspend method is
> > > supposed to enable remote wakeup if it as available and if
> > > device_may_wakeup(dev) is true.
> >
> > Well, I thought that was obvious. :-)
>
> Sometimes it doesn't hurt to state the obvious!  :-)

Sure.

In the meantime I updated the patch once again.  I addressed your last 
comments in this version and added the possibility to resume with blocking
suspend (ie. after such a resume pm_runtime_suspend() and pm_request_suspend() 
will return immediately intil a special function is called).

I also fixed a couple of bugs. :-)

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM: Introduce core framework for run-time PM of I/O devices

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  311 +++++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  499 
+++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   97 ++++++-
 include/linux/pm_runtime.h         |  112 ++++++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 1062 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within 
any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power 
management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to 
a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a 
low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +341,79 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the 
current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_GRACE	0x20
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:6;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,499 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+#include <linux/jiffies.h>
+
+/**
+ * __pm_runtime_change_status - Change the run-time PM status of a device.
+ * @dev: Device to handle.
+ * @status: Expected current run-time PM status of the device.
+ * @new_status: New value of the device's run-time PM status.
+ *
+ * Change the run-time PM status of the device to @new_status if its current
+ * value is equal to @status.
+ */
+void __pm_runtime_change_status(struct device *dev, unsigned int status,
+				unsigned int new_status)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == status)
+		dev->power.runtime_status = new_status;
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_change_status);
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run 
time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device 
bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	int error = -EINVAL;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
+	    || (!sync && dev->power.suspend_aborted)) {
+		/*
+		 * Device is resuming or in a post-resume grace period or
+		 * there's a resume request pending, or a pending suspend
+		 * request has just been cancelled and we're running as a result
+		 * of this request.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = -EAGAIN;
+		goto out;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	__pm_runtime_suspend(pm_work_to_device(work), false);
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @msec: Time, in miliseconds, to wait before attempting to suspend the 
device.
+ */
+void pm_request_suspend(struct device *dev, unsigned int msec)
+{
+	unsigned long flags;
+	unsigned long delay = msecs_to_jiffies(msec);
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status &= RPM_GRACE;
+	dev->power.suspend_aborted = true;
+}
+
+/**
+ * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ * @grace: If set, force a post-resume grace period.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  
If
+ * runtime suspend is in progress while this function is being run, wait for 
it
+ * to finish before resuming the device.  If runtime suspend is scheduled, 
but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int __pm_runtime_resume(struct device *dev, bool grace)
+{
+	int error = -EINVAL;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out_unlock;
+	} if (!(dev->power.runtime_status & ~RPM_GRACE)) {
+		/* Device is active or in a post-resume grace period. */
+		error = 0;
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		error = 0;
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = __pm_runtime_resume(dev->parent, false);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	/* The RPM_GRACE bit may be set in runtime_status. */
+	dev->power.runtime_status &= ~(RPM_WAKE | RPM_SUSPENDED);
+	dev->power.runtime_status |= RPM_RESUMING;
+	if (grace)
+		dev->power.runtime_status |= RPM_GRACE;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	dev->power.runtime_status &= ~RPM_RESUMING;
+	switch (error) {
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run __pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and 
run
+ * __pm_runtime_resume() for it without forcing a grace period after the 
resume.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	__pm_runtime_resume(pm_work_to_device(work), false);
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ * @grace: If set, force a post-resume grace period.
+ */
+void __pm_request_resume(struct device *dev, bool grace)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* The parent is suspending, suspended or idle. Wake it up. */
+		__pm_request_resume(dev->parent, false);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	if (grace)
+		dev->power.runtime_status |= RPM_GRACE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue 
is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is 
guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		dev->power.runtime_status = RPM_ACTIVE;
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue 
is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is 
guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		dev->power.runtime_status &= ~(RPM_WAKE | RPM_GRACE);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	spin_lock_init(&dev->power.lock);
+	dev->power.runtime_status = RPM_ACTIVE;
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,112 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern void __pm_runtime_change_status(struct device *dev, unsigned int 
status,
+				       unsigned int new_status);
+extern int __pm_runtime_suspend(struct device *dev, bool sync);
+extern void pm_request_suspend(struct device *dev, unsigned int msec);
+extern int __pm_runtime_resume(struct device *dev, bool grace);
+extern void __pm_request_resume(struct device *dev, bool grace);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline void __pm_runtime_change_status(struct device *dev,
+					      unsigned int status,
+					      unsigned int new_status) {}
+static inline int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_suspend(struct device *dev, unsigned int msec) 
{}
+static inline int __pm_runtime_resume(struct device *dev, bool grace)
+{
+	return -ENOSYS;
+}
+static inline void __pm_request_resume(struct device *dev, bool grace) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+static inline int pm_runtime_suspend(struct device *dev)
+{
+	return __pm_runtime_suspend(dev, true);
+}
+
+static inline int pm_runtime_resume(struct device *dev)
+{
+	return __pm_runtime_resume(dev, false);
+}
+
+static inline int pm_runtime_resume_grace(struct device *dev)
+{
+	return __pm_runtime_resume(dev, true);
+}
+
+static inline void pm_request_resume(struct device *dev)
+{
+	__pm_request_resume(dev, false);
+}
+
+static inline void pm_request_resume_grace(struct device *dev)
+{
+	__pm_request_resume(dev, true);
+}
+
+static inline void pm_runtime_clear_active(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_ERROR, RPM_ACTIVE);
+}
+
+static inline void pm_runtime_clear_suspended(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_ERROR, RPM_SUSPENDED);
+}
+
+static inline void pm_runtime_release(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_GRACE, RPM_ACTIVE);
+}
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,311 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers 
can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is 
declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' 
(which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that 
can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can 
be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types 
and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+functions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described below.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* int pm_runtime_resume_grace(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_request_resume_grace(struct device *dev);
+* void pm_runtime_release(struct device *dev) {}
+
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+* void pm_runtime_clear_active(struct device *dev) {}
+* void pm_runtime_clear_suspended(struct device *dev) {}
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device 
object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, all of the run-time PM core operations.  They do it by 
decreasing
+and increasing, respectively, the 'power.depth' field of 'struct device'.  If
+the value of this field is greater than 0, pm_runtime_suspend(),
+pm_request_suspend(), pm_runtime_resume() and so on return immediately 
without
+doing anything and -EBUSY is returned by pm_runtime_suspend(),
+pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM core functions can be used for that device.  The
+initial value of 'power.depth', as set by pm_runtime_init(), is 1 (i.e. the
+run-time PM of the device is initially disabled).
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device probe and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
+use the 'power.runtime_status' and 'power.suspend_aborted' fields of
+'struct device' for mutual synchronization.  The 'power.runtime_status' 
field,
+called the device's run-time PM status in what follows, is set to RPM_ACTIVE 
by
+pm_runtime_init().
+
+pm_request_suspend() is used to queue up a suspend request for an active 
device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE
+(i.e. the device is not active from the PM core standpoint), it returns
+immediately.  Otherwise, it changes the device's run-time PM status to 
RPM_IDLE
+and puts a request to suspend the device into pm_wq.  The 'msec' argument is
+used to specify the time to wait before the request will be completed, in
+miliseconds.  It is valid to call this function from interrupt context.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called directly by a bus type or device driver.  An 
asynchronous
+version of it is called by the PM core, to complete a request queued up by
+pm_request_suspend().  The only difference between them is the handling of
+situations when a queued up suspend request has just been cancelled.  Apart 
from
+this, they work in the same way.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the 
device's
+  run-time PM status field, 'power.runtime_status'), success is returned.
+* If the device is about to resume or is in a post-resume grace period (i.e. 
at
+  least one of the RPM_WAKE, RPM_RESUMING, and RPM_GRACE bits are set in the
+  device's run-time PM status field), -EAGAIN is returned.  -EAGAIN is also
+  returned if the function has been called via pm_wq as a result of a 
cancelled
+  suspend request (the 'power.suspend_aborted' field is used for this 
purpose).
+* If the device is suspending (i.e. its run-time PM status is 
RPM_SUSPENDING),
+  which means that another instance of pm_runtime_suspend() is running at the
+  same time for the same device, the function waits for the other instance to
+  complete and returns the error code (or success) returned by it.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, the device's run-time 
PM
+  status is set to RPM_ACTIVE and -EAGAIN is returned.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and its bus type's ->runtime_suspend() callback is executed.
+This callback is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus 
type).
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  
Once
+  that has happened, the device is regarded by the PM core as suspended, but 
it
+  _need_ _not_ mean that the device has been put into a low power state.  
What
+  really occurs to the device at this point totally depends on its bus type 
(it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback 
completes
+  successfully, the device bus type's ->runtime_idle() callback is executed 
for
+  the device's parent, if there is one and if all of its children are 
suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set 
to
+  RPM_ERROR and the PM core will refuse to carry out any run-time PM 
operations
+  for it until the status is cleared by its bus type or driver with the help 
of
+  either pm_runtime_clear_active(), or pm_runtime_clear_suspended().
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.  If the device's bus type
+doesn't implement ->runtime_suspend(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_request_resume() and pm_request_resume_grace() are used to queue up a 
resume
+request for a device that is suspended, suspending or has a suspend request
+pending.  The difference between them is that pm_request_resume_grace() 
causes
+the RPM_GRACE bit to be set in the device's run-time PM status field, which
+prevents the PM core from suspending the device or queueing up a suspend 
request
+for it until the RPM_GRACE bit is cleared with the help of 
pm_runtime_release().
+Apart from this, they work in the same way.
+* If a suspend request is pending for the device (i.e. the device's run-time 
PM
+  status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted' flag is 
set
+  for the device, the RPM_IDLE bit is cleared in the device's run-time PM 
status
+  field and the function returns (pm_request_resume_grace() additionally sets
+  the RPM_GRACE bit in the device's run-time PM status field).
+* If the device is not suspended or suspending (i.e. none of the 
RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), 
the
+  function returns.
+* If the device's parent is inactive (i.e. at least one of the RPM_IDLE,
+  RPM_SUSPENDING, and RPM_SUSPENDED bits is set in its run-time PM status
+  field), a resume request is (recursively) scheduled for the parent and the
+  function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-
time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() and pm_runtime_resume_grace() are used to carry out a
+run-time resume of a device that is suspended, suspending or has a suspend
+request pending.  They are called either by the PM core, to complete a 
request
+queued up by pm_request_resume(), or directly by a bus type or device driver.
+The difference between them is that pm_request_resume_grace() causes the
+RPM_GRACE bit to be set in the device's run-time PM status field, which 
prevents
+the PM core from suspending the device or queueing up a suspend request for 
it
+until the RPM_GRACE bit is cleared with the help of pm_runtime_release().  
Apart
+from this, they work in the same way.
+* If the device is active (i.e. all of the bits in its run-time PM status are
+  clear, possibly except for RPM_GRACE), success is returned.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled, the 
'power.suspend_aborted'
+  flag is set for the device, the RPM_IDLE bit is cleared in its run-time PM
+  status field and the function returns success (pm_runtime_resume_grace()
+  additionally sets the RPM_GRACE bit in the device's run-time PM status 
field).
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+  run-time PM status field), the function waits for the suspend operation to
+  complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the 
device's
+  run-time PM status field), the device's parent exists and is not active 
(i.e.
+  the parent's run-time PM status is not RPM_ACTIVE or RPM_GRACE), the parent 
is
+  resumed (recursively) and the function restarts itself.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the 
other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the RPM_WAKE and RPM_SUSPENDED bits are cleared
+and the RPM_RESUMING bit is set in the device's run-time PM status field.  
Next,
+the device bus type's ->runtime_resume() callback is executed, which is
+responsible for handling the device as appropriate (for example, it may 
choose
+to execute the device driver's ->runtime_resume() callback or to carry out 
any
+other suitable action depending on the bus type).
+* If it completes successfully, the device's run-time PM status is set to
+  'active' (i.e. the device's run-time PM status field is either RPM_ACTIVE, 
or
+  RPM_GRACE), which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns
+  success, the device _must_ be able to carry out I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set 
to
+  RPM_ERROR and the PM core will refuse to carry out any run-time PM 
operations
+  for it until the status is cleared by its bus type or driver with the help 
of
+  either pm_runtime_clear_active(), or pm_runtime_clear_suspended().
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.  If the device's bus type
+doesn't implement ->runtime_resume(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_runtime_release() is used to clear the RPM_GRACE bit in the device's run-
time
+PM status field.  This bit, if set, causes the PM core to refuse to suspend
+the device or to queue up a suspend request for it.  In particular, it causes
+pm_runtime_suspend() to return -EAGAIN without doing anything else.  This may
+be useful if the device is resumed for a specific task and it shouldn't be
+suspended until the task is complete, but there are many potential sources of
+suspend requests that could disturb it.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for 
an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power 
transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the 
field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+pm_runtime_clear_active() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_ACTIVE.
+
+pm_runtime_clear_suspended() is used to change the device's run-time PM 
status
+field from RPM_ERROR to RPM_SUSPENDED.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the 
bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the 
PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() 
knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, 
however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, 
the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the 
device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code 
different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error 
and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+In particular, it is recommended that ->runtime_suspend() return -EBUSY or
+-EAGAIN if device_may_wakeup() returns 'false' for the device.  On the other
+hand, if device_may_wakeup() returns 'true' for the device and the device is 
put
+into a low power state during the execution of ->runtime_suspend(), it is
+expected that remote wake-up (i.e. hardware mechanism allowing the device to
+request a change of its power state, such as PCI PME) will be enabled for the
+device.  Generally, remote wake-up should be enabled whenever the device is 
put
+into a low power state at run time and is expected to receive input from the
+outside of the system.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the 
bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the 
PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() 
knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, 
the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to 
mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code 
different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error 
and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the 
conditions
+necessary for suspending the device are met) and to queue up a suspend 
request
+for the device if that is the case.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch update 2] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-16 14:30           ` Alan Stern
  (?)
@ 2009-06-16 21:30           ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-16 21:30 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Tuesday 16 June 2009, Alan Stern wrote:
> On Tue, 16 Jun 2009, Rafael J. Wysocki wrote:
> > > Since pm_runtime_resume() takes care of powering up the parent, there's
> > > no need for pm_request_resume() to worry about it also.
> >
> > But still it won't hurt to do it IMO, because the parents are then going
> > to be resumed before our pm_runtime_resume() is called.
>
> It's extra code that isn't needed.  In essence, you are trading code
> space for a shorter runtime stack.

That's correct.  I think the code size increase is small and it's better to 
keep the stack as small as reasonably possible.

> > > The documentation should mention that the runtime_suspend method is
> > > supposed to enable remote wakeup if it as available and if
> > > device_may_wakeup(dev) is true.
> >
> > Well, I thought that was obvious. :-)
>
> Sometimes it doesn't hurt to state the obvious!  :-)

Sure.

In the meantime I updated the patch once again.  I addressed your last 
comments in this version and added the possibility to resume with blocking
suspend (ie. after such a resume pm_runtime_suspend() and pm_request_suspend() 
will return immediately intil a special function is called).

I also fixed a couple of bugs. :-)

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM: Introduce core framework for run-time PM of I/O devices

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  311 +++++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  499 
+++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   97 ++++++-
 include/linux/pm_runtime.h         |  112 ++++++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 1062 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within 
any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power 
management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to 
a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a 
low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +341,79 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the 
current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_GRACE	0x20
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:6;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,499 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+#include <linux/jiffies.h>
+
+/**
+ * __pm_runtime_change_status - Change the run-time PM status of a device.
+ * @dev: Device to handle.
+ * @status: Expected current run-time PM status of the device.
+ * @new_status: New value of the device's run-time PM status.
+ *
+ * Change the run-time PM status of the device to @new_status if its current
+ * value is equal to @status.
+ */
+void __pm_runtime_change_status(struct device *dev, unsigned int status,
+				unsigned int new_status)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == status)
+		dev->power.runtime_status = new_status;
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_change_status);
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run 
time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device 
bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	int error = -EINVAL;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
+	    || (!sync && dev->power.suspend_aborted)) {
+		/*
+		 * Device is resuming or in a post-resume grace period or
+		 * there's a resume request pending, or a pending suspend
+		 * request has just been cancelled and we're running as a result
+		 * of this request.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = -EAGAIN;
+		goto out;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	__pm_runtime_suspend(pm_work_to_device(work), false);
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @msec: Time, in miliseconds, to wait before attempting to suspend the 
device.
+ */
+void pm_request_suspend(struct device *dev, unsigned int msec)
+{
+	unsigned long flags;
+	unsigned long delay = msecs_to_jiffies(msec);
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status &= RPM_GRACE;
+	dev->power.suspend_aborted = true;
+}
+
+/**
+ * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ * @grace: If set, force a post-resume grace period.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  
If
+ * runtime suspend is in progress while this function is being run, wait for 
it
+ * to finish before resuming the device.  If runtime suspend is scheduled, 
but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int __pm_runtime_resume(struct device *dev, bool grace)
+{
+	int error = -EINVAL;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out_unlock;
+	} if (!(dev->power.runtime_status & ~RPM_GRACE)) {
+		/* Device is active or in a post-resume grace period. */
+		error = 0;
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		error = 0;
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = __pm_runtime_resume(dev->parent, false);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	/* The RPM_GRACE bit may be set in runtime_status. */
+	dev->power.runtime_status &= ~(RPM_WAKE | RPM_SUSPENDED);
+	dev->power.runtime_status |= RPM_RESUMING;
+	if (grace)
+		dev->power.runtime_status |= RPM_GRACE;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	dev->power.runtime_status &= ~RPM_RESUMING;
+	switch (error) {
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run __pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and 
run
+ * __pm_runtime_resume() for it without forcing a grace period after the 
resume.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	__pm_runtime_resume(pm_work_to_device(work), false);
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ * @grace: If set, force a post-resume grace period.
+ */
+void __pm_request_resume(struct device *dev, bool grace)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* The parent is suspending, suspended or idle. Wake it up. */
+		__pm_request_resume(dev->parent, false);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	if (grace)
+		dev->power.runtime_status |= RPM_GRACE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue 
is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is 
guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		dev->power.runtime_status = RPM_ACTIVE;
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue 
is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is 
guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		dev->power.runtime_status &= ~(RPM_WAKE | RPM_GRACE);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	spin_lock_init(&dev->power.lock);
+	dev->power.runtime_status = RPM_ACTIVE;
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,112 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern void __pm_runtime_change_status(struct device *dev, unsigned int 
status,
+				       unsigned int new_status);
+extern int __pm_runtime_suspend(struct device *dev, bool sync);
+extern void pm_request_suspend(struct device *dev, unsigned int msec);
+extern int __pm_runtime_resume(struct device *dev, bool grace);
+extern void __pm_request_resume(struct device *dev, bool grace);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline void __pm_runtime_change_status(struct device *dev,
+					      unsigned int status,
+					      unsigned int new_status) {}
+static inline int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_suspend(struct device *dev, unsigned int msec) 
{}
+static inline int __pm_runtime_resume(struct device *dev, bool grace)
+{
+	return -ENOSYS;
+}
+static inline void __pm_request_resume(struct device *dev, bool grace) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+static inline int pm_runtime_suspend(struct device *dev)
+{
+	return __pm_runtime_suspend(dev, true);
+}
+
+static inline int pm_runtime_resume(struct device *dev)
+{
+	return __pm_runtime_resume(dev, false);
+}
+
+static inline int pm_runtime_resume_grace(struct device *dev)
+{
+	return __pm_runtime_resume(dev, true);
+}
+
+static inline void pm_request_resume(struct device *dev)
+{
+	__pm_request_resume(dev, false);
+}
+
+static inline void pm_request_resume_grace(struct device *dev)
+{
+	__pm_request_resume(dev, true);
+}
+
+static inline void pm_runtime_clear_active(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_ERROR, RPM_ACTIVE);
+}
+
+static inline void pm_runtime_clear_suspended(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_ERROR, RPM_SUSPENDED);
+}
+
+static inline void pm_runtime_release(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_GRACE, RPM_ACTIVE);
+}
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,311 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers 
can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is 
declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' 
(which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that 
can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can 
be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types 
and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+functions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described below.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* int pm_runtime_resume_grace(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_request_resume_grace(struct device *dev);
+* void pm_runtime_release(struct device *dev) {}
+
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+* void pm_runtime_clear_active(struct device *dev) {}
+* void pm_runtime_clear_suspended(struct device *dev) {}
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device 
object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, all of the run-time PM core operations.  They do it by 
decreasing
+and increasing, respectively, the 'power.depth' field of 'struct device'.  If
+the value of this field is greater than 0, pm_runtime_suspend(),
+pm_request_suspend(), pm_runtime_resume() and so on return immediately 
without
+doing anything and -EBUSY is returned by pm_runtime_suspend(),
+pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM core functions can be used for that device.  The
+initial value of 'power.depth', as set by pm_runtime_init(), is 1 (i.e. the
+run-time PM of the device is initially disabled).
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device probe and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
+use the 'power.runtime_status' and 'power.suspend_aborted' fields of
+'struct device' for mutual synchronization.  The 'power.runtime_status' 
field,
+called the device's run-time PM status in what follows, is set to RPM_ACTIVE 
by
+pm_runtime_init().
+
+pm_request_suspend() is used to queue up a suspend request for an active 
device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE
+(i.e. the device is not active from the PM core standpoint), it returns
+immediately.  Otherwise, it changes the device's run-time PM status to 
RPM_IDLE
+and puts a request to suspend the device into pm_wq.  The 'msec' argument is
+used to specify the time to wait before the request will be completed, in
+miliseconds.  It is valid to call this function from interrupt context.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called directly by a bus type or device driver.  An 
asynchronous
+version of it is called by the PM core, to complete a request queued up by
+pm_request_suspend().  The only difference between them is the handling of
+situations when a queued up suspend request has just been cancelled.  Apart 
from
+this, they work in the same way.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the 
device's
+  run-time PM status field, 'power.runtime_status'), success is returned.
+* If the device is about to resume or is in a post-resume grace period (i.e. 
at
+  least one of the RPM_WAKE, RPM_RESUMING, and RPM_GRACE bits are set in the
+  device's run-time PM status field), -EAGAIN is returned.  -EAGAIN is also
+  returned if the function has been called via pm_wq as a result of a 
cancelled
+  suspend request (the 'power.suspend_aborted' field is used for this 
purpose).
+* If the device is suspending (i.e. its run-time PM status is 
RPM_SUSPENDING),
+  which means that another instance of pm_runtime_suspend() is running at the
+  same time for the same device, the function waits for the other instance to
+  complete and returns the error code (or success) returned by it.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, the device's run-time 
PM
+  status is set to RPM_ACTIVE and -EAGAIN is returned.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and its bus type's ->runtime_suspend() callback is executed.
+This callback is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus 
type).
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  
Once
+  that has happened, the device is regarded by the PM core as suspended, but 
it
+  _need_ _not_ mean that the device has been put into a low power state.  
What
+  really occurs to the device at this point totally depends on its bus type 
(it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback 
completes
+  successfully, the device bus type's ->runtime_idle() callback is executed 
for
+  the device's parent, if there is one and if all of its children are 
suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set 
to
+  RPM_ERROR and the PM core will refuse to carry out any run-time PM 
operations
+  for it until the status is cleared by its bus type or driver with the help 
of
+  either pm_runtime_clear_active(), or pm_runtime_clear_suspended().
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.  If the device's bus type
+doesn't implement ->runtime_suspend(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_request_resume() and pm_request_resume_grace() are used to queue up a 
resume
+request for a device that is suspended, suspending or has a suspend request
+pending.  The difference between them is that pm_request_resume_grace() 
causes
+the RPM_GRACE bit to be set in the device's run-time PM status field, which
+prevents the PM core from suspending the device or queueing up a suspend 
request
+for it until the RPM_GRACE bit is cleared with the help of 
pm_runtime_release().
+Apart from this, they work in the same way.
+* If a suspend request is pending for the device (i.e. the device's run-time 
PM
+  status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted' flag is 
set
+  for the device, the RPM_IDLE bit is cleared in the device's run-time PM 
status
+  field and the function returns (pm_request_resume_grace() additionally sets
+  the RPM_GRACE bit in the device's run-time PM status field).
+* If the device is not suspended or suspending (i.e. none of the 
RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), 
the
+  function returns.
+* If the device's parent is inactive (i.e. at least one of the RPM_IDLE,
+  RPM_SUSPENDING, and RPM_SUSPENDED bits is set in its run-time PM status
+  field), a resume request is (recursively) scheduled for the parent and the
+  function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-
time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() and pm_runtime_resume_grace() are used to carry out a
+run-time resume of a device that is suspended, suspending or has a suspend
+request pending.  They are called either by the PM core, to complete a 
request
+queued up by pm_request_resume(), or directly by a bus type or device driver.
+The difference between them is that pm_request_resume_grace() causes the
+RPM_GRACE bit to be set in the device's run-time PM status field, which 
prevents
+the PM core from suspending the device or queueing up a suspend request for 
it
+until the RPM_GRACE bit is cleared with the help of pm_runtime_release().  
Apart
+from this, they work in the same way.
+* If the device is active (i.e. all of the bits in its run-time PM status are
+  clear, possibly except for RPM_GRACE), success is returned.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled, the 
'power.suspend_aborted'
+  flag is set for the device, the RPM_IDLE bit is cleared in its run-time PM
+  status field and the function returns success (pm_runtime_resume_grace()
+  additionally sets the RPM_GRACE bit in the device's run-time PM status 
field).
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+  run-time PM status field), the function waits for the suspend operation to
+  complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the 
device's
+  run-time PM status field), the device's parent exists and is not active 
(i.e.
+  the parent's run-time PM status is not RPM_ACTIVE or RPM_GRACE), the parent 
is
+  resumed (recursively) and the function restarts itself.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the 
other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the RPM_WAKE and RPM_SUSPENDED bits are cleared
+and the RPM_RESUMING bit is set in the device's run-time PM status field.  
Next,
+the device bus type's ->runtime_resume() callback is executed, which is
+responsible for handling the device as appropriate (for example, it may 
choose
+to execute the device driver's ->runtime_resume() callback or to carry out 
any
+other suitable action depending on the bus type).
+* If it completes successfully, the device's run-time PM status is set to
+  'active' (i.e. the device's run-time PM status field is either RPM_ACTIVE, 
or
+  RPM_GRACE), which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns
+  success, the device _must_ be able to carry out I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set 
to
+  RPM_ERROR and the PM core will refuse to carry out any run-time PM 
operations
+  for it until the status is cleared by its bus type or driver with the help 
of
+  either pm_runtime_clear_active(), or pm_runtime_clear_suspended().
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.  If the device's bus type
+doesn't implement ->runtime_resume(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_runtime_release() is used to clear the RPM_GRACE bit in the device's run-
time
+PM status field.  This bit, if set, causes the PM core to refuse to suspend
+the device or to queue up a suspend request for it.  In particular, it causes
+pm_runtime_suspend() to return -EAGAIN without doing anything else.  This may
+be useful if the device is resumed for a specific task and it shouldn't be
+suspended until the task is complete, but there are many potential sources of
+suspend requests that could disturb it.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for 
an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power 
transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the 
field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+pm_runtime_clear_active() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_ACTIVE.
+
+pm_runtime_clear_suspended() is used to change the device's run-time PM 
status
+field from RPM_ERROR to RPM_SUSPENDED.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the 
bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the 
PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() 
knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, 
however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, 
the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the 
device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code 
different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error 
and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+In particular, it is recommended that ->runtime_suspend() return -EBUSY or
+-EAGAIN if device_may_wakeup() returns 'false' for the device.  On the other
+hand, if device_may_wakeup() returns 'true' for the device and the device is 
put
+into a low power state during the execution of ->runtime_suspend(), it is
+expected that remote wake-up (i.e. hardware mechanism allowing the device to
+request a change of its power state, such as PCI PME) will be enabled for the
+device.  Generally, remote wake-up should be enabled whenever the device is 
put
+into a low power state at run time and is expected to receive input from the
+outside of the system.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the 
bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the 
PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() 
knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, 
the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to 
mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code 
different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error 
and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the 
conditions
+necessary for suspending the device are met) and to queue up a suspend 
request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-16 21:30           ` Rafael J. Wysocki
@ 2009-06-16 22:33             ` Rafael J. Wysocki
  2009-06-17 20:08                 ` Alan Stern
  2009-06-17 20:08               ` Alan Stern
  2009-06-16 22:33             ` Rafael J. Wysocki
  1 sibling, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-16 22:33 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Tuesday 16 June 2009, Rafael J. Wysocki wrote:
> On Tuesday 16 June 2009, Alan Stern wrote:
> > On Tue, 16 Jun 2009, Rafael J. Wysocki wrote:
> > > > Since pm_runtime_resume() takes care of powering up the parent, there's
> > > > no need for pm_request_resume() to worry about it also.
> > >
> > > But still it won't hurt to do it IMO, because the parents are then going
> > > to be resumed before our pm_runtime_resume() is called.
> >
> > It's extra code that isn't needed.  In essence, you are trading code
> > space for a shorter runtime stack.
> 
> That's correct.  I think the code size increase is small and it's better to 
> keep the stack as small as reasonably possible.
> 
> > > > The documentation should mention that the runtime_suspend method is
> > > > supposed to enable remote wakeup if it as available and if
> > > > device_may_wakeup(dev) is true.
> > >
> > > Well, I thought that was obvious. :-)
> >
> > Sometimes it doesn't hurt to state the obvious!  :-)
> 
> Sure.
> 
> In the meantime I updated the patch once again.  I addressed your last 
> comments in this version and added the possibility to resume with blocking
> suspend (ie. after such a resume pm_runtime_suspend() and pm_request_suspend() 
> will return immediately intil a special function is called).
> 
> I also fixed a couple of bugs. :-)

Sorry for the broken patch.  My mailer started to wordwrap messages
automatically and I didn't notice.

The correct patch is appended.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM: Introduce core framework for run-time PM of I/O devices

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  311 +++++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  499 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   97 ++++++-
 include/linux/pm_runtime.h         |  112 ++++++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 1062 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +341,79 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_GRACE	0x20
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:6;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,499 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+#include <linux/jiffies.h>
+
+/**
+ * __pm_runtime_change_status - Change the run-time PM status of a device.
+ * @dev: Device to handle.
+ * @status: Expected current run-time PM status of the device.
+ * @new_status: New value of the device's run-time PM status.
+ *
+ * Change the run-time PM status of the device to @new_status if its current
+ * value is equal to @status.
+ */
+void __pm_runtime_change_status(struct device *dev, unsigned int status,
+				unsigned int new_status)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == status)
+		dev->power.runtime_status = new_status;
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_change_status);
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	int error = -EINVAL;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
+	    || (!sync && dev->power.suspend_aborted)) {
+		/*
+		 * Device is resuming or in a post-resume grace period or
+		 * there's a resume request pending, or a pending suspend
+		 * request has just been cancelled and we're running as a result
+		 * of this request.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = -EAGAIN;
+		goto out;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	__pm_runtime_suspend(pm_work_to_device(work), false);
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @msec: Time to wait before attempting to suspend the device, in milliseconds.
+ */
+void pm_request_suspend(struct device *dev, unsigned int msec)
+{
+	unsigned long flags;
+	unsigned long delay = msecs_to_jiffies(msec);
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status &= RPM_GRACE;
+	dev->power.suspend_aborted = true;
+}
+
+/**
+ * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ * @grace: If set, force a post-resume grace period.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int __pm_runtime_resume(struct device *dev, bool grace)
+{
+	int error = -EINVAL;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out_unlock;
+	} if (!(dev->power.runtime_status & ~RPM_GRACE)) {
+		/* Device is active or in a post-resume grace period. */
+		error = 0;
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		error = 0;
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = __pm_runtime_resume(dev->parent, false);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	/* The RPM_GRACE bit may be set in runtime_status. */
+	dev->power.runtime_status &= ~(RPM_WAKE | RPM_SUSPENDED);
+	dev->power.runtime_status |= RPM_RESUMING;
+	if (grace)
+		dev->power.runtime_status |= RPM_GRACE;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	dev->power.runtime_status &= ~RPM_RESUMING;
+	switch (error) {
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run __pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * __pm_runtime_resume() for it without forcing a grace period after the resume.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	__pm_runtime_resume(pm_work_to_device(work), false);
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ * @grace: If set, force a post-resume grace period.
+ */
+void __pm_request_resume(struct device *dev, bool grace)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* The parent is suspending, suspended or idle. Wake it up. */
+		__pm_request_resume(dev->parent, false);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	if (grace)
+		dev->power.runtime_status |= RPM_GRACE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		dev->power.runtime_status = RPM_ACTIVE;
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		dev->power.runtime_status &= ~(RPM_WAKE | RPM_GRACE);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	spin_lock_init(&dev->power.lock);
+	dev->power.runtime_status = RPM_ACTIVE;
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,112 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern void __pm_runtime_change_status(struct device *dev, unsigned int status,
+				       unsigned int new_status);
+extern int __pm_runtime_suspend(struct device *dev, bool sync);
+extern void pm_request_suspend(struct device *dev, unsigned int msec);
+extern int __pm_runtime_resume(struct device *dev, bool grace);
+extern void __pm_request_resume(struct device *dev, bool grace);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline void __pm_runtime_change_status(struct device *dev,
+					      unsigned int status,
+					      unsigned int new_status) {}
+static inline int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_suspend(struct device *dev, unsigned int msec) {}
+static inline int __pm_runtime_resume(struct device *dev, bool grace)
+{
+	return -ENOSYS;
+}
+static inline void __pm_request_resume(struct device *dev, bool grace) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+static inline int pm_runtime_suspend(struct device *dev)
+{
+	return __pm_runtime_suspend(dev, true);
+}
+
+static inline int pm_runtime_resume(struct device *dev)
+{
+	return __pm_runtime_resume(dev, false);
+}
+
+static inline int pm_runtime_resume_grace(struct device *dev)
+{
+	return __pm_runtime_resume(dev, true);
+}
+
+static inline void pm_request_resume(struct device *dev)
+{
+	__pm_request_resume(dev, false);
+}
+
+static inline void pm_request_resume_grace(struct device *dev)
+{
+	__pm_request_resume(dev, true);
+}
+
+static inline void pm_runtime_clear_active(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_ERROR, RPM_ACTIVE);
+}
+
+static inline void pm_runtime_clear_suspended(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_ERROR, RPM_SUSPENDED);
+}
+
+static inline void pm_runtime_release(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_GRACE, RPM_ACTIVE);
+}
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,311 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+functions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described below.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* int pm_runtime_resume_grace(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_request_resume_grace(struct device *dev);
+* void pm_runtime_release(struct device *dev) {}
+
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+* void pm_runtime_clear_active(struct device *dev) {}
+* void pm_runtime_clear_suspended(struct device *dev) {}
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, all of the run-time PM core operations.  They do it by decreasing
+and increasing, respectively, the 'power.depth' field of 'struct device'.  If
+the value of this field is greater than 0, pm_runtime_suspend(),
+pm_request_suspend(), pm_runtime_resume() and so on return immediately without
+doing anything and -EBUSY is returned by pm_runtime_suspend(),
+pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM core functions can be used for that device.  The
+initial value of 'power.depth', as set by pm_runtime_init(), is 1 (i.e. the
+run-time PM of the device is initially disabled).
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device probe and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
+use the 'power.runtime_status' and 'power.suspend_aborted' fields of
+'struct device' for mutual synchronization.  The 'power.runtime_status' field,
+called the device's run-time PM status in what follows, is set to RPM_ACTIVE by
+pm_runtime_init().
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE
+(i.e. the device is not active from the PM core standpoint), it returns
+immediately.  Otherwise, it changes the device's run-time PM status to RPM_IDLE
+and puts a request to suspend the device into pm_wq.  The 'msec' argument is
+used to specify the time to wait before the request will be completed, in
+milliseconds.  It is valid to call this function from interrupt context.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called directly by a bus type or device driver.  An asynchronous
+version of it is called by the PM core, to complete a request queued up by
+pm_request_suspend().  The only difference between them is the handling of
+situations when a queued up suspend request has just been cancelled.  Apart from
+this, they work in the same way.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field, 'power.runtime_status'), success is returned.
+* If the device is about to resume or is in a post-resume grace period (i.e. at
+  least one of the RPM_WAKE, RPM_RESUMING, and RPM_GRACE bits are set in the
+  device's run-time PM status field), -EAGAIN is returned.  -EAGAIN is also
+  returned if the function has been called via pm_wq as a result of a cancelled
+  suspend request (the 'power.suspend_aborted' field is used for this purpose).
+* If the device is suspending (i.e. its run-time PM status is RPM_SUSPENDING),
+  which means that another instance of pm_runtime_suspend() is running at the
+  same time for the same device, the function waits for the other instance to
+  complete and returns the error code (or success) returned by it.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, the device's run-time PM
+  status is set to RPM_ACTIVE and -EAGAIN is returned.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and its bus type's ->runtime_suspend() callback is executed.
+This callback is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  Once
+  that has happened, the device is regarded by the PM core as suspended, but it
+  _need_ _not_ mean that the device has been put into a low power state.  What
+  really occurs to the device at this point totally depends on its bus type (it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback completes
+  successfully, the device bus type's ->runtime_idle() callback is executed for
+  the device's parent, if there is one and if all of its children are suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to carry out any run-time PM operations
+  for it until the status is cleared by its bus type or driver with the help of
+  either pm_runtime_clear_active(), or pm_runtime_clear_suspended().
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.  If the device's bus type
+doesn't implement ->runtime_suspend(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_request_resume() and pm_request_resume_grace() are used to queue up a resume
+request for a device that is suspended, suspending or has a suspend request
+pending.  The difference between them is that pm_request_resume_grace() causes
+the RPM_GRACE bit to be set in the device's run-time PM status field, which
+prevents the PM core from suspending the device or queuing up a suspend request
+for it until the RPM_GRACE bit is cleared with the help of pm_runtime_release().
+Apart from this, they work in the same way.
+* If a suspend request is pending for the device (i.e. the device's run-time PM
+  status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted' flag is set
+  for the device, the RPM_IDLE bit is cleared in the device's run-time PM status
+  field and the function returns (pm_request_resume_grace() additionally sets
+  the RPM_GRACE bit in the device's run-time PM status field).
+* If the device is not suspended or suspending (i.e. none of the RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), the
+  function returns.
+* If the device's parent is inactive (i.e. at least one of the RPM_IDLE,
+  RPM_SUSPENDING, and RPM_SUSPENDED bits is set in its run-time PM status
+  field), a resume request is (recursively) scheduled for the parent and the
+  function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() and pm_runtime_resume_grace() are used to carry out a
+run-time resume of a device that is suspended, suspending or has a suspend
+request pending.  They are called either by the PM core, to complete a request
+queued up by pm_request_resume(), or directly by a bus type or device driver.
+The difference between them is that pm_request_resume_grace() causes the
+RPM_GRACE bit to be set in the device's run-time PM status field, which prevents
+the PM core from suspending the device or queuing up a suspend request for it
+until the RPM_GRACE bit is cleared with the help of pm_runtime_release().  Apart
+from this, they work in the same way.
+* If the device is active (i.e. all of the bits in its run-time PM status are
+  clear, possibly except for RPM_GRACE), success is returned.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted'
+  flag is set for the device, the RPM_IDLE bit is cleared in its run-time PM
+  status field and the function returns success (pm_runtime_resume_grace()
+  additionally sets the RPM_GRACE bit in the device's run-time PM status field).
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+  run-time PM status field), the function waits for the suspend operation to
+  complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field), the device's parent exists and is not active (i.e.
+  the parent's run-time PM status is not RPM_ACTIVE or RPM_GRACE), the parent is
+  resumed (recursively) and the function restarts itself.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the RPM_WAKE and RPM_SUSPENDED bits are cleared
+and the RPM_RESUMING bit is set in the device's run-time PM status field.  Next,
+the device bus type's ->runtime_resume() callback is executed, which is
+responsible for handling the device as appropriate (for example, it may choose
+to execute the device driver's ->runtime_resume() callback or to carry out any
+other suitable action depending on the bus type).
+* If it completes successfully, the device's run-time PM status is set to
+  'active' (i.e. the device's run-time PM status field is either RPM_ACTIVE, or
+  RPM_GRACE), which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns
+  success, the device _must_ be able to carry out I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to carry out any run-time PM operations
+  for it until the status is cleared by its bus type or driver with the help of
+  either pm_runtime_clear_active(), or pm_runtime_clear_suspended().
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.  If the device's bus type
+doesn't implement ->runtime_resume(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_runtime_release() is used to clear the RPM_GRACE bit in the device's run-time
+PM status field.  This bit, if set, causes the PM core to refuse to suspend
+the device or to queue up a suspend request for it.  In particular, it causes
+pm_runtime_suspend() to return -EAGAIN without doing anything else.  This may
+be useful if the device is resumed for a specific task and it shouldn't be
+suspended until the task is complete, but there are many potential sources of
+suspend requests that could disturb it.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+pm_runtime_clear_active() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_ACTIVE.
+
+pm_runtime_clear_suspended() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_SUSPENDED.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+In particular, it is recommended that ->runtime_suspend() return -EBUSY or
+-EAGAIN if device_may_wakeup() returns 'false' for the device.  On the other
+hand, if device_may_wakeup() returns 'true' for the device and the device is put
+into a low power state during the execution of ->runtime_suspend(), it is
+expected that remote wake-up (i.e. hardware mechanism allowing the device to
+request a change of its power state, such as PCI PME) will be enabled for the
+device.  Generally, remote wake-up should be enabled whenever the device is put
+into a low power state at run time and is expected to receive input from the
+outside of the system.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-16 21:30           ` Rafael J. Wysocki
  2009-06-16 22:33             ` [patch update 2 fix] " Rafael J. Wysocki
@ 2009-06-16 22:33             ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-16 22:33 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Tuesday 16 June 2009, Rafael J. Wysocki wrote:
> On Tuesday 16 June 2009, Alan Stern wrote:
> > On Tue, 16 Jun 2009, Rafael J. Wysocki wrote:
> > > > Since pm_runtime_resume() takes care of powering up the parent, there's
> > > > no need for pm_request_resume() to worry about it also.
> > >
> > > But still it won't hurt to do it IMO, because the parents are then going
> > > to be resumed before our pm_runtime_resume() is called.
> >
> > It's extra code that isn't needed.  In essence, you are trading code
> > space for a shorter runtime stack.
> 
> That's correct.  I think the code size increase is small and it's better to 
> keep the stack as small as reasonably possible.
> 
> > > > The documentation should mention that the runtime_suspend method is
> > > > supposed to enable remote wakeup if it as available and if
> > > > device_may_wakeup(dev) is true.
> > >
> > > Well, I thought that was obvious. :-)
> >
> > Sometimes it doesn't hurt to state the obvious!  :-)
> 
> Sure.
> 
> In the meantime I updated the patch once again.  I addressed your last 
> comments in this version and added the possibility to resume with blocking
> suspend (ie. after such a resume pm_runtime_suspend() and pm_request_suspend() 
> will return immediately intil a special function is called).
> 
> I also fixed a couple of bugs. :-)

Sorry for the broken patch.  My mailer started to wordwrap messages
automatically and I didn't notice.

The correct patch is appended.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM: Introduce core framework for run-time PM of I/O devices

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  311 +++++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  499 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   97 ++++++-
 include/linux/pm_runtime.h         |  112 ++++++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 1062 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +341,79 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_GRACE	0x20
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:6;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,499 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+#include <linux/jiffies.h>
+
+/**
+ * __pm_runtime_change_status - Change the run-time PM status of a device.
+ * @dev: Device to handle.
+ * @status: Expected current run-time PM status of the device.
+ * @new_status: New value of the device's run-time PM status.
+ *
+ * Change the run-time PM status of the device to @new_status if its current
+ * value is equal to @status.
+ */
+void __pm_runtime_change_status(struct device *dev, unsigned int status,
+				unsigned int new_status)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == status)
+		dev->power.runtime_status = new_status;
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_change_status);
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	int error = -EINVAL;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
+	    || (!sync && dev->power.suspend_aborted)) {
+		/*
+		 * Device is resuming or in a post-resume grace period or
+		 * there's a resume request pending, or a pending suspend
+		 * request has just been cancelled and we're running as a result
+		 * of this request.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = -EAGAIN;
+		goto out;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	__pm_runtime_suspend(pm_work_to_device(work), false);
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @msec: Time to wait before attempting to suspend the device, in milliseconds.
+ */
+void pm_request_suspend(struct device *dev, unsigned int msec)
+{
+	unsigned long flags;
+	unsigned long delay = msecs_to_jiffies(msec);
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status &= RPM_GRACE;
+	dev->power.suspend_aborted = true;
+}
+
+/**
+ * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ * @grace: If set, force a post-resume grace period.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int __pm_runtime_resume(struct device *dev, bool grace)
+{
+	int error = -EINVAL;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out_unlock;
+	} if (!(dev->power.runtime_status & ~RPM_GRACE)) {
+		/* Device is active or in a post-resume grace period. */
+		error = 0;
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		error = 0;
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = __pm_runtime_resume(dev->parent, false);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	/* The RPM_GRACE bit may be set in runtime_status. */
+	dev->power.runtime_status &= ~(RPM_WAKE | RPM_SUSPENDED);
+	dev->power.runtime_status |= RPM_RESUMING;
+	if (grace)
+		dev->power.runtime_status |= RPM_GRACE;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	dev->power.runtime_status &= ~RPM_RESUMING;
+	switch (error) {
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run __pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * __pm_runtime_resume() for it without forcing a grace period after the resume.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	__pm_runtime_resume(pm_work_to_device(work), false);
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ * @grace: If set, force a post-resume grace period.
+ */
+void __pm_request_resume(struct device *dev, bool grace)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		if (grace)
+			dev->power.runtime_status |= RPM_GRACE;
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* The parent is suspending, suspended or idle. Wake it up. */
+		__pm_request_resume(dev->parent, false);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	if (grace)
+		dev->power.runtime_status |= RPM_GRACE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		dev->power.runtime_status = RPM_ACTIVE;
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		dev->power.runtime_status &= ~(RPM_WAKE | RPM_GRACE);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	spin_lock_init(&dev->power.lock);
+	dev->power.runtime_status = RPM_ACTIVE;
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,112 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern void __pm_runtime_change_status(struct device *dev, unsigned int status,
+				       unsigned int new_status);
+extern int __pm_runtime_suspend(struct device *dev, bool sync);
+extern void pm_request_suspend(struct device *dev, unsigned int msec);
+extern int __pm_runtime_resume(struct device *dev, bool grace);
+extern void __pm_request_resume(struct device *dev, bool grace);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline void __pm_runtime_change_status(struct device *dev,
+					      unsigned int status,
+					      unsigned int new_status) {}
+static inline int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_suspend(struct device *dev, unsigned int msec) {}
+static inline int __pm_runtime_resume(struct device *dev, bool grace)
+{
+	return -ENOSYS;
+}
+static inline void __pm_request_resume(struct device *dev, bool grace) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+static inline int pm_runtime_suspend(struct device *dev)
+{
+	return __pm_runtime_suspend(dev, true);
+}
+
+static inline int pm_runtime_resume(struct device *dev)
+{
+	return __pm_runtime_resume(dev, false);
+}
+
+static inline int pm_runtime_resume_grace(struct device *dev)
+{
+	return __pm_runtime_resume(dev, true);
+}
+
+static inline void pm_request_resume(struct device *dev)
+{
+	__pm_request_resume(dev, false);
+}
+
+static inline void pm_request_resume_grace(struct device *dev)
+{
+	__pm_request_resume(dev, true);
+}
+
+static inline void pm_runtime_clear_active(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_ERROR, RPM_ACTIVE);
+}
+
+static inline void pm_runtime_clear_suspended(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_ERROR, RPM_SUSPENDED);
+}
+
+static inline void pm_runtime_release(struct device *dev)
+{
+	__pm_runtime_change_status(dev, RPM_GRACE, RPM_ACTIVE);
+}
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,311 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+functions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described below.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* int pm_runtime_resume_grace(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_request_resume_grace(struct device *dev);
+* void pm_runtime_release(struct device *dev) {}
+
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+* void pm_runtime_clear_active(struct device *dev) {}
+* void pm_runtime_clear_suspended(struct device *dev) {}
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, all of the run-time PM core operations.  They do it by decreasing
+and increasing, respectively, the 'power.depth' field of 'struct device'.  If
+the value of this field is greater than 0, pm_runtime_suspend(),
+pm_request_suspend(), pm_runtime_resume() and so on return immediately without
+doing anything and -EBUSY is returned by pm_runtime_suspend(),
+pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM core functions can be used for that device.  The
+initial value of 'power.depth', as set by pm_runtime_init(), is 1 (i.e. the
+run-time PM of the device is initially disabled).
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device probe and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
+use the 'power.runtime_status' and 'power.suspend_aborted' fields of
+'struct device' for mutual synchronization.  The 'power.runtime_status' field,
+called the device's run-time PM status in what follows, is set to RPM_ACTIVE by
+pm_runtime_init().
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE
+(i.e. the device is not active from the PM core standpoint), it returns
+immediately.  Otherwise, it changes the device's run-time PM status to RPM_IDLE
+and puts a request to suspend the device into pm_wq.  The 'msec' argument is
+used to specify the time to wait before the request will be completed, in
+milliseconds.  It is valid to call this function from interrupt context.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called directly by a bus type or device driver.  An asynchronous
+version of it is called by the PM core, to complete a request queued up by
+pm_request_suspend().  The only difference between them is the handling of
+situations when a queued up suspend request has just been cancelled.  Apart from
+this, they work in the same way.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field, 'power.runtime_status'), success is returned.
+* If the device is about to resume or is in a post-resume grace period (i.e. at
+  least one of the RPM_WAKE, RPM_RESUMING, and RPM_GRACE bits are set in the
+  device's run-time PM status field), -EAGAIN is returned.  -EAGAIN is also
+  returned if the function has been called via pm_wq as a result of a cancelled
+  suspend request (the 'power.suspend_aborted' field is used for this purpose).
+* If the device is suspending (i.e. its run-time PM status is RPM_SUSPENDING),
+  which means that another instance of pm_runtime_suspend() is running at the
+  same time for the same device, the function waits for the other instance to
+  complete and returns the error code (or success) returned by it.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, the device's run-time PM
+  status is set to RPM_ACTIVE and -EAGAIN is returned.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and its bus type's ->runtime_suspend() callback is executed.
+This callback is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  Once
+  that has happened, the device is regarded by the PM core as suspended, but it
+  _need_ _not_ mean that the device has been put into a low power state.  What
+  really occurs to the device at this point totally depends on its bus type (it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback completes
+  successfully, the device bus type's ->runtime_idle() callback is executed for
+  the device's parent, if there is one and if all of its children are suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to carry out any run-time PM operations
+  for it until the status is cleared by its bus type or driver with the help of
+  either pm_runtime_clear_active(), or pm_runtime_clear_suspended().
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.  If the device's bus type
+doesn't implement ->runtime_suspend(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_request_resume() and pm_request_resume_grace() are used to queue up a resume
+request for a device that is suspended, suspending or has a suspend request
+pending.  The difference between them is that pm_request_resume_grace() causes
+the RPM_GRACE bit to be set in the device's run-time PM status field, which
+prevents the PM core from suspending the device or queuing up a suspend request
+for it until the RPM_GRACE bit is cleared with the help of pm_runtime_release().
+Apart from this, they work in the same way.
+* If a suspend request is pending for the device (i.e. the device's run-time PM
+  status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted' flag is set
+  for the device, the RPM_IDLE bit is cleared in the device's run-time PM status
+  field and the function returns (pm_request_resume_grace() additionally sets
+  the RPM_GRACE bit in the device's run-time PM status field).
+* If the device is not suspended or suspending (i.e. none of the RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), the
+  function returns.
+* If the device's parent is inactive (i.e. at least one of the RPM_IDLE,
+  RPM_SUSPENDING, and RPM_SUSPENDED bits is set in its run-time PM status
+  field), a resume request is (recursively) scheduled for the parent and the
+  function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() and pm_runtime_resume_grace() are used to carry out a
+run-time resume of a device that is suspended, suspending or has a suspend
+request pending.  They are called either by the PM core, to complete a request
+queued up by pm_request_resume(), or directly by a bus type or device driver.
+The difference between them is that pm_request_resume_grace() causes the
+RPM_GRACE bit to be set in the device's run-time PM status field, which prevents
+the PM core from suspending the device or queuing up a suspend request for it
+until the RPM_GRACE bit is cleared with the help of pm_runtime_release().  Apart
+from this, they work in the same way.
+* If the device is active (i.e. all of the bits in its run-time PM status are
+  clear, possibly except for RPM_GRACE), success is returned.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled, the 'power.suspend_aborted'
+  flag is set for the device, the RPM_IDLE bit is cleared in its run-time PM
+  status field and the function returns success (pm_runtime_resume_grace()
+  additionally sets the RPM_GRACE bit in the device's run-time PM status field).
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+  run-time PM status field), the function waits for the suspend operation to
+  complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field), the device's parent exists and is not active (i.e.
+  the parent's run-time PM status is not RPM_ACTIVE or RPM_GRACE), the parent is
+  resumed (recursively) and the function restarts itself.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the RPM_WAKE and RPM_SUSPENDED bits are cleared
+and the RPM_RESUMING bit is set in the device's run-time PM status field.  Next,
+the device bus type's ->runtime_resume() callback is executed, which is
+responsible for handling the device as appropriate (for example, it may choose
+to execute the device driver's ->runtime_resume() callback or to carry out any
+other suitable action depending on the bus type).
+* If it completes successfully, the device's run-time PM status is set to
+  'active' (i.e. the device's run-time PM status field is either RPM_ACTIVE, or
+  RPM_GRACE), which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns
+  success, the device _must_ be able to carry out I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to carry out any run-time PM operations
+  for it until the status is cleared by its bus type or driver with the help of
+  either pm_runtime_clear_active(), or pm_runtime_clear_suspended().
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.  If the device's bus type
+doesn't implement ->runtime_resume(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_runtime_release() is used to clear the RPM_GRACE bit in the device's run-time
+PM status field.  This bit, if set, causes the PM core to refuse to suspend
+the device or to queue up a suspend request for it.  In particular, it causes
+pm_runtime_suspend() to return -EAGAIN without doing anything else.  This may
+be useful if the device is resumed for a specific task and it shouldn't be
+suspended until the task is complete, but there are many potential sources of
+suspend requests that could disturb it.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+pm_runtime_clear_active() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_ACTIVE.
+
+pm_runtime_clear_suspended() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_SUSPENDED.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+In particular, it is recommended that ->runtime_suspend() return -EBUSY or
+-EAGAIN if device_may_wakeup() returns 'false' for the device.  On the other
+hand, if device_may_wakeup() returns 'true' for the device and the device is put
+into a low power state during the execution of ->runtime_suspend(), it is
+expected that remote wake-up (i.e. hardware mechanism allowing the device to
+request a change of its power state, such as PCI PME) will be enabled for the
+device.  Generally, remote wake-up should be enabled whenever the device is put
+into a low power state at run time and is expected to receive input from the
+outside of the system.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-16 22:33             ` [patch update 2 fix] " Rafael J. Wysocki
@ 2009-06-17 20:08                 ` Alan Stern
  2009-06-17 20:08               ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-17 20:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Wed, 17 Jun 2009, Rafael J. Wysocki wrote:

> Sorry for the broken patch.  My mailer started to wordwrap messages
> automatically and I didn't notice.
> 
> The correct patch is appended.

> Index: linux-2.6/include/linux/pm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pm.h
> +++ linux-2.6/include/linux/pm.h

> + * @runtime_suspend: Prepare the device for a condition in which it won't be
> + *	able to communicate with the CPU(s) and RAM due to power management.
> + *	This need not mean that the device should be put into a low power state,
> + *	like for example when the device is behind a link, represented by a

Suggested rephrasing: For example, if the device is behind a link
which is about to be turned off, the device may remain at full power.
But if the device does go to low power and if device_may_wakeup(dev)
is true, enable remote wakeup.


> +/**
> + * Device run-time power management state.
> + *
> + * These state labels are used internally by the PM core to indicate the current
> + * status of a device with respect to the PM core operations.  They do not
> + * reflect the actual power state of the device or its status as seen by the
> + * driver.
> + *
> + * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
> + *			pending for it.
> + *
> + * RPM_IDLE		It has been requested that the device be suspended.
> + *			Suspend request has been put into the run-time PM
> + *			workqueue and it's pending execution.
> + *
> + * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
> + *			executed.
> + *
> + * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
> + *			completed successfully.  The device is regarded as
> + *			suspended.
> + *
> + * RPM_WAKE		It has been requested that the device be woken up.
> + *			Resume request has been put into the run-time PM
> + *			workqueue and it's pending execution.
> + *
> + * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
> + *			executed.
> + *
> + * RPM_ERROR		Represents a condition from which the PM core cannot
> + *			recover by itself.  If the device's run-time PM status
> + *			field has this value, all of the run-time PM operations
> + *			carried out for the device by the core will fail, until
> + *			the status field is changed to either RPM_ACTIVE or
> + *			RPM_SUSPENDED (it is not valid to use the other values
> + *			in such a situation) by the device's driver or bus type.
> + *			This happens when the device bus type's
> + *			->runtime_suspend() or ->runtime_resume() callback
> + *			returns error code different from -EAGAIN or -EBUSY.

What about RPM_GRACE?

> + */
> +
> +#define RPM_ACTIVE	0
> +#define RPM_IDLE	0x01
> +#define RPM_SUSPENDING	0x02
> +#define RPM_SUSPENDED	0x04
> +#define RPM_WAKE	0x08
> +#define RPM_RESUMING	0x10
> +#define RPM_GRACE	0x20
> +#define RPM_ERROR	(-1)

This won't work very well when assigned to an unsigned 6-bit field.

> +
> +#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
> +#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
> +#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
> +#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)

Since each of these is used only once, it would be better not to
define them as macros.  Use the parenthesized expression instead; this
will be easier for readers to understand.


> +/**
> + * __pm_runtime_change_status - Change the run-time PM status of a device.
> + * @dev: Device to handle.
> + * @status: Expected current run-time PM status of the device.
> + * @new_status: New value of the device's run-time PM status.
> + *
> + * Change the run-time PM status of the device to @new_status if its current
> + * value is equal to @status.
> + */
> +void __pm_runtime_change_status(struct device *dev, unsigned int status,

If RPM_ERROR is -1 then status better not be unsigned.

> +				unsigned int new_status)
> +{
> +	unsigned long flags;
> +
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return;

Return only if new_status == RPM_SUSPENDED.  Is this routine ever
called with status equal to anything other than RPM_ERROR?


+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}

Instead of a costly device_for_each_child(), would it be better to
maintain a counter with the number of unsuspended children?


> +/**
> + * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> + * @dev: Device to suspend.
> + * @sync: If unset, the funtion has been called via pm_wq.
> + *
> + * Check if the status of the device is appropriate and run the
> + * ->runtime_suspend() callback provided by the device's bus type driver.
> + * Update the run-time PM flags in the device object to reflect the current
> + * status of the device.
> + */
> +int __pm_runtime_suspend(struct device *dev, bool sync)
> +{
> +	int error = -EINVAL;
> +
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return -EBUSY;

Should this test be made inside the scope of the spinlock?

For that matter, should power.depth always be set within the spinlock?
If it is then it doesn't need to be an atomic_t.

> +
> +	spin_lock(&dev->power.lock);

Should be spin_lock_irq().  Same in other places.

> +
> +	if (dev->power.runtime_status == RPM_ERROR) {
> +		goto out;
> +	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
> +		error = 0;
> +		goto out;
> +	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
> +	    || (!sync && dev->power.suspend_aborted)) {
> +		/*
> +		 * Device is resuming or in a post-resume grace period or
> +		 * there's a resume request pending, or a pending suspend
> +		 * request has just been cancelled and we're running as a result
> +		 * of this request.
> +		 */

In the sync case, it might be better to wait until the ongoing resume
(or resume grace period) is finished and then do the suspend.

Of course, this depends on the context in which the synchronous
runtime suspend is carried out.  Right now, the only such context I
know of is when the user tells the system to force a USB device into a
suspended state.

> +		error = -EAGAIN;
> +		goto out;
> +	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
> +		spin_unlock(&dev->power.lock);
> +
> +		/*
> +		 * Another suspend is running in parallel with us.  Wait for it
> +		 * to complete and return.
> +		 */
> +		wait_for_completion(&dev->power.work_done);
> +
> +		return dev->power.runtime_error;
> +	} else if (pm_check_children(dev)) {
> +		/*
> +		 * We can only suspend the device if all of its children have
> +		 * been suspended.
> +		 */
> +		dev->power.runtime_status = RPM_ACTIVE;
> +		error = -EAGAIN;

-EBUSY would be more appropriate.

> +		goto out;
> +	}

> +/**
> + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> + * @dev: Device to cancel the suspend request for.
> + */
> +static void pm_cancel_suspend(struct device *dev)
> +{
> +	cancel_delayed_work(&dev->power.runtime_work);
> +	dev->power.runtime_status &= RPM_GRACE;

This looks strange.  Aren't we guaranteed at this point that the
status is RPM_IDLE?

> +	dev->power.suspend_aborted = true;
> +}
> +
> +/**
> + * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> + * @dev: Device to resume.
> + * @grace: If set, force a post-resume grace period.
> + *
> + * Check if the device is really suspended and run the ->runtime_resume()
> + * callback provided by the device's bus type driver.  Update the run-time PM
> + * flags in the device object to reflect the current status of the device.  If
> + * runtime suspend is in progress while this function is being run, wait for it
> + * to finish before resuming the device.  If runtime suspend is scheduled, but
> + * it hasn't started yet, cancel it and we're done.
> + */
> +int __pm_runtime_resume(struct device *dev, bool grace)
> +{
> +	int error = -EINVAL;
...
> +	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
> +	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
> +		spin_unlock(&dev->power.lock);

Here's where you want to increment the parent's depth.  Figuring out
where to decrement it again isn't easy, given the way this routine is
structured.

> +		spin_unlock(&dev->parent->power.lock);
> +
> +		/* The device's parent is not active.  Resume it and repeat. */
> +		error = __pm_runtime_resume(dev->parent, false);
> +		if (error)
> +			return error;

Need to reset error to -EINVAL.


> +/**
> + * pm_request_resume - Schedule run-time resume of given device.
> + * @dev: Device to resume.
> + * @grace: If set, force a post-resume grace period.
> + */
> +void __pm_request_resume(struct device *dev, bool grace)
> +{
> +	unsigned long parent_flags = 0, flags;
> +
> + repeat:
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return;
> +
> +	if (dev->parent)
> +		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
> +	spin_lock_irqsave(&dev->power.lock, flags);
> +
> +	if (dev->power.runtime_status == RPM_IDLE) {
> +		/* Autosuspend request is pending, no need to resume. */
> +		pm_cancel_suspend(dev);
> +		if (grace)
> +			dev->power.runtime_status |= RPM_GRACE;
> +		goto out;
> +	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
> +		goto out;
> +	} else if (dev->parent
> +	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
> +		spin_unlock_irqrestore(&dev->power.lock, flags);
> +		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
> +
> +		/* The parent is suspending, suspended or idle. Wake it up. */
> +		__pm_request_resume(dev->parent, false);
> +
> +		goto repeat;

What if the parent's state is RPM_SUSPENDING?  Won't this go into a
tight loop?  You need to test the parent's WAKEUP bit above.


> Index: linux-2.6/Documentation/power/runtime_pm.txt
> ===================================================================
> --- /dev/null
> +++ linux-2.6/Documentation/power/runtime_pm.txt
> @@ -0,0 +1,311 @@
> +Run-time Power Management Framework for I/O Devices
> +
> +(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
> +
> +1. Introduction
> +
> +The support for run-time power management (run-time PM) of I/O devices is

s/The support/Support/

> +provided at the power management core (PM core) level by means of:

> +pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
> +respectively, all of the run-time PM core operations.  They do it by decreasing
> +and increasing, respectively, the 'power.depth' field of 'struct device'.  If
> +the value of this field is greater than 0, pm_runtime_suspend(),
> +pm_request_suspend(), pm_runtime_resume() and so on return immediately without
> +doing anything and -EBUSY is returned by pm_runtime_suspend(),
> +pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if

In your code, pm_runtime_disable() doesn't actually do a resume.  So if
a driver wants to make sure a device is at full power and stays that
way, it has to call:

	pm_runtime_resume(dev);
	pm_runtime_disable(dev);

This is a race; another thread might suspend the device in between.
It would make more sense to have have pm_runtime_resume() function
normally even when depth > 0.  Then the calls could be made in the
opposite order and there wouldn't be a race.

The equivalent code in USB does this automatically.  The
runtime-disable routine does a resume if the depth value was
originally 0, and the runtime-enable routine queues a delayed
autosuspend request if the final depth value is 0.

> +pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
> +pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
> +use the 'power.runtime_status' and 'power.suspend_aborted' fields of
> +'struct device' for mutual synchronization.  The 'power.runtime_status' field,

Strictly speaking, they use those fields for mutual cooperation.  It's
the power.lock field which provides synchronization.


> +pm_runtime_suspend() is used to carry out a run-time suspend of an active
> +device.  It is called directly by a bus type or device driver.  An asynchronous
> +version of it is called by the PM core, to complete a request queued up by
> +pm_request_suspend().  The only difference between them is the handling of
> +situations when a queued up suspend request has just been cancelled.  Apart from
> +this, they work in the same way.
> +* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
> +  run-time PM status field, 'power.runtime_status'), success is returned.

Blank lines surrounding the *-ed paragraphs would make this more
readable.

> +pm_request_resume() and pm_request_resume_grace() are used to queue up a resume
> +request for a device that is suspended, suspending or has a suspend request
> +pending.  The difference between them is that pm_request_resume_grace() causes
> +the RPM_GRACE bit to be set in the device's run-time PM status field, which
> +prevents the PM core from suspending the device or queuing up a suspend request
> +for it until the RPM_GRACE bit is cleared with the help of pm_runtime_release().
> +Apart from this, they work in the same way.

Is RPM_GRACE really needed?  Can't we accomplish more or less the same
thing by using the autosuspend delay combined with the depth counter?

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-17 20:08                 ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-17 20:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Wed, 17 Jun 2009, Rafael J. Wysocki wrote:

> Sorry for the broken patch.  My mailer started to wordwrap messages
> automatically and I didn't notice.
> 
> The correct patch is appended.

> Index: linux-2.6/include/linux/pm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pm.h
> +++ linux-2.6/include/linux/pm.h

> + * @runtime_suspend: Prepare the device for a condition in which it won't be
> + *	able to communicate with the CPU(s) and RAM due to power management.
> + *	This need not mean that the device should be put into a low power state,
> + *	like for example when the device is behind a link, represented by a

Suggested rephrasing: For example, if the device is behind a link
which is about to be turned off, the device may remain at full power.
But if the device does go to low power and if device_may_wakeup(dev)
is true, enable remote wakeup.


> +/**
> + * Device run-time power management state.
> + *
> + * These state labels are used internally by the PM core to indicate the current
> + * status of a device with respect to the PM core operations.  They do not
> + * reflect the actual power state of the device or its status as seen by the
> + * driver.
> + *
> + * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
> + *			pending for it.
> + *
> + * RPM_IDLE		It has been requested that the device be suspended.
> + *			Suspend request has been put into the run-time PM
> + *			workqueue and it's pending execution.
> + *
> + * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
> + *			executed.
> + *
> + * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
> + *			completed successfully.  The device is regarded as
> + *			suspended.
> + *
> + * RPM_WAKE		It has been requested that the device be woken up.
> + *			Resume request has been put into the run-time PM
> + *			workqueue and it's pending execution.
> + *
> + * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
> + *			executed.
> + *
> + * RPM_ERROR		Represents a condition from which the PM core cannot
> + *			recover by itself.  If the device's run-time PM status
> + *			field has this value, all of the run-time PM operations
> + *			carried out for the device by the core will fail, until
> + *			the status field is changed to either RPM_ACTIVE or
> + *			RPM_SUSPENDED (it is not valid to use the other values
> + *			in such a situation) by the device's driver or bus type.
> + *			This happens when the device bus type's
> + *			->runtime_suspend() or ->runtime_resume() callback
> + *			returns error code different from -EAGAIN or -EBUSY.

What about RPM_GRACE?

> + */
> +
> +#define RPM_ACTIVE	0
> +#define RPM_IDLE	0x01
> +#define RPM_SUSPENDING	0x02
> +#define RPM_SUSPENDED	0x04
> +#define RPM_WAKE	0x08
> +#define RPM_RESUMING	0x10
> +#define RPM_GRACE	0x20
> +#define RPM_ERROR	(-1)

This won't work very well when assigned to an unsigned 6-bit field.

> +
> +#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
> +#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
> +#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
> +#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)

Since each of these is used only once, it would be better not to
define them as macros.  Use the parenthesized expression instead; this
will be easier for readers to understand.


> +/**
> + * __pm_runtime_change_status - Change the run-time PM status of a device.
> + * @dev: Device to handle.
> + * @status: Expected current run-time PM status of the device.
> + * @new_status: New value of the device's run-time PM status.
> + *
> + * Change the run-time PM status of the device to @new_status if its current
> + * value is equal to @status.
> + */
> +void __pm_runtime_change_status(struct device *dev, unsigned int status,

If RPM_ERROR is -1 then status better not be unsigned.

> +				unsigned int new_status)
> +{
> +	unsigned long flags;
> +
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return;

Return only if new_status == RPM_SUSPENDED.  Is this routine ever
called with status equal to anything other than RPM_ERROR?


+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}

Instead of a costly device_for_each_child(), would it be better to
maintain a counter with the number of unsuspended children?


> +/**
> + * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> + * @dev: Device to suspend.
> + * @sync: If unset, the funtion has been called via pm_wq.
> + *
> + * Check if the status of the device is appropriate and run the
> + * ->runtime_suspend() callback provided by the device's bus type driver.
> + * Update the run-time PM flags in the device object to reflect the current
> + * status of the device.
> + */
> +int __pm_runtime_suspend(struct device *dev, bool sync)
> +{
> +	int error = -EINVAL;
> +
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return -EBUSY;

Should this test be made inside the scope of the spinlock?

For that matter, should power.depth always be set within the spinlock?
If it is then it doesn't need to be an atomic_t.

> +
> +	spin_lock(&dev->power.lock);

Should be spin_lock_irq().  Same in other places.

> +
> +	if (dev->power.runtime_status == RPM_ERROR) {
> +		goto out;
> +	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
> +		error = 0;
> +		goto out;
> +	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
> +	    || (!sync && dev->power.suspend_aborted)) {
> +		/*
> +		 * Device is resuming or in a post-resume grace period or
> +		 * there's a resume request pending, or a pending suspend
> +		 * request has just been cancelled and we're running as a result
> +		 * of this request.
> +		 */

In the sync case, it might be better to wait until the ongoing resume
(or resume grace period) is finished and then do the suspend.

Of course, this depends on the context in which the synchronous
runtime suspend is carried out.  Right now, the only such context I
know of is when the user tells the system to force a USB device into a
suspended state.

> +		error = -EAGAIN;
> +		goto out;
> +	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
> +		spin_unlock(&dev->power.lock);
> +
> +		/*
> +		 * Another suspend is running in parallel with us.  Wait for it
> +		 * to complete and return.
> +		 */
> +		wait_for_completion(&dev->power.work_done);
> +
> +		return dev->power.runtime_error;
> +	} else if (pm_check_children(dev)) {
> +		/*
> +		 * We can only suspend the device if all of its children have
> +		 * been suspended.
> +		 */
> +		dev->power.runtime_status = RPM_ACTIVE;
> +		error = -EAGAIN;

-EBUSY would be more appropriate.

> +		goto out;
> +	}

> +/**
> + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> + * @dev: Device to cancel the suspend request for.
> + */
> +static void pm_cancel_suspend(struct device *dev)
> +{
> +	cancel_delayed_work(&dev->power.runtime_work);
> +	dev->power.runtime_status &= RPM_GRACE;

This looks strange.  Aren't we guaranteed at this point that the
status is RPM_IDLE?

> +	dev->power.suspend_aborted = true;
> +}
> +
> +/**
> + * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> + * @dev: Device to resume.
> + * @grace: If set, force a post-resume grace period.
> + *
> + * Check if the device is really suspended and run the ->runtime_resume()
> + * callback provided by the device's bus type driver.  Update the run-time PM
> + * flags in the device object to reflect the current status of the device.  If
> + * runtime suspend is in progress while this function is being run, wait for it
> + * to finish before resuming the device.  If runtime suspend is scheduled, but
> + * it hasn't started yet, cancel it and we're done.
> + */
> +int __pm_runtime_resume(struct device *dev, bool grace)
> +{
> +	int error = -EINVAL;
...
> +	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
> +	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
> +		spin_unlock(&dev->power.lock);

Here's where you want to increment the parent's depth.  Figuring out
where to decrement it again isn't easy, given the way this routine is
structured.

> +		spin_unlock(&dev->parent->power.lock);
> +
> +		/* The device's parent is not active.  Resume it and repeat. */
> +		error = __pm_runtime_resume(dev->parent, false);
> +		if (error)
> +			return error;

Need to reset error to -EINVAL.


> +/**
> + * pm_request_resume - Schedule run-time resume of given device.
> + * @dev: Device to resume.
> + * @grace: If set, force a post-resume grace period.
> + */
> +void __pm_request_resume(struct device *dev, bool grace)
> +{
> +	unsigned long parent_flags = 0, flags;
> +
> + repeat:
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return;
> +
> +	if (dev->parent)
> +		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
> +	spin_lock_irqsave(&dev->power.lock, flags);
> +
> +	if (dev->power.runtime_status == RPM_IDLE) {
> +		/* Autosuspend request is pending, no need to resume. */
> +		pm_cancel_suspend(dev);
> +		if (grace)
> +			dev->power.runtime_status |= RPM_GRACE;
> +		goto out;
> +	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
> +		goto out;
> +	} else if (dev->parent
> +	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
> +		spin_unlock_irqrestore(&dev->power.lock, flags);
> +		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
> +
> +		/* The parent is suspending, suspended or idle. Wake it up. */
> +		__pm_request_resume(dev->parent, false);
> +
> +		goto repeat;

What if the parent's state is RPM_SUSPENDING?  Won't this go into a
tight loop?  You need to test the parent's WAKEUP bit above.


> Index: linux-2.6/Documentation/power/runtime_pm.txt
> ===================================================================
> --- /dev/null
> +++ linux-2.6/Documentation/power/runtime_pm.txt
> @@ -0,0 +1,311 @@
> +Run-time Power Management Framework for I/O Devices
> +
> +(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
> +
> +1. Introduction
> +
> +The support for run-time power management (run-time PM) of I/O devices is

s/The support/Support/

> +provided at the power management core (PM core) level by means of:

> +pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
> +respectively, all of the run-time PM core operations.  They do it by decreasing
> +and increasing, respectively, the 'power.depth' field of 'struct device'.  If
> +the value of this field is greater than 0, pm_runtime_suspend(),
> +pm_request_suspend(), pm_runtime_resume() and so on return immediately without
> +doing anything and -EBUSY is returned by pm_runtime_suspend(),
> +pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if

In your code, pm_runtime_disable() doesn't actually do a resume.  So if
a driver wants to make sure a device is at full power and stays that
way, it has to call:

	pm_runtime_resume(dev);
	pm_runtime_disable(dev);

This is a race; another thread might suspend the device in between.
It would make more sense to have have pm_runtime_resume() function
normally even when depth > 0.  Then the calls could be made in the
opposite order and there wouldn't be a race.

The equivalent code in USB does this automatically.  The
runtime-disable routine does a resume if the depth value was
originally 0, and the runtime-enable routine queues a delayed
autosuspend request if the final depth value is 0.

> +pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
> +pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
> +use the 'power.runtime_status' and 'power.suspend_aborted' fields of
> +'struct device' for mutual synchronization.  The 'power.runtime_status' field,

Strictly speaking, they use those fields for mutual cooperation.  It's
the power.lock field which provides synchronization.


> +pm_runtime_suspend() is used to carry out a run-time suspend of an active
> +device.  It is called directly by a bus type or device driver.  An asynchronous
> +version of it is called by the PM core, to complete a request queued up by
> +pm_request_suspend().  The only difference between them is the handling of
> +situations when a queued up suspend request has just been cancelled.  Apart from
> +this, they work in the same way.
> +* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
> +  run-time PM status field, 'power.runtime_status'), success is returned.

Blank lines surrounding the *-ed paragraphs would make this more
readable.

> +pm_request_resume() and pm_request_resume_grace() are used to queue up a resume
> +request for a device that is suspended, suspending or has a suspend request
> +pending.  The difference between them is that pm_request_resume_grace() causes
> +the RPM_GRACE bit to be set in the device's run-time PM status field, which
> +prevents the PM core from suspending the device or queuing up a suspend request
> +for it until the RPM_GRACE bit is cleared with the help of pm_runtime_release().
> +Apart from this, they work in the same way.

Is RPM_GRACE really needed?  Can't we accomplish more or less the same
thing by using the autosuspend delay combined with the depth counter?

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-16 22:33             ` [patch update 2 fix] " Rafael J. Wysocki
  2009-06-17 20:08                 ` Alan Stern
@ 2009-06-17 20:08               ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-17 20:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Wed, 17 Jun 2009, Rafael J. Wysocki wrote:

> Sorry for the broken patch.  My mailer started to wordwrap messages
> automatically and I didn't notice.
> 
> The correct patch is appended.

> Index: linux-2.6/include/linux/pm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pm.h
> +++ linux-2.6/include/linux/pm.h

> + * @runtime_suspend: Prepare the device for a condition in which it won't be
> + *	able to communicate with the CPU(s) and RAM due to power management.
> + *	This need not mean that the device should be put into a low power state,
> + *	like for example when the device is behind a link, represented by a

Suggested rephrasing: For example, if the device is behind a link
which is about to be turned off, the device may remain at full power.
But if the device does go to low power and if device_may_wakeup(dev)
is true, enable remote wakeup.


> +/**
> + * Device run-time power management state.
> + *
> + * These state labels are used internally by the PM core to indicate the current
> + * status of a device with respect to the PM core operations.  They do not
> + * reflect the actual power state of the device or its status as seen by the
> + * driver.
> + *
> + * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
> + *			pending for it.
> + *
> + * RPM_IDLE		It has been requested that the device be suspended.
> + *			Suspend request has been put into the run-time PM
> + *			workqueue and it's pending execution.
> + *
> + * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
> + *			executed.
> + *
> + * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
> + *			completed successfully.  The device is regarded as
> + *			suspended.
> + *
> + * RPM_WAKE		It has been requested that the device be woken up.
> + *			Resume request has been put into the run-time PM
> + *			workqueue and it's pending execution.
> + *
> + * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
> + *			executed.
> + *
> + * RPM_ERROR		Represents a condition from which the PM core cannot
> + *			recover by itself.  If the device's run-time PM status
> + *			field has this value, all of the run-time PM operations
> + *			carried out for the device by the core will fail, until
> + *			the status field is changed to either RPM_ACTIVE or
> + *			RPM_SUSPENDED (it is not valid to use the other values
> + *			in such a situation) by the device's driver or bus type.
> + *			This happens when the device bus type's
> + *			->runtime_suspend() or ->runtime_resume() callback
> + *			returns error code different from -EAGAIN or -EBUSY.

What about RPM_GRACE?

> + */
> +
> +#define RPM_ACTIVE	0
> +#define RPM_IDLE	0x01
> +#define RPM_SUSPENDING	0x02
> +#define RPM_SUSPENDED	0x04
> +#define RPM_WAKE	0x08
> +#define RPM_RESUMING	0x10
> +#define RPM_GRACE	0x20
> +#define RPM_ERROR	(-1)

This won't work very well when assigned to an unsigned 6-bit field.

> +
> +#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
> +#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
> +#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
> +#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)

Since each of these is used only once, it would be better not to
define them as macros.  Use the parenthesized expression instead; this
will be easier for readers to understand.


> +/**
> + * __pm_runtime_change_status - Change the run-time PM status of a device.
> + * @dev: Device to handle.
> + * @status: Expected current run-time PM status of the device.
> + * @new_status: New value of the device's run-time PM status.
> + *
> + * Change the run-time PM status of the device to @new_status if its current
> + * value is equal to @status.
> + */
> +void __pm_runtime_change_status(struct device *dev, unsigned int status,

If RPM_ERROR is -1 then status better not be unsigned.

> +				unsigned int new_status)
> +{
> +	unsigned long flags;
> +
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return;

Return only if new_status == RPM_SUSPENDED.  Is this routine ever
called with status equal to anything other than RPM_ERROR?


+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}

Instead of a costly device_for_each_child(), would it be better to
maintain a counter with the number of unsuspended children?


> +/**
> + * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> + * @dev: Device to suspend.
> + * @sync: If unset, the funtion has been called via pm_wq.
> + *
> + * Check if the status of the device is appropriate and run the
> + * ->runtime_suspend() callback provided by the device's bus type driver.
> + * Update the run-time PM flags in the device object to reflect the current
> + * status of the device.
> + */
> +int __pm_runtime_suspend(struct device *dev, bool sync)
> +{
> +	int error = -EINVAL;
> +
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return -EBUSY;

Should this test be made inside the scope of the spinlock?

For that matter, should power.depth always be set within the spinlock?
If it is then it doesn't need to be an atomic_t.

> +
> +	spin_lock(&dev->power.lock);

Should be spin_lock_irq().  Same in other places.

> +
> +	if (dev->power.runtime_status == RPM_ERROR) {
> +		goto out;
> +	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
> +		error = 0;
> +		goto out;
> +	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
> +	    || (!sync && dev->power.suspend_aborted)) {
> +		/*
> +		 * Device is resuming or in a post-resume grace period or
> +		 * there's a resume request pending, or a pending suspend
> +		 * request has just been cancelled and we're running as a result
> +		 * of this request.
> +		 */

In the sync case, it might be better to wait until the ongoing resume
(or resume grace period) is finished and then do the suspend.

Of course, this depends on the context in which the synchronous
runtime suspend is carried out.  Right now, the only such context I
know of is when the user tells the system to force a USB device into a
suspended state.

> +		error = -EAGAIN;
> +		goto out;
> +	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
> +		spin_unlock(&dev->power.lock);
> +
> +		/*
> +		 * Another suspend is running in parallel with us.  Wait for it
> +		 * to complete and return.
> +		 */
> +		wait_for_completion(&dev->power.work_done);
> +
> +		return dev->power.runtime_error;
> +	} else if (pm_check_children(dev)) {
> +		/*
> +		 * We can only suspend the device if all of its children have
> +		 * been suspended.
> +		 */
> +		dev->power.runtime_status = RPM_ACTIVE;
> +		error = -EAGAIN;

-EBUSY would be more appropriate.

> +		goto out;
> +	}

> +/**
> + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> + * @dev: Device to cancel the suspend request for.
> + */
> +static void pm_cancel_suspend(struct device *dev)
> +{
> +	cancel_delayed_work(&dev->power.runtime_work);
> +	dev->power.runtime_status &= RPM_GRACE;

This looks strange.  Aren't we guaranteed at this point that the
status is RPM_IDLE?

> +	dev->power.suspend_aborted = true;
> +}
> +
> +/**
> + * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> + * @dev: Device to resume.
> + * @grace: If set, force a post-resume grace period.
> + *
> + * Check if the device is really suspended and run the ->runtime_resume()
> + * callback provided by the device's bus type driver.  Update the run-time PM
> + * flags in the device object to reflect the current status of the device.  If
> + * runtime suspend is in progress while this function is being run, wait for it
> + * to finish before resuming the device.  If runtime suspend is scheduled, but
> + * it hasn't started yet, cancel it and we're done.
> + */
> +int __pm_runtime_resume(struct device *dev, bool grace)
> +{
> +	int error = -EINVAL;
...
> +	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
> +	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
> +		spin_unlock(&dev->power.lock);

Here's where you want to increment the parent's depth.  Figuring out
where to decrement it again isn't easy, given the way this routine is
structured.

> +		spin_unlock(&dev->parent->power.lock);
> +
> +		/* The device's parent is not active.  Resume it and repeat. */
> +		error = __pm_runtime_resume(dev->parent, false);
> +		if (error)
> +			return error;

Need to reset error to -EINVAL.


> +/**
> + * pm_request_resume - Schedule run-time resume of given device.
> + * @dev: Device to resume.
> + * @grace: If set, force a post-resume grace period.
> + */
> +void __pm_request_resume(struct device *dev, bool grace)
> +{
> +	unsigned long parent_flags = 0, flags;
> +
> + repeat:
> +	if (atomic_read(&dev->power.depth) > 0)
> +		return;
> +
> +	if (dev->parent)
> +		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
> +	spin_lock_irqsave(&dev->power.lock, flags);
> +
> +	if (dev->power.runtime_status == RPM_IDLE) {
> +		/* Autosuspend request is pending, no need to resume. */
> +		pm_cancel_suspend(dev);
> +		if (grace)
> +			dev->power.runtime_status |= RPM_GRACE;
> +		goto out;
> +	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
> +		goto out;
> +	} else if (dev->parent
> +	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
> +		spin_unlock_irqrestore(&dev->power.lock, flags);
> +		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
> +
> +		/* The parent is suspending, suspended or idle. Wake it up. */
> +		__pm_request_resume(dev->parent, false);
> +
> +		goto repeat;

What if the parent's state is RPM_SUSPENDING?  Won't this go into a
tight loop?  You need to test the parent's WAKEUP bit above.


> Index: linux-2.6/Documentation/power/runtime_pm.txt
> ===================================================================
> --- /dev/null
> +++ linux-2.6/Documentation/power/runtime_pm.txt
> @@ -0,0 +1,311 @@
> +Run-time Power Management Framework for I/O Devices
> +
> +(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
> +
> +1. Introduction
> +
> +The support for run-time power management (run-time PM) of I/O devices is

s/The support/Support/

> +provided at the power management core (PM core) level by means of:

> +pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
> +respectively, all of the run-time PM core operations.  They do it by decreasing
> +and increasing, respectively, the 'power.depth' field of 'struct device'.  If
> +the value of this field is greater than 0, pm_runtime_suspend(),
> +pm_request_suspend(), pm_runtime_resume() and so on return immediately without
> +doing anything and -EBUSY is returned by pm_runtime_suspend(),
> +pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if

In your code, pm_runtime_disable() doesn't actually do a resume.  So if
a driver wants to make sure a device is at full power and stays that
way, it has to call:

	pm_runtime_resume(dev);
	pm_runtime_disable(dev);

This is a race; another thread might suspend the device in between.
It would make more sense to have have pm_runtime_resume() function
normally even when depth > 0.  Then the calls could be made in the
opposite order and there wouldn't be a race.

The equivalent code in USB does this automatically.  The
runtime-disable routine does a resume if the depth value was
originally 0, and the runtime-enable routine queues a delayed
autosuspend request if the final depth value is 0.

> +pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
> +pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
> +use the 'power.runtime_status' and 'power.suspend_aborted' fields of
> +'struct device' for mutual synchronization.  The 'power.runtime_status' field,

Strictly speaking, they use those fields for mutual cooperation.  It's
the power.lock field which provides synchronization.


> +pm_runtime_suspend() is used to carry out a run-time suspend of an active
> +device.  It is called directly by a bus type or device driver.  An asynchronous
> +version of it is called by the PM core, to complete a request queued up by
> +pm_request_suspend().  The only difference between them is the handling of
> +situations when a queued up suspend request has just been cancelled.  Apart from
> +this, they work in the same way.
> +* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
> +  run-time PM status field, 'power.runtime_status'), success is returned.

Blank lines surrounding the *-ed paragraphs would make this more
readable.

> +pm_request_resume() and pm_request_resume_grace() are used to queue up a resume
> +request for a device that is suspended, suspending or has a suspend request
> +pending.  The difference between them is that pm_request_resume_grace() causes
> +the RPM_GRACE bit to be set in the device's run-time PM status field, which
> +prevents the PM core from suspending the device or queuing up a suspend request
> +for it until the RPM_GRACE bit is cleared with the help of pm_runtime_release().
> +Apart from this, they work in the same way.

Is RPM_GRACE really needed?  Can't we accomplish more or less the same
thing by using the autosuspend delay combined with the depth counter?

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-17 20:08                 ` Alan Stern
  (?)
@ 2009-06-17 23:07                 ` Rafael J. Wysocki
  2009-06-18 18:17                   ` Alan Stern
  2009-06-18 18:17                     ` Alan Stern
  -1 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-17 23:07 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

Hi Alan,

Thanks a lot for the review!

On Wednesday 17 June 2009, Alan Stern wrote:
> On Wed, 17 Jun 2009, Rafael J. Wysocki wrote:
> 
> > Sorry for the broken patch.  My mailer started to wordwrap messages
> > automatically and I didn't notice.
> > 
> > The correct patch is appended.
> 
> > Index: linux-2.6/include/linux/pm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/pm.h
> > +++ linux-2.6/include/linux/pm.h
> 
> > + * @runtime_suspend: Prepare the device for a condition in which it won't be
> > + *	able to communicate with the CPU(s) and RAM due to power management.
> > + *	This need not mean that the device should be put into a low power state,
> > + *	like for example when the device is behind a link, represented by a
> 
> Suggested rephrasing: For example, if the device is behind a link
> which is about to be turned off, the device may remain at full power.
> But if the device does go to low power and if device_may_wakeup(dev)
> is true, enable remote wakeup.

Done.

> > +/**
> > + * Device run-time power management state.
> > + *
> > + * These state labels are used internally by the PM core to indicate the current
> > + * status of a device with respect to the PM core operations.  They do not
> > + * reflect the actual power state of the device or its status as seen by the
> > + * driver.
> > + *
> > + * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
> > + *			pending for it.
> > + *
> > + * RPM_IDLE		It has been requested that the device be suspended.
> > + *			Suspend request has been put into the run-time PM
> > + *			workqueue and it's pending execution.
> > + *
> > + * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
> > + *			executed.
> > + *
> > + * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
> > + *			completed successfully.  The device is regarded as
> > + *			suspended.
> > + *
> > + * RPM_WAKE		It has been requested that the device be woken up.
> > + *			Resume request has been put into the run-time PM
> > + *			workqueue and it's pending execution.
> > + *
> > + * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
> > + *			executed.
> > + *
> > + * RPM_ERROR		Represents a condition from which the PM core cannot
> > + *			recover by itself.  If the device's run-time PM status
> > + *			field has this value, all of the run-time PM operations
> > + *			carried out for the device by the core will fail, until
> > + *			the status field is changed to either RPM_ACTIVE or
> > + *			RPM_SUSPENDED (it is not valid to use the other values
> > + *			in such a situation) by the device's driver or bus type.
> > + *			This happens when the device bus type's
> > + *			->runtime_suspend() or ->runtime_resume() callback
> > + *			returns error code different from -EAGAIN or -EBUSY.
> 
> What about RPM_GRACE?

Forgotten.

Well, I've already replaced it with a counter (more about it below).

> > + */
> > +
> > +#define RPM_ACTIVE	0
> > +#define RPM_IDLE	0x01
> > +#define RPM_SUSPENDING	0x02
> > +#define RPM_SUSPENDED	0x04
> > +#define RPM_WAKE	0x08
> > +#define RPM_RESUMING	0x10
> > +#define RPM_GRACE	0x20
> > +#define RPM_ERROR	(-1)
> 
> This won't work very well when assigned to an unsigned 6-bit field.

OK, I'm changing it to 0x1F (IOW, all bits set).

> > +
> > +#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
> > +#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
> > +#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
> > +#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
> 
> Since each of these is used only once, it would be better not to
> define them as macros.  Use the parenthesized expression instead; this
> will be easier for readers to understand.

OK

> > +/**
> > + * __pm_runtime_change_status - Change the run-time PM status of a device.
> > + * @dev: Device to handle.
> > + * @status: Expected current run-time PM status of the device.
> > + * @new_status: New value of the device's run-time PM status.
> > + *
> > + * Change the run-time PM status of the device to @new_status if its current
> > + * value is equal to @status.
> > + */
> > +void __pm_runtime_change_status(struct device *dev, unsigned int status,
> 
> If RPM_ERROR is -1 then status better not be unsigned.

That's fixed by redefining RPM_ERROR (see above).

> > +				unsigned int new_status)
> > +{
> > +	unsigned long flags;
> > +
> > +	if (atomic_read(&dev->power.depth) > 0)
> > +		return;
> 
> Return only if new_status == RPM_SUSPENDED.

Not only then.  The dev->power.depth counter was meant to be a "disable
everything" one, because there are situations in which we don't want even
resume to run (probe, release, system-wide suspend, hibernation, resume from
a system sleep state, possibly others).

That said, I overlooked some problems related to it.  So, I think to disable
the runtime PM of given device, it will be necessary to run a synchronous
runtime resume with taking a ref to block suspend.

> Is this routine ever called with status equal to anything other than
> RPM_ERROR?

Not at the moment.  OK, I'll change it.

> +/**
> + * pm_check_children - Check if all children of a device have been suspended.
> + * @dev: Device to check.
> + *
> + * Returns 0 if all children of the device have been suspended or -EBUSY
> + * otherwise.
> + */
> +static int pm_check_children(struct device *dev)
> +{
> +	return dev->power.suspend_skip_children ? 0 :
> +			device_for_each_child(dev, NULL, pm_device_suspended);
> +}
> 
> Instead of a costly device_for_each_child(), would it be better to
> maintain a counter with the number of unsuspended children?

Hmm.  How exactly are we going to count them?  The only way I see at the moment
would be to increase this number by one when running pm_runtime_init() for a
new child.  Seems doable.

> > +/**
> > + * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> > + * @dev: Device to suspend.
> > + * @sync: If unset, the funtion has been called via pm_wq.
> > + *
> > + * Check if the status of the device is appropriate and run the
> > + * ->runtime_suspend() callback provided by the device's bus type driver.
> > + * Update the run-time PM flags in the device object to reflect the current
> > + * status of the device.
> > + */
> > +int __pm_runtime_suspend(struct device *dev, bool sync)
> > +{
> > +	int error = -EINVAL;
> > +
> > +	if (atomic_read(&dev->power.depth) > 0)
> > +		return -EBUSY;
> 
> Should this test be made inside the scope of the spinlock?

Yes, it should.

> For that matter, should power.depth always be set within the spinlock?
> If it is then it doesn't need to be an atomic_t.

pm_runtime_[dis|en]able() don't take the lock when changing it, but it's
going to be dropped anyway.

> > +
> > +	spin_lock(&dev->power.lock);
> 
> Should be spin_lock_irq().  Same in other places.

OK, I wasn't sure about that.

> > +
> > +	if (dev->power.runtime_status == RPM_ERROR) {
> > +		goto out;
> > +	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
> > +		error = 0;
> > +		goto out;
> > +	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
> > +	    || (!sync && dev->power.suspend_aborted)) {
> > +		/*
> > +		 * Device is resuming or in a post-resume grace period or
> > +		 * there's a resume request pending, or a pending suspend
> > +		 * request has just been cancelled and we're running as a result
> > +		 * of this request.
> > +		 */
> 
> In the sync case, it might be better to wait until the ongoing resume
> (or resume grace period) is finished and then do the suspend.
>
> Of course, this depends on the context in which the synchronous
> runtime suspend is carried out.  Right now, the only such context I
> know of is when the user tells the system to force a USB device into a
> suspended state.

>From the functionality point of view, nothing wrong happens if runtime suspend
fails as long as an error code is returned and the caller has to be prepared
for a failure anyway.  Moreover, we never know why the resume is carried out,
so it's not clear whether it will be valid to carry out the suspend after that.

> 
> > +		error = -EAGAIN;
> > +		goto out;
> > +	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
> > +		spin_unlock(&dev->power.lock);
> > +
> > +		/*
> > +		 * Another suspend is running in parallel with us.  Wait for it
> > +		 * to complete and return.
> > +		 */
> > +		wait_for_completion(&dev->power.work_done);
> > +
> > +		return dev->power.runtime_error;
> > +	} else if (pm_check_children(dev)) {
> > +		/*
> > +		 * We can only suspend the device if all of its children have
> > +		 * been suspended.
> > +		 */
> > +		dev->power.runtime_status = RPM_ACTIVE;
> > +		error = -EAGAIN;
> 
> -EBUSY would be more appropriate.

OK

> > +		goto out;
> > +	}
> 
> > +/**
> > + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> > + * @dev: Device to cancel the suspend request for.
> > + */
> > +static void pm_cancel_suspend(struct device *dev)
> > +{
> > +	cancel_delayed_work(&dev->power.runtime_work);
> > +	dev->power.runtime_status &= RPM_GRACE;
> 
> This looks strange.  Aren't we guaranteed at this point that the
> status is RPM_IDLE?

Yes.

> > +	dev->power.suspend_aborted = true;
> > +}
> > +
> > +/**
> > + * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> > + * @dev: Device to resume.
> > + * @grace: If set, force a post-resume grace period.
> > + *
> > + * Check if the device is really suspended and run the ->runtime_resume()
> > + * callback provided by the device's bus type driver.  Update the run-time PM
> > + * flags in the device object to reflect the current status of the device.  If
> > + * runtime suspend is in progress while this function is being run, wait for it
> > + * to finish before resuming the device.  If runtime suspend is scheduled, but
> > + * it hasn't started yet, cancel it and we're done.
> > + */
> > +int __pm_runtime_resume(struct device *dev, bool grace)
> > +{
> > +	int error = -EINVAL;
> ...
> > +	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
> > +	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
> > +		spin_unlock(&dev->power.lock);
> 
> Here's where you want to increment the parent's depth.  Figuring out
> where to decrement it again isn't easy, given the way this routine is
> structured.

Hmm.  We can use a local bool variable to store the information that the ref
has been taken for the parent and dereference it when leaving the function.

> > +		spin_unlock(&dev->parent->power.lock);
> > +
> > +		/* The device's parent is not active.  Resume it and repeat. */
> > +		error = __pm_runtime_resume(dev->parent, false);
> > +		if (error)
> > +			return error;
> 
> Need to reset error to -EINVAL.

Why -EINVAL?

> > +/**
> > + * pm_request_resume - Schedule run-time resume of given device.
> > + * @dev: Device to resume.
> > + * @grace: If set, force a post-resume grace period.
> > + */
> > +void __pm_request_resume(struct device *dev, bool grace)
> > +{
> > +	unsigned long parent_flags = 0, flags;
> > +
> > + repeat:
> > +	if (atomic_read(&dev->power.depth) > 0)
> > +		return;
> > +
> > +	if (dev->parent)
> > +		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
> > +	spin_lock_irqsave(&dev->power.lock, flags);
> > +
> > +	if (dev->power.runtime_status == RPM_IDLE) {
> > +		/* Autosuspend request is pending, no need to resume. */
> > +		pm_cancel_suspend(dev);
> > +		if (grace)
> > +			dev->power.runtime_status |= RPM_GRACE;
> > +		goto out;
> > +	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
> > +		goto out;
> > +	} else if (dev->parent
> > +	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
> > +		spin_unlock_irqrestore(&dev->power.lock, flags);
> > +		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
> > +
> > +		/* The parent is suspending, suspended or idle. Wake it up. */
> > +		__pm_request_resume(dev->parent, false);
> > +
> > +		goto repeat;
> 
> What if the parent's state is RPM_SUSPENDING?  Won't this go into a
> tight loop?  You need to test the parent's WAKEUP bit above.

Right.

> > Index: linux-2.6/Documentation/power/runtime_pm.txt
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/Documentation/power/runtime_pm.txt
> > @@ -0,0 +1,311 @@
> > +Run-time Power Management Framework for I/O Devices
> > +
> > +(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
> > +
> > +1. Introduction
> > +
> > +The support for run-time power management (run-time PM) of I/O devices is
> 
> s/The support/Support/

OK

> > +provided at the power management core (PM core) level by means of:
> 
> > +pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
> > +respectively, all of the run-time PM core operations.  They do it by decreasing
> > +and increasing, respectively, the 'power.depth' field of 'struct device'.  If
> > +the value of this field is greater than 0, pm_runtime_suspend(),
> > +pm_request_suspend(), pm_runtime_resume() and so on return immediately without
> > +doing anything and -EBUSY is returned by pm_runtime_suspend(),
> > +pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if
> 
> In your code, pm_runtime_disable() doesn't actually do a resume.  So if
> a driver wants to make sure a device is at full power and stays that
> way, it has to call:
> 
> 	pm_runtime_resume(dev);
> 	pm_runtime_disable(dev);
> 
> This is a race; another thread might suspend the device in between.
> It would make more sense to have have pm_runtime_resume() function
> normally even when depth > 0.  Then the calls could be made in the
> opposite order and there wouldn't be a race.
> 
> The equivalent code in USB does this automatically.  The
> runtime-disable routine does a resume if the depth value was
> originally 0,

Yes, we should do that in general.

> and the runtime-enable routine queues a delayed autosuspend request if the
> final depth value is 0.

I don't like this.

> > +pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
> > +pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
> > +use the 'power.runtime_status' and 'power.suspend_aborted' fields of
> > +'struct device' for mutual synchronization.  The 'power.runtime_status' field,
> 
> Strictly speaking, they use those fields for mutual cooperation.  It's
> the power.lock field which provides synchronization.

OK

> > +pm_runtime_suspend() is used to carry out a run-time suspend of an active
> > +device.  It is called directly by a bus type or device driver.  An asynchronous
> > +version of it is called by the PM core, to complete a request queued up by
> > +pm_request_suspend().  The only difference between them is the handling of
> > +situations when a queued up suspend request has just been cancelled.  Apart from
> > +this, they work in the same way.
> > +* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
> > +  run-time PM status field, 'power.runtime_status'), success is returned.
> 
> Blank lines surrounding the *-ed paragraphs would make this more
> readable.

OK

> > +pm_request_resume() and pm_request_resume_grace() are used to queue up a resume
> > +request for a device that is suspended, suspending or has a suspend request
> > +pending.  The difference between them is that pm_request_resume_grace() causes
> > +the RPM_GRACE bit to be set in the device's run-time PM status field, which
> > +prevents the PM core from suspending the device or queuing up a suspend request
> > +for it until the RPM_GRACE bit is cleared with the help of pm_runtime_release().
> > +Apart from this, they work in the same way.
> 
> Is RPM_GRACE really needed?  Can't we accomplish more or less the same
> thing by using the autosuspend delay combined with the depth counter?

No, it's not.  As I said above, I replaced it with a counter and then I
realized that 'disable' should in fact do 'resume and get', so we can handle
everything with just one counter.

I'll send a revised patch tomorrow.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-17 20:08                 ` Alan Stern
  (?)
  (?)
@ 2009-06-17 23:07                 ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-17 23:07 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

Hi Alan,

Thanks a lot for the review!

On Wednesday 17 June 2009, Alan Stern wrote:
> On Wed, 17 Jun 2009, Rafael J. Wysocki wrote:
> 
> > Sorry for the broken patch.  My mailer started to wordwrap messages
> > automatically and I didn't notice.
> > 
> > The correct patch is appended.
> 
> > Index: linux-2.6/include/linux/pm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/pm.h
> > +++ linux-2.6/include/linux/pm.h
> 
> > + * @runtime_suspend: Prepare the device for a condition in which it won't be
> > + *	able to communicate with the CPU(s) and RAM due to power management.
> > + *	This need not mean that the device should be put into a low power state,
> > + *	like for example when the device is behind a link, represented by a
> 
> Suggested rephrasing: For example, if the device is behind a link
> which is about to be turned off, the device may remain at full power.
> But if the device does go to low power and if device_may_wakeup(dev)
> is true, enable remote wakeup.

Done.

> > +/**
> > + * Device run-time power management state.
> > + *
> > + * These state labels are used internally by the PM core to indicate the current
> > + * status of a device with respect to the PM core operations.  They do not
> > + * reflect the actual power state of the device or its status as seen by the
> > + * driver.
> > + *
> > + * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
> > + *			pending for it.
> > + *
> > + * RPM_IDLE		It has been requested that the device be suspended.
> > + *			Suspend request has been put into the run-time PM
> > + *			workqueue and it's pending execution.
> > + *
> > + * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
> > + *			executed.
> > + *
> > + * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
> > + *			completed successfully.  The device is regarded as
> > + *			suspended.
> > + *
> > + * RPM_WAKE		It has been requested that the device be woken up.
> > + *			Resume request has been put into the run-time PM
> > + *			workqueue and it's pending execution.
> > + *
> > + * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
> > + *			executed.
> > + *
> > + * RPM_ERROR		Represents a condition from which the PM core cannot
> > + *			recover by itself.  If the device's run-time PM status
> > + *			field has this value, all of the run-time PM operations
> > + *			carried out for the device by the core will fail, until
> > + *			the status field is changed to either RPM_ACTIVE or
> > + *			RPM_SUSPENDED (it is not valid to use the other values
> > + *			in such a situation) by the device's driver or bus type.
> > + *			This happens when the device bus type's
> > + *			->runtime_suspend() or ->runtime_resume() callback
> > + *			returns error code different from -EAGAIN or -EBUSY.
> 
> What about RPM_GRACE?

Forgotten.

Well, I've already replaced it with a counter (more about it below).

> > + */
> > +
> > +#define RPM_ACTIVE	0
> > +#define RPM_IDLE	0x01
> > +#define RPM_SUSPENDING	0x02
> > +#define RPM_SUSPENDED	0x04
> > +#define RPM_WAKE	0x08
> > +#define RPM_RESUMING	0x10
> > +#define RPM_GRACE	0x20
> > +#define RPM_ERROR	(-1)
> 
> This won't work very well when assigned to an unsigned 6-bit field.

OK, I'm changing it to 0x1F (IOW, all bits set).

> > +
> > +#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
> > +#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
> > +#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING | RPM_GRACE)
> > +#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
> 
> Since each of these is used only once, it would be better not to
> define them as macros.  Use the parenthesized expression instead; this
> will be easier for readers to understand.

OK

> > +/**
> > + * __pm_runtime_change_status - Change the run-time PM status of a device.
> > + * @dev: Device to handle.
> > + * @status: Expected current run-time PM status of the device.
> > + * @new_status: New value of the device's run-time PM status.
> > + *
> > + * Change the run-time PM status of the device to @new_status if its current
> > + * value is equal to @status.
> > + */
> > +void __pm_runtime_change_status(struct device *dev, unsigned int status,
> 
> If RPM_ERROR is -1 then status better not be unsigned.

That's fixed by redefining RPM_ERROR (see above).

> > +				unsigned int new_status)
> > +{
> > +	unsigned long flags;
> > +
> > +	if (atomic_read(&dev->power.depth) > 0)
> > +		return;
> 
> Return only if new_status == RPM_SUSPENDED.

Not only then.  The dev->power.depth counter was meant to be a "disable
everything" one, because there are situations in which we don't want even
resume to run (probe, release, system-wide suspend, hibernation, resume from
a system sleep state, possibly others).

That said, I overlooked some problems related to it.  So, I think to disable
the runtime PM of given device, it will be necessary to run a synchronous
runtime resume with taking a ref to block suspend.

> Is this routine ever called with status equal to anything other than
> RPM_ERROR?

Not at the moment.  OK, I'll change it.

> +/**
> + * pm_check_children - Check if all children of a device have been suspended.
> + * @dev: Device to check.
> + *
> + * Returns 0 if all children of the device have been suspended or -EBUSY
> + * otherwise.
> + */
> +static int pm_check_children(struct device *dev)
> +{
> +	return dev->power.suspend_skip_children ? 0 :
> +			device_for_each_child(dev, NULL, pm_device_suspended);
> +}
> 
> Instead of a costly device_for_each_child(), would it be better to
> maintain a counter with the number of unsuspended children?

Hmm.  How exactly are we going to count them?  The only way I see at the moment
would be to increase this number by one when running pm_runtime_init() for a
new child.  Seems doable.

> > +/**
> > + * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> > + * @dev: Device to suspend.
> > + * @sync: If unset, the funtion has been called via pm_wq.
> > + *
> > + * Check if the status of the device is appropriate and run the
> > + * ->runtime_suspend() callback provided by the device's bus type driver.
> > + * Update the run-time PM flags in the device object to reflect the current
> > + * status of the device.
> > + */
> > +int __pm_runtime_suspend(struct device *dev, bool sync)
> > +{
> > +	int error = -EINVAL;
> > +
> > +	if (atomic_read(&dev->power.depth) > 0)
> > +		return -EBUSY;
> 
> Should this test be made inside the scope of the spinlock?

Yes, it should.

> For that matter, should power.depth always be set within the spinlock?
> If it is then it doesn't need to be an atomic_t.

pm_runtime_[dis|en]able() don't take the lock when changing it, but it's
going to be dropped anyway.

> > +
> > +	spin_lock(&dev->power.lock);
> 
> Should be spin_lock_irq().  Same in other places.

OK, I wasn't sure about that.

> > +
> > +	if (dev->power.runtime_status == RPM_ERROR) {
> > +		goto out;
> > +	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
> > +		error = 0;
> > +		goto out;
> > +	} else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
> > +	    || (!sync && dev->power.suspend_aborted)) {
> > +		/*
> > +		 * Device is resuming or in a post-resume grace period or
> > +		 * there's a resume request pending, or a pending suspend
> > +		 * request has just been cancelled and we're running as a result
> > +		 * of this request.
> > +		 */
> 
> In the sync case, it might be better to wait until the ongoing resume
> (or resume grace period) is finished and then do the suspend.
>
> Of course, this depends on the context in which the synchronous
> runtime suspend is carried out.  Right now, the only such context I
> know of is when the user tells the system to force a USB device into a
> suspended state.

>From the functionality point of view, nothing wrong happens if runtime suspend
fails as long as an error code is returned and the caller has to be prepared
for a failure anyway.  Moreover, we never know why the resume is carried out,
so it's not clear whether it will be valid to carry out the suspend after that.

> 
> > +		error = -EAGAIN;
> > +		goto out;
> > +	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
> > +		spin_unlock(&dev->power.lock);
> > +
> > +		/*
> > +		 * Another suspend is running in parallel with us.  Wait for it
> > +		 * to complete and return.
> > +		 */
> > +		wait_for_completion(&dev->power.work_done);
> > +
> > +		return dev->power.runtime_error;
> > +	} else if (pm_check_children(dev)) {
> > +		/*
> > +		 * We can only suspend the device if all of its children have
> > +		 * been suspended.
> > +		 */
> > +		dev->power.runtime_status = RPM_ACTIVE;
> > +		error = -EAGAIN;
> 
> -EBUSY would be more appropriate.

OK

> > +		goto out;
> > +	}
> 
> > +/**
> > + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> > + * @dev: Device to cancel the suspend request for.
> > + */
> > +static void pm_cancel_suspend(struct device *dev)
> > +{
> > +	cancel_delayed_work(&dev->power.runtime_work);
> > +	dev->power.runtime_status &= RPM_GRACE;
> 
> This looks strange.  Aren't we guaranteed at this point that the
> status is RPM_IDLE?

Yes.

> > +	dev->power.suspend_aborted = true;
> > +}
> > +
> > +/**
> > + * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> > + * @dev: Device to resume.
> > + * @grace: If set, force a post-resume grace period.
> > + *
> > + * Check if the device is really suspended and run the ->runtime_resume()
> > + * callback provided by the device's bus type driver.  Update the run-time PM
> > + * flags in the device object to reflect the current status of the device.  If
> > + * runtime suspend is in progress while this function is being run, wait for it
> > + * to finish before resuming the device.  If runtime suspend is scheduled, but
> > + * it hasn't started yet, cancel it and we're done.
> > + */
> > +int __pm_runtime_resume(struct device *dev, bool grace)
> > +{
> > +	int error = -EINVAL;
> ...
> > +	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
> > +	    && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
> > +		spin_unlock(&dev->power.lock);
> 
> Here's where you want to increment the parent's depth.  Figuring out
> where to decrement it again isn't easy, given the way this routine is
> structured.

Hmm.  We can use a local bool variable to store the information that the ref
has been taken for the parent and dereference it when leaving the function.

> > +		spin_unlock(&dev->parent->power.lock);
> > +
> > +		/* The device's parent is not active.  Resume it and repeat. */
> > +		error = __pm_runtime_resume(dev->parent, false);
> > +		if (error)
> > +			return error;
> 
> Need to reset error to -EINVAL.

Why -EINVAL?

> > +/**
> > + * pm_request_resume - Schedule run-time resume of given device.
> > + * @dev: Device to resume.
> > + * @grace: If set, force a post-resume grace period.
> > + */
> > +void __pm_request_resume(struct device *dev, bool grace)
> > +{
> > +	unsigned long parent_flags = 0, flags;
> > +
> > + repeat:
> > +	if (atomic_read(&dev->power.depth) > 0)
> > +		return;
> > +
> > +	if (dev->parent)
> > +		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
> > +	spin_lock_irqsave(&dev->power.lock, flags);
> > +
> > +	if (dev->power.runtime_status == RPM_IDLE) {
> > +		/* Autosuspend request is pending, no need to resume. */
> > +		pm_cancel_suspend(dev);
> > +		if (grace)
> > +			dev->power.runtime_status |= RPM_GRACE;
> > +		goto out;
> > +	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
> > +		goto out;
> > +	} else if (dev->parent
> > +	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
> > +		spin_unlock_irqrestore(&dev->power.lock, flags);
> > +		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
> > +
> > +		/* The parent is suspending, suspended or idle. Wake it up. */
> > +		__pm_request_resume(dev->parent, false);
> > +
> > +		goto repeat;
> 
> What if the parent's state is RPM_SUSPENDING?  Won't this go into a
> tight loop?  You need to test the parent's WAKEUP bit above.

Right.

> > Index: linux-2.6/Documentation/power/runtime_pm.txt
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/Documentation/power/runtime_pm.txt
> > @@ -0,0 +1,311 @@
> > +Run-time Power Management Framework for I/O Devices
> > +
> > +(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
> > +
> > +1. Introduction
> > +
> > +The support for run-time power management (run-time PM) of I/O devices is
> 
> s/The support/Support/

OK

> > +provided at the power management core (PM core) level by means of:
> 
> > +pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
> > +respectively, all of the run-time PM core operations.  They do it by decreasing
> > +and increasing, respectively, the 'power.depth' field of 'struct device'.  If
> > +the value of this field is greater than 0, pm_runtime_suspend(),
> > +pm_request_suspend(), pm_runtime_resume() and so on return immediately without
> > +doing anything and -EBUSY is returned by pm_runtime_suspend(),
> > +pm_runtime_resume() and pm_runtime_resume_grace().  Therefore, if
> 
> In your code, pm_runtime_disable() doesn't actually do a resume.  So if
> a driver wants to make sure a device is at full power and stays that
> way, it has to call:
> 
> 	pm_runtime_resume(dev);
> 	pm_runtime_disable(dev);
> 
> This is a race; another thread might suspend the device in between.
> It would make more sense to have have pm_runtime_resume() function
> normally even when depth > 0.  Then the calls could be made in the
> opposite order and there wouldn't be a race.
> 
> The equivalent code in USB does this automatically.  The
> runtime-disable routine does a resume if the depth value was
> originally 0,

Yes, we should do that in general.

> and the runtime-enable routine queues a delayed autosuspend request if the
> final depth value is 0.

I don't like this.

> > +pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
> > +pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
> > +use the 'power.runtime_status' and 'power.suspend_aborted' fields of
> > +'struct device' for mutual synchronization.  The 'power.runtime_status' field,
> 
> Strictly speaking, they use those fields for mutual cooperation.  It's
> the power.lock field which provides synchronization.

OK

> > +pm_runtime_suspend() is used to carry out a run-time suspend of an active
> > +device.  It is called directly by a bus type or device driver.  An asynchronous
> > +version of it is called by the PM core, to complete a request queued up by
> > +pm_request_suspend().  The only difference between them is the handling of
> > +situations when a queued up suspend request has just been cancelled.  Apart from
> > +this, they work in the same way.
> > +* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
> > +  run-time PM status field, 'power.runtime_status'), success is returned.
> 
> Blank lines surrounding the *-ed paragraphs would make this more
> readable.

OK

> > +pm_request_resume() and pm_request_resume_grace() are used to queue up a resume
> > +request for a device that is suspended, suspending or has a suspend request
> > +pending.  The difference between them is that pm_request_resume_grace() causes
> > +the RPM_GRACE bit to be set in the device's run-time PM status field, which
> > +prevents the PM core from suspending the device or queuing up a suspend request
> > +for it until the RPM_GRACE bit is cleared with the help of pm_runtime_release().
> > +Apart from this, they work in the same way.
> 
> Is RPM_GRACE really needed?  Can't we accomplish more or less the same
> thing by using the autosuspend delay combined with the depth counter?

No, it's not.  As I said above, I replaced it with a counter and then I
realized that 'disable' should in fact do 'resume and get', so we can handle
everything with just one counter.

I'll send a revised patch tomorrow.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-17 23:07                 ` Rafael J. Wysocki
@ 2009-06-18 18:17                     ` Alan Stern
  2009-06-18 18:17                     ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-18 18:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Thu, 18 Jun 2009, Rafael J. Wysocki wrote:

> Not only then.  The dev->power.depth counter was meant to be a "disable
> everything" one, because there are situations in which we don't want even
> resume to run (probe, release, system-wide suspend, hibernation, resume from
> a system sleep state, possibly others).
> 
> That said, I overlooked some problems related to it.  So, I think to disable
> the runtime PM of given device, it will be necessary to run a synchronous
> runtime resume with taking a ref to block suspend.

There should also be an async version, which increases depth while
submitting a resume request.

In fact, maybe it would be best if pm_request_resume always increments
depth (unless it fails for some other reason) and __pm_runtime_resume
increments depth whenever called synchronously.  And likewise for the
suspend paths.

> > Instead of a costly device_for_each_child(), would it be better to
> > maintain a counter with the number of unsuspended children?
> 
> Hmm.  How exactly are we going to count them?  The only way I see at the moment
> would be to increase this number by one when running pm_runtime_init() for a
> new child.  Seems doable.

That's right.  You also have to decrement the number when an
unsuspended child device is removed, obviously.  The one thing to
watch out for is what happens if a device is removed while its
runtime_resume callback is running.  :-)

> > > +	spin_lock(&dev->power.lock);
> > 
> > Should be spin_lock_irq().  Same in other places.
> 
> OK, I wasn't sure about that.

The reasoning isn't complicated.  If a spinlock can be taken by an
interrupt handler (or any other code that might run in interrupt
context) then you have the possibility of a deadlock as follows:

	spin_lock(&lock);
	<Interrupt occurs>
		irq_handler() {
			spin_lock(&lock);

The handler can't acquire the lock because it is already in use, and
it can't be released until the handler returns.

As a result, if a spinlock is ever taken within an interrupt handler
then it always has to be acquired with interrupts disabled.
Similarly, if it is never taken within an interrupt handler but it is
taken within a bottom-half routine, then it always has to be acquired
with bottom halves disabled.

> From the functionality point of view, nothing wrong happens if runtime suspend
> fails as long as an error code is returned and the caller has to be prepared
> for a failure anyway.  Moreover, we never know why the resume is carried out,
> so it's not clear whether it will be valid to carry out the suspend after that.

Your first point certainly is correct.  As for the second point, if
whoever did the resume doesn't want the device suspended again, he
should have incremented depth.  So making the suspend wait until the
resume is finished and then failing because the depth is positive
would be a valid approach.

However there's no use worrying about this until we have some real
examples.

> > > +		spin_unlock(&dev->parent->power.lock);
> > > +
> > > +		/* The device's parent is not active.  Resume it and repeat. */
> > > +		error = __pm_runtime_resume(dev->parent, false);
> > > +		if (error)
> > > +			return error;
> > 
> > Need to reset error to -EINVAL.
> 
> Why -EINVAL?

We have lost the context because of email trimming.  Briefly, when you
jump back to "repeat:", the code there expects error to have been
initialized to -EINVAL.  Some of the pathways will return error
unchanged, expecting it to have that value.

Alternatively, you could have those pathways set error and then you
wouldn't have to initialize it.  Either way.


> > The equivalent code in USB does this automatically.  The
> > runtime-disable routine does a resume if the depth value was
> > originally 0,
> 
> Yes, we should do that in general.
> 
> > and the runtime-enable routine queues a delayed autosuspend request if the
> > final depth value is 0.
> 
> I don't like this.

I guess this a question of how you view things.  My view has been that
whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
the driver nor anyone else has any reason to keep the device at full
power.  By definition, since that's what depth is -- a count of the
reasons for not suspending.

There might be some obscure other reason, but in general depth going
to 0 means a delayed autosuspend request should be queued.

Which reminds me... Something to think about: In an async call to
__pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
then perhaps your code should automatically requeue a new delayed
autosuspend request.  Which implies, of course, that the autosuspend
delay has to be stored in the dev_pm_info structure.  This isn't a bad
thing, since exposing the value in sysfs gives userspace a consistent
way to set the delay.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-18 18:17                     ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-18 18:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Thu, 18 Jun 2009, Rafael J. Wysocki wrote:

> Not only then.  The dev->power.depth counter was meant to be a "disable
> everything" one, because there are situations in which we don't want even
> resume to run (probe, release, system-wide suspend, hibernation, resume from
> a system sleep state, possibly others).
> 
> That said, I overlooked some problems related to it.  So, I think to disable
> the runtime PM of given device, it will be necessary to run a synchronous
> runtime resume with taking a ref to block suspend.

There should also be an async version, which increases depth while
submitting a resume request.

In fact, maybe it would be best if pm_request_resume always increments
depth (unless it fails for some other reason) and __pm_runtime_resume
increments depth whenever called synchronously.  And likewise for the
suspend paths.

> > Instead of a costly device_for_each_child(), would it be better to
> > maintain a counter with the number of unsuspended children?
> 
> Hmm.  How exactly are we going to count them?  The only way I see at the moment
> would be to increase this number by one when running pm_runtime_init() for a
> new child.  Seems doable.

That's right.  You also have to decrement the number when an
unsuspended child device is removed, obviously.  The one thing to
watch out for is what happens if a device is removed while its
runtime_resume callback is running.  :-)

> > > +	spin_lock(&dev->power.lock);
> > 
> > Should be spin_lock_irq().  Same in other places.
> 
> OK, I wasn't sure about that.

The reasoning isn't complicated.  If a spinlock can be taken by an
interrupt handler (or any other code that might run in interrupt
context) then you have the possibility of a deadlock as follows:

	spin_lock(&lock);
	<Interrupt occurs>
		irq_handler() {
			spin_lock(&lock);

The handler can't acquire the lock because it is already in use, and
it can't be released until the handler returns.

As a result, if a spinlock is ever taken within an interrupt handler
then it always has to be acquired with interrupts disabled.
Similarly, if it is never taken within an interrupt handler but it is
taken within a bottom-half routine, then it always has to be acquired
with bottom halves disabled.

> From the functionality point of view, nothing wrong happens if runtime suspend
> fails as long as an error code is returned and the caller has to be prepared
> for a failure anyway.  Moreover, we never know why the resume is carried out,
> so it's not clear whether it will be valid to carry out the suspend after that.

Your first point certainly is correct.  As for the second point, if
whoever did the resume doesn't want the device suspended again, he
should have incremented depth.  So making the suspend wait until the
resume is finished and then failing because the depth is positive
would be a valid approach.

However there's no use worrying about this until we have some real
examples.

> > > +		spin_unlock(&dev->parent->power.lock);
> > > +
> > > +		/* The device's parent is not active.  Resume it and repeat. */
> > > +		error = __pm_runtime_resume(dev->parent, false);
> > > +		if (error)
> > > +			return error;
> > 
> > Need to reset error to -EINVAL.
> 
> Why -EINVAL?

We have lost the context because of email trimming.  Briefly, when you
jump back to "repeat:", the code there expects error to have been
initialized to -EINVAL.  Some of the pathways will return error
unchanged, expecting it to have that value.

Alternatively, you could have those pathways set error and then you
wouldn't have to initialize it.  Either way.


> > The equivalent code in USB does this automatically.  The
> > runtime-disable routine does a resume if the depth value was
> > originally 0,
> 
> Yes, we should do that in general.
> 
> > and the runtime-enable routine queues a delayed autosuspend request if the
> > final depth value is 0.
> 
> I don't like this.

I guess this a question of how you view things.  My view has been that
whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
the driver nor anyone else has any reason to keep the device at full
power.  By definition, since that's what depth is -- a count of the
reasons for not suspending.

There might be some obscure other reason, but in general depth going
to 0 means a delayed autosuspend request should be queued.

Which reminds me... Something to think about: In an async call to
__pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
then perhaps your code should automatically requeue a new delayed
autosuspend request.  Which implies, of course, that the autosuspend
delay has to be stored in the dev_pm_info structure.  This isn't a bad
thing, since exposing the value in sysfs gives userspace a consistent
way to set the delay.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-17 23:07                 ` Rafael J. Wysocki
@ 2009-06-18 18:17                   ` Alan Stern
  2009-06-18 18:17                     ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-18 18:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Thu, 18 Jun 2009, Rafael J. Wysocki wrote:

> Not only then.  The dev->power.depth counter was meant to be a "disable
> everything" one, because there are situations in which we don't want even
> resume to run (probe, release, system-wide suspend, hibernation, resume from
> a system sleep state, possibly others).
> 
> That said, I overlooked some problems related to it.  So, I think to disable
> the runtime PM of given device, it will be necessary to run a synchronous
> runtime resume with taking a ref to block suspend.

There should also be an async version, which increases depth while
submitting a resume request.

In fact, maybe it would be best if pm_request_resume always increments
depth (unless it fails for some other reason) and __pm_runtime_resume
increments depth whenever called synchronously.  And likewise for the
suspend paths.

> > Instead of a costly device_for_each_child(), would it be better to
> > maintain a counter with the number of unsuspended children?
> 
> Hmm.  How exactly are we going to count them?  The only way I see at the moment
> would be to increase this number by one when running pm_runtime_init() for a
> new child.  Seems doable.

That's right.  You also have to decrement the number when an
unsuspended child device is removed, obviously.  The one thing to
watch out for is what happens if a device is removed while its
runtime_resume callback is running.  :-)

> > > +	spin_lock(&dev->power.lock);
> > 
> > Should be spin_lock_irq().  Same in other places.
> 
> OK, I wasn't sure about that.

The reasoning isn't complicated.  If a spinlock can be taken by an
interrupt handler (or any other code that might run in interrupt
context) then you have the possibility of a deadlock as follows:

	spin_lock(&lock);
	<Interrupt occurs>
		irq_handler() {
			spin_lock(&lock);

The handler can't acquire the lock because it is already in use, and
it can't be released until the handler returns.

As a result, if a spinlock is ever taken within an interrupt handler
then it always has to be acquired with interrupts disabled.
Similarly, if it is never taken within an interrupt handler but it is
taken within a bottom-half routine, then it always has to be acquired
with bottom halves disabled.

> From the functionality point of view, nothing wrong happens if runtime suspend
> fails as long as an error code is returned and the caller has to be prepared
> for a failure anyway.  Moreover, we never know why the resume is carried out,
> so it's not clear whether it will be valid to carry out the suspend after that.

Your first point certainly is correct.  As for the second point, if
whoever did the resume doesn't want the device suspended again, he
should have incremented depth.  So making the suspend wait until the
resume is finished and then failing because the depth is positive
would be a valid approach.

However there's no use worrying about this until we have some real
examples.

> > > +		spin_unlock(&dev->parent->power.lock);
> > > +
> > > +		/* The device's parent is not active.  Resume it and repeat. */
> > > +		error = __pm_runtime_resume(dev->parent, false);
> > > +		if (error)
> > > +			return error;
> > 
> > Need to reset error to -EINVAL.
> 
> Why -EINVAL?

We have lost the context because of email trimming.  Briefly, when you
jump back to "repeat:", the code there expects error to have been
initialized to -EINVAL.  Some of the pathways will return error
unchanged, expecting it to have that value.

Alternatively, you could have those pathways set error and then you
wouldn't have to initialize it.  Either way.


> > The equivalent code in USB does this automatically.  The
> > runtime-disable routine does a resume if the depth value was
> > originally 0,
> 
> Yes, we should do that in general.
> 
> > and the runtime-enable routine queues a delayed autosuspend request if the
> > final depth value is 0.
> 
> I don't like this.

I guess this a question of how you view things.  My view has been that
whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
the driver nor anyone else has any reason to keep the device at full
power.  By definition, since that's what depth is -- a count of the
reasons for not suspending.

There might be some obscure other reason, but in general depth going
to 0 means a delayed autosuspend request should be queued.

Which reminds me... Something to think about: In an async call to
__pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
then perhaps your code should automatically requeue a new delayed
autosuspend request.  Which implies, of course, that the autosuspend
delay has to be stored in the dev_pm_info structure.  This isn't a bad
thing, since exposing the value in sysfs gives userspace a consistent
way to set the delay.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-18 18:17                     ` Alan Stern
  (?)
  (?)
@ 2009-06-19  0:38                     ` Rafael J. Wysocki
  2009-06-19 16:25                       ` Alan Stern
  2009-06-19 16:25                         ` Alan Stern
  -1 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-19  0:38 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Thursday 18 June 2009, Alan Stern wrote:
> On Thu, 18 Jun 2009, Rafael J. Wysocki wrote:
> 
> > Not only then.  The dev->power.depth counter was meant to be a "disable
> > everything" one, because there are situations in which we don't want even
> > resume to run (probe, release, system-wide suspend, hibernation, resume from
> > a system sleep state, possibly others).
> > 
> > That said, I overlooked some problems related to it.  So, I think to disable
> > the runtime PM of given device, it will be necessary to run a synchronous
> > runtime resume with taking a ref to block suspend.
> 
> There should also be an async version, which increases depth while
> submitting a resume request.
> 
> In fact, maybe it would be best if pm_request_resume always increments
> depth (unless it fails for some other reason) and __pm_runtime_resume
> increments depth whenever called synchronously.  And likewise for the
> suspend paths.

And how exactly are we going to check if pm_request_resume() was successful?

We'd have to be able to do that in a code path different from the one that has
called pm_request_resume().

> > > Instead of a costly device_for_each_child(), would it be better to
> > > maintain a counter with the number of unsuspended children?
> > 
> > Hmm.  How exactly are we going to count them?  The only way I see at the moment
> > would be to increase this number by one when running pm_runtime_init() for a
> > new child.  Seems doable.
> 
> That's right.  You also have to decrement the number when an
> unsuspended child device is removed, obviously.

I forgot about that, so it is not done in the patch below.

BTW, is it just me, or are we overcomplicating that thing beyond any
reasonable limit?

I think I'll just do the device_for_each_child() for now, because IMO this
optimization isn't just worth complications resulting from it, because,
realistically, how many children is a parent going to have in a notmal system?

> The one thing to watch out for is what happens if a device is removed while
> its runtime_resume callback is running.  :-)

I don't think it's possible.

> > > > +	spin_lock(&dev->power.lock);
> > > 
> > > Should be spin_lock_irq().  Same in other places.
> > 
> > OK, I wasn't sure about that.
> 
> The reasoning isn't complicated.  If a spinlock can be taken by an
> interrupt handler (or any other code that might run in interrupt
> context) then you have the possibility of a deadlock as follows:
> 
> 	spin_lock(&lock);
> 	<Interrupt occurs>
> 		irq_handler() {
> 			spin_lock(&lock);
> 
> The handler can't acquire the lock because it is already in use, and
> it can't be released until the handler returns.
> 
> As a result, if a spinlock is ever taken within an interrupt handler
> then it always has to be acquired with interrupts disabled.
> Similarly, if it is never taken within an interrupt handler but it is
> taken within a bottom-half routine, then it always has to be acquired
> with bottom halves disabled.
> 
> > From the functionality point of view, nothing wrong happens if runtime suspend
> > fails as long as an error code is returned and the caller has to be prepared
> > for a failure anyway.  Moreover, we never know why the resume is carried out,
> > so it's not clear whether it will be valid to carry out the suspend after that.
> 
> Your first point certainly is correct.  As for the second point, if
> whoever did the resume doesn't want the device suspended again, he
> should have incremented depth.  So making the suspend wait until the
> resume is finished and then failing because the depth is positive
> would be a valid approach.
> 
> However there's no use worrying about this until we have some real
> examples.
> 
> > > > +		spin_unlock(&dev->parent->power.lock);
> > > > +
> > > > +		/* The device's parent is not active.  Resume it and repeat. */
> > > > +		error = __pm_runtime_resume(dev->parent, false);
> > > > +		if (error)
> > > > +			return error;
> > > 
> > > Need to reset error to -EINVAL.
> > 
> > Why -EINVAL?
> 
> We have lost the context because of email trimming.  Briefly, when you
> jump back to "repeat:", the code there expects error to have been
> initialized to -EINVAL.  Some of the pathways will return error
> unchanged, expecting it to have that value.
> 
> Alternatively, you could have those pathways set error and then you
> wouldn't have to initialize it.  Either way.

Ah, OK

> > > The equivalent code in USB does this automatically.  The
> > > runtime-disable routine does a resume if the depth value was
> > > originally 0,
> > 
> > Yes, we should do that in general.
> > 
> > > and the runtime-enable routine queues a delayed autosuspend request if the
> > > final depth value is 0.
> > 
> > I don't like this.
> 
> I guess this a question of how you view things.  My view has been that
> whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
> the driver nor anyone else has any reason to keep the device at full
> power.  By definition, since that's what depth is -- a count of the
> reasons for not suspending.
> 
> There might be some obscure other reason, but in general depth going
> to 0 means a delayed autosuspend request should be queued.

OK there, but pm_runtime_disable() is called by the core in some places where
we'd rather not want the device to be suspended (like during a system-wide
power transitions).

> Which reminds me... Something to think about: In an async call to
> __pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
> then perhaps your code should automatically requeue a new delayed
> autosuspend request.  Which implies, of course, that the autosuspend
> delay has to be stored in the dev_pm_info structure.  This isn't a bad
> thing, since exposing the value in sysfs gives userspace a consistent
> way to set the delay.

I think that functionality can be added later.  Let's keep things as simple
as possible initially, or we won't be able to make any progress.

Below is a new version of the patch.  Unfortunately, it is a major rework.
In short, I tried to address some of your recent comments and my observations.
It doesn't use depth any more, there's another counter (called resume_count)
instead, also playing the role of the RPM_GRACE bit from the previous version.

I've just finished it, so it may be still missing something apart from the
updating child_count on removal of an unsuspended child.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM: Introduce core framework for run-time PM of I/O devices (rev. 2)

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Not-yet-signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  378 ++++++++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  533 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   95 ++++++
 include/linux/pm_runtime.h         |  115 +++++++
 kernel/power/Kconfig               |   14 
 kernel/power/main.c                |   17 +
 9 files changed, 1164 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,28 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state.
+ *	For example, if the device is behind a link which is about to be turned
+ *	off, the device may remain at full power.  Still, if the device does go
+ *	to low power and if device_may_wakeup(dev) is true, remote wake-up
+ *	(i.e. hardware mechanism allowing the device to request a change of its
+ *	power state, such as PCI PME) should be enabled for it.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +207,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +343,75 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	0x1F
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	suspend_work;
+	struct work_struct	resume_work;
+	struct completion	work_done;
+	unsigned int		ignore_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	int			resume_count;
+	int			child_count;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,533 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+#include <linux/jiffies.h>
+
+/**
+ * __pm_get_child - Increment the counter of unsuspended children of a device.
+ * @dev: Device to handle;
+ */
+static void __pm_get_child(struct device *dev)
+{
+	dev->power.child_count++;
+}
+
+/**
+ * __pm_put_child - Decrement the counter of unsuspended children of a device.
+ * @dev: Device to handle;
+ */
+static void __pm_put_child(struct device *dev)
+{
+	if (dev->power.child_count > 0)
+		dev->power.child_count--;
+	else
+		dev_warn(dev, "Excessive %s!\n", __FUNCTION__);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (!pm_children_suspended(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	struct device *parent = NULL;
+	unsigned long parflags = 0, flags;
+	int error = -EINVAL;
+
+	might_sleep();
+
+ repeat:
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if ((dev->power.runtime_status & (RPM_WAKE | RPM_RESUMING))
+	    || dev->power.resume_count > 0
+	    || (!sync && dev->power.suspend_aborted)) {
+		/*
+		 * Device is resuming, there's a resume request pending for it,
+		 * the device's resume counter is greater than 0, or a pending
+		 * suspend request has just been cancelled and we're running as
+		 * a result of that request.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	} else if (sync && dev->power.runtime_status == RPM_IDLE) {
+		/*
+		 * Suspend request is pending, but we're not running as a result
+		 * of it, so cancel it and repeat.
+		 */
+		dev->power.suspend_aborted = true;
+		dev->power.runtime_status = RPM_ACTIVE;
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+
+		cancel_delayed_work_sync(&dev->power.suspend_work);
+
+		goto repeat;
+	}
+
+	if (!pm_children_suspended(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = -EBUSY;
+		goto out;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+	parent = dev->parent;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	switch (error) {
+	case 0:
+		/*
+		 * Resume request might have been queued in the meantime, in
+		 * which case the RPM_WAKE bit is also set in runtime_status.
+		 */
+		dev->power.runtime_status &= ~RPM_SUSPENDING;
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && parent) {
+		__pm_put_child(parent);
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		pm_runtime_notify_idle(parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	__pm_runtime_suspend(suspend_work_to_device(work), false);
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @msec: Time to wait before attempting to suspend the device, in milliseconds.
+ */
+void pm_request_suspend(struct device *dev, unsigned int msec)
+{
+	unsigned long flags;
+	unsigned long delay = msecs_to_jiffies(msec);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	queue_delayed_work(pm_wq, &dev->power.suspend_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * __pm_runtime_get - Increment the resume counter of given device.
+ * @dev: Device to handle.
+ */
+static void __pm_runtime_get(struct device *dev)
+{
+	dev->power.resume_count++;
+}
+
+/**
+ * __pm_runtime_put - Decrement the resume counter of given device.
+ * @dev: Device to handle.
+ */
+static void __pm_runtime_put(struct device *dev)
+{
+	if (dev->power.resume_count > 0)
+		dev->power.resume_count--;
+	else
+		dev_warn(dev, "Excessive %s!\n", __FUNCTION__);
+}
+
+/**
+ * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ * @get: If set, increment the device's resume counter.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int __pm_runtime_resume(struct device *dev, bool get, bool sync)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+	bool put_parent = false;
+	int error = -EINVAL;
+
+	might_sleep();
+
+ repeat:
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status == RPM_ACTIVE) {
+		error = 0;
+		goto out;
+	}
+
+	if (dev->power.runtime_status & RPM_IDLE) {
+		/* Only a suspend request is pending, cancel it and repeat. */
+		dev->power.suspend_aborted = true;
+		dev->power.runtime_status &= ~RPM_IDLE;
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		cancel_delayed_work_sync(&dev->power.suspend_work);
+
+		goto repeat;
+	} else if (sync && (dev->power.runtime_status & RPM_WAKE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		/*
+		 * Resume request is pending, but we're not running as a result
+		 * of it, so it has to run before we continue in case it's
+		 * going to increment the device's resume counter.
+		 */
+		flush_work(&dev->power.resume_work);
+
+		goto repeat;
+	} else if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && parent
+	    && parent->power.runtime_status != RPM_ACTIVE) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		/* The parent as to be resumed before we continue. */
+		error = pm_runtime_resume_get(parent);
+		if (error)
+			return error;
+
+		put_parent = true;
+		error = -EINVAL;
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent) {
+			if (put_parent)
+				__pm_runtime_put(parent);
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+			parent = NULL;
+		}
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		spin_lock_irqsave(&dev->power.lock, flags);
+
+		error = dev->power.runtime_error;
+		goto out;
+	}
+
+	if (dev->power.runtime_status == RPM_SUSPENDED && parent)
+		__pm_get_child(parent);
+
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent) {
+		if (put_parent)
+			__pm_runtime_put(parent);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+		parent = NULL;
+	}
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	dev->power.runtime_status = error ? RPM_ERROR : RPM_ACTIVE;
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+ out:
+	if (!error && get)
+		__pm_runtime_get(dev);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent) {
+		if (put_parent)
+			__pm_runtime_put(parent);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+	}
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run __pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * __pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	__pm_runtime_resume(resume_work_to_device(work), false, false);
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+void pm_request_resume(struct device *dev)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+
+ repeat:
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if ((dev->power.runtime_status & RPM_WAKE)
+	    || !(dev->power.runtime_status &
+			(RPM_SUSPENDING | RPM_SUSPENDED))) {
+		goto out;
+	} else if (parent && !(parent->power.runtime_status & RPM_WAKE)
+	    && (parent->power.runtime_status &
+			(RPM_IDLE | RPM_SUSPENDING | RPM_SUSPENDED))) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		pm_request_resume(parent);
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_SUSPENDED && parent)
+		__pm_get_child(parent);
+
+	/*
+	 * The device may be suspending at the moment or a suspend request may
+	 * be pending for it and we can't clear the RPM_IDLE or RPM_SUSPENDING
+	 * bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	queue_work(pm_wq, &dev->power.resume_work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_runtime_put - Decrement the resume counter of a device under 'power.lock'.
+ * @dev: Device to handle.
+ */
+void pm_runtime_put(struct device *dev)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	__pm_runtime_put(dev);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_put);
+
+/**
+ * pm_runtime_disable - Disable run-time suspend and resume of a device.
+ * @dev: Device to handle.
+ *
+ * Increase the resume counter of given device, so that it cannot be suspended
+ * at run time, and run pm_runtime_resume() for it to put it into the RPM_ACTIVE
+ * state, which also blocks run-time resume of it.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	__pm_runtime_get(dev);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	pm_runtime_resume(dev);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * __pm_runtime_clear_status - Change the run-time PM status of a device.
+ * @dev: Device to handle.
+ * @status: New value of the device's run-time PM status.
+ *
+ * Change the run-time PM status of the device to @status, which must be
+ * either RPM_ACTIVE or RPM_SUSPENDED, if its current value is equal to
+ * RPM_ERROR.
+ */
+void __pm_runtime_clear_status(struct device *dev, unsigned int status)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+
+	if (status & ~RPM_SUSPENDED)
+		return;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ERROR)
+		goto out;
+
+	dev->power.runtime_status = status;
+	if (parent && status == RPM_SUSPENDED)
+		__pm_put_child(parent);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_clear_status);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	struct device *parent = dev->parent;
+
+	spin_lock_init(&dev->power.lock);
+
+	dev->power.runtime_status = RPM_ACTIVE;
+	dev->power.resume_count = 1;
+	pm_suspend_ignore_children(dev, false);
+	dev->power.child_count = 0;
+	INIT_DELAYED_WORK(&dev->power.suspend_work, pm_runtime_suspend_work);
+	INIT_WORK(&dev->power.resume_work, pm_runtime_resume_work);
+
+	if (parent) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&parent->power.lock, flags);
+
+		__pm_get_child(parent);
+
+		spin_unlock_irqrestore(&parent->power.lock, flags);
+	}
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,115 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern int __pm_runtime_suspend(struct device *dev, bool sync);
+extern void pm_request_suspend(struct device *dev, unsigned int msec);
+extern int __pm_runtime_resume(struct device *dev, bool get, bool sync);
+extern void pm_request_resume(struct device *dev);
+extern void pm_runtime_put(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void __pm_runtime_clear_status(struct device *dev, unsigned int status);
+
+static inline struct device *suspend_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, suspend_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline struct device *resume_work_to_device(struct work_struct *work)
+{
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(work, struct dev_pm_info, resume_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline bool pm_children_suspended(struct device *dev)
+{
+	return dev->power.ignore_children || !dev->power.child_count;
+}
+
+static inline bool pm_suspend_possible(struct device *dev)
+{
+	return pm_children_suspended(dev) && !(dev->power.resume_count > 0
+		|| (dev->power.runtime_status & (RPM_WAKE | RPM_RESUMING)));
+}
+
+static inline void pm_suspend_ignore_children(struct device *dev, bool enable)
+{
+	dev->power.ignore_children = enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_suspend(struct device *dev, unsigned int msec) {}
+static inline int __pm_runtime_resume(struct device *dev, bool get, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_resume(struct device *dev) {}
+static inline void pm_runtime_put(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void __pm_runtime_clear_status(struct device *dev,
+					      unsigned int status) {}
+
+static inline bool pm_children_suspended(struct device *dev) { return false; }
+static inline bool pm_suspend_possible(struct device *dev) { return false; }
+static inline void pm_suspend_ignore_children(struct device *dev, bool en) {}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+static inline int pm_runtime_suspend(struct device *dev)
+{
+	return __pm_runtime_suspend(dev, true);
+}
+
+static inline int pm_runtime_resume(struct device *dev)
+{
+	return __pm_runtime_resume(dev, false, true);
+}
+
+static inline int pm_runtime_resume_get(struct device *dev)
+{
+	return __pm_runtime_resume(dev, true, true);
+}
+
+static inline void pm_runtime_clear_active(struct device *dev)
+{
+	__pm_runtime_clear_status(dev, RPM_ACTIVE);
+}
+
+static inline void pm_runtime_clear_suspended(struct device *dev)
+{
+	__pm_runtime_clear_status(dev, RPM_SUSPENDED);
+}
+
+static inline void pm_runtime_enable(struct device *dev)
+{
+	pm_runtime_put(dev);
+}
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,378 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+Support for run-time power management (run-time PM) of I/O devices is provided
+at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields of 'struct dev_pm_info', the helper functions
+using them and the run-time PM callbacks present in 'struct dev_pm_ops' are
+described below.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned int msec);
+* int pm_runtime_resume(struct device *dev);
+* int pm_runtime_resume_get(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_runtime_put(struct device *dev);
+
+* bool pm_suspend_possible(struct device *dev);
+
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+
+* void pm_suspend_ignore_children(struct device *dev, bool enable);
+
+* void pm_runtime_clear_active(struct device *dev) {}
+* void pm_runtime_clear_suspended(struct device *dev) {}
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+a device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+pm_runtime_resume_get(), pm_request_resume(), and pm_request_resume_get()
+use the 'power.runtime_status', 'power.resume_count', 'power.suspend_aborted',
+and 'power.child_count' fields of 'struct device' for mutual cooperation.  In
+what follows the 'power.runtime_status', 'power.resume_count', and
+'power.child_count' fields are referred to as the device's run-time PM status,
+the device's resume counter, and the counter of unsuspended children of the
+device, respectively.  They are set to RPM_ACTIVE, 1 and 0, respectively, by
+pm_runtime_init().
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called directly by a bus type or device driver.  There also is
+an asynchronous version of it, which is executed by the PM core to complete a
+request queued up by pm_request_suspend().  However, the only difference between
+them is the handling of situations in which a queued up suspend request has just
+been cancelled.  Apart from this, they work in the same way.
+
+  * If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the
+    device's run-time PM status field, 'power.runtime_status'), success is
+    returned.
+
+  * If the device is about to resume (i.e. at least one of the RPM_WAKE and
+    RPM_RESUMING bits are set in its run-time PM status field) or its resume
+    counter is greater than 0, or the function has been called via pm_wq as a
+    result of a cancelled suspend request (the 'power.suspend_aborted' field is
+    used to signal the termination of a suspend request), -EAGAIN is returned.
+
+  * If the device is suspending (i.e. its run-time PM status is RPM_SUSPENDING),
+    which means that another instance of pm_runtime_suspend() is running at the
+    same time for the same device, the function waits for the other instance to
+    complete and returns the error code (or success) returned by it.
+
+  * If the device has a pending suspend request (i.e. the RPM_IDLE bit is set in
+    its run-time PM status) and the function hasn't been called as a result of
+    that request, it cancels the request and restarts itself in case another
+    suspend is running in parallel with it.
+
+  * If the children of the device are not suspended and the
+    'power.ignore_children' flag is not set for it, the device's run-time PM
+    status is set to RPM_ACTIVE and -EAGAIN is returned.
+
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and its bus type's ->runtime_suspend() callback is executed.
+This callback is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+
+  * If it completes successfully, the RPM_SUSPENDING bit is cleared and the
+    RPM_SUSPENDED bit is set in the device's run-time PM status field.  Once
+    that has happened, the device is regarded by the PM core as suspended, but
+    it _need_ _not_ mean that the device has been put into a low power state.
+    What really occurs to the device at this point totally depends on its bus
+    type (it may depend on the device's driver if the bus type chooses to call
+    it).  Additionally, if the device bus type's ->runtime_suspend() callback
+    completes successfully and there's no resume request pending for the device
+    (i.e. the RPM_WAKE flag is not set in its run-time PM status field), and the
+    device has a parent, the parent's counter of unsuspended children (i.e. the
+    'power.child_count' field) is decremented.  Next, if it turns out to be
+    equal to zero (i.e. all children of the device's parent have been suspended)
+    or the parent has the 'power.ignore_children' flag set, the parent's bus
+    type's ->runtime_idle() callback is executed.
+
+  * If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+    set to RPM_ACTIVE.
+
+  * If another error code is returned, the device's run-time PM status is set to
+    RPM_ERROR, which makes the PM core refuse to carry out any run-time PM
+    operations for it until the status is cleared by its bus type or driver with
+    the help of pm_runtime_clear_active() or pm_runtime_clear_suspended().
+
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.  If the device's bus type
+doesn't implement ->runtime_suspend(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE
+(i.e. the device is not active from the PM core standpoint), the function
+returns immediately.  Otherwise, it changes the device's run-time PM status to
+RPM_IDLE and puts a request to suspend the device into pm_wq.  The 'msec'
+argument is used to specify the time to wait before the request will be
+completed, in milliseconds.  It is valid to call this function from interrupt
+context.
+
+pm_runtime_resume() and pm_runtime_resume_get() are used to carry out a
+run-time resume of a device that is suspended, suspending or has a suspend
+request pending.  They are called directly by a bus type or device driver.
+The difference between them is that pm_request_resume_get() increments the
+device's resume counter, which prevents the PM core from suspending the device
+or queuing up a suspend request for it until its resume counter is decreased
+down to 0 with the help of pm_runtime_put().  Apart from this, they work in the
+same way.  There also is an asynchronous version of pm_runtime_resume(), called
+by the PM core as a result of a resume request queued up by pm_request_resume(),
+which doesn't check if there's a concurrent pending resume request for the
+device.
+
+  * If the device is active (i.e. all of the bits in its run-time PM status are
+    unset), success is returned (pm_request_resume_get() increments the device's
+    resume counter in that case).
+
+  * If there's a suspend request pending for the device (i.e. the device's
+    run-time PM status is RPM_IDLE), it is cancelled, the
+    'power.suspend_aborted' flag is set for the device, the RPM_IDLE bit is
+    cleared in its run-time PM status field and the function restarts itself.
+
+  * If the device has a pending resume request (i.e. the RPM_WAKE bit is set in
+    its run-time PM status field), but the function hasn't been called as a
+    result of that request, the function waits for that request to complete
+    (in case it's going to increment the device's resume counter) and restarts
+    itself.
+
+  * If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+    run-time PM status field), the function waits for the suspend operation to
+    complete and restarts itself.
+
+  * If the device is suspended and doesn't have a pending resume request (i.e.
+    its run-time PM status is RPM_SUSPENDED), and it has a parent that is not
+    active (i.e. the parent's run-time PM status is not RPM_ACTIVE),
+    pm_runtime_resume_get() is called (recursively) for the parent.  If the
+    parent's resume is successful, the function notes that the parent's resume
+    counter will have to be decremented and restarts itself.  Otherwise, it
+    returns the error code returned by the instance of pm_runtime_resume_get()
+    handling the device's parent.
+
+  * If the device is resuming (i.e. the device's run-time PM status is
+    RPM_RESUMING), which means that another instance of pm_runtime_resume() or
+    pm_runtime_resume_get() is running at the same time for the same device, the
+    function waits for the other instance to complete and returns the result
+    returned by it (pm_runtime_resume_get() increments the device's resume
+    counter if success is returned).
+
+If none of the above happens, the function checks if the device's run-time PM
+status is RPM_SUSPENDED, which means that the device doesn't have a resume
+request pending, and if it has a parent.  If that is the case, the parent's
+counter of unsuspended children is increased.  Next, the device's run-time PM
+status is set to RPM_RESUMING and its bus type's ->runtime_resume() callback is
+executed.  This callback is entirely responsible for handling the device as
+appropriate (for example, it may choose to execute the device driver's
+->runtime_resume() callback or to carry out any other suitable action depending
+on the bus type).
+
+  * If it completes successfully, the device's run-time PM status is set to
+    RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+    device bus type's ->runtime_resume() callback, when it is about to return
+    success, _must_ _ensure_ that this really is the case (i.e. when it returns
+    success, the device _must_ be able to carry out I/O operations as needed).
+
+  * If an error code is returned, the device's run-time PM status is set to
+    RPM_ERROR, which makes the PM core refuse to carry out any run-time PM
+    operations for the device until the status is cleared by its bus type or
+    driver with the help of either pm_runtime_clear_active(), or
+    pm_runtime_clear_suspended().  Thus, it is strongly recommended that the
+    device bus type's ->runtime_resume() callback only return error codes in
+    fatal error conditions, when it is impossible to bring the device back to
+    the operational state by any available means.  Inability to wake up a
+    suspended device usually means a service loss and it may very well result in
+    a data loss to the user, so it _must_ be avoided if at all possible.
+
+Finally, pm_runtime_resume() and pm_runtime_resume_get() return the error code
+(or success) returned by the device bus type's ->runtime_resume() callback
+(pm_runtime_resume_get() increments the device's resume counter if success is
+returned).  If the device's bus type doesn't implement ->runtime_resume(),
+-EINVAL is returned and the device's run-time PM status is set to RPM_ERROR.
+
+pm_request_resume() is used to queue up a resume request for a device that is
+suspended, suspending or has a suspend request pending.
+
+  * If the device has a resume request pending (i.e. the RPM_WAKE bit is set in
+    its run-time PM status field) or the device is not suspended or suspending
+    (i.e. none of the RPM_SUSPENDED and RPM_SUSPENDING bits is set in its
+    run-time PM status field), the function returns.
+
+  * If the device has a parent and the parent is inactive (i.e. at least one of
+    the RPM_IDLE, RPM_SUSPENDING, and RPM_SUSPENDED bits is set in its run-time
+    PM status field), and the parent doesn't have a resume request pending
+    (i.e. the RPM_WAKE bit is not set in the parent's run-time PM status field),
+    a resume request is scheduled for the parent with the help of
+    pm_request_resume() (i.e. recursively) and the function is restarted.
+
+If none of the above happens, the function checks if the device's run-time PM
+status is RPM_SUSPENDED, which means that the device is not suspending at the
+moment, and if it has a parent.  If that is the case, the parent's counter of
+unsuspended children is increased.  Next, the RPM_WAKE bit is set in the
+device's run-time PM status field and the request to execute the asynchronous
+version of pm_runtime_resume() is put into pm_wq.  It is valid to call this
+function from interrupt context.
+
+Note that it is possible to have a resume request and a suspend request queued
+up at the same time.  In that case, if the suspend request is attempted to
+complete first, the asynchronous version of pm_runtime_suspend() run as a result
+of it will notice that the RPM_WAKE bit is set in the device's run-time PM
+status field and will return -EAGAIN as a result without doing anything else.
+Then, the subsequent resume carried out as a result of the queued up request
+will notice that the RPM_IDLE bit is set in the device's run-time PM status
+field, so it will try to cancel the suspend request and the run-time PM status
+of the device will be set to RPM_ACTIVE.  On the other hand, if the resume
+request is attempted to complete first, which is more likely, it will cancel the
+pending suspend request the run-time PM status of the device will be set to
+RPM_ACTIVE.
+
+pm_runtime_put() is used to decrease the device's resume counter by 1.  If the
+resume counter of the device is greater than 0, it causes the PM core to refuse
+to suspend the device or to queue up a suspend request for it.  In particular,
+it causes pm_runtime_suspend() to return -EAGAIN without doing anything else.
+This may be useful if the device is resumed for a specific task and it shouldn't
+be suspended until the task is complete, but there are many potential sources of
+suspend requests that could disturb it.  It is valid to call this function from
+interrupt context.
+
+pm_suspend_possible() is used to check if the device may be suspended at this
+particular moment.  It checks the device's run-time PM status, resume counter,
+and the counter of unsuspended children.  It returns 'false' if the device's
+counter of unsuspended children is greater than 0 or the device's resume counter
+is greater than 0, or at least one of the RPM_WAKE and RPM_RESUMING bits is set
+in its run-time PM status field.  It is valid to call this function from
+interrupt context.
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, all of the run-time PM core operations.  They do it by decreasing
+and increasing, respectively, the device's resume counter, but
+pm_runtime_disable() additionally calls pm_runtime_resume() for the device to
+make sure that the device will not be suspended while its run-time power
+management is disabled.  Therefore, if pm_runtime_disable() is called several
+times in a row for the same device, it has to be balanced by the appropriate
+number of pm_runtime_enable() calls so that the other run-time PM core functions
+can be used for that device.  The initial value of the device's resume counter,
+as set by pm_runtime_init(), is 1 (i.e. the device's run-time power management
+is initially disabled).
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time power management of devices temporarily during device probe
+and removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_suspend_ignore_children() is used to set or unset the
+'power.ignore_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 1, and if 'enable' is 'false', the field
+is set to 0.  The default value of 'power.ignore_children', as set by
+pm_runtime_init(), is 0.
+
+pm_runtime_clear_active() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_ACTIVE.  It is valid to call this function from
+interrupt context.
+
+pm_runtime_clear_suspended() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_SUSPENDED.  If the device has a parent, it the
+function additionally decrements the parent's counter of unsuspended children.
+It is valid to call this function from interrupt context.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+In particular, it is recommended that ->runtime_suspend() return -EBUSY or
+-EAGAIN if device_may_wakeup() returns 'false' for the device.  On the other
+hand, if device_may_wakeup() returns 'true' for the device and the device is put
+into a low power state during the execution of ->runtime_suspend(), it is
+expected that remote wake-up (i.e. hardware mechanism allowing the device to
+request a change of its power state, such as PCI PME) will be enabled for the
+device.  Generally, remote wake-up should be enabled whenever the device is put
+into a low power state at run time and is expected to receive input from the
+outside of the system.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-18 18:17                     ` Alan Stern
  (?)
@ 2009-06-19  0:38                     ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-19  0:38 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Thursday 18 June 2009, Alan Stern wrote:
> On Thu, 18 Jun 2009, Rafael J. Wysocki wrote:
> 
> > Not only then.  The dev->power.depth counter was meant to be a "disable
> > everything" one, because there are situations in which we don't want even
> > resume to run (probe, release, system-wide suspend, hibernation, resume from
> > a system sleep state, possibly others).
> > 
> > That said, I overlooked some problems related to it.  So, I think to disable
> > the runtime PM of given device, it will be necessary to run a synchronous
> > runtime resume with taking a ref to block suspend.
> 
> There should also be an async version, which increases depth while
> submitting a resume request.
> 
> In fact, maybe it would be best if pm_request_resume always increments
> depth (unless it fails for some other reason) and __pm_runtime_resume
> increments depth whenever called synchronously.  And likewise for the
> suspend paths.

And how exactly are we going to check if pm_request_resume() was successful?

We'd have to be able to do that in a code path different from the one that has
called pm_request_resume().

> > > Instead of a costly device_for_each_child(), would it be better to
> > > maintain a counter with the number of unsuspended children?
> > 
> > Hmm.  How exactly are we going to count them?  The only way I see at the moment
> > would be to increase this number by one when running pm_runtime_init() for a
> > new child.  Seems doable.
> 
> That's right.  You also have to decrement the number when an
> unsuspended child device is removed, obviously.

I forgot about that, so it is not done in the patch below.

BTW, is it just me, or are we overcomplicating that thing beyond any
reasonable limit?

I think I'll just do the device_for_each_child() for now, because IMO this
optimization isn't just worth complications resulting from it, because,
realistically, how many children is a parent going to have in a notmal system?

> The one thing to watch out for is what happens if a device is removed while
> its runtime_resume callback is running.  :-)

I don't think it's possible.

> > > > +	spin_lock(&dev->power.lock);
> > > 
> > > Should be spin_lock_irq().  Same in other places.
> > 
> > OK, I wasn't sure about that.
> 
> The reasoning isn't complicated.  If a spinlock can be taken by an
> interrupt handler (or any other code that might run in interrupt
> context) then you have the possibility of a deadlock as follows:
> 
> 	spin_lock(&lock);
> 	<Interrupt occurs>
> 		irq_handler() {
> 			spin_lock(&lock);
> 
> The handler can't acquire the lock because it is already in use, and
> it can't be released until the handler returns.
> 
> As a result, if a spinlock is ever taken within an interrupt handler
> then it always has to be acquired with interrupts disabled.
> Similarly, if it is never taken within an interrupt handler but it is
> taken within a bottom-half routine, then it always has to be acquired
> with bottom halves disabled.
> 
> > From the functionality point of view, nothing wrong happens if runtime suspend
> > fails as long as an error code is returned and the caller has to be prepared
> > for a failure anyway.  Moreover, we never know why the resume is carried out,
> > so it's not clear whether it will be valid to carry out the suspend after that.
> 
> Your first point certainly is correct.  As for the second point, if
> whoever did the resume doesn't want the device suspended again, he
> should have incremented depth.  So making the suspend wait until the
> resume is finished and then failing because the depth is positive
> would be a valid approach.
> 
> However there's no use worrying about this until we have some real
> examples.
> 
> > > > +		spin_unlock(&dev->parent->power.lock);
> > > > +
> > > > +		/* The device's parent is not active.  Resume it and repeat. */
> > > > +		error = __pm_runtime_resume(dev->parent, false);
> > > > +		if (error)
> > > > +			return error;
> > > 
> > > Need to reset error to -EINVAL.
> > 
> > Why -EINVAL?
> 
> We have lost the context because of email trimming.  Briefly, when you
> jump back to "repeat:", the code there expects error to have been
> initialized to -EINVAL.  Some of the pathways will return error
> unchanged, expecting it to have that value.
> 
> Alternatively, you could have those pathways set error and then you
> wouldn't have to initialize it.  Either way.

Ah, OK

> > > The equivalent code in USB does this automatically.  The
> > > runtime-disable routine does a resume if the depth value was
> > > originally 0,
> > 
> > Yes, we should do that in general.
> > 
> > > and the runtime-enable routine queues a delayed autosuspend request if the
> > > final depth value is 0.
> > 
> > I don't like this.
> 
> I guess this a question of how you view things.  My view has been that
> whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
> the driver nor anyone else has any reason to keep the device at full
> power.  By definition, since that's what depth is -- a count of the
> reasons for not suspending.
> 
> There might be some obscure other reason, but in general depth going
> to 0 means a delayed autosuspend request should be queued.

OK there, but pm_runtime_disable() is called by the core in some places where
we'd rather not want the device to be suspended (like during a system-wide
power transitions).

> Which reminds me... Something to think about: In an async call to
> __pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
> then perhaps your code should automatically requeue a new delayed
> autosuspend request.  Which implies, of course, that the autosuspend
> delay has to be stored in the dev_pm_info structure.  This isn't a bad
> thing, since exposing the value in sysfs gives userspace a consistent
> way to set the delay.

I think that functionality can be added later.  Let's keep things as simple
as possible initially, or we won't be able to make any progress.

Below is a new version of the patch.  Unfortunately, it is a major rework.
In short, I tried to address some of your recent comments and my observations.
It doesn't use depth any more, there's another counter (called resume_count)
instead, also playing the role of the RPM_GRACE bit from the previous version.

I've just finished it, so it may be still missing something apart from the
updating child_count on removal of an unsuspended child.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM: Introduce core framework for run-time PM of I/O devices (rev. 2)

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Not-yet-signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  378 ++++++++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  533 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   95 ++++++
 include/linux/pm_runtime.h         |  115 +++++++
 kernel/power/Kconfig               |   14 
 kernel/power/main.c                |   17 +
 9 files changed, 1164 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,28 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state.
+ *	For example, if the device is behind a link which is about to be turned
+ *	off, the device may remain at full power.  Still, if the device does go
+ *	to low power and if device_may_wakeup(dev) is true, remote wake-up
+ *	(i.e. hardware mechanism allowing the device to request a change of its
+ *	power state, such as PCI PME) should be enabled for it.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +207,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +343,75 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	0x1F
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	suspend_work;
+	struct work_struct	resume_work;
+	struct completion	work_done;
+	unsigned int		ignore_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	int			resume_count;
+	int			child_count;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,533 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+#include <linux/jiffies.h>
+
+/**
+ * __pm_get_child - Increment the counter of unsuspended children of a device.
+ * @dev: Device to handle;
+ */
+static void __pm_get_child(struct device *dev)
+{
+	dev->power.child_count++;
+}
+
+/**
+ * __pm_put_child - Decrement the counter of unsuspended children of a device.
+ * @dev: Device to handle;
+ */
+static void __pm_put_child(struct device *dev)
+{
+	if (dev->power.child_count > 0)
+		dev->power.child_count--;
+	else
+		dev_warn(dev, "Excessive %s!\n", __FUNCTION__);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (!pm_children_suspended(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	struct device *parent = NULL;
+	unsigned long parflags = 0, flags;
+	int error = -EINVAL;
+
+	might_sleep();
+
+ repeat:
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if ((dev->power.runtime_status & (RPM_WAKE | RPM_RESUMING))
+	    || dev->power.resume_count > 0
+	    || (!sync && dev->power.suspend_aborted)) {
+		/*
+		 * Device is resuming, there's a resume request pending for it,
+		 * the device's resume counter is greater than 0, or a pending
+		 * suspend request has just been cancelled and we're running as
+		 * a result of that request.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	} else if (sync && dev->power.runtime_status == RPM_IDLE) {
+		/*
+		 * Suspend request is pending, but we're not running as a result
+		 * of it, so cancel it and repeat.
+		 */
+		dev->power.suspend_aborted = true;
+		dev->power.runtime_status = RPM_ACTIVE;
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+
+		cancel_delayed_work_sync(&dev->power.suspend_work);
+
+		goto repeat;
+	}
+
+	if (!pm_children_suspended(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = -EBUSY;
+		goto out;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+	parent = dev->parent;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	switch (error) {
+	case 0:
+		/*
+		 * Resume request might have been queued in the meantime, in
+		 * which case the RPM_WAKE bit is also set in runtime_status.
+		 */
+		dev->power.runtime_status &= ~RPM_SUSPENDING;
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && parent) {
+		__pm_put_child(parent);
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		pm_runtime_notify_idle(parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	__pm_runtime_suspend(suspend_work_to_device(work), false);
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @msec: Time to wait before attempting to suspend the device, in milliseconds.
+ */
+void pm_request_suspend(struct device *dev, unsigned int msec)
+{
+	unsigned long flags;
+	unsigned long delay = msecs_to_jiffies(msec);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	queue_delayed_work(pm_wq, &dev->power.suspend_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * __pm_runtime_get - Increment the resume counter of given device.
+ * @dev: Device to handle.
+ */
+static void __pm_runtime_get(struct device *dev)
+{
+	dev->power.resume_count++;
+}
+
+/**
+ * __pm_runtime_put - Decrement the resume counter of given device.
+ * @dev: Device to handle.
+ */
+static void __pm_runtime_put(struct device *dev)
+{
+	if (dev->power.resume_count > 0)
+		dev->power.resume_count--;
+	else
+		dev_warn(dev, "Excessive %s!\n", __FUNCTION__);
+}
+
+/**
+ * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ * @get: If set, increment the device's resume counter.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int __pm_runtime_resume(struct device *dev, bool get, bool sync)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+	bool put_parent = false;
+	int error = -EINVAL;
+
+	might_sleep();
+
+ repeat:
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status == RPM_ACTIVE) {
+		error = 0;
+		goto out;
+	}
+
+	if (dev->power.runtime_status & RPM_IDLE) {
+		/* Only a suspend request is pending, cancel it and repeat. */
+		dev->power.suspend_aborted = true;
+		dev->power.runtime_status &= ~RPM_IDLE;
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		cancel_delayed_work_sync(&dev->power.suspend_work);
+
+		goto repeat;
+	} else if (sync && (dev->power.runtime_status & RPM_WAKE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		/*
+		 * Resume request is pending, but we're not running as a result
+		 * of it, so it has to run before we continue in case it's
+		 * going to increment the device's resume counter.
+		 */
+		flush_work(&dev->power.resume_work);
+
+		goto repeat;
+	} else if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && parent
+	    && parent->power.runtime_status != RPM_ACTIVE) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		/* The parent as to be resumed before we continue. */
+		error = pm_runtime_resume_get(parent);
+		if (error)
+			return error;
+
+		put_parent = true;
+		error = -EINVAL;
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent) {
+			if (put_parent)
+				__pm_runtime_put(parent);
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+			parent = NULL;
+		}
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		spin_lock_irqsave(&dev->power.lock, flags);
+
+		error = dev->power.runtime_error;
+		goto out;
+	}
+
+	if (dev->power.runtime_status == RPM_SUSPENDED && parent)
+		__pm_get_child(parent);
+
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent) {
+		if (put_parent)
+			__pm_runtime_put(parent);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+		parent = NULL;
+	}
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	dev->power.runtime_status = error ? RPM_ERROR : RPM_ACTIVE;
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+ out:
+	if (!error && get)
+		__pm_runtime_get(dev);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent) {
+		if (put_parent)
+			__pm_runtime_put(parent);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+	}
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run __pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * __pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	__pm_runtime_resume(resume_work_to_device(work), false, false);
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+void pm_request_resume(struct device *dev)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+
+ repeat:
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if ((dev->power.runtime_status & RPM_WAKE)
+	    || !(dev->power.runtime_status &
+			(RPM_SUSPENDING | RPM_SUSPENDED))) {
+		goto out;
+	} else if (parent && !(parent->power.runtime_status & RPM_WAKE)
+	    && (parent->power.runtime_status &
+			(RPM_IDLE | RPM_SUSPENDING | RPM_SUSPENDED))) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		pm_request_resume(parent);
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_SUSPENDED && parent)
+		__pm_get_child(parent);
+
+	/*
+	 * The device may be suspending at the moment or a suspend request may
+	 * be pending for it and we can't clear the RPM_IDLE or RPM_SUSPENDING
+	 * bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	queue_work(pm_wq, &dev->power.resume_work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_runtime_put - Decrement the resume counter of a device under 'power.lock'.
+ * @dev: Device to handle.
+ */
+void pm_runtime_put(struct device *dev)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	__pm_runtime_put(dev);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_put);
+
+/**
+ * pm_runtime_disable - Disable run-time suspend and resume of a device.
+ * @dev: Device to handle.
+ *
+ * Increase the resume counter of given device, so that it cannot be suspended
+ * at run time, and run pm_runtime_resume() for it to put it into the RPM_ACTIVE
+ * state, which also blocks run-time resume of it.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	__pm_runtime_get(dev);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	pm_runtime_resume(dev);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * __pm_runtime_clear_status - Change the run-time PM status of a device.
+ * @dev: Device to handle.
+ * @status: New value of the device's run-time PM status.
+ *
+ * Change the run-time PM status of the device to @status, which must be
+ * either RPM_ACTIVE or RPM_SUSPENDED, if its current value is equal to
+ * RPM_ERROR.
+ */
+void __pm_runtime_clear_status(struct device *dev, unsigned int status)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+
+	if (status & ~RPM_SUSPENDED)
+		return;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ERROR)
+		goto out;
+
+	dev->power.runtime_status = status;
+	if (parent && status == RPM_SUSPENDED)
+		__pm_put_child(parent);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_clear_status);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	struct device *parent = dev->parent;
+
+	spin_lock_init(&dev->power.lock);
+
+	dev->power.runtime_status = RPM_ACTIVE;
+	dev->power.resume_count = 1;
+	pm_suspend_ignore_children(dev, false);
+	dev->power.child_count = 0;
+	INIT_DELAYED_WORK(&dev->power.suspend_work, pm_runtime_suspend_work);
+	INIT_WORK(&dev->power.resume_work, pm_runtime_resume_work);
+
+	if (parent) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&parent->power.lock, flags);
+
+		__pm_get_child(parent);
+
+		spin_unlock_irqrestore(&parent->power.lock, flags);
+	}
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,115 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern int __pm_runtime_suspend(struct device *dev, bool sync);
+extern void pm_request_suspend(struct device *dev, unsigned int msec);
+extern int __pm_runtime_resume(struct device *dev, bool get, bool sync);
+extern void pm_request_resume(struct device *dev);
+extern void pm_runtime_put(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void __pm_runtime_clear_status(struct device *dev, unsigned int status);
+
+static inline struct device *suspend_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, suspend_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline struct device *resume_work_to_device(struct work_struct *work)
+{
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(work, struct dev_pm_info, resume_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline bool pm_children_suspended(struct device *dev)
+{
+	return dev->power.ignore_children || !dev->power.child_count;
+}
+
+static inline bool pm_suspend_possible(struct device *dev)
+{
+	return pm_children_suspended(dev) && !(dev->power.resume_count > 0
+		|| (dev->power.runtime_status & (RPM_WAKE | RPM_RESUMING)));
+}
+
+static inline void pm_suspend_ignore_children(struct device *dev, bool enable)
+{
+	dev->power.ignore_children = enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_suspend(struct device *dev, unsigned int msec) {}
+static inline int __pm_runtime_resume(struct device *dev, bool get, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_resume(struct device *dev) {}
+static inline void pm_runtime_put(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void __pm_runtime_clear_status(struct device *dev,
+					      unsigned int status) {}
+
+static inline bool pm_children_suspended(struct device *dev) { return false; }
+static inline bool pm_suspend_possible(struct device *dev) { return false; }
+static inline void pm_suspend_ignore_children(struct device *dev, bool en) {}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+static inline int pm_runtime_suspend(struct device *dev)
+{
+	return __pm_runtime_suspend(dev, true);
+}
+
+static inline int pm_runtime_resume(struct device *dev)
+{
+	return __pm_runtime_resume(dev, false, true);
+}
+
+static inline int pm_runtime_resume_get(struct device *dev)
+{
+	return __pm_runtime_resume(dev, true, true);
+}
+
+static inline void pm_runtime_clear_active(struct device *dev)
+{
+	__pm_runtime_clear_status(dev, RPM_ACTIVE);
+}
+
+static inline void pm_runtime_clear_suspended(struct device *dev)
+{
+	__pm_runtime_clear_status(dev, RPM_SUSPENDED);
+}
+
+static inline void pm_runtime_enable(struct device *dev)
+{
+	pm_runtime_put(dev);
+}
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,378 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+Support for run-time power management (run-time PM) of I/O devices is provided
+at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields of 'struct dev_pm_info', the helper functions
+using them and the run-time PM callbacks present in 'struct dev_pm_ops' are
+described below.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned int msec);
+* int pm_runtime_resume(struct device *dev);
+* int pm_runtime_resume_get(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_runtime_put(struct device *dev);
+
+* bool pm_suspend_possible(struct device *dev);
+
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+
+* void pm_suspend_ignore_children(struct device *dev, bool enable);
+
+* void pm_runtime_clear_active(struct device *dev) {}
+* void pm_runtime_clear_suspended(struct device *dev) {}
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+a device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+pm_runtime_resume_get(), pm_request_resume(), and pm_request_resume_get()
+use the 'power.runtime_status', 'power.resume_count', 'power.suspend_aborted',
+and 'power.child_count' fields of 'struct device' for mutual cooperation.  In
+what follows the 'power.runtime_status', 'power.resume_count', and
+'power.child_count' fields are referred to as the device's run-time PM status,
+the device's resume counter, and the counter of unsuspended children of the
+device, respectively.  They are set to RPM_ACTIVE, 1 and 0, respectively, by
+pm_runtime_init().
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called directly by a bus type or device driver.  There also is
+an asynchronous version of it, which is executed by the PM core to complete a
+request queued up by pm_request_suspend().  However, the only difference between
+them is the handling of situations in which a queued up suspend request has just
+been cancelled.  Apart from this, they work in the same way.
+
+  * If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the
+    device's run-time PM status field, 'power.runtime_status'), success is
+    returned.
+
+  * If the device is about to resume (i.e. at least one of the RPM_WAKE and
+    RPM_RESUMING bits are set in its run-time PM status field) or its resume
+    counter is greater than 0, or the function has been called via pm_wq as a
+    result of a cancelled suspend request (the 'power.suspend_aborted' field is
+    used to signal the termination of a suspend request), -EAGAIN is returned.
+
+  * If the device is suspending (i.e. its run-time PM status is RPM_SUSPENDING),
+    which means that another instance of pm_runtime_suspend() is running at the
+    same time for the same device, the function waits for the other instance to
+    complete and returns the error code (or success) returned by it.
+
+  * If the device has a pending suspend request (i.e. the RPM_IDLE bit is set in
+    its run-time PM status) and the function hasn't been called as a result of
+    that request, it cancels the request and restarts itself in case another
+    suspend is running in parallel with it.
+
+  * If the children of the device are not suspended and the
+    'power.ignore_children' flag is not set for it, the device's run-time PM
+    status is set to RPM_ACTIVE and -EAGAIN is returned.
+
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and its bus type's ->runtime_suspend() callback is executed.
+This callback is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+
+  * If it completes successfully, the RPM_SUSPENDING bit is cleared and the
+    RPM_SUSPENDED bit is set in the device's run-time PM status field.  Once
+    that has happened, the device is regarded by the PM core as suspended, but
+    it _need_ _not_ mean that the device has been put into a low power state.
+    What really occurs to the device at this point totally depends on its bus
+    type (it may depend on the device's driver if the bus type chooses to call
+    it).  Additionally, if the device bus type's ->runtime_suspend() callback
+    completes successfully and there's no resume request pending for the device
+    (i.e. the RPM_WAKE flag is not set in its run-time PM status field), and the
+    device has a parent, the parent's counter of unsuspended children (i.e. the
+    'power.child_count' field) is decremented.  Next, if it turns out to be
+    equal to zero (i.e. all children of the device's parent have been suspended)
+    or the parent has the 'power.ignore_children' flag set, the parent's bus
+    type's ->runtime_idle() callback is executed.
+
+  * If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+    set to RPM_ACTIVE.
+
+  * If another error code is returned, the device's run-time PM status is set to
+    RPM_ERROR, which makes the PM core refuse to carry out any run-time PM
+    operations for it until the status is cleared by its bus type or driver with
+    the help of pm_runtime_clear_active() or pm_runtime_clear_suspended().
+
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.  If the device's bus type
+doesn't implement ->runtime_suspend(), -EINVAL is returned and the device's
+run-time PM status is set to RPM_ERROR.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE
+(i.e. the device is not active from the PM core standpoint), the function
+returns immediately.  Otherwise, it changes the device's run-time PM status to
+RPM_IDLE and puts a request to suspend the device into pm_wq.  The 'msec'
+argument is used to specify the time to wait before the request will be
+completed, in milliseconds.  It is valid to call this function from interrupt
+context.
+
+pm_runtime_resume() and pm_runtime_resume_get() are used to carry out a
+run-time resume of a device that is suspended, suspending or has a suspend
+request pending.  They are called directly by a bus type or device driver.
+The difference between them is that pm_request_resume_get() increments the
+device's resume counter, which prevents the PM core from suspending the device
+or queuing up a suspend request for it until its resume counter is decreased
+down to 0 with the help of pm_runtime_put().  Apart from this, they work in the
+same way.  There also is an asynchronous version of pm_runtime_resume(), called
+by the PM core as a result of a resume request queued up by pm_request_resume(),
+which doesn't check if there's a concurrent pending resume request for the
+device.
+
+  * If the device is active (i.e. all of the bits in its run-time PM status are
+    unset), success is returned (pm_request_resume_get() increments the device's
+    resume counter in that case).
+
+  * If there's a suspend request pending for the device (i.e. the device's
+    run-time PM status is RPM_IDLE), it is cancelled, the
+    'power.suspend_aborted' flag is set for the device, the RPM_IDLE bit is
+    cleared in its run-time PM status field and the function restarts itself.
+
+  * If the device has a pending resume request (i.e. the RPM_WAKE bit is set in
+    its run-time PM status field), but the function hasn't been called as a
+    result of that request, the function waits for that request to complete
+    (in case it's going to increment the device's resume counter) and restarts
+    itself.
+
+  * If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+    run-time PM status field), the function waits for the suspend operation to
+    complete and restarts itself.
+
+  * If the device is suspended and doesn't have a pending resume request (i.e.
+    its run-time PM status is RPM_SUSPENDED), and it has a parent that is not
+    active (i.e. the parent's run-time PM status is not RPM_ACTIVE),
+    pm_runtime_resume_get() is called (recursively) for the parent.  If the
+    parent's resume is successful, the function notes that the parent's resume
+    counter will have to be decremented and restarts itself.  Otherwise, it
+    returns the error code returned by the instance of pm_runtime_resume_get()
+    handling the device's parent.
+
+  * If the device is resuming (i.e. the device's run-time PM status is
+    RPM_RESUMING), which means that another instance of pm_runtime_resume() or
+    pm_runtime_resume_get() is running at the same time for the same device, the
+    function waits for the other instance to complete and returns the result
+    returned by it (pm_runtime_resume_get() increments the device's resume
+    counter if success is returned).
+
+If none of the above happens, the function checks if the device's run-time PM
+status is RPM_SUSPENDED, which means that the device doesn't have a resume
+request pending, and if it has a parent.  If that is the case, the parent's
+counter of unsuspended children is increased.  Next, the device's run-time PM
+status is set to RPM_RESUMING and its bus type's ->runtime_resume() callback is
+executed.  This callback is entirely responsible for handling the device as
+appropriate (for example, it may choose to execute the device driver's
+->runtime_resume() callback or to carry out any other suitable action depending
+on the bus type).
+
+  * If it completes successfully, the device's run-time PM status is set to
+    RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+    device bus type's ->runtime_resume() callback, when it is about to return
+    success, _must_ _ensure_ that this really is the case (i.e. when it returns
+    success, the device _must_ be able to carry out I/O operations as needed).
+
+  * If an error code is returned, the device's run-time PM status is set to
+    RPM_ERROR, which makes the PM core refuse to carry out any run-time PM
+    operations for the device until the status is cleared by its bus type or
+    driver with the help of either pm_runtime_clear_active(), or
+    pm_runtime_clear_suspended().  Thus, it is strongly recommended that the
+    device bus type's ->runtime_resume() callback only return error codes in
+    fatal error conditions, when it is impossible to bring the device back to
+    the operational state by any available means.  Inability to wake up a
+    suspended device usually means a service loss and it may very well result in
+    a data loss to the user, so it _must_ be avoided if at all possible.
+
+Finally, pm_runtime_resume() and pm_runtime_resume_get() return the error code
+(or success) returned by the device bus type's ->runtime_resume() callback
+(pm_runtime_resume_get() increments the device's resume counter if success is
+returned).  If the device's bus type doesn't implement ->runtime_resume(),
+-EINVAL is returned and the device's run-time PM status is set to RPM_ERROR.
+
+pm_request_resume() is used to queue up a resume request for a device that is
+suspended, suspending or has a suspend request pending.
+
+  * If the device has a resume request pending (i.e. the RPM_WAKE bit is set in
+    its run-time PM status field) or the device is not suspended or suspending
+    (i.e. none of the RPM_SUSPENDED and RPM_SUSPENDING bits is set in its
+    run-time PM status field), the function returns.
+
+  * If the device has a parent and the parent is inactive (i.e. at least one of
+    the RPM_IDLE, RPM_SUSPENDING, and RPM_SUSPENDED bits is set in its run-time
+    PM status field), and the parent doesn't have a resume request pending
+    (i.e. the RPM_WAKE bit is not set in the parent's run-time PM status field),
+    a resume request is scheduled for the parent with the help of
+    pm_request_resume() (i.e. recursively) and the function is restarted.
+
+If none of the above happens, the function checks if the device's run-time PM
+status is RPM_SUSPENDED, which means that the device is not suspending at the
+moment, and if it has a parent.  If that is the case, the parent's counter of
+unsuspended children is increased.  Next, the RPM_WAKE bit is set in the
+device's run-time PM status field and the request to execute the asynchronous
+version of pm_runtime_resume() is put into pm_wq.  It is valid to call this
+function from interrupt context.
+
+Note that it is possible to have a resume request and a suspend request queued
+up at the same time.  In that case, if the suspend request is attempted to
+complete first, the asynchronous version of pm_runtime_suspend() run as a result
+of it will notice that the RPM_WAKE bit is set in the device's run-time PM
+status field and will return -EAGAIN as a result without doing anything else.
+Then, the subsequent resume carried out as a result of the queued up request
+will notice that the RPM_IDLE bit is set in the device's run-time PM status
+field, so it will try to cancel the suspend request and the run-time PM status
+of the device will be set to RPM_ACTIVE.  On the other hand, if the resume
+request is attempted to complete first, which is more likely, it will cancel the
+pending suspend request the run-time PM status of the device will be set to
+RPM_ACTIVE.
+
+pm_runtime_put() is used to decrease the device's resume counter by 1.  If the
+resume counter of the device is greater than 0, it causes the PM core to refuse
+to suspend the device or to queue up a suspend request for it.  In particular,
+it causes pm_runtime_suspend() to return -EAGAIN without doing anything else.
+This may be useful if the device is resumed for a specific task and it shouldn't
+be suspended until the task is complete, but there are many potential sources of
+suspend requests that could disturb it.  It is valid to call this function from
+interrupt context.
+
+pm_suspend_possible() is used to check if the device may be suspended at this
+particular moment.  It checks the device's run-time PM status, resume counter,
+and the counter of unsuspended children.  It returns 'false' if the device's
+counter of unsuspended children is greater than 0 or the device's resume counter
+is greater than 0, or at least one of the RPM_WAKE and RPM_RESUMING bits is set
+in its run-time PM status field.  It is valid to call this function from
+interrupt context.
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, all of the run-time PM core operations.  They do it by decreasing
+and increasing, respectively, the device's resume counter, but
+pm_runtime_disable() additionally calls pm_runtime_resume() for the device to
+make sure that the device will not be suspended while its run-time power
+management is disabled.  Therefore, if pm_runtime_disable() is called several
+times in a row for the same device, it has to be balanced by the appropriate
+number of pm_runtime_enable() calls so that the other run-time PM core functions
+can be used for that device.  The initial value of the device's resume counter,
+as set by pm_runtime_init(), is 1 (i.e. the device's run-time power management
+is initially disabled).
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time power management of devices temporarily during device probe
+and removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_suspend_ignore_children() is used to set or unset the
+'power.ignore_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 1, and if 'enable' is 'false', the field
+is set to 0.  The default value of 'power.ignore_children', as set by
+pm_runtime_init(), is 0.
+
+pm_runtime_clear_active() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_ACTIVE.  It is valid to call this function from
+interrupt context.
+
+pm_runtime_clear_suspended() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_SUSPENDED.  If the device has a parent, it the
+function additionally decrements the parent's counter of unsuspended children.
+It is valid to call this function from interrupt context.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+In particular, it is recommended that ->runtime_suspend() return -EBUSY or
+-EAGAIN if device_may_wakeup() returns 'false' for the device.  On the other
+hand, if device_may_wakeup() returns 'true' for the device and the device is put
+into a low power state during the execution of ->runtime_suspend(), it is
+expected that remote wake-up (i.e. hardware mechanism allowing the device to
+request a change of its power state, such as PCI PME) will be enabled for the
+device.  Generally, remote wake-up should be enabled whenever the device is put
+into a low power state at run time and is expected to receive input from the
+outside of the system.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-19  0:38                     ` Rafael J. Wysocki
@ 2009-06-19 16:25                         ` Alan Stern
  2009-06-19 16:25                         ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-19 16:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Fri, 19 Jun 2009, Rafael J. Wysocki wrote:

> > In fact, maybe it would be best if pm_request_resume always increments
> > depth (unless it fails for some other reason) and __pm_runtime_resume
> > increments depth whenever called synchronously.  And likewise for the
> > suspend paths.
> 
> And how exactly are we going to check if pm_request_resume() was successful?
> 
> We'd have to be able to do that in a code path different from the one that has
> called pm_request_resume().

No, no.  What I meant was: Increment depth if pm_request_resume calls 
queue_work.  If it exits early, don't increment depth.

But now I'm not sure this is the right thing to do.  It's not clear 
what the right model should be for asynchronous operation.

> > > > Instead of a costly device_for_each_child(), would it be better to
> > > > maintain a counter with the number of unsuspended children?
> > > 
> > > Hmm.  How exactly are we going to count them?  The only way I see at the moment
> > > would be to increase this number by one when running pm_runtime_init() for a
> > > new child.  Seems doable.
> > 
> > That's right.  You also have to decrement the number when an
> > unsuspended child device is removed, obviously.
> 
> I forgot about that, so it is not done in the patch below.
> 
> BTW, is it just me, or are we overcomplicating that thing beyond any
> reasonable limit?
> 
> I think I'll just do the device_for_each_child() for now, because IMO this
> optimization isn't just worth complications resulting from it, because,
> realistically, how many children is a parent going to have in a notmal system?

Okay for now.  For the future...  What I'm concerned about is this: If
a driver uses asynchronous operation, it might need to tell the core
each time an I/O operation finished.  Whatever this involves will
therefore be on a hot path, so it should minimize the amount of locking
and other activity -- but device_for_each_child takes a bunch of locks.


> > The one thing to watch out for is what happens if a device is removed while
> > its runtime_resume callback is running.  :-)
> 
> I don't think it's possible.

I guess this is more of a problem in the USB stack, because we use the
same field to keep track of whether a device is suspended and whether
it has been unplugged.  Okay, forget it.


> > I guess this a question of how you view things.  My view has been that
> > whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
> > the driver nor anyone else has any reason to keep the device at full
> > power.  By definition, since that's what depth is -- a count of the
> > reasons for not suspending.
> > 
> > There might be some obscure other reason, but in general depth going
> > to 0 means a delayed autosuspend request should be queued.
> 
> OK there, but pm_runtime_disable() is called by the core in some places where
> we'd rather not want the device to be suspended (like during a system-wide
> power transitions).

I'm not sure what you mean.  I was talking about pm_runtime_enable
(which decrements depth), not pm_runtime_disable (which increments it).  
When pm_runtime_enable finds that depth has gone to 0, it should queue
a delayed autosuspend request.


> > Which reminds me... Something to think about: In an async call to
> > __pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
> > then perhaps your code should automatically requeue a new delayed
> > autosuspend request.  Which implies, of course, that the autosuspend
> > delay has to be stored in the dev_pm_info structure.  This isn't a bad
> > thing, since exposing the value in sysfs gives userspace a consistent
> > way to set the delay.
> 
> I think that functionality can be added later.  Let's keep things as simple
> as possible initially, or we won't be able to make any progress.
> 
> Below is a new version of the patch.  Unfortunately, it is a major rework.
> In short, I tried to address some of your recent comments and my observations.
> It doesn't use depth any more, there's another counter (called resume_count)
> instead, also playing the role of the RPM_GRACE bit from the previous version.

Ah.  Okay.

> I've just finished it, so it may be still missing something apart from the
> updating child_count on removal of an unsuspended child.

I'll review it later.  For now, perhaps it would help to give some of
the considerations used by the USB stack.  For each device we store a
usage counter (equivalent to resume_count or depth), an autosuspend
delay value, and a time of last use.  Whenever a synchronous suspend or
any resume occurs, the time of last use is set to the current time.  
The same thing happens with delayed autosuspend requests that aren't 
requeues of an earlier request (see below).

Autosuspend is disallowed if:

	the driver doesn't support autosuspend;

	the usage counter is > 0;

	autosuspend has been disabled for this device;

	the driver requires remote wakeup during autosuspend
	but the user has disallowed wakeup.

If everything else is okay but not enough time has elapsed since the
device was last used, another delayed autosuspend request is queued and
the current one fails with -EAGAIN.

I believe the only circumstance where we would want to do an
autosuspend without decrementing the usage counter is if the usage
counter is already 0 (but the device hasn't been autosuspended yet) and
the autosuspend delay is changed.  For example, if the delay was
originally set to 30 seconds and the device has been idle for only 10
seconds, but then the user changes the delay to 5 seconds, we would do
an immediate autosuspend while leaving the counter at 0.

The model for asynchronous operation is that the usage counter remains
always at 0, and the driver updates the time-of-last-use field whenever
an I/O operation starts or completes.  The core keeps a delayed
autosuspend request queued; each time the request runs it checks
whether the device has been idle sufficiently long.  If not it
requeues itself; otherwise it carries out an autosuspend.

If an I/O operation takes too long (so that an autosuspend starts up in
the middle), the driver's suspend callback will return -EBUSY, thereby
causing another delayed autosuspend to be queued.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-19 16:25                         ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-19 16:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Fri, 19 Jun 2009, Rafael J. Wysocki wrote:

> > In fact, maybe it would be best if pm_request_resume always increments
> > depth (unless it fails for some other reason) and __pm_runtime_resume
> > increments depth whenever called synchronously.  And likewise for the
> > suspend paths.
> 
> And how exactly are we going to check if pm_request_resume() was successful?
> 
> We'd have to be able to do that in a code path different from the one that has
> called pm_request_resume().

No, no.  What I meant was: Increment depth if pm_request_resume calls 
queue_work.  If it exits early, don't increment depth.

But now I'm not sure this is the right thing to do.  It's not clear 
what the right model should be for asynchronous operation.

> > > > Instead of a costly device_for_each_child(), would it be better to
> > > > maintain a counter with the number of unsuspended children?
> > > 
> > > Hmm.  How exactly are we going to count them?  The only way I see at the moment
> > > would be to increase this number by one when running pm_runtime_init() for a
> > > new child.  Seems doable.
> > 
> > That's right.  You also have to decrement the number when an
> > unsuspended child device is removed, obviously.
> 
> I forgot about that, so it is not done in the patch below.
> 
> BTW, is it just me, or are we overcomplicating that thing beyond any
> reasonable limit?
> 
> I think I'll just do the device_for_each_child() for now, because IMO this
> optimization isn't just worth complications resulting from it, because,
> realistically, how many children is a parent going to have in a notmal system?

Okay for now.  For the future...  What I'm concerned about is this: If
a driver uses asynchronous operation, it might need to tell the core
each time an I/O operation finished.  Whatever this involves will
therefore be on a hot path, so it should minimize the amount of locking
and other activity -- but device_for_each_child takes a bunch of locks.


> > The one thing to watch out for is what happens if a device is removed while
> > its runtime_resume callback is running.  :-)
> 
> I don't think it's possible.

I guess this is more of a problem in the USB stack, because we use the
same field to keep track of whether a device is suspended and whether
it has been unplugged.  Okay, forget it.


> > I guess this a question of how you view things.  My view has been that
> > whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
> > the driver nor anyone else has any reason to keep the device at full
> > power.  By definition, since that's what depth is -- a count of the
> > reasons for not suspending.
> > 
> > There might be some obscure other reason, but in general depth going
> > to 0 means a delayed autosuspend request should be queued.
> 
> OK there, but pm_runtime_disable() is called by the core in some places where
> we'd rather not want the device to be suspended (like during a system-wide
> power transitions).

I'm not sure what you mean.  I was talking about pm_runtime_enable
(which decrements depth), not pm_runtime_disable (which increments it).  
When pm_runtime_enable finds that depth has gone to 0, it should queue
a delayed autosuspend request.


> > Which reminds me... Something to think about: In an async call to
> > __pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
> > then perhaps your code should automatically requeue a new delayed
> > autosuspend request.  Which implies, of course, that the autosuspend
> > delay has to be stored in the dev_pm_info structure.  This isn't a bad
> > thing, since exposing the value in sysfs gives userspace a consistent
> > way to set the delay.
> 
> I think that functionality can be added later.  Let's keep things as simple
> as possible initially, or we won't be able to make any progress.
> 
> Below is a new version of the patch.  Unfortunately, it is a major rework.
> In short, I tried to address some of your recent comments and my observations.
> It doesn't use depth any more, there's another counter (called resume_count)
> instead, also playing the role of the RPM_GRACE bit from the previous version.

Ah.  Okay.

> I've just finished it, so it may be still missing something apart from the
> updating child_count on removal of an unsuspended child.

I'll review it later.  For now, perhaps it would help to give some of
the considerations used by the USB stack.  For each device we store a
usage counter (equivalent to resume_count or depth), an autosuspend
delay value, and a time of last use.  Whenever a synchronous suspend or
any resume occurs, the time of last use is set to the current time.  
The same thing happens with delayed autosuspend requests that aren't 
requeues of an earlier request (see below).

Autosuspend is disallowed if:

	the driver doesn't support autosuspend;

	the usage counter is > 0;

	autosuspend has been disabled for this device;

	the driver requires remote wakeup during autosuspend
	but the user has disallowed wakeup.

If everything else is okay but not enough time has elapsed since the
device was last used, another delayed autosuspend request is queued and
the current one fails with -EAGAIN.

I believe the only circumstance where we would want to do an
autosuspend without decrementing the usage counter is if the usage
counter is already 0 (but the device hasn't been autosuspended yet) and
the autosuspend delay is changed.  For example, if the delay was
originally set to 30 seconds and the device has been idle for only 10
seconds, but then the user changes the delay to 5 seconds, we would do
an immediate autosuspend while leaving the counter at 0.

The model for asynchronous operation is that the usage counter remains
always at 0, and the driver updates the time-of-last-use field whenever
an I/O operation starts or completes.  The core keeps a delayed
autosuspend request queued; each time the request runs it checks
whether the device has been idle sufficiently long.  If not it
requeues itself; otherwise it carries out an autosuspend.

If an I/O operation takes too long (so that an autosuspend starts up in
the middle), the driver's suspend callback will return -EBUSY, thereby
causing another delayed autosuspend to be queued.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-19  0:38                     ` Rafael J. Wysocki
@ 2009-06-19 16:25                       ` Alan Stern
  2009-06-19 16:25                         ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-19 16:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Fri, 19 Jun 2009, Rafael J. Wysocki wrote:

> > In fact, maybe it would be best if pm_request_resume always increments
> > depth (unless it fails for some other reason) and __pm_runtime_resume
> > increments depth whenever called synchronously.  And likewise for the
> > suspend paths.
> 
> And how exactly are we going to check if pm_request_resume() was successful?
> 
> We'd have to be able to do that in a code path different from the one that has
> called pm_request_resume().

No, no.  What I meant was: Increment depth if pm_request_resume calls 
queue_work.  If it exits early, don't increment depth.

But now I'm not sure this is the right thing to do.  It's not clear 
what the right model should be for asynchronous operation.

> > > > Instead of a costly device_for_each_child(), would it be better to
> > > > maintain a counter with the number of unsuspended children?
> > > 
> > > Hmm.  How exactly are we going to count them?  The only way I see at the moment
> > > would be to increase this number by one when running pm_runtime_init() for a
> > > new child.  Seems doable.
> > 
> > That's right.  You also have to decrement the number when an
> > unsuspended child device is removed, obviously.
> 
> I forgot about that, so it is not done in the patch below.
> 
> BTW, is it just me, or are we overcomplicating that thing beyond any
> reasonable limit?
> 
> I think I'll just do the device_for_each_child() for now, because IMO this
> optimization isn't just worth complications resulting from it, because,
> realistically, how many children is a parent going to have in a notmal system?

Okay for now.  For the future...  What I'm concerned about is this: If
a driver uses asynchronous operation, it might need to tell the core
each time an I/O operation finished.  Whatever this involves will
therefore be on a hot path, so it should minimize the amount of locking
and other activity -- but device_for_each_child takes a bunch of locks.


> > The one thing to watch out for is what happens if a device is removed while
> > its runtime_resume callback is running.  :-)
> 
> I don't think it's possible.

I guess this is more of a problem in the USB stack, because we use the
same field to keep track of whether a device is suspended and whether
it has been unplugged.  Okay, forget it.


> > I guess this a question of how you view things.  My view has been that
> > whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
> > the driver nor anyone else has any reason to keep the device at full
> > power.  By definition, since that's what depth is -- a count of the
> > reasons for not suspending.
> > 
> > There might be some obscure other reason, but in general depth going
> > to 0 means a delayed autosuspend request should be queued.
> 
> OK there, but pm_runtime_disable() is called by the core in some places where
> we'd rather not want the device to be suspended (like during a system-wide
> power transitions).

I'm not sure what you mean.  I was talking about pm_runtime_enable
(which decrements depth), not pm_runtime_disable (which increments it).  
When pm_runtime_enable finds that depth has gone to 0, it should queue
a delayed autosuspend request.


> > Which reminds me... Something to think about: In an async call to
> > __pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
> > then perhaps your code should automatically requeue a new delayed
> > autosuspend request.  Which implies, of course, that the autosuspend
> > delay has to be stored in the dev_pm_info structure.  This isn't a bad
> > thing, since exposing the value in sysfs gives userspace a consistent
> > way to set the delay.
> 
> I think that functionality can be added later.  Let's keep things as simple
> as possible initially, or we won't be able to make any progress.
> 
> Below is a new version of the patch.  Unfortunately, it is a major rework.
> In short, I tried to address some of your recent comments and my observations.
> It doesn't use depth any more, there's another counter (called resume_count)
> instead, also playing the role of the RPM_GRACE bit from the previous version.

Ah.  Okay.

> I've just finished it, so it may be still missing something apart from the
> updating child_count on removal of an unsuspended child.

I'll review it later.  For now, perhaps it would help to give some of
the considerations used by the USB stack.  For each device we store a
usage counter (equivalent to resume_count or depth), an autosuspend
delay value, and a time of last use.  Whenever a synchronous suspend or
any resume occurs, the time of last use is set to the current time.  
The same thing happens with delayed autosuspend requests that aren't 
requeues of an earlier request (see below).

Autosuspend is disallowed if:

	the driver doesn't support autosuspend;

	the usage counter is > 0;

	autosuspend has been disabled for this device;

	the driver requires remote wakeup during autosuspend
	but the user has disallowed wakeup.

If everything else is okay but not enough time has elapsed since the
device was last used, another delayed autosuspend request is queued and
the current one fails with -EAGAIN.

I believe the only circumstance where we would want to do an
autosuspend without decrementing the usage counter is if the usage
counter is already 0 (but the device hasn't been autosuspended yet) and
the autosuspend delay is changed.  For example, if the delay was
originally set to 30 seconds and the device has been idle for only 10
seconds, but then the user changes the delay to 5 seconds, we would do
an immediate autosuspend while leaving the counter at 0.

The model for asynchronous operation is that the usage counter remains
always at 0, and the driver updates the time-of-last-use field whenever
an I/O operation starts or completes.  The core keeps a delayed
autosuspend request queued; each time the request runs it checks
whether the device has been idle sufficiently long.  If not it
requeues itself; otherwise it carries out an autosuspend.

If an I/O operation takes too long (so that an autosuspend starts up in
the middle), the driver's suspend callback will return -EBUSY, thereby
causing another delayed autosuspend to be queued.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-19 16:25                         ` Alan Stern
  (?)
@ 2009-06-19 22:42                         ` Rafael J. Wysocki
  2009-06-20  2:34                           ` Alan Stern
  2009-06-20  2:34                             ` Alan Stern
  -1 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-19 22:42 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Friday 19 June 2009, Alan Stern wrote:
> On Fri, 19 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > In fact, maybe it would be best if pm_request_resume always increments
> > > depth (unless it fails for some other reason) and __pm_runtime_resume
> > > increments depth whenever called synchronously.  And likewise for the
> > > suspend paths.
> > 
> > And how exactly are we going to check if pm_request_resume() was successful?
> > 
> > We'd have to be able to do that in a code path different from the one that has
> > called pm_request_resume().
> 
> No, no.  What I meant was: Increment depth if pm_request_resume calls 
> queue_work.  If it exits early, don't increment depth.
> 
> But now I'm not sure this is the right thing to do.  It's not clear 
> what the right model should be for asynchronous operation.

I think we can grab a reference when queuing up a resume request and drop
it on the completion of it.  This way, suspend will be locked while we're
waiting for the resume to run, which I think is what we want.

> > > > > Instead of a costly device_for_each_child(), would it be better to
> > > > > maintain a counter with the number of unsuspended children?
> > > > 
> > > > Hmm.  How exactly are we going to count them?  The only way I see at the moment
> > > > would be to increase this number by one when running pm_runtime_init() for a
> > > > new child.  Seems doable.
> > > 
> > > That's right.  You also have to decrement the number when an
> > > unsuspended child device is removed, obviously.
> > 
> > I forgot about that, so it is not done in the patch below.
> > 
> > BTW, is it just me, or are we overcomplicating that thing beyond any
> > reasonable limit?
> > 
> > I think I'll just do the device_for_each_child() for now, because IMO this
> > optimization isn't just worth complications resulting from it, because,
> > realistically, how many children is a parent going to have in a notmal system?
> 
> Okay for now.  For the future...  What I'm concerned about is this: If
> a driver uses asynchronous operation, it might need to tell the core
> each time an I/O operation finished.  Whatever this involves will
> therefore be on a hot path, so it should minimize the amount of locking
> and other activity -- but device_for_each_child takes a bunch of locks.

OK, I think I'll try to do the counting, although it may be difficult to handle
all of the corner cases.
 
> > > The one thing to watch out for is what happens if a device is removed while
> > > its runtime_resume callback is running.  :-)
> > 
> > I don't think it's possible.
> 
> I guess this is more of a problem in the USB stack, because we use the
> same field to keep track of whether a device is suspended and whether
> it has been unplugged.  Okay, forget it.

Well, it seems to be easy to handle: inrease resume_counter, wait for all
operations in progress to complete and change the status to 'active'
(that blocks resume).

> > > I guess this a question of how you view things.  My view has been that
> > > whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
> > > the driver nor anyone else has any reason to keep the device at full
> > > power.  By definition, since that's what depth is -- a count of the
> > > reasons for not suspending.
> > > 
> > > There might be some obscure other reason, but in general depth going
> > > to 0 means a delayed autosuspend request should be queued.
> > 
> > OK there, but pm_runtime_disable() is called by the core in some places where
> > we'd rather not want the device to be suspended (like during a system-wide
> > power transitions).
> 
> I'm not sure what you mean.  I was talking about pm_runtime_enable
> (which decrements depth), not pm_runtime_disable (which increments it).  
> When pm_runtime_enable finds that depth has gone to 0, it should queue
> a delayed autosuspend request.

OK, but I don't think that queuing a request without notifying the bus type is
the right thing to do.  IMO it's better to use ->runtime_idle() in that case
(in analogy with the situation in which the last child of a device has been
suspended).

> > > Which reminds me... Something to think about: In an async call to
> > > __pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
> > > then perhaps your code should automatically requeue a new delayed
> > > autosuspend request.  Which implies, of course, that the autosuspend
> > > delay has to be stored in the dev_pm_info structure.  This isn't a bad
> > > thing, since exposing the value in sysfs gives userspace a consistent
> > > way to set the delay.
> > 
> > I think that functionality can be added later.  Let's keep things as simple
> > as possible initially, or we won't be able to make any progress.
> > 
> > Below is a new version of the patch.  Unfortunately, it is a major rework.
> > In short, I tried to address some of your recent comments and my observations.
> > It doesn't use depth any more, there's another counter (called resume_count)
> > instead, also playing the role of the RPM_GRACE bit from the previous version.
> 
> Ah.  Okay.
> 
> > I've just finished it, so it may be still missing something apart from the
> > updating child_count on removal of an unsuspended child.
> 
> I'll review it later.

In fact I have updated it once again, so it's probably better to wait. :-)

> For now, perhaps it would help to give some of
> the considerations used by the USB stack.  For each device we store a
> usage counter (equivalent to resume_count or depth), an autosuspend
> delay value, and a time of last use.  Whenever a synchronous suspend or
> any resume occurs, the time of last use is set to the current time.  
> The same thing happens with delayed autosuspend requests that aren't 
> requeues of an earlier request (see below).
> 
> Autosuspend is disallowed if:
> 
> 	the driver doesn't support autosuspend;
> 
> 	the usage counter is > 0;
> 
> 	autosuspend has been disabled for this device;
> 
> 	the driver requires remote wakeup during autosuspend
> 	but the user has disallowed wakeup.

That's probably universal for all bus types and devices.

> If everything else is okay but not enough time has elapsed since the
> device was last used, another delayed autosuspend request is queued and
> the current one fails with -EAGAIN.

I wouldn't like to do the automatic queuing at the core level, simply because
the core may not have enough information to make a correct decision.

> I believe the only circumstance where we would want to do an
> autosuspend without decrementing the usage counter is if the usage
> counter is already 0 (but the device hasn't been autosuspended yet) and
> the autosuspend delay is changed.  For example, if the delay was
> originally set to 30 seconds and the device has been idle for only 10
> seconds, but then the user changes the delay to 5 seconds, we would do
> an immediate autosuspend while leaving the counter at 0.
> 
> The model for asynchronous operation is that the usage counter remains
> always at 0, and the driver updates the time-of-last-use field whenever
> an I/O operation starts or completes.  The core keeps a delayed
> autosuspend request queued; each time the request runs it checks
> whether the device has been idle sufficiently long.  If not it
> requeues itself; otherwise it carries out an autosuspend.

Again, I think it's a bus type's decision whether or not to use such a
"permanent" suspend request.

> If an I/O operation takes too long (so that an autosuspend starts up in
> the middle), the driver's suspend callback will return -EBUSY, thereby
> causing another delayed autosuspend to be queued.

OK, thanks for the description.

I think it probably is a good idea to store the time of last use in 'struct
device', so that bus types don't need to duplicate that field (all of them will
likely use it).  I'm not sure about the delay, though.  Well, I need some time
to think about it. :-)

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-19 16:25                         ` Alan Stern
  (?)
  (?)
@ 2009-06-19 22:42                         ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-19 22:42 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Friday 19 June 2009, Alan Stern wrote:
> On Fri, 19 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > In fact, maybe it would be best if pm_request_resume always increments
> > > depth (unless it fails for some other reason) and __pm_runtime_resume
> > > increments depth whenever called synchronously.  And likewise for the
> > > suspend paths.
> > 
> > And how exactly are we going to check if pm_request_resume() was successful?
> > 
> > We'd have to be able to do that in a code path different from the one that has
> > called pm_request_resume().
> 
> No, no.  What I meant was: Increment depth if pm_request_resume calls 
> queue_work.  If it exits early, don't increment depth.
> 
> But now I'm not sure this is the right thing to do.  It's not clear 
> what the right model should be for asynchronous operation.

I think we can grab a reference when queuing up a resume request and drop
it on the completion of it.  This way, suspend will be locked while we're
waiting for the resume to run, which I think is what we want.

> > > > > Instead of a costly device_for_each_child(), would it be better to
> > > > > maintain a counter with the number of unsuspended children?
> > > > 
> > > > Hmm.  How exactly are we going to count them?  The only way I see at the moment
> > > > would be to increase this number by one when running pm_runtime_init() for a
> > > > new child.  Seems doable.
> > > 
> > > That's right.  You also have to decrement the number when an
> > > unsuspended child device is removed, obviously.
> > 
> > I forgot about that, so it is not done in the patch below.
> > 
> > BTW, is it just me, or are we overcomplicating that thing beyond any
> > reasonable limit?
> > 
> > I think I'll just do the device_for_each_child() for now, because IMO this
> > optimization isn't just worth complications resulting from it, because,
> > realistically, how many children is a parent going to have in a notmal system?
> 
> Okay for now.  For the future...  What I'm concerned about is this: If
> a driver uses asynchronous operation, it might need to tell the core
> each time an I/O operation finished.  Whatever this involves will
> therefore be on a hot path, so it should minimize the amount of locking
> and other activity -- but device_for_each_child takes a bunch of locks.

OK, I think I'll try to do the counting, although it may be difficult to handle
all of the corner cases.
 
> > > The one thing to watch out for is what happens if a device is removed while
> > > its runtime_resume callback is running.  :-)
> > 
> > I don't think it's possible.
> 
> I guess this is more of a problem in the USB stack, because we use the
> same field to keep track of whether a device is suspended and whether
> it has been unplugged.  Okay, forget it.

Well, it seems to be easy to handle: inrease resume_counter, wait for all
operations in progress to complete and change the status to 'active'
(that blocks resume).

> > > I guess this a question of how you view things.  My view has been that
> > > whever depth (or pm_usage_cnt in the USB code) is 0, it means neither
> > > the driver nor anyone else has any reason to keep the device at full
> > > power.  By definition, since that's what depth is -- a count of the
> > > reasons for not suspending.
> > > 
> > > There might be some obscure other reason, but in general depth going
> > > to 0 means a delayed autosuspend request should be queued.
> > 
> > OK there, but pm_runtime_disable() is called by the core in some places where
> > we'd rather not want the device to be suspended (like during a system-wide
> > power transitions).
> 
> I'm not sure what you mean.  I was talking about pm_runtime_enable
> (which decrements depth), not pm_runtime_disable (which increments it).  
> When pm_runtime_enable finds that depth has gone to 0, it should queue
> a delayed autosuspend request.

OK, but I don't think that queuing a request without notifying the bus type is
the right thing to do.  IMO it's better to use ->runtime_idle() in that case
(in analogy with the situation in which the last child of a device has been
suspended).

> > > Which reminds me... Something to think about: In an async call to
> > > __pm_runtime_suspend, if the runtime_suspend callback returns -EBUSY
> > > then perhaps your code should automatically requeue a new delayed
> > > autosuspend request.  Which implies, of course, that the autosuspend
> > > delay has to be stored in the dev_pm_info structure.  This isn't a bad
> > > thing, since exposing the value in sysfs gives userspace a consistent
> > > way to set the delay.
> > 
> > I think that functionality can be added later.  Let's keep things as simple
> > as possible initially, or we won't be able to make any progress.
> > 
> > Below is a new version of the patch.  Unfortunately, it is a major rework.
> > In short, I tried to address some of your recent comments and my observations.
> > It doesn't use depth any more, there's another counter (called resume_count)
> > instead, also playing the role of the RPM_GRACE bit from the previous version.
> 
> Ah.  Okay.
> 
> > I've just finished it, so it may be still missing something apart from the
> > updating child_count on removal of an unsuspended child.
> 
> I'll review it later.

In fact I have updated it once again, so it's probably better to wait. :-)

> For now, perhaps it would help to give some of
> the considerations used by the USB stack.  For each device we store a
> usage counter (equivalent to resume_count or depth), an autosuspend
> delay value, and a time of last use.  Whenever a synchronous suspend or
> any resume occurs, the time of last use is set to the current time.  
> The same thing happens with delayed autosuspend requests that aren't 
> requeues of an earlier request (see below).
> 
> Autosuspend is disallowed if:
> 
> 	the driver doesn't support autosuspend;
> 
> 	the usage counter is > 0;
> 
> 	autosuspend has been disabled for this device;
> 
> 	the driver requires remote wakeup during autosuspend
> 	but the user has disallowed wakeup.

That's probably universal for all bus types and devices.

> If everything else is okay but not enough time has elapsed since the
> device was last used, another delayed autosuspend request is queued and
> the current one fails with -EAGAIN.

I wouldn't like to do the automatic queuing at the core level, simply because
the core may not have enough information to make a correct decision.

> I believe the only circumstance where we would want to do an
> autosuspend without decrementing the usage counter is if the usage
> counter is already 0 (but the device hasn't been autosuspended yet) and
> the autosuspend delay is changed.  For example, if the delay was
> originally set to 30 seconds and the device has been idle for only 10
> seconds, but then the user changes the delay to 5 seconds, we would do
> an immediate autosuspend while leaving the counter at 0.
> 
> The model for asynchronous operation is that the usage counter remains
> always at 0, and the driver updates the time-of-last-use field whenever
> an I/O operation starts or completes.  The core keeps a delayed
> autosuspend request queued; each time the request runs it checks
> whether the device has been idle sufficiently long.  If not it
> requeues itself; otherwise it carries out an autosuspend.

Again, I think it's a bus type's decision whether or not to use such a
"permanent" suspend request.

> If an I/O operation takes too long (so that an autosuspend starts up in
> the middle), the driver's suspend callback will return -EBUSY, thereby
> causing another delayed autosuspend to be queued.

OK, thanks for the description.

I think it probably is a good idea to store the time of last use in 'struct
device', so that bus types don't need to duplicate that field (all of them will
likely use it).  I'm not sure about the delay, though.  Well, I need some time
to think about it. :-)

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-19 22:42                         ` Rafael J. Wysocki
@ 2009-06-20  2:34                             ` Alan Stern
  2009-06-20  2:34                             ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-20  2:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Sat, 20 Jun 2009, Rafael J. Wysocki wrote:

> I think we can grab a reference when queuing up a resume request and drop
> it on the completion of it.  This way, suspend will be locked while we're
> waiting for the resume to run, which I think is what we want.

But suspend is already blocked from the time a resume request is queued 
until the resume completes, unless the suspend was underway when the 
request was made.  So that doesn't seem to make sense.

This really all depends on how drivers use async autoresume.  Here's 
one possible way they could be written:

irq_handler() {
	status = pm_request_resume();
	if (status indicates the device is currently resumed)
		handle_the_IO();
	else
		save_the_IO();
}

runtime_resume_method() {
	handle_saved_IO();
	pm_request_suspend();	/* Could call pm_notify_idle instead */
}

The implications of this design are:

	pm_request_resume should return one code if the status already
	is RPM_WAKE and a different code if the resume request had to
	be queued (or one was already queued).

	pm_request_suspend should run very quickly, since it will be
	called after every I/O operation.  Likewise, pm_request_resume
	should run very quickly if the status is RPM_ACTIVE or 
	RPM_IDLE.

	In order to prevent autosuspends from occurring while I/O is
	in progress, the pm_request_resume call should increment the
	usage counter (if it had to queue the request) and the 
	pm_request_suspend call should decrement it (maybe after
	waiting for the delay).


> OK, I think I'll try to do the counting, although it may be difficult to handle
> all of the corner cases.

No, I agree it's not worth worrying about for now.  It can always be 
added later.


> > > > There might be some obscure other reason, but in general depth going
> > > > to 0 means a delayed autosuspend request should be queued.
> > > 
> > > OK there, but pm_runtime_disable() is called by the core in some places where
> > > we'd rather not want the device to be suspended (like during a system-wide
> > > power transitions).
> > 
> > I'm not sure what you mean.  I was talking about pm_runtime_enable
> > (which decrements depth), not pm_runtime_disable (which increments it).  
> > When pm_runtime_enable finds that depth has gone to 0, it should queue
> > a delayed autosuspend request.
> 
> OK, but I don't think that queuing a request without notifying the bus type is
> the right thing to do.  IMO it's better to use ->runtime_idle() in that case
> (in analogy with the situation in which the last child of a device has been
> suspended).

Agreed.


> > Autosuspend is disallowed if:
> > 
> > 	the driver doesn't support autosuspend;
> > 
> > 	the usage counter is > 0;
> > 
> > 	autosuspend has been disabled for this device;
> > 
> > 	the driver requires remote wakeup during autosuspend
> > 	but the user has disallowed wakeup.
> 
> That's probably universal for all bus types and devices.

Probably.  But you haven't provided a way for the driver to indicate 
that it requires wakeup.  It's not a big deal, since the 
runtime_suspend method can do its own checking.

> > If everything else is okay but not enough time has elapsed since the
> > device was last used, another delayed autosuspend request is queued and
> > the current one fails with -EAGAIN.
> 
> I wouldn't like to do the automatic queuing at the core level, simply because
> the core may not have enough information to make a correct decision.

Calling the notify_idle method would be good enough.

> > The model for asynchronous operation is that the usage counter remains
> > always at 0, and the driver updates the time-of-last-use field whenever
> > an I/O operation starts or completes.  The core keeps a delayed
> > autosuspend request queued; each time the request runs it checks
> > whether the device has been idle sufficiently long.  If not it
> > requeues itself; otherwise it carries out an autosuspend.
> 
> Again, I think it's a bus type's decision whether or not to use such a
> "permanent" suspend request.

Ironically, this model is different from the one I outlined above.  
There's more than one way to do this, it's not clear which is best, and 
AFAIK none of them have been implemented in a real driver yet.

> I think it probably is a good idea to store the time of last use in 'struct
> device', so that bus types don't need to duplicate that field (all of them will
> likely use it).  I'm not sure about the delay, though.  Well, I need some time
> to think about it. :-)

All bus types will want to implement _some_ delay; it doesn't make
sense to power down a device immediately after every operation and then
power it back up for the next operation.

But the time scales of the delays may vary widely.  Some devices might 
be able to power up in a millisecond or less; others will require 
seconds.  The delays should be set accordingly.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-20  2:34                             ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-20  2:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Sat, 20 Jun 2009, Rafael J. Wysocki wrote:

> I think we can grab a reference when queuing up a resume request and drop
> it on the completion of it.  This way, suspend will be locked while we're
> waiting for the resume to run, which I think is what we want.

But suspend is already blocked from the time a resume request is queued 
until the resume completes, unless the suspend was underway when the 
request was made.  So that doesn't seem to make sense.

This really all depends on how drivers use async autoresume.  Here's 
one possible way they could be written:

irq_handler() {
	status = pm_request_resume();
	if (status indicates the device is currently resumed)
		handle_the_IO();
	else
		save_the_IO();
}

runtime_resume_method() {
	handle_saved_IO();
	pm_request_suspend();	/* Could call pm_notify_idle instead */
}

The implications of this design are:

	pm_request_resume should return one code if the status already
	is RPM_WAKE and a different code if the resume request had to
	be queued (or one was already queued).

	pm_request_suspend should run very quickly, since it will be
	called after every I/O operation.  Likewise, pm_request_resume
	should run very quickly if the status is RPM_ACTIVE or 
	RPM_IDLE.

	In order to prevent autosuspends from occurring while I/O is
	in progress, the pm_request_resume call should increment the
	usage counter (if it had to queue the request) and the 
	pm_request_suspend call should decrement it (maybe after
	waiting for the delay).


> OK, I think I'll try to do the counting, although it may be difficult to handle
> all of the corner cases.

No, I agree it's not worth worrying about for now.  It can always be 
added later.


> > > > There might be some obscure other reason, but in general depth going
> > > > to 0 means a delayed autosuspend request should be queued.
> > > 
> > > OK there, but pm_runtime_disable() is called by the core in some places where
> > > we'd rather not want the device to be suspended (like during a system-wide
> > > power transitions).
> > 
> > I'm not sure what you mean.  I was talking about pm_runtime_enable
> > (which decrements depth), not pm_runtime_disable (which increments it).  
> > When pm_runtime_enable finds that depth has gone to 0, it should queue
> > a delayed autosuspend request.
> 
> OK, but I don't think that queuing a request without notifying the bus type is
> the right thing to do.  IMO it's better to use ->runtime_idle() in that case
> (in analogy with the situation in which the last child of a device has been
> suspended).

Agreed.


> > Autosuspend is disallowed if:
> > 
> > 	the driver doesn't support autosuspend;
> > 
> > 	the usage counter is > 0;
> > 
> > 	autosuspend has been disabled for this device;
> > 
> > 	the driver requires remote wakeup during autosuspend
> > 	but the user has disallowed wakeup.
> 
> That's probably universal for all bus types and devices.

Probably.  But you haven't provided a way for the driver to indicate 
that it requires wakeup.  It's not a big deal, since the 
runtime_suspend method can do its own checking.

> > If everything else is okay but not enough time has elapsed since the
> > device was last used, another delayed autosuspend request is queued and
> > the current one fails with -EAGAIN.
> 
> I wouldn't like to do the automatic queuing at the core level, simply because
> the core may not have enough information to make a correct decision.

Calling the notify_idle method would be good enough.

> > The model for asynchronous operation is that the usage counter remains
> > always at 0, and the driver updates the time-of-last-use field whenever
> > an I/O operation starts or completes.  The core keeps a delayed
> > autosuspend request queued; each time the request runs it checks
> > whether the device has been idle sufficiently long.  If not it
> > requeues itself; otherwise it carries out an autosuspend.
> 
> Again, I think it's a bus type's decision whether or not to use such a
> "permanent" suspend request.

Ironically, this model is different from the one I outlined above.  
There's more than one way to do this, it's not clear which is best, and 
AFAIK none of them have been implemented in a real driver yet.

> I think it probably is a good idea to store the time of last use in 'struct
> device', so that bus types don't need to duplicate that field (all of them will
> likely use it).  I'm not sure about the delay, though.  Well, I need some time
> to think about it. :-)

All bus types will want to implement _some_ delay; it doesn't make
sense to power down a device immediately after every operation and then
power it back up for the next operation.

But the time scales of the delays may vary widely.  Some devices might 
be able to power up in a millisecond or less; others will require 
seconds.  The delays should be set accordingly.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-19 22:42                         ` Rafael J. Wysocki
@ 2009-06-20  2:34                           ` Alan Stern
  2009-06-20  2:34                             ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-20  2:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Sat, 20 Jun 2009, Rafael J. Wysocki wrote:

> I think we can grab a reference when queuing up a resume request and drop
> it on the completion of it.  This way, suspend will be locked while we're
> waiting for the resume to run, which I think is what we want.

But suspend is already blocked from the time a resume request is queued 
until the resume completes, unless the suspend was underway when the 
request was made.  So that doesn't seem to make sense.

This really all depends on how drivers use async autoresume.  Here's 
one possible way they could be written:

irq_handler() {
	status = pm_request_resume();
	if (status indicates the device is currently resumed)
		handle_the_IO();
	else
		save_the_IO();
}

runtime_resume_method() {
	handle_saved_IO();
	pm_request_suspend();	/* Could call pm_notify_idle instead */
}

The implications of this design are:

	pm_request_resume should return one code if the status already
	is RPM_WAKE and a different code if the resume request had to
	be queued (or one was already queued).

	pm_request_suspend should run very quickly, since it will be
	called after every I/O operation.  Likewise, pm_request_resume
	should run very quickly if the status is RPM_ACTIVE or 
	RPM_IDLE.

	In order to prevent autosuspends from occurring while I/O is
	in progress, the pm_request_resume call should increment the
	usage counter (if it had to queue the request) and the 
	pm_request_suspend call should decrement it (maybe after
	waiting for the delay).


> OK, I think I'll try to do the counting, although it may be difficult to handle
> all of the corner cases.

No, I agree it's not worth worrying about for now.  It can always be 
added later.


> > > > There might be some obscure other reason, but in general depth going
> > > > to 0 means a delayed autosuspend request should be queued.
> > > 
> > > OK there, but pm_runtime_disable() is called by the core in some places where
> > > we'd rather not want the device to be suspended (like during a system-wide
> > > power transitions).
> > 
> > I'm not sure what you mean.  I was talking about pm_runtime_enable
> > (which decrements depth), not pm_runtime_disable (which increments it).  
> > When pm_runtime_enable finds that depth has gone to 0, it should queue
> > a delayed autosuspend request.
> 
> OK, but I don't think that queuing a request without notifying the bus type is
> the right thing to do.  IMO it's better to use ->runtime_idle() in that case
> (in analogy with the situation in which the last child of a device has been
> suspended).

Agreed.


> > Autosuspend is disallowed if:
> > 
> > 	the driver doesn't support autosuspend;
> > 
> > 	the usage counter is > 0;
> > 
> > 	autosuspend has been disabled for this device;
> > 
> > 	the driver requires remote wakeup during autosuspend
> > 	but the user has disallowed wakeup.
> 
> That's probably universal for all bus types and devices.

Probably.  But you haven't provided a way for the driver to indicate 
that it requires wakeup.  It's not a big deal, since the 
runtime_suspend method can do its own checking.

> > If everything else is okay but not enough time has elapsed since the
> > device was last used, another delayed autosuspend request is queued and
> > the current one fails with -EAGAIN.
> 
> I wouldn't like to do the automatic queuing at the core level, simply because
> the core may not have enough information to make a correct decision.

Calling the notify_idle method would be good enough.

> > The model for asynchronous operation is that the usage counter remains
> > always at 0, and the driver updates the time-of-last-use field whenever
> > an I/O operation starts or completes.  The core keeps a delayed
> > autosuspend request queued; each time the request runs it checks
> > whether the device has been idle sufficiently long.  If not it
> > requeues itself; otherwise it carries out an autosuspend.
> 
> Again, I think it's a bus type's decision whether or not to use such a
> "permanent" suspend request.

Ironically, this model is different from the one I outlined above.  
There's more than one way to do this, it's not clear which is best, and 
AFAIK none of them have been implemented in a real driver yet.

> I think it probably is a good idea to store the time of last use in 'struct
> device', so that bus types don't need to duplicate that field (all of them will
> likely use it).  I'm not sure about the delay, though.  Well, I need some time
> to think about it. :-)

All bus types will want to implement _some_ delay; it doesn't make
sense to power down a device immediately after every operation and then
power it back up for the next operation.

But the time scales of the delays may vary widely.  Some devices might 
be able to power up in a millisecond or less; others will require 
seconds.  The delays should be set accordingly.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20  2:34                             ` Alan Stern
  (?)
  (?)
@ 2009-06-20 14:30                             ` Alan Stern
  2009-06-20 23:48                               ` Rafael J. Wysocki
                                                 ` (3 more replies)
  -1 siblings, 4 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-20 14:30 UTC (permalink / raw)
  To: Rafael J. Wysocki, Magnus Damm
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

Some more thoughts...

Magnus, you might have some insights here.  It occurred to me that some 
devices can switch power levels very quickly, and the drivers might 
therefore want the runtime suspend and resume methods to be called as 
soon as possible, even in interrupt context.

In terms of the current framework, this probably means holding the
runtime PM lock (i.e., not releasing it) across the calls to
->runtime_suspend and ->runtime_resume.  It also means that
pm_request_suspend and pm_request_resume should carry out their jobs
immediately instead of queuing a work item.  (Unless the current status 
is RPM_SUSPENDING or RPM_RESUMING, which should never happen.)

Should there be a flag in dev_pm_info to select this behavior?


When a device structure is unregistered and deallocated, we have to
insure that there aren't any pending runtime PM workqueue items.  
Hence device_del should call a routine that changes the status to an
exceptional state (not RPM_ERROR but something else) to prevent new
requests from being queued, and then calls cancel_work_sync or
cancel_delayed_work_sync as required.

Similarly, we should insure that runtime PM calls made before the
device is registered don't do anything.  So when the device structure
is first created and the contents are all 0, this should also be
interpreted as an exceptional state.  We could call it RPM_UNREGISTERED
and use it for both purposes.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20  2:34                             ` Alan Stern
  (?)
@ 2009-06-20 14:30                             ` Alan Stern
  -1 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-20 14:30 UTC (permalink / raw)
  To: Rafael J. Wysocki, Magnus Damm
  Cc: ACPI Devel Maling List, Linux-pm mailing list, Greg KH, LKML,
	Ingo Molnar

Some more thoughts...

Magnus, you might have some insights here.  It occurred to me that some 
devices can switch power levels very quickly, and the drivers might 
therefore want the runtime suspend and resume methods to be called as 
soon as possible, even in interrupt context.

In terms of the current framework, this probably means holding the
runtime PM lock (i.e., not releasing it) across the calls to
->runtime_suspend and ->runtime_resume.  It also means that
pm_request_suspend and pm_request_resume should carry out their jobs
immediately instead of queuing a work item.  (Unless the current status 
is RPM_SUSPENDING or RPM_RESUMING, which should never happen.)

Should there be a flag in dev_pm_info to select this behavior?


When a device structure is unregistered and deallocated, we have to
insure that there aren't any pending runtime PM workqueue items.  
Hence device_del should call a routine that changes the status to an
exceptional state (not RPM_ERROR but something else) to prevent new
requests from being queued, and then calls cancel_work_sync or
cancel_delayed_work_sync as required.

Similarly, we should insure that runtime PM calls made before the
device is registered don't do anything.  So when the device structure
is first created and the contents are all 0, this should also be
interpreted as an exceptional state.  We could call it RPM_UNREGISTERED
and use it for both purposes.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20  2:34                             ` Alan Stern
                                               ` (2 preceding siblings ...)
  (?)
@ 2009-06-20 23:38                             ` Rafael J. Wysocki
  2009-06-21  2:23                               ` Alan Stern
  2009-06-21  2:23                                 ` Alan Stern
  -1 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-20 23:38 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Saturday 20 June 2009, Alan Stern wrote:
> On Sat, 20 Jun 2009, Rafael J. Wysocki wrote:
> 
> > I think we can grab a reference when queuing up a resume request and drop
> > it on the completion of it.  This way, suspend will be locked while we're
> > waiting for the resume to run, which I think is what we want.
> 
> But suspend is already blocked from the time a resume request is queued 
> until the resume completes, unless the suspend was underway when the 
> request was made.  So that doesn't seem to make sense.
> 
> This really all depends on how drivers use async autoresume.  Here's 
> one possible way they could be written:
> 
> irq_handler() {
> 	status = pm_request_resume();
> 	if (status indicates the device is currently resumed)
> 		handle_the_IO();
> 	else
> 		save_the_IO();
> }
> 
> runtime_resume_method() {
> 	handle_saved_IO();
> 	pm_request_suspend();	/* Could call pm_notify_idle instead */
> }
> 
> The implications of this design are:
> 
> 	pm_request_resume should return one code if the status already
> 	is RPM_WAKE and a different code if the resume request had to
> 	be queued (or one was already queued).

I did something like this in the patch below.

> 	pm_request_suspend should run very quickly, since it will be
> 	called after every I/O operation.  Likewise, pm_request_resume
> 	should run very quickly if the status is RPM_ACTIVE or 
> 	RPM_IDLE.

Hmm.  pm_request_suspend() is really short, so it should be fast.
pm_request_resume() is a bit more complicated, though (it takes two spinlocks,
increases an atomic counter, possibly twice, and queues up a work item, also
in the RPM_IDLE case).

> 	In order to prevent autosuspends from occurring while I/O is
> 	in progress, the pm_request_resume call should increment the
> 	usage counter (if it had to queue the request) and the 
> 	pm_request_suspend call should decrement it (maybe after
> 	waiting for the delay).

I don't want like pm_request_suspend() to do that, because it's valid to
call it many times in a row. (only the first request will be queued in such a
case).

I'd prefer the caller to do pm_request_resume_get() (please see the patch
below) to put a resume request into the queue and then pm_runtime_put_notify()
when it's done with the I/O.  That will result in ->runtime_idle() being called
automatically if the device may be suspended.

> > OK, I think I'll try to do the counting, although it may be difficult to handle
> > all of the corner cases.
> 
> No, I agree it's not worth worrying about for now.  It can always be 
> added later.

Well, I've done it already, so I'd prefer to keep it, unless it's broken. ;-)

> > > > > There might be some obscure other reason, but in general depth going
> > > > > to 0 means a delayed autosuspend request should be queued.
> > > > 
> > > > OK there, but pm_runtime_disable() is called by the core in some places where
> > > > we'd rather not want the device to be suspended (like during a system-wide
> > > > power transitions).
> > > 
> > > I'm not sure what you mean.  I was talking about pm_runtime_enable
> > > (which decrements depth), not pm_runtime_disable (which increments it).  
> > > When pm_runtime_enable finds that depth has gone to 0, it should queue
> > > a delayed autosuspend request.
> > 
> > OK, but I don't think that queuing a request without notifying the bus type is
> > the right thing to do.  IMO it's better to use ->runtime_idle() in that case
> > (in analogy with the situation in which the last child of a device has been
> > suspended).
> 
> Agreed.
> 
> 
> > > Autosuspend is disallowed if:
> > > 
> > > 	the driver doesn't support autosuspend;
> > > 
> > > 	the usage counter is > 0;
> > > 
> > > 	autosuspend has been disabled for this device;
> > > 
> > > 	the driver requires remote wakeup during autosuspend
> > > 	but the user has disallowed wakeup.
> > 
> > That's probably universal for all bus types and devices.
> 
> Probably.  But you haven't provided a way for the driver to indicate 
> that it requires wakeup.  It's not a big deal, since the 
> runtime_suspend method can do its own checking.
> 
> > > If everything else is okay but not enough time has elapsed since the
> > > device was last used, another delayed autosuspend request is queued and
> > > the current one fails with -EAGAIN.
> > 
> > I wouldn't like to do the automatic queuing at the core level, simply because
> > the core may not have enough information to make a correct decision.
> 
> Calling the notify_idle method would be good enough.
> 
> > > The model for asynchronous operation is that the usage counter remains
> > > always at 0, and the driver updates the time-of-last-use field whenever
> > > an I/O operation starts or completes.  The core keeps a delayed
> > > autosuspend request queued; each time the request runs it checks
> > > whether the device has been idle sufficiently long.  If not it
> > > requeues itself; otherwise it carries out an autosuspend.
> > 
> > Again, I think it's a bus type's decision whether or not to use such a
> > "permanent" suspend request.
> 
> Ironically, this model is different from the one I outlined above.  
> There's more than one way to do this, it's not clear which is best, and 
> AFAIK none of them have been implemented in a real driver yet.
> 
> > I think it probably is a good idea to store the time of last use in 'struct
> > device', so that bus types don't need to duplicate that field (all of them will
> > likely use it).  I'm not sure about the delay, though.  Well, I need some time
> > to think about it. :-)
> 
> All bus types will want to implement _some_ delay; it doesn't make
> sense to power down a device immediately after every operation and then
> power it back up for the next operation.

Sure.  But you can use the pm_request_resume()'s delay to achieve that
without storing the delay in 'struct device'.  It seems.

> But the time scales of the delays may vary widely.  Some devices might 
> be able to power up in a millisecond or less; others will require 
> seconds.  The delays should be set accordingly.

Agreed.

OK

Below is a new patch.  It's been reworked quite a bit since the previous
version I sent and I don't think there's anything I'd like to add to it at this
point, unless something is evidently wrong.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM: Introduce core framework for run-time PM of I/O devices (rev. 2)

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  416 ++++++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    6 
 drivers/base/power/runtime.c       |  617 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   95 +++++
 include/linux/pm_runtime.h         |  148 ++++++++
 kernel/power/Kconfig               |   14 
 kernel/power/main.c                |   17 +
 9 files changed, 1320 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,28 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state.
+ *	For example, if the device is behind a link which is about to be turned
+ *	off, the device may remain at full power.  Still, if the device does go
+ *	to low power and if device_may_wakeup(dev) is true, remote wake-up
+ *	(i.e. hardware mechanism allowing the device to request a change of its
+ *	power state, such as PCI PME) should be enabled for it.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +207,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +343,75 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	0x1F
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	suspend_work;
+	struct work_struct	resume_work;
+	struct completion	work_done;
+	unsigned int		ignore_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	atomic_t		resume_count;
+	int			child_count;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,617 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+#include <linux/jiffies.h>
+
+/**
+ * __pm_get_child - Increment the counter of unsuspended children of a device.
+ * @dev: Device to handle;
+ */
+static void __pm_get_child(struct device *dev)
+{
+	dev->power.child_count++;
+}
+
+/**
+ * __pm_put_child - Decrement the counter of unsuspended children of a device.
+ * @dev: Device to handle;
+ */
+static void __pm_put_child(struct device *dev)
+{
+	if (dev->power.child_count > 0)
+		dev->power.child_count--;
+	else
+		dev_warn(dev, "Excessive %s!\n", __FUNCTION__);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.resume_count) > 0)
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * pm_runtime_put - Decrement the resume counter of a device.
+ * @dev: Device to handle.
+ *
+ * Decrement the resume counter of a device, check if it went down to zero and
+ * notify the device's bus type in that case.
+ */
+void pm_runtime_put_notify(struct device *dev)
+{
+	pm_runtime_put(dev);
+
+	if (pm_children_suspended(dev))
+		pm_runtime_notify_idle(dev);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_put_notify);
+
+/**
+ * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	struct device *parent = NULL;
+	unsigned long parflags = 0, flags;
+	int error = -EINVAL;
+
+	might_sleep();
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+ repeat:
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if (atomic_read(&dev->power.resume_count) > 0
+	    || (!sync && dev->power.runtime_status == RPM_IDLE
+	    && dev->power.suspend_aborted)) {
+		/*
+		 * We're forbidden to suspend the device (eg. it may be
+		 * resuming) or a pending suspend request has just been
+		 * cancelled (by a concurrent suspend) and we're running as a
+		 * result of that request.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDING) {
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	} else if (sync && dev->power.runtime_status == RPM_IDLE
+	    && !dev->power.suspend_aborted) {
+		/*
+		 * Suspend request is pending, but we're not running as a result
+		 * of that request, so cancel it.  Since we're not clearing the
+		 * RPM_IDLE bit now, no new suspend requests will be queued up
+		 * while the pending one is waited for to finish.
+		 */
+		dev->power.suspend_aborted = true;
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+
+		cancel_delayed_work_sync(&dev->power.suspend_work);
+
+		spin_lock_irqsave(&dev->power.lock, flags);
+
+		/* Repeat if anything has changed. */
+		if (dev->power.runtime_status != RPM_IDLE
+		    || !dev->power.suspend_aborted)
+			goto repeat;
+	}
+
+	if (!pm_children_suspended(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = -EBUSY;
+		goto out;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+	parent = dev->parent;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	switch (error) {
+	case 0:
+		/*
+		 * Resume request might have been queued up in the meantime, in
+		 * which case the RPM_WAKE bit is also set in runtime_status.
+		 */
+		dev->power.runtime_status &= ~RPM_SUSPENDING;
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && parent) {
+		__pm_put_child(parent);
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		if (!parent->power.child_count
+		    && !parent->power.ignore_children)
+			pm_runtime_notify_idle(parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	__pm_runtime_suspend(suspend_work_to_device(work), false);
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @msec: Time to wait before attempting to suspend the device, in milliseconds.
+ */
+void pm_request_suspend(struct device *dev, unsigned int msec)
+{
+	unsigned long flags;
+	unsigned long delay = msecs_to_jiffies(msec);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE
+	    || atomic_read(&dev->power.resume_count) > 0)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	queue_delayed_work(pm_wq, &dev->power.suspend_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ * @get: If set, increment the device's resume counter.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int __pm_runtime_resume(struct device *dev, bool get, bool sync)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+	bool put_parent = false;
+	unsigned int status;
+	int error = -EINVAL;
+
+	might_sleep();
+
+	/*
+	 * This makes concurrent __pm_runtime_suspend() and pm_request_suspend()
+	 * started after us, or restarted, return immediately, so only the ones
+	 * started before us can execute ->runtime_suspend().
+	 */
+	pm_runtime_get(dev);
+
+ repeat:
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+ repeat_locked:
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status == RPM_ACTIVE) {
+		error = 0;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_IDLE
+	    && !dev->power.suspend_aborted) {
+		/* Suspend request is pending, so cancel it. */
+		dev->power.suspend_aborted = true;
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		cancel_delayed_work_sync(&dev->power.suspend_work);
+
+		if (parent)
+			spin_lock_irqsave(&parent->power.lock, parflags);
+		spin_lock_irqsave(&dev->power.lock, flags);
+
+		/* Repeat if anything has changed. */
+		if (dev->power.runtime_status != RPM_IDLE
+		    || !dev->power.suspend_aborted)
+			goto repeat_locked;
+
+		/*
+		 * Suspend request has been cancelled and there's nothing more
+		 * to do.  Clear the RPM_IDLE bit and return.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = 0;
+		goto out;
+	}
+
+	if (sync && (dev->power.runtime_status & RPM_WAKE)) {
+		/*
+		 * Resume request is pending, so let it run, because it has to
+		 * decrement the resume counter of the device.
+		 */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		flush_work(&dev->power.resume_work);
+
+		goto repeat;
+	} else if (dev->power.runtime_status & RPM_SUSPENDING) {
+		/*
+		 * Suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (!put_parent && dev->power.runtime_status == RPM_SUSPENDED
+	    && parent && parent->power.runtime_status != RPM_ACTIVE) {
+		/* The parent has to be resumed before we can continue. */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		error = pm_runtime_resume_get(parent);
+		if (error)
+			return error;
+
+		put_parent = true;
+		error = -EINVAL;
+		goto repeat;
+	}
+
+	status = dev->power.runtime_status;
+	if (status == RPM_RESUMING)
+		goto unlock;
+
+	if (dev->power.runtime_status == RPM_SUSPENDED && parent)
+		__pm_get_child(parent);
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+ unlock:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent) {
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+		/*
+		 * We can decrement the parent's resume counter right now,
+		 * because it can't be suspended anyway after the
+		 * __pm_get_child() above.
+		 */
+		if (put_parent)
+			pm_runtime_put(parent);
+		parent = NULL;
+	}
+
+	if (status == RPM_RESUMING) {
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		error = dev->power.runtime_error;
+		goto out_put;
+	}
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	dev->power.runtime_status = error ? RPM_ERROR : RPM_ACTIVE;
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent) {
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+		if (put_parent)
+			pm_runtime_put(parent);
+	}
+
+ out_put:
+	/* Allow suspends to run if we are supposed to. */
+	if (!get || error)
+		pm_runtime_put_notify(dev);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run __pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * __pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	struct device *dev = resume_work_to_device(work);
+
+	__pm_runtime_resume(dev, false, false);
+	pm_runtime_put_notify(dev);
+}
+
+/**
+ * pm_cancel_suspend_work - Cancel a pending suspend request.
+ *
+ * Use @work to get the device object the work item has been scheduled for and
+ * cancel a pending suspend request for it.
+ */
+static void pm_cancel_suspend_work(struct work_struct *work)
+{
+	struct device *dev = resume_work_to_device(work);
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_IDLE
+	    || !dev->power.suspend_aborted)
+		goto out;
+	/*
+	 * Suspend request is pending, so cancel it.  __pm_runtime_resume() and
+	 * __pm_request_resume() will notice that suspend_aborted is true, so
+	 * they will return immediately.  Suspend requests and direct attempts
+	 * to suspend are blocked by the increased resume counter.
+	 */
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	cancel_delayed_work_sync(&dev->power.suspend_work);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	/* Clear the status if someone else hasn't done it already. */
+	if (dev->power.runtime_status == RPM_IDLE && dev->power.suspend_aborted)
+		dev->power.runtime_status = RPM_ACTIVE;
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	pm_runtime_put_notify(dev);
+}
+
+/**
+ * __pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+int __pm_request_resume(struct device *dev, bool get)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+	int error = 0;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		error = -EINVAL;
+		goto out;
+	}
+
+	if (get)
+		pm_runtime_get(dev);
+
+	if (dev->power.runtime_status == RPM_ACTIVE) {
+		error = -EBUSY;
+		goto out;
+	} else if (dev->power.runtime_status & (RPM_WAKE | RPM_RESUMING)) {
+		error = -EINPROGRESS;
+		goto out;
+	}
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		error = -EBUSY;
+
+		if (dev->power.suspend_aborted)
+			goto out;
+
+		/* Suspend request is pending.  Queue a request to cancel it. */
+		dev->power.suspend_aborted = true;
+		INIT_WORK(&dev->power.resume_work, pm_cancel_suspend_work);
+		goto queue;
+	}
+
+	if (dev->power.runtime_status == RPM_SUSPENDED && parent)
+		__pm_get_child(parent);
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	INIT_WORK(&dev->power.resume_work, pm_runtime_resume_work);
+
+ queue:
+	pm_runtime_get(dev);
+	queue_work(pm_wq, &dev->power.resume_work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_request_resume);
+
+/**
+ * __pm_runtime_clear_status - Change the run-time PM status of a device.
+ * @dev: Device to handle.
+ * @status: New value of the device's run-time PM status.
+ *
+ * Change the run-time PM status of the device to @status, which must be
+ * either RPM_ACTIVE or RPM_SUSPENDED, if its current value is equal to
+ * RPM_ERROR.
+ */
+void __pm_runtime_clear_status(struct device *dev, unsigned int status)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+
+	if (status & ~RPM_SUSPENDED)
+		return;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ERROR)
+		goto out;
+
+	dev->power.runtime_status = status;
+	if (parent && status == RPM_SUSPENDED)
+		__pm_put_child(parent);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_clear_status);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to initialize.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	struct device *parent = dev->parent;
+
+	spin_lock_init(&dev->power.lock);
+
+	dev->power.runtime_status = RPM_ACTIVE;
+	atomic_set(&dev->power.resume_count, 1);
+	pm_suspend_ignore_children(dev, false);
+	dev->power.child_count = 0;
+	INIT_DELAYED_WORK(&dev->power.suspend_work, pm_runtime_suspend_work);
+
+	if (parent) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&parent->power.lock, flags);
+		__pm_get_child(parent);
+		spin_unlock_irqrestore(&parent->power.lock, flags);
+	}
+}
+
+/**
+ * pm_runtime_close - Prepare for the removal of a device object.
+ * @dev: Device object being removed.
+ */
+void pm_runtime_close(struct device *dev)
+{
+	struct device *parent = dev->parent;
+	unsigned long flags;
+	unsigned int status;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	/* This makes __pm_runtime_suspend() return immediately. */
+	pm_runtime_get(dev);
+
+	while (dev->power.runtime_status & (RPM_SUSPENDING | RPM_RESUMING)) {
+		spin_unlock_irqrestore(&parent->power.lock, flags);
+
+		wait_for_completion(&dev->power.work_done);
+
+		spin_lock_irqsave(&dev->power.lock, flags);
+	}
+	status = dev->power.runtime_status;
+
+	/* This makes __pm_runtime_resume() return immediately. */
+	dev->power.runtime_status = RPM_ACTIVE;
+
+	spin_unlock_irqrestore(&parent->power.lock, flags);
+
+	if (status != RPM_SUSPENDED && parent) {
+		spin_lock_irqsave(&parent->power.lock, flags);
+		__pm_put_child(parent);
+		spin_unlock_irqrestore(&parent->power.lock, flags);
+	}
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,148 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern void pm_runtime_close(struct device *dev);
+extern void pm_runtime_put_notify(struct device *dev);
+extern int __pm_runtime_suspend(struct device *dev, bool sync);
+extern void pm_request_suspend(struct device *dev, unsigned int msec);
+extern int __pm_runtime_resume(struct device *dev, bool get, bool sync);
+extern int __pm_request_resume(struct device *dev, bool);
+extern void __pm_runtime_clear_status(struct device *dev, unsigned int status);
+
+static inline struct device *suspend_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, suspend_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline struct device *resume_work_to_device(struct work_struct *work)
+{
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(work, struct dev_pm_info, resume_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_runtime_get(struct device *dev)
+{
+	atomic_inc(&dev->power.resume_count);
+}
+
+static inline void pm_runtime_put(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.resume_count, -1, 0))
+		dev_warn(dev, "Excessive %s!\n", __FUNCTION__);
+}
+
+static inline bool pm_children_suspended(struct device *dev)
+{
+	return dev->power.ignore_children || !dev->power.child_count;
+}
+
+static inline bool pm_suspend_possible(struct device *dev)
+{
+	return pm_children_suspended(dev)
+		&& !atomic_read(&dev->power.resume_count)
+		&& !(dev->power.runtime_status & RPM_WAKE);
+}
+
+static inline void pm_suspend_ignore_children(struct device *dev, bool enable)
+{
+	dev->power.ignore_children = enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline void pm_runtime_close(struct device *dev) {}
+static inline void pm_runtime_put_notify(struct device *dev) {}
+static inline int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_suspend(struct device *dev, unsigned int msec) {}
+static inline int __pm_runtime_resume(struct device *dev, bool get, bool sync)
+{
+	return -ENOSYS;
+}
+static inline int __pm_request_resume(struct device *dev, bool get)
+{
+	return -ENOSYS;
+}
+static inline void __pm_runtime_clear_status(struct device *dev,
+					      unsigned int status) {}
+
+static inline void pm_runtime_get(struct device *dev) {}
+static inline bool pm_children_suspended(struct device *dev) { return false; }
+static inline bool pm_suspend_possible(struct device *dev) { return false; }
+static inline void pm_suspend_ignore_children(struct device *dev, bool en) {}
+static inline void pm_runtime_put(struct device *dev) {}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+static inline int pm_runtime_suspend(struct device *dev)
+{
+	return __pm_runtime_suspend(dev, true);
+}
+
+static inline int pm_runtime_resume(struct device *dev)
+{
+	return __pm_runtime_resume(dev, false, true);
+}
+
+static inline int pm_runtime_resume_get(struct device *dev)
+{
+	return __pm_runtime_resume(dev, true, true);
+}
+
+static inline int pm_request_resume(struct device *dev)
+{
+	return __pm_request_resume(dev, false);
+}
+
+static inline int pm_request_resume_get(struct device *dev)
+{
+	return __pm_request_resume(dev, true);
+}
+
+static inline void pm_runtime_clear_active(struct device *dev)
+{
+	__pm_runtime_clear_status(dev, RPM_ACTIVE);
+}
+
+static inline void pm_runtime_clear_suspended(struct device *dev)
+{
+	__pm_runtime_clear_status(dev, RPM_SUSPENDED);
+}
+
+static inline void pm_runtime_enable(struct device *dev)
+{
+	pm_runtime_put(dev);
+}
+
+static inline void pm_runtime_disable(struct device *dev)
+{
+	pm_runtime_get(dev);
+	pm_runtime_resume(dev);
+}
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -104,6 +106,7 @@ void device_pm_remove(struct device *dev
 		 kobject_name(&dev->kobj));
 	mutex_lock(&dpm_list_mtx);
 	list_del_init(&dev->power.entry);
+	pm_runtime_close(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +510,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +757,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +765,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,416 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+Support for run-time power management (run-time PM) of I/O devices is provided
+at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields of 'struct dev_pm_info', the helper functions
+using them and the run-time PM callbacks present in 'struct dev_pm_ops' are
+described below.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+* void pm_runtime_close(struct device *dev);
+
+* void pm_runtime_get(struct device *dev);
+* void pm_runtime_put(struct device *dev);
+* void pm_runtime_put_notify(struct device *dev);
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned int msec);
+* int pm_runtime_resume(struct device *dev);
+* int pm_runtime_resume_get(struct device *dev);
+* void pm_request_resume(struct device *dev);
+
+* bool pm_suspend_possible(struct device *dev);
+
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+
+* void pm_suspend_ignore_children(struct device *dev, bool enable);
+
+* void pm_runtime_clear_active(struct device *dev) {}
+* void pm_runtime_clear_suspended(struct device *dev) {}
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+a device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_close() disables the run-time PM of a device and updates the 'power'
+member of its parent's device object to take the removal of the device into
+account.  It is called during the destruction of the device object, in
+drivers/base/power/main.c:device_pm_remove().
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+pm_runtime_resume_get(), pm_request_resume(), and pm_request_resume_get()
+use the 'power.runtime_status', 'power.resume_count', 'power.suspend_aborted',
+and 'power.child_count' fields of 'struct device' for mutual cooperation.  In
+what follows the 'power.runtime_status', 'power.resume_count', and
+'power.child_count' fields are referred to as the device's run-time PM status,
+the device's resume counter, and the counter of unsuspended children of the
+device, respectively.  They are set to RPM_ACTIVE, 1 and 0, respectively, by
+pm_runtime_init().
+
+pm_runtime_get() is used to increase the device's resume counter by 1.  If the
+resume counter of the device is greater than 0, it will cause the PM core to
+refuse to suspend the device or to queue up a suspend request for it.  This may
+be useful if the device is resumed for a specific task and it shouldn't be
+suspended until the task is complete, but there are many potential sources of
+suspend requests that could disturb it.  It is valid to call this function from
+interrupt context.
+
+pm_runtime_put() is used to decrease the device's resume counter by 1 if it's
+greater than 0.  pm_runtime_put_notify() additionally checks if the device's
+resume counter is equal to zero (after it's just been decreased) and if all
+children of the device are suspended (or it has the 'power.ignore_children' flag
+set).  If that is the case, the ->runtime_idle() callback provided by the
+device's bus type is executed for it.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called directly by a bus type or device driver, but internally
+it calls __pm_runtime_suspend() that is also used for asynchronous suspending of
+devices (i.e. to complete requests queued up by pm_request_suspend()) and works
+as follows.
+
+  * If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the
+    device's run-time PM status field, 'power.runtime_status'), success is
+    returned.
+
+  * If the device's resume counter is greater than 0 or the function has been
+    called via pm_wq as a result of a cancelled suspend request (the RPM_IDLE
+    bit is set in the device's run-time PM status field and its
+    'power.suspend_aborted' flag is set), -EAGAIN is returned.
+
+  * If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+    run-time PM status field), which means that another instance of
+    __pm_runtime_suspend() is running at the same time for the same device, the
+    function waits for the other instance to complete and returns the result
+    returned by it.
+
+  * If the device has a pending suspend request (i.e. the RPM_IDLE bit is set in
+    its run-time PM status) and the function hasn't been called as a result of
+    that request, it cancels the request (synchronously) and restarts itself if
+    a concurrent suspend or resume is running in parallel with it or a resume
+    request has just been queued up.
+
+  * If the children of the device are not suspended and the
+    'power.ignore_children' flag is not set for it, the device's run-time PM
+    status is set to RPM_ACTIVE and -EAGAIN is returned.
+
+If none of the above takes place, or a pending suspend request has been
+successfully cancelled, the device's run-time PM status is set to RPM_SUSPENDING
+and its bus type's ->runtime_suspend() callback is executed.  This callback is
+entirely responsible for handling the device as appropriate (for example, it may
+choose to execute the device driver's ->runtime_suspend() callback or to carry
+out any other suitable action depending on the bus type).
+
+  * If it completes successfully, the RPM_SUSPENDING bit is cleared and the
+    RPM_SUSPENDED bit is set in the device's run-time PM status field.  Once
+    that has happened, the device is regarded by the PM core as suspended, but
+    it _need_ _not_ mean that the device has been put into a low power state.
+    What really occurs to the device at this point entirely depends on its bus
+    type (it may depend on the device's driver if the bus type chooses to call
+    it).  Additionally, if the device bus type's ->runtime_suspend() callback
+    completes successfully and there's no resume request pending for the device
+    (i.e. the RPM_WAKE flag is not set in its run-time PM status field), and the
+    device has a parent, the parent's counter of unsuspended children (i.e. the
+    'power.child_count' field) is decremented.  If that counter turns out to be
+    equal to zero (i.e. the device was the last unsuspended child of its parent)
+    and the parent's 'power.ignore_children' flag is unset, and the parent's
+    resume counter is equal to 0, its bus type's ->runtime_idle() callback is
+    executed for it.
+
+  * If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+    set to RPM_ACTIVE.
+
+  * If another error code is returned, the device's run-time PM status is set to
+    RPM_ERROR, which makes the PM core refuse to carry out any run-time PM
+    operations for it until the status is cleared by its bus type or driver with
+    the help of pm_runtime_clear_active() or pm_runtime_clear_suspended().
+
+Finally, pm_runtime_suspend() returns the result returned by the device bus
+type's ->runtime_suspend() callback.  If the device's bus type doesn't implement
+->runtime_suspend(), -EINVAL is returned and the device's run-time PM status is
+set to RPM_ERROR.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE
+or its resume counter is greater than 0 (i.e. the device is not active from the
+PM core standpoint), the function returns immediately.  Otherwise, it changes
+the device's run-time PM status to RPM_IDLE and puts a request to suspend the
+device into pm_wq.  The 'msec' argument is used to specify the time to wait
+before the request will be completed, in milliseconds.  It is valid to call this
+function from interrupt context.
+
+pm_runtime_resume() and pm_runtime_resume_get() are used to carry out a
+run-time resume of a device that is suspended, suspending or has a suspend
+request pending.  They are called directly by a bus type or device driver and
+the difference between them is that pm_runtime_resume_get() leaves the device's
+resume counter incremented.  Internally, however, they both call
+__pm_runtime_resume() that is also used for asynchronous resuming of devices
+(i.e. to complete requests queued up by pm_request_resume() or
+pm_request_resume_get()).  It first increments the device's resume counter to
+prevent new suspend requests from being queued up and to make subsequent
+attempts to suspend the device fail.  The device's resume counter will be
+decremented on return, unless success is about to be returned and the function
+is requested to hold a reference to the device (i.e. in the
+pm_runtime_resume_get() case).
+
+After incrementing the device's run-time PM counter __pm_runtime_resume()
+proceeds as follows.
+
+  * If the device is active (i.e. all of the bits in its run-time PM status are
+    unset), success is returned.
+
+  * If there's a suspend request pending for the device (i.e. the RPM_IDLE bit
+    is set in the device's run-time PM status field), the
+    'power.suspend_aborted' flag is set for the device and the request is
+    cancelled (synchronously).  Then, the function restarts itself if the
+    device's RPM_IDLE bit was cleared or the 'power.suspend_aborted' flag was
+    unset in the meantime by a concurrent thread.  Otherwise, the device's
+    run-time PM status is cleared to RPM_ACTIVE and the function returns
+    success.
+
+  * If the device has a pending resume request (i.e. the RPM_WAKE bit is set in
+    its run-time PM status field), but the function hasn't been called as a
+    result of that request, the request is waited for to complete and the
+    function restarts itself.
+
+  * If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+    run-time PM status field), the function waits for the suspend operation to
+    complete and restarts itself.
+
+  * If the device is suspended and doesn't have a pending resume request (i.e.
+    its run-time PM status is RPM_SUSPENDED), and it has a parent that is not
+    active (i.e. the parent's run-time PM status is not RPM_ACTIVE),
+    pm_runtime_resume_get() is called (recursively) for the parent.  If the
+    parent's resume is successful, the function notes that the parent's resume
+    counter will have to be decremented and restarts itself.  Otherwise, it
+    returns the error code returned by the instance of pm_runtime_resume_get()
+    handling the device's parent.
+
+  * If the device is resuming (i.e. the device's run-time PM status is
+    RPM_RESUMING), which means that another instance of __pm_runtime_resume() is
+    running at the same time for the same device, the function waits for the
+    other instance to complete and returns the result returned by it.
+
+If none of the above happens, the function checks if the device's run-time PM
+status is RPM_SUSPENDED, which means that the device doesn't have a resume
+request pending, and if it has a parent.  If that is the case, the parent's
+counter of unsuspended children is increased.  Next, the device's run-time PM
+status is set to RPM_RESUMING and its bus type's ->runtime_resume() callback is
+executed.  This callback is entirely responsible for handling the device as
+appropriate (for example, it may choose to execute the device driver's
+->runtime_resume() callback or to carry out any other suitable action depending
+on the bus type).
+
+  * If it completes successfully, the device's run-time PM status is set to
+    RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+    device bus type's ->runtime_resume() callback, when it is about to return
+    success, _must_ _ensure_ that this really is the case (i.e. when it returns
+    success, the device _must_ be able to carry out I/O operations as needed).
+
+  * If an error code is returned, the device's run-time PM status is set to
+    RPM_ERROR, which makes the PM core refuse to carry out any run-time PM
+    operations for the device until the status is cleared by its bus type or
+    driver with the help of either pm_runtime_clear_active(), or
+    pm_runtime_clear_suspended().  Thus, it is strongly recommended that bus
+    types' ->runtime_resume() callbacks only return error codes in fatal error
+    conditions, when it is impossible to bring the device back to the
+    operational state by any available means.  Inability to wake up a suspended
+    device usually means a service loss and it may very well result in a data
+    loss to the user, so it _must_ be regarded as a severe problem and avoided
+    if at all possible.
+
+Finally, __pm_runtime_resume() returns the result returned by the device bus
+type's ->runtime_resume() callback.  The device's resume counter is decremented
+right before the function returns, unless success is about to be returned and
+the function is requested to hold a reference to the device (i.e. in the
+pm_runtime_resume_get() case).  If the device's bus type doesn't implement
+->runtime_resume(), -EINVAL is returned and the device's run-time PM status is
+set to RPM_ERROR.
+
+pm_request_resume() and pm_request_resume_get() are used to queue up a resume
+request for a device that is suspended, suspending or has a suspend request
+pending.  The difference between them is that pm_request_resume_get() leaves the
+device's resume counter incremented, so the device cannot be suspended by
+__pm_runtime_suspend() after it has run.  Internally, they both call
+__pm_request_resume() which works as follows.
+
+* If the function is requested to take a reference to the device (i.e. in the
+  pm_request_resume_get() case), the device's resume counter is incremented.
+
+* If the run-time PM status of the device is RPM_ACTIVE, -EBUSY is returned.
+
+* If the device is resuming or has a resume request pending (i.e. at least one
+  of the RPM_WAKE and RPM_RESUMING bits is set in the device's run-time PM
+  status field), -EINPROGRESS is returned.
+
+* If the device's run-time status is RPM_IDLE (i.e. a suspend request is pending
+  for it) and the 'power.suspend_aborted' flag is set (i.e. the pending request
+  is being cancelled), -EBUSY is returned.
+
+* If the device's run-time status is RPM_IDLE (i.e. a suspend request is pending
+  for it) and the 'power.suspend_aborted' flag is not set, the device's
+  'power.suspend_aborted' flag is set, a request to cancel the pending suspend
+  request is queued up and the device's resume counter is increased (it will be
+  decreased by the work function when it's done its job).  Finally, -EBUSY is
+  returned.
+
+If none of the above happens, the function checks if the device's run-time PM
+status is RPM_SUSPENDED and if it has a parent, in which case the parent's
+counter of unsuspended children is incremented.  Next, the function grabs a
+reference to the device by increasing its resume counter (this reference is
+going to be dropped automatically after the __pm_runtime_resume() handling the
+request has run), the RPM_WAKE bit is set in the device's run-time PM status
+field and the request to execute __pm_runtime_resume() is put into pm_wq.
+Finally, the function returns 0, which means that the resume request has been
+successfully queued up.  It is valid to call this function from interrupt
+context.
+
+Note that it usually is _not_ safe to access the device for I/O purposes
+immediately after __pm_request_resume() has returned, unless the returned result
+is -EBUSY, which means that it wasn't necessary to resume the device.
+
+Note also that only one suspend request or one resume request may be queued up
+at any given moment.  Moreover, a resume request cannot be queued up along with
+a suspend request.  Still, if it's necessary to queue up a request to cancel a
+pending suspend request, these two requests will be present in pm_wq at the
+same time.  In that case, regardless of which request is attempted to complete
+first, the device's run-time PM status will be set to RPM_ACTIVE as a final
+result.
+
+pm_suspend_possible() is used to check if the device may be suspended at this
+particular moment.  It checks the device's resume counter and the counter of
+unsuspended children.  It returns 'false' if any of these counters is greater
+than 0 or 'true' otherwise.
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, all of the run-time PM core operations.  They do it by
+decrementing and incrementing, respectively, the device's resume counter, which
+also is done by pm_runtime_get() and pm_runtime_put().  However,
+pm_runtime_enable() doesn't notify the device's bus type of its resume counter
+reaching 0 and pm_runtime_disable() additionally calls pm_runtime_resume() for
+the device after incrementing its resume counter to ensure that it will not be
+suspended while its run-time PM is disabled.  Therefore, if pm_runtime_disable()
+is called several times in a row for the same device, it has to be balanced by
+the appropriate number of pm_runtime_enable() calls so that the other run-time
+PM core functions work for that device.  The initial value of the device's
+resume counter, as set by pm_runtime_init(), is 1 (i.e. the device's run-time PM
+is initially disabled).
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time power management of devices temporarily during device probe
+and removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_suspend_ignore_children() is used to set or unset the
+'power.ignore_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 1, and if 'enable' is 'false', the field
+is set to 0.  The default value of 'power.ignore_children', as set by
+pm_runtime_init(), is 0.
+
+pm_runtime_clear_active() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_ACTIVE.  It is valid to call this function from
+interrupt context.
+
+pm_runtime_clear_suspended() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_SUSPENDED.  If the device has a parent, it the
+function additionally decrements the parent's counter of unsuspended children,
+although the parent's bus type is not notified if the counter becomes 0.  It is
+valid to call this function from interrupt context.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+In particular, it is recommended that ->runtime_suspend() return -EBUSY or
+-EAGAIN if device_may_wakeup() returns 'false' for the device.  On the other
+hand, if device_may_wakeup() returns 'true' for the device and the device is put
+into a low power state during the execution of ->runtime_suspend(), it is
+expected that remote wake-up (i.e. hardware mechanism allowing the device to
+request a change of its power state, such as PCI PME) will be enabled for the
+device.  Generally, remote wake-up should be enabled whenever the device is put
+into a low power state at run time and is expected to receive input from the
+outside of the system.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20  2:34                             ` Alan Stern
                                               ` (3 preceding siblings ...)
  (?)
@ 2009-06-20 23:38                             ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-20 23:38 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Saturday 20 June 2009, Alan Stern wrote:
> On Sat, 20 Jun 2009, Rafael J. Wysocki wrote:
> 
> > I think we can grab a reference when queuing up a resume request and drop
> > it on the completion of it.  This way, suspend will be locked while we're
> > waiting for the resume to run, which I think is what we want.
> 
> But suspend is already blocked from the time a resume request is queued 
> until the resume completes, unless the suspend was underway when the 
> request was made.  So that doesn't seem to make sense.
> 
> This really all depends on how drivers use async autoresume.  Here's 
> one possible way they could be written:
> 
> irq_handler() {
> 	status = pm_request_resume();
> 	if (status indicates the device is currently resumed)
> 		handle_the_IO();
> 	else
> 		save_the_IO();
> }
> 
> runtime_resume_method() {
> 	handle_saved_IO();
> 	pm_request_suspend();	/* Could call pm_notify_idle instead */
> }
> 
> The implications of this design are:
> 
> 	pm_request_resume should return one code if the status already
> 	is RPM_WAKE and a different code if the resume request had to
> 	be queued (or one was already queued).

I did something like this in the patch below.

> 	pm_request_suspend should run very quickly, since it will be
> 	called after every I/O operation.  Likewise, pm_request_resume
> 	should run very quickly if the status is RPM_ACTIVE or 
> 	RPM_IDLE.

Hmm.  pm_request_suspend() is really short, so it should be fast.
pm_request_resume() is a bit more complicated, though (it takes two spinlocks,
increases an atomic counter, possibly twice, and queues up a work item, also
in the RPM_IDLE case).

> 	In order to prevent autosuspends from occurring while I/O is
> 	in progress, the pm_request_resume call should increment the
> 	usage counter (if it had to queue the request) and the 
> 	pm_request_suspend call should decrement it (maybe after
> 	waiting for the delay).

I don't want like pm_request_suspend() to do that, because it's valid to
call it many times in a row. (only the first request will be queued in such a
case).

I'd prefer the caller to do pm_request_resume_get() (please see the patch
below) to put a resume request into the queue and then pm_runtime_put_notify()
when it's done with the I/O.  That will result in ->runtime_idle() being called
automatically if the device may be suspended.

> > OK, I think I'll try to do the counting, although it may be difficult to handle
> > all of the corner cases.
> 
> No, I agree it's not worth worrying about for now.  It can always be 
> added later.

Well, I've done it already, so I'd prefer to keep it, unless it's broken. ;-)

> > > > > There might be some obscure other reason, but in general depth going
> > > > > to 0 means a delayed autosuspend request should be queued.
> > > > 
> > > > OK there, but pm_runtime_disable() is called by the core in some places where
> > > > we'd rather not want the device to be suspended (like during a system-wide
> > > > power transitions).
> > > 
> > > I'm not sure what you mean.  I was talking about pm_runtime_enable
> > > (which decrements depth), not pm_runtime_disable (which increments it).  
> > > When pm_runtime_enable finds that depth has gone to 0, it should queue
> > > a delayed autosuspend request.
> > 
> > OK, but I don't think that queuing a request without notifying the bus type is
> > the right thing to do.  IMO it's better to use ->runtime_idle() in that case
> > (in analogy with the situation in which the last child of a device has been
> > suspended).
> 
> Agreed.
> 
> 
> > > Autosuspend is disallowed if:
> > > 
> > > 	the driver doesn't support autosuspend;
> > > 
> > > 	the usage counter is > 0;
> > > 
> > > 	autosuspend has been disabled for this device;
> > > 
> > > 	the driver requires remote wakeup during autosuspend
> > > 	but the user has disallowed wakeup.
> > 
> > That's probably universal for all bus types and devices.
> 
> Probably.  But you haven't provided a way for the driver to indicate 
> that it requires wakeup.  It's not a big deal, since the 
> runtime_suspend method can do its own checking.
> 
> > > If everything else is okay but not enough time has elapsed since the
> > > device was last used, another delayed autosuspend request is queued and
> > > the current one fails with -EAGAIN.
> > 
> > I wouldn't like to do the automatic queuing at the core level, simply because
> > the core may not have enough information to make a correct decision.
> 
> Calling the notify_idle method would be good enough.
> 
> > > The model for asynchronous operation is that the usage counter remains
> > > always at 0, and the driver updates the time-of-last-use field whenever
> > > an I/O operation starts or completes.  The core keeps a delayed
> > > autosuspend request queued; each time the request runs it checks
> > > whether the device has been idle sufficiently long.  If not it
> > > requeues itself; otherwise it carries out an autosuspend.
> > 
> > Again, I think it's a bus type's decision whether or not to use such a
> > "permanent" suspend request.
> 
> Ironically, this model is different from the one I outlined above.  
> There's more than one way to do this, it's not clear which is best, and 
> AFAIK none of them have been implemented in a real driver yet.
> 
> > I think it probably is a good idea to store the time of last use in 'struct
> > device', so that bus types don't need to duplicate that field (all of them will
> > likely use it).  I'm not sure about the delay, though.  Well, I need some time
> > to think about it. :-)
> 
> All bus types will want to implement _some_ delay; it doesn't make
> sense to power down a device immediately after every operation and then
> power it back up for the next operation.

Sure.  But you can use the pm_request_resume()'s delay to achieve that
without storing the delay in 'struct device'.  It seems.

> But the time scales of the delays may vary widely.  Some devices might 
> be able to power up in a millisecond or less; others will require 
> seconds.  The delays should be set accordingly.

Agreed.

OK

Below is a new patch.  It's been reworked quite a bit since the previous
version I sent and I don't think there's anything I'd like to add to it at this
point, unless something is evidently wrong.

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM: Introduce core framework for run-time PM of I/O devices (rev. 2)

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  416 ++++++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    6 
 drivers/base/power/runtime.c       |  617 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   95 +++++
 include/linux/pm_runtime.h         |  148 ++++++++
 kernel/power/Kconfig               |   14 
 kernel/power/main.c                |   17 +
 9 files changed, 1320 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,28 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state.
+ *	For example, if the device is behind a link which is about to be turned
+ *	off, the device may remain at full power.  Still, if the device does go
+ *	to low power and if device_may_wakeup(dev) is true, remote wake-up
+ *	(i.e. hardware mechanism allowing the device to request a change of its
+ *	power state, such as PCI PME) should be enabled for it.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +207,9 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
 };
 
 /**
@@ -315,14 +343,75 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	0x1F
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	suspend_work;
+	struct work_struct	resume_work;
+	struct completion	work_done;
+	unsigned int		ignore_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	atomic_t		resume_count;
+	int			child_count;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,617 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+#include <linux/jiffies.h>
+
+/**
+ * __pm_get_child - Increment the counter of unsuspended children of a device.
+ * @dev: Device to handle;
+ */
+static void __pm_get_child(struct device *dev)
+{
+	dev->power.child_count++;
+}
+
+/**
+ * __pm_put_child - Decrement the counter of unsuspended children of a device.
+ * @dev: Device to handle;
+ */
+static void __pm_put_child(struct device *dev)
+{
+	if (dev->power.child_count > 0)
+		dev->power.child_count--;
+	else
+		dev_warn(dev, "Excessive %s!\n", __FUNCTION__);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.resume_count) > 0)
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * pm_runtime_put - Decrement the resume counter of a device.
+ * @dev: Device to handle.
+ *
+ * Decrement the resume counter of a device, check if it went down to zero and
+ * notify the device's bus type in that case.
+ */
+void pm_runtime_put_notify(struct device *dev)
+{
+	pm_runtime_put(dev);
+
+	if (pm_children_suspended(dev))
+		pm_runtime_notify_idle(dev);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_put_notify);
+
+/**
+ * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	struct device *parent = NULL;
+	unsigned long parflags = 0, flags;
+	int error = -EINVAL;
+
+	might_sleep();
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+ repeat:
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDED) {
+		error = 0;
+		goto out;
+	} else if (atomic_read(&dev->power.resume_count) > 0
+	    || (!sync && dev->power.runtime_status == RPM_IDLE
+	    && dev->power.suspend_aborted)) {
+		/*
+		 * We're forbidden to suspend the device (eg. it may be
+		 * resuming) or a pending suspend request has just been
+		 * cancelled (by a concurrent suspend) and we're running as a
+		 * result of that request.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status & RPM_SUSPENDING) {
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	} else if (sync && dev->power.runtime_status == RPM_IDLE
+	    && !dev->power.suspend_aborted) {
+		/*
+		 * Suspend request is pending, but we're not running as a result
+		 * of that request, so cancel it.  Since we're not clearing the
+		 * RPM_IDLE bit now, no new suspend requests will be queued up
+		 * while the pending one is waited for to finish.
+		 */
+		dev->power.suspend_aborted = true;
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+
+		cancel_delayed_work_sync(&dev->power.suspend_work);
+
+		spin_lock_irqsave(&dev->power.lock, flags);
+
+		/* Repeat if anything has changed. */
+		if (dev->power.runtime_status != RPM_IDLE
+		    || !dev->power.suspend_aborted)
+			goto repeat;
+	}
+
+	if (!pm_children_suspended(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = -EBUSY;
+		goto out;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+	parent = dev->parent;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	switch (error) {
+	case 0:
+		/*
+		 * Resume request might have been queued up in the meantime, in
+		 * which case the RPM_WAKE bit is also set in runtime_status.
+		 */
+		dev->power.runtime_status &= ~RPM_SUSPENDING;
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && parent) {
+		__pm_put_child(parent);
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		if (!parent->power.child_count
+		    && !parent->power.ignore_children)
+			pm_runtime_notify_idle(parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	__pm_runtime_suspend(suspend_work_to_device(work), false);
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @msec: Time to wait before attempting to suspend the device, in milliseconds.
+ */
+void pm_request_suspend(struct device *dev, unsigned int msec)
+{
+	unsigned long flags;
+	unsigned long delay = msecs_to_jiffies(msec);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE
+	    || atomic_read(&dev->power.resume_count) > 0)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	queue_delayed_work(pm_wq, &dev->power.suspend_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ * @get: If set, increment the device's resume counter.
+ * @sync: If unset, the funtion has been called via pm_wq.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int __pm_runtime_resume(struct device *dev, bool get, bool sync)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+	bool put_parent = false;
+	unsigned int status;
+	int error = -EINVAL;
+
+	might_sleep();
+
+	/*
+	 * This makes concurrent __pm_runtime_suspend() and pm_request_suspend()
+	 * started after us, or restarted, return immediately, so only the ones
+	 * started before us can execute ->runtime_suspend().
+	 */
+	pm_runtime_get(dev);
+
+ repeat:
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+ repeat_locked:
+	if (dev->power.runtime_status == RPM_ERROR) {
+		goto out;
+	} else if (dev->power.runtime_status == RPM_ACTIVE) {
+		error = 0;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_IDLE
+	    && !dev->power.suspend_aborted) {
+		/* Suspend request is pending, so cancel it. */
+		dev->power.suspend_aborted = true;
+
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		cancel_delayed_work_sync(&dev->power.suspend_work);
+
+		if (parent)
+			spin_lock_irqsave(&parent->power.lock, parflags);
+		spin_lock_irqsave(&dev->power.lock, flags);
+
+		/* Repeat if anything has changed. */
+		if (dev->power.runtime_status != RPM_IDLE
+		    || !dev->power.suspend_aborted)
+			goto repeat_locked;
+
+		/*
+		 * Suspend request has been cancelled and there's nothing more
+		 * to do.  Clear the RPM_IDLE bit and return.
+		 */
+		dev->power.runtime_status = RPM_ACTIVE;
+		error = 0;
+		goto out;
+	}
+
+	if (sync && (dev->power.runtime_status & RPM_WAKE)) {
+		/*
+		 * Resume request is pending, so let it run, because it has to
+		 * decrement the resume counter of the device.
+		 */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		flush_work(&dev->power.resume_work);
+
+		goto repeat;
+	} else if (dev->power.runtime_status & RPM_SUSPENDING) {
+		/*
+		 * Suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		if (parent)
+			spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (!put_parent && dev->power.runtime_status == RPM_SUSPENDED
+	    && parent && parent->power.runtime_status != RPM_ACTIVE) {
+		/* The parent has to be resumed before we can continue. */
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+		error = pm_runtime_resume_get(parent);
+		if (error)
+			return error;
+
+		put_parent = true;
+		error = -EINVAL;
+		goto repeat;
+	}
+
+	status = dev->power.runtime_status;
+	if (status == RPM_RESUMING)
+		goto unlock;
+
+	if (dev->power.runtime_status == RPM_SUSPENDED && parent)
+		__pm_get_child(parent);
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+ unlock:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent) {
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+		/*
+		 * We can decrement the parent's resume counter right now,
+		 * because it can't be suspended anyway after the
+		 * __pm_get_child() above.
+		 */
+		if (put_parent)
+			pm_runtime_put(parent);
+		parent = NULL;
+	}
+
+	if (status == RPM_RESUMING) {
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		error = dev->power.runtime_error;
+		goto out_put;
+	}
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	dev->power.runtime_status = error ? RPM_ERROR : RPM_ACTIVE;
+	dev->power.runtime_error = error;
+	complete_all(&dev->power.work_done);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent) {
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+		if (put_parent)
+			pm_runtime_put(parent);
+	}
+
+ out_put:
+	/* Allow suspends to run if we are supposed to. */
+	if (!get || error)
+		pm_runtime_put_notify(dev);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run __pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * __pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	struct device *dev = resume_work_to_device(work);
+
+	__pm_runtime_resume(dev, false, false);
+	pm_runtime_put_notify(dev);
+}
+
+/**
+ * pm_cancel_suspend_work - Cancel a pending suspend request.
+ *
+ * Use @work to get the device object the work item has been scheduled for and
+ * cancel a pending suspend request for it.
+ */
+static void pm_cancel_suspend_work(struct work_struct *work)
+{
+	struct device *dev = resume_work_to_device(work);
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_IDLE
+	    || !dev->power.suspend_aborted)
+		goto out;
+	/*
+	 * Suspend request is pending, so cancel it.  __pm_runtime_resume() and
+	 * __pm_request_resume() will notice that suspend_aborted is true, so
+	 * they will return immediately.  Suspend requests and direct attempts
+	 * to suspend are blocked by the increased resume counter.
+	 */
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	cancel_delayed_work_sync(&dev->power.suspend_work);
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	/* Clear the status if someone else hasn't done it already. */
+	if (dev->power.runtime_status == RPM_IDLE && dev->power.suspend_aborted)
+		dev->power.runtime_status = RPM_ACTIVE;
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	pm_runtime_put_notify(dev);
+}
+
+/**
+ * __pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+int __pm_request_resume(struct device *dev, bool get)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+	int error = 0;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_ERROR) {
+		error = -EINVAL;
+		goto out;
+	}
+
+	if (get)
+		pm_runtime_get(dev);
+
+	if (dev->power.runtime_status == RPM_ACTIVE) {
+		error = -EBUSY;
+		goto out;
+	} else if (dev->power.runtime_status & (RPM_WAKE | RPM_RESUMING)) {
+		error = -EINPROGRESS;
+		goto out;
+	}
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		error = -EBUSY;
+
+		if (dev->power.suspend_aborted)
+			goto out;
+
+		/* Suspend request is pending.  Queue a request to cancel it. */
+		dev->power.suspend_aborted = true;
+		INIT_WORK(&dev->power.resume_work, pm_cancel_suspend_work);
+		goto queue;
+	}
+
+	if (dev->power.runtime_status == RPM_SUSPENDED && parent)
+		__pm_get_child(parent);
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	INIT_WORK(&dev->power.resume_work, pm_runtime_resume_work);
+
+ queue:
+	pm_runtime_get(dev);
+	queue_work(pm_wq, &dev->power.resume_work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(__pm_request_resume);
+
+/**
+ * __pm_runtime_clear_status - Change the run-time PM status of a device.
+ * @dev: Device to handle.
+ * @status: New value of the device's run-time PM status.
+ *
+ * Change the run-time PM status of the device to @status, which must be
+ * either RPM_ACTIVE or RPM_SUSPENDED, if its current value is equal to
+ * RPM_ERROR.
+ */
+void __pm_runtime_clear_status(struct device *dev, unsigned int status)
+{
+	struct device *parent = dev->parent;
+	unsigned long parflags = 0, flags;
+
+	if (status & ~RPM_SUSPENDED)
+		return;
+
+	if (parent)
+		spin_lock_irqsave(&parent->power.lock, parflags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ERROR)
+		goto out;
+
+	dev->power.runtime_status = status;
+	if (parent && status == RPM_SUSPENDED)
+		__pm_put_child(parent);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (parent)
+		spin_unlock_irqrestore(&parent->power.lock, parflags);
+}
+EXPORT_SYMBOL_GPL(__pm_runtime_clear_status);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to initialize.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	struct device *parent = dev->parent;
+
+	spin_lock_init(&dev->power.lock);
+
+	dev->power.runtime_status = RPM_ACTIVE;
+	atomic_set(&dev->power.resume_count, 1);
+	pm_suspend_ignore_children(dev, false);
+	dev->power.child_count = 0;
+	INIT_DELAYED_WORK(&dev->power.suspend_work, pm_runtime_suspend_work);
+
+	if (parent) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&parent->power.lock, flags);
+		__pm_get_child(parent);
+		spin_unlock_irqrestore(&parent->power.lock, flags);
+	}
+}
+
+/**
+ * pm_runtime_close - Prepare for the removal of a device object.
+ * @dev: Device object being removed.
+ */
+void pm_runtime_close(struct device *dev)
+{
+	struct device *parent = dev->parent;
+	unsigned long flags;
+	unsigned int status;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	/* This makes __pm_runtime_suspend() return immediately. */
+	pm_runtime_get(dev);
+
+	while (dev->power.runtime_status & (RPM_SUSPENDING | RPM_RESUMING)) {
+		spin_unlock_irqrestore(&parent->power.lock, flags);
+
+		wait_for_completion(&dev->power.work_done);
+
+		spin_lock_irqsave(&dev->power.lock, flags);
+	}
+	status = dev->power.runtime_status;
+
+	/* This makes __pm_runtime_resume() return immediately. */
+	dev->power.runtime_status = RPM_ACTIVE;
+
+	spin_unlock_irqrestore(&parent->power.lock, flags);
+
+	if (status != RPM_SUSPENDED && parent) {
+		spin_lock_irqsave(&parent->power.lock, flags);
+		__pm_put_child(parent);
+		spin_unlock_irqrestore(&parent->power.lock, flags);
+	}
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,148 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern void pm_runtime_close(struct device *dev);
+extern void pm_runtime_put_notify(struct device *dev);
+extern int __pm_runtime_suspend(struct device *dev, bool sync);
+extern void pm_request_suspend(struct device *dev, unsigned int msec);
+extern int __pm_runtime_resume(struct device *dev, bool get, bool sync);
+extern int __pm_request_resume(struct device *dev, bool);
+extern void __pm_runtime_clear_status(struct device *dev, unsigned int status);
+
+static inline struct device *suspend_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, suspend_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline struct device *resume_work_to_device(struct work_struct *work)
+{
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(work, struct dev_pm_info, resume_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_runtime_get(struct device *dev)
+{
+	atomic_inc(&dev->power.resume_count);
+}
+
+static inline void pm_runtime_put(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.resume_count, -1, 0))
+		dev_warn(dev, "Excessive %s!\n", __FUNCTION__);
+}
+
+static inline bool pm_children_suspended(struct device *dev)
+{
+	return dev->power.ignore_children || !dev->power.child_count;
+}
+
+static inline bool pm_suspend_possible(struct device *dev)
+{
+	return pm_children_suspended(dev)
+		&& !atomic_read(&dev->power.resume_count)
+		&& !(dev->power.runtime_status & RPM_WAKE);
+}
+
+static inline void pm_suspend_ignore_children(struct device *dev, bool enable)
+{
+	dev->power.ignore_children = enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline void pm_runtime_close(struct device *dev) {}
+static inline void pm_runtime_put_notify(struct device *dev) {}
+static inline int __pm_runtime_suspend(struct device *dev, bool sync)
+{
+	return -ENOSYS;
+}
+static inline void pm_request_suspend(struct device *dev, unsigned int msec) {}
+static inline int __pm_runtime_resume(struct device *dev, bool get, bool sync)
+{
+	return -ENOSYS;
+}
+static inline int __pm_request_resume(struct device *dev, bool get)
+{
+	return -ENOSYS;
+}
+static inline void __pm_runtime_clear_status(struct device *dev,
+					      unsigned int status) {}
+
+static inline void pm_runtime_get(struct device *dev) {}
+static inline bool pm_children_suspended(struct device *dev) { return false; }
+static inline bool pm_suspend_possible(struct device *dev) { return false; }
+static inline void pm_suspend_ignore_children(struct device *dev, bool en) {}
+static inline void pm_runtime_put(struct device *dev) {}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+static inline int pm_runtime_suspend(struct device *dev)
+{
+	return __pm_runtime_suspend(dev, true);
+}
+
+static inline int pm_runtime_resume(struct device *dev)
+{
+	return __pm_runtime_resume(dev, false, true);
+}
+
+static inline int pm_runtime_resume_get(struct device *dev)
+{
+	return __pm_runtime_resume(dev, true, true);
+}
+
+static inline int pm_request_resume(struct device *dev)
+{
+	return __pm_request_resume(dev, false);
+}
+
+static inline int pm_request_resume_get(struct device *dev)
+{
+	return __pm_request_resume(dev, true);
+}
+
+static inline void pm_runtime_clear_active(struct device *dev)
+{
+	__pm_runtime_clear_status(dev, RPM_ACTIVE);
+}
+
+static inline void pm_runtime_clear_suspended(struct device *dev)
+{
+	__pm_runtime_clear_status(dev, RPM_SUSPENDED);
+}
+
+static inline void pm_runtime_enable(struct device *dev)
+{
+	pm_runtime_put(dev);
+}
+
+static inline void pm_runtime_disable(struct device *dev)
+{
+	pm_runtime_get(dev);
+	pm_runtime_resume(dev);
+}
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -104,6 +106,7 @@ void device_pm_remove(struct device *dev
 		 kobject_name(&dev->kobj));
 	mutex_lock(&dpm_list_mtx);
 	list_del_init(&dev->power.entry);
+	pm_runtime_close(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +510,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +757,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +765,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,416 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+Support for run-time power management (run-time PM) of I/O devices is provided
+at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields of 'struct dev_pm_info', the helper functions
+using them and the run-time PM callbacks present in 'struct dev_pm_ops' are
+described below.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+* void pm_runtime_close(struct device *dev);
+
+* void pm_runtime_get(struct device *dev);
+* void pm_runtime_put(struct device *dev);
+* void pm_runtime_put_notify(struct device *dev);
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned int msec);
+* int pm_runtime_resume(struct device *dev);
+* int pm_runtime_resume_get(struct device *dev);
+* void pm_request_resume(struct device *dev);
+
+* bool pm_suspend_possible(struct device *dev);
+
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+
+* void pm_suspend_ignore_children(struct device *dev, bool enable);
+
+* void pm_runtime_clear_active(struct device *dev) {}
+* void pm_runtime_clear_suspended(struct device *dev) {}
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+a device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_close() disables the run-time PM of a device and updates the 'power'
+member of its parent's device object to take the removal of the device into
+account.  It is called during the destruction of the device object, in
+drivers/base/power/main.c:device_pm_remove().
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+pm_runtime_resume_get(), pm_request_resume(), and pm_request_resume_get()
+use the 'power.runtime_status', 'power.resume_count', 'power.suspend_aborted',
+and 'power.child_count' fields of 'struct device' for mutual cooperation.  In
+what follows the 'power.runtime_status', 'power.resume_count', and
+'power.child_count' fields are referred to as the device's run-time PM status,
+the device's resume counter, and the counter of unsuspended children of the
+device, respectively.  They are set to RPM_ACTIVE, 1 and 0, respectively, by
+pm_runtime_init().
+
+pm_runtime_get() is used to increase the device's resume counter by 1.  If the
+resume counter of the device is greater than 0, it will cause the PM core to
+refuse to suspend the device or to queue up a suspend request for it.  This may
+be useful if the device is resumed for a specific task and it shouldn't be
+suspended until the task is complete, but there are many potential sources of
+suspend requests that could disturb it.  It is valid to call this function from
+interrupt context.
+
+pm_runtime_put() is used to decrease the device's resume counter by 1 if it's
+greater than 0.  pm_runtime_put_notify() additionally checks if the device's
+resume counter is equal to zero (after it's just been decreased) and if all
+children of the device are suspended (or it has the 'power.ignore_children' flag
+set).  If that is the case, the ->runtime_idle() callback provided by the
+device's bus type is executed for it.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called directly by a bus type or device driver, but internally
+it calls __pm_runtime_suspend() that is also used for asynchronous suspending of
+devices (i.e. to complete requests queued up by pm_request_suspend()) and works
+as follows.
+
+  * If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the
+    device's run-time PM status field, 'power.runtime_status'), success is
+    returned.
+
+  * If the device's resume counter is greater than 0 or the function has been
+    called via pm_wq as a result of a cancelled suspend request (the RPM_IDLE
+    bit is set in the device's run-time PM status field and its
+    'power.suspend_aborted' flag is set), -EAGAIN is returned.
+
+  * If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+    run-time PM status field), which means that another instance of
+    __pm_runtime_suspend() is running at the same time for the same device, the
+    function waits for the other instance to complete and returns the result
+    returned by it.
+
+  * If the device has a pending suspend request (i.e. the RPM_IDLE bit is set in
+    its run-time PM status) and the function hasn't been called as a result of
+    that request, it cancels the request (synchronously) and restarts itself if
+    a concurrent suspend or resume is running in parallel with it or a resume
+    request has just been queued up.
+
+  * If the children of the device are not suspended and the
+    'power.ignore_children' flag is not set for it, the device's run-time PM
+    status is set to RPM_ACTIVE and -EAGAIN is returned.
+
+If none of the above takes place, or a pending suspend request has been
+successfully cancelled, the device's run-time PM status is set to RPM_SUSPENDING
+and its bus type's ->runtime_suspend() callback is executed.  This callback is
+entirely responsible for handling the device as appropriate (for example, it may
+choose to execute the device driver's ->runtime_suspend() callback or to carry
+out any other suitable action depending on the bus type).
+
+  * If it completes successfully, the RPM_SUSPENDING bit is cleared and the
+    RPM_SUSPENDED bit is set in the device's run-time PM status field.  Once
+    that has happened, the device is regarded by the PM core as suspended, but
+    it _need_ _not_ mean that the device has been put into a low power state.
+    What really occurs to the device at this point entirely depends on its bus
+    type (it may depend on the device's driver if the bus type chooses to call
+    it).  Additionally, if the device bus type's ->runtime_suspend() callback
+    completes successfully and there's no resume request pending for the device
+    (i.e. the RPM_WAKE flag is not set in its run-time PM status field), and the
+    device has a parent, the parent's counter of unsuspended children (i.e. the
+    'power.child_count' field) is decremented.  If that counter turns out to be
+    equal to zero (i.e. the device was the last unsuspended child of its parent)
+    and the parent's 'power.ignore_children' flag is unset, and the parent's
+    resume counter is equal to 0, its bus type's ->runtime_idle() callback is
+    executed for it.
+
+  * If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+    set to RPM_ACTIVE.
+
+  * If another error code is returned, the device's run-time PM status is set to
+    RPM_ERROR, which makes the PM core refuse to carry out any run-time PM
+    operations for it until the status is cleared by its bus type or driver with
+    the help of pm_runtime_clear_active() or pm_runtime_clear_suspended().
+
+Finally, pm_runtime_suspend() returns the result returned by the device bus
+type's ->runtime_suspend() callback.  If the device's bus type doesn't implement
+->runtime_suspend(), -EINVAL is returned and the device's run-time PM status is
+set to RPM_ERROR.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE
+or its resume counter is greater than 0 (i.e. the device is not active from the
+PM core standpoint), the function returns immediately.  Otherwise, it changes
+the device's run-time PM status to RPM_IDLE and puts a request to suspend the
+device into pm_wq.  The 'msec' argument is used to specify the time to wait
+before the request will be completed, in milliseconds.  It is valid to call this
+function from interrupt context.
+
+pm_runtime_resume() and pm_runtime_resume_get() are used to carry out a
+run-time resume of a device that is suspended, suspending or has a suspend
+request pending.  They are called directly by a bus type or device driver and
+the difference between them is that pm_runtime_resume_get() leaves the device's
+resume counter incremented.  Internally, however, they both call
+__pm_runtime_resume() that is also used for asynchronous resuming of devices
+(i.e. to complete requests queued up by pm_request_resume() or
+pm_request_resume_get()).  It first increments the device's resume counter to
+prevent new suspend requests from being queued up and to make subsequent
+attempts to suspend the device fail.  The device's resume counter will be
+decremented on return, unless success is about to be returned and the function
+is requested to hold a reference to the device (i.e. in the
+pm_runtime_resume_get() case).
+
+After incrementing the device's run-time PM counter __pm_runtime_resume()
+proceeds as follows.
+
+  * If the device is active (i.e. all of the bits in its run-time PM status are
+    unset), success is returned.
+
+  * If there's a suspend request pending for the device (i.e. the RPM_IDLE bit
+    is set in the device's run-time PM status field), the
+    'power.suspend_aborted' flag is set for the device and the request is
+    cancelled (synchronously).  Then, the function restarts itself if the
+    device's RPM_IDLE bit was cleared or the 'power.suspend_aborted' flag was
+    unset in the meantime by a concurrent thread.  Otherwise, the device's
+    run-time PM status is cleared to RPM_ACTIVE and the function returns
+    success.
+
+  * If the device has a pending resume request (i.e. the RPM_WAKE bit is set in
+    its run-time PM status field), but the function hasn't been called as a
+    result of that request, the request is waited for to complete and the
+    function restarts itself.
+
+  * If the device is suspending (i.e. the RPM_SUSPENDING bit is set in its
+    run-time PM status field), the function waits for the suspend operation to
+    complete and restarts itself.
+
+  * If the device is suspended and doesn't have a pending resume request (i.e.
+    its run-time PM status is RPM_SUSPENDED), and it has a parent that is not
+    active (i.e. the parent's run-time PM status is not RPM_ACTIVE),
+    pm_runtime_resume_get() is called (recursively) for the parent.  If the
+    parent's resume is successful, the function notes that the parent's resume
+    counter will have to be decremented and restarts itself.  Otherwise, it
+    returns the error code returned by the instance of pm_runtime_resume_get()
+    handling the device's parent.
+
+  * If the device is resuming (i.e. the device's run-time PM status is
+    RPM_RESUMING), which means that another instance of __pm_runtime_resume() is
+    running at the same time for the same device, the function waits for the
+    other instance to complete and returns the result returned by it.
+
+If none of the above happens, the function checks if the device's run-time PM
+status is RPM_SUSPENDED, which means that the device doesn't have a resume
+request pending, and if it has a parent.  If that is the case, the parent's
+counter of unsuspended children is increased.  Next, the device's run-time PM
+status is set to RPM_RESUMING and its bus type's ->runtime_resume() callback is
+executed.  This callback is entirely responsible for handling the device as
+appropriate (for example, it may choose to execute the device driver's
+->runtime_resume() callback or to carry out any other suitable action depending
+on the bus type).
+
+  * If it completes successfully, the device's run-time PM status is set to
+    RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+    device bus type's ->runtime_resume() callback, when it is about to return
+    success, _must_ _ensure_ that this really is the case (i.e. when it returns
+    success, the device _must_ be able to carry out I/O operations as needed).
+
+  * If an error code is returned, the device's run-time PM status is set to
+    RPM_ERROR, which makes the PM core refuse to carry out any run-time PM
+    operations for the device until the status is cleared by its bus type or
+    driver with the help of either pm_runtime_clear_active(), or
+    pm_runtime_clear_suspended().  Thus, it is strongly recommended that bus
+    types' ->runtime_resume() callbacks only return error codes in fatal error
+    conditions, when it is impossible to bring the device back to the
+    operational state by any available means.  Inability to wake up a suspended
+    device usually means a service loss and it may very well result in a data
+    loss to the user, so it _must_ be regarded as a severe problem and avoided
+    if at all possible.
+
+Finally, __pm_runtime_resume() returns the result returned by the device bus
+type's ->runtime_resume() callback.  The device's resume counter is decremented
+right before the function returns, unless success is about to be returned and
+the function is requested to hold a reference to the device (i.e. in the
+pm_runtime_resume_get() case).  If the device's bus type doesn't implement
+->runtime_resume(), -EINVAL is returned and the device's run-time PM status is
+set to RPM_ERROR.
+
+pm_request_resume() and pm_request_resume_get() are used to queue up a resume
+request for a device that is suspended, suspending or has a suspend request
+pending.  The difference between them is that pm_request_resume_get() leaves the
+device's resume counter incremented, so the device cannot be suspended by
+__pm_runtime_suspend() after it has run.  Internally, they both call
+__pm_request_resume() which works as follows.
+
+* If the function is requested to take a reference to the device (i.e. in the
+  pm_request_resume_get() case), the device's resume counter is incremented.
+
+* If the run-time PM status of the device is RPM_ACTIVE, -EBUSY is returned.
+
+* If the device is resuming or has a resume request pending (i.e. at least one
+  of the RPM_WAKE and RPM_RESUMING bits is set in the device's run-time PM
+  status field), -EINPROGRESS is returned.
+
+* If the device's run-time status is RPM_IDLE (i.e. a suspend request is pending
+  for it) and the 'power.suspend_aborted' flag is set (i.e. the pending request
+  is being cancelled), -EBUSY is returned.
+
+* If the device's run-time status is RPM_IDLE (i.e. a suspend request is pending
+  for it) and the 'power.suspend_aborted' flag is not set, the device's
+  'power.suspend_aborted' flag is set, a request to cancel the pending suspend
+  request is queued up and the device's resume counter is increased (it will be
+  decreased by the work function when it's done its job).  Finally, -EBUSY is
+  returned.
+
+If none of the above happens, the function checks if the device's run-time PM
+status is RPM_SUSPENDED and if it has a parent, in which case the parent's
+counter of unsuspended children is incremented.  Next, the function grabs a
+reference to the device by increasing its resume counter (this reference is
+going to be dropped automatically after the __pm_runtime_resume() handling the
+request has run), the RPM_WAKE bit is set in the device's run-time PM status
+field and the request to execute __pm_runtime_resume() is put into pm_wq.
+Finally, the function returns 0, which means that the resume request has been
+successfully queued up.  It is valid to call this function from interrupt
+context.
+
+Note that it usually is _not_ safe to access the device for I/O purposes
+immediately after __pm_request_resume() has returned, unless the returned result
+is -EBUSY, which means that it wasn't necessary to resume the device.
+
+Note also that only one suspend request or one resume request may be queued up
+at any given moment.  Moreover, a resume request cannot be queued up along with
+a suspend request.  Still, if it's necessary to queue up a request to cancel a
+pending suspend request, these two requests will be present in pm_wq at the
+same time.  In that case, regardless of which request is attempted to complete
+first, the device's run-time PM status will be set to RPM_ACTIVE as a final
+result.
+
+pm_suspend_possible() is used to check if the device may be suspended at this
+particular moment.  It checks the device's resume counter and the counter of
+unsuspended children.  It returns 'false' if any of these counters is greater
+than 0 or 'true' otherwise.
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, all of the run-time PM core operations.  They do it by
+decrementing and incrementing, respectively, the device's resume counter, which
+also is done by pm_runtime_get() and pm_runtime_put().  However,
+pm_runtime_enable() doesn't notify the device's bus type of its resume counter
+reaching 0 and pm_runtime_disable() additionally calls pm_runtime_resume() for
+the device after incrementing its resume counter to ensure that it will not be
+suspended while its run-time PM is disabled.  Therefore, if pm_runtime_disable()
+is called several times in a row for the same device, it has to be balanced by
+the appropriate number of pm_runtime_enable() calls so that the other run-time
+PM core functions work for that device.  The initial value of the device's
+resume counter, as set by pm_runtime_init(), is 1 (i.e. the device's run-time PM
+is initially disabled).
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time power management of devices temporarily during device probe
+and removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_suspend_ignore_children() is used to set or unset the
+'power.ignore_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 1, and if 'enable' is 'false', the field
+is set to 0.  The default value of 'power.ignore_children', as set by
+pm_runtime_init(), is 0.
+
+pm_runtime_clear_active() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_ACTIVE.  It is valid to call this function from
+interrupt context.
+
+pm_runtime_clear_suspended() is used to change the device's run-time PM status
+field from RPM_ERROR to RPM_SUSPENDED.  If the device has a parent, it the
+function additionally decrements the parent's counter of unsuspended children,
+although the parent's bus type is not notified if the counter becomes 0.  It is
+valid to call this function from interrupt context.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which means that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+In particular, it is recommended that ->runtime_suspend() return -EBUSY or
+-EAGAIN if device_may_wakeup() returns 'false' for the device.  On the other
+hand, if device_may_wakeup() returns 'true' for the device and the device is put
+into a low power state during the execution of ->runtime_suspend(), it is
+expected that remote wake-up (i.e. hardware mechanism allowing the device to
+request a change of its power state, such as PCI PME) will be enabled for the
+device.  Generally, remote wake-up should be enabled whenever the device is put
+into a low power state at run time and is expected to receive input from the
+outside of the system.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not communicate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20 14:30                             ` [linux-pm] " Alan Stern
  2009-06-20 23:48                               ` Rafael J. Wysocki
@ 2009-06-20 23:48                               ` Rafael J. Wysocki
  2009-06-21  2:30                                 ` Alan Stern
  2009-06-21  2:30                                 ` [linux-pm] " Alan Stern
  2009-06-22  6:20                                 ` Magnus Damm
  2009-06-22  6:20                               ` Magnus Damm
  3 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-20 23:48 UTC (permalink / raw)
  To: Alan Stern
  Cc: Magnus Damm, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Saturday 20 June 2009, Alan Stern wrote:
> Some more thoughts...
> 
> Magnus, you might have some insights here.  It occurred to me that some 
> devices can switch power levels very quickly, and the drivers might 
> therefore want the runtime suspend and resume methods to be called as 
> soon as possible, even in interrupt context.

Then, we'll need special suspend and resume calls for them.

> In terms of the current framework, this probably means holding the
> runtime PM lock (i.e., not releasing it) across the calls to
> ->runtime_suspend and ->runtime_resume.  It also means that
> pm_request_suspend and pm_request_resume should carry out their jobs
> immediately instead of queuing a work item.  (Unless the current status 
> is RPM_SUSPENDING or RPM_RESUMING, which should never happen.)
> 
> Should there be a flag in dev_pm_info to select this behavior?

I don't think we should complicate pm_request_suspend() and pm_request_resume()
further to handle this particular case.  IMO it's better to provide separate
core calls for that.

> When a device structure is unregistered and deallocated, we have to
> insure that there aren't any pending runtime PM workqueue items.  
> Hence device_del should call a routine that changes the status to an
> exceptional state (not RPM_ERROR but something else) to prevent new
> requests from being queued, and then calls cancel_work_sync or
> cancel_delayed_work_sync as required.

This is done in the patch I've just sent.
 
> Similarly, we should insure that runtime PM calls made before the
> device is registered don't do anything.  So when the device structure
> is first created and the contents are all 0, this should also be
> interpreted as an exceptional state.  We could call it RPM_UNREGISTERED
> and use it for both purposes.

Hmm.  How do you think is possible that the pm_runtime_* functions will be
called in such a situation?

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20 14:30                             ` [linux-pm] " Alan Stern
@ 2009-06-20 23:48                               ` Rafael J. Wysocki
  2009-06-20 23:48                               ` [linux-pm] " Rafael J. Wysocki
                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-20 23:48 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Saturday 20 June 2009, Alan Stern wrote:
> Some more thoughts...
> 
> Magnus, you might have some insights here.  It occurred to me that some 
> devices can switch power levels very quickly, and the drivers might 
> therefore want the runtime suspend and resume methods to be called as 
> soon as possible, even in interrupt context.

Then, we'll need special suspend and resume calls for them.

> In terms of the current framework, this probably means holding the
> runtime PM lock (i.e., not releasing it) across the calls to
> ->runtime_suspend and ->runtime_resume.  It also means that
> pm_request_suspend and pm_request_resume should carry out their jobs
> immediately instead of queuing a work item.  (Unless the current status 
> is RPM_SUSPENDING or RPM_RESUMING, which should never happen.)
> 
> Should there be a flag in dev_pm_info to select this behavior?

I don't think we should complicate pm_request_suspend() and pm_request_resume()
further to handle this particular case.  IMO it's better to provide separate
core calls for that.

> When a device structure is unregistered and deallocated, we have to
> insure that there aren't any pending runtime PM workqueue items.  
> Hence device_del should call a routine that changes the status to an
> exceptional state (not RPM_ERROR but something else) to prevent new
> requests from being queued, and then calls cancel_work_sync or
> cancel_delayed_work_sync as required.

This is done in the patch I've just sent.
 
> Similarly, we should insure that runtime PM calls made before the
> device is registered don't do anything.  So when the device structure
> is first created and the contents are all 0, this should also be
> interpreted as an exceptional state.  We could call it RPM_UNREGISTERED
> and use it for both purposes.

Hmm.  How do you think is possible that the pm_runtime_* functions will be
called in such a situation?

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20 23:38                             ` [patch update 3] " Rafael J. Wysocki
@ 2009-06-21  2:23                                 ` Alan Stern
  2009-06-21  2:23                                 ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-21  2:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> > 	pm_request_suspend should run very quickly, since it will be
> > 	called after every I/O operation.  Likewise, pm_request_resume
> > 	should run very quickly if the status is RPM_ACTIVE or 
> > 	RPM_IDLE.
> 
> Hmm.  pm_request_suspend() is really short, so it should be fast.
> pm_request_resume() is a bit more complicated, though (it takes two spinlocks,
> increases an atomic counter, possibly twice, and queues up a work item, also
> in the RPM_IDLE case).

Hmm, maybe that's a good reason for not trying to handle the parent
from within pm_request_resume().  :-)

Or maybe the routine should be optimized for the RPM_ACTIVE and 
RPM_IDLE cases, where it doesn't have to do much anyway.


> > 	In order to prevent autosuspends from occurring while I/O is
> > 	in progress, the pm_request_resume call should increment the
> > 	usage counter (if it had to queue the request) and the 
> > 	pm_request_suspend call should decrement it (maybe after
> > 	waiting for the delay).
> 
> I don't want like pm_request_suspend() to do that, because it's valid to
> call it many times in a row. (only the first request will be queued in such a
> case).

Sorry, what I meant was that in each case the counter should be
{inc,dec}remented if a new request had to be queued.  If one was
already queued then the counter should be left alone.

The reason behind this is that a bunch of pm_request_suspend calls
which all end up referring to the same workqueue item will result in a
single async call to the runtime_suspend method.  Therefore they should
cause a single decrement of the counter.  Likewise for
pm_request_resume.

> I'd prefer the caller to do pm_request_resume_get() (please see the patch
> below) to put a resume request into the queue and then pm_runtime_put_notify()
> when it's done with the I/O.  That will result in ->runtime_idle() being called
> automatically if the device may be suspended.

If anyone does pm_request_resume or pm_runtime_resume, what is there to 
prevent the device from being suspended again as soon as the resume is 
finished (and before anything useful can be accomplished)?

     1. The driver's runtime_suspend method might be smart enough to 
	return -EBUSY until the driver has completed whatever it's 
	doing.

     2. The usage counter might be > 0.

     3. The number of children might be > 0.

In case 1 there's no reason not to also increment the counter, since
the driver can decrement it again when it is finished.  In cases 2 and
3, we can assume the counter or the number of children was previously
equal to 0, since otherwise the resume call would have been vacuous.  
This implies that the resume call itself should be responsible for
incrementing either the counter or the number of children.

What I'm getting at is this: There's no real point to having separate 
pm_request_resume and pm_request_resume_get calls.  All such calls 
should increment either the usage counter or the number of children.

(In the USB stack, a single counter is used for both purposes.  It 
doesn't look like that will work here.)


> > All bus types will want to implement _some_ delay; it doesn't make
> > sense to power down a device immediately after every operation and then
> > power it back up for the next operation.
> 
> Sure.  But you can use the pm_request_resume()'s delay to achieve that
> without storing the delay in 'struct device'.  It seems.

If you do it that way then the delay has to be hard-coded or stored in
some non-standardized location.  Which will be more common: devices
where the delay is fixed by the bus type (or driver), or devices where
the user should be able to adjust the delay?  If user-adjustable is 
more common then the delay should be stored in dev_pm_info, so that it 
can be controlled by a centralized sysfs attribute defined in the 
driver core.  If not then you are right, the delay doesn't need to be 
stored in struct device.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-21  2:23                                 ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-21  2:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> > 	pm_request_suspend should run very quickly, since it will be
> > 	called after every I/O operation.  Likewise, pm_request_resume
> > 	should run very quickly if the status is RPM_ACTIVE or 
> > 	RPM_IDLE.
> 
> Hmm.  pm_request_suspend() is really short, so it should be fast.
> pm_request_resume() is a bit more complicated, though (it takes two spinlocks,
> increases an atomic counter, possibly twice, and queues up a work item, also
> in the RPM_IDLE case).

Hmm, maybe that's a good reason for not trying to handle the parent
from within pm_request_resume().  :-)

Or maybe the routine should be optimized for the RPM_ACTIVE and 
RPM_IDLE cases, where it doesn't have to do much anyway.


> > 	In order to prevent autosuspends from occurring while I/O is
> > 	in progress, the pm_request_resume call should increment the
> > 	usage counter (if it had to queue the request) and the 
> > 	pm_request_suspend call should decrement it (maybe after
> > 	waiting for the delay).
> 
> I don't want like pm_request_suspend() to do that, because it's valid to
> call it many times in a row. (only the first request will be queued in such a
> case).

Sorry, what I meant was that in each case the counter should be
{inc,dec}remented if a new request had to be queued.  If one was
already queued then the counter should be left alone.

The reason behind this is that a bunch of pm_request_suspend calls
which all end up referring to the same workqueue item will result in a
single async call to the runtime_suspend method.  Therefore they should
cause a single decrement of the counter.  Likewise for
pm_request_resume.

> I'd prefer the caller to do pm_request_resume_get() (please see the patch
> below) to put a resume request into the queue and then pm_runtime_put_notify()
> when it's done with the I/O.  That will result in ->runtime_idle() being called
> automatically if the device may be suspended.

If anyone does pm_request_resume or pm_runtime_resume, what is there to 
prevent the device from being suspended again as soon as the resume is 
finished (and before anything useful can be accomplished)?

     1. The driver's runtime_suspend method might be smart enough to 
	return -EBUSY until the driver has completed whatever it's 
	doing.

     2. The usage counter might be > 0.

     3. The number of children might be > 0.

In case 1 there's no reason not to also increment the counter, since
the driver can decrement it again when it is finished.  In cases 2 and
3, we can assume the counter or the number of children was previously
equal to 0, since otherwise the resume call would have been vacuous.  
This implies that the resume call itself should be responsible for
incrementing either the counter or the number of children.

What I'm getting at is this: There's no real point to having separate 
pm_request_resume and pm_request_resume_get calls.  All such calls 
should increment either the usage counter or the number of children.

(In the USB stack, a single counter is used for both purposes.  It 
doesn't look like that will work here.)


> > All bus types will want to implement _some_ delay; it doesn't make
> > sense to power down a device immediately after every operation and then
> > power it back up for the next operation.
> 
> Sure.  But you can use the pm_request_resume()'s delay to achieve that
> without storing the delay in 'struct device'.  It seems.

If you do it that way then the delay has to be hard-coded or stored in
some non-standardized location.  Which will be more common: devices
where the delay is fixed by the bus type (or driver), or devices where
the user should be able to adjust the delay?  If user-adjustable is 
more common then the delay should be stored in dev_pm_info, so that it 
can be controlled by a centralized sysfs attribute defined in the 
driver core.  If not then you are right, the delay doesn't need to be 
stored in struct device.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20 23:38                             ` [patch update 3] " Rafael J. Wysocki
@ 2009-06-21  2:23                               ` Alan Stern
  2009-06-21  2:23                                 ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-21  2:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> > 	pm_request_suspend should run very quickly, since it will be
> > 	called after every I/O operation.  Likewise, pm_request_resume
> > 	should run very quickly if the status is RPM_ACTIVE or 
> > 	RPM_IDLE.
> 
> Hmm.  pm_request_suspend() is really short, so it should be fast.
> pm_request_resume() is a bit more complicated, though (it takes two spinlocks,
> increases an atomic counter, possibly twice, and queues up a work item, also
> in the RPM_IDLE case).

Hmm, maybe that's a good reason for not trying to handle the parent
from within pm_request_resume().  :-)

Or maybe the routine should be optimized for the RPM_ACTIVE and 
RPM_IDLE cases, where it doesn't have to do much anyway.


> > 	In order to prevent autosuspends from occurring while I/O is
> > 	in progress, the pm_request_resume call should increment the
> > 	usage counter (if it had to queue the request) and the 
> > 	pm_request_suspend call should decrement it (maybe after
> > 	waiting for the delay).
> 
> I don't want like pm_request_suspend() to do that, because it's valid to
> call it many times in a row. (only the first request will be queued in such a
> case).

Sorry, what I meant was that in each case the counter should be
{inc,dec}remented if a new request had to be queued.  If one was
already queued then the counter should be left alone.

The reason behind this is that a bunch of pm_request_suspend calls
which all end up referring to the same workqueue item will result in a
single async call to the runtime_suspend method.  Therefore they should
cause a single decrement of the counter.  Likewise for
pm_request_resume.

> I'd prefer the caller to do pm_request_resume_get() (please see the patch
> below) to put a resume request into the queue and then pm_runtime_put_notify()
> when it's done with the I/O.  That will result in ->runtime_idle() being called
> automatically if the device may be suspended.

If anyone does pm_request_resume or pm_runtime_resume, what is there to 
prevent the device from being suspended again as soon as the resume is 
finished (and before anything useful can be accomplished)?

     1. The driver's runtime_suspend method might be smart enough to 
	return -EBUSY until the driver has completed whatever it's 
	doing.

     2. The usage counter might be > 0.

     3. The number of children might be > 0.

In case 1 there's no reason not to also increment the counter, since
the driver can decrement it again when it is finished.  In cases 2 and
3, we can assume the counter or the number of children was previously
equal to 0, since otherwise the resume call would have been vacuous.  
This implies that the resume call itself should be responsible for
incrementing either the counter or the number of children.

What I'm getting at is this: There's no real point to having separate 
pm_request_resume and pm_request_resume_get calls.  All such calls 
should increment either the usage counter or the number of children.

(In the USB stack, a single counter is used for both purposes.  It 
doesn't look like that will work here.)


> > All bus types will want to implement _some_ delay; it doesn't make
> > sense to power down a device immediately after every operation and then
> > power it back up for the next operation.
> 
> Sure.  But you can use the pm_request_resume()'s delay to achieve that
> without storing the delay in 'struct device'.  It seems.

If you do it that way then the delay has to be hard-coded or stored in
some non-standardized location.  Which will be more common: devices
where the delay is fixed by the bus type (or driver), or devices where
the user should be able to adjust the delay?  If user-adjustable is 
more common then the delay should be stored in dev_pm_info, so that it 
can be controlled by a centralized sysfs attribute defined in the 
driver core.  If not then you are right, the delay doesn't need to be 
stored in struct device.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20 23:48                               ` [linux-pm] " Rafael J. Wysocki
  2009-06-21  2:30                                 ` Alan Stern
@ 2009-06-21  2:30                                 ` Alan Stern
  2009-06-21 11:32                                   ` Rafael J. Wysocki
  2009-06-21 11:32                                   ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 2 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-21  2:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Magnus Damm, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> On Saturday 20 June 2009, Alan Stern wrote:
> > Some more thoughts...
> > 
> > Magnus, you might have some insights here.  It occurred to me that some 
> > devices can switch power levels very quickly, and the drivers might 
> > therefore want the runtime suspend and resume methods to be called as 
> > soon as possible, even in interrupt context.
> 
> Then, we'll need special suspend and resume calls for them.

Good idea.  pm_runtime_resume_atomic() and pm_runtime_suspend_atomic().  
No need for _request variants since the status should never be 
RPM_SUSPENDING or RPM_RESUMING while the lock is released.


> > Similarly, we should insure that runtime PM calls made before the
> > device is registered don't do anything.  So when the device structure
> > is first created and the contents are all 0, this should also be
> > interpreted as an exceptional state.  We could call it RPM_UNREGISTERED
> > and use it for both purposes.
> 
> Hmm.  How do you think is possible that the pm_runtime_* functions will be
> called in such a situation?

By mistake.  :-)

Seriously, there _are_ places where drivers get bound to device before
those devices are registered.  This happens for example in USB when a
bunch of related interfaces are present in the same physical device.  
When the first interface is registered, its driver binds itself to all
the others even though they haven't been registered yet.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20 23:48                               ` [linux-pm] " Rafael J. Wysocki
@ 2009-06-21  2:30                                 ` Alan Stern
  2009-06-21  2:30                                 ` [linux-pm] " Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-21  2:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> On Saturday 20 June 2009, Alan Stern wrote:
> > Some more thoughts...
> > 
> > Magnus, you might have some insights here.  It occurred to me that some 
> > devices can switch power levels very quickly, and the drivers might 
> > therefore want the runtime suspend and resume methods to be called as 
> > soon as possible, even in interrupt context.
> 
> Then, we'll need special suspend and resume calls for them.

Good idea.  pm_runtime_resume_atomic() and pm_runtime_suspend_atomic().  
No need for _request variants since the status should never be 
RPM_SUSPENDING or RPM_RESUMING while the lock is released.


> > Similarly, we should insure that runtime PM calls made before the
> > device is registered don't do anything.  So when the device structure
> > is first created and the contents are all 0, this should also be
> > interpreted as an exceptional state.  We could call it RPM_UNREGISTERED
> > and use it for both purposes.
> 
> Hmm.  How do you think is possible that the pm_runtime_* functions will be
> called in such a situation?

By mistake.  :-)

Seriously, there _are_ places where drivers get bound to device before
those devices are registered.  This happens for example in USB when a
bunch of related interfaces are present in the same physical device.  
When the first interface is registered, its driver binds itself to all
the others even though they haven't been registered yet.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-21  2:30                                 ` [linux-pm] " Alan Stern
  2009-06-21 11:32                                   ` Rafael J. Wysocki
@ 2009-06-21 11:32                                   ` Rafael J. Wysocki
  2009-06-22 14:16                                     ` Alan Stern
  2009-06-22 14:16                                     ` Alan Stern
  1 sibling, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-21 11:32 UTC (permalink / raw)
  To: Alan Stern
  Cc: Magnus Damm, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Sunday 21 June 2009, Alan Stern wrote:
> On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> 
> > On Saturday 20 June 2009, Alan Stern wrote:
> > > Some more thoughts...
> > > 
> > > Magnus, you might have some insights here.  It occurred to me that some 
> > > devices can switch power levels very quickly, and the drivers might 
> > > therefore want the runtime suspend and resume methods to be called as 
> > > soon as possible, even in interrupt context.
> > 
> > Then, we'll need special suspend and resume calls for them.
> 
> Good idea.  pm_runtime_resume_atomic() and pm_runtime_suspend_atomic().  
> No need for _request variants since the status should never be 
> RPM_SUSPENDING or RPM_RESUMING while the lock is released.

Yes, exactly.  I also thought of the same names. :-)

> > > Similarly, we should insure that runtime PM calls made before the
> > > device is registered don't do anything.  So when the device structure
> > > is first created and the contents are all 0, this should also be
> > > interpreted as an exceptional state.  We could call it RPM_UNREGISTERED
> > > and use it for both purposes.
> > 
> > Hmm.  How do you think is possible that the pm_runtime_* functions will be
> > called in such a situation?
> 
> By mistake.  :-)
> 
> Seriously, there _are_ places where drivers get bound to device before
> those devices are registered.  This happens for example in USB when a
> bunch of related interfaces are present in the same physical device.  
> When the first interface is registered, its driver binds itself to all
> the others even though they haven't been registered yet.

Well, the suspend functions could be protected against that under the
assumption that no suspend is possible for resume_counter = 0 (then, the "good
to go" value would be -1).

Still, the resume functions start from acquring a spinlock, which is not going
to work if that spinlock is uninitialized.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-21  2:30                                 ` [linux-pm] " Alan Stern
@ 2009-06-21 11:32                                   ` Rafael J. Wysocki
  2009-06-21 11:32                                   ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-21 11:32 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Sunday 21 June 2009, Alan Stern wrote:
> On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> 
> > On Saturday 20 June 2009, Alan Stern wrote:
> > > Some more thoughts...
> > > 
> > > Magnus, you might have some insights here.  It occurred to me that some 
> > > devices can switch power levels very quickly, and the drivers might 
> > > therefore want the runtime suspend and resume methods to be called as 
> > > soon as possible, even in interrupt context.
> > 
> > Then, we'll need special suspend and resume calls for them.
> 
> Good idea.  pm_runtime_resume_atomic() and pm_runtime_suspend_atomic().  
> No need for _request variants since the status should never be 
> RPM_SUSPENDING or RPM_RESUMING while the lock is released.

Yes, exactly.  I also thought of the same names. :-)

> > > Similarly, we should insure that runtime PM calls made before the
> > > device is registered don't do anything.  So when the device structure
> > > is first created and the contents are all 0, this should also be
> > > interpreted as an exceptional state.  We could call it RPM_UNREGISTERED
> > > and use it for both purposes.
> > 
> > Hmm.  How do you think is possible that the pm_runtime_* functions will be
> > called in such a situation?
> 
> By mistake.  :-)
> 
> Seriously, there _are_ places where drivers get bound to device before
> those devices are registered.  This happens for example in USB when a
> bunch of related interfaces are present in the same physical device.  
> When the first interface is registered, its driver binds itself to all
> the others even though they haven't been registered yet.

Well, the suspend functions could be protected against that under the
assumption that no suspend is possible for resume_counter = 0 (then, the "good
to go" value would be -1).

Still, the resume functions start from acquring a spinlock, which is not going
to work if that spinlock is uninitialized.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-21  2:23                                 ` Alan Stern
  (?)
  (?)
@ 2009-06-21 12:46                                 ` Rafael J. Wysocki
  2009-06-22 15:01                                     ` Alan Stern
  -1 siblings, 1 reply; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-21 12:46 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Sunday 21 June 2009, Alan Stern wrote:
> On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > 	pm_request_suspend should run very quickly, since it will be
> > > 	called after every I/O operation.  Likewise, pm_request_resume
> > > 	should run very quickly if the status is RPM_ACTIVE or 
> > > 	RPM_IDLE.
> > 
> > Hmm.  pm_request_suspend() is really short, so it should be fast.
> > pm_request_resume() is a bit more complicated, though (it takes two spinlocks,
> > increases an atomic counter, possibly twice, and queues up a work item, also
> > in the RPM_IDLE case).
> 
> Hmm, maybe that's a good reason for not trying to handle the parent
> from within pm_request_resume().  :-)
>
> Or maybe the routine should be optimized for the RPM_ACTIVE and 
> RPM_IDLE cases, where it doesn't have to do much anyway.

Yes, I think that's the way to go.

> > > 	In order to prevent autosuspends from occurring while I/O is
> > > 	in progress, the pm_request_resume call should increment the
> > > 	usage counter (if it had to queue the request) and the 
> > > 	pm_request_suspend call should decrement it (maybe after
> > > 	waiting for the delay).
> > 
> > I don't want like pm_request_suspend() to do that, because it's valid to
> > call it many times in a row. (only the first request will be queued in such a
> > case).
> 
> Sorry, what I meant was that in each case the counter should be
> {inc,dec}remented if a new request had to be queued.  If one was
> already queued then the counter should be left alone.
> 
> The reason behind this is that a bunch of pm_request_suspend calls
> which all end up referring to the same workqueue item will result in a
> single async call to the runtime_suspend method.

Yes, that's why only the first one results in queuing up a request.

There is a problem with that if the later calls are supposed to use shorter
delays, but I have no real idea to handle this cleanly.

> Therefore they should cause a single decrement of the counter.  Likewise for
> pm_request_resume.

Hmm.  Why exactly do you think it's necessary to decrease the usage counter
in suspend functions?  You can't suspend a device more than once and you have
to resume it at the first request anyway.

I think it makes sense to increase the usage counter on every attempt to
resume, even if the device is not woken up as a result, because that means the
caller wants the device not to be suspended until the counter is decreased.
This way, even if the device is already active, multiple callers can prevent it
from suspending by calling pm_request_resume_get() or pm_runtime_resume_get()
and then dropping the references.

Now, we can also make pm_request_suspend() and pm_runtime_suspend() drop
the usage counter (if it's greater than zero), but that implies a usage model
in which a resume function called when I/O is started should be balanced with a
suspend function called after the I/O has been finished.

However, I'd prefer a usage model in which ->runtime_idle() is called when the
I/O is finished and the usage counter is zero and it decides whether to call a
suspend function.

So, perhaps I should make resume functions increase the usage counter
unconditionally and introduce pm_runtime_idle() to be called when the I/O is
done?  That is, pm_runtime_idle() will decrement the usage counter, check if
it's zero and call ->runtime_idle() when that's the case (well, this is what
pm_runtime_put_notify() does right now, but maybe the name is wrong).

Also, there should be a function to use when it's only necessary to drop the
usage counter, without calling ->runtime_idle() (for example, if another code
path is supposed to call a suspend function directly).

> > I'd prefer the caller to do pm_request_resume_get() (please see the patch
> > below) to put a resume request into the queue and then pm_runtime_put_notify()
> > when it's done with the I/O.  That will result in ->runtime_idle() being called
> > automatically if the device may be suspended.
> 
> If anyone does pm_request_resume or pm_runtime_resume, what is there to 
> prevent the device from being suspended again as soon as the resume is 
> finished (and before anything useful can be accomplished)?
> 
>      1. The driver's runtime_suspend method might be smart enough to 
> 	return -EBUSY until the driver has completed whatever it's 
> 	doing.
> 
>      2. The usage counter might be > 0.
> 
>      3. The number of children might be > 0.
> 
> In case 1 there's no reason not to also increment the counter, since
> the driver can decrement it again when it is finished.  In cases 2 and
> 3, we can assume the counter or the number of children was previously
> equal to 0, since otherwise the resume call would have been vacuous.  
> This implies that the resume call itself should be responsible for
> incrementing either the counter or the number of children.
> 
> What I'm getting at is this: There's no real point to having separate 
> pm_request_resume and pm_request_resume_get calls.  All such calls 
> should increment either the usage counter or the number of children.

OK

But there are devices with no children, so I think it's necessary to increment
the usage counter in all cases.

> (In the USB stack, a single counter is used for both purposes.  It 
> doesn't look like that will work here.)
> 
> 
> > > All bus types will want to implement _some_ delay; it doesn't make
> > > sense to power down a device immediately after every operation and then
> > > power it back up for the next operation.
> > 
> > Sure.  But you can use the pm_request_resume()'s delay to achieve that
> > without storing the delay in 'struct device'.  It seems.
> 
> If you do it that way then the delay has to be hard-coded or stored in
> some non-standardized location.  Which will be more common: devices
> where the delay is fixed by the bus type (or driver), or devices where
> the user should be able to adjust the delay?  If user-adjustable is 
> more common then the delay should be stored in dev_pm_info, so that it 
> can be controlled by a centralized sysfs attribute defined in the 
> driver core.  If not then you are right, the delay doesn't need to be 
> stored in struct device.

I agree, but I'd prefer not to add the delay and timestamp fields to the
picture in a future patch.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-21  2:23                                 ` Alan Stern
  (?)
@ 2009-06-21 12:46                                 ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-21 12:46 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Sunday 21 June 2009, Alan Stern wrote:
> On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > 	pm_request_suspend should run very quickly, since it will be
> > > 	called after every I/O operation.  Likewise, pm_request_resume
> > > 	should run very quickly if the status is RPM_ACTIVE or 
> > > 	RPM_IDLE.
> > 
> > Hmm.  pm_request_suspend() is really short, so it should be fast.
> > pm_request_resume() is a bit more complicated, though (it takes two spinlocks,
> > increases an atomic counter, possibly twice, and queues up a work item, also
> > in the RPM_IDLE case).
> 
> Hmm, maybe that's a good reason for not trying to handle the parent
> from within pm_request_resume().  :-)
>
> Or maybe the routine should be optimized for the RPM_ACTIVE and 
> RPM_IDLE cases, where it doesn't have to do much anyway.

Yes, I think that's the way to go.

> > > 	In order to prevent autosuspends from occurring while I/O is
> > > 	in progress, the pm_request_resume call should increment the
> > > 	usage counter (if it had to queue the request) and the 
> > > 	pm_request_suspend call should decrement it (maybe after
> > > 	waiting for the delay).
> > 
> > I don't want like pm_request_suspend() to do that, because it's valid to
> > call it many times in a row. (only the first request will be queued in such a
> > case).
> 
> Sorry, what I meant was that in each case the counter should be
> {inc,dec}remented if a new request had to be queued.  If one was
> already queued then the counter should be left alone.
> 
> The reason behind this is that a bunch of pm_request_suspend calls
> which all end up referring to the same workqueue item will result in a
> single async call to the runtime_suspend method.

Yes, that's why only the first one results in queuing up a request.

There is a problem with that if the later calls are supposed to use shorter
delays, but I have no real idea to handle this cleanly.

> Therefore they should cause a single decrement of the counter.  Likewise for
> pm_request_resume.

Hmm.  Why exactly do you think it's necessary to decrease the usage counter
in suspend functions?  You can't suspend a device more than once and you have
to resume it at the first request anyway.

I think it makes sense to increase the usage counter on every attempt to
resume, even if the device is not woken up as a result, because that means the
caller wants the device not to be suspended until the counter is decreased.
This way, even if the device is already active, multiple callers can prevent it
from suspending by calling pm_request_resume_get() or pm_runtime_resume_get()
and then dropping the references.

Now, we can also make pm_request_suspend() and pm_runtime_suspend() drop
the usage counter (if it's greater than zero), but that implies a usage model
in which a resume function called when I/O is started should be balanced with a
suspend function called after the I/O has been finished.

However, I'd prefer a usage model in which ->runtime_idle() is called when the
I/O is finished and the usage counter is zero and it decides whether to call a
suspend function.

So, perhaps I should make resume functions increase the usage counter
unconditionally and introduce pm_runtime_idle() to be called when the I/O is
done?  That is, pm_runtime_idle() will decrement the usage counter, check if
it's zero and call ->runtime_idle() when that's the case (well, this is what
pm_runtime_put_notify() does right now, but maybe the name is wrong).

Also, there should be a function to use when it's only necessary to drop the
usage counter, without calling ->runtime_idle() (for example, if another code
path is supposed to call a suspend function directly).

> > I'd prefer the caller to do pm_request_resume_get() (please see the patch
> > below) to put a resume request into the queue and then pm_runtime_put_notify()
> > when it's done with the I/O.  That will result in ->runtime_idle() being called
> > automatically if the device may be suspended.
> 
> If anyone does pm_request_resume or pm_runtime_resume, what is there to 
> prevent the device from being suspended again as soon as the resume is 
> finished (and before anything useful can be accomplished)?
> 
>      1. The driver's runtime_suspend method might be smart enough to 
> 	return -EBUSY until the driver has completed whatever it's 
> 	doing.
> 
>      2. The usage counter might be > 0.
> 
>      3. The number of children might be > 0.
> 
> In case 1 there's no reason not to also increment the counter, since
> the driver can decrement it again when it is finished.  In cases 2 and
> 3, we can assume the counter or the number of children was previously
> equal to 0, since otherwise the resume call would have been vacuous.  
> This implies that the resume call itself should be responsible for
> incrementing either the counter or the number of children.
> 
> What I'm getting at is this: There's no real point to having separate 
> pm_request_resume and pm_request_resume_get calls.  All such calls 
> should increment either the usage counter or the number of children.

OK

But there are devices with no children, so I think it's necessary to increment
the usage counter in all cases.

> (In the USB stack, a single counter is used for both purposes.  It 
> doesn't look like that will work here.)
> 
> 
> > > All bus types will want to implement _some_ delay; it doesn't make
> > > sense to power down a device immediately after every operation and then
> > > power it back up for the next operation.
> > 
> > Sure.  But you can use the pm_request_resume()'s delay to achieve that
> > without storing the delay in 'struct device'.  It seems.
> 
> If you do it that way then the delay has to be hard-coded or stored in
> some non-standardized location.  Which will be more common: devices
> where the delay is fixed by the bus type (or driver), or devices where
> the user should be able to adjust the delay?  If user-adjustable is 
> more common then the delay should be stored in dev_pm_info, so that it 
> can be controlled by a centralized sysfs attribute defined in the 
> driver core.  If not then you are right, the delay doesn't need to be 
> stored in struct device.

I agree, but I'd prefer not to add the delay and timestamp fields to the
picture in a future patch.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20 14:30                             ` [linux-pm] " Alan Stern
@ 2009-06-22  6:20                                 ` Magnus Damm
  2009-06-20 23:48                               ` [linux-pm] " Rafael J. Wysocki
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 118+ messages in thread
From: Magnus Damm @ 2009-06-22  6:20 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Sat, Jun 20, 2009 at 11:30 PM, Alan Stern<stern@rowland.harvard.edu> wrote:
> Some more thoughts...
>
> Magnus, you might have some insights here.  It occurred to me that some
> devices can switch power levels very quickly, and the drivers might
> therefore want the runtime suspend and resume methods to be called as
> soon as possible, even in interrupt context.

I'd like to call pm_request_suspend() from interrupt context. I don't
depend on it, but being able to perform runtime suspend directly from
the ISR would be convenient from a device driver POV. I'm not sure if
that should result in bus/device ->runtime_suspend() calls from
interrupt context though.

In my case the bus specific code for ->runtime_suspend() may just
decrease the usage count of the powerdomain but refrain from calling
the device ->runtime_suspend() callbacks until all devices in the
powerdomain have been suspended. The bus/device runtime suspend
callbacks do not need to be executed from interrupt context. Just
noting that the device is idle is enough at interrupt time. This could
be handled by generic code IMO.

Runtime resume needs to block until the hardware is woken up though.
Just marking the device as resumed and letting the driver access the
hardware before it is woken up does not seem like a good idea. =) For
my SuperH devices I do not need to resume from interrupt context, at
least at this point.

From my perspective it's ok to specificy that the ->runtime_suspend()
and ->runtime_resume() callbacks are executed from process context
only and may sleep. Seems like a simple and good interface that can be
accepted by many bus types. My bus and driver code do not need to
sleep though, so a direct-from-interrupt-context design is fine as
well.

> In terms of the current framework, this probably means holding the
> runtime PM lock (i.e., not releasing it) across the calls to
> ->runtime_suspend and ->runtime_resume.  It also means that
> pm_request_suspend and pm_request_resume should carry out their jobs
> immediately instead of queuing a work item.  (Unless the current status
> is RPM_SUSPENDING or RPM_RESUMING, which should never happen.)

No problem holding a per struct device lock. I suspect that executing
the callbacks from interrupt context is the most efficient design, but
it may come with interrupt latency side effects.

> Should there be a flag in dev_pm_info to select this behavior?

I'd say that executing the callbacks from process context is enough
for now. This will probably be a good match together with interrupt
threads as well.

Maybe the ARM guys have more advanced requriements?

Cheers,

/ magnus
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for  run-time PM of I/O devices
@ 2009-06-22  6:20                                 ` Magnus Damm
  0 siblings, 0 replies; 118+ messages in thread
From: Magnus Damm @ 2009-06-22  6:20 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Sat, Jun 20, 2009 at 11:30 PM, Alan Stern<stern@rowland.harvard.edu> wrote:
> Some more thoughts...
>
> Magnus, you might have some insights here.  It occurred to me that some
> devices can switch power levels very quickly, and the drivers might
> therefore want the runtime suspend and resume methods to be called as
> soon as possible, even in interrupt context.

I'd like to call pm_request_suspend() from interrupt context. I don't
depend on it, but being able to perform runtime suspend directly from
the ISR would be convenient from a device driver POV. I'm not sure if
that should result in bus/device ->runtime_suspend() calls from
interrupt context though.

In my case the bus specific code for ->runtime_suspend() may just
decrease the usage count of the powerdomain but refrain from calling
the device ->runtime_suspend() callbacks until all devices in the
powerdomain have been suspended. The bus/device runtime suspend
callbacks do not need to be executed from interrupt context. Just
noting that the device is idle is enough at interrupt time. This could
be handled by generic code IMO.

Runtime resume needs to block until the hardware is woken up though.
Just marking the device as resumed and letting the driver access the
hardware before it is woken up does not seem like a good idea. =) For
my SuperH devices I do not need to resume from interrupt context, at
least at this point.

>From my perspective it's ok to specificy that the ->runtime_suspend()
and ->runtime_resume() callbacks are executed from process context
only and may sleep. Seems like a simple and good interface that can be
accepted by many bus types. My bus and driver code do not need to
sleep though, so a direct-from-interrupt-context design is fine as
well.

> In terms of the current framework, this probably means holding the
> runtime PM lock (i.e., not releasing it) across the calls to
> ->runtime_suspend and ->runtime_resume.  It also means that
> pm_request_suspend and pm_request_resume should carry out their jobs
> immediately instead of queuing a work item.  (Unless the current status
> is RPM_SUSPENDING or RPM_RESUMING, which should never happen.)

No problem holding a per struct device lock. I suspect that executing
the callbacks from interrupt context is the most efficient design, but
it may come with interrupt latency side effects.

> Should there be a flag in dev_pm_info to select this behavior?

I'd say that executing the callbacks from process context is enough
for now. This will probably be a good match together with interrupt
threads as well.

Maybe the ARM guys have more advanced requriements?

Cheers,

/ magnus

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-20 14:30                             ` [linux-pm] " Alan Stern
                                                 ` (2 preceding siblings ...)
  2009-06-22  6:20                                 ` Magnus Damm
@ 2009-06-22  6:20                               ` Magnus Damm
  3 siblings, 0 replies; 118+ messages in thread
From: Magnus Damm @ 2009-06-22  6:20 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Sat, Jun 20, 2009 at 11:30 PM, Alan Stern<stern@rowland.harvard.edu> wrote:
> Some more thoughts...
>
> Magnus, you might have some insights here.  It occurred to me that some
> devices can switch power levels very quickly, and the drivers might
> therefore want the runtime suspend and resume methods to be called as
> soon as possible, even in interrupt context.

I'd like to call pm_request_suspend() from interrupt context. I don't
depend on it, but being able to perform runtime suspend directly from
the ISR would be convenient from a device driver POV. I'm not sure if
that should result in bus/device ->runtime_suspend() calls from
interrupt context though.

In my case the bus specific code for ->runtime_suspend() may just
decrease the usage count of the powerdomain but refrain from calling
the device ->runtime_suspend() callbacks until all devices in the
powerdomain have been suspended. The bus/device runtime suspend
callbacks do not need to be executed from interrupt context. Just
noting that the device is idle is enough at interrupt time. This could
be handled by generic code IMO.

Runtime resume needs to block until the hardware is woken up though.
Just marking the device as resumed and letting the driver access the
hardware before it is woken up does not seem like a good idea. =) For
my SuperH devices I do not need to resume from interrupt context, at
least at this point.

>From my perspective it's ok to specificy that the ->runtime_suspend()
and ->runtime_resume() callbacks are executed from process context
only and may sleep. Seems like a simple and good interface that can be
accepted by many bus types. My bus and driver code do not need to
sleep though, so a direct-from-interrupt-context design is fine as
well.

> In terms of the current framework, this probably means holding the
> runtime PM lock (i.e., not releasing it) across the calls to
> ->runtime_suspend and ->runtime_resume.  It also means that
> pm_request_suspend and pm_request_resume should carry out their jobs
> immediately instead of queuing a work item.  (Unless the current status
> is RPM_SUSPENDING or RPM_RESUMING, which should never happen.)

No problem holding a per struct device lock. I suspect that executing
the callbacks from interrupt context is the most efficient design, but
it may come with interrupt latency side effects.

> Should there be a flag in dev_pm_info to select this behavior?

I'd say that executing the callbacks from process context is enough
for now. This will probably be a good match together with interrupt
threads as well.

Maybe the ARM guys have more advanced requriements?

Cheers,

/ magnus

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for  run-time PM of I/O devices
  2009-06-22  6:20                                 ` Magnus Damm
@ 2009-06-22  6:43                                   ` Arjan van de Ven
  -1 siblings, 0 replies; 118+ messages in thread
From: Arjan van de Ven @ 2009-06-22  6:43 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Alan Stern, Rafael J. Wysocki, Greg KH, LKML,
	ACPI Devel Maling List, Linux-pm mailing list, Ingo Molnar

On Mon, 22 Jun 2009 15:20:43 +0900
Magnus Damm <magnus.damm@gmail.com> wrote:

> On Sat, Jun 20, 2009 at 11:30 PM, Alan
> Stern<stern@rowland.harvard.edu> wrote:
> > Some more thoughts...
> >
> > Magnus, you might have some insights here.  It occurred to me that
> > some devices can switch power levels very quickly, and the drivers
> > might therefore want the runtime suspend and resume methods to be
> > called as soon as possible, even in interrupt context.
> 
> I'd like to call pm_request_suspend() from interrupt context. I don't

there are some really strong reasons to at least be able to call the
resume function from an interrupt handler.... shared interrupts are one
of them.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for  run-time PM of I/O devices
@ 2009-06-22  6:43                                   ` Arjan van de Ven
  0 siblings, 0 replies; 118+ messages in thread
From: Arjan van de Ven @ 2009-06-22  6:43 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Alan Stern, Rafael J. Wysocki, Greg KH, LKML,
	ACPI Devel Maling List, Linux-pm mailing list, Ingo Molnar

On Mon, 22 Jun 2009 15:20:43 +0900
Magnus Damm <magnus.damm@gmail.com> wrote:

> On Sat, Jun 20, 2009 at 11:30 PM, Alan
> Stern<stern@rowland.harvard.edu> wrote:
> > Some more thoughts...
> >
> > Magnus, you might have some insights here.  It occurred to me that
> > some devices can switch power levels very quickly, and the drivers
> > might therefore want the runtime suspend and resume methods to be
> > called as soon as possible, even in interrupt context.
> 
> I'd like to call pm_request_suspend() from interrupt context. I don't

there are some really strong reasons to at least be able to call the
resume function from an interrupt handler.... shared interrupts are one
of them.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22  6:20                                 ` Magnus Damm
  (?)
  (?)
@ 2009-06-22  6:43                                 ` Arjan van de Ven
  -1 siblings, 0 replies; 118+ messages in thread
From: Arjan van de Ven @ 2009-06-22  6:43 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Mon, 22 Jun 2009 15:20:43 +0900
Magnus Damm <magnus.damm@gmail.com> wrote:

> On Sat, Jun 20, 2009 at 11:30 PM, Alan
> Stern<stern@rowland.harvard.edu> wrote:
> > Some more thoughts...
> >
> > Magnus, you might have some insights here.  It occurred to me that
> > some devices can switch power levels very quickly, and the drivers
> > might therefore want the runtime suspend and resume methods to be
> > called as soon as possible, even in interrupt context.
> 
> I'd like to call pm_request_suspend() from interrupt context. I don't

there are some really strong reasons to at least be able to call the
resume function from an interrupt handler.... shared interrupts are one
of them.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22  6:43                                   ` Arjan van de Ven
@ 2009-06-22  7:27                                     ` Magnus Damm
  -1 siblings, 0 replies; 118+ messages in thread
From: Magnus Damm @ 2009-06-22  7:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Mon, Jun 22, 2009 at 3:43 PM, Arjan van de Ven<arjan@infradead.org> wrote:
> On Mon, 22 Jun 2009 15:20:43 +0900
> Magnus Damm <magnus.damm@gmail.com> wrote:
>
>> On Sat, Jun 20, 2009 at 11:30 PM, Alan
>> Stern<stern@rowland.harvard.edu> wrote:
>> > Some more thoughts...
>> >
>> > Magnus, you might have some insights here.  It occurred to me that
>> > some devices can switch power levels very quickly, and the drivers
>> > might therefore want the runtime suspend and resume methods to be
>> > called as soon as possible, even in interrupt context.
>>
>> I'd like to call pm_request_suspend() from interrupt context. I don't
>
> there are some really strong reasons to at least be able to call the
> resume function from an interrupt handler.... shared interrupts are one
> of them.

I suppose you mean that you need to resume the hardware device before
you can check if it has a pending interrupt source? If so then you
also mean that suspended hardware devices may generate interrupts, no?

My plan for SuperH SoC is that runtime suspend always stops the clock,
but register save and power off may happen. Regardless, interrupts are
not generated from suspended hardware devices.

In the rare case of shared interrupts on SuperH we can just skip over
the platform devices that have been runtime suspended since they will
not have generated any interrupt.

But yes, if there is hardware than can generate interrupts from
suspended state then we need to resume from interrupt context. This
probably means that the entire bus hierarchy must resume from
interrupt context as well.

/ magnus

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for  run-time PM of I/O devices
@ 2009-06-22  7:27                                     ` Magnus Damm
  0 siblings, 0 replies; 118+ messages in thread
From: Magnus Damm @ 2009-06-22  7:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alan Stern, Rafael J. Wysocki, Greg KH, LKML,
	ACPI Devel Maling List, Linux-pm mailing list, Ingo Molnar

On Mon, Jun 22, 2009 at 3:43 PM, Arjan van de Ven<arjan@infradead.org> wrote:
> On Mon, 22 Jun 2009 15:20:43 +0900
> Magnus Damm <magnus.damm@gmail.com> wrote:
>
>> On Sat, Jun 20, 2009 at 11:30 PM, Alan
>> Stern<stern@rowland.harvard.edu> wrote:
>> > Some more thoughts...
>> >
>> > Magnus, you might have some insights here.  It occurred to me that
>> > some devices can switch power levels very quickly, and the drivers
>> > might therefore want the runtime suspend and resume methods to be
>> > called as soon as possible, even in interrupt context.
>>
>> I'd like to call pm_request_suspend() from interrupt context. I don't
>
> there are some really strong reasons to at least be able to call the
> resume function from an interrupt handler.... shared interrupts are one
> of them.

I suppose you mean that you need to resume the hardware device before
you can check if it has a pending interrupt source? If so then you
also mean that suspended hardware devices may generate interrupts, no?

My plan for SuperH SoC is that runtime suspend always stops the clock,
but register save and power off may happen. Regardless, interrupts are
not generated from suspended hardware devices.

In the rare case of shared interrupts on SuperH we can just skip over
the platform devices that have been runtime suspended since they will
not have generated any interrupt.

But yes, if there is hardware than can generate interrupts from
suspended state then we need to resume from interrupt context. This
probably means that the entire bus hierarchy must resume from
interrupt context as well.

/ magnus

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22  6:20                                 ` Magnus Damm
                                                   ` (2 preceding siblings ...)
  (?)
@ 2009-06-22  8:15                                 ` Oliver Neukum
  -1 siblings, 0 replies; 118+ messages in thread
From: Oliver Neukum @ 2009-06-22  8:15 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Alan Stern, Rafael J. Wysocki, Greg KH, LKML,
	ACPI Devel Maling List, Linux-pm mailing list, Ingo Molnar

Am Montag, 22. Juni 2009 08:20:43 schrieb Magnus Damm:
> I'd like to call pm_request_suspend() from interrupt context. I don't
> depend on it, but being able to perform runtime suspend directly from
> the ISR would be convenient from a device driver POV. I'm not sure if
> that should result in bus/device ->runtime_suspend() calls from
> interrupt context though.
>
> In my case the bus specific code for ->runtime_suspend() may just
> decrease the usage count of the powerdomain but refrain from calling
> the device ->runtime_suspend() callbacks until all devices in the
> powerdomain have been suspended. The bus/device runtime suspend
> callbacks do not need to be executed from interrupt context. Just
> noting that the device is idle is enough at interrupt time. This could
> be handled by generic code IMO.

>From practical experience doing USB power management I can tell
you that requesting suspension from interrupt makes things a lot
easier for driver writers.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22  6:20                                 ` Magnus Damm
                                                   ` (3 preceding siblings ...)
  (?)
@ 2009-06-22  8:15                                 ` Oliver Neukum
  -1 siblings, 0 replies; 118+ messages in thread
From: Oliver Neukum @ 2009-06-22  8:15 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

Am Montag, 22. Juni 2009 08:20:43 schrieb Magnus Damm:
> I'd like to call pm_request_suspend() from interrupt context. I don't
> depend on it, but being able to perform runtime suspend directly from
> the ISR would be convenient from a device driver POV. I'm not sure if
> that should result in bus/device ->runtime_suspend() calls from
> interrupt context though.
>
> In my case the bus specific code for ->runtime_suspend() may just
> decrease the usage count of the powerdomain but refrain from calling
> the device ->runtime_suspend() callbacks until all devices in the
> powerdomain have been suspended. The bus/device runtime suspend
> callbacks do not need to be executed from interrupt context. Just
> noting that the device is idle is enough at interrupt time. This could
> be handled by generic code IMO.

>From practical experience doing USB power management I can tell
you that requesting suspension from interrupt makes things a lot
easier for driver writers.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for  run-time PM of I/O devices
  2009-06-22  7:27                                     ` [linux-pm] " Magnus Damm
@ 2009-06-22 13:49                                       ` Arjan van de Ven
  -1 siblings, 0 replies; 118+ messages in thread
From: Arjan van de Ven @ 2009-06-22 13:49 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Alan Stern, Rafael J. Wysocki, Greg KH, LKML,
	ACPI Devel Maling List, Linux-pm mailing list, Ingo Molnar

On Mon, 22 Jun 2009 16:27:29 +0900
Magnus Damm <magnus.damm@gmail.com> wrote:

> On Mon, Jun 22, 2009 at 3:43 PM, Arjan van de
> Ven<arjan@infradead.org> wrote:
> > On Mon, 22 Jun 2009 15:20:43 +0900
> > Magnus Damm <magnus.damm@gmail.com> wrote:
> >
> >> On Sat, Jun 20, 2009 at 11:30 PM, Alan
> >> Stern<stern@rowland.harvard.edu> wrote:
> >> > Some more thoughts...
> >> >
> >> > Magnus, you might have some insights here.  It occurred to me
> >> > that some devices can switch power levels very quickly, and the
> >> > drivers might therefore want the runtime suspend and resume
> >> > methods to be called as soon as possible, even in interrupt
> >> > context.
> >>
> >> I'd like to call pm_request_suspend() from interrupt context. I
> >> don't
> >
> > there are some really strong reasons to at least be able to call the
> > resume function from an interrupt handler.... shared interrupts are
> > one of them.
> 
> I suppose you mean that you need to resume the hardware device before
> you can check if it has a pending interrupt source? If so then you
> also mean that suspended hardware devices may generate interrupts, no?

yes and no. For the shared interrupt case.. no.
but yes for the hw I have in mind (and on my desk ;-) that can happen
as well from the device itself.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for  run-time PM of I/O devices
@ 2009-06-22 13:49                                       ` Arjan van de Ven
  0 siblings, 0 replies; 118+ messages in thread
From: Arjan van de Ven @ 2009-06-22 13:49 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Alan Stern, Rafael J. Wysocki, Greg KH, LKML,
	ACPI Devel Maling List, Linux-pm mailing list, Ingo Molnar

On Mon, 22 Jun 2009 16:27:29 +0900
Magnus Damm <magnus.damm@gmail.com> wrote:

> On Mon, Jun 22, 2009 at 3:43 PM, Arjan van de
> Ven<arjan@infradead.org> wrote:
> > On Mon, 22 Jun 2009 15:20:43 +0900
> > Magnus Damm <magnus.damm@gmail.com> wrote:
> >
> >> On Sat, Jun 20, 2009 at 11:30 PM, Alan
> >> Stern<stern@rowland.harvard.edu> wrote:
> >> > Some more thoughts...
> >> >
> >> > Magnus, you might have some insights here.  It occurred to me
> >> > that some devices can switch power levels very quickly, and the
> >> > drivers might therefore want the runtime suspend and resume
> >> > methods to be called as soon as possible, even in interrupt
> >> > context.
> >>
> >> I'd like to call pm_request_suspend() from interrupt context. I
> >> don't
> >
> > there are some really strong reasons to at least be able to call the
> > resume function from an interrupt handler.... shared interrupts are
> > one of them.
> 
> I suppose you mean that you need to resume the hardware device before
> you can check if it has a pending interrupt source? If so then you
> also mean that suspended hardware devices may generate interrupts, no?

yes and no. For the shared interrupt case.. no.
but yes for the hw I have in mind (and on my desk ;-) that can happen
as well from the device itself.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22  7:27                                     ` [linux-pm] " Magnus Damm
  (?)
@ 2009-06-22 13:49                                     ` Arjan van de Ven
  -1 siblings, 0 replies; 118+ messages in thread
From: Arjan van de Ven @ 2009-06-22 13:49 UTC (permalink / raw)
  To: Magnus Damm
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Mon, 22 Jun 2009 16:27:29 +0900
Magnus Damm <magnus.damm@gmail.com> wrote:

> On Mon, Jun 22, 2009 at 3:43 PM, Arjan van de
> Ven<arjan@infradead.org> wrote:
> > On Mon, 22 Jun 2009 15:20:43 +0900
> > Magnus Damm <magnus.damm@gmail.com> wrote:
> >
> >> On Sat, Jun 20, 2009 at 11:30 PM, Alan
> >> Stern<stern@rowland.harvard.edu> wrote:
> >> > Some more thoughts...
> >> >
> >> > Magnus, you might have some insights here.  It occurred to me
> >> > that some devices can switch power levels very quickly, and the
> >> > drivers might therefore want the runtime suspend and resume
> >> > methods to be called as soon as possible, even in interrupt
> >> > context.
> >>
> >> I'd like to call pm_request_suspend() from interrupt context. I
> >> don't
> >
> > there are some really strong reasons to at least be able to call the
> > resume function from an interrupt handler.... shared interrupts are
> > one of them.
> 
> I suppose you mean that you need to resume the hardware device before
> you can check if it has a pending interrupt source? If so then you
> also mean that suspended hardware devices may generate interrupts, no?

yes and no. For the shared interrupt case.. no.
but yes for the hw I have in mind (and on my desk ;-) that can happen
as well from the device itself.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-21 11:32                                   ` [linux-pm] " Rafael J. Wysocki
@ 2009-06-22 14:16                                     ` Alan Stern
  2009-06-22 15:27                                       ` Rafael J. Wysocki
  2009-06-22 15:27                                       ` [linux-pm] " Rafael J. Wysocki
  2009-06-22 14:16                                     ` Alan Stern
  1 sibling, 2 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 14:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Magnus Damm, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> > Seriously, there _are_ places where drivers get bound to device before
> > those devices are registered.  This happens for example in USB when a
> > bunch of related interfaces are present in the same physical device.  
> > When the first interface is registered, its driver binds itself to all
> > the others even though they haven't been registered yet.
> 
> Well, the suspend functions could be protected against that under the
> assumption that no suspend is possible for resume_counter = 0 (then, the "good
> to go" value would be -1).
> 
> Still, the resume functions start from acquring a spinlock, which is not going
> to work if that spinlock is uninitialized.

The initialization needs to be improved.  Most of the code in
pm_runtime_init() should be called from device_pm_init(), and the rest
should be moved into a separate pm_runtime_add() routine to be called
from device_pm_add().

One of the things pm_runtime_add() could do is change the status from 
RPM_UNREGISTERED to RPM_ACTIVE.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-21 11:32                                   ` [linux-pm] " Rafael J. Wysocki
  2009-06-22 14:16                                     ` Alan Stern
@ 2009-06-22 14:16                                     ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 14:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> > Seriously, there _are_ places where drivers get bound to device before
> > those devices are registered.  This happens for example in USB when a
> > bunch of related interfaces are present in the same physical device.  
> > When the first interface is registered, its driver binds itself to all
> > the others even though they haven't been registered yet.
> 
> Well, the suspend functions could be protected against that under the
> assumption that no suspend is possible for resume_counter = 0 (then, the "good
> to go" value would be -1).
> 
> Still, the resume functions start from acquring a spinlock, which is not going
> to work if that spinlock is uninitialized.

The initialization needs to be improved.  Most of the code in
pm_runtime_init() should be called from device_pm_init(), and the rest
should be moved into a separate pm_runtime_add() routine to be called
from device_pm_add().

One of the things pm_runtime_add() could do is change the status from 
RPM_UNREGISTERED to RPM_ACTIVE.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-21 12:46                                 ` Rafael J. Wysocki
@ 2009-06-22 15:01                                     ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 15:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> > Sorry, what I meant was that in each case the counter should be
> > {inc,dec}remented if a new request had to be queued.  If one was
> > already queued then the counter should be left alone.
> > 
> > The reason behind this is that a bunch of pm_request_suspend calls
> > which all end up referring to the same workqueue item will result in a
> > single async call to the runtime_suspend method.
> 
> Yes, that's why only the first one results in queuing up a request.
> 
> There is a problem with that if the later calls are supposed to use shorter
> delays, but I have no real idea to handle this cleanly.

Nor do I.  When the time-of-last-use and delay fields are implemented, 
this should never arise.

> > Therefore they should cause a single decrement of the counter.  Likewise for
> > pm_request_resume.
> 
> Hmm.  Why exactly do you think it's necessary to decrease the usage counter
> in suspend functions?  You can't suspend a device more than once and you have
> to resume it at the first request anyway.
> 
> I think it makes sense to increase the usage counter on every attempt to
> resume, even if the device is not woken up as a result, because that means the
> caller wants the device not to be suspended until the counter is decreased.
> This way, even if the device is already active, multiple callers can prevent it
> from suspending by calling pm_request_resume_get() or pm_runtime_resume_get()
> and then dropping the references.

Again, this boils down to how drivers decide to use the async 
interface.  I can see justifications for both pm_request_resume_get 
(which would always increment the counter) and pm_request_resume (which 
would increment the counter only if a work item had to be queued).  And 
of course, synchronous pm_runtime_resume should always increment the 
counter.

> Now, we can also make pm_request_suspend() and pm_runtime_suspend() drop
> the usage counter (if it's greater than zero), but that implies a usage model
> in which a resume function called when I/O is started should be balanced with a
> suspend function called after the I/O has been finished.
> 
> However, I'd prefer a usage model in which ->runtime_idle() is called when the
> I/O is finished and the usage counter is zero and it decides whether to call a
> suspend function.
> 
> So, perhaps I should make resume functions increase the usage counter
> unconditionally and introduce pm_runtime_idle() to be called when the I/O is
> done?  That is, pm_runtime_idle() will decrement the usage counter, check if
> it's zero and call ->runtime_idle() when that's the case (well, this is what
> pm_runtime_put_notify() does right now, but maybe the name is wrong).

Maybe it should just be called pm_runtime_put.  There could be a
separate pm_runtime_idle that doesn't decrement the counter but invokes
the callback if the counter is already 0.  (This could be useful after
a runtime_resume method returned -EBUSY.)

> Also, there should be a function to use when it's only necessary to drop the
> usage counter, without calling ->runtime_idle() (for example, if another code
> path is supposed to call a suspend function directly).

I don't see any reason for that.  It says: "The device isn't in use any
more, but even though we support autosuspend we aren't going to try to
suspend it now."  What's the point?  And as for the other code path, if
the device is already suspended when it calls the suspend function
directly, there's no harm done.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-22 15:01                                     ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 15:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:

> > Sorry, what I meant was that in each case the counter should be
> > {inc,dec}remented if a new request had to be queued.  If one was
> > already queued then the counter should be left alone.
> > 
> > The reason behind this is that a bunch of pm_request_suspend calls
> > which all end up referring to the same workqueue item will result in a
> > single async call to the runtime_suspend method.
> 
> Yes, that's why only the first one results in queuing up a request.
> 
> There is a problem with that if the later calls are supposed to use shorter
> delays, but I have no real idea to handle this cleanly.

Nor do I.  When the time-of-last-use and delay fields are implemented, 
this should never arise.

> > Therefore they should cause a single decrement of the counter.  Likewise for
> > pm_request_resume.
> 
> Hmm.  Why exactly do you think it's necessary to decrease the usage counter
> in suspend functions?  You can't suspend a device more than once and you have
> to resume it at the first request anyway.
> 
> I think it makes sense to increase the usage counter on every attempt to
> resume, even if the device is not woken up as a result, because that means the
> caller wants the device not to be suspended until the counter is decreased.
> This way, even if the device is already active, multiple callers can prevent it
> from suspending by calling pm_request_resume_get() or pm_runtime_resume_get()
> and then dropping the references.

Again, this boils down to how drivers decide to use the async 
interface.  I can see justifications for both pm_request_resume_get 
(which would always increment the counter) and pm_request_resume (which 
would increment the counter only if a work item had to be queued).  And 
of course, synchronous pm_runtime_resume should always increment the 
counter.

> Now, we can also make pm_request_suspend() and pm_runtime_suspend() drop
> the usage counter (if it's greater than zero), but that implies a usage model
> in which a resume function called when I/O is started should be balanced with a
> suspend function called after the I/O has been finished.
> 
> However, I'd prefer a usage model in which ->runtime_idle() is called when the
> I/O is finished and the usage counter is zero and it decides whether to call a
> suspend function.
> 
> So, perhaps I should make resume functions increase the usage counter
> unconditionally and introduce pm_runtime_idle() to be called when the I/O is
> done?  That is, pm_runtime_idle() will decrement the usage counter, check if
> it's zero and call ->runtime_idle() when that's the case (well, this is what
> pm_runtime_put_notify() does right now, but maybe the name is wrong).

Maybe it should just be called pm_runtime_put.  There could be a
separate pm_runtime_idle that doesn't decrement the counter but invokes
the callback if the counter is already 0.  (This could be useful after
a runtime_resume method returned -EBUSY.)

> Also, there should be a function to use when it's only necessary to drop the
> usage counter, without calling ->runtime_idle() (for example, if another code
> path is supposed to call a suspend function directly).

I don't see any reason for that.  It says: "The device isn't in use any
more, but even though we support autosuspend we aren't going to try to
suspend it now."  What's the point?  And as for the other code path, if
the device is already suspended when it calls the suspend function
directly, there's no harm done.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 14:16                                     ` Alan Stern
  2009-06-22 15:27                                       ` Rafael J. Wysocki
@ 2009-06-22 15:27                                       ` Rafael J. Wysocki
  2009-06-22 15:39                                         ` Alan Stern
  2009-06-22 15:39                                         ` Alan Stern
  1 sibling, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:27 UTC (permalink / raw)
  To: Alan Stern
  Cc: Magnus Damm, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Monday 22 June 2009, Alan Stern wrote:
> On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > Seriously, there _are_ places where drivers get bound to device before
> > > those devices are registered.  This happens for example in USB when a
> > > bunch of related interfaces are present in the same physical device.  
> > > When the first interface is registered, its driver binds itself to all
> > > the others even though they haven't been registered yet.
> > 
> > Well, the suspend functions could be protected against that under the
> > assumption that no suspend is possible for resume_counter = 0 (then, the "good
> > to go" value would be -1).
> > 
> > Still, the resume functions start from acquring a spinlock, which is not going
> > to work if that spinlock is uninitialized.
> 
> The initialization needs to be improved.  Most of the code in
> pm_runtime_init() should be called from device_pm_init(), and the rest
> should be moved into a separate pm_runtime_add() routine to be called
> from device_pm_add().

OK

In that case, I think, the initialization of the spinlock and resume_counter
can be put into the thing called by device_pm_init().

> One of the things pm_runtime_add() could do is change the status from 
> RPM_UNREGISTERED to RPM_ACTIVE.

If the status is initially (ie. at the device_pm_init() point) RPM_ACTIVE and
resume_couter is initially 1, what are we going to need RPM_UNREGISTERED for?

Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 14:16                                     ` Alan Stern
@ 2009-06-22 15:27                                       ` Rafael J. Wysocki
  2009-06-22 15:27                                       ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:27 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Monday 22 June 2009, Alan Stern wrote:
> On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > Seriously, there _are_ places where drivers get bound to device before
> > > those devices are registered.  This happens for example in USB when a
> > > bunch of related interfaces are present in the same physical device.  
> > > When the first interface is registered, its driver binds itself to all
> > > the others even though they haven't been registered yet.
> > 
> > Well, the suspend functions could be protected against that under the
> > assumption that no suspend is possible for resume_counter = 0 (then, the "good
> > to go" value would be -1).
> > 
> > Still, the resume functions start from acquring a spinlock, which is not going
> > to work if that spinlock is uninitialized.
> 
> The initialization needs to be improved.  Most of the code in
> pm_runtime_init() should be called from device_pm_init(), and the rest
> should be moved into a separate pm_runtime_add() routine to be called
> from device_pm_add().

OK

In that case, I think, the initialization of the spinlock and resume_counter
can be put into the thing called by device_pm_init().

> One of the things pm_runtime_add() could do is change the status from 
> RPM_UNREGISTERED to RPM_ACTIVE.

If the status is initially (ie. at the device_pm_init() point) RPM_ACTIVE and
resume_couter is initially 1, what are we going to need RPM_UNREGISTERED for?

Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for  run-time PM of I/O devices
  2009-06-22  6:43                                   ` Arjan van de Ven
  (?)
  (?)
@ 2009-06-22 15:33                                   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Magnus Damm, Alan Stern, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Monday 22 June 2009, Arjan van de Ven wrote:
> On Mon, 22 Jun 2009 15:20:43 +0900
> Magnus Damm <magnus.damm@gmail.com> wrote:
> 
> > On Sat, Jun 20, 2009 at 11:30 PM, Alan
> > Stern<stern@rowland.harvard.edu> wrote:
> > > Some more thoughts...
> > >
> > > Magnus, you might have some insights here.  It occurred to me that
> > > some devices can switch power levels very quickly, and the drivers
> > > might therefore want the runtime suspend and resume methods to be
> > > called as soon as possible, even in interrupt context.
> > 
> > I'd like to call pm_request_suspend() from interrupt context. I don't
> 
> there are some really strong reasons to at least be able to call the
> resume function from an interrupt handler.... shared interrupts are one
> of them.

Yes.  But that requires your hardware to be able to wake up fast enough, so I
think we can introduce pm_runtime_resume_atomic() and
pm_runtime_suspend_atomic() to be used with the devices that can do that, as
proposed by Alan.

Surely not all devices can do it, though.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22  6:43                                   ` Arjan van de Ven
                                                     ` (2 preceding siblings ...)
  (?)
@ 2009-06-22 15:33                                   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Monday 22 June 2009, Arjan van de Ven wrote:
> On Mon, 22 Jun 2009 15:20:43 +0900
> Magnus Damm <magnus.damm@gmail.com> wrote:
> 
> > On Sat, Jun 20, 2009 at 11:30 PM, Alan
> > Stern<stern@rowland.harvard.edu> wrote:
> > > Some more thoughts...
> > >
> > > Magnus, you might have some insights here.  It occurred to me that
> > > some devices can switch power levels very quickly, and the drivers
> > > might therefore want the runtime suspend and resume methods to be
> > > called as soon as possible, even in interrupt context.
> > 
> > I'd like to call pm_request_suspend() from interrupt context. I don't
> 
> there are some really strong reasons to at least be able to call the
> resume function from an interrupt handler.... shared interrupts are one
> of them.

Yes.  But that requires your hardware to be able to wake up fast enough, so I
think we can introduce pm_runtime_resume_atomic() and
pm_runtime_suspend_atomic() to be used with the devices that can do that, as
proposed by Alan.

Surely not all devices can do it, though.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for  run-time PM of I/O devices
  2009-06-22 13:49                                       ` Arjan van de Ven
  (?)
  (?)
@ 2009-06-22 15:39                                       ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:39 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Magnus Damm, Alan Stern, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Monday 22 June 2009, Arjan van de Ven wrote:
> On Mon, 22 Jun 2009 16:27:29 +0900
> Magnus Damm <magnus.damm@gmail.com> wrote:
> 
> > On Mon, Jun 22, 2009 at 3:43 PM, Arjan van de
> > Ven<arjan@infradead.org> wrote:
> > > On Mon, 22 Jun 2009 15:20:43 +0900
> > > Magnus Damm <magnus.damm@gmail.com> wrote:
> > >
> > >> On Sat, Jun 20, 2009 at 11:30 PM, Alan
> > >> Stern<stern@rowland.harvard.edu> wrote:
> > >> > Some more thoughts...
> > >> >
> > >> > Magnus, you might have some insights here.  It occurred to me
> > >> > that some devices can switch power levels very quickly, and the
> > >> > drivers might therefore want the runtime suspend and resume
> > >> > methods to be called as soon as possible, even in interrupt
> > >> > context.
> > >>
> > >> I'd like to call pm_request_suspend() from interrupt context. I
> > >> don't
> > >
> > > there are some really strong reasons to at least be able to call the
> > > resume function from an interrupt handler.... shared interrupts are
> > > one of them.
> > 
> > I suppose you mean that you need to resume the hardware device before
> > you can check if it has a pending interrupt source? If so then you
> > also mean that suspended hardware devices may generate interrupts, no?
> 
> yes and no. For the shared interrupt case.. no.
> but yes for the hw I have in mind (and on my desk ;-) that can happen
> as well from the device itself.

If that's PCI hardware (I guess it is ;-)), I'm not really sure if this
behavior is compliant with the specification.

Anyway, if the interrupt is not shared and the device can wake up fast enough,
we should be able to handle it.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 13:49                                       ` Arjan van de Ven
  (?)
@ 2009-06-22 15:39                                       ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:39 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Monday 22 June 2009, Arjan van de Ven wrote:
> On Mon, 22 Jun 2009 16:27:29 +0900
> Magnus Damm <magnus.damm@gmail.com> wrote:
> 
> > On Mon, Jun 22, 2009 at 3:43 PM, Arjan van de
> > Ven<arjan@infradead.org> wrote:
> > > On Mon, 22 Jun 2009 15:20:43 +0900
> > > Magnus Damm <magnus.damm@gmail.com> wrote:
> > >
> > >> On Sat, Jun 20, 2009 at 11:30 PM, Alan
> > >> Stern<stern@rowland.harvard.edu> wrote:
> > >> > Some more thoughts...
> > >> >
> > >> > Magnus, you might have some insights here.  It occurred to me
> > >> > that some devices can switch power levels very quickly, and the
> > >> > drivers might therefore want the runtime suspend and resume
> > >> > methods to be called as soon as possible, even in interrupt
> > >> > context.
> > >>
> > >> I'd like to call pm_request_suspend() from interrupt context. I
> > >> don't
> > >
> > > there are some really strong reasons to at least be able to call the
> > > resume function from an interrupt handler.... shared interrupts are
> > > one of them.
> > 
> > I suppose you mean that you need to resume the hardware device before
> > you can check if it has a pending interrupt source? If so then you
> > also mean that suspended hardware devices may generate interrupts, no?
> 
> yes and no. For the shared interrupt case.. no.
> but yes for the hw I have in mind (and on my desk ;-) that can happen
> as well from the device itself.

If that's PCI hardware (I guess it is ;-)), I'm not really sure if this
behavior is compliant with the specification.

Anyway, if the interrupt is not shared and the device can wake up fast enough,
we should be able to handle it.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:27                                       ` [linux-pm] " Rafael J. Wysocki
@ 2009-06-22 15:39                                         ` Alan Stern
  2009-06-22 15:53                                           ` Rafael J. Wysocki
  2009-06-22 15:53                                           ` [linux-pm] " Rafael J. Wysocki
  2009-06-22 15:39                                         ` Alan Stern
  1 sibling, 2 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 15:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Magnus Damm, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:

> On Monday 22 June 2009, Alan Stern wrote:
> > On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> > 
> > > > Seriously, there _are_ places where drivers get bound to device before
> > > > those devices are registered.  This happens for example in USB when a
> > > > bunch of related interfaces are present in the same physical device.  
> > > > When the first interface is registered, its driver binds itself to all
> > > > the others even though they haven't been registered yet.
> > > 
> > > Well, the suspend functions could be protected against that under the
> > > assumption that no suspend is possible for resume_counter = 0 (then, the "good
> > > to go" value would be -1).
> > > 
> > > Still, the resume functions start from acquring a spinlock, which is not going
> > > to work if that spinlock is uninitialized.
> > 
> > The initialization needs to be improved.  Most of the code in
> > pm_runtime_init() should be called from device_pm_init(), and the rest
> > should be moved into a separate pm_runtime_add() routine to be called
> > from device_pm_add().
> 
> OK
> 
> In that case, I think, the initialization of the spinlock and resume_counter
> can be put into the thing called by device_pm_init().

Right.

> > One of the things pm_runtime_add() could do is change the status from 
> > RPM_UNREGISTERED to RPM_ACTIVE.
> 
> If the status is initially (ie. at the device_pm_init() point) RPM_ACTIVE and
> resume_couter is initially 1, what are we going to need RPM_UNREGISTERED for?

Okay, we don't need it then.  I forgot to mention in the previous
message that there also has to be a pm_runtime_del() routine, which
should cancel pending workqueue items and set the counter to some high
value so that no new items are added.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:27                                       ` [linux-pm] " Rafael J. Wysocki
  2009-06-22 15:39                                         ` Alan Stern
@ 2009-06-22 15:39                                         ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 15:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:

> On Monday 22 June 2009, Alan Stern wrote:
> > On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> > 
> > > > Seriously, there _are_ places where drivers get bound to device before
> > > > those devices are registered.  This happens for example in USB when a
> > > > bunch of related interfaces are present in the same physical device.  
> > > > When the first interface is registered, its driver binds itself to all
> > > > the others even though they haven't been registered yet.
> > > 
> > > Well, the suspend functions could be protected against that under the
> > > assumption that no suspend is possible for resume_counter = 0 (then, the "good
> > > to go" value would be -1).
> > > 
> > > Still, the resume functions start from acquring a spinlock, which is not going
> > > to work if that spinlock is uninitialized.
> > 
> > The initialization needs to be improved.  Most of the code in
> > pm_runtime_init() should be called from device_pm_init(), and the rest
> > should be moved into a separate pm_runtime_add() routine to be called
> > from device_pm_add().
> 
> OK
> 
> In that case, I think, the initialization of the spinlock and resume_counter
> can be put into the thing called by device_pm_init().

Right.

> > One of the things pm_runtime_add() could do is change the status from 
> > RPM_UNREGISTERED to RPM_ACTIVE.
> 
> If the status is initially (ie. at the device_pm_init() point) RPM_ACTIVE and
> resume_couter is initially 1, what are we going to need RPM_UNREGISTERED for?

Okay, we don't need it then.  I forgot to mention in the previous
message that there also has to be a pm_runtime_del() routine, which
should cancel pending workqueue items and set the counter to some high
value so that no new items are added.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:01                                     ` Alan Stern
  (?)
  (?)
@ 2009-06-22 15:49                                     ` Rafael J. Wysocki
  2009-06-22 16:28                                       ` Alan Stern
                                                         ` (3 more replies)
  -1 siblings, 4 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:49 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Monday 22 June 2009, Alan Stern wrote:
> On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > Sorry, what I meant was that in each case the counter should be
> > > {inc,dec}remented if a new request had to be queued.  If one was
> > > already queued then the counter should be left alone.
> > > 
> > > The reason behind this is that a bunch of pm_request_suspend calls
> > > which all end up referring to the same workqueue item will result in a
> > > single async call to the runtime_suspend method.
> > 
> > Yes, that's why only the first one results in queuing up a request.
> > 
> > There is a problem with that if the later calls are supposed to use shorter
> > delays, but I have no real idea to handle this cleanly.
> 
> Nor do I.  When the time-of-last-use and delay fields are implemented, 
> this should never arise.

OK, so I'd like to leave it as is for now with the assumption that it's going
to be solved in future.

> > > Therefore they should cause a single decrement of the counter.  Likewise for
> > > pm_request_resume.
> > 
> > Hmm.  Why exactly do you think it's necessary to decrease the usage counter
> > in suspend functions?  You can't suspend a device more than once and you have
> > to resume it at the first request anyway.
> > 
> > I think it makes sense to increase the usage counter on every attempt to
> > resume, even if the device is not woken up as a result, because that means the
> > caller wants the device not to be suspended until the counter is decreased.
> > This way, even if the device is already active, multiple callers can prevent it
> > from suspending by calling pm_request_resume_get() or pm_runtime_resume_get()
> > and then dropping the references.
> 
> Again, this boils down to how drivers decide to use the async 
> interface.  I can see justifications for both pm_request_resume_get 
> (which would always increment the counter) and pm_request_resume (which 
> would increment the counter only if a work item had to be queued).

OK, so this means we should provide both at the core level and let the drivers
decide which one to use.

I think in both cases the caller would be responsible for decrementing the
counter?

> And of course, synchronous pm_runtime_resume should always increment the 
> counter.

Sure.

> > Now, we can also make pm_request_suspend() and pm_runtime_suspend() drop
> > the usage counter (if it's greater than zero), but that implies a usage model
> > in which a resume function called when I/O is started should be balanced with a
> > suspend function called after the I/O has been finished.
> > 
> > However, I'd prefer a usage model in which ->runtime_idle() is called when the
> > I/O is finished and the usage counter is zero and it decides whether to call a
> > suspend function.
> > 
> > So, perhaps I should make resume functions increase the usage counter
> > unconditionally and introduce pm_runtime_idle() to be called when the I/O is
> > done?  That is, pm_runtime_idle() will decrement the usage counter, check if
> > it's zero and call ->runtime_idle() when that's the case (well, this is what
> > pm_runtime_put_notify() does right now, but maybe the name is wrong).
> 
> Maybe it should just be called pm_runtime_put.  There could be a
> separate pm_runtime_idle that doesn't decrement the counter but invokes
> the callback if the counter is already 0.  (This could be useful after
> a runtime_resume method returned -EBUSY.)

OK

> > Also, there should be a function to use when it's only necessary to drop the
> > usage counter, without calling ->runtime_idle() (for example, if another code
> > path is supposed to call a suspend function directly).
> 
> I don't see any reason for that.  It says: "The device isn't in use any
> more, but even though we support autosuspend we aren't going to try to
> suspend it now."  What's the point?  And as for the other code path, if
> the device is already suspended when it calls the suspend function
> directly, there's no harm done.

OK

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:01                                     ` Alan Stern
  (?)
@ 2009-06-22 15:49                                     ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:49 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Monday 22 June 2009, Alan Stern wrote:
> On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > Sorry, what I meant was that in each case the counter should be
> > > {inc,dec}remented if a new request had to be queued.  If one was
> > > already queued then the counter should be left alone.
> > > 
> > > The reason behind this is that a bunch of pm_request_suspend calls
> > > which all end up referring to the same workqueue item will result in a
> > > single async call to the runtime_suspend method.
> > 
> > Yes, that's why only the first one results in queuing up a request.
> > 
> > There is a problem with that if the later calls are supposed to use shorter
> > delays, but I have no real idea to handle this cleanly.
> 
> Nor do I.  When the time-of-last-use and delay fields are implemented, 
> this should never arise.

OK, so I'd like to leave it as is for now with the assumption that it's going
to be solved in future.

> > > Therefore they should cause a single decrement of the counter.  Likewise for
> > > pm_request_resume.
> > 
> > Hmm.  Why exactly do you think it's necessary to decrease the usage counter
> > in suspend functions?  You can't suspend a device more than once and you have
> > to resume it at the first request anyway.
> > 
> > I think it makes sense to increase the usage counter on every attempt to
> > resume, even if the device is not woken up as a result, because that means the
> > caller wants the device not to be suspended until the counter is decreased.
> > This way, even if the device is already active, multiple callers can prevent it
> > from suspending by calling pm_request_resume_get() or pm_runtime_resume_get()
> > and then dropping the references.
> 
> Again, this boils down to how drivers decide to use the async 
> interface.  I can see justifications for both pm_request_resume_get 
> (which would always increment the counter) and pm_request_resume (which 
> would increment the counter only if a work item had to be queued).

OK, so this means we should provide both at the core level and let the drivers
decide which one to use.

I think in both cases the caller would be responsible for decrementing the
counter?

> And of course, synchronous pm_runtime_resume should always increment the 
> counter.

Sure.

> > Now, we can also make pm_request_suspend() and pm_runtime_suspend() drop
> > the usage counter (if it's greater than zero), but that implies a usage model
> > in which a resume function called when I/O is started should be balanced with a
> > suspend function called after the I/O has been finished.
> > 
> > However, I'd prefer a usage model in which ->runtime_idle() is called when the
> > I/O is finished and the usage counter is zero and it decides whether to call a
> > suspend function.
> > 
> > So, perhaps I should make resume functions increase the usage counter
> > unconditionally and introduce pm_runtime_idle() to be called when the I/O is
> > done?  That is, pm_runtime_idle() will decrement the usage counter, check if
> > it's zero and call ->runtime_idle() when that's the case (well, this is what
> > pm_runtime_put_notify() does right now, but maybe the name is wrong).
> 
> Maybe it should just be called pm_runtime_put.  There could be a
> separate pm_runtime_idle that doesn't decrement the counter but invokes
> the callback if the counter is already 0.  (This could be useful after
> a runtime_resume method returned -EBUSY.)

OK

> > Also, there should be a function to use when it's only necessary to drop the
> > usage counter, without calling ->runtime_idle() (for example, if another code
> > path is supposed to call a suspend function directly).
> 
> I don't see any reason for that.  It says: "The device isn't in use any
> more, but even though we support autosuspend we aren't going to try to
> suspend it now."  What's the point?  And as for the other code path, if
> the device is already suspended when it calls the suspend function
> directly, there's no harm done.

OK

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [linux-pm] [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:39                                         ` Alan Stern
  2009-06-22 15:53                                           ` Rafael J. Wysocki
@ 2009-06-22 15:53                                           ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:53 UTC (permalink / raw)
  To: Alan Stern
  Cc: Magnus Damm, Greg KH, LKML, ACPI Devel Maling List,
	Linux-pm mailing list, Ingo Molnar

On Monday 22 June 2009, Alan Stern wrote:
> On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:
> 
> > On Monday 22 June 2009, Alan Stern wrote:
> > > On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> > > 
> > > > > Seriously, there _are_ places where drivers get bound to device before
> > > > > those devices are registered.  This happens for example in USB when a
> > > > > bunch of related interfaces are present in the same physical device.  
> > > > > When the first interface is registered, its driver binds itself to all
> > > > > the others even though they haven't been registered yet.
> > > > 
> > > > Well, the suspend functions could be protected against that under the
> > > > assumption that no suspend is possible for resume_counter = 0 (then, the "good
> > > > to go" value would be -1).
> > > > 
> > > > Still, the resume functions start from acquring a spinlock, which is not going
> > > > to work if that spinlock is uninitialized.
> > > 
> > > The initialization needs to be improved.  Most of the code in
> > > pm_runtime_init() should be called from device_pm_init(), and the rest
> > > should be moved into a separate pm_runtime_add() routine to be called
> > > from device_pm_add().
> > 
> > OK
> > 
> > In that case, I think, the initialization of the spinlock and resume_counter
> > can be put into the thing called by device_pm_init().
> 
> Right.
> 
> > > One of the things pm_runtime_add() could do is change the status from 
> > > RPM_UNREGISTERED to RPM_ACTIVE.
> > 
> > If the status is initially (ie. at the device_pm_init() point) RPM_ACTIVE and
> > resume_couter is initially 1, what are we going to need RPM_UNREGISTERED for?
> 
> Okay, we don't need it then.  I forgot to mention in the previous
> message that there also has to be a pm_runtime_del() routine, which
> should cancel pending workqueue items and set the counter to some high
> value so that no new items are added.

Should that be called by device_pm_remove()?  I think so.

Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:39                                         ` Alan Stern
@ 2009-06-22 15:53                                           ` Rafael J. Wysocki
  2009-06-22 15:53                                           ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 15:53 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, LKML, ACPI Devel Maling List, Linux-pm mailing list,
	Ingo Molnar

On Monday 22 June 2009, Alan Stern wrote:
> On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:
> 
> > On Monday 22 June 2009, Alan Stern wrote:
> > > On Sun, 21 Jun 2009, Rafael J. Wysocki wrote:
> > > 
> > > > > Seriously, there _are_ places where drivers get bound to device before
> > > > > those devices are registered.  This happens for example in USB when a
> > > > > bunch of related interfaces are present in the same physical device.  
> > > > > When the first interface is registered, its driver binds itself to all
> > > > > the others even though they haven't been registered yet.
> > > > 
> > > > Well, the suspend functions could be protected against that under the
> > > > assumption that no suspend is possible for resume_counter = 0 (then, the "good
> > > > to go" value would be -1).
> > > > 
> > > > Still, the resume functions start from acquring a spinlock, which is not going
> > > > to work if that spinlock is uninitialized.
> > > 
> > > The initialization needs to be improved.  Most of the code in
> > > pm_runtime_init() should be called from device_pm_init(), and the rest
> > > should be moved into a separate pm_runtime_add() routine to be called
> > > from device_pm_add().
> > 
> > OK
> > 
> > In that case, I think, the initialization of the spinlock and resume_counter
> > can be put into the thing called by device_pm_init().
> 
> Right.
> 
> > > One of the things pm_runtime_add() could do is change the status from 
> > > RPM_UNREGISTERED to RPM_ACTIVE.
> > 
> > If the status is initially (ie. at the device_pm_init() point) RPM_ACTIVE and
> > resume_couter is initially 1, what are we going to need RPM_UNREGISTERED for?
> 
> Okay, we don't need it then.  I forgot to mention in the previous
> message that there also has to be a pm_runtime_del() routine, which
> should cancel pending workqueue items and set the counter to some high
> value so that no new items are added.

Should that be called by device_pm_remove()?  I think so.

Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:49                                     ` Rafael J. Wysocki
@ 2009-06-22 16:28                                         ` Alan Stern
  2009-06-22 16:28                                         ` Alan Stern
                                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 16:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:

> > Again, this boils down to how drivers decide to use the async 
> > interface.  I can see justifications for both pm_request_resume_get 
> > (which would always increment the counter) and pm_request_resume (which 
> > would increment the counter only if a work item had to be queued).
> 
> OK, so this means we should provide both at the core level and let the drivers
> decide which one to use.
> 
> I think in both cases the caller would be responsible for decrementing the
> counter?

Sure.  They could call pm_runtime_put just once at the end of their
runtime_resume method (assuming they used pm_request_resume), or they
could call it at every place where some deferred work was finished 
(assuming they used pm_request_resume_get).

> > Okay, we don't need it then.  I forgot to mention in the previous
> > message that there also has to be a pm_runtime_del() routine, which
> > should cancel pending workqueue items and set the counter to some high
> > value so that no new items are added.
> 
> Should that be called by device_pm_remove()?  I think so.

Yes.  I suppose it could be named pm_runtime_remove.  Either would be 
okay.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-22 16:28                                         ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 16:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:

> > Again, this boils down to how drivers decide to use the async 
> > interface.  I can see justifications for both pm_request_resume_get 
> > (which would always increment the counter) and pm_request_resume (which 
> > would increment the counter only if a work item had to be queued).
> 
> OK, so this means we should provide both at the core level and let the drivers
> decide which one to use.
> 
> I think in both cases the caller would be responsible for decrementing the
> counter?

Sure.  They could call pm_runtime_put just once at the end of their
runtime_resume method (assuming they used pm_request_resume), or they
could call it at every place where some deferred work was finished 
(assuming they used pm_request_resume_get).

> > Okay, we don't need it then.  I forgot to mention in the previous
> > message that there also has to be a pm_runtime_del() routine, which
> > should cancel pending workqueue items and set the counter to some high
> > value so that no new items are added.
> 
> Should that be called by device_pm_remove()?  I think so.

Yes.  I suppose it could be named pm_runtime_remove.  Either would be 
okay.

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:49                                     ` Rafael J. Wysocki
@ 2009-06-22 16:28                                       ` Alan Stern
  2009-06-22 16:28                                         ` Alan Stern
                                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-22 16:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:

> > Again, this boils down to how drivers decide to use the async 
> > interface.  I can see justifications for both pm_request_resume_get 
> > (which would always increment the counter) and pm_request_resume (which 
> > would increment the counter only if a work item had to be queued).
> 
> OK, so this means we should provide both at the core level and let the drivers
> decide which one to use.
> 
> I think in both cases the caller would be responsible for decrementing the
> counter?

Sure.  They could call pm_runtime_put just once at the end of their
runtime_resume method (assuming they used pm_request_resume), or they
could call it at every place where some deferred work was finished 
(assuming they used pm_request_resume_get).

> > Okay, we don't need it then.  I forgot to mention in the previous
> > message that there also has to be a pm_runtime_del() routine, which
> > should cancel pending workqueue items and set the counter to some high
> > value so that no new items are added.
> 
> Should that be called by device_pm_remove()?  I think so.

Yes.  I suppose it could be named pm_runtime_remove.  Either would be 
okay.

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 16:28                                         ` Alan Stern
  (?)
  (?)
@ 2009-06-22 23:02                                         ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 23:02 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Monday 22 June 2009, Alan Stern wrote:
> On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > Again, this boils down to how drivers decide to use the async 
> > > interface.  I can see justifications for both pm_request_resume_get 
> > > (which would always increment the counter) and pm_request_resume (which 
> > > would increment the counter only if a work item had to be queued).
> > 
> > OK, so this means we should provide both at the core level and let the drivers
> > decide which one to use.
> > 
> > I think in both cases the caller would be responsible for decrementing the
> > counter?
> 
> Sure.  They could call pm_runtime_put just once at the end of their
> runtime_resume method (assuming they used pm_request_resume), or they
> could call it at every place where some deferred work was finished 
> (assuming they used pm_request_resume_get).
> 
> > > Okay, we don't need it then.  I forgot to mention in the previous
> > > message that there also has to be a pm_runtime_del() routine, which
> > > should cancel pending workqueue items and set the counter to some high
> > > value so that no new items are added.
> > 
> > Should that be called by device_pm_remove()?  I think so.
> 
> Yes.  I suppose it could be named pm_runtime_remove.  Either would be 
> okay.

OK

I'm sending a new version of the $subject patch, containing these changes
among other things, in a new thread.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 16:28                                         ` Alan Stern
  (?)
@ 2009-06-22 23:02                                         ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-22 23:02 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Monday 22 June 2009, Alan Stern wrote:
> On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > Again, this boils down to how drivers decide to use the async 
> > > interface.  I can see justifications for both pm_request_resume_get 
> > > (which would always increment the counter) and pm_request_resume (which 
> > > would increment the counter only if a work item had to be queued).
> > 
> > OK, so this means we should provide both at the core level and let the drivers
> > decide which one to use.
> > 
> > I think in both cases the caller would be responsible for decrementing the
> > counter?
> 
> Sure.  They could call pm_runtime_put just once at the end of their
> runtime_resume method (assuming they used pm_request_resume), or they
> could call it at every place where some deferred work was finished 
> (assuming they used pm_request_resume_get).
> 
> > > Okay, we don't need it then.  I forgot to mention in the previous
> > > message that there also has to be a pm_runtime_del() routine, which
> > > should cancel pending workqueue items and set the counter to some high
> > > value so that no new items are added.
> > 
> > Should that be called by device_pm_remove()?  I think so.
> 
> Yes.  I suppose it could be named pm_runtime_remove.  Either would be 
> okay.

OK

I'm sending a new version of the $subject patch, containing these changes
among other things, in a new thread.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:49                                     ` Rafael J. Wysocki
@ 2009-06-23 17:02                                         ` Alan Stern
  2009-06-22 16:28                                         ` Alan Stern
                                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-23 17:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:

> > And of course, synchronous pm_runtime_resume should always increment the 
> > counter.
> 
> Sure.

Now that I've thought about it some more, I decided that we might want
to be more flexible.  Without subjecting you to the entire line of
reasoning, let's just say that I'm starting to wonder whether it's such
a good idea to tie the counter increments to the PM core runtime resume
calls at all.

Maybe it would be better (easier to use, less constraining) to require
the runtime_resume callback to do its own pm_runtime_get.  That way the
driver would be entirely responsible for managing the usage counter;
the PM core wouldn't be involved.  pm_runtime_get would simply
increment the counter, so it could be used even in interrupt context.  
At the moment, I don't see any need for it to queue an autoresume
request if the device happens to be suspended.

Something like this was probably your intention all along.  :-)

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-23 17:02                                         ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-23 17:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:

> > And of course, synchronous pm_runtime_resume should always increment the 
> > counter.
> 
> Sure.

Now that I've thought about it some more, I decided that we might want
to be more flexible.  Without subjecting you to the entire line of
reasoning, let's just say that I'm starting to wonder whether it's such
a good idea to tie the counter increments to the PM core runtime resume
calls at all.

Maybe it would be better (easier to use, less constraining) to require
the runtime_resume callback to do its own pm_runtime_get.  That way the
driver would be entirely responsible for managing the usage counter;
the PM core wouldn't be involved.  pm_runtime_get would simply
increment the counter, so it could be used even in interrupt context.  
At the moment, I don't see any need for it to queue an autoresume
request if the device happens to be suspended.

Something like this was probably your intention all along.  :-)

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-22 15:49                                     ` Rafael J. Wysocki
  2009-06-22 16:28                                       ` Alan Stern
  2009-06-22 16:28                                         ` Alan Stern
@ 2009-06-23 17:02                                       ` Alan Stern
  2009-06-23 17:02                                         ` Alan Stern
  3 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-23 17:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:

> > And of course, synchronous pm_runtime_resume should always increment the 
> > counter.
> 
> Sure.

Now that I've thought about it some more, I decided that we might want
to be more flexible.  Without subjecting you to the entire line of
reasoning, let's just say that I'm starting to wonder whether it's such
a good idea to tie the counter increments to the PM core runtime resume
calls at all.

Maybe it would be better (easier to use, less constraining) to require
the runtime_resume callback to do its own pm_runtime_get.  That way the
driver would be entirely responsible for managing the usage counter;
the PM core wouldn't be involved.  pm_runtime_get would simply
increment the counter, so it could be used even in interrupt context.  
At the moment, I don't see any need for it to queue an autoresume
request if the device happens to be suspended.

Something like this was probably your intention all along.  :-)

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-23 17:02                                         ` Alan Stern
  (?)
@ 2009-06-23 17:45                                         ` Rafael J. Wysocki
  2009-06-23 18:26                                           ` Alan Stern
  2009-06-23 18:26                                             ` Alan Stern
  -1 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-23 17:45 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Tuesday 23 June 2009, Alan Stern wrote:
> On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > And of course, synchronous pm_runtime_resume should always increment the 
> > > counter.
> > 
> > Sure.
> 
> Now that I've thought about it some more, I decided that we might want
> to be more flexible.  Without subjecting you to the entire line of
> reasoning, let's just say that I'm starting to wonder whether it's such
> a good idea to tie the counter increments to the PM core runtime resume
> calls at all.
> 
> Maybe it would be better (easier to use, less constraining) to require
> the runtime_resume callback to do its own pm_runtime_get.  That way the
> driver would be entirely responsible for managing the usage counter;
> the PM core wouldn't be involved.  pm_runtime_get would simply
> increment the counter, so it could be used even in interrupt context.  
> At the moment, I don't see any need for it to queue an autoresume
> request if the device happens to be suspended.
> 
> Something like this was probably your intention all along.  :-)

More or less. :-)

In short, I think suspending (or queuing a suspend request) should fail if the
usage counter is nonzero, but the resuming (or queuing up a resume request)
should be possible regardless of its value.  The reason is that multiple
threads may in theory attempt to resume the device at the same time.

However, I'm not sure if the core should manipulate the usage counter by
itself, because it's sort of problematic (there's no good approach to decide
when to decrement the counter).

So, I'd let the callers use pm_runtime_get() to increment the counter
and pm_runtime_put() to decrement it, possibly queuing up an idle notification
if the counter happens to reach 0.  Also, I'm not sure if unbalanced
pm_runtime_put() should be regarded as a bug.

At the same time, I'd like the core to use runtime_status and the other
fields in dev_pm_info, except for the usage counter, to ensure that all
operations are only carried out when it makes sense.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-23 17:02                                         ` Alan Stern
  (?)
  (?)
@ 2009-06-23 17:45                                         ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-23 17:45 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Tuesday 23 June 2009, Alan Stern wrote:
> On Mon, 22 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > And of course, synchronous pm_runtime_resume should always increment the 
> > > counter.
> > 
> > Sure.
> 
> Now that I've thought about it some more, I decided that we might want
> to be more flexible.  Without subjecting you to the entire line of
> reasoning, let's just say that I'm starting to wonder whether it's such
> a good idea to tie the counter increments to the PM core runtime resume
> calls at all.
> 
> Maybe it would be better (easier to use, less constraining) to require
> the runtime_resume callback to do its own pm_runtime_get.  That way the
> driver would be entirely responsible for managing the usage counter;
> the PM core wouldn't be involved.  pm_runtime_get would simply
> increment the counter, so it could be used even in interrupt context.  
> At the moment, I don't see any need for it to queue an autoresume
> request if the device happens to be suspended.
> 
> Something like this was probably your intention all along.  :-)

More or less. :-)

In short, I think suspending (or queuing a suspend request) should fail if the
usage counter is nonzero, but the resuming (or queuing up a resume request)
should be possible regardless of its value.  The reason is that multiple
threads may in theory attempt to resume the device at the same time.

However, I'm not sure if the core should manipulate the usage counter by
itself, because it's sort of problematic (there's no good approach to decide
when to decrement the counter).

So, I'd let the callers use pm_runtime_get() to increment the counter
and pm_runtime_put() to decrement it, possibly queuing up an idle notification
if the counter happens to reach 0.  Also, I'm not sure if unbalanced
pm_runtime_put() should be regarded as a bug.

At the same time, I'd like the core to use runtime_status and the other
fields in dev_pm_info, except for the usage counter, to ensure that all
operations are only carried out when it makes sense.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-23 17:45                                         ` Rafael J. Wysocki
@ 2009-06-23 18:26                                             ` Alan Stern
  2009-06-23 18:26                                             ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-23 18:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Tue, 23 Jun 2009, Rafael J. Wysocki wrote:

> In short, I think suspending (or queuing a suspend request) should fail if the
> usage counter is nonzero, but the resuming (or queuing up a resume request)
> should be possible regardless of its value.  The reason is that multiple
> threads may in theory attempt to resume the device at the same time.

Agreed.  Suspends and resumes aren't symmetrical -- a single resume 
request must outweigh numerous suspend requests.

> However, I'm not sure if the core should manipulate the usage counter by
> itself, because it's sort of problematic (there's no good approach to decide
> when to decrement the counter).

Yes.  The idea behind my previous message was that it's not really so
easy for the core to decide when to _increment_ the counter either.

> So, I'd let the callers use pm_runtime_get() to increment the counter
> and pm_runtime_put() to decrement it, possibly queuing up an idle notification
> if the counter happens to reach 0.  Also, I'm not sure if unbalanced
> pm_runtime_put() should be regarded as a bug.

It should be.  Once the counter is messed up, runtime PM wouldn't be
able to work properly.  But maybe you should add a pm_set_counter call
so that drivers can recover from imbalances.

One question still remains: If the counter is 0 at the end of a
successful pm_runtime_resume, should the core then call pm_notify_idle?  
Or should we make the driver responsible for that too?

> At the same time, I'd like the core to use runtime_status and the other
> fields in dev_pm_info, except for the usage counter, to ensure that all
> operations are only carried out when it makes sense.

Yes.  In fact, I'd say that when the counter is positive it doesn't
make sense to allow a runtime suspend -- so you don't need that
exception in your statement above.  :-)

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-23 18:26                                             ` Alan Stern
  0 siblings, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-23 18:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Tue, 23 Jun 2009, Rafael J. Wysocki wrote:

> In short, I think suspending (or queuing a suspend request) should fail if the
> usage counter is nonzero, but the resuming (or queuing up a resume request)
> should be possible regardless of its value.  The reason is that multiple
> threads may in theory attempt to resume the device at the same time.

Agreed.  Suspends and resumes aren't symmetrical -- a single resume 
request must outweigh numerous suspend requests.

> However, I'm not sure if the core should manipulate the usage counter by
> itself, because it's sort of problematic (there's no good approach to decide
> when to decrement the counter).

Yes.  The idea behind my previous message was that it's not really so
easy for the core to decide when to _increment_ the counter either.

> So, I'd let the callers use pm_runtime_get() to increment the counter
> and pm_runtime_put() to decrement it, possibly queuing up an idle notification
> if the counter happens to reach 0.  Also, I'm not sure if unbalanced
> pm_runtime_put() should be regarded as a bug.

It should be.  Once the counter is messed up, runtime PM wouldn't be
able to work properly.  But maybe you should add a pm_set_counter call
so that drivers can recover from imbalances.

One question still remains: If the counter is 0 at the end of a
successful pm_runtime_resume, should the core then call pm_notify_idle?  
Or should we make the driver responsible for that too?

> At the same time, I'd like the core to use runtime_status and the other
> fields in dev_pm_info, except for the usage counter, to ensure that all
> operations are only carried out when it makes sense.

Yes.  In fact, I'd say that when the counter is positive it doesn't
make sense to allow a runtime suspend -- so you don't need that
exception in your statement above.  :-)

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-23 17:45                                         ` Rafael J. Wysocki
@ 2009-06-23 18:26                                           ` Alan Stern
  2009-06-23 18:26                                             ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-23 18:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Tue, 23 Jun 2009, Rafael J. Wysocki wrote:

> In short, I think suspending (or queuing a suspend request) should fail if the
> usage counter is nonzero, but the resuming (or queuing up a resume request)
> should be possible regardless of its value.  The reason is that multiple
> threads may in theory attempt to resume the device at the same time.

Agreed.  Suspends and resumes aren't symmetrical -- a single resume 
request must outweigh numerous suspend requests.

> However, I'm not sure if the core should manipulate the usage counter by
> itself, because it's sort of problematic (there's no good approach to decide
> when to decrement the counter).

Yes.  The idea behind my previous message was that it's not really so
easy for the core to decide when to _increment_ the counter either.

> So, I'd let the callers use pm_runtime_get() to increment the counter
> and pm_runtime_put() to decrement it, possibly queuing up an idle notification
> if the counter happens to reach 0.  Also, I'm not sure if unbalanced
> pm_runtime_put() should be regarded as a bug.

It should be.  Once the counter is messed up, runtime PM wouldn't be
able to work properly.  But maybe you should add a pm_set_counter call
so that drivers can recover from imbalances.

One question still remains: If the counter is 0 at the end of a
successful pm_runtime_resume, should the core then call pm_notify_idle?  
Or should we make the driver responsible for that too?

> At the same time, I'd like the core to use runtime_status and the other
> fields in dev_pm_info, except for the usage counter, to ensure that all
> operations are only carried out when it makes sense.

Yes.  In fact, I'd say that when the counter is positive it doesn't
make sense to allow a runtime suspend -- so you don't need that
exception in your statement above.  :-)

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-23 18:26                                             ` Alan Stern
  (?)
  (?)
@ 2009-06-24  0:17                                             ` Rafael J. Wysocki
  2009-06-24 14:51                                               ` Alan Stern
  2009-06-24 14:51                                               ` Alan Stern
  -1 siblings, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-24  0:17 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, linux-pm, ACPI Devel Maling List,
	Ingo Molnar, LKML, Greg KH

On Tuesday 23 June 2009, Alan Stern wrote:
> On Tue, 23 Jun 2009, Rafael J. Wysocki wrote:
> 
> > In short, I think suspending (or queuing a suspend request) should fail if the
> > usage counter is nonzero, but the resuming (or queuing up a resume request)
> > should be possible regardless of its value.  The reason is that multiple
> > threads may in theory attempt to resume the device at the same time.
> 
> Agreed.  Suspends and resumes aren't symmetrical -- a single resume 
> request must outweigh numerous suspend requests.
> 
> > However, I'm not sure if the core should manipulate the usage counter by
> > itself, because it's sort of problematic (there's no good approach to decide
> > when to decrement the counter).
> 
> Yes.  The idea behind my previous message was that it's not really so
> easy for the core to decide when to _increment_ the counter either.
> 
> > So, I'd let the callers use pm_runtime_get() to increment the counter
> > and pm_runtime_put() to decrement it, possibly queuing up an idle notification
> > if the counter happens to reach 0.  Also, I'm not sure if unbalanced
> > pm_runtime_put() should be regarded as a bug.
> 
> It should be.  Once the counter is messed up, runtime PM wouldn't be
> able to work properly.  But maybe you should add a pm_set_counter call
> so that drivers can recover from imbalances.
> 
> One question still remains: If the counter is 0 at the end of a
> successful pm_runtime_resume, should the core then call pm_notify_idle?  
> Or should we make the driver responsible for that too?

Good question. :-)

I think the core may call pm_notify_idle() in that case, but not necessarily in
the synchronous case.

> > At the same time, I'd like the core to use runtime_status and the other
> > fields in dev_pm_info, except for the usage counter, to ensure that all
> > operations are only carried out when it makes sense.
> 
> Yes.  In fact, I'd say that when the counter is positive it doesn't
> make sense to allow a runtime suspend -- so you don't need that
> exception in your statement above.  :-)

Well, you're right.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-23 18:26                                             ` Alan Stern
  (?)
@ 2009-06-24  0:17                                             ` Rafael J. Wysocki
  -1 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-24  0:17 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg KH, LKML, ACPI Devel Maling List, linux-pm, Ingo Molnar

On Tuesday 23 June 2009, Alan Stern wrote:
> On Tue, 23 Jun 2009, Rafael J. Wysocki wrote:
> 
> > In short, I think suspending (or queuing a suspend request) should fail if the
> > usage counter is nonzero, but the resuming (or queuing up a resume request)
> > should be possible regardless of its value.  The reason is that multiple
> > threads may in theory attempt to resume the device at the same time.
> 
> Agreed.  Suspends and resumes aren't symmetrical -- a single resume 
> request must outweigh numerous suspend requests.
> 
> > However, I'm not sure if the core should manipulate the usage counter by
> > itself, because it's sort of problematic (there's no good approach to decide
> > when to decrement the counter).
> 
> Yes.  The idea behind my previous message was that it's not really so
> easy for the core to decide when to _increment_ the counter either.
> 
> > So, I'd let the callers use pm_runtime_get() to increment the counter
> > and pm_runtime_put() to decrement it, possibly queuing up an idle notification
> > if the counter happens to reach 0.  Also, I'm not sure if unbalanced
> > pm_runtime_put() should be regarded as a bug.
> 
> It should be.  Once the counter is messed up, runtime PM wouldn't be
> able to work properly.  But maybe you should add a pm_set_counter call
> so that drivers can recover from imbalances.
> 
> One question still remains: If the counter is 0 at the end of a
> successful pm_runtime_resume, should the core then call pm_notify_idle?  
> Or should we make the driver responsible for that too?

Good question. :-)

I think the core may call pm_notify_idle() in that case, but not necessarily in
the synchronous case.

> > At the same time, I'd like the core to use runtime_status and the other
> > fields in dev_pm_info, except for the usage counter, to ensure that all
> > operations are only carried out when it makes sense.
> 
> Yes.  In fact, I'd say that when the counter is positive it doesn't
> make sense to allow a runtime suspend -- so you don't need that
> exception in your statement above.  :-)

Well, you're right.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24  0:17                                             ` Rafael J. Wysocki
  2009-06-24 14:51                                               ` Alan Stern
@ 2009-06-24 14:51                                               ` Alan Stern
  2009-06-24 19:14                                                 ` Rafael J. Wysocki
  2009-06-24 19:14                                                 ` Rafael J. Wysocki
  1 sibling, 2 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-24 14:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, Linux-pm mailing list,
	ACPI Devel Mailing List, Ingo Molnar, LKML, Greg KH

On Wed, 24 Jun 2009, Rafael J. Wysocki wrote:

> > One question still remains: If the counter is 0 at the end of a
> > successful pm_runtime_resume, should the core then call pm_notify_idle?  
> > Or should we make the driver responsible for that too?
> 
> Good question. :-)
> 
> I think the core may call pm_notify_idle() in that case, but not necessarily in
> the synchronous case.

I'm not sure; we may want to do it even for synchronous resumes.  
Otherwise the callers would be forced to do it.

There's also the other side of the coin.  What if the counter is 0 at
the end of a failed pm_runtime_suspend?

For example, suppose the driver's runtime_suspend method decides that
the device hasn't been idle for long enough, so it wants to fail the
suspend attempt with -EBUSY and queue a new delayed autosuspend
request.  But at this point the status is RPM_SUSPENDING, so new
suspend requests won't be accepted (N.B., the test for this in the most
recent patch doesn't look right).

Even with a queued notification, there's no guarantee that the
notification won't be sent before the status changes from
RPM_SUSPENDING to RPM_ACTIVE.  So we really do need the notification to
be sent by pm_runtime_suspend, after it has updated the status and
dropped the lock.


There's another totally separate issue worth discussing here.  This 
will affect the USB implementation of the new runtime PM framework.

The difficulty is that some USB interface drivers require remote wakeup
to be enabled while their interfaces are suspended.  But remote wakeup
is a global setting; it doesn't take effect until the entire physical
device is suspended.  (To put it another way, USB has no notion of
suspending interfaces.)  This means we must not allow these interfaces
to be suspended before the whole device is.  But the whole device is
the parent of the interfaces -- if we can't suspend the children before
suspending the parent then we're stuck.

Clearly this is something the USB stack has to deal with; it shouldn't
affect the general PM framework.  However the only solution I can think
of involves subverting the framework, which isn't very nice.  The idea
is to ignore runtime_suspend callbacks for these interface drivers;
allow them to keep on running even though the PM core thinks they are
suspended.  Then suspend and resume them as part of the callbacks for
the entire device.  (For interface drivers that don't require remote
wakeup there is no problem; it doesn't matter when they get suspended.)

This will work, but it's a hack.  Does anybody have a better idea?

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24  0:17                                             ` Rafael J. Wysocki
@ 2009-06-24 14:51                                               ` Alan Stern
  2009-06-24 14:51                                               ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-24 14:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Mailing List, Linux-pm mailing list,
	Ingo Molnar

On Wed, 24 Jun 2009, Rafael J. Wysocki wrote:

> > One question still remains: If the counter is 0 at the end of a
> > successful pm_runtime_resume, should the core then call pm_notify_idle?  
> > Or should we make the driver responsible for that too?
> 
> Good question. :-)
> 
> I think the core may call pm_notify_idle() in that case, but not necessarily in
> the synchronous case.

I'm not sure; we may want to do it even for synchronous resumes.  
Otherwise the callers would be forced to do it.

There's also the other side of the coin.  What if the counter is 0 at
the end of a failed pm_runtime_suspend?

For example, suppose the driver's runtime_suspend method decides that
the device hasn't been idle for long enough, so it wants to fail the
suspend attempt with -EBUSY and queue a new delayed autosuspend
request.  But at this point the status is RPM_SUSPENDING, so new
suspend requests won't be accepted (N.B., the test for this in the most
recent patch doesn't look right).

Even with a queued notification, there's no guarantee that the
notification won't be sent before the status changes from
RPM_SUSPENDING to RPM_ACTIVE.  So we really do need the notification to
be sent by pm_runtime_suspend, after it has updated the status and
dropped the lock.


There's another totally separate issue worth discussing here.  This 
will affect the USB implementation of the new runtime PM framework.

The difficulty is that some USB interface drivers require remote wakeup
to be enabled while their interfaces are suspended.  But remote wakeup
is a global setting; it doesn't take effect until the entire physical
device is suspended.  (To put it another way, USB has no notion of
suspending interfaces.)  This means we must not allow these interfaces
to be suspended before the whole device is.  But the whole device is
the parent of the interfaces -- if we can't suspend the children before
suspending the parent then we're stuck.

Clearly this is something the USB stack has to deal with; it shouldn't
affect the general PM framework.  However the only solution I can think
of involves subverting the framework, which isn't very nice.  The idea
is to ignore runtime_suspend callbacks for these interface drivers;
allow them to keep on running even though the PM core thinks they are
suspended.  Then suspend and resume them as part of the callbacks for
the entire device.  (For interface drivers that don't require remote
wakeup there is no problem; it doesn't matter when they get suspended.)

This will work, but it's a hack.  Does anybody have a better idea?

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
                       ` (3 preceding siblings ...)
  2009-06-15 21:08       ` Alan Stern
@ 2009-06-24 15:04     ` Pavel Machek
  2009-06-27 21:52       ` Rafael J. Wysocki
  2009-06-27 21:52       ` Rafael J. Wysocki
  2009-06-24 15:04     ` Pavel Machek
  5 siblings, 2 replies; 118+ messages in thread
From: Pavel Machek @ 2009-06-24 15:04 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, Oliver Neukum, Magnus Damm, linux-pm,
	ACPI Devel Maling List, Ingo Molnar, LKML, Greg KH

Hi!


> +2. Run-time PM Helper Functions and Device Fields
> +
> +The following helper functions are defined in drivers/base/power/runtime.c
> +and include/linux/pm_runtime.h:
> +
> +* void pm_runtime_init(struct device *dev);
> +* void pm_runtime_enable(struct device *dev);
> +* void pm_runtime_disable(struct device *dev);
> +* int pm_runtime_suspend(struct device *dev);
> +* void pm_request_suspend(struct device *dev, unsigned long delay);
> +* int pm_runtime_resume(struct device *dev);
> +* void pm_request_resume(struct device *dev);
> +* void pm_cancel_runtime_suspend(struct device *dev);
> +* void pm_cancel_runtime_resume(struct device *dev);
> +* void pm_suspend_check_children(struct device *dev, bool enable);

Those *s look confusingly like pointers. Remove them?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
                       ` (4 preceding siblings ...)
  2009-06-24 15:04     ` Pavel Machek
@ 2009-06-24 15:04     ` Pavel Machek
  5 siblings, 0 replies; 118+ messages in thread
From: Pavel Machek @ 2009-06-24 15:04 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, Magnus Damm, linux-pm,
	Ingo Molnar

Hi!


> +2. Run-time PM Helper Functions and Device Fields
> +
> +The following helper functions are defined in drivers/base/power/runtime.c
> +and include/linux/pm_runtime.h:
> +
> +* void pm_runtime_init(struct device *dev);
> +* void pm_runtime_enable(struct device *dev);
> +* void pm_runtime_disable(struct device *dev);
> +* int pm_runtime_suspend(struct device *dev);
> +* void pm_request_suspend(struct device *dev, unsigned long delay);
> +* int pm_runtime_resume(struct device *dev);
> +* void pm_request_resume(struct device *dev);
> +* void pm_cancel_runtime_suspend(struct device *dev);
> +* void pm_cancel_runtime_resume(struct device *dev);
> +* void pm_suspend_check_children(struct device *dev, bool enable);

Those *s look confusingly like pointers. Remove them?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24 14:51                                               ` Alan Stern
  2009-06-24 19:14                                                 ` Rafael J. Wysocki
@ 2009-06-24 19:14                                                 ` Rafael J. Wysocki
  2009-06-24 20:19                                                   ` Alan Stern
  2009-06-24 20:19                                                   ` Alan Stern
  1 sibling, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-24 19:14 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, Linux-pm mailing list,
	ACPI Devel Mailing List, Ingo Molnar, LKML, Greg KH

On Wednesday 24 June 2009, Alan Stern wrote:
> On Wed, 24 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > One question still remains: If the counter is 0 at the end of a
> > > successful pm_runtime_resume, should the core then call pm_notify_idle?  
> > > Or should we make the driver responsible for that too?
> > 
> > Good question. :-)
> > 
> > I think the core may call pm_notify_idle() in that case, but not necessarily in
> > the synchronous case.
> 
> I'm not sure; we may want to do it even for synchronous resumes.  
> Otherwise the callers would be forced to do it.

I have no strong opinion.  We can do it in the sychronous case too.

> There's also the other side of the coin.  What if the counter is 0 at
> the end of a failed pm_runtime_suspend?
> 
> For example, suppose the driver's runtime_suspend method decides that
> the device hasn't been idle for long enough, so it wants to fail the
> suspend attempt with -EBUSY and queue a new delayed autosuspend
> request.  But at this point the status is RPM_SUSPENDING, so new
> suspend requests won't be accepted (N.B., the test for this in the most
> recent patch doesn't look right).

In fact it was inversed (fixed now), thanks for spotting this!

> Even with a queued notification, there's no guarantee that the
> notification won't be sent before the status changes from
> RPM_SUSPENDING to RPM_ACTIVE.  So we really do need the notification to
> be sent by pm_runtime_suspend, after it has updated the status and
> dropped the lock.

OK

> There's another totally separate issue worth discussing here.  This 
> will affect the USB implementation of the new runtime PM framework.
> 
> The difficulty is that some USB interface drivers require remote wakeup
> to be enabled while their interfaces are suspended.  But remote wakeup
> is a global setting; it doesn't take effect until the entire physical
> device is suspended.  (To put it another way, USB has no notion of
> suspending interfaces.)  This means we must not allow these interfaces
> to be suspended before the whole device is.  But the whole device is
> the parent of the interfaces -- if we can't suspend the children before
> suspending the parent then we're stuck.

Not if we use the power.ignore_children flag on the parent.

> Clearly this is something the USB stack has to deal with; it shouldn't
> affect the general PM framework.  However the only solution I can think
> of involves subverting the framework, which isn't very nice.  The idea
> is to ignore runtime_suspend callbacks for these interface drivers;
> allow them to keep on running even though the PM core thinks they are
> suspended.  Then suspend and resume them as part of the callbacks for
> the entire device.  (For interface drivers that don't require remote
> wakeup there is no problem; it doesn't matter when they get suspended.)
> 
> This will work, but it's a hack.  Does anybody have a better idea?

Well, as I said above, you can set power.ignore_children on the device
and then it can be suspended even if the interfaces aren't.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24 14:51                                               ` Alan Stern
@ 2009-06-24 19:14                                                 ` Rafael J. Wysocki
  2009-06-24 19:14                                                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-24 19:14 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, LKML, ACPI Devel Mailing List, Linux-pm mailing list,
	Ingo Molnar

On Wednesday 24 June 2009, Alan Stern wrote:
> On Wed, 24 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > One question still remains: If the counter is 0 at the end of a
> > > successful pm_runtime_resume, should the core then call pm_notify_idle?  
> > > Or should we make the driver responsible for that too?
> > 
> > Good question. :-)
> > 
> > I think the core may call pm_notify_idle() in that case, but not necessarily in
> > the synchronous case.
> 
> I'm not sure; we may want to do it even for synchronous resumes.  
> Otherwise the callers would be forced to do it.

I have no strong opinion.  We can do it in the sychronous case too.

> There's also the other side of the coin.  What if the counter is 0 at
> the end of a failed pm_runtime_suspend?
> 
> For example, suppose the driver's runtime_suspend method decides that
> the device hasn't been idle for long enough, so it wants to fail the
> suspend attempt with -EBUSY and queue a new delayed autosuspend
> request.  But at this point the status is RPM_SUSPENDING, so new
> suspend requests won't be accepted (N.B., the test for this in the most
> recent patch doesn't look right).

In fact it was inversed (fixed now), thanks for spotting this!

> Even with a queued notification, there's no guarantee that the
> notification won't be sent before the status changes from
> RPM_SUSPENDING to RPM_ACTIVE.  So we really do need the notification to
> be sent by pm_runtime_suspend, after it has updated the status and
> dropped the lock.

OK

> There's another totally separate issue worth discussing here.  This 
> will affect the USB implementation of the new runtime PM framework.
> 
> The difficulty is that some USB interface drivers require remote wakeup
> to be enabled while their interfaces are suspended.  But remote wakeup
> is a global setting; it doesn't take effect until the entire physical
> device is suspended.  (To put it another way, USB has no notion of
> suspending interfaces.)  This means we must not allow these interfaces
> to be suspended before the whole device is.  But the whole device is
> the parent of the interfaces -- if we can't suspend the children before
> suspending the parent then we're stuck.

Not if we use the power.ignore_children flag on the parent.

> Clearly this is something the USB stack has to deal with; it shouldn't
> affect the general PM framework.  However the only solution I can think
> of involves subverting the framework, which isn't very nice.  The idea
> is to ignore runtime_suspend callbacks for these interface drivers;
> allow them to keep on running even though the PM core thinks they are
> suspended.  Then suspend and resume them as part of the callbacks for
> the entire device.  (For interface drivers that don't require remote
> wakeup there is no problem; it doesn't matter when they get suspended.)
> 
> This will work, but it's a hack.  Does anybody have a better idea?

Well, as I said above, you can set power.ignore_children on the device
and then it can be suspended even if the interfaces aren't.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24 19:14                                                 ` Rafael J. Wysocki
  2009-06-24 20:19                                                   ` Alan Stern
@ 2009-06-24 20:19                                                   ` Alan Stern
  2009-06-24 21:23                                                     ` Rafael J. Wysocki
  2009-06-24 21:23                                                     ` Rafael J. Wysocki
  1 sibling, 2 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-24 20:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oliver Neukum, Magnus Damm, Linux-pm mailing list,
	ACPI Devel Mailing List, Ingo Molnar, LKML, Greg KH

On Wed, 24 Jun 2009, Rafael J. Wysocki wrote:

> > The difficulty is that some USB interface drivers require remote wakeup
> > to be enabled while their interfaces are suspended.  But remote wakeup
> > is a global setting; it doesn't take effect until the entire physical
> > device is suspended.  (To put it another way, USB has no notion of
> > suspending interfaces.)  This means we must not allow these interfaces
> > to be suspended before the whole device is.  But the whole device is
> > the parent of the interfaces -- if we can't suspend the children before
> > suspending the parent then we're stuck.
> 
> Not if we use the power.ignore_children flag on the parent.
> 
> > Clearly this is something the USB stack has to deal with; it shouldn't
> > affect the general PM framework.  However the only solution I can think
> > of involves subverting the framework, which isn't very nice.  The idea
> > is to ignore runtime_suspend callbacks for these interface drivers;
> > allow them to keep on running even though the PM core thinks they are
> > suspended.  Then suspend and resume them as part of the callbacks for
> > the entire device.  (For interface drivers that don't require remote
> > wakeup there is no problem; it doesn't matter when they get suspended.)
> > 
> > This will work, but it's a hack.  Does anybody have a better idea?
> 
> Well, as I said above, you can set power.ignore_children on the device
> and then it can be suspended even if the interfaces aren't.

Hmm.  The hard part still remains: to make sure that the interfaces 
don't get suspended without the device also getting suspended.

I suppose we could attack this by making the device do a runtime_get on
each of the interfaces, which would be released in the device's
runtime_suspend method.  But then conversely, each interface driver
would have to do its gets and puts on the _device's_ resume_counter.  
If they used the interface counters then the values would never go to 0
and so nothing would ever be suspended.

You've got to admit, this does sound rather bizarre.  :-)  But it ought 
to work...

Alan Stern


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24 19:14                                                 ` Rafael J. Wysocki
@ 2009-06-24 20:19                                                   ` Alan Stern
  2009-06-24 20:19                                                   ` Alan Stern
  1 sibling, 0 replies; 118+ messages in thread
From: Alan Stern @ 2009-06-24 20:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Mailing List, Linux-pm mailing list,
	Ingo Molnar

On Wed, 24 Jun 2009, Rafael J. Wysocki wrote:

> > The difficulty is that some USB interface drivers require remote wakeup
> > to be enabled while their interfaces are suspended.  But remote wakeup
> > is a global setting; it doesn't take effect until the entire physical
> > device is suspended.  (To put it another way, USB has no notion of
> > suspending interfaces.)  This means we must not allow these interfaces
> > to be suspended before the whole device is.  But the whole device is
> > the parent of the interfaces -- if we can't suspend the children before
> > suspending the parent then we're stuck.
> 
> Not if we use the power.ignore_children flag on the parent.
> 
> > Clearly this is something the USB stack has to deal with; it shouldn't
> > affect the general PM framework.  However the only solution I can think
> > of involves subverting the framework, which isn't very nice.  The idea
> > is to ignore runtime_suspend callbacks for these interface drivers;
> > allow them to keep on running even though the PM core thinks they are
> > suspended.  Then suspend and resume them as part of the callbacks for
> > the entire device.  (For interface drivers that don't require remote
> > wakeup there is no problem; it doesn't matter when they get suspended.)
> > 
> > This will work, but it's a hack.  Does anybody have a better idea?
> 
> Well, as I said above, you can set power.ignore_children on the device
> and then it can be suspended even if the interfaces aren't.

Hmm.  The hard part still remains: to make sure that the interfaces 
don't get suspended without the device also getting suspended.

I suppose we could attack this by making the device do a runtime_get on
each of the interfaces, which would be released in the device's
runtime_suspend method.  But then conversely, each interface driver
would have to do its gets and puts on the _device's_ resume_counter.  
If they used the interface counters then the values would never go to 0
and so nothing would ever be suspended.

You've got to admit, this does sound rather bizarre.  :-)  But it ought 
to work...

Alan Stern

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24 20:19                                                   ` Alan Stern
@ 2009-06-24 21:23                                                     ` Rafael J. Wysocki
  2009-06-24 21:23                                                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-24 21:23 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Magnus Damm, Linux-pm mailing list,
	ACPI Devel Mailing List, Ingo Molnar, LKML, Greg KH

On Wednesday 24 June 2009, Alan Stern wrote:
> On Wed, 24 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > The difficulty is that some USB interface drivers require remote wakeup
> > > to be enabled while their interfaces are suspended.  But remote wakeup
> > > is a global setting; it doesn't take effect until the entire physical
> > > device is suspended.  (To put it another way, USB has no notion of
> > > suspending interfaces.)  This means we must not allow these interfaces
> > > to be suspended before the whole device is.  But the whole device is
> > > the parent of the interfaces -- if we can't suspend the children before
> > > suspending the parent then we're stuck.
> > 
> > Not if we use the power.ignore_children flag on the parent.
> > 
> > > Clearly this is something the USB stack has to deal with; it shouldn't
> > > affect the general PM framework.  However the only solution I can think
> > > of involves subverting the framework, which isn't very nice.  The idea
> > > is to ignore runtime_suspend callbacks for these interface drivers;
> > > allow them to keep on running even though the PM core thinks they are
> > > suspended.  Then suspend and resume them as part of the callbacks for
> > > the entire device.  (For interface drivers that don't require remote
> > > wakeup there is no problem; it doesn't matter when they get suspended.)
> > > 
> > > This will work, but it's a hack.  Does anybody have a better idea?
> > 
> > Well, as I said above, you can set power.ignore_children on the device
> > and then it can be suspended even if the interfaces aren't.
> 
> Hmm.  The hard part still remains: to make sure that the interfaces 
> don't get suspended without the device also getting suspended.
> 
> I suppose we could attack this by making the device do a runtime_get on
> each of the interfaces, which would be released in the device's
> runtime_suspend method.  But then conversely, each interface driver
> would have to do its gets and puts on the _device's_ resume_counter.  
> If they used the interface counters then the values would never go to 0
> and so nothing would ever be suspended.
> 
> You've got to admit, this does sound rather bizarre.  :-)

Yeah.

> But it ought to work...

Alternatively, there can be a USB-specific flag for the interfaces meaning
"device suspended" if set.  Then, the device driver will have to set that flag
for all the interfaces so that they can be suspended and the interface drivers
as well as the device driver will use the device's resume counter to block
suspends of the whole thing.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update 3] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24 20:19                                                   ` Alan Stern
  2009-06-24 21:23                                                     ` Rafael J. Wysocki
@ 2009-06-24 21:23                                                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-24 21:23 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, LKML, ACPI Devel Mailing List, Linux-pm mailing list,
	Ingo Molnar

On Wednesday 24 June 2009, Alan Stern wrote:
> On Wed, 24 Jun 2009, Rafael J. Wysocki wrote:
> 
> > > The difficulty is that some USB interface drivers require remote wakeup
> > > to be enabled while their interfaces are suspended.  But remote wakeup
> > > is a global setting; it doesn't take effect until the entire physical
> > > device is suspended.  (To put it another way, USB has no notion of
> > > suspending interfaces.)  This means we must not allow these interfaces
> > > to be suspended before the whole device is.  But the whole device is
> > > the parent of the interfaces -- if we can't suspend the children before
> > > suspending the parent then we're stuck.
> > 
> > Not if we use the power.ignore_children flag on the parent.
> > 
> > > Clearly this is something the USB stack has to deal with; it shouldn't
> > > affect the general PM framework.  However the only solution I can think
> > > of involves subverting the framework, which isn't very nice.  The idea
> > > is to ignore runtime_suspend callbacks for these interface drivers;
> > > allow them to keep on running even though the PM core thinks they are
> > > suspended.  Then suspend and resume them as part of the callbacks for
> > > the entire device.  (For interface drivers that don't require remote
> > > wakeup there is no problem; it doesn't matter when they get suspended.)
> > > 
> > > This will work, but it's a hack.  Does anybody have a better idea?
> > 
> > Well, as I said above, you can set power.ignore_children on the device
> > and then it can be suspended even if the interfaces aren't.
> 
> Hmm.  The hard part still remains: to make sure that the interfaces 
> don't get suspended without the device also getting suspended.
> 
> I suppose we could attack this by making the device do a runtime_get on
> each of the interfaces, which would be released in the device's
> runtime_suspend method.  But then conversely, each interface driver
> would have to do its gets and puts on the _device's_ resume_counter.  
> If they used the interface counters then the values would never go to 0
> and so nothing would ever be suspended.
> 
> You've got to admit, this does sound rather bizarre.  :-)

Yeah.

> But it ought to work...

Alternatively, there can be a USB-specific flag for the interfaces meaning
"device suspended" if set.  Then, the device driver will have to set that flag
for all the interfaces so that they can be suspended and the interface drivers
as well as the device driver will use the device's resume counter to block
suspends of the whole thing.

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24 15:04     ` Pavel Machek
@ 2009-06-27 21:52       ` Rafael J. Wysocki
  2009-07-06  8:28         ` Pavel Machek
  2009-07-06  8:28         ` Pavel Machek
  2009-06-27 21:52       ` Rafael J. Wysocki
  1 sibling, 2 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-27 21:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Stern, Oliver Neukum, Magnus Damm, linux-pm,
	ACPI Devel Maling List, Ingo Molnar, LKML, Greg KH

On Wednesday 24 June 2009, Pavel Machek wrote:
> Hi!
> 
> 
> > +2. Run-time PM Helper Functions and Device Fields
> > +
> > +The following helper functions are defined in drivers/base/power/runtime.c
> > +and include/linux/pm_runtime.h:
> > +
> > +* void pm_runtime_init(struct device *dev);
> > +* void pm_runtime_enable(struct device *dev);
> > +* void pm_runtime_disable(struct device *dev);
> > +* int pm_runtime_suspend(struct device *dev);
> > +* void pm_request_suspend(struct device *dev, unsigned long delay);
> > +* int pm_runtime_resume(struct device *dev);
> > +* void pm_request_resume(struct device *dev);
> > +* void pm_cancel_runtime_suspend(struct device *dev);
> > +* void pm_cancel_runtime_resume(struct device *dev);
> > +* void pm_suspend_check_children(struct device *dev, bool enable);
> 
> Those *s look confusingly like pointers. Remove them?

>From the doc?  OK, I can use another character. :-)

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-24 15:04     ` Pavel Machek
  2009-06-27 21:52       ` Rafael J. Wysocki
@ 2009-06-27 21:52       ` Rafael J. Wysocki
  1 sibling, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-27 21:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Greg KH, LKML, ACPI Devel Maling List, Magnus Damm, linux-pm,
	Ingo Molnar

On Wednesday 24 June 2009, Pavel Machek wrote:
> Hi!
> 
> 
> > +2. Run-time PM Helper Functions and Device Fields
> > +
> > +The following helper functions are defined in drivers/base/power/runtime.c
> > +and include/linux/pm_runtime.h:
> > +
> > +* void pm_runtime_init(struct device *dev);
> > +* void pm_runtime_enable(struct device *dev);
> > +* void pm_runtime_disable(struct device *dev);
> > +* int pm_runtime_suspend(struct device *dev);
> > +* void pm_request_suspend(struct device *dev, unsigned long delay);
> > +* int pm_runtime_resume(struct device *dev);
> > +* void pm_request_resume(struct device *dev);
> > +* void pm_cancel_runtime_suspend(struct device *dev);
> > +* void pm_cancel_runtime_resume(struct device *dev);
> > +* void pm_suspend_check_children(struct device *dev, bool enable);
> 
> Those *s look confusingly like pointers. Remove them?

>From the doc?  OK, I can use another character. :-)

Best,
Rafael

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-27 21:52       ` Rafael J. Wysocki
  2009-07-06  8:28         ` Pavel Machek
@ 2009-07-06  8:28         ` Pavel Machek
  1 sibling, 0 replies; 118+ messages in thread
From: Pavel Machek @ 2009-07-06  8:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, Oliver Neukum, Magnus Damm, linux-pm,
	ACPI Devel Maling List, Ingo Molnar, LKML, Greg KH

> On Wednesday 24 June 2009, Pavel Machek wrote:
> > Hi!
> > 
> > 
> > > +2. Run-time PM Helper Functions and Device Fields
> > > +
> > > +The following helper functions are defined in drivers/base/power/runtime.c
> > > +and include/linux/pm_runtime.h:
> > > +
> > > +* void pm_runtime_init(struct device *dev);
> > > +* void pm_runtime_enable(struct device *dev);
> > > +* void pm_runtime_disable(struct device *dev);
> > > +* int pm_runtime_suspend(struct device *dev);
> > > +* void pm_request_suspend(struct device *dev, unsigned long delay);
> > > +* int pm_runtime_resume(struct device *dev);
> > > +* void pm_request_resume(struct device *dev);
> > > +* void pm_cancel_runtime_suspend(struct device *dev);
> > > +* void pm_cancel_runtime_resume(struct device *dev);
> > > +* void pm_suspend_check_children(struct device *dev, bool enable);
> > 
> > Those *s look confusingly like pointers. Remove them?
> 
> From the doc?  OK, I can use another character. :-)

Yes. I suggest # :-)

> 
> Best,
> Rafael

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices
  2009-06-27 21:52       ` Rafael J. Wysocki
@ 2009-07-06  8:28         ` Pavel Machek
  2009-07-06  8:28         ` Pavel Machek
  1 sibling, 0 replies; 118+ messages in thread
From: Pavel Machek @ 2009-07-06  8:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg KH, LKML, ACPI Devel Maling List, Magnus Damm, linux-pm,
	Ingo Molnar

> On Wednesday 24 June 2009, Pavel Machek wrote:
> > Hi!
> > 
> > 
> > > +2. Run-time PM Helper Functions and Device Fields
> > > +
> > > +The following helper functions are defined in drivers/base/power/runtime.c
> > > +and include/linux/pm_runtime.h:
> > > +
> > > +* void pm_runtime_init(struct device *dev);
> > > +* void pm_runtime_enable(struct device *dev);
> > > +* void pm_runtime_disable(struct device *dev);
> > > +* int pm_runtime_suspend(struct device *dev);
> > > +* void pm_request_suspend(struct device *dev, unsigned long delay);
> > > +* int pm_runtime_resume(struct device *dev);
> > > +* void pm_request_resume(struct device *dev);
> > > +* void pm_cancel_runtime_suspend(struct device *dev);
> > > +* void pm_cancel_runtime_resume(struct device *dev);
> > > +* void pm_suspend_check_children(struct device *dev, bool enable);
> > 
> > Those *s look confusingly like pointers. Remove them?
> 
> From the doc?  OK, I can use another character. :-)

Yes. I suggest # :-)

> 
> Best,
> Rafael

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH] PM: Introduce core framework for run-time PM of I/O devices
@ 2009-06-13 22:23 Rafael J. Wysocki
  0 siblings, 0 replies; 118+ messages in thread
From: Rafael J. Wysocki @ 2009-06-13 22:23 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum, Magnus Damm
  Cc: ACPI Devel Maling List, pm list, Ingo Molnar, LKML

Hi,

Below is the current version of my "run-time PM for I/O devices" patch.

I've done my best to address the comments received during the recent
discussions, but at the same time I've tried to make the patch only contain
the most essential things.  For this reason, for example, the sysfs interface
is not there and it's going to be added in a separate patch.

Please let me know if you want me to change anything in this patch or to add
anything new to it.  [Magnus, I remember you wanted something like
->runtime_wakeup() along with ->runtime_idle(), but I'm not sure it's really
necessary.  Please let me know if you have any particular usage scenario for
it.]

Best,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Introduce a core framework for run-time power management of I/O
devices.  Add device run-time PM fields to 'struct dev_pm_info'
and device run-time PM callbacks to 'struct dev_pm_ops'.  Introduce
a run-time PM workqueue and define some device run-time PM helper
functions at the core level.  Document all these things.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/runtime_pm.txt |  250 ++++++++++++++++++++
 drivers/base/dd.c                  |    9 
 drivers/base/power/Makefile        |    1 
 drivers/base/power/main.c          |    5 
 drivers/base/power/runtime.c       |  461 +++++++++++++++++++++++++++++++++++++
 include/linux/pm.h                 |   98 +++++++
 include/linux/pm_runtime.h         |   63 +++++
 kernel/power/Kconfig               |   14 +
 kernel/power/main.c                |   17 +
 9 files changed, 915 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/Kconfig
===================================================================
--- linux-2.6.orig/kernel/power/Kconfig
+++ linux-2.6/kernel/power/Kconfig
@@ -208,3 +208,17 @@ config APM_EMULATION
 	  random kernel OOPSes or reboots that don't seem to be related to
 	  anything, try disabling/enabling this option (or disabling/enabling
 	  APM in your BIOS).
+
+config PM_RUNTIME
+	bool "Run-time PM core functionality"
+	depends on PM
+	---help---
+	  Enable functionality allowing I/O devices to be put into energy-saving
+	  (low power) states at run time (or autosuspended) after a specified
+	  period of inactivity and woken up in response to a hardware-generated
+	  wake-up event or a driver's request.
+
+	  Hardware support is generally required for this functionality to work
+	  and the bus type drivers of the buses the devices are on are
+	  responsibile for the actual handling of the autosuspend requests and
+	  wake-up events.
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -11,6 +11,7 @@
 #include <linux/kobject.h>
 #include <linux/string.h>
 #include <linux/resume-trace.h>
+#include <linux/workqueue.h>
 
 #include "power.h"
 
@@ -217,8 +218,24 @@ static struct attribute_group attr_group
 	.attrs = g,
 };
 
+#ifdef CONFIG_PM_RUNTIME
+struct workqueue_struct *pm_wq;
+
+static int __init pm_start_workqueue(void)
+{
+	pm_wq = create_freezeable_workqueue("pm");
+
+	return pm_wq ? 0 : -ENOMEM;
+}
+#else
+static inline int pm_start_workqueue(void) { return 0; }
+#endif
+
 static int __init pm_init(void)
 {
+	int error = pm_start_workqueue();
+	if (error)
+		return error;
 	power_kobj = kobject_create_and_add("power", NULL);
 	if (!power_kobj)
 		return -ENOMEM;
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -22,6 +22,9 @@
 #define _LINUX_PM_H
 
 #include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
 
 /*
  * Callbacks for platform drivers to implement.
@@ -165,6 +168,26 @@ typedef struct pm_message {
  * It is allowed to unregister devices while the above callbacks are being
  * executed.  However, it is not allowed to unregister a device from within any
  * of its own callbacks.
+ *
+ * There also are the following callbacks related to run-time power management
+ * of devices:
+ *
+ * @runtime_suspend: Prepare the device for a condition in which it won't be
+ *	able to communicate with the CPU(s) and RAM due to power management.
+ *	This need not mean that the device should be put into a low power state,
+ *	like for example when the device is behind a link, represented by a
+ *	separate device object, that is going to be turned off for power
+ *	management purposes.
+ *
+ * @runtime_resume: Put the device into the fully active state in response to a
+ *	wake-up event generated by hardware or at a request of software.  If
+ *	necessary, put the device into the full power state and restore its
+ *	registers, so that it is fully operational.
+ *
+ * @runtime_idle: Device appears to be inactive and it might be put into a low
+ *	power state if all of the necessary conditions are satisfied.  Check
+ *	these conditions and handle the device as appropriate, possibly queueing
+ *	a suspend request for it.
  */
 
 struct dev_pm_ops {
@@ -182,6 +205,11 @@ struct dev_pm_ops {
 	int (*thaw_noirq)(struct device *dev);
 	int (*poweroff_noirq)(struct device *dev);
 	int (*restore_noirq)(struct device *dev);
+#ifdef CONFIG_PM_RUNTIME
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+#endif
 };
 
 /**
@@ -315,14 +343,78 @@ enum dpm_state {
 	DPM_OFF_IRQ,
 };
 
+/**
+ * Device run-time power management state.
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.  They do not
+ * reflect the actual power state of the device or its status as seen by the
+ * driver.
+ *
+ * RPM_ACTIVE		Device is fully operational, no run-time PM requests are
+ *			pending for it.
+ *
+ * RPM_IDLE		It has been requested that the device be suspended.
+ *			Suspend request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_SUSPENDING	Device bus type's ->runtime_suspend() callback is being
+ *			executed.
+ *
+ * RPM_SUSPENDED	Device bus type's ->runtime_suspend() callback has
+ *			completed successfully.  The device is regarded as
+ *			suspended.
+ *
+ * RPM_WAKE		It has been requested that the device be woken up.
+ *			Resume request has been put into the run-time PM
+ *			workqueue and it's pending execution.
+ *
+ * RPM_RESUMING		Device bus type's ->runtime_resume() callback is being
+ *			executed.
+ *
+ * RPM_ERROR		Represents a condition from which the PM core cannot
+ *			recover by itself.  If the device's run-time PM status
+ *			field has this value, all of the run-time PM operations
+ *			carried out for the device by the core will fail, until
+ *			the status field is changed to either RPM_ACTIVE or
+ *			RPM_SUSPENDED (it is not valid to use the other values
+ *			in such a situation) by the device's driver or bus type.
+ *			This happens when the device bus type's
+ *			->runtime_suspend() or ->runtime_resume() callback
+ *			returns error code different from -EAGAIN or -EBUSY.
+ */
+
+#define RPM_ACTIVE	0
+#define RPM_IDLE	0x01
+#define RPM_SUSPENDING	0x02
+#define RPM_SUSPENDED	0x04
+#define RPM_WAKE	0x08
+#define RPM_RESUMING	0x10
+#define RPM_ERROR	(-1)
+
+#define RPM_IN_SUSPEND	(RPM_SUSPENDING | RPM_SUSPENDED)
+#define RPM_INACTIVE	(RPM_IDLE | RPM_IN_SUSPEND)
+#define RPM_NO_SUSPEND	(RPM_WAKE | RPM_RESUMING)
+#define RPM_IN_PROGRESS	(RPM_SUSPENDING | RPM_RESUMING)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
-	unsigned		can_wakeup:1;
-	unsigned		should_wakeup:1;
+	unsigned int		can_wakeup:1;
+	unsigned int		should_wakeup:1;
 	enum dpm_state		status;		/* Owned by the PM core */
-#ifdef	CONFIG_PM_SLEEP
+#ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
 #endif
+#ifdef CONFIG_PM_RUNTIME
+	struct delayed_work	runtime_work;
+	struct completion	work_done;
+	unsigned int		suspend_skip_children:1;
+	unsigned int		suspend_aborted:1;
+	unsigned int		runtime_status:5;
+	int			runtime_error;
+	atomic_t		depth;
+	spinlock_t		lock;
+#endif
 };
 
 /*
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o
+obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
Index: linux-2.6/drivers/base/power/runtime.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/runtime.c
@@ -0,0 +1,461 @@
+/*
+ * drivers/base/power/runtime.c - Helper functions for device run-time PM
+ *
+ * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/pm_runtime.h>
+
+/**
+ * pm_runtime_reset - Clear all of the device run-time PM flags.
+ * @dev: Device object to clear the flags for.
+ */
+static void pm_runtime_reset(struct device *dev)
+{
+	dev->power.suspend_aborted = false;
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_device_suspended - Check if given device has been suspended at run time.
+ * @dev: Device to check.
+ * @data: Ignored.
+ *
+ * Returns 0 if the device has been suspended and it hasn't been requested to
+ * resume or -EBUSY otherwise.
+ */
+static int pm_device_suspended(struct device *dev, void *data)
+{
+	return dev->power.runtime_status == RPM_SUSPENDED ? 0 : -EBUSY;
+}
+
+/**
+ * pm_check_children - Check if all children of a device have been suspended.
+ * @dev: Device to check.
+ *
+ * Returns 0 if all children of the device have been suspended or -EBUSY
+ * otherwise.
+ */
+static int pm_check_children(struct device *dev)
+{
+	return dev->power.suspend_skip_children ? 0 :
+			device_for_each_child(dev, NULL, pm_device_suspended);
+}
+
+/**
+ * pm_runtime_notify_idle - Run a device bus type's runtime_idle() callback.
+ * @dev: Device to notify.
+ *
+ * Check if all children of given device are suspended and call the device bus
+ * type's ->runtime_idle() callback if that's the case.
+ */
+static void pm_runtime_notify_idle(struct device *dev)
+{
+	if (atomic_read(&dev->power.depth) > 0 || pm_check_children(dev))
+		return;
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_idle)
+		dev->bus->pm->runtime_idle(dev);
+}
+
+/**
+ * pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
+ * @dev: Device to suspend.
+ *
+ * Check if the status of the device is appropriate and run the
+ * ->runtime_suspend() callback provided by the device's bus type driver.
+ * Update the run-time PM flags in the device object to reflect the current
+ * status of the device.
+ */
+int pm_runtime_suspend(struct device *dev)
+{
+	int error = 0;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_SUSPENDED) {
+		goto out;
+	} else if (dev->power.runtime_status & RPM_NO_SUSPEND) {
+		/* Device is resuming or there's a resume request pending. */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_IDLE
+	    && dev->power.suspend_aborted) {
+		dev->power.suspend_aborted = false;
+		dev->power.runtime_status = RPM_ACTIVE;
+		goto out;
+	} else if (pm_check_children(dev)) {
+		/*
+		 * We can only suspend the device if all of its children have
+		 * been suspended.
+		 */
+		error = -EAGAIN;
+		goto out;
+	} else if (dev->power.runtime_status == RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+
+		/*
+		 * Another suspend is running in parallel with us.  Wait for it
+		 * to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_SUSPENDING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_suspend)
+		error = dev->bus->pm->runtime_suspend(dev);
+
+	spin_lock(&dev->power.lock);
+
+	/*
+	 * Resume request might have been queued in the meantime, in which case
+	 * the RPM_WAKE bit is also set in runtime_status.
+	 */
+	dev->power.runtime_status &= ~RPM_SUSPENDING;
+	switch (error) {
+	case 0:
+		dev->power.runtime_status |= RPM_SUSPENDED;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+	if (!error && !(dev->power.runtime_status & RPM_WAKE) && dev->parent) {
+		spin_unlock(&dev->power.lock);
+
+		pm_runtime_notify_idle(dev->parent);
+
+		return 0;
+	}
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_suspend);
+
+/**
+ * pm_runtime_suspend_work - Run pm_runtime_suspend() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the suspend has been scheduled for and
+ * run pm_runtime_suspend() for it.
+ */
+static void pm_runtime_suspend_work(struct work_struct *work)
+{
+	pm_runtime_suspend(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_suspend - Schedule run-time suspend of given device.
+ * @dev: Device to suspend.
+ * @delay: Time, in jiffies, to wait before attempting to suspend the device.
+ */
+void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+	unsigned long flags;
+
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status != RPM_ACTIVE)
+		goto out;
+
+	dev->power.runtime_status = RPM_IDLE;
+	dev->power.suspend_aborted = false;
+	INIT_DELAYED_WORK(&dev->power.runtime_work, pm_runtime_suspend_work);
+	queue_delayed_work(pm_wq, &dev->power.runtime_work, delay);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_suspend);
+
+/**
+ * pm_cancel_suspend - Cancel a pending suspend request for given device.
+ * @dev: Device to cancel the suspend request for.
+ *
+ * Should be called under pm_lock_device() and only if we are sure that the
+ * ->autosuspend() callback hasn't started to yet.
+ */
+static void pm_cancel_suspend(struct device *dev)
+{
+	dev->power.suspend_aborted = true;
+	cancel_delayed_work(&dev->power.runtime_work);
+	dev->power.runtime_status = RPM_ACTIVE;
+}
+
+/**
+ * pm_runtime_resume - Run a device bus type's runtime_resume() callback.
+ * @dev: Device to resume.
+ *
+ * Check if the device is really suspended and run the ->runtime_resume()
+ * callback provided by the device's bus type driver.  Update the run-time PM
+ * flags in the device object to reflect the current status of the device.  If
+ * runtime suspend is in progress while this function is being run, wait for it
+ * to finish before resuming the device.  If runtime suspend is scheduled, but
+ * it hasn't started yet, cancel it and we're done.
+ */
+int pm_runtime_resume(struct device *dev)
+{
+	int error = 0;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return -EBUSY;
+
+	if (dev->parent)
+		spin_lock(&dev->parent->power.lock);
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_ACTIVE) {
+		goto out_unlock;
+	} else if (dev->power.runtime_status == RPM_IDLE) {
+		/* ->runtime_suspend() hasn't started yet, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out_unlock;
+	}
+
+	if (dev->power.runtime_status & RPM_SUSPENDING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * A suspend is running in parallel with us.  Wait for it to
+		 * complete and repeat.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		goto repeat;
+	} else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
+	    && dev->parent->power.runtime_status != RPM_ACTIVE) {
+		spin_unlock(&dev->power.lock);
+		spin_unlock(&dev->parent->power.lock);
+
+		/* The device's parent is not active.  Resume it and repeat. */
+		error = pm_runtime_resume(dev->parent);
+		if (error)
+			return error;
+
+		goto repeat;
+	}
+
+	if (dev->power.runtime_status == RPM_RESUMING) {
+		spin_unlock(&dev->power.lock);
+		if (dev->parent)
+			spin_unlock(&dev->parent->power.lock);
+
+		/*
+		 * There's another resume running in parallel with us. Wait for
+		 * it to complete and return.
+		 */
+		wait_for_completion(&dev->power.work_done);
+
+		return dev->power.runtime_error;
+	}
+
+	dev->power.runtime_status = RPM_RESUMING;
+	init_completion(&dev->power.work_done);
+
+	spin_unlock(&dev->power.lock);
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+
+	if (dev->bus && dev->bus->pm && dev->bus->pm->runtime_resume)
+		error = dev->bus->pm->runtime_resume(dev);
+
+	spin_lock(&dev->power.lock);
+
+	switch (error) {
+	case 0:
+		dev->power.runtime_status = RPM_ACTIVE;
+		break;
+	case -EAGAIN:
+	case -EBUSY:
+		dev->power.runtime_status = RPM_SUSPENDED;
+		break;
+	default:
+		dev->power.runtime_status = RPM_ERROR;
+	}
+	dev->power.runtime_error = error;
+	complete(&dev->power.work_done);
+
+ out:
+	spin_unlock(&dev->power.lock);
+
+	return error;
+
+ out_unlock:
+	if (dev->parent)
+		spin_unlock(&dev->parent->power.lock);
+	goto out;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_resume);
+
+/**
+ * pm_runtime_resume_work - Run pm_runtime_resume() for a device.
+ * @work: Work structure used for scheduling the execution of this function.
+ *
+ * Use @work to get the device object the resume has been scheduled for and run
+ * pm_runtime_resume() for it.
+ */
+static void pm_runtime_resume_work(struct work_struct *work)
+{
+	pm_runtime_resume(pm_work_to_device(work));
+}
+
+/**
+ * pm_request_resume - Schedule run-time resume of given device.
+ * @dev: Device to resume.
+ */
+void pm_request_resume(struct device *dev)
+{
+	unsigned long parent_flags = 0, flags;
+
+ repeat:
+	if (atomic_read(&dev->power.depth) > 0)
+		return;
+
+	if (dev->parent)
+		spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		/* Autosuspend request is pending, no need to resume. */
+		pm_cancel_suspend(dev);
+		goto out;
+	} else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
+		goto out;
+	} else if (dev->parent
+	    && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
+		spin_unlock_irqrestore(&dev->power.lock, flags);
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+
+		/* We have to resume the parent first. */
+		pm_request_resume(dev->parent);
+
+		goto repeat;
+	}
+
+	/*
+	 * The device may be suspending at the moment and we can't clear the
+	 * RPM_SUSPENDING bit in its runtime_status just yet.
+	 */
+	dev->power.runtime_status |= RPM_WAKE;
+	INIT_WORK(&dev->power.runtime_work.work, pm_runtime_resume_work);
+	queue_work(pm_wq, &dev->power.runtime_work.work);
+
+ out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	if (dev->parent)
+		spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
+}
+EXPORT_SYMBOL_GPL(pm_request_resume);
+
+/**
+ * pm_cancel_runtime_suspend - Cancel a pending suspend request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_suspend(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status == RPM_IDLE) {
+		cancel_delayed_work(&dev->power.runtime_work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_suspend);
+
+/**
+ * pm_cancel_runtime_resume - Cancel a pending resume request for a device.
+ * @dev: Device to handle.
+ *
+ * This routine is only supposed to be called when the run-time PM workqueue is
+ * frozen (i.e. during system-wide suspend or hibernation) when it is guaranteed
+ * that no work items are being executed.
+ */
+void pm_cancel_runtime_resume(struct device *dev)
+{
+	spin_lock(&dev->power.lock);
+
+	if (dev->power.runtime_status & RPM_WAKE) {
+		work_clear_pending(&dev->power.runtime_work.work);
+		pm_runtime_reset(dev);
+	}
+
+	spin_unlock(&dev->power.lock);
+}
+EXPORT_SYMBOL_GPL(pm_cancel_runtime_resume);
+
+/**
+ * pm_runtime_disable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Increase the depth field in the device's dev_pm_info structure, which will
+ * cause the run-time PM functions above to return without doing anything.
+ * If there is a run-time PM operation in progress, wait for it to complete.
+ */
+void pm_runtime_disable(struct device *dev)
+{
+	might_sleep();
+
+	atomic_inc(&dev->power.depth);
+
+	if (dev->power.runtime_status & RPM_IN_PROGRESS)
+		wait_for_completion(&dev->power.work_done);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_disable);
+
+/**
+ * pm_runtime_enable - Disable run-time power management for given device.
+ * @dev: Device to handle.
+ *
+ * Enable run-time power management for given device by decreasing the depth
+ * field in its dev_pm_info structure.
+ */
+void pm_runtime_enable(struct device *dev)
+{
+	if (!atomic_add_unless(&dev->power.depth, -1, 0))
+		dev_warn(dev, "PM: Excessive pm_runtime_enable()!\n");
+}
+EXPORT_SYMBOL_GPL(pm_runtime_enable);
+
+/**
+ * pm_runtime_init - Initialize run-time PM fields in given device object.
+ * @dev: Device object to handle.
+ */
+void pm_runtime_init(struct device *dev)
+{
+	pm_runtime_reset(dev);
+	spin_lock_init(&dev->power.lock);
+	atomic_set(&dev->power.depth, 1);
+	pm_suspend_check_children(dev, true);
+}
Index: linux-2.6/include/linux/pm_runtime.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pm_runtime.h
@@ -0,0 +1,63 @@
+/*
+ * pm_runtime.h - Device run-time power management helper functions.
+ *
+ * Copyright (C) 2009 Rafael J. Wysocki <rjw@sisk.pl>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_RUNTIME_H
+#define _LINUX_PM_RUNTIME_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+#ifdef CONFIG_PM_RUNTIME
+
+extern struct workqueue_struct *pm_wq;
+
+extern void pm_runtime_init(struct device *dev);
+extern int pm_runtime_suspend(struct device *dev);
+extern void pm_request_suspend(struct device *dev, unsigned long delay);
+extern int pm_runtime_resume(struct device *dev);
+extern void pm_request_resume(struct device *dev);
+extern void pm_cancel_runtime_suspend(struct device *dev);
+extern void pm_cancel_runtime_resume(struct device *dev);
+extern void pm_runtime_disable(struct device *dev);
+extern void pm_runtime_enable(struct device *dev);
+
+static inline struct device *pm_work_to_device(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct dev_pm_info *dpi;
+
+	dpi = container_of(dw, struct dev_pm_info, runtime_work);
+	return container_of(dpi, struct device, power);
+}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+	dev->power.suspend_skip_children = !enable;
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+static inline void pm_runtime_init(struct device *dev) {}
+static inline int pm_runtime_suspend(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_suspend(struct device *dev, unsigned long delay)
+{
+}
+static inline int pm_runtime_resume(struct device *dev) { return -ENOSYS; }
+static inline void pm_request_resume(struct device *dev) {}
+static inline void pm_cancel_runtime_suspend(struct device *dev) {}
+static inline void pm_cancel_runtime_resume(struct device *dev) {}
+static inline void pm_runtime_disable(struct device *dev) {}
+static inline void pm_runtime_enable(struct device *dev) {}
+
+static inline void pm_suspend_check_children(struct device *dev, bool enable)
+{
+}
+
+#endif /* !CONFIG_PM_RUNTIME */
+
+#endif
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -21,6 +21,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
+#include <linux/pm_runtime.h>
 #include <linux/resume-trace.h>
 #include <linux/rwsem.h>
 #include <linux/interrupt.h>
@@ -88,6 +89,7 @@ void device_pm_add(struct device *dev)
 	}
 
 	list_add_tail(&dev->power.entry, &dpm_list);
+	pm_runtime_init(dev);
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -507,6 +509,7 @@ static void dpm_complete(pm_message_t st
 		get_device(dev);
 		if (dev->power.status > DPM_ON) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			mutex_unlock(&dpm_list_mtx);
 
 			device_complete(dev, state);
@@ -753,6 +756,7 @@ static int dpm_prepare(pm_message_t stat
 
 		get_device(dev);
 		dev->power.status = DPM_PREPARING;
+		pm_runtime_disable(dev);
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_prepare(dev, state);
@@ -760,6 +764,7 @@ static int dpm_prepare(pm_message_t stat
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			dev->power.status = DPM_ON;
+			pm_runtime_enable(dev);
 			if (error == -EAGAIN) {
 				put_device(dev);
 				continue;
Index: linux-2.6/drivers/base/dd.c
===================================================================
--- linux-2.6.orig/drivers/base/dd.c
+++ linux-2.6/drivers/base/dd.c
@@ -23,6 +23,7 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
+#include <linux/pm_runtime.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -202,8 +203,12 @@ int driver_probe_device(struct device_dr
 	pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
 		 drv->bus->name, __func__, dev_name(dev), drv->name);
 
+	pm_runtime_disable(dev);
+
 	ret = really_probe(dev, drv);
 
+	pm_runtime_enable(dev);
+
 	return ret;
 }
 
@@ -306,6 +311,8 @@ static void __device_release_driver(stru
 
 	drv = dev->driver;
 	if (drv) {
+		pm_runtime_disable(dev);
+
 		driver_sysfs_remove(dev);
 
 		if (dev->bus)
@@ -320,6 +327,8 @@ static void __device_release_driver(stru
 		devres_release_all(dev);
 		dev->driver = NULL;
 		klist_remove(&dev->p->knode_driver);
+
+		pm_runtime_enable(dev);
 	}
 }
 
Index: linux-2.6/Documentation/power/runtime_pm.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/power/runtime_pm.txt
@@ -0,0 +1,250 @@
+Run-time Power Management Framework for I/O Devices
+
+(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+1. Introduction
+
+The support for run-time power management (run-time PM) of I/O devices is
+provided at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queueing all work items related to run-time PM, because this allows
+  them to be synchronized with system-wide power transitions.  pm_wq is declared
+  in include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of run-time PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing run-time PM operations with one another.
+
+* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out run-time PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The device run-time PM fields defined in 'struct dev_pm_info', the helper
+funtions and the run-time PM callbacks defined in 'struct dev_pm_ops' are
+described in what follows.
+
+2. Run-time PM Helper Functions and Device Fields
+
+The following helper functions are defined in drivers/base/power/runtime.c
+and include/linux/pm_runtime.h:
+
+* void pm_runtime_init(struct device *dev);
+* void pm_runtime_enable(struct device *dev);
+* void pm_runtime_disable(struct device *dev);
+* int pm_runtime_suspend(struct device *dev);
+* void pm_request_suspend(struct device *dev, unsigned long delay);
+* int pm_runtime_resume(struct device *dev);
+* void pm_request_resume(struct device *dev);
+* void pm_cancel_runtime_suspend(struct device *dev);
+* void pm_cancel_runtime_resume(struct device *dev);
+* void pm_suspend_check_children(struct device *dev, bool enable);
+
+pm_runtime_init() initializes the run-time PM fields in the 'power' member of
+the device object.  It is called during the initialization of the device object,
+in drivers/base/power/main.c:device_pm_add().
+
+pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
+respectively, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume().  They do it by decreasing and increasing, respectively,
+the 'power.depth' field of 'struct device'.  If the value of this field is
+greater than 0, pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() return immediately without doing anything and -EBUSY is
+returned by pm_runtime_suspend() and pm_runtime_resume().  Therefore, if
+pm_runtime_disable() is called several times in a row for the same device, it
+has to be balanced by the appropriate number of pm_runtime_enable() calls so
+that the other run-time PM functions can be used for that device.  The initial
+value of 'power.depth', as set by pm_runtime_init(), is 1.
+
+pm_runtime_disable() and pm_runtime_enable() are used by the device core to
+disable the run-time PM of the device temporarily during device proble and
+removal as well as during system-wide power transitions (i.e. system-wide
+suspend or hibernation, or resume from a system sleep state).
+
+pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
+and pm_request_resume() use the 'power.runtime_status' and
+'power.suspend_aborted' fields of 'struct device' for mutual synchronization.
+These fields are initialized by pm_runtime_init() and set to RPM_ACTIVE and
+'false', respectively.
+
+pm_request_suspend() is used to queue up a suspend request for an active device.
+If the run-time PM status of the device (i.e. the value of the
+'power.runtime_status' field in 'struct device') is different from RPM_ACTIVE,
+it returns immediately.  Otherwise, it changes the device's run-time PM status
+to RPM_IDLE and puts a request to execute pm_runtime_suspend() into pm_wq.  The
+'delay' argument is used to specify time to wait before the request will be
+completed, in jiffies.
+
+pm_runtime_suspend() is used to carry out a run-time suspend of an active
+device.  It is called either by the PM core, to complete a request queued up by
+pm_request_suspend(), or directly by a bus type or device driver.
+* It returns immediately if the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field ('power.runtime_status').
+* It returns -EAGAIN if at least one of the RPM_WAKE and RPM_RESUMING bits is
+  set the device's run-time PM status field.
+* If the device's run-time PM status is RPM_IDLE and 'power.suspend_aborted'
+  flag is set for it, the device's run-time PM status is set to RPM_ACTIVE and
+  the function returns success.
+* If the device's children are not suspended and the
+  'power.suspend_skip_children' flag is not set for it, -EAGAIN is returned.
+* If the device's run-time PM status is RPM_SUSPENDING, which means that another
+  instance of pm_runtime_suspend() is running at the same time for the same
+  device, the function waits for the other instance to complete and returns the
+  error code (or success) returned by it.
+If none of the above takes place, the device's run-time PM status is set to
+RPM_SUSPENDING and the device bus type's ->runtime_suspend() callback is
+executed, which is responsible for handling the device as appropriate (for
+example, it may choose to execute the device driver's ->runtime_suspend()
+callback or to carry out any other suitable action depending on the bus type).
+Next:
+* If it completes successfully, the RPM_SUSPENDED bit is set and the
+  RPM_SUSPENDING bit is cleared in the device's run-time PM status field.  Once
+  that has happened, the device is regarded by the PM core as suspended, but it
+  need not mean that the device has been put into a low power state.  What
+  really occurs to the device at this point totally depends on its bus type (it
+  may depend on the device's driver if the bus type chooses to call it).
+  Additionally, if the device bus type's ->runtime_suspend() callback completes
+  successfully, the device bus type's ->runtime_idle() callback is executed for
+  the device's parent if there is one and if all of its children are suspended
+  (or the 'power.suspend_skip_children' flag is set for it).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_ACTIVE.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_suspend() returns the error code (or success) returned by
+the device bus type's ->runtime_suspend() callback.
+
+pm_request_resume() is used to queue up a resume request for a device that is
+suspended, suspending or has a suspend request pending.
+* If a suspend request is pending for the device (i.e. the device's run-time PM
+  status is RPM_IDLE), it is cancelled and the function returns.
+* If the device is not suspended or suspending (i.e. none of the RPM_SUSPENDED
+  and RPM_SUSPENDING bits is set in the device's run-time PM status field), the
+  function returns.
+* If the device's parent is inactive, a resume request is scheduled for the
+  parent and the function is restarted.
+If none of the above happens, the RPM_WAKE bit is set in the device's run-time
+PM status field and the request to execute pm_runtime_resume() is put into
+pm_wq.
+
+pm_runtime_resume() is used to carry out a run-time resume of a device that is
+suspended, suspending or has a suspend request pending.  It is called either by
+the PM core, to complete a request queued up by pm_request_resume(), or
+directly by a bus type or device driver.
+* It returns immediately if the device's run-time PM status is RPM_ACTIVE.
+* If there's a suspend request pending for the device (i.e. the device's
+  run-time PM status is RPM_IDLE), it is cancelled and the function returns
+  success.
+* If the device is suspending (i.e. the RPM_SUSPENDING bit is set in the
+  device's run-time PM status field), the function waits for the suspend
+  operation to complete and restarts itself.
+* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
+  run-time PM status field), the device's parent exists and is not active (i.e.
+  the parent's run-time PM status is not RPM_ACTIVE), pm_runtime_resume() is
+  called (recursively) for the parent and the function is restarted.
+* If the device is resuming (i.e. the device's run-time PM status is
+  RPM_RESUMING), which means that another instance of pm_runtime_resume() is
+  running at the same time for the same device, the function waits for the other
+  instance to complete and returns the result returned by it.
+If none of the above happens, the device's run-time PM status is set to
+RPM_RESUMING and the device bus type's ->runtime_resume() callback is executed,
+which is responsible for handling the device as appropriate (for example, it may
+choose to execute the device driver's ->runtime_resume() callback or to carry
+out any other suitable action depending on the bus type).  Next:
+* If it completes successfully, the device's run-time PM status is set to
+  RPM_ACTIVE, which means that the device is fully operational.  Thus, the
+  device bus type's ->runtime_resume() callback, when it is about to return
+  success, _must_ _ensure_ that this really is the case (i.e. when it returns,
+  the device _must_ be able to complete I/O operations as needed).
+* If either -EBUSY or -EAGAIN is returned, the device's run-time PM status is
+  set to RPM_SUSPENDED.
+* If another error code is returned, the device's run-time PM status is set to
+  RPM_ERROR and the PM core will refuse to run pm_runtime_suspend(),
+  pm_request_suspend(), pm_runtime_resume(), and pm_request_resume() until the
+  status is changed to either RPM_ACTIVE or RPM_SUSPENDED by the device's bus
+  type or driver.
+Finally, pm_runtime_resume() returns the error code (or success) returned by
+the device bus type's ->runtime_resume() callback.
+
+pm_cancel_runtime_suspend() is used to cancel a pending suspend request for an
+active device, but it can only be called when the run-time PM of the device
+is disabled.  It is supposed to be used during system-wide power transitions.
+
+pm_cancel_runtime_resume() is used to cancel a pending suspend request for
+a suspended device.  It can only be called when the run-time PM of the device
+is disabled and it is supposed to be used during system-wide power transitions.
+
+pm_suspend_check_children() is used to set or unset the
+'power.suspend_skip_children' flag in 'struct device'.  If the 'enabled'
+argument is 'true', the field is set to 0, and if 'enable' is 'false', the field
+is set to 1.  The default value of 'power.suspend_skip_children', as set by
+pm_runtime_init(), is 0.
+
+3. Device Run-time PM Callbacks
+
+There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	void (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend() callback is executed by pm_runtime_suspend() for the bus
+type of the device being suspended.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_suspend() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the bus type's ->runtime_suspend() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_suspend() callback has returned successfully,
+  the PM core regards the device as suspended, which need not mean that the
+  device has been put into a low power state.  It is supposed to mean, however,
+  that the device will not commuticate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_ACTIVE, which measn that the device
+  _must_ be fully operational one this has happened.
+* If the bus type's ->runtime_suspend() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_resume() callback is executed by pm_runtime_resume() for the bus
+type of the device being woken up.  The bus type's callback is then _fully_
+_responsible_ for handling the device as appropriate, which may, but need not
+include executing the device driver's ->runtime_resume() callback (from the PM
+core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the bus type's ->runtime_resume() knows
+what to do to handle the device).
+* Once the bus type's ->runtime_resume() callback has returned successfully,
+  the PM core regards the device as fully operational, which means that the
+  device _must_ be able to complete I/O operations as needed.
+* If the bus type's ->runtime_resume() callback returns -EBUSY or -EAGAIN, the
+  device's run-time PM status is set to RPM_SUSPENDED, which is supposed to mean
+  that the device will not commuticate with the CPU(s) and RAM until the bus
+  type's ->runtime_resume() callback is executed for it.
+* If the bus type's ->runtime_resume() callback returns an error code different
+  from -EBUSY or -EAGAIN, the PM core regards this as an unrecoverable error and
+  will refuse to run the helper functions described in Section 1 until the
+  status is changed to either RPM_SUSPENDED or RPM_ACTIVE by the device's bus
+  type or driver.
+
+The ->runtime_idle() callback is executed by pm_runtime_suspend() for the bus
+type of a device the children of which are all suspended (or which has the
+'power.suspend_skip_children' flag set).  The action carried out by this
+callback is totally dependent on the bus type in question, but the expected
+action is to check if the device can be suspended (i.e. if all of the conditions
+necessary for suspending the device are met) and to queue up a suspend request
+for the device if that is the case.

^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, other threads:[~2009-07-06  8:28 UTC | newest]

Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-13 22:23 [PATCH] PM: Introduce core framework for run-time PM of I/O devices Rafael J. Wysocki
2009-06-14  9:41 ` Magnus Damm
2009-06-14  9:41 ` Magnus Damm
2009-06-14  9:41   ` Magnus Damm
2009-06-14 10:29   ` Rafael J. Wysocki
2009-06-14 10:29   ` Rafael J. Wysocki
2009-06-14  9:58 ` [linux-pm] " Rafael J. Wysocki
2009-06-14 22:57   ` [patch update] " Rafael J. Wysocki
2009-06-14 23:18     ` Arjan van de Ven
2009-06-15 20:02       ` Rafael J. Wysocki
2009-06-15 20:02       ` Rafael J. Wysocki
2009-06-14 23:18     ` Arjan van de Ven
2009-06-15 21:08     ` Alan Stern
2009-06-15 21:08     ` Alan Stern
2009-06-15 21:08       ` Alan Stern
2009-06-15 23:21       ` Rafael J. Wysocki
2009-06-16 14:30         ` Alan Stern
2009-06-16 14:30         ` Alan Stern
2009-06-16 14:30           ` Alan Stern
2009-06-16 21:30           ` [patch update 2] " Rafael J. Wysocki
2009-06-16 21:30           ` Rafael J. Wysocki
2009-06-16 22:33             ` [patch update 2 fix] " Rafael J. Wysocki
2009-06-17 20:08               ` Alan Stern
2009-06-17 20:08                 ` Alan Stern
2009-06-17 23:07                 ` Rafael J. Wysocki
2009-06-18 18:17                   ` Alan Stern
2009-06-18 18:17                   ` Alan Stern
2009-06-18 18:17                     ` Alan Stern
2009-06-19  0:38                     ` Rafael J. Wysocki
2009-06-19  0:38                     ` Rafael J. Wysocki
2009-06-19 16:25                       ` Alan Stern
2009-06-19 16:25                       ` Alan Stern
2009-06-19 16:25                         ` Alan Stern
2009-06-19 22:42                         ` Rafael J. Wysocki
2009-06-20  2:34                           ` Alan Stern
2009-06-20  2:34                           ` Alan Stern
2009-06-20  2:34                             ` Alan Stern
2009-06-20 14:30                             ` Alan Stern
2009-06-20 14:30                             ` [linux-pm] " Alan Stern
2009-06-20 23:48                               ` Rafael J. Wysocki
2009-06-20 23:48                               ` [linux-pm] " Rafael J. Wysocki
2009-06-21  2:30                                 ` Alan Stern
2009-06-21  2:30                                 ` [linux-pm] " Alan Stern
2009-06-21 11:32                                   ` Rafael J. Wysocki
2009-06-21 11:32                                   ` [linux-pm] " Rafael J. Wysocki
2009-06-22 14:16                                     ` Alan Stern
2009-06-22 15:27                                       ` Rafael J. Wysocki
2009-06-22 15:27                                       ` [linux-pm] " Rafael J. Wysocki
2009-06-22 15:39                                         ` Alan Stern
2009-06-22 15:53                                           ` Rafael J. Wysocki
2009-06-22 15:53                                           ` [linux-pm] " Rafael J. Wysocki
2009-06-22 15:39                                         ` Alan Stern
2009-06-22 14:16                                     ` Alan Stern
2009-06-22  6:20                               ` [linux-pm] " Magnus Damm
2009-06-22  6:20                                 ` Magnus Damm
2009-06-22  6:43                                 ` Arjan van de Ven
2009-06-22  6:43                                   ` Arjan van de Ven
2009-06-22  7:27                                   ` Magnus Damm
2009-06-22  7:27                                     ` [linux-pm] " Magnus Damm
2009-06-22 13:49                                     ` Arjan van de Ven
2009-06-22 13:49                                     ` [linux-pm] " Arjan van de Ven
2009-06-22 13:49                                       ` Arjan van de Ven
2009-06-22 15:39                                       ` Rafael J. Wysocki
2009-06-22 15:39                                       ` [linux-pm] " Rafael J. Wysocki
2009-06-22 15:33                                   ` Rafael J. Wysocki
2009-06-22 15:33                                   ` Rafael J. Wysocki
2009-06-22  6:43                                 ` Arjan van de Ven
2009-06-22  8:15                                 ` [linux-pm] " Oliver Neukum
2009-06-22  8:15                                 ` Oliver Neukum
2009-06-22  6:20                               ` Magnus Damm
2009-06-20 23:38                             ` [patch update 3] " Rafael J. Wysocki
2009-06-21  2:23                               ` Alan Stern
2009-06-21  2:23                               ` Alan Stern
2009-06-21  2:23                                 ` Alan Stern
2009-06-21 12:46                                 ` Rafael J. Wysocki
2009-06-21 12:46                                 ` Rafael J. Wysocki
2009-06-22 15:01                                   ` Alan Stern
2009-06-22 15:01                                     ` Alan Stern
2009-06-22 15:49                                     ` Rafael J. Wysocki
2009-06-22 15:49                                     ` Rafael J. Wysocki
2009-06-22 16:28                                       ` Alan Stern
2009-06-22 16:28                                       ` Alan Stern
2009-06-22 16:28                                         ` Alan Stern
2009-06-22 23:02                                         ` Rafael J. Wysocki
2009-06-22 23:02                                         ` Rafael J. Wysocki
2009-06-23 17:02                                       ` Alan Stern
2009-06-23 17:02                                       ` Alan Stern
2009-06-23 17:02                                         ` Alan Stern
2009-06-23 17:45                                         ` Rafael J. Wysocki
2009-06-23 18:26                                           ` Alan Stern
2009-06-23 18:26                                           ` Alan Stern
2009-06-23 18:26                                             ` Alan Stern
2009-06-24  0:17                                             ` Rafael J. Wysocki
2009-06-24  0:17                                             ` Rafael J. Wysocki
2009-06-24 14:51                                               ` Alan Stern
2009-06-24 14:51                                               ` Alan Stern
2009-06-24 19:14                                                 ` Rafael J. Wysocki
2009-06-24 19:14                                                 ` Rafael J. Wysocki
2009-06-24 20:19                                                   ` Alan Stern
2009-06-24 20:19                                                   ` Alan Stern
2009-06-24 21:23                                                     ` Rafael J. Wysocki
2009-06-24 21:23                                                     ` Rafael J. Wysocki
2009-06-23 17:45                                         ` Rafael J. Wysocki
2009-06-20 23:38                             ` Rafael J. Wysocki
2009-06-19 22:42                         ` [patch update 2 fix] " Rafael J. Wysocki
2009-06-17 23:07                 ` Rafael J. Wysocki
2009-06-17 20:08               ` Alan Stern
2009-06-16 22:33             ` Rafael J. Wysocki
2009-06-15 23:21       ` [patch update] " Rafael J. Wysocki
2009-06-24 15:04     ` Pavel Machek
2009-06-27 21:52       ` Rafael J. Wysocki
2009-07-06  8:28         ` Pavel Machek
2009-07-06  8:28         ` Pavel Machek
2009-06-27 21:52       ` Rafael J. Wysocki
2009-06-24 15:04     ` Pavel Machek
2009-06-14 22:57   ` Rafael J. Wysocki
2009-06-14  9:58 ` [PATCH] " Rafael J. Wysocki
  -- strict thread matches above, loose matches on Subject: below --
2009-06-13 22:23 Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.