linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] coupled cpuidle state support
@ 2011-12-21  0:09 Colin Cross
  2011-12-21  0:09 ` [PATCH 1/3] cpuidle: refactor out cpuidle_enter_state Colin Cross
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: Colin Cross @ 2011-12-21  0:09 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-pm
  Cc: Len Brown, Kevin Hilman, Santosh Shilimkar, Amit Kucheria,
	Arjan van de Ven, Trinabh Gupta, Deepthi Dharwar, linux-omap,
	linux-tegra, Colin Cross

On some ARM SMP SoCs (OMAP4460, Tegra 2, and probably more), the
cpus cannot be independently powered down, either due to
sequencing restrictions (on Tegra 2, cpu 0 must be the last to
power down), or due to HW bugs (on OMAP4460, a cpu powering up
will corrupt the gic state unless the other cpu runs a work
around).  Each cpu has a power state that it can enter without
coordinating with the other cpu (usually Wait For Interrupt, or
WFI), and one or more "coupled" power states that affect blocks
shared between the cpus (L2 cache, interrupt controller, and
sometimes the whole SoC).  Entering a coupled power state must
be tightly controlled on both cpus.

The easiest solution to implementing coupled cpu power states is
to hotplug all but one cpu whenever possible, usually using a
cpufreq governor that looks at cpu load to determine when to
enable the secondary cpus.  This causes problems, as hotplug is an
expensive operation, so the number of hotplug transitions must be
minimized, leading to very slow response to loads, often on the
order of seconds.

This patch series implements an alternative solution, where each
cpu will wait in the WFI state until all cpus are ready to enter
a coupled state, at which point the coupled state function will
be called on all cpus at approximately the same time.

Once all cpus are ready to enter idle, they are woken by an smp
cross call.  At this point, there is a chance that one of the
cpus will find work to do, and choose not to enter suspend.  A
final pass is needed to guarantee that all cpus will call the
power state enter function at the same time.  During this pass,
each cpu will increment the ready counter, and continue once the
ready counter matches the number of online coupled cpus.  If any
cpu exits idle, the other cpus will decrement their counter and
retry.

To use coupled cpuidle states, a cpuidle driver must:

   Set struct cpuidle_device.coupled_cpus to the mask of all
   coupled cpus, usually the same as cpu_possible_mask if all cpus
   are part of the same cluster.  The coupled_cpus mask must be
   set in the struct cpuidle_device for each cpu.

   Set struct cpuidle_device.safe_state to a state that is not a
   coupled state.  This is usually WFI.

   Set CPUIDLE_FLAG_COUPLED in struct cpuidle_state.flags for each
   state that affects multiple cpus.

   Provide a struct cpuidle_state.enter function for each state
   that affects multiple cpus.  This function is guaranteed to be
   called on all cpus at approximately the same time.  The driver
   should ensure that the cpus all abort together if any cpu tries
   to abort once the function is called.

This series was functionally tested on v3.0, but has only been
compile-tested on v3.2 after the removal of per-cpu state fields.

This patch set has a few disadvantages over the hotplug governor,
but I think they are all fairly minor:
   * Worst-case interrupt latency can be increased.  If one cpu
     receives an interrupt while the other is spinning in the
     ready_count loop, the second cpu will be stuck with
     interrupts off until the first cpu finished processing
     its interrupt and exits idle.  This will increase the worst
     case interrupt latency by the worst-case interrupt processing
     time, but should be very rare.
   * Interrupts are processed while still inside pm_idle.
     Normally, interrupts are only processed at the very end of
     pm_idle, just before it returns to the idle loop.  Coupled
     states requires processing interrupts inside
     cpuidle_enter_state_coupled in order to distinguish between
     the smp_cross_call from another cpu that is now idle and an
     interrupt that should cause idle to exit.
     I don't see a way to fix this without either being able to
     read the next pending irq from the interrupt chip, or
     querying the irq core for which interrupts were processed.
   * Since interrupts are processed inside cpuidle, the next
     timer event could change.  The new timer event will be
     handled correctly, but the idle state decision made by
     the governor will be out of date, and will not be revisited.
     The governor select function could be called again every time,
     but this could lead to a lot of work being done by an idle
     cpu if the other cpu was mostly busy.
   * The spinlock that protects requested_state and ready_count is
     should probably be replaced with careful use of atomics and
     barriers.

None of the platforms I work with have an SMP idle implementation
upstream, so I can't easily show a patch that converts a platform
from hotplug governor to coupled cpuidle states.  Instead, I'll
give a quick example implementation assuming functions that handle
hotplug and single-cpu idle already exist.

static int mach_enter_idle_coupled(struct cpuidle_device *dev,
		struct cpuidle_driver *drv, int index)
{
	ktime_t enter, exit;
	s64 us;

	clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);

	enter = ktime_get();

	cpu_pm_enter();

        if (dev->cpu == 0) {
	        for_each_online_cpu(i)
        	        while (i != dev->cpu && !mach_cpu_is_reset(i))
                	        cpu_relax();

		mach_cpu_idle();

		for_each_online_cpu(i)
			if (i != cpu)
				mach_cpu_online(i);
	} else {
		mach_cpu_offline();
	}

	cpu_pm_exit();

	exit = ktime_sub(ktime_get(), enter);
	us = ktime_to_us(exit);

	clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);

	local_irq_enable();

	dev->last_residency = us;

	return index;
}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/3] cpuidle: refactor out cpuidle_enter_state
  2011-12-21  0:09 [PATCH 0/3] coupled cpuidle state support Colin Cross
@ 2011-12-21  0:09 ` Colin Cross
  2012-01-04 14:08   ` [linux-pm] " Jean Pihet
  2011-12-21  0:09 ` [PATCH 2/3] cpuidle: fix error handling in __cpuidle_register_device Colin Cross
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Colin Cross @ 2011-12-21  0:09 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-pm
  Cc: Len Brown, Kevin Hilman, Santosh Shilimkar, Amit Kucheria,
	Arjan van de Ven, Trinabh Gupta, Deepthi Dharwar, linux-omap,
	linux-tegra, Colin Cross

Split the code to enter a state and update the stats into a helper
function, cpuidle_enter_state, and export it.  This function will
be called by the coupled state code to handle entering the safe
state and the final coupled state.

Signed-off-by: Colin Cross <ccross@android.com>
---
 drivers/cpuidle/cpuidle.c |   43 +++++++++++++++++++++++++++++--------------
 drivers/cpuidle/cpuidle.h |    2 ++
 2 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 06ce268..1486b3c 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -54,6 +54,34 @@ static void cpuidle_kick_cpus(void) {}
 static int __cpuidle_register_device(struct cpuidle_device *dev);
 
 /**
+ * cpuidle_enter_state
+ *
+ * enter the state and update stats
+ */
+int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
+		int next_state)
+{
+	int entered_state;
+	struct cpuidle_state *target_state;
+
+	target_state = &drv->states[next_state];
+
+	entered_state = target_state->enter(dev, drv, next_state);
+
+	if (entered_state >= 0) {
+		/* Update cpuidle counters */
+		/* This can be moved to within driver enter routine
+		 * but that results in multiple copies of same code.
+		 */
+		dev->states_usage[entered_state].time +=
+				(unsigned long long)dev->last_residency;
+		dev->states_usage[entered_state].usage++;
+	}
+
+	return entered_state;
+}
+
+/**
  * cpuidle_idle_call - the main idle loop
  *
  * NOTE: no locks or semaphores should be used here
@@ -63,7 +91,6 @@ int cpuidle_idle_call(void)
 {
 	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
 	struct cpuidle_driver *drv = cpuidle_get_driver();
-	struct cpuidle_state *target_state;
 	int next_state, entered_state;
 
 	if (off)
@@ -92,26 +119,14 @@ int cpuidle_idle_call(void)
 		return 0;
 	}
 
-	target_state = &drv->states[next_state];
-
 	trace_power_start(POWER_CSTATE, next_state, dev->cpu);
 	trace_cpu_idle(next_state, dev->cpu);
 
-	entered_state = target_state->enter(dev, drv, next_state);
+	entered_state = cpuidle_enter_state(dev, drv, next_state);
 
 	trace_power_end(dev->cpu);
 	trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
 
-	if (entered_state >= 0) {
-		/* Update cpuidle counters */
-		/* This can be moved to within driver enter routine
-		 * but that results in multiple copies of same code.
-		 */
-		dev->states_usage[entered_state].time +=
-				(unsigned long long)dev->last_residency;
-		dev->states_usage[entered_state].usage++;
-	}
-
 	/* give the governor an opportunity to reflect on the outcome */
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev, entered_state);
diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
index 38c3fd8..dd2df8f 100644
--- a/drivers/cpuidle/cpuidle.h
+++ b/drivers/cpuidle/cpuidle.h
@@ -14,6 +14,8 @@ extern struct list_head cpuidle_detected_devices;
 extern struct mutex cpuidle_lock;
 extern spinlock_t cpuidle_driver_lock;
 extern int cpuidle_disabled(void);
+int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
+		int next_state);
 
 /* idle loop */
 extern void cpuidle_install_idle_handler(void);
-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] cpuidle: fix error handling in __cpuidle_register_device
  2011-12-21  0:09 [PATCH 0/3] coupled cpuidle state support Colin Cross
  2011-12-21  0:09 ` [PATCH 1/3] cpuidle: refactor out cpuidle_enter_state Colin Cross
@ 2011-12-21  0:09 ` Colin Cross
  2011-12-21  0:09 ` [PATCH 3/3] cpuidle: add support for states that affect multiple cpus Colin Cross
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Colin Cross @ 2011-12-21  0:09 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-pm
  Cc: Len Brown, Kevin Hilman, Santosh Shilimkar, Amit Kucheria,
	Arjan van de Ven, Trinabh Gupta, Deepthi Dharwar, linux-omap,
	linux-tegra, Colin Cross

Fix the error handling in __cpuidle_register_device to include
the missing list_del.  Move it to a label, which will simplify
the error handling when coupled states are added.

Signed-off-by: Colin Cross <ccross@android.com>
---
 drivers/cpuidle/cpuidle.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 1486b3c..ea00a16 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -318,13 +318,18 @@ static int __cpuidle_register_device(struct cpuidle_device *dev)
 
 	per_cpu(cpuidle_devices, dev->cpu) = dev;
 	list_add(&dev->device_list, &cpuidle_detected_devices);
-	if ((ret = cpuidle_add_sysfs(sys_dev))) {
-		module_put(cpuidle_driver->owner);
-		return ret;
-	}
+	ret = cpuidle_add_sysfs(sys_dev);
+	if (ret)
+		goto err_sysfs;
 
 	dev->registered = 1;
 	return 0;
+
+err_sysfs:
+	module_put(cpuidle_driver->owner);
+	list_del(&dev->device_list);
+	per_cpu(cpuidle_devices, dev->cpu) = NULL;
+	return ret;
 }
 
 /**
-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/3] cpuidle: add support for states that affect multiple cpus
  2011-12-21  0:09 [PATCH 0/3] coupled cpuidle state support Colin Cross
  2011-12-21  0:09 ` [PATCH 1/3] cpuidle: refactor out cpuidle_enter_state Colin Cross
  2011-12-21  0:09 ` [PATCH 2/3] cpuidle: fix error handling in __cpuidle_register_device Colin Cross
@ 2011-12-21  0:09 ` Colin Cross
  2011-12-21  9:02 ` [PATCH 0/3] coupled cpuidle state support Arjan van de Ven
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Colin Cross @ 2011-12-21  0:09 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-pm
  Cc: Len Brown, Kevin Hilman, Santosh Shilimkar, Amit Kucheria,
	Arjan van de Ven, Trinabh Gupta, Deepthi Dharwar, linux-omap,
	linux-tegra, Colin Cross

On some ARM SMP SoCs (OMAP4460, Tegra 2, and probably more), the
cpus cannot be independently powered down, either due to
sequencing restrictions (on Tegra 2, cpu 0 must be the last to
power down), or due to HW bugs (on OMAP4460, a cpu powering up
will corrupt the gic state unless the other cpu runs a work
around).  Each cpu has a power state that it can enter without
coordinating with the other cpu (usually Wait For Interrupt, or
WFI), and one or more "coupled" power states that affect blocks
shared between the cpus (L2 cache, interrupt controller, and
sometimes the whole SoC).  Entering a coupled power state must
be tightly controlled on both cpus.

The easiest solution to implementing coupled cpu power states is
to hotplug all but one cpu whenever possible, usually using a
cpufreq governor that looks at cpu load to determine when to
enable the secondary cpus.  This causes problems, as hotplug is an
expensive operation, so the number of hotplug transitions must be
minimized, leading to very slow response to loads, often on the
order of seconds.

This file implements an alternative solution, where each cpu will
wait in the WFI state until all cpus are ready to enter a coupled
state, at which point the coupled state function will be called
on all cpus at approximately the same time.

Once all cpus are ready to enter idle, they are woken by an smp
cross call.  At this point, there is a chance that one of the
cpus will find work to do, and choose not to enter suspend.  A
final pass is needed to guarantee that all cpus will call the
power state enter function at the same time.  During this pass,
each cpu will increment the ready counter, and continue once the
ready counter matches the number of online coupled cpus.  If any
cpu exits idle, the other cpus will decrement their counter and
retry.

To use coupled cpuidle states, a cpuidle driver must:

   Set struct cpuidle_device.coupled_cpus to the mask of all
   coupled cpus, usually the same as cpu_possible_mask if all cpus
   are part of the same cluster.  The coupled_cpus mask must be
   set in the struct cpuidle_device for each cpu.

   Set struct cpuidle_device.safe_state to a state that is not a
   coupled state.  This is usually WFI.

   Set CPUIDLE_FLAG_COUPLED in struct cpuidle_state.flags for each
   state that affects multiple cpus.

   Provide a struct cpuidle_state.enter function for each state
   that affects multiple cpus.  This function is guaranteed to be
   called on all cpus at approximately the same time.  The driver
   should ensure that the cpus all abort together if any cpu tries
   to abort once the function is called.

Signed-off-by: Colin Cross <ccross@android.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Amit Kucheria <amit.kucheria@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Trinabh Gupta <g.trinabh@gmail.com>
Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
---
 drivers/cpuidle/Kconfig   |    3 +
 drivers/cpuidle/Makefile  |    1 +
 drivers/cpuidle/coupled.c |  413 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/cpuidle/cpuidle.c |   14 ++-
 drivers/cpuidle/cpuidle.h |   39 +++++
 include/linux/cpuidle.h   |    7 +
 6 files changed, 476 insertions(+), 1 deletions(-)
 create mode 100644 drivers/cpuidle/coupled.c

diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 7dbc4a8..7a72e55 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -18,3 +18,6 @@ config CPU_IDLE_GOV_MENU
 	bool
 	depends on CPU_IDLE && NO_HZ
 	default y
+
+config ARCH_NEEDS_CPU_IDLE_COUPLED
+	def_bool n
diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index 5634f88..38c8f69 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -3,3 +3,4 @@
 #
 
 obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
+obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
new file mode 100644
index 0000000..3fb7d24
--- /dev/null
+++ b/drivers/cpuidle/coupled.c
@@ -0,0 +1,413 @@
+/*
+ * coupled.c - helper functions to enter the same idle state on multiple cpus
+ *
+ * Copyright (c) 2011 Google, Inc.
+ *
+ * Author: Colin Cross <ccross@android.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/cpu.h>
+#include <linux/cpuidle.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include "cpuidle.h"
+
+/*
+ * coupled cpuidle states
+ *
+ * On some ARM SMP SoCs (OMAP4460, Tegra 2, and probably more), the
+ * cpus cannot be independently powered down, either due to
+ * sequencing restrictions (on Tegra 2, cpu 0 must be the last to
+ * power down), or due to HW bugs (on OMAP4460, a cpu powering up
+ * will corrupt the gic state unless the other cpu runs a work
+ * around).  Each cpu has a power state that it can enter without
+ * coordinating with the other cpu (usually Wait For Interrupt, or
+ * WFI), and one or more "coupled" power states that affect blocks
+ * shared between the cpus (L2 cache, interrupt controller, and
+ * sometimes the whole SoC).  Entering a coupled power state must
+ * be tightly controlled on both cpus.
+ *
+ * The easiest solution to implementing coupled cpu power states is
+ * to hotplug all but one cpu whenever possible, usually using a
+ * cpufreq governor that looks at cpu load to determine when to
+ * enable the secondary cpus.  This causes problems, as hotplug is an
+ * expensive operation, so the number of hotplug transitions must be
+ * minimized, leading to very slow response to loads, often on the
+ * order of seconds.
+ *
+ * This file implements an alternative solution, where each cpu will
+ * wait in the WFI state until all cpus are ready to enter a coupled
+ * state, at which point the coupled state function will be called
+ * on all cpus at approximately the same time.
+ *
+ * Once all cpus are ready to enter idle, they are woken by an smp
+ * cross call.  At this point, there is a chance that one of the
+ * cpus will find work to do, and choose not to enter suspend.  A
+ * final pass is needed to guarantee that all cpus will call the
+ * power state enter function at the same time.  During this pass,
+ * each cpu will increment the ready counter, and continue once the
+ * ready counter matches the number of online coupled cpus.  If any
+ * cpu exits idle, the other cpus will decrement their counter and
+ * retry.
+ *
+ * To use coupled cpuidle states, a cpuidle driver must:
+ *
+ *    Set struct cpuidle_device.coupled_cpus to the mask of all
+ *    coupled cpus, usually the same as cpu_possible_mask if all cpus
+ *    are part of the same cluster.  The coupled_cpus mask must be
+ *    set in the struct cpuidle_device for each cpu.
+ *
+ *    Set struct cpuidle_device.safe_state to a state that is not a
+ *    coupled state.  This is usually WFI.
+ *
+ *    Set CPUIDLE_FLAG_COUPLED in struct cpuidle_state.flags for each
+ *    state that affects multiple cpus.
+ *
+ *    Provide a struct cpuidle_state.enter function for each state
+ *    that affects multiple cpus.  This function is guaranteed to be
+ *    called on all cpus at approximately the same time.  The driver
+ *    should ensure that the cpus all abort together if any cpu tries
+ *    to abort once the function is called.
+ *
+ */
+
+static DEFINE_MUTEX(cpuidle_coupled_lock);
+static DEFINE_PER_CPU(struct call_single_data, cpuidle_coupled_poke_cb);
+static cpumask_t cpuidle_coupled_poked_mask;
+
+/**
+ * cpuidle_state_is_coupled
+ *
+ * Returns true if the target state is coupled with cpus besides this one
+ */
+bool cpuidle_state_is_coupled(struct cpuidle_device *dev,
+	struct cpuidle_driver *drv, int state)
+{
+	return drv->states[state].flags & CPUIDLE_FLAG_COUPLED;
+}
+
+/**
+ * cpuidle_all_coupled_cpus_idle
+ *
+ * Returns true if all cpus coupled to this target state are idle
+ */
+static inline bool
+cpuidle_coupled_cpus_idle(struct cpuidle_coupled *coupled)
+{
+	int i;
+
+	assert_spin_locked(&coupled->lock);
+
+	smp_rmb();
+
+	for_each_cpu_mask(i, coupled->alive_coupled_cpus)
+		if (coupled->requested_state[i] < 0)
+			return false;
+
+	return true;
+}
+
+/**
+ * cpuidle_get_idle_state
+ *
+ * Returns the deepest idle state that all coupled cpus can enter
+ */
+static inline int cpuidle_coupled_get_state(struct cpuidle_device *dev,
+		struct cpuidle_coupled *coupled)
+{
+	int i;
+	int state = INT_MAX;
+
+	assert_spin_locked(&coupled->lock);
+
+	for_each_cpu_mask(i, coupled->alive_coupled_cpus)
+		if (coupled->requested_state[i] < state)
+			state = coupled->requested_state[i];
+
+	BUG_ON(state >= dev->state_count || state < 0);
+
+	return state;
+}
+
+static void cpuidle_coupled_poked(void *info)
+{
+	int cpu = (unsigned long)info;
+	cpumask_clear_cpu(cpu, &cpuidle_coupled_poked_mask);
+}
+
+static void cpuidle_coupled_poke(int cpu)
+{
+	struct call_single_data *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);
+	if (cpu_online(cpu))
+		if (!cpumask_test_and_set_cpu(cpu, &cpuidle_coupled_poked_mask))
+			__smp_call_function_single(cpu, csd, 0);
+}
+
+/**
+ * cpuidle_coupled_update_state
+ *
+ * Updates the requested idle state for the specified cpuidle device,
+ * poking all coupled cpus out of idle to let them see the new state.
+ */
+static void cpuidle_coupled_update_state(struct cpuidle_device *dev,
+		struct cpuidle_coupled *coupled, int next_state)
+{
+	int cpu;
+
+	assert_spin_locked(&coupled->lock);
+
+	coupled->requested_state[dev->cpu] = next_state;
+	smp_wmb();
+
+	if (next_state >= 0)
+		for_each_cpu_mask(cpu, coupled->alive_coupled_cpus)
+			if (cpu != dev->cpu)
+				cpuidle_coupled_poke(cpu);
+}
+
+/**
+ * cpuidle_enter_state_coupled
+ *
+ * Coordinate with coupled cpus to enter the target state.  This is a two
+ * stage process.  In the first stage, the cpus are operating independently,
+ * and may call into cpuidle_enter_state_coupled at completely different times.
+ * To save as much power as possible, the first cpus to call this function will
+ * go to an intermediate state (the cpuidle_device's safe state), and wait for
+ * all the other cpus to call this function.  Once all coupled cpus are idle,
+ * the second stage will start.  Each coupled cpu will spin until all cpus have
+ * guaranteed that they will call the target_state.
+ */
+int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
+		struct cpuidle_driver *drv, int next_state)
+{
+	int entered_state = -1;
+	struct cpuidle_coupled *coupled = dev->coupled;
+
+	spin_lock(&coupled->lock);
+	if (!cpumask_test_cpu(dev->cpu, &coupled->alive_coupled_cpus)) {
+		/*
+		 * somebody took us out of the online coupled cpus mask, we
+		 * must be on the way up or down
+		 */
+		spin_unlock(&coupled->lock);
+		return -1;
+	}
+
+	BUG_ON(coupled->ready_count);
+	cpuidle_coupled_update_state(dev, coupled, next_state);
+
+retry:
+	/*
+	 * Wait for all coupled cpus to be idle, using the deepest state
+	 * allowed for a single cpu.
+	 */
+	while (!need_resched() && !cpuidle_coupled_cpus_idle(coupled)) {
+		spin_unlock(&coupled->lock);
+
+		entered_state = cpuidle_enter_state(dev, drv,
+			dev->safe_state_index);
+
+		local_irq_enable();
+		cpu_relax();
+		local_irq_disable();
+
+		spin_lock(&coupled->lock);
+	}
+
+	/* give a chance to process any remaining pokes */
+	local_irq_enable();
+	cpu_relax();
+	local_irq_disable();
+
+	if (need_resched()) {
+		cpuidle_coupled_update_state(dev, coupled, -1);
+		goto out;
+	}
+
+	/*
+	 * All coupled cpus are probably idle.  There is a small chance that
+	 * one of the other cpus just became active.  Increment a counter when
+	 * ready, and spin until all coupled cpus have incremented the counter.
+	 * Once a cpu has incremented the counter, it cannot abort idle and must
+	 * spin until either the count has hit num_online_cpus(), or another
+	 * cpu leaves idle.
+	 */
+
+	coupled->ready_count++;
+
+	while (coupled->ready_count !=
+			cpumask_weight(&coupled->alive_coupled_cpus)) {
+		if (!cpuidle_coupled_cpus_idle(coupled)) {
+			coupled->ready_count--;
+			goto retry;
+		}
+
+		spin_unlock(&coupled->lock);
+		cpu_relax();
+		spin_lock(&coupled->lock);
+	}
+
+	/* all cpus have acked the coupled state */
+	next_state = cpuidle_coupled_get_state(dev, coupled);
+	spin_unlock(&coupled->lock);
+
+	entered_state = cpuidle_enter_state(dev, drv, next_state);
+
+	spin_lock(&coupled->lock);
+
+	cpuidle_coupled_update_state(dev, coupled, -1);
+	coupled->ready_count--;
+
+out:
+	local_irq_enable();
+
+	while (coupled->ready_count > 0) {
+		spin_unlock(&coupled->lock);
+		cpu_relax();
+		spin_lock(&coupled->lock);
+	}
+
+	spin_unlock(&coupled->lock);
+
+	return entered_state;
+}
+
+/**
+ * cpuidle_coupled_register_device
+ *
+ * Called from cpuidle_register_device to handle coupled idle init.  Finds the
+ * cpuidle_coupled struct for this set of coupled cpus, or creates one if none
+ * exists yet.
+ */
+int cpuidle_coupled_register_device(struct cpuidle_device *dev)
+{
+	int cpu;
+	struct cpuidle_device *other_dev;
+	struct call_single_data *csd;
+
+	if (cpumask_empty(&dev->coupled_cpus))
+		return 0;
+
+	for_each_cpu_mask(cpu, dev->coupled_cpus) {
+		other_dev = per_cpu(cpuidle_devices, cpu);
+		if (other_dev && other_dev->coupled) {
+			BUG_ON(!cpumask_equal(&dev->coupled_cpus,
+				&other_dev->coupled_cpus));
+			dev->coupled = other_dev->coupled;
+			goto have_coupled;
+		}
+	}
+
+	/* No existing coupled info found, create a new one */
+	dev->coupled = kzalloc(sizeof(struct cpuidle_coupled), GFP_KERNEL);
+	if (!dev->coupled)
+		return -ENOMEM;
+
+	spin_lock_init(&dev->coupled->lock);
+
+have_coupled:
+	spin_lock(&dev->coupled->lock);
+
+	dev->coupled->requested_state[dev->cpu] = -1;
+
+	if (cpu_online(dev->cpu))
+		cpumask_set_cpu(dev->cpu, &dev->coupled->alive_coupled_cpus);
+	dev->coupled->refcnt++;
+
+	csd = &per_cpu(cpuidle_coupled_poke_cb, dev->cpu);
+	csd->func = cpuidle_coupled_poked;
+	csd->info = (void *)(unsigned long)dev->cpu;
+
+	spin_unlock(&dev->coupled->lock);
+
+	return 0;
+}
+
+/**
+ * cpuidle_coupled_unregister_device
+ *
+ * Called from cpuidle_unregister_device to tear down coupled idle.  Removes the
+ * cpu from the coupled idle set, and frees the cpuidle_coupled_info struct if
+ * this was the last cpu in the set.
+ */
+void cpuidle_coupled_unregister_device(struct cpuidle_device *dev)
+{
+	if (cpumask_empty(&dev->coupled_cpus))
+		return;
+
+	cpumask_clear_cpu(dev->cpu, &dev->coupled->alive_coupled_cpus);
+	if (--dev->coupled->refcnt)
+		kfree(dev->coupled);
+
+	dev->coupled = NULL;
+}
+
+static void cpuidle_coupled_cpu_set_alive(int cpu, bool online)
+{
+	struct cpuidle_device *dev;
+
+	mutex_lock(&cpuidle_lock);
+
+	dev = per_cpu(cpuidle_devices, cpu);
+	if (!dev->coupled)
+		goto out;
+
+	spin_lock(&dev->coupled->lock);
+
+	if (online)
+		cpumask_set_cpu(dev->cpu, &dev->coupled->alive_coupled_cpus);
+	else
+		cpumask_clear_cpu(dev->cpu, &dev->coupled->alive_coupled_cpus);
+
+	spin_unlock(&dev->coupled->lock);
+
+out:
+	mutex_unlock(&cpuidle_lock);
+}
+
+/**
+ * cpuidle_coupled_cpu_notify
+ *
+ * Called when a cpu is brought on or offline using hotplug.  Updates the
+ * coupled cpu set appropriately
+ */
+static int cpuidle_coupled_cpu_notify(struct notifier_block *nb,
+		unsigned long action, void *hcpu)
+{
+	int cpu = (unsigned long)hcpu;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		cpuidle_coupled_cpu_set_alive(cpu, false);
+		break;
+	case CPU_UP_PREPARE:
+		cpuidle_coupled_cpu_set_alive(cpu, true);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block cpuidle_coupled_cpu_notifier = {
+	.notifier_call = cpuidle_coupled_cpu_notify,
+};
+
+static int __init cpuidle_coupled_init(void)
+{
+	return register_cpu_notifier(&cpuidle_coupled_cpu_notifier);
+}
+core_initcall(cpuidle_coupled_init);
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ea00a16..e3d61b2 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -122,7 +122,10 @@ int cpuidle_idle_call(void)
 	trace_power_start(POWER_CSTATE, next_state, dev->cpu);
 	trace_cpu_idle(next_state, dev->cpu);
 
-	entered_state = cpuidle_enter_state(dev, drv, next_state);
+	if (cpuidle_state_is_coupled(dev, drv, next_state))
+		entered_state = cpuidle_enter_state_coupled(dev, drv, next_state);
+	else
+		entered_state = cpuidle_enter_state(dev, drv, next_state);
 
 	trace_power_end(dev->cpu);
 	trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
@@ -322,9 +325,16 @@ static int __cpuidle_register_device(struct cpuidle_device *dev)
 	if (ret)
 		goto err_sysfs;
 
+	ret = cpuidle_coupled_register_device(dev);
+	if (ret)
+		goto err_coupled;
+
 	dev->registered = 1;
 	return 0;
 
+err_coupled:
+	cpuidle_remove_sysfs(sys_dev);
+	wait_for_completion(&dev->kobj_unregister);
 err_sysfs:
 	module_put(cpuidle_driver->owner);
 	list_del(&dev->device_list);
@@ -379,6 +389,8 @@ void cpuidle_unregister_device(struct cpuidle_device *dev)
 	wait_for_completion(&dev->kobj_unregister);
 	per_cpu(cpuidle_devices, dev->cpu) = NULL;
 
+	cpuidle_coupled_unregister_device(dev);
+
 	cpuidle_resume_and_unlock();
 
 	module_put(cpuidle_driver->owner);
diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
index dd2df8f..55a0c6f 100644
--- a/drivers/cpuidle/cpuidle.h
+++ b/drivers/cpuidle/cpuidle.h
@@ -32,4 +32,43 @@ extern void cpuidle_remove_state_sysfs(struct cpuidle_device *device);
 extern int cpuidle_add_sysfs(struct sys_device *sysdev);
 extern void cpuidle_remove_sysfs(struct sys_device *sysdev);
 
+/* coupled states */
+struct cpuidle_coupled {
+	spinlock_t lock;
+	int requested_state[NR_CPUS];
+	int ready_count;
+	cpumask_t alive_coupled_cpus;
+	int refcnt;
+};
+
+#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
+bool cpuidle_state_is_coupled(struct cpuidle_device *dev,
+		struct cpuidle_driver *drv, int state);
+int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
+		struct cpuidle_driver *drv, int next_state);
+int cpuidle_coupled_register_device(struct cpuidle_device *dev);
+void cpuidle_coupled_unregister_device(struct cpuidle_device *dev);
+#else
+static inline bool cpuidle_state_is_coupled(struct cpuidle_device *dev,
+		struct cpuidle_driver *drv, int state)
+{
+	return false;
+}
+
+static inline int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
+		struct cpuidle_driver *drv, int next_state)
+{
+	return -1;
+}
+
+static inline int cpuidle_coupled_register_device(struct cpuidle_device *dev)
+{
+	return 0;
+}
+
+static inline void cpuidle_coupled_unregister_device(struct cpuidle_device *dev)
+{
+}
+#endif
+
 #endif /* __DRIVER_CPUIDLE_H */
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 7408af8..5438a09 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -53,6 +53,7 @@ struct cpuidle_state {
 
 /* Idle State Flags */
 #define CPUIDLE_FLAG_TIME_VALID	(0x01) /* is residency time measurable? */
+#define CPUIDLE_FLAG_COUPLED	(0x02) /* state applies to multiple cpus */
 
 #define CPUIDLE_DRIVER_FLAGS_MASK (0xFFFF0000)
 
@@ -97,6 +98,12 @@ struct cpuidle_device {
 	struct kobject		kobj;
 	struct completion	kobj_unregister;
 	void			*governor_data;
+
+#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
+	int			safe_state_index;
+	cpumask_t		coupled_cpus;
+	struct cpuidle_coupled	*coupled;
+#endif
 };
 
 DECLARE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21  0:09 [PATCH 0/3] coupled cpuidle state support Colin Cross
                   ` (2 preceding siblings ...)
  2011-12-21  0:09 ` [PATCH 3/3] cpuidle: add support for states that affect multiple cpus Colin Cross
@ 2011-12-21  9:02 ` Arjan van de Ven
  2011-12-21  9:40   ` Colin Cross
  2012-01-04  0:41 ` Kevin Hilman
  2012-01-20  8:46 ` Daniel Lezcano
  5 siblings, 1 reply; 36+ messages in thread
From: Arjan van de Ven @ 2011-12-21  9:02 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Trinabh Gupta,
	Deepthi Dharwar, linux-omap, linux-tegra

On 12/21/2011 1:09 AM, Colin Cross wrote:
> On some ARM SMP SoCs (OMAP4460, Tegra 2, and probably more), the
> cpus cannot be independently powered down, either due to
> sequencing restrictions (on Tegra 2, cpu 0 must be the last to
> power down), or due to HW bugs (on OMAP4460, a cpu powering up
> will corrupt the gic state unless the other cpu runs a work
> around).  Each cpu has a power state that it can enter without
> coordinating with the other cpu (usually Wait For Interrupt, or
> WFI), and one or more "coupled" power states that affect blocks
> shared between the cpus (L2 cache, interrupt controller, and
> sometimes the whole SoC).  Entering a coupled power state must
> be tightly controlled on both cpus.
> 
> The easiest solution to implementing coupled cpu power states is
> to hotplug all but one cpu whenever possible, usually using a
> cpufreq governor that looks at cpu load to determine when to
> enable the secondary cpus.  This causes problems, as hotplug is an
> expensive operation, so the number of hotplug transitions must be
> minimized, leading to very slow response to loads, often on the
> order of seconds.
> 
> This patch series implements an alternative solution, where each
> cpu will wait in the WFI state until all cpus are ready to enter
> a coupled state, at which point the coupled state function will
> be called on all cpus at approximately the same time.
> 
> Once all cpus are ready to enter idle, they are woken by an smp
> cross call.  At this point, there is a chance that one of the
> cpus will find work to do, and choose not to enter suspend.  A
> final pass is needed to guarantee that all cpus will call the
> power state enter function at the same time.  During this pass,
> each cpu will increment the ready counter, and continue once the
> ready counter matches the number of online coupled cpus.  If any
> cpu exits idle, the other cpus will decrement their counter and
> retry.

this smells fundamentally racey to me; you can get an interrupt one
cycle after you think you're done, but before the last guy enters WFI...

how do you solve that issue ?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21  9:02 ` [PATCH 0/3] coupled cpuidle state support Arjan van de Ven
@ 2011-12-21  9:40   ` Colin Cross
  2011-12-21  9:44     ` Arjan van de Ven
  0 siblings, 1 reply; 36+ messages in thread
From: Colin Cross @ 2011-12-21  9:40 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Trinabh Gupta,
	Deepthi Dharwar, linux-omap, linux-tegra

On Wed, Dec 21, 2011 at 1:02 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 12/21/2011 1:09 AM, Colin Cross wrote:
>> On some ARM SMP SoCs (OMAP4460, Tegra 2, and probably more), the
>> cpus cannot be independently powered down, either due to
>> sequencing restrictions (on Tegra 2, cpu 0 must be the last to
>> power down), or due to HW bugs (on OMAP4460, a cpu powering up
>> will corrupt the gic state unless the other cpu runs a work
>> around).  Each cpu has a power state that it can enter without
>> coordinating with the other cpu (usually Wait For Interrupt, or
>> WFI), and one or more "coupled" power states that affect blocks
>> shared between the cpus (L2 cache, interrupt controller, and
>> sometimes the whole SoC).  Entering a coupled power state must
>> be tightly controlled on both cpus.
>>
>> The easiest solution to implementing coupled cpu power states is
>> to hotplug all but one cpu whenever possible, usually using a
>> cpufreq governor that looks at cpu load to determine when to
>> enable the secondary cpus.  This causes problems, as hotplug is an
>> expensive operation, so the number of hotplug transitions must be
>> minimized, leading to very slow response to loads, often on the
>> order of seconds.
>>
>> This patch series implements an alternative solution, where each
>> cpu will wait in the WFI state until all cpus are ready to enter
>> a coupled state, at which point the coupled state function will
>> be called on all cpus at approximately the same time.
>>
>> Once all cpus are ready to enter idle, they are woken by an smp
>> cross call.  At this point, there is a chance that one of the
>> cpus will find work to do, and choose not to enter suspend.  A
>> final pass is needed to guarantee that all cpus will call the
>> power state enter function at the same time.  During this pass,
>> each cpu will increment the ready counter, and continue once the
>> ready counter matches the number of online coupled cpus.  If any
>> cpu exits idle, the other cpus will decrement their counter and
>> retry.
>
> this smells fundamentally racey to me; you can get an interrupt one
> cycle after you think you're done, but before the last guy enters WFI...
>
> how do you solve that issue ?

All the cpus have interrupts off when they increment the counter, so
they cannot receive an interrupt.  If an interrupt is pending on one
of those cpus, it will be handled later when WFI aborts due to the
pending interrupt.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21  9:40   ` Colin Cross
@ 2011-12-21  9:44     ` Arjan van de Ven
  2011-12-21  9:55       ` Colin Cross
  0 siblings, 1 reply; 36+ messages in thread
From: Arjan van de Ven @ 2011-12-21  9:44 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Trinabh Gupta,
	Deepthi Dharwar, linux-omap, linux-tegra

On 12/21/2011 10:40 AM, Colin Cross wrote:

>> this smells fundamentally racey to me; you can get an interrupt one
>> cycle after you think you're done, but before the last guy enters WFI...
>>
>> how do you solve that issue ?
> 
> All the cpus have interrupts off when they increment the counter, so
> they cannot receive an interrupt.  If an interrupt is pending on one
> of those cpus, it will be handled later when WFI aborts due to the
> pending interrupt.

... but this leads to cases where you're aborting before other cpus are
entering..... so your "last guy in" doesn't really work, since while cpu
0 thinks it's the last guy, cpu 1 is already on the way out/out
already...  (heck it might already be going back to sleep if your idle
code can run fast, like in the size of a cache miss)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21  9:44     ` Arjan van de Ven
@ 2011-12-21  9:55       ` Colin Cross
  2011-12-21 12:12         ` Arjan van de Ven
  0 siblings, 1 reply; 36+ messages in thread
From: Colin Cross @ 2011-12-21  9:55 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Trinabh Gupta,
	Deepthi Dharwar, linux-omap, linux-tegra

On Wed, Dec 21, 2011 at 1:44 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 12/21/2011 10:40 AM, Colin Cross wrote:
>
>>> this smells fundamentally racey to me; you can get an interrupt one
>>> cycle after you think you're done, but before the last guy enters WFI...
>>>
>>> how do you solve that issue ?
>>
>> All the cpus have interrupts off when they increment the counter, so
>> they cannot receive an interrupt.  If an interrupt is pending on one
>> of those cpus, it will be handled later when WFI aborts due to the
>> pending interrupt.
>
> ... but this leads to cases where you're aborting before other cpus are
> entering..... so your "last guy in" doesn't really work, since while cpu
> 0 thinks it's the last guy, cpu 1 is already on the way out/out
> already...  (heck it might already be going back to sleep if your idle
> code can run fast, like in the size of a cache miss)

Once a cpu has incremented the counter, it has no way out unless either
1: another cpu (that hasn't incremented the counter yet) receives an
interrupt, aborts idle, and clears its idle flag
or
2: all cpus enter the ready counter, and call the cpuidle driver's
enter function.

In your example, cpu 1 has incremented the counter, so it cannot be on
the way out unless cpu 0 aborts (in which case it will not increment
the counter, and the counter will never be equal to the number of
cpus), or unless cpu 0 turns off its interrupts and incrementes the
counter (in which case neither cpu can return until after the cpuidle
driver's enter function has been called on all cpus).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21  9:55       ` Colin Cross
@ 2011-12-21 12:12         ` Arjan van de Ven
  2011-12-21 19:05           ` Colin Cross
  2012-03-14  0:39           ` Colin Cross
  0 siblings, 2 replies; 36+ messages in thread
From: Arjan van de Ven @ 2011-12-21 12:12 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Trinabh Gupta,
	Deepthi Dharwar, linux-omap, linux-tegra

On 12/21/2011 10:55 AM, Colin Cross wrote:
> On Wed, Dec 21, 2011 at 1:44 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>> On 12/21/2011 10:40 AM, Colin Cross wrote:
>>
>>>> this smells fundamentally racey to me; you can get an interrupt one
>>>> cycle after you think you're done, but before the last guy enters WFI...
>>>>
>>>> how do you solve that issue ?
>>>
>>> All the cpus have interrupts off when they increment the counter, so
>>> they cannot receive an interrupt.  If an interrupt is pending on one
>>> of those cpus, it will be handled later when WFI aborts due to the
>>> pending interrupt.
>>
>> ... but this leads to cases where you're aborting before other cpus are
>> entering..... so your "last guy in" doesn't really work, since while cpu
>> 0 thinks it's the last guy, cpu 1 is already on the way out/out
>> already...  (heck it might already be going back to sleep if your idle
>> code can run fast, like in the size of a cache miss)
> 
> Once a cpu has incremented the counter, it has no way out unless either
> 1: another cpu (that hasn't incremented the counter yet) receives an
> interrupt, aborts idle, and clears its idle flag
> or
> 2: all cpus enter the ready counter, and call the cpuidle driver's
> enter function.

.. or it enters WFI, and a physical device sends it an interrupt,
at which point it exits.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21 12:12         ` Arjan van de Ven
@ 2011-12-21 19:05           ` Colin Cross
  2011-12-21 19:36             ` Arjan van de Ven
  2012-03-14  0:39           ` Colin Cross
  1 sibling, 1 reply; 36+ messages in thread
From: Colin Cross @ 2011-12-21 19:05 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Trinabh Gupta,
	Deepthi Dharwar, linux-omap, linux-tegra

On Wed, Dec 21, 2011 at 4:12 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 12/21/2011 10:55 AM, Colin Cross wrote:
>> On Wed, Dec 21, 2011 at 1:44 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>> On 12/21/2011 10:40 AM, Colin Cross wrote:
>>>
>>>>> this smells fundamentally racey to me; you can get an interrupt one
>>>>> cycle after you think you're done, but before the last guy enters WFI...
>>>>>
>>>>> how do you solve that issue ?
>>>>
>>>> All the cpus have interrupts off when they increment the counter, so
>>>> they cannot receive an interrupt.  If an interrupt is pending on one
>>>> of those cpus, it will be handled later when WFI aborts due to the
>>>> pending interrupt.
>>>
>>> ... but this leads to cases where you're aborting before other cpus are
>>> entering..... so your "last guy in" doesn't really work, since while cpu
>>> 0 thinks it's the last guy, cpu 1 is already on the way out/out
>>> already...  (heck it might already be going back to sleep if your idle
>>> code can run fast, like in the size of a cache miss)
>>
>> Once a cpu has incremented the counter, it has no way out unless either
>> 1: another cpu (that hasn't incremented the counter yet) receives an
>> interrupt, aborts idle, and clears its idle flag
>> or
>> 2: all cpus enter the ready counter, and call the cpuidle driver's
>> enter function.
>
> .. or it enters WFI, and a physical device sends it an interrupt,
> at which point it exits.

None of the cpus will return to the idle loop until all cpus have
decremented the ready counter back to 0, so they can't wrap around
again.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21 19:05           ` Colin Cross
@ 2011-12-21 19:36             ` Arjan van de Ven
  2011-12-21 19:42               ` [linux-pm] " Colin Cross
  0 siblings, 1 reply; 36+ messages in thread
From: Arjan van de Ven @ 2011-12-21 19:36 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Trinabh Gupta,
	Deepthi Dharwar, linux-omap, linux-tegra

>>
>> .. or it enters WFI, and a physical device sends it an interrupt,
>> at which point it exits.
> 
> None of the cpus will return to the idle loop until all cpus have
> decremented the ready counter back to 0, so they can't wrap around
> again.


yikes, so you IPI all the cpus on the first exit.
that must burn power ;-(

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2011-12-21 19:36             ` Arjan van de Ven
@ 2011-12-21 19:42               ` Colin Cross
  2011-12-22  8:35                 ` Shilimkar, Santosh
  0 siblings, 1 reply; 36+ messages in thread
From: Colin Cross @ 2011-12-21 19:42 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Kevin Hilman, Len Brown, linux-kernel, Amit Kucheria,
	linux-tegra, linux-pm, linux-omap, linux-arm-kernel

On Wed, Dec 21, 2011 at 11:36 AM, Arjan van de Ven
<arjan@linux.intel.com> wrote:
>>>
>>> .. or it enters WFI, and a physical device sends it an interrupt,
>>> at which point it exits.
>>
>> None of the cpus will return to the idle loop until all cpus have
>> decremented the ready counter back to 0, so they can't wrap around
>> again.
>
>
> yikes, so you IPI all the cpus on the first exit.
> that must burn power ;-(

No, you're not understanding the point of this series.

If your cpus can go in and out of idle independently, you don't use
this code at all.  Each cpu can call WFI and come back out without
talking to the other cpu.

However, if you have two cpus that share some part of the SoC that can
be turned off in idle, like the L2 cache controller or the system bus,
the two cpus need to go to idle together, and they will both boot
together when either one receives an interrupt (although one will
likely immediately go back to a shallower state that doesn't require
coordination with the other cpu).  There is no way around this, it's
how the hardware works on some ARM platforms.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2011-12-21 19:42               ` [linux-pm] " Colin Cross
@ 2011-12-22  8:35                 ` Shilimkar, Santosh
  2011-12-22  8:53                   ` Arjan van de Ven
  0 siblings, 1 reply; 36+ messages in thread
From: Shilimkar, Santosh @ 2011-12-22  8:35 UTC (permalink / raw)
  To: Colin Cross
  Cc: Arjan van de Ven, Kevin Hilman, Len Brown, linux-kernel,
	Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	linux-arm-kernel

On Thu, Dec 22, 2011 at 1:12 AM, Colin Cross <ccross@android.com> wrote:
> On Wed, Dec 21, 2011 at 11:36 AM, Arjan van de Ven
> <arjan@linux.intel.com> wrote:
>>>>
>>>> .. or it enters WFI, and a physical device sends it an interrupt,
>>>> at which point it exits.
>>>
>>> None of the cpus will return to the idle loop until all cpus have
>>> decremented the ready counter back to 0, so they can't wrap around
>>> again.
>>
>>
>> yikes, so you IPI all the cpus on the first exit.
>> that must burn power ;-(
>
> No, you're not understanding the point of this series.
>
> If your cpus can go in and out of idle independently, you don't use
> this code at all.  Each cpu can call WFI and come back out without
> talking to the other cpu.
>
Indeed. The SOCs, Arch's which does support low power
state independently and doesn't need any co-ordination between CPU's
will continue to work same way as before with this series.

> However, if you have two cpus that share some part of the SoC that can
> be turned off in idle, like the L2 cache controller or the system bus,
> the two cpus need to go to idle together, and they will both boot
> together when either one receives an interrupt (although one will
> likely immediately go back to a shallower state that doesn't require
> coordination with the other cpu).  There is no way around this, it's
> how the hardware works on some ARM platforms.

Apart from shared peripherals which Colin pointed out, OMAP also
brings in the security software state which is kind of executing in
parallel with linux. This state is lost in certain deeper low power
states and since the security software is affine to only master CPU
(because of OMAP security architecture), the co-ordination is
mandatory to achieve those low power states.

So this additional CPU co-ordination logic for such C-states
really helps to solve the issue in most generic way.

Regards
Santosh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2011-12-22  8:35                 ` Shilimkar, Santosh
@ 2011-12-22  8:53                   ` Arjan van de Ven
  2011-12-22  9:30                     ` Shilimkar, Santosh
  2011-12-22 21:20                     ` Colin Cross
  0 siblings, 2 replies; 36+ messages in thread
From: Arjan van de Ven @ 2011-12-22  8:53 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Colin Cross, Kevin Hilman, Len Brown, linux-kernel,
	Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	linux-arm-kernel

On 12/22/2011 9:35 AM, Shilimkar, Santosh wrote:

> Indeed. The SOCs, Arch's which does support low power
> state independently and doesn't need any co-ordination between CPU's
> will continue to work same way as before with this series.

btw I think you misunderstand; I don't object to a need for something
like this, I am just very concerned that this may not be possible to be
done in a race-free way.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2011-12-22  8:53                   ` Arjan van de Ven
@ 2011-12-22  9:30                     ` Shilimkar, Santosh
  2011-12-22 21:20                     ` Colin Cross
  1 sibling, 0 replies; 36+ messages in thread
From: Shilimkar, Santosh @ 2011-12-22  9:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Colin Cross, Kevin Hilman, Len Brown, linux-kernel,
	Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	linux-arm-kernel

On Thu, Dec 22, 2011 at 2:23 PM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 12/22/2011 9:35 AM, Shilimkar, Santosh wrote:
>
>> Indeed. The SOCs, Arch's which does support low power
>> state independently and doesn't need any co-ordination between CPU's
>> will continue to work same way as before with this series.
>
> btw I think you misunderstand; I don't object to a need for something
> like this,
I did. Thanks for clarification.

> I am just very concerned that this may not be possible to be
> done in a race-free way.
>
I agree to those races but may be they are harmless.
Also the safe state need not be just WFI and can be bit deeper
where the co-ordination between isn't necessary. So that should
still not burn the power that much.

For simplicity let's assume a two CPU scenario.
Ideal scenario:
CPU 1 has entered idle and incremented counter for the
co-ordinated C state. CPU0 also enter and increments the
counter and now the subsystem and interconnect can go
down along with CPU cluster.

Few of the race conditions will possibly lead to void
the above conditions  and in that case the counter would
reflect that and such a C-state won't be attempted and
a safe C-state would be attempted. That should still work
fine.

Some how this hardware/security restriction is bit stupid
and likely going against the existing CPUIDLE design
which expect that a CPUIDLE thread are per CPU and it should
be independent and local to that CPU.

Regards
Santosh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2011-12-22  8:53                   ` Arjan van de Ven
  2011-12-22  9:30                     ` Shilimkar, Santosh
@ 2011-12-22 21:20                     ` Colin Cross
  1 sibling, 0 replies; 36+ messages in thread
From: Colin Cross @ 2011-12-22 21:20 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Shilimkar, Santosh, Kevin Hilman, Len Brown, linux-kernel,
	Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	linux-arm-kernel

On Thu, Dec 22, 2011 at 12:53 AM, Arjan van de Ven
<arjan@linux.intel.com> wrote:
> On 12/22/2011 9:35 AM, Shilimkar, Santosh wrote:
>
>> Indeed. The SOCs, Arch's which does support low power
>> state independently and doesn't need any co-ordination between CPU's
>> will continue to work same way as before with this series.
>
> btw I think you misunderstand; I don't object to a need for something
> like this, I am just very concerned that this may not be possible to be
> done in a race-free way.

I agree that there are many potential races in this code, but I
believe I've handled all of them.  This patch set is a refactoring of
the common parts of the SMP idle code that has shipped on Tegra and
OMAP4 devices, so the basic idea has been hammered on extensively.

I think I've explained the protection against the race condition you
asked about earlier.  As for the power impact, the power savings of
getting into the deeper coupled power states far outweighs the cost of
booting all coupled cpus out of idle and letting them loop back to a
shallower idle state.

On an OMAP4 platform, at the slowest cpu speed, the system-wide power
usage with both cpus in the non-coupled state (WFI) is around 50 mA,
while the deepest coupled state is 12 mA.  At the fastest cpu speed,
WFI is over 100 mA.

On some platforms it may be possible to have only one cpu boot out of
idle when an interrupt arrives, but that introduces the race condition
you asked about before, where the last cpu tries to go into idle at
the same time that an already-idle cpu comes back out.  This way has
fairly minimal overhead (booting the 2nd cpu and then going to WFI),
and that's a path that already needs to be fast because it's part of
the normal single-cpu idle path.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21  0:09 [PATCH 0/3] coupled cpuidle state support Colin Cross
                   ` (3 preceding siblings ...)
  2011-12-21  9:02 ` [PATCH 0/3] coupled cpuidle state support Arjan van de Ven
@ 2012-01-04  0:41 ` Kevin Hilman
  2012-01-04 17:27   ` Shilimkar, Santosh
  2012-01-20  8:46 ` Daniel Lezcano
  5 siblings, 1 reply; 36+ messages in thread
From: Kevin Hilman @ 2012-01-04  0:41 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Santosh Shilimkar, Amit Kucheria, Arjan van de Ven,
	Trinabh Gupta, Deepthi Dharwar, linux-omap, linux-tegra

Colin Cross <ccross@android.com> writes:


> This patch series implements an alternative solution, where each
> cpu will wait in the WFI state until all cpus are ready to enter
> a coupled state, at which point the coupled state function will
> be called on all cpus at approximately the same time.

This looks great, and is certainly preferred to hotplug IMO.

[...]

> None of the platforms I work with have an SMP idle implementation
> upstream, so I can't easily show a patch that converts a platform
> from hotplug governor to coupled cpuidle states.  

Basic OMAP4 CPUidle support is in now queued for v3.3 (see omap4
branch Tony's tree[1].)  

Can you (or Santosh) send a patch that adds coupled support to that
driver so it can see some broader testing on OMAP4?

Thanks,

Kevin

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap.git

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 1/3] cpuidle: refactor out cpuidle_enter_state
  2011-12-21  0:09 ` [PATCH 1/3] cpuidle: refactor out cpuidle_enter_state Colin Cross
@ 2012-01-04 14:08   ` Jean Pihet
  0 siblings, 0 replies; 36+ messages in thread
From: Jean Pihet @ 2012-01-04 14:08 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Kevin Hilman,
	Len Brown, Amit Kucheria, linux-tegra, linux-omap,
	Arjan van de Ven

Hi Colin,

On Wed, Dec 21, 2011 at 1:09 AM, Colin Cross <ccross@android.com> wrote:
> Split the code to enter a state and update the stats into a helper
> function, cpuidle_enter_state, and export it.  This function will
> be called by the coupled state code to handle entering the safe
> state and the final coupled state.
>
> Signed-off-by: Colin Cross <ccross@android.com>
> ---
>  drivers/cpuidle/cpuidle.c |   43 +++++++++++++++++++++++++++++--------------
>  drivers/cpuidle/cpuidle.h |    2 ++
>  2 files changed, 31 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 06ce268..1486b3c 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -54,6 +54,34 @@ static void cpuidle_kick_cpus(void) {}
>  static int __cpuidle_register_device(struct cpuidle_device *dev);
>
>  /**
> + * cpuidle_enter_state
> + *
> + * enter the state and update stats
> + */
> +int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
> +               int next_state)
> +{
> +       int entered_state;
> +       struct cpuidle_state *target_state;
> +
> +       target_state = &drv->states[next_state];
> +
> +       entered_state = target_state->enter(dev, drv, next_state);
> +
> +       if (entered_state >= 0) {
> +               /* Update cpuidle counters */
> +               /* This can be moved to within driver enter routine
> +                * but that results in multiple copies of same code.
> +                */
> +               dev->states_usage[entered_state].time +=
> +                               (unsigned long long)dev->last_residency;
> +               dev->states_usage[entered_state].usage++;
> +       }
> +
> +       return entered_state;
> +}
> +
> +/**
>  * cpuidle_idle_call - the main idle loop
>  *
>  * NOTE: no locks or semaphores should be used here
> @@ -63,7 +91,6 @@ int cpuidle_idle_call(void)
>  {
>        struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
>        struct cpuidle_driver *drv = cpuidle_get_driver();
> -       struct cpuidle_state *target_state;
>        int next_state, entered_state;
>
>        if (off)
> @@ -92,26 +119,14 @@ int cpuidle_idle_call(void)
>                return 0;
>        }
>
> -       target_state = &drv->states[next_state];
> -
>        trace_power_start(POWER_CSTATE, next_state, dev->cpu);
>        trace_cpu_idle(next_state, dev->cpu);
>
> -       entered_state = target_state->enter(dev, drv, next_state);
> +       entered_state = cpuidle_enter_state(dev, drv, next_state);
>
>        trace_power_end(dev->cpu);
>        trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
The cpu_idle traces are only present in this function and not in
cpuidle_enter_state. Is that expected? Can all the transitions from
all the cpus get traced that way?

Regards,
Jean

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2012-01-04  0:41 ` Kevin Hilman
@ 2012-01-04 17:27   ` Shilimkar, Santosh
  0 siblings, 0 replies; 36+ messages in thread
From: Shilimkar, Santosh @ 2012-01-04 17:27 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Colin Cross, linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Amit Kucheria, Arjan van de Ven, Trinabh Gupta, Deepthi Dharwar,
	linux-omap, linux-tegra

On Wed, Jan 4, 2012 at 1:41 AM, Kevin Hilman <khilman@ti.com> wrote:
> Colin Cross <ccross@android.com> writes:
>
>
>> This patch series implements an alternative solution, where each
>> cpu will wait in the WFI state until all cpus are ready to enter
>> a coupled state, at which point the coupled state function will
>> be called on all cpus at approximately the same time.
>
> This looks great, and is certainly preferred to hotplug IMO.
>
> [...]
>
>> None of the platforms I work with have an SMP idle implementation
>> upstream, so I can't easily show a patch that converts a platform
>> from hotplug governor to coupled cpuidle states.
>
> Basic OMAP4 CPUidle support is in now queued for v3.3 (see omap4
> branch Tony's tree[1].)
>
> Can you (or Santosh) send a patch that adds coupled support to that
> driver so it can see some broader testing on OMAP4?
>
I briefly attempted OMAP4 idle with this series but had some lock
up related issues which we faced in the older development.

Then I got busy into other activities which will keep me occupied for
next few weeks. I will be happy to share the patches in case some one
would like to have a look till then.

Regards
Santosh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21  0:09 [PATCH 0/3] coupled cpuidle state support Colin Cross
                   ` (4 preceding siblings ...)
  2012-01-04  0:41 ` Kevin Hilman
@ 2012-01-20  8:46 ` Daniel Lezcano
  2012-01-20 20:40   ` Colin Cross
       [not found]   ` <8762e8kqi6.fsf@ti.com>
  5 siblings, 2 replies; 36+ messages in thread
From: Daniel Lezcano @ 2012-01-20  8:46 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Arjan van de Ven,
	Trinabh Gupta, Deepthi Dharwar, linux-omap, linux-tegra

On 12/21/2011 01:09 AM, Colin Cross wrote:
> On some ARM SMP SoCs (OMAP4460, Tegra 2, and probably more), the
> cpus cannot be independently powered down, either due to
> sequencing restrictions (on Tegra 2, cpu 0 must be the last to
> power down), or due to HW bugs (on OMAP4460, a cpu powering up
> will corrupt the gic state unless the other cpu runs a work
> around).  Each cpu has a power state that it can enter without
> coordinating with the other cpu (usually Wait For Interrupt, or
> WFI), and one or more "coupled" power states that affect blocks
> shared between the cpus (L2 cache, interrupt controller, and
> sometimes the whole SoC).  Entering a coupled power state must
> be tightly controlled on both cpus.
>
> The easiest solution to implementing coupled cpu power states is
> to hotplug all but one cpu whenever possible, usually using a
> cpufreq governor that looks at cpu load to determine when to
> enable the secondary cpus.  This causes problems, as hotplug is an
> expensive operation, so the number of hotplug transitions must be
> minimized, leading to very slow response to loads, often on the
> order of seconds.
>
> This patch series implements an alternative solution, where each
> cpu will wait in the WFI state until all cpus are ready to enter
> a coupled state, at which point the coupled state function will
> be called on all cpus at approximately the same time.
>
> Once all cpus are ready to enter idle, they are woken by an smp
> cross call.  At this point, there is a chance that one of the
> cpus will find work to do, and choose not to enter suspend.  A
> final pass is needed to guarantee that all cpus will call the
> power state enter function at the same time.  During this pass,
> each cpu will increment the ready counter, and continue once the
> ready counter matches the number of online coupled cpus.  If any
> cpu exits idle, the other cpus will decrement their counter and
> retry.
>
> To use coupled cpuidle states, a cpuidle driver must:
>
>     Set struct cpuidle_device.coupled_cpus to the mask of all
>     coupled cpus, usually the same as cpu_possible_mask if all cpus
>     are part of the same cluster.  The coupled_cpus mask must be
>     set in the struct cpuidle_device for each cpu.
>
>     Set struct cpuidle_device.safe_state to a state that is not a
>     coupled state.  This is usually WFI.
>
>     Set CPUIDLE_FLAG_COUPLED in struct cpuidle_state.flags for each
>     state that affects multiple cpus.
>
>     Provide a struct cpuidle_state.enter function for each state
>     that affects multiple cpus.  This function is guaranteed to be
>     called on all cpus at approximately the same time.  The driver
>     should ensure that the cpus all abort together if any cpu tries
>     to abort once the function is called.
>
> This series was functionally tested on v3.0, but has only been
> compile-tested on v3.2 after the removal of per-cpu state fields.

Hi Colin,

this patchset could be interesting to resolve in a generic way the cpu 
dependencies.
What is the status of this patchset ?

Did you have the opportunity to measure the power consumption with and 
without this patchset ?

Thanks
-- Daniel

-- 
  <http://www.linaro.org/>  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:<http://www.facebook.com/pages/Linaro>  Facebook |
<http://twitter.com/#!/linaroorg>  Twitter |
<http://www.linaro.org/linaro-blog/>  Blog



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2012-01-20  8:46 ` Daniel Lezcano
@ 2012-01-20 20:40   ` Colin Cross
  2012-01-25 14:04     ` Daniel Lezcano
  2012-01-27  8:54     ` [linux-pm] " Vincent Guittot
       [not found]   ` <8762e8kqi6.fsf@ti.com>
  1 sibling, 2 replies; 36+ messages in thread
From: Colin Cross @ 2012-01-20 20:40 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Arjan van de Ven,
	Trinabh Gupta, Deepthi Dharwar, linux-omap, linux-tegra

On Fri, Jan 20, 2012 at 12:46 AM, Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
> Hi Colin,
>
> this patchset could be interesting to resolve in a generic way the cpu
> dependencies.
> What is the status of this patchset ?

I can't do much with it right now, because I don't have any devices
that can do SMP idle with a v3.2 kernel.  I've started working on an
updated version that avoids the spinlock, but it might be a while
before I can test and post it.  I'm mostly looking for feedback on the
approach taken in this patch, and whether it will be useful for other
SoCs besides Tegra and OMAP4.

> Did you have the opportunity to measure the power consumption with and
> without this patchset ?

Power consumption will be very dependent on the specific SoC in
question.  The most important factors are the power savings of the
independent cpuidle state (normally WFI) vs. the hotplug state
(normally 1 cpu in OFF), and the workload being tested.

On a very idle system, these patches result in the same total power as
hotplugging one cpu and letting the other idle normally.  On a 25%
busy system, you might see a slight increase in power, as the best
independent cpuidle state might be WFI, vs 1 cpu in OFF mode in
hotplug.  On OMAP4, that difference is small, on the order of 10 mW.
Once you hit the threshold where a hotplug governor would have
hotplugged in the second cpu (lets say 40%), the savings from these
patches are enormous, as you can hit the lowest power state up to 60%
of the time, where the hotplug solution would never be going below WFI
on both cpus.  On OMAP4, that can be well over 100 mW.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2012-01-20 20:40   ` Colin Cross
@ 2012-01-25 14:04     ` Daniel Lezcano
  2012-01-31 14:13       ` Daniel Lezcano
  2012-01-27  8:54     ` [linux-pm] " Vincent Guittot
  1 sibling, 1 reply; 36+ messages in thread
From: Daniel Lezcano @ 2012-01-25 14:04 UTC (permalink / raw)
  To: Colin Cross
  Cc: Daniel Lezcano, linux-kernel, linux-arm-kernel, linux-pm,
	Len Brown, Kevin Hilman, Santosh Shilimkar, Amit Kucheria,
	Arjan van de Ven, Trinabh Gupta, Deepthi Dharwar, linux-omap,
	linux-tegra

On 01/20/2012 09:40 PM, Colin Cross wrote:
> On Fri, Jan 20, 2012 at 12:46 AM, Daniel Lezcano
> <daniel.lezcano@linaro.org>  wrote:
>> Hi Colin,
>>
>> this patchset could be interesting to resolve in a generic way the cpu
>> dependencies.
>> What is the status of this patchset ?
> I can't do much with it right now, because I don't have any devices
> that can do SMP idle with a v3.2 kernel.  I've started working on an
> updated version that avoids the spinlock, but it might be a while
> before I can test and post it.  I'm mostly looking for feedback on the
> approach taken in this patch, and whether it will be useful for other
> SoCs besides Tegra and OMAP4.

Hi Colin,

I will be happy to test your patchset. Do you have a pointer to a more 
recent kernel ?


>> Did you have the opportunity to measure the power consumption with and
>> without this patchset ?
> Power consumption will be very dependent on the specific SoC in
> question.  The most important factors are the power savings of the
> independent cpuidle state (normally WFI) vs. the hotplug state
> (normally 1 cpu in OFF), and the workload being tested.
>
> On a very idle system, these patches result in the same total power as
> hotplugging one cpu and letting the other idle normally.  On a 25%
> busy system, you might see a slight increase in power, as the best
> independent cpuidle state might be WFI, vs 1 cpu in OFF mode in
> hotplug.  On OMAP4, that difference is small, on the order of 10 mW.
> Once you hit the threshold where a hotplug governor would have
> hotplugged in the second cpu (lets say 40%), the savings from these
> patches are enormous, as you can hit the lowest power state up to 60%
> of the time, where the hotplug solution would never be going below WFI
> on both cpus.  On OMAP4, that can be well over 100 mW.

Interesting.


Thanks

   -- Daniel


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2012-01-20 20:40   ` Colin Cross
  2012-01-25 14:04     ` Daniel Lezcano
@ 2012-01-27  8:54     ` Vincent Guittot
  2012-01-27 17:32       ` Colin Cross
  1 sibling, 1 reply; 36+ messages in thread
From: Vincent Guittot @ 2012-01-27  8:54 UTC (permalink / raw)
  To: Colin Cross
  Cc: Daniel Lezcano, Kevin Hilman, Len Brown, linux-kernel,
	Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	Arjan van de Ven, linux-arm-kernel

On 20 January 2012 21:40, Colin Cross <ccross@android.com> wrote:
> On Fri, Jan 20, 2012 at 12:46 AM, Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>> Hi Colin,
>>
>> this patchset could be interesting to resolve in a generic way the cpu
>> dependencies.
>> What is the status of this patchset ?
>
> I can't do much with it right now, because I don't have any devices
> that can do SMP idle with a v3.2 kernel.  I've started working on an
> updated version that avoids the spinlock, but it might be a while
> before I can test and post it.  I'm mostly looking for feedback on the
> approach taken in this patch, and whether it will be useful for other
> SoCs besides Tegra and OMAP4.
>

Hi Colin,

In your patch, you put in safe state (WFI for most of platform) the
cpus that become idle and these cpus are woken up each time a new cpu
of the cluster becomes idle. Then, the cluster state is chosen and the
cpus enter the selected C-state. On ux500, we are using another
behavior for synchronizing  the cpus. The cpus are prepared to enter
the c-state that has been chosen by the governor and the last cpu,
that enters idle, chooses the final cluster state (according to cpus'
C-state). The main advantage of this solution is that you don't need
to wake other cpus to enter the C-state of a cluster. This can be
quite worth full when tasks mainly run on one cpu. Have you also think
about such behavior when developing the coupled cpuidle driver ? It
could be interesting to add such behavior.

Regards,
Vincent

>> Did you have the opportunity to measure the power consumption with and
>> without this patchset ?
>
> Power consumption will be very dependent on the specific SoC in
> question.  The most important factors are the power savings of the
> independent cpuidle state (normally WFI) vs. the hotplug state
> (normally 1 cpu in OFF), and the workload being tested.
>
> On a very idle system, these patches result in the same total power as
> hotplugging one cpu and letting the other idle normally.  On a 25%
> busy system, you might see a slight increase in power, as the best
> independent cpuidle state might be WFI, vs 1 cpu in OFF mode in
> hotplug.  On OMAP4, that difference is small, on the order of 10 mW.
> Once you hit the threshold where a hotplug governor would have
> hotplugged in the second cpu (lets say 40%), the savings from these
> patches are enormous, as you can hit the lowest power state up to 60%
> of the time, where the hotplug solution would never be going below WFI
> on both cpus.  On OMAP4, that can be well over 100 mW.
> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/linux-pm

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2012-01-27  8:54     ` [linux-pm] " Vincent Guittot
@ 2012-01-27 17:32       ` Colin Cross
  2012-02-01 12:13         ` Vincent Guittot
  0 siblings, 1 reply; 36+ messages in thread
From: Colin Cross @ 2012-01-27 17:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Daniel Lezcano, Kevin Hilman, Len Brown, linux-kernel,
	Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	Arjan van de Ven, linux-arm-kernel

On Fri, Jan 27, 2012 at 12:54 AM, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> On 20 January 2012 21:40, Colin Cross <ccross@android.com> wrote:
>> On Fri, Jan 20, 2012 at 12:46 AM, Daniel Lezcano
>> <daniel.lezcano@linaro.org> wrote:
>>> Hi Colin,
>>>
>>> this patchset could be interesting to resolve in a generic way the cpu
>>> dependencies.
>>> What is the status of this patchset ?
>>
>> I can't do much with it right now, because I don't have any devices
>> that can do SMP idle with a v3.2 kernel.  I've started working on an
>> updated version that avoids the spinlock, but it might be a while
>> before I can test and post it.  I'm mostly looking for feedback on the
>> approach taken in this patch, and whether it will be useful for other
>> SoCs besides Tegra and OMAP4.
>>
>
> Hi Colin,
>
> In your patch, you put in safe state (WFI for most of platform) the
> cpus that become idle and these cpus are woken up each time a new cpu
> of the cluster becomes idle. Then, the cluster state is chosen and the
> cpus enter the selected C-state. On ux500, we are using another
> behavior for synchronizing  the cpus. The cpus are prepared to enter
> the c-state that has been chosen by the governor and the last cpu,
> that enters idle, chooses the final cluster state (according to cpus'
> C-state). The main advantage of this solution is that you don't need
> to wake other cpus to enter the C-state of a cluster. This can be
> quite worth full when tasks mainly run on one cpu. Have you also think
> about such behavior when developing the coupled cpuidle driver ? It
> could be interesting to add such behavior.

Waking up the cpus that are in the safe state is not done just to
choose the target state, it's done to allow the cpus to take
themselves to the target low power state.  On ux500, are you saying
you take the cpus directly from the safe state to a lower power state
without ever going back to the active state?  I once implemented Tegra
that way, and it required lots of nasty synchronization to prevent
resetting the cpu at the same time that it was booting due to an
interrupt, and I was later told that Tegra can't handle that sequence
at all, although I haven't verified it yet.

On platforms that can't turn the cpus off in a random order, or that
can't take a cpu directly from the safe state to the target state,
something like these coupled cpuidle patches are required.  On
platforms that can, the low power modes can be implemented without
these patches, although it is very hard to do without race conditions.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2012-01-25 14:04     ` Daniel Lezcano
@ 2012-01-31 14:13       ` Daniel Lezcano
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Lezcano @ 2012-01-31 14:13 UTC (permalink / raw)
  To: Colin Cross
  Cc: Daniel Lezcano, linux-kernel, linux-arm-kernel, linux-pm,
	Len Brown, Kevin Hilman, Santosh Shilimkar, Amit Kucheria,
	Arjan van de Ven, Trinabh Gupta, Deepthi Dharwar, linux-omap,
	linux-tegra

On 01/25/2012 03:04 PM, Daniel Lezcano wrote:
> On 01/20/2012 09:40 PM, Colin Cross wrote:
>> On Fri, Jan 20, 2012 at 12:46 AM, Daniel Lezcano
>> <daniel.lezcano@linaro.org>  wrote:
>>> Hi Colin,
>>>
>>> this patchset could be interesting to resolve in a generic way the cpu
>>> dependencies.
>>> What is the status of this patchset ?
>> I can't do much with it right now, because I don't have any devices
>> that can do SMP idle with a v3.2 kernel.  I've started working on an
>> updated version that avoids the spinlock, but it might be a while
>> before I can test and post it.  I'm mostly looking for feedback on the
>> approach taken in this patch, and whether it will be useful for other
>> SoCs besides Tegra and OMAP4.
>
> Hi Colin,
>
> I will be happy to test your patchset. Do you have a pointer to a more 
> recent kernel ?

Hi Colin,

sorry for bothering you. I am working on a cpuidle driver for an ARM SMP 
board, I will be really happy to test your patchset and look how it fits 
with the cpuidle driver.

Thanks
   -- Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2012-01-27 17:32       ` Colin Cross
@ 2012-02-01 12:13         ` Vincent Guittot
  2012-02-01 14:59           ` Lorenzo Pieralisi
  0 siblings, 1 reply; 36+ messages in thread
From: Vincent Guittot @ 2012-02-01 12:13 UTC (permalink / raw)
  To: Colin Cross
  Cc: Daniel Lezcano, Kevin Hilman, Len Brown, linux-kernel,
	Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	Arjan van de Ven, linux-arm-kernel

Hi Colin,

Sorry for this late reply

On 27 January 2012 18:32, Colin Cross <ccross@android.com> wrote:
> On Fri, Jan 27, 2012 at 12:54 AM, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>> On 20 January 2012 21:40, Colin Cross <ccross@android.com> wrote:
>>> On Fri, Jan 20, 2012 at 12:46 AM, Daniel Lezcano
>>> <daniel.lezcano@linaro.org> wrote:
>>>> Hi Colin,
>>>>
>>>> this patchset could be interesting to resolve in a generic way the cpu
>>>> dependencies.
>>>> What is the status of this patchset ?
>>>
>>> I can't do much with it right now, because I don't have any devices
>>> that can do SMP idle with a v3.2 kernel.  I've started working on an
>>> updated version that avoids the spinlock, but it might be a while
>>> before I can test and post it.  I'm mostly looking for feedback on the
>>> approach taken in this patch, and whether it will be useful for other
>>> SoCs besides Tegra and OMAP4.
>>>
>>
>> Hi Colin,
>>
>> In your patch, you put in safe state (WFI for most of platform) the
>> cpus that become idle and these cpus are woken up each time a new cpu
>> of the cluster becomes idle. Then, the cluster state is chosen and the
>> cpus enter the selected C-state. On ux500, we are using another
>> behavior for synchronizing  the cpus. The cpus are prepared to enter
>> the c-state that has been chosen by the governor and the last cpu,
>> that enters idle, chooses the final cluster state (according to cpus'
>> C-state). The main advantage of this solution is that you don't need
>> to wake other cpus to enter the C-state of a cluster. This can be
>> quite worth full when tasks mainly run on one cpu. Have you also think
>> about such behavior when developing the coupled cpuidle driver ? It
>> could be interesting to add such behavior.
>
> Waking up the cpus that are in the safe state is not done just to
> choose the target state, it's done to allow the cpus to take
> themselves to the target low power state.  On ux500, are you saying
> you take the cpus directly from the safe state to a lower power state
> without ever going back to the active state?  I once implemented Tegra

yes it is

> that way, and it required lots of nasty synchronization to prevent
> resetting the cpu at the same time that it was booting due to an
> interrupt, and I was later told that Tegra can't handle that sequence
> at all, although I haven't verified it yet.

you have to 2 main things to check :
 - this cpu is the last one to enter an idle state
 - other cpus are prepared to enter a cluster power state
 - other cpus are in WFI

Vincent

>
> On platforms that can't turn the cpus off in a random order, or that
> can't take a cpu directly from the safe state to the target state,
> something like these coupled cpuidle patches are required.  On
> platforms that can, the low power modes can be implemented without
> these patches, although it is very hard to do without race conditions.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2012-02-01 12:13         ` Vincent Guittot
@ 2012-02-01 14:59           ` Lorenzo Pieralisi
  2012-02-01 17:30             ` Colin Cross
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Pieralisi @ 2012-02-01 14:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Colin Cross, Daniel Lezcano, Kevin Hilman, Len Brown,
	linux-kernel, Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	Arjan van de Ven, linux-arm-kernel

On Wed, Feb 01, 2012 at 12:13:26PM +0000, Vincent Guittot wrote:

[...]

> >> In your patch, you put in safe state (WFI for most of platform) the
> >> cpus that become idle and these cpus are woken up each time a new cpu
> >> of the cluster becomes idle. Then, the cluster state is chosen and the
> >> cpus enter the selected C-state. On ux500, we are using another
> >> behavior for synchronizing  the cpus. The cpus are prepared to enter
> >> the c-state that has been chosen by the governor and the last cpu,
> >> that enters idle, chooses the final cluster state (according to cpus'
> >> C-state). The main advantage of this solution is that you don't need
> >> to wake other cpus to enter the C-state of a cluster. This can be
> >> quite worth full when tasks mainly run on one cpu. Have you also think
> >> about such behavior when developing the coupled cpuidle driver ? It
> >> could be interesting to add such behavior.
> >
> > Waking up the cpus that are in the safe state is not done just to
> > choose the target state, it's done to allow the cpus to take
> > themselves to the target low power state.  On ux500, are you saying
> > you take the cpus directly from the safe state to a lower power state
> > without ever going back to the active state?  I once implemented Tegra
> 
> yes it is

But if there is a single power rail for the entire cluster, when a CPU
is "prepared" for shutdown this means that you have to save the context and
clean L1, maybe for nothing since if other CPUs are up and running the
CPU going idle can just enter a simple standby wfi (clock-gated but power on).

With Colin's approach, context is saved and L1 cleaned only when it is
almost certain the cluster is powered off (so the CPUs).

It is a trade-off, I am not saying one approach is better than the
other; we just have to make sure that preparing the CPU for "possible" shutdown 
is better than sending IPIs to take CPUs out of wfi and synchronize
them (this happens if and only if CPUs enter coupled C-states).

As usual this will depend on use cases (and silicon implementations :) )

It is definitely worth benchmarking them.

Lorenzo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2012-02-01 14:59           ` Lorenzo Pieralisi
@ 2012-02-01 17:30             ` Colin Cross
  2012-02-01 18:07               ` Lorenzo Pieralisi
  0 siblings, 1 reply; 36+ messages in thread
From: Colin Cross @ 2012-02-01 17:30 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Vincent Guittot, Daniel Lezcano, Kevin Hilman, Len Brown,
	linux-kernel, Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	Arjan van de Ven, linux-arm-kernel

On Wed, Feb 1, 2012 at 6:59 AM, Lorenzo Pieralisi
<lorenzo.pieralisi@arm.com> wrote:
> On Wed, Feb 01, 2012 at 12:13:26PM +0000, Vincent Guittot wrote:
>
> [...]
>
>> >> In your patch, you put in safe state (WFI for most of platform) the
>> >> cpus that become idle and these cpus are woken up each time a new cpu
>> >> of the cluster becomes idle. Then, the cluster state is chosen and the
>> >> cpus enter the selected C-state. On ux500, we are using another
>> >> behavior for synchronizing  the cpus. The cpus are prepared to enter
>> >> the c-state that has been chosen by the governor and the last cpu,
>> >> that enters idle, chooses the final cluster state (according to cpus'
>> >> C-state). The main advantage of this solution is that you don't need
>> >> to wake other cpus to enter the C-state of a cluster. This can be
>> >> quite worth full when tasks mainly run on one cpu. Have you also think
>> >> about such behavior when developing the coupled cpuidle driver ? It
>> >> could be interesting to add such behavior.
>> >
>> > Waking up the cpus that are in the safe state is not done just to
>> > choose the target state, it's done to allow the cpus to take
>> > themselves to the target low power state.  On ux500, are you saying
>> > you take the cpus directly from the safe state to a lower power state
>> > without ever going back to the active state?  I once implemented Tegra
>>
>> yes it is
>
> But if there is a single power rail for the entire cluster, when a CPU
> is "prepared" for shutdown this means that you have to save the context and
> clean L1, maybe for nothing since if other CPUs are up and running the
> CPU going idle can just enter a simple standby wfi (clock-gated but power on).
>
> With Colin's approach, context is saved and L1 cleaned only when it is
> almost certain the cluster is powered off (so the CPUs).
>
> It is a trade-off, I am not saying one approach is better than the
> other; we just have to make sure that preparing the CPU for "possible" shutdown
> is better than sending IPIs to take CPUs out of wfi and synchronize
> them (this happens if and only if CPUs enter coupled C-states).
>
> As usual this will depend on use cases (and silicon implementations :) )
>
> It is definitely worth benchmarking them.
>

I'm less worried about performance, and more worried about race
conditions.  How do you deal with the following situation:
CPU0 goes to WFI, and saves its state
CPU1 goes idle, and selects a deep idle state that powers down CPU0
CPU1 saves is state, and is about to trigger the power down
CPU0 gets an interrupt, restores its state, and modifies state (maybe
takes a spinlock during boot)
CPU1 cuts the power to CPU0

On OMAP4, the race is handled in hardware.  When CPU1 tries to cut the
power to the blocks shared by CPU0 the hardware will ignore the
request if CPU0 is not in WFI.  On Tegra2, there is no hardware
support and I had to handle it with a spinlock implemented in scratch
registers because CPU0 is out of coherency when it starts booting and
ldrex/strex don't work.  I'm not convinced my implementation is
correct, and I'd be curious to see any other implementations.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2012-02-01 17:30             ` Colin Cross
@ 2012-02-01 18:07               ` Lorenzo Pieralisi
  2012-02-03  1:19                 ` Colin Cross
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Pieralisi @ 2012-02-01 18:07 UTC (permalink / raw)
  To: Colin Cross
  Cc: Vincent Guittot, Daniel Lezcano, Kevin Hilman, Len Brown,
	linux-kernel, Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	Arjan van de Ven, linux-arm-kernel

On Wed, Feb 01, 2012 at 05:30:15PM +0000, Colin Cross wrote:
> On Wed, Feb 1, 2012 at 6:59 AM, Lorenzo Pieralisi
> <lorenzo.pieralisi@arm.com> wrote:
> > On Wed, Feb 01, 2012 at 12:13:26PM +0000, Vincent Guittot wrote:
> >
> > [...]
> >
> >> >> In your patch, you put in safe state (WFI for most of platform) the
> >> >> cpus that become idle and these cpus are woken up each time a new cpu
> >> >> of the cluster becomes idle. Then, the cluster state is chosen and the
> >> >> cpus enter the selected C-state. On ux500, we are using another
> >> >> behavior for synchronizing  the cpus. The cpus are prepared to enter
> >> >> the c-state that has been chosen by the governor and the last cpu,
> >> >> that enters idle, chooses the final cluster state (according to cpus'
> >> >> C-state). The main advantage of this solution is that you don't need
> >> >> to wake other cpus to enter the C-state of a cluster. This can be
> >> >> quite worth full when tasks mainly run on one cpu. Have you also think
> >> >> about such behavior when developing the coupled cpuidle driver ? It
> >> >> could be interesting to add such behavior.
> >> >
> >> > Waking up the cpus that are in the safe state is not done just to
> >> > choose the target state, it's done to allow the cpus to take
> >> > themselves to the target low power state.  On ux500, are you saying
> >> > you take the cpus directly from the safe state to a lower power state
> >> > without ever going back to the active state?  I once implemented Tegra
> >>
> >> yes it is
> >
> > But if there is a single power rail for the entire cluster, when a CPU
> > is "prepared" for shutdown this means that you have to save the context and
> > clean L1, maybe for nothing since if other CPUs are up and running the
> > CPU going idle can just enter a simple standby wfi (clock-gated but power on).
> >
> > With Colin's approach, context is saved and L1 cleaned only when it is
> > almost certain the cluster is powered off (so the CPUs).
> >
> > It is a trade-off, I am not saying one approach is better than the
> > other; we just have to make sure that preparing the CPU for "possible" shutdown
> > is better than sending IPIs to take CPUs out of wfi and synchronize
> > them (this happens if and only if CPUs enter coupled C-states).
> >
> > As usual this will depend on use cases (and silicon implementations :) )
> >
> > It is definitely worth benchmarking them.
> >
> 
> I'm less worried about performance, and more worried about race
> conditions.  How do you deal with the following situation:
> CPU0 goes to WFI, and saves its state
> CPU1 goes idle, and selects a deep idle state that powers down CPU0
> CPU1 saves is state, and is about to trigger the power down
> CPU0 gets an interrupt, restores its state, and modifies state (maybe
> takes a spinlock during boot)
> CPU1 cuts the power to CPU0
> 
> On OMAP4, the race is handled in hardware.  When CPU1 tries to cut the
> power to the blocks shared by CPU0 the hardware will ignore the
> request if CPU0 is not in WFI.  On Tegra2, there is no hardware
> support and I had to handle it with a spinlock implemented in scratch
> registers because CPU0 is out of coherency when it starts booting and
> ldrex/strex don't work.  I'm not convinced my implementation is
> correct, and I'd be curious to see any other implementations.

That's a problem you solved with coupled C-states (ie your example in
the cover letter), where the primary waits for other CPUs to be reset
before issuing the power down command, right ? At that point in time 
secondaries cannot wake up (?) and if wfi (ie power down) aborts you just
take the secondaries out of reset and restart executing simultaneously,
correct ? It mirrors the suspend behaviour, which is easier to deal with
than completely random idle paths.

It is true that this should be managed by the PM HW; if HW is not
capable of managing these situations things get nasty as you highlighted.

And it is also true ldrex/strex on cacheable memory might not be available in
those early warm-boot stages. I came up with a locking algorithm on
strongly ordered memory to deal with that, but I am still not sure it is
something we really really need.

I will test coupled C-state code ASAP, and come back with feedback.

Thanks,
Lorenzo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2012-02-01 18:07               ` Lorenzo Pieralisi
@ 2012-02-03  1:19                 ` Colin Cross
  0 siblings, 0 replies; 36+ messages in thread
From: Colin Cross @ 2012-02-03  1:19 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Vincent Guittot, Daniel Lezcano, Kevin Hilman, Len Brown,
	linux-kernel, Amit Kucheria, linux-tegra, linux-pm, linux-omap,
	Arjan van de Ven, linux-arm-kernel

On Wed, Feb 1, 2012 at 10:07 AM, Lorenzo Pieralisi
<lorenzo.pieralisi@arm.com> wrote:
> On Wed, Feb 01, 2012 at 05:30:15PM +0000, Colin Cross wrote:
>> On Wed, Feb 1, 2012 at 6:59 AM, Lorenzo Pieralisi
>> <lorenzo.pieralisi@arm.com> wrote:
>> > On Wed, Feb 01, 2012 at 12:13:26PM +0000, Vincent Guittot wrote:
>> >
>> > [...]
>> >
>> >> >> In your patch, you put in safe state (WFI for most of platform) the
>> >> >> cpus that become idle and these cpus are woken up each time a new cpu
>> >> >> of the cluster becomes idle. Then, the cluster state is chosen and the
>> >> >> cpus enter the selected C-state. On ux500, we are using another
>> >> >> behavior for synchronizing  the cpus. The cpus are prepared to enter
>> >> >> the c-state that has been chosen by the governor and the last cpu,
>> >> >> that enters idle, chooses the final cluster state (according to cpus'
>> >> >> C-state). The main advantage of this solution is that you don't need
>> >> >> to wake other cpus to enter the C-state of a cluster. This can be
>> >> >> quite worth full when tasks mainly run on one cpu. Have you also think
>> >> >> about such behavior when developing the coupled cpuidle driver ? It
>> >> >> could be interesting to add such behavior.
>> >> >
>> >> > Waking up the cpus that are in the safe state is not done just to
>> >> > choose the target state, it's done to allow the cpus to take
>> >> > themselves to the target low power state.  On ux500, are you saying
>> >> > you take the cpus directly from the safe state to a lower power state
>> >> > without ever going back to the active state?  I once implemented Tegra
>> >>
>> >> yes it is
>> >
>> > But if there is a single power rail for the entire cluster, when a CPU
>> > is "prepared" for shutdown this means that you have to save the context and
>> > clean L1, maybe for nothing since if other CPUs are up and running the
>> > CPU going idle can just enter a simple standby wfi (clock-gated but power on).
>> >
>> > With Colin's approach, context is saved and L1 cleaned only when it is
>> > almost certain the cluster is powered off (so the CPUs).
>> >
>> > It is a trade-off, I am not saying one approach is better than the
>> > other; we just have to make sure that preparing the CPU for "possible" shutdown
>> > is better than sending IPIs to take CPUs out of wfi and synchronize
>> > them (this happens if and only if CPUs enter coupled C-states).
>> >
>> > As usual this will depend on use cases (and silicon implementations :) )
>> >
>> > It is definitely worth benchmarking them.
>> >
>>
>> I'm less worried about performance, and more worried about race
>> conditions.  How do you deal with the following situation:
>> CPU0 goes to WFI, and saves its state
>> CPU1 goes idle, and selects a deep idle state that powers down CPU0
>> CPU1 saves is state, and is about to trigger the power down
>> CPU0 gets an interrupt, restores its state, and modifies state (maybe
>> takes a spinlock during boot)
>> CPU1 cuts the power to CPU0
>>
>> On OMAP4, the race is handled in hardware.  When CPU1 tries to cut the
>> power to the blocks shared by CPU0 the hardware will ignore the
>> request if CPU0 is not in WFI.  On Tegra2, there is no hardware
>> support and I had to handle it with a spinlock implemented in scratch
>> registers because CPU0 is out of coherency when it starts booting and
>> ldrex/strex don't work.  I'm not convinced my implementation is
>> correct, and I'd be curious to see any other implementations.
>
> That's a problem you solved with coupled C-states (ie your example in
> the cover letter), where the primary waits for other CPUs to be reset
> before issuing the power down command, right ? At that point in time
> secondaries cannot wake up (?) and if wfi (ie power down) aborts you just
> take the secondaries out of reset and restart executing simultaneously,
> correct ? It mirrors the suspend behaviour, which is easier to deal with
> than completely random idle paths.

Yes, anything that supports hotplug and suspend should support coupled
cpuidle states fairly easily.  The only thing required that is not
already used by hotplug/suspend is the ability to save and restore
context on cpu1, but most implementations end up doing that already.

> It is true that this should be managed by the PM HW; if HW is not
> capable of managing these situations things get nasty as you highlighted.

Yes - on some platforms, the HW is not designed to handle it.  On
others, it is designed to, but due to HW bugs it cannot be used.

> And it is also true ldrex/strex on cacheable memory might not be available in
> those early warm-boot stages. I came up with a locking algorithm on
> strongly ordered memory to deal with that, but I am still not sure it is
> something we really really need.

I did the same, but with device memory.

> I will test coupled C-state code ASAP, and come back with feedback.
>
> Thanks,
> Lorenzo
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
       [not found]   ` <8762e8kqi6.fsf@ti.com>
@ 2012-03-14  0:28     ` Colin Cross
  2012-03-14  0:47       ` Colin Cross
  2012-03-14  2:04     ` Arjan van de Ven
  1 sibling, 1 reply; 36+ messages in thread
From: Colin Cross @ 2012-03-14  0:28 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Daniel Lezcano, linux-kernel, linux-arm-kernel, linux-pm,
	Len Brown, Santosh Shilimkar, Amit Kucheria, Arjan van de Ven,
	Trinabh Gupta, Deepthi Dharwar, linux-omap, linux-tegra

On Tue, Mar 13, 2012 at 4:52 PM, Kevin Hilman <khilman@ti.com> wrote:
> Hi Colin,
>
> On 12/21/2011 01:09 AM, Colin Cross wrote:
>
>> To use coupled cpuidle states, a cpuidle driver must:
>
> [...]
>
>>     Provide a struct cpuidle_state.enter function for each state
>>     that affects multiple cpus.  This function is guaranteed to be
>>     called on all cpus at approximately the same time.  The driver
>>     should ensure that the cpus all abort together if any cpu tries
>>     to abort once the function is called.
>
> I've discoved the last sentence above is crucial, and in order to catch
> all the corner cases I found it useful to have the struct
> cpuidle_coupled in cpuidle.h so that the driver can check ready_count
> itself (patch below, on top of $SUBJECT series.)

ready_count is an internal state of core coupled code, and will change
significantly in the next version of the patches.  Drivers cannot
depend on it.

> As you know, on OMAP4, when entering the coupled state, CPU0 has to wait
> for CPU1 to enter its low power state before entering itself.  The first
> pass at implementing this was to just spin waiting for the powerdomain
> of CPU1 to hit off.  That works... most of the time.
>
> If CPU1 wakes up immediately (or before CPU0 starts checking), or more
> likely, fails to hit the low-power state because of other hardware
> "conditions", CPU0 will end up stuck in the loop waiting for CPU1.
>
> To solve this, in addition to checking the power state of CPU1, I also
> check if (coupled->ready_count != cpumask_weight(&coupled->alive_coupled_cpus)).
> If true, it means that CPU1 has already exited/aborted so CPU0 had
> better abort as well.
>
> Checking the ready_count seemed like an easy way to do this, but did you
> have any other mechanisms in mind for CPUs to communicate that they've
> exited/aborted?

Why not set a flag from CPU1 when it exits the low power state, and
have CPU0 spin on the powerdomain register or the flag?  You can then
use the parallel barrier function to ensure both cpus have seen the
flag and reset it to 0 before returning.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2011-12-21 12:12         ` Arjan van de Ven
  2011-12-21 19:05           ` Colin Cross
@ 2012-03-14  0:39           ` Colin Cross
  1 sibling, 0 replies; 36+ messages in thread
From: Colin Cross @ 2012-03-14  0:39 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, linux-arm-kernel, linux-pm, Len Brown,
	Kevin Hilman, Santosh Shilimkar, Amit Kucheria, Trinabh Gupta,
	Deepthi Dharwar, linux-omap, linux-tegra

On Wed, Dec 21, 2011 at 4:12 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 12/21/2011 10:55 AM, Colin Cross wrote:
>> On Wed, Dec 21, 2011 at 1:44 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>> On 12/21/2011 10:40 AM, Colin Cross wrote:
>>>
>>>>> this smells fundamentally racey to me; you can get an interrupt one
>>>>> cycle after you think you're done, but before the last guy enters WFI...
>>>>>
>>>>> how do you solve that issue ?
>>>>
>>>> All the cpus have interrupts off when they increment the counter, so
>>>> they cannot receive an interrupt.  If an interrupt is pending on one
>>>> of those cpus, it will be handled later when WFI aborts due to the
>>>> pending interrupt.
>>>
>>> ... but this leads to cases where you're aborting before other cpus are
>>> entering..... so your "last guy in" doesn't really work, since while cpu
>>> 0 thinks it's the last guy, cpu 1 is already on the way out/out
>>> already...  (heck it might already be going back to sleep if your idle
>>> code can run fast, like in the size of a cache miss)
>>
>> Once a cpu has incremented the counter, it has no way out unless either
>> 1: another cpu (that hasn't incremented the counter yet) receives an
>> interrupt, aborts idle, and clears its idle flag
>> or
>> 2: all cpus enter the ready counter, and call the cpuidle driver's
>> enter function.
>
> .. or it enters WFI, and a physical device sends it an interrupt,
> at which point it exits.

Rereading this, I think I understand what you meant.  I misunderstood
because of your reference to WFI, which is not entered at this point.
WFI generally refers to the shallowest idle state, which only affects
a single cpu.  For deeper states, the cpu may or may not execute the
WFI instruction to trigger the transition (depending on the platform),
but it's not really entering WFI, and the cpu will generally not ever
execute the next instruction (again, depending on the platform).

For coupled cpus, only one cpu is capable of booting first, so all
interrupts should be masked on cpu1 by the time it gets to its low
power state, so it should never abort.  If it does abort (like the
example Kevin Hillman just posted to the end of this thread), cpu1
needs to set a flag to say it is not going to get to idle, and cpu0
needs to wait for cpu1 to be in idle (by polling a power controller
register), or for the flag to be set.  Afterwards, the cpus need to
resynchronize and reset the flag, which will be easy with the parallel
barrier helper function that will be included in the next patch set.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2012-03-14  0:28     ` Colin Cross
@ 2012-03-14  0:47       ` Colin Cross
  2012-03-14 14:23         ` [linux-pm] " Kevin Hilman
  0 siblings, 1 reply; 36+ messages in thread
From: Colin Cross @ 2012-03-14  0:47 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Daniel Lezcano, linux-kernel, linux-arm-kernel, linux-pm,
	Len Brown, Santosh Shilimkar, Amit Kucheria, Arjan van de Ven,
	Trinabh Gupta, Deepthi Dharwar, linux-omap, linux-tegra

On Tue, Mar 13, 2012 at 5:28 PM, Colin Cross <ccross@android.com> wrote:
> On Tue, Mar 13, 2012 at 4:52 PM, Kevin Hilman <khilman@ti.com> wrote:
>> Hi Colin,
>>
>> On 12/21/2011 01:09 AM, Colin Cross wrote:
>>
>>> To use coupled cpuidle states, a cpuidle driver must:
>>
>> [...]
>>
>>>     Provide a struct cpuidle_state.enter function for each state
>>>     that affects multiple cpus.  This function is guaranteed to be
>>>     called on all cpus at approximately the same time.  The driver
>>>     should ensure that the cpus all abort together if any cpu tries
>>>     to abort once the function is called.
>>
>> I've discoved the last sentence above is crucial, and in order to catch
>> all the corner cases I found it useful to have the struct
>> cpuidle_coupled in cpuidle.h so that the driver can check ready_count
>> itself (patch below, on top of $SUBJECT series.)
>
> ready_count is an internal state of core coupled code, and will change
> significantly in the next version of the patches.  Drivers cannot
> depend on it.
>
>> As you know, on OMAP4, when entering the coupled state, CPU0 has to wait
>> for CPU1 to enter its low power state before entering itself.  The first
>> pass at implementing this was to just spin waiting for the powerdomain
>> of CPU1 to hit off.  That works... most of the time.
>>
>> If CPU1 wakes up immediately (or before CPU0 starts checking), or more
>> likely, fails to hit the low-power state because of other hardware
>> "conditions", CPU0 will end up stuck in the loop waiting for CPU1.
>>
>> To solve this, in addition to checking the power state of CPU1, I also
>> check if (coupled->ready_count != cpumask_weight(&coupled->alive_coupled_cpus)).
>> If true, it means that CPU1 has already exited/aborted so CPU0 had
>> better abort as well.
>>
>> Checking the ready_count seemed like an easy way to do this, but did you
>> have any other mechanisms in mind for CPUs to communicate that they've
>> exited/aborted?
>
> Why not set a flag from CPU1 when it exits the low power state, and
> have CPU0 spin on the powerdomain register or the flag?  You can then
> use the parallel barrier function to ensure both cpus have seen the
> flag and reset it to 0 before returning.

I realized the parallel barrier helper was not included in the patch
set I posted, it will be in the next patch set.  Short version, no
caller to cpuidle_coupled_parallel_barrier will return before all cpus
in the coupled set have called it.  It allows you to resynchronize the
cpus after an abort to ensure they have all seen the abort flag before
clearing it and returning, leaving everything in the correct state for
the next idle attempt.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
       [not found]   ` <8762e8kqi6.fsf@ti.com>
  2012-03-14  0:28     ` Colin Cross
@ 2012-03-14  2:04     ` Arjan van de Ven
  2012-03-14  2:21       ` Colin Cross
  1 sibling, 1 reply; 36+ messages in thread
From: Arjan van de Ven @ 2012-03-14  2:04 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Colin Cross, Daniel Lezcano, linux-kernel, linux-arm-kernel,
	linux-pm, Len Brown, Santosh Shilimkar, Amit Kucheria,
	Trinabh Gupta, Deepthi Dharwar, linux-omap, linux-tegra

On 3/13/2012 4:52 PM, Kevin Hilman wrote:
> Checking the ready_count seemed like an easy way to do this, but did you
> have any other mechanisms in mind for CPUs to communicate that they've
> exited/aborted?

this indeed is the tricky part (which I warned about earlier);
I've spent quite a lot of time (weeks) to get this provably working for
an Intel system with similar requirements... and it's extremely unfunny,
and needed firmware support to close some of the race conditions.

I sure hope that hardware with these requirements is on the way out...
it's not very OS friendly.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] coupled cpuidle state support
  2012-03-14  2:04     ` Arjan van de Ven
@ 2012-03-14  2:21       ` Colin Cross
  0 siblings, 0 replies; 36+ messages in thread
From: Colin Cross @ 2012-03-14  2:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Kevin Hilman, Daniel Lezcano, linux-kernel, linux-arm-kernel,
	linux-pm, Len Brown, Santosh Shilimkar, Amit Kucheria,
	Trinabh Gupta, Deepthi Dharwar, linux-omap, linux-tegra

On Tue, Mar 13, 2012 at 7:04 PM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 3/13/2012 4:52 PM, Kevin Hilman wrote:
>> Checking the ready_count seemed like an easy way to do this, but did you
>> have any other mechanisms in mind for CPUs to communicate that they've
>> exited/aborted?
>
> this indeed is the tricky part (which I warned about earlier);
> I've spent quite a lot of time (weeks) to get this provably working for
> an Intel system with similar requirements... and it's extremely unfunny,
> and needed firmware support to close some of the race conditions.

As long as you can tell from cpu0 that cpu1 has succesfully entered
the low power state, this should be easy, and the
coupled_cpuidle_parallel_barrier helper should make it even easier.
This series just allows the exact same sequence of transitions used by
hotplug cpufreq governors to happen from the idle thread:

1. cpu0 forces cpu1 offline.  From a wakeup perspective, this is
exactly like hotplug removing cpu1.  Unlike hotplug, the state of cpu1
has to be saved, and the scheduler is not told that the cpu is gone.
Instead of using the IPI signalling that hotplug uses, we call the
same function on both cpus, and one cpu runs the equivalent of
platform_cpu_kill, and the other the equivalent of platform_cpu_die.
Wakeup events that are only targeted at cpu1 will need to be
temporarily migrated to cpu0.
2. If the hotplug is successful, cpu0 goes to a low power state, the
same way it would when the hotplug cpufreq governor had removed cpu1.
3. cpu0 wakes up, because all wakeup events are pointed at it.
4. After restoring its own state, cpu0 brings cpu1 back online,
exactly like hotplug online would, except that the boot vector has to
point to a cpu_resume handler instead of secondary start.

> I sure hope that hardware with these requirements is on the way out...
> it's not very OS friendly.

Even hardware that was designed not not have these requirements
sometimes has bugs that require this kind of sequencing.  OMAP4430
should theoretically support idle without sequencing, but OMAP4460
introduced a ROM code bug that requires sequencing again (turning on
cpu1 corrupts the interrupt controller, so cpu0 has to be waiting with
interrupts off to run a workaround whenever cpu1 turns on).

Out of curiosity, what Intel hardware needs this?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-pm] [PATCH 0/3] coupled cpuidle state support
  2012-03-14  0:47       ` Colin Cross
@ 2012-03-14 14:23         ` Kevin Hilman
  0 siblings, 0 replies; 36+ messages in thread
From: Kevin Hilman @ 2012-03-14 14:23 UTC (permalink / raw)
  To: Colin Cross
  Cc: Len Brown, linux-kernel, Amit Kucheria, linux-tegra, linux-pm,
	linux-omap, Arjan van de Ven, linux-arm-kernel

Colin Cross <ccross@android.com> writes:

> On Tue, Mar 13, 2012 at 5:28 PM, Colin Cross <ccross@android.com> wrote:
>> On Tue, Mar 13, 2012 at 4:52 PM, Kevin Hilman <khilman@ti.com> wrote:

[...]

>>>
>>> Checking the ready_count seemed like an easy way to do this, but did you
>>> have any other mechanisms in mind for CPUs to communicate that they've
>>> exited/aborted?
>>
>> Why not set a flag from CPU1 when it exits the low power state, and
>> have CPU0 spin on the powerdomain register or the flag?  You can then
>> use the parallel barrier function to ensure both cpus have seen the
>> flag and reset it to 0 before returning.
>
> I realized the parallel barrier helper was not included in the patch
> set I posted, it will be in the next patch set.  

Do you have an ETA for the next patchset?  I'll be glad to give it a
spin on OMAP4.

> Short version, no caller to cpuidle_coupled_parallel_barrier will
> return before all cpus in the coupled set have called it.  It allows
> you to resynchronize the cpus after an abort to ensure they have all
> seen the abort flag before clearing it and returning, leaving
> everything in the correct state for the next idle attempt.

OK, this sounds like it will work well for what I need.

Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2012-03-14 14:23 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-21  0:09 [PATCH 0/3] coupled cpuidle state support Colin Cross
2011-12-21  0:09 ` [PATCH 1/3] cpuidle: refactor out cpuidle_enter_state Colin Cross
2012-01-04 14:08   ` [linux-pm] " Jean Pihet
2011-12-21  0:09 ` [PATCH 2/3] cpuidle: fix error handling in __cpuidle_register_device Colin Cross
2011-12-21  0:09 ` [PATCH 3/3] cpuidle: add support for states that affect multiple cpus Colin Cross
2011-12-21  9:02 ` [PATCH 0/3] coupled cpuidle state support Arjan van de Ven
2011-12-21  9:40   ` Colin Cross
2011-12-21  9:44     ` Arjan van de Ven
2011-12-21  9:55       ` Colin Cross
2011-12-21 12:12         ` Arjan van de Ven
2011-12-21 19:05           ` Colin Cross
2011-12-21 19:36             ` Arjan van de Ven
2011-12-21 19:42               ` [linux-pm] " Colin Cross
2011-12-22  8:35                 ` Shilimkar, Santosh
2011-12-22  8:53                   ` Arjan van de Ven
2011-12-22  9:30                     ` Shilimkar, Santosh
2011-12-22 21:20                     ` Colin Cross
2012-03-14  0:39           ` Colin Cross
2012-01-04  0:41 ` Kevin Hilman
2012-01-04 17:27   ` Shilimkar, Santosh
2012-01-20  8:46 ` Daniel Lezcano
2012-01-20 20:40   ` Colin Cross
2012-01-25 14:04     ` Daniel Lezcano
2012-01-31 14:13       ` Daniel Lezcano
2012-01-27  8:54     ` [linux-pm] " Vincent Guittot
2012-01-27 17:32       ` Colin Cross
2012-02-01 12:13         ` Vincent Guittot
2012-02-01 14:59           ` Lorenzo Pieralisi
2012-02-01 17:30             ` Colin Cross
2012-02-01 18:07               ` Lorenzo Pieralisi
2012-02-03  1:19                 ` Colin Cross
     [not found]   ` <8762e8kqi6.fsf@ti.com>
2012-03-14  0:28     ` Colin Cross
2012-03-14  0:47       ` Colin Cross
2012-03-14 14:23         ` [linux-pm] " Kevin Hilman
2012-03-14  2:04     ` Arjan van de Ven
2012-03-14  2:21       ` Colin Cross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).