* [RFC PATCH V4 0/5] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel @ 2011-03-22 12:32 Trinabh Gupta 2011-03-22 12:32 ` [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta ` (4 more replies) 0 siblings, 5 replies; 51+ messages in thread From: Trinabh Gupta @ 2011-03-22 12:32 UTC (permalink / raw) To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel, sfr Changes in V4: * Implemented cpuidle driver for xen. Earlier pm_idle function pointer would be set to default idle. Now a cpuidle_driver encapsulating the idle routine is cleanly registered for this using cpuidle_register_driver API. * Implemented a cpuidle driver for apm_cpu_idle() as part of APM BIOS driver. Earlier APM BIOS driver would flip pm_idle function pointer to apm_cpu_idle. * This patch series applies on 2.6.38, and was tested on x86 system with multiple sleep states. Goal: This patch series tries to achieve the goal of having cpuidle manage all idle routine for x86. It removes pm_idle function pointer which causes problems discussed at http://lkml.org/lkml/2009/8/28/43 and http://lkml.org/lkml/2009/8/28/50. V1 of this series is at https://lkml.org/lkml/2010/10/19/449, V2 is at https://lkml.org/lkml/2011/1/13/98 and V3 is at https://lkml.org/lkml/2011/2/8/73 --- Trinabh Gupta (5): cpuidle: cpuidle driver for apm cpuidle: driver for xen cpuidle: default idle driver for x86 cpuidle: list based cpuidle driver registration and selection cpuidle: Remove pm_idle pointer for x86 arch/x86/Kconfig | 2 arch/x86/kernel/apm_32.c | 75 ++++++- arch/x86/kernel/process.c | 339 -------------------------------- arch/x86/kernel/process_32.c | 4 arch/x86/kernel/process_64.c | 4 arch/x86/xen/setup.c | 42 ++++ drivers/acpi/processor_idle.c | 2 drivers/cpuidle/Kconfig | 9 + drivers/cpuidle/cpuidle.c | 50 ++++- drivers/cpuidle/driver.c | 114 ++++++++++- drivers/idle/Makefile | 1 drivers/idle/default_driver.c | 437 +++++++++++++++++++++++++++++++++++++++++ include/linux/cpuidle.h | 3 13 files changed, 707 insertions(+), 375 deletions(-) create mode 100644 drivers/idle/default_driver.c -- -Trinabh ^ permalink raw reply [flat|nested] 51+ messages in thread
* [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86 2011-03-22 12:32 [RFC PATCH V4 0/5] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta @ 2011-03-22 12:32 ` Trinabh Gupta 2011-03-23 1:00 ` Stephen Rothwell 2011-03-22 12:32 ` [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection Trinabh Gupta ` (3 subsequent siblings) 4 siblings, 1 reply; 51+ messages in thread From: Trinabh Gupta @ 2011-03-22 12:32 UTC (permalink / raw) To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel, sfr This patch removes pm_idle function pointer and directly calls cpuidle_idle_call from the idle loop on x86. Hence, CPUIdle has to be built into the kernel for x86 and optional for other archs. Archs that still use pm_idle can continue to set pm_idle=cpuidle_idle_call() and co-exist. The cpuidle_(un)install_idle_handler() is defined per arch in cpuidle.c and would set pm_idle pointer on non-x86 architectures until they are incrementally converted to directly call cpuidle_idle_call() and not use the pm_idle pointer. Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com> --- arch/x86/kernel/process_32.c | 4 +++- arch/x86/kernel/process_64.c | 4 +++- drivers/cpuidle/Kconfig | 3 ++- drivers/cpuidle/cpuidle.c | 2 +- 4 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 8d12878..17b7101 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -74,6 +74,8 @@ static inline void play_dead(void) } #endif +extern void cpuidle_idle_call(void); + /* * The idle thread. There's no useful work to be * done, so just try to conserve power and have a @@ -109,7 +111,7 @@ void cpu_idle(void) local_irq_disable(); /* Don't trace irqs off for idle */ stop_critical_timings(); - pm_idle(); + cpuidle_idle_call(); start_critical_timings(); } tick_nohz_restart_sched_tick(); diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index bd387e8..c736bf3 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -99,6 +99,8 @@ static inline void play_dead(void) } #endif +extern void cpuidle_idle_call(void); + /* * The idle thread. There's no useful work to be * done, so just try to conserve power and have a @@ -136,7 +138,7 @@ void cpu_idle(void) enter_idle(); /* Don't trace irqs off for idle */ stop_critical_timings(); - pm_idle(); + cpuidle_idle_call(); start_critical_timings(); /* In many cases the interrupt that ended idle diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 7dbc4a8..e67c258 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -1,7 +1,8 @@ config CPU_IDLE bool "CPU idle PM support" - default ACPI + default y if X86 + default y if ACPI help CPU idle is a generic framework for supporting software-controlled idle processor power management. It includes modular cross-platform diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index bf50924..8baaa04 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -47,7 +47,7 @@ static int __cpuidle_register_device(struct cpuidle_device *dev); * * NOTE: no locks or semaphores should be used here */ -static void cpuidle_idle_call(void) +void cpuidle_idle_call(void) { struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices); struct cpuidle_state *target_state; ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86 2011-03-22 12:32 ` [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta @ 2011-03-23 1:00 ` Stephen Rothwell 2011-03-23 10:10 ` Trinabh Gupta 0 siblings, 1 reply; 51+ messages in thread From: Stephen Rothwell @ 2011-03-23 1:00 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak, linux-kernel [-- Attachment #1: Type: text/plain, Size: 721 bytes --] Hi, Just some simple comments. Does having this patch first in the series break APM idle? On Tue, 22 Mar 2011 18:02:27 +0530 Trinabh Gupta <trinabh@linux.vnet.ibm.com> wrote: > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c > index 8d12878..17b7101 100644 > --- a/arch/x86/kernel/process_32.c > +++ b/arch/x86/kernel/process_32.c > @@ -74,6 +74,8 @@ static inline void play_dead(void) > } > #endif > > +extern void cpuidle_idle_call(void); Put this declaration in a header file and include that header file here and in the file that defines that function. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86 2011-03-23 1:00 ` Stephen Rothwell @ 2011-03-23 10:10 ` Trinabh Gupta 0 siblings, 0 replies; 51+ messages in thread From: Trinabh Gupta @ 2011-03-23 10:10 UTC (permalink / raw) To: Stephen Rothwell Cc: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel Hi Stephen, On 03/23/2011 06:30 AM, Stephen Rothwell wrote: > Hi, > > Just some simple comments. > > Does having this patch first in the series break APM idle? Thanks for reviewing. Yes, I think removal of "pm_idle = default_idle" statement in APM may have to be moved here. > > On Tue, 22 Mar 2011 18:02:27 +0530 Trinabh Gupta<trinabh@linux.vnet.ibm.com> wrote: >> >> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c >> index 8d12878..17b7101 100644 >> --- a/arch/x86/kernel/process_32.c >> +++ b/arch/x86/kernel/process_32.c >> @@ -74,6 +74,8 @@ static inline void play_dead(void) >> } >> #endif >> >> +extern void cpuidle_idle_call(void); > > Put this declaration in a header file and include that header file here > and in the file that defines that function. > Ok, thanks -Trinabh ^ permalink raw reply [flat|nested] 51+ messages in thread
* [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-22 12:32 [RFC PATCH V4 0/5] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta 2011-03-22 12:32 ` [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta @ 2011-03-22 12:32 ` Trinabh Gupta 2011-03-23 2:59 ` Len Brown 2011-03-22 12:33 ` [RFC PATCH V4 3/5] cpuidle: default idle driver for x86 Trinabh Gupta ` (2 subsequent siblings) 4 siblings, 1 reply; 51+ messages in thread From: Trinabh Gupta @ 2011-03-22 12:32 UTC (permalink / raw) To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel, sfr A cpuidle_driver structure represents a cpuidle driver like acpi_idle, intel_idle providing low level idle routines. A cpuidle_driver is global in nature as it provides routines for all the CPUS. Each CPU registered with the cpuidle subsystem is represented as a cpuidle_device. A cpuidle_device structure points to the low level idle routines for that CPU provided by a certain driver. In other words, a cpuidle driver creates a cpuidle_device structure for each CPU that it registers with the cpuidle subsystem. Whenever cpuidle idle loop is called, the cpuidle subsystem picks the cpuidle_device structure for that cpu and calls one of the low level idle routines through that structure. In the current design, only one cpuidle_driver may be registered and registration of any subsequent driver fails. The same registered driver provides low level idle routines for each cpuidle_device. A list based registration for cpuidle_driver provides a clean mechanism for multiple subsystems/modules to register their own idle routines and thus avoids using pm_idle. This patch implements a list based registration for cpuidle driver. Different drivers can be registered and the driver to be used is selected based on a per driver priority. On a cpuidle driver registration or unregistration cpuidle_device structure for each CPU is changed. Each cpuidle_device points to its driver to facilitate this registration and unregistration. --------- --------- |cpuidle| |cpuidle| |driver |----------------------- |driver | |default| |acpi | --------- --------- ^ ^ | | ---------------------------- --------------------------- ^ ^ ^ ^ | | | | | | | | --------- --------- --------- ---------- |cpuidle| |cpuidle| |cpuidle| |cpuidle | |device | |device | .... |device | |device | ..... | CPU0 | | CPU1 | |CPU0 | |CPU1 | --------- --------- --------- ---------- Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com> --- drivers/acpi/processor_idle.c | 2 + drivers/cpuidle/Kconfig | 6 ++ drivers/cpuidle/cpuidle.c | 46 ++++++++++++++--- drivers/cpuidle/driver.c | 114 ++++++++++++++++++++++++++++++++++++++--- include/linux/cpuidle.h | 3 + 5 files changed, 157 insertions(+), 14 deletions(-) diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index d615b7d..65dde00 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -971,6 +971,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev, struct cpuidle_driver acpi_idle_driver = { .name = "acpi_idle", .owner = THIS_MODULE, + .priority = 10, }; /** @@ -992,6 +993,7 @@ static int acpi_processor_setup_cpuidle(struct acpi_processor *pr) } dev->cpu = pr->id; + dev->drv = &acpi_idle_driver; for (i = 0; i < CPUIDLE_STATE_MAX; i++) { dev->states[i].name[0] = '\0'; dev->states[i].desc[0] = '\0'; diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index e67c258..8cc73c8 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -19,3 +19,9 @@ config CPU_IDLE_GOV_MENU bool depends on CPU_IDLE && NO_HZ default y + +config ARCH_USES_PMIDLE + bool + depends on CPUIDLE + depends on !X86 + default y diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8baaa04..91ef1bf 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -116,6 +116,7 @@ void cpuidle_idle_call(void) cpuidle_curr_governor->reflect(dev); } +#ifdef CONFIG_ARCH_USES_PMIDLE /** * cpuidle_install_idle_handler - installs the cpuidle idle loop handler */ @@ -138,6 +139,10 @@ void cpuidle_uninstall_idle_handler(void) cpuidle_kick_cpus(); } } +#else +void cpuidle_install_idle_handler(void) {} +void cpuidle_uninstall_idle_handler(void) {} +#endif /** * cpuidle_pause_and_lock - temporarily disables CPUIDLE @@ -293,9 +298,21 @@ static int __cpuidle_register_device(struct cpuidle_device *dev) struct sys_device *sys_dev = get_cpu_sysdev((unsigned long)dev->cpu); struct cpuidle_driver *cpuidle_driver = cpuidle_get_driver(); + if (dev->registered) + return 0; if (!sys_dev) return -EINVAL; - if (!try_module_get(cpuidle_driver->owner)) + /* + * This will maintain compatibility with old cpuidle_device + * structure where driver pointer is not set. + * + * To Do: Change all call sites to set dev->drv pointer. + * This can be changed to BUG() later in case somebody + * registers without a driver pointer. + */ + if (!dev->drv) + dev->drv = cpuidle_driver; + if (!try_module_get(dev->drv->owner)) return -EINVAL; init_completion(&dev->kobj_unregister); @@ -320,10 +337,11 @@ static int __cpuidle_register_device(struct cpuidle_device *dev) dev->states[i].power_usage = -1 - i; } - per_cpu(cpuidle_devices, dev->cpu) = dev; + if (cpuidle_driver == dev->drv) + per_cpu(cpuidle_devices, dev->cpu) = dev; list_add(&dev->device_list, &cpuidle_detected_devices); if ((ret = cpuidle_add_sysfs(sys_dev))) { - module_put(cpuidle_driver->owner); + module_put(dev->drv->owner); return ret; } @@ -374,13 +392,18 @@ void cpuidle_unregister_device(struct cpuidle_device *dev) cpuidle_disable_device(dev); cpuidle_remove_sysfs(sys_dev); - list_del(&dev->device_list); wait_for_completion(&dev->kobj_unregister); - per_cpu(cpuidle_devices, dev->cpu) = NULL; + if (cpuidle_driver == dev->drv) + per_cpu(cpuidle_devices, dev->cpu) = NULL; + /* + * To Do: Ensure each CPU has exited the current + * idle routine which may belong to this module/driver. + * cpuidle_kick_cpus() equivalent is required. + */ cpuidle_resume_and_unlock(); - module_put(cpuidle_driver->owner); + module_put(dev->drv->owner); } EXPORT_SYMBOL_GPL(cpuidle_unregister_device); @@ -420,6 +443,15 @@ static inline void latency_notifier_init(struct notifier_block *n) #endif /* CONFIG_SMP */ +#ifdef CONFIG_ARCH_USES_PMIDLE +static inline void save_pm_idle(void) +{ + pm_idle_old = pm_idle; +} +#else +static inline void save_pm_idle(void) {} +#endif + /** * cpuidle_init - core initializer */ @@ -427,7 +459,7 @@ static int __init cpuidle_init(void) { int ret; - pm_idle_old = pm_idle; + save_pm_idle(); ret = cpuidle_add_class_sysfs(&cpu_sysdev_class); if (ret) diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c index fd1601e..c235286 100644 --- a/drivers/cpuidle/driver.c +++ b/drivers/cpuidle/driver.c @@ -14,8 +14,83 @@ #include "cpuidle.h" +#define MAX_PRIORITY 1000 +#define DEFAULT_PRIORITY 100 + static struct cpuidle_driver *cpuidle_curr_driver; DEFINE_SPINLOCK(cpuidle_driver_lock); +LIST_HEAD(registered_cpuidle_drivers); + +static struct cpuidle_driver *select_cpuidle_driver(void) +{ + struct cpuidle_driver *item = NULL, *selected = NULL; + unsigned int min_priority = MAX_PRIORITY; + + list_for_each_entry(item, ®istered_cpuidle_drivers, + driver_list) { + if (item->priority <= min_priority) { + selected = item; + min_priority = item->priority; + } + } + return selected; +} + +static void set_current_cpuidle_driver(struct cpuidle_driver *drv) +{ + struct cpuidle_device *item = NULL; + if (drv == cpuidle_curr_driver) + return; + + /* Unregister the previous drivers devices */ + /* Do we need to take cpuidle_lock ? {un}register_device takes lock*/ + if (cpuidle_curr_driver) { + list_for_each_entry(item, &cpuidle_detected_devices, + device_list) { + if (item->drv == cpuidle_curr_driver) + cpuidle_unregister_device(item); + } + } + + cpuidle_curr_driver = drv; + + if (drv == NULL) + return; + else { + /* Register the new driver devices */ + list_for_each_entry(item, &cpuidle_detected_devices, + device_list) { + if (item->drv == drv) + cpuidle_register_device(item); + } + } +} + +static inline int arch_allow_register(void) +{ +#ifdef CONFIG_ARCH_USES_PMIDLE + if (cpuidle_curr_driver) + return -EPERM; + else + return 0; +#else + return 0; +#endif +} + +static inline int arch_allow_unregister(struct cpuidle_driver *drv) +{ +#ifdef CONFIG_ARCH_USES_PMIDLE + if (drv != cpuidle_curr_driver) { + WARN(1, "invalid cpuidle_unregister_driver(%s)\n", + drv->name); + return -EPERM; + } else + return 0; +#else + return 0; +#endif +} /** * cpuidle_register_driver - registers a driver @@ -23,15 +98,29 @@ DEFINE_SPINLOCK(cpuidle_driver_lock); */ int cpuidle_register_driver(struct cpuidle_driver *drv) { + struct cpuidle_driver *item = NULL; if (!drv) return -EINVAL; + if (drv->priority == 0) + drv->priority = DEFAULT_PRIORITY; spin_lock(&cpuidle_driver_lock); - if (cpuidle_curr_driver) { + + if (arch_allow_register()) { spin_unlock(&cpuidle_driver_lock); return -EBUSY; } - cpuidle_curr_driver = drv; + + /* Check if driver already registered */ + list_for_each_entry(item, ®istered_cpuidle_drivers, driver_list) { + if (item == drv) { + spin_unlock(&cpuidle_driver_lock); + return -EINVAL; + } + } + + list_add(&drv->driver_list, ®istered_cpuidle_drivers); + set_current_cpuidle_driver(select_cpuidle_driver()); spin_unlock(&cpuidle_driver_lock); return 0; @@ -54,14 +143,25 @@ EXPORT_SYMBOL_GPL(cpuidle_get_driver); */ void cpuidle_unregister_driver(struct cpuidle_driver *drv) { - if (drv != cpuidle_curr_driver) { - WARN(1, "invalid cpuidle_unregister_driver(%s)\n", - drv->name); + struct cpuidle_device *item = NULL; + + if (arch_allow_unregister(drv)) return; - } spin_lock(&cpuidle_driver_lock); - cpuidle_curr_driver = NULL; + + /* Set some other driver as current */ + list_del(&drv->driver_list); + set_current_cpuidle_driver(select_cpuidle_driver()); + + /* Delete all devices corresponding to this driver */ + mutex_lock(&cpuidle_lock); + list_for_each_entry(item, &cpuidle_detected_devices, device_list) { + if (item->drv == drv) + list_del(&item->device_list); + } + mutex_unlock(&cpuidle_lock); + spin_unlock(&cpuidle_driver_lock); } diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 36719ea..a5f5fd0 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -90,6 +90,7 @@ struct cpuidle_device { struct cpuidle_state *last_state; struct list_head device_list; + struct cpuidle_driver *drv; struct kobject kobj; struct completion kobj_unregister; void *governor_data; @@ -119,6 +120,8 @@ static inline int cpuidle_get_last_residency(struct cpuidle_device *dev) struct cpuidle_driver { char name[CPUIDLE_NAME_LEN]; struct module *owner; + unsigned int priority; + struct list_head driver_list; }; #ifdef CONFIG_CPU_IDLE ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-22 12:32 ` [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection Trinabh Gupta @ 2011-03-23 2:59 ` Len Brown 2011-03-23 9:22 ` Trinabh Gupta 0 siblings, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-23 2:59 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr the original cpuidle prototype supported multiple driver registration, but no production use for it could be imagined, and so it was deleted. Subsequently on x86, we added intel_idle to replace acpi_idle and a typical kernel will have them both built in. We still don't allow mutliple registrations, we just arrange affairs such that the preferred intel_idle probes before the backup, acpi_idle. If intel_idle recognizes the platform, its probe succeeds, else acpi_idle gets a go. If there is a problem with intel_idle, or a comparison needs to be made, a bootparam is available to tell intel_idle not to probe. This mechanism takes approximately 10 lines of code -- the bootparam to disable the preferred driver. What is the benefit of all the code to support the feature of run-time multiple driver registration and switching? thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-23 2:59 ` Len Brown @ 2011-03-23 9:22 ` Trinabh Gupta 2011-03-23 20:51 ` Len Brown 0 siblings, 1 reply; 51+ messages in thread From: Trinabh Gupta @ 2011-03-23 9:22 UTC (permalink / raw) To: Len Brown Cc: arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel Hi Len, The goal of the patch series is to remove exported pm_idle function pointer (see http://lkml.org/lkml/2009/8/28/43 and http://lkml.org/lkml/2009/8/28/50 for problems related to pm_idle). The first patch in the series removes pm_idle for x86 and we now directly call cpuidle_idle_call as suggested by Arjan (https://lkml.org/lkml/2010/10/19/453). But we also have to replace the functionality provided by pm_idle, i.e. call default_idle for platforms where no better idle routine exists, call mwait for pre-nehalem platforms, use intel_idle or acpi_idle for nehalem architectures etc. To manage all this we need a registration mechanism which is conveniently provided by cpuidle. In theory I agree that we can maybe do without list based registration i.e probe and pick the best for the platform, but things may become less predictable and difficult to manage as we have more and more platforms and drivers. By directly calling into cpuidle, we already have arch default other than intel_idle and acpi_idle. Then APM and xen (though it uses default_idle) also have their own idle routines. List based management and selection based on priority would provide a cleaner solution. Thanks, -Trinabh On 03/23/2011 08:29 AM, Len Brown wrote: > the original cpuidle prototype supported multiple driver registration, > but no production use for it could be imagined, and so it was deleted. > > Subsequently on x86, we added intel_idle to replace acpi_idle > and a typical kernel will have them both built in. > We still don't allow mutliple registrations, we just arrange > affairs such that the preferred intel_idle probes before > the backup, acpi_idle. If intel_idle recognizes the platform, > its probe succeeds, else acpi_idle gets a go. > If there is a problem with intel_idle, or a comparison needs to be made, > a bootparam is available to tell intel_idle not to probe. > > This mechanism takes approximately 10 lines of code -- the bootparam > to disable the preferred driver. > > What is the benefit of all the code to support the feature of run-time > multiple driver registration and switching? > > thanks, > Len Brown, Intel Open Source Technology Center > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-23 9:22 ` Trinabh Gupta @ 2011-03-23 20:51 ` Len Brown 2011-03-24 4:41 ` Len Brown 2011-03-24 14:13 ` Trinabh Gupta 0 siblings, 2 replies; 51+ messages in thread From: Len Brown @ 2011-03-23 20:51 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel > The goal of the patch series is to remove exported pm_idle function > pointer (see http://lkml.org/lkml/2009/8/28/43 and > http://lkml.org/lkml/2009/8/28/50 for problems related to pm_idle). > The first patch in the series removes pm_idle for x86 and we > now directly call cpuidle_idle_call as suggested by Arjan > (https://lkml.org/lkml/2010/10/19/453). So the problem statement with "pm_idle" is that it is visible to modules and thus potentially racey and unsafe? Any reason we can't delete his line today to address most of the concern? EXPORT_SYMBOL(pm_idle); > But we also have to replace the functionality provided by pm_idle, > i.e. call default_idle for platforms where no better idle routine > exists, call mwait for pre-nehalem platforms, use intel_idle or > acpi_idle for nehalem architectures etc. To manage all this > we need a registration mechanism which is conveniently provided > by cpuidle. It isn't immediately clear to me that all of these options need to be preserved. Are we suggesting that x86 must always build with cpuidle? I'm sure that somebody someplace will object to that. OTOH, if cpuidle is included, I'd like to see the non-cpuidle code excluded, since nobody will run it... > In theory I agree that we can maybe do without list based > registration i.e probe and pick the best for the platform, but things > may become less predictable and difficult to manage as > we have more and more platforms and drivers. > By directly calling into cpuidle, we already have arch default > other than intel_idle and acpi_idle. Then APM and xen (though > it uses default_idle) also have their own idle routines. > List based management and selection based on priority would provide Does anybody actually use the latest kernel in APM mode? I'm not even sure the last version of Windows that would talk to APM, it was whatever was before Windows-95, I think. But don't get me wrong, I agree that pm_idle should go. I agree that cpuidle should have a default other than the polling loop it currently uses. I just don't think we should spend a lot of code and time preserving every conceivable option and feature. We should first do some spring cleaning to see if we can simplify the problem. thanks, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-23 20:51 ` Len Brown @ 2011-03-24 4:41 ` Len Brown 2011-03-24 14:13 ` Trinabh Gupta 1 sibling, 0 replies; 51+ messages in thread From: Len Brown @ 2011-03-24 4:41 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel > Does anybody actually use the latest kernel in APM mode? Google tells me that Windows 2000 still supported APM, and it was still present in Windows XP APM was not present in Windows Vista, aka Windows 2006. MS dropped support in their latest OS 5 years ago. When will the latest Linux drop APM support? -Len ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-23 20:51 ` Len Brown 2011-03-24 4:41 ` Len Brown @ 2011-03-24 14:13 ` Trinabh Gupta 2011-03-24 16:52 ` Vaidyanathan Srinivasan 2011-03-25 7:05 ` Len Brown 1 sibling, 2 replies; 51+ messages in thread From: Trinabh Gupta @ 2011-03-24 14:13 UTC (permalink / raw) To: Len Brown, arjan, peterz Cc: suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel On 03/24/2011 02:21 AM, Len Brown wrote: >> The goal of the patch series is to remove exported pm_idle function >> pointer (see http://lkml.org/lkml/2009/8/28/43 and >> http://lkml.org/lkml/2009/8/28/50 for problems related to pm_idle). >> The first patch in the series removes pm_idle for x86 and we >> now directly call cpuidle_idle_call as suggested by Arjan >> (https://lkml.org/lkml/2010/10/19/453). > > So the problem statement with "pm_idle" is that it is visible to modules > and thus potentially racey and unsafe? > > Any reason we can't delete his line today to address most of the concern? Hi Len, I think there are other problems too, related to saving and restoring of pm_idle pointer. For example, cpuidle itself saves current value of pm_idle, flips it and then restores the saved value. There is no guarantee that the saved function still exists. APM does exact same thing (though it may not be used these days). The problem also is that a number of architectures have copied the same design based on pm_idle; so its spreading. > > EXPORT_SYMBOL(pm_idle); > >> But we also have to replace the functionality provided by pm_idle, >> i.e. call default_idle for platforms where no better idle routine >> exists, call mwait for pre-nehalem platforms, use intel_idle or >> acpi_idle for nehalem architectures etc. To manage all this >> we need a registration mechanism which is conveniently provided >> by cpuidle. > > It isn't immediately clear to me that all of these options > need to be preserved. So what do you suggest can be removed? > > Are we suggesting that x86 must always build with cpuidle? > I'm sure that somebody someplace will object to that. Arjan argued that since almost everyone today runs cpuidle it may be best to include it in the kernel (https://lkml.org/lkml/2010/10/20/243). But yes, we agreed that we would have to make cpuidle lighter incrementally. Making ladder governor optional could be one way for example. > > OTOH, if cpuidle is included, I'd like to see the > non-cpuidle code excluded, since nobody will run it... > >> In theory I agree that we can maybe do without list based >> registration i.e probe and pick the best for the platform, but things >> may become less predictable and difficult to manage as >> we have more and more platforms and drivers. >> By directly calling into cpuidle, we already have arch default >> other than intel_idle and acpi_idle. Then APM and xen (though >> it uses default_idle) also have their own idle routines. >> List based management and selection based on priority would provide > > Does anybody actually use the latest kernel in APM mode? > I'm not even sure the last version of Windows that would talk to APM, > it was whatever was before Windows-95, I think. > > But don't get me wrong, I agree that pm_idle should go. > I agree that cpuidle should have a default other than > the polling loop it currently uses. Sure, thanks -Trinabh ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-24 14:13 ` Trinabh Gupta @ 2011-03-24 16:52 ` Vaidyanathan Srinivasan 2011-03-25 7:13 ` Len Brown 2011-03-25 7:05 ` Len Brown 1 sibling, 1 reply; 51+ messages in thread From: Vaidyanathan Srinivasan @ 2011-03-24 16:52 UTC (permalink / raw) To: Trinabh Gupta Cc: Len Brown, arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel * Trinabh Gupta <trinabh@linux.vnet.ibm.com> [2011-03-24 19:43:43]: [snip] > >>But we also have to replace the functionality provided by pm_idle, > >>i.e. call default_idle for platforms where no better idle routine > >>exists, call mwait for pre-nehalem platforms, use intel_idle or > >>acpi_idle for nehalem architectures etc. To manage all this > >>we need a registration mechanism which is conveniently provided > >>by cpuidle. > > > >It isn't immediately clear to me that all of these options > >need to be preserved. > > So what do you suggest can be removed? Can we use safe_halt() until intel_idle/acpi_idle take over? But what if they do not take over? If safe_halt() is not very bad compared to the variants like mwait_idle and c1e_idle, then we can remove the old code and no need to move them to default driver. > >Are we suggesting that x86 must always build with cpuidle? > >I'm sure that somebody someplace will object to that. > > Arjan argued that since almost everyone today runs cpuidle > it may be best to include it in the kernel > (https://lkml.org/lkml/2010/10/20/243). But yes, we agreed > that we would have to make cpuidle lighter incrementally. > Making ladder governor optional could be one way for example. > > > >OTOH, if cpuidle is included, I'd like to see the > >non-cpuidle code excluded, since nobody will run it... The non-cpuidle code will be the select_idle_routine() and related function that cam move to default_driver that register to cpuidle. We can load on-demand as module if better routines fail to register. Maybe we don't need this at all as discussed in the above point? --Vaidy [snip] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-24 16:52 ` Vaidyanathan Srinivasan @ 2011-03-25 7:13 ` Len Brown 0 siblings, 0 replies; 51+ messages in thread From: Len Brown @ 2011-03-25 7:13 UTC (permalink / raw) To: Vaidyanathan Srinivasan Cc: Trinabh Gupta, arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel > > So what do you suggest can be removed? > > Can we use safe_halt() until intel_idle/acpi_idle take over? But what > if they do not take over? If safe_halt() is not very bad compared to > the variants like mwait_idle and c1e_idle, then we can remove the old > code and no need to move them to default driver. One reason I'd like a default cpuidle driver is that today there is a race. cpuidle registers, but until its driver registers it will use polling. go ahead and look: grep . /sys/devices/system/cpu/cpu*/cpuidle/state0/usage that should be 0, but it isn't... > > >Are we suggesting that x86 must always build with cpuidle? > > >I'm sure that somebody someplace will object to that. > > > > Arjan argued that since almost everyone today runs cpuidle > > it may be best to include it in the kernel > > (https://lkml.org/lkml/2010/10/20/243). But yes, we agreed > > that we would have to make cpuidle lighter incrementally. > > Making ladder governor optional could be one way for example. > > > > > >OTOH, if cpuidle is included, I'd like to see the > > >non-cpuidle code excluded, since nobody will run it... > > The non-cpuidle code will be the select_idle_routine() and related > function that cam move to default_driver that register to cpuidle. > We can load on-demand as module if better routines fail to register. > Maybe we don't need this at all as discussed in the above point? Right, though I don't share your fascination with modules. cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-24 14:13 ` Trinabh Gupta 2011-03-24 16:52 ` Vaidyanathan Srinivasan @ 2011-03-25 7:05 ` Len Brown 2011-03-25 15:35 ` [Xen-devel] " Konrad Rzeszutek Wilk 1 sibling, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-25 7:05 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel > I think there are other problems too, related to saving and restoring > of pm_idle pointer. For example, cpuidle itself saves current value > of pm_idle, flips it and then restores the saved value. There is > no guarantee that the saved function still exists. APM does exact > same thing (though it may not be used these days). > > The problem also is that a number of architectures have copied the > same design based on pm_idle; so its spreading. pm_idle is a primitive design yes, but I think the issue with pm_idle is a theoretical one, at least on x86; as there isn't any other code scribbling on pm_idle in practice. So this is clean-up, rather than bug-fix work... > > It isn't immediately clear to me that all of these options > > need to be preserved. > > So what do you suggest can be removed? I sent a series of small patches yesterday to get the ball rolling... https://lkml.org/lkml/2011/3/24/54 I think the xen thing can go away. I proposed that APM be removed entirely, but that will take a few releases to conclude.... > > Are we suggesting that x86 must always build with cpuidle? > > I'm sure that somebody someplace will object to that. > > Arjan argued that since almost everyone today runs cpuidle > it may be best to include it in the kernel > (https://lkml.org/lkml/2010/10/20/243). But yes, we agreed > that we would have to make cpuidle lighter incrementally. > Making ladder governor optional could be one way for example. ladder is already optional. cheers, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [Xen-devel] Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-25 7:05 ` Len Brown @ 2011-03-25 15:35 ` Konrad Rzeszutek Wilk 2011-03-31 2:25 ` Len Brown 0 siblings, 1 reply; 51+ messages in thread From: Konrad Rzeszutek Wilk @ 2011-03-25 15:35 UTC (permalink / raw) To: Len Brown Cc: Trinabh Gupta, venki, ak, suresh.b.siddha, sfr, peterz, benh, linux-kernel, xen-devel, arjan On Fri, Mar 25, 2011 at 03:05:36AM -0400, Len Brown wrote: > > I think there are other problems too, related to saving and restoring > > of pm_idle pointer. For example, cpuidle itself saves current value > > of pm_idle, flips it and then restores the saved value. There is > > no guarantee that the saved function still exists. APM does exact > > same thing (though it may not be used these days). > > > > The problem also is that a number of architectures have copied the > > same design based on pm_idle; so its spreading. > > pm_idle is a primitive design yes, but I think the issue > with pm_idle is a theoretical one, at least on x86; > as there isn't any other code scribbling on pm_idle > in practice. So this is clean-up, rather than bug-fix work... > > > > It isn't immediately clear to me that all of these options > > > need to be preserved. > > > > So what do you suggest can be removed? > > I sent a series of small patches yesterday to get the ball rolling... > https://lkml.org/lkml/2011/3/24/54 > > I think the xen thing can go away. The xen thing being the setting of cpuidle to halt or the proposed patch? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [Xen-devel] Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection 2011-03-25 15:35 ` [Xen-devel] " Konrad Rzeszutek Wilk @ 2011-03-31 2:25 ` Len Brown 0 siblings, 0 replies; 51+ messages in thread From: Len Brown @ 2011-03-31 2:25 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Trinabh Gupta, venki, ak, suresh.b.siddha, sfr, peterz, benh, linux-kernel, xen-devel, arjan thanks, Len Brown, Intel Open Source Technology Center On Fri, 25 Mar 2011, Konrad Rzeszutek Wilk wrote: > On Fri, Mar 25, 2011 at 03:05:36AM -0400, Len Brown wrote: > > > I think there are other problems too, related to saving and restoring > > > of pm_idle pointer. For example, cpuidle itself saves current value > > > of pm_idle, flips it and then restores the saved value. There is > > > no guarantee that the saved function still exists. APM does exact > > > same thing (though it may not be used these days). > > > > > > The problem also is that a number of architectures have copied the > > > same design based on pm_idle; so its spreading. > > > > pm_idle is a primitive design yes, but I think the issue > > with pm_idle is a theoretical one, at least on x86; > > as there isn't any other code scribbling on pm_idle > > in practice. So this is clean-up, rather than bug-fix work... > > > > > > It isn't immediately clear to me that all of these options > > > > need to be preserved. > > > > > > So what do you suggest can be removed? > > > > I sent a series of small patches yesterday to get the ball rolling... > > https://lkml.org/lkml/2011/3/24/54 > > > > I think the xen thing can go away. > > The xen thing being the setting of cpuidle to halt or the proposed > patch? I don't think Xen needs a cpuidle driver. Xen just needs to tell the kernel to call HALT, and I think we'll keep that around for the non-cpuidle case, and for the idle periods before cpuidle initializes. cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* [RFC PATCH V4 3/5] cpuidle: default idle driver for x86 2011-03-22 12:32 [RFC PATCH V4 0/5] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta 2011-03-22 12:32 ` [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta 2011-03-22 12:32 ` [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection Trinabh Gupta @ 2011-03-22 12:33 ` Trinabh Gupta 2011-03-23 3:13 ` Len Brown 2011-03-22 12:33 ` [RFC PATCH V4 4/5] cpuidle: driver for xen Trinabh Gupta 2011-03-22 12:33 ` [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm Trinabh Gupta 4 siblings, 1 reply; 51+ messages in thread From: Trinabh Gupta @ 2011-03-22 12:33 UTC (permalink / raw) To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel, sfr This default cpuidle_driver parses idle= boot parameters, selects the optimal idle routine for x86 during bootup and registers with cpuidle. The code for idle routines and the selection of optimal routine is moved from arch/x86/kernel/process.c . At module_init this default driver is registered with cpuidle and for non ACPI platforms it continues to be used. For ACPI platforms, acpi_idle driver would replace this driver at a later point in time during bootup. Until this driver's registration, architecture supplied compile time default idle routine is called from within cpuidle_idle_call(). Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com> --- arch/x86/Kconfig | 2 arch/x86/kernel/process.c | 339 -------------------------------- drivers/cpuidle/cpuidle.c | 2 drivers/idle/Makefile | 1 drivers/idle/default_driver.c | 437 +++++++++++++++++++++++++++++++++++++++++ 5 files changed, 440 insertions(+), 341 deletions(-) create mode 100644 drivers/idle/default_driver.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d5ed94d..6c03c92 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -148,7 +148,7 @@ config RWSEM_XCHGADD_ALGORITHM def_bool X86_XADD config ARCH_HAS_CPU_IDLE_WAIT - def_bool y + def_bool n config GENERIC_CALIBRATE_DELAY def_bool y diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index ff45541..27950dc 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -7,7 +7,6 @@ #include <linux/sched.h> #include <linux/module.h> #include <linux/pm.h> -#include <linux/clockchips.h> #include <linux/random.h> #include <linux/user-return-notifier.h> #include <linux/dmi.h> @@ -330,344 +329,6 @@ long sys_execve(const char __user *name, return error; } -/* - * Idle related variables and functions - */ -unsigned long boot_option_idle_override = IDLE_NO_OVERRIDE; -EXPORT_SYMBOL(boot_option_idle_override); - -/* - * Powermanagement idle function, if any.. - */ -void (*pm_idle)(void); -EXPORT_SYMBOL(pm_idle); - -#ifdef CONFIG_X86_32 -/* - * This halt magic was a workaround for ancient floppy DMA - * wreckage. It should be safe to remove. - */ -static int hlt_counter; -void disable_hlt(void) -{ - hlt_counter++; -} -EXPORT_SYMBOL(disable_hlt); - -void enable_hlt(void) -{ - hlt_counter--; -} -EXPORT_SYMBOL(enable_hlt); - -static inline int hlt_use_halt(void) -{ - return (!hlt_counter && boot_cpu_data.hlt_works_ok); -} -#else -static inline int hlt_use_halt(void) -{ - return 1; -} -#endif - -/* - * We use this if we don't have any better - * idle routine.. - */ -void default_idle(void) -{ - if (hlt_use_halt()) { - trace_power_start(POWER_CSTATE, 1, smp_processor_id()); - trace_cpu_idle(1, smp_processor_id()); - current_thread_info()->status &= ~TS_POLLING; - /* - * TS_POLLING-cleared state must be visible before we - * test NEED_RESCHED: - */ - smp_mb(); - - if (!need_resched()) - safe_halt(); /* enables interrupts racelessly */ - else - local_irq_enable(); - current_thread_info()->status |= TS_POLLING; - trace_power_end(smp_processor_id()); - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); - } else { - local_irq_enable(); - /* loop is done by the caller */ - cpu_relax(); - } -} -#ifdef CONFIG_APM_MODULE -EXPORT_SYMBOL(default_idle); -#endif - -void stop_this_cpu(void *dummy) -{ - local_irq_disable(); - /* - * Remove this CPU: - */ - set_cpu_online(smp_processor_id(), false); - disable_local_APIC(); - - for (;;) { - if (hlt_works(smp_processor_id())) - halt(); - } -} - -static void do_nothing(void *unused) -{ -} - -/* - * cpu_idle_wait - Used to ensure that all the CPUs discard old value of - * pm_idle and update to new pm_idle value. Required while changing pm_idle - * handler on SMP systems. - * - * Caller must have changed pm_idle to the new value before the call. Old - * pm_idle value will not be used by any CPU after the return of this function. - */ -void cpu_idle_wait(void) -{ - smp_mb(); - /* kick all the CPUs so that they exit out of pm_idle */ - smp_call_function(do_nothing, NULL, 1); -} -EXPORT_SYMBOL_GPL(cpu_idle_wait); - -/* - * This uses new MONITOR/MWAIT instructions on P4 processors with PNI, - * which can obviate IPI to trigger checking of need_resched. - * We execute MONITOR against need_resched and enter optimized wait state - * through MWAIT. Whenever someone changes need_resched, we would be woken - * up from MWAIT (without an IPI). - * - * New with Core Duo processors, MWAIT can take some hints based on CPU - * capability. - */ -void mwait_idle_with_hints(unsigned long ax, unsigned long cx) -{ - if (!need_resched()) { - if (cpu_has(__this_cpu_ptr(&cpu_info), X86_FEATURE_CLFLUSH_MONITOR)) - clflush((void *)¤t_thread_info()->flags); - - __monitor((void *)¤t_thread_info()->flags, 0, 0); - smp_mb(); - if (!need_resched()) - __mwait(ax, cx); - } -} - -/* Default MONITOR/MWAIT with no hints, used for default C1 state */ -static void mwait_idle(void) -{ - if (!need_resched()) { - trace_power_start(POWER_CSTATE, 1, smp_processor_id()); - trace_cpu_idle(1, smp_processor_id()); - if (cpu_has(__this_cpu_ptr(&cpu_info), X86_FEATURE_CLFLUSH_MONITOR)) - clflush((void *)¤t_thread_info()->flags); - - __monitor((void *)¤t_thread_info()->flags, 0, 0); - smp_mb(); - if (!need_resched()) - __sti_mwait(0, 0); - else - local_irq_enable(); - trace_power_end(smp_processor_id()); - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); - } else - local_irq_enable(); -} - -/* - * On SMP it's slightly faster (but much more power-consuming!) - * to poll the ->work.need_resched flag instead of waiting for the - * cross-CPU IPI to arrive. Use this option with caution. - */ -static void poll_idle(void) -{ - trace_power_start(POWER_CSTATE, 0, smp_processor_id()); - trace_cpu_idle(0, smp_processor_id()); - local_irq_enable(); - while (!need_resched()) - cpu_relax(); - trace_power_end(smp_processor_id()); - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); -} - -/* - * mwait selection logic: - * - * It depends on the CPU. For AMD CPUs that support MWAIT this is - * wrong. Family 0x10 and 0x11 CPUs will enter C1 on HLT. Powersavings - * then depend on a clock divisor and current Pstate of the core. If - * all cores of a processor are in halt state (C1) the processor can - * enter the C1E (C1 enhanced) state. If mwait is used this will never - * happen. - * - * idle=mwait overrides this decision and forces the usage of mwait. - */ - -#define MWAIT_INFO 0x05 -#define MWAIT_ECX_EXTENDED_INFO 0x01 -#define MWAIT_EDX_C1 0xf0 - -int mwait_usable(const struct cpuinfo_x86 *c) -{ - u32 eax, ebx, ecx, edx; - - if (boot_option_idle_override == IDLE_FORCE_MWAIT) - return 1; - - if (c->cpuid_level < MWAIT_INFO) - return 0; - - cpuid(MWAIT_INFO, &eax, &ebx, &ecx, &edx); - /* Check, whether EDX has extended info about MWAIT */ - if (!(ecx & MWAIT_ECX_EXTENDED_INFO)) - return 1; - - /* - * edx enumeratios MONITOR/MWAIT extensions. Check, whether - * C1 supports MWAIT - */ - return (edx & MWAIT_EDX_C1); -} - -bool c1e_detected; -EXPORT_SYMBOL(c1e_detected); - -static cpumask_var_t c1e_mask; - -void c1e_remove_cpu(int cpu) -{ - if (c1e_mask != NULL) - cpumask_clear_cpu(cpu, c1e_mask); -} - -/* - * C1E aware idle routine. We check for C1E active in the interrupt - * pending message MSR. If we detect C1E, then we handle it the same - * way as C3 power states (local apic timer and TSC stop) - */ -static void c1e_idle(void) -{ - if (need_resched()) - return; - - if (!c1e_detected) { - u32 lo, hi; - - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); - - if (lo & K8_INTP_C1E_ACTIVE_MASK) { - c1e_detected = true; - if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) - mark_tsc_unstable("TSC halt in AMD C1E"); - printk(KERN_INFO "System has AMD C1E enabled\n"); - } - } - - if (c1e_detected) { - int cpu = smp_processor_id(); - - if (!cpumask_test_cpu(cpu, c1e_mask)) { - cpumask_set_cpu(cpu, c1e_mask); - /* - * Force broadcast so ACPI can not interfere. - */ - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE, - &cpu); - printk(KERN_INFO "Switch to broadcast mode on CPU%d\n", - cpu); - } - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); - - default_idle(); - - /* - * The switch back from broadcast mode needs to be - * called with interrupts disabled. - */ - local_irq_disable(); - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); - local_irq_enable(); - } else - default_idle(); -} - -void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) -{ -#ifdef CONFIG_SMP - if (pm_idle == poll_idle && smp_num_siblings > 1) { - printk_once(KERN_WARNING "WARNING: polling idle and HT enabled," - " performance may degrade.\n"); - } -#endif - if (pm_idle) - return; - - if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) { - /* - * One CPU supports mwait => All CPUs supports mwait - */ - printk(KERN_INFO "using mwait in idle threads.\n"); - pm_idle = mwait_idle; - } else if (cpu_has_amd_erratum(amd_erratum_400)) { - /* E400: APIC timer interrupt does not wake up CPU from C1e */ - printk(KERN_INFO "using C1E aware idle routine\n"); - pm_idle = c1e_idle; - } else - pm_idle = default_idle; -} - -void __init init_c1e_mask(void) -{ - /* If we're using c1e_idle, we need to allocate c1e_mask. */ - if (pm_idle == c1e_idle) - zalloc_cpumask_var(&c1e_mask, GFP_KERNEL); -} - -static int __init idle_setup(char *str) -{ - if (!str) - return -EINVAL; - - if (!strcmp(str, "poll")) { - printk("using polling idle threads.\n"); - pm_idle = poll_idle; - boot_option_idle_override = IDLE_POLL; - } else if (!strcmp(str, "mwait")) { - boot_option_idle_override = IDLE_FORCE_MWAIT; - } else if (!strcmp(str, "halt")) { - /* - * When the boot option of idle=halt is added, halt is - * forced to be used for CPU idle. In such case CPU C2/C3 - * won't be used again. - * To continue to load the CPU idle driver, don't touch - * the boot_option_idle_override. - */ - pm_idle = default_idle; - boot_option_idle_override = IDLE_HALT; - } else if (!strcmp(str, "nomwait")) { - /* - * If the boot option of "idle=nomwait" is added, - * it means that mwait will be disabled for CPU C2/C3 - * states. In such case it won't touch the variable - * of boot_option_idle_override. - */ - boot_option_idle_override = IDLE_NOMWAIT; - } else - return -1; - - return 0; -} -early_param("idle", idle_setup); - unsigned long arch_align_stack(unsigned long sp) { if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 91ef1bf..7486e0f 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -34,7 +34,7 @@ static void cpuidle_kick_cpus(void) { cpu_idle_wait(); } -#elif defined(CONFIG_SMP) +#elif defined(CONFIG_SMP) && defined(CONFIG_ARCH_USES_PMIDLE) # error "Arch needs cpu_idle_wait() equivalent here" #else /* !CONFIG_ARCH_HAS_CPU_IDLE_WAIT && !CONFIG_SMP */ static void cpuidle_kick_cpus(void) {} diff --git a/drivers/idle/Makefile b/drivers/idle/Makefile index 23d295c..f16e866 100644 --- a/drivers/idle/Makefile +++ b/drivers/idle/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_I7300_IDLE) += i7300_idle.o obj-$(CONFIG_INTEL_IDLE) += intel_idle.o +obj-$(CONFIG_X86) += default_driver.o diff --git a/drivers/idle/default_driver.c b/drivers/idle/default_driver.c new file mode 100644 index 0000000..21de286 --- /dev/null +++ b/drivers/idle/default_driver.c @@ -0,0 +1,437 @@ +#include <linux/kernel.h> +#include <linux/errno.h> +#include <linux/sched.h> +#include <linux/cpuidle.h> +#include <linux/clockchips.h> +#include <linux/slab.h> +#include <trace/events/power.h> +#include <asm/mwait.h> + + +/* + * Idle related variables and functions + */ +unsigned long boot_option_idle_override = IDLE_NO_OVERRIDE; +EXPORT_SYMBOL(boot_option_idle_override); + +static struct cpuidle_state *opt_state; + +#ifdef CONFIG_X86_32 +/* + * This halt magic was a workaround for ancient floppy DMA + * wreckage. It should be safe to remove. + */ +static int hlt_counter; +void disable_hlt(void) +{ + hlt_counter++; +} +EXPORT_SYMBOL(disable_hlt); + +void enable_hlt(void) +{ + hlt_counter--; +} +EXPORT_SYMBOL(enable_hlt); + +static inline int hlt_use_halt(void) +{ + return (!hlt_counter && boot_cpu_data.hlt_works_ok); +} +#else +static inline int hlt_use_halt(void) +{ + return 1; +} +#endif + +/* + * We use this if we don't have any better + * idle routine.. + */ +void default_idle(void) +{ + if (hlt_use_halt()) { + trace_power_start(POWER_CSTATE, 1, smp_processor_id()); + trace_cpu_idle(1, smp_processor_id()); + current_thread_info()->status &= ~TS_POLLING; + /* + * TS_POLLING-cleared state must be visible before we + * test NEED_RESCHED: + */ + smp_mb(); + + if (!need_resched()) + safe_halt(); /* enables interrupts racelessly */ + else + local_irq_enable(); + current_thread_info()->status |= TS_POLLING; + trace_power_end(smp_processor_id()); + trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); + } else { + local_irq_enable(); + /* loop is done by the caller */ + cpu_relax(); + } +} +#ifdef CONFIG_APM_MODULE +EXPORT_SYMBOL(default_idle); +#endif + +void stop_this_cpu(void *dummy) +{ + local_irq_disable(); + /* + * Remove this CPU: + */ + set_cpu_online(smp_processor_id(), false); + disable_local_APIC(); + + for (;;) { + if (hlt_works(smp_processor_id())) + halt(); + } +} + +/* + * This uses new MONITOR/MWAIT instructions on P4 processors with PNI, + * which can obviate IPI to trigger checking of need_resched. + * We execute MONITOR against need_resched and enter optimized wait state + * through MWAIT. Whenever someone changes need_resched, we would be woken + * up from MWAIT (without an IPI). + * + * New with Core Duo processors, MWAIT can take some hints based on CPU + * capability. + */ +void mwait_idle_with_hints(unsigned long ax, unsigned long cx) +{ + if (!need_resched()) { + if (cpu_has(__this_cpu_ptr(&cpu_info), X86_FEATURE_CLFLUSH_MONITOR)) + clflush((void *)¤t_thread_info()->flags); + + __monitor((void *)¤t_thread_info()->flags, 0, 0); + smp_mb(); + if (!need_resched()) + __mwait(ax, cx); + } +} + +/* Default MONITOR/MWAIT with no hints, used for default C1 state */ +static void mwait_idle(void) +{ + if (!need_resched()) { + trace_power_start(POWER_CSTATE, 1, smp_processor_id()); + trace_cpu_idle(1, smp_processor_id()); + if (cpu_has(__this_cpu_ptr(&cpu_info), X86_FEATURE_CLFLUSH_MONITOR)) + clflush((void *)¤t_thread_info()->flags); + + __monitor((void *)¤t_thread_info()->flags, 0, 0); + smp_mb(); + if (!need_resched()) + __sti_mwait(0, 0); + else + local_irq_enable(); + trace_power_end(smp_processor_id()); + trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); + } else + local_irq_enable(); +} + +/* + * On SMP it's slightly faster (but much more power-consuming!) + * to poll the ->work.need_resched flag instead of waiting for the + * cross-CPU IPI to arrive. Use this option with caution. + */ +static void poll_idle(void) +{ + trace_power_start(POWER_CSTATE, 0, smp_processor_id()); + trace_cpu_idle(0, smp_processor_id()); + local_irq_enable(); + while (!need_resched()) + cpu_relax(); + trace_power_end(smp_processor_id()); + trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); +} + +/* + * mwait selection logic: + * + * It depends on the CPU. For AMD CPUs that support MWAIT this is + * wrong. Family 0x10 and 0x11 CPUs will enter C1 on HLT. Powersavings + * then depend on a clock divisor and current Pstate of the core. If + * all cores of a processor are in halt state (C1) the processor can + * enter the C1E (C1 enhanced) state. If mwait is used this will never + * happen. + * + * idle=mwait overrides this decision and forces the usage of mwait. + */ + +#define MWAIT_INFO 0x05 +#define MWAIT_ECX_EXTENDED_INFO 0x01 +#define MWAIT_EDX_C1 0xf0 + +int mwait_usable(const struct cpuinfo_x86 *c) +{ + u32 eax, ebx, ecx, edx; + + if (boot_option_idle_override == IDLE_FORCE_MWAIT) + return 1; + + if (c->cpuid_level < MWAIT_INFO) + return 0; + + cpuid(MWAIT_INFO, &eax, &ebx, &ecx, &edx); + /* Check, whether EDX has extended info about MWAIT */ + if (!(ecx & MWAIT_ECX_EXTENDED_INFO)) + return 1; + + /* + * edx enumeratios MONITOR/MWAIT extensions. Check, whether + * C1 supports MWAIT + */ + return (edx & MWAIT_EDX_C1); +} + +bool c1e_detected; +EXPORT_SYMBOL(c1e_detected); + +static cpumask_var_t c1e_mask; + +void c1e_remove_cpu(int cpu) +{ + if (c1e_mask != NULL) + cpumask_clear_cpu(cpu, c1e_mask); +} + +/* + * C1E aware idle routine. We check for C1E active in the interrupt + * pending message MSR. If we detect C1E, then we handle it the same + * way as C3 power states (local apic timer and TSC stop) + */ +static void c1e_idle(void) +{ + if (need_resched()) + return; + + if (!c1e_detected) { + u32 lo, hi; + + rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + + if (lo & K8_INTP_C1E_ACTIVE_MASK) { + c1e_detected = true; + if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) + mark_tsc_unstable("TSC halt in AMD C1E"); + printk(KERN_INFO "System has AMD C1E enabled\n"); + } + } + + if (c1e_detected) { + int cpu = smp_processor_id(); + + if (!cpumask_test_cpu(cpu, c1e_mask)) { + cpumask_set_cpu(cpu, c1e_mask); + /* + * Force broadcast so ACPI can not interfere. + */ + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE, + &cpu); + printk(KERN_INFO "Switch to broadcast mode on CPU%d\n", + cpu); + } + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); + + default_idle(); + + /* + * The switch back from broadcast mode needs to be + * called with interrupts disabled. + */ + local_irq_disable(); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); + local_irq_enable(); + } else + default_idle(); +} + +static int poll_idle_wrapper(struct cpuidle_device *dev, + struct cpuidle_state *state) +{ + poll_idle(); + return 0; +} + +static int mwait_idle_wrapper(struct cpuidle_device *dev, + struct cpuidle_state *state) +{ + mwait_idle(); + return 0; +} + +static int c1e_idle_wrapper(struct cpuidle_device *dev, + struct cpuidle_state *state) +{ + c1e_idle(); + return 0; +} + +int default_idle_wrapper(struct cpuidle_device *dev, + struct cpuidle_state *state) +{ + default_idle(); + return 0; +} + +static struct cpuidle_state state_poll = { + .name = "POLL", + .desc = "POLL", + .driver_data = (void *) 0x00, + .flags = CPUIDLE_FLAG_TIME_VALID, + .exit_latency = 1, + .target_residency = 1, + .enter = &poll_idle_wrapper, +}; + +static struct cpuidle_state state_mwait = { + .name = "C1", + .desc = "MWAIT No Hints", + .driver_data = (void *) 0x01, + .flags = CPUIDLE_FLAG_TIME_VALID, + .exit_latency = 1, + .target_residency = 1, + .enter = &mwait_idle_wrapper, +}; + +static struct cpuidle_state state_c1e = { + .name = "C1E", + .desc = "C1E", + .driver_data = (void *) 0x02, + .flags = CPUIDLE_FLAG_TIME_VALID, + .exit_latency = 1, + .target_residency = 1, + .enter = &c1e_idle_wrapper, +}; + +struct cpuidle_state state_default_idle = { + .name = "DEFAULT-IDLE", + .desc = "Default idle routine", + .driver_data = (void *) 0x03, + .flags = CPUIDLE_FLAG_TIME_VALID, + .exit_latency = 1, + .target_residency = 1, + .enter = &default_idle_wrapper, +}; + +void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) +{ +#ifdef CONFIG_SMP + if (opt_state == &state_poll && smp_num_siblings > 1) { + printk_once(KERN_WARNING "WARNING: polling idle and HT enabled," + " performance may degrade.\n"); + } +#endif + if (opt_state) + return; + + if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) { + /* + * One CPU supports mwait => All CPUs supports mwait + */ + printk(KERN_INFO "using mwait in idle threads.\n"); + opt_state = &state_mwait; + } else if (cpu_has_amd_erratum(amd_erratum_400)) { + /* E400: APIC timer interrupt does not wake up CPU from C1e */ + printk(KERN_INFO "using C1E aware idle routine\n"); + opt_state = &state_c1e; + } else + opt_state = &state_default_idle; +} + +void __init init_c1e_mask(void) +{ + /* If we're using c1e_idle, we need to allocate c1e_mask. */ + if (opt_state == &state_c1e) + zalloc_cpumask_var(&c1e_mask, GFP_KERNEL); +} + +static int __init idle_setup(char *str) +{ + if (!str) + return -EINVAL; + + if (!strcmp(str, "poll")) { + printk("using polling idle threads.\n"); + opt_state = &state_poll; + boot_option_idle_override = IDLE_POLL; + } else if (!strcmp(str, "mwait")) { + boot_option_idle_override = IDLE_FORCE_MWAIT; + } else if (!strcmp(str, "halt")) { + /* + * When the boot option of idle=halt is added, halt is + * forced to be used for CPU idle. In such case CPU C2/C3 + * won't be used again. + * To continue to load the CPU idle driver, don't touch + * the boot_option_idle_override. + */ + opt_state = &state_default_idle; + boot_option_idle_override = IDLE_HALT; + } else if (!strcmp(str, "nomwait")) { + /* + * If the boot option of "idle=nomwait" is added, + * it means that mwait will be disabled for CPU C2/C3 + * states. In such case it won't touch the variable + * of boot_option_idle_override. + */ + boot_option_idle_override = IDLE_NOMWAIT; + } else + return -1; + + return 0; +} +early_param("idle", idle_setup); + +static struct cpuidle_driver default_idle_driver = { + .name = "default_idle", + .owner = THIS_MODULE, + .priority = 100, +}; + +static int setup_cpuidle(int cpu) +{ + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device), + GFP_KERNEL); + int count = CPUIDLE_DRIVER_STATE_START; + dev->cpu = cpu; + dev->drv = &default_idle_driver; + + BUG_ON(opt_state == NULL); + dev->states[count] = *opt_state; + count++; + + dev->state_count = count; + + if (cpuidle_register_device(dev)) + return -EIO; + return 0; +} + +static int __init default_idle_init(void) +{ + int retval, i; + retval = cpuidle_register_driver(&default_idle_driver); + + for_each_online_cpu(i) { + setup_cpuidle(i); + } + + return 0; +} + + +static void __exit default_idle_exit(void) +{ + cpuidle_unregister_driver(&default_idle_driver); + return; +} + +device_initcall(default_idle_init); ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 3/5] cpuidle: default idle driver for x86 2011-03-22 12:33 ` [RFC PATCH V4 3/5] cpuidle: default idle driver for x86 Trinabh Gupta @ 2011-03-23 3:13 ` Len Brown 2011-03-23 9:31 ` Trinabh Gupta 0 siblings, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-23 3:13 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr Why is this patch a step forward? > +obj-$(CONFIG_X86) += default_driver.o BTW, that's a pretty generic name for an x86 specific idle driver... I think that on builds that support intel_idle and acpi_idle, everything in this file will be unused, unless somebody uses some debugging cmdline params that should have been deleted ages ago. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 3/5] cpuidle: default idle driver for x86 2011-03-23 3:13 ` Len Brown @ 2011-03-23 9:31 ` Trinabh Gupta 2011-03-24 16:32 ` Vaidyanathan Srinivasan 0 siblings, 1 reply; 51+ messages in thread From: Trinabh Gupta @ 2011-03-23 9:31 UTC (permalink / raw) To: Len Brown Cc: arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel On 03/23/2011 08:43 AM, Len Brown wrote: > Why is this patch a step forward? Hi Len, I have basically moved the code for arch default and mwait idle from arch/x86/kernel/process.c to a driver. This was suggested by Venki (https://lkml.org/lkml/2010/10/19/460) as part of pm_idle cleanup and direct call of cpuidle_idle_call(). There is not much new code here. > >> +obj-$(CONFIG_X86) += default_driver.o > > BTW, that's a pretty generic name for an x86 specific idle driver... > > I think that on builds that support intel_idle and acpi_idle, > everything in this file will be unused, unless somebody uses some > debugging cmdline params that should have been deleted ages ago. Yes, I agree that the name has to be x86 specific. I think the routines would be used for pre-nehalem architectures that use arch default or mwait. Thanks, -Trinabh > > thanks, > Len Brown, Intel Open Source Technology Center > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 3/5] cpuidle: default idle driver for x86 2011-03-23 9:31 ` Trinabh Gupta @ 2011-03-24 16:32 ` Vaidyanathan Srinivasan 0 siblings, 0 replies; 51+ messages in thread From: Vaidyanathan Srinivasan @ 2011-03-24 16:32 UTC (permalink / raw) To: Trinabh Gupta Cc: Len Brown, arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel * Trinabh Gupta <trinabh@linux.vnet.ibm.com> [2011-03-23 15:01:14]: > > On 03/23/2011 08:43 AM, Len Brown wrote: > >Why is this patch a step forward? > > Hi Len, > > I have basically moved the code for arch default and mwait > idle from arch/x86/kernel/process.c to a driver. This was > suggested by Venki (https://lkml.org/lkml/2010/10/19/460) > as part of pm_idle cleanup and direct call of > cpuidle_idle_call(). There is not much new code here. > > > > >>+obj-$(CONFIG_X86) += default_driver.o > > > >BTW, that's a pretty generic name for an x86 specific idle driver... > > > >I think that on builds that support intel_idle and acpi_idle, > >everything in this file will be unused, unless somebody uses some > >debugging cmdline params that should have been deleted ages ago. > > Yes, I agree that the name has to be x86 specific. I think the > routines would be used for pre-nehalem architectures that use > arch default or mwait. Mainly selection between default_idle (safe_halt), mwait_idle and c1e_idle needs to be placed in a default driver. This is the code that was 'outside' of cpuidle framework and directly used pm_idle(). This is mostly unused and overridden by intel_idle or acpi_idle, but still cannot be discarded. Maybe keep this as a module and probe/load only if both intel_idle and acpi_idle failed to load or excluded by command line or otherwise. --Vaidy ^ permalink raw reply [flat|nested] 51+ messages in thread
* [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-22 12:32 [RFC PATCH V4 0/5] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta ` (2 preceding siblings ...) 2011-03-22 12:33 ` [RFC PATCH V4 3/5] cpuidle: default idle driver for x86 Trinabh Gupta @ 2011-03-22 12:33 ` Trinabh Gupta 2011-03-22 14:50 ` Konrad Rzeszutek Wilk 2011-03-22 12:33 ` [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm Trinabh Gupta 4 siblings, 1 reply; 51+ messages in thread From: Trinabh Gupta @ 2011-03-22 12:33 UTC (permalink / raw) To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel, sfr This patch implements a default cpuidle driver for xen. Earlier pm_idle was flipped to default_idle. Maybe there is a better way to ensure default_idle is called without using this cpuidle driver. Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com> --- arch/x86/xen/setup.c | 42 +++++++++++++++++++++++++++++++++++++++++- 1 files changed, 41 insertions(+), 1 deletions(-) diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index a8a66a5..4fce4cd 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -9,6 +9,8 @@ #include <linux/mm.h> #include <linux/pm.h> #include <linux/memblock.h> +#include <linux/cpuidle.h> +#include <linux/slab.h> #include <asm/elf.h> #include <asm/vdso.h> @@ -321,6 +323,44 @@ void __cpuinit xen_enable_syscall(void) #endif /* CONFIG_X86_64 */ } +static struct cpuidle_driver xen_idle_driver = { + .name = "xen_idle", + .owner = THIS_MODULE, + .priority = 1, +}; + +extern struct cpuidle_state state_default_state; + +static int setup_cpuidle(int cpu) +{ + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device), + GFP_KERNEL); + int count = CPUIDLE_DRIVER_STATE_START; + dev->cpu = cpu; + dev->drv = &xen_idle_driver; + + dev->states[count] = state_default_idle; + count++; + + dev->state_count = count; + + if (cpuidle_register_device(dev)) + return -EIO; + return 0; +} + +static int xen_idle_init(void) +{ + int retval, i; + retval = cpuidle_register_driver(&xen_idle_driver); + + for_each_online_cpu(i) { + setup_cpuidle(i); + } + + return 0; +} + void __init xen_arch_setup(void) { xen_panic_handler_init(); @@ -354,7 +394,7 @@ void __init xen_arch_setup(void) #ifdef CONFIG_X86_32 boot_cpu_data.hlt_works_ok = 1; #endif - pm_idle = default_idle; + xen_idle_init(); boot_option_idle_override = IDLE_HALT; fiddle_vdso(); ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-22 12:33 ` [RFC PATCH V4 4/5] cpuidle: driver for xen Trinabh Gupta @ 2011-03-22 14:50 ` Konrad Rzeszutek Wilk 2011-03-23 9:57 ` Trinabh Gupta 0 siblings, 1 reply; 51+ messages in thread From: Konrad Rzeszutek Wilk @ 2011-03-22 14:50 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel On Tue, Mar 22, 2011 at 06:03:28PM +0530, Trinabh Gupta wrote: > This patch implements a default cpuidle driver for xen. > Earlier pm_idle was flipped to default_idle. Maybe there > is a better way to ensure default_idle is called > without using this cpuidle driver. Please also CC the Xen devel mailing list (I did this for you) I couldn't find it in the description, but I was wondering what is that you are trying to fix? What is the problem description? Two, you mention in your description that it was tested on x86 systems. did you test this with Xen? If so, what version. > > Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com> > --- > > arch/x86/xen/setup.c | 42 +++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 41 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c > index a8a66a5..4fce4cd 100644 > --- a/arch/x86/xen/setup.c > +++ b/arch/x86/xen/setup.c > @@ -9,6 +9,8 @@ > #include <linux/mm.h> > #include <linux/pm.h> > #include <linux/memblock.h> > +#include <linux/cpuidle.h> > +#include <linux/slab.h> > > #include <asm/elf.h> > #include <asm/vdso.h> > @@ -321,6 +323,44 @@ void __cpuinit xen_enable_syscall(void) > #endif /* CONFIG_X86_64 */ > } > > +static struct cpuidle_driver xen_idle_driver = { > + .name = "xen_idle", > + .owner = THIS_MODULE, > + .priority = 1, > +}; > + > +extern struct cpuidle_state state_default_state; > + > +static int setup_cpuidle(int cpu) > +{ > + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device), > + GFP_KERNEL); No checking to see if dev == NULL? > + int count = CPUIDLE_DRIVER_STATE_START; > + dev->cpu = cpu; > + dev->drv = &xen_idle_driver; > + > + dev->states[count] = state_default_idle; > + count++; > + > + dev->state_count = count; > + > + if (cpuidle_register_device(dev)) > + return -EIO; No cleanup of the 'dev' so that we don't leak memory? > + return 0; > +} > + > +static int xen_idle_init(void) > +{ > + int retval, i; > + retval = cpuidle_register_driver(&xen_idle_driver); > + > + for_each_online_cpu(i) { > + setup_cpuidle(i); > + } > + > + return 0; > +} > + > void __init xen_arch_setup(void) > { > xen_panic_handler_init(); > @@ -354,7 +394,7 @@ void __init xen_arch_setup(void) > #ifdef CONFIG_X86_32 > boot_cpu_data.hlt_works_ok = 1; > #endif > - pm_idle = default_idle; > + xen_idle_init(); > boot_option_idle_override = IDLE_HALT; > > fiddle_vdso(); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-22 14:50 ` Konrad Rzeszutek Wilk @ 2011-03-23 9:57 ` Trinabh Gupta 2011-03-24 7:18 ` Len Brown 0 siblings, 1 reply; 51+ messages in thread From: Trinabh Gupta @ 2011-03-23 9:57 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel On 03/22/2011 08:20 PM, Konrad Rzeszutek Wilk wrote: > On Tue, Mar 22, 2011 at 06:03:28PM +0530, Trinabh Gupta wrote: >> This patch implements a default cpuidle driver for xen. >> Earlier pm_idle was flipped to default_idle. Maybe there >> is a better way to ensure default_idle is called >> without using this cpuidle driver. > Hi Konrad, > Please also CC the Xen devel mailing list (I did this for you) Yes, I will. Thanks > > I couldn't find it in the description, but I was wondering > what is that you are trying to fix? What is the problem description? We are trying to remove the exported function pointer pm_idle which is set to the desired idle routine to be used. For example, xen sets it to default_idle. Having exported function pointer is not very secure. The first problem is that this exported pointer can be modified/flipped by any subsystem. There is no tracking or notification mechanism. Secondly and more importantly, various subsystems save the value of this pointer, flip it and later restore to the saved value. There is no guarantee that the saved value is still valid (see http://lkml.org/lkml/2009/8/28/43 and http://lkml.org/lkml/2009/8/28/50) The first patch of the series removed pm_idle and now we directly call into CPUIdle subsystem. As a result for all the subsystems using pm_idle, we have to implement a driver that can be registered to cpuidle. > > Two, you mention in your description that it was tested on x86 systems. did > you test this with Xen? If so, what version. The patches are still in RFC stage and haven't been tested with Xen. Once we are clear on a particular solution, we will carefully test the patches. Thanks for the code review. Yes, I have missed a couple of things. I will look at how to maintain per cpu dev pointers and free the memory. Thanks, -Trinabh > >> >> Signed-off-by: Trinabh Gupta<trinabh@linux.vnet.ibm.com> >> --- >> >> arch/x86/xen/setup.c | 42 +++++++++++++++++++++++++++++++++++++++++- >> 1 files changed, 41 insertions(+), 1 deletions(-) >> >> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c >> index a8a66a5..4fce4cd 100644 >> --- a/arch/x86/xen/setup.c >> +++ b/arch/x86/xen/setup.c >> @@ -9,6 +9,8 @@ >> #include<linux/mm.h> >> #include<linux/pm.h> >> #include<linux/memblock.h> >> +#include<linux/cpuidle.h> >> +#include<linux/slab.h> >> >> #include<asm/elf.h> >> #include<asm/vdso.h> >> @@ -321,6 +323,44 @@ void __cpuinit xen_enable_syscall(void) >> #endif /* CONFIG_X86_64 */ >> } >> >> +static struct cpuidle_driver xen_idle_driver = { >> + .name = "xen_idle", >> + .owner = THIS_MODULE, >> + .priority = 1, >> +}; >> + >> +extern struct cpuidle_state state_default_state; >> + >> +static int setup_cpuidle(int cpu) >> +{ >> + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device), >> + GFP_KERNEL); > > No checking to see if dev == NULL? >> + int count = CPUIDLE_DRIVER_STATE_START; >> + dev->cpu = cpu; >> + dev->drv =&xen_idle_driver; >> + >> + dev->states[count] = state_default_idle; >> + count++; >> + >> + dev->state_count = count; >> + >> + if (cpuidle_register_device(dev)) >> + return -EIO; > No cleanup of the 'dev' so that we don't leak memory? > >> + return 0; >> +} >> + >> +static int xen_idle_init(void) >> +{ >> + int retval, i; >> + retval = cpuidle_register_driver(&xen_idle_driver); >> + >> + for_each_online_cpu(i) { >> + setup_cpuidle(i); >> + } >> + >> + return 0; >> +} >> + >> void __init xen_arch_setup(void) >> { >> xen_panic_handler_init(); >> @@ -354,7 +394,7 @@ void __init xen_arch_setup(void) >> #ifdef CONFIG_X86_32 >> boot_cpu_data.hlt_works_ok = 1; >> #endif >> - pm_idle = default_idle; >> + xen_idle_init(); >> boot_option_idle_override = IDLE_HALT; >> >> fiddle_vdso(); >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-23 9:57 ` Trinabh Gupta @ 2011-03-24 7:18 ` Len Brown 2011-03-24 12:05 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-24 7:18 UTC (permalink / raw) To: Trinabh Gupta Cc: Konrad Rzeszutek Wilk, arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel Is a CONFIG_XEN kernel supposed to use just HLT in idle? xen_arch_setup() does this: pm_idle = default_idle; boot_option_idle_override = IDLE_HALT; which has that effect. I guess this makes sense b/c the CONFIG_XEN kernel is Dom0 and the real C-sates are done by the hypervisor? Would the same CONFIG_XEN kernel binary ever not run xen_arch_setup(), run on raw hardware, and want to use idle states other than HLT? thanks, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-24 7:18 ` Len Brown @ 2011-03-24 12:05 ` Konrad Rzeszutek Wilk 2011-03-25 7:19 ` Len Brown 2011-03-25 14:38 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 51+ messages in thread From: Konrad Rzeszutek Wilk @ 2011-03-24 12:05 UTC (permalink / raw) To: Len Brown Cc: Trinabh Gupta, arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel On Thu, Mar 24, 2011 at 03:18:14AM -0400, Len Brown wrote: > Is a CONFIG_XEN kernel supposed to use just HLT in idle? For right now.. > > xen_arch_setup() does this: > > pm_idle = default_idle; > boot_option_idle_override = IDLE_HALT; > > which has that effect. I guess this makes sense b/c the > CONFIG_XEN kernel is Dom0 and the real C-sates are done > by the hypervisor? Correct. There are some patches that make the C-states be visible in the Linux kernel, but that hasn't been ported over yet. > > Would the same CONFIG_XEN kernel binary ever not > run xen_arch_setup(), run on raw hardware, and want ever not? I am not sure of the question, so let me state: The Linux kernel if compiled with CONFIG_XEN, and if run on native hardware, would _never_ run 'xen_arch_setup()'*. It would run the normal, native type setup. > to use idle states other than HLT? > *: It could if you really really wanted. You would need to change the GRUB2 to inject some extra data in the 'sub_hardware' flag to be the Xen specific. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-24 12:05 ` Konrad Rzeszutek Wilk @ 2011-03-25 7:19 ` Len Brown 2011-03-25 14:43 ` [Xen-devel] " Jeremy Fitzhardinge 2011-03-25 14:38 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-25 7:19 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Trinabh Gupta, arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr, xen-devel > On Thu, Mar 24, 2011 at 03:18:14AM -0400, Len Brown wrote: > > Is a CONFIG_XEN kernel supposed to use just HLT in idle? > > For right now.. > > > > xen_arch_setup() does this: > > > > pm_idle = default_idle; > > boot_option_idle_override = IDLE_HALT; > > > > which has that effect. I guess this makes sense b/c the > > CONFIG_XEN kernel is Dom0 and the real C-sates are done > > by the hypervisor? > > Correct. There are some patches that make the C-states > be visible in the Linux kernel, but that hasn't been ported > over yet. The Xen Dom0 kernel will trap into the hypervisor whenever it does a HLT or an MWAIT, yes? What is the benefit of having Dom0 decided between C-states that it can't actually enter? What is the mechanism by which those C-states are made visible to Dom0, and how are those states related to the states that are supported on the bare iron? > > Would the same CONFIG_XEN kernel binary ever not > > run xen_arch_setup(), run on raw hardware, and want > > to use idle states other than HLT? > ever not? I am not sure of the question, so let me state: > The Linux kernel if compiled with CONFIG_XEN, and if run on > native hardware, would _never_ run 'xen_arch_setup()'*. It would > run the normal, native type setup. Thanks, that clarifies. -Len Brown, Intel Open Source Technolgy Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-25 7:19 ` Len Brown @ 2011-03-25 14:43 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 51+ messages in thread From: Jeremy Fitzhardinge @ 2011-03-25 14:43 UTC (permalink / raw) To: Len Brown Cc: Konrad Rzeszutek Wilk, venki, ak, suresh.b.siddha, sfr, peterz, benh, linux-kernel, xen-devel, arjan, Trinabh Gupta On 03/25/2011 07:19 AM, Len Brown wrote: > The Xen Dom0 kernel will trap into the hypervisor > whenever it does a HLT or an MWAIT, yes? Yes, on hlt. > What is the benefit of having Dom0 decided between > C-states that it can't actually enter? There might be a slight benefit to allow a domain to tell Xen what its overall utilisation is (ie, "I'd like this VCPU to run, but it isn't very important so you can take that into account when choosing scheduling priority and/or PCPU performance"). But there's nothing like that at present. > What is the mechanism by which those C-states are > made visible to Dom0, and how are those states > related to the states that are supported on > the bare iron? Because dom0 is the official ACPI owner (ie, it has the AML interpreter), we need dom0 to handle complex interactions with ACPI (the hypervisor can do simple table parsing). At present the mechanism for power states is that dom0 gets them out of ACPI, and then passes them to Xen which actually uses them. But no guest kernel has any runtime use of power states. J ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-24 12:05 ` Konrad Rzeszutek Wilk 2011-03-25 7:19 ` Len Brown @ 2011-03-25 14:38 ` Jeremy Fitzhardinge 2011-03-31 2:02 ` Len Brown 1 sibling, 1 reply; 51+ messages in thread From: Jeremy Fitzhardinge @ 2011-03-25 14:38 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Len Brown, venki, ak, suresh.b.siddha, sfr, peterz, benh, linux-kernel, xen-devel, arjan, Trinabh Gupta On 03/24/2011 12:05 PM, Konrad Rzeszutek Wilk wrote: > On Thu, Mar 24, 2011 at 03:18:14AM -0400, Len Brown wrote: >> Is a CONFIG_XEN kernel supposed to use just HLT in idle? > For right now.. For always, I should think. >> xen_arch_setup() does this: >> >> pm_idle = default_idle; >> boot_option_idle_override = IDLE_HALT; >> >> which has that effect. I guess this makes sense b/c the >> CONFIG_XEN kernel is Dom0 and the real C-sates are done >> by the hypervisor? > Correct. There are some patches that make the C-states > be visible in the Linux kernel, but that hasn't been ported > over yet. All we need is for the idle CPU to block in the hypervisor; a plain "hlt" is always going to be sufficient (which is overridden as a pvop into a sched_idle hypercall). Xen will choose an appropriate power state for the physical cpus depending on the overall busyness of the system (which any individual virtual machine can't determine). J ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-25 14:38 ` Jeremy Fitzhardinge @ 2011-03-31 2:02 ` Len Brown 2011-03-31 21:26 ` Len Brown 0 siblings, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-31 2:02 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Konrad Rzeszutek Wilk, venki, ak, suresh.b.siddha, sfr, peterz, benh, linux-kernel, xen-devel, arjan, Trinabh Gupta > >> Is a CONFIG_XEN kernel supposed to use just HLT in idle? > > For right now.. > > For always, I should think. Yay! > >> xen_arch_setup() does this: > >> > >> pm_idle = default_idle; > >> boot_option_idle_override = IDLE_HALT; > >> > >> which has that effect. I guess this makes sense b/c the > >> CONFIG_XEN kernel is Dom0 and the real C-sates are done > >> by the hypervisor? > > Correct. There are some patches that make the C-states > > be visible in the Linux kernel, but that hasn't been ported > > over yet. > > All we need is for the idle CPU to block in the hypervisor; a plain > "hlt" is always going to be sufficient (which is overridden as a pvop > into a sched_idle hypercall). > > Xen will choose an appropriate power state for the physical cpus > depending on the overall busyness of the system (which any individual > virtual machine can't determine). Okay, knowing that the Dom0 kernel 1. can boot in non-xen mode on bare hardware and run cpuidle 2. needs just HLT when booted in xen mode will help us keep things simple. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-31 2:02 ` Len Brown @ 2011-03-31 21:26 ` Len Brown 2011-03-31 22:36 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-31 21:26 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Konrad Rzeszutek Wilk, venki, ak, suresh.b.siddha, sfr, peterz, benh, linux-kernel, xen-devel, arjan, Trinabh Gupta > > >> xen_arch_setup() does this: > > >> > > >> pm_idle = default_idle; > > >> boot_option_idle_override = IDLE_HALT; What happens on a Xen kernel if these lines are not there? Does Xen export the C-states tables to Dom0 kernel, and the Dom0 kernel has an acpi processor driver, and thus it would try to use all the C-states? If yes, must Xen show those tables to Dom? If it did not, then the lines above would not be necessary, as in the absence of any C-states, the kernel should use halt by default. thanks, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-31 21:26 ` Len Brown @ 2011-03-31 22:36 ` Jeremy Fitzhardinge 2011-04-01 3:03 ` Len Brown 0 siblings, 1 reply; 51+ messages in thread From: Jeremy Fitzhardinge @ 2011-03-31 22:36 UTC (permalink / raw) To: Len Brown Cc: Konrad Rzeszutek Wilk, venki, ak, suresh.b.siddha, sfr, peterz, benh, linux-kernel, xen-devel, arjan, Trinabh Gupta On 03/31/2011 02:26 PM, Len Brown wrote: >>>>> xen_arch_setup() does this: >>>>> >>>>> pm_idle = default_idle; >>>>> boot_option_idle_override = IDLE_HALT; > What happens on a Xen kernel if these lines are not there? > Does Xen export the C-states tables to Dom0 kernel, and the Dom0 > kernel has an acpi processor driver, and thus it would try to > use all the C-states? If they're no there it tries to use the Intel cpuidle driver, which fails (just hangs forever in idle, I think). > If yes, must Xen show those tables to Dom? > If it did not, then the lines above would not be necessary, > as in the absence of any C-states, the kernel should > use halt by default. The dom0 kernel gets all the ACPI state so it can get all the juicy goodness from it. It does extract the C state info, but passes it back to Xen rather than use it itself. We don't generally try to filter the ACPI state before letting dom0 see it (DMAR being the exception, since dom0 really has no business knowing about that). (We have this basic problem that neither Xen nor dom0 are the ideal owners of ACPI. In principle Xen should own ACPI as the most privileged "OS", but it really only cares about things like power states, interrupt routing, system topology, busses, etc. But dom0 cares about lid switches, magic keyboard keys, volume controls, video output switching, etc, etc. At the moment it seems to work best if dom0 do all ACPI processing then pass Xen the parts it needs, which are generally fixed-at-startup config info items.) J ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen 2011-03-31 22:36 ` Jeremy Fitzhardinge @ 2011-04-01 3:03 ` Len Brown 0 siblings, 0 replies; 51+ messages in thread From: Len Brown @ 2011-04-01 3:03 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Konrad Rzeszutek Wilk, venki, ak, suresh.b.siddha, sfr, peterz, benh, linux-kernel, xen-devel, arjan, Trinabh Gupta > >>>>> xen_arch_setup() does this: > >>>>> > >>>>> pm_idle = default_idle; > >>>>> boot_option_idle_override = IDLE_HALT; > > What happens on a Xen kernel if these lines are not there? > > Does Xen export the C-states tables to Dom0 kernel, and the Dom0 > > kernel has an acpi processor driver, and thus it would try to > > use all the C-states? > > If they're no there it tries to use the Intel cpuidle driver, which > fails (just hangs forever in idle, I think). > > > If yes, must Xen show those tables to Dom? > > If it did not, then the lines above would not be necessary, > > as in the absence of any C-states, the kernel should > > use halt by default. > > The dom0 kernel gets all the ACPI state so it can get all the juicy > goodness from it. It does extract the C state info, but passes it back > to Xen rather than use it itself. We don't generally try to filter the > ACPI state before letting dom0 see it (DMAR being the exception, since > dom0 really has no business knowing about that). > > (We have this basic problem that neither Xen nor dom0 are the ideal > owners of ACPI. In principle Xen should own ACPI as the most privileged > "OS", but it really only cares about things like power states, interrupt > routing, system topology, busses, etc. But dom0 cares about lid > switches, magic keyboard keys, volume controls, video output switching, > etc, etc. At the moment it seems to work best if dom0 do all ACPI > processing then pass Xen the parts it needs, which are generally > fixed-at-startup config info items.) Okay. Since a Xen kernel can also boot on bare iron, and thus includes cpuidle, acpi_idle, intel_idle; and when in Dom0 mode it simply wants to run HLT in idle... I think what we want to do is simply disable cpuidle when booted in Dom0 mode. That will allow it to fall back to default_idle w/o xen_setup needing to tinker with installing an idle driver. I'll send a patch in the next series. If you can try it out, that would be great. thanks, -Len ^ permalink raw reply [flat|nested] 51+ messages in thread
* [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-22 12:32 [RFC PATCH V4 0/5] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta ` (3 preceding siblings ...) 2011-03-22 12:33 ` [RFC PATCH V4 4/5] cpuidle: driver for xen Trinabh Gupta @ 2011-03-22 12:33 ` Trinabh Gupta 2011-03-23 1:14 ` Stephen Rothwell 2011-03-24 4:27 ` [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm Len Brown 4 siblings, 2 replies; 51+ messages in thread From: Trinabh Gupta @ 2011-03-22 12:33 UTC (permalink / raw) To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel, sfr Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com> --- arch/x86/kernel/apm_32.c | 75 +++++++++++++++++++++++++++++++++++++--------- 1 files changed, 60 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c index 0e4f24c..22d40bf 100644 --- a/arch/x86/kernel/apm_32.c +++ b/arch/x86/kernel/apm_32.c @@ -882,8 +882,6 @@ static void apm_do_busy(void) #define IDLE_CALC_LIMIT (HZ * 100) #define IDLE_LEAKY_MAX 16 -static void (*original_pm_idle)(void) __read_mostly; - /** * apm_cpu_idle - cpu idling for APM capable Linux * @@ -947,10 +945,7 @@ recalc: break; } } - if (original_pm_idle) - original_pm_idle(); - else - default_idle(); + default_idle(); local_irq_disable(); jiffies_since_last_check = jiffies - last_jiffies; if (jiffies_since_last_check > idle_period) @@ -2256,6 +2251,63 @@ static struct dmi_system_id __initdata apm_dmi_table[] = { { } }; +int apm_idle_wrapper(struct cpuidle_device *dev, + struct cpuidle_state *state) +{ + apm_cpu_idle(); + return 0; +} + +struct cpuidle_state state_apm_idle = { + .name = "APM-IDLE", + .desc = "APM idle routine", + .exit_latency = 1, + .target_residency = 1, + .enter = &apm_idle_wrapper, +}; + +static struct cpuidle_driver apm_idle_driver = { + .name = "apm_idle", + .owner = THIS_MODULE, + .priority = 1, +}; + +static int apm_setup_cpuidle(int cpu) +{ + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device), + GFP_KERNEL); + int count = CPUIDLE_DRIVER_STATE_START; + dev->cpu = cpu; + dev->drv = &apm_idle_driver; + + dev->states[count] = state_apm_idle; + count++; + + dev->state_count = count; + + if (cpuidle_register_device(dev)) + return -EIO; + return 0; +} + +static int apm_idle_init(void) +{ + int retval, i; + retval = cpuidle_register_driver(&apm_idle_driver); + + for_each_online_cpu(i) { + apm_setup_cpuidle(i); + } + + return 0; +} + +static void apm_idle_exit(void) +{ + cpuidle_unregister_driver(&apm_idle_driver); + return; +} + /* * Just start the APM thread. We do NOT want to do APM BIOS * calls from anything but the APM thread, if for no other reason @@ -2393,8 +2445,7 @@ static int __init apm_init(void) if (HZ != 100) idle_period = (idle_period * HZ) / 100; if (idle_threshold < 100) { - original_pm_idle = pm_idle; - pm_idle = apm_cpu_idle; + apm_idle_init(); set_pm_idle = 1; } @@ -2406,13 +2457,7 @@ static void __exit apm_exit(void) int error; if (set_pm_idle) { - pm_idle = original_pm_idle; - /* - * We are about to unload the current idle thread pm callback - * (pm_idle), Wait for all processors to update cached/local - * copies of pm_idle before proceeding. - */ - cpu_idle_wait(); + apm_idle_exit(); } if (((apm_info.bios.flags & APM_BIOS_DISENGAGED) == 0) && (apm_info.connection_version > 0x0100)) { ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-22 12:33 ` [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm Trinabh Gupta @ 2011-03-23 1:14 ` Stephen Rothwell 2011-03-23 10:25 ` Trinabh Gupta 2011-03-24 4:27 ` [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm Len Brown 1 sibling, 1 reply; 51+ messages in thread From: Stephen Rothwell @ 2011-03-23 1:14 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak, linux-kernel [-- Attachment #1: Type: text/plain, Size: 950 bytes --] Hi, On Tue, 22 Mar 2011 18:03:40 +0530 Trinabh Gupta <trinabh@linux.vnet.ibm.com> wrote: > > +static int apm_setup_cpuidle(int cpu) > +{ > + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device), > + GFP_KERNEL); Same as xen comments: no NULL check. > + int count = CPUIDLE_DRIVER_STATE_START; > + dev->cpu = cpu; > + dev->drv = &apm_idle_driver; Also wondering why you would ever have a different idle routine on different cpus? > + > + dev->states[count] = state_apm_idle; > + count++; > + > + dev->state_count = count; > + > + if (cpuidle_register_device(dev)) > + return -EIO; And we leak dev. > +static void apm_idle_exit(void) > +{ > + cpuidle_unregister_driver(&apm_idle_driver); Do we leak the per cpu device structure here? > + return; Unnecessary return statement. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-23 1:14 ` Stephen Rothwell @ 2011-03-23 10:25 ` Trinabh Gupta 2011-03-23 20:32 ` Len Brown 0 siblings, 1 reply; 51+ messages in thread From: Trinabh Gupta @ 2011-03-23 10:25 UTC (permalink / raw) To: Stephen Rothwell Cc: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel Hi Stephen, Thanks for reviewing. On 03/23/2011 06:44 AM, Stephen Rothwell wrote: > Hi, > > On Tue, 22 Mar 2011 18:03:40 +0530 Trinabh Gupta<trinabh@linux.vnet.ibm.com> wrote: >> >> +static int apm_setup_cpuidle(int cpu) >> +{ >> + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device), >> + GFP_KERNEL); > > Same as xen comments: no NULL check. > >> + int count = CPUIDLE_DRIVER_STATE_START; >> + dev->cpu = cpu; >> + dev->drv =&apm_idle_driver; > > Also wondering why you would ever have a different idle routine on > different cpus? Yes, this is an ongoing debate. Apparently it is a possibility because of ACPI bugs. CPU's can have asymmetric C-states and overall different idle routines on different cpus. Please refer to http://lkml.org/lkml/2009/9/24/132 and https://lkml.org/lkml/2011/2/10/37 for a discussion around this. I have posted a patch series that does global registration i.e same idle routines for each cpu. Please check http://lkml.org/lkml/2011/3/22/161 . That series applies on top of this series. Global registration significantly simplifies the design, but still we are not sure about the direction to take. > >> + >> + dev->states[count] = state_apm_idle; >> + count++; >> + >> + dev->state_count = count; >> + >> + if (cpuidle_register_device(dev)) >> + return -EIO; > > And we leak dev. > >> +static void apm_idle_exit(void) >> +{ >> + cpuidle_unregister_driver(&apm_idle_driver); > > Do we leak the per cpu device structure here? I will see how we can save per cpu device structure pointers so that we can free them. > >> + return; > > Unnecessary return statement. > Thanks, -Trinabh ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-23 10:25 ` Trinabh Gupta @ 2011-03-23 20:32 ` Len Brown 2011-03-24 14:28 ` Trinabh Gupta 0 siblings, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-23 20:32 UTC (permalink / raw) To: Trinabh Gupta Cc: Stephen Rothwell, arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel > > Also wondering why you would ever have a different idle routine on > > different cpus? > > Yes, this is an ongoing debate. Apparently it is a possibility > because of ACPI bugs. CPU's can have asymmetric C-states > and overall different idle routines on different cpus. Please > refer to http://lkml.org/lkml/2009/9/24/132 and > https://lkml.org/lkml/2011/2/10/37 for a discussion around this. Althought the ACPI specification allows the BIOS to tell the OS about different C-states per-processor, I know of zero system in the field and zero systems in development that require that capability. That isn't a guarantee that capability will never be used, but I'm not holding my breath. If there are systems with broken tables that make them appear asymetric, then we should have a workaround that handles that case, rather than complicating the normal code for the broken case. So I recommend deleting the extra per-cpu registration stuff unless there is some other architecture that requires it and can't hadle the asymmetry in another way. > I have posted a patch series that does global registration > i.e same idle routines for each cpu. Please check > http://lkml.org/lkml/2011/3/22/161 . That series applies on > top of this series. Global registration significantly > simplifies the design, but still we are not sure about the > direction to take. I'll review that. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-23 20:32 ` Len Brown @ 2011-03-24 14:28 ` Trinabh Gupta 2011-03-24 16:21 ` Vaidyanathan Srinivasan 2011-03-25 7:24 ` Len Brown 0 siblings, 2 replies; 51+ messages in thread From: Trinabh Gupta @ 2011-03-24 14:28 UTC (permalink / raw) To: Len Brown, arjan Cc: Stephen Rothwell, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On 03/24/2011 02:02 AM, Len Brown wrote: >>> Also wondering why you would ever have a different idle routine on >>> different cpus? >> >> Yes, this is an ongoing debate. Apparently it is a possibility >> because of ACPI bugs. CPU's can have asymmetric C-states >> and overall different idle routines on different cpus. Please >> refer to http://lkml.org/lkml/2009/9/24/132 and >> https://lkml.org/lkml/2011/2/10/37 for a discussion around this. > > Althought the ACPI specification allows the BIOS to tell the OS > about different C-states per-processor, I know of zero system > in the field and zero systems in development that require that > capability. That isn't a guarantee that capability will never > be used, but I'm not holding my breath. > > If there are systems with broken tables that make them > appear asymetric, then we should have a workaround that handles > that case, rather than complicating the normal code for > the broken case. > > So I recommend deleting the extra per-cpu registration stuff > unless there is some other architecture that requires it > and can't hadle the asymmetry in another way. Yes, lets go forward with removal of per-cpu registration and handle rare case of asymmetry in some other may. Using intersection or union of C-states for each cpu may be a solution. Using intersection or lowest common C-state has the corner case that we could have packages/cores supporting a new lower C-state in case of thermal limit and they would want OS to go to this state. Using intersection or lowest common C-state may prevent this. Another option is to use union of C-states; but I am not sure what happens if a CPU uses a state that is not reported for it??? Maybe there is some other way to handle asymmetry ?? > >> I have posted a patch series that does global registration >> i.e same idle routines for each cpu. Please check >> http://lkml.org/lkml/2011/3/22/161 . That series applies on >> top of this series. Global registration significantly >> simplifies the design, but still we are not sure about the >> direction to take. > > I'll review that. Thanks; please review especially the data structure changes https://lkml.org/lkml/2011/3/22/162 -Trinabh ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-24 14:28 ` Trinabh Gupta @ 2011-03-24 16:21 ` Vaidyanathan Srinivasan 2011-03-25 7:24 ` Len Brown 1 sibling, 0 replies; 51+ messages in thread From: Vaidyanathan Srinivasan @ 2011-03-24 16:21 UTC (permalink / raw) To: Trinabh Gupta Cc: Len Brown, arjan, Stephen Rothwell, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel * Trinabh Gupta <trinabh@linux.vnet.ibm.com> [2011-03-24 19:58:29]: > > > On 03/24/2011 02:02 AM, Len Brown wrote: > >>>Also wondering why you would ever have a different idle routine on > >>>different cpus? > >> > >>Yes, this is an ongoing debate. Apparently it is a possibility > >>because of ACPI bugs. CPU's can have asymmetric C-states > >>and overall different idle routines on different cpus. Please > >>refer to http://lkml.org/lkml/2009/9/24/132 and > >>https://lkml.org/lkml/2011/2/10/37 for a discussion around this. > > > >Althought the ACPI specification allows the BIOS to tell the OS > >about different C-states per-processor, I know of zero system > >in the field and zero systems in development that require that > >capability. That isn't a guarantee that capability will never > >be used, but I'm not holding my breath. > > > >If there are systems with broken tables that make them > >appear asymetric, then we should have a workaround that handles > >that case, rather than complicating the normal code for > >the broken case. > > > >So I recommend deleting the extra per-cpu registration stuff > >unless there is some other architecture that requires it > >and can't hadle the asymmetry in another way. Hi Len, The fear of breaking legacy hardware is what is holding us. Arjan also pointed that there are systems that has asymmetric C-States either intentionally or due to a bug. I agree with you that we should remove per-cpu registration at this point and move forward with a single registration. We can workaround the corner cases. > Yes, lets go forward with removal of per-cpu registration > and handle rare case of asymmetry in some other may. Yes. Lets discuss the design/problems on the other patch series. (http://lkml.org/lkml/2011/3/22/161) > Using intersection or union of C-states for each cpu may > be a solution. Using intersection or lowest common C-state > has the corner case that we could have packages/cores > supporting a new lower C-state in case of thermal limit and > they would want OS to go to this state. Using intersection > or lowest common C-state may prevent this. > > Another option is to use union of C-states; > but I am not sure what happens if a CPU uses a state that > is not reported for it??? > > Maybe there is some other way to handle asymmetry ?? The simplest method will be a union of all C-States. Basically let the CPU that reports the maximum C-States to register or override the current setup. During boot-up allow the first CPU to do the registration, but later override if another CPU comes up with more C-States. This will work assuming that calling a new (deeper) C-State on CPUs that did not report them will cause no harm. It should degenerate to closest supported C-State. > > > >>I have posted a patch series that does global registration > >>i.e same idle routines for each cpu. Please check > >>http://lkml.org/lkml/2011/3/22/161 . That series applies on > >>top of this series. Global registration significantly > >>simplifies the design, but still we are not sure about the > >>direction to take. > > > >I'll review that. > Thanks; please review especially the data structure changes > https://lkml.org/lkml/2011/3/22/162 Single registration will allow us to combine struct cpuidle_device and struct cpuidle_driver, and simplify the framework for large systems. AFAIK other archs like powerpc or ARM would not need per-cpu definitions. --Vaidy ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-24 14:28 ` Trinabh Gupta 2011-03-24 16:21 ` Vaidyanathan Srinivasan @ 2011-03-25 7:24 ` Len Brown 2011-03-25 18:01 ` Vaidyanathan Srinivasan 1 sibling, 1 reply; 51+ messages in thread From: Len Brown @ 2011-03-25 7:24 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, Stephen Rothwell, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel > Maybe there is some other way to handle asymmetry ?? When somebody invents a system that needs asymmetry on purpose, that is the time to get fancy. Until that day, simply print a FW_BUG WARNING and fall back to default_idle(). cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-25 7:24 ` Len Brown @ 2011-03-25 18:01 ` Vaidyanathan Srinivasan 2011-03-31 2:17 ` cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) Len Brown 0 siblings, 1 reply; 51+ messages in thread From: Vaidyanathan Srinivasan @ 2011-03-25 18:01 UTC (permalink / raw) To: Len Brown Cc: Trinabh Gupta, arjan, Stephen Rothwell, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel * Len Brown <lenb@kernel.org> [2011-03-25 03:24:24]: > > > Maybe there is some other way to handle asymmetry ?? > > When somebody invents a system that needs asymmetry on purpose, > that is the time to get fancy. Until that day, simply print a > FW_BUG WARNING and fall back to default_idle(). We may print a warning but not switch to default_idle(). There could be laptops that gets an extra C-State when run on battery. So when the ACPI _PPC event happens for the C-State change, we need to handle the situation such that the states are refreshed for all CPUs in one step. Basically while we are updating the new C-State, the framework should not detect this as asymmetric and switch to default_idle(). Please fell free to suggest a more elegant solution to handle the runtime C-State change event, though symmetric across all CPUs. --Vaidy ^ permalink raw reply [flat|nested] 51+ messages in thread
* cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-03-25 18:01 ` Vaidyanathan Srinivasan @ 2011-03-31 2:17 ` Len Brown 2011-03-31 13:18 ` Peter Zijlstra 2011-04-01 7:02 ` Trinabh Gupta 0 siblings, 2 replies; 51+ messages in thread From: Len Brown @ 2011-03-31 2:17 UTC (permalink / raw) To: Vaidyanathan Srinivasan Cc: Trinabh Gupta, arjan, Stephen Rothwell, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel > > > Maybe there is some other way to handle asymmetry ?? I mis-spoke on asymmetry. Moorestown is already an example of an asymmetric system, since its deepest c-state is available on cpu0, but not on cpu1. So it needs different tables for each cpu. I think what would work is a default c-state table for the system, and the ability of a per-cpu override table. I think that would gracefully handle the case of many identical cpus, and also systems with different tables per cpu. The same goes for write-access to the tables. In the typical case, a single table can be shared for the entire system and nobody will be writing to it. However, with the governor changes to call dev->prepare and sift through all the states to find the legal one with the lowest power_usage... There is software today out of tree that updates that power_usage entry from prepare(). As I mentioned, I'm not fond of that mechanism - it looks racey to me. I'd rather see the capability of a drivers idle handler to demote to another handler in the driver and for the accounting to not get messed up when that happens. I think the way to do that is to let the driver do the accounting rather than doing it in the cpuidle caller. cheers, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-03-31 2:17 ` cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) Len Brown @ 2011-03-31 13:18 ` Peter Zijlstra 2011-04-01 4:09 ` Len Brown 2011-04-01 7:02 ` Trinabh Gupta 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2011-03-31 13:18 UTC (permalink / raw) To: Len Brown Cc: Vaidyanathan Srinivasan, Trinabh Gupta, arjan, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On Wed, 2011-03-30 at 22:17 -0400, Len Brown wrote: > > Moorestown is already an example of an asymmetric system, > since its deepest c-state is available on cpu0, but not on cpu1. > So it needs different tables for each cpu. wtf are these hardware guys smoking and how the heck are we supposed to schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-03-31 13:18 ` Peter Zijlstra @ 2011-04-01 4:09 ` Len Brown 2011-04-01 8:15 ` Dipankar Sarma 2011-04-01 14:02 ` Peter Zijlstra 0 siblings, 2 replies; 51+ messages in thread From: Len Brown @ 2011-04-01 4:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Vaidyanathan Srinivasan, Trinabh Gupta, arjan, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel > > Moorestown is already an example of an asymmetric system, > > since its deepest c-state is available on cpu0, but not on cpu1. > > So it needs different tables for each cpu. > > wtf are these hardware guys smoking and how the heck are we supposed to > schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0? they are smoking micro-amps:-) S0i3 on cpu0 can be entered only after cpu1 is already off-line, among other system hardware dependencies... So it makes no sense to export S0i3 as a c-state on cpu1. When cpu1 is online, the scheduler treats it as a normal SMP. cheers, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-04-01 4:09 ` Len Brown @ 2011-04-01 8:15 ` Dipankar Sarma 2011-04-01 14:38 ` Arjan van de Ven 2011-04-01 14:02 ` Peter Zijlstra 1 sibling, 1 reply; 51+ messages in thread From: Dipankar Sarma @ 2011-04-01 8:15 UTC (permalink / raw) To: Len Brown Cc: Peter Zijlstra, Vaidyanathan Srinivasan, Trinabh Gupta, arjan, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On Fri, Apr 01, 2011 at 12:09:25AM -0400, Len Brown wrote: > > > Moorestown is already an example of an asymmetric system, > > > since its deepest c-state is available on cpu0, but not on cpu1. > > > So it needs different tables for each cpu. > > > > wtf are these hardware guys smoking and how the heck are we supposed to > > schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0? > > they are smoking micro-amps:-) > > S0i3 on cpu0 can be entered only after cpu1 is already off-line, > among other system hardware dependencies... > > So it makes no sense to export S0i3 as a c-state on cpu1. > > When cpu1 is online, the scheduler treats it as a normal SMP. Isn't S0i3 a "system" state, as opposed to cpu state ? Perhaps we can treat it as such and still keep the c-states symmetric. The cpu can transition to S0i3 from any cpu as long as others are offlined, no ? In that sense, really all the cpus would be in S0i3, which would make it symmetric. If this isn't how mrst cpuidle works, then cpuidle accounting may be broken in principle because of this. Thanks Dipankar ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-04-01 8:15 ` Dipankar Sarma @ 2011-04-01 14:38 ` Arjan van de Ven 2011-04-03 16:18 ` Dipankar Sarma 0 siblings, 1 reply; 51+ messages in thread From: Arjan van de Ven @ 2011-04-01 14:38 UTC (permalink / raw) To: dipankar Cc: Len Brown, Peter Zijlstra, Vaidyanathan Srinivasan, Trinabh Gupta, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On 4/1/2011 1:15 AM, Dipankar Sarma wrote: > On Fri, Apr 01, 2011 at 12:09:25AM -0400, Len Brown wrote: >>>> Moorestown is already an example of an asymmetric system, >>>> since its deepest c-state is available on cpu0, but not on cpu1. >>>> So it needs different tables for each cpu. >>> wtf are these hardware guys smoking and how the heck are we supposed to >>> schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0? >> they are smoking micro-amps:-) >> >> S0i3 on cpu0 can be entered only after cpu1 is already off-line, >> among other system hardware dependencies... >> >> So it makes no sense to export S0i3 as a c-state on cpu1. >> >> When cpu1 is online, the scheduler treats it as a normal SMP. > Isn't S0i3 a "system" state, as opposed to cpu state ? it's misnamed. it's a C state to the OS. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-04-01 14:38 ` Arjan van de Ven @ 2011-04-03 16:18 ` Dipankar Sarma 0 siblings, 0 replies; 51+ messages in thread From: Dipankar Sarma @ 2011-04-03 16:18 UTC (permalink / raw) To: Arjan van de Ven Cc: Len Brown, Peter Zijlstra, Vaidyanathan Srinivasan, Trinabh Gupta, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On Fri, Apr 01, 2011 at 07:38:23AM -0700, Arjan van de Ven wrote: > On 4/1/2011 1:15 AM, Dipankar Sarma wrote: > >On Fri, Apr 01, 2011 at 12:09:25AM -0400, Len Brown wrote: > >>>>Moorestown is already an example of an asymmetric system, > >>>>since its deepest c-state is available on cpu0, but not on cpu1. > >>>>So it needs different tables for each cpu. > >>>wtf are these hardware guys smoking and how the heck are we supposed to > >>>schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0? > >>they are smoking micro-amps:-) > >> > >>S0i3 on cpu0 can be entered only after cpu1 is already off-line, > >>among other system hardware dependencies... > >> > >>So it makes no sense to export S0i3 as a c-state on cpu1. > >> > >>When cpu1 is online, the scheduler treats it as a normal SMP. > >Isn't S0i3 a "system" state, as opposed to cpu state ? > > it's misnamed. it's a C state to the OS. I understand that it has been implemented as a C-state from this - http://build.meego.com/package/view_file?file=linux-2.6.37-mrst-s0i3.patch&package=kernel-adaptation-mrst&project=home%3Adliu9&srcmd5=a0929a2863150f5c8454507d6cd8f09d The key question is this - +int mrst_check_state_availability(struct cpuidle_device *dev) +{ + int cpu = smp_processor_id(); + + /* + * If there is another CPU running, the GPU is active, + * the PMU is uninitialized, or there is a still-unprocessed + * PMU command, we cannot enter S0i3. + */ + if (!pmu_reg || !cpumask_equal(cpu_online_mask, cpumask_of(cpu)) || + s0i3_pmu_command_pending) + dev->states[5].flags |= CPUIDLE_FLAG_IGNORE; + else + dev->states[5].flags &= ~CPUIDLE_FLAG_IGNORE; Is this really asymetric ? Or is it that if the other cpu(s) are in C6, the chip can enter S0i3 ? If that is the case, then isn't this equivalent to all the cpus in the chip entering S0i3 and thus symmetrical ? Also, if going to S0i3 relies on other cpus offlined, then we don't need to worry about asymetry from the scheduler/cpuidle point of view, no ? Thanks Dipankar ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-04-01 4:09 ` Len Brown 2011-04-01 8:15 ` Dipankar Sarma @ 2011-04-01 14:02 ` Peter Zijlstra 2011-04-04 14:32 ` Dipankar Sarma 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2011-04-01 14:02 UTC (permalink / raw) To: Len Brown Cc: Vaidyanathan Srinivasan, Trinabh Gupta, arjan, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On Fri, 2011-04-01 at 00:09 -0400, Len Brown wrote: > > > Moorestown is already an example of an asymmetric system, > > > since its deepest c-state is available on cpu0, but not on cpu1. > > > So it needs different tables for each cpu. > > > > wtf are these hardware guys smoking and how the heck are we supposed to > > schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0? > > they are smoking micro-amps:-) Has anybody told them that pushing lots of logic into software generally burns more amps because it keeps the thing running longer? > S0i3 on cpu0 can be entered only after cpu1 is already off-line, > among other system hardware dependencies... > > So it makes no sense to export S0i3 as a c-state on cpu1. > > When cpu1 is online, the scheduler treats it as a normal SMP. Dipankar's reply seems to address this issue well. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-04-01 14:02 ` Peter Zijlstra @ 2011-04-04 14:32 ` Dipankar Sarma 2011-04-05 15:01 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Dipankar Sarma @ 2011-04-04 14:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Len Brown, Vaidyanathan Srinivasan, Trinabh Gupta, arjan, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On Fri, Apr 01, 2011 at 04:02:36PM +0200, Peter Zijlstra wrote: > > > S0i3 on cpu0 can be entered only after cpu1 is already off-line, > > among other system hardware dependencies... > > > > So it makes no sense to export S0i3 as a c-state on cpu1. > > > > When cpu1 is online, the scheduler treats it as a normal SMP. > > Dipankar's reply seems to address this issue well. I can't find any Moorestown documentation at the Intel site, but thinking about Len's inputs a bit more, it seems there may be still a problem asymetry from the scheduler perspective. If cpu0 or cpu1 either of them can be offlined, there is no asymetry. If only cpu1 can be offlined, it would mean that one cpu may be more efficient depending on how we do cpu offlining for power savings. It gets a bit messy. Len, what exacty is the significance of offlining here ? Apart from going to C6, what else is needed in cpu1 for the chip to go to S0i3 ? Why is idle C6 not enough ? Thanks Dipankar ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-04-04 14:32 ` Dipankar Sarma @ 2011-04-05 15:01 ` Peter Zijlstra 2011-04-05 15:48 ` Dipankar Sarma 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2011-04-05 15:01 UTC (permalink / raw) To: dipankar Cc: Len Brown, Vaidyanathan Srinivasan, Trinabh Gupta, arjan, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On Mon, 2011-04-04 at 20:02 +0530, Dipankar Sarma wrote: > On Fri, Apr 01, 2011 at 04:02:36PM +0200, Peter Zijlstra wrote: > > > > > S0i3 on cpu0 can be entered only after cpu1 is already off-line, > > > among other system hardware dependencies... > > > > > > So it makes no sense to export S0i3 as a c-state on cpu1. > > > > > > When cpu1 is online, the scheduler treats it as a normal SMP. > > > > Dipankar's reply seems to address this issue well. > > I can't find any Moorestown documentation at the Intel site, but > thinking about Len's inputs a bit more, it seems there may > be still a problem asymetry from the scheduler perspective. > > If cpu0 or cpu1 either of them can be offlined, there is no > asymetry. If only cpu1 can be offlined, it would mean that > one cpu may be more efficient depending on how we do > cpu offlining for power savings. It gets a bit messy. > > Len, what exacty is the significance of offlining here ? > Apart from going to C6, what else is needed in cpu1 for > the chip to go to S0i3 ? Why is idle C6 not enough ? I don't think offlining is relevant, anybody using that for power management is doing it wrong, _very_ wrong. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-04-05 15:01 ` Peter Zijlstra @ 2011-04-05 15:48 ` Dipankar Sarma 0 siblings, 0 replies; 51+ messages in thread From: Dipankar Sarma @ 2011-04-05 15:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Len Brown, Vaidyanathan Srinivasan, Trinabh Gupta, arjan, Stephen Rothwell, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On Tue, Apr 05, 2011 at 05:01:32PM +0200, Peter Zijlstra wrote: > On Mon, 2011-04-04 at 20:02 +0530, Dipankar Sarma wrote: > > On Fri, Apr 01, 2011 at 04:02:36PM +0200, Peter Zijlstra wrote: > > I can't find any Moorestown documentation at the Intel site, but > > thinking about Len's inputs a bit more, it seems there may > > be still a problem asymetry from the scheduler perspective. > > > > If cpu0 or cpu1 either of them can be offlined, there is no > > asymetry. If only cpu1 can be offlined, it would mean that > > one cpu may be more efficient depending on how we do > > cpu offlining for power savings. It gets a bit messy. > > > > Len, what exacty is the significance of offlining here ? > > Apart from going to C6, what else is needed in cpu1 for > > the chip to go to S0i3 ? Why is idle C6 not enough ? > > I don't think offlining is relevant, anybody using that for power > management is doing it wrong, _very_ wrong. I am suggesting that it depends on the offlining logic. If cpu1 is being used as an added co-processor for some specific apps and mostly offline otherwise, it may not be an issue. If offlining is being used as a meta-scheduler over the kernel scheduler (like power savings or whatever logic) than it will cause asymmetry problems dependent on the ooffline logic - e.g. it may be more advantageous for the kernel scheduler to schedule on cpu0 keeping cpu1 free leading to S0i3 more often. Not advocating it when we are trying to run away from it on powerpc :) For now, it seems we are OK in handling S0i3 through cpuidle, just that it would be nice to understand the overall offline logic and why it is needed. Similar questions have come up in my discussions with ARM guys in the recent times as well. Thanks Dipankar ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) 2011-03-31 2:17 ` cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) Len Brown 2011-03-31 13:18 ` Peter Zijlstra @ 2011-04-01 7:02 ` Trinabh Gupta 1 sibling, 0 replies; 51+ messages in thread From: Trinabh Gupta @ 2011-04-01 7:02 UTC (permalink / raw) To: Len Brown Cc: Vaidyanathan Srinivasan, arjan, Stephen Rothwell, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, xen-devel On 03/31/2011 07:47 AM, Len Brown wrote: >>>> Maybe there is some other way to handle asymmetry ?? > > I mis-spoke on asymmetry. > > Moorestown is already an example of an asymmetric system, > since its deepest c-state is available on cpu0, but not on cpu1. > So it needs different tables for each cpu. > > I think what would work is a default c-state table for the system, > and the ability of a per-cpu override table. I think that would > gracefully handle the case of many identical cpus, and also systems > with different tables per cpu. Hi Len, What would happen if a cpu enters a state which wasn't reported for it? I am wondering if an approach like union of states of each would work. Can we handle asymmetry through checking and demotion inside the routine itself; just like you are proposing as dev->prepare alternative? But I guess this may not be efficient if this happens often. I am not sure if having a per-cpu override would be very tidy (ideas ?); and much better than per-cpu stuff. So just want to check what would be the best way forward? > > The same goes for write-access to the tables. > In the typical case, a single table can be shared for the entire system > and nobody will be writing to it. However, with the governor changes > to call dev->prepare and sift through all the states to find the > legal one with the lowest power_usage... There is software today > out of tree that updates that power_usage entry from prepare(). > > As I mentioned, I'm not fond of that mechanism - it looks racey > to me. I'd rather see the capability of a drivers idle handler > to demote to another handler in the driver and for the accounting > to not get messed up when that happens. I think the way to do that > is to let the driver do the accounting rather than doing it in > the cpuidle caller. I agree with this; we should move update of statistics inside the driver routines (enter routines). They already take device/stats structure as parameter. Thanks, -Trinabh ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm 2011-03-22 12:33 ` [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm Trinabh Gupta 2011-03-23 1:14 ` Stephen Rothwell @ 2011-03-24 4:27 ` Len Brown 1 sibling, 0 replies; 51+ messages in thread From: Len Brown @ 2011-03-24 4:27 UTC (permalink / raw) To: Trinabh Gupta Cc: arjan, peterz, suresh.b.siddha, benh, venki, ak, linux-kernel, sfr does anybody still have an APM-capable box that can run the latest kernel and test this code? thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2011-04-05 15:48 UTC | newest] Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-03-22 12:32 [RFC PATCH V4 0/5] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta 2011-03-22 12:32 ` [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta 2011-03-23 1:00 ` Stephen Rothwell 2011-03-23 10:10 ` Trinabh Gupta 2011-03-22 12:32 ` [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection Trinabh Gupta 2011-03-23 2:59 ` Len Brown 2011-03-23 9:22 ` Trinabh Gupta 2011-03-23 20:51 ` Len Brown 2011-03-24 4:41 ` Len Brown 2011-03-24 14:13 ` Trinabh Gupta 2011-03-24 16:52 ` Vaidyanathan Srinivasan 2011-03-25 7:13 ` Len Brown 2011-03-25 7:05 ` Len Brown 2011-03-25 15:35 ` [Xen-devel] " Konrad Rzeszutek Wilk 2011-03-31 2:25 ` Len Brown 2011-03-22 12:33 ` [RFC PATCH V4 3/5] cpuidle: default idle driver for x86 Trinabh Gupta 2011-03-23 3:13 ` Len Brown 2011-03-23 9:31 ` Trinabh Gupta 2011-03-24 16:32 ` Vaidyanathan Srinivasan 2011-03-22 12:33 ` [RFC PATCH V4 4/5] cpuidle: driver for xen Trinabh Gupta 2011-03-22 14:50 ` Konrad Rzeszutek Wilk 2011-03-23 9:57 ` Trinabh Gupta 2011-03-24 7:18 ` Len Brown 2011-03-24 12:05 ` Konrad Rzeszutek Wilk 2011-03-25 7:19 ` Len Brown 2011-03-25 14:43 ` [Xen-devel] " Jeremy Fitzhardinge 2011-03-25 14:38 ` Jeremy Fitzhardinge 2011-03-31 2:02 ` Len Brown 2011-03-31 21:26 ` Len Brown 2011-03-31 22:36 ` Jeremy Fitzhardinge 2011-04-01 3:03 ` Len Brown 2011-03-22 12:33 ` [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm Trinabh Gupta 2011-03-23 1:14 ` Stephen Rothwell 2011-03-23 10:25 ` Trinabh Gupta 2011-03-23 20:32 ` Len Brown 2011-03-24 14:28 ` Trinabh Gupta 2011-03-24 16:21 ` Vaidyanathan Srinivasan 2011-03-25 7:24 ` Len Brown 2011-03-25 18:01 ` Vaidyanathan Srinivasan 2011-03-31 2:17 ` cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm) Len Brown 2011-03-31 13:18 ` Peter Zijlstra 2011-04-01 4:09 ` Len Brown 2011-04-01 8:15 ` Dipankar Sarma 2011-04-01 14:38 ` Arjan van de Ven 2011-04-03 16:18 ` Dipankar Sarma 2011-04-01 14:02 ` Peter Zijlstra 2011-04-04 14:32 ` Dipankar Sarma 2011-04-05 15:01 ` Peter Zijlstra 2011-04-05 15:48 ` Dipankar Sarma 2011-04-01 7:02 ` Trinabh Gupta 2011-03-24 4:27 ` [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm Len Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).