The cpuidle-haltpoll driver with haltpoll governor allows the guest vcpus to poll for a specified amount of time before halting. This provides the following benefits to host side polling: 1) The POLL flag is set while polling is performed, which allows a remote vCPU to avoid sending an IPI (and the associated cost of handling the IPI) when performing a wakeup. 2) The VM-exit cost can be avoided. The downside of guest side polling is that polling is performed even with other runnable tasks in the host. Results comparing halt_poll_ns and server/client application where a small packet is ping-ponged: host --> 31.33 halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) For the SAP HANA benchmarks (where idle_spin is a parameter of the previous version of the patch, results should be the same): hpns == halt_poll_ns idle_spin=0/ idle_spin=800/ idle_spin=0/ hpns=200000 hpns=0 hpns=800000 DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) V2: - Move from x86 to generic code (Paolo/Christian) - Add auto-tuning logic (Paolo) - Add MSR to disable host side polling (Paolo) V3: - Do not be specific about HLT VM-exit in the documentation (Ankur Arora) - Mark tuning parameters static and __read_mostly (Andrea Arcangeli) - Add WARN_ON if host does not support poll control (Joao Martins) - Use sched_clock and cleanup haltpoll_enter_idle (Peter Zijlstra) - Mark certain functions in kvm.c as static (kernel test robot) - Remove tracepoints as they use RCU from extended quiescent state (kernel test robot) V4: - Use a haltpoll governor, use poll_state.c poll code (Rafael J. Wysocki) V5: - Take latency requirement into consideration (Rafael J. Wysocki) - Set target_residency/exit_latency to 1 (Rafael J. Wysocki) - Do not load cpuidle driver if not virtualized (Rafael J. Wysocki)
Add a cpuidle driver that calls the architecture default_idle routine. To be used in conjunction with the haltpoll governor. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> --- arch/x86/kernel/process.c | 2 - drivers/cpuidle/Kconfig | 9 +++++ drivers/cpuidle/Makefile | 1 drivers/cpuidle/cpuidle-haltpoll.c | 65 +++++++++++++++++++++++++++++++++++++ 4 files changed, 76 insertions(+), 1 deletion(-) Index: linux-2.6-newcpuidle.git/arch/x86/kernel/process.c =================================================================== --- linux-2.6-newcpuidle.git.orig/arch/x86/kernel/process.c +++ linux-2.6-newcpuidle.git/arch/x86/kernel/process.c @@ -580,7 +580,7 @@ void __cpuidle default_idle(void) safe_halt(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); } -#ifdef CONFIG_APM_MODULE +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_HALTPOLL_CPUIDLE_MODULE) EXPORT_SYMBOL(default_idle); #endif Index: linux-2.6-newcpuidle.git/drivers/cpuidle/Kconfig =================================================================== --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/Kconfig +++ linux-2.6-newcpuidle.git/drivers/cpuidle/Kconfig @@ -51,6 +51,15 @@ depends on PPC source "drivers/cpuidle/Kconfig.powerpc" endmenu +config HALTPOLL_CPUIDLE + tristate "Halt poll cpuidle driver" + depends on X86 && KVM_GUEST + default y + help + This option enables halt poll cpuidle driver, which allows to poll + before halting in the guest (more efficient than polling in the + host via halt_poll_ns for some scenarios). + endif config ARCH_NEEDS_CPU_IDLE_COUPLED Index: linux-2.6-newcpuidle.git/drivers/cpuidle/Makefile =================================================================== --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/Makefile +++ linux-2.6-newcpuidle.git/drivers/cpuidle/Makefile @@ -7,6 +7,7 @@ obj-y += cpuidle.o driver.o governor.o s obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o +obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o ################################################################################## # ARM SoC drivers Index: linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle-haltpoll.c =================================================================== --- /dev/null +++ linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle-haltpoll.c @@ -0,0 +1,69 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * cpuidle driver for haltpoll governor. + * + * Copyright 2019 Red Hat, Inc. and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Authors: Marcelo Tosatti <mtosatti@redhat.com> + */ + +#include <linux/init.h> +#include <linux/cpuidle.h> +#include <linux/module.h> +#include <linux/sched/idle.h> +#include <linux/kvm_para.h> + +static int default_enter_idle(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int index) +{ + if (current_clr_polling_and_test()) { + local_irq_enable(); + return index; + } + default_idle(); + return index; +} + +static struct cpuidle_driver haltpoll_driver = { + .name = "haltpoll", + .owner = THIS_MODULE, + .states = { + { /* entry 0 is for polling */ }, + { + .enter = default_enter_idle, + .exit_latency = 1, + .target_residency = 1, + .power_usage = -1, + .name = "haltpoll idle", + .desc = "default architecture idle", + }, + }, + .safe_state_index = 0, + .state_count = 2, +}; + +static int __init haltpoll_init(void) +{ + struct cpuidle_driver *drv = &haltpoll_driver; + + cpuidle_poll_state_init(drv); + + if (!kvm_para_available()) + return 0; + + return cpuidle_register(&haltpoll_driver, NULL); +} + +static void __exit haltpoll_exit(void) +{ + cpuidle_unregister(&haltpoll_driver); +} + +module_init(haltpoll_init); +module_exit(haltpoll_exit); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@redhat.com>"); +
Add a "get_poll_time" callback to the cpuidle_governor structure, and change poll state to poll for that amount of time. Provide a default method for it, while allowing individual governors to override it. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> --- drivers/cpuidle/cpuidle.c | 40 ++++++++++++++++++++++++++++++++++++++++ drivers/cpuidle/poll_state.c | 11 ++--------- include/linux/cpuidle.h | 8 ++++++++ 3 files changed, 50 insertions(+), 9 deletions(-) Index: linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle.c =================================================================== --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/cpuidle.c +++ linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle.c @@ -362,6 +362,46 @@ void cpuidle_reflect(struct cpuidle_devi } /** + * cpuidle_default_poll_time - default routine used to return poll time + * governors can override it if necessary + * + * @drv: the cpuidle driver tied with the cpu + * @dev: the cpuidle device + * + */ +static u64 cpuidle_default_poll_time(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{ + int i; + + for (i = 1; i < drv->state_count; i++) { + if (drv->states[i].disabled || dev->states_usage[i].disable) + continue; + + return (u64)drv->states[i].target_residency * NSEC_PER_USEC; + } + + return TICK_NSEC; +} + +/** + * cpuidle_get_poll_time - tell the polling driver how much time to poll, + * in nanoseconds. + * + * @drv: the cpuidle driver tied with the cpu + * @dev: the cpuidle device + * + */ +u64 cpuidle_get_poll_time(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{ + if (cpuidle_curr_governor->get_poll_time) + return cpuidle_curr_governor->get_poll_time(drv, dev); + + return cpuidle_default_poll_time(drv, dev); +} + +/** * cpuidle_install_idle_handler - installs the cpuidle idle loop handler */ void cpuidle_install_idle_handler(void) Index: linux-2.6-newcpuidle.git/drivers/cpuidle/poll_state.c =================================================================== --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/poll_state.c +++ linux-2.6-newcpuidle.git/drivers/cpuidle/poll_state.c @@ -20,16 +20,9 @@ static int __cpuidle poll_idle(struct cp local_irq_enable(); if (!current_set_polling_and_test()) { unsigned int loop_count = 0; - u64 limit = TICK_NSEC; - int i; + u64 limit; - for (i = 1; i < drv->state_count; i++) { - if (drv->states[i].disabled || dev->states_usage[i].disable) - continue; - - limit = (u64)drv->states[i].target_residency * NSEC_PER_USEC; - break; - } + limit = cpuidle_get_poll_time(drv, dev); while (!need_resched()) { cpu_relax(); Index: linux-2.6-newcpuidle.git/include/linux/cpuidle.h =================================================================== --- linux-2.6-newcpuidle.git.orig/include/linux/cpuidle.h +++ linux-2.6-newcpuidle.git/include/linux/cpuidle.h @@ -132,6 +132,8 @@ extern int cpuidle_select(struct cpuidle extern int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, int index); extern void cpuidle_reflect(struct cpuidle_device *dev, int index); +extern u64 cpuidle_get_poll_time(struct cpuidle_driver *drv, + struct cpuidle_device *dev); extern int cpuidle_register_driver(struct cpuidle_driver *drv); extern struct cpuidle_driver *cpuidle_get_driver(void); @@ -166,6 +168,9 @@ static inline int cpuidle_enter(struct c struct cpuidle_device *dev, int index) {return -ENODEV; } static inline void cpuidle_reflect(struct cpuidle_device *dev, int index) { } +extern u64 cpuidle_get_poll_time(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{return 0; } static inline int cpuidle_register_driver(struct cpuidle_driver *drv) {return -ENODEV; } static inline struct cpuidle_driver *cpuidle_get_driver(void) {return NULL; } @@ -246,6 +251,9 @@ struct cpuidle_governor { struct cpuidle_device *dev, bool *stop_tick); void (*reflect) (struct cpuidle_device *dev, int index); + + u64 (*get_poll_time) (struct cpuidle_driver *drv, + struct cpuidle_device *dev); }; #ifdef CONFIG_CPU_IDLE
The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle driver, allows guest vcpus to poll for a specified amount of time before halting. This provides the following benefits to host side polling: 1) The POLL flag is set while polling is performed, which allows a remote vCPU to avoid sending an IPI (and the associated cost of handling the IPI) when performing a wakeup. 2) The VM-exit cost can be avoided. The downside of guest side polling is that polling is performed even with other runnable tasks in the host. Results comparing halt_poll_ns and server/client application where a small packet is ping-ponged: host --> 31.33 halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) For the SAP HANA benchmarks (where idle_spin is a parameter of the previous version of the patch, results should be the same): hpns == halt_poll_ns idle_spin=0/ idle_spin=800/ idle_spin=0/ hpns=200000 hpns=0 hpns=800000 DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> --- Documentation/virtual/guest-halt-polling.txt | 79 ++++++++++++ drivers/cpuidle/Kconfig | 11 + drivers/cpuidle/governors/Makefile | 1 drivers/cpuidle/governors/haltpoll.c | 175 +++++++++++++++++++++++++++ 4 files changed, 266 insertions(+) Index: linux-2.6-newcpuidle.git/drivers/cpuidle/Kconfig =================================================================== --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/Kconfig +++ linux-2.6-newcpuidle.git/drivers/cpuidle/Kconfig @@ -33,6 +33,17 @@ config CPU_IDLE_GOV_TEO Some workloads benefit from using it and it generally should be safe to use. Say Y here if you are not happy with the alternatives. +config CPU_IDLE_GOV_HALTPOLL + bool "Haltpoll governor (for virtualized systems)" + depends on KVM_GUEST + help + This governor implements haltpoll idle state selection, to be + used in conjunction with the haltpoll cpuidle driver, allowing + for polling for a certain amount of time before entering idle + state. + + Some virtualized workloads benefit from using it. + config DT_IDLE_STATES bool Index: linux-2.6-newcpuidle.git/drivers/cpuidle/governors/Makefile =================================================================== --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/governors/Makefile +++ linux-2.6-newcpuidle.git/drivers/cpuidle/governors/Makefile @@ -6,3 +6,4 @@ obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o +obj-$(CONFIG_CPU_IDLE_GOV_HALTPOLL) += haltpoll.o Index: linux-2.6-newcpuidle.git/drivers/cpuidle/governors/haltpoll.c =================================================================== --- /dev/null +++ linux-2.6-newcpuidle.git/drivers/cpuidle/governors/haltpoll.c @@ -0,0 +1,176 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * haltpoll.c - haltpoll idle governor + * + * Copyright 2019 Red Hat, Inc. and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Authors: Marcelo Tosatti <mtosatti@redhat.com> + */ + +#include <linux/kernel.h> +#include <linux/cpuidle.h> +#include <linux/time.h> +#include <linux/ktime.h> +#include <linux/hrtimer.h> +#include <linux/tick.h> +#include <linux/sched.h> +#include <linux/module.h> +#include <linux/kvm_para.h> + +static unsigned int guest_halt_poll_us __read_mostly = 200; +module_param(guest_halt_poll_us, uint, 0644); + +/* division factor to shrink halt_poll_us */ +static unsigned int guest_halt_poll_shrink __read_mostly = 2; +module_param(guest_halt_poll_shrink, uint, 0644); + +/* multiplication factor to grow per-cpu halt_poll_us */ +static unsigned int guest_halt_poll_grow __read_mostly = 2; +module_param(guest_halt_poll_grow, uint, 0644); + +/* value in us to start growing per-cpu halt_poll_us */ +static unsigned int guest_halt_poll_grow_start __read_mostly = 50; +module_param(guest_halt_poll_grow_start, uint, 0644); + +/* allow shrinking guest halt poll */ +static bool guest_halt_poll_allow_shrink __read_mostly = true; +module_param(guest_halt_poll_allow_shrink, bool, 0644); + +struct haltpoll_device { + int last_state_idx; + unsigned int halt_poll_us; +}; + +static DEFINE_PER_CPU_ALIGNED(struct haltpoll_device, hpoll_devices); + +/** + * haltpoll_select - selects the next idle state to enter + * @drv: cpuidle driver containing state data + * @dev: the CPU + * @stop_tick: indication on whether or not to stop the tick + */ +static int haltpoll_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, + bool *stop_tick) +{ + struct haltpoll_device *hdev = this_cpu_ptr(&hpoll_devices); + int latency_req = cpuidle_governor_latency_req(dev->cpu); + + if (!drv->state_count || latency_req == 0) { + *stop_tick = false; + return 0; + } + + if (hdev->halt_poll_us == 0) + return 1; + + /* Last state was poll? */ + if (hdev->last_state_idx == 0) { + /* Halt if no event occurred on poll window */ + if (dev->poll_time_limit == true) + return 1; + + *stop_tick = false; + /* Otherwise, poll again */ + return 0; + } + + *stop_tick = false; + /* Last state was halt: poll */ + return 0; +} + +static void adjust_haltpoll_us(unsigned int block_us, + struct haltpoll_device *dev) +{ + unsigned int val; + + /* Grow cpu_halt_poll_us if + * cpu_halt_poll_us < block_ns < guest_halt_poll_us + */ + if (block_us > dev->halt_poll_us && block_us <= guest_halt_poll_us) { + val = dev->halt_poll_us * guest_halt_poll_grow; + + if (val < guest_halt_poll_grow_start) + val = guest_halt_poll_grow_start; + if (val > guest_halt_poll_us) + val = guest_halt_poll_us; + + dev->halt_poll_us = val; + } else if (block_us > guest_halt_poll_us && + guest_halt_poll_allow_shrink) { + unsigned int shrink = guest_halt_poll_shrink; + + val = dev->halt_poll_us; + if (shrink == 0) + val = 0; + else + val /= shrink; + dev->halt_poll_us = val; + } +} + +/** + * haltpoll_reflect - update variables and update poll time + * @dev: the CPU + * @index: the index of actual entered state + */ +static void haltpoll_reflect(struct cpuidle_device *dev, int index) +{ + struct haltpoll_device *hdev = this_cpu_ptr(&hpoll_devices); + + hdev->last_state_idx = index; + + if (index != 0) + adjust_haltpoll_us(dev->last_residency, hdev); +} + +/** + * haltpoll_enable_device - scans a CPU's states and does setup + * @drv: cpuidle driver + * @dev: the CPU + */ +static int haltpoll_enable_device(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{ + struct haltpoll_device *hdev = &per_cpu(hpoll_devices, dev->cpu); + + memset(hdev, 0, sizeof(struct haltpoll_device)); + + return 0; +} + +/** + * haltpoll_get_poll_time - return amount of poll time + * @drv: cpuidle driver + * @dev: the CPU + */ +static u64 haltpoll_get_poll_time(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{ + struct haltpoll_device *hdev = &per_cpu(hpoll_devices, dev->cpu); + + return hdev->halt_poll_us * NSEC_PER_USEC; +} + +static struct cpuidle_governor haltpoll_governor = { + .name = "haltpoll", + .rating = 21, + .enable = haltpoll_enable_device, + .select = haltpoll_select, + .reflect = haltpoll_reflect, + .get_poll_time = haltpoll_get_poll_time, +}; + +static int __init init_haltpoll(void) +{ + if (kvm_para_available()) + return cpuidle_register_governor(&haltpoll_governor); + + return 0; +} + +postcore_initcall(init_haltpoll); Index: linux-2.6-newcpuidle.git/Documentation/virtual/guest-halt-polling.txt =================================================================== --- /dev/null +++ linux-2.6-newcpuidle.git/Documentation/virtual/guest-halt-polling.txt @@ -0,0 +1,79 @@ +Guest halt polling +================== + +The cpuidle_haltpoll driver, with the haltpoll governor, allows +the guest vcpus to poll for a specified amount of time before +halting. +This provides the following benefits to host side polling: + + 1) The POLL flag is set while polling is performed, which allows + a remote vCPU to avoid sending an IPI (and the associated + cost of handling the IPI) when performing a wakeup. + + 2) The VM-exit cost can be avoided. + +The downside of guest side polling is that polling is performed +even with other runnable tasks in the host. + +The basic logic as follows: A global value, guest_halt_poll_us, +is configured by the user, indicating the maximum amount of +time polling is allowed. This value is fixed. + +Each vcpu has an adjustable guest_halt_poll_us +("per-cpu guest_halt_poll_us"), which is adjusted by the algorithm +in response to events (explained below). + +Module Parameters +================= + +The haltpoll governor has 5 tunable module parameters: + +1) guest_halt_poll_us: +Maximum amount of time, in microseconds, that polling is +performed before halting. + +Default: 200 + +2) guest_halt_poll_shrink: +Division factor used to shrink per-cpu guest_halt_poll_us when +wakeup event occurs after the global guest_halt_poll_us. + +Default: 2 + +3) guest_halt_poll_grow: +Multiplication factor used to grow per-cpu guest_halt_poll_us +when event occurs after per-cpu guest_halt_poll_us +but before global guest_halt_poll_us. + +Default: 2 + +4) guest_halt_poll_grow_start: +The per-cpu guest_halt_poll_us eventually reaches zero +in case of an idle system. This value sets the initial +per-cpu guest_halt_poll_us when growing. This can +be increased from 10, to avoid misses during the initial +growth stage: + +10, 20, 40, ... (example assumes guest_halt_poll_grow=2). + +Default: 50 + +5) guest_halt_poll_allow_shrink: + +Bool parameter which allows shrinking. Set to N +to avoid it (per-cpu guest_halt_poll_us will remain +high once achieves global guest_halt_poll_us value). + +Default: Y + +The module parameters can be set from the debugfs files in: + + /sys/module/haltpoll/parameters/ + +Further Notes +============= + +- Care should be taken when setting the guest_halt_poll_us parameter as a +large value has the potential to drive the cpu usage to 100% on a machine which +would be almost entirely idle otherwise. +
Add an MSRs which allows the guest to disable host polling (specifically the cpuidle-haltpoll, when performing polling in the guest, disables host side polling). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> --- Documentation/virtual/kvm/msr.txt | 9 +++++++++ arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/include/uapi/asm/kvm_para.h | 2 ++ arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/cpuid.c | 3 ++- arch/x86/kvm/x86.c | 23 +++++++++++++++++++++++ 6 files changed, 39 insertions(+), 1 deletion(-) Index: linux-2.6-newcpuidle.git/Documentation/virtual/kvm/msr.txt =================================================================== --- linux-2.6-newcpuidle.git.orig/Documentation/virtual/kvm/msr.txt +++ linux-2.6-newcpuidle.git/Documentation/virtual/kvm/msr.txt @@ -273,3 +273,12 @@ MSR_KVM_EOI_EN: 0x4b564d04 guest must both read the least significant bit in the memory area and clear it using a single CPU instruction, such as test and clear, or compare and exchange. + +MSR_KVM_POLL_CONTROL: 0x4b564d05 + Control host side polling. + + data: Bit 0 enables (1) or disables (0) host halt poll + logic. + KVM guests can disable host halt polling when performing + polling themselves. + Index: linux-2.6-newcpuidle.git/arch/x86/include/asm/kvm_host.h =================================================================== --- linux-2.6-newcpuidle.git.orig/arch/x86/include/asm/kvm_host.h +++ linux-2.6-newcpuidle.git/arch/x86/include/asm/kvm_host.h @@ -752,6 +752,8 @@ struct kvm_vcpu_arch { struct gfn_to_hva_cache data; } pv_eoi; + u64 msr_kvm_poll_control; + /* * Indicate whether the access faults on its page table in guest * which is set when fix page fault and used to detect unhandeable Index: linux-2.6-newcpuidle.git/arch/x86/include/uapi/asm/kvm_para.h =================================================================== --- linux-2.6-newcpuidle.git.orig/arch/x86/include/uapi/asm/kvm_para.h +++ linux-2.6-newcpuidle.git/arch/x86/include/uapi/asm/kvm_para.h @@ -29,6 +29,7 @@ #define KVM_FEATURE_PV_TLB_FLUSH 9 #define KVM_FEATURE_ASYNC_PF_VMEXIT 10 #define KVM_FEATURE_PV_SEND_IPI 11 +#define KVM_FEATURE_POLL_CONTROL 12 #define KVM_HINTS_REALTIME 0 @@ -47,6 +48,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_POLL_CONTROL 0x4b564d05 struct kvm_steal_time { __u64 steal; Index: linux-2.6-newcpuidle.git/arch/x86/kvm/Kconfig =================================================================== --- linux-2.6-newcpuidle.git.orig/arch/x86/kvm/Kconfig +++ linux-2.6-newcpuidle.git/arch/x86/kvm/Kconfig @@ -41,6 +41,7 @@ config KVM select PERF_EVENTS select HAVE_KVM_MSI select HAVE_KVM_CPU_RELAX_INTERCEPT + select HAVE_KVM_NO_POLL select KVM_GENERIC_DIRTYLOG_READ_PROTECT select KVM_VFIO select SRCU Index: linux-2.6-newcpuidle.git/arch/x86/kvm/cpuid.c =================================================================== --- linux-2.6-newcpuidle.git.orig/arch/x86/kvm/cpuid.c +++ linux-2.6-newcpuidle.git/arch/x86/kvm/cpuid.c @@ -640,7 +640,8 @@ static inline int __do_cpuid_ent(struct (1 << KVM_FEATURE_PV_UNHALT) | (1 << KVM_FEATURE_PV_TLB_FLUSH) | (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) | - (1 << KVM_FEATURE_PV_SEND_IPI); + (1 << KVM_FEATURE_PV_SEND_IPI) | + (1 << KVM_FEATURE_POLL_CONTROL); if (sched_info_on()) entry->eax |= (1 << KVM_FEATURE_STEAL_TIME); Index: linux-2.6-newcpuidle.git/arch/x86/kvm/x86.c =================================================================== --- linux-2.6-newcpuidle.git.orig/arch/x86/kvm/x86.c +++ linux-2.6-newcpuidle.git/arch/x86/kvm/x86.c @@ -1174,6 +1174,7 @@ static u32 emulated_msrs[] = { MSR_IA32_POWER_CTL, MSR_K7_HWCR, + MSR_KVM_POLL_CONTROL, }; static unsigned num_emulated_msrs; @@ -2625,6 +2626,14 @@ int kvm_set_msr_common(struct kvm_vcpu * return 1; break; + case MSR_KVM_POLL_CONTROL: + /* only enable bit supported */ + if (data & (-1ULL << 1)) + return 1; + + vcpu->arch.msr_kvm_poll_control = data; + break; + case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1: @@ -2874,6 +2883,9 @@ int kvm_get_msr_common(struct kvm_vcpu * case MSR_KVM_PV_EOI_EN: msr_info->data = vcpu->arch.pv_eoi.msr_val; break; + case MSR_KVM_POLL_CONTROL: + msr_info->data = vcpu->arch.msr_kvm_poll_control; + break; case MSR_IA32_P5_MC_ADDR: case MSR_IA32_P5_MC_TYPE: case MSR_IA32_MCG_CAP: @@ -8874,6 +8886,10 @@ void kvm_arch_vcpu_postcreate(struct kvm msr.host_initiated = true; kvm_write_tsc(vcpu, &msr); vcpu_put(vcpu); + + /* poll control enabled by default */ + vcpu->arch.msr_kvm_poll_control = 1; + mutex_unlock(&vcpu->mutex); if (!kvmclock_periodic_sync) @@ -9948,6 +9964,13 @@ bool kvm_vector_hashing_enabled(void) } EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled); +bool kvm_arch_no_poll(struct kvm_vcpu *vcpu) +{ + return (vcpu->arch.msr_kvm_poll_control & 1) == 0; +} +EXPORT_SYMBOL_GPL(kvm_arch_no_poll); + + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
When performing guest side polling, it is not necessary to also perform host side polling. So disable host side polling, via the new MSR interface, when loading cpuidle-haltpoll driver. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> --- arch/x86/Kconfig | 7 +++++ arch/x86/include/asm/cpuidle_haltpoll.h | 8 ++++++ arch/x86/kernel/kvm.c | 42 ++++++++++++++++++++++++++++++++ drivers/cpuidle/cpuidle-haltpoll.c | 10 ++++++- include/linux/cpuidle_haltpoll.h | 16 ++++++++++++ 5 files changed, 82 insertions(+), 1 deletion(-) Index: linux-2.6-newcpuidle.git/arch/x86/include/asm/cpuidle_haltpoll.h =================================================================== --- /dev/null +++ linux-2.6-newcpuidle.git/arch/x86/include/asm/cpuidle_haltpoll.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ARCH_HALTPOLL_H +#define _ARCH_HALTPOLL_H + +void arch_haltpoll_enable(void); +void arch_haltpoll_disable(void); + +#endif Index: linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle-haltpoll.c =================================================================== --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/cpuidle-haltpoll.c +++ linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle-haltpoll.c @@ -15,6 +15,7 @@ #include <linux/module.h> #include <linux/sched/idle.h> #include <linux/kvm_para.h> +#include <linux/cpuidle_haltpoll.h> static int default_enter_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) @@ -47,6 +48,7 @@ static struct cpuidle_driver haltpoll_dr static int __init haltpoll_init(void) { + int ret; struct cpuidle_driver *drv = &haltpoll_driver; cpuidle_poll_state_init(drv); @@ -54,11 +56,16 @@ static int __init haltpoll_init(void) if (!kvm_para_available()) return 0; - return cpuidle_register(&haltpoll_driver, NULL); + ret = cpuidle_register(&haltpoll_driver, NULL); + if (ret == 0) + arch_haltpoll_enable(); + + return ret; } static void __exit haltpoll_exit(void) { + arch_haltpoll_disable(); cpuidle_unregister(&haltpoll_driver); } Index: linux-2.6-newcpuidle.git/include/linux/cpuidle_haltpoll.h =================================================================== --- /dev/null +++ linux-2.6-newcpuidle.git/include/linux/cpuidle_haltpoll.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _CPUIDLE_HALTPOLL_H +#define _CPUIDLE_HALTPOLL_H + +#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL +#include <asm/cpuidle_haltpoll.h> +#else +static inline void arch_haltpoll_enable(void) +{ +} + +static inline void arch_haltpoll_disable(void) +{ +} +#endif +#endif Index: linux-2.6-newcpuidle.git/arch/x86/Kconfig =================================================================== --- linux-2.6-newcpuidle.git.orig/arch/x86/Kconfig +++ linux-2.6-newcpuidle.git/arch/x86/Kconfig @@ -787,6 +787,7 @@ config KVM_GUEST bool "KVM Guest support (including kvmclock)" depends on PARAVIRT select PARAVIRT_CLOCK + select ARCH_CPUIDLE_HALTPOLL default y ---help--- This option enables various optimizations for running under the KVM @@ -795,6 +796,12 @@ config KVM_GUEST underlying device model, the host provides the guest with timing infrastructure such as time of day, and system time +config ARCH_CPUIDLE_HALTPOLL + def_bool n + prompt "Disable host haltpoll when loading haltpoll driver" + help + If virtualized under KVM, disable host haltpoll. + config PVH bool "Support for running PVH guests" ---help--- Index: linux-2.6-newcpuidle.git/arch/x86/kernel/kvm.c =================================================================== --- linux-2.6-newcpuidle.git.orig/arch/x86/kernel/kvm.c +++ linux-2.6-newcpuidle.git/arch/x86/kernel/kvm.c @@ -853,3 +853,45 @@ void __init kvm_spinlock_init(void) } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ + +#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL + +static void kvm_disable_host_haltpoll(void *i) +{ + wrmsrl(MSR_KVM_POLL_CONTROL, 0); +} + +static void kvm_enable_host_haltpoll(void *i) +{ + wrmsrl(MSR_KVM_POLL_CONTROL, 1); +} + +void arch_haltpoll_enable(void) +{ + if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) { + printk(KERN_ERR "kvm: host does not support poll control\n"); + printk(KERN_ERR "kvm: host upgrade recommended\n"); + return; + } + + preempt_disable(); + /* Enable guest halt poll disables host halt poll */ + kvm_disable_host_haltpoll(NULL); + smp_call_function(kvm_disable_host_haltpoll, NULL, 1); + preempt_enable(); +} +EXPORT_SYMBOL_GPL(arch_haltpoll_enable); + +void arch_haltpoll_disable(void) +{ + if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) + return; + + preempt_disable(); + /* Enable guest halt poll disables host halt poll */ + kvm_enable_host_haltpoll(NULL); + smp_call_function(kvm_enable_host_haltpoll, NULL, 1); + preempt_enable(); +} +EXPORT_SYMBOL_GPL(arch_haltpoll_disable); +#endif
On Mon, Jul 1, 2019 at 8:57 PM Marcelo Tosatti <mtosatti@redhat.com> wrote: > > Add a "get_poll_time" callback to the cpuidle_governor structure, > and change poll state to poll for that amount of time. > > Provide a default method for it, while allowing individual governors > to override it. > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> I had ACKed this before, but then it occurred to me that it would be less intrusive to use a new field, say poll_limit_ns (equal to 0 by default), in struct cpuidle_device. > > --- > drivers/cpuidle/cpuidle.c | 40 ++++++++++++++++++++++++++++++++++++++++ > drivers/cpuidle/poll_state.c | 11 ++--------- > include/linux/cpuidle.h | 8 ++++++++ > 3 files changed, 50 insertions(+), 9 deletions(-) > > Index: linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle.c > =================================================================== > --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/cpuidle.c > +++ linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle.c > @@ -362,6 +362,46 @@ void cpuidle_reflect(struct cpuidle_devi > } > > /** > + * cpuidle_default_poll_time - default routine used to return poll time > + * governors can override it if necessary > + * > + * @drv: the cpuidle driver tied with the cpu > + * @dev: the cpuidle device > + * > + */ > +static u64 cpuidle_default_poll_time(struct cpuidle_driver *drv, > + struct cpuidle_device *dev) With this new field in place this could be called cpuidle_poll_time() and -> > +{ > + int i; -> do something like this here: if (dev->poll_limit_ns) return dev->poll_limit_ns; and the governor changes below wouldn't be necessary any more. Then, the governor could update poll_limit_ns if it wanted to override the default. It also would be possible to use poll_limit_ns as a sort of poll limit cache to store the last value in it and clear it on state disable/enable to avoid the search through the states every time even without haltpoll. > + > + for (i = 1; i < drv->state_count; i++) { > + if (drv->states[i].disabled || dev->states_usage[i].disable) > + continue; > + > + return (u64)drv->states[i].target_residency * NSEC_PER_USEC; > + } > + > + return TICK_NSEC; > +} > + > +/** > + * cpuidle_get_poll_time - tell the polling driver how much time to poll, > + * in nanoseconds. > + * > + * @drv: the cpuidle driver tied with the cpu > + * @dev: the cpuidle device > + * > + */ > +u64 cpuidle_get_poll_time(struct cpuidle_driver *drv, > + struct cpuidle_device *dev) > +{ > + if (cpuidle_curr_governor->get_poll_time) > + return cpuidle_curr_governor->get_poll_time(drv, dev); > + > + return cpuidle_default_poll_time(drv, dev); > +} > + > +/** > * cpuidle_install_idle_handler - installs the cpuidle idle loop handler > */ > void cpuidle_install_idle_handler(void) > Index: linux-2.6-newcpuidle.git/drivers/cpuidle/poll_state.c > =================================================================== > --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/poll_state.c > +++ linux-2.6-newcpuidle.git/drivers/cpuidle/poll_state.c > @@ -20,16 +20,9 @@ static int __cpuidle poll_idle(struct cp > local_irq_enable(); > if (!current_set_polling_and_test()) { > unsigned int loop_count = 0; > - u64 limit = TICK_NSEC; > - int i; > + u64 limit; > > - for (i = 1; i < drv->state_count; i++) { > - if (drv->states[i].disabled || dev->states_usage[i].disable) > - continue; > - > - limit = (u64)drv->states[i].target_residency * NSEC_PER_USEC; > - break; > - } > + limit = cpuidle_get_poll_time(drv, dev); > > while (!need_resched()) { > cpu_relax(); > Index: linux-2.6-newcpuidle.git/include/linux/cpuidle.h > =================================================================== > --- linux-2.6-newcpuidle.git.orig/include/linux/cpuidle.h > +++ linux-2.6-newcpuidle.git/include/linux/cpuidle.h > @@ -132,6 +132,8 @@ extern int cpuidle_select(struct cpuidle > extern int cpuidle_enter(struct cpuidle_driver *drv, > struct cpuidle_device *dev, int index); > extern void cpuidle_reflect(struct cpuidle_device *dev, int index); > +extern u64 cpuidle_get_poll_time(struct cpuidle_driver *drv, > + struct cpuidle_device *dev); > > extern int cpuidle_register_driver(struct cpuidle_driver *drv); > extern struct cpuidle_driver *cpuidle_get_driver(void); > @@ -166,6 +168,9 @@ static inline int cpuidle_enter(struct c > struct cpuidle_device *dev, int index) > {return -ENODEV; } > static inline void cpuidle_reflect(struct cpuidle_device *dev, int index) { } > +extern u64 cpuidle_get_poll_time(struct cpuidle_driver *drv, > + struct cpuidle_device *dev) > +{return 0; } > static inline int cpuidle_register_driver(struct cpuidle_driver *drv) > {return -ENODEV; } > static inline struct cpuidle_driver *cpuidle_get_driver(void) {return NULL; } > @@ -246,6 +251,9 @@ struct cpuidle_governor { > struct cpuidle_device *dev, > bool *stop_tick); > void (*reflect) (struct cpuidle_device *dev, int index); > + > + u64 (*get_poll_time) (struct cpuidle_driver *drv, > + struct cpuidle_device *dev); > }; > > #ifdef CONFIG_CPU_IDLE > >
On Mon, Jul 1, 2019 at 8:57 PM Marcelo Tosatti <mtosatti@redhat.com> wrote: > > Add a cpuidle driver that calls the architecture default_idle routine. > > To be used in conjunction with the haltpoll governor. > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > arch/x86/kernel/process.c | 2 - > drivers/cpuidle/Kconfig | 9 +++++ > drivers/cpuidle/Makefile | 1 > drivers/cpuidle/cpuidle-haltpoll.c | 65 +++++++++++++++++++++++++++++++++++++ > 4 files changed, 76 insertions(+), 1 deletion(-) > > Index: linux-2.6-newcpuidle.git/arch/x86/kernel/process.c > =================================================================== > --- linux-2.6-newcpuidle.git.orig/arch/x86/kernel/process.c > +++ linux-2.6-newcpuidle.git/arch/x86/kernel/process.c > @@ -580,7 +580,7 @@ void __cpuidle default_idle(void) > safe_halt(); > trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); > } > -#ifdef CONFIG_APM_MODULE > +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_HALTPOLL_CPUIDLE_MODULE) > EXPORT_SYMBOL(default_idle); > #endif > > Index: linux-2.6-newcpuidle.git/drivers/cpuidle/Kconfig > =================================================================== > --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/Kconfig > +++ linux-2.6-newcpuidle.git/drivers/cpuidle/Kconfig > @@ -51,6 +51,15 @@ depends on PPC > source "drivers/cpuidle/Kconfig.powerpc" > endmenu > > +config HALTPOLL_CPUIDLE > + tristate "Halt poll cpuidle driver" > + depends on X86 && KVM_GUEST > + default y > + help > + This option enables halt poll cpuidle driver, which allows to poll > + before halting in the guest (more efficient than polling in the > + host via halt_poll_ns for some scenarios). > + > endif > > config ARCH_NEEDS_CPU_IDLE_COUPLED > Index: linux-2.6-newcpuidle.git/drivers/cpuidle/Makefile > =================================================================== > --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/Makefile > +++ linux-2.6-newcpuidle.git/drivers/cpuidle/Makefile > @@ -7,6 +7,7 @@ obj-y += cpuidle.o driver.o governor.o s > obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o > obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o > obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o > +obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o > > ################################################################################## > # ARM SoC drivers > Index: linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle-haltpoll.c > =================================================================== > --- /dev/null > +++ linux-2.6-newcpuidle.git/drivers/cpuidle/cpuidle-haltpoll.c > @@ -0,0 +1,69 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * cpuidle driver for haltpoll governor. > + * > + * Copyright 2019 Red Hat, Inc. and/or its affiliates. > + * > + * This work is licensed under the terms of the GNU GPL, version 2. See > + * the COPYING file in the top-level directory. > + * > + * Authors: Marcelo Tosatti <mtosatti@redhat.com> > + */ > + > +#include <linux/init.h> > +#include <linux/cpuidle.h> > +#include <linux/module.h> > +#include <linux/sched/idle.h> > +#include <linux/kvm_para.h> > + > +static int default_enter_idle(struct cpuidle_device *dev, > + struct cpuidle_driver *drv, int index) > +{ > + if (current_clr_polling_and_test()) { > + local_irq_enable(); > + return index; > + } > + default_idle(); > + return index; > +} > + > +static struct cpuidle_driver haltpoll_driver = { > + .name = "haltpoll", > + .owner = THIS_MODULE, > + .states = { > + { /* entry 0 is for polling */ }, > + { > + .enter = default_enter_idle, > + .exit_latency = 1, > + .target_residency = 1, > + .power_usage = -1, > + .name = "haltpoll idle", > + .desc = "default architecture idle", > + }, > + }, > + .safe_state_index = 0, > + .state_count = 2, > +}; > + > +static int __init haltpoll_init(void) > +{ > + struct cpuidle_driver *drv = &haltpoll_driver; > + > + cpuidle_poll_state_init(drv); > + > + if (!kvm_para_available()) > + return 0; > + > + return cpuidle_register(&haltpoll_driver, NULL); > +} > + > +static void __exit haltpoll_exit(void) > +{ > + cpuidle_unregister(&haltpoll_driver); > +} > + > +module_init(haltpoll_init); > +module_exit(haltpoll_exit); > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@redhat.com>"); > + > >
On Mon, Jul 1, 2019 at 8:57 PM Marcelo Tosatti <mtosatti@redhat.com> wrote: > > The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle > driver, allows guest vcpus to poll for a specified amount of time before > halting. > This provides the following benefits to host side polling: > > 1) The POLL flag is set while polling is performed, which allows > a remote vCPU to avoid sending an IPI (and the associated > cost of handling the IPI) when performing a wakeup. > > 2) The VM-exit cost can be avoided. > > The downside of guest side polling is that polling is performed > even with other runnable tasks in the host. > > Results comparing halt_poll_ns and server/client application > where a small packet is ping-ponged: > > host --> 31.33 > halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > > For the SAP HANA benchmarks (where idle_spin is a parameter > of the previous version of the patch, results should be the > same): > > hpns == halt_poll_ns > > idle_spin=0/ idle_spin=800/ idle_spin=0/ > hpns=200000 hpns=0 hpns=800000 > DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > > --- > Documentation/virtual/guest-halt-polling.txt | 79 ++++++++++++ > drivers/cpuidle/Kconfig | 11 + > drivers/cpuidle/governors/Makefile | 1 > drivers/cpuidle/governors/haltpoll.c | 175 +++++++++++++++++++++++++++ > 4 files changed, 266 insertions(+) > > Index: linux-2.6-newcpuidle.git/drivers/cpuidle/Kconfig > =================================================================== > --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/Kconfig > +++ linux-2.6-newcpuidle.git/drivers/cpuidle/Kconfig > @@ -33,6 +33,17 @@ config CPU_IDLE_GOV_TEO > Some workloads benefit from using it and it generally should be safe > to use. Say Y here if you are not happy with the alternatives. > > +config CPU_IDLE_GOV_HALTPOLL > + bool "Haltpoll governor (for virtualized systems)" > + depends on KVM_GUEST > + help > + This governor implements haltpoll idle state selection, to be > + used in conjunction with the haltpoll cpuidle driver, allowing > + for polling for a certain amount of time before entering idle > + state. > + > + Some virtualized workloads benefit from using it. > + > config DT_IDLE_STATES > bool > > Index: linux-2.6-newcpuidle.git/drivers/cpuidle/governors/Makefile > =================================================================== > --- linux-2.6-newcpuidle.git.orig/drivers/cpuidle/governors/Makefile > +++ linux-2.6-newcpuidle.git/drivers/cpuidle/governors/Makefile > @@ -6,3 +6,4 @@ > obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o > obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o > obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o > +obj-$(CONFIG_CPU_IDLE_GOV_HALTPOLL) += haltpoll.o > Index: linux-2.6-newcpuidle.git/drivers/cpuidle/governors/haltpoll.c > =================================================================== > --- /dev/null > +++ linux-2.6-newcpuidle.git/drivers/cpuidle/governors/haltpoll.c > @@ -0,0 +1,176 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * haltpoll.c - haltpoll idle governor > + * > + * Copyright 2019 Red Hat, Inc. and/or its affiliates. > + * > + * This work is licensed under the terms of the GNU GPL, version 2. See > + * the COPYING file in the top-level directory. > + * > + * Authors: Marcelo Tosatti <mtosatti@redhat.com> > + */ > + > +#include <linux/kernel.h> > +#include <linux/cpuidle.h> > +#include <linux/time.h> > +#include <linux/ktime.h> > +#include <linux/hrtimer.h> > +#include <linux/tick.h> > +#include <linux/sched.h> > +#include <linux/module.h> > +#include <linux/kvm_para.h> > + > +static unsigned int guest_halt_poll_us __read_mostly = 200; > +module_param(guest_halt_poll_us, uint, 0644); > + > +/* division factor to shrink halt_poll_us */ > +static unsigned int guest_halt_poll_shrink __read_mostly = 2; > +module_param(guest_halt_poll_shrink, uint, 0644); > + > +/* multiplication factor to grow per-cpu halt_poll_us */ > +static unsigned int guest_halt_poll_grow __read_mostly = 2; > +module_param(guest_halt_poll_grow, uint, 0644); > + > +/* value in us to start growing per-cpu halt_poll_us */ > +static unsigned int guest_halt_poll_grow_start __read_mostly = 50; > +module_param(guest_halt_poll_grow_start, uint, 0644); > + > +/* allow shrinking guest halt poll */ > +static bool guest_halt_poll_allow_shrink __read_mostly = true; > +module_param(guest_halt_poll_allow_shrink, bool, 0644); > + > +struct haltpoll_device { > + int last_state_idx; > + unsigned int halt_poll_us; > +}; Say you have poll_limit_ns in struct cpuidle_device as mentioned in the other reply. Since all of the existing governors use last_state_idx (or equivalent), that could be moved to struct cpuidle_device too, in principle. Would you still need the new structure here then? > + > +static DEFINE_PER_CPU_ALIGNED(struct haltpoll_device, hpoll_devices); > + > +/** > + * haltpoll_select - selects the next idle state to enter > + * @drv: cpuidle driver containing state data > + * @dev: the CPU > + * @stop_tick: indication on whether or not to stop the tick > + */ > +static int haltpoll_select(struct cpuidle_driver *drv, > + struct cpuidle_device *dev, > + bool *stop_tick) > +{ > + struct haltpoll_device *hdev = this_cpu_ptr(&hpoll_devices); > + int latency_req = cpuidle_governor_latency_req(dev->cpu); > + > + if (!drv->state_count || latency_req == 0) { > + *stop_tick = false; > + return 0; > + } > + > + if (hdev->halt_poll_us == 0) > + return 1; > + > + /* Last state was poll? */ > + if (hdev->last_state_idx == 0) { > + /* Halt if no event occurred on poll window */ > + if (dev->poll_time_limit == true) > + return 1; > + > + *stop_tick = false; > + /* Otherwise, poll again */ > + return 0; > + } > + > + *stop_tick = false; > + /* Last state was halt: poll */ > + return 0; > +} > + > +static void adjust_haltpoll_us(unsigned int block_us, > + struct haltpoll_device *dev) > +{ > + unsigned int val; > + > + /* Grow cpu_halt_poll_us if > + * cpu_halt_poll_us < block_ns < guest_halt_poll_us > + */ > + if (block_us > dev->halt_poll_us && block_us <= guest_halt_poll_us) { > + val = dev->halt_poll_us * guest_halt_poll_grow; > + > + if (val < guest_halt_poll_grow_start) > + val = guest_halt_poll_grow_start; > + if (val > guest_halt_poll_us) > + val = guest_halt_poll_us; > + > + dev->halt_poll_us = val; > + } else if (block_us > guest_halt_poll_us && > + guest_halt_poll_allow_shrink) { > + unsigned int shrink = guest_halt_poll_shrink; > + > + val = dev->halt_poll_us; > + if (shrink == 0) > + val = 0; > + else > + val /= shrink; > + dev->halt_poll_us = val; > + } > +} > + > +/** > + * haltpoll_reflect - update variables and update poll time > + * @dev: the CPU > + * @index: the index of actual entered state > + */ > +static void haltpoll_reflect(struct cpuidle_device *dev, int index) > +{ > + struct haltpoll_device *hdev = this_cpu_ptr(&hpoll_devices); > + > + hdev->last_state_idx = index; > + > + if (index != 0) > + adjust_haltpoll_us(dev->last_residency, hdev); > +} > + > +/** > + * haltpoll_enable_device - scans a CPU's states and does setup > + * @drv: cpuidle driver > + * @dev: the CPU > + */ > +static int haltpoll_enable_device(struct cpuidle_driver *drv, > + struct cpuidle_device *dev) > +{ > + struct haltpoll_device *hdev = &per_cpu(hpoll_devices, dev->cpu); > + > + memset(hdev, 0, sizeof(struct haltpoll_device)); > + > + return 0; > +} > + > +/** > + * haltpoll_get_poll_time - return amount of poll time > + * @drv: cpuidle driver > + * @dev: the CPU > + */ > +static u64 haltpoll_get_poll_time(struct cpuidle_driver *drv, > + struct cpuidle_device *dev) > +{ > + struct haltpoll_device *hdev = &per_cpu(hpoll_devices, dev->cpu); > + > + return hdev->halt_poll_us * NSEC_PER_USEC; > +} > + > +static struct cpuidle_governor haltpoll_governor = { > + .name = "haltpoll", > + .rating = 21, > + .enable = haltpoll_enable_device, > + .select = haltpoll_select, > + .reflect = haltpoll_reflect, > + .get_poll_time = haltpoll_get_poll_time, > +}; > + > +static int __init init_haltpoll(void) > +{ > + if (kvm_para_available()) > + return cpuidle_register_governor(&haltpoll_governor); > + > + return 0; > +} > + > +postcore_initcall(init_haltpoll); > Index: linux-2.6-newcpuidle.git/Documentation/virtual/guest-halt-polling.txt > =================================================================== > --- /dev/null > +++ linux-2.6-newcpuidle.git/Documentation/virtual/guest-halt-polling.txt > @@ -0,0 +1,79 @@ > +Guest halt polling > +================== > + > +The cpuidle_haltpoll driver, with the haltpoll governor, allows > +the guest vcpus to poll for a specified amount of time before > +halting. > +This provides the following benefits to host side polling: > + > + 1) The POLL flag is set while polling is performed, which allows > + a remote vCPU to avoid sending an IPI (and the associated > + cost of handling the IPI) when performing a wakeup. > + > + 2) The VM-exit cost can be avoided. > + > +The downside of guest side polling is that polling is performed > +even with other runnable tasks in the host. > + > +The basic logic as follows: A global value, guest_halt_poll_us, > +is configured by the user, indicating the maximum amount of > +time polling is allowed. This value is fixed. > + > +Each vcpu has an adjustable guest_halt_poll_us > +("per-cpu guest_halt_poll_us"), which is adjusted by the algorithm > +in response to events (explained below). > + > +Module Parameters > +================= > + > +The haltpoll governor has 5 tunable module parameters: > + > +1) guest_halt_poll_us: > +Maximum amount of time, in microseconds, that polling is > +performed before halting. > + > +Default: 200 > + > +2) guest_halt_poll_shrink: > +Division factor used to shrink per-cpu guest_halt_poll_us when > +wakeup event occurs after the global guest_halt_poll_us. > + > +Default: 2 > + > +3) guest_halt_poll_grow: > +Multiplication factor used to grow per-cpu guest_halt_poll_us > +when event occurs after per-cpu guest_halt_poll_us > +but before global guest_halt_poll_us. > + > +Default: 2 > + > +4) guest_halt_poll_grow_start: > +The per-cpu guest_halt_poll_us eventually reaches zero > +in case of an idle system. This value sets the initial > +per-cpu guest_halt_poll_us when growing. This can > +be increased from 10, to avoid misses during the initial > +growth stage: > + > +10, 20, 40, ... (example assumes guest_halt_poll_grow=2). > + > +Default: 50 > + > +5) guest_halt_poll_allow_shrink: > + > +Bool parameter which allows shrinking. Set to N > +to avoid it (per-cpu guest_halt_poll_us will remain > +high once achieves global guest_halt_poll_us value). > + > +Default: Y > + > +The module parameters can be set from the debugfs files in: > + > + /sys/module/haltpoll/parameters/ > + > +Further Notes > +============= > + > +- Care should be taken when setting the guest_halt_poll_us parameter as a > +large value has the potential to drive the cpu usage to 100% on a machine which > +would be almost entirely idle otherwise. > + > >