On 6/4/2019 12:52 AM, Marcelo Tosatti wrote:
> The cpuidle-haltpoll driver allows the guest vcpus to poll for a specified
> amount of time before halting. This provides the following benefits
> to host side polling:
>
> 1) The POLL flag is set while polling is performed, which allows
> a remote vCPU to avoid sending an IPI (and the associated
> cost of handling the IPI) when performing a wakeup.
>
> 2) The HLT VM-exit cost can be avoided.
>
> The downside of guest side polling is that polling is performed
> even with other runnable tasks in the host.
>
> Results comparing halt_poll_ns and server/client application
> where a small packet is ping-ponged:
>
> host --> 31.33
> halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%)
> halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%)
>
> For the SAP HANA benchmarks (where idle_spin is a parameter
> of the previous version of the patch, results should be the
> same):
>
> hpns == halt_poll_ns
>
> idle_spin=0/ idle_spin=800/ idle_spin=0/
> hpns=200000 hpns=0 hpns=800000
> DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%)
> InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%)
> DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%)
> UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%)
>
> V2:
>
> - Move from x86 to generic code (Paolo/Christian).
> - Add auto-tuning logic (Paolo).
> - Add MSR to disable host side polling (Paolo).
>
>
>
First of all, please CC power management patches (including cpuidle,
cpufreq etc) to linux-pm@vger.kernel.org (there are people on that list
who may want to see your changes before they go in) and CC cpuidle
material (in particular) to Peter Zijlstra.
Second, I'm not a big fan of this approach to be honest, as it kind of
is a driver trying to play the role of a governor.
We have a "polling state" already that could be used here in principle
so I wonder what would be wrong with that. Also note that there seems
to be at least some code duplication between your code and the "polling
state" implementation, so maybe it would be possible to do some things
in a common way?
On 6/4/2019 12:52 AM, Marcelo Tosatti wrote: It is rather inconvenient to comment patches posted as attachments. Anyway, IMO adding trace events at this point is premature, please move adding them to a separate patch at the end of the series.
On Fri, Jun 07, 2019 at 11:49:51AM +0200, Rafael J. Wysocki wrote: > On 6/4/2019 12:52 AM, Marcelo Tosatti wrote: > >The cpuidle-haltpoll driver allows the guest vcpus to poll for a specified > >amount of time before halting. This provides the following benefits > >to host side polling: > > > > 1) The POLL flag is set while polling is performed, which allows > > a remote vCPU to avoid sending an IPI (and the associated > > cost of handling the IPI) when performing a wakeup. > > > > 2) The HLT VM-exit cost can be avoided. > > > >The downside of guest side polling is that polling is performed > >even with other runnable tasks in the host. > > > >Results comparing halt_poll_ns and server/client application > >where a small packet is ping-ponged: > > > >host --> 31.33 > >halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > >halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > > > >For the SAP HANA benchmarks (where idle_spin is a parameter > >of the previous version of the patch, results should be the > >same): > > > >hpns == halt_poll_ns > > > > idle_spin=0/ idle_spin=800/ idle_spin=0/ > > hpns=200000 hpns=0 hpns=800000 > >DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > >InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > >DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > >UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > > > >V2: > > > >- Move from x86 to generic code (Paolo/Christian). > >- Add auto-tuning logic (Paolo). > >- Add MSR to disable host side polling (Paolo). > > > > > > > First of all, please CC power management patches (including cpuidle, > cpufreq etc) to linux-pm@vger.kernel.org (there are people on that > list who may want to see your changes before they go in) and CC > cpuidle material (in particular) to Peter Zijlstra. Ok, Peter is CC'ed, will include linux-pm@vger in the next patches. > Second, I'm not a big fan of this approach to be honest, as it kind > of is a driver trying to play the role of a governor. Well, its not trying to choose which idle state to enter, because there is only one idle state to enter when virtualized (HLT). > We have a "polling state" already that could be used here in > principle so I wonder what would be wrong with that. There is no "target residency" concept in the virtualized use-case (which is what poll_state.c uses to calculate the poll time). Moreover the cpuidle-haltpoll driver uses an adaptive logic to tune poll time (which appparently does not make sense for poll_state). The only thing they share is the main loop structure: "while (!need_resched()) { cpu_relax(); now = ktime_get(); }" > Also note that > there seems to be at least some code duplication between your code > and the "polling state" implementation, so maybe it would be > possible to do some things in a common way? Again, its just the main loop structure that is shared: there is no target residency in the virtualized case, and we want an adaptive scheme. Lets think about deduplication: you would have a cpuidle driver, with a fake "target residency". Now, it makes no sense to use a governor for the virtualized case (again, there is only one idle state: HLT, the host governor is used for the actual idle state decision in the host). So i fail to see how i would go about integrating these two and what are the advantages of doing so ?
On 07/06/19 19:16, Marcelo Tosatti wrote:
> There is no "target residency" concept in the virtualized use-case
> (which is what poll_state.c uses to calculate the poll time).
Actually there is: it is the cost of a vmexit, and it be calibrated with
a very short CPUID loop (e.g. run 100 CPUID instructions and take the
smallest TSC interval---it should take less than 50 microseconds, and
less than a millisecond even on nested virt).
I think it would make sense to improve poll_state.c to use an adaptive
algorithm similar to the one you implemented, which includes optionally
allowing to poll for an interval larger than the target residency.
Paolo
On Fri, Jun 07, 2019 at 08:22:35PM +0200, Paolo Bonzini wrote: > On 07/06/19 19:16, Marcelo Tosatti wrote: > > There is no "target residency" concept in the virtualized use-case > > (which is what poll_state.c uses to calculate the poll time). > > Actually there is: it is the cost of a vmexit, and it be calibrated with > a very short CPUID loop (e.g. run 100 CPUID instructions and take the > smallest TSC interval---it should take less than 50 microseconds, and > less than a millisecond even on nested virt). For a given application, you want to configure the poll time to the maximum time an event happen after starting the idle procedure. For SAP HANA, that value is between 200us - 800us (most tests require less than 800us, but some require 800us, to significantly avoid the IPIs). "The target residency is the minimum time the hardware must spend in the given state, including the time needed to enter it (which may be substantial), in order to save more energy than it would save by entering one of the shallower idle states instead." Clearly these are two different things... > I think it would make sense to improve poll_state.c to use an adaptive > algorithm similar to the one you implemented, which includes optionally > allowing to poll for an interval larger than the target residency. > > Paolo Ok, so i'll move the adaptive code to poll_state.c, where the driver selects whether to use target_residency or the adaptive value (based on module parameters). Not sure if its an adaptible value is desirable for the non virtualized case.
On Fri, Jun 07, 2019 at 11:49:51AM +0200, Rafael J. Wysocki wrote:
> On 6/4/2019 12:52 AM, Marcelo Tosatti wrote:
> >The cpuidle-haltpoll driver allows the guest vcpus to poll for a specified
> >amount of time before halting. This provides the following benefits
> >to host side polling:
> >
> > 1) The POLL flag is set while polling is performed, which allows
> > a remote vCPU to avoid sending an IPI (and the associated
> > cost of handling the IPI) when performing a wakeup.
> >
> > 2) The HLT VM-exit cost can be avoided.
> >
> >The downside of guest side polling is that polling is performed
> >even with other runnable tasks in the host.
> >
> >Results comparing halt_poll_ns and server/client application
> >where a small packet is ping-ponged:
> >
> >host --> 31.33
> >halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%)
> >halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%)
> >
> >For the SAP HANA benchmarks (where idle_spin is a parameter
> >of the previous version of the patch, results should be the
> >same):
> >
> >hpns == halt_poll_ns
> >
> > idle_spin=0/ idle_spin=800/ idle_spin=0/
> > hpns=200000 hpns=0 hpns=800000
> >DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%)
> >InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%)
> >DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%)
> >UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%)
> >
> >V2:
> >
> >- Move from x86 to generic code (Paolo/Christian).
> >- Add auto-tuning logic (Paolo).
> >- Add MSR to disable host side polling (Paolo).
> >
> >
> >
> First of all, please CC power management patches (including cpuidle,
> cpufreq etc) to linux-pm@vger.kernel.org (there are people on that
> list who may want to see your changes before they go in) and CC
> cpuidle material (in particular) to Peter Zijlstra.
>
> Second, I'm not a big fan of this approach to be honest, as it kind
> of is a driver trying to play the role of a governor.
>
> We have a "polling state" already that could be used here in
> principle so I wonder what would be wrong with that. Also note that
> there seems to be at least some code duplication between your code
> and the "polling state" implementation, so maybe it would be
> possible to do some things in a common way?
Hi Rafael,
After modifying poll_state.c to use a generic "poll time" driver
callback [1] (since using a variable "target_residency" for that
looks really ugly), would need a governor which does:
haltpoll_governor_select_next_state()
if (prev_state was poll and evt happened on prev poll window) -> POLL.
if (prev_state == HLT) -> POLL
otherwise -> HLT
And a "default_idle" cpuidle driver that:
defaultidle_idle()
if (current_clr_polling_and_test()) {
local_irq_enable();
return index;
}
default_idle();
return
Using such governor with any other cpuidle driver would
be pointless (since it would enter the first state only
and therefore not save power).
Not certain about using the default_idle driver with
other governors: one would rather use a driver that
supports all states on a given machine.
This combination of governor/driver pair, for the sake
of sharing the idle loop, seems awkward to me.
And fails the governor/driver separation: one will use the
pair in practice.
But i have no problem with it, so i'll proceed with that.
Let me know otherwise.
Thanks.
On Mon, Jun 10, 2019 at 5:00 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Fri, Jun 07, 2019 at 11:49:51AM +0200, Rafael J. Wysocki wrote:
> > On 6/4/2019 12:52 AM, Marcelo Tosatti wrote:
> > >The cpuidle-haltpoll driver allows the guest vcpus to poll for a specified
> > >amount of time before halting. This provides the following benefits
> > >to host side polling:
> > >
> > > 1) The POLL flag is set while polling is performed, which allows
> > > a remote vCPU to avoid sending an IPI (and the associated
> > > cost of handling the IPI) when performing a wakeup.
> > >
> > > 2) The HLT VM-exit cost can be avoided.
> > >
> > >The downside of guest side polling is that polling is performed
> > >even with other runnable tasks in the host.
> > >
> > >Results comparing halt_poll_ns and server/client application
> > >where a small packet is ping-ponged:
> > >
> > >host --> 31.33
> > >halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%)
> > >halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%)
> > >
> > >For the SAP HANA benchmarks (where idle_spin is a parameter
> > >of the previous version of the patch, results should be the
> > >same):
> > >
> > >hpns == halt_poll_ns
> > >
> > > idle_spin=0/ idle_spin=800/ idle_spin=0/
> > > hpns=200000 hpns=0 hpns=800000
> > >DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%)
> > >InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%)
> > >DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%)
> > >UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%)
> > >
> > >V2:
> > >
> > >- Move from x86 to generic code (Paolo/Christian).
> > >- Add auto-tuning logic (Paolo).
> > >- Add MSR to disable host side polling (Paolo).
> > >
> > >
> > >
> > First of all, please CC power management patches (including cpuidle,
> > cpufreq etc) to linux-pm@vger.kernel.org (there are people on that
> > list who may want to see your changes before they go in) and CC
> > cpuidle material (in particular) to Peter Zijlstra.
> >
> > Second, I'm not a big fan of this approach to be honest, as it kind
> > of is a driver trying to play the role of a governor.
> >
> > We have a "polling state" already that could be used here in
> > principle so I wonder what would be wrong with that. Also note that
> > there seems to be at least some code duplication between your code
> > and the "polling state" implementation, so maybe it would be
> > possible to do some things in a common way?
>
> Hi Rafael,
>
> After modifying poll_state.c to use a generic "poll time" driver
> callback [1] (since using a variable "target_residency" for that
> looks really ugly), would need a governor which does:
>
> haltpoll_governor_select_next_state()
> if (prev_state was poll and evt happened on prev poll window) -> POLL.
> if (prev_state == HLT) -> POLL
> otherwise -> HLT
>
> And a "default_idle" cpuidle driver that:
>
> defaultidle_idle()
> if (current_clr_polling_and_test()) {
> local_irq_enable();
> return index;
> }
> default_idle();
> return
>
> Using such governor with any other cpuidle driver would
> be pointless (since it would enter the first state only
> and therefore not save power).
>
> Not certain about using the default_idle driver with
> other governors: one would rather use a driver that
> supports all states on a given machine.
>
> This combination of governor/driver pair, for the sake
> of sharing the idle loop, seems awkward to me.
> And fails the governor/driver separation: one will use the
> pair in practice.
>
> But i have no problem with it, so i'll proceed with that.
>
> Let me know otherwise.
If my understanding of your argumentation is correct, it is only
necessary to take the default_idle_call() branch of
cpuidle_idle_call() in the VM case, so it should be sufficient to
provide a suitable default_idle_call() which is what you seem to be
trying to do.
I might have been confused by the terminology used in the patch series
if that's the case.
Also, if that's the case, this is not cpuidle matter really. It is a
matter of providing a better default_idle_call() for the arch at hand.
Thanks,
Rafael
On Tue, Jun 11, 2019 at 12:03:26AM +0200, Rafael J. Wysocki wrote: > On Mon, Jun 10, 2019 at 5:00 PM Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > On Fri, Jun 07, 2019 at 11:49:51AM +0200, Rafael J. Wysocki wrote: > > > On 6/4/2019 12:52 AM, Marcelo Tosatti wrote: > > > >The cpuidle-haltpoll driver allows the guest vcpus to poll for a specified > > > >amount of time before halting. This provides the following benefits > > > >to host side polling: > > > > > > > > 1) The POLL flag is set while polling is performed, which allows > > > > a remote vCPU to avoid sending an IPI (and the associated > > > > cost of handling the IPI) when performing a wakeup. > > > > > > > > 2) The HLT VM-exit cost can be avoided. > > > > > > > >The downside of guest side polling is that polling is performed > > > >even with other runnable tasks in the host. > > > > > > > >Results comparing halt_poll_ns and server/client application > > > >where a small packet is ping-ponged: > > > > > > > >host --> 31.33 > > > >halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > > > >halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > > > > > > > >For the SAP HANA benchmarks (where idle_spin is a parameter > > > >of the previous version of the patch, results should be the > > > >same): > > > > > > > >hpns == halt_poll_ns > > > > > > > > idle_spin=0/ idle_spin=800/ idle_spin=0/ > > > > hpns=200000 hpns=0 hpns=800000 > > > >DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > > > >InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > > > >DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > > > >UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > > > > > > > >V2: > > > > > > > >- Move from x86 to generic code (Paolo/Christian). > > > >- Add auto-tuning logic (Paolo). > > > >- Add MSR to disable host side polling (Paolo). > > > > > > > > > > > > > > > First of all, please CC power management patches (including cpuidle, > > > cpufreq etc) to linux-pm@vger.kernel.org (there are people on that > > > list who may want to see your changes before they go in) and CC > > > cpuidle material (in particular) to Peter Zijlstra. > > > > > > Second, I'm not a big fan of this approach to be honest, as it kind > > > of is a driver trying to play the role of a governor. > > > > > > We have a "polling state" already that could be used here in > > > principle so I wonder what would be wrong with that. Also note that > > > there seems to be at least some code duplication between your code > > > and the "polling state" implementation, so maybe it would be > > > possible to do some things in a common way? > > > > Hi Rafael, > > > > After modifying poll_state.c to use a generic "poll time" driver > > callback [1] (since using a variable "target_residency" for that > > looks really ugly), would need a governor which does: > > > > haltpoll_governor_select_next_state() > > if (prev_state was poll and evt happened on prev poll window) -> POLL. > > if (prev_state == HLT) -> POLL > > otherwise -> HLT > > > > And a "default_idle" cpuidle driver that: > > > > defaultidle_idle() > > if (current_clr_polling_and_test()) { > > local_irq_enable(); > > return index; > > } > > default_idle(); > > return > > > > Using such governor with any other cpuidle driver would > > be pointless (since it would enter the first state only > > and therefore not save power). > > > > Not certain about using the default_idle driver with > > other governors: one would rather use a driver that > > supports all states on a given machine. > > > > This combination of governor/driver pair, for the sake > > of sharing the idle loop, seems awkward to me. > > And fails the governor/driver separation: one will use the > > pair in practice. > > > > But i have no problem with it, so i'll proceed with that. > > > > Let me know otherwise. > > If my understanding of your argumentation is correct, it is only > necessary to take the default_idle_call() branch of > cpuidle_idle_call() in the VM case, so it should be sufficient to > provide a suitable default_idle_call() which is what you seem to be > trying to do. In the VM case, we need to poll before actually halting (this is because its tricky to implement MWAIT in guests, so polling for some amount of time allows the IPI avoidance optimization, see trace_sched_wake_idle_without_ipi, to take place). The amount of time we poll is variable and adjusted (see adjust_haltpoll_ns in the patchset). > I might have been confused by the terminology used in the patch series > if that's the case. > > Also, if that's the case, this is not cpuidle matter really. It is a > matter of providing a better default_idle_call() for the arch at hand. Peter Zijlstra suggested a cpuidle driver for this. Also, other architectures will use the same "poll before exiting to VM" logic (so we'd rather avoid duplicating this code): PPC, x86, S/390, MIPS... So in my POV it makes sense to unify this. So, back to your initial suggestion: Q) "Can you unify code with poll_state.c?" A) Yes, but it requires a new governor, which seems overkill and unfit for the purpose. Moreover, the logic in menu to decide whether its necessary or not to stop sched tick is useful for us (so a default_idle_call is not sufficient), because the cost of enabling/disabling the sched tick is high on VMs. So i'll fix the comments of the cpuidle driver (which everyone seems to agree with, except your understandable distate for it) and repost.
[-- Attachment #1: 01-cpuidle-haltpoll --] [-- Type: text/plain, Size: 10829 bytes --] The cpuidle_haltpoll driver allows the guest vcpus to poll for a specified amount of time before halting. This provides the following benefits to host side polling: 1) The POLL flag is set while polling is performed, which allows a remote vCPU to avoid sending an IPI (and the associated cost of handling the IPI) when performing a wakeup. 2) The HLT VM-exit cost can be avoided. The downside of guest side polling is that polling is performed even with other runnable tasks in the host. Results comparing halt_poll_ns and server/client application where a small packet is ping-ponged: host --> 31.33 halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) For the SAP HANA benchmarks (where idle_spin is a parameter of the previous version of the patch, results should be the same): hpns == halt_poll_ns idle_spin=0/ idle_spin=800/ idle_spin=0/ hpns=200000 hpns=0 hpns=800000 DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) --- Documentation/virtual/guest-halt-polling.txt | 96 +++++++++++++++++ arch/x86/kernel/process.c | 2 drivers/cpuidle/Kconfig | 9 + drivers/cpuidle/Makefile | 1 drivers/cpuidle/cpuidle-haltpoll.c | 145 +++++++++++++++++++++++++++ 5 files changed, 252 insertions(+), 1 deletion(-) Index: linux-2.6.git/Documentation/virtual/guest-halt-polling.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.git/Documentation/virtual/guest-halt-polling.txt 2019-06-11 16:38:55.072877644 -0300 @@ -0,0 +1,96 @@ +Guest halt polling +================== + +The cpuidle_haltpoll driver allows the guest vcpus to poll for a specified +amount of time before halting. This provides the following benefits +to host side polling: + + 1) The POLL flag is set while polling is performed, which allows + a remote vCPU to avoid sending an IPI (and the associated + cost of handling the IPI) when performing a wakeup. + + 2) The VM-exit cost can be avoided. + +The downside of guest side polling is that polling is performed +even with other runnable tasks in the host. + +The basic logic as follows: A global value, guest_halt_poll_ns, +is configured by the user, indicating the maximum amount of +time polling is allowed. This value is fixed. + +Each vcpu has an adjustable guest_halt_poll_ns +("per-cpu guest_halt_poll_ns"), which is adjusted by the algorithm +in response to events (explained below). + +Module Parameters +================= + +The cpuidle_haltpoll module has 5 tunable module parameters: + +1) guest_halt_poll_ns: + +Maximum amount of time, in nanoseconds, that polling is +performed before halting. + +Default: 0 + +2) guest_halt_poll_shrink: + +Division factor used to shrink per-cpu guest_halt_poll_ns when +wakeup event occurs after the global guest_halt_poll_ns. + +Default: 2 + +3) guest_halt_poll_grow: + +Multiplication factor used to grow per-cpu guest_halt_poll_ns +when event occurs after per-cpu guest_halt_poll_ns +but before global guest_halt_poll_ns. + +Default: 2 + +4) guest_halt_poll_grow_start: + +The per-cpu guest_halt_poll_ns eventually reaches zero +in case of an idle system. This value sets the initial +per-cpu guest_halt_poll_ns when growing. This can +be increased from 10000, to avoid misses during the initial +growth stage: + +10000, 20000, 40000, ... (example assumes guest_halt_poll_grow=2). + +Default: 10000 + +5) guest_halt_poll_allow_shrink: + +Bool parameter which allows shrinking. Set to N +to avoid it (per-cpu guest_halt_poll_ns will remain +high once achieves global guest_halt_poll_ns value). + +Default: Y + +The module parameters can be set from the debugfs files in: + + /sys/module/cpuidle_haltpoll/parameters/ + +Host and guest polling +====================== + +KVM also performs host side polling (that is, it can poll for a certain +amount of time before halting) on behalf of guest vcpus. + +Modern hosts support poll control MSRs, which are used by +cpuidle_haltpoll to disable host side polling on a per-VM basis. + +If the KVM host does not support this interface, then both guest side +and host side polling can be performed, which can incur extra CPU and +energy consumption. One might consider disabling host side polling +manually if upgrading to a new host is not an option. + +Further Notes +============= + +Care should be taken when setting the guest_halt_poll_ns parameter +as a large value has the potential to drive the cpu usage to 100% on a +machine which would be almost entirely idle otherwise. + Index: linux-2.6.git/arch/x86/kernel/process.c =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/process.c 2019-06-11 12:14:37.731286353 -0300 +++ linux-2.6.git/arch/x86/kernel/process.c 2019-06-11 12:14:44.699424799 -0300 @@ -580,7 +580,7 @@ safe_halt(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); } -#ifdef CONFIG_APM_MODULE +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_HALTPOLL_CPUIDLE_MODULE) EXPORT_SYMBOL(default_idle); #endif Index: linux-2.6.git/drivers/cpuidle/Kconfig =================================================================== --- linux-2.6.git.orig/drivers/cpuidle/Kconfig 2019-06-11 12:14:37.731286353 -0300 +++ linux-2.6.git/drivers/cpuidle/Kconfig 2019-06-11 16:38:13.984060707 -0300 @@ -51,6 +51,15 @@ source "drivers/cpuidle/Kconfig.powerpc" endmenu +config HALTPOLL_CPUIDLE + tristate "Halt poll cpuidle driver" + depends on X86 + default y + help + This option enables halt poll cpuidle driver, which allows to poll + before halting in the guest (more efficient than polling in the + host via halt_poll_ns for some scenarios). + endif config ARCH_NEEDS_CPU_IDLE_COUPLED Index: linux-2.6.git/drivers/cpuidle/Makefile =================================================================== --- linux-2.6.git.orig/drivers/cpuidle/Makefile 2019-06-11 12:14:37.731286353 -0300 +++ linux-2.6.git/drivers/cpuidle/Makefile 2019-06-11 12:14:44.700424819 -0300 @@ -7,6 +7,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o +obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o ################################################################################## # ARM SoC drivers Index: linux-2.6.git/drivers/cpuidle/cpuidle-haltpoll.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.git/drivers/cpuidle/cpuidle-haltpoll.c 2019-06-11 16:37:23.328053569 -0300 @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * cpuidle driver for halt polling. + * + * Copyright 2019 Red Hat, Inc. and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Authors: Marcelo Tosatti <mtosatti@redhat.com> + */ + +#include <linux/init.h> +#include <linux/cpuidle.h> +#include <linux/module.h> +#include <linux/sched/clock.h> +#include <linux/sched/idle.h> + +static unsigned int guest_halt_poll_ns __read_mostly = 200000; +module_param(guest_halt_poll_ns, uint, 0644); + +/* division factor to shrink halt_poll_ns */ +static unsigned int guest_halt_poll_shrink __read_mostly = 2; +module_param(guest_halt_poll_shrink, uint, 0644); + +/* multiplication factor to grow per-cpu halt_poll_ns */ +static unsigned int guest_halt_poll_grow __read_mostly = 2; +module_param(guest_halt_poll_grow, uint, 0644); + +/* value in ns to start growing per-cpu halt_poll_ns */ +static unsigned int guest_halt_poll_grow_start __read_mostly = 10000; +module_param(guest_halt_poll_grow_start, uint, 0644); + +/* value in ns to start growing per-cpu halt_poll_ns */ +static bool guest_halt_poll_allow_shrink __read_mostly = true; +module_param(guest_halt_poll_allow_shrink, bool, 0644); + +static DEFINE_PER_CPU(unsigned int, halt_poll_ns); + +static void adjust_haltpoll_ns(unsigned int block_ns, + unsigned int *cpu_halt_poll_ns) +{ + unsigned int val; + + /* Grow cpu_halt_poll_ns if + * cpu_halt_poll_ns < block_ns < guest_halt_poll_ns + */ + if (block_ns > *cpu_halt_poll_ns && block_ns <= guest_halt_poll_ns) { + val = *cpu_halt_poll_ns * guest_halt_poll_grow; + + if (val < guest_halt_poll_grow_start) + val = guest_halt_poll_grow_start; + if (val > guest_halt_poll_ns) + val = guest_halt_poll_ns; + + *cpu_halt_poll_ns = val; + } else if (block_ns > guest_halt_poll_ns && + guest_halt_poll_allow_shrink) { + unsigned int shrink = guest_halt_poll_shrink; + + val = *cpu_halt_poll_ns; + if (shrink == 0) + val = 0; + else + val /= shrink; + *cpu_halt_poll_ns = val; + } +} + +static int haltpoll_enter_idle(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int index) +{ + unsigned int *cpu_halt_poll_ns; + unsigned long long start, now, block_ns; + int cpu = smp_processor_id(); + + cpu_halt_poll_ns = per_cpu_ptr(&halt_poll_ns, cpu); + + if (current_set_polling_and_test()) { + local_irq_enable(); + goto out; + } + + start = sched_clock(); + local_irq_enable(); + for (;;) { + if (need_resched()) { + current_clr_polling(); + goto out; + } + + now = sched_clock(); + if (now - start > *cpu_halt_poll_ns) + break; + + cpu_relax(); + } + + local_irq_disable(); + if (current_clr_polling_and_test()) { + local_irq_enable(); + goto out; + } + + default_idle(); + block_ns = sched_clock() - start; + adjust_haltpoll_ns(block_ns, cpu_halt_poll_ns); + +out: + return index; +} + +static struct cpuidle_driver haltpoll_driver = { + .name = "haltpoll_idle", + .owner = THIS_MODULE, + .states = { + { /* entry 0 is for polling */ }, + { + .enter = haltpoll_enter_idle, + .exit_latency = 0, + .target_residency = 0, + .power_usage = -1, + .name = "Halt poll", + .desc = "Halt poll idle", + }, + }, + .safe_state_index = 0, + .state_count = 2, +}; + +static int __init haltpoll_init(void) +{ + return cpuidle_register(&haltpoll_driver, NULL); +} + +static void __exit haltpoll_exit(void) +{ + cpuidle_unregister(&haltpoll_driver); +} + +module_init(haltpoll_init); +module_exit(haltpoll_exit); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@redhat.com>"); +
On Tue, Jun 11, 2019 at 4:27 PM Marcelo Tosatti <mtosatti@redhat.com> wrote: > > On Tue, Jun 11, 2019 at 12:03:26AM +0200, Rafael J. Wysocki wrote: > > On Mon, Jun 10, 2019 at 5:00 PM Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > > > On Fri, Jun 07, 2019 at 11:49:51AM +0200, Rafael J. Wysocki wrote: > > > > On 6/4/2019 12:52 AM, Marcelo Tosatti wrote: > > > > >The cpuidle-haltpoll driver allows the guest vcpus to poll for a specified > > > > >amount of time before halting. This provides the following benefits > > > > >to host side polling: > > > > > > > > > > 1) The POLL flag is set while polling is performed, which allows > > > > > a remote vCPU to avoid sending an IPI (and the associated > > > > > cost of handling the IPI) when performing a wakeup. > > > > > > > > > > 2) The HLT VM-exit cost can be avoided. > > > > > > > > > >The downside of guest side polling is that polling is performed > > > > >even with other runnable tasks in the host. > > > > > > > > > >Results comparing halt_poll_ns and server/client application > > > > >where a small packet is ping-ponged: > > > > > > > > > >host --> 31.33 > > > > >halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > > > > >halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > > > > > > > > > >For the SAP HANA benchmarks (where idle_spin is a parameter > > > > >of the previous version of the patch, results should be the > > > > >same): > > > > > > > > > >hpns == halt_poll_ns > > > > > > > > > > idle_spin=0/ idle_spin=800/ idle_spin=0/ > > > > > hpns=200000 hpns=0 hpns=800000 > > > > >DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > > > > >InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > > > > >DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > > > > >UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > > > > > > > > > >V2: > > > > > > > > > >- Move from x86 to generic code (Paolo/Christian). > > > > >- Add auto-tuning logic (Paolo). > > > > >- Add MSR to disable host side polling (Paolo). > > > > > > > > > > > > > > > > > > > First of all, please CC power management patches (including cpuidle, > > > > cpufreq etc) to linux-pm@vger.kernel.org (there are people on that > > > > list who may want to see your changes before they go in) and CC > > > > cpuidle material (in particular) to Peter Zijlstra. > > > > > > > > Second, I'm not a big fan of this approach to be honest, as it kind > > > > of is a driver trying to play the role of a governor. > > > > > > > > We have a "polling state" already that could be used here in > > > > principle so I wonder what would be wrong with that. Also note that > > > > there seems to be at least some code duplication between your code > > > > and the "polling state" implementation, so maybe it would be > > > > possible to do some things in a common way? > > > > > > Hi Rafael, > > > > > > After modifying poll_state.c to use a generic "poll time" driver > > > callback [1] (since using a variable "target_residency" for that > > > looks really ugly), would need a governor which does: > > > > > > haltpoll_governor_select_next_state() > > > if (prev_state was poll and evt happened on prev poll window) -> POLL. > > > if (prev_state == HLT) -> POLL > > > otherwise -> HLT > > > > > > And a "default_idle" cpuidle driver that: > > > > > > defaultidle_idle() > > > if (current_clr_polling_and_test()) { > > > local_irq_enable(); > > > return index; > > > } > > > default_idle(); > > > return > > > > > > Using such governor with any other cpuidle driver would > > > be pointless (since it would enter the first state only > > > and therefore not save power). > > > > > > Not certain about using the default_idle driver with > > > other governors: one would rather use a driver that > > > supports all states on a given machine. > > > > > > This combination of governor/driver pair, for the sake > > > of sharing the idle loop, seems awkward to me. > > > And fails the governor/driver separation: one will use the > > > pair in practice. > > > > > > But i have no problem with it, so i'll proceed with that. > > > > > > Let me know otherwise. > > > > If my understanding of your argumentation is correct, it is only > > necessary to take the default_idle_call() branch of > > cpuidle_idle_call() in the VM case, so it should be sufficient to > > provide a suitable default_idle_call() which is what you seem to be > > trying to do. > > In the VM case, we need to poll before actually halting (this is because > its tricky to implement MWAIT in guests, so polling for some amount > of time allows the IPI avoidance optimization, > see trace_sched_wake_idle_without_ipi, to take place). > > The amount of time we poll is variable and adjusted (see adjust_haltpoll_ns > in the patchset). > > > I might have been confused by the terminology used in the patch series > > if that's the case. > > > > Also, if that's the case, this is not cpuidle matter really. It is a > > matter of providing a better default_idle_call() for the arch at hand. > > Peter Zijlstra suggested a cpuidle driver for this. So I wonder what his rationale was. > Also, other architectures will use the same "poll before exiting to VM" > logic (so we'd rather avoid duplicating this code): PPC, x86, S/390, > MIPS... So in my POV it makes sense to unify this. The logic is fine IMO, but the implementation here is questionable. > So, back to your initial suggestion: > > Q) "Can you unify code with poll_state.c?" > A) Yes, but it requires a new governor, which seems overkill and unfit > for the purpose. > > Moreover, the logic in menu to decide whether its necessary or not > to stop sched tick is useful for us (so a default_idle_call is not > sufficient), because the cost of enabling/disabling the sched tick is > high on VMs. So in fact you need a governor, but you really only need it to decide whether or not to stop the tick for you. menu has a quite high overhead for that. :-) > So i'll fix the comments of the cpuidle driver (which everyone seems > to agree with, except your understandable distate for it) and repost.
On Tue, Jun 11, 2019 at 11:24:39PM +0200, Rafael J. Wysocki wrote:
> > Peter Zijlstra suggested a cpuidle driver for this.
>
> So I wonder what his rationale was.
I was thinking we don't need this hard-coded in the idle loop when virt
can load a special driver.