linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems
@ 2018-10-14  6:53 Doug Smythies
  2018-10-15  7:52 ` Rafael J. Wysocki
  2018-10-16  3:00 ` Doug Smythies
  0 siblings, 2 replies; 7+ messages in thread
From: Doug Smythies @ 2018-10-14  6:53 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Srinivas Pandruvada', 'Peter Zijlstra',
	'LKML', 'Frederic Weisbecker',
	'Mel Gorman', 'Giovanni Gherdovich',
	'Daniel Lezcano', 'Linux PM',
	Doug Smythies

Hi Rafael,

I tried your TEO idle governor.

On 2018.10.11 14:02 Rafael J. Wysocki wrote:

...[cut]...

> It has been tested on a few different systems with a number of
> different workloads and compared with the menu governor.  In the
> majority of cases the workloads performed similarly regardless of
> the cpuidle governor in use, but in one case the TEO governor
> allowed the workload to perform 75% better, which is a clear
> indication that some workloads may benefit from using it quite
> a bit depending on the hardware they run on.

Could you supply more detail for the 75% better case, so that
I can try to repeat the results on my system?

...[cut]...

> It is likely to select the "polling" state less often than menu
> due to the lack of the extra latency limit derived from the
> predicted idle duration, so the workloads depending on that
> behavior may be worse off (but less energy should be used
> while running them at the same time).

Yes, and I see exactly that with the 1 core pipe test: Less
performance (~10%), but also less processor package power
(~3%), compared to the 8 patch set results from the other day.

The iperf test (running 3 clients at once) results were similar
for both power and throughput.

> Overall, it selects deeper idle states than menu more often, but
> that doesn't seem to make a significant difference in the majority
> of cases.

Not always, that viscous powernightmare sweep test that I run used
way way more processor package power and spent a staggering amount
of time in idle state 0. [1]. 

... [cut]...

> + * The sleep length is the maximum duration of the upcoming idle time of the
> + * CPU and it is always known to the kernel.  Using it alone for selecting an
> + * idle state for the CPU every time is a viable option in principle, but that
> + * might lead to suboptimal results if the other wakeup sources are more active
> + * for some reason.  Thus this governor estimates whether or not the CPU idle
> + * time is likely to be significantly shorter than the sleep length and selects
> + * an idle state for it in accordance with that, as follows:

There is something wrong here, in my opinion, but I have not isolated exactly where
by staring at the code.
Read on.

... [cut]...

> + * Assuming an idle interval every second tick, take the maximum number of CPU
> + * wakeups regarded as recent to rougly correspond to 10 minutes.
> + *
> + * When the total number of CPU wakeups goes above this value, all of the
> + * counters corresponding to the given CPU undergo a "decay" and the counting
> + * of recent events stars over.
> + */
> +#define TEO_MAX_RECENT_WAKEUPS	(300 * HZ)

In my opinion, there are problems with this is approach:

First, the huge huge range of possible times between decay events,
anywhere from ~ a second to approximately a week.
	In an idle 1000 HZ system, at 2 idle entries per 4 second watchdog event:
	time = 300,000 wakes * 2 seconds/wake = 6.9 days
	Note: The longest single idle time I measured was 3.5 seconds, but that is
	always combined with a shorter one. Even using a more realistic, and
	just now measured, average value of 0.7 idles/second would be 2.4 days.

Second: It leads to unpredictable behaviour, sometimes for a long time, until
the effects of some previous work are completely flushed. And from the first
point above, that previous work might have been days ago. In my case, and while
doing this work, it resulted in non-repeatability of tests and confusion
for awhile. Decay events are basically asynchronous to the actual tasks being
executed. For data to support what I am saying I did the following:
	Do a bunch of times {
		Start the powernightmare sweep test.
		Abort after several seconds (enough time to flush filters
			and prefer idle state 0)
		Wait a random amount of time
		Start a very light work load, but such that the sleep time
			per work cycle is less than one tick
		Observe varying times until idle state 0 is not excessively selected.
			Anywhere from 0 to 17 minutes (the maximum length of test) was observed.
	}

Additional information:

Periodic workflow: I am having difficulty understanding an unexpected high
number of idle entries/exits in steady state (i.e. once things have settled
down and the filters have finally flushed) For example, a 60% work / 40% sleep
at 500 hertz workflow seems to have an extra idle entry exit. Trace excerpt
(edited, the first column is uSeconds since previous):

     140 cpu_idle: state=4294967295 cpu_id=7
    1152 cpu_idle: state=4 cpu_id=7    <<<< The expected ~1200 uSecs of work
     690 cpu_idle: state=4294967295 cpu_id=7  <<<< Unexpected, Expected ~800 uSecs
      18 cpu_idle: state=2 cpu_id=7           <<<< So this extra idle makes up the difference
     138 cpu_idle: state=4294967295 cpu_id=7  <<<< But why is it there?
    1152 cpu_idle: state=4 cpu_id=7     <<<< Repeat
     690 cpu_idle: state=4294967295 cpu_id=7
      13 cpu_idle: state=2 cpu_id=7
     143 cpu_idle: state=4294967295 cpu_id=7
    1152 cpu_idle: state=4 cpu_id=7      <<<< Repeat
     689 cpu_idle: state=4294967295 cpu_id=7
      19 cpu_idle: state=2 cpu_id=7

Now compare with trace data for kernel 4.16-rc6 with the 9 patches
from the other day (which is what I expect to see):

     846 cpu_idle: state=4294967295 cpu_id=7
    1150 cpu_idle: state=4 cpu_id=7    <<<< The expected ~1200 uSecs of work
     848 cpu_idle: state=4294967295 cpu_id=7 <<<< The expected ~800 uSecs of idle
    1152 cpu_idle: state=4 cpu_id=7    <<<< Repeat
     848 cpu_idle: state=4294967295 cpu_id=7
    1151 cpu_idle: state=4 cpu_id=7    <<<< Repeat
     848 cpu_idle: state=4294967295 cpu_id=7
    1152 cpu_idle: state=4 cpu_id=7    <<<< Repeat
     848 cpu_idle: state=4294967295 cpu_id=7
    1152 cpu_idle: state=4 cpu_id=7    <<<< Repeat

Anyway, in the end we really only care about power. So for this test:
Kernel 4.19-rc6 + 9 patches: 9.133 watts
TEO (on top of 4.19-rc7):
At start, high number of idle state 0 entries: 11.33 watts (+24%)
After awhile, it shifted to idle state 1: 10.00 watts (+9.5%)
After awhile, it shifted to idle state 2: 9.67 watts (+5.9%)
That seemed to finally be a steady state scenario (at least for over 2 hours).
Note: it was always using idle state 4 also.

...[snip]...

> +	/* Decay past events information. */
> +	for (i = 0; i < drv->state_count; i++) {
> +		cpu_data->states[i].early_wakeups_old += cpu_data->states[i].early_wakeups;
> +		cpu_data->states[i].early_wakeups_old /= 2;
> +		cpu_data->states[i].early_wakeups = 0;
> +
> +		cpu_data->states[i].hits_old += cpu_data->states[i].hits;
> +		cpu_data->states[i].hits_old /= 2;
> +		cpu_data->states[i].hits = 0;

I wonder if this decay rate is strong enough.

Hope this helps.

... Doug

[1] http://fast.smythies.com/linux-pm/k419/k419-pn-sweep-teo.htm



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems
  2018-10-14  6:53 [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems Doug Smythies
@ 2018-10-15  7:52 ` Rafael J. Wysocki
  2018-10-16  3:00 ` Doug Smythies
  1 sibling, 0 replies; 7+ messages in thread
From: Rafael J. Wysocki @ 2018-10-15  7:52 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Peter Zijlstra,
	Linux Kernel Mailing List, Frederic Weisbecker, Mel Gorman,
	Giovanni Gherdovich, Daniel Lezcano, Linux PM

Hi Doug,

On Sun, Oct 14, 2018 at 8:53 AM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Rafael,
>
> I tried your TEO idle governor.

Thanks!

> On 2018.10.11 14:02 Rafael J. Wysocki wrote:
>
> ...[cut]...
>
> > It has been tested on a few different systems with a number of
> > different workloads and compared with the menu governor.  In the
> > majority of cases the workloads performed similarly regardless of
> > the cpuidle governor in use, but in one case the TEO governor
> > allowed the workload to perform 75% better, which is a clear
> > indication that some workloads may benefit from using it quite
> > a bit depending on the hardware they run on.
>
> Could you supply more detail for the 75% better case, so that
> I can try to repeat the results on my system?

This was encryption on Skylake X, but I'll get more details on that later.

> ...[cut]...
>
> > It is likely to select the "polling" state less often than menu
> > due to the lack of the extra latency limit derived from the
> > predicted idle duration, so the workloads depending on that
> > behavior may be worse off (but less energy should be used
> > while running them at the same time).
>
> Yes, and I see exactly that with the 1 core pipe test: Less
> performance (~10%), but also less processor package power
> (~3%), compared to the 8 patch set results from the other day.
>
> The iperf test (running 3 clients at once) results were similar
> for both power and throughput.
>
> > Overall, it selects deeper idle states than menu more often, but
> > that doesn't seem to make a significant difference in the majority
> > of cases.
>
> Not always, that viscous powernightmare sweep test that I run used
> way way more processor package power and spent a staggering amount
> of time in idle state 0. [1].

Can you please remind me what exactly the workload is in that test?

>
> ... [cut]...
>
> > + * The sleep length is the maximum duration of the upcoming idle time of the
> > + * CPU and it is always known to the kernel.  Using it alone for selecting an
> > + * idle state for the CPU every time is a viable option in principle, but that
> > + * might lead to suboptimal results if the other wakeup sources are more active
> > + * for some reason.  Thus this governor estimates whether or not the CPU idle
> > + * time is likely to be significantly shorter than the sleep length and selects
> > + * an idle state for it in accordance with that, as follows:
>
> There is something wrong here, in my opinion, but I have not isolated exactly where
> by staring at the code.
> Read on.
>
> ... [cut]...
>
> > + * Assuming an idle interval every second tick, take the maximum number of CPU
> > + * wakeups regarded as recent to rougly correspond to 10 minutes.
> > + *
> > + * When the total number of CPU wakeups goes above this value, all of the
> > + * counters corresponding to the given CPU undergo a "decay" and the counting
> > + * of recent events stars over.
> > + */
> > +#define TEO_MAX_RECENT_WAKEUPS       (300 * HZ)
>
> In my opinion, there are problems with this is approach:
>
> First, the huge huge range of possible times between decay events,
> anywhere from ~ a second to approximately a week.
>         In an idle 1000 HZ system, at 2 idle entries per 4 second watchdog event:
>         time = 300,000 wakes * 2 seconds/wake = 6.9 days
>         Note: The longest single idle time I measured was 3.5 seconds, but that is
>         always combined with a shorter one. Even using a more realistic, and
>         just now measured, average value of 0.7 idles/second would be 2.4 days.
>
> Second: It leads to unpredictable behaviour, sometimes for a long time, until
> the effects of some previous work are completely flushed. And from the first
> point above, that previous work might have been days ago. In my case, and while
> doing this work, it resulted in non-repeatability of tests and confusion
> for awhile. Decay events are basically asynchronous to the actual tasks being
> executed. For data to support what I am saying I did the following:
>         Do a bunch of times {
>                 Start the powernightmare sweep test.
>                 Abort after several seconds (enough time to flush filters
>                         and prefer idle state 0)
>                 Wait a random amount of time
>                 Start a very light work load, but such that the sleep time
>                         per work cycle is less than one tick
>                 Observe varying times until idle state 0 is not excessively selected.
>                         Anywhere from 0 to 17 minutes (the maximum length of test) was observed.
>         }
>
> Additional information:
>
> Periodic workflow: I am having difficulty understanding an unexpected high
> number of idle entries/exits in steady state (i.e. once things have settled
> down and the filters have finally flushed) For example, a 60% work / 40% sleep
> at 500 hertz workflow seems to have an extra idle entry exit. Trace excerpt
> (edited, the first column is uSeconds since previous):
>
>      140 cpu_idle: state=4294967295 cpu_id=7
>     1152 cpu_idle: state=4 cpu_id=7    <<<< The expected ~1200 uSecs of work
>      690 cpu_idle: state=4294967295 cpu_id=7  <<<< Unexpected, Expected ~800 uSecs
>       18 cpu_idle: state=2 cpu_id=7           <<<< So this extra idle makes up the difference
>      138 cpu_idle: state=4294967295 cpu_id=7  <<<< But why is it there?
>     1152 cpu_idle: state=4 cpu_id=7     <<<< Repeat
>      690 cpu_idle: state=4294967295 cpu_id=7
>       13 cpu_idle: state=2 cpu_id=7
>      143 cpu_idle: state=4294967295 cpu_id=7
>     1152 cpu_idle: state=4 cpu_id=7      <<<< Repeat
>      689 cpu_idle: state=4294967295 cpu_id=7
>       19 cpu_idle: state=2 cpu_id=7
>
> Now compare with trace data for kernel 4.16-rc6 with the 9 patches
> from the other day (which is what I expect to see):
>
>      846 cpu_idle: state=4294967295 cpu_id=7
>     1150 cpu_idle: state=4 cpu_id=7    <<<< The expected ~1200 uSecs of work
>      848 cpu_idle: state=4294967295 cpu_id=7 <<<< The expected ~800 uSecs of idle
>     1152 cpu_idle: state=4 cpu_id=7    <<<< Repeat
>      848 cpu_idle: state=4294967295 cpu_id=7
>     1151 cpu_idle: state=4 cpu_id=7    <<<< Repeat
>      848 cpu_idle: state=4294967295 cpu_id=7
>     1152 cpu_idle: state=4 cpu_id=7    <<<< Repeat
>      848 cpu_idle: state=4294967295 cpu_id=7
>     1152 cpu_idle: state=4 cpu_id=7    <<<< Repeat
>
> Anyway, in the end we really only care about power. So for this test:
> Kernel 4.19-rc6 + 9 patches: 9.133 watts
> TEO (on top of 4.19-rc7):
> At start, high number of idle state 0 entries: 11.33 watts (+24%)
> After awhile, it shifted to idle state 1: 10.00 watts (+9.5%)
> After awhile, it shifted to idle state 2: 9.67 watts (+5.9%)
> That seemed to finally be a steady state scenario (at least for over 2 hours).
> Note: it was always using idle state 4 also.
>
> ...[snip]...
>
> > +     /* Decay past events information. */
> > +     for (i = 0; i < drv->state_count; i++) {
> > +             cpu_data->states[i].early_wakeups_old += cpu_data->states[i].early_wakeups;
> > +             cpu_data->states[i].early_wakeups_old /= 2;
> > +             cpu_data->states[i].early_wakeups = 0;
> > +
> > +             cpu_data->states[i].hits_old += cpu_data->states[i].hits;
> > +             cpu_data->states[i].hits_old /= 2;
> > +             cpu_data->states[i].hits = 0;
>
> I wonder if this decay rate is strong enough.
>
> Hope this helps.

Yes, it does, thank you!

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems
  2018-10-14  6:53 [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems Doug Smythies
  2018-10-15  7:52 ` Rafael J. Wysocki
@ 2018-10-16  3:00 ` Doug Smythies
  2018-10-16  8:00   ` Rafael J. Wysocki
  1 sibling, 1 reply; 7+ messages in thread
From: Doug Smythies @ 2018-10-16  3:00 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'Srinivas Pandruvada',
	'Peter Zijlstra', 'Linux Kernel Mailing List',
	'Frederic Weisbecker', 'Mel Gorman',
	'Giovanni Gherdovich', 'Daniel Lezcano',
	'Linux PM',
	Doug Smythies

On 2018.10.15 00:52 Rafael J. Wysocki wrote:
> On Sun, Oct 14, 2018 at 8:53 AM Doug Smythies <dsmythies@telus.net> wrote:
>> On 2018.10.11 14:02 Rafael J. Wysocki wrote:
>
> ...[cut]...
>
>>> Overall, it selects deeper idle states than menu more often, but
>>> that doesn't seem to make a significant difference in the majority
>>> of cases.
>>
>> Not always, that viscous powernightmare sweep test that I run used
>> way way more processor package power and spent a staggering amount
>> of time in idle state 0. [1].
>
> Can you please remind me what exactly the workload is in that test?

The problem with my main test computer is that I have never had a good
way to make it use idle state 0 and/or idle state 1 a significant amount,
while not setting the need-resched flag. Due to the minimum overheads
involved, in a tight loop c program calling nanosleep with an only 1
nanosecond argument, will result in about 50 (44 to 57 measured)
microseconds, or much too long to invoke idle state 0 or 1 (at least
on my test computer). So, for my 8 CPU older model i7-2600K, the idea
is to spin out 40 threads doing short sleeps in an attempt to pile up
events such that the shallower idle states are invoked more often.

Why 40 threads, one might wonder? This was many months ago now, but
I tested quite a number of threads, and 40 seemed to provide the
most interesting results for this type of work. I have not rechecked
it since (probably should).

For the testing I did in August for this:

"[PATCH] cpuidle: menu: Retain tick when shallow state is selected"
[2].
The thinking was to sweep through a wide range of sleep times,
and see if anything odd shows up. The test description is copied
here:

In [2] Doug wrote:
> Test 1: A Thomas Ilsche type "powernightmare" test:
> (forever ((10 times - variable usec sleep) 0.999 seconds sleep) X 40 staggered
> threads. Where the "variable" was from 0.05 to 5 in steps of 0.05, for the first ~200
> minutes of the test. (note: overheads mean that actual loop times are quite
> different.) And then from 5 to 500 in steps of 1, for the remaining 1000 minutes of
> the test. Each step ran for 2 minutes. The system was idle for 1 minute at the start,
> and a few minutes at the end of the graphs.
> While called "kernel 4.18", the baseline was actually from mainline at head =
> df2def4, or just after Rafael's linux-pm "pm-4.19-rc1-2" merge.
> (actually after the next acpi merge).
> Reference kernel = df2def4 with the two patches reverted.

However, that description was flawed, because there actually was never
a long sleep (incompetence on my part, but it doesn't really matter).
That test was 1200 minutes, and is worth looking at [3].
Notice how, as the test progresses, a migration through the idle
states can be observed, just as expected.

The next old reference of this test was the 8 patch set on top of
Kernel 4.19-rc6 [4], from a week ago. However, I shortened the test
by 900 minutes. Why? Well, there is only so much time in a day.

So now, back to the test this thread is about [1]. It might be
argued that maybe the TEO governor should be spending more time
in idle state 0 near the start of test, as the test shows. Trace
data does, maybe, support such an argument, but I haven't had
time to dig into it.

I also wonder if some of the weirdness later in the test is
repeatable or not (re: discussion elsewhere on this thread,
now cut, about lack of repeatability). However, I have not
had time to repeat the test.

Hope this helps, and sorry for any confusion and this long e-mail.
 
... Doug

[1] http://fast.smythies.com/linux-pm/k419/k419-pn-sweep-teo.htm
[2] https://marc.info/?l=linux-pm&m=153531591826718&w=2
[3] http://fast.smythies.com/linux-pm/k418-pn-sweep-rjw.htm
[4] http://fast.smythies.com/linux-pm/k419/k419-pn-sweep-rjw.htm
   


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems
  2018-10-16  3:00 ` Doug Smythies
@ 2018-10-16  8:00   ` Rafael J. Wysocki
  0 siblings, 0 replies; 7+ messages in thread
From: Rafael J. Wysocki @ 2018-10-16  8:00 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Srinivas Pandruvada',
	'Peter Zijlstra', 'Linux Kernel Mailing List',
	'Frederic Weisbecker', 'Mel Gorman',
	'Giovanni Gherdovich', 'Daniel Lezcano',
	'Linux PM'

On Tuesday, October 16, 2018 5:00:19 AM CEST Doug Smythies wrote:
> On 2018.10.15 00:52 Rafael J. Wysocki wrote:
> > On Sun, Oct 14, 2018 at 8:53 AM Doug Smythies <dsmythies@telus.net> wrote:
> >> On 2018.10.11 14:02 Rafael J. Wysocki wrote:
> >
> > ...[cut]...
> >
> >>> Overall, it selects deeper idle states than menu more often, but
> >>> that doesn't seem to make a significant difference in the majority
> >>> of cases.
> >>
> >> Not always, that viscous powernightmare sweep test that I run used
> >> way way more processor package power and spent a staggering amount
> >> of time in idle state 0. [1].
> >
> > Can you please remind me what exactly the workload is in that test?
> 
> The problem with my main test computer is that I have never had a good
> way to make it use idle state 0 and/or idle state 1 a significant amount,
> while not setting the need-resched flag. Due to the minimum overheads
> involved, in a tight loop c program calling nanosleep with an only 1
> nanosecond argument, will result in about 50 (44 to 57 measured)
> microseconds, or much too long to invoke idle state 0 or 1 (at least
> on my test computer). So, for my 8 CPU older model i7-2600K, the idea
> is to spin out 40 threads doing short sleeps in an attempt to pile up
> events such that the shallower idle states are invoked more often.
> 
> Why 40 threads, one might wonder? This was many months ago now, but
> I tested quite a number of threads, and 40 seemed to provide the
> most interesting results for this type of work. I have not rechecked
> it since (probably should).
> 
> For the testing I did in August for this:
> 
> "[PATCH] cpuidle: menu: Retain tick when shallow state is selected"
> [2].
> The thinking was to sweep through a wide range of sleep times,
> and see if anything odd shows up. The test description is copied
> here:
> 
> In [2] Doug wrote:
> > Test 1: A Thomas Ilsche type "powernightmare" test:
> > (forever ((10 times - variable usec sleep) 0.999 seconds sleep) X 40 staggered
> > threads. Where the "variable" was from 0.05 to 5 in steps of 0.05, for the first ~200
> > minutes of the test. (note: overheads mean that actual loop times are quite
> > different.) And then from 5 to 500 in steps of 1, for the remaining 1000 minutes of
> > the test. Each step ran for 2 minutes. The system was idle for 1 minute at the start,
> > and a few minutes at the end of the graphs.
> > While called "kernel 4.18", the baseline was actually from mainline at head =
> > df2def4, or just after Rafael's linux-pm "pm-4.19-rc1-2" merge.
> > (actually after the next acpi merge).
> > Reference kernel = df2def4 with the two patches reverted.
> 
> However, that description was flawed, because there actually was never
> a long sleep (incompetence on my part, but it doesn't really matter).
> That test was 1200 minutes, and is worth looking at [3].
> Notice how, as the test progresses, a migration through the idle
> states can be observed, just as expected.
> 
> The next old reference of this test was the 8 patch set on top of
> Kernel 4.19-rc6 [4], from a week ago. However, I shortened the test
> by 900 minutes. Why? Well, there is only so much time in a day.
> 
> So now, back to the test this thread is about [1]. It might be
> argued that maybe the TEO governor should be spending more time
> in idle state 0 near the start of test, as the test shows. Trace
> data does, maybe, support such an argument, but I haven't had
> time to dig into it.
> 
> I also wonder if some of the weirdness later in the test is
> repeatable or not (re: discussion elsewhere on this thread,
> now cut, about lack of repeatability). However, I have not
> had time to repeat the test.
> 
> Hope this helps, and sorry for any confusion and this long e-mail.

Yes, it helps, many thanks again and no worries about long emails. :-)

I'm going to make some changes to the new governor to take your observations
into account.

Cheers,
Rafael


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems
  2018-10-22  8:54 ` Giovanni Gherdovich
@ 2018-10-26  7:58   ` Rafael J. Wysocki
  0 siblings, 0 replies; 7+ messages in thread
From: Rafael J. Wysocki @ 2018-10-26  7:58 UTC (permalink / raw)
  To: ggherdovich
  Cc: rjw, linux-pm, srinivas.pandruvada, peterz, linux-kernel,
	frederic, mgorman, dsmythies, daniel.lezcano

Hi Giovanni,

On Mon, Oct 22, 2018 at 10:51 AM Giovanni Gherdovich
<ggherdovich@suse.cz> wrote:
>
> Hello Rafael,
>
> I ran some benchmarks and will send you a detailed report later;

Thanks a lot, much appreciated!

Even though I'm about to send a v2 of the $subject patch which is a
complete rewrite of some parts of the code, results from the previous
one will be good to see anyway.

> for now I have some questions to make sure I understand the code.
>
> First off, I like your algorithm and you make an excellent job at
> documenting it with comments. Since it's a heuristic, it's not "obvious"
> code and comments help a lot.
>
> On Thu, 2018-10-11 at 23:01 +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >

[cut]

> > +static void teo_update(struct cpuidle_driver *drv,
> > +                        struct cpuidle_device *dev)
> > +{
> > +     struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > +     unsigned int measured_us = dev->last_residency;
>
> question: how reliable is the value "dev->last_residency"? does it come
> "from the hardware" somehow? I'm asking because I'm more familiar with
> CPUfreq than CPUidle, and there you have a bit of a disconnect between what
> the OS think and what the hardware actually does. So I wonder if there is a
> similar situation with the last_residency value, and ask if it comes from
> some sort of feedback from the processor itself (if that makes any
> sense). I see it's written to in the cpuidle_enter_state() function in
> cpuidle.c, but I couldn't clearly understand that function either.

It is measured by the kernel, so it is not very precise, but it should
be precise enough for idle duration estimation on average.

> > +     int i = cpu_data->last_state;
> > +
> > +     if (measured_us >= 2 * drv->states[i].exit_latency)
> > +             measured_us -= drv->states[i].exit_latency;
> > +     else
> > +             measured_us /= 2;
> > +
> > +     /* Find the state matching the measured idle duration. */
> > +     for (i = drv->state_count - 1; i > 0; i--) {
> > +             if (drv->states[i].target_residency <= measured_us)
> > +                     break;
> > +     }
>
> Wait, I'm lost here. You just initialized "int i = cpu_data->last_state;"
> ten lines above, and here you're computing "given how much we idled, what
> would have been the best state to enter?". Was the initialization of i a
> mistake? what's the intention?

It's intentional, but just because of the measured_us computation
earlier that would need to refer to
drv->states[cpu_data->last_state].exit_latency twice.  "i" serves as
auxiliary index storage for that.

> > +
> > +     cpu_data->recent_wakeups++;
> > +
> > +     /*
> > +      * If the state matches the time till the closest timer event used
> > +      * during the most recent state selection too, it would be sufficient
> > +      * to use that time alone for the idle state selection, so increment
> > +      * the "hits" number for the matching state and clear the "early series"
> > +      * data.
> > +      *
> > +      * Note that sleep_length_us cannot be less than measured_us (or it
> > +      * would mean a late timer event), so it is sufficient to check the
> > +      * lower boundary here.
> > +      */
> > +     if (i == drv->state_count - 1 ||
> > +         drv->states[i+1].target_residency > cpu_data->sleep_length_us)
> > {
>
> Correct me if I'm wrong: we know that states[i+1].target_residency > measured_us,
> because that's just how we've chosen i. We know that sleep_length_us >= measured_us,
> because that's how timers work.
>
> Now, the naive way if we're in a "hit" situation would be to check is
> measured_us is equal (or, well, "close") to sleep_length_us. If, on the
> other side, measured_us is a lot smaller than sleep_length_us, it's an
> "early". But you don't check it that way, what you do is to see if
> target_residency[i+1] is in between measured_us and sleep_length_us or
> not. Like this (sorry for the sloppy notation:
>
> target_residency[i] < measured_us <= sleep_length_us < target_residency[i+1]
>
> and that would be a hit. Otherwise:
>
> target_residency[i] < measured_us < target_residency[i+1] <= sleep_length_us
>
> and that's an "early". Right?

The idea here is for "early" to mean "so much earlier than it makes a
real difference", where "a real difference" is when different idle
states need to be used in both cases.

For example, say you have two idle states, X with the target residency
of 20 and Y with the target residency of 150.  Say that your
sleep_length is 300 and your measured_us (idle duration) is 100.
Then, you should have used idle state X and knowing sleep_length
upfront wouldn't have helped, so it is a "miss".  Now, if your
sleep_length is (again) 300 and your measured_us is 200 (say), you
could still use idle state Y (as indicated by the sleep_length value),
so relying on sleep_length alone would give you a correct outcome, so
it is a "hit".

Generally, a "hit" is when you could rely on sleep_length alone for
the current outcome.

I'm afraid I cannot explain this much better, sorry. :-)

> Again, no guarantee that i is actually the state we idle-ed in, it's just
> one we make up for this bookkeeping.

[cut]

> > +     early_count = 0;
> > +     max_count = 0;
> > +     max_count_idx = -1;
> > +     prev_count = 0;
> > +     prev_idx = -1;
> > +     idx = -1;
> > +
> > +     for (i = 0; i < drv->state_count; i++) {
> > +             struct cpuidle_state *s = &drv->states[i];
> > +             struct cpuidle_state_usage *su = &dev->states_usage[i];
> > +             u64 count = cpu_data->states[i].early_wakeups +
> > +                             cpu_data->states[i].early_wakeups_old;
> > +
> > +             if (s->disabled || su->disable) {
> > +                     /*
> > +                      * This state is disabled, so add its count of matched
> > +                      * "early" wakeups to the last enabled state's count.
> > +                      */
> > +                     if (prev_idx >= 0) {
> > +                             prev_count += count;
> > +                             if (prev_count > max_count) {
> > +                                     max_count = prev_count;
> > +                                     max_count_idx = prev_idx;
> > +                             }
> > +                     }
> > +                     continue;
>
> I'm not sure I get this part: I have the impression you introduced the
> prev_idx variable exactly for its use in this check (skip disabled states),
> but it looks to me you had to use idx instead. At every iteration, idx is
> the last enabled state. Why prev_idx?

Well, this is a bit messy, sorry about that.

The idea is that if one of the states was disabled, it can't be
selected as the one with the maximum count of "early" hits, so its
count of "early" hits is kind of "stolen" by the first enabled
shallower state to make it look more "important".

I've dropped this in the new version of the patch as it covers a
corner case that may not be worth the effort.

> > +             }
> > +
> > +             if (idx < 0)
> > +                     idx = i; /* first enabled state */
> > +
> > +             if (s->target_residency > duration_us) {
> > +                     /*
> > +                      * If the next wakeup is expected to be "early", the
> > +                      * time frame of it is known already.
> > +                      */
> > +                     if (duration_us < cpu_data->sleep_length_us) {
> > +                             /*
> > +                              * Use a physical idle state, not busy polling,
> > +                              * unless a timer will trigger really soon.
> > +                              */
> > +                             if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> > +                                 s->exit_latency <= latency_req &&
> > +                                 s->target_residency <= cpu_data->sleep_length_us) {
>
> Just to be clear: 0 is a polling idle state; can there be any other? You
> seem to keep it generic here, but seems to me you're actually checking if
> idx so far is zero and try to see if there is another option.

The polling state has to be state 0.

> Also: you put this check "may I convince you not to poll?" inside the
> branch that handle the 3-or-more case (where duration_us has been changed
> to avg_us). My question is: don't you need the check also outside of the
> branch? (where, if I understand correctly, duration_us is equal to
> sleep_length_us).

No, because if s->target_residency == cpu_data->sleep_length_us, we'll
still use polling and duration_us cannot be greater than the sleep
length.

> > +                                     duration_us = s->target_residency;
> > +                                     idx = i;
> > +                             }
> > +                             break;
> > +                     }
> > +
> > +                     /*
> > +                      * If the number of "hits" for the state matching the
> > +                      * sleep length is greater than the total number of
> > +                      * "early" wakeups for all of the shallower states, the
> > +                      * state selected so far is the one to use.
> > +                      */
> > +                     if (early_count <= cpu_data->states[idx].hits +
> > +                                        cpu_data->states[idx].hits_old)
> > +                             break;
> > +
> > +                     /*
> > +                      * It is more likely that one of the shallower states
> > +                      * will match the idle duration measured after wakeup,
> > +                      * so take the one with the maximum count of matched
> > +                      * "early" wakeups.
> > +                      */
> > +                     idx = max_count_idx;
> > +                     duration_us = drv->states[idx].target_residency;
> > +                     break;
> > +             }
> > +             if (s->exit_latency > latency_req) {
> > +                     /*
> > +                      * If we break out of the loop for latency reasons, use
> > +                      * the target residency of the selected state as the
> > +                      * expected idle duration to avoid stopping the tick
> > +                      * as long as that target residency is low enough.
> > +                      */
> > +                     duration_us = drv->states[idx].target_residency;
> > +                     break;
> > +             }
> > +
> > +             early_count += prev_count;
>
> I have the slight feeling that here you wanted "early_count += count;"
> instead. As I understand early_count is meant to be "one iteration behind"
> in the loop (meaning it accumulates up to i-1). If you update it with
> prev_count instead of count, that would make it two iterations
> behind. Wrong?

So that's because of the "stealing" idea described above, but I might
have made a mistake here now looking at it again.

[cut]

>
> Just a curiosity: why isn't the disable() function from the
> cpuidle_governor interface implemented? I see it isn't in "menu" as well,
> so I wonder what disable() is supposed to do.

Clean up if any allocations have been made in ->enable(), for example.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems
  2018-10-11 21:01 Rafael J. Wysocki
@ 2018-10-22  8:54 ` Giovanni Gherdovich
  2018-10-26  7:58   ` Rafael J. Wysocki
  0 siblings, 1 reply; 7+ messages in thread
From: Giovanni Gherdovich @ 2018-10-22  8:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: Srinivas Pandruvada, Peter Zijlstra, LKML, Frederic Weisbecker,
	Mel Gorman, Doug Smythies, Daniel Lezcano

Hello Rafael,

I ran some benchmarks and will send you a detailed report later; for now I
have some questions to make sure I understand the code.

First off, I like your algorithm and you make an excellent job at
documenting it with comments. Since it's a heuristic, it's not "obvious"
code and comments help a lot.

On Thu, 2018-10-11 at 23:01 +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.  First, it includes timer wakeups in
> the pattern detection data and mixes them up with wakeups from
> other sources which in some cases causes it to expect what
> essentially would be a timer wakeup in a time frame in which
> no timer wakeups are possible (becuase it knows the time until
> the next timer event and that is later than the expected wakeup
> time).  Second, it uses the estra exit latency limit based on
> the predicted idle duration and depending on the number of tasks
> waiting on I/O, even though those tasks may run on a different
> CPU when they are woken up.  Moreover, the time ranges used by it
> for the sleep length correction factors are not correlated to the
> list of available idle states in any way whatever and different
> correction factors are used depending on whether or not there are
> tasks waiting on I/O, which again doesn't imply anything in
> particular.
> 
> A major rework of the menu governor would be required to address
> these issues and it is likely that the performance of some
> workloads would suffer from that.  It is thus better to introduce
> an entirely new governor without them and let everybody use the
> governor that works better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it uses a different approach to that problem.  Namely,
> instead of attempting to predict the idle duration itself which is
> likely to be inaccurate no matter what, it selects the idle state
> that was the best "match" for the time range given by the sleep
> length value in the past (see the code comments for details).
> 
> It has been tested on a few different systems with a number of
> different workloads and compared with the menu governor.  In the
> majority of cases the workloads performed similarly regardless of
> the cpuidle governor in use, but in one case the TEO governor
> allowed the workload to perform 75% better, which is a clear
> indication that some workloads may benefit from using it quite
> a bit depending on the hardware they run on.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> I'm running this governor on all of my systems now without any
> visible adverse effects.
> 
> It is likely to select the "polling" state less often than menu
> due to the lack of the extra latency limit derived from the
> predicted idle duration, so the workloads depending on that
> behavior may be worse off (but less energy should be used
> while running them at the same time).
> 
> Overall, it selects deeper idle states than menu more often, but
> that doesn't seem to make a significant difference in the majority
> of cases.
> 
> In this preliminary revision it overtakes menu as the default governor
> for tickless systems (due to the higher rating), but that is likely
> to change going forward.  At this point I'm mostly asking for feedback
> and possibly testing with whatever workloads you can throw at it.
> 
> The patch should apply on top of 4.19-rc7, although I'm running it on
> top of my linux-next branch.
> 
> ---
>  drivers/cpuidle/Kconfig            |   11 +
>  drivers/cpuidle/governors/Makefile |    1 
>  drivers/cpuidle/governors/teo.c    |  397 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 409 insertions(+)
> 
> Index: linux-pm/drivers/cpuidle/governors/teo.c
> ===================================================================
> --- /dev/null
> +++ linux-pm/drivers/cpuidle/governors/teo.c
> @@ -0,0 +1,397 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Timer events oriented CPU idle governor
> + *
> + * Copyright (C) 2018 Intel Corporation
> + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> + *
> + * The idea of this governor is based on the observation that on many systems
> + * timer events are two or more orders of magnitude more frequent than any
> + * other interrupts, so they are likely to be the most significant source of CPU
> + * wakeups from idle states.  Moreover, it only is necessary to estimate which
> + * idle state is the most likely one to match the idle duration measured after
> + * wakeup and it may not be necessary to predict the idle duration itself
> + * accurately.  [For example, if only two idle states are available, it is
> + * sufficient to estimate whether or not the first state is more likely to match
> + * the measured idle duration than the second one and the precise length of the
> + * idle interval itself need not be predicted for this purpose.]
> + *
> + * Also, if the time till the closest timer event (sleep length) is used to
> + * select and idle state for a CPU and that state turns out to match the idle
> + * duration measured after wakeup, it doesn't matter if the CPU really has been
> + * woken up by the timer.  What matters is that it has been woken up in the time
> + * frame matching the selected state and not what mechanism in particular has
> + * been the real cause of the wakeup.
> + *
> + * The sleep length is the maximum duration of the upcoming idle time of the
> + * CPU and it is always known to the kernel.  Using it alone for selecting an
> + * idle state for the CPU every time is a viable option in principle, but that
> + * might lead to suboptimal results if the other wakeup sources are more active
> + * for some reason.  Thus this governor estimates whether or not the CPU idle
> + * time is likely to be significantly shorter than the sleep length and selects
> + * an idle state for it in accordance with that, as follows:
> + *
> + * - If there is a series of 3 or more consecutive wakeups each significantly
> + *   earlier than the closest timer event, expect one more of them to occur and
> + *   use the average time between them to select an idle state for the CPU.
> + *
> + * - Otherwise, find the state on the basis of the sleep length and state
> + *   statistics collected over time:
> + *
> + *   o Find the deepest idle state whose target residency is less than or euqal
> + *     to the sleep length.
> + *
> + *   o Select it if the number of times it matched both the sleep length and the
> + *     idle duration measured after wakeup in the past is not less than the total
> + *     number of "early" CPU wakeups matched by all of the shallower states.
> + *
> + *   o Otherwise, select the shallower state with the greatest number of matched
> + *     "early" wakeups.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/cpuidle.h>
> +#include <linux/jiffies.h>
> +#include <linux/tick.h>
> +
> +/*
> + * Assuming an idle interval every second tick, take the maximum number of CPU
> + * wakeups regarded as recent to rougly correspond to 10 minutes.
> + *
> + * When the total number of CPU wakeups goes above this value, all of the
> + * counters corresponding to the given CPU undergo a "decay" and the counting
> + * of recent events stars over.
> + */
> +#define TEO_MAX_RECENT_WAKEUPS	(300 * HZ)
> +
> +/**
> + * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
> + * @early_wakeups: Number of "early" CPU wakeups matched by this state.
> + * @early_wakeups_old: Past "early" CPU wakeups matched by this state (decayed).
> + * @hits: Number of "on time" CPU wakeups matched by this state.
> + * @hits_old: Past "on time" CPU wakeups matched by this state (decayed).
> + *
> + * A CPU wakeup is "matched" by a given idle state if the idle duration measured
> + * after the wakeup is between the target residency of that state and the target
> + * residnecy of the next one (or if this is the deepest state provided, it
> + * "matches" a CPU wakeup when the measured idle duration is at least equal to
> + * its target residency).
> + *
> + * Also, from the TEO governor prespective, a CPU wakeup from idle is "early" if
> + * it occurs significantly earlier than the closest expected timer event (that
> + * is, early enough to match an idle state shallower than the one matching the
> + * time till the closest timer event).  Otherwise, the wakeup is "on time", or
> + * it is a "hit".
> + */
> +struct teo_idle_state {
> +	u64 early_wakeups;
> +	u64 early_wakeups_old;
> +	u64 hits;
> +	u64 hits_old;
> +};
> +
> +/**
> + * struct teo_cpu - CPU data used by the TEO cpuidle governor.
> + * @states: Idle states data corresponding to this CPU.
> + * @early_series_count: Number of recent "early" wakeups in a row (series).
> + * @early_series_time_us: Total idle time corresponding to the "early" series.
> + * @recent_wakeups: Number of wakeups since the counters were decayed last time.
> + * @sleep_length_us: Time till the closest timer event.
> + * @last_state: Idle state entered by the CPU last time.
> + */
> +struct teo_cpu {
> +	struct teo_idle_state states[CPUIDLE_STATE_MAX];
> +	u64 early_series_count;
> +	u64 early_series_time_us;
> +	unsigned int recent_wakeups;
> +	unsigned int sleep_length_us;
> +	int last_state;
> +};
> +
> +static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> +
> +/**
> + * teo_update - Update CPU data after wakeup.
> + * @drv: cpuidle driver containing state data.
> + * @dev: Target CPU.
> + */
> +static void teo_update(struct cpuidle_driver *drv,
> +			   struct cpuidle_device *dev)
> +{
> +	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> +	unsigned int measured_us = dev->last_residency;

question: how reliable is the value "dev->last_residency"? does it come
"from the hardware" somehow? I'm asking because I'm more familiar with
CPUfreq than CPUidle, and there you have a bit of a disconnect between what
the OS think and what the hardware actually does. So I wonder if there is a
similar situation with the last_residency value, and ask if it comes from
some sort of feedback from the processor itself (if that makes any
sense). I see it's written to in the cpuidle_enter_state() function in
cpuidle.c, but I couldn't clearly understand that function either.

> +	int i = cpu_data->last_state;
> +
> +	if (measured_us >= 2 * drv->states[i].exit_latency)
> +		measured_us -= drv->states[i].exit_latency;
> +	else
> +		measured_us /= 2;
> +
> +	/* Find the state matching the measured idle duration. */
> +	for (i = drv->state_count - 1; i > 0; i--) {
> +		if (drv->states[i].target_residency <= measured_us)
> +			break;
> +	}

Wait, I'm lost here. You just initialized "int i = cpu_data->last_state;"
ten lines above, and here you're computing "given how much we idled, what
would have been the best state to enter?". Was the initialization of i a
mistake? what's the intention?

> +
> +	cpu_data->recent_wakeups++;
> +
> +	/*
> +	 * If the state matches the time till the closest timer event used
> +	 * during the most recent state selection too, it would be sufficient
> +	 * to use that time alone for the idle state selection, so increment
> +	 * the "hits" number for the matching state and clear the "early series"
> +	 * data.
> +	 *
> +	 * Note that sleep_length_us cannot be less than measured_us (or it
> +	 * would mean a late timer event), so it is sufficient to check the
> +	 * lower boundary here.
> +	 */
> +	if (i == drv->state_count - 1 ||
> +	    drv->states[i+1].target_residency > cpu_data->sleep_length_us)
> {

Correct me if I'm wrong: we know that states[i+1].target_residency > measured_us,
because that's just how we've chosen i. We know that sleep_length_us >= measured_us,
because that's how timers work.

Now, the naive way if we're in a "hit" situation would be to check is
measured_us is equal (or, well, "close") to sleep_length_us. If, on the
other side, measured_us is a lot smaller than sleep_length_us, it's an
"early". But you don't check it that way, what you do is to see if
target_residency[i+1] is in between measured_us and sleep_length_us or
not. Like this (sorry for the sloppy notation:

target_residency[i] < measured_us <= sleep_length_us < target_residency[i+1]

and that would be a hit. Otherwise:

target_residency[i] < measured_us < target_residency[i+1] <= sleep_length_us

and that's an "early". Right?

Again, no guarantee that i is actually the state we idle-ed in, it's just
one we make up for this bookkeeping.


> +		cpu_data->states[i].hits++;
> +
> +		cpu_data->early_series_count = 0;
> +		cpu_data->early_series_time_us = 0;
> +	} else {
> +		/*
> +		 * The wakeup was significantly earlier than the closest timer
> +		 * event used for idle state selection, so update the CPU data
> +		 * to reflect that.
> +		 */
> +		cpu_data->states[i].early_wakeups++;
> +
> +		cpu_data->early_series_count++;
> +		cpu_data->early_series_time_us += measured_us;
> +	}
> +
> +	if (cpu_data->recent_wakeups <= TEO_MAX_RECENT_WAKEUPS)
> +		return;
> +
> +	cpu_data->recent_wakeups = 0;
> +
> +	/* Decay past events information. */
> +	for (i = 0; i < drv->state_count; i++) {
> +		cpu_data->states[i].early_wakeups_old += cpu_data->states[i].early_wakeups;
> +		cpu_data->states[i].early_wakeups_old /= 2;
> +		cpu_data->states[i].early_wakeups = 0;
> +
> +		cpu_data->states[i].hits_old += cpu_data->states[i].hits;
> +		cpu_data->states[i].hits_old /= 2;
> +		cpu_data->states[i].hits = 0;
> +	}
> +}
> +
> +/**
> + * teo_select - Selects the next idle state to enter.
> + * @drv: cpuidle driver containing state data.
> + * @dev: Target CPU.
> + * @stop_tick: Indication on whether or not to stop the scheduler tick.
> + */
> +static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> +		      bool *stop_tick)
> +{
> +	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> +	int latency_req = cpuidle_governor_latency_req(dev->cpu);
> +	unsigned int duration_us;
> +	ktime_t delta_next;
> +	int idx, i;
> +	u64 early_count, max_count, prev_count;
> +	int max_count_idx, prev_idx;
> +
> +	if (cpu_data->last_state >= 0) {
> +		teo_update(drv, dev);
> +		cpu_data->last_state = -1;
> +	}
> +
> +	cpu_data->sleep_length_us = ktime_to_us(tick_nohz_get_sleep_length(&delta_next));
> +
> +	duration_us = cpu_data->sleep_length_us;
> +
> +	if (cpu_data->early_series_count > 2) {
> +		unsigned int avg_us;
> +
> +		/*
> +		 * There is a series of consecutive wakeups each significantly
> +		 * earlier than the closest timer event, so expect one more of
> +		 * them to occur and use the average time between them for idle
> +		 * state selection, as long as it is within the tick boundary
> +		 * and the tick has not been stopped.
> +		 */
> +		avg_us = div64_u64(cpu_data->early_series_time_us,
> +				   cpu_data->early_series_count);
> +
> +		if (avg_us < TICK_USEC && avg_us < duration_us &&
> +		    !tick_nohz_tick_stopped())
> +			duration_us = avg_us;
> +	}
> +
> +	early_count = 0;
> +	max_count = 0;
> +	max_count_idx = -1;
> +	prev_count = 0;
> +	prev_idx = -1;
> +	idx = -1;
> +
> +	for (i = 0; i < drv->state_count; i++) {
> +		struct cpuidle_state *s = &drv->states[i];
> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
> +		u64 count = cpu_data->states[i].early_wakeups +
> +				cpu_data->states[i].early_wakeups_old;
> +
> +		if (s->disabled || su->disable) {
> +			/*
> +			 * This state is disabled, so add its count of matched
> +			 * "early" wakeups to the last enabled state's count.
> +			 */
> +			if (prev_idx >= 0) {
> +				prev_count += count;
> +				if (prev_count > max_count) {
> +					max_count = prev_count;
> +					max_count_idx = prev_idx;
> +				}
> +			}
> +			continue;

I'm not sure I get this part: I have the impression you introduced the
prev_idx variable exactly for its use in this check (skip disabled states),
but it looks to me you had to use idx instead. At every iteration, idx is
the last enabled state. Why prev_idx?

> +		}
> +
> +		if (idx < 0)
> +			idx = i; /* first enabled state */
> +
> +		if (s->target_residency > duration_us) {
> +			/*
> +			 * If the next wakeup is expected to be "early", the
> +			 * time frame of it is known already.
> +			 */
> +			if (duration_us < cpu_data->sleep_length_us) {
> +				/*
> +				 * Use a physical idle state, not busy polling,
> +				 * unless a timer will trigger really soon.
> +				 */
> +				if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> +				    s->exit_latency <= latency_req &&
> +				    s->target_residency <= cpu_data->sleep_length_us) {

Just to be clear: 0 is a polling idle state; can there be any other? You
seem to keep it generic here, but seems to me you're actually checking if
idx so far is zero and try to see if there is another option.

Also: you put this check "may I convince you not to poll?" inside the
branch that handle the 3-or-more case (where duration_us has been changed
to avg_us). My question is: don't you need the check also outside of the
branch? (where, if I understand correctly, duration_us is equal to
sleep_length_us).

> +					duration_us = s->target_residency;
> +					idx = i;
> +				}
> +				break;
> +			}
> +
> +			/*
> +			 * If the number of "hits" for the state matching the
> +			 * sleep length is greater than the total number of
> +			 * "early" wakeups for all of the shallower states, the
> +			 * state selected so far is the one to use.
> +			 */
> +			if (early_count <= cpu_data->states[idx].hits +
> +					   cpu_data->states[idx].hits_old)
> +				break;
> +
> +			/*
> +			 * It is more likely that one of the shallower states
> +			 * will match the idle duration measured after wakeup,
> +			 * so take the one with the maximum count of matched
> +			 * "early" wakeups.
> +			 */
> +			idx = max_count_idx;
> +			duration_us = drv->states[idx].target_residency;
> +			break;
> +		}
> +		if (s->exit_latency > latency_req) {
> +			/*
> +			 * If we break out of the loop for latency reasons, use
> +			 * the target residency of the selected state as the
> +			 * expected idle duration to avoid stopping the tick
> +			 * as long as that target residency is low enough.
> +			 */
> +			duration_us = drv->states[idx].target_residency;
> +			break;
> +		}
> +
> +		early_count += prev_count;

I have the slight feeling that here you wanted "early_count += count;"
instead. As I understand early_count is meant to be "one iteration behind"
in the loop (meaning it accumulates up to i-1). If you update it with
prev_count instead of count, that would make it two iterations
behind. Wrong?

> +
> +		idx = i;
> +
> +		if (count > max_count) {
> +			max_count = count;
> +			max_count_idx = idx;
> +		}
> +
> +		prev_count = count;
> +		prev_idx = idx;
> +	}
> +
> +	if (idx < 0)
> +		idx = 0; /* No states enabled. Must use 0. */
> +
> +	/*
> +	 * Don't stop the tick if the selected state is a polling one or if the
> +	 * expected idle duration is shorter than the tick period length.
> +	 */
> +	if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
> +	    duration_us < TICK_USEC) && !tick_nohz_tick_stopped()) {
> +		unsigned int delta_next_us = ktime_to_us(delta_next);
> +
> +		cpu_data->sleep_length_us = delta_next_us;
> +		*stop_tick = false;
> +
> +		if (idx > 0 && drv->states[idx].target_residency > delta_next_us) {
> +			/*
> +			 * The tick is not going to be stopped and the target
> +			 * residency of the state to be returned is not within
> +			 * the time until the closest timer event including the
> +			 * tick, so try to correct that.
> +			 */
> +			for (i = idx - 1; i > 0; i--) {
> +				if (drv->states[i].disabled ||
> +				    dev->states_usage[i].disable)
> +					continue;
> +
> +				if (drv->states[i].target_residency <= delta_next_us)
> +					break;
> +			}
> +			idx = i;
> +		}
> +	}
> +
> +	return idx;
> +}
> +
> +/**
> + * teo_reflect - Note that governor data for the CPU need to be updated.
> + * @dev: Target CPU.
> + * @state: Entered state.
> + */
> +static void teo_reflect(struct cpuidle_device *dev, int state)
> +{
> +	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> +
> +	cpu_data->last_state = state;
> +}
> +
> +/**
> + * teo_enable_device - Initialize the governor's data for the target CPU.
> + * @drv: cpuidle driver (not used).
> + * @dev: Target CPU.
> + */
> +static int teo_enable_device(struct cpuidle_driver *drv,
> +			     struct cpuidle_device *dev)
> +{
> +	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> +
> +	memset(cpu_data, 0, sizeof(*cpu_data));
> +	return 0;
> +}
> +
> +static struct cpuidle_governor teo_governor = {
> +	.name =		"teo",
> +	.rating =	22,
> +	.enable =	teo_enable_device,
> +	.select =	teo_select,
> +	.reflect =	teo_reflect,
> +};

Just a curiosity: why isn't the disable() function from the
cpuidle_governor interface implemented? I see it isn't in "menu" as well,
so I wonder what disable() is supposed to do.


Thanks,
Giovanni


> +
> +static int __init teo_governor_init(void)
> +{
> +	return cpuidle_register_governor(&teo_governor);
> +}
> +
> +postcore_initcall(teo_governor_init);
> Index: linux-pm/drivers/cpuidle/Kconfig
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/Kconfig
> +++ linux-pm/drivers/cpuidle/Kconfig
> @@ -23,6 +23,17 @@ config CPU_IDLE_GOV_LADDER
>  config CPU_IDLE_GOV_MENU
>  	bool "Menu governor (for tickless system)"
>  
> +config CPU_IDLE_GOV_TEO
> +	bool "Timer events oriented governor (for tickless systems)"
> +	help
> +	  Menu governor derivative that uses a simplified idle state
> +	  selection method focused on timer events and does not do any
> +	  interactivity boosting.
> +
> +	  Some workloads benefit from using this governor and it generally
> +	  should be safe to use.  Say Y here if you are not happy with the
> +	  alternatives.
> +
>  config DT_IDLE_STATES
>  	bool
>  
> Index: linux-pm/drivers/cpuidle/governors/Makefile
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/Makefile
> +++ linux-pm/drivers/cpuidle/governors/Makefile
> @@ -4,3 +4,4 @@
>  
>  obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o
>  obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o
> +obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems
@ 2018-10-11 21:01 Rafael J. Wysocki
  2018-10-22  8:54 ` Giovanni Gherdovich
  0 siblings, 1 reply; 7+ messages in thread
From: Rafael J. Wysocki @ 2018-10-11 21:01 UTC (permalink / raw)
  To: Linux PM
  Cc: Srinivas Pandruvada, Peter Zijlstra, LKML, Frederic Weisbecker,
	Mel Gorman, Giovanni Gherdovich, Doug Smythies, Daniel Lezcano

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The venerable menu governor does some thigns that are quite
questionable in my view.  First, it includes timer wakeups in
the pattern detection data and mixes them up with wakeups from
other sources which in some cases causes it to expect what
essentially would be a timer wakeup in a time frame in which
no timer wakeups are possible (becuase it knows the time until
the next timer event and that is later than the expected wakeup
time).  Second, it uses the estra exit latency limit based on
the predicted idle duration and depending on the number of tasks
waiting on I/O, even though those tasks may run on a different
CPU when they are woken up.  Moreover, the time ranges used by it
for the sleep length correction factors are not correlated to the
list of available idle states in any way whatever and different
correction factors are used depending on whether or not there are
tasks waiting on I/O, which again doesn't imply anything in
particular.

A major rework of the menu governor would be required to address
these issues and it is likely that the performance of some
workloads would suffer from that.  It is thus better to introduce
an entirely new governor without them and let everybody use the
governor that works better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it uses a different approach to that problem.  Namely,
instead of attempting to predict the idle duration itself which is
likely to be inaccurate no matter what, it selects the idle state
that was the best "match" for the time range given by the sleep
length value in the past (see the code comments for details).

It has been tested on a few different systems with a number of
different workloads and compared with the menu governor.  In the
majority of cases the workloads performed similarly regardless of
the cpuidle governor in use, but in one case the TEO governor
allowed the workload to perform 75% better, which is a clear
indication that some workloads may benefit from using it quite
a bit depending on the hardware they run on.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

I'm running this governor on all of my systems now without any
visible adverse effects.

It is likely to select the "polling" state less often than menu
due to the lack of the extra latency limit derived from the
predicted idle duration, so the workloads depending on that
behavior may be worse off (but less energy should be used
while running them at the same time).

Overall, it selects deeper idle states than menu more often, but
that doesn't seem to make a significant difference in the majority
of cases.

In this preliminary revision it overtakes menu as the default governor
for tickless systems (due to the higher rating), but that is likely
to change going forward.  At this point I'm mostly asking for feedback
and possibly testing with whatever workloads you can throw at it.

The patch should apply on top of 4.19-rc7, although I'm running it on
top of my linux-next branch.

---
 drivers/cpuidle/Kconfig            |   11 +
 drivers/cpuidle/governors/Makefile |    1 
 drivers/cpuidle/governors/teo.c    |  397 +++++++++++++++++++++++++++++++++++++
 3 files changed, 409 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,397 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of CPU
+ * wakeups from idle states.  Moreover, it only is necessary to estimate which
+ * idle state is the most likely one to match the idle duration measured after
+ * wakeup and it may not be necessary to predict the idle duration itself
+ * accurately.  [For example, if only two idle states are available, it is
+ * sufficient to estimate whether or not the first state is more likely to match
+ * the measured idle duration than the second one and the precise length of the
+ * idle interval itself need not be predicted for this purpose.]
+ *
+ * Also, if the time till the closest timer event (sleep length) is used to
+ * select and idle state for a CPU and that state turns out to match the idle
+ * duration measured after wakeup, it doesn't matter if the CPU really has been
+ * woken up by the timer.  What matters is that it has been woken up in the time
+ * frame matching the selected state and not what mechanism in particular has
+ * been the real cause of the wakeup.
+ *
+ * The sleep length is the maximum duration of the upcoming idle time of the
+ * CPU and it is always known to the kernel.  Using it alone for selecting an
+ * idle state for the CPU every time is a viable option in principle, but that
+ * might lead to suboptimal results if the other wakeup sources are more active
+ * for some reason.  Thus this governor estimates whether or not the CPU idle
+ * time is likely to be significantly shorter than the sleep length and selects
+ * an idle state for it in accordance with that, as follows:
+ *
+ * - If there is a series of 3 or more consecutive wakeups each significantly
+ *   earlier than the closest timer event, expect one more of them to occur and
+ *   use the average time between them to select an idle state for the CPU.
+ *
+ * - Otherwise, find the state on the basis of the sleep length and state
+ *   statistics collected over time:
+ *
+ *   o Find the deepest idle state whose target residency is less than or euqal
+ *     to the sleep length.
+ *
+ *   o Select it if the number of times it matched both the sleep length and the
+ *     idle duration measured after wakeup in the past is not less than the total
+ *     number of "early" CPU wakeups matched by all of the shallower states.
+ *
+ *   o Otherwise, select the shallower state with the greatest number of matched
+ *     "early" wakeups.
+ */
+
+#include <linux/kernel.h>
+#include <linux/cpuidle.h>
+#include <linux/jiffies.h>
+#include <linux/tick.h>
+
+/*
+ * Assuming an idle interval every second tick, take the maximum number of CPU
+ * wakeups regarded as recent to rougly correspond to 10 minutes.
+ *
+ * When the total number of CPU wakeups goes above this value, all of the
+ * counters corresponding to the given CPU undergo a "decay" and the counting
+ * of recent events stars over.
+ */
+#define TEO_MAX_RECENT_WAKEUPS	(300 * HZ)
+
+/**
+ * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
+ * @early_wakeups: Number of "early" CPU wakeups matched by this state.
+ * @early_wakeups_old: Past "early" CPU wakeups matched by this state (decayed).
+ * @hits: Number of "on time" CPU wakeups matched by this state.
+ * @hits_old: Past "on time" CPU wakeups matched by this state (decayed).
+ *
+ * A CPU wakeup is "matched" by a given idle state if the idle duration measured
+ * after the wakeup is between the target residency of that state and the target
+ * residnecy of the next one (or if this is the deepest state provided, it
+ * "matches" a CPU wakeup when the measured idle duration is at least equal to
+ * its target residency).
+ *
+ * Also, from the TEO governor prespective, a CPU wakeup from idle is "early" if
+ * it occurs significantly earlier than the closest expected timer event (that
+ * is, early enough to match an idle state shallower than the one matching the
+ * time till the closest timer event).  Otherwise, the wakeup is "on time", or
+ * it is a "hit".
+ */
+struct teo_idle_state {
+	u64 early_wakeups;
+	u64 early_wakeups_old;
+	u64 hits;
+	u64 hits_old;
+};
+
+/**
+ * struct teo_cpu - CPU data used by the TEO cpuidle governor.
+ * @states: Idle states data corresponding to this CPU.
+ * @early_series_count: Number of recent "early" wakeups in a row (series).
+ * @early_series_time_us: Total idle time corresponding to the "early" series.
+ * @recent_wakeups: Number of wakeups since the counters were decayed last time.
+ * @sleep_length_us: Time till the closest timer event.
+ * @last_state: Idle state entered by the CPU last time.
+ */
+struct teo_cpu {
+	struct teo_idle_state states[CPUIDLE_STATE_MAX];
+	u64 early_series_count;
+	u64 early_series_time_us;
+	unsigned int recent_wakeups;
+	unsigned int sleep_length_us;
+	int last_state;
+};
+
+static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
+
+/**
+ * teo_update - Update CPU data after wakeup.
+ * @drv: cpuidle driver containing state data.
+ * @dev: Target CPU.
+ */
+static void teo_update(struct cpuidle_driver *drv,
+			   struct cpuidle_device *dev)
+{
+	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+	unsigned int measured_us = dev->last_residency;
+	int i = cpu_data->last_state;
+
+	if (measured_us >= 2 * drv->states[i].exit_latency)
+		measured_us -= drv->states[i].exit_latency;
+	else
+		measured_us /= 2;
+
+	/* Find the state matching the measured idle duration. */
+	for (i = drv->state_count - 1; i > 0; i--) {
+		if (drv->states[i].target_residency <= measured_us)
+			break;
+	}
+
+	cpu_data->recent_wakeups++;
+
+	/*
+	 * If the state matches the time till the closest timer event used
+	 * during the most recent state selection too, it would be sufficient
+	 * to use that time alone for the idle state selection, so increment
+	 * the "hits" number for the matching state and clear the "early series"
+	 * data.
+	 *
+	 * Note that sleep_length_us cannot be less than measured_us (or it
+	 * would mean a late timer event), so it is sufficient to check the
+	 * lower boundary here.
+	 */
+	if (i == drv->state_count - 1 ||
+	    drv->states[i+1].target_residency > cpu_data->sleep_length_us) {
+		cpu_data->states[i].hits++;
+
+		cpu_data->early_series_count = 0;
+		cpu_data->early_series_time_us = 0;
+	} else {
+		/*
+		 * The wakeup was significantly earlier than the closest timer
+		 * event used for idle state selection, so update the CPU data
+		 * to reflect that.
+		 */
+		cpu_data->states[i].early_wakeups++;
+
+		cpu_data->early_series_count++;
+		cpu_data->early_series_time_us += measured_us;
+	}
+
+	if (cpu_data->recent_wakeups <= TEO_MAX_RECENT_WAKEUPS)
+		return;
+
+	cpu_data->recent_wakeups = 0;
+
+	/* Decay past events information. */
+	for (i = 0; i < drv->state_count; i++) {
+		cpu_data->states[i].early_wakeups_old += cpu_data->states[i].early_wakeups;
+		cpu_data->states[i].early_wakeups_old /= 2;
+		cpu_data->states[i].early_wakeups = 0;
+
+		cpu_data->states[i].hits_old += cpu_data->states[i].hits;
+		cpu_data->states[i].hits_old /= 2;
+		cpu_data->states[i].hits = 0;
+	}
+}
+
+/**
+ * teo_select - Selects the next idle state to enter.
+ * @drv: cpuidle driver containing state data.
+ * @dev: Target CPU.
+ * @stop_tick: Indication on whether or not to stop the scheduler tick.
+ */
+static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		      bool *stop_tick)
+{
+	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+	int latency_req = cpuidle_governor_latency_req(dev->cpu);
+	unsigned int duration_us;
+	ktime_t delta_next;
+	int idx, i;
+	u64 early_count, max_count, prev_count;
+	int max_count_idx, prev_idx;
+
+	if (cpu_data->last_state >= 0) {
+		teo_update(drv, dev);
+		cpu_data->last_state = -1;
+	}
+
+	cpu_data->sleep_length_us = ktime_to_us(tick_nohz_get_sleep_length(&delta_next));
+
+	duration_us = cpu_data->sleep_length_us;
+
+	if (cpu_data->early_series_count > 2) {
+		unsigned int avg_us;
+
+		/*
+		 * There is a series of consecutive wakeups each significantly
+		 * earlier than the closest timer event, so expect one more of
+		 * them to occur and use the average time between them for idle
+		 * state selection, as long as it is within the tick boundary
+		 * and the tick has not been stopped.
+		 */
+		avg_us = div64_u64(cpu_data->early_series_time_us,
+				   cpu_data->early_series_count);
+
+		if (avg_us < TICK_USEC && avg_us < duration_us &&
+		    !tick_nohz_tick_stopped())
+			duration_us = avg_us;
+	}
+
+	early_count = 0;
+	max_count = 0;
+	max_count_idx = -1;
+	prev_count = 0;
+	prev_idx = -1;
+	idx = -1;
+
+	for (i = 0; i < drv->state_count; i++) {
+		struct cpuidle_state *s = &drv->states[i];
+		struct cpuidle_state_usage *su = &dev->states_usage[i];
+		u64 count = cpu_data->states[i].early_wakeups +
+				cpu_data->states[i].early_wakeups_old;
+
+		if (s->disabled || su->disable) {
+			/*
+			 * This state is disabled, so add its count of matched
+			 * "early" wakeups to the last enabled state's count.
+			 */
+			if (prev_idx >= 0) {
+				prev_count += count;
+				if (prev_count > max_count) {
+					max_count = prev_count;
+					max_count_idx = prev_idx;
+				}
+			}
+			continue;
+		}
+
+		if (idx < 0)
+			idx = i; /* first enabled state */
+
+		if (s->target_residency > duration_us) {
+			/*
+			 * If the next wakeup is expected to be "early", the
+			 * time frame of it is known already.
+			 */
+			if (duration_us < cpu_data->sleep_length_us) {
+				/*
+				 * Use a physical idle state, not busy polling,
+				 * unless a timer will trigger really soon.
+				 */
+				if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
+				    s->exit_latency <= latency_req &&
+				    s->target_residency <= cpu_data->sleep_length_us) {
+					duration_us = s->target_residency;
+					idx = i;
+				}
+				break;
+			}
+
+			/*
+			 * If the number of "hits" for the state matching the
+			 * sleep length is greater than the total number of
+			 * "early" wakeups for all of the shallower states, the
+			 * state selected so far is the one to use.
+			 */
+			if (early_count <= cpu_data->states[idx].hits +
+					   cpu_data->states[idx].hits_old)
+				break;
+
+			/*
+			 * It is more likely that one of the shallower states
+			 * will match the idle duration measured after wakeup,
+			 * so take the one with the maximum count of matched
+			 * "early" wakeups.
+			 */
+			idx = max_count_idx;
+			duration_us = drv->states[idx].target_residency;
+			break;
+		}
+		if (s->exit_latency > latency_req) {
+			/*
+			 * If we break out of the loop for latency reasons, use
+			 * the target residency of the selected state as the
+			 * expected idle duration to avoid stopping the tick
+			 * as long as that target residency is low enough.
+			 */
+			duration_us = drv->states[idx].target_residency;
+			break;
+		}
+
+		early_count += prev_count;
+
+		idx = i;
+
+		if (count > max_count) {
+			max_count = count;
+			max_count_idx = idx;
+		}
+
+		prev_count = count;
+		prev_idx = idx;
+	}
+
+	if (idx < 0)
+		idx = 0; /* No states enabled. Must use 0. */
+
+	/*
+	 * Don't stop the tick if the selected state is a polling one or if the
+	 * expected idle duration is shorter than the tick period length.
+	 */
+	if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
+	    duration_us < TICK_USEC) && !tick_nohz_tick_stopped()) {
+		unsigned int delta_next_us = ktime_to_us(delta_next);
+
+		cpu_data->sleep_length_us = delta_next_us;
+		*stop_tick = false;
+
+		if (idx > 0 && drv->states[idx].target_residency > delta_next_us) {
+			/*
+			 * The tick is not going to be stopped and the target
+			 * residency of the state to be returned is not within
+			 * the time until the closest timer event including the
+			 * tick, so try to correct that.
+			 */
+			for (i = idx - 1; i > 0; i--) {
+				if (drv->states[i].disabled ||
+				    dev->states_usage[i].disable)
+					continue;
+
+				if (drv->states[i].target_residency <= delta_next_us)
+					break;
+			}
+			idx = i;
+		}
+	}
+
+	return idx;
+}
+
+/**
+ * teo_reflect - Note that governor data for the CPU need to be updated.
+ * @dev: Target CPU.
+ * @state: Entered state.
+ */
+static void teo_reflect(struct cpuidle_device *dev, int state)
+{
+	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+
+	cpu_data->last_state = state;
+}
+
+/**
+ * teo_enable_device - Initialize the governor's data for the target CPU.
+ * @drv: cpuidle driver (not used).
+ * @dev: Target CPU.
+ */
+static int teo_enable_device(struct cpuidle_driver *drv,
+			     struct cpuidle_device *dev)
+{
+	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+
+	memset(cpu_data, 0, sizeof(*cpu_data));
+	return 0;
+}
+
+static struct cpuidle_governor teo_governor = {
+	.name =		"teo",
+	.rating =	22,
+	.enable =	teo_enable_device,
+	.select =	teo_select,
+	.reflect =	teo_reflect,
+};
+
+static int __init teo_governor_init(void)
+{
+	return cpuidle_register_governor(&teo_governor);
+}
+
+postcore_initcall(teo_governor_init);
Index: linux-pm/drivers/cpuidle/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpuidle/Kconfig
+++ linux-pm/drivers/cpuidle/Kconfig
@@ -23,6 +23,17 @@ config CPU_IDLE_GOV_LADDER
 config CPU_IDLE_GOV_MENU
 	bool "Menu governor (for tickless system)"
 
+config CPU_IDLE_GOV_TEO
+	bool "Timer events oriented governor (for tickless systems)"
+	help
+	  Menu governor derivative that uses a simplified idle state
+	  selection method focused on timer events and does not do any
+	  interactivity boosting.
+
+	  Some workloads benefit from using this governor and it generally
+	  should be safe to use.  Say Y here if you are not happy with the
+	  alternatives.
+
 config DT_IDLE_STATES
 	bool
 
Index: linux-pm/drivers/cpuidle/governors/Makefile
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/Makefile
+++ linux-pm/drivers/cpuidle/governors/Makefile
@@ -4,3 +4,4 @@
 
 obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o
 obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o
+obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-10-26  7:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-14  6:53 [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems Doug Smythies
2018-10-15  7:52 ` Rafael J. Wysocki
2018-10-16  3:00 ` Doug Smythies
2018-10-16  8:00   ` Rafael J. Wysocki
  -- strict thread matches above, loose matches on Subject: below --
2018-10-11 21:01 Rafael J. Wysocki
2018-10-22  8:54 ` Giovanni Gherdovich
2018-10-26  7:58   ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).