sched: fair: Use the earliest break even
diff mbox series

Message ID 20200304114844.17700-1-daniel.lezcano@linaro.org
State Superseded
Headers show
Series
  • sched: fair: Use the earliest break even
Related show

Commit Message

Daniel Lezcano March 4, 2020, 11:48 a.m. UTC
In the idle CPU selection process occuring in the slow path via the
find_idlest_group_cpu() function, we pick up in priority an idle CPU
with the shallowest idle state otherwise we fall back to the least
loaded CPU.

In order to be more energy efficient but without impacting the
performances, let's use another criteria: the break even deadline.

At idle time, when we store the idle state the CPU is entering in, we
compute the next deadline where the CPU could be woken up without
spending more energy to sleep.

At the selection process, we use the shallowest CPU but in addition we
choose the one with the minimal break even deadline instead of relying
on the idle_timestamp. When the CPU is idle, the timestamp has less
meaning because the CPU could have wake up and sleep again several times
without exiting the idle loop. In this case the break even deadline is
more relevant as it increases the probability of choosing a CPU which
reached its break even.

Tested on a synquacer 24 cores, 6 sched domains:

sched/perf and messaging does not show a performance regression. Ran
10 times schbench and adrestia.

with break-even deadline:
-------------------------
schbench *99.0th        : 14844.8
adrestia / periodic     : 57.95
adrestia / single       : 49.3

Without break-even deadline:
----------------------------
schbench / *99.0th      : 15017.6
adrestia / periodic     : 57
adrestia / single       : 55.4

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 kernel/sched/fair.c  | 18 ++++++++++++++++--
 kernel/sched/idle.c  |  9 ++++++++-
 kernel/sched/sched.h | 20 ++++++++++++++++++++
 3 files changed, 44 insertions(+), 3 deletions(-)

Comments

Qais Yousef March 4, 2020, 3:01 p.m. UTC | #1
Hi Daniel

Adding Rafael to CC as I think he might be interested in this too.

On 03/04/20 12:48, Daniel Lezcano wrote:
> In the idle CPU selection process occuring in the slow path via the
> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> with the shallowest idle state otherwise we fall back to the least
> loaded CPU.
> 
> In order to be more energy efficient but without impacting the
> performances, let's use another criteria: the break even deadline.

What is the break even deadline?

> 
> At idle time, when we store the idle state the CPU is entering in, we
> compute the next deadline where the CPU could be woken up without
> spending more energy to sleep.

I think that's its definition, but could do with more explanation.

So the break even deadline is the time window during which we can abort the CPU
while entering its shallowest idle state?

So if we have 2 cpus entering the shallowest idle state, we pick the one that
has a faster abort? And the energy saving comes from the fact we avoided
unnecessary sleep-just-to-wakeup-immediately cycle?

> 
> At the selection process, we use the shallowest CPU but in addition we
> choose the one with the minimal break even deadline instead of relying
> on the idle_timestamp. When the CPU is idle, the timestamp has less
> meaning because the CPU could have wake up and sleep again several times
> without exiting the idle loop. In this case the break even deadline is
> more relevant as it increases the probability of choosing a CPU which
> reached its break even.
> 
> Tested on a synquacer 24 cores, 6 sched domains:
> 
> sched/perf and messaging does not show a performance regression. Ran
> 10 times schbench and adrestia.
> 
> with break-even deadline:
> -------------------------
> schbench *99.0th        : 14844.8
> adrestia / periodic     : 57.95
> adrestia / single       : 49.3
> 
> Without break-even deadline:
> ----------------------------
> schbench / *99.0th      : 15017.6
> adrestia / periodic     : 57
> adrestia / single       : 55.4

Lower is better or worse? The numbers might be popular and maybe I should have
known them, but adding some explanation will always help.

Do you have any energy measurement on the impact of this patch?

> 
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> ---
>  kernel/sched/fair.c  | 18 ++++++++++++++++--
>  kernel/sched/idle.c  |  9 ++++++++-
>  kernel/sched/sched.h | 20 ++++++++++++++++++++
>  3 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fcc968669aea..520c5e822fdd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>  {
>  	unsigned long load, min_load = ULONG_MAX;
>  	unsigned int min_exit_latency = UINT_MAX;
> +	s64 min_break_even = S64_MAX;
>  	u64 latest_idle_timestamp = 0;
>  	int least_loaded_cpu = this_cpu;
>  	int shallowest_idle_cpu = -1;
> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>  		if (available_idle_cpu(i)) {
>  			struct rq *rq = cpu_rq(i);
>  			struct cpuidle_state *idle = idle_get_state(rq);
> +			s64 break_even = idle_get_break_even(rq);
> +			
>  			if (idle && idle->exit_latency < min_exit_latency) {
>  				/*
>  				 * We give priority to a CPU whose idle state
> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>  				 * of any idle timestamp.
>  				 */
>  				min_exit_latency = idle->exit_latency;
> +				min_break_even = break_even;
>  				latest_idle_timestamp = rq->idle_stamp;
>  				shallowest_idle_cpu = i;
> -			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
> -				   rq->idle_stamp > latest_idle_timestamp) {
> +			} else if ((idle && idle->exit_latency == min_exit_latency) &&
> +				   break_even < min_break_even) {
> +				/*
> +				 * We give priority to the shallowest
> +				 * idle states with the minimal break
> +				 * even deadline to decrease the
> +				 * probability to choose a CPU which
> +				 * did not reach its break even yet
> +				 */
> +				min_break_even = break_even;
> +				shallowest_idle_cpu = i;
> +			} else if (!idle && rq->idle_stamp > latest_idle_timestamp) {

Shouldn't you retain the original if condition here? You omitted the 2nd part
of this check compared to the original condition

	(!idle || >>>idle->exit_latency == min_exit_latency<<<)

>  				/*
>  				 * If equal or no active idle state, then
>  				 * the most recently idled CPU might have
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index b743bf38f08f..189cd51cd474 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -19,7 +19,14 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
>   */
>  void sched_idle_set_state(struct cpuidle_state *idle_state)
>  {
> -	idle_set_state(this_rq(), idle_state);
> +	struct rq *rq = this_rq();
> +	ktime_t kt;
> +
> +	if (likely(idle_state)) {
> +		kt = ktime_add_ns(ktime_get(), idle_state->exit_latency_ns);
> +		idle_set_state(rq, idle_state);

This changes the behavior of the function.

There's a call to sched_idle_set_state(NULL), so with this it'd be a NOP.

Is this intentional?

Don't you need to reset the break_even value if idle_state is NULL too?


Thanks

--
Qais Yousef

> +		idle_set_break_even(rq, ktime_to_ns(kt));
> +	}
>  }
>  
>  static int __read_mostly cpu_idle_force_poll;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2a0caf394dd4..abf2d2e73575 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1015,6 +1015,7 @@ struct rq {
>  #ifdef CONFIG_CPU_IDLE
>  	/* Must be inspected within a rcu lock section */
>  	struct cpuidle_state	*idle_state;
> +	s64			break_even;
>  #endif
>  };
>  
> @@ -1850,6 +1851,16 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>  
>  	return rq->idle_state;
>  }
> +
> +static inline void idle_set_break_even(struct rq *rq, s64 break_even)
> +{
> +	rq->break_even = break_even;
> +}
> +
> +static inline s64 idle_get_break_even(struct rq *rq)
> +{
> +	return rq->break_even;
> +}
>  #else
>  static inline void idle_set_state(struct rq *rq,
>  				  struct cpuidle_state *idle_state)
> @@ -1860,6 +1871,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>  {
>  	return NULL;
>  }
> +
> +static inline void idle_set_break_even(struct rq *rq, s64 break_even)
> +{
> +}
> +
> +static inline s64 idle_get_break_even(struct rq *rq)
> +{
> +	return 0;
> +}
>  #endif
>  
>  extern void schedule_idle(void);
> -- 
> 2.17.1
>
Valentin Schneider March 4, 2020, 3:22 p.m. UTC | #2
On Wed, Mar 04 2020, Daniel Lezcano wrote:
> In the idle CPU selection process occuring in the slow path via the
> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> with the shallowest idle state otherwise we fall back to the least
> loaded CPU.
>
> In order to be more energy efficient but without impacting the
> performances, let's use another criteria: the break even deadline.
>
> At idle time, when we store the idle state the CPU is entering in, we
> compute the next deadline where the CPU could be woken up without
> spending more energy to sleep.
>
> At the selection process, we use the shallowest CPU but in addition we
> choose the one with the minimal break even deadline instead of relying
> on the idle_timestamp. When the CPU is idle, the timestamp has less
> meaning because the CPU could have wake up and sleep again several times
> without exiting the idle loop. In this case the break even deadline is
> more relevant as it increases the probability of choosing a CPU which
> reached its break even.
>

Ok so we still favour smallest exit latency, but if we have to pick
among several CPUs with the same exit latency, we can use the break
even. I'll want to test this on stuff, but I like the overall idea.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fcc968669aea..520c5e822fdd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>  {
>       unsigned long load, min_load = ULONG_MAX;
>       unsigned int min_exit_latency = UINT_MAX;
> +	s64 min_break_even = S64_MAX;
>       u64 latest_idle_timestamp = 0;
>       int least_loaded_cpu = this_cpu;
>       int shallowest_idle_cpu = -1;
> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>               if (available_idle_cpu(i)) {
>                       struct rq *rq = cpu_rq(i);
>                       struct cpuidle_state *idle = idle_get_state(rq);
> +			s64 break_even = idle_get_break_even(rq);
> +

Nit: there's tabs in that line break.

>                       if (idle && idle->exit_latency < min_exit_latency) {
>                               /*
>                                * We give priority to a CPU whose idle state
> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>                                * of any idle timestamp.
>                                */
>                               min_exit_latency = idle->exit_latency;
> +				min_break_even = break_even;
>                               latest_idle_timestamp = rq->idle_stamp;
>                               shallowest_idle_cpu = i;
> -			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
> -				   rq->idle_stamp > latest_idle_timestamp) {
> +			} else if ((idle && idle->exit_latency == min_exit_latency) &&
> +				   break_even < min_break_even) {
> +				/*
> +				 * We give priority to the shallowest
> +				 * idle states with the minimal break
> +				 * even deadline to decrease the
> +				 * probability to choose a CPU which
> +				 * did not reach its break even yet
> +				 */
> +				min_break_even = break_even;
> +				shallowest_idle_cpu = i;
> +			} else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
>                               /*
>                                * If equal or no active idle state, then
>                                * the most recently idled CPU might have

That comment will need to be changed as well, that condition now only
catters to the !idle case.

With that said, that comment actually raises a valid point: picking
recently idled CPUs might give us warmer cache. So by using the break
even stat, we can end up picking CPUs with colder caches (have been
idling for longer) than the current logic would. I suppose more testing
will tell us where we stand.

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index b743bf38f08f..189cd51cd474 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -19,7 +19,14 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
>   */
>  void sched_idle_set_state(struct cpuidle_state *idle_state)
>  {
> -	idle_set_state(this_rq(), idle_state);
> +	struct rq *rq = this_rq();
> +	ktime_t kt;
> +
> +	if (likely(idle_state)) {

Doesn't this break things? e.g. calling this with NULL?

> +		kt = ktime_add_ns(ktime_get(), idle_state->exit_latency_ns);

ISTR there were objections to using ktime stuff in the scheduler, but I
can't remember anything specific. If we only call into it when actually
entering an idle state (and not when we're exiting one), I suppose that
would be fine?

> +		idle_set_state(rq, idle_state);
> +		idle_set_break_even(rq, ktime_to_ns(kt));
> +	}
>  }
>
>  static int __read_mostly cpu_idle_force_poll;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2a0caf394dd4..abf2d2e73575 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1015,6 +1015,7 @@ struct rq {
>  #ifdef CONFIG_CPU_IDLE
>       /* Must be inspected within a rcu lock section */
>       struct cpuidle_state	*idle_state;
> +	s64			break_even;

Why signed? This should be purely positive, right?

>  #endif
>  };
>
> @@ -1850,6 +1851,16 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>
>       return rq->idle_state;
>  }
> +
> +static inline void idle_set_break_even(struct rq *rq, s64 break_even)
> +{
> +	rq->break_even = break_even;
> +}
> +
> +static inline s64 idle_get_break_even(struct rq *rq)
> +{
> +	return rq->break_even;
> +}

I'm not super familiar with the callsites for setting idle states,
what's the locking situation there? Do we have any rq lock?

In find_idlest_group_cpu() we're in a read-side RCU section, so the
idle_state is protected (speaking of which, why isn't idle_get_state()
using rcu_dereference()?), but that's doesn't cover the break even.

IIUC at the very least we may want to give them the READ/WRITE_ONCE()
treatment.
Daniel Lezcano March 4, 2020, 4:17 p.m. UTC | #3
Hi Qais,

On 04/03/2020 16:01, Qais Yousef wrote:
> Hi Daniel
> 
> Adding Rafael to CC as I think he might be interested in this too.
> 
> On 03/04/20 12:48, Daniel Lezcano wrote:
>> In the idle CPU selection process occuring in the slow path via the
>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
>> with the shallowest idle state otherwise we fall back to the least
>> loaded CPU.
>>
>> In order to be more energy efficient but without impacting the
>> performances, let's use another criteria: the break even deadline.
> 
> What is the break even deadline?
>
>> At idle time, when we store the idle state the CPU is entering in, we
>> compute the next deadline where the CPU could be woken up without
>> spending more energy to sleep.
> 
> I think that's its definition, but could do with more explanation.
> 
> So the break even deadline is the time window during which we can abort the CPU
> while entering its shallowest idle state?

No, it is the moment in absolute time when the min residency is reached.

From Documentation/devicetree/bindings/arm/idle-states.yaml

"
min-residency is defined for a given idle state as the minimum expected
residency time for a state (inclusive of preparation and entry) after
which choosing that state become the most energy efficient option. A
good way to visualise this, is by taking the same graph above and
comparing some states energy consumptions plots.
"

> So if we have 2 cpus entering the shallowest idle state, we pick the one that
> has a faster abort? And the energy saving comes from the fact we avoided
> unnecessary sleep-just-to-wakeup-immediately cycle?

No, actually it is about to choose a CPU where it has a better chance to
have reach its min residency. Basically, when the CPU enters an idle
state, that has a cost (cache flush / refill, context saving/restore etc
...), so there is a peak of energy and the CPU has to save energy long
enough to compensate this extra consumption.

If the scheduler is constantly waking up an idle CPU before it slept
long enough, we lose energy and performance.

Example 1, the CPUs are in a state:
 - CPU0 (power down)
 - CPU1 (power down)
 - CPU2 (WFI)
 - CPU3 (power down)

 The routine choose CPU2 because it is the shallowest state.

Example 2, the CPUs are in a state:
 - CPU0 (power down) (bedl = 1234)
 - CPU1 (power down) (bedl = 4321)
 - CPU2 (power down) (bedl = 9876)
 - CPU3 (power down) (bedl = 3421)

* bedl = break even deadline

  The routine choose CPU1 because the bedl is the smallest.


>> At the selection process, we use the shallowest CPU but in addition we
>> choose the one with the minimal break even deadline instead of relying
>> on the idle_timestamp. When the CPU is idle, the timestamp has less
>> meaning because the CPU could have wake up and sleep again several times
>> without exiting the idle loop. In this case the break even deadline is
>> more relevant as it increases the probability of choosing a CPU which
>> reached its break even.
>>
>> Tested on a synquacer 24 cores, 6 sched domains:
>>
>> sched/perf and messaging does not show a performance regression. Ran
>> 10 times schbench and adrestia.
>>
>> with break-even deadline:
>> -------------------------
>> schbench *99.0th        : 14844.8
>> adrestia / periodic     : 57.95
>> adrestia / single       : 49.3
>>
>> Without break-even deadline:
>> ----------------------------
>> schbench / *99.0th      : 15017.6
>> adrestia / periodic     : 57
>> adrestia / single       : 55.4
> 
> Lower is better or worse? The numbers might be popular and maybe I should have
> known them, but adding some explanation will always help.

 * lesser is better

> Do you have any energy measurement on the impact of this patch?

No, I don't have any platform allowing that. I've an instrumented
big.Little but this patch is not relevant as the EAS takes over.

The only platform I have in my hands now is the synquacer 24 cores which
makes sense for this slow path selection.

>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> ---
>>  kernel/sched/fair.c  | 18 ++++++++++++++++--
>>  kernel/sched/idle.c  |  9 ++++++++-
>>  kernel/sched/sched.h | 20 ++++++++++++++++++++
>>  3 files changed, 44 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fcc968669aea..520c5e822fdd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>  {
>>  	unsigned long load, min_load = ULONG_MAX;
>>  	unsigned int min_exit_latency = UINT_MAX;
>> +	s64 min_break_even = S64_MAX;
>>  	u64 latest_idle_timestamp = 0;
>>  	int least_loaded_cpu = this_cpu;
>>  	int shallowest_idle_cpu = -1;
>> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>  		if (available_idle_cpu(i)) {
>>  			struct rq *rq = cpu_rq(i);
>>  			struct cpuidle_state *idle = idle_get_state(rq);
>> +			s64 break_even = idle_get_break_even(rq);
>> +			
>>  			if (idle && idle->exit_latency < min_exit_latency) {
>>  				/*
>>  				 * We give priority to a CPU whose idle state
>> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>  				 * of any idle timestamp.
>>  				 */
>>  				min_exit_latency = idle->exit_latency;
>> +				min_break_even = break_even;
>>  				latest_idle_timestamp = rq->idle_stamp;
>>  				shallowest_idle_cpu = i;
>> -			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
>> -				   rq->idle_stamp > latest_idle_timestamp) {
>> +			} else if ((idle && idle->exit_latency == min_exit_latency) &&
>> +				   break_even < min_break_even) {
>> +				/*
>> +				 * We give priority to the shallowest
>> +				 * idle states with the minimal break
>> +				 * even deadline to decrease the
>> +				 * probability to choose a CPU which
>> +				 * did not reach its break even yet
>> +				 */
>> +				min_break_even = break_even;
>> +				shallowest_idle_cpu = i;
>> +			} else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
> 
> Shouldn't you retain the original if condition here? You omitted the 2nd part
> of this check compared to the original condition
> 
> 	(!idle || >>>idle->exit_latency == min_exit_latency<<<)

It is done on purpose because of the condition right before.

>>  				/*
>>  				 * If equal or no active idle state, then
>>  				 * the most recently idled CPU might have
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index b743bf38f08f..189cd51cd474 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -19,7 +19,14 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
>>   */
>>  void sched_idle_set_state(struct cpuidle_state *idle_state)
>>  {
>> -	idle_set_state(this_rq(), idle_state);
>> +	struct rq *rq = this_rq();
>> +	ktime_t kt;
>> +
>> +	if (likely(idle_state)) {
>> +		kt = ktime_add_ns(ktime_get(), idle_state->exit_latency_ns);
>> +		idle_set_state(rq, idle_state);
> 
> This changes the behavior of the function.
> 
> There's a call to sched_idle_set_state(NULL), so with this it'd be a NOP.
> 
> Is this intentional?

Nope, thanks for spotting it. It should be:

 void sched_idle_set_state(struct cpuidle_state *idle_state)
 {
-       idle_set_state(this_rq(), idle_state);
+       struct rq *rq = this_rq();
+       idle_set_state(rq, idle_state);
+
+       if (likely(idle_state)) {
+               ktime_t kt = ktime_add_ns(ktime_get(),
+                                         idle_state->exit_latency_ns);
+               idle_set_break_even(rq, ktime_to_ns(kt));
+       }
 }


> Don't you need to reset the break_even value if idle_state is NULL too?

If the idle_state is NULL, the routine won't use the break_even.
Daniel Lezcano March 4, 2020, 4:51 p.m. UTC | #4
On 04/03/2020 16:22, Valentin Schneider wrote:
> 
> On Wed, Mar 04 2020, Daniel Lezcano wrote:
>> In the idle CPU selection process occuring in the slow path via the
>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
>> with the shallowest idle state otherwise we fall back to the least
>> loaded CPU.
>>
>> In order to be more energy efficient but without impacting the
>> performances, let's use another criteria: the break even deadline.
>>
>> At idle time, when we store the idle state the CPU is entering in, we
>> compute the next deadline where the CPU could be woken up without
>> spending more energy to sleep.
>>
>> At the selection process, we use the shallowest CPU but in addition we
>> choose the one with the minimal break even deadline instead of relying
>> on the idle_timestamp. When the CPU is idle, the timestamp has less
>> meaning because the CPU could have wake up and sleep again several times
>> without exiting the idle loop. In this case the break even deadline is
>> more relevant as it increases the probability of choosing a CPU which
>> reached its break even.
>>
> 
> Ok so we still favour smallest exit latency, but if we have to pick
> among several CPUs with the same exit latency, we can use the break
> even. I'll want to test this on stuff, but I like the overall idea.
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fcc968669aea..520c5e822fdd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>  {
>>       unsigned long load, min_load = ULONG_MAX;
>>       unsigned int min_exit_latency = UINT_MAX;
>> +	s64 min_break_even = S64_MAX;
>>       u64 latest_idle_timestamp = 0;
>>       int least_loaded_cpu = this_cpu;
>>       int shallowest_idle_cpu = -1;
>> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>               if (available_idle_cpu(i)) {
>>                       struct rq *rq = cpu_rq(i);
>>                       struct cpuidle_state *idle = idle_get_state(rq);
>> +			s64 break_even = idle_get_break_even(rq);
>> +
> 
> Nit: there's tabs in that line break.
> 
>>                       if (idle && idle->exit_latency < min_exit_latency) {
>>                               /*
>>                                * We give priority to a CPU whose idle state
>> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>                                * of any idle timestamp.
>>                                */
>>                               min_exit_latency = idle->exit_latency;
>> +				min_break_even = break_even;
>>                               latest_idle_timestamp = rq->idle_stamp;
>>                               shallowest_idle_cpu = i;
>> -			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
>> -				   rq->idle_stamp > latest_idle_timestamp) {
>> +			} else if ((idle && idle->exit_latency == min_exit_latency) &&
>> +				   break_even < min_break_even) {
>> +				/*
>> +				 * We give priority to the shallowest
>> +				 * idle states with the minimal break
>> +				 * even deadline to decrease the
>> +				 * probability to choose a CPU which
>> +				 * did not reach its break even yet
>> +				 */
>> +				min_break_even = break_even;
>> +				shallowest_idle_cpu = i;
>> +			} else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
>>                               /*
>>                                * If equal or no active idle state, then
>>                                * the most recently idled CPU might have
> 
> That comment will need to be changed as well, that condition now only
> catters to the !idle case.

Right.

> With that said, that comment actually raises a valid point: picking
> recently idled CPUs might give us warmer cache. So by using the break
> even stat, we can end up picking CPUs with colder caches (have been
> idling for longer) than the current logic would. I suppose more testing
> will tell us where we stand.

Actually I'm not sure this comment still applies. If the CPU is powered
down, the cache is flushed or we can be picking up CPU with their cache
totally trashed by interrupt processing.

>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index b743bf38f08f..189cd51cd474 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -19,7 +19,14 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
>>   */
>>  void sched_idle_set_state(struct cpuidle_state *idle_state)
>>  {
>> -	idle_set_state(this_rq(), idle_state);
>> +	struct rq *rq = this_rq();
>> +	ktime_t kt;
>> +
>> +	if (likely(idle_state)) {
> 
> Doesn't this break things? e.g. calling this with NULL?

Yes, Qais spotted it.

>> +		kt = ktime_add_ns(ktime_get(), idle_state->exit_latency_ns);
> 
> ISTR there were objections to using ktime stuff in the scheduler, but I
> can't remember anything specific. If we only call into it when actually
> entering an idle state (and not when we're exiting one), I suppose that
> would be fine?

In this slow path it is fine. In the fast path, it is unacceptable.

>> +		idle_set_state(rq, idle_state);
>> +		idle_set_break_even(rq, ktime_to_ns(kt));
>> +	}
>>  }
>>
>>  static int __read_mostly cpu_idle_force_poll;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 2a0caf394dd4..abf2d2e73575 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1015,6 +1015,7 @@ struct rq {
>>  #ifdef CONFIG_CPU_IDLE
>>       /* Must be inspected within a rcu lock section */
>>       struct cpuidle_state	*idle_state;
>> +	s64			break_even;
> 
> Why signed? This should be purely positive, right?

It should be, but s64 complies with the functions ktime_to_ns signature.

static inline s64 ktime_to_ns(const ktime_t kt)

>>  #endif
>>  };
>>
>> @@ -1850,6 +1851,16 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>>
>>       return rq->idle_state;
>>  }
>> +
>> +static inline void idle_set_break_even(struct rq *rq, s64 break_even)
>> +{
>> +	rq->break_even = break_even;
>> +}
>> +
>> +static inline s64 idle_get_break_even(struct rq *rq)
>> +{
>> +	return rq->break_even;
>> +}
> 
> I'm not super familiar with the callsites for setting idle states,
> what's the locking situation there? Do we have any rq lock?

It is safe, we are under rcu, this section was discussed several years
ago when introducing the idle_set_state():

 https://lkml.org/lkml/2014/9/19/332

> In find_idlest_group_cpu() we're in a read-side RCU section, so the
> idle_state is protected (speaking of which, why isn't idle_get_state()
> using rcu_dereference()?), but that's doesn't cover the break even.
> 
> IIUC at the very least we may want to give them the READ/WRITE_ONCE()
> treatment.
>
Valentin Schneider March 4, 2020, 6:31 p.m. UTC | #5
On Wed, Mar 04 2020, Daniel Lezcano wrote:
>> With that said, that comment actually raises a valid point: picking
>> recently idled CPUs might give us warmer cache. So by using the break
>> even stat, we can end up picking CPUs with colder caches (have been
>> idling for longer) than the current logic would. I suppose more testing
>> will tell us where we stand.
>
> Actually I'm not sure this comment still applies. If the CPU is powered
> down, the cache is flushed or we can be picking up CPU with their cache
> totally trashed by interrupt processing.
>
>>> +++ b/kernel/sched/sched.h
>>> @@ -1015,6 +1015,7 @@ struct rq {
>>>  #ifdef CONFIG_CPU_IDLE
>>>       /* Must be inspected within a rcu lock section */
>>>       struct cpuidle_state	*idle_state;
>>> +	s64			break_even;
>>
>> Why signed? This should be purely positive, right?
>
> It should be, but s64 complies with the functions ktime_to_ns signature.
>
> static inline s64 ktime_to_ns(const ktime_t kt)
>

Would there be harm then in simply storing:

  ktime_get_ns() + idle_state->exit_latency_ns

(which is u64)?

>>>  #endif
>>>  };
>>>
>>> @@ -1850,6 +1851,16 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>>>
>>>       return rq->idle_state;
>>>  }
>>> +
>>> +static inline void idle_set_break_even(struct rq *rq, s64 break_even)
>>> +{
>>> +	rq->break_even = break_even;
>>> +}
>>> +
>>> +static inline s64 idle_get_break_even(struct rq *rq)
>>> +{
>>> +	return rq->break_even;
>>> +}
>>
>> I'm not super familiar with the callsites for setting idle states,
>> what's the locking situation there? Do we have any rq lock?
>
> It is safe, we are under rcu, this section was discussed several years
> ago when introducing the idle_set_state():
>
>  https://lkml.org/lkml/2014/9/19/332
>

Thanks for the link!

So while we (should) have the relevant barriers, there can still be
concurrent writing (from the CPU entering/leaving idle) and reading
(from the CPU gathering stats).

rcu_dereference() gives you READ_ONCE(), and the rcu_assign_pointer()
should give you an smp_store_release(). What I'm thinking here is, if we
have reasons not to use the RCU primitives, we should at least slap some
READ/WRITE_ONCE() to the accesses. Also, can RCU even do anything about
scalar values like the break even you're storing?

>> In find_idlest_group_cpu() we're in a read-side RCU section, so the
>> idle_state is protected (speaking of which, why isn't idle_get_state()
>> using rcu_dereference()?), but that's doesn't cover the break even.
>>
>> IIUC at the very least we may want to give them the READ/WRITE_ONCE()
>> treatment.
>>
Qais Yousef March 5, 2020, 3:25 p.m. UTC | #6
On 03/04/20 17:17, Daniel Lezcano wrote:
> 
> Hi Qais,
> 
> On 04/03/2020 16:01, Qais Yousef wrote:
> > Hi Daniel
> > 
> > Adding Rafael to CC as I think he might be interested in this too.
> > 
> > On 03/04/20 12:48, Daniel Lezcano wrote:
> >> In the idle CPU selection process occuring in the slow path via the
> >> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> >> with the shallowest idle state otherwise we fall back to the least
> >> loaded CPU.
> >>
> >> In order to be more energy efficient but without impacting the
> >> performances, let's use another criteria: the break even deadline.
> > 
> > What is the break even deadline?
> >
> >> At idle time, when we store the idle state the CPU is entering in, we
> >> compute the next deadline where the CPU could be woken up without
> >> spending more energy to sleep.
> > 
> > I think that's its definition, but could do with more explanation.
> > 
> > So the break even deadline is the time window during which we can abort the CPU
> > while entering its shallowest idle state?
> 
> No, it is the moment in absolute time when the min residency is reached.
> 
> From Documentation/devicetree/bindings/arm/idle-states.yaml
> 
> "
> min-residency is defined for a given idle state as the minimum expected
> residency time for a state (inclusive of preparation and entry) after
> which choosing that state become the most energy efficient option. A
> good way to visualise this, is by taking the same graph above and
> comparing some states energy consumptions plots.
> "
> 
> > So if we have 2 cpus entering the shallowest idle state, we pick the one that
> > has a faster abort? And the energy saving comes from the fact we avoided
> > unnecessary sleep-just-to-wakeup-immediately cycle?
> 
> No, actually it is about to choose a CPU where it has a better chance to
> have reach its min residency. Basically, when the CPU enters an idle
> state, that has a cost (cache flush / refill, context saving/restore etc
> ...), so there is a peak of energy and the CPU has to save energy long
> enough to compensate this extra consumption.
> 
> If the scheduler is constantly waking up an idle CPU before it slept
> long enough, we lose energy and performance.
> 
> Example 1, the CPUs are in a state:
>  - CPU0 (power down)
>  - CPU1 (power down)
>  - CPU2 (WFI)
>  - CPU3 (power down)
> 
>  The routine choose CPU2 because it is the shallowest state.
> 
> Example 2, the CPUs are in a state:
>  - CPU0 (power down) (bedl = 1234)
>  - CPU1 (power down) (bedl = 4321)
>  - CPU2 (power down) (bedl = 9876)
>  - CPU3 (power down) (bedl = 3421)
> 
> * bedl = break even deadline
> 
>   The routine choose CPU1 because the bedl is the smallest.

Thanks for the explanation and the reference. The idle-state.yaml has a nice
diagram. I won't insist, but I think including the definition of break even
will help a lot to make the patch more understandable.

[...]

> > Shouldn't you retain the original if condition here? You omitted the 2nd part
> > of this check compared to the original condition
> > 
> > 	(!idle || >>>idle->exit_latency == min_exit_latency<<<)
> 
> It is done on purpose because of the condition right before.

I see it now. If denote the condition as B. If it was false then the OR will
reduce to !idle. But if it was True, then which path you take will depend only
on state of idle too, which you check in both paths.

[...]

> 
> Nope, thanks for spotting it. It should be:
> 
>  void sched_idle_set_state(struct cpuidle_state *idle_state)
>  {
> -       idle_set_state(this_rq(), idle_state);
> +       struct rq *rq = this_rq();
> +       idle_set_state(rq, idle_state);
> +
> +       if (likely(idle_state)) {
> +               ktime_t kt = ktime_add_ns(ktime_get(),
> +                                         idle_state->exit_latency_ns);
> +               idle_set_break_even(rq, ktime_to_ns(kt));
> +       }
>  }
> 
> 
> > Don't you need to reset the break_even value if idle_state is NULL too?
> 
> If the idle_state is NULL, the routine won't use the break_even.

+1

Thanks

--
Qais Yousef
Daniel Lezcano March 6, 2020, 9:28 a.m. UTC | #7
Hi Valentin,

[cc'ing Thomas Gleixner]

On 04/03/2020 19:31, Valentin Schneider wrote:
> 
> On Wed, Mar 04 2020, Daniel Lezcano wrote:
>>> With that said, that comment actually raises a valid point: picking
>>> recently idled CPUs might give us warmer cache. So by using the break
>>> even stat, we can end up picking CPUs with colder caches (have been
>>> idling for longer) than the current logic would. I suppose more testing
>>> will tell us where we stand.
>>
>> Actually I'm not sure this comment still applies. If the CPU is powered
>> down, the cache is flushed or we can be picking up CPU with their cache
>> totally trashed by interrupt processing.
>>
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -1015,6 +1015,7 @@ struct rq {
>>>>  #ifdef CONFIG_CPU_IDLE
>>>>       /* Must be inspected within a rcu lock section */
>>>>       struct cpuidle_state	*idle_state;
>>>> +	s64			break_even;
>>>
>>> Why signed? This should be purely positive, right?
>>
>> It should be, but s64 complies with the functions ktime_to_ns signature.
>>
>> static inline s64 ktime_to_ns(const ktime_t kt)
>>
> 
> Would there be harm then in simply storing:
> 
>   ktime_get_ns() + idle_state->exit_latency_ns
> 
> (which is u64)?

Actually I'm not sure, ktime_get_ns() is correct, it returns an u64 but
it is actually returning:

	return ktime_to_ns(ktime_get());

where ktime_to_ns is returning a s64.

There are the following functions:

static inline u64 ktime_get_coarse_ns(void)
static inline u64 ktime_get_coarse_real_ns(void)
static inline u64 ktime_get_coarse_boottime_ns(void)
static inline u64 ktime_get_coarse_clocktai_ns(void)
static inline u64 ktime_get_ns(void)
static inline u64 ktime_get_real_ns(void)
static inline u64 ktime_get_boottime_ns(void)
static inline u64 ktime_get_clocktai_ns(void)
static inline u64 ktime_get_raw_ns(void)

extern u64 ktime_get_mono_fast_ns(void);
extern u64 ktime_get_raw_fast_ns(void);
extern u64 ktime_get_boot_fast_ns(void);
extern u64 ktime_get_real_fast_ns(void);

They are all based on ktime_get_ns() which returns a s64.

Not sure this type mismatch was done on purpose or not.

Thomas?

Patch
diff mbox series

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcc968669aea..520c5e822fdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5793,6 +5793,7 @@  find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 {
 	unsigned long load, min_load = ULONG_MAX;
 	unsigned int min_exit_latency = UINT_MAX;
+	s64 min_break_even = S64_MAX;
 	u64 latest_idle_timestamp = 0;
 	int least_loaded_cpu = this_cpu;
 	int shallowest_idle_cpu = -1;
@@ -5810,6 +5811,8 @@  find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 		if (available_idle_cpu(i)) {
 			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
+			s64 break_even = idle_get_break_even(rq);
+			
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
 				 * We give priority to a CPU whose idle state
@@ -5817,10 +5820,21 @@  find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 				 * of any idle timestamp.
 				 */
 				min_exit_latency = idle->exit_latency;
+				min_break_even = break_even;
 				latest_idle_timestamp = rq->idle_stamp;
 				shallowest_idle_cpu = i;
-			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
-				   rq->idle_stamp > latest_idle_timestamp) {
+			} else if ((idle && idle->exit_latency == min_exit_latency) &&
+				   break_even < min_break_even) {
+				/*
+				 * We give priority to the shallowest
+				 * idle states with the minimal break
+				 * even deadline to decrease the
+				 * probability to choose a CPU which
+				 * did not reach its break even yet
+				 */
+				min_break_even = break_even;
+				shallowest_idle_cpu = i;
+			} else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
 				/*
 				 * If equal or no active idle state, then
 				 * the most recently idled CPU might have
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b743bf38f08f..189cd51cd474 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -19,7 +19,14 @@  extern char __cpuidle_text_start[], __cpuidle_text_end[];
  */
 void sched_idle_set_state(struct cpuidle_state *idle_state)
 {
-	idle_set_state(this_rq(), idle_state);
+	struct rq *rq = this_rq();
+	ktime_t kt;
+
+	if (likely(idle_state)) {
+		kt = ktime_add_ns(ktime_get(), idle_state->exit_latency_ns);
+		idle_set_state(rq, idle_state);
+		idle_set_break_even(rq, ktime_to_ns(kt));
+	}
 }
 
 static int __read_mostly cpu_idle_force_poll;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a0caf394dd4..abf2d2e73575 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1015,6 +1015,7 @@  struct rq {
 #ifdef CONFIG_CPU_IDLE
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
+	s64			break_even;
 #endif
 };
 
@@ -1850,6 +1851,16 @@  static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 
 	return rq->idle_state;
 }
+
+static inline void idle_set_break_even(struct rq *rq, s64 break_even)
+{
+	rq->break_even = break_even;
+}
+
+static inline s64 idle_get_break_even(struct rq *rq)
+{
+	return rq->break_even;
+}
 #else
 static inline void idle_set_state(struct rq *rq,
 				  struct cpuidle_state *idle_state)
@@ -1860,6 +1871,15 @@  static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 {
 	return NULL;
 }
+
+static inline void idle_set_break_even(struct rq *rq, s64 break_even)
+{
+}
+
+static inline s64 idle_get_break_even(struct rq *rq)
+{
+	return 0;
+}
 #endif
 
 extern void schedule_idle(void);