linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] cpufreq: intel_pstate: Handle no_turbo in frequency invariance
@ 2022-04-07 23:42 Chen Yu
  2022-04-08  8:22 ` Peter Zijlstra
  2022-04-08 14:22 ` Giovanni Gherdovich
  0 siblings, 2 replies; 4+ messages in thread
From: Chen Yu @ 2022-04-07 23:42 UTC (permalink / raw)
  To: linux-pm
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Vincent Guittot,
	Peter Zijlstra, Len Brown, Tim Chen, Giovanni Gherdovich,
	Chen Yu, linux-kernel, Zhang Rui, Chen Yu

Problem statement:
Once the user has disabled turbo frequency by
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo,
the cfs_rq's util_avg becomes quite small when compared with
CPU capacity.

Step to reproduce:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99
would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
with a CPU utilization of 99%. [1]

top result:
%Cpu3  : 98.4 us,  0.0 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

check util_avg:
cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
  .util_avg                      : 611

So the util_avg/cpu capacity is 611/1024, which is much smaller than
98.4% shown in the top result.

This might impact some logic in the scheduler. For example, group_is_overloaded()
would compare the group_capacity and group_util in the sched group, to
check if this sched group is overloaded or not. With this gap, even
when there is a nearly 100% workload, the sched group will not be regarded
as overloaded. Besides group_is_overloaded(), there are also other victims.
There is a ongoing work that aims to optimize the task wakeup in a LLC domain.
The main idea is to stop searching idle CPUs if the sched domain is overloaded[2].
This proposal also relies on the util_avg/CPU capacity to decide whether the LLC
domain is overloaded.

Analysis:
CPU frequency invariance has caused this difference. In summary,
the util_sum of cfs rq would decay quite fast when the CPU is in
idle, when the CPU frequency invariance is enabled.

The detail is as followed:

As depicted in update_rq_clock_pelt(), when the frequency invariance
is enabled, there would be two clock variables on each rq, clock_task
and clock_pelt:

   The clock_pelt scales the time to reflect the effective amount of
   computation done during the running delta time but then syncs back to
   clock_task when rq is idle.

   absolute time    | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
   @ max frequency  ------******---------------******---------------
   @ half frequency ------************---------************---------
   clock pelt       | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16

The fast decay of util_sum during idle is due to:
1. rq->clock_pelt is always behind rq->clock_task
2. rq->last_update is updated to rq->clock_pelt' after invoking ___update_load_sum()
3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly increased
   a lot to rq->clock_task
4. Enters ___update_load_sum() again, the idle period is calculated by
   rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
   The lower the CPU frequency is, the larger the delta =
   rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
   used to decay the util_sum only, the util_sum drops significantly during
   idle period.

Proposal:
This symptom is not only caused by disabling turbo frequency, but it
would also appear if the user limits the max frequency at runtime. Because
if the frequency is always lower than the max frequency,
CPU frequency invariance would decay the util_sum quite fast during idle.

As some end users would disable turbo after boot up, this patch aims to
present this symptom and deals with turbo scenarios for now. It might
be ideal if CPU frequency invariance is aware of the max CPU frequency
(user specified) at runtime in the future.

[Previous patch seems to be lost on LKML, this is a resend, sorry for any
inconvenience]

Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 drivers/cpufreq/intel_pstate.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 846bb3a78788..2216b24b6f84 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1322,6 +1322,7 @@ static ssize_t store_no_turbo(struct kobject *a, struct kobj_attribute *b,
 	mutex_unlock(&intel_pstate_limits_lock);
 
 	intel_pstate_update_policies();
+	arch_set_max_freq_ratio(global.no_turbo);
 
 	mutex_unlock(&intel_pstate_driver_lock);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] cpufreq: intel_pstate: Handle no_turbo in frequency invariance
  2022-04-07 23:42 [PATCH] cpufreq: intel_pstate: Handle no_turbo in frequency invariance Chen Yu
@ 2022-04-08  8:22 ` Peter Zijlstra
  2022-04-08 14:22 ` Giovanni Gherdovich
  1 sibling, 0 replies; 4+ messages in thread
From: Peter Zijlstra @ 2022-04-08  8:22 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-pm, Rafael J. Wysocki, Srinivas Pandruvada,
	Vincent Guittot, Len Brown, Tim Chen, Giovanni Gherdovich,
	Chen Yu, linux-kernel, Zhang Rui

On Fri, Apr 08, 2022 at 07:42:58AM +0800, Chen Yu wrote:
> Problem statement:
> Once the user has disabled turbo frequency by
> echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo,
> the cfs_rq's util_avg becomes quite small when compared with
> CPU capacity.
> 
> Step to reproduce:
> 
> echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> 
> ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99
> would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
> with a CPU utilization of 99%. [1]
> 
> top result:
> %Cpu3  : 98.4 us,  0.0 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> 
> check util_avg:
> cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
>   .util_avg                      : 611
> 
> So the util_avg/cpu capacity is 611/1024, which is much smaller than
> 98.4% shown in the top result.
> 
> This might impact some logic in the scheduler. For example, group_is_overloaded()
> would compare the group_capacity and group_util in the sched group, to
> check if this sched group is overloaded or not. With this gap, even
> when there is a nearly 100% workload, the sched group will not be regarded
> as overloaded. Besides group_is_overloaded(), there are also other victims.
> There is a ongoing work that aims to optimize the task wakeup in a LLC domain.
> The main idea is to stop searching idle CPUs if the sched domain is overloaded[2].
> This proposal also relies on the util_avg/CPU capacity to decide whether the LLC
> domain is overloaded.
> 
> Analysis:
> CPU frequency invariance has caused this difference. In summary,
> the util_sum of cfs rq would decay quite fast when the CPU is in
> idle, when the CPU frequency invariance is enabled.
> 
> The detail is as followed:
> 
> As depicted in update_rq_clock_pelt(), when the frequency invariance
> is enabled, there would be two clock variables on each rq, clock_task
> and clock_pelt:
> 
>    The clock_pelt scales the time to reflect the effective amount of
>    computation done during the running delta time but then syncs back to
>    clock_task when rq is idle.
> 
>    absolute time    | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
>    @ max frequency  ------******---------------******---------------
>    @ half frequency ------************---------************---------
>    clock pelt       | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
> 
> The fast decay of util_sum during idle is due to:
> 1. rq->clock_pelt is always behind rq->clock_task
> 2. rq->last_update is updated to rq->clock_pelt' after invoking ___update_load_sum()
> 3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly increased
>    a lot to rq->clock_task
> 4. Enters ___update_load_sum() again, the idle period is calculated by
>    rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
>    The lower the CPU frequency is, the larger the delta =
>    rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
>    used to decay the util_sum only, the util_sum drops significantly during
>    idle period.
> 
> Proposal:
> This symptom is not only caused by disabling turbo frequency, but it
> would also appear if the user limits the max frequency at runtime. Because
> if the frequency is always lower than the max frequency,
> CPU frequency invariance would decay the util_sum quite fast during idle.
> 
> As some end users would disable turbo after boot up, this patch aims to
> present this symptom and deals with turbo scenarios for now. It might
> be ideal if CPU frequency invariance is aware of the max CPU frequency
> (user specified) at runtime in the future.
> 
> [Previous patch seems to be lost on LKML, this is a resend, sorry for any
> inconvenience]
> 
> Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
> Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] cpufreq: intel_pstate: Handle no_turbo in frequency invariance
  2022-04-07 23:42 [PATCH] cpufreq: intel_pstate: Handle no_turbo in frequency invariance Chen Yu
  2022-04-08  8:22 ` Peter Zijlstra
@ 2022-04-08 14:22 ` Giovanni Gherdovich
  2022-04-13 15:39   ` Rafael J. Wysocki
  1 sibling, 1 reply; 4+ messages in thread
From: Giovanni Gherdovich @ 2022-04-08 14:22 UTC (permalink / raw)
  To: Chen Yu, linux-pm
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Vincent Guittot,
	Peter Zijlstra, Len Brown, Tim Chen, Chen Yu, linux-kernel,
	Zhang Rui

On Fri, 2022-04-08 at 07:42 +0800, Chen Yu wrote:
> Problem statement:
> Once the user has disabled turbo frequency by
> echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo,
> the cfs_rq's util_avg becomes quite small when compared with
> CPU capacity.
> 
> Step to reproduce:
> 
> echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> 
> ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99
> would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
> with a CPU utilization of 99%. [1]
> 
> top result:
> %Cpu3  : 98.4 us,  0.0 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> 
> check util_avg:
> cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
>   .util_avg                      : 611
> 
> So the util_avg/cpu capacity is 611/1024, which is much smaller than
> 98.4% shown in the top result.
> 
> This might impact some logic in the scheduler. For example, group_is_overloaded()
> would compare the group_capacity and group_util in the sched group, to
> check if this sched group is overloaded or not. With this gap, even
> when there is a nearly 100% workload, the sched group will not be regarded
> as overloaded. Besides group_is_overloaded(), there are also other victims.
> There is a ongoing work that aims to optimize the task wakeup in a LLC domain.
> The main idea is to stop searching idle CPUs if the sched domain is overloaded[2].
> This proposal also relies on the util_avg/CPU capacity to decide whether the LLC
> domain is overloaded.
> 
> Analysis:
> CPU frequency invariance has caused this difference. In summary,
> the util_sum of cfs rq would decay quite fast when the CPU is in
> idle, when the CPU frequency invariance is enabled.
> 
> The detail is as followed:
> 
> As depicted in update_rq_clock_pelt(), when the frequency invariance
> is enabled, there would be two clock variables on each rq, clock_task
> and clock_pelt:
> 
>    The clock_pelt scales the time to reflect the effective amount of
>    computation done during the running delta time but then syncs back to
>    clock_task when rq is idle.
> 
>    absolute time    | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
>    @ max frequency  ------******---------------******---------------
>    @ half frequency ------************---------************---------
>    clock pelt       | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
> 
> The fast decay of util_sum during idle is due to:
> 1. rq->clock_pelt is always behind rq->clock_task
> 2. rq->last_update is updated to rq->clock_pelt' after invoking ___update_load_sum()
> 3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly increased
>    a lot to rq->clock_task
> 4. Enters ___update_load_sum() again, the idle period is calculated by
>    rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
>    The lower the CPU frequency is, the larger the delta =
>    rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
>    used to decay the util_sum only, the util_sum drops significantly during
>    idle period.
> 
> Proposal:
> This symptom is not only caused by disabling turbo frequency, but it
> would also appear if the user limits the max frequency at runtime. Because
> if the frequency is always lower than the max frequency,
> CPU frequency invariance would decay the util_sum quite fast during idle.
> 
> As some end users would disable turbo after boot up, this patch aims to
> present this symptom and deals with turbo scenarios for now. It might
> be ideal if CPU frequency invariance is aware of the max CPU frequency
> (user specified) at runtime in the future.
> 
> [Previous patch seems to be lost on LKML, this is a resend, sorry for any
> inconvenience]
> 
> Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
> Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>

You're right, when turbo is disabled, the frequency invariance code needs to
know about it; it calculates freq_curr/freq_max and thinks that freq_max is
some turbo level. For example commit 918229cdd5ab ("x86/intel_pstate: Handle
runtime turbo disablement/enablement in frequency invariance") takes care of
this when global.turbo_disabled changes, but before your patch nothing checks
if the user disabled turbo from sysfs. Thanks for the fix!

Giovanni

> ---
>  drivers/cpufreq/intel_pstate.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 846bb3a78788..2216b24b6f84 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1322,6 +1322,7 @@ static ssize_t store_no_turbo(struct kobject *a, struct kobj_attribute *b,
>  	mutex_unlock(&intel_pstate_limits_lock);
>  
>  	intel_pstate_update_policies();
> +	arch_set_max_freq_ratio(global.no_turbo);
>  
>  	mutex_unlock(&intel_pstate_driver_lock);
>  


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] cpufreq: intel_pstate: Handle no_turbo in frequency invariance
  2022-04-08 14:22 ` Giovanni Gherdovich
@ 2022-04-13 15:39   ` Rafael J. Wysocki
  0 siblings, 0 replies; 4+ messages in thread
From: Rafael J. Wysocki @ 2022-04-13 15:39 UTC (permalink / raw)
  To: Giovanni Gherdovich, Chen Yu
  Cc: Linux PM, Rafael J. Wysocki, Srinivas Pandruvada,
	Vincent Guittot, Peter Zijlstra, Len Brown, Tim Chen, Chen Yu,
	Linux Kernel Mailing List, Zhang Rui

On Fri, Apr 8, 2022 at 4:22 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>
> On Fri, 2022-04-08 at 07:42 +0800, Chen Yu wrote:
> > Problem statement:
> > Once the user has disabled turbo frequency by
> > echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo,
> > the cfs_rq's util_avg becomes quite small when compared with
> > CPU capacity.
> >
> > Step to reproduce:
> >
> > echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> >
> > ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99
> > would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
> > with a CPU utilization of 99%. [1]
> >
> > top result:
> > %Cpu3  : 98.4 us,  0.0 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> >
> > check util_avg:
> > cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
> >   .util_avg                      : 611
> >
> > So the util_avg/cpu capacity is 611/1024, which is much smaller than
> > 98.4% shown in the top result.
> >
> > This might impact some logic in the scheduler. For example, group_is_overloaded()
> > would compare the group_capacity and group_util in the sched group, to
> > check if this sched group is overloaded or not. With this gap, even
> > when there is a nearly 100% workload, the sched group will not be regarded
> > as overloaded. Besides group_is_overloaded(), there are also other victims.
> > There is a ongoing work that aims to optimize the task wakeup in a LLC domain.
> > The main idea is to stop searching idle CPUs if the sched domain is overloaded[2].
> > This proposal also relies on the util_avg/CPU capacity to decide whether the LLC
> > domain is overloaded.
> >
> > Analysis:
> > CPU frequency invariance has caused this difference. In summary,
> > the util_sum of cfs rq would decay quite fast when the CPU is in
> > idle, when the CPU frequency invariance is enabled.
> >
> > The detail is as followed:
> >
> > As depicted in update_rq_clock_pelt(), when the frequency invariance
> > is enabled, there would be two clock variables on each rq, clock_task
> > and clock_pelt:
> >
> >    The clock_pelt scales the time to reflect the effective amount of
> >    computation done during the running delta time but then syncs back to
> >    clock_task when rq is idle.
> >
> >    absolute time    | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> >    @ max frequency  ------******---------------******---------------
> >    @ half frequency ------************---------************---------
> >    clock pelt       | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
> >
> > The fast decay of util_sum during idle is due to:
> > 1. rq->clock_pelt is always behind rq->clock_task
> > 2. rq->last_update is updated to rq->clock_pelt' after invoking ___update_load_sum()
> > 3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly increased
> >    a lot to rq->clock_task
> > 4. Enters ___update_load_sum() again, the idle period is calculated by
> >    rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
> >    The lower the CPU frequency is, the larger the delta =
> >    rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
> >    used to decay the util_sum only, the util_sum drops significantly during
> >    idle period.
> >
> > Proposal:
> > This symptom is not only caused by disabling turbo frequency, but it
> > would also appear if the user limits the max frequency at runtime. Because
> > if the frequency is always lower than the max frequency,
> > CPU frequency invariance would decay the util_sum quite fast during idle.
> >
> > As some end users would disable turbo after boot up, this patch aims to
> > present this symptom and deals with turbo scenarios for now. It might
> > be ideal if CPU frequency invariance is aware of the max CPU frequency
> > (user specified) at runtime in the future.
> >
> > [Previous patch seems to be lost on LKML, this is a resend, sorry for any
> > inconvenience]
> >
> > Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
> > Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>
> Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
>
> You're right, when turbo is disabled, the frequency invariance code needs to
> know about it; it calculates freq_curr/freq_max and thinks that freq_max is
> some turbo level. For example commit 918229cdd5ab ("x86/intel_pstate: Handle
> runtime turbo disablement/enablement in frequency invariance") takes care of
> this when global.turbo_disabled changes, but before your patch nothing checks
> if the user disabled turbo from sysfs. Thanks for the fix!
>
> Giovanni

Applied as 5.19 material, thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-04-13 15:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-07 23:42 [PATCH] cpufreq: intel_pstate: Handle no_turbo in frequency invariance Chen Yu
2022-04-08  8:22 ` Peter Zijlstra
2022-04-08 14:22 ` Giovanni Gherdovich
2022-04-13 15:39   ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).