kernel-janitors.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/fair: check for idle core
@ 2020-10-20 16:37 Julia Lawall
  2020-10-21  7:29 ` Vincent Guittot
  2020-10-21 11:20 ` Mel Gorman
  0 siblings, 2 replies; 68+ messages in thread
From: Julia Lawall @ 2020-10-20 16:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kernel-janitors, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles.Muller

On a thread wakeup, the change [1] from runnable load average to load
average for comparing candidate cores means that recent short-running
daemons on the core where a thread ran previously can be considered to
have a higher load than the core performing the wakeup, even when the
core where the thread ran previously is currently idle.  This can
cause a thread to migrate, taking the place of some other thread that
is about to wake up, and so on.  To avoid unnecessary migrations,
extend wake_affine_idle to check whether the core where the thread
previously ran is currently idle, and if so return that core as the
target.

[1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
load in wakeup path")

This particularly has an impact when using passive (intel_cpufreq)
power management, where kworkers run every 0.004 seconds on all cores,
increasing the likelihood that an idle core will be considered to have
a load.

The following numbers were obtained with the benchmarking tool
hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
management were used.  Times are in seconds.  All experiments use all
160 hardware threads.

	v5.9/active		v5.9+patch/active
bt.C.c	24.725724+-0.962340	23.349608+-1.607214
lu.C.x	29.105952+-4.804203	25.249052+-5.561617
sp.C.x	31.220696+-1.831335	30.227760+-2.429792
ua.C.x	26.606118+-1.767384	25.778367+-1.263850

	v5.9/passive		v5.9+patch/passive
bt.C.c	25.330360+-1.028316	23.544036+-1.020189
lu.C.x	35.872659+-4.872090	23.719295+-3.883848
sp.C.x	32.141310+-2.289541	29.125363+-0.872300
ua.C.x	29.024597+-1.667049	25.728888+-1.539772

On the smaller data sets (A and B) and on the other NAS benchmarks
there is no impact on performance.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

---
 kernel/sched/fair.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa4c6227cd6d..9b23dad883ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	if (sync && cpu_rq(this_cpu)->nr_running = 1)
 		return this_cpu;
 
+	if (available_idle_cpu(prev_cpu))
+		return prev_cpu;
+
 	return nr_cpumask_bits;
 }
 

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall
@ 2020-10-21  7:29 ` Vincent Guittot
  2020-10-21 11:13   ` Peter Zijlstra
  2020-10-21 12:27   ` Vincent Guittot
  2020-10-21 11:20 ` Mel Gorman
  1 sibling, 2 replies; 68+ messages in thread
From: Vincent Guittot @ 2020-10-21  7:29 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller

Hi Julia,

On Tue, 20 Oct 2020 at 19:21, Julia Lawall <Julia.Lawall@inria.fr> wrote:
>
> On a thread wakeup, the change [1] from runnable load average to load
> average for comparing candidate cores means that recent short-running
> daemons on the core where a thread ran previously can be considered to
> have a higher load than the core performing the wakeup, even when the
> core where the thread ran previously is currently idle.  This can
> cause a thread to migrate, taking the place of some other thread that
> is about to wake up, and so on.  To avoid unnecessary migrations,
> extend wake_affine_idle to check whether the core where the thread
> previously ran is currently idle, and if so return that core as the
> target.
>
> [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> load in wakeup path")
>
> This particularly has an impact when using passive (intel_cpufreq)
> power management, where kworkers run every 0.004 seconds on all cores,
> increasing the likelihood that an idle core will be considered to have
> a load.
>
> The following numbers were obtained with the benchmarking tool
> hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> management were used.  Times are in seconds.  All experiments use all
> 160 hardware threads.
>
>         v5.9/active             v5.9+patch/active
> bt.C.c  24.725724+-0.962340     23.349608+-1.607214
> lu.C.x  29.105952+-4.804203     25.249052+-5.561617
> sp.C.x  31.220696+-1.831335     30.227760+-2.429792
> ua.C.x  26.606118+-1.767384     25.778367+-1.263850
>
>         v5.9/passive            v5.9+patch/passive
> bt.C.c  25.330360+-1.028316     23.544036+-1.020189
> lu.C.x  35.872659+-4.872090     23.719295+-3.883848
> sp.C.x  32.141310+-2.289541     29.125363+-0.872300
> ua.C.x  29.024597+-1.667049     25.728888+-1.539772
>
> On the smaller data sets (A and B) and on the other NAS benchmarks
> there is no impact on performance.
>
> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>

>
> ---
>  kernel/sched/fair.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aa4c6227cd6d..9b23dad883ee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>         if (sync && cpu_rq(this_cpu)->nr_running = 1)
>                 return this_cpu;
>
> +       if (available_idle_cpu(prev_cpu))
> +               return prev_cpu;
> +
>         return nr_cpumask_bits;
>  }
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21  7:29 ` Vincent Guittot
@ 2020-10-21 11:13   ` Peter Zijlstra
  2020-10-21 12:27   ` Vincent Guittot
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-21 11:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller

On Wed, Oct 21, 2020 at 09:29:46AM +0200, Vincent Guittot wrote:
> Hi Julia,
> 
> On Tue, 20 Oct 2020 at 19:21, Julia Lawall <Julia.Lawall@inria.fr> wrote:
> >
> > On a thread wakeup, the change [1] from runnable load average to load
> > average for comparing candidate cores means that recent short-running
> > daemons on the core where a thread ran previously can be considered to
> > have a higher load than the core performing the wakeup, even when the
> > core where the thread ran previously is currently idle.  This can
> > cause a thread to migrate, taking the place of some other thread that
> > is about to wake up, and so on.  To avoid unnecessary migrations,
> > extend wake_affine_idle to check whether the core where the thread
> > previously ran is currently idle, and if so return that core as the
> > target.
> >
> > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> > load in wakeup path")
> >
> > This particularly has an impact when using passive (intel_cpufreq)
> > power management, where kworkers run every 0.004 seconds on all cores,
> > increasing the likelihood that an idle core will be considered to have
> > a load.
> >
> > The following numbers were obtained with the benchmarking tool
> > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> > benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> > 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> > management were used.  Times are in seconds.  All experiments use all
> > 160 hardware threads.
> >
> >         v5.9/active             v5.9+patch/active
> > bt.C.c  24.725724+-0.962340     23.349608+-1.607214
> > lu.C.x  29.105952+-4.804203     25.249052+-5.561617
> > sp.C.x  31.220696+-1.831335     30.227760+-2.429792
> > ua.C.x  26.606118+-1.767384     25.778367+-1.263850
> >
> >         v5.9/passive            v5.9+patch/passive
> > bt.C.c  25.330360+-1.028316     23.544036+-1.020189
> > lu.C.x  35.872659+-4.872090     23.719295+-3.883848
> > sp.C.x  32.141310+-2.289541     29.125363+-0.872300
> > ua.C.x  29.024597+-1.667049     25.728888+-1.539772
> >
> > On the smaller data sets (A and B) and on the other NAS benchmarks
> > there is no impact on performance.
> >
> > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
> 
> Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>

Thanks!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall
  2020-10-21  7:29 ` Vincent Guittot
@ 2020-10-21 11:20 ` Mel Gorman
  2020-10-21 11:56   ` Julia Lawall
  2020-10-21 12:25   ` Vincent Guittot
  1 sibling, 2 replies; 68+ messages in thread
From: Mel Gorman @ 2020-10-21 11:20 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles.Muller

On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote:
> On a thread wakeup, the change [1] from runnable load average to load
> average for comparing candidate cores means that recent short-running
> daemons on the core where a thread ran previously can be considered to
> have a higher load than the core performing the wakeup, even when the
> core where the thread ran previously is currently idle.  This can
> cause a thread to migrate, taking the place of some other thread that
> is about to wake up, and so on.  To avoid unnecessary migrations,
> extend wake_affine_idle to check whether the core where the thread
> previously ran is currently idle, and if so return that core as the
> target.
> 
> [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> load in wakeup path")
> 
> This particularly has an impact when using passive (intel_cpufreq)
> power management, where kworkers run every 0.004 seconds on all cores,
> increasing the likelihood that an idle core will be considered to have
> a load.
> 
> The following numbers were obtained with the benchmarking tool
> hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> management were used.  Times are in seconds.  All experiments use all
> 160 hardware threads.
> 
> 	v5.9/active		v5.9+patch/active
> bt.C.c	24.725724+-0.962340	23.349608+-1.607214
> lu.C.x	29.105952+-4.804203	25.249052+-5.561617
> sp.C.x	31.220696+-1.831335	30.227760+-2.429792
> ua.C.x	26.606118+-1.767384	25.778367+-1.263850
> 
> 	v5.9/passive		v5.9+patch/passive
> bt.C.c	25.330360+-1.028316	23.544036+-1.020189
> lu.C.x	35.872659+-4.872090	23.719295+-3.883848
> sp.C.x	32.141310+-2.289541	29.125363+-0.872300
> ua.C.x	29.024597+-1.667049	25.728888+-1.539772
> 
> On the smaller data sets (A and B) and on the other NAS benchmarks
> there is no impact on performance.
> 
> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

I suspect that the benefit of this patch is due to avoiding the overhead
of wake_affine_weight() check because the following check exists in
select_idle_sibling

        /*
         * If the previous CPU is cache affine and idle, don't be stupid:
         */
        if (prev != target && cpus_share_cache(prev, target) &&
            (available_idle_cpu(prev) || sched_idle_cpu(prev)))
                return prev;

Still, the concept makes some sense to avoid wake_affine_weight but look
at the earlier part of wake_affine_idle()

        if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
                return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;

This thing is almost completely useless because this_cpu is only going to
be idle if it's a wakeup from interrupt context when the CPU was otherwise
idle *but* it takes care to only use the CPU if this and prev share LLC.

The patch as it stands may leave a task on a remote node when it should
have been pulled local to the waker because prev happened to be idle. This
is not guaranteed because a node could have multiple LLCs and prev is
still appropriate but that's a different problem entirely and requires
much deeper surgery. Still, not pulling a task from a remote node is
a change in expected behaviour. While it's possible that NUMA domains
will not even reach this path, it depends on the NUMA distance as can
be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the
cpus_share_cache check is necessary.

I think it would be more appropriate to rework that block that checks
this_cpu to instead check if the CPUs share cache first and then return one
of them (preference to prev based on the comment above it about avoiding
a migration) if either one is idle.

I see Vincent already agreed with the patch so I could be wrong.  Vincent,
did I miss something stupid?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 11:20 ` Mel Gorman
@ 2020-10-21 11:56   ` Julia Lawall
  2020-10-21 12:19     ` Peter Zijlstra
  2020-10-21 12:28     ` Mel Gorman
  2020-10-21 12:25   ` Vincent Guittot
  1 sibling, 2 replies; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 11:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller



On Wed, 21 Oct 2020, Mel Gorman wrote:

> On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote:
> > On a thread wakeup, the change [1] from runnable load average to load
> > average for comparing candidate cores means that recent short-running
> > daemons on the core where a thread ran previously can be considered to
> > have a higher load than the core performing the wakeup, even when the
> > core where the thread ran previously is currently idle.  This can
> > cause a thread to migrate, taking the place of some other thread that
> > is about to wake up, and so on.  To avoid unnecessary migrations,
> > extend wake_affine_idle to check whether the core where the thread
> > previously ran is currently idle, and if so return that core as the
> > target.
> >
> > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> > load in wakeup path")
> >
> > This particularly has an impact when using passive (intel_cpufreq)
> > power management, where kworkers run every 0.004 seconds on all cores,
> > increasing the likelihood that an idle core will be considered to have
> > a load.
> >
> > The following numbers were obtained with the benchmarking tool
> > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> > benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> > 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> > management were used.  Times are in seconds.  All experiments use all
> > 160 hardware threads.
> >
> > 	v5.9/active		v5.9+patch/active
> > bt.C.c	24.725724+-0.962340	23.349608+-1.607214
> > lu.C.x	29.105952+-4.804203	25.249052+-5.561617
> > sp.C.x	31.220696+-1.831335	30.227760+-2.429792
> > ua.C.x	26.606118+-1.767384	25.778367+-1.263850
> >
> > 	v5.9/passive		v5.9+patch/passive
> > bt.C.c	25.330360+-1.028316	23.544036+-1.020189
> > lu.C.x	35.872659+-4.872090	23.719295+-3.883848
> > sp.C.x	32.141310+-2.289541	29.125363+-0.872300
> > ua.C.x	29.024597+-1.667049	25.728888+-1.539772
> >
> > On the smaller data sets (A and B) and on the other NAS benchmarks
> > there is no impact on performance.
> >
> > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
>
> I suspect that the benefit of this patch is due to avoiding the overhead
> of wake_affine_weight() check because the following check exists in
> select_idle_sibling

I'm running 160 threads on 160 cores (80 physical cores).  All of the
threads are thus best off to just stay where they are.  If one thread
moves to the socket of prev, then they will be displacing other threads
that also expect to return to where they were previously located.

You can see this in the traces shown here:

https://pages.lip6.fr/Julia.Lawall/uas.pdf

Prior to 5.8, my machine was using intel_pstate and had few background
tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
the kernel decided that intel_cpufreq would be more appropriate, which
introduced kworkers every 0.004 seconds on all cores.  In the graphs for
early versions, sometimes the whole benchmark runs with the threads just
staying on their cores, or a few migrations.  Starting with 5.8, after 5
seconds where there are a number of synchronizations, all of the threads
move around between all of the cores.  Typically, one bad placement leads
to 10-15 threads moving around, until one ends up on the idle core where
the original thread was intended to be.

>
>         /*
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */
>         if (prev != target && cpus_share_cache(prev, target) &&
>             (available_idle_cpu(prev) || sched_idle_cpu(prev)))
>                 return prev;

This isn't triggered in the problematic case, because the problematic case
is where the prev core and the waker core are on different sockets.

To my understanding, when the runnable load was used and prev was idle,
wake_affine_weight would fail, and then wake_affine would return prev.
With the load average, in the case where there is a thread on the waker
core and there has recently been a daemon on the prev core, the comparison
between the cores is a bit random.  The patch thus tries to restore the
previous behavior.

julia

> Still, the concept makes some sense to avoid wake_affine_weight but look
> at the earlier part of wake_affine_idle()
>
>         if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
>                 return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
>
> This thing is almost completely useless because this_cpu is only going to
> be idle if it's a wakeup from interrupt context when the CPU was otherwise
> idle *but* it takes care to only use the CPU if this and prev share LLC.
>
> The patch as it stands may leave a task on a remote node when it should
> have been pulled local to the waker because prev happened to be idle. This
> is not guaranteed because a node could have multiple LLCs and prev is
> still appropriate but that's a different problem entirely and requires
> much deeper surgery. Still, not pulling a task from a remote node is
> a change in expected behaviour. While it's possible that NUMA domains
> will not even reach this path, it depends on the NUMA distance as can
> be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the
> cpus_share_cache check is necessary.
>
> I think it would be more appropriate to rework that block that checks
> this_cpu to instead check if the CPUs share cache first and then return one
> of them (preference to prev based on the comment above it about avoiding
> a migration) if either one is idle.
>
> I see Vincent already agreed with the patch so I could be wrong.  Vincent,
> did I miss something stupid?
>
> --
> Mel Gorman
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 11:56   ` Julia Lawall
@ 2020-10-21 12:19     ` Peter Zijlstra
  2020-10-21 12:42       ` Julia Lawall
  2020-10-21 13:10       ` Peter Zijlstra
  2020-10-21 12:28     ` Mel Gorman
  1 sibling, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-21 12:19 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, Rafael J. Wysocki, viresh.kumar,
	srinivas.pandruvada

On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> Prior to 5.8, my machine was using intel_pstate and had few background
> tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> the kernel decided that intel_cpufreq would be more appropriate, which
> introduced kworkers every 0.004 seconds on all cores.

That still doesn't make any sense. Are you running the legacy on-demand
thing or something?

Rafael, Srinivas, Viresh, how come it defaults to that?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 11:20 ` Mel Gorman
  2020-10-21 11:56   ` Julia Lawall
@ 2020-10-21 12:25   ` Vincent Guittot
  2020-10-21 12:47     ` Mel Gorman
  1 sibling, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2020-10-21 12:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles.Muller

Le mercredi 21 oct. 2020 à 12:20:38 (+0100), Mel Gorman a écrit :
> On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote:
> > On a thread wakeup, the change [1] from runnable load average to load
> > average for comparing candidate cores means that recent short-running
> > daemons on the core where a thread ran previously can be considered to
> > have a higher load than the core performing the wakeup, even when the
> > core where the thread ran previously is currently idle.  This can
> > cause a thread to migrate, taking the place of some other thread that
> > is about to wake up, and so on.  To avoid unnecessary migrations,
> > extend wake_affine_idle to check whether the core where the thread
> > previously ran is currently idle, and if so return that core as the
> > target.
> > 
> > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> > load in wakeup path")
> > 
> > This particularly has an impact when using passive (intel_cpufreq)
> > power management, where kworkers run every 0.004 seconds on all cores,
> > increasing the likelihood that an idle core will be considered to have
> > a load.
> > 
> > The following numbers were obtained with the benchmarking tool
> > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> > benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> > 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> > management were used.  Times are in seconds.  All experiments use all
> > 160 hardware threads.
> > 
> > 	v5.9/active		v5.9+patch/active
> > bt.C.c	24.725724+-0.962340	23.349608+-1.607214
> > lu.C.x	29.105952+-4.804203	25.249052+-5.561617
> > sp.C.x	31.220696+-1.831335	30.227760+-2.429792
> > ua.C.x	26.606118+-1.767384	25.778367+-1.263850
> > 
> > 	v5.9/passive		v5.9+patch/passive
> > bt.C.c	25.330360+-1.028316	23.544036+-1.020189
> > lu.C.x	35.872659+-4.872090	23.719295+-3.883848
> > sp.C.x	32.141310+-2.289541	29.125363+-0.872300
> > ua.C.x	29.024597+-1.667049	25.728888+-1.539772
> > 
> > On the smaller data sets (A and B) and on the other NAS benchmarks
> > there is no impact on performance.
> > 
> > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
> 
> I suspect that the benefit of this patch is due to avoiding the overhead
> of wake_affine_weight() check because the following check exists in
> select_idle_sibling
> 
>         /*
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */
>         if (prev != target && cpus_share_cache(prev, target) &&
>             (available_idle_cpu(prev) || sched_idle_cpu(prev)))
>                 return prev;
> 
> Still, the concept makes some sense to avoid wake_affine_weight but look
> at the earlier part of wake_affine_idle()
> 
>         if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
>                 return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> 
> This thing is almost completely useless because this_cpu is only going to
> be idle if it's a wakeup from interrupt context when the CPU was otherwise
> idle *but* it takes care to only use the CPU if this and prev share LLC.
> 
> The patch as it stands may leave a task on a remote node when it should
> have been pulled local to the waker because prev happened to be idle. This
> is not guaranteed because a node could have multiple LLCs and prev is
> still appropriate but that's a different problem entirely and requires
> much deeper surgery. Still, not pulling a task from a remote node is
> a change in expected behaviour. While it's possible that NUMA domains
> will not even reach this path, it depends on the NUMA distance as can
> be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the
> cpus_share_cache check is necessary.
> 
> I think it would be more appropriate to rework that block that checks
> this_cpu to instead check if the CPUs share cache first and then return one
> of them (preference to prev based on the comment above it about avoiding
> a migration) if either one is idle.
> 
> I see Vincent already agreed with the patch so I could be wrong.  Vincent,
> did I miss something stupid?

This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since
commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was
And this is important because this will then decide in which LLC we will looks for a cpu

> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21  7:29 ` Vincent Guittot
  2020-10-21 11:13   ` Peter Zijlstra
@ 2020-10-21 12:27   ` Vincent Guittot
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2020-10-21 12:27 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller

On Wed, 21 Oct 2020 at 09:29, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Hi Julia,
>
> On Tue, 20 Oct 2020 at 19:21, Julia Lawall <Julia.Lawall@inria.fr> wrote:
> >
> > On a thread wakeup, the change [1] from runnable load average to load
> > average for comparing candidate cores means that recent short-running
> > daemons on the core where a thread ran previously can be considered to
> > have a higher load than the core performing the wakeup, even when the
> > core where the thread ran previously is currently idle.  This can
> > cause a thread to migrate, taking the place of some other thread that
> > is about to wake up, and so on.  To avoid unnecessary migrations,
> > extend wake_affine_idle to check whether the core where the thread
> > previously ran is currently idle, and if so return that core as the
> > target.
> >
> > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> > load in wakeup path")
> >
> > This particularly has an impact when using passive (intel_cpufreq)
> > power management, where kworkers run every 0.004 seconds on all cores,
> > increasing the likelihood that an idle core will be considered to have
> > a load.
> >
> > The following numbers were obtained with the benchmarking tool
> > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> > benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> > 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> > management were used.  Times are in seconds.  All experiments use all
> > 160 hardware threads.
> >
> >         v5.9/active             v5.9+patch/active
> > bt.C.c  24.725724+-0.962340     23.349608+-1.607214
> > lu.C.x  29.105952+-4.804203     25.249052+-5.561617
> > sp.C.x  31.220696+-1.831335     30.227760+-2.429792
> > ua.C.x  26.606118+-1.767384     25.778367+-1.263850
> >
> >         v5.9/passive            v5.9+patch/passive
> > bt.C.c  25.330360+-1.028316     23.544036+-1.020189
> > lu.C.x  35.872659+-4.872090     23.719295+-3.883848
> > sp.C.x  32.141310+-2.289541     29.125363+-0.872300
> > ua.C.x  29.024597+-1.667049     25.728888+-1.539772
> >
> > On the smaller data sets (A and B) and on the other NAS benchmarks
> > there is no impact on performance.
> >

A fix tag is missing
Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in
wakeup path")

> > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
>
> Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>
>
> >
> > ---
> >  kernel/sched/fair.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index aa4c6227cd6d..9b23dad883ee 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> >         if (sync && cpu_rq(this_cpu)->nr_running = 1)
> >                 return this_cpu;
> >
> > +       if (available_idle_cpu(prev_cpu))
> > +               return prev_cpu;
> > +
> >         return nr_cpumask_bits;
> >  }
> >
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 11:56   ` Julia Lawall
  2020-10-21 12:19     ` Peter Zijlstra
@ 2020-10-21 12:28     ` Mel Gorman
  1 sibling, 0 replies; 68+ messages in thread
From: Mel Gorman @ 2020-10-21 12:28 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller

On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > I suspect that the benefit of this patch is due to avoiding the overhead
> > of wake_affine_weight() check because the following check exists in
> > select_idle_sibling
> 
> I'm running 160 threads on 160 cores (80 physical cores).  All of the
> threads are thus best off to just stay where they are.  If one thread
> moves to the socket of prev, then they will be displacing other threads
> that also expect to return to where they were previously located.
> 
> You can see this in the traces shown here:
> 
> https://pages.lip6.fr/Julia.Lawall/uas.pdf
> 

This is an important point -- almost all CPUs are commonly active and the
search for an idle CPU is going to be difficult given that it is inherently
race-prone. In this case, you're right, it may indeed be better off to just
leave the task where it is if the prev CPU is idle. Not only does it avoid
the wake_affine_weight cost, it avoids a useless search of idle siblings.

However, there is alsoo the important case where a machine is only
partially or lightly utilised. In that case, a cross-node migration
may be beneficial as the wakee may need to access data in the wakers
cache or memory local to the waker (no guarantee, the waker/wakee could
be completely independent but the scheduler cannot tell). In the low
utilisation case, select_idle_sibling is also likely to be less expensive.

Leaving a task on a remote node because the prev CPU was idle is an
important semantic change in the behaviour of the scheduler when there is
spare idle capacity in either domain. It's non-obvious from the changelog
and the patch itself that this change of behaviour happens.

> Prior to 5.8, my machine was using intel_pstate and had few background
> tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> the kernel decided that intel_cpufreq would be more appropriate, which
> introduced kworkers every 0.004 seconds on all cores.  In the graphs for
> early versions, sometimes the whole benchmark runs with the threads just
> staying on their cores, or a few migrations.  Starting with 5.8, after 5
> seconds where there are a number of synchronizations, all of the threads
> move around between all of the cores.  Typically, one bad placement leads
> to 10-15 threads moving around, until one ends up on the idle core where
> the original thread was intended to be.
> 

And this is an issue but I don't think the fix is avoiding cross-node
migrations entirely if the prev CPU happened to be idle because cache
lines will have to be bounced and/or remote data be accessed.  At minimum,
the low utilisation case should be considered and the changelog justify
why avoiding cross-node wakeups is still appropriate when the waker CPU
has plenty of idle CPUs.

> >
> >         /*
> >          * If the previous CPU is cache affine and idle, don't be stupid:
> >          */
> >         if (prev != target && cpus_share_cache(prev, target) &&
> >             (available_idle_cpu(prev) || sched_idle_cpu(prev)))
> >                 return prev;
> 
> This isn't triggered in the problematic case, because the problematic case
> is where the prev core and the waker core are on different sockets.
> 

I simply wanted to highlight the fact it checks whether caches are shared
or not as that is also taken into account in select_idle_sibling for
example.

> To my understanding, when the runnable load was used and prev was idle,
> wake_affine_weight would fail, and then wake_affine would return prev.
> With the load average, in the case where there is a thread on the waker
> core and there has recently been a daemon on the prev core, the comparison
> between the cores is a bit random.  The patch thus tries to restore the
> previous behavior.
> 

I think wake_affine_weight() is a bit random anyway simply because it picks
a CPU that select_idle_sibling ignores in a lot of cases and in others
simply stacks the wakee on top of the waker. Every so often I think that
it should simply be ripped out because the scheduler avoids stacking on
the wakeup paths unless there is no other choice in which case the task
might as well remain on prev anyway. Maybe it made more sense when there
was an effort to detect when task stacking is appropriate but right now,
wake_affine_weight() is basically a random migration generator -- leave
it to the load balancer to figure out placement decisions based on load.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 12:19     ` Peter Zijlstra
@ 2020-10-21 12:42       ` Julia Lawall
  2020-10-21 12:52         ` Peter Zijlstra
  2020-10-21 18:15         ` Rafael J. Wysocki
  2020-10-21 13:10       ` Peter Zijlstra
  1 sibling, 2 replies; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller, Rafael J. Wysocki,
	viresh.kumar, srinivas.pandruvada



On Wed, 21 Oct 2020, Peter Zijlstra wrote:

> On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > Prior to 5.8, my machine was using intel_pstate and had few background
> > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > the kernel decided that intel_cpufreq would be more appropriate, which
> > introduced kworkers every 0.004 seconds on all cores.
>
> That still doesn't make any sense. Are you running the legacy on-demand
> thing or something?
>
> Rafael, Srinivas, Viresh, how come it defaults to that?

The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
bug.  I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
have the HWP feature, even though the cores seemed to be able to change
their frequencies at the hardware level.

julia

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 12:25   ` Vincent Guittot
@ 2020-10-21 12:47     ` Mel Gorman
  2020-10-21 12:56       ` Julia Lawall
  0 siblings, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2020-10-21 12:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles.Muller

On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote:
> > I see Vincent already agreed with the patch so I could be wrong.  Vincent,
> > did I miss something stupid?
> 
> This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since
> commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was
> And this is important because this will then decide in which LLC we will looks for a cpu
> 

Ok, that is understandable but I'm still concerned that the fix simply
trades one problem for another by leaving related tasks remote to each
other and increasing cache misses and remote data accesses.

wake_affine_weight is a giant pain because really we don't care about the
load on the waker CPU or its available, we care about whether it has idle
siblings that can be found quickly. As tempting as ripping it out is,
it never happened because sometimes it makes the right decision.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 12:42       ` Julia Lawall
@ 2020-10-21 12:52         ` Peter Zijlstra
  2020-10-21 18:18           ` Rafael J. Wysocki
  2020-10-21 18:15         ` Rafael J. Wysocki
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-21 12:52 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, Rafael J. Wysocki, viresh.kumar,
	srinivas.pandruvada

On Wed, Oct 21, 2020 at 02:42:20PM +0200, Julia Lawall wrote:
> 
> 
> On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> 
> > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > introduced kworkers every 0.004 seconds on all cores.
> >
> > That still doesn't make any sense. Are you running the legacy on-demand
> > thing or something?
> >
> > Rafael, Srinivas, Viresh, how come it defaults to that?
> 
> The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> bug.  I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> have the HWP feature, even though the cores seemed to be able to change
> their frequencies at the hardware level.

That just makes intel_pstate not prefer active mode. With the clear
intent that it should then go use schedutil, but somehow it looks like
you landed on ondemand, which is absolutely atrocious.

What's:

  $ for i in /sys/devices/system/cpu/cpu0/cpufreq/scaling_*; do echo -n $i ": "; cat $i; done

say, for you? And if you do:

  $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo schedutil > $i; done

Are the kworkers gone?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 12:47     ` Mel Gorman
@ 2020-10-21 12:56       ` Julia Lawall
  2020-10-21 13:18         ` Mel Gorman
  0 siblings, 1 reply; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Julia Lawall, Ingo Molnar, kernel-janitors,
	Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles.Muller

[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]



On Wed, 21 Oct 2020, Mel Gorman wrote:

> On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote:
> > > I see Vincent already agreed with the patch so I could be wrong.  Vincent,
> > > did I miss something stupid?
> >
> > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since
> > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was
> > And this is important because this will then decide in which LLC we will looks for a cpu
> >
>
> Ok, that is understandable but I'm still concerned that the fix simply
> trades one problem for another by leaving related tasks remote to each
> other and increasing cache misses and remote data accesses.
>
> wake_affine_weight is a giant pain because really we don't care about the
> load on the waker CPU or its available, we care about whether it has idle
> siblings that can be found quickly. As tempting as ripping it out is,
> it never happened because sometimes it makes the right decision.

My goal was to restore the previous behavior, when runnable load was used.
The patch removing the use of runnable load (11f10e5420f6) presented it
basically as that load balancing was using it, so wakeup should use it
too, and any way it didn't matter because idle CPUS were checked for
anyway.

Is your point of view that the proposed change is overkill?  Or is it that
the original behavior was not desirable?

thanks,
julia

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 12:19     ` Peter Zijlstra
  2020-10-21 12:42       ` Julia Lawall
@ 2020-10-21 13:10       ` Peter Zijlstra
  2020-10-21 18:11         ` Rafael J. Wysocki
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-21 13:10 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, Rafael J. Wysocki, viresh.kumar,
	srinivas.pandruvada

On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > Prior to 5.8, my machine was using intel_pstate and had few background
> > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > the kernel decided that intel_cpufreq would be more appropriate, which
> > introduced kworkers every 0.004 seconds on all cores.
> 
> That still doesn't make any sense. Are you running the legacy on-demand
> thing or something?
> 
> Rafael, Srinivas, Viresh, how come it defaults to that?

Does we want something like this?

---
 arch/x86/configs/i386_defconfig   | 3 +--
 arch/x86/configs/x86_64_defconfig | 3 +--
 drivers/cpufreq/Kconfig           | 7 +++++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig
index 78210793d357..c343ad459737 100644
--- a/arch/x86/configs/i386_defconfig
+++ b/arch/x86/configs/i386_defconfig
@@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y
 CONFIG_PM_TRACE_RTC=y
 CONFIG_ACPI_DOCK=y
 CONFIG_ACPI_BGRT=y
-CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
-CONFIG_CPU_FREQ_GOV_ONDEMAND=y
+CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
 CONFIG_X86_ACPI_CPUFREQ=y
 CONFIG_EFI_VARS=y
 CONFIG_KPROBES=y
diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index 9936528e1939..23e1ea85c1ec 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y
 CONFIG_PM_TRACE_RTC=y
 CONFIG_ACPI_DOCK=y
 CONFIG_ACPI_BGRT=y
-CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
-CONFIG_CPU_FREQ_GOV_ONDEMAND=y
+CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
 CONFIG_X86_ACPI_CPUFREQ=y
 CONFIG_IA32_EMULATION=y
 CONFIG_EFI_VARS=y
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 2c7171e0b001..8dfca6e9b836 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -37,8 +37,7 @@ config CPU_FREQ_STAT
 choice
 	prompt "Default CPUFreq governor"
 	default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
-	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
-	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
+	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP
 	default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
 	help
 	  This option sets which CPUFreq governor shall be loaded at
@@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
 
 config CPU_FREQ_DEFAULT_GOV_ONDEMAND
 	bool "ondemand"
+	depends on !SMP
 	select CPU_FREQ_GOV_ONDEMAND
 	select CPU_FREQ_GOV_PERFORMANCE
 	help
@@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
 
 config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	bool "conservative"
+	depends on !SMP
 	select CPU_FREQ_GOV_CONSERVATIVE
 	select CPU_FREQ_GOV_PERFORMANCE
 	help
@@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
 
 config CPU_FREQ_GOV_ONDEMAND
 	tristate "'ondemand' cpufreq policy governor"
+	depends on !SMP
 	select CPU_FREQ_GOV_COMMON
 	help
 	  'ondemand' - This driver adds a dynamic cpufreq policy governor.
@@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND
 config CPU_FREQ_GOV_CONSERVATIVE
 	tristate "'conservative' cpufreq governor"
 	depends on CPU_FREQ
+	depends on !SMP
 	select CPU_FREQ_GOV_COMMON
 	help
 	  'conservative' - this driver is rather similar to the 'ondemand'

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 12:56       ` Julia Lawall
@ 2020-10-21 13:18         ` Mel Gorman
  2020-10-21 13:24           ` Julia Lawall
  2020-10-21 13:48           ` Julia Lawall
  0 siblings, 2 replies; 68+ messages in thread
From: Mel Gorman @ 2020-10-21 13:18 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles.Muller

On Wed, Oct 21, 2020 at 02:56:06PM +0200, Julia Lawall wrote:
> 
> 
> On Wed, 21 Oct 2020, Mel Gorman wrote:
> 
> > On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote:
> > > > I see Vincent already agreed with the patch so I could be wrong.  Vincent,
> > > > did I miss something stupid?
> > >
> > > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since
> > > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was
> > > And this is important because this will then decide in which LLC we will looks for a cpu
> > >
> >
> > Ok, that is understandable but I'm still concerned that the fix simply
> > trades one problem for another by leaving related tasks remote to each
> > other and increasing cache misses and remote data accesses.
> >
> > wake_affine_weight is a giant pain because really we don't care about the
> > load on the waker CPU or its available, we care about whether it has idle
> > siblings that can be found quickly. As tempting as ripping it out is,
> > it never happened because sometimes it makes the right decision.
> 
> My goal was to restore the previous behavior, when runnable load was used.
> The patch removing the use of runnable load (11f10e5420f6) presented it
> basically as that load balancing was using it, so wakeup should use it
> too, and any way it didn't matter because idle CPUS were checked for
> anyway.
> 

Which is fair.

> Is your point of view that the proposed change is overkill?  Or is it that
> the original behavior was not desirable?
> 

I worry it's overkill because prev is always used if it is idle even
if it is on a node remote to the waker. It cuts off the option of a
wakee moving to a CPU local to the waker which is not equivalent to the
original behaviour.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 13:18         ` Mel Gorman
@ 2020-10-21 13:24           ` Julia Lawall
  2020-10-21 15:08             ` Mel Gorman
  2020-10-21 13:48           ` Julia Lawall
  1 sibling, 1 reply; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 13:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Julia Lawall, Vincent Guittot, Ingo Molnar, kernel-janitors,
	Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller

[-- Attachment #1: Type: text/plain, Size: 2269 bytes --]



On Wed, 21 Oct 2020, Mel Gorman wrote:

> On Wed, Oct 21, 2020 at 02:56:06PM +0200, Julia Lawall wrote:
> >
> >
> > On Wed, 21 Oct 2020, Mel Gorman wrote:
> >
> > > On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote:
> > > > > I see Vincent already agreed with the patch so I could be wrong.  Vincent,
> > > > > did I miss something stupid?
> > > >
> > > > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since
> > > > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was
> > > > And this is important because this will then decide in which LLC we will looks for a cpu
> > > >
> > >
> > > Ok, that is understandable but I'm still concerned that the fix simply
> > > trades one problem for another by leaving related tasks remote to each
> > > other and increasing cache misses and remote data accesses.
> > >
> > > wake_affine_weight is a giant pain because really we don't care about the
> > > load on the waker CPU or its available, we care about whether it has idle
> > > siblings that can be found quickly. As tempting as ripping it out is,
> > > it never happened because sometimes it makes the right decision.
> >
> > My goal was to restore the previous behavior, when runnable load was used.
> > The patch removing the use of runnable load (11f10e5420f6) presented it
> > basically as that load balancing was using it, so wakeup should use it
> > too, and any way it didn't matter because idle CPUS were checked for
> > anyway.
> >
>
> Which is fair.
>
> > Is your point of view that the proposed change is overkill?  Or is it that
> > the original behavior was not desirable?
> >
>
> I worry it's overkill because prev is always used if it is idle even
> if it is on a node remote to the waker. It cuts off the option of a
> wakee moving to a CPU local to the waker which is not equivalent to the
> original behaviour.

But it is equal to the original behavior in the idle prev case if you go
back to the runnable load average days...

The problem seems impossible to solve, because there is no way to know by
looking only at prev and this whether the thread would prefer to stay
where it was or go to the waker.

julia

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 13:18         ` Mel Gorman
  2020-10-21 13:24           ` Julia Lawall
@ 2020-10-21 13:48           ` Julia Lawall
  2020-10-21 15:26             ` Mel Gorman
  1 sibling, 1 reply; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 13:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Julia Lawall, Vincent Guittot, Ingo Molnar, kernel-janitors,
	Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles.Muller

[-- Attachment #1: Type: text/plain, Size: 2183 bytes --]



On Wed, 21 Oct 2020, Mel Gorman wrote:

> On Wed, Oct 21, 2020 at 02:56:06PM +0200, Julia Lawall wrote:
> >
> >
> > On Wed, 21 Oct 2020, Mel Gorman wrote:
> >
> > > On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote:
> > > > > I see Vincent already agreed with the patch so I could be wrong.  Vincent,
> > > > > did I miss something stupid?
> > > >
> > > > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since
> > > > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was
> > > > And this is important because this will then decide in which LLC we will looks for a cpu
> > > >
> > >
> > > Ok, that is understandable but I'm still concerned that the fix simply
> > > trades one problem for another by leaving related tasks remote to each
> > > other and increasing cache misses and remote data accesses.
> > >
> > > wake_affine_weight is a giant pain because really we don't care about the
> > > load on the waker CPU or its available, we care about whether it has idle
> > > siblings that can be found quickly. As tempting as ripping it out is,
> > > it never happened because sometimes it makes the right decision.
> >
> > My goal was to restore the previous behavior, when runnable load was used.
> > The patch removing the use of runnable load (11f10e5420f6) presented it
> > basically as that load balancing was using it, so wakeup should use it
> > too, and any way it didn't matter because idle CPUS were checked for
> > anyway.
> >
>
> Which is fair.
>
> > Is your point of view that the proposed change is overkill?  Or is it that
> > the original behavior was not desirable?
> >
>
> I worry it's overkill because prev is always used if it is idle even
> if it is on a node remote to the waker. It cuts off the option of a
> wakee moving to a CPU local to the waker which is not equivalent to the
> original behaviour.

Could it be possible to check p->recent_used_cpu?  If that is prev (or on
the same socket?), then prev could be a good choice.  If that is on the
same socket as the waker, then maybe the waker would be better.

julia

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 13:24           ` Julia Lawall
@ 2020-10-21 15:08             ` Mel Gorman
  2020-10-21 15:18               ` Julia Lawall
  2020-10-21 15:19               ` Vincent Guittot
  0 siblings, 2 replies; 68+ messages in thread
From: Mel Gorman @ 2020-10-21 15:08 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller

On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote:
> > I worry it's overkill because prev is always used if it is idle even
> > if it is on a node remote to the waker. It cuts off the option of a
> > wakee moving to a CPU local to the waker which is not equivalent to the
> > original behaviour.
> 
> But it is equal to the original behavior in the idle prev case if you go
> back to the runnable load average days...
> 

It is similar but it misses the sync treatment and sd->imbalance_pct part of
wake_affine_weight which has unpredictable consequences. The data
available is only on the fully utilised case.

> The problem seems impossible to solve, because there is no way to know by
> looking only at prev and this whether the thread would prefer to stay
> where it was or go to the waker.
> 

Yes, this is definitely true. Looking at prev_cpu and this_cpu is a
crude approximation and the path is heavily limited in terms of how
clever it can be.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 15:08             ` Mel Gorman
@ 2020-10-21 15:18               ` Julia Lawall
  2020-10-21 15:23                 ` Vincent Guittot
  2020-10-21 15:19               ` Vincent Guittot
  1 sibling, 1 reply; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 15:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller



On Wed, 21 Oct 2020, Mel Gorman wrote:

> On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote:
> > > I worry it's overkill because prev is always used if it is idle even
> > > if it is on a node remote to the waker. It cuts off the option of a
> > > wakee moving to a CPU local to the waker which is not equivalent to the
> > > original behaviour.
> >
> > But it is equal to the original behavior in the idle prev case if you go
> > back to the runnable load average days...
> >
>
> It is similar but it misses the sync treatment and sd->imbalance_pct part of
> wake_affine_weight which has unpredictable consequences. The data
> available is only on the fully utilised case.

OK, what if my patch were:

@@ -5800,6 +5800,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	if (sync && cpu_rq(this_cpu)->nr_running = 1)
 		return this_cpu;

+	if (!sync && available_idle_cpu(prev_cpu))
+		return prev_cpu;
+
 	return nr_cpumask_bits;
 }

The sd->imbalance_pct part would have previously been a multiplication by
0, so it doesn't need to be taken into account.

julia

>
> > The problem seems impossible to solve, because there is no way to know by
> > looking only at prev and this whether the thread would prefer to stay
> > where it was or go to the waker.
> >
>
> Yes, this is definitely true. Looking at prev_cpu and this_cpu is a
> crude approximation and the path is heavily limited in terms of how
> clever it can be.
>
> --
> Mel Gorman
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 15:08             ` Mel Gorman
  2020-10-21 15:18               ` Julia Lawall
@ 2020-10-21 15:19               ` Vincent Guittot
  2020-10-21 17:00                 ` Mel Gorman
  1 sibling, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2020-10-21 15:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller

On Wed, 21 Oct 2020 at 17:08, Mel Gorman <mgorman@suse.de> wrote:
>
> On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote:
> > > I worry it's overkill because prev is always used if it is idle even
> > > if it is on a node remote to the waker. It cuts off the option of a
> > > wakee moving to a CPU local to the waker which is not equivalent to the
> > > original behaviour.
> >
> > But it is equal to the original behavior in the idle prev case if you go
> > back to the runnable load average days...
> >
>
> It is similar but it misses the sync treatment and sd->imbalance_pct part of
> wake_affine_weight which has unpredictable consequences. The data
> available is only on the fully utilised case.

In fact It's the same because runnable_load_avg was null when cpu is idle, so
if prev_cpu was idle, we were selecting prev_idle

>
> > The problem seems impossible to solve, because there is no way to know by
> > looking only at prev and this whether the thread would prefer to stay
> > where it was or go to the waker.
> >
>
> Yes, this is definitely true. Looking at prev_cpu and this_cpu is a
> crude approximation and the path is heavily limited in terms of how
> clever it can be.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 15:18               ` Julia Lawall
@ 2020-10-21 15:23                 ` Vincent Guittot
  2020-10-21 15:33                   ` Julia Lawall
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2020-10-21 15:23 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller

On Wed, 21 Oct 2020 at 17:18, Julia Lawall <julia.lawall@inria.fr> wrote:
>
>
>
> On Wed, 21 Oct 2020, Mel Gorman wrote:
>
> > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote:
> > > > I worry it's overkill because prev is always used if it is idle even
> > > > if it is on a node remote to the waker. It cuts off the option of a
> > > > wakee moving to a CPU local to the waker which is not equivalent to the
> > > > original behaviour.
> > >
> > > But it is equal to the original behavior in the idle prev case if you go
> > > back to the runnable load average days...
> > >
> >
> > It is similar but it misses the sync treatment and sd->imbalance_pct part of
> > wake_affine_weight which has unpredictable consequences. The data
> > available is only on the fully utilised case.
>
> OK, what if my patch were:
>
> @@ -5800,6 +5800,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>         if (sync && cpu_rq(this_cpu)->nr_running = 1)
>                 return this_cpu;
>
> +       if (!sync && available_idle_cpu(prev_cpu))
> +               return prev_cpu;
> +

this is not useful because when prev_cpu is idle, its runnable_avg was
null so the only
way for this_cpu to be selected by wake_affine_weight is to be null
too which is not really
possible when sync is set because sync is used to say, the running
task on this cpu
is about to sleep

>         return nr_cpumask_bits;
>  }
>
> The sd->imbalance_pct part would have previously been a multiplication by
> 0, so it doesn't need to be taken into account.
>
> julia
>
> >
> > > The problem seems impossible to solve, because there is no way to know by
> > > looking only at prev and this whether the thread would prefer to stay
> > > where it was or go to the waker.
> > >
> >
> > Yes, this is definitely true. Looking at prev_cpu and this_cpu is a
> > crude approximation and the path is heavily limited in terms of how
> > clever it can be.
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 13:48           ` Julia Lawall
@ 2020-10-21 15:26             ` Mel Gorman
  0 siblings, 0 replies; 68+ messages in thread
From: Mel Gorman @ 2020-10-21 15:26 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles.Muller

On Wed, Oct 21, 2020 at 03:48:59PM +0200, Julia Lawall wrote:
> > I worry it's overkill because prev is always used if it is idle even
> > if it is on a node remote to the waker. It cuts off the option of a
> > wakee moving to a CPU local to the waker which is not equivalent to the
> > original behaviour.
> 
> Could it be possible to check p->recent_used_cpu?  If that is prev (or on
> the same socket?), then prev could be a good choice.  If that is on the
> same socket as the waker, then maybe the waker would be better.
> 

It is an interesting idea but the treatment of p->recent_used_cpu is a
bit strange though. At one point, I looked at reconciling all the select
CPU paths into one single pass (prototype worked back in 5.6-era but was
not a universal win). As part of that, I remembered that the setting
of recent_used_cpu is a problem because it can be set to the wakers
recent_used_cpu by this chunk

                if (want_affine)
                        current->recent_used_cpu = cpu;

Fixing that on its own causes other problems. From an unposted series
that "fixed it" as the last part of 14 patch series;

    After select_idle_sibling, the *wakers* recent_used_cpu is set to the
    CPU used for the wakee. This was necessary at the time as without it,
    the miss rate was unacceptably high.  It still works, but it is less
    efficient than it can be.

    This patch leaves the waker state alone and uses either the previous CPU on
    a hit or the target CPU on a miss when the recently used CPU is examined
    for idleness. The rest of the series is important as without it, this
    patch improves the number of hits but the miss rate gets ridiculously high.
    After the rest of the series, the hit and miss rates are both higher but
    the miss rate is acceptable.

As the hunk means that recent_used_cpu could have been set based
on a previous wakeup, it makes it unreliable for making cross-node
decisions.  p->recent_used_cpu's primary purpose at the moment is to
avoid select_idle_sibling searches.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 15:23                 ` Vincent Guittot
@ 2020-10-21 15:33                   ` Julia Lawall
  0 siblings, 0 replies; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 15:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller



On Wed, 21 Oct 2020, Vincent Guittot wrote:

> On Wed, 21 Oct 2020 at 17:18, Julia Lawall <julia.lawall@inria.fr> wrote:
> >
> >
> >
> > On Wed, 21 Oct 2020, Mel Gorman wrote:
> >
> > > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote:
> > > > > I worry it's overkill because prev is always used if it is idle even
> > > > > if it is on a node remote to the waker. It cuts off the option of a
> > > > > wakee moving to a CPU local to the waker which is not equivalent to the
> > > > > original behaviour.
> > > >
> > > > But it is equal to the original behavior in the idle prev case if you go
> > > > back to the runnable load average days...
> > > >
> > >
> > > It is similar but it misses the sync treatment and sd->imbalance_pct part of
> > > wake_affine_weight which has unpredictable consequences. The data
> > > available is only on the fully utilised case.
> >
> > OK, what if my patch were:
> >
> > @@ -5800,6 +5800,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> >         if (sync && cpu_rq(this_cpu)->nr_running = 1)
> >                 return this_cpu;
> >
> > +       if (!sync && available_idle_cpu(prev_cpu))
> > +               return prev_cpu;
> > +
>
> this is not useful because when prev_cpu is idle, its runnable_avg was
> null so the only
> way for this_cpu to be selected by wake_affine_weight is to be null
> too which is not really
> possible when sync is set because sync is used to say, the running
> task on this cpu
> is about to sleep

OK, I agree.  Previously prev_eff_load was 0 when prev was idle, and
whether the sync code is executed in wake_affine_weight or not, it will
not b the case that this_eff_load < prev_eff_load, so this will not be
selected.

julia


>
> >         return nr_cpumask_bits;
> >  }
> >
> > The sd->imbalance_pct part would have previously been a multiplication by
> > 0, so it doesn't need to be taken into account.
> >
> > julia
> >
> > >
> > > > The problem seems impossible to solve, because there is no way to know by
> > > > looking only at prev and this whether the thread would prefer to stay
> > > > where it was or go to the waker.
> > > >
> > >
> > > Yes, this is definitely true. Looking at prev_cpu and this_cpu is a
> > > crude approximation and the path is heavily limited in terms of how
> > > clever it can be.
> > >
> > > --
> > > Mel Gorman
> > > SUSE Labs
> > >
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 15:19               ` Vincent Guittot
@ 2020-10-21 17:00                 ` Mel Gorman
  2020-10-21 17:39                   ` Julia Lawall
  0 siblings, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2020-10-21 17:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller

On Wed, Oct 21, 2020 at 05:19:53PM +0200, Vincent Guittot wrote:
> On Wed, 21 Oct 2020 at 17:08, Mel Gorman <mgorman@suse.de> wrote:
> >
> > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote:
> > > > I worry it's overkill because prev is always used if it is idle even
> > > > if it is on a node remote to the waker. It cuts off the option of a
> > > > wakee moving to a CPU local to the waker which is not equivalent to the
> > > > original behaviour.
> > >
> > > But it is equal to the original behavior in the idle prev case if you go
> > > back to the runnable load average days...
> > >
> >
> > It is similar but it misses the sync treatment and sd->imbalance_pct part of
> > wake_affine_weight which has unpredictable consequences. The data
> > available is only on the fully utilised case.
> 
> In fact It's the same because runnable_load_avg was null when cpu is idle, so
> if prev_cpu was idle, we were selecting prev_idle
> 

Sync wakeups may only consider this_cpu and the load of the waker but
in that case, it was probably selected already by the sync check in
wake_affine_idle which will pass except when the domain is overloaded.
Fair enough, I'll withdraw any concerns. It could have done with a
comment :/

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 17:00                 ` Mel Gorman
@ 2020-10-21 17:39                   ` Julia Lawall
  0 siblings, 0 replies; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 17:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Julia Lawall, Ingo Molnar, kernel-janitors,
	Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller



On Wed, 21 Oct 2020, Mel Gorman wrote:

> On Wed, Oct 21, 2020 at 05:19:53PM +0200, Vincent Guittot wrote:
> > On Wed, 21 Oct 2020 at 17:08, Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote:
> > > > > I worry it's overkill because prev is always used if it is idle even
> > > > > if it is on a node remote to the waker. It cuts off the option of a
> > > > > wakee moving to a CPU local to the waker which is not equivalent to the
> > > > > original behaviour.
> > > >
> > > > But it is equal to the original behavior in the idle prev case if you go
> > > > back to the runnable load average days...
> > > >
> > >
> > > It is similar but it misses the sync treatment and sd->imbalance_pct part of
> > > wake_affine_weight which has unpredictable consequences. The data
> > > available is only on the fully utilised case.
> >
> > In fact It's the same because runnable_load_avg was null when cpu is idle, so
> > if prev_cpu was idle, we were selecting prev_idle
> >
>
> Sync wakeups may only consider this_cpu and the load of the waker but
> in that case, it was probably selected already by the sync check in
> wake_affine_idle which will pass except when the domain is overloaded.
> Fair enough, I'll withdraw any concerns. It could have done with a
> comment :/

Sure, I'll resend the patch and extend the log message with this issue.

Otherwise, I was wondering are there any particular kinds of applications
where gathering the threads back with the waker is a good idea?  I've been
looking more at applications with N threads on N cores, where it would be
best for the threads to remain where they are.

thanks,
julia

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 13:10       ` Peter Zijlstra
@ 2020-10-21 18:11         ` Rafael J. Wysocki
  2020-10-22  4:53           ` Viresh Kumar
  2020-10-22  7:11           ` Peter Zijlstra
  0 siblings, 2 replies; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-21 18:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller, viresh.kumar,
	srinivas.pandruvada

On Wednesday, October 21, 2020 3:10:26 PM CEST Peter Zijlstra wrote:
> On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > introduced kworkers every 0.004 seconds on all cores.
> > 
> > That still doesn't make any sense. Are you running the legacy on-demand
> > thing or something?
> > 
> > Rafael, Srinivas, Viresh, how come it defaults to that?
> 
> Does we want something like this?
> 
> ---
>  arch/x86/configs/i386_defconfig   | 3 +--
>  arch/x86/configs/x86_64_defconfig | 3 +--
>  drivers/cpufreq/Kconfig           | 7 +++++--
>  3 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig
> index 78210793d357..c343ad459737 100644
> --- a/arch/x86/configs/i386_defconfig
> +++ b/arch/x86/configs/i386_defconfig
> @@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y
>  CONFIG_PM_TRACE_RTC=y
>  CONFIG_ACPI_DOCK=y
>  CONFIG_ACPI_BGRT=y
> -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> -CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
>  CONFIG_X86_ACPI_CPUFREQ=y
>  CONFIG_EFI_VARS=y
>  CONFIG_KPROBES=y
> diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
> index 9936528e1939..23e1ea85c1ec 100644
> --- a/arch/x86/configs/x86_64_defconfig
> +++ b/arch/x86/configs/x86_64_defconfig
> @@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y
>  CONFIG_PM_TRACE_RTC=y
>  CONFIG_ACPI_DOCK=y
>  CONFIG_ACPI_BGRT=y
> -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> -CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
>  CONFIG_X86_ACPI_CPUFREQ=y
>  CONFIG_IA32_EMULATION=y
>  CONFIG_EFI_VARS=y
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index 2c7171e0b001..8dfca6e9b836 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -37,8 +37,7 @@ config CPU_FREQ_STAT
>  choice
>  	prompt "Default CPUFreq governor"
>  	default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
> -	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
> -	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
> +	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP
>  	default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
>  	help
>  	  This option sets which CPUFreq governor shall be loaded at
> @@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
>  
>  config CPU_FREQ_DEFAULT_GOV_ONDEMAND
>  	bool "ondemand"
> +	depends on !SMP
>  	select CPU_FREQ_GOV_ONDEMAND
>  	select CPU_FREQ_GOV_PERFORMANCE
>  	help
> @@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
>  
>  config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
>  	bool "conservative"
> +	depends on !SMP
>  	select CPU_FREQ_GOV_CONSERVATIVE
>  	select CPU_FREQ_GOV_PERFORMANCE
>  	help

The changes above should work.

> @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
>  
>  config CPU_FREQ_GOV_ONDEMAND
>  	tristate "'ondemand' cpufreq policy governor"
> +	depends on !SMP

But I don't think that we can do this and the one below.

>  	select CPU_FREQ_GOV_COMMON
>  	help
>  	  'ondemand' - This driver adds a dynamic cpufreq policy governor.
> @@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND
>  config CPU_FREQ_GOV_CONSERVATIVE
>  	tristate "'conservative' cpufreq governor"
>  	depends on CPU_FREQ
> +	depends on !SMP
>  	select CPU_FREQ_GOV_COMMON
>  	help
>  	  'conservative' - this driver is rather similar to the 'ondemand'
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 12:42       ` Julia Lawall
  2020-10-21 12:52         ` Peter Zijlstra
@ 2020-10-21 18:15         ` Rafael J. Wysocki
  2020-10-21 19:47           ` Julia Lawall
  1 sibling, 1 reply; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-21 18:15 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Peter Zijlstra, Mel Gorman, Ingo Molnar, kernel-janitors,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller, viresh.kumar,
	srinivas.pandruvada

On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote:
> 
> On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> 
> > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > introduced kworkers every 0.004 seconds on all cores.
> >
> > That still doesn't make any sense. Are you running the legacy on-demand
> > thing or something?
> >
> > Rafael, Srinivas, Viresh, how come it defaults to that?
> 
> The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> bug.  I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> have the HWP feature, even though the cores seemed to be able to change
> their frequencies at the hardware level.

That's in the range of "turbo" P-states (if a P-state above a certain threshold
is requested by the governor, the processor has a license to choose P-states
in the range above this threshold by itself).

Cheers!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 12:52         ` Peter Zijlstra
@ 2020-10-21 18:18           ` Rafael J. Wysocki
  0 siblings, 0 replies; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-21 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller, viresh.kumar,
	srinivas.pandruvada

On Wednesday, October 21, 2020 2:52:03 PM CEST Peter Zijlstra wrote:
> On Wed, Oct 21, 2020 at 02:42:20PM +0200, Julia Lawall wrote:
> > 
> > 
> > On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> > 
> > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > introduced kworkers every 0.004 seconds on all cores.
> > >
> > > That still doesn't make any sense. Are you running the legacy on-demand
> > > thing or something?
> > >
> > > Rafael, Srinivas, Viresh, how come it defaults to that?
> > 
> > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> > bug.  I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> > have the HWP feature, even though the cores seemed to be able to change
> > their frequencies at the hardware level.
> 
> That just makes intel_pstate not prefer active mode. With the clear
> intent that it should then go use schedutil, but somehow it looks like
> you landed on ondemand, which is absolutely atrocious.

Probably, the "default governor" setting was inherited from the old config.

It didn't matter when the active mode was used by default, because it only
recognizes the "powersave" and "performance" settings (and "ondemand" causes
it to fall back to "powersave" IIRC).

Cheers!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 18:15         ` Rafael J. Wysocki
@ 2020-10-21 19:47           ` Julia Lawall
  2020-10-21 20:25             ` Rafael J. Wysocki
  0 siblings, 1 reply; 68+ messages in thread
From: Julia Lawall @ 2020-10-21 19:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Julia Lawall, Peter Zijlstra, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller, viresh.kumar,
	srinivas.pandruvada



On Wed, 21 Oct 2020, Rafael J. Wysocki wrote:

> On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote:
> >
> > On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> >
> > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > introduced kworkers every 0.004 seconds on all cores.
> > >
> > > That still doesn't make any sense. Are you running the legacy on-demand
> > > thing or something?
> > >
> > > Rafael, Srinivas, Viresh, how come it defaults to that?
> >
> > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> > bug.  I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> > have the HWP feature, even though the cores seemed to be able to change
> > their frequencies at the hardware level.
>
> That's in the range of "turbo" P-states (if a P-state above a certain threshold
> is requested by the governor, the processor has a license to choose P-states
> in the range above this threshold by itself).

Sorry, but I don't understand this answer at all.

thanks,
julia

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 19:47           ` Julia Lawall
@ 2020-10-21 20:25             ` Rafael J. Wysocki
  0 siblings, 0 replies; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-21 20:25 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Peter Zijlstra, Mel Gorman, Ingo Molnar, kernel-janitors,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller, viresh.kumar,
	srinivas.pandruvada

On Wednesday, October 21, 2020 9:47:51 PM CEST Julia Lawall wrote:
> 
> On Wed, 21 Oct 2020, Rafael J. Wysocki wrote:
> 
> > On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote:
> > >
> > > On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> > >
> > > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > > introduced kworkers every 0.004 seconds on all cores.
> > > >
> > > > That still doesn't make any sense. Are you running the legacy on-demand
> > > > thing or something?
> > > >
> > > > Rafael, Srinivas, Viresh, how come it defaults to that?
> > >
> > > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> > > bug.  I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> > > have the HWP feature, even though the cores seemed to be able to change
> > > their frequencies at the hardware level.
> >
> > That's in the range of "turbo" P-states (if a P-state above a certain threshold
> > is requested by the governor, the processor has a license to choose P-states
> > in the range above this threshold by itself).
> 
> Sorry, but I don't understand this answer at all.

Well, sorry about that and let me rephrase then.

Contemporary CPUs have two ranges of P-states, the so called "guaranteed
performance" range and the "turbo" range.

In the "guaranteed performance" range the CPU runs in the P-state requested
by the governor, unless a higher P-state has been requested for another CPU in
its frequency domain (usually covering the entire processor package) , in which
case that higher P-state will be used (the effective P-state for all CPUs in the
frequency domain is the maximum of all P-states requested for individual CPUs).

However, if the governor requests a P-state from the "turbo" range, the
processor is not required to take that request literally and the PM unit in it may
override the governor's choice and cause the CPU to run in a different P-state
(also from the "turbo" range), even if lower P-states have been requested for
the other CPUs in the processor package.

This is also explained in Documentation/admin-guide/pm/intel_pstate.rst (in the
Turbo P-states Support section), in more detail and hopefully more clearly.

Cheers!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 18:11         ` Rafael J. Wysocki
@ 2020-10-22  4:53           ` Viresh Kumar
  2020-10-22  7:11           ` Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: Viresh Kumar @ 2020-10-22  4:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada

On 21-10-20, 20:11, Rafael J. Wysocki wrote:
> On Wednesday, October 21, 2020 3:10:26 PM CEST Peter Zijlstra wrote:
> > On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > introduced kworkers every 0.004 seconds on all cores.
> > > 
> > > That still doesn't make any sense. Are you running the legacy on-demand
> > > thing or something?
> > > 
> > > Rafael, Srinivas, Viresh, how come it defaults to that?
> > 
> > Does we want something like this?
> > 
> > ---
> >  arch/x86/configs/i386_defconfig   | 3 +--
> >  arch/x86/configs/x86_64_defconfig | 3 +--
> >  drivers/cpufreq/Kconfig           | 7 +++++--
> >  3 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig
> > index 78210793d357..c343ad459737 100644
> > --- a/arch/x86/configs/i386_defconfig
> > +++ b/arch/x86/configs/i386_defconfig
> > @@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y
> >  CONFIG_PM_TRACE_RTC=y
> >  CONFIG_ACPI_DOCK=y
> >  CONFIG_ACPI_BGRT=y
> > -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> > -CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> > +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
> >  CONFIG_X86_ACPI_CPUFREQ=y
> >  CONFIG_EFI_VARS=y
> >  CONFIG_KPROBES=y
> > diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
> > index 9936528e1939..23e1ea85c1ec 100644
> > --- a/arch/x86/configs/x86_64_defconfig
> > +++ b/arch/x86/configs/x86_64_defconfig
> > @@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y
> >  CONFIG_PM_TRACE_RTC=y
> >  CONFIG_ACPI_DOCK=y
> >  CONFIG_ACPI_BGRT=y
> > -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> > -CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> > +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
> >  CONFIG_X86_ACPI_CPUFREQ=y
> >  CONFIG_IA32_EMULATION=y
> >  CONFIG_EFI_VARS=y
> > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> > index 2c7171e0b001..8dfca6e9b836 100644
> > --- a/drivers/cpufreq/Kconfig
> > +++ b/drivers/cpufreq/Kconfig
> > @@ -37,8 +37,7 @@ config CPU_FREQ_STAT
> >  choice
> >  	prompt "Default CPUFreq governor"
> >  	default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
> > -	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
> > -	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
> > +	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP
> >  	default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> >  	help
> >  	  This option sets which CPUFreq governor shall be loaded at
> > @@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
> >  
> >  config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> >  	bool "ondemand"
> > +	depends on !SMP
> >  	select CPU_FREQ_GOV_ONDEMAND
> >  	select CPU_FREQ_GOV_PERFORMANCE
> >  	help
> > @@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> >  
> >  config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> >  	bool "conservative"
> > +	depends on !SMP
> >  	select CPU_FREQ_GOV_CONSERVATIVE
> >  	select CPU_FREQ_GOV_PERFORMANCE
> >  	help
> 
> The changes above should work.
> 
> > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
> >  
> >  config CPU_FREQ_GOV_ONDEMAND
> >  	tristate "'ondemand' cpufreq policy governor"
> > +	depends on !SMP
> 
> But I don't think that we can do this and the one below.

I have exactly the same comments.

> >  	select CPU_FREQ_GOV_COMMON
> >  	help
> >  	  'ondemand' - This driver adds a dynamic cpufreq policy governor.
> > @@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND
> >  config CPU_FREQ_GOV_CONSERVATIVE
> >  	tristate "'conservative' cpufreq governor"
> >  	depends on CPU_FREQ
> > +	depends on !SMP
> >  	select CPU_FREQ_GOV_COMMON
> >  	help
> >  	  'conservative' - this driver is rather similar to the 'ondemand'
> > 
> 
> 
> 

-- 
viresh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-21 18:11         ` Rafael J. Wysocki
  2020-10-22  4:53           ` Viresh Kumar
@ 2020-10-22  7:11           ` Peter Zijlstra
  2020-10-22 10:59             ` Viresh Kumar
  2020-10-22 11:21             ` AW: " Walter Harms
  1 sibling, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-22  7:11 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller, viresh.kumar,
	srinivas.pandruvada

On Wed, Oct 21, 2020 at 08:11:59PM +0200, Rafael J. Wysocki wrote:

> > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
> >  
> >  config CPU_FREQ_GOV_ONDEMAND
> >  	tristate "'ondemand' cpufreq policy governor"
> > +	depends on !SMP
> 
> But I don't think that we can do this and the one below.

Well, but we need to do something to force people onto schedutil,
otherwise we'll get more crap like this thread.

Can we take the choice away? Only let Kconfig select which governors are
available and then set the default ourselves? I mean, the end goal being
to not have selectable governors at all, this seems like a good step
anyway.

---

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 2c7171e0b001..3a9f54b382c0 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -34,77 +34,6 @@ config CPU_FREQ_STAT
 
 	  If in doubt, say N.
 
-choice
-	prompt "Default CPUFreq governor"
-	default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
-	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
-	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
-	default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
-	help
-	  This option sets which CPUFreq governor shall be loaded at
-	  startup. If in doubt, use the default setting.
-
-config CPU_FREQ_DEFAULT_GOV_PERFORMANCE
-	bool "performance"
-	select CPU_FREQ_GOV_PERFORMANCE
-	help
-	  Use the CPUFreq governor 'performance' as default. This sets
-	  the frequency statically to the highest frequency supported by
-	  the CPU.
-
-config CPU_FREQ_DEFAULT_GOV_POWERSAVE
-	bool "powersave"
-	select CPU_FREQ_GOV_POWERSAVE
-	help
-	  Use the CPUFreq governor 'powersave' as default. This sets
-	  the frequency statically to the lowest frequency supported by
-	  the CPU.
-
-config CPU_FREQ_DEFAULT_GOV_USERSPACE
-	bool "userspace"
-	select CPU_FREQ_GOV_USERSPACE
-	help
-	  Use the CPUFreq governor 'userspace' as default. This allows
-	  you to set the CPU frequency manually or when a userspace 
-	  program shall be able to set the CPU dynamically without having
-	  to enable the userspace governor manually.
-
-config CPU_FREQ_DEFAULT_GOV_ONDEMAND
-	bool "ondemand"
-	select CPU_FREQ_GOV_ONDEMAND
-	select CPU_FREQ_GOV_PERFORMANCE
-	help
-	  Use the CPUFreq governor 'ondemand' as default. This allows
-	  you to get a full dynamic frequency capable system by simply
-	  loading your cpufreq low-level hardware driver.
-	  Be aware that not all cpufreq drivers support the ondemand
-	  governor. If unsure have a look at the help section of the
-	  driver. Fallback governor will be the performance governor.
-
-config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
-	bool "conservative"
-	select CPU_FREQ_GOV_CONSERVATIVE
-	select CPU_FREQ_GOV_PERFORMANCE
-	help
-	  Use the CPUFreq governor 'conservative' as default. This allows
-	  you to get a full dynamic frequency capable system by simply
-	  loading your cpufreq low-level hardware driver.
-	  Be aware that not all cpufreq drivers support the conservative
-	  governor. If unsure have a look at the help section of the
-	  driver. Fallback governor will be the performance governor.
-
-config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
-	bool "schedutil"
-	depends on SMP
-	select CPU_FREQ_GOV_SCHEDUTIL
-	select CPU_FREQ_GOV_PERFORMANCE
-	help
-	  Use the 'schedutil' CPUFreq governor by default. If unsure,
-	  have a look at the help section of that governor. The fallback
-	  governor will be 'performance'.
-
-endchoice
-
 config CPU_FREQ_GOV_PERFORMANCE
 	tristate "'performance' governor"
 	help
@@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE
 config CPU_FREQ_GOV_ONDEMAND
 	tristate "'ondemand' cpufreq policy governor"
 	select CPU_FREQ_GOV_COMMON
+	select CPU_FREQ_GOV_PERFORMANCE
 	help
 	  'ondemand' - This driver adds a dynamic cpufreq policy governor.
 	  The governor does a periodic polling and 
@@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE
 	tristate "'conservative' cpufreq governor"
 	depends on CPU_FREQ
 	select CPU_FREQ_GOV_COMMON
+	select CPU_FREQ_GOV_PERFORMANCE
 	help
 	  'conservative' - this driver is rather similar to the 'ondemand'
 	  governor both in its source code and its purpose, the difference is
@@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL
 	bool "'schedutil' cpufreq policy governor"
 	depends on CPU_FREQ && SMP
 	select CPU_FREQ_GOV_ATTR_SET
+	select CPU_FREQ_GOV_PERFORMANCE
 	select IRQ_WORK
 	help
 	  This governor makes decisions based on the utilization data provided
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 1877f5e2e5b0..6848e3337b65 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
 	return NULL;
 }
 
+static struct cpufreq_governor *cpufreq_default_governor(void)
+{
+	struct cpufreq_governor *gov = NULL;
+
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+	gov = find_governor("schedutil");
+	if (gov)
+		return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND
+	gov = find_governor("ondemand");
+	if (gov)
+		return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE
+	gov = find_governor("conservative");
+	if (gov)
+		return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+	gov = find_governor("userspace");
+	if (gov)
+		return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE
+	gov = find_governor("powersave");
+	if (gov)
+		return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE
+	gov = find_governor("performance");
+	if (gov)
+		return gov;
+#endif
+
+	return gov;
+}
+
 static struct cpufreq_governor *get_governor(const char *str_governor)
 {
 	struct cpufreq_governor *t;
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index aa39ff31ec9f..25449a1ee954 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -330,12 +330,5 @@ MODULE_DESCRIPTION("'cpufreq_conservative' - A dynamic cpufreq governor for "
 		"optimised for use in a battery environment");
 MODULE_LICENSE("GPL");
 
-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-	return &CPU_FREQ_GOV_CONSERVATIVE;
-}
-#endif
-
 cpufreq_governor_init(CPU_FREQ_GOV_CONSERVATIVE);
 cpufreq_governor_exit(CPU_FREQ_GOV_CONSERVATIVE);
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index ac361a8b1d3b..2a7fd645e1fb 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -467,12 +467,5 @@ MODULE_DESCRIPTION("'cpufreq_ondemand' - A dynamic cpufreq governor for "
 	"Low Latency Frequency Transition capable processors");
 MODULE_LICENSE("GPL");
 
-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-	return &CPU_FREQ_GOV_ONDEMAND;
-}
-#endif
-
 cpufreq_governor_init(CPU_FREQ_GOV_ONDEMAND);
 cpufreq_governor_exit(CPU_FREQ_GOV_ONDEMAND);
diff --git a/drivers/cpufreq/cpufreq_performance.c b/drivers/cpufreq/cpufreq_performance.c
index 71c1d9aba772..c6c7473a3a73 100644
--- a/drivers/cpufreq/cpufreq_performance.c
+++ b/drivers/cpufreq/cpufreq_performance.c
@@ -23,12 +23,6 @@ static struct cpufreq_governor cpufreq_gov_performance = {
 	.limits		= cpufreq_gov_performance_limits,
 };
 
-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-	return &cpufreq_gov_performance;
-}
-#endif
 #ifndef CONFIG_CPU_FREQ_GOV_PERFORMANCE_MODULE
 struct cpufreq_governor *cpufreq_fallback_governor(void)
 {
diff --git a/drivers/cpufreq/cpufreq_powersave.c b/drivers/cpufreq/cpufreq_powersave.c
index 7749522355b5..8b9555f70e9d 100644
--- a/drivers/cpufreq/cpufreq_powersave.c
+++ b/drivers/cpufreq/cpufreq_powersave.c
@@ -27,12 +27,5 @@ MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>");
 MODULE_DESCRIPTION("CPUfreq policy governor 'powersave'");
 MODULE_LICENSE("GPL");
 
-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-	return &cpufreq_gov_powersave;
-}
-#endif
-
 cpufreq_governor_init(cpufreq_gov_powersave);
 cpufreq_governor_exit(cpufreq_gov_powersave);
diff --git a/drivers/cpufreq/cpufreq_userspace.c b/drivers/cpufreq/cpufreq_userspace.c
index 50a4d7846580..48ce53038ac0 100644
--- a/drivers/cpufreq/cpufreq_userspace.c
+++ b/drivers/cpufreq/cpufreq_userspace.c
@@ -131,12 +131,5 @@ MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>, "
 MODULE_DESCRIPTION("CPUfreq policy governor 'userspace'");
 MODULE_LICENSE("GPL");
 
-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-	return &cpufreq_gov_userspace;
-}
-#endif
-
 cpufreq_governor_init(cpufreq_gov_userspace);
 cpufreq_governor_exit(cpufreq_gov_userspace);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5ae7b4e6e8d6..3b77e408377a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -888,13 +888,6 @@ struct cpufreq_governor schedutil_gov = {
 	.limits			= sugov_limits,
 };
 
-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-	return &schedutil_gov;
-}
-#endif
-
 cpufreq_governor_init(schedutil_gov);
 
 #ifdef CONFIG_ENERGY_MODEL

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-22  7:11           ` Peter Zijlstra
@ 2020-10-22 10:59             ` Viresh Kumar
  2020-10-22 11:45               ` Rafael J. Wysocki
  2020-10-22 11:21             ` AW: " Walter Harms
  1 sibling, 1 reply; 68+ messages in thread
From: Viresh Kumar @ 2020-10-22 10:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada

On 22-10-20, 09:11, Peter Zijlstra wrote:
> Well, but we need to do something to force people onto schedutil,
> otherwise we'll get more crap like this thread.
> 
> Can we take the choice away? Only let Kconfig select which governors are
> available and then set the default ourselves? I mean, the end goal being
> to not have selectable governors at all, this seems like a good step
> anyway.

Just to clarify and complete the point a bit here, the users can still
pass the default governor from cmdline using
cpufreq.default_governor=, which will take precedence over the one the
below code is playing with. And later once the kernel is up, they can
still choose a different governor from userspace.

> ---
> 
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index 2c7171e0b001..3a9f54b382c0 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -34,77 +34,6 @@ config CPU_FREQ_STAT
>  
>  	  If in doubt, say N.
>  
> -choice
> -	prompt "Default CPUFreq governor"
> -	default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
> -	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
> -	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
> -	default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> -	help
> -	  This option sets which CPUFreq governor shall be loaded at
> -	  startup. If in doubt, use the default setting.
> -
> -config CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> -	bool "performance"
> -	select CPU_FREQ_GOV_PERFORMANCE
> -	help
> -	  Use the CPUFreq governor 'performance' as default. This sets
> -	  the frequency statically to the highest frequency supported by
> -	  the CPU.
> -
> -config CPU_FREQ_DEFAULT_GOV_POWERSAVE
> -	bool "powersave"
> -	select CPU_FREQ_GOV_POWERSAVE
> -	help
> -	  Use the CPUFreq governor 'powersave' as default. This sets
> -	  the frequency statically to the lowest frequency supported by
> -	  the CPU.
> -
> -config CPU_FREQ_DEFAULT_GOV_USERSPACE
> -	bool "userspace"
> -	select CPU_FREQ_GOV_USERSPACE
> -	help
> -	  Use the CPUFreq governor 'userspace' as default. This allows
> -	  you to set the CPU frequency manually or when a userspace 
> -	  program shall be able to set the CPU dynamically without having
> -	  to enable the userspace governor manually.
> -
> -config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> -	bool "ondemand"
> -	select CPU_FREQ_GOV_ONDEMAND
> -	select CPU_FREQ_GOV_PERFORMANCE
> -	help
> -	  Use the CPUFreq governor 'ondemand' as default. This allows
> -	  you to get a full dynamic frequency capable system by simply
> -	  loading your cpufreq low-level hardware driver.
> -	  Be aware that not all cpufreq drivers support the ondemand
> -	  governor. If unsure have a look at the help section of the
> -	  driver. Fallback governor will be the performance governor.
> -
> -config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> -	bool "conservative"
> -	select CPU_FREQ_GOV_CONSERVATIVE
> -	select CPU_FREQ_GOV_PERFORMANCE
> -	help
> -	  Use the CPUFreq governor 'conservative' as default. This allows
> -	  you to get a full dynamic frequency capable system by simply
> -	  loading your cpufreq low-level hardware driver.
> -	  Be aware that not all cpufreq drivers support the conservative
> -	  governor. If unsure have a look at the help section of the
> -	  driver. Fallback governor will be the performance governor.
> -
> -config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
> -	bool "schedutil"
> -	depends on SMP
> -	select CPU_FREQ_GOV_SCHEDUTIL
> -	select CPU_FREQ_GOV_PERFORMANCE
> -	help
> -	  Use the 'schedutil' CPUFreq governor by default. If unsure,
> -	  have a look at the help section of that governor. The fallback
> -	  governor will be 'performance'.
> -
> -endchoice
> -
>  config CPU_FREQ_GOV_PERFORMANCE
>  	tristate "'performance' governor"
>  	help
> @@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE
>  config CPU_FREQ_GOV_ONDEMAND
>  	tristate "'ondemand' cpufreq policy governor"
>  	select CPU_FREQ_GOV_COMMON
> +	select CPU_FREQ_GOV_PERFORMANCE
>  	help
>  	  'ondemand' - This driver adds a dynamic cpufreq policy governor.
>  	  The governor does a periodic polling and 
> @@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE
>  	tristate "'conservative' cpufreq governor"
>  	depends on CPU_FREQ
>  	select CPU_FREQ_GOV_COMMON
> +	select CPU_FREQ_GOV_PERFORMANCE
>  	help
>  	  'conservative' - this driver is rather similar to the 'ondemand'
>  	  governor both in its source code and its purpose, the difference is
> @@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL
>  	bool "'schedutil' cpufreq policy governor"
>  	depends on CPU_FREQ && SMP
>  	select CPU_FREQ_GOV_ATTR_SET
> +	select CPU_FREQ_GOV_PERFORMANCE

And I am not really sure why we always wanted this backup performance
governor to be there unless the said governors are built as module.

>  	select IRQ_WORK
>  	help
>  	  This governor makes decisions based on the utilization data provided
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 1877f5e2e5b0..6848e3337b65 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
>  	return NULL;
>  }
>  
> +static struct cpufreq_governor *cpufreq_default_governor(void)
> +{
> +	struct cpufreq_governor *gov = NULL;
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +	gov = find_governor("schedutil");
> +	if (gov)
> +		return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND
> +	gov = find_governor("ondemand");
> +	if (gov)
> +		return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE
> +	gov = find_governor("conservative");
> +	if (gov)
> +		return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
> +	gov = find_governor("userspace");
> +	if (gov)
> +		return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE
> +	gov = find_governor("powersave");
> +	if (gov)
> +		return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE
> +	gov = find_governor("performance");
> +	if (gov)
> +		return gov;
> +#endif
> +
> +	return gov;
> +}

I think that would be fine with me. Though we would be required to
update couple of defconfigs here to make sure they keep working the
way they wanted to.

-- 
viresh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* AW: [PATCH] sched/fair: check for idle core
  2020-10-22  7:11           ` Peter Zijlstra
  2020-10-22 10:59             ` Viresh Kumar
@ 2020-10-22 11:21             ` Walter Harms
  1 sibling, 0 replies; 68+ messages in thread
From: Walter Harms @ 2020-10-22 11:21 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Valentin Schneider, Gilles Muller, viresh.kumar,
	srinivas.pandruvada

To avoid this #ifdef jungle ..
I guess it would be save to search for a governor even 
ones that are not enabled.

and a second thing: can we change the subject please ?

jm2c
 wh



________________________________________
Von: Peter Zijlstra [peterz@infradead.org]
Gesendet: Donnerstag, 22. Oktober 2020 09:11
An: Rafael J. Wysocki
Cc: Julia Lawall; Mel Gorman; Ingo Molnar; kernel-janitors@vger.kernel.org; Juri Lelli; Vincent Guittot; Dietmar Eggemann; Steven Rostedt; Ben Segall; Daniel Bristot de Oliveira; linux-kernel@vger.kernel.org; Valentin Schneider; Gilles Muller; viresh.kumar@linaro.org; srinivas.pandruvada@linux.intel.com
Betreff: Re: [PATCH] sched/fair: check for idle core

On Wed, Oct 21, 2020 at 08:11:59PM +0200, Rafael J. Wysocki wrote:

> > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
> >
> >  config CPU_FREQ_GOV_ONDEMAND
> >     tristate "'ondemand' cpufreq policy governor"
> > +   depends on !SMP
>
> But I don't think that we can do this and the one below.

Well, but we need to do something to force people onto schedutil,
otherwise we'll get more crap like this thread.

Can we take the choice away? Only let Kconfig select which governors are
available and then set the default ourselves? I mean, the end goal being
to not have selectable governors at all, this seems like a good step
anyway.

---

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 2c7171e0b001..3a9f54b382c0 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -34,77 +34,6 @@ config CPU_FREQ_STAT

          If in doubt, say N.

-choice
-       prompt "Default CPUFreq governor"
-       default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
-       default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
-       default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
-       default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
-       help
-         This option sets which CPUFreq governor shall be loaded at
-         startup. If in doubt, use the default setting.
-
-config CPU_FREQ_DEFAULT_GOV_PERFORMANCE
-       bool "performance"
-       select CPU_FREQ_GOV_PERFORMANCE
-       help
-         Use the CPUFreq governor 'performance' as default. This sets
-         the frequency statically to the highest frequency supported by
-         the CPU.
-
-config CPU_FREQ_DEFAULT_GOV_POWERSAVE
-       bool "powersave"
-       select CPU_FREQ_GOV_POWERSAVE
-       help
-         Use the CPUFreq governor 'powersave' as default. This sets
-         the frequency statically to the lowest frequency supported by
-         the CPU.
-
-config CPU_FREQ_DEFAULT_GOV_USERSPACE
-       bool "userspace"
-       select CPU_FREQ_GOV_USERSPACE
-       help
-         Use the CPUFreq governor 'userspace' as default. This allows
-         you to set the CPU frequency manually or when a userspace
-         program shall be able to set the CPU dynamically without having
-         to enable the userspace governor manually.
-
-config CPU_FREQ_DEFAULT_GOV_ONDEMAND
-       bool "ondemand"
-       select CPU_FREQ_GOV_ONDEMAND
-       select CPU_FREQ_GOV_PERFORMANCE
-       help
-         Use the CPUFreq governor 'ondemand' as default. This allows
-         you to get a full dynamic frequency capable system by simply
-         loading your cpufreq low-level hardware driver.
-         Be aware that not all cpufreq drivers support the ondemand
-         governor. If unsure have a look at the help section of the
-         driver. Fallback governor will be the performance governor.
-
-config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
-       bool "conservative"
-       select CPU_FREQ_GOV_CONSERVATIVE
-       select CPU_FREQ_GOV_PERFORMANCE
-       help
-         Use the CPUFreq governor 'conservative' as default. This allows
-         you to get a full dynamic frequency capable system by simply
-         loading your cpufreq low-level hardware driver.
-         Be aware that not all cpufreq drivers support the conservative
-         governor. If unsure have a look at the help section of the
-         driver. Fallback governor will be the performance governor.
-
-config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
-       bool "schedutil"
-       depends on SMP
-       select CPU_FREQ_GOV_SCHEDUTIL
-       select CPU_FREQ_GOV_PERFORMANCE
-       help
-         Use the 'schedutil' CPUFreq governor by default. If unsure,
-         have a look at the help section of that governor. The fallback
-         governor will be 'performance'.
-
-endchoice
-
 config CPU_FREQ_GOV_PERFORMANCE
        tristate "'performance' governor"
        help
@@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE
 config CPU_FREQ_GOV_ONDEMAND
        tristate "'ondemand' cpufreq policy governor"
        select CPU_FREQ_GOV_COMMON
+       select CPU_FREQ_GOV_PERFORMANCE
        help
          'ondemand' - This driver adds a dynamic cpufreq policy governor.
          The governor does a periodic polling and
@@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE
        tristate "'conservative' cpufreq governor"
        depends on CPU_FREQ
        select CPU_FREQ_GOV_COMMON
+       select CPU_FREQ_GOV_PERFORMANCE
        help
          'conservative' - this driver is rather similar to the 'ondemand'
          governor both in its source code and its purpose, the difference is
@@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL
        bool "'schedutil' cpufreq policy governor"
        depends on CPU_FREQ && SMP
        select CPU_FREQ_GOV_ATTR_SET
+       select CPU_FREQ_GOV_PERFORMANCE
        select IRQ_WORK
        help
          This governor makes decisions based on the utilization data provided
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 1877f5e2e5b0..6848e3337b65 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
        return NULL;
 }

+static struct cpufreq_governor *cpufreq_default_governor(void)
+{
+       struct cpufreq_governor *gov = NULL;
+
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+       gov = find_governor("schedutil");
+       if (gov)
+               return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND
+       gov = find_governor("ondemand");
+       if (gov)
+               return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE
+       gov = find_governor("conservative");
+       if (gov)
+               return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+       gov = find_governor("userspace");
+       if (gov)
+               return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE
+       gov = find_governor("powersave");
+       if (gov)
+               return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE
+       gov = find_governor("performance");
+       if (gov)
+               return gov;
+#endif
+
+       return gov;
+}
+
 static struct cpufreq_governor *get_governor(const char *str_governor)
 {
        struct cpufreq_governor *t;
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index aa39ff31ec9f..25449a1ee954 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -330,12 +330,5 @@ MODULE_DESCRIPTION("'cpufreq_conservative' - A dynamic cpufreq governor for "
                "optimised for use in a battery environment");
 MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-       return &CPU_FREQ_GOV_CONSERVATIVE;
-}
-#endif
-
 cpufreq_governor_init(CPU_FREQ_GOV_CONSERVATIVE);
 cpufreq_governor_exit(CPU_FREQ_GOV_CONSERVATIVE);
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index ac361a8b1d3b..2a7fd645e1fb 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -467,12 +467,5 @@ MODULE_DESCRIPTION("'cpufreq_ondemand' - A dynamic cpufreq governor for "
        "Low Latency Frequency Transition capable processors");
 MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-       return &CPU_FREQ_GOV_ONDEMAND;
-}
-#endif
-
 cpufreq_governor_init(CPU_FREQ_GOV_ONDEMAND);
 cpufreq_governor_exit(CPU_FREQ_GOV_ONDEMAND);
diff --git a/drivers/cpufreq/cpufreq_performance.c b/drivers/cpufreq/cpufreq_performance.c
index 71c1d9aba772..c6c7473a3a73 100644
--- a/drivers/cpufreq/cpufreq_performance.c
+++ b/drivers/cpufreq/cpufreq_performance.c
@@ -23,12 +23,6 @@ static struct cpufreq_governor cpufreq_gov_performance = {
        .limits         = cpufreq_gov_performance_limits,
 };

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-       return &cpufreq_gov_performance;
-}
-#endif
 #ifndef CONFIG_CPU_FREQ_GOV_PERFORMANCE_MODULE
 struct cpufreq_governor *cpufreq_fallback_governor(void)
 {
diff --git a/drivers/cpufreq/cpufreq_powersave.c b/drivers/cpufreq/cpufreq_powersave.c
index 7749522355b5..8b9555f70e9d 100644
--- a/drivers/cpufreq/cpufreq_powersave.c
+++ b/drivers/cpufreq/cpufreq_powersave.c
@@ -27,12 +27,5 @@ MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>");
 MODULE_DESCRIPTION("CPUfreq policy governor 'powersave'");
 MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-       return &cpufreq_gov_powersave;
-}
-#endif
-
 cpufreq_governor_init(cpufreq_gov_powersave);
 cpufreq_governor_exit(cpufreq_gov_powersave);
diff --git a/drivers/cpufreq/cpufreq_userspace.c b/drivers/cpufreq/cpufreq_userspace.c
index 50a4d7846580..48ce53038ac0 100644
--- a/drivers/cpufreq/cpufreq_userspace.c
+++ b/drivers/cpufreq/cpufreq_userspace.c
@@ -131,12 +131,5 @@ MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>, "
 MODULE_DESCRIPTION("CPUfreq policy governor 'userspace'");
 MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-       return &cpufreq_gov_userspace;
-}
-#endif
-
 cpufreq_governor_init(cpufreq_gov_userspace);
 cpufreq_governor_exit(cpufreq_gov_userspace);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5ae7b4e6e8d6..3b77e408377a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -888,13 +888,6 @@ struct cpufreq_governor schedutil_gov = {
        .limits                 = sugov_limits,
 };

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
-       return &schedutil_gov;
-}
-#endif
-
 cpufreq_governor_init(schedutil_gov);

 #ifdef CONFIG_ENERGY_MODEL

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-22 10:59             ` Viresh Kumar
@ 2020-10-22 11:45               ` Rafael J. Wysocki
  2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
  2020-10-23  6:24                 ` Viresh Kumar
  0 siblings, 2 replies; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-22 11:45 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada

On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> On 22-10-20, 09:11, Peter Zijlstra wrote:
> > Well, but we need to do something to force people onto schedutil,
> > otherwise we'll get more crap like this thread.
> > 
> > Can we take the choice away? Only let Kconfig select which governors are
> > available and then set the default ourselves? I mean, the end goal being
> > to not have selectable governors at all, this seems like a good step
> > anyway.
> 
> Just to clarify and complete the point a bit here, the users can still
> pass the default governor from cmdline using
> cpufreq.default_governor=, which will take precedence over the one the
> below code is playing with. And later once the kernel is up, they can
> still choose a different governor from userspace.

Right.

Also some people simply set "performance" as the default governor and then
don't touch cpufreq otherwise (the idea is to get everything to the max
freq right away and stay in that mode forever).  This still needs to be
possible IMO.

> > ---
> > 
> > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> > index 2c7171e0b001..3a9f54b382c0 100644
> > --- a/drivers/cpufreq/Kconfig
> > +++ b/drivers/cpufreq/Kconfig
> > @@ -34,77 +34,6 @@ config CPU_FREQ_STAT
> >  
> >  	  If in doubt, say N.
> >  
> > -choice
> > -	prompt "Default CPUFreq governor"
> > -	default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
> > -	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
> > -	default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
> > -	default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> > -	help
> > -	  This option sets which CPUFreq governor shall be loaded at
> > -	  startup. If in doubt, use the default setting.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> > -	bool "performance"
> > -	select CPU_FREQ_GOV_PERFORMANCE
> > -	help
> > -	  Use the CPUFreq governor 'performance' as default. This sets
> > -	  the frequency statically to the highest frequency supported by
> > -	  the CPU.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_POWERSAVE
> > -	bool "powersave"
> > -	select CPU_FREQ_GOV_POWERSAVE
> > -	help
> > -	  Use the CPUFreq governor 'powersave' as default. This sets
> > -	  the frequency statically to the lowest frequency supported by
> > -	  the CPU.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_USERSPACE
> > -	bool "userspace"
> > -	select CPU_FREQ_GOV_USERSPACE
> > -	help
> > -	  Use the CPUFreq governor 'userspace' as default. This allows
> > -	  you to set the CPU frequency manually or when a userspace 
> > -	  program shall be able to set the CPU dynamically without having
> > -	  to enable the userspace governor manually.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> > -	bool "ondemand"
> > -	select CPU_FREQ_GOV_ONDEMAND
> > -	select CPU_FREQ_GOV_PERFORMANCE
> > -	help
> > -	  Use the CPUFreq governor 'ondemand' as default. This allows
> > -	  you to get a full dynamic frequency capable system by simply
> > -	  loading your cpufreq low-level hardware driver.
> > -	  Be aware that not all cpufreq drivers support the ondemand
> > -	  governor. If unsure have a look at the help section of the
> > -	  driver. Fallback governor will be the performance governor.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> > -	bool "conservative"
> > -	select CPU_FREQ_GOV_CONSERVATIVE
> > -	select CPU_FREQ_GOV_PERFORMANCE
> > -	help
> > -	  Use the CPUFreq governor 'conservative' as default. This allows
> > -	  you to get a full dynamic frequency capable system by simply
> > -	  loading your cpufreq low-level hardware driver.
> > -	  Be aware that not all cpufreq drivers support the conservative
> > -	  governor. If unsure have a look at the help section of the
> > -	  driver. Fallback governor will be the performance governor.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
> > -	bool "schedutil"
> > -	depends on SMP
> > -	select CPU_FREQ_GOV_SCHEDUTIL
> > -	select CPU_FREQ_GOV_PERFORMANCE
> > -	help
> > -	  Use the 'schedutil' CPUFreq governor by default. If unsure,
> > -	  have a look at the help section of that governor. The fallback
> > -	  governor will be 'performance'.
> > -
> > -endchoice
> > -
> >  config CPU_FREQ_GOV_PERFORMANCE
> >  	tristate "'performance' governor"
> >  	help
> > @@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE
> >  config CPU_FREQ_GOV_ONDEMAND
> >  	tristate "'ondemand' cpufreq policy governor"
> >  	select CPU_FREQ_GOV_COMMON
> > +	select CPU_FREQ_GOV_PERFORMANCE
> >  	help
> >  	  'ondemand' - This driver adds a dynamic cpufreq policy governor.
> >  	  The governor does a periodic polling and 
> > @@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE
> >  	tristate "'conservative' cpufreq governor"
> >  	depends on CPU_FREQ
> >  	select CPU_FREQ_GOV_COMMON
> > +	select CPU_FREQ_GOV_PERFORMANCE
> >  	help
> >  	  'conservative' - this driver is rather similar to the 'ondemand'
> >  	  governor both in its source code and its purpose, the difference is
> > @@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL
> >  	bool "'schedutil' cpufreq policy governor"
> >  	depends on CPU_FREQ && SMP
> >  	select CPU_FREQ_GOV_ATTR_SET
> > +	select CPU_FREQ_GOV_PERFORMANCE
> 
> And I am not really sure why we always wanted this backup performance
> governor to be there unless the said governors are built as module.

Apparently, some old drivers had problems with switching frequencies fast enough
for ondemand to be used with them and the fallback was for those cases.  AFAICS.

> >  	select IRQ_WORK
> >  	help
> >  	  This governor makes decisions based on the utilization data provided
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 1877f5e2e5b0..6848e3337b65 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
> >  	return NULL;
> >  }
> >  
> > +static struct cpufreq_governor *cpufreq_default_governor(void)
> > +{
> > +	struct cpufreq_governor *gov = NULL;
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > +	gov = find_governor("schedutil");
> > +	if (gov)
> > +		return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND
> > +	gov = find_governor("ondemand");
> > +	if (gov)
> > +		return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE
> > +	gov = find_governor("conservative");
> > +	if (gov)
> > +		return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
> > +	gov = find_governor("userspace");
> > +	if (gov)
> > +		return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE
> > +	gov = find_governor("powersave");
> > +	if (gov)
> > +		return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE
> > +	gov = find_governor("performance");
> > +	if (gov)
> > +		return gov;
> > +#endif
> > +
> > +	return gov;
> > +}
> 
> I think that would be fine with me. Though we would be required to
> update couple of defconfigs here to make sure they keep working the
> way they wanted to.

Generally agreed, but see the point about the "performance" governor above.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 11:45               ` Rafael J. Wysocki
@ 2020-10-22 12:02                 ` Peter Zijlstra
  2020-10-22 12:19                   ` Rafael J. Wysocki
                                     ` (2 more replies)
  2020-10-23  6:24                 ` Viresh Kumar
  1 sibling, 3 replies; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-22 12:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada

On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > Well, but we need to do something to force people onto schedutil,
> > > otherwise we'll get more crap like this thread.
> > > 
> > > Can we take the choice away? Only let Kconfig select which governors are
> > > available and then set the default ourselves? I mean, the end goal being
> > > to not have selectable governors at all, this seems like a good step
> > > anyway.
> > 
> > Just to clarify and complete the point a bit here, the users can still
> > pass the default governor from cmdline using
> > cpufreq.default_governor=, which will take precedence over the one the
> > below code is playing with. And later once the kernel is up, they can
> > still choose a different governor from userspace.
> 
> Right.
> 
> Also some people simply set "performance" as the default governor and then
> don't touch cpufreq otherwise (the idea is to get everything to the max
> freq right away and stay in that mode forever).  This still needs to be
> possible IMO.

Performance/powersave make sense to keep.

However I do want to retire ondemand, conservative and also very much
intel_pstate/active mode. I also have very little sympathy for
userspace.

We should start by making it hard to use them and eventually just delete
them outright.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
@ 2020-10-22 12:19                   ` Rafael J. Wysocki
  2020-10-22 12:29                     ` Peter Zijlstra
  2020-10-22 16:23                   ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki
  2020-10-27 11:11                   ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef
  2 siblings, 1 reply; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-22 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown

[CC linux-pm and Len]

On Thursday, October 22, 2020 2:02:13 PM CEST Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > > Well, but we need to do something to force people onto schedutil,
> > > > otherwise we'll get more crap like this thread.
> > > > 
> > > > Can we take the choice away? Only let Kconfig select which governors are
> > > > available and then set the default ourselves? I mean, the end goal being
> > > > to not have selectable governors at all, this seems like a good step
> > > > anyway.
> > > 
> > > Just to clarify and complete the point a bit here, the users can still
> > > pass the default governor from cmdline using
> > > cpufreq.default_governor=, which will take precedence over the one the
> > > below code is playing with. And later once the kernel is up, they can
> > > still choose a different governor from userspace.
> > 
> > Right.
> > 
> > Also some people simply set "performance" as the default governor and then
> > don't touch cpufreq otherwise (the idea is to get everything to the max
> > freq right away and stay in that mode forever).  This still needs to be
> > possible IMO.
> 
> Performance/powersave make sense to keep.
> 
> However I do want to retire ondemand, conservative and also very much
> intel_pstate/active mode.

I agree in general, but IMO it would not be prudent to do that without making
schedutil provide the same level of performance in all of the relevant use
cases.

> I also have very little sympathy for userspace.

That I completely agree with.

> We should start by making it hard to use them and eventually just delete
> them outright.

Right, but see above: IMO step 0 should be to ensure that schedutil is a viable
replacement for all users.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 12:19                   ` Rafael J. Wysocki
@ 2020-10-22 12:29                     ` Peter Zijlstra
  2020-10-22 14:52                       ` Mel Gorman
  2020-10-22 15:45                       ` A L
  0 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-22 12:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown

On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > However I do want to retire ondemand, conservative and also very much
> > intel_pstate/active mode.
> 
> I agree in general, but IMO it would not be prudent to do that without making
> schedutil provide the same level of performance in all of the relevant use
> cases.

Agreed; I though to have understood we were there already.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 12:29                     ` Peter Zijlstra
@ 2020-10-22 14:52                       ` Mel Gorman
  2020-10-22 14:58                         ` Colin Ian King
  2020-10-22 15:25                         ` Peter Zijlstra
  2020-10-22 15:45                       ` A L
  1 sibling, 2 replies; 68+ messages in thread
From: Mel Gorman @ 2020-10-22 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar,
	Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown

On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > > However I do want to retire ondemand, conservative and also very much
> > > intel_pstate/active mode.
> > 
> > I agree in general, but IMO it would not be prudent to do that without making
> > schedutil provide the same level of performance in all of the relevant use
> > cases.
> 
> Agreed; I though to have understood we were there already.

AFAIK, not quite (added Giovanni as he has been paying more attention).
Schedutil has improved since it was merged but not to the extent where
it is a drop-in replacement. The standard it needs to meet is that
it is at least equivalent to powersave (in intel_pstate language)
or ondemand (acpi_cpufreq) and within a reasonable percentage of the
performance governor. Defaulting to performance is a) giving up and b)
the performance governor is not a universal win. There are some questions
currently on whether schedutil is good enough when HWP is not available.
There was some evidence (I don't have the data, Giovanni was looking into
it) that HWP was a requirement to make schedutil work well. That is a
hazard in itself because someone could test on the latest gen Intel CPU
and conclude everything is fine and miss that Intel-specific technology
is needed to make it work well while throwing everyone else under a bus.
Giovanni knows a lot more than I do about this, I could be wrong or
forgetting things.

For distros, switching to schedutil by default would be nice because
frequency selection state would follow the task instead of being per-cpu
and we could stop worrying about different HWP implementations but it's
not at the point where the switch is advisable. I would expect hard data
before switching the default and still would strongly advise having a
period of time where we can fall back when someone inevitably finds a
new corner case or exception.

For reference, SLUB had the same problem for years. It was switched
on by default in the kernel config but it was a long time before
SLUB was generally equivalent to SLAB in terms of performance. Block
multiqueue also had vaguely similar issues before the default changes
and a period of time before it was removed removed (example whinging mail
https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/)
It's schedutil's turn :P

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 14:52                       ` Mel Gorman
@ 2020-10-22 14:58                         ` Colin Ian King
  2020-10-22 15:12                           ` Phil Auld
  2020-10-22 15:25                         ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Colin Ian King @ 2020-10-22 14:58 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar,
	Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown

On 22/10/2020 15:52, Mel Gorman wrote:
> On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
>>>> However I do want to retire ondemand, conservative and also very much
>>>> intel_pstate/active mode.
>>>
>>> I agree in general, but IMO it would not be prudent to do that without making
>>> schedutil provide the same level of performance in all of the relevant use
>>> cases.
>>
>> Agreed; I though to have understood we were there already.
> 
> AFAIK, not quite (added Giovanni as he has been paying more attention).
> Schedutil has improved since it was merged but not to the extent where
> it is a drop-in replacement. The standard it needs to meet is that
> it is at least equivalent to powersave (in intel_pstate language)
> or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> performance governor. Defaulting to performance is a) giving up and b)
> the performance governor is not a universal win. There are some questions
> currently on whether schedutil is good enough when HWP is not available.
> There was some evidence (I don't have the data, Giovanni was looking into
> it) that HWP was a requirement to make schedutil work well. That is a
> hazard in itself because someone could test on the latest gen Intel CPU
> and conclude everything is fine and miss that Intel-specific technology
> is needed to make it work well while throwing everyone else under a bus.
> Giovanni knows a lot more than I do about this, I could be wrong or
> forgetting things.
> 
> For distros, switching to schedutil by default would be nice because
> frequency selection state would follow the task instead of being per-cpu
> and we could stop worrying about different HWP implementations but it's
> not at the point where the switch is advisable. I would expect hard data
> before switching the default and still would strongly advise having a
> period of time where we can fall back when someone inevitably finds a
> new corner case or exception.

..and it would be really useful for distros to know when the hard data
is available so that they can make an informed decision when to move to
schedutil.

> 
> For reference, SLUB had the same problem for years. It was switched
> on by default in the kernel config but it was a long time before
> SLUB was generally equivalent to SLAB in terms of performance. Block
> multiqueue also had vaguely similar issues before the default changes
> and a period of time before it was removed removed (example whinging mail
> https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/)
> It's schedutil's turn :P
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 14:58                         ` Colin Ian King
@ 2020-10-22 15:12                           ` Phil Auld
  2020-10-22 16:35                             ` Mel Gorman
  0 siblings, 1 reply; 68+ messages in thread
From: Phil Auld @ 2020-10-22 15:12 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Mel Gorman, Peter Zijlstra, Rafael J. Wysocki,
	Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown

On Thu, Oct 22, 2020 at 03:58:13PM +0100 Colin Ian King wrote:
> On 22/10/2020 15:52, Mel Gorman wrote:
> > On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
> >> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> >>>> However I do want to retire ondemand, conservative and also very much
> >>>> intel_pstate/active mode.
> >>>
> >>> I agree in general, but IMO it would not be prudent to do that without making
> >>> schedutil provide the same level of performance in all of the relevant use
> >>> cases.
> >>
> >> Agreed; I though to have understood we were there already.
> > 
> > AFAIK, not quite (added Giovanni as he has been paying more attention).
> > Schedutil has improved since it was merged but not to the extent where
> > it is a drop-in replacement. The standard it needs to meet is that
> > it is at least equivalent to powersave (in intel_pstate language)
> > or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> > performance governor. Defaulting to performance is a) giving up and b)
> > the performance governor is not a universal win. There are some questions
> > currently on whether schedutil is good enough when HWP is not available.
> > There was some evidence (I don't have the data, Giovanni was looking into
> > it) that HWP was a requirement to make schedutil work well. That is a
> > hazard in itself because someone could test on the latest gen Intel CPU
> > and conclude everything is fine and miss that Intel-specific technology
> > is needed to make it work well while throwing everyone else under a bus.
> > Giovanni knows a lot more than I do about this, I could be wrong or
> > forgetting things.
> > 
> > For distros, switching to schedutil by default would be nice because
> > frequency selection state would follow the task instead of being per-cpu
> > and we could stop worrying about different HWP implementations but it's
> > not at the point where the switch is advisable. I would expect hard data
> > before switching the default and still would strongly advise having a
> > period of time where we can fall back when someone inevitably finds a
> > new corner case or exception.
> 
> ..and it would be really useful for distros to know when the hard data
> is available so that they can make an informed decision when to move to
> schedutil.
>

I think distros are on the hook to generate that hard data themselves
with which to make such a decision.  I don't expect it to be done by
someone else. 

> > 
> > For reference, SLUB had the same problem for years. It was switched
> > on by default in the kernel config but it was a long time before
> > SLUB was generally equivalent to SLAB in terms of performance. Block
> > multiqueue also had vaguely similar issues before the default changes
> > and a period of time before it was removed removed (example whinging mail
> > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/)
> > It's schedutil's turn :P
> > 
> 

Agreed. I'd like the option to switch back if we make the default change.
It's on the table and I'd like to be able to go that way. 

Cheers,
Phil

-- 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 14:52                       ` Mel Gorman
  2020-10-22 14:58                         ` Colin Ian King
@ 2020-10-22 15:25                         ` Peter Zijlstra
  2020-10-22 15:55                           ` Rafael J. Wysocki
                                             ` (2 more replies)
  1 sibling, 3 replies; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-22 15:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar,
	Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown

On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote:

> There are some questions
> currently on whether schedutil is good enough when HWP is not available.

Srinivas and Rafael will know better, but Intel does run a lot of tests
and IIRC it was found that schedutil was on-par for !HWP. That was the
basis for commit:

  33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")

But now it turns out that commit results in running intel_pstate-passive
on ondemand, which is quite horrible.

> There was some evidence (I don't have the data, Giovanni was looking into
> it) that HWP was a requirement to make schedutil work well.

That seems to be the question; Rafael just said the opposite.

> For distros, switching to schedutil by default would be nice because
> frequency selection state would follow the task instead of being per-cpu
> and we could stop worrying about different HWP implementations but it's

s/HWP/cpufreq-governors/ ? But yes.

> not at the point where the switch is advisable. I would expect hard data
> before switching the default and still would strongly advise having a
> period of time where we can fall back when someone inevitably finds a
> new corner case or exception.

Which is why I advocated to make it 'difficult' to use the old ones and
only later remove them.

> For reference, SLUB had the same problem for years. It was switched
> on by default in the kernel config but it was a long time before
> SLUB was generally equivalent to SLAB in terms of performance.

I remember :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 12:29                     ` Peter Zijlstra
  2020-10-22 14:52                       ` Mel Gorman
@ 2020-10-22 15:45                       ` A L
  2020-10-22 15:55                         ` Vincent Guittot
  1 sibling, 1 reply; 68+ messages in thread
From: A L @ 2020-10-22 15:45 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown



---- From: Peter Zijlstra <peterz@infradead.org> -- Sent: 2020-10-22 - 14:29 ----

> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
>> > However I do want to retire ondemand, conservative and also very much
>> > intel_pstate/active mode.
>> 
>> I agree in general, but IMO it would not be prudent to do that without making
>> schedutil provide the same level of performance in all of the relevant use
>> cases.
> 
> Agreed; I though to have understood we were there already.

Hi, 


Currently schedutil does not populate all stats like ondemand does, which can be a problem for some monitoring software. 

On my AMD 3000G CPU with kernel-5.9.1:


grep. /sys/devices/system/cpu/cpufreq/policy0/stats/*

With ondemand:
time_in_state:3900000 145179
time_in_state:1600000 9588482
total_trans:177565
trans_table:   From  :    To
trans_table:         :   3900000   1600000
trans_table:  3900000:         0     88783
trans_table:  1600000:     88782         0

With schedutil only two file exists:
reset:<empty>
total_trans:216609 


I'd really like to have these stats populated with schedutil, if that's possible.

Thanks. 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 15:45                       ` A L
@ 2020-10-22 15:55                         ` Vincent Guittot
  2020-10-23  5:23                           ` Viresh Kumar
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2020-10-22 15:55 UTC (permalink / raw)
  To: A L
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Julia Lawall,
	Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, Srinivas Pandruvada, Linux PM, Len Brown

On Thu, 22 Oct 2020 at 17:45, A L <mail@lechevalier.se> wrote:
>
>
>
> ---- From: Peter Zijlstra <peterz@infradead.org> -- Sent: 2020-10-22 - 14:29 ----
>
> > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> >> > However I do want to retire ondemand, conservative and also very much
> >> > intel_pstate/active mode.
> >>
> >> I agree in general, but IMO it would not be prudent to do that without making
> >> schedutil provide the same level of performance in all of the relevant use
> >> cases.
> >
> > Agreed; I though to have understood we were there already.
>
> Hi,
>
>
> Currently schedutil does not populate all stats like ondemand does, which can be a problem for some monitoring software.
>
> On my AMD 3000G CPU with kernel-5.9.1:
>
>
> grep. /sys/devices/system/cpu/cpufreq/policy0/stats/*
>
> With ondemand:
> time_in_state:3900000 145179
> time_in_state:1600000 9588482
> total_trans:177565
> trans_table:   From  :    To
> trans_table:         :   3900000   1600000
> trans_table:  3900000:         0     88783
> trans_table:  1600000:     88782         0
>
> With schedutil only two file exists:
> reset:<empty>
> total_trans:216609
>
>
> I'd really like to have these stats populated with schedutil, if that's possible.

Your problem might have been fixed with
commit 96f60cddf7a1 ("cpufreq: stats: Enable stats for fast-switch as well")


>
> Thanks.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 15:25                         ` Peter Zijlstra
@ 2020-10-22 15:55                           ` Rafael J. Wysocki
  2020-10-22 16:29                           ` Mel Gorman
  2020-10-22 20:10                           ` Giovanni Gherdovich
  2 siblings, 0 replies; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-22 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar,
	Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Linux Kernel Mailing List,
	Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM,
	Len Brown

On Thu, Oct 22, 2020 at 5:25 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote:
>
> > There are some questions
> > currently on whether schedutil is good enough when HWP is not available.
>
> Srinivas and Rafael will know better, but Intel does run a lot of tests
> and IIRC it was found that schedutil was on-par for !HWP. That was the
> basis for commit:
>
>   33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")
>
> But now it turns out that commit results in running intel_pstate-passive
> on ondemand, which is quite horrible.

It doesn't in general.  AFAICS this happens only if "ondemand" was
selected as the default governor in the old kernel config, which
should not be the common case.

But I do agree that this needs to be avoided.

> > There was some evidence (I don't have the data, Giovanni was looking into
> > it) that HWP was a requirement to make schedutil work well.
>
> That seems to be the question; Rafael just said the opposite.

I'm not aware of any data like that.

HWP should not be required and it should always be possible to make an
HWP system run without HWP (except for those with exotic BIOS
configs).  However, schedutil should work without HWP as well as (or
better than) the "ondemand" and "conservative" governors on top of the
same driver (whatever it is) and it should work as well as (or better
than) "raw" HWP (so to speak) on top of intel_pstate in the passive
mode with HWP enabled (before 5.9 it couldn't work in that
configuration at all and now it can do that, which I guess may be
regarded as an improvement).

> > For distros, switching to schedutil by default would be nice because
> > frequency selection state would follow the task instead of being per-cpu
> > and we could stop worrying about different HWP implementations but it's
>
> s/HWP/cpufreq-governors/ ? But yes.

Well, different HWP implementations in different processor generations
may be a concern as well in general.

> > not at the point where the switch is advisable. I would expect hard data
> > before switching the default and still would strongly advise having a
> > period of time where we can fall back when someone inevitably finds a
> > new corner case or exception.
>
> Which is why I advocated to make it 'difficult' to use the old ones and
> only later remove them.

Slightly less convenient may be sufficient IMV.

> > For reference, SLUB had the same problem for years. It was switched
> > on by default in the kernel config but it was a long time before
> > SLUB was generally equivalent to SLAB in terms of performance.
>
> I remember :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate
  2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
  2020-10-22 12:19                   ` Rafael J. Wysocki
@ 2020-10-22 16:23                   ` Rafael J. Wysocki
  2020-10-23  6:29                     ` Viresh Kumar
  2020-10-23 15:15                     ` [PATCH v2] " Rafael J. Wysocki
  2020-10-27 11:11                   ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef
  2 siblings, 2 replies; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-22 16:23 UTC (permalink / raw)
  To: Peter Zijlstra, Viresh Kumar, Julia Lawall
  Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate

Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by
default without HWP") was meant to cause intel_pstate without HWP
to be used in the passive mode with the schedutil governor on top of
it by default, but it missed the case in which either "ondemand" or
"conservative" was selected as the default governor in the existing
kernel config, in which case the previous old governor configuration
would be used, causing the default legacy governor to be used on top
of intel_pstate instead of schedutil.

Address this by preventing "ondemand" and "conservative" from being
configured as the default cpufreq governor in the case when schedutil
is the default choice for the default governor setting.

[Note that the default cpufreq governor can still be set via the
 kernel command line if need be and that choice is not limited,
 so if anyone really wants to use one of the legacy governors by
 default, it can be achieved this way.]

Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")
Cc: 5.8+ <stable@vger.kernel.org> # 5.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-pm/drivers/cpufreq/Kconfig
=================================--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -71,6 +71,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
 
 config CPU_FREQ_DEFAULT_GOV_ONDEMAND
 	bool "ondemand"
+	depends on !SMP || !X86_INTEL_PSTATE
 	select CPU_FREQ_GOV_ONDEMAND
 	select CPU_FREQ_GOV_PERFORMANCE
 	help
@@ -83,6 +84,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
 
 config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	bool "conservative"
+	depends on !SMP || !X86_INTEL_PSTATE
 	select CPU_FREQ_GOV_CONSERVATIVE
 	select CPU_FREQ_GOV_PERFORMANCE
 	help

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 15:25                         ` Peter Zijlstra
  2020-10-22 15:55                           ` Rafael J. Wysocki
@ 2020-10-22 16:29                           ` Mel Gorman
  2020-10-22 20:10                           ` Giovanni Gherdovich
  2 siblings, 0 replies; 68+ messages in thread
From: Mel Gorman @ 2020-10-22 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar,
	Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown

On Thu, Oct 22, 2020 at 05:25:14PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote:
> 
> > There are some questions
> > currently on whether schedutil is good enough when HWP is not available.
> 
> Srinivas and Rafael will know better, but Intel does run a lot of tests
> and IIRC it was found that schedutil was on-par for !HWP. That was the
> basis for commit:
> 
>   33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")
> 
> But now it turns out that commit results in running intel_pstate-passive
> on ondemand, which is quite horrible.
> 

I know Intel ran a lot of tests, no question about it and no fingers are
being pointed. I know I've had enough bugs patches tested with a battery
of tests on various machines and still ended up with bug reports :)

> > There was some evidence (I don't have the data, Giovanni was looking into
> > it) that HWP was a requirement to make schedutil work well.
> 
> That seems to be the question; Rafael just said the opposite.
> 
> > For distros, switching to schedutil by default would be nice because
> > frequency selection state would follow the task instead of being per-cpu
> > and we could stop worrying about different HWP implementations but it's
> 
> s/HWP/cpufreq-governors/ ? But yes.
> 

I've seen cases where HWP had variable behaviour between CPU
generations. It was hard to quantify and/or figure out because HWP is a
black box.

> > not at the point where the switch is advisable. I would expect hard data
> > before switching the default and still would strongly advise having a
> > period of time where we can fall back when someone inevitably finds a
> > new corner case or exception.
> 
> Which is why I advocated to make it 'difficult' to use the old ones and
> only later remove them.
> 

That's fair.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 15:12                           ` Phil Auld
@ 2020-10-22 16:35                             ` Mel Gorman
  2020-10-22 17:59                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2020-10-22 16:35 UTC (permalink / raw)
  To: Phil Auld
  Cc: Colin Ian King, Peter Zijlstra, Rafael J. Wysocki,
	Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown

On Thu, Oct 22, 2020 at 11:12:00AM -0400, Phil Auld wrote:
> > > AFAIK, not quite (added Giovanni as he has been paying more attention).
> > > Schedutil has improved since it was merged but not to the extent where
> > > it is a drop-in replacement. The standard it needs to meet is that
> > > it is at least equivalent to powersave (in intel_pstate language)
> > > or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> > > performance governor. Defaulting to performance is a) giving up and b)
> > > the performance governor is not a universal win. There are some questions
> > > currently on whether schedutil is good enough when HWP is not available.
> > > There was some evidence (I don't have the data, Giovanni was looking into
> > > it) that HWP was a requirement to make schedutil work well. That is a
> > > hazard in itself because someone could test on the latest gen Intel CPU
> > > and conclude everything is fine and miss that Intel-specific technology
> > > is needed to make it work well while throwing everyone else under a bus.
> > > Giovanni knows a lot more than I do about this, I could be wrong or
> > > forgetting things.
> > > 
> > > For distros, switching to schedutil by default would be nice because
> > > frequency selection state would follow the task instead of being per-cpu
> > > and we could stop worrying about different HWP implementations but it's
> > > not at the point where the switch is advisable. I would expect hard data
> > > before switching the default and still would strongly advise having a
> > > period of time where we can fall back when someone inevitably finds a
> > > new corner case or exception.
> > 
> > ..and it would be really useful for distros to know when the hard data
> > is available so that they can make an informed decision when to move to
> > schedutil.
> >
> 
> I think distros are on the hook to generate that hard data themselves
> with which to make such a decision.  I don't expect it to be done by
> someone else. 
> 

Yep, distros are on the hook. When I said "I would expect hard data",
it was in the knowledge that for openSUSE/SLE, we (as in SUSE) would be
generating said data and making a call based on it. I'd be surprised if
Phil was not thinking along the same lines.

> > > For reference, SLUB had the same problem for years. It was switched
> > > on by default in the kernel config but it was a long time before
> > > SLUB was generally equivalent to SLAB in terms of performance. Block
> > > multiqueue also had vaguely similar issues before the default changes
> > > and a period of time before it was removed removed (example whinging mail
> > > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/)
> > > It's schedutil's turn :P
> > > 
> > 
> 
> Agreed. I'd like the option to switch back if we make the default change.
> It's on the table and I'd like to be able to go that way. 
> 

Yep. It sounds chicken, but it's a useful safety net and a reasonable
way to deprecate a feature. It's also useful for bug creation -- User X
running whatever found that schedutil is worse than the old governor and
had to temporarily switch back. Repeat until complaining stops and then
tear out the old stuff.

When/if there is a patch setting schedutil as the default, cc suitable
distro people (Giovanni and myself for openSUSE). Other distros assuming
they're watching can nominate their own victim.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 16:35                             ` Mel Gorman
@ 2020-10-22 17:59                               ` Rafael J. Wysocki
  2020-10-22 20:32                                 ` Mel Gorman
  0 siblings, 1 reply; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-22 17:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Colin Ian King, Peter Zijlstra, Rafael J. Wysocki,
	Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	Linux Kernel Mailing List, Valentin Schneider, Gilles Muller,
	Srinivas Pandruvada, Linux PM, Len Brown

On Thu, Oct 22, 2020 at 6:35 PM Mel Gorman <mgorman@suse.de> wrote:
>
> On Thu, Oct 22, 2020 at 11:12:00AM -0400, Phil Auld wrote:
> > > > AFAIK, not quite (added Giovanni as he has been paying more attention).
> > > > Schedutil has improved since it was merged but not to the extent where
> > > > it is a drop-in replacement. The standard it needs to meet is that
> > > > it is at least equivalent to powersave (in intel_pstate language)
> > > > or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> > > > performance governor. Defaulting to performance is a) giving up and b)
> > > > the performance governor is not a universal win. There are some questions
> > > > currently on whether schedutil is good enough when HWP is not available.
> > > > There was some evidence (I don't have the data, Giovanni was looking into
> > > > it) that HWP was a requirement to make schedutil work well. That is a
> > > > hazard in itself because someone could test on the latest gen Intel CPU
> > > > and conclude everything is fine and miss that Intel-specific technology
> > > > is needed to make it work well while throwing everyone else under a bus.
> > > > Giovanni knows a lot more than I do about this, I could be wrong or
> > > > forgetting things.
> > > >
> > > > For distros, switching to schedutil by default would be nice because
> > > > frequency selection state would follow the task instead of being per-cpu
> > > > and we could stop worrying about different HWP implementations but it's
> > > > not at the point where the switch is advisable. I would expect hard data
> > > > before switching the default and still would strongly advise having a
> > > > period of time where we can fall back when someone inevitably finds a
> > > > new corner case or exception.
> > >
> > > ..and it would be really useful for distros to know when the hard data
> > > is available so that they can make an informed decision when to move to
> > > schedutil.
> > >
> >
> > I think distros are on the hook to generate that hard data themselves
> > with which to make such a decision.  I don't expect it to be done by
> > someone else.
> >
>
> Yep, distros are on the hook. When I said "I would expect hard data",
> it was in the knowledge that for openSUSE/SLE, we (as in SUSE) would be
> generating said data and making a call based on it. I'd be surprised if
> Phil was not thinking along the same lines.
>
> > > > For reference, SLUB had the same problem for years. It was switched
> > > > on by default in the kernel config but it was a long time before
> > > > SLUB was generally equivalent to SLAB in terms of performance. Block
> > > > multiqueue also had vaguely similar issues before the default changes
> > > > and a period of time before it was removed removed (example whinging mail
> > > > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/)
> > > > It's schedutil's turn :P
> > > >
> > >
> >
> > Agreed. I'd like the option to switch back if we make the default change.
> > It's on the table and I'd like to be able to go that way.
> >
>
> Yep. It sounds chicken, but it's a useful safety net and a reasonable
> way to deprecate a feature. It's also useful for bug creation -- User X
> running whatever found that schedutil is worse than the old governor and
> had to temporarily switch back. Repeat until complaining stops and then
> tear out the old stuff.
>
> When/if there is a patch setting schedutil as the default, cc suitable
> distro people (Giovanni and myself for openSUSE).

So for the record, Giovanni was on the CC list of the "cpufreq:
intel_pstate: Use passive mode by default without HWP" patch that this
discussion resulted from (and which kind of belongs to the above
category).

> Other distros assuming they're watching can nominate their own victim.

But no other victims had been nominated at that time.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 15:25                         ` Peter Zijlstra
  2020-10-22 15:55                           ` Rafael J. Wysocki
  2020-10-22 16:29                           ` Mel Gorman
@ 2020-10-22 20:10                           ` Giovanni Gherdovich
  2020-10-22 20:16                             ` Giovanni Gherdovich
  2020-10-23  7:03                             ` Peter Zijlstra
  2 siblings, 2 replies; 68+ messages in thread
From: Giovanni Gherdovich @ 2020-10-22 20:10 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Mel Gorman, Viresh Kumar, Julia Lawall, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown

Hello Peter, Rafael,

back in August I tested a v5.8 kernel adding Rafael's patches from v5.9 that
make schedutil and HWP works together, i.e. f6ebbcf08f37 ("cpufreq: intel_pstate:
Implement passive mode with HWP enabled").

The main point I took from the exercise is that tbench (network benchmark
in localhost) is problematic for schedutil and only with HWP (thanks to
Rafael's patch above) it reaches the throughput of the other governors.
When HWP isn't available, the penalty is 5-10% and I need to understand if
the cause is something that can affect other applications too (or just a
quirk of this test).

I ran this campaign this summer when Rafal CC'ed me to f6ebbcf08f37
("cpufreq: intel_pstate: Implement passive mode with HWP enabled"),
I didn't reply as the patch was a win anyways (my bad, I should have posted
the positive results). The regression of tbench with schedutil w/o HWP,
that went unnoticed for long, got the best of my attention.

Other remarks

* on gitsource (running the git unit test suite, measures elapsed time)
  schedutil is a lot better than Intel's powersave but not as good as the
  performance governor.

* for the AMD EPYC machines we haven't yet implemented frequency invariant
  accounting, which might explain why schedutil looses to ondemand on all
  the benchmarks.

* on dbench (filesystem, measures latency) and kernbench (kernel compilation),
  sugov is as good as the Intel performance governor. You can add or remove
  HWP (to either sugov or perfgov), it doesn't make a difference. Intel's
  powersave in general trails behind.

* generally my main concern is performance, not power efficiency, but I was
  a little disappointed to see schedutil being just as efficient as
  perfgov (the performance-per-watt ratios): there are even a few cases
  where (on tbench) the performance governor is both faster and more
  efficient. From previous conversations with Rafael I recall that
  switching frequency has an energy cost, so it could be that schedutil
  switches too often to amortize it. I haven't checked.

To read the tables:

Tilde (~) means the result is the same as baseline (or, the ratio is close
to 1). The double asterisk (**) is a visual aid and means the result is
worse than baseline (higher or lower depending on the case).

For an overview of the possible configurations (intel_psate passive,
active, HWP on/off etc) I made the diagram at
https://beta.suse.com/private/ggherdovich/cpufreq/x86-cpufreq.png

1) INTEL, HWP-CAPABLE MACHINES
2) INTEL, NON-HWP-CAPABLE MACHINES
3) AMD EPYC

1) INTEL, HWP-CAPABLE MACHINES:

64x_SKYLAKE_NUMA: Intel Skylake SP, 32 cores / 64 threads, NUMA, SATA SSD storage
------------------------------------------------------------------------------
            sugov-HWP   sugov-no-HWP   powersave-HWP   perfgov-HWP   better if
------------------------------------------------------------------------------
                                  PERFORMANCE RATIOS
tbench        1.00        0.68           ~               1.03**        higher
dbench        1.00        ~              1.03            ~             lower
kernbench     1.00        ~              1.11            ~             lower
gitsource     1.00        1.03           2.26            0.82**        lower
------------------------------------------------------------------------------
                             PERFORMANCE-PER-WATT RATIOS
tbench        1.00        0.74           ~               ~             higher
dbench        1.00        ~              ~               ~             higher
kernbench     1.00        ~              0.96            ~             higher
gitsource     1.00        0.96           0.45            1.15**        higher


8x_SKYLAKE_UMA: Intel Skylake (client), 4 cores / 8 threads, UMA, SATA SSD storage
------------------------------------------------------------------------------
            sugov-HWP   sugov-no-HWP   powersave-HWP   perfgov-HWP   better if
------------------------------------------------------------------------------
                                  PERFORMANCE RATIOS
tbench        1.00        0.91           ~               ~             higher
dbench        1.00        ~              ~               ~             lower
kernbench     1.00        ~              ~               ~             lower
gitsource     1.00        1.04           1.77            ~             lower
------------------------------------------------------------------------------
                             PERFORMANCE-PER-WATT RATIOS
tbench        1.00        0.95           ~               ~             higher
dbench        1.00        ~              ~               ~             higher
kernbench     1.00        ~              ~               ~             higher
gitsource     1.00        ~              0.74            ~             higher


8x_COFFEELAKE_UMA: Intel Coffee Lake, 4 cores / 8 threads, UMA, NVMe SSD storage
---------------------------------------------------------------
            sugov-HWP   powersave-HWP   perfgov-HWP   better if
---------------------------------------------------------------
                        PERFORMANCE RATIOS
tbench        1.00        ~               ~             higher
dbench        1.00        1.12            ~             lower
kernbench     1.00        ~               ~             lower
gitsource     1.00        2.05            ~             lower
---------------------------------------------------------------
                    PERFORMANCE-PER-WATT RATIOS
tbench        1.00        ~               ~             higher
dbench        1.00        1.80**          ~             higher
kernbench     1.00        ~               ~             higher
gitsource     1.00        1.52**          ~             higher


2) INTEL, NON-HWP-CAPABLE MACHINES:

80x_BROADWELL_NUMA: Intel Broadwell EP, 40 cores / 80 threads, NUMA, SATA SSD storage
---------------------------------------------------------------
              sugov     powersave       perfgov       better if
---------------------------------------------------------------
                        PERFORMANCE RATIOS
tbench        1.00        1.11**          1.10**        higher
dbench        1.00        1.10            ~             lower
kernbench     1.00        1.10            ~             lower
gitsource     1.00        2.27            0.95**        lower
---------------------------------------------------------------
                    PERFORMANCE-PER-WATT RATIOS
tbench        1.00         1.05**         1.04**        higher
dbench        1.00         1.24**         0.95          higher
kernbench     1.00         ~              ~             higher
gitsource     1.00         0.86           1.04**        higher


48x_HASWELL_NUMA: Intel Haswell EP, 24 cores / 48 threads, NUMA, HDD storage
---------------------------------------------------------------
              sugov     powersave       perfgov       better if
---------------------------------------------------------------
                        PERFORMANCE RATIOS
tbench        1.00         1.25**         1.27**        higher
dbench        1.00         1.17           ~             lower
kernbench     1.00         1.04           ~             lower
gitsource     1.00         1.54           0.79**        lower
---------------------------------------------------------------
                    PERFORMANCE-PER-WATT RATIOS
tbench        1.00         1.18**         1.11**        higher
dbench        1.00         1.25**         ~             higher
kernbench     1.00         1.04**         0.97          higher
gitsource     1.00         0.77           ~             higher


3) AMD EPYC:

256x_ROME_NUMA: AMD Rome , 128 cores / 256 threads, NUMA, SATA SSD storage
---------------------------------------------------------------
              sugov      ondemand       perfgov       better if
---------------------------------------------------------------
                        PERFORMANCE RATIOS
tbench        1.00         1.11**       1.58**          higher
dbench        1.00         0.44**       0.40**          lower
kernbench     1.00         ~            0.91**          lower
gitsource     1.00         0.96**       0.65**          lower


128x_NAPLES_NUMA: AMD Naples , 64 cores / 128 threads, NUMA, SATA SSD storage
---------------------------------------------------------------
              sugov      ondemand       perfgov       better if
---------------------------------------------------------------
                        PERFORMANCE RATIOS
tbench        1.00         1.10**       1.19**          higher
dbench        1.00         1.05         0.95**          lower
kernbench     1.00         ~            0.95**          lower
gitsource     1.00         0.93**       0.55**          lower


Giovanni

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 20:10                           ` Giovanni Gherdovich
@ 2020-10-22 20:16                             ` Giovanni Gherdovich
  2020-10-23  7:03                             ` Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: Giovanni Gherdovich @ 2020-10-22 20:16 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Mel Gorman, Viresh Kumar, Julia Lawall, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown

On Thu, 2020-10-22 at 22:10 +0200, Giovanni Gherdovich wrote:
> [...]
> To read the tables:
> 
> Tilde (~) means the result is the same as baseline (or, the ratio is close
> to 1). The double asterisk (**) is a visual aid and means the result is
> worse than baseline (higher or lower depending on the case).

Ouch, the opposite. Double asterisk (**) is where the result is better
than baseline, and schedutil needs improvement.


Giovanni

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 17:59                               ` Rafael J. Wysocki
@ 2020-10-22 20:32                                 ` Mel Gorman
  2020-10-22 20:39                                   ` Phil Auld
  0 siblings, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2020-10-22 20:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Phil Auld, Colin Ian King, Peter Zijlstra, Rafael J. Wysocki,
	Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	Linux Kernel Mailing List, Valentin Schneider, Gilles Muller,
	Srinivas Pandruvada, Linux PM, Len Brown

On Thu, Oct 22, 2020 at 07:59:43PM +0200, Rafael J. Wysocki wrote:
> > > Agreed. I'd like the option to switch back if we make the default change.
> > > It's on the table and I'd like to be able to go that way.
> > >
> >
> > Yep. It sounds chicken, but it's a useful safety net and a reasonable
> > way to deprecate a feature. It's also useful for bug creation -- User X
> > running whatever found that schedutil is worse than the old governor and
> > had to temporarily switch back. Repeat until complaining stops and then
> > tear out the old stuff.
> >
> > When/if there is a patch setting schedutil as the default, cc suitable
> > distro people (Giovanni and myself for openSUSE).
> 
> So for the record, Giovanni was on the CC list of the "cpufreq:
> intel_pstate: Use passive mode by default without HWP" patch that this
> discussion resulted from (and which kind of belongs to the above
> category).
> 

Oh I know, I did not mean to suggest that you did not. He made people
aware that this was going to be coming down the line and has been looking
into the "what if schedutil was the default" question.  AFAIK, it's still
a work-in-progress and I don't know all the specifics but he knows more
than I do on the topic. I only know enough that if we flipped the switch
tomorrow that we could be plagued with google searches suggesting it be
turned off again just like there is still broken advice out there about
disabling intel_pstate for usually the wrong reasons.

The passive patch was a clear flag that the intent is that schedutil will
be the default at some unknown point in the future. That point is now a
bit closer and this thread could have encouraged a premature change of
the default resulting in unfair finger pointing at one company's test
team. If at least two distos check it out and it still goes wrong, at
least there will be shared blame :/

> > Other distros assuming they're watching can nominate their own victim.
> 
> But no other victims had been nominated at that time.

We have one, possibly two if Phil agrees. That's better than zero or
unfairly placing the full responsibility on the Intel guys that have been
testing it out.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 20:32                                 ` Mel Gorman
@ 2020-10-22 20:39                                   ` Phil Auld
  0 siblings, 0 replies; 68+ messages in thread
From: Phil Auld @ 2020-10-22 20:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael J. Wysocki, Colin Ian King, Peter Zijlstra,
	Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar,
	Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Linux Kernel Mailing List,
	Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM,
	Len Brown

On Thu, Oct 22, 2020 at 09:32:55PM +0100 Mel Gorman wrote:
> On Thu, Oct 22, 2020 at 07:59:43PM +0200, Rafael J. Wysocki wrote:
> > > > Agreed. I'd like the option to switch back if we make the default change.
> > > > It's on the table and I'd like to be able to go that way.
> > > >
> > >
> > > Yep. It sounds chicken, but it's a useful safety net and a reasonable
> > > way to deprecate a feature. It's also useful for bug creation -- User X
> > > running whatever found that schedutil is worse than the old governor and
> > > had to temporarily switch back. Repeat until complaining stops and then
> > > tear out the old stuff.
> > >
> > > When/if there is a patch setting schedutil as the default, cc suitable
> > > distro people (Giovanni and myself for openSUSE).
> > 
> > So for the record, Giovanni was on the CC list of the "cpufreq:
> > intel_pstate: Use passive mode by default without HWP" patch that this
> > discussion resulted from (and which kind of belongs to the above
> > category).
> > 
> 
> Oh I know, I did not mean to suggest that you did not. He made people
> aware that this was going to be coming down the line and has been looking
> into the "what if schedutil was the default" question.  AFAIK, it's still
> a work-in-progress and I don't know all the specifics but he knows more
> than I do on the topic. I only know enough that if we flipped the switch
> tomorrow that we could be plagued with google searches suggesting it be
> turned off again just like there is still broken advice out there about
> disabling intel_pstate for usually the wrong reasons.
> 
> The passive patch was a clear flag that the intent is that schedutil will
> be the default at some unknown point in the future. That point is now a
> bit closer and this thread could have encouraged a premature change of
> the default resulting in unfair finger pointing at one company's test
> team. If at least two distos check it out and it still goes wrong, at
> least there will be shared blame :/
> 
> > > Other distros assuming they're watching can nominate their own victim.
> > 
> > But no other victims had been nominated at that time.
> 
> We have one, possibly two if Phil agrees. That's better than zero or
> unfairly placing the full responsibility on the Intel guys that have been
> testing it out.
>

Yes. I agree and we (RHEL) are planning to test this soon. I'll try to get
to it.  You can certainly CC me, please, athough I also try to watch for this
sort of thing on list. 


Cheers,
Phil

> -- 
> Mel Gorman
> SUSE Labs
> 

-- 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 15:55                         ` Vincent Guittot
@ 2020-10-23  5:23                           ` Viresh Kumar
  0 siblings, 0 replies; 68+ messages in thread
From: Viresh Kumar @ 2020-10-23  5:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: A L, Peter Zijlstra, Rafael J. Wysocki, Julia Lawall, Mel Gorman,
	Ingo Molnar, kernel-janitors, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	Srinivas Pandruvada, Linux PM, Len Brown

On 22-10-20, 17:55, Vincent Guittot wrote:
> On Thu, 22 Oct 2020 at 17:45, A L <mail@lechevalier.se> wrote:
> >
> >
> >
> > ---- From: Peter Zijlstra <peterz@infradead.org> -- Sent: 2020-10-22 - 14:29 ----
> >
> > > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > >> > However I do want to retire ondemand, conservative and also very much
> > >> > intel_pstate/active mode.
> > >>
> > >> I agree in general, but IMO it would not be prudent to do that without making
> > >> schedutil provide the same level of performance in all of the relevant use
> > >> cases.
> > >
> > > Agreed; I though to have understood we were there already.
> >
> > Hi,
> >
> >
> > Currently schedutil does not populate all stats like ondemand does, which can be a problem for some monitoring software.
> >
> > On my AMD 3000G CPU with kernel-5.9.1:
> >
> >
> > grep. /sys/devices/system/cpu/cpufreq/policy0/stats/*
> >
> > With ondemand:
> > time_in_state:3900000 145179
> > time_in_state:1600000 9588482
> > total_trans:177565
> > trans_table:   From  :    To
> > trans_table:         :   3900000   1600000
> > trans_table:  3900000:         0     88783
> > trans_table:  1600000:     88782         0
> >
> > With schedutil only two file exists:
> > reset:<empty>
> > total_trans:216609
> >
> >
> > I'd really like to have these stats populated with schedutil, if that's possible.
> 
> Your problem might have been fixed with
> commit 96f60cddf7a1 ("cpufreq: stats: Enable stats for fast-switch as well")

Thanks Vincent. Right, I have already fixed that for everyone.

-- 
viresh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-22 11:45               ` Rafael J. Wysocki
  2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
@ 2020-10-23  6:24                 ` Viresh Kumar
  2020-10-23 15:06                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 68+ messages in thread
From: Viresh Kumar @ 2020-10-23  6:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada

On 22-10-20, 13:45, Rafael J. Wysocki wrote:
> On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > And I am not really sure why we always wanted this backup performance
> > governor to be there unless the said governors are built as module.
> 
> Apparently, some old drivers had problems with switching frequencies fast enough
> for ondemand to be used with them and the fallback was for those cases.  AFAICS.

Do we still need this ? Or better ask those platforms to individually
enable both of them.

-- 
viresh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate
  2020-10-22 16:23                   ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki
@ 2020-10-23  6:29                     ` Viresh Kumar
  2020-10-23 11:59                       ` Rafael J. Wysocki
  2020-10-23 15:15                     ` [PATCH v2] " Rafael J. Wysocki
  1 sibling, 1 reply; 68+ messages in thread
From: Viresh Kumar @ 2020-10-23  6:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown

On 22-10-20, 18:23, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate
> 
> Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by
> default without HWP") was meant to cause intel_pstate without HWP
> to be used in the passive mode with the schedutil governor on top of
> it by default, but it missed the case in which either "ondemand" or
> "conservative" was selected as the default governor in the existing
> kernel config, in which case the previous old governor configuration
> would be used, causing the default legacy governor to be used on top
> of intel_pstate instead of schedutil.
> 
> Address this by preventing "ondemand" and "conservative" from being
> configured as the default cpufreq governor in the case when schedutil
> is the default choice for the default governor setting.
> 
> [Note that the default cpufreq governor can still be set via the
>  kernel command line if need be and that choice is not limited,
>  so if anyone really wants to use one of the legacy governors by
>  default, it can be achieved this way.]
> 
> Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")
> Cc: 5.8+ <stable@vger.kernel.org> # 5.8+
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpufreq/Kconfig |    2 ++
>  1 file changed, 2 insertions(+)
> 
> Index: linux-pm/drivers/cpufreq/Kconfig
> =================================> --- linux-pm.orig/drivers/cpufreq/Kconfig
> +++ linux-pm/drivers/cpufreq/Kconfig
> @@ -71,6 +71,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
>  
>  config CPU_FREQ_DEFAULT_GOV_ONDEMAND
>  	bool "ondemand"
> +	depends on !SMP || !X86_INTEL_PSTATE
>  	select CPU_FREQ_GOV_ONDEMAND
>  	select CPU_FREQ_GOV_PERFORMANCE
>  	help
> @@ -83,6 +84,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
>  
>  config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
>  	bool "conservative"
> +	depends on !SMP || !X86_INTEL_PSTATE

While reading this first it felt like a SMP platforms related problem
(which I was surprised about), and then I understood what you are
doing.

I wonder if rewriting it this way makes it more readable with same
result eventually.

	depends on !(X86_INTEL_PSTATE && SMP)

-- 
viresh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 20:10                           ` Giovanni Gherdovich
  2020-10-22 20:16                             ` Giovanni Gherdovich
@ 2020-10-23  7:03                             ` Peter Zijlstra
  2020-10-23 17:46                               ` Tom Lendacky
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2020-10-23  7:03 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Rafael J. Wysocki, Mel Gorman, Viresh Kumar, Julia Lawall,
	Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown,
	thomas.lendacky, puwen, yazen.ghannam, kim.phillips,
	suravee.suthikulpanit

On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:
> * for the AMD EPYC machines we haven't yet implemented frequency invariant
>   accounting, which might explain why schedutil looses to ondemand on all
>   the benchmarks.

Right, I poked the AMD people on that a few times, but nothing seems to
be forthcoming :/ Tom, any way you could perhaps expedite the matter?

In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in

  arch/x86/kernel/smpboot.c:init_freq_invariance()

The main issue is finding a 'max' frequency that is not the absolute max
turbo boost (this could result in not reaching it very often) but also
not too low such that we're always clipping.

And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
the chips have a CPPC interface which could be used instead. Is there
any progress on that?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate
  2020-10-23  6:29                     ` Viresh Kumar
@ 2020-10-23 11:59                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-23 11:59 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Peter Zijlstra, Julia Lawall, Mel Gorman,
	Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Linux Kernel Mailing List,
	Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM,
	Len Brown

On Fri, Oct 23, 2020 at 8:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 22-10-20, 18:23, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Subject: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate
> >
> > Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by
> > default without HWP") was meant to cause intel_pstate without HWP
> > to be used in the passive mode with the schedutil governor on top of
> > it by default, but it missed the case in which either "ondemand" or
> > "conservative" was selected as the default governor in the existing
> > kernel config, in which case the previous old governor configuration
> > would be used, causing the default legacy governor to be used on top
> > of intel_pstate instead of schedutil.
> >
> > Address this by preventing "ondemand" and "conservative" from being
> > configured as the default cpufreq governor in the case when schedutil
> > is the default choice for the default governor setting.
> >
> > [Note that the default cpufreq governor can still be set via the
> >  kernel command line if need be and that choice is not limited,
> >  so if anyone really wants to use one of the legacy governors by
> >  default, it can be achieved this way.]
> >
> > Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")
> > Cc: 5.8+ <stable@vger.kernel.org> # 5.8+
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/cpufreq/Kconfig |    2 ++
> >  1 file changed, 2 insertions(+)
> >
> > Index: linux-pm/drivers/cpufreq/Kconfig
> > =================================> > --- linux-pm.orig/drivers/cpufreq/Kconfig
> > +++ linux-pm/drivers/cpufreq/Kconfig
> > @@ -71,6 +71,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
> >
> >  config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> >       bool "ondemand"
> > +     depends on !SMP || !X86_INTEL_PSTATE
> >       select CPU_FREQ_GOV_ONDEMAND
> >       select CPU_FREQ_GOV_PERFORMANCE
> >       help
> > @@ -83,6 +84,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> >
> >  config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> >       bool "conservative"
> > +     depends on !SMP || !X86_INTEL_PSTATE
>
> While reading this first it felt like a SMP platforms related problem
> (which I was surprised about), and then I understood what you are
> doing.
>
> I wonder if rewriting it this way makes it more readable with same
> result eventually.
>
>         depends on !(X86_INTEL_PSTATE && SMP)

Agreed, will update.

Thanks!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-23  6:24                 ` Viresh Kumar
@ 2020-10-23 15:06                   ` Rafael J. Wysocki
  2020-10-27  3:13                     ` Viresh Kumar
  0 siblings, 1 reply; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-23 15:06 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, linux-pm

On Friday, October 23, 2020 8:12:46 AM CEST Viresh Kumar wrote:
> On 22-10-20, 13:45, Rafael J. Wysocki wrote:
> > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > And I am not really sure why we always wanted this backup performance
> > > governor to be there unless the said governors are built as module.
> > 
> > Apparently, some old drivers had problems with switching frequencies fast enough
> > for ondemand to be used with them and the fallback was for those cases.  AFAICS.
> 
> Do we still need this ?

For the reasonably modern hardware, I don't think so.

> Or better ask those platforms to individually
> enable both of them.

Bu who knows what they are? :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v2] cpufreq: Avoid configuring old governors as default with intel_pstate
  2020-10-22 16:23                   ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki
  2020-10-23  6:29                     ` Viresh Kumar
@ 2020-10-23 15:15                     ` Rafael J. Wysocki
  2020-10-27  3:13                       ` Viresh Kumar
  1 sibling, 1 reply; 68+ messages in thread
From: Rafael J. Wysocki @ 2020-10-23 15:15 UTC (permalink / raw)
  To: Peter Zijlstra, Viresh Kumar, Julia Lawall
  Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by
default without HWP") was meant to cause intel_pstate to be used
in the passive mode with the schedutil governor on top of it, but
it missed the case in which either "ondemand" or "conservative"
was selected as the default governor in the existing kernel config,
in which case the previous old governor configuration would be used,
causing the default legacy governor to be used on top of intel_pstate
instead of schedutil.

Address this by preventing "ondemand" and "conservative" from being
configured as the default cpufreq governor in the case when schedutil
is the default choice for the default governor setting.

[Note that the default cpufreq governor can still be set via the
 kernel command line if need be and that choice is not limited,
 so if anyone really wants to use one of the legacy governors by
 default, it can be achieved this way.]

Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Cc: 5.8+ <stable@vger.kernel.org> # 5.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

The v2 addresses a review comment from Viresh regarding of the expression format
and adds a missing Reported-by for Julia.

---
 drivers/cpufreq/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-pm/drivers/cpufreq/Kconfig
=================================--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -71,6 +71,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
 
 config CPU_FREQ_DEFAULT_GOV_ONDEMAND
 	bool "ondemand"
+	depends on !(X86_INTEL_PSTATE && SMP)
 	select CPU_FREQ_GOV_ONDEMAND
 	select CPU_FREQ_GOV_PERFORMANCE
 	help
@@ -83,6 +84,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
 
 config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	bool "conservative"
+	depends on !(X86_INTEL_PSTATE && SMP)
 	select CPU_FREQ_GOV_CONSERVATIVE
 	select CPU_FREQ_GOV_PERFORMANCE
 	help

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-23  7:03                             ` Peter Zijlstra
@ 2020-10-23 17:46                               ` Tom Lendacky
  2020-10-26 19:52                                 ` Fontenot, Nathan
  0 siblings, 1 reply; 68+ messages in thread
From: Tom Lendacky @ 2020-10-23 17:46 UTC (permalink / raw)
  To: Peter Zijlstra, Giovanni Gherdovich
  Cc: Rafael J. Wysocki, Mel Gorman, Viresh Kumar, Julia Lawall,
	Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown, puwen,
	yazen.ghannam, kim.phillips, suravee.suthikulpanit, Fontenot,
	Nathan

On 10/23/20 2:03 AM, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:
>> * for the AMD EPYC machines we haven't yet implemented frequency invariant
>>    accounting, which might explain why schedutil looses to ondemand on all
>>    the benchmarks.
> 
> Right, I poked the AMD people on that a few times, but nothing seems to
> be forthcoming :/ Tom, any way you could perhaps expedite the matter?

Adding Nathan to the thread to help out here.

Thanks,
Tom

> 
> In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in
> 
>    arch/x86/kernel/smpboot.c:init_freq_invariance()
> 
> The main issue is finding a 'max' frequency that is not the absolute max
> turbo boost (this could result in not reaching it very often) but also
> not too low such that we're always clipping.
> 
> And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
> the chips have a CPPC interface which could be used instead. Is there
> any progress on that?
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-23 17:46                               ` Tom Lendacky
@ 2020-10-26 19:52                                 ` Fontenot, Nathan
  0 siblings, 0 replies; 68+ messages in thread
From: Fontenot, Nathan @ 2020-10-26 19:52 UTC (permalink / raw)
  To: Tom Lendacky, Peter Zijlstra, Giovanni Gherdovich
  Cc: Rafael J. Wysocki, Mel Gorman, Viresh Kumar, Julia Lawall,
	Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown, puwen,
	yazen.ghannam, kim.phillips, suravee.suthikulpanit

On 10/23/2020 12:46 PM, Tom Lendacky wrote:
> On 10/23/20 2:03 AM, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:
>>> * for the AMD EPYC machines we haven't yet implemented frequency invariant
>>>    accounting, which might explain why schedutil looses to ondemand on all
>>>    the benchmarks.
>>
>> Right, I poked the AMD people on that a few times, but nothing seems to
>> be forthcoming :/ Tom, any way you could perhaps expedite the matter?
> 
> Adding Nathan to the thread to help out here.
> 
> Thanks,
> Tom

Thanks Tom, diving in...

> 
>>
>> In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in
>>
>>    arch/x86/kernel/smpboot.c:init_freq_invariance()
>>
>> The main issue is finding a 'max' frequency that is not the absolute max
>> turbo boost (this could result in not reaching it very often) but also
>> not too low such that we're always clipping.

I've started looking into this and have a lead but need to confirm that the
frequency value I'm getting is not an absolute max.

>>
>> And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
>> the chips have a CPPC interface which could be used instead. Is there
>> any progress on that?
>>

Correct, AMD uses acpi_cpufreq. The newer AMD chips do have a CPPC interface
(not sure how far back 'newer' covers). I'll take a look at schedutil and
cppc_cpufreq and the possibility of transitioning to them for AMD.

-Nathan

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] sched/fair: check for idle core
  2020-10-23 15:06                   ` Rafael J. Wysocki
@ 2020-10-27  3:13                     ` Viresh Kumar
  0 siblings, 0 replies; 68+ messages in thread
From: Viresh Kumar @ 2020-10-27  3:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, linux-pm

On 23-10-20, 17:06, Rafael J. Wysocki wrote:
> On Friday, October 23, 2020 8:12:46 AM CEST Viresh Kumar wrote:
> > On 22-10-20, 13:45, Rafael J. Wysocki wrote:
> > > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > > And I am not really sure why we always wanted this backup performance
> > > > governor to be there unless the said governors are built as module.
> > > 
> > > Apparently, some old drivers had problems with switching frequencies fast enough
> > > for ondemand to be used with them and the fallback was for those cases.  AFAICS.
> > 
> > Do we still need this ?
> 
> For the reasonably modern hardware, I don't think so.
> 
> > Or better ask those platforms to individually
> > enable both of them.
> 
> Bu who knows what they are? :-)

I was planning to break them and let them complain :)

-- 
viresh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2] cpufreq: Avoid configuring old governors as default with intel_pstate
  2020-10-23 15:15                     ` [PATCH v2] " Rafael J. Wysocki
@ 2020-10-27  3:13                       ` Viresh Kumar
  0 siblings, 0 replies; 68+ messages in thread
From: Viresh Kumar @ 2020-10-27  3:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar,
	kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Valentin Schneider, Gilles Muller,
	srinivas.pandruvada, Linux PM, Len Brown

On 23-10-20, 17:15, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by
> default without HWP") was meant to cause intel_pstate to be used
> in the passive mode with the schedutil governor on top of it, but
> it missed the case in which either "ondemand" or "conservative"
> was selected as the default governor in the existing kernel config,
> in which case the previous old governor configuration would be used,
> causing the default legacy governor to be used on top of intel_pstate
> instead of schedutil.
> 
> Address this by preventing "ondemand" and "conservative" from being
> configured as the default cpufreq governor in the case when schedutil
> is the default choice for the default governor setting.
> 
> [Note that the default cpufreq governor can still be set via the
>  kernel command line if need be and that choice is not limited,
>  so if anyone really wants to use one of the legacy governors by
>  default, it can be achieved this way.]
> 
> Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")
> Reported-by: Julia Lawall <julia.lawall@inria.fr>
> Cc: 5.8+ <stable@vger.kernel.org> # 5.8+
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> The v2 addresses a review comment from Viresh regarding of the expression format
> and adds a missing Reported-by for Julia.
> 
> ---
>  drivers/cpufreq/Kconfig |    2 ++
>  1 file changed, 2 insertions(+)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
  2020-10-22 12:19                   ` Rafael J. Wysocki
  2020-10-22 16:23                   ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki
@ 2020-10-27 11:11                   ` Qais Yousef
  2020-10-27 11:26                     ` Valentin Schneider
  2 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-10-27 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Viresh Kumar, Julia Lawall, Mel Gorman,
	Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider,
	Gilles Muller, srinivas.pandruvada

On 10/22/20 14:02, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > > Well, but we need to do something to force people onto schedutil,
> > > > otherwise we'll get more crap like this thread.
> > > > 
> > > > Can we take the choice away? Only let Kconfig select which governors are
> > > > available and then set the default ourselves? I mean, the end goal being
> > > > to not have selectable governors at all, this seems like a good step
> > > > anyway.
> > > 
> > > Just to clarify and complete the point a bit here, the users can still
> > > pass the default governor from cmdline using
> > > cpufreq.default_governor=, which will take precedence over the one the
> > > below code is playing with. And later once the kernel is up, they can
> > > still choose a different governor from userspace.
> > 
> > Right.
> > 
> > Also some people simply set "performance" as the default governor and then
> > don't touch cpufreq otherwise (the idea is to get everything to the max
> > freq right away and stay in that mode forever).  This still needs to be
> > possible IMO.
> 
> Performance/powersave make sense to keep.
> 
> However I do want to retire ondemand, conservative and also very much
> intel_pstate/active mode. I also have very little sympathy for
> userspace.

Userspace is useful for testing and sanity checking. Not sure if people use it
to measure voltage/current at each frequency to generate
dynamic-power-coefficient for their platform. Lukasz, Dietmar?

Thanks

--
Qais Yousef

> 
> We should start by making it hard to use them and eventually just delete
> them outright.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-27 11:11                   ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef
@ 2020-10-27 11:26                     ` Valentin Schneider
  2020-10-27 11:42                       ` Qais Yousef
  0 siblings, 1 reply; 68+ messages in thread
From: Valentin Schneider @ 2020-10-27 11:26 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Julia Lawall,
	Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Gilles Muller,
	srinivas.pandruvada


On 27/10/20 11:11, Qais Yousef wrote:
> On 10/22/20 14:02, Peter Zijlstra wrote:
>> However I do want to retire ondemand, conservative and also very much
>> intel_pstate/active mode. I also have very little sympathy for
>> userspace.
>
> Userspace is useful for testing and sanity checking. Not sure if people use it
> to measure voltage/current at each frequency to generate
> dynamic-power-coefficient for their platform. Lukasz, Dietmar?
>

It's valuable even just for cpufreq sanity checking - we have that test
that goes through increasing frequencies and asserts the work done is
monotonically increasing. This has been quite useful in the past to detect
broken bits.

That *should* still be totally doable with any other governor by using the
scaling_{min, max}_freq sysfs interface.

> Thanks

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-27 11:26                     ` Valentin Schneider
@ 2020-10-27 11:42                       ` Qais Yousef
  2020-10-27 11:48                         ` Viresh Kumar
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-10-27 11:42 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Julia Lawall,
	Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel, Gilles Muller,
	srinivas.pandruvada

On 10/27/20 11:26, Valentin Schneider wrote:
> 
> On 27/10/20 11:11, Qais Yousef wrote:
> > On 10/22/20 14:02, Peter Zijlstra wrote:
> >> However I do want to retire ondemand, conservative and also very much
> >> intel_pstate/active mode. I also have very little sympathy for
> >> userspace.
> >
> > Userspace is useful for testing and sanity checking. Not sure if people use it
> > to measure voltage/current at each frequency to generate
> > dynamic-power-coefficient for their platform. Lukasz, Dietmar?
> >
> 
> It's valuable even just for cpufreq sanity checking - we have that test
> that goes through increasing frequencies and asserts the work done is
> monotonically increasing. This has been quite useful in the past to detect
> broken bits.
> 
> That *should* still be totally doable with any other governor by using the
> scaling_{min, max}_freq sysfs interface.

True. This effectively makes every governor a potential user space governor.

/me not sure to be happy or grumpy about it

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
  2020-10-27 11:42                       ` Qais Yousef
@ 2020-10-27 11:48                         ` Viresh Kumar
  0 siblings, 0 replies; 68+ messages in thread
From: Viresh Kumar @ 2020-10-27 11:48 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Valentin Schneider, Peter Zijlstra, Rafael J. Wysocki,
	Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Daniel Bristot de Oliveira, linux-kernel,
	Gilles Muller, srinivas.pandruvada

On 27-10-20, 11:42, Qais Yousef wrote:
> On 10/27/20 11:26, Valentin Schneider wrote:
> > 
> > On 27/10/20 11:11, Qais Yousef wrote:
> > > On 10/22/20 14:02, Peter Zijlstra wrote:
> > >> However I do want to retire ondemand, conservative and also very much
> > >> intel_pstate/active mode. I also have very little sympathy for
> > >> userspace.
> > >
> > > Userspace is useful for testing and sanity checking. Not sure if people use it
> > > to measure voltage/current at each frequency to generate
> > > dynamic-power-coefficient for their platform. Lukasz, Dietmar?
> > >
> > 
> > It's valuable even just for cpufreq sanity checking - we have that test
> > that goes through increasing frequencies and asserts the work done is
> > monotonically increasing. This has been quite useful in the past to detect
> > broken bits.
> > 
> > That *should* still be totally doable with any other governor by using the
> > scaling_{min, max}_freq sysfs interface.
> 
> True. This effectively makes every governor a potential user space governor.
> 
> /me not sure to be happy or grumpy about it

Userspace governor should be kept as is, it is very effective to get
unnecessary governor code out of the path when testing basic
functioning of the hardware/driver. It is quite useful when things
don't work as expected.

-- 
viresh

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2020-10-27 11:48 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall
2020-10-21  7:29 ` Vincent Guittot
2020-10-21 11:13   ` Peter Zijlstra
2020-10-21 12:27   ` Vincent Guittot
2020-10-21 11:20 ` Mel Gorman
2020-10-21 11:56   ` Julia Lawall
2020-10-21 12:19     ` Peter Zijlstra
2020-10-21 12:42       ` Julia Lawall
2020-10-21 12:52         ` Peter Zijlstra
2020-10-21 18:18           ` Rafael J. Wysocki
2020-10-21 18:15         ` Rafael J. Wysocki
2020-10-21 19:47           ` Julia Lawall
2020-10-21 20:25             ` Rafael J. Wysocki
2020-10-21 13:10       ` Peter Zijlstra
2020-10-21 18:11         ` Rafael J. Wysocki
2020-10-22  4:53           ` Viresh Kumar
2020-10-22  7:11           ` Peter Zijlstra
2020-10-22 10:59             ` Viresh Kumar
2020-10-22 11:45               ` Rafael J. Wysocki
2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
2020-10-22 12:19                   ` Rafael J. Wysocki
2020-10-22 12:29                     ` Peter Zijlstra
2020-10-22 14:52                       ` Mel Gorman
2020-10-22 14:58                         ` Colin Ian King
2020-10-22 15:12                           ` Phil Auld
2020-10-22 16:35                             ` Mel Gorman
2020-10-22 17:59                               ` Rafael J. Wysocki
2020-10-22 20:32                                 ` Mel Gorman
2020-10-22 20:39                                   ` Phil Auld
2020-10-22 15:25                         ` Peter Zijlstra
2020-10-22 15:55                           ` Rafael J. Wysocki
2020-10-22 16:29                           ` Mel Gorman
2020-10-22 20:10                           ` Giovanni Gherdovich
2020-10-22 20:16                             ` Giovanni Gherdovich
2020-10-23  7:03                             ` Peter Zijlstra
2020-10-23 17:46                               ` Tom Lendacky
2020-10-26 19:52                                 ` Fontenot, Nathan
2020-10-22 15:45                       ` A L
2020-10-22 15:55                         ` Vincent Guittot
2020-10-23  5:23                           ` Viresh Kumar
2020-10-22 16:23                   ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki
2020-10-23  6:29                     ` Viresh Kumar
2020-10-23 11:59                       ` Rafael J. Wysocki
2020-10-23 15:15                     ` [PATCH v2] " Rafael J. Wysocki
2020-10-27  3:13                       ` Viresh Kumar
2020-10-27 11:11                   ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef
2020-10-27 11:26                     ` Valentin Schneider
2020-10-27 11:42                       ` Qais Yousef
2020-10-27 11:48                         ` Viresh Kumar
2020-10-23  6:24                 ` Viresh Kumar
2020-10-23 15:06                   ` Rafael J. Wysocki
2020-10-27  3:13                     ` Viresh Kumar
2020-10-22 11:21             ` AW: " Walter Harms
2020-10-21 12:28     ` Mel Gorman
2020-10-21 12:25   ` Vincent Guittot
2020-10-21 12:47     ` Mel Gorman
2020-10-21 12:56       ` Julia Lawall
2020-10-21 13:18         ` Mel Gorman
2020-10-21 13:24           ` Julia Lawall
2020-10-21 15:08             ` Mel Gorman
2020-10-21 15:18               ` Julia Lawall
2020-10-21 15:23                 ` Vincent Guittot
2020-10-21 15:33                   ` Julia Lawall
2020-10-21 15:19               ` Vincent Guittot
2020-10-21 17:00                 ` Mel Gorman
2020-10-21 17:39                   ` Julia Lawall
2020-10-21 13:48           ` Julia Lawall
2020-10-21 15:26             ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).